Capturing functions in the Linux kernel using ftrace

Ninja Penguin, En3l In one project related to the security of Linux systems, we needed to intercept calls to important functions inside the kernel (such as opening files and starting processes) in order to be able to monitor activity in the system and proactively block suspicious processes.

During the development process, we managed to invent a fairly good approach that allows you to conveniently intercept any function in the kernel by name and execute your code around its calls. Interceptor can be installed from a loadable GPL module, without rebuilding the kernel. The approach supports kernel versions 3.19+ for the x86_64 architecture.

(Picture of a penguin just above: © En3l with DeviantArt .)

Known approaches


Linux Security API


The most correct would be to use Linux Security API - a special interface created specifically for this purpose. In critical places of the nuclear code, there are calls to security functions, which in turn call callbacks set by the security module. The security module can examine the context of the operation and decide whether to allow or deny it.

Unfortunately, the Linux Security API has a couple of important limitations:


If the position of kernel developers is ambiguous about the multiplicity of modules, the ban on dynamic loading is fundamental: the security module must be part of the kernel in order to ensure security all the time from the moment it is loaded.

Thus, to use the Security API, you need to ship your own kernel assembly, as well as integrate an add-on module with SELinux or AppArmor, which are used by popular distributions. The customer did not want to subscribe to such obligations, so this path was closed.

For these reasons, the Security API did not suit us, otherwise it would be an ideal option.

Modifying the system call table


Monitoring was required mainly for actions performed by user applications, so that, in principle, could be implemented at the level of system calls. As you know, Linux stores all system call handlers in the sys_call_table table. The substitution of values ​​in this table leads to a change in the behavior of the entire system. Thus, by saving the old values ​​of the handler and substituting our own handler in the table, we can intercept any system call.

This approach has certain advantages:


However, it also suffers from some disadvantages:


Initially, we chose and successfully implemented this approach, pursuing the benefits of supporting the largest number of systems. However, at that time we did not yet know about the features of x86_64 and restrictions on intercepted calls. Later, we found critical support for system calls related to the launch of new processes - clone () and execve () - which are just special. That is what led us to search for new options.

Using kprobes


One of the options that were considered was the use of kprobes : a specialized API, primarily intended for debugging and kernel tracing. This interface allows you to set pre- and post-processors for any instruction in the kernel, as well as handlers for entry and return from a function. Handlers get access to registers and can change them. Thus, we could get both monitoring and the ability to influence the further course of work.

The advantages that kprobes provides for intercepting:


Disadvantages of kprobes:


In the process of researching the topic, our view fell on the ftrace framework, which can replace jprobes. As it turned out, it is better suited for our needs for intercepting function calls. However, if you need to trace specific instructions within functions, then kprobes should not be discounted.

Splicing


For completeness, it is also worth describing the classic way to intercept functions, which is to replace the instructions at the beginning of the function with an unconditional transition, leading to our handler. Original instructions are transferred to another place and executed before going back to the intercepted function. With the help of two transitions, we sew (splice in) our additional code into the function, so this approach is called splicing .

This is exactly how jump-optimization for kprobes is implemented. Using splicing, you can achieve the same results, but without additional costs for kprobes and with complete control of the situation.

The benefits of splicing are obvious:


However, the main drawback of this approach seriously overshadows the picture:


In general, if you are able to invoke this daemon, subordinating only to initiates, and are ready to tolerate it in your code, then splicing is quite a working approach for intercepting function calls. I had a negative attitude to the writing of bicycles, so this option remained as a backup for us in case there would be no progress at all with ready-made solutions easier.

New approach with ftrace


Ftrace is a framework for kernel tracing at the function level. It has been developed since 2008 and has a fantastic interface for user programs. Ftrace allows you to track the frequency and duration of function calls, display call graphs, filter functions of interest by templates, and so on. On the possibilities of ftrace, you can start reading from here , and then follow the links and official documentation.

It implements ftrace based on the compiler keys -pg and -mfentry , which insert at the beginning of each function a call to the special trace function mcount () or __fantry __ (). Usually in user programs this compiler feature is used by profilers to track calls of all functions. The kernel uses these functions to implement the ftrace framework.

Calling ftrace from each function is, of course, not cheap, so optimization is available for popular architectures: dynamic ftrace . The bottom line is that the kernel knows the location of all the mcount () or __fentry __ () calls and replaces their machine code with nop at the early stages of loading - a special instruction that does nothing. When tracing is enabled, ftrace calls are added back to the desired functions. Thus, if ftrace is not used, then its effect on the system is minimal.

Description of the necessary functions


Each intercepted function can be described by the following structure:

 /** * struct ftrace_hook -    * * @name:    * * @function:  -,     *   * * @original:   ,     *  ,    * * @address:   ,    * * @ops:   ftrace,  , *      */ struct ftrace_hook { const char *name; void *function; void *original; unsigned long address; struct ftrace_ops ops; }; 

The user only needs to fill in the first three fields: name, function, original. The remaining fields are considered implementation details. The description of all intercepted functions can be assembled into an array and use macros to increase the compactness of the code:

 #define HOOK(_name, _function, _original) \ { \ .name = (_name), \ .function = (_function), \ .original = (_original), \ } static struct ftrace_hook hooked_functions[] = { HOOK("sys_clone", fh_sys_clone, &real_sys_clone), HOOK("sys_execve", fh_sys_execve, &real_sys_execve), }; 

Wrappers over intercepted functions are as follows:

 /* *        execve(). *     .      *  :       , *    ABI (  "asmlinkage"). */ static asmlinkage long (*real_sys_execve)(const char __user *filename, const char __user *const __user *argv, const char __user *const __user *envp); /* *      .   —  *   .      *  .      ,  *    . */ static asmlinkage long fh_sys_execve(const char __user *filename, const char __user *const __user *argv, const char __user *const __user *envp) { long ret; pr_debug("execve() called: filename=%p argv=%p envp=%p\n", filename, argv, envp); ret = real_sys_execve(filename, argv, envp); pr_debug("execve() returns: %ld\n", ret); return ret; } 

As you can see, the functions being intercepted with a minimum of extra code. The only point that requires careful attention is the function signatures. They must match one to one. Without this, obviously, the arguments will be passed incorrectly and everything will go downhill. For interception of system calls, this is less important, since their handlers are very stable and for efficiency, arguments are taken in the same manner as the system calls themselves. However, if you plan to intercept other functions, then you should remember that there are no stable interfaces inside the kernel .

Ftrace initialization


First we need to find and save the address of the function that we will intercept. Ftrace allows you to trace functions by name, but we still need to know the address of the original function in order to call it.

You can get the address using kallsyms - a list of all characters in the kernel. This list includes all characters, not only exported for modules. Getting the address of the intercepted function looks like this:

 static int resolve_hook_address(struct ftrace_hook *hook) { hook->address = kallsyms_lookup_name(hook->name); if (!hook->address) { pr_debug("unresolved symbol: %s\n", hook->name); return -ENOENT; } *((unsigned long*) hook->original) = hook->address; return 0; } 

Next, you need to initialize the ftrace_ops structure. It is mandatory
the field is just a func indicating a callback, but we also need
Set some important flags:

 int fh_install_hook(struct ftrace_hook *hook) { int err; err = resolve_hook_address(hook); if (err) return err; hook->ops.func = fh_ftrace_thunk; hook->ops.flags = FTRACE_OPS_FL_SAVE_REGS | FTRACE_OPS_FL_IPMODIFY; /* ... */ } 

fh_ftrace_thunk () is our callback, which ftrace will call when tracing a function. About him later. The flags we set will be needed to perform the interception. They instruct ftrace to save and restore processor registers, the contents of which we can change in the callback.

Now we are ready to enable interception. To do this, you must first enable ftrace for the function of interest using ftrace_set_filter_ip (), and then allow ftrace to call our callback using register_ftrace_function ():

 int fh_install_hook(struct ftrace_hook *hook) { /* ... */ err = ftrace_set_filter_ip(&hook->ops, hook->address, 0, 0); if (err) { pr_debug("ftrace_set_filter_ip() failed: %d\n", err); return err; } err = register_ftrace_function(&hook->ops); if (err) { pr_debug("register_ftrace_function() failed: %d\n", err); /*    ftrace   . */ ftrace_set_filter_ip(&hook->ops, hook->address, 1, 0); return err; } return 0; } 

Interception is turned off in the same way, only in reverse order:

 void fh_remove_hook(struct ftrace_hook *hook) { int err; err = unregister_ftrace_function(&hook->ops); if (err) { pr_debug("unregister_ftrace_function() failed: %d\n", err); } err = ftrace_set_filter_ip(&hook->ops, hook->address, 1, 0); if (err) { pr_debug("ftrace_set_filter_ip() failed: %d\n", err); } } 

After the unregister_ftrace_function () call is completed, the absence of activations of the installed callback in the system (and with it our wrappers) is guaranteed. Therefore, we can, for example, safely unload the interceptor module without fearing that our functions are still performed somewhere in the system (after all, if they disappear, the processor will be upset).

Perform interception


How is the actual interception? Very simple. Ftrace allows you to change the state of the registers after exiting the callback. By changing the% rip register — the pointer to the next executable instruction — we change the instructions that the processor executes — that is, we can force it to perform an unconditional transition from the current function to ours. Thus, we take control over ourselves.

The callback for ftrace looks like this:

 static void notrace fh_ftrace_thunk(unsigned long ip, unsigned long parent_ip, struct ftrace_ops *ops, struct pt_regs *regs) { struct ftrace_hook *hook = container_of(ops, struct ftrace_hook, ops); regs->ip = (unsigned long) hook->function; } 

Using the container_of () macro, we get the address of our struct ftrace_hook at the address of the struct ftrace_hook embedded in it, and then we replace the value of the% rip register in the struct pt_regs structure with the address of our handler. Everything. For architectures other than x86_64, this register may be called differently (like IP or PC), but the idea is in principle applicable to them.

Notice the notrace specifier added to the callback. They can mark functions that are not allowed for tracing with ftrace. For example, the functions of ftrace itself that are involved in the tracing process are labeled this way. This helps to prevent the system from hanging in an infinite loop while tracing all functions in the kernel (ftrace can do this).

A callback ftrace usually triggers with displacement disabled (like kprobes). There may be exceptions, but you should not count on them. In our case, however, this restriction is not important, so we just replace eight bytes in the structure.

The wrapper function that is called later will be executed in the same context as the original function. Therefore, there you can do the same that is allowed to do in the intercepted function. For example, if you intercept an interrupt handler, then you still cannot sleep in a wrapper.

Recursive Call Protection


: , ftrace, , . - .

, — parent_ip — ftrace-, , . . , .

, parent_ipshould point inside our wrapper, whereas in the first, somewhere else in the kernel. It is necessary to transfer control only at the first call of the function, all the others must allow the original function to be executed.

Entry checking can be done very efficiently by comparing the address with the boundaries of the current module (which contains all our functions). This works fine if in a module only a wrapper calls an intercepted function. Otherwise, you need to be more selective.

So, the correct ftrace callback is as follows:

 static void notrace fh_ftrace_thunk(unsigned long ip, unsigned long parent_ip, struct ftrace_ops *ops, struct pt_regs *regs) { struct ftrace_hook *hook = container_of(ops, struct ftrace_hook, ops); /*      . */ if (!within_module(parent_ip, THIS_MODULE)) regs->ip = (unsigned long) hook->function; } 

Distinctive features / advantages of this approach:



: ls , . (, Bash) fork () + execve () . clone() execve() . , execve(), .

- :

sequence-

, ( ) ( ), where the ftrace ( purple ) framework calls functions from our module ( green ).

  1. The user process executes SYSCALL. This instruction switches to kernel mode and transfers control to the low-level system call handler, entry_SYSCALL_64 (). It is responsible for all system calls of 64-bit programs on 64-bit kernels.
  2. . , , do_syscall_64 (), . sys_call_tablesys_execve ().
  3. ftrace. __fentry__ (), ftrace. , , nop , sys_execve() .
  4. Ftrace . ftrace , . , %rip, .
  5. . parent_ip , do_syscall_64() — sys_execve() — , %rip pt_regs .
  6. Ftrace . FTRACE_SAVE_REGS, ftrace pt_regs . ftrace . %rip — — .
  7. -. - sys_execve() . fh_sys_execve (). , do_syscall_64().
  8. . . fh_sys_execve() ( ) . . — sys_execve() , real_sys_execve , .
  9. . sys_execve(), ftrace . , -…
  10. . sys_execve() fh_sys_execve(), do_syscall_64(). sys_execve() . : ftrace sys_execve() .
  11. . sys_execve() fh_sys_execve(). . , execve() , , , . .
  12. Control is returned to the kernel. Finally, fh_sys_execve () is terminated and control passes to do_syscall_64 (), which assumes that the system call was completed as usual. The core continues its nuclear affairs.
  13. The control is returned to the user process. Finally, the kernel executes an IRET (or SYSRET) instruction, but for execve () it is always an IRET, setting registers for the new user process and putting the CPU in the execution mode of the user code. The system call (and the launch of the new process) is complete.

Advantages and disadvantages


As a result, we get a very convenient way to intercept any functions in the kernel, which has the following advantages:


What are the disadvantages of this solution?


Consider some of the disadvantages in more detail.

Kernel configuration requirements


For starters, the kernel must support ftrace and kallsyms. To do this, the following options should be enabled:


Then, ftrace should support dynamic modification of registers. The option is responsible for this opportunity.


Further, the kernel used should be based on version 3.19 or higher in order to access the FTRACE_OPS_FL_IPMODIFY flag. Earlier versions of the kernel also know how to replace the% rip register, but starting from 3.19, this should be done only after setting this flag. The presence of a flag for old kernels will lead to a compilation error, and its absence for new ones will lead to non-working interception.

Finally, to perform an interception, the location of the ftrace call inside the function is critical: the call must be placed at the very beginning, before the function prologue (where space is allocated for local variables and a stack frame is formed). This feature of the architecture is taken into account by the option


x86_64 , i386 — . - i386 ftrace , ftrace . %eip — , , .

ftrace 32- x86. , ( «»), , ftrace.


During testing, we encountered one interesting feature : on some distributions, interception of functions caused the system to hang tightly. Naturally, this happened only on systems other than those used by developers. The problem was also not reproduced on the original interception prototype, with any distributions and kernel versions.

Debugging showed that a hang occurs inside the intercepted function. For some mystical reason, when calling the original function inside the ftrace-callback, the address parent_ipcontinued to be pointed to the kernel code instead of the wrapper function code. Because of this, there was an infinite loop, as ftrace repeatedly caused our wrapper, without performing any useful actions.

Fortunately, we had both working and broken code at our disposal, so finding the differences was only a matter of time. After the unification of the code and ejection of all unnecessary, the differences between the versions could be localized to the wrapper function.

This option worked:

 static asmlinkage long fh_sys_execve(const char __user *filename, const char __user *const __user *argv, const char __user *const __user *envp) { long ret; pr_debug("execve() called: filename=%p argv=%p envp=%p\n", filename, argv, envp); ret = real_sys_execve(filename, argv, envp); pr_debug("execve() returns: %ld\n", ret); return ret; } 

and this one - hung the system:

 static asmlinkage long fh_sys_execve(const char __user *filename, const char __user *const __user *argv, const char __user *const __user *envp) { long ret; pr_devel("execve() called: filename=%p argv=%p envp=%p\n", filename, argv, envp); ret = real_sys_execve(filename, argv, envp); pr_devel("execve() returns: %ld\n", ret); return ret; } 

How does it come about that the level of logging affects behavior? A careful study of the machine code of the two functions quickly clarified the situation and caused the very feeling when the compiler was to blame. Usually he is on the list of suspects somewhere near the cosmic rays, but not this time.

The point, as it turned out, is that the pr_devel () calls are expanded to void. This version of the printk macro is used for logging during development. Such entries in the log are not interesting during operation, so they are automatically cut from the code, if you do not declare the DEBUG macro. After that, the function for the compiler turns into this:

 static asmlinkage long fh_sys_execve(const char __user *filename, const char __user *const __user *argv, const char __user *const __user *envp) { return real_sys_execve(filename, argv, envp); } 

And here come the optimization. In this case it worked so-called optimization tail calls (tail call optimization). It allows the compiler to replace an honest function call with a direct transition to its body, if one function calls another and immediately returns its value. In native code, an honest call looks like this:

 0000000000000000 <fh_sys_execve>: 0: e8 00 00 00 00 callq 5 <fh_sys_execve+0x5> 5: ff 15 00 00 00 00 callq *0x0(%rip) b: f3 c3 repz retq 

and non-working - like this:

 0000000000000000 <fh_sys_execve>: 0: e8 00 00 00 00 callq 5 <fh_sys_execve+0x5> 5: 48 8b 05 00 00 00 00 mov 0x0(%rip),%rax c: ff e0 jmpq *%rax 

CALL — __fentry__(), . real_sys_execve ( ) CALL fh_sys_execve() RET. real_sys_execve() JMP.

«» , , CALL. , — parent_ip . fh_sys_execve() , — . parent_ipcontinues to point inside the core, which ultimately leads to the formation of an infinite loop.

It also explains why the problem was reproduced only on some distributions. When compiling modules, different distributions use different sets of compilation flags. In distressed distributions, tail call optimization was enabled by default.

The solution for us was to turn off tail optimization for the entire file with wrapper functions:

 #pragma GCC optimize("-fno-optimize-sibling-calls") 

Conclusion


… Linux — . , - , .

, Github .

Source: https://habr.com/ru/post/413241/


All Articles