When function calls through the external interface are faster than native calls C

Update: good discussion on Hacker News

David Yoo on GitHub has developed an interesting performance test for function calls through various external interfaces (Foreign Function Interfaces, FFI ).

He created a shared object file ( .so ) with one simple C function. Then he wrote code to repeatedly call this function through each FFI with the time dimension.

For C "FFI", he used the standard dynamic layout, not dlopen() . This distinction is very important, as it really affects the test results. One can argue how fair such a comparison is with the actual FFI, but it is still interesting to measure it.

The most surprising result of the benchmark is that the FFI from LuaJIT is significantly faster than C. It is about 25% faster than the native C call for a shared object function. How could a weakly and dynamically typed scripting language overtake in the benchmark C? Is the result accurate?

In fact, this is quite logical. The test is running on Linux, so the delay is from the Procedure Linkage Table (PLT). I prepared a really simple experiment to demonstrate the effect in a simple old C:

https://github.com/skeeto/dynamic-function-benchmark

Here are the results on the Intel i7-6700 (Skylake):

plt: 1.759799 ns/call
ind: 1.257125 ns/call
jit: 1.008108 ns/call


Here are three different types of function calls:

  1. Via PLT.
  2. Indirect function call (via dlsym(3) )
  3. Direct function call (via JIT-compiled function)

As you can see, the last is the fastest. It is usually not used in C programs, but this is a natural option in the presence of a JIT compiler, including, obviously, LuaJIT.

The empty() function is called in my benchmark:

 void empty(void) { } 

Compiling into a shared object:

 $ cc -shared -fPIC -Os -o empty.so empty.c 

As in the previous GPSN comparison , the benchmark can call this function many times before the alarm goes off.

Procedure Layout Tables


When a program or library calls a function in another shared object, the compiler cannot know where the function will be in memory. Information is found only at runtime, when the program and its dependencies are loaded into memory. Usually the function is located in random places, for example, in accordance with address space randomization (Address Space Layout Randomization, ASLR).

How to solve such a problem? Well, there are several options.

One of them is to mark each call in the binary metadata. The dynamic runtime linker then inserts the correct address with each call. The specific mechanism depends on the code model that was used during the compilation.

The disadvantage of this approach is to slow downloads, increase the size of binary files and reduce the exchange of code pages between different processes. Loading slows down because all dynamic paging points need to be patched to the correct address before starting program execution. The binary is inflated, because each entry needs a place in the table. And the lack of public access is associated with a change in code pages.

On the other hand, the costs of invoking dynamic functions can be eliminated, which gives JIT-like performance, as shown in the benchmark.

The second option is to route all dynamic calls through the table. The original dial peer refers to the stub in this table, and from there to the actual dynamic function. With this approach, the code does not need to be patched, which leads to a trivial exchange between processes. Only one record in the table needs to be patched to each dynamic function. Moreover, these corrections can be made lazily when the function is first called, which speeds up loading even more.

In systems with ELF binary files, this table is called the Procedure Linkage Table (PLT). PLT itself is not corrected in reality - it is displayed as read-only for the rest of the code. Instead, the global offset table (GOT) is corrected. The PLT stub retrieves the address of the dynamic function from the GOT and indirectly switches to that address. In order to lazily load function addresses, these GOT records are initialized with the address of a function that finds the target character, updates the GOT with this address, and then moves on to this function. Subsequent calls use the lazily detected address.



The disadvantage of PLT is the additional overhead of calling a dynamic function, which was manifested in the benchmark. Since the benchmark measures only function calls, the difference seems quite significant, but in practice it is usually close to zero.

Here is the benchmark:

 /* Cleared by an alarm signal. */ volatile sig_atomic_t running; static long plt_benchmark(void) { long count; for (count = 0; running; count++) empty(); return count; } 

Since empty() is in a shared object, the call goes through PLT.

Indirect dynamic calls


Another way to dynamically invoke functions is bypassing PLT and getting the address of the target function in the program, for example, via dlsym(3) .

 void *h = dlopen("path/to/lib.so", RTLD_NOW); void (*f)(void) = dlsym("f"); f(); 

If the address of the function is obtained, then the cost is less than the function calls through PLT. No intermediate stub function and access to GOT. (Caution: if the program has a PLT entry for this function, then dlsym(3) can actually return the address of the stub).

But this is still an indirect challenge. On conventional architectures, direct function calls get its immediate relative address. That is, the purpose of the call is some hard-coded offset from the call point. The CPU can understand much earlier where the call will go.

Indirect calls have more costs. First, the address needs to be stored somewhere. Even if it is just a register, its use increases the register deficit. Secondly, indirect calls provoke the predictor of branches in the CPU, imposing an additional load on the processor. In the worst case, a call can even lead to a pipeline stop.

Here is the benchmark:

 volatile sig_atomic_t running; static long indirect_benchmark(void (*f)(void)) { long count; for (count = 0; running; count++) f(); return count; } 

The function passed to this benchmark is retrieved using dlsym(3) , so the compiler cannot do something tricky , such as converting this indirect call back to direct.

If the loop body is complicated enough to cause a shortage of registers and thus giving the address to the stack, then this benchmark also cannot be honestly compared with the PLT benchmark.

Direct function calls


The first two types of calls to dynamic functions are simple and easy to use. Direct calls for dynamic functions are more difficult to organize because they require code changes during its execution. In my benchmark, I put together a small JIT compiler to generate a direct call.

The trick is that on x86-64, explicit transitions are limited to a 2 GB range due to a 32-bit immediate operand signed. This means that the JIT code should be placed almost next to the target function, empty() . If the JIT code should call two different dynamic functions divided by more than 2 GB, then it is impossible to make two direct calls.

To simplify the situation, my benchmark does not worry about the exact or very careful choice of the address of the JIT code. After receiving the address of the target function, he simply subtracts 4 MB, rounds to the nearest page, allocates some memory and writes code to it. If everything is done properly, then to search for a place, you need to check your own program views in memory, and this cannot be done in a clean portable way. On Linux , parsing of virtual files under / proc is required .

This is how my JIT memory allocation looks. It assumes reasonable behavior for casting uintptr_t types :

 static void jit_compile(struct jit_func *f, void (*empty)(void)) { uintptr_t addr = (uintptr_t)empty; void *desired = (void *)((addr - SAFETY_MARGIN) & PAGEMASK); /* ... */ unsigned char *p = mmap(desired, len, prot, flags, fd, 0); /* ... */ } 

There are two pages: one for recording and the other with non-writable code. As in my library for closures , here the bottom page is writable and contains the variable running , which is reset on alarm. This page should be located next to the JIT code in order to provide effective access relative to the RIP, as functions in the two other benchmarks. The top page contains such an assembly code:

 jit_benchmark: push rbx xor ebx, ebx .loop: mov eax, [rel running] test eax, eax je .done call empty inc ebx jmp .loop .done: mov eax, ebx pop rbx ret 

call empty is the only dynamically generated instruction, it is necessary for correct filling of the relative address (minus 5 is indicated relative to the end of the instruction):

  // call empty uintptr_t rel = (uintptr_t)empty - (uintptr_t)p - 5; *p++ = 0xe8; *p++ = rel >> 0; *p++ = rel >> 8; *p++ = rel >> 16; *p++ = rel >> 24; 

If the empty() function is not in a shared object, but in the same binary file, then this is essentially a direct call that the compiler will generate for plt_benchmark() , assuming that for some reason it has not embedded empty() .

Ironically, calling a JIT-compiled code requires an indirect call (for example, through a function pointer), and this cannot be bypassed. What is there to do, jit-compile another function for a direct call? Fortunately, it doesn’t matter, because only the direct call is measured in the loop.

No secrets


Given these results, it becomes clear why LuaJIT generates more efficient calls to dynamic functions than PLT, even if they remain indirect calls . In my benchmark, indirect calls without PLT were 28% faster than with PLT, and direct calls without PLT were 43% faster than with PLT. This small advantage of JIT programs over simple old native programs is achieved due to the absolute rejection of code exchange between processes.

Source: https://habr.com/ru/post/413181/


All Articles