Not so long ago, I had another conversation with a colleague on the eternal topic: "by reference, or by value." As a result, this article appeared. In it, I want to present the results of my research on this and related topics. Next will be considered:
- Registers and their purpose when calling functions.
- Transfer and return of simple types and structures.
- How to pass by reference and by value affects the optimization of the function body by the compiler.
- How space is used for multiple function calls.
- The mechanism of virtual calls.
- Tail call and recursion optimization.
- Initialization of structures, arrays and vectors.
Caution! The article contains a large amount of code in C ++ and assembler ( Intel ASM with comments), as well as a set of tables with performance estimates. Everything written is relevant for x86-64 System V ABI , which is used in all modern Unix operating systems, for example, in Linux and macOS.
Information was taken from the System V Application document Binary Interface for x86-64 . Assembly listings are obtained for clang 5.0.0 x86-64 with the flags -O3 -std=c++1z -march=sandybridge
(using https://godbolt.org ). Performance ratings have been made for the Intel® Xeon® CPU E5-2660 2.20GHz .
Content
X86-64 registers
All data is stored in RAM. To speed up work with it, multi-level caches are used. But to change the data, one way or another, registers are used ( discussed in the comments ). Below is a very brief description of the most used registers in the x86-64 architecture.
- 16 general-purpose registers:
rax, rbx, rcx, rdx, rbp, rsi, rdi, rsp
as well as r8-r15
. The size of each of them is 64 bits (8 bytes). To access the lower 32 bits (4 bytes), the prefix e
instead of r
( rax
→ eax
). Support only non-vector integer operations. rip
(instruction pointer) points to the instruction that will be executed next. Various constant data contained in the memory section with instructions can be read by offset relative to rip
.rsp
(stack pointer) points to the last item on the stack. The stack grows towards smaller addresses. Pushing something onto the stack reduces the rsp
value.- 16 SSE registers of 128 bits:
xmm0 - xmm15
. If AVX
mode is supported, they refer to the lower 128 bits of the ymm0 - ymm15
each of which has a size of 256 bits. For vector or non-integer operations, the data must first be loaded into these registers.
Parameter passing
This section provides a somewhat abbreviated and simplified description of the algorithm for distributing arguments to registers / stack. A full description can be seen on page 17 "System V ABI".
We introduce several classes of objects:
- INTEGER - integral types that fit into general registers. These are
bool
, char
, int
and so on. - SSE is a floating point number that fits into a vector register. This is
float
and double
. - MEMORY - objects passed through the stack.
To unify the description, the types __int128
and complex
represented as structures of two fields:
struct __int128 { int64 low, high; }; struct complexT { T real, imag; };
At the beginning, each function argument is classified:
- If the type is more than 128 bits, or has non-aligned fields, then it is MEMORY .
- If there is a non-trivial destructor, copy constructor, virtual methods, virtual base classes, then it is passed through a "transparent link". The object is replaced by a pointer that is of type INTEGER .
- Aggregates, and these are structures and arrays, are analyzed in chunks of 8 bytes each .
- If the piece has a field of type MEMORY , then the entire piece of MEMORY .
- If there is an INTEGER type field, then the entire INTEGER piece.
- Otherwise, the whole piece of SSE .
- If there is a piece of type MEMORY , then the whole argument is MEMORY .
- The types
long double
and complex long double
use a special set of x87 FPU
registers and are of type MEMORY . - The types
__m256
, __m128
and __float128
are of type SSE .
After classification, all 8 byte chunks (in one chunk there may be several structure fields, or array elements) are allocated to the registers:
- MEMORY passed through the stack.
- INTEGER are transmitted through the next free register
rdi, rsi, rdx, rcx, r8, r9
in exactly that order. - SSEs are transmitted through the next free register
xmm0 - xmm7
.
Arguments are viewed from left to right. Those arguments that did not have enough registers are passed through the stack. If any piece of the argument is not enough register, then the entire argument is passed through the stack.
The values are returned as follows:
- MEMORY types are returned via the stack. The place on it is provided by the calling function and the address of its beginning is passed through
rdi
as if this is the first argument of the function. When returning, this address must be returned via rax
. The first original argument will be passed, respectively, as the second, and so on. - The INTEGER piece is returned via the next free register
rax, rdx
. - The SSE piece is returned via the next free register
xmm0, xmm1
. These registers are used for both receiving and returning values.
A summary table with registers and their purpose is very useful when reading an assembler:
Register | Purpose |
---|
rax | The temporary register, return the first (ret 1) INTEGER result. |
rbx | Belongs to the calling function, should not be changed at the time of return. |
rcx | Passing the fourth (4) INTEGER argument. |
rdx | Passing the third (3) INTEGER argument, returning the second (ret 2) INTEGER result. |
rsp | A pointer to the stack. |
rbp | Belongs to the calling function, should not be changed at the time of return. |
rsi | Passing the second (2) INTEGER argument. |
rdi | Pass the first (1) INTEGER argument. |
r8 | Passing the fifth (5) INTEGER argument. |
r9 | Passing the sixth (6) INTEGER argument. |
r10-r11 | Temporary registers. |
r12-r15 | Belongs to the calling function, should not be changed at the time of return. |
xmm0-xmm1 | Passing and returning the first and second SSE arguments. |
xmm2-xmm7 | Passing from the third to the sixth SSE arguments. |
xmm8-xmm15 | Temporary registers. |
Registers belonging to the calling function should not be used, or their values should be saved somewhere, for example, on a stack, and then restored.
Simple examples
Unless explicitly stated otherwise, all used functions were labeled as NOINLINE
. We pretend that the function body is located in the cpp file, and LTO is disabled. Also, all the results of the functions are passed to the empty NOINLINE
function to prevent the optimizer from deleting the entire code.
#define NOINLINE __attribute__((noinline)) #define INLINE static __attribute__((always_inline))
Consider something simple.
double foo(int8_t a, int16_t b, int32_t c, int64_t d, float x, double y) { return a + b + c + d + x + y; } ... auto result = foo(1, 2, 3, 4, 5, 6);
Parameters are passed as:
Name | Register | Name | Register | Result |
---|
a | rdi | d | rcx | xmm0 |
b | rsi | x | xmm0 | |
c | rdx | y | xmm1 | |
Consider the generated code in more detail.
foo(signed char, short, int, long, float, double): add edi, esi # a b. add edi, edx # c. movsxd rax, edi # rax, 64 . add rax, rcx # d. vcvtsi2ss xmm2, xmm2, rax # float xmm2. vaddss xmm0, xmm2, xmm0 # x, 's' vaddss single precision. vcvtss2sd xmm0, xmm0, xmm0 # double. vaddsd xmm0, xmm0, xmm1 # y, 'd' vaddsd double precision. ret # , , , rsp 8. .LCPI1_0: # . .long 1084227584 # float 5 .LCPI1_1: .quad 4618441417868443648 # double 6 main: # @main sub rsp, 24 # . vmovss xmm0, dword ptr [rip + .LCPI1_0] # xmm0 = mem[0],zero,zero,zero vmovsd xmm1, qword ptr [rip + .LCPI1_1] # xmm1 = mem[0],zero mov edi, 1 mov esi, 2 mov edx, 3 mov ecx, 4 call foo(signed char, short, int, long, float, double) # call , rsp 8, . vmovsd qword ptr [rsp + 16], xmm0 # .
If the function is passed parameters of simple types, then you need to try hard so that they are not transferred through registers.
Consider various examples of aggregates. Arrays can be viewed as structures with multiple fields.
struct St { double a, b; }; double foo(St s) { return sa + sb; } ... St s{1, 2}; auto result = foo(s);
Name | Register | Name | Register | Result |
---|
sa | xmm0 | sb | xmm1 | xmm0 |
It would seem that nothing prevents to push two double
at once into one xmm
register. But alas, the distribution algorithm operates with only eight-byte chunks.
foo(St): # @foo(St) vaddsd xmm0, xmm0, xmm1 # , . ret .LCPI1_0: .quad 4607182418800017408 # double 1 .LCPI1_1: .quad 4611686018427387904 # double 2 main: # @main sub rsp, 24 # . vmovsd xmm0, qword ptr [rip + .LCPI1_0] # xmm0 = mem[0],zero vmovsd xmm1, qword ptr [rip + .LCPI1_1] # xmm1 = mem[0],zero call foo(St) vmovsd qword ptr [rsp + 16], xmm0 # double .
If you add another double
field, the entire structure will be passed through the stack, since its size will exceed 128 bytes.
struct St { double a, b, c; }; double foo(St s) { return sa + sb + sc; } ... St s{1, 2, 3}; auto result = foo(s);
foo(St): # @foo(St) # , 8 , . , rsp+8. vmovsd xmm0, qword ptr [rsp + 8] # xmm0 = mem[0],zero vaddsd xmm0, xmm0, qword ptr [rsp + 16] vaddsd xmm0, xmm0, qword ptr [rsp + 24] ret .L_ZZ4mainE1s: .quad 4607182418800017408 # double 1 .quad 4611686018427387904 # double 2 .quad 4613937818241073152 # double 3 main: # @main sub rsp, 40 # 40 . # , . , mov . mov rax, qword ptr [rip + .L_ZZ4mainE1s+16] # '3'. mov qword ptr [rsp + 16], rax # '3' . vmovups xmm0, xmmword ptr [rip + .L_ZZ4mainE1s] # xmm0 '1' '2'. vmovups xmmword ptr [rsp], xmm0 # '1' '2 . : 1 = *rsp , 2 = *(rsp+8), 3 = *(rsp+16). call foo(St) vmovsd qword ptr [rsp + 32], xmm0 # double .
Let's see what happens if we replace double
with uint64_t
.
struct St { uint64_t a, b; }; uint64_t foo(St s) { return sa + sb; } ... St s{1, 2}; auto result = foo(s);
Name | Register | Name | Register | Result |
---|
sa | rdi | sb | rsi | rax |
foo(St): # @foo(St) lea rax, [rdi + rsi] ret main: # @main sub rsp, 24 mov edi, 1 mov esi, 2 call foo(St) mov qword ptr [rsp + 16], rax
The result is noticeably more compact. You can read more about why the lea
instruction is used rather than add
, for example, here: https://stackoverflow.com/a/6328441/1418863
If you add another field, then, as in the example with double
, the structure will be passed through the stack. The code, by the way, will be almost identical, even loading onto the stack will be done through xmm
registers.
Consider something more interesting.
struct St { float a, b, c, d; }; St foo(St s1, St s2) { return {s1.a + s2.a, s1.b + s2.b, s1.c + s2.c, s1.d + s2.d}; } ... St s1{1, 2, 3, 4}, s2{5, 6, 7, 8}; auto result = foo(s1, s2);
Name | Register | Name | Register | Result |
---|
s1.a | xmm0 | s1.b | xmm0 | xmm0, xmm1 |
s1.c | xmm1 | s1.d | xmm1 | |
s2.a | xmm2 | s2.b | xmm2 | |
s2.c | xmm3 | s2.d | xmm3 | |
Each xmm
register is pushed by two float
fields.
foo(St, St): # @foo(St, St) # vaddps float . vaddps xmm0, xmm0, xmm2 vaddps xmm1, xmm1, xmm3 ret .LCPI1_0: .long 1065353216 # float 1 .long 1073741824 # float 2 .zero 4 .zero 4 # LCPI1_1 - LCPI1_3. ... main: # @main sub rsp, 24 # , . vmovapd xmm0, xmmword ptr [rip + .LCPI1_0] # xmm0 = <1,2,u,u> vmovapd xmm1, xmmword ptr [rip + .LCPI1_1] # xmm1 = <3,4,u,u> vmovaps xmm2, xmmword ptr [rip + .LCPI1_2] # xmm2 = <5,6,u,u> vmovaps xmm3, xmmword ptr [rip + .LCPI1_3] # xmm3 = <7,8,u,u> call foo(St, St) # , . xmm 256 , 128 a b, - c d. vunpcklpd xmm0, xmm0, xmm1 # xmm0 = xmm0[0],xmm1[0] vmovupd xmmword ptr [rsp + 8], xmm0
If the structure had not 4, but three fields, then the function code would be similar, except for replacing the second vaddps
instruction with vaddss
, which adds only the first 64 bits of the register.
struct St { int32_t a, b, c, d; }; St foo(St s1, St s2) { return {s1.a + s2.a, s1.b + s2.b, s1.c + s2.c, s1.d + s2.d}; } ... St s1{1, 2, 3, 4}, s2{5, 6, 7, 8}; auto result = foo(s1, s2);
Name | Register | Name | Register | Result |
---|
s1.a | rdi | s1.b | rdi | rax, rdx |
s1.c | rsi | s1.d | rsi | |
s2.a | rdx | s2.b | rdx | |
s2.c | rcx | s2.d | rcx | |
foo(St, St): # @foo(St, St) lea eax, [rdx + rdi] movabs r8, -4294967296 # 0xFFFFFFFF00000000 . and rdi, r8 add rdi, rdx and rdi, r8 or rax, rdi lea edx, [rcx + rsi] # , add. and rsi, r8 add rsi, rcx and rsi, r8 or rdx, rsi ret main: # @main sub rsp, 24 movabs rdi, 8589934593 # . movabs rsi, 17179869187 movabs rdx, 25769803781 movabs rcx, 34359738375 call foo(St, St) mov qword ptr [rsp + 8], rax # . mov qword ptr [rsp + 16], rdx
Bit magic happens inside the function, but the principle is about clear. Each pair of 32 bit numbers is packed into one 64 bit register. Returns are made in the same way.
Let's see what happens if we start mixing field types, but so that within 8 byte chunks they are of the same class.
struct St { int32_t a, b; float c, d; }; St foo(St s1, St s2) { return {s1.a + s2.a, s1.b + s2.b, s1.c + s2.c, s1.d + s2.d}; } ... St s1{1, 2, 3, 4}, s2{5, 6, 7, 8}; auto result = foo(s1, s2);
Name | Register | Name | Register | Result |
---|
s1.a | rdi | s1.b | rdi | rax, xmm0 |
s1.c | xmm0 | s1.d | xmm0 | |
s2.a | rsi | s2.b | rsi | |
s2.c | xmm1 | s2.d | xmm1 | |
foo(St, St): # @foo(St, St) lea eax, [rsi + rdi] # . movabs rcx, -4294967296 and rdi, rcx add rdi, rsi and rdi, rcx or rax, rdi vaddps xmm0, xmm0, xmm1 # . ret .LCPI1_0: .long 1077936128 # float 3 .long 1082130432 # float 4 .zero 4 .zero 4 ... main: # @main sub rsp, 24 vmovaps xmm0, xmmword ptr [rip + .LCPI1_0] # xmm0 = <3,4,u,u> vmovaps xmm1, xmmword ptr [rip + .LCPI1_1] # xmm1 = <7,8,u,u> movabs rdi, 8589934593 movabs rsi, 25769803781 call foo(St, St) mov qword ptr [rsp + 8], rax # . vmovlps qword ptr [rsp + 16], xmm0
But this is not interesting, since the field types in each 8 byte chunk are the same. Shuffle the fields.
struct St { int32_t a; float b; int32_t c; float d; }; St foo(St s1, St s2) { return {s1.a + s2.a, s1.b + s2.b, s1.c + s2.c, s1.d + s2.d}; } ... St s1{1, 2, 3, 4}, s2{5, 6, 7, 8}; auto result = foo(s1, s2);
Name | Register | Name | Register | Result |
---|
s1.a | rdi | s1.b | rdi | rax, rdx |
s1.c | rsi | s1.d | rsi | |
s2.a | rdx | s2.b | rdx | |
s2.c | rcx | s2.d | rcx | |
See item 3.2. Since there is both a float
and an int
in an 8 byte chunk, the entire chunk will be of type INTEGER and will be transferred in general-purpose registers.
foo(St, St): # @foo(St, St) mov rax, rdx add edx, edi shr rdi, 32 vmovd xmm0, edi mov rdi, rcx add ecx, esi shr rsi, 32 vmovd xmm1, esi shr rax, 32 vmovd xmm2, eax vaddss xmm0, xmm0, xmm2 shr rdi, 32 vmovd xmm2, edi vaddss xmm1, xmm1, xmm2 vmovd eax, xmm0 shl rax, 32 or rdx, rax vmovd eax, xmm1 shl rax, 32 or rcx, rax mov rax, rdx mov rdx, rcx ret main: # @main sub rsp, 24 movabs rdi, 4611686018427387905 # 0x4000000000000001, 32 int, - float. movabs rsi, 4647714815446351875 movabs rdx, 4665729213955833861 movabs rcx, 4683743612465315847 call foo(St, St) mov qword ptr [rsp + 8], rax # 64 . mov qword ptr [rsp + 16], rdx
Here you can see 6 shift operations for extracting float
fields and stuffing them into a register with the result. As well as the absence of any vector operations. In general, it is better not to interfere with field types within 8 byte chunks of the structure.
Transfer by reference
Passing parameters via a constant link is similar to passing a pointer to an object. If the object does not fit in registers, then it is passed and returned through the stack. Let's see how this happens. For realism, consider the structure for a three-dimensional point.
struct Point3f { float x, y, z; }; struct Point3d { double x, y, z; }; Point3f scale(Point3f p) { return {px * 2, py * 2, pz * 2}; } Point3f scaleR(const Point3f& p) { return {px * 2, py * 2, pz * 2}; } Point3d scale(Point3d p) { return {px * 2, py * 2, pz * 2}; } Point3d scaleR(const Point3d& p) { return {px * 2, py * 2, pz * 2}; }
Compare the code of functions. Mostly new xmm
registers will be used, so the logic is quite understandable.
scale(Point3f): # @scale(Point3f) # , x, y xmm0, z - xmm1, . vaddps xmm0, xmm0, xmm0 vaddss xmm1, xmm1, xmm1 ret scaleR(Point3f const&): # @scaleR(Point3f const&) # rdi, . xmm0, xmm1. vmovsd xmm0, qword ptr [rdi] # xmm0 = mem[0],zero vaddps xmm0, xmm0, xmm0 vmovss xmm1, dword ptr [rdi + 8] # xmm1 = mem[0],zero,zero,zero vaddss xmm1, xmm1, xmm1 ret scale(Point3d): # @scale(Point3d) # rdi , . [rsp+8, rsp+32). [rsp, rsp+8) . vmovapd xmm0, xmmword ptr [rsp + 8] vaddpd xmm0, xmm0, xmm0 vmovupd xmmword ptr [rdi], xmm0 vmovsd xmm0, qword ptr [rsp + 24] # xmm0 = mem[0],zero vaddsd xmm0, xmm0, xmm0 vmovsd qword ptr [rdi + 16], xmm0 mov rax, rdi # , . ret scaleR(Point3d const&): # @scaleR(Point3d const&) # , [rsp+8, rsp+32), [rsi, rsi+24). vmovupd xmm0, xmmword ptr [rsi] vaddpd xmm0, xmm0, xmm0 vmovupd xmmword ptr [rdi], xmm0 vmovsd xmm0, qword ptr [rsi + 16] # xmm0 = mem[0],zero vaddsd xmm0, xmm0, xmm0 vmovsd qword ptr [rdi + 16], xmm0 mov rax, rdi ret
Now we will look at the place of calling these functions
# scale(Point3f) main: # @main sub rsp, 24 # . vmovaps xmm0, xmmword ptr [rip + .LCPI4_0] # xmm0 = <1,2,u,u> vmovss xmm1, dword ptr [rip + .LCPI4_1] # xmm1 = <3,u,u,u> call scale(Point3f) # scaleR(const Point3f&) main: # @main sub rsp, 24 # rdi, . mov edi, .L_ZZ4mainE1p # <1,2,3,u> call scaleR(Point3f const&) # scale(Point3d) main: # @main sub rsp, 64 # . mov rax, qword ptr [rip + .L_ZZ4mainE1p+16] mov qword ptr [rsp + 16], rax vmovups xmm0, xmmword ptr [rip + .L_ZZ4mainE1p] vmovups xmmword ptr [rsp], xmm0 lea rbx, [rsp + 40] mov rdi, rbx # [rsp+40, rsp+64). call scale(Point3d) .L_ZZ4mainE1p: .quad 4607182418800017408 # double 1 .quad 4611686018427387904 # double 2 .quad 4613937818241073152 # double 3 # scaleR(const Point3d&) main: # @main sub rsp, 64 # . mov rax, qword ptr [rip + .L_ZZ4mainE1p+16] mov qword ptr [rsp + 32], rax vmovups xmm0, xmmword ptr [rip + .L_ZZ4mainE1p] vmovaps xmmword ptr [rsp + 16], xmm0 lea rbx, [rsp + 40] lea rsi, [rsp + 16] # [rsp+16, rsp+40). mov rdi, rbx # [rsp+40, rsp+64). call scaleR(Point3d const&)
Let's see that if we have a lot of fields, but the structure still fits into the registers. This is where the fun begins.
struct St { char d[16]; }; St foo(St s1, St s2) {
The code for a function that takes arguments by value.
Name | Register | Name | Register | Result |
---|
s1.d [1: 8] | rdi | s1.d [8:16] | rsi | rax, rdx |
s2.d [1: 8] | rdx | s2.d [8:16] | rcx | |
foo(St, St): # @foo(St, St) mov qword ptr [rsp - 16], rdi mov qword ptr [rsp - 8], rsi mov qword ptr [rsp - 32], rdx mov qword ptr [rsp - 24], rcx mov eax, edx add al, dil mov byte ptr [rsp - 48], al mov r8, rdi shr r8, 8 mov rax, rdx shr rax, 8 add al, r8b mov byte ptr [rsp - 47], al mov r8, rdi shr r8, 16 mov rax, rdx shr rax, 16 add al, r8b mov byte ptr [rsp - 46], al mov r8, rdi shr r8, 24 mov rax, rdx shr rax, 24 add al, r8b mov byte ptr [rsp - 45], al mov r8, rdi shr r8, 32 mov rax, rdx shr rax, 32 add al, r8b mov byte ptr [rsp - 44], al mov r8, rdi shr r8, 40 mov rax, rdx shr rax, 40 add al, r8b mov byte ptr [rsp - 43], al mov r8, rdi shr r8, 48 mov rax, rdx shr rax, 48 add al, r8b mov byte ptr [rsp - 42], al shr rdi, 56 shr rdx, 56 add dl, dil mov byte ptr [rsp - 41], dl mov eax, ecx add al, sil mov byte ptr [rsp - 40], al mov rax, rsi shr rax, 8 mov rdx, rcx shr rdx, 8 add dl, al mov byte ptr [rsp - 39], dl shr rsi, 16 shr rcx, 16 add cl, sil mov byte ptr [rsp - 38], cl mov al, byte ptr [rsp - 21] mov cl, byte ptr [rsp - 20] add al, byte ptr [rsp - 5] mov byte ptr [rsp - 37], al add cl, byte ptr [rsp - 4] mov byte ptr [rsp - 36], cl mov al, byte ptr [rsp - 19] mov cl, byte ptr [rsp - 18] add al, byte ptr [rsp - 3] mov byte ptr [rsp - 35], al add cl, byte ptr [rsp - 2] mov byte ptr [rsp - 34], cl mov al, byte ptr [rsp - 17] add al, byte ptr [rsp - 1] mov byte ptr [rsp - 33], al mov rax, qword ptr [rsp - 48] mov rdx, qword ptr [rsp - 40] ret
Yes, all the arguments are copied onto the stack, after which it is extracted and added one byte at a time. As you can count, there are exactly 16 add
instructions in the function. GCC, by the way, in this example gives a much more compact code, but still with copying through the stack. Is it possible to improve something? Let's transfer structure according to the link.
St fooR(const St& s1, const St& s2) { }
Name | Register | Name | Register | Result |
---|
s1 | rdi | s2 | rsi | rax, rdx |
fooR(St const&, St const&): # @fooR(St const&, St const&) vmovdqu xmm0, xmmword ptr [rsi] vpaddb xmm0, xmm0, xmmword ptr [rdi] vmovdqa xmmword ptr [rsp - 24], xmm0 mov rax, qword ptr [rsp - 24] mov rdx, qword ptr [rsp - 16] ret
Oh yeah! It looks much better. We can load 16 single-byte elements into xmm
register at once and call vpaddb
which will vpaddb
them all together in one operation. After that, the result is copied to the output registers via the stack. You might think that you can get rid of this last operation by replacing the first argument with a non-constant link.
void fooR1(St &s1, const St& s2) { for(int i{}; i < 16; ++i) s1.d[i] += s2.d[i]; }
Name | Register | Name | Register | Result |
---|
s1 | rdi | s2 | rsi | |
fooR1(St&, St const&): # @fooR1(St&, St const&) mov al, byte ptr [rsi] add byte ptr [rdi], al mov al, byte ptr [rsi + 1] add byte ptr [rdi + 1], al mov al, byte ptr [rsi + 2] add byte ptr [rdi + 2], al mov al, byte ptr [rsi + 3] add byte ptr [rdi + 3], al mov al, byte ptr [rsi + 4] add byte ptr [rdi + 4], al mov al, byte ptr [rsi + 5] add byte ptr [rdi + 5], al mov al, byte ptr [rsi + 6] add byte ptr [rdi + 6], al mov al, byte ptr [rsi + 7] add byte ptr [rdi + 7], al mov al, byte ptr [rsi + 8] add byte ptr [rdi + 8], al mov al, byte ptr [rsi + 9] add byte ptr [rdi + 9], al mov al, byte ptr [rsi + 10] add byte ptr [rdi + 10], al mov al, byte ptr [rsi + 11] add byte ptr [rdi + 11], al mov al, byte ptr [rsi + 12] add byte ptr [rdi + 12], al mov al, byte ptr [rsi + 13] add byte ptr [rdi + 13], al mov al, byte ptr [rsi + 14] add byte ptr [rdi + 14], al mov al, byte ptr [rsi + 15] add byte ptr [rdi + 15], al ret
It seems that something went wrong. This is due to the fact that, by default, the compiler is very careful, and assumes that the programmer may be out of sorts and write something like this:
char buff[17]; fooR1(*reinterpret_cast<St*>(buff+1), reinterpret_cast<const St*>(buff));
In this case, buff[i+1] += buff[i]
calculated at each iteration, that is, there is a pointer aliasing . In order to indicate to the compiler that no such strange use of the function is foreseen, the keyword __restrict exists.
void fooR2(St & __restrict s1, const St& s2) { }
Which gives the desired result.
fooR2(St&, St const&): # @fooR2(St&, St const&) vmovdqu xmm0, xmmword ptr [rdi] vpaddb xmm0, xmm0, xmmword ptr [rsi] vmovdqu xmmword ptr [rdi], xmm0 ret
Changing the signature on void fooR3(St &__restrict s1, St s2)
will also lead to a blown up assembler code, similar to the first example with St foo(St, St)
.
, , void foo(char* __restrict s1, const char* s2, int size)
, __restrict
.
b
a
, , foo
:
St a, b; st(a, b);
Code | Cycles per iteration |
---|
St a, b; st(a, b); | 7.6 |
4 x foo no reuse | 121.9 |
4 x foo | 117.7 |
4 x fooR no reuse | 66.3 |
4 x fooR | 64.6 |
4 x fooR1 | 84.5 |
4 x fooR2 | 20.6 |
4 x foo inline | 51.9 |
4 x fooR inline | 30.5 |
4 x fooR1 inline | 8.8 |
4 x fooR2 inline | 8.8 |
'no reuse' , . auto a2 = foo(a, b); auto a3 = foo(a2, b);
. 'inline' , INLINE , NOINLINE .
fooR1 inline / fooR2 inline
, , , , , foo inline / fooR inline
, . , , .
, , .
struct Point3f { float x, y, z; ~Point3f() {} }; Point3f scale(Point3f p) { return {px * 2, py * 2, pz * 2}; }
rdi
, . , rsi
, .
scale(Point3f): # @scale(Point3f) vmovss xmm0, dword ptr [rsi] # xmm0 = mem[0],zero,zero,zero vaddss xmm0, xmm0, xmm0 vmovss dword ptr [rdi], xmm0 vmovss xmm0, dword ptr [rsi + 4] # xmm0 = mem[0],zero,zero,zero vaddss xmm0, xmm0, xmm0 vmovss dword ptr [rdi + 4], xmm0 vmovss xmm0, dword ptr [rsi + 8] # xmm0 = mem[0],zero,zero,zero vaddss xmm0, xmm0, xmm0 vmovss dword ptr [rdi + 8], xmm0 mov rax, rdi ret
, , ( ) . POD . Point3f scaleR(const Point3f&)
. .
Point3f p{1, 2, 3}; auto result = scale(p); sink(&result);
main: # @main push rbx sub rsp, 48 movabs rax, 4611686019492741120 # . mov qword ptr [rsp + 16], rax mov dword ptr [rsp + 24], 1077936128 lea rbx, [rsp + 32] lea rsi, [rsp + 16] # [rsp+16, rsp+28) mov rdi, rbx # [rsp+32, rsp+44) call scale(Point3f) mov qword ptr [rsp + 8], rbx # [rsp+8, rsp+16) . lea rdi, [rsp + 8] call void sink<Point3f*>(Point3f* const&) xor eax, eax add rsp, 48 pop rbx ret # . mov rdi, rax call _Unwind_Resume
NOINLINE , .
main: # @main push r14 push rbx sub rsp, 56 movabs rax, 4611686019492741120 # [rsp, rsp+12). p. mov qword ptr [rsp], rax mov dword ptr [rsp + 8], 1077936128 # [rsp+24, rsp+36), pTmp. mov eax, dword ptr [rsp + 8] mov dword ptr [rsp + 32], eax mov rax, qword ptr [rsp] mov qword ptr [rsp + 24], rax lea r14, [rsp + 40] lea rbx, [rsp + 24] mov rdi, r14 # [rsp+40, rsp+52), result. mov rsi, rbx # - pTmp. call scale(Point3f) mov rdi, rbx # pTmp. - this. call Point3f::~Point3f() mov qword ptr [rsp + 16], r14 # [rsp+16, rsp+24) result. sink. lea rdi, [rsp + 16] call void sink<Point3f*>(Point3f* const&) lea rdi, [rsp + 40] # result. call Point3f::~Point3f() mov rdi, rsp # p. call Point3f::~Point3f() xor eax, eax add rsp, 56 pop rbx pop r14 ret # . . mov rbx, rax lea rdi, [rsp + 40] # result. call Point3f::~Point3f() mov rdi, rsp # p. call Point3f::~Point3f() mov rdi, rbx call _Unwind_Resume
p
, .
, . .
# Point3f result = scale(scale(Point3f{1, 2, 3})); sub rsp, 24 vmovaps xmm0, xmmword ptr [rip + .LCPI4_0] # xmm0 = <1,2,u,u> vmovss xmm1, dword ptr [rip + .LCPI4_1] # xmm1 = mem[0],zero,zero,zero # xmm0, xmm1 , , ! call scale(Point3f) call scale(Point3f) vmovlps qword ptr [rsp + 8], xmm0 vmovss dword ptr [rsp + 16], xmm1 # Point3f result = scaleR(scaleR(Point3f{1, 2, 3})); sub rsp, 56 # [rsp+24, rsp+36). movabs rax, 4611686019492741120 # 0x400000003F800000 = [2.0f, 1.0f] mov qword ptr [rsp + 24], rax mov dword ptr [rsp + 32], 1077936128 # 0x40400000 = 3.0f lea rdi, [rsp + 24] # . call scaleR(Point3f const&) # [rsp+8, rsp+20). vmovlps qword ptr [rsp + 8], xmm0 vmovss dword ptr [rsp + 16], xmm1 lea rdi, [rsp + 8] # . call scaleR(Point3f const&) vmovlps qword ptr [rsp + 40], xmm0 vmovss dword ptr [rsp + 48], xmm1
, , . , .
# Point3d result = scale(scale(Point3d{1, 2, 3})); sub rsp, 112 # [rsp, rsp+24). vmovaps xmm0, xmmword ptr [rip + .LCPI4_0] # xmm0 = [1.000000e+00,2.000000e+00] vmovaps xmmword ptr [rsp + 32], xmm0 movabs rax, 4613937818241073152 # 0x4008000000000000 = 3.0 mov qword ptr [rsp + 48], rax mov rax, qword ptr [rsp + 48] mov qword ptr [rsp + 16], rax vmovaps xmm0, xmmword ptr [rsp + 32] vmovups xmmword ptr [rsp], xmm0 # [rsp+64, rsp+88). lea rdi, [rsp + 64] # rdi . call scale(Point3d) # [rsp, rsp+24). mov rax, qword ptr [rsp + 80] # z z . mov qword ptr [rsp + 16], rax vmovups xmm0, xmmword ptr [rsp + 64] # [x, y] [x, y] . vmovups xmmword ptr [rsp], xmm0 # [rsp+88, rsp+112). lea rbx, [rsp + 88] mov rdi, rbx call scale(Point3d) # Point3d result = scaleR(scaleR(Point3d{1, 2, 3})); sub rsp, 72 # [rsp, rsp+24), . vmovaps xmm0, xmmword ptr [rip + .LCPI4_0] vmovaps xmmword ptr [rsp], xmm0 movabs rax, 4613937818241073152 mov qword ptr [rsp + 16], rax lea r14, [rsp + 24] mov rsi, rsp # - [rsp, rsp+24). mov rdi, r14 # - [rsp+24, rsp+48). call scaleR(Point3d const&) lea rbx, [rsp + 48] mov rdi, rbx # [rsp+48, rsp+72). mov rsi, r14 # [rsp+24, rsp+48). call scaleR(Point3d const&)
, , , , . – . . , , .
, . , . , , Point3f
, Point3d
– .
Code | Cycles per iteration |
---|
auto r = pf(); | 6.7 |
auto r = scale(pf()); | 11.1 |
auto r = scaleR(pf()); | 12.6 |
auto r = scale(scale(pf())); | 18.2 |
auto r = scaleR(scaleR(pf())); | 18.3 |
auto r = scale(scale(scale(pf()))); | 16.8 |
auto r = scaleR(scaleR(scaleR(pf()))); | 20.2 |
auto r = pd(); | 7.3 |
auto r = scale(pd()); | 11.7 |
auto r = scaleR(pd()); | 11.0 |
auto r = scale(scale(pd())); | 16.9 |
auto r = scaleR(scaleR(pd())); | 14.1 |
auto r = scale(scale(scale(pd()))); | 21.2 |
auto r = scaleR(scaleR(scaleR(pd()))); | 17.2 |
INLINE | 8.1 — 8.9 |
Point3f
struct Point3i { int32_t x, y, z; };
Point3d
struct Point3ll { int64_t x, y, z; };
, . , , , 64 int, . , Point3f
struct Point2ll { int64_t x, y; };
Point3d
struct Point4ll { int64_t x, y, z, a; };
, -.
:
- . .
- , , .
- inline , . , , , . .
optional
std::optional
, boost::optional
, , "x86-64 clang (experimental concepts)" , , MSVC ,
struct Point { float x, y; }; using OptPoint1 = optional<Point>;
struct OptPoint2 { float x, y; union { char _; bool d; };
, OptPoint1
, OptPoint2
– .
OptPoint1 foo(OptPoint1 s) { return Point{s->x + 1, s->y + 1}; } OptPoint2 foo(OptPoint2 s) { return {sx + 1, sy + 1, true}; } ... OptPoint1 s1{Point{1, 2}}; OptPoint2 s2{3, 4, true}; auto result1 = foo(s1); auto result2 = foo(s2);
.LCPI0_0: .long 1065353216 # float 1 foo(std::optional<Point>): # @foo(std::optional<Point>) vmovss xmm0, dword ptr [rip + .LCPI0_0] # xmm0 = mem[0],zero,zero,zero # rsi . vaddss xmm1, xmm0, dword ptr [rsi] # x , 1. vaddss xmm0, xmm0, dword ptr [rsi + 4] # y , 1. vmovss dword ptr [rdi], xmm1 # x , rdi , . vmovss dword ptr [rdi + 4], xmm0 # y . mov byte ptr [rdi + 8], 1 # optional::has_value(). mov rax, rdi # . ret .LCPI1_0: .long 1065353216 # float 1 .long 1065353216 # float 1 .zero 4 .zero 4 foo(OptPoint2): # @foo(OptPoint2) vaddps xmm0, xmm0, xmmword ptr [rip + .LCPI1_0] # [x, y] c [1, 1]. xmm0 . mov al, 1 # d, al - 8 rax. ret .LCPI2_0: .long 1077936128 # float 3 .long 1082130432 # float 4 .zero 4 .zero 4 main: # @main push rbx sub rsp, 64 movabs rax, 4611686019492741120 # 0x400000003F800000 x y. mov qword ptr [rsp + 32], rax # x, y . mov byte ptr [rsp + 40], 1 # optional::has_value(). lea rbx, [rsp + 48] lea rsi, [rsp + 32] # . mov rdi, rbx # . call foo(std::optional<Point>) # . vmovaps xmm0, xmmword ptr [rip + .LCPI2_0] # xmm0 = <3,4,u,u> mov edi, 1 # bool d. call foo(OptPoint2) vmovlps qword ptr [rsp + 16], xmm0 # . mov byte ptr [rsp + 24], al # al rax.
, foo
inline
, .
# OptPoint1 foo(OptPoint1) OptPoint1 foo(const OptPoint1&) vmovss xmm0, dword ptr [rip + .LCPI0_0] # xmm0 = mem[0],zero,zero,zero vaddss xmm1, xmm0, dword ptr [rsp + 32] vaddss xmm0, xmm0, dword ptr [rsp + 36] vmovss dword ptr [rsp + 8], xmm1 vmovss dword ptr [rsp + 12], xmm0 mov byte ptr [rsp + 16], 1 # OptPoint2 foo(OptPoint2) OptPoint2 foo(const OptPoint2&) vmovsd xmm0, qword ptr [rsp + 48] # xmm0 = mem[0],zero vaddps xmm0, xmm0, xmmword ptr [rip + .LCPI0_1] vmovlps qword ptr [rsp + 8], xmm0 mov byte ptr [rsp + 16], 1
, , , .
: inline
, std::optional
.
, , . . , . , - , , , . .
struct Fn { virtual ~Fn() noexcept = default; virtual int call(int x) const = 0; }; struct Add final : Fn { Add(int a) : a(a) {} int call(int x) const override { return a + x; } int a; }; NOINLINE bool isFixedPoint(const Fn& fn, int x) { return fn.call(x) == x; } int main() { Add add{32}; bool result = isFixedPoint(add, 10); }
- .
Add::call(int) const: # @Add::call(int) const # rdi this , [rdi, rdi+8) , . add esi, dword ptr [rdi + 8] mov eax, esi # rax, eax 32 . ret # Add. 16, RTTI . vtable for Add: .quad 0 .quad typeinfo for Add # RTTI. -8. .quad Fn::~Fn() # , 0 . .quad Add::~Add() .quad Add::call(int) const # , 16 . isFixedPoint(Fn const&, int): # @isFixedPoint(Fn const&, int) push rbx # rbx, , . mov ebx, esi # 32 . mov rax, qword ptr [rdi] # rdi - Fn, this . call qword ptr [rax + 16] # Add::call. cmp eax, ebx # call rax, ebx, . sete al # 8 eax. pop rbx # rbx. ret main: # @main sub rsp, 40 mov qword ptr [rsp + 24], vtable for Add+16 # Add, 16 , , , . RTTI . mov dword ptr [rsp + 32], 32 # add.a . lea rdi, [rsp + 24] # . mov esi, 10 # . call isFixedPoint(Fn const&, int) mov byte ptr [rsp + 15], al # . ... mov rdi, rax # . call _Unwind_Resume mov rdi, rax call _Unwind_Resume
protected
, ( 34-37). NOINLINE , ( false
). NOINLINE , . .
struct Add final { Add(int a) : a(a) {} NOINLINE int call(int x) const { return a + x; } int a; }; template<typename T> NOINLINE bool isFixedPoint(const T& fn, int x) { return fn.call(x) == x; }
Add::call(int) const: # @Add::call(int) const add esi, dword ptr [rdi] # , this rdi , , . mov eax, esi ret bool isFixedPoint<Add>(Add const&, int): # @bool isFixedPoint<Add>(Add const&, int) push rbx mov ebx, esi # Add::call , isFixedPoint. call Add::call(int) const cmp eax, ebx sete al pop rbx ret main: # @main sub rsp, 24 mov dword ptr [rsp + 8], 32 # add.a. lea rdi, [rsp + 8] # , rdi add.a. mov esi, 10 call bool isFixedPoint<Add>(Add const&, int) mov byte ptr [rsp + 7], al ... ret
, NOINLINE .
1000 Add
isFixedPoint
.
Code | Cycles per iteration |
---|
call , isFixedPoint call | 5267 |
call , NOINLINE isFixedPoint | 10721 |
call , INLINE isFixedPoint | 8291 |
call , NOINLINE isFixedPoint | 10571 |
, NOINLINE call , NOINLINE isFixedPoint | 10536 |
, isFixedPoint call | 4505 |
, INLINE call , INLINE isFixedPoint | 4531 |
:
- .
- , .
- , , , inline. .
inline
. inline
- cpp , NOINLINE .inline
.
call
, jmp
.
. clang . , :
double exp_by_squaring(double x, int n, double y = 1) { if (n < 0) return exp_by_squaring(1.0 / x, -n, y); if (n == 0) return y; if (n == 1) return x * y; if (n % 2 == 0) return exp_by_squaring(x * x, n / 2, y); return exp_by_squaring(x * x, (n - 1) / 2, x * y); }
We get:
.LCPI0_0: .quad 4607182418800017408 # double 1 exp_by_squaring(double, int, double): # @exp_by_squaring(double, int, double) vmovsd xmm2, qword ptr [rip + .LCPI0_0] # xmm2 = mem[0],zero vmovapd xmm3, xmm0 test edi, edi jns .LBB0_4 jmp .LBB0_3 .LBB0_9: # in Loop: Header=BB0_4 Depth=1 shr edi vmovapd xmm3, xmm0 test edi, edi jns .LBB0_4 .LBB0_3: # =>This Inner Loop Header: Depth=1 vdivsd xmm3, xmm2, xmm3 neg edi test edi, edi js .LBB0_3 .LBB0_4: # =>This Inner Loop Header: Depth=1 je .LBB0_7 cmp edi, 1 je .LBB0_6 vmulsd xmm0, xmm3, xmm3 test dil, 1 je .LBB0_9 lea eax, [rdi - 1] shr eax, 31 lea edi, [rdi + rax] add edi, -1 sar edi vmulsd xmm1, xmm3, xmm1 vmovapd xmm3, xmm0 test edi, edi jns .LBB0_4 jmp .LBB0_3 .LBB0_6: vmulsd xmm1, xmm3, xmm1 .LBB0_7: vmovapd xmm0, xmm1 ret
, , . , , . ~10% .
. .
int64_t sum(int64_t x, int64_t y) { return x + y; } int64_t add1(int64_t x) { return sum(x, 1); } int64_t add2(int64_t x) { return sum(1, x); } int64_t add3(int64_t x) { return sum(-1, x) + 2; }
sum(long, long): # @sum(long, long) lea rax, [rdi + rsi] ret add1(long): # @add1(long) mov esi, 1 # . jmp sum(long, long) # TAILCALL add2(long): # @add2(long) mov rax, rdi # . mov edi, 1 mov rsi, rax jmp sum(long, long) # TAILCALL add3(long): # @add3(long) push rax # rax . mov rax, rdi # , add2, . mov rdi, -1 mov rsi, rax call sum(long, long) add rax, 2 # 2 . pop rcx ret
, , call
, jmp
. , , sum
, , add
.
, 10%.
Findings:
Initialization
. 2D :
struct Point { double x, y; }; struct ZeroPoint { double x{}, y{}; }; struct NanPoint { double x{quietNaN}, y{quietNaN}; };
Point
. ZeroPoint
. IEEE 754-1985:
The number zero is represented specially: sign = 0 for positive zero, 1 for negative zero; biased exponent = 0; fraction = 0;
memset
. NanPoint
numeric_limits<double>::quiet_NaN();
, .
Point data;
sub rsp, 24
- .
ZeroPoint data; Point data{};
.
sub rsp, 40 vxorps xmm0, xmm0, xmm0 vmovaps xmmword ptr [rsp + 16], xmm0
. xmm0
. XOR
vxorps
. .
NanPoint data;
sub rsp, 40 vmovaps xmm0, xmmword ptr [rip + .LCPI0_0] # xmm0 = [nan,nan] vmovaps xmmword ptr [rsp + 16], xmm0
, .
:
static constexpr size_t smallSize = 8; static constexpr size_t bigSize = 321; extern size_t smallUnknownSize;
.
array<Point, smallSize> data;
– .
sub rsp, 136
, .
array<ZeroPoint, smallSize> data; array<ZeroPoint, smallSize> data{}; array<Point, smallSize> data{};
.
sub rsp, 192 vxorps ymm0, ymm0, ymm0 vmovaps ymmword ptr [rsp + 128], ymm0 vmovaps ymmword ptr [rsp + 96], ymm0 vmovaps ymmword ptr [rsp + 64], ymm0 vmovaps ymmword ptr [rsp + 32], ymm0
256 , 2 .
array<NanPoint, smallSize> data; array<NanPoint, smallSize> data{};
sub rsp, 136 vmovaps xmm0, xmmword ptr [rip + .LCPI0_0] # xmm0 = [nan,nan] vmovups xmmword ptr [rsp + 24], xmm0 vmovups xmmword ptr [rsp + 8], xmm0 vmovups xmmword ptr [rsp + 56], xmm0 vmovups xmmword ptr [rsp + 40], xmm0 vmovups xmmword ptr [rsp + 88], xmm0 vmovups xmmword ptr [rsp + 72], xmm0 vmovups xmmword ptr [rsp + 120], xmm0 vmovups xmmword ptr [rsp + 104], xmm0
.
array<Point, bigSize> data;
sub rsp, 5144
.
array<ZeroPoint, bigSize> data; array<ZeroPoint, bigSize> data{}; array<Point, bigSize> data{};
sub rsp, 5152 lea rbx, [rsp + 16] xor esi, esi mov edx, 5136 mov rdi, rbx call memset # memset(rsp+16, 0, 5136).
, memset
. , , rdi, esi, edx
. e
d
32 64- .
array<NanPoint, bigSize> data;
sub rsp, 5144 lea rax, [rsp + 8] lea rcx, [rsp + 5144] vmovaps xmm0, xmmword ptr [rip + .LCPI0_0] # xmm0 = [nan,nan] # . .LBB0_1: # =>This Inner Loop Header: Depth=1 vmovups xmmword ptr [rax], xmm0 vmovups xmmword ptr [rax + 16], xmm0 vmovups xmmword ptr [rax + 32], xmm0 add rax, 48 cmp rax, rcx jne .LBB0_1
. , 321 3 . rax, rcx
.
array<NanPoint, bigSize> data{};
sub rsp, 5152 lea rbx, [rsp + 16] xor esi, esi mov edx, 5136 mov rdi, rbx call memset # memset(rsp+16, 0, 5136). lea rax, [rsp + 5152] vmovaps xmm0, xmmword ptr [rip + .LCPI0_0] # xmm0 = [nan,nan] # . .LBB0_1: # =>This Inner Loop Header: Depth=1 vmovups xmmword ptr [rbx], xmm0 vmovups xmmword ptr [rbx + 16], xmm0 vmovups xmmword ptr [rbx + 32], xmm0 add rbx, 48 cmp rbx, rax jne .LBB0_1
. , memset
. , , . .
vector<Point> data(smallSize);
sub rsp, 40 vxorps xmm0, xmm0, xmm0 vmovaps xmmword ptr [rsp], xmm0 mov qword ptr [rsp + 16], 0 mov edi, 128 call operator new(unsigned long) # new(128). mov qword ptr [rsp], rax mov rcx, rax sub rcx, -128 mov qword ptr [rsp + 16], rcx vxorps xmm0, xmm0, xmm0 # . vmovups xmmword ptr [rax + 16], xmm0 vmovups xmmword ptr [rax], xmm0 vmovups xmmword ptr [rax + 32], xmm0 vmovups xmmword ptr [rax + 48], xmm0 vmovups xmmword ptr [rax + 64], xmm0 vmovups xmmword ptr [rax + 80], xmm0 vmovups xmmword ptr [rax + 96], xmm0 vmovups xmmword ptr [rax + 112], xmm0
new
. , T{}
, , . ZeroPoint
NanPoint
. :
vector<Point> data(bigSize);
main: # @main push rbx sub rsp, 48 vxorps xmm0, xmm0, xmm0 vmovaps xmmword ptr [rsp], xmm0 mov qword ptr [rsp + 16], 0 mov edi, 5136 call operator new(unsigned long) # new(5136). mov qword ptr [rsp], rax mov qword ptr [rsp + 8], rax mov rcx, rax add rcx, 5136 mov qword ptr [rsp + 16], rcx vxorps xmm0, xmm0, xmm0 vmovaps xmmword ptr [rsp + 32], xmm0 xor edx, edx # . .LBB0_2: # =>This Inner Loop Header: Depth=1 vmovaps xmm0, xmmword ptr [rsp + 32] vmovups xmmword ptr [rax + rdx], xmm0 vmovaps xmm0, xmmword ptr [rsp + 32] vmovups xmmword ptr [rax + rdx + 16], xmm0 vmovaps xmm0, xmmword ptr [rsp + 32] vmovups xmmword ptr [rax + rdx + 32], xmm0 add rdx, 48 cmp rdx, 5136 jne .LBB0_2 mov qword ptr [rsp + 8], rcx mov rax, rsp
, . NanPoint
.
vector<NanPoint> data(bigSize);
main: # @main push rbx sub rsp, 32 vxorps xmm0, xmm0, xmm0 vmovaps xmmword ptr [rsp], xmm0 mov qword ptr [rsp + 16], 0 mov edi, 5136 call operator new(unsigned long) # new(5136). mov qword ptr [rsp], rax mov qword ptr [rsp + 8], rax mov rcx, rax add rcx, 5136 mov qword ptr [rsp + 16], rcx xor edx, edx vmovaps xmm0, xmmword ptr [rip + .LCPI0_0] # xmm0 = [nan,nan] # NAN. .LBB0_2: # =>This Inner Loop Header: Depth=1 vmovups xmmword ptr [rax + rdx], xmm0 vmovups xmmword ptr [rax + rdx + 16], xmm0 vmovups xmmword ptr [rax + rdx + 32], xmm0 add rdx, 48 cmp rdx, 5136 jne .LBB0_2 mov qword ptr [rsp + 8], rcx mov rax, rsp
A ZeroPoint
.
vector<ZeroPoint> data(bigSize);
sub rsp, 32 vxorps xmm0, xmm0, xmm0 vmovaps xmmword ptr [rsp], xmm0 mov qword ptr [rsp + 16], 0 mov edi, 5136 call operator new(unsigned long) # new(5136). mov qword ptr [rsp], rax mov rbx, rax add rbx, 5136 mov qword ptr [rsp + 16], rbx xor esi, esi mov edx, 5136 mov rdi, rax call memset # memset(&data, 0, 5136).
memset
. , , memset
Point
. , , bigUnknownSize
.
vector<NanPoint> data(bigUnknownSize);
sub rsp, 32 mov rbx, qword ptr [rip + bigUnknownSize] vxorps xmm0, xmm0, xmm0 vmovaps xmmword ptr [rsp], xmm0 mov qword ptr [rsp + 16], 0 test rbx, rbx # bigUnknownSize == 0, new. je .LBB0_1 mov rax, rbx shr rax, 60 jne .LBB0_3 mov rdi, rbx shl rdi, 4 # 16 4. call operator new(unsigned long) # new(bigUnknownSize*16). jmp .LBB0_6 .LBB0_1: xor eax, eax .LBB0_6: mov rcx, rbx shl rcx, 4 add rcx, rax mov qword ptr [rsp], rax mov qword ptr [rsp + 8], rax mov qword ptr [rsp + 16], rcx test rbx, rbx # bigUnknownSize == 0, . je .LBB0_14 lea rdx, [rbx - 1] mov rsi, rbx and rsi, 7 je .LBB0_10 neg rsi vmovaps xmm0, xmmword ptr [rip + .LCPI0_0] # xmm0 = [nan,nan] # 0-7 . .LBB0_9: # =>This Inner Loop Header: Depth=1 vmovups xmmword ptr [rax], xmm0 dec rbx add rax, 16 inc rsi jne .LBB0_9 .LBB0_10: cmp rdx, 7 jb .LBB0_13 vmovaps xmm0, xmmword ptr [rip + .LCPI0_0] # xmm0 = [nan,nan] # 8. .LBB0_12: # =>This Inner Loop Header: Depth=1 vmovups xmmword ptr [rax], xmm0 vmovups xmmword ptr [rax + 16], xmm0 vmovups xmmword ptr [rax + 32], xmm0 vmovups xmmword ptr [rax + 48], xmm0 vmovups xmmword ptr [rax + 64], xmm0 vmovups xmmword ptr [rax + 80], xmm0 vmovups xmmword ptr [rax + 96], xmm0 vmovups xmmword ptr [rax + 112], xmm0 sub rax, -128 add rbx, -8 jne .LBB0_12 .LBB0_13: mov rax, rcx .LBB0_14: mov qword ptr [rsp + 8], rax mov rax, rsp
. , LBB0_9
, 8. 8 .
, Point
, ZeroPoint
memset
:
vector<ZeroPoint> data(bigUnknownSize);
sub rsp, 40 mov rbx, qword ptr [rip + bigUnknownSize] vxorps xmm0, xmm0, xmm0 vmovaps xmmword ptr [rsp], xmm0 mov qword ptr [rsp + 16], 0 test rbx, rbx # bigUnknownSize == 0, new. je .LBB0_1 mov rax, rbx shr rax, 60 jne .LBB0_3 mov rdi, rbx shl rdi, 4 call operator new(unsigned long) # new(bigUnknownSize*16). jmp .LBB0_6 .LBB0_1: xor eax, eax .LBB0_6: mov rdx, rbx shl rdx, 4 lea r14, [rax + rdx] mov qword ptr [rsp], rax mov qword ptr [rsp + 8], rax mov qword ptr [rsp + 16], r14 test rbx, rbx # bigUnknownSize == 0, memset. je .LBB0_8 xor esi, esi mov rdi, rax call memset # memset(&data, 0, bigUnknownSize*16). mov rax, r14 .LBB0_8: mov qword ptr [rsp + 8], rax mov rax, rsp
,
vector<NanPoint> data; data.resize(bigUnknownSize);
250 , . operator new
, .
.
Code | Cycles per iteration |
---|
Point p; | 4.5 |
ZeroPoint p; | 5.2 |
NanPoint p; | 4.5 |
array<Point, smallSize> p; | 4.5 |
array<ZeroPoint, smallSize> p; | 6.7 |
array<NanPoint, smallSize> p; | 6.7 |
array<Point, bigSize> p; | 4.5 |
array<ZeroPoint, bigSize> p; | 296.0 |
array<NanPoint, bigSize> p; | 391.0 |
array<Point, bigSize> p{}; | 292.0 |
array<NanPoint, bigSize> p{}; | 657.0 |
vector<Point> p(smallSize); | 32.3 |
vector<ZeroPoint> p(smallSize); | 33.8 |
vector<NanPoint> p(smallSize); | 33.8 |
vector<Point> p(bigSize); | 323.0 |
vector<ZeroPoint> p(bigSize); | 308.0 |
vector<NanPoint> p(bigSize); | 281.0 |
vector<ZeroPoint> p(smallUnknownSize); | 44.1 |
vector<NanPoint> p(smallUnknownSize); | 37.6 |
vector<Point> p(bigUnknownSize); | 311.0 |
vector<ZeroPoint> p(bigUnknownSize); | 315.0 |
vector<NanPoint> p(bigUnknownSize); | 290.0 |
vector<NanPoint> p; p.resize(bigUnknownSize); | 315.0 |
Findings:
- .
- .
memset
, .- .
- , . .
- . .
- , . .