Go Compiler: SSA Optimization Rules Description Language


The gc compiler uses a special Lisp-like domain-specific language ( DSL ) to describe the Static Single Assignment (SSA) optimization rules.


I propose to analyze the main elements of this language, its features and limitations.
As an exercise, add a compiler to Go to generate an instruction that it had not previously generated, optimizing the expression a*b+c .


This is the first article in the series about the insides of the Go SSA compiler backend , therefore, in addition to reviewing the DSL rule descriptions, we will look at related components to create the necessary basis for our next session.


Introduction


Frontend Go compiler ends at the time of generation of the SSA view from the annotated AST. The functions responsible for the conversion can be found in cmd / compile / internal / gc / ssa.go. The entry point to the SSA backend is the ssa.Compile function, defined in the cmd / compile / internal / ssa / compile.go file .


Terminology
EnRuValue
Compiler frontendFrontend compilerParsing and lexical analysis, sometimes type resolution, intermediate representation is close to the source code, usually some annotated AST.
Compiler backendCompiler backendLower-level optimizations and intermediate presentation, code generation.
FormThe formAlmost synonymous with the word "expression" (expression). Usually in Lisp form is a fairly common way to name a program element, be it a list or an atom.
Optimization passOptimization phaseExecution of a specific algorithm on the program. The word "pass" is somewhat ambiguous, because one phase can perform several passes, and / or use a code that is common with other phases.

If, as you read the article, you have found a completely incomprehensible term for you, you should report it, perhaps it will be added to this table.


The Go Go SSA optimizer consists of several phases, each of which performs passes along the compiled function. Some phases use the so-called "rewrite rules", the rules for converting some SSA sequences to others, potentially more optimal.


Transformation rules are described using S-expressions . The elements of these expressions are ssa.Value . In the simplest case, these rules allow you to replace one ssa.Value with another.


For example, the code below collapses the multiplication of 8-bit constants:


 (Mul8 (Const8 [c]) (Const8 [d])) -> (Const8 [int64(int8(c*d))]) 

There are two main categories of SSA values: high-level, almost independent of the target machine, and those that are architectural-specific (usually mapped to 1-in-1 machine instructions).


Optimizations are described in terms of these two categories. First, high-level and common to all architectures, then platform-oriented.


All code associated with the rules is in cmd / compile / internal / ssa / gen . We will consider two sets:


  1. genericOps.go - machine-independent operations.
  2. AMD64Ops.go - operations specific to GOARCH=AMD64 (64-bit x86).

After the first few phases that work on an abstract machine, the so-called lowering is performed, as a result of which the transition from genericOps to a set of concrete architecture occurs. In our example, this will be AMD64Ops . After this point, all subsequent phases operate on a view from the second category.


After the optimizer, the code generator comes into play. For AMD64, the implementation of code generation can be found in the cmd / compile / internal / amd64 package. The task of the code generator is to replace ssa.Block and ssa.Value in the sequence obj.Prog , transmitted to the assembler . The assembler will collect the machine code, which will be ready for execution after linking .


Optimization rules


If files with operations definitions are named like " ${ARCH}Ops.go ", then the optimization rules are placed in " ${ARCH}.Rules ".


High-level rules perform simple transformations, most of the folding of constant expressions , as well as some transformations that simplify subsequent processing.


Each file with low-level rules consists of two parts:


  1. Lowering - replacing abstract operations with machine equivalents.
  2. Directly optimizations themselves.

An example of converting an operation to a machine:


 (Const32 [val]) -> (MOVLconst [val]) // L - long, 32- (Const64 [val]) -> (MOVQconst [val]) // Q - quad, 64- | | generic op | AMD64 op 

It is in low-level optimizations that the main number of important optimizations are performed, such as reducing the cost of operations , partial embedding and utilization of the capabilities of the memory addressing modes available in the processor.


Operations have a mnemonic name, which is usually called an opcode. Opcodes of architecture-dependent operations, as a rule, reflect the names of real instructions.


Rule Description Language Syntax


The basic grammar is described in rulegen.go :


 // rule syntax: // sexpr [&& extra conditions] -> [@block] sexpr // // sexpr are s-expressions (lisp-like parenthesized groupings) // sexpr ::= [variable:](opcode sexpr*) // | variable // | <type> // | [auxint] // | {aux} // // aux ::= variable | {code} // type ::= variable | {code} // variable ::= some token // opcode ::= one of the opcodes from the *Ops.go files 

Snippet translation above
 //  : // sexpr [&&  ] -> [@block] sexpr // // sexpr -  S- (   ) // sexpr ::= [variable:](opcode sexpr*) // | variable // | <type> // | [auxint] // | {aux} // // aux ::= variable | {code} // type ::= variable | {code} // variable ::= Go  ( ) // opcode ::=   *Ops.go  


It is also worth mentioning that inside the .Rules files allowed " // " comments.


Let us analyze a simple example that contains all these elements:


  Opcode=ADDLconst -    32-  : AuxInt=c - ,    `x` : : (ADDLconst [c] x) && int32(c)==0 -> x | / | / | | / | / | | / | /    | /   ( `&&`    ) ,     

All these explanatory signatures are not part of a valid rule entry.

This rule converts x+0 to x . Everything inside the condition section is just Go code,
unless limited to expressions which result should be bool .
You can call predicates defined in rewrite.go .


In addition to the usual opcodes, combinations that generate several rules can be used:


 (ADD(Q|L)const [off] x:(SP)) -> (LEA(Q|L) [off] x) //  Q|L alternation: (ADDQconst [off] x:(SP)) -> (LEAQ [off] x) (ADDLconst [off] x:(SP)) -> (LEAL [off] x) //    `x`: (ADDQconst [off] (SP)) -> (LEAQ [off] (SP)) (ADDLconst [off] (SP)) -> (LEAL [off] (SP)) 

(SP) is one of the operations in genericOps.go and expresses the loading of a pointer to a hardware stack. For architectures where there is no hardware SP , it is emulated.

Features of variables in templates (S-expressions to the left of -> ):



 //       :    ADDQconst, //          : (ADDQconst _) -> v (ADDQconst x) -> (ADDQconst x) 

If AuxInt not specified (the expression in square brackets), then the rule will be triggered on any value of AuxInt . Similarly with the {} -parameters (about them below).


The name v means the outermost captured form.
For example, for the expression (ADDQconst (SUBQconst x)) outer form is ADDQconst .


Variables can be used several times, this allows you to require matching several parts of the S-expression to each other:


 (ADDQconst [v] (ADDQconst [v] x)) // , ,  "x+2+2" (x+v+v). 

Types in rules


In some cases, it is required to explicitly indicate the type of generated and / or compatible form.
The type is indicated in "triangular brackets", as a type argument in C ++ templates:


 // typ.UInt32 -   BTSLconst. // BSFL    typ.UInt32,    //    . (Ctz16 x) -> (BSFL (BTSLconst <typ.UInt32> [16] x)) 

In addition to types, there are "characters" (or, more universally, Aux properties).


 (StaticCall [argsWidth] {target} mem) -> (CALLstatic [argsWidth] {target} mem) 


The semantics of Aux and AuxInt vary greatly from operation to operation. The best source of documentation in this case are *Ops.go files. Each opData opcode opData has an aux field that describes how to interpret these fields.


To describe the types used package cmd / compile / internal / types . Some types are specific to the SSA backend, such as types.TypeFlags , others are common between cmd/compile/internal/gc and cmd/compile/internal/ssa .


Special types


There is a special types.TypeMem type, which performs several functions at once:


  1. Allows you to sort and group ssa.Value by memory access patterns. In particular, it guarantees the correct order of execution within the base blocks (about them below).
  2. Defines the state of the memory flow in the program If the instruction modifies the memory, a new SSA value with types.TypeMem type will be generated as a result of this operation.

Like the special meaning of OpPhi , memory is interpreted in an exceptional way in many phases of the optimizer.


Little about Phi

Phi has a role that varies slightly from phase to phase.


At the very beginning of the SSA compiler, Phi serves its classical purpose and expresses the choice of value depending on the execution path that led us to this value.


For example, if there are two jumps in a block, and both modify the memory, then the block assignment will receive a memory equal to (Phi mem1 mem2) . Loops also drag Phi value.


Another special type is the types.TypeFlags mentioned above. This type describes how a CPU instruction generates flags .


In this case, the instructions, like ADDQ , although generating flags, do not have types.Flags type, but are marked with the clobberFlags attribute.


types.Flags used to highlight the result of instructions like CMPQ , which do not write the result to any of their explicit operands, but only update the internal state of the processor, which can be used by the following instruction.


Instructions like SETL allow you to "read" the flags and return them in the form of ssa.Value , which can be placed in the register.


  L-less than G-greater than | | (SETL (InvertFlags x)) -> (SETG x) | ,   

SSA Inspection Program


Suppose we have such a program ( example.go ):


 package example func fusedMulAdd(a, b, c float64) float64 { return a*c + b } 

We can view the SSA code that is generated for the fusedMulAdd function:


 $ GOSSAFUNC=fusedMulAdd go tool compile example.go > ssa.txt 

Now check the working (current) directory:



Machine code for fusedMulAdd

The ~r3 symbol is renamed ret for expressiveness.


 v7 (4) MOVSD a(SP), X0 v11 (4) MOVSD c+16(SP), X1 v12 (4) MULSD X1, X0 v6 (4) MOVSD b+8(SP), X1 v13 (4) ADDSD X1, X0 v15 (4) MOVSD X0, ret+24(SP) b1 (4) RET 

This is how the SSA program for fusedMulAdd after the lower phase (ssa.html):



Text format SSA program

If for some reason you wanted to copy it:


 lower [77667 ns] b1: v1 (?) = InitMem <mem> v2 (?) = SP <uintptr> v7 (?) = LEAQ <*float64> {~r3} v2 v8 (3) = Arg <float64> {a} v9 (3) = Arg <float64> {b} v10 (3) = Arg <float64> {c} v12 (+4) = MULSD <float64> v8 v10 v13 (4) = ADDSD <float64> v12 v9 v14 (4) = VarDef <mem> {~r3} v1 v15 (4) = MOVSDstore <mem> {~r3} v2 v13 v14 Ret v15 (line +4) 

Translating this into S-expressions:


 (MOVQstore {~r3} (SP) (ADDSD (MULSD (Arg {a}) (Arg {c})) (Arg {b}))) 

SSA after regalloc phase

This is the output of ssa.html for the regalloc phase.



 regalloc [87237 ns] b1: v1 (?) = InitMem <mem> v14 (4) = VarDef <mem> {~r3} v1 v2 (?) = SP <uintptr> : SP v8 (3) = Arg <float64> {a} : a[float64] v9 (3) = Arg <float64> {b} : b[float64] v10 (3) = Arg <float64> {c} : c[float64] v7 (4) = LoadReg <float64> v8 : X0 v11 (4) = LoadReg <float64> v10 : X1 v12 (+4) = MULSD <float64> v7 v11 : X0 v6 (4) = LoadReg <float64> v9 : X1 v13 (4) = ADDSD <float64> v12 v6 : X0 v15 (4) = MOVSDstore <mem> {~r3} v2 v13 v14 Ret v15 (line +4) 

Adding new optimization rules


On FMA processors, we can calculate a*c + b for one instruction instead of two.


Take as a basis CL117295 for the authorship of Ilya Tokar .


For your convenience, I have prepared a diff patch .


1. Adding a new operation - FMASD


In the file compile/internal/ssa/gen/AMD64Ops.go find the slice variable AMD64ops and add a new element there (to any place):


 { // fp64 fma name: "FMASD", //   SSA argLength: 3, reg: fp31, //   regalloc,   resultInArg0: true, //     source,  destination asm: "VFMADD231SD", //   }, 

Since earlier (fp,fp,fp -> fp) there were no operations, you need to define a new specifier:


  fp01 = regInfo{inputs: nil, outputs: fponly} fp21 = regInfo{inputs: []regMask{fp, fp}, outputs: fponly} + fp31 = regInfo{inputs: []regMask{fp, fp, fp}, outputs: fponly} 

2. Adding an optimization rule


 (ADDSD (MULSD xy) z) -> (FMASD zxy) 

A better implementation would not be unconditional and would test the availability of FMA. We will assume that there is definitely an FMA on our target machine.


The compiler uses the config for such checks:


 //  config.useFMA=false,    . (ADDSD (MULSD xy) z) && config.useFMA-> (FMASD zxy) 

How to check FMA support?

If lscpu is available in the system, then, for example:


 $ lscpu | grep fma 

3. Implementing code generation


Now, in the ssaGenValue function defined in the compile/internal/amd64/ssa.go , you need to add code generation for FMASD :


 func ssaGenValue(s *gc.SSAGenState, v *ssa.Value) { switch v.Op { case ssa.OpAMD64FMASD: p := s.Prog(v.Op.Asm()) //   obj.Prog    // From:  source . p.From = obj.Addr{Type: obj.TYPE_REG, Reg: v.Args[2].Reg()} // To: destination . // v.Reg()  ,      FMASD. p.To = obj.Addr{Type: obj.TYPE_REG, Reg: v.Reg()} // From3:  source . //  From3 .     //  RestArgs,    source ,  . p.SetFrom3(obj.Addr{ Type: obj.TYPE_REG, Reg: v.Args[1].Reg(), }) if v.Reg() != v.Args[0].Reg() { //   resultInArg0 s := v.LongString() v.Fatalf("input[0] and output not in same register %s", s) } //    ,     case. } } 

Now everything is ready to check the performance of our new optimization. It is very rare to add new instructions, usually new optimizations are done based on already defined SSA operations.


Checking results


The first step is to generate Go code from the updated gen/AMD64Ops.go and gen/AMD64.Rules .


 #  GOROOT  ,   ,   `go env GOROOT`. cd $GOROOT/src/cmd/compile/internal/ssa/gen && go run *.go 

Next you need to build our new compiler:


 go install cmd/compile 

Now, when compiling the same example, we get another machine code:


 - v7 (4) MOVSD a(SP), X0 - v11 (4) MOVSD c+16(SP), X1 - v12 (4) MULSD X1, X0 - v6 (4) MOVSD b+8(SP), X1 - v13 (4) ADDSD X1, X0 - v15 (4) MOVSD X0, ret+24(SP) - b1 (4) RET + v12 (4) MOVSD b+8(SP), X0 + v7 (4) MOVSD a(SP), X1 + v11 (4) MOVSD c+16(SP), X2 + v13 (4) VFMADD231SD X2, X1, X0 + v15 (4) MOVSD X0, ret+24(SP) + b1 (4) RET 

Basic blocks


Now that the most difficult work is done, let's talk about the basic blocks .


The values ​​that we optimized above are in blocks, and the blocks are in function.


Blocks, like ssa.Value , are abstract and machine-dependent. All blocks have exactly one entry point and from 0 to 2 assignment blocks (depending on the type of block).


The simplest blocks are If , Exit and Plain :



Here are some simple examples of rewriting abstract blocks into AMD64 blocks:


   "then" ( ) |  "else" ( ) | | (If (SETL cmp) yes no) -> (LT cmp yes no) (If (SETLE cmp) yes no) -> (LE cmp yes no) 

The block theme will be discussed in more detail in the context of other optimizing phases in the SSA compiler part.


Limitations of optimization rules


SSA backend has its advantages. Some optimizations are feasible for O(1) . However, there are also disadvantages, due to which the SSA optimizer alone will not be enough, at least until some aspects of its implementation change.


Suppose you want to append :


 xs = append(xs, 'a') xs = append(xs, 'b') // => xs = append(xs, 'a', 'b') 

At the time when the SSA is generated, the high-level structure of the code is lost and the append , being a non-normal function, is already embedded in the body of the containing block. You will have to write a bulky rule that captures the entire sequence of operations generated for append .


Speaking specifically about .Rules , then this DSL is quite weak work with blocks ( ssa.Block ). Any nontrivial optimization associated with blocks is not possible for expression in this language. Partial update of the block is not possible. It is also impossible to throw out blocks (but there is a hack in the form of a First block, which is used to delete a dead code).


Even if part of the flaws are correctable, most compilers agree that there is no single and better intermediate form for representing code.

Go that goes faster


If you come up with some cool optimization rules, feel free to send it to go-review.googlesource.com . I would be happy to make a review (add to CC iskander.sharipov@intel.com ).


Happy hacking compiler!



Bonus material


Examples of good patches in Go that added or modified SSA rules:



A README document appeared recently to describe the SSA part of the compiler.
Recommended for reading.

Source: https://habr.com/ru/post/415771/


All Articles