Étude des problèmes de spilling et coalescing liés à l’allocation de registres en tant que deux phases distinctes

Facebook Tweet Pin Email

Étude des problèmes de spilling et coalescing liés à
l’allocation de registres en tant que deux phases
distinctes

Extension to the spill non-everywhere problem

The spill everywhere problem considers a fixed cost for a live-range. If one want to optimize loads and stores, i.e., to spill variables only on parts of their live-ranges, the cost is not fixed in advance. It is possible to extend the dynamic programming algorithm of Theorem 4.4 to the spill non-everywhere problem. At each point, the cost of spilling a variable now depends on whether it has already been chosen to be spilled by the dynamic algorithm (starting from the leaves)—the cost of the store has already been counted—or not, in which case a store will be inserted. We will explain our ideas for a basic block, then the dynamic algorithm for a tree, i.e., under SSA, works as in the proof of Theorem 4.4. The idea for the dynamic programming to work is to separate at each point the variables in three sets: • the variables not spilled (previously noted W); • the variables spilled and not in a register (previously noted VS ); • the variables spilled but still in a register (noted V r S ). Indeed, a spilled variable may still reside in a register between two uses, to save one load. This complicates the task of calculating the cost of a fitting set. Let us consider a basic block where one tries to calculate the cost of a solution for point p, depending on the solutions for the next point p 0 . • If v ∈ W(p), the only possibility is v ∈ W(p 0 ), with no cost. • If v ∈ VS (p): three cases, v ∈ VS (p 0 ) with no cost; or v ∈ V r S (p 0 ), which costs a load (inserted between p and p 0 ); or v ∈ W(p 0 ), which costs a store (inserted at the definition of v) and a load (inserted between p and p 0 ). • If v ∈ V r S (p): three cases, v ∈ V r S (p 0 ) with no cost; or v ∈ VS (p 0 ) with no cost (v just ceases to reside in a register); or v ∈ W(p 0 ), which costs a store (inserted at the definition of v) 7 but no load since v is already in a register. So we just need to be sure that, at each step, there is a polynomial number of

Spill Everywhere with Holes

The previous section dealt with the spill everywhere problem without holes. To summarize, by looking again at Table 4.1, this problem is polynomial for a basic block even in its weighted version, whereas it is NP-complete for a general CFG under SSA, unless for a fixed (small) number of registers. As mentioned earlier, the model without holes does not reflect the reality of most architectures: it corresponds to the CISC-like models while many architectures are in fact RISC-like. The goal of this section is to tackle the problem of spill everywhere “with holes,” on a basic block. We restricted the study to the cases where it was polynomial without holes. Indeed, the problem with holes is intuitively “harder” than the problem without holes. Whenever the problem was already NP-complete, chances were that it would stay so.8 7This is a restriction of the model, where the cost of the store is the same everywhere. In practice however, it is better to add it on a place not often executed. It is not trivial to insert this cost in the dynamic programming but it might work. 8Actually, this is not straightforward to prove. To patch the proof of Theorem 4.6, other variables must be added to “counter-act” the effects of holes. This is the same technique which will be used to patch the “δ-variables” in the proof of Theorem 4.11.

SPILL EVERYWHERE WITH HOLES

Where do the holes come from? For an architecture where operations are allowed only between registers, whenever a variable is spilled, one needs to insert a store instruction after its definition and load instructions before the every use of this variable.9 Thus, new variables appear, with very short live-ranges, but which nonetheless need to be assigned to registers. In other words, when a variable is spilled, the number of simultaneously alive variables decreases by one at every point of the live-range, except where the variable is defined or used. Thus spilling everywhere a variable does not remove the complete interval, but only parts of it, since there is still some tiny subintervals left. This is why, for instance, in the algorithm of Chaitin et al. [1981], the register allocation must re-build the interference graph and iterate if some variables are spilled.10 Holes and chads: The notion of holes can be formalized as follows. An SSA program on a basic block, or linear SSA code, is a pair (B, V) where B = {p1, . . . , pm} is a sequence of m program points, and V the set of variables that appear in the code. Between two consecutive program points, there is an instruction. Each variable of V is defined at most once and, if it is not defined in the code, is considered live-in of the sequence B, i.e., alive on point p1. Similarly, each variable either has a “last use” (last instruction that uses it) or is live-out of the sequence B. A variable is represented by a simple interval of the sequence B, starting during the instruction that defines it (or at p1 for a live-in), and ending during the instruction that last uses it (or at pm for a live-out). Spilling a variable v ∈ V decreases by one the register pressure at each of its points but not at its definition and uses points, i.e., the program point just after the instruction that defines it, and the program points just before the instructions that use it. Some tiny sub-intervals of the live-range remain at these places. They represent temporary variables that must contain the value in register before storing it or after having reloaded it. Hence, the set of points that is actually “removed” is the interval v with “holes” on it. We call it a punched interval. The remaining points c ∈ v that are not removed are called chads, as if, when spilling the variable v, one first had punched the corresponding interval, leaving small intervals in place. It is important to place precisely where are the holes in live-ranges, since they represent the locations where chads will remain, i.e., where problems will arise. We will do so while referring to Figure 4.2 for a graphical explanation. Note that an instruction first uses simultaneously some variables and then possibly defines some other new variables. Hence, the holes for the definitions come a bit later than the holes for the arguments. This the expected behavior since for instance for [d ← a+b], if a and d are spilled, the same register can be used to load a and to hold d before its store. Similarly, at a program point, the holes for the definitions of the preceding instruction should not overlap with the holes for the uses of the next instruction. For instance, between the definition of c and the one of d, if c and a are spilled, the same register can be used to store c, then to load a. So, holes for definitions start “in the middle” of the defining instruction and end at the next program point, while holes for uses start at the previous program point and end “in the middle” of the instruction which uses the variables.

Conclusion

The recent result that, under the SSA form, the interference graph of a program is chordal opened promising directions for the design of register allocation heuristics, by having an exact test to decide whenever spilling is necessary, and a polynomial algorithm to assign registers to variables when no spilling is necessary anymore. Studying the complexity of the spill “everywhere” problem—where variables are spilled on their entire live-range—was important in this context. Even if it is a restriction of the more general load-store optimization problem, the “everywhere” simplification is used by many register allocators (e.g. Iterated Register Coalescing (IRC)), and might either give clues for the load-store optimization problem, or work as an oracle where the spill everywhere approximation is sufficient, in JIT compilation for instance. Our results can provide insights for the design of aggressive register allocators that trade compile time for provably “optimal” results. But, unfortunately, the main implication of our work is that SSA does not simplify the spill problem like it does for the assignment (coloring) problem. Our study considers different singular variants of the spill everywhere problem: 1. We distinguish the problem without or with holes depending on whether use operands of instructions can reside in memory slots or not. Live-ranges are then contiguous or with holes, which leaves chads when spilled. 2. For the variant with chads, we study the influence of the number of simultaneous chads—maximum number of use operands of an instruction and maximum number of definition operands of an instruction. 3. We distinguish the case of a basic block (linear sequence) and of a general SSA program (tree). 4. Our model uses a cost function for spilling a variable. We distinguish whether this cost function is uniform (unweighted) or arbitrary (weighted). 5. Finally, in addition to the general case, we consider the singular case of spilling with few registers and the case of an incremental spilling that would lower the register pressure one by one. The classical furthest-first greedy algorithm is optimal only for the unweighted version without holes on a basic block. The weighted version can be solved in polynomial time, but unfortunately only for a basic block and not for a general SSA program. The positive result of our study for architectures with few registers is that the spill everywhere problem with a bounded number of registers is polynomial even with holes. Of course, the complexity is exponential in the number of registers, but for architectures like X86, it points that algorithms based on dynamic programming might be considered in an aggressive compilation context. In particular, it may be a possible alternative to commercial solvers required by ILP formulations of the same problem, even for models more general than spill everywhere as the one used by Appel and George [2001]. However, ILP is often faster than dynamic programming.

Table des matières

Avant-propos
Contents
Nomenclature
1 Introduction
1.1 Program compilation
1.2 Register allocation
1.3 Spilling & Coalescing
1.4 Techniques for register allocation
1.5 About this thesis
2 Grounds
2.1 Basis for register allocation
2.1.1 Programs and control-flow graphs
2.1.2 Live-ranges, interference graph
2.1.3 Maxlive
2.2 Coloring the interference graph
2.2.1 Testing if R registers are sufficient
2.2.1.1 Conditions on Maxlive
2.2.1.2 Chaitin et al.’s simplification scheme
2.2.2 Interesting graph structures
2.2.2.1 k-colorable graphs
2.2.2.2 Cliques
2.2.2.3 Interval graphs
2.2.2.4 Chordal graphs
2.2.2.5 Greedy-k-colorable graphs
2.2.2.6 Orderings of graphs structures
2.2.3 What to do if R registers are not sufficient?
2.2.4 Iterated Register Coalescing (IRC)
2.3 Static Single Assignment form
2.3.1 Definition of SSA
2.3.2 The dominance property
2.3.3 Properties of SSA
2.3.4 SSA interference graph is chordal
2.3.5 Why is coloring polynomial under SSA
2.3.6 SSA form is not machine code
2.3.7 Splitting and parallel copies
2.4 Conclusion
3 Revisiting the proof of Chaitin et al
3.1 NP-completeness proofs
3.1.1 Direct consequences of Chaitin et al.’s proof
3.1.2 Splitting variables in Chaitin et al.’s proof
3.1.3 Split points on edges
3.1.4 Split points anywhere
3.1.5 Summary and discussion of complexity proofs
CONTENTS
3.2 Polynomial solutions .
3.2.1 Static Single Assignment
3.2.2 Color propagation
3.3 Explanation of complexity
3.4 Register allocation in two phases
3.5 Conclusion .
3.5.1 Summary of Results
3.5.2 Organization of the thesis
4 Complexity of spill everywhere under SSA
4.1 Terminology and Notation
4.2 Spill Everywhere without Holes
4.2.1 Complexity results
4.2.2 Extension to the spill non-everywhere problem
4.3 Spill Everywhere with Holes
4.4 Conclusion
5 Complexity of register coalescing
5.1 Definitions & properties for NP-completeness
5.2 Complexity of aggressive coalescing
5.3 Complexity of conservative coalescing
5.4 Complexity of optimistic coalescing
5.5 Summary and conclusion
6 Advanced coalescing: improving the coloring
6.1 Recalling the coalescing problems
6.2 Conservative coalescing
6.2.1 Brute-force conservative coalescing
6.2.2 Chordal-based incremental coalescing
6.2.2.1 Two lemmas for chordal-based coalescing
6.2.2.2 Explaining the chordal-based algorithm
6.2.2.3 Complexity and quality of chordal-based
6.3 De-coalescing after aggressive coalescing
6.3.1 The existing strategy
6.3.2 Our approach
6.4 Optimal rules for coalescing
6.4.1 The optimal “clique” rule
6.4.2 The “terminal” rules
6.4.3 Using the optimal rules for aggressive coalescing
6.4.4 Disclaimer
6.5 Experiments and evaluation
6.5.1 Methodology
6.5.2 Conservative heuristics
6.5.3 Optimistic heuristics .
6.5.4 Ordering the affinities
6.5.5 Using the optimal “clique” and “terminal” rules
6.5.5.1 Use of the “clique” rule
6.5.5.2 Use of optimal rules in aggressive coalescing
6.5.6 Quality conclusion of the experiments
6.5.7 Inside the Chordal scheme
6.6 Conclusion
7 Parallel copy motion to get out of colored SSA
7.1 About going out of colored SSA .
7.1.1 Introducing the parallel copies
7.1.2 Duplications in parallel copies
7.1.3 Reversible parallel copies
7.2 Properties for moving a parallel copy away from an edge
7.2.1 The problem of critical edges
7.2.2 Compensation
7.3 Moving parallel copies away from critical edges .
7.3.1 Decomposition of a parallel copy containing duplications
7.3.2 The problem of moving reversible parallel copies
7.3.3 Converting parallel copies to permutations
7.3.4 Sequentializing permutations
7.4 Put it all together
7.4.1 Another break in the (permutation) wall
7.4.2 Chains, trees and butterflies of critical edges .
7.4.3 Whenever permutation motion is stuck
7.5 Conclusion
8 Conclusion
8.1 Reality is different from models
8.1.1 Architectural constraints complicates register allocation
8.1.1.1 The constraints
8.1.1.2 Solutions for constraints on registers
8.1.1.3 Splitting even more
8.1.2 Architectural constraints that simplify register allocation
8.1.2.1 Repairing color mismatches is easy
8.1.2.2 False critical edges can be split
8.2 Register allocation in practice
8.2.1 Global versus local
8.2.2 Proposed scheme(s)
8.2.2.1 Aggressive scheme
8.2.2.2 Local spilling followed by coalescing .
8.2.2.3 A few more words on permutation motion .
8.2.2.4 Towards JIT compilation
8.3 Conclusion
List of Figures
List of Tables
List of Algorithms
Bibliography
Index