Inlining

The most critical optimisation for an automatic optimiser is inlining. Inlining is the enabling optimisation which replaces a function name by the definition of that function. After a function definition has been inlined, new optimisation opportunities are now evident to the optimiser such as the β-reduction and know-case elimination optimisations.

The difficulty with inlining is that on its own it is not a beneficial code transformation. Inlining a function which does not unlock any further optimisation possibilities is wasted work which increases code size. Therefore any automatic partial evaluator must decide at what thresholds functions must be inlined and what factors about the call-site must be taken into account when making an inlining decision. Unfortunately designing a library which relies on these thresholds and decisions is inadvisable. Small changes to the source program can have big changes on which thresholds are breached and understanding how your changes will affect the thresholds is very difficult.

1. Preliminaries

1.1 What is inlining?

Given a definition x = E, one can inline x at a particular occurrence by replacing the occurrence by E. For example:

let { f = \x -> x * 3 } in f (a + b) - c
⇝
(a + b) * 3 - c

We can identify three distinct transformations that collectively implement what we describe as inlining:

Inlining itself replaces an occurrence of a let-bound variable by a copy of its right-hand side of its definition.
```
let { f = \x -> x * 3} in f (a + b) - c
⇝ [inline f]
let { f = \x -> x * 3} in (\x -> x * 3) (a + b) - c
```
Notice that not all occurrences of f need be inlined, hence the original definition of f must be retained.

Dead code generation discards bindings that are no longer used; this usually occurs when all occurrences of a variable have been inlined.
```
let { f = \x -> x * 3 } in (\x -> x * 3) (a + b) - c
⇝ [dead f]
(\x -> x * 3) (a + b) - c
```

β-reduction rewrites a lambda application (\x -> E) A to let {x = A} in E.
```
(\x -> x * 3) (a + b) -c
⇝ [beta]
(let { x = a + b } in x * 3) - c
```

1.2 Factors affecting inlining

Although inlining may be valid, this does not mean that it is desirable. Inlining may increase code size, or duplicate work. There are three distinct factors to consider:

Does any code get duplicated? And if so, how much? For example, consider the following code where “...big...” is a large expression:
```
let f = \v -> ...big... in (f 3, f 4)
```
Then inlining f would not duplicate any work (f will still be called twice), but it would duplicate the code for f’s body.

Bloated programs are bad as they cause increased compilation time and lower cache hit rates. Inlining can often reduce code size by exposing new opportunities for transformations. GHC uses a number of heuristics to determine whether an expression is small enough to duplicate.

Does any work get duplicated? And if so, how much? For example, consider the following code where foo is expensive to compute:
```
let x = foo 1000 in x + x
```
Inlining x would result in two calls to foo instead of one.

Work can be duplicated even if x only appears once:
```
let x = foo 1000
    f = \y -> x * y
in ... (f 3) .. (f 4) ...
```
If we inline x at its (single) occurrence site, foo will be called every time f is. The general rule is that we must be careful when inlining inside a lambda.

It’s not hard to come up with examples where a single inlining that duplicates work gives rise to an arbitrarily large increase in run time. GHC is therefore very conservative about work duplication. In general, GHC never duplicates work unless it is sure that the duplication is a small, bounded amount.

Are any transformations exposed by inlining? For example, consider the bindings:
```
f = \x  -> E
g = \ys -> map f ys
```
Suppose we were to inline f inside g:
```
g = \ys -> map (\x -> E) ys
```
No code is duplicated by doing so, but a small bounded amount of work is duplicated because the closure for (\x -> E) would have to be allocated each time g was called. It is often worth putting up with this work duplication, because inlining f exposes new transformation opportunities at the inlining site. But in this case, nothing at all would be gained by inlining f, because f is not applied to anything.

Inlining is not an optimisation by itself. The direct effects of careful inlining are small: it may duplicate code or a constant amount of work, and usually saves a call or jump (albeit not invariably). It is the indirect effects that we are really after: the main reason for inlining is that it often exposes new transformations, by bringing together two code fragments that were previously separate. Therefore, in general, inlining decisions must be influenced by context.

1.3 Work duplication

If x is inlined in more than one place, or inlined inside a lambda, we have to worry about work duplication.

Work duplication will be bounded in at least the cases when x’s right-hand side is:

A variable
A constructor application
A lambda abstraction

Constructor applications require careful treatment. Consider the following code:

x = (f y, g y)
h = \z -> case x of
            (a, b) -> ...

It would be disasterous to inline x inside the body of h, since that would duplicate the calls to f and g.

2. Ensuring termination - Loop breakers

2.1 Sources of non-termination

GHC’s intermediate language introduces extensions to System F which themselves introduce non-termination in two distinct ways:

Recursive Bindings: If a recursively-bound variable is inlined at one of its occurrences, that will introduce a new occurrence of the same variable. Unless this is restricted somehow, inlining could go on infinitely.

Recursive Data Types: Consider the following code defining loop.

This is a recursive data type T with constructor C which takes a function T -> Int.
```
data T = C (T -> Int)
```
The function g is small and non-recursive, hence when processing loop = g (C g), g will be inlined. However the inlined call will very soon rewrite to g (C g) which is just the expression we started with.
```
g = \y -> case y of
            C h -> h y

loop = g (C g)
```
The problem here is that the data type T is recursive, and it appears contravariantly in its own definition.

The former is an immediate and pressing problem as almost any interesting Haskell program involves recursion. In contrast, the latter is rather rare.

2.2 Recursive Bindings - The problem & solution

We call a group of bindings wrapped in rec a recursive group. Unrestricted inlining of non-recursive bindings is safe, but unrestricted inlining of recursive bindings may lead to non-termination.

The real problem with recursive bindings is that they can make the inliner fall into an infinite loop. The key insight to the solution is that: the inliner cannot loop if every cycle in the dependency graph is broken by a variable that is never inlined. This variable is called a loop breaker.

What is a loop breaker?

Explanation 1: If we were to inline recursive definitions without care, we could easily cause the simplifier to diverge. However, we still want to inline as many functions as possible which appear in mutually recursive blocks. GHC statically analyses each recursive group of bindings and chooses one of the bindings in each group as the loop-breaker. Any function which is marked as a loop-breaker will never be inlined. Other functions in the recursive group are free to be inlined as eventually a loop-breaker will be reached and the inliner will stop. GHC will never inline a loop breaker regardless of what pragma we attach to it.

Explanation 2: To prevent recursive functions from being inlined, some care is needed to deal with mutually recursive groups. For each mutually recursive group of functions, dependency analysis is performed and then a single loop-breaker is chosen which breaks any cyclic dependencies. The loop-breaker is never inlined, but other functions in the group can be.

How does GHC choose a loop-breaker?

The loop-breaker for GHC is decided based on the following criteria: try to not select a variable that would be very beneficial to inline.

The choice of loop-breaker affects how a program is optimised but the programmer has little control about which member of a recursive group is chosen. This explains why self-recursive functions are never inlined as they are always chosen as the loop-breaker for their recursive group of size one.

Inlining Thresholds

When each function is defined, the body of the function is analysed and metrics are calculated about under what conditions the compiler considers it worthwhile to inline the function.

• What is an unfolding?

The unfolding of a function f is what f is replaced by when it is inlined. This is usually the definition of f.

• Code Size

The size of an unfolding is also a heuristic measure with the understanding the inlining a lot of “big” definitions will lead to a very “big” problem, which will take a long time to optimise. The flags controlling inlining are given below:

-funfolding-creation-threshold – Functions under this size will be given unfoldings and can therefore be inlined.
-funfolding-dict-discount – The argument discount is the argument is a dictionary.
-funfolding-fun-discount – The argument discount if the argument is a function.
-funfolding-use-threshold – The threshold for deciding whether to inline a function at the call-site. The size is after the argument discounts have been applied.

• When is a function not inlined?

The first two restrictions are conservative, aimed at reducing the chance of the inliner looping or creating very large intermediate programs. In GHC the inliner never inlines a recursive definition which is conservative and predictable. Bottoming functions are not inlined as they are lifted to the top-level and inlining them would undo this work.

Self-recursive functions are never inlined
Functions which are “too big” are not inlined. The threshold is controlled by the -funfolding-use-threshold flag.
Functions which are known to bottom (undefined - a computation which never completes successfully) are not inlined.

• When is a function inlined?

For each function definition the body of the function is analysed and the “unfolding guidance” is calculated. There are two varieties of unfolding guidance: UnfWhen and UnflfGoodArgs.

For normal identifiers, UnfIfGoodArgs stipulates under what conditions the compiler thinks inlining would be beneficial, and it calculates this using three factors:

The size of the function definition - this is taken into account to avoid creating program bloat.
A discount for each argument which has already been evaluated to WHNF (weak head normal form). The argument discounts are used to encourage the inlining of functions which are guaranteed to unleash further optimisations opportunities.
A discount if the function appears in a scrutinee position. If an argument appears in a scrutinee position in the body of a function, then inlining that function will cause the case to be reduced. Also, if the function itself returns a constructor, then inlining the function into a scrutinee position will enable the case to be eliminated.

These facts are all calculated by looking at the definition of a function. The size and discounts are unitless, created solely for the purpose of informing the inlining heuristic.

• Call Site Heuristics

At the call site (a call site of a function is the location (line of code) where the function is called), the unfolding guidance is used to decide whether to inline the function call.

The expression is “work-free”, i.e. we are not going to duplicate a lot of work by inlining it.
The unfolding is small enough, as calculated using the guidance.
There is some perceived benefit to inlining into this context. For example, inlining into a scrutinee position.

The important take-home message from this description is that there are two halves to making an inlining decision: the definition site and the call site. At the definition site, certain criteria are set up which determine the decision about whether to inline at the call-site when more context is available.

Controlling Inlining

The compiler is naturally conservative when deciding whether to inline a function as if an incorrect decision is made, the resulting compilation can take a very long time and use a lot of memory. Therefore it is sometimes necessary for experts to direct the optimiser in more precise ways using compiler options of function-specific pragmas. These mechanisms can be used to control whether and how non-recursive functions are inlined. Recursive functions will still never be inlined.

Pragmas

• INLINE : This is the most well-known and dangerous way of controlling inlining. By annotating a function with INLINE you inform the compiler to always inline the function regardless of its size or calling context. Not only does the INLINE pragma force the function to be inlined at every call site, it also changes what is inlined. If a function is inlined naturally then the optimised unfolding will be used to replace the function name. With the pragma, instead the unoptimised unfolding is used which amounts to an unoptimised version of the user-written right-hand side of the function being inserted directly.

• INLINABLE : This pragma is far more benign than INLINE. Instead of influencing the inlining decision from the call site, the pragma only overrides the -funfolding-creation-threshold option and makes sure the unfolding for a definition is included in the interface file. Like the INLINE pragma, it is the unoptimised definition which is included. Despite its name, the pragma is mainly used for ensuring the definition of recursive functions is made available in interface files so that they can be specialised across modules.

• NOINLINE : There is also the instruction NOINLINE which indicates that a function should never be inlined (the usual reason for using NOINLINE is in conjunction with (rewrite) RULES pragmas).

These three pragmas can be augmented with “phase annotations”. Phases count down from the initial starting phase. By default this is two, hence there are phases two, one, and zero. Each phase corresponds to one run of the main simplifier in the optimisation pipeline. Adding a phase annotation to an INLINE, INLINABLE, or NOINLINE pragma affects which of these phases the pragma is active in.

These phase annotations provide a way to loosely control the order definitions get inlined into each other in the attempt to remedy the situation where function bodies become too big to inline due to other definitions being inlined into the body. For example, if a function f contains a lot of other functions which when inlined would make f itself too large to inline under the default heuristics, then f could be marked with INLINE[2] so that it is inlined before its body.

It is also worthwhile when dealing with rewrite rules to mark definitions which we want to interact with these rules with INLINE[0], so that they they are not inlined until the last phase and the rule matching facility has the greatest opportunity to rewrite expressions before they are removed by inlining.

Below describes how the phase annotations interact with the INLINE and NOINLINE pragmas.

INLINE[k] – Do not inline until phase k and then be very keen to inline
NOINLINE[k] – Do not inline until phase k and then inline as normal
INLINE[~k] – Be keen to inline until phase k but then do not inline it
NOINLINE[~k] – Inline normally until phase k but then do not inline it

The greatest issue with the INLINE pragma story is that a decision is made at the definition site about how each function should behave at the call site. As a first approximation, it is desirable to inline a function if it is applied to a statically or partially statically known argument (which is the purpose of the argument discounts) – whether this will happen is not known at the definition site. Therefore a certain amount of confidence is required by a library author if they predict (without reservation) that adding an INLINE to a definition will be beneficial to users.

Trick: Inlining with Typeclasses

GHC is rather reluctant to inline in general recursive code. There is only one exception: GHC is keen to create type-specialised copies of (constraint) polymorphic recursive definitions and to inline the definitions of typeclass methods in the process.

Last updated on 13 Nov 2020
Published on 13 Nov 2020