Optimising Core

These are some organised notes on optimising core, summarised from other pages.

1. Dumping core & enabling optimisations

To inspect the core (in a coherent manner), we can add the following to either our package.yaml file (when using stack):
```
    ghc-options:
    - dump-core
    dependencies:
    - -fplugin=DumpCore
```
or our .cabal file (when using cabal):
```
build-depends: dump-core
ghc-options:   -fplugin-DumpCore
```
When we build our program (stack build or ghc), this creates a folder called dump-core containing html files. Opening one of these files gives us a nice visualisation of the core code of our program. The code we see is how Haskell looks in reality.

To enable optimisations (for a cabal project), we can compiled with -O2 by creating a cabal.project file and placing optimization:2 inside cabal.project.
```
packages:.
optimization:2
```

2. Workflow

Usually, if one is confused about something in the core, it is best to compare it with the original program. This entire process is the general work flow when searching for optimisations by looking at Haskell core:

Writing the program
Generating the core
Analysing the core to see if we are unhappy about how some code has been generated
Tweaking the way the code generates so that the ineffiency disappears. (If we are using staging, this means changing the code from the staging point-of-view.)
It is generally a good idea to benchmark to verify that a optimisation change from looking at the core has improved run-time, however it can be counter-productive because we may need to make a few changes before we can see the benefits. The danger is that we can make a change which results in a better-looking core, but the performance is worse, and then we roll back the change even though we need to continue to do something else to make the entire thing work. The reason we inline things is to expose optimisations – inlining per se does not necessarily result in better performance.

The relationship between the profiling information and the core, is more obvious than the relationship between the profiling information and the original code. We can use the profiler to guide us to the parts of the core that we want to look at.

3. Common Syntax

!, or bang patterns, indicate strict evaluation. When we see a let, this is a lazy variable declaration, whereas a let! signifies a strict variable declaration.
$d is prepended to type class dictionaries which contain class method implementations.
$w is usually prepended to functions that the compiler has unwrapped some of the arguments for (which eliminates a pattern match).
$s is usually prepended to functions that the compiler has specialised.
# indicates an unboxed type. Most unboxed types end in the character #, e.g. Int#, Double#, Addr#, (#, #). Similarly, the primitive operations on these types look like, for example, <#, +#, *#.
lvl refers to function bound to a variable defined somewhere else in the same module. It’s useful to click on these and follow them back to the source definition in the core.

4. Pragmas

INLINE : This is the most well-known and dangerous way of controlling inlining. By annotating a function with INLINE you inform the compiler to always inline the function regardless of its size or calling context. Not only does the INLINE pragma force the function to be inlined at every call site, it also changes what is inlined. If a function is inlined naturally then the optimised unfolding will be used to replace the function name. With the pragma, instead the unoptimised unfolding is used which amounts to an unoptimised version of the user-written right-hand side of the function being inserted directly.

INLINABLE : This pragma is far more benign than INLINE. Instead of influencing the inlining decision from the call site, the pragma only overrides the -funfolding-creation-threshold option and makes sure the unfolding for a definition is included in the interface file. Like the INLINE pragma, it is the unoptimised definition which is included. Despite its name, the pragma is mainly used for ensuring the definition of recursive functions is made available in interface files so that they can be specialised across modules.

Reasons to possibly not use INLINABLE:
- Without INLINABLE, the definition that goes in the interface file is the code after optimisation, whereas with INLINABLE, it is the code you wrote (more or less). In particular, without INLINABLE, GHC might inline other functions into the function’s definition.
- Without INLINABLE, GHC will omit the definition from the interface file if it is too big. If some other function got inlined into the right-hand-side, this could easily push it over the limit.
- INLINABLE also turns on some clever machinery that automatically specialises overloaded functions where they are used, and shares the specialised versions with other modules that transitively import the module in which the specialised version was created.

NOINLINE : There is also the instruction NOINLINE which indicates that a function should never be inlined (the usual reason for using NOINLINE is in conjunction with (rewrite) RULES pragmas).

UNPACK : When a constructor field is marked strict, and it is a single-constructor type, then it is possible to ask GHC to unpack the contents of the field directly in its parent. The UNPACK indicates to the compiler that it should unpack the contents of a constructor field into the constructor itself, removing a level of indirection. It says that we don’t care about the constructor that wraps the arguments. This can only be done for data types with one constructor. For example:
```
data T = T {-# UNPACK #-} !Float
           {-# UNPACK #-} !Float
```
will create a constructor T containing two unboxed floats. This may not always be an optimisation: if the T constructor is scrutinised and the floats passed to a non-strict function for example, they will have to be reboxed (this is done automatically by the compiler).

This is good for flattening a nesting of data types.

5. Things to look out for

Random case statements

Duplication of code (such as within a let-binding where a variable is created twice with the exact same definition)

For example, the function manyTest is supposed to take a sequence of alternating a’s and b’s and roll them all up:
```
manyTest :: Parser [Char]
manyTest = many (string "ab" $> (code 'c'))
```
In the generated core, one of the problems we can observe is that a particular function called exit is created twice:
```
  λs1 ->
    ...
    let
      exit =
        λo# eta ipv ->
          case ==# o# ipv of
            DEFAULT ->
              (#, #) eta Nothing
            1 ->
              let! (#, #) ipv ipv1 = readMutVar# ipv1 eta in
              (#, #) ipv (Just (ipv1 []))
      exit =
        λo# eta ipv ->
          case ==# o# ipv of
            DEFAULT ->
              (#, #) eta Nothing
            1 ->
              let! (#, #) ipv ipv1 = readMutVar# ipv1 eta in
              (#, #) ipv (Just (ipv1 []))
```
The question becomes “why has the compiler not worked out that these are the same functions, and duplicated it when it didn’t need to?”. In the body of the let statement of the core version, we can observe where exit is used - it tells us that the program checks if there is enough input, and if so, is that input the character ‘b’. So we expect the call of exit to be somewhere where we read characters.
```
in
  let ...
  in case <# ipv dt of
        DEFAULT ->
          exit o# eta ipv
        1 ->
          case indexWideCharArray# input# ipv of
            DEFAULT ->
              exit o# eta ipv
            'b' ->
              ...
```
These exit variables would have been defined in our original code somewhere. In this case, we can see the two different functions sat and emitLengthCheck both use the expression $$bad which represents the exit variable.
```
sat ... =
  ... if ... then ... else $$bad

emitLengthCheck ... =
  ... if ... then ... else $$bad
```
We fail in the case in which we don’t have enough input, and we fail in the case where the character doesn’t match. So we ask why these exit variables are two different values, when they could have been the same one - these calls to $$bad should really come from the same place. The question then becomes: “how do we work out where they came from, and how do we trace them back to their common source?”.

We can find that in some function evalSat, that we’ve defined a function maybeEmitCheck that defines the bad variable in a let statement, and uses it twice in its body.
```
evalSat .. = do
  ...
  where
    maybeEmitCheck ... =
      [|| let bad = $$(raise y)
          in $$(emitLengthCheck n (sat (genDefunc p) mk [||bad||]) [||bad||] y)
      ||]
```
The first usage of bad is the one that gets passed to sat, and the second usage is the one that gets passed to emitLengthCheck. We can correspond this to the previous core code: the first exit corresponds to the bad passed to emitLengthCheck, and the second exit corresponds to the bad passed to sat. In this case, we don’t currently know why GHC has chosen to define exit twice.

Situations where certain operations can be factored out

For example, the function manyTest is supposed to take a sequence of alternating a’s and b’s and roll them all up:
```
manyTest :: Parser [Char]
manyTest = many (string "ab" $> (code 'c'))
```
In the generated core, we could ask: can these length checks <# o# dt and <# ipv dt be factored out, and why is it doing these twice?
```
λo# eta ->
  case <# o# dt of
    DEFAULT ->
      exit eta
    1 ->
      ...
      let
        ipv =
          +# o# 1
      in case <# ipv dt of
        DEFAULT ->
          ...
```
We can actually fix this by changing the definition of manyTest to:
```
manyTest :: Parser [Char]
manyTest = many (try (string "ab") $> (code 'c'))
```
Going back to the core, we can see now that the first length check has been replaced with the second length check i.e. it’s been factored out. Because we always want to read an a character followed by a b character, this code will now always check if there are atleast two characters before trying to read any characters at all, rather than reading an a and then failing when there are no further characters.
```
  λo# eta ->
    case <# (+# o# 1) dt of
      DEFAULT ->
        exit eta
      1 ->
        ...
        let
          ipv =
            +# o# 1
        in
          ...
```

Use of strict and lazy values

When we see a let, this is a lazy variable declaration, whereas a let! signifies a strict variable declaration. We can spot code where values are lazily constructed and determine if that value doesn’t actually need to be lazy. For example, the following code is lazy, but we should be able to determine it strictly:
```
  let
    ipv =
      +# o# 1
```
In theory, we should be able to find where the creation of ipv is in the surface program and then bang-pattern it, resulting in let becoming let! in the core program.

In the core visualisation, we also can highlight over certain fragments of code, and in the top right corner, this shows us the type information and also strictness properties. Even though something may appear to be lazy, we can identify that something is actually a strict binding (because it says S rather than L).

Expensive function applications (referenced from external modules)

Function calls that could have been inlined (referenced from external modules)

For example, in the following function inferModel:

inferModel :: Int -> Int -> Int -> IO [[(Params, Numeric.Log.Log Double)]]
inferModel n_mhsteps n_timesteps n_particles = do
    ys <- parseFromFile dataParser "data/datafile"
    case ys of
      (Left _)    -> error "naughty"
      (Right obsData) -> sampleIO $ do
              pmmhRes <- prior $ pmmh n_mhsteps n_timesteps n_particles paramsPrior (scoreModel initLatentState (map Obs obsData))
              return pmmhRes

In the generated core, we can see a function call to pmmh which is from an external module. The hope is that the call to pmmh (represented by the green-highlighted lvl) would have been inlined - however this is not the case, as it simply references pmmh from the monad-bayes library, which has not been marked as INLINE or INLINABLE. This means GHC isn’t specialising it, so the program still performs dictionary lookups when that code is compiled.

We can add an {-# INLINABLE #-} or {-# INLINE #-} pragma to pmmh.

-- | Particle Marginal Metropolis-Hastings sampling.
pmmh ::
  MonadInfer m =>
  -- | number of Metropolis-Hastings steps
  Int ->
  -- | number of time steps
  Int ->
  -- | number of particles
  Int ->
  -- | model parameters prior
  Traced m b ->
  -- | model
  (b -> Sequential (Population m) a) ->
  m [[(a, Log Double)]]
pmmh t k n param model =
  mh t (param >>= runPopulation . pushEvidence . Pop.hoist lift .
        smcSystematic k n . model)

{-# INLINE pmmh #-}

Revisiting the inferModel function, we can now see that the call to pmmh (previously represented by lvl) has been inlined with what is now called $w$spmmh.

$w is usually prepended to functions that the compiler has unwrapped some of the arguments for (which eliminates a pattern match).
$s is usually prepended to functions that the compiler has specialised.

Unwrapping and rewrapping of data structures

This is when we see functions that take arguments containing a data constructor, unwrap the constructor (i.e. use its contained values to compute new ones), and then pass the new values wrapped inside the constructor to another function (possibly recursively). This is also wasted memory allocation. Note that this doesn’t happen with newtypes.

Arguments which haven’t been unboxed when instead they could have been

Most types in GHC are boxed (values of that type are represented by a pointer to a heap object, e.g. the representation of Int is a two-word heap object). An unboxed type is represented by the value itself, no pointers or heap allocation is involved. These correspond to the “raw machine” types we would use in C. Most unboxed types end in the character #, e.g. Int#, Double#, Addr#, (#, #). Similarly, the primitive operations on these types look like, for example, <#, +#, *#.

To illustrate the consequence of boxed/unboxed types, consider the situation where we have a recursive call which takes an integer as an argument. If this integer had not been unboxed in the core, then the program would box the integer to the call of the recursive function; within the function it will then unbox the integer to do the work on it, and afterwards box up another integer as an argument to the recursive call. This is essentially a looping of unnecessarily boxing up and unboxing values.

The following code illustrates a recursive loop where the integer argument passed in (o#) is already successfully unboxed:
```
  let
    rec
      loop =
        λo# eta ->
          ...
          loop (+# ipv 1) s2#
```
So we are looking for constructors where we see a call to function wrap something up, and immediately on the other end of the function, wrap something up again.

6. Other notes

Although memory leaks might not be what our performance issues are, they also could be. In Haskell, it’s not called a memory leak, it’s a space leak which is a bunch of unevaluated thunks which is causing a long chain of computation. The memory is held on to by the thunks even though it is potentially not used. If we see lots of heap activity, this could be a sign that laziness in our program is adding overhead. Also, if we are seeing a lot of objects being churned through (even if they are going through quite fast), we could ask “are those objects necessary, or could they have been unboxed and thrown away?”.

The canonical way of fixing performance issues in effect systems is removing the monad transformers and hand-rolling the entire instance. Although this is against the point of effect systems, at the same it can be very helpful. If we handroll the instance (which we can do by basically taking the monad transformed instances and inlining the definitions ourselves until we get to something primitive), we are essentially looking to see whether the core has generated that. This is a pattern of creating the ideal program, and working out which bits the compiler has missed out on that we have recognised. This is always a good approach when we don’t understand what can be done better to the original program yet, and is a way to recognise and understand unfamiliar structures in core.

GHC can specialise the monads when it sees them concretely then the inliner works its magic on the specialised code to form what we see there. No particular code changes are necessary to ensure that the program specialises; even if there are cases of non-concrete monads m in type definitions, the only part of the core that we are interested in is where m becomes concrete.

If we happen to inline a particular function and this results in a slower program, this implies that the code size is too big - so this leads to memory problems (cache issues). If the function isn’t being optimised after it has been inlined, then it shouldn’t be inlined.

Generally speaking, the INLINABLE pragma is always helpful (non-dangerous) for the compiler, but the INLINE pragma should be used when we are fairly certain that an optimisation can happen, but the compiler isn’t realising it even with the use of the INLINABLE pragma.

If we don’t see $w or $s in the core anywhere, optimisations are not on.

A way to avoid packing and unpacking of data structures in core is to use continuation-passing-style form.

Q : Given a record data type such as:
```
data Params = Params {
    transition_p  :: !Double,
    observation_p :: !Double
} deriving Show
```
is it unavoidable to lose the whole record feature (named parameters) if we were to optimise this to avoid the wrapping and rewrapping of the constructor?

A : Nope! If it’s strict and haskell can unbox it, you can still use the record syntax and record accessors – it’s just they will get optimised out and replaced with just the relevant argument of the function directly.
```
data Params = Params {
    transition_p  :: {-# UNPACK #-} !Double,
    observation_p :: {-# UNPACK #-} !Double
} deriving Show
```

GHC loves single-constructor datatypes, such as tuples. A single-constructor datatype can be unpacked when it is passed to a strict function. For example, given this function:
```
f (x,y) = ...
```
GHC’s strictness analyser will detect that f is strict in its argument, and compile the function like this:
```
f z = case z of (x,y) -> f' x y
f' x y = ...
```
where f is called the wrapper, and f' is called the worker. The wrapper is inlined everywhere, so for example if you had a call to f like this:
```
... f (3,4) ...
```
this will end up being compiled to
```
... f' 3 4 ...
```
and the tuple has been completely optimised away. This only happens when the function argument is a single-constructor type

Last updated on 13 Nov 2020
Published on 13 Nov 2020