How to Benchmark and Profile

Benchmarking With Criterion & Stack

The benchmark program we intend to profile needs to have a function which takes one argument (influencing the computational effort of executing the program) - this is the function we directly call for profiling.

This should be under the test directory, e.g. test/BenchmarkProgram.hs which contains a function:
```
testPmmh :: Int -> IO ()
testPmmh nsteps = do
   observedStates <- generateData initLatentState initParams nsteps
   print observedStates
```
The program that calls the profiling functions should ideally be test/Spec.hs. This needs to import both the benchmarked program and criterion.
```
import BenchmarkProgram
import Criterion.Main
```
It should contain a main function which details how benchmarking is performed. (This isn’t the optimal usage of Criterion’s tools, but this is still the core idea).
```
main = defaultMain [
  bgroup "pmmhTest" [ bench "10"  $ whnf testPmmh 10
                    , bench "20"  $ whnf testPmmh 20
                    , bench "40"  $ whnf testPmmh 40
                    , bench "80"  $ whnf testPmmh 80
                    , bench "160" $ whnf testPmmh 160
                    ]
  ]
```
In this example, the defaultMain criterion function takes a list of Benchmark values, each of which describes a function to benchmark.
```
defaultMain :: [Benchmark] -> IO ()
```
The bgroup function groups related benchmarks together, where its first argument is a name for that group of benchmarks.
```
bgroup :: String -> [Benchmark] -> Benchmark
```
The bench function (unsurprisingly) benchmarks a program. The first argument is a name describing the activity we’re benchmarking. The Benchmarkable type is a container for code that can be benchmarked.
```
bench :: String -> Benchmarkable -> Benchmark
```
The whnf function applies an argument to a function and evaluates the result to weak head normal form (WHNF).
```
whnf :: (a -> b) -> a -> Benchmarkable
```

The package.yaml file should contain the following:

tests:
  benchmarks-test:
    main:                Spec.hs
    source-dirs:         test
    ghc-options:
    - -O
    - -threaded
    - -rtsopts
    - -fprof-auto
    - -with-rtsopts=-N
    dependencies:
    - benchmarks
    - criterion
    # and other dependencies which the benchmarked program needs, e.g. :
    - statistics
    - monad-bayes
    - mtl
    - log-domain

Next, within the test directory, we can then compile Spec.hs (along with its local imports i.e. the benchmarked program).
```
stack ghc BenchmarkProgram.hs Spec.hs
```
And then by running the following, the --output option directs our program to write a report to the file spec.html:
```
./Spec --output spec.html
```
Clicking spec.html opens this report.

Benchmarking With Criterion & Cabal

The benchmark program we intend to profile needs to have a function which takes one argument (influencing the computational effort of executing the program) - this is the function we directly call for profiling.

This can be under the src directory, e.g. src/PmmhHmm.hs which contains a function:
```
testPmmh :: Int -> IO ()
testPmmh nsteps = do
   observedStates <- generateData initLatentState initParams nsteps
   print observedStates
```

The program that calls the profiling functions should ideally be benchmarking/BenchmarkProgram.hs and be the Main module. This needs to import both the benchmarked program and criterion.

module Main where

import BenchmarkProgram
import Criterion.Main

It should contain a main function which details how benchmarking is performed. (This isn’t the optimal usage of Criterion’s tools, but this is still the core idea).

main = defaultMain [
  bgroup "pmmhTest" [ bench "10"  $ whnf testPmmh 10
                    , bench "20"  $ whnf testPmmh 20
                    , bench "40"  $ whnf testPmmh 40
                    , bench "80"  $ whnf testPmmh 80
                    , bench "160" $ whnf testPmmh 160
                    ]
  ]

The .cabal file should contain the following (the build dependencies need to import our own package if the benchmarking program uses exposed modules from our own library):

executable benchmark-program
   main-is:             BenchmarkProgram.hs
   build-depends:       base >=4.11 && <4.14,
                        criterion,
                        probabilistic-programming, -- our package name
                        mtl >=2.2.2 && <2.3,
                        statistics >=0.14.0 && <0.16,
                        log-domain >=0.12 && <0.14,
                        megaparsec
   default-language:    Haskell2010
   hs-source-dirs:      benchmarking

Next, within the benchmarking directory, we can then compile BenchmarkProgram.hs and execute it with the command line arguments --output benchmark-results.html where the --output option directs our program to write a report to the file benchmark-results.html. Writing -- after cabal run <executable> lets us specify any following command line arguments to the executable.
```
cabal run benchmark-program -- --output benchmark-results.html
```
Clicking benchmark-results.html opens this report.

Criterion

Criterion allows two kinds of code to be benchmarked.

Any IO action can be benchmarked directly
We can benchmark pure functions (with some trickery)

• Benchmarking an IO action

We use nfIO to specify that after we run an IO action, its result must be evaluated to normal form (so that all of its internal constructors are fully evaluated, and it contains no thunks).

nfIO :: NFData a => IO a -> IO ()

The function nfIO is best used when:

Lazy I/O is involved - (nfIO avoids resource leaks)
We aren’t confident how much evaluation will have been performed on the result of an action - (nfIO makes sure that it is fully evaluated)

We use whnfIO to specify that after we run an IO action, its result must be evaluated only deep enough for the outermost constructor to be known.

whnfIO :: IO a -> IO ()

The function whnfIO is best used when:

Our I/O action returns a simple value such as an Int
Our I/O action returns something complex such as a Map but evaluating the outermost constructor will do “enough work”

• Benchmarking a pure function

Lazy evaluation makes it hard to benchmark pure code:

If we tried to saturate a function with all of its arguments and evaluate it repeatedly, laziness would ensure that we’d only do “real work” the first time through our benchmarking loop. The expression would be overwritten with that result, and no further work would happen on subsequent loops through our benchmarking harness.

We can overcome laziness by benchmarking an unsaturated function (a function which has been given all but one of its arguments).

The function nf takes two arguments: the unsaturated function we want to benchmark, and the final argument to pass to it. It applies the argument to the function and then evaluates the result to normal form.

nf :: NFData b => (a -> b) -> a -> Benchmarkable

The function whnf takes similar arguments but evaluated the result to weak head normal form.

whnf :: (a -> b) -> a -> Benchmarkable

• Understanding Criterion Reports

The left chart is a kernel density estimate (KDE) of time measurements - this graphs the probability of any given time measurement occurring. A spike indicates that a measurement of a particular time occurred, and its height indicates how often that measurement was repeated.
The right chart contains the raw measurements from which the kernel density estimate was built. The x axis indicates the number of loop iterations. The y axis shows measured execution time for the given number of iterations. The line is the linear regression generated from this data.

OLS regression estimates the time needed for a single execution of the benchmarked activity, using an ordinary least-squares regression model. This number should be similar to the mean execution time. The OLS estimate is usually more accurate than the mean as it more effectively eliminates measurement overhead and other constant factors.
R2 goodness-of-fit measures how accurately the linear regression model fits the observed measurements. If the measurements are not too noisy, R2 should lie between 0.99 and 1, indicating an excellent fit. If the number is below 0.99, something is confusing the accuracy of the linear model. If the number is below 0.9, this is very worrisome.
Mean execution time and standard deviation are statistics calculated from execution time divided by the number of iterations.

The lower and upper bounds measure the accuracy of the main estimate (using a statistical technique called bootstrapping) - this tells us that when randomly resampling the data, 95% of estimates fall within the lower and upper bounds. Having lower and upper bounds which are close to the main estimate indicates that the main estimate is of good quality.

Profiling with Stack

To run profiling while performing these benchmarks, run:
```
stack test --profile
```
The --profile flag turns on the --enable-library-profiling and --enable-executable-profiling cabal options, and also passes the +RTS -p -RTS runtime options to any test-suites and benchmarks. This produces a .prof file containing time and allocation info for the test run.

To run profiling for an executable, run:

stack clean # not always needed
stack build --profile
stack exec --profile -- my-exe +RTS <profiling options>

The GHC system compiles the code and the links it with an RTS (runtime system) which handles storage management, thread scheduling, profiling, and so on. The RTS has a lot of options to control its behaviour. To set RTS options, we can declare options on the command line between +RTS ... -RTS when running the program. The most common (and relevant) options are -p which produces a standard time profile report.

Profiling with Cabal

Run:

cabal v2-run --enable-profiling my-exe --  +RTS -p

This produces a file my-exe.prof.

• Understanding Profiling Reports

GHC’s profiling system assigns costs to cost centres. A cost is the time or space required to evaluate an expression. Cost centres are program annotations around expressions - all costs incurred by the annotated expression are assigned to the enclosing cost centre.

        total time  =        0.14 secs   (7 ticks @ 20 ms)
        total alloc =   8,741,204 bytes  (excludes profiling overheads)

COST CENTRE          MODULE     %time %alloc

nfib                 Main       100.0  100.0

                                               individual     inherited
COST CENTRE              MODULE      entries  %time %alloc   %time %alloc

MAIN                     MAIN             0    0.0   0.0    100.0 100.0
 main                    Main             0    0.0   0.0      0.0   0.0
 CAF                     PrelHandle       3    0.0   0.0      0.0   0.0
 CAF                     PrelAddr         1    0.0   0.0      0.0   0.0
 CAF                     Main             9    0.0   0.0    100.0 100.0
  main                   Main             1    0.0   0.0    100.0 100.0
   g                     Main             1    0.0   0.0      0.0   0.2
    nfib                 Main           465    0.0   0.2      0.0   0.2
   f                     Main             1    0.0   0.0    100.0  99.8
    nfib                 Main        242785  100.0  99.8    100.0  99.8

The first part of the file gives the program name and options, and the total time and total memory allocation measured during the run of the program.
The second part of the file is a break-down by cost-centre of the most costly functions in the program.
The third and final part of the file gives a full profile break-down by cost-centre stack. This is a call-graph profile of the program (in which we can find calls of all the various functions involved in the execution of the program).

The time and allocation incurred by a given part of the program is displayed in two ways:

Individual, which are the costs incurred by the code covered by this cost-centre stack alone.
Inherited, which includes the costs incurred by all the children of this node.

The meanings of the various columns in the profiling report are:

Entries - the number of times this particular point in the call graph was entered
Individual %time - the percentage of the total run time of the program spent at this point in the call graph
Individual %alloc - the percentage of the total memory allocations (excluding profiling overheads) of the program made by this call
Inherited %time - the percentage of the total run time of the program spent below this point in the call graph
Inherited %alloc - the percentage of the total memory allocations (excluding profiling overheads) of the program made by this call and all of its sub-calls

• Inserting Cost Centres By Hand

Cost centres are just program annotations. When passing fprof-auto (previously known as auto-all) to the compiler, it automatically inserts a cost centre annotation around every top-level function not marked INLINE in the program. However, we can manually add the cost centre annotations ourselves. When using stack, we don’t need to add any extra GHC options to the package.yaml file to do this.

The syntax of a cost centre annotation is:

{-# SCC "name" #-} <expression>

where "name" is an arbitrary string that labels the cost centre as it appears in the profiling report, and <expression> is any Haskell expression. An SCC (Set Cost Centre) annotation extends as far to the right as possible when parsing.

For example:

main :: IO ()
main = do let xs = {-# SCC "X" #-} [1..1000000]
          let ys = {-# SCC "Y" #-} [1..2000000]
          print $ last xs
          print $ last $ init xs
          print $ last ys
          print $ last $ init ys

We can then find these labelled cost-centres in the third section of our profiling report (in the call-graph profile/break down of cost centres)

• Visualising Profiler Reports

To make it easier to analyse .prof files, we can use ghcprofview.py from ghcprofview-py and then execute it on our .prof file:

./ghcprofview.py path/to/file.prof

This gives us the following visualisation of the profile:

Last updated on 13 Nov 2020
Published on 13 Nov 2020