How to Benchmark and Profile
Benchmarking With Criterion & Stack
-
The benchmark program we intend to profile needs to have a function which takes one argument (influencing the computational effort of executing the program) - this is the function we directly call for profiling.
This should be under the
test
directory, e.g.test/BenchmarkProgram.hs
which contains a function:testPmmh :: Int -> IO () testPmmh nsteps = do observedStates <- generateData initLatentState initParams nsteps print observedStates
-
The program that calls the profiling functions should ideally be
test/Spec.hs
. This needs to import both the benchmarked program and criterion.import BenchmarkProgram import Criterion.Main
It should contain a
main
function which details how benchmarking is performed. (This isn’t the optimal usage of Criterion’s tools, but this is still the core idea).main = defaultMain [ bgroup "pmmhTest" [ bench "10" $ whnf testPmmh 10 , bench "20" $ whnf testPmmh 20 , bench "40" $ whnf testPmmh 40 , bench "80" $ whnf testPmmh 80 , bench "160" $ whnf testPmmh 160 ] ]
In this example, the
defaultMain
criterion function takes a list ofBenchmark
values, each of which describes a function to benchmark.defaultMain :: [Benchmark] -> IO ()
The
bgroup
function groups related benchmarks together, where its first argument is a name for that group of benchmarks.bgroup :: String -> [Benchmark] -> Benchmark
The
bench
function (unsurprisingly) benchmarks a program. The first argument is a name describing the activity we’re benchmarking. TheBenchmarkable
type is a container for code that can be benchmarked.bench :: String -> Benchmarkable -> Benchmark
The
whnf
function applies an argument to a function and evaluates the result to weak head normal form (WHNF).whnf :: (a -> b) -> a -> Benchmarkable
-
The
package.yaml
file should contain the following:tests: benchmarks-test: main: Spec.hs source-dirs: test ghc-options: - -O - -threaded - -rtsopts - -fprof-auto - -with-rtsopts=-N dependencies: - benchmarks - criterion # and other dependencies which the benchmarked program needs, e.g. : - statistics - monad-bayes - mtl - log-domain
-
Next, within the
test
directory, we can then compileSpec.hs
(along with its local imports i.e. the benchmarked program).stack ghc BenchmarkProgram.hs Spec.hs
And then by running the following, the
--output
option directs our program to write a report to the filespec.html
:./Spec --output spec.html
Clicking
spec.html
opens this report.
Benchmarking With Criterion & Cabal
-
The benchmark program we intend to profile needs to have a function which takes one argument (influencing the computational effort of executing the program) - this is the function we directly call for profiling.
This can be under the
src
directory, e.g.src/PmmhHmm.hs
which contains a function:testPmmh :: Int -> IO () testPmmh nsteps = do observedStates <- generateData initLatentState initParams nsteps print observedStates
-
The program that calls the profiling functions should ideally be
benchmarking/BenchmarkProgram.hs
and be theMain
module. This needs to import both the benchmarked program and criterion.module Main where import BenchmarkProgram import Criterion.Main
It should contain a
main
function which details how benchmarking is performed. (This isn’t the optimal usage of Criterion’s tools, but this is still the core idea).main = defaultMain [ bgroup "pmmhTest" [ bench "10" $ whnf testPmmh 10 , bench "20" $ whnf testPmmh 20 , bench "40" $ whnf testPmmh 40 , bench "80" $ whnf testPmmh 80 , bench "160" $ whnf testPmmh 160 ] ]
-
The
.cabal
file should contain the following (the build dependencies need to import our own package if the benchmarking program uses exposed modules from our own library):executable benchmark-program main-is: BenchmarkProgram.hs build-depends: base >=4.11 && <4.14, criterion, probabilistic-programming, -- our package name mtl >=2.2.2 && <2.3, statistics >=0.14.0 && <0.16, log-domain >=0.12 && <0.14, megaparsec default-language: Haskell2010 hs-source-dirs: benchmarking
-
Next, within the
benchmarking
directory, we can then compileBenchmarkProgram.hs
and execute it with the command line arguments--output benchmark-results.html
where the--output
option directs our program to write a report to the filebenchmark-results.html
. Writing--
aftercabal run <executable>
lets us specify any following command line arguments to the executable.cabal run benchmark-program -- --output benchmark-results.html
Clicking
benchmark-results.html
opens this report.
Criterion
Criterion allows two kinds of code to be benchmarked.
- Any
IO
action can be benchmarked directly - We can benchmark pure functions (with some trickery)
• Benchmarking an IO action
We use nfIO
to specify that after we run an IO
action, its result must be evaluated to normal form (so that all of its internal constructors are fully evaluated, and it contains no thunks).
nfIO :: NFData a => IO a -> IO ()
The function nfIO
is best used when:
- Lazy I/O is involved - (
nfIO
avoids resource leaks) - We aren’t confident how much evaluation will have been performed on the result of an action - (
nfIO
makes sure that it is fully evaluated)
We use whnfIO
to specify that after we run an IO
action, its result must be evaluated only deep enough for the outermost constructor to be known.
whnfIO :: IO a -> IO ()
The function whnfIO
is best used when:
- Our I/O action returns a simple value such as an
Int
- Our I/O action returns something complex such as a
Map
but evaluating the outermost constructor will do “enough work”
• Benchmarking a pure function
Lazy evaluation makes it hard to benchmark pure code:
- If we tried to saturate a function with all of its arguments and evaluate it repeatedly, laziness would ensure that we’d only do “real work” the first time through our benchmarking loop. The expression would be overwritten with that result, and no further work would happen on subsequent loops through our benchmarking harness.
We can overcome laziness by benchmarking an unsaturated function (a function which has been given all but one of its arguments).
The function nf
takes two arguments: the unsaturated function we want to benchmark, and the final argument to pass to it. It applies the argument to the function and then evaluates the result to normal form.
nf :: NFData b => (a -> b) -> a -> Benchmarkable
The function whnf
takes similar arguments but evaluated the result to weak head normal form.
whnf :: (a -> b) -> a -> Benchmarkable
• Understanding Criterion Reports
-
The left chart is a kernel density estimate (KDE) of time measurements - this graphs the probability of any given time measurement occurring. A spike indicates that a measurement of a particular time occurred, and its height indicates how often that measurement was repeated.
-
The right chart contains the raw measurements from which the kernel density estimate was built. The x axis indicates the number of loop iterations. The y axis shows measured execution time for the given number of iterations. The line is the linear regression generated from this data.
-
OLS regression estimates the time needed for a single execution of the benchmarked activity, using an ordinary least-squares regression model. This number should be similar to the mean execution time. The OLS estimate is usually more accurate than the mean as it more effectively eliminates measurement overhead and other constant factors.
-
R2 goodness-of-fit measures how accurately the linear regression model fits the observed measurements. If the measurements are not too noisy, R2 should lie between 0.99 and 1, indicating an excellent fit. If the number is below 0.99, something is confusing the accuracy of the linear model. If the number is below 0.9, this is very worrisome.
-
Mean execution time and standard deviation are statistics calculated from execution time divided by the number of iterations.
The lower and upper bounds measure the accuracy of the main estimate (using a statistical technique called bootstrapping) - this tells us that when randomly resampling the data, 95% of estimates fall within the lower and upper bounds. Having lower and upper bounds which are close to the main estimate indicates that the main estimate is of good quality.
Profiling with Stack
-
To run profiling while performing these benchmarks, run:
stack test --profile
The
--profile
flag turns on the--enable-library-profiling
and--enable-executable-profiling
cabal options, and also passes the+RTS -p -RTS
runtime options to any test-suites and benchmarks. This produces a.prof
file containing time and allocation info for the test run. -
To run profiling for an executable, run:
stack clean # not always needed stack build --profile stack exec --profile -- my-exe +RTS <profiling options>
The GHC system compiles the code and the links it with an RTS (runtime system) which handles storage management, thread scheduling, profiling, and so on. The RTS has a lot of options to control its behaviour. To set RTS options, we can declare options on the command line between +RTS ... -RTS
when running the program. The most common (and relevant) options are -p
which produces a standard time profile report.
Profiling with Cabal
Run:
cabal v2-run --enable-profiling my-exe -- +RTS -p
This produces a file my-exe.prof
.
• Understanding Profiling Reports
GHC’s profiling system assigns costs to cost centres. A cost is the time or space required to evaluate an expression. Cost centres are program annotations around expressions - all costs incurred by the annotated expression are assigned to the enclosing cost centre.
total time = 0.14 secs (7 ticks @ 20 ms)
total alloc = 8,741,204 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
nfib Main 100.0 100.0
individual inherited
COST CENTRE MODULE entries %time %alloc %time %alloc
MAIN MAIN 0 0.0 0.0 100.0 100.0
main Main 0 0.0 0.0 0.0 0.0
CAF PrelHandle 3 0.0 0.0 0.0 0.0
CAF PrelAddr 1 0.0 0.0 0.0 0.0
CAF Main 9 0.0 0.0 100.0 100.0
main Main 1 0.0 0.0 100.0 100.0
g Main 1 0.0 0.0 0.0 0.2
nfib Main 465 0.0 0.2 0.0 0.2
f Main 1 0.0 0.0 100.0 99.8
nfib Main 242785 100.0 99.8 100.0 99.8
-
The first part of the file gives the program name and options, and the total time and total memory allocation measured during the run of the program.
-
The second part of the file is a break-down by cost-centre of the most costly functions in the program.
-
The third and final part of the file gives a full profile break-down by cost-centre stack. This is a call-graph profile of the program (in which we can find calls of all the various functions involved in the execution of the program).
The time and allocation incurred by a given part of the program is displayed in two ways:
- Individual, which are the costs incurred by the code covered by this cost-centre stack alone.
- Inherited, which includes the costs incurred by all the children of this node.
The meanings of the various columns in the profiling report are:
- Entries - the number of times this particular point in the call graph was entered
- Individual %time - the percentage of the total run time of the program spent at this point in the call graph
- Individual %alloc - the percentage of the total memory allocations (excluding profiling overheads) of the program made by this call
- Inherited %time - the percentage of the total run time of the program spent below this point in the call graph
- Inherited %alloc - the percentage of the total memory allocations (excluding profiling overheads) of the program made by this call and all of its sub-calls
• Inserting Cost Centres By Hand
Cost centres are just program annotations. When passing fprof-auto
(previously known as auto-all
) to the compiler, it automatically inserts a cost centre annotation around every top-level function not marked INLINE
in the program. However, we can manually add the cost centre annotations ourselves. When using stack, we don’t need to add any extra GHC options to the package.yaml
file to do this.
The syntax of a cost centre annotation is:
{-# SCC "name" #-} <expression>
where "name"
is an arbitrary string that labels the cost centre as it appears in the profiling report, and <expression>
is any Haskell expression. An SCC
(Set Cost Centre) annotation extends as far to the right as possible when parsing.
For example:
main :: IO ()
main = do let xs = {-# SCC "X" #-} [1..1000000]
let ys = {-# SCC "Y" #-} [1..2000000]
print $ last xs
print $ last $ init xs
print $ last ys
print $ last $ init ys
We can then find these labelled cost-centres in the third section of our profiling report (in the call-graph profile/break down of cost centres)
• Visualising Profiler Reports
To make it easier to analyse .prof
files, we can use ghcprofview.py
from ghcprofview-py and then execute it on our .prof
file:
./ghcprofview.py path/to/file.prof
This gives us the following visualisation of the profile: