Probabilistic Effects. λθ

Research Journal

def model(p):
    burglary <~ bern(0.0001)
    alarm    <~ bern(if burglary: 0.95 else: 0.01)
    return (alarm, burglary)

New or Long-Standing Research Questions, or Missing Research

  • How to elegantly implement composable approximate inference algorithms (smc, mh, smc^2, variational mcmc, monte carlo methods in general - monte carlo methods are the primary methods for obtaining posteriors). In practice, exact inference is not used widely, and most probabilistic inference algorithms are approximate.
  • Algebraic effects in probabilistic programming.
  • Use gadts for typed tagless final?
  • Look at embedding in haskell/icfp
  • Ability to cleanly separate bayesian components in the model - i.e. the prior and likelihood? (The model as a data generation process can essentially be the likelihood when we don’t know the likelihood).
  • Having a clear workflow/set of steps to define a bayesian model and perform inference
  • Clean separation of model specification and inference methods.
  • Models as first class objects (purely as data generative processes), which can be inferred over, as opposed to having to redefine a model for both simulation and inference.
  • We are focused on simulation based inference.

  • Addressing 1: TraceT [Int, Int]
  • Addressing 2: Adding fresh addresses to all expressions during compile time, and appending them during evaluation run time for increased call stack depth - i don’t think this will work during evaluation - if we continue appending addresses for every function call, then the addresses of sample statements are affecting by the entire preceding call stack.
  • Addressing 3: Addressing all sample statements during compile time, and indexing them by number of occurrences during evaluation run time (anglican approach)
  • Addressing 4: Using stable names to provide addresses during eval time, and appending them as well during call stack depth

Activity (MN)

  • Implementing FOPPL from “Intro to Probabilistic Programming”
  • Reading “Design and Implementation of Probabilistic Programming Language Anglican”
  • Reading “Folding Domain-Specific Languages: Deep and Shallow Embeddings”, and “Tagless final style”

Activity (MN) 19/01/2021 - 22/01/2021

  • Reading and writing up introduction to probabilistic programming.
  • Meeting with Alessio on understanding how the free monad transformer tree is constructed and executed as a random decision function.
  • Things to look at: Tagless final style, Embedded probabilistic programming, PλωNK: Functional Probabilistic NetKAT, Algebraic effects

Activity (MN) 18/01/2021

  • Checking if all of the naive optimisation options have been exhausted:
    • Inlining (x)
    • CPS’ing (x)
    • Compile-time flags (x)
    • Modules – using explicit export lists (x)
    • The following have been partially applied, but is typically time-consuming and yield small improvements:
      • Unboxed types (?)
      • Strictness annotations (?)
      • Strength reduction (?)
  • To do:
    • Implement a version of trace mcmc in c++ / r to compare against, just because there’s no point comparing to mcmc without tracing.
    • Learn about eff and implement monad-bayes in eff
  • Started using larger arguments to the benchmark program to increase the base run-time, so that optimisations yielding improved run-times are more noticeable.
  • Reading “Freer monads, more extensible effects” in order to understand and use freer-simple.
  • Discovered a number of resources on compiling with continuations (automating/algorithmically generating CPS code from direct-style code)

Meeting (MN + AZ) 05/01/2021 - 17/01/2021

  • Understanding and writing up documentation regarding statistics, monad-bayes, trace mcmc etc.
  • CPS’ing the Coroutine library.

Meeting (MN + MW + SF + WZ) 05/01/2021

  • Still trying to make monad bayes faster 27s - initial time 4.8s - new time achieved by: 1. Inlining and unboxing 2. Cont passing style for state monad 3. Switching to LogicT to reassociate binds to be better optimised
  • M: where does Jamie get these ideas from Minh: probs from word of mouth passing down. All word of mouth. Cont passing is used, but it is not a key word. mtl doesnt use it M: can we automate using cont passing style? Minh: yes M: what next? Minh: most of runtime (50%) is on coroutines, so will learn about that. They are like free monads. There are many libs that do this, maybe a different one is better. Maybe cont passing can be done there - there is a paper that does that Minh: most optimisations so far have been changing what monad bayes uses - changing the monads M: => that monad bayes is actually well designed, but they just use rubbish inputs. Maybe you can propose a modular approach to improving the performance of the lib by switching all these things. Use monad bayes as an example, spot the patterns e.g. the libs it uses and know how to improve each of them. When you have all these patterns and how to optimise them try the techniques out on a different lang. For now there are probs more patterns to find. Keep up the good work! If you find more of the same that is good confidence building; if you find new that also exciting! Minh: benchmarking = particle thing, markov chain - benchmark we made. M: dont forget to also eval the existing examples - they may not be as interesting as yours as yours probs uses everything cos its good for testing, but you can translate it to third party M: why this new example Minh: existing examples were too specific to certain components and they were not realistic. Out benchmark is the biggest use of monad bayes, as opposed to the existing toy examples M: how old is the paper Minh: 2018 (paper) / 2019 (thesis) M: trance references and see what others have done

Activity (MN) 04/01/2021

  • Due to just over 50% of the benchmark run-time for monad-bayes currently being spent by coroutines: I’ve been trying to understand coroutines in Haskell, and also exploring existing coroutine libraries (monad-coroutine, pipes, conduit, three-continuation-approach).
  • Trying to reimplement the monad-coroutine library (which monad-bayes uses) in CPS.

Activity (MN)     28/12/2020 - 03/01/2021

  • Trying to reimplement and integrate components of monad-bayes in CPS-form
  • More CPS conversation with JW.
  • Learned and integrated some alternatives to List and ListT: DList and LogicT.
  • Successfully implemented and integrated a CPS version of the Trace data type of monad-bayes, trying both TraceCPS and TraceTCPS – neither versions showed an improvement in run-time.

Activity (MN)     22/12/2020 - 27/12/2020

Activity (MN)     16/12/2020 - 18/12/2020

Meeting (MN + RP + MW)     15/12/2020

  • Todo: look at handwriting some monad-bayes, and get more involved with Haskell core

Activity (MN)     11/12/2020 - 14/12/2020

Meeting (MN + NW)     11/12/2020

  • Inlining with Typeclasses - GHC is rather reluctant to inline in general recursive code. There is only one exception: GHC is keen to create type-specialised copies of (constraint) polymorphic recursive definitions and to inline the definitions of typeclass methods in the process.
  • If we want to go down the understanding of performance in Haskell code path, understanding core is quite essential especially for low-level optimisations (for people who want to understand high-level optimisations such as fusion, it’s not as essential, but for understanding things such as inlining and staging, it is critical).
  • The vast majority of low-level optimisation techniques remain relevant from version to version of GHC.
  • The inlining of type classes trick is known by quite advanced Haskell hackers, not necessarily by Haskell researchers. The frustration with this trick is what led MP to go on and do some staging – the inliner is so unreliable, and all we can do is hit it with a global constant that says “try harder” which isn’t very predictable. We want some localised reasoning sometimes, and we don’t get that. The old way of doing this was to work hard to build a whole bunch of data types that interact in just the right way so that the typeclass system is going to work in our favour, and this is just very difficult. The whole point of MP’s research was to talk about how we can make things reliable through staging instead. For NW, this is the future of thinking about performance in core code. There are still some open questions – the nice thing about inline rules is that we can write down some laws, and with some hope, they will fire. The hope with staging is that we can encode some of these laws as data types that then get staged away so that those laws are forced upon the structure of things. The bad thing is that sometimes we want there to be some interaction between staging and inlining; NW does not know how to do that yet – what they have in mind is that you write some staged code, it puts the code into the right shape, some fusion through inlining possibly happens (that’s what we expect at the moment – the code gets rearranged in some way and good things happen), but what we don’t know how to do is to say “actually, now I want you to try this type of optimisation because I’m pretty sure it will work in certain cases”. So we want to have some sort of extra stage that happens after the first stage, so that between the first stage and the second stage, there needs to be some inlining happening, and NW doesn’t know how to control that.
  • The idea with staging is that we make a promise that says “please compile this code now, and leave this other piece of code for run-time”. So perhaps we want to compile, and instead of making this block of code happen at run-time, we want to make happen at the next compilation stage. So we have code within code within code within code etc; this is the idea of multi-staging.
  • We need to have a good expectation of what we hope to make faster in the first place. We want to have a good idea of what the different stages will look like before doing any staged programming. One thing JW does in Parsec is think “we think this thing could probably be staged” - they are thinking about what the final product look like, and what they are hoping to make the compiler do and which bits of code do they aim to make the compiler touch so it will go faster. If we don’t have a good sense of that, then there’s no point really in proceeding, because we don’t have any guarantees that anything will get staged away.
  • The reason NW advocates for writing things in terms of algebras all the time, is for the optimiser. When we write things in terms of algebras, we want usually have are non-recursive bits of code which are going to working in sequence (or maybe even in parallel) on some data structure. We may be able to fuse those algebras together more easily when they’re exposed as algebras. If the algebraic content is trapped in some recursive loop, we have very little hope of getting that to fuse nicely.
  • Q: With respect to the idea of using Andre Loh’s inlining technique applied to folding over neural networks, what do you think about going down this route, and what would the focus of such a paper be?

    A: That would be an easy win as a first step. In response to what the focus of the paper would be, this is a direction that was previously discussed with MP but one they didn’t end up exploring. The idea is, given the construction of algebraic data types using fixed points, i.e. μ f = f (μ f), we can imagine doing an indexed version of that where our μ is keeping track of how many unrolls it has done so far before it gives up. It’s an n-unfolding version of μ. We might have some type-level parameter and if it reaches zero then we just use the actual fixed point. The point is that if we can start expressing algebras in those terms (which is quite trivial) then we can start staging those things (which NW thinks is quite trivial too), but NW doesn’t think they’ve seen anybody do it. There’s a story there to be told about staging recursive data types – a good paper around this would be some analysis of whether that has worked, and some examples of where this is interesting, and folding over neural networks is a great example as we have a big data structure with a finite number of unrollings, but its a finite and unpredictable number because maybe we want 10 levels of unrolling in this situation and 5 levels in another situation. So this example is really ripe for this question.

  • Q: Previously with folding over neural networks, we were kind of ignoring the performance aspect and focusing on the expressiveness of the whole idea.

    A: NW would still err on the side of caution, because we most likely wouldn’t be able to beat optimised libraries for neural networks using this technique. It depends what we’re trying to compare against. If we’re trying to compare against other neural network libraries, we won’t be able to win yet. There’s a PhD’s worth of work in trying to make neural networks in Haskell performative, and even if we get close, that would be a successful PhD. But a different angle of the question is, even if we can just show how to make recursive data types faster, that’s not comparing with the rest of the neural network literature, that’s just even within the Haskell literature, and it’s an interesting question. And so the point of comparison becomes easier to argue – as a reviewer, they might critique us if we were trying to show that this is a good neural network library. But if we can show in terms of benchmarks how to make GHC better or make Haskell code better, then we have an infallible argument, it’s just correct. NW can imagine a line of research which goes something like: we first tinker with how can we make recursive data types inline better, then once we do that, we start telling that story, and as an example, we use neural networks. If that gets accepted, then part two could be to start thinking about neural networks seriously, and how far are we from competing with other neural network languages and what would be required? State of the art neural network systems use GPUs for their algorithms. That opens another question: Given algebraic data types, what are the constraints on those data types that are required in order for us to do GPU processing on those things. And that’s a really nice question, because we can imagine writing normal Haskell code with normal Haskell data types, and then use staging which realises that if our data types are only so big, then we are better off making that data type statically sized and then throwing it at our GPU. We can do all of this behind the scenes without anybody ever needing to know. That would be wonderful. It might be that we start getting into the same ballpark as other optimisation frameworks for neural networks, except that ours doesn’t work for just neural networks, it works for a whole family of algebraic data types – and that’s really interesting.

  • Q: Given the paper feedback received for folding over neural networks, how do you think this research direction answers those critiques, or does it even need to answer those if we’re trying to argue its value with a different angle?

    A: NW doesn’t think it needs to answer those questions, and that we’re asking a different question here. The angle we were taking before was: here are neural networks, and here’s a different way of describing them which is more structured. And some people didn’t buy that, because they needed too much domain knowledge about recursion schemes to be able to see that this was useful, and at the same time, they knew too much about neural networks and thought that this research was too special case. The long term goal of neural networks in Haskell is a good long term goal which is achievable, but it would take a couple years of work, it isn’t something that can be done in half a year. There’s a kind of upfront risk there which is difficult to swallow. Going through the angle of “how do we make recursive data types faster?” is a way of getting to that goal, but with more confidence that people will be happy to accept that story, because its not about neural networks per se, its about something grander.

  • Q: I’m thinking of juggling both projects at the same time - both improving the performance of monad-bayes or improving probabilistic programming in Haskell, as well as improving performance in recursive data types using neural networks as an example. What would be the next steps?

    A: For one direction, the next steps would be to learn how to stage, and how to stage this n-unfolding version of μ. (NW is slightly concerned because they discussed this idea with JW, and doesn’t know whether they are pursuing this as research).

    For another direction, with respect to staging something like monad-bayes, one idea that is interesting there which is an area NW hasn’t explored, is – given that we can draw a random value from a uniform distribution, we can use this to create a value drawn from normal distribution quite easily (there are a lot of techniques for doing this), and we can also get beta distributions and all sorts of distributions out of this. An interesting question is, how do we make that happen more quickly, or, how do we fuse together the code that does this? To be more precise, we tend to piggy-back off one distribution to make another, to make another, and so on. The process is that we usually write some function that takes our random value and does something to it, and then we take this value and do something else to it, and so on. What we’re hoping here is that there is some fusion that could happen. If we imagine transforming our uniformly distributed value through ten different transformations, we would want those transformations to fuse together into one single thing. Well, how do we make sure this is actually happening? Of course we’ve got inline rules and pragmas, but we’re not convinced that that’s always going to happen, so what’s the role of staging in this? It would be nice to find some cute distributions which require iterative steps of values drawn from the previous distribution, and onwards (markov chains do this sort of thing). Inlining these distributions would be cool, and inlining those through staging would be very cool. For example if we had a markov chain of a fixed length n, we could imagine writing the handrolled version which does n layers of iterations at once, but it’s huge – it takes all the input and does this great big sum and multiplication etc, and we’re hoping that it rearranges nicely. The end optimal code is probably going to use things like Horner’s rule to reduce the number of multiplications, and things like these are what the inliner won’t spot. We can imagine there being a great story there for staging, one which hasn’t been explored.

    A suggestion for an initial direction could be to look at a markov chain with a small fixed length, and think about how we would unroll that chain, and what code we would write if we were to unroll that manually, and what optimisations we could perform (such as Horner’s rule). The next step would be to think about what the code would look like if we increased the number of layers in the chain, and write the code manually, and bench it. Compare the version that only does the unrolling n times versus the one that has been hand-optimised, and see if it makes a difference (it would be surprising if it didn’t). Then we could ask ourselves, how do we automate this? NW thinks staging is the answer to that question. This seems like it could be a successful project that takes perhaps 3 or 4 months to get through, and is also a good first step towards understanding the optimiser properly, and understanding staging properly.

Activity (MN)     10/12/2020

Activity (MN)     08/12/2020 - 09/12/2020

Meeting (MN + MW + SF + WZ)     08/12/2020

  • MW: Inlining is the first unattractive step to investigate what this can and cant do, this is not necessarily your direction. This doesn’t have to be about finding a solution – it could be finding out what the problem is. Don’t worry about small and specific problems, if you collect enough of them, there might be a bigger thing going on.
  • MW: Two questions re monad-bayes / similar libs:
    1. What is the problem (learning about inlining helps to answer this) and can we generalise this problem? What other languages suffer from this? So far we know it is slow, but inlining will help find out why it is slow and what feature is the issue. When we know this, we can see if other languages suffer too.
    2. What solutions exist for this?
  • MW: You’re making good progress! You are honing in on what the problem is, and all you need to know is enough about the problem in order to write a paper.
  • MN: inspiration for folding over neural nets
    • Inliner doesn’t inline recursive functions because we dont necessarily know how many times it will recurse, but sometimes we do know and dependent types can enforce this unrolling.
    • In a neural net you always know how many layers there are.
    • Dependent types would also be good to ensuring your neural net is correctly constructed.
    • Should I introduce dependent types and staging into the folding neural nets?
  • MW: It would be a good extension of existing work! The problem anticipated is that the criticism of original paper still stands. This makes it stronger, but you still need to address existing criticism. This extension is better if the first paper is published. Does this addition allow you to dodge the criticism?
  • MN: Is it okay to do two things?
  • MW: Everyone does multiple things at the same time. Ehat you need to ask is: in the end will I have a thesis? You want to be focused enough to make progress. My suggestion is to press on more with probabilistic programming, changing frequently between research paths is not a good idea. Switching is not the best if you have to drop one.
  • MW: For probabilistic programming, think more about the type class dictionary unrolling because that is different from inlining (learnings from MP’s thesis background).

Activity (MN)     07/12/2020

Activity/Meeting (MN + AZ)     05/12/2020

  • Reading and writing up of Secrets of the GHC inliner (inlining).
  • Todo: Look at Jennifer Hackett & Graham Hutton’s work on optimisations.
  • How we can make it easier for people to write DSLs and do consistent optimisations on them without having to know about GHC?

Query (MN + NW)     04/12/2020

  • Q: I’ve heard here and there that mtl isn’t considered a real effect system - why is this, and what do you think an effect system comprises of?

  • A: That’s a strange statement. I suppose they are trying to talk about the fact there are no “typing rules”. An effect system is presented, for instance in the definition of Eff, but I think we can construct one for MTL.

Activity (MN)     04/12/2020

  • Reading and writing up of “Secrets of the GHC inliner” (inlining).
  • Thoughts:
    • Sudden inspiration to look at recursion schemes and neural networks again using dependent types and Andres Löh’s’s specialisation technique.
    • The general direction of optimisation can seem like a mundane path/dead-end a lot of the time; being able to generalise research on optimisations to a sufficient degree is difficult, and it seems that a lot of context-specific manual fiddling is involved.
    • We want intrinsic, robust optimisations from first principles which shouldn’t solely rely on the compiler or are ad-hoc.
    • Exploring research on expressibility is attractive.
    • Comparing effect systems is still attractive.

Activity (MN)     03/12/2020

  • Cloned monad-bayes in probabilistic-programming for creating a handrolling version.
  • Learning about handrolling monad transformer stacks.
  • Reading and writing up relevant parts of MP’s thesis (specialisation).

Activity (MN)     02/12/2020

Meeting (MN + RP + MW)     01/12/2020

  • What’s a reasonable expectation for the percentage run-time of bind?
  • Because monad-bayes uses the deprecated version of ListT, it’s a possibility that ListT isn’t as optimal as possible.
  • Newtypes don’t incur any unwrapping/rewrapping of data types at run-time, so any newtypes in monad-bayes are not inherently problematic.
  • Being able to achieve good performance just by inlining could point to some interesting opportunities, but it’s not necessarily a bad thing if we’re unable to get very far in optimisations by just inlining; this could tell a story about other approaches (e.g. template Haskell, fusion, eff, algebraic effects), and it could say something about mtl itself.
  • Just by looking at the profiling report of our benchmark program, in theory it should be significantly faster than what we have at the moment, as most of the time is still being spent on operations such as bind, liftA2 and pure.
  • Is it actually folklore that monad transformers are slow? What’s the general impression of performance between different effect systems?

Activity (MN)     01/12/2020

Meeting (MN + SF + MW)     01/12/2020

  • Manually adding inline pragmas is needed to improve program performance, but we need to look at the core to know where to put them.
  • Need to benchmark again with more iterations to get a better average, and continue looking at opportunities to inline monad-bayes.
  • The necessity to inline at the user level is problematic as we can’t expect the user of the DSL to know this stuff.
  • Need to look at MP’s thesis. It has a survey of inlining techniques and all about how GHC behaves.
  • What is an effect system? Apparently mtl is not.

Activity (MN)     30/11/2020

  • Figured out and wrote up how to work criterion with cabal
  • Recorded benchmarks for differently inlined versions of monad-bayes

Meeting (MN + AZ)     27/11/2020

  • Research hypothesis currently would be that all of this inlining results in a faster library. Need to go back and benchmark & record timings for each optimisation change on monad-bayes when looking at core to see whether things have actually improved. Benchmark program will be PMMH inference on a HMM.
  • Need to create a handrolled version of monad-bayes (w.r.t its inference transformers), and also continue inlining to see how fast we can get the current version of monad-bayes. If the inlined version can reach the speed of the handrolled version, then it is sufficiently fast.
  • Why has GHC not inlined certain definitions automatically? Need to write up on how the GHC inliner works, and get a good understanding of inlining.
  • There’s a discussion to be had about how this style of doing things is bad - if your program is not reaching the desired performance and it’s not obvious as to how to optimise your program, it could be indicative that the compiler is doing something wrong. If such problems need to be solved by looking at core and then simply adding an inline pragma to certain functions, this seems like a pretty sub-optimal way to be writing programs.

Activity (MN)     25/11/2020

Activity (MN)     24/11/2020

  • Got out of cabal hell

    sudo apt-get remove --auto-remove cabal-install
    sudo apt-get purge cabal-install
    
  • Installed ghcup which allows me to set versions of ghc

    curl https://gitlab.haskell.org/haskell/ghcup/raw/master/bootstrap-haskell -sSf | sh
    source /home/minh/.ghcup/env
    ghcup upgrade
    . "$HOME/.ghcup/env"
    echo '. $HOME/.ghcup/env' >> "$HOME/.bashrc"
    
  • Set up new cabal project for probabilistic-programming, using ghc-8.6.5, containing monad-bayes locally and the original code for HMM simulation and PMMH inference which uses monad-bayes as a non-local build dependency.

    mkdir probabilistic-programming
    cd probabilistic-programming
    cabal init -n --is-executable
    ghcup install 8.6.5
    ghcup set 8.6.5
    
  • Added cabal.project file with optimization:2 flag to project.

  • Performed some analysis and optimisations on the dumped core of monad-bayes, specifically with Population.hs and Weighted.hs.

Meeting (MN + SF + MW + WZ)     24/11/2020

  • A big flaw of monad-bayes is that they are calling references, so inlining will speed this up; for some this is folklore, for others they just aren’t aware. This is an topic we should probably try and clarify (pass knowledge on) to people.
  • A good idea is to explore how relevant this occurence actually is and see if others have this problem. Software developers can write refutal papers which is uncommon in functional programming. To write these papers the paper to be refuted must either be an important example, or address many less important examples. Hence for such a proposed paper to be written, we should find one important example of this problem, or many small ones.

Activity (MN)     23/11/2020

  • Explained issue to JW about being unable to expand definitions for >>= and liftA2 in core for monad-bayes, and having trouble cloning and dumping the core of the mtl and transformers libraries locally.

    JW : Being unable to expand out definitions for >>= and liftA2 is problematic to begin with - if the dumped core has referenced these functions, then the inliner isn’t doing its job. In parsley which is a cabal project, he has compiled with -O2 by placing optimization:2 in the cabal.project file, which allows this inlining to occur.

  • As we’re using stack, I’m currently unable to find how the optimization flag translates into a package.yaml or stack.yaml file (it’s definitely not ghc-options: O2). Next step is to figure out cabal, create a cabal project for probabilistic-programming, add the optimization:2 option, and inspect if anything has changed in the dumped core.

Meeting (MN + AZ)     22/11/2020

  • Looked at core of files Population.hs and Weighted.hs from monad-bayes.
  • The two components of interest from the profiling report were:
    • (>>=) from Population.hs, which is essentially how (>>=) works for StateT (Log Double) (ListT m) a) if we ignore the newtype wrappers.

      newtype Population m a = Population (Weighted (ListT m) a)
        deriving (Functor, Applicative, Monad, MonadIO, MonadSample, MonadCond, MonadInfer)
      

      The version of StateT that monad-bayes uses is from the mtl-2.2.2 library.

    • liftA2 and pure from Weighted.hs, which is essentially how liftA2 and pure work for StateT (Log Double) m a if we ignore the newtype wrapper.

      newtype Weighted m a = Weighted (StateT (Log Double) m a)
        deriving (Functor, Applicative, Monad, MonadIO, MonadTrans, MonadSample)
      

      The version of ListT that monad-bayes uses if from the transformers-0.5.2.0 library (this is deprecated in all later versions).

      We were unable to expand out the generated core of these, as StateT and ListT belong to libraries external to the stack project. Hence, the next steps were to clone the necessary libraries locally and dump their core.

  • Cloned mtl-2.2.2 locally into probabilistic-programming repo. (The local copy of monad-bayes does not yet use the local version of mtl-2.2.2.)
  • Dumped the core of mtl-2.2.2.
  • Tried to clone transformers-0.5.2.0 locally into probabilistic-programming repo, but ran into “duplicate instance declarations” error due to the definitions in the local transformers-0.5.2.0 package clashing with GHC.Base. Currently unsure how to solve this.
  • Discussed/learned about process of handrolling monad transformers in mtl.

Activity (MN)     20/11/2020

Meeting (MN + MW + RP)     19/11/2020

  • It’s useful to know if performance improves linearly/asymptotically or remains linear/asymptotic, following the first attempt at optimising monad-bayes e.g. by hand-rolling the inference transformers.
  • In response to the question on using Eff, “how do we turn this into research, rather than simply an application of a different effect system to an existing library?”, we can ask ourselves whether this application is straightforward. If it isn’t straightforward, then there’s probably a research element in it. It is definitely right to try to use what existing libraries/tools are already there. The suspicion is that, along the way, we will find something within these existing approaches that is worth researching, as there will be some problems to be addressed and changes to be made. Until we have exhausted existing options, it would be wrong to start our own. This would be a requirement from reviewers that we are not reinventing the wheel.
  • Eff is not a mature library at all - if we were to make progress understanding and using it to reimplement the monad-bayes system in that setting, that would already be building towards some sort of contribution which a lot of interesting stuff is likely to come out of. Eventually, it could maybe feed into the evolution of the Eff library, or similarly with monad-bayes. There are two or three different directions we could imagine making progress in: one is a better version of monad-bayes or probabilistic programming in Haskell, another is a better version of Eff.
  • The current direction is already definitely research/publishable somewhere, the only question is if it is ICFP etc. We’re in a good position in the sense that we have quite an ambitious goal in mind and we have a plan with multiple paths of achieving that, but we haven’t arrived yet because we don’t know what we’ll end up with. It’s extremely likely that opportunities (for getting side-tracked onto more concrete research) will appear during this process.
  • We have atleast two concrete targets at the moment: 1) Using Eff to reimplement monad-bayes (sounds like a good thing to pursue) 2) Hand-rolling monad-bayes and seeing if there is some kind of asymptotic change in performance, and if there is, then we’ve atleast identified the problem there.
  • Research almost always starts with a concrete example - once we understand it, then we can see how much we can generalise it.
  • It’s a good sign that we are taking ideas from multiple sources, and indicates that the work is not straightforward. (Taking an idea from one paper and applying it, is quite worrying, whereas being able to show that we have taken ideas from multiple papers is promising.)
  • Good idea to collect a literature review around probabilistic-effects in order to have a more concrete mind-map of research directions, and organize them around how they might contribute to our research going forward.

Meeting (MN + AZ)     19/11/2020

To do in preparation for work session on Sunday (22/11/2020):

  • Clone monad-bayes and link it as a local library by placing it in the same stack project/folder as probalistic-programming. Then dump the core code for it.
  • Read and write up tutorials on core + get a gist of existing resources around optimising Haskell and what remains unexplored/unwritten about.
  • Familiarise ourselves with hand-rolling monad transformers.

Debugging - Getting dump-core to work (AZ + MN)     16/11/2020

  • Adding the ghc-option and dependencies for dump-core in package.yaml gives us the following error:

  • From this, adding the extra dependency dump-core-0.1.3.2 in stack.yaml gives us the below error. This is a resolver error i.e. stack can not resolve the constraints of the packages. Stack works by collecting bunches of packages that work together and calls them an lts (long-time-support). The constraint is generated by dump-core, which says that the version of base that our stack project uses does not match the version of base that dump-core uses, i.e. we need a version of base from >=4.9 & < 4.13, and the most recent version satisfying this constraint is 4.12.0.

  • Looking at the various lts versions, the resolver version lts-14.1 supports base version 4.12.0, which uses GHC version 8.6.5.

  • Setting resolver to lts-14.1 then gives the following extra dependency error:

  • Introducing these extra dependencies then finally solves the problem:

  • Running stack build now generates a folder called dump-core, containing html files for visualisations of the core Haskell code generated from our program.

Debugging - Getting dump-core to generate the desired files (MN)     18/11/2020

  • Given the below configuration of package.yaml, cabal does not generate any executables for the library fields, so running stack build with dump-core will unhelpfully generate a Paths_<package_name>.html file, part of which represents our code found in the src directory.

  • In order to generate core for desired files in directory src, we need to either:

    1. Move Main.hs to src and change the source-dirs field under executables from app to src
    2. Add src to the source-dirs field under executables, and add the dependencies under library to the dependencies under executables. The second option is shown below:

Meeting (MN + AZ + JW)     13/11/2020

Following the previous meeting, this was an extremely helpful discussion (with JW) about how to proceed given the profiler report. Content from this meeting is fully elaborated on here. Briefly summarised, we were introduced to:

  • The ghc-option/dependency dump-core as a means of finding out what core Haskell code is generated from surface-level Haskell code
  • How to understand various core-level syntax
  • The general workflow needed when investigating/identifying inefficiencies in programs
  • Common problematic fragments of code to look out for in the core program.

Thoughts:

  • There is definitely a place for a good tutorial paper on how to do all of this (understanding Haskell core, the relationship between surface-level code and core, and identifying possible optimisations). If we pay strong attention to what we do in the next few weeks and write up our process, this would be a good contribution towards the general question of how to optimise a Haskell program. What we’re about to do is what a lot of people do without any resources. A quick outline of such a paper would be to: 1) emphasize the process of profiling your code dumping the core of the highly used functions, and 2) distill what it is one should looking out for in the core. This could be a good Haskell symposium paper.

From this, our next targets are:

  • Compile monad-bayes with the dump-core plugin on, and invest time in figuring out where performance issues lie in the core code.
  • Writing the hand-rolled versions of instances for all the inference transformers of monad-bayes (there’s an incomplete page here on unrolling monad transformers), and compare the core for both the hand-rolled version and the monad-bayes library itself, and identify what is slow.
  • Look for some resources (e.g. tutorials or talks) on understanding core.
  • To record/write up our workflow and stream of consciousness as we progress on understanding how to optimise monad-bayes by using core.
  • Have a work session during weekend on 21/11/2020

Meeting (MN + AZ)     12/11/2020

Analysing profile report on PMMH inference for a HMM:

The key points of interest are:

  • The (>>=) operation which occurs in Population.hs, accounting for 35% of the total runtime. The source line responsible for this is found in the typeclass derivations of the Population newtype, namely the deriving of the Monad class.

    -- | A collection of weighted samples, or particles.
    newtype Population m a = Population (Weighted (ListT m) a)
      deriving (Functor, Applicative, Monad, MonadIO, MonadSample, MonadCond, MonadInfer)
    
  • The liftA2 operation and the pure operation which both occur in Weighted.hs, accounting for 11.9% and 7.9% of the total runtime respectively. The source line responsible for this is found in the typeclass derivations of the Weighted newtype, namely the deriving of the Applicative class.

    -- | Execute the program using the prior distribution, while accumulating likelihood.
    newtype Weighted m a = Weighted (StateT (Log Double) m a)
      -- StateT is more efficient than WriterT
      deriving (Functor, Applicative, Monad, MonadIO, MonadTrans, MonadSample)
    

These overheads are associated with the inference representations/transformer stack which monad-bayes uses, as opposed to time spent on numerical computation such as with the bernoulli function.

This prompts the questions:

  • What definitions of (>>=), liftA2, and pure are generated as a result of using deriving, and why are these inefficient for our case?

• Discussion following conversation with NW (MH + AZ)     11/11/2020

  • That link looks more like actually writing an effect system, not using an existing one. There is a slight reticence to get involved in actually creating effect systems. However that direction may potentially be a good place to go in the future; a natural path after using all of these effect systems (of interest) would probably be developing an effect system. AZ not particularly interested in this at the moment, but could be for MN.

  • Fusion for free with staging seems like a pretty obviously successful idea. AZ has a vague intuition for how that might work, but not confident enough to see the path forward there (w.r.t what NW is working on - not related to probabilistic-effects)

  • Intuition is that fusion for free is relevant to probabilistic-effects, but Alexis King’s Eff library is the best bet.

  • With respect to MW’s suggestion on a restricted calculus, this needs some thought, but is likely a good/possible direction.

  • The Eff library exists as a fork of GHC - we can just use that fork with the patch and the library. Eff is quite early stage but it looks stable enough for us to use. This is a probably the best starting point. We also have the monad-bayes the library so it shouldn’t be a huge step re-implementing, and if it works, it would obviously be the best suggestion. Why rewrite everything in a complex way when you can just move to the same system but better?

  • One of the next best (computationally-related) goals would be to work out how to get Alexis King’s patch and library working. This looks like the patch branch: https://github.com/lexi-lambda/ghc-proposals/tree/delimited-continuation-primops

  • How do we turn this into research, rather than simply an application of a different effect system to an existing library?

  • The research aspect would be comparing the different methods in my opinion. If we’re looking for new things, then maybe the stuff NW is working on? It is possible that just staging it or just using codensity/fusion stuff would count as novel research, given that JW has a paper on staging parsley. Additionally, we believe that case studies/methodology papers like “how to optimise something built on an effect system” are important research, however sometimes academia can be a bit snooty.

  • If we think a case study methodology is too risky (out of a fear of some kind of lack of appreciation) maybe just staging or doing fusion would be best as that would be novel. Staging is just more involved and would require more novel insight / the path is less charted, whereas eff would likely be easier to reimplement monad-bayes with essentially. If we feel more inclined to take a case study approach initially, going to eff would be the best first move. Either option is plausible, but we’ll end up doing eff either way as it’s the best bet to actually make it faster. Additionally, we’ll be able to reach a milestone faster and feel a bit better about taking other approaches.

  • Optimistically, we could get the eff content done by the end of the year or early January.

• Slack (MN + NW)     11/11/2020

  • Touched base with NW on project status:
    • Current idea is take the monad-bayes (which is a slow probabilistic programming library in Haskell) and implement a similar idea using different effect systems to understand what happens and why to the performance - hopefully something interesting will come out of it & asked about staging on optimising effect systems (as a potential direction for improving monad-bayes).
    • Sounds promising
  • Asked about existing papers/resources about using staging to optimise effect systems in general (with the intention to apply as a possible direction towards improving monad-bayes)
    • NW is working on such a paper; it is most likely a natural combination of the work on staging with MP and the Fusion for Free work.
  • Need to set up meeting with NW when a valuable discussion is to be had about the current state of the project. We are still currently invested in reading.
  • Suggested related paper: https://dl.acm.org/doi/pdf/10.1145/3408975

• Slack (MN + JW)     11/11/2020

  • Does there exist a good paper/resource on how staging can be used to optimise effect systems in general?

  • Nope, it hasn’t been done in general yet - there are people (NW) working on this. Additionally, the staged SOP and the Parsley paper both provide meaningful nuggets about how to use staging for optimisation.

• Scrum Meeting 2 (MW + SF + MN)     10/11/2020

  • Meeting with AZ

  • Visualiser set up for the profiling info

  • Reading up on three different effect system optimisation approaches inc:

    • Alexis King - delimited continuations and the Cont monad.
    • Codensity transformations (Csongor used this in his generic deriving paper)
    • Staging (perhaps see JW’s staged parser combinator paper)
    • Tagless final style
      • SF has read this paper. Here are their notes: Typed tagless final style if a good approach from embedding a DSL Pros: Types are preserved; Efficient; Doesn’t get stuck; Can express pattern matching and non-compositional things; Extensible; The heart of this style is adding a type param polymorphism and parameterisation. NW: “it is a precursor to algebraic effects”. Tagless final style and algebraic effects are just implementations within a broader field, it would be better to understand the general specification of how to embed properly. Following NW’s advice I will now read a few papers that give me a broader insight into the field of algebraic effects, to get the more general view, instead of focusing on a specific implementation. Suggested papers: “Handlers of Algebraic Effects”, “Programming and Reasoning with Algebraic Effects and Effect Handlers”, “From Theory to Practice of Algebraic Effects and Handlers” (SF started reading the first, but found it a bit hard due to lack of knowledge in that area).
  • Plan = Current direction is to implement using three different effect system approaches and then evaluate; hopefully something interesting will come out of this.

  • Suggestion (MW): Just use a minimal calculus for experiments, and then have Monad Bayes as the big example.

  • Suggestion (MW): Evaluate whether the embedding of Monad Bayes' DSL is shallow/deep and how this affects performance; shallow embeddings are naturally less efficient than deep embeddings.

  • Weekly meetings set up between MN & AZ to keep discussions and progress churning. Currently reading with purpose and writing up notes, while also doing some coding/profiling and discussing each week. progressing well.

• Scrum Meeting 1 (MW + SF + MN)     03/11/2020

  • Can the inference monad transformer stack of a program be inferred/fixed at compile time?

  • Identifying smaller goals and intermediate research ideas and what is valuable to do:

    • What’s the line between papers and a very good blog post? (new things are papers)
    • How can research/observations across monad-bayes be generalised?
    • It is possible to do an empirical evaluation and target software dev audiences
    • Programming languages field lacks proper evaluations; it is possible to question existing “folklore” with new observed results"
    • It is important to know what the benchmarking tool can and can’t do - when experimenting we must always ask what can the tool do for me. Goal is guided by what you can measure.
Last updated on 13 Nov 2020
Published on 13 Nov 2020