Thursday, February 17, 2011

binary by the numbers

New home

I've taken the liberty to fork the darcs repo of binary and put it on github. It contains both the latest released version as well as the new experimental continuation based Get monad a branch called cps.

git clone git://


It's interesting to run the benchmark of binary on different architectures and with different versions of GHC. Although there recently has been work within the community with fast writing (blaze-builder comes to mind) I've mostly been working on how to read things fast.

The classic binary implementation of the Get monad is a state monad while the new experimental version is continuation based, so fundamentally different. They also perform differently. To produce the numbers below I ran the benchmark suite of binary. It reads either Word8, Word16, Word32 or Word64 in a (hopefully) tight loop and then presents how fast it could do it. For example, see this graph over performance in a 32bit environment;

The nice news is that GHC 7.0.1 always performs better than GHC 6.12.3. Also the experimental cps branch (the wide green line) is faster than the classic master branch.

Things seems to be going well in 32bit land. Let's have a look in a 64bit environment;

This gives a different picture. GHC 7.0.1 still performs better than GHC 6.12.3, but we can also see that the cps branch can't keep up with the state monad based master branch (in contrast to when compiling for 32bits). Future work will include to figure out why, and how to fix it.

Lets have a look at how binary performs at writing too;

Benchmark Environment

The tests have been performed on a Sandy Bridge CPU using GHCs native backend. I wanted to try the LLVM backend too, but unfortunately LLVM crashes when compiling the benchmark executable.

Sunday, January 9, 2011

About binary

About a year ago I started hacking on a project to change the Get monad of the binary package. I've been working off and on since... :) The change will allow us to incrementally provide the Get monad with input, as well as allow backtracking and a more graceful error handling. The code was public, but got moved with the infrastructure changes. It's now available again;
darcs get
Johan Tibell writes about the API update in his recent blog post. Developers familiar with attoparsec code will largely be familiar with the new binary Get monad too, as it's been a heavy influence. The type for the parse function would essentially be something like this:
data Result r =
    Fail ByteString [ByteString] -- an error msg and a trace
  | Partial (ByteString -> Result r) -- incremental input
  | Done r -- finished!
A few forks of binary tries to address this too, all in their own way. Currently I know of cereal and binary-strict.


When benchmarking cereal and the new binary package, cereal comes out on top, at the expense of not being able to consume the input in incremental chunks. I couldn't find the benchmark suit for binary-strict. The reason for cereal being faster, I think, is due to that its simpler code when having a simpler state, essentially only a single strict ByteString (the input). In binary (and attoparsec) it's a bit more complicated, due to the incremental input (in combination with supporting MonadPlus):
data S =
  S { input      :: !B.ByteString -- the current input chunk
    , next_input :: !B.ByteString -- saved input to be used when backtracking
    , read_all   :: !Bool -- have we requested all input available to parse?
    } deriving Show

newtype Get a =
  C { runCont :: forall r.
                 S ->
                 Failure   r ->
                 Success a r ->
                 Result    r }

type Failure   r = S -> [String] -> String -> Result r
type Success a r = S -> a -> Result r

bindG :: Get a -> (a -> Get b) -> Get b
bindG (C c) f = C $ \st0 kf ks -> c st0 kf (\st1 a -> runCont (f a) st1 kf ks)
Unfortunately, this results in bad performance. I'm guessing that's it's because it's reconstructing the state value (of type S), and the remaining ByteString input for each value consumed, thus a lot of allocations. So, as an experiment, I manually unpacked S;
-- No longer using S
data S = S { input      :: !B.ByteString
           , next_input :: !B.ByteString
           , read_all   :: !Bool
           } deriving Show

newtype Get a =
  C { runCont :: forall r.
                 -- these three were part of S, now they are separate arguments
                 B.ByteString -> -- 1
                 B.ByteString -> -- 2
                 Bool ->         -- 3
                 Failure   r ->
                 Success a r ->
                 Result    r }

type Failure   r = B.ByteString -> B.ByteString -> Bool -> [String] ->String -> Result r
type Success a r = B.ByteString -> B.ByteString -> Bool -> a -> Result r

bindG :: Get a -> (a -> Get b) -> Get b
bindG (C c) f = C $ \inp next eof kf ks ->
                             c inp next eof kf
                               (\inp' next' eof' a -> runCont (f a) inp' next' eof' kf ks)
With ghc-7, this yields a huge speed boost and reaches about half the speed of the old binary library (and ~30% faster than cereal). Unfortunately, I find the code is less readable and harder to maintain. Maybe it's worth it though. I got a hint to see ghc ticket #1349. Duncan Coutts summarizes the issue, this time about the Put monad. There are a lot of details and examples in those mails, suggesting an extension to GHC to control strictness in the arguments of higher order components. It'd allow us to write the prettier version, yet enjoying the nicer properties of the unpacked version. It's unlikely that we'll see the proposal implemented soon, though. It seems there are four options;
  • Go with the manually unpacked code
  • Drop support for backtracking
  • Drop support for incremental input
  • Find something even better, you're all invited :)