Archive

Archive for October, 2013

Performance of Sambamba MRI bindings

October 27, 2013 Leave a comment

So I hacked up a simple BAM reader yesterday as a C extension. For now, BamRead class in Ruby only exposes a ‘name’ method just for the sake of not being totally useless. After all, what I’m interested in is performance of conversion from D to Ruby. And with careful coding the cost is not that large. (What I love about C extensions is full control over what happens.)

In fact, on multicore systems it easily beats PySAM (this simple test is counting reads and computing average read name length; the file contains Ion Torrent 200bp data, 570MB in size)

Python 2.7.3 + PySAM
$ time python test.py
3186621
14.2702508394

real    0m13.034s
user    0m12.588s
sys     0m0.412s

Ruby 2.1.0-preview1
$ time LD_LIBRARY_PATH=. ruby test.rb
3186621
14.270250839368723

real    0m4.820s
user    0m16.756s
sys     0m0.700s

However, the dynamic loading of my D library sometimes just hangs due to some deadlock =\ I currently use 2.063 where support for shared libraries is not official, because in 2.064 there’s an issue with zlib that I reported to their bugtracker. Hopefully that will be resolved soon.

Advertisements
Categories: Uncategorized

Ruby: FFI or not FFI?

October 26, 2013 Leave a comment

I started making Ruby bindings for my SAM/BAM library, and it’s not at all clear whether to use FFI or good old C extension for MRI.

For you to get the clear picture, I’m going support only Linux and Mac OS X, distributing binary packages, because that’s the easiest option given the current state of DMD compiler infrastructure—it’s not available by default on almost every system, like GCC.

One factor is convenience. By that word FFI proponents usually mean that they are too lazy to sit and write some C code. But hey, since I’m going to distribute binary packages only, I can just use Rice which should be much easier.

Another important factor is interoperatibility. Well, I did some simple benchmarks and discovered that in MRI, the speed sucks if I use FFI, and in JRuby, simply calling Picard library gives the same or better performance (of course, if you’re aware of flags –server and -Xji.objectProxyCache=false). But more importantly, on JRuby the overhead with either FFI or Java Integration is huge =\ Namely, counting reads in a 2.5GB file took about 2 minutes, but adding computation of average read name length added another 2 minutes, giving total of 4! For comparison, it took only 2 minutes using PySAM, and this is the baseline that should be followed.

My conclusions from this are that
1) JVM is not well suited for dynamic languages, and this opinion is supported by JRuby developers. Hopefully, Topaz will mature eventually.
2) Overhead of FFI is too substantial to ignore in my particular case, where we want to work with lots of short reads.

So, I will go with Rice. The bonus part is that I will have to write C++ classes wrapping D functionality, which could theoretically be also used for CPython extension using Boost.Python.

EDIT:  compared bindings generated with Rice and SWIG, the latter wins. So, the full chain is D -> C bindings -> C++ wrapper -> SWIG wrapper -> Ruby

EDIT2: after some playing with SWIG I realized that I want ultimate control over what’s going on, and finally decided to write in good old C.

Categories: Uncategorized Tags: , ,