GSoC weekly report #12
This is my final GSoC report. Sambamba improved significantly over the last two weeks. Several bugs were found and fixed, and some new functionality was added.
Pipeline support
Sambamba library can now work with arbitrary streams, including non-seekable ones (of course, random access is out in this case). However, I haven’t yet figured out how to deal with ‘-’ command-line parameters using std.getopt module, so please use /dev/stdin and /dev/stdout for the time being.
MessagePack output
Seems like earlier I’ve underestimated the performance boost it can bring :)
I’ve measured time of looping through all alignments from a 112MB BAM file, and got the following results (on Pjotr’s 8-core box):
- MRI Ruby, JSON input with ‘oj’ gem — 26.0s real, 29.2s user
- MRI Ruby, MsgPack input — 6.7s real, 9.3s user
- Python, JSON input with ‘ujson’ — 17.8s real, 21.0s user
- Python, MsgPack input — 3.5s real, 7.3s user
Just for comparison, time of looping through the file with PySAM (fetch() method) is 3.4s real/3.4s user, i.e. in case of MsgPack output multithreaded execution compensates for additional serialization/deserialization steps.
I’m pretty sure output speed can be further improved—I didn’t yet tweak it thoroughly.
Ruby gem update
Now it works with MsgPack output format, also exception handling is improved. It’s not yet on Travis CI, however, and works only with Ruby 1.9 — I use Open3 for checking stderr to throw meaningful exceptions.
I’ve also set up Travis CI hooks. MRI 1.9.2 and 1.9.3 pass tests. However, I can’t get JRuby working there due to the same issues as with BioRuby—something related to popen/popen3. But JRuby in 1.9 mode works fine, trust me :)
Galaxy toolshed wrapper
I’ve managed to somehow make a wrapper for filtering tool: http://toolshed.g2.bx.psu.edu/repos/lomereiter/sambamba_filter
However, I can’t say that I like Galaxy. Using command-line is way faster than waiting for some slow Python engine to do the same job. Well, I’m just not that kind of person Galaxy was created for :)
Future plans
Next step is making a pileup engine. That’s the most essential part which is lacking in my tools now. For the design I’ll take ideas from PySAM and bio-alignment gem. (Also it involves using statistics for variant calls and thus is related to my studies.) Hopefully, then I’ll get more feedback.
Another direction of further development is making a decent validation tool. I’m sure I can make a better one that ValidateSamFile from Picard (more flexible and faster) but at the moment motivation is lacking.
Conclusion
I’ve learned quite a bit about bioinformatics. This is definitely a very interesting field, and amounts of data to be analyzed are growing rapidly. I won’t be surprised if BAM will be replaced by CRAM or some another format in a few years, and that will render my library useless. Nevertheless, now I’ve got a good experience of writing libraries of that sort, and tweaking the performance. That will surely be of great help in the future.
Thank you Google and Open Bioinformatics Foundation :)
>For the design I’ll take ideas from PySAM and bio-alignment gem.
There’s tests in bio-pileup_iterator that may be of use too.
Thanks, I’ll take a look. At least, they might provide some insights for understanding pileup format.
np. Is there a proper pileup format spec somewhere do you know?
I doubt you can find anything more decent than the description at samtools homepage (http://samtools.sourceforge.net/pileup.shtml)
Alrighty. The part of the spec I least understand is the ‘=’ sign in the second last field. I have the sam file so could look it up, but got lazy.
https://github.com/wwood/bioruby-pileup_iterator/blob/master/test/test_bio-pileup_iterator.rb#L221
Good work with sambamba, by the way. I’ll be using it.