Home > Uncategorized > GSoC weekly report #12

GSoC weekly report #12

This is my final GSoC report. Sambamba improved significantly over the last two weeks. Several bugs were found and fixed, and some new functionality was added.

Pipeline support

Sambamba library can now work with arbitrary streams, including non-seekable ones (of course, random access is out in this case). However, I haven’t yet figured out how to deal with ‘-’ command-line parameters using std.getopt module, so please use /dev/stdin and /dev/stdout for the time being.

MessagePack output

Seems like earlier I’ve underestimated the performance boost it can bring :)

I’ve measured time of looping through all alignments from a 112MB BAM file, and got the following results (on Pjotr’s 8-core box):

  • MRI Ruby, JSON input with ‘oj’ gem — 26.0s real, 29.2s user
  • MRI Ruby, MsgPack input — 6.7s real, 9.3s user
  • Python, JSON input with ‘ujson’ — 17.8s real, 21.0s user
  • Python, MsgPack input — 3.5s real, 7.3s user

Just for comparison, time of looping through the file with PySAM (fetch() method) is 3.4s real/3.4s user, i.e. in case of MsgPack output multithreaded execution compensates for additional serialization/deserialization steps.

I’m pretty sure output speed can be further improved—I didn’t yet tweak it thoroughly.

Ruby gem update

Now it works with MsgPack output format, also exception handling is improved. It’s not yet on Travis CI, however, and works only with Ruby 1.9 — I use Open3 for checking stderr to throw meaningful exceptions.

I’ve also set up Travis CI hooks. MRI 1.9.2 and 1.9.3 pass tests. However, I can’t get JRuby working there due to the same issues as with BioRuby—something related to popen/popen3. But JRuby in 1.9 mode works fine, trust me :)

Galaxy toolshed wrapper

I’ve managed to somehow make a wrapper for filtering tool: http://toolshed.g2.bx.psu.edu/repos/lomereiter/sambamba_filter

However, I can’t say that I like Galaxy. Using command-line is way faster than waiting for some slow Python engine to do the same job. Well, I’m just not that kind of person Galaxy was created for :)

Future plans

Next step is making a pileup engine. That’s the most essential part which is lacking in my tools now. For the design I’ll take ideas from PySAM and bio-alignment gem. (Also it involves using statistics for variant calls and thus is related to my studies.) Hopefully, then I’ll get more feedback.

Another direction of further development is making a decent validation tool. I’m sure I can make a better one that ValidateSamFile from Picard (more flexible and faster) but at the moment motivation is lacking.

Conclusion

I’ve learned quite a bit about bioinformatics. This is definitely a very interesting field, and amounts of data to be analyzed are growing rapidly. I won’t be surprised if BAM will be replaced by CRAM or some another format in a few years, and that will render my library useless. Nevertheless, now I’ve got a good experience of writing libraries of that sort, and tweaking the performance. That will surely be of great help in the future.

Thank you Google and Open Bioinformatics Foundation :)

About these ads
Categories: Uncategorized Tags:
  1. Ben Woodcroft
    August 21, 2012 at 6:12 am | #1

    >For the design I’ll take ideas from PySAM and bio-alignment gem.

    There’s tests in bio-pileup_iterator that may be of use too.

    • August 21, 2012 at 10:05 am | #2

      Thanks, I’ll take a look. At least, they might provide some insights for understanding pileup format.

      • Ben Woodcroft
        August 21, 2012 at 10:09 am | #3

        np. Is there a proper pileup format spec somewhere do you know?

  2. August 21, 2012 at 10:22 am | #4

    I doubt you can find anything more decent than the description at samtools homepage (http://samtools.sourceforge.net/pileup.shtml)

  3. Ben Woodcroft
    August 21, 2012 at 10:31 am | #5

    Alrighty. The part of the spec I least understand is the ‘=’ sign in the second last field. I have the sam file so could look it up, but got lazy.

    https://github.com/wwood/bioruby-pileup_iterator/blob/master/test/test_bio-pileup_iterator.rb#L221

    Good work with sambamba, by the way. I’ll be using it.

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: