Archive

Archive for August, 2012

GSoC weekly report #12

August 20, 2012 5 comments

This is my final GSoC report. Sambamba improved significantly over the last two weeks. Several bugs were found and fixed, and some new functionality was added.

Pipeline support

Sambamba library can now work with arbitrary streams, including non-seekable ones (of course, random access is out in this case). However, I haven’t yet figured out how to deal with ‘-‘ command-line parameters using std.getopt module, so please use /dev/stdin and /dev/stdout for the time being.

MessagePack output

Seems like earlier I’ve underestimated the performance boost it can bring :)

I’ve measured time of looping through all alignments from a 112MB BAM file, and got the following results (on Pjotr’s 8-core box):

  • MRI Ruby, JSON input with ‘oj’ gem — 26.0s real, 29.2s user
  • MRI Ruby, MsgPack input — 6.7s real, 9.3s user
  • Python, JSON input with ‘ujson’ — 17.8s real, 21.0s user
  • Python, MsgPack input — 3.5s real, 7.3s user

Just for comparison, time of looping through the file with PySAM (fetch() method) is 3.4s real/3.4s user, i.e. in case of MsgPack output multithreaded execution compensates for additional serialization/deserialization steps.

I’m pretty sure output speed can be further improved—I didn’t yet tweak it thoroughly.

Ruby gem update

Now it works with MsgPack output format, also exception handling is improved. It’s not yet on Travis CI, however, and works only with Ruby 1.9 — I use Open3 for checking stderr to throw meaningful exceptions.

I’ve also set up Travis CI hooks. MRI 1.9.2 and 1.9.3 pass tests. However, I can’t get JRuby working there due to the same issues as with BioRuby—something related to popen/popen3. But JRuby in 1.9 mode works fine, trust me :)

Galaxy toolshed wrapper

I’ve managed to somehow make a wrapper for filtering tool: http://toolshed.g2.bx.psu.edu/repos/lomereiter/sambamba_filter

However, I can’t say that I like Galaxy. Using command-line is way faster than waiting for some slow Python engine to do the same job. Well, I’m just not that kind of person Galaxy was created for :)

Future plans

Next step is making a pileup engine. That’s the most essential part which is lacking in my tools now. For the design I’ll take ideas from PySAM and bio-alignment gem. (Also it involves using statistics for variant calls and thus is related to my studies.) Hopefully, then I’ll get more feedback.

Another direction of further development is making a decent validation tool. I’m sure I can make a better one that ValidateSamFile from Picard (more flexible and faster) but at the moment motivation is lacking.

Conclusion

I’ve learned quite a bit about bioinformatics. This is definitely a very interesting field, and amounts of data to be analyzed are growing rapidly. I won’t be surprised if BAM will be replaced by CRAM or some another format in a few years, and that will render my library useless. Nevertheless, now I’ve got a good experience of writing libraries of that sort, and tweaking the performance. That will surely be of great help in the future.

Thank you Google and Open Bioinformatics Foundation :)

Categories: Uncategorized Tags:

GSoC weekly report #11

August 6, 2012 Leave a comment

BAM output in viewing tool

For a long time, sambamba view had no option to output BAM files. That was related to awful quality of its code, which is now no longer the case. (Looking at the code in its current state and to the ‘Head First Design Patterns’ book, seems I’ve applied Template Method pattern.)

Other refactoring

A lot more things should be refactored, of course. E.g. all the tools now have hard-coded functionality in them (merging, sorting, indexing) while in principle, what one would like to see is methods ‘sortBy!Coordinate’, ‘index’ for BamFile, and moving merge functionality to MultiBamFile struct providing the same methods as BamFile. But that’s something I’ll work on after GSoC, not now.

Ruby DSL for selecting reads

I’ve described it in a Cucumber feature, a simple example of what it’s capable of (in fact, the following code just calls the command-line tool with –count and –filter=… parameters):


puts bam.alignments
        .referencing('20')
        .overlapping(100.Kbp .. 150.Kbp)
        .select {
            flag.first_of_pair.is_set
            flag.proper_pair.is_set
            sequence_length >= 80
            tag(:NM) == 0
        }.count

Man pages

I went down the same road as Clayton and created man pages with Ronn. For those who’re not yet aware, it’s a tool written in Ruby which allows to make both HTML and man output from files with Markdown syntax. Very convenient :)

Script for creating Debian packages

Also I wrote a Ruby script today to easily create debian packages for both i386 and amd64, and put them to Github downloads. Hopefully, now that I’ve done it I’ll make releases more often.

Future plans

We’ve agreed on importancy of having tools used by more people, and the decision is to put them on Galaxy Tool Shed. That should help in popularizing sambamba, my tool has some cool functionality to offer :)

I had a look at what’s available on Tool Shed, with special interest of what stuff for dealing with SAM is presented there. I’m planning to add an example of filtering because this is one of areas where sambamba really shines, and maybe also indexing.

After that I’ll start optimizing my D and Ruby code, and also reducing memory consumption. Perhaps, I’ll spend on that all remaining time. If not, I have a lot of other stuff to do, like stdin/stdout support.

Categories: Uncategorized Tags: