Archive

Archive for July, 2012

GSoC weekly report #10

July 24, 2012 Leave a comment

Filtering

Now sambamba supports custom filter expressions for specifying which alignments to process.

Example:

sambamba view mydata.bam -F "mapping_quality >= 60 and [MQ] >= 60 and mate_is_reverse_strand and [RG] == 'ERR016155' and sequence =~ /^ACTG/ and ([OP] != null or [OC] != null)" -c

As you can see, filter query syntax supports:

  • integer/string comparison for fields/tags
  • regex match operator
  • tag existence condition
  • and, or, not logical operators

Full description is available at sambamba wiki.

I also began working on providing Ruby wrapper for this syntax. After reading some pieces of Martin Fowler’ “Domain specific languages” book, I have appreciated the power of ‘instance_eval’ :-)

As of now, it looks like this (not pushed to the repo yet):

filter = Bio::Bam::AlignmentFilter.new {
  ref_id == 19

  flag.unmapped.is_unset
  flag.mate_is_unmapped.is_unset

  tag(:RG) =~ /ERR001*/i

  tag(:NM) > 0

  union {

    intersection {
      mapping_quality >= 40
      mapping_quality < 60
    }

    mapping_quality.is_unknown

  }
}

Looks kinda cool, but I’m still thinking of ways to reduce amount of dots and use spaces instead :-)

Progressbar

For me, it’s annoying to not have any progress indicator. In those cases where file size is known beforehand, it’s relatively easy to calculate current progress. The only issue is design.

What I came up with is as follows: the user can provide optional compile-time argument — a function with one argument, percentage. An important observation mentioned in code comments is that as float computations are not cheap, the user can make this argument lazy (http://blackwhale.github.com/lazy-evaluation.html) and compute it only, say, each 1000 alignments.

The working example is sambamba-index tool, where wget-like progressbar is outputted to STDERR. In the next few days I’ll add this functionality to the other utilities as well.

Advertisements
Categories: Uncategorized Tags:

GSoC weekly report #9

July 17, 2012 Leave a comment

Sorting

I’ve added a few more options to command line tool, namely:

  • setting compression level of sorted BAM file
  • sorting by read name instead of coordinate
  • outputting chunks as uncompressed BAM

In August, I’ll look again at how the performance can be improved, though it’s almost OK now.

Merging

Somehow this turned to be harder than sorting. Of course, there’s nothing hard in merging alignments, the hard part is really merging SAM headers. Samtools seems to don’t do that at all; Picard does a better job, and I had to read a lot of Java code to understand what’s going on. Finally, I wrote D version of it, and it was 2x shorter. Then I’ve improved it a bit, because Picard uses pretty dumb algorithm for merging sequence dictionaries.

Consider the following three SAM headers:

@HD    VN:1.3    SO:coordinate
@SQ    SN:A    LN:100
@HD    VN:1.3    SO:coordinate
@SQ    SN:B    LN:100
@HD    VN:1.3    SO:coordinate
@SQ    SN:B    LN:100
@SQ    SN:A    LN:100

Surely they are mergeable, but in this particular order Picard can’t do that! That’s because it merges sequence dictionaries in turn: merge first two, then merge the result with the third. So before merging with the last header, it assumes the order to be A < B, and it can’t be changed later in the algorithm. But the order must be consistent among headers in order to merge files sorted by coordinate (because they are first sorted by reference ID), and the last header implies that B < A.

Thus the pairwise approach taken by Picard didn’t satisfy me. What I do in sambamba is first build a directed graph out of all sequence dictionaries where edges indicate the order, and then do a topological sorting on it. That allows to process arbitrarily complex cases. (Perhaps, I should change implementation to use DFS instead of BFS for the result to look a bit more intuitive.)

Debian packages

Also I prepared debian packages for both amd64 and i386 architectures. That’s my first experience of building packages, but seems like everything works, I’ve tested them on live USBs with Debian (amd64) and Ubuntu 10.04 (i386).

They’re available at http://github.com/lomereiter/sambamba/downloads

For packaging, I had to follow samtools approach and make one executable instead of several, that’s because binaries are statically linked with libphobos and it’s better to have it copied only once. However, the old way of building tools separately is also available, that’s easier for development.

As the tool called ‘sambamba’ in the past is now ‘sambamba-view’ I also had to tweak Ruby bindings a bit, so I’ve published bioruby-sambamba gem 0.0.2 — no new functionality, just a minor update taking the renaming into account.

Categories: Uncategorized Tags:

GSoC weekly report #8

July 10, 2012 Leave a comment

Indexing

Last week I implemented building BAI index for BAM files. In order to understand how indexing works I had to read source code of BamTools and Picard (samtools code is unreadable to me).

Finally I managed to get it working, and it exploits parallelism so that time of building index is roughly equal to the time of reading the file. I did some experiments on Pjotr’s server, and when files are located on a single HDD, my tool works about 40% faster than ‘samtools index’: one thread does all I/O stuff + indexing, while other threads unpack BGZF blocks. It would be interesting to see what’s the situation with SSD because they have much better reading performance. (Pjotr was going to test my tool on his 4-core laptop with SSD.)

Sorting

Also I recently began to implement sorting. It already works, and the speed is good enough and also limited by I/O since external sorting involves dealing with a lot of files. I use the simplest approach: first write sorted chunks to temporary directory, then merge them in one file.

However, I’m not yet satisfied with memory consumption. For sorting, I need to tweak some numbers in runtime in order not to exceed a given memory limit, and it’s a bit tricky, partly because it’s hard to predict anything with garbage collector, partly – because memory allocations are hidden under several levels of abstraction :) I’m now investigating this issue, it’s essential for decreasing the number of sorted chunks and thus getting larger buffers for reading these chunks in merge phase.

 

Categories: Uncategorized Tags:

GSoC weekly report #7

First version of gem

I’ve released first version of bioruby-sambamba gem, it’s available on rubygems.org. It works via Bio::Command, parsing JSON output produced by sambamba executable with the aid of Oj gem (that stands for ‘Optimized Json’).

At the moment, users have to compile sambamba tool themselves, and I realize that it’s not very convenient. I’ll look into producing binaries for all platforms without any dependencies so that gem will download the executable for user. Another reason for doing that is slow code generated by DMD. In general, with respect to D compilers, the more time it takes to install the compiler, the better speed you have :) I’ve tested my code with three compilers (GDC, LDC2 and DMD), and from my experience, GDC generates the fastest code, LDC2 produces a little slower executables, and DMD is the slowest among the three in that respect. Moreover, on a long pieces of code generated by Ragel, DMD with optimization flag (-O) turned on spends a lot of time in optimization phase because… its optimization phase is, uhm, not well optimized =\ (http://d.puremagic.com/issues/show_bug.cgi?id=7157)

So I’m going to compile stand-alone executables with GDC, though I’ll perhaps will need help of Iain Buclaw to install the compiler (the latest version of it is hosted on Github, and instructions on bitbucket.org are a bit outdated).

Refactoring

I decided to spend some time on refactoring so as to eliminate some design drawbacks in my code. One of them was very inconvenient API for working with SAM header. Since I didn’t initially intend to provide BAM/SAM output, I didn’t pay any attention to it at all. I looked at the ways Picard and BamTools implement this kind of functionality, and borrowed a few ideas from there. Now the code is much better, the user can add/remove reference sequence, read group, or program information (corresponding to @SQ, @RG and @PG lines in a header).

Also, after the code review when Marjan pointed out some problems with the current validation code, I began to refactor it. I have a feeling that the code can be made much more flexible (maybe some kind of Visitor pattern with walking through a tree of validators?). During the process of refactoring, I also improved the performance quite a bit, using profiler and eliminating most bottlenecks.  Validation became about 5x faster, so the speed is decent now.

Faster SAM parsing

I’ve said earlier about the opportunity to cut the time of parsing SAM by 20-25%, and I used that. Now the parsing code uses way less (re-)allocations.

Speaking about allocations, it’s only this week that I realized that static variables in D are thread-local. That makes me reconsider some pieces of code and replace some heap-allocated arrays with static arrays which I was erroneously avoiding before, thinking that it might cause multithreading-related issues. That might give another performance boost.

Nicer syntax for working with tag values

I’ve implemented better opCast() and opEquals() for Value struct, and now the user doesn’t need to do explicit conversions in almost all cases. For example, previously in order to compare a value holding a ubyte with an int, one had to say “if (to!ubyte(value) == 32) …”, and now it is just “if (value == 32) …”. Also, when assigning a value to a read tag, previously one had to explicitly call Value constructor while now it’s not needed: ‘read[“RG”] = “111111”;’ instead of ‘read[“RG”] = Value(“111111”)’

I’ll spend a few days more on refactoring, but the main task for this week is BAM indexing, i.e. producing BAI files from a sorted BAM file. Next logical step is sorting/merging, I hope to finish that by 20th July at most, and then prepare next version of gem. Maybe I’ll release another version in the middle of July when I’ll have working binaries for all platforms.

Categories: Uncategorized Tags: