Home > Uncategorized > GSoC weekly report #9

GSoC weekly report #9

Sorting

I’ve added a few more options to command line tool, namely:

  • setting compression level of sorted BAM file
  • sorting by read name instead of coordinate
  • outputting chunks as uncompressed BAM

In August, I’ll look again at how the performance can be improved, though it’s almost OK now.

Merging

Somehow this turned to be harder than sorting. Of course, there’s nothing hard in merging alignments, the hard part is really merging SAM headers. Samtools seems to don’t do that at all; Picard does a better job, and I had to read a lot of Java code to understand what’s going on. Finally, I wrote D version of it, and it was 2x shorter. Then I’ve improved it a bit, because Picard uses pretty dumb algorithm for merging sequence dictionaries.

Consider the following three SAM headers:

@HD    VN:1.3    SO:coordinate
@SQ    SN:A    LN:100
@HD    VN:1.3    SO:coordinate
@SQ    SN:B    LN:100
@HD    VN:1.3    SO:coordinate
@SQ    SN:B    LN:100
@SQ    SN:A    LN:100

Surely they are mergeable, but in this particular order Picard can’t do that! That’s because it merges sequence dictionaries in turn: merge first two, then merge the result with the third. So before merging with the last header, it assumes the order to be A < B, and it can’t be changed later in the algorithm. But the order must be consistent among headers in order to merge files sorted by coordinate (because they are first sorted by reference ID), and the last header implies that B < A.

Thus the pairwise approach taken by Picard didn’t satisfy me. What I do in sambamba is first build a directed graph out of all sequence dictionaries where edges indicate the order, and then do a topological sorting on it. That allows to process arbitrarily complex cases. (Perhaps, I should change implementation to use DFS instead of BFS for the result to look a bit more intuitive.)

Debian packages

Also I prepared debian packages for both amd64 and i386 architectures. That’s my first experience of building packages, but seems like everything works, I’ve tested them on live USBs with Debian (amd64) and Ubuntu 10.04 (i386).

They’re available at http://github.com/lomereiter/sambamba/downloads

For packaging, I had to follow samtools approach and make one executable instead of several, that’s because binaries are statically linked with libphobos and it’s better to have it copied only once. However, the old way of building tools separately is also available, that’s easier for development.

As the tool called ‘sambamba’ in the past is now ‘sambamba-view’ I also had to tweak Ruby bindings a bit, so I’ve published bioruby-sambamba gem 0.0.2 — no new functionality, just a minor update taking the renaming into account.

Advertisements
Categories: Uncategorized Tags:
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: