First experience with Mono-D

December 16, 2012 Leave a comment

Today I installed Mono-D, hoping that it might make my experience with D a little better. Unfortunately, it was far from my expectations. When it is used on small pieces of code, everything looks fine. But when I added the whole BioD library (which I slowly develop) to the include paths, performance decreased drastically and it became impossible to type code. Each autocompletion took at least 30 seconds to resolve :-( No thanks, I would rather type method name manually, 10 times faster.

And even neglecting slowness, completion results are also far from what I need. I use a lot of templates and `auto` declarations that seem to confuse Mono-D engine. I also had no luck with type deduction of iteration variable in foreach statements.

Thus, I’m staying with Vim, although it’s totally dumb when it comes to the language semantics. Hopefully, Mono-D will improve over years. Perhaps, it already works well for Java-style projects that don’t use templates all over the place, but that’s not my case.

UPDATE (March 06, 2013)

Tried again with Xamarin Studio and 0.5.1.3. Speed issue has been resolved, but autocompletion still doesn’t play well with templated code.

Categories: Uncategorized

New cool stuff in Sambamba!

October 5, 2012 Leave a comment

During last month, I have introduced quite a few cool features in Sambamba library. This blog post is a brief overview of them.

MD tags -> reference bases

If reads in the BAM file have MD tags (e.g. calculated with the aid of samtools calmd), you can reconstruct parts of reference sequence from them. This is done lazily, without any heap allocations. If you need a string instead of a lazy character sequence, just wrap it with to!string.

Example (taken from http://davetang.org/muse/2011/01/28/perl-and-sam/)

    auto read = Alignment("r4",
                         "AGTGATGGGAGGATGTCTCGTCTGTGAGTTACAGCA",
                         [CigarOperation(2, 'M'),
                          CigarOperation(1, 'I'),
                          CigarOperation(7, 'M'),
                          CigarOperation(6, 'D'),
                          CigarOperation(26, 'M')]);
    read["MD"] = "3C3T1^GCTCAG26";
    assert(equal(dna(read), "AGGCTGGTAGCTCAGGGATGTCTCGTCTGTGAGTTACAGCA"));

The same works for ranges of coordinate-sorted reads aligned to the same reference. That is, in this case no heap-allocations occur as well, and the sequence is also lazy.

Notice that last two reads in the following example don’t overlap, therefore space between them is filled with ‘N’.

(in this example, some bases are skipped to fit into screen)

    auto r1 = Alignment("r1",
                        "AGGTTTTGTGAGTGGGA...GCTCAGAGTGGCTCTA",
                        [CigarOperation(89, 'M'),   CigarOperation(1, 'S')]);
    r1.position = 60246;    r1["MD"] = "89";

    auto r2 = Alignment("r2",
                        "TGTGAGTGGGACAGTTGCAGCAAAACACAACCATAGGTGCCCA...CTCCAGCCCTT",
                        [CigarOperation(83, 'M'),   CigarOperation(7, 'S')]);
    r2.position = 60252;    r2["MD"] = "82T0";

    auto r3 = Alignment("r3",
                        "CATAGGTGCCCATCCACCAAGGCAGGCTCTCCATCTTGCTCAGAGTGGCTCTAGCCCTTG...",
                        [CigarOperation(90, 'M')]);
    r3.position = 60283;    r3["MD"] = "90";

    auto r4 = Alignment("r4",    "CCCTTGCTGAC...TATCCAACCG",   [CigarOperation(90, 'M')]);
    r4.position = 60337;    r4["MD"] = "90";

    auto r5 = Alignment("r5",    "GAGGCTCCACCC...AAGCACTTTGT", [CigarOperation(90, 'M')]);
    r5.position = 60432;    r5["MD"] = "90";

    auto reads = [r1, r2, r3, r4, r5];
    assert(equal(dna(reads), "AGGTTTTGTG...CTATCCAACCGNNNNNGAGGC...CTTTGT"));

Pileup

This feature requires dcollections to be installed into Sambamba root directory. Installation process is described at the top of pileuprange.d file, this will also be in README soon.

The behaviour might be a bit different from samtools with respect to insertions/deletions, I’ll revisit it later. For now the main focus is SNPs.

Some example code showing how to work with pileup in Sambamba:

foreach (column; pileupWithReferenceBases(reads)) {
    writeln(column.position, ' ', column.coverage);
    foreach (read; column.reads) {
        if (read.current_base != column.reference_base &&
            read.current_base_quality > 30 && read.cigar_operation == 'M')
        {
            ...
        }
    }
}

Improved efficiency of random access code

Today I have revisited code related to random access. A nice idea came to my mind, and I parallelized decompression much better than before.

For instance, here are the times for counting alignments from NA12043.chrom20.LS454.ssaha2.CEU.low_coverage.20101123.bam.v1, overlapping the region 20:1261261-12162614 (the number of such reads is 63417).

samtools

real 0m0.579s
user 0m0.560s
sys 0m0.020s

sambamba 0.2.8

real 0m0.349s
user 0m0.616s
sys 0m0.104s

sambamba trunk

real 0m0.152s
user 0m0.648s
sys 0m0.120s

The experiment was done, as usual, on Pjotr’s 8-core machine. These results mean, basically, that level of parallelism for accessing a region using BAI index is now the same as for serial access. I will update the Debian package soon :)

Categories: Uncategorized

GSoC weekly report #12

August 20, 2012 5 comments

This is my final GSoC report. Sambamba improved significantly over the last two weeks. Several bugs were found and fixed, and some new functionality was added.

Pipeline support

Sambamba library can now work with arbitrary streams, including non-seekable ones (of course, random access is out in this case). However, I haven’t yet figured out how to deal with ‘-’ command-line parameters using std.getopt module, so please use /dev/stdin and /dev/stdout for the time being.

MessagePack output

Seems like earlier I’ve underestimated the performance boost it can bring :)

I’ve measured time of looping through all alignments from a 112MB BAM file, and got the following results (on Pjotr’s 8-core box):

  • MRI Ruby, JSON input with ‘oj’ gem — 26.0s real, 29.2s user
  • MRI Ruby, MsgPack input — 6.7s real, 9.3s user
  • Python, JSON input with ‘ujson’ — 17.8s real, 21.0s user
  • Python, MsgPack input — 3.5s real, 7.3s user

Just for comparison, time of looping through the file with PySAM (fetch() method) is 3.4s real/3.4s user, i.e. in case of MsgPack output multithreaded execution compensates for additional serialization/deserialization steps.

I’m pretty sure output speed can be further improved—I didn’t yet tweak it thoroughly.

Ruby gem update

Now it works with MsgPack output format, also exception handling is improved. It’s not yet on Travis CI, however, and works only with Ruby 1.9 — I use Open3 for checking stderr to throw meaningful exceptions.

I’ve also set up Travis CI hooks. MRI 1.9.2 and 1.9.3 pass tests. However, I can’t get JRuby working there due to the same issues as with BioRuby—something related to popen/popen3. But JRuby in 1.9 mode works fine, trust me :)

Galaxy toolshed wrapper

I’ve managed to somehow make a wrapper for filtering tool: http://toolshed.g2.bx.psu.edu/repos/lomereiter/sambamba_filter

However, I can’t say that I like Galaxy. Using command-line is way faster than waiting for some slow Python engine to do the same job. Well, I’m just not that kind of person Galaxy was created for :)

Future plans

Next step is making a pileup engine. That’s the most essential part which is lacking in my tools now. For the design I’ll take ideas from PySAM and bio-alignment gem. (Also it involves using statistics for variant calls and thus is related to my studies.) Hopefully, then I’ll get more feedback.

Another direction of further development is making a decent validation tool. I’m sure I can make a better one that ValidateSamFile from Picard (more flexible and faster) but at the moment motivation is lacking.

Conclusion

I’ve learned quite a bit about bioinformatics. This is definitely a very interesting field, and amounts of data to be analyzed are growing rapidly. I won’t be surprised if BAM will be replaced by CRAM or some another format in a few years, and that will render my library useless. Nevertheless, now I’ve got a good experience of writing libraries of that sort, and tweaking the performance. That will surely be of great help in the future.

Thank you Google and Open Bioinformatics Foundation :)

Categories: Uncategorized Tags:

GSoC weekly report #11

August 6, 2012 Leave a comment

BAM output in viewing tool

For a long time, sambamba view had no option to output BAM files. That was related to awful quality of its code, which is now no longer the case. (Looking at the code in its current state and to the ‘Head First Design Patterns’ book, seems I’ve applied Template Method pattern.)

Other refactoring

A lot more things should be refactored, of course. E.g. all the tools now have hard-coded functionality in them (merging, sorting, indexing) while in principle, what one would like to see is methods ‘sortBy!Coordinate’, ‘index’ for BamFile, and moving merge functionality to MultiBamFile struct providing the same methods as BamFile. But that’s something I’ll work on after GSoC, not now.

Ruby DSL for selecting reads

I’ve described it in a Cucumber feature, a simple example of what it’s capable of (in fact, the following code just calls the command-line tool with –count and –filter=… parameters):


puts bam.alignments
        .referencing('20')
        .overlapping(100.Kbp .. 150.Kbp)
        .select {
            flag.first_of_pair.is_set
            flag.proper_pair.is_set
            sequence_length >= 80
            tag(:NM) == 0
        }.count

Man pages

I went down the same road as Clayton and created man pages with Ronn. For those who’re not yet aware, it’s a tool written in Ruby which allows to make both HTML and man output from files with Markdown syntax. Very convenient :)

Script for creating Debian packages

Also I wrote a Ruby script today to easily create debian packages for both i386 and amd64, and put them to Github downloads. Hopefully, now that I’ve done it I’ll make releases more often.

Future plans

We’ve agreed on importancy of having tools used by more people, and the decision is to put them on Galaxy Tool Shed. That should help in popularizing sambamba, my tool has some cool functionality to offer :)

I had a look at what’s available on Tool Shed, with special interest of what stuff for dealing with SAM is presented there. I’m planning to add an example of filtering because this is one of areas where sambamba really shines, and maybe also indexing.

After that I’ll start optimizing my D and Ruby code, and also reducing memory consumption. Perhaps, I’ll spend on that all remaining time. If not, I have a lot of other stuff to do, like stdin/stdout support.

Categories: Uncategorized Tags:

GSoC weekly report #10

July 24, 2012 Leave a comment

Filtering

Now sambamba supports custom filter expressions for specifying which alignments to process.

Example:

sambamba view mydata.bam -F "mapping_quality >= 60 and [MQ] >= 60 and mate_is_reverse_strand and [RG] == 'ERR016155' and sequence =~ /^ACTG/ and ([OP] != null or [OC] != null)" -c

As you can see, filter query syntax supports:

  • integer/string comparison for fields/tags
  • regex match operator
  • tag existence condition
  • and, or, not logical operators

Full description is available at sambamba wiki.

I also began working on providing Ruby wrapper for this syntax. After reading some pieces of Martin Fowler’ “Domain specific languages” book, I have appreciated the power of ‘instance_eval’ :-)

As of now, it looks like this (not pushed to the repo yet):

filter = Bio::Bam::AlignmentFilter.new {
  ref_id == 19

  flag.unmapped.is_unset
  flag.mate_is_unmapped.is_unset

  tag(:RG) =~ /ERR001*/i

  tag(:NM) > 0

  union {

    intersection {
      mapping_quality >= 40
      mapping_quality < 60
    }

    mapping_quality.is_unknown

  }
}

Looks kinda cool, but I’m still thinking of ways to reduce amount of dots and use spaces instead :-)

Progressbar

For me, it’s annoying to not have any progress indicator. In those cases where file size is known beforehand, it’s relatively easy to calculate current progress. The only issue is design.

What I came up with is as follows: the user can provide optional compile-time argument — a function with one argument, percentage. An important observation mentioned in code comments is that as float computations are not cheap, the user can make this argument lazy (http://blackwhale.github.com/lazy-evaluation.html) and compute it only, say, each 1000 alignments.

The working example is sambamba-index tool, where wget-like progressbar is outputted to STDERR. In the next few days I’ll add this functionality to the other utilities as well.

Categories: Uncategorized Tags:

GSoC weekly report #9

July 17, 2012 Leave a comment

Sorting

I’ve added a few more options to command line tool, namely:

  • setting compression level of sorted BAM file
  • sorting by read name instead of coordinate
  • outputting chunks as uncompressed BAM

In August, I’ll look again at how the performance can be improved, though it’s almost OK now.

Merging

Somehow this turned to be harder than sorting. Of course, there’s nothing hard in merging alignments, the hard part is really merging SAM headers. Samtools seems to don’t do that at all; Picard does a better job, and I had to read a lot of Java code to understand what’s going on. Finally, I wrote D version of it, and it was 2x shorter. Then I’ve improved it a bit, because Picard uses pretty dumb algorithm for merging sequence dictionaries.

Consider the following three SAM headers:

@HD    VN:1.3    SO:coordinate
@SQ    SN:A    LN:100
@HD    VN:1.3    SO:coordinate
@SQ    SN:B    LN:100
@HD    VN:1.3    SO:coordinate
@SQ    SN:B    LN:100
@SQ    SN:A    LN:100

Surely they are mergeable, but in this particular order Picard can’t do that! That’s because it merges sequence dictionaries in turn: merge first two, then merge the result with the third. So before merging with the last header, it assumes the order to be A < B, and it can’t be changed later in the algorithm. But the order must be consistent among headers in order to merge files sorted by coordinate (because they are first sorted by reference ID), and the last header implies that B < A.

Thus the pairwise approach taken by Picard didn’t satisfy me. What I do in sambamba is first build a directed graph out of all sequence dictionaries where edges indicate the order, and then do a topological sorting on it. That allows to process arbitrarily complex cases. (Perhaps, I should change implementation to use DFS instead of BFS for the result to look a bit more intuitive.)

Debian packages

Also I prepared debian packages for both amd64 and i386 architectures. That’s my first experience of building packages, but seems like everything works, I’ve tested them on live USBs with Debian (amd64) and Ubuntu 10.04 (i386).

They’re available at http://github.com/lomereiter/sambamba/downloads

For packaging, I had to follow samtools approach and make one executable instead of several, that’s because binaries are statically linked with libphobos and it’s better to have it copied only once. However, the old way of building tools separately is also available, that’s easier for development.

As the tool called ‘sambamba’ in the past is now ‘sambamba-view’ I also had to tweak Ruby bindings a bit, so I’ve published bioruby-sambamba gem 0.0.2 — no new functionality, just a minor update taking the renaming into account.

Categories: Uncategorized Tags:

GSoC weekly report #8

July 10, 2012 Leave a comment

Indexing

Last week I implemented building BAI index for BAM files. In order to understand how indexing works I had to read source code of BamTools and Picard (samtools code is unreadable to me).

Finally I managed to get it working, and it exploits parallelism so that time of building index is roughly equal to the time of reading the file. I did some experiments on Pjotr’s server, and when files are located on a single HDD, my tool works about 40% faster than ‘samtools index’: one thread does all I/O stuff + indexing, while other threads unpack BGZF blocks. It would be interesting to see what’s the situation with SSD because they have much better reading performance. (Pjotr was going to test my tool on his 4-core laptop with SSD.)

Sorting

Also I recently began to implement sorting. It already works, and the speed is good enough and also limited by I/O since external sorting involves dealing with a lot of files. I use the simplest approach: first write sorted chunks to temporary directory, then merge them in one file.

However, I’m not yet satisfied with memory consumption. For sorting, I need to tweak some numbers in runtime in order not to exceed a given memory limit, and it’s a bit tricky, partly because it’s hard to predict anything with garbage collector, partly – because memory allocations are hidden under several levels of abstraction :) I’m now investigating this issue, it’s essential for decreasing the number of sorted chunks and thus getting larger buffers for reading these chunks in merge phase.

 

Categories: Uncategorized Tags:
Follow

Get every new post delivered to your Inbox.