Posts Tagged ‘gsoc’

GSoC weekly report #12

August 20, 2012 5 comments

This is my final GSoC report. Sambamba improved significantly over the last two weeks. Several bugs were found and fixed, and some new functionality was added.

Pipeline support

Sambamba library can now work with arbitrary streams, including non-seekable ones (of course, random access is out in this case). However, I haven’t yet figured out how to deal with ‘-‘ command-line parameters using std.getopt module, so please use /dev/stdin and /dev/stdout for the time being.

MessagePack output

Seems like earlier I’ve underestimated the performance boost it can bring :)

I’ve measured time of looping through all alignments from a 112MB BAM file, and got the following results (on Pjotr’s 8-core box):

  • MRI Ruby, JSON input with ‘oj’ gem — 26.0s real, 29.2s user
  • MRI Ruby, MsgPack input — 6.7s real, 9.3s user
  • Python, JSON input with ‘ujson’ — 17.8s real, 21.0s user
  • Python, MsgPack input — 3.5s real, 7.3s user

Just for comparison, time of looping through the file with PySAM (fetch() method) is 3.4s real/3.4s user, i.e. in case of MsgPack output multithreaded execution compensates for additional serialization/deserialization steps.

I’m pretty sure output speed can be further improved—I didn’t yet tweak it thoroughly.

Ruby gem update

Now it works with MsgPack output format, also exception handling is improved. It’s not yet on Travis CI, however, and works only with Ruby 1.9 — I use Open3 for checking stderr to throw meaningful exceptions.

I’ve also set up Travis CI hooks. MRI 1.9.2 and 1.9.3 pass tests. However, I can’t get JRuby working there due to the same issues as with BioRuby—something related to popen/popen3. But JRuby in 1.9 mode works fine, trust me :)

Galaxy toolshed wrapper

I’ve managed to somehow make a wrapper for filtering tool:

However, I can’t say that I like Galaxy. Using command-line is way faster than waiting for some slow Python engine to do the same job. Well, I’m just not that kind of person Galaxy was created for :)

Future plans

Next step is making a pileup engine. That’s the most essential part which is lacking in my tools now. For the design I’ll take ideas from PySAM and bio-alignment gem. (Also it involves using statistics for variant calls and thus is related to my studies.) Hopefully, then I’ll get more feedback.

Another direction of further development is making a decent validation tool. I’m sure I can make a better one that ValidateSamFile from Picard (more flexible and faster) but at the moment motivation is lacking.


I’ve learned quite a bit about bioinformatics. This is definitely a very interesting field, and amounts of data to be analyzed are growing rapidly. I won’t be surprised if BAM will be replaced by CRAM or some another format in a few years, and that will render my library useless. Nevertheless, now I’ve got a good experience of writing libraries of that sort, and tweaking the performance. That will surely be of great help in the future.

Thank you Google and Open Bioinformatics Foundation :)

Categories: Uncategorized Tags:

GSoC weekly report #11

August 6, 2012 Leave a comment

BAM output in viewing tool

For a long time, sambamba view had no option to output BAM files. That was related to awful quality of its code, which is now no longer the case. (Looking at the code in its current state and to the ‘Head First Design Patterns’ book, seems I’ve applied Template Method pattern.)

Other refactoring

A lot more things should be refactored, of course. E.g. all the tools now have hard-coded functionality in them (merging, sorting, indexing) while in principle, what one would like to see is methods ‘sortBy!Coordinate’, ‘index’ for BamFile, and moving merge functionality to MultiBamFile struct providing the same methods as BamFile. But that’s something I’ll work on after GSoC, not now.

Ruby DSL for selecting reads

I’ve described it in a Cucumber feature, a simple example of what it’s capable of (in fact, the following code just calls the command-line tool with –count and –filter=… parameters):

puts bam.alignments
        .overlapping(100.Kbp .. 150.Kbp)
        .select {
            sequence_length >= 80
            tag(:NM) == 0

Man pages

I went down the same road as Clayton and created man pages with Ronn. For those who’re not yet aware, it’s a tool written in Ruby which allows to make both HTML and man output from files with Markdown syntax. Very convenient :)

Script for creating Debian packages

Also I wrote a Ruby script today to easily create debian packages for both i386 and amd64, and put them to Github downloads. Hopefully, now that I’ve done it I’ll make releases more often.

Future plans

We’ve agreed on importancy of having tools used by more people, and the decision is to put them on Galaxy Tool Shed. That should help in popularizing sambamba, my tool has some cool functionality to offer :)

I had a look at what’s available on Tool Shed, with special interest of what stuff for dealing with SAM is presented there. I’m planning to add an example of filtering because this is one of areas where sambamba really shines, and maybe also indexing.

After that I’ll start optimizing my D and Ruby code, and also reducing memory consumption. Perhaps, I’ll spend on that all remaining time. If not, I have a lot of other stuff to do, like stdin/stdout support.

Categories: Uncategorized Tags:

GSoC weekly report #10

July 24, 2012 Leave a comment


Now sambamba supports custom filter expressions for specifying which alignments to process.


sambamba view mydata.bam -F "mapping_quality >= 60 and [MQ] >= 60 and mate_is_reverse_strand and [RG] == 'ERR016155' and sequence =~ /^ACTG/ and ([OP] != null or [OC] != null)" -c

As you can see, filter query syntax supports:

  • integer/string comparison for fields/tags
  • regex match operator
  • tag existence condition
  • and, or, not logical operators

Full description is available at sambamba wiki.

I also began working on providing Ruby wrapper for this syntax. After reading some pieces of Martin Fowler’ “Domain specific languages” book, I have appreciated the power of ‘instance_eval’ :-)

As of now, it looks like this (not pushed to the repo yet):

filter = {
  ref_id == 19


  tag(:RG) =~ /ERR001*/i

  tag(:NM) > 0

  union {

    intersection {
      mapping_quality >= 40
      mapping_quality < 60



Looks kinda cool, but I’m still thinking of ways to reduce amount of dots and use spaces instead :-)


For me, it’s annoying to not have any progress indicator. In those cases where file size is known beforehand, it’s relatively easy to calculate current progress. The only issue is design.

What I came up with is as follows: the user can provide optional compile-time argument — a function with one argument, percentage. An important observation mentioned in code comments is that as float computations are not cheap, the user can make this argument lazy ( and compute it only, say, each 1000 alignments.

The working example is sambamba-index tool, where wget-like progressbar is outputted to STDERR. In the next few days I’ll add this functionality to the other utilities as well.

Categories: Uncategorized Tags:

GSoC weekly report #9

July 17, 2012 Leave a comment


I’ve added a few more options to command line tool, namely:

  • setting compression level of sorted BAM file
  • sorting by read name instead of coordinate
  • outputting chunks as uncompressed BAM

In August, I’ll look again at how the performance can be improved, though it’s almost OK now.


Somehow this turned to be harder than sorting. Of course, there’s nothing hard in merging alignments, the hard part is really merging SAM headers. Samtools seems to don’t do that at all; Picard does a better job, and I had to read a lot of Java code to understand what’s going on. Finally, I wrote D version of it, and it was 2x shorter. Then I’ve improved it a bit, because Picard uses pretty dumb algorithm for merging sequence dictionaries.

Consider the following three SAM headers:

@HD    VN:1.3    SO:coordinate
@SQ    SN:A    LN:100
@HD    VN:1.3    SO:coordinate
@SQ    SN:B    LN:100
@HD    VN:1.3    SO:coordinate
@SQ    SN:B    LN:100
@SQ    SN:A    LN:100

Surely they are mergeable, but in this particular order Picard can’t do that! That’s because it merges sequence dictionaries in turn: merge first two, then merge the result with the third. So before merging with the last header, it assumes the order to be A < B, and it can’t be changed later in the algorithm. But the order must be consistent among headers in order to merge files sorted by coordinate (because they are first sorted by reference ID), and the last header implies that B < A.

Thus the pairwise approach taken by Picard didn’t satisfy me. What I do in sambamba is first build a directed graph out of all sequence dictionaries where edges indicate the order, and then do a topological sorting on it. That allows to process arbitrarily complex cases. (Perhaps, I should change implementation to use DFS instead of BFS for the result to look a bit more intuitive.)

Debian packages

Also I prepared debian packages for both amd64 and i386 architectures. That’s my first experience of building packages, but seems like everything works, I’ve tested them on live USBs with Debian (amd64) and Ubuntu 10.04 (i386).

They’re available at

For packaging, I had to follow samtools approach and make one executable instead of several, that’s because binaries are statically linked with libphobos and it’s better to have it copied only once. However, the old way of building tools separately is also available, that’s easier for development.

As the tool called ‘sambamba’ in the past is now ‘sambamba-view’ I also had to tweak Ruby bindings a bit, so I’ve published bioruby-sambamba gem 0.0.2 — no new functionality, just a minor update taking the renaming into account.

Categories: Uncategorized Tags:

GSoC weekly report #8

July 10, 2012 Leave a comment


Last week I implemented building BAI index for BAM files. In order to understand how indexing works I had to read source code of BamTools and Picard (samtools code is unreadable to me).

Finally I managed to get it working, and it exploits parallelism so that time of building index is roughly equal to the time of reading the file. I did some experiments on Pjotr’s server, and when files are located on a single HDD, my tool works about 40% faster than ‘samtools index’: one thread does all I/O stuff + indexing, while other threads unpack BGZF blocks. It would be interesting to see what’s the situation with SSD because they have much better reading performance. (Pjotr was going to test my tool on his 4-core laptop with SSD.)


Also I recently began to implement sorting. It already works, and the speed is good enough and also limited by I/O since external sorting involves dealing with a lot of files. I use the simplest approach: first write sorted chunks to temporary directory, then merge them in one file.

However, I’m not yet satisfied with memory consumption. For sorting, I need to tweak some numbers in runtime in order not to exceed a given memory limit, and it’s a bit tricky, partly because it’s hard to predict anything with garbage collector, partly – because memory allocations are hidden under several levels of abstraction :) I’m now investigating this issue, it’s essential for decreasing the number of sorted chunks and thus getting larger buffers for reading these chunks in merge phase.


Categories: Uncategorized Tags:

GSoC weekly report #7

First version of gem

I’ve released first version of bioruby-sambamba gem, it’s available on It works via Bio::Command, parsing JSON output produced by sambamba executable with the aid of Oj gem (that stands for ‘Optimized Json’).

At the moment, users have to compile sambamba tool themselves, and I realize that it’s not very convenient. I’ll look into producing binaries for all platforms without any dependencies so that gem will download the executable for user. Another reason for doing that is slow code generated by DMD. In general, with respect to D compilers, the more time it takes to install the compiler, the better speed you have :) I’ve tested my code with three compilers (GDC, LDC2 and DMD), and from my experience, GDC generates the fastest code, LDC2 produces a little slower executables, and DMD is the slowest among the three in that respect. Moreover, on a long pieces of code generated by Ragel, DMD with optimization flag (-O) turned on spends a lot of time in optimization phase because… its optimization phase is, uhm, not well optimized =\ (

So I’m going to compile stand-alone executables with GDC, though I’ll perhaps will need help of Iain Buclaw to install the compiler (the latest version of it is hosted on Github, and instructions on are a bit outdated).


I decided to spend some time on refactoring so as to eliminate some design drawbacks in my code. One of them was very inconvenient API for working with SAM header. Since I didn’t initially intend to provide BAM/SAM output, I didn’t pay any attention to it at all. I looked at the ways Picard and BamTools implement this kind of functionality, and borrowed a few ideas from there. Now the code is much better, the user can add/remove reference sequence, read group, or program information (corresponding to @SQ, @RG and @PG lines in a header).

Also, after the code review when Marjan pointed out some problems with the current validation code, I began to refactor it. I have a feeling that the code can be made much more flexible (maybe some kind of Visitor pattern with walking through a tree of validators?). During the process of refactoring, I also improved the performance quite a bit, using profiler and eliminating most bottlenecks.  Validation became about 5x faster, so the speed is decent now.

Faster SAM parsing

I’ve said earlier about the opportunity to cut the time of parsing SAM by 20-25%, and I used that. Now the parsing code uses way less (re-)allocations.

Speaking about allocations, it’s only this week that I realized that static variables in D are thread-local. That makes me reconsider some pieces of code and replace some heap-allocated arrays with static arrays which I was erroneously avoiding before, thinking that it might cause multithreading-related issues. That might give another performance boost.

Nicer syntax for working with tag values

I’ve implemented better opCast() and opEquals() for Value struct, and now the user doesn’t need to do explicit conversions in almost all cases. For example, previously in order to compare a value holding a ubyte with an int, one had to say “if (to!ubyte(value) == 32) …”, and now it is just “if (value == 32) …”. Also, when assigning a value to a read tag, previously one had to explicitly call Value constructor while now it’s not needed: ‘read[“RG”] = “111111”;’ instead of ‘read[“RG”] = Value(“111111”)’

I’ll spend a few days more on refactoring, but the main task for this week is BAM indexing, i.e. producing BAI files from a sorted BAM file. Next logical step is sorting/merging, I hope to finish that by 20th July at most, and then prepare next version of gem. Maybe I’ll release another version in the middle of July when I’ll have working binaries for all platforms.

Categories: Uncategorized Tags:

GSoC weekly report #6

June 26, 2012 1 comment

Ruby bindings

The bindings now work with JSON output from command-line tool sambamba which is to be installed. All the old Cucumber scenarios are passing except one not-very-important one. I’ll write documentation during this week and pack that into a gem.

SAM input

I got it working, and introduced new type SamFile which is very similar to BamFile except it doesn’t provide random access. Unit tests ensure that parseAlignmentLine(toSam(read)) == read for all valid reads (otherwise invalid fields are default-initialized)

In order to allow invalid data, I made a simple rule invalid_field in Ragel, which just reads until next tab character:

mandatoryfields = (qname | invalid_field) '\t'
                  (flag  | invalid_field) '\t'
                  (rname | invalid_field) '\t'
                  (pos   | invalid_field) '\t'
                  (mapq  | invalid_field) '\t'
                  ... // and so on

Parsing is now about 3x as slow as in samtools, but that has nothing to do with Ragel, the main reason is too much memory allocations. I did some profiling, and doubling the speed won’t take a lot of effort. As for tripling, I’m not that sure, but I’ll try :)

Sambamba CLI tool

It accepts both SAM and BAM files as input, and can output either SAM or JSON (speed is the same for both cases). Also it allows filtering by quality and/or read group, and accepts samtools syntax for regions.

More on wiki:

Categories: Uncategorized Tags:

GSoC weekly report #5

June 19, 2012 Leave a comment


I’ve created several wiki pages covering basics of working with the library – how to open file, do random access, output SAM/BAM, how to access/modify tags, and so on.

Unfortunately, it turns out that DMD installation from zip package from their website is not that easy. It is not very clear which steps are needed to get it working. Francesco tried to install the compiler this way and ran into troubles. We’ll investigate this issue, though installing from repository should work fine.

Improved speed of conversion to SAM

Now the library is on par with samtools in this respect. The approach I have taken is to build string representation entirely on stack when possible (i.e. when it doesn’t cause overflow). I’ve noticed that alignments are typically not big, just several kilobytes in size, and it’s safe to allocate memory on stack in this case. Since I store alignments as contiguous chunks of memory, I know their size in binary form, and it’s easily estimated that in text form their size won’t grow more than 5x, the worst case being array of signed bytes in tags, where each byte can take 5 symbols (minus sign, three digits, and comma).

In case the alignment is bigger (which is not common case), old implementation is used which is about two times slower for file streams.

Parallelized BAM compression

If DMD didn’t suffer from issue 5710, this paragraph woudn’t even exist, because there would be nothing to tell about. But due to that issue and, particularly, std.parallelism module suffering from it, I had to add this ugly switch to my code:

switch (compression_level) {
    case -1: writeAlignmentBlocks!(-1)(chunked_blocks, stream); break;
    case 0: writeAlignmentBlocks!(0)(chunked_blocks, stream); break;
    case 1: writeAlignmentBlocks!(1)(chunked_blocks, stream); break;


    case 9: writeAlignmentBlocks!(9)(chunked_blocks, stream); break;
    default: assert(0);

For those who are not familiar with D, runtime parameter compression_level is sorta “converted” to compile-time parameter, otherwise the code doesn’t compile. Luckuly, compression level takes values in range -1 .. 9 only.

Reconsidering D/Ruby interaction

Unfortunately, one of latest threads on forum shows that the situation with shared libraries didn’t improve over the last two months, and my Ruby bindings indeed can suddenly crash during D garbage collector run.

That was the main reason I tossed up between C++11 and D in April. Being pragmatic, I don’t hope the situation with D runtime library be completely solved during the summer. So now I have to do something about it and take another approach.

Since shared libraries are out, the approach has to deal with interprocess communication.

The first thing I looked at was JSON.

I wrote a module for serialization BAM records to JSON, and it works with the same speed as serialization to SAM (which, in turn, works with about the same speed as ‘samtools view’ when compiled with GDC – not too bad!). So the speed of processing now depends on how fast JSON is parsed in dynamic language. I looked at ‘cjson’ for Python, and ‘oj’ gem for Ruby. With first one, I’m able to deserialize 20000 records per second on my netbook, while Ruby is twice slower. In other words, Python with ‘cjson’ deserializes all records from ~300MB file in 3 minutes, whereas conversion to SAM with samtools takes about 30 seconds. I think it’s not that bad, and is quite a good compromise between speed and simplicity. I can also serialize an alignment into several JSON objects, one for most used fields, a few others for less used ones, and parse them lazily in Ruby/Python.

An example of how that works using Bio::Command:

require 'bio'
require 'oj'

class AlignmentIterator
  include Enumerable

  def initialize(filename)
    @filename = filename

  def each
    Bio::Command.call_command(['./bam2json', @filename]) do |io|
      io.each do |line|
        yield Oj.load(line)

FILENAME = 'HG00125.chrom20.ILLUMINA.bwa.GBR.low_coverage.20111114.bam' do |alignment|
  puts alignment['pos'] unless alignment['tags']['RG'].nil?

So it’s not hard at all to write wrappers for command line tools like this. Maybe I’ll consider some other tools for serialization/deserialization such as Apache Thrift or MessagePack, though I doubt they will provide much better performance than JSON.

Progress in SAM parsing

I didn’t pay a lot of attention to parsing SAM, but at least parsing valid records is quite easy. Here are 300 lines of Ragel/D code which parse alignment record line:

Also, it turns out that Ragel has some built-in tools for error actions. A couple of pages in its user guide are specifically devoted to error handling. I’ll investigate this and add appropriate actions to deal with invalid data, then the library will be ready to iterate over SAM records as well.

Categories: Uncategorized Tags:

GSoC weekly report #4

June 11, 2012 Leave a comment

Last week, I’ve implemented BAM output in my library. In order for it to be useful, I’ve also added support for alignment modification and creation.

BAM output

Currently, the interface consists of just one function, writeBAM, which takes following parameters:

  1. Output stream (it can wrap stdout if you wish)
  2. Text of SAM header
  3. Information about reference sequences (which comes after header)
  4. Range of alignments
  5. (Optional) Compression level, from -1 to 9. Default is -1 which stays for default compression in zlib. Zero corresponds to so-called uncompressed BAM, although it turned out to be zlib-compressed, but with zero compression.

For example of usage, see wiki:

Modification support

All tags, flags, and fields now can be modified. I have used a lot of bithacks to make it fast, especially with tag values which can hold integers, floats, chars, strings, or arrays of numbers.

In my code, every tag value is a tagged union. Type tags are designed in such a way that (typetag >> 5) is size of stored value (or element of stored array) in bytes, (typetag & 1) tells whether tag value is an array or not, and (typetag & 3) tells whether it is a string or not. This way, it is easy to copy data from value to stream, no more information is needed.

So it’s time to think about new project title, because BAMread is misleading now.

Now I’ll concentrate on filling wiki with details about what library is capable of, and making installation process more user-friendly. There’s a lot of functionality already, but it needs to be documented.

Meanwhile, I’ll create a branch for playing with Ragel and will work on adding SAM support.

Categories: Uncategorized Tags:

GSoC weekly report #3

Caching module

A week ago, I wrote a post about memoization. I implemented a generic memoize function which uses chosen cache strategy. The strategies can be easily added, and the cache can be accessed from multiple threads, each value being calculated only once.

I have implemented least used and FIFO cache strategies. Later I will study other cache algorithms and measure how they perform. Now simple FIFO cache is used for storing decompressed BGZF blocks.

Random access

is now covered with unit tests which show that accessing with index file produces the same result as naive filtering. Also, it now uses caching so that a block doesn’t get decompressed twice.

It also works from Ruby, in general I try to keep bindings up to date.


Recently we discussed with Pjotr this topic, and decided that I write a simple CLI application for fetching alignments. It’s ready now:

The output is still slow, I’ll investigate how to make it faster. I already rejected D standard library output methods (see my previous post) and now I’m using only functions from stdio.h for output.

My current ideas are:

  • get rid of functions from printf family and use only fwrite; *printf are slow because of parsing format string
  • get rid of fputc calls, collect bytes in a string buffer instead

Although *printf are convenient, they are not fast enough for big amounts of data. There’re not so many types which I need to convert to strings — mostly integers, and very rarely some float values. I’ll just copy-paste some existing implementations. For instance, this one: (they claim that number is converted to string 20x faster than with the aid of snprintf).

Categories: Uncategorized Tags: