Home > Uncategorized > GSoC weekly report #5

GSoC weekly report #5

Wiki

I’ve created several wiki pages covering basics of working with the library – how to open file, do random access, output SAM/BAM, how to access/modify tags, and so on.

Unfortunately, it turns out that DMD installation from zip package from their website is not that easy. It is not very clear which steps are needed to get it working. Francesco tried to install the compiler this way and ran into troubles. We’ll investigate this issue, though installing from repository should work fine.

Improved speed of conversion to SAM

Now the library is on par with samtools in this respect. The approach I have taken is to build string representation entirely on stack when possible (i.e. when it doesn’t cause overflow). I’ve noticed that alignments are typically not big, just several kilobytes in size, and it’s safe to allocate memory on stack in this case. Since I store alignments as contiguous chunks of memory, I know their size in binary form, and it’s easily estimated that in text form their size won’t grow more than 5x, the worst case being array of signed bytes in tags, where each byte can take 5 symbols (minus sign, three digits, and comma).

In case the alignment is bigger (which is not common case), old implementation is used which is about two times slower for file streams.

Parallelized BAM compression

If DMD didn’t suffer from issue 5710, this paragraph woudn’t even exist, because there would be nothing to tell about. But due to that issue and, particularly, std.parallelism module suffering from it, I had to add this ugly switch to my code:

switch (compression_level) {
    case -1: writeAlignmentBlocks!(-1)(chunked_blocks, stream); break;
    case 0: writeAlignmentBlocks!(0)(chunked_blocks, stream); break;
    case 1: writeAlignmentBlocks!(1)(chunked_blocks, stream); break;

    .............

    case 9: writeAlignmentBlocks!(9)(chunked_blocks, stream); break;
    default: assert(0);
}

For those who are not familiar with D, runtime parameter compression_level is sorta “converted” to compile-time parameter, otherwise the code doesn’t compile. Luckuly, compression level takes values in range -1 .. 9 only.

Reconsidering D/Ruby interaction

Unfortunately, one of latest threads on dlang.org forum shows that the situation with shared libraries didn’t improve over the last two months, and my Ruby bindings indeed can suddenly crash during D garbage collector run.

That was the main reason I tossed up between C++11 and D in April. Being pragmatic, I don’t hope the situation with D runtime library be completely solved during the summer. So now I have to do something about it and take another approach.

Since shared libraries are out, the approach has to deal with interprocess communication.

The first thing I looked at was JSON.

I wrote a module for serialization BAM records to JSON, and it works with the same speed as serialization to SAM (which, in turn, works with about the same speed as ‘samtools view’ when compiled with GDC – not too bad!). So the speed of processing now depends on how fast JSON is parsed in dynamic language. I looked at ‘cjson’ for Python, and ‘oj’ gem for Ruby. With first one, I’m able to deserialize 20000 records per second on my netbook, while Ruby is twice slower. In other words, Python with ‘cjson’ deserializes all records from ~300MB file in 3 minutes, whereas conversion to SAM with samtools takes about 30 seconds. I think it’s not that bad, and is quite a good compromise between speed and simplicity. I can also serialize an alignment into several JSON objects, one for most used fields, a few others for less used ones, and parse them lazily in Ruby/Python.

An example of how that works using Bio::Command:

require 'bio'
require 'oj'

class AlignmentIterator
  include Enumerable

  def initialize(filename)
    @filename = filename
  end

  def each
    Bio::Command.call_command(['./bam2json', @filename]) do |io|
      io.each do |line|
        yield Oj.load(line)
      end
    end
  end
end

FILENAME = 'HG00125.chrom20.ILLUMINA.bwa.GBR.low_coverage.20111114.bam'

AlignmentIterator.new(FILENAME).each do |alignment|
  puts alignment['pos'] unless alignment['tags']['RG'].nil?
end

So it’s not hard at all to write wrappers for command line tools like this. Maybe I’ll consider some other tools for serialization/deserialization such as Apache Thrift or MessagePack, though I doubt they will provide much better performance than JSON.

Progress in SAM parsing

I didn’t pay a lot of attention to parsing SAM, but at least parsing valid records is quite easy. Here are 300 lines of Ragel/D code which parse alignment record line: https://github.com/lomereiter/sambamba/blob/samragel/sam/sam_alignment.rl

Also, it turns out that Ragel has some built-in tools for error actions. A couple of pages in its user guide are specifically devoted to error handling. I’ll investigate this and add appropriate actions to deal with invalid data, then the library will be ready to iterate over SAM records as well.

Advertisements
Categories: Uncategorized Tags:
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: