Archive

Archive for June, 2012

GSoC weekly report #6

June 26, 2012 1 comment

Ruby bindings

The bindings now work with JSON output from command-line tool sambamba which is to be installed. All the old Cucumber scenarios are passing except one not-very-important one. I’ll write documentation during this week and pack that into a gem.

SAM input

I got it working, and introduced new type SamFile which is very similar to BamFile except it doesn’t provide random access. Unit tests ensure that parseAlignmentLine(toSam(read)) == read for all valid reads (otherwise invalid fields are default-initialized)

In order to allow invalid data, I made a simple rule invalid_field in Ragel, which just reads until next tab character:

mandatoryfields = (qname | invalid_field) '\t'
                  (flag  | invalid_field) '\t'
                  (rname | invalid_field) '\t'
                  (pos   | invalid_field) '\t'
                  (mapq  | invalid_field) '\t'
                  ... // and so on

Parsing is now about 3x as slow as in samtools, but that has nothing to do with Ragel, the main reason is too much memory allocations. I did some profiling, and doubling the speed won’t take a lot of effort. As for tripling, I’m not that sure, but I’ll try :)

Sambamba CLI tool

It accepts both SAM and BAM files as input, and can output either SAM or JSON (speed is the same for both cases). Also it allows filtering by quality and/or read group, and accepts samtools syntax for regions.

More on wiki: https://github.com/lomereiter/sambamba/wiki/Command-line-tools

Advertisements
Categories: Uncategorized Tags:

GSoC weekly report #5

June 19, 2012 Leave a comment

Wiki

I’ve created several wiki pages covering basics of working with the library – how to open file, do random access, output SAM/BAM, how to access/modify tags, and so on.

Unfortunately, it turns out that DMD installation from zip package from their website is not that easy. It is not very clear which steps are needed to get it working. Francesco tried to install the compiler this way and ran into troubles. We’ll investigate this issue, though installing from repository should work fine.

Improved speed of conversion to SAM

Now the library is on par with samtools in this respect. The approach I have taken is to build string representation entirely on stack when possible (i.e. when it doesn’t cause overflow). I’ve noticed that alignments are typically not big, just several kilobytes in size, and it’s safe to allocate memory on stack in this case. Since I store alignments as contiguous chunks of memory, I know their size in binary form, and it’s easily estimated that in text form their size won’t grow more than 5x, the worst case being array of signed bytes in tags, where each byte can take 5 symbols (minus sign, three digits, and comma).

In case the alignment is bigger (which is not common case), old implementation is used which is about two times slower for file streams.

Parallelized BAM compression

If DMD didn’t suffer from issue 5710, this paragraph woudn’t even exist, because there would be nothing to tell about. But due to that issue and, particularly, std.parallelism module suffering from it, I had to add this ugly switch to my code:

switch (compression_level) {
    case -1: writeAlignmentBlocks!(-1)(chunked_blocks, stream); break;
    case 0: writeAlignmentBlocks!(0)(chunked_blocks, stream); break;
    case 1: writeAlignmentBlocks!(1)(chunked_blocks, stream); break;

    .............

    case 9: writeAlignmentBlocks!(9)(chunked_blocks, stream); break;
    default: assert(0);
}

For those who are not familiar with D, runtime parameter compression_level is sorta “converted” to compile-time parameter, otherwise the code doesn’t compile. Luckuly, compression level takes values in range -1 .. 9 only.

Reconsidering D/Ruby interaction

Unfortunately, one of latest threads on dlang.org forum shows that the situation with shared libraries didn’t improve over the last two months, and my Ruby bindings indeed can suddenly crash during D garbage collector run.

That was the main reason I tossed up between C++11 and D in April. Being pragmatic, I don’t hope the situation with D runtime library be completely solved during the summer. So now I have to do something about it and take another approach.

Since shared libraries are out, the approach has to deal with interprocess communication.

The first thing I looked at was JSON.

I wrote a module for serialization BAM records to JSON, and it works with the same speed as serialization to SAM (which, in turn, works with about the same speed as ‘samtools view’ when compiled with GDC – not too bad!). So the speed of processing now depends on how fast JSON is parsed in dynamic language. I looked at ‘cjson’ for Python, and ‘oj’ gem for Ruby. With first one, I’m able to deserialize 20000 records per second on my netbook, while Ruby is twice slower. In other words, Python with ‘cjson’ deserializes all records from ~300MB file in 3 minutes, whereas conversion to SAM with samtools takes about 30 seconds. I think it’s not that bad, and is quite a good compromise between speed and simplicity. I can also serialize an alignment into several JSON objects, one for most used fields, a few others for less used ones, and parse them lazily in Ruby/Python.

An example of how that works using Bio::Command:

require 'bio'
require 'oj'

class AlignmentIterator
  include Enumerable

  def initialize(filename)
    @filename = filename
  end

  def each
    Bio::Command.call_command(['./bam2json', @filename]) do |io|
      io.each do |line|
        yield Oj.load(line)
      end
    end
  end
end

FILENAME = 'HG00125.chrom20.ILLUMINA.bwa.GBR.low_coverage.20111114.bam'

AlignmentIterator.new(FILENAME).each do |alignment|
  puts alignment['pos'] unless alignment['tags']['RG'].nil?
end

So it’s not hard at all to write wrappers for command line tools like this. Maybe I’ll consider some other tools for serialization/deserialization such as Apache Thrift or MessagePack, though I doubt they will provide much better performance than JSON.

Progress in SAM parsing

I didn’t pay a lot of attention to parsing SAM, but at least parsing valid records is quite easy. Here are 300 lines of Ragel/D code which parse alignment record line: https://github.com/lomereiter/sambamba/blob/samragel/sam/sam_alignment.rl

Also, it turns out that Ragel has some built-in tools for error actions. A couple of pages in its user guide are specifically devoted to error handling. I’ll investigate this and add appropriate actions to deal with invalid data, then the library will be ready to iterate over SAM records as well.

Categories: Uncategorized Tags:

GSoC weekly report #4

June 11, 2012 Leave a comment

Last week, I’ve implemented BAM output in my library. In order for it to be useful, I’ve also added support for alignment modification and creation.

BAM output

Currently, the interface consists of just one function, writeBAM, which takes following parameters:

  1. Output stream (it can wrap stdout if you wish)
  2. Text of SAM header
  3. Information about reference sequences (which comes after header)
  4. Range of alignments
  5. (Optional) Compression level, from -1 to 9. Default is -1 which stays for default compression in zlib. Zero corresponds to so-called uncompressed BAM, although it turned out to be zlib-compressed, but with zero compression.

For example of usage, see wiki: https://github.com/lomereiter/sambamba/wiki/Getting-started

Modification support

All tags, flags, and fields now can be modified. I have used a lot of bithacks to make it fast, especially with tag values which can hold integers, floats, chars, strings, or arrays of numbers.

In my code, every tag value is a tagged union. Type tags are designed in such a way that (typetag >> 5) is size of stored value (or element of stored array) in bytes, (typetag & 1) tells whether tag value is an array or not, and (typetag & 3) tells whether it is a string or not. This way, it is easy to copy data from value to stream, no more information is needed.

So it’s time to think about new project title, because BAMread is misleading now.

Now I’ll concentrate on filling wiki with details about what library is capable of, and making installation process more user-friendly. There’s a lot of functionality already, but it needs to be documented.

Meanwhile, I’ll create a branch for playing with Ragel and will work on adding SAM support.

Categories: Uncategorized Tags:

GSoC weekly report #3

Caching module

A week ago, I wrote a post about memoization. I implemented a generic memoize function which uses chosen cache strategy. The strategies can be easily added, and the cache can be accessed from multiple threads, each value being calculated only once.

I have implemented least used and FIFO cache strategies. Later I will study other cache algorithms and measure how they perform. Now simple FIFO cache is used for storing decompressed BGZF blocks.

Random access

is now covered with unit tests which show that accessing with index file produces the same result as naive filtering. Also, it now uses caching so that a block doesn’t get decompressed twice.

It also works from Ruby, in general I try to keep bindings up to date.

Testing

Recently we discussed with Pjotr this topic, and decided that I write a simple CLI application for fetching alignments. It’s ready now: https://github.com/lomereiter/sambamba/blob/master/CLItools/bam-fetch/bam_fetch.d

The output is still slow, I’ll investigate how to make it faster. I already rejected D standard library output methods (see my previous post) and now I’m using only functions from stdio.h for output.

My current ideas are:

  • get rid of functions from printf family and use only fwrite; *printf are slow because of parsing format string
  • get rid of fputc calls, collect bytes in a string buffer instead

Although *printf are convenient, they are not fast enough for big amounts of data. There’re not so many types which I need to convert to strings — mostly integers, and very rarely some float values. I’ll just copy-paste some existing implementations. For instance, this one: http://code.google.com/p/stringencoders/wiki/PerformanceNumToA (they claim that number is converted to string 20x faster than with the aid of snprintf).

Categories: Uncategorized Tags:

D and string formatting

D standard library is completely unsuitable for outputting big amounts of text. That’s the conclusion I came to after profiling my BAM to SAM converter.

Let’s begin with the fact that there’re two implementations of formatting in the language, both living in std.format. One is doFormat function, another one is formattedWrite. Those two have some subtle difference in behaviour (http://d.puremagic.com/issues/show_bug.cgi?id=4532)

doFormat

It treats all strings as UTF-8 text. That causes a lot of overhead for simple ASCII strings. Profiling showed that about 1/3 of time is spent on this Unicode stuff.

Also I’m afraid to count how many bytes it allocates on stack:

void formatArg(char fc)
{
bool vbit;
ulong vnumber;
char vchar;
dchar vdchar;
Object vobject;
real vreal;
creal vcreal;
Mangle m2;
int signed = 0;
uint base = 10;
int uc;
char[ulong.sizeof * 8] tmpbuf; // long enough to print long in binary
const(char)* prefix = "";
string s;

/* here finally comes actual code */

It is used by all streams from std.stream module.

formattedWrite

This one is newer and uses more template techniques, thus less runtime overhead, more is moved to compile time. It is used by std.stdio module. Here another problem arises. About 1/3 of time is spent in locking/unlocking output stream. Every time you use writef function, it locks, writes a few bytes, and then unlocks. Absolutely awful.

 

Hey, I don’t need all this fancy stuff! As a C++ guy, I don’t want to pay for features I don’t use! I want performance, damn it! I tried to use std.cstream which provides a stream interface around stdout. You think it worked? Ha-ha! Simple dout.printf(“%d”, 123) causes segfault on x86_64 machines. I posted a bug, of course, but with respect to string formatting, D streams are completely f*cked up.

Another cool thing is that std.stream is going to be deprecated at some time anyway. Why?! Ah, just because ranges are so good that stream interfaces have to be designed with ranges in mind. Sorry, but I wrote BAM format reader very quickly with std.stream, and I didn’t need any bloody ranges at all. Bytes, integers and floats are not that abstract, you know. Even for streams operating on text, I can’t imagine a situation where I would need anything more than iterating over lines, which is already presented in the standard library.

The same applies to container library. Better to have no containers at all than have badly designed ones, that’s what Alexandrescu believes in. And since he don’t have a lot of time to flesh out the design because of his family and such, everybody will write queues and stacks from scratch, and use associative arrays as sets, until finally the miracle occurs, after years of intense and deep thinking. Seems like D wants to become yet another Haskell, but with C-like syntax and a bit closer to metal. It’s not necessarily bad, but the process is too slow. It’s been several years since people started to ask for container library. What do we see now? A bunch of some strange stuff which even can’t be easily used in multithreaded applications. Array, for instance, is a reference-counted object which doesn’t use garbage collection. What the hell is it all about? Does it mean that they don’t even hope to make GC better than it is now? No, there’s even a GSoC project concerned with making GC precise. Personally, I don’t find it an issue, because 64-bit is the future. Though, as you can see from my experience with std.cstream, D language developers seem to not think so.

So, my dear friend, use std.c.stdio, std.c.stdlib, std.c.string, and enjoy real performance with fprintf!

Categories: Uncategorized Tags:

Ragel and bioinformatics

The term ‘bioinformatics’ should be familiar to the reader. So let’s become acquainted with Ragel.

What is it?

Finite-state machine compiler. FSM is a term which is known to every computer scientist. Basically, it’s a bunch of predetermined states and transitions between states. You have some final states, and some error states, and how your machine will walk through these states depends on the input and some predetermined rules.

Why is it useful? Because it allows to parse any regular grammar, and most textual data formats in bioinformatics can be described by some regular language.

Oh, enough theory… why should I care?

The code which Ragel generates is extremely fast. That’s a fact.

Ragel is used in

  • Gherkin gem — for parsing feature files. Here’s a line from Cucumber history:
    “Treetop is gone and replaced with Ragel. The new Ragel parser lives in the gherkin gem. Parse times are up to 100 times faster.”
  • Mongrel web server — for parsing HTTP requests.

    “Mongrel’s design is based around the idea that most of the security problems in HTTP servers comes from hand-coded parsers that are too “loose” with the protocol. Mongrel uses a generated parser (using Ragel) that is very strict and seems to block a huge number of attack attempts simply because it is so exacting. Since this protection comes at the HTTP level, any framework using Mongrel gets it for free.”

  • Rubinius — for implementing Array#pack/String#unpack methods.

Ok, how to use it?

First step is defining formal grammar for the format. That might not be easy, especially if there’re no strict specification. However, when specification gives detailed descriptions of all fields, this step is easy.  It took me just 1.5 hours to write more or less complete grammar for the SAM format. Here it is.

Then, you can insert actions to be executed on each transition. Action is a snippet of code. The languages supported are C, C++, D, Go, Java, and Objective C. (Ruby is also supported but it’s too slow for real-world parsers).

Let’s take an example. We wish to extract all tag names from SAM file. How to do that when we have Ragel grammar?

Ok, let’s find the line which deals with tags:

tag = alpha alnum ;

We need to add two actions, one at the beginning, and one at the end (the snippets are in D programming language):

action start_tag {
    tag_startpos = p - data.ptr;
}

action end_tag {
    tag_endpos = p - data.ptr;
    tagnames ~= cast(string)data[tag_startpos .. tag_endpos].dup;
}

tag = (alpha alnum) > start_tag % end_tag ;

Here p is a pointer which points to the current char, and data is a string being parsed. tag_startpos is a variable stored somewhere between the two actions. That’s it, now we just need to add some boilerplate to that:

void process(string data) {
    // Ragel requires the following 4 variables to be declared:
    char* p = cast(char*)data.ptr; // pointer to start
    char* pe = cast(char*)data.ptr + data.length; // pointer to end
    char* eof = null; // not needed in this simple example
    int cs; // identifier of the current state of FSM

    size_t tag_startpos;
    size_t tag_endpos;
    string[] tagnames;

    %%write init;
    %%write exec;
}

Now you can pass any SAM file to this function. Of course, in real-world case, you would read file in chunks, and pass those chunks instead of whole file.

 

The conclusion is, it is much less work to write parsers in this way. Furthermore, such grammar descriptions make it extremely easy to check that the parser conforms to the specification. Another cool feature is that validation is done during parsing, yet the parsing remains very fast, because no memory copying occurs (except initiated by you) and the minimum of required operations is done.

I’m going to write SAM parser with Ragel when I’ll find some time. I’m highly interested in how it will compare with samtools parser ;-)

Categories: Uncategorized Tags: ,