Archive

Posts Tagged ‘gsoc’

GSoC weekly report #5

June 19, 2012 Leave a comment

Wiki

I’ve created several wiki pages covering basics of working with the library – how to open file, do random access, output SAM/BAM, how to access/modify tags, and so on.

Unfortunately, it turns out that DMD installation from zip package from their website is not that easy. It is not very clear which steps are needed to get it working. Francesco tried to install the compiler this way and ran into troubles. We’ll investigate this issue, though installing from repository should work fine.

Improved speed of conversion to SAM

Now the library is on par with samtools in this respect. The approach I have taken is to build string representation entirely on stack when possible (i.e. when it doesn’t cause overflow). I’ve noticed that alignments are typically not big, just several kilobytes in size, and it’s safe to allocate memory on stack in this case. Since I store alignments as contiguous chunks of memory, I know their size in binary form, and it’s easily estimated that in text form their size won’t grow more than 5x, the worst case being array of signed bytes in tags, where each byte can take 5 symbols (minus sign, three digits, and comma).

In case the alignment is bigger (which is not common case), old implementation is used which is about two times slower for file streams.

Parallelized BAM compression

If DMD didn’t suffer from issue 5710, this paragraph woudn’t even exist, because there would be nothing to tell about. But due to that issue and, particularly, std.parallelism module suffering from it, I had to add this ugly switch to my code:

switch (compression_level) {
    case -1: writeAlignmentBlocks!(-1)(chunked_blocks, stream); break;
    case 0: writeAlignmentBlocks!(0)(chunked_blocks, stream); break;
    case 1: writeAlignmentBlocks!(1)(chunked_blocks, stream); break;

    .............

    case 9: writeAlignmentBlocks!(9)(chunked_blocks, stream); break;
    default: assert(0);
}

For those who are not familiar with D, runtime parameter compression_level is sorta “converted” to compile-time parameter, otherwise the code doesn’t compile. Luckuly, compression level takes values in range -1 .. 9 only.

Reconsidering D/Ruby interaction

Unfortunately, one of latest threads on dlang.org forum shows that the situation with shared libraries didn’t improve over the last two months, and my Ruby bindings indeed can suddenly crash during D garbage collector run.

That was the main reason I tossed up between C++11 and D in April. Being pragmatic, I don’t hope the situation with D runtime library be completely solved during the summer. So now I have to do something about it and take another approach.

Since shared libraries are out, the approach has to deal with interprocess communication.

The first thing I looked at was JSON.

I wrote a module for serialization BAM records to JSON, and it works with the same speed as serialization to SAM (which, in turn, works with about the same speed as ‘samtools view’ when compiled with GDC – not too bad!). So the speed of processing now depends on how fast JSON is parsed in dynamic language. I looked at ‘cjson’ for Python, and ‘oj’ gem for Ruby. With first one, I’m able to deserialize 20000 records per second on my netbook, while Ruby is twice slower. In other words, Python with ‘cjson’ deserializes all records from ~300MB file in 3 minutes, whereas conversion to SAM with samtools takes about 30 seconds. I think it’s not that bad, and is quite a good compromise between speed and simplicity. I can also serialize an alignment into several JSON objects, one for most used fields, a few others for less used ones, and parse them lazily in Ruby/Python.

An example of how that works using Bio::Command:

require 'bio'
require 'oj'

class AlignmentIterator
  include Enumerable

  def initialize(filename)
    @filename = filename
  end

  def each
    Bio::Command.call_command(['./bam2json', @filename]) do |io|
      io.each do |line|
        yield Oj.load(line)
      end
    end
  end
end

FILENAME = 'HG00125.chrom20.ILLUMINA.bwa.GBR.low_coverage.20111114.bam'

AlignmentIterator.new(FILENAME).each do |alignment|
  puts alignment['pos'] unless alignment['tags']['RG'].nil?
end

So it’s not hard at all to write wrappers for command line tools like this. Maybe I’ll consider some other tools for serialization/deserialization such as Apache Thrift or MessagePack, though I doubt they will provide much better performance than JSON.

Progress in SAM parsing

I didn’t pay a lot of attention to parsing SAM, but at least parsing valid records is quite easy. Here are 300 lines of Ragel/D code which parse alignment record line: https://github.com/lomereiter/sambamba/blob/samragel/sam/sam_alignment.rl

Also, it turns out that Ragel has some built-in tools for error actions. A couple of pages in its user guide are specifically devoted to error handling. I’ll investigate this and add appropriate actions to deal with invalid data, then the library will be ready to iterate over SAM records as well.

Categories: Uncategorized Tags:

GSoC weekly report #4

June 11, 2012 Leave a comment

Last week, I’ve implemented BAM output in my library. In order for it to be useful, I’ve also added support for alignment modification and creation.

BAM output

Currently, the interface consists of just one function, writeBAM, which takes following parameters:

  1. Output stream (it can wrap stdout if you wish)
  2. Text of SAM header
  3. Information about reference sequences (which comes after header)
  4. Range of alignments
  5. (Optional) Compression level, from -1 to 9. Default is -1 which stays for default compression in zlib. Zero corresponds to so-called uncompressed BAM, although it turned out to be zlib-compressed, but with zero compression.

For example of usage, see wiki: https://github.com/lomereiter/sambamba/wiki/Getting-started

Modification support

All tags, flags, and fields now can be modified. I have used a lot of bithacks to make it fast, especially with tag values which can hold integers, floats, chars, strings, or arrays of numbers.

In my code, every tag value is a tagged union. Type tags are designed in such a way that (typetag >> 5) is size of stored value (or element of stored array) in bytes, (typetag & 1) tells whether tag value is an array or not, and (typetag & 3) tells whether it is a string or not. This way, it is easy to copy data from value to stream, no more information is needed.

So it’s time to think about new project title, because BAMread is misleading now.

Now I’ll concentrate on filling wiki with details about what library is capable of, and making installation process more user-friendly. There’s a lot of functionality already, but it needs to be documented.

Meanwhile, I’ll create a branch for playing with Ragel and will work on adding SAM support.

Categories: Uncategorized Tags:

GSoC weekly report #3

Caching module

A week ago, I wrote a post about memoization. I implemented a generic memoize function which uses chosen cache strategy. The strategies can be easily added, and the cache can be accessed from multiple threads, each value being calculated only once.

I have implemented least used and FIFO cache strategies. Later I will study other cache algorithms and measure how they perform. Now simple FIFO cache is used for storing decompressed BGZF blocks.

Random access

is now covered with unit tests which show that accessing with index file produces the same result as naive filtering. Also, it now uses caching so that a block doesn’t get decompressed twice.

It also works from Ruby, in general I try to keep bindings up to date.

Testing

Recently we discussed with Pjotr this topic, and decided that I write a simple CLI application for fetching alignments. It’s ready now: https://github.com/lomereiter/sambamba/blob/master/CLItools/bam-fetch/bam_fetch.d

The output is still slow, I’ll investigate how to make it faster. I already rejected D standard library output methods (see my previous post) and now I’m using only functions from stdio.h for output.

My current ideas are:

  • get rid of functions from printf family and use only fwrite; *printf are slow because of parsing format string
  • get rid of fputc calls, collect bytes in a string buffer instead

Although *printf are convenient, they are not fast enough for big amounts of data. There’re not so many types which I need to convert to strings — mostly integers, and very rarely some float values. I’ll just copy-paste some existing implementations. For instance, this one: http://code.google.com/p/stringencoders/wiki/PerformanceNumToA (they claim that number is converted to string 20x faster than with the aid of snprintf).

Categories: Uncategorized Tags:

GSoC weekly report #2

Ruby bindings

Now it’s possible to iterate over alignments from Ruby, although it’s much slower than doing the same from D. Iterating will make more sense when I’ll introduce filtering and random access. But anyway, Ruby syntax can be used for iteration, and it looks nice:

bam = BamFile.new "test/data/ex1_header.bam"
puts bam.alignments
        .collect {|read| read['NM'] } # access tag - it's integer
        .compact                      # remove nils
        .instance_eval { reduce(:+) / size.to_f } # get average

Validation

This week, I concentrated mostly on validation module. As I haven’t yet developed deep understanding of NGS technologies, I just did my best on translating all ‘must’, ‘ought to’, and ‘shall’ appearing in the specification, into program code :)

The results are good enough – my validator has catched a few errors in tags.bam provided by Peter Cock, all of them about invalid characters in read names and tags (spaces and @). That’s the output:

ReadNameContainsInvalidCharacters in read 'tag_zz:Z: '
ReadNameContainsInvalidCharacters in read 'tag_zz:Z:   '
ReadNameContainsInvalidCharacters in read 'tag_zz:Z:@'
ReadNameContainsInvalidCharacters in read 'tag_zz:Z:@@@'
ReadNameContainsInvalidCharacters in read 'tag_aa:A: '
	Tag error: NonPrintableCharacter ('aa' -> 'A: ')
InvalidTags in read 'tag_aa:A: '
ReadNameContainsInvalidCharacters in read 'tag_aa:A:@'

I tried to design my validation module to be customizable. Therefore t works this way: if you need to validate, say, alignments, you subclass AbstractAlignmentValidator, implement your own error handlers, and then use its ‘validate’ method for alignments.

class Validator : AbstractAlignmentValidator {
    void onError(ref Alignment alignment, AlignmentError e) {
        writeln(e, " in read '", alignment.read_name, "'");
    }

    void onError(ref Alignment alignment, CigarError e) {
        writeln("\tCIGAR error: ", e);
        writeln("\t\t   cigar string: ", alignment.cigar_string);
        writeln("\t\tsequence length: ", alignment.sequence_length);
    }

    void onError(string key, ref Value value, TagError e) {
        writeln("\tTag error: ", e , " ('", key, "' -> '", value.to_sam(), "')");
    }
}

Notice that if you’re too lazy to implement all abstract methods, you can use std.typecons and subclass BlackHole!AbstractAlignmentValidator ;-)

Currently, there’re 8 types of alignment errors, 3 – of CIGAR ones (I’m sure there should be more of them), 10 – of errors in tags, and about 10 – of SAM header errors which are catched with another abstract class.

Tags

By the way, have you noticed this Value type used in tag error handler? It’s a C-style tagged union with a bunch of methods which play well with typical D code. You can assign it to a variable of any type allowed by the specification, compare, get the stored value (standard ConvException is thrown if wrong type is requested), and convert it to SAM format that easy:

short[] array_of_shorts = [4, 5, 6];
Value v = array_of_shorts;
assert(v.is_numeric_array);
assert(v.to_sam == "B:s,4,5,6");
assert(to!(short[])(v) == array_of_shorts);

I also attempted to provide this kind of facility for Ruby. For instance, ‘tags’ method of an alignment returns a Hash with objects of corresponding Ruby types, be it arrays of integers, strings, or whatever.

Random access

Yesterday I started getting my head around how random access works. Now I already have something working, but it’s probably very buggy. And slow (no caching yet). And works with DMD only :(

Even with DMD I had to tweak my code in several places in order to avoid internal compiler errors. It’s somehow related to passing delegates as template arguments to standard library algorithms. Functional style suits my thinking, but the two compilers are not ready for that. The errors I saw from GDC were truly astonishing! 14k characters long mangled function name is something unforgettable :) Here it is: https://gist.github.com/2802993

Since I’ve almost finished my plans for the second week (Ruby bindings just need some polishing), I’m going to continue playing with random access stuff next week. If I’ll manage to get it working before July (for when it is planned), I’ll probably try to implement indexing then, and maybe even investigate indexing schemes other than the default one, i.e. BAI. The rumors are, BAMTools has its own index format, BTI, which makes for faster random accesses, up to 10x. (http://bioinformatics.oxfordjournals.org/content/27/12/1691.full)

Categories: programming Tags: ,

GSoC weekly report #1

The coding period has finally started. I’m glad to become a part of BioRuby community which is, fortunately, a very friendly one :)

The past week saw a lot of activities of various sorts, so this particular report is a bit long.

Rubinius issues and BioRuby unit tests

I filed two bugs to Rubinius bug tracker. One is already fixed, and that will solve some of unit test failures. Also another bug in String#split was fixed recently, and I expect that after the next update Rubinius will pass all unit tests in 1.8 mode on Travis CI.

The situation with 1.9 mode is a little more complicated. For instance, it doesn’t yet fully implements IO#popen and Kernel#Spawn method semantics, and that causes Bio::Command to behave incorrectly. I filed a bug yesterday about that.

I will continue to investigate which Rubinius bugs cause unit test failures. I like LLVM in general, and this implementation of Ruby in particular. JIT compiler + no JVM dependency + clean code… Not that fast yet, but improvements are definitely possible. Clang already outperforms GCC in some cases ;)

D code optimizations

I started a wiki page on github: https://github.com/lomereiter/BAMread/wiki/D-optimization-techniques

Currently, it contains 6 small tips, and I expect it to grow during the summer.

BDD

Also, I learned a bit about behaviour-driven development in general, and about Cucumber. I really like this approach :) It’s very motivating to see at a glance what needed, why, and by whom. And splitting everything into scenarios and steps allows to easily divide one milestone into several smaller ones. I would anyway keep all those things in my head, and it’s cool that written in Gherkin, they can drive my development and serve testing and documentation purposes :)

Ruby object creation time

During previous week, I optimized my BAM reading library quite a bit. Currently, it uses std.parallelism.parallelMap to unpack BGZF blocks in parallel. What I wonder, though, is does it make any sense to be able to iterate alignments from Ruby?

Let’s take some big file and see how long it takes to parse it serially:

$ ./readbam ../HG00476.chrom11.ILLUMINA.bwa.CHS.low_coverage.20111114.bam
 unpacking took 24925ms
 number of records: 10585955

OK, and how long does it take to create corresponding number of Ruby objects? (Ruby 1.9.3-p194)

$ ruby -rbenchmark -e 'p Benchmark::measure{10585955.times{Object.new}}.total'
 4.54

Compared to 25 seconds, it doesn’t look that bad.

But let’s try parallel version (thanks to Pjotr, I can use a 8-core server via SSH):

$ ./readbam ../HG00476.chrom11.ILLUMINA.bwa.CHS.low_coverage.20111114.bam 7
 unpacking took 6733ms
 number of records: 10585955

Hmm, so if we use one object per alignment, it will take at least 4.5s to just create all of them in Ruby, while D code will parse 2/3 of the file in that time.

And if we create not Objects, but FFI Structs like this simple one:

class S < FFI::Struct
layout :dummy, :int
end

things get worse — creation of 10M objects of this class takes almost 20 seconds. Well, one might argue it’s not that bad either, for 0.8GB file…

…but in fact, I see no use cases for iterating alignments from Ruby (although I’ve somehow managed to write Cucumber feature for that). My library will provide heavily optimized functions for common high-level tasks to be used from dynamic languages. Whenever you want speed — write code in D. Whenever you want interactiveness (e.g., fetch alignments overlapping particular region and visualize them somehow) — use your favourite dynamic language, the library will provide such functions. Both these things should be easy, that’s what I aim at.

Categories: programming Tags:

GSoC weekly report #0.5

So, another week of “pre-coding” period have passed.

I was mainly concentrated on D part of the project, adding functionality to it. During the week, I implemented parsing of the whole BAM file :) Today I wrote a simple utility in D, which uses my library to convert BAM to SAM. It doesn’t work with array tags yet, and not as fast as samtools, but nevertheless… On a couple of BAM files from test/data directory (namely, bins.bam and ex1_header.bam) the output is identical to that of samtools view — I checked with diff — and that kinda proves that everything works fine. Speed issues are mainly due to using std.variant module for storing tags. It uses runtime reflection which is quite slow. Maybe, there’re some other reasons. Anyway, I’m going to write my own tagged union type next week, it should improve the performance quite a bit, and also fix design flaws.

For testing tag parsing, I used file tags.bam provided to me by Peter Cock. It contains tests for all types of tags, and my library successfully passes them. Later I’ll experiment with possible speed improvements, and having unit tests covering full range of possible tag types is a must.

Also, I downloaded and compiled gdc from trunk. It provides decent performance, not worse than dmd, at least. We expect gdc to gain shared library support in the next two months. Before that happens, we have to use dmd, although there’re some issues with its garbage collector, causing segfaults. We discussed that with Marjan and Pjotr and decided that the best option in such circumstances would be to disable GC during development — testing library on small files won’t consume much memory anyway.

Another thing I downloaded and compiled is Rubinius. I’m going to investigate why it hangs on BioRuby unittests in 1.9 mode. Another mode, 1.8, seems to work fine except maybe some very minor bugs.

During next week, I’m going to learn how to use Cucumber and Rspec, improve D library performance a little, and start to write Ruby bindings. So it will be mostly ‘Ruby week’ ;-)

Categories: programming Tags:

[GSoC] Weekly report #0

Finally, OBF seems to be convinced about usefulness of new BAM parser. Thus now I can spend time on coding instead of writing/talking on why what I’m about to do is so important :)

I began to write code before the official start for two reasons. First, I just can’t wait to get my hands on this. The project is a great opportunity to learn more about D and Ruby, and what they’re capable of when being combined. By the way, it’s one of the goals we set for this summer — discover some best practices. Moreover, I’ll describe some techniques I already use, in this very post ;) The second reason is more prosaic. I’ll have exams in June, and therefore I have to do some stuff in advance. However, I believe I’ll deal with time management issues. Since July I’ll be able to work mostly full-time.

Now let’s move on to what was done during last week. The project repository is at https://github.com/lomereiter/sambamba. Directory structure is not well organized yet, I’ll do that a bit later.

The main struct, BamFile (bamfile.d), has just a constructor, method close() and, the most important at the moment, header property which returns SamHeader struct (defined in samheader.d).

Now I advise you to take a look at samheader.d. It is rather small file which begins with ~50 lines of rather crazy mixture of string mixins and mixin templates, and then…

mixin HeaderLineStruct!("SqLine",
          Field!("sequence_name", "SN"),
          Field!("sequence_length", "LN", uint),
          Field!("assembly", "AS"),
          Field!("md5", "M5"),
          Field!("species", "SP"),
          Field!("uri", "UR"));

That’s how I want to define structs for information from SAM header line fields. And D allows me to do that. This code generates struct definition, with static method parse() which takes a string and returns an instance of the struct. Really, if “each line is TAB-delimited and except the @CO lines, each data field follows a format ‘TAG:VALUE’ where TAG is a two-letter string <…>”, why should I write more than this tag and the type of the value? And since the value type is almost always a string, let’s make it default in the Field template ;-)

So, here’s the first useful technique: whenever you find some code to be too repetitive, see if it’s worthwhile to create a small DSL based on mixins. D makes code generation easy with compile-time function evaluation, almost as easy as writing Lisp macros.

How about writing Ruby FFI bindings to these structs? I decided to apply DRY principle again, and here’s my another idea: scaffolding. RoR guys have been doing automatic boilerplate generation for years, so why not apply this method for generating FFI bindings?

So I wrote a small Ruby code generator for D structs, which currently supports numeric types, strings, and arrays as struct fields. Now just a few lines of code will do:

import samheader;
mixin(import("utils/scaffold.d"));

import std.stdio;

void main() {
    File("bindings/scaffolds/SqLine.rb", "w").write(toRuby!SqLine);
    File("bindings/scaffolds/RgLine.rb", "w").write(toRuby!RgLine);
    File("bindings/scaffolds/PgLine.rb", "w").write(toRuby!PgLine);
    File("bindings/scaffolds/SamHeader.rb", "w").write(toRuby!SamHeader);
}
The results of generation are in bindings/scaffolds. Since Ruby has Open Classes, you can reopen any class and define/undefine its methods. For instance, I haven’t found a way to determine at compile-time whether a field is private or not, and you might want to undef corresponding methods. However, there is a thread on dlang.org with suggestion to add getProtection trait. Hopefully, it will be added to the language in the near future.
With a few lines of hand-coded bindings, we can use D functions from Ruby:

load './bindings/libbam.rb'

bam = BamFile.new "HG00125.chrom20.ILLUMINA.bwa.GBR.low_coverage.20111114.bam"
puts bam.header.pg_lines.map(&:program_name)

The code lacks tests and documentation, though, and that’s what I’ll be working on during the next week.

Categories: programming Tags: , ,