Archive for May, 2012

Thoughts on memoization

As I said in the previous post about GSoC, recently I started to think about how to implement random access to BAM file (which consists of gzipped blocks about 20kbytes each).

The problem is, we have a decompression function which is computationally intensive, and we are to convert requested blocks in the most efficient way, preferably using multithreading.

Let’s generalize it. We have a function f: X \to Y. What we want is the same function but returning the result faster. However, that’s not a good idea because it will be hard to parallelize iteration over a subset of X. It’s better to define a memoization operator M mapping f to Mf: X \to D_Y which will return a delay instead of actual value. To work with this delay, we need another function g: D_Y \to Y such that f = g \circ Mf. The nice thing about this approach is that the delay can be returned immediately, while computation can be done in another thread.

Thus we can get both things working:

  • Parallel map. We can use a cyclic buffer of delays, prefilling it at the start. This way, there will always be a fixed amount of tasks being executed in a multithreaded task pool. Advancing to the next element will be accompanied with pushing new task into the pool.
  • Evaluation at random points. Memoization operator can maintain a cache of delays, to which access will be syncronized.

I googled these idea and found this great article: It contains some code in Clojure, and particularly, it contains an implementation described above, using futures and cache with synchronized access.

Converting that code to D shouldn’t be a big problem thanks to std.parallelism module providing future/promise stuff. I currently aim at developing a reusable solution built on templates and allowing arbitrary function and cache implementation.

(to be continued…)

Categories: Uncategorized

GSoC weekly report #2

Ruby bindings

Now it’s possible to iterate over alignments from Ruby, although it’s much slower than doing the same from D. Iterating will make more sense when I’ll introduce filtering and random access. But anyway, Ruby syntax can be used for iteration, and it looks nice:

bam = "test/data/ex1_header.bam"
puts bam.alignments
        .collect {|read| read['NM'] } # access tag - it's integer
        .compact                      # remove nils
        .instance_eval { reduce(:+) / size.to_f } # get average


This week, I concentrated mostly on validation module. As I haven’t yet developed deep understanding of NGS technologies, I just did my best on translating all ‘must’, ‘ought to’, and ‘shall’ appearing in the specification, into program code :)

The results are good enough – my validator has catched a few errors in tags.bam provided by Peter Cock, all of them about invalid characters in read names and tags (spaces and @). That’s the output:

ReadNameContainsInvalidCharacters in read 'tag_zz:Z: '
ReadNameContainsInvalidCharacters in read 'tag_zz:Z:   '
ReadNameContainsInvalidCharacters in read 'tag_zz:Z:@'
ReadNameContainsInvalidCharacters in read 'tag_zz:Z:@@@'
ReadNameContainsInvalidCharacters in read 'tag_aa:A: '
	Tag error: NonPrintableCharacter ('aa' -> 'A: ')
InvalidTags in read 'tag_aa:A: '
ReadNameContainsInvalidCharacters in read 'tag_aa:A:@'

I tried to design my validation module to be customizable. Therefore t works this way: if you need to validate, say, alignments, you subclass AbstractAlignmentValidator, implement your own error handlers, and then use its ‘validate’ method for alignments.

class Validator : AbstractAlignmentValidator {
    void onError(ref Alignment alignment, AlignmentError e) {
        writeln(e, " in read '", alignment.read_name, "'");

    void onError(ref Alignment alignment, CigarError e) {
        writeln("\tCIGAR error: ", e);
        writeln("\t\t   cigar string: ", alignment.cigar_string);
        writeln("\t\tsequence length: ", alignment.sequence_length);

    void onError(string key, ref Value value, TagError e) {
        writeln("\tTag error: ", e , " ('", key, "' -> '", value.to_sam(), "')");

Notice that if you’re too lazy to implement all abstract methods, you can use std.typecons and subclass BlackHole!AbstractAlignmentValidator ;-)

Currently, there’re 8 types of alignment errors, 3 – of CIGAR ones (I’m sure there should be more of them), 10 – of errors in tags, and about 10 – of SAM header errors which are catched with another abstract class.


By the way, have you noticed this Value type used in tag error handler? It’s a C-style tagged union with a bunch of methods which play well with typical D code. You can assign it to a variable of any type allowed by the specification, compare, get the stored value (standard ConvException is thrown if wrong type is requested), and convert it to SAM format that easy:

short[] array_of_shorts = [4, 5, 6];
Value v = array_of_shorts;
assert(v.to_sam == "B:s,4,5,6");
assert(to!(short[])(v) == array_of_shorts);

I also attempted to provide this kind of facility for Ruby. For instance, ‘tags’ method of an alignment returns a Hash with objects of corresponding Ruby types, be it arrays of integers, strings, or whatever.

Random access

Yesterday I started getting my head around how random access works. Now I already have something working, but it’s probably very buggy. And slow (no caching yet). And works with DMD only :(

Even with DMD I had to tweak my code in several places in order to avoid internal compiler errors. It’s somehow related to passing delegates as template arguments to standard library algorithms. Functional style suits my thinking, but the two compilers are not ready for that. The errors I saw from GDC were truly astonishing! 14k characters long mangled function name is something unforgettable :) Here it is:

Since I’ve almost finished my plans for the second week (Ruby bindings just need some polishing), I’m going to continue playing with random access stuff next week. If I’ll manage to get it working before July (for when it is planned), I’ll probably try to implement indexing then, and maybe even investigate indexing schemes other than the default one, i.e. BAI. The rumors are, BAMTools has its own index format, BTI, which makes for faster random accesses, up to 10x. (

Categories: programming Tags: ,

GSoC weekly report #1

The coding period has finally started. I’m glad to become a part of BioRuby community which is, fortunately, a very friendly one :)

The past week saw a lot of activities of various sorts, so this particular report is a bit long.

Rubinius issues and BioRuby unit tests

I filed two bugs to Rubinius bug tracker. One is already fixed, and that will solve some of unit test failures. Also another bug in String#split was fixed recently, and I expect that after the next update Rubinius will pass all unit tests in 1.8 mode on Travis CI.

The situation with 1.9 mode is a little more complicated. For instance, it doesn’t yet fully implements IO#popen and Kernel#Spawn method semantics, and that causes Bio::Command to behave incorrectly. I filed a bug yesterday about that.

I will continue to investigate which Rubinius bugs cause unit test failures. I like LLVM in general, and this implementation of Ruby in particular. JIT compiler + no JVM dependency + clean code… Not that fast yet, but improvements are definitely possible. Clang already outperforms GCC in some cases ;)

D code optimizations

I started a wiki page on github:

Currently, it contains 6 small tips, and I expect it to grow during the summer.


Also, I learned a bit about behaviour-driven development in general, and about Cucumber. I really like this approach :) It’s very motivating to see at a glance what needed, why, and by whom. And splitting everything into scenarios and steps allows to easily divide one milestone into several smaller ones. I would anyway keep all those things in my head, and it’s cool that written in Gherkin, they can drive my development and serve testing and documentation purposes :)

Ruby object creation time

During previous week, I optimized my BAM reading library quite a bit. Currently, it uses std.parallelism.parallelMap to unpack BGZF blocks in parallel. What I wonder, though, is does it make any sense to be able to iterate alignments from Ruby?

Let’s take some big file and see how long it takes to parse it serially:

$ ./readbam ../HG00476.chrom11.ILLUMINA.bwa.CHS.low_coverage.20111114.bam
 unpacking took 24925ms
 number of records: 10585955

OK, and how long does it take to create corresponding number of Ruby objects? (Ruby 1.9.3-p194)

$ ruby -rbenchmark -e 'p Benchmark::measure{10585955.times{}}.total'

Compared to 25 seconds, it doesn’t look that bad.

But let’s try parallel version (thanks to Pjotr, I can use a 8-core server via SSH):

$ ./readbam ../HG00476.chrom11.ILLUMINA.bwa.CHS.low_coverage.20111114.bam 7
 unpacking took 6733ms
 number of records: 10585955

Hmm, so if we use one object per alignment, it will take at least 4.5s to just create all of them in Ruby, while D code will parse 2/3 of the file in that time.

And if we create not Objects, but FFI Structs like this simple one:

class S < FFI::Struct
layout :dummy, :int

things get worse — creation of 10M objects of this class takes almost 20 seconds. Well, one might argue it’s not that bad either, for 0.8GB file…

…but in fact, I see no use cases for iterating alignments from Ruby (although I’ve somehow managed to write Cucumber feature for that). My library will provide heavily optimized functions for common high-level tasks to be used from dynamic languages. Whenever you want speed — write code in D. Whenever you want interactiveness (e.g., fetch alignments overlapping particular region and visualize them somehow) — use your favourite dynamic language, the library will provide such functions. Both these things should be easy, that’s what I aim at.

Categories: programming Tags:

GSoC weekly report #0.5

So, another week of “pre-coding” period have passed.

I was mainly concentrated on D part of the project, adding functionality to it. During the week, I implemented parsing of the whole BAM file :) Today I wrote a simple utility in D, which uses my library to convert BAM to SAM. It doesn’t work with array tags yet, and not as fast as samtools, but nevertheless… On a couple of BAM files from test/data directory (namely, bins.bam and ex1_header.bam) the output is identical to that of samtools view — I checked with diff — and that kinda proves that everything works fine. Speed issues are mainly due to using std.variant module for storing tags. It uses runtime reflection which is quite slow. Maybe, there’re some other reasons. Anyway, I’m going to write my own tagged union type next week, it should improve the performance quite a bit, and also fix design flaws.

For testing tag parsing, I used file tags.bam provided to me by Peter Cock. It contains tests for all types of tags, and my library successfully passes them. Later I’ll experiment with possible speed improvements, and having unit tests covering full range of possible tag types is a must.

Also, I downloaded and compiled gdc from trunk. It provides decent performance, not worse than dmd, at least. We expect gdc to gain shared library support in the next two months. Before that happens, we have to use dmd, although there’re some issues with its garbage collector, causing segfaults. We discussed that with Marjan and Pjotr and decided that the best option in such circumstances would be to disable GC during development — testing library on small files won’t consume much memory anyway.

Another thing I downloaded and compiled is Rubinius. I’m going to investigate why it hangs on BioRuby unittests in 1.9 mode. Another mode, 1.8, seems to work fine except maybe some very minor bugs.

During next week, I’m going to learn how to use Cucumber and Rspec, improve D library performance a little, and start to write Ruby bindings. So it will be mostly ‘Ruby week’ ;-)

Categories: programming Tags:

[GSoC] Weekly report #0

Finally, OBF seems to be convinced about usefulness of new BAM parser. Thus now I can spend time on coding instead of writing/talking on why what I’m about to do is so important :)

I began to write code before the official start for two reasons. First, I just can’t wait to get my hands on this. The project is a great opportunity to learn more about D and Ruby, and what they’re capable of when being combined. By the way, it’s one of the goals we set for this summer — discover some best practices. Moreover, I’ll describe some techniques I already use, in this very post ;) The second reason is more prosaic. I’ll have exams in June, and therefore I have to do some stuff in advance. However, I believe I’ll deal with time management issues. Since July I’ll be able to work mostly full-time.

Now let’s move on to what was done during last week. The project repository is at Directory structure is not well organized yet, I’ll do that a bit later.

The main struct, BamFile (bamfile.d), has just a constructor, method close() and, the most important at the moment, header property which returns SamHeader struct (defined in samheader.d).

Now I advise you to take a look at samheader.d. It is rather small file which begins with ~50 lines of rather crazy mixture of string mixins and mixin templates, and then…

mixin HeaderLineStruct!("SqLine",
          Field!("sequence_name", "SN"),
          Field!("sequence_length", "LN", uint),
          Field!("assembly", "AS"),
          Field!("md5", "M5"),
          Field!("species", "SP"),
          Field!("uri", "UR"));

That’s how I want to define structs for information from SAM header line fields. And D allows me to do that. This code generates struct definition, with static method parse() which takes a string and returns an instance of the struct. Really, if “each line is TAB-delimited and except the @CO lines, each data field follows a format ‘TAG:VALUE’ where TAG is a two-letter string <…>”, why should I write more than this tag and the type of the value? And since the value type is almost always a string, let’s make it default in the Field template ;-)

So, here’s the first useful technique: whenever you find some code to be too repetitive, see if it’s worthwhile to create a small DSL based on mixins. D makes code generation easy with compile-time function evaluation, almost as easy as writing Lisp macros.

How about writing Ruby FFI bindings to these structs? I decided to apply DRY principle again, and here’s my another idea: scaffolding. RoR guys have been doing automatic boilerplate generation for years, so why not apply this method for generating FFI bindings?

So I wrote a small Ruby code generator for D structs, which currently supports numeric types, strings, and arrays as struct fields. Now just a few lines of code will do:

import samheader;

import std.stdio;

void main() {
    File("bindings/scaffolds/SqLine.rb", "w").write(toRuby!SqLine);
    File("bindings/scaffolds/RgLine.rb", "w").write(toRuby!RgLine);
    File("bindings/scaffolds/PgLine.rb", "w").write(toRuby!PgLine);
    File("bindings/scaffolds/SamHeader.rb", "w").write(toRuby!SamHeader);
The results of generation are in bindings/scaffolds. Since Ruby has Open Classes, you can reopen any class and define/undefine its methods. For instance, I haven’t found a way to determine at compile-time whether a field is private or not, and you might want to undef corresponding methods. However, there is a thread on with suggestion to add getProtection trait. Hopefully, it will be added to the language in the near future.
With a few lines of hand-coded bindings, we can use D functions from Ruby:

load './bindings/libbam.rb'

bam = "HG00125.chrom20.ILLUMINA.bwa.GBR.low_coverage.20111114.bam"

The code lacks tests and documentation, though, and that’s what I’ll be working on during the next week.

Categories: programming Tags: , ,

Reading BAM files in D

It turned out to be really easy, using and std.zlib modules :)

Sequence of actions is as follows:

  1. Create an instance of BufferedFile, providing file name to its constructor.
  2. Then pass it to constructor of EndianStream, specifying little endian byte order
  3. Read BGZF blocks from this stream, according to their specification, and decompress them using std.zlib uncompress function (winbits must be -15).
  4. Then the range of decompressed blocks is being wrapped by subclass of Stream, using MemoryStream internally. To make a Stream subclass, you only need to define just three methods: readBlock, writeBlock, and seek. The latter two may just throw an exception in this partucular case. That took me as little as 64 lines of code.
  5. Now that we have this full-fledged stream, we again pass it to EndianStream constructor. And then we can read BAM data from it.
About 200 lines in total. That’s how powerful D standard library is. Much like Python ‘batteries’ ;-) And what I like the most, comparing D to C++, are ranges instead of STL iterators. They are really easier to write and easier to use, at least for iterating when the length is not known in advance. Calling empty() is way more intuitive than comparing with some idiotic default iterator indicating end of stream, like it’s done with istreambuf_iterator.

Now that I’ve learned how to read data from BAM, next step will be SAM header parsing. The header is typically small, about a hundred of lines, only in extraordinary cases being more than 4MB, so it makes little sense to optimize speed. Better to spend more time on validation part.

Categories: programming Tags: ,

On programming languages and HPC

Some people believe one should never rewrite old code. I’m not among them, and let me tell why, taking samtools as an example.

If you want your programs to scale on modern multicores, you must use parallelism. Free lunch is over, damn it! It’s great that samtools’ authors realize that, but hey, do you really want to add 40 lines of code each time you need a trivial parallel for loop?

Modern languages make such stuff easy to use, and moreover, when it’s in a standard library, whenever someone comes up with the idea how to make some routine faster, you get this performance boost automatically with the new version. Conversely, if this ‘someone’ is you, you will help not only yourself but the whole community. It’s all about not reinventing the wheel and code reuse.

I’m sorry to say this, but to me, a lot of C coders seem a bit like cavemen. I’ve seen a lot of examples of using home-made object system, or string library, or container/algorithm library. They live without generic programming, and when they come to the point where even macros are of no help, they roll out their own code generators in Ruby/Python/etc. The situation with OOP is a bit better because of GObject existence. As for multithreading, I would advise many of them to study OpenMP and look if it’s suitable for their tasks. (That’s KISS principle.) And only if it is not, this is the case for pthread.h.

Well, one could ask, how about reusing old code? The thing is, in this particular case, samtools were not designed to be used as API. All the existing bindings to dynamic languages arose out of necessity. To appreciate what a pain in the ass it is, I advise you to read how Pjotr Prins describes his experience with creating bindings. Seems like nobody ever gave a damn about refactoring.

Now, why are we so concerned about having API, shared libraries, and dynamic languages? Scientists love DSLs, and R is the brightest example for that. In the absence of DSL, one uses some dynamic language so as to concentrate on his/her tasks instead of fighting with language quirks. Many people also like interactiveness, i.e. having REPL environment with ability to visualize data they’re working with. But dynamic languages are usually slow, and that’s the reason we have to use compiled code. There is some progress in creating languages both fast and dynamic, e.g. Julia, but it’s far too experimental at the moment.

And we don’t want to duplicate efforts in creating language bindings. At the moment, we have SWIG for C/C++, and GObject Introspection for Vala is being developed. There’re some developments for D, namely, RuDy and PyD. D is great in that it allows to generate C wrappers at compile time — you shouldn’t use a separate parser, just use __traits keyword and std.traits module. And compile-time function evaluation works well enough to write even compile-time raytracer.

Another point why D matters, in my opinion, is generic programming. Science is all about abstractions. I feel uncomfortable writing low-level code and being unable to express them in the most appropriate way, not losing any execution speed along the road. C++11 is another good language, though the template syntax is weird and opportunities are limited compared to D. Other than that, I see no languages supporting compile-time generic programming.

So, as I see it, new high-performance scientific libraries should be written in Cilk Plus or D. In my opinion, that’s the best way for them to be maintainable, generic, extremely fast, and easily bindable altogether. Another option would be to have a lot of domain-specific languages compiling to native code or using JIT compiler, like BioScala. Time shall tell…

Categories: Uncategorized Tags: