Archive for the ‘programming’ Category

GSoC weekly report #2

Ruby bindings

Now it’s possible to iterate over alignments from Ruby, although it’s much slower than doing the same from D. Iterating will make more sense when I’ll introduce filtering and random access. But anyway, Ruby syntax can be used for iteration, and it looks nice:

bam = "test/data/ex1_header.bam"
puts bam.alignments
        .collect {|read| read['NM'] } # access tag - it's integer
        .compact                      # remove nils
        .instance_eval { reduce(:+) / size.to_f } # get average


This week, I concentrated mostly on validation module. As I haven’t yet developed deep understanding of NGS technologies, I just did my best on translating all ‘must’, ‘ought to’, and ‘shall’ appearing in the specification, into program code :)

The results are good enough – my validator has catched a few errors in tags.bam provided by Peter Cock, all of them about invalid characters in read names and tags (spaces and @). That’s the output:

ReadNameContainsInvalidCharacters in read 'tag_zz:Z: '
ReadNameContainsInvalidCharacters in read 'tag_zz:Z:   '
ReadNameContainsInvalidCharacters in read 'tag_zz:Z:@'
ReadNameContainsInvalidCharacters in read 'tag_zz:Z:@@@'
ReadNameContainsInvalidCharacters in read 'tag_aa:A: '
	Tag error: NonPrintableCharacter ('aa' -> 'A: ')
InvalidTags in read 'tag_aa:A: '
ReadNameContainsInvalidCharacters in read 'tag_aa:A:@'

I tried to design my validation module to be customizable. Therefore t works this way: if you need to validate, say, alignments, you subclass AbstractAlignmentValidator, implement your own error handlers, and then use its ‘validate’ method for alignments.

class Validator : AbstractAlignmentValidator {
    void onError(ref Alignment alignment, AlignmentError e) {
        writeln(e, " in read '", alignment.read_name, "'");

    void onError(ref Alignment alignment, CigarError e) {
        writeln("\tCIGAR error: ", e);
        writeln("\t\t   cigar string: ", alignment.cigar_string);
        writeln("\t\tsequence length: ", alignment.sequence_length);

    void onError(string key, ref Value value, TagError e) {
        writeln("\tTag error: ", e , " ('", key, "' -> '", value.to_sam(), "')");

Notice that if you’re too lazy to implement all abstract methods, you can use std.typecons and subclass BlackHole!AbstractAlignmentValidator ;-)

Currently, there’re 8 types of alignment errors, 3 – of CIGAR ones (I’m sure there should be more of them), 10 – of errors in tags, and about 10 – of SAM header errors which are catched with another abstract class.


By the way, have you noticed this Value type used in tag error handler? It’s a C-style tagged union with a bunch of methods which play well with typical D code. You can assign it to a variable of any type allowed by the specification, compare, get the stored value (standard ConvException is thrown if wrong type is requested), and convert it to SAM format that easy:

short[] array_of_shorts = [4, 5, 6];
Value v = array_of_shorts;
assert(v.to_sam == "B:s,4,5,6");
assert(to!(short[])(v) == array_of_shorts);

I also attempted to provide this kind of facility for Ruby. For instance, ‘tags’ method of an alignment returns a Hash with objects of corresponding Ruby types, be it arrays of integers, strings, or whatever.

Random access

Yesterday I started getting my head around how random access works. Now I already have something working, but it’s probably very buggy. And slow (no caching yet). And works with DMD only :(

Even with DMD I had to tweak my code in several places in order to avoid internal compiler errors. It’s somehow related to passing delegates as template arguments to standard library algorithms. Functional style suits my thinking, but the two compilers are not ready for that. The errors I saw from GDC were truly astonishing! 14k characters long mangled function name is something unforgettable :) Here it is:

Since I’ve almost finished my plans for the second week (Ruby bindings just need some polishing), I’m going to continue playing with random access stuff next week. If I’ll manage to get it working before July (for when it is planned), I’ll probably try to implement indexing then, and maybe even investigate indexing schemes other than the default one, i.e. BAI. The rumors are, BAMTools has its own index format, BTI, which makes for faster random accesses, up to 10x. (

Categories: programming Tags: ,

GSoC weekly report #1

The coding period has finally started. I’m glad to become a part of BioRuby community which is, fortunately, a very friendly one :)

The past week saw a lot of activities of various sorts, so this particular report is a bit long.

Rubinius issues and BioRuby unit tests

I filed two bugs to Rubinius bug tracker. One is already fixed, and that will solve some of unit test failures. Also another bug in String#split was fixed recently, and I expect that after the next update Rubinius will pass all unit tests in 1.8 mode on Travis CI.

The situation with 1.9 mode is a little more complicated. For instance, it doesn’t yet fully implements IO#popen and Kernel#Spawn method semantics, and that causes Bio::Command to behave incorrectly. I filed a bug yesterday about that.

I will continue to investigate which Rubinius bugs cause unit test failures. I like LLVM in general, and this implementation of Ruby in particular. JIT compiler + no JVM dependency + clean code… Not that fast yet, but improvements are definitely possible. Clang already outperforms GCC in some cases ;)

D code optimizations

I started a wiki page on github:

Currently, it contains 6 small tips, and I expect it to grow during the summer.


Also, I learned a bit about behaviour-driven development in general, and about Cucumber. I really like this approach :) It’s very motivating to see at a glance what needed, why, and by whom. And splitting everything into scenarios and steps allows to easily divide one milestone into several smaller ones. I would anyway keep all those things in my head, and it’s cool that written in Gherkin, they can drive my development and serve testing and documentation purposes :)

Ruby object creation time

During previous week, I optimized my BAM reading library quite a bit. Currently, it uses std.parallelism.parallelMap to unpack BGZF blocks in parallel. What I wonder, though, is does it make any sense to be able to iterate alignments from Ruby?

Let’s take some big file and see how long it takes to parse it serially:

$ ./readbam ../HG00476.chrom11.ILLUMINA.bwa.CHS.low_coverage.20111114.bam
 unpacking took 24925ms
 number of records: 10585955

OK, and how long does it take to create corresponding number of Ruby objects? (Ruby 1.9.3-p194)

$ ruby -rbenchmark -e 'p Benchmark::measure{10585955.times{}}.total'

Compared to 25 seconds, it doesn’t look that bad.

But let’s try parallel version (thanks to Pjotr, I can use a 8-core server via SSH):

$ ./readbam ../HG00476.chrom11.ILLUMINA.bwa.CHS.low_coverage.20111114.bam 7
 unpacking took 6733ms
 number of records: 10585955

Hmm, so if we use one object per alignment, it will take at least 4.5s to just create all of them in Ruby, while D code will parse 2/3 of the file in that time.

And if we create not Objects, but FFI Structs like this simple one:

class S < FFI::Struct
layout :dummy, :int

things get worse — creation of 10M objects of this class takes almost 20 seconds. Well, one might argue it’s not that bad either, for 0.8GB file…

…but in fact, I see no use cases for iterating alignments from Ruby (although I’ve somehow managed to write Cucumber feature for that). My library will provide heavily optimized functions for common high-level tasks to be used from dynamic languages. Whenever you want speed — write code in D. Whenever you want interactiveness (e.g., fetch alignments overlapping particular region and visualize them somehow) — use your favourite dynamic language, the library will provide such functions. Both these things should be easy, that’s what I aim at.

Categories: programming Tags:

GSoC weekly report #0.5

So, another week of “pre-coding” period have passed.

I was mainly concentrated on D part of the project, adding functionality to it. During the week, I implemented parsing of the whole BAM file :) Today I wrote a simple utility in D, which uses my library to convert BAM to SAM. It doesn’t work with array tags yet, and not as fast as samtools, but nevertheless… On a couple of BAM files from test/data directory (namely, bins.bam and ex1_header.bam) the output is identical to that of samtools view — I checked with diff — and that kinda proves that everything works fine. Speed issues are mainly due to using std.variant module for storing tags. It uses runtime reflection which is quite slow. Maybe, there’re some other reasons. Anyway, I’m going to write my own tagged union type next week, it should improve the performance quite a bit, and also fix design flaws.

For testing tag parsing, I used file tags.bam provided to me by Peter Cock. It contains tests for all types of tags, and my library successfully passes them. Later I’ll experiment with possible speed improvements, and having unit tests covering full range of possible tag types is a must.

Also, I downloaded and compiled gdc from trunk. It provides decent performance, not worse than dmd, at least. We expect gdc to gain shared library support in the next two months. Before that happens, we have to use dmd, although there’re some issues with its garbage collector, causing segfaults. We discussed that with Marjan and Pjotr and decided that the best option in such circumstances would be to disable GC during development — testing library on small files won’t consume much memory anyway.

Another thing I downloaded and compiled is Rubinius. I’m going to investigate why it hangs on BioRuby unittests in 1.9 mode. Another mode, 1.8, seems to work fine except maybe some very minor bugs.

During next week, I’m going to learn how to use Cucumber and Rspec, improve D library performance a little, and start to write Ruby bindings. So it will be mostly ‘Ruby week’ ;-)

Categories: programming Tags:

[GSoC] Weekly report #0

Finally, OBF seems to be convinced about usefulness of new BAM parser. Thus now I can spend time on coding instead of writing/talking on why what I’m about to do is so important :)

I began to write code before the official start for two reasons. First, I just can’t wait to get my hands on this. The project is a great opportunity to learn more about D and Ruby, and what they’re capable of when being combined. By the way, it’s one of the goals we set for this summer — discover some best practices. Moreover, I’ll describe some techniques I already use, in this very post ;) The second reason is more prosaic. I’ll have exams in June, and therefore I have to do some stuff in advance. However, I believe I’ll deal with time management issues. Since July I’ll be able to work mostly full-time.

Now let’s move on to what was done during last week. The project repository is at Directory structure is not well organized yet, I’ll do that a bit later.

The main struct, BamFile (bamfile.d), has just a constructor, method close() and, the most important at the moment, header property which returns SamHeader struct (defined in samheader.d).

Now I advise you to take a look at samheader.d. It is rather small file which begins with ~50 lines of rather crazy mixture of string mixins and mixin templates, and then…

mixin HeaderLineStruct!("SqLine",
          Field!("sequence_name", "SN"),
          Field!("sequence_length", "LN", uint),
          Field!("assembly", "AS"),
          Field!("md5", "M5"),
          Field!("species", "SP"),
          Field!("uri", "UR"));

That’s how I want to define structs for information from SAM header line fields. And D allows me to do that. This code generates struct definition, with static method parse() which takes a string and returns an instance of the struct. Really, if “each line is TAB-delimited and except the @CO lines, each data field follows a format ‘TAG:VALUE’ where TAG is a two-letter string <…>”, why should I write more than this tag and the type of the value? And since the value type is almost always a string, let’s make it default in the Field template ;-)

So, here’s the first useful technique: whenever you find some code to be too repetitive, see if it’s worthwhile to create a small DSL based on mixins. D makes code generation easy with compile-time function evaluation, almost as easy as writing Lisp macros.

How about writing Ruby FFI bindings to these structs? I decided to apply DRY principle again, and here’s my another idea: scaffolding. RoR guys have been doing automatic boilerplate generation for years, so why not apply this method for generating FFI bindings?

So I wrote a small Ruby code generator for D structs, which currently supports numeric types, strings, and arrays as struct fields. Now just a few lines of code will do:

import samheader;

import std.stdio;

void main() {
    File("bindings/scaffolds/SqLine.rb", "w").write(toRuby!SqLine);
    File("bindings/scaffolds/RgLine.rb", "w").write(toRuby!RgLine);
    File("bindings/scaffolds/PgLine.rb", "w").write(toRuby!PgLine);
    File("bindings/scaffolds/SamHeader.rb", "w").write(toRuby!SamHeader);
The results of generation are in bindings/scaffolds. Since Ruby has Open Classes, you can reopen any class and define/undefine its methods. For instance, I haven’t found a way to determine at compile-time whether a field is private or not, and you might want to undef corresponding methods. However, there is a thread on with suggestion to add getProtection trait. Hopefully, it will be added to the language in the near future.
With a few lines of hand-coded bindings, we can use D functions from Ruby:

load './bindings/libbam.rb'

bam = "HG00125.chrom20.ILLUMINA.bwa.GBR.low_coverage.20111114.bam"

The code lacks tests and documentation, though, and that’s what I’ll be working on during the next week.

Categories: programming Tags: , ,

Reading BAM files in D

It turned out to be really easy, using and std.zlib modules :)

Sequence of actions is as follows:

  1. Create an instance of BufferedFile, providing file name to its constructor.
  2. Then pass it to constructor of EndianStream, specifying little endian byte order
  3. Read BGZF blocks from this stream, according to their specification, and decompress them using std.zlib uncompress function (winbits must be -15).
  4. Then the range of decompressed blocks is being wrapped by subclass of Stream, using MemoryStream internally. To make a Stream subclass, you only need to define just three methods: readBlock, writeBlock, and seek. The latter two may just throw an exception in this partucular case. That took me as little as 64 lines of code.
  5. Now that we have this full-fledged stream, we again pass it to EndianStream constructor. And then we can read BAM data from it.
About 200 lines in total. That’s how powerful D standard library is. Much like Python ‘batteries’ ;-) And what I like the most, comparing D to C++, are ranges instead of STL iterators. They are really easier to write and easier to use, at least for iterating when the length is not known in advance. Calling empty() is way more intuitive than comparing with some idiotic default iterator indicating end of stream, like it’s done with istreambuf_iterator.

Now that I’ve learned how to read data from BAM, next step will be SAM header parsing. The header is typically small, about a hundred of lines, only in extraordinary cases being more than 4MB, so it makes little sense to optimize speed. Better to spend more time on validation part.

Categories: programming Tags: ,

[GSoC] Intro

April 26, 2012 Leave a comment

So… I’m accepted to GSoC this year, and that’s cool!

What is more cool, though, is that we have a great team in BioRuby community, working on similar tasks, and everybody is interested in performance and robustness. For me, it’s quite challenging, because the other two guys have professional experience and… uhm, a bit older :) It means I’m gonna learn a lot of new things during the summer :)

What is my project about? Parsing. Currently existing parsers mostly ignore existence of multicore processors, while it’s crucial to use all the available power when your data is gigabytes or even more in size. The long-term goal is to put the end to this situation.

But that’s not the only problem. Current implementations are mostly written in C, badly tested and documented, and hard to use from dynamic languages which are great for research.

My project is about parsing BAM data. It will be done in D, with Ruby FFI bindings. Of course I won’t be able to implement a lot of functionality. I’m gonna implement:

  • iterating over alignments
  • optional validation of both SAM header and alignment records
  • random access via BAM index file
  • and, of course, API available for use from any language via FFI

The purpose is to show that modern languages like D make not only for fast development, but also for fast and robust software utilizing multicore processors. Currently, DMD compiler is not as fast as C (because its developers are currently more focused on getting rid of compiler bugs), and GDC compiler doesn’t yet have support for shared libraries. However, this situation is likely to change in the near future, and with respect to speed the focus will be more on parallelizing things.

Why D?

  • Batteries included: std.zlib for decompressing data, for working with data streams — no more worries about endianness! (
  • Unit tests and contracts built into the language. That makes for robustness.
  • Great opportunities for generic programming. The code will be reusable. Bioinformaticians tend to reinvent the wheel (how many similar formats are there, huh?), and this situation is to be changed.
  • Built-in support for Actor model which makes multithreaded programs easy to reason about.
  • Effective string implementation, support for slicing. That’s invaluable in parsing.

I’ll add some links here, mostly for myself:

1) validation criteria, very good list.

2) Here’s what my design with respect to parallelization will be based on:

(I’ve come up with the same idea, and it’s easier to post a link than to describe it in my own words)

More detailed sequence diagram is here:

I’m gonna devise a generic solution for transforming one InputRange into another one in parallel, so that one will need to provide only transforming function and number of worker threads. The code will be encapsulated and thus reusable.

Also, #TODO: transforming one Range into a chunked one is already there ( but it doesn’t work with InputRanges. It should be easy to extend it for InputRanges with a bunch of additional static ifs, and make a pull request.

Categories: programming Tags:

Bug hunting

April 21, 2012 Leave a comment

It might seem that everything is clear about D and Ruby interaction.

Not quite so, as it turns out. Let’s consider this (almost) trivial example:


module test;
import std.array : array;
import std.algorithm : splitter;

extern (C) void foo() {
 array(splitter("abc", ' ')); // any string will do


#!/usr/bin/env ruby
require 'ffi'
module Bar
 extend FFI::Library
 ffi_lib './'
 attach_function :foo, [], :void # bind our function
end # let's call it!

(Full code and Makefile is available at

Running this Ruby code leads to a segmentation fault with the following backtrace:

./ [0xf72017d0]
./ [0xf7202f6f] /usr/include/d/std/array.d:1984
./ [0xf7202e82] /usr/include/d/std/array.d:1960
./ [0xf7203146] /usr/include/d/std/array.d:2007
./ [0xf7202b36] /usr/include/d/std/array.d:66

What the hell is going on?! Let’s investigate…

By looking at std.array source code near the lines mentioned in backtrace, and some playing with it (I, for example, inserted printf call before 1984th line to test which expression causes segfault), I managed to make our foo even simpler:

import core.bitop : bsr;

extern (C) void foo() {
 auto dummy = 0L / bsr(2);

Now, some simple observations.

  1. If you change 0L to 0 without ‘L’, you get no segfault.
  2. If you replace bsr(2) with its value, i.e. 1, you also get no segfault.
  3. If this construction is used in standalone D application, everything is also OK.
  4. If we don’t use the sophisticated scheme with Runtime.initialize() and Runtime.terminate(), everything also works fine. (Let me remind you that it’s necessary to initialize D runtime in order to be able to use its GC and most library functions.) That is, if you call attach and detach functions manually via FFI, it’s ok.
  5. If you replace / with +, -, or *, also no segfault.

Sounds completely crazy, doesn’t it?

Good news is that our observations contain a workaround: don’t use linking with C object file but use -shared flag of dmd instead, and call D runtime initialization function manually. I believe it should work. And, as a matter of fact, D should support “static this() {…}” as a function to be called at library loading. So basically, this workaround is just replacing one kludge with another.

Categories: programming Tags: ,