Home > programming > GSoC weekly report #1

GSoC weekly report #1

The coding period has finally started. I’m glad to become a part of BioRuby community which is, fortunately, a very friendly one :)

The past week saw a lot of activities of various sorts, so this particular report is a bit long.

Rubinius issues and BioRuby unit tests

I filed two bugs to Rubinius bug tracker. One is already fixed, and that will solve some of unit test failures. Also another bug in String#split was fixed recently, and I expect that after the next update Rubinius will pass all unit tests in 1.8 mode on Travis CI.

The situation with 1.9 mode is a little more complicated. For instance, it doesn’t yet fully implements IO#popen and Kernel#Spawn method semantics, and that causes Bio::Command to behave incorrectly. I filed a bug yesterday about that.

I will continue to investigate which Rubinius bugs cause unit test failures. I like LLVM in general, and this implementation of Ruby in particular. JIT compiler + no JVM dependency + clean code… Not that fast yet, but improvements are definitely possible. Clang already outperforms GCC in some cases ;)

D code optimizations

I started a wiki page on github: https://github.com/lomereiter/BAMread/wiki/D-optimization-techniques

Currently, it contains 6 small tips, and I expect it to grow during the summer.

BDD

Also, I learned a bit about behaviour-driven development in general, and about Cucumber. I really like this approach :) It’s very motivating to see at a glance what needed, why, and by whom. And splitting everything into scenarios and steps allows to easily divide one milestone into several smaller ones. I would anyway keep all those things in my head, and it’s cool that written in Gherkin, they can drive my development and serve testing and documentation purposes :)

Ruby object creation time

During previous week, I optimized my BAM reading library quite a bit. Currently, it uses std.parallelism.parallelMap to unpack BGZF blocks in parallel. What I wonder, though, is does it make any sense to be able to iterate alignments from Ruby?

Let’s take some big file and see how long it takes to parse it serially:

$ ./readbam ../HG00476.chrom11.ILLUMINA.bwa.CHS.low_coverage.20111114.bam
 unpacking took 24925ms
 number of records: 10585955

OK, and how long does it take to create corresponding number of Ruby objects? (Ruby 1.9.3-p194)

$ ruby -rbenchmark -e 'p Benchmark::measure{10585955.times{Object.new}}.total'
 4.54

Compared to 25 seconds, it doesn’t look that bad.

But let’s try parallel version (thanks to Pjotr, I can use a 8-core server via SSH):

$ ./readbam ../HG00476.chrom11.ILLUMINA.bwa.CHS.low_coverage.20111114.bam 7
 unpacking took 6733ms
 number of records: 10585955

Hmm, so if we use one object per alignment, it will take at least 4.5s to just create all of them in Ruby, while D code will parse 2/3 of the file in that time.

And if we create not Objects, but FFI Structs like this simple one:

class S < FFI::Struct
layout :dummy, :int
end

things get worse — creation of 10M objects of this class takes almost 20 seconds. Well, one might argue it’s not that bad either, for 0.8GB file…

…but in fact, I see no use cases for iterating alignments from Ruby (although I’ve somehow managed to write Cucumber feature for that). My library will provide heavily optimized functions for common high-level tasks to be used from dynamic languages. Whenever you want speed — write code in D. Whenever you want interactiveness (e.g., fetch alignments overlapping particular region and visualize them somehow) — use your favourite dynamic language, the library will provide such functions. Both these things should be easy, that’s what I aim at.

Categories: programming Tags:
  1. No comments yet.
  1. No trackbacks yet.

Leave a comment