EU CodeFest 2014

September 22, 2014 Leave a comment

Yesterday morning I returned from the CodeFest, and it was a wonderful experience.

The format of the event is unusual. There are very few talks – in this year, about BioJS and BioNode, but they serve as a flavoring rather than the meat. The people are totally free to choose whatever they want to do. Of course, some structure is still needed, and for this matter at the very start the attendants organized into groups according to their fields of interest: BioJS, BioNode, RDF & semantic web, pipelines & HPC. Unsurprisingly, I stayed with the latter.

This division doesn’t at all mean there’s no communication between these groups. To the contrary, I noted for myself that seeing large groups raises curiosity about what’s so hot about their topic. Before the event, I had no idea what RDF (and SPARQL) is, but once I saw how much interest this topic attracts, it felt impossible to remain ignorant, so I approached one of guys sitting at the RDF table and asked him all sorts of novice questions about the technology, its current usage, performance and so on. My overall impression is that RDF vs relational DBs is like dynamic vs static typing in programming languages – it doesn’t require to think upfront about the data organization, thus being much easier to administer and play with.

Strikingly, main interest of about half of people was BioJS. As much as I don’t fancy JS, I have to admit its superiority when it comes to visualization. So, out of curiosity I attended the BioJS tutorial, from which I learned a few things. First of all, it was clarified that there’s a bit of a mess now because version 2.0 is not yet released but at the same time this is what everyone should use. Second, the tutorials and interactive examples at http://edu.biojs.net/ were highlighted, which look quite inspirational for newcomers. Finally, it’s interesting to observe how BioJS goes to decentralized development in the same way BioRuby did, so that joining the growing community is as easy as it gets, one just writes a plugin and makes it available through npm. And one is not forced to use JS, e.g. one of the most sophisticated plugins, MSA, is entirely in CoffeeScript! Overall, the presentation left the impression of a very impressive and well organized project.

It was also the first time I met a bioinformatics freelancer, Khalil from Belgium; and it is no wonder that he was one of (the one?) the oldest persons at the event—one has to build reputation in scientific community first, I believe. In some way, freelancing is fascinating because of the feel of freedom, but it doesn’t suit most of bioinformatics work; IMO it’s only applicable in the case of pure development, since any research usually requires some collaboration in order to be productive. Therefore, existence of freelancing in bioinformatics indicates just how vast the field is.

The last impressive piece is only related to the Codefest because of the hosting location, EBI. Pjotr and me met James Bonfield, the wizard of sequencing data compression and the author of CRAM implementation in HTSlib. We chiefly discussed  the CRAM format and its future. We agreed that it’s theoretically possible for BAM to fade out eventually, because flexibility of CRAM allows to write records in uncompressed form, making it suitable for piping. I also decided after the discussion that making another implementation goes against all possible principles of good programming, and I should reuse HTSlib in Sambamba and become a contributor to the library. I’ve heard from James that his implementation in Staden is somewhat better because the version in HTSlib slighly suffered from integration, but I’m not yet certain if it’s worthwhile to link to less popular library. So far I created D bindings to HTSlib, and was able to successfully read CRAM and convert its records into BAM (cram.d), and the performance is substantially better than that of my own implementation of CRAM reader (nevertheless, the latter served its purpose of understanding the format). The next step for me is to introduce more flexibility to HTSlib so that users can switch between the default thread pool implementation by James and any other implementations (in my case, it’s reusing thread pool from D standard library).

In summary, attending such events helps great for keeping in sync with recent developments, looking at things from a different perspective, and simply enjoying geek atmosphere. Kudos to the organizers!


I’m also very thankful for the Open Bioinformatics Foundation, my mentoring organization in GSoC 2012, for covering my travel & accommodation costs. It’s a remarkable example of a successful umbrella non-profit organization, associating with yet another one, Software in the Public Interest.

In this year, I participated in scoring GSoC 2014 proposals to O|B|F, and now that a month has passed since the end of the program, I must note that the average proposal scores correlate amazingly well with the quality of work & involvement of the students. Every year O|B|F meets my notion of success w.r.t. GSoC participation, in that at least one student continues contributing to their project after the end of summer. Hopefully this trend will continue!

Categories: Uncategorized

ggplot2

July 27, 2014 Leave a comment

ggplot2 sounds like a great package but I currently don’t have the optimism bandwidth to use it :(

I can, to some extent, forgive removal of relative height/width parameter for facet grids, but I strictly maintain that axis positioning (top/bottom, left/right) must be a basic feature of any decent plotting package.

Categories: Uncategorized

Getting Bibtex entries from DOI (command line)

May 12, 2014 1 comment

Here’s how:

#!/usr/bin/env bash
curl -LH “Accept: text/bibliography; style=bibtex” http://dx.doi.org/$1

Categories: Uncategorized Tags:

Performance of Sambamba MRI bindings

October 27, 2013 Leave a comment

So I hacked up a simple BAM reader yesterday as a C extension. For now, BamRead class in Ruby only exposes a ‘name’ method just for the sake of not being totally useless. After all, what I’m interested in is performance of conversion from D to Ruby. And with careful coding the cost is not that large. (What I love about C extensions is full control over what happens.)

In fact, on multicore systems it easily beats PySAM (this simple test is counting reads and computing average read name length; the file contains Ion Torrent 200bp data, 570MB in size)

Python 2.7.3 + PySAM
$ time python test.py
3186621
14.2702508394

real    0m13.034s
user    0m12.588s
sys     0m0.412s

Ruby 2.1.0-preview1
$ time LD_LIBRARY_PATH=. ruby test.rb
3186621
14.270250839368723

real    0m4.820s
user    0m16.756s
sys     0m0.700s

However, the dynamic loading of my D library sometimes just hangs due to some deadlock =\ I currently use 2.063 where support for shared libraries is not official, because in 2.064 there’s an issue with zlib that I reported to their bugtracker. Hopefully that will be resolved soon.

Categories: Uncategorized

Ruby: FFI or not FFI?

October 26, 2013 Leave a comment

I started making Ruby bindings for my SAM/BAM library, and it’s not at all clear whether to use FFI or good old C extension for MRI.

For you to get the clear picture, I’m going support only Linux and Mac OS X, distributing binary packages, because that’s the easiest option given the current state of DMD compiler infrastructure—it’s not available by default on almost every system, like GCC.

One factor is convenience. By that word FFI proponents usually mean that they are too lazy to sit and write some C code. But hey, since I’m going to distribute binary packages only, I can just use Rice which should be much easier.

Another important factor is interoperatibility. Well, I did some simple benchmarks and discovered that in MRI, the speed sucks if I use FFI, and in JRuby, simply calling Picard library gives the same or better performance (of course, if you’re aware of flags –server and -Xji.objectProxyCache=false). But more importantly, on JRuby the overhead with either FFI or Java Integration is huge =\ Namely, counting reads in a 2.5GB file took about 2 minutes, but adding computation of average read name length added another 2 minutes, giving total of 4! For comparison, it took only 2 minutes using PySAM, and this is the baseline that should be followed.

My conclusions from this are that
1) JVM is not well suited for dynamic languages, and this opinion is supported by JRuby developers. Hopefully, Topaz will mature eventually.
2) Overhead of FFI is too substantial to ignore in my particular case, where we want to work with lots of short reads.

So, I will go with Rice. The bonus part is that I will have to write C++ classes wrapping D functionality, which could theoretically be also used for CPython extension using Boost.Python.

EDIT:  compared bindings generated with Rice and SWIG, the latter wins. So, the full chain is D -> C bindings -> C++ wrapper -> SWIG wrapper -> Ruby

EDIT2: after some playing with SWIG I realized that I want ultimate control over what’s going on, and finally decided to write in good old C.

Categories: Uncategorized Tags: , ,

First experience with Mono-D

December 16, 2012 Leave a comment

Today I installed Mono-D, hoping that it might make my experience with D a little better. Unfortunately, it was far from my expectations. When it is used on small pieces of code, everything looks fine. But when I added the whole BioD library (which I slowly develop) to the include paths, performance decreased drastically and it became impossible to type code. Each autocompletion took at least 30 seconds to resolve :-( No thanks, I would rather type method name manually, 10 times faster.

And even neglecting slowness, completion results are also far from what I need. I use a lot of templates and `auto` declarations that seem to confuse Mono-D engine. I also had no luck with type deduction of iteration variable in foreach statements.

Thus, I’m staying with Vim, although it’s totally dumb when it comes to the language semantics. Hopefully, Mono-D will improve over years. Perhaps, it already works well for Java-style projects that don’t use templates all over the place, but that’s not my case.

UPDATE (March 06, 2013)

Tried again with Xamarin Studio and 0.5.1.3. Speed issue has been resolved, but autocompletion still doesn’t play well with templated code.

Categories: Uncategorized

New cool stuff in Sambamba!

October 5, 2012 Leave a comment

During last month, I have introduced quite a few cool features in Sambamba library. This blog post is a brief overview of them.

MD tags -> reference bases

If reads in the BAM file have MD tags (e.g. calculated with the aid of samtools calmd), you can reconstruct parts of reference sequence from them. This is done lazily, without any heap allocations. If you need a string instead of a lazy character sequence, just wrap it with to!string.

Example (taken from http://davetang.org/muse/2011/01/28/perl-and-sam/)

    auto read = Alignment("r4",
                         "AGTGATGGGAGGATGTCTCGTCTGTGAGTTACAGCA",
                         [CigarOperation(2, 'M'),
                          CigarOperation(1, 'I'),
                          CigarOperation(7, 'M'),
                          CigarOperation(6, 'D'),
                          CigarOperation(26, 'M')]);
    read["MD"] = "3C3T1^GCTCAG26";
    assert(equal(dna(read), "AGGCTGGTAGCTCAGGGATGTCTCGTCTGTGAGTTACAGCA"));

The same works for ranges of coordinate-sorted reads aligned to the same reference. That is, in this case no heap-allocations occur as well, and the sequence is also lazy.

Notice that last two reads in the following example don’t overlap, therefore space between them is filled with ‘N’.

(in this example, some bases are skipped to fit into screen)

    auto r1 = Alignment("r1",
                        "AGGTTTTGTGAGTGGGA...GCTCAGAGTGGCTCTA",
                        [CigarOperation(89, 'M'),   CigarOperation(1, 'S')]);
    r1.position = 60246;    r1["MD"] = "89";

    auto r2 = Alignment("r2",
                        "TGTGAGTGGGACAGTTGCAGCAAAACACAACCATAGGTGCCCA...CTCCAGCCCTT",
                        [CigarOperation(83, 'M'),   CigarOperation(7, 'S')]);
    r2.position = 60252;    r2["MD"] = "82T0";

    auto r3 = Alignment("r3",
                        "CATAGGTGCCCATCCACCAAGGCAGGCTCTCCATCTTGCTCAGAGTGGCTCTAGCCCTTG...",
                        [CigarOperation(90, 'M')]);
    r3.position = 60283;    r3["MD"] = "90";

    auto r4 = Alignment("r4",    "CCCTTGCTGAC...TATCCAACCG",   [CigarOperation(90, 'M')]);
    r4.position = 60337;    r4["MD"] = "90";

    auto r5 = Alignment("r5",    "GAGGCTCCACCC...AAGCACTTTGT", [CigarOperation(90, 'M')]);
    r5.position = 60432;    r5["MD"] = "90";

    auto reads = [r1, r2, r3, r4, r5];
    assert(equal(dna(reads), "AGGTTTTGTG...CTATCCAACCGNNNNNGAGGC...CTTTGT"));

Pileup

This feature requires dcollections to be installed into Sambamba root directory. Installation process is described at the top of pileuprange.d file, this will also be in README soon.

The behaviour might be a bit different from samtools with respect to insertions/deletions, I’ll revisit it later. For now the main focus is SNPs.

Some example code showing how to work with pileup in Sambamba:

foreach (column; pileupWithReferenceBases(reads)) {
    writeln(column.position, ' ', column.coverage);
    foreach (read; column.reads) {
        if (read.current_base != column.reference_base &&
            read.current_base_quality > 30 && read.cigar_operation == 'M')
        {
            ...
        }
    }
}

Improved efficiency of random access code

Today I have revisited code related to random access. A nice idea came to my mind, and I parallelized decompression much better than before.

For instance, here are the times for counting alignments from NA12043.chrom20.LS454.ssaha2.CEU.low_coverage.20101123.bam.v1, overlapping the region 20:1261261-12162614 (the number of such reads is 63417).

samtools

real 0m0.579s
user 0m0.560s
sys 0m0.020s

sambamba 0.2.8

real 0m0.349s
user 0m0.616s
sys 0m0.104s

sambamba trunk

real 0m0.152s
user 0m0.648s
sys 0m0.120s

The experiment was done, as usual, on Pjotr’s 8-core machine. These results mean, basically, that level of parallelism for accessing a region using BAI index is now the same as for serial access. I will update the Debian package soon :)

Categories: Uncategorized
Follow

Get every new post delivered to your Inbox.