Home > programming > GSoC weekly report #2

GSoC weekly report #2

Ruby bindings

Now it’s possible to iterate over alignments from Ruby, although it’s much slower than doing the same from D. Iterating will make more sense when I’ll introduce filtering and random access. But anyway, Ruby syntax can be used for iteration, and it looks nice:

bam = BamFile.new "test/data/ex1_header.bam"
puts bam.alignments
        .collect {|read| read['NM'] } # access tag - it's integer
        .compact                      # remove nils
        .instance_eval { reduce(:+) / size.to_f } # get average

Validation

This week, I concentrated mostly on validation module. As I haven’t yet developed deep understanding of NGS technologies, I just did my best on translating all ‘must’, ‘ought to’, and ‘shall’ appearing in the specification, into program code :)

The results are good enough – my validator has catched a few errors in tags.bam provided by Peter Cock, all of them about invalid characters in read names and tags (spaces and @). That’s the output:

ReadNameContainsInvalidCharacters in read 'tag_zz:Z: '
ReadNameContainsInvalidCharacters in read 'tag_zz:Z:   '
ReadNameContainsInvalidCharacters in read 'tag_zz:Z:@'
ReadNameContainsInvalidCharacters in read 'tag_zz:Z:@@@'
ReadNameContainsInvalidCharacters in read 'tag_aa:A: '
	Tag error: NonPrintableCharacter ('aa' -> 'A: ')
InvalidTags in read 'tag_aa:A: '
ReadNameContainsInvalidCharacters in read 'tag_aa:A:@'

I tried to design my validation module to be customizable. Therefore t works this way: if you need to validate, say, alignments, you subclass AbstractAlignmentValidator, implement your own error handlers, and then use its ‘validate’ method for alignments.

class Validator : AbstractAlignmentValidator {
    void onError(ref Alignment alignment, AlignmentError e) {
        writeln(e, " in read '", alignment.read_name, "'");
    }

    void onError(ref Alignment alignment, CigarError e) {
        writeln("\tCIGAR error: ", e);
        writeln("\t\t   cigar string: ", alignment.cigar_string);
        writeln("\t\tsequence length: ", alignment.sequence_length);
    }

    void onError(string key, ref Value value, TagError e) {
        writeln("\tTag error: ", e , " ('", key, "' -> '", value.to_sam(), "')");
    }
}

Notice that if you’re too lazy to implement all abstract methods, you can use std.typecons and subclass BlackHole!AbstractAlignmentValidator ;-)

Currently, there’re 8 types of alignment errors, 3 – of CIGAR ones (I’m sure there should be more of them), 10 – of errors in tags, and about 10 – of SAM header errors which are catched with another abstract class.

Tags

By the way, have you noticed this Value type used in tag error handler? It’s a C-style tagged union with a bunch of methods which play well with typical D code. You can assign it to a variable of any type allowed by the specification, compare, get the stored value (standard ConvException is thrown if wrong type is requested), and convert it to SAM format that easy:

short[] array_of_shorts = [4, 5, 6];
Value v = array_of_shorts;
assert(v.is_numeric_array);
assert(v.to_sam == "B:s,4,5,6");
assert(to!(short[])(v) == array_of_shorts);

I also attempted to provide this kind of facility for Ruby. For instance, ‘tags’ method of an alignment returns a Hash with objects of corresponding Ruby types, be it arrays of integers, strings, or whatever.

Random access

Yesterday I started getting my head around how random access works. Now I already have something working, but it’s probably very buggy. And slow (no caching yet). And works with DMD only :(

Even with DMD I had to tweak my code in several places in order to avoid internal compiler errors. It’s somehow related to passing delegates as template arguments to standard library algorithms. Functional style suits my thinking, but the two compilers are not ready for that. The errors I saw from GDC were truly astonishing! 14k characters long mangled function name is something unforgettable :) Here it is: https://gist.github.com/2802993

Since I’ve almost finished my plans for the second week (Ruby bindings just need some polishing), I’m going to continue playing with random access stuff next week. If I’ll manage to get it working before July (for when it is planned), I’ll probably try to implement indexing then, and maybe even investigate indexing schemes other than the default one, i.e. BAI. The rumors are, BAMTools has its own index format, BTI, which makes for faster random accesses, up to 10x. (http://bioinformatics.oxfordjournals.org/content/27/12/1691.full)

Categories: programming Tags: ,
  1. No comments yet.
  1. No trackbacks yet.

Leave a comment