GSoC weekly report #6
The bindings now work with JSON output from command-line tool sambamba which is to be installed. All the old Cucumber scenarios are passing except one not-very-important one. I’ll write documentation during this week and pack that into a gem.
I got it working, and introduced new type SamFile which is very similar to BamFile except it doesn’t provide random access. Unit tests ensure that parseAlignmentLine(toSam(read)) == read for all valid reads (otherwise invalid fields are default-initialized)
In order to allow invalid data, I made a simple rule invalid_field in Ragel, which just reads until next tab character:
mandatoryfields = (qname | invalid_field) '\t' (flag | invalid_field) '\t' (rname | invalid_field) '\t' (pos | invalid_field) '\t' (mapq | invalid_field) '\t' ... // and so on
Parsing is now about 3x as slow as in samtools, but that has nothing to do with Ragel, the main reason is too much memory allocations. I did some profiling, and doubling the speed won’t take a lot of effort. As for tripling, I’m not that sure, but I’ll try :)
Sambamba CLI tool
It accepts both SAM and BAM files as input, and can output either SAM or JSON (speed is the same for both cases). Also it allows filtering by quality and/or read group, and accepts samtools syntax for regions.