Archive

Archive for October, 2012

New cool stuff in Sambamba!

October 5, 2012 Leave a comment

During last month, I have introduced quite a few cool features in Sambamba library. This blog post is a brief overview of them.

MD tags -> reference bases

If reads in the BAM file have MD tags (e.g. calculated with the aid of samtools calmd), you can reconstruct parts of reference sequence from them. This is done lazily, without any heap allocations. If you need a string instead of a lazy character sequence, just wrap it with to!string.

Example (taken from http://davetang.org/muse/2011/01/28/perl-and-sam/)

    auto read = Alignment("r4",
                         "AGTGATGGGAGGATGTCTCGTCTGTGAGTTACAGCA",
                         [CigarOperation(2, 'M'),
                          CigarOperation(1, 'I'),
                          CigarOperation(7, 'M'),
                          CigarOperation(6, 'D'),
                          CigarOperation(26, 'M')]);
    read["MD"] = "3C3T1^GCTCAG26";
    assert(equal(dna(read), "AGGCTGGTAGCTCAGGGATGTCTCGTCTGTGAGTTACAGCA"));

The same works for ranges of coordinate-sorted reads aligned to the same reference. That is, in this case no heap-allocations occur as well, and the sequence is also lazy.

Notice that last two reads in the following example don’t overlap, therefore space between them is filled with ‘N’.

(in this example, some bases are skipped to fit into screen)

    auto r1 = Alignment("r1",
                        "AGGTTTTGTGAGTGGGA...GCTCAGAGTGGCTCTA",
                        [CigarOperation(89, 'M'),   CigarOperation(1, 'S')]);
    r1.position = 60246;    r1["MD"] = "89";

    auto r2 = Alignment("r2",
                        "TGTGAGTGGGACAGTTGCAGCAAAACACAACCATAGGTGCCCA...CTCCAGCCCTT",
                        [CigarOperation(83, 'M'),   CigarOperation(7, 'S')]);
    r2.position = 60252;    r2["MD"] = "82T0";

    auto r3 = Alignment("r3",
                        "CATAGGTGCCCATCCACCAAGGCAGGCTCTCCATCTTGCTCAGAGTGGCTCTAGCCCTTG...",
                        [CigarOperation(90, 'M')]);
    r3.position = 60283;    r3["MD"] = "90";

    auto r4 = Alignment("r4",    "CCCTTGCTGAC...TATCCAACCG",   [CigarOperation(90, 'M')]);
    r4.position = 60337;    r4["MD"] = "90";

    auto r5 = Alignment("r5",    "GAGGCTCCACCC...AAGCACTTTGT", [CigarOperation(90, 'M')]);
    r5.position = 60432;    r5["MD"] = "90";

    auto reads = [r1, r2, r3, r4, r5];
    assert(equal(dna(reads), "AGGTTTTGTG...CTATCCAACCGNNNNNGAGGC...CTTTGT"));

Pileup

This feature requires dcollections to be installed into Sambamba root directory. Installation process is described at the top of pileuprange.d file, this will also be in README soon.

The behaviour might be a bit different from samtools with respect to insertions/deletions, I’ll revisit it later. For now the main focus is SNPs.

Some example code showing how to work with pileup in Sambamba:

foreach (column; pileupWithReferenceBases(reads)) {
    writeln(column.position, ' ', column.coverage);
    foreach (read; column.reads) {
        if (read.current_base != column.reference_base &&
            read.current_base_quality > 30 && read.cigar_operation == 'M')
        {
            ...
        }
    }
}

Improved efficiency of random access code

Today I have revisited code related to random access. A nice idea came to my mind, and I parallelized decompression much better than before.

For instance, here are the times for counting alignments from NA12043.chrom20.LS454.ssaha2.CEU.low_coverage.20101123.bam.v1, overlapping the region 20:1261261-12162614 (the number of such reads is 63417).

samtools

real 0m0.579s
user 0m0.560s
sys 0m0.020s

sambamba 0.2.8

real 0m0.349s
user 0m0.616s
sys 0m0.104s

sambamba trunk

real 0m0.152s
user 0m0.648s
sys 0m0.120s

The experiment was done, as usual, on Pjotr’s 8-core machine. These results mean, basically, that level of parallelism for accessing a region using BAI index is now the same as for serial access. I will update the Debian package soon :)

Categories: Uncategorized