Home > programming > [GSoC] Intro

[GSoC] Intro

So… I’m accepted to GSoC this year, and that’s cool!

What is more cool, though, is that we have a great team in BioRuby community, working on similar tasks, and everybody is interested in performance and robustness. For me, it’s quite challenging, because the other two guys have professional experience and… uhm, a bit older :) It means I’m gonna learn a lot of new things during the summer :)

What is my project about? Parsing. Currently existing parsers mostly ignore existence of multicore processors, while it’s crucial to use all the available power when your data is gigabytes or even more in size. The long-term goal is to put the end to this situation.

But that’s not the only problem. Current implementations are mostly written in C, badly tested and documented, and hard to use from dynamic languages which are great for research.

My project is about parsing BAM data. It will be done in D, with Ruby FFI bindings. Of course I won’t be able to implement a lot of functionality. I’m gonna implement:

  • iterating over alignments
  • optional validation of both SAM header and alignment records
  • random access via BAM index file
  • and, of course, API available for use from any language via FFI

The purpose is to show that modern languages like D make not only for fast development, but also for fast and robust software utilizing multicore processors. Currently, DMD compiler is not as fast as C (because its developers are currently more focused on getting rid of compiler bugs), and GDC compiler doesn’t yet have support for shared libraries. However, this situation is likely to change in the near future, and with respect to speed the focus will be more on parallelizing things.

Why D?

  • Batteries included: std.zlib for decompressing data, std.stream for working with data streams — no more worries about endianness! (http://dlang.org/phobos/std_stream.html#EndianStream)
  • Unit tests and contracts built into the language. That makes for robustness.
  • Great opportunities for generic programming. The code will be reusable. Bioinformaticians tend to reinvent the wheel (how many similar formats are there, huh?), and this situation is to be changed.
  • Built-in support for Actor model which makes multithreaded programs easy to reason about.
  • Effective string implementation, support for slicing. That’s invaluable in parsing.

I’ll add some links here, mostly for myself:

1) validation criteria, very good list.

http://genome.sph.umich.edu/wiki/SAM_Validation_Criteria

2) Here’s what my design with respect to parallelization will be based on:

http://stackoverflow.com/a/6763435

(I’ve come up with the same idea, and it’s easier to post a link than to describe it in my own words)

More detailed sequence diagram is here: http://goo.gl/iVnyH

I’m gonna devise a generic solution for transforming one InputRange into another one in parallel, so that one will need to provide only transforming function and number of worker threads. The code will be encapsulated and thus reusable.

Also, #TODO: transforming one Range into a chunked one is already there (http://dlang.org/phobos/std_range.html#chunks) but it doesn’t work with InputRanges. It should be easy to extend it for InputRanges with a bunch of additional static ifs, and make a pull request.

Advertisements
Categories: programming Tags:
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: