Home > Uncategorized > Ragel and bioinformatics

Ragel and bioinformatics

The term ‘bioinformatics’ should be familiar to the reader. So let’s become acquainted with Ragel.

What is it?

Finite-state machine compiler. FSM is a term which is known to every computer scientist. Basically, it’s a bunch of predetermined states and transitions between states. You have some final states, and some error states, and how your machine will walk through these states depends on the input and some predetermined rules.

Why is it useful? Because it allows to parse any regular grammar, and most textual data formats in bioinformatics can be described by some regular language.

Oh, enough theory… why should I care?

The code which Ragel generates is extremely fast. That’s a fact.

Ragel is used in

  • Gherkin gem — for parsing feature files. Here’s a line from Cucumber history:
    “Treetop is gone and replaced with Ragel. The new Ragel parser lives in the gherkin gem. Parse times are up to 100 times faster.”
  • Mongrel web server — for parsing HTTP requests.

    “Mongrel’s design is based around the idea that most of the security problems in HTTP servers comes from hand-coded parsers that are too “loose” with the protocol. Mongrel uses a generated parser (using Ragel) that is very strict and seems to block a huge number of attack attempts simply because it is so exacting. Since this protection comes at the HTTP level, any framework using Mongrel gets it for free.”

  • Rubinius — for implementing Array#pack/String#unpack methods.

Ok, how to use it?

First step is defining formal grammar for the format. That might not be easy, especially if there’re no strict specification. However, when specification gives detailed descriptions of all fields, this step is easy.  It took me just 1.5 hours to write more or less complete grammar for the SAM format. Here it is.

Then, you can insert actions to be executed on each transition. Action is a snippet of code. The languages supported are C, C++, D, Go, Java, and Objective C. (Ruby is also supported but it’s too slow for real-world parsers).

Let’s take an example. We wish to extract all tag names from SAM file. How to do that when we have Ragel grammar?

Ok, let’s find the line which deals with tags:

tag = alpha alnum ;

We need to add two actions, one at the beginning, and one at the end (the snippets are in D programming language):

action start_tag {
    tag_startpos = p - data.ptr;
}

action end_tag {
    tag_endpos = p - data.ptr;
    tagnames ~= cast(string)data[tag_startpos .. tag_endpos].dup;
}

tag = (alpha alnum) > start_tag % end_tag ;

Here p is a pointer which points to the current char, and data is a string being parsed. tag_startpos is a variable stored somewhere between the two actions. That’s it, now we just need to add some boilerplate to that:

void process(string data) {
    // Ragel requires the following 4 variables to be declared:
    char* p = cast(char*)data.ptr; // pointer to start
    char* pe = cast(char*)data.ptr + data.length; // pointer to end
    char* eof = null; // not needed in this simple example
    int cs; // identifier of the current state of FSM

    size_t tag_startpos;
    size_t tag_endpos;
    string[] tagnames;

    %%write init;
    %%write exec;
}

Now you can pass any SAM file to this function. Of course, in real-world case, you would read file in chunks, and pass those chunks instead of whole file.

 

The conclusion is, it is much less work to write parsers in this way. Furthermore, such grammar descriptions make it extremely easy to check that the parser conforms to the specification. Another cool feature is that validation is done during parsing, yet the parsing remains very fast, because no memory copying occurs (except initiated by you) and the minimum of required operations is done.

I’m going to write SAM parser with Ragel when I’ll find some time. I’m highly interested in how it will compare with samtools parser ;-)

Advertisements
Categories: Uncategorized Tags: ,
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: