Archive

Posts Tagged ‘Ruby’

Ruby: FFI or not FFI?

October 26, 2013 Leave a comment

I started making Ruby bindings for my SAM/BAM library, and it’s not at all clear whether to use FFI or good old C extension for MRI.

For you to get the clear picture, I’m going support only Linux and Mac OS X, distributing binary packages, because that’s the easiest option given the current state of DMD compiler infrastructure—it’s not available by default on almost every system, like GCC.

One factor is convenience. By that word FFI proponents usually mean that they are too lazy to sit and write some C code. But hey, since I’m going to distribute binary packages only, I can just use Rice which should be much easier.

Another important factor is interoperatibility. Well, I did some simple benchmarks and discovered that in MRI, the speed sucks if I use FFI, and in JRuby, simply calling Picard library gives the same or better performance (of course, if you’re aware of flags –server and -Xji.objectProxyCache=false). But more importantly, on JRuby the overhead with either FFI or Java Integration is huge =\ Namely, counting reads in a 2.5GB file took about 2 minutes, but adding computation of average read name length added another 2 minutes, giving total of 4! For comparison, it took only 2 minutes using PySAM, and this is the baseline that should be followed.

My conclusions from this are that
1) JVM is not well suited for dynamic languages, and this opinion is supported by JRuby developers. Hopefully, Topaz will mature eventually.
2) Overhead of FFI is too substantial to ignore in my particular case, where we want to work with lots of short reads.

So, I will go with Rice. The bonus part is that I will have to write C++ classes wrapping D functionality, which could theoretically be also used for CPython extension using Boost.Python.

EDIT:  compared bindings generated with Rice and SWIG, the latter wins. So, the full chain is D -> C bindings -> C++ wrapper -> SWIG wrapper -> Ruby

EDIT2: after some playing with SWIG I realized that I want ultimate control over what’s going on, and finally decided to write in good old C.

Advertisements
Categories: Uncategorized Tags: , ,

[GSoC] Weekly report #0

Finally, OBF seems to be convinced about usefulness of new BAM parser. Thus now I can spend time on coding instead of writing/talking on why what I’m about to do is so important :)

I began to write code before the official start for two reasons. First, I just can’t wait to get my hands on this. The project is a great opportunity to learn more about D and Ruby, and what they’re capable of when being combined. By the way, it’s one of the goals we set for this summer — discover some best practices. Moreover, I’ll describe some techniques I already use, in this very post ;) The second reason is more prosaic. I’ll have exams in June, and therefore I have to do some stuff in advance. However, I believe I’ll deal with time management issues. Since July I’ll be able to work mostly full-time.

Now let’s move on to what was done during last week. The project repository is at https://github.com/lomereiter/sambamba. Directory structure is not well organized yet, I’ll do that a bit later.

The main struct, BamFile (bamfile.d), has just a constructor, method close() and, the most important at the moment, header property which returns SamHeader struct (defined in samheader.d).

Now I advise you to take a look at samheader.d. It is rather small file which begins with ~50 lines of rather crazy mixture of string mixins and mixin templates, and then…

mixin HeaderLineStruct!("SqLine",
          Field!("sequence_name", "SN"),
          Field!("sequence_length", "LN", uint),
          Field!("assembly", "AS"),
          Field!("md5", "M5"),
          Field!("species", "SP"),
          Field!("uri", "UR"));

That’s how I want to define structs for information from SAM header line fields. And D allows me to do that. This code generates struct definition, with static method parse() which takes a string and returns an instance of the struct. Really, if “each line is TAB-delimited and except the @CO lines, each data field follows a format ‘TAG:VALUE’ where TAG is a two-letter string <…>”, why should I write more than this tag and the type of the value? And since the value type is almost always a string, let’s make it default in the Field template ;-)

So, here’s the first useful technique: whenever you find some code to be too repetitive, see if it’s worthwhile to create a small DSL based on mixins. D makes code generation easy with compile-time function evaluation, almost as easy as writing Lisp macros.

How about writing Ruby FFI bindings to these structs? I decided to apply DRY principle again, and here’s my another idea: scaffolding. RoR guys have been doing automatic boilerplate generation for years, so why not apply this method for generating FFI bindings?

So I wrote a small Ruby code generator for D structs, which currently supports numeric types, strings, and arrays as struct fields. Now just a few lines of code will do:

import samheader;
mixin(import("utils/scaffold.d"));

import std.stdio;

void main() {
    File("bindings/scaffolds/SqLine.rb", "w").write(toRuby!SqLine);
    File("bindings/scaffolds/RgLine.rb", "w").write(toRuby!RgLine);
    File("bindings/scaffolds/PgLine.rb", "w").write(toRuby!PgLine);
    File("bindings/scaffolds/SamHeader.rb", "w").write(toRuby!SamHeader);
}
The results of generation are in bindings/scaffolds. Since Ruby has Open Classes, you can reopen any class and define/undefine its methods. For instance, I haven’t found a way to determine at compile-time whether a field is private or not, and you might want to undef corresponding methods. However, there is a thread on dlang.org with suggestion to add getProtection trait. Hopefully, it will be added to the language in the near future.
With a few lines of hand-coded bindings, we can use D functions from Ruby:

load './bindings/libbam.rb'

bam = BamFile.new "HG00125.chrom20.ILLUMINA.bwa.GBR.low_coverage.20111114.bam"
puts bam.header.pg_lines.map(&:program_name)

The code lacks tests and documentation, though, and that’s what I’ll be working on during the next week.

Categories: programming Tags: , ,

D & Ruby FFI, part 2: working with arrays and strings

April 10, 2012 Leave a comment

Firstly, string in D is nothing but immutable(char)[], that is, an array of particular type.
Secondly, the memory layout of an array is simple: size_t (for storing length) + pointer (which points to the array elements).

The immediate conclusion is that D strings are not necessarily zero-terminated. It’s not very convenient when we’re working via FFI; the bright side is, together with immutability that gives the D compiler some opportunities to optimize operations with strings.

Strings

Let’s play a bit with the struct from the previous post:

public struct Foo {
    public string hello = "hello, world!";
}

Using FFI::Struct, it can be represented in Ruby code as

class Foo < FFI::Struct
    layout :hello_length, :size_t,
           :hello,        :string

    def initialize
        @ptr = MyLibrary.foo_new
        ObjectSpace.define_finalizer @ptr, Foo.finalize(@ptr)
        super(@ptr) # init FFI::Struct with the pointer
    end
    # ... finalize stuff ...
end

Then we may access our string as Foo.new[:hello]. Luckily, it’s zero-terminated — as all string literals in D.

Let’s now do something involving slicing:

string str;
extern (C) immutable(char)* foo_hello(Foo* p) {
    str = std.array.split(p.hello)[0];
    return str.ptr;
}

If you now bind it to Ruby by means of

...
    attach_function :foo_hello, [:pointer], :string
...
    def hello
        MyLibrary.foo_hello @ptr
    end
...

you’ll see that Foo.new.hello returns “hello, world!” instead of expected “hello,”.

In order to return a zero-terminated string, use std.string.toStringz.
To do the converse (that is to convert char* into a string), use std.conv.to!string (for some reasons, std.string.toString was deprecated).

 

Arrays

Now let’s invent a method to send arrays from D.

For instance, the array of strings which we get after calling split. Firstly, we have to convert all the strings into zero-terminated ones. Then we can return a struct with array as a field. (I don’t know if it’s possible to return the array without packing it into a struct)

import std.algorithm : map;
import std.array : array;

alias immutable(char)* cstring;

struct WordsArray {
    cstring[] arr;
}

static WordsArray words;

extern (C) WordsArray foo_hello(Foo* p) {
    words.arr = array(map!(std.string.toStringz)(std.array.split(p.hello)));
    return words;
}

The Ruby code is also not very sophisticated:

class WordsArray < FFI::Struct 
    layout :length, :size_t, # dynamic array layout
           :data,   :pointer # 

    def to_a
        self[:data].get_array_of_string(0, self[:length])
    end
end

...
    attach_function :foo_hello, [:pointer], WordsArray.by_value
...

    def hello
        (MyLibrary.foo_hello @ptr).to_a
    end
...
Categories: programming Tags: , ,

D & Ruby FFI, part 1: mapping D structs onto Ruby classes

April 10, 2012 Leave a comment

This post starts series about interaction between D and Ruby via FFI. Several things are to be investigated.

Firstly, both languages have their own garbage collectors. Although we can disable D garbage collector manually for some classes/structs, that’s not convenient at all. We want to use existing D code without modifying it.

Also, D ABI is a bit more sophisticated than C ABI (some documentation is at http://dlang.org/abi.html). So we need to understand what is the layout of D objects and how to work with it effectively.


This post describes how to deal with garbage collection issues.

The method suggested below currently works with structs only(Although, structs can be passed by reference in D, and they are more lightweight than classes, so it’s not a big limitation.)

Suppose we want to create D struct from Ruby. The memory occupied by the struct can’t be managed by D garbage collector, because it’s the Ruby code which is to manage the memory, not D code. D functions can only hide some details of allocating/deallocating memory.

Let’s take this struct as an example:

public struct Foo {
    public string hello = "hello, world!";
}

We have to provide two functions for alloc/free following C calling conventions:

extern (C) void* foo_new() {
    Foo* p = cast(Foo*)malloc(Foo.sizeof);
    std.conv.emplace(p);

    GC.addRange(p, Foo.sizeof);
    return p;
}

extern (C) void foo_free(Foo* p) {
    GC.removeRange(p);
    free(cast(void*)p);
}

Allocating memory is easy – one calls corresponding external function via FFI, gets a pointer, and that’s it. Deallocation is a little bit trickier, since Ruby also has its own GC. Therefore, when we create the instance, we should also register the object’s finalizer which will be called by GC after the instance destruction. Notice this word, ‘after’. It means that we must copy the pointer to somewhere, otherwise GC just won’t be able to destroy the object because the finalizer will reference the object itself. (See http://www.mikeperham.com/2010/02/24/the-trouble-with-ruby-finalizers/ where this subtle thing is described in detail.)

So the source code of the Ruby part is something like this:

require 'ffi'

module MyLibrary
    extend FFI::Library
    ffi_lib './test.so'
    attach_function :foo_new, [], :pointer
    attach_function :foo_free, [:pointer], :void
end

class Foo
    def initialize
        @ptr = MyLibrary.foo_new
        ObjectSpace.define_finalizer @ptr, Foo.finalize(@ptr)
    end

    def self.finalize ptr
        proc { MyLibrary.foo_free ptr }
    end
end

Here I visualised the process of interaction of the two languages. You may open in another tab this gist to see all the code.

Another subtle thing is, how do we initialize D garbage collector? In order for it to work, we are to call core.runtime.Runtime.initialize, and when we’re done with the library, we are to call terminate() method

Theoretically, one could use “static this() { … }” and “static ~this() { … }” in D code. Practically, it doesn’t work yet. Thus we are forced to resort to some dirty tricks with gcc function attributes, namely, with __attribute__((constructor))/__attribute__((destructor)). (Read the discussion at http://stackoverflow.com/questions/9759880/automatically-executed-functions-when-loading-shared-libraries).

UPD: dirty tricks mentioned above can cause segfaults for no apparent reason. You better manually call the function initializing runtime, through FFI.

Categories: programming Tags: , ,