Kaushik Ghose

All your genome are belong to us

chromosomes_feat


Leave a comment

Seeing chromosomes

Did you know that chromosomes can be seen under a light microscope, and they get their name from the fact that they are strongly stained by certain colored dyes?

I was reading through our government’s collection of educational materials on genomics. Going through the history of the gene I was startled to learn that people knew about chromosomes by the late 1800s. I had assumed they were a later discovery, requiring more sophisticated instruments than a simple light microscope.

Walther Fleming was the first to report seeing these structures while studying cell division. The dye he was using, aniline, was heavily absorbed by these mysterious granules in the nucleus, which he called chromatin. Though human chromosomes can be quite long (Chromosome 1, the longest, is 85mm long), being a single molecule (!) you can imagine it’s quite invisible even under a microscope when stretched out.

It turns out, however, that during cell division the chromosomes condense and become dense enough that a stain will make them visible under a light microscope. Since the job of the DNA in the chromosomes is to serve instructions to create proteins, it is probably important for the DNA to be “unfurled” so it has a large surface area for ready transcription. When a cell needs to divide, however, the nuclear material needs to be transported and shuffled around, and it is best to have it in a tidy ball to do so.

The study of the structure of chromosomes using microscopes is called cyto-genetics. One could imagine that the physical structure of the chromosome is not important for the understanding of the genetic code. We mostly study DNA as a one dimensional, linear sequence of characters we call the genetic code. It turns out, however, that the physical three dimensional organization of chromosomes can affect which proteins are produced and in what quantity (“gene expression”)

I’m also told that chromosomes have their own domains in the nucleus – they are not just tendrils floating willy nilly in the nucleus.

Yup, that was your small, sticky ball of biological facts from me today …

Untitled drawing


2 Comments

Why would I ever write in C?

I’ve made computers do tricks in both C/C++ and Python for many years now. For the past few years I’ve written exclusively in Python for both work and play. I recently went back to a personal project to rewrite it in C++ and, for the first time in my life, thought, “Why would I ever write in C?”

I have two personal projects (A note taker/reference manager and a photo organizer) that I’ve been working on for over a decade now. I started both in C++ using the QT widget library. I moved the note taker to Ruby (on-rails, I had my browser-as-GUI phase) then to Python (bottle is an amazing and lightweight web framework, you should try it) and then I abandoned it in favor of paper and pencil (but I don’t do much manuscript writing anymore, which accounts for me not needing a reference manager and making do with paper notes).

The photo organizer I made two versions of in C++ using QT, and then I moved it to Python. I was quite happy with the Python version, but I decided I wanted some changes, and it seemed to be a new decade, so I started to itch to move it back into C++. QT had changed a lot since 2012 (which I think was the last time I messed with the C++/QT version of my photo organizer). There were a lot of new bells and whistles and I got excited and I wanted to sit down and see what I could do.

I wanted to start of with a library of core functions and an associated test suite. In Python, this kind of code is very easy to set up: I put the library in a source directory and the tests in a directory marked ‘tests’ and use nosetests to automatically discover and run them. The tests come about organically as short pieces of code I write in the REPL to test newly added features that smoothly build up the module

The first thing I realized is that I missed the REPL. For twenty years I had done without a REPL. I would code up a skeleton of an application, craft the makefile, have function stubs and then just make sure it compiled. I would then add functionality by fleshing out the stubs and adding new functions, refactoring into new files and so on. But now, just the thought of spending hours, perhaps days, setting everything up without any feedback, raised a barrier. It made things not fun. It was WORK! This is not a good thing for a project done in spare time, for enjoyment.

The next thing I missed were nosetests. The scene for unit tests is now much better for C++ and QT has it’s own little framework for writing and running tests. But they all look so CLUNKY! You have to declare a class, then include a header file, then make an application out of that. You have to tediously add every test separately, then add every function separately, then add every test case.

I guess, even with IDEs like QT Creator, there was too much boilerplate, too much gap between me writing some code and seeing what it did. The compile cycle didn’t help. Doxygen sounded OK, but there was no smile on my face when I went to download it and see how I would be using it in my project.

8085-Microprocessor-Kit

My first introduction to computers was via BASIC, then Pascal. I then went to C and C++. I had a class in 8085 assembly (We used kits packaged in neat wooden boxes. I enjoyed keying in the program and the data and then hitting go. My favorite project had been to write an infinite precision division program. I had seen Apollo 13 a few years back, and those kits and their interfaces reminded me of the DSKY).

My introduction to proper C++ was in a data structures and algorithms class, and the height of my C++ programming adventures was in a computer graphics class where I made animations out of particles in OpenGL and a kind of fly through of a complex scene. These were done on SGIs O2 machines, at the Indian Institute of Science’s Super Computer Center, where I would retire early each evening after dinner, because the O2 machines were limited in number and the cubicles would be quickly occupied.

I say all this, perhaps in a subconscious attempt to convince you that I’m not totally inexperienced in programming and do indeed intrinsically enjoy programming computers and understanding the art of programming computers, and don’t view it merely as a means to an end, which is to make these amazing calculating machines do clever tricks.

Screen Shot 2015-05-09 at 12.19.07 AM

But I think Python has spoiled me for ever in terms of the manner in which I enjoy computers. Writing C/C++ has become work. It’s something I would do wearing a suit and tie (not that I wear a suit and tie at work – which should tell you something). It’s something that I would do with a serious demeanor and a feeling, I imagine, mental patients have when wearing a straightjacket.

Boilerplate, boilerplate everywhere I look. Don’t smile. The doctors will take you away to the padded room. Better to look serious and keep your head down.

Performance you say? Well, to be honest, there are very few parts of my code where I need “C like” speed. And where I do, well, I can use CFFI, or more likely, I use Cython. It looks a little ugly, but hey, I have to write so little of it. The only annoying thing is inspecting the generated C code to make sure it’s getting rid of the dynamic Python calls.

Oh, yeah, pip install has spoiled me too. I ain’t chasing down your exotic library to read EXIF tag data in photo files. I haven’t got that kind of time. But I do have time to type pip install.

So, perhaps the only caveat I would add, in closing, is to say, the full title of this post should be “Why would I ever write in C for fun?”


Leave a comment

Cython __cinit__

cdef classes are awesome if you want lightweight data structures, for example, when you need millions of them. These cython classes have some special methods, the most basic of which is the __cinit__ method which is the analog of the __init__ method of regular Python classes.

The __cinit__ method is practical because it lets you initialize your Cython class transparently from straight python The downside is that we are now back to type checking and converting when we initialize, since we will accept Python variables. This can add perceptible overhead to object creation.

Consider the following class definition

cdef class A:
  cdef public:
    int a, b, c, d, e

  def __cinit__(self, int a, int b, int c, int d, int e):
    self.a, self.b, self.c, self.d, self.e = a, b, c, d, e

  def __repr__(self):
    return '({:d}, {:d})'.format(self.a, self.b)

If we run the following function

def one_million_objects():
  cdef:
    int n
    list x = []
  for n in xrange(1000000):
    a = A(n, n + 1, n + 2, n + 3, n + 4)
    x += [a]
  return x

and profile it, we obtain:

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1    0.235    0.235    0.235    0.235 {v1.one_million_objects}

If you peek at the translated C code (which, admittedly, is pretty ugly) you will find that the relevant part of the code goes:

 for (__pyx_t_2 = 0; __pyx_t_2 < 1000000; __pyx_t_2+=1) {
    __pyx_v_n = __pyx_t_2;

    /* "v1.pyx":17
 *     list x = []
 *   for n in xrange(1000000):
 *     a = A(n, n + 1, n + 2, n + 3, n + 4)             # <<<<<<<<<<<<<<
 *     x += [a]
 * 
 */
    __pyx_t_1 = __Pyx_PyInt_From_unsigned_long(__pyx_v_n); if (unlikely(!__pyx_t_1)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 17; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
    __Pyx_GOTREF(__pyx_t_1);
    __pyx_t_3 = __Pyx_PyInt_From_unsigned_long((__pyx_v_n + 1)); if (unlikely(!__pyx_t_3)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 17; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
    __Pyx_GOTREF(__pyx_t_3);
    __pyx_t_4 = __Pyx_PyInt_From_unsigned_long((__pyx_v_n + 2)); if (unlikely(!__pyx_t_4)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 17; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
    __Pyx_GOTREF(__pyx_t_4);
    __pyx_t_5 = __Pyx_PyInt_From_unsigned_long((__pyx_v_n + 3)); if (unlikely(!__pyx_t_5)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 17; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
    __Pyx_GOTREF(__pyx_t_5);
    __pyx_t_6 = __Pyx_PyInt_From_unsigned_long((__pyx_v_n + 4)); if (unlikely(!__pyx_t_6)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 17; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
    __Pyx_GOTREF(__pyx_t_6);
    __pyx_t_7 = PyTuple_New(5); if (unlikely(!__pyx_t_7)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 17; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
    __Pyx_GOTREF(__pyx_t_7);
    PyTuple_SET_ITEM(__pyx_t_7, 0, __pyx_t_1);
    __Pyx_GIVEREF(__pyx_t_1);
    PyTuple_SET_ITEM(__pyx_t_7, 1, __pyx_t_3);
    __Pyx_GIVEREF(__pyx_t_3);
    PyTuple_SET_ITEM(__pyx_t_7, 2, __pyx_t_4);
    __Pyx_GIVEREF(__pyx_t_4);
    PyTuple_SET_ITEM(__pyx_t_7, 3, __pyx_t_5);
    __Pyx_GIVEREF(__pyx_t_5);
    PyTuple_SET_ITEM(__pyx_t_7, 4, __pyx_t_6);
    __Pyx_GIVEREF(__pyx_t_6);
    __pyx_t_1 = 0;
    __pyx_t_3 = 0;
    __pyx_t_4 = 0;
    __pyx_t_5 = 0;
    __pyx_t_6 = 0;
    __pyx_t_6 = __Pyx_PyObject_Call(((PyObject *)((PyObject*)__pyx_ptype_2v1_A)), __pyx_t_7, NULL); if (unlikely(!__pyx_t_6)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 17; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
    __Pyx_GOTREF(__pyx_t_6);
    __Pyx_DECREF(__pyx_t_7); __pyx_t_7 = 0;
    __Pyx_XDECREF_SET(__pyx_v_a, ((struct __pyx_obj_2v1_A *)__pyx_t_6));
    __pyx_t_6 = 0;

Our simple integers are being converted into python objects and then back again.

If we omit the __cinit__ definition and manually initialize the elements of the structure:

cdef class A:
  cdef public:
    int a, b, c, d, e

  def __repr__(self):
    return '({:d}, {:d})'.format(self.a, self.b)

With our million objects function being written as:

def one_million_objects():
  cdef:
    int n
    list x = []
    A a
  for n in xrange(1000000):
    a = A()
    a.a, a.b, a.c, a.d, a.e = n, n + 1, n + 2, n + 3, n + 4
    x += [a]
  return x

and profile it, we obtain:

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1    0.146    0.146    0.146    0.146 {v2.one_million_objects}

Which is a pretty big savings. You will have guessed that this savings is due to the fact that we don’t do an expensive round trip through Python objects any more:

for (__pyx_t_2 = 0; __pyx_t_2 < 1000000; __pyx_t_2+=1) {
    __pyx_v_n = __pyx_t_2;

    /* "v2.pyx":15
 *     A a
 *   for n in xrange(1000000):
 *     a = A()             # <<<<<<<<<<<<<<
 *     a.a, a.b, a.c, a.d, a.e = n, n + 1, n + 2, n + 3, n + 4
 *     x += [a]
 */
    __pyx_t_1 = __Pyx_PyObject_Call(((PyObject *)((PyObject*)__pyx_ptype_2v2_A)), __pyx_empty_tuple, NULL); if (unlikely(!__pyx_t_1)) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 15; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
    __Pyx_GOTREF(__pyx_t_1);
    __Pyx_XDECREF_SET(__pyx_v_a, ((struct __pyx_obj_2v2_A *)__pyx_t_1));
    __pyx_t_1 = 0;

    /* "v2.pyx":16
 *   for n in xrange(1000000):
 *     a = A()
 *     a.a, a.b, a.c, a.d, a.e = n, n + 1, n + 2, n + 3, n + 4             # <<<<<<<<<<<<<<
 *     x += [a]
 * 
 */
    __pyx_t_3 = __pyx_v_n;
    __pyx_t_4 = (__pyx_v_n + 1);
    __pyx_t_5 = (__pyx_v_n + 2);
    __pyx_t_6 = (__pyx_v_n + 3);
    __pyx_t_7 = (__pyx_v_n + 4);
    __pyx_v_a->a = __pyx_t_3;
    __pyx_v_a->b = __pyx_t_4;
    __pyx_v_a->c = __pyx_t_5;
    __pyx_v_a->d = __pyx_t_6;
    __pyx_v_a->e = __pyx_t_7;

For some reason, we can not declare the

__cinit__

function as a

cpdef

function.

71AE90J735L


Leave a comment

How to shoot yourself in the foot with Cython

If you’ve grown fat and complacent using Python, try a little Cython. It’ll put hair on your chest and take some off your scalp.

Take the following Cython code I was writing (heavily paraphrased)

cdef:
  char *s = seq  # seq is passed in to the function
  unsigned char i, j
  unsigned char sub_mat[85]

for i, j in zip([65, 67, 71, 84], [84, 65, 67, 71]):
  sub_mat[i] = j

return [ sub_mat[s[i]]  for i in range(len(seq))]

This code compiles file. It even runs fine for a few test cases I set up for it. But when I put it into a standard testing pipeline I use, it caused a different test to freeze up.

My first reaction was, I’ve lost a pointer somewhere. So I started to dig around. (That reminds me, there has got to be a easier way to hook a debugger up to Cython. The prescribed ways are so complicated). I started to sprinkle print statements in the code. Nothing.

I then broke the final list comprehension out into a loop and put print statements in there. Suddenly I saw that my program wasn’t going off the reservation by losing a pointer. It was actually looping over and over again at that point. And then the lightbulb went off. The loop variable i had been typed as an unsigned char (max value 255) and if len(seq) was ever greater than that it would just cause the loop to go on for ever as i overflowed and reset to 0 repeatedly.

Of course, those of you programming in C will have immediately noticed this issue and asked – “Does len(seq) ever get larger than 255?” But I’ve grown fat and lazy on Python which lets me do things like 1<<1024 without missing a beat (yes, that creates a 1025 bit integer. Suck it up C programmer), so, on occasion, I have to lose some hair from the top of my head when I slum it with all you C-coders.

Cython-logo.svg


Leave a comment

Get more out of Cython

Getting more out of your Cython code doesn’t have to be a black art. A key step is to peek into the cythonized .c file and note if inner loops are getting down to bare C without any of the overheads associated with Python’s dynamism.

In an earlier post I had introduced Cython, which is a really cool language that straddles Python and C. Cython is Python with some ctype defs that helps a translator (Cython) to convert the Python code into a .c file which can be compiled and imported as a module. I’m one of the many folks who rave about this system because, just by changing .py to .pyx and running the code through the Cython system get get a 2x speedup for your functions. The next stage is to add some type declarations to your code. These give the real speedups.

I became curious as to what the type declarations actually do, and how they can help. Beyond curiosity, I was hoping that understanding this process would help me write more efficient Cython code.

I was working on a simple piece of code to determine the transition matrix for a DNA sequence. The DNA sequence is a string of characters from the set {A,C,G,T,N} (The N represents an unknown character, or, sometimes, a don’t care character). The transition matrix simply represents the probabilities that, in the sequence, character X will be followed by character Y, for every possible transition. This means a 5×5 matrix.

My first pass looked like this:
actg.pyx

def compute_transition_matrix(bytes seq):
  # Given a string, computes the transition matrix description of that string
  cdef:
    unsigned long int t_mat[85][85]
    unsigned long int seq_len = len(seq), n, m

  for n in [65, 67, 71, 84, 78]:
    for m in [65, 67, 71, 84, 78]:
      t_mat[n][m] = 0

  for n in range(1, seq_len):
    t_mat[<unsigned char> seq[n - 1]][<unsigned char> seq[n]] += 1

  # Prefer to copy the data into a python list, rather than mess around with pointer conversion etc.
  return [[t_mat[n][m] for m in [65, 67, 71, 84, 78]] for n in [65, 67, 71, 84, 78]]

Inspecting the actg.c code generated by running cython on this file I found that my inner loop looked like this:

  /* "actg.pyx":17
 *       t_mat[n][m] = 0
 * 
 *   for n in range(1, seq_len):             # <<<<<<<<<<<<<<
 *     t_mat[<unsigned char> seq[n - 1]][<unsigned char> seq[n]] += 1
 * 
 */
  __pyx_t_4 = __pyx_v_seq_len;
  for (__pyx_t_7 = 1; __pyx_t_7 < __pyx_t_4; __pyx_t_7+=1) {
    __pyx_v_n = __pyx_t_7;

    /* "actg.pyx":18
 * 
 *   for n in range(1, seq_len):
 *     t_mat[<unsigned char> seq[n - 1]][<unsigned char> seq[n]] += 1             # <<<<<<<<<<<<<<
 * 
 *   # Prefer to copy the data into a python list, rather than mess around with pointer conversion etc.
 */
    if (unlikely(__pyx_v_seq == Py_None)) {
      PyErr_SetString(PyExc_TypeError, "'NoneType' object is not subscriptable");
      {__pyx_filename = __pyx_f[0]; __pyx_lineno = 18; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
    }
    __pyx_t_8 = (__pyx_v_n - 1);
    __pyx_t_9 = __Pyx_PyBytes_GetItemInt(__pyx_v_seq, __pyx_t_8, 1); if (unlikely(__pyx_t_9 == ((char)-1) && PyErr_Occurred())) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 18; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
    __pyx_t_10 = ((unsigned char)__pyx_t_9);
    if (unlikely(__pyx_v_seq == Py_None)) {
      PyErr_SetString(PyExc_TypeError, "'NoneType' object is not subscriptable");
      {__pyx_filename = __pyx_f[0]; __pyx_lineno = 18; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
    }
    __pyx_t_9 = __Pyx_PyBytes_GetItemInt(__pyx_v_seq, __pyx_v_n, 1); if (unlikely(__pyx_t_9 == ((char)-1) && PyErr_Occurred())) {__pyx_filename = __pyx_f[0]; __pyx_lineno = 18; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
    __pyx_t_11 = ((unsigned char)__pyx_t_9);
    ((__pyx_v_t_mat[__pyx_t_10])[__pyx_t_11]) = (((__pyx_v_t_mat[__pyx_t_10])[__pyx_t_11]) + 1);
  }

I know that the first thing you are thinking is that, Man this looks butt-ugly. Also gotos?!?! Well, the code is not really meant for human consumption AND there are some things in here that makes it easy to find what you are looking for. The first is that the name mangling preserves the core of the function name, so you can do a search for the function in an editor. The second thing, which turns out to be very, very important is that the original code is in comments along side the relevant C code. And for this I take my hats off the Cython folks, because they did not have to do this, but they put it in any way, which makes our inspection a LOT easier.

Now, you can see right away that the inner loop has a bunch of funny checks related to making sure variable types are correct and have appropriate attributes etc. And I thought, I don’t need this cruft, I KNOW this is a string of bytes – an array of chars.

My change was to explicitly tell Cython that my bytes seq variable was really an array of char (Note the char *seq = s):

def compute_transition_matrix(bytes s):
  """Given a string, computes the transition matrix description of that string"""
  cdef:
    unsigned long int t_mat[85][85]
    unsigned long int seq_len = len(s), n, m
    char *seq = s

  for n in [65, 67, 71, 84, 78]:
    for m in [65, 67, 71, 84, 78]:
      t_mat[n][m] = 0

  for n in range(1, seq_len):
    t_mat[<unsigned char> seq[n - 1]][<unsigned char> seq[n]] += 1

  # Prefer to copy the data into a python list, rather than mess around with pointer conversion etc.
  return [[t_mat[n][m] for m in [65, 67, 71, 84, 78]] for n in [65, 67, 71, 84, 78]]

The inner loop that results from this code, looks like:

  /* "actg.pyx":35
 *       t_mat[n][m] = 0
 * 
 *   for n in range(1, seq_len):             # <<<<<<<<<<<<<<
 *     t_mat[<unsigned char> seq[n - 1]][<unsigned char> seq[n]] += 1
 * 
 */
  __pyx_t_5 = __pyx_v_seq_len;
  for (__pyx_t_8 = 1; __pyx_t_8 < __pyx_t_5; __pyx_t_8+=1) {
    __pyx_v_n = __pyx_t_8;

    /* "actg.pyx":36
 * 
 *   for n in range(1, seq_len):
 *     t_mat[<unsigned char> seq[n - 1]][<unsigned char> seq[n]] += 1             # <<<<<<<<<<<<<<
 * 
 *   # Prefer to copy the data into a python list, rather than mess around with pointer conversion etc.
 */
    __pyx_t_9 = ((unsigned char)(__pyx_v_seq[(__pyx_v_n - 1)]));
    __pyx_t_10 = ((unsigned char)(__pyx_v_seq[__pyx_v_n]));
    ((__pyx_v_t_mat[__pyx_t_9])[__pyx_t_10]) = (((__pyx_v_t_mat[__pyx_t_9])[__pyx_t_10]) + 1);
  }

Which looks A LOT BETTER.

This is reflected in some simple %timeit based benchmarking by running this code on Human chromosome 1. The first version takes 1.4s to run while the second version takes 0.27s – a 5x speedup, which corresponds with us taking all that dynamic cruft out of the inner loop.

One thing to note is that the PROPER way to do all this is profile your code first and find out where the bottle necks are. Sometimes doing %timeit on individual functions are all that is needed. Sometimes you will need to do a proper profile, and Cython as directives to allow you to profile Cython functions, but these can slow the code down.

python-logo


Leave a comment

Python setuptools entry points

This is a quick tutorial on how to use Python setup tools entry points to elegantly distribute command line scripts as part of your program, and to enable your code to discover and use plugins in a principled fashion.

Installing scripts

Say your program supplies a python script dothis.py. If you have been using distutils, in your setup.py script you probably have something like

setup(...,
  scripts=['dothis.py']
     )

When I first used this it was like magic. This line would cause setup to place a executable wrapper script in /usr/local/bin that would point to your dothis.py script.

setuptools has a slightly different, but ultimately better, way to handle distributing commandline scripts. You may have to refactor your code a little bit to get this to work, but the refactoring improves the layout of your code. If your original script (like some of mine) had the pattern

# Your module code

if __name__ == "__main__":
  # Parse command line args 
  # Do a bunch of stuff

You need to refactor the stuff in main into a function – let’s call it cli():

# Your module code

def cli():  # Entry point for scripts
  # Parse command line args 
  # Do a bunch of stuff

if __name__ == "__main__":
  cli()

(In general this is proper practice, but folks have been known to ignore it)

In your setup script you should now discard the scripts line and instead use:

    ...
    entry_points={
           ....
      # Command line scripts
      'console_scripts': ['runme = dothis:cli']
    },
    ...

This has several advantages, mainly relating to playing nicely with both POSIX and Windows systems. There are also options for distributing Python GUI programs!

Plugins

Setuptools gives us a really nice way to implement a plugin system without us having to write any plugin manager that keeps track of where plugins are stored and how to import them (Relative import? Absolute path? Dotted path?). It also ensures that we and the plugin writers don’t have to worry about setting up some common space where plugin code has to be installed into to be discoverable by the main app.

What you do is simply settle on a context name for your plugin system. Say my.cool.plugins. It is important to choose a unique name, since the whole Python install will know about this name and we don’t want collisions. Having the name of your package somewhere is a good bet.

In your application code, you can use something like this to find and load a plugin module:

import pkg_resources
def _load_plugin(name, plugin_entry_point):
  for v in pkg_resources.iter_entry_points(plugin_entry_point, name):
    return v.load()
  raise ImportError('No plugin called "{:s}" has been registered.'.format(name))

Where ‘my.cool.plugin’ is the value you should pass for plugin_entry_point. This function will return the result of load() which is a module the same as if we used import by our own hand.

You can also discover all the available plugin modules:

def discover_all_plugins():
  return sorted([(v.name, v.module_name) for v in pkg_resources.iter_entry_points('my.cool.plugins')],
                cmp=lambda x, y: cmp(x[0], y[0]))

But how can we make the magic happen? How can we make a plugin discoverable this way? The magic lies in setup.py for the package supplying the package. Say the plugin code lies in a file called plugin_file.py under the directory my/plugin/code:

    ...
    entry_points={
           ....
      # Register the built in plugins
      'my.cool.plugins': ['fancy_name = my.plugin.code.plugin_file']
    },
    ...

Once you run python setup.py install on this, your application will be able to see the plugin.

Many thanks to Björn Pollex who encouraged me to look into setup tools entry points as a way to clean up how I distribute scripts and implement plugins.


Leave a comment

The Felix problem

Say you have a multiplayer game where there is no central server. You need to indicate to a pair of players if they are within a certain distance from each other, but you can’t let any of the players know the actual positions of any other players. Could you do it?

I was driving us back from the 2014 Biological Data Science meeting at Cold Spring Harbor when Felix posed this problem to me. The drive was alternatingly boring and annoying as we hit pockets of over polite New York drivers on the way out (Though there was a stretch of road along the Hutchinson River Parkway that was SPECTACULAR in terms of fall foliage. We didn’t take any pictures, but that alone made driving to and from the conference worth it)

As a complete aside, Felix had the GPS voice on his phone on and we were entertained by numerous howlers, one them being the voice calling it the “Hutchinson River Peek-wee” (pkwy) and the other being: “Take I95 New Hampshire minus Maine” which Felix declared would simply be New Hampshire, since there is no intersection between the two sets.

Anyhow, back on point, to relieve the boredom Felix was telling me crypto related things and came up with this question, which he claimed he did not have a definitive answer to.

Here are my attempts during the drive:

Players A, B want their distance checked. Players C, D are “third-parties”

Scheme 1 (Fails)

Player A sends their position along with N-1 random positions to C. Each position has a tag. Only A knows the correct position by tag. Player B does the same

Screen Shot 2014-11-08 at 11.10.02 PM

Player C computes N^2 position combinations and returns the results to both A and B.

Screen Shot 2014-11-08 at 11.10.11 PM

My idea was that when they get the results back A and B will be able to pick the correct results, because they’ll know which is their tag, but, at this point, I realized the scheme will fail, because each will get back N results with their tag in it, but not know which was the original one.

But, this led me to the second scheme which works:

Scheme 2

Player A and B send their position, along with N-1 random positions, tagged, to C. They also separately they send their tags to D.

Screen Shot 2014-11-08 at 11.13.42 PM

C sends the N^2 computation results to D who then matches the correct answer based on the tag pair from A and B and then sends the result back to A and B

Screen Shot 2014-11-08 at 11.14.03 PM

Scheme 3 – nice refinement of 2

I then realized that you can save on the large number of computations (N^2 even if they are simple) by doing the following: A adds a random number to their position (p_a + q_a) and sends this to C. B sends a similarly randomly offset number to C (p_b + q_b). Both A and B send the random offsets to D.

Screen Shot 2014-11-08 at 11.17.52 PM

C computes r_1 = (p_a + q_a) - (p_b + q_b) and sends it to D.

Screen Shot 2014-11-08 at 11.17.59 PM

D then subtracts the difference of the offsets from the result from C r_2 = r_1 - (q_a - q_b), which is r_2 = (p_a + q_a) - (p_b + q_b) - (q_a - q_b) i.e. r_2 = p_a - p_b.

Screen Shot 2014-11-08 at 11.18.06 PM

This last is the computation we want and D broadcasts the flag “In range” or “Out of range”. Note that neither C nor D know the actual positions of A and B, and A and B don’t know each others’ positions either.

I was pretty happy at arriving at this solution and by this time we had meandered our way out of New Rochelle.

Follow

Get every new post delivered to your Inbox.

Join 55 other followers