Derived classes and std::vector

I want to talk about a fun corner of C++ I ran into recently. It has to do with how to handle calling member functions of a list of heterogenous objects.

My use case was a bit like the following: say we have a bunch of different functions – a * x^2, b * x, c/x – and we want to apply them one after the other to a series of x values. Which functions are included, and the order of application are user defined.

My solution was to have a base class

and three derived classes

But when I put them in a list and applied them:

I have

g++ --std=c++14 main_wrong.cpp -o test
./test
0
0
0
0

Whaaaaa? Waah! Why is it calling the dummy base class function. Isn’t the WHOLE PURPOSE of virtual functions to ensure that the function in the derived class gets called?

The core problem here is that std::vector makes a copy of the object when we use the push_back function. We’ve defined the container type as a base class and when our derived class is passed to push_back it “slices” the object down so that the subclass now looks like the base class – losing all it’s derived attributes. (I first ran into Object slicing thanks to “Cubbi”‘s answer here)

So what can we do? We could use a vector of pointers instead. But the internet seems to think this is an unsafe idea. UNSAFE!? We don’t write C++ to be SAFE!! Oh, alright. We’ll practice safe code. What can we do?

The safest route is to use std::unique_ptr:

I’m not full clear on the use of emplace_back instead of push_back: it certainly isn’t for efficiency reasons here – the just code won’t compile using push_back.

So there you have it.

Chromecast audio: command line

When I was casting about (I’m allowed one Dad pun per article. Oops, I mean bad pun. Sorry that was two much fun.)  for home audio solutions (It’s what the kids call speakers these days) a friend of mine encouraged me to try out Google’s Chromecast Audio device. And so another internet connected creature entered our home. This post starts with a rant. Feel free to skip to the section that says “Commandline” …

I was simultaneously curious, impressed, and a little disgusted by the Chromecast Audio. We got the Chromecast at the same time as the Synology NAS did and I found the two to be a study in contrast.

Synology gives you an appliance – it’s a snap to setup, to update, and to add functionality (“Apps”) to. This is great, but even greater is that you get root access – it’s a Linux powered box that you own and have control over – you can de-Appliance it.

The Chromecast Audio is a modern appliance. It’s a black box that has very little to configure (good) and it’s locked down in a weird way (very bad). It’s an internet connected device that updates itself and there is no way to access what’s inside. Thankfully, it does not have a camera or a microphone. Google does release some sort of source code. It’s a 4GB tar file. It’s not clear if this is the whole software the box is running, and if it’s the latest version. So there’s that.

One major symptom of this locked-down-ness (and probably of Google’s marketing prowess) is just how hard it was to find some useful information of WHAT the damn thing was doing. I’d always come across functional or how-to articles that told me to use this App or that App, or use the Chrome browser and use cast.

Finally, and ironically, the best description came from Synology’s blog:

Chromecast is basically a Chrome browser, you can’t see it, you can’t feel it, but it’s there: when you select a song to stream, DS audio/video simply passes the HTTP(S) url to the browser, meaning the stream goes from the DiskStation to the Chromecast, the device is never a bottleneck. And if you kill the app or log out, streaming will carry on uninterrupted!

So, here’s where I got disgusted – the chromecast is really built around smart phones or tablet devices that run “Apps” and really pushes you to Google’s ecosystem. It was very easy to configure the Chromecast via my Chrome browser, but that implies … that I was running a Chrome browser. Suppose I like Firefox or some other browser?

Next, streaming to the device requires a smart phone App. The only way to control the device from a Mac is to open up … Chrome and then use this janky thing called Cast. It’s not clear if it actually passes on the URL to the Chromecast (as it claims) or mirrors your computer output (Tab casting) – I got very glitchy playback when I tried to cast Youtube videos from my mac. It also messes up my sound output, to where I have to go into settings to restore things.

What’s with this App only stuff? Suppose I don’t have a smart phone (I actually don’t – we do exist.) How do I control the player? Just give me a small program to communicate with the device! You don’t have to go overboard and give me a fantastic browser based virtual desktop like Synology does. I don’t expect Google to have the software chops of a relatively small Taiwanese hardware company, but even a simple command line player would make me happy …

I like the hardware. I like the miniaturization. I like the ease of setup. I hate the ecosystem. It reeks of a desperation to make money that is unseemly.

Commandline

(My original plan was to see if I could outline here a bare bones Python program that commands the Chromecast to stream files or URLs of your choosing from basically any platform. After looking into the details of the chromecast protocol, I don’t believe I have the time to develop something from the ground up. Instead, we’ll use the wonderful pychromecast library from Paulus Schoutsen a little bit. But, we’ll start off with some experimentation, anyway)

First we are going to have to find our Chromecast device. Chromecasts use the mDNS protocol [a][b] and listen to the address _googlecast._tcp.local. We can use the zeroconf module (pip installable) to find the service.

import time
from zeroconf import ServiceBrowser, Zeroconf

class MyListener(object):
  def add_service(self, zeroconf, type, name):
    info = zeroconf.get_service_info(type, name)
    print("Service {} added, service info: {}\n".format(name, info))

zeroconf = Zeroconf()
listener = MyListener()
browser = ServiceBrowser(zeroconf, "_googlecast._tcp.local.", listener)
time.sleep(1)
zeroconf.close()

(It takes a little time for the handshaking to complete, so it’s best to add a delay before you shutdown zeroconf – hence the sleep)

If I run this, what do I get?

Service Chromecast-Audio-XXX._googlecast._tcp.local. added, 
service info: ServiceInfo(
type='_googlecast._tcp.local.', 
name='Chromecast-Audio-XXX._googlecast._tcp.local.', 
address=b'YYY', 
port=8009, 
weight=0, 
priority=0, 
server='XXX.local.', 
properties={
  b've': b'05', 
  b'md': b'Chromecast Audio', 
  b'rm': False, 
  b'st': b'0', 
  b'bs': b'......', 
  b'ic': b'/setup/icon.png', 
  b'ca': b'2052', 
  b'nf': b'1', 
  b'id': b'XXX', 
  b'fn': b'kitchen', 
  b'rs': False})

Whoa! Look at that! Some things I can figure out, like an IP address, a port and the name I gave the device (“kitchen”).

So, this IP address, can we ping it?

MacBook-Pro:~ kghose$ ping XXX.local
PING XXX.local (192.168.1.156): 56 data bytes
64 bytes from 192.168.1.156: icmp_seq=0 ttl=64 time=7.929 ms
64 bytes from 192.168.1.156: icmp_seq=1 ttl=64 time=4.980 ms
64 bytes from 192.168.1.156: icmp_seq=2 ttl=64 time=4.923 ms

Heh, Heh, heh. It’s ALIVE!

MacBook-Pro:~ kghose$ ssh XXX.local
ssh: connect to host XXX.local port 22: Connection refused

Ok, too much to ask for.

Can we port scan it? (I was lazy and used MacOSs built in Network Utility from a tip here)

Port Scan has started...

Port Scanning host: 192.168.1.156

     Open TCP Port:     8008        http-alt
     Open TCP Port:     8009
     Open TCP Port:     9000        cslistener
     Open TCP Port:     10001       scp-config
Port Scan has completed...

Not much open here – bodes well for security, I suppose. Note that the un-named port 8009 is the one the Chromecast returned during service discovery.

When I started out I thought that the Chromecast would have some kind of straightforward RESTful API and I looked forward to playing with it, but Google opted for a more unfriendly custom interface sitting just on top of TCP.

Fortunately, the excellent pychromecast library has done the tedious work of integrating things together to provide a Pythonic API to the chrome cast device.

(As a warning, just after I ran the following code and simply adjusted the volume on my Chromecast, it started misbehaving with the Apps on our phone. I had to reboot the Chromecast to get everything back to normal again. It could have been a coincidence, though)

import pychromecast

chromecasts = pychromecast.get_chromecasts()

[cc.device.friendly_name for cc in chromecasts]
#-> ['kitchen']

cast = next(cc for cc in chromecasts if cc.device.friendly_name == 'kitchen')
cast.wait()
cast.status
#-> CastStatus(is_active_input=None, is_stand_by=None, volume_level=0.25, volume_muted=False, app_id='ZZZ', display_name='Google Play Music', namespaces=['urn:x-cast:com.google.cast.broadcast', 'urn:x-cast:com.google.cast.media', 'urn:x-cast:com.google.cast.cac', 'urn:x-cast:com.google.android.music.cloudqueue'], session_id='YYY', transport_id='YYY', status_text='Google Play Music')

The only “write” operation I tried was to change the volume on the Chromecast, but as I mentioned, after this, the Chromecast had to be rebooted.

cast.volume_up()
#-> 0.35

cast.volume_up()
#-> 0.4499999940395355

cast.volume_up()
#-> 0.4499999940395355

cast.volume_up()
#-> 0.4499999940395355

cast.volume_up()
#-> 0.4499999940395355

cast.volume_up()
#-> 0.4499999940395355

cast.volume_up()
#-> 0.4499999940395355

cast.volume_up()
#-> 0.4499999940395355

cast.volume_up()
#-> 0.549999988079071

cast.volume_up()
#-> 0.6500000119209289

cast.volume_up()
#-> 0.6500000119209289

cast.volume_up()
#-> 0.6500000119209289

cast.volume_up()
#-> 0.6500000119209289

cast.volume_down()
#-> 0.5500000357627869

cast.volume_down()
#-> 0.5500000357627869

So, I didn’t get further than that – I was put off because it seemed that controlling the Chromecast from this source borked the connection between the phone and the Chromecast for good i.e. until I rebooted it, but it SEEMS straightforward to have a little command line client to send music to the Chromecast using this library. We shall see.

References:
[1] Low level description of Chromecast V2 protocol
[2] Python library to communicate with Chromecast by Balloob
[3] Chromecast interface description (incomplete)
[4] Protocol buffer introduction

functools.partial

functools.partial is one of those things that you don’t need until you need it, and then you need it badly. Trying to create a series of lambda functions I discovered that the simplest answer that came to my mind did something counter intuitive, and functools.partial – which I had ignored so far, came to my rescue.

Effectively, I wanted to create a list of functions

f1(x), f2(x), f3(x) ...

where each function took the same input but did something different, for example

f1(x) = x + 1, f2(x) = x + 2, f3(x) = x + 3

What I tried was as follows

fl = [lambda x: x + i for i in range(4)]

It ran fine, but it gave me the following output

fl[0](1) -> 4
fl[1](1) -> 4
fl[2](1) -> 4
fl[3](1) -> 4

What was happening was that all the lambda functions were taking the last value of i (as a global), rather than the value of i at the time the function was created.

An object oriented person would probably do something like

class F:
  def __init__(self, i):
    self.i = i
  def f(self, x):
    return x + self.i

fl = [F(i) for i in range(4)]

But I enjoy Python because it borrows a lot of things from LISP and allows you to write a lot of code very succinctly, making it easier to grasp (if done well – it’s also possible to write Perl in Python)

Turns out the way you can capture the state of i in the lambda is to create a partial function, and functools.partial allows you to do this succinctly as

fl = [functools.partial(lambda x, j: x + j, j=i) for i in range(4)]

For those who are curious, WHY I had to do this – well, here is an analogy to my use case: I had data that was well expressed as a table of values. The data itself was stored as a list of dictionary which was sometimes nested – some of the values formed part of a hierarchy. For example, one row of the data could look like this:

row = {
 'name': 'Baker',
 'rank': 22,
 'score': {
   'raw': 43,
   'scaled': 25
 }
}

I wanted a table with columns

Last Name | Rank | raw score | scaled score |
----------|------|-----------|--------------|
    Baker |   22 |        43 |            25|

The actual table I was creating was meant to show ALL the data from an experiment with many variables and so had a lot of columns. I was wary of making a mistake in labeling the column relative to the value it had. It would be one of those silent and deadly errors.

If I had a simple flat structure, I might simply have changed the code such that the dictionary keys matched the column headers, weighing the fact that there was other code – already written and debugged – that used the existing keys and that the existing keys were short and succinct and I liked them.

My solution was to create a list of tuples

[(column header, function to extract column value from dictionary)]

And then simply loop over this list, first to create the header, and then for each row, to extract the data value into it’s correct place. Having the column header right next to the extraction function minimized the chances I would mess up somewhere, especially when adding or moving a column when editing or refactoring the code.

Self-Organizing Maps

Self-organizing maps are an old idea (first published in 1989) and take strong inspiration from some empirical neurophysiological observations from that time. The original paper “Self-Organizing Semantic Maps” by Ritter and Kohonen (pdf) has a nice discussion that took me back to some questions I was looking at in another life as a neurophysiologist. The discussion centers around cortical maps, which are most pronounced in the clearly sensory and motor areas of the brain.

Very early experiments (in their paper Lashley‘s work is referenced) led some people to propose the brain was a complete mishmash, with no functional separable internal organization. Each part of the brain did exactly what the other parts did, a bit like how a lens cut in half still works like the original lens (just dimmer), or how a hologram cut in half, still shows the original scene.

Later experiments looked more closely and found a wealth of specialization at all scales in the brain. The cortex (the most evolutionarily recent part of the brain) shows an especially fascinating organization at the smallest scales. This is most apparent in the visual cortex, where the neurons are often organized in two or higher dimensional maps. A famous example is the set of orientation maps in early visual cortex:

nrn3766-b1This figure is taken from Moser et al, Nature Reviews Neuroscience 15,466–481(2014) but you’ll run into such figures every where. It shows physical space along the cortex, color coded for the orientation of bar that the neurons at that location best respond to. One can see that the colors are not randomly assigned but form contiguous regions, indicating that neighboring neurons respond to similar orientations.

I could go on about cortical organization like this (and indeed I have elsewhere) but this is a good segway into self-organizing maps. Researchers studying machine learning got interested in this feature of the brain and wondered if it held some use for classification problems. Kohonen drew up an algorithm for what he called a self-organizing map that has been used on and off mainly for visualization via un-supervised learning.

The Kohonen map is constructed as follows.

  1. Construct a sheet of “neurons” with the dimensionality you want. Usually this is limited to one, two or three dimensions, because we want to visualize a complex higher dimensional dataset.
  2. Each neuron is a template – it has the same dimensions as the input data and can be overlaid on each instance of the input data.
  3. We initialize the templates randomly – so the templates look like white noise.
  4. We “present” each input example to the map. Then we find the template neuron that most closely matches this input. We then morph the template slightly so that it even more closely matches the input. We do this for all the neighbors of the neuron too.
  5. We keep repeating this for all the input we have.

What does this result in? I have two videos for you, you product of the MTV generation, that shows how this works.

I took the famous NIST handwritten digits data set and fed it to a two dimensional SOM.

The first video represents the self-organizing map as a set of templates that change as inputs are supplied to it.

This is easy to interpret as you can identify the templates as they evolve. Note how neurons form neighborhoods that naturally link together similar stimuli. It is easy to see with this data set because the data are two dimensional visual patterns and we can judge similarity intuitively.

The second video represents each neuron’s stimulus preferences, with it’s most preferred stimulus digit appearing as the largest.

You can see of the network evolves as examples are thrown at it and segregates out into zones that respond to or represent a particular form of the stimulus. You can see how zones travel and try to spread-out, ending up at the corners of the map. This is due to the competition created by the winner-take-all operation we use (closest matching template wins) combined with the effect of adjusting the winning neuron’s neighbors too.

SOMs are fun things to play with, but are a little fiddly. You may have picked up from the videos that we had two parameters called sigma and eta that changed as the learning went on. The numbers got smaller as the training went on. Sigma is the size of the neighborhood and eta is the learning rate – the smaller the value of eta the less the neurons change in response to inputs. These are two things you have to fiddle with to get decent results. Also as you can imagine, this whole process is sensitive to the random templates we start with.

Numerous attempts have been made to try and figure out objective ways to determine these numerous parameters for a SOM. It turns out they are pretty data dependent and most people just do multiple runs until they get results they are happy with.

As I was reading Bishop I ran into his version of the SOM which he calls Generative Topographic Mapping which Bishop claims is more principled and mathematically grounded than SOMs. The GTM, which I really haven’t heard about at all, seems like a fun thing to study, but as a ex-neurophysiologist SOMs are definitely more intuitive to understand.

Code (and link to data source) follows:

Bayesian networks and conditional independence

Bayesian networks are directed graphs that represent probabilistic relationships between variables. I’m using the super-excellent book “Pattern Recognition and Machine Learning” by C. Bishop as my reference to learn about bayesian networks and conditional independence, and I’d like to share some intuitions about these.

First, let’s motivate our study with a common home situation related to kids. We start with the question “Do we have to gas up the car?” (As you know, everything related to kids is binary. There is no room for gray areas). Well, this depends upon whether we need to do a grocery trip, or not. That, in turn depends upon whether the food in the fridge has gone bad and whether we need dino nuggets. Why would the food go bad? Well that depends on whether the electricity went out, and whether the darn kids broke the fridge again. The kids never broke your fridge? Lucky you. Whether we need dino nuggets depends on whether we’re having a party. The probability that the kids broke the fridge depends on whether there were a horde of them or not (i.e. a party).

Whoa, this paragraph looks complicated, like a GRE word problem. Let’s represent this as a directed graph:

Screen Shot 2017-04-13 at 7.57.02 AM

Much better!

All the variables here are binary and we can put actual numbers to this network as follows:

Screen Shot 2017-04-13 at 8.14.09 AM

Explaining some of the numbers, the entry for the (P)arty node indicates that there is a probability of 0.1 we’re going to have a party, and this is uninfluenced by anything else. Looking at the entry for the (G)as up node, we see that whether we gas up depends on the states of F and D. For example, if both F and D are false, there is a 0.1 probability of needing to gas up, and if F is true and D is false, there is a 0.6 probability of needing to gas up, and so on.

Intuitively, for a pair of nodes A and B, there is a linkage if the probability of the output depends on the state of the input(s). So, in the truth tables above I’ve made every row different. If, for example, the probability that B would take the state “T” is 0.2 regardless of whether A was “T” or “F” then we can intuit that A doesn’t influence B and there is no line from A to B. If, on the other hand, B takes the state “T” with p=0.2 if A is “T” and with p=0.8 if A is “F” then we can see that the probability B takes the state “T” depends on A.

Before we start to look at our kid’s party network, I’d like to examine three basic cases Bishop describes in his book, and try an develop intuitions for those, all involving just three nodes. For each of these cases I’m going to run some simulations and print results using the code given in this gist in case you want to redo the experiments for your satisfaction.

The only thing to know going forward is that I’ll be using the phi coefficient to quantify the degree of correlation between pairs of nodes. phi is a little more tricky to interpret than pearson’s R, but in general a phi of 0 indicates no correlation, and larger values indicate more correlation.

The first case is the tail-to-head configuration:Screen Shot 2017-04-13 at 4.56.51 PM.png

Say A=”T” with p=0.5 (i.e. a fair coin) and C=”T” p=0.7 if A=”T” and p=0.3 otherwise, and B=”T” p=0.7 if C=”T” and p=0.3 otherwise. Here “B” is related to “A” through an intermediary node “C”. Our intuition tells us that in general, A influences C, C influences B and so A has some influence on B. A and B are correlated (or dependent).

Now, here is the interesting thing: suppose we “observe C”, which effectively means, we do a bunch of experimental runs to get the states of A, C and B according to the probability tables and then we pick just those experiments where C = “T” or C = “F”. What happens to the correlation of A and B?

If we take just the experiments where C=”T”, while we do have a bias in A (because C comes up “T” a lot more times when A is “T”) and a bias in B (because B comes up “T” a lot more times when C is “T”) the two are actually not related anymore! This is because B is coming up “T” with a fixed 0.7 probability REGARDLESS OF WHAT A was. Yes, we have a lot more A=”T”, but that happens independently of what B does.

That’s kind of cool to think about. Bishop, of course, has algebra to prove this, but I find intuition to be more fun.

As a verification, running this configuration 100000 times I find the phi between A & B = 0.16379837272156458 while with C observed to be “T” it is -0.00037 and “F” it is 0.0024. Observing C decorrelates A and B. MATH WORKS!

As an aside, if I setup the network such that the bias of C does not change with the input from A, I get phi of A & B = -0.00138 and phi of A & B with C observed to be “T” = -0.00622 and “F” =  0.001665 confirming that there is never any correlation in the first place in such a “flat” network.

The second case bishop considers is the tail-to-tail configuration:

Screen Shot 2017-04-13 at 6.33.31 PM

Here we have a good intuition that C influences A and B, causing A and B to be correlated. If C is “T” it causes A to be “T” with pA(T)  and B to be “T” with pB(T) and when C is “F” A is “T” with pB(F) and B is “T” with pB(F). These probabilities switch to track C as it changes, resulting in the linkage.

What happens if we observe C? Say C is “T”. Now this fixes the probability of “T” for both A and B. They maybe different, but they are fixed, and the values A and B take are now independent!

The third case is very cool, the head-to-head configuration:

Screen Shot 2017-04-13 at 7.22.51 PM.png

We easily see that A and B are independent. What happens if we observe C? Let’s consider a concrete example with a probability table

A B  p(T) for C
F F  0.1
F T  0.7
T F  0.3
T T  0.9

Say we run 100 trials and A and B come up “T” or “F” equally likely (p=0.5). We expect each AB pattern will occur equally often (25 times) and then the expected number of C=T states is

A B  E(C=T)
F F  0.1 * 25 =  2.5
F T  0.7 * 25 = 17.5
T F  0.3 * 25 =  7.5
T T  0.9 * 25 = 22.5

So we have an expected 50 trials where C=T. Of that subset of trials, the fraction of trials that each pattern comes up is:

A B  p(AB|C=T)
F F   2.5 / 50 = 0.05
F T  17.5 / 50 = 0.35
T F   7.5 / 50 = 0.15
T T  22.5 / 50 = 0.45

Ah! You say. Look, when A and B are independent, each of those patterns come up equally often, but here, after we observe C, there is clearly an imbalance! How sneaky! Basically, because some patterns of AB are more likely to result in C=T this sneaks into the statistics once we pick a particular C (Those hoity, toity statisticians call this “conditioning on C”). A and B are now dependent!!

I have to say, I got pretty excited by this. Perhaps I’m odd.

But wait! There’s more! Say C has a descendent node. Now observing a descendent node actually “biases” the ancestor node – so in a way you are partially observing the ancestor node and this can also cause A and B to become dependent!

Now, I was going to then show you some experiments I did with a simulation of the Kid’s party network that I started with, and show you all these three conditions, and a funky one where there are two paths linking a pair of nodes (P and G) and how observing the middle node reduces their correlation, but not all the way, cause of the secondary path, but I think I’ll leave you with the simulation code so you can play with it yourself.

Code for this little experiment is available as a github gist.

CIGAR strings

When I first joined Seven Bridges my boss would be describing things and he would say “Do this with the CIGAR” or “compute that from the CIGAR”. Reluctant to be found out so early I nodded sagely and made a furtive note in my notebook to check out what the heck a CIGAR was.

You are most likely to run into this word – and it is in all caps, for reasons that may become clear later – if you operate closer to the nasty end of any work that requires genomic sequencing. One of the more compute intensive things we have to do to raw genomic data before we can get with the real sciencey stuff is to figure out how a sequence obtained in an experiment relates to some reference data.

A lot of life is tied together and it shows in the DNA. There are similarities not only between the DNA of humans, between the DNA of humans and chips, but also between mice and men, but the closest match is by far between human beings of all “races” and ethnicities, despite what some ignorant folk will try to tell you.

At the tiniest level, at very early stages of genomic sequence data analysis, we take a string of data, say

data = ACTGACTGACTGACTG

and compare it against some reference data, say

ref = ACTTACTAAGACTACTG

We notice that we can align the data against the reference by changing one letter (marked by *), assuming some letters have been deleted (– in the data) and some letters have been inserted (- in the ref)

data = ACTGACT--GACTGACTG
       |||*|||  |||| ||||
ref  = ACTTACTAAGACT-ACTG

 
In general the vast majority of these alignments of data against the reference can be described as a series of exact matches interspersed by single base mismatches, insertions and deletions.

Now, what does CIGAR stand for? It stands for Compact Idiosyncratic Gapped Alignment Record. Yes, it does sound like a backronym doesn’t it? But the “gapped alignment part” has probably given you a good clue. The major operation above was in inserting gaps, either in the data or the reference, in order to get the two strings to align against each other.

A CIGAR string is simply the set of operations we did encoded succinctly. So, in the case above, the CIGAR is

7M2D4M1I4M

 

In the CIGAR we usually ignore single base mismatches and focus on the insertions and deletions (though the “extended CIGAR” is much more detailed and includes information about mismatches) and so we convert the gapless bits to “M”, the number indicating how many letters (bases) form a stretch of gapless alignment. This is interspersed with “D”s and “I”s to indicate deleted subsequences and inserted subsequences.

And now you know.

Development environment

While it can be an avenue for procrastination, searching for a proper development environment often pays off with greatly improved productivity in the long term. I had some finicky requirements. I wanted to maintain my own CMakeLists file so that the project was IDE/editor/platform independent. I wanted to be able to just add files to the project from anywhere – from VI, by OS copy, by git pull – (I set wildcards in the CMakeLists.txt file) – and definitely not have to fiddle in the IDE to add files. I wanted to edit markdown files which I was using for documentation, and placing in a separate docs folder. I wanted to see a decent preview of the markdown occassionally. Ideally, my environment would simply track my directory organization, including non C++ files. However, I also wanted intelligent code completion, including for my own code.

At work, where a ton of my current stuff is in Python, I happily use Pycharm which has been a very worthwhile investment. So CLion was a natural candidate to start with. I didn’t get very far, because I balked at the license terms. It felt a little annoying to have to apply for an open source license and then renew every year and so on. When I want to play with red-tape I talk to the government.

I completely understand needing to charge people for your software. It’s a necessity for maintenance and continued development. But you can not get money from people who have none. To me the rational model is to charge professional entities for the software, to make it free to use for non-professional use and just not bother with the small fry. Basically, have a smallish legal department that goes after a big player if they violate. For companies and individual professionals a software license is the least of their costs (and probably the item with the highest ROI), and they will gladly pay to have some one to go to for support. Chasing down individual hobbyists just doesn’t make sense. I’m sorry, where did that soapbox come from? I must have stepped on it by accident. Moving on to the next editor/IDE …

I tried XCode next. It was a medium-ish install, and I think I needed an iTunes account. I was encouraged to use it because CMake can generate XCode projects and many coders working on Macs swore by it. I have used it infrequently – mostly when I was editing a largish Objective C project during my postdoc. I found it fine to use and debug with. It was fine, but I disliked how CMake organized the XCode project virtual directories, and I disliked how I had to manually add non-source and new files. And memories of the over complicated build system started to come back to me.

I then found something called CodeLite. It was confusing because I spent some time trying to figure out if it was related to Code::Blocks. CodeLite looked kind of promising. There was a nice tutorial on how to integrate with CMake based projects. But it looked cumbersome, and I had to add files to the project through the interface. By now, I was willing to endure this but I couldn’t get over the fact that there was no preview for Markdown. I liked that they limited the number of plugins, but I really wanted markdown preview. I told you I was finicky.

Then, quite by accident I started to poke around Visual Studio Code and downloaded it to try it out. In brief, it seems to satisfy all my finicky requirements. This choice rather surprised me, because it was just not on my radar. I also realize this does nothing to advance my attempt to get into the cool kids programming club where people use vim or emacs, but that was a lost cause anyway. (I now recall that I initially avoided VSC because a colleague spoke out very strongly against it a few months ago. Goes to show, you need to try things out for yourself)

CMake and XCode

Oddly enough, though I’ve been working on Macs for eight years now, I haven’t used Xcode a lot and in the past I used Bloodshed C++ and eclipse. I also have fond memories of using vi and make with my IDE being a cluster of four terminals (code, compile, debug and run output, I think they were. Or maybe one was pine). Also, the fondness is perhaps an effect of nostalgia.

Since I’m using CMake I looked around for ways to make XCode work with it. CMake will generate an xcode project with the command -G XCode but the structure of this project looked atrocious and I wondered what I was doing wrong. This post by John Lamp gives some nice details about how and why the generated XCode project is so structured.

One trick I learned (Lamp mentions it but I really paid attention when I watched this video) was to add the header files to the sources list as well, in the CMake project. After I did this the organization of the files in the XCode Project view became more sensiple – XCode knows how to split up header and source files.

Also, before you open up XCode, add the build directory to .gitignore. When XCode tries to manage your versioning, it seems to want to add just about everything to version control by default, and the build directory is an un-holy mess.

Cmake tutorial, slightly easier than the official one.

 

Starting a project

It’s that time of the decade again – to start a new hobby programming project that will probably span a few years, given that it’s a side project. So I went through the usual decision tree and this time I decided to document it for my future self.

Picking a language

I really surprised myself. I picked c++. A year or so ago I would have told you I was going to use Haskell or Common Lisp for new projects. I’m pretty proficient in Python and so I wouldn’t pick that. But C++??? Now, I’ll still be doing coding challenge type tasks in common lisp this year, and next year will be the year of Haskell, but every time I touch one of these I get fingers bitten off. There is always something easy I want to do that breaks things and the solution, often from practitioners in the field, seems to be pretty messy.

Python, funnily enough, is the closest I’ve come to an ideal language for practical use. I don’t use custom defined classes very much and it’s only fake functional (it comes out in the major – no tail recursion – and in the minor – try and use lambdas with the multiprocessing module) and it can be astoundingly slow for compute intensive tasks that can’t be cast as numpy operations. But you can’t beat it in time it takes to go from a blank file to a working, self-documenting, fairly bug free program where you can often write very complex things very concisely and elegantly.

So why C++?? C++14, to be precise, drew me back from these REPL oriented languages that make prototyping so fast and easy. In a nutshell I think the standard library has turned into quite a nice thing and things like CATCH and Cmake make medium sized projects easier to manage. I anticipate an intense amount of numerical computation, potentially large amounts of generated data and an attendant desire to control the type of the data and how it’s laid out. Common Lisp, Python and, I suspect, Haskell aren’t so bothered by such low level details. I suspect I’d be able to quickly prototype what I want to do in those languages, but then I’d get into a hellhole trying to remove the bottlenecks.

Build system: Cmake: I’ve used this before and found it manageable
Command line parser: cxxopts – one file (just header)
Test system: CATCH – I’ve used this before and I like how easy it is.

One thing I’m excited to experiment with is Cling – allegedly a REPL for C++. We’ll see …

Picking a version system

This was easy. git. I toyed briefly with using this as an excuse to learn mercurial, but I decided to reduce the number of moving parts for this particular project. Mercurial will have to wait.

Picking a license

I’ve always used the GPL. This time I picked the MIT license. There were a couple of funny thoughts that went through my head as I was picking the license. Funny because they reveal very deep (but perhaps not so serious) desires that I don’t often reflect on. One dominant thought was, what if this becomes widely popular (yeah, sure) and people start to capitalize on this but I don’t get any of the upside (fame/money). You’d probably want to use the GPL to block people like that.

And so I went with the MIT license, because almost everything I know, I know because random people on the internet unconditionally shared their knowledge with me, and almost every tool I use has been unconditionally granted to me by random people on the internet.  The GPL started to sound a little mean spirited. A little too over bearing. The MIT license, on the other hand, sounded more in the hacker spirit. Here is some code. I throw it to the wind. If you find it useful send, me a post card. Just don’t sue me. Leave the lawyers out of this.

Picking a host

github, because of the network effects, and I’m using git. And to be fair, github has made a lot of effort to make the site feature full. I do hope it does not end up like sourceforge. Remember when sourceforge was where we put all our projects?

Picking a documentation system

Text files in markdown. Algorithm design notes in the code files. Because everything else gets out of date and becomes a giant interdependent mess that you deal with only if this is the day job.

Work style

Ok, this one I’m gonna get right (yeah, sure). I’m starting with the specifications, then the user manual, then the tests, then the code. This is not a magic formula, and it doesn’t stop me from skipping around, but this set of practices has always seemed more logical and a lot less irritating to me. Code is living, you always go back and change things, but as long as you follow this order when you change things, sanity prevails.

Off we go …

A run-in with the garbage men

When you write and run a computer program you are processing data. You load data, create more data, process it and so on. All this data takes up memory on your computer. Without any other action the memory starts to look like a kid’s bedroom – all toys, and stale clothes, and empty cans and magazines and books everywhere. There are typically two ways programming languages deal with this clutter.

One is the “clean your room or I’ll blow up the house” method. I realize this is not what parents say, but this is what happens in languages like C++ where you – the programmer – are responsible for cleaning up after yourself. If you don’t your program crashes and burns and exits out with a nasty error message. If your operating system is not a grown up operating system, it often freezes up your whole computer.

The other is automatic memory management, which is a bit like having a housekeeping service. Common lisp is a garbage collected language. Which means that the runtime of the language takes care of the clutter for you. The service sends over a person who goes through your room and tidies up. As you can imagine, this is a lot easier for you, since now you don’t have to worry about all this low level housekeeping stuff and can concentrate on the fun things in life/computer science.

So imagine my surprise when I ran the following bit of code and my house blew up (program memory space, not my house literally, but you’ve caught on)

(defun random-string (length)
  (with-output-to-string (stream)
    (let ((*print-base* 36))
      (loop repeat length do (princ (+ (random 26) 10) stream)))))

(defun insert (count)
  (let ((ht (make-hash-table :test 'equal)))
    (loop
       repeat count
       do (setf (gethash (random-string 30) ht) 1))))

(defun gc-test (loop-count ins-count &optional (nudge-gc nil))
  (loop
     repeat loop-count
     do (insert ins-count)))

(gc-test 20 1000000 nil)

I wasn’t actually doing this exact thing, but the principle was the same: I was inserting keys into a hash table which got larger and larger, and then I discarded the hash table. After a few cycles of this BAM! my program crashed and SBCL printed out the delightfully campy:

Heap exhausted, game over.

Common Lisp’s handy (room) function prints out the memory usage of the program, and when I instrumented my program with it I saw this:

no-forcing

Now you can see that up to the 4th iteration housekeeping kind of keeps up, though there is this funny sawtooth pattern where housekeeping slacks off on the first pass, does a thorough job on the second, slacks off on the third, catches up on the fourth. But then, the union goes on strike! Housekeeping refuses to show up and the dirty pizza boxes start to pileup and then the floor collapses. Contrary to what your friends working in sales will tell you, not all graphs that rise exponentially on the right hand side are good.

SBCL has a neat function

(sb-ext:gc)

that allows you to manually call for housekeeping, and I modified my program to do this each time I discarded the hash-table, and when I reran the program I saw this:

with-forcing

So, manually forcing the garbage collector to run does do the trick but it made me a little disappointed with SBCL’s implementation. I suspect commercial Lisp compilers do a better job. If any one has a guess as to why the GC fails in this case, please drop a line.

In case you are interested, the function, instrumented with (room) and with the garbage collection forcing looks like this

(defun gc-test (loop-count ins-count &optional (nudge-gc nil))
  (loop
     repeat loop-count
     do (progn
	  (insert ins-count)
	  (when nudge-gc (sb-ext:gc :full t))
	  (room nil))))

(gc-test 20 1000000 nil)