I have just released a new version of the
libagf machine-learning library. This is probably the biggest revision since starting the software project some seven years ago. The main addition is a generalization (called "multi-borders") of the "AGF borders" algorithm from binary to multi-class classifications. Rather than pick a single method and implement that (such as one-against-one or one-against-the-rest), which would've been quite trivial, I rather chose to generalize the whole business. For more details, please refer to the
paper I submitted to Journal of Machine Learning Research.
I think this is one of the first times I've created software without having some specific application in mind. I wrote it mainly for the challenge and because the library doesn't feel quite complete without it. Not to say that it won't be useful for many prediction tasks.
Here are a few thoughts on the creation of the new software.
On preparation
In preparing the "multi-borders" software, one thing that struck me was just how quickly I got it up and running. Perhaps this is just experience, but I think planning and preparation has a lot to do with it. I had a very good idea what I wanted to achieve and while I don't usually write much down, I have everything planned out in my head.
There are two components to the software: one for training and one for classification. At the training stage, all the software does is read the control file and output commands for the already-existing binary classification software and then output another control file for the classification stage. I got the first stage working long before the second, largely because testing is so simple: most of the input can be made up on the spot (i.e. the arguments don't have to refer to real files) and you just check the output for correctness. In both cases I spent maybe two to three weeks coding and got things up and running after three or four tries.
I object
I'm still not sure how I feel about the object-oriented programming paradigm. Usually I think it's just over-rated and anything that can be done using classes and objects can be done at least as well and with more flexibility using all of the features that made C so distinct from other languages of the time: pointers (especially function pointers and void pointers), unions and structures. The problem is because I'm still stuck in a very rigid mindset that says there's some kind of right way and wrong way of doing things that I haven't learned yet how to properly program in straight C.
Take the dreaded "goto"--everyone says it's bad, but it's less in simply using the goto as using it badly. In the
Elements of Programming Style, Kernighan and Plauger spend many pages showing how to clean up code with gotos. Gotos definitely have their place, such as breaking out of a series of nested loops or cleaning up after different types errors without a lot of repeated code. In a couple of programs I've written for the new software release (and elsewhere), I've demonstrated to my satisfaction how to write code that's just as hard to follow, using only structured programming constructs. Void pointers provide an even better mechanism for shooting yourself in the foot, but also unprecedented power in an otherwise strongly typed, compiled language.
As I mentioned above, there were two components to the software. The first part was written "the old-fashioned way" using standard, procedural programming. The second part was written with objects and polymorphism. Of the two, the latter was definitely the easier to get up and running. Perhaps this is the main benefit of objects: it doesn't make the language any more powerful, just makes the programs easier to think about and debug.
It can also produce code that looks very elegant, at least when viewed from a certain angle. The multi-class classifications are based on a hierarchical model: you divide the classes in two, divide them again and so on until there is only one class left. The class structure of course follows this logic as well. The elegant part comes in the fact that the same method is called all the way up the line with no break in the symmetry. But in order to achieve this it includes a very brain-dead construct: a one-class classifier (that is, a "one-class-classifier-class"), a fairly large piece of code that doesn't do anything except return the class label of the top-level partition.
Finally, here's how we might be able to do better than the O-O framework using the features of C. Here is the prototype for the re-factored subroutine for sampling the class borders from a set of training data in a binary classification problem:
template
nel_ta sample_class_borders(
real (*rfunc) (real *, void *, real *),
//returns difference in conditional prob. plus derivs
int (*sample) (void *, real *, real *),
//returns a random sample from each class
void *param, //these are just along for the ride
nel_ta n, //number of times to sample
dim_ta D, //number of dimensions
real tol, //desired tolerance
iter_ta maxit, //maximum number of iterations
real **border, //returned border samples
real **gradient, //returned border gradients
real rthresh=0); //location of Bayesian border
Yes, I do like to freely combine different elements from different programming paradigms: it's what make programming in C++ so much fun! I could've passed an object with two methods to this function, but chose instead to pass two function pointers and a void pointer. The normal method of calculating conditional probabilities requires an array of vectors (coordinate data) plus an array of class labels. You could encapsulate both in an object, but this seems a bit limiting: there are many other things you could potentially do with them: for instance you can pair the coordinate data with a different set of class labels, generated for example from a set of continuous ordinates. By having them only temporarily assigned to the parameter list the program becomes more flexible.
This is one thing that always annoyed me about O-O programming: the idea that all the data has to be hidden and then you end up with a whole bunch of pointless methods that do nothing but assign to or return a field. Sure in some languages (such as C++) you can scrap the data hiding, but then is it really object-oriented? Instead you end up with the older C-style programming paradigm with structures, pointers and unions.
Finally, suggesting that the object being passed to this method is simply a binary classifier obscures the true generality of the method. It is, in fact, a multi-dimensional root-finder.
rfunc could be any continuous, differentiable function that evaluates to both positive and negative values, not just a difference in conditional probabilities (i.e., a binary classifier). It makes more sense (at least to me) to represent it as a function to which you pass a set of parameters, rather than as an object class. Again, this to me is the beauty of the C++ language: you can choose amongst a multitude of programming paradigms most appropriate to the problem at hand.
Meta-programming
Another major addition to the program was a hierarchical clustering algorithm. I wrote this to help me understand the travelling salesman problem, as I've mentioned elsewhere. The program builds a dendrogram and then, using single-letter commands, allows you to interactively browse through it, assigning classes along the way. It didn't take me long to figure out that you can output commands as they're being input and then feed them into the program again. And from there, it's not a very big leap to see how you could write a program, that, instead of using the C++ objects directly to manipulate the dendrogram, uses the one letter commands from the interactive utility. This, although the language for the dendrogram browser is very simple, is nonetheless a type of
meta-programming.
It's also a technique I've used very little in the past. Perhaps in part because it's very easy to abuse. I remember once being in charge of maintaining a large, motley and rather finicky piece of software. In one part, the operations to be performed were first written to a file in a crude sort of language. Then, the file was read in again and the operations were performed -- achieving absolutely nothing in the process. I removed both pieces of code and rewrote them so that the operations were simply performed--with nothing in between!
Since learning lex/yacc and starting work on some of my own domain specific languages (or DSL, which itself is a very powerful concept) I'm starting to use this technique a bit more often. The training part of the multi-borders program, for instance, does very little except parse a control file. In the process, it prints out commands for training the binary classifiers that comprise the multi-class model. I could've run these directly using a system call, however because parsing the control file takes so little time, I chose to compile them into a script instead.