Thursday, February 14, 2013

Additional thoughts on software design: keeping it simple

There are some aspects of software design that are quite difficult to teach and even more difficult to learn.  Like elegance in design.  I remember helping a colleague with a subroutine to randomly permute a list--just like shuffling a deck of cards.  He had this enormous subroutine that generated random numbers then checked them against other numbers.  I just shook my head.  It's counter-intuitive--at least until you figure it out--but the best way I've found is to generate a list of random numbers, sort them, and then use the shifts so generated to permute the list.  In the language we were working with (Interactive Data Language) it's a one-liner.

I don't remember having trouble coming up with this simply because I've worked with simulated random numbers for so long.  I've also used sorting for a unique application: calculation of equivalent latitude.  Check out the Wikipedia page (which I wrote).  I think the algorithm for calculating equivalent latitude is the epitome of elegance and I've discovered a number of very similar tricks for quite different applications.

Not to say that I haven't created my share of bloated code.  I remember once writing a piece of code to find local maxima in a gridded, two dimensional field.  What I came up with involved dividing the data values into several ranges and using a domain-filling algorithm (not of my own design).  Not very accurate and even less fast.  A simple, gradient ascent algorithm is faster by far (I know because I eventually rewrote it) or what about just checking for a local maximum in the eight nearest-neighbours??  I recall having tried both at the time, but couldn't get either working.  That's the other lesson: sometimes it takes patience to ferret out all the bugs in a piece of code.  It also takes confidence in believing that a particular approach will work.

Another particularly embarrassing piece of bloatware I've written involved binning measurement locations from satellite data in the process of interpolating them to a regular grid.  The program did work quite well, but I'm pretty sure the binning actually made the algorithm slower.  I'm not certain of that since I haven't rewritten it yet.  Meanwhile, the indices it generated were gigantic.  (I'm still convinced that some type of binning can produce significant speed gains in nearest-neighbour type problems, but I haven't yet been able to come up with the goods.)

Thursday, February 7, 2013

Stuff that I'm working on

I started this article some time ago, so some of the stuff is already out of date. It's quite long as well, so be warned...

New software release and planned improvements

The Peteysoft Foundation has released a new version of the libagf software and the libpetey scientific library required by it. Changes are fairly minor, but nonetheless important. The Runge Kutta integrator now has a thread-safe version. The function pointer passed to this routine is compatible with the GSL. New routines have been introduced including a symbol table class and a "smart" pointer class. More significant improvements are detailed in the next two sections, while many exciting planned new developments follow.

k-least algorithm

The kleast subroutine selects the k smallest elements in an array. Past versions of this algorithm were based on a binary tree and have a theoretical average time efficiency of O(n log k), although tests suggest that the actual running time is slightly better, closer to n+k log k [1]. This is the theoretical best efficiency for a k-least algorithm that sorts the resulting output, necessarily true for a binary tree. According to Knuth, if the output is not sorted, the efficiency can be as good as O(n) [2]. This can be acheived through a modified quicksort algorithm, but not one based on a heap which again, sorts the output. A new, faster k-least algorithm based on a quicksort has been implemented and applied to the libagf library to pre-select the data and to find the k-nearest- neighbours.

Solving for the filter width

The old algorithm for solving for the filter width while holding the total of the weights, W, was based on repeated squaring of the weights until the total passed the target value. The final result was found by exponential interpolation between the final value and the previous value. The purpose of this excessively clever method was to reduce the number of computed exponentials--the number is fixed at two per weight. Unfortunately, it was quite finicky and improper initialization or certain sample distributions (such as duplicate samples) would cause the method to hang. To address these issues, it was replaced with nothing more than a root-finding algorithm, although a fairly sophisticated one. This is the same as used for searching out the class borders: a third order interpolation method which brackets the roots. I call this root-finding method, "supernewton". [1]

While they should make the software faster and more robust, they are rather routine and somewhat boring improvements. What follows are some more interesting ideas, in order of interest and complexity. Of course, what makes an idea interesting, often also makes it complex and difficult to implement.

The inverse error function and the superNewton algorithm

The inverse error function is quite useful since it tells you for a given integrated probability and Guassian statistics, how far from the centre of the distribution you need to be. So for instance, if you have n normally distributed random samples, you can use the inverse error function to figure out approximately where the most distant sample should lie. Unfortunately, most math libraries don't include a numerical implementation of it. I figured, no problem, I'll use my powerful numerical inverse method "superNewton" to implement my own. Rather than giving me a simple, efficient math function in record time, it revealed to me some rather striking limitations in the algorithm as it is currently implemented. Most importantly, it tends to get stuck always re-bracketing the the root on the same side. This could probably be fixed quite easily, but at the cost of some extra book-keeping. Simply track how many times the estimated root has landed on the same side of the actual root as well as the relative distance between the previous left and right brackets. If these values are too high, the estimated root can be shifted so that it's now on the opposite side of the actual root. I would suggest doubling the difference between the estimated root and the closest bracket.

Another issue is tolerance of the root. This is exacerbated by the problem just discussed: very low x-tolerances can't be used. If we are using y- tolerances, then for the error function, clearly they must be varied depending on the value of x. For a root-finding algorithm, the idea of "relative tolerance" doesn't have much meaning [3].

Cross-validation of PDF estimates

Cross-validation or n-fold cross validation are simple and reliable ways of testing the accuracy of a supervised statistical classification algorithm. Something similar would be nice to have for probability density function (PDF) estimates as well. Unfortunately, unlike that for a classification problem, actual values of the PDF are not contained in training data for a probability estimation problem and the underlying PDF is frequently not known. This can be solved quite readily by adding an extra step. A new dataset with the equivalent distribution as the existing training data is first simulated using the same method as a metropolis Monte Carlo [3]. This dataset is not used for PDF estimation, but rather the original dataset is randomly split up, PDFs are calculated for each portion at the locations in the simulated dataset and then compared, e.g. through cross-correlation. Note that the parameters used (k for KNN, W for AGF) will need to scaled up in proportion (i.e. by n if this is considered a form of n-fold validation) when using the full dataset.

This has now been implemented in version 0.9.6.

Binning

There are so many applications for an efficient, multi-dimensional binning algorithm, I don't even know where to begin. For instance, in the surgery step of contour advection with surgery, a nearest neighbour search must be performed for every pair of points to check if they are too close or not. Kernel estimators are more efficient if the samples used are restricted to a subset near the centre of the kernel function. Binning can also be used for collocation routines used for the purpose, for instance, of creating satellite inverse methods.

In the absence of binning, the time efficiency of finding a set of k-nearest neighbours is on the order n, where n is the number of samples. With binning, we can do one of two things: we can either reduce the coefficient, or we can get it down to order klog n.

In the first case, the standard, rectangular binning method in which each dimension is evenly divided into several blocks, would apply. Ideally, the bins should be large enough that most of them contain several samples, but small enough that the number of samples in each bin is significantly smaller than the total. As the number of dimensions increase, the number of bins becomes unmanageably large, while the fraction occupied also gets smaller. I have designed binning routines of this type for application over the surface of the globe. Oddly, they don't provide any speed advantage over simply running through all the samples. Either there is too much book- keeping and overhead, or there are optimizations implemented in the processor, the compiler or both that already provide a similar advantage.

An interesting solution to the multiplication of bins at higher dimensions and the presence of empty bins is to discretize each of the coordinate samples based on the bin size. Each dimension is assigned a rank and the resulting list of integer numbers sorted as you would sort any other integer. The storage requirement is typically no more than double the original. If the bin size is set so that the average number of samples per bin is very low, on the order of one, then we can build up the number of bins to check in a concentrically expanding hypersphere until we have the required number of samples. While this idea looks good on paper, the required amount of book-keeping and overhead is far more than simply searching through each of the samples one-by-one, especially since finding a given bin is now a log n operation.

Both these examples are using a somewhat naive rectangular binning strategy. A more sophisticated technique would use an adaptive binning strategy that depends on the actual distribution of data. Something like a KD-tree. I still have yet to try these out--coding one up would be far from trivial, while I find I have a distinct aversion to black-boxes so I tend to use pre-compiled libraries rather sparingly. The KD-tree also suffers from an excess of book-keeping as even searching for a single nearest neighbour requires checking every adjacent bin. The number of adjacent bins growing, of course, with the number of dimensions and often lying on very different branches of the tree. Searching for k-nearest-neighbours or all neighbours within a radius would be even more problematic.

Optimal AGF

The current version of Adaptive Gaussian filtering is quite useful and has a reasonable justification. We can, however, write an expression for the error of kernel density estimator as a function of the bandwidth. There are two terms, a variance term and and a bias term:

e2=(P/n/hD)2+(h2/n grad2 P)2

calculated respectively, in the limit as the filter width becomes very small and in the limit as it becomes very large. Here, P is the actual density function, n is the number of samples, h is the bandwidth and D is the number of dimensions. The actual density function, P, is of course, not known. But there is no reason why we cannot fit the function through repeated estimates using different bandwidths. Since e=P-f where f is the estimated density we can rearrange the above equation as follows:

n2(f-P)=(1/hD-n2)P2/f + h4 (grad2 P)2/f

Since we have several different samples, we can construct a system of equations by substituting values for f and h. The constant term, P, on the left-hand-side can be eliminated by subtracting different equations in the system:

(1/hD-n2)(1/fi - 1/fj) P2 + (hi4/fi - hj4/fj) (grad2 P)2 = n2(fi - fj)

so in theory we need a minimum of three samples to solve for P and grad2P. In practice, the method should be more stable by taking more than the minimum number of samples and performing a least-squares fit.

I recently took a second look at this project--I've had the method implemented for some time but it didn't work. I've now got it returning reasonable results for the optimal filter width, at least when it works at all, though not if you take the P2 factor from the above equation. Part of the problem is that the matrix is very ill-conditioned--the two variables we are solving for are separated by many orders of magnitude likewise all the different elements of the matrix.

Sparse matrix calculator

I have recently put out a paper that describes a powerful dynamical tracer reconstruction method called principal component proxy analysis [4]. The method is based on decomposing a matrix representation of the tracer dynamics using principal component analysis (PCA). Each PC is then regressed with a series of (typically sparse) tracer measurements and the entire field thus reconstructed. The manuscript was sent to Atmospheric Environment but was unfortunately rejected.

The reviewers were interested first, that the manuscript be fleshed out somewhat, and second, how better to determine the parameters of the method: how many principal components to use and how long of a lead time to use. I am attempting something more general here: to build a simple sparse matrix calculator so that I can perform a series of numerical experiments with the matrices generated by the tracer simulation. I'm hoping that will help me to better understand just how the method works: for instance, how is it possible that you can correlate measurements from before the matrix was integrated and the method will still give good results?

The sparse matrix calculator has already been implemented and has even passed a few very basic tests. I still need to test it with actual, practical matrices like those from the tracer simulation, and of course it needs many refinements and features to make it complete, but the basic idea is there and it's running. At the moment I'm working on refactoring the entire trajectory suite and I'd like to finish that first before continuing with this.

Multi-class classification with AGF borders

If you have a discrimination border, this is only good for discriminating two classes. If you want to be more discriminating, you need more than one border. This is something I've been planning for a long time, but made very little progress on. The problem is: there is no one-size-fits-all solution. You can divide up the classes in many different ways and still get usable results and the best way to divide them may be highly problem-dependent. For instance: in the case of a discretized continuum variable, you'd probably want to make sure that all the borders only separate "adjacent" classes, that is, ranges that are next to each other in the discretization scheme. On the other hand, if there are four classes, but they can be reduced into two overlapping sets of two classes, then that right there tells you how to divide them up.

There are two major ways you can divide up the classes: in the first case so that all of the conditional probabilities can be solved for and in the second case so that only the conditional probability of the "winning" class is returned. In the first case, you need at least (n-1)/2 borders--that is, for each border you can write two equations relating the conditional probabilities plus one more for the entire system (they sum to unity.) In the second case, the method is hierarchical: divide the set of classes into two groups, divide each pair of groups into two more and so on. Consider the case of dividing up four classes:

x | o
-----
+ | *

The four classes are x, o, + and *. In the first case, we think of this diagram as representing two borders: one horizontal, and one vertical. In the second case, consider first a horizontal border, and then two vertical borders. It stands to reason that the two methods can also be combined.

References

[1] Peter Mills (2011) "Efficient statistical classification of satellite measurements." International Journal of Remote Sensing, 32 (21): 6109-6132.

[2] Donald Knuth (1998) The Art of Computer Programming, Volume 3: Sorting and Searching, 2nd edition Addison-Wesley.

[3] William Press, Saul Teukolsky, William Vetterling, Brian Flannery (1992). Numerical Recipes in C, 2nd edition Cambridge University Press, Cambridge.

[4] Peter Mills (2012). "Principal component proxy tracer analysis." arxiv:1202.2194

[5] Peter Mills (2006), "The influence of noise on a classical chaotic scatterer." Communications in Nonlinear Science and Numerical Simulation, 11 (8) 899-906.