Sunday, July 13, 2014

Multi-borders software: reflections

It's been several weeks since I released the "multi-borders" multi-class classification software.  Along with the software, I also submitted this paper to the Journal of Machine Learning Research: Machine Learning Open-Source Software (JMLR-MLOSS) stream but sadly it was rejected.

Actually I'm quite glad that the paper was rejected as there are issues with the software, not least of which is the fact that it's not as accurate as it could be.  This is actually discussed in the paper: the eight-class test case returns an accuracy of about 56% for AGF without borders training, 54% using the "hierarchical" method (essentially a decision tree) and only 52% using the "non-hierarchical" method.  It's the non-hierarchical method that's the most interesting and that I'll be discussing here.  This uses a matrix inversion (least squares, actually) to solve for the conditional probabilities based on the "raw" probabilities from each of the binary classifiers.  That is, if P are the conditional probabilities for each of the classes and R are the differences between the conditional probabilities from the binary classifiers, then there will be a mapping, A, from one to the other and we solve the following equation in a least squares sense:

  A*P=R                    (1)

At first I thought that errors are accumulating from the estimates of the conditional probabilities, which themselves are not all that accurate.  This may not be the case.

On the other hand, you can also solve for the class through "voting" which is how most other multi-class methods work.  Arithmetically, this can be expressed by multiplying the transpose of the mapping with the "raw" classes of each of the binary classifiers, where the class is expressed as one of {-1, 1}.  This method is more accurate even when we use the difference in conditional probabilities in place of the class.  That is, we have:

  c=arg max A^T*R    (2)

where c is the class of the test point.  Unfortunately, this provides poor knowledge of the final conditional probabilities, which in the first method are very accurate. An obvious solution is to constrain (1) so that it always agrees with (2).  The least squares should probably be constrained anyway to keep each element of P between 0 and 1 and to ensure that the sum is always 1.

In principle this doesn't look that hard: if different sets of constraints (from 1 to N, where N is the number of classes) represent sub-spaces, then it would be a matter of solving the minimization problem within progressively smaller sub-spaces until we find one that satisfies the original constraints, all the while judiciously pruning eligible sub-spaces as we go along.  Adding the new constraint brought on by the "voting" solution would complicate this procedure somewhat.  In practice, it is easy to re-normalize the results from solving (1), so they have so far been left untouched and I doubt the constrained solution would be significantly more accurate than the unconstrained.

I find it really interesting that by basicly flipping the whole equation around, we end up with something that works out about the same:

arg max A^T*R ~= arg max A^(-1)*P

No question this is related to the properties of the matrix itself and should suggest better methods of solving it.

Of perhaps even more concern is that for all the multi-class problems I've tried so far, LIBSVM has handily outperformed libAGF.  At first I thought this makes sense since AGF borders approximates a curved class border from a piece-wise collection of linear borders, while a support vector machine actually analytically warps the space so that a curved boundary is not even necessary.  But this difference is persisting even using the "raw" AGF with no borders training.  I may need to go back to the beginning to optimize the estimation of conditional probabilities.

I confess, I have been somewhat lax in testing this library.  Initially, I was content to try it on the synthetic dataset I had composed and on the particular problems I was trying to solve (discrete retrieval of humidity from satellite data and surface classification of Landsat images) because the speed advantage was just so astronomical--it seemed to make up for any other potential shortcomings such as less than ideal accuracy.  In the problems I was working on often either LIBSVM wouldn't converge or the time it took to classify the test data was so enormous as to just not be worth it.  Of course for small training datasets the speed advantage of AGF evaporates.

The LIBSVM page itself has a very nice collection of datasets. I've messed around with a few of the binary datasets in the past and in retrospect I should've documented the results more thoroughly.  That should be the rule for just about every scientific project, even casual testing.  The ones I'm currently working on are "segment" and "satimage."

Another thing to note about SVM is that it optimizes the classification error.  As you know, AGF generates estimates of the conditional probabilities.  Interestingly, the conditional probability represents the maximum possible accuracy for estimates at that particular location in the feature space.  I used this property very early on to validate the conditional probability estimates by first binning them and then comparing the actual accuracies with the averages in each bin.  Thus, in theory, classifications based on the conditional probabilities are optimally accurate, but that assumes that the estimates themselves are accurate.  Unfortunately, AGF probability estimates are are not optimized, at least not in any way that takes into account error rates.  This is something else I've been working on, mainly for probability density estimation.