Saturday, December 13, 2014

Accelerate your LIBSVM classification models

I guess I finally have to admit it.  I'm not as clever as I thought.  The statistical classification algorithms I created, more or less from scratch, about nine years ago, are nowhere near as clever as a support vector machine (SVM).  And they may not even be as accurate either!

Yes, sad to say, LIBSVM is more accurate than libAGF.  I believe this is because it is maximizing the accuracy with respect to actual data whereas "adaptive Gaussian filtering" (AGF) is not.

About half a year ago I made a major addition to libAGF.  This was to generalize the pre-trained binary models for multi-class classifications.  I wrote a paper on my approach to the problem and submitted it to the Journal of Machine Learning Research (JMLR) as a "software paper."  Sadly it was rejected.  One of the objections the reviewer had was that the method couldn't be applied to other binary classifiers than the AGF "borders" classifier.  Actually it could, but only at the software level, not the command line, interface level, and not easily.

So I set out to remedy this in libAGF version "0.9.8."  The software can now be combined with the two binaries from LIBSVM, svm-train and svm-predict, or anything that looks like them.  Whereas in LIBSVM you can choose from exactly one multi-class classification method, libAGF lets you choose between slightly more than nc! where nc is the number of classes.

So lets say you've built your hybrid libAGF/LIBSVM model that's tailored exactly to the classification problem at hand.  Since we're ultimately still calling svm-predict to perform the classifications, you might find it a bit slow.

Why exactly is SVM so slow?  After all, in theory the method works by finding a linear discrimination border in a transformed space, so classifications should be O(1) (constant time) efficient!  Unfortunately, it's the caveat, "transformed space" that destroys this efficiency.

The truly clever part about support vector machines is the so-called "kernel trick."  With the kernel trick, we implicitly generate extra, transformed variables by performing operations on the dot product.  Consider a squared dot product of two, two-dimensional variables:

[(x1, x2) • (y1, y2)]
= x12y12 + 2x1y1x2y2 + x22y22
= (x12, √2x1x2, x22) • (y12, √2y1y2, y22)

Clever huh?  The trouble is, unlike in this example, you usually can't calculate these extra variables explicitly.  By the same token, the discrimination border, which is linear only in the transformed space, is also neither calculated nor stored explicitly.  Rather, a series of "support vectors", along with matching coefficients, are stored and used to generated each classification estimate.  These support vectors are simply training vectors, usually close to the border, that affect the solution.  The more training data, the more support vectors.

The data I'm using to test my "multi-borders" routines contain over 80 000 training samples.  Using LIBSVM with these data is very slow because the number of support vectors is some non-negligible fraction of this eighty thousand.  If I'm testing LIBSVM algorithms I rarely use the full compliment, rather some sub-set of it.

LibAGF multi-borders to the rescue!

The libAGF binary classifiers work by first generating a pair of samples on either side of the discrimination border, then zeroing the difference in conditional probabilities along the line between these two points.  In this way it is possible generate a group of points lying directly on the discrimination border.  You can collect as many or as few of these points as you need.  In practice,this is rarely more than 100.  In combination with the gradient vectors, it is straightforward, and very fast, to use these points for classification.

The crux of this algorithm is a function which returns estimates of the conditional probabilities. You can plug in any function you want.  Since the idea here is to accelerate the LIBSVM classification stage, we plug in the LIBSVM estimates as generated by the svm-predict command along with the pre-trained, SVM model.  Training this model, incidentally, is still just as slow, so you'll just have to live with that, although training the new model is not.

The gradient vectors are generated numerically.  Unfortunately, this produces a slight degradation in accuracy.  In principle, it should be possible, easy even, to produce these gradients analytically.  Later versions may allow this but it will require modifying the LIBSVM executable.

But speed increases are monumental.  We have swapped an implicit discrimination border, specified through the support vectors, with an explicit one, directly represented by a small set of samples.

Check out the latest version of libAGF for all the juicy details.

Friday, December 12, 2014

Cloud cover index


The other day I was out running after dark.  It was -8 degrees C and I was expecting it to be bitterly cold.  Instead, it felt quite mild, pleasant even.  It was overcast.

One of the things I learned while working on sea ice retrieval, is that air temperature is just one of many factors that determine how an object exchanges heat with its environment, and a relatively minor one at that.  We've all heard of the wind chill factor.  We feel of course, much colder during a windy day than a calm one, even if the air temperature is the same.  Another factor is humidity, hence the humidex. The one I'm going to discuss here is cloud cover.  I'm trying to design a temperature correction for cloud cover, much like that for wind chill and humidity.

Every warm object emits radiation which is why stove elements glow red and incandescent light bulb filaments glow white.  At room temperatures this radiation is too low in frequency to see with the naked eye.  When an object emits radiation, an equivalent amount of heat energy is lost.  When it absorbs radiation, the energy from the radiation goes towards warming it. On cold, calm days, radiative cooling is actually a much more significant source of heat loss than direct conduction into the air because air is a poor conductor.  This is why down is so warm: it's not the down itself that's a good insulator, it's actually the air spaces between.

Clouds reflect a lot of the long-wave radiation we emit, making overcast days feel warmer than clear days, especially during the evenings when the sun's rays are not there to compensate.  This of course is the well-worn explanation for climate change and the "greenhouse effect."

OK, so maybe you've heard all that.  Here is the more technical explanation.  But first, for the less technically inclined, lets go directly to the table:

Cloud cover
Air temp. 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
-20.0 -20.0 -19.8 -19.6 -19.2 -18.8 -18.2 -17.6 -16.8 -16.0 -15.0
-19.0 -19.0 -18.8 -18.6 -18.2 -17.8 -17.2 -16.6 -15.8 -14.9 -14.0
-18.0 -17.9 -17.8 -17.6 -17.2 -16.8 -16.2 -15.5 -14.8 -13.9 -12.9
-17.0 -17.0 -16.8 -16.5 -16.2 -15.7 -15.2 -14.5 -13.8 -12.9 -11.9
-16.0 -15.9 -15.8 -15.5 -15.2 -14.7 -14.2 -13.5 -12.7 -11.9 -10.9
-15.0 -14.9 -14.8 -14.5 -14.2 -13.7 -13.2 -12.5 -11.7 -10.8 -9.8
-14.0 -13.9 -13.8 -13.5 -13.2 -12.7 -12.1 -11.5 -10.7 -9.8 -8.8
-13.0 -12.9 -12.8 -12.5 -12.2 -11.7 -11.1 -10.5 -9.7 -8.8 -7.7
-12.0 -11.9 -11.8 -11.5 -11.2 -10.7 -10.1 -9.4 -8.6 -7.7 -6.7
-11.0 -10.9 -10.8 -10.5 -10.2 -9.7 -9.1 -8.4 -7.6 -6.7 -5.7
-10.0 -9.9 -9.8 -9.5 -9.2 -8.7 -8.1 -7.4 -6.6 -5.7 -4.6
-9.0 -8.9 -8.8 -8.5 -8.1 -7.7 -7.1 -6.4 -5.6 -4.6 -3.6
-8.0 -7.9 -7.8 -7.5 -7.1 -6.7 -6.1 -5.4 -4.5 -3.6 -2.6
-7.0 -6.9 -6.8 -6.5 -6.1 -5.6 -5.1 -4.3 -3.5 -2.6 -1.5
-6.0 -5.9 -5.8 -5.5 -5.1 -4.6 -4.0 -3.3 -2.5 -1.5 -0.5
-5.0 -4.9 -4.8 -4.5 -4.1 -3.6 -3.0 -2.3 -1.5 -0.5 0.6
-4.0 -3.9 -3.8 -3.5 -3.1 -2.6 -2.0 -1.3 -0.4 0.5 1.6
-3.0 -2.9 -2.8 -2.5 -2.1 -1.6 -1.0 -0.3 0.6 1.5 2.6
-2.0 -1.9 -1.8 -1.5 -1.1 -0.6 0.0 0.8 1.6 2.6 3.7
-1.0 -0.9 -0.8 -0.5 -0.1 0.4 1.0 1.8 2.6 3.6 4.7
0.0 0.1 0.2 0.5 0.9 1.4 2.0 2.8 3.7 4.6 5.7

In other words, the presence of cloud cover can make it seem several degrees warmer.  How does that work exactly?

The basic idea is to calculate the rate of heat loss for a cloudy atmosphere and then ask the question, "what would the clear sky air temperature need to be in order to produce the equivalent level of heat loss?"

Since the human body exchanges heat with the air using approximately the same mechnanisms, let me describe the thermodynamic models I used for sea ice.  These were applied for various purposes: the first one I wrote was actually for an ocean model.  Surface heat exchange is the number one mechanism forcing ocean circulation.  I took the same model and used it to predict the temperature profile of Antarctic icepack and also to model the growth rate and salinity profile of sea ice.

There are four  mechanism by which ice (and other things, including us) exchange heat with the environment: latent heat, sensible heat, shortwave radiation and longwave radiation.  Latent heat, also called evaporative cooling, is the heat lost through evaporation.  Since water has a very high heat of vapourization, when water evaporates, for instance from the surface of pack ice, from the surface of the ocean or from our bodies, it takes a lot of heat with it.  The main thing determining evaporative cooling is the difference in vapour pressure between the air and the thing being cooled.  High humidity will slow evaporation.

Sensible heat is direct heat transfer through conduction.  In still air, sensible heat is basicly nil.  As the wind picks up, it can get quite high as the wind chill index will tell us.

Shortwave radiative flux is just the sun shining on us.  This is determined partly by geometry: objects pointing directly at the sun will be heated faster than those at an angle.  Since the sun is lower in the sky earlier and later in the day and at higher latitude, ground heating is lower.  It is also determined by cloud cover.  In this model, we are deliberately ignoring solar heating: lets just assume it's night time.

Finally, the one that interests us the most is longwave flux.  This is heat that is radiated directly off us in the form of electromagnetic radiation.  The rate at which this occurs is proportional to the 4th power of temperature times the Stefan-Boltzmann constant: this is the Stefan-Boltzmann law.  This is the equation I used which also accounts for both clouds and air re-radiating the heat back:

 Qlw=-ε σ [0.39*(1-ccccccc)Ts4+Ts3(Ts-Ta)]

where ε is the emissivity coefficient, σ is the Stefan-Boltzmann constant, ccc is the "cloud-cover-coefficient," cc is cloud cover as a fraction between 0 and 1, Tis skin temperature and Ta is air temperature.  I just got this from a paper on surface heat flux, but you can clearly see the Stefan-Boltzmann law embedded in there.

In the sea ice growth simulation I used net heat flux to determine the surface temperature of the ice.
The difference in temperature between the ice and the water (which is at a constant freezing temperature) is given by the heat flux times the conductivity and divided by ice thickness. Since all of the flux calculations also involve a temperature difference, this time between skin (surface) temperature and air temperature in one form or another, we need to solve for the skin temperature.  In other words, we need to invert the equation for flux in order to answer the question, "what value of skin temperature will match heat conduction through the ice (between the ice-water interface and the ice surface) to heat flux between the ice and air?"

The cloud cover index was generated in a similar way, except in this case the skin temperature was held constant at 37 deg. C while it was the air temperature that was being solved for.  It was also much simpler in that I didn't assume any conductive interfaces: heat was lost directly from the skin to the outside air.

While it might be possible to generate an analytic solution, I just fed the whole thing to a numerical root-finding algorithm, in this case bisection.  I didn't account for latent heat since it's hard to say how high it would be: it depends very much on whether the person is sweating or not.  Leaving out sensible heat so that only longwave is present, however, gave very high values, so I set the wind speed at a moderate 4 m/s (14.4 km/h) to produce some sensible heat loss.  You could also assume that the person is wearing a coat and try to match air temperature to conductive heat loss, much as I did with the sea ice growth model.  This would achieve a similar result.

Here is the Python program if you want to mess around with it.

Saturday, November 22, 2014

The mathematics of tracer transport

The reviewers of this paper, about a new method of dynamical tracer reconstruction ("PC proxy"), were interested in knowing more about the algorithm: how to determine the parameters and what makes it better than previous methods.  In order to better understand it (only three years after the fact) I'm in the process of compiling a bunch of mathematics related to tracer transport, particularly in relation to how it's used in the paper.  A lot of it I derived a long time ago, but whenever I want to work on this stuff the notebook where I did the last derivation usually isn't available so I end up re-deriving it.  Now all I have to do is not lose the file--I've already misplaced an older collection of mathematical derivations.  If anybody's seen it, maybe send me an e-mail.

Anyways, I've uploaded it to my homepage.  It's still a work in progress so I will continue to add to and correct it.  I should probably upload it to Github or some other code repository for better version control.  I'm still waiting for a good science-based website for this type of thing.  Maybe if I had the resources I might try starting something like this myself.

Another thing that strikes me is how time-consuming doing up these types of derivation in Latex is.  Some software specifically for writing up and performing derivations could be really useful.  And by this I don't mean symbolic computation packages like Maple or Maxima. I've always found these a whole lot less powerful and useful than one first imagines.  Partly it's just the sheer volume of methods (the Maxima documentation is over 1000 pages long!) but also: all they seem to consist of is a whole bunch of built-in functions for manipulating equations one way or another.  Not unlike a library for any standard procedural programming language, except instead of operating on variables, they're operating on what appears to the user to be mathematical expressions (but are actually structured variables in a programming language!)

What I'm talking about is something much simpler whose only purpose would be to write up equations in a pretty format as well as insert short sections of text and cross reference previous equations.  A handful of shortcuts for Latex might fit the bill. One idea would be to just take simple ASCII math and in real time turn this into pretty equations, with macros for symbols that have no ASCII representation.  The ultimate of course would be full WYSIWYG, maybe a bit like MathCAD, but without any actual math.  I might try coding up something in lexx/yacc someday.

I'm also working on mathematical computations relating to multi-class classification and optimal adaptive Gaussian filtering.

Update: I've now posted this project on github.  There is also an updated version of the compiled pdf on my homepage.

Friday, August 8, 2014

Worshipping complexity

Frankenstein tells the story of a monster created by technology.  The story is popular because it is prophetic.  There are no literal Frankenstein monsters but I don't think many people recognize the real, runaway Frankenstein monster we've created.  A few weeks ago, in a fit of temper, I smashed my netbook.  My shoulder is still sore from it.  But in retrospect, I'm glad I destroyed that machine because I'm certain it was spying on me.  Why else would I be offered such a convenient device with such impressive specifications--dual, 1.66 GHz processors, 1 gigabyte of RAM, 160 gigabytes of hard-disk space--for so little money?  Twenty years ago, that was super-computer territory.  That also happens to be equipped with a microphone and a low-light video-camera.  In the wake of Edward Snowden, I don't think we need to ask the question.  It also explains why, despite it's impressive technical specifications, the machine was so damn slow.

There's no question in my mind that in twenty years or so, there won't even be any human beings behind the spying, that is, assuming the computers even let us live.  Most if not all of the elements are in place for a complete take-over of our world by the machines.  Assuming it hasn't happened already.  Yes, this our Frankenstein monster: runaway technology.  It no longer serves us, we serve it.  And it has caused considerable damage to the natural world like the tar sands in Northern Alberta.

But these elements of complexity are beyond my means to tackle, so I'd like to focus on those that are within reach: inside my own computer code. Recently I wrote an answer on Quora about why so many people seem to hate C++.

The purpose of using a more expressive language is to simplify things: to accomplish more with less, but the more I consider it, the more I'm beginning to suspect that C++ actually does the opposite. It actually encourages you to make your code more complex.  Take this class definition, presented in it's entirety, for a sparse matrix:

namespace libpetey {
  namespace libsparse {

    //forward declaration of sparse_el:
    template class sparse_el;

    typedef int32_t ind_t;

    #define EPS 5.96e-8         //this is actually pretty big... (?)

    template
    class sparse:public matrix_base {
      //friend void conj_grad_norm(sparse *, data_t *, data_t *, data_t, long);
      //friend double * irreg2reg(double **xmat, long m, double *y, long n, double **grids, long *ngrid);
      private:
        //since all the "constructors" do the same thing, we have one private
        //routine to do them all:
        void copy(const sparse &other);
        void convert(sparse_array &other, data_t neps=EPS);

      protected:
        sparse_el *matrix;

        //dimensions:
        index_t m;
        index_t n;

        //number of nonzero elements:           (these types should depend on index_t
        long nel;                               //i.e. index_t = int32_t, then use int64_t)

        //index of last element searched:
        long last_search;

        //size of array
        long array_size;

        //update flag:
        //0=matrix needs updating
        //1=matrix is fully updated
        //2=matrix is sorted but has zero elements
        char update_flag;

        //"zero" tolerance:
        float eps;

        FILE *sparse_log;


      public:
        //initializers:
        sparse();
        sparse(data_t neps);
        sparse(index_t min, index_t nin, data_t neps=EPS);
        sparse(data_t **non, index_t min, index_t nin, data_t neps=EPS);
        virtual ~sparse();

        //create the identity matrix with existing dimensions:
        void identity();
        //create the identity matrix with specified dimensions:
        void identity(index_t mn, index_t nn);

        sparse(matrix_base *other);
        sparse(const sparse &other);
        sparse(sparse_array &other);
        sparse(full_matrix &other, data_t neps=EPS);

        void from_full(data_t **non, index_t min, index_t nin, data_t neps=EPS);

        //storage manipulation:
        void update();          //update matrix
        void remove_zeros();    //remove insignificant elements
        //remove elemnts of less than specified magnitude:
        void remove_zeros(data_t minmag);
        //clear matrix, but do not change storage:
        void reset(index_t mnew=0, index_t nnew=0);
        //clear matrix, deallocate storage:
        void clear(index_t mnew=0, index_t nnew=0);
        //extend storage by specified amount:
        void extend(long size);

        //add and access matrix elements:
        //add or change an element:
        long add_el(data_t val, index_t i, index_t j);
        //change existing element or insert new:
        virtual long cel(data_t val, index_t i, index_t j);

        //return value of element:
        virtual data_t operator ( ) (index_t i, index_t j);

        //access rows:
        virtual data_t *operator ( ) (index_t i);
        virtual void get_row(index_t i, data_t *row);
        //ratio of amount of storage to equivalent full:
        float storage_ratio();
        //approx. ratio of performance (for matrix multiply) to equivalent full:
        float performance_ratio();

        //I/O:
        virtual size_t read(FILE *fptr);                //binary read
        virtual size_t write(FILE *ptr);                //binary write
        virtual void print(FILE *ptr);          //print (ascii) to file
        virtual int scan(FILE *ptr);            //read from (ascii) file
        void print_raw(FILE *ptr);

        //the "canonical" versions:
        virtual matrix_base * mat_mult(
                    matrix_base *cand);
        virtual matrix_base * add(matrix_base *b);

        virtual data_t * vect_mult(data_t *cand);
        virtual void scal_mult(data_t m);
        virtual data_t * left_mult(data_t *cor);

        virtual matrix_base & operator = (full_matrix &other);
        virtual matrix_base & operator = (sparse &other);
        virtual matrix_base & operator = (sparse_array &other);

        virtual data_t norm();

        virtual matrix_base * clone();

       //sparse & operator = (matrix_base &other);

/*
        virtual operator sparse& ();
        virtual operator sparse_array& ();
        virtual operator full_matrix& ();
*/

    };

    template
    inline long sparse::size() {
      return nel;
    }

    template
    inline void sparse::remove_zeros() {
      remove_zeros(eps);
    }

    typedef sparse sparse_matrix;

  }  //end namespace libsparse
}    //end namespace libpetey

Hmmmm....  it doesn't look that simple.  (Hopefully you didn't read through all of it.)  Notice how I've thrown in everything but the kitchen sink: well, templates allow us to use the class with different data types, so why don't we throw them in?  Namespaces prevent collision between symbols in different libraries and they define modules, so why don't we throw them in as well?  And so on.

Why don't we start over, and this time work in C?  

typedef float scalar;
typedef integer int32_t;

struct sparse_element {
  integer i;
  integer j;
  real value;
}

This is how we could define the data structure. To create an instantiation of it, just make an array:

sparse_element sparse_matrix[nel+1];

The first element could hold the dimensions of the matrix:

sparse_matrix[0].i=m;
sparse_matrix[0].j=n;

This type of doubling up on variables is often discouraged, but does have the nice side effect of making code a little simpler and less cluttered.  The number of data elements could be either stored in a separate variable (common with arrays in C), stored in the value field or an extra element that serves as an end marker tacked on to the end of the array, much like C strings.  This is not as mad as it sounds: most functions will operate on the whole list, iterating through each element one-by-one.  One thing we lose is the ability to tack on new elements easily, but on the other hand it is more efficient to have some idea how big your data structure will be ahead of time and to allocate space accordingly.

This doesn't include all the operators which of course will all need be set down.  But we've also left out the templates and the namespaces.  Sure, it's nice to be able to use the matrix class with different scalar data types and with different integer types for index locations, but are we really ever going to use, at the same time, multiple sparse matrices comprised of different elementary types?  Probably not.  A typedef works just as well and is far simpler.

But even if we need generic programming, this can be done quite easily in C.  Take this prototype for a sort function, for instance:

void heapsort(void **data, int n, int (* comp)(void *, void *));

Where data are the data elements (any type) and comp is the comparison function.  Want polymorphism as well?  This isn't too hard either, and it occurs where it should: as far outside of the basic definitions as possible.  It's not built in from the very beginning.  Lets say we have two data types:

struct child1 {
  ...
};

struct child2 {
...
}

And they both have an operator called do_something:

int do_something1(child1 *);
int do_something2(child2 *);

But we need a more general operator that works on either type:

struct parent {
  int type;
  union data {
    child1 *a;
    child2 *b;
  }
};

int do_something(parent *c) {
  switch (c->type) {
    case(1):
      do_something1(c->data.a);
      break;
    case(2):
      do_something2(c->data.b);
      break;
    default:
      fprintf(stderr, "do_something: type %d undefined\n", c->type);
      exit(-1);
}

In this case, yes, the alternative in C is more complicated and "less elegant" whatever that might mean, but it has two features that I like.  First, the branch logic is explicitly written out, not hidden or implicit in the definitions.  Second, the more basic operations are defined completely independently.  There is no need to rough out the larger structure first.

Another thing I like about C, or any pure procedural language, is that processes and data are always strictly separate.  There is no need to mash them together, whether such an alliance is holy or not.  At least C++ gives you the choice; languages like Smalltalk do not.  Another thing I can't stand is forced data-hiding.

Unfortunately, I still haven't mastered these techniques myself.  I got into C++ programming early, just as it was first coming out and haven't switched since.  I had to write a simulation of a chaotic scattering system: essentially a simple ray-tracing progra and started in Pascal because that's what I had learned in highschool.  I had a Turbo-Pascal compiler which had all the bells and whistles, including O-O.  Since I was curious about object-oriented programming and it was very new at the time, I coded the problem in this style and it was a very good "fit".  Later on I wanted to learn C and C++ because, like objects, they were also very popular and "cool" at the time.  Translating this program into C++ was almost trivial because the object-oriented aspects of each language were so similar.

Minimalist Coding

I wanted to call it, "extreme programming" but that's taken already and besides, the term is a bit cliche by now.  But I am going to borrow some ideas from extreme sports, in particular alpinism.  Don't ask me how I got interested in mountaineering.  Heights scare the shit out of me.  I figured if I could overcome this one fear, it would help me with more mundane fears like job interviews or asking a girl out.  I'm not convinced the idea was all that sound.  I never got much further than conquering the easiest routes of the bouldering wall at the local climbing gym.

I also bought and read Extreme Alpinism by Mark Twight.  There are two broad styles of climbing: expedition style and "alpine" style.  In expedition climbing, you take a big company with lots of gear, lay out camps and hang lots of fixed rope.  By being well equipped you are more protected.  On the other hand, you move a lot slower.  In alpine climbing, you take as little gear as possible and try to be self-sufficient.  You can move a lot faster and you're more flexible.

I was most interested in the most extreme style: climbing without rope or protection.  Consider, every-time you insert a chock or other protection, you are distracting yourself from the main task at hand: getting up the mountain.  Also, if your protection is sloppily placed it may be little better than no protection at all.  Why not focus all your energy on the most important thing: making sure you have the best holds and that your hand and foot placement is secure?  Because you aren't carrying all this hardware, you're a lot lighter.  I think you can see where I'm going with this.

So we leave out range checking and don't make any attempt to catch or correct errors.  It's considered normal for a subroutine to return an error code but if every time after calling a subroutine we check for an error condition and the stack is four levels deep, that's an awful lot of extra code and worse, clutter.  When I first started learning computer science, the term GIGO was thrown around a lot: garbage in, garbage out.  Instead of spending so much effort testing for errors, why not spend that effort making sure your inputs are correct?  If your code is also correct, that will guarantee that your output is ultimately error free.

We apply the same minimalism to documentation: keep documentation to a minimum and keep variable names short.  Ideally, the code should be self-explanatory: because we have now reduced the clutter by a large percent, the program is easier to interpret and it's easier to find things.  Naturally, this style works a lot better for scientific programming than more user-oriented problems.  One place where your code almost always has to be fault-tolerant is when importing large amounts of data.  Other people's data is notoriously unreliable. Often, however, large chunks of it can be ignored or thrown away without any ill effects, but we need to catch the bad apples first so as not to spoil the rest.

I confess I am a bit miffed at just how un-portable C and C++ are.  (Though it took me a while to realize that integer types may be different sizes on different machines and compilers, about the time I started working on 64 bit computers.)  But not miffed enough to use a configure script.  I recently inspected the configure script for an early C++ version of NetCDF: it's over 17000 lines long!  Meanwhile, the wrapper classes that comprise the body of the library are just over 4000 lines.  This is not a recipe for portability.  It's easier to debug a 200 line make file than a 10000 line configure script.  Some time ago I had the privilege of doing so, and the code looked pretty hacked together.

Rising from the ashes

This is me and my foundation rising from the ashes...

Sunday, July 13, 2014

Multi-borders software: reflections

It's been several weeks since I released the "multi-borders" multi-class classification software.  Along with the software, I also submitted this paper to the Journal of Machine Learning Research: Machine Learning Open-Source Software (JMLR-MLOSS) stream but sadly it was rejected.

Actually I'm quite glad that the paper was rejected as there are issues with the software, not least of which is the fact that it's not as accurate as it could be.  This is actually discussed in the paper: the eight-class test case returns an accuracy of about 56% for AGF without borders training, 54% using the "hierarchical" method (essentially a decision tree) and only 52% using the "non-hierarchical" method.  It's the non-hierarchical method that's the most interesting and that I'll be discussing here.  This uses a matrix inversion (least squares, actually) to solve for the conditional probabilities based on the "raw" probabilities from each of the binary classifiers.  That is, if P are the conditional probabilities for each of the classes and R are the differences between the conditional probabilities from the binary classifiers, then there will be a mapping, A, from one to the other and we solve the following equation in a least squares sense:

  A*P=R                    (1)

At first I thought that errors are accumulating from the estimates of the conditional probabilities, which themselves are not all that accurate.  This may not be the case.

On the other hand, you can also solve for the class through "voting" which is how most other multi-class methods work.  Arithmetically, this can be expressed by multiplying the transpose of the mapping with the "raw" classes of each of the binary classifiers, where the class is expressed as one of {-1, 1}.  This method is more accurate even when we use the difference in conditional probabilities in place of the class.  That is, we have:

  c=arg max A^T*R    (2)

where c is the class of the test point.  Unfortunately, this provides poor knowledge of the final conditional probabilities, which in the first method are very accurate. An obvious solution is to constrain (1) so that it always agrees with (2).  The least squares should probably be constrained anyway to keep each element of P between 0 and 1 and to ensure that the sum is always 1.

In principle this doesn't look that hard: if different sets of constraints (from 1 to N, where N is the number of classes) represent sub-spaces, then it would be a matter of solving the minimization problem within progressively smaller sub-spaces until we find one that satisfies the original constraints, all the while judiciously pruning eligible sub-spaces as we go along.  Adding the new constraint brought on by the "voting" solution would complicate this procedure somewhat.  In practice, it is easy to re-normalize the results from solving (1), so they have so far been left untouched and I doubt the constrained solution would be significantly more accurate than the unconstrained.

I find it really interesting that by basicly flipping the whole equation around, we end up with something that works out about the same:

arg max A^T*R ~= arg max A^(-1)*P

No question this is related to the properties of the matrix itself and should suggest better methods of solving it.

Of perhaps even more concern is that for all the multi-class problems I've tried so far, LIBSVM has handily outperformed libAGF.  At first I thought this makes sense since AGF borders approximates a curved class border from a piece-wise collection of linear borders, while a support vector machine actually analytically warps the space so that a curved boundary is not even necessary.  But this difference is persisting even using the "raw" AGF with no borders training.  I may need to go back to the beginning to optimize the estimation of conditional probabilities.

I confess, I have been somewhat lax in testing this library.  Initially, I was content to try it on the synthetic dataset I had composed and on the particular problems I was trying to solve (discrete retrieval of humidity from satellite data and surface classification of Landsat images) because the speed advantage was just so astronomical--it seemed to make up for any other potential shortcomings such as less than ideal accuracy.  In the problems I was working on often either LIBSVM wouldn't converge or the time it took to classify the test data was so enormous as to just not be worth it.  Of course for small training datasets the speed advantage of AGF evaporates.

The LIBSVM page itself has a very nice collection of datasets. I've messed around with a few of the binary datasets in the past and in retrospect I should've documented the results more thoroughly.  That should be the rule for just about every scientific project, even casual testing.  The ones I'm currently working on are "segment" and "satimage."

Another thing to note about SVM is that it optimizes the classification error.  As you know, AGF generates estimates of the conditional probabilities.  Interestingly, the conditional probability represents the maximum possible accuracy for estimates at that particular location in the feature space.  I used this property very early on to validate the conditional probability estimates by first binning them and then comparing the actual accuracies with the averages in each bin.  Thus, in theory, classifications based on the conditional probabilities are optimally accurate, but that assumes that the estimates themselves are accurate.  Unfortunately, AGF probability estimates are are not optimized, at least not in any way that takes into account error rates.  This is something else I've been working on, mainly for probability density estimation.

Tuesday, June 24, 2014

How to work towards your own enslavement

By now I hope everyone's familiar with the privacy issues related to Google, Facebook and other internet companies.  Here is an interesting talk about it:


The most important point: what are they doing with your data?  They are building electronic models of you so that they can predict your behaviour.

If it is not yet a theorem that prediction implies control, then it should be.  This has been shown for simple systems: assuming we have good knowledge of the system dynamics, small perturbations of a chaotic system can be used to control it .  In one position I was working with the "singular vectors" of weather systems.  These are based on the integrated tangent vector and tell you in which direction, approximately, you have to perturb the system to get the biggest result.  The intended use was for prediction, but I have a bigger idea: use them to control the weather.  If I don't get to this, someone else will eventually.  We set up heat pumps and fans/turbines in strategic locations and use those to drive the system to its desired state.  Hell, they probably use all the wind farms in Europe for this purpose already.

If prediction implies control in physical systems, it is even more so in human affairs.  When I was a kid I used to play a game with my brother called, "Stratego."  It is a bit like chess, but with the conceit that the rank of all the pieces are hidden from the opposing player.  Often I feel like this is my position in life: my hand is open--I am playing chess, while the elites are playing poker.

The models they make for humans are actually a lot less sophisticated.  Mainly these are statistical models.  Statistical models that could be built, for instance, from my libAGF software.  I have primarily applied the software to the observation side of things, namely retrieving data from satellites.  Recently a fellow e-mailed me asking for help with the software.  It turns out he was rewriting the whole thing in Matlab so his company could use it without any copyright issues.  Since requesting a donation from either him or his company, I haven't heard back.

There are still so many improvements I can make on this software and at times I work daily on it.  In fact the work never ends.  At the moment I am not getting a penny for it and I'm not sure I ever did so directly.  I suspect this is part of the elite's plan as well: encourage the development of "open systems" so that they can have as much free software (and plenty of other stuff) to steal as they need.  Software that will ultimately be used for extortion, repression and control.  Sure, if I'm not willing to do it, there are two dozen others ready to step in to my shoes.  Sure there are already dozens, probably hundreds of other pieces of software that will do the same thing just as well.  LIBSVM and LVQ/SOMPAK are just two examples.  But I don't think I'm willing anymore to personally help brick in the walls of my own penitentiary.

Sunday, June 8, 2014

A rational ethical theory: first take

I've already mentioned my interest in ethical philosophy.  This is my first attempt to come up with a rational ethical theory:

1. An ethical statement is a statement that something is either right or wrong, good or bad.
2. An ethical system is a series of such statements.

By this definition the Ten Commandments comprise an ethical system.  The question becomes how we validate such a system, that is how do we determine its truth or falsity?  I've always been partial to the utilitarian philosophy: the greatest good for the greatest number.  In Ethics: Twelve Lectures on the Philosophy of Morality, David Wiggins makes some very pointed criticisms of utilitarianism.  For instance, imagine a situation where everyone can live well and find happiness, but at a cost: one person is excluded and lives in an extended hell.  (hmmmm... this sounds strangely familiar.)  Despite the lone sucker's nasty predicament, this would seem to fit the definition.

Apparently in U.S. states with the least difference between rich and poor, people are also better off in general and this points us towards a refinement to the utilitarian principle.  Imagine happiness lying along a distribution with the x-axis being happiness and the y-axis being number of people.  The narrower and further right this distribution lies, the better.  But most important, the further right the left-most non-zero point, the better.  In other words, we need to look at the happiness of the most unhappy person.

At the time I came up with the first part, my solution was much simpler.  I tried to create an ethical system comprised of only one statement that was its own validation system.  Namely: that which increases joy is good.  Joy I decided, was synonymous with good.

The problem here is ambivalence.  We are often drawn in two directions at once.  My first, concrete example, cigarette smoking, works but it's a bit weak.  So at the risk of revealing a bit too much about myself, I was "fortunate" enough to have had a girlfriend who gave me a much stronger example.  A large part of human laws and taboos revolve around sex and perhaps this example will illustrate why.  (Or as Foucault puts it in The History of Sexuality, human sexuality is a particularly dense set of power relations.)  It is interesting, and perhaps a bit ironic, that I first codified the above, rather simple idea of morality while staying at a hostel in Edinburgh after having first gotten together with this woman.  Fast forward to our break-up after about a year-and-a-half of on-again off-again bickering, I found myself staying at the apartment of and even sleeping in the same bed with this woman even as she was dating a new man.  At one point, as she was about to go on a date with this fellow she started getting frisky.  On the one hand, it's hard for me to imagine anything more arousing.  On the other, I really wanted to kill this guy with my bare hands.  I cannot imagine a deeper level of ambivalence.

So this, to me, points to the central aspect of ethics and morality: ambivalence.  Without ambivalence, if we are always certain of the rightness of our actions and their ultimate outcome, then there is no need for ethics.

On a more general level: I would very much like to do good in the world and to leave behind something better for future generations.  I am also ambitious: I would like to be successful and to achieve many great things.  Such achievement is often quite resource-intensive and can damage the environment.  This is the crisis facing our world today.  And it should be obvious to those of us with clearer heads that the elites have mortgaged the future of our grandchildren for wealth and success today.

As the above statement suggests, there is a tendency to preference desire and pleasures.  The concept of ambivalence suggests that all desires are equal: none is higher or lower.  I know I tend to rank them into more base, "animal" desires: food, sex, personal power and prestige, revenge; and higher, more "spiritual desires": searching for truth and knowledge, fighting for justice, helping others, enjoying nature.  Ranking our desires of course is one way of resolving ambivalence.

The cancer of the Earth

Two Fridays ago I was chatting with an older fellow--a hard luck case--who had worked in the oil fields as a youth. He said some things to me that I found upsetting, not because I disagreed with them, but because I've been thinking along the same lines for some time. He said that the planet is dying and the only way out is to find another planet to kill. Oil is the blood of the planet and we a cancer upon it. The only way that we can create is by first destroying. Well, I'm not sure about oil being the blood of the Earth. Certainly it goes against much of the current scientific paradigm. One could see how it might be: the centre of the Earth is very hot and it has an iron core. Is the Earth a living thing that gains power from burning the iron in its core and oil serves as a primer? Keep in mind that Mars has a heavily oxidized surface. Is it a now "dead" planet? Shortly after he said this it occurred to me that scientists still don't have a very good understanding of fire and it is actually a very distinct phase of matter. As living things, we actually have more in common with flame than we do with more inert forms of matter. And a large part of what separates us from other animals is our ability to harness fire: particularly in the internal combustion engine. This, of course, is the legend of Prometheus. This fellow worked on the oil fields: perhaps he saw things that others are not privy to. I know very little about the chemistry and geology of the Earth's surface. I suspect am certain that the scientific community knows a lot less than they let on. There was a Doctor Who episode in which a group of humans on a space ship discovered that their ship was actually a giant "space whale" and to keep it moving, it was being constantly tortured. Is this the state of our modern industrial society? In the Doctor Who episode the elites were simply ignorant of the facts. Our elite, of course, would have no issue with simply lying through their teeth to cover up the facts. But I digress. Perhaps he is right about finding other planets. This might be the only way out of our current predicament. But not for the reasons you might imagine. Assuming that we don't simply send people in suspended animation or send out a spore, building a space ship to travel to another planet (and by implication another star system) would require creating, in miniature, a self-sustaining ecosystem. If we could do this inside a space ship, surely we could apply the lessons learned to make society down here on Earth more self-sustaining? I am reminded of the movie W.A.L.L.Y. The movie doesn't really work except as reductio ad absurdum. In the movie, humans have polluted the Earth to such an extent that they move out into space. The problem is, they continue in their wasteful ways while on their ship. Of course, had they continued to generate garbage at the same rate (as it shows in the movie) there would soon be nothing left of their ship. If our wasteful ways are destructive down here on Earth, the same goes out in space, only far more so.

Tuesday, April 15, 2014

How to build a super-efficient motor-assisted bicycle

The moral philosophy is a weighty topic.  Sexual politics even more so.  So here is some lighter fair about how to build a better motor-assisted bicycle.  I have already written an article about how current products are rubbish and how they are hobbled by brain-dead regulations.

An under-powered vehicle needs a lot of gears to help it adapt to different terrain and driving conditions.  Modern bicycles with derailleur gears typically have a minimum of 16 gears while hub-gear bicycles will have seven (7) or more.  Transport trucks have 12 or more gears and vehicles designed for off-road use will usually have a special gear reducer that effectively doubles the number of gears.  Contrast this with personal automobiles, which in the past, could make do with as few as three and even now rarely have more than six (6).

A lot of motor conversion kits for bicycles have a very crude transmission set-up.  One older system involved simply dropping a rubber wheel onto the front tire, but newer chain drives often aren't much better: there is only one gear and the pedals are only useful to kick-start the engine.  On the one hand, transmission requirements for human versus gasoline power are very different because the former is high-torque, low RPM (revolutions per minute) and can produce generous torque at rest, while the latter is low-torque, high RPM and cannot produce any torque at rest.  On the other hand, to save weight, the transmission ought to be shared between the two.

This is the solution I'm currently working on.  First the engine.  I found this 2 hp gasoline engine, complete with magneto ignition:
The engine can drive the rear wheel through a standard bicycle derailleur gear by connecting to this Shimano front-freewheeling chainwheel from the 1980s:

Through a ratchet mechanism, the chainwheel is allowed to spin in a clockwise direction even when your feet aren't moving.  What's missing is how to connect the engine to the chainwheel but seems obvious from the photos: just chain it directly from the small sprocket mounted on the end of the engine crankshaft to the larger of the two chainrings.

There are at least three problems with this solution.  The first is that the engine achieves its horsepower peak at around 10000 rpm.  In top gear a human will have to pedal at roughly 100 rpm to travel at a speed of 50 km/h.  This implies a gear reduction of roughly 100:1!  It's not quite that bad since you don't have to be in top gear when the engine is pushing the bike at top speed.  There are multiple chainrings: potentially you could use just one very small one when the engine is engaged as long as there is still a reasonable spread.

With this in mind, notice that the chainwheels has been drilled out twice: first with a large spacing to accept this monster, 60 tooth chainring shown left and again with a smaller spacing to accept a standard 24-30 tooth "granny" ring shown at right.
The cog on the engine is 14 teeth so even using the largest rear sprocket (roughly 1:1 ratio), you get a final drive ratio of 60:14=4.29.  At maximum rpm the rear wheel will still be turning at over 2300 rpm which translates to a maximum speed of 297 km/h!

The second problem is that the engine actually spins counter-clockwise, which hints at a solution to the first problem: just have the speed reducer in one with the reverse gear.  Two gears with a tooth ratio of about 20:1 (a compound gear might be better for such a large ratio), 17:1 in combination with the giant 60 tooth ring or 14:1 with the little granny should do the trick.  If both are employed, the ratio could be as low as 10:1.

Such large ratios suggest the use of worm gears but this makes it impossible to kick or jump-start the engine with foot power or momentum and a starter motor adds unnecessary weight.  And third, we still haven't addressed the issue of engaging and disengaging the engine.  In the setup described above, the engine is always engaged making it impossible to use the pedals except for starting the engine or possibly assisting it.  This would only be acceptable for the earliest prototypes.

As a more mature, practical solution, I envision the motor being mounted beside the seat-tube with the crankshaft running perpendicular to the bottom-bracket (pedal-crank) axle rather than parallel.  Two bevel gears connect the two: a small one at the end of the crankshaft and a very large one on the outside of the chainwheel.  The engine is engaged and disengaged with a lever either by making the engine crankshaft extensible or through a pair of movable idler gears.

Another obvious essential feature of the final version will be a hand-operated clutch (foolishly illegal in the Ontario Traffic Act's definition of a moped).  Note that this will likely need to be separate from the engagement/disengagement mechanism as we don't want any extra gears or hardware connecting the pedals and creating drag while using only leg power.

If this project interests you, you can make a donation to the Peteysoft Foundation here.

New software release

I have just released a new version of the libagf machine-learning library.  This is probably the biggest revision since starting the software project some seven years ago.  The main addition is a generalization (called "multi-borders") of the "AGF borders" algorithm from binary to multi-class classifications.  Rather than pick a single method and implement that (such as one-against-one or one-against-the-rest), which would've been quite trivial, I rather chose to generalize the whole business.  For more details, please refer to the paper I submitted to Journal of Machine Learning Research.

I think this is one of the first times I've created software without having some specific application in mind.  I wrote it mainly for the challenge and because the library doesn't feel quite complete without it.  Not to say that it won't be useful for many prediction tasks.

Here are a few thoughts on the creation of the new software.

On preparation

In preparing the "multi-borders" software, one thing that struck me was just how quickly I got it up and running.  Perhaps this is just experience, but I think planning and preparation has a lot to do with it.  I had a very good idea what I wanted to achieve and while I don't usually write much down, I have everything planned out in my head.

There are two components to the software: one for training and one for classification.  At the training stage, all the software does is read the control file and output commands for the already-existing binary classification software and then output another control file for the classification stage.  I got the first stage working long before the second, largely because testing is so simple: most of the input can be made up on the spot (i.e. the arguments don't have to refer to real files) and you just check the output for correctness.  In both cases I spent maybe two to three weeks coding and got things up and running after three or four tries.

I object


I'm still not sure how I feel about the object-oriented programming paradigm.  Usually I think it's just over-rated and anything that can be done using classes and objects can be done at least as well and with more flexibility using all of the features that made C so distinct from other languages of the time: pointers (especially function pointers and void pointers), unions and structures.  The problem is because I'm still stuck in a very rigid mindset that says there's some kind of right way and wrong way of doing things that I haven't learned yet how to properly program in straight C.

Take the dreaded "goto"--everyone says it's bad, but it's less in simply using the goto as using it badly.  In the Elements of Programming Style, Kernighan and Plauger spend many pages showing how to clean up code with gotos.  Gotos definitely have their place, such as breaking out of a series of nested loops or cleaning up after different types errors without a lot of repeated code.  In a couple of programs I've written for the new software release (and elsewhere), I've demonstrated to my satisfaction how to write code that's just as hard to follow, using only structured programming constructs.  Void pointers provide an even better mechanism for shooting yourself in the foot, but also unprecedented power in an otherwise strongly typed, compiled language.

As I mentioned above, there were two components to the software.  The first part was written "the old-fashioned way" using standard, procedural programming.  The second part was written with objects and polymorphism.  Of the two, the latter was definitely the easier to get up and running.  Perhaps this is the main benefit of objects: it doesn't make the language any more powerful, just makes the programs easier to think about and debug.

It can also produce code that looks very elegant, at least when viewed from a certain angle.  The multi-class classifications are based on a hierarchical model: you divide the classes in two, divide them again and so on until there is only one class left.  The class structure of course follows this logic as well.  The elegant part comes in the fact that the same method is called all the way up the line with no break in the symmetry.  But in order to achieve this it includes a very brain-dead construct: a one-class classifier (that is, a "one-class-classifier-class"), a fairly large piece of code that doesn't do anything except return the class label of the top-level partition.

Finally, here's how we might be able to do better than the O-O framework using the features of C.  Here is the prototype for the re-factored subroutine for sampling the class borders from a set of training data in a binary classification problem:

template
nel_ta sample_class_borders(

            real (*rfunc) (real *, void *, real *),
                      //returns difference in conditional prob. plus derivs
            int (*sample) (void *, real *, real *), 

                      //returns a random sample from each class
            void *param,        //these are just along for the ride
            nel_ta n,           //number of times to sample
            dim_ta D,           //number of dimensions
            real tol,           //desired tolerance
            iter_ta maxit,      //maximum number of iterations
            real **border,      //returned border samples
            real **gradient,    //returned border gradients
            real rthresh=0);    //location of Bayesian border


Yes, I do like to freely combine different elements from different programming paradigms: it's what make programming in C++ so much fun!  I could've passed an object with two methods to this function, but chose instead to pass two function pointers and a void pointer.  The normal method of calculating conditional probabilities requires an array of vectors (coordinate data) plus an array of class labels.  You could encapsulate both in an object, but this seems a bit limiting: there are many other things you could potentially do with them: for instance you can pair the coordinate data with a different set of class labels, generated for example from a set of continuous ordinates.  By having them only temporarily assigned to the parameter list the program becomes more flexible.

This is one thing that always annoyed me about O-O programming: the idea that all the data has to be hidden and then you end up with a whole bunch of pointless methods that do nothing but assign to or return a field.  Sure in some languages (such as C++) you can scrap the data hiding, but then is it really object-oriented?  Instead you end up with the older C-style programming paradigm with structures, pointers and unions.

Finally, suggesting that the object being passed to this method is simply a binary classifier obscures the true generality of the method.  It is, in fact, a multi-dimensional root-finder.  rfunc could be any continuous, differentiable function that evaluates to both positive and negative values, not just a difference in conditional probabilities (i.e., a binary classifier).  It makes more sense (at least to me) to represent it as a function to which you pass a set of parameters, rather than as an object class.  Again, this to me is the beauty of the C++ language:  you can choose amongst a multitude of programming paradigms most appropriate to the problem at hand.

Meta-programming

Another major addition to the program was a hierarchical clustering algorithm.  I wrote this to help me understand the travelling salesman problem, as I've mentioned elsewhere.  The program builds a dendrogram and then, using single-letter commands, allows you to interactively browse through it, assigning classes along the way.  It didn't take me long to figure out that you can output commands as they're being input and then feed them into the program again.  And from there, it's not a very big leap to see how you could write a program, that, instead of using the C++ objects directly to manipulate the dendrogram, uses the one letter commands from the interactive utility.  This, although the language for the dendrogram browser is very simple, is nonetheless a type of meta-programming.

It's also a technique I've used very little in the past.  Perhaps in part because it's very easy to abuse.  I remember once being in charge of maintaining a large, motley and rather finicky piece of software.  In one part, the operations to be performed were first written to a file in a crude sort of language.  Then, the file was read in again and the operations were performed -- achieving absolutely nothing in the process.  I removed both pieces of code and rewrote them so that the operations were simply performed--with nothing in between!

Since learning lex/yacc and starting work on some of my own domain specific languages (or DSL, which itself is a very powerful concept) I'm starting to use this technique a bit more often.  The training part of the multi-borders program, for instance, does very little except parse a control file.  In the process, it prints out commands for training the binary classifiers that comprise the multi-class model.  I could've run these directly using a system call, however because parsing the control file takes so little time, I chose to compile them into a script instead.

Tuesday, April 1, 2014

The other day I came across a chest

in the woods, buried in snow.  I opened it up and sure enough there was some treasure inside: some coins from the Dominican Republic.  But I didn't take any.  I know, we've all read Treasure Island, we all dream of finding that massive windfall that will set us up for life.  But what if, when we stumble across a hidden chest, instead of taking what's inside, we add a little bit, because, after all, what's a chest without treasure?

Friday, March 21, 2014

Re-post

I notice two posts on this blog, "Women in science," and "All men are rapists," have been getting a lot of traffic.  Certainly they make nice "click-bait," but this blog was not meant to be political.  Of course it is the arrogance of scientists and philosophers to claim that they deal in "objective" or "eternal" or "universal" (or whichever other superlative you care to use) truths, rather than pieces of the moment.

As other posts make clear, I am interested in moral philosophy and the morality of equality is one of the most important developments in Western thought whose ramifications have yet to be fully worked out.  The idea, however, is taking considerable abuse from modern feminists.

In any case, here is an older post, that I think much more clearly reflects the "spirit" of this blog.

On the negation of modal verbs


I recently learned (it goes to show how diligently I've been practising my German) that Germans use the phrase "must not" in the opposite sense of English speakers to mean "need not." Thinking about this, I realized that there is an implicit "or" in any statement involving modal verbs and the sense of the negative depends upon which part of the logical proposition is negated.

For example, I would translate the phrase:

"You must do A"

into a logical proposition as follows:

^A -> P

where P is some form of punishment. Or:

A or P

We could negate the phrase either by negating the whole thing:

^A and ^P

Or we could negate only one part:

A or ^P = P -> A
(e.g., "You (must not) play in the street.")

^A or P = A -> P
(e.g. "You must (not play in the street.)")

The first example would seem to be how the Germans use the phrase since whether we do A or not, we will not get punished for it, while the third form is more in line with how we use the phrase, that is, A implies punishment. The second example seems rather more ambiguous and in fact inverts the construct: now, getting punished implies that we have done A.

It has rather deep implications, since all of ethics, law and morality is related to the use of modal verbs. Can we use this idea to justify breaking the Ten Commandments?

I translate,

"I shall go to the store,"

as:

F (uture) -> S (going to the store) = ^F or S

The sixth commandment becomes:

You shall (not kill.) = F -> ^ K = ^F or ^K = ^(F and K)

You (shall not) kill. = ^F -> K = F or K

or perhaps,

^(F -> K) = F and ^K

The second says that if there's a future, there may or may not be killing while the first and third say what we want them to say: if F is true, K must be false.