Sunday, January 18, 2015

The Overseer



One of the more interesting things I did this summer was visit an old-growth forest.  While the scale of these trees was truly astounding, what struck me most was just how little of the tree was canopy and how much was trunk.  Most of the mass of these trees is there to hold up other parts of the tree!  Only a tiny fraction is actually devoted to gathering sunlight.

Moreover, there will be an equilibrium density for the trees.  If they are clustered together too tightly, then one or more of their number will fall.  If they are too sparse, new trees will grow up in the gaps.  In other words, in their natural state, they live a marginal existence: clinging to the edge of life.

That is, unless there is someone tending the forest who culls the trees as necessary.  Then those remaining can grow up healthier and stronger, with thicker trunks relative to their height and a larger, broader canopy.

Monday, January 12, 2015

What is spirit?

In a two or more posts, I've talked a little bit about "spirit".  If you're a skeptic and you've been reading this, you probably think, that's nonsense, there's no such thing as spirit, it doesn't exist.  Probably not, but then again, there isn't any such thing as energy either, at least not in the sense that you can measure it directly.  It is a derived quantity: it is always has to be calculated from two or more other quantities.  Nonetheless it's still an extremely useful quantity in physics.

In particular, I wanted to clarify some statements I made in the post called, My Dream. I thought about simply writing an addendum, but why not create a brand new post?  In that post I talk about negative spirit flowing out of Washington D.C. through a network of trails leading, ultimately, to long-distance scenic trails like the Appalachian Trail.  How does that work?  What is the mechanism behind this?  Simple.  The people walking (or cycling) towards the city are smiling.  Those moving away are not.

I experienced this directly.  I picked up two hikers from the A.T. and let them sleep over in my apartment.  The next day, we made arrangements to meet after I had finished work and they had finished sight-seeing.  Later, I began to worry that I would having trouble finding them since we had agreed to meet at one of the busiest subway stations during rush hour.  It turns out they were easy to spot: they were the only ones smiling.

Sunday, January 11, 2015

Are miracles possible?

What about the double-slit experiment?  That's kind of a miracle isn't it?  Point an electron gun at a pair of slits and then look at where they land on a screen behind.  It doesn't matter how fast you fire the electrons, they will, over time, form an interference pattern.  In other words, they are behaving like waves.  OK, so now you're wondering, if they are behaving like waves, yet they're being fired one at a time, what is interfering with what?  Does each electron interfere with itself, or do they interfere with each other, somehow "seeing" the other electrons both forward and backward in time?

Lets design an experiment to test these ideas.  Now you flash a strobe light every time you fire an electron so you can figure out through which slit it passes.  The interference pattern disappears.

I've heard any number of interpretations of quantum mechanics: An object isn't really there when you aren't observing it.  There is a mystical connection between observer and observed.  The distinction between subject and object is an illusion. And so on.

But at least in this case, I think the interpretation is much, much simpler.  When you observe something, it changes.  Suppose instead of using the term observe, we substitute the word probe.  To probe is almost by definition to interfere with since a probe (as a noun) is almost always something material and solid.  Granted, if you drop a sounder into the ocean to measure depth, for instance, the ocean floor won't change much in response, certainly less than measurement error or the local variance.  But if you fire a photon at a sub-atomic particle, it will change the velocity of that particle, sometimes by a large amount.

The act of observation is the act of touching.  When something is touched, it will always move or change in response.

The analogy to investigations of so-called "psi" phenomenon is (or should be) obvious.  Lets use prayer as an example.  I have at various times in my life, prayed with considerable regularity.  While I no longer identify as Christian, I still pray, though less often than I used to.  It was a great sense of relief when I stopped attending church because I could finally admit to myself that I am completely ignorant about spiritual matters.  But then again, the whole point of a Supreme Being is that He or She or It is completely beyond us.

I don't know how many times I've heard of studies purporting to demonstrate the effectiveness of prayer.  I don't know how many times I've heard skeptics try to debunk such claims and reference their own studies.  But after all, if spiritual configurations (for lack of a better word) are as delicate as sub-atomic particles, should we be surprised if close observation destroys them?  Should we be surprised if they behave in the same way as the double slit experiment and lose their magic as soon as they are seriously scrutinized?

I don't know the answer to these questions, because there is no answer.  I guess it just comes down to faith.

Saturday, January 10, 2015

Fun with matrix operators

Now that I have more time, I can turn over ideas more carefully.  I'm not so focused on results. Take matrix multiplication.  It is an associative operator:

A(BC)=(AB)C

It's a property I use all the time, but typically never consciously think about. For instance, in my sparse matrix array classes, the sparse matrices are stored in reverse order because they are multiplied in reverse order.  This is for efficiency reasons.  If you have a set of matrices, {Ai}, and you want to multiply them through and then multiply the result with a vector, as follows:

A1A2A3 ... An-1An v                                (1)

then this:

A1(A2(A3( ... (An-1(An v)) ... )))             (2)

is more efficient than multiplying them through in the order in which they are written:

(( ... ((A1A2)A3) ... An-1)An v)               (3)

because each intermediate result is a vector, rather than a matrix.  Similarly, if v is a full matrix instead of a vector, (2) is still more efficient than (3) because multiplying a sparse matrix with a full one is more efficient than multiplying two sparse matrices.

Is associativity related to linearity?  No, because we can define a new, linear operator, lets call it, #:

A#B = ABT = Σk aikbik

that loses the associative property.  Now lets rewrite (1) in terms of #:

A1#(A2#(A3( ... (An-1#(An vT)T)T ... )T)T)T

                     A1#vT#An#An-1# ... #A3#A2

                    = (vT#An#An-1# ... #A1#A2#A1)T                  (4)

since:

(A#B)T = B#A

So why is this useful?  From a strictly mathematical standpoint, the new operator is quite useless: it loses the useful and important property of associativity, without producing any new advantages.  Numerically, however, it is extremely useful: multiplying a sparse matrix with the transpose of a sparse matrix is much more straightforward than straight multiplication since you move through the non-zero elements of both matrices in order.  This is assuming that both are sorted by subscript in row-major order.

Suppose, once again, that v is a matrix.  Even if v is sparse, it's usually more efficient to convert it to a full matrix and multiply through using (2) since the result will become full after only a few iterations.  Not always: if all the matrices are tri-diagonal, for instance, then the result will also be tri-diagonal.

So lets assume that not only is v a sparse matrix, but that the result will also be sparse.  Clearly, (4) is now more efficient than (2) since we now need only 2 matrix transpositions instead of n, where n is the number of A's. Since matrix multiplication is O(n2) (where n is the number of non-zero elements in each matrix, assuming it is about the same for each) while matrix transposition is O(n log n) (for a generalized sparse matrix: swap the subscripts, then re-sort them in row-major order) the gains will be quite modest.

When constructing parallel algorithms, it might be better to stick with (2) since we can multiply through in just about any order we want: e.g., multiply, in order, every second pair of consecutive matrices, then take the results and again multiply by consecutive pairs and so on.

I mention multiplying an array of sparse matrices with a full matrix.  This is also an interesting case, although mostly from a numerical perspective.  The intuitive approach would be to have two full matrices: one for the result, and one for the multiplicand. We multiply each of the A's in turn, swapping the result with the multiplicand in between.  A second approach would be to decompose the multiplicand into columns and multiply each of the columns in turn, all the way through.

It turns out that for large problems, the latter approach is much, much faster, even though it violates at least one optimization principle. If we have two loops, both of which have the same limits, convert them to one loop.  For example:

for (int i=0; i
for (int i=0; i

should be converted to:

for (int i=0; i
  a[i]=b[i]*b[i];
  r+=a[i];
}

The column-by-column method is better because it uses less memory and because it operates on a smaller chunk of memory for longer.  Whereas the first method accesses a whole full matrix at each multiplication step, the second method accesses only one column of the matrix. Therefore there is less paging, either between virtual memory and RAM or between core memory and cache.  Like multiplication between two sparse matrices, this method is also more efficient when multiplying with the transpose, assuming row-major storage order once again for both the sparse matrices and the full matrices.

It does well to remind us that while high-level languages like C and C++ do a good job isolating the programmer from the hardware, you don't always end up with the best code by considering the computer as this perfect, abstract machine, instead of a real one with real limitations and real, hacked-together solutions and work-arounds.  

It is worth comparing the code for both methods, so I will post it in full.  Here is method #1:

template
void sparse_array::mat_mult(real **cand,

                real **result) {
  real **med;
  real **swap;


  med=allocate_matrix(m, m);
  sparse_a[0]->mat_mult(cand, result, m);
  for (long i=1; i
    sparse_a[i]->mat_mult(result, med, m);
    swap=med;
    med=result;
    result=swap;
  }

  if (nsparse % 2 == 0) {
    swap=result;
    result=med;
    med=swap;
    copy_matrix(med, result, m, m);
  }
  delete_matrix(med);
}

Here is method #2, applied, however, to the transpose of the multiplicand:

template
void sparse_array::mat_mult_tt(real **cand,
                real **result) {
  printf("%6.1f%%", 0.);
  for (index_t i=0; i
    printf("\b\b\b\b\b\b\b%6.1f%%", 100.*i/m);
    fflush(stdout);
    vect_mult(cand[i], result[i]);
  }
  printf("\b\b\b\b\b\b\b%6.1f%%\n", 100.);
}

There are a few of things worth noting.  In the first method, if the number of sparse matrices is odd, the result and the multiplicand are swapped an odd number of times.  For efficiency, only the memory locations are swapped, not the actual values, therefore the memory locations need to be swapped on last time and the multiplicand copied back to the result before the function can return.  In the second method, we print out a crude progress meter in the form of the percentage of rows that have been multiplied through.

The second version is not only much simpler, it is also easier to parallelize since each row multiplication is completely independent.

Saturday, December 13, 2014

Accelerate your LIBSVM classification models

I guess I finally have to admit it.  I'm not as clever as I thought.  The statistical classification algorithms I created, more or less from scratch, about nine years ago, are nowhere near as clever as a support vector machine (SVM).  And they may not even be as accurate either!

Yes, sad to say, LIBSVM is more accurate than libAGF.  I believe this is because it is maximizing the accuracy with respect to actual data whereas "adaptive Gaussian filtering" (AGF) is not.

About half a year ago I made a major addition to libAGF.  This was to generalize the pre-trained binary models for multi-class classifications.  I wrote a paper on my approach to the problem and submitted it to the Journal of Machine Learning Research (JMLR) as a "software paper."  Sadly it was rejected.  One of the objections the reviewer had was that the method couldn't be applied to other binary classifiers than the AGF "borders" classifier.  Actually it could, but only at the software level, not the command line, interface level, and not easily.

So I set out to remedy this in libAGF version "0.9.8."  The software can now be combined with the two binaries from LIBSVM, svm-train and svm-predict, or anything that looks like them.  Whereas in LIBSVM you can choose from exactly one multi-class classification method, libAGF lets you choose between slightly more than nc! where nc is the number of classes.

So lets say you've built your hybrid libAGF/LIBSVM model that's tailored exactly to the classification problem at hand.  Since we're ultimately still calling svm-predict to perform the classifications, you might find it a bit slow.

Why exactly is SVM so slow?  After all, in theory the method works by finding a linear discrimination border in a transformed space, so classifications should be O(1) (constant time) efficient!  Unfortunately, it's the caveat, "transformed space" that destroys this efficiency.

The truly clever part about support vector machines is the so-called "kernel trick."  With the kernel trick, we implicitly generate extra, transformed variables by performing operations on the dot product.  Consider a squared dot product of two, two-dimensional variables:

[(x1, x2) • (y1, y2)]
= x12y12 + 2x1y1x2y2 + x22y22
= (x12, √2x1x2, x22) • (y12, √2y1y2, y22)

Clever huh?  The trouble is, unlike in this example, you usually can't calculate these extra variables explicitly.  By the same token, the discrimination border, which is linear only in the transformed space, is also neither calculated nor stored explicitly.  Rather, a series of "support vectors", along with matching coefficients, are stored and used to generated each classification estimate.  These support vectors are simply training vectors, usually close to the border, that affect the solution.  The more training data, the more support vectors.

The data I'm using to test my "multi-borders" routines contain over 80 000 training samples.  Using LIBSVM with these data is very slow because the number of support vectors is some non-negligible fraction of this eighty thousand.  If I'm testing LIBSVM algorithms I rarely use the full compliment, rather some sub-set of it.

LibAGF multi-borders to the rescue!

The libAGF binary classifiers work by first generating a pair of samples on either side of the discrimination border, then zeroing the difference in conditional probabilities along the line between these two points.  In this way it is possible generate a group of points lying directly on the discrimination border.  You can collect as many or as few of these points as you need.  In practice,this is rarely more than 100.  In combination with the gradient vectors, it is straightforward, and very fast, to use these points for classification.

The crux of this algorithm is a function which returns estimates of the conditional probabilities. You can plug in any function you want.  Since the idea here is to accelerate the LIBSVM classification stage, we plug in the LIBSVM estimates as generated by the svm-predict command along with the pre-trained, SVM model.  Training this model, incidentally, is still just as slow, so you'll just have to live with that, although training the new model is not.

The gradient vectors are generated numerically.  Unfortunately, this produces a slight degradation in accuracy.  In principle, it should be possible, easy even, to produce these gradients analytically.  Later versions may allow this but it will require modifying the LIBSVM executable.

But speed increases are monumental.  We have swapped an implicit discrimination border, specified through the support vectors, with an explicit one, directly represented by a small set of samples.

Check out the latest version of libAGF for all the juicy details.

Friday, December 12, 2014

Cloud cover index


The other day I was out running after dark.  It was -8 degrees C and I was expecting it to be bitterly cold.  Instead, it felt quite mild, pleasant even.  It was overcast.

One of the things I learned while working on sea ice retrieval, is that air temperature is just one of many factors that determine how an object exchanges heat with its environment, and a relatively minor one at that.  We've all heard of the wind chill factor.  We feel of course, much colder during a windy day than a calm one, even if the air temperature is the same.  Another factor is humidity, hence the humidex. The one I'm going to discuss here is cloud cover.  I'm trying to design a temperature correction for cloud cover, much like that for wind chill and humidity.

Every warm object emits radiation which is why stove elements glow red and incandescent light bulb filaments glow white.  At room temperatures this radiation is too low in frequency to see with the naked eye.  When an object emits radiation, an equivalent amount of heat energy is lost.  When it absorbs radiation, the energy from the radiation goes towards warming it. On cold, calm days, radiative cooling is actually a much more significant source of heat loss than direct conduction into the air because air is a poor conductor.  This is why down is so warm: it's not the down itself that's a good insulator, it's actually the air spaces between.

Clouds reflect a lot of the long-wave radiation we emit, making overcast days feel warmer than clear days, especially during the evenings when the sun's rays are not there to compensate.  This of course is the well-worn explanation for climate change and the "greenhouse effect."

OK, so maybe you've heard all that.  Here is the more technical explanation.  But first, for the less technically inclined, lets go directly to the table:

Cloud cover
Air temp. 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
-20.0 -20.0 -19.8 -19.6 -19.2 -18.8 -18.2 -17.6 -16.8 -16.0 -15.0
-19.0 -19.0 -18.8 -18.6 -18.2 -17.8 -17.2 -16.6 -15.8 -14.9 -14.0
-18.0 -17.9 -17.8 -17.6 -17.2 -16.8 -16.2 -15.5 -14.8 -13.9 -12.9
-17.0 -17.0 -16.8 -16.5 -16.2 -15.7 -15.2 -14.5 -13.8 -12.9 -11.9
-16.0 -15.9 -15.8 -15.5 -15.2 -14.7 -14.2 -13.5 -12.7 -11.9 -10.9
-15.0 -14.9 -14.8 -14.5 -14.2 -13.7 -13.2 -12.5 -11.7 -10.8 -9.8
-14.0 -13.9 -13.8 -13.5 -13.2 -12.7 -12.1 -11.5 -10.7 -9.8 -8.8
-13.0 -12.9 -12.8 -12.5 -12.2 -11.7 -11.1 -10.5 -9.7 -8.8 -7.7
-12.0 -11.9 -11.8 -11.5 -11.2 -10.7 -10.1 -9.4 -8.6 -7.7 -6.7
-11.0 -10.9 -10.8 -10.5 -10.2 -9.7 -9.1 -8.4 -7.6 -6.7 -5.7
-10.0 -9.9 -9.8 -9.5 -9.2 -8.7 -8.1 -7.4 -6.6 -5.7 -4.6
-9.0 -8.9 -8.8 -8.5 -8.1 -7.7 -7.1 -6.4 -5.6 -4.6 -3.6
-8.0 -7.9 -7.8 -7.5 -7.1 -6.7 -6.1 -5.4 -4.5 -3.6 -2.6
-7.0 -6.9 -6.8 -6.5 -6.1 -5.6 -5.1 -4.3 -3.5 -2.6 -1.5
-6.0 -5.9 -5.8 -5.5 -5.1 -4.6 -4.0 -3.3 -2.5 -1.5 -0.5
-5.0 -4.9 -4.8 -4.5 -4.1 -3.6 -3.0 -2.3 -1.5 -0.5 0.6
-4.0 -3.9 -3.8 -3.5 -3.1 -2.6 -2.0 -1.3 -0.4 0.5 1.6
-3.0 -2.9 -2.8 -2.5 -2.1 -1.6 -1.0 -0.3 0.6 1.5 2.6
-2.0 -1.9 -1.8 -1.5 -1.1 -0.6 0.0 0.8 1.6 2.6 3.7
-1.0 -0.9 -0.8 -0.5 -0.1 0.4 1.0 1.8 2.6 3.6 4.7
0.0 0.1 0.2 0.5 0.9 1.4 2.0 2.8 3.7 4.6 5.7

In other words, the presence of cloud cover can make it seem several degrees warmer.  How does that work exactly?

The basic idea is to calculate the rate of heat loss for a cloudy atmosphere and then ask the question, "what would the clear sky air temperature need to be in order to produce the equivalent level of heat loss?"

Since the human body exchanges heat with the air using approximately the same mechnanisms, let me describe the thermodynamic models I used for sea ice.  These were applied for various purposes: the first one I wrote was actually for an ocean model.  Surface heat exchange is the number one mechanism forcing ocean circulation.  I took the same model and used it to predict the temperature profile of Antarctic icepack and also to model the growth rate and salinity profile of sea ice.

There are four  mechanism by which ice (and other things, including us) exchange heat with the environment: latent heat, sensible heat, shortwave radiation and longwave radiation.  Latent heat, also called evaporative cooling, is the heat lost through evaporation.  Since water has a very high heat of vapourization, when water evaporates, for instance from the surface of pack ice, from the surface of the ocean or from our bodies, it takes a lot of heat with it.  The main thing determining evaporative cooling is the difference in vapour pressure between the air and the thing being cooled.  High humidity will slow evaporation as will high temperatures.

Sensible heat is direct heat transfer through conduction.  In still air, sensible heat is basicly nil.  As the wind picks up, it can get quite high as the wind chill index will tell us.

Shortwave radiative flux is just the sun shining on us.  This is determined partly by geometry: objects pointing directly at the sun will be heated faster than those at an angle.  Since the sun is lower in the sky earlier and later in the day and at higher latitude, ground heating is lower.  It is also determined by cloud cover.  In this model, we are deliberately ignoring solar heating: lets just assume it's night time.

Finally, the one that interests us the most is longwave flux.  This is heat that is radiated directly off us in the form of electromagnetic radiation.  The rate at which this occurs is proportional to the 4th power of temperature times the Stefan-Boltzmann constant: this is the Stefan-Boltzmann law.  This is the equation I used which also accounts for both clouds and air re-radiating the heat back:

 Qlw=-ε σ [0.39*(1-ccccccc)Ts4+Ts3(Ts-Ta)]

where ε is the emissivity coefficient, σ is the Stefan-Boltzmann constant, ccc is the "cloud-cover-coefficient," cc is cloud cover as a fraction between 0 and 1, Tis skin temperature and Ta is air temperature.  I just got this from a paper on surface heat flux, but you can clearly see the Stefan-Boltzmann law embedded in there.

In the sea ice growth simulation I used net heat flux to determine the surface temperature of the ice.
The difference in temperature between the ice and the water (which is at a constant freezing temperature) is given by the heat flux times the conductivity and divided by ice thickness. Since all of the flux calculations also involve a temperature difference, this time between skin (surface) temperature and air temperature in one form or another, we need to solve for the skin temperature.  In other words, we need to invert the equation for flux in order to answer the question, "what value of skin temperature will match heat conduction through the ice (between the ice-water interface and the ice surface) to heat flux between the ice and air?"

The cloud cover index was generated in a similar way, except in this case the skin temperature was held constant at 37 deg. C while it was the air temperature that was being solved for.  It was also much simpler in that I didn't assume any conductive interfaces: heat was lost directly from the skin to the outside air.

While it might be possible to generate an analytic solution, I just fed the whole thing to a numerical root-finding algorithm, in this case bisection.  I didn't account for latent heat since it's hard to say how high it would be: it depends very much on whether the person is sweating or not.  Leaving out sensible heat so that only longwave is present, however, gave very high values, so I set the wind speed at a moderate 4 m/s (14.4 km/h) to produce some sensible heat loss.  You could also assume the the person is wearing a coat and try to match air temperature to conductive heat loss, much as I did with the sea ice growth model.  This would achieve a similar result.

Here is the Python program if you want to mess around with it.

Saturday, November 22, 2014

The mathematics of tracer transport

The reviewers of this paper, about a new method of dynamical tracer reconstruction ("PC proxy"), were interested in knowing more about the algorithm: how to determine the parameters and what makes it better than previous methods.  In order to better understand it (only three years after the fact) I'm in the process of compiling a bunch of mathematics related to tracer transport, particularly in relation to how it's used in the paper.  A lot of it I derived a long time ago, but whenever I want to work on this stuff the notebook where I did the last derivation usually isn't available so I end up re-deriving it.  Now all I have to do is not lose the file--I've already misplaced an older collection of mathematical derivations.  If anybody's seen it, maybe send me an e-mail.

Anyways, I've uploaded it to my homepage.  It's still a work in progress so I will continue to add to and correct it.  I should probably upload it to Github or some other code repository for better version control.  I'm still waiting for a good science-based website for this type of thing.  Maybe if I had the resources I might try starting something like this myself.

Another thing that strikes me is how time-consuming doing up these types of derivation in Latex is.  Some software specifically for writing up and performing derivations could be really useful.  And by this I don't mean symbolic computation packages like Maple or Maxima. I've always found these a whole lot less powerful and useful than one first imagines.  Partly it's just the sheer volume of methods (the Maxima documentation is over 1000 pages long!) but also: all they seem to consist of is a whole bunch of built-in functions for manipulating equations one way or another.  Not unlike a library for any standard procedural programming language, except instead of operating on variables, they're operating on what appears to the user to be mathematical expressions (but are actually structured variables in a programming language!)

What I'm talking about is something much simpler whose only purpose would be to write up equations in a pretty format as well as insert short sections of text and cross reference previous equations.  A handful of shortcuts for Latex might fit the bill. One idea would be to just take simple ASCII math and in real time turn this into pretty equations, with macros for symbols that have no ASCII representation.  The ultimate of course would be full WYSIWYG, maybe a bit like MathCAD, but without any actual math.  I might try coding up something in lexx/yacc someday.

I'm also working on mathematical computations relating to multi-class classification and optimal adaptive Gaussian filtering.

Update: I've now posted this project on github.  There is also an updated version of the compiled pdf on my homepage.