Friday, August 8, 2014

Worshipping complexity

Frankenstein tells the story of a monster created by technology.  The story is popular because it is prophetic.  There are no literal Frankenstein monsters but I don't think many people recognize the real, runaway Frankenstein monster we've created.  A few weeks ago, in a fit of temper, I smashed my netbook.  My shoulder is still sore from it.  But in retrospect, I'm glad I destroyed that machine because I'm certain it was spying on me.  Why else would I be offered such a convenient device with such impressive specifications--dual, 1.66 GHz processors, 1 gigabyte of RAM, 160 gigabytes of hard-disk space--for so little money?  Twenty years ago, that was super-computer territory.  That also happens to be equipped with a microphone and a low-light video-camera.  In the wake of Edward Snowden, I don't think we need to ask the question.  It also explains why, despite it's impressive technical specifications, the machine was so damn slow.

There's no question in my mind that in twenty years or so, there won't even be any human beings behind the spying, that is, assuming the computers even let us live.  Most if not all of the elements are in place for a complete take-over of our world by the machines.  Assuming it hasn't happened already.  Yes, this our Frankenstein monster: runaway technology.  It no longer serves us, we serve it.  And it has caused considerable damage to the natural world like the tar sands in Northern Alberta.

But these elements of complexity are beyond my means to tackle, so I'd like to focus on those that are within reach: inside my own computer code. Recently I wrote an answer on Quora about why so many people seem to hate C++.

The purpose of using a more expressive language is to simplify things: to accomplish more with less, but the more I consider it, the more I'm beginning to suspect that C++ actually does the opposite. It actually encourages you to make your code more complex.  Take this class definition, presented in it's entirety, for a sparse matrix:

namespace libpetey {
  namespace libsparse {

    //forward declaration of sparse_el:
    template class sparse_el;

    typedef int32_t ind_t;

    #define EPS 5.96e-8         //this is actually pretty big... (?)

    template
    class sparse:public matrix_base {
      //friend void conj_grad_norm(sparse *, data_t *, data_t *, data_t, long);
      //friend double * irreg2reg(double **xmat, long m, double *y, long n, double **grids, long *ngrid);
      private:
        //since all the "constructors" do the same thing, we have one private
        //routine to do them all:
        void copy(const sparse &other);
        void convert(sparse_array &other, data_t neps=EPS);

      protected:
        sparse_el *matrix;

        //dimensions:
        index_t m;
        index_t n;

        //number of nonzero elements:           (these types should depend on index_t
        long nel;                               //i.e. index_t = int32_t, then use int64_t)

        //index of last element searched:
        long last_search;

        //size of array
        long array_size;

        //update flag:
        //0=matrix needs updating
        //1=matrix is fully updated
        //2=matrix is sorted but has zero elements
        char update_flag;

        //"zero" tolerance:
        float eps;

        FILE *sparse_log;


      public:
        //initializers:
        sparse();
        sparse(data_t neps);
        sparse(index_t min, index_t nin, data_t neps=EPS);
        sparse(data_t **non, index_t min, index_t nin, data_t neps=EPS);
        virtual ~sparse();

        //create the identity matrix with existing dimensions:
        void identity();
        //create the identity matrix with specified dimensions:
        void identity(index_t mn, index_t nn);

        sparse(matrix_base *other);
        sparse(const sparse &other);
        sparse(sparse_array &other);
        sparse(full_matrix &other, data_t neps=EPS);

        void from_full(data_t **non, index_t min, index_t nin, data_t neps=EPS);

        //storage manipulation:
        void update();          //update matrix
        void remove_zeros();    //remove insignificant elements
        //remove elemnts of less than specified magnitude:
        void remove_zeros(data_t minmag);
        //clear matrix, but do not change storage:
        void reset(index_t mnew=0, index_t nnew=0);
        //clear matrix, deallocate storage:
        void clear(index_t mnew=0, index_t nnew=0);
        //extend storage by specified amount:
        void extend(long size);

        //add and access matrix elements:
        //add or change an element:
        long add_el(data_t val, index_t i, index_t j);
        //change existing element or insert new:
        virtual long cel(data_t val, index_t i, index_t j);

        //return value of element:
        virtual data_t operator ( ) (index_t i, index_t j);

        //access rows:
        virtual data_t *operator ( ) (index_t i);
        virtual void get_row(index_t i, data_t *row);
        //ratio of amount of storage to equivalent full:
        float storage_ratio();
        //approx. ratio of performance (for matrix multiply) to equivalent full:
        float performance_ratio();

        //I/O:
        virtual size_t read(FILE *fptr);                //binary read
        virtual size_t write(FILE *ptr);                //binary write
        virtual void print(FILE *ptr);          //print (ascii) to file
        virtual int scan(FILE *ptr);            //read from (ascii) file
        void print_raw(FILE *ptr);

        //the "canonical" versions:
        virtual matrix_base * mat_mult(
                    matrix_base *cand);
        virtual matrix_base * add(matrix_base *b);

        virtual data_t * vect_mult(data_t *cand);
        virtual void scal_mult(data_t m);
        virtual data_t * left_mult(data_t *cor);

        virtual matrix_base & operator = (full_matrix &other);
        virtual matrix_base & operator = (sparse &other);
        virtual matrix_base & operator = (sparse_array &other);

        virtual data_t norm();

        virtual matrix_base * clone();

       //sparse & operator = (matrix_base &other);

/*
        virtual operator sparse& ();
        virtual operator sparse_array& ();
        virtual operator full_matrix& ();
*/

    };

    template
    inline long sparse::size() {
      return nel;
    }

    template
    inline void sparse::remove_zeros() {
      remove_zeros(eps);
    }

    typedef sparse sparse_matrix;

  }  //end namespace libsparse
}    //end namespace libpetey

Hmmmm....  it doesn't look that simple.  (Hopefully you didn't read through all of it.)  Notice how I've thrown in everything but the kitchen sink: well, templates allow us to use the class with different data types, so why don't we throw them in?  Namespaces prevent collision between symbols in different libraries and they define modules, so why don't we throw them in as well?  And so on.

Why don't we start over, and this time work in C?  

typedef float scalar;
typedef integer int32_t;

struct sparse_element {
  integer i;
  integer j;
  real value;
}

This is how we could define the data structure. To create an instantiation of it, just make an array:

sparse_element sparse_matrix[nel+1];

The first element could hold the dimensions of the matrix:

sparse_matrix[0].i=m;
sparse_matrix[0].j=n;

This type of doubling up on variables is often discouraged, but does have the nice side effect of making code a little simpler and less cluttered.  The number of data elements could be either stored in a separate variable (common with arrays in C), stored in the value field or an extra element that serves as an end marker tacked on to the end of the array, much like C strings.  This is not as mad as it sounds: most functions will operate on the whole list, iterating through each element one-by-one.  One thing we lose is the ability to tack on new elements easily, but on the other hand it is more efficient to have some idea how big your data structure will be ahead of time and to allocate space accordingly.

This doesn't include all the operators which of course will all need be set down.  But we've also left out the templates and the namespaces.  Sure, it's nice to be able to use the matrix class with different scalar data types and with different integer types for index locations, but are we really ever going to use, at the same time, multiple sparse matrices comprised of different elementary types?  Probably not.  A typedef works just as well and is far simpler.

But even if we need generic programming, this can be done quite easily in C.  Take this prototype for a sort function, for instance:

void heapsort(void **data, int n, int (* comp)(void *, void *));

Where data are the data elements (any type) and comp is the comparison function.  Want polymorphism as well?  This isn't too hard either, and it occurs where it should: as far outside of the basic definitions as possible.  It's not built in from the very beginning.  Lets say we have two data types:

struct child1 {
  ...
};

struct child2 {
...
}

And they both have an operator called do_something:

int do_something1(child1 *);
int do_something2(child2 *);

But we need a more general operator that works on either type:

struct parent {
  int type;
  union data {
    child1 *a;
    child2 *b;
  }
};

int do_something(parent *c) {
  switch (c->type) {
    case(1):
      do_something1(c->data.a);
      break;
    case(2):
      do_something2(c->data.b);
      break;
    default:
      fprintf(stderr, "do_something: type %d undefined\n", c->type);
      exit(-1);
}

In this case, yes, the alternative in C is more complicated and "less elegant" whatever that might mean, but it has two features that I like.  First, the branch logic is explicitly written out, not hidden or implicit in the definitions.  Second, the more basic operations are defined completely independently.  There is no need to rough out the larger structure first.

Another thing I like about C, or any pure procedural language, is that processes and data are always strictly separate.  There is no need to mash them together, whether such an alliance is holy or not.  At least C++ gives you the choice; languages like Smalltalk do not.  Another thing I can't stand is forced data-hiding.

Unfortunately, I still haven't mastered these techniques myself.  I got into C++ programming early, just as it was first coming out and haven't switched since.  I had to write a simulation of a chaotic scattering system: essentially a simple ray-tracing progra and started in Pascal because that's what I had learned in highschool.  I had a Turbo-Pascal compiler which had all the bells and whistles, including O-O.  Since I was curious about object-oriented programming and it was very new at the time, I coded the problem in this style and it was a very good "fit".  Later on I wanted to learn C and C++ because, like objects, they were also very popular and "cool" at the time.  Translating this program into C++ was almost trivial because the object-oriented aspects of each language were so similar.

Minimalist Coding

I wanted to call it, "extreme programming" but that's taken already and besides, the term is a bit cliche by now.  But I am going to borrow some ideas from extreme sports, in particular alpinism.  Don't ask me how I got interested in mountaineering.  Heights scare the shit out of me.  I figured if I could overcome this one fear, it would help me with more mundane fears like job interviews or asking a girl out.  I'm not convinced the idea was all that sound.  I never got much further than conquering the easiest routes of the bouldering wall at the local climbing gym.

I also bought and read Extreme Alpinism by Mark Twight.  There are two broad styles of climbing: expedition style and "alpine" style.  In expedition climbing, you take a big company with lots of gear, lay out camps and hang lots of fixed rope.  By being well equipped you are more protected.  On the other hand, you move a lot slower.  In alpine climbing, you take as little gear as possible and try to be self-sufficient.  You can move a lot faster and you're more flexible.

I was most interested in the most extreme style: climbing without rope or protection.  Consider, every-time you insert a chock or other protection, you are distracting yourself from the main task at hand: getting up the mountain.  Also, if your protection is sloppily placed it may be little better than no protection at all.  Why not focus all your energy on the most important thing: making sure you have the best holds and that your hand and foot placement is secure?  Because you aren't carrying all this hardware, you're a lot lighter.  I think you can see where I'm going with this.

So we leave out range checking and don't make any attempt to catch or correct errors.  It's considered normal for a subroutine to return an error code but if every time after calling a subroutine we check for an error condition and the stack is four levels deep, that's an awful lot of extra code and worse, clutter.  When I first started learning computer science, the term GIGO was thrown around a lot: garbage in, garbage out.  Instead of spending so much effort testing for errors, why not spend that effort making sure your inputs are correct?  If your code is also correct, that will guarantee that your output is ultimately error free.

We apply the same minimalism to documentation: keep documentation to a minimum and keep variable names short.  Ideally, the code should be self-explanatory: because we have now reduced the clutter by a large percent, the program is easier to interpret and it's easier to find things.  Naturally, this style works a lot better for scientific programming than more user-oriented problems.  One place where your code almost always has to be fault-tolerant is when importing large amounts of data.  Other people's data is notoriously unreliable. Often, however, large chunks of it can be ignored or thrown away without any ill effects, but we need to catch the bad apples first so as not to spoil the rest.

I confess I am a bit miffed at just how un-portable C and C++ are.  (Though it took me a while to realize that integer types may be different sizes on different machines and compilers, about the time I started working on 64 bit computers.)  But not miffed enough to use a configure script.  I recently inspected the configure script for an early C++ version of NetCDF: it's over 17000 lines long!  Meanwhile, the wrapper classes that comprise the body of the library are just over 4000 lines.  This is not a recipe for portability.  It's easier to debug a 200 line make file than a 10000 line configure script.  Some time ago I had the privilege of doing so, and the code looked pretty hacked together.

Rising from the ashes

This is me and my foundation rising from the ashes...