Rich Data & Machine Learning

A commitment to truth creates a moral imperative that forces you to acknowledge the data and to take the important step of recognizing reality. - M. K. Gandhi

I am a huge advocate of serious, thoughtful data analysis.  I studied (among many other things) statistics at MIT, where I completed all of the course requirements for a PhD in statistics before getting distracted. I worked on a number of truly interesting problems, including applying stochastic methods to representations of social networks to demonstrate some interesting results about rumor contagion.

I have continued to learn and apply state-of-the-art data analysis during my career.  It was without irony that I was the development manager for the first versions of Lotus 1-2-3 which for years was the world's leading spreadsheet. Other rich data solutions in my career span fraud detection (a recommended solution by the American Bankers Association), dating and jobs site recommendations and matching, information extraction and analysis for biomedical applications, and more.

The biggest lessons I wish to impart are:

  • Rich data is not necessarily big data. The quantity of data is not an indicator of the quality of data.  In fact, usually the opposite holds.
    • You may be using wrong or poor data features. You have lots of data, but the data does not have the indicators or values that can be used to draw useful conclusions.  If you don't have features collected or coded that explain most of the variance of your dependent outcomes, then you are probably in trouble.  Sure, you may be able to find some features that proxy for what should have been assembled, but you will probably get a lot of spurious patterns arising.
    • You might be poorly preparing your features for a pattern algorithm.  You start with potentially rich data and you "simplify" it.  For example, text analytics companies are almost always incredibly guilty of putting coherent text through a grinder ("stemming" and "stopping") to create a "bag of words".  Now, the average word has about 2.5 separate meanings and most information-laden terms are multi-word expressions. No matter how clever your algorithms, it is still garbage in, garbage out. Yes, linguistics is hard. Its also the only way to make headway. And, if you start with lots of PDF files and aren't smart about extracting text, typically 10-20% will end up being garbled.
  • You don't need to do a rich data experiment for anything you can look up in Wikipedia.
    • Pattern-based methods are merely methods to apply when you run out of other ways to discover what you need to know. Tycho Brahe, a Dane, collected copious data on the positions of planets. From that, Kepler created the three laws of planetary motion (ellipses with the Sun at one foci, equal areas in equal time, orbital period squared is proportional to the cube of the semi-major axis).  Arguably, Kepler was using the "big data" methods of his time.  But Newton, in working out the inverse-square law of gravity, used Kepler's laws, not Brahe's data.  And Einstein, used his Equivalence Principle which was devised from a thought experiment to work out his theory of gravity and then devised a test involving a single observational experiment (the positions of stars during a solar eclipse).
  • There is no free lunch. And there are proofs so don't think otherwise.
    • If one algorithm seems to out perform another in one problem setting, it is not because of its overall superiority, it is merely that the algorithm fits the characteristics of the data at hand.  This, an informal statement of the No Free Lunch theorem, is not mere folklore. It is a warning that we always need to consider prior biases and models, the data itself, the trade-offs between generalizations and specializations to be made and the way we penalize errors. Marketers may sing the praises of their company's superior algorithms. Beware. Remember that there are bad algorithms and bad approaches out there - but this does not imply there is ever going to be some universal "best" algorithm. Choose wisely.
    • Every day I spot something akin to "buy this product, become a data scientist immediately!" (where have I heard THAT before?) Almost as often, I see articles that unabashedly gush about map-reduce algorithms as a panacea. No. Data science is more about understanding use cases and wise application and interpretation of algorithms than about the algorithms themselves. I would not want to visit a doctor whose knowledge was limited to the inner workings of all of the devices he uses.
  • Ugly ducklings are all around us.
    • Our analysis is going to be influenced by our assumptions about what an interesting pattern might look like. The "ugly duckling" theorem states that there is no "best" set of features or feature attributes that applies independent of the problem that we are trying to solve.  And further, that the assumptions that we make influences the patterns that we see as similar or significant.  A very simple example: If I told you that I flipped a coin six times and each time got "heads" you might think that remarkable.  If I told you that I flipped a coin and got the sequence H-T-T-H-T-H, you probably would think that ordinary.  And yet, each sequence arises with the same probability (1 out of 26).

  • 95% confidence means 1 out of 20 times you are dead wrong.  Enough said.
  • Statistical significance is not the same as true significance.
    • We all know the beer and diapers story.  It seems compelling - the application of a data mining technique to big data to get a surprising and useful result. The problem in the story is that the application of the technique used (discovery of association rules, most often by applying the "apriori" algorithm) generally creates so many associations that appear to the algorithm to be significant but have no practical value that the approach almost always is to be avoided.

I have had occasion to use a wide variety of approaches over the years, some exciting and some mundane. In this notebook, I have created pages for some technology I would like to share, along with some notes on "The sociology of data." The notes and code on kernel methods and SVMs will be packaged and shared on Github.