Statistical Learning | Simant Dube

Machine learning is super hot in silicon valley these days! It has emerged as a very useful discipline in computer science and statistics. If you have skills in Machine Learning, you are likely to get a nice job. But what exactly is machine learning? With growing popularity of the field, engineers and scientists know the technical answer, but can it be explained to everyone in simple language? What are current trends in machine learning and what could come next?

In this article, we will focus on statistical learning and discuss state-of-the-art, trends and future directions.

Decisions, Decisions, Decisions!

Why is machine learning hot? Well, there are so many decisions to be made everywhere. Suppose you want to predict, recommend, classify or rank something in an automated data driven manner. Then your best bet is to use statistical machine learning.

Netflix wants to recommend movies to you which you may like. Google wants to rank web pages depending on their relevance to your search query. Facebook and LinkedIn want to display those advertisements which are likely to be clicked. A biotechnology company wants to offer a diagnostic platform to predict a medical condition depending on gene and protein expression. When you wave your hand in front of Kinect, it tries to classify its 3-D depth sensor data into different body parts and understand your gesture. A retailer wants to predict demand of inventory items. Self-driving cars want to understand immediate surrounding traffic. And the list goes on.

In future, a household robot will recognize faces, understand gestures, facial expressions and speech, and it will be able to move around your house helping out in chores. It will have to make a lot of decisions almost non-stop! Machine learning will be the basis for such Artificial Intelligence.

It is not easy to come up with rules of thumb to make decisions for all these problems. There could be very large number of situations to handle and you may not be able to write down a simple recipe which will allow you to make decisions in all possible cases with desired accuracy. There could be a complex underlying process going on which may be quite difficult to capture in a simple handcrafted model.

Machine learning is building software which allows you to make decisions in a statistical sense. You train this software based on training data, which is the data where humans provide their judgement or labels and which are presumed to be correct decisions. Therefore, it captures human intelligence and decision making experience. At a very concrete level, training data can be viewed as an excel table. You have rows and columns. Rows are all different instances or samples of your problem. Columns represent features or signals which you think could be useful to you. There is an extra column which corresponds to the correct decision filled out by human experts.

Example: Predicting Happiness

To make the discussion livelier, let us apply machine learning to a problem which is subject of so many self-improvement books. We will try to achieve through machine learning, by just using statistics, something which has preoccupied humanity for ages: we will predict happiness!

Happiness is our target or response variable. Using our intuition and on consultation with happiness experts, we make a list of all those factors which we presume to be important in predicting happiness. They could be age, gender, income, relationship status, number of children, political beliefs, religious beliefs, job satisfaction, number of friends, personality type, and so on, which will constitute our features, signals or predictor variables. Let us say we make a list of 20 such features. We then go around the world taking a survey of say 5000 people. For each person, we get the values of 20 features. The training data as an excel table will have 5000 rows and 21 columns. Why 21 columns? First 20 columns are the features. The last 21st column is the target variable which indicates how happy the person is, for example, on a scale from 0 to 10. How to fill out this 21st column in our happiness project? Well, here human judgement, experience and intelligence come into play. It could be self-reported happiness level or it could be something which is computed by certain experts whose task is to assign each person a happiness value. An important point is that this value will be assumed to be correct. That is why sometimes training data is called Golden dataset or Ground Truth and this process of using labeled data is called supervised learning.

Once we have our 5000 rows, 21 columns excel table filled out, we can then use it as an input to a machine learning training algorithm which will try to build a mathematical function or model that will map predictor variables into the output variable. The output of this supervised learning is the trained model which we will then use in practice to predict happiness of any person.

What is the form of this model which we trained? Machine learning literature gives you many choices, from very simple ones to quite complex ones. Let us say we use them as black boxes and simply try all of them out using a brute force approach and see which one works best.

The training algorithm will also tell us how well we succeeded in training. Suppose it tells us that we achieved high accuracy. Can we then open Champagne bottles and start celebrating on solving this age old problem? No. What we will need to do is to then apply our model on the data which we have not seen. This is called test data and we are shown this data only once after the training has been completed. This is our real examination. If we do really well in this test set and assuming this test set is fairly big and representative, then yes, we can definitely celebrate! 🙂 And it will be another jewel in the crown of machine learning.

Machine Learning Concepts

While we are trying to predict happiness we may read few books on machine learning. In machine learning literature, we would encounter concepts such training error, generalization error, bias-variance tradeoff, VC Dimension, PAC learning, overfitting, underfitting, ROC curves; we will become familiar with models such as linear models, logistic regression, neural networks, support vector machines, bayes classifier, decision trees, nearest neighbors, probabilistic graphical models, generative and discriminative models; we will learn about gradient descent, convex optimization, clustering, expectation maximization, boosting, bagging, bootstrapping, monte carlo techniques, cross validation, dimensionality reduction, regularization; and many other things. Knowledge of linear algebra and statistics will be quite handy to master these concepts.

All these will give us theoretical understanding of what is going on in machine learning as well a set of practical tools. We will discover that machine learning is an empirical science based on rich theoretical foundation which requires a lot of experimentation and iterative continuing improvement cycles. We will realize that we need rich visualization tools which help us discover patterns in our data and in our results so that we can make these continuing improvements.

Deep Learning

In our example of happiness prediction, we used our intuition about and our understanding of happiness to list 20 features which we thought were important in predicting happiness. Depending on the application, we will come up with an appropriate list of features which we think are important. For example, for Google, PageRank is an important and well-known feature to rank web documents. If we are trying to classify a digital image, we will use certain computer vision features, for example, those based on image intensity gradients. This is called Feature Engineering. Lot of innovation is required in designing such useful features and therefore those publications which propose such features get well cited.

Let us now stop briefly and make two observations. One crucial observation is that we live in a world which is best modeled in a hierarchical manner, from elementary particles to giant galaxies, from raw pixels in your digital camera image to a familiar face, from vibrations of air molecules to a Beethoven’s symphony, from simple realities within us and around us to the elusive concept of happiness.

Another crucial observation is that human mind, which is our best model for intelligence, does not exactly go through the process of supervised learning. A human baby looks around the world and figures out a lot of patterns in an unsupervised (or rather self-supervised) manner. Parents and teachers only provide a gentle supervised touch to this innate learning process. We follow process of building of models of the world based on evidence and gradual refinement.

How exactly unsupervised (self-supervised) learning and explicitly supervised learning will both work together to give the best solution is still being researched. It is worth noting that explicitly supervised approaches continue to perform well and when we do have large amount of labeled data and we use layers of hierarchical features which are learned automatically then supervised learning offer competitive solutions and outperform unsupervised learning. At the time of writing of this article, convolutional neural networks which are trained using supervised approach seem to be performing the best for computer vision.

Can we somehow capture these ideas and improve classical supervised machine learning? It could be also very useful as this is the era of Big Data. Enormous amount of web data, mobile data, image data, video data, social networking data, customer data, biological and medical data, is being collected in giant data server farms and it is practically impossible to label this humongous data as in classical supervised machine learning.

A self-supervised or semi-supervised approach which works with unlabeled data seems to be our better bet from a practical point of view. We can try to build hierarchical representations of features automatically using unlabeled data just like human baby does, through an iterative process of parts-whole model building, model matching and model refinement, some components of which could be based on explicitly supervised learning. Human brain is estimated to be a hierarchical deep network of neurons with many layers. Starting with raw data, from our eyes through the optic nerve to visual cortex or from our ears through the auditory nerve to auditory cortex, it builds a hierarchical representation of features, which ultimately leads to recognition of a familiar face or to appreciation of a beautiful song.

Deep learning derives its motivation from this biology. Deep learning is a technique currently being researched by machine learning community in which we train hierarchical features in an unsupervised manner using huge amounts of unlabeled data and which are then fine tuned further in classical supervised manner using much smaller amount of labeled data. We are therefore automating the process of feature engineering which earlier used to require human ingenuity.

Consider the example of classifying an image as that of a cat. One can feed lots of images, both of cats and non-cats, perhaps using millions of youtube videos (as was recently demonstrated by a machine learning team led by Prof. Andrew Ng) to deep learning algorithm and let it combine raw pixels into features such as edges, and then next level of features such as composite edges, corners and basic textures, and then next level of features such as eyes, ears, fur, etc., till we get a high level feature which puts them all together into a cat’s face.

Coming back to human baby example, it seems 3 billion years evolution have trained us with first layers of these robust vision features, and then aided by amazing flexibility of human brain, a baby has no trouble in training higher layers to recognize cars, table lamps, people, butterflies, flowers, etc. So evolution of mind and flexibility of mind go hand-in-hand in creating human intelligence! Parents, teachers and other people are still important as they train the highest layer which gives us social and emotional intelligence.

Artificial Intelligence and Natural Intelligence

Since deep learning has strong biological motivations, are we then moving towards a future in which the line between natural intelligence and artificial intelligence is blurring? It is exciting to see that computer scientists and neuroscientists can work together now to unravel mysteries of human mind!

At the same time, we should realize that machine learning which includes deep learning is best explained in terms of mathematics. We are really training a mathematical model which may or may not correspond to how human mind works but still it will have all the appearance of intelligence in functional form. Machine learning tries to replicate mapping of features to correct predictions using whatever works best in practice. Therefore, such software appears intelligent when viewed as a black box, but it could be employing a totally different mechanism to perform this mapping than what we employ in our brain.

Which one is higher intelligence, artificial or human? Which one has better long-term potential?

Since human mind is only one of the ways to do this mapping, despite our anthropocentric mindset, there exist pure mathematical models out there which significantly outperform human mind and which hopefully will be discovered by us at some point. When coupled with the fact that cloud computing of future will be able to train and execute these models at a mind boggling scale and speed, it is reasonable to predict that artificial intelligence will eventually surpass human intelligence! It should make us humble as well as proud.

Scientific Understanding

That all sounds very exciting, useful and practical. In era of Big Data computing, machine learning which includes deep learning is a great tool. But is it just useful for businesses? Aren’t we truth seekers and not just utilitarian? Does it lead to better understanding of life and universe?

Let us say we did a great job in predicting happiness using statistical machine learning. Does this model tell us some new truths about happiness? How do these features really affect happiness? What are underlying processes and cause-effect relationships? Did we merely capture statistical correlations and nothing more? Where are psychological truths? Happiness is a difficult topic and it involves social and political realities. Does our model teach us how we can create better societies which enhance happiness?

Taking a more concrete and down-to-earth example, suppose we used genomic and proteomic data to predict a disease using a machine learning model such as neural network. Even if it achieves high accuracy, does it tell us anything useful about underlying biological pathways? Understanding how genes and proteins interact with each other and under influence of epigenetic environmental factors is as difficult as unravelling paradoxes of happiness.

Machine learning used as a black box therefore seems to be just a statistical tool devoid of truths, at least at first glance.

Good news is that we can make it as a tool for scientific understanding. This is one exciting area where this field can grow and become mature in both applied and theoretical sense! How can we interpret machine learning models? What insights can it give us about the problem at hand? How can it assist truth seekers and at the same time give something useful to utilitarian?

We want to better interpret the machine learning models we build. This desire to interpret machine learning models need not be just a goal to help science but it can be rooted in pragmatic goals of business. A business may not want to see unexpected blunders and errors by its trained machine learning model. It may opt for a machine learning model which is simpler and therefore amenable to human interpretation in order to avoid such errors. Once number of features becomes large and we start employing very complex machine learning models, we lose understanding and therefore control, which can cause uneasiness among business leaders.

Interesting work in future can be done in exploring such high-dimensional feature spaces, complexity of models and their simplification, feature interactions and underlying dynamics of processes, and unexpected errors.

Therefore, we should resist our immediate temptation to call machine learning a statistical tool for practical business goals in contrast with science which tries to understand reality.

We should also remember that our scientific laws are also mathematical models. Newton’s law of gravitation was a mathematical equation till it got superseded by Einstein’s space-time curvature. Though it is an amusing story, for Newton the training data consisted of an apple which fell on his head, but it was enough for building of his great model that worked well for both apple and the moon. 🙂 Quantum Mechanics is a mathematical model. We are struggling to interpret Quantum Mechanics. It is just a useful tool or is it truthful depiction of reality? We believe in these mathematical theories in statistical sense and that may not be too far from machine learning. We do experiments, which is like our test set, and we confirm the predictions of these models. This is the scientific method. But for Truth, our bar is highest possible and we firmly demand 100% accuracy. Even a single violation sends us back to search for better theory. Newtonian mechanics is approximate and quite useful in practice but not exact. Einstein has provided us with better theory. And the search goes on.

Collaboration in Machine Learning

One effective way we will be able to make progress in machine learning will be through collaboration. One of the bottlenecks in machine learning is training time. As we bridge gap between artificial intelligence and human intelligence, we will have to replicate what three billion years of evolution did for natural intelligence. More complex models are built on top of simple models.

To aid in this incremental and hierarchical improvement, successful models which we build in academia and open research labs can be released in public domain. There could be an open-source initiative in which we keep a repository of machine learning models. Like evolution of natural intelligence in nature, artificial intelligence will evolve with time and through social community effort. We should be able to reuse, modify and enhance artificial intelligence models built by others. Of course, it will need some effort on the side of data standardization, pre-processing and post processing, but this should be a solvable technical problem.

That concludes this article. It is hoped that scientific machine learning, which aims to understand, and social machine learning, which aims to replicate evolution, would lead us to new exciting frontiers in coming decades.

Simant Dube

Machine Learning, Deep Learning and Scientific Understanding