In this section, I will be posting blogs about Machine Learning, Data Science and Technology Product Innovation, in my free time.
Continuing from previous post. Please read Part-1 first.
In the previous post we examined and visualised the dataset containing thousands of Kobe Bryant’s shots throughout his career. We used Python and Tableau and drew the first insights from the data. We will now continue with an introduction to predictive modelling (and we will test those first insights as well). We will do so by demonstrating classification with decision trees. For this, we will use the Scikit-Learn Machine Learning framework, once again with Jupyter Notebook. For visualisations we will use Seaborn and Graphiz.
We will use decision trees as an introduction for two reasons:
- They are easily interpretable, thus they are good instructive material.
- They can be used as an improved baseline to compare with more advanced models, e.g. Random Forests, Logistic Regression, Support Vector Machines, Neural Networks etc. Such advanced algorithms will be the subject of Part-3, the post to follow.
In the posts to follow, we will compare this model with more advanced Machine Learning classification algorithms. We will first explore if ensemble methods based on decision trees like Random Forests and further optimisations like Extremely Randomised Trees can produce better models for the particular problem. One interesting aspect of decision trees and its family of algorithms is that they naturally perform feature selection, as shown in this post.
We will then explore discriminant algorithms such as Logistic Regression and Support Vector Machines. For these algorithms feature selection is not inherent so we will also investigate dimensionality reduction, Principal Component Analysis etc.
Earlier this year Kobe Bryant, one of the most important players in the history of NBA, ended his career. He played for a full 20 years with LA Lakers. Bryant is the all time top scorer of the legendary team and he is considered by many to be the second best shooting guard in the history of NBA behind Michael Jordan.
After his retirement a dataset containing 20 years worth of Bryant’s shots was released by Kaggle. The challenge was to build a model that predicts if Bryant would score each shot or not. The dataset is a good opportunity to have some fun and demonstrate in detail the data science techniques one can leverage end-to-end, from data exploration to model evaluation. Here you can see ten “impossible” shots of Kobe Bryant that are quite probably included in the dataset.
The aim of this blog is not to produce the most accurate predictive model. The competition has ended and we might in fact use some of the insights posted by the community. Rather, it is to enjoy and, in the process, showcase how to approach Machine Learning classification problems like this. This is the first part of a series of posts that will cover the end-to-end process.
For max fun, we will use Python (Pandas, Jupyter Notebook, Scikit Learn) for data exploration, visualisation and predictive modelling as well as Tableau for super fast exploratory visualisations. Let’s deepdive into the data! We will start with a Jupyter Notebook. All the details of what is involved at each step are marked down inline so that the notebook reads seamlessly in the blog.
After familiarising with the basics of the dataset, now let’s proceed to fast Tableau analysis in order to double down on the dataset and make it as transparent as possible.
What we want next, is to build an intuition of how predictive each feature is of the target variable ‘shot_made_flag’. Roughly speaking if there is notable variation of the target variable’s distribution across the subsets defined by a feature’s different values, this could be an indication that the feature is predictive of the target variable. It is important to make three notes at this point:
- In principle, if a target variable distribution does not vary depending on different values of a feature, this does not necessarily mean that the feature is not predictive. It might be variable in a subset of the dataset and we currently examine the entire set. This means that the feature may be predictive depending on the modelling algorithm and the subsets it creates in its process.
- On the other hand, it may happen that a target variable distribution varies across the values of a feature but the feature will not make its way into a good predictive model if it is dependent or correlated to another feature. In that case, it would possibly just add to overfitting. As an example of dependent features, the area variables are mappings of the coordinates, as we showed earlier in the Jupyter notebook.
- We can investigate further the empirical intuition of a “notable variation” by running tests to determine its statistical significance.
If the above are difficult to comprehend at this point, don’t worry we will return to these aspects in this series of blogs.
Now let’s examine how ‘shot_made_flag’ distributes for each value of each feature. This is a rather exhaustive exploratory process which in reality can be shorter. Here, we want to show how we can leverage the tools in order to make the dataset completely transparent and build good intuition. We will return to assess these intuitions in retrospective, once we build and evaluate our predictive models, in subsequent parts.
In the first dashboard you can see all features that are dependent on the distance of the shot. Blue are the scored shots and red are failed attempts. More specifically, 1a) shows the success ratio per ‘shot_zone_range’ bucket as given in the dataset. Evidently, the target variable is unevenly distributed across the subsets defined by the range feature. Next, 1b) shows how many shots there are in each bucket in total.
In 2) The blue line signifies scored shots per ‘shot_distance’. We can conclude that distance is in feet and one can see the steep increase at the limit of 22-23 ft. where the 3pts line lies. The red line shows the number of failed shots. 3a) shows the success ratio for the 2 and 3 pointers. Finally, 3b) is the share of total 2 and 3 pointers attempted (‘shot_type’).
All the above features are correlated, which means that one of the really independent variables will most probably make it to the predictive model. Distance and x/y coordinates are different representation of the independent variable. The rest are mappings of the distance.
In 4) it becomes evident that the target variable is distributed unevenly across the ‘combined_shot_types’ as well.
In 5) one can observe that the ‘action_type’ is a more fine grained categorisation of the ‘combined_shot_type’. Again the variability is notable, and so we expect that action type is a good candidate for the predictive model.
In 6) we have summarised the performance per ‘period’ of the game and the share of shots in each period. Here we see a more even distribution.
7) illustrates the calculated field ‘remaining_time’ until the end of the period.
8a) shows the performance against each ‘opponent’…
and 8b) ‘matchup’, which is the same as 8a) to the granularity of home and away.
Finally, 9) shows the performance for each ‘season’. It seems that Bryant’s performance has started declining towards the end of his career, which is what one would expect that comes naturally with age. Finally, in 10) the performance in the normal period vs the playoffs does not seem to vary (‘playoffs’ flag = 1).
At this stage, we have a full understanding of our dataset and it is time to deepdive into the details of predictive modelling. If you made it this far, you may as well stay tuned for the second part of this long blog. Thanks.
An artificial neural network (NN for short) is a classifier. In supervised machine learning, classification is one of the most prominent problems. The aim is to assort objects into classes that are defined a priori (terminology not to be confused with Object Oriented programming). Classification has a broad domain of applications, for example:
- in image processing we may seek to distinguish images depicting different kinds (classes) of objects (e.g. cars, bikes, buildings etc),
- in natural language processing (NLP) we may seek to classify texts into categories (e.g. distinguish texts that talk about politics, sports, culture etc),
- in financial transactions processing we may seek to decide if a new transaction is legitimate or fraudulent.
The term “supervised” refers to the fact that the algorithm is previously trained with “tagged” examples for each category (i.e. examples whose classes are made known to the NN) so that it learns to classify new, unseen ones in the future. We will see how training a NN works in a bit.
In simple terms, a classifier accepts a number of inputs, which are called features and collectively describe an item to be classified (be it a picture, text, transaction or anything else as discussed previously), and outputs the class it believes the item belongs to. For example, in an image recognition task, the features may be the array of pixels and their colors. In an NLP problem, the features are the words in a text. In finance several properties of each transaction such as the daytime, cardholder’s name, the billing and shipping addresses, the amount etc.
It is important to understand that we assume an underlying real relationship between the characteristics of an item and the class it belongs to. The goal of running a NN is: Given a number of examples, try and come up with a function that resembles this real relationship (Of course, you’ll say: you are geeks, you are better with functions than relationships!) This function is called the predictive model or just the model because it is a practical, simplified version of how items with certain features belong to certain classes in the real world. Get comfy with using the word “function” as it comes up quite often, it is a useful abstraction for the rest of the conversation (no maths involved). You might be interested to know that a big part of the work that Data Scientists do (the dudes that work on such problems) is to figure out exactly which are the features that better describe the entities of the problem at hand, which is similar to saying which characteristics seem to distinguish items of one class from those of another. This process is called feature selection.
A NN is a structure used for classification. It consists of several components interconnected and organized in layers. These components are called artificial neurons (ANs) but we often refer to them as units. Each unit is itself a classifier, only a simpler one whose ability to classify is limited when used for complex problems. It turns out that we can completely overcome the limitations of simple classifiers by interconnecting a number of them to form powerful NNs. Think of it as an example of the principle Unite and Lead.
This structure of a combination of inputs that go through the artificial neuron resembles the functionality of a physical neuron in the brain, thus the name. In the following picture the structure of a physical and an artificial neuron are compared. The AN is shown as two nodes to illustrate its internals: An AN combines the inputs and then applies what is called the activation function (depicted as an S-curve), but it is usually represented as one node, as above.
- The inputs of the AN correspond to the dendrites,
- the AN itself (sum + activation) to the body/nucleus and
- the output to the axon.
The analogy goes deeper as neurons are known to provide human brain with a “generic learning algorithm”: By re-wiring various types of sensory data to a brain region, the same region can learn to recognize different types of input. E.g. the brain region responsible for the sense of taste can learn to distinguish touching sense input after the appropriate sensory re-wiring. This has been confirmed experimentally on ferrets.
Similarly ANs organized in NNs provide a generic algorithm in principle capable of learning to distinguish any classes. So, going back to the example applications in the beginning of this answer, you can use the same NN principles to classify pictures, texts or transactions. For a better understanding, read on.
However, no matter how deep the analogies feel and how beautiful they are, bear in mind that NNs are just a bio-inspired algorithm. They don’t really model the brain, the functioning of which is extremely complicated and, to a high degree, unknown.
At this point you must be wondering what on earth is an activation function. In order to understand this we need to recall what a NN tries to compute: An output function (the model) that takes an example described by its features as an input and outputs the likelihood that the example falls into each one of the classes. What the activation function does is to take as an input the sum of these feature values and transform it to a form that can be used as a component of the output function. When multiple such components from all the ANs of the network are combined, the goal output function is constructed.
Historically the S-curve (aka the sigmoid function) has been used as the activation function in NNs, in which case we are talking about Logistic Regression units (although better functions are now known). This choice relates to yet another biologically inspired analogy. Before explaining it, let’s see first how it looks (think of it as what happens when you can’t get the temperature in the shower right: first it’s too cold despite larger adjustment attempts and then it quickly turns too hot with smaller adjustment attempts):
Now the bio analogy: brain neurons activate more frequently as their electric input stimulae increases. The relationship of the activation frequency as a result of the input voltage is an S-curve. However the S-curve is more pervasive in nature than just that, it is the curve of all kinds of phase transitions.
- Its input layer consists of a number of units that depends on the number of input features. Features are engineered to describe the class instances, be them images, texts, transactions etc, depending on the application. For example, in an image recognition task, the features may be the array of pixels and their colors.
- Its output layer consisting often of a number of units equal to the number of classes in the problem. When given a new, unseen example, each unit of the output layer assigns a probability that this example belongs to each particular class, based on its training.
- Between the input and output layers, there may be several hidden layers (for reasons briefly described next), but for many problems one or two hidden layers are enough.
Training is often done with the Back Propagation algorithm. During BackProp, the NN is fed with examples of all classes. As mentioned, the training examples are said to be “tagged”, meaning that the NN is given both the example (as described by its features) and the class it really belongs to. Given many such training examples, the NN constructs, during training, what we know by now as the model, i.e. a probabilistic mapping of certain features (input) to classes (output). The model is reflected on the weighs of the units connectors (see previous figure); BackProp’s job is to compute these weighs. Based on the constructed model, the NN will classify new untagged examples (i.e. instances that it has not seen during training), aka it will predict the probability of a new example belonging to each class. Therefore there are fundamentally two distinct phases:
- During training, the NN is fed with several tagged examples from which it constructs its model.
- During testing, the NN classifies new, unknown instances into the known classes, based on the constructed model.
NNs with multiple layers of perceptrons are powerful classifiers (deep neural networks) in that we can use them to model very complex, non-linear classification patterns of instances that may be described by potentially several thousands of features. Depending on the application, such patterns may or may not be detectable by humans (e.g. the human brain is very good in image recognition but is not effective in tasks such as making predictions by generalizing historical data in complex, dynamic contexts).
I originally wrote this piece as a response to a question on Quora, which proved the most popular response of that thread, so I decided to re-blog it here. Link to my original response on Quora.