Earlier this year Kobe Bryant, one of the most important players in the history of NBA, ended his career. He played for a full 20 years with LA Lakers. Bryant is the all time top scorer of the legendary team and he is considered by many to be the second best shooting guard in the history of NBA behind Michael Jordan.

After his retirement a dataset containing 20 years worth of Bryant’s shots was released by Kaggle. The challenge was to build a model that predicts if Bryant would score each shot or not. The dataset is a good opportunity to have some fun and demonstrate in detail the data science techniques one can leverage end-to-end, from data exploration to model evaluation. Here you can see ten “impossible” shots of Kobe Bryant that are quite probably included in the dataset.

The aim of this blog is not to produce the most accurate predictive model. The competition has ended and we might in fact use some of the insights posted by the community. Rather, it is to enjoy and, in the process, showcase how to approach Machine Learning classification problems like this. This is the first part of a series of posts that will cover the end-to-end process.

For max fun, we will use Python (Pandas, Jupyter Notebook, Scikit Learn) for data exploration, visualisation and predictive modelling as well as Tableau for super fast exploratory visualisations. Let’s deepdive into the data! We will start with a Jupyter Notebook. All the details of what is involved at each step are marked down inline so that the notebook reads seamlessly in the blog.

After familiarising with the basics of the dataset, now let’s proceed to fast Tableau analysis in order to double down on the dataset and make it as transparent as possible.

What we want next, is to build an intuition of how predictive each feature is of the target variable ‘shot_made_flag’. Roughly speaking if there is notable variation of the target variable’s distribution across the subsets defined by a feature’s different values, this could be an indication that the feature is predictive of the target variable. It is important to make three notes at this point:

- In principle, if a target variable distribution does not vary depending on different values of a feature, this does not necessarily mean that the feature is not predictive. It might be variable in a subset of the dataset and we currently examine the entire set. This means that the feature may be predictive depending on the modelling algorithm and the subsets it creates in its process.
- On the other hand, it may happen that a target variable distribution varies across the values of a feature but the feature will not make its way into a good predictive model if it is dependent or correlated to another feature. In that case, it would possibly just add to overfitting. As an example of dependent features, the area variables are mappings of the coordinates, as we showed earlier in the Jupyter notebook.
- We can investigate further the empirical intuition of a “notable variation” by running tests to determine its statistical significance.

If the above are difficult to comprehend at this point, don’t worry we will return to these aspects in this series of blogs.

Now let’s examine how ‘shot_made_flag’ distributes for each value of each feature. This is a rather exhaustive exploratory process which in reality can be shorter. Here, we want to show how we can leverage the tools in order to make the dataset completely transparent and build good intuition. We will return to assess these intuitions in retrospective, once we build and evaluate our predictive models, in subsequent parts.

In the first dashboard you can see all features that are dependent on the distance of the shot. Blue are the scored shots and red are failed attempts. More specifically, 1a) shows the success ratio per ‘shot_zone_range’ bucket as given in the dataset. Evidently, the target variable is unevenly distributed across the subsets defined by the range feature. Next, 1b) shows how many shots there are in each bucket in total.

In 2) The blue line signifies scored shots per ‘shot_distance’. We can conclude that distance is in feet and one can see the steep increase at the limit of 22-23 ft. where the 3pts line lies. The red line shows the number of failed shots. 3a) shows the success ratio for the 2 and 3 pointers. Finally, 3b) is the share of total 2 and 3 pointers attempted (‘shot_type’).

All the above features are correlated, which means that one of the really independent variables will most probably make it to the predictive model. Distance and x/y coordinates are different representation of the independent variable. The rest are mappings of the distance.

In 4) it becomes evident that the target variable is distributed unevenly across the ‘combined_shot_types’ as well.

In 5) one can observe that the ‘action_type’ is a more fine grained categorisation of the ‘combined_shot_type’. Again the variability is notable, and so we expect that action type is a good candidate for the predictive model.

In 6) we have summarised the performance per ‘period’ of the game and the share of shots in each period. Here we see a more even distribution.

7) illustrates the calculated field ‘remaining_time’ until the end of the period.

8a) shows the performance against each ‘opponent’…

and 8b) ‘matchup’, which is the same as 8a) to the granularity of home and away.

Finally, 9) shows the performance for each ‘season’. It seems that Bryant’s performance has started declining towards the end of his career, which is what one would expect that comes naturally with age. Finally, in 10) the performance in the normal period vs the playoffs does not seem to vary (‘playoffs’ flag = 1).

At this stage, we have a full understanding of our dataset and it is time to deepdive into the details of predictive modelling. If you made it this far, you may as well stay tuned for the second part of this long blog. Thanks.