Decision trees work by splitting the sample space into purer segments. The criterion that we will use for splitting is entropy (although you can also use gini). Entropy is defined as follows:

$$H(C) = - P(c_1)\cdot\log P(c_1) - P(c_2)\cdot\log P(c_2) -...$$where $P(c_1)$ is the probability of class $c_1$ in the segment and so on. In a problem with two classes, a pure segment (having either only class $c_1$ instances or class $c_2$) has entropy 0.

In this post we will perform the following operations in the context of decision trees:

- Pre-processing: which consists of
- data cleansing,
- feature engineering,
- standarisation (of a subset of features) and
- dummification of categorical features.

- Predictive modelling:
- We will start by determining the complexity of the tree to avoid overfitting and high bias.
- We will use an initial method to narrow the area down and
- then perform k-fold cross-validation to pick the optimal complexity.
- Next we will train, test and evaluate the model across most typical metrics, i.e. accuracy, precision, recall, $F1$-score.

- Visualisation:
- We will visualise the bias/variance sweetspot,
- of course the tree itself and
- the feature space divisions that it creates.

Please right click on the images to see them in full size.

Letâ€™s deepdive into the data.

In [27]:

```
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from pandas import Series, DataFrame
%matplotlib inline
filename= "data.csv"
raw = pd.read_csv(filename)
```

In [28]:

```
df = raw.dropna()
df.index = range(len(df))
print(df.shape)
df.isnull().values.any()
```

Out[28]:

Let's start by examining the temporal features. They are relevant as follows:

`'Game_date'`

may capture performance variability based on the point in a season (e.g. month etc).- We break
`'game_date'`

into month and year. `'Season'`

(and the year compnent of`'game_date'`

) may capture the effect of ageing in the player's performance (see Part-1).- Finally, we will use
`'time_remaining'`

(in seconds) and drop`'seconds'`

and`'minutes_remaining'`

.

The `'shot_id'`

and `'game_id'`

may be useful for time series analysis, but for now we can drop them. `'Team_name'`

is always LA Lakers, so we can go ahead and drop it from the working dataframe.

In [29]:

```
df.loc[:,'month'] = df['game_date'].str[5:7].astype(int)
df.loc[:,'year'] = df['game_date'].str[0:4].astype(int)
df = df.drop('game_date', 1)
df = df.drop('team_name', 1)
df = df.drop('shot_id', 1)
df = df.drop('game_id', 1)
df = df.drop('game_event_id', 1)
df = df.drop('season', 1)
time = Series(df.minutes_remaining*60 + df.seconds_remaining)
df.loc[:,'time_remaining'] = time
df = df.drop('seconds_remaining', 1)
df = df.drop('minutes_remaining', 1)
```

`'matchup'`

information is in `'opponent'`

. However, we want to capture the home/away property. We will create a series with 1s where there is *'vs.'* in the field (apparently corresponding to home games) and 0s elsewhere (*'@'* apparently corresponding to away games). I will then drop the `'matchup'`

and append the binary series:

In [30]:

```
home_series = Series([1 if 'vs.' in x else 0 for x in df['matchup']])
home = home_series.to_frame()
df['home'] = home
df = df.drop('matchup', 1)
```

`'loc_x'`

, `'loc_y'`

and `'lon'`

, `'lat'`

in the dataset? We will now check whether they are redundant, "mirrored" data as we suspect from the visulisations of Part-1. A simple way to do this, is to compute the correlations between the variables `'loc_x'`

and `'lon'`

, and `'loc_y'`

and `'lat'`

respectively.

In [31]:

```
A= df[['lon', 'lat', 'loc_x', 'loc_y']].as_matrix()
print(np.corrcoef(A[:,0], A[:,2]))
print(np.corrcoef(A[:,1], A[:,3]))
```

In [32]:

```
import seaborn as sns
pal = ['Crimson', 'Dodgerblue']
sns.pairplot(df, vars=['loc_x', 'loc_y', 'lat', 'lon', 'shot_distance'], hue='shot_made_flag', size=3, palette = pal)
plt.show()
```

`'loc_x'`

contains the same information as `'lon'`

and `'loc_y'`

as `'lat'`

. We will drop `'lon'`

and `'lat'`

and check the schema of the dataset before going further.

In [33]:

```
df = df.drop('lon', 1)
df = df.drop('lat', 1)
df.info()
df.isnull().values.any()
```

Out[33]:

**Good Data Science Note:** Strictly speaking, we should only draw these insights by looking into the training set, not the entire set. By looking into the entire set we are slightly "cheating".

In fact `'shot distance'`

and an `angle`

feature are sufficient to represent the spatial properties of a shot. The angle feature is captured by one of the two pairs of coordinates; The player may shoot better from certain angles, may be better shooting with one of two hands etc. Of course, retaining the coordinates and the distance features induces redundancy, but entropy based algorithms have an inherent feature selection property.

At this stage, before performing further pre-processing it is better to split the dataset into training and test subsets, in order to avoid information "leaking" into our training set (which in the real world is completely unknown).

**Good Data Science Note:** The pre-processing to be made on the training set, will be applied to the test set as well but separately.

We will now isolate the target variable vector from the features and split the dataset to 80% training and 20% test set.

In [34]:

```
from sklearn import cross_validation
from sklearn.cross_validation import train_test_split
y = np.array(df['shot_made_flag']).astype(int)
X = df.drop('shot_made_flag', 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=5)
```

For the purposes of a tree classifier, features standarisation is not required. This is because trees do not work with any notion of distance, rather with class purity. In addition, we chose a tree classifier as our first model because of their interpretability and feature standarisation would compromise the model's interpretability.

For the purposes of creating an angle feature though we will standarise `'loc_x'`

and `'loc_y'`

, to avoid zero values that may result in divisions by zero. With standarisation, the data are re-distributed around a mean of zero in one standard deviation distance.

**Good Data Science Note:** Once again, this processing must be done in training and testing separately.

**Technical Note:** Scikit's `StandardScaler`

provides with the facilities to transform training and testing consistently.

In [35]:

```
from sklearn import preprocessing
A = X_train[['loc_x', 'loc_y']].copy()
scaler = preprocessing.StandardScaler().fit(A)
A_st = scaler.transform(A)
print(A_st.mean(axis=0), A_st.std(axis=0)) #Note the mean of zero and std of 1 for the two columns
X_train.loc[:, 'angle'] = A_st[:,0]/A_st[:,1]
```

Next, we apply the same transformation to the test set, by using the same `StandardScaler`

object.

In [36]:

```
B = X_test[['loc_x', 'loc_y']]
B_st = scaler.transform(B)
X_test.loc[:,'angle'] = B_st[:,0]/B_st[:,1]
```

We can now drop `'loc_x'`

, `'loc_y'`

from both subsets.

In [37]:

```
X_test = X_test.drop('loc_x', 1)
X_test = X_test.drop('loc_y', 1)
X_train = X_train.drop('loc_x', 1)
X_train = X_train.drop('loc_y', 1)
X_test.info()
X_test.isnull().values.any()
```

Out[37]:

In order to use Scikit's classifiers, we need to convert the categorical fields. We can do this with Pandas' `get_dummies()`

.

**Technical note:** Alternatively Scikit's `OneHotEncoder`

or `LabelBinarizer`

can be used.

Let's see how the Pandas method works. We will first get the list of all categorical columns (they are of type object). We can do this with `select_dtypes()`

.

In [38]:

```
cat = df.select_dtypes([object]).columns
cat
```

Out[38]:

What is interesting here is that in the general case, as mentioned before, we should not look into the test set at all when pre-processing. Thus:

- Encoding the categorical data should be done based on the training set. At this point, the schema of the training set is finalised.
- Then separately, like we did not know anything about it before, we should encode the categorical data in the test set.
- With Pandas'
`setdiff1d()`

we can determine then which dummified columns are missing from the test set compared to the training set and add them with 0 values across all rows. - Finally we will select from the training set the columns that are in the test set, in that same order.

In [39]:

```
X_train = pd.get_dummies(X_train)
X_test = pd.get_dummies(X_test)
#Get the columns that are in training but not in test
not_in_test = np.setdiff1d(X_train.columns, X_test.columns)
print(not_in_test)
not_in_train = np.setdiff1d(X_test.columns, X_train.columns)
print(not_in_train)
for c in not_in_test: #Add these columns to test, setting them to zero
X_test[c] = 0
X_test = X_test[X_train.columns] #Make test be of the same schema as training
```

We will now try to see what is a reasonable complexity for our model. A complex model overfits while a simple model is not predictive;

What is the sweetspot here? We will train tree classifiers of various complexities and visualise their performance on the testing and training set.

To do so, we will use the `min_samples_leaf`

parameter of `DecisionTreeClassifier`

which controls the minimum number of samples present at each leaf. The smaller this number is, *the more complex* is the corresponding model. For decision trees, we use balanced sampling to prevent bias for the dominant class.

In [40]:

```
from sklearn import tree
from sklearn.cross_validation import train_test_split
from sklearn import cross_validation
from sklearn import metrics
r = np.arange(1, 10000, 500)
a=[]
b=[]
print(X_test.shape, X_train.shape, y_test.shape, y_train.shape)
for i in r:
clf = tree.DecisionTreeClassifier(min_samples_leaf=i, class_weight='balanced')
clf = clf.fit(X_train, y_train)
a.append(clf.score(X_test, y_test))
b.append(clf.score(X_train, y_train)) #This is done solely for visualising the overfitting effect, evaluating against training data is not part of the analysis
plt.figure(figsize=(8, 5))
plt.plot(r, a)
plt.plot(r, b)
#Retroactively marked for visualisation purposes:
plt.axvspan(0, 500, color='r', alpha=0.1, lw=0)
plt.axvspan(500, 1550, color='g', alpha=0.1, lw=0)
#---
plt.xlim(0, 10000)
plt.ylim(0.5,1)
plt.xlabel('Min samples in leaf (inverse complexity)')
plt.ylabel('Accuracy')
plt.show()
```

**Good Data Science Note:** In reality you should never evaluate the performance of a model on training data. Here, it is not part of the "real "analysis", it is only done for demonstrating the effect of overfitting and how accuracy varies vs. the model's complexity in relation with overfitting. See below.

- The green line represents performance on the training set and the blue on the testing set. Note that for small numbers of samples in each leaf (high complexity), the model performs very well on the training set and very poorly on the testing set (red area). This is indicative of overfitting.
- At some point, the two lines converge and the sweetspot seems to be in the area of 500 to 1500 samples in each leaf (green area).
- Then performance diminishes as the model becomes simplistic (high bias).

This is the expected behaviour. This rough process is mostly for instructional purposes, but it also helps narrowing down an area of reasonable complexity, in which we can then perform cross-validation, a more expensive and accurate process, in order to pick an optimal value.

In [41]:

```
r = np.arange(250, 5000, 250)
r
```

Out[41]:

We will use $F_1$ score to evaluate the models.

**Good Data Science Note:** $F_1$ is prefered over accuracy as a model evaluation metric. Accuracy is not informative of false positive and negative rates which may not worth equal for particular applications and it may be misleading in case of very unbalanced classes (not the case here).

**Technical Note:** If despite that you want to use accuracy, declare it so in the `scoring`

parameter of `cross_val_score`

. Scikit learn provides with an array of alternative metrics.

In [42]:

```
scores = []
for i in r:
clf = tree.DecisionTreeClassifier(min_samples_leaf=i, class_weight='balanced')
m = cross_validation.cross_val_score(clf, X_train, y_train, cv=10, scoring='f1')
scores.append((i, m.mean(), m.var()))
scores
```

Out[42]:

In [43]:

```
clf = tree.DecisionTreeClassifier(criterion='entropy', min_samples_leaf=1750, class_weight='balanced')
clf = clf.fit(X_train, y_train)
```

In [44]:

```
from sklearn.externals.six import StringIO
with open("kobe3.dot", 'w') as f:
f = tree.export_graphviz(clf, out_file=f, feature_names=X_train.columns, class_names=True, filled=True, rounded=True, special_characters=True)
```

Note that the tree is rooted at `'action type'`

(`jump shot`

). If you go back to part-1 of this analysis (previous post, Tableau graphs 4 and 5), you will notice that we were expecting this feature to make it into the predictive model because the target variable varied a lot depending on it.

Also, note that jump-shot is the dominant type of action. These two combined means that it creates divisions that maximise purity, so it is natural to be the first predictive feature that the model picks up. Single-feature trees (aka decision stumps) are often used as baseline classifiers, `'action type jump shot'`

would be the feature for a stump.

For similar reasons, `'shot distance'`

and `'season'`

(`'year'`

here) was also expected to make it high in the model. On the other hand, `'time remaining'`

was not that apparently pedictive from the start.

It is interesting to see how the tree divides the feature space. A typical visualisation is pairwise feature comparison. In our case, we have many categorical and binary features and that does not help visualising, so we will pick two significant numerical features, `'time_remaining'`

and `'shot_distance'`

and compare how the model complexity affects the areas that the tree creates.

For visualisation purposes:

- We will train a tree classifier for various model complexities like before (more particularly minimum 250-1750 samples at each leaf), only this time with these two features only.
- We will plot the predicted areas with a contour: In the blue areas, the tree predicts that shot will fail, while in the yellow ones that the shot will succeed.
- We will also plot the test dataset on top: Red points are actual missed shots and black ones are scored.
- The subplots represent gradually simpler models. Note how the boundary becomes gradually simpler. For small minimum numbers per leaf, the tree overfits.

In [45]:

```
Xviztr = DataFrame()
Xviztst = DataFrame()
Xviztr['time_remaining'] = X_train['time_remaining']
Xviztr['shot_distance'] = X_train['shot_distance']
Xviztst['time_remaining'] = X_test['time_remaining']
Xviztst['shot_distance'] = X_test['shot_distance']
t_min, t_max = -5, 714
d_min, d_max = -0.5, 74
tt, dd = np.meshgrid(np.arange(t_min, t_max, 1), np.arange(d_min, d_max, 1))
colors = np.where(y_test==1, "black", "red");
def classgraph(n, i):
sp = plt.subplot(3, 2, i)
sp.set_title(n)
plt.xlabel('time remaining')
plt.ylabel('distance')
clfviz = tree.DecisionTreeClassifier(min_samples_leaf=n, class_weight='balanced')
clfviz.fit(Xviztr, y_train)
Z = clfviz.predict(zip(tt.ravel(), dd.ravel()))
Z = Z.reshape(tt.shape)
cs = plt.contourf(tt, dd, Z, cmap=plt.cm.Paired)
plt.scatter(Xviztst['time_remaining'], Xviztst['shot_distance'], c=colors, s=2, alpha=1, lw=0)
leaf = [250, 500, 750, 1000, 1500, 1750]
plt.figure(figsize=(10,15))
for k in leaf:
classgraph(k, leaf.index(k)+1)
```

In [46]:

```
y_predict = clf.predict(X_test)
y_predict.shape
```

Out[46]:

In [47]:

```
metrics.accuracy_score(y_test, y_predict, normalize=True)
```

Out[47]:

In [48]:

```
no = np.where(y_test == 0)[0].size
yes = np.where(y_test == 1)[0].size
print (no, yes)
```

In [49]:

```
from __future__ import division
no/(no+yes)
```

Out[49]:

In [50]:

```
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_predict)
cm
```

Out[50]:

In [51]:

```
plt.figure(figsize=(4,4))
plt.title('Confusion matrix')
ax = sns.heatmap(cm, cmap=plt.cm.Blues, annot=True, fmt='d', square=True, linewidths=.5)
ax.set_xlabel('Predicted class')
ax.set_ylabel('Real class')
```

Out[51]:

Now let's examine the precision, recall and $F_1$ score:

In [52]:

```
metrx = {'precision': metrics.precision_score(y_test, y_predict),
'recall': metrics.recall_score(y_test, y_predict),
'f1': metrics.f1_score(y_test, y_predict)}
print(metrx)
```

The high number of false negatives causes a low recall.