Photo by  Simon Matzinger  on  Unsplash

A decision tree is one of the simplest and most understandable classification algorithms. Alone, a single decision tree is a relatively weak learner that will be prone to overfitting. But decision trees are the building blocks of random forests, a powerful algorithm that is very difficult to overfit. To better understand random forests I wanted to better understand decision trees. So I coded them up from scratch. The following trees are for classifying continuous data.

Decision trees can be thought of as asking a series of questions in order to split data into smaller and smaller groups. For example, let's say we have a bunch of data of people's Google search history and whether or not they voted. To build a model to predict whether or not someone will vote. Looking at the data, we may think of several questions to ask such as whether or not a person googled for their polling location.

If we ask that question, we'll get two subsets of data: folks who have searched for their polling location and folks who haven't. We can continue to ask questions, splitting our dataset into smaller and smaller subgroups. Our goal is to get perfectly pure subgroups, that is groups where everyone belongs to the same class. They either voted or they didn't. We then ask these same questions in the same order on new data to make predictions.

 A very simple decision tree. Each point where we ask a question is called a node. Each point where we've reached a pure group is called a leaf.

A very simple decision tree. Each point where we ask a question is called a node. Each point where we've reached a pure group is called a leaf.

Uncertainty and impurity

The foundation of a decision tree is how we measure the purity of our subgroups. We quantify purity by measuring  uncertainty. For example, the worst case scenario is an even split: half of our subgroup belonging to one class and half belonging to the other. This case yields maximum uncertainty because it's a toss up. With equal proportions, we can't say which class is more likely. If a subgroup contains all the same class, there is no uncertainty.

Once measure of the level of uncertainty in a subgroup is Entropy. The entropy of a dataset, S, is as follows:

$$E(S) = \sum_{i}^n -p_i \log_2(p_i)$$ with the probabilities, $p$, calculated by simply counting the frequency of each class in a dataset. By definition, $E(0) = 0$, since $\log(0)$ is undefined.

The Counter object from the collections library makes this very simple in python:

1
2
3
4
5
6
7
from collections import Counter
from math import log

def entropy(labels):
    """Calculates the entropy for a given set of labels."""
    probs = [freq/len(labels) for freq in Counter(labels).values()]
    return sum(-p*log(p, 2) if p else 0 for p in probs)

An alternative measure of uncertainty is Gini impurity. Gini impurity is computationally more efficient than Entropy, and yields comparable results.

$$G(S) = \sum_{p}^n p_i(1-p_i)$$

And in python:

1
2
3
4
def gini(labels):
    """Calculates the gini impurity for a given set of labels"""
    probs = [freq/len(labels) for freq in Counter(labels).values()]
    return sum(p*(1-p) for p in probs)

For both measures, minimum uncertainty comes when a class's proportion is either 0% or 100%.

uncertainty.png
>> entropy([True, True, True, True])
0.0

>> entropy([True, True, False, False])
1.0

>> gini([False, False, False, False])
0.0

>> gini([True, True, False, False])
0.5
 

Asking a question

At this point I will introduce the data that we can test our custom algorithm on. A single decision tree works well only with easily separable data. I chose the sklearn Iris dataset. It is the sepal and petal lengths and widths for three different kinds of iris flowers and is a nice training wheels dataset.

iris_dataset.png

Say we are splitting on petal width. A questions we could ask is whether or not the petal width is greater than or less than a certain value. But which value should we choose? The median? The mean? How do we know which value will split the data into the purest groups? We can't know. So, we try every unique value of petal width and record on the best one. Below, is that process in python. This code can only split on continuous variables, so all categorical variables must be one-hot-encoded.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
def best_split(feature, labels):
    """For a single feature, calculate the uncertainty of splitting on each
    data point and return the best split point."""
    
    lowest_uncertainty = 10
    for test_point in set(feature):
        mask = [pt<test_point for pt in feature]
        left = labels[mask]
        right = labels[np.invert(mask)]

        left_score = uncertainty(left)
        right_score = uncertainty(right)
        
        # Net Uncertainty
        net = (len(left)/len(labels))*left_score + \
              (len(right)/len(labels))*right_score

        if net < lowest_uncertainty:
            lowest_uncertainty = net
            best_split_point = test_point

    return best_split_point, lowest_uncertainty

Each question like this will split the data into two subgroups, each with less uncertainty. This reduction in uncertainty is termed information gain, and it is simply the difference between the initial uncertainty and the net uncertainty of the subgroups. We calculate net uncertainty as the weighted average of the uncertainty of each subgroup. Each subgroup's uncertainty measure is weighted by the size of that subgroup.

A decision tree is what we call a "greedy" algorithm. At each step, the algorithm chooses the feature that offers the most information gain. Perhaps a more effective tree could be formed by splitting a feature with less information gain first, and then splitting on the most informative one later. If that's the case, this algorithm will never find that tree. As long as you give it the same features, this algorithm will grow the same tree.

 

Recursion

Each iteration we calculate the information gain offered by each feature, and split on the feature that offers highest gain. If either or both of the resultant subgroups are sufficiently pure, we return a leaf. If not, we repeat the process on each subgroup. Perfect purity is not always possible, so we usually set a minimum uncertainty threshold. Before doing this, its helpful to convert the data into a dictionary format, like so:

{'labels': array([0, 0, ... 2, 2]),
 'petal length (cm)': array([ 1.4,  1.4,  ... 5.4,  5.1]),
 'petal width (cm)': array([ 0.2,  0.2, ...  2.3,  1.8]),
 'sepal length (cm)': array([ 5.1,  4.9,  ... 6.2,  5.9]),
 'sepal width (cm)': array([ 3.5,  3.0,  ...  3.4,  3.0])}

Since we perform the same process at each node, this algorithm is perfect for recursion. Using recursion makes this code clean and fast! 

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
from scipy.stats import mode

def grow_tree(data, min_uncertainty):
    """Creates a single node and returns two datasets split
    on that node. Runs recursively to return a decision tree."""
    if self.uncertainty(data['labels']) <= self.min_uncertainty:
        return mode(data['labels'])[0][0]
    else:
        gain, feature_name, split_point = self.best_feature(data)
        left, right = self.split(data, feature_name, split_point)
        return {'feature_name': feature_name,
                'information_gain': gain,
                'split_point': split_point,
                'left': self.grow_tree(left),
                'right': self.grow_tree(right)}

The Iris dataset is so easily separable we can set our minimum uncertainty to zero. With that threshold, the above code returns this decision tree as a nested dictionary:

{'feature_name': 'petal length (cm)',
 'information_gain': 0.9324631422909171,
 'left': 0,
 'right': {'feature_name': 'petal width (cm)',
  'information_gain': 0.7377391157323221,
  'left': {'feature_name': 'petal length (cm)',
   'information_gain': 0.2939677433590534,
   'left': {'feature_name': 'sepal length (cm)',
    'information_gain': 0.1207548980710625,
    'left': {'feature_name': 'sepal width (cm)',
     'information_gain': 1.0,
     'left': 1,
     'right': 2,
     'split_point': 2.5},
    'right': 1,
    'split_point': 5.0},
   'right': 2,
   'split_point': 5.0999999999999996},
  'right': 2,
  'split_point': 1.8},
 'split_point': 3.0}
 

Classify New Data

Using this tree to make predictions also relies on recursion. At each node, follow the appropriate branch and return what lies beneath. If it's another node, repeat the same process. If it's a leaf, we have our prediction. The code passes smaller and smaller sub-trees to itself until it reaches a leaf. When this happens, sub_tree will be a single value instead of a dictionary.

1
2
3
4
5
6
7
8
9
def classify(sub_tree, data):
	"""Classify a single data point using the provided tree"""
    if isinstance(sub_tree, dict):
        if data[sub_tree['feature_name']] < sub_tree['split_point']:
            return classify(sub_tree['left'], data)
        else:
        	return classify(sub_tree['right'], data)
    else:
    	return sub_tree
 

Pros

A decision tree is a simple and elegant algorithm. By far the greatest advantage of using a decision tree is how interpretable the output will be. The code will give you a tree that you could follow by hand if you want. Nothing could be further from a black box. This makes decision trees easy to share with non-technical team members.

Decision trees also require very little data prep on your part. They naturally ignore missing data and outliers won’t artificially skew the model. The algorithm doesn’t mandate a linear relationship between features and target or require that all features be independent.

They also handle much of the feature selection and engineering for you. The algorithm automatically selects the most important feature first. It will also inherently handle feature interaction; there is no need to create linear combinations of features before fitting. Most applications only handle continuous data, but one-hot-encoding categorical data gets around that. You can also get good results with small amounts of data and there are relatively few hyper-parameters to tune.

Finally, as demonstrated above, the uncertainty measures can handle any number of unique classes. This makes the algorithm inherently multi-class.

 

and cons

A decision tree is a pretty simplistic algorithm and works best on easily separable data. As mentioned above, decision trees are “greedy.” While it naturally selects the most important features, it will select them in the same order every time.

Also, nothing says we can't split on the same feature more than once. A certain features may several helpful split points and different depths of the tree. Without setting a minimum uncertainty threshold, the tree will keep growing deeper and deeper and until each leaf is perfectly pure. This means decision trees have a penchant for overfitting.

While one tree may be limited, a forest of trees can be very powerful. Random forests are an ensemble method that overcome trees’ limitations and retain all their strengths.

 

Let's keep going

I recommend trying to code from scratch any algorithm that you use often; it's the best way to learn what an algorithm does well and where it falls short. The full code for this algorithm and a notebook comparing its performance to sklearn can be found on my Github.