random forests

coded from scratch in python

Photo by  Simon Matzinger  on  Unsplash

A decision tree is one of the simplest and most understandable classification algorithms. Alone, a single decision tree is a relatively weak learner that will be prone to overfitting. But decision trees are the building blocks of random forests, a powerful algorithm that is very difficult to overfit. To better understand random forests I wanted to better understand decision trees, and to do that I decided to code them up from scratch.

Decision trees can be thought of as asking a series of questions in order to split data into smaller and smaller groups. For example, let's say we have a bunch of data of people's Google search history and whether or not they voted. To build a model to predict whether or not someone will vote. Looking at the data, we may think of several questions to ask such as whether or not a person googled for their polling location.

If we ask that question, we'll get two subsets of data: folks who have searched for their polling location and folks who haven't. We can continue to ask questions, splitting our dataset into smaller and smaller subgroups. Our goal is to get perfectly pure subgroups, that is groups where everyone belongs to the same class. They either voted or they didn't. We then ask these same questions in the same order on new data to make predictions.

A very simple decision tree. Each point where we ask a question is called a node. Each point where we've reached a pure group is called a leaf.

A very simple decision tree. Each point where we ask a question is called a node. Each point where we've reached a pure group is called a leaf.

Uncertainty and impurity

The foundation of a decision tree is how we measure the purity of our subgroups. We quantify purity by measuring  uncertainty. For example, the worst case scenario is an even split: half of our subgroup belonging to one class and half belonging to the other. This case yields maximum uncertainty because it's a toss up. With equal proportions, we can't say which class is more likely. If a subgroup contains all the same class, there is no uncertainty.

Once measure of the level of uncertainty in a subgroup is Entropy. The entropy of a dataset, S, is as follows:

$$E(S) = \sum_{i}^n -p_i \log_2(p_i)$$ with the probabilities, $p$, calculated by simply counting the frequency of each class in a dataset. By definition, $E(0) = 0$, since $\log(0)$ is undefined.

The Counter object from the collections library makes this very simple in python:

1
2
3
4
5
6
7
from collections import Counter
from math import log

def entropy(labels):
    """Calculates the entropy for a given set of labels."""
    probs = [freq/len(labels) for freq in Counter(labels).values()]
    return sum(-p*log(p, 2) if p else 0 for p in probs)

An alternative measure of uncertainty is Gini impurity. Gini impurity is computationally more efficient than Entropy, and yields comparable results.

$$G(S) = \sum_{p}^n p_i(1-p_i)$$

And in python:

1
2
3
4
def gini(labels):
    """Calculates the gini impurity for a given set of labels"""
    probs = [freq/len(labels) for freq in Counter(labels).values()]
    return sum(p*(1-p) for p in probs)

For both measures, minimum uncertainty comes when a class's proportion is either 0% or 100%.

uncertainty.png
>> entropy([True, True, True, True])
0.0

>> entropy([True, True, False, False])
1.0

>> gini([False, False, False, False])
0.0

>> gini([True, True, False, False])
0.5
 

Asking a question

At this point I will introduce the data that we can test our custom algorithm on. A single decision tree works well only with easily separable data. I chose the sklearn Iris dataset. It is the sepal and petal lengths and widths for three different kinds of iris flowers and is a nice training wheels dataset.

iris_dataset.png

Say we are splitting on petal width. A questions we could ask is whether or not the petal width is greater than or less than a certain value. But which value should we choose? The median? The mean? How do we know which value will split the data into the purest groups? We can't know. So, we try every unique value of petal width and record on the best one. Below, is that process in python. This code can only split on continuous variables, so all categorical variables must be one-hot-encoded.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
def best_split(feature, labels):
    """For a single feature, calculate the uncertainty of splitting on each
    data point and return the best split point."""
    
    lowest_uncertainty = 10
    for test_point in set(feature):
        mask = [pt<test_point for pt in feature]
        left = labels[mask]
        right = labels[np.invert(mask)]

        left_score = uncertainty(left)
        right_score = uncertainty(right)
        
        # Net Uncertainty
        net = (len(left)/len(labels))*left_score + \
              (len(right)/len(labels))*right_score

        if net < lowest_uncertainty:
            lowest_uncertainty = net
            best_split_point = test_point

    return best_split_point, lowest_uncertainty

Each question like this will split the data into two subgroups, each with less uncertainty. This reduction in uncertainty is termed information gain, and it is simply the difference between the initial uncertainty and the net uncertainty of the subgroups. We calculate net uncertainty as the weighted average of the uncertainty of each subgroup. Each subgroup's uncertainty measure is weighted by the size of that subgroup.

A decision tree is what we call a "greedy" algorithm. At each step, the algorithm chooses the feature that offers the most information gain. Perhaps a more effective tree could be formed by splitting a feature with less information gain first, and then splitting on the most informative one later. If that's the case, this algorithm will never find that tree. As long as you give it the same features, this algorithm will grow the same tree.

 

Recursion

Each iteration we calculate the information gain offered by each feature, and split on the feature that offers highest gain. If either or both of the resultant subgroups are sufficiently pure, we return a leaf. If not, we repeat the process on each subgroup. Perfect purity is not always possible, so we usually set a minimum uncertainty threshold. Before doing this, its helpful to convert the data into a dictionary format, like so:

{'labels': array([0, 0, ... 2, 2]),
 'petal length (cm)': array([ 1.4,  1.4,  ... 5.4,  5.1]),
 'petal width (cm)': array([ 0.2,  0.2, ...  2.3,  1.8]),
 'sepal length (cm)': array([ 5.1,  4.9,  ... 6.2,  5.9]),
 'sepal width (cm)': array([ 3.5,  3.0,  ...  3.4,  3.0])}

Since we perform the same process at each node, this algorithm is perfect for recursion. Using recursion makes this code clean and fast! 

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
from scipy.stats import mode

def grow_tree(data, min_uncertainty):
    """Creates a single node and returns two datasets split
    on that node. Runs recursively to return a decision tree."""
    if self.uncertainty(data['labels']) <= self.min_uncertainty:
        return mode(data['labels'])[0][0]
    else:
        gain, feature_name, split_point = self.best_feature(data)
        left, right = self.split(data, feature_name, split_point)
        return {'feature_name': feature_name,
                'information_gain': gain,
                'split_point': split_point,
                'left': self.grow_tree(left),
                'right': self.grow_tree(right)}

The Iris dataset is so easily separable we can set our minimum uncertainty to zero. With that threshold, the above code returns this decision tree as a nested dictionary:

{'feature_name': 'petal length (cm)',
 'information_gain': 0.9324631422909171,
 'left': 0,
 'right': {'feature_name': 'petal width (cm)',
  'information_gain': 0.7377391157323221,
  'left': {'feature_name': 'petal length (cm)',
   'information_gain': 0.2939677433590534,
   'left': {'feature_name': 'sepal length (cm)',
    'information_gain': 0.1207548980710625,
    'left': {'feature_name': 'sepal width (cm)',
     'information_gain': 1.0,
     'left': 1,
     'right': 2,
     'split_point': 2.5},
    'right': 1,
    'split_point': 5.0},
   'right': 2,
   'split_point': 5.0999999999999996},
  'right': 2,
  'split_point': 1.8},
 'split_point': 3.0}
 

Classify New Data

Using this tree to make predictions also relies on recursion. At each node, follow the appropriate branch and return what lies beneath. If it's another node, repeat the same process. If it's a leaf, we have our prediction. The code passes smaller and smaller sub-trees to itself until it reaches a leaf. When this happens, sub_tree will be a single value instead of a dictionary.

1
2
3
4
5
6
7
8
9
def classify(sub_tree, data):
	"""Classify a single data point using the provided tree"""
    if isinstance(sub_tree, dict):
        if data[sub_tree['feature_name']] < sub_tree['split_point']:
            return classify(sub_tree['left'], data)
        else:
        	return classify(sub_tree['right'], data)
    else:
    	return sub_tree
 

Let's keep going

Once you've coded up a single decision tree, it will be easy to grow several into a random forest! Random forests address the two main shortcomings of a decision trees: their greedy feature selections and their penchant for overfitting.

I recommend trying to code from scratch any algorithm that you use often; it's the best way to learn what an algorithm does well and where it falls short. The full code for this algorithm and a notebook comparing its performance to sklearn can be found on my Github.