Bagging and random forests

Bagging targets variance

Bias–variance reminder

\[ \underbrace{\mathbb{E}[(h_D(x)-y)^2]}_{\mathrm{Error}} = \underbrace{\mathbb{E}[(h_D(x)-\bar h(x))^2]}_{\mathrm{Variance}} + \underbrace{\mathbb{E}[(\bar h(x)-\bar y(x))^2]}_{\mathrm{Bias}} + {\small \underbrace{\mathbb{E}[(\bar y(x)-y(x))^2]}_{\mathrm{Noise}}} \]

Bagging is designed to reduce the variance term without having to redesign the base learner. It is especially useful for high-variance learners such as full decision trees.

Ensemble intuition

\[ \hat h(x)=\frac{1}{m}\sum_{i=1}^{m} h_{D_i}(x) \]
  • If the component predictors wobble around a mean prediction, averaging smooths out the wobble.
  • The effect is strongest when the base learners are unstable and when the different bags do not all make exactly the same error.
  • Decision trees are a natural fit because tiny data changes can produce very different trees.

If we had many datasets, averaging would be easy

Weak law reminder

\[ \frac{1}{m}\sum_{i=1}^{m}x_i \to \bar x \qquad \text{as } m\to\infty \]

For i.i.d. random variables, averages converge toward their mean.

Apply the same idea to predictors

\[ \hat h = \frac{1}{m}\sum_{i=1}^m h_{D_i} \to \bar h \]

If the \(D_i\) were independently drawn datasets from \(P^n\), averaging the resulting predictors would move us toward the mean predictor \(\bar h\).

The practical problem is that we only have one dataset \(D\), not infinitely many independent draws from the data-generating distribution.

The bootstrap trick

Bagging replaces fresh draws from \(P\) with resamples from the observed training set \(D\), sampled uniformly with replacement.

Sampling distribution over the dataset

\[ Q((x_i,y_i)\mid D)=\frac{1}{n} \qquad \forall (x_i,y_i)\in D,\quad n=|D| \]

A bootstrap bag \(D_j\) is sampled from \(Q^n\): we draw \(n\) times with replacement from the empirical dataset.

\[ \hat h_D(x)=\frac{1}{m}\sum_{j=1}^{m} h_{D_j}(x) \]

Important subtlety

  • The bags are not independent in the strict sense required by the weak law argument.
  • So the note does not claim a formal WLLN proof for bagging.
  • Instead, it argues that in practice bagging still reduces variance very effectively.
The component trees are trained on slightly different bootstrap views of the same data, which is often enough to decorrelate their errors.

Why is resampling from the data still reasonable?

Key takeaway from the analysis

\[ Q(X=x_i)=P(X=x_i) \]

The lecture note derives this using a binomial-count argument over how many copies of \(x_i\) appear in the sampled dataset \(D\), and then how likely a bootstrap draw is to pick one of those copies.

Intuition

  • Think of bootstrap sampling as first picking a slot in the dataset, then asking which example fills that slot.
  • The slot itself carries no preference for particular examples.
  • So marginally, drawing from the empirical dataset preserves the original data distribution even though the bags are dependent.
This is enough to make bagging intuitively trustworthy even though the clean i.i.d. argument no longer applies.

Bagging summarized

Step 1

\[ D_1,\dots,D_m \sim Q^n \]

Sample \(m\) bootstrap datasets from the training set \(D\), each with replacement.

Step 2

\[ h_j = \mathrm{Train}(D_j) \]

Train one base learner on each bag. In our demos, the base learner is a full-depth decision tree.

Step 3

\[ h(x)=\frac{1}{m}\sum_{j=1}^{m} h_j(x) \]

Average the predictions. For classification, that corresponds to voting; for regression, it is literal averaging.

Larger \(m\) usually helps, but after some point the returns diminish. Increasing the number of bags tends to cost computation more than it changes accuracy.

Interactive demo: bagging for classification

This demo uses full-depth CART trees as the bags. Move the sliders for the number of bags and the number of samples per bag to see how the ensemble boundary changes.

Interactive demo: bagging for regression

This demo uses full-depth regression trees inside the bags. Compare how the average fit changes as you vary the number of bags and the samples per bag.

Out-of-bag error comes almost for free

A major practical advantage of bagging is that every training point is left out of many bags. That gives us a built-in test-like estimate.

Leave-one-point-out ensemble

\[ S_i=\{k \mid (x_i,y_i)\notin D_k\}, \qquad \tilde h_i(x)=\frac{1}{|S_i|}\sum_{k\in S_i} h_k(x) \]

For each training example, average only the bags that never saw that example.

OOB error estimate

\[ \epsilon_{\mathrm{OOB}} = \frac{1}{n}\sum_{(x_i,y_i)\in D} \ell(\tilde h_i(x_i), y_i) \]

This behaves like a validation estimate without having to carve off a separate validation split from the dataset.

In the demos, the OOB estimate changes as we vary the number of bags and the bag size — a nice way to show that bagging can evaluate itself while training.

Random forest = bagged trees + random feature subsampling

Algorithm sketch

  1. Sample bootstrap datasets \(D_1,\dots,D_m\) from \(D\).
  2. For each \(D_j\), grow a full decision tree.
  3. Before each split, randomly choose only \(k\le d\) features and search for the best split using only those features.
  4. Average / vote the predictions across trees.

Why the extra randomness helps

  • Bagging reduces variance when the trees are not perfectly correlated.
  • If one feature is extremely strong, plain bagging tends to split on it in many trees, making the trees too similar.
  • Randomly limiting the features at each split forces the trees to differ more, which further reduces ensemble variance.
The note recommends \(k=\sqrt d\) as a very good default for classification, and large \(m\) as long as computation allows.

Interactive demo: random forest

This demo adds a feature-per-split slider \(k\). With \(d=2\), using \(k=1\) forces each split to consider only one coordinate at a time; \(k=2\) reduces back toward plain bagging of trees.

Why random forests are so popular

Reason 1: few sensitive hyperparameters

  • The main knobs are the number of trees \(m\) and the number of features per split \(k\).
  • Performance is often fairly insensitive to the exact value of either one.
  • A good classification default is \(k=\sqrt d\).

Reason 2: decision trees need little preprocessing

  • Features can have different scales and units.
  • Trees can mix heterogeneous feature types naturally.
  • This makes forests especially convenient in practical pipelines.

Bridge to the next lecture

Once students understand bagging and random forests, the natural next step is boosting: instead of averaging independent-ish trees, we build trees sequentially to correct previous mistakes.

Navigation: /Space next, previous, T list, R references, F fullscreen.