Bagging is designed to reduce the variance term without having to redesign the base learner. It is especially useful for high-variance learners such as full decision trees.
For i.i.d. random variables, averages converge toward their mean.
If the \(D_i\) were independently drawn datasets from \(P^n\), averaging the resulting predictors would move us toward the mean predictor \(\bar h\).
Bagging replaces fresh draws from \(P\) with resamples from the observed training set \(D\), sampled uniformly with replacement.
A bootstrap bag \(D_j\) is sampled from \(Q^n\): we draw \(n\) times with replacement from the empirical dataset.
The lecture note derives this using a binomial-count argument over how many copies of \(x_i\) appear in the sampled dataset \(D\), and then how likely a bootstrap draw is to pick one of those copies.
Sample \(m\) bootstrap datasets from the training set \(D\), each with replacement.
Train one base learner on each bag. In our demos, the base learner is a full-depth decision tree.
Average the predictions. For classification, that corresponds to voting; for regression, it is literal averaging.
This demo uses full-depth CART trees as the bags. Move the sliders for the number of bags and the number of samples per bag to see how the ensemble boundary changes.
This demo uses full-depth regression trees inside the bags. Compare how the average fit changes as you vary the number of bags and the samples per bag.
A major practical advantage of bagging is that every training point is left out of many bags. That gives us a built-in test-like estimate.
For each training example, average only the bags that never saw that example.
This behaves like a validation estimate without having to carve off a separate validation split from the dataset.
This demo adds a feature-per-split slider \(k\). With \(d=2\), using \(k=1\) forces each split to consider only one coordinate at a time; \(k=2\) reduces back toward plain bagging of trees.
Once students understand bagging and random forests, the natural next step is boosting: instead of averaging independent-ish trees, we build trees sequentially to correct previous mistakes.