Ensembles of Ensembles of Ensembles: A Guide to Stacking

, machine learning is a hypercompetitive game of ensemble engineering. The difference of a slight improvement in lap time or loss scores can be measured in the millions of dollars a team brings in when they do what it takes to be the best. Not only does every single component of the system need to be perfect, the way it is all brought together needs to be perfect too.

The state of the art

Gradient boosted models have historically been the most competitive models for tabular and time series prediction problems. These are ensemble methods because they combine the results of several base estimators to come up with a final answer that is better than any individual prediction alone. But the state of the art is beginning to change. Pre-trained models such as TabPFN for tabular data, and Chronos for time series are beginning to match or exceed gradient boosted models on certain benchmarks. In a way these are also ensemble methods, except instead of ensembling many predictions, they are an ensemble of the data that they learn from. The intuition behind this is broadly applicable, and can be taken further.

There is now a situation where two completely different approaches are battling for the top spot across ML leaderboards, and are followed closely by dozens of other architectures that have their own sets of strengths and weaknesses. Given that they all learn in different ways, and also learn from different data, they can all be used together in an additional ensemble that retains a majority of the strengths, while eliminating a majority of the weaknesses. If done properly, this almost always leads to better performance, and a more robust model.

Assertions and assumptions

The same strategies that can be used to determine what data is important for making a given prediction can also be used to determine what models are important for making a given prediction. Just like how a combination of base estimations in gradient boosted models is better than a single estimation, a combination of models is better than one.

For the rest of this discussion, there is a big assumption that all the correct data is used in the modelling process. In other words, all relevant information is known at time t (or during inference). In data science, this is not a trivial assumption to make, and falsely doing so will largely invalidate claims made here. As it turns out, most of the work in data science is just trying to satisfy this assumption with data in the correct format. Also note that the covariates/features exposed to models are not fixed as different architectures do better with different data, and may not be able to handle certain data types at all (this will be a particularly relevant point for pre-trained language/numeric model hybrids to address, which are still in early development).

Multi-Layer Stacking

A generalized approach that can be modified for time series or tabular regression/classification problems

Layer 1

There are many ways of creating ensemble methods, and it makes the most sense to organize these steps in layers. The first layer is the collection of base models (e.g. CatBoost, MLPs, TabPFN, etc.).

For tabular problems, these can be trained with bootstrap aggregation, where new training sets are created by sampling from the base training set with replacement. Individual models are then trained on each new set and their predictions are averaged. Hyperparameter optimization can also be done for each of these models, though this is much more computationally expensive as each model for each sample (or “bag”) is re-trained many times. To cut down on training time, a hyperparameter optimization scheduler like Optuna can be used so that model runs that aren’t doing well are cut short, and a local minimum can be zeroed in on quicker by using some statistical optimization tricks. Alternatively, several hyperparameter presets can be used for each model based on what tends to work well for that particular model on similar datasets. The different models with different presets can either be averaged together to “represent” one model, or they can be registered as different versions of the model and used in the next layer.

For time series forecasting, traditional bootstrapping becomes an issue. Since the time dimension must be respected, a process cannot randomly break this data up and resample to create new training sets. Instead, cross-validation should be done with a rolling window through time. For this process a new model is created to predict on a validation window with timestamps strictly after those present in the training set. After training and evaluation, that validation window is added to the training set and the process is repeated for the next slice of time (the next validation window). This yields a good idea of how well the model will perform throughout time, but models are not usually ensembled in this step. Since recent time series data is often the most informative, only the model trained on the last step is used for inference. However, the out-of-fold predictions from previous windows can still be used in the next layer.

Layer 2

After training the base models, evaluation metrics on the training set and the validation set are available. For all intermediate steps, the test set should be completely ignored. In layer 2, new strategies can be used since model performance is known, and solid predictions have (hopefully) already been made.

For tabular problems, a second round of bagged models can be trained where the predictions of the layer 1 models are added as features. In the case where a base model performs poorly on validation, it can be dropped from this step.

In time series, the same strategy cannot be done since the layer 1 models never made predictions for the entire training set. This is not possible to do since there would be no data to train on to get predictions for the beginning of the training set, and a model that’s been trained on anything after that cannot be used to get those predictions needed to use as features in the model. A caveat to this is that if the architecture of the layer 2 model can handle missing values, or only a subset of the training set that has predictions is used, then a full re-train (on training data and layer 1 model predictions) can be done at this layer. While this is possible, and maybe useful, there are more elegant approaches.

Since model performance is known and predictions have been made, a combination of base model predictions can be used as new predictors. There are a handful of ways to do this:

Simply average all of them
Weight each prediction set by its validation performance and average them
Take a linear combination of all of the predictions that minimizes loss with ordinary least squares
Do a greedy ensemble that starts with the best performing model and slowly adds weight of other models until performance stops improving
If that’s not enough, an entire model can be trained purely on the predictions of the base models (this is only really useful if there is a sufficiently large number of out-of-fold predictions)

Note that the validation windows of layer 1 becomes the training set of layer 2, so only the last validation set of layer 1 is used as the validation set of layer 2. Instead of trying to figure out what single approach is the best, layer 2 should try them all as these steps are computationally efficient.

Layer 3

Time to stack more layers… The tabular approach yielded predictions from another round of bagged models, and the time series approach yielded the predictions of different ensembling strategies. Layer 3 will simply use one of the ensembling strategies mentioned in the layer 2 time series ensembles to create the final meta-model. This is the model that should be used to evaluate on the test set, though it’s a good idea to verify that it actually outperforms the base models. The final model should almost always win, and will be less sensitive to bad predictions from a single model as the bad predictions can be down-weighted, and tend to get averaged out. Conversely, If one model picks up on a pattern that the others don’t, the multi-layer stack can learn to amplify those predictions. The only cases where this is ineffective is if one model is always better across the board, which is quite rare, or one or more base models are quite bad, in which case they should be removed entirely.

Was it all worth it?

Probably. The downside to this is that it requires training many models instead of one. If datasets are sufficiently large, training and inference time can quickly become a constraint for certain applications. The counterargument to this is that the process is highly parallelizable, and efficient algorithms can be used in place of deep learning if needed. LightGBM is an order of magnitude quicker than deep learning, and is often still competitive.

This philosophy of ensembling ensembles in machine learning has been popularized and fully adopted by AutoGluon. As a matter of fact, it is the de facto standard for their AutoML offering, and their team has contributed a great deal to both the open-source community and to bleeding edge research in the field. As the pre-training frontier for tabular/time series transformers has yet to be fully explored, expect the added diversity of models-to-come to further strengthen this strategy.

There is good reason to believe this philosophy will continue to win, as it has in many other domains:

Democracy is an ensemble of elected officials, and elected officials represent the ensemble of their constituents (in theory at least). While not perfect, it’s still the best system yet.
Medical diagnosis improves with multiple opinions. Combining assessments from multiple radiologists, pathologists, or specialists consistently reduces misdiagnosis rates. Each doctor may catch different patterns or edge cases, and their combined judgment is more reliable than any individual assessment.
Even equities markets are an ensemble of beliefs about the future. While historically the information contained in the moves of these markets has not been directly relevant to most people, prediction markets and forecasting platforms are changing this.
In Claude Code’s recent release (February 2026), Anthropic introduced collaborative “agent teams” where multiple Claude instances work together on tasks, coordinating through shared task lists and peer-to-peer communication. xAI uses a similar multi-agent approach with Grok 4 Heavy/Grok 4.20, where independent agents work in parallel and “cross-validate” each other’s solutions before converging on a final answer.

It turns out teamwork is the way to go. Ensembles of ensembles of ensembles show up repeatedly in the best systems humans have created, and the machine learning domain is no exception. In the age of intelligence, scaling this idea will not be optional.

What's Hot

AI evals are becoming the new compute bottleneck

The Moto Razr and Razr Plus are victims of shrinkflation

4 YAML Files Instead of PySpark: How We Let Analysts Build Data Pipelines Without Engineers

4 YAML Files Instead of PySpark: How We Let Analysts Build Data Pipelines Without Engineers

System Design Series: Apache Flink from 10,000 Feet, and Building a Flink-powered Recommendation Engine

Let the AI Do the Experimenting

AI evals are becoming the new compute bottleneck

The Moto Razr and Razr Plus are victims of shrinkflation

4 YAML Files Instead of PySpark: How We Let Analysts Build Data Pipelines Without Engineers

Quantization from the ground up

David Sacks is done as AI czar — here’s what he’s doing instead

Judge sides with Anthropic to temporarily block the Pentagon’s ban

Most Popular

AI evals are becoming the new compute bottleneck

The Moto Razr and Razr Plus are victims of shrinkflation

4 YAML Files Instead of PySpark: How We Let Analysts Build Data Pipelines Without Engineers

Our Picks

Quantization from the ground up

David Sacks is done as AI czar — here’s what he’s doing instead

Judge sides with Anthropic to temporarily block the Pentagon’s ban

Subscribe to Updates

What's Hot

Ensembles of Ensembles of Ensembles: A Guide to Stacking

The state of the art

Assertions and assumptions

Multi-Layer Stacking

Layer 1

Layer 2

Layer 3

Was it all worth it?

Related Posts

Subscribe to Updates