Macro by Mark

Unlock Full Macro Model Library with Starter.

This feature is exclusively available to Starter, Research, and Pro. Upgrade when you need this workflow, review pricing, or send a question before changing plans.

Upgrade to Starter View pricing Questions?Already subscribed? Sign in

What you keep on Free

Create and edit one custom board
Use up to 3 widgets on each Free board
Browse indicators and calendar

← Models Overview History Concepts Models Schools

Combination of several base forecasters -- simple averages, weighted blends, or stacking -- to reduce single-model risk.

Why does averaging several mediocre forecasts almost always beat the single best individual forecast?

Background

Bates and Granger (1969) noticed something striking: a simple average of two airline-passenger forecasts beat each individual forecast. This was not a fluke. Four decades of subsequent evidence---surveyed by Timmermann (2006) and Genre, Kenny, Meyler, and Timmermann (2013)---confirms that forecast combination is one of the most reliable empirical regularities in economics. The "forecast combination puzzle" is that simple averages often beat sophisticated optimal weighting schemes, likely because the estimation error in optimal weights offsets their theoretical advantage in small samples.

The mechanism is variance reduction through diversification. If two forecasters make independent errors, their average has half the error variance of either one. In practice, forecast errors are correlated, but never perfectly so. As long as individual models capture partially distinct information, combining them cancels some errors. The mathematics is identical to portfolio diversification: the combined forecast is a portfolio of predictions, and the combination weights are portfolio weights that minimize forecast-error variance subject to the constraint that weights sum to one.

Modern practice goes beyond simple averaging. The ECB combines density forecasts from its Survey of Professional Forecasters using optimal linear pools. The Bank of England's Monetary Policy Committee informally combines outputs from multiple internal models (COMPASS, DSGE, SVARs) when producing its Inflation Report projections. The Federal Reserve Bank of Atlanta's GDPNow is itself a weighted combination of bridge equations, each using different indicator subsets. Academic researchers use Bayesian model averaging (BMA), where the weights are posterior model probabilities proportional to marginal likelihoods.

The field has moved from static to adaptive combination. Stock and Watson (2004) showed that the best individual forecasting model changes over time---a model that dominated in the 1990s may fail in the 2000s. Fixed weights estimated from a historical sample can assign high weight to a model whose advantage has evaporated. Time-varying weights that adapt to recent forecast performance---exponentially weighted moving averages of squared errors, Bayesian learning, or regime-switching weights---address this instability. Adaptive combinations sacrifice some statistical efficiency for robustness to structural change.

How the Parts Fit Together

The inputs are $M$ individual forecasts $\hat{y}_1, \ldots, \hat{y}_M$ for a common target variable $y$ at a common horizon $h$ . Each forecast comes from a distinct model (VAR, DSGE, ARIMA, factor model, judgmental survey, etc.) or from the same model class with different specifications. The forecasts may be point forecasts (scalars) or density forecasts (full predictive distributions). Historical forecast errors are needed for weight estimation: a holdout sample of $T_0$ periods where both the forecasts and the realized values are observed.

The combined forecast is $\hat{y}_c = \sum_{m=1}^M w_m \hat{y}_m$ , where $w_m \geq 0$ and $\sum_m w_m = 1$ . The weights $w_m$ are chosen to minimize the forecast-error variance of the combination. Under the assumption that forecast errors have constant covariance $\Sigma_e$ ( $M \times M$ matrix), the optimal weights are $w^* = \Sigma_e^{-1} \iota / (\iota' \Sigma_e^{-1} \iota)$ , where $\iota$ is a vector of ones. This is the minimum-variance portfolio from finance. In practice, $\Sigma_e$ must be estimated from the holdout sample, introducing estimation error that can make the estimated optimal weights worse than equal weights.

Weight estimation methods range from simple to complex. Equal weights ( $w_m = 1/M$ ) ignore the covariance structure entirely but avoid estimation error. Inverse-MSE weights set $w_m \propto 1/\text{MSE}_m$ , using only individual model performance without cross-model covariance. Regression-based weights run the regression $y_t = \beta_0 + \sum_m \beta_m \hat{y}_{m,t} + \varepsilon_t$ and use $\hat{\beta}_m$ as weights (Granger-Ramanathan, 1984), but this requires enough holdout data to estimate $M+1$ parameters reliably. Bayesian model averaging sets $w_m$ proportional to the posterior model probability $p(\mathcal{M}_m | \text{data})$ , computed from the marginal likelihood of each model. Time-varying weights use exponential decay: $w_{m,t} \propto \exp(-\alpha \sum_{s=1}^{t} (y_s - \hat{y}_{m,s})^2)$ with a forgetting factor $\alpha$ .

Applications

The ECB's Survey of Professional Forecasters (SPF) collects density forecasts from approximately 60 institutions for euro-area GDP growth, inflation, and unemployment. The ECB staff combines these into a weighted density forecast, where the weights reflect each forecaster's past predictive accuracy. The combined density is tighter (lower entropy) than most individual densities, demonstrating the variance-reduction benefit. This combined forecast enters the Governing Council's deliberations alongside the ECB's own staff projections.

The Federal Reserve Bank of Philadelphia's Survey of Professional Forecasters publishes a consensus (median and mean) forecast that is itself a forecast combination. Research by Reifschneider and Tulip (2019) shows that the SPF consensus systematically outperforms the Fed's Greenbook (now Tealbook) forecast at horizons beyond one quarter, even though Greenbook forecasters have access to confidential data. The combination of many diverse perspectives apparently compensates for the information advantage of a single well-resourced institution.

Central banks that maintain multiple internal models combine them informally. The Bank of England's Monetary Policy Committee sees output from at least four models (COMPASS central forecast, DSGE variant, reduced-form VARs, judgmental overlays). The committee members weight these outputs subjectively, but the process is structured to ensure that no single model dominates. Research by Kapetanios, Mitchell, Price, and Fawcett (2015) formalized this by showing that density combination with time-varying weights improves upon any fixed-weight scheme for the Bank's inflation forecast.

Forecast combination fails when the model pool lacks diversity. Combining five VARs with slightly different lag lengths provides marginal benefit because their errors are nearly identical. The gains come from combining structurally different approaches: a statistical model (VAR) with a structural model (DSGE), a machine-learning method with a judgmental forecast, or a domestic model with one that emphasizes international linkages. Combination also fails when the number of models $M$ is large relative to the holdout $T_0$ , because weight estimation becomes unreliable. In this regime, equal weights dominate, and the ensemble reduces to a simple average.

Literature and Extensions

Key Papers

Bates, Granger (1969) --- foundational paper on forecast combination, established the basic variance-reduction principle
Granger, Ramanathan (1984) --- regression-based combination weights with and without constraints
Stock, Watson (2004) --- documented time variation in relative forecast performance, motivating adaptive combination
Timmermann (2006) --- comprehensive survey of forecast combination methods and the combination puzzle
Hall, Mitchell (2007) --- optimal combination of density forecasts using the logarithmic scoring rule

Named Variants

Bayesian model averaging --- weights proportional to posterior model probabilities
Exponentially weighted combination --- recent forecast errors weighted more heavily, adapting to structural change
Stacked generalization --- uses a meta-learner (regression, LASSO, neural network) to combine base model forecasts
Trimmed mean / median combination --- robust to outlier forecasts by discarding extreme predictions
Density forecast combination --- combines full predictive distributions, not just point forecasts

Open Questions

Why simple averages outperform estimated optimal weights in most empirical settings (the forecast combination puzzle)
How to optimally combine density forecasts when individual predictive distributions are misspecified
Whether machine-learning-based meta-learners can reliably improve upon equal weighting for macro forecast combination

Components

\hat{y}_m

Individual forecast

Point forecast from model $m$ for the target variable $y$ at forecast horizon $h$ .

w_m

Combination weight

Weight assigned to model $m$ in the combined forecast. Non-negative, summing to one.

\Sigma_e

Forecast-error covariance

The $M \times M$ covariance matrix of individual forecast errors. Off-diagonal entries capture error correlation across models.

\hat{y}_c

Combined forecast

The weighted combination $\hat{y}_c = \sum_m w_m \hat{y}_m$ . The main output of the ensemble.

\text{MSE}_m

Individual model MSE

Mean squared forecast error of model $m$ over the holdout period. Used for inverse-MSE weighting.

T_0

Holdout sample size

Number of past periods with observed forecasts and realizations, used to estimate combination weights.

Assumptions

Stationary forecast-error distributionTestable

The joint distribution of forecast errors $(e_1, \ldots, e_M)$ is stationary over the holdout and forecast periods.

If violated: Weights estimated from historical errors are no longer optimal. Structural breaks, policy shifts, or model degradation change the error covariance over time.

Unbiased individual forecastsTestable

Each individual forecast is unbiased: $E[y - \hat{y}_m] = 0$ for all $m$ .

If violated: Biased forecasts transmit their bias to the combination unless the weights are constrained or a constant is included (Granger-Ramanathan regression). Mincer-Zarnowitz tests detect individual forecast bias.

Diversified information setsTestable

Individual models use partially non-overlapping information, so their errors are imperfectly correlated.

If violated: If all models make the same errors (perfect correlation), combination provides zero variance reduction. The benefit of combining scales with the degree of error diversification.

Sufficient holdout for weight estimationMaintained

The holdout sample $T_0$ is large enough to estimate the covariance $\Sigma_e$ reliably. For optimal weights, $T_0 \gg M$ .

If violated: Estimated optimal weights overfit the holdout sample. Sample covariance inversion amplifies estimation error. Equal weights become preferable when $T_0 / M$ is small.

Correctly specified models (for BMA)Maintained

The true data-generating process is one of the $M$ candidate models (or well-approximated by one).

If violated: BMA posterior concentrates on the single best-approximating model, missing the combination benefits that arise from averaging across structurally different models.

Symmetric loss functionMaintained

The loss function is quadratic (MSE). Over-prediction and under-prediction are equally costly.

If violated: Under asymmetric loss (e.g., underestimating inflation is worse than overestimating), MSE-optimal weights are suboptimal. Combination should target the asymmetric loss function directly.

Data-Driven Models

Loading Data-Driven Models