Macro by Mark

Unlock Full Macro Model Library with Starter.

This feature is exclusively available to Starter, Research, and Pro. Upgrade when you need this workflow, review pricing, or send a question before changing plans.

Upgrade to Starter View pricing Questions?Already subscribed? Sign in

What you keep on Free

Create and edit one custom board
Use up to 3 widgets on each Free board
Browse indicators and calendar

← Models Overview History Concepts Models Schools

L2-penalized linear regression -- shrinks coefficients toward zero to stabilize forecasts when predictors are many or collinear.

How do you estimate regression coefficients when your predictors are highly collinear or outnumber your observations?

Background

Ordinary least squares breaks down when the design matrix $X'X$ is nearly singular. Multicollinearity inflates coefficient variances, making individual estimates unreliable even if the overall fit is good. Hoerl and Kennard (1970) proposed adding a penalty $\lambda I$ to the normal equations, shrinking coefficients toward zero and stabilizing the inversion. The resulting estimator, $\hat{\beta}_{\text{ridge}} = (X'X + \lambda I)^{-1} X'y$ , is biased but can have dramatically lower mean squared error than OLS when collinearity is severe.

The mechanics are simple. The penalty $\lambda \|\beta\|_2^2$ added to the OLS objective function penalizes the sum of squared coefficients. This is equivalent to placing a Gaussian prior $\beta \sim N(0, \lambda^{-1} I)$ on the coefficient vector in a Bayesian regression. The penalty parameter $\lambda$ controls the bias-variance tradeoff: $\lambda = 0$ recovers OLS, $\lambda \to \infty$ shrinks all coefficients to zero. Cross-validation or generalized cross-validation selects $\lambda$ to minimize out-of-sample prediction error.

In macroeconomics, ridge regression appears whenever the predictor set is large relative to the sample. Nowcasting GDP from 100+ monthly indicators runs into a "fat" design matrix. Phillips-curve estimation with many candidate slack measures and their lags produces collinearity. International growth regressions with 60+ potential determinants (the Sala-i-Martin "I just ran two million regressions" problem) face the same issue. More recently, Plagborg-Moller and Wolf (2021) used ridge-regularized local projections to estimate impulse responses, showing that shrinkage can substantially tighten confidence bands at longer horizons.

The Federal Reserve Bank of New York's nowcasting model uses ridge-type shrinkage within its dynamic factor framework. The Bank of England's COMPASS suite applies regularization to large conditioning sets. Academic researchers at central banks routinely use ridge as a benchmark when comparing regularized versus factor-based forecasting methods. The estimator's computational simplicity---a single matrix inversion---makes it attractive when the model needs frequent re-estimation, as in real-time forecasting settings.

How the Parts Fit Together

The input is a standard regression dataset: $n$ observations of a dependent variable $y$ and $p$ predictors $X$ . In macro applications, $y$ is typically a forecast target (next-quarter GDP growth, inflation) and $X$ collects a large set of candidate predictors (financial variables, survey indicators, labor market series, international data). The predictors are almost always standardized to unit variance before estimation so that the penalty treats all coefficients symmetrically.

The model solves $\min_{\beta} \|y - X\beta\|_2^2 + \lambda \|\beta\|_2^2$ , where $\lambda \geq 0$ is the penalty parameter. The closed-form solution is $\hat{\beta}_{\text{ridge}} = (X'X + \lambda I_p)^{-1} X'y$ . Geometrically, OLS finds the coefficient vector that minimizes the residual sum of squares; ridge constrains that vector to lie within a sphere of radius $\|\beta\|_2 \leq t$ for some $t$ that maps one-to-one to $\lambda$ . The constraint sphere is smooth and convex, so the solution is unique and continuous in $\lambda$ .

The penalty parameter $\lambda$ is selected by $k$ -fold cross-validation (typically $k = 10$ ) or generalized cross-validation (GCV), which provides an analytical approximation to leave-one-out CV. The GCV criterion is $\text{GCV}(\lambda) = n^{-1} \|y - X\hat{\beta}_{\text{ridge}}\|^2 / (1 - n^{-1} \text{tr}(H_\lambda))^2$ , where $H_\lambda = X(X'X + \lambda I)^{-1} X'$ is the ridge hat matrix. The effective degrees of freedom $\text{df}(\lambda) = \text{tr}(H_\lambda)$ decreases monotonically from $p$ (at $\lambda = 0$ ) to $0$ (as $\lambda \to \infty$ ), providing a continuous measure of model complexity.

Applications

The Federal Reserve Bank of New York's nowcasting framework uses regularized regressions (including ridge) when estimating bridge equations from large sets of monthly indicators. With 100+ candidate predictors released on staggered schedules, OLS on the full set is infeasible for most vintages. Ridge provides stable coefficient estimates that update smoothly as new data arrive, which is critical for a real-time forecasting system where erratic coefficient jumps would undermine credibility.

Phillips-curve estimation illustrates the collinearity problem ridge was built for. If you include six measures of economic slack (output gap, unemployment gap, capacity utilization gap, employment-to-population ratio, vacancy-unemployment ratio, underemployment rate) plus their first and second lags, you have 18 highly correlated predictors. OLS assigns wild, offsetting coefficients to near-identical series. Ridge pulls these coefficients toward each other (and toward zero), producing a composite slack measure implicitly. De Mol, Giannone, and Reichlin (2008) showed that ridge regression on a large predictor set can perform comparably to principal-components-based factor models for macro forecasting.

In academic applied work, ridge serves as a benchmark in forecast comparison exercises. Researchers evaluating new machine-learning methods for macro forecasting (random forests, neural networks, gradient boosting) almost always include ridge as a baseline because it is the simplest regularized method with a closed-form solution. When ridge beats a fancy model, it usually means the fancy model is overfitting. When a fancy model barely beats ridge, the nonlinearity gains are negligible.

Ridge fails when variable selection matters. It keeps every predictor in the model, shrinking weak ones toward zero but never eliminating them. For practitioners who need to identify which predictors drive the outcome---policy evaluation, structural analysis, model interpretability---ridge is the wrong tool. It also fails when the true model is very sparse: if only 5 of 100 predictors matter, LASSO or elastic net will outperform ridge by zeroing out the irrelevant 95. Finally, ridge treats all predictors as exchangeable through the isotropic penalty $\lambda I$ . If prior knowledge suggests that some groups of predictors should be penalized differently (e.g., financial variables less than survey variables), group-penalty methods like group LASSO or Bayesian hierarchical regression are more appropriate.

Literature and Extensions

Key Papers

Hoerl, Kennard (1970) --- introduced ridge regression, proved MSE improvement over OLS under collinearity
Golub, Heath, Wahba (1979) --- generalized cross-validation for selecting the penalty parameter
De Mol, Giannone, Reichlin (2008) --- ridge regression as a Bayesian shrinkage alternative to factor models for macro forecasting
Plagborg-Moller, Wolf (2021) --- ridge-regularized local projections for impulse response estimation
Hastie, Tibshirani, Friedman (2009, Ch. 3) --- textbook treatment embedding ridge in the penalized regression family

Named Variants

Kernel ridge regression --- extends to nonlinear relationships via the kernel trick without changing the penalty structure
Generalized ridge --- allows a diagonal penalty matrix $\Lambda = \text{diag}(\lambda_1, \ldots, \lambda_p)$ with predictor-specific penalties
Ridge with grouped penalties --- different $\lambda$ for different blocks of predictors (financial vs. real variables)
Bayesian ridge --- full posterior inference under Gaussian prior, providing credible intervals on coefficients
Elastic net --- combines L2 (ridge) and L1 (LASSO) penalties for simultaneous shrinkage and selection

Open Questions

Whether adaptive shrinkage methods (e.g., empirical Bayes or cross-validated elastic net) dominate fixed-lambda ridge in real-time macro forecasting
How to select the penalty parameter in time-series settings where standard cross-validation violates temporal ordering
Whether ridge-regularized impulse responses have better coverage properties than standard local projection confidence intervals in finite samples

Components

\hat{\beta}_{\text{ridge}}

Ridge coefficient vector

The $p \times 1$ vector of penalized regression coefficients, $(X'X + \lambda I)^{-1} X'y$ . Biased toward zero but lower variance than OLS.

\lambda

Penalty parameter

Non-negative scalar controlling shrinkage intensity. Zero recovers OLS; infinity sends all coefficients to zero.

X'X

Gram matrix

The $p \times p$ matrix of predictor cross-products. Its eigenvalues determine collinearity severity; small eigenvalues signal near-singularity.

\text{df}(\lambda)

Effective degrees of freedom

Trace of the hat matrix $H_\lambda$ . Equals $\sum_j d_j^2 / (d_j^2 + \lambda)$ where $d_j$ are singular values of $X$ . Continuously interpolates between $p$ and $0$ .

\text{GCV}(\lambda)

Generalized cross-validation score

Analytical approximation to leave-one-out CV error. Used to select $\lambda$ without repeatedly refitting the model.

\text{VIF}_j

Variance inflation factor

Measures collinearity of predictor $j$ with all others: $\text{VIF}_j = (1 - R_j^2)^{-1}$ . Values above 10 signal problematic collinearity that ridge was designed to address.

Assumptions

Linear relationshipTestable

The conditional expectation $E[y|X]$ is linear in $X\beta$ for some unknown $\beta$ .

If violated: Ridge shrinks toward a wrong model. Residual plots and partial-response checks reveal nonlinearity.

Independent errorsTestable

The error terms $\varepsilon_i$ are uncorrelated: $\text{Cov}(\varepsilon_i, \varepsilon_j) = 0$ for $i \neq j$ .

If violated: Cross-validation scores are invalid if errors are serially correlated (common in time-series data). Block cross-validation or time-series CV is required.

Homoskedasticity (for GCV)Testable

Errors have constant variance: $\text{Var}(\varepsilon_i) = \sigma^2$ for all $i$ .

If violated: GCV is inconsistent under heteroskedasticity. Use $k$ -fold CV instead, which does not require constant variance.

Centered and scaled predictorsMaintained

Predictors are standardized to zero mean and unit variance before applying the penalty.

If violated: Without standardization, the penalty penalizes coefficients on different scales unequally. A predictor measured in billions gets near-zero coefficient regardless of its importance.

All predictors are relevant (no sparsity)Maintained

The true coefficient vector is dense---most predictors contribute at least weakly to $y$ .

If violated: If the truth is sparse (only a few predictors matter), ridge keeps all of them in the model with small nonzero coefficients. LASSO, which sets some coefficients exactly to zero, is more appropriate under sparsity.

Penalty does not grow with sample sizeMaintained

The selected $\lambda$ satisfies $\lambda / n \to 0$ as $n \to \infty$ , so the ridge estimator converges to OLS in large samples.

If violated: If $\lambda$ is fixed regardless of sample size, the bias does not vanish and the estimator is inconsistent. Cross-validation naturally selects smaller $\lambda$ as $n$ grows.

Data-Driven Models

Loading Data-Driven Models