Skip to main content
Macro by Mark
  • Home
  • News
  • Calendar
  • Indicators
  • Macro
  • About
Sign inSign up
Macro by Mark

Global Economic Data, Empirical Models, and Macro Theory
All in One Workspace

Public data from government agencies and multilateral statistical releases, anchored in official sources

© 2026 Mark Jayson Nation

Product

  • Home
  • Indicators
  • News
  • Calendar

Macro

  • Overview
  • Models
  • Labs
  • Glossary

Learn

  • Concepts
  • Models
  • Schools
  • History
  • Docs

Account

  • Create account
  • Sign in
  • Pricing
  • Contact
AboutPrivacy PolicyTerms of ServiceTrust and securityEthics and Compliance

Data-Driven Models

Loading Data-Driven Models

Macro by Mark

Unlock Full Macro Model Library with Starter.

This feature is exclusively available to Starter, Research, and Pro. Upgrade when you need this workflow, review pricing, or send a question before changing plans.

Upgrade to StarterView pricingQuestions?Already subscribed? Sign in

What you keep on Free

  • Create and edit one custom board
  • Use up to 3 widgets on each Free board
  • Browse indicators and calendar
← ModelsOverviewHistoryConceptsModelsSchools

Ridge regression
Model

L2-penalized linear regression -- shrinks coefficients toward zero to stabilize forecasts when predictors are many or collinear.

How do you estimate regression coefficients when your predictors are highly collinear or outnumber your observations?

Background

Ordinary least squares breaks down when the design matrix X′XX'XX′X is nearly singular. Multicollinearity inflates coefficient variances, making individual estimates unreliable even if the overall fit is good. Hoerl and Kennard (1970) proposed adding a penalty λI\lambda IλI to the normal equations, shrinking coefficients toward zero and stabilizing the inversion. The resulting estimator, β^ridge=(X′X+λI)−1X′y\hat{\beta}_{\text{ridge}} = (X'X + \lambda I)^{-1} X'yβ^​ridge​=(X′X+λI)−1X′y, is biased but can have dramatically lower mean squared error than OLS when collinearity is severe.

The mechanics are simple. The penalty λ∥β∥22\lambda \|\beta\|_2^2λ∥β∥22​ added to the OLS objective function penalizes the sum of squared coefficients. This is equivalent to placing a Gaussian prior β∼N(0,λ−1I)\beta \sim N(0, \lambda^{-1} I)β∼N(0,λ−1I) on the coefficient vector in a Bayesian regression. The penalty parameter λ\lambdaλ controls the bias-variance tradeoff: λ=0\lambda = 0λ=0 recovers OLS, λ→∞\lambda \to \inftyλ→∞ shrinks all coefficients to zero. Cross-validation or generalized cross-validation selects λ\lambdaλ to minimize out-of-sample prediction error.

In macroeconomics, ridge regression appears whenever the predictor set is large relative to the sample. Nowcasting GDP from 100+ monthly indicators runs into a "fat" design matrix. Phillips-curve estimation with many candidate slack measures and their lags produces collinearity. International growth regressions with 60+ potential determinants (the Sala-i-Martin "I just ran two million regressions" problem) face the same issue. More recently, Plagborg-Moller and Wolf (2021) used ridge-regularized local projections to estimate impulse responses, showing that shrinkage can substantially tighten confidence bands at longer horizons.

The Federal Reserve Bank of New York's nowcasting model uses ridge-type shrinkage within its dynamic factor framework. The Bank of England's COMPASS suite applies regularization to large conditioning sets. Academic researchers at central banks routinely use ridge as a benchmark when comparing regularized versus factor-based forecasting methods. The estimator's computational simplicity---a single matrix inversion---makes it attractive when the model needs frequent re-estimation, as in real-time forecasting settings.

How the Parts Fit Together

The input is a standard regression dataset: nnn observations of a dependent variable yyy and ppp predictors XXX. In macro applications, yyy is typically a forecast target (next-quarter GDP growth, inflation) and XXX collects a large set of candidate predictors (financial variables, survey indicators, labor market series, international data). The predictors are almost always standardized to unit variance before estimation so that the penalty treats all coefficients symmetrically.

The model solves min⁡β∥y−Xβ∥22+λ∥β∥22\min_{\beta} \|y - X\beta\|_2^2 + \lambda \|\beta\|_2^2minβ​∥y−Xβ∥22​+λ∥β∥22​, where λ≥0\lambda \geq 0λ≥0 is the penalty parameter. The closed-form solution is β^ridge=(X′X+λIp)−1X′y\hat{\beta}_{\text{ridge}} = (X'X + \lambda I_p)^{-1} X'yβ^​ridge​=(X′X+λIp​)−1X′y. Geometrically, OLS finds the coefficient vector that minimizes the residual sum of squares; ridge constrains that vector to lie within a sphere of radius ∥β∥2≤t\|\beta\|_2 \leq t∥β∥2​≤t for some ttt that maps one-to-one to λ\lambdaλ. The constraint sphere is smooth and convex, so the solution is unique and continuous in λ\lambdaλ.

The penalty parameter λ\lambdaλ is selected by kkk-fold cross-validation (typically k=10k = 10k=10) or generalized cross-validation (GCV), which provides an analytical approximation to leave-one-out CV. The GCV criterion is GCV(λ)=n−1∥y−Xβ^ridge∥2/(1−n−1tr(Hλ))2\text{GCV}(\lambda) = n^{-1} \|y - X\hat{\beta}_{\text{ridge}}\|^2 / (1 - n^{-1} \text{tr}(H_\lambda))^2GCV(λ)=n−1∥y−Xβ^​ridge​∥2/(1−n−1tr(Hλ​))2, where Hλ=X(X′X+λI)−1X′H_\lambda = X(X'X + \lambda I)^{-1} X'Hλ​=X(X′X+λI)−1X′ is the ridge hat matrix. The effective degrees of freedom df(λ)=tr(Hλ)\text{df}(\lambda) = \text{tr}(H_\lambda)df(λ)=tr(Hλ​) decreases monotonically from ppp (at λ=0\lambda = 0λ=0) to 000 (as λ→∞\lambda \to \inftyλ→∞), providing a continuous measure of model complexity.

Applications

The Federal Reserve Bank of New York's nowcasting framework uses regularized regressions (including ridge) when estimating bridge equations from large sets of monthly indicators. With 100+ candidate predictors released on staggered schedules, OLS on the full set is infeasible for most vintages. Ridge provides stable coefficient estimates that update smoothly as new data arrive, which is critical for a real-time forecasting system where erratic coefficient jumps would undermine credibility.

Phillips-curve estimation illustrates the collinearity problem ridge was built for. If you include six measures of economic slack (output gap, unemployment gap, capacity utilization gap, employment-to-population ratio, vacancy-unemployment ratio, underemployment rate) plus their first and second lags, you have 18 highly correlated predictors. OLS assigns wild, offsetting coefficients to near-identical series. Ridge pulls these coefficients toward each other (and toward zero), producing a composite slack measure implicitly. De Mol, Giannone, and Reichlin (2008) showed that ridge regression on a large predictor set can perform comparably to principal-components-based factor models for macro forecasting.

In academic applied work, ridge serves as a benchmark in forecast comparison exercises. Researchers evaluating new machine-learning methods for macro forecasting (random forests, neural networks, gradient boosting) almost always include ridge as a baseline because it is the simplest regularized method with a closed-form solution. When ridge beats a fancy model, it usually means the fancy model is overfitting. When a fancy model barely beats ridge, the nonlinearity gains are negligible.

Ridge fails when variable selection matters. It keeps every predictor in the model, shrinking weak ones toward zero but never eliminating them. For practitioners who need to identify which predictors drive the outcome---policy evaluation, structural analysis, model interpretability---ridge is the wrong tool. It also fails when the true model is very sparse: if only 5 of 100 predictors matter, LASSO or elastic net will outperform ridge by zeroing out the irrelevant 95. Finally, ridge treats all predictors as exchangeable through the isotropic penalty λI\lambda IλI. If prior knowledge suggests that some groups of predictors should be penalized differently (e.g., financial variables less than survey variables), group-penalty methods like group LASSO or Bayesian hierarchical regression are more appropriate.

Literature and Extensions

Key Papers

  • Hoerl, Kennard (1970) --- introduced ridge regression, proved MSE improvement over OLS under collinearity
  • Golub, Heath, Wahba (1979) --- generalized cross-validation for selecting the penalty parameter
  • De Mol, Giannone, Reichlin (2008) --- ridge regression as a Bayesian shrinkage alternative to factor models for macro forecasting
  • Plagborg-Moller, Wolf (2021) --- ridge-regularized local projections for impulse response estimation
  • Hastie, Tibshirani, Friedman (2009, Ch. 3) --- textbook treatment embedding ridge in the penalized regression family

Named Variants

  • Kernel ridge regression --- extends to nonlinear relationships via the kernel trick without changing the penalty structure
  • Generalized ridge --- allows a diagonal penalty matrix Λ=diag(λ1,…,λp)\Lambda = \text{diag}(\lambda_1, \ldots, \lambda_p)Λ=diag(λ1​,…,λp​) with predictor-specific penalties
  • Ridge with grouped penalties --- different λ\lambdaλ for different blocks of predictors (financial vs. real variables)
  • Bayesian ridge --- full posterior inference under Gaussian prior, providing credible intervals on coefficients
  • Elastic net --- combines L2 (ridge) and L1 (LASSO) penalties for simultaneous shrinkage and selection

Open Questions

  • Whether adaptive shrinkage methods (e.g., empirical Bayes or cross-validated elastic net) dominate fixed-lambda ridge in real-time macro forecasting
  • How to select the penalty parameter in time-series settings where standard cross-validation violates temporal ordering
  • Whether ridge-regularized impulse responses have better coverage properties than standard local projection confidence intervals in finite samples

Components

β^ridge\hat{\beta}_{\text{ridge}}β^​ridge​Ridge coefficient vector

The p×1p \times 1p×1 vector of penalized regression coefficients, (X′X+λI)−1X′y(X'X + \lambda I)^{-1} X'y(X′X+λI)−1X′y. Biased toward zero but lower variance than OLS.

λ\lambdaλPenalty parameter

Non-negative scalar controlling shrinkage intensity. Zero recovers OLS; infinity sends all coefficients to zero.

X′XX'XX′XGram matrix

The p×pp \times pp×p matrix of predictor cross-products. Its eigenvalues determine collinearity severity; small eigenvalues signal near-singularity.

df(λ)\text{df}(\lambda)df(λ)Effective degrees of freedom

Trace of the hat matrix HλH_\lambdaHλ​. Equals ∑jdj2/(dj2+λ)\sum_j d_j^2 / (d_j^2 + \lambda)∑j​dj2​/(dj2​+λ) where djd_jdj​ are singular values of XXX. Continuously interpolates between ppp and 000.

GCV(λ)\text{GCV}(\lambda)GCV(λ)Generalized cross-validation score

Analytical approximation to leave-one-out CV error. Used to select λ\lambdaλ without repeatedly refitting the model.

VIFj\text{VIF}_jVIFj​Variance inflation factor

Measures collinearity of predictor jjj with all others: VIFj=(1−Rj2)−1\text{VIF}_j = (1 - R_j^2)^{-1}VIFj​=(1−Rj2​)−1. Values above 10 signal problematic collinearity that ridge was designed to address.

Assumptions

Linear relationshipTestable

The conditional expectation E[y∣X]E[y|X]E[y∣X] is linear in XβX\betaXβ for some unknown β\betaβ.

If violated: Ridge shrinks toward a wrong model. Residual plots and partial-response checks reveal nonlinearity.

Independent errorsTestable

The error terms εi\varepsilon_iεi​ are uncorrelated: Cov(εi,εj)=0\text{Cov}(\varepsilon_i, \varepsilon_j) = 0Cov(εi​,εj​)=0 for i≠ji \neq ji=j.

If violated: Cross-validation scores are invalid if errors are serially correlated (common in time-series data). Block cross-validation or time-series CV is required.

Homoskedasticity (for GCV)Testable

Errors have constant variance: Var(εi)=σ2\text{Var}(\varepsilon_i) = \sigma^2Var(εi​)=σ2 for all iii.

If violated: GCV is inconsistent under heteroskedasticity. Use kkk-fold CV instead, which does not require constant variance.

Centered and scaled predictorsMaintained

Predictors are standardized to zero mean and unit variance before applying the penalty.

If violated: Without standardization, the penalty penalizes coefficients on different scales unequally. A predictor measured in billions gets near-zero coefficient regardless of its importance.

All predictors are relevant (no sparsity)Maintained

The true coefficient vector is dense---most predictors contribute at least weakly to yyy.

If violated: If the truth is sparse (only a few predictors matter), ridge keeps all of them in the model with small nonzero coefficients. LASSO, which sets some coefficients exactly to zero, is more appropriate under sparsity.

Penalty does not grow with sample sizeMaintained

The selected λ\lambdaλ satisfies λ/n→0\lambda / n \to 0λ/n→0 as n→∞n \to \inftyn→∞, so the ridge estimator converges to OLS in large samples.

If violated: If λ\lambdaλ is fixed regardless of sample size, the bias does not vanish and the estimator is inconsistent. Cross-validation naturally selects smaller λ\lambdaλ as nnn grows.