Skip to main content
Macro by Mark
  • Home
  • News
  • Calendar
  • Indicators
  • Macro
  • About
Sign inSign up
Macro by Mark

Global Economic Data, Empirical Models, and Macro Theory
All in One Workspace

Public data from government agencies and multilateral statistical releases, anchored in official sources

© 2026 Mark Jayson Nation

Product

  • Home
  • Indicators
  • News
  • Calendar

Macro

  • Overview
  • Models
  • Labs
  • Glossary

Learn

  • Concepts
  • Models
  • Schools
  • History
  • Docs

Account

  • Create account
  • Sign in
  • Pricing
  • Contact
AboutPrivacy PolicyTerms of ServiceTrust and securityEthics and Compliance

Data-Driven Models

Loading Data-Driven Models

Macro by Mark

Unlock Full Macro Model Library with Starter.

This feature is exclusively available to Starter, Research, and Pro. Upgrade when you need this workflow, review pricing, or send a question before changing plans.

Upgrade to StarterView pricingQuestions?Already subscribed? Sign in

What you keep on Free

  • Create and edit one custom board
  • Use up to 3 widgets on each Free board
  • Browse indicators and calendar
← ModelsOverviewHistoryConceptsModelsSchools

LASSO
Model

L1-penalized linear regression -- drives some coefficients to zero, doing variable selection inside the fit.

Which of these hundred candidate predictors actually matter for forecasting, and can you estimate their effects simultaneously?

Background

OLS treats every predictor as essential. Ridge regression treats every predictor as relevant but shrinks them toward zero. Neither eliminates variables. Tibshirani (1996) proposed the LASSO---Least Absolute Shrinkage and Selection Operator---which replaces ridge's L2 penalty with an L1 penalty: λ∥β∥1=λ∑j∣βj∣\lambda \|\beta\|_1 = \lambda \sum_j |\beta_j|λ∥β∥1​=λ∑j​∣βj​∣. The geometry of the L1 ball (a diamond in two dimensions) means that the constrained optimum hits a corner where one or more coefficients are exactly zero. The LASSO simultaneously estimates coefficients and selects variables, producing a sparse model directly from the optimization.

The mechanics are sharp. The objective is min⁡β12n∥y−Xβ∥22+λ∥β∥1\min_\beta \frac{1}{2n} \|y - X\beta\|_2^2 + \lambda \|\beta\|_1minβ​2n1​∥y−Xβ∥22​+λ∥β∥1​. Unlike ridge, this problem has no closed-form solution because the L1 penalty is not differentiable at zero. The solution is computed by coordinate descent: cycle through predictors, update each coefficient by soft-thresholding, and repeat until convergence. The LARS (Least Angle Regression) algorithm of Efron et al. (2004) computes the entire solution path---all coefficient values for every λ\lambdaλ from infinity to zero---in the same computational cost as a single OLS fit. This path is piecewise linear, with kinks at the λ\lambdaλ values where variables enter or leave the active set.

In macroeconomics, LASSO gained traction through the work of Bai and Ng (2008) on factor-augmented forecasting and Li and Chen (2014) on selecting relevant predictors from large macro panels. Central bank researchers use LASSO to identify which of dozens of financial indicators actually predict recessions, to select lag lengths in high-dimensional VARs, and to estimate sparse Phillips curves. The Bank of Canada, Reserve Bank of Australia, and Bank of England have all published research using LASSO or its variants for macro forecasting.

The method's limitation is well-understood: LASSO's variable selection is unstable when predictors are correlated. If two variables carry the same information, LASSO arbitrarily picks one and zeros the other. Which one it picks can flip with small data perturbations. The elastic net (Zou and Hastie, 2005) addresses this by combining L1 and L2 penalties. The adaptive LASSO (Zou, 2006) uses data-dependent weights to achieve oracle properties---the same selection and estimation efficiency as if the true sparse model were known in advance.

How the Parts Fit Together

Inputs mirror those of ridge: nnn observations, ppp predictors (standardized to unit variance), and a dependent variable yyy. Typical macro applications involve ppp ranging from 50 to 500 candidate predictors: monthly financial variables, survey indices, labor market indicators, international series, and their lags. The critical difference from ridge is that the analyst expects only a subset---perhaps 5 to 20---of these predictors to be active in the true model.

The model solves min⁡β12n∥y−Xβ∥22+λ∑j=1p∣βj∣\min_\beta \frac{1}{2n} \|y - X\beta\|_2^2 + \lambda \sum_{j=1}^p |\beta_j|minβ​2n1​∥y−Xβ∥22​+λ∑j=1p​∣βj​∣. The L1 penalty creates a non-smooth optimization landscape. At each coordinate jjj, the solution is β^j=S(zj,λ)\hat{\beta}_j = S(z_j, \lambda)β^​j​=S(zj​,λ) where zjz_jzj​ is the partial residual regression coefficient and S(z,λ)=sign(z)(∣z∣−λ)+S(z, \lambda) = \text{sign}(z)(|z| - \lambda)_+S(z,λ)=sign(z)(∣z∣−λ)+​ is the soft-thresholding operator. If ∣zj∣≤λ|z_j| \leq \lambda∣zj​∣≤λ, the coefficient is set exactly to zero. If ∣zj∣>λ|z_j| > \lambda∣zj​∣>λ, it is shrunk toward zero by λ\lambdaλ. The entire algorithm cycles through coordinates repeatedly until all coefficients stabilize.

Penalty selection uses kkk-fold cross-validation, typically k=10k = 10k=10. The CV curve plots prediction error against log⁡(λ)\log(\lambda)log(λ), and the minimum identifies λmin⁡\lambda_{\min}λmin​. The one-standard-error rule picks λ1SE\lambda_{1\text{SE}}λ1SE​, the largest λ\lambdaλ within one SE of the minimum---a more parsimonious model that protects against overfitting. For time-series data, the CV folds must respect temporal ordering (rolling or expanding window). Information criteria (BIC, extended BIC) are alternatives that penalize model complexity directly and avoid the need for cross-validation.

Applications

The Reserve Bank of Australia uses LASSO to select predictors from a pool of 150+ monthly indicators when nowcasting GDP. The selected model typically retains 8--15 variables, which shifts in composition across vintages as the economic environment changes. The sparsity constraint forces the model to identify which signals are genuinely informative in each forecast round, avoiding the "kitchen sink" problem that plagues OLS with large predictor sets.

Academic research on Phillips-curve specification uses LASSO to adjudicate among competing slack measures.Stock and Watson (2019) included unemployment, output gap, capacity utilization, and various labor-market indicators (with lags) as candidates. LASSO's selection reveals which measures contain independent forecasting power for inflation after accounting for the others---a question that generates long debates when addressed by eyeball comparison of in-sample fit.

In financial economics, LASSO identifies which of hundreds of firm characteristics predict stock returns in the cross-section. Gu, Kelly, and Xiu (2020) used LASSO (among other ML methods) on a large set of firm-level predictors, finding that 20--30 characteristics survive regularization. This approach has replaced ad hoc "anomaly" testing in much of the recent empirical asset pricing literature.

LASSO breaks down in three settings. First, when predictors are highly correlated, selection is unstable: LASSO picks one from a correlated group and discards the others, and which one it picks can change with minor data revisions. Elastic net fixes this by adding an L2 penalty that encourages correlated predictors to share weight. Second, LASSO can select at most min⁡(n,p)\min(n, p)min(n,p) variables. In extreme high-dimensional settings (p≫np \gg np≫n), this cap binds before all true predictors are captured. Third, LASSO coefficients are biased toward zero by construction. Post-LASSO OLS---running OLS on the LASSO-selected variables---corrects the bias but introduces a two-stage procedure whose statistical properties are harder to characterize.

Literature and Extensions

Key Papers

  • Tibshirani (1996) --- introduced the LASSO and its geometric interpretation
  • Efron, Hastie, Johnstone, Tibshirani (2004) --- LARS algorithm for computing the entire LASSO solution path efficiently
  • Zou (2006) --- adaptive LASSO with oracle properties
  • Zou, Hastie (2005) --- elastic net combining L1 and L2 penalties
  • Bickel, Ritov, Tsybakov (2009) --- oracle inequalities for LASSO under restricted eigenvalue conditions

Named Variants

  • Adaptive LASSO --- predictor-specific penalty weights λwj\lambda w_jλwj​ where wj=∣β^jOLS∣−γw_j = |\hat{\beta}_j^{\text{OLS}}|^{-\gamma}wj​=∣β^​jOLS​∣−γ
  • Group LASSO --- penalizes ∥βG∥2\|\beta_G\|_2∥βG​∥2​ for predefined groups GGG, selecting entire groups rather than individual variables
  • Elastic net --- combines L1 and L2 penalties: α∥β∥1+(1−α)∥β∥22\alpha\|\beta\|_1 + (1-\alpha)\|\beta\|_2^2α∥β∥1​+(1−α)∥β∥22​
  • Post-LASSO OLS --- uses LASSO for selection, then re-estimates by OLS on selected variables to reduce bias
  • Square-root LASSO --- self-normalizing variant that does not require knowledge or estimation of the error variance

Open Questions

  • Whether adaptive or thresholded LASSO achieves reliable variable selection in macro panels where the irrepresentable condition fails
  • How to construct valid confidence intervals for LASSO-selected coefficients without assuming the selected model is correct (selective inference)
  • Whether LASSO's instability under predictor correlation is a fundamental limitation or a finite-sample artifact that vanishes with enough data

Components

β^LASSO\hat{\beta}_{\text{LASSO}}β^​LASSO​LASSO coefficient vector

The p×1p \times 1p×1 vector of L1-penalized coefficients. Some entries are exactly zero; the nonzero entries identify the selected variables.

λ\lambdaλPenalty parameter

Non-negative scalar controlling sparsity. Larger λ\lambdaλ sets more coefficients to zero. At λmax⁡=max⁡j∣Xj′y∣/n\lambda_{\max} = \max_j |X_j' y| / nλmax​=maxj​∣Xj′​y∣/n, all coefficients are zero.

S(z,λ)S(z, \lambda)S(z,λ)Soft-thresholding operator

The core nonlinearity: S(z,λ)=sign(z)(∣z∣−λ)+S(z, \lambda) = \text{sign}(z)(|z| - \lambda)_+S(z,λ)=sign(z)(∣z∣−λ)+​. Shrinks zzz toward zero by λ\lambdaλ and sets it to zero if ∣z∣≤λ|z| \leq \lambda∣z∣≤λ.

A(λ)\mathcal{A}(\lambda)A(λ)Active set

The set of predictors with nonzero coefficients at penalty λ\lambdaλ: A={j:β^j(λ)≠0}\mathcal{A} = \{j : \hat{\beta}_j(\lambda) \neq 0\}A={j:β^​j​(λ)=0}. This is the selected model.

λmax⁡\lambda_{\max}λmax​Maximum penalty

The smallest λ\lambdaλ at which all coefficients are zero: λmax⁡=n−1max⁡j∣Xj′y∣\lambda_{\max} = n^{-1} \max_j |X_j' y|λmax​=n−1maxj​∣Xj′​y∣. The regularization path starts here.

df(λ)\text{df}(\lambda)df(λ)Degrees of freedom

For the LASSO, the effective degrees of freedom equals the number of nonzero coefficients ∣A(λ)∣|\mathcal{A}(\lambda)|∣A(λ)∣ (Zou, Hastie, Tibshirani, 2007). Discrete, unlike ridge's continuous df.

Assumptions

SparsityMaintained

The true coefficient vector β0\beta_0β0​ has sss nonzero entries with s≪ps \ll ps≪p. Most predictors are irrelevant.

If violated: If the truth is dense (all predictors contribute weakly), LASSO zeroes out some true signals and underperforms ridge. Elastic net is more robust to this case.

Restricted eigenvalue conditionMaintained

The design matrix XXX satisfies ∥Xδ∥22/n≥κ∥δ∥22\|X \delta\|_2^2 / n \geq \kappa \|\delta\|_2^2∥Xδ∥22​/n≥κ∥δ∥22​ for all δ\deltaδ in a restricted set around the true sparse support.

If violated: Without this condition, the LASSO cannot distinguish the true support from spurious alternatives. Estimation and selection consistency both fail.

Irrepresentable condition (for selection consistency)Testable

The correlation between relevant and irrelevant predictors is bounded: ∥XAc′XA(XA′XA)−1sign(βA)∥∞<1\|X_{\mathcal{A}^c}' X_{\mathcal{A}} (X_{\mathcal{A}}' X_{\mathcal{A}})^{-1} \text{sign}(\beta_{\mathcal{A}})\|_\infty < 1∥XAc′​XA​(XA′​XA​)−1sign(βA​)∥∞​<1.

If violated: LASSO may select incorrect variables even as n→∞n \to \inftyn→∞. This condition is strong and often violated in macro data where predictors are correlated.

Linear modelTestable

E[y∣X]=XβE[y|X] = X\betaE[y∣X]=Xβ for some β\betaβ.

If violated: LASSO selects a linear approximation. Nonlinear effects are missed; important nonlinear predictors may be incorrectly zeroed out.

Independent or weakly dependent errorsTestable

Errors εi\varepsilon_iεi​ are i.i.d. or weakly dependent for oracle inequality results to hold.

If violated: Serial correlation invalidates standard cross-validation and weakens the theoretical guarantees on selection and estimation rates.

Standardized predictorsMaintained

All predictors are scaled to zero mean and unit variance before applying the penalty.

If violated: The L1 penalty treats coefficients on different scales unequally. A predictor measured in large units gets penalized more heavily per unit of predictive power.