Macro by Mark

Unlock Full Macro Model Library with Starter.

This feature is exclusively available to Starter, Research, and Pro. Upgrade when you need this workflow, review pricing, or send a question before changing plans.

Upgrade to Starter View pricing Questions?Already subscribed? Sign in

What you keep on Free

Create and edit one custom board
Use up to 3 widgets on each Free board
Browse indicators and calendar

← Models Overview History Concepts Models Schools

L1-penalized linear regression -- drives some coefficients to zero, doing variable selection inside the fit.

Which of these hundred candidate predictors actually matter for forecasting, and can you estimate their effects simultaneously?

Background

OLS treats every predictor as essential. Ridge regression treats every predictor as relevant but shrinks them toward zero. Neither eliminates variables. Tibshirani (1996) proposed the LASSO---Least Absolute Shrinkage and Selection Operator---which replaces ridge's L2 penalty with an L1 penalty: $\lambda \|\beta\|_1 = \lambda \sum_j |\beta_j|$ . The geometry of the L1 ball (a diamond in two dimensions) means that the constrained optimum hits a corner where one or more coefficients are exactly zero. The LASSO simultaneously estimates coefficients and selects variables, producing a sparse model directly from the optimization.

The mechanics are sharp. The objective is $\min_\beta \frac{1}{2n} \|y - X\beta\|_2^2 + \lambda \|\beta\|_1$ . Unlike ridge, this problem has no closed-form solution because the L1 penalty is not differentiable at zero. The solution is computed by coordinate descent: cycle through predictors, update each coefficient by soft-thresholding, and repeat until convergence. The LARS (Least Angle Regression) algorithm of Efron et al. (2004) computes the entire solution path---all coefficient values for every $\lambda$ from infinity to zero---in the same computational cost as a single OLS fit. This path is piecewise linear, with kinks at the $\lambda$ values where variables enter or leave the active set.

In macroeconomics, LASSO gained traction through the work of Bai and Ng (2008) on factor-augmented forecasting and Li and Chen (2014) on selecting relevant predictors from large macro panels. Central bank researchers use LASSO to identify which of dozens of financial indicators actually predict recessions, to select lag lengths in high-dimensional VARs, and to estimate sparse Phillips curves. The Bank of Canada, Reserve Bank of Australia, and Bank of England have all published research using LASSO or its variants for macro forecasting.

The method's limitation is well-understood: LASSO's variable selection is unstable when predictors are correlated. If two variables carry the same information, LASSO arbitrarily picks one and zeros the other. Which one it picks can flip with small data perturbations. The elastic net (Zou and Hastie, 2005) addresses this by combining L1 and L2 penalties. The adaptive LASSO (Zou, 2006) uses data-dependent weights to achieve oracle properties---the same selection and estimation efficiency as if the true sparse model were known in advance.

How the Parts Fit Together

Inputs mirror those of ridge: $n$ observations, $p$ predictors (standardized to unit variance), and a dependent variable $y$ . Typical macro applications involve $p$ ranging from 50 to 500 candidate predictors: monthly financial variables, survey indices, labor market indicators, international series, and their lags. The critical difference from ridge is that the analyst expects only a subset---perhaps 5 to 20---of these predictors to be active in the true model.

The model solves $\min_\beta \frac{1}{2n} \|y - X\beta\|_2^2 + \lambda \sum_{j=1}^p |\beta_j|$ . The L1 penalty creates a non-smooth optimization landscape. At each coordinate $j$ , the solution is $\hat{\beta}_j = S(z_j, \lambda)$ where $z_j$ is the partial residual regression coefficient and $S(z, \lambda) = \text{sign}(z)(|z| - \lambda)_+$ is the soft-thresholding operator. If $|z_j| \leq \lambda$ , the coefficient is set exactly to zero. If $|z_j| > \lambda$ , it is shrunk toward zero by $\lambda$ . The entire algorithm cycles through coordinates repeatedly until all coefficients stabilize.

Penalty selection uses $k$ -fold cross-validation, typically $k = 10$ . The CV curve plots prediction error against $\log(\lambda)$ , and the minimum identifies $\lambda_{\min}$ . The one-standard-error rule picks $\lambda_{1\text{SE}}$ , the largest $\lambda$ within one SE of the minimum---a more parsimonious model that protects against overfitting. For time-series data, the CV folds must respect temporal ordering (rolling or expanding window). Information criteria (BIC, extended BIC) are alternatives that penalize model complexity directly and avoid the need for cross-validation.

Applications

The Reserve Bank of Australia uses LASSO to select predictors from a pool of 150+ monthly indicators when nowcasting GDP. The selected model typically retains 8--15 variables, which shifts in composition across vintages as the economic environment changes. The sparsity constraint forces the model to identify which signals are genuinely informative in each forecast round, avoiding the "kitchen sink" problem that plagues OLS with large predictor sets.

Academic research on Phillips-curve specification uses LASSO to adjudicate among competing slack measures.Stock and Watson (2019) included unemployment, output gap, capacity utilization, and various labor-market indicators (with lags) as candidates. LASSO's selection reveals which measures contain independent forecasting power for inflation after accounting for the others---a question that generates long debates when addressed by eyeball comparison of in-sample fit.

In financial economics, LASSO identifies which of hundreds of firm characteristics predict stock returns in the cross-section. Gu, Kelly, and Xiu (2020) used LASSO (among other ML methods) on a large set of firm-level predictors, finding that 20--30 characteristics survive regularization. This approach has replaced ad hoc "anomaly" testing in much of the recent empirical asset pricing literature.

LASSO breaks down in three settings. First, when predictors are highly correlated, selection is unstable: LASSO picks one from a correlated group and discards the others, and which one it picks can change with minor data revisions. Elastic net fixes this by adding an L2 penalty that encourages correlated predictors to share weight. Second, LASSO can select at most $\min(n, p)$ variables. In extreme high-dimensional settings ( $p \gg n$ ), this cap binds before all true predictors are captured. Third, LASSO coefficients are biased toward zero by construction. Post-LASSO OLS---running OLS on the LASSO-selected variables---corrects the bias but introduces a two-stage procedure whose statistical properties are harder to characterize.

Literature and Extensions

Key Papers

Tibshirani (1996) --- introduced the LASSO and its geometric interpretation
Efron, Hastie, Johnstone, Tibshirani (2004) --- LARS algorithm for computing the entire LASSO solution path efficiently
Zou (2006) --- adaptive LASSO with oracle properties
Zou, Hastie (2005) --- elastic net combining L1 and L2 penalties
Bickel, Ritov, Tsybakov (2009) --- oracle inequalities for LASSO under restricted eigenvalue conditions

Named Variants

Adaptive LASSO --- predictor-specific penalty weights $\lambda w_j$ where $w_j = |\hat{\beta}_j^{\text{OLS}}|^{-\gamma}$
Group LASSO --- penalizes $\|\beta_G\|_2$ for predefined groups $G$ , selecting entire groups rather than individual variables
Elastic net --- combines L1 and L2 penalties: $\alpha\|\beta\|_1 + (1-\alpha)\|\beta\|_2^2$
Post-LASSO OLS --- uses LASSO for selection, then re-estimates by OLS on selected variables to reduce bias
Square-root LASSO --- self-normalizing variant that does not require knowledge or estimation of the error variance

Open Questions

Whether adaptive or thresholded LASSO achieves reliable variable selection in macro panels where the irrepresentable condition fails
How to construct valid confidence intervals for LASSO-selected coefficients without assuming the selected model is correct (selective inference)
Whether LASSO's instability under predictor correlation is a fundamental limitation or a finite-sample artifact that vanishes with enough data

Components

\hat{\beta}_{\text{LASSO}}

LASSO coefficient vector

The $p \times 1$ vector of L1-penalized coefficients. Some entries are exactly zero; the nonzero entries identify the selected variables.

\lambda

Penalty parameter

Non-negative scalar controlling sparsity. Larger $\lambda$ sets more coefficients to zero. At $\lambda_{\max} = \max_j |X_j' y| / n$ , all coefficients are zero.

S(z, \lambda)

Soft-thresholding operator

The core nonlinearity: $S(z, \lambda) = \text{sign}(z)(|z| - \lambda)_+$ . Shrinks $z$ toward zero by $\lambda$ and sets it to zero if $|z| \leq \lambda$ .

\mathcal{A}(\lambda)

Active set

The set of predictors with nonzero coefficients at penalty $\lambda$ : $\mathcal{A} = \{j : \hat{\beta}_j(\lambda) \neq 0\}$ . This is the selected model.

\lambda_{\max}

Maximum penalty

The smallest $\lambda$ at which all coefficients are zero: $\lambda_{\max} = n^{-1} \max_j |X_j' y|$ . The regularization path starts here.

\text{df}(\lambda)

Degrees of freedom

For the LASSO, the effective degrees of freedom equals the number of nonzero coefficients $|\mathcal{A}(\lambda)|$ (Zou, Hastie, Tibshirani, 2007). Discrete, unlike ridge's continuous df.

Assumptions

SparsityMaintained

The true coefficient vector $\beta_0$ has $s$ nonzero entries with $s \ll p$ . Most predictors are irrelevant.

If violated: If the truth is dense (all predictors contribute weakly), LASSO zeroes out some true signals and underperforms ridge. Elastic net is more robust to this case.

Restricted eigenvalue conditionMaintained

The design matrix $X$ satisfies $\|X \delta\|_2^2 / n \geq \kappa \|\delta\|_2^2$ for all $\delta$ in a restricted set around the true sparse support.

If violated: Without this condition, the LASSO cannot distinguish the true support from spurious alternatives. Estimation and selection consistency both fail.

Irrepresentable condition (for selection consistency)Testable

The correlation between relevant and irrelevant predictors is bounded: $\|X_{\mathcal{A}^c}' X_{\mathcal{A}} (X_{\mathcal{A}}' X_{\mathcal{A}})^{-1} \text{sign}(\beta_{\mathcal{A}})\|_\infty < 1$ .

If violated: LASSO may select incorrect variables even as $n \to \infty$ . This condition is strong and often violated in macro data where predictors are correlated.

Linear modelTestable

$E[y|X] = X\beta$ for some $\beta$ .

If violated: LASSO selects a linear approximation. Nonlinear effects are missed; important nonlinear predictors may be incorrectly zeroed out.

Independent or weakly dependent errorsTestable

Errors $\varepsilon_i$ are i.i.d. or weakly dependent for oracle inequality results to hold.

If violated: Serial correlation invalidates standard cross-validation and weakens the theoretical guarantees on selection and estimation rates.

Standardized predictorsMaintained

All predictors are scaled to zero mean and unit variance before applying the penalty.

If violated: The L1 penalty treats coefficients on different scales unequally. A predictor measured in large units gets penalized more heavily per unit of predictive power.

Data-Driven Models

Loading Data-Driven Models