Macroeconomic model reference

Elastic Net Model

Regularized regression that blends LASSO selection with ridge shrinkage for correlated predictor blocks.

Empirical forecasting models · Model guide

Elastic Net: question, structure, and use cases

Regularized regression that blends LASSO selection with ridge shrinkage for correlated predictor blocks.

How do you select variables from a large set of correlated predictors without arbitrarily dropping members of correlated groups?

Background

LASSO (Tibshirani, 1996) performs variable selection by imposing an L1 penalty on regression coefficients, shrinking some exactly to zero. Ridge regression (Hoerl and Kennard, 1970) imposes an L2 penalty that shrinks all coefficients toward zero but never eliminates any. Each method has a blind spot. LASSO is unstable when predictors are correlated: it arbitrarily picks one from a correlated cluster and zeros the rest, and which one survives can flip with small data perturbations. Ridge cannot produce sparse models at all. Zou and Hastie (2005) proposed the elastic net, which adds both penalties simultaneously: $\alpha \|\beta\|_1 + (1 - \alpha) \|\beta\|_2^2$ . The mixing parameter $\alpha \in (0, 1]$ interpolates between pure ridge ( $\alpha = 0$ ) and pure LASSO ( $\alpha = 1$ ). The combination inherits LASSO's sparsity and ridge's stability under collinearity.

The core mechanism is the grouped selection property. When a block of predictors is pairwise correlated, the elastic net tends to assign them similar coefficient magnitudes and select or drop the entire block together. Pure LASSO cannot do this because the L1 ball's geometry has sharp corners that favor selecting a single representative from each correlated group. Adding the L2 penalty rounds those corners: the elastic net constraint set is a blend of the L1 diamond and the L2 sphere, producing a shape whose corners are smoothed enough to let correlated predictors share weight. Formally, Zou and Hastie proved that the difference in elastic net coefficients between two predictors $j$ and $k$ is bounded by a function of $1 - \rho_{jk}$ (their sample correlation), so highly correlated predictors receive nearly equal treatment.

Central bank research groups adopted elastic net quickly. The Bank of England uses it for inflation density forecasting with 100+ candidate predictors (Kapetanios, Marcellino, Papailias, 2021). The Federal Reserve Bank of New York applies elastic net in mixed-frequency nowcasting pipelines where financial and real-activity indicators are collinear. The IMF's World Economic Outlook team uses penalized regressions including elastic net for cross-country growth projections with many correlated institutional and macroeconomic controls. In finance, Gu, Kelly, and Xiu (2020) found elastic net competitive with deeper machine learning methods for predicting stock returns from hundreds of firm characteristics.

The estimator was extended by Friedman, Hastie, and Tibshirani (2010) with the glmnet coordinate descent implementation, which made the full two-dimensional cross-validation grid over $(\alpha, \lambda)$ computationally tractable. The adaptive elastic net (Zou and Zhang, 2009) adds data-dependent coefficient weights to achieve oracle properties under weaker conditions. The group elastic net penalizes predefined blocks of variables (e.g., all lags of a given indicator), combining group LASSO and group ridge. The sparse group elastic net adds within-group sparsity on top of between-group selection.

How the Parts Fit Together

Inputs are the standard regression setup: $n$ observations, $p$ predictors (typically standardized to zero mean and unit variance), and a response vector $y$ . In macro applications, $p$ ranges from 50 to 500 candidate predictors drawn from financial conditions indices, labor market indicators, survey data, international trade flows, and their lags. The elastic net expects that a moderate-sized subset of these predictors is relevant, possibly in correlated clusters. Unlike LASSO, the analyst does not need to assume that only one predictor per cluster is relevant.

The objective function is $\min_\beta \frac{1}{2n} \|y - X\beta\|_2^2 + \lambda \left[ \alpha \|\beta\|_1 + \frac{1 - \alpha}{2} \|\beta\|_2^2 \right]$ . Two hyperparameters govern the fit: the overall penalty strength $\lambda \geq 0$ and the mixing parameter $\alpha \in [0, 1]$ . At $\alpha = 1$ , the penalty is pure L1 (LASSO). At $\alpha = 0$ , the penalty is pure L2 (ridge). Intermediate $\alpha$ values produce models that are sparse (some coefficients exactly zero) but grouping-stable (correlated predictors share selection fate). The L2 component also removes the cap of $\min(n, p)$ on the number of selected variables that LASSO imposes: elastic net can select all $p$ predictors if $\alpha < 1$ .

Estimation proceeds by coordinate descent (Friedman, Hastie, Tibshirani, 2010). For a fixed $(\alpha, \lambda)$ , the algorithm cycles through predictors and applies a soft-thresholding update that accounts for both penalties. The update for coordinate $j$ is $\hat{\beta}_j = S(z_j, \alpha \lambda) / (1 + \lambda(1 - \alpha))$ , where $z_j$ is the partial residual regression coefficient and $S$ is the soft-thresholding operator. The denominator $(1 + \lambda(1 - \alpha))$ is the ridge contribution: it further shrinks the surviving coefficients beyond what LASSO alone would do. The algorithm is warm-started along a grid of $\lambda$ values for each $\alpha$ , making the full path computationally cheap. Hyperparameter selection uses two-dimensional $k$ -fold cross-validation over a grid of $(\alpha, \lambda)$ pairs, or a one-dimensional search over $\lambda$ at a few fixed $\alpha$ values (0.25, 0.50, 0.75, 1.0) followed by selection of the best $(\alpha, \lambda)$ pair.

Applications

The Bank of England's forecasting division uses elastic net to select from 150+ monthly economic indicators when building inflation fan charts. The grouped selection property matters here: inflation expectations from surveys, market breakevens, and model-implied measures are highly correlated, and the policy team needs all members of a relevant group to appear in the model rather than having LASSO arbitrarily pick one. The selected model typically retains 15 to 30 variables in 4 to 6 correlated clusters, which maps cleanly onto the narrative structure the Monetary Policy Committee expects.

In cross-country growth regressions, elastic net resolves the "open-endedness" problem identified by Levine and Renelt (1992): dozens of institutional, geographic, and policy variables are plausible growth determinants, many are correlated, and OLS cannot handle them all simultaneously. Elastic net selects a stable subset while keeping correlated governance indicators (rule of law, corruption control, regulatory quality) together rather than dropping all but one. The IMF and World Bank have published working papers using this approach for policy-relevant variable selection in development economics.

Financial risk management uses elastic net for portfolio construction with hundreds of assets. DeMiguel, Martin-Utrera, Nogales (2018) showed that elastic net portfolios outperform LASSO portfolios out of sample because the L2 penalty stabilizes weights among correlated assets (e.g., stocks in the same sector), reducing turnover and transaction costs.

Elastic net should not be used when predictors are genuinely orthogonal and very sparse selection is needed: the L2 penalty adds unnecessary bias in that setting, and pure LASSO is strictly better. It also should not be used when the practitioner needs valid post-selection confidence intervals, because the selective inference literature for elastic net is less developed than for LASSO. Finally, if the goal is pure prediction without any sparsity requirement, ridge or random forests may dominate because they avoid the bias introduced by L1 thresholding on weakly relevant predictors.

Components

\hat{\beta}_{\text{EN}}

Elastic net coefficient vector

The $p \times 1$ vector of penalized coefficients. Some entries are exactly zero (sparsity from L1); nonzero entries are shrunk toward zero (from both L1 and L2).

\lambda

Overall penalty strength

Non-negative scalar controlling the total amount of regularization. Larger $\lambda$ produces sparser, more shrunken models.

\alpha

L1/L2 mixing parameter

Scalar in $[0, 1]$ controlling the balance between LASSO ( $\alpha = 1$ ) and ridge ( $\alpha = 0$ ). Intermediate values yield grouped selection with sparsity.

S(z, \gamma)

Soft-thresholding operator

$S(z, \gamma) = \text{sign}(z)(|z| - \gamma)_+$ . Sets $z$ to zero if $|z| \leq \gamma$ ; otherwise shrinks $z$ toward zero by $\gamma$ .

\mathcal{A}(\alpha, \lambda)

Active set

The set of predictors with nonzero elastic net coefficients: $\mathcal{A} = \{j : \hat{\beta}_j^{\text{EN}}(\alpha, \lambda) \neq 0\}$ . Unlike LASSO, this set can exceed $n$ in size when $\alpha < 1$ .

\text{df}(\alpha, \lambda)

Effective degrees of freedom

Approximated by the number of nonzero coefficients divided by the ridge shrinkage factor: $|\mathcal{A}| / (1 + \lambda(1 - \alpha))$ . Interpolates between LASSO df (count of nonzeros) and ridge df (continuous trace).

Assumptions

Approximate sparsity with groupingMaintained

The true coefficient vector $\beta_0$ has a moderate number of nonzero entries, possibly clustered among correlated predictors. The truth need not be as sparse as LASSO requires.

If violated: If the truth is extremely sparse with no correlated structure, elastic net's L2 component adds unnecessary bias without improving selection. Pure LASSO is more efficient in that case.

Linear conditional expectationTestable

$E[y | X] = X\beta$ for some coefficient vector $\beta$ .

If violated: Elastic net estimates a linear approximation. Nonlinear effects are absorbed into residuals; important interaction or threshold effects are missed.

Standardized predictorsMaintained

All predictors are scaled to zero mean and unit variance before applying the penalty.

If violated: Both L1 and L2 penalties treat coefficients on different scales unequally. A predictor measured in large units is penalized disproportionately. Standardization is required for the grouped selection property to function correctly.

Independent or weakly dependent errorsTestable

Errors $\varepsilon_i$ are i.i.d. sub-Gaussian, or weakly dependent with summable mixing coefficients for the time-series case.

If violated: Strong serial correlation invalidates standard $k$ -fold cross-validation. Oracle inequalities require adjustment for the reduced effective sample size.

Restricted eigenvalue conditionMaintained

The design matrix $X$ satisfies a restricted eigenvalue condition: $\|X\delta\|_2^2 / n \geq \kappa \|\delta\|_2^2$ for all $\delta$ in a cone around the true sparse support.

If violated: Without this condition, the elastic net cannot reliably recover the true support direction. Estimation rates degrade and the grouped selection guarantee weakens.

Appropriate mixing parameter rangeMaintained

$\alpha$ is bounded away from zero: $\alpha \in [\alpha_{\min}, 1]$ with $\alpha_{\min} > 0$ for sparsity to hold.

If violated: At $\alpha = 0$ the model reduces to ridge with no sparsity. For the elastic net's variable selection properties to hold, $\alpha$ must be strictly positive.

Concepts, data, and nearby models

Open the concept, data series, policy setting, or neighboring model that anchors this page.

Elastic Net: question, structure, and use cases

Background

How the Parts Fit Together

Applications

Components

Assumptions

Concepts, data, and nearby models

Concepts

Indicators

Policy

Nearby models