shrinkage-tutorial/tutorial.Rmd at master · mollyow/shrinkage-tutorial · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
---
title: "Hierarchical ridge regression"
header-includes:
   - \usepackage{bm}
output:
  github_document:
    pandoc_args: --webtex
bibliography: tutorial.bib
nocite: |
  @friedman2009glmnet, @nloptr, @Offer-Westort:2019aa
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
### The problem ###

Suppose we have a factorial experiment, where we want to account
for two-way and higher-order interactions. We may think that interaction effects
will be small but not exactly equal to zero, and higher-order interactions
will tend to be associated with smaller effects relative to
lower order interactions.

Accounting for all interactions in a standard linear model may be costly in
terms of variance, so we want to use some form of regularization.

**The solution proposed here:** _Hierarchical ridge regression_ (or optionally,
lasso or elastic net), where penalties are increasing with degree of complexity of
interactions.

We will only consider factorial experiments here, but this idea can be extended
to account for non-linearities in dose-type treatments as well. For our
purposes, we'll also consider only factors where there is a natural baseline, or
control, level.

We'll consider interactions up to order $K$, and $\omega_k$ will tell us how
many effects there are at each degree of complexity; i.e., for main effects,
$\omega_1$ is equivalent to the total number of factor treatment levels above
the baseline level, and $\omega_2$ is the total number of two-way interactions.

Let $X$ be the matrix of factors and all interactions up to order $K$,  $\beta$
be the vector of coefficients excluding the intercept term, and $\lambda$ be the
penalty vector. Thus, the estimator solves,

$$
\text{argmin}_{ \beta_0, \beta  } \bigg\{\sum^{n}_{i=1}\left(y_i - \beta_0 - \beta^T \text{x}_{i}\right)^2+  \beta^T \text{diag}(\lambda)\beta \bigg\}.
$$


For the hierarchical ridge, we will set the penalties so that each order of
interactions has a separate penalty, $\lambda_k$, constrained so
$\lambda_1\le \lambda_2 \le \dots \le \lambda_K$. To make up the full $\lambda$
vector, each $\lambda_k$ is repeated $\omega_k$ times.

### Implementation ###
The hierarchical ridge regression is built on the infrastructure of `glmnet.`
The `glmnet` functions include an option for a `penalty.factor`, "a number
that multiplies `lambda` to allow differential shrinkage." A penalty factor of
0 results in no shrinkage, while setting the penalty factor to 1 for all
variables reduces to standard lasso or ridge regression.

To find these values, the parameters optimized over represent the vector $(\lambda_2/\lambda_1, \dots, \lambda_K/\lambda_1)$,
i.e., the size of each penalty factor relative to penalties on main effects,
and, equivalent to above, constraints are implemented to ensure
$(1 \le \lambda_2/\lambda_1\le  \dots\le \lambda_K/\lambda_1)$. Once we've
found these values, we'll set $\lambda_1$ per usual in ridge regression, as the
the most regularized model within one standard error of the value with minimum
mean cross-validated error.

This optimization is implemented with the `nloptr` package, combining an augmented
Lagrangian method to impose inequality constraints on a subsidiary algorithm, here an implementation of the "Subplex" algorithm [@rowan1991functional].
This is described in the general documentation for `NLopt` [here](https://nlopt.readthedocs.io/en/latest/NLopt_Algorithms/#augmented-lagrangian-algorithm).
```{r library, echo = TRUE, message = FALSE}
library(glmnet)
library(nloptr)
set.seed(94305)
```

The minimizing function reports mean cross-validated error reported by
`cv.glmnet` from the `glmnet package, with 10 fixed folds, $\alpha = 0$, and $\lambda$ selected as that producing minimum cross-validated error with the parameters serving as relevant penalty factors. The first penalty factor is always 1, since we're
setting penalty factors as relative to the penalty on main effects. The intercept remains unpenalized.

(Alternative versions, with, for example, continuous variables, might set the
first penalty factor equal to zero for no penalty on main effects, implementing
penalization begining with non-linearities or interactions.)

```{r ridge.search}
ridge.search <- function( pf ){
  p.fac <- rep(c(1, pf), times = omegas)
  modelcv <- cv.glmnet(Dmat, yvec, alpha = 0,
                       penalty.factor = p.fac,
                       foldid = foldvals)
  mse.min <- modelcv$cvm[modelcv$lambda == modelcv$lambda.min]
  return(mse.min)
}
```

### Simulation
In our simulations here, we'll consider an experiment with four factors, each with
three treatment levels above a baseline control baseline level. We'll conduct estimation up
to third order interactions.
```{r }
k <- 3 # Order of interactions
f <- formula(paste0('y ~ .^', k))

fl <- rep(3, 4) # Number of factor levels ABOVE baseline for each factor

# Get number of values for each level of interactions, excluding baseline
(omegas <- unlist(lapply(1:k, function(x) sum(combn(fl, m = x, FUN = prod)) )))
```
The sum of `omegas` should be equal to the number of columns in our *X* matrix.
`omegas[1]` gives us the number of main effects, `omegas[2]` the number of
two-way interactions, and `omegas[3]` the number of three-way interactions.

For simulations, we will let coefficient values be drawn from normal
distributions with mean zero, with standard deviations decreasing with order of
complexity.
```{r }
sdv <- c(1, 1/2, 1/8) # SD of coefficients for each order

# Simulate (true, unknown) coefficient values
beta <- unlist(
  sapply(1:k,
         function(x) replicate(
           omegas[x],
           rnorm(1, mean = 0, sd = sdv[x])
         ) ) )


# Sample data set
n <- 1500

## Treatment matrix
x <- lapply(1:length(fl),
            function(x) as.character(sample(0:fl[x], n,
                                            replace = TRUE)))

Dmat <- as.data.frame(do.call(cbind, x))
Dmat <- model.matrix(lm(f, data = data.frame(y = 1, Dmat)))[,-1]


## Outcome vector
yvec <- Dmat %*% beta + rnorm(n, mean = 0, sd = 1)
```


We can then implement the hierarchical ridge regression.
```{r, echo = FALSE, eval = TRUE}
optpar <- paste0(paste(c("min", sdv, n), collapse = "_"), ".rda")
# Fix fold values for optimization; variation across folds can make optimization noisy
foldvals <- sample(rep(1:10, n/10))

# Constrain penalty factors to be of increasing magnitude
eval.g0 <- function( x ) {
  return(  c(-x[1]+1, -x[2]+x[1]) )
}
load(optpar)

p.min <- rep(c(1, min.c$solution), times = omegas)
```

```{r, eval = FALSE}
# Fix fold values for optimization; variation across folds can make optimization noisy
foldvals <- sample(rep(1:10, n/10))

# Constrain penalty factors to be of increasing magnitude
eval.g0 <- function( x ) {
  return(  c(-x[1]+1, -x[2]+x[1]) )
}

# Minimize
min.c <- nloptr(x0 = c(1, 1),
                eval_f = ridge.search,
                lb = c(1,1),
                ub = c(1e5, 1e5),
                eval_g_ineq  = eval.g0,
                opts = list('algorithm'='NLOPT_LN_AUGLAG',
                            'xtol_rel'=1.0e-8,
                            # 'print_level' = 3,
                            local_opts = list('algorithm'='NLOPT_LN_SBPLX',
                                              'xtol_rel'=1.0e-8)) )

p.min <- rep(c(1, min.c$solution), times = omegas)

```

```{r, eval = FALSE, echo = FALSE}
optpar <- paste0(paste(c("min", sdv, n), collapse = "_"), ".rda")
save(min.c, file = optpar)
```

We can then compare estimators
```{r}
# Standard Lasso
cv.l <- cv.glmnet(Dmat, yvec, alpha = 1)

# Standard ridge regression
cv.r <- cv.glmnet(Dmat, yvec, alpha = 0)

# Hierarchical ridge with optimized penalty factors
cv.ropt <- cv.glmnet(Dmat, yvec, alpha = 0,
                   penalty.factor = p.min)

ff <- formula(yvec ~ .)
# Linear model with main effects
lm1 <- lm(ff, data = data.frame(yvec, Dmat[,1:omegas[1]]))

# Linear model + second order interactions
lm2 <- lm(ff, data = data.frame(yvec, Dmat[,1:cumsum(omegas)[2]]))

# Linear model + third order interactions
lm3 <- lm(ff, data = data.frame(yvec, Dmat))
```
Passing some of the plotting commands to a sourced script...we see in the plot
below the true coefficient values, followed by those estimated under each of the
models. Dashed vertical lines divide coefficients into groups of main effects, two-way interactions, and three-way interactions; within categories, coefficients are not ordered.

We can see that lasso shrinks some of the coefficient values all the way to zero
in each of the categories. Ridge regression shrinks coefficients, but never all
the way to zero. And the hierarchical ridge regression imposes different amounts
of shrinkage for the different categories.

```{r, eval = TRUE, message = FALSE, warning=FALSE}
source('plotting.R')

```

<!-- We can conduct iterated simulations to compare the performance of the  -->
<!-- hierarchical ridge to other standard estimators.  -->
```{r, eval = FALSE, echo = FALSE}
# eventually load in simulations
````