The R package `penalizedclr` provides an implementation of the penalized logistic regression model that can be used in the analysis of matched case-control studies. The implementation allows to apply different penalties to different blocks of covariates, and is therefore particularly useful in the presence of multi-source heterogenous data, such as multiple layers of omics measurements. Both L1 and L2 penalties are implemented. Additionally, the package implements stability selection to allow for variable selection in the considered regression model.

You can install the released version of `penalizedclr` from CRAN with:

`install.packages("penalizedclr")`

And the development version from GitHub with:

```
library(devtools)
install_github("veradjordjilovic/penalizedclr")
```

Load the package with:

`library(penalizedclr)`

Assume that we have \(K\) independent matched case-control pairs \((Y_{ki}, X_{ki})\), where \(Y_{ki}\), \(k=1,\ldots,K;\) \(i=1,2,\) is a binary variable indicating case control status (1 if case, 0 if control) and \(X_{ki}\) is the associated \(p\)-dimensional vector of covariates. The conditional logistic regression models the probability of being a case given that the observation belongs to the \(k\)-th pair as: \[ {\mathrm {logit}}\left[P(Y_{ki}=1 \mid S=k)\right] = \beta_{0k} + \beta^{T}X_{ki}, \quad k\in\left\{1,\ldots, K\right\}, i\in\left\{1,2\right\} \] where \(S\) is the matched pair id, \(\beta_{0k}\) is the pair specific intercept and \(\beta=(\beta_1, \ldots,\beta_p)^T\) is a \(p\)-dimensional vector of fixed effects.

In the present package we:

estimate \(\beta\) in the high dimensional setting in which the number of covariates \(p\) is much higher than the number of pairs \(K\). We consider a penalized conditional logistic regression, which adds a penalty to the conditional log likelihood. Motivated by current medical applications considering clinical and molecular data, we allow \(X_{ki}\) to be a merge of heteregenous data sources;

perform stability selection to identify important variables, that is, variables for which \(\beta_j\neq 0\), \(j\in\left\{1,\ldots,p\right\}\).

In this section we provide examples of how to fit a penalized conditional regression model with source-specific penalty parameters and how to perform variable selection with `penalizedclr`.

Initial settings and libraries to be loaded:

```
set.seed(123)
require(tidyverse)
```

We generate a simple multi-source data set, with two groups of covariates containing 80 and 20 variables, respectively. As specified above, each case is matched to a control, and the probability of being a case in each stratum (case-control pair) is obtained from the linear predictor. An intercept is generated independently for each stratum.

```
# two groups of predictors
<- c(80, 20)
p
# percentage and number of non-null variables
<- c(0.2, 0.8)
p_nz <- round(p*p_nz, 0)
m_nz
# number of different strata (case-control pairs)
<- 125
K
# number of cases and controls in each stratum (not necessarily 1:1 matching, other designs are also allowed)
<- 1
n_cases <- 1
n_ctrl
# generating covariates
= cbind(matrix(rnorm(p[1] * K * (n_cases + n_ctrl), 0, 1), ncol = p[1]),
X matrix(rnorm(p[2] * K * (n_cases + n_ctrl), 0, 2), ncol = p[2]))
# coefficients
<- as.matrix(c(rnorm(m_nz[1], 0, 0.8),
beta rep(0, p[1] - m_nz[1]),
rnorm(m_nz[2], 0.1, 0.4),
rep(0, p[2] - m_nz[2])), ncol = 1)
<- rep(rnorm(K, 0, 2), each= n_cases+n_ctrl)
beta_stratum
# stratum membership
<- rep(1:K, each= n_cases+n_ctrl)
stratum
# linear predictor
<- beta_stratum + X %*% beta
lin_pred
<- exp(lin_pred) / (1 + exp(lin_pred))
prob_case
# generate the response
<- rep(0, length(stratum))
Y
<- as_tibble(data.frame(stratum = stratum,
data_sim probability = prob_case,
obs_id = 1 : length(stratum)))
<- data_sim %>%
data_sim_cases group_by(stratum)%>%
sample_n(n_cases, weight = probability)
$obs_id] <- 1 Y[data_sim_cases
```

The function `penalized.clr` fits a penalized conditional logistic regression model with different penalties for different blocks of covariates. The L1 penalty parameter `lambda`

(a numeric vector of the length equal to the number of blocks) can be specified by the user or computed internally. In the latter case, cross-validation is performed internally for each data layer separately. It is also possible to apply elastic net penalty through parameter `alpha`

.

This code illustrates how to fit `penalized.clr` with penalty parameters specified by the user. Here `Y` is the response vector, `X` is the multi-source matrix of covariates, `stratum` is the vector of ids of the matching pairs, and `p` is the vector of block dimensions. It has the same dimension as the vector of penalty parameters `lambda`. Penalty parameters are thus specified for each data source, i.e., a block of covariates. It is possible to standardize the variables by setting `standardize = TRUE` (`FALSE` by default).

```
<- penalized.clr(response = Y, penalized = X, stratum = stratum,
fit1 lambda = c(6,7), p = p, standardize = TRUE)
```

The function returns a list with elements:

`penalized`- Regression coefficients for the penalized covariates.`unpenalized`- Regression coefficients for the unpenalized covariates.`converged`- Whether the fitting process was judged to be converged.`lambda`- The tuning parameter for L1 used.`alpha`- The elastic net mixing parameter used.

For instance, `fit1$penalized`

is a numeric vector containing the regression coefficients for the penalized covariates. We can compare estimated coefficients with the true coefficients (the ones that generated our data), by constructing a `2 x 2`

contingency table cross-tabulating true non-zero and estimated non-zero coefficients:

```
<- (beta != 0) * 1 #index of nonzero coefficients
nonzero_index table(fitted = (fit1$penalized != 0) * 1, nonzero_index)
#> nonzero_index
#> fitted 0 1
#> 0 51 17
#> 1 17 15
```

When the user does not specify the penalty parameters, they are computed internally for each data source separately based on cross-validation.

```
<- penalized.clr(response = Y, penalized = X, stratum = stratum,
fit2 p = p,
standardize = TRUE)
```

The selected penalty coefficients are:

```
$lambda
fit2#> [1] 4.958049 5.641397
```

The package uses fast cross-validation implemented in the R package `clogitL1` (see the function `find.default.lambda` described below). We recommend to inspect manually the obtained `lambda`

parameters. It is also recommended to investigate different ratios of data-source specific penalties (see, for instance, Boulesteix et al. 2017 and examples of penalty grids therein). Note that cross validation depends on random data splits, and different runs will return different values of `lambda`

.

This code shows what happens when no information is given about the blocks of predictors (`p`

). In this case, all covariates are considered to form a single block and are penalized equally.

```
<- penalized.clr(response = Y, penalized = X, stratum = stratum,
fit3 standardize = TRUE)
```

The selected penalty coefficient is:

```
$lambda
fit3#> [1] 4.62712
```

Somethimes, a subset of covariates should be excluded from penalization. This can be achieved by specifying the `unpenalized`

argument. In what follows, we penalize the first block of covariates, and leave the remaining block unpenalized.

```
<- X[, 1:p[1]]
X1 <- X[, (p[1]+1):(p[1]+p[2])]
X2 <- penalized.clr(response = Y, penalized = X1, unpenalized = X2,
fit4 stratum = stratum, p = p[1],
standardize = TRUE)
```

This can be particularly useful when performing variable selection on omics variables (penalized) while adjusting for clinical covariates that should not be penalized.

The package `penalizedclr` is not limited to L1, i.e. lasso, penalty. While L1 penalty is more suited for variable selection, in the presence of highly correlated covariates, it can be useful to add small amount of L2 penalty. The two can be combined in an elastic net framework by specifying the mixing parameter `alpha`

that can assume values between 0 and 1. Default `alpha=1`

gives lasso.

```
<- penalized.clr(response = Y, penalized = X, stratum = stratum,
fit2 p = p,
standardize = TRUE, alpha = 0.6)
```

The function `stable.clr` performs stability selection (Meinshausen and Bühlmann 2010 and Shah and Samworth 2013) for variable selection in penalized conditional logistic regression. For details on stability selection, we refer to the original publications. Briefly, a set of L1 penalties is considered, and for each considered value of the penalty, \(2B\) random subsamples of \(\lfloor K/2 \rfloor\) matched pairs are taken and a penalized model is estimated. For each covariate and penalty value, selection probability is estimated as the proportion of estimated models in which the associated coefficient estimate is different from zero. Finally, the overall selection probability of a variable is obtained by taking the maximum selection probability over all considered penalty values.

Parameter \(B\) is set to 100 by default, but can be changed by the user. Note that this choice will have an impact on the computation time, and higher values of \(B\) will lead to a slower computation.

The function returns a list containing the selection probabilities for each covariate, i.e. the proportion of estimated models in which the associated coefficient estimate is different from zero. The user can then set a threshold for selection probability (values ranging from 0.55 to 0.9 are recommended) and obtain the set of selected covariates.

The following code performs stability selection when all covariates are considered as a single block (a single data source) and a sequence of L1 penalties contains only 2 values.

```
<- stable.clr(response = Y,
stable1 penalized = X, stratum = stratum,
lambda.seq = c(10,20))
```

The function returns a list with two elements:

`Pilambda`a numeric vector giving estimates of selection probabilities for each penalized covariate.`lambda.seq`a sequence of L1 penalties used.

To inspect the results, we can, for instance, print the covariates with selection probability higher than 0.6:

```
which(stable1$P>0.6)
#> [1] 82 86 87 90
```

It is possible to obtain the sequence of L1 penalty parameters via the function `find.default.lambda`. It relies on the `cv.clogitL1`

function of the `clogitL1`

package to perform cross-validation to determine a suitable `lambda`

sequence. Note that it considers each data source separately and returns a sequence of four values per source (the optimal penalty along with three additional values, see package documentation for details). The number of folds is set to 10 by default but can also be specified by the user.

```
<- find.default.lambda(response = Y,
lambda.seq stratum = stratum,
penalized = X,
alpha=1,
nfolds = 10)
lambda.seq#> [1] 5.300950 6.988013 9.211995 12.143774
```

```
<- stable.clr(response = Y,
stable2 penalized = X, stratum = stratum,
lambda.seq = lambda.seq)
```

Covariates with selection probability higher than 0.6:

```
which(stable2$P>0.6)
#> [1] 1 14 58 82 86 87 90 93
```

If we do not specify the penalty parameters as in case 2, `stable.clr` will compute them by default by using `find.default.lambda` with default options. We only need to run:

```
<- stable.clr(response = Y,
stable3 penalized = X,
stratum = stratum)
```

The selected sequence of `lambda`

is:

```
$lambda.seq
stable3#> [1] 6.521593 8.023312 9.870830 12.143774
```

Covariates with selection probability higher than 0.6:

```
which(stable3$P>0.6)
#> [1] 1 82 86 87 90 93
```

This code implements stability selection while taking into account the block structure of covariates (\(p\)). To achieve this, we use the function `stable.clr.g` which extends `stable.clr` to allow for blocks (groups) of covariates. In this case, the function takes as an argument `lambda.list`

, a list of sequences of L1 penalties, one for each block. Note that the computation time depends on the length of the block-specific sequences: for each combination of penalties, \(2B\) subsamples are taken and used to estimate a penalized conditional logistic regression model.

```
<- find.default.lambda(response = Y,
lambda.list penalized = X, stratum = stratum,
p = p)
lambda.list#> [[1]]
#> [1] 10.67917 13.13825 16.16358 19.88555
#>
#> [[2]]
#> [1] 9.211995 12.143774 16.008612 21.103461
```

```
<- stable.clr.g(response = Y,
stable.g1 penalized = X,
stratum = stratum,
p = p,
lambda.list = lambda.list)
```

```
which(stable.g1$P>0.6)
#> [1] 82 86 87 90 93
```

Note that if \(p\) is not specified, the package will automatically run `stable.clr` and apply equal penalties for all covariates.

Boulesteix, A. L., De Bin, R., Jiang, X., & Fuchs, M. (2017). IPF-LASSO: Integrative-penalized regression with penalty factors for prediction based on multi-omics data. Computational and mathematical methods in medicine, 2017.

Meinshausen, N., & Bühlmann, P. (2010). Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4), 417-473.

Shah, R. D., & Samworth, R. J. (2013). Variable selection with error control: another look at stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(1), 55-80.