`penppml`

is an R package that enables users to fit
penalized Poisson Pseudo Maximum Likelihood (PPML) regressions with
high-dimensional fixed effects (HDFE). Supported penalties in the
current version are ridge and lasso. The original application that
motivated the development of `penppml`

was the estimation of
three-way gravity models of trade with a large number of PTA provision
dummies (Breinlich, Corradi, Rocha, Ruta, Santos Silva and Zylkin,
2021).

To install and load `penppml`

, you can use the following
commands:

`install.packages("penppml")`

`library(penppml)`

`::load_all() devtools`

The `penppml`

package features the `trade`

data
set, which integrates panel data on bilateral trade flows with
information about specific provisions in trade agreements for 220
exporters and 270 importers. The provisions included in the package are
a subset of 16 out of 305 “essential” provisions featured in the full
data set. More information about the data set and the variables is
included in the corresponding help file, accessible via
`?trade`

.

Along with the `trade`

data set, the package includes an
auxiliary data frame, `countries`

, which contains basic
information about the country ISO codes included in the main data
set.

This enables users to easily filter by region or subregion. For instance, if we want to restrict our analysis to countries in the Americas, we can do the following:

```
<- countries$iso[countries$region %in% c("Americas")]
selected <- trade[(trade$exp %in% selected) & (trade$imp %in% selected), -(5:6)] # We remove columns 5 and
trade2 # 6 because these variables are not needed in our regressions.
```

We will show the capabilities of this package using the filtered
`trade2`

data frame.

The package enables users to run unpenalized PPML regressions with
HDFE, using the `hdfeppml`

function as follows:

```
<- hdfeppml(data = trade2,
reg1 dep = "export",
fixed = list(c("exp", "time"),
c("imp", "time"),
c("exp", "imp")))
```

As we can see, the function is designed to be data-frame-friendly:
the user sets a data frame of reference in the `data`

argument and then picks the dependent, independent and fixed effect
variables either by name or by column number within the provided data
frame. Also, note the following points about the syntax:

The

`fixed`

argument can take either a single vector (just like the`dep`

argument; each element will be used as a separate fixed effect) or a list of vectors. The latter option is useful in cases where the desired fixed effect is the result of the interaction of two or more variables, as in the gravity model of trade. In those cases, you can specify any number of separate fixed effects in distinct elements of the list and, inside each element, which variables in the data set you want to interact.As explained in the note, if

`indep`

is empty, the function uses all remaining columns in the data frame by default.

Internally, the function will do the necessary transformations of the columns into vectors and matrices, as needed by the algorithm.

Alternatively, more advanced users who prefer to handle data
transformations themselves can use the internal function
`hdfeppml_int`

. For more information and examples on this
issue, run `?hdfeppml`

or `?hdfeppml_int`

. Note
also that this applies to all of the main functions in our package: both
the data-frame-friendly wrapper and the internal function are available
for use.

To obtain the results associated with our PPML model:

```
<- data.frame(prov = rownames(reg1$coefficients), b = reg1$coefficients, se = 0)
results $se[!is.na(reg1$coefficients)] <- reg1$se
results results
```

If we want to specify the independent variable, we can specify indep as pointing to columns 5 to 7 with the following code.

```
<- hdfeppml(data = trade2, indep=5:7,
reg1_indep dep = "export",
fixed = list(c("exp", "time"),
c("imp", "time"),
c("exp", "imp")))
```

The `mlfitppml`

function is a flexible tool for computing
penalized PPML regressions with HDFE. For instance, if we want to fit a
PPML regression with a lasso penalty for several values of the penalty
parameter (lambda) at once, we can run:

```
<- c(0.05, 0.025, 0.01, 0.0075, 0.005, 0.0025, 0.001, 0.00075, 0.0005, 0.00025, 0.0001, 0)
lambdas
<- mlfitppml(data = trade2,
reg2 dep = "export",
fixed = list(c("exp", "time"),
c("imp", "time"),
c("exp", "imp")),
penalty = "lasso",
lambdas = lambdas)
```

The function returns a list that contains two sets of coefficients
for each value of lambda: the penalized coefficients (in the
`beta_pre`

element) and the post-penalty or unpenalized
coefficients (in the `beta`

element; these are calculated by
estimating a post-penalty regression with just the selected
variables).

If the user wishes to obtain the penalized estimates for a single
value of lambda, they can either use `mlfitppml`

as described
above (just setting `lambdas == x`

, where `x`

is a
number) or use the `penhdfeppml`

function, upon which
`mlfitppml`

is built, directly. For instance:

```
<- penhdfeppml(data = trade2,
reg3 dep = "export",
fixed = list(c("exp", "time"),
c("imp", "time"),
c("exp", "imp")),
penalty = "lasso",
lambda = 0.005)
```

For more details on `penhdfeppml`

, run
`?penhdfeppml`

.

`mlfitppml`

also allows user to use the ridge penalty in
their PPML regressions. Simply run:

```
<- seq(0.0001, 0, length.out = 10)
lambdas
<- mlfitppml(data = trade2,
reg4 dep = "export",
fixed = list(c("exp", "time"),
c("imp", "time"),
c("exp", "imp")),
penalty = "ridge",
lambdas = lambdas)
```

Note that this feature is still in development and may contain bugs.

The `mlfitppml`

function enables users to carry out
cross-validation of their models via the `xval`

and
`IDs`

arguments. When `xval`

is set to
`TRUE`

, the function performs cross-validation using a
user-provided vector of IDs. For instance, if we want to do k-fold cross
validation with k = 20, splitting the data set by agreement (not by
observation), we can do the following:

```
<- unique(trade[(trade$exp %in% selected) & (trade$imp %in% selected), 5])
id <- 20
nfolds <- data.frame(id = id, fold = sample(1:nfolds, size = length(id), replace = TRUE))
unique_ids
<- merge(trade[(trade$exp %in% selected) & (trade$imp %in% selected), 5, drop = FALSE],
cross_ids by = "id", all.x = TRUE) unique_ids,
```

Now we activate the cross-validation option in `mlfitppml`

and input the ID vector:

```
<- mlfitppml(data = trade2,
reg5 dep = "export",
fixed = list(c("exp", "time"),
c("imp", "time"),
c("exp", "imp")),
penalty = "lasso",
lambdas = c(seq(0.5, 0.1, by = -0.1), 0.05, 0.01, 0.005, 0.001, 0.0005, 0.0001, 0),
xval = TRUE,
IDs = cross_ids$fold)
```

Now the function also returns a `rmse`

element that
includes the cross-validation results (the average RMSE or root mean
squared error for each value of lambda). Users can employ this tool to
choose the value of the penalty parameter that minimizes the RMSE:

`$rmse reg5`

This package also enables the use of the plugin lasso method,
incorporating coefficient-specific penalty weights calculated
automatically by the package. The most convenient way to do this is to
set `method = "plugin"`

in `mlfitppml`

. Note that
the plugin algorithm requires a clustering variable - we are using the
interaction of the exporter and importer variables in the
`trade`

data set:

```
<- mlfitppml(data = trade2,
reg6 dep = "export",
fixed = list(c("exp", "time"),
c("imp", "time"),
c("exp", "imp")),
penalty = "lasso",
method = "plugin",
cluster = c("exp", "imp"))
```

To display the plugin lasso results:

```
<- data.frame(prov = rownames(reg6$beta), b_pre = reg6$beta_pre, b = reg6$beta, se = 0)
results $se <- reg6$ses[1,]
results results
```

We can also set the gamma value as defined in Belloni et al. to something less strict, e.g. 0.3.

```
<- mlfitppml(data = trade2,
reg6_gamma dep = "export",
fixed = list(c("exp", "time"),
c("imp", "time"),
c("exp", "imp")),
penalty = "lasso",
method = "plugin",
cluster = c("exp", "imp"), gamma_val=0.3)
rownames(reg6$beta)[which(reg6_gamma$beta != 0)]
```

```
<- data.frame(prov = rownames(reg6_gamma$beta), b_pre = reg6_gamma$beta_pre, b = reg6_gamma$beta, se = 0)
results_gamma $se <- reg6_gamma$ses[1,]
results_gamma results_gamma
```

This package also allows users to implement the two-step lasso
described in Breinlich et al. (2021). This method consists in, first,
running a plugin lasso estimation and second, running individual lasso
regressions on each of the variables selected in the first stage. The
`iceberg`

function is designed precisely with this second
step in mind. It takes a vector of dependent variables and returns a
lasso regression for each one of them:

```
<- iceberg(data = trade2[, -(1:4)],
iceberg_results dep = results$prov[results$b != 0],
selectobs = (trade2$time == "2016"))
```

Currently, the function returns a matrix with coefficient estimates for each of the selected provisions. Support for standard errors (including clustered) is in development as of the current version of the package.

Since PPML coefficients can’t be easily interpreted, you may find it useful to see the raw correlations between the variables in the iceberg lasso step:

```
<- cor(trade2[, results$prov])
provcorr <- provcorr[, results$prov[results$b != 0]]) (provcorr
```

To illustrate this, we select the trade data, but now not dropping id and agreement variable.

We want to cluster by agreement, that means that we draw observations with the same agreement ID. For this, we need to create a new agreement ID in the following way:

```
<- trade[(trade$exp %in% selected) & (trade$imp %in% selected), ] # Now, we need id and agreement variable
trade3 # Let's cluster by agreement
$alt_id <- trade3$id # ID refers to agreement ID
trade3$alt_id[is.na(trade3$alt_id)] <- 0 # We set this to zero when the ID is missing, interpreting this as the country pair not being part of any agreement.
trade3
# Create pair ID
<- do.call(paste, as.data.frame(t(apply(trade3[1:2], 1, sort))))
v1 $pair <- match(v1, unique(v1))
trade3$pair <- trade3$pair + 500
trade3
# Create maximal ID from the preexisting ID-variable inside each pair
<- within(trade3, {alt_id2 = ave(alt_id,pair,FUN=max)} ) # This creates the maximum of the ID for each pair. This adjusts for the fact that some pairs might have been part of different agreements and we want to take get a unique agreement ID for each pair.
trade3
$alt_id2[trade3$alt_id2==0] <- trade3$pair[trade3$alt_id2==0]
trade3unique(trade3$alt_id2) # This cluster variable collects pairs for pairs that are in agreement, uses the pair ID for those that are not.
# Thus, it allows errors to be clustered within agreements.
<- factor(trade3$alt_id2)
alt_id2 $clus <- alt_id2 #Add the ID to the data trade3
```

Next, we run Bootstrap Lasso, which means drawing agreement clusters with replacement using the ID just created. In the bootstrap() command, first hdfeppml() is run, using a dummy which is one when any provision was in place and zero otherwise to get initial values of mu. These are then used in plugin Lasso. This is repeated B times (as specified in bootreps) and the command selects those provisions selected by plugin in at least a fraction of cases. This fraction can be specified by the option boot_threshold (Default is 1 pc).

```
set.seed(123)
<- bootstrap(data=trade3, dep="export", cluster_id="clus", fixed=list(c("exp", "time"), c("imp", "time"), c("exp", "imp")), indep=7:22, bootreps=10, colcheck_x = TRUE, colcheck_x_fes = TRUE, boot_threshold = 0.01, post=TRUE, gamma_val=0.01, verbose=FALSE) bs1
```

Breinlich, H., Corradi, V., Rocha, N., Ruta, M., Santos Silva, J.M.C. and T. Zylkin, T. (2021). “Machine Learning in International Trade Research: Evaluating the Impact of Trade Agreements”, Policy Research Working Paper; No. 9629. World Bank, Washington, DC.

Correia, S., P. Guimaraes and T. Zylkin (2020). “Fast Poisson
estimation with high dimensional fixed effects”, *STATA Journal*,
20, 90-115.

Gaure, S (2013). “OLS with multiple high dimensional category
variables”, *Computational Statistics & Data Analysis*, 66,
8-18.

Friedman, J., T. Hastie, and R. Tibshirani (2010). “Regularization
paths for generalized linear models via coordinate descent”, *Journal
of Statistical Software*, 33, 1-22.

Belloni, A., V. Chernozhukov, C. Hansen and D. Kozbur (2016).
“Inference in high dimensional panel models with an application to gun
control”, *Journal of Business & Economic Statistics*, 34,
590-605.