distrr provides some tools to estimate and manage empirical distributions. In particular, one of the main features of distrr is the creation of data cubes of estimated statistics, that include all the combinations of the variables of interest. The package makes strong usage of the tools provided by dplyr, which is a grammar of data manipulation.
The main functions to create a data cube are dcc5()
and
dcc6()
(dcc
stands for data cube
creation).
The data cube creation is like:
data %>%
group_by(some variables) %>%
summarise(one or more estimated statistic)
in dplyr terms, but the operation is done for each possible combination of the variables used for grouping. The result will be a data frame in “tidy form”. See some examples in the Usage section below.
# From CRAN
install.packages("distrr")
# Or the development version from GitHub:
# install.packages("devtools")
devtools::install_github("gibonet/distrr")
Consider the invented_wages
dataset:
library(distrr)
str(invented_wages)
#> Classes 'tbl_df' and 'data.frame': 1000 obs. of 5 variables:
#> $ gender : Factor w/ 2 levels "men","women": 1 2 1 2 1 1 1 2 2 2 ...
#> $ sector : Factor w/ 2 levels "secondary","tertiary": 2 1 2 2 1 1 2 1 2 1 ...
#> $ education : Factor w/ 3 levels "I","II","III": 3 2 2 2 2 1 3 1 2 2 ...
#> $ wage : num 8400 4200 5100 7400 4300 4900 5400 2900 4500 3000 ...
#> $ sample_weights: num 105 32 36 12 21 46 79 113 34 32 ...
If we want to count the number of observations and estimate the average wage by gender, with dplyr we can do:
library(dplyr)
%>%
invented_wages group_by(gender) %>%
summarise(n = n(), av_wage = mean(wage))
#> # A tibble: 2 x 3
#> gender n av_wage
#> <fct> <int> <dbl>
#> 1 men 547 5435.
#> 2 women 453 4441.
We can estimate the same statistics but grouped by education by
changing the argument inside group_by
:
%>%
invented_wages group_by(education) %>%
summarise(n = n(), av_wage = mean(wage))
#> # A tibble: 3 x 3
#> education n av_wage
#> <fct> <int> <dbl>
#> 1 I 172 3774.
#> 2 II 719 5099.
#> 3 III 109 6139.
and estimate the statistics by gender and education including both
variables in group_by
:
%>%
invented_wages group_by(gender, education) %>%
summarise(n = n(), av_wage = mean(wage))
#> # A tibble: 6 x 4
#> # Groups: gender [2]
#> gender education n av_wage
#> <fct> <fct> <int> <dbl>
#> 1 men I 60 4627.
#> 2 men II 409 5278.
#> 3 men III 78 6886.
#> 4 women I 112 3317.
#> 5 women II 310 4865.
#> 6 women III 31 4261.
With dcc5
we can perform all the steps above with one
call:
%>%
invented_wages dcc5(.variables = c("gender", "education"), av_wage = ~mean(wage))
#> # A tibble: 12 x 4
#> gender education n av_wage
#> * <fct> <fct> <int> <dbl>
#> 1 Totale Totale 1000 4985.
#> 2 Totale I 172 3774.
#> 3 Totale II 719 5099.
#> 4 Totale III 109 6139.
#> 5 men Totale 547 5435.
#> 6 men I 60 4627.
#> 7 men II 409 5278.
#> 8 men III 78 6886.
#> 9 women Totale 453 4441.
#> 10 women I 112 3317.
#> 11 women II 310 4865.
#> 12 women III 31 4261.
The resulting data frame contains a column for each grouping variable, and the estimations of all the combinations of the variables:
.all
, which by default
is TRUE
).Note that in the result there are some rows where the variables take
the value "Totale"
. When a variable has this value, it
means that the subset of the data considered in that row contains all
the values of the variable. For example, the first row of the result of
dcc5
contains the estimations for all the dataset. The
value "Totale"
can be changed with the argument
.total
.
The same result of dcc5
can be produced by
dcc6
, with a slightly different approach.
# Set a list of function calls
<- list(
list_of_funs n = ~n(),
av_wage = ~mean(wage),
weighted_av_wage = ~weighted.mean(wage, sample_weights)
)
# Set the grouping variables
<- c("gender", "education")
vars
# And create the data cube with dcc6
%>%
invented_wages dcc6(.variables = vars, .funs_list = list_of_funs, .total = "TOTAL")
#> # A tibble: 12 x 5
#> gender education n av_wage weighted_av_wage
#> * <fct> <fct> <int> <dbl> <dbl>
#> 1 TOTAL TOTAL 1000 4985. 4645.
#> 2 TOTAL I 172 3774. 3527.
#> 3 TOTAL II 719 5099. 4917.
#> 4 TOTAL III 109 6139. 5885.
#> 5 men TOTAL 547 5435. 5323.
#> 6 men I 60 4627. 4681.
#> 7 men II 409 5278. 5129.
#> 8 men III 78 6886. 6173.
#> 9 women TOTAL 453 4441. 3614.
#> 10 women I 112 3317. 3227.
#> 11 women II 310 4865. 4225.
#> 12 women III 31 4261. 4388.
Compared to the results obtained with dcc5
, we added the
weighted average of wages and changed the "Totale"
value to
"TOTAL"
.