cat2cat
The introduced cat2cat
algorithm was designed to offer
an easy and clear interface to apply a transition table which was
provided by a data maintainer or built by a researcher. The objective is
to unify an inconsistent coded categorical variable in a panel dataset,
where a transition table is the core element of the process.
Examples of a dataset with such inconsistent coded categorical variable are ISCO (The International Standard Classification of Occupations) or ICD (International Classification of Diseases) based one. The both classifications are regularly updated to adjust to e.g. new science achievements. More clearly we might image that e.g. new science achievements brings new occupations types on the market or enable recognition of new diseases types.
The categorical variable encoding changes are typically provided by datasets providers in the form of transition table, for each time point the changes occurred. A transition table conveys information needed for matching all categories between two periods of time. More precisely it contains two columns where the first column contains old categories and the second column contains the new ones. Sometimes a transition table has to be created manually by a researcher.
The main rule is to replicate the observation if it could be assigned to a few categories. More precisely for each observation we look across a transition table to check how the original category could be mapped to the opposite period one. Then using simple frequencies or statistical methods to approximate probabilities of being assigned to each of them. For each observation that was replicated, the probabilities have to add up to one. The algorithm distinguishes different mechanics for panel data with and without unique identifiers.
There should be highlighted 3 important elements:
library(cat2cat)
library(dplyr)
data(verticals)
<- verticals[verticals$v_date == "2020-04-01", ]
agg_old <- verticals[verticals$v_date == "2020-05-01", ]
agg_new
## cat2cat_agg - could map in both directions at once although
## usually we want to have old or new representation
<- cat2cat_agg(data = list(old = agg_old,
agg new = agg_new,
cat_var = "vertical",
time_var = "v_date",
freq_var = "counts"),
%<% c(Automotive1, Automotive2),
Automotive c(Kids1, Kids2) %>% c(Kids),
%>% c(Home, Supermarket))
Home
## possible processing
$old %>%
agggroup_by(vertical) %>%
summarise(sales = sum(sales*prop_c2c), counts = sum(counts*prop_c2c), v_date = first(v_date))
## # A tibble: 11 × 4
## vertical sales counts v_date
## <chr> <dbl> <dbl> <chr>
## 1 Automotive1 49.4 87.1 2020-04-01
## 2 Automotive2 27.2 47.9 2020-04-01
## 3 Books 104. 7489 2020-04-01
## 4 Clothes 105. 1078 2020-04-01
## 5 Electronics 87.9 9544 2020-04-01
## 6 Fashion 94.5 7399 2020-04-01
## 7 Health 94.4 16102 2020-04-01
## 8 Home 94.3 2414 2020-04-01
## 9 Kids1 103. 17686 2020-04-01
## 10 Kids2 111. 32349 2020-04-01
## 11 Sport 91.1 4957 2020-04-01
$new %>%
agggroup_by(vertical) %>%
summarise(sales = sum(sales*prop_c2c), counts = sum(counts*prop_c2c), v_date = first(v_date))
## # A tibble: 11 × 4
## vertical sales counts v_date
## <chr> <dbl> <dbl> <chr>
## 1 Automotive1 100. 36453 2020-05-01
## 2 Automotive2 102. 20039 2020-05-01
## 3 Books 112. 14239 2020-05-01
## 4 Clothes 108. 27185 2020-05-01
## 5 Electronics 82.7 859 2020-05-01
## 6 Fashion 85.2 4981 2020-05-01
## 7 Health 104. 1934 2020-05-01
## 8 Home 178. 29375 2020-05-01
## 9 Kids1 37.3 309. 2020-05-01
## 10 Kids2 68.2 565. 2020-05-01
## 11 Sport 99.3 9843 2020-05-01
## the ean variable is a unique identifier
data(verticals2)
<- verticals2[verticals2$v_date == "2020-04-01", ]
vert_old <- verticals2[verticals2$v_date == "2020-05-01", ]
vert_new
## get transitions table
<- vert_old %>%
trans_v inner_join(vert_new, by = "ean") %>%
select(vertical.x, vertical.y) %>% distinct()
#
## cat2cat
## it is important to set id_var as then we merging categories 1 to 1
## for this identifier which exists in both periods.
<- cat2cat(
verts data = list(old = vert_old, new = vert_new, id_var = "ean", cat_var = "vertical", time_var = "v_date"),
mappings = list(trans = trans_v, direction = "backward")
)
data(occup)
data(trans)
<- occup[occup$year == 2008,]
occup_old <- occup[occup$year == 2010,] occup_new
## cat2cat
<- cat2cat(
occup_simple data = list(old = occup_old, new = occup_new, cat_var = "code", time_var = "year"),
mappings = list(trans = trans, direction = "backward")
)
## with informative features it might be usefull to run ml algorithm
## currently only knn, lda or rf (randomForest), a few methods could be specified at once
## where probability will be assessed as fraction of closest points.
<- cat2cat(
occup_2 data = list(old = occup_old, new = occup_new, cat_var = "code", time_var = "year"),
mappings = list(trans = trans, direction = "backward"),
ml = list(method = "knn", features = c("age", "sex", "edu", "exp", "parttime", "salary"),
args = list(k = 10))
)
# summary_plot
plot_c2c(occup_2$old, type = c("both"))
# mix of methods
<- cat2cat(
occup_2_mix data = list(old = occup_old, new = occup_new, cat_var = "code", time_var = "year"),
mappings = list(trans = trans, direction = "backward"),
ml = list(method = c("knn", "rf", "lda"), features = c("age", "sex", "edu", "exp", "parttime", "salary"),
args = list(k = 10, ntree = 50))
)# correlation between ml models and simple fequencies
$old %>% select(wei_knn_c2c, wei_rf_c2c, wei_lda_c2c, wei_freq_c2c) %>% cor() occup_2_mix
## wei_knn_c2c wei_rf_c2c wei_lda_c2c wei_freq_c2c
## wei_knn_c2c 1.0000000 0.8584472 0.8406724 0.8974173
## wei_rf_c2c 0.8584472 1.0000000 0.8886519 0.8738478
## wei_lda_c2c 0.8406724 0.8886519 1.0000000 0.8894577
## wei_freq_c2c 0.8974173 0.8738478 0.8894577 1.0000000
# cross all methods and subset one highest probability category for each subject
<- occup_2_mix$old %>%
occup_old_mix_highest1 cross_c2c(.) %>%
prune_c2c(.,column = "wei_cross_c2c", method = "highest1")
The replication process is neutral for calculating at least the first 2 central moments for all variables. This is because for each observation which was replicated, probabilities sum to one. If we are removing non-zero probability observations then replication probabilities have to be reweighed to still sum to one. Important note is that removing non zero probability observations should be done only if needed, as it impact the counts of categorical variable levels. More preciously removing non-zero weights will influence the regression model if we will use the unified categorical variable.
The next 3 regressions have the same results.
## orginal dataset
<- lm(I(log(salary)) ~ age + sex + factor(edu) + parttime + exp, occup_old, weights = multiplier)
lms2 summary(lms2)
##
## Call:
## lm(formula = I(log(salary)) ~ age + sex + factor(edu) + parttime +
## exp, data = occup_old, weights = multiplier)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -41.649 -4.154 -0.170 4.134 94.979
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.6049934 0.0175898 489.203 < 2e-16 ***
## age -0.0028783 0.0004561 -6.311 2.81e-10 ***
## sexTRUE 0.2539537 0.0050557 50.231 < 2e-16 ***
## factor(edu)2 -0.0799110 0.0097006 -8.238 < 2e-16 ***
## factor(edu)3 -0.3579335 0.0124684 -28.707 < 2e-16 ***
## factor(edu)4 -0.4252729 0.0072134 -58.956 < 2e-16 ***
## factor(edu)5 -0.4050551 0.0101847 -39.771 < 2e-16 ***
## factor(edu)6 -0.6473797 0.0072134 -89.746 < 2e-16 ***
## factor(edu)7 -0.5295860 0.0783593 -6.758 1.42e-11 ***
## factor(edu)8 -0.6751645 0.0110551 -61.073 < 2e-16 ***
## parttime 1.9342513 0.0114274 169.264 < 2e-16 ***
## exp 0.0128464 0.0004370 29.396 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.302 on 34164 degrees of freedom
## Multiple R-squared: 0.5797, Adjusted R-squared: 0.5796
## F-statistic: 4284 on 11 and 34164 DF, p-value: < 2.2e-16
## using one highest cross weights
## cross_c2c to cross differen methods weights
## prune_c2c - highest1 leave only one the highest probability obs for each subject
<- occup_2$old %>%
occup_old_2 cross_c2c(., c("wei_freq_c2c", "wei_knn_c2c"), c(1/2,1/2)) %>%
prune_c2c(.,column = "wei_cross_c2c", method = "highest1")
<- lm(I(log(salary)) ~ age + sex + factor(edu) + parttime + exp, occup_old_2, weights = multiplier)
lms summary(lms)
##
## Call:
## lm(formula = I(log(salary)) ~ age + sex + factor(edu) + parttime +
## exp, data = occup_old_2, weights = multiplier)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -41.649 -4.154 -0.170 4.134 94.979
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.6049934 0.0175898 489.203 < 2e-16 ***
## age -0.0028783 0.0004561 -6.311 2.81e-10 ***
## sexTRUE 0.2539537 0.0050557 50.231 < 2e-16 ***
## factor(edu)2 -0.0799110 0.0097006 -8.238 < 2e-16 ***
## factor(edu)3 -0.3579335 0.0124684 -28.707 < 2e-16 ***
## factor(edu)4 -0.4252729 0.0072134 -58.956 < 2e-16 ***
## factor(edu)5 -0.4050551 0.0101847 -39.771 < 2e-16 ***
## factor(edu)6 -0.6473797 0.0072134 -89.746 < 2e-16 ***
## factor(edu)7 -0.5295860 0.0783593 -6.758 1.42e-11 ***
## factor(edu)8 -0.6751645 0.0110551 -61.073 < 2e-16 ***
## parttime 1.9342513 0.0114274 169.264 < 2e-16 ***
## exp 0.0128464 0.0004370 29.396 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.302 on 34164 degrees of freedom
## Multiple R-squared: 0.5797, Adjusted R-squared: 0.5796
## F-statistic: 4284 on 11 and 34164 DF, p-value: < 2.2e-16
## we have to adjust size of stds as we artificialy enlarge degrees of freedom
<- occup_2$old %>%
occup_old_3 prune_c2c(method = "nonzero") #many prune methods like highest
<- lm(I(log(salary)) ~ age + sex + factor(edu) + parttime + exp, occup_old_3, weights = multiplier * wei_freq_c2c)
lms_replicated # Adjusted R2 is meaningless here
$df.residual <- nrow(occup_old) - length(lms_replicated$assign)
lms_replicatedsuppressWarnings(summary(lms_replicated))
##
## Call:
## lm(formula = I(log(salary)) ~ age + sex + factor(edu) + parttime +
## exp, data = occup_old_3, weights = multiplier * wei_freq_c2c)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -34.550 -0.686 -0.045 0.616 65.260
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.6049934 0.0175898 489.203 < 2e-16 ***
## age -0.0028783 0.0004561 -6.311 2.81e-10 ***
## sexTRUE 0.2539537 0.0050557 50.231 < 2e-16 ***
## factor(edu)2 -0.0799110 0.0097006 -8.238 < 2e-16 ***
## factor(edu)3 -0.3579335 0.0124684 -28.707 < 2e-16 ***
## factor(edu)4 -0.4252729 0.0072134 -58.956 < 2e-16 ***
## factor(edu)5 -0.4050551 0.0101847 -39.771 < 2e-16 ***
## factor(edu)6 -0.6473797 0.0072134 -89.746 < 2e-16 ***
## factor(edu)7 -0.5295860 0.0783593 -6.758 1.42e-11 ***
## factor(edu)8 -0.6751645 0.0110551 -61.073 < 2e-16 ***
## parttime 1.9342513 0.0114274 169.264 < 2e-16 ***
## exp 0.0128464 0.0004370 29.396 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.302 on 34164 degrees of freedom
## Multiple R-squared: 0.5797, Adjusted R-squared: -3.254
## F-statistic: 4284 on 11 and 34164 DF, p-value: < 2.2e-16
Example regression model with usage of the unified variable
(g_new_c2c
). A separate model for each occupational
group.
<- occup_2$old %>%
occup_old_4 prune_c2c(method = "nonzero") #many prune methods like highest
<- formula(I(log(salary)) ~ age + sex + factor(edu) + parttime + exp + factor(year))
formula_oo <- rbind(occup_old_4, occup_2$new) %>%
oo group_by(g_new_c2c) %>%
do(
lm = tryCatch(
summary(lm(formula_oo, ., weights = multiplier * wei_freq_c2c)),
error = function(e) NULL
)%>%
) filter(!is.null(lm))
head(oo)
## # A tibble: 6 × 2
## # Rowwise:
## g_new_c2c lm
## <chr> <list>
## 1 111201 <smmry.lm>
## 2 111301 <smmry.lm>
## 3 111405 <smmry.lm>
## 4 112001 <smmry.lm>
## 5 112002 <smmry.lm>
## 6 112003 <smmry.lm>
$lm[[1]] oo
##
## Call:
## lm(formula = formula_oo, data = ., weights = multiplier * wei_freq_c2c)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -7.277 -1.488 -0.145 1.362 6.923
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.520469 0.682841 12.478 < 2e-16 ***
## age 0.021666 0.013878 1.561 0.12433
## sexTRUE 0.089484 0.119301 0.750 0.45647
## factor(edu)2 -0.248113 0.187691 -1.322 0.19177
## factor(edu)4 -0.505550 0.255098 -1.982 0.05260 .
## parttime 1.758043 0.617628 2.846 0.00624 **
## exp 0.001256 0.010965 0.115 0.90920
## factor(year)2010 0.617386 0.148847 4.148 0.00012 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.54 on 54 degrees of freedom
## Multiple R-squared: 0.4966, Adjusted R-squared: 0.4313
## F-statistic: 7.61 on 7 and 54 DF, p-value: 2.27e-06