Introduction

Iterative proportional fitting (IPF) is used in many disciplines to adjust an initial set of weights to match various marginal distributions. This package implements the iterative proportional updating algorithm based on the paper from Arizona State University (IPU). In survey raking or population synthesis, the IPU algorithm has the added advantage of being able to match household- and person-level marginal targets.

The primary functions of the ipfr package are discussed below.

ipu_matrix()
ipu()
synthesize()

suppressPackageStartupMessages({
  library(ipfr)
  library(dplyr)
})

Example: 2D / Matrix

The first example shows how to use ipfr to solve one of the most common IPF problems: Given a matrix of starting weights and a set of row and column targets, balance the matrix in order to satisfy both row and column targets.

set.seed(42)
mtx <- matrix(data = runif(9), nrow = 3, ncol = 3)
row_targets <- c(3, 4, 5)
column_targets <- c(5, 4, 3)
mtx
#>           [,1]      [,2]      [,3]
#> [1,] 0.9148060 0.8304476 0.7365883
#> [2,] 0.9370754 0.6417455 0.1346666
#> [3,] 0.2861395 0.5190959 0.6569923

The resulting matrix satisfies both row and column targets while retaining a similar relative distribution of weights in each cell.

result <- ipu_matrix(mtx, row_targets, column_targets)
result
#>          [,1]      [,2]      [,3]
#> [1,] 1.317384 0.9349039 0.7477130
#> [2,] 2.443984 1.3084532 0.2475772
#> [3,] 1.238632 1.7566429 2.0047098
rowSums(result)
#> [1] 3.000001 4.000015 4.999985
colSums(result)
#> [1] 5 4 3

Example: Household survey (simple)

The matrix example is common and conceptually simple, but makes thinking about higher-dimension problems difficult. A much better way to think about the problem (whether in two or ten dimensions) is as a table.

This example creates a household survey that will act as the seed (like the matrix from the last example). The number of persons and autos for each household is listed in the size and autos columns. A starting weight of 1 is also included.

survey <- tibble(
  size = c(1, 2, 1, 1),
  autos = c(0, 2, 2, 1),
  weight = 1
)

survey
#> # A tibble: 4 x 3
#>    size autos weight
#>   <dbl> <dbl>  <dbl>
#> 1     1     0      1
#> 2     2     2      1
#> 3     1     2      1
#> 4     1     1      1

Assume that we know the total number of households by size and households by auto-ownership from the Census. These target marginal distributions are just like our row and column totals from the previous example. Targets are passed into the ipu function as shown below. Take the size targets, for example:

The list name size corresponds to the size column in the survey.
The column names 1 and 2 correspond to the unique values in the survey’s size column.

targets <- list()
targets$size <- tibble(
  `1` = 75,
  `2` = 25
)
targets$autos <- tibble(
  `0` = 25,
  `1` = 50,
  `2` = 25
)

targets
#> $size
#> # A tibble: 1 x 2
#>     `1`   `2`
#>   <dbl> <dbl>
#> 1    75    25
#> 
#> $autos
#> # A tibble: 1 x 3
#>     `0`   `1`   `2`
#>   <dbl> <dbl> <dbl>
#> 1    25    50    25

A call to ipu() returns a named list of results.

result <- ipu(survey, targets)
names(result)
#> [1] "weight_tbl"   "weight_dist"  "primary_comp"

The first element is the resulting weight table. The primary seed has three columns appended:

weight
- The new/expanded weight of the record.
avg_weight
- The average weight, which is the total target / number of seed records.
weight_factor
- Each record’s weight divided by the average weight. This is a useful diagnostic to measure how extreme a weight is.

result$weight_tbl
#> # A tibble: 4 x 6
#>    size autos weight    id avg_weight weight_factor
#>   <dbl> <dbl>  <dbl> <int>      <dbl>         <dbl>
#> 1     1     0 25.0       1         25        1.00  
#> 2     2     2 24.7       2         25        0.990 
#> 3     1     2  0.255     3         25        0.0102
#> 4     1     1 50.0       4         25        2.00

The second element is a histogram of the weight_factor. This provides a quick overview of the distribution of weights. Many large or small weights can indicate underlying problems even if the ipu converged and matches targets. For instance, a record with a weight factor of 50 means that it is greatly over-represented in the re-weighted survey.

result$weight_dist

The next element is a comparison back to the targets provided. With complex seed and target tables, this makes investigating results quick and easy.

result$primary_comp
#> # A tibble: 5 x 6
#>   geography category target result  diff pct_diff
#>   <chr>     <chr>     <dbl>  <dbl> <dbl>    <dbl>
#> 1 geo_all   autos_0      25   25.0  0        0   
#> 2 geo_all   autos_1      50   50.0  0        0   
#> 3 geo_all   autos_2      25   25    0        0   
#> 4 geo_all   size_1       75   75.3  0.26     0.34
#> 5 geo_all   size_2       25   24.7 -0.26    -1.02

If secondary targets are provided to ipu(), a fourth item in the list will contain a secondary_comp table.

Example: Add person targets

In household survey expansion, it is common to want to control for certain features that describe households (like size) while controlling for other attributes that describe people (like age). This is possible with the ipu() function.

This example is taken directly from the Arizona paper on page 20: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.537.723&rep=rep1&type=pdf

In this example, household type could represent size (e.g. 1-person and 2-person households). Person type could represent age groups (e.g. under 18, between 18 and 50, and over 50).

setup_arizona() creates the seed and target tables used in the example.

result <- setup_arizona()
hh_seed <- result$hh_seed
hh_targets <- result$hh_targets
per_seed <- result$per_seed
per_targets <- result$per_targets

The household seed table is the primary_seed
The household target list is the primary_target
The person seed table is the secondary_seed
The person target list is the secondary_target

In the interest of keeping vignette build time short, the ipu() algorithm is only run for 30 iterations. After running for 400 or more iterations, the results match closely to those shown in the paper.

result <- ipu(hh_seed, hh_targets, per_seed, per_targets, max_iterations = 30)

The comparison table for the secondary (person) marginals is now included in result. Feel free to run the code chunk above for 400 or more iterations and then look again.

result$weight_tbl
#> # A tibble: 8 x 5
#>      id hhtype weight avg_weight weight_factor
#>   <int>  <dbl>  <dbl>      <dbl>         <dbl>
#> 1     1      1   3.77       12.5         0.301
#> 2     2      1  23.0        12.5         1.84 
#> 3     3      1   6.50       12.5         0.520
#> 4     4      2  25.7        12.5         2.06 
#> 5     5      2  17.4        12.5         1.39 
#> 6     6      2   7.26       12.5         0.581
#> 7     7      2   4.21       12.5         0.337
#> 8     8      2   7.26       12.5         0.581
  
result$primary_comp
#> # A tibble: 2 x 6
#>   geography category target result  diff pct_diff
#>   <chr>     <chr>     <dbl>  <dbl> <dbl>    <dbl>
#> 1 geo_all   hhtype_1     35   33.3 -1.73    -4.94
#> 2 geo_all   hhtype_2     65   61.9 -3.15    -4.84

result$secondary_comp
#> # A tibble: 3 x 6
#>   geography category  target result  diff pct_diff
#>   <chr>     <chr>      <dbl>  <dbl> <dbl>    <dbl>
#> 1 geo_all   pertype_1     91   88.4 -2.58    -2.83
#> 2 geo_all   pertype_2     65   63.8 -1.16    -1.79
#> 3 geo_all   pertype_3    104  104    0        0

Example 3: Using multiple geographies

ipu() allows different geographies to be specified for different marginal tables. There are a few rules that make this possible, but in short, a geography field in each target table tells the algorithm which scale to constrain to. If no geography field is present in a target table, it applies to the entire seed table (i.e. region).

All of the following rules are checked by the function and a warning message will show which (if any) is violated.

The primary seed table must contain all geography fields used by any target tables.
Do not duplicate geography fields on the secondary seed table.
- This prevents potential errors/inconsistencies between seed tables.
All fields that designate geographies must start with “geo_” e.g.
- geo_tract
- geo_region
- geo_state

To demonstrate, the Arizona example is modified to add two different tracts for household controls but to still control the person targets at the regional level.

# Repeat the hh_seed to create tract 1 and 2 households
new_hh_seed <- hh_seed %>%
  mutate(geo_tract = 1)
new_hh_seed <- bind_rows(
  new_hh_seed,
  new_hh_seed %>% 
    mutate(geo_tract = 2, id = id + 8)
)
new_hh_seed$geo_region <- 1

new_hh_seed
#> # A tibble: 16 x 4
#>       id hhtype geo_tract geo_region
#>    <dbl>  <dbl>     <dbl>      <dbl>
#>  1     1      1         1          1
#>  2     2      1         1          1
#>  3     3      1         1          1
#>  4     4      2         1          1
#>  5     5      2         1          1
#>  6     6      2         1          1
#>  7     7      2         1          1
#>  8     8      2         1          1
#>  9     9      1         2          1
#> 10    10      1         2          1
#> 11    11      1         2          1
#> 12    12      2         2          1
#> 13    13      2         2          1
#> 14    14      2         2          1
#> 15    15      2         2          1
#> 16    16      2         2          1

# Repeat the household targets for two tracts
new_hh_targets <- hh_targets
new_hh_targets$hhtype <- bind_rows(hh_targets$hhtype, hh_targets$hhtype)
new_hh_targets$hhtype <- new_hh_targets$hhtype %>%
  mutate(geo_tract = c(1, 2))

new_hh_targets$hhtype
#> # A tibble: 2 x 3
#>     `1`   `2` geo_tract
#>   <dbl> <dbl>     <dbl>
#> 1    35    65         1
#> 2    35    65         2

# Repeat the per_seed to create tract 1 and 2 persons
new_per_seed <- bind_rows(
  per_seed,
  per_seed %>% 
    mutate(id = id + 8)
)

new_per_seed
#> # A tibble: 46 x 2
#>       id pertype
#>    <dbl>   <dbl>
#>  1     1       1
#>  2     1       2
#>  3     1       3
#>  4     2       1
#>  5     2       3
#>  6     3       1
#>  7     3       1
#>  8     3       2
#>  9     4       1
#> 10     4       3
#> # … with 36 more rows

# Double the regional person targets
new_per_targets <- per_targets
new_per_targets$pertype <- per_targets$pertype %>%
  mutate_all(list(~. * 2)) %>%
  mutate(geo_region = 1)

new_per_targets$pertype
#> # A tibble: 1 x 4
#>     `1`   `2`   `3` geo_region
#>   <dbl> <dbl> <dbl>      <dbl>
#> 1   182   130   208          1

Run the IPU algorithm. Again, for vignette build time, only 30 iterations are performed. Run the code yourself with max_iterations set to 600 to see the converged result.

result <- ipu(
  new_hh_seed, new_hh_targets,
  new_per_seed, new_per_targets,
  max_iterations = 30
)

The tables below show the results compared back to targets.

result$primary_comp
#> # A tibble: 4 x 6
#>   geography   category target result  diff pct_diff
#>   <chr>       <chr>     <dbl>  <dbl> <dbl>    <dbl>
#> 1 geo_tract_1 hhtype_1     35   33.3 -1.73    -4.94
#> 2 geo_tract_1 hhtype_2     65   61.9 -3.15    -4.84
#> 3 geo_tract_2 hhtype_1     35   33.3 -1.73    -4.94
#> 4 geo_tract_2 hhtype_2     65   61.9 -3.15    -4.84

result$secondary_comp
#> # A tibble: 3 x 6
#>   geography    category  target result  diff pct_diff
#>   <chr>        <chr>      <dbl>  <dbl> <dbl>    <dbl>
#> 1 geo_region_1 pertype_1    182   177. -5.16    -2.83
#> 2 geo_region_1 pertype_2    130   128. -2.33    -1.79
#> 3 geo_region_1 pertype_3    208   208   0        0

Advanced options

ipu offers some advanced options for controlling the behavior. Weights can be capped with min_ratio and max_ratio options. Secondary target importance can be adjusted using the secondary_importance option. For info on using these options appropriately, see the common_ipf_problems vignette.

Synthetic Populations

The weight_tbl from ipu() can be fed into the synthesize() function to generate a synthetic population. The code block below takes the output from the previous example to demonstrate. The group_by option tells the function to group by the geo_tract field before random sampling takes places. This will ensure that the number of sampled records in each tract equals the total weight in each tract.

A new_id is created for each synthetic member (usually a household), but the old ID is preserved to facilitate joins or other operations where it could be useful.

set.seed(42)
synthesize(result$weight_tbl, group_by = "geo_tract") %>%
  head()
#> # A tibble: 6 x 5
#>   new_id    id hhtype geo_tract geo_region
#>    <int> <dbl>  <dbl>     <dbl>      <dbl>
#> 1      1     3      1         1          1
#> 2      2     7      2         1          1
#> 3      3     2      1         1          1
#> 4      4     8      2         1          1
#> 5      5     5      2         1          1
#> 6      6     5      2         1          1

A tidyverse-style call also works, as shown below.

set.seed(42)
result$weight_tbl %>%
  group_by(geo_tract) %>%
  synthesize() %>%
  head()
#> # A tibble: 6 x 5
#>   new_id    id hhtype geo_tract geo_region
#>    <int> <dbl>  <dbl>     <dbl>      <dbl>
#> 1      1     3      1         1          1
#> 2      2     7      2         1          1
#> 3      3     2      1         1          1
#> 4      4     8      2         1          1
#> 5      5     5      2         1          1
#> 6      6     5      2         1          1