An R interface to inter-geography and inter-temporal crosswalks.
This package provides a consistent API and standardized versions of crosswalks to enable consistent approaches that work across different geography and year combinations. The package also facilitates interpolation–that is, adjusting source geography/year values by their crosswalk weights and translating these values to the desired target geography/year–including diagnostics of the joins between source data and crosswalks.
The package sources crosswalks from:
- Geocorr 2022 (Missouri Census Data Center) - for same-year crosswalks between geographies
- IPUMS NHGIS - for inter-temporal crosswalks (across different census years)
- CT Data Collaborative - for Connecticut 2020→2022 crosswalks (planning region changes)
- Programmatic access: No more manual downloads from web interfaces
- Standardized output: Consistent column names across all crosswalk sources
- Metadata tracking: Full provenance stored as attributes
- Multi-step handling: Automatic chaining when both geography and year change
- Local caching: Reproducible workflows with cached crosswalks
# Install from GitHub
renv::install("UI-Research/crosswalk")First we obtain a crosswalk and apply it to our data:
library(crosswalk)
library(dplyr)
library(stringr)
library(sf)
source_data = tidycensus::get_acs(
year = 2023,
geography = "zcta",
output = "wide",
variables = c(below_poverty_level = "B17001_002")) %>%
dplyr::select(
source_geoid = GEOID,
count_below_poverty_level = below_poverty_levelE)
get_crosswalk(
source_geography = "block",
target_geography = "puma",
source_year = 2010,
target_year = 2020,
weight = "population")
# Get a crosswalk from ZCTAs to PUMAs (same year, uses Geocorr (2022))
zcta_puma_crosswalk <- get_crosswalk(
source_geography = "zcta",
target_geography = "puma22",
weight = "population")
# Apply the crosswalk to your data
crosswalked_data <- crosswalk_data(
geoid_column = "source_geoid",
data = source_data,
crosswalk = zcta_puma_crosswalk)What does the crosswalk(s) reflect and how was it sourced?
attr(crosswalked_data, "crosswalk_metadata")How well did the crosswalk join to our source data?
## look at all the characteristics of the join(s) between the source data
## and the crosswalks
join_quality = attr(crosswalked_data, "join_quality")
## what share of records in the source data do not join to a crosswalk and
## thus are dropped during the crosswalking process?
join_quality$pct_data_unmatched
## zctas aren't nested within states, otherwise join_quality$state_analysis_data
## would help us to ID whether non-joining source data were clustered within one
## or a few states. instead we can join to spatial data to diagnose further:
zctas_sf = tigris::zctas(year = 2023)
states_sf = tigris::states(year = 2023, cb = TRUE)
## apart from DC, which has a disproportionate number of non-joining ZCTAs--
## seemingly corresponding to federal areas and buildings--the distribution of
## non-joining ZCTAs appears proportionate to state-level populations and is
## distributed across many states:
zctas_sf %>%
dplyr::filter(GEOID20 %in% join_quality$data_geoids_unmatched) %>%
sf::st_intersection(states_sf %>% select(NAME)) %>%
sf::st_drop_geometry() %>%
dplyr::count(NAME, sort = TRUE)And how accurate was the crosswalking process?
comparison_data = tidycensus::get_acs(
year = 2023,
geography = "puma",
output = "wide",
variables = c(
below_poverty_level = "B17001_002")) %>%
dplyr::select(
source_geoid = GEOID,
count_below_poverty_level_acs = below_poverty_levelE)
combined_data = dplyr::left_join(
comparison_data,
crosswalked_data,
by = c("source_geoid" = "geoid"))
combined_data %>%
dplyr::select(source_geoid, dplyr::matches("count")) %>%
dplyr::mutate(difference_percent = (count_below_poverty_level_acs - count_below_poverty_level) / count_below_poverty_level_acs) %>%
ggplot2::ggplot() +
ggplot2::geom_histogram(ggplot2::aes(x = difference_percent)) +
ggplot2::theme_minimal() +
ggplot2::theme(panel.grid = ggplot2::element_blank()) +
ggplot2::scale_x_continuous(labels = scales::percent) +
ggplot2::labs(
title = "Crosswalked data approximates observed values",
subtitle = "Block group-level source data would produce more accurate crosswalked values",
y = "",
x = "Percent difference between observed and crosswalked values")The package has two main functions:
| Function | Purpose |
|---|---|
get_crosswalk() |
Fetch crosswalk(s) |
crosswalk_data() |
Apply crosswalk(s) to interpolate data to the target geography-year |
get_crosswalk() always returns a list structured as follows:
result <- get_crosswalk(
source_geography = "tract",
target_geography = "zcta",
source_year = 2010,
target_year = 2020,
weight = "population"
)
names(result)
#> [1] "crosswalks" "plan" "message"The list contains three elements:
| Element | Description |
|---|---|
crosswalks |
A named list of crosswalks (step_1, step_2, etc.) of length one or greater |
plan |
Details about what crosswalks are being fetched |
message |
A human-readable description of the crosswalk chain |
Single-step crosswalks (same year, different geography OR same geography, different year):
# Same year, different geography (Geocorr)
result <- get_crosswalk(
source_geography = "tract",
target_geography = "zcta",
weight = "population"
)
# result$crosswalks$step_1 contains one crosswalk
# Same geography, different year (NHGIS)
result <- get_crosswalk(
source_geography = "tract",
target_geography = "tract",
source_year = 2010,
target_year = 2020
)
# result$crosswalks$step_1 contains one crosswalkMulti-step crosswalks (different geography AND different year):
When both geography and year change, no single crosswalk source provides this directly. The package automatically plans and fetches a two-step chain:
- Step 1 (NHGIS): Change year, keep geography constant
- Step 2 (Geocorr): Change geography at target year
result <- get_crosswalk(
source_geography = "tract",
target_geography = "zcta",
source_year = 2010,
target_year = 2020,
weight = "population"
)
# Two crosswalks are returned
names(result$crosswalks)
#> [1] "step_1" "step_2"
# Step 1: 2010 tracts -> 2020 tracts (NHGIS)
# Step 2: 2020 tracts -> 2020 ZCTAs (Geocorr)Each crosswalk contains standardized columns:
| Column | Description |
|---|---|
source_geoid |
Identifier for source geography |
target_geoid |
Identifier for target geography |
allocation_factor_source_to_target |
Weight for interpolating values |
weighting_factor |
What attribute was used (population, housing, land) |
Additional columns may include source_year, target_year,
population_2020, housing_2020, and land_area_sqmi depending on the
source.
Each crosswalk tibble has a crosswalk_metadata attribute that
documents what the crosswalk represents and how it was created:
metadata <- attr(result$crosswalks$step_1, "crosswalk_metadata")
names(metadata)
#> [1] "call_parameters" "data_source" "data_source_full_name" "download_url" ...crosswalk_data() applies crosswalk weights to transform your data. It
automatically handles multi-step crosswalks.
The function auto-detects columns based on prefixes:
| Prefix | Treatment |
|---|---|
count_ |
Summed after weighting (for counts like population, housing units) |
mean_, median_, percent_, ratio_ |
Weighted mean (for rates, percentages, averages) |
You can also specify columns explicitly via count_columns and
non_count_columns. All non-count variables are interpolated using
weighted means, weighting by the allocation factor from the crosswalk.
2022-vintage crosswalks between any of these geographies:
- block, block group, tract, county
- place, zcta, puma22
- cd118, cd119, urban_area, core_based_statistical_area
NHGIS provides cross-decade crosswalks with the following structure:
Source geographies: block, block_group, tract
Target geographies: - From blocks (decennial years only): block, block_group, tract, county, place, zcta, puma, urban_area, cbsa - From block_group or tract: block_group, tract, county
| Source Years | Target Years |
|---|---|
| 1990, 2000 | 2010, 2014, 2015, 2020, 2022 |
| 2010, 2011, 2012, 2014, 2015 | 1990, 2000, 2020, 2022 |
| 2020, 2022 | 1990, 2000, 2010, 2014, 2015 |
Notes: - Within-decade crosswalks (e.g., 2010→2014) are not
available from NHGIS - Block→ZCTA, Block→PUMA, etc. are only available
for decennial years (1990, 2000, 2010, 2020) - The package automatically
uses direct NHGIS crosswalks when available (e.g.,
get_crosswalk(source_geography = "block", target_geography = "zcta", source_year = 2010, target_year = 2020)
returns a single-step NHGIS crosswalk)
For 2020 to 2022 transformations, the package uses CT Data Collaborative crosswalks for Connecticut (where planning regions replaced counties) and identity mappings for other states (where no changes occurred).
NHGIS crosswalks require an IPUMS API key. Get one at
https://account.ipums.org/api_keys and add to your .Renviron:
usethis::edit_r_environ()
# Add: IPUMS_API_KEY=your_key_hereUse the cache parameter to save crosswalks locally for ease:
result <- get_crosswalk(
source_geography = "tract",
target_geography = "zcta",
weight = "population",
cache = here::here("crosswalks-cache"))The intellectual credit for the underlying crosswalks belongs to the original developers.
For NHGIS, see citation requirements at: https://www.nhgis.org/citation-and-use-nhgis-data
For Geocorr, a suggested citation:
Missouri Census Data Center, University of Missouri. (2022). Geocorr 2022: Geographic Correspondence Engine. Retrieved from: https://mcdc.missouri.edu/applications/geocorr2022.html