crosswalk

An R interface to inter-geography and inter-temporal crosswalks.

Overview

This package provides a consistent API and standardized versions of crosswalks to enable consistent approaches that work across different geography and year combinations. The package also facilitates interpolation–that is, adjusting source geography/year values by their crosswalk weights and translating these values to the desired target geography/year–including diagnostics of the joins between source data and crosswalks.

The package sources crosswalks from:

Geocorr 2022 (Missouri Census Data Center) - for same-year crosswalks between geographies
IPUMS NHGIS - for inter-temporal crosswalks (across different census years)
CT Data Collaborative - for Connecticut 2020→2022 crosswalks (planning region changes)

Why Use `crosswalk`?

Programmatic access: No more manual downloads from web interfaces
Standardized output: Consistent column names across all crosswalk sources
Metadata tracking: Full provenance stored as attributes
Multi-step handling: Automatic chaining when both geography and year change
Local caching: Reproducible workflows with cached crosswalks

Installation

# Install from GitHub
renv::install("UI-Research/crosswalk")

Overview

First we obtain a crosswalk and apply it to our data:

library(crosswalk)
library(dplyr)
library(stringr)
library(sf)

source_data = tidycensus::get_acs(
    year = 2023,
    geography = "zcta",
    output = "wide",
    variables = c(below_poverty_level = "B17001_002")) %>%
  dplyr::select(
    source_geoid = GEOID,
    count_below_poverty_level = below_poverty_levelE)

get_crosswalk(
  source_geography = "block",
  target_geography = "puma",
  source_year = 2010,
  target_year = 2020,
  weight = "population")

# Get a crosswalk from ZCTAs to PUMAs (same year, uses Geocorr (2022))
zcta_puma_crosswalk <- get_crosswalk(
  source_geography = "zcta",
  target_geography = "puma22",
  weight = "population")

# Apply the crosswalk to your data
crosswalked_data <- crosswalk_data(
  geoid_column = "source_geoid",
  data = source_data,
  crosswalk = zcta_puma_crosswalk)

What does the crosswalk(s) reflect and how was it sourced?

attr(crosswalked_data, "crosswalk_metadata")

How well did the crosswalk join to our source data?

## look at all the characteristics of the join(s) between the source data
## and the crosswalks
join_quality = attr(crosswalked_data, "join_quality")

## what share of records in the source data do not join to a crosswalk and
## thus are dropped during the crosswalking process?
join_quality$pct_data_unmatched

## zctas aren't nested within states, otherwise join_quality$state_analysis_data 
## would help us to ID whether non-joining source data were clustered within one
## or a few states. instead we can join to spatial data to diagnose further:
zctas_sf = tigris::zctas(year = 2023)
states_sf = tigris::states(year = 2023, cb = TRUE)

## apart from DC, which has a disproportionate number of non-joining ZCTAs--
## seemingly corresponding to federal areas and buildings--the distribution of
## non-joining ZCTAs appears proportionate to state-level populations and is 
## distributed across many states:
zctas_sf %>% 
  dplyr::filter(GEOID20 %in% join_quality$data_geoids_unmatched) %>%
  sf::st_intersection(states_sf %>% select(NAME)) %>%
  sf::st_drop_geometry() %>%
  dplyr::count(NAME, sort = TRUE)

And how accurate was the crosswalking process?

comparison_data = tidycensus::get_acs(
    year = 2023,
    geography = "puma",
    output = "wide",
    variables = c(
      below_poverty_level = "B17001_002")) %>%
  dplyr::select(
    source_geoid = GEOID,
    count_below_poverty_level_acs = below_poverty_levelE)

combined_data = dplyr::left_join(
  comparison_data,
  crosswalked_data,
  by = c("source_geoid" = "geoid")) 
  
combined_data %>%
  dplyr::select(source_geoid, dplyr::matches("count")) %>%
  dplyr::mutate(difference_percent = (count_below_poverty_level_acs - count_below_poverty_level) / count_below_poverty_level_acs) %>%
  ggplot2::ggplot() +
    ggplot2::geom_histogram(ggplot2::aes(x = difference_percent)) +
    ggplot2::theme_minimal() +
    ggplot2::theme(panel.grid = ggplot2::element_blank()) +
    ggplot2::scale_x_continuous(labels = scales::percent) +
    ggplot2::labs(
      title = "Crosswalked data approximates observed values",
      subtitle = "Block group-level source data would produce more accurate crosswalked values",
      y = "",
      x = "Percent difference between observed and crosswalked values")

Core Functions

The package has two main functions:

Function	Purpose
`get_crosswalk()`	Fetch crosswalk(s)
`crosswalk_data()`	Apply crosswalk(s) to interpolate data to the target geography-year

Understanding `get_crosswalk()` Output

get_crosswalk() always returns a list structured as follows:

result <- get_crosswalk(
  source_geography = "tract",
  target_geography = "zcta",
  source_year = 2010,
  target_year = 2020,
  weight = "population"
)

names(result)
#> [1] "crosswalks" "plan" "message"

The list contains three elements:

Element	Description
`crosswalks`	A named list of crosswalks (`step_1`, `step_2`, etc.) of length one or greater
`plan`	Details about what crosswalks are being fetched
`message`	A human-readable description of the crosswalk chain

Single-Step vs. Multi-Step Crosswalks

Single-step crosswalks (same year, different geography OR same geography, different year):

# Same year, different geography (Geocorr)
result <- get_crosswalk(
  source_geography = "tract",
  target_geography = "zcta",
  weight = "population"
)
# result$crosswalks$step_1 contains one crosswalk

# Same geography, different year (NHGIS)
result <- get_crosswalk(
  source_geography = "tract",
  target_geography = "tract",
  source_year = 2010,
  target_year = 2020
)
# result$crosswalks$step_1 contains one crosswalk

Multi-step crosswalks (different geography AND different year):

When both geography and year change, no single crosswalk source provides this directly. The package automatically plans and fetches a two-step chain:

Step 1 (NHGIS): Change year, keep geography constant
Step 2 (Geocorr): Change geography at target year

result <- get_crosswalk(
  source_geography = "tract",
  target_geography = "zcta",
  source_year = 2010,
  target_year = 2020,
  weight = "population"
)

# Two crosswalks are returned
names(result$crosswalks)
#> [1] "step_1" "step_2"

# Step 1: 2010 tracts -> 2020 tracts (NHGIS)
# Step 2: 2020 tracts -> 2020 ZCTAs (Geocorr)

Crosswalk Structure

Each crosswalk contains standardized columns:

Column	Description
`source_geoid`	Identifier for source geography
`target_geoid`	Identifier for target geography
`allocation_factor_source_to_target`	Weight for interpolating values
`weighting_factor`	What attribute was used (population, housing, land)

Additional columns may include source_year, target_year, population_2020, housing_2020, and land_area_sqmi depending on the source.

Accessing Metadata

Each crosswalk tibble has a crosswalk_metadata attribute that documents what the crosswalk represents and how it was created:

metadata <- attr(result$crosswalks$step_1, "crosswalk_metadata")
names(metadata)
#> [1] "call_parameters" "data_source" "data_source_full_name" "download_url" ...

Using `crosswalk_data()` to Interpolate Data

crosswalk_data() applies crosswalk weights to transform your data. It automatically handles multi-step crosswalks.

Column Naming Convention

The function auto-detects columns based on prefixes:

Prefix	Treatment
`count_`	Summed after weighting (for counts like population, housing units)
`mean_`, `median_`, `percent_`, `ratio_`	Weighted mean (for rates, percentages, averages)

You can also specify columns explicitly via count_columns and non_count_columns. All non-count variables are interpolated using weighted means, weighting by the allocation factor from the crosswalk.

Supported Geography and Year Combinations

Inter-Geography Crosswalks (Geocorr)

2022-vintage crosswalks between any of these geographies:

block, block group, tract, county
place, zcta, puma22
cd118, cd119, urban_area, core_based_statistical_area

Inter-Temporal Crosswalks (NHGIS)

NHGIS provides cross-decade crosswalks with the following structure:

Source geographies: block, block_group, tract

Target geographies: - From blocks (decennial years only): block, block_group, tract, county, place, zcta, puma, urban_area, cbsa - From block_group or tract: block_group, tract, county

Source Years	Target Years
1990, 2000	2010, 2014, 2015, 2020, 2022
2010, 2011, 2012, 2014, 2015	1990, 2000, 2020, 2022
2020, 2022	1990, 2000, 2010, 2014, 2015

Notes: - Within-decade crosswalks (e.g., 2010→2014) are not available from NHGIS - Block→ZCTA, Block→PUMA, etc. are only available for decennial years (1990, 2000, 2010, 2020) - The package automatically uses direct NHGIS crosswalks when available (e.g., get_crosswalk(source_geography = "block", target_geography = "zcta", source_year = 2010, target_year = 2020) returns a single-step NHGIS crosswalk)

2020→2022 Crosswalks (CTData)

For 2020 to 2022 transformations, the package uses CT Data Collaborative crosswalks for Connecticut (where planning regions replaced counties) and identity mappings for other states (where no changes occurred).

API Keys

NHGIS crosswalks require an IPUMS API key. Get one at https://account.ipums.org/api_keys and add to your .Renviron:

usethis::edit_r_environ()
# Add: IPUMS_API_KEY=your_key_here

Caching

Use the cache parameter to save crosswalks locally for ease:

result <- get_crosswalk(
  source_geography = "tract",
  target_geography = "zcta",
  weight = "population",
  cache = here::here("crosswalks-cache"))

Citations

The intellectual credit for the underlying crosswalks belongs to the original developers.

For NHGIS, see citation requirements at: https://www.nhgis.org/citation-and-use-nhgis-data

For Geocorr, a suggested citation:

Missouri Census Data Center, University of Missouri. (2022). Geocorr 2022: Geographic Correspondence Engine. Retrieved from: https://mcdc.missouri.edu/applications/geocorr2022.html

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.beads		.beads
.github		.github
R		R
man		man
renv		renv
tests		tests
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.Rprofile		.Rprofile
.gitattributes		.gitattributes
.gitignore		.gitignore
AGENTS.md		AGENTS.md
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
README.Rmd		README.Rmd
README.html		README.html
README.md		README.md
_pkgdown.yml		_pkgdown.yml
crosswalk.Rproj		crosswalk.Rproj
renv.lock		renv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

crosswalk

Overview

Why Use `crosswalk`?

Installation

Overview

Core Functions

Understanding `get_crosswalk()` Output

Single-Step vs. Multi-Step Crosswalks

Crosswalk Structure

Accessing Metadata

Using `crosswalk_data()` to Interpolate Data

Column Naming Convention

Supported Geography and Year Combinations

Inter-Geography Crosswalks (Geocorr)

Inter-Temporal Crosswalks (NHGIS)

2020→2022 Crosswalks (CTData)

API Keys

Caching

Citations

About

Licenses found

Uh oh!

Releases

Packages

Languages

License

Licenses found

UI-Research/crosswalk

Folders and files

Latest commit

History

Repository files navigation

crosswalk

Overview

Why Use crosswalk?

Installation

Overview

Core Functions

Understanding get_crosswalk() Output

Single-Step vs. Multi-Step Crosswalks

Crosswalk Structure

Accessing Metadata

Using crosswalk_data() to Interpolate Data

Column Naming Convention

Supported Geography and Year Combinations

Inter-Geography Crosswalks (Geocorr)

Inter-Temporal Crosswalks (NHGIS)

2020→2022 Crosswalks (CTData)

API Keys

Caching

Citations

About

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Why Use `crosswalk`?

Understanding `get_crosswalk()` Output

Using `crosswalk_data()` to Interpolate Data

Packages