Skip to content

An R package for inter-temporal and inter-geography crosswalks

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md
Notifications You must be signed in to change notification settings

UI-Research/crosswalk

Repository files navigation

crosswalk

An R interface to inter-geography and inter-temporal crosswalks.

Overview

This package provides a consistent API and standardized versions of crosswalks to enable consistent approaches that work across different geography and year combinations. The package also facilitates interpolation–that is, adjusting source geography/year values by their crosswalk weights and translating these values to the desired target geography/year–including diagnostics of the joins between source data and crosswalks.

The package sources crosswalks from:

  • Geocorr 2022 (Missouri Census Data Center) - for same-year crosswalks between geographies
  • IPUMS NHGIS - for inter-temporal crosswalks (across different census years)
  • CT Data Collaborative - for Connecticut 2020→2022 crosswalks (planning region changes)

Why Use crosswalk?

  • Programmatic access: No more manual downloads from web interfaces
  • Standardized output: Consistent column names across all crosswalk sources
  • Metadata tracking: Full provenance stored as attributes
  • Multi-step handling: Automatic chaining when both geography and year change
  • Local caching: Reproducible workflows with cached crosswalks

Installation

# Install from GitHub
renv::install("UI-Research/crosswalk")

Overview

First we obtain a crosswalk and apply it to our data:

library(crosswalk)
library(dplyr)
library(stringr)
library(sf)

source_data = tidycensus::get_acs(
    year = 2023,
    geography = "zcta",
    output = "wide",
    variables = c(below_poverty_level = "B17001_002")) %>%
  dplyr::select(
    source_geoid = GEOID,
    count_below_poverty_level = below_poverty_levelE)

get_crosswalk(
  source_geography = "block",
  target_geography = "puma",
  source_year = 2010,
  target_year = 2020,
  weight = "population")

# Get a crosswalk from ZCTAs to PUMAs (same year, uses Geocorr (2022))
zcta_puma_crosswalk <- get_crosswalk(
  source_geography = "zcta",
  target_geography = "puma22",
  weight = "population")

# Apply the crosswalk to your data
crosswalked_data <- crosswalk_data(
  geoid_column = "source_geoid",
  data = source_data,
  crosswalk = zcta_puma_crosswalk)

What does the crosswalk(s) reflect and how was it sourced?

attr(crosswalked_data, "crosswalk_metadata")

How well did the crosswalk join to our source data?

## look at all the characteristics of the join(s) between the source data
## and the crosswalks
join_quality = attr(crosswalked_data, "join_quality")

## what share of records in the source data do not join to a crosswalk and
## thus are dropped during the crosswalking process?
join_quality$pct_data_unmatched

## zctas aren't nested within states, otherwise join_quality$state_analysis_data 
## would help us to ID whether non-joining source data were clustered within one
## or a few states. instead we can join to spatial data to diagnose further:
zctas_sf = tigris::zctas(year = 2023)
states_sf = tigris::states(year = 2023, cb = TRUE)

## apart from DC, which has a disproportionate number of non-joining ZCTAs--
## seemingly corresponding to federal areas and buildings--the distribution of
## non-joining ZCTAs appears proportionate to state-level populations and is 
## distributed across many states:
zctas_sf %>% 
  dplyr::filter(GEOID20 %in% join_quality$data_geoids_unmatched) %>%
  sf::st_intersection(states_sf %>% select(NAME)) %>%
  sf::st_drop_geometry() %>%
  dplyr::count(NAME, sort = TRUE)

And how accurate was the crosswalking process?

comparison_data = tidycensus::get_acs(
    year = 2023,
    geography = "puma",
    output = "wide",
    variables = c(
      below_poverty_level = "B17001_002")) %>%
  dplyr::select(
    source_geoid = GEOID,
    count_below_poverty_level_acs = below_poverty_levelE)

combined_data = dplyr::left_join(
  comparison_data,
  crosswalked_data,
  by = c("source_geoid" = "geoid")) 
  
combined_data %>%
  dplyr::select(source_geoid, dplyr::matches("count")) %>%
  dplyr::mutate(difference_percent = (count_below_poverty_level_acs - count_below_poverty_level) / count_below_poverty_level_acs) %>%
  ggplot2::ggplot() +
    ggplot2::geom_histogram(ggplot2::aes(x = difference_percent)) +
    ggplot2::theme_minimal() +
    ggplot2::theme(panel.grid = ggplot2::element_blank()) +
    ggplot2::scale_x_continuous(labels = scales::percent) +
    ggplot2::labs(
      title = "Crosswalked data approximates observed values",
      subtitle = "Block group-level source data would produce more accurate crosswalked values",
      y = "",
      x = "Percent difference between observed and crosswalked values")

Core Functions

The package has two main functions:

Function Purpose
get_crosswalk() Fetch crosswalk(s)
crosswalk_data() Apply crosswalk(s) to interpolate data to the target geography-year

Understanding get_crosswalk() Output

get_crosswalk() always returns a list structured as follows:

result <- get_crosswalk(
  source_geography = "tract",
  target_geography = "zcta",
  source_year = 2010,
  target_year = 2020,
  weight = "population"
)

names(result)
#> [1] "crosswalks" "plan" "message"

The list contains three elements:

Element Description
crosswalks A named list of crosswalks (step_1, step_2, etc.) of length one or greater
plan Details about what crosswalks are being fetched
message A human-readable description of the crosswalk chain

Single-Step vs. Multi-Step Crosswalks

Single-step crosswalks (same year, different geography OR same geography, different year):

# Same year, different geography (Geocorr)
result <- get_crosswalk(
  source_geography = "tract",
  target_geography = "zcta",
  weight = "population"
)
# result$crosswalks$step_1 contains one crosswalk

# Same geography, different year (NHGIS)
result <- get_crosswalk(
  source_geography = "tract",
  target_geography = "tract",
  source_year = 2010,
  target_year = 2020
)
# result$crosswalks$step_1 contains one crosswalk

Multi-step crosswalks (different geography AND different year):

When both geography and year change, no single crosswalk source provides this directly. The package automatically plans and fetches a two-step chain:

  1. Step 1 (NHGIS): Change year, keep geography constant
  2. Step 2 (Geocorr): Change geography at target year
result <- get_crosswalk(
  source_geography = "tract",
  target_geography = "zcta",
  source_year = 2010,
  target_year = 2020,
  weight = "population"
)

# Two crosswalks are returned
names(result$crosswalks)
#> [1] "step_1" "step_2"

# Step 1: 2010 tracts -> 2020 tracts (NHGIS)
# Step 2: 2020 tracts -> 2020 ZCTAs (Geocorr)

Crosswalk Structure

Each crosswalk contains standardized columns:

Column Description
source_geoid Identifier for source geography
target_geoid Identifier for target geography
allocation_factor_source_to_target Weight for interpolating values
weighting_factor What attribute was used (population, housing, land)

Additional columns may include source_year, target_year, population_2020, housing_2020, and land_area_sqmi depending on the source.

Accessing Metadata

Each crosswalk tibble has a crosswalk_metadata attribute that documents what the crosswalk represents and how it was created:

metadata <- attr(result$crosswalks$step_1, "crosswalk_metadata")
names(metadata)
#> [1] "call_parameters" "data_source" "data_source_full_name" "download_url" ...

Using crosswalk_data() to Interpolate Data

crosswalk_data() applies crosswalk weights to transform your data. It automatically handles multi-step crosswalks.

Column Naming Convention

The function auto-detects columns based on prefixes:

Prefix Treatment
count_ Summed after weighting (for counts like population, housing units)
mean_, median_, percent_, ratio_ Weighted mean (for rates, percentages, averages)

You can also specify columns explicitly via count_columns and non_count_columns. All non-count variables are interpolated using weighted means, weighting by the allocation factor from the crosswalk.

Supported Geography and Year Combinations

Inter-Geography Crosswalks (Geocorr)

2022-vintage crosswalks between any of these geographies:

  • block, block group, tract, county
  • place, zcta, puma22
  • cd118, cd119, urban_area, core_based_statistical_area

Inter-Temporal Crosswalks (NHGIS)

NHGIS provides cross-decade crosswalks with the following structure:

Source geographies: block, block_group, tract

Target geographies: - From blocks (decennial years only): block, block_group, tract, county, place, zcta, puma, urban_area, cbsa - From block_group or tract: block_group, tract, county

Source Years Target Years
1990, 2000 2010, 2014, 2015, 2020, 2022
2010, 2011, 2012, 2014, 2015 1990, 2000, 2020, 2022
2020, 2022 1990, 2000, 2010, 2014, 2015

Notes: - Within-decade crosswalks (e.g., 2010→2014) are not available from NHGIS - Block→ZCTA, Block→PUMA, etc. are only available for decennial years (1990, 2000, 2010, 2020) - The package automatically uses direct NHGIS crosswalks when available (e.g., get_crosswalk(source_geography = "block", target_geography = "zcta", source_year = 2010, target_year = 2020) returns a single-step NHGIS crosswalk)

2020→2022 Crosswalks (CTData)

For 2020 to 2022 transformations, the package uses CT Data Collaborative crosswalks for Connecticut (where planning regions replaced counties) and identity mappings for other states (where no changes occurred).

API Keys

NHGIS crosswalks require an IPUMS API key. Get one at https://account.ipums.org/api_keys and add to your .Renviron:

usethis::edit_r_environ()
# Add: IPUMS_API_KEY=your_key_here

Caching

Use the cache parameter to save crosswalks locally for ease:

result <- get_crosswalk(
  source_geography = "tract",
  target_geography = "zcta",
  weight = "population",
  cache = here::here("crosswalks-cache"))

Citations

The intellectual credit for the underlying crosswalks belongs to the original developers.

For NHGIS, see citation requirements at: https://www.nhgis.org/citation-and-use-nhgis-data

For Geocorr, a suggested citation:

Missouri Census Data Center, University of Missouri. (2022). Geocorr 2022: Geographic Correspondence Engine. Retrieved from: https://mcdc.missouri.edu/applications/geocorr2022.html

About

An R package for inter-temporal and inter-geography crosswalks

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages