Skip to content

switchbox-data/reports2

Repository files navigation

reports2

Switchbox is a nonprofit think tank that produces rigorous, accessible data on U.S. state climate policy for advocates, policymakers, and the public. Find out more at www.switch.box

This repository contains Switchbox's reports. We use a modern bilingual stack combining Python and R.

πŸ“‹ Table of Contents


🎯 Overview

  • Quarto Reports: All reports written in Quarto (located in reports/ directory)
  • Cloud Data: All data used in our reports is stored in an S3 bucket (`s3://data.sb/) on AWS
  • Bilingual Analytics: Reports use both Python (polars) and R (tidyverse)
  • Fast Package Management:
    • Python: uv for lightning-fast dependency resolution
    • R: pak with P3M for binary package installation
  • Reproducible Development Environment: Intended to be used with VS Code Dev Containers
  • Task Runner: just for convenient command execution
  • Code Quality: Modern linting and formatting with ruff (Python) and air (R), automated via prek pre-commit hooks

🌍 Why Open Source?

Our reports (on our website) are shared under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0), and the code behind them (contained in this repo) is released under the MIT License.

We do this for two reasons:

1. πŸ” Transparency

As a research group, our work is frequently cited in the press and aims to shape the policy conversation. Given this, we believe that the public has a right to see exactly how we produce our findingsβ€”going well beyond vague methodology sections.

2. πŸ”¬ Open Science

We believe the clean energy transition will happen faster if energy researchers, particularly those working in the nonprofit sector, embrace open data and open source code so that we can build on each other's work rather than reinventing the wheel.

πŸ“ Repo Structure

While the rest of this README.md will walk through the contents of this repo in detail, here is an initial overview:

reports2/
β”œβ”€β”€ .devcontainer/          # Dev container configuration
β”œβ”€β”€ docs/                   # Published HTML reports (hosted via GitHub Pages at switchbox-data.github.io/reports2)
β”œβ”€β”€ lib/                    # Shared libraries and utilities
β”œβ”€β”€ reports/                # Switchbox report projects (source code for reports on our website)
β”‚   β”œβ”€β”€ ct_hp_rates/        # Individual Quarto projects for each report
β”‚   β”œβ”€β”€ il_lea/
β”‚   └── ...
β”œβ”€β”€ tests/                  # Python test suite
β”œβ”€β”€ .pre-commit-config.yaml # Pre-commit hooks configuration
β”œβ”€β”€ pyproject.toml          # Python dependencies and tool configuration
β”œβ”€β”€ uv.lock                 # Locked Python dependencies
└── Justfile                # Command runner recipes

πŸ› οΈ Available Commands

While the rest of this README.md will explain when these commands should be used, here is an initial overview of the "tasks" you can perform in this repo.

Two places to use just: Commands are available in the repository root (for development environment and testing) and in individual report directories (for rendering reports). These commands are defined in Justfiles in each location - see the root Justfile and individual report Justfiles (e.g., reports/ny_aeba_grid/Justfile).

View all available commands in your current location:

just --list

Repository root commands:

  • just install - Set up development environment
  • just check - Run quality checks (same as CI: lock file + pre-commit hooks)
  • just check-deps - Check for obsolete dependencies with deptry
  • just test - Run test suite (same as CI)
  • just new_report - Create a new Quarto report in reports/
  • just aws - Authenticate with AWS SSO
  • just devpod - Launch devcontainer on AWS via DevPod
  • just clean - Remove generated files and caches

Report directory commands (reports/<project_code>/):

  • just render - Render HTML version of report using Quarto
  • just draft - Render Word document for content reviews using Quarto
  • just typeset - Render ICML for InDesign typesetting using Quarto
  • just publish - Copy rendered HTML version of report from project docs/ to root docs/ for web publishing
  • just clean - Remove generated files and caches

πŸš€ Quick Start

If you want to rerun and edit the code that generates our reports, this section shows you how to get started.

Option 1: Dev Container (Recommended)

The fastest way to get started is using the provided dev container:

  1. Prerequisites:

  2. Launch:

    • Open this repository in VS Code or Cursor
    • Click "Reopen in Container" when prompted (or use Cmd+Shift+P β†’ "Dev Containers: Reopen in Container")
    • The container will automatically install:
      • Core Tools: Quarto, just, AWS CLI, GitHub CLI (gh), prek
      • Python Stack: Python dependencies via uv, ruff for fast linting/formatting, ty for type checking
      • R Stack: R dependencies via pak, air for fast linting/formatting
      • Pre-commit Hooks: Configured via prek
      • Editor Extensions: Python, R, Quarto, TOML, just syntax, and more
  3. Verify Installation:

    just --list  # See all available commands

Option 2: Dev Container on AWS (via DevPod)

For faster data access (especially with large datasets), you can launch the devcontainer on AWS in region us-west-2, close to where our data is stored in S3.

  1. Prerequisites:

    • Install DevPod
    • Configure the AWS provider in DevPod (detailed setup documentation coming soon)
  2. Launch:

    just aws      # Authenticate with AWS so Devpod can create an EC2 instance
    just devpod   # Launch devcontainer on EC2 instance

This uses a prebuilt devcontainer image from GitHub Container Registry, so you don't have to wait for the image to build.

Option 3: Manual Installation (Without Dev Container)

If you prefer not to use dev containers:

  1. Install Prerequisites:

    • Python 3.9+
    • R 4.0+
    • pak package for R
    • just
  2. Run Setup:

    just install

    This command:

    • Installs uv
    • Uses uv to create a virtualenv and install python packages to it
    • Uses pak to install R packages listed in DESCRIPTION file
    • Installs pre-commit hooks with prek

πŸ“ Creating a New Report

To create a new Quarto report from the Switchbox template:

just new_report

You'll be prompted to enter a report name.

Naming convention: Use state_topic format (e.g., ny_aeba, ri_hp_rates). If we've used a topic before in other states (like hp_rates), reuse it to maintain consistency across reports.

This will:

πŸ”„ Development Workflow

We follow a structured development process to ensure all work is tracked, organized, and reviewed. This workflow keeps PRs from growing stale, maintains code quality, and ensures every piece of work is tied to a clear ticket.

Our workflow:

  1. All work starts with a GitHub Issue - Captured in our Kanban board with clear "What", "Why", "How", and "Deliverables"
  2. Issues are reviewed before work begins - Ensures alignment before coding starts
  3. Branches are created from issues - Automatic linking between code and tickets
  4. PRs are created early - Work-in-progress is visible and reviewable
  5. Code is reviewed before merging - Quality checks and peer review catch issues
  6. PRs are short-lived - Merged within a week to keep momentum

This process ensures:

  • πŸ“ Every feature/fix has documentation (the issue)
  • πŸ”— Code changes are traceable to requirements
  • πŸ‘€ Work is visible and reviewable at all stages
  • βœ… Code is reviewed early and often

For complete details on our workflow, including how to create issues, request reviews, and merge PRs, see CONTRIBUTING.md.

πŸ“„ Understanding Quarto Reports

What is Quarto?

Quarto is a scientific publishing system that lets you combine narrative text with executable code. This approach, called literate programming, means your analysis and narrative live together in one document.

Simple example:

## Heat Pump Adoption

Total residential heat pumps installed in Rhode Island:

```{r}
total_installs <- sum(resstock_data$heat_pump_count)
print(total_installs)
```

This represents `{r} pct_electric`% of all heating systems statewide.

The code executes when you render the document, weaving results directly into your narrative. Learn more about Quarto.

Our Opinionated Approach

We use Quarto in a specific way, defined by our report template. Our reports are based on Quarto's Manuscript project type, which is designed for scientific and technical publications.

Report Structure

After running just new_report, you'll have this structure:

reports/your-report/
β”œβ”€β”€ index.qmd              # The actual report (narrative + embedded results)
β”œβ”€β”€ notebooks/
β”‚   └── analysis.qmd       # The underlying analysis (data wrangling + analysis + visualization)
β”œβ”€β”€ _quarto.yml            # Quarto project configuration (set up by template)
β”œβ”€β”€ Justfile               # Commands to render the report
└── docs/                  # Generated output (HTML, DOCX, ICML, PDF) - gitignored, do not commit

Key files:

  • index.qmd: Your report's narrative. This is what readers see. Contains text, embedded charts, and inline results.
  • notebooks/analysis.qmd: Your data analysis. Data loading, cleaning, modeling, and creating visualizations.
    • Prefer a single analysis.qmd notebook for simplicity and clarity
    • If you need multiple notebooks, check with the team first to discuss organization
  • _quarto.yml: Quarto configuration - YAML front matter, format settings, and stylesheet references (all set up by the template)

Data Flow: Analysis β†’ Report

We keep analysis separate from narrative. Here's how data flows from notebooks/analysis.qmd to index.qmd:

1. Export Variables from Analysis (R)

In notebooks/analysis.qmd, export objects to a local RData file (gitignored, do not commit):

# Do your analysis
total_savings <- calculate_savings(data)
growth_rate <- 0.15

# Export variables for the report
save(total_savings, growth_rate, file = "report_vars.RData")

Then load them in index.qmd:

# Load analysis results
load("report_vars.RData")

Note: The .RData file is gitignored - only the code that generates it is versioned.

2. Embed Charts from Analysis

For visualizations, use Quarto's embed syntax to pull charts from notebooks/analysis.qmd into index.qmd:

In notebooks/analysis.qmd, create a labeled code chunk:

```{r}
#| label: fig-energy-savings
#| fig-cap: "Annual energy savings by upgrade"

ggplot(data, aes(x = upgrade, y = savings)) +
  geom_col() +
  theme_minimal()
```

In index.qmd, embed it:

{{< embed notebooks/analysis.qmd#fig-energy-savings >}}

The chart appears in your report without duplicating code!

Report Output Formats

Our template includes just commands for different outputs:

just render    # Generate HTML for web publishing (static site)
just draft     # Generate Word doc for content reviews with collaborators
just typeset   # Generate ICML for creating typeset PDFs in InDesign

When to use each:

  • HTML (just render): Publishing our reports as interactive web pages
  • DOCX (just draft): Sharing drafts for content review and feedback
  • ICML (just typeset): Professional typesetting in Adobe InDesign for PDF export

The template automatically configures these formats with our stylesheets and branding.

🌐 Publishing Web Reports

Once your report is ready to be published, you'll want to get it to the web so others can access it. This section walks through how our web hosting works and the steps to publish.

How Web Reports Work

When you run just render in your report directory (e.g., reports/ny_aeba_grid/), Quarto generates a complete static website in the docs/ subdirectory:

reports/ny_aeba_grid/
└── docs/                      # Generated by Quarto (gitignored in report dirs)
    β”œβ”€β”€ index.html             # Main HTML file
    β”œβ”€β”€ img/                   # Images
    β”œβ”€β”€ index_files/           # Supporting files
    β”œβ”€β”€ notebooks/             # Embedded notebook outputs
    └── site_libs/             # JavaScript, CSS, etc.

Why this matters: The entire docs/ directory is self-contained. You can copy it anywhere, open index.html, and the report will work perfectly - all assets are bundled together.

πŸ’‘ Development Tip: Use this during development! Run just render frequently and open reports/<project_code>/docs/index.html in your browser to see exactly how your report will look when published to the web. This lets you iterate on design and layout before publishing.

How We Host Reports

We use GitHub Pages to host our reports automatically:

  1. Any files in the root docs/ directory on the main branch are automatically published to https://switchbox-data.github.io/reports2/

  2. To publish a report, we copy the rendered report from reports/<project_code>/docs/ to the root docs/<project_code>/ directory

  3. URLs: A report at docs/ny_aeba_grid/index.html becomes accessible at:

πŸ“‚ See what's published: Check the docs/ directory to see currently published reports.

πŸ’‘ Key insight: Publishing = rendering your report ➑️ copying to docs/<project_code>/ ➑️ merging to main

Publishing Step-by-Step

Follow these steps to publish or update a web report:

1. Prepare Your Report

  • Finish your work following the development workflow (create issue, work in a branch, etc.)
  • Make sure your PR is ready to merge
  • Open your devcontainer terminal

2. Navigate to Your Report Directory

cd reports/<project_code>

3. Verify Render Configuration

Check that all notebooks needed for the report are listed in _quarto.yml under project > render:

project:
  render:
    - index.qmd
    - notebooks/analysis.qmd

Why: Quarto only renders files you explicitly list, ensuring you have control over what gets published.

4. Clear Cache and Render

# Clear any cached content
just clean

# Render the HTML version
just render

This creates fresh output in reports/<project_code>/docs/.

5. Copy to Publishing Directory

# Copy rendered report to root docs/ directory
just publish

What this does: Copies all files from reports/<project_code>/docs/ to docs/<project_code>/. If files already exist at docs/<project_code>/, they're deleted first to ensure a clean publish.

6. Return to Repository Root

cd ../..

7. Stage the Published Files

git add -f docs/

Why -f (force)? The docs/ directory is gitignored in report directories to prevent accidental commits during development. The -f flag overrides this, allowing you to commit to the root docs/ directory intentionally.

8. Commit and Push

git commit -m "Publish new version of <project_code> report"
git push

Important: Pushing to your branch does NOT publish the report yet - it must be merged to main first.

9. Merge to Main

  • Go to your PR on GitHub
  • Merge it to main

10. Verify Deployment

  • GitHub automatically triggers a "pages build and deployment" workflow
  • Check Actions to see the workflow run
  • Once the workflow is green (βœ“), your report is live at https://switchbox-data.github.io/reports/<project_code>/

Adding a PDF Link

Once you have a final PDF version (typically typeset in InDesign), you can add a download link to the web report:

1. Get the PDF

  • Download the final PDF from Google Drive: Projects > <project_code> > final
  • Ensure it's named switchbox_<project_code>.pdf (rename if needed)

2. Add to Report Directory

Place the PDF in your report root:

reports/<project_code>/
β”œβ”€β”€ switchbox_<project_code>.pdf   # ← Add here
β”œβ”€β”€ index.qmd
└── ...

3. Update YAML Front Matter

Add or uncomment this in index.qmd's YAML header:

other-links:
  - text: PDF
    icon: file-earmark-pdf
    href: switchbox_<project_code>.pdf

What this does: Adds a PDF download button to your web report's navigation.

4. Re-render and Publish

# Re-render with PDF link
just render

# Verify the PDF link appears and works locally

# Publish the update
just publish
cd ../..
git add -f docs/
git commit -m "Add PDF link to <project_code> report"
git push

Then merge to main following steps 9-10 above.

πŸ“¦ Managing Dependencies

Python Dependencies

Adding a Python package

uv add <package-name>

This command:

  • Adds the package to pyproject.toml
  • Updates uv.lock with resolved dependencies
  • Installs the package in your virtual environment

Example:

uv add polars  # Add polars as a dependency
uv add --dev pytest-mock  # Add as a dev dependency

⚠️ Important: Do NOT use pip install to add packages. Using pip install will install the package locally but will not update pyproject.toml or uv.lock, meaning others won't get your dependencies. Always use uv add.

How Your Package Persists

In dev container:

  • When you run uv add package-name, packages are installed to /opt/venv/ inside the container
  • They stay in the container, and are not exported to your local filesystem. So if you restart the container, the package will be gone!
  • To make your new package persist, you need to add it to the image itself, by committing pyproject.toml and uv.lock and pushing to Github
  • If you're using devcontainers on your laptop, rebuild the container and your package will be permanently installed within the image
  • If you're using devcontainers on Devpod, Github actions will automatically rebuild the image with the new package; restart your workspace to use the new image
  • Bottom line: Run uv add, commit pyproject.toml and uv.lock, and rebuild the devcontainer to persist packages

On regular laptop:

  • When you run uv add package-name, packages are installed to .venv/, which persists in your local workspace
  • Packages remain installed between sessions
  • No reinstallation needed unless you delete .venv/ or run uv sync after changes

How Others Get Your Package

In dev container:

  1. You commit both pyproject.toml and uv.lock, and push to Github
  2. Others pull your changes
  3. They rebuild their container:
    • If they're using devcontainers on their laptop, when they rebuild their container, all packages in uv.lock (including your new one) will be permanently installed into the image
    • If they're using devcontainers on Devpod, GitHub Actions will automatically rebuild the image with the new package; they just need to restart their workspace to use the new image

On regular laptop:

  1. You commit both pyproject.toml and uv.lock to git
  2. Others pull your changes
  3. They must manually run uv sync to install the new dependency

R Dependencies

R dependency management works differently, you have to manually update a file that lists packages, then install them.

Adding a new R package

  1. Add it to DESCRIPTION in the Imports section:

    Imports:
        dplyr,
        ggplot2,
        arrow
    
  2. Install it by running:

    just install

How Your Package Persists

In dev container:

  • If you install a package directly with pak::pak("dplyr"), the package is installed temporarily in the container
  • It will be gone when the container restarts!
  • If you add it to DESCRIPTION and run just install, as documented above, the package will also install temporarily
  • However, if you then commit DESCRIPTION and push to Github...
  • If you're using devcontainers on your laptop, rebuild the container and every package in DESCRIPTION (including your new one) will be permanently installed within the image
  • If you're using devcontainers on Devpod, GitHub Actions will automatically rebuild the image with the new package; restart your workspace to use the new image
  • Bottom line: Add packages to DESCRIPTION, commit it, and rebuild the devcontainer to persist them

On regular laptop:

  • Packages are saved to your global R library (typically ~/R/library/)
  • Packages remain installed between sessions
  • No reinstallation needed unless you uninstall them or use a different R version

How Others Get Your Package

In dev container:

  1. You add a package to DESCRIPTION, commit it, and push to GitHub
  2. Others pull your changes
  3. They rebuild their container:
    • If they're using devcontainers on their laptop, when they rebuild their container, all packages in DESCRIPTION (including your new one) will be pemanently installed into the image
    • If they're using devcontainers on Devpod, GitHub Actions will automatically rebuild the image with the new package; they just need to restart their workspace to use the new image

On regular laptop:

  1. You add a package to DESCRIPTION and commit it to git
  2. Others pull your changes
  3. They manually install dependencies:
    just install
    Or in an R session:
    pak::local_install_deps()  # Installs all dependencies from DESCRIPTION

πŸ”€ When to Use Python vs. R

Both languages are available, but we have clear preferences based on the type of work:

Use R (with tidyverse) for:

  • Data analysis - Exploratory data analysis, statistical analysis
  • Data modeling - Statistical models, regression, forecasting
  • Data visualization - Creating charts and graphs for reports
  • Default choice - Unless there's a specific reason to use Python, prefer R for these tasks

Use Python for:

  • Data engineering - Scripts that fetch, process, and upload data to S3
  • Numerical simulations - Generating synthetic data, Monte Carlo simulations
  • Library requirements - When a specific Python library is needed

Why this split?

  • R/tidyverse excels at interactive analysis and producing publication-quality visualizations
  • Python excels at scripting, automation, and computational tasks
  • Our reports are written in Quarto, which works seamlessly with both languages
  • You can use both in the same report when needed, but prefer consistency within a single analysis

πŸ“Š Working with Data

πŸ”‘ AWS Configuration

Data for analyses is stored on S3. Ensure you have:

  • AWS credentials configured in ~/.aws/credentials
  • Default region set to us-west-2

The dev container automatically mounts your local AWS credentials.

Reading Data from S3

All data lives in S3 - we do not store data files in this git repository. Our primary data bucket is s3://data.sb/, and we also use public open data sources from other S3 buckets.

We prefer Parquet files in our data lake for efficient columnar storage and compression.

Using Python (with polars / arrow)

import polars as pl

# Scan multiple parquet files from S3 (lazy - doesn't load data yet)
lf = pl.scan_parquet("s3://data.sb/eia/heating_oil_prices/*.parquet")

# Apply transformations and aggregations (still lazy)
# By using parquet and arrow with lazy execution, we limit how much data is downloaded
result = (
    lf.filter(pl.col("state") == "RI")
    .group_by("year")
    .agg(pl.col("price").mean().alias("avg_price"))
)

# πŸ’‘ TIP: Stay in lazy execution as long as possible - do all filtering, grouping, and aggregations
# before calling collect(). This allows Polars to optimize the query plan and minimize data transfer.

# Collect to download subset of parquet data, perform aggregation, and load into memory
# Result is a Polars DataFrame (columnar, Arrow-based format)
df = result.collect()

# To convert to row-oriented format, use to_pandas() to get a pandas DataFrame
# ⚠️ WARNING: We do NOT use pandas for analysis - only use if a library requires pandas DataFrame
pandas_df = result.collect().to_pandas()

Using R (with dplyr / arrow)

library(arrow)
library(dplyr)

# Scan multiple parquet files from S3 (lazy - doesn't load data yet)
lf <- open_dataset("s3://data.sb/eia/heating_oil_prices/*.parquet")

# Apply transformations and aggregations (still lazy)
# By using parquet and arrow with lazy execution, we limit how much data is downloaded
result <- lf |>
  filter(state == "RI") |>
  group_by(year) |>
  summarize(avg_price = mean(price))

# πŸ’‘ TIP: Stay in lazy execution as long as possible - do all filtering, grouping, and aggregations
# before calling compute(). This allows Arrow to optimize the query plan and minimize data transfer.

# Compute to download subset of parquet data, perform aggregation, and load into memory
# Result is an Arrow Table (columnar, Arrow-based format - same as Python polars)
df <- result |> compute()

# Or use collect() to convert to tibble (row-oriented, standard R data frame)
# ⚠️ WARNING: stay in arrow whenever possible β€” only use if a library requires tibbles
tibble_df <- result |> collect()

Performance Considerations

Initial downloads can be slow depending on:

  • File size
  • Your internet connection
  • Distance from the S3 region (us-west-2)

Options to improve performance:

  1. Cache locally: Download files once and cache in data/ (gitignored)
  2. Run dev containers in the cloud: See Option 2 in Quick Start for launching devcontainers on AWS in us-west-2 region, same as the data bucket
  3. Use partitioned datasets: Only read the partitions you need

When reports execute: Data is downloaded from S3 at runtime. The first run may be slower, but subsequent runs can use cached data if you've set up local caching.

Writing Data to S3

⚠️ Note: We have naming conventions but the upload process is still being standardized.

Naming and Organization Conventions

Directory structure:

s3://data.sb/<org>/<dataset>/<filename_YYYYMMDD.parquet>
  • Org: Organization producing the data (e.g., nrel, eia, fred)
  • Dataset: Name of the dataset (e.g., heating_oil, resstock, inflation_factors)
    • Always use a dataset directory, even if there's only one file
    • Prefer official data product names when they exist (e.g., EIA's "State Energy Data System", Census Bureau's "American Community Survey")
    • If the official name is long, use a clear abbreviated version in lowercase with underscores
  • Filename: Descriptive name for the specific file
    • Do NOT include the org name or dataset name (already in the path)
    • Do include geographic scope, time granularity, or other distinguishing info about the contents of the file
    • Must end with _YYYYMMDD reflecting when the data was downloaded
  • Naming style: Everything lowercase with underscores
    • Good: eia/heating_oil_prices/ri_monthly_20240315.parquet
    • Bad: EIA/Heating-Oil_ri_eia_prices_2024-03-15.parquet

Why timestamps matter:

  • Versioning: New snapshots get new dates (e.g., ri_monthly_20240415.parquet)
  • Reproducibility: Old code using ri_monthly_20240315.parquet continues to work
  • Traceability: Know exactly when data was retrieved

Example structure:

s3://data.sb/
β”œβ”€β”€ eia/
β”‚   β”œβ”€β”€ heating_oil_prices/
β”‚   β”‚   β”œβ”€β”€ ri_monthly_20240315.parquet
β”‚   β”‚   └── ct_monthly_20240315.parquet
β”‚   └── propane/
β”‚       └── ri_monthly_20240315.parquet
β”œβ”€β”€ nrel/
β”‚   └── resstock/
β”‚       └── 2024_release2_tmy3/...
└── fred/
    └── inflation_factors/
        └── monthly_20240301.parquet

Uploading Data

Preferred method: Use scripted downloads via just commands

Check the lib/ directory for existing data fetch scripts:

  • lib/eia/fetch_delivered_fuels_prices_eia.py
  • lib/eia/fetch_eia_state_profile.py

These scripts should:

  1. Download data from the source
  2. Process and save as Parquet
  3. Upload to S3 with proper naming (including date)
  4. Be runnable via just commands for reproducibility

Manual uploads (when necessary):

# Using AWS CLI
aws s3 cp local_file.parquet s3://data.sb/<org>/<dataset>/<filename_YYYYMMDD.parquet>

# Example: uploading heating oil data for Rhode Island
aws s3 cp ri_monthly_20240315.parquet s3://data.sb/eia/heating_oil/ri_monthly_20240315.parquet

⚠️ Before uploading: Coordinate with the team to ensure:

  • Naming follows conventions
  • Path doesn't conflict with existing data
  • Date reflects actual download/creation date

πŸ” Code Quality & Testing

Pre-commit Hooks

What are pre-commit hooks? Pre-commit hooks are automated scripts that run every time you make a git commit. They catch issues (formatting, linting, syntax errors) before code enters the repository, ensuring consistent code quality across the team. This saves time in code review and prevents broken code from being committed.

Why we use them: By automatically formatting and checking code at commit time, we ensure that all code in the repository meets our quality standards without requiring manual intervention. Everyone's code is formatted consistently, and common errors are caught immediately.

Pre-commit hooks are managed by prek and configured in .pre-commit-config.yaml.

Configured hooks:

  • ruff-check: Lints Python code with auto-fix
  • ruff-format: Formats Python code
  • ty-check: Type checks Python code using ty
  • trailing-whitespace: Removes trailing whitespace
  • end-of-file-fixer: Ensures files end with a newline
  • check-yaml/json/toml: Validates config file syntax
  • check-added-large-files: Prevents commits of files >600KB
  • check-merge-conflict: Detects merge conflict markers
  • check-case-conflict: Prevents case-sensitivity issues

Note on R formatting: We don't yet have the air formatter for R integrated with pre-commit hooks. Instead, use air's editor integration via the Posit.air-vscode extension, which is pre-installed in the dev container. Air will automatically format your R code in the editor.

Hooks run automatically on git commit. To run manually:

prek run -a  # Run all hooks on all files

Run Quality Checks

just check

This command performs (same as CI):

  1. Lock file validation: Ensures uv.lock is consistent with pyproject.toml
  2. Pre-commit hooks: Runs all configured hooks including type checking (see above)

Optional - Check for obsolete dependencies:

just check-deps

Runs deptry to check for unused dependencies in pyproject.toml.

Run Tests

just test

Runs the Python test suite with pytest (tests in tests/ directory), including doctest validation.

Note on R tests: We don't currently have R tests or a testing framework configured for R code. Only Python tests are run by just test and in CI.

🚦 CI/CD Pipeline

The repository uses GitHub Actions to automatically run quality checks and tests on your code. The CI runs the exact same just commands you use locally, in the same devcontainer environment, ensuring perfect consistency.

What Runs and When

The workflow runs two jobs in parallel for speed:

On Pull Requests (opened, updated, or marked ready for review):

  1. quality-checks: Runs just check (lock file validation + pre-commit hooks)
  2. tests: Runs just test (pytest test suite)

On Push to main:

  • Same checks and tests as pull requests (both jobs run in parallel)

Devcontainer in CI/CD

On every commit to main or a pull request, GitHub Actions actually builds the devcontainer image - the same process that happens when you rebuild in VS Code.

To avoid 15-minute builds every time, we use a two-tier caching strategy:

  1. Image caching: If .devcontainer/Dockerfile, pyproject.toml, uv.lock, and DESCRIPTION haven't changed, the devcontainer image build is skipped, and the most recent image (in GHCR) is reused (~30 seconds)
  2. Layer caching: If any of these files changed, the image is rebuilt, but only affected layers rebuild while the others are pulled from GHCR cache (incremental builds, ~2-5 minutes)

Once built, the image is pushed to GitHub Container Registry (GHCR), where it's immediately available as ghcr.io/switchbox-data/reports2:latest.

The quality-checks and tests jobs then pull this prebuilt image and run just check and just test inside it - no rebuilding required.

Devcontainer prebuilds: Double Duty

These CI builds serve a dual purpose:

  1. For CI: Tests run in the exact devcontainer environment
  2. For Devpod: Users can launch devcontainers on AWS (similar to Codespaces) using the prebuilt image - unlike when using devcontainers on your laptop, you don't have to wait for the image to build

Bottom line: Every commit that modifies the devcontainer or dependencies triggers an automatic devcontainer image build. This ensures CI uses the correct environment, and anyone (including Devpod users) can use the fully built devcontainer without building it from scratch.

Why This Matters

  • Perfect Consistency: CI literally runs just check and just test inside of the devcontainer - exactly what you run locally β€” and devcontainer is rebuilt when its definition changes
  • Speed: Devcontainer is only rebuilt when necessary, so quality checks and tests usually run immediately, and they run in parallel, making CI faster
  • Safety net: Even if someone skips pre-commit checks locally, CI catches code quality issues before merge
  • Code quality: Every PR must pass all checks and tests before it can be merged
  • Optional checks: Dependency audits (just check-deps) are not part of CI but available locally for additional validation

You can see the workflow configuration in .github/workflows/data-science-main.yml.

🧹 Cleaning Up

Remove generated files and caches:

just clean

This removes:

  • .pytest_cache
  • .ruff_cache
  • tmp/
  • notebooks/.quarto

πŸ“š Additional Resources

Development Environment

Python Tools

R Tools

About

This repo contains Switchbox's reports.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •