reports2

Switchbox is a nonprofit think tank that produces rigorous, accessible data on U.S. state climate policy for advocates, policymakers, and the public. Find out more at www.switch.box

This repository contains Switchbox's reports. We use a modern bilingual stack combining Python and R.

📋 Table of Contents

Overview
Why Open Source?
Repo Structure
Available Commands
Quick Start
Creating a New Report
Development Workflow
Understanding Quarto Reports
Publishing Web Reports
Managing Dependencies
When to Use Python vs. R
Working with Data
Code Quality & Testing
CI/CD Pipeline
Cleaning Up
Additional Resources

🎯 Overview

Quarto Reports: All reports written in Quarto (located in reports/ directory)
Cloud Data: All data used in our reports is stored in an S3 bucket (`s3://data.sb/) on AWS
Bilingual Analytics: Reports use both Python (polars) and R (tidyverse)
Fast Package Management:
- Python: uv for lightning-fast dependency resolution
- R: pak with P3M for binary package installation
Reproducible Development Environment: Intended to be used with VS Code Dev Containers
Task Runner: just for convenient command execution
Code Quality: Modern linting and formatting with ruff (Python) and air (R), automated via prek pre-commit hooks

🌍 Why Open Source?

Our reports (on our website) are shared under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0), and the code behind them (contained in this repo) is released under the MIT License.

We do this for two reasons:

1. 🔍 Transparency

As a research group, our work is frequently cited in the press and aims to shape the policy conversation. Given this, we believe that the public has a right to see exactly how we produce our findings—going well beyond vague methodology sections.

2. 🔬 Open Science

We believe the clean energy transition will happen faster if energy researchers, particularly those working in the nonprofit sector, embrace open data and open source code so that we can build on each other's work rather than reinventing the wheel.

📁 Repo Structure

While the rest of this README.md will walk through the contents of this repo in detail, here is an initial overview:

reports2/
├── .devcontainer/          # Dev container configuration
├── docs/                   # Published HTML reports (hosted via GitHub Pages at switchbox-data.github.io/reports2)
├── lib/                    # Shared libraries and utilities
├── reports/                # Switchbox report projects (source code for reports on our website)
│   ├── ct_hp_rates/        # Individual Quarto projects for each report
│   ├── il_lea/
│   └── ...
├── tests/                  # Python test suite
├── .pre-commit-config.yaml # Pre-commit hooks configuration
├── pyproject.toml          # Python dependencies and tool configuration
├── uv.lock                 # Locked Python dependencies
└── Justfile                # Command runner recipes

🛠️ Available Commands

While the rest of this README.md will explain when these commands should be used, here is an initial overview of the "tasks" you can perform in this repo.

Two places to use just: Commands are available in the repository root (for development environment and testing) and in individual report directories (for rendering reports). These commands are defined in Justfiles in each location - see the root Justfile and individual report Justfiles (e.g., reports/ny_aeba_grid/Justfile).

View all available commands in your current location:

just --list

Repository root commands:

just install - Set up development environment
just check - Run quality checks (same as CI: lock file + pre-commit hooks)
just check-deps - Check for obsolete dependencies with deptry
just test - Run test suite (same as CI)
just new_report - Create a new Quarto report in reports/
just aws - Authenticate with AWS SSO
just devpod - Launch devcontainer on AWS via DevPod
just clean - Remove generated files and caches

Report directory commands (reports/<project_code>/):

just render - Render HTML version of report using Quarto
just draft - Render Word document for content reviews using Quarto
just typeset - Render ICML for InDesign typesetting using Quarto
just publish - Copy rendered HTML version of report from project docs/ to root docs/ for web publishing
just clean - Remove generated files and caches

🚀 Quick Start

If you want to rerun and edit the code that generates our reports, this section shows you how to get started.

Option 1: Dev Container (Recommended)

The fastest way to get started is using the provided dev container:

Prerequisites:
Launch:
- Open this repository in VS Code or Cursor
- Click "Reopen in Container" when prompted (or use Cmd+Shift+P → "Dev Containers: Reopen in Container")
- The container will automatically install:
  - Core Tools: Quarto, just, AWS CLI, GitHub CLI (gh), prek
  - Python Stack: Python dependencies via uv, ruff for fast linting/formatting, ty for type checking
  - R Stack: R dependencies via pak, air for fast linting/formatting
  - Pre-commit Hooks: Configured via prek
  - Editor Extensions: Python, R, Quarto, TOML, just syntax, and more

Verify Installation:

just --list  # See all available commands

Option 2: Dev Container on AWS (via DevPod)

For faster data access (especially with large datasets), you can launch the devcontainer on AWS in region us-west-2, close to where our data is stored in S3.

Prerequisites:
- Install DevPod
- Configure the AWS provider in DevPod (detailed setup documentation coming soon)

Launch:

just aws      # Authenticate with AWS so Devpod can create an EC2 instance
just devpod   # Launch devcontainer on EC2 instance

This uses a prebuilt devcontainer image from GitHub Container Registry, so you don't have to wait for the image to build.

Option 3: Manual Installation (Without Dev Container)

If you prefer not to use dev containers:

Install Prerequisites:
- Python 3.9+
- R 4.0+
- pak package for R
- just
Run Setup:
```
just install
```
This command:
- Installs uv
- Uses uv to create a virtualenv and install python packages to it
- Uses pak to install R packages listed in DESCRIPTION file
- Installs pre-commit hooks with prek

📝 Creating a New Report

To create a new Quarto report from the Switchbox template:

just new_report

You'll be prompted to enter a report name.

Naming convention: Use state_topic format (e.g., ny_aeba, ri_hp_rates). If we've used a topic before in other states (like hp_rates), reuse it to maintain consistency across reports.

This will:

Create reports/<state_topic>/
Initialize it with the switchbox-data/report_template
Set up the necessary Quarto configuration files

🔄 Development Workflow

We follow a structured development process to ensure all work is tracked, organized, and reviewed. This workflow keeps PRs from growing stale, maintains code quality, and ensures every piece of work is tied to a clear ticket.

Our workflow:

All work starts with a GitHub Issue - Captured in our Kanban board with clear "What", "Why", "How", and "Deliverables"
Issues are reviewed before work begins - Ensures alignment before coding starts
Branches are created from issues - Automatic linking between code and tickets
PRs are created early - Work-in-progress is visible and reviewable
Code is reviewed before merging - Quality checks and peer review catch issues
PRs are short-lived - Merged within a week to keep momentum

This process ensures:

📝 Every feature/fix has documentation (the issue)
🔗 Code changes are traceable to requirements
👀 Work is visible and reviewable at all stages
✅ Code is reviewed early and often

For complete details on our workflow, including how to create issues, request reviews, and merge PRs, see CONTRIBUTING.md.

📄 Understanding Quarto Reports

What is Quarto?

Quarto is a scientific publishing system that lets you combine narrative text with executable code. This approach, called literate programming, means your analysis and narrative live together in one document.

Simple example:

## Heat Pump Adoption

Total residential heat pumps installed in Rhode Island:

```{r}
total_installs <- sum(resstock_data$heat_pump_count)
print(total_installs)
```

This represents `{r} pct_electric`% of all heating systems statewide.

The code executes when you render the document, weaving results directly into your narrative. Learn more about Quarto.

Our Opinionated Approach

We use Quarto in a specific way, defined by our report template. Our reports are based on Quarto's Manuscript project type, which is designed for scientific and technical publications.

Report Structure

After running just new_report, you'll have this structure:

reports/your-report/
├── index.qmd              # The actual report (narrative + embedded results)
├── notebooks/
│   └── analysis.qmd       # The underlying analysis (data wrangling + analysis + visualization)
├── _quarto.yml            # Quarto project configuration (set up by template)
├── Justfile               # Commands to render the report
└── docs/                  # Generated output (HTML, DOCX, ICML, PDF) - gitignored, do not commit

Key files:

index.qmd: Your report's narrative. This is what readers see. Contains text, embedded charts, and inline results.
notebooks/analysis.qmd: Your data analysis. Data loading, cleaning, modeling, and creating visualizations.
- Prefer a single analysis.qmd notebook for simplicity and clarity
- If you need multiple notebooks, check with the team first to discuss organization
_quarto.yml: Quarto configuration - YAML front matter, format settings, and stylesheet references (all set up by the template)

Data Flow: Analysis → Report

We keep analysis separate from narrative. Here's how data flows from notebooks/analysis.qmd to index.qmd:

1. Export Variables from Analysis (R)

In notebooks/analysis.qmd, export objects to a local RData file (gitignored, do not commit):

# Do your analysis
total_savings <- calculate_savings(data)
growth_rate <- 0.15

# Export variables for the report
save(total_savings, growth_rate, file = "report_vars.RData")

Then load them in index.qmd:

# Load analysis results
load("report_vars.RData")

Note: The .RData file is gitignored - only the code that generates it is versioned.

2. Embed Charts from Analysis

For visualizations, use Quarto's embed syntax to pull charts from notebooks/analysis.qmd into index.qmd:

In notebooks/analysis.qmd, create a labeled code chunk:

```{r}
#| label: fig-energy-savings
#| fig-cap: "Annual energy savings by upgrade"

ggplot(data, aes(x = upgrade, y = savings)) +
  geom_col() +
  theme_minimal()
```

In index.qmd, embed it:

{{< embed notebooks/analysis.qmd#fig-energy-savings >}}

The chart appears in your report without duplicating code!

Report Output Formats

Our template includes just commands for different outputs:

just render    # Generate HTML for web publishing (static site)
just draft     # Generate Word doc for content reviews with collaborators
just typeset   # Generate ICML for creating typeset PDFs in InDesign

When to use each:

HTML (just render): Publishing our reports as interactive web pages
DOCX (just draft): Sharing drafts for content review and feedback
ICML (just typeset): Professional typesetting in Adobe InDesign for PDF export

The template automatically configures these formats with our stylesheets and branding.

🌐 Publishing Web Reports

Once your report is ready to be published, you'll want to get it to the web so others can access it. This section walks through how our web hosting works and the steps to publish.

How Web Reports Work

When you run just render in your report directory (e.g., reports/ny_aeba_grid/), Quarto generates a complete static website in the docs/ subdirectory:

reports/ny_aeba_grid/
└── docs/                      # Generated by Quarto (gitignored in report dirs)
    ├── index.html             # Main HTML file
    ├── img/                   # Images
    ├── index_files/           # Supporting files
    ├── notebooks/             # Embedded notebook outputs
    └── site_libs/             # JavaScript, CSS, etc.

Why this matters: The entire docs/ directory is self-contained. You can copy it anywhere, open index.html, and the report will work perfectly - all assets are bundled together.

💡 Development Tip: Use this during development! Run just render frequently and open reports/<project_code>/docs/index.html in your browser to see exactly how your report will look when published to the web. This lets you iterate on design and layout before publishing.

How We Host Reports

We use GitHub Pages to host our reports automatically:

Any files in the root docs/ directory on the main branch are automatically published to https://switchbox-data.github.io/reports2/
To publish a report, we copy the rendered report from reports/<project_code>/docs/ to the root docs/<project_code>/ directory
URLs: A report at docs/ny_aeba_grid/index.html becomes accessible at:
- https://switchbox-data.github.io/reports2/ny_aeba_grid/
- https://www.switch.box/aeba-grid (our web admin embeds it in an iframe - you don't need to worry about this)

📂 See what's published: Check the docs/ directory to see currently published reports.

💡 Key insight: Publishing = rendering your report ➡️ copying to docs/<project_code>/ ➡️ merging to main

Publishing Step-by-Step

Follow these steps to publish or update a web report:

1. Prepare Your Report

Finish your work following the development workflow (create issue, work in a branch, etc.)
Make sure your PR is ready to merge
Open your devcontainer terminal

2. Navigate to Your Report Directory

cd reports/<project_code>

3. Verify Render Configuration

Check that all notebooks needed for the report are listed in _quarto.yml under project > render:

project:
  render:
    - index.qmd
    - notebooks/analysis.qmd

Why: Quarto only renders files you explicitly list, ensuring you have control over what gets published.

4. Clear Cache and Render

# Clear any cached content
just clean

# Render the HTML version
just render

This creates fresh output in reports/<project_code>/docs/.

5. Copy to Publishing Directory

# Copy rendered report to root docs/ directory
just publish

What this does: Copies all files from reports/<project_code>/docs/ to docs/<project_code>/. If files already exist at docs/<project_code>/, they're deleted first to ensure a clean publish.

6. Return to Repository Root

cd ../..

7. Stage the Published Files

git add -f docs/

Why -f (force)? The docs/ directory is gitignored in report directories to prevent accidental commits during development. The -f flag overrides this, allowing you to commit to the root docs/ directory intentionally.

8. Commit and Push

git commit -m "Publish new version of <project_code> report"
git push

Important: Pushing to your branch does NOT publish the report yet - it must be merged to main first.

9. Merge to Main

Go to your PR on GitHub
Merge it to main

10. Verify Deployment

GitHub automatically triggers a "pages build and deployment" workflow
Check Actions to see the workflow run
Once the workflow is green (✓), your report is live at https://switchbox-data.github.io/reports/<project_code>/

Adding a PDF Link

Once you have a final PDF version (typically typeset in InDesign), you can add a download link to the web report:

1. Get the PDF

Download the final PDF from Google Drive: Projects > <project_code> > final
Ensure it's named switchbox_<project_code>.pdf (rename if needed)

2. Add to Report Directory

Place the PDF in your report root:

reports/<project_code>/
├── switchbox_<project_code>.pdf   # ← Add here
├── index.qmd
└── ...

3. Update YAML Front Matter

Add or uncomment this in index.qmd's YAML header:

other-links:
  - text: PDF
    icon: file-earmark-pdf
    href: switchbox_<project_code>.pdf

What this does: Adds a PDF download button to your web report's navigation.

4. Re-render and Publish

# Re-render with PDF link
just render

# Verify the PDF link appears and works locally

# Publish the update
just publish
cd ../..
git add -f docs/
git commit -m "Add PDF link to <project_code> report"
git push

Then merge to main following steps 9-10 above.

📦 Managing Dependencies

Python Dependencies

Adding a Python package

uv add <package-name>

This command:

Adds the package to pyproject.toml
Updates uv.lock with resolved dependencies
Installs the package in your virtual environment

Example:

uv add polars  # Add polars as a dependency
uv add --dev pytest-mock  # Add as a dev dependency

⚠️ Important: Do NOT use pip install to add packages. Using pip install will install the package locally but will not update pyproject.toml or uv.lock, meaning others won't get your dependencies. Always use uv add.

How Your Package Persists

In dev container:

When you run uv add package-name, packages are installed to /opt/venv/ inside the container
They stay in the container, and are not exported to your local filesystem. So if you restart the container, the package will be gone!
To make your new package persist, you need to add it to the image itself, by committing pyproject.toml and uv.lock and pushing to Github
If you're using devcontainers on your laptop, rebuild the container and your package will be permanently installed within the image
If you're using devcontainers on Devpod, Github actions will automatically rebuild the image with the new package; restart your workspace to use the new image
Bottom line: Run uv add, commit pyproject.toml and uv.lock, and rebuild the devcontainer to persist packages

On regular laptop:

When you run uv add package-name, packages are installed to .venv/, which persists in your local workspace
Packages remain installed between sessions
No reinstallation needed unless you delete .venv/ or run uv sync after changes

How Others Get Your Package

In dev container:

You commit both pyproject.toml and uv.lock, and push to Github
Others pull your changes
They rebuild their container:
- If they're using devcontainers on their laptop, when they rebuild their container, all packages in uv.lock (including your new one) will be permanently installed into the image
- If they're using devcontainers on Devpod, GitHub Actions will automatically rebuild the image with the new package; they just need to restart their workspace to use the new image

On regular laptop:

You commit both pyproject.toml and uv.lock to git
Others pull your changes
They must manually run uv sync to install the new dependency

R Dependencies

R dependency management works differently, you have to manually update a file that lists packages, then install them.

Adding a new R package

Add it to DESCRIPTION in the Imports section:

Imports:
    dplyr,
    ggplot2,
    arrow

Install it by running:
```
just install
```

How Your Package Persists

In dev container:

If you install a package directly with pak::pak("dplyr"), the package is installed temporarily in the container
It will be gone when the container restarts!
If you add it to DESCRIPTION and run just install, as documented above, the package will also install temporarily
However, if you then commit DESCRIPTION and push to Github...
If you're using devcontainers on your laptop, rebuild the container and every package in DESCRIPTION (including your new one) will be permanently installed within the image
If you're using devcontainers on Devpod, GitHub Actions will automatically rebuild the image with the new package; restart your workspace to use the new image
Bottom line: Add packages to DESCRIPTION, commit it, and rebuild the devcontainer to persist them

On regular laptop:

Packages are saved to your global R library (typically ~/R/library/)
Packages remain installed between sessions
No reinstallation needed unless you uninstall them or use a different R version

How Others Get Your Package

In dev container:

You add a package to DESCRIPTION, commit it, and push to GitHub
Others pull your changes
They rebuild their container:
- If they're using devcontainers on their laptop, when they rebuild their container, all packages in DESCRIPTION (including your new one) will be pemanently installed into the image
- If they're using devcontainers on Devpod, GitHub Actions will automatically rebuild the image with the new package; they just need to restart their workspace to use the new image

On regular laptop:

You add a package to DESCRIPTION and commit it to git
Others pull your changes

They manually install dependencies:

just install

Or in an R session:

pak::local_install_deps()  # Installs all dependencies from DESCRIPTION

🔀 When to Use Python vs. R

Both languages are available, but we have clear preferences based on the type of work:

Use R (with tidyverse) for:

Data analysis - Exploratory data analysis, statistical analysis
Data modeling - Statistical models, regression, forecasting
Data visualization - Creating charts and graphs for reports
Default choice - Unless there's a specific reason to use Python, prefer R for these tasks

Use Python for:

Data engineering - Scripts that fetch, process, and upload data to S3
Numerical simulations - Generating synthetic data, Monte Carlo simulations
Library requirements - When a specific Python library is needed

Why this split?

R/tidyverse excels at interactive analysis and producing publication-quality visualizations
Python excels at scripting, automation, and computational tasks
Our reports are written in Quarto, which works seamlessly with both languages
You can use both in the same report when needed, but prefer consistency within a single analysis

📊 Working with Data

🔑 AWS Configuration

Data for analyses is stored on S3. Ensure you have:

AWS credentials configured in ~/.aws/credentials
Default region set to us-west-2

The dev container automatically mounts your local AWS credentials.

Reading Data from S3

All data lives in S3 - we do not store data files in this git repository. Our primary data bucket is s3://data.sb/, and we also use public open data sources from other S3 buckets.

We prefer Parquet files in our data lake for efficient columnar storage and compression.

Using Python (with polars / arrow)

import polars as pl

# Scan multiple parquet files from S3 (lazy - doesn't load data yet)
lf = pl.scan_parquet("s3://data.sb/eia/heating_oil_prices/*.parquet")

# Apply transformations and aggregations (still lazy)
# By using parquet and arrow with lazy execution, we limit how much data is downloaded
result = (
    lf.filter(pl.col("state") == "RI")
    .group_by("year")
    .agg(pl.col("price").mean().alias("avg_price"))
)

# 💡 TIP: Stay in lazy execution as long as possible - do all filtering, grouping, and aggregations
# before calling collect(). This allows Polars to optimize the query plan and minimize data transfer.

# Collect to download subset of parquet data, perform aggregation, and load into memory
# Result is a Polars DataFrame (columnar, Arrow-based format)
df = result.collect()

# To convert to row-oriented format, use to_pandas() to get a pandas DataFrame
# ⚠️ WARNING: We do NOT use pandas for analysis - only use if a library requires pandas DataFrame
pandas_df = result.collect().to_pandas()

Using R (with dplyr / arrow)

library(arrow)
library(dplyr)

# Scan multiple parquet files from S3 (lazy - doesn't load data yet)
lf <- open_dataset("s3://data.sb/eia/heating_oil_prices/*.parquet")

# Apply transformations and aggregations (still lazy)
# By using parquet and arrow with lazy execution, we limit how much data is downloaded
result <- lf |>
  filter(state == "RI") |>
  group_by(year) |>
  summarize(avg_price = mean(price))

# 💡 TIP: Stay in lazy execution as long as possible - do all filtering, grouping, and aggregations
# before calling compute(). This allows Arrow to optimize the query plan and minimize data transfer.

# Compute to download subset of parquet data, perform aggregation, and load into memory
# Result is an Arrow Table (columnar, Arrow-based format - same as Python polars)
df <- result |> compute()

# Or use collect() to convert to tibble (row-oriented, standard R data frame)
# ⚠️ WARNING: stay in arrow whenever possible — only use if a library requires tibbles
tibble_df <- result |> collect()

Performance Considerations

Initial downloads can be slow depending on:

File size
Your internet connection
Distance from the S3 region (us-west-2)

Options to improve performance:

Cache locally: Download files once and cache in data/ (gitignored)
Run dev containers in the cloud: See Option 2 in Quick Start for launching devcontainers on AWS in us-west-2 region, same as the data bucket
Use partitioned datasets: Only read the partitions you need

When reports execute: Data is downloaded from S3 at runtime. The first run may be slower, but subsequent runs can use cached data if you've set up local caching.

Writing Data to S3

⚠️ Note: We have naming conventions but the upload process is still being standardized.

Naming and Organization Conventions

Directory structure:

s3://data.sb/<org>/<dataset>/<filename_YYYYMMDD.parquet>

Org: Organization producing the data (e.g., nrel, eia, fred)
Dataset: Name of the dataset (e.g., heating_oil, resstock, inflation_factors)
- Always use a dataset directory, even if there's only one file
- Prefer official data product names when they exist (e.g., EIA's "State Energy Data System", Census Bureau's "American Community Survey")
- If the official name is long, use a clear abbreviated version in lowercase with underscores
Filename: Descriptive name for the specific file
- Do NOT include the org name or dataset name (already in the path)
- Do include geographic scope, time granularity, or other distinguishing info about the contents of the file
- Must end with _YYYYMMDD reflecting when the data was downloaded
Naming style: Everything lowercase with underscores
- Good: eia/heating_oil_prices/ri_monthly_20240315.parquet
- Bad: EIA/Heating-Oil_ri_eia_prices_2024-03-15.parquet

Why timestamps matter:

Versioning: New snapshots get new dates (e.g., ri_monthly_20240415.parquet)
Reproducibility: Old code using ri_monthly_20240315.parquet continues to work
Traceability: Know exactly when data was retrieved

Example structure:

s3://data.sb/
├── eia/
│   ├── heating_oil_prices/
│   │   ├── ri_monthly_20240315.parquet
│   │   └── ct_monthly_20240315.parquet
│   └── propane/
│       └── ri_monthly_20240315.parquet
├── nrel/
│   └── resstock/
│       └── 2024_release2_tmy3/...
└── fred/
    └── inflation_factors/
        └── monthly_20240301.parquet

Uploading Data

Preferred method: Use scripted downloads via just commands

Check the lib/ directory for existing data fetch scripts:

lib/eia/fetch_delivered_fuels_prices_eia.py
lib/eia/fetch_eia_state_profile.py

These scripts should:

Download data from the source
Process and save as Parquet
Upload to S3 with proper naming (including date)
Be runnable via just commands for reproducibility

Manual uploads (when necessary):

# Using AWS CLI
aws s3 cp local_file.parquet s3://data.sb/<org>/<dataset>/<filename_YYYYMMDD.parquet>

# Example: uploading heating oil data for Rhode Island
aws s3 cp ri_monthly_20240315.parquet s3://data.sb/eia/heating_oil/ri_monthly_20240315.parquet

⚠️ Before uploading: Coordinate with the team to ensure:

Naming follows conventions
Path doesn't conflict with existing data
Date reflects actual download/creation date

🔍 Code Quality & Testing

Pre-commit Hooks

What are pre-commit hooks? Pre-commit hooks are automated scripts that run every time you make a git commit. They catch issues (formatting, linting, syntax errors) before code enters the repository, ensuring consistent code quality across the team. This saves time in code review and prevents broken code from being committed.

Why we use them: By automatically formatting and checking code at commit time, we ensure that all code in the repository meets our quality standards without requiring manual intervention. Everyone's code is formatted consistently, and common errors are caught immediately.

Pre-commit hooks are managed by prek and configured in .pre-commit-config.yaml.

Configured hooks:

ruff-check: Lints Python code with auto-fix
ruff-format: Formats Python code
ty-check: Type checks Python code using ty
trailing-whitespace: Removes trailing whitespace
end-of-file-fixer: Ensures files end with a newline
check-yaml/json/toml: Validates config file syntax
check-added-large-files: Prevents commits of files >600KB
check-merge-conflict: Detects merge conflict markers
check-case-conflict: Prevents case-sensitivity issues

Note on R formatting: We don't yet have the air formatter for R integrated with pre-commit hooks. Instead, use air's editor integration via the Posit.air-vscode extension, which is pre-installed in the dev container. Air will automatically format your R code in the editor.

Hooks run automatically on git commit. To run manually:

prek run -a  # Run all hooks on all files

Run Quality Checks

just check

This command performs (same as CI):

Lock file validation: Ensures uv.lock is consistent with pyproject.toml
Pre-commit hooks: Runs all configured hooks including type checking (see above)

Optional - Check for obsolete dependencies:

just check-deps

Runs deptry to check for unused dependencies in pyproject.toml.

Run Tests

just test

Runs the Python test suite with pytest (tests in tests/ directory), including doctest validation.

Note on R tests: We don't currently have R tests or a testing framework configured for R code. Only Python tests are run by just test and in CI.

🚦 CI/CD Pipeline

The repository uses GitHub Actions to automatically run quality checks and tests on your code. The CI runs the exact same just commands you use locally, in the same devcontainer environment, ensuring perfect consistency.

What Runs and When

The workflow runs two jobs in parallel for speed:

On Pull Requests (opened, updated, or marked ready for review):

quality-checks: Runs just check (lock file validation + pre-commit hooks)
tests: Runs just test (pytest test suite)

On Push to main:

Same checks and tests as pull requests (both jobs run in parallel)

Devcontainer in CI/CD

On every commit to main or a pull request, GitHub Actions actually builds the devcontainer image - the same process that happens when you rebuild in VS Code.

To avoid 15-minute builds every time, we use a two-tier caching strategy:

Image caching: If .devcontainer/Dockerfile, pyproject.toml, uv.lock, and DESCRIPTION haven't changed, the devcontainer image build is skipped, and the most recent image (in GHCR) is reused (~30 seconds)
Layer caching: If any of these files changed, the image is rebuilt, but only affected layers rebuild while the others are pulled from GHCR cache (incremental builds, ~2-5 minutes)

Once built, the image is pushed to GitHub Container Registry (GHCR), where it's immediately available as ghcr.io/switchbox-data/reports2:latest.

The quality-checks and tests jobs then pull this prebuilt image and run just check and just test inside it - no rebuilding required.

Devcontainer prebuilds: Double Duty

These CI builds serve a dual purpose:

For CI: Tests run in the exact devcontainer environment
For Devpod: Users can launch devcontainers on AWS (similar to Codespaces) using the prebuilt image - unlike when using devcontainers on your laptop, you don't have to wait for the image to build

Bottom line: Every commit that modifies the devcontainer or dependencies triggers an automatic devcontainer image build. This ensures CI uses the correct environment, and anyone (including Devpod users) can use the fully built devcontainer without building it from scratch.

Why This Matters

Perfect Consistency: CI literally runs just check and just test inside of the devcontainer - exactly what you run locally — and devcontainer is rebuilt when its definition changes
Speed: Devcontainer is only rebuilt when necessary, so quality checks and tests usually run immediately, and they run in parallel, making CI faster
Safety net: Even if someone skips pre-commit checks locally, CI catches code quality issues before merge
Code quality: Every PR must pass all checks and tests before it can be merged
Optional checks: Dependency audits (just check-deps) are not part of CI but available locally for additional validation

You can see the workflow configuration in .github/workflows/data-science-main.yml.

🧹 Cleaning Up

Remove generated files and caches:

just clean

This removes:

.pytest_cache
.ruff_cache
tmp/
notebooks/.quarto

📚 Additional Resources

Development Environment

Dev Containers Documentation
GitHub Actions Documentation
just Documentation
Quarto Documentation
prek Documentation - Pre-commit hook manager

Python Tools

uv Documentation - Package manager
ruff Documentation - Linter and formatter
ty Documentation - Type checker
deptry Documentation - Dependency checker
Polars Documentation - Fast DataFrame library
PyArrow Documentation - Python bindings for Apache Arrow

R Tools

pak Documentation - Package manager
air Documentation - Linter and formatter
dplyr Documentation - Data manipulation grammar
arrow Documentation - R bindings for Apache Arrow

Name		Name	Last commit message	Last commit date
Latest commit History 112 Commits
.devcontainer		.devcontainer
.github		.github
docs/ny_aeba_grid		docs/ny_aeba_grid
lib		lib
reports		reports
tests		tests
.copier-answers.yml		.copier-answers.yml
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
DESCRIPTION		DESCRIPTION
Justfile		Justfile
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

License

switchbox-data/reports2

Folders and files

Latest commit

History

Repository files navigation

reports2

📋 Table of Contents

🎯 Overview

🌍 Why Open Source?

1. 🔍 Transparency

2. 🔬 Open Science

📁 Repo Structure

🛠️ Available Commands

🚀 Quick Start

Option 1: Dev Container (Recommended)

Option 2: Dev Container on AWS (via DevPod)

Option 3: Manual Installation (Without Dev Container)

📝 Creating a New Report

🔄 Development Workflow

📄 Understanding Quarto Reports

What is Quarto?

Our Opinionated Approach

Report Structure

Data Flow: Analysis → Report

Report Output Formats

🌐 Publishing Web Reports

How Web Reports Work

How We Host Reports

Publishing Step-by-Step

Adding a PDF Link

📦 Managing Dependencies

Python Dependencies

R Dependencies

🔀 When to Use Python vs. R

Use R (with tidyverse) for:

Use Python for:

Why this split?

📊 Working with Data

🔑 AWS Configuration

Reading Data from S3

Writing Data to S3

🔍 Code Quality & Testing

Pre-commit Hooks

Run Quality Checks

Run Tests

🚦 CI/CD Pipeline

What Runs and When

Devcontainer in CI/CD

Devcontainer prebuilds: Double Duty

Why This Matters

🧹 Cleaning Up

📚 Additional Resources

Development Environment

Python Tools

R Tools

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages