Switchbox is a nonprofit think tank that produces rigorous, accessible data on U.S. state climate policy for advocates, policymakers, and the public. Find out more at www.switch.box
This repository contains Switchbox's reports. We use a modern bilingual stack combining Python and R.
- Overview
- Why Open Source?
- Repo Structure
- Available Commands
- Quick Start
- Creating a New Report
- Development Workflow
- Understanding Quarto Reports
- Publishing Web Reports
- Managing Dependencies
- When to Use Python vs. R
- Working with Data
- Code Quality & Testing
- CI/CD Pipeline
- Cleaning Up
- Additional Resources
- Quarto Reports: All reports written in Quarto (located in
reports/directory) - Cloud Data: All data used in our reports is stored in an S3 bucket (`s3://data.sb/) on AWS
- Bilingual Analytics: Reports use both Python (polars) and R (tidyverse)
- Fast Package Management:
- Reproducible Development Environment: Intended to be used with VS Code Dev Containers
- Task Runner: just for convenient command execution
- Code Quality: Modern linting and formatting with ruff (Python) and air (R), automated via prek pre-commit hooks
Our reports (on our website) are shared under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0), and the code behind them (contained in this repo) is released under the MIT License.
We do this for two reasons:
As a research group, our work is frequently cited in the press and aims to shape the policy conversation. Given this, we believe that the public has a right to see exactly how we produce our findingsβgoing well beyond vague methodology sections.
We believe the clean energy transition will happen faster if energy researchers, particularly those working in the nonprofit sector, embrace open data and open source code so that we can build on each other's work rather than reinventing the wheel.
While the rest of this README.md will walk through the contents of this repo in detail, here is an initial overview:
reports2/
βββ .devcontainer/ # Dev container configuration
βββ docs/ # Published HTML reports (hosted via GitHub Pages at switchbox-data.github.io/reports2)
βββ lib/ # Shared libraries and utilities
βββ reports/ # Switchbox report projects (source code for reports on our website)
β βββ ct_hp_rates/ # Individual Quarto projects for each report
β βββ il_lea/
β βββ ...
βββ tests/ # Python test suite
βββ .pre-commit-config.yaml # Pre-commit hooks configuration
βββ pyproject.toml # Python dependencies and tool configuration
βββ uv.lock # Locked Python dependencies
βββ Justfile # Command runner recipes
While the rest of this README.md will explain when these commands should be used, here is an initial overview of the "tasks" you can perform in this repo.
Two places to use just: Commands are available in the repository root (for development environment and testing) and in individual report directories (for rendering reports). These commands are defined in Justfiles in each location - see the root Justfile and individual report Justfiles (e.g., reports/ny_aeba_grid/Justfile).
View all available commands in your current location:
just --listRepository root commands:
just install- Set up development environmentjust check- Run quality checks (same as CI: lock file + pre-commit hooks)just check-deps- Check for obsolete dependencies with deptryjust test- Run test suite (same as CI)just new_report- Create a new Quarto report inreports/just aws- Authenticate with AWS SSOjust devpod- Launch devcontainer on AWS via DevPodjust clean- Remove generated files and caches
Report directory commands (reports/<project_code>/):
just render- Render HTML version of report using Quartojust draft- Render Word document for content reviews using Quartojust typeset- Render ICML for InDesign typesetting using Quartojust publish- Copy rendered HTML version of report from projectdocs/to rootdocs/for web publishingjust clean- Remove generated files and caches
If you want to rerun and edit the code that generates our reports, this section shows you how to get started.
The fastest way to get started is using the provided dev container:
-
Prerequisites:
-
Launch:
- Open this repository in VS Code or Cursor
- Click "Reopen in Container" when prompted (or use
Cmd+Shift+Pβ "Dev Containers: Reopen in Container") - The container will automatically install:
- Core Tools: Quarto, just, AWS CLI, GitHub CLI (gh), prek
- Python Stack: Python dependencies via
uv, ruff for fast linting/formatting, ty for type checking - R Stack: R dependencies via
pak, air for fast linting/formatting - Pre-commit Hooks: Configured via
prek - Editor Extensions: Python, R, Quarto, TOML, just syntax, and more
-
Verify Installation:
just --list # See all available commands
For faster data access (especially with large datasets), you can launch the devcontainer on AWS in region us-west-2, close to where our data is stored in S3.
-
Prerequisites:
- Install DevPod
- Configure the AWS provider in DevPod (detailed setup documentation coming soon)
-
Launch:
just aws # Authenticate with AWS so Devpod can create an EC2 instance just devpod # Launch devcontainer on EC2 instance
This uses a prebuilt devcontainer image from GitHub Container Registry, so you don't have to wait for the image to build.
If you prefer not to use dev containers:
-
Install Prerequisites:
- Python 3.9+
- R 4.0+
pakpackage for R- just
-
Run Setup:
just install
This command:
- Installs
uv - Uses
uvto create a virtualenv and install python packages to it - Uses
pakto install R packages listed inDESCRIPTIONfile - Installs pre-commit hooks with
prek
- Installs
To create a new Quarto report from the Switchbox template:
just new_reportYou'll be prompted to enter a report name.
Naming convention: Use state_topic format (e.g., ny_aeba, ri_hp_rates). If we've used a topic before in other states (like hp_rates), reuse it to maintain consistency across reports.
This will:
- Create
reports/<state_topic>/ - Initialize it with the switchbox-data/report_template
- Set up the necessary Quarto configuration files
We follow a structured development process to ensure all work is tracked, organized, and reviewed. This workflow keeps PRs from growing stale, maintains code quality, and ensures every piece of work is tied to a clear ticket.
Our workflow:
- All work starts with a GitHub Issue - Captured in our Kanban board with clear "What", "Why", "How", and "Deliverables"
- Issues are reviewed before work begins - Ensures alignment before coding starts
- Branches are created from issues - Automatic linking between code and tickets
- PRs are created early - Work-in-progress is visible and reviewable
- Code is reviewed before merging - Quality checks and peer review catch issues
- PRs are short-lived - Merged within a week to keep momentum
This process ensures:
- π Every feature/fix has documentation (the issue)
- π Code changes are traceable to requirements
- π Work is visible and reviewable at all stages
- β Code is reviewed early and often
For complete details on our workflow, including how to create issues, request reviews, and merge PRs, see CONTRIBUTING.md.
Quarto is a scientific publishing system that lets you combine narrative text with executable code. This approach, called literate programming, means your analysis and narrative live together in one document.
Simple example:
## Heat Pump Adoption
Total residential heat pumps installed in Rhode Island:
```{r}
total_installs <- sum(resstock_data$heat_pump_count)
print(total_installs)
```
This represents `{r} pct_electric`% of all heating systems statewide.The code executes when you render the document, weaving results directly into your narrative. Learn more about Quarto.
We use Quarto in a specific way, defined by our report template. Our reports are based on Quarto's Manuscript project type, which is designed for scientific and technical publications.
After running just new_report, you'll have this structure:
reports/your-report/
βββ index.qmd # The actual report (narrative + embedded results)
βββ notebooks/
β βββ analysis.qmd # The underlying analysis (data wrangling + analysis + visualization)
βββ _quarto.yml # Quarto project configuration (set up by template)
βββ Justfile # Commands to render the report
βββ docs/ # Generated output (HTML, DOCX, ICML, PDF) - gitignored, do not commit
Key files:
index.qmd: Your report's narrative. This is what readers see. Contains text, embedded charts, and inline results.notebooks/analysis.qmd: Your data analysis. Data loading, cleaning, modeling, and creating visualizations.- Prefer a single
analysis.qmdnotebook for simplicity and clarity - If you need multiple notebooks, check with the team first to discuss organization
- Prefer a single
_quarto.yml: Quarto configuration - YAML front matter, format settings, and stylesheet references (all set up by the template)
We keep analysis separate from narrative. Here's how data flows from notebooks/analysis.qmd to index.qmd:
1. Export Variables from Analysis (R)
In notebooks/analysis.qmd, export objects to a local RData file (gitignored, do not commit):
# Do your analysis
total_savings <- calculate_savings(data)
growth_rate <- 0.15
# Export variables for the report
save(total_savings, growth_rate, file = "report_vars.RData")Then load them in index.qmd:
# Load analysis results
load("report_vars.RData")Note: The .RData file is gitignored - only the code that generates it is versioned.
2. Embed Charts from Analysis
For visualizations, use Quarto's embed syntax to pull charts from notebooks/analysis.qmd into index.qmd:
In notebooks/analysis.qmd, create a labeled code chunk:
```{r}
#| label: fig-energy-savings
#| fig-cap: "Annual energy savings by upgrade"
ggplot(data, aes(x = upgrade, y = savings)) +
geom_col() +
theme_minimal()
```In index.qmd, embed it:
{{< embed notebooks/analysis.qmd#fig-energy-savings >}}The chart appears in your report without duplicating code!
Our template includes just commands for different outputs:
just render # Generate HTML for web publishing (static site)
just draft # Generate Word doc for content reviews with collaborators
just typeset # Generate ICML for creating typeset PDFs in InDesignWhen to use each:
- HTML (
just render): Publishing our reports as interactive web pages - DOCX (
just draft): Sharing drafts for content review and feedback - ICML (
just typeset): Professional typesetting in Adobe InDesign for PDF export
The template automatically configures these formats with our stylesheets and branding.
Once your report is ready to be published, you'll want to get it to the web so others can access it. This section walks through how our web hosting works and the steps to publish.
When you run just render in your report directory (e.g., reports/ny_aeba_grid/), Quarto generates a complete static website in the docs/ subdirectory:
reports/ny_aeba_grid/
βββ docs/ # Generated by Quarto (gitignored in report dirs)
βββ index.html # Main HTML file
βββ img/ # Images
βββ index_files/ # Supporting files
βββ notebooks/ # Embedded notebook outputs
βββ site_libs/ # JavaScript, CSS, etc.
Why this matters: The entire docs/ directory is self-contained. You can copy it anywhere, open index.html, and the report will work perfectly - all assets are bundled together.
π‘ Development Tip: Use this during development! Run just render frequently and open reports/<project_code>/docs/index.html in your browser to see exactly how your report will look when published to the web. This lets you iterate on design and layout before publishing.
We use GitHub Pages to host our reports automatically:
-
Any files in the root
docs/directory on themainbranch are automatically published tohttps://switchbox-data.github.io/reports2/ -
To publish a report, we copy the rendered report from
reports/<project_code>/docs/to the rootdocs/<project_code>/directory -
URLs: A report at
docs/ny_aeba_grid/index.htmlbecomes accessible at:- https://switchbox-data.github.io/reports2/ny_aeba_grid/
- https://www.switch.box/aeba-grid (our web admin embeds it in an iframe - you don't need to worry about this)
π See what's published: Check the docs/ directory to see currently published reports.
π‘ Key insight: Publishing = rendering your report β‘οΈ copying to docs/<project_code>/ β‘οΈ merging to main
Follow these steps to publish or update a web report:
1. Prepare Your Report
- Finish your work following the development workflow (create issue, work in a branch, etc.)
- Make sure your PR is ready to merge
- Open your devcontainer terminal
2. Navigate to Your Report Directory
cd reports/<project_code>3. Verify Render Configuration
Check that all notebooks needed for the report are listed in _quarto.yml under project > render:
project:
render:
- index.qmd
- notebooks/analysis.qmdWhy: Quarto only renders files you explicitly list, ensuring you have control over what gets published.
4. Clear Cache and Render
# Clear any cached content
just clean
# Render the HTML version
just renderThis creates fresh output in reports/<project_code>/docs/.
5. Copy to Publishing Directory
# Copy rendered report to root docs/ directory
just publishWhat this does: Copies all files from reports/<project_code>/docs/ to docs/<project_code>/. If files already exist at docs/<project_code>/, they're deleted first to ensure a clean publish.
6. Return to Repository Root
cd ../..7. Stage the Published Files
git add -f docs/Why -f (force)? The docs/ directory is gitignored in report directories to prevent accidental commits during development. The -f flag overrides this, allowing you to commit to the root docs/ directory intentionally.
8. Commit and Push
git commit -m "Publish new version of <project_code> report"
git pushImportant: Pushing to your branch does NOT publish the report yet - it must be merged to main first.
9. Merge to Main
- Go to your PR on GitHub
- Merge it to
main
10. Verify Deployment
- GitHub automatically triggers a "pages build and deployment" workflow
- Check Actions to see the workflow run
- Once the workflow is green (β), your report is live at
https://switchbox-data.github.io/reports/<project_code>/
Once you have a final PDF version (typically typeset in InDesign), you can add a download link to the web report:
1. Get the PDF
- Download the final PDF from Google Drive:
Projects > <project_code> > final - Ensure it's named
switchbox_<project_code>.pdf(rename if needed)
2. Add to Report Directory
Place the PDF in your report root:
reports/<project_code>/
βββ switchbox_<project_code>.pdf # β Add here
βββ index.qmd
βββ ...
3. Update YAML Front Matter
Add or uncomment this in index.qmd's YAML header:
other-links:
- text: PDF
icon: file-earmark-pdf
href: switchbox_<project_code>.pdfWhat this does: Adds a PDF download button to your web report's navigation.
4. Re-render and Publish
# Re-render with PDF link
just render
# Verify the PDF link appears and works locally
# Publish the update
just publish
cd ../..
git add -f docs/
git commit -m "Add PDF link to <project_code> report"
git pushThen merge to main following steps 9-10 above.
Adding a Python package
uv add <package-name>This command:
- Adds the package to
pyproject.toml - Updates
uv.lockwith resolved dependencies - Installs the package in your virtual environment
Example:
uv add polars # Add polars as a dependency
uv add --dev pytest-mock # Add as a dev dependencypip install to add packages. Using pip install will install the package locally but will not update pyproject.toml or uv.lock, meaning others won't get your dependencies. Always use uv add.
How Your Package Persists
In dev container:
- When you run
uv add package-name, packages are installed to/opt/venv/inside the container - They stay in the container, and are not exported to your local filesystem. So if you restart the container, the package will be gone!
- To make your new package persist, you need to add it to the image itself, by committing
pyproject.tomlanduv.lockand pushing to Github - If you're using devcontainers on your laptop, rebuild the container and your package will be permanently installed within the image
- If you're using devcontainers on Devpod, Github actions will automatically rebuild the image with the new package; restart your workspace to use the new image
- Bottom line: Run
uv add, commitpyproject.tomlanduv.lock, and rebuild the devcontainer to persist packages
On regular laptop:
- When you run
uv add package-name, packages are installed to.venv/, which persists in your local workspace - Packages remain installed between sessions
- No reinstallation needed unless you delete
.venv/or runuv syncafter changes
How Others Get Your Package
In dev container:
- You commit both
pyproject.tomlanduv.lock, and push to Github - Others pull your changes
- They rebuild their container:
- If they're using devcontainers on their laptop, when they rebuild their container, all packages in
uv.lock(including your new one) will be permanently installed into the image - If they're using devcontainers on Devpod, GitHub Actions will automatically rebuild the image with the new package; they just need to restart their workspace to use the new image
- If they're using devcontainers on their laptop, when they rebuild their container, all packages in
On regular laptop:
- You commit both
pyproject.tomlanduv.lockto git - Others pull your changes
- They must manually run
uv syncto install the new dependency
R dependency management works differently, you have to manually update a file that lists packages, then install them.
Adding a new R package
-
Add it to
DESCRIPTIONin theImportssection:Imports: dplyr, ggplot2, arrow -
Install it by running:
just install
How Your Package Persists
In dev container:
- If you install a package directly with
pak::pak("dplyr"), the package is installed temporarily in the container - It will be gone when the container restarts!
- If you add it to
DESCRIPTIONand runjust install, as documented above, the package will also install temporarily - However, if you then commit
DESCRIPTIONand push to Github... - If you're using devcontainers on your laptop, rebuild the container and every package in
DESCRIPTION(including your new one) will be permanently installed within the image - If you're using devcontainers on Devpod, GitHub Actions will automatically rebuild the image with the new package; restart your workspace to use the new image
- Bottom line: Add packages to
DESCRIPTION, commit it, and rebuild the devcontainer to persist them
On regular laptop:
- Packages are saved to your global R library (typically
~/R/library/) - Packages remain installed between sessions
- No reinstallation needed unless you uninstall them or use a different R version
How Others Get Your Package
In dev container:
- You add a package to
DESCRIPTION, commit it, and push to GitHub - Others pull your changes
- They rebuild their container:
- If they're using devcontainers on their laptop, when they rebuild their container, all packages in
DESCRIPTION(including your new one) will be pemanently installed into the image - If they're using devcontainers on Devpod, GitHub Actions will automatically rebuild the image with the new package; they just need to restart their workspace to use the new image
- If they're using devcontainers on their laptop, when they rebuild their container, all packages in
On regular laptop:
- You add a package to
DESCRIPTIONand commit it to git - Others pull your changes
- They manually install dependencies:
Or in an R session:
just install
pak::local_install_deps() # Installs all dependencies from DESCRIPTION
Both languages are available, but we have clear preferences based on the type of work:
Use R (with tidyverse) for:
- Data analysis - Exploratory data analysis, statistical analysis
- Data modeling - Statistical models, regression, forecasting
- Data visualization - Creating charts and graphs for reports
- Default choice - Unless there's a specific reason to use Python, prefer R for these tasks
- Data engineering - Scripts that fetch, process, and upload data to S3
- Numerical simulations - Generating synthetic data, Monte Carlo simulations
- Library requirements - When a specific Python library is needed
- R/tidyverse excels at interactive analysis and producing publication-quality visualizations
- Python excels at scripting, automation, and computational tasks
- Our reports are written in Quarto, which works seamlessly with both languages
- You can use both in the same report when needed, but prefer consistency within a single analysis
Data for analyses is stored on S3. Ensure you have:
- AWS credentials configured in
~/.aws/credentials - Default region set to
us-west-2
The dev container automatically mounts your local AWS credentials.
All data lives in S3 - we do not store data files in this git repository. Our primary data bucket is s3://data.sb/, and we also use public open data sources from other S3 buckets.
We prefer Parquet files in our data lake for efficient columnar storage and compression.
Using Python (with polars / arrow)
import polars as pl
# Scan multiple parquet files from S3 (lazy - doesn't load data yet)
lf = pl.scan_parquet("s3://data.sb/eia/heating_oil_prices/*.parquet")
# Apply transformations and aggregations (still lazy)
# By using parquet and arrow with lazy execution, we limit how much data is downloaded
result = (
lf.filter(pl.col("state") == "RI")
.group_by("year")
.agg(pl.col("price").mean().alias("avg_price"))
)
# π‘ TIP: Stay in lazy execution as long as possible - do all filtering, grouping, and aggregations
# before calling collect(). This allows Polars to optimize the query plan and minimize data transfer.
# Collect to download subset of parquet data, perform aggregation, and load into memory
# Result is a Polars DataFrame (columnar, Arrow-based format)
df = result.collect()
# To convert to row-oriented format, use to_pandas() to get a pandas DataFrame
# β οΈ WARNING: We do NOT use pandas for analysis - only use if a library requires pandas DataFrame
pandas_df = result.collect().to_pandas()library(arrow)
library(dplyr)
# Scan multiple parquet files from S3 (lazy - doesn't load data yet)
lf <- open_dataset("s3://data.sb/eia/heating_oil_prices/*.parquet")
# Apply transformations and aggregations (still lazy)
# By using parquet and arrow with lazy execution, we limit how much data is downloaded
result <- lf |>
filter(state == "RI") |>
group_by(year) |>
summarize(avg_price = mean(price))
# π‘ TIP: Stay in lazy execution as long as possible - do all filtering, grouping, and aggregations
# before calling compute(). This allows Arrow to optimize the query plan and minimize data transfer.
# Compute to download subset of parquet data, perform aggregation, and load into memory
# Result is an Arrow Table (columnar, Arrow-based format - same as Python polars)
df <- result |> compute()
# Or use collect() to convert to tibble (row-oriented, standard R data frame)
# β οΈ WARNING: stay in arrow whenever possible β only use if a library requires tibbles
tibble_df <- result |> collect()Performance Considerations
Initial downloads can be slow depending on:
- File size
- Your internet connection
- Distance from the S3 region (us-west-2)
Options to improve performance:
- Cache locally: Download files once and cache in
data/(gitignored) - Run dev containers in the cloud: See Option 2 in Quick Start for launching devcontainers on AWS in
us-west-2 region, same as the data bucket - Use partitioned datasets: Only read the partitions you need
When reports execute: Data is downloaded from S3 at runtime. The first run may be slower, but subsequent runs can use cached data if you've set up local caching.
Naming and Organization Conventions
Directory structure:
s3://data.sb/<org>/<dataset>/<filename_YYYYMMDD.parquet>
- Org: Organization producing the data (e.g.,
nrel,eia,fred) - Dataset: Name of the dataset (e.g.,
heating_oil,resstock,inflation_factors)- Always use a dataset directory, even if there's only one file
- Prefer official data product names when they exist (e.g., EIA's "State Energy Data System", Census Bureau's "American Community Survey")
- If the official name is long, use a clear abbreviated version in lowercase with underscores
- Filename: Descriptive name for the specific file
- Do NOT include the org name or dataset name (already in the path)
- Do include geographic scope, time granularity, or other distinguishing info about the contents of the file
- Must end with
_YYYYMMDDreflecting when the data was downloaded
- Naming style: Everything lowercase with underscores
- Good:
eia/heating_oil_prices/ri_monthly_20240315.parquet - Bad:
EIA/Heating-Oil_ri_eia_prices_2024-03-15.parquet
- Good:
Why timestamps matter:
- Versioning: New snapshots get new dates (e.g.,
ri_monthly_20240415.parquet) - Reproducibility: Old code using
ri_monthly_20240315.parquetcontinues to work - Traceability: Know exactly when data was retrieved
Example structure:
s3://data.sb/
βββ eia/
β βββ heating_oil_prices/
β β βββ ri_monthly_20240315.parquet
β β βββ ct_monthly_20240315.parquet
β βββ propane/
β βββ ri_monthly_20240315.parquet
βββ nrel/
β βββ resstock/
β βββ 2024_release2_tmy3/...
βββ fred/
βββ inflation_factors/
βββ monthly_20240301.parquet
Uploading Data
Preferred method: Use scripted downloads via just commands
Check the lib/ directory for existing data fetch scripts:
lib/eia/fetch_delivered_fuels_prices_eia.pylib/eia/fetch_eia_state_profile.py
These scripts should:
- Download data from the source
- Process and save as Parquet
- Upload to S3 with proper naming (including date)
- Be runnable via
justcommands for reproducibility
Manual uploads (when necessary):
# Using AWS CLI
aws s3 cp local_file.parquet s3://data.sb/<org>/<dataset>/<filename_YYYYMMDD.parquet>
# Example: uploading heating oil data for Rhode Island
aws s3 cp ri_monthly_20240315.parquet s3://data.sb/eia/heating_oil/ri_monthly_20240315.parquet- Naming follows conventions
- Path doesn't conflict with existing data
- Date reflects actual download/creation date
What are pre-commit hooks? Pre-commit hooks are automated scripts that run every time you make a git commit. They catch issues (formatting, linting, syntax errors) before code enters the repository, ensuring consistent code quality across the team. This saves time in code review and prevents broken code from being committed.
Why we use them: By automatically formatting and checking code at commit time, we ensure that all code in the repository meets our quality standards without requiring manual intervention. Everyone's code is formatted consistently, and common errors are caught immediately.
Pre-commit hooks are managed by prek and configured in .pre-commit-config.yaml.
Configured hooks:
- ruff-check: Lints Python code with auto-fix
- ruff-format: Formats Python code
- ty-check: Type checks Python code using ty
- trailing-whitespace: Removes trailing whitespace
- end-of-file-fixer: Ensures files end with a newline
- check-yaml/json/toml: Validates config file syntax
- check-added-large-files: Prevents commits of files >600KB
- check-merge-conflict: Detects merge conflict markers
- check-case-conflict: Prevents case-sensitivity issues
Note on R formatting: We don't yet have the air formatter for R integrated with pre-commit hooks. Instead, use air's editor integration via the Posit.air-vscode extension, which is pre-installed in the dev container. Air will automatically format your R code in the editor.
Hooks run automatically on git commit. To run manually:
prek run -a # Run all hooks on all filesjust checkThis command performs (same as CI):
- Lock file validation: Ensures
uv.lockis consistent withpyproject.toml - Pre-commit hooks: Runs all configured hooks including type checking (see above)
Optional - Check for obsolete dependencies:
just check-depsRuns deptry to check for unused dependencies in pyproject.toml.
just testRuns the Python test suite with pytest (tests in tests/ directory), including doctest validation.
Note on R tests: We don't currently have R tests or a testing framework configured for R code. Only Python tests are run by just test and in CI.
The repository uses GitHub Actions to automatically run quality checks and tests on your code. The CI runs the exact same just commands you use locally, in the same devcontainer environment, ensuring perfect consistency.
The workflow runs two jobs in parallel for speed:
On Pull Requests (opened, updated, or marked ready for review):
- quality-checks: Runs
just check(lock file validation + pre-commit hooks) - tests: Runs
just test(pytest test suite)
On Push to main:
- Same checks and tests as pull requests (both jobs run in parallel)
On every commit to main or a pull request, GitHub Actions actually builds the devcontainer image - the same process that happens when you rebuild in VS Code.
To avoid 15-minute builds every time, we use a two-tier caching strategy:
- Image caching: If
.devcontainer/Dockerfile,pyproject.toml,uv.lock, andDESCRIPTIONhaven't changed, the devcontainer image build is skipped, and the most recent image (in GHCR) is reused (~30 seconds) - Layer caching: If any of these files changed, the image is rebuilt, but only affected layers rebuild while the others are pulled from GHCR cache (incremental builds, ~2-5 minutes)
Once built, the image is pushed to GitHub Container Registry (GHCR), where it's immediately available as ghcr.io/switchbox-data/reports2:latest.
The quality-checks and tests jobs then pull this prebuilt image and run just check and just test inside it - no rebuilding required.
These CI builds serve a dual purpose:
- For CI: Tests run in the exact devcontainer environment
- For Devpod: Users can launch devcontainers on AWS (similar to Codespaces) using the prebuilt image - unlike when using devcontainers on your laptop, you don't have to wait for the image to build
Bottom line: Every commit that modifies the devcontainer or dependencies triggers an automatic devcontainer image build. This ensures CI uses the correct environment, and anyone (including Devpod users) can use the fully built devcontainer without building it from scratch.
- Perfect Consistency: CI literally runs
just checkandjust testinside of the devcontainer - exactly what you run locally β and devcontainer is rebuilt when its definition changes - Speed: Devcontainer is only rebuilt when necessary, so quality checks and tests usually run immediately, and they run in parallel, making CI faster
- Safety net: Even if someone skips pre-commit checks locally, CI catches code quality issues before merge
- Code quality: Every PR must pass all checks and tests before it can be merged
- Optional checks: Dependency audits (
just check-deps) are not part of CI but available locally for additional validation
You can see the workflow configuration in .github/workflows/data-science-main.yml.
Remove generated files and caches:
just cleanThis removes:
.pytest_cache.ruff_cachetmp/notebooks/.quarto
- Dev Containers Documentation
- GitHub Actions Documentation
- just Documentation
- Quarto Documentation
- prek Documentation - Pre-commit hook manager
- uv Documentation - Package manager
- ruff Documentation - Linter and formatter
- ty Documentation - Type checker
- deptry Documentation - Dependency checker
- Polars Documentation - Fast DataFrame library
- PyArrow Documentation - Python bindings for Apache Arrow
- pak Documentation - Package manager
- air Documentation - Linter and formatter
- dplyr Documentation - Data manipulation grammar
- arrow Documentation - R bindings for Apache Arrow