Skip to content

Latest commit

 

History

History
94 lines (71 loc) · 3.32 KB

File metadata and controls

94 lines (71 loc) · 3.32 KB

Contributing

Thanks for considering a contribution! This repo curates the datasets that CodeCarbon uses to estimate computing CO2 emissions, so corrections to data, new sources, and fixes to the collection pipeline are all in scope.

Development setup

Requires Python 3.13+ and uv.

# Clone and install (creates .venv automatically)
git clone https://github.com/mlco2/codecarbon-data.git
cd codecarbon-data
uv sync

# Some collectors need credentials — see README "Source credentials"
export ELECTRICITY_MAPS_TOKEN=...

# Some collectors require a browser
uv run python -m playwright install --with-deps

Running tests and lint

uv run pytest                                    # full suite (mocked HTTP)
uv run pytest -m "not integration"               # skip live-network tests
uv run pytest tests/collectors/grid/             # one directory
uv run ruff check src/ tests/
uv run ruff format src/ tests/

The CI workflow runs pytest and ruff check on every push and pull request — keep both green before requesting review.

Running the pipeline locally

# Collect a single domain
uv run codecarbon-data collect grid --log-level DEBUG

# Validate the resulting CSV against datapackage.json
uv run codecarbon-data validate grid

# Inspect output
head data/grid_emissions.csv

The pipeline writes output to data/ and regenerates ATTRIBUTION.md from the sources that produced records. Don't hand-edit ATTRIBUTION.md — it is generated by src/licensing.py:generate_attribution.

Adding a new data source

See ADDING_SOURCES.md for the full walkthrough. In short:

  1. Register the source in sources.yaml (with license metadata).
  2. Implement BaseCollector in src/collectors/<domain>/<source>.py.
  3. Add unit tests in tests/collectors/<domain>/test_<source>.py with the HTTP layer mocked (no fixture files; inline sample data).
  4. Wire the collector into _get_domain_collectors in src/pipeline.py.
  5. If the new source overlaps an existing one for a domain, update the merge policy in src/collectors/<domain>/merge.py.

Code style

  • Type hints on public functions and Pydantic fields.
  • Don't return pd.DataFrame from collectors — return list[DomainRecord]. The CSV boundary is a single function (records_to_dataframe in src/models.py).
  • Sort records deterministically before returning so output diffs are reviewable.
  • Raise CollectorError on collection failures; let the pipeline log and continue. Don't print, don't sys.exit.
  • Use src.http.create_client() for all HTTP — it provides retries, rate limiting, and the project user agent.

Pull requests

  • One PR per source / fix / feature when possible.
  • For data-only changes (e.g. a corrected GPU TDP), describe the upstream evidence in the PR body.
  • The monthly automated auto/data-update-* PR is generated by the Update datasets workflow and should be merged once CI is green; don't open competing data-refresh PRs against it.

Reporting issues

Bugs and data corrections go in GitHub Issues. For data corrections, include a link to upstream evidence (vendor spec sheet, official report) — the project favors traceable factual extraction over editorial judgment.