Thanks for considering a contribution! This repo curates the datasets that CodeCarbon uses to estimate computing CO2 emissions, so corrections to data, new sources, and fixes to the collection pipeline are all in scope.
Requires Python 3.13+ and uv.
# Clone and install (creates .venv automatically)
git clone https://github.com/mlco2/codecarbon-data.git
cd codecarbon-data
uv sync
# Some collectors need credentials — see README "Source credentials"
export ELECTRICITY_MAPS_TOKEN=...
# Some collectors require a browser
uv run python -m playwright install --with-depsuv run pytest # full suite (mocked HTTP)
uv run pytest -m "not integration" # skip live-network tests
uv run pytest tests/collectors/grid/ # one directory
uv run ruff check src/ tests/
uv run ruff format src/ tests/The CI workflow runs pytest and ruff check on every push and pull
request — keep both green before requesting review.
# Collect a single domain
uv run codecarbon-data collect grid --log-level DEBUG
# Validate the resulting CSV against datapackage.json
uv run codecarbon-data validate grid
# Inspect output
head data/grid_emissions.csvThe pipeline writes output to data/ and regenerates ATTRIBUTION.md from
the sources that produced records. Don't hand-edit ATTRIBUTION.md — it is
generated by src/licensing.py:generate_attribution.
See ADDING_SOURCES.md for the full walkthrough. In short:
- Register the source in
sources.yaml(with license metadata). - Implement
BaseCollectorinsrc/collectors/<domain>/<source>.py. - Add unit tests in
tests/collectors/<domain>/test_<source>.pywith the HTTP layer mocked (no fixture files; inline sample data). - Wire the collector into
_get_domain_collectorsinsrc/pipeline.py. - If the new source overlaps an existing one for a domain, update the
merge policy in
src/collectors/<domain>/merge.py.
- Type hints on public functions and Pydantic fields.
- Don't return
pd.DataFramefrom collectors — returnlist[DomainRecord]. The CSV boundary is a single function (records_to_dataframeinsrc/models.py). - Sort records deterministically before returning so output diffs are reviewable.
- Raise
CollectorErroron collection failures; let the pipeline log and continue. Don't print, don'tsys.exit. - Use
src.http.create_client()for all HTTP — it provides retries, rate limiting, and the project user agent.
- One PR per source / fix / feature when possible.
- For data-only changes (e.g. a corrected GPU TDP), describe the upstream evidence in the PR body.
- The monthly automated
auto/data-update-*PR is generated by theUpdate datasetsworkflow and should be merged once CI is green; don't open competing data-refresh PRs against it.
Bugs and data corrections go in GitHub Issues. For data corrections, include a link to upstream evidence (vendor spec sheet, official report) — the project favors traceable factual extraction over editorial judgment.