Skip to content

ktalpay/CarbonOps-Parser

CarbonOps-Parser

CarbonOps-Parser banner

Scheduled carbon factor ingestion and parsing reference project with Python and .NET implementation options.

Status Phase Python .NET PostgreSQL Docs Release CI Package License

CarbonOps-Parser is a standalone public technical project for scheduled carbon factor source ingestion and parsing. It checks selected public emission factor sources, detects source version or hash changes, archives raw source files, parses source-specific structures, validates parsed records, and stores ingestion metadata and source-specific records in PostgreSQL.

The project is independent from carbonops-assistant. It is not a continuation, module, plugin, or dependency of that project.

Current Status

CarbonOps-Parser is in early Phase 1. The repository currently emphasizes project documentation, architecture, schema contract notes, source support planning, and public contribution structure before parser implementation begins.

Implementation work is planned for two independent paths:

  • Python in src/python
  • .NET in src/dotnet

Users who clone or fork the repository should be able to choose either implementation path.

Phase 1 Scope

Phase 1 focuses on scheduled ingestion and parsing for:

  • GHG Protocol
  • DEFRA/DESNZ
  • IPCC EFDB

The intended Phase 1 workflow is:

  1. Read configuration.
  2. Validate the database provider.
  3. Connect to PostgreSQL.
  4. Check whether required tables exist.
  5. Create missing tables if needed.
  6. Initialize source schedules.
  7. Check source version and file hash.
  8. Download a source document when a new version or hash is detected.
  9. Archive the raw source file.
  10. Parse source-specific structures.
  11. Validate parsed records.
  12. Persist shared ingestion metadata and source-specific records.
  13. Store import summaries and validation issues.

Architecture At A Glance

source schedule
  -> version/hash check
  -> download when changed
  -> raw file archive
  -> source-specific parser
  -> validation
  -> PostgreSQL persistence
  -> import summary and validation issues

Phase 1 uses shared ingestion metadata tables plus source-specific master/detail tables. It does not force GHG Protocol, DEFRA/DESNZ, and IPCC EFDB into one canonical factor table. A normalized or search-oriented projection may be considered in a later phase.

Implementation Options

Python

The Python implementation is planned first because it is practical for source discovery, spreadsheet inspection, parser mapping, validation, and data engineering workflows.

The initial Python source adapter contracts and in-memory registry live under src/carbonfactor_parser/source_adapters.

See src/python/README.md.

.NET

The .NET implementation is planned as an independent Worker Service path that follows the same conceptual workflow with .NET-oriented application structure.

See src/dotnet/README.md.

Install And Local Dry-Run Quickstart

From a fresh checkout or local working copy:

git clone <REPOSITORY_URL> CarbonOps-Parser
cd CarbonOps-Parser
python -m pip install -e .

Run the test suite if you want a quick local smoke check:

python -m pytest

Run the checked-in DEFRA/DESNZ fixture through the local dry-run CLI:

carbonops-parser local-dry-run \
  --local-path examples/fixtures/defra_desnz_minimal.csv \
  --source-family defra_desnz \
  --source-id defra-desnz-minimal-fixture \
  --content-type text/csv \
  --format-hint csv

Expected summary:

status=success
parsed_record_count=2
normalization_record_count=2
persistence_input_record_count=2
ddl_preview_present=True
issue_count=0

Run the JSON variant:

carbonops-parser local-dry-run \
  --local-path examples/fixtures/defra_desnz_minimal.csv \
  --source-family defra_desnz \
  --source-id defra-desnz-minimal-fixture \
  --content-type text/csv \
  --format-hint csv \
  --json

Key output fields:

  • status: dry-run outcome such as success, failed, unsupported, or no_records
  • parsed_record_count: records parsed by the minimal local DEFRA/DESNZ fixture parser
  • normalization_record_count: records produced by the minimal fixture normalization mapper
  • persistence_input_record_count: records prepared as PersistenceInput
  • ddl_preview_present: whether review-only PostgreSQL DDL preview text is attached
  • issues: structured local loader, parser, normalization, or persistence-input issues

Optionally include PostgreSQL insert preview data in text output:

carbonops-parser local-dry-run \
  --local-path examples/fixtures/defra_desnz_minimal.csv \
  --source-family defra_desnz \
  --source-id defra-desnz-minimal-fixture \
  --content-type text/csv \
  --format-hint csv \
  --include-postgresql-preview

Trimmed expected preview lines:

postgresql_preview_included=True
postgresql_preview_status=ready
postgresql_preview_only=True
postgresql_preview_sql_execution=False
postgresql_preview_database_connection=False
postgresql_preview_target_table=normalized_records
postgresql_preview_record_count=2
postgresql_preview_sql=INSERT INTO normalized_records (source_family, source_id, record_id, record_index, row_number, normalized_fields, source_reference, source_artifact_reference, source_checksum_sha256, parser_metadata, normalization_metadata, created_at, updated_at) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
postgresql_preview_issue_count=0

Run the JSON PostgreSQL preview variant:

carbonops-parser local-dry-run \
  --local-path examples/fixtures/defra_desnz_minimal.csv \
  --source-family defra_desnz \
  --source-id defra-desnz-minimal-fixture \
  --content-type text/csv \
  --format-hint csv \
  --json \
  --include-postgresql-preview

Trimmed expected JSON preview section:

{
  "postgresql_persistence_preview": {
    "included": true,
    "preview_only": true,
    "sql_execution": false,
    "database_connection": false,
    "status": "ready",
    "target_table": "normalized_records",
    "record_count": 2,
    "ordered_columns": [
      "source_family",
      "source_id",
      "record_id",
      "record_index",
      "row_number",
      "normalized_fields",
      "source_reference",
      "source_artifact_reference",
      "source_checksum_sha256",
      "parser_metadata",
      "normalization_metadata",
      "created_at",
      "updated_at"
    ],
    "idempotency_key_fields": [
      "source_family",
      "source_id",
      "record_id",
      "source_artifact_reference",
      "source_checksum_sha256"
    ],
    "issues": []
  }
}

The postgresql_persistence_preview section is preview-only. It includes the target table, ordered columns, parameter rows, record count, SQL text with placeholders, and idempotency metadata, but it does not execute SQL or persist records. No PostgreSQL server, database configuration, or credentials are required.

This quickstart is local dry-run only. It does not connect to PostgreSQL, write records, execute SQL, run migrations, perform network calls, trigger source acquisition, load config files, or require credentials. It does not make production DEFRA/DESNZ correctness claims.

For boundary details, see Local Dry-Run CLI Boundary, Local File Normalized Persistence Dry-Run Boundary, PostgreSQL Persistence Preview Boundary, and Local Dry-Run Troubleshooting.

Developer Tests

Run the lightweight Python test suite from the repository root:

python -m pytest

Pytest configuration is kept in pyproject.toml, including the src package import path used by the tests.

Public API Examples

The carbonfactor_parser.source_adapters package exposes source adapter contracts and lightweight helpers for tests, prototypes, and implementation slices.

Hash source content without reading or downloading files:

from carbonfactor_parser.source_adapters import (
    sha256_hex_from_bytes,
    sha256_hex_from_text,
)

content_hash = sha256_hex_from_bytes(b"sample source content")
note_hash = sha256_hex_from_text("sample metadata note")

Create and validate metadata for an existing local file:

from pathlib import Path

from carbonfactor_parser.source_adapters import (
    SourceFamily,
    build_source_document_from_file,
    validate_source_document_metadata,
)

document = build_source_document_from_file(
    source_family=SourceFamily.DEFRA_DESNZ,
    source_name="Example local factor file",
    file_path=Path("data/raw/example/source.csv"),
)

metadata_issues = validate_source_document_metadata(document)

Create and validate an ingestion summary contract:

from carbonfactor_parser.source_adapters import (
    SourceFamily,
    create_ingestion_run_summary,
    validate_ingestion_run_summary,
)

summary = create_ingestion_run_summary(
    ingestion_id="example-run-001",
    source_family=SourceFamily.DEFRA_DESNZ,
    source_name="Example local factor file",
)

summary_issues = validate_ingestion_run_summary(summary)

Use the artificial-only source acquisition validation pipeline with in-memory metadata:

from carbonfactor_parser import (
    create_artificial_source_acquisition_metadata,
    validate_and_summarize_artificial_source_acquisition_metadata,
)

metadata = create_artificial_source_acquisition_metadata(
    source_family="artificial_source_acquisition",
    logical_source_name="artificial-in-memory-source",
    declared_content_type="text/csv",
    checksum_sha256="a" * 64,
    acquired_at_label="static-artificial-acquisition-label",
)

pipeline_result = validate_and_summarize_artificial_source_acquisition_metadata(
    metadata,
)
issue_count = pipeline_result.summary.total_issue_count

This pipeline is limited to artificial metadata shape checks and deterministic summaries. It does not acquire real sources, read files, validate real source URLs, run parsers or normalization, check factor correctness, or provide compliance/legal or carbon accounting correctness. See docs/artificial-source-acquisition-validation-pipeline.md, docs/artificial-source-acquisition-module-recap.md, and examples/example_artificial_source_acquisition_validation_pipeline.py.

Source acquisition CLI quickstart

Use the carbonops-source-acquisition CLI for local source descriptor checks and acquisition flow previews.

  • Default run mode is noop and offline.
  • HTTP mode is opt-in with --client http.
  • validate checks local descriptor metadata only; it does not verify live URLs.
  • run --dry-run plans targets only and does not acquire content or write files/manifests.
  • Parser execution and database persistence are outside this CLI boundary at this phase.
carbonops-source-acquisition validate
carbonops-source-acquisition list
carbonops-source-acquisition list --source-id defra_desnz
carbonops-source-acquisition run --dry-run --base-directory ./data/source-acquisition
carbonops-source-acquisition run --output-format json
carbonops-source-acquisition run --client http --source-id ghg_protocol
carbonops-source-acquisition run --client http --source-id ghg_protocol --persist-content --base-directory ./data/source-acquisition

For boundary details, see:

See examples/example_acquisition_artifact_parser_input_mapping.py for a deterministic in-memory example of mapping acquisition artifact metadata into a future parser input boundary without executing a parser.

The parser package exposes ParserInputContract, create_parser_input_contract(), validate_parser_input_contract(), ParserFileContentInput, local parser file content loading helpers, parser file content validation helpers, parse_defra_desnz_file_content(), raw parsed record payload contracts, the ParserAdapter protocol, NoopParserAdapter, ArtificialParserAdapter, DefraDesnzParserAdapter, parser adapter registry helpers, parser execution planning and runner helpers, and parser execution result contracts for future parser adapter input handoff. The normalization package exposes parser execution handoff helpers, normalization input helpers for successful parser results with raw payloads, and a minimal DEFRA/DESNZ fixture normalization mapper. The persistence package exposes normalized result persistence input contracts, a logical PostgreSQL schema descriptor, a review-only DDL preview helper, a deterministic insert SQL builder, PostgreSQL persistence preview helpers, repository protocol/result contracts, an explicit caller-provided PostgreSQL options contract, a default-disabled PostgreSQL integration test boundary, and a PostgreSQL repository skeleton that returns unsupported results without database runtime behavior. The pipeline package exposes a local DEFRA/DESNZ fixture dry-run helper that composes those boundaries to produce PersistenceInput plus DDL preview metadata without DB or network behavior. These contracts keep acquisition metadata, already-loaded content, raw parser output, parser output metadata, normalization input, normalization handoff metadata, persistence input metadata, schema metadata, repository options metadata, integration test metadata, preview metadata, and repository result metadata separate; they do not include database connection behavior or full source-specific correctness claims.

Source Support

Each Phase 1 source family will have its own schedule, source version/hash check, parser, validation rules, archive layout, and source-specific tables.

Source family Phase 1 role Table group
GHG Protocol Source-specific parser and workbook/tool mapping ghg_*
DEFRA/DESNZ First planned ingestion slice after discovery defra_*
IPCC EFDB Heterogeneous source discovery and parser mapping ipcc_*

See docs/source-support.md and docs/source-discovery.md.

Configuration Summary

The conceptual configuration model includes:

  • Database provider and connection settings.
  • Raw archive path.
  • Source-specific enabled flags.
  • Source-specific schedules with day, week, month, year, time, and timezone support.

Phase 1 implements only postgres as the database provider. mysql and mssql are recognized as conceptual provider names but are not implemented in Phase 1.

See docs/configuration-model.md.

The shared conceptual example lives at config/carbonops.config.example.yaml.

Database Model Summary

PostgreSQL is the Phase 1 persistence target. The model includes:

  • Shared ingestion metadata tables: carbon_sources, carbon_source_versions, carbon_import_runs, carbon_raw_files, carbon_validation_issues, and carbon_job_locks.
  • DEFRA/DESNZ tables: defra_categories, defra_subcategories, defra_factor_sets, and defra_factor_values.
  • GHG Protocol tables: ghg_tools, ghg_factor_sheets, ghg_factor_groups, and ghg_factor_values.
  • IPCC EFDB tables: ipcc_sectors, ipcc_categories, ipcc_references, ipcc_factor_records, and ipcc_factor_values.

See docs/database-model.md, docs/database-startup.md, and database/postgres/README.md.

Documentation Map

Roadmap Summary

Near-term work moves from documentation polish to schema scripts, Python source discovery, PostgreSQL startup checks, raw archive handling, and the first DEFRA/DESNZ ingestion slice. The .NET Worker Service path follows as an independent implementation option.

See docs/roadmap.md and docs/task-breakdown.md.

Governance

Issues and pull requests are welcome for documentation, examples, parser mappings, source discovery, database schema notes, and implementation improvements.

Non-Goals

CarbonOps-Parser does not:

  • Calculate carbon inventories.
  • Produce emissions reports.
  • Replace source-owner documentation or source files.
  • Guarantee source data correctness.
  • Provide a deployment platform.
  • Normalize all source families into one shared factor table during Phase 1.

License

CarbonOps-Parser is licensed under the Apache License 2.0.

About

Scheduled carbon factor ingestion and parsing reference project with independent Python and .NET implementations for GHG Protocol, DEFRA/DESNZ, and IPCC EFDB datasets.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors