Skip to content

Latest commit

 

History

History
253 lines (197 loc) · 10.9 KB

File metadata and controls

253 lines (197 loc) · 10.9 KB

GridSync — Claude Context

Project Overview

A data hub that downloads ERCOT public ESIID data (Report 203), archives it to Backblaze B2, loads it into Postgres, and exposes a fuzzy search API. Designed to be fully provider-agnostic — no AWS, GCP, or Azure dependencies.

Architecture Principles

  • No cloud vendor lock-in — every component runs on any Linux server, VPS, or container host
  • Primary language: Rust — use Rust for all components where viable
  • Fallback language: Go — only where Rust tooling or SDK support is insufficient
  • Storage: Backblaze B2 — S3-compatible object storage for raw file archives
  • Database: Supabase / Postgres — same schema either way; self-hostable
  • No managed schedulers — system cron or pg_cron only

Data Source

ERCOT Report 203 — Public TDSP ESIID Extract (no auth required)

  • Document list API: https://www.ercot.com/misapp/servlets/IceDocListJsonWS?reportTypeId=203
  • Download URL: https://www.ercot.com/misdownload/servlets/mirDownload?doclookupId={id}
  • File naming: ext.00000203.0000000000000000.YYYYMMDD.HHMMSSXXX.{TDSP}_{TYPE}.zip
  • File types: _FUL (full monthly snapshot) and _DAILY (incremental delta)
  • FUL publish cadence: first Tuesday following the first Monday of each month
  • DAILY publish cadence: every day
  • ERCOT retains ~31 days of documents (~380 files in the feed at any time)
  • Estimated data volume: 3–5 million rows across all TDSPs

TDSPs published: CENTERPOINT, ONCOR_ELEC, LUBBOCK, NUECES_ELEC, TNMP, AEP_CENTRAL, AEP_NORTH, ENTERGY_GULF, SHARYLAND, AEP_TEXAS_SP, SWEPCO_ENERG

CSV schema (24 fields):

esiid, address, address_overflow, city, state, zip_code, duns, read_cycle,
status, premise_type, power_region, utility, market_id, station_code,
station_name, metered, service_orders, polr_class, settlement_ams, tdsp_ams,
switch_hold, market_code, settlement_load_zone, publish_date

Rolling Data Strategy

  • FUL load — upsert all rows, then deactivate (is_active = FALSE) any ESIID for that TDSP not present in the file
  • DAILY load — upsert only; no deactivation
  • Lookback — 2 days for DAILY, 7 days for FUL (catches the monthly file automatically)
  • Downloader skips files already present in B2 (checksum/existence check), so reruns are safe

Project Structure

gridsync/
├── Cargo.toml                    # Workspace — all dependency versions pinned here
├── rust-toolchain.toml           # Pins Rust stable channel
├── .mise.toml                    # Tool versions + .env autoloading
├── .env.example                  # Required environment variables
├── .gitignore
├── db/
│   └── migrations/
│       ├── 0001_extensions.sql   # pg_trgm, fuzzystrmatch, pg_cron
│       ├── 0002_create_esiids.sql
│       ├── 0003_create_pipeline_runs.sql
│       ├── 0004_indexes.sql
│       └── 0005_triggers.sql
├── .mise/tasks/                  # All runnable tasks (see Commands section)
└── crates/
    ├── common/                   # Shared: Config, B2Client, DB pool, models, errors
    ├── downloader/               # Fetches ERCOT doc list → downloads → uploads to B2
    ├── etl/                      # Downloads from B2 → parses ZIP/CSV → upserts Postgres
    └── api/                      # Axum REST API — fuzzy search endpoint

Crates

esiid-common

Shared library used by all other crates.

Module Purpose
config Config::from_env() — reads all env vars, returns typed struct
db connect() pool setup, migrate() runs embedded migrations
storage B2Client — upload, download, exists, object_key() helper
models Esiid, PipelineRun, UpsertStats structs with sqlx::FromRow
error AppError enum wrapping sqlx, anyhow, and domain errors

esiid-downloader

Binary: downloader

Polls the ERCOT document list API, compares against B2 to find new files, downloads and uploads them, records each run in pipeline_runs.

Key modules:

  • ercotErcotClient: list_documents(), download(doc_id), parse_doc() (extracts TDSP + FileType from filename)
  • storage — re-exports esiid_common::storage::B2Client

CLI flags:

--full-only        Only download FUL files
--daily-only       Only download DAILY files
--tdsp <name>      Filter to a specific TDSP
--dry-run          List files that would be downloaded without fetching

esiid-etl

Binary: etl

Downloads a specific file from B2, parses the ZIP/CSV, upserts into Postgres. For FUL files, deactivates ESIIDs no longer in the extract.

Key modules:

  • parserparse_zip() extracts CSV from ZIP; ErcotRecord maps CSV headers exactly
  • loaderupsert_records() uses UNNEST bulk upsert in 1,000-row batches; deactivate_missing() for FUL cleanup

CLI flags:

--key <b2-key>         B2 object key to process
--tdsp <name>          TDSP name (required for FUL deactivation)
--file-type <FUL|DAILY>
--skip-deactivate      Skip deactivation step on FUL files

esiid-api

Binary: api

Axum HTTP server with two routes:

  • GET /search — fuzzy search (see Search section)
  • GET /health — returns "ok"

CLI flags:

--port <n>    Override API_PORT env var

Database Schema

esiids table

  • idBIGSERIAL PRIMARY KEY (surrogate; integer for index efficiency at 3–5M rows)
  • esiidTEXT NOT NULL UNIQUE (ERCOT stable identifier; upsert key)
  • All 24 ERCOT CSV fields stored as TEXT or DATE — no coercion at ingest time
  • Pipeline tracking: first_seen_at, last_seen_at, updated_at, is_active
  • search_vector TSVECTOR — maintained by trigger, weighted: ESIID=A, address=B, city/zip=C

pipeline_runs table

Tracks every downloader and ETL run. Fields: source_file, source_checksum (SHA-256), tdsp, file_type (enum: FUL/DAILY), publish_date, timing, row counts, status (enum: running/completed/failed), error_message.

Indexes

  • B-tree on: utility, zip_code, premise_type, settlement_load_zone, status, is_active, publish_date
  • GIN trigram (gin_trgm_ops) on: address, city
  • GIN on: search_vector

Triggers

  • trg_esiids_updated_at — sets updated_at = NOW() on every update
  • trg_esiids_search_vector — rebuilds search_vector on insert/update using 'simple' dictionary (no stemming — address data benefits from exact/prefix matching)

Search

GET /search query parameters:

Param Type Description
q string Free-text fuzzy query — matched via FTS + trigram similarity
zip string Exact zip code filter
utility string TDSP name filter (case-insensitive)
premise_type string e.g. RES, COM (case-insensitive)
load_zone string Settlement load zone filter
active_only bool Default true — only return is_active = TRUE rows
limit int Max results, clamped 1–100, default 25
offset int Pagination offset, default 0

When q is provided, results are ordered by ts_rank (FTS) DESC then similarity(address, q) DESC. Without q, results are ordered by esiid.

Response: { total, limit, offset, results: Esiid[] }

B2 Object Key Convention

raw/{source}/{dataset}/YYYY/MM/DD/{original-filename}.zip

The source and dataset segments allow the bucket to expand beyond ESIID data without restructuring.

Segment Current value Future examples
source ercot oncor, centerpoint
dataset esiid settlement, outages, usage

Example: raw/ercot/esiid/2026/03/22/ext.00000203...ONCOR_ELEC_FUL.zip

Environment Variables

Variable Required Description
DATABASE_URL Yes Postgres connection string
B2_KEY_ID Yes Backblaze B2 application key ID
B2_APPLICATION_KEY Yes Backblaze B2 application key
B2_BUCKET_NAME Yes B2 bucket name
B2_ENDPOINT Yes e.g. https://s3.us-east-005.backblazeb2.com
B2_REGION No Default: us-east-005
ERCOT_API_BASE No Default: https://www.ercot.com
API_HOST No Default: 0.0.0.0
API_PORT No Default: 3000
RUST_LOG No Tracing filter, e.g. info or esiid_api=debug

The .env file is autoloaded by mise when entering the project directory.

Commands

All tasks run via mise run <task>. The .env file is loaded automatically.

mise run install          # Install sqlx-cli and cargo-watch
mise run build            # cargo build --workspace (debug)
mise run release          # cargo build --workspace --release
mise run check            # cargo check + cargo clippy -D warnings
mise run fmt              # cargo fmt --all
mise run test             # cargo test --workspace
mise run api              # Run the search API server
mise run watch            # Live-reload API server (requires cargo-watch)
mise run download         # Run the downloader (pass flags with --)
mise run etl              # Run the ETL (pass flags with --)
mise run cron             # Run the full pipeline (download + ETL signal)
mise run db:migrate       # Apply all pending migrations
mise run db:status        # Show applied/pending migration status
mise run db:revert        # Revert the last applied migration
mise run db:prepare       # Generate sqlx offline query cache (.sqlx/)

First-time setup

cp .env.example .env
# fill in DATABASE_URL, B2_*, etc.

mise install              # install Rust stable
mise run install          # install sqlx-cli + cargo-watch
mise run db:migrate       # apply migrations
mise run build            # verify everything compiles
mise run api              # start the server

sqlx offline mode

After adding or changing any sqlx::query! / sqlx::query_as! calls, regenerate the cache so CI can build without a live database:

mise run db:prepare       # writes .sqlx/ directory — commit this

Key Design Decisions

  • Surrogate BIGSERIAL PK + UNIQUE(esiid) — integer PK is faster for B-tree ops and foreign keys at 3–5M rows; ESIID remains the upsert key
  • All ERCOT fields stored as TEXT — avoids ingestion failures from unexpected values; coercion happens in the application layer
  • UNNEST bulk upsert — one round-trip per 1,000-row batch instead of N individual queries
  • 'simple' FTS dictionary — address data doesn't benefit from stemming; exact/prefix matching is more useful
  • search_vector weights — ESIID exact hits (A) rank above address matches (B) which rank above city/zip (C)
  • B2Client in common — shared between downloader and ETL; no duplication
  • Migrations embedded in binaries — each binary runs sqlx::migrate! on startup so schema is always current
  • FUL deactivation — after a full-snapshot load, ESIIDs for that TDSP not present in the file are marked is_active = FALSE rather than deleted, preserving history