Skip to content

feat(data,repo): local demo tooling + seeder price-history fix#298

Merged
w7-mgfcode merged 2 commits into
devfrom
chore/local-demo-enrichment-tools
May 26, 2026
Merged

feat(data,repo): local demo tooling + seeder price-history fix#298
w7-mgfcode merged 2 commits into
devfrom
chore/local-demo-enrichment-tools

Conversation

@w7-mgfcode
Copy link
Copy Markdown
Owner

@w7-mgfcode w7-mgfcode commented May 26, 2026

Summary

Bundles three carryover concerns from prior local demo work into one PR. Tracked by #297.

  • fix(data)app/shared/seeder/generators/facts.py: PriceHistoryGenerator could emit a row with valid_to < valid_from when a price change rolled on the window's first day. That violates ck_price_history_valid_dates and crashes ingest. The fix skips the degenerate row.
  • feat(data) — three new localhost-only scripts that drive the public API to enrich the demo DB without raw SQL writes:
    • scripts/seed_phase2_only.py — re-runs Phase 2 generators (replenishment, exogenous, returns, lifecycle) against existing dimensions; refuses unless DATABASE_URL resolves to localhost / 127.0.0.1
    • scripts/seed_historical_activity.py — submits a spread of train/predict/backtest jobs across 2024-Q4 → 2026-Q1 cutoffs via /jobs so the Registry / Jobs / Forecasts dashboards have meaningful content
    • scripts/seed_registry_from_jobs.py — walks completed train jobs, runs the canonical /registry/runs pending → running → success transition + alias stamping with deterministic stub metrics
  • chore(repo)uv.lock bumps forecastlabai 0.2.18 → 0.2.19 to match the already-merged release-please version bump (mirrors prior #239).

Excluded (intentionally)

  • app/features/rag/models.py (modified locally) + alembic/versions/a2b3c4d5e6f7_rag_embedding_dim_2560_qwen3.py (untracked) — the migration's own docstring marks it "Local-only demo migration". It TRUNCATEs document_chunk, drops ix_chunk_embedding_hnsw, and hardcodes vector(2560) for qwen3-embedding:4b. Shipping it to dev/main would wipe any non-qwen3 user's RAG corpus on next alembic upgrade head (settings default is still 1536). Stays uncommitted locally. If upstream qwen3 support is wanted, it gets its own PRP (target-dim from settings, non-destructive upgrade path, HNSW → IVFFlat fallback).

Validation

Check Result
uv run ruff check <touched files> ✅ All checks passed (after auto-fixing 2 imports + manually fixing 6 unicode chars + 1 generic + 1 sha1)
uv run ruff format --check <touched files> ✅ 4 files already formatted
uv run mypy app/shared/seeder/generators/facts.py ✅ Success, no issues (the only touched in-scope file; CI's mypy app/ does not cover scripts/)
uv run pytest -v -m \"not integration\" app/shared/seeder/ -k \"facts or phase1_regression or price_history\" ✅ 10 passed, 249 deselected

Not run:

  • Full mypy app/ and pyright app/ — CI will run these on push (they only cover app/, not the 3 new scripts/)
  • Integration tests (-m integration) — the seeder fix has unit coverage already; integration tests for the new scripts would need a live API + seeded DB
  • The 3 new scripts themselves — they require a running backend + seeded DB (out of scope for a docs-PR-adjacent merge)

Why one PR

Per the answer to the user's clarifying question. Alternative was three tiny PRs (seeder fix / uv.lock / scripts) — same content, more churn. Multi-scope commit data,repo is allowed by .claude/rules/commit-format.md for cross-cutting work.

Test plan

  • Lint + format clean on touched files
  • mypy clean on in-scope facts.py
  • Seeder unit tests pass (including the regression suite)
  • CI green on the four dev gates (Lint & Format, Type Check, Test, Migration Check)
  • Manual smoke (optional): uv run python scripts/seed_phase2_only.py --seed 42 against a freshly seeded local DB

Summary by Sourcery

Fix price history seeding and add local demo seeding scripts, along with a dependency lockfile bump.

New Features:

  • Add a local-only script to backfill registry runs and aliases from completed training jobs via the public API.
  • Add a local-only script to rerun phase-2 data enrichment generators against existing demo dimensions without altering phase-1 facts.
  • Add a local-only script to seed historical training, prediction, backtest, and batch jobs via the public API for richer dashboard data.

Bug Fixes:

  • Prevent the price history generator from emitting rows where the valid_to date precedes valid_from, avoiding ingest constraint violations.

Build:

  • Update uv.lock to bump the forecastlabai dependency to the latest released version.

Bundles three carryover concerns from prior local demo work into one PR.

* fix(data) — PriceHistoryGenerator could emit a row with valid_to <
  valid_from when a change roll fired on the window's first day. That
  violates ck_price_history_valid_dates and crashed the seeder during
  ingest. The fix skips the degenerate row.

* feat(data) — three new local-host scripts that drive the public API
  to enrich the demo DB without raw SQL writes:
  - seed_phase2_only: re-runs Phase 2 generators (replenishment,
    exogenous, returns, lifecycle) against existing dimensions
  - seed_historical_activity: submits varied train/predict/backtest
    jobs across 2024-Q4 -> 2026-Q1 cutoffs through /jobs
  - seed_registry_from_jobs: walks completed train jobs, runs the
    canonical pending -> running -> success transition + alias stamps

* chore(repo) — uv.lock refreshes forecastlabai 0.2.18 -> 0.2.19
  to match the release-please-merged version bump.

Excluded intentionally: alembic/a2b3c4d5e6f7 + rag/models.py — the
migration is self-marked "local-only demo" (truncates document_chunk,
drops HNSW index, hardcodes 2560 for qwen3) and would wipe any non-
qwen3 user's RAG corpus on upgrade. Stays uncommitted locally.
@sourcery-ai
Copy link
Copy Markdown

sourcery-ai Bot commented May 26, 2026

Reviewer's Guide

Fixes a bug in the price-history seeder that could generate invalid date windows, and adds three localhost-only demo scripts that seed additional data and historical activity via the public API while updating the lockfile to the current forecastlabai version.

Sequence diagram for seed_registry_from_jobs registry backfill

sequenceDiagram
    participant Script as SeedRegistryFromJobs
    participant JobsAPI as JobsAPI
    participant RegistryAPI as RegistryAPI
    participant FS as RegistryArtifactStorage

    Script->>JobsAPI: GET /jobs?job_type=train&status=completed (fetch_completed_train_jobs)
    JobsAPI-->>Script: completed train jobs

    loop for each completed job
        Script->>FS: check/copy model_path to registry_root
        Script->>RegistryAPI: POST /registry/runs
        alt created
            RegistryAPI-->>Script: run_id
            Script->>RegistryAPI: PATCH /registry/runs/{run_id} status=running
            RegistryAPI-->>Script: updated run
            Script->>RegistryAPI: PATCH /registry/runs/{run_id}
            activate Script
            Script-->>RegistryAPI: status=success, metrics, artifact_uri, artifact_hash
            deactivate Script
            RegistryAPI-->>Script: updated run
        else duplicate config_hash
            RegistryAPI-->>Script: 4xx (skip run)
        end
    end

    Script->>Script: select winners per (store_id, product_id)
    loop aliases for each winner
        Script->>RegistryAPI: POST /registry/aliases (champion/challenger)
        RegistryAPI-->>Script: alias created or error
    end
Loading

Sequence diagram for seed_historical_activity job seeding

sequenceDiagram
    participant Script as SeedHistoricalActivity
    participant JobsAPI as JobsAPI
    participant BatchAPI as BatchForecastingAPI

    loop for each (store_id, product_id, cutoff, model_type)
        Script->>JobsAPI: POST /jobs (job_type=train)
        JobsAPI-->>Script: job_id
        Script->>JobsAPI: GET /jobs/{job_id} (poll_job)
        JobsAPI-->>Script: status, run_id
    end

    Script->>Script: filter completed train jobs at latest cutoff
    loop for each latest run_id
        Script->>JobsAPI: POST /jobs (job_type=predict)
        JobsAPI-->>Script: job_id
        Script->>JobsAPI: GET /jobs/{job_id} (poll_job)
        JobsAPI-->>Script: status
    end

    loop for selected pairs
        Script->>JobsAPI: POST /jobs (job_type=backtest)
        JobsAPI-->>Script: job_id
        Script->>JobsAPI: GET /jobs/{job_id} (poll_job, longer timeout)
        JobsAPI-->>Script: status
    end

    Script->>BatchAPI: POST /batch/forecasting
    BatchAPI-->>Script: batch_id, item_count or error
Loading

File-Level Changes

Change Details Files
Ensure generated price history rows never violate the valid_from/valid_to check constraint by skipping zero-length initial windows.
  • Introduce a local valid_to variable when a price change triggers.
  • Guard the append of the previous price window with a valid_to >= current_valid_from check to avoid degenerate ranges.
  • Only update current_price and current_valid_from when a non-degenerate window was persisted.
app/shared/seeder/generators/facts.py
Add a localhost-only Phase 2 enrichment script that backfills lifecycle, replenishment, exogenous signals, and returns data against an existing seeded schema.
  • Wire up async SQLAlchemy session creation using DATABASE_URL from settings and hard-refuse if the URL is not localhost/127.0.0.1.
  • Compute lifecycle attributes per product deterministically from a seeded RNG and update Product rows in-place.
  • Generate replenishment events, exogenous signals, and sales return records via existing generator classes and bulk-insert them in chunks for performance.
  • Expose a small argparse CLI to control RNG seed and returns probability.
scripts/seed_phase2_only.py
Add a script to backfill realistic historical model activity by driving train/predict/backtest and batch jobs through the public /jobs and /batch/forecasting APIs.
  • Define fixed (store, product) pairs, cutoffs, and baseline model types to span 2024–2026 training windows.
  • Implement helpers to submit and poll jobs generically, then orchestrate train→predict→backtest flows sequentially for a small matrix of scenarios.
  • Optionally submit a small batch forecasting job and emit concise console summaries of counts and statuses.
  • Provide a simple CLI flag for the API base URL.
scripts/seed_historical_activity.py
Add a registry seeding script that turns completed train jobs into registry runs with deterministic stub metrics and aliasing.
  • Fetch paginated completed train jobs via /jobs and filter to supported baseline model types.
  • Locate and copy each job's trained model artifact into the registry artifact root, hashing contents to populate artifact metadata.
  • Create, transition, and finalize /registry/runs entries with stubbed metrics seeded from job run_id for deterministic but varied values.
  • Select lowest-WAPE runs per (store, product) for the latest cutoff and create champion/challenger aliases via /registry/aliases.
  • Expose a CLI for configuring the API base URL and use settings.registry_artifact_root for artifact placement.
scripts/seed_registry_from_jobs.py
Align the uv.lock dependency on forecastlabai with the already-bumped project version.
  • Update the forecastlabai locked version from 0.2.18 to 0.2.19 in uv.lock to match the release-please bump and keep environments reproducible.
uv.lock

Possibly linked issues


Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 26, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e19dd2fa-885a-43d5-a482-de33111fb078

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch chore/local-demo-enrichment-tools

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 2 issues, and left some high level feedback:

  • In fetch_completed_train_jobs, the pagination break condition if page * len(jobs) >= total can stop too early when the last page is partially filled; consider using the configured page_size (or an accumulated count) instead of page * len(jobs) to decide when to exit.
  • In seed_phase2_only.chunked, the use of the new generic function syntax def chunked[U](...) ties the script to Python 3.12+; if this repo targets earlier versions, switch to the TypeVar pattern (def chunked(items: list[U], size: int) -> Iterator[list[U]]) for broader compatibility.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `fetch_completed_train_jobs`, the pagination break condition `if page * len(jobs) >= total` can stop too early when the last page is partially filled; consider using the configured `page_size` (or an accumulated count) instead of `page * len(jobs)` to decide when to exit.
- In `seed_phase2_only.chunked`, the use of the new generic function syntax `def chunked[U](...)` ties the script to Python 3.12+; if this repo targets earlier versions, switch to the `TypeVar` pattern (`def chunked(items: list[U], size: int) -> Iterator[list[U]]`) for broader compatibility.

## Individual Comments

### Comment 1
<location path="scripts/seed_registry_from_jobs.py" line_range="102-111" />
<code_context>
+    model_type = str(params.get("model_type", ""))
+    if model_type not in {"naive", "seasonal_naive", "moving_average"}:
+        return None  # only baselines for this backfill
+    source_path = Path(str(result.get("model_path", "")))
+    if not source_path.exists():
+        # try relative-to-cwd
+        rel = Path.cwd() / source_path
+        if rel.exists():
+            source_path = rel
+        else:
+            return None
+    forecast_run_id = str(result.get("run_id", ""))
+    artifact_uri = f"backfill/{model_type}-{source_path.stem}.joblib"
+    dest = registry_root / artifact_uri
+    dest.parent.mkdir(parents=True, exist_ok=True)
+    if not dest.exists():
+        shutil.copy2(source_path, dest)
+    raw = dest.read_bytes()
+    artifact_hash = hashlib.sha256(raw).hexdigest()
</code_context>
<issue_to_address>
**issue (bug_risk):** Handling of missing or non-file `model_path` values can misbehave when the value is empty or points to a directory.

If `model_path` is missing or empty, `Path("")` resolves to the current directory and `exists()` is True, so `shutil.copy2` will be called on a directory and fail. The same applies if `model_path` points to an existing directory. Consider requiring a non-empty path and checking `source_path.is_file()` (and likewise for the `rel` candidate) before copying; otherwise, skip this entry.
</issue_to_address>

### Comment 2
<location path="scripts/seed_registry_from_jobs.py" line_range="120-129" />
<code_context>
+    artifact_hash = hashlib.sha256(raw).hexdigest()
+
+    # (a) create
+    r = await client.post(
+        "/registry/runs",
+        json={
+            "model_type": model_type,
+            "model_config": _model_config_payload(model_type),
+            "feature_config": None,
+            "data_window_start": str(params.get("start_date")),
+            "data_window_end": str(params.get("end_date")),
+            "store_id": int(params["store_id"]),
+            "product_id": int(params["product_id"]),
+            "agent_context": None,
+            "git_sha": None,
+        },
+    )
+    if r.status_code >= 400:
+        # duplicate config_hash → idempotent skip
+        return None
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Treating all non-2xx `/registry/runs` responses as duplicate-config skips can hide real errors.

`status_code >= 400` is broader than the duplicate-config case and will also catch unrelated 4xx/5xx errors, which then get treated as idempotent skips. That can hide real failures like registry downtime or validation errors. Please either detect the actual duplicate condition (e.g., specific status code or error payload) or at least log/raise on unexpected 4xx/5xx responses so operational issues are visible.

Suggested implementation:

```python
    artifact_hash = hashlib.sha256(raw).hexdigest()

    # (a) create
    r = await client.post(
        "/registry/runs",
        json={
            "model_type": model_type,
            "model_config": _model_config_payload(model_type),
            "feature_config": None,
            "data_window_start": str(params.get("start_date")),
            "data_window_end": str(params.get("end_date")),
            "store_id": int(params["store_id"]),
            "product_id": int(params["product_id"]),
            "agent_context": None,
            "git_sha": None,
        },
    )
    if r.status_code == 409:
        # duplicate config_hash → idempotent skip
        return None
    if r.status_code >= 400:
        # non-duplicate 4xx/5xx should surface as errors
        try:
            error_detail = r.json()
        except Exception:
            error_detail = r.text
        raise RuntimeError(
            f"Failed to create registry run (status {r.status_code}): {error_detail}"
        )

    model_type = str(params.get("model_type", ""))

```

If your registry API uses a different status code or response shape to indicate "duplicate config" (e.g. 422 with a specific error code in the JSON body), update the `if r.status_code == 409:` condition to match that contract, or refine the check using `error_detail`. Also ensure that `client`, `_model_config_payload`, and `params` are in scope in this function as expected.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread scripts/seed_registry_from_jobs.py Outdated
Comment thread scripts/seed_registry_from_jobs.py
Three corrections to register_one and fetch_completed_train_jobs:

* pagination — `page * len(jobs) >= total` stops too early when the
  last page is partial. Switch to accumulated-count + short-page
  detection (exit when len(jobs) < page_size or len(out) >= total).
* model_path validation — empty / directory paths slipped through
  because Path("") resolves to cwd and Path.exists() returns True for
  directories. Require non-empty path and Path.is_file() for both the
  raw and cwd-relative candidates.
* duplicate detection — `r.status_code >= 400` blanket-swallowed
  registry downtime and validation errors as idempotent skips. Narrow
  the skip to HTTP 409 (the actual DuplicateRunError code per
  registry/routes.py:113) and raise RuntimeError on other 4xx / 5xx
  with the response body for diagnostics.

Python 3.12-only `def chunked[U](...)` syntax in seed_phase2_only.py
is intentional — `pyproject.toml:6` already pins `requires-python =
">=3.12"`.
@w7-mgfcode w7-mgfcode merged commit 26a105a into dev May 26, 2026
8 checks passed
@w7-mgfcode w7-mgfcode deleted the chore/local-demo-enrichment-tools branch May 26, 2026 04:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant