feat(data,repo): local demo tooling + seeder price-history fix by w7-mgfcode · Pull Request #298 · w7-mgfcode/ForecastLabAI

w7-mgfcode · 2026-05-26T04:03:27Z

Summary

Bundles three carryover concerns from prior local demo work into one PR. Tracked by #297.

fix(data) — app/shared/seeder/generators/facts.py: PriceHistoryGenerator could emit a row with valid_to < valid_from when a price change rolled on the window's first day. That violates ck_price_history_valid_dates and crashes ingest. The fix skips the degenerate row.
feat(data) — three new localhost-only scripts that drive the public API to enrich the demo DB without raw SQL writes:
- scripts/seed_phase2_only.py — re-runs Phase 2 generators (replenishment, exogenous, returns, lifecycle) against existing dimensions; refuses unless DATABASE_URL resolves to localhost / 127.0.0.1
- scripts/seed_historical_activity.py — submits a spread of train/predict/backtest jobs across 2024-Q4 → 2026-Q1 cutoffs via /jobs so the Registry / Jobs / Forecasts dashboards have meaningful content
- scripts/seed_registry_from_jobs.py — walks completed train jobs, runs the canonical /registry/runs pending → running → success transition + alias stamping with deterministic stub metrics
chore(repo) — uv.lock bumps forecastlabai 0.2.18 → 0.2.19 to match the already-merged release-please version bump (mirrors prior #239).

Excluded (intentionally)

app/features/rag/models.py (modified locally) + alembic/versions/a2b3c4d5e6f7_rag_embedding_dim_2560_qwen3.py (untracked) — the migration's own docstring marks it "Local-only demo migration". It TRUNCATEs document_chunk, drops ix_chunk_embedding_hnsw, and hardcodes vector(2560) for qwen3-embedding:4b. Shipping it to dev/main would wipe any non-qwen3 user's RAG corpus on next alembic upgrade head (settings default is still 1536). Stays uncommitted locally. If upstream qwen3 support is wanted, it gets its own PRP (target-dim from settings, non-destructive upgrade path, HNSW → IVFFlat fallback).

Validation

Check	Result
`uv run ruff check <touched files>`	✅ All checks passed (after auto-fixing 2 imports + manually fixing 6 unicode chars + 1 generic + 1 sha1)
`uv run ruff format --check <touched files>`	✅ 4 files already formatted
`uv run mypy app/shared/seeder/generators/facts.py`	✅ Success, no issues (the only touched in-scope file; CI's `mypy app/` does not cover `scripts/`)
`uv run pytest -v -m \"not integration\" app/shared/seeder/ -k \"facts or phase1_regression or price_history\"`	✅ 10 passed, 249 deselected

Not run:

Full mypy app/ and pyright app/ — CI will run these on push (they only cover app/, not the 3 new scripts/)
Integration tests (-m integration) — the seeder fix has unit coverage already; integration tests for the new scripts would need a live API + seeded DB
The 3 new scripts themselves — they require a running backend + seeded DB (out of scope for a docs-PR-adjacent merge)

Why one PR

Per the answer to the user's clarifying question. Alternative was three tiny PRs (seeder fix / uv.lock / scripts) — same content, more churn. Multi-scope commit data,repo is allowed by .claude/rules/commit-format.md for cross-cutting work.

Test plan

Lint + format clean on touched files
mypy clean on in-scope facts.py
Seeder unit tests pass (including the regression suite)
CI green on the four dev gates (Lint & Format, Type Check, Test, Migration Check)
Manual smoke (optional): uv run python scripts/seed_phase2_only.py --seed 42 against a freshly seeded local DB

Summary by Sourcery

Fix price history seeding and add local demo seeding scripts, along with a dependency lockfile bump.

New Features:

Add a local-only script to backfill registry runs and aliases from completed training jobs via the public API.
Add a local-only script to rerun phase-2 data enrichment generators against existing demo dimensions without altering phase-1 facts.
Add a local-only script to seed historical training, prediction, backtest, and batch jobs via the public API for richer dashboard data.

Bug Fixes:

Prevent the price history generator from emitting rows where the valid_to date precedes valid_from, avoiding ingest constraint violations.

Build:

Update uv.lock to bump the forecastlabai dependency to the latest released version.

Bundles three carryover concerns from prior local demo work into one PR. * fix(data) — PriceHistoryGenerator could emit a row with valid_to < valid_from when a change roll fired on the window's first day. That violates ck_price_history_valid_dates and crashed the seeder during ingest. The fix skips the degenerate row. * feat(data) — three new local-host scripts that drive the public API to enrich the demo DB without raw SQL writes: - seed_phase2_only: re-runs Phase 2 generators (replenishment, exogenous, returns, lifecycle) against existing dimensions - seed_historical_activity: submits varied train/predict/backtest jobs across 2024-Q4 -> 2026-Q1 cutoffs through /jobs - seed_registry_from_jobs: walks completed train jobs, runs the canonical pending -> running -> success transition + alias stamps * chore(repo) — uv.lock refreshes forecastlabai 0.2.18 -> 0.2.19 to match the release-please-merged version bump. Excluded intentionally: alembic/a2b3c4d5e6f7 + rag/models.py — the migration is self-marked "local-only demo" (truncates document_chunk, drops HNSW index, hardcodes 2560 for qwen3) and would wipe any non- qwen3 user's RAG corpus on upgrade. Stays uncommitted locally.

sourcery-ai · 2026-05-26T04:03:34Z

Reviewer's Guide

Fixes a bug in the price-history seeder that could generate invalid date windows, and adds three localhost-only demo scripts that seed additional data and historical activity via the public API while updating the lockfile to the current forecastlabai version.

Sequence diagram for seed_registry_from_jobs registry backfill

sequenceDiagram
    participant Script as SeedRegistryFromJobs
    participant JobsAPI as JobsAPI
    participant RegistryAPI as RegistryAPI
    participant FS as RegistryArtifactStorage

    Script->>JobsAPI: GET /jobs?job_type=train&status=completed (fetch_completed_train_jobs)
    JobsAPI-->>Script: completed train jobs

    loop for each completed job
        Script->>FS: check/copy model_path to registry_root
        Script->>RegistryAPI: POST /registry/runs
        alt created
            RegistryAPI-->>Script: run_id
            Script->>RegistryAPI: PATCH /registry/runs/{run_id} status=running
            RegistryAPI-->>Script: updated run
            Script->>RegistryAPI: PATCH /registry/runs/{run_id}
            activate Script
            Script-->>RegistryAPI: status=success, metrics, artifact_uri, artifact_hash
            deactivate Script
            RegistryAPI-->>Script: updated run
        else duplicate config_hash
            RegistryAPI-->>Script: 4xx (skip run)
        end
    end

    Script->>Script: select winners per (store_id, product_id)
    loop aliases for each winner
        Script->>RegistryAPI: POST /registry/aliases (champion/challenger)
        RegistryAPI-->>Script: alias created or error
    end

Sequence diagram for seed_historical_activity job seeding

sequenceDiagram
    participant Script as SeedHistoricalActivity
    participant JobsAPI as JobsAPI
    participant BatchAPI as BatchForecastingAPI

    loop for each (store_id, product_id, cutoff, model_type)
        Script->>JobsAPI: POST /jobs (job_type=train)
        JobsAPI-->>Script: job_id
        Script->>JobsAPI: GET /jobs/{job_id} (poll_job)
        JobsAPI-->>Script: status, run_id
    end

    Script->>Script: filter completed train jobs at latest cutoff
    loop for each latest run_id
        Script->>JobsAPI: POST /jobs (job_type=predict)
        JobsAPI-->>Script: job_id
        Script->>JobsAPI: GET /jobs/{job_id} (poll_job)
        JobsAPI-->>Script: status
    end

    loop for selected pairs
        Script->>JobsAPI: POST /jobs (job_type=backtest)
        JobsAPI-->>Script: job_id
        Script->>JobsAPI: GET /jobs/{job_id} (poll_job, longer timeout)
        JobsAPI-->>Script: status
    end

    Script->>BatchAPI: POST /batch/forecasting
    BatchAPI-->>Script: batch_id, item_count or error

File-Level Changes

Change	Details	Files
Ensure generated price history rows never violate the valid_from/valid_to check constraint by skipping zero-length initial windows.	Introduce a local valid_to variable when a price change triggers. Guard the append of the previous price window with a valid_to >= current_valid_from check to avoid degenerate ranges. Only update current_price and current_valid_from when a non-degenerate window was persisted.	`app/shared/seeder/generators/facts.py`
Add a localhost-only Phase 2 enrichment script that backfills lifecycle, replenishment, exogenous signals, and returns data against an existing seeded schema.	Wire up async SQLAlchemy session creation using DATABASE_URL from settings and hard-refuse if the URL is not localhost/127.0.0.1. Compute lifecycle attributes per product deterministically from a seeded RNG and update Product rows in-place. Generate replenishment events, exogenous signals, and sales return records via existing generator classes and bulk-insert them in chunks for performance. Expose a small argparse CLI to control RNG seed and returns probability.	`scripts/seed_phase2_only.py`
Add a script to backfill realistic historical model activity by driving train/predict/backtest and batch jobs through the public /jobs and /batch/forecasting APIs.	Define fixed (store, product) pairs, cutoffs, and baseline model types to span 2024–2026 training windows. Implement helpers to submit and poll jobs generically, then orchestrate train→predict→backtest flows sequentially for a small matrix of scenarios. Optionally submit a small batch forecasting job and emit concise console summaries of counts and statuses. Provide a simple CLI flag for the API base URL.	`scripts/seed_historical_activity.py`
Add a registry seeding script that turns completed train jobs into registry runs with deterministic stub metrics and aliasing.	Fetch paginated completed train jobs via /jobs and filter to supported baseline model types. Locate and copy each job's trained model artifact into the registry artifact root, hashing contents to populate artifact metadata. Create, transition, and finalize /registry/runs entries with stubbed metrics seeded from job run_id for deterministic but varied values. Select lowest-WAPE runs per (store, product) for the latest cutoff and create champion/challenger aliases via /registry/aliases. Expose a CLI for configuring the API base URL and use settings.registry_artifact_root for artifact placement.	`scripts/seed_registry_from_jobs.py`
Align the uv.lock dependency on forecastlabai with the already-bumped project version.	Update the forecastlabai locked version from 0.2.18 to 0.2.19 in uv.lock to match the release-please bump and keep environments reproducible.	`uv.lock`

Possibly linked issues

chore(data): local demo enrichment tooling + seeder price-history fix #297: PR exactly implements the issue’s three buckets: seeder bugfix, enrichment scripts, and uv.lock refresh.

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

coderabbitai · 2026-05-26T04:03:34Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e19dd2fa-885a-43d5-a482-de33111fb078

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch chore/local-demo-enrichment-tools

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

sourcery-ai

Hey - I've found 2 issues, and left some high level feedback:

In fetch_completed_train_jobs, the pagination break condition if page * len(jobs) >= total can stop too early when the last page is partially filled; consider using the configured page_size (or an accumulated count) instead of page * len(jobs) to decide when to exit.
In seed_phase2_only.chunked, the use of the new generic function syntax def chunked[U](...) ties the script to Python 3.12+; if this repo targets earlier versions, switch to the TypeVar pattern (def chunked(items: list[U], size: int) -> Iterator[list[U]]) for broader compatibility.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- In `fetch_completed_train_jobs`, the pagination break condition `if page * len(jobs) >= total` can stop too early when the last page is partially filled; consider using the configured `page_size` (or an accumulated count) instead of `page * len(jobs)` to decide when to exit.
- In `seed_phase2_only.chunked`, the use of the new generic function syntax `def chunked[U](...)` ties the script to Python 3.12+; if this repo targets earlier versions, switch to the `TypeVar` pattern (`def chunked(items: list[U], size: int) -> Iterator[list[U]]`) for broader compatibility.

## Individual Comments

### Comment 1
<location path="scripts/seed_registry_from_jobs.py" line_range="102-111" />
<code_context>
+    model_type = str(params.get("model_type", ""))
+    if model_type not in {"naive", "seasonal_naive", "moving_average"}:
+        return None  # only baselines for this backfill
+    source_path = Path(str(result.get("model_path", "")))
+    if not source_path.exists():
+        # try relative-to-cwd
+        rel = Path.cwd() / source_path
+        if rel.exists():
+            source_path = rel
+        else:
+            return None
+    forecast_run_id = str(result.get("run_id", ""))
+    artifact_uri = f"backfill/{model_type}-{source_path.stem}.joblib"
+    dest = registry_root / artifact_uri
+    dest.parent.mkdir(parents=True, exist_ok=True)
+    if not dest.exists():
+        shutil.copy2(source_path, dest)
+    raw = dest.read_bytes()
+    artifact_hash = hashlib.sha256(raw).hexdigest()
</code_context>
<issue_to_address>
**issue (bug_risk):** Handling of missing or non-file `model_path` values can misbehave when the value is empty or points to a directory.

If `model_path` is missing or empty, `Path("")` resolves to the current directory and `exists()` is True, so `shutil.copy2` will be called on a directory and fail. The same applies if `model_path` points to an existing directory. Consider requiring a non-empty path and checking `source_path.is_file()` (and likewise for the `rel` candidate) before copying; otherwise, skip this entry.
</issue_to_address>

### Comment 2
<location path="scripts/seed_registry_from_jobs.py" line_range="120-129" />
<code_context>
+    artifact_hash = hashlib.sha256(raw).hexdigest()
+
+    # (a) create
+    r = await client.post(
+        "/registry/runs",
+        json={
+            "model_type": model_type,
+            "model_config": _model_config_payload(model_type),
+            "feature_config": None,
+            "data_window_start": str(params.get("start_date")),
+            "data_window_end": str(params.get("end_date")),
+            "store_id": int(params["store_id"]),
+            "product_id": int(params["product_id"]),
+            "agent_context": None,
+            "git_sha": None,
+        },
+    )
+    if r.status_code >= 400:
+        # duplicate config_hash → idempotent skip
+        return None
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Treating all non-2xx `/registry/runs` responses as duplicate-config skips can hide real errors.

`status_code >= 400` is broader than the duplicate-config case and will also catch unrelated 4xx/5xx errors, which then get treated as idempotent skips. That can hide real failures like registry downtime or validation errors. Please either detect the actual duplicate condition (e.g., specific status code or error payload) or at least log/raise on unexpected 4xx/5xx responses so operational issues are visible.

Suggested implementation:

```python
    artifact_hash = hashlib.sha256(raw).hexdigest()

    # (a) create
    r = await client.post(
        "/registry/runs",
        json={
            "model_type": model_type,
            "model_config": _model_config_payload(model_type),
            "feature_config": None,
            "data_window_start": str(params.get("start_date")),
            "data_window_end": str(params.get("end_date")),
            "store_id": int(params["store_id"]),
            "product_id": int(params["product_id"]),
            "agent_context": None,
            "git_sha": None,
        },
    )
    if r.status_code == 409:
        # duplicate config_hash → idempotent skip
        return None
    if r.status_code >= 400:
        # non-duplicate 4xx/5xx should surface as errors
        try:
            error_detail = r.json()
        except Exception:
            error_detail = r.text
        raise RuntimeError(
            f"Failed to create registry run (status {r.status_code}): {error_detail}"
        )

    model_type = str(params.get("model_type", ""))

```

If your registry API uses a different status code or response shape to indicate "duplicate config" (e.g. 422 with a specific error code in the JSON body), update the `if r.status_code == 409:` condition to match that contract, or refine the check using `error_detail`. Also ensure that `client`, `_model_config_payload`, and `params` are in scope in this function as expected.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

Three corrections to register_one and fetch_completed_train_jobs: * pagination — `page * len(jobs) >= total` stops too early when the last page is partial. Switch to accumulated-count + short-page detection (exit when len(jobs) < page_size or len(out) >= total). * model_path validation — empty / directory paths slipped through because Path("") resolves to cwd and Path.exists() returns True for directories. Require non-empty path and Path.is_file() for both the raw and cwd-relative candidates. * duplicate detection — `r.status_code >= 400` blanket-swallowed registry downtime and validation errors as idempotent skips. Narrow the skip to HTTP 409 (the actual DuplicateRunError code per registry/routes.py:113) and raise RuntimeError on other 4xx / 5xx with the response body for diagnostics. Python 3.12-only `def chunked[U](...)` syntax in seed_phase2_only.py is intentional — `pyproject.toml:6` already pins `requires-python = ">=3.12"`.

sourcery-ai Bot reviewed May 26, 2026

View reviewed changes

Comment thread scripts/seed_registry_from_jobs.py Outdated

Comment thread scripts/seed_registry_from_jobs.py

w7-mgfcode merged commit 26a105a into dev May 26, 2026
8 checks passed

w7-mgfcode deleted the chore/local-demo-enrichment-tools branch May 26, 2026 04:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(data,repo): local demo tooling + seeder price-history fix#298

feat(data,repo): local demo tooling + seeder price-history fix#298
w7-mgfcode merged 2 commits into
devfrom
chore/local-demo-enrichment-tools

w7-mgfcode commented May 26, 2026 •

edited by sourcery-ai Bot

Loading

Uh oh!

sourcery-ai Bot commented May 26, 2026 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

coderabbitai Bot commented May 26, 2026 •

edited

Loading

Review skipped

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

w7-mgfcode commented May 26, 2026 • edited by sourcery-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Excluded (intentionally)

Validation

Why one PR

Test plan

Summary by Sourcery

Uh oh!

sourcery-ai Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for seed_registry_from_jobs registry backfill

Sequence diagram for seed_historical_activity job seeding

File-Level Changes

Possibly linked issues

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

coderabbitai Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

w7-mgfcode commented May 26, 2026 •

edited by sourcery-ai Bot

Loading

sourcery-ai Bot commented May 26, 2026 •

edited

Loading

coderabbitai Bot commented May 26, 2026 •

edited

Loading