feat(cli): add --cloud mode to recce init for CLL pre-computation#1284
feat(cli): add --cloud mode to recce init for CLL pre-computation#1284
Conversation
Add `recce init --cloud --session-id <id>` that: 1. Downloads manifests + catalogs from Recce Cloud session 2. Downloads existing CLL cache (current session → base session fallback) 3. Computes per-node CLL and builds full CLL map 4. Uploads cll_map.json + cll_cache.db back to session S3 This enables the Cloud instance to pre-compute CLL data so the /cll endpoint can serve it without a running Recce instance (DRC-3183). The cache fallback chain (current → base → scratch) means the 200s+ cold-start only happens once per project on production metadata upload. Subsequent PR sessions reuse the warm cache. Also adds cll_map.json generation to local `recce init` (non-cloud), saved alongside the SQLite cache for local development use. Resolves DRC-3181 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: even-wei <evenwei@infuseai.io>
- Handle get_session 403 error explicitly (was producing misleading "missing org_id" error instead of "access denied") - Fix state_file_host override (was setting nonexistent .host attribute; now correctly overrides base_url and base_url_v2) - Wrap get_download_urls and get_base_session_download_urls in try/except for graceful error handling - Remove duplicate import requests Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: even-wei <evenwei@infuseai.io>
Codecov Report❌ Patch coverage is
... and 4 files with indirect coverage changes 🚀 New features to boost your workflow:
|
…er-artifact-upload
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: even-wei <evenwei@infuseai.io>
…er-artifact-upload
There was a problem hiding this comment.
Pull request overview
Adds a new cloud-aware initialization workflow to the Recce CLI so CLL (column-level lineage) artifacts can be precomputed and made available in Recce Cloud before a Recce server instance is running.
Changes:
- Extend
recce initwith--cloud+--session-idto download dbt artifacts from Recce Cloud, warm-start the CLL cache, and upload results back via presigned URLs. - Generate and persist a full
cll_map.jsonduringrecce init(local and cloud modes). - Add CLI test coverage for
recce init --cloudflows (missing args, download/upload failures, map build failures).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
recce/cli.py |
Implements recce init --cloud artifact download/cache warm-start, full CLL map generation, and optional Cloud upload. |
tests/test_cli_cache.py |
Adds tests covering cloud-mode init argument validation and common failure/edge scenarios. |
| console.print(f"[bold]Cloud mode[/bold]: session {session_id}") | ||
|
|
||
| # Get session info | ||
| session_info = cloud_client.get_session(session_id) |
There was a problem hiding this comment.
cloud_client.get_session(session_id) can raise RecceCloudException (e.g., non-200/non-403 responses, network issues). Currently this call is not wrapped, so recce init --cloud may crash with an unhandled exception instead of producing a clean CLI error and exit code 1. Catch RecceCloudException here and print a user-friendly error before exiting.
| session_info = cloud_client.get_session(session_id) | |
| try: | |
| session_info = cloud_client.get_session(session_id) | |
| except RecceCloudException as e: | |
| console.print(f"[[red]Error[/red]] Failed to get session: {e}") | |
| exit(1) |
| url = download_urls.get(artifact_key) | ||
| if url: | ||
| resp = requests.get(url) | ||
| if resp.status_code == 200: | ||
| (target_path / filename).write_bytes(resp.content) |
There was a problem hiding this comment.
Artifact downloads use requests.get(url) and immediately access resp.content. For potentially large manifests/catalogs this is workable, but for large artifacts (and especially when similar logic is used for the cache) it can cause high memory usage and hangs because no timeout is set. Use a reasonable timeout and stream responses to disk (e.g., stream=True + chunked writes), and handle requests.RequestException explicitly so failures are reported cleanly.
| if cll_cache_url: | ||
| resp = requests.get(cll_cache_url) | ||
| if resp.status_code == 200 and len(resp.content) > 0: | ||
| Path(cache_db).write_bytes(resp.content) | ||
| console.print(f" Downloaded CLL cache from session ({len(resp.content) / 1024 / 1024:.1f} MB)") |
There was a problem hiding this comment.
Downloading the existing cll_cache.db reads the entire body into memory via resp.content before writing it. Since the cache DB can be large, this can spike memory usage in CI/containers. Stream the response to a temporary file and move it into place on success (and include a timeout / exception handling).
| with open(cll_map_path, "rb") as f: | ||
| resp = requests.put(cll_map_upload_url, data=f, headers={"Content-Type": "application/json"}) | ||
| if resp.status_code in (200, 204): | ||
| console.print(f" Uploaded cll_map.json ({cll_map_path.stat().st_size / 1024 / 1024:.1f} MB)") | ||
| else: |
There was a problem hiding this comment.
The upload path uses requests.put(...) without a timeout and doesn't catch requests.RequestException, so transient network issues can hang the command or produce noisy tracebacks. Add a timeout and handle request exceptions similarly to other cloud interactions so recce init --cloud fails/warns deterministically.
| elif not cll_cache_upload_url: | ||
| logger.debug("No cll_cache_url in upload URLs — cache upload not supported yet") | ||
|
|
||
| console.print("[bold green]Cloud upload complete.[/bold green]") |
There was a problem hiding this comment.
Cloud upload complete. is printed unconditionally even if one or both uploads fail (HTTP 5xx). This is misleading for automation/CI logs. Track whether each upload succeeded and print a success message only when everything succeeded, otherwise print a completion-with-warnings message (or consider a non-zero exit code if uploads are required).
| console.print("\n[bold]Building full CLL map...[/bold]") | ||
| t_map_start = time.perf_counter() | ||
| try: | ||
| full_cll_map = dbt_adapter.build_full_cll_map() | ||
| cll_map_path = Path(cache_db).parent / "cll_map.json" |
There was a problem hiding this comment.
recce init now builds and writes cll_map.json, but there are no assertions in the existing init tests verifying that this file is produced (and has expected structure) for the non-cloud path. Add/extend a unit test in tests/test_cli_cache.py to assert cll_map.json is created next to --cache-db and is valid JSON, so this new behavior is protected against regressions.
PR checklist
What type of PR is this?
Feature -- adds
recce init --cloudfor CLL pre-computationWhat this PR does / why we need it:
Adds
--cloudmode torecce initso it can download artifacts from Recce Cloud, compute CLL, and upload results (cll_map.json+cll_cache.db) back to the session S3 bucket.This enables the Cloud server to serve
/clldata before a Recce instance is available (DRC-3183).Key behaviors:
cll_map.jsoncll_map.jsongeneration to local (non-cloud)recce initWhich issue(s) this PR fixes:
Resolves DRC-3181
Special notes for your reviewer:
--cloudflag reuses the samerecce_cloud_optionsasrecce server --cloudDoes this PR introduce a user-facing change?:
Generated with Claude Code