Skip to content

feat(sync): dispatch v2 sync pipeline via Cloud Tasks#7801

Open
mdmohsin7 wants to merge 14 commits into
mainfrom
feat/sync-cloud-tasks
Open

feat(sync): dispatch v2 sync pipeline via Cloud Tasks#7801
mdmohsin7 wants to merge 14 commits into
mainfrom
feat/sync-cloud-tasks

Conversation

@mdmohsin7

Copy link
Copy Markdown
Member

Problem

The /v2/sync-local-files pipeline (decode → VAD → fair-use → STT → LLM) runs as a fire-and-forget asyncio task on whichever backend-sync instance received the upload. Consequences:

  • Invisible to the Cloud Run autoscaler — it scales on request concurrency, but the requests end at the 202. Instances grind up to 16 pipelines each while Cloud Run sees them as idle, so sync bursts overload the service (the min-10 instance floor exists to compensate).
  • Jobs die silently on scale-in/SIGTERM (10s drain, no redelivery).
  • No retries, no fleet-wide balancing, no backpressure — only a per-instance semaphore.

Change

With SYNC_DISPATCH_MODE=cloud_tasks, the fast path stages the raw .bin files in the syncing-local GCS bucket (blob name = local path, the pipeline's existing convention), enqueues one named Cloud Task per job (task id = job_id, enqueue-side dedup), and returns the same 202 {job_id}. The task POSTs back to a new OIDC-verified POST /v2/sync-jobs/run on the same service, which runs the same pipeline code inside the request — autoscaler-visible, durable, retried with backoff, rate-limitable at the queue.

Zero app changes. Endpoint contract, multipart format, conversation_id param, 202+poll flow, and /v1/sync-local-files are untouched.

Inline path is preserved and used when: the flag is off (default), the request carries BYOK headers (keys are request-scoped and cannot follow a task), or staging/enqueue fails (automatic fallback — a Cloud Tasks outage degrades to today's behavior).

Idempotency / failure handling

  • Per-job run lock (Redis, fail-closed, TTL 1800s > the 1500s request cap, so it can never expire under a live run). Duplicate deliveries get 409 and re-check terminal status later.
  • Terminal jobs are never re-run — duplicate delivery, stale-detector-failed, or DG-budget-failed jobs are acked and their staged blobs deleted.
  • Queued-reset before retryable 500s so the 600s stale detector can't terminally fail a job while the retry backoff elapses.
  • Final attempt (X-CloudTasks-TaskRetryCount) marks the job failed, deletes blobs, returns 200 (consume).
  • Processed-segment ledger — retries skip segments whose conversation writes already landed (timestamp dedup remains as second line).
  • Metering once-guardsrecord_speech_ms / record_dg_usage_ms / record_usage fire at most once per job across retries (no fair-use double-counting).
  • Staged-blob cleanup on every terminal outcome; the bucket's existing 1-day lifecycle rule covers hard crashes. Per-segment wav cleanup (480s post-STT) unchanged.
  • Expired blobs (task dispatched >24h after enqueue): handler marks the job failed and consumes; the app re-uploads (it keeps local files until terminal status).
  • Timeout middleware: new per-path override — /v2/sync-jobs/run gets HTTP_SYNC_JOBS_RUN_TIMEOUT (default 1500s) instead of the 120s default that would otherwise kill every attempt.
  • OIDC fails closed when env is unsetbackend/backend-integration run the same image but reject all task traffic.

Tests

  • New tests/unit/test_sync_cloud_tasks.py (24 tests): lock semantics incl. fail-closed Redis errors, queued-reset, ledger, once-guards, OIDC rejection matrix, structural contract. Registered in test.sh.
  • Existing sync suites restored to main baseline (loader stubs updated for the new imports).
  • scripts/scan_async_blockers.py: no findings in changed files.

🚀 Deployment (ordered — the code ships inert)

The default is SYNC_DISPATCH_MODE=inline, so merging + deploying changes nothing until step 4.

1. Merge + deploy (regular merge, no squash):

gh workflow run gcp_backend.yml -f environment=prod -f branch=main

2. One-time infra (Cloud Tasks API is currently NOT enabled on based-hardware):

gcloud services enable cloudtasks.googleapis.com

gcloud tasks queues create sync-jobs --location=us-central1 \
  --max-concurrent-dispatches=96 --max-dispatches-per-second=10 \
  --max-attempts=5 --min-backoff=60s --max-backoff=300s

gcloud iam service-accounts create sync-tasks-invoker

# backend-sync runtime SA needs to enqueue + mint OIDC tokens for the invoker SA:
#   roles/cloudtasks.enqueuer (project) and roles/iam.serviceAccountUser on sync-tasks-invoker
# invoker SA needs roles/run.invoker on backend-sync (if the service requires auth)

# REQUIRED CHECK: backend-sync ingress must be "all" — Cloud Tasks push traffic
# does not count as internal; "internal and cloud load balancing" will 403 every dispatch.
gcloud run services describe backend-sync --region us-central1 --format="value(metadata.annotations.'run.googleapis.com/ingress')"

3. Env vars on backend-sync only (deploy workflow preserves env vars; one-time update):

gcloud run services update backend-sync --region us-central1 --update-env-vars \
SYNC_DISPATCH_MODE=inline,\
SYNC_TASKS_PROJECT=based-hardware,\
SYNC_TASKS_LOCATION=us-central1,\
SYNC_TASKS_QUEUE=sync-jobs,\
SYNC_TASKS_HANDLER_URL=https://backend-sync-hhibjajaja-uc.a.run.app/v2/sync-jobs/run,\
SYNC_TASKS_INVOKER_SA=sync-tasks-invoker@based-hardware.iam.gserviceaccount.com,\
SYNC_TASKS_MAX_ATTEMPTS=5,\
HTTP_SYNC_JOBS_RUN_TIMEOUT=1500

Invariants to keep in sync: SYNC_TASKS_MAX_ATTEMPTS must mirror the queue's --max-attempts; HTTP_SYNC_JOBS_RUN_TIMEOUT (1500) must stay below the 1800s run-lock TTL. Do NOT set the SYNC_TASKS_* vars on backend/backend-integration — unset env keeps the handler inert there.

4. Smoke test, then flip:

# smoke: upload a small WAL sync from the app (inline mode still active), confirm unchanged behavior
gcloud run services update backend-sync --region us-central1 --update-env-vars SYNC_DISPATCH_MODE=cloud_tasks

Watch: queue depth (cloudtasks.googleapis.com/queue/depth), handler 5xx ratio on /v2/sync-jobs/run, job terminal-status mix, and failed_final/staged_audio_expired log events. Suggested alert: queue depth > 200 for 15 min.

5. Rollback = flip the flag back:

gcloud run services update backend-sync --region us-central1 --update-env-vars SYNC_DISPATCH_MODE=inline

In-flight queued tasks still drain through the handler — harmless.

6. Later (separate change): once autoscaling-on-real-load is proven for a week, lower backend-sync minScale 10 → 2 ($1k/month at current always-on pricing).

🤖 Generated with Claude Code

mdmohsin7 added 12 commits June 10, 2026 20:27
The Cloud Tasks sync-job handler runs the whole pipeline inside the
request and needs a cap above the 120s default. Path overrides take
precedence over method timeouts.
- per-job run lock (fail-closed, compare-and-delete release) so
  duplicate task deliveries can never run a job concurrently
- mark_job_queued_for_retry: stale-detector-exempt reset before retries
- processed-segment ledger so retries skip segments that already landed
- try_mark_once guards so fair-use/usage metering counts once per job
enqueue_sync_job creates one named HTTP task per job (task id = job_id
for enqueue-side dedup) with an OIDC token for the invoker SA.
verify_cloud_tasks_oidc fails closed when SYNC_TASKS_* env is unset so
services sharing the image never accept task traffic.
Blob name = local relative path (existing pipeline convention).
download returns False on NotFound so the handler can consume tasks
whose blobs were removed by the bucket's 1-day lifecycle rule.
The pipeline previously ran as a fire-and-forget asyncio task on the
instance that received the upload: invisible to the Cloud Run
autoscaler, killed on scale-in, no retries, no fleet-wide balancing.

Fast path now stages raw .bin files in GCS and enqueues one Cloud Task
per job (SYNC_DISPATCH_MODE=cloud_tasks); the task POSTs back to
/v2/sync-jobs/run which runs the same pipeline inside the request.
Inline path is kept for rollback, BYOK requests (header-scoped keys
cannot follow a task), and enqueue failures.

Handler semantics: per-job run lock (409 while held), terminal jobs
acked without re-running, queued-reset before retryable 500s so the
stale detector cannot kill jobs during backoff, final attempt marks
failed and consumes. Staged blobs are deleted on every terminal
outcome; the bucket's 1-day lifecycle covers hard crashes. Metering
once-guards and a processed-segment ledger make retries idempotent.
@greptile-apps

greptile-apps Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR replaces the fire-and-forget asyncio background task in /v2/sync-local-files with a Cloud Tasks-backed pipeline when SYNC_DISPATCH_MODE=cloud_tasks. It introduces staged-blob storage in GCS, named task enqueue with OIDC-verified delivery, and a comprehensive idempotency layer (run lock, queued-reset, processed-segment ledger, once-guards for metering).

  • New dispatch path (utils/cloud_tasks.py, routers/sync.py): uploads raw .bin files to GCS, enqueues one named task per job, and handles it at POST /v2/sync-jobs/run inside the request making pipeline work autoscaler-visible. The inline path is preserved as default and automatic fallback.
  • Idempotency primitives (database/sync_jobs.py): Redis run-lock (fail-closed, TTL 1800s > request cap 1500s), queued-reset before retryable 500s, processed-segment ledger for retry skip, and SETNX once-guards on all three metering calls.
  • Timeout wiring (utils/other/timeout.py, main.py): per-path timeout override for /v2/sync-jobs/run using an explicit is not None guard, addressing the previous or-falsy concern.

Confidence Score: 5/5

Safe to merge. The default SYNC_DISPATCH_MODE=inline means deploying changes nothing until the flag is flipped; the new Cloud Tasks path is fully gated and falls back to inline on any enqueue failure.

All three concerns from the previous review cycle are addressed. The only new observations are a partial-download cleanup gap that is practically unreachable (all blobs in a job expire simultaneously), and a lazy-singleton init pattern harmless under the GIL. Flag-based rollout and automatic inline fallback make the change low-risk.

No files require special attention - routers/sync.py has the most new logic but retry/idempotency paths are well-tested by 24 new unit tests.

Important Files Changed

Filename Overview
backend/utils/cloud_tasks.py New module: Cloud Tasks dispatch (enqueue_sync_job) and OIDC verification (verify_cloud_tasks_oidc). Lazy client singletons, env-driven fail-closed design, structured logging on token verification.
backend/routers/sync.py Adds Cloud Tasks dispatch path, task-mode branch in the pipeline, run_sync_job handler with lock/idempotency/retry logic, and staged-blob helpers. Existing inline path and v1 endpoint unchanged.
backend/database/sync_jobs.py Adds run-lock (fail-closed), queued-reset, processed-segment ledger, and once-guard primitives with correct Redis semantics (compare-and-delete, SETNX, TTL alignment).
backend/utils/other/timeout.py Extends TimeoutMiddleware with per-path timeout overrides; uses explicit is not None guard and skips None entries in paths_timeout.
backend/tests/unit/test_sync_cloud_tasks.py 24 tests covering lock semantics, queued-reset, processed-segment ledger, once-guards, OIDC verification matrix, and structural contracts.

Sequence Diagram

sequenceDiagram
    participant App
    participant SyncEndpoint as POST /v2/sync-local-files
    participant GCS as GCS syncing-local
    participant CT as Cloud Tasks queue
    participant Handler as POST /v2/sync-jobs/run
    participant Redis
    participant Pipeline as Pipeline decode-VAD-STT-LLM

    App->>SyncEndpoint: multipart upload
    SyncEndpoint->>GCS: stage raw bin blobs
    SyncEndpoint->>CT: enqueue named task job_id
    SyncEndpoint-->>App: 202 job_id

    CT->>Handler: POST payload plus OIDC token
    Handler->>Handler: verify_cloud_tasks_oidc
    Handler->>Redis: try_acquire_job_run_lock
    Redis-->>Handler: token or None
    Handler->>Redis: get_sync_job
    Redis-->>Handler: job status
    alt terminal status
        Handler->>GCS: delete staged blobs
        Handler-->>CT: 200 acked
    else normal run
        Handler->>GCS: download blobs to local paths
        Handler->>Pipeline: run pipeline task_mode True
        Pipeline->>Redis: mark_job_processing
        Pipeline->>Pipeline: decode VAD STT LLM
        Pipeline->>Redis: add_processed_segment
        Pipeline->>Redis: try_mark_once metering
        alt success
            Handler->>GCS: delete staged blobs
            Handler-->>CT: 200 done
        else retryable error
            Handler->>Redis: mark_job_queued_for_retry
            Handler-->>CT: 500 retry
        else final attempt
            Handler->>Redis: mark_job_failed
            Handler->>GCS: delete staged blobs
            Handler-->>CT: 200 failed_final
        end
    end
    Handler->>Redis: release_job_run_lock
Loading

Reviews (2): Last reviewed commit: "fix(sync): log OIDC verification failure..." | Re-trigger Greptile

Comment thread backend/utils/other/timeout.py Outdated
Comment on lines +77 to +79
timeout = self.paths_timeout.get(request.url.path) or self.methods_timeout.get(
request.method, self.default_timeout
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Using or to select the path timeout means any falsy value (specifically 0.0) silently falls back to the method timeout instead of applying the configured value. If HTTP_SYNC_JOBS_RUN_TIMEOUT=0 were ever set, the sync-jobs path would get the POST method timeout (120 s) rather than 0 s, with no error or warning. An explicit is not None check matches the intent.

Suggested change
timeout = self.paths_timeout.get(request.url.path) or self.methods_timeout.get(
request.method, self.default_timeout
)
path_timeout = self.paths_timeout.get(request.url.path)
timeout = path_timeout if path_timeout is not None else self.methods_timeout.get(
request.method, self.default_timeout
)

Comment on lines +121 to +124
try:
claims = id_token.verify_oauth2_token(auth_header[len('Bearer ') :], _get_auth_request(), audience=audience)
except Exception:
raise HTTPException(status_code=403, detail='Invalid OIDC token')

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The bare except Exception: catches both genuine token-validation failures and transient errors (e.g. a network timeout while fetching Google's JWKS endpoint), then raises the same 403 without logging the underlying cause. When cert-fetch latency spikes, operators would see a flood of "Invalid OIDC token" 403s with nothing in the logs to distinguish them from actual bad tokens, making the failure very hard to triage.

Suggested change
try:
claims = id_token.verify_oauth2_token(auth_header[len('Bearer ') :], _get_auth_request(), audience=audience)
except Exception:
raise HTTPException(status_code=403, detail='Invalid OIDC token')
try:
claims = id_token.verify_oauth2_token(auth_header[len('Bearer ') :], _get_auth_request(), audience=audience)
except Exception as e:
logger.warning('OIDC token verification failed: %s', e)
raise HTTPException(status_code=403, detail='Invalid OIDC token')

Comment thread backend/routers/sync.py
Comment on lines +1798 to +1800
except Exception as e:
logger.error(f'sync_v2: Cloud Tasks dispatch failed job={job_id}, falling back inline: {e}')
start_background_task(_delete_staged_blobs_async(owned_paths), name=f'sync_unstage:{job_id}')

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The except block deletes staged blobs unconditionally, but dispatched = True is set before the local-cleanup steps. If an exception were to propagate after that point (e.g. a future change makes _cleanup_files raise), this code would delete the blobs that the already-enqueued task is about to download, silently breaking the job — the task would run, find no blobs, and mark the job staged_audio_expired. Both _cleanup_files and shutil.rmtree(_, True) currently swallow errors, so this path is unreachable today, but the guard is cheap and keeps the invariant explicit.

Suggested change
except Exception as e:
logger.error(f'sync_v2: Cloud Tasks dispatch failed job={job_id}, falling back inline: {e}')
start_background_task(_delete_staged_blobs_async(owned_paths), name=f'sync_unstage:{job_id}')
except Exception as e:
logger.error(f'sync_v2: Cloud Tasks dispatch failed job={job_id}, falling back inline: {e}')
if not dispatched:
start_background_task(_delete_staged_blobs_async(owned_paths), name=f'sync_unstage:{job_id}')

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skipping this one per project convention: we don't add guards for paths that are unreachable by design — _cleanup_files and shutil.rmtree(_, True) both swallow errors deliberately, so nothing after dispatched = True can raise. Please remember this preference for future reviews on this repo.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, makes sense. I've noted this for future reviews: View rule

Tip: You can customize Greptile's behavior for this repo with .greptile/rules.md and .greptile/config.json.

@mdmohsin7

Copy link
Copy Markdown
Member Author

@greptile-apps re-review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant