Skip to content

feat(scratchnode): /ask operability telemetry query (PR C)#446

Merged
HomenShum merged 2 commits into
mainfrom
feat/ask-observability
Jun 1, 2026
Merged

feat(scratchnode): /ask operability telemetry query (PR C)#446
HomenShum merged 2 commits into
mainfrom
feat/ask-observability

Conversation

@HomenShum
Copy link
Copy Markdown
Owner

PR C of the /ask production-grade sprint. Backend-only, additive (new query, no schema/contract change).

Why

Launch ops can't run /ask blind. getAskTelemetry(eventId) is a bounded, read-only aggregate over an event's answers surfacing the operate-the-launch signals.

What it returns

  • mode mix { provider, cache, deterministic, provider_fallback }
  • provider failure rate = provider_fallback / provider ATTEMPTS (cache + deterministic excluded — they never reached the provider)
  • quality pass rate + avg score, total est. cost (cents), avg provider latency, live-search count

Honesty (agentic_reliability)

  • BOUND: scan capped ≤1000; capped flag when the window is full
  • HONEST_SCORES: every value computed from real rows; rates are null (not a fabricated 0%/100%) when there's no denominator — UI renders \—, never invents \100% healthy`n- No private data: liveEventAnswers are public; never touches userNotes

Tests

+3 scenario tests: full mixed-mode aggregate (7 answers), HONEST_SCORES empty-room null case, BOUND cap/capped flag.

Follow-up

Separate frontend PR (now that #445 landed): surface a host ask health line + a degraded badge on provider_fallback answers.

Verification floor

codegen 0 · tsc 0 · vitest 57 passed/1 skipped · build 0

🤖 Generated with Claude Code

… (PR C)

PR C of the /ask launch-readiness sprint. Backend-only, additive (new query,
no schema/contract change).

Launch ops can't run /ask blind. getAskTelemetry(eventId) is a bounded, read-only
aggregate over an event's answers that surfaces the operate-the-launch signals:
  - mode mix { provider, cache, deterministic, provider_fallback }
  - PROVIDER FAILURE RATE = provider_fallback / provider ATTEMPTS (cache +
    deterministic excluded from the denominator — they never reached the provider)
  - quality pass rate + avg score (from the deterministic answer evaluation)
  - total estimated cost (cents) and avg provider latency (from the provider_llm
    trace step)
  - live-search count

Honesty (agentic_reliability):
  - BOUND: scan capped at ≤1000; `capped` flag surfaced when the window is full.
  - HONEST_SCORES: every value is computed from real rows; rates are NULL (not a
    fabricated 0% / 100%) when there's no denominator — the UI must render "—",
    never invent "100% healthy" from zero data.
  - No private data: liveEventAnswers are public; the query never touches userNotes.

Tests (convex/__tests__/scratchnode.events.test.ts): +3 scenario tests — the full
aggregate from a 7-answer mixed-mode room, the HONEST_SCORES empty-room null case,
and the BOUND cap/`capped` flag.

Follow-up (separate frontend PR, after PR #445 lands): surface this in a host
"ask health" line + a degraded badge on provider_fallback answers.

Verification: convex codegen 0, tsc 0, vitest 57 passed / 1 skipped, build 0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@HomenShum HomenShum enabled auto-merge (squash) June 1, 2026 17:20
@vercel
Copy link
Copy Markdown

vercel Bot commented Jun 1, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
nodebench-ai Ready Ready Preview, Comment Jun 1, 2026 5:23pm

Request Review

@augmentcode
Copy link
Copy Markdown

augmentcode Bot commented Jun 1, 2026

🤖 Augment PR Summary

Summary: Adds a new backend-only Convex query to surface bounded /ask operability telemetry for a live event, enabling launch/host operators to monitor health without scanning unbounded data.

Changes:

  • Introduced getAskTelemetry(eventId, limit?) query that scans up to a capped window (default 500, max 1000) of liveEventAnswers.
  • Computes a mode mix (provider, cache, deterministic, provider_fallback) and derives providerAttempts and providerFailureRate.
  • Aggregates quality metrics (pass rate + average score) from stored evaluation rows.
  • Aggregates estimated total cost (cents), average provider latency from traces, and counts external/live searches.
  • Implements “honest scores” semantics by returning null rates when there is no denominator, and returns capped when the scan hits the window size.
  • Added 3 scenario tests covering mixed-mode aggregation, empty-event null rates, and bounded scanning with a limit cap.

Technical Notes: This is additive/read-only (no schema or contract changes) and relies on the existing liveEventAnswers.by_event_time index plus trace/evaluation fields for metrics.

🤖 Was this summary useful? React with 👍 or 👎

Copy link
Copy Markdown

@augmentcode augmentcode Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 2 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

Comment thread convex/events.ts
if (r.evaluation.passed) passCount += 1;
}
const provStep = (r.trace ?? []).find(
(s: any) => s.step === "provider_llm" && s.status === "ok",
Copy link
Copy Markdown

@augmentcode augmentcode Bot Jun 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

convex/events.ts:949: avgProviderLatencyMs only includes trace steps where step === "provider_llm" && status === "ok", but provider_fallback rows record provider_llm with status: "error" (see existing trace emission in this file). This will systematically under-report provider latency during degraded periods (timeouts/errors) even though those attempts are part of the operability signal.

Severity: medium

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

},
): TableRecord {
const { score = 100, passed = true, costCents = 0, providerMs = null, liveSearches = 0, createdAt } = opts;
const trace = providerMs != null
Copy link
Copy Markdown

@augmentcode augmentcode Bot Jun 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

convex/__tests__/scratchnode.events.test.ts:1083: telAnswer() builds a non-provider trace whenever providerMs == null, so the provider_fallback fixture ends up with no provider_llm step at all. In production, fallbacks include a provider_llm step with status: "error" + durationMs, so this test case isn’t actually exercising telemetry behavior against a realistic trace shape.

Severity: low

Other Locations
  • convex/__tests__/scratchnode.events.test.ts:1124

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

@HomenShum HomenShum merged commit ad256d0 into main Jun 1, 2026
16 checks passed
@HomenShum HomenShum deleted the feat/ask-observability branch June 1, 2026 17:34
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

Demo: walkthrough of the surfaces this PR changed is available as a workflow artifact (pr-demo-446) at https://github.com/HomenShum/nodebench-ai/actions/runs/26771171588

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants