Skip to content

[FE / chore] Move evals to packages#4753

Draft
ardaerzin wants to merge 107 commits into
mainfrom
fe-chore/move-evals-to-packages
Draft

[FE / chore] Move evals to packages#4753
ardaerzin wants to merge 107 commits into
mainfrom
fe-chore/move-evals-to-packages

Conversation

@ardaerzin

Copy link
Copy Markdown
Contributor

Summary

Testing

QA follow-up

  • evaluations & annotation queues testing

Checklist

  • Relevant tests pass locally
  • Relevant linting and formatting pass locally
  • I have signed the CLA, or I will sign it when the bot prompts me

Contributor Resources

ardaerzin added 30 commits June 8, 2026 00:15
New state+logic package for evaluations, mirroring the @agenta/annotation split
(headless here; React UI will follow in @agenta/evaluations-ui). Run/queue/result/
metric data molecules stay in @agenta/entities; this package owns run-config
construction and the run-creation controller. Registered as an @agenta/oss dep.

- core/buildRunConfig: PURE, headless port of OSS createEvaluationRunConfig. The
  four playground/workflow atoms it used to read via getDefaultStore are now passed
  in as a flat plain-data DTO (schemaContextByRevisionId), so the package imports
  zero jotai/playground/getDefaultStore. Unit tested without a store.
- controllers/createEvaluationRun: orchestrates createRuns -> createScenarios ->
  setResults via Fern, with deleteRuns rollback on partial failure (backend
  cascade-deletes scenarios/results). Injectable client → all branches (success,
  scenario-fail, results-fail, rollback-fail) unit tested with a fake, no backend.
- vendored slugify + extractEvaluatorMetricKeys with TODOs to consolidate onto
  @agenta/shared and entities extractMetrics in a later slice.

22 unit tests pass; types + lint clean. TODOS.md notes a backend atomic-create
endpoint that would remove the FE rollback entirely.
Rewrite the evaluationRun and evaluationQueue API functions from raw axios
(@agenta/shared/api) to the Fern-generated @agentaai/api-client via @agenta/sdk,
matching the secret/gatewayTool precedent. project_id is injected through Fern's
queryParams (projectScopedRequest); the Zod boundary is preserved unchanged — it
now narrows Fern's all-optional generated types and remains the independent drift
check.

The SDK client is imported lazily (dynamic import) rather than statically:
@agentaai/api-client is ESM-only (no require export), and a static import would
break the tsx --test molecule/ETL suites the moment a molecule is imported. Lazy
import keeps those suites green and resolves correctly via the ESM loader at call
time. Existing node:test molecule (15) + ETL (9) + leak (5) suites pass.
Wire the OSS creation path to the new package and delete the duplicated config +
orchestration:

- usePreviewEvaluations.createNewRun now resolves per-revision schema context from
  the playground/workflow atoms (the app supplies inputs), calls the package's pure
  buildRunConfig, then the headless createEvaluationRun controller (run -> scenarios
  -> results with rollback). No bridge — OSS only reads atoms and hands plain data in.
- Delete services/evaluationRuns/api/index.ts (createEvaluationRunConfig), the inline
  createScenarios helper, and the hand-rolled run/scenario/step orchestration. Drops
  the now-orphaned slugify/uuid/useSWRConfig/SCENARIOS_ENDPOINT usages.
- NewEvaluationModalInner reads the controller's clean {runId} return shape.

The rewrite also removed 8 pre-existing type errors that lived in the old
orchestration (oss tsc: 593 -> 589); the migrated files are type- and lint-clean.
…; log parse failures in prod

Two correctness/reliability fixes to the evaluationRun-family Zod schemas and the
shared validation helper, de-risking the upcoming run-fetch consolidation (T6):

- Add .passthrough() to evaluationRun/data/step/mapping/reference/result/metric
  schemas. The backend mounts these payloads with extra="allow", and downstream
  consumers (notably the OSS EvalRunDetails run enrichment: buildRunIndex,
  evaluator-ref patching) read fields beyond what the schema declares. The default
  z.object() was silently stripping them — a data-loss bug, and the specific blocker
  to routing the OSS per-run fetch through the package molecule. Known fields are
  still strictly validated; this makes the schema a validator, not a field filter.
- safeParseWithLogging now logs validation failures in production too, not just dev.
  A Zod failure is always real signal (backend drift / a bug), never normal control
  flow, so it should be visible in prod logs instead of silently swallowed. The null
  return is preserved, so no caller's control flow changes.
- Add a schema-contract test (real-response-shaped fixtures) pinning passthrough of
  unknown top-level/nested/ref fields and that a missing required id still fails.

entities: types + lint clean; schema (6) + molecule (15) + ETL (9) + leak (5) +
vitest unit (589) suites pass. oss tsc error count unchanged.
…reading app-global state

The evaluationRun molecule imported projectIdAtom from @agenta/shared/state and read
it from the default store inside its query atoms (with a "projectId not yet available"
retry hack) — the package reaching into app-global state, and an assumption that a
project is always ambient in a global store. Decouple it: callers pass projectId.

- Re-key every run atom family from (runId) to ({projectId, runId}) and the scenario
  families to ({projectId, runId, scenarioId}), with projectId-aware areEqual. The
  query atoms take projectId straight from the family key — no store read, no
  projectIdAtom import, no retry hack (projectId is part of the key, captured at
  subscription, which also removes the atomWithQuery-cant-react-to-deps workaround).
- Public surface threads projectId: selectors.x({projectId, runId}),
  get.x(projectId, runId, ...), invalidateEvaluationRunCache({projectId, runId}).
- Consumers that use the changed surface are the annotation controllers /
  annotation-ui (already app-state-aware) — updated to pass projectId. The result/
  metric molecules already took projectId from callers and are unchanged. OSS does
  NOT consume this surface (its local evaluationRunQueryAtomFamily is a name
  collision, not the package export), so no OSS changes.

entities + annotation + annotation-ui types + lint clean; molecule (15) / ETL (9) /
schema (6) suites pass; oss tsc unchanged at baseline.
… + createEvaluationRun

Fulfils the eng-review commitment (D5 / "table store testable with actual API
integration"): real-backend integration tests, skipped unless AGENTA_API_URL +
AGENTA_AUTH_KEY are set (globalSetup mints an ephemeral account + API key).

- @agenta/entities: extend the integration worker to also authenticate the Fern client
  (sets AGENTA_API_KEY/AGENTA_HOST) — the eval api goes through @agentaai/api-client, not
  axios, so the existing axios-only auth didn't cover it. New evaluationRun integration
  test exercises the atoms' data layer against a real backend: queryEvaluationRuns /
  fetchEvaluationRun / queryEvaluationResults / queryEvaluationMetrics / queryEvaluationQueues
  return well-formed, Zod-valid empty results on a fresh project, and the decoupled
  {projectId, runId} molecule atom fetches and resolves an absent run to null. Pins Fern
  auth + endpoint reachability + the Zod boundary (passthrough) + the projectId wiring
  against real responses.
- @agenta/evaluations: stand up the integration harness (config + ephemeral-account setup,
  Fern-auth worker) and a createEvaluationRun controller test that covers the DIFFERENT
  evaluation TYPES this controller produces — a matrix over human-origin, auto-origin, and
  no-evaluator runs — each create→fetch (asserting the meta.evaluation_kind type marker +
  annotation-step origin + step shape round-trip)→delete, plus deleteRuns (the rollback
  cleanup primitive) removing a run. Online evals use a separate endpoint (out of scope).
  The orchestration branches stay unit-covered by the faked client.

Both suites compile and skip cleanly with no backend (6 + 4 tests). New files lint clean.
…rn query (T6)

previewRunBatcher reimplemented the package evaluationRun molecule's batch fetch — the
same POST /evaluations/runs/query {run:{ids}} via raw axios. Delegate its network/query
layer to the shared Fern-backed queryEvaluationRuns from @agenta/entities/evaluationRun,
removing the duplicate axios query (and the last raw /runs/query call in the per-run
path). The batcher keeps its own in-memory cache + the list→detail priming; only the
fetch is shared now.

Behavior-preserving: identical query, same snake_case run shape (the eval schemas
passthrough unknown fields as of the T2 slice, so nothing the downstream enrichment reads
is stripped). queryEvaluationRuns is verified against a live backend by the entities
integration suite. oss tsc unchanged at baseline; file lints clean.

Remaining T6 (not a dedup — no package equivalent yet): the LIST fetch
(fetchPreviewRunsShared) still uses axios because its run.search / run.evaluation_kinds
filters aren't modelled in Fern's generated EvaluationRunQuery. Routing it through Fern
needs the OpenAPI spec extended (or a documented cast). The deeper consolidation — delete
previewRunBatcher entirely and read through the package molecule — is a follow-on (touches
the OSS enriched run atom + list-priming + ~6 consumers).
…e molecule (T6)

Completes the run-fetch consolidation: the OSS previewRunBatcher (a per-run batched
fetch + Map cache + list→detail priming, duplicating the package molecule's batcher) is
deleted. Its consumers now use the package's shared batched fetch.

- @agenta/entities: expose fetchEvaluationRunBatched({projectId, runId}) — the molecule's
  existing createBatchFetcher exposed imperatively, so async non-jotai call sites get the
  same batched POST /evaluations/runs/query without a second batcher.
- OSS enriched run atom (EvalRunDetails/atoms/table/run.ts) + EvaluationRunsTablePOC
  runSummaries: fetch the raw run via fetchEvaluationRunBatched instead of getPreviewRunBatcher.
- Drop the previewRunBatcher Map cache + its prime (from the list fetch + usePreviewEvaluations)
  + its invalidate calls (editEvaluation, PreviewEvalRunHeader, scenarios/api). These were
  side-cache clears; the real detail/list refetch is triggered separately (queryClient
  invalidate / refetchRunQueries), and with no Map every fetch is now always-fresh-but-still
  -batched. Behavior-preserving (a minor cross-query cache is the only thing lost).

Concurrent run reads still collapse into one batched query. oss tsc unchanged at baseline
(589; the 5 remaining table/run.ts errors are pre-existing — unimported axios, the
ensureEvaluatorRevisions return type, snakeToCamelCaseKeys typing). Package molecule (15) /
ETL (9) / schema (6) suites pass; entities + changed-file lint clean. The package query is
verified against the live backend by the integration suite.

NOTE: the OSS enriched-atom path has no automated view tests and wasn't UI-smoke-tested;
the change is type-neutral + behavior-preserving by construction, but a manual pass over the
evaluations list + run detail is worth doing before merge.
queryStepResults reimplemented POST /evaluations/results/query via raw axios — the same
query the package's Fern-backed queryEvaluationResults already does. Delegate to it
(behavior-preserving: same request, same snake_case rows via schema passthrough; returns
[] when no project, as the package query does). Removes a duplicate axios read.

The result MUTATIONS in this file stay on axios for now and are NOT migrated: Fern's
generated EvaluationResultCreate under-declares fields the backend accepts (no span_id,
references, or data), so routing the annotation write-back through Fern would silently
drop span_id and break trace/span linking. Documented inline; unblock by extending the
backend OpenAPI spec + regenerating the client. oss tsc unchanged at baseline; lint clean.
The new @agenta/evaluations workspace package wasn't added to oss/next.config.ts, so
Next didn't transpile it — the OSS imports of it (buildRunConfig / createEvaluationRun)
failed to resolve and the app wouldn't load (404 on the chunk). Add it to both
transpilePackages and experimental.optimizePackageImports, alongside the other @agenta/*
workspace packages.
EE renders OSS pages that import @agenta/evaluations, but ee/package.json didn't declare
the workspace dep, so pnpm never linked it into ee/node_modules → module resolution failed
and the EE app 404'd on load. Add the dependency (and the optimizePackageImports entry);
transpilePackages is inherited via `{...ossConfig}` so the earlier oss/next.config fix
already covers EE's transpile step.
…apping kinds

The evaluations table rendered blank "Created by" and metric cells after the axios->Fern
migration. Root cause: `evaluationRunMappingKindSchema` was `z.enum(["input","ground_truth",
"application","evaluator","annotation"])`, but the backend emits `data.mappings[].column.kind`
values of "testset"/"invocation"/"annotation". Because that field sits deep inside the optional
`data` tree, a single unrecognized enum value failed the entire run parse, which failed the whole
`runs: z.array(evaluationRunSchema)` envelope -> `safeParseWithLogging` returned null ->
`queryEvaluationRuns` returned no runs -> the per-run summary atom resolved to null, blanking
`created_by_id` and the step-reference-derived metric columns. The old axios list path did no Zod
validation, so it tolerated these values.

Fix: validate the three string-union "kind" fields (mapping kind, step type, step origin) as
permissive `z.string()` instead of `z.enum`, keeping the known values as documented unions for
autocomplete. Backend payloads use extra="allow" and the taxonomy drifts; a strict enum on a
deeply-nested optional field is a catastrophic failure mode. Adds a regression test that parses a
real (UUID- and key-scrubbed) /evaluations/runs/query payload.
… payloads

The integration test built run configs with `data.mappings: []` and never went through the
read-back/parse path the run table uses, so it could not catch the mapping-kind enum
regression that blanked the table — it passed against both the broken and fixed schema.

Two fixes:
- Populate mappings with the real `column.kind` values the package's buildRunConfig emits
  ("testset"/"invocation"/"evaluator"), so the created run actually exercises schema kind
  validation on read-back.
- Round-trip each created run through queryEvaluationRuns (the batched path the table uses)
  and assert the run survives the parse and its mapping kinds are preserved.

Verified: this now FAILS against the old `z.enum` mapping-kind schema and passes against
the fixed `z.string()` one. Note these tests are gated behind AGENTA_API_URL +
AGENTA_AUTH_KEY and skip (showing as green) when unset — they must be run with a backend.
Parses a real project's EXISTING runs through the production evaluationRunSchema, per-run, so
schema drift against production-shaped payloads (the class of bug that blanked the run table)
is caught with the offending run id + field path. Read-only (query only), safe against a real
project with a read-scoped key. Gated on AGENTA_API_URL + AGENTA_REAL_API_KEY +
AGENTA_REAL_PROJECT_ID; skips when unset.
The entities eval integration suite only asserted empty-envelope/absent cases against a
fresh ephemeral project, so it could never exercise run-data parsing or the molecule's
derived selectors — exactly why the mapping-kind regression slipped through. Add:

- A populated-run block: create a representative run via the raw Fern client (entities
  cannot depend on @agenta/evaluations) with testset/invocation/evaluator mappings, then
  assert queryEvaluationRuns + fetchEvaluationRun parse it and evaluationRunMolecule
  selectors (data/steps/annotationSteps/mappings/evaluatorIds) derive real values.
- An evaluationQueue CRUD round-trip: create a run + queue, verify queryEvaluationQueues /
  fetchEvaluationQueue parse the populated queue and the molecule entity atoms resolve its
  name/run id. Cleans up runs + queue in afterAll.

Verified: the populated-run block FAILS against the old z.enum mapping-kind schema (3
failures) and passes against the fix; 11/11 green against the live local stack.
ensureEvaluatorRevisions called `axios.patch('/evaluations/runs/{id}')` but axios was never
imported in that file, so the call threw ReferenceError, was swallowed by the surrounding
try/catch, and the evaluator-revision write-back silently never persisted (pre-existing).

Add a Fern-backed `editEvaluationRun` to @agenta/entities/evaluationRun (PATCH
/evaluations/runs/{run_id} via client.editRun, Zod-validated at the boundary) and route the
OSS enrichment through it. EvaluationRunEdit accepts id + data.steps, so this is not blocked
by the Fern under-declaration affecting result mutations.

Adds an integration test that patches a real run's annotation-step references and re-fetches
to assert the change persists. oss tsc 589 -> 588 (removes the latent `Cannot find name
'axios'`). entities: 591 unit + 12 eval integration green against the live stack.
…ckend contract

Investigation showed the result-mutation "blocker" was a false premise: evaluation_results
has no span_id/references/data columns (only trace_id et al.), so those FE-sent fields were
silently dropped by the backend, not "accepted". The result↔trace link is trace_id.

- Add Fern-backed `setEvaluationResults` to @agenta/entities/evaluationRun (POST
  /evaluations/results/, the upsert-on-natural-key setter) carrying only real columns.
- Route OSS `upsertStepResultWithAnnotation` through it, dropping the vestigial span_id
  (behavior-preserving — backend never persisted it). Removes the last axios usage from
  services/evaluations/results/api.ts.
- Delete dead `createStepResults` + `updateStepResults` (zero callers).
- Integration test: create run + scenario, upsert a result, read it back, assert trace_id
  persists. 13/13 eval integration green against the live stack; 591 unit; oss tsc 588.
… contract

fetchPreviewRunsShared was the last axios eval read. Add a Fern-backed
`queryEvaluationRunsList` to @agenta/entities (POST /evaluations/runs/query with the filters
query_runs actually supports — references/flags/statuses + windowing) and route the OSS
list fetch through it, keeping the OSS request-dedup cache + camelCasing wrapper.

Drops `search` and `evaluation_kinds` from the request: the backend has no such filters
(silently dropped), and free-text/kind filtering is client-side per the eval-filtering RFC —
so this is behavior-preserving. windowing is read off the raw envelope (the Zod envelope
doesn't model it) and returned for the paginating consumer (fetchAutoEvaluationRuns).

Integration test: create runs, list them through the parse, assert presence + windowing
cursor + limit. 15/15 eval integration green; 591 unit; oss tsc 588.
Add Fern-backed scenario primitives to @agenta/entities/evaluationRun: a minimal
evaluationScenario schema (passthrough) + `queryEvaluationScenarios` (POST
/evaluations/scenarios/query) and `setEvaluationScenarioStatuses` (PATCH
/evaluations/scenarios/, id+status only).

Route OSS services/evaluations/scenarios/api.ts through them; the run-status rollup
(checkAndUpdateRunStatus) now reuses queryEvaluationRuns + editEvaluationRun. Removes the
last axios from that file (and the bespoke SSRF id-guard — Fern encodes path params).

Integration tests: query a run's scenarios, edit a scenario status, re-query and assert it
persists. 17/17 eval integration green against the live stack; 591 unit; oss tsc 588.
Route services/evaluations/invocations/api.ts through the Fern package functions:
upsertStepResultWithInvocation -> setEvaluationResults (drops the vestigial span_id /
references / outputs that have no columns; keeps trace_id + error, both real columns);
updateScenarioStatus -> setEvaluationScenarioStatuses (deduped onto the same primitive as
services/evaluations/scenarios). Extends EvaluationResultSetInput with the real `error`
column. Removes the last axios from the file.

Behavior covered by the existing setEvaluationResults + setEvaluationScenarioStatuses
integration tests. oss tsc 588; 591 unit green.
The EvaluationRunsTablePOC delete action used raw axios.delete('/evaluations/runs/'). Add
Fern-backed `deleteEvaluationRuns` to @agenta/entities (DELETE /evaluations/runs/; backend
cascade-deletes scenarios/results/metrics) and route deletePreviewRuns through it.

Integration test: create a run, delete via the package fn, assert fetch returns null.
18/18 eval integration green; 591 unit; oss tsc 588.
…n delete

- Add Fern `queryEvaluationMetricsBatch` to @agenta/entities (POST /evaluations/metrics/query
  with the backend projection flags run_ids / scenario_ids / timestamps) and route the
  EvalRunDetails runMetrics batcher through it (run-level + temporal). Behavior-preserving:
  identical payload, and the metric schema is passthrough (only id/run_id required, both real
  columns) so no field stripping.
- Route DeleteEvaluationModalContent's run delete onto deleteEvaluationRuns (dedupes its
  private axios copy). Both files now axios-free.

Metrics are worker-computed (can't be made in the ephemeral harness), so verified the
populated path against the real project via the read-only smoke test: every existing metric
parses through evaluationMetricSchema with the exact batch payload. entities 591 unit + 18
eval integration; evaluations 22 unit; oss tsc 588.
Locks the structure for relocating the evaluation-run engine into a layered package
architecture (entities ← evaluations ← annotations, + -ui mirrors), with annotation queue
and human eval as presets over one evaluation engine.

Key decisions captured: extract the generic engine FROM @agenta/annotation (source of
truth) into @agenta/evaluations, keep annotation green throughout, prove parity vs the OSS
EvalRunDetails/EvaluationRunsTablePOC baseline before deleting OSS dups, move (not rewrite)
the single configurable run table from AnnotationQueuesView, keep etl in entities.

Includes §0 guardrails (anti-stray), the unified entity model, the controller
generic-vs-annotation decomposition map, sequenced Work Packages each keeping annotation
green, the regression methodology, and definition of done.
…ation plan

Adds an enforceable "clean up after yourself" requirement so agents can't leave eval
services/utils/data-layer atoms behind in OSS:
- §0 cardinal rule 7: each WP deletes its OSS counterpart in the same WP; migration is not
  done until the cleanup ledger is checked off.
- §7 cleanup ledger: explicit list of every OSS eval service/lib/atom path that must be
  deleted, mapped to the WP that deletes it; legacy bridge + onlineEvaluations tracked as
  terminal WPs (never silently left).
- §7.2 verification gate: concrete grep/find commands that must return empty at final DoD.
- §9 Definition of done now requires the zero-residue gate to pass.
… package

Closes a testing gap in the migration plan: WP-2 had only unit tests and WP-3 had none.
Now every WP that moves state/logic must ship a real-API integration test that drives the
SHIPPED atoms/molecules/controllers — not a test-local replica.

- §5: testing is part of every WP's DoD; adds an "Integration test (real API, real atoms)"
  line to WP-0..4, each naming the exact shipped surface to drive.
- §8: hard rule — import and exercise the real surface (if you delete the package code the
  test must fail to compile), run against the real backend, seed via raw client but assert
  through the package; bans the hand-built-payload anti-pattern that caused the mapping-kind
  bug; adds a per-WP coverage table; clarifies "tests green" means ran-with-backend not skipped.
Empty React UI package mirroring @agenta/annotation-ui, registered in OSS+EE
(package.json deps, next.config transpilePackages + optimizePackageImports). Will receive
the run list table, run detail view, scenario table, and metric cells in later work packages
(see docs/designs/evaluations-packages-migration-plan.md). No behavior change.
…y (WP-0)

Moves the scenario schema + queryEvaluationScenarios/setEvaluationScenarioStatuses out of
evaluationRun into a standalone @agenta/entities/evaluationScenario module (core/api/state),
adds a reactive {projectId, runId}-keyed molecule (list/ids/statuses selectors), and a
subpath export. evaluationRun no longer owns scenario code; OSS consumers
(services/evaluations/{scenarios,invocations}) re-point to the new module.

Integration test (real API, real atoms): drives the shipped evaluationScenario api +
molecule selectors against a real run's scenarios (the WP-0 DoD). entities 591 unit + 19
eval integration (run 16 + scenario 3) green against the live stack; oss tsc 588.
… update)

Reverses the earlier "etl stays in entities" decision. The ETL filtering is a feature where
OSS EvalRunDetails is ahead of annotation (annotation has no filtering — verified, it imports
none of the etl filtering), so:

- entities keeps only entity definitions; the eval-run ETL (hydration, mapping/column
  resolution, client-side filtering) moves to @agenta/evaluations (+ filter bar / column
  headers / resolved cells to @agenta/evaluations-ui).
- §4 source-of-truth exception: the ETL is extracted from OSS EvalRunDetails/etl, NOT from
  annotation; annotation gains filtering by depending on evaluations.
- New WP-3.5 (move the ETL, sourced from OSS) with its own real-API/real-atom integration
  test (hydrate real scenarios + apply a real rowPredicateFilter).
- Cleanup ledger + §7.2 gate now require OSS EvalRunDetails/etl gone and the entities
  evaluationRun/etl subpath removed; §10 records the reversal.
…ario source

Verified from code (no assumptions): the annotation session engine is founded on
simpleQueueMolecule, and the two consumers source the scenario LIST from different endpoints
— annotation via POST /simple/queues/{id}/scenarios/query (queue-scoped, optional user_id
annotator filter → may be a subset) and EvalRunDetails via POST /evaluations/scenarios/query
by run_id (run-scoped). Both return EvaluationScenario rows; scenario DATA is derived by
{projectId, runId, scenarioId} from the entities molecules in both.

Therefore the generic evaluations session engine must NOT hardcode a scenario molecule — it
takes an INJECTED source {projectId, runId, scenarios[], scenariosQuery} and owns
navigation/progress/current/focus/view. Annotation keeps feeding the QUEUE source (user-scoped
— do not swap to run-scoped); only the engine code is shared. §3.1 decomposition + WP-1 Move
updated; the truly-shared core is the scenario-data selectors keyed by {projectId,runId,scenarioId}.
Extract the scenario navigation/progress/focus/view engine from @agenta/annotation's
annotationSessionController into @agenta/evaluations/state (the navigation logic is moved
verbatim) with two genericizing changes:
  - the scenario LIST + query state are INJECTED via actions.setScenarios (no scenario
    molecule imported), so annotation can inject its queue-scoped source and the eval-run
    view a run-scoped one;
  - run/project context comes from openSession({projectId, runId}), decoupled from any store.

This is additive — @agenta/annotation is untouched (re-pointing it is the next WP-1 slice,
which needs annotation-route QA). Integration test drives the SHIPPED engine atoms over a
real run's scenarios (navigate next/prev, markCompleted → progress/status, hideCompletedInFocus
→ navigable filtering). 22 unit + 3 session-engine integration green vs the live stack.
ardaerzin added 27 commits June 13, 2026 15:19
…s blank in Overview)

The WP-4h-5 relocation pinned @agenta/evaluations-ui to recharts ^2.13.0 (resolved
2.15.4), but the eval chart components are recharts-3 code (OSS/EE/main use ^3.1.0 →
3.8.1). Run under recharts 2.x, the Overview spider chart + per-evaluator distribution
charts rendered nothing while numeric stats showed — the chart APIs differ across the
major. It typechecked green under 2.x because the used API subset overlaps. Bump to
^3.1.0 (resolves the shared 3.8.1, same as main) and fix the recharts-3 Tooltip/formatter
callback signatures the stricter v3 types surfaced. oss tsc 363 (unchanged).
WP-4h moved the eval views into @agenta/evaluations-ui, but the Tailwind content
globs (oss/tailwind.config.ts, reused by ee via createConfig) were never updated to
scan it. So Tailwind didn't generate the package's utility classes — only ones that
also appear in already-scanned packages survived. Package-unique classes were dropped:
the run-overview spider's lg:flex-row + lg:w-7/12|w-5/12 (so it stacked under the table
instead of beside it) and its h-[480px]/h-full container (so the chart collapsed to 0
height and recharts rendered nothing — spider + per-evaluator distribution charts blank
while text showed). Add agenta-evaluations + agenta-evaluations-ui to the content array.
…inputs

The scenario focus drawer fed the whole testcase ENTITY ({id, created_at, data:{...},
testset_id, ...}) to TestcaseDataEditor, but the editor addresses values by bare column
key (valueKey, e.g. 'country') while the user columns live nested under .data. So every
input rendered empty when the testcase-entity branch was taken (row click resolves
sourceTestcaseId immediately); reload appeared to work because it rendered via the
flat embedded-steps fallback first. Unwrap to the inner .data record so the
testcase-entity branch matches the editor's bare keys, consistent with the
embedded-steps fallback (also flat). Diagnostic logging removed.
…fy casing/run-kind, cut debug log

- delete unused @deprecated facades getEvaluationKindWithFallback and
  CACHE_AWARE_HYDRATE_FETCHERS (+ their barrel re-exports); zero consumers
- collapse the duplicate snakeToCamelCaseKeys: delete the usePreviewEvaluations
  copy, re-point its sole importer at the canonical evalRun/utils/casing
- derive runsTable EvaluationRunKind from core (CoreEvaluationRunKind | "all")
  instead of restating the literal union
- remove the unconditional [runInvocationAction] Starting invocation debug log
…nto one factory

evaluationResultMolecule and evaluationMetricMolecule were ~95% identical
cache machinery (byScenario read, cache-aware prefetchByScenarioIds,
invalidate, evictByRunId, evictByScenarioIds, cacheKey). Extract the shared
logic into createScenarioCacheMolecule<T, K>; the two molecules now just bind
their element type, fetcher, cache-key prefix, and outcome list-key. Metrics
opts into skipItemsWithoutScenarioId for run-level aggregates (null scenario_id).

Public surface unchanged: same exported molecules, the Prefetch{Results,Metrics}
{Args,Outcome} types, the results/metrics outcome fields, and _internal.cacheKey
all preserved. Entities unit suite green (658 tests).
- annotationSessionController: collectColumnPathValues and collectDataColumnKeys
  were the same depth-first leaf traversal differing only in accumulator; both
  now delegate to a single walkLeafColumns(data, visit) visitor.
- testsetSync: buildAddToTestsetOperations and remapTargetRowsToBaseRevision
  both built baseRowIds + baseRowIdByDedup from baseRows; extract a shared
  indexBaseRows(baseRows, {guardAmbiguous}) parameterized to preserve each
  caller's exact behavior. guardAmbiguous=true keeps the add-to-testset
  ambiguous-dedup guard; =false keeps the sync path's legacy last-writer-wins
  (the missing guard there is a documented latent gap, left unchanged given the
  AGE-3761 write-back sensitivity).

Behavior-preserving; annotation unit suite green (90 tests).
…esults fetcher

scenarioStepsBatcherFamily re-implemented POST /evaluations/results/query with
raw axios + manual envelope parsing (results ?? steps) — a duplicate of the
canonical typed/zod queryEvaluationResults the entities layer already owns.
Delegate the network call to queryEvaluationResults; the atomWithQuery shell
keeps caching + live 5s polling and the ScenarioStepsBatchResult/camelCase
output shape is unchanged, so consumers and polling behavior are preserved.

Note: the TanStack caches of the live-polling path and the cache-first
evaluationResultMolecule remain separate by design — the run-details poll needs
a fresh fetch each tick, which the cache-first molecule prefetch would skip.
Full single-cache unification would need a molecule cache-bypass mode + QA;
out of scope here. evaluations unit suite green (133 tests).
evaluationRunPaginatedStore (state/runList) had ZERO production consumers —
only its barrel re-export and one integration test referenced it. The live
run-list is the feature-rich runsTable engine (fetchAutoEvaluationRuns +
previewRunSummary, with subject-filter / fillToLimit / references); the generic
EvaluationListView takes its store as a prop and its sole renderer
(AnnotationQueuesView) passes simpleQueuePaginatedStore, not this one.

Its EvaluationRunTableRow type was a separate same-named shape; the ~35 live
consumers use the runsTable/types.ts EvaluationRunTableRow via
@agenta/evaluations/state/runsTable, unaffected.

Removed: state/runList/ (store + filter atoms), its top-barrel re-export, and
runListStore.integration.test.ts. ~190 LOC. evaluations suite green (133).
…g to -ui

The headless @agenta/evaluations package carried 16 injection seams that only the
relocated VIEWS (run-list + run-details, in @agenta/evaluations-ui) ever read —
URL/route/app-state, saved-queries, current-workflow, metric-blueprint /
resolved-label / evaluator-reference families, workspace-member-by-id,
navigation-request, and the onboarding-widget seams. Pure view/routing concerns
do not belong in the framework-agnostic state package.

Moved those 16 seams + their types into a new
@agenta/evaluations-ui/src/host/runViewInjection.ts with its own
registerRunViewInjections write-atom. The 6 seams the headless runtime atoms
actually read (workspaceMembers, testcaseQueryFamily, referenceResolver,
runInvalidate, clearMetricSelection, annotationTransform) plus the shared
ReferenceQueryResult and Query*Payload types stay in evalRunInjection.ts.

OSS hosts now split registration: register(...) for headless seams +
registerView(...) for view seams. 17 -ui consumers re-pointed to the local
module. evaluations + evaluations-ui green (tsc/lint/133 tests); oss tsc at its
pre-existing 363-error baseline with zero new host/seam errors.

Manual QA: run-list + run-details views (onboarding widget, navigation, URL
focus drawer, metric columns, online-eval start/stop).
… atoms files

Verbatim extraction of pure helpers into sibling files — no logic changes.

- metrics.ts (973 -> 421): pure metric compute/lookup block + the 3 metric types
  moved to metricsCompute.ts (560). metrics.ts keeps the caches, status helpers,
  resolveProjectId/resolveEffectiveRunId atom-getters, and all atoms; re-exports
  the public ScenarioMetricData / RunLevelMetricData types so the API is unchanged.
- scenarioColumnValues.ts (1231 -> 968): pure step/value helpers (getStepKind,
  pickStep, extractStepsByKind, extractStepError, findStepWithError,
  resolveAnnotationValue, …) moved to scenarioColumnValuesHelpers.ts (273). The
  727-line scenarioColumnValueBaseAtomFamily and all public exports stay.

Public API preserved; evaluations tsc+lint+133 unit tests green.

Deferred: runMetrics.ts / metricProcessor.ts splits (owned by the spun-off
metricProcessor-ReferenceError task — would collide). Note: the moved metrics
compute block carries a pre-existing latent `declare const applyAggregatesToRaw`
ReferenceError (sibling of the runMetrics one), preserved verbatim — needs its
own fix.
…ToRaw ReferenceError)

buildRunLevelMetricData referenced an undefined applyAggregatesToRaw (a declare-const
masking a pre-existing, unconditional ReferenceError — migration-plan §11.3 bug #1).
Its only transitive caller, runLevelMetricQueryAtomFamily, was unused (not exported
from any barrel, referenced nowhere) and superseded by runMetrics.ts's own run-level
engine (flattenRunLevelMetricData). Rather than implement a never-called function,
remove the dead path:

- metrics.ts: delete runLevelMetricQueryAtomFamily + its buildRunLevelMetricData /
  RunLevelMetricData imports and re-export.
- metricsCompute.ts: delete buildRunLevelMetricData, applyAggregatesToRaw, and the
  RunLevelMetricData type.

KEPT (live, used by buildGroupedMetrics → scenario metrics): computeAggregatedMetrics,
extractStatTotal, asNumber. Zero runtime change (dead code); evaluations tsc+lint+133
unit tests green.
…appers

Migrate 4 of the annotationFormController raw-axios /evaluations/* calls onto the
typed, zod-validated entities wrappers (Fern under the hood), per web/CLAUDE.md:
- PATCH /evaluations/scenarios/   -> setEvaluationScenarioStatuses
- POST  /evaluations/scenarios/query -> queryEvaluationScenarios
- POST  /evaluations/runs/query   -> queryEvaluationRuns
- PATCH /evaluations/runs/{id}     -> editEvaluationRun
Removed the now-orphaned getAgentaApiUrl()/apiUrl local in checkAndUpdateRunStatus.

Left on raw axios deliberately (documented inline):
- POST /evaluations/results/ — also sends span_id, which the wrapper's typed input
  omits (no backend column); migrating would drop span_id + cascade a param removal
  through the submit-entry flow.
- POST /evaluations/metrics/query + /evaluations/metrics/ — duplicate the
  (also-axios) upsertScenarioMetricData service; no Fern metrics-set wrapper exists.
  Their own consolidation.
- POST /testsets/revisions/query (annotationSessionController) — intentionally reads
  raw, un-normalized rows to preserve testcase_dedup_id (AGE-3761); a normalizing
  wrapper would reintroduce the dedup duplication bug.

annotation tsc+lint+90 unit tests green.
…-to-packages

main now contains the merged fe-feat/add-evaluators-to-existing-eval base + eval
fixes that landed since this branch diverged. Integrated via merge (not rebase)
to resolve the relocation conflict set once.

Conflict resolutions (main's eval fixes ported onto the relocated package files):
- OverviewView/utils/evaluatorMetrics.ts: took main's id-OR-slug evaluator
  'definition' match ('evaluator name instead of default' fix); widened the local
  EvaluatorDefinitionLike with name?.
- evalRun/atoms/table/columns.ts: kept package eslint header + canonicalizeMetricKey,
  added main's extractMetrics import (schemaless-evaluator type-from-step-schema fix).
- evalRun/atoms/mutations/editEvaluation.ts: full-ported main's reliably-refresh
  improvements (key.includes(runId) surface match, authoritative run-status read,
  settle double-invalidation) and relocated previewRunBatcher (getPreviewRunBatcher /
  invalidatePreviewRunCache) into @agenta/evaluations; kept the injected-seam
  clearMetricSelectionCache to avoid a runsTable<->evalRun cycle.
- agenta-ui/package.json: union (immer ^10.1.3 + main's jotai ^2.16.1); lockfile
  regenerated via pnpm install.
Silent type-breaks from main's Fern api-client regen, fixed:
- createEvaluationRun.ts: EvaluationRunData -> EvaluationRunCreate['data'].

Rename detection paired all moved eval files; no old-OSS eval dirs resurrected.
Eval packages green (tsc+lint; entities 663 / evaluations 133 / annotation 90 tests).
…equest types

The eval wrappers passed request bodies through opaque `as never` casts. Replace
each with a named cast onto the Fern-generated request type (via `as unknown as
AgentaApi.X`), keeping the wrappers' intentionally-loose inputs and the Zod
response boundary unchanged (per web/CLAUDE.md: Fern under-declares extra="allow",
so the local Zod schema stays the drift check):

- editRun        -> AgentaApi.EvaluationRunEdit
- queryRuns      -> AgentaApi.EvaluationRunQueryRequest (both call sites)
- setResults     -> AgentaApi.EvaluationResultsSetRequest["results"]
- queryMetrics   -> AgentaApi.EvaluationMetricsQueryRequest
- editScenarios  -> AgentaApi.EvaluationScenarioEdit[]

Benefit: names the real request type (readability/intent) and gives a compile-time
drift signal if Fern renames/removes it — useful given the eval request surface is
actively changing. No response/entity types touched (those stay Zod by design).
entities tsc+lint+663 unit tests green.
…eferenceError)

runMetrics.ts run-metric-stats queryFn referenced metricProcessor at the
run-level-gap branch, but no such binding exists in that scope — the real
processor is local to the inner processMetrics helper (which already flushed). A
declare-const masked it at type-check; at runtime the branch threw a
ReferenceError whenever a run-level gap existed (no run-level entry + scenario-less
fetched metrics), failing the whole run-metrics query.

Even resolved, it would push a flag onto a throwaway processor never flushed there
(no-op). The legitimate gap-marking already happens inside processMetrics on the
flushed processor. Removed the misplaced branch + the declare-const + the unused
MetricProcessor import. Restores the query from throwing; preserves real behavior.
evaluations tsc+lint+133 tests green.
… api client

Dead public surface (all verified zero external consumers, tsc/lint/663+90 tests green):
- evaluationRunMolecule: drop the 3 step-reference atomFamilies left behind when
  that logic moved to @agenta/evaluations (stepReferencesByEvaluatorId,
  stepKeysByEvaluatorSlug, scenarioInvocationStepKey — def+selector+get each) +
  the orphaned StepEvaluatorRefs interface; de-export invalidateEvaluationRunCache
  (kept as internal cache.invalidateDetail) + drop its barrel re-exports.
- evaluationScenarioMolecule: drop the unused  selector + imperative get.*
  block (only list/ids/statuses + atoms.query are consumed); kept the query family.
- annotation: drop dead getOutputsSchema/getMetricFieldsFromEvaluator/
  getMetricsFromAnnotation re-exports (real consumers import from @agenta/evaluations;
  re-pointed the one in-package test); drop the duplicate syncToTestset alias.

Dedup: evaluationQueue/api/client.ts was byte-identical to evaluationRun's — re-point
the sole importer at the run client and delete the dup.

~180 LOC removed. Note: canSyncToTestset/canSyncToTestsetAtom also look orphaned —
left pending UI confirmation.
…on god-file

annotationSessionController.ts was 2526 LOC mixing session/queue/scenario state
with ~1100 LOC of add-to-testset + sync-to-testset export orchestration. Move the
export machinery verbatim into a new sibling controllers/addToTestset.ts
(modal/job atoms, export-prep helpers, column-remap family, prepare*ExportRows,
addScenariosToTestsetAtom, sync preview + syncToTestsetsAtom). Pure relocation,
no logic change.

Session controller now 1447 LOC, focused on session state. Shared session atoms it
still owns are exported and imported into addToTestset.ts; the moved atoms/actions
are imported back so the public annotationSessionController object + barrels are
byte-identical. Benign ES-module cycle (refs only inside getters/setters).

annotation tsc+lint+90 tests green.
…narioMetricData

annotationFormController.upsertAnnotationMetrics hand-rolled the same
query-existing -> merge -> upsert flow that @agenta/evaluations
services/metrics.ts upsertScenarioMetricData already ships (and which the eval
run-details annotate flow uses). Keep the annotation-specific value shaping
(buildMetricDataFromValue -> attributes.ag.data.outputs.* under the step key) and
delegate persistence. Added an optional projectId param to upsertScenarioMetricData
so annotation keeps passing its explicit project id (existing callers fall back to
the store read, unchanged).

~55 LOC of duplicated query/merge/POST removed. Behavior delta: existing metrics
are now PATCHed by id (vs POST upsert) — same end state, slightly more correct.
QA: annotation submit (metric write-back) smoke test. evaluations + evaluations-ui
+ annotation tsc/lint green; 90 annotation tests pass.
…aluationRunKind

evalRun/state/evalType.ts hand-declared PreviewEvaluationType = auto|human|online|null,
a near-duplicate of core's EvaluationRunKind (auto|human|online|custom). The detection
logic was already shared (derivedEvalTypeAtomFamily delegates to deriveEvaluationKind);
only the type literal was duplicated. Redefine it as
Exclude<EvaluationRunKind, "custom"> | null so the union has a single source of truth
in core — identical narrow set (the run-details preview never surfaces the custom/SDK
kind), zero behavior/type change. evaluations + evaluations-ui tsc/lint green; 133 tests.

Note: a separate, unrelated PreviewEvaluationType (human|online|automatic|
single_model_test) lives in hooks/usePreviewEvaluations — different domain (legacy API
filter), left untouched (same-name footgun worth a future rename).
…fetcher

scenarioData/metrics.ts queried per-scenario metrics with raw axios, bypassing the
entities queryEvaluationMetrics (typed + zod). A single scenario belongs to exactly
one run, so adding the fetcher's run_ids constraint is a redundant, behavior-equivalent
narrowing — swap to queryEvaluationMetrics, dropping the raw axios path (closes the
spun-off scenarioData-metrics chip).

Scope note: the OTHER metric raw-axios paths are intentionally left:
- evalRun/atoms/metrics.ts batcher deliberately omits run_ids for scenario-scoped
  (cross-run comparison) queries to avoid over-filtering — queryEvaluationMetrics
  forces run_ids, so routing it there would regress.
- the /evaluations/metrics/refresh calls have no entities wrapper.

evaluations tsc+lint+133 tests green. QA: scenario metric display in run-details.
…eQueueStatus

Two unrelated types shared the name EvaluationStatus across subpaths: the canonical
run/scenario enum in evaluationRun/core/status.ts (EVALUATION_* + failed/incomplete,
used across OSS) and a different 7-value queue status (pending/queued/running/...) in
simpleQueue/core/schema.ts whose comment falsely claimed it was shared with
EvaluationRun. Same name, different shapes — a real footgun.

Rename the simpleQueue type to SimpleQueueStatus (kept the evaluationStatusSchema Zod
value name) and update its re-exports (simpleQueue + evaluationQueue barrels) and the
3 annotation-ui consumers. The run enum and its OSS consumers + Fern's generated
AgentaApi.EvaluationStatus are untouched. entities (663 tests) + annotation-ui
tsc/lint green.
…+ redundancy fixes

Focused dead-code sweep follow-up. Removed ~767 LOC of exported-but-zero-consumer
symbols (each re-verified across packages+oss+ee before deletion; tsc is the gate):

@agenta/evaluations:
- deleted whole files table/testcases.ts (superseded by molecule path) + services/workerUtils.ts
- dead atoms/helpers: serializeRunIndex/deserializeRunIndex, normalizeEvaluationKindString,
  evaluationMetricBatcherAtom, scenarioStepsBatcherAtom, clearScenarioStatusCache, the
  runDerived app/variant-id cluster, isInvocationRunningAtom, scenarioHasEmbeddedInputsAtomFamily,
  scenarioRowHeightPxAtom, tableScenario{Ids,Offset}AtomFamily, traceUtils extractRootSpanIdFromTraceData/
  findTraceForStep, clearAllBootstrapAttempts, evaluatorOutputTypes get/visibility helpers +
  dead version counter, invalidateMetricSelectionCache, FLAG_LABELS, primePreviewRunCache,
  paginationAtom — plus their barrel re-exports.

@agenta/entities: deleteEvaluationQueues (plural) + queryEvaluationQueueScenarios chain
(schema/type), unused *Molecule typeof-exports, Prefetch{Results,Metrics}{Args,Outcome} aliases.

Redundancy fixes: isEmptyMetrics now uses isEmptyValue; annotationFormController's private
getStore() dropped for the shared one; renamed the colliding hooks PreviewEvaluationType ->
PreviewEvaluationFilterType.

Re-verification SAVED 5 false-positives the broad sweep flagged but are actually used
(extractEvaluatorMetricKeys, getPreviewRunBatcher, invalidatePreviewRunCache,
evaluator{ColumnDefs,StepRefs}AtomFamily via object-map, searchQueryAtom) — kept.

Untouched: etl/ scaffolding (separate audit), annotation sync-to-testset (kept, pending UI),
evaluationQueue module. evaluations/entities/evaluations-ui/annotation tsc+lint+tests green.
etl/ audit: per-symbol consumer analysis (external + internal-non-barrel + test)
across web/. Removed only symbols dead on all three axes; kept everything with any
consumer.

Deleted:
- realScenarioSource.ts (whole file) — makeRealScenarioSource + types, 0/0/0
- cacheAwareFetchers.ts (whole file) — buildMoleculeBackedFetchers /
  MOLECULE_BACKED_HYDRATE_FETCHERS / cacheAwareFetchTestcases, 0/0/0
- hydrateScenariosTransform.ts: the makeHydrateScenariosTransform +
  DEFAULT_HYDRATE_FETCHERS cluster (kept the 3 live shared type exports)
- cacheDiagnostics.ts: inspectMemory + MemorySnapshot (kept inspectCache/clearCacheByPrefix)
- etl/index.ts: dropped the @agenta/entities/shared passthrough re-export block
  (every consumer imports those directly from entities, none via the etl barrel)

KEPT (verified live via evaluations-ui / internal etl / package state / tests):
resolveMappings + resolvers, rowPredicateFilter, runReferenceFilter, filterSchema,
hitRatioMeter, predicateToEntitySlices, all filtering/* hooks, inspectCache. Same-named
RunStep/RunMapping/ColumnGroup competing decls confirmed distinct.

evaluations + evaluations-ui + entities tsc/lint green; 133 evaluations tests pass.
Third main integration on this branch (80 commits, incl. release/v0.103.5,
OSS invite hardening, single-project batch fetchers, cascade evaluator
selector, table header-scroll-sync).

Conflicts resolved, porting main's landed fixes onto the relocated package
files (main still edits the OLD oss eval paths this branch moved):

- Playground/PlaygroundHeader: kept our re-point to openWorkflowRevisionDrawerAtom
  (the evaluatorDrawerStore compat bridge was deleted in WP-4 residue B; 3
  consumers call the underlying playground-ui atom directly). Took main's
  isolated-playground evaluator-create feature (currentAppSelection,
  handleCreatedEvaluator) and threaded its new params
  (isolatedPlayground/initialAppSelection/postCreateNavigation/onWorkflowCreated)
  through context: "evaluator-create" instead of the bridge's mode: "create".
- agenta-ui InfiniteVirtualTableInner: ported main's .ant-table-header /
  .ant-table-body scroll-sync useEffect (#4697) into the relocated package copy;
  removed the leftover deleted oss original.
- state/evaluator/evaluatorDrawerStore: accepted our deletion. main's added
  drawer params already merged into the underlying
  @agenta/playground-ui workflow-revision-drawer store, so no porting needed.

Auto-merged eval package files verified to carry main's changes:
- evaluations-ui RunDetails/Page: "SDK Evals" / kind:"custom" typeMap entry
  (no regression against the Q6 PreviewEvaluationType narrowing).
- entities evaluationRun molecule: single-project runBatchFetcher rewrite.
- annotation-ui CreateQueueDrawer: multi-select evaluator picker props.

Gates: package tsc 0 errors (entities/evaluations/evaluations-ui/annotation/
annotation-ui/ui/playground-ui); lint 0; tests evaluations 133, entities 669,
annotation 90. OSS tsc 355 (≤ pre-merge baseline ~363), no new signatures from
touched files.
Drop 4 ungated/untagged console.log leftovers and 2 dead commented-out
console blocks. Behaviour-only logging cleanup, no logic change.

Removed:
- metrics.ts triggerMetricsRefresh success log (kept the failure warn)
- useAnnotationState baseline-change + remaining-edits debug logs
- export/referenceResolvers stray console.log("slot")
- runMetrics dead "entry.needsTemporal" comment
- metricProcessor dead "flush called" comment block

Deliberately kept (guarded dev diagnostics / facilities, not noise):
- metricProcessorDebug isDev-gated logger; process.env.NODE_ENV-guarded
  [HUMAN_EVAL_REFRESH_LOG] / [EvalRunDetails2] diagnostics; buildRunIndex
  shouldLogDetails-gated debug; traces.ts debug facility; logExportAction
  helper; NEXT_PUBLIC_EVAL_RUN_DEBUG-parked blocks; catch-block error logs.

Gates: evaluations + evaluations-ui types=0, lint clean, 133 tests pass.
… them

EvaluationRunsTableStoreProvider mirrors injected eval-view seam atoms from
the parent store into its scoped store via store.set(atom, parentValue).
Several of those atoms hold a FUNCTION value (the query/metric/member
families, factories, the online-evaluations api). jotai's primitive set()
treats a function argument as a state updater and CALLS it, so the mirror
ran e.g. queriesQueryFamily(null) — crashing in the family's {payload}
destructure on the apps overview page — and silently corrupted every other
function-valued seam (storing factory(prev) instead of the factory).

Wrap function values in a constant updater when mirroring (both the initial
seed and the live sync), matching how the host registers them via
set(atom, () => v).
@ardaerzin ardaerzin requested a review from ashrafchowdury June 19, 2026 13:40
@vercel

vercel Bot commented Jun 19, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
agenta-documentation Ready Ready Preview, Comment Jun 19, 2026 1:40pm

Request Review

@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: a57d4c9b-1422-4f36-aae6-0c84adc2bb0a

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fe-chore/move-evals-to-packages

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant