Skip to content

feat(cli,api): migrate CLI and REST server to consume aicr.Client#1104

Closed
hkii wants to merge 41 commits into
NVIDIA:mainfrom
hkii:feat/cli-api-consume-client
Closed

feat(cli,api): migrate CLI and REST server to consume aicr.Client#1104
hkii wants to merge 41 commits into
NVIDIA:mainfrom
hkii:feat/cli-api-consume-client

Conversation

@hkii

@hkii hkii commented May 29, 2026

Copy link
Copy Markdown
Member

Summary

Route the CLI and API server entirely through the public aicr.Client facade, and replace the process-global DataProvider/criteria registry with per-Client state. This makes the facade the single integration surface for both internal consumers and external Go importers, and lets multiple Clients resolve recipes concurrently without clobbering shared globals.

Motivation / Context

AICR maintained three independent component-initialization paths: the CLI used the deprecated recipe.SetDataProvider process-global; the REST server directly constructed recipe.Builder, bundler.Bundler, and validator.Validator; and the facade used the new per-instance aicr.Client. Every change to wiring policy had to be applied in all three places. Per the project principle that "business logic lives in functional packages so CLI and API can both use it," the facade is the proper single aggregation point.

This PR migrates the CLI and REST server to consume aicr.Client as thin presentation layers, retires the legacy global-state and direct-construction wiring, and extends the facade (rather than bypassing it) where new capabilities were needed — preserving behavioral parity across all CLI flags and REST endpoints.

Fixes: #1077
Related: N/A

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/api, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Validator (pkg/validator)
  • Docs/examples (docs/, examples/)
  • Other: public facade (aicr.go, bundle.go, query.go, options.go, types.go)

Implementation Notes

Facade surface (new):

  • Client.LoadRecipe (provider-bound file load), MakeBundle (full deployer-mode bundle), ResolveRecipeFromSnapshot (constraint eval vs a snapshot), ResolveRecipeFromCriteria, SelectFromRecipe (hydrate + select), LoadCatalog, Resolved(), plus CriteriaRegistry accessor.
  • New options: WithVersion, WithAllowLists + ParseAllowListsFromEnv (with enforcement), EmbeddedSource for embedded-only data, and validation options WithValidationPhases/Timeout/Commit/ImageRegistryOverride/ImageTagOverride.
  • Lossless type aliases (Recipe/AllowLists/Criteria).

Per-Client isolation (replaces process global):

  • pkg/recipe gains a per-DataProvider criteria registry with a global shim for back-compat; LoadFromFileWithProvider and catalog.Load now take a DataProvider instead of reading a hidden global. pkg/config adds ResolveCriteriaWithRegistry for per-provider validation.
  • CLI initDataProvider is dropped; recipe/query/mirror/bundle/validate and API /v1/recipe /v1/query /v1/bundle all consume aicr.Client. bundler.New is removed from server.go.

Correctness fixes folded in:

  • Deep-copy ComponentRef.Overrides nested values in clone, and deep-copy the recipe in AdoptRecipe, to preserve Client isolation (no slice/map aliasing across overlay merges).
  • validate preserves structured error codes via PropagateOrWrap.
  • Validation timeout is now opt-in (*time.Duration): the CLI passes 0 to uncap so an all-phase run isn't cut short by a fixed cap; controllers keep the default. Tolerations use an explicit tolerationsSet flag so a nil/empty override correctly clears the validator's default tolerate-all instead of being dropped.

Testing

make qualify

New/expanded coverage: aicr_test.go, aicr_internal_test.go, bundle_test.go, pkg/api/{recipe,bundle}_handler_test.go, pkg/recipe/{criteria_registry,loader_provider}_test.go, pkg/validator/catalog/catalog_test.go.

Risk Assessment

  • Low
  • Medium — Touches CLI, API, recipe engine, and validator, and changes how the DataProvider/criteria registry is wired. Behavior is preserved via a global shim and defensive copies; no external API contract changes.
  • High

Rollout notes: No migration required for external importers — the aicr.Client surface is additive and the prior process-global path retains a back-compat shim. Concurrent-Client isolation is a v0.12+ guarantee (documented in aicr.go); long-running consumers should retain and reuse Clients and call Close() to avoid monotonic cache growth.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed (docs/integrator/go-library.md, docs/integrator/public-api.md)
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

hkii added 30 commits May 28, 2026 16:20
… ResolveRecipeFromCriteria, SelectFromRecipe
LoadRecipe loads a recipe file (or cm:// ConfigMap URI) through the
Client's own DataProvider, hydrating overlay inputs against that
provider and returning an owner-stamped facade *RecipeResult ready for
ValidateState / BundleComponents.
Add WithValidationPhases to restrict a validation run to a subset of
phases (default unchanged: all phases in order). ValidateState now also
threads the Client's DataProvider into the validator so the catalog
loads from this Client's recipe source instead of the package global.
Add (*RecipeResult).Resolved() so internal consumers can reach the full
pkg/recipe.RecipeResult behind the lossy facade shape (constraints,
validation config, deployment order) needed for phase warnings and
evidence emission.

Add three ValidateOptions translating to validator options:
WithValidationCommit, WithValidationImageRegistryOverride, and
WithValidationImageTagOverride. Empty string is the unset sentinel,
matching the validator's own empty-as-noop handling.

Reject unknown WithValidationPhases values with ErrCodeInvalidRequest
before any cluster work so a typed Phase typo fails closed rather than
degrading to an empty/skipped run.
Add coverage proving catalog.Load reads the validator catalog from the
DataProvider it is handed rather than the package global, and update the
v1 README examples to the catalog.Load(dp, version, commit) signature.

This is the catalog-side half of letting the validate evidence path use
the command's data provider instead of the recipe package global, so a
--data overlay resolves the validator catalog against the command's
source.
Drive the validate command through the aicr.Client facade. The command
builds a Client from the resolved data source (--data flag, else
spec.recipe.data; filesystem-layered or embedded), loads the recipe via
client.LoadRecipe, and runs the phases via client.ValidateState with the
resolved CLI knobs mapped to WithValidation* options (namespace, run ID,
cleanup, image-pull-secrets, no-cluster, tolerations, node selector,
commit, image registry/tag overrides, and phases when explicitly
selected).

This removes the initDataProvider call from the validate Action, so
validate no longer mutates the recipe package global: the Client owns its
DataProvider. The same provider is built once and threaded into evidence
emission (emitRecipeEvidence -> catalog.Load) so a --data overlay
resolves the attestation catalog against the command's source.

The phase-warning helper and evidence emission need the full
pkg/recipe.RecipeResult, reached via the new RecipeResult.Resolved()
accessor. LoadRecipe wraps loaded recipes with a nil-criteria-tolerant
constructor so already-hydrated and bare RecipeResult files load the same
way recipe.LoadFromFile accepted them.
Add Client.MakeBundle plus BundleOptions / BundleConfig / BundleAttester /
BundleArtifact to the facade. MakeBundle wraps bundler.New + Make so callers
get the same full deployer-mode bundle artifact the CLI bundle command and
REST /v1/bundle handler produce today, with the Client's own DataProvider
(via recipe.Resolved()) and allowlist fencing applied.
Route the bundle command through the aicr.Client facade: build a Client from
the resolved data source (--data / spec.recipe.data), load the recipe via
client.LoadRecipe, and generate the bundle via client.MakeBundle. Drops the
process-global initDataProvider and the direct bundler.New / Make calls;
output is unchanged.
Route mirror list's --recipe branch through the aicr.Client facade: a Client
built from the command's data source loads the recipe via client.LoadRecipe,
and rec.Resolved() yields the raw *recipe.RecipeResult the Lister needs. The
criteria branch (buildRecipeFromCmdWithConfig) and initDataProvider remain
unchanged because they still depend on the process-global criteria registry,
de-globalized by NVIDIA#987 (Stage 4).
…rver.go

Add a pkg/api bundleHandler that reproduces bundler.HandleBundles exactly —
same method gate, body decode, allowlist pre-check, query-param parsing, and
zip response — but routes through the aicr.Client facade (AdoptRecipe +
MakeBundle) instead of constructing a bundler directly. server.go now builds
no recipe.Builder and no bundler.Bundler, completing NVIDIA#1077 criterion NVIDIA#2.

Supporting changes:
- recipe.RecipeResult.BindDataProvider: exported binder so the facade can
  adopt an externally-decoded RecipeResult onto the Client's provider.
- aicr.Client.AdoptRecipe / adoptRecipe: wrap a raw RecipeResult into a
  Client-owned, provider-bound, owner-stamped result for MakeBundle.
- bundler.ParseBundleConfig / StreamZipResponse: exported boundaries so the
  facade-backed handler shares one config-construction and zip-streaming site
  with the legacy handler (no drift).
- Parity test asserts identical status, content headers, and zip layout vs
  the legacy bundler.HandleBundles across deployer modes.
MakeBundle previously wrapped the context with a fixed
defaults.BundleHandlerTimeout (~60s), which the CLI bundle command
inherited — a regression for large bundles, --vendor-charts, and
attestation/signing that can each exceed 60s.

Add an opt-in BundleOptions.Timeout field: > 0 caps the run (bounded by
the smaller of it and any tighter caller deadline); 0 imposes no
facade-level cap and runs under the caller's ctx as-is. The REST
/v1/bundle handler sets Timeout = defaults.BundleHandlerTimeout to keep
its 60s request boundary; the CLI leaves it unset for uncapped bundles.
A prior change turned the exported catalog.Load(version, commit) into
Load(dp, version, commit), breaking the exported API. Rename the
provider-aware form to LoadWithDataProvider(dp, version, commit) and
re-add Load(version, commit) as a thin back-compat wrapper that passes a
nil provider (global fallback, preserving original behavior).

Internal provider-aware callers (validator.go, validate_evidence.go) use
LoadWithDataProvider; simple-path tests and README examples keep
Load(version, commit).
Replace errors.Wrap(ErrCodeInternal, ...) with PropagateOrWrap at three
validate-path sites so an inner *StructuredError's code survives instead
of being overwritten with ErrCodeInternal:
- validate_evidence.go: catalog.LoadWithDataProvider error
- validate.go: client.ValidateState error
- validate.go: aicr.NewClient error
- types.go: (*RecipeResult).Resolved() returns nil for a nil receiver
  instead of panicking.
- aicr_test.go: allowlist-rejection test now asserts the structured
  error carries ErrCodeInvalidRequest, mirroring the invalid-phase test.
- docs/integrator/go-library.md: reword the kubeconfig sentence to
  disambiguate which "path" the ConfigMap-URI condition refers to.
adoptRecipe bound the Client's DataProvider directly onto the
caller-supplied *Recipe, mutating caller-owned state. A caller reusing
one *Recipe across two Clients would have the second adopt overwrite the
first's binding. Add RecipeResult.DeepCopy (deep-copies all exported
fields, leaves the unexported provider nil for BindDataProvider) and
adopt against the copy so neither the caller's recipe nor a sibling
Client's binding is contaminated.
…ons, opt-in timeout

- Thread the Client's version into the validator so :latest images are
  rewritten to the release tag and AICR_CLI_VERSION is populated (the
  pre-facade CLI passed validator.WithVersion).
- Track tolerationsSet so WithValidationTolerations(nil) is an explicit
  override that clears the validator's default tolerate-all, instead of
  being dropped as "unset".
- Make the facade validation deadline opt-in via WithValidationTimeout:
  nil keeps the 60m default (controllers), 0 imposes no cap (per-validator
  timeouts govern), >0 sets an explicit cap — matching MakeBundle.
hkii added 8 commits May 29, 2026 09:26
Pass aicr.WithValidationTimeout(0) so CLI validation is uncapped at the
facade level (per-validator timeouts govern), and rely on the now-always
-emitted WithValidationTolerations override to clear the validator's
default tolerate-all — restoring the pre-facade CLI behavior for
all-phase runs that include the long inference-perf check.
…irror

Explain that validate's second DataProvider exists for evidence emission
(resolving against the command source, not the global), and log the
external data directory in mirror's --recipe path for observability
parity with validate/bundle.
De-globalize the criteria registry so it is cached per-DataProvider,
mirroring the existing component-registry pattern. The package-global
path stays working as a back-compat shim.

- Add criteriaRegistryCache (sync.Map keyed by DataProvider) plus
  GetCriteriaRegistryFor / EvictCachedCriteriaRegistry and the
  _ForTesting helpers. DefaultRegistry now delegates to
  GetCriteriaRegistryFor(GetDataProvider()) so the global and
  embedded-provider paths share one registry.
- seedCriteriaRegistry takes a DataProvider and registers into that
  provider's registry; buildMetadataStore threads its provider through,
  so loading a provider's metadata store seeds THAT provider's criteria
  registry.
- Lift the parse + All*Types cores onto *CriteriaRegistry methods
  (ParseService/Accelerator/Intent/OS/Platform, All*Types). The
  package-level ParseCriteria*Type / AllCriteria*Types / BuildCriteria
  delegate to DefaultRegistry() so existing callers compile unchanged.
- Add the registry-aware build path the CLI will consume:
  BuildCriteriaWithRegistry + WithServiceRegistry/... options that
  resolve against an explicit registry.
Add (*Client).LoadCatalog to eagerly seed THIS Client's per-provider
criteria registry (via recipe.LoadMetadataStoreFor) and
(*Client).CriteriaRegistry to return that registry, so CLI/library
callers parse and validate criteria against the same DataProvider the
Client resolves with. Includes a CriteriaRegistry type alias.

Tests cover embedded-vs-filesystem registry isolation, strict-mode
hiding of external values, and the nil-context/closed-Client guards.
Add (*RecipeSpec).ResolveCriteriaWithRegistry so config-sourced recipe
criteria validate against a caller-supplied per-provider criteria
registry instead of the package global. ResolveCriteria stays as a shim
over DefaultRegistry() for existing callers. This lets a --data overlay's
non-OSS criteria value supplied via spec.recipe.criteria validate the
same way the CLI flag path does.
…ider

Migrate the recipe, query, and mirror (criteria path) commands off the
process-global recipe data provider and global criteria registry onto a
per-command aicr.Client with a per-provider criteria registry.

- Add recipeClientFromCmd in root.go: builds an aicr.Client bound to the
  command's resolved data source (--data / spec.recipe.data, else
  embedded), replacing initDataProvider, which is removed.
- Thread a *recipe.CriteriaRegistry through buildRecipeFromCmdWithConfig,
  mergeCriteriaFromCmdAndConfig, buildCriteriaFromCmd, applyCriteriaOverrides,
  and applyCriteriaFromConfig so every criteria parse resolves against the
  Client's own provider registry (via BuildCriteriaWithRegistry /
  reg.Parse* / ResolveCriteriaWithRegistry).
- recipe/query/mirror each build a Client, defer Close, and call
  client.LoadCatalog to seed their provider registry before parsing; query
  now seeds explicitly (fixing a latent lazy-global ordering bug). Strict
  mode applies to the Client's registry. Snapshot path resolves through
  client.ResolveRecipeFromSnapshot, criteria path through
  client.ResolveRecipeFromCriteria.
- mirror builds one Client for both --recipe (LoadRecipe) and criteria
  paths; initDataProvider removed from mirror entirely.

Retarget the former TestInitDataProvider_* tests at recipeClientFromCmd
and pass DefaultRegistry() at the unit-test criteria-helper call sites
(OSS canonical values resolve via the registry fast path). No
initDataProvider/SetDataProvider references remain in pkg/cli.
@copy-pr-bot

copy-pr-bot Bot commented May 29, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@mchmarny mchmarny left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking this on, @hkii — you've been carrying the facade thread since #1072 and this PR is the logical next step. Direction is right and matches the project principle that the facade should be the single aggregation point. A few architectural shape questions before line-level review starts; flagging them now since the PR is still draft.

1. Two parallel public-API tiers on the Client.
types.go introduces transparent aliases (type Recipe = recipe.RecipeResult) alongside the existing facade-owned shape (type RecipeResult struct {…}), and the surface has parallel resolve methods returning each (ResolveRecipe → *RecipeResult vs ResolveRecipeFromCriteria/FromSnapshot → *Recipe). The Stability note on Snapshot already concedes the alias tier "inherits pkg/snapshotter's Public (evolving) tier rather than the facade's stable tier — field-shape changes propagate without notice," and a comment points to #1078 as the future wrap. The risk: this PR ships the alias surface that #1078 will need to wrap, and any caller built against *Recipe between now and #1078 has to migrate twice. Suggest landing #1078 first so the facade-owned wrapper protects external Go importers; then this PR consumes the stable types.

2. The PR bundles a wiring refactor with several behavior changes.
Title says "migrate CLI and API to consume aicr.Client" but the implementation notes also include: deep-copy fix in AdoptRecipe + ComponentRef.Overrides (fixes a prior aliasing bug — i.e., behavior change), validation timeout time.Duration → *time.Duration with the CLI passing 0 to uncap (real behavior change for CLI users), tolerations explicit-set flag (semantic change), and validate propagating structured codes via PropagateOrWrap. Each of those is independently bisectable and reviewable. Bundling them with a 4,900-line refactor across 41 files in 38 commits makes regression hunting much harder if something breaks downstream. The "preserves behavioral parity" framing also fights the substance — uncapping validation timeout in the CLI is a deliberate change worth calling out plainly.

Suggested split:

  • Pre-PR: each correctness fix on its own commit with a regression test (AdoptRecipe deep-copy is a particularly good standalone candidate — small, testable, ships behavior improvement without coupling to wiring).
  • Pre-PR: per-DataProvider criteria registry (replaces the process global) with the back-compat shim's deprecation marker and a tracking comment pointing at #987 for removal.
  • This PR (slimmed): only the CLI/API migration onto the now-stable Client surface. Should drop 1,500–2,000 lines once the above ship separately.

3. Back-compat shim sunset plan.
"Per-DataProvider criteria registry with a global shim for back-compat" — shims tend to be sticky, and #987 is the sibling that removes the global entirely. Worth confirming the shim is flagged with //nolint:staticcheck + a Deprecated: doc tag pointing at #987, and that there's a test asserting no in-tree caller still touches the global. Otherwise we end up with a permanent dual code path.

4. PR mechanics. None of this is substance; flagging for the squash-and-rebase pass:

  • Squash: 38 commits → 1 per logical change (after splitting per concern #2 above)
  • Rebase: mergeable_state: dirty after #1100/#1101/#1102 landed
  • Sign: git commit -S (checkbox is unchecked)
  • make qualify checkbox: unchecked — confirm pass locally before flipping to ready-for-review

What's clearly good and worth keeping: the direction (single integration surface, per-instance state), the docs updates landing in the same PR (go-library.md, public-api.md), the per-file test coverage discipline, and the Close() + per-Client guarantee documented in rollout notes. The architectural boundaries are respected — no kubectl orchestration, no in-cluster controllers, no new deployers.

Happy to pair on any of the splits, especially the #1078 sequencing question. Thanks again for the work.

@coderabbitai

coderabbitai Bot commented May 29, 2026

Copy link
Copy Markdown

Caution

Review failed

Failed to post review comments

📝 Walkthrough

Walkthrough

Adds per-Client version and allowlist fields to the aicr Client; threads version into recipe builders; implements LoadCatalog and CriteriaRegistry for per-provider registries; centralizes allowlist enforcement and shared resolveCriteria for criteria/snapshot resolution; binds data providers into loaded RecipeResults and adds DeepCopy/Bind helpers; adds selected-phase validation and provider-aware validator/catalog loading; introduces Client.AdoptRecipe/MakeBundle and bundler streaming/config helpers; migrates CLI and server to construct/use per-command/server Clients; and adds extensive tests and documentation updates.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related issues

Possibly related PRs

Suggested labels

enhancement, area/recipes, area/tests

Suggested reviewers

  • mchmarny
  • lalitadithya
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/cli/bundle.go`:
- Around line 605-626: The client construction block duplicates logic in
recipeClientFromCmd; replace the inline data-dir resolution and aicr.NewClient
call with a single call to recipeClientFromCmd(cmd, cfg), capture the returned
client and error, and keep the same defer to close the client (defer func() { _
= client.Close() }()). Ensure you remove the duplicated variables/dataDir logic
and still return the wrapped error from the recipeClientFromCmd failure path so
behavior and logging (including the "initializing external data provider"
message) remain identical.

In `@pkg/cli/validate.go`:
- Around line 747-771: This code duplicates client construction logic from
recipeClientFromCmd; replace the block that builds dataDir/source and calls
aicr.NewClient with a single call to recipeClientFromCmd (passing the command
context/flags and cfg and version as required), handle its returned client and
error the same way, and keep the defer client.Close() cleanup; ensure you remove
the local dataDir/source logic so the shared recipeClientFromCmd is the single
source of truth.

In `@pkg/recipe/criteria_registry.go`:
- Around line 235-258: The cache currently keys only by DataProvider identity
(criteriaRegistryCache and GetCriteriaRegistryFor), which preserves stale
registries when a mutable provider's internal state changes; change the cache
key to include the provider's generation counter so a provider state change
creates a new entry. Concretely, in GetCriteriaRegistryFor compute a composite
key (e.g., struct{dp DataProvider; gen uint64} or a string like
fmt.Sprintf("%p-%d", dp, gen)) where gen is obtained from the provider's
generation accessor (e.g., dp.Generation() or the appropriate
GetGeneration/GetVersion method), then LoadOrStore using that composite key and
keep the rest (criteriaRegistryCacheEntry, once, newCriteriaRegistry) unchanged;
apply the same keying change to the other cache use sites noted around the
299-301 region.

In `@pkg/validator/validator.go`:
- Around line 182-185: The catalog.LoadWithDataProvider call currently always
rewraps any error as errors.ErrCodeInternal (in the block around
catalog.LoadWithDataProvider(v.dataProvider, v.Version, v.Commit)), which hides
upstream structured codes; change the handling to propagate coded errors
unmodified and only wrap unknown errors: after the call, detect if the returned
err already carries a structured code (the package's coded/error type), and if
so return it directly, otherwise wrap with errors.Wrap(errors.ErrCodeInternal,
"failed to load validator catalog", err). Apply the same propagate-or-wrap
change to the other occurrence around lines 240-243 so existing catalog error
codes are preserved.

In `@query.go`:
- Around line 29-32: The current handling of errors from recipe.HydrateResult
always wraps the error with errors.ErrCodeInternal, which can overwrite existing
structured error codes; replace the return logic to use errors.PropagateOrWrap
(or the library's propagate helper) so that if the hydrated error already
carries a code it is propagated unchanged, otherwise it is wrapped with a
contextual message like "hydrate recipe" and ErrCodeInternal—update the call
site where recipe.HydrateResult(r) is checked to use errors.PropagateOrWrap(err,
errors.ErrCodeInternal, "hydrate recipe") instead of errors.Wrap.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 0648c27e-82ec-46d1-8f17-7c3aa625303d

📥 Commits

Reviewing files that changed from the base of the PR and between 4daf2af and 091e11f.

📒 Files selected for processing (41)
  • aicr.go
  • aicr_internal_test.go
  • aicr_test.go
  • bundle.go
  • bundle_test.go
  • docs/integrator/go-library.md
  • docs/integrator/public-api.md
  • options.go
  • pkg/api/bundle_handler.go
  • pkg/api/bundle_handler_test.go
  • pkg/api/recipe_handler.go
  • pkg/api/recipe_handler_test.go
  • pkg/api/server.go
  • pkg/api/server_test.go
  • pkg/bundler/handler.go
  • pkg/cli/bundle.go
  • pkg/cli/config_integration_test.go
  • pkg/cli/mirror.go
  • pkg/cli/query.go
  • pkg/cli/recipe.go
  • pkg/cli/recipe_test.go
  • pkg/cli/root.go
  • pkg/cli/validate.go
  • pkg/cli/validate_evidence.go
  • pkg/config/resolve.go
  • pkg/recipe/criteria.go
  • pkg/recipe/criteria_registry.go
  • pkg/recipe/criteria_registry_test.go
  • pkg/recipe/handler_query.go
  • pkg/recipe/handler_query_test.go
  • pkg/recipe/loader.go
  • pkg/recipe/loader_provider_test.go
  • pkg/recipe/metadata.go
  • pkg/recipe/metadata_store.go
  • pkg/validator/catalog/catalog.go
  • pkg/validator/catalog/catalog_test.go
  • pkg/validator/options.go
  • pkg/validator/types.go
  • pkg/validator/validator.go
  • query.go
  • types.go

Comment thread pkg/cli/bundle.go
Comment on lines +605 to +626
// Build the Client from the resolved data source (--data flag, else
// spec.recipe.data). A filesystem source layers the external dir over
// the embedded data; an empty data dir uses the embedded data only. The
// Client owns its DataProvider — LoadRecipe and MakeBundle thread it
// through, replacing the old process-global data provider.
dataDir := cmd.String("data")
if dataDir == "" {
dataDir = cfg.Recipe().DataDir()
}
source := aicr.EmbeddedSource()
if dataDir != "" {
slog.Info("initializing external data provider", "directory", dataDir)
source = aicr.FilesystemSource(dataDir)
}
client, err := aicr.NewClient(
aicr.WithRecipeSource(source),
aicr.WithVersion(version),
)
if err != nil {
return errors.Wrap(errors.ErrCodeInternal, "failed to initialize data provider", err)
}
defer func() { _ = client.Close() }()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Extract shared client construction to avoid duplication.

This client construction block duplicates recipeClientFromCmd in root.go (same data-dir resolution, source selection, logging message, and error wrapping). Consider calling recipeClientFromCmd(cmd, cfg) here instead of reimplementing the logic inline.

♻️ Suggested refactor
-	// Build the Client from the resolved data source (--data flag, else
-	// spec.recipe.data). A filesystem source layers the external dir over
-	// the embedded data; an empty data dir uses the embedded data only. The
-	// Client owns its DataProvider — LoadRecipe and MakeBundle thread it
-	// through, replacing the old process-global data provider.
-	dataDir := cmd.String("data")
-	if dataDir == "" {
-		dataDir = cfg.Recipe().DataDir()
-	}
-	source := aicr.EmbeddedSource()
-	if dataDir != "" {
-		slog.Info("initializing external data provider", "directory", dataDir)
-		source = aicr.FilesystemSource(dataDir)
-	}
-	client, err := aicr.NewClient(
-		aicr.WithRecipeSource(source),
-		aicr.WithVersion(version),
-	)
-	if err != nil {
-		return errors.Wrap(errors.ErrCodeInternal, "failed to initialize data provider", err)
-	}
+	client, err := recipeClientFromCmd(cmd, cfg)
+	if err != nil {
+		return err
+	}
 	defer func() { _ = client.Close() }()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Build the Client from the resolved data source (--data flag, else
// spec.recipe.data). A filesystem source layers the external dir over
// the embedded data; an empty data dir uses the embedded data only. The
// Client owns its DataProvider — LoadRecipe and MakeBundle thread it
// through, replacing the old process-global data provider.
dataDir := cmd.String("data")
if dataDir == "" {
dataDir = cfg.Recipe().DataDir()
}
source := aicr.EmbeddedSource()
if dataDir != "" {
slog.Info("initializing external data provider", "directory", dataDir)
source = aicr.FilesystemSource(dataDir)
}
client, err := aicr.NewClient(
aicr.WithRecipeSource(source),
aicr.WithVersion(version),
)
if err != nil {
return errors.Wrap(errors.ErrCodeInternal, "failed to initialize data provider", err)
}
defer func() { _ = client.Close() }()
client, err := recipeClientFromCmd(cmd, cfg)
if err != nil {
return err
}
defer func() { _ = client.Close() }()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/cli/bundle.go` around lines 605 - 626, The client construction block
duplicates logic in recipeClientFromCmd; replace the inline data-dir resolution
and aicr.NewClient call with a single call to recipeClientFromCmd(cmd, cfg),
capture the returned client and error, and keep the same defer to close the
client (defer func() { _ = client.Close() }()). Ensure you remove the duplicated
variables/dataDir logic and still return the wrapped error from the
recipeClientFromCmd failure path so behavior and logging (including the
"initializing external data provider" message) remain identical.

Comment thread pkg/cli/validate.go
Comment on lines +747 to +771
// Build the Client from the resolved data source (--data flag,
// else spec.recipe.data). A filesystem source layers the external
// dir over the embedded data; an empty data dir uses the embedded
// data only. The Client owns its DataProvider, replacing the old
// process-global data provider. dataProvider is the
// same provider, built once so it can also be threaded into evidence
// emission (so --data overlays resolve the validator catalog against
// the command's source rather than the package global).
dataDir := cmd.String("data")
if dataDir == "" {
dataDir = cfg.Recipe().DataDir()
}
source := aicr.EmbeddedSource()
if dataDir != "" {
slog.Info("initializing external data provider", "directory", dataDir)
source = aicr.FilesystemSource(dataDir)
}
client, err := aicr.NewClient(
aicr.WithRecipeSource(source),
aicr.WithVersion(version),
)
if err != nil {
return errors.PropagateOrWrap(err, errors.ErrCodeInternal, "failed to initialize data provider")
}
defer func() { _ = client.Close() }()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Consolidate client construction with recipeClientFromCmd.

This block duplicates the client construction logic from recipeClientFromCmd in root.go. Consider reusing that helper to avoid drift.

♻️ Suggested refactor
-	// Build the Client from the resolved data source (--data flag,
-	// else spec.recipe.data). A filesystem source layers the external
-	// dir over the embedded data; an empty data dir uses the embedded
-	// data only. The Client owns its DataProvider, replacing the old
-	// process-global data provider. dataProvider is the
-	// same provider, built once so it can also be threaded into evidence
-	// emission (so --data overlays resolve the validator catalog against
-	// the command's source rather than the package global).
-	dataDir := cmd.String("data")
-	if dataDir == "" {
-		dataDir = cfg.Recipe().DataDir()
-	}
-	source := aicr.EmbeddedSource()
-	if dataDir != "" {
-		slog.Info("initializing external data provider", "directory", dataDir)
-		source = aicr.FilesystemSource(dataDir)
-	}
-	client, err := aicr.NewClient(
-		aicr.WithRecipeSource(source),
-		aicr.WithVersion(version),
-	)
-	if err != nil {
-		return errors.PropagateOrWrap(err, errors.ErrCodeInternal, "failed to initialize data provider")
-	}
+	client, err := recipeClientFromCmd(cmd, cfg)
+	if err != nil {
+		return err
+	}
 	defer func() { _ = client.Close() }()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/cli/validate.go` around lines 747 - 771, This code duplicates client
construction logic from recipeClientFromCmd; replace the block that builds
dataDir/source and calls aicr.NewClient with a single call to
recipeClientFromCmd (passing the command context/flags and cfg and version as
required), handle its returned client and error the same way, and keep the defer
client.Close() cleanup; ensure you remove the local dataDir/source logic so the
shared recipeClientFromCmd is the single source of truth.

Comment on lines +235 to +258
var criteriaRegistryCache sync.Map // map[DataProvider]*criteriaRegistryCacheEntry

// GetCriteriaRegistryFor returns the criteria registry for the supplied
// DataProvider, constructing it lazily on first access. Concurrent callers
// with the same provider observe the same singleton; distinct providers
// populate distinct cache entries and never share state. Each per-provider
// registry honors AICR_CRITERIA_STRICT at construction, exactly like the
// global default.
//
// A nil provider falls back to GetDataProvider() so the legacy global path —
// DefaultRegistry and the package-level ParseCriteria*Type shims — continues
// to work transparently and shares a single registry with the embedded
// provider.
func GetCriteriaRegistryFor(dp DataProvider) *CriteriaRegistry {
if dp == nil {
dp = GetDataProvider() //nolint:staticcheck // back-compat fallback for pre-WithDataProvider callers (#983 Stage 2)
}
e, _ := criteriaRegistryCache.LoadOrStore(dp, &criteriaRegistryCacheEntry{})
entry := e.(*criteriaRegistryCacheEntry)
entry.once.Do(func() {
entry.registry = newCriteriaRegistry()
})
return entry.registry
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Key criteria-registry cache entries by provider generation, not only provider identity.

Caching only on DataProvider identity can preserve stale registries when a settable provider’s underlying state changes while identity remains stable. Include the provider generation in the cache key so late-bound configuration reliably rebuilds.

As per coding guidelines, "Sync.Once caching — Key cache by generation counter when state depends on settable global (e.g., DataProvider-tied registry); recompute on miss so late-bound config takes effect".

Also applies to: 299-301

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/recipe/criteria_registry.go` around lines 235 - 258, The cache currently
keys only by DataProvider identity (criteriaRegistryCache and
GetCriteriaRegistryFor), which preserves stale registries when a mutable
provider's internal state changes; change the cache key to include the
provider's generation counter so a provider state change creates a new entry.
Concretely, in GetCriteriaRegistryFor compute a composite key (e.g., struct{dp
DataProvider; gen uint64} or a string like fmt.Sprintf("%p-%d", dp, gen)) where
gen is obtained from the provider's generation accessor (e.g., dp.Generation()
or the appropriate GetGeneration/GetVersion method), then LoadOrStore using that
composite key and keep the rest (criteriaRegistryCacheEntry, once,
newCriteriaRegistry) unchanged; apply the same keying change to the other cache
use sites noted around the 299-301 region.

Comment on lines +182 to 185
cat, err := catalog.LoadWithDataProvider(v.dataProvider, v.Version, v.Commit)
if err != nil {
return nil, errors.Wrap(errors.ErrCodeInternal, "failed to load validator catalog", err)
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Preserve structured catalog-load error codes instead of forcing ErrCodeInternal.

Both call sites currently rewrap catalog-load errors with ErrCodeInternal, which masks upstream structured codes (for example malformed catalog input). Propagate-or-wrap instead so existing coded errors stay intact.

Suggested patch
-	cat, err := catalog.LoadWithDataProvider(v.dataProvider, v.Version, v.Commit)
+	cat, err := catalog.LoadWithDataProvider(v.dataProvider, v.Version, v.Commit)
 	if err != nil {
-		return nil, errors.Wrap(errors.ErrCodeInternal, "failed to load validator catalog", err)
+		return nil, errors.PropagateOrWrap(err, errors.ErrCodeInternal, "failed to load validator catalog")
 	}
@@
-	cat, err := catalog.LoadWithDataProvider(v.dataProvider, v.Version, v.Commit)
+	cat, err := catalog.LoadWithDataProvider(v.dataProvider, v.Version, v.Commit)
 	if err != nil {
-		return nil, errors.Wrap(errors.ErrCodeInternal, "failed to load validator catalog", err)
+		return nil, errors.PropagateOrWrap(err, errors.ErrCodeInternal, "failed to load validator catalog")
 	}

As per coding guidelines: “Don't double-wrap errors that already have proper codes; propagate as-is to preserve error code.”

Also applies to: 240-243

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/validator/validator.go` around lines 182 - 185, The
catalog.LoadWithDataProvider call currently always rewraps any error as
errors.ErrCodeInternal (in the block around
catalog.LoadWithDataProvider(v.dataProvider, v.Version, v.Commit)), which hides
upstream structured codes; change the handling to propagate coded errors
unmodified and only wrap unknown errors: after the call, detect if the returned
err already carries a structured code (the package's coded/error type), and if
so return it directly, otherwise wrap with errors.Wrap(errors.ErrCodeInternal,
"failed to load validator catalog", err). Apply the same propagate-or-wrap
change to the other occurrence around lines 240-243 so existing catalog error
codes are preserved.

Comment thread query.go
Comment on lines +29 to +32
hydrated, err := recipe.HydrateResult(r)
if err != nil {
return nil, errors.Wrap(errors.ErrCodeInternal, "hydrate recipe", err)
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Preserve structured error codes from hydration failures.

recipe.HydrateResult errors may already carry a structured code; forcing ErrCodeInternal here can clobber it. Use errors.PropagateOrWrap to preserve existing codes.

Suggested fix
  hydrated, err := recipe.HydrateResult(r)
  if err != nil {
-		return nil, errors.Wrap(errors.ErrCodeInternal, "hydrate recipe", err)
+		return nil, errors.PropagateOrWrap(err, errors.ErrCodeInternal, "hydrate recipe")
  }
Based on learnings: if an error already includes the correct structured code, propagate it as-is and avoid double-wrapping.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@query.go` around lines 29 - 32, The current handling of errors from
recipe.HydrateResult always wraps the error with errors.ErrCodeInternal, which
can overwrite existing structured error codes; replace the return logic to use
errors.PropagateOrWrap (or the library's propagate helper) so that if the
hydrated error already carries a code it is propagated unchanged, otherwise it
is wrapped with a contextual message like "hydrate recipe" and
ErrCodeInternal—update the call site where recipe.HydrateResult(r) is checked to
use errors.PropagateOrWrap(err, errors.ErrCodeInternal, "hydrate recipe")
instead of errors.Wrap.

hkii added 2 commits May 29, 2026 12:45
# Conflicts:
#	aicr_test.go
#	pkg/cli/recipe.go
The criteria-registry seeding/strict tests construct per-provider
registries that honor AICR_CRITERIA_STRICT at build time. make test sets
it to 1 (the CI gate), which hid external --data criteria values and broke
the seeding/precondition assertions. Clear the env via t.Setenv so each
test's freshly-built registry starts non-strict; strict behavior is still
exercised explicitly via SetStrict.
@github-actions

Copy link
Copy Markdown
Contributor

@hkii this PR now has merge conflicts with main. Please rebase to resolve them.

@hkii

hkii commented May 29, 2026

Copy link
Copy Markdown
Member Author

Thanks for taking this on, @hkii — you've been carrying the facade thread since #1072 and this PR is the logical next step. Direction is right and matches the project principle that the facade should be the single aggregation point. A few architectural shape questions before line-level review starts; flagging them now since the PR is still draft.

1. Two parallel public-API tiers on the Client. types.go introduces transparent aliases (type Recipe = recipe.RecipeResult) alongside the existing facade-owned shape (type RecipeResult struct {…}), and the surface has parallel resolve methods returning each (ResolveRecipe → *RecipeResult vs ResolveRecipeFromCriteria/FromSnapshot → *Recipe). The Stability note on Snapshot already concedes the alias tier "inherits pkg/snapshotter's Public (evolving) tier rather than the facade's stable tier — field-shape changes propagate without notice," and a comment points to #1078 as the future wrap. The risk: this PR ships the alias surface that #1078 will need to wrap, and any caller built against *Recipe between now and #1078 has to migrate twice. Suggest landing #1078 first so the facade-owned wrapper protects external Go importers; then this PR consumes the stable types.

2. The PR bundles a wiring refactor with several behavior changes. Title says "migrate CLI and API to consume aicr.Client" but the implementation notes also include: deep-copy fix in AdoptRecipe + ComponentRef.Overrides (fixes a prior aliasing bug — i.e., behavior change), validation timeout time.Duration → *time.Duration with the CLI passing 0 to uncap (real behavior change for CLI users), tolerations explicit-set flag (semantic change), and validate propagating structured codes via PropagateOrWrap. Each of those is independently bisectable and reviewable. Bundling them with a 4,900-line refactor across 41 files in 38 commits makes regression hunting much harder if something breaks downstream. The "preserves behavioral parity" framing also fights the substance — uncapping validation timeout in the CLI is a deliberate change worth calling out plainly.

Suggested split:

  • Pre-PR: each correctness fix on its own commit with a regression test (AdoptRecipe deep-copy is a particularly good standalone candidate — small, testable, ships behavior improvement without coupling to wiring).
  • Pre-PR: per-DataProvider criteria registry (replaces the process global) with the back-compat shim's deprecation marker and a tracking comment pointing at feat(recipe): deprecate SetDataProvider / GetDataProvider globals #987 for removal.
  • This PR (slimmed): only the CLI/API migration onto the now-stable Client surface. Should drop 1,500–2,000 lines once the above ship separately.

3. Back-compat shim sunset plan. "Per-DataProvider criteria registry with a global shim for back-compat" — shims tend to be sticky, and #987 is the sibling that removes the global entirely. Worth confirming the shim is flagged with //nolint:staticcheck + a Deprecated: doc tag pointing at #987, and that there's a test asserting no in-tree caller still touches the global. Otherwise we end up with a permanent dual code path.

4. PR mechanics. None of this is substance; flagging for the squash-and-rebase pass:

What's clearly good and worth keeping: the direction (single integration surface, per-instance state), the docs updates landing in the same PR (go-library.md, public-api.md), the per-file test coverage discipline, and the Close() + per-Client guarantee documented in rollout notes. The architectural boundaries are respected — no kubectl orchestration, no in-cluster controllers, no new deployers.

Happy to pair on any of the splits, especially the #1078 sequencing question. Thanks again for the work.

Thanks, this is a useful read and most of it I agree with. Going point by point:

  1. The two API tiers and feat(aicr): wrap facade alias types as facade-owned structs #1078 ordering. Fair concern, the double-migrate risk is real. One thing I'd add though: the *Recipe-returning
    methods (ResolveRecipeFromCriteria, ResolveRecipeFromSnapshot, LoadRecipe) aren't an accident. They exist because the lossy facade
    RecipeResult (just Name/Version/Components) physically can't reproduce what the CLI and /v1/recipe actually emit: full metadata,
    constraints, applied overlays, deployment order. So whatever feat(aicr): wrap facade alias types as facade-owned structs #1078 lands, it can't just be "wrap everything in the lossy struct"; it needs a
    lossless path for these exact callers or the migration regresses output. With that caveat I'm fine landing feat(aicr): wrap facade alias types as facade-owned structs #1078 first and rebasing on top.
    For what it's worth, nothing external consumes the *Recipe methods yet (in-tree CLI/REST only), so the cost of getting the order wrong
    today is basically nil. But agreed, worth locking down before we invite outside use.

  2. The bundling. You're right. It's already a chain of small, independently bisectable commits, so splitting is mostly mechanical:

  • PR-1: the correctness fixes (AdoptRecipe/Overrides deep-copy, PropagateOrWrap), each with its own regression test. The deep-copy is the
    clean standalone you flagged.
  • PR-2: the per-DataProvider criteria registry plus shim.
  • PR-3 (this one, slimmed): just the CLI/API wiring onto the stable surface.

And you caught the right thing on "parity": the validation-timeout uncap is a real behavior change, not parity. The old CLI was bounded only
by per-validator timeouts, and the facade's fixed ValidationOperationTimeout would cancel an all-phase run that hits the 50m inference-perf
check. I'll say that plainly in the description instead of filing it under parity.

  1. Shim sunset. Will do: Deprecated: plus //nolint:staticcheck on the global shims, and a test that fails if any in-tree caller still
    reaches for the global. One correction though, the global here is the criteria registry (DefaultRegistry/ParseCriteria*), not
    SetDataProvider/GetDataProvider. feat(recipe): deprecate SetDataProvider / GetDataProvider globals #987 was the DataProvider globals and it's already closed, so I'll open a separate issue to track the
    registry removal and point the shim there. Otherwise we imply feat(recipe): deprecate SetDataProvider / GetDataProvider globals #987 still owns this, and it doesn't.

  2. Mechanics. Commits are -S signed; GitHub shows them Verified. The local --show-signature "No signature" is just this machine missing an
    allowedSignersFile. make qualify passes locally and I'll flip the box after the rebase. I'll rebase onto main once harden(recipes): direct dependencyRefs from monitoring-CR creators to prometheus-operator-crds #1100/feat(recipes): apply nodewright generic tuning to rtx-pro-6000 EKS #1101/feat(recipes): add nodewright h100 tuning to H200 EKS recipes #1102 are in
    and squash to one commit per logical change before taking it out of draft.

The #1078 ordering is the one with real consequences, so I'd like to settle that before I cut PR-1. Happy to pair on it.

@github-actions

Copy link
Copy Markdown
Contributor

@mchmarny

Copy link
Copy Markdown
Member

Closing in favor of the 2-PR split @hkii landed:

Thanks @hkii for splitting this. The 3 correctness fixes called out in earlier review (deep-copy ComponentRef.Overrides, deep-copy in AdoptRecipe, PropagateOrWrap for structured error codes) ride along in #1108's squash rather than as a separate pre-PR — happy to accept that bundling since the diff is small and self-contained within the squash.

Epic tracking moves to #1107#1108, then #1078 + #1080 to close #1076.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(cli,api): migrate CLI and REST server to consume aicr.Client

2 participants