Skip to content

Add managed Helm service mode to retriever harness#2146

Merged
charlesbluca merged 8 commits into
NVIDIA:mainfrom
charlesbluca:codex/retriever-harness-managed-helm
May 28, 2026
Merged

Add managed Helm service mode to retriever harness#2146
charlesbluca merged 8 commits into
NVIDIA:mainfrom
charlesbluca:codex/retriever-harness-managed-helm

Conversation

@charlesbluca
Copy link
Copy Markdown
Collaborator

Description

Adds Helm-managed service lifecycle support to retriever-harness so service-mode harness runs can optionally deploy NRL via Helm, port-forward the retriever service, run ServiceIngestor, collect logs/artifacts, and tear down the release.

Highlights:

  • Adds manage_service/--managed config and CLI plumbing for Helm chart refs, chart versions, values files, repeated --helm-set, timeouts, binary/sudo knobs, and --keep-up.
  • Adds an NRL-native HelmServiceManager with label-based service discovery, service/VectorDB port-forwarding, /v1/health polling, optional NIMService/NIMCache readiness, log collection, and secret redaction in displayed commands.
  • Updates service-mode harness result handling to the finalized ServiceIngestResult contract (job_id, document_ids, failures, elapsed_s) instead of legacy metrics/job_ids assumptions.
  • Adds managed Helm profile examples for core, no-nims-external, all-optional, audio-video, and split coverage.

Validation:

  • uv run --project nemo_retriever --extra service --extra dev pytest nemo_retriever/tests/test_harness_config.py nemo_retriever/tests/test_harness_run.py nemo_retriever/tests/test_harness_helm_manager.py nemo_retriever/tests/test_harness_helm_profiles.py -q (99 passed)
  • Rendered all five managed profiles against remote nim-nvstaging/nemo-retriever --version 26.05-RC6 with the RC6 service image override.
  • Live microk8s smoke: deployed RC6 chart/service image with no-nims-external, port-forwarded service + VectorDB, ingested one PDF through managed service mode, and verified Helm teardown left no resources in the smoke namespace.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • N/A - If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

@charlesbluca charlesbluca changed the title [codex] Add managed Helm service mode to retriever harness Add managed Helm service mode to retriever harness May 28, 2026
@charlesbluca
Copy link
Copy Markdown
Collaborator Author

@greptile review

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 28, 2026

Greptile Summary

This PR adds Helm-managed service lifecycle support to the retriever harness: it can now deploy NRL via Helm, port-forward the service, run ServiceIngestor, collect logs, and tear down the release — all driven by a new manage_service config flag and a new HelmServiceManager class. It also updates _run_service_mode to the finalized ServiceIngestResult contract and adds service-mode BEIR evaluation support.

  • HelmServiceManager (helm_manager.py): label-based service and VectorDB discovery, port-forward lifecycle with PermissionError handling, /v1/health polling, optional NIMService/NIMCache readiness waits, and secret redaction for displayed commands.
  • _run_managed_service_mode (run.py): wraps HelmServiceManager.start/stop around a normal _run_service_mode call; adds _run_service_beir_evaluation and migrates off the legacy metrics/job_ids contract.
  • Helm profiles (helm-profiles/): five example managed profiles (core, no-nims-external, all-optional, audio-video, split) covering the main NRL deployment topologies.

Confidence Score: 4/5

Safe to merge with the validation gap fixed; the Helm manager and run orchestration are well-structured and well-tested.

The validate() early return in config.py now silently skips BEIR evaluation checks for service-mode runs — a gap created by this PR adding BEIR support to service mode without updating validation. A managed run with evaluation_mode: beir and a missing beir_loader will deploy the full chart (potentially 10+ minutes) before failing inside _run_service_beir_evaluation.

nemo_retriever/src/nemo_retriever/harness/config.py — the validate() service-mode block needs BEIR and evaluation_mode checks before its early return.

Important Files Changed

Filename Overview
nemo_retriever/src/nemo_retriever/harness/config.py Adds manage_service and all Helm config fields; service-mode validate() early return now skips BEIR evaluation validation, which is newly exercised by this PR.
nemo_retriever/src/nemo_retriever/harness/helm_manager.py New NRL-native HelmServiceManager with label-based service discovery, port-forwarding, /v1/health polling, NIMService/NIMCache readiness, and secret redaction; uses logging correctly throughout.
nemo_retriever/src/nemo_retriever/harness/run.py Adds _run_managed_service_mode, service-mode BEIR evaluation, and migrates result handling to the finalized ServiceIngestResult contract; a leftover service_job_status key will always be None.
tools/harness/src/nv_ingest_harness/service_manager/helm.py Adds label-selector-based service discovery methods and fixes wildcard matching with fnmatch; pre-existing print() style is consistent with the rest of the file.
nemo_retriever/tests/test_harness_helm_manager.py New unit tests for HelmServiceManager; covers command building, port-forward lifecycle, secret redaction, and readiness polling with proper mocking.
nemo_retriever/tests/test_harness_run.py Extends existing run tests with managed service mode coverage; properly mocks HelmServiceManager and validates result payloads.

Sequence Diagram

sequenceDiagram
    participant Caller as harness run
    participant MSM as _run_managed_service_mode
    participant HSM as HelmServiceManager
    participant K8s as Kubernetes / Helm
    participant SM as _run_service_mode

    Caller->>MSM: "cfg (manage_service=true)"
    MSM->>HSM: start()
    HSM->>K8s: helm upgrade --install (--wait)
    HSM->>K8s: "kubectl get services -l component=service|gateway"
    HSM->>K8s: kubectl port-forward service/... local:7670
    HSM->>HSM: _poll_http_200 /v1/health
    HSM->>K8s: "kubectl wait --for=condition=Ready pod"
    HSM-->>MSM: "rc=0"
    MSM->>SM: "service_cfg (service_url=localhost:7670)"
    SM-->>MSM: result_payload
    MSM->>HSM: dump_logs(artifact_dir) [on failure]
    MSM->>HSM: "stop(uninstall=not keep_up)"
    HSM->>K8s: helm uninstall
    MSM-->>Caller: result_payload
Loading
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
nemo_retriever/src/nemo_retriever/harness/config.py:189-193
**Service-mode `validate()` skips BEIR config validation entirely**

The `return errors` at the end of the `run_mode == "service"` block was pre-existing, but this PR adds BEIR evaluation support to `_run_service_mode`. Now a service-mode config that sets `evaluation_mode: beir` with a missing or invalid `beir_loader` passes `load_harness_config()` silently — the error only surfaces deep into the run (potentially after several minutes of Helm deployment). The `evaluation_mode` check and BEIR-specific validations on lines 193-240 should also apply to service mode.

```suggestion
                if not isinstance(self.helm_set, dict):
                    errors.append("helm_set must be a mapping/dict")

        if self.evaluation_mode not in VALID_EVALUATION_MODES:
            errors.append(f"evaluation_mode must be one of {sorted(VALID_EVALUATION_MODES)}")
        if self.evaluation_mode == "beir":
            if self.beir_loader not in VALID_BEIR_LOADERS:
                errors.append(f"beir_loader must be one of {sorted(VALID_BEIR_LOADERS)}")
            if self.beir_doc_id_field not in VALID_BEIR_DOC_ID_FIELDS:
                errors.append(f"beir_doc_id_field must be one of {sorted(VALID_BEIR_DOC_ID_FIELDS)}")
            if not self.beir_split:
                errors.append("beir_split must be a non-empty string")
            if not isinstance(self.beir_ks, (list, tuple)) or not self.beir_ks:
                errors.append("beir_ks must be a non-empty list/tuple of positive integers")
            return errors

        if self.evaluation_mode not in VALID_EVALUATION_MODES:
```

### Issue 2 of 2
nemo_retriever/src/nemo_retriever/harness/run.py:1285-1287
**`service_job_status` always writes `None` after the contract migration**

The PR description states the result contract is now `job_id`, `document_ids`, `failures`, and `elapsed_s`. `ServiceIngestResult` has no `job_status` attribute, so `getattr(result_obj, "job_status", None)` will always produce `None` in every `results.json`. This is a leftover key from the old contract that was not cleaned up alongside the `job_ids``job_id` migration.

```suggestion
        "service_job_id": getattr(result_obj, "job_id", None),
        "service_document_ids": document_ids,
```

Reviews (5): Last reviewed commit: "Merge branch 'main' into codex/retriever..." | Re-trigger Greptile

Comment thread nemo_retriever/tests/test_harness_helm_manager.py
Comment thread nemo_retriever/src/nemo_retriever/harness/helm_manager.py
Comment thread nemo_retriever/src/nemo_retriever/harness/helm_manager.py Outdated
Comment thread nemo_retriever/src/nemo_retriever/harness/run.py Outdated
@charlesbluca
Copy link
Copy Markdown
Collaborator Author

@greptile review

@charlesbluca
Copy link
Copy Markdown
Collaborator Author

@greptile review

Comment thread nemo_retriever/src/nemo_retriever/harness/helm_manager.py
@charlesbluca
Copy link
Copy Markdown
Collaborator Author

@greptile review

@charlesbluca
Copy link
Copy Markdown
Collaborator Author

@greptile review

1 similar comment
@charlesbluca
Copy link
Copy Markdown
Collaborator Author

@greptile review

@charlesbluca charlesbluca marked this pull request as ready for review May 28, 2026 17:43
@charlesbluca charlesbluca requested review from a team as code owners May 28, 2026 17:43
@charlesbluca charlesbluca requested a review from jioffe502 May 28, 2026 17:43
@sosahi
Copy link
Copy Markdown
Collaborator

sosahi commented May 28, 2026

/nvskills-ci

@charlesbluca charlesbluca merged commit e519b51 into NVIDIA:main May 28, 2026
10 checks passed
@charlesbluca charlesbluca deleted the codex/retriever-harness-managed-helm branch May 28, 2026 20:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants