Add managed Helm service mode to retriever harness#2146
Conversation
|
@greptile review |
Greptile SummaryThis PR adds Helm-managed service lifecycle support to the retriever harness: it can now deploy NRL via Helm, port-forward the service, run
|
| Filename | Overview |
|---|---|
| nemo_retriever/src/nemo_retriever/harness/config.py | Adds manage_service and all Helm config fields; service-mode validate() early return now skips BEIR evaluation validation, which is newly exercised by this PR. |
| nemo_retriever/src/nemo_retriever/harness/helm_manager.py | New NRL-native HelmServiceManager with label-based service discovery, port-forwarding, /v1/health polling, NIMService/NIMCache readiness, and secret redaction; uses logging correctly throughout. |
| nemo_retriever/src/nemo_retriever/harness/run.py | Adds _run_managed_service_mode, service-mode BEIR evaluation, and migrates result handling to the finalized ServiceIngestResult contract; a leftover service_job_status key will always be None. |
| tools/harness/src/nv_ingest_harness/service_manager/helm.py | Adds label-selector-based service discovery methods and fixes wildcard matching with fnmatch; pre-existing print() style is consistent with the rest of the file. |
| nemo_retriever/tests/test_harness_helm_manager.py | New unit tests for HelmServiceManager; covers command building, port-forward lifecycle, secret redaction, and readiness polling with proper mocking. |
| nemo_retriever/tests/test_harness_run.py | Extends existing run tests with managed service mode coverage; properly mocks HelmServiceManager and validates result payloads. |
Sequence Diagram
sequenceDiagram
participant Caller as harness run
participant MSM as _run_managed_service_mode
participant HSM as HelmServiceManager
participant K8s as Kubernetes / Helm
participant SM as _run_service_mode
Caller->>MSM: "cfg (manage_service=true)"
MSM->>HSM: start()
HSM->>K8s: helm upgrade --install (--wait)
HSM->>K8s: "kubectl get services -l component=service|gateway"
HSM->>K8s: kubectl port-forward service/... local:7670
HSM->>HSM: _poll_http_200 /v1/health
HSM->>K8s: "kubectl wait --for=condition=Ready pod"
HSM-->>MSM: "rc=0"
MSM->>SM: "service_cfg (service_url=localhost:7670)"
SM-->>MSM: result_payload
MSM->>HSM: dump_logs(artifact_dir) [on failure]
MSM->>HSM: "stop(uninstall=not keep_up)"
HSM->>K8s: helm uninstall
MSM-->>Caller: result_payload
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 2
nemo_retriever/src/nemo_retriever/harness/config.py:189-193
**Service-mode `validate()` skips BEIR config validation entirely**
The `return errors` at the end of the `run_mode == "service"` block was pre-existing, but this PR adds BEIR evaluation support to `_run_service_mode`. Now a service-mode config that sets `evaluation_mode: beir` with a missing or invalid `beir_loader` passes `load_harness_config()` silently — the error only surfaces deep into the run (potentially after several minutes of Helm deployment). The `evaluation_mode` check and BEIR-specific validations on lines 193-240 should also apply to service mode.
```suggestion
if not isinstance(self.helm_set, dict):
errors.append("helm_set must be a mapping/dict")
if self.evaluation_mode not in VALID_EVALUATION_MODES:
errors.append(f"evaluation_mode must be one of {sorted(VALID_EVALUATION_MODES)}")
if self.evaluation_mode == "beir":
if self.beir_loader not in VALID_BEIR_LOADERS:
errors.append(f"beir_loader must be one of {sorted(VALID_BEIR_LOADERS)}")
if self.beir_doc_id_field not in VALID_BEIR_DOC_ID_FIELDS:
errors.append(f"beir_doc_id_field must be one of {sorted(VALID_BEIR_DOC_ID_FIELDS)}")
if not self.beir_split:
errors.append("beir_split must be a non-empty string")
if not isinstance(self.beir_ks, (list, tuple)) or not self.beir_ks:
errors.append("beir_ks must be a non-empty list/tuple of positive integers")
return errors
if self.evaluation_mode not in VALID_EVALUATION_MODES:
```
### Issue 2 of 2
nemo_retriever/src/nemo_retriever/harness/run.py:1285-1287
**`service_job_status` always writes `None` after the contract migration**
The PR description states the result contract is now `job_id`, `document_ids`, `failures`, and `elapsed_s`. `ServiceIngestResult` has no `job_status` attribute, so `getattr(result_obj, "job_status", None)` will always produce `None` in every `results.json`. This is a leftover key from the old contract that was not cleaned up alongside the `job_ids`→`job_id` migration.
```suggestion
"service_job_id": getattr(result_obj, "job_id", None),
"service_document_ids": document_ids,
```
Reviews (5): Last reviewed commit: "Merge branch 'main' into codex/retriever..." | Re-trigger Greptile
|
@greptile review |
|
@greptile review |
|
@greptile review |
|
@greptile review |
1 similar comment
|
@greptile review |
|
/nvskills-ci |
Description
Adds Helm-managed service lifecycle support to
retriever-harnessso service-mode harness runs can optionally deploy NRL via Helm, port-forward the retriever service, runServiceIngestor, collect logs/artifacts, and tear down the release.Highlights:
manage_service/--managedconfig and CLI plumbing for Helm chart refs, chart versions, values files, repeated--helm-set, timeouts, binary/sudo knobs, and--keep-up.HelmServiceManagerwith label-based service discovery, service/VectorDB port-forwarding,/v1/healthpolling, optional NIMService/NIMCache readiness, log collection, and secret redaction in displayed commands.ServiceIngestResultcontract (job_id,document_ids,failures,elapsed_s) instead of legacymetrics/job_idsassumptions.core,no-nims-external,all-optional,audio-video, andsplitcoverage.Validation:
uv run --project nemo_retriever --extra service --extra dev pytest nemo_retriever/tests/test_harness_config.py nemo_retriever/tests/test_harness_run.py nemo_retriever/tests/test_harness_helm_manager.py nemo_retriever/tests/test_harness_helm_profiles.py -q(99 passed)nim-nvstaging/nemo-retriever --version 26.05-RC6with the RC6 service image override.no-nims-external, port-forwarded service + VectorDB, ingested one PDF through managed service mode, and verified Helm teardown left no resources in the smoke namespace.Checklist