Skip to content

feat(guardrails): Update benchmark job; add latency analysis step#359

Open
JashG wants to merge 8 commits into
mainfrom
jgulabrai/guardrails-benchmark-analysis
Open

feat(guardrails): Update benchmark job; add latency analysis step#359
JashG wants to merge 8 commits into
mainfrom
jgulabrai/guardrails-benchmark-analysis

Conversation

@JashG

@JashG JashG commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Summary

guardrails-benchmark job

Updates the guardrails-benchmark CI step:

  • Run two variants of the benchmark: one without Guardrails, and one with Guardrails (content safety input and output checks). This enables checking overhead added by the nemo-guardrails plugin.

guardrails-benchmark-analyze job

Updates the guardrails-benchmark-analyze CI step:

  • Outputs three tables:
    • Measured Latencies: Total e2e latency for both benchmarks.
    • Platform Overhead: Latency added by NMP (i.e. latencies without counting the mock LLM calls).
    • Guardrails Overhead vs. Baseline: Latency added by Guardrails compared to pre-defined baselines. These baselines are the average of several runs in CI. I added a "reasonable +/- threshold" to account for potential latency deviations.
  • The job fails if the Guardrails overhead exceeds the reasonable threshold vs. the baseline. Currently, this job won't block the pipeline.

Benchmark harness

  • run.py now accepts --variant {with-guardrails, without-guardrails, all}. all runs both variants sequentially against the same NMP, writing them to the same run directory under aiperf_results/<variant>/.
  • seeding.py creates a second control VirtualModel (no-guardrails-vm) alongside the existing guarded one.
  • aiperf_runner.py accepts a model_ref so each variant's sweep targets the correct VM.
  • New analyze.py (stdlib-only) reads both variants from a run dir and prints three tables: measured latencies, platform overhead (mock-LLM time subtracted), and the baseline check.

@github-actions github-actions Bot added the feat label Jun 16, 2026
@github-actions

github-actions Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor
Suite Lines Covered Line Rate Branch Rate
Unit Tests 21176/27762 76.3% 61.2%
Integration Tests 12216/26531 46.0% 19.5%

JashG added 2 commits June 22, 2026 10:12
Signed-off-by: Jash Gulabrai <jgulabrai@nvidia.com>
Signed-off-by: Jash Gulabrai <jgulabrai@nvidia.com>
@JashG JashG force-pushed the jgulabrai/guardrails-benchmark-analysis branch from 6115c89 to f78ffa8 Compare June 22, 2026 14:12
JashG added 4 commits June 22, 2026 11:23
Signed-off-by: Jash Gulabrai <jgulabrai@nvidia.com>
Signed-off-by: Jash Gulabrai <jgulabrai@nvidia.com>
Signed-off-by: Jash Gulabrai <jgulabrai@nvidia.com>
Signed-off-by: Jash Gulabrai <jgulabrai@nvidia.com>
@JashG JashG marked this pull request as ready for review June 23, 2026 14:10
@JashG JashG requested review from a team as code owners June 23, 2026 14:10
Signed-off-by: Jash Gulabrai <jgulabrai@nvidia.com>
@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This PR adds a guardrails-specific CI trigger, converts the Nemo Guardrails benchmark into with/without-guardrails variants, adds a latency comparison analyzer with baseline checks, and updates benchmark configs, tests, and documentation to match the new run and analysis flow.

Changes

Guardrails benchmark workflow

Layer / File(s) Summary
Change detection and benchmark job trigger
.github/actions/changes/action.yaml, .github/workflows/ci.yaml
Adds a guardrails-benchmark change flag and uses it to conditionally start a matrix benchmark job.
Benchmark variant constants and control virtual model
plugins/nemo-guardrails/src/nemo_guardrails_plugin/benchmarks/{constants.py,seeding.py}, plugins/nemo-guardrails/tests/unit/benchmarks/test_seeding.py
New variant and control VM constants; seeding creates a no-middleware control VM alongside the main guardrails VM; tests updated to assert both VMs are created.
Paths and mock LLM environment configs
plugins/nemo-guardrails/src/nemo_guardrails_plugin/benchmarks/{paths.py}, plugins/nemo-guardrails/benchmarks/configs/mock_llm/*, plugins/nemo-guardrails/benchmarks/configs/mock_llm/README.md
RunPaths gains in-repo mock LLM env fields and per-variant AIPerf helpers; mock env files define model, safety, and deterministic latency parameters; README documents versioning rationale.
Benchmark runtime with variant execution
plugins/nemo-guardrails/src/nemo_guardrails_plugin/benchmarks/{run.py,aiperf_runner.py}
Harness validates in-repo configs, selects per-variant VM references, prepares variant-specific runtime configs, executes variants sequentially, collects outcomes, and optionally prints analysis.
Latency analysis and baseline validation
plugins/nemo-guardrails/src/nemo_guardrails_plugin/benchmarks/analyze.py
New analyzer loads per-concurrency latency from both variant outputs, computes measured vs platform overhead tables, validates delta p50 against baseline constants with per-concurrency tolerance overrides, and produces pass/fail reports.
CI workflow matrix and artifact aggregation
.github/workflows/ci.yaml
Replaces single benchmark job with matrix variants; new analyze job downloads both artifacts, runs strict analysis, and uploads merged results.
Documentation updates
plugins/nemo-guardrails/benchmarks/README.md
Documents updated directory layout, two-job CI flow, baseline delta p50 validation with tolerance overrides, and analyzer local execution.

Sequence Diagram(s)

sequenceDiagram
  participant Changes as changes job
  participant Bench as guardrails-benchmark
  participant Store as artifact storage
  participant Analyze as guardrails-benchmark-analyze

  Changes->>Bench: guardrails-benchmark == true
  Bench->>Bench: run with-guardrails variant
  Bench->>Store: upload variant artifact
  Bench->>Bench: run without-guardrails variant
  Bench->>Store: upload variant artifact
  Analyze->>Store: download both artifacts
  Analyze->>Analyze: run analyze.py --strict
  Analyze->>Store: upload merged benchmark results
Loading

Possibly related PRs

  • NVIDIA-NeMo/nemo-platform#80: Introduced the same guardrails benchmark modules and CI path that this PR extends with dual variants and result analysis.

Suggested reviewers

  • gabwow
  • albcui
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 55.56% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed Title accurately captures the main changes: introducing dual-variant benchmark execution and a new latency analysis step in the guardrails CI pipeline.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch jgulabrai/guardrails-benchmark-analysis

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
.github/workflows/ci.yaml (1)

1306-1334: 🩺 Stability & Availability | 🟠 Major

Add guardrails-benchmark and guardrails-benchmark-analyze to ci-status dependencies.

The ci-status job (lines 1306-1334) omits both benchmark jobs. If ci-status is the required gate and benchmark jobs fail (when triggered by changes or dispatch), those failures are missed.

Suggested patch
  ci-status:
    name: CI status
    needs:
      - changes
+     - guardrails-benchmark
+     - guardrails-benchmark-analyze
      - actionlint
      - docker-bake-graph
      - build-cpu-smoke-images
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/ci.yaml around lines 1306 - 1334, The ci-status job is
missing guardrails-benchmark and guardrails-benchmark-analyze from its needs
dependencies list. Add both of these job names to the needs array under the
ci-status job (after the other existing job names like web-studio-e2e and
opa-policy-test) to ensure benchmark job failures are reflected in the overall
CI status gate.
🧹 Nitpick comments (1)
plugins/nemo-guardrails/benchmarks/configs/mock_llm/README.md (1)

1-22: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Restructure this page to one Diataxis type with required sections.

This page is currently descriptive but missing required doc scaffolding. Convert it into a single quadrant (likely REFERENCE), add a top prerequisites block, add concrete CLI/Python examples in a tab-set, and end with a “Next Steps” section linking to related benchmark docs.
As per coding guidelines, Each documentation page should fit ONE Diataxis quadrant, Always list prerequisites at the top, Provide both Python SDK and CLI examples in tab-sets, and Include 'Next Steps' section at the end.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/nemo-guardrails/benchmarks/configs/mock_llm/README.md` around lines 1
- 22, Restructure the Mock LLM configurations README to follow Diataxis
REFERENCE format by adding a prerequisites section at the top that lists
required dependencies and setup steps, reorganizing the existing descriptive
content into a properly structured REFERENCE page, adding a new section with
concrete CLI and Python SDK examples in tab-set format demonstrating how to use
and configure the app-llm.env and content-safety-llm.env files, and ending with
a "Next Steps" section that links to related benchmark documentation and guides
for running benchmarks with these configurations.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/ci.yaml:
- Around line 1142-1147: The analyze job's if condition prevents it from running
when the guardrails-benchmark job fails, even though the job handles missing
artifacts gracefully with continue-on-error. Modify the if condition that
currently checks for !cancelled() and the workflow_dispatch or changes.outputs
conditions to use always() instead, so the analyze job runs regardless of the
guardrails-benchmark job's status and can process partial artifacts when
available.
- Around line 1151-1154: The checkout action for nemo-platform is missing the
persist-credentials security setting. Add `persist-credentials: false` to the
`with:` section of the "Checkout nemo-platform" step to prevent the GitHub
Actions authentication token from being stored in the Git configuration, since
this job only needs to download and upload artifacts without performing Git
operations.

In `@plugins/nemo-guardrails/benchmarks/README.md`:
- Around line 227-229: The worked example in the benchmark README for c=16
tolerance incorrectly states the tolerance as 300 ms when the actual value used
in analyze.py is 200 ms. Update lines 227-229 to replace the 300 ms tolerance
value with the correct 200 ms value from analyze.py line 46, and recalculate the
pass/fail example thresholds accordingly so that the example accurately
demonstrates how the 200 ms tolerance determines whether a performance run
passes or fails based on the delta_p50 values.

In `@plugins/nemo-guardrails/src/nemo_guardrails_plugin/benchmarks/seeding.py`:
- Around line 181-194: The control VM creation in the seeding flow can reuse an
existing virtual model unchanged when create(..., exist_ok=True) returns it, so
the empty request_middleware and response_middleware settings may be ignored.
Update the no_guardrails_vm handling in the seeding logic by either calling
client.inference.virtual_models.update() after creation to explicitly clear
middleware, or deleting any existing NO_GUARDRAILS_VM_NAME before recreating it,
so the control VM always stays middleware-free.

---

Outside diff comments:
In @.github/workflows/ci.yaml:
- Around line 1306-1334: The ci-status job is missing guardrails-benchmark and
guardrails-benchmark-analyze from its needs dependencies list. Add both of these
job names to the needs array under the ci-status job (after the other existing
job names like web-studio-e2e and opa-policy-test) to ensure benchmark job
failures are reflected in the overall CI status gate.

---

Nitpick comments:
In `@plugins/nemo-guardrails/benchmarks/configs/mock_llm/README.md`:
- Around line 1-22: Restructure the Mock LLM configurations README to follow
Diataxis REFERENCE format by adding a prerequisites section at the top that
lists required dependencies and setup steps, reorganizing the existing
descriptive content into a properly structured REFERENCE page, adding a new
section with concrete CLI and Python SDK examples in tab-set format
demonstrating how to use and configure the app-llm.env and
content-safety-llm.env files, and ending with a "Next Steps" section that links
to related benchmark documentation and guides for running benchmarks with these
configurations.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1cd5064a-464d-418f-a424-914014ecf570

📥 Commits

Reviewing files that changed from the base of the PR and between c48f52c and 98f5d45.

📒 Files selected for processing (13)
  • .github/actions/changes/action.yaml
  • .github/workflows/ci.yaml
  • plugins/nemo-guardrails/benchmarks/README.md
  • plugins/nemo-guardrails/benchmarks/configs/mock_llm/README.md
  • plugins/nemo-guardrails/benchmarks/configs/mock_llm/app-llm.env
  • plugins/nemo-guardrails/benchmarks/configs/mock_llm/content-safety-llm.env
  • plugins/nemo-guardrails/src/nemo_guardrails_plugin/benchmarks/aiperf_runner.py
  • plugins/nemo-guardrails/src/nemo_guardrails_plugin/benchmarks/analyze.py
  • plugins/nemo-guardrails/src/nemo_guardrails_plugin/benchmarks/constants.py
  • plugins/nemo-guardrails/src/nemo_guardrails_plugin/benchmarks/paths.py
  • plugins/nemo-guardrails/src/nemo_guardrails_plugin/benchmarks/run.py
  • plugins/nemo-guardrails/src/nemo_guardrails_plugin/benchmarks/seeding.py
  • plugins/nemo-guardrails/tests/unit/benchmarks/test_seeding.py

Comment thread .github/workflows/ci.yaml
Comment thread .github/workflows/ci.yaml
Comment thread plugins/nemo-guardrails/benchmarks/README.md Outdated
Signed-off-by: Jash Gulabrai <jgulabrai@nvidia.com>
@JashG JashG force-pushed the jgulabrai/guardrails-benchmark-analysis branch from ad12028 to dcaf12f Compare June 23, 2026 14:36
@JashG JashG requested review from albcui and gabwow June 23, 2026 14:37
@JashG JashG changed the title feat(guardrails): Benchmark analysis feat(guardrails): Update benchmark job; add latency analysis step Jun 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant