diff --git a/plugins/nemo-customizer/src/nemo_customizer/skills/nemo-customizer/SKILL.md b/plugins/nemo-customizer/src/nemo_customizer/skills/nemo-customizer/SKILL.md
index b907063734..0aaf6ac106 100644
--- a/plugins/nemo-customizer/src/nemo_customizer/skills/nemo-customizer/SKILL.md
+++ b/plugins/nemo-customizer/src/nemo_customizer/skills/nemo-customizer/SKILL.md
@@ -111,9 +111,9 @@ Training never runs inside the `nemo` CLI process. After `submit`, the platform'
 ## Gotchas
 
 - Resolve the CLI per **Pre-flight — CLI resolution** before any `nemo …` command; run from the **nemo-platform** git root, not a plugin subfolder.
-- Set `NEMO_BASE_URL` (or `NMP_BASE_URL`) only when the user gives a platform URL; default `http://127.0.0.1:8080` (same as `http://localhost:8080`). Track whether the user **overrode** the base URL — see **Platform unreachable** below.
+- Set `NMP_BASE_URL` only when the user gives a platform URL; default `http://127.0.0.1:8080` (same as `http://localhost:8080`). The `nemo` CLI reads this env var (see SDK `NMP_BASE_URL`). Track whether the user **overrode** the base URL — see **Platform unreachable** below.
 - **Platform unreachable** — if any platform API call fails with a connection error (`Connection error`, timeout, refused):
-  - **User gave a custom URL** (e.g. `10.0.0.51:8080`) or you exported a non-default `NEMO_BASE_URL` / `NMP_BASE_URL`: stop and tell the user the platform is not reachable at that address. Do **not** offer to start local services.
+  - **User gave a custom URL** or you exported a non-default `NMP_BASE_URL`: stop and tell the user the platform is not reachable at that address. Do **not** offer to start local services.
   - **Default URL only** (no user override): **ask** whether to start the platform locally. If they agree, from the **nemo-platform** git root run in the **background**:
 
     ```bash
@@ -139,8 +139,10 @@ Training never runs inside the `nemo` CLI process. After `submit`, the platform'
 - **Do not use local `docker info`** to pick automodel vs unsloth. Run `nemo jobs list-execution-profiles -f json` against the user's platform (login first only if auth is enabled — see **Authentication**; see `references/troubleshooting.md`). Default output is a table — **`-f json` is required** for scripting; parse **stdout only** (do not pipe `2>&1` into `json.load`).
 - **Do not merge stderr into stdout when parsing JSON** — `submit`, `explain`, and `-f json` commands write **JSON on stdout**; harmless warnings like `Configuration file not found, using defaults` go to **stderr**. Piping with **`2>&1`** before `json.load` raises `JSONDecodeError` even when submit **succeeded** — a common cause of **duplicate jobs** when the agent re-submits after a parse error. Parse stdout only; redirect stderr if needed (`2>/dev/null`). See `references/troubleshooting.md` § **Parsing CLI JSON**.
 - For submit/image/plugin errors (both backends), read `references/troubleshooting.md`. Unsloth needs the `nmp-unsloth-training` container image on the **platform host's** Docker daemon (see `docker/unsloth/README.md`).
-- **Missing training image on a remote platform** — if the user gave a non-localhost `NEMO_BASE_URL` / `NMP_BASE_URL` (e.g. `10.0.0.51:8080`) and the job errors with `Failed to pull image`, `manifest unknown`, or missing `nmp-unsloth-training` / automodel training image: **do not** run `docker build`, `docker pull`, or `docker buildx bake` on the agent machine. Report with **Report to user** (use **Output adapter fileset (planned):** on error), then append on-target build steps from `references/troubleshooting.md` § **Missing training images**.
+- **Missing training image on a remote platform** — if the user gave a non-localhost `NMP_BASE_URL` and the job errors with `Failed to pull image`, `manifest unknown`, or missing `nmp-unsloth-training` / automodel training image: **do not** run `docker build`, `docker pull`, or `docker buildx bake` on the agent machine. Report with **Report to user** (use **Output adapter fileset (planned):** on error), then append on-target build steps from `references/troubleshooting.md` § **Missing training images**.
 - **Gated HuggingFace models** (Llama, Gemma, …) — confirm `hf-token` + fileset `token_secret` before submit; download fails with `Failed to access upstream storage` / 502 when missing. See **HuggingFace token (gated models)** and `references/troubleshooting.md` § **Gated HuggingFace models**.
+- **Post-training eval format** — use the same CHAT `messages` JSONL as training. **Do not** flatten rows to `prompt`/`expected` for the evaluator. Send `messages[:-1]` at inference (exclude final assistant label); score against `messages[-1].content`. See `references/post-training-eval.md` and `references/eval_helpers.py`.
+- **LoRA adapters load automatically for eval** — when a LoRA job completes (`save_method: lora`), the adapter is registered on the base model entity and hot-reloaded on any **READY** deployment with `lora_enabled: true`. **Do not** create or update deployments before LoRA eval. **Full SFT** (`finetuning_type: all_weights`) and **merged checkpoints** (`merged_16bit` / `merged_4bit`) register a new **model** entity at `output.name` — **deploy that entity for inference** before chat or eval; full weights are not hot-reloaded onto the base deployment. For LoRA eval, route through the **provider** gateway (`/provider/<name>/-/v1` with `model: default--<adapter>`); the model-entity path (`/model/<entity>/-/v1`) always hits the base model. See `references/post-training-eval.md` § **Request routing (base vs LoRA)**.
 
 ## Workflow
 
@@ -148,7 +150,7 @@ Common steps then **branch by plugin pick**:
 
 ```text
 - [ ] Resolve CLI (Pre-flight — CLI resolution); cd nemo-platform
-- [ ] export NEMO_BASE_URL (if user provided endpoint); note whether base URL is user-overridden
+- [ ] export NMP_BASE_URL (if user provided endpoint); note whether base URL is user-overridden
 - [ ] nemo auth status — skip login if auth disabled; if auth enabled and unsigned JWT allowed, `nemo auth login --unsigned-token --email <…>`; if OIDC, `nemo auth login`
 - [ ] nemo jobs list-execution-profiles -f json — apply Plugin pick rules above (retry login on 401/403)
 - [ ] On connection error: default URL → ask to start platform (see Platform unreachable); custom URL → report unreachable and stop
@@ -162,12 +164,14 @@ Common steps then **branch by plugin pick**:
 - [ ] nemo customization automodel submit /tmp/job.json --workspace default
 - [ ] Poll until top-level terminal (`poll_customization_job.sh`; default 15s interval, or 30–60s manual polls)
 - [ ] Report using output template below
+- [ ] Optional: compare base vs adapter on validation — `references/eval_helpers.py …` (LoRA only; CHAT format; adapters hot-reload automatically; see `references/post-training-eval.md`)
 
 # unsloth branch (submit → Docker GPU job)
 - [ ] Write /tmp/job.json using the UnslothJobInput shape (see Fast path — unsloth)
 - [ ] nemo customization unsloth submit /tmp/job.json --workspace default [--profile <gpu-profile>]
 - [ ] Poll until top-level terminal (`poll_customization_job.sh unsloth-<job-id>`; default 15s interval)
 - [ ] Report using output template below
+- [ ] Optional: compare base vs adapter on validation — `references/eval_helpers.py …` (LoRA only; CHAT format; adapters hot-reload automatically; see `references/post-training-eval.md`)
 ```
 
 ## Fast path — automodel
@@ -177,7 +181,7 @@ Substitute `<hf-repo>`, `<hf-dataset>`, `<model-entity>`, `<weights-fileset>`, `
 **Setup**
 
 ```bash
-export NEMO_BASE_URL=http://127.0.0.1:8080   # user override only
+export NMP_BASE_URL=http://127.0.0.1:8080   # user override only
 cd /path/to/nemo-platform
 nemo auth status   # skip login if auth disabled; if enabled + unsigned JWT allowed → login --unsigned-token --email admin@example.com
 nemo jobs list-execution-profiles -f json   # platform GPU profiles → automodel; set training.execution_profile if needed
@@ -399,7 +403,7 @@ Pick the path by whether the **base model fits in ~48 GB on one GPU** (LoRA or f
 | 4B–8B | 1 | 2 | `5e-6` |
 | >8B | 1 | 1 | lower LR or use TP / shorter seq |
 
-Output type is **model** (full checkpoint), not adapter. Expect much longer runs than LoRA at the same batch.
+Output type is **model** (full checkpoint), not adapter. Expect much longer runs than LoRA at the same batch. **Inference:** deploy `default/<output.name>` as a new model entity — full SFT does not hot-reload onto the base model's LoRA deployment.
 
 ### `max_seq_length` scaling
 
@@ -458,7 +462,7 @@ There is no `parallelism` block, no TP / PP / DP, no GBS divisibility math. Mult
 
 `load_in_4bit: true` (default) keeps base weights in 4-bit, which is what makes the "smaller per-device batch on bigger models" rule milder than vanilla HF. If you raise `per_device_train_batch_size` and hit OOM (exit 137) or training crashes (exit 1), halve `per_device_train_batch_size` first and double `gradient_accumulation_steps` to keep the effective batch the same.
 
-**Save method.** Default `output.save_method: "lora"` (adapter only — small, fast, deploy-friendly). Use `"merged_16bit"` if the user wants a full-weight checkpoint to deploy without an adapter loader; `"merged_4bit"` only when storage is tight (lossy). Merged methods require `training.finetuning_type: "lora"`.
+**Save method.** Default `output.save_method: "lora"` (adapter only — small, fast, hot-reloads on LoRA-enabled deployments). Use `"merged_16bit"` if the user wants a full-weight checkpoint to deploy as a standalone model entity; `"merged_4bit"` only when storage is tight (lossy). Merged methods require `training.finetuning_type: "lora"`. Merged and full SFT outputs must be **deployed for inference** — they do not hot-reload onto the base adapter deployment.
 
 **Tuning loop (unsloth):**
 
@@ -513,7 +517,7 @@ After polling reaches a **terminal** status (`completed`, `error`, or `cancelled
 
 | Status | Notes |
 |--------|-------|
-| `completed` | Brief success summary (e.g. adapter registered on model entity). When `metrics.train_loss` has ≥2 entries, add a loss-drop sentence: *Loss dropped from \<first value, 1 dp\> at step 1 to \<last value, 3 dp\> at step \<N\>; validation loss was \<val or n/a\>.* |
+| `completed` | Brief success summary. LoRA (`save_method: lora`): adapter registered on base model entity. Full SFT / merged checkpoint: new model entity at `output.name`. When `metrics.train_loss` has ≥2 entries, add a loss-drop sentence: *Loss dropped from \<first value, 1 dp\> at step 1 to \<last value, 3 dp\> at step \<N\>; validation loss was \<val or n/a\>.* Append **Using the adapter** (LoRA) or **Using the fine-tuned model** (full SFT / merged) with discovered provider name and concrete gateway URLs (see below). |
 | `error` | Quote `error_details.message` or the failing step; note setup that succeeded before the failure (auth, dataset upload, submit). |
 | `cancelled` | Cancellation reason if available. |
 
@@ -580,21 +584,113 @@ After polling reaches a **terminal** status (`completed`, `error`, or `cancelled
 | Output save method | lora |
 ```
 
-**Using the adapter (`completed` only)** — after **Training configuration**, run `nemo models get <model-entity> --workspace default` (parse stdout only) to confirm the adapter is listed under `adapters`. Append this section:
+**Using the output (`completed` only)** — after **Training configuration**, branch on output type:
+
+| Output | When | Report section |
+|--------|------|----------------|
+| LoRA adapter | `save_method: lora` (default) | **Using the adapter** — below |
+| Full model | `finetuning_type: all_weights`, or `save_method: merged_16bit` / `merged_4bit` | **Using the fine-tuned model** — below |
+
+### Using the adapter (LoRA / `save_method: lora`)
+
+Run these discovery commands (parse stdout only; do not pipe `2>&1` into JSON parsers):
+
+1. `nemo models get <model-entity> --workspace default` — confirm `<output.name>` appears under `adapters` with `enabled: true`.
+2. `nemo inference providers list --workspace default -f json` — pick a **READY** provider whose `served_models` includes `default/<model-entity>` (base entity). Record its `name` as `<provider>` (often matches the deployment name).
+
+On a deployment with `lora_enabled: true`, the adapter is **hot-reloaded automatically** — no new deployment, deployment update, or provider reconfiguration before inference or post-training eval. Append this section with **concrete URLs and provider name** from discovery:
 
 ```markdown
 ### Using the adapter
 
-The adapter `<output.name>` is attached to `default/<model-entity>`. List adapters with:
+The adapter `<output.name>` is registered on `default/<model-entity>`. Weights are hot-reloaded on LoRA-enabled deployments serving the **base** entity — no new deployment or provider update after training.
+
+#### Request routing (base vs LoRA)
+
+| Target | Gateway path | OpenAI base URL | Request `"model"` field |
+|--------|--------------|-----------------|-------------------------|
+| **Base** weights | model-entity | `$NMP_BASE_URL/apis/inference-gateway/v2/workspaces/default/model/<model-entity>/-/v1` | `default/<model-entity>` |
+| **LoRA adapter** | **provider** | `$NMP_BASE_URL/apis/inference-gateway/v2/workspaces/default/provider/<provider>/-/v1` | `default--<output.name>` |
+
+**Common mistake:** posting to the model-entity URL with `"model": "default--<output.name>"` still runs the **base** model. Base-vs-adapter eval will look identical until LoRA requests use the **provider** URL above. See `references/post-training-eval.md` § **Request routing (base vs LoRA)**.
+
+#### Chat inference (CHAT-trained models)
+
+Match training context at inference — send **`messages[:-1]`** (all turns except the final assistant label). Single-turn rows are just the user message; multi-turn rows keep prior user/assistant history.
+
+| Setting | Value | Why |
+|---------|-------|-----|
+| `messages` | All turns except the final assistant label from the JSONL row | Same decode path as SFT |
+| `max_tokens` | `64` for short assistant labels | Training targets are brief (e.g. MCQA choice text) |
+| `temperature` | `0` | Reproducible eval / regression checks |
+| `chat_template_kwargs.enable_thinking` | `false` for Qwen3 short-answer SFT | Thinking mode needs extra tokens and changes output shape vs training |
+
+#### Example — LoRA adapter via provider
+
+\`\`\`bash
+export NMP_BASE_URL=<platform-url>   # omit when using default localhost
+nemo inference gateway provider post v1/chat/completions <provider> --workspace default \\
+  --body '{
+    "model": "default--<output.name>",
+    "messages": [<all turns except final assistant label from the eval row>],
+    "max_tokens": 64,
+    "temperature": 0,
+    "chat_template_kwargs": {"enable_thinking": false}
+  }'
+\`\`\`
+
+#### Example — base model via model-entity (comparison)
+
+\`\`\`bash
+export NMP_BASE_URL=<platform-url>
+nemo inference gateway model post v1/chat/completions <model-entity> --workspace default \\
+  --body '{
+    "model": "default/<model-entity>",
+    "messages": [<same prompt turns as LoRA example — exclude final assistant label>],
+    "max_tokens": 64,
+    "temperature": 0,
+    "chat_template_kwargs": {"enable_thinking": false}
+  }'
+\`\`\`
+
+#### Post-training eval (optional)
+
+Validation loss from training is **not** accuracy. To compare base vs adapter on the validation split with correct routing:
 
 \`\`\`bash
-export NEMO_BASE_URL=<platform-url>   # omit line when using default localhost
 cd /path/to/nemo-platform
-nemo models get <model-entity> --workspace default
+uv run python plugins/nemo-customizer/src/nemo_customizer/skills/nemo-customizer/references/eval_helpers.py \\
+  --model-entity <model-entity> \\
+  --adapter <output.name> \\
+  --provider <provider> \\
+  --dataset-fileset <dataset-fileset> \\
+  --split validation.jsonl
 \`\`\`
+
+Uses CHAT `messages` rows unchanged from the training fileset (`messages[:-1]` at inference). Repeat `--adapter` for multi-adapter compare. `--provider` is optional when a READY provider is auto-discovered. Set `NMP_BASE_URL` (or pass `--base-url`) when the platform is not localhost. LoRA only — full SFT / merged outputs need a deployed model entity (see **Using the fine-tuned model**).
+```
+
+### Using the fine-tuned model (full SFT / merged checkpoint)
+
+When `finetuning_type: all_weights` or `save_method` is `merged_16bit` / `merged_4bit`, the job registers a **model** entity at `output.name` with full fine-tuned weights. **Deploy that entity before inference or eval** — full checkpoints are not hot-reloaded onto the base model's LoRA deployment.
+
+1. `nemo models get <output.name> --workspace default` — confirm the fine-tuned model entity exists.
+2. Create or update an inference deployment / provider that serves `default/<output.name>` (same workflow as deploying any model entity).
+3. Append this section with the **READY** provider or deployment name and concrete gateway URL.
+
+```markdown
+### Using the fine-tuned model
+
+Fine-tuned weights are on model entity `default/<output.name>`. Unlike LoRA adapters, full checkpoints **require a new inference deployment** (or provider update) before chat or eval.
+
+| Target | Gateway path | OpenAI base URL | Request `"model"` field |
+|--------|--------------|-----------------|-------------------------|
+| Fine-tuned model | model-entity | `$NMP_BASE_URL/apis/inference-gateway/v2/workspaces/default/model/<output.name>/-/v1` | `default/<output.name>` |
+
+Use the same chat settings as LoRA inference (`messages[:-1]`, `max_tokens`, `temperature`, `enable_thinking` as appropriate). Post-training eval: run generation eval against this model-entity URL (not `eval_helpers.py --adapter`, which is LoRA-specific).
 ```
 
-Use the user's platform URL in `NEMO_BASE_URL` when they overrode it; omit the export line for default `http://127.0.0.1:8080`. The JSON `adapters` array shows `name`, `fileset`, `finetuning_type`, and `lora_config` for each registered adapter.
+Use the user's platform URL in `NMP_BASE_URL` when they overrode it; omit the export line for default `http://127.0.0.1:8080`. Substitute `<provider>`, concrete URLs, and entity names with values from discovery — do not leave generic placeholders in the user-facing report. For **LoRA**, do **not** tell the user to update the deployment before calling the adapter — registration on the base model entity is sufficient. For **full SFT / merged**, tell the user they must deploy `<output.name>` before inference.
 
 **Save report to `/tmp`** — unless the user opts out, write the full Markdown report (header, **Training configuration**, **Using the adapter** when `completed`, and **Resources created** when a slug or new filesets were used) to `/tmp/fine-tune-result-<slug-or-job-suffix>.md`. Use the random slug from the run when one was assigned; otherwise use the job id suffix (e.g. `a925b07ff678`).
 
@@ -602,7 +698,7 @@ Use the user's platform URL in `NEMO_BASE_URL` when they overrode it; omit the e
 
 | Error type | Append |
 |------------|--------|
-| Missing training image + user-overridden `NEMO_BASE_URL` / `NMP_BASE_URL` | `references/troubleshooting.md` § **Missing training images** — on-target build steps, env vars, re-submit commands. **Do not** `docker build` locally for a remote platform. |
+| Missing training image + user-overridden `NMP_BASE_URL` | `references/troubleshooting.md` § **Missing training images** — on-target build steps, env vars, re-submit commands. **Do not** `docker build` locally for a remote platform. |
 | Download fails / `Failed to access upstream storage` / 502 on gated HF model | `references/troubleshooting.md` § **Gated HuggingFace models** — create/update `hf-token`, add `token_secret` to fileset, confirm HF license, re-submit. |
 | W&B not syncing / no `[launcher]` secret lines / `WandbCallback requires wandb` / wandb 401 | `references/troubleshooting.md` § **W&B / integrations not working** (jobs-launcher build, secret update, unsloth image). Setup: `references/integrations-setup.md`. |
 
@@ -626,5 +722,6 @@ For other terminal errors, keep the same header template; put remediation detail
 | W&B / MLflow field reference | `references/hyperparameters.md` § **Integrations (automodel + unsloth)** |
 | W&B secret + MLflow local server + jobs-launcher | `references/integrations-setup.md` |
 | Gated HF model auth (`hf-token`, fileset `token_secret`) | `references/troubleshooting.md` § **Gated HuggingFace models** |
+| Post-training eval (base vs LoRA, CHAT format parity) | `references/post-training-eval.md`, `references/eval_helpers.py` |
 
 Related: `plugins/nemo-automodel/README.md`, `plugins/nemo-unsloth/README.md`, `plugins/nemo-customizer/docs/CUSTOMIZATION.md`, skills **`nemo-files`**, **`nemo-status`**, **`nemo-secrets`**.
diff --git a/plugins/nemo-customizer/src/nemo_customizer/skills/nemo-customizer/references/dataset-formats.md b/plugins/nemo-customizer/src/nemo_customizer/skills/nemo-customizer/references/dataset-formats.md
index b03b43e12c..d1d026656f 100644
--- a/plugins/nemo-customizer/src/nemo_customizer/skills/nemo-customizer/references/dataset-formats.md
+++ b/plugins/nemo-customizer/src/nemo_customizer/skills/nemo-customizer/references/dataset-formats.md
@@ -54,3 +54,15 @@ Optional fields on the unsloth `dataset` block:
 - The automodel SFT format `{"prompt": "...", "completion": "..."}` is **not** directly consumable by unsloth — unsloth has no built-in `prompt`/`completion` concatenation. Convert to either messages or pre-rendered text before upload.
 
 EMBEDDING and CUSTOM (automodel-only schemas) are not supported by unsloth today.
+
+## Post-training evaluation
+
+Eval rows must use the **same CHAT `messages` shape** as training. Do not flatten to `prompt`/`expected` for the evaluator.
+
+| Training JSONL | Eval dataset | Eval `prompt_template` | Metric reference |
+|----------------|--------------|------------------------|------------------|
+| `messages` (single- or multi-turn) | Same fileset split (`validation.jsonl`) | `messages[:-1]` — exclude final assistant label — see `post-training-eval.md` | `{{ item.messages[-1].content }}` |
+
+LoRA inference and eval use the **provider** gateway on the **base** entity (`/provider/<name>/-/v1`, `model: default--<adapter>`). Base model uses the model-entity path. Full SFT / merged checkpoints use the **output** model entity's model-entity URL — deploy first. See `post-training-eval.md` and the **Using the adapter** / **Using the fine-tuned model** sections in `SKILL.md`.
+
+Shared helpers and compare CLI: `references/eval_helpers.py`. Full workflow: `references/post-training-eval.md`.
diff --git a/plugins/nemo-customizer/src/nemo_customizer/skills/nemo-customizer/references/eval_helpers.py b/plugins/nemo-customizer/src/nemo_customizer/skills/nemo-customizer/references/eval_helpers.py
new file mode 100644
index 0000000000..1c866d4fcd
--- /dev/null
+++ b/plugins/nemo-customizer/src/nemo_customizer/skills/nemo-customizer/references/eval_helpers.py
@@ -0,0 +1,741 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+"""Post-training evaluation helpers — keep eval dataset shape aligned with CHAT training JSONL.
+
+**LoRA** (``output.save_method: lora``): adapters registered on the base model entity
+are hot-reloaded on deployments with ``lora_enabled: true`` — no deployment update or
+new inference deployment before eval.
+
+**Full SFT** (``finetuning_type: all_weights``) or **merged LoRA checkpoints**
+(``save_method: merged_16bit`` / ``merged_4bit``): the job registers a new **model**
+entity at ``output.name``. Deploy that entity for inference before chat or eval — full
+weights are not hot-reloaded onto the base model's deployment.
+
+Run from the nemo-platform git root (reads ``$NMP_BASE_URL`` when ``--base-url`` is omitted)::
+
+    export NMP_BASE_URL=http://127.0.0.1:8080
+    uv run python plugins/nemo-customizer/src/nemo_customizer/skills/nemo-customizer/references/eval_helpers.py \\
+        --model-entity <model-entity> --adapter <adapter-a> --adapter <adapter-b> \\
+        --provider <provider> --dataset-fileset <dataset-fileset> --split validation.jsonl
+
+Import in agent scripts (add references/ to sys.path or run via uv from repo root).
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import os
+import re
+import urllib.error
+import urllib.request
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Any, Sequence
+
+# --- Train/eval format contract (CHAT JSONL) --------------------------------
+
+CHAT_ROW_KEYS = frozenset({"messages"})
+
+# Inference: all turns except the final assistant label (single- or multi-turn).
+CHAT_USER_PROMPT_TEMPLATE: dict[str, Any] = {
+    "messages": "{{ item.messages[:-1] }}",
+}
+
+# Metric reference: final assistant turn (the label to predict).
+CHAT_REFERENCE_TEMPLATE = "{{ item.messages[-1].content }}"
+
+# Back-compat alias for single-turn MCQA docs/snippets.
+CHAT_SINGLE_TURN_USER_PROMPT_TEMPLATE = {
+    "messages": [{"role": "user", "content": "{{ item.messages[0].content }}"}],
+}
+
+PLATFORM_HTTP_TIMEOUT_SEC = 60
+
+
+def _assert_message_turn(turn: Any, *, label: str, index: int | str) -> dict[str, Any]:
+    """Validate one messages[] element is a dict before reading role/content."""
+    if not isinstance(turn, dict):
+        raise ValueError(f"{label}: messages[{index}] must be an object with role/content, got {type(turn).__name__}")
+    return turn
+
+
+def assert_chat_row(row: dict[str, Any], *, index: int | None = None) -> None:
+    """Validate one dataset row matches automodel/unsloth CHAT training shape."""
+    label = f"row {index}" if index is not None else "row"
+    if "messages" not in row:
+        raise ValueError(
+            f"{label}: expected CHAT format with 'messages' array; got keys {sorted(row)}. "
+            "Do not flatten to prompt/expected — use references/post-training-eval.md."
+        )
+    messages = row["messages"]
+    if not isinstance(messages, list) or len(messages) < 2:
+        raise ValueError(f"{label}: messages must be a list with at least one prompt turn + final assistant label")
+    first = _assert_message_turn(messages[0], label=label, index=0)
+    if first.get("role") != "user":
+        raise ValueError(f"{label}: expected messages[0]=user")
+    last = _assert_message_turn(messages[-1], label=label, index=-1)
+    if last.get("role") != "assistant":
+        raise ValueError(f"{label}: expected final messages[-1]=assistant (the label to score)")
+
+
+def reference_content(row: dict[str, Any]) -> str:
+    """Return the assistant label for a CHAT row (final turn)."""
+    assert_chat_row(row)
+    return row["messages"][-1]["content"]
+
+
+def load_chat_jsonl(path: Path | str) -> list[dict[str, Any]]:
+    """Load JSONL rows; validate CHAT shape; return rows unchanged."""
+    rows: list[dict[str, Any]] = []
+    with Path(path).open(encoding="utf-8") as handle:
+        for index, line in enumerate(handle, start=1):
+            if not line.strip():
+                continue
+            row = json.loads(line)
+            assert_chat_row(row, index=index)
+            rows.append(row)
+    return rows
+
+
+def load_chat_jsonl_from_platform(
+    *,
+    base_url: str,
+    workspace: str,
+    fileset: str,
+    remote_path: str,
+) -> list[dict[str, Any]]:
+    """Download a JSONL split from a platform fileset and validate CHAT rows."""
+    url = f"{base_url.rstrip('/')}/apis/files/v2/workspaces/{workspace}/filesets/{fileset}/-/{remote_path.lstrip('/')}"
+    with urllib.request.urlopen(url, timeout=PLATFORM_HTTP_TIMEOUT_SEC) as response:
+        content = response.read().decode("utf-8")
+    rows: list[dict[str, Any]] = []
+    for index, line in enumerate(content.splitlines(), start=1):
+        if not line.strip():
+            continue
+        row = json.loads(line)
+        assert_chat_row(row, index=index)
+        rows.append(row)
+    return rows
+
+
+def chat_metrics():
+    """Build default metrics for CHAT SFT eval (exact match + ROUGE + BLEU)."""
+    from nemo_evaluator_sdk import BLEUMetric, ROUGEMetric
+    from nemo_evaluator_sdk.metrics.exact_match import ExactMatchMetric
+
+    ref = CHAT_REFERENCE_TEMPLATE
+    return [
+        ExactMatchMetric(reference=ref),
+        ROUGEMetric(reference=ref),
+        BLEUMetric(references=[ref]),
+    ]
+
+
+def normalize_mcqa_answer(text: str) -> str:
+    """Normalize MCQA model output for comparison with bare choice-text references."""
+    text = text.strip()
+    bold = re.search(r"\*\*(?:[A-E]\.\s*)?([^*]+)\*\*", text)
+    if bold:
+        text = bold.group(1)
+    text = re.sub(r"^[A-E]\.\s*", "", text)
+    text = re.sub(r"\*\*([^*]+)\*\*", r"\1", text)
+    return text.strip().lower()
+
+
+def served_model_name(*, workspace: str, entity_or_adapter: str, finetuning: str = "base") -> str:
+    """Return the ``model`` field for base entity or LoRA adapter requests."""
+    if finetuning == "base":
+        return f"{workspace}/{entity_or_adapter}"
+    if finetuning == "lora":
+        return f"{workspace}--{entity_or_adapter}"
+    raise ValueError("finetuning must be 'base' or 'lora'")
+
+
+def adapter_composite_entity_name(*, model_entity: str, workspace: str, adapter_name: str) -> str:
+    """LoRA composite model-entity path segment (for reference / OpenAI-route body only).
+
+    The model-entity proxy path ``model/{composite}/-/v1`` requires a dedicated
+    VirtualModel per composite and typically 404s on stock deployments. Prefer
+    :func:`provider_gateway_url` for adapter eval.
+    """
+    return f"{model_entity}&adapters/{workspace}/{adapter_name}"
+
+
+def model_entity_gateway_url(*, base_url: str, workspace: str, model_entity: str) -> str:
+    """OpenAI-compatible inference-gateway URL for a registered base model entity."""
+    return f"{base_url.rstrip('/')}/apis/inference-gateway/v2/workspaces/{workspace}/model/{model_entity}/-/v1"
+
+
+def provider_gateway_url(*, base_url: str, workspace: str, provider_name: str) -> str:
+    """OpenAI-compatible inference-gateway URL for a model provider (LoRA eval route)."""
+    return f"{base_url.rstrip('/')}/apis/inference-gateway/v2/workspaces/{workspace}/provider/{provider_name}/-/v1"
+
+
+def gateway_path_from_url(url: str) -> str:
+    """Return ``model-entity`` or ``provider`` from a gateway base URL."""
+    if "/provider/" in url:
+        return "provider"
+    if "/model/" in url:
+        return "model-entity"
+    return "unknown"
+
+
+def _platform_get_json(url: str) -> dict[str, Any]:
+    with urllib.request.urlopen(url, timeout=PLATFORM_HTTP_TIMEOUT_SEC) as response:
+        return json.loads(response.read().decode("utf-8"))
+
+
+def find_ready_provider_for_model_entity(
+    *,
+    base_url: str,
+    workspace: str,
+    model_entity: str,
+) -> str | None:
+    """Return a READY provider name that serves ``workspace/model_entity`` (base or LoRA)."""
+    url = f"{base_url.rstrip('/')}/apis/models/v2/workspaces/{workspace}/providers?page_size=100&filter.status=READY"
+    payload = _platform_get_json(url)
+    base_entity_id = f"{workspace}/{model_entity}"
+    matches: list[str] = []
+    for provider in payload.get("data", []):
+        if provider.get("status") != "READY":
+            continue
+        for served in provider.get("served_models") or []:
+            entity_id = served.get("model_entity_id") or ""
+            if entity_id == base_entity_id or entity_id.startswith(f"{base_entity_id}&adapters/"):
+                matches.append(provider["name"])
+                break
+    if not matches:
+        return None
+    # Prefer deployment-backed providers (stable) over arbitrary first hit.
+    return sorted(set(matches))[0]
+
+
+@dataclass
+class JobAdapterInfo:
+    job_name: str
+    adapter_name: str
+    epochs: int | None
+    backend: str
+    model_entity: str
+    dataset_ref: str
+    status: str
+    created_at: str | None = None
+
+    def to_dict(self) -> dict[str, Any]:
+        return {
+            "job_name": self.job_name,
+            "adapter_name": self.adapter_name,
+            "epochs": self.epochs,
+            "backend": self.backend,
+            "model_entity": self.model_entity,
+            "dataset_ref": self.dataset_ref,
+            "status": self.status,
+            "created_at": self.created_at,
+        }
+
+
+def adapter_from_completed_job(
+    *,
+    base_url: str,
+    workspace: str,
+    job_name: str,
+) -> JobAdapterInfo:
+    """Resolve adapter output name and training epochs from a platform job."""
+    url = f"{base_url.rstrip('/')}/apis/jobs/v2/workspaces/{workspace}/jobs/{job_name}"
+    try:
+        job = _platform_get_json(url)
+    except urllib.error.HTTPError as exc:
+        raise ValueError(f"Job not found: {workspace}/{job_name}") from exc
+    spec = job.get("spec") or {}
+    output_name = (spec.get("output") or {}).get("name") or spec.get("name")
+    if not output_name:
+        raise ValueError(f"Job {job_name} has no output adapter name in spec")
+    model = spec.get("model")
+    model_entity = model.get("name", "") if isinstance(model, dict) else (model or "")
+    if model_entity.startswith(f"{workspace}/"):
+        model_entity = model_entity.split("/", 1)[1]
+    dataset = spec.get("dataset") or {}
+    dataset_ref = dataset.get("path") or dataset.get("training") or ""
+    backend = job_name.split("-", 1)[0] if "-" in job_name else "unknown"
+    return JobAdapterInfo(
+        job_name=job_name,
+        adapter_name=output_name,
+        epochs=(spec.get("schedule") or {}).get("epochs"),
+        backend=backend,
+        model_entity=model_entity,
+        dataset_ref=dataset_ref,
+        status=job.get("status", "unknown"),
+        created_at=job.get("created_at"),
+    )
+
+
+def list_completed_job_adapters(
+    *,
+    base_url: str,
+    workspace: str,
+    model_entity: str,
+    dataset_fileset: str | None = None,
+    page_size: int = 500,
+) -> list[JobAdapterInfo]:
+    """List completed customization jobs and their output adapter names."""
+    url = (
+        f"{base_url.rstrip('/')}/apis/jobs/v2/workspaces/{workspace}/jobs?page_size={page_size}&filter.status=completed"
+    )
+    payload = _platform_get_json(url)
+    dataset_ref = f"{workspace}/{dataset_fileset}" if dataset_fileset else None
+    model_ref = f"{workspace}/{model_entity}"
+    results: list[JobAdapterInfo] = []
+    for job in payload.get("data", []):
+        if job.get("status") != "completed":
+            continue
+        spec = job.get("spec") or {}
+        out = (spec.get("output") or {}).get("name") or spec.get("name")
+        if not out:
+            continue
+        model = spec.get("model")
+        job_model = model.get("name", "") if isinstance(model, dict) else (model or "")
+        ds = spec.get("dataset") or {}
+        job_ds = ds.get("path") or ds.get("training") or ""
+        if model_ref not in str(job_model):
+            continue
+        if dataset_ref and dataset_ref not in str(job_ds):
+            continue
+        backend = job["name"].split("-", 1)[0] if "-" in job["name"] else "unknown"
+        results.append(
+            JobAdapterInfo(
+                job_name=job["name"],
+                adapter_name=out,
+                epochs=(spec.get("schedule") or {}).get("epochs"),
+                backend=backend,
+                model_entity=model_entity,
+                dataset_ref=job_ds,
+                status=job.get("status", "completed"),
+                created_at=job.get("created_at"),
+            )
+        )
+    results.sort(key=lambda item: item.created_at or "", reverse=True)
+    return results
+
+
+def build_online_eval_config(
+    *,
+    max_tokens: int = 64,
+    temperature: float = 0,
+    parallelism: int = 8,
+    enable_thinking: bool = False,
+    limit_samples: int | None = None,
+):
+    """RunConfigOnlineModel defaults aligned with Qwen3 CHAT SFT eval."""
+    from nemo_evaluator_sdk.values import InferenceParams, RunConfigOnlineModel
+
+    extra_body = {"chat_template_kwargs": {"enable_thinking": enable_thinking}} if not enable_thinking else None
+    inference_kwargs: dict[str, Any] = {"max_tokens": max_tokens, "temperature": temperature}
+    if extra_body:
+        inference_kwargs["extra_body"] = extra_body
+    return RunConfigOnlineModel(
+        parallelism=parallelism,
+        limit_samples=limit_samples,
+        inference=InferenceParams(**inference_kwargs),
+    )
+
+
+def build_platform_model_target(
+    *,
+    base_url: str,
+    workspace: str,
+    model_entity: str,
+    adapter_name: str | None = None,
+    provider_name: str | None = None,
+):
+    """SDK Model target for base entity or LoRA adapter on the platform gateway.
+
+    Base weights use the **model-entity** proxy
+    (``/model/{entity}/-/v1``). LoRA adapters must use the **provider** proxy
+    (``/provider/{name}/-/v1``) with ``model: {workspace}--{adapter}`` — the
+    model-entity path always routes to the base VirtualModel and ignores adapter
+    names in the request body.
+    """
+    from nemo_evaluator_sdk.enums import ModelFormat
+    from nemo_evaluator_sdk.values.models import Model
+
+    resolved_provider = provider_name or find_ready_provider_for_model_entity(
+        base_url=base_url,
+        workspace=workspace,
+        model_entity=model_entity,
+    )
+    if not resolved_provider:
+        raise ValueError(
+            f"No READY inference provider serves {workspace}/{model_entity}. "
+            "Deploy the base model (with lora_enabled: true for LoRA eval) or pass --provider <name>."
+        )
+
+    if adapter_name:
+        return Model(
+            url=provider_gateway_url(
+                base_url=base_url,
+                workspace=workspace,
+                provider_name=resolved_provider,
+            ),
+            name=served_model_name(workspace=workspace, entity_or_adapter=adapter_name, finetuning="lora"),
+            format=ModelFormat.NVIDIA_NIM,
+        )
+
+    return Model(
+        url=model_entity_gateway_url(base_url=base_url, workspace=workspace, model_entity=model_entity),
+        name=served_model_name(workspace=workspace, entity_or_adapter=model_entity, finetuning="base"),
+        format=ModelFormat.NVIDIA_NIM,
+    )
+
+
+@dataclass
+class EvalSummary:
+    target: str
+    model_name: str
+    gateway_url: str
+    gateway_path: str
+    num_samples: int
+    raw_exact_match: float
+    normalized_accuracy: float
+    aggregate_metrics: dict[str, dict[str, float | None]]
+
+    def to_dict(self) -> dict[str, Any]:
+        return {
+            "target": self.target,
+            "model_name": self.model_name,
+            "gateway_url": self.gateway_url,
+            "gateway_path": self.gateway_path,
+            "num_samples": self.num_samples,
+            "raw_exact_match": self.raw_exact_match,
+            "normalized_accuracy": self.normalized_accuracy,
+            "metrics": self.aggregate_metrics,
+        }
+
+
+def summarize_chat_eval_result(*, target: str, model_name: str, gateway_url: str, result) -> EvalSummary:
+    """Summarize Evaluator benchmark result for CHAT rows."""
+    em_rows = result.per_metric["exact-match"].row_scores
+    num_samples = len(em_rows)
+    raw_correct = sum(
+        1 for rs in em_rows if rs.sample.get("output_text", "").strip() == reference_content(rs.item).strip()
+    )
+    norm_correct = sum(
+        1
+        for rs in em_rows
+        if normalize_mcqa_answer(rs.sample.get("output_text", "")) == normalize_mcqa_answer(reference_content(rs.item))
+    )
+    aggregate_metrics: dict[str, dict[str, float | None]] = {}
+    for metric_name, metric_result in result.per_metric.items():
+        aggregate_metrics[metric_name] = {
+            score.name.split(".")[-1]: round(score.mean, 4) if score.mean is not None else None
+            for score in metric_result.aggregate_scores.scores
+        }
+    return EvalSummary(
+        target=target,
+        model_name=model_name,
+        gateway_url=gateway_url,
+        gateway_path=gateway_path_from_url(gateway_url),
+        num_samples=num_samples,
+        raw_exact_match=round(raw_correct / num_samples, 4) if num_samples else 0.0,
+        normalized_accuracy=round(norm_correct / num_samples, 4) if num_samples else 0.0,
+        aggregate_metrics=aggregate_metrics,
+    )
+
+
+def run_chat_online_eval(
+    *,
+    rows: Sequence[dict[str, Any]],
+    target,
+    config,
+    metrics=None,
+    prompt_template: dict[str, Any] | None = None,
+):
+    """Run online eval on CHAT rows using shared templates."""
+    from nemo_evaluator_sdk import Evaluator
+
+    for index, row in enumerate(rows):
+        assert_chat_row(row, index=index)
+    if metrics is None:
+        metrics = chat_metrics()
+    return Evaluator().run_sync(
+        metrics=metrics,
+        dataset=list(rows),
+        target=target,
+        prompt_template=prompt_template or CHAT_USER_PROMPT_TEMPLATE,
+        config=config,
+    )
+
+
+def _eval_target(
+    *,
+    base_url: str,
+    workspace: str,
+    model_entity: str,
+    adapter_name: str | None,
+    provider_name: str | None,
+    rows: Sequence[dict[str, Any]],
+    config,
+    target_label: str,
+) -> EvalSummary:
+    target = build_platform_model_target(
+        base_url=base_url,
+        workspace=workspace,
+        model_entity=model_entity,
+        adapter_name=adapter_name,
+        provider_name=provider_name,
+    )
+    result = run_chat_online_eval(rows=rows, target=target, config=config)
+    return summarize_chat_eval_result(
+        target=target_label,
+        model_name=target.name,
+        gateway_url=target.url,
+        result=result,
+    )
+
+
+def compare_adapters(
+    *,
+    base_url: str,
+    workspace: str,
+    model_entity: str,
+    adapter_names: Sequence[str],
+    rows: Sequence[dict[str, Any]],
+    include_base: bool = True,
+    provider_name: str | None = None,
+    max_tokens: int = 64,
+    enable_thinking: bool = False,
+    parallelism: int = 8,
+    limit_samples: int | None = None,
+) -> list[EvalSummary]:
+    """Compare base (optional) and one or more LoRA adapters on the same CHAT rows."""
+    config = build_online_eval_config(
+        max_tokens=max_tokens,
+        enable_thinking=enable_thinking,
+        parallelism=parallelism,
+        limit_samples=limit_samples,
+    )
+    summaries: list[EvalSummary] = []
+    if include_base:
+        summaries.append(
+            _eval_target(
+                base_url=base_url,
+                workspace=workspace,
+                model_entity=model_entity,
+                adapter_name=None,
+                provider_name=provider_name,
+                rows=rows,
+                config=config,
+                target_label="base",
+            )
+        )
+    for adapter_name in adapter_names:
+        summaries.append(
+            _eval_target(
+                base_url=base_url,
+                workspace=workspace,
+                model_entity=model_entity,
+                adapter_name=adapter_name,
+                provider_name=provider_name,
+                rows=rows,
+                config=config,
+                target_label=adapter_name,
+            )
+        )
+    return summaries
+
+
+def compare_base_vs_adapter(
+    *,
+    base_url: str,
+    workspace: str,
+    model_entity: str,
+    adapter_name: str,
+    rows: Sequence[dict[str, Any]],
+    provider_name: str | None = None,
+    max_tokens: int = 64,
+    enable_thinking: bool = False,
+    parallelism: int = 8,
+    limit_samples: int | None = None,
+) -> list[EvalSummary]:
+    """Compare base model vs one LoRA adapter on the same CHAT validation rows."""
+    summaries = compare_adapters(
+        base_url=base_url,
+        workspace=workspace,
+        model_entity=model_entity,
+        adapter_names=[adapter_name],
+        rows=rows,
+        include_base=True,
+        provider_name=provider_name,
+        max_tokens=max_tokens,
+        enable_thinking=enable_thinking,
+        parallelism=parallelism,
+        limit_samples=limit_samples,
+    )
+    if len(summaries) == 2:
+        summaries[1].target = "lora"
+    return summaries
+
+
+def lift_vs_base(summaries: Sequence[EvalSummary]) -> dict[str, float]:
+    """Normalized accuracy delta vs the base summary (if present)."""
+    base = next((summary for summary in summaries if summary.target == "base"), None)
+    if base is None:
+        return {}
+    return {
+        summary.target: round(summary.normalized_accuracy - base.normalized_accuracy, 4)
+        for summary in summaries
+        if summary.target != "base"
+    }
+
+
+def routing_sanity_warnings(
+    summaries: Sequence[EvalSummary],
+    *,
+    routing_tolerance_pp: float = 0.015,
+) -> list[str]:
+    """Return human-readable warnings when LoRA routing or scores look suspicious."""
+    warnings: list[str] = []
+    base = next((summary for summary in summaries if summary.target == "base"), None)
+    for summary in summaries:
+        if summary.target == "base":
+            if summary.gateway_path != "model-entity":
+                warnings.append(
+                    f"base eval used {summary.gateway_path} route; expected model-entity ({summary.gateway_url})"
+                )
+            continue
+        if summary.gateway_path != "provider":
+            warnings.append(
+                f"{summary.target}: LoRA eval used {summary.gateway_path} route "
+                f"({summary.gateway_url}); expected provider gateway — scores may match base"
+            )
+        if base and abs(summary.normalized_accuracy - base.normalized_accuracy) <= routing_tolerance_pp:
+            warnings.append(
+                f"{summary.target}: normalized accuracy {summary.normalized_accuracy:.1%} is within "
+                f"{routing_tolerance_pp:.1%} of base ({base.normalized_accuracy:.1%}) — verify provider routing"
+            )
+    return warnings
+
+
+def build_eval_payload(
+    *,
+    summaries: Sequence[EvalSummary],
+    base_url: str,
+    workspace: str,
+    model_entity: str,
+    adapter_names: Sequence[str],
+    provider_name: str | None,
+) -> dict[str, Any]:
+    """Assemble CLI/programmatic JSON output with routing metadata and warnings."""
+    routing: dict[str, Any] = {}
+    if any(summary.target == "base" for summary in summaries):
+        routing["base"] = {
+            "gateway_path": "model-entity",
+            "url": model_entity_gateway_url(base_url=base_url, workspace=workspace, model_entity=model_entity),
+            "model_field": served_model_name(workspace=workspace, entity_or_adapter=model_entity, finetuning="base"),
+        }
+    for adapter_name in adapter_names:
+        target = build_platform_model_target(
+            base_url=base_url,
+            workspace=workspace,
+            model_entity=model_entity,
+            adapter_name=adapter_name,
+            provider_name=provider_name,
+        )
+        routing[adapter_name] = {
+            "gateway_path": "provider",
+            "url": target.url,
+            "model_field": target.name,
+        }
+    warnings = routing_sanity_warnings(summaries)
+    payload: dict[str, Any] = {
+        "dataset_format": "chat (messages)",
+        "prompt_template": CHAT_USER_PROMPT_TEMPLATE,
+        "reference_template": CHAT_REFERENCE_TEMPLATE,
+        "routing": routing,
+        "results": [summary.to_dict() for summary in summaries],
+        "lift_vs_base": lift_vs_base(summaries),
+        "primary_metric": "normalized_accuracy",
+    }
+    if warnings:
+        payload["warnings"] = warnings
+    return payload
+
+
+def default_base_url() -> str:
+    """Platform URL from env or localhost default."""
+    return os.environ.get("NMP_BASE_URL") or "http://127.0.0.1:8080"
+
+
+def _parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description="Compare base vs LoRA on CHAT validation JSONL")
+    parser.add_argument(
+        "--base-url",
+        default=default_base_url(),
+        help="Platform URL (default: $NMP_BASE_URL or http://127.0.0.1:8080)",
+    )
+    parser.add_argument("--workspace", default="default")
+    parser.add_argument("--model-entity", required=True)
+    parser.add_argument(
+        "--adapter",
+        action="append",
+        required=True,
+        help="Adapter name(s) registered on the model entity (repeat for multi-adapter compare)",
+    )
+    parser.add_argument(
+        "--provider",
+        default=None,
+        help="Inference provider name for LoRA requests (auto-discovered when omitted)",
+    )
+    parser.add_argument("--dataset-fileset", required=True)
+    parser.add_argument("--split", default="validation.jsonl")
+    parser.add_argument("--max-tokens", type=int, default=64)
+    parser.add_argument("--enable-thinking", action="store_true")
+    parser.add_argument("--limit-samples", type=int, default=None)
+    parser.add_argument("--output", type=Path, default=None)
+    parser.add_argument(
+        "--no-base",
+        action="store_true",
+        help="Skip base-model eval (adapter-only comparison)",
+    )
+    return parser.parse_args()
+
+
+def main() -> int:
+    args = _parse_args()
+    rows = load_chat_jsonl_from_platform(
+        base_url=args.base_url,
+        workspace=args.workspace,
+        fileset=args.dataset_fileset,
+        remote_path=args.split,
+    )
+    summaries = compare_adapters(
+        base_url=args.base_url,
+        workspace=args.workspace,
+        model_entity=args.model_entity,
+        adapter_names=args.adapter,
+        rows=rows,
+        include_base=not args.no_base,
+        provider_name=args.provider,
+        max_tokens=args.max_tokens,
+        enable_thinking=args.enable_thinking,
+        limit_samples=args.limit_samples,
+    )
+    payload = build_eval_payload(
+        summaries=summaries,
+        base_url=args.base_url,
+        workspace=args.workspace,
+        model_entity=args.model_entity,
+        adapter_names=args.adapter,
+        provider_name=args.provider,
+    )
+    text = json.dumps(payload, indent=2)
+    print(text)
+    if args.output:
+        args.output.parent.mkdir(parents=True, exist_ok=True)
+        args.output.write_text(text, encoding="utf-8")
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
diff --git a/plugins/nemo-customizer/src/nemo_customizer/skills/nemo-customizer/references/hyperparameters.md b/plugins/nemo-customizer/src/nemo_customizer/skills/nemo-customizer/references/hyperparameters.md
index 2b6e5e0495..277b0be196 100644
--- a/plugins/nemo-customizer/src/nemo_customizer/skills/nemo-customizer/references/hyperparameters.md
+++ b/plugins/nemo-customizer/src/nemo_customizer/skills/nemo-customizer/references/hyperparameters.md
@@ -566,7 +566,7 @@ See **Integrations (automodel + unsloth)** above.
 |-------|---------|-------|
 | `name` | auto-derived from `<model-entity>-<dataset>-<hex12>` | The output model entity / fileset name. |
 | `description` | `null` | Free-form description carried onto the entity and fileset. |
-| `save_method` | `"lora"` | `"lora"` (adapter — small, deploy via NIM/vLLM with adapter loader), `"merged_16bit"` (merged checkpoint, deploy without adapter), `"merged_4bit"` (lossy, storage-tight). `merged_*` requires `training.finetuning_type: "lora"`. |
+| `save_method` | `"lora"` | `"lora"` (adapter — hot-reloads on base LoRA deployment; no new inference deploy), `"merged_16bit"` (merged checkpoint — **deploy** `output.name` as model entity), `"merged_4bit"` (lossy, storage-tight; deploy like merged). `merged_*` requires `training.finetuning_type: "lora"`. |
 
 After `to_spec`, the canonical `OutputResponse` also carries `type` (`"adapter"` for `save_method: "lora"`, `"model"` otherwise) and `fileset` (defaults to `name`); both are derived — submitter doesn't set them.
 
@@ -598,12 +598,12 @@ Drop `rank` before lowering batch when OOM. Higher `alpha/rank` ratios amplify a
 
 ### Save-method picker
 
-| User wants | `save_method` |
-|------------|---------------|
-| Smallest artefact, deploy via adapter loader (default NIM / vLLM) | `lora` |
-| Full-weight checkpoint to deploy without an adapter | `merged_16bit` |
-| Disk-tight merged checkpoint (lossy) | `merged_4bit` |
-| Full SFT (no LoRA) | `lora` is invalid here; output is always a full model — leave `save_method` at default and ignore the merged options |
+| User wants | `save_method` | Inference after training |
+|------------|---------------|--------------------------|
+| Smallest artefact; hot-reload on base LoRA deployment | `lora` | No new deploy — adapter loads on existing `lora_enabled` deployment |
+| Full-weight checkpoint as standalone model | `merged_16bit` | **Deploy** `output.name` as new model entity |
+| Disk-tight merged checkpoint (lossy) | `merged_4bit` | **Deploy** `output.name` as new model entity |
+| Full SFT (no LoRA) | `lora` is invalid; output is always a full model | **Deploy** `output.name` as new model entity |
 
 `merged_*` require `training.finetuning_type: "lora"`. The schema validator surfaces a clear error if violated.
 
diff --git a/plugins/nemo-customizer/src/nemo_customizer/skills/nemo-customizer/references/integrations-setup.md b/plugins/nemo-customizer/src/nemo_customizer/skills/nemo-customizer/references/integrations-setup.md
index 9927d26c17..bbad950c37 100644
--- a/plugins/nemo-customizer/src/nemo_customizer/skills/nemo-customizer/references/integrations-setup.md
+++ b/plugins/nemo-customizer/src/nemo_customizer/skills/nemo-customizer/references/integrations-setup.md
@@ -96,7 +96,7 @@ Job JSON references the secret by name:
 Store the API key in the **platform** secret store. A local `wandb login` cache on your laptop is **not** used by training containers.
 
 ```bash
-export NEMO_BASE_URL=http://<platform-host>:8080   # omit when using default localhost
+export NMP_BASE_URL=http://<platform-host>:8080   # omit when using default localhost
 cd /path/to/nemo-platform
 
 # Create (first time)
diff --git a/plugins/nemo-customizer/src/nemo_customizer/skills/nemo-customizer/references/post-training-eval.md b/plugins/nemo-customizer/src/nemo_customizer/skills/nemo-customizer/references/post-training-eval.md
new file mode 100644
index 0000000000..4fb787ad82
--- /dev/null
+++ b/plugins/nemo-customizer/src/nemo_customizer/skills/nemo-customizer/references/post-training-eval.md
@@ -0,0 +1,228 @@
+# Post-training evaluation (train/eval format parity)
+
+Use after a customization job reaches **`completed`** when the user wants to compare **base vs LoRA** on the validation split.
+
+## Format contract
+
+Training and evaluation must use the **same CHAT JSONL row shape**:
+
+```json
+{
+  "messages": [
+    {"role": "user", "content": "<user turn or multi-turn prompt>"},
+    {"role": "assistant", "content": "<label to predict>"}
+  ]
+}
+```
+
+Multi-turn rows use the same rule: the **final** `messages[-1]` turn is the assistant label; all preceding turns are context.
+
+| Do | Don't |
+|----|-------|
+| Pass rows with `messages` unchanged from the training fileset | Flatten to `prompt` / `expected` or `prompt` / `completion` for eval |
+| Send **`messages[:-1]`** at inference (exclude only the final assistant label) | Pass full `messages` including the label turn, or use `{"messages": "{{ item.messages }}"}` unfiltered |
+| Score against **`messages[-1].content`** (final assistant turn) | Score against a renamed `expected` field unless you also keep `messages` |
+
+Single-turn rows (one user prompt + one assistant label) are the degenerate case: `messages[:-1]` is just the user turn.
+
+Automodel and unsloth both train on this shape when `has_chat` is true (see `hf-conversion.md`, `dataset-formats.md`).
+
+## Evaluator templates (required)
+
+```python
+CHAT_USER_PROMPT_TEMPLATE = {
+    "messages": "{{ item.messages[:-1] }}",
+}
+CHAT_REFERENCE_TEMPLATE = "{{ item.messages[-1].content }}"
+```
+
+Import from `references/eval_helpers.py` — do not re-type these in one-off scripts.
+
+## Inference defaults (thinking models, e.g. Qwen3)
+
+| Setting | Recommended | Avoid |
+|---------|-------------|-------|
+| `enable_thinking` | `false` via `chat_template_kwargs` for short-answer SFT | Thinking on without enough tokens — model never closes thinking tag so the strip hook fails |
+| `max_tokens` | `64` (short assistant labels) | `16` with thinking on; `1024` thinking on without strip (verbose prose) |
+| System prompt | Omit unless user asks — matches training | Extra system prompt changes decode path vs SFT |
+
+For thinking-enabled eval, set `reasoning=ReasoningParams(end_token="``")` **and** ensure `max_tokens` is large enough for the model to emit the end token before generating the answer.
+
+## Inference after customization (wrap-up)
+
+Include this in the completed-job report. Agents must discover `<provider>` from `nemo inference providers list --workspace default -f json` and fill concrete URLs — do not leave placeholders.
+
+### LoRA adapters: no new deployment
+
+Applies when the job used **`finetuning_type: lora`** and **`output.save_method: lora`** (adapter output).
+
+After a customization job reaches **`completed`**, the platform registers the adapter on the base **model entity**. On a deployment with **`lora_enabled: true`**, enabled adapters are **hot-reloaded automatically** (adapter sidecar → vLLM). **Do not** create a new inference deployment, update the deployment, re-create providers, or add the adapter to a `served_models` list before post-training eval — run eval as soon as the job completes.
+
+| Prerequisite (one-time) | Per-adapter step after training |
+|-------------------------|----------------------------------|
+| A **READY** inference deployment for the **base** model entity with `lora_enabled: true` | Confirm adapter appears under `nemo models get <model-entity>` → `adapters` |
+| Gateway reachable at the provider URL below | Target the adapter by name in the eval request (see table) |
+
+### Full SFT / merged checkpoints: deploy the output model
+
+Applies when the job used **`finetuning_type: all_weights`** (full-weight SFT) or **`save_method: merged_16bit` / `merged_4bit`** (merged LoRA checkpoint). Output `type` is **`model`**, not `adapter`.
+
+The fine-tuned weights live on a **new model entity** at `output.name` (`default/<output.name>`). **You must deploy that entity for inference** — create a new inference deployment or add it to a provider's `served_models` before chat or eval. Full checkpoints are **not** hot-reloaded onto the base model's LoRA deployment.
+
+| Step | Action |
+|------|--------|
+| Confirm registration | `nemo models get <output.name> --workspace default` — entity exists with fine-tuned weights |
+| Deploy for inference | Create or update an inference deployment / provider that serves `default/<output.name>` |
+| Inference / eval route | **Model-entity** URL on `<output.name>` with `model: default/<output.name>` (not the base entity) |
+
+Post-training eval for full models: compare against the base entity on its deployment, or eval the fine-tuned entity directly via `eval_helpers.py` is LoRA-oriented (`--adapter`); for full SFT, run generation eval against the **output** model entity's gateway URL.
+
+### Request routing (base vs LoRA)
+
+The model-entity proxy path **always** resolves to the base VirtualModel. Setting `"model": "default--<adapter-name>"` on `/model/<base-entity>/-/v1` does **not** select the adapter — gateway logs will show only the base path.
+
+| Target | Gateway route | URL pattern | Request `model` field |
+|--------|---------------|-------------|------------------------|
+| Base entity | **Model entity** | `$NMP_BASE_URL/apis/inference-gateway/v2/workspaces/default/model/<model-entity>/-/v1` | `default/<model-entity>` |
+| LoRA adapter | **Provider** | `$NMP_BASE_URL/apis/inference-gateway/v2/workspaces/default/provider/<provider>/-/v1` | `default--<adapter-name>` |
+
+`eval_helpers.py` auto-discovers a READY provider that serves the base entity (or pass `--provider <name>`). LoRA adapter weights hot-reload on that deployment — no provider update per adapter. (Full SFT / merged outputs need a separate deployment — see above.)
+
+Optional sanity checks:
+
+- `nemo models get <model-entity> --workspace default` — adapter listed with `enabled: true`
+- `nemo inference providers list --workspace default` — provider status **READY**
+- LoRA eval/inference logs should show `path=…/provider/<provider>/-/v1/chat/completions`, **not** `…/model/<model-entity>/-/v1`
+- JSON output includes `warnings` when routing looks wrong or adapter scores match base within ~1.5 pp
+
+### Why earlier evals looked wrong
+
+If base and LoRA scores were identical (~99% same outputs), the adapter was almost certainly called through the **model-entity** path. That path always resolves to the base VirtualModel — the `"model": "default--<adapter>"` field in the body is ignored. Fix: route LoRA through the **provider** URL with the same `model` field. `eval_helpers.build_platform_model_target()` and the CLI implement this split automatically.
+
+### Classification / short-answer metric interpretation
+
+For multiple-choice or short-label SFT, treat **`normalized_accuracy`** as the primary metric when labels need normalization (`normalize_mcqa_answer` strips `A. foo`, markdown, etc.).
+
+| Observation | Likely meaning |
+|-------------|----------------|
+| Base & LoRA normalized scores match within ~1–2 pp | LoRA likely hit **model-entity** path (base only) — check `warnings` and gateway logs |
+| Base raw exact match low, normalized much higher | Normal when the base model emits formatted prose but normalized labels match |
+| LoRA normalized clearly above base | Correct provider routing and real adapter lift |
+| Train loss dropped sharply but eval flat | Wrong eval routing, mismatched inference settings, or need more epochs — val loss ≠ accuracy |
+
+### Epoch / adapter ablations
+
+Resolve adapter names from completed job specs instead of guessing:
+
+```python
+import os
+from eval_helpers import list_completed_job_adapters, compare_adapters, build_eval_payload
+
+base_url = os.environ.get("NMP_BASE_URL") or "http://127.0.0.1:8080"
+
+jobs = list_completed_job_adapters(
+    base_url=base_url,
+    workspace="default",
+    model_entity="<model-entity>",
+    dataset_fileset="<dataset-fileset>",
+)
+# jobs[0].epochs, jobs[0].adapter_name, jobs[0].backend — sorted newest first
+
+summaries = compare_adapters(
+    base_url=base_url,
+    workspace="default",
+    model_entity="<model-entity>",
+    adapter_names=[jobs[0].adapter_name, jobs[1].adapter_name],
+    rows=rows,
+)
+payload = build_eval_payload(..., summaries=summaries, adapter_names=[...])
+# payload["lift_vs_base"], payload.get("warnings")
+```
+
+When comparing adapters from **different backends** (automodel vs unsloth) or batch configs, note confounds — epoch count alone may not explain the gap.
+
+### Production chat requests (same rules as eval)
+
+| Piece | LoRA adapter | Base model |
+|-------|--------------|------------|
+| HTTP base URL | `…/provider/<provider>/-/v1` | `…/model/<model-entity>/-/v1` |
+| `"model"` | `default--<adapter-name>` | `default/<model-entity>` |
+| `messages` | `messages[:-1]` from the training row (exclude final assistant label) | Same |
+| Short-answer SFT (e.g. Qwen3) | `"chat_template_kwargs": {"enable_thinking": false}` | Same |
+| `max_tokens` / `temperature` | `64` / `0` typical for short labels | Same |
+
+CLI shortcuts (substitute names from the job):
+
+```bash
+# LoRA
+nemo inference gateway provider post v1/chat/completions <provider> --workspace default \
+  --body '{"model":"default--<adapter>","messages":[{"role":"user","content":"…"}],"max_tokens":64,"temperature":0,"chat_template_kwargs":{"enable_thinking":false}}'
+
+# Base
+nemo inference gateway model post v1/chat/completions <model-entity> --workspace default \
+  --body '{"model":"default/<model-entity>","messages":[{"role":"user","content":"…"}],"max_tokens":64,"temperature":0,"chat_template_kwargs":{"enable_thinking":false}}'
+```
+
+## Metrics
+
+| Task | Metrics |
+|------|---------|
+| MCQA / exact label | `ExactMatchMetric` + `normalize_mcqa_answer()` when models return `A. foo` or markdown |
+| Similarity | `ROUGEMetric`, `BLEUMetric` with `CHAT_REFERENCE_TEMPLATE` |
+
+Val loss from training is **not** accuracy — always run a generation eval for user-facing quality.
+
+## Helper script
+
+From **nemo-platform** git root:
+
+```bash
+export NMP_BASE_URL=http://127.0.0.1:8080   # user platform URL when not localhost
+
+# Base vs one adapter (--base-url optional when NMP_BASE_URL is set)
+uv run python plugins/nemo-customizer/src/nemo_customizer/skills/nemo-customizer/references/eval_helpers.py \
+  --model-entity <model-entity> \
+  --adapter <adapter-name> \
+  --provider <provider> \
+  --dataset-fileset <dataset-fileset> \
+  --split validation.jsonl \
+  --output /tmp/fine-tune-eval.json
+
+# Base vs multiple adapters (epoch ablation)
+uv run python plugins/nemo-customizer/src/nemo_customizer/skills/nemo-customizer/references/eval_helpers.py \
+  --model-entity <model-entity> \
+  --adapter <adapter-a> \
+  --adapter <adapter-b> \
+  --dataset-fileset <dataset-fileset> \
+  --split validation.jsonl \
+  --output /tmp/fine-tune-eval-multi.json
+```
+
+Programmatic use:
+
+```python
+from eval_helpers import (
+    load_chat_jsonl_from_platform,
+    compare_adapters,
+    compare_base_vs_adapter,
+    build_eval_payload,
+    list_completed_job_adapters,
+    routing_sanity_warnings,
+    CHAT_USER_PROMPT_TEMPLATE,
+)
+```
+
+(Add `references/` to `sys.path` or run via `uv run python` from repo root.)
+
+## Report to user
+
+After compare, report for **base and each adapter**:
+
+- **Normalized accuracy** (primary for MCQA)
+- Raw exact match (strict string — often 0% on base for formatted answers)
+- Lift vs base (`lift_vs_base` in JSON output)
+- ROUGE / BLEU aggregates if requested
+- Any `warnings` from routing sanity checks
+- Inference settings (`enable_thinking`, `max_tokens`) and dataset fileset ref
+
+Uses the **nemo-evaluator SDK** (`Evaluator`, metrics, `RunConfigOnlineModel`) under the hood — no separate evaluator skill doc required. For general BYOB/rubric eval outside customization, use the **nemo-evaluator** skill.
diff --git a/plugins/nemo-customizer/src/nemo_customizer/skills/nemo-customizer/references/troubleshooting.md b/plugins/nemo-customizer/src/nemo_customizer/skills/nemo-customizer/references/troubleshooting.md
index b0f799f91f..88d8c47d11 100644
--- a/plugins/nemo-customizer/src/nemo_customizer/skills/nemo-customizer/references/troubleshooting.md
+++ b/plugins/nemo-customizer/src/nemo_customizer/skills/nemo-customizer/references/troubleshooting.md
@@ -20,7 +20,7 @@ Any `nemo …` call may fail with `Connection error`, timeout, or connection ref
 
 | Situation | Action |
 |-----------|--------|
-| User gave a platform host/URL (e.g. `10.0.0.51:8080`) or you set `NEMO_BASE_URL` / `NMP_BASE_URL` to something other than `http://127.0.0.1:8080` or `http://localhost:8080` | Report that the platform is not reachable at that address. Ask them to confirm the host is up and the URL is correct. **Do not** start local services. |
+| User gave a platform host/URL (e.g. `10.0.0.51:8080`) or you set `NMP_BASE_URL` to something other than `http://127.0.0.1:8080` or `http://localhost:8080` | Report that the platform is not reachable at that address. Ask them to confirm the host is up and the URL is correct. **Do not** start local services. |
 | Default URL only — no user override | **Ask** whether to start the platform locally. If they agree, from the **nemo-platform** git root run in the **background**, then poll until healthy and retry the failed command: |
 
 ```bash
@@ -45,7 +45,7 @@ If the user already has a listener on `:8080` but health fails, see **nemo-statu
 
 ## Backend choice (automodel vs unsloth)
 
-**Do not** run `docker info` on the agent machine. The platform often runs elsewhere (`NEMO_BASE_URL`). Ask the **connected platform** what executors it exposes.
+**Do not** run `docker info` on the agent machine. The platform often runs elsewhere (`NMP_BASE_URL`). Ask the **connected platform** what executors it exposes.
 
 List profiles (login first only if auth is enabled — see **Authentication** in `SKILL.md`):
 
@@ -64,7 +64,7 @@ Each entry has `provider`, `profile` (name), and `backend` (e.g. `docker`, `kube
 | Response includes **`provider`: `gpu` or `gpu_distributed`** | **`automodel`** (default) |
 | No GPU profiles (only `subprocess` and/or CPU `provider`) | Report that GPU customization is unavailable |
 
-Both backends are **`submit`-only**. After submit, the platform's **Docker executor** runs GPU container steps on the daemon attached to the connected platform host (`platform.runtime: docker`). Training does not run in the CLI shell — query execution profiles on the platform (`NEMO_BASE_URL`), not GPU availability in the agent's terminal.
+Both backends are **`submit`-only**. After submit, the platform's **Docker executor** runs GPU container steps on the daemon attached to the connected platform host (`platform.runtime: docker`). Training does not run in the CLI shell — query execution profiles on the platform (`NMP_BASE_URL`), not GPU availability in the agent's terminal.
 
 ### Pick execution profile
 
@@ -175,13 +175,13 @@ After secret + fileset are wired, re-submit the same job JSON (use a fresh `outp
 
 ## Missing training images
 
-Job errors like `Failed to pull image … nmp-unsloth-training:… Not Found`, `manifest unknown`, or a missing automodel training image mean the **connected platform's Docker daemon** (the one that runs GPU job steps) does not have the image. With the default `NEMO_BASE_URL` / `NMP_BASE_URL` (`127.0.0.1:8080` / `localhost:8080`), that daemon is usually on the same machine as the agent; with a user-overridden URL (e.g. `10.0.0.51:8080`), it is on the remote target host instead.
+Job errors like `Failed to pull image … nmp-unsloth-training:… Not Found`, `manifest unknown`, or a missing automodel training image mean the **connected platform's Docker daemon** (the one that runs GPU job steps) does not have the image. With the default `NMP_BASE_URL` (`127.0.0.1:8080` / `localhost:8080`), that daemon is usually on the same machine as the agent; with a user-overridden URL (e.g. `10.0.0.51:8080`), it is on the remote target host instead.
 
 **Did the user override the base URL?** (same rule as **Platform unreachable** — track this from the start of the workflow.)
 
 | Situation | Action |
 |-----------|--------|
-| **Remote platform** — user gave a host/URL (e.g. `10.0.0.51:8080`) or you set `NEMO_BASE_URL` / `NMP_BASE_URL` to something other than `http://127.0.0.1:8080` or `http://localhost:8080` | **Do not** run `docker build`, `docker pull`, or `docker buildx bake` on the agent machine — that only affects the agent's local daemon, not the remote platform. Tell the user they must build or load the image **on the target host** (the machine whose Docker daemon runs the GPU job steps). Report with **Report to user** in `SKILL.md`, then append **Report follow-up — missing image (remote platform)** below. Stop; do not retry submit until the user confirms the image is available on the target. |
+| **Remote platform** — user gave a host/URL (e.g. `10.0.0.51:8080`) or you set `NMP_BASE_URL` to something other than `http://127.0.0.1:8080` or `http://localhost:8080` | **Do not** run `docker build`, `docker pull`, or `docker buildx bake` on the agent machine — that only affects the agent's local daemon, not the remote platform. Tell the user they must build or load the image **on the target host** (the machine whose Docker daemon runs the GPU job steps). Report with **Report to user** in `SKILL.md`, then append **Report follow-up — missing image (remote platform)** below. Stop; do not retry submit until the user confirms the image is available on the target. |
 | **Local platform** — default URL only (`127.0.0.1:8080` / `localhost:8080`) | Build or pull on **that same host** where `nemo services run` and Docker share a daemon. See build commands below and `docker/unsloth/README.md` (unsloth) or automodel docker docs. Set env vars **before** starting/restarting the platform. |
 
 Image env vars are read when the platform starts (not per job):
@@ -227,7 +227,7 @@ When submit or poll returns a missing-image error and the base URL is **user-ove
 **Re-submit after the image is available:**
 
 ```bash
-export NEMO_BASE_URL=<user's platform URL>
+export NMP_BASE_URL=<user's platform URL>
 cd /path/to/nemo-platform
 nemo customization <plugin> submit /tmp/job.json --workspace default [--profile <gpu-profile>]
 ```
@@ -285,7 +285,7 @@ Set `jobs.executors.docker.launcher_tool_path` in `~/.nemo/config.yaml` to the *
 | `Unsloth does not support local run` | Used `run` instead of `submit` | `nemo customization unsloth submit <job.json> -w <workspace>` |
 | `Unsloth training requires platform.runtime: docker` | Platform not configured for Docker GPU jobs | Start platform with Docker runtime and a GPU execution profile |
 | Unknown execution profile | Default `gpu` profile missing or wrong | Re-list profiles; pass `--profile <exact-name>` on submit |
-| Missing `nmp-unsloth-training` image / `Failed to pull image` / `manifest unknown` | Image not on the **platform host's** Docker daemon | **Remote platform** (`NEMO_BASE_URL` not localhost): tell user to build on the target — **do not** `docker build` locally. **Local platform**: build on same host; see **Missing training images** above and `docker/unsloth/README.md` |
+| Missing `nmp-unsloth-training` image / `Failed to pull image` / `manifest unknown` | Image not on the **platform host's** Docker daemon | **Remote platform** (`NMP_BASE_URL` not localhost): tell user to build on the target — **do not** `docker build` locally. **Local platform**: build on same host; see **Missing training images** above and `docker/unsloth/README.md` |
 | `torch.cuda.is_available()` False in training step logs | GPU not exposed to the container step | Confirm the execution profile is GPU-backed; check platform Docker GPU setup |
 | Job stuck in `active` after training step completes | Upload / model-entity steps still running | Keep polling top-level status (same as automodel) |
 
diff --git a/plugins/nemo-customizer/src/nemo_customizer/skills/nemo-customizer/scripts/poll_customization_job.sh b/plugins/nemo-customizer/src/nemo_customizer/skills/nemo-customizer/scripts/poll_customization_job.sh
index 302ff02325..52d030ce81 100755
--- a/plugins/nemo-customizer/src/nemo_customizer/skills/nemo-customizer/scripts/poll_customization_job.sh
+++ b/plugins/nemo-customizer/src/nemo_customizer/skills/nemo-customizer/scripts/poll_customization_job.sh
@@ -1,7 +1,7 @@
 #!/usr/bin/env bash
 # Poll customization job until top-level status is terminal.
 # Usage: poll_customization_job.sh <plugin>-<job-id> [interval_seconds]
-# Requires: NEMO_BASE_URL or NMP_BASE_URL; run from nemo-platform root.
+# Requires: NMP_BASE_URL; run from nemo-platform root.
 # Resolves `nemo` on PATH, else `uv run nemo` (see SKILL.md Pre-flight).
 # Exit 0 on completed; exit 1 on error, cancelled, or get-status failure.
 
diff --git a/plugins/nemo-customizer/tests/test_eval_helpers.py b/plugins/nemo-customizer/tests/test_eval_helpers.py
new file mode 100644
index 0000000000..5dd81b6a4f
--- /dev/null
+++ b/plugins/nemo-customizer/tests/test_eval_helpers.py
@@ -0,0 +1,250 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+from __future__ import annotations
+
+import importlib.util
+import json
+import sys
+from pathlib import Path
+
+import pytest
+
+SKILL_REFERENCES = (
+    Path(__file__).resolve().parents[1] / "src" / "nemo_customizer" / "skills" / "nemo-customizer" / "references"
+)
+
+
+def _load_eval_helpers():
+    module_name = "nemo_customizer_eval_helpers_test"
+    spec = importlib.util.spec_from_file_location(
+        module_name,
+        SKILL_REFERENCES / "eval_helpers.py",
+    )
+    assert spec is not None and spec.loader is not None
+    module = importlib.util.module_from_spec(spec)
+    # Dataclasses resolve cls.__module__ during decoration; register before exec.
+    sys.modules[module_name] = module
+    spec.loader.exec_module(module)
+    return module
+
+
+eval_helpers = _load_eval_helpers()
+
+
+def test_served_model_names() -> None:
+    assert eval_helpers.served_model_name(workspace="default", entity_or_adapter="qwen3-1.7b") == "default/qwen3-1.7b"
+    assert (
+        eval_helpers.served_model_name(workspace="default", entity_or_adapter="my-lora", finetuning="lora")
+        == "default--my-lora"
+    )
+
+
+def test_adapter_composite_entity_name() -> None:
+    assert (
+        eval_helpers.adapter_composite_entity_name(
+            model_entity="qwen3-1.7b",
+            workspace="default",
+            adapter_name="my-lora",
+        )
+        == "qwen3-1.7b&adapters/default/my-lora"
+    )
+
+
+def test_build_platform_model_target_routes_lora_via_provider() -> None:
+    target = eval_helpers.build_platform_model_target(
+        base_url="http://10.0.0.51:8080",
+        workspace="default",
+        model_entity="qwen3-1.7b",
+        adapter_name="my-lora",
+        provider_name="my-provider",
+    )
+    assert "/provider/my-provider/-/v1" in target.url
+    assert "/model/qwen3-1.7b/-/v1" not in target.url
+    assert target.name == "default--my-lora"
+
+
+def test_build_platform_model_target_routes_base_via_model_entity() -> None:
+    target = eval_helpers.build_platform_model_target(
+        base_url="http://10.0.0.51:8080",
+        workspace="default",
+        model_entity="qwen3-1.7b",
+        provider_name="my-provider",
+    )
+    assert "/model/qwen3-1.7b/-/v1" in target.url
+    assert "/provider/" not in target.url
+    assert target.name == "default/qwen3-1.7b"
+
+
+def test_build_platform_model_target_requires_ready_provider_for_base(
+    monkeypatch: pytest.MonkeyPatch,
+) -> None:
+    monkeypatch.setattr(eval_helpers, "find_ready_provider_for_model_entity", lambda **kwargs: None)
+    with pytest.raises(ValueError, match="No READY inference provider"):
+        eval_helpers.build_platform_model_target(
+            base_url="http://10.0.0.51:8080",
+            workspace="default",
+            model_entity="qwen3-1.7b",
+        )
+
+
+def test_gateway_path_from_url() -> None:
+    assert eval_helpers.gateway_path_from_url("http://x/provider/p/-/v1") == "provider"
+    assert eval_helpers.gateway_path_from_url("http://x/model/m/-/v1") == "model-entity"
+
+
+def test_normalize_mcqa_answer() -> None:
+    assert eval_helpers.normalize_mcqa_answer("bank") == "bank"
+    assert eval_helpers.normalize_mcqa_answer("A. bank") == "bank"
+    assert eval_helpers.normalize_mcqa_answer("The correct answer is: **A. bank**") == "bank"
+
+
+def test_assert_chat_row_rejects_flattened() -> None:
+    with pytest.raises(ValueError, match="messages"):
+        eval_helpers.assert_chat_row({"prompt": "hi", "expected": "bye"})
+
+
+def test_assert_chat_row_accepts_single_turn() -> None:
+    row = {
+        "messages": [
+            {"role": "user", "content": "Question?"},
+            {"role": "assistant", "content": "yes"},
+        ]
+    }
+    eval_helpers.assert_chat_row(row)
+
+
+def test_assert_chat_row_accepts_multi_turn() -> None:
+    row = {
+        "messages": [
+            {"role": "user", "content": "Turn 1"},
+            {"role": "assistant", "content": "Reply 1"},
+            {"role": "user", "content": "Turn 2"},
+            {"role": "assistant", "content": "final label"},
+        ]
+    }
+    eval_helpers.assert_chat_row(row)
+    assert eval_helpers.reference_content(row) == "final label"
+
+
+def test_assert_chat_row_rejects_non_dict_message_turns() -> None:
+    with pytest.raises(ValueError, match="messages\\[0\\] must be an object"):
+        eval_helpers.assert_chat_row({"messages": ["x", "y"]})
+
+
+def test_assert_chat_row_rejects_missing_final_assistant() -> None:
+    row = {
+        "messages": [
+            {"role": "user", "content": "Turn 1"},
+            {"role": "assistant", "content": "Reply 1"},
+            {"role": "user", "content": "Turn 2"},
+        ]
+    }
+    with pytest.raises(ValueError, match="assistant"):
+        eval_helpers.assert_chat_row(row)
+
+
+def test_load_chat_jsonl(tmp_path: Path) -> None:
+    path = tmp_path / "val.jsonl"
+    path.write_text(
+        json.dumps(
+            {
+                "messages": [
+                    {"role": "user", "content": "Q"},
+                    {"role": "assistant", "content": "A"},
+                ]
+            }
+        )
+        + "\n",
+        encoding="utf-8",
+    )
+    rows = eval_helpers.load_chat_jsonl(path)
+    assert len(rows) == 1
+    assert rows[0]["messages"][-1]["content"] == "A"
+
+
+def test_chat_templates_use_messages_slice() -> None:
+    assert "item.messages[:-1]" in eval_helpers.CHAT_USER_PROMPT_TEMPLATE["messages"]
+    assert "item.messages[-1]" in eval_helpers.CHAT_REFERENCE_TEMPLATE
+
+
+def test_lift_vs_base() -> None:
+    summaries = [
+        eval_helpers.EvalSummary(
+            target="base",
+            model_name="default/m",
+            gateway_url="http://x/model/m/-/v1",
+            gateway_path="model-entity",
+            num_samples=10,
+            raw_exact_match=0.0,
+            normalized_accuracy=0.5,
+            aggregate_metrics={},
+        ),
+        eval_helpers.EvalSummary(
+            target="lora-a",
+            model_name="default--a",
+            gateway_url="http://x/provider/p/-/v1",
+            gateway_path="provider",
+            num_samples=10,
+            raw_exact_match=0.7,
+            normalized_accuracy=0.75,
+            aggregate_metrics={},
+        ),
+    ]
+    assert eval_helpers.lift_vs_base(summaries) == {"lora-a": 0.25}
+
+
+def test_routing_sanity_warnings_detects_flat_scores() -> None:
+    summaries = [
+        eval_helpers.EvalSummary(
+            target="base",
+            model_name="default/m",
+            gateway_url="http://x/model/m/-/v1",
+            gateway_path="model-entity",
+            num_samples=10,
+            raw_exact_match=0.0,
+            normalized_accuracy=0.59,
+            aggregate_metrics={},
+        ),
+        eval_helpers.EvalSummary(
+            target="lora-a",
+            model_name="default--a",
+            gateway_url="http://x/model/m/-/v1",
+            gateway_path="model-entity",
+            num_samples=10,
+            raw_exact_match=0.0,
+            normalized_accuracy=0.59,
+            aggregate_metrics={},
+        ),
+    ]
+    warnings = eval_helpers.routing_sanity_warnings(summaries)
+    assert any("provider" in warning for warning in warnings)
+    assert any("within" in warning for warning in warnings)
+
+
+def test_adapter_from_completed_job_parses_spec(monkeypatch: pytest.MonkeyPatch) -> None:
+    payload = {
+        "name": "unsloth-abc",
+        "status": "completed",
+        "created_at": "2026-06-16T20:22:09",
+        "spec": {
+            "schedule": {"epochs": 3},
+            "output": {"name": "my-adapter"},
+            "model": {"name": "default/qwen3-1.7b"},
+            "dataset": {"path": "default/commonsense_qa"},
+        },
+    }
+
+    def fake_get(url: str) -> dict:
+        assert url.endswith("/jobs/unsloth-abc")
+        return payload
+
+    monkeypatch.setattr(eval_helpers, "_platform_get_json", fake_get)
+    info = eval_helpers.adapter_from_completed_job(
+        base_url="http://10.0.0.51:8080",
+        workspace="default",
+        job_name="unsloth-abc",
+    )
+    assert info.adapter_name == "my-adapter"
+    assert info.epochs == 3
+    assert info.backend == "unsloth"