Epic: Gym CLI usability -- first class CLI experience

# Epic: CLI usability improvements -- first class CLI experience

## Use cases, pain points, and background
Most modern CLI tools use a `tool subcommand` pattern (`docker run, git status, kubectl apply`). Gym should have a  single top-level command i.e. `gym` with grouped subcommands would be more intuitive and easier to discover.

The goal is to replace the ng_* / nemo_gym_* entry points with a single gym CLI. Top-level commands for common operations, with subcommand groups for related clusters (e.g. gym data * for all data operations, gym benchmark * for benchmark operations). Keep the ng_* aliases around as deprecated shims during a transition period, with a deprecation warning pointing users to the new command.

**Why now:** The user base and the catalog are both about to grow significantly. The current CLI surface scales poorly with either. Fixing the shell once, before adoption peaks, is far cheaper than retrofitting later — and unblocks every subsequent CLI feature (discoverability, status, info, etc.) from inheriting a unified surface for free.


### Problems

1. **Inconsistent conventions.** The current CLI has 40+ entry points across two naming conventions (`ng_*` and `nemo_gym_*`), all using underscore-separated names, e.g., `ng_collect_rollouts, ng_init_resources_server, ng_e2e_collect_rollouts, ng_upload_dataset_to_gitlab`, etc.
2. **Environment discoverability**. A new user lands in Gym and asks: "What can I evaluate against, and what does each environment actually do?"  Ref: [#1433](https://github.com/NVIDIA-NeMo/Gym/issues/1433).
3. **Benchmark-centric, need to improve UX for agent-centric**. The current evaluation flow starts from "pick a benchmark" which bundles the agent, environment, and data together in a single config. Agent developers think the other way: "I have an agent, I want to evaluate it against environments." We need to improve the UX of this path.	
4. **Agent swapping requires config duplication**. Running SWE-bench with OpenHands vs SWE Agent vs Mini SWE Agent requires three nearly identical YAML configs that differ by one line (the agent). Every agent-environment combination needs its own config file rather than being composable at the CLI level. Ref: [#1396](https://github.com/NVIDIA-NeMo/Gym/issues/1396).
5. **Config paths require familiarity with project structure**. Users need to specify exact paths like `resources_servers/gpqa/configs/gpqa.yaml`. This requires familiarity with the internal project structure — you need to know that model configs live in `responses_api_models/ `and benchmark configs in `resources_servers/ or benchmarks/`. Ref: [#1205](https://github.com/NVIDIA-NeMo/Gym/issues/1205).
6. **CLI terminology is training-centric, not eval-centric**. "Collect rollouts" and "reward profiling" are RL training terms — an agent developer who wants to evaluate their agent thinks in terms of "run evaluation" and "get metrics/scores." The current naming assumes the user's goal is to generate training data, not to assess agent performance. Similarly, `responses_create_params`, `verifier_metadata`, and `resources_server` are abstractions that don't map to how an agent developer thinks about their workflow ("tasks," "scores," "environments"). The help text explains the current concepts well, but users searching for evaluation workflows may not discover the right commands.
7. **3 commands to compose agentic eval**. `ng_run` → `ng_collect_rollouts` → `ng_reward_profile` exposes the microservice infrastructure (start servers, collect, score) rather than abstracting it into the user's goal: "evaluate my agent on this task set." Ideally the user could specify tasks, model, agent, and environment in a single command. Ref: [#1188](https://github.com/NVIDIA-NeMo/Gym/issues/1188). 
8. **Skills use and discoverability**. NeMo Gym lacks a strong workflow for evaluating skill iteration. Users need to modify skills independently from the agent and environment, then compare how those changes affect accuracy, cost, task completion time, and generalization across task sets. Ref: [#1235](https://github.com/NVIDIA-NeMo/Gym/issues/1235) .

### Product Requirements

1. **CLI Ergonomics**. CLI commands should be structured according to overall workflow (Data -> Environments -> Evaluate <-> Improve -> Deploy). Gym is for evaluating and improving models and agents through iterative evaluation and experimentation. The CLI structure should reflect this workflow, not internal implementation details (server types, rollout collection mechanics).
2. **Discoverability**. Users should be able to discover what environments (Dataset + Agent Harness Resources Server) are available in Gym from the CLI, without needing to browse GitHub or read source code. 
3. **Environments**. CLI should support both building and using environments — Users need to create new environments (scaffold, develop, test) and use existing ones (discover, configure, run, evaluate, train). The CLI should make both paths clear.
4. **Components**. CLI must account for clean composition: an environment is a dataset + agent harness + resources server + model server.
5. **Extensible via plugins**. External NeMo tools (Data Designer, Safe Synthesizer, Anonymizer) and training frameworks (NeMo RL, VeRL, Unsloth) should be able to register commands into the CLI without modifying Gym core. This is a first-class design constraint, not a nice-to-have.
6. **Skills**. Agent skills should be available via the CLI so that agents are able to navigate using the product well without needing to refer to Github
7. **Backward compatibility**.  Existing ng_* commands must continue to work during a deprecation period, emitting a migration notice. Users have these in scripts, CI pipelines, and muscle memory.

### UX Standards

1. **Single entry point with --help everywhere**. One gym command exposes all capabilities. Every group and subcommand supports -h/--help. No hidden commands. 
2. **Standard flag syntax**. --flag value for common inputs. Hydra +key=value is an escape hatch for advanced config overrides, not the default interface. No shell-quoting gymnastics for arrays.Hydra is powerful for config composition, but its override syntax is designed for config management, not for user-facing CLI ergonomics. Most CLI tools in the ecosystem (kubectl, docker, git, pip, etc.) follow the POSIX/GNU convention: long flags (`--input data.jsonl`) and short flags (`-i data.jsonl`) with `--help` discoverability. Users coming from these tools find the `+key=value` syntax unfamiliar and need to consult documentation, adding to developer friction. 
3. **Machine-readable output**. Every list/inspect/status command supports --json for piping into jq, CI assertions, or dashboards.
6. **Helpful failure paths**. Typos get "did you mean?" suggestions. Validation errors explain what's wrong and what to do. No bare asserts or raw stack traces.

## Description

 A user should be able to:

- Discover every Gym capability from `gym --help`, drill into any subcommand with `gym <cmd> --help`.
- Use familiar `--flag value` syntax for common inputs, without learning Hydra override grammar to do basic things.
- Get JSON output from any list/inspect/status command and pipe it.
- Recover from typos with helpful suggestions.
- Turn on debug logging with `-v` from any subcommand.

Hydra remains the config-composition engine underneath — it just stops being the user-facing flag syntax.

### Design
- Single `gym` entry point with subparsers (argparse or `simple-parsing`). Group existing scripts under nouns/verbs: `gym run`, `gym test`, `gym ls …`, `gym dataset {upload,download,migrate}`, `gym prepare {benchmark,data,prompts}`,  `gym init`, `gym status`, etc..
- Keep every existing `ng_*` / `nemo_gym_*` console script as a thin shim that calls into the new router, with a one-line deprecation warning. Removal happens at least one release later.
- Translate the most common `+key=value` overrides into first-class `--flags` on each subcommand (e.g. `gym run --config a.yaml --config b.yaml` instead of `ng_run "+config_paths=[a.yaml,b.yaml]"`). Keep `+key=value` as an escape hatch for power users.
- `-h` / `--help` works at every level. `-v` / `--verbose` sets `LOG_LEVEL=DEBUG`.
- `--json` (or `--output {table,json,wide}`) on every command whose primary output is structured.
- Error helpers: shared "did-you-mean" suggester (rapidfuzz or stdlib `difflib.get_close_matches`) wired into env/benchmark/task lookups.

## Acceptance Criteria

- [ ] `gym` is installed as a console script and prints a grouped command list on `gym` / `gym --help`.
- [ ] Every subcommand responds to `-h` / `--help` with usage, flags, and at least one example.
- [ ] The most common workflows accept standard `--flag value` syntax for their primary inputs; `+key=value` still works as an escape hatch.
- [ ] `--json` produces stable, documented output on every command whose primary purpose is to list, inspect, or report status.
- [ ] `-v` / `--verbose` works at the top level and is honored by every subcommand.
- [ ] Lookup failures (unknown env, unknown task, unknown invocation, unknown config key) print a "did you mean?" suggestion when a close match exists.
- [ ] All existing `ng_*` and `nemo_gym_*` scripts continue to work and emit a one-line deprecation notice pointing to the new equivalent.
- [ ] `gym --help` and a one-page CLI cheat-sheet land in `docs/`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epic: Gym CLI usability -- first class CLI experience #1434

Epic: CLI usability improvements -- first class CLI experience

Use cases, pain points, and background

Problems

Product Requirements

UX Standards

Description

Design

Acceptance Criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Epic: Gym CLI usability -- first class CLI experience #1434

Description

Epic: CLI usability improvements -- first class CLI experience

Use cases, pain points, and background

Problems

Product Requirements

UX Standards

Description

Design

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions