Epic: CLI usability improvements -- first class CLI experience
Use cases, pain points, and background
Most modern CLI tools use a tool subcommand pattern (docker run, git status, kubectl apply). Gym should have a single top-level command i.e. gym with grouped subcommands would be more intuitive and easier to discover.
The goal is to replace the ng_* / nemo_gym_* entry points with a single gym CLI. Top-level commands for common operations, with subcommand groups for related clusters (e.g. gym data * for all data operations, gym benchmark * for benchmark operations). Keep the ng_* aliases around as deprecated shims during a transition period, with a deprecation warning pointing users to the new command.
Why now: The user base and the catalog are both about to grow significantly. The current CLI surface scales poorly with either. Fixing the shell once, before adoption peaks, is far cheaper than retrofitting later — and unblocks every subsequent CLI feature (discoverability, status, info, etc.) from inheriting a unified surface for free.
Problems
- Inconsistent conventions. The current CLI has 40+ entry points across two naming conventions (
ng_* and nemo_gym_*), all using underscore-separated names, e.g., ng_collect_rollouts, ng_init_resources_server, ng_e2e_collect_rollouts, ng_upload_dataset_to_gitlab, etc.
- Environment discoverability. A new user lands in Gym and asks: "What can I evaluate against, and what does each environment actually do?" Ref: #1433.
- Benchmark-centric, need to improve UX for agent-centric. The current evaluation flow starts from "pick a benchmark" which bundles the agent, environment, and data together in a single config. Agent developers think the other way: "I have an agent, I want to evaluate it against environments." We need to improve the UX of this path.
- Agent swapping requires config duplication. Running SWE-bench with OpenHands vs SWE Agent vs Mini SWE Agent requires three nearly identical YAML configs that differ by one line (the agent). Every agent-environment combination needs its own config file rather than being composable at the CLI level. Ref: #1396.
- Config paths require familiarity with project structure. Users need to specify exact paths like
resources_servers/gpqa/configs/gpqa.yaml. This requires familiarity with the internal project structure — you need to know that model configs live in responses_api_models/ and benchmark configs in resources_servers/ or benchmarks/. Ref: #1205.
- CLI terminology is training-centric, not eval-centric. "Collect rollouts" and "reward profiling" are RL training terms — an agent developer who wants to evaluate their agent thinks in terms of "run evaluation" and "get metrics/scores." The current naming assumes the user's goal is to generate training data, not to assess agent performance. Similarly,
responses_create_params, verifier_metadata, and resources_server are abstractions that don't map to how an agent developer thinks about their workflow ("tasks," "scores," "environments"). The help text explains the current concepts well, but users searching for evaluation workflows may not discover the right commands.
- 3 commands to compose agentic eval.
ng_run → ng_collect_rollouts → ng_reward_profile exposes the microservice infrastructure (start servers, collect, score) rather than abstracting it into the user's goal: "evaluate my agent on this task set." Ideally the user could specify tasks, model, agent, and environment in a single command. Ref: #1188.
- Skills use and discoverability. NeMo Gym lacks a strong workflow for evaluating skill iteration. Users need to modify skills independently from the agent and environment, then compare how those changes affect accuracy, cost, task completion time, and generalization across task sets. Ref: #1235 .
Product Requirements
- CLI Ergonomics. CLI commands should be structured according to overall workflow (Data -> Environments -> Evaluate <-> Improve -> Deploy). Gym is for evaluating and improving models and agents through iterative evaluation and experimentation. The CLI structure should reflect this workflow, not internal implementation details (server types, rollout collection mechanics).
- Discoverability. Users should be able to discover what environments (Dataset + Agent Harness Resources Server) are available in Gym from the CLI, without needing to browse GitHub or read source code.
- Environments. CLI should support both building and using environments — Users need to create new environments (scaffold, develop, test) and use existing ones (discover, configure, run, evaluate, train). The CLI should make both paths clear.
- Components. CLI must account for clean composition: an environment is a dataset + agent harness + resources server + model server.
- Extensible via plugins. External NeMo tools (Data Designer, Safe Synthesizer, Anonymizer) and training frameworks (NeMo RL, VeRL, Unsloth) should be able to register commands into the CLI without modifying Gym core. This is a first-class design constraint, not a nice-to-have.
- Skills. Agent skills should be available via the CLI so that agents are able to navigate using the product well without needing to refer to Github
- Backward compatibility. Existing ng_* commands must continue to work during a deprecation period, emitting a migration notice. Users have these in scripts, CI pipelines, and muscle memory.
UX Standards
- Single entry point with --help everywhere. One gym command exposes all capabilities. Every group and subcommand supports -h/--help. No hidden commands.
- Standard flag syntax. --flag value for common inputs. Hydra +key=value is an escape hatch for advanced config overrides, not the default interface. No shell-quoting gymnastics for arrays.Hydra is powerful for config composition, but its override syntax is designed for config management, not for user-facing CLI ergonomics. Most CLI tools in the ecosystem (kubectl, docker, git, pip, etc.) follow the POSIX/GNU convention: long flags (
--input data.jsonl) and short flags (-i data.jsonl) with --help discoverability. Users coming from these tools find the +key=value syntax unfamiliar and need to consult documentation, adding to developer friction.
- Machine-readable output. Every list/inspect/status command supports --json for piping into jq, CI assertions, or dashboards.
- Helpful failure paths. Typos get "did you mean?" suggestions. Validation errors explain what's wrong and what to do. No bare asserts or raw stack traces.
Description
A user should be able to:
- Discover every Gym capability from
gym --help, drill into any subcommand with gym <cmd> --help.
- Use familiar
--flag value syntax for common inputs, without learning Hydra override grammar to do basic things.
- Get JSON output from any list/inspect/status command and pipe it.
- Recover from typos with helpful suggestions.
- Turn on debug logging with
-v from any subcommand.
Hydra remains the config-composition engine underneath — it just stops being the user-facing flag syntax.
Design
- Single
gym entry point with subparsers (argparse or simple-parsing). Group existing scripts under nouns/verbs: gym run, gym test, gym ls …, gym dataset {upload,download,migrate}, gym prepare {benchmark,data,prompts}, gym init, gym status, etc..
- Keep every existing
ng_* / nemo_gym_* console script as a thin shim that calls into the new router, with a one-line deprecation warning. Removal happens at least one release later.
- Translate the most common
+key=value overrides into first-class --flags on each subcommand (e.g. gym run --config a.yaml --config b.yaml instead of ng_run "+config_paths=[a.yaml,b.yaml]"). Keep +key=value as an escape hatch for power users.
-h / --help works at every level. -v / --verbose sets LOG_LEVEL=DEBUG.
--json (or --output {table,json,wide}) on every command whose primary output is structured.
- Error helpers: shared "did-you-mean" suggester (rapidfuzz or stdlib
difflib.get_close_matches) wired into env/benchmark/task lookups.
Acceptance Criteria
Epic: CLI usability improvements -- first class CLI experience
Use cases, pain points, and background
Most modern CLI tools use a
tool subcommandpattern (docker run, git status, kubectl apply). Gym should have a single top-level command i.e.gymwith grouped subcommands would be more intuitive and easier to discover.The goal is to replace the ng_* / nemo_gym_* entry points with a single gym CLI. Top-level commands for common operations, with subcommand groups for related clusters (e.g. gym data * for all data operations, gym benchmark * for benchmark operations). Keep the ng_* aliases around as deprecated shims during a transition period, with a deprecation warning pointing users to the new command.
Why now: The user base and the catalog are both about to grow significantly. The current CLI surface scales poorly with either. Fixing the shell once, before adoption peaks, is far cheaper than retrofitting later — and unblocks every subsequent CLI feature (discoverability, status, info, etc.) from inheriting a unified surface for free.
Problems
ng_*andnemo_gym_*), all using underscore-separated names, e.g.,ng_collect_rollouts, ng_init_resources_server, ng_e2e_collect_rollouts, ng_upload_dataset_to_gitlab, etc.resources_servers/gpqa/configs/gpqa.yaml. This requires familiarity with the internal project structure — you need to know that model configs live inresponses_api_models/and benchmark configs inresources_servers/ or benchmarks/. Ref: #1205.responses_create_params,verifier_metadata, andresources_serverare abstractions that don't map to how an agent developer thinks about their workflow ("tasks," "scores," "environments"). The help text explains the current concepts well, but users searching for evaluation workflows may not discover the right commands.ng_run→ng_collect_rollouts→ng_reward_profileexposes the microservice infrastructure (start servers, collect, score) rather than abstracting it into the user's goal: "evaluate my agent on this task set." Ideally the user could specify tasks, model, agent, and environment in a single command. Ref: #1188.Product Requirements
UX Standards
--input data.jsonl) and short flags (-i data.jsonl) with--helpdiscoverability. Users coming from these tools find the+key=valuesyntax unfamiliar and need to consult documentation, adding to developer friction.Description
A user should be able to:
gym --help, drill into any subcommand withgym <cmd> --help.--flag valuesyntax for common inputs, without learning Hydra override grammar to do basic things.-vfrom any subcommand.Hydra remains the config-composition engine underneath — it just stops being the user-facing flag syntax.
Design
gymentry point with subparsers (argparse orsimple-parsing). Group existing scripts under nouns/verbs:gym run,gym test,gym ls …,gym dataset {upload,download,migrate},gym prepare {benchmark,data,prompts},gym init,gym status, etc..ng_*/nemo_gym_*console script as a thin shim that calls into the new router, with a one-line deprecation warning. Removal happens at least one release later.+key=valueoverrides into first-class--flagson each subcommand (e.g.gym run --config a.yaml --config b.yamlinstead ofng_run "+config_paths=[a.yaml,b.yaml]"). Keep+key=valueas an escape hatch for power users.-h/--helpworks at every level.-v/--verbosesetsLOG_LEVEL=DEBUG.--json(or--output {table,json,wide}) on every command whose primary output is structured.difflib.get_close_matches) wired into env/benchmark/task lookups.Acceptance Criteria
gymis installed as a console script and prints a grouped command list ongym/gym --help.-h/--helpwith usage, flags, and at least one example.--flag valuesyntax for their primary inputs;+key=valuestill works as an escape hatch.--jsonproduces stable, documented output on every command whose primary purpose is to list, inspect, or report status.-v/--verboseworks at the top level and is honored by every subcommand.ng_*andnemo_gym_*scripts continue to work and emit a one-line deprecation notice pointing to the new equivalent.gym --helpand a one-page CLI cheat-sheet land indocs/.