Skip to content

Epic: Gym CLI usability -- first class CLI experience #1434

@sephmard

Description

@sephmard

Epic: CLI usability improvements -- first class CLI experience

Use cases, pain points, and background

Most modern CLI tools use a tool subcommand pattern (docker run, git status, kubectl apply). Gym should have a single top-level command i.e. gym with grouped subcommands would be more intuitive and easier to discover.

The goal is to replace the ng_* / nemo_gym_* entry points with a single gym CLI. Top-level commands for common operations, with subcommand groups for related clusters (e.g. gym data * for all data operations, gym benchmark * for benchmark operations). Keep the ng_* aliases around as deprecated shims during a transition period, with a deprecation warning pointing users to the new command.

Why now: The user base and the catalog are both about to grow significantly. The current CLI surface scales poorly with either. Fixing the shell once, before adoption peaks, is far cheaper than retrofitting later — and unblocks every subsequent CLI feature (discoverability, status, info, etc.) from inheriting a unified surface for free.

Problems

  1. Inconsistent conventions. The current CLI has 40+ entry points across two naming conventions (ng_* and nemo_gym_*), all using underscore-separated names, e.g., ng_collect_rollouts, ng_init_resources_server, ng_e2e_collect_rollouts, ng_upload_dataset_to_gitlab, etc.
  2. Environment discoverability. A new user lands in Gym and asks: "What can I evaluate against, and what does each environment actually do?" Ref: #1433.
  3. Benchmark-centric, need to improve UX for agent-centric. The current evaluation flow starts from "pick a benchmark" which bundles the agent, environment, and data together in a single config. Agent developers think the other way: "I have an agent, I want to evaluate it against environments." We need to improve the UX of this path.
  4. Agent swapping requires config duplication. Running SWE-bench with OpenHands vs SWE Agent vs Mini SWE Agent requires three nearly identical YAML configs that differ by one line (the agent). Every agent-environment combination needs its own config file rather than being composable at the CLI level. Ref: #1396.
  5. Config paths require familiarity with project structure. Users need to specify exact paths like resources_servers/gpqa/configs/gpqa.yaml. This requires familiarity with the internal project structure — you need to know that model configs live in responses_api_models/ and benchmark configs in resources_servers/ or benchmarks/. Ref: #1205.
  6. CLI terminology is training-centric, not eval-centric. "Collect rollouts" and "reward profiling" are RL training terms — an agent developer who wants to evaluate their agent thinks in terms of "run evaluation" and "get metrics/scores." The current naming assumes the user's goal is to generate training data, not to assess agent performance. Similarly, responses_create_params, verifier_metadata, and resources_server are abstractions that don't map to how an agent developer thinks about their workflow ("tasks," "scores," "environments"). The help text explains the current concepts well, but users searching for evaluation workflows may not discover the right commands.
  7. 3 commands to compose agentic eval. ng_runng_collect_rolloutsng_reward_profile exposes the microservice infrastructure (start servers, collect, score) rather than abstracting it into the user's goal: "evaluate my agent on this task set." Ideally the user could specify tasks, model, agent, and environment in a single command. Ref: #1188.
  8. Skills use and discoverability. NeMo Gym lacks a strong workflow for evaluating skill iteration. Users need to modify skills independently from the agent and environment, then compare how those changes affect accuracy, cost, task completion time, and generalization across task sets. Ref: #1235 .

Product Requirements

  1. CLI Ergonomics. CLI commands should be structured according to overall workflow (Data -> Environments -> Evaluate <-> Improve -> Deploy). Gym is for evaluating and improving models and agents through iterative evaluation and experimentation. The CLI structure should reflect this workflow, not internal implementation details (server types, rollout collection mechanics).
  2. Discoverability. Users should be able to discover what environments (Dataset + Agent Harness Resources Server) are available in Gym from the CLI, without needing to browse GitHub or read source code.
  3. Environments. CLI should support both building and using environments — Users need to create new environments (scaffold, develop, test) and use existing ones (discover, configure, run, evaluate, train). The CLI should make both paths clear.
  4. Components. CLI must account for clean composition: an environment is a dataset + agent harness + resources server + model server.
  5. Extensible via plugins. External NeMo tools (Data Designer, Safe Synthesizer, Anonymizer) and training frameworks (NeMo RL, VeRL, Unsloth) should be able to register commands into the CLI without modifying Gym core. This is a first-class design constraint, not a nice-to-have.
  6. Skills. Agent skills should be available via the CLI so that agents are able to navigate using the product well without needing to refer to Github
  7. Backward compatibility. Existing ng_* commands must continue to work during a deprecation period, emitting a migration notice. Users have these in scripts, CI pipelines, and muscle memory.

UX Standards

  1. Single entry point with --help everywhere. One gym command exposes all capabilities. Every group and subcommand supports -h/--help. No hidden commands.
  2. Standard flag syntax. --flag value for common inputs. Hydra +key=value is an escape hatch for advanced config overrides, not the default interface. No shell-quoting gymnastics for arrays.Hydra is powerful for config composition, but its override syntax is designed for config management, not for user-facing CLI ergonomics. Most CLI tools in the ecosystem (kubectl, docker, git, pip, etc.) follow the POSIX/GNU convention: long flags (--input data.jsonl) and short flags (-i data.jsonl) with --help discoverability. Users coming from these tools find the +key=value syntax unfamiliar and need to consult documentation, adding to developer friction.
  3. Machine-readable output. Every list/inspect/status command supports --json for piping into jq, CI assertions, or dashboards.
  4. Helpful failure paths. Typos get "did you mean?" suggestions. Validation errors explain what's wrong and what to do. No bare asserts or raw stack traces.

Description

A user should be able to:

  • Discover every Gym capability from gym --help, drill into any subcommand with gym <cmd> --help.
  • Use familiar --flag value syntax for common inputs, without learning Hydra override grammar to do basic things.
  • Get JSON output from any list/inspect/status command and pipe it.
  • Recover from typos with helpful suggestions.
  • Turn on debug logging with -v from any subcommand.

Hydra remains the config-composition engine underneath — it just stops being the user-facing flag syntax.

Design

  • Single gym entry point with subparsers (argparse or simple-parsing). Group existing scripts under nouns/verbs: gym run, gym test, gym ls …, gym dataset {upload,download,migrate}, gym prepare {benchmark,data,prompts}, gym init, gym status, etc..
  • Keep every existing ng_* / nemo_gym_* console script as a thin shim that calls into the new router, with a one-line deprecation warning. Removal happens at least one release later.
  • Translate the most common +key=value overrides into first-class --flags on each subcommand (e.g. gym run --config a.yaml --config b.yaml instead of ng_run "+config_paths=[a.yaml,b.yaml]"). Keep +key=value as an escape hatch for power users.
  • -h / --help works at every level. -v / --verbose sets LOG_LEVEL=DEBUG.
  • --json (or --output {table,json,wide}) on every command whose primary output is structured.
  • Error helpers: shared "did-you-mean" suggester (rapidfuzz or stdlib difflib.get_close_matches) wired into env/benchmark/task lookups.

Acceptance Criteria

  • gym is installed as a console script and prints a grouped command list on gym / gym --help.
  • Every subcommand responds to -h / --help with usage, flags, and at least one example.
  • The most common workflows accept standard --flag value syntax for their primary inputs; +key=value still works as an escape hatch.
  • --json produces stable, documented output on every command whose primary purpose is to list, inspect, or report status.
  • -v / --verbose works at the top level and is honored by every subcommand.
  • Lookup failures (unknown env, unknown task, unknown invocation, unknown config key) print a "did you mean?" suggestion when a close match exists.
  • All existing ng_* and nemo_gym_* scripts continue to work and emit a one-line deprecation notice pointing to the new equivalent.
  • gym --help and a one-page CLI cheat-sheet land in docs/.

Metadata

Metadata

Assignees

No one assigned

    Labels

    CLIneeds-designGoal is clear, but needs input on technical designusabilityimprovements to user experience
    No fields configured for Feature.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions