Skip to content

(WIP) do not merge - Add nightly speed runs to check how fast we can solve a single map#439

Open
eugenevinitsky wants to merge 22 commits into
emerge/temp_trainingfrom
ev/single_agent_speed_runs
Open

(WIP) do not merge - Add nightly speed runs to check how fast we can solve a single map#439
eugenevinitsky wants to merge 22 commits into
emerge/temp_trainingfrom
ev/single_agent_speed_runs

Conversation

@eugenevinitsky
Copy link
Copy Markdown

No description provided.

Eugene Vinitsky and others added 11 commits May 22, 2026 22:09
New [env] knob goal_placement (gigaflow): 0 = route (existing forward-route
waypoints), 1 = random. In random mode compute_goals samples each target
waypoint at a random drivable point anywhere on the map, rejecting points
within min_waypoint_spacing of the agent, so goals can land in any direction
including behind it. The route/path are still built at spawn for lane
observations and progression; only the goal positions change. Goal-reached
stays a pure euclidean check, so random goals reward correctly.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Manual launcher for single-agent-per-map training: one agent per env on Town02
only, many envs, goal_placement=random (goals anywhere on the map incl. behind).
Auto-creates the Town02-only map dir, runs gpu_heartbeat alongside, wandb-tracked.
Scheduling and scale (NUM_AGENTS) are overridable without editing the file.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Launch the single-agent speed run as a 3-task SLURM array (one seed each via
--train.seed), all logged to wandb group "Nightly Test". max_goal_position=1000
and total_timesteps capped at 1B. Town02-folder create is atomic so concurrent
array tasks can't read a partial map file.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds an opt-in [env] knob use_map_cache (default 0). When set, environments
loading the same map file share one read-only copy of the static geometry --
road_elements, the spatial grid (cells + neighbor cache), and the lane graph
(incl. its O(n^2) distance matrix) -- via a per-process, reference-counted cache
keyed by map filename. Per-env mutable state (agents, traffic-light states) is
never shared, so dynamic traffic lights stay correct.

This removes the per-env map duplication that OOMs single-map / many-env runs
(e.g. 1 map x 1024 single-agent envs): the geometry is built once and borrowed.

Lifecycle: init() builds-or-borrows; c_close() decrements the refcount and frees
the entry only on the last reference. A getpid() guard prevents a forked worker
from freeing the parent's copy-on-write geometry (the use-after-free that the
upstream WIP #346 hit). Default off keeps existing single-owner behavior; the
single_agent_speed_run launcher passes --env.use_map_cache 1.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Pass --env.traffic_light_behavior 0 so the agent isn't stopped/penalized at
lights during point-to-point speed runs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The dynamic pufferl argparser registers dotted [env]/[train] keys with hyphens
(map_dir -> --env.map-dir), so the underscore forms were rejected as unrecognized
arguments. Convert all dotted overrides to hyphens. Also disable the nuPlan-based
evaluators (validation_replay + behavior buckets) for single-map CARLA training
via --eval.<name>.enabled 0; validation_gigaflow stays enabled.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
It inherited enabled from validation_defaults, so the argparser never registered
--eval.validation-replay.enabled and CLI overrides were rejected. Declaring it
explicitly (same true default) makes it toggleable, matching the behaviors_* sections.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Dedicated, disabled-by-default evaluator mirroring the single-agent speed-run
task (Town02, one agent, random goals, lights off) with render_backend=obs_html.
Lets us faithfully obs-render any checkpoint via
  puffer eval puffer_drive --evaluator single_agent_obs --load-model-path <ckpt>
without it firing inline during training (enabled=false; standalone-by-name still
runs it).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
With goal_placement=random, min_waypoint_spacing is the rejection floor on goal
distance; 0 lets goals land right next to the agent (near goals, not only far).
Set in the training launcher and the single_agent_obs render config.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Remove the [eval.single_agent_obs] section and the [eval.validation_replay]
enabled= line; keep drive.ini as the shared default. The general goal_placement
and use_map_cache [env] knobs stay (default off) since they're required for the
CLI flags to exist. The launcher no longer tries to disable validation_replay
(its enabled is inherited, not CLI-overridable without a drive.ini line).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…_obs)

These are eval-scoped and do not alter default training: validation_replay's
enabled was already inherited-true (now explicit, CLI-toggleable); single_agent_obs
is enabled=false so it never runs inline. Re-enable the launcher's validation_replay
disable for single-agent runs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 23, 2026 15:09
traffic_light_behavior=0 only stops the env force-halting at reds; the agent still
learns to stop because a red-light violation zeroes the reward multiplier
(no_red_light, drive.h). The red-light metric is gated on
max_traffic_control_observations, so setting it to 0 removes lights from the
observation AND keeps red_light_violation_rate at 0 (multiplier never zeroed) --
the agent can neither see nor be penalized by lights. Set in the launcher and the
single_agent_obs render config.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a SLURM “nightly speed run” training job and extends the Drive environment to support (1) random goal placement across the map and (2) an optional per-process shared map-geometry cache to reduce repeated map load/build overhead.

Changes:

  • Introduce scripts/single_agent_speed_run.sbatch for a 3-seed nightly single-map (Town02) training sweep with evaluator disabling.
  • Add goal_placement (route vs random) and use_map_cache (share static map geometry) plumbing from config → Python env → C env.
  • Extend drive.ini with new env keys and a disabled-by-default evaluator preset (eval.single_agent_obs) for manual obs_html rendering.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
scripts/single_agent_speed_run.sbatch New SLURM script to run the single-agent nightly “speed run” training sweep.
pufferlib/ocean/drive/drive.py Adds validation and passes through new env kwargs (goal_placement, use_map_cache).
pufferlib/ocean/drive/drive.h Implements random goal sampling and a shared map-geometry cache with refcounted entries.
pufferlib/ocean/drive/binding.c Unpacks new kwargs into the C Drive struct.
pufferlib/config/ocean/drive.ini Adds new env keys and a disabled-by-default evaluator config for the speed-run task.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +57 to +60
; Goal placement (gigaflow) - options: 0 - route (forward waypoints), 1 - random (anywhere on map)
goal_placement = 0
; Share static map geometry (roads/grid/lane-graph) across envs using the same map - 0 off, 1 on
use_map_cache = 0
Comment on lines +1918 to +1921
for (int attempt = 0; attempt < MAX_GOAL_ATTEMPTS; attempt++) {
int list_idx = rand() % env->grid_map->num_drivable_grid_cell;
int grid_idx = env->grid_map->grid_index_drivable[list_idx];

Comment on lines +1922 to +1934
GridMapEntity cell_candidates[MAX_ENTITIES_PER_CELL];
int candidate_count = 0;
for (int j = 0; j < env->grid_map->cell_entities_count[grid_idx]; j++) {
GridMapEntity entity = env->grid_map->cells[grid_idx][j];
if (is_drivable_road_lane(env->road_elements[entity.entity_idx].type)) {
cell_candidates[candidate_count++] = entity;
}
}
if (candidate_count == 0) {
continue;
}

GridMapEntity chosen = cell_candidates[rand() % candidate_count];
Comment on lines +307 to +313
// Per-process map cache. Built lazily in init(). g_map_cache_pid stamps the
// process that built it: a forked worker inherits these pointers via copy-on-write
// and must never free them (it would corrupt the parent's heap), so the free path
// in c_close is guarded by getpid() == g_map_cache_pid.
static struct SharedMapData **g_map_cache = NULL;
static int g_map_cache_count = 0;
static pid_t g_map_cache_pid = 0;
Eugene Vinitsky and others added 10 commits May 23, 2026 11:26
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Set WANDB_NAME=<YYYY-MM-DD>_single_agent_seed<N> (WandbLogger passes no name=, so
the env var sets the run name). Date-first so runs sort chronologically per launch.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… bound

compute_goals_random previously only enforced a min distance and sampled
uniformly over the whole map (far-skewed, no cap). Now it accepts points in
[min_waypoint_spacing, max_waypoint_spacing], with a closest-to-band fallback so
a tight max still lands a nearby goal. The random-mode contexts (single-agent
launcher + single_agent_obs render) default max_waypoint_spacing to a huge value
(= anywhere on the map), preserving prior behavior; the global [env] default
stays 60 for route mode / default training.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
free_shared_map_data NULLs an entry's slot on refcount zero; reuse those holes on
the next insert so a resample cycle (free all, rebuild) keeps g_map_cache_count
bounded by the number of distinct maps instead of appending past NULL holes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Move the single-agent Town02 / random-goal / lights-off knobs into a
program_config YAML so the run goes through the canonical submit_cluster.py
path (code isolation + container + heartbeat) instead of a hand-rolled sbatch.
Launch 3 seeds via --args train.seed=0:1:2.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
submit_cluster.py joins the inner command into a bash -c string without quoting
arg values, so a space in --wandb-group split the arg and crashed argparse.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Superseded by the submit_cluster.py path (scripts/cluster_configs/single_agent_speed_run.yaml),
which provides per-run code isolation, container wrapping, and the heartbeat. The
hand-rolled sbatch ran from the shared checkout with no isolation, which is what
let a rebuild SIGBUS the live seeds.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…d ablation)

Self-contained variant of single_agent_speed_run.yaml with lane (lane_align,
lane_center) and velocity (vel_align, velocity, overspeed) rewards zeroed, so the
ablation is defined by a config file rather than ad-hoc --args. seed stays on --args.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Override the inherited validation_gigaflow scenario_length=500 / render_max_steps=300
so the obs render matches the single-agent training episode length (1280 steps).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants