beam

Remote GPU training deployment CLI. Provision cloud GPU instances, sync your code and environment, and launch training jobs — all from one command.

What It Does

beam handles the full lifecycle of remote GPU training:

Provisions a GPU instance on RunPod or Vast.ai (or connects to an existing one)
Builds a Python virtual environment in Docker (locally or reuses cached)
Syncs the venv and your code to the remote machine via rsync/SSH
Sets up credentials (W&B, HuggingFace, Kaggle) on the remote
Launches your training command in a tmux session with sentinel-file tracking
Monitors the job via the Pulse subsystem (cron-based health checks, auto-restart, GPU idle detection)

Features

🚀 One-command deploy — beam run --provider vastai python train.py
☁️ Multi-provider — RunPod and Vast.ai, with fallback ordering
🔄 Smart venv sync — MD5-hashed deps, incremental transfers, Docker-built envs
📦 Registry mode — Push venv as Docker image to ghcr.io for fast deploys
💓 Pulse monitoring — Cron-based health checks with auto-restart and GPU idle detection
💸 Cost tracking — Log $/hr per session, idle-stop to save money
🎯 Bid strategies — 5 strategies for Vast.ai spot instances
🔁 Recovery — Reconnect to existing instances, recover from spot preemption
📋 Session state — JSON session files for status, attach, logs

Install

# Requires Python 3.11+, uv, docker (for venv builds)
git clone <repo>
cd beam
uv pip install -e .

The entry point is beam (defined as beam-deploy in pyproject.toml).

Quick Start

1. Create `beam.yaml` in your project root

project:
  name: my-training-run
  repo_root: .

registry:
  # Optional: set to push venv as Docker image (faster syncs)
  # Requires GITHUB_USERNAME + DOCKER_GHRC_TOKEN env vars

remote:
  user: root
  project_dir: /workspace

ssh:
  identity_file: ~/.ssh/id_rsa

credentials:
  wandb_api_key: ${WANDB_API_KEY}
  hugging_face_hub_token: ${HUGGING_FACE_HUB_TOKEN}

providers:
  vastai:
    machines:
      - name: rtx4090
        gpu_name: RTX_4090
        num_gpus: 1
        disk_gb: 50
        image: pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime
        bid_strategy: multiplier
        bid_multiplier: 1.1

  runpod:
    machines:
      - name: a100
        gpu_type: NVIDIA A100 80GB PCIe
        gpu_count: 1
        cloud_tier: SECURE
        image: runpod/pytorch:2.3.0-py3.11-cuda12.1.1-devel-ubuntu22.04

pulse:
  enabled: true
  interval_minutes: 5
  idle_stop: true
  idle_threshold_minutes: 15

2. Deploy

# Provision a new instance and run training
beam run --provider vastai --pod-config rtx4090 python train.py

# Connect to an existing machine
beam run --ip 1.2.3.4 -p 22 python train.py

# Dry run (no provisioning)
beam run --provider vastai --dry-run python train.py

# Run locally (no ip or provider = local passthrough)
beam run python train.py

3. Monitor

beam status                    # Show all sessions (live job state by default)
beam status --offline          # Fast offline view, skips SSH
beam attach <session-id>       # Attach to tmux session
beam logs <session-id>         # Tail training logs
beam stop <session-id>         # Stop instance + remove pulse cron

CLI Reference

`beam run`

Launch a training job (provision → sync → run).

Flag	Short	Description
`--ip`		Connect to existing machine at this IP
`--port`	`-p`	SSH port (default: 22)
`-i`		SSH identity file
`--user`		SSH user (default: root)
`--provider`		Cloud provider: `vastai` or `runpod`
`--pod-id`		Reconnect to existing pod by ID
`--pod-config`		Machine config name from beam.yaml
`--name`	`-n`	Session name
`--pulse`		Enable Pulse monitoring
`--pulse-idle-stop`		Stop instance when GPU is idle
`--no-auto-stop-on-complete`		Don't stop instance when job finishes
`--config`		Path to beam.yaml
`--verbose`	`-v`	Verbose output
`--quiet`	`-q`	Suppress non-essential output
`--dry-run`		Print plan, don't execute
`[command]`		Training command to run

Execution paths:

No --ip and no --provider → runs command locally via os.execvp
--ip given → pipeline on existing machine (no provisioning)
--provider given → provision new instance, then pipeline

`beam status`

Show all sessions — SSHes into each remote by default to report live Job State.

Flags: to skip SSH checks (fast, offline view)

`beam init`

Initialize a beam.yaml in the current directory.

`beam offers`

List available GPU offers from a provider.

beam offers --provider vastai
beam offers --provider runpod

`beam attach`

Attach to the tmux session of a running job.

beam attach <session-id>

`beam logs`

Tail the log file for a session.

beam logs <session-id>
beam logs <session-id> --follow

`beam stop`

Stop a session's instance and remove its Pulse cron job.

beam stop <session-id>

`beam pulse`

Manage Pulse monitoring cron jobs.

beam pulse list                  # List active pulse jobs
beam pulse install <session-id>  # Install cron for session
beam pulse uninstall <session-id># Remove cron for session
beam pulse check <session-id>    # Run one health check now

`beam.yaml` Config Reference

All fields are optional unless noted.

`project`

project:
  name: my-project          # Project name (used in remote paths)
  repo_root: .              # Local repo root (default: auto-detected)

`registry`

registry:
  image: ghcr.io/user/beam-env  # Docker image for venv (optional)

`remote`

remote:
  user: root                    # Remote SSH user
  project_dir: /workspace       # Base dir on remote (env: BEAM_REMOTE_PROJECT_DIR)

`ssh`

ssh:
  identity_file: ~/.ssh/id_rsa  # SSH key path
  port: 22

`credentials`

credentials:
  wandb_api_key: ""
  hugging_face_hub_token: ""
  kaggle_json_path: ~/.kaggle/kaggle.json

`providers.vastai.machines[]`

- name: rtx4090                  # Config name (use with --pod-config)
  gpu_name: RTX_4090             # GPU model filter
  num_gpus: 1
  num_gpus_max: 1                # Max GPUs (for multi-GPU offers)
  gpu_ram_min: 20                # Minimum GPU RAM (GB)
  machine_id: null               # Pin to specific machine ID
  offer_type: bid                # "bid" or "on-demand"
  offer_limit: 20                # Max offers to consider
  price: 0.30                    # Starting bid price ($/hr)
  price_max: 0.80                # Max bid price
  price_step: 0.05               # Bid increment
  disk_gb: 50
  runtype: ssh                   # "ssh" or "jupyter"
  target_state: running
  recovery_policy: restart       # "restart", "ignore", or "destroy"
  image: pytorch/pytorch:latest
  onstart: ""                    # Script to run on instance start
  env: {}                        # Environment variables
  pick_strategy: cheapest        # How to pick from offers
  bid_strategy: multiplier       # Bid strategy (see below)
  bid_multiplier: 1.1
  bid_percentile: 50
  reliability_min: 0.9
  dph_max: 1.0                   # Max $/hr
  dlperf_min: 0.0
  dlperf_per_dphtotal_min: 0.0

`providers.runpod.machines[]`

- name: a100
  gpu_type: "NVIDIA A100 80GB PCIe"
  gpu_count: 1
  cloud_tier: SECURE             # "SECURE" or "COMMUNITY"
  spot: false
  recovery_policy: restart
  image: runpod/pytorch:latest
  template_id: null
  container_disk_size: 20        # GB
  volume_size: 50                # GB
  volume_path: /workspace
  network_volume_id: null
  min_memory_gb: 0
  min_vcpu: 0
  max_cost: 0.0                  # 0 = no limit
  ports: "22/tcp"
  ssh_public_key_files: []
  env_vars: {}

`pulse`

pulse:
  enabled: false
  interval_minutes: 5
  idle_stop: false
  idle_threshold_minutes: 15
  idle_gpu_threshold: 10         # % GPU utilization = idle
  max_restarts: 5

Pulse Monitoring

Pulse is a cron-based health monitoring subsystem. When enabled (--pulse flag or pulse.enabled: true), it installs a cron job that runs every N minutes and checks job health.

Decision Matrix

Machine State	Job State	Action
running	FINISHED	Auto-stop instance
running	FAILED/crashed	Soft restart (re-sync + relaunch)
running	RUNNING	GPU idle check → destroy+restart if too many strikes
stopped/exited	any	Hard restart (full pipeline, fresh instance)
api_error	any	Skip (try again next cycle)

Restart Strategies

Soft restart — Re-sync code + relaunch on the same machine (fast)
Auto restart — Full pipeline on a fresh new instance
Destroy and restart — Destroy current instance + full pipeline (used after GPU idle strikes)
Idle stop — Stop instance + remove cron job (job done, save money)

Backoff

Restart backoff: min(60 × 2^restart_count, 1800) seconds (max 30 minutes).

Audit Log

All Pulse decisions are logged to .beam/logs/sessions/pulse_history_{session_id}.jsonl.

Bid Strategies (Vast.ai)

Strategy	Selection	Bid Price	Best For
`multiplier` (default)	Cheapest offer	`min_bid × bid_multiplier`	General use
`cheapest`	Cheapest offer	`cfg.price` + increments	Legacy/manual
`best_value`	Best `dlperf/dph`	`min_bid × bid_multiplier`	Performance per $
`percentile`	Cheapest offer	Nth percentile of min_bids	Percentile control
`score`	Best `dlperf×reliability/min_bid`	`min_bid × bid_multiplier`	Reliability focus

Environment Variables

Variable	Required	Description
`RUNPOD_API_KEY`	For RunPod	RunPod REST API key
`VAST_API_KEY`	For Vast.ai	Vast.ai API key
`GITHUB_USERNAME`	For registry mode	GitHub username for ghcr.io
`DOCKER_GHRC_TOKEN` or `GITHUB_TOKEN`	For registry mode	Token for ghcr.io push
`WANDB_API_KEY`	Optional	Weights & Biases API key
`HUGGING_FACE_HUB_TOKEN`	Optional	HuggingFace token
`BEAM_REMOTE_USER`	Optional	Override remote SSH user
`BEAM_REMOTE_PROJECT_DIR`	Optional	Override remote project directory
`FORCE_PUSH`	Optional	Force Docker image push even if unchanged

Filesystem Layout

Local

./beam.yaml                                  # Config file
./.deps_hash                                 # MD5 hash of dep files
./.heavy-deps-hash                           # Hash for heavy deps
./.beam/logs/                                # All local beam logs
./.beam/logs/sessions/{id}.json              # Session state
./.beam/logs/.pulse_wrapper_{id}.sh          # Pulse cron wrapper script
./.beam/logs/pulse_cron.log                  # Pulse cron output log
./.beam/logs/sessions/pulse_history_{id}.jsonl  # Pulse audit log

Remote

{remote.project_dir}/{project.name}/         # Code root
{remote_code_dir}/.venv/                     # Python venv
{remote_code_dir}/.run_session.sh            # Launch script
{remote.project_dir}/logs/{project.name}/    # Job logs dir
{remote_logs_dir}/{session_name}.log         # Training log
{remote_logs_dir}/.beam_state/RUNNING        # Sentinel: PID|timestamp
{remote_logs_dir}/.beam_state/FINISHED       # Sentinel: timestamp
{remote_logs_dir}/.beam_state/FAILED         # Sentinel: exit_code|timestamp

Shell Completion

Beam supports Tab completion for session IDs, provider names, and config names in bash, zsh, and fish.

Setup (one-time):

# Detect your shell automatically
beam --install-completion

# Or specify explicitly
beam --install-completion zsh
beam --install-completion bash
beam --install-completion fish

Then restart your shell (or source ~/.zshrc / source ~/.bashrc).

What completes:

Command	Completes
`beam attach <TAB>`	Session IDs
`beam logs <TAB>`	Session IDs
`beam stop <TAB>`	Session IDs
`beam pulse install --session <TAB>`	Session IDs
`beam pulse uninstall --session <TAB>`	Session IDs
`beam pulse check --session <TAB>`	Session IDs
`beam run --provider <TAB>`	`runpod`, `vast.ai`
`beam offers --provider <TAB>`	`runpod`, `vast.ai`
`beam run --pod-config <TAB>`	Config names from `beam.yaml`
`beam offers --pod-config <TAB>`	Config names from `beam.yaml`

Session IDs are read from .beam/logs/sessions/ (filename-only, instant). No network calls during completion.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.cursor/rules		.cursor/rules
examples		examples
src/beam		src/beam
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
CLAUDE.md		CLAUDE.md
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

beam

What It Does

Features

Install

Quick Start

1. Create beam.yaml in your project root

2. Deploy

3. Monitor

CLI Reference

beam run

beam status

beam init

beam offers

beam attach

beam logs

beam stop

beam pulse

beam.yaml Config Reference

project

registry

remote

ssh

credentials

providers.vastai.machines[]

providers.runpod.machines[]

pulse

Pulse Monitoring

Decision Matrix

Restart Strategies

Backoff

Audit Log

Bid Strategies (Vast.ai)

Environment Variables

Filesystem Layout

Local

Remote

See Also

Shell Completion

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Create `beam.yaml` in your project root

`beam run`

`beam status`

`beam init`

`beam offers`

`beam attach`

`beam logs`

`beam stop`

`beam pulse`

`beam.yaml` Config Reference

`project`

`registry`

`remote`

`ssh`

`credentials`

`providers.vastai.machines[]`

`providers.runpod.machines[]`

`pulse`

Packages