Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 6 additions & 4 deletions docs/src/core/reference/resource-monitoring.md
Original file line number Diff line number Diff line change
Expand Up @@ -227,9 +227,10 @@ metrics.

## Slurm Accounting Stats

When running inside a Slurm allocation, Torc calls `sacct` after each job step completes and stores
the results in the `slurm_stats` table. These complement the sysinfo-based metrics above with
Slurm-native cgroup measurements.
When running inside a Slurm allocation, Torc queues completed job steps for asynchronous `sacct`
collection and stores the results in the `slurm_stats` table. Lookups are batched by allocation so
job completion does not wait on Slurm accounting latency. These complement the sysinfo-based metrics
above with Slurm-native cgroup measurements.

### Fields

Expand All @@ -245,7 +246,8 @@ Slurm-native cgroup measurements.
Additional identifying fields stored per record: `workflow_id`, `job_id`, `run_id`, `attempt_id`,
`slurm_job_id`.

Fields are `null` when:
Because `sacct` lookup is asynchronous, these rows may appear shortly after job completion instead
of immediately. Fields are `null` when:

- The job ran locally (no `SLURM_JOB_ID` in the environment)
- `sacct` is not available on the node
Expand Down
9 changes: 5 additions & 4 deletions docs/src/specialized/design/srun-monitoring.md
Original file line number Diff line number Diff line change
Expand Up @@ -234,15 +234,16 @@ sstat calls for completed steps return non-zero exit codes. These are logged at

## sacct Collection

**Module:** `src/client/async_cli_command.rs`
**Modules:** `src/client/job_runner.rs`, `src/client/async_cli_command.rs`

After a job step exits, `collect_sacct_stats()` retrieves the final Slurm accounting record. This is
a blocking call that runs on the job runner thread.
After a job step exits, the job runner records completion first and enqueues the step for a
background `sacct` worker. That worker batches lookups by Slurm allocation and calls
`collect_sacct_stats_for_steps()` off the completion path.

### Retry Logic

The Slurm accounting daemon (`slurmdbd`) often has a delay between step completion and record
availability. The collector retries up to 6 times with 5-second delays:
availability. The background collector retries up to 6 times with 5-second delays:

```mermaid
flowchart TD
Expand Down
9 changes: 6 additions & 3 deletions docs/src/specialized/hpc/slurm.md
Original file line number Diff line number Diff line change
Expand Up @@ -397,8 +397,9 @@ if enabled.

### Slurm Accounting Stats

After each job step exits, Torc calls `sacct` once to collect the following Slurm-native accounting
fields and stores them in the `slurm_stats` table:
After each job step exits, Torc records completion immediately and queues Slurm accounting
collection on a background worker. That worker batches `sacct` lookups by allocation and stores the
following Slurm-native accounting fields in the `slurm_stats` table:

| Field | sacct source | Description |
| ---------------------- | -------------- | ------------------------------------- |
Expand All @@ -412,7 +413,9 @@ fields and stores them in the `slurm_stats` table:
These fields complement the existing sysinfo-based metrics (`peak_memory_bytes`, `peak_cpu_percent`,
etc.) and are available via `torc slurm stats <workflow_id>`.

`sacct` data is collected on a best-effort basis. Fields are `null` when:
`sacct` data is collected on a best-effort basis. Because lookup runs asynchronously, stats may
appear shortly after job completion rather than inline with the completion path. Fields are `null`
when:

- The job ran locally (no `SLURM_JOB_ID`)
- `sacct` is not available on the node
Expand Down
Loading
Loading