What
Three intertwined transient-failure modes from brev create keep tripping our automation. They're related — each one's downstream consequence makes the next one likelier — and a unified "verify the workspace state before treating create as failed" path on the CLI side would knock out all three.
Mode 1 — unexpected EOF rc=1 even after the workspace was created
brev create <name> --type <T> exits non-zero with a stderr body like:
WARN RESTY Post "https://brevapi.us-west-2-prod.control-plane.brev.dev/api/organizations/<org-id>/workspaces?cli_version=v0.6.323&local=true&os=linux&utm_source=cli": unexpected EOF, Attempt 1
ERROR RESTY Post "https://brevapi.us-west-2-prod.control-plane.brev.dev/api/organizations/<org-id>/workspaces?cli_version=v0.6.323&local=true&os=linux&utm_source=cli": unexpected EOF
[Worker 1] m8i-flex.2xlarge Failed: ... Post "...": unexpected EOF
Warning: Only created 0/1 instances
could only create 0/1 instances
…even though the workspace was actually created — it shows up in brev ls --json immediately after with status=DEPLOYING (then RUNNING) and is fully usable. The non-zero exit is a false negative: the POST appears to have completed server-side but the client got an EOF before reading the success response.
Mode 2 — rpc error: code = Internal desc = context deadline exceeded
Less common but observed today on m8i-flex.2xlarge:
[Worker 1] Trying m8i-flex.2xlarge for instance 'gr-ngc'...
[Worker 1] m8i-flex.2xlarge Failed: [error]
github.com/brevdev/brev-cli/pkg/cmd/gpucreate.(*createContext).createWorkspace
/go/src/github.com/brevdev/brev-cli/pkg/cmd/gpucreate/gpucreate.go:999
: rpc error: code = Internal desc = context deadline exceeded
Type m8i-flex.2xlarge had failures, trying next type...
Warning: Only created 0/1 instances
could only create 0/1 instances.
The error path looks like AWS RPC timing out before the workspace registration completes. Same false-negative shape as Mode 1 — the workspace can still end up created server-side — but with a different wire error so any retry predicate keyed only on unexpected EOF misses it.
Mode 3 — Error: workspace 'X' already exists after a previous client-side timeout
The downstream of the other two: a previous brev create on the same name appeared to fail to the caller, the caller cleaned up its own DB row but didn't brev delete, and the next run hits:
[Worker 1] m8i-flex.2xlarge Failed: [error]
github.com/brevdev/brev-cli/pkg/cmd/gpucreate.(*createContext).createWorkspace
/go/src/github.com/brevdev/brev-cli/pkg/cmd/gpucreate/gpucreate.go:999
: duplicate workspace with name gr-validate
Error: workspace 'gr-validate' already exists. Use a different name or delete the existing workspace
This one is technically the operator's fault (failed to clean up the orphan), but the orphan only existed because of Mode 1 / Mode 2 in the first place — and the error message doesn't tell the caller "this is the same workspace your previous attempt actually succeeded in creating, you can attach to it." Treating the existing workspace as recoverable would close the loop.
Today's occurrences
Single 8-hour window, single Brev org (vanguard-programming), AWS m8i-flex.2xlarge. All brev create invocations from gateroom_manager.brev_driver._dispatch_brev_create running as root from a systemd unit. CLI version v0.6.323.
| time (UTC) |
instance |
env id |
mode |
notes |
| 17:01:58 |
gr-ngc |
- |
1 |
retry attempt 1/3, workspace existed |
| 17:03:17 |
gr-ngc |
- |
1 |
retry attempt 2/3, workspace not in ls |
| 17:03:30 |
gr-ngc |
- |
2 |
rpc context deadline; attempt 3/3 final |
| 17:29:08 |
gr-ngc2 |
- |
1 |
retry attempt 1/3 (parallel-spawn batch) |
| 17:29:08 |
gr-ngc3 |
- |
1 |
retry attempt 1/3 (parallel-spawn batch) |
| 21:08:43 |
gr-validate |
si3jsvf4t |
3 |
orphan from prior client-side timeout |
The successful spawns from this window — q01vizlja (gr-validate), 1nmqlnm60 (gr-self-dev), b08jjhe8k (gr-ngc2), 5ohm8qti5 (gr-ngc3), ygg2q2r5c (gr-ngc) — all came up cleanly once they got past whichever transient failure they hit first.
Why it matters for automation
Downstream tooling that treats brev create's exit code as truth concludes "create failed" and either:
- Tries to clean up by calling
brev delete <name> (which succeeds, terminating the perfectly-good workspace).
- Marks an internal record (DB row, etc.) as
destroyed/failed, leaving the orphan VM in Brev that no automation will reconcile.
- Bails the whole pipeline before whatever was supposed to run on the new workspace.
…and on the next attempt, hits Mode 3 because the orphan is still up.
Workarounds we ship today
async def _workspace_was_created(self, instance_name: str) -> bool:
try:
return (await self.lookup_env_id(instance_name)) is not None
except BrevError:
return False
# After any non-zero `brev create` exit, call _workspace_was_created.
# If the workspace is in `brev ls --json`, treat the create as a
# success (log a warning) and proceed. If it's not, surface the
# original error.
This handles Mode 1 and partially handles Mode 3 (we now brev delete the orphan before retrying, so a later run doesn't collide). Mode 2 still slips through because we only retry on the literal unexpected EOF substring.
What would fix this on the CLI side
Pick whichever fits the architecture best:
- Server-side write idempotency + client retry. If
POST /workspaces is keyed by a client-supplied request ID, the CLI can retry on any transient (EOF, rpc deadline, …) without risking a duplicate workspace. Today's behaviour suggests the write is succeeding before the response is fully returned, so a retry would either re-fetch the result (if the API is idempotent) or surface the actual created object.
- CLI fallback on transient failure: when the response is truncated mid-payload OR returns an internal-rpc error, the CLI internally calls
GET /workspaces?name=<name> to check whether the workspace was actually created, and treats that as the source of truth before failing.
Error: workspace 'X' already exists could have a --reuse-if-existing flag (or auto-reuse) that re-emits the existing workspace's create-result JSON instead of failing — closing the Mode 1 → Mode 3 loop without operator intervention.
- At minimum, a clearer signal in the error message for Modes 1 and 2 that the failure might mean "successfully created but response truncated" — so callers know to verify via
brev ls rather than assume failure and clean up.
Workaround for anyone hitting this today
Verify via brev ls --json after every non-zero brev create exit before treating it as a hard failure, and if the workspace exists either reuse it or delete-and-retry.
What
Three intertwined transient-failure modes from
brev createkeep tripping our automation. They're related — each one's downstream consequence makes the next one likelier — and a unified "verify the workspace state before treating create as failed" path on the CLI side would knock out all three.Mode 1 —
unexpected EOFrc=1 even after the workspace was createdbrev create <name> --type <T>exits non-zero with a stderr body like:…even though the workspace was actually created — it shows up in
brev ls --jsonimmediately after withstatus=DEPLOYING(thenRUNNING) and is fully usable. The non-zero exit is a false negative: the POST appears to have completed server-side but the client got an EOF before reading the success response.Mode 2 —
rpc error: code = Internal desc = context deadline exceededLess common but observed today on
m8i-flex.2xlarge:The error path looks like AWS RPC timing out before the workspace registration completes. Same false-negative shape as Mode 1 — the workspace can still end up created server-side — but with a different wire error so any retry predicate keyed only on
unexpected EOFmisses it.Mode 3 —
Error: workspace 'X' already existsafter a previous client-side timeoutThe downstream of the other two: a previous
brev createon the same name appeared to fail to the caller, the caller cleaned up its own DB row but didn'tbrev delete, and the next run hits:This one is technically the operator's fault (failed to clean up the orphan), but the orphan only existed because of Mode 1 / Mode 2 in the first place — and the error message doesn't tell the caller "this is the same workspace your previous attempt actually succeeded in creating, you can attach to it." Treating the existing workspace as recoverable would close the loop.
Today's occurrences
Single 8-hour window, single Brev org (
vanguard-programming), AWSm8i-flex.2xlarge. Allbrev createinvocations fromgateroom_manager.brev_driver._dispatch_brev_createrunning as root from a systemd unit. CLI versionv0.6.323.gr-ngcgr-ngcgr-ngcgr-ngc2gr-ngc3gr-validatesi3jsvf4tThe successful spawns from this window —
q01vizlja(gr-validate),1nmqlnm60(gr-self-dev),b08jjhe8k(gr-ngc2),5ohm8qti5(gr-ngc3),ygg2q2r5c(gr-ngc) — all came up cleanly once they got past whichever transient failure they hit first.Why it matters for automation
Downstream tooling that treats
brev create's exit code as truth concludes "create failed" and either:brev delete <name>(which succeeds, terminating the perfectly-good workspace).destroyed/failed, leaving the orphan VM in Brev that no automation will reconcile.…and on the next attempt, hits Mode 3 because the orphan is still up.
Workarounds we ship today
This handles Mode 1 and partially handles Mode 3 (we now
brev deletethe orphan before retrying, so a later run doesn't collide). Mode 2 still slips through because we only retry on the literalunexpected EOFsubstring.What would fix this on the CLI side
Pick whichever fits the architecture best:
POST /workspacesis keyed by a client-supplied request ID, the CLI can retry on any transient (EOF, rpc deadline, …) without risking a duplicate workspace. Today's behaviour suggests the write is succeeding before the response is fully returned, so a retry would either re-fetch the result (if the API is idempotent) or surface the actual created object.GET /workspaces?name=<name>to check whether the workspace was actually created, and treats that as the source of truth before failing.Error: workspace 'X' already existscould have a--reuse-if-existingflag (or auto-reuse) that re-emits the existing workspace's create-result JSON instead of failing — closing the Mode 1 → Mode 3 loop without operator intervention.brev lsrather than assume failure and clean up.Workaround for anyone hitting this today
Verify via
brev ls --jsonafter every non-zerobrev createexit before treating it as a hard failure, and if the workspace exists either reuse it or delete-and-retry.