brev create transient failures (unexpected EOF, rpc context deadline, post-orphan duplicate-workspace) need a unified verify+retry path

## What

Three intertwined transient-failure modes from `brev create` keep tripping our automation. They're related — each one's downstream consequence makes the next one likelier — and a unified "verify the workspace state before treating create as failed" path on the CLI side would knock out all three.

### Mode 1 — `unexpected EOF` rc=1 even after the workspace was created

`brev create <name> --type <T>` exits non-zero with a stderr body like:

```
WARN  RESTY Post "https://brevapi.us-west-2-prod.control-plane.brev.dev/api/organizations/<org-id>/workspaces?cli_version=v0.6.323&local=true&os=linux&utm_source=cli": unexpected EOF, Attempt 1
ERROR RESTY Post "https://brevapi.us-west-2-prod.control-plane.brev.dev/api/organizations/<org-id>/workspaces?cli_version=v0.6.323&local=true&os=linux&utm_source=cli": unexpected EOF
[Worker 1] m8i-flex.2xlarge Failed: ... Post "...": unexpected EOF
Warning: Only created 0/1 instances
could only create 0/1 instances
```

…even though the workspace **was actually created** — it shows up in `brev ls --json` immediately after with `status=DEPLOYING` (then `RUNNING`) and is fully usable. The non-zero exit is a false negative: the POST appears to have completed server-side but the client got an EOF before reading the success response.

### Mode 2 — `rpc error: code = Internal desc = context deadline exceeded`

Less common but observed today on `m8i-flex.2xlarge`:

```
[Worker 1] Trying m8i-flex.2xlarge for instance 'gr-ngc'...
[Worker 1] m8i-flex.2xlarge Failed: [error]
github.com/brevdev/brev-cli/pkg/cmd/gpucreate.(*createContext).createWorkspace
    /go/src/github.com/brevdev/brev-cli/pkg/cmd/gpucreate/gpucreate.go:999
: rpc error: code = Internal desc = context deadline exceeded

Type m8i-flex.2xlarge had failures, trying next type...

Warning: Only created 0/1 instances
could only create 0/1 instances.
```

The error path looks like AWS RPC timing out before the workspace registration completes. Same false-negative shape as Mode 1 — the workspace can still end up created server-side — but with a different wire error so any retry predicate keyed only on `unexpected EOF` misses it.

### Mode 3 — `Error: workspace 'X' already exists` after a previous client-side timeout

The downstream of the other two: a previous `brev create` on the same name appeared to fail to the caller, the caller cleaned up its own DB row but didn't `brev delete`, and the next run hits:

```
[Worker 1] m8i-flex.2xlarge Failed: [error]
github.com/brevdev/brev-cli/pkg/cmd/gpucreate.(*createContext).createWorkspace
    /go/src/github.com/brevdev/brev-cli/pkg/cmd/gpucreate/gpucreate.go:999
: duplicate workspace with name gr-validate

Error: workspace 'gr-validate' already exists. Use a different name or delete the existing workspace
```

This one is technically the operator's fault (failed to clean up the orphan), but the orphan only existed because of Mode 1 / Mode 2 in the first place — and the error message doesn't tell the caller "this is the same workspace your previous attempt actually succeeded in creating, you can attach to it." Treating the existing workspace as recoverable would close the loop.

## Today's occurrences

Single 8-hour window, single Brev org (`vanguard-programming`), AWS `m8i-flex.2xlarge`. All `brev create` invocations from `gateroom_manager.brev_driver._dispatch_brev_create` running as root from a systemd unit. CLI version `v0.6.323`.

| time (UTC)  | instance      | env id        | mode | notes                                    |
|-------------|---------------|---------------|------|------------------------------------------|
| 17:01:58    | `gr-ngc`      | -             | 1    | retry attempt 1/3, workspace existed     |
| 17:03:17    | `gr-ngc`      | -             | 1    | retry attempt 2/3, workspace not in ls   |
| 17:03:30    | `gr-ngc`      | -             | 2    | rpc context deadline; attempt 3/3 final  |
| 17:29:08    | `gr-ngc2`     | -             | 1    | retry attempt 1/3 (parallel-spawn batch) |
| 17:29:08    | `gr-ngc3`     | -             | 1    | retry attempt 1/3 (parallel-spawn batch) |
| 21:08:43    | `gr-validate` | `si3jsvf4t`   | 3    | orphan from prior client-side timeout    |

The successful spawns from this window — `q01vizlja` (gr-validate), `1nmqlnm60` (gr-self-dev), `b08jjhe8k` (gr-ngc2), `5ohm8qti5` (gr-ngc3), `ygg2q2r5c` (gr-ngc) — all came up cleanly once they got past whichever transient failure they hit first.

## Why it matters for automation

Downstream tooling that treats `brev create`'s exit code as truth concludes "create failed" and either:

- Tries to clean up by calling `brev delete <name>` (which succeeds, terminating the perfectly-good workspace).
- Marks an internal record (DB row, etc.) as `destroyed`/`failed`, leaving the orphan VM in Brev that no automation will reconcile.
- Bails the whole pipeline before whatever was supposed to run on the new workspace.

…and on the next attempt, hits Mode 3 because the orphan is still up.

## Workarounds we ship today

```python
async def _workspace_was_created(self, instance_name: str) -> bool:
    try:
        return (await self.lookup_env_id(instance_name)) is not None
    except BrevError:
        return False

# After any non-zero `brev create` exit, call _workspace_was_created.
# If the workspace is in `brev ls --json`, treat the create as a
# success (log a warning) and proceed. If it's not, surface the
# original error.
```

This handles Mode 1 and partially handles Mode 3 (we now `brev delete` the orphan before retrying, so a later run doesn't collide). Mode 2 still slips through because we only retry on the literal `unexpected EOF` substring.

## What would fix this on the CLI side

Pick whichever fits the architecture best:

1. **Server-side write idempotency + client retry.** If `POST /workspaces` is keyed by a client-supplied request ID, the CLI can retry on any transient (EOF, rpc deadline, …) without risking a duplicate workspace. Today's behaviour suggests the write is succeeding before the response is fully returned, so a retry would either re-fetch the result (if the API is idempotent) or surface the actual created object.
2. **CLI fallback on transient failure**: when the response is truncated mid-payload OR returns an internal-rpc error, the CLI internally calls `GET /workspaces?name=<name>` to check whether the workspace was actually created, and treats that as the source of truth before failing.
3. **`Error: workspace 'X' already exists`** could have a `--reuse-if-existing` flag (or auto-reuse) that re-emits the existing workspace's create-result JSON instead of failing — closing the Mode 1 → Mode 3 loop without operator intervention.
4. **At minimum, a clearer signal in the error message** for Modes 1 and 2 that the failure *might* mean "successfully created but response truncated" — so callers know to verify via `brev ls` rather than assume failure and clean up.

## Workaround for anyone hitting this today

Verify via `brev ls --json` after every non-zero `brev create` exit before treating it as a hard failure, and if the workspace exists either reuse it or delete-and-retry.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

brev create transient failures (unexpected EOF, rpc context deadline, post-orphan duplicate-workspace) need a unified verify+retry path #382

What

Mode 1 — `unexpected EOF` rc=1 even after the workspace was created

Mode 2 — `rpc error: code = Internal desc = context deadline exceeded`

Mode 3 — `Error: workspace 'X' already exists` after a previous client-side timeout

Today's occurrences

Why it matters for automation

Workarounds we ship today

What would fix this on the CLI side

Workaround for anyone hitting this today

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

time (UTC)	instance	env id	mode	notes
17:01:58	`gr-ngc`	-	1	retry attempt 1/3, workspace existed
17:03:17	`gr-ngc`	-	1	retry attempt 2/3, workspace not in ls
17:03:30	`gr-ngc`	-	2	rpc context deadline; attempt 3/3 final
17:29:08	`gr-ngc2`	-	1	retry attempt 1/3 (parallel-spawn batch)
17:29:08	`gr-ngc3`	-	1	retry attempt 1/3 (parallel-spawn batch)
21:08:43	`gr-validate`	`si3jsvf4t`	3	orphan from prior client-side timeout

brev create transient failures (unexpected EOF, rpc context deadline, post-orphan duplicate-workspace) need a unified verify+retry path #382

Description

What

Mode 1 — unexpected EOF rc=1 even after the workspace was created

Mode 2 — rpc error: code = Internal desc = context deadline exceeded

Mode 3 — Error: workspace 'X' already exists after a previous client-side timeout

Today's occurrences

Why it matters for automation

Workarounds we ship today

What would fix this on the CLI side

Workaround for anyone hitting this today

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Mode 1 — `unexpected EOF` rc=1 even after the workspace was created

Mode 2 — `rpc error: code = Internal desc = context deadline exceeded`

Mode 3 — `Error: workspace 'X' already exists` after a previous client-side timeout