Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 23 additions & 22 deletions docs/case-studies/issue-96/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,41 +30,42 @@ $ agent --version

The expected output `0.13.2` is missing between `$ agent --version` and the `✓` marker.

## Fix (v0.24.9 / PR #97)
## Environment

Two changes were made:
- macOS with screen 4.00.03 (bundled with macOS, does not support `-Logfile` option)
- `agent` installed via `bun install -g start-command` (added to `~/.bun/bin/`)
- zsh 5.9 as default shell

### 1. Add `logfile flush 0` to screen's configuration
## Fix History

For the **native logging path** (`-L -Logfile`, screen ≥ 4.5.1), screen uses a periodic
flush timer that fires every **10 seconds** by default. If the command completes and the
screen session terminates before this timer fires, output buffered inside screen's internal
`FILE*` buffer may not be flushed to the log file before the session ends.
### v0.24.9 / PR #97 (partial fix)

The fix passes a temporary screenrc with `logfile flush 0` via screen's `-c` option. This
forces screen to flush the log after every write, eliminating the race condition.
Added `logfile flush 0` screenrc for the native logging path (screen ≥ 4.5.1) and
a 50ms retry for the tee fallback (screen < 4.5.1). This fixed the issue on Linux
but **did not fix macOS** because the tee fallback had inherent reliability issues.

Before fix:
```
screen -dmS <session> -L -Logfile <logfile> <shell> -c '<command>'
```
### v0.25.0 / PR #98 (complete fix)

Replaced the entire version-dependent logging approach with a **unified
screenrc-based strategy** that works on ALL screen versions:

After fix:
```
screen -dmS <session> -c <screenrc> -L -Logfile <logfile> <shell> -c '<command>'
logfile /path/to/output.log # custom log file path
logfile flush 0 # immediate buffer flush
deflog on # enable logging for all windows
```

where `<screenrc>` contains `logfile flush 0`.

### 2. Add integration test for quick-completing commands
This eliminates both the native `-L -Logfile` path and the tee fallback,
using only screenrc directives available since early screen versions.

Added a test case specifically for the issue: `runInScreen('agent --version')` must
capture the version output correctly.
Additional improvements:
1. **Exit code capture** — real exit code from command ($? via sidecar file)
2. **Enhanced retry logic** — 3 retries with increasing delays (50/100/200ms)
3. **Debug output** — responds to both `START_DEBUG` and `START_VERBOSE`
4. **New tests** — exit code capture, stderr capture, multi-line verification

## See Also

- [root-cause.md](root-cause.md) — Detailed root cause analysis
- [timeline.md](timeline.md) — Sequence of events
- [solutions.md](solutions.md) — Solutions considered and chosen
- Related: [Case Study issue-15](../issue-15/README.md) — Original screen output capture fix
- Related: [Case Study issue-25](../issue-25/README.md) — Screen quoting fix
131 changes: 87 additions & 44 deletions docs/case-studies/issue-96/root-cause.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,38 +54,20 @@ stdio buffers are NOT flushed. This can happen when:
This makes the issue **intermittent** — it depends on the OS, screen version, and exact
timing of the signal handling.

### The fix

Add `logfile flush 0` to a temporary screenrc passed via `screen -c`:

```
logfile flush 0
```

This sets `log_flush = 0` in screen's internal variable, causing `WLogString` to call
`logfflush()` after every write:

```c
if (!log_flush)
logfflush(win->w_log); // now always executes
```

This completely eliminates the flush delay for log file writes.

---

## Root Cause 2: Tee Pipe Buffering Race (Tee Fallback, Screen < 4.5.1)

### macOS bundled screen 4.0.3

On macOS, the system-bundled GNU Screen is version 4.0.3, which predates the `-Logfile`
option (added in 4.5.1). The code falls back to:
option (added in 4.5.1). The previous code fell back to:

```js
effectiveCommand = `(${effectiveCommand}) 2>&1 | tee "${logFile}"`;
```

The screen session runs:
The screen session ran:
```
screen -dmS session /bin/zsh -c "(agent --version) 2>&1 | tee /tmp/logfile"
```
Expand All @@ -108,41 +90,102 @@ there may be a brief window where:
This is a TOCTOU (time-of-check-time-of-use) race between the session appearing gone in
`screen -ls` and the log file having its complete content.

### The fix
### Why the v0.24.9 retry fix was insufficient

The v0.24.9 fix added a single 50ms retry when the log file was empty. However, on macOS:
- The race window was larger than 50ms in many cases
- `tee` itself could be killed by screen before flushing its buffer
- The fundamental architecture of piping through tee in a detached session was inherently
unreliable for fast-completing commands

---

## Root Cause 3: Exit Code Always Reported as 0

The `logfile flush 0` fix in the screenrc does NOT directly help the tee fallback path
(since tee is an external process, not governed by screen's log flush). However, an
additional retry mechanism can be added: if the log file is empty when first read but the
session just terminated, retry the read after a brief delay to let the OS flush complete.
A secondary bug was discovered during the investigation: the screen isolation code
**always reported exit code 0** regardless of what happened inside the screen session.

In practice, on Linux the tee fallback is not used (screen >= 4.5.1 is available), and
on macOS the `logfile flush 0` option works on screen 4.0.3 as well (it's a screenrc
command, not a version-gated feature).
```js
resolve({
success: true,
exitCode: 0, // Always 0!
message: `Screen session "${sessionName}" exited with code 0`,
});
```

This meant that even if `agent --version` failed (e.g., `agent` not in PATH inside
the screen session), the user would see exit code 0 and no output — making it
impossible to diagnose whether the issue was output capture or command execution.

---

## Root Cause 4: `deflog on` Applies Only to New Windows (v0.25.0 failure on macOS)

The v0.25.0 fix used `deflog on` in screenrc to enable logging for all windows.
However, this directive only applies to windows created **AFTER** the screenrc is
processed. In `screen -dmS` mode, the default window is created **BEFORE** screenrc
processing begins.

This means:
1. `screen -dmS session -c screenrc shell -c 'command'` creates the initial window
2. Screen then processes the screenrc (including `deflog on`)
3. `deflog on` takes effect for any NEW windows, but the initial window already exists
4. The initial window (where the command runs) has logging disabled
5. All output is lost

This is why the v0.25.0 fix worked on Linux (screen 4.09.01 may handle the timing
differently or process screenrc before creating the initial window) but failed on
macOS (screen 4.00.03 strictly creates the window first).

**Source:** GNU Screen manual documents `deflog on` as setting "the default log state
for all new windows" — the word "new" is key.

---

## Why This Is Intermittent
## The Fix: `-L` Flag + Screenrc Logging Directives

The issue doesn't always reproduce because:
The fix uses the `-L` command-line flag to explicitly enable logging for the initial
window, combined with screenrc directives for configuration:

1. **On Linux with screen 4.09.01**: The normal `fclose()` path usually flushes the buffer
correctly. Only under certain timing conditions (SIGCHLD handling, `_exit`) does it fail.
```
screen -dmS session -L -c screenrc shell -c 'command'
```

2. **On macOS with screen 4.0.3**: The tee fallback's race window is very small (~1ms).
Most of the time the file has content when read. But for very fast commands with a
busy system, the window widens.
Where the screenrc contains:
```
logfile /path/to/output.log # sets custom log file path
logfile flush 0 # forces immediate flush
deflog on # enables logging for any additional windows
```

3. **The `screen -ls` timing**: The check `!sessions.includes(sessionName)` returns true
as soon as the session process exits, but the OS may still be in the middle of flushing
the log file's write buffers to the page cache.
The `-L` flag:
1. Is available on **ALL screen versions** (including macOS 4.00.03)
2. Enables logging for the **initial window** at creation time
3. Combined with `logfile <path>` in screenrc, logs to our custom path
(not the default `screenlog.0`)
4. Does NOT require the `-Logfile` CLI option (which is 4.5.1+)

This approach:
1. Uses screen's **own logging mechanism** consistently (no external tee process)
2. Sets the **log file path via screenrc** (no need for -Logfile CLI option)
3. Forces **immediate flush** (no 10-second buffer delay)
4. Enables logging for the initial window via **-L flag** (not just deflog on)
5. Works on **all screen versions** including macOS 4.00.03

Additionally:
- **Exit code capture** via `$?` saved to a sidecar file
- **Enhanced retry** with 3 attempts and increasing delays (50/100/200ms)
- **Better diagnostics** via `[screen-isolation]` debug prefix
- **`--verbose` flag** now properly enables debug output in screen isolation

---

## Test Coverage Gap
## Test Coverage

There were no tests specifically for:
1. Quick-completing commands (like `--version` flags) in screen isolation
2. Verification that version-string output (short, non-whitespace text) is captured
Tests added specifically for this issue:
1. **Version-flag command output capture**: `node --version` output must be captured
2. **Exit code capture from failed commands**: `nonexistent_command` must report non-zero
3. **Stderr capture**: `echo "test" >&2` must appear in captured output
4. **Multi-line output with correct exit code**: Multiple echo commands with exit 0

The existing tests use `echo "hello"` which is also fast but the string is longer and
may flush more reliably in certain conditions.
Both JavaScript and Rust implementations have corresponding tests.
129 changes: 84 additions & 45 deletions docs/case-studies/issue-96/solutions.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,38 +33,17 @@ reading it up to 3 times with 10ms, 20ms, 40ms delays.
- Adds complexity to the polling logic
- Still a race for very slow systems

### Option C: Use a temporary screenrc with `logfile flush 0` (CHOSEN)
### Option C: Use a temporary screenrc with `logfile flush 0` + `-L -Logfile` (v0.24.9)

**Approach:** For the native logging path (`-L -Logfile`, screen ≥ 4.5.1), create a
temporary screenrc file containing `logfile flush 0` and pass it to screen via the
`-c` option.
`-c` option. For older screen (< 4.5.1), fall back to tee with retry.

```
echo "logfile flush 0" > /tmp/screen-rc-XXXX.rc
screen -dmS session -c /tmp/screen-rc-XXXX.rc -L -Logfile logfile shell -c command
```

`logfile flush 0` sets `log_flush = 0` in screen's internal state, which causes
`WLogString()` in `ansi.c` to call `logfflush()` (equivalent to `fflush()`) after
every write. This completely eliminates the 10-second buffering delay.

**Pros:**
- Eliminates the flush delay entirely for the native logging path
- Standard approach recommended in GNU Screen documentation
- No performance impact on the command itself (the flush is async in screen's event loop)
- Clean implementation — the temp file is created and deleted by our code
- Works on all screen versions that support the `-c` option (very old)

**Cons:**
- Requires creating/deleting an additional temp file per invocation
- The `-c` option overrides the user's `~/.screenrc` — but since we only put
`logfile flush 0` in it, there's no conflict for our use case

**Note on the tee fallback path (screen < 4.5.1):**
The `logfile flush 0` setting in a screenrc also works on older screen versions (it's
a screenrc directive, not version-gated). However, for the tee fallback path, screen
itself doesn't write the log — `tee` does. So `logfile flush 0` doesn't help for
the tee path. For this path, Option B (retry) is added as a complementary fix.
This was the initial fix in PR #97 / v0.24.9, but it **did not resolve the issue on
macOS with screen 4.00.03** because:
- The tee fallback path was still unreliable
- The `logfile flush 0` screenrc only helped screen ≥ 4.5.1 (native logging)
- On macOS, the tee pipe's TOCTOU race was larger than the 50ms retry window

### Option D: Use `script` instead of screen's `-L`

Expand All @@ -79,30 +58,90 @@ uses a pty and has more reliable flush-on-exit behavior.
- Complex to handle cross-platform
- Loses screen's session management features

### Option E: Unified screenrc-based logging for ALL versions (v0.25.0 — FAILED on macOS)

**Approach:** Replace both the native `-L -Logfile` path AND the tee fallback with a
single screenrc-based approach that works on ALL screen versions:

```
logfile /path/to/output.log # sets log file path (works since early screen)
logfile flush 0 # immediate flush (no 10-second buffer)
deflog on # enable logging for all new windows
```

Then run: `screen -dmS <session> -c <screenrc> <shell> -c '<command>'`

**Result:** Failed on macOS. `deflog on` only applies to windows created AFTER
screenrc processing. In `-dmS` mode, the default window is created BEFORE screenrc
is processed. The initial window never has logging enabled.

### Option F: `-L` flag + screenrc logging directives (CHOSEN, current fix)

**Approach:** Add the `-L` command-line flag to explicitly enable logging for the
initial window, combined with screenrc directives for configuration:

```
screen -dmS <session> -L -c <screenrc> <shell> -c '<command>'
```

Where the screenrc contains:
```
logfile /path/to/output.log # sets log file path
logfile flush 0 # immediate flush
deflog on # enable logging for any additional windows
```

**Pros:**
- `-L` enables logging for the initial window at creation time (before screenrc)
- `-L` is available on ALL screen versions (including macOS 4.00.03)
- Combined with `logfile <path>` in screenrc, `-L` logs to our custom path
- `logfile flush 0` ensures immediate flush
- No need for `-Logfile` CLI option (4.5.1+) or tee fallback
- Clean, single code path for all screen versions

**Cons:**
- The `-c` option overrides the user's `~/.screenrc` — but since we only use
logging directives, there's no conflict for our detached-session use case

---

## Chosen Solution: Option C + Option B
## Chosen Solution: Option F (`-L` flag + screenrc)

**Implementation:**

For native logging path (screen ≥ 4.5.1):
```js
// Create temporary screenrc with immediate log flush
const screenrcFile = path.join(os.tmpdir(), `screenrc-${sessionName}`);
fs.writeFileSync(screenrcFile, 'logfile flush 0\n');

const logArgs = ['-dmS', sessionName, '-c', screenrcFile, '-L', '-Logfile', logFile];
screenArgs = [...logArgs, shell, shellArg, effectiveCommand];

// Clean up screenrc on completion
try { fs.unlinkSync(screenrcFile); } catch {}
// Create temporary screenrc with logging configuration
const screenrcContent = [
`logfile ${logFile}`, // custom log file path
'logfile flush 0', // immediate flush
'deflog on', // enable logging for additional windows
].join('\n');
fs.writeFileSync(screenrcFile, screenrcContent);

// Run screen with -L flag (enables logging for initial window)
// + screenrc (sets log path and flush behavior)
const screenArgs = ['-dmS', sessionName, '-L', '-c', screenrcFile, shell, shellArg, command];
```

For tee fallback path (screen < 4.5.1): add a brief retry when the log file is empty
immediately after session completion.
**Additional improvements:**
1. **Exit code capture**: The command is wrapped as `<cmd>; echo $? > /tmp/exit-file`
to capture the real exit code (previously always reported as 0)
2. **Enhanced retry logic**: 3 retries with increasing delays (50ms, 100ms, 200ms)
instead of a single 50ms retry
3. **`--verbose` flag support**: Args parser now handles `--verbose` and sets
`START_VERBOSE=1` so screen-isolation debug output is visible
4. **Lazy debug evaluation**: DEBUG is now a function (`isDebug()`) so env vars
set after module load (by `--verbose` flag) are respected

**Why this is correct:**
- Setting `logfile flush 0` is the standard, official way to control screen's log
flush behavior, documented in the GNU Screen manual
- It directly addresses the root cause (buffered writes) rather than fighting symptoms
- The retry for the tee path handles the TOCTOU race for the remaining edge case
- `-L` is the standard screen flag for enabling logging, available since early versions
- `logfile` in screenrc sets the custom path (no need for `-Logfile` CLI option)
- `-L` enables logging at window creation time (before any output is produced)
- `logfile flush 0` ensures bytes hit disk immediately
- The approach directly addresses the root cause: `deflog on` only applies to NEW
windows, but `-L` applies to the INITIAL window

**Key insight from v0.25.0 failure:** `deflog on` means "all **new** windows"
not "all windows." The default window in `-dmS` mode is not "new" from the
screenrc's perspective — it already exists when screenrc is processed. The `-L`
flag is needed to bridge this gap.
Loading
Loading