Problem Statement
The current 30-second shell timeout is causing failures for legitimate long-running commands (e.g., npm install, large file operations). We need architectural patterns to handle varying command durations without arbitrary timeout increases.
Proposed Mitigation Strategies
1. Exponential Backoff with Jitter
- Start with short timeout, increase exponentially on retry
- Add random jitter to prevent thundering herd
- Example: 5s → 10s → 20s → 40s with ±20% jitter
2. Command Decomposition
- Break long operations into smaller checkpointed steps
- Example:
npm install → check cache → download packages → link dependencies
- Each step has appropriate timeout for its complexity
3. Timeout Tiering by Command Type
- Fast commands (ls, pwd): 5s
- Medium commands (git status, file reads): 15s
- Heavy commands (npm install, builds): 60s+
- Commands declare their tier via metadata
4. Circuit Breaker Pattern
- Track command failure rates
- Open circuit after N consecutive timeouts
- Prevent cascading failures
- Half-open state for gradual recovery
Concrete Example: npm install
Current behavior: Hits 30s timeout on large dependency trees
Proposed solution:
{
command: 'npm install',
timeoutTier: 'heavy',
baseTimeout: 60000,
retryStrategy: 'exponential',
decompose: [
{ step: 'cache-check', timeout: 5000 },
{ step: 'download', timeout: 45000 },
{ step: 'link', timeout: 10000 }
]
}
Required Metrics for Configuration
To properly configure timeouts, we need:
- Command duration histograms - distribution of actual execution times
- 95th/99th percentile durations - understand outliers vs typical cases
- Timeout frequency by command type - which commands fail most often
- Success rates across timeout thresholds - find optimal values
Implementation Phases
Phase 1: Collect metrics on current command durations
Phase 2: Implement timeout tiering for known command types
Phase 3: Add exponential backoff for retryable commands
Phase 4: Implement circuit breaker for system-wide resilience
Related Discussions
This issue emerged from chat discussion about shell command reliability and the need for more sophisticated timeout handling than a single global value.
Problem Statement
The current 30-second shell timeout is causing failures for legitimate long-running commands (e.g.,
npm install, large file operations). We need architectural patterns to handle varying command durations without arbitrary timeout increases.Proposed Mitigation Strategies
1. Exponential Backoff with Jitter
2. Command Decomposition
npm install→ check cache → download packages → link dependencies3. Timeout Tiering by Command Type
4. Circuit Breaker Pattern
Concrete Example: npm install
Current behavior: Hits 30s timeout on large dependency trees
Proposed solution:
Required Metrics for Configuration
To properly configure timeouts, we need:
Implementation Phases
Phase 1: Collect metrics on current command durations
Phase 2: Implement timeout tiering for known command types
Phase 3: Add exponential backoff for retryable commands
Phase 4: Implement circuit breaker for system-wide resilience
Related Discussions
This issue emerged from chat discussion about shell command reliability and the need for more sophisticated timeout handling than a single global value.