Skip to content

Fix memory leaks and race conditions causing OOM shutdowns#24

Open
TheMarstonConnell wants to merge 5 commits into
mainfrom
marston/fix-memory-leaks-and-crashes
Open

Fix memory leaks and race conditions causing OOM shutdowns#24
TheMarstonConnell wants to merge 5 commits into
mainfrom
marston/fix-memory-leaks-and-crashes

Conversation

@TheMarstonConnell
Copy link
Copy Markdown
Member

Summary

  • Queue race conditions: Added sync.Mutex to protect messages, stopped, and running fields accessed from multiple goroutines without synchronization
  • Infinite retry loop: popAndPost() retried forever if blockchain was unreachable — capped at 10 retries, then releases waiting goroutines with error
  • Unbounded queue growth: Post() now rejects new messages when queue exceeds 500, providing backpressure instead of growing until OOM
  • Goroutine leaks from *gin.Context: ProcessFile, PostFile, and UpdateCollection no longer receive *gin.Context in background goroutines — dependencies are extracted before goroutine launch
  • http.DefaultClient mutation: Replaced with package-level clients (uploadClient, cloneClient) with 120s timeouts
  • Clone handlers with no timeout: http.Get() calls replaced with context-aware requests using cloneClient
  • No DB connection pool limits: Set MaxOpenConns(25), MaxIdleConns(5), ConnMaxLifetime(5min)
  • Storage purchaser silent exit: Changed return to continue on query error + added COALESCE for empty tables
  • GetUsage returning nil, nil: Now returns proper error when no customer found

Test plan

  • go build ./... passes (verified)
  • go vet ./... passes (verified)
  • Deploy to staging and monitor memory usage over 24h — should no longer see gradual memory growth
  • Test upload flow end-to-end (single file, multi-file, clone from URL, IPFS clone)
  • Test collection operations (add/remove files, nested collections)
  • Verify queue backpressure: when queue is full, API returns error instead of hanging
  • Simulate blockchain downtime: confirm queue stops retrying after 10 attempts instead of looping forever

🤖 Generated with Claude Code

TheMarstonConnell and others added 5 commits February 8, 2026 15:42
…utdowns

- Add mutex to Queue struct to prevent data races on messages/stopped/running
- Cap broadcast retries at 10 and queue size at 500 to prevent unbounded growth
- Stop passing *gin.Context to background goroutines (causes leaks after handler returns)
- Replace http.DefaultClient mutation with package-level clients with timeouts
- Add DB connection pool limits (25 open, 5 idle, 5min lifetime)
- Fix storage purchaser goroutine silently exiting on query error
- Fix GetUsage returning nil error when no customer found
- Use COALESCE in SUM queries to handle empty tables

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Queue tests (17 tests):
- NewQueue initialization, Stop, Size, TooBusy threshold
- Post backpressure: rejects when full (500), accepts when not full
- Post blocks until released, returns error from MsgHolder
- Concurrent Post/Size/TooBusy/Stop with race detector
- popAndPost returns on empty queue, batches at most 20 messages
- Listen goroutine exits when stopped

GetUsage tests (5 tests):
- Returns error (not nil) when no customer found
- Success path with correct byte calculations
- Handles empty files table via COALESCE
- Handles missing subscription
- Handles customer query failure

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ents, and uploads

Tests cover auth checks, SQL query correctness, pagination wiring, error handling,
and edge cases (empty results, not-found, DB errors) using httptest + go-sqlmock.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant