Skip to content

fix(brainbar): harden reliability under db contention#276

Merged
EtanHey merged 1 commit into
mainfrom
fix/brainbar-reliability-A
May 3, 2026
Merged

fix(brainbar): harden reliability under db contention#276
EtanHey merged 1 commit into
mainfrom
fix/brainbar-reliability-A

Conversation

@EtanHey
Copy link
Copy Markdown
Owner

@EtanHey EtanHey commented May 3, 2026

Summary

  • Enforce a single BrainBar instance with a nonblocking socket-side flock; duplicate app launches now terminate instead of competing for /tmp/brainbar.sock.
  • Start the MCP socket/router before database write readiness, and move startup DB open/retry work off the socket queue so initialize and tools/list respond during migration/DDL lock contention.
  • Add Swift brain_store write-budget fallback to the existing durable pending-store queue, including a regression for R02's write burst followed by brain_search staying responsive.

Context

Tests

  • swift test --package-path brain-bar --filter BrainBarReliabilityTests — 4 tests, 0 failures.
  • swift test --package-path brain-bar — 345 tests, 0 failures.
  • ./scripts/run_tests.sh — BrainLayer test gate passed: Python unit suite 1823 passed, 9 skipped, 75 deselected, 1 xfailed; MCP registration 3 passed; isolated eval/hook routing 32 passed; bun 1 passed; FTS5 regression PASS.
  • Pre-push hook reran the repo gate and passed with the same Python/MCP/eval/bun/regression results.

@codex review
@cursor review
@BugBot review


Note

Medium Risk
Touches BrainBar startup/server lifecycle and SQLite write paths; incorrect locking or async DB open coordination could prevent startup or lead to missed/duplicate writes under contention.

Overview
BrainBar now starts the MCP socket/router before the SQLite database is ready, moving database open/retry work off the socket queue so initialize/tools/list remain responsive while the DB is locked or recovering.

Adds a single-instance enforcement via a non-blocking flock lockfile (BrainBarInstanceLock) owned by BrainBarServer, with an onStartRejected callback used by the app to terminate duplicate launches.

Hardens write behavior under contention by adding BrainDatabase.storeOrQueueWithinBudget (short busy-timeout, no retries) and switching brain_store to queue to the pending-store file when the DB is busy; includes new reliability tests covering instance locking, startup responsiveness during write locks, queued writes, and read-path responsiveness during queued write bursts.

Reviewed by Cursor Bugbot for commit d059427. Bugbot is set up for automated code reviews on this repo. Configure here.

Note

Harden BrainBar reliability under database contention with single-instance locking and queued writes

  • Adds BrainBarInstanceLock, a file-based single-instance lock using flock, so secondary app instances are rejected immediately via onStartRejected rather than competing for resources.
  • Refactors BrainBarServer to start with a dbPath and open the database asynchronously after the socket is bound, preventing startup hangs under write contention.
  • Adds storeOrQueueWithinBudget to BrainDatabase, which falls back to queuing writes when the database is busy, and returns a StoreWriteOutcome so callers can distinguish immediate vs. queued writes.
  • Updates the brain_store MCP tool handler in MCPRouter to use the budgeted write path and return queued=true in the response when a write is deferred.
  • Adds reliability tests in BrainBarReliabilityTests covering lock rejection, initialize responsiveness under write locks, and read-path stability during queued write bursts.
📊 Macroscope summarized d059427. 5 files reviewed, 6 issues evaluated, 1 issue filtered, 1 comment posted

🗂️ Filtered Issues

brain-bar/Sources/BrainBar/BrainBarServer.swift — 1 comment posted, 4 evaluated, 1 filtered
  • line 552: Misleading error message in handleSubscribeTool. The guard on lines 552-555 checks agentID, tags, AND database, but always returns "Database not available" regardless of which condition failed. If agent_id or tags is missing/invalid, the client receives an incorrect error message. [ Out of scope ]

Summary by CodeRabbit

Release Notes

  • New Features

    • Prevents multiple app instances from running simultaneously
    • Intelligent database write queuing when the database is busy
    • Improved timeout handling for database operations
  • Bug Fixes

    • Optimized startup sequence for faster initialization
    • Better handling of database contention scenarios
  • Tests

    • Added reliability tests for instance locking and write-lock behavior

Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 3, 2026

📝 Walkthrough

Walkthrough

This PR introduces single-instance locking via OS-level file locks and makes BrainBar resilient to database write-locks during startup. A new BrainBarInstanceLock type enforces exclusive process ownership; BrainBarServer acquires the lock at startup and rejects via onStartRejected if another instance runs. Database opening is refactored to be asynchronous and non-blocking. BrainDatabase adds storeOrQueueWithinBudget() to queue writes when the database is under contention, and the startup flow defers database readiness via server.onDatabaseReady rather than blocking initialization.

Changes

Single-Instance Locking

Layer / File(s) Summary
Lock Mechanism
BrainBarInstanceLock.swift
New type providing OS-level exclusive flock-based locking. acquire(lockPath:) ensures directory exists, opens/creates lock file, attempts non-blocking exclusive lock (throws alreadyRunning on conflict), writes PID, and fsyncs. release() is idempotent and cleans up lock file.
Server Lock Integration
BrainBarServer.swift
Constructor accepts optional instanceLockPath; startOnQueue() attempts lock acquisition before server setup and invokes onStartRejected callback if another instance is detected. Shutdown releases instance lock state.
App Startup Wiring
BrainBarApp.swift
Sets server.onStartRejected to terminate the app immediately if startup lock fails, preventing duplicate instances from initializing.
Reliability Tests
BrainBarReliabilityTests.swift
testSingleInstanceLockRejectsSecondOwner verifies lock acquisition blocks a second owner and throws alreadyRunning.

Write-Lock Resilience & Async Database Opening

Layer / File(s) Summary
Database Write Queuing
BrainDatabase.swift
Introduced StoreWriteOutcome enum (.stored or .queued). Extended store(...) to accept configurable retries and per-call busyTimeoutMillis (overriding SQLite PRAGMA busy_timeout). Added storeOrQueueWithinBudget(...) which attempts store() with retries: 0; on retryable lock errors, enqueues via queuePendingStore() and returns .queued instead of throwing.
Server Async DB Opening
BrainBarServer.swift
Database opening refactored to async: attemptDatabaseOpen() initiates open on a background queue when needed; finishDatabaseOpen() runs on server queue and closes DB only if server is no longer listening and no external DB was provided. Added databaseOpenInProgress flag to guard concurrent attempts.
App Startup Flow
BrainBarApp.swift
Removed blocking background queue DB open. Server now calls server.start() immediately. Runtime installation, database assignment, and URL flushing moved into server.onDatabaseReady callback (executed on @MainActor). Legacy menu-bar surface and quick-capture hotkey setup now run after server.start() without blocking on DB readiness.
Request Routing
MCPRouter.swift
handleBrainStore replaced exception-driven control flow (do/catch + conditional queueing) with switch-based outcome handling: .stored path returns queued: false with flush metadata; .queued path returns queued: true without metadata.
Reliability Tests
BrainBarReliabilityTests.swift
testInitializeReturnsImmediatelyWhileStartupDatabaseIsWriteLocked verifies server responds to initialize requests quickly even while DB is write-locked. testBrainStoreQueuesWithinBudgetWhenDatabaseWriteLockIsHeld confirms brain_store returns queued: true under lock. testQueuedWriteBurstDoesNotTakeReadPathDown ensures a burst of queued writes does not break the read path (brain_search).

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

The PR introduces a new concurrency-aware locking primitive (BrainBarInstanceLock with flock semantics), refactors server-database lifecycle management into an asynchronous, non-blocking flow, and introduces a new queuing strategy (storeOrQueueWithinBudget) that replaces exception-driven logic with enum-based outcome handling. The changes span multiple core files with nuanced state management (databaseOpenInProgress, instanceLock lifecycle, lock release in shutdown), plus detailed test utilities (socket framing, timeout handling, SQLite lock seeding). Logic is moderately dense and state interactions require careful reasoning across app, server, and database layers.

Possibly related PRs

Poem

🔐 A lock and a queue, both born in this change,
One guards the gates so instances don't range,
The other queues writes when the DB's too tight,
Now startup's resilient—no more freezing in flight! 🐰✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix(brainbar): harden reliability under db contention' accurately summarizes the main change: hardening BrainBar reliability by addressing database contention issues through instance locking, responsive socket startup, and write queuing.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/brainbar-reliability-A

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 60 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

guard bindResult == 0 else {
NSLog("[BrainBar] Failed to bind: errno %d", errno)
close(fd)
return

P2 Badge Release instance lock on socket bind/listen startup failure

After acquiring the singleton lock, several startup failure branches return early without cleanup (for example on bind failure). In those cases this process keeps instanceLock held even though the server never starts listening, so later relaunch attempts are rejected as "already running" until this broken process exits. These error paths should release the lock (or call a shared cleanup path) before returning.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

guard shouldRelease else { return }
flock(fd, LOCK_UN)
close(fd)
try? FileManager.default.removeItem(atPath: lockPath)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep lockfile inode stable while another instance may acquire

release() unlinks lockPath after unlocking, which breaks flock-based singleton guarantees during handoff: if a second process acquires the lock between LOCK_UN and removeItem, this code can delete the inode that process is locking, allowing a third process to recreate the path and acquire a separate lock concurrently. For flock locks, the lock file should remain in place (or at least not be removed in this sequence) to avoid split-brain ownership.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Low

private func handleMessage(fd: Int32, request: [String: Any]) -> [String: Any] {

handleMessage intercepts brain_subscribe, brain_unsubscribe, and brain_ack before they reach MCPRouter.handleToolsCall, skipping the validate(arguments:for:) check. This lets clients send oversized agent_id strings or unbounded tags arrays that bypass the maxLength/maxItems limits defined in limitedInputSchema, risking database storage failures or memory pressure.

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file brain-bar/Sources/BrainBar/BrainBarServer.swift around line 417:

`handleMessage` intercepts `brain_subscribe`, `brain_unsubscribe`, and `brain_ack` before they reach `MCPRouter.handleToolsCall`, skipping the `validate(arguments:for:)` check. This lets clients send oversized `agent_id` strings or unbounded `tags` arrays that bypass the `maxLength`/`maxItems` limits defined in `limitedInputSchema`, risking database storage failures or memory pressure.

Evidence trail:
BrainBarServer.swift lines 417-425: handleMessage intercepts brain_subscribe/brain_unsubscribe/brain_ack before router.handle(). MCPRouter.swift line 148: validate(arguments:for:) is called in the router path. MCPRouter.swift lines 703-762: validate() enforces maxLength/maxItems from schema. MCPRouter.swift lines 659-700: limitedInputSchema adds default maxLength/maxItems. MCPRouter.swift lines 988-1030: brain_subscribe/brain_unsubscribe/brain_ack schemas use limitedInputSchema. BrainBarServer.swift lines 551-640: handleSubscribeTool/handleUnsubscribeTool/handleAckTool have no validation of argument sizes.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit d059427. Configure here.

if let previousBusyTimeout {
try? executeOnHandle(db, sql: "PRAGMA busy_timeout = \(previousBusyTimeout)")
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Busy timeout permanently lowered if pragma query fails

Medium Severity

The store() method's defer block only restores the previous busy_timeout when previousBusyTimeout is non-nil. Since previousBusyTimeout is computed via busyTimeoutMillis.flatMap { _ in queryPragma(…) }, it becomes nil if queryPragma fails — yet the PRAGMA is still changed to the low budget value (e.g., 50ms). In that case, the defer skips restoration and the shared database connection keeps the 50ms timeout permanently. All subsequent write operations under any contention would then fail far too quickly instead of using the configured 30-second timeout.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit d059427. Configure here.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@brain-bar/Sources/BrainBar/BrainBarApp.swift`:
- Around line 112-120: server.start() returns before the server's lock check
completes causing installLegacyMenuBarSurface(with:), collector.start(), and
configureQuickCaptureHotkey() to run even for rejected duplicates; change the
flow to only perform those side effects after an explicit success callback from
the server (e.g. a completion handler on server.start or observing an
onStartAccepted/onStartRejected event) — move calls to
installLegacyMenuBarSurface(with: collector), collector.start(),
configureQuickCaptureHotkey(), and the NSLog trailing message into the
success/accepted path and ensure the rejected path (onStartRejected) does not
run them.

In `@brain-bar/Sources/BrainBar/BrainBarInstanceLock.swift`:
- Around line 55-64: The release() method currently unlinks the lock file after
releasing the flock, which can cause a race allowing multiple processes to hold
locks on different inodes; remove the FileManager.default.removeItem(atPath:
lockPath) call and keep only the unlock (flock(fd, LOCK_UN)) and close(fd) steps
(preserving the releaseLock guard logic) so the lock file remains on disk as a
marker and prevents the race.

In `@brain-bar/Sources/BrainBar/BrainBarServer.swift`:
- Around line 188-198: After successfully acquiring instanceLock via
BrainBarInstanceLock.acquire, ensure the lock is always released and
onStartRejected is invoked for any subsequent early-return failure paths:
introduce a deferred cleanup (e.g., a defer block or shared failure handler)
immediately after assignment of instanceLock that checks a startupSucceeded flag
(set true only on full successful startup) and if false calls
instanceLock.release (or the appropriate BrainBarInstanceLock release method)
and invokes onStartRejected with the existing error message; update all
post-acquire early returns to rely on this shared cleanup so the lock is not
left held when startup fails.

In `@brain-bar/Sources/BrainBar/BrainDatabase.swift`:
- Around line 722-730: The current code mutates the shared SQLite handle's
PRAGMA busy_timeout via busyTimeoutMillis/queryPragma/executeOnHandle which
races when storeAsync/store() run concurrently; instead, when a custom
busyTimeoutMillis is requested create and use a dedicated short-lived DB
connection/handle for that budgeted write (set its PRAGMA busy_timeout once on
the new handle, run the write via executeOnHandle or equivalent, then close the
handle) rather than changing and restoring the shared connection; update the
logic around busyTimeoutMillis, previousBusyTimeout, queryPragma, and
executeOnHandle to open/close a temporary connection for timed writes to
eliminate interleaving on the shared handle.

In `@brain-bar/Tests/BrainBarTests/BrainBarReliabilityTests.swift`:
- Around line 240-284: connectReliabilitySocket currently returns a blocking
socket but readReliabilityFramedMessage expects a non-blocking fd; after
creating fd in connectReliabilitySocket set O_NONBLOCK (use fcntl to get current
flags, add O_NONBLOCK, set them) and handle any fcntl errors by closing fd and
throwing; ensure the rest of connect loop still works (check for
EINPROGRESS/EWOULDBLOCK as needed) and leave the non-blocking flag on the
returned descriptor so readReliabilityFramedMessage’s EAGAIN/EWOULDBLOCK
handling and the timeout loop behave correctly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 4fc27bf1-aa71-41dc-a9d1-e075d2720650

📥 Commits

Reviewing files that changed from the base of the PR and between 70606e0 and d059427.

📒 Files selected for processing (6)
  • brain-bar/Sources/BrainBar/BrainBarApp.swift
  • brain-bar/Sources/BrainBar/BrainBarInstanceLock.swift
  • brain-bar/Sources/BrainBar/BrainBarServer.swift
  • brain-bar/Sources/BrainBar/BrainDatabase.swift
  • brain-bar/Sources/BrainBar/MCPRouter.swift
  • brain-bar/Tests/BrainBarTests/BrainBarReliabilityTests.swift
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Cursor Bugbot
  • GitHub Check: Macroscope - Correctness Check
  • GitHub Check: test (3.12)
  • GitHub Check: test (3.13)
  • GitHub Check: test (3.11)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2026-03-18T00:12:08.774Z
Learnt from: EtanHey
Repo: EtanHey/brainlayer PR: 87
File: brain-bar/Sources/BrainBar/BrainBarServer.swift:118-129
Timestamp: 2026-03-18T00:12:08.774Z
Learning: In Swift files under brain-bar/Sources/BrainBar, enforce that when a critical dependency like the database is nil due to startup ordering (socket before DB), any tool handler that accesses the database must throw an explicit error (e.g., ToolError.noDatabase) instead of returning a default/empty value. Do not allow silent defaults (e.g., guard let db else { return ... }). Flag patterns that silently return defaults when db is nil, as this masks startup timing issues. This guidance applies broadly to similar Swift files in the BrainBar module, not just this one location.

Applied to files:

  • brain-bar/Sources/BrainBar/BrainBarInstanceLock.swift
  • brain-bar/Sources/BrainBar/BrainBarApp.swift
  • brain-bar/Sources/BrainBar/BrainBarServer.swift
  • brain-bar/Sources/BrainBar/MCPRouter.swift
  • brain-bar/Sources/BrainBar/BrainDatabase.swift
📚 Learning: 2026-03-29T18:45:40.988Z
Learnt from: EtanHey
Repo: EtanHey/brainlayer PR: 133
File: brain-bar/Sources/BrainBar/BrainDatabase.swift:0-0
Timestamp: 2026-03-29T18:45:40.988Z
Learning: In the BrainBar module’s Swift database layer (notably BrainDatabase.swift), ensure that the `search()` function’s `unreadOnly=true` path orders results by the delivery frontier cursor so the watermark `maxRowID` stays contiguous. Specifically, when `unreadOnly` is enabled, the query must include `ORDER BY c.rowid ASC` (e.g., via `let orderByClause = unreadOnly ? "c.rowid ASC" : "f.rank"`). Do not replace the unread-only ordering with relevance-based sorting (e.g., `f.rank`) unconditionally or for the unread-only path, as it can introduce gaps in the watermark and incorrectly mark unseen rows as delivered. Flag any future change to the `ORDER BY` clause in this function that makes relevance sorting apply to the unread-only case.

Applied to files:

  • brain-bar/Sources/BrainBar/BrainBarInstanceLock.swift
  • brain-bar/Sources/BrainBar/BrainBarApp.swift
  • brain-bar/Sources/BrainBar/BrainBarServer.swift
  • brain-bar/Sources/BrainBar/MCPRouter.swift
  • brain-bar/Sources/BrainBar/BrainDatabase.swift
🪛 SwiftLint (0.63.2)
brain-bar/Tests/BrainBarTests/BrainBarReliabilityTests.swift

[Warning] 5-5: Classes should have an explicit deinit method

(required_deinit)

Comment on lines +112 to +120
server.start()

if launchMode == .legacyStatusItem {
installLegacyMenuBarSurface(with: collector)
}

collector.start()
configureQuickCaptureHotkey()
NSLog("[BrainBar] Socket ready; database will self-heal — launchMode=%@", String(describing: launchMode))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Don't start app side effects before startup acceptance is known.

server.start() returns before the lock check runs on the server queue, so a losing duplicate instance still installs the legacy surface, starts the collector, and registers the hotkey before onStartRejected gets back onto the main actor. That briefly recreates the duplicate side effects this PR is trying to eliminate. Gate this work behind an explicit successful-start callback, or make startup acceptance synchronous.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@brain-bar/Sources/BrainBar/BrainBarApp.swift` around lines 112 - 120,
server.start() returns before the server's lock check completes causing
installLegacyMenuBarSurface(with:), collector.start(), and
configureQuickCaptureHotkey() to run even for rejected duplicates; change the
flow to only perform those side effects after an explicit success callback from
the server (e.g. a completion handler on server.start or observing an
onStartAccepted/onStartRejected event) — move calls to
installLegacyMenuBarSurface(with: collector), collector.start(),
configureQuickCaptureHotkey(), and the NSLog trailing message into the
success/accepted path and ensure the rejected path (onStartRejected) does not
run them.

Comment on lines +55 to +64
func release() {
let shouldRelease = releaseLock.withLock { released -> Bool in
guard !released else { return false }
released = true
return true
}
guard shouldRelease else { return }
flock(fd, LOCK_UN)
close(fd)
try? FileManager.default.removeItem(atPath: lockPath)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's find and examine the BrainBarInstanceLock.swift file
find . -name "BrainBarInstanceLock.swift" -type f

Repository: EtanHey/brainlayer

Length of output: 118


🏁 Script executed:

# Search for the BrainBarInstanceLock file more broadly
fd -e swift -i brainbarinstancelock

Repository: EtanHey/brainlayer

Length of output: 116


🏁 Script executed:

# List the brain-bar sources directory structure
fd . brain-bar/Sources/ -type f -name "*.swift" | head -20

Repository: EtanHey/brainlayer

Length of output: 233


🏁 Script executed:

# Read the BrainBarInstanceLock.swift file
cat -n ./brain-bar/Sources/BrainBar/BrainBarInstanceLock.swift

Repository: EtanHey/brainlayer

Length of output: 2504


🏁 Script executed:

# Check for any related tests or usage patterns
rg "BrainBarInstanceLock" --type swift -A 3 -B 3

Repository: EtanHey/brainlayer

Length of output: 4530


🏁 Script executed:

# Search for any initialization or creation logic that shows the lock design intent
rg "flock\|lockPath\|removeItem" brain-bar/Sources/BrainBar/BrainBarInstanceLock.swift -A 2 -B 2

Repository: EtanHey/brainlayer

Length of output: 44


🏁 Script executed:

# Check if there are any issues, PRs, or comments about this race condition
rg "BrainBarInstanceLock|instance.*lock|race" --type swift brain-bar/ -i | head -30

Repository: EtanHey/brainlayer

Length of output: 2762


🏁 Script executed:

# Look at the test more closely to see if it covers the race condition scenario
cat -n brain-bar/Tests/BrainBarTests/BrainBarReliabilityTests.swift | grep -A 20 "BrainBarInstanceLock"

Repository: EtanHey/brainlayer

Length of output: 1308


🏁 Script executed:

# Search for any documentation or comments about why removeItem is there
rg "removeItem|cleanup|delete.*lock" brain-bar/Sources/BrainBar/ -B 3 -A 3

Repository: EtanHey/brainlayer

Length of output: 4156


Remove the lock file deletion to prevent a race condition that breaks single-instance guarantees.

The flock syscall attaches locks to inodes, not pathnames. Between close(fd) and removeItem(lockPath), another process can open the same path and acquire the lock on the same inode. A third process can then create a new file at that path before the first process completes deletion, gaining a lock on a different inode. This allows two processes to hold locks simultaneously on different inodes, violating the single-instance invariant.

🔧 Suggested fix
     func release() {
         let shouldRelease = releaseLock.withLock { released -> Bool in
             guard !released else { return false }
             released = true
             return true
         }
         guard shouldRelease else { return }
         flock(fd, LOCK_UN)
         close(fd)
-        try? FileManager.default.removeItem(atPath: lockPath)
     }

The lock file can safely remain on disk; it acts as a marker and causes no harm on subsequent acquisitions.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@brain-bar/Sources/BrainBar/BrainBarInstanceLock.swift` around lines 55 - 64,
The release() method currently unlinks the lock file after releasing the flock,
which can cause a race allowing multiple processes to hold locks on different
inodes; remove the FileManager.default.removeItem(atPath: lockPath) call and
keep only the unlock (flock(fd, LOCK_UN)) and close(fd) steps (preserving the
releaseLock guard logic) so the lock file remains on disk as a marker and
prevents the race.

Comment on lines +188 to +198
do {
instanceLock = try BrainBarInstanceLock.acquire(lockPath: instanceLockPath)
} catch BrainBarInstanceLock.AcquireError.alreadyRunning {
NSLog("[BrainBar] Another BrainBar instance owns %@. Exiting this server.", instanceLockPath)
onStartRejected?("another BrainBar instance owns \(instanceLockPath)")
return
} catch {
NSLog("[BrainBar] Failed to acquire single-instance lock %@: %@", instanceLockPath, String(describing: error))
onStartRejected?("failed to acquire single-instance lock \(instanceLockPath): \(error)")
return
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Release the singleton lock on every startup failure after this point.

Once instanceLock is acquired here, later startup failures still just return. That leaves this process holding the lock and never invokes onStartRejected, so the app can stay alive with no listener while blocking all future launches until exit. Add a shared failure path that releases the lock and reports rejection on every post-acquire early return.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@brain-bar/Sources/BrainBar/BrainBarServer.swift` around lines 188 - 198,
After successfully acquiring instanceLock via BrainBarInstanceLock.acquire,
ensure the lock is always released and onStartRejected is invoked for any
subsequent early-return failure paths: introduce a deferred cleanup (e.g., a
defer block or shared failure handler) immediately after assignment of
instanceLock that checks a startupSucceeded flag (set true only on full
successful startup) and if false calls instanceLock.release (or the appropriate
BrainBarInstanceLock release method) and invokes onStartRejected with the
existing error message; update all post-acquire early returns to rely on this
shared cleanup so the lock is not left held when startup fails.

Comment on lines +722 to +730
let previousBusyTimeout = busyTimeoutMillis.flatMap { _ in queryPragma(db, name: "busy_timeout") }
if let busyTimeoutMillis {
try executeOnHandle(db, sql: "PRAGMA busy_timeout = \(max(1, busyTimeoutMillis))")
}
defer {
if let previousBusyTimeout {
try? executeOnHandle(db, sql: "PRAGMA busy_timeout = \(previousBusyTimeout)")
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's look at the BrainDatabase.swift file around the flagged lines
cat -n brain-bar/Sources/BrainBar/BrainDatabase.swift | sed -n '710,750p'

Repository: EtanHey/brainlayer

Length of output: 2224


🏁 Script executed:

# Check the broader context of the store method and how db is used
cat -n brain-bar/Sources/BrainBar/BrainDatabase.swift | sed -n '700,780p'

Repository: EtanHey/brainlayer

Length of output: 3771


🏁 Script executed:

# Look for the BrainDatabase class definition and how the db handle is initialized
cat -n brain-bar/Sources/BrainBar/BrainDatabase.swift | head -100

Repository: EtanHey/brainlayer

Length of output: 4353


🏁 Script executed:

# Check if there are any synchronization primitives (locks, queues, etc.) around the db handle
rg -n "nonisolated\|DispatchQueue\|Lock\|Mutex\|serial" brain-bar/Sources/BrainBar/BrainDatabase.swift | head -30

Repository: EtanHey/brainlayer

Length of output: 44


🏁 Script executed:

# Search for other callers of the store method to understand concurrency context
rg -n "\.store\(" brain-bar/Sources/BrainBar/ -A 2 | head -40

Repository: EtanHey/brainlayer

Length of output: 1921


🏁 Script executed:

# Check if there's any synchronization around db handle access or SQLite connection management
cat -n brain-bar/Sources/BrainBar/BrainDatabase.swift | sed -n '150,250p'

Repository: EtanHey/brainlayer

Length of output: 3860


🏁 Script executed:

# Look for the db property definition
rg -n "var db\|let db\|private.*db" brain-bar/Sources/BrainBar/BrainDatabase.swift | head -20

Repository: EtanHey/brainlayer

Length of output: 44


🏁 Script executed:

# Check for any dispatch queue or serial access patterns in the class
cat -n brain-bar/Sources/BrainBar/BrainDatabase.swift | sed -n '1,150p' | tail -50

Repository: EtanHey/brainlayer

Length of output: 2375


🏁 Script executed:

# Look at how storeAsync is implemented to understand threading model
cat -n brain-bar/Sources/BrainBar/BrainDatabase.swift | sed -n '778,800p'

Repository: EtanHey/brainlayer

Length of output: 1297


🏁 Script executed:

# Check the open() method to see how db connection is established and managed
rg -n "func open\|private func open" brain-bar/Sources/BrainBar/BrainDatabase.swift -A 15 | head -40

Repository: EtanHey/brainlayer

Length of output: 44


🏁 Script executed:

# Check for any actor or main thread enforcement
rg -n "@MainActor\|@globalActor\|nonisolated" brain-bar/Sources/BrainBar/BrainDatabase.swift

Repository: EtanHey/brainlayer

Length of output: 44


🏁 Script executed:

# Check if there's a serial queue or other synchronization I missed
cat -n brain-bar/Sources/BrainBar/BrainDatabase.swift | sed -n '239,300p'

Repository: EtanHey/brainlayer

Length of output: 3080


🏁 Script executed:

# Look for any queue or lock protecting db access
rg -n "serialQueue\|dbQueue\|dbLock\|synchronized" brain-bar/Sources/BrainBar/BrainDatabase.swift

Repository: EtanHey/brainlayer

Length of output: 44


🏁 Script executed:

# Check the runWriteStatement implementation to see if it has synchronization
rg -n "func runWriteStatement\|private func runWriteStatement" brain-bar/Sources/BrainBar/BrainDatabase.swift -A 10 | head -30

Repository: EtanHey/brainlayer

Length of output: 44


🏁 Script executed:

# Verify SQLite PRAGMA busy_timeout behavior
web_search
SQLite PRAGMA busy_timeout connection state per connection or global

Repository: EtanHey/brainlayer

Length of output: 158


🏁 Script executed:

# Check if there are any other store() calls that happen concurrently
rg -n "\.store\(|storeAsync|storeOrQueueWithinBudget" brain-bar/Sources/BrainBar/ --type swift | wc -l

Repository: EtanHey/brainlayer

Length of output: 65


🏁 Script executed:

# Check the full threading model - is db protected by any queue?
cat -n brain-bar/Sources/BrainBar/BrainDatabase.swift | sed -n '239,400p'

Repository: EtanHey/brainlayer

Length of output: 7265


🏁 Script executed:

# Check if runWriteStatement or similar has any protection
cat -n brain-bar/Sources/BrainBar/BrainDatabase.swift | sed -n '850,950p'

Repository: EtanHey/brainlayer

Length of output: 4988


🏁 Script executed:

# Look for queryPragma and executeOnHandle to understand what they do
rg -n "func queryPragma|func executeOnHandle" brain-bar/Sources/BrainBar/BrainDatabase.swift -A 5

Repository: EtanHey/brainlayer

Length of output: 757


🏁 Script executed:

# Verify that there's no synchronized access pattern I'm missing
rg -n "DispatchQueue\|serial\|queue\|Lock\|Mutex" brain-bar/Sources/BrainBar/BrainDatabase.swift | head -10

Repository: EtanHey/brainlayer

Length of output: 44


🌐 Web query:

SQLite PRAGMA busy_timeout connection-specific thread-safe

💡 Result:

Yes, the SQLite PRAGMA busy_timeout is connection-specific (per-database connection) and operates safely in thread-safe builds of SQLite. Key points: - It sets the busy handler for that specific database connection only. Each connection has its own independent busy handler/timeout setting. Setting it on one connection does not affect others. - Default is 0 (no timeout, immediate SQLITE_BUSY return). - Use: PRAGMA busy_timeout = milliseconds; (e.g., PRAGMA busy_timeout = 5000; for 5 seconds) - Equivalent C API: sqlite3_busy_timeout(sqlite3 *db, int ms); explicitly per connection. - Thread safety: Works in multi-threaded (SQLITE_THREADSAFE=2) and serialized (default, SQLITE_THREADSAFE=1) modes. In serialized mode, full mutex protection serializes access to the connection. In multi-thread, connections must not be shared across threads without external synchronization, but the timeout itself is a simple per-connection value. Compile with thread-safe options for multi-threaded use; single-thread mode (SQLITE_THREADSAFE=0) omits mutexes but still supports per-connection timeouts if not multi-threaded. - Must be set after opening each connection, as PRAGMAs are session/connection-specific and do not persist. - In WAL mode (recommended for concurrency), it helps handle lock contention between the single writer and multiple readers. Note: Timeouts may not apply in deadlocks (immediate SQLITE_BUSY) or certain recovery scenarios.

Citations:


🏁 Script executed:

# Check how storeAsync and other concurrent paths use the db
cat -n brain-bar/Sources/BrainBar/BrainDatabase.swift | sed -n '779,790p'

Repository: EtanHey/brainlayer

Length of output: 731


🏁 Script executed:

# Verify there's truly no serial queue or synchronization by looking at the full class structure
rg -n "private.*queue\|private.*lock\|private.*serial" brain-bar/Sources/BrainBar/BrainDatabase.swift

Repository: EtanHey/brainlayer

Length of output: 44


🏁 Script executed:

# Check if any of the store callers use main thread enforcement
rg -n "db.store\|db.storeAsync\|db.storeOrQueue" brain-bar/Sources/BrainBar/QuickCaptureController.swift -B 3 -A 3 | head -20

Repository: EtanHey/brainlayer

Length of output: 44


🏁 Script executed:

# Check configureConnection to see if busy_timeout is set during init
rg -n "func configureConnection|func openConnection" brain-bar/Sources/BrainBar/BrainDatabase.swift -A 15 | head -40

Repository: EtanHey/brainlayer

Length of output: 1836


🏁 Script executed:

# Check if the initial OpenConfiguration.busyTimeoutMillis is applied at all
rg -n "openConfiguration\|busyTimeoutMillis" brain-bar/Sources/BrainBar/BrainDatabase.swift | head -20

Repository: EtanHey/brainlayer

Length of output: 44


🏁 Script executed:

# Look at executeOnHandle more completely to confirm it's not synchronized
cat -n brain-bar/Sources/BrainBar/BrainDatabase.swift | sed -n '1642,1660p'

Repository: EtanHey/brainlayer

Length of output: 985


🏁 Script executed:

# Let me check if there are any other clues about thread safety in the broader context
cat -n brain-bar/Sources/BrainBar/BrainDatabase.swift | sed -n '1,50p'

Repository: EtanHey/brainlayer

Length of output: 2346


🏁 Script executed:

# Verify the SQLITE_OPEN_FULLMUTEX flag means SQLite serializes access internally
# But that doesn't prevent the caller from having race conditions in PRAGMA mutation
rg -n "FULLMUTEX\|Sendable" brain-bar/Sources/BrainBar/BrainDatabase.swift | head -10

Repository: EtanHey/brainlayer

Length of output: 44


Concurrent PRAGMA mutations on shared handle can cause operations to inherit wrong timeouts.

The code mutates busy_timeout on the shared SQLite connection and restores it via defer, but since storeAsync dispatches to a concurrent global queue and there is no external synchronization on the db handle, concurrent store() calls can interleave. One thread may set the timeout to 50 ms while another thread reads that value before restoration, causing unrelated operations to use incorrect timeout values or fail spuriously. Use a dedicated short-timeout connection for budgeted writes instead of mutating the shared handle.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@brain-bar/Sources/BrainBar/BrainDatabase.swift` around lines 722 - 730, The
current code mutates the shared SQLite handle's PRAGMA busy_timeout via
busyTimeoutMillis/queryPragma/executeOnHandle which races when
storeAsync/store() run concurrently; instead, when a custom busyTimeoutMillis is
requested create and use a dedicated short-lived DB connection/handle for that
budgeted write (set its PRAGMA busy_timeout once on the new handle, run the
write via executeOnHandle or equivalent, then close the handle) rather than
changing and restoring the shared connection; update the logic around
busyTimeoutMillis, previousBusyTimeout, queryPragma, and executeOnHandle to
open/close a temporary connection for timed writes to eliminate interleaving on
the shared handle.

Comment on lines +240 to +284
private func connectReliabilitySocket(path: String, timeout: TimeInterval) throws -> Int32 {
let fd = socket(AF_UNIX, SOCK_STREAM, 0)
guard fd >= 0 else {
throw NSError(domain: NSPOSIXErrorDomain, code: Int(errno))
}

var addr = sockaddr_un()
addr.sun_family = sa_family_t(AF_UNIX)
let pathBytes = path.utf8CString
let pathCapacity = MemoryLayout.size(ofValue: addr.sun_path)
guard pathBytes.count <= pathCapacity else {
close(fd)
throw NSError(domain: NSPOSIXErrorDomain, code: Int(ENAMETOOLONG), userInfo: [
NSLocalizedDescriptionKey: "Socket path too long (\(pathBytes.count) > \(pathCapacity)): \(path)"
])
}
withUnsafeMutablePointer(to: &addr.sun_path) { ptr in
ptr.withMemoryRebound(to: CChar.self, capacity: pathCapacity) { dest in
pathBytes.withUnsafeBufferPointer { src in
_ = memcpy(dest, src.baseAddress!, src.count)
}
}
}

let deadline = Date().addingTimeInterval(timeout)
var lastErrno = ENOENT
while Date() < deadline {
let result = withUnsafePointer(to: &addr) { addrPtr in
addrPtr.withMemoryRebound(to: sockaddr.self, capacity: 1) { ptr in
connect(fd, ptr, socklen_t(MemoryLayout<sockaddr_un>.size))
}
}
if result == 0 {
return fd
}
lastErrno = errno
if errno != ENOENT && errno != ECONNREFUSED {
break
}
Thread.sleep(forTimeInterval: 0.01)
}

close(fd)
throw NSError(domain: NSPOSIXErrorDomain, code: Int(lastErrno))
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -name "*.swift" -type f | head -20

Repository: EtanHey/brainlayer

Length of output: 1184


🏁 Script executed:

rg "readReliabilityFramedMessage" --type swift -A 20

Repository: EtanHey/brainlayer

Length of output: 4787


🏁 Script executed:

rg "connectReliabilitySocket" --type swift -B 3 -A 3

Repository: EtanHey/brainlayer

Length of output: 1480


The socket must be nonblocking for the timeout to function.

The readReliabilityFramedMessage(...) function expects the fd to be non-blocking, as evidenced by its checks for EAGAIN/EWOULDBLOCK. However, connectReliabilitySocket returns a blocking socket from socket(AF_UNIX, SOCK_STREAM, 0). When the server never responds, read() blocks indefinitely and the deadline is never evaluated, causing the test to hang rather than timeout.

🔧 Suggested fix
         if result == 0 {
+            let flags = fcntl(fd, F_GETFL)
+            if flags >= 0 {
+                _ = fcntl(fd, F_SETFL, flags | O_NONBLOCK)
+            }
             return fd
         }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@brain-bar/Tests/BrainBarTests/BrainBarReliabilityTests.swift` around lines
240 - 284, connectReliabilitySocket currently returns a blocking socket but
readReliabilityFramedMessage expects a non-blocking fd; after creating fd in
connectReliabilitySocket set O_NONBLOCK (use fcntl to get current flags, add
O_NONBLOCK, set them) and handle any fcntl errors by closing fd and throwing;
ensure the rest of connect loop still works (check for EINPROGRESS/EWOULDBLOCK
as needed) and leave the non-blocking flag on the returned descriptor so
readReliabilityFramedMessage’s EAGAIN/EWOULDBLOCK handling and the timeout loop
behave correctly.

@EtanHey EtanHey merged commit 219f697 into main May 3, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant