Skip to content

perf(brain): unbounded maps without eviction + exclusive lock on cache read + O(n) ring buffer #546

@hrygo

Description

@hrygo

Background

internal/brain 是 LLM 客户端装饰器链和意图路由模块(含 brain/llm 子包)。Phase 2 resource-mgmt + performance 分析发现 3 个无界 map 无 TTL 驱逐、cache 读路径使用排他锁、指标滚动窗口 O(n) 拷贝。

Scope: resource-mgmt, performance — cycle 203 (模块分析通过 3)
Key files: router.go, memory.go, llm/cost.go, llm/ratelimit.go, llm/metrics.go

Related: issue 501 (cost-calculator unbounded map, Phase 1 已跟踪), issue 531 (SafetyGuard race)


Finding Summary

Category Critical High Medium Low
Resource-mgmt 0 0 2 0
Performance 0 0 2 0
合计 0 0 4 0

Findings

intent-router-exclusive-lock-on-cache-read

Severity: Medium | Confidence: High | ROI: Medium
Location: router.go:372-387

Problem: getFromCache 是读操作(map lookup + LRU MoveToFront),但获取排他 Lock() 而非 RLock()。每次缓存命中阻塞所有并发读写者。RWMutex 已声明但未在热路径读操作上充分利用。

Current Pattern:

func (r *IntentRouter) getFromCache(key string) *IntentResult {
    r.cacheMu.Lock()          // exclusive lock for READ
    defer r.cacheMu.Unlock()
    result, exists := r.cache[key]
    if !exists { return nil }
    if elem, ok := r.lruIndex[key]; ok {
        r.lruList.MoveToFront(elem)
    }
    return result
}

Proposed Fix: 先 RLock 读取 map,miss 时释放;LRU MoveToFront 单独获取 Lock。

Acceptance Criteria:

  • getFromCache 使用 RLock 进行 map 查找
  • LRU MoveToFront 在独立 Lock 中执行
  • TestIntentRouter_ConcurrentCacheAccess-race 验证无竞争

rate-limiter-unbounded-models-map

Severity: Medium | Confidence: Medium | ROI: Medium
Location: llm/ratelimit.go:33, llm/ratelimit.go:187-208

Problem: RateLimiter.models map 为每个唯一模型名创建 rate.Limiter,但从不驱逐。如果模型名是动态的(用户配置或路由器响应),map 无限增长。

Proposed Fix: 添加 TTL 驱逐 goroutine(与 SafetyGuard.userLimiters 的 evictStaleLimiters 模式一致)。

Acceptance Criteria:

  • RateLimiter 添加 lastAccess 跟踪和 TTL 驱逐
  • 驱逐 goroutine 在 RateLimiter.Close() 时优雅退出
  • TestRateLimiter_ModelEviction 验证过期模型被清理

metrics-latency-ring-buffer-append-copy

Severity: Medium | Confidence: High | ROI: Medium
Location: llm/metrics.go:175-179

Problem: requestLatencies 滚动窗口使用 append(slice[1:], val) 满时拷贝 999 个 float64(8KB),且在 mu.Lock 下执行。OTel histogram 已处理延迟分布,本地窗口仅用于 GetStats() API。

Current Pattern:

if len(mc.requestLatencies) >= mc.maxLatencySamples {
    mc.requestLatencies = append(mc.requestLatencies[1:], latencyMs)  // O(n) copy
}

Proposed Fix: 替换为 ring buffer,或对 GetStats() 使用 OTel histogram 数据。

Acceptance Criteria:

  • 用 ring buffer 替换 slice shifting,或改用 OTel histogram 数据
  • RecordRequest 在 mu.Lock 下不再分配新 slice
  • GetStats() 返回值精度不变

memory-manager-unbounded-preferences-map

Severity: Medium | Confidence: High | ROI: High
Location: memory.go:500-503, memory.go:516-524

Problem: MemoryManager.preferences 使用两级 map(userID -> key -> value),条目只增不减。无 TTL、无最大用户数限制、无后台清理。与 SafetyGuard.userLimiters(有 evictStaleLimiters,10 分钟间隔)不同,MemoryManager 无任何驱逐机制。

Current Pattern:

type MemoryManager struct {
    preferences map[string]map[string]string // userID -> key -> value
    prefMu      sync.RWMutex
}

func (m *MemoryManager) RecordUserPreference(userID, key, value string) {
    m.prefMu.Lock()
    defer m.prefMu.Unlock()
    if m.preferences[userID] == nil {
        m.preferences[userID] = make(map[string]string)
    }
    m.preferences[userID][key] = value  // only adds, never evicts
}

Proposed Fix: 添加 lastAccess 跟踪和 TTL 驱逐(与 ContextCompressor 的 startCleanupDaemon 模式相同)。

Acceptance Criteria:

  • MemoryManager 添加 lastAccess map[string]time.Time 跟踪
  • 后台 goroutine 定期驱逐超过 TTL 的用户条目
  • TestMemoryManager_PreferenceEviction 验证 TTL 清理行为

Implementation Priority

Finding Priority Effort Risk Impact
memory-manager-unbounded P1 Small Low 防止数千用户场景内存泄漏
intent-router-lock P1 Small Low 缓存命中吞吐量提升
rate-limiter-unbounded P2 Medium Low 动态模型名场景防护
metrics-ring-buffer P2 Small Low 消除 O(n) 拷贝

Recommended starting point: memory-manager-unbounded + intent-router-lock — 同为小投入高 ROI


Out of Scope

  • CostCalculator.sessions unbounded map(已跟踪于 issue 501)
  • Metrics OTel context.Background() trace break(已跟踪于 issue 501)
  • fmt.Sprintf per-request prompt build(LLM 延迟掩盖,Low ROI)

Verification

  • make test 通过,无回归
  • make lint 不产生新警告
  • go test -race ./internal/brain/... 无数据竞争

Metadata

Metadata

Assignees

No one assigned

    Labels

    P3Medium: tech debt, refactoring, improvementsarchitectureDomain: design patterns, coupling, separation of concerns

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions