Skip to content

perf(session): N+1 lock in ListActive/WorkerHealth + sequential GC + pool double-lock #545

@hrygo

Description

@hrygo

Background

internal/session 是会话状态机模块。Phase 2 resource-mgmt + performance 分析发现管理 API 热路径的锁模式和 GC 批处理有优化空间。

Scope: resource-mgmt, performance — cycle 203 (模块分析通过 3)
Key files: manager.go, pool.go


Finding Summary

Category Critical High Medium Low
Performance 0 0 3 0
合计 0 0 3 0

Findings

listactive-workerhealth-n-plus-one-lock-pattern

Severity: Medium | Confidence: High | ROI: High
Location: manager.go:820-832, manager.go:883-896

Problem: ListActive()WorkerHealthStatuses()m.mu.RLock 下遍历所有 session,每个 session 单独获取 ms.mu.RLock。这是 N+1 锁模式:1 全局锁 + N session 锁。100+ 活跃 session 时产生 100+ lock/unlock 对。

Current Pattern:

func (m *Manager) ListActive() []*SessionInfo {
    m.mu.RLock()
    defer m.mu.RUnlock()
    for _, ms := range m.sessions {
        ms.mu.RLock()          // N per-session locks
        info := ms.info
        ms.mu.RUnlock()
        sessions = append(sessions, &info)
    }
    return sessions
}

Proposed Fix: 移除 per-session ms.mu.RLockms.info 是值类型 struct,直接拷贝安全(内含的 slice/map 指针当前调用者只读)。

Acceptance Criteria:

  • ListActive() 不再获取 ms.mu.RLock,直接 info := ms.info 拷贝
  • WorkerHealthStatuses() 同理移除 per-session RLock
  • make test 通过,零回归

gc-terminates-sessions-sequentially-under-lock

Severity: Medium | Confidence: High | ROI: Medium
Location: manager.go:1053-1062

Problem: GC 顺序终止过期 session。每个 TransitionWithReason 获取释放 m.mu + ms.mu + DB upsert。50 个过期 session 产生 200+ 锁操作。周末后批量清理可延迟下一个 GC 周期。

Proposed Fix: 使用 errgroup 限制并发度为 5 的批处理终止。TransitionWithReason 对不同 session 安全并发(每个 session 有独立 ms.mu)。

Acceptance Criteria:

  • GC 使用 errgroup 并发终止,并发度限制为 5
  • TestGC_ConcurrentTermination 验证并发安全性
  • 单个 transition 失败不影响其他 session

pool-acquire-memory-two-lock-roundtrips

Severity: Medium | Confidence: High | ROI: High
Location: manager.go:494-516, pool.go:67-88, pool.go:142-158

Problem: AttachWorker 先调 pool.Acquire(userID) 再调 pool.AcquireMemory(userID),各获取释放 pool.mu 一次。失败回滚(AcquireMemory 失败 -> Release)是第三次。20 并发 session 启动 = 60 次 pool.mu 操作。

Proposed Fix: 合并为单一 AcquireWithMemory(userID) 方法,一次 Lock 内检查 slot + memory 配额。

Acceptance Criteria:

  • 新增 PoolManager.AcquireWithMemory(userID string) error 方法
  • AttachWorker 调用 AcquireWithMemory 替代两次调用
  • 原有 Acquire/AcquireMemory/Release 保持不变(向后兼容)
  • TestPool_AcquireWithMemory 覆盖 slot-exceeded / memory-exceeded / success 路径

Implementation Priority

Finding Priority Effort Risk Impact
listactive-n-plus-one P1 Small Low 100+ session 时锁操作减少 2/3
pool-double-lock P1 Small Low 热路径锁操作减少 50%
gc-sequential P2 Medium Medium 批量清理延迟减少

Recommended starting point: listactive-n-plus-one — 最低风险最高 ROI


Out of Scope

  • transitionState lock gap 和 getManagedSession ctx 丢弃(已跟踪于 issue 527)
  • sessions map 非终止 session 无驱逐(设计如此,runningIndex 优化已到位)

Verification

  • make test 通过,无回归
  • make lint 不产生新警告
  • go test -race ./internal/session/ 无数据竞争

Metadata

Metadata

Assignees

No one assigned

    Labels

    P3Medium: tech debt, refactoring, improvementsarchitectureDomain: design patterns, coupling, separation of concerns

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions