Background
internal/session 是会话状态机模块。Phase 2 resource-mgmt + performance 分析发现管理 API 热路径的锁模式和 GC 批处理有优化空间。
Scope: resource-mgmt, performance — cycle 203 (模块分析通过 3)
Key files: manager.go, pool.go
Finding Summary
| Category |
Critical |
High |
Medium |
Low |
| Performance |
0 |
0 |
3 |
0 |
| 合计 |
0 |
0 |
3 |
0 |
Findings
listactive-workerhealth-n-plus-one-lock-pattern
Severity: Medium | Confidence: High | ROI: High
Location: manager.go:820-832, manager.go:883-896
Problem: ListActive() 和 WorkerHealthStatuses() 在 m.mu.RLock 下遍历所有 session,每个 session 单独获取 ms.mu.RLock。这是 N+1 锁模式:1 全局锁 + N session 锁。100+ 活跃 session 时产生 100+ lock/unlock 对。
Current Pattern:
func (m *Manager) ListActive() []*SessionInfo {
m.mu.RLock()
defer m.mu.RUnlock()
for _, ms := range m.sessions {
ms.mu.RLock() // N per-session locks
info := ms.info
ms.mu.RUnlock()
sessions = append(sessions, &info)
}
return sessions
}
Proposed Fix: 移除 per-session ms.mu.RLock。ms.info 是值类型 struct,直接拷贝安全(内含的 slice/map 指针当前调用者只读)。
Acceptance Criteria:
gc-terminates-sessions-sequentially-under-lock
Severity: Medium | Confidence: High | ROI: Medium
Location: manager.go:1053-1062
Problem: GC 顺序终止过期 session。每个 TransitionWithReason 获取释放 m.mu + ms.mu + DB upsert。50 个过期 session 产生 200+ 锁操作。周末后批量清理可延迟下一个 GC 周期。
Proposed Fix: 使用 errgroup 限制并发度为 5 的批处理终止。TransitionWithReason 对不同 session 安全并发(每个 session 有独立 ms.mu)。
Acceptance Criteria:
pool-acquire-memory-two-lock-roundtrips
Severity: Medium | Confidence: High | ROI: High
Location: manager.go:494-516, pool.go:67-88, pool.go:142-158
Problem: AttachWorker 先调 pool.Acquire(userID) 再调 pool.AcquireMemory(userID),各获取释放 pool.mu 一次。失败回滚(AcquireMemory 失败 -> Release)是第三次。20 并发 session 启动 = 60 次 pool.mu 操作。
Proposed Fix: 合并为单一 AcquireWithMemory(userID) 方法,一次 Lock 内检查 slot + memory 配额。
Acceptance Criteria:
Implementation Priority
| Finding |
Priority |
Effort |
Risk |
Impact |
| listactive-n-plus-one |
P1 |
Small |
Low |
100+ session 时锁操作减少 2/3 |
| pool-double-lock |
P1 |
Small |
Low |
热路径锁操作减少 50% |
| gc-sequential |
P2 |
Medium |
Medium |
批量清理延迟减少 |
Recommended starting point: listactive-n-plus-one — 最低风险最高 ROI
Out of Scope
- transitionState lock gap 和 getManagedSession ctx 丢弃(已跟踪于 issue 527)
- sessions map 非终止 session 无驱逐(设计如此,runningIndex 优化已到位)
Verification
Background
internal/session是会话状态机模块。Phase 2 resource-mgmt + performance 分析发现管理 API 热路径的锁模式和 GC 批处理有优化空间。Scope: resource-mgmt, performance — cycle 203 (模块分析通过 3)
Key files:
manager.go,pool.goFinding Summary
Findings
listactive-workerhealth-n-plus-one-lock-pattern
Severity: Medium | Confidence: High | ROI: High
Location:
manager.go:820-832,manager.go:883-896Problem:
ListActive()和WorkerHealthStatuses()在m.mu.RLock下遍历所有 session,每个 session 单独获取ms.mu.RLock。这是 N+1 锁模式:1 全局锁 + N session 锁。100+ 活跃 session 时产生 100+ lock/unlock 对。Current Pattern:
Proposed Fix: 移除 per-session
ms.mu.RLock。ms.info是值类型 struct,直接拷贝安全(内含的 slice/map 指针当前调用者只读)。Acceptance Criteria:
ListActive()不再获取ms.mu.RLock,直接info := ms.info拷贝WorkerHealthStatuses()同理移除 per-session RLockmake test通过,零回归gc-terminates-sessions-sequentially-under-lock
Severity: Medium | Confidence: High | ROI: Medium
Location:
manager.go:1053-1062Problem: GC 顺序终止过期 session。每个
TransitionWithReason获取释放 m.mu + ms.mu + DB upsert。50 个过期 session 产生 200+ 锁操作。周末后批量清理可延迟下一个 GC 周期。Proposed Fix: 使用 errgroup 限制并发度为 5 的批处理终止。
TransitionWithReason对不同 session 安全并发(每个 session 有独立 ms.mu)。Acceptance Criteria:
errgroup并发终止,并发度限制为 5TestGC_ConcurrentTermination验证并发安全性pool-acquire-memory-two-lock-roundtrips
Severity: Medium | Confidence: High | ROI: High
Location:
manager.go:494-516,pool.go:67-88,pool.go:142-158Problem:
AttachWorker先调pool.Acquire(userID)再调pool.AcquireMemory(userID),各获取释放pool.mu一次。失败回滚(AcquireMemory 失败 -> Release)是第三次。20 并发 session 启动 = 60 次 pool.mu 操作。Proposed Fix: 合并为单一
AcquireWithMemory(userID)方法,一次 Lock 内检查 slot + memory 配额。Acceptance Criteria:
PoolManager.AcquireWithMemory(userID string) error方法AttachWorker调用AcquireWithMemory替代两次调用Acquire/AcquireMemory/Release保持不变(向后兼容)TestPool_AcquireWithMemory覆盖 slot-exceeded / memory-exceeded / success 路径Implementation Priority
Recommended starting point: listactive-n-plus-one — 最低风险最高 ROI
Out of Scope
Verification
make test通过,无回归make lint不产生新警告go test -race ./internal/session/无数据竞争