Skip to content

perf(gateway): per-connection re-encode + clone-per-event + snapshot allocation on hot path #544

@hrygo

Description

@hrygo

Background

internal/gateway 是 HotPlex 核心消息路由层。Phase 2 resource-mgmt + performance 分析发现热路径存在不必要的重复编码和分配开销。

Scope: resource-mgmt, performance — cycle 203 (模块分析通过 3)
Key files: conn.go, hub.go, bridge_forward.go, platform_writer.go


Finding Summary

Category Critical High Medium Low
Performance 0 1 2 0
Resource-mgmt 0 0 0 0
合计 0 1 2 0

Findings

per-connection-re-encode-on-route

Severity: High | Confidence: High | ROI: Medium
Location: conn.go:656-671, hub.go:454-486, platform_writer.go:83-86

Problem: Hub.routeMessage 对每个订阅的 SessionWriter 独立调用 RouteWrite,每次都重新 JSON-marshal 同一个 Envelope。对于 1 WS conn + 1 platform conn 的会话,每条 message.delta 被编码 2-3 次,每次分配新 []byte。

Current Pattern:

// hub.go routeMessage — per-conn encoding
for _, conn := range conns {
    if err := conn.RouteWrite(ctx, msg.Env); err == nil {
        continue
    }
}

// conn.go RouteWrite — encodes every time
func (c *Conn) RouteWrite(_ context.Context, env *events.Envelope) error {
    data, err := aep.EncodeJSON(env)  // allocates []byte per call
    ...
    c.sendData(data)
}

Proposed Fix: 在 routeMessage 中编码一次,对 WS conn 发送原始字节;platform conn 保留独立编码(需要不同处理)。

Acceptance Criteria:

  • Hub.routeMessage 对同一 Envelope 只调用 aep.EncodeJSON 一次
  • WS conn 通过新方法 SendData(data []byte) 发送预编码字节
  • BenchmarkRouteMessage_Throughput 显示编码分配减少 N-1 次(N 为 conn 数)

forward-events-clone-per-event

Severity: Medium | Confidence: High | ROI: Medium
Location: bridge_forward.go:176-178, bridge_forward.go:156-159

Problem: forwardEvents 对每条出站事件调用 events.Clone() 深拷贝 Envelope + map[string]any Data。在高频 message.delta(6-20/sec/session)下产生持续分配压力。Clone 是正确性必需的(Hub.Run 并发编码),优化目标是减少分配成本。

Proposed Fix: 验证 typed Event.Data(MessageDeltaData, DoneData 等)路径占主导 — 如果 >80% 事件使用 typed Data(非 map[string]any),当前 Clone 已接近最优(struct copy 不触发 deepCopyMap)。

Acceptance Criteria:

  • 添加 Prometheus counter 统计 typed vs map[string]any Data 路径比例
  • 如果 map 路径 < 20%,记录为 accepted risk;否则添加 typed path 快速路径

snapshot-conns-per-route-allocation

Severity: Medium | Confidence: High | ROI: Medium
Location: hub.go:229-238, hub.go:454-455

Problem: snapshotConns 在每次 routeMessage 调用时分配新 []SessionWriter slice。10 events/sec * 10 sessions = 100 slice allocations/sec。Hub.Run 是单线程的,可以直接在 RLock 下迭代而非快照。

Current Pattern:

func (h *Hub) snapshotConns(sessionID string) []SessionWriter {
    h.mu.RLock()
    sessionConns := h.sessions[sessionID]
    conns := make([]SessionWriter, 0, len(sessionConns))
    for conn := range sessionConns {
        conns = append(conns, conn)
    }
    h.mu.RUnlock()
    return conns
}

Proposed Fix: 在 routeMessage 中直接持有 RLock 迭代,延迟错误处理(removeSession)到迭代后批处理。

Acceptance Criteria:

  • routeMessage 不再调用 snapshotConns,改为直接 RLock 迭代
  • 错误 conn 的移除延迟到迭代后批处理,避免 RLock 下修改 map
  • BenchmarkRouteMessage_Allocs 显示 allocs/op 减少

Implementation Priority

Finding Priority Effort Risk Impact
per-connection-re-encode P1 Medium Low 热路径编码减少 N-1 次
forward-events-clone P2 Small Low 确认 typed path 主导后可接受
snapshot-allocation P2 Medium Medium 需处理 RLock 下的错误移除

Recommended starting point: per-connection-re-encode — 最大性能收益


Out of Scope

  • Hub.Run 单线程瓶颈(已知设计选择,分片重构 ROI 低)
  • Conn.writeCh 缓冲大小(64 合理,无争用证据)

Verification

  • make test 通过,无回归
  • make lint 不产生新警告
  • go test -bench=BenchmarkRouteMessage -count=5 验证改进

Related: issue 526 (accumMu lock contention)

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2High: affects many users, daily occurrencesarchitectureDomain: design patterns, coupling, separation of concerns

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions