Add pprof & mcp metrics in api-service and request retry in aenv sdk#71
Add pprof & mcp metrics in api-service and request retry in aenv sdk#71lanmaoxinqing merged 19 commits intomainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the stability and observability of both the Python SDK and the Go API service. The Python SDK now features sophisticated retry logic for establishing MCP sessions and manages persistent sessions more efficiently, while carefully avoiding retries for potentially non-idempotent tool executions. Concurrently, the Go API service gains powerful profiling capabilities through Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces several valuable improvements across the api-service and the aenv Python SDK, enhancing robustness, performance, and observability through additions like pprof endpoints and MCP request metrics, and a robust retry mechanism for MCP sessions. However, a security audit identified critical vulnerabilities, including the exposure of unauthenticated pprof endpoints, a potential Denial of Service (DoS) due to unrestricted memory allocation when reading request bodies, and the logging of potentially sensitive information in plain text. Addressing these security concerns is paramount. Additionally, a minor suggestion for code cleanup in the Python SDK has been noted to improve maintainability.
|
|
||
| mainRouter.GET("/health", healthChecker) | ||
| mainRouter.GET("/metrics", gin.WrapH(promhttp.Handler())) | ||
| pprof.Register(mainRouter) |
There was a problem hiding this comment.
The pprof profiling endpoints are registered on the mainRouter without any authentication or access control. This exposes sensitive internal state and performance data of the application to anyone who can reach the service. It is recommended to wrap the pprof registration in an authentication middleware or expose it on a separate, internal-only port.
api-service/middleware/metrics.go
Outdated
| return func(c *gin.Context) { | ||
| var rpcMethod string | ||
| if c.Request.Method == "POST" && c.Request.Body != nil { | ||
| if body, err := io.ReadAll(c.Request.Body); err == nil && len(body) > 0 { |
There was a problem hiding this comment.
The MCPMetricsMiddleware uses io.ReadAll to read the entire request body into memory without any size limit. An attacker can send a very large request body to exhaust the server's memory, leading to a Denial of Service. It is recommended to use io.LimitReader to restrict the maximum size of the request body that can be read into memory.
| @@ -649,8 +697,39 @@ async def call_tool( | |||
| f"{self._log_prefix()} Executing tool: {actual_tool_name} in environment {self.env_name}, arguments={arguments}" | |||
| f"NOT retrying (non-idempotent safety). | " | ||
| f"Environment: {self.env_name} | " | ||
| f"Tool: {actual_tool_name} | " | ||
| f"Arguments: {arguments} | " |
aenv/src/aenv/core/environment.py
Outdated
| async def _ensure_mcp_session(self): | ||
| """Ensure a persistent MCP session is established.""" | ||
| if not self._mcp_session_active: | ||
| client = await self._get_mcp_client() | ||
| logger.info(f"{self._log_prefix()} Opening persistent MCP session") | ||
| await client.__aenter__() | ||
| self._mcp_session_active = True |
There was a problem hiding this comment.
This method _ensure_mcp_session is defined twice in the Environment class. The second definition at line 1005 will overwrite this one at runtime, making this block of code unreachable. This can be confusing for future maintenance. To improve code clarity and remove dead code, I recommend deleting this first definition.
Add comprehensive Prometheus metrics (requests, latency, instance ops, active instances gauge, uptime gauge) to api-service. Implement labels propagation through full chain: SDK -> api-service -> faas-api-service -> metaservice, enabling per-experiment/owner/app metric breakdown. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Share request metrics with MCP data plane traffic on port 8081. Skip recording /health endpoint errors to avoid noise from proxy-less K8s liveness probes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove mockEnvInstanceService and mockBackendClient that were defined but never used, causing golangci-lint unused errors in CI. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Each call_tool/list_tools was wrapping the MCP client in `async with client:`, causing a full session lifecycle (initialize/call/DELETE) per invocation. When the server was busy, re-initialize could hang for 60s. Move session management to Environment __aenter__/__aexit__ so one session is reused for all tool calls within the context manager. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add gin-contrib/pprof v1.4.0 for runtime profiling - Fix resp.Body leak in faas_model/http_client.go Into() method - Fix resp.Body leak in faas_client.go InitializeFunction() - Add HTTP Transport pool config (MaxIdleConns, IdleConnTimeout) - Add MCPMetricsMiddleware with real path and JSON-RPC method extraction - Remove unused mock types in labels test to fix golangci-lint Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add circuit breaker (5 consecutive failures -> UnrecoverableEnvironmentError) - Rebuild MCP client after tool execution failure to break ClosedResourceError cascade - Add retry logic to __aenter__ session establishment (max 2 attempts) - Remove duplicate _ensure_mcp_session definition (dead code) - Add unit tests for circuit breaker and __aenter__ retry - Add e2e test suite for session persistence verification Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… leak - Add shared http.Transport with connection pooling to MCPGateway - SSE handler and HTTP proxy reuse transport instead of creating per-request - Update Dockerfile from golang:1.21 to golang:1.23 to match go.work Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ation tuning Truncate request/response body to 2KB in logs, compress rotated files, reduce max backups from 30 to 10, set 7-day max age, and raise file log level from Debug to Info. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…accuracy Add LimitReader(8KB) to prevent OOM, set GetBody for reverse proxy rewind, skip SSE duration histogram, normalize endpoint labels to prevent cardinality explosion, and store body/rpc_method in context. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rename the business metric label key from "env" to "envName" across code, metrics, and tests to avoid ambiguity with environment variables. Also fix typo in lumberjack import (natefinsh -> natefinch) and remove unused truncateString function. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add test_e2e_resilience.py covering SDK recovery scenarios: - Tool call recovery after api-service pod restart - Session rebuild after transient network failure - Instance preservation across session rebuilds - Concurrent tool calls during network jitter - Circuit breaker behavior under rapid failures Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ric label The previous commit incorrectly renamed the instance label key from "env" to "envName" in business logic and storage. The intent was only to rename the Prometheus metric label to avoid collision with system env variables. Instance labels remain "env" internally; the collector maps inst.Labels["env"] to Prometheus label "envName" via BusinessLabelKeys. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- aenv: increase __aenter__ max_attempts from 2 to 3 for consistency with call_tool session retry logic - api-service: cap large body read at 1MB in MCPMetricsMiddleware to prevent memory exhaustion from abnormal requests; remove unused _req_body context key Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
daihaowz
left a comment
There was a problem hiding this comment.
严重问题 (CRITICAL / HIGH)
- Redis SCAN 游标变量遮蔽 — 大数据集下无限循环
文件: api-service/service/redis.go — ListEnvInstancesFromRedis 和 deleteKeysByScan
for {
scannedKeys, cursor, err := r.client.Scan(r.ctx, cursor, pattern, 100).Result()
// ^^^^^^ := 创建了新的局部变量,外层 cursor 始终为 0
:= 遮蔽了外层的 cursor,下一轮循环传给 Scan 的始终是 0,导致大数据集下无限循环。小数据集碰巧一次扫完所以没触发。(注:此 bug 在 main 上已存在,但本分支修改了 redis.go 却没有修复)
- pprof 端点在生产环境无鉴权暴露
文件: api-service/main.go
pprof.Register(mainRouter)
/debug/pprof/ 注册在外部可访问的主路由上,无任何鉴权中间件。攻击者可获取 goroutine dump、堆内存信息,甚至触发 CPU profiling 造成 DoS。
- _rebuild_mcp_client 未持有 session 锁 — 并发竞态
文件: aenv/src/aenv/core/environment.py:190-200
_rebuild_mcp_client 直接修改 self._mcp_client 和 self._mcp_session_active,未获取 _mcp_session_lock。而 _ensure_mcp_session 通过锁保护同样的字段。当多个协程共享同一个 Environment 实例时,会导致状态不一致(新创建的 client 被 rebuild 覆盖销毁)。
- envInstance.Env 空指针 panic
文件: api-service/service/redis.go — ListEnvInstancesFromRedis
if envInstance != nil && envInstance.Env.Name != "" {
当 envInstance 非 nil 但 envInstance.Env 为 nil 时,会空指针 panic。应改为 envInstance != nil && envInstance.Env != nil && envInstance.Env.Name != ""。
- environmentVariables key 更名 — API 兼容性破坏
文件: api-service/controller/env_instance.go
// main: backendEnv.DeployConfig["environmentVariables"] = ...
// branch: backendEnv.DeployConfig["environment_variables"] = ...
key 从 camelCase 改为 snake_case,如果下游 controller 读的还是 "environmentVariables",环境变量将静默丢失。
- Redis fallback 丢失 version 过滤
文件: api-service/controller/env_instance.go — ListEnvInstances
当 Redis 数据不完整回退到 service 层时,ListEnvInstances(envName) 只按名称过滤,丢弃了版本信息,返回所有版本的实例。
中等问题 (MEDIUM)
- _req_body 上下文键从未设置 — 死代码
文件: api-service/middleware/logging.go:866 + middleware/metrics.go
LoggingMiddleware 检查 c.Get("_req_body"),但 MCPMetricsMiddleware 从未调用 c.Set("_req_body", ...),导致该分支永远不会执行。
- Collector Stop() 无 double-close 保护
文件: api-service/metrics/collector.go
func (c *Collector) Stop() { close(c.stopCh) }
无 sync.Once 保护,重复调用会 panic: close of closed channel。
- select{} 之后的 defer 永远不会执行
文件: api-service/main.go
defer metricsCollector.Stop()
defer cleanManager.Stop()
select {} // 永远阻塞,defer 不可达
资源清理代码实际上是死代码,Stop() 永远不会被调用。
- 熔断器无恢复机制
文件: aenv/src/aenv/core/environment.py:697-701
_consecutive_tool_errors 达到阈值后 Environment 永久不可用。没有 half-open 状态、超时重置或手动 reset 方法。长期运行的 Environment 一旦触发熔断就永久报废。
- 异步锁延迟初始化存在竞态
文件: aenv/src/aenv/core/environment.py:808-809, 1046-1047
if not hasattr(self, "_init_lock"):
self._init_lock = asyncio.Lock()
两个协程可在 hasattr 检查和赋值之间交替执行,创建两个不同的锁实例。应在 init 中初始化。
- Prometheus 标签基数爆炸风险
文件: api-service/middleware/metrics.go + api-service/metrics/metrics.go
- 主路由 FullPath() 对 404 路径回退到原始 URL → 扫描器可创建无限时序
- InstanceUptimeSeconds 包含 instance_id 标签 → 高频创建/销毁实例时基数爆炸
- CreateEnvInstance 在 binding 验证前就用 req.EnvName 做 label → 攻击者可注入任意值
- MCP body 静默截断
文件: api-service/middleware/metrics.go
body 超过 ~1MB+8KB 时被静默截断,读取错误被吞掉(remaining, _ := io.ReadAll(...)),向上游转发不完整的 body 而无任何告警。
- Warmup 未发送请求体
文件: api-service/service/env_instance.go
httpReq, err := http.NewRequest("PUT", url, nil) // body 是 nil
req *backend.Env 参数被接受但从未序列化发送。
- 5xx 响应不重试
文件: aenv/src/aenv/client/scheduler_client.py
重试逻辑只捕获 httpx.RequestError(网络层错误),服务端 500/503 等瞬态错误不会被重试。
- wait_for_status 不处理 TERMINATED 状态
文件: aenv/src/aenv/client/scheduler_client.py
等待 RUNNING 状态时只短路 FAILED,对 TERMINATED(终态)仍会一直轮询直到超时。
- Labels 在 list 响应中丢失
文件: api-service/service/schedule_client.go — ListEnvInstances
PodListResponseData 没有 Labels 字段,通过 ScheduleClient 路径 list 时创建时设置的 labels 全部丢失。
- Label 命名混淆
文件: api-service/metrics/metrics.go + metrics/collector.go
Prometheus label 名为 envName,但从 inst.Labels["env"] 取值。命名不一致容易引起后续维护 bug。
低等问题 (LOW)
┌─────┬──────────────────────────────┬────────────────────────────────────────────────────┐
│ # │ 文件 │ 问题 │
├─────┼──────────────────────────────┼────────────────────────────────────────────────────┤
│ 19 │ environment.py:252 │ initialize() dummy 路径返回 None 而非声明的 bool │
├─────┼──────────────────────────────┼────────────────────────────────────────────────────┤
│ 20 │ environment.py:513 │ call_function 使用可变默认参数 {} │
├─────┼──────────────────────────────┼────────────────────────────────────────────────────┤
│ 21 │ environment.py:220 │ aenter 循环后有不可达的 return self │
├─────┼──────────────────────────────┼────────────────────────────────────────────────────┤
│ 22 │ logging.py:82 │ getLogger 的 type 参数被完全忽略 │
├─────┼──────────────────────────────┼────────────────────────────────────────────────────┤
│ 23 │ scheduler_client.py │ 使用废弃的 asyncio.get_event_loop() │
├─────┼──────────────────────────────┼────────────────────────────────────────────────────┤
│ 24 │ redis.go │ Redis key 无 TTL,无状态对账机制,长期内存泄漏 │
├─────┼──────────────────────────────┼────────────────────────────────────────────────────┤
│ 25 │ env_instance.go (controller) │ TTL 行为从"始终设置"变为"条件设置",可能影响客户端 │
└─────┴──────────────────────────────┴────────────────────────────────────────────────────┘
已修复的 Bug (本分支修复了 main 上的问题)
- ListEnvInstances 中 token 为 nil 时缺少 return,导致空指针 — 已修复
- DeployConfig 为 nil 时直接写 map 导致 panic — 已修复(nil check 前移)
建议优先修复
Resolve conflicts in main.go keeping admission control changes alongside upstream PR #71 updates. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
No description provided.