LLM Gateway Metrics

基于 Spring Cloud Gateway 的 LLM API 性能监控组件，专为大语言模型 API 网关设计，提供精确的请求时序追踪和 Prometheus 指标暴露。

功能特性

精确的时序追踪
- TTFT (Time To First Token)：首字节响应时间
- 总响应时间：请求到响应完成的全链路耗时
- 响应数据量：实时统计响应字节数
流式和非流式支持
- 自动识别 SSE (Server-Sent Events) 流式响应
- 兼容标准 JSON 非流式响应
- 支持 text/event-stream 和其他流式 Content-Type
灵活的路由匹配
- 基于 Route ID 的精确匹配
- 基于 Route ID 前缀的模糊匹配
- 基于路径模式的 Ant 风格匹配
多维度标签
- Request ID：唯一请求标识
- Model：LLM 模型名称（低基数，适合作为 Prometheus 标签）
- Tenant：租户标识（高基数，默认仅记录日志）
- Status Code：HTTP 状态码及状态类别（2xx, 4xx, 5xx）
- Outcome：请求结果（success, client_error, server_error, cancelled, error）
双通道输出
- Prometheus Metrics：通过 /actuator/prometheus 端点暴露，支持 Grafana 可视化
- Structured Logging：结构化日志输出，便于日志分析和告警

快速开始

环境要求

JDK 21+
Maven 3.6+
Spring Boot 3.4+
Spring Cloud Gateway 2024.0.2+

构建项目

./mvnw clean package

运行应用

./mvnw spring-boot:run

访问监控端点

健康检查: http://localhost:8080/actuator/health
指标查询: http://localhost:8080/actuator/metrics
Prometheus: http://localhost:8080/actuator/prometheus

配置说明

完整配置示例

llm:
  gateway:
    timing:
      # 全局开关
      enabled: true
      
      # 过滤器顺序（默认在 NettyWriteResponseFilter 之前）
      order: -2
      
      # 精确匹配的路由ID列表
      route-ids:
        - llm-openai-route
        - llm-anthropic-route
      
      # 路由ID前缀匹配（满足其一即生效）
      route-id-prefixes:
        - llm
        - ai
      
      # 路径模式匹配（Ant 风格）
      path-patterns:
        - /v1/chat/completions
        - /v1/completions
        - /v1/responses
        - /**/v1/chat/completions
        - /**/v1/completions
      
      # 自定义请求头名称
      header-names:
        request-id: X-Request-Id
        model: X-LLM-Model
        tenant: X-Tenant-Id
      
      # Prometheus 指标配置
      metrics:
        enabled: true
        meter-prefix: llm.gateway
        include-model-tag: true      # model 作为标签（低基数）
        include-tenant-tag: false    # tenant 不作为标签（高基数）
      
      # 日志配置
      logging:
        enabled: true

# Spring Boot Actuator 配置
management:
  endpoints:
    web:
      exposure:
        include: health,info,metrics,prometheus
  endpoint:
    health:
      show-details: always

配置项说明

配置项	类型	默认值	说明
`llm.gateway.timing.enabled`	Boolean	true	全局开关
`llm.gateway.timing.order`	Integer	-2	GlobalFilter 执行顺序
`llm.gateway.timing.route-ids`	List	[]	精确匹配的路由ID
`llm.gateway.timing.route-id-prefixes`	List	["llm"]	路由ID前缀
`llm.gateway.timing.path-patterns`	List	见配置文件	Ant 风格路径模式
`llm.gateway.timing.header-names.request-id`	String	X-Request-Id	请求ID的请求头名称
`llm.gateway.timing.header-names.model`	String	X-LLM-Model	模型名称的请求头名称
`llm.gateway.timing.header-names.tenant`	String	X-Tenant-Id	租户ID的请求头名称
`llm.gateway.timing.metrics.enabled`	Boolean	true	是否启用 Prometheus 指标
`llm.gateway.timing.metrics.meter-prefix`	String	llm.gateway	指标名称前缀
`llm.gateway.timing.metrics.include-model-tag`	Boolean	true	是否将 model 作为标签
`llm.gateway.timing.metrics.include-tenant-tag`	Boolean	false	是否将 tenant 作为标签
`llm.gateway.timing.logging.enabled`	Boolean	true	是否启用日志记录

监控指标

Prometheus Metrics

组件会自动注册以下 Prometheus 指标：

1. `llm_gateway_ttft_seconds`

Time To First Token (TTFT) - 首字节响应时间

类型: Timer
标签:
- route_id: 路由ID
- method: HTTP 方法（GET, POST）
- status_class: 状态类别（2xx, 4xx, 5xx）
- stream_mode: 响应模式（stream, non_stream, unknown）
- model: LLM 模型名称（当 include-model-tag=true）
- tenant: 租户ID（当 include-tenant-tag=true）

2. `llm_gateway_total_seconds`

Total Response Time - 总响应时间

类型: Timer
标签: 同上

3. `llm_gateway_response_bytes`

Response Bytes - 响应数据量

类型: Distribution Summary
标签: 同上

日志输出

First Output Event (TTFT)

{
  "event": "llm_timing_first_output",
  "request_id": "550e8400-e29b-41d4-a716-446655440000",
  "route_id": "llm-openai-route",
  "method": "POST",
  "path": "/v1/chat/completions",
  "model": "gpt-4",
  "tenant": "customer-001",
  "status": 200,
  "status_class": "2xx",
  "stream_mode": "stream",
  "start_epoch_millis": 1717516800000,
  "ttft_nanos": 125000000,
  "ttft_ms": 125.0,
  "response_bytes": 0,
  "event_type": "first_output"
}

Completion Event

{
  "event": "llm_timing_completed",
  "request_id": "550e8400-e29b-41d4-a716-446655440000",
  "route_id": "llm-openai-route",
  "method": "POST",
  "path": "/v1/chat/completions",
  "model": "gpt-4",
  "tenant": "customer-001",
  "status": 200,
  "status_class": "2xx",
  "stream_mode": "stream",
  "start_epoch_millis": 1717516800000,
  "ttft_nanos": 125000000,
  "ttft_ms": 125.0,
  "total_nanos": 3500000000,
  "total_ms": 3500.0,
  "response_bytes": 2048,
  "signal_type": "ON_COMPLETE",
  "outcome": "success"
}

使用示例

1. 配置 Gateway 路由

spring:
  cloud:
    gateway:
      routes:
        - id: llm-openai-route
          uri: https://api.openai.com
          predicates:
            - Path=/openai/**
          filters:
            - StripPrefix=1

2. 发送请求

curl -X POST http://localhost:8080/openai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "X-Request-Id: test-001" \
  -H "X-LLM-Model: gpt-4" \
  -H "X-Tenant-Id: customer-001" \
  -d '{
    "model": "gpt-4",
    "messages": [
      {"role": "user", "content": "Hello"}
    ]
  }'

3. 查询 Prometheus 指标

# 查询 TTFT P95
curl http://localhost:8080/actuator/prometheus | grep llm_gateway_ttft

# 示例输出
llm_gateway_ttft_seconds{model="gpt-4",route_id="llm-openai-route",status_class="2xx",stream_mode="stream",quantile="0.95"} 0.15

4. Grafana 可视化

在 Grafana 中使用以下 PromQL 查询：

# TTFT P95 分模型统计
histogram_quantile(0.95, 
  sum(rate(llm_gateway_ttft_seconds_bucket[5m])) by (le, model)
)

# 总响应时间 P99
histogram_quantile(0.99, 
  sum(rate(llm_gateway_total_seconds_bucket[5m])) by (le)
)

# 响应数据量趋势
rate(llm_gateway_response_bytes_sum[5m])

技术架构

核心组件

LlmTimingGlobalFilter (GlobalFilter)
    ↓
    ├─ LlmGatewayTimingProperties (配置)
    │
    └─ LlmTimingRecorder (接口)
         ├─ MicrometerLlmTimingRecorder (Prometheus)
         └─ LoggingLlmTimingRecorder (日志)

工作原理

请求拦截: LlmTimingGlobalFilter 作为 GlobalFilter 在路由前拦截请求
响应装饰: 使用 ServerHttpResponseDecorator 包装原始响应
流式监听: 通过 writeWith 和 writeAndFlushWith 监听响应数据流
时序记录:
- 记录请求开始时间（startNanos）
- 捕获首字节到达时间（firstOutputNanos）
- 记录请求结束时间（endNanos）
数据统计: 累加响应字节数（responseBytes）
多通道输出: 调用所有 LlmTimingRecorder 实现，输出到 Prometheus 和日志

关键设计

非侵入式: 基于 Spring Cloud Gateway 的 GlobalFilter 机制，无需修改业务代码
响应式: 完全兼容 WebFlux 响应式编程模型，支持背压和流式处理
高性能: 使用 AtomicLong 和 LongAdder 实现无锁统计
容错性: 统计逻辑异常不会影响业务请求（try-catch 保护）
低基数: 默认将高基数字段（如 tenant）排除在 Prometheus 标签外，避免指标爆炸

技术栈

Spring Boot: 3.4.6
Spring Cloud Gateway: 2024.0.2
Java: 21
Micrometer: Registry Prometheus
Reactive Streams: Project Reactor
Maven: 3.6+

项目结构

src/main/java/com/glmapper/llm/metrics/
├── LlmGatewayMetricsApplication.java       # 主启动类
├── LlmGatewayTimingConfiguration.java      # 自动配置类
├── LlmGatewayTimingProperties.java         # 配置属性
├── LlmTimingGlobalFilter.java              # 核心过滤器
├── LlmTimingRecorder.java                  # 记录器接口
├── LlmTimingSnapshot.java                  # 时序快照
├── MicrometerLlmTimingRecorder.java        # Prometheus 记录器
└── LoggingLlmTimingRecorder.java           # 日志记录器

最佳实践

1. 标签基数控制

llm:
  gateway:
    timing:
      metrics:
        include-model-tag: true      # ✅ model 通常只有几十个值
        include-tenant-tag: false    # ❌ tenant 可能有成千上万个值

2. 路由匹配优先级

优先使用 Route ID 匹配，性能更好：

llm:
  gateway:
    timing:
      # 优先级1: 精确匹配
      route-ids:
        - llm-openai-route
      
      # 优先级2: 前缀匹配
      route-id-prefixes:
        - llm
      
      # 优先级3: 路径匹配（较慢）
      path-patterns:
        - /v1/chat/completions

3. 日志级别建议

logging:
  level:
    com.glmapper.llm.metrics.LoggingLlmTimingRecorder: INFO

4. Prometheus 抓取配置

scrape_configs:
  - job_name: 'llm-gateway'
    metrics_path: '/actuator/prometheus'
    scrape_interval: 15s
    static_configs:
      - targets: ['localhost:8080']

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.mvn/wrapper		.mvn/wrapper
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
mvnw		mvnw
mvnw.cmd		mvnw.cmd
pom.xml		pom.xml

Folders and files

Latest commit

History

Repository files navigation

LLM Gateway Metrics

功能特性

快速开始

环境要求

构建项目

运行应用

访问监控端点

配置说明

完整配置示例

配置项说明

监控指标

Prometheus Metrics

1. llm_gateway_ttft_seconds

2. llm_gateway_total_seconds

3. llm_gateway_response_bytes

日志输出

First Output Event (TTFT)

Completion Event

使用示例

1. 配置 Gateway 路由

2. 发送请求

3. 查询 Prometheus 指标

4. Grafana 可视化

技术架构

核心组件

工作原理

关键设计

技术栈

项目结构

最佳实践

1. 标签基数控制

2. 路由匹配优先级

3. 日志级别建议

4. Prometheus 抓取配置

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. `llm_gateway_ttft_seconds`

2. `llm_gateway_total_seconds`

3. `llm_gateway_response_bytes`

Packages