Auto DevOps 架构重构文档

📋 概述

本文档详细说明了 Auto DevOps 项目从手写 ReAct 实现重构为标准化 LangChain 架构的过程，包括新架构的设计理念、核心功能、使用方法和最佳实践。

🎯 重构目标

原始问题

手写 ReAct 循环，代码复杂且难以维护
MockLLM 响应固定，无法模拟复杂场景
缺乏标准化的 Agent 框架支持
测试场景有限，难以覆盖真实使用情况

重构目标

✅ 采用标准化的 LangChain 架构模式
✅ 实现完整的 ReAct 流程（思考-行动循环）
✅ 支持动态场景配置和模型切换
✅ 保持向后兼容性，不破坏现有 API
✅ 提供完善的测试和调试工具

🏗️ 新架构设计

分层架构图

┌─────────────────────────────────────────┐
│                FastAPI Layer              │
│  web_service.py (路由和HTTP处理)           │
└─────────────────┬───────────────────────┘
                  │
┌─────────────────▼───────────────────────┐
│               Service Layer              │
│   diagnosis_core.py (适配器，兼容接口)     │
└─────────────────┬───────────────────────┘
                  │
┌─────────────────▼───────────────────────┐
│              Agent Layer                 │
│   diagnosis_agent.py (高级Agent接口)      │
│   react_executor.py (ReAct执行器)        │
└─────────────────┬───────────────────────┘
                  │
┌─────────────────▼───────────────────────┐
│               Model Layer                 │
│   mock_llm.py (增强Mock模型)              │
│   glm_llm.py (真实GLM模型)               │
│   model_factory.py (模型工厂)            │
└─────────────────┬───────────────────────┘
                  │
┌─────────────────▼───────────────────────┐
│               Tool Layer                  │
│   diagnosis_tools.py (诊断工具)           │
│   tool_manager.py (工具管理器)            │
└─────────────────┬───────────────────────┘
                  │
┌─────────────────▼───────────────────────┐
│             Config Layer                 │
│   scenarios.py (场景配置)                │
│   agent_configs.py (Agent配置)           │
└─────────────────────────────────────────┘

核心组件说明

1. Model Layer（模型层）

BaseLLM: 所有LLM实现的基类，定义统一接口
MockLLM: 增强的Mock模型，支持基于场景的动态响应
GLMLLM: 真实的智谱GLM-4.6模型实现
ModelFactory: 模型工厂，统一管理模型创建

2. Tool Layer（工具层）

Diagnosis Tools: 诊断工具集合（编译日志、K8s日志、服务日志、知识库查询）
Tool Manager: 工具管理器，统一管理工具注册和调用
Dynamic Response: 支持基于场景配置的动态工具响应

3. Agent Layer（Agent层）

ReActExecutor: 完整的ReAct流程执行器
- 标准化的思考-行动循环
- 智能工具调用和结果处理
- 完善的错误处理机制
DiagnosisAgent: 高级Agent接口
- 整合模型、工具和执行器
- 支持场景配置和模型切换
- 提供流式和同步两种调用模式

4. Service Layer（服务层）

DiagnosisService: 服务适配器
- 保持与原有FastAPI接口的完全兼容性
- 内部使用新的核心架构
- 提供高级配置和测试功能

5. Config Layer（配置层）

Scenarios: 预定义测试场景配置
Agent Configs: Agent运行配置模板

🔧 核心功能

1. 动态场景配置

预定义场景

# 5个预定义场景
SCENARIOS = {
    "k8s_health_check_failure": "K8s健康检查失败",
    "compilation_failure": "编译失败",
    "runtime_failure": "运行时错误",
    "multi_step_complex": "复杂多步骤问题",
    "normal_flow": "正常流程"
}

场景配置结构

scenario_config = {
    "name": "k8s_health_check_failure",
    "model": {
        "sequence": [
            {"type": "thought", "content": "分析开始"},
            {"type": "action", "tool": "get_build_log", "input": "_"},
            {"type": "final", "content": "最终结论"}
        ]
    },
    "tools": {
        "build_log": "[编译日志]",
        "k8s_log": "[部署日志]",
        # ...
    }
}

2. 模型切换机制

# Mock模型（默认）
service = DiagnosisService()
service.switch_to_mock_model("compilation_failure")

# 真实模型
service.switch_to_real_model(api_key="your-api-key")

# 获取当前配置
info = service.get_service_info()
print(f"当前模型: {info['model_type']}")
print(f"当前场景: {info['current_scenario']}")

3. 完整ReAct流程

标准ReAct循环

Thought: 智能分析和推理
Action: 选择合适的工具
Action Input: 准备工具参数
Observation: 执行工具并获取结果
循环: 直到获得最终结论

流程示例

# 用户输入: "发布任务失败，请分析原因"

# Agent内部流程：
# 1. Thought: 需要分析发布失败原因，先检查编译日志
# 2. Action: get_build_log
# 3. Action Input: _
# 4. Observation: [编译日志结果]
# 5. Thought: 编译正常，检查K8s部署日志
# 6. Action: get_k8s_log
# 7. ... (继续循环)
# 8. Final Answer: 最终诊断结论

📚 使用指南

基本使用（与原来完全兼容）

from src.demo.diagnosis_core import DiagnosisService

# 创建服务实例（默认使用基于场景的Mock模型）
service = DiagnosisService(verbose=True)

# 同步诊断
result = await service.diagnose("发布任务失败，请分析原因")
print(result)

# 流式诊断
async for chunk in service.diagnose_stream("发布任务失败，请分析原因"):
    if chunk["type"] == "tool_call":
        print(f"调用工具: {chunk['tool']}")
    elif chunk["type"] == "final":
        print(f"最终结果: {chunk['content']}")

高级功能

场景配置

# 切换到不同场景
service.configure_scenario("compilation_failure")
result = await service.diagnose("发布任务失败")

service.configure_scenario("runtime_failure")
result = await service.diagnose("发布任务失败")

service.configure_scenario("normal_flow")
result = await service.diagnose("发布任务失败")

模型切换

# 切换到真实GLM模型
service.switch_to_real_model(api_key="your-glm-api-key")
result = await service.diagnose("发布任务失败")

# 切换回Mock模型
service.switch_to_mock_model("k8s_health_check_failure")
result = await service.diagnose("发布任务失败")

批量测试

# 测试所有预定义场景
results = await service.test_all_scenarios("发布任务失败，请分析原因")

for scenario, result in results.items():
    print(f"{scenario}: {'✅' if result['success'] else '❌'}")
    if result['success']:
        print(f"  步骤数: {result['total_steps']}")
    else:
        print(f"  错误: {result['error']}")

创建自定义场景

# 定义自定义场景
custom_scenario = {
    "name": "custom_database_error",
    "model": {
        "sequence": [
            {"type": "thought", "content": "需要检查数据库连接问题"},
            {"type": "action", "tool": "get_service_log", "input": "_"},
            {"type": "final", "content": "数据库连接失败，需要检查配置"}
        ]
    },
    "tools": {
        "service_log": "[服务日志] Database connection failed: Connection refused",
        "knowledge_base": {
            "数据库连接": "检查数据库服务状态和连接配置"
        }
    }
}

# 应用自定义场景
from src.core.configs.scenarios import ALL_SCENARIOS
ALL_SCENARIOS["custom_database_error"] = custom_scenario
service.configure_scenario("custom_database_error")

🧪 测试和调试

运行测试脚本

# 激活虚拟环境
source .venv/bin/activate

# 运行完整测试
python test_new_agent.py

测试内容

单个场景测试: 验证每个场景的ReAct流程
场景切换测试: 验证动态场景切换功能
模型切换测试: 验证Mock和真实模型切换
批量测试: 测试所有预定义场景
API兼容性测试: 验证FastAPI接口正常工作

调试技巧

启用详细日志

# 创建带详细日志的服务
service = DiagnosisService(verbose=True)

# 在ReAct执行器中查看详细过程
agent = service.core_service.current_agent
print(agent.get_agent_info())

查看Agent状态

# 获取Agent配置信息
info = service.get_service_info()
print(f"模型类型: {info['model_type']}")
print(f"可用工具: {info['available_tools']}")
print(f"当前场景: {info['current_scenario']}")

🚀 FastAPI接口

兼容性说明

新架构完全保持与原有FastAPI接口的兼容性：

同步接口

curl -X POST http://localhost:8002/api/chat \
  -H "Content-Type: application/json" \
  -d '{"message": "发布任务失败，请分析原因"}'

流式接口

curl -X POST http://localhost:8002/api/chat/stream \
  -H "Content-Type: application/json" \
  -d '{"message": "发布任务失败，请分析原因"}' --no-buffer

健康检查

curl -X GET http://localhost:8002/api/health

新增功能接口（可选）

如需要在FastAPI层暴露新功能，可以扩展web_service.py：

# 新增场景切换接口
@router.post("/api/scenario/{scenario_name}")
async def switch_scenario(scenario_name: str):
    service.configure_scenario(scenario_name)
    return {"message": f"已切换到场景: {scenario_name}"}

# 新增模型切换接口
@router.post("/api/model/real")
async def switch_to_real(api_key: str = None):
    service.switch_to_real_model(api_key)
    return {"message": "已切换到真实模型"}

🔧 生产部署

环境配置

使用真实模型

from src.demo.diagnosis_core import DiagnosisService

# 创建生产环境服务
service = DiagnosisService(verbose=False)

# 切换到真实GLM模型
service.switch_to_real_model(api_key="your-production-api-key")

配置优化

# 使用生产配置
from src.core.configs.agent_configs import PRODUCTION_CONFIG
from src.core.agents.diagnosis_agent import DiagnosisAgentBuilder

agent = (DiagnosisAgentBuilder()
          .with_glm_model(api_key="prod-api-key")
          .with_max_iterations(15)
          .with_verbose(False)
          .build())

性能优化建议

模型选择: 生产环境建议使用真实模型
缓存机制: 对常见问题结果进行缓存
异步处理: 充分利用异步特性处理并发请求
资源监控: 监控API调用频率和响应时间
错误处理: 完善的降级和重试机制

📁 文件结构

src/
├── core/                          # 新架构核心模块
│   ├── models/                     # 模型层
│   │   ├── __init__.py
│   │   ├── base_llm.py            # LLM基类
│   │   ├── mock_llm.py            # 增强Mock模型
│   │   ├── glm_llm.py             # GLM真实模型
│   │   └── model_factory.py       # 模型工厂
│   ├── tools/                      # 工具层
│   │   ├── __init__.py
│   │   ├── diagnosis_tools.py    # 诊断工具
│   │   └── tool_manager.py        # 工具管理器
│   ├── agents/                     # Agent层
│   │   ├── __init__.py
│   │   ├── react_executor.py      # ReAct执行器
│   │   └── diagnosis_agent.py     # 诊断Agent
│   ├── services/                   # 服务层
│   │   ├── __init__.py
│   │   └── diagnosis_service.py   # 诊断服务
│   └── configs/                    # 配置层
│       ├── __init__.py
│       ├── scenarios.py           # 场景配置
│       └── agent_configs.py       # Agent配置
├── demo/                           # 兼容层
│   ├── diagnosis_core.py           # 适配器（使用新架构）
│   ├── diagnosis_core_old.py      # 原始实现（备份）
│   └── web_service.py             # FastAPI路由
└── base/                           # 原始模型（参考）
    └── agent.py                    # 原始GLM实现

🎉 重构成果

技术成果

✅ 标准化架构: 采用LangChain最佳实践
✅ 完整ReAct流程: 真正的思考-行动循环
✅ 动态配置: 支持场景和模型动态切换
✅ 向后兼容: 保持所有现有API不变
✅ 完善测试: 多场景测试覆盖
✅ 生产就绪: 支持真实模型接入

业务价值

🚀 开发效率: 标准化架构降低开发复杂度
🔧 维护性: 模块化设计便于维护和扩展
🧪 可测试性: 丰富的测试工具和场景
📈 可扩展性: 易于添加新模型和工具
🛡️ 稳定性: 完善的错误处理和降级机制

下一步规划

扩展工具集: 添加更多诊断工具
多模型支持: 集成更多LLM模型
性能优化: 实现缓存和批处理
监控告警: 添加系统监控和告警
A/B测试: 支持不同模型效果对比

📞 技术支持

如有问题或建议，请参考：

项目代码：查看 src/core/ 目录下的详细实现
测试示例：运行 test_new_agent.py 查看完整用法
配置参考：查看 src/core/configs/ 目录下的配置模板

重构完成！🎉 现在您可以通过修改场景配置来测试各种情况，并轻松切换到真实模型进行生产部署。

FilesExpand file tree

ARCHITECTURE_REFACTOR.md

Latest commit

History