Part VII: Tools & Platforms Estimated Reading Time: 14 minutes
AI agent development frameworks have evolved beyond simple abstractions for LLM calls to include sophisticated built-in observability and evaluation capabilities. Understanding these native features is essential for teams building production agents, as framework-native instrumentation often provides deeper insights with less configuration effort. This section examines the observability and evaluation features of leading agent frameworks including LangChain, LlamaIndex, CrewAI, and others.
- 17.1 LangChain and LangGraph
- 17.2 LlamaIndex
- 17.3 CrewAI
- 17.4 Semantic Kernel
- 17.5 Other Notable Frameworks
- 17.6 Framework Comparison
LangChain is the most widely adopted framework for building LLM applications, with LangGraph as its dedicated extension for building stateful, multi-actor agents. The ecosystem provides extensive observability capabilities, with LangSmith as the preferred (but not required) observability platform.
Callbacks System: LangChain's callback system enables real-time monitoring of all framework operations:
on_llm_start/on_llm_end: LLM invocation trackingon_chain_start/on_chain_end: Chain execution monitoringon_tool_start/on_tool_end: Tool usage trackingon_agent_action/on_agent_finish: Agent decision monitoringon_retriever_start/on_retriever_end: RAG retrieval tracking
# Example: Custom callback handler
from langchain.callbacks.base import BaseCallbackHandler
class AgentTracker(BaseCallbackHandler):
def on_tool_start(self, serialized, input_str, **kwargs):
print(f"Tool called: {serialized['name']}")
self.log_tool_usage(serialized['name'], input_str)
def on_agent_action(self, action, **kwargs):
print(f"Agent decided: {action.tool} with {action.tool_input}")
def on_agent_finish(self, finish, **kwargs):
print(f"Agent completed: {finish.return_values}")Run Trees: LangChain automatically organizes executions into hierarchical run trees that capture:
- Parent-child relationships between operations
- Input/output at each level
- Timing information
- Error states and exceptions
LangChain pushes strongly toward LangSmith for comprehensive observability:
Automatic Tracing:
- Environment variable configuration for zero-code instrumentation
- Captures all LLM calls, tool invocations, and chain executions
- Detailed token usage and cost tracking
Evaluation Capabilities:
- Dataset management for test cases
- LLM-as-judge evaluators
- Human feedback collection
- A/B testing support
- Regression detection
OpenTelemetry Support: LangSmith supports OTel to unify observability across services, enabling integration with existing monitoring stacks.
LangGraph adds agent-specific observability:
State Tracking:
- Monitor graph state at each node
- Track state mutations over time
- Debug complex state flows
Node-Level Instrumentation:
- Per-node timing and performance metrics
- Tool call patterns per node
- Token usage per node
Granular Tracing Recommendations: For LangGraph setups, best practices include:
- Use dedicated observability layer for span-level metrics
- Capture token usage, latency, and quality signals at both node and session levels
- Implement custom callbacks for graph-specific events
LangChain supports extensive third-party observability integrations:
- Langfuse: Full tracing and evaluation
- Arize Phoenix: OpenTelemetry-native observability
- MLflow: Tracing and experiment tracking
- Weights & Biases: Experiment management
- Helicone: Gateway-based observability
LlamaIndex excels at building production-grade RAG systems with built-in features for caching, streaming, and observability. The framework provides comprehensive instrumentation centered on OpenTelemetry.
Instrumentation Module: LlamaIndex provides a dedicated instrumentation module for comprehensive tracing:
# Example: LlamaIndex instrumentation setup
from llama_index.core import Settings
from llama_index.core.instrumentation import get_dispatcher
# Get the global dispatcher
dispatcher = get_dispatcher()
# Add custom event handlers
@dispatcher.span
def custom_operation(query):
# This operation is automatically traced
return process_query(query)Event Types:
- LLM events (calls, streaming, tool use)
- Embedding events
- Retrieval events
- Query engine events
- Agent events
LlamaIndex instrumentation is centered on OpenTelemetry:
Auto-Instrumentation:
- Compatible with OpenLLMetry for automatic tracing
- OpenInference conventions for standardized spans
- OTLP export to any compatible backend
Span Attributes:
- Query text and response
- Retrieved documents and scores
- LLM parameters and outputs
- Token counts and latency
LlamaIndex provides native evaluation modules:
Built-in Evaluators:
- Faithfulness Evaluator: Measures if response is grounded in retrieved context
- Relevancy Evaluator: Assesses response relevance to query
- Correctness Evaluator: Compares responses to ground truth
- Guideline Evaluator: Checks adherence to custom guidelines
Evaluation Workflow:
# Example: LlamaIndex evaluation
from llama_index.core.evaluation import FaithfulnessEvaluator, RelevancyEvaluator
faithfulness = FaithfulnessEvaluator()
relevancy = RelevancyEvaluator()
# Evaluate a query response
response = query_engine.query("What is the capital of France?")
# Check faithfulness
faith_result = faithfulness.evaluate_response(response=response)
print(f"Faithfulness: {faith_result.score}")
# Check relevancy
rel_result = relevancy.evaluate_response(query="What is the capital of France?", response=response)
print(f"Relevancy: {rel_result.score}")LlamaIndex offers dedicated Langfuse integration:
- Automatic trace capture for all operations
- Cost tracking per query
- Session and user tracking
- Evaluation score logging
CrewAI specializes in role-based multi-agent collaboration, enabling team-oriented workflows where multiple agents work together. The framework provides built-in observability features for tracking agent interactions and task execution.
Agent Execution Tracking:
- Monitor individual agent activities
- Track task assignments and completions
- Log inter-agent communications
Task Metrics:
- Task execution time
- Success/failure rates
- Retry counts
- Output quality indicators
Crew-Level Insights:
- Overall workflow progress
- Bottleneck identification
- Resource utilization
CrewAI supports multiple observability platforms:
Braintrust Integration:
# Example: CrewAI with Braintrust
from crewai import Crew, Agent, Task
import braintrust
# Initialize Braintrust
braintrust.init(api_key="your-api-key")
# Create agents and tasks
crew = Crew(
agents=[research_agent, writer_agent],
tasks=[research_task, writing_task],
verbose=True # Enable logging
)
# Execute with automatic tracing
result = crew.kickoff()Other Integrations:
- Langfuse for detailed tracing
- Arize Phoenix for OpenTelemetry-native observability
- MLflow for experiment tracking
- Custom callbacks for proprietary systems
While CrewAI excels at multi-agent orchestration, teams often need to:
- Implement custom observability layers
- Integrate external evaluation frameworks
- Build governance and deployment pipelines separately
Microsoft Semantic Kernel is an open-source SDK for building AI agents and copilots, with deep integration into the Microsoft ecosystem and Azure AI Foundry.
Telemetry Support: Semantic Kernel provides built-in telemetry through:
- OpenTelemetry instrumentation
- Azure Application Insights integration
- Custom telemetry providers
Kernel Events:
- Function invocation tracking
- Plugin execution monitoring
- Memory operation logging
- LLM call instrumentation
// Example: Semantic Kernel with OpenTelemetry (C#)
using Microsoft.SemanticKernel;
using OpenTelemetry.Trace;
var tracerProvider = Sdk.CreateTracerProviderBuilder()
.AddSource("Microsoft.SemanticKernel")
.AddConsoleExporter()
.Build();
var kernel = Kernel.CreateBuilder()
.AddOpenAIChatCompletion("gpt-4", apiKey)
.Build();Semantic Kernel agents integrate seamlessly with Azure AI Foundry evaluation:
- Converter support for evaluation data
- Compatible with IntentResolution, ToolCallAccuracy, and other evaluators
- Continuous evaluation through Foundry dashboard
Semantic Kernel provides consistent observability across:
- C#/.NET: Primary SDK with full feature support
- Python: Feature-parity SDK
- Java: Growing feature set
AutoGen enables building multi-agent applications with conversational patterns.
Observability Features:
- Conversation logging
- Agent interaction tracking
- Message flow visualization
Integrations:
- OpenLLMetry for tracing
- MLflow for experiment tracking
- Custom logging handlers
DSPy provides a programming framework for LLMs with automatic prompt optimization.
Observability Features:
- Compilation traces
- Module execution logs
- Optimization metrics
Integrations:
- Arize Phoenix for tracing
- OpenLLMetry instrumentation
- MLflow integration
Haystack is a framework for building RAG and NLP pipelines.
Observability Features:
- Pipeline component tracing
- Document retrieval metrics
- Query performance monitoring
Integrations:
- OpenTelemetry native support
- Langfuse integration
- Arize Phoenix compatibility
Mastra is an emerging framework with first-class observability support.
Observability Features:
- Built-in tracing
- Workflow monitoring
- Tool usage analytics
Integrations:
- Braintrust native support
- OpenTelemetry exporters
- Multiple backend options
NVIDIA NeMo offers framework-agnostic integration with production optimization features.
Observability Features:
- Cross-framework performance tracking
- Customizer for throughput optimization
- Agent Toolkit for comprehensive monitoring
Framework Compatibility: Works with CrewAI, Haystack, LangChain, and LlamaIndex.
| Framework | Native Tracing | OTel Support | Built-in Eval | Cost Tracking | Multi-Agent |
|---|---|---|---|---|---|
| LangChain/LangGraph | ✅ Callbacks | ✅ Via LangSmith | ✅ | ✅ | ✅ |
| LlamaIndex | ✅ Instrumentation | ✅ Native | ✅ Native | ✅ | Partial |
| CrewAI | ✅ Basic | ✅ Via integrations | ❌ External | ✅ | ✅ Native |
| Semantic Kernel | ✅ Telemetry | ✅ Native | ✅ Via Azure | ✅ | ✅ |
| AutoGen | ✅ Logging | ✅ Via integrations | ❌ External | ❌ | ✅ Native |
| DSPy | ✅ Traces | ✅ Via Phoenix | ❌ External | ❌ | ❌ |
| Haystack | ✅ Pipeline | ✅ Native | ✅ Basic | ✅ | Partial |
| Framework | Langfuse | Phoenix | MLflow | LangSmith | Braintrust |
|---|---|---|---|---|---|
| LangChain | ✅ | ✅ | ✅ | ✅ Native | ✅ |
| LlamaIndex | ✅ Native | ✅ | ✅ | Partial | ✅ |
| CrewAI | ✅ | ✅ | ✅ | ✅ | ✅ Native |
| Semantic Kernel | Partial | ✅ | ✅ | ❌ | ✅ |
| AutoGen | ✅ | ✅ | ✅ | Partial | ✅ |
| DSPy | ✅ | ✅ Native | ✅ | Partial | ✅ |
| Haystack | ✅ | ✅ | ❌ | ❌ | ✅ |
Choose LangChain/LangGraph when:
- Building complex agent workflows
- Need mature ecosystem with extensive integrations
- Want LangSmith for managed observability
Choose LlamaIndex when:
- Primarily building RAG systems
- Need native evaluation capabilities
- Want strong OpenTelemetry integration
Choose CrewAI when:
- Building multi-agent collaborative systems
- Role-based agent design fits use case
- Team-oriented workflow patterns needed
Choose Semantic Kernel when:
- Working in Microsoft/Azure ecosystem
- Need C#/.NET support
- Want Azure AI Foundry integration
Regardless of framework choice, best practices include:
-
Pair with dedicated observability layer: No framework provides everything—add specialized tools for gaps
-
Implement simulation and evaluation: Framework tracing alone is insufficient; add evaluation loops
-
Use human expert review: Automated metrics need human validation, especially for subjective quality
-
Configure alerts: Monitor key metrics and trigger notifications on degradation
-
Adopt common production pattern: Consider LlamaIndex for ingestion/retrieval + LangChain for orchestration
-
All major frameworks support OpenTelemetry: This enables mixing and matching of frameworks with consistent observability
-
Built-in evaluation varies significantly: LlamaIndex and LangChain provide native evaluation; others require external frameworks
-
LangSmith integration dominates LangChain: While not required, LangSmith provides the most seamless experience for LangChain users
-
Multi-agent frameworks lag in evaluation: CrewAI and AutoGen excel at orchestration but need external evaluation tools
-
Framework choice doesn't lock observability: OpenTelemetry compatibility means you can change observability platforms without changing frameworks
-
Hybrid patterns are common: Many production systems combine frameworks (e.g., LlamaIndex + LangChain) with unified observability
-
LangChain. (2025). State of AI Agents. https://www.langchain.com/state-of-agent-engineering
-
LangChain. (2025). LangSmith Observability. https://www.langchain.com/langsmith/observability
-
Langfuse. (2025). Observability for LlamaIndex with Langfuse Integration. https://langfuse.com/integrations/frameworks/llamaindex
-
Alphamatch. (2026). Top 7 Agentic AI Frameworks in 2026. https://www.alphamatch.ai/blog/top-agentic-ai-frameworks-2026
-
AIMultiple. (2026). Compare Top 12 LLM Orchestration Frameworks in 2026. https://research.aimultiple.com/llm-orchestration/
-
Kolekar, R. (2026). Production RAG in 2026: LangChain vs LlamaIndex. https://rahulkolekar.com/production-rag-in-2026-langchain-vs-llamaindex/
-
Maxim AI. (2025). Best AI Agent Frameworks 2025. https://www.getmaxim.ai/articles/top-5-ai-agent-frameworks-in-2025-a-practical-guide-for-ai-builders/
-
Softcery. (2025). 14 AI Agent Frameworks Compared. https://softcery.com/lab/top-14-ai-agent-frameworks-of-2025-a-founders-guide-to-building-smarter-systems
-
CrewAI. (2025). Braintrust Integration. https://docs.crewai.com/en/observability/braintrust
-
Mastra. (2026). Braintrust Exporter for AI Tracing. https://mastra.ai/docs/observability/ai-tracing/exporters/braintrust
Navigation: ← Previous Section: Cloud Provider Evaluation Platforms | Table of Contents | Next Section: Benchmark Suites →