Skip to content

Latest commit

 

History

History
490 lines (344 loc) · 16.6 KB

File metadata and controls

490 lines (344 loc) · 16.6 KB

Section 17: Observability and Evaluation Features in AI Development Frameworks

Part VII: Tools & Platforms Estimated Reading Time: 14 minutes

Overview

AI agent development frameworks have evolved beyond simple abstractions for LLM calls to include sophisticated built-in observability and evaluation capabilities. Understanding these native features is essential for teams building production agents, as framework-native instrumentation often provides deeper insights with less configuration effort. This section examines the observability and evaluation features of leading agent frameworks including LangChain, LlamaIndex, CrewAI, and others.

Table of Contents


17.1 LangChain and LangGraph

17.1.1 Overview

LangChain is the most widely adopted framework for building LLM applications, with LangGraph as its dedicated extension for building stateful, multi-actor agents. The ecosystem provides extensive observability capabilities, with LangSmith as the preferred (but not required) observability platform.

17.1.2 Native Observability Features

Callbacks System: LangChain's callback system enables real-time monitoring of all framework operations:

  • on_llm_start / on_llm_end: LLM invocation tracking
  • on_chain_start / on_chain_end: Chain execution monitoring
  • on_tool_start / on_tool_end: Tool usage tracking
  • on_agent_action / on_agent_finish: Agent decision monitoring
  • on_retriever_start / on_retriever_end: RAG retrieval tracking
# Example: Custom callback handler
from langchain.callbacks.base import BaseCallbackHandler

class AgentTracker(BaseCallbackHandler):
    def on_tool_start(self, serialized, input_str, **kwargs):
        print(f"Tool called: {serialized['name']}")
        self.log_tool_usage(serialized['name'], input_str)

    def on_agent_action(self, action, **kwargs):
        print(f"Agent decided: {action.tool} with {action.tool_input}")

    def on_agent_finish(self, finish, **kwargs):
        print(f"Agent completed: {finish.return_values}")

Run Trees: LangChain automatically organizes executions into hierarchical run trees that capture:

  • Parent-child relationships between operations
  • Input/output at each level
  • Timing information
  • Error states and exceptions

17.1.3 LangSmith Integration

LangChain pushes strongly toward LangSmith for comprehensive observability:

Automatic Tracing:

  • Environment variable configuration for zero-code instrumentation
  • Captures all LLM calls, tool invocations, and chain executions
  • Detailed token usage and cost tracking

Evaluation Capabilities:

  • Dataset management for test cases
  • LLM-as-judge evaluators
  • Human feedback collection
  • A/B testing support
  • Regression detection

OpenTelemetry Support: LangSmith supports OTel to unify observability across services, enabling integration with existing monitoring stacks.

17.1.4 LangGraph-Specific Features

LangGraph adds agent-specific observability:

State Tracking:

  • Monitor graph state at each node
  • Track state mutations over time
  • Debug complex state flows

Node-Level Instrumentation:

  • Per-node timing and performance metrics
  • Tool call patterns per node
  • Token usage per node

Granular Tracing Recommendations: For LangGraph setups, best practices include:

  • Use dedicated observability layer for span-level metrics
  • Capture token usage, latency, and quality signals at both node and session levels
  • Implement custom callbacks for graph-specific events

17.1.5 Third-Party Integrations

LangChain supports extensive third-party observability integrations:

  • Langfuse: Full tracing and evaluation
  • Arize Phoenix: OpenTelemetry-native observability
  • MLflow: Tracing and experiment tracking
  • Weights & Biases: Experiment management
  • Helicone: Gateway-based observability

17.2 LlamaIndex

17.2.1 Overview

LlamaIndex excels at building production-grade RAG systems with built-in features for caching, streaming, and observability. The framework provides comprehensive instrumentation centered on OpenTelemetry.

17.2.2 Native Observability Features

Instrumentation Module: LlamaIndex provides a dedicated instrumentation module for comprehensive tracing:

# Example: LlamaIndex instrumentation setup
from llama_index.core import Settings
from llama_index.core.instrumentation import get_dispatcher

# Get the global dispatcher
dispatcher = get_dispatcher()

# Add custom event handlers
@dispatcher.span
def custom_operation(query):
    # This operation is automatically traced
    return process_query(query)

Event Types:

  • LLM events (calls, streaming, tool use)
  • Embedding events
  • Retrieval events
  • Query engine events
  • Agent events

17.2.3 OpenTelemetry Integration

LlamaIndex instrumentation is centered on OpenTelemetry:

Auto-Instrumentation:

  • Compatible with OpenLLMetry for automatic tracing
  • OpenInference conventions for standardized spans
  • OTLP export to any compatible backend

Span Attributes:

  • Query text and response
  • Retrieved documents and scores
  • LLM parameters and outputs
  • Token counts and latency

17.2.4 Evaluation Capabilities

LlamaIndex provides native evaluation modules:

Built-in Evaluators:

  • Faithfulness Evaluator: Measures if response is grounded in retrieved context
  • Relevancy Evaluator: Assesses response relevance to query
  • Correctness Evaluator: Compares responses to ground truth
  • Guideline Evaluator: Checks adherence to custom guidelines

Evaluation Workflow:

# Example: LlamaIndex evaluation
from llama_index.core.evaluation import FaithfulnessEvaluator, RelevancyEvaluator

faithfulness = FaithfulnessEvaluator()
relevancy = RelevancyEvaluator()

# Evaluate a query response
response = query_engine.query("What is the capital of France?")

# Check faithfulness
faith_result = faithfulness.evaluate_response(response=response)
print(f"Faithfulness: {faith_result.score}")

# Check relevancy
rel_result = relevancy.evaluate_response(query="What is the capital of France?", response=response)
print(f"Relevancy: {rel_result.score}")

17.2.5 Langfuse Integration

LlamaIndex offers dedicated Langfuse integration:

  • Automatic trace capture for all operations
  • Cost tracking per query
  • Session and user tracking
  • Evaluation score logging

17.3 CrewAI

17.3.1 Overview

CrewAI specializes in role-based multi-agent collaboration, enabling team-oriented workflows where multiple agents work together. The framework provides built-in observability features for tracking agent interactions and task execution.

17.3.2 Native Observability Features

Agent Execution Tracking:

  • Monitor individual agent activities
  • Track task assignments and completions
  • Log inter-agent communications

Task Metrics:

  • Task execution time
  • Success/failure rates
  • Retry counts
  • Output quality indicators

Crew-Level Insights:

  • Overall workflow progress
  • Bottleneck identification
  • Resource utilization

17.3.3 Third-Party Integrations

CrewAI supports multiple observability platforms:

Braintrust Integration:

# Example: CrewAI with Braintrust
from crewai import Crew, Agent, Task
import braintrust

# Initialize Braintrust
braintrust.init(api_key="your-api-key")

# Create agents and tasks
crew = Crew(
    agents=[research_agent, writer_agent],
    tasks=[research_task, writing_task],
    verbose=True  # Enable logging
)

# Execute with automatic tracing
result = crew.kickoff()

Other Integrations:

  • Langfuse for detailed tracing
  • Arize Phoenix for OpenTelemetry-native observability
  • MLflow for experiment tracking
  • Custom callbacks for proprietary systems

17.3.4 Considerations

While CrewAI excels at multi-agent orchestration, teams often need to:

  • Implement custom observability layers
  • Integrate external evaluation frameworks
  • Build governance and deployment pipelines separately

17.4 Semantic Kernel

17.4.1 Overview

Microsoft Semantic Kernel is an open-source SDK for building AI agents and copilots, with deep integration into the Microsoft ecosystem and Azure AI Foundry.

17.4.2 Native Observability Features

Telemetry Support: Semantic Kernel provides built-in telemetry through:

  • OpenTelemetry instrumentation
  • Azure Application Insights integration
  • Custom telemetry providers

Kernel Events:

  • Function invocation tracking
  • Plugin execution monitoring
  • Memory operation logging
  • LLM call instrumentation
// Example: Semantic Kernel with OpenTelemetry (C#)
using Microsoft.SemanticKernel;
using OpenTelemetry.Trace;

var tracerProvider = Sdk.CreateTracerProviderBuilder()
    .AddSource("Microsoft.SemanticKernel")
    .AddConsoleExporter()
    .Build();

var kernel = Kernel.CreateBuilder()
    .AddOpenAIChatCompletion("gpt-4", apiKey)
    .Build();

17.4.3 Azure AI Foundry Integration

Semantic Kernel agents integrate seamlessly with Azure AI Foundry evaluation:

  • Converter support for evaluation data
  • Compatible with IntentResolution, ToolCallAccuracy, and other evaluators
  • Continuous evaluation through Foundry dashboard

17.4.4 Multi-Language Support

Semantic Kernel provides consistent observability across:

  • C#/.NET: Primary SDK with full feature support
  • Python: Feature-parity SDK
  • Java: Growing feature set

17.5 Other Notable Frameworks

17.5.1 AutoGen (Microsoft)

AutoGen enables building multi-agent applications with conversational patterns.

Observability Features:

  • Conversation logging
  • Agent interaction tracking
  • Message flow visualization

Integrations:

  • OpenLLMetry for tracing
  • MLflow for experiment tracking
  • Custom logging handlers

17.5.2 DSPy

DSPy provides a programming framework for LLMs with automatic prompt optimization.

Observability Features:

  • Compilation traces
  • Module execution logs
  • Optimization metrics

Integrations:

  • Arize Phoenix for tracing
  • OpenLLMetry instrumentation
  • MLflow integration

17.5.3 Haystack

Haystack is a framework for building RAG and NLP pipelines.

Observability Features:

  • Pipeline component tracing
  • Document retrieval metrics
  • Query performance monitoring

Integrations:

  • OpenTelemetry native support
  • Langfuse integration
  • Arize Phoenix compatibility

17.5.4 Mastra

Mastra is an emerging framework with first-class observability support.

Observability Features:

  • Built-in tracing
  • Workflow monitoring
  • Tool usage analytics

Integrations:

  • Braintrust native support
  • OpenTelemetry exporters
  • Multiple backend options

17.5.5 NVIDIA NeMo

NVIDIA NeMo offers framework-agnostic integration with production optimization features.

Observability Features:

  • Cross-framework performance tracking
  • Customizer for throughput optimization
  • Agent Toolkit for comprehensive monitoring

Framework Compatibility: Works with CrewAI, Haystack, LangChain, and LlamaIndex.


17.6 Framework Comparison

17.6.1 Observability Feature Matrix

Framework Native Tracing OTel Support Built-in Eval Cost Tracking Multi-Agent
LangChain/LangGraph ✅ Callbacks ✅ Via LangSmith
LlamaIndex ✅ Instrumentation ✅ Native ✅ Native Partial
CrewAI ✅ Basic ✅ Via integrations ❌ External ✅ Native
Semantic Kernel ✅ Telemetry ✅ Native ✅ Via Azure
AutoGen ✅ Logging ✅ Via integrations ❌ External ✅ Native
DSPy ✅ Traces ✅ Via Phoenix ❌ External
Haystack ✅ Pipeline ✅ Native ✅ Basic Partial

17.6.2 Integration Ecosystem

Framework Langfuse Phoenix MLflow LangSmith Braintrust
LangChain ✅ Native
LlamaIndex ✅ Native Partial
CrewAI ✅ Native
Semantic Kernel Partial
AutoGen Partial
DSPy ✅ Native Partial
Haystack

17.6.3 Best Practices for Framework Selection

Choose LangChain/LangGraph when:

  • Building complex agent workflows
  • Need mature ecosystem with extensive integrations
  • Want LangSmith for managed observability

Choose LlamaIndex when:

  • Primarily building RAG systems
  • Need native evaluation capabilities
  • Want strong OpenTelemetry integration

Choose CrewAI when:

  • Building multi-agent collaborative systems
  • Role-based agent design fits use case
  • Team-oriented workflow patterns needed

Choose Semantic Kernel when:

  • Working in Microsoft/Azure ecosystem
  • Need C#/.NET support
  • Want Azure AI Foundry integration

17.6.4 Observability-Driven Development Pattern

Regardless of framework choice, best practices include:

  1. Pair with dedicated observability layer: No framework provides everything—add specialized tools for gaps

  2. Implement simulation and evaluation: Framework tracing alone is insufficient; add evaluation loops

  3. Use human expert review: Automated metrics need human validation, especially for subjective quality

  4. Configure alerts: Monitor key metrics and trigger notifications on degradation

  5. Adopt common production pattern: Consider LlamaIndex for ingestion/retrieval + LangChain for orchestration


Key Takeaways

  1. All major frameworks support OpenTelemetry: This enables mixing and matching of frameworks with consistent observability

  2. Built-in evaluation varies significantly: LlamaIndex and LangChain provide native evaluation; others require external frameworks

  3. LangSmith integration dominates LangChain: While not required, LangSmith provides the most seamless experience for LangChain users

  4. Multi-agent frameworks lag in evaluation: CrewAI and AutoGen excel at orchestration but need external evaluation tools

  5. Framework choice doesn't lock observability: OpenTelemetry compatibility means you can change observability platforms without changing frameworks

  6. Hybrid patterns are common: Many production systems combine frameworks (e.g., LlamaIndex + LangChain) with unified observability


References

  1. LangChain. (2025). State of AI Agents. https://www.langchain.com/state-of-agent-engineering

  2. LangChain. (2025). LangSmith Observability. https://www.langchain.com/langsmith/observability

  3. Langfuse. (2025). Observability for LlamaIndex with Langfuse Integration. https://langfuse.com/integrations/frameworks/llamaindex

  4. Alphamatch. (2026). Top 7 Agentic AI Frameworks in 2026. https://www.alphamatch.ai/blog/top-agentic-ai-frameworks-2026

  5. AIMultiple. (2026). Compare Top 12 LLM Orchestration Frameworks in 2026. https://research.aimultiple.com/llm-orchestration/

  6. Kolekar, R. (2026). Production RAG in 2026: LangChain vs LlamaIndex. https://rahulkolekar.com/production-rag-in-2026-langchain-vs-llamaindex/

  7. Maxim AI. (2025). Best AI Agent Frameworks 2025. https://www.getmaxim.ai/articles/top-5-ai-agent-frameworks-in-2025-a-practical-guide-for-ai-builders/

  8. Softcery. (2025). 14 AI Agent Frameworks Compared. https://softcery.com/lab/top-14-ai-agent-frameworks-of-2025-a-founders-guide-to-building-smarter-systems

  9. CrewAI. (2025). Braintrust Integration. https://docs.crewai.com/en/observability/braintrust

  10. Mastra. (2026). Braintrust Exporter for AI Tracing. https://mastra.ai/docs/observability/ai-tracing/exporters/braintrust


Navigation:Previous Section: Cloud Provider Evaluation Platforms | Table of Contents | Next Section: Benchmark Suites