Section 17: Observability and Evaluation Features in AI Development Frameworks

Part VII: Tools & Platforms Estimated Reading Time: 14 minutes

Overview

AI agent development frameworks have evolved beyond simple abstractions for LLM calls to include sophisticated built-in observability and evaluation capabilities. Understanding these native features is essential for teams building production agents, as framework-native instrumentation often provides deeper insights with less configuration effort. This section examines the observability and evaluation features of leading agent frameworks including LangChain, LlamaIndex, CrewAI, and others.

17.1 LangChain and LangGraph
17.2 LlamaIndex
17.3 CrewAI
17.4 Semantic Kernel
17.5 Other Notable Frameworks
17.6 Framework Comparison

17.1 LangChain and LangGraph

17.1.1 Overview

LangChain is the most widely adopted framework for building LLM applications, with LangGraph as its dedicated extension for building stateful, multi-actor agents. The ecosystem provides extensive observability capabilities, with LangSmith as the preferred (but not required) observability platform.

17.1.2 Native Observability Features

Callbacks System: LangChain's callback system enables real-time monitoring of all framework operations:

on_llm_start / on_llm_end: LLM invocation tracking
on_chain_start / on_chain_end: Chain execution monitoring
on_tool_start / on_tool_end: Tool usage tracking
on_agent_action / on_agent_finish: Agent decision monitoring
on_retriever_start / on_retriever_end: RAG retrieval tracking

# Example: Custom callback handler
from langchain.callbacks.base import BaseCallbackHandler

class AgentTracker(BaseCallbackHandler):
    def on_tool_start(self, serialized, input_str, **kwargs):
        print(f"Tool called: {serialized['name']}")
        self.log_tool_usage(serialized['name'], input_str)

    def on_agent_action(self, action, **kwargs):
        print(f"Agent decided: {action.tool} with {action.tool_input}")

    def on_agent_finish(self, finish, **kwargs):
        print(f"Agent completed: {finish.return_values}")

Run Trees: LangChain automatically organizes executions into hierarchical run trees that capture:

Parent-child relationships between operations
Input/output at each level
Timing information
Error states and exceptions

17.1.3 LangSmith Integration

LangChain pushes strongly toward LangSmith for comprehensive observability:

Automatic Tracing:

Environment variable configuration for zero-code instrumentation
Captures all LLM calls, tool invocations, and chain executions
Detailed token usage and cost tracking

Evaluation Capabilities:

Dataset management for test cases
LLM-as-judge evaluators
Human feedback collection
A/B testing support
Regression detection

OpenTelemetry Support: LangSmith supports OTel to unify observability across services, enabling integration with existing monitoring stacks.

17.1.4 LangGraph-Specific Features

LangGraph adds agent-specific observability:

State Tracking:

Monitor graph state at each node
Track state mutations over time
Debug complex state flows

Node-Level Instrumentation:

Per-node timing and performance metrics
Tool call patterns per node
Token usage per node

Granular Tracing Recommendations: For LangGraph setups, best practices include:

Use dedicated observability layer for span-level metrics
Capture token usage, latency, and quality signals at both node and session levels
Implement custom callbacks for graph-specific events

17.1.5 Third-Party Integrations

LangChain supports extensive third-party observability integrations:

Langfuse: Full tracing and evaluation
Arize Phoenix: OpenTelemetry-native observability
MLflow: Tracing and experiment tracking
Weights & Biases: Experiment management
Helicone: Gateway-based observability

17.2 LlamaIndex

17.2.1 Overview

LlamaIndex excels at building production-grade RAG systems with built-in features for caching, streaming, and observability. The framework provides comprehensive instrumentation centered on OpenTelemetry.

17.2.2 Native Observability Features

Instrumentation Module: LlamaIndex provides a dedicated instrumentation module for comprehensive tracing:

# Example: LlamaIndex instrumentation setup
from llama_index.core import Settings
from llama_index.core.instrumentation import get_dispatcher

# Get the global dispatcher
dispatcher = get_dispatcher()

# Add custom event handlers
@dispatcher.span
def custom_operation(query):
    # This operation is automatically traced
    return process_query(query)

Event Types:

LLM events (calls, streaming, tool use)
Embedding events
Retrieval events
Query engine events
Agent events

17.2.3 OpenTelemetry Integration

LlamaIndex instrumentation is centered on OpenTelemetry:

Auto-Instrumentation:

Compatible with OpenLLMetry for automatic tracing
OpenInference conventions for standardized spans
OTLP export to any compatible backend

Span Attributes:

Query text and response
Retrieved documents and scores
LLM parameters and outputs
Token counts and latency

17.2.4 Evaluation Capabilities

LlamaIndex provides native evaluation modules:

Built-in Evaluators:

Faithfulness Evaluator: Measures if response is grounded in retrieved context
Relevancy Evaluator: Assesses response relevance to query
Correctness Evaluator: Compares responses to ground truth
Guideline Evaluator: Checks adherence to custom guidelines

Evaluation Workflow:

# Example: LlamaIndex evaluation
from llama_index.core.evaluation import FaithfulnessEvaluator, RelevancyEvaluator

faithfulness = FaithfulnessEvaluator()
relevancy = RelevancyEvaluator()

# Evaluate a query response
response = query_engine.query("What is the capital of France?")

# Check faithfulness
faith_result = faithfulness.evaluate_response(response=response)
print(f"Faithfulness: {faith_result.score}")

# Check relevancy
rel_result = relevancy.evaluate_response(query="What is the capital of France?", response=response)
print(f"Relevancy: {rel_result.score}")

17.2.5 Langfuse Integration

LlamaIndex offers dedicated Langfuse integration:

Automatic trace capture for all operations
Cost tracking per query
Session and user tracking
Evaluation score logging

17.3 CrewAI

17.3.1 Overview

CrewAI specializes in role-based multi-agent collaboration, enabling team-oriented workflows where multiple agents work together. The framework provides built-in observability features for tracking agent interactions and task execution.

17.3.2 Native Observability Features

Agent Execution Tracking:

Monitor individual agent activities
Track task assignments and completions
Log inter-agent communications

Task Metrics:

Task execution time
Success/failure rates
Retry counts
Output quality indicators

Crew-Level Insights:

Overall workflow progress
Bottleneck identification
Resource utilization

17.3.3 Third-Party Integrations

CrewAI supports multiple observability platforms:

Braintrust Integration:

# Example: CrewAI with Braintrust
from crewai import Crew, Agent, Task
import braintrust

# Initialize Braintrust
braintrust.init(api_key="your-api-key")

# Create agents and tasks
crew = Crew(
    agents=[research_agent, writer_agent],
    tasks=[research_task, writing_task],
    verbose=True  # Enable logging
)

# Execute with automatic tracing
result = crew.kickoff()

Other Integrations:

Langfuse for detailed tracing
Arize Phoenix for OpenTelemetry-native observability
MLflow for experiment tracking
Custom callbacks for proprietary systems

17.3.4 Considerations

While CrewAI excels at multi-agent orchestration, teams often need to:

Implement custom observability layers
Integrate external evaluation frameworks
Build governance and deployment pipelines separately

17.4 Semantic Kernel

17.4.1 Overview

Microsoft Semantic Kernel is an open-source SDK for building AI agents and copilots, with deep integration into the Microsoft ecosystem and Azure AI Foundry.

17.4.2 Native Observability Features

Telemetry Support: Semantic Kernel provides built-in telemetry through:

OpenTelemetry instrumentation
Azure Application Insights integration
Custom telemetry providers

Kernel Events:

Function invocation tracking
Plugin execution monitoring
Memory operation logging
LLM call instrumentation

// Example: Semantic Kernel with OpenTelemetry (C#)
using Microsoft.SemanticKernel;
using OpenTelemetry.Trace;

var tracerProvider = Sdk.CreateTracerProviderBuilder()
    .AddSource("Microsoft.SemanticKernel")
    .AddConsoleExporter()
    .Build();

var kernel = Kernel.CreateBuilder()
    .AddOpenAIChatCompletion("gpt-4", apiKey)
    .Build();

17.4.3 Azure AI Foundry Integration

Semantic Kernel agents integrate seamlessly with Azure AI Foundry evaluation:

Converter support for evaluation data
Compatible with IntentResolution, ToolCallAccuracy, and other evaluators
Continuous evaluation through Foundry dashboard

17.4.4 Multi-Language Support

Semantic Kernel provides consistent observability across:

C#/.NET: Primary SDK with full feature support
Python: Feature-parity SDK
Java: Growing feature set

17.5 Other Notable Frameworks

17.5.1 AutoGen (Microsoft)

AutoGen enables building multi-agent applications with conversational patterns.

Observability Features:

Conversation logging
Agent interaction tracking
Message flow visualization

Integrations:

OpenLLMetry for tracing
MLflow for experiment tracking
Custom logging handlers

17.5.2 DSPy

DSPy provides a programming framework for LLMs with automatic prompt optimization.

Observability Features:

Compilation traces
Module execution logs
Optimization metrics

Integrations:

Arize Phoenix for tracing
OpenLLMetry instrumentation
MLflow integration

17.5.3 Haystack

Haystack is a framework for building RAG and NLP pipelines.

Observability Features:

Pipeline component tracing
Document retrieval metrics
Query performance monitoring

Integrations:

OpenTelemetry native support
Langfuse integration
Arize Phoenix compatibility

17.5.4 Mastra

Mastra is an emerging framework with first-class observability support.

Observability Features:

Built-in tracing
Workflow monitoring
Tool usage analytics

Integrations:

Braintrust native support
OpenTelemetry exporters
Multiple backend options

17.5.5 NVIDIA NeMo

NVIDIA NeMo offers framework-agnostic integration with production optimization features.

Observability Features:

Cross-framework performance tracking
Customizer for throughput optimization
Agent Toolkit for comprehensive monitoring

Framework Compatibility: Works with CrewAI, Haystack, LangChain, and LlamaIndex.

17.6 Framework Comparison

17.6.1 Observability Feature Matrix

Framework	Native Tracing	OTel Support	Built-in Eval	Cost Tracking	Multi-Agent
LangChain/LangGraph	✅ Callbacks	✅ Via LangSmith	✅	✅	✅
LlamaIndex	✅ Instrumentation	✅ Native	✅ Native	✅	Partial
CrewAI	✅ Basic	✅ Via integrations	❌ External	✅	✅ Native
Semantic Kernel	✅ Telemetry	✅ Native	✅ Via Azure	✅	✅
AutoGen	✅ Logging	✅ Via integrations	❌ External	❌	✅ Native
DSPy	✅ Traces	✅ Via Phoenix	❌ External	❌	❌
Haystack	✅ Pipeline	✅ Native	✅ Basic	✅	Partial

17.6.2 Integration Ecosystem

Framework	Langfuse	Phoenix	MLflow	LangSmith	Braintrust
LangChain	✅	✅	✅	✅ Native	✅
LlamaIndex	✅ Native	✅	✅	Partial	✅
CrewAI	✅	✅	✅	✅	✅ Native
Semantic Kernel	Partial	✅	✅	❌	✅
AutoGen	✅	✅	✅	Partial	✅
DSPy	✅	✅ Native	✅	Partial	✅
Haystack	✅	✅	❌	❌	✅

17.6.3 Best Practices for Framework Selection

Choose LangChain/LangGraph when:

Building complex agent workflows
Need mature ecosystem with extensive integrations
Want LangSmith for managed observability

Choose LlamaIndex when:

Primarily building RAG systems
Need native evaluation capabilities
Want strong OpenTelemetry integration

Choose CrewAI when:

Building multi-agent collaborative systems
Role-based agent design fits use case
Team-oriented workflow patterns needed

Choose Semantic Kernel when:

Working in Microsoft/Azure ecosystem
Need C#/.NET support
Want Azure AI Foundry integration

17.6.4 Observability-Driven Development Pattern

Regardless of framework choice, best practices include:

Pair with dedicated observability layer: No framework provides everything—add specialized tools for gaps
Implement simulation and evaluation: Framework tracing alone is insufficient; add evaluation loops
Use human expert review: Automated metrics need human validation, especially for subjective quality
Configure alerts: Monitor key metrics and trigger notifications on degradation
Adopt common production pattern: Consider LlamaIndex for ingestion/retrieval + LangChain for orchestration

Key Takeaways

All major frameworks support OpenTelemetry: This enables mixing and matching of frameworks with consistent observability
Built-in evaluation varies significantly: LlamaIndex and LangChain provide native evaluation; others require external frameworks
LangSmith integration dominates LangChain: While not required, LangSmith provides the most seamless experience for LangChain users
Multi-agent frameworks lag in evaluation: CrewAI and AutoGen excel at orchestration but need external evaluation tools
Framework choice doesn't lock observability: OpenTelemetry compatibility means you can change observability platforms without changing frameworks
Hybrid patterns are common: Many production systems combine frameworks (e.g., LlamaIndex + LangChain) with unified observability

References

LangChain. (2025). State of AI Agents. https://www.langchain.com/state-of-agent-engineering
LangChain. (2025). LangSmith Observability. https://www.langchain.com/langsmith/observability
Langfuse. (2025). Observability for LlamaIndex with Langfuse Integration. https://langfuse.com/integrations/frameworks/llamaindex
Alphamatch. (2026). Top 7 Agentic AI Frameworks in 2026. https://www.alphamatch.ai/blog/top-agentic-ai-frameworks-2026
AIMultiple. (2026). Compare Top 12 LLM Orchestration Frameworks in 2026. https://research.aimultiple.com/llm-orchestration/
Kolekar, R. (2026). Production RAG in 2026: LangChain vs LlamaIndex. https://rahulkolekar.com/production-rag-in-2026-langchain-vs-llamaindex/
Maxim AI. (2025). Best AI Agent Frameworks 2025. https://www.getmaxim.ai/articles/top-5-ai-agent-frameworks-in-2025-a-practical-guide-for-ai-builders/
Softcery. (2025). 14 AI Agent Frameworks Compared. https://softcery.com/lab/top-14-ai-agent-frameworks-of-2025-a-founders-guide-to-building-smarter-systems
CrewAI. (2025). Braintrust Integration. https://docs.crewai.com/en/observability/braintrust
Mastra. (2026). Braintrust Exporter for AI Tracing. https://mastra.ai/docs/observability/ai-tracing/exporters/braintrust

Navigation: ← Previous Section: Cloud Provider Evaluation Platforms | Table of Contents | Next Section: Benchmark Suites →

FilesExpand file tree

17_observability_features_ai_frameworks.md

Latest commit

History

17_observability_features_ai_frameworks.md

File metadata and controls

Section 17: Observability and Evaluation Features in AI Development Frameworks

Overview

Table of Contents

17.1 LangChain and LangGraph

17.1.1 Overview

17.1.2 Native Observability Features

17.1.3 LangSmith Integration

17.1.4 LangGraph-Specific Features

17.1.5 Third-Party Integrations

17.2 LlamaIndex

17.2.1 Overview

17.2.2 Native Observability Features

17.2.3 OpenTelemetry Integration

17.2.4 Evaluation Capabilities

17.2.5 Langfuse Integration

17.3 CrewAI

17.3.1 Overview

17.3.2 Native Observability Features

17.3.3 Third-Party Integrations

17.3.4 Considerations

17.4 Semantic Kernel

17.4.1 Overview

17.4.2 Native Observability Features

17.4.3 Azure AI Foundry Integration

17.4.4 Multi-Language Support

17.5 Other Notable Frameworks

17.5.1 AutoGen (Microsoft)

17.5.2 DSPy

17.5.3 Haystack

17.5.4 Mastra

17.5.5 NVIDIA NeMo

17.6 Framework Comparison

17.6.1 Observability Feature Matrix

17.6.2 Integration Ecosystem

17.6.3 Best Practices for Framework Selection

17.6.4 Observability-Driven Development Pattern

Key Takeaways

References