A Comprehensive Framework for Evaluating Temporal Reasoning in RAG Systems
Features β’ Installation β’ Quick Start β’ Metrics β’ Examples β’ Docs β’ Citation
TempoEval is a state-of-the-art evaluation framework designed specifically for assessing temporal reasoning capabilities in Retrieval-Augmented Generation (RAG) systems. Unlike traditional metrics that only measure relevance, TempoEval provides 16 specialized metrics that evaluate how well your RAG system understands, retrieves, and generates temporally accurate content.
Traditional RAG evaluation metrics fail to capture temporal nuances:
| Scenario | Traditional Metrics | TempoEval |
|---|---|---|
| Query: "What happened in 2020?" β Retrieved doc about 2019 | β High similarity | β Low temporal precision |
| Answer mentions dates not in context | β Fluent text | β Temporal hallucination detected |
| Cross-period query needs docs from multiple eras | β Partial coverage | β Full temporal coverage measured |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Layer 3: REASONING METRICS β
β ββ Event Ordering β’ Duration Accuracy β’ Cross-Period Reasoning β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Layer 2: GENERATION METRICS β
β ββ Faithfulness β’ Hallucination β’ Coherence β’ Alignment β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Layer 1: RETRIEVAL METRICS β
β ββ Precision β’ Recall β’ NDCG β’ Coverage β’ Diversity β’ MRR β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| Feature | Description |
|---|---|
| π― Focus Time Extraction | Automatically extract temporal focus from queries and documents |
| π 16 Specialized Metrics | Comprehensive temporal evaluation across retrieval, generation, and reasoning |
| π€ LLM-as-Judge | Use GPT-4, Claude, or other LLMs for nuanced temporal assessment |
| β‘ Dual-Mode Evaluation | Rule-based (fast) or LLM-based (accurate) metric computation |
| π TempoScore | Unified composite score combining all temporal dimensions |
| π° Cost Tracking | Built-in efficiency monitoring for latency and API costs |
| π¦ TEMPO Benchmark | Integrated support for the TEMPO temporal QA benchmark |
pip install tempoevalgit clone https://github.com/DataScienceUIBK/tempoeval.git
cd tempoeval
pip install -e .# For LLM-based evaluation (recommended)
pip install openai anthropic
# For BM25 retrieval in examples
pip install gensim pyserini
# For TEMPO benchmark loading
pip install datasets huggingface_hub pyarrowfrom tempoeval.metrics import TemporalRecall, TemporalNDCG, TemporalPrecision
from tempoeval.core import FocusTime
# Your retrieval results
retrieved_ids = ["doc_2020", "doc_2019", "doc_2021"]
gold_ids = ["doc_2020", "doc_2021"]
# Compute metrics
recall = TemporalRecall().compute(retrieved_ids=retrieved_ids, gold_ids=gold_ids, k=5)
ndcg = TemporalNDCG().compute(retrieved_ids=retrieved_ids, gold_ids=gold_ids, k=5)
print(f"Temporal Recall@5: {recall:.3f}")
print(f"Temporal NDCG@5: {ndcg:.3f}")from tempoeval.core import FocusTime, extract_qft, extract_dft
from tempoeval.metrics import TemporalPrecision
# Extract Focus Time from query
query = "What happened to Bitcoin in 2017?"
qft = extract_qft(query) # FocusTime(years={2017})
# Extract Focus Time from documents
documents = [
"Bitcoin reached $20,000 in December 2017.",
"Ethereum launched in 2015.",
"The SegWit upgrade activated in August 2017.",
]
dfts = [extract_dft(doc) for doc in documents]
# Evaluate temporal precision
precision = TemporalPrecision(use_focus_time=True)
score = precision.compute(qft=qft, dfts=dfts, k=3)
print(f"Temporal Precision@3: {score:.3f}")import os
from tempoeval.llm import AzureOpenAIProvider
from tempoeval.metrics import (
TemporalFaithfulness,
TemporalHallucination,
TemporalCoherence,
TempoScore
)
# Configure LLM
os.environ["AZURE_OPENAI_ENDPOINT"] = "https://your-endpoint.openai.azure.com/"
os.environ["AZURE_OPENAI_API_KEY"] = "your-api-key"
os.environ["AZURE_DEPLOYMENT_NAME"] = "gpt-4o"
llm = AzureOpenAIProvider()
# Your RAG output
query = "When was Bitcoin pruning introduced?"
contexts = ["Bitcoin Core 0.11.0 was released on July 12, 2015 with pruning support."]
answer = "Bitcoin pruning was introduced in version 0.11.0, released on July 12, 2015."
# Evaluate generation quality
faithfulness = TemporalFaithfulness(llm=llm)
hallucination = TemporalHallucination(llm=llm)
coherence = TemporalCoherence(llm=llm)
print(f"Faithfulness: {faithfulness.compute(answer=answer, contexts=contexts):.3f}")
print(f"Hallucination: {hallucination.compute(answer=answer, contexts=contexts):.3f}")
print(f"Coherence: {coherence.compute(answer=answer):.3f}")
# Compute unified TempoScore
tempo_scorer = TempoScore()
result = tempo_scorer.compute(
temporal_precision=0.9,
temporal_recall=0.85,
temporal_faithfulness=1.0,
temporal_coherence=1.0
)
print(f"\nπ― TempoScore: {result['tempo_weighted']:.3f}")| Metric | Description | LLM Required |
|---|---|---|
TemporalPrecision |
% of retrieved docs matching query's temporal focus | Optional |
TemporalRecall |
% of relevant temporal docs retrieved | Optional |
TemporalNDCG |
Ranking quality with temporal relevance grading | No |
TemporalMRR |
Reciprocal rank of first temporally relevant doc | No |
TemporalCoverage |
Coverage of required time periods (cross-period) | Yes |
TemporalDiversity |
Variety of time periods in retrieved docs | Optional |
AnchorCoverage |
Coverage of key temporal anchors | Optional |
| Metric | Description | LLM Required |
|---|---|---|
TemporalFaithfulness |
Are temporal claims supported by context? | Yes |
TemporalHallucination |
% of fabricated temporal information | Yes |
TemporalCoherence |
Internal consistency of temporal statements | Yes |
AnswerTemporalAlignment |
Does answer focus on the right time period? | Yes |
| Metric | Description | LLM Required |
|---|---|---|
EventOrdering |
Correctness of event sequence | Yes |
DurationAccuracy |
Accuracy of duration/interval claims | Yes |
CrossPeriodReasoning |
Quality of comparison across time periods | Yes |
| Metric | Description |
|---|---|
TempoScore |
Unified score combining all temporal dimensions |
tempoeval/
βββ π¦ core/ # Core components
β βββ focus_time.py # Focus Time extraction
β βββ evaluator.py # Main evaluation orchestrator
β βββ config.py # Configuration management
β βββ result.py # Result containers
βββ π metrics/ # All 16 metrics
β βββ retrieval/ # Layer 1 metrics
β βββ generation/ # Layer 2 metrics
β βββ reasoning/ # Layer 3 metrics
β βββ composite/ # TempoScore
βββ π€ llm/ # LLM provider integrations
β βββ openai_provider.py
β βββ azure_provider.py
β βββ anthropic_provider.py
βββ π datasets/ # Dataset loaders
β βββ tempo.py # TEMPO benchmark
β βββ timebench.py # TimeBench
βββ π§ guidance/ # Temporal guidance generation
βββ β‘ efficiency/ # Cost & latency tracking
βββ π οΈ utils/ # Utility functions
We provide comprehensive examples in the examples/ directory:
| Example | Description | LLM Required |
|---|---|---|
01_retrieval_bm25.py |
Basic retrieval evaluation | β |
02_rag_generation.py |
RAG generation evaluation | β |
03_full_pipeline.py |
Complete RAG pipeline | β |
04_tempo_dataset.py |
Using TEMPO benchmark | β |
05_cross_period.py |
Cross-period queries | β |
06_tempo_hsm_complete.py |
Full HSM evaluation | β |
07_generate_guidance.py |
Generate temporal guidance | β |
08_pipeline_with_generated_guidance.py |
End-to-end pipeline | β |
cd examples
# Copy and configure credentials (for LLM examples)
cp .env.example .env
# Edit .env with your API keys
# Run examples
python 01_retrieval_bm25.py # No LLM needed
python 02_rag_generation.py # Requires .envCreate a .env file or set environment variables:
# Azure OpenAI (Recommended)
AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
AZURE_OPENAI_API_KEY=your-api-key
AZURE_DEPLOYMENT_NAME=gpt-4o
AZURE_OPENAI_API_VERSION=2024-05-01-preview
# Or OpenAI
OPENAI_API_KEY=your-openai-key
# Or Anthropic
ANTHROPIC_API_KEY=your-anthropic-keyfrom tempoeval.core import TempoEvalConfig
config = TempoEvalConfig(
k_values=[5, 10, 20], # Evaluation depths
use_focus_time=True, # Enable Focus Time extraction
llm_provider="azure", # LLM provider
parallel_requests=10, # Concurrent LLM calls
)TempoEval includes built-in support for the TEMPO benchmark - a comprehensive temporal QA dataset:
from tempoeval.datasets import load_tempo, load_tempo_documents
# Load queries with temporal annotations
queries = load_tempo(domain="bitcoin", max_samples=100)
# Load corpus documents
documents = load_tempo_documents(domain="bitcoin")
# Available domains: bitcoin, cardano, economics, hsm (History of Science & Medicine)Track latency and costs for LLM-based evaluation:
from tempoeval.efficiency import EfficiencyTracker
tracker = EfficiencyTracker(model_name="gpt-4o")
# ... run your evaluation ...
# Get summary
summary = tracker.summary()
print(f"Total Cost: ${summary['total_cost_usd']:.4f}")
print(f"Avg Latency: {summary['avg_latency_ms']:.1f}ms")# Run all tests
pytest tests/ -v
# Run specific test file
pytest tests/test_core.py -v
# Run with coverage
pytest tests/ --cov=tempoeval --cov-report=htmlFull documentation is available at: https://tempoeval.readthedocs.io/en/latest/
If you use TempoEval in your research, please cite our paper:
soonWe welcome contributions! Please see our Contributing Guide for details.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Built on top of the TEMPO Benchmark
- LLM integrations via OpenAI, Azure OpenAI, and Anthropic
Made with β€οΈ for the Temporal IR Community