Skip to content

DataScienceUIBK/tempoeval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

11 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

TempoEval

⏱️ TempoEval

A Comprehensive Framework for Evaluating Temporal Reasoning in RAG Systems

PyPI Version GitHub Stars License Python Version arXiv

Features β€’ Installation β€’ Quick Start β€’ Metrics β€’ Examples β€’ Docs β€’ Citation


🎯 Overview

TempoEval is a state-of-the-art evaluation framework designed specifically for assessing temporal reasoning capabilities in Retrieval-Augmented Generation (RAG) systems. Unlike traditional metrics that only measure relevance, TempoEval provides 16 specialized metrics that evaluate how well your RAG system understands, retrieves, and generates temporally accurate content.

16 Metrics 3 Layers Focus Time

πŸ€” Why TempoEval?

Traditional RAG evaluation metrics fail to capture temporal nuances:

Scenario Traditional Metrics TempoEval
Query: "What happened in 2020?" β†’ Retrieved doc about 2019 βœ… High similarity ❌ Low temporal precision
Answer mentions dates not in context βœ… Fluent text ❌ Temporal hallucination detected
Cross-period query needs docs from multiple eras ❌ Partial coverage βœ… Full temporal coverage measured

✨ Key Features

πŸ“Š Three-Layer Evaluation Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Layer 3: REASONING METRICS                                     β”‚
β”‚  └─ Event Ordering β€’ Duration Accuracy β€’ Cross-Period Reasoning β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Layer 2: GENERATION METRICS                                    β”‚
β”‚  └─ Faithfulness β€’ Hallucination β€’ Coherence β€’ Alignment        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Layer 1: RETRIEVAL METRICS                                     β”‚
β”‚  └─ Precision β€’ Recall β€’ NDCG β€’ Coverage β€’ Diversity β€’ MRR      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ”‘ Core Capabilities

Feature Description
🎯 Focus Time Extraction Automatically extract temporal focus from queries and documents
πŸ“ˆ 16 Specialized Metrics Comprehensive temporal evaluation across retrieval, generation, and reasoning
πŸ€– LLM-as-Judge Use GPT-4, Claude, or other LLMs for nuanced temporal assessment
⚑ Dual-Mode Evaluation Rule-based (fast) or LLM-based (accurate) metric computation
πŸ“Š TempoScore Unified composite score combining all temporal dimensions
πŸ’° Cost Tracking Built-in efficiency monitoring for latency and API costs
πŸ“¦ TEMPO Benchmark Integrated support for the TEMPO temporal QA benchmark

πŸ“¦ Installation

Via pip (Recommended)

pip install tempoeval

From Source

git clone https://github.com/DataScienceUIBK/tempoeval.git
cd tempoeval
pip install -e .

Optional Dependencies

# For LLM-based evaluation (recommended)
pip install openai anthropic

# For BM25 retrieval in examples
pip install gensim pyserini

# For TEMPO benchmark loading
pip install datasets huggingface_hub pyarrow

πŸš€ Quick Start

Basic Retrieval Evaluation (No LLM Required)

from tempoeval.metrics import TemporalRecall, TemporalNDCG, TemporalPrecision
from tempoeval.core import FocusTime

# Your retrieval results
retrieved_ids = ["doc_2020", "doc_2019", "doc_2021"]
gold_ids = ["doc_2020", "doc_2021"]

# Compute metrics
recall = TemporalRecall().compute(retrieved_ids=retrieved_ids, gold_ids=gold_ids, k=5)
ndcg = TemporalNDCG().compute(retrieved_ids=retrieved_ids, gold_ids=gold_ids, k=5)

print(f"Temporal Recall@5: {recall:.3f}")
print(f"Temporal NDCG@5: {ndcg:.3f}")

Focus Time-Based Evaluation

from tempoeval.core import FocusTime, extract_qft, extract_dft
from tempoeval.metrics import TemporalPrecision

# Extract Focus Time from query
query = "What happened to Bitcoin in 2017?"
qft = extract_qft(query)  # FocusTime(years={2017})

# Extract Focus Time from documents
documents = [
    "Bitcoin reached $20,000 in December 2017.",
    "Ethereum launched in 2015.",
    "The SegWit upgrade activated in August 2017.",
]
dfts = [extract_dft(doc) for doc in documents]

# Evaluate temporal precision
precision = TemporalPrecision(use_focus_time=True)
score = precision.compute(qft=qft, dfts=dfts, k=3)
print(f"Temporal Precision@3: {score:.3f}")

LLM-Based Generation Evaluation

import os
from tempoeval.llm import AzureOpenAIProvider
from tempoeval.metrics import (
    TemporalFaithfulness,
    TemporalHallucination,
    TemporalCoherence,
    TempoScore
)

# Configure LLM
os.environ["AZURE_OPENAI_ENDPOINT"] = "https://your-endpoint.openai.azure.com/"
os.environ["AZURE_OPENAI_API_KEY"] = "your-api-key"
os.environ["AZURE_DEPLOYMENT_NAME"] = "gpt-4o"

llm = AzureOpenAIProvider()

# Your RAG output
query = "When was Bitcoin pruning introduced?"
contexts = ["Bitcoin Core 0.11.0 was released on July 12, 2015 with pruning support."]
answer = "Bitcoin pruning was introduced in version 0.11.0, released on July 12, 2015."

# Evaluate generation quality
faithfulness = TemporalFaithfulness(llm=llm)
hallucination = TemporalHallucination(llm=llm)
coherence = TemporalCoherence(llm=llm)

print(f"Faithfulness: {faithfulness.compute(answer=answer, contexts=contexts):.3f}")
print(f"Hallucination: {hallucination.compute(answer=answer, contexts=contexts):.3f}")
print(f"Coherence: {coherence.compute(answer=answer):.3f}")

# Compute unified TempoScore
tempo_scorer = TempoScore()
result = tempo_scorer.compute(
    temporal_precision=0.9,
    temporal_recall=0.85,
    temporal_faithfulness=1.0,
    temporal_coherence=1.0
)
print(f"\n🎯 TempoScore: {result['tempo_weighted']:.3f}")

πŸ“Š Metrics

Layer 1: Retrieval Metrics

Metric Description LLM Required
TemporalPrecision % of retrieved docs matching query's temporal focus Optional
TemporalRecall % of relevant temporal docs retrieved Optional
TemporalNDCG Ranking quality with temporal relevance grading No
TemporalMRR Reciprocal rank of first temporally relevant doc No
TemporalCoverage Coverage of required time periods (cross-period) Yes
TemporalDiversity Variety of time periods in retrieved docs Optional
AnchorCoverage Coverage of key temporal anchors Optional

Layer 2: Generation Metrics

Metric Description LLM Required
TemporalFaithfulness Are temporal claims supported by context? Yes
TemporalHallucination % of fabricated temporal information Yes
TemporalCoherence Internal consistency of temporal statements Yes
AnswerTemporalAlignment Does answer focus on the right time period? Yes

Layer 3: Reasoning Metrics

Metric Description LLM Required
EventOrdering Correctness of event sequence Yes
DurationAccuracy Accuracy of duration/interval claims Yes
CrossPeriodReasoning Quality of comparison across time periods Yes

Composite Metrics

Metric Description
TempoScore Unified score combining all temporal dimensions

πŸ“ Project Structure

tempoeval/
β”œβ”€β”€ πŸ“¦ core/                    # Core components
β”‚   β”œβ”€β”€ focus_time.py          # Focus Time extraction
β”‚   β”œβ”€β”€ evaluator.py           # Main evaluation orchestrator
β”‚   β”œβ”€β”€ config.py              # Configuration management
β”‚   └── result.py              # Result containers
β”œβ”€β”€ πŸ“Š metrics/                 # All 16 metrics
β”‚   β”œβ”€β”€ retrieval/             # Layer 1 metrics
β”‚   β”œβ”€β”€ generation/            # Layer 2 metrics
β”‚   β”œβ”€β”€ reasoning/             # Layer 3 metrics
β”‚   └── composite/             # TempoScore
β”œβ”€β”€ πŸ€– llm/                     # LLM provider integrations
β”‚   β”œβ”€β”€ openai_provider.py
β”‚   β”œβ”€β”€ azure_provider.py
β”‚   └── anthropic_provider.py
β”œβ”€β”€ πŸ“ˆ datasets/                # Dataset loaders
β”‚   β”œβ”€β”€ tempo.py               # TEMPO benchmark
β”‚   └── timebench.py           # TimeBench
β”œβ”€β”€ πŸ”§ guidance/                # Temporal guidance generation
β”œβ”€β”€ ⚑ efficiency/              # Cost & latency tracking
└── πŸ› οΈ utils/                   # Utility functions

πŸ“š Examples

We provide comprehensive examples in the examples/ directory:

Example Description LLM Required
01_retrieval_bm25.py Basic retrieval evaluation ❌
02_rag_generation.py RAG generation evaluation βœ…
03_full_pipeline.py Complete RAG pipeline βœ…
04_tempo_dataset.py Using TEMPO benchmark ❌
05_cross_period.py Cross-period queries βœ…
06_tempo_hsm_complete.py Full HSM evaluation βœ…
07_generate_guidance.py Generate temporal guidance βœ…
08_pipeline_with_generated_guidance.py End-to-end pipeline βœ…

Running Examples

cd examples

# Copy and configure credentials (for LLM examples)
cp .env.example .env
# Edit .env with your API keys

# Run examples
python 01_retrieval_bm25.py      # No LLM needed
python 02_rag_generation.py      # Requires .env

πŸ”§ Configuration

Environment Variables (for LLM-based evaluation)

Create a .env file or set environment variables:

# Azure OpenAI (Recommended)
AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
AZURE_OPENAI_API_KEY=your-api-key
AZURE_DEPLOYMENT_NAME=gpt-4o
AZURE_OPENAI_API_VERSION=2024-05-01-preview

# Or OpenAI
OPENAI_API_KEY=your-openai-key

# Or Anthropic
ANTHROPIC_API_KEY=your-anthropic-key

Programmatic Configuration

from tempoeval.core import TempoEvalConfig

config = TempoEvalConfig(
    k_values=[5, 10, 20],           # Evaluation depths
    use_focus_time=True,            # Enable Focus Time extraction
    llm_provider="azure",           # LLM provider
    parallel_requests=10,           # Concurrent LLM calls
)

πŸ“ˆ TEMPO Benchmark

TempoEval includes built-in support for the TEMPO benchmark - a comprehensive temporal QA dataset:

from tempoeval.datasets import load_tempo, load_tempo_documents

# Load queries with temporal annotations
queries = load_tempo(domain="bitcoin", max_samples=100)

# Load corpus documents
documents = load_tempo_documents(domain="bitcoin")

# Available domains: bitcoin, cardano, economics, hsm (History of Science & Medicine)

⚑ Efficiency Tracking

Track latency and costs for LLM-based evaluation:

from tempoeval.efficiency import EfficiencyTracker

tracker = EfficiencyTracker(model_name="gpt-4o")

# ... run your evaluation ...

# Get summary
summary = tracker.summary()
print(f"Total Cost: ${summary['total_cost_usd']:.4f}")
print(f"Avg Latency: {summary['avg_latency_ms']:.1f}ms")

πŸ§ͺ Testing

# Run all tests
pytest tests/ -v

# Run specific test file
pytest tests/test_core.py -v

# Run with coverage
pytest tests/ --cov=tempoeval --cov-report=html

πŸ“– Documentation

Full documentation is available at: https://tempoeval.readthedocs.io/en/latest/


πŸ“„ Citation

If you use TempoEval in your research, please cite our paper:

soon

🀝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

πŸ“œ License

This project is licensed under the MIT License - see the LICENSE file for details.


πŸ™ Acknowledgments


Made with ❀️ for the Temporal IR Community

Star on GitHub

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published

Languages