A framework for evaluating large language models (LLMs) across a variety of tasks. Supports async batch inference, Azure OpenAI, and locally hosted OpenAI-compatible endpoints.
| Task | TaskType | LLM Role | Description |
|---|---|---|---|
| Multiple-Choice QA | QA |
Model under test | LLM answers questions directly; evaluates ability to select the correct option |
| Tool Calling | TOOLCALLING |
Model under test | LLM performs function calling; supports custom datasets and BFCL format |
| Check Worthiness | CHECKWORTHINESS |
Model under test | LLM determines whether a claim is worth fact-checking |
| Claim Evidence Stance | CLAIMEVIDENCESTANCE |
Model under test / NLI | LLM (or an external NLI service) classifies the stance of evidence toward a claim: support / refute / irrelevant |
| Summarization Evaluation | SUMMARIZATION |
G-Eval judge | LLM acts as a judge and scores summaries; metrics: Relevance / Coherence / Consistency / Fluency |
| Fact Extraction Evaluation | FACTEXTRACTION |
G-Eval judge | LLM acts as a judge and scores extracted facts; metrics: Quality / Completeness |
| QA Generation Evaluation | QAGENERATE |
G-Eval judge | LLM acts as a judge and scores generated QA pairs; metrics: Relevance / Accuracy / Fluency |
| FAQ Response Evaluation | FAQRESPONSE |
G-Eval judge | LLM acts as a judge and scores FAQ responses; metrics: Completeness / Relevance / Coherence / Fluency / Consistency / Actionability / Evidence Use / Conciseness |
| Generate Claims Evaluation | GENERATECLAIMS |
G-Eval judge | LLM acts as a judge and scores generated claims; metrics: Factual Accuracy / Coverage / Redundancy |
| Generate Truths Evaluation | GENERATETRUTHS |
G-Eval judge | LLM acts as a judge and evaluates the quality of generated truth statements from documents |
| RAG Retrieval Evaluation | RAG |
Not required | Uses Embedding + Reranking models only; evaluates hybrid retrieval accuracy (vector search + BM25) |
Requirements: Python >= 3.9
pip install -e .Copy the example config and fill in your API credentials:
cp config/example_models.yaml config/models.yamlExample config/models.yaml:
params:
default:
temperature: 0.2
max_tokens: 1000
top_p: 1
LLM_engines:
gpt-4o:
model: "gpt-4o"
azure_api_base: "https://your-resource.openai.azure.com/"
azure_api_key: "your_azure_api_key"
azure_api_version: "2024-02-01"
Qwen3-14B:
model: "Qwen3-14B"
local_api_key: "Empty"
local_base_url: "http://localhost:8000/v1"
translate_to_cht: true # optional: convert output to Traditional Chinese via OpenCC
embedding_models:
bge-m3:
model: "bge-m3"
local_api_key: "Empty"
local_base_url: "http://localhost:8001/v1"
reranking_models:
bge-reranker-large:
model: "bge-reranker-large"
local_api_key: "Empty"
local_base_url: "http://localhost:8002/v1"- LLM_engines: If the model name starts with
gpt(and does not containoss),AzureOpenAIis used automatically; otherwise a local OpenAI-compatible endpoint is used. - translate_to_cht: When set to
true, model output is converted to Traditional Chinese using OpenCC.
EvalPipeline lets you chain multiple tasks in a single script with a shared batch_size and output directory, and automatically generates an execution summary report.
from src.api.async_llm_client import AsyncLLMChat
from src.pipeline import EvalPipeline, PipelineStep
from src.tasks.base_task import TaskConfig, TaskType
llm = AsyncLLMChat(model="Qwen3-14B", config_path="./config/models.yaml")
pipeline = EvalPipeline(output_folder="output")
pipeline.add(PipelineStep(
task_type=TaskType.QA,
dataset_paths="./datasets/qa/example_qa.json", # single path or a list
config=TaskConfig(task_type=TaskType.QA, llm=llm, llm_params={...}),
name="QA-Qwen3-14B",
))
pipeline.add(PipelineStep(
task_type=TaskType.SUMMARIZATION,
dataset_paths=[
"./datasets/summarization/set_a.json",
"./datasets/summarization/set_b.json", # multiple datasets run sequentially
],
config=TaskConfig(task_type=TaskType.SUMMARIZATION, llm=llm, llm_params={...}),
evaluate_kwargs={"probability_normalize": True},
name="Summarization-Qwen3-14B",
))
results = pipeline.run(batch_size=5)
pipeline.export_report("./output/pipeline_report.json")Full example: examples/llm_evals/pipeline_example.py
| Parameter | Type | Description |
|---|---|---|
task_type |
TaskType |
Task type (required) |
dataset_paths |
str | List[str] |
Dataset path(s); accepts a single string or a list of paths (required) |
config |
TaskConfig |
TaskConfig for this task (required) |
evaluate_kwargs |
dict |
Extra arguments forwarded to async_evaluate(), e.g. probability_normalize, top_k, use_single_prompt (optional) |
name |
str |
Step label shown in logs and the report (optional; defaults to task_type.value) |
The JSON produced by export_report():
{
"generated_at": "2026-03-16T10:00:00",
"total_steps": 2,
"success": 1,
"partial": 1,
"failed": 0,
"total_elapsed_seconds": 120.5,
"steps": [
{
"step": 1,
"name": "QA-Qwen3-14B",
"task_type": "qa",
"dataset_paths": ["./datasets/qa/example_qa.json"],
"status": "success",
"total_elapsed_seconds": 45.2,
"datasets": [
{
"dataset_path": "./datasets/qa/example_qa.json",
"status": "success",
"elapsed_seconds": 45.2,
"result_path": "output/qa/result.json",
"error": null
}
]
}
]
}Step-level status values: success (all datasets passed), partial (some passed), error (all failed).
Each task is instantiated via TaskConfig + the corresponding Task class, then run with asyncio.run().
import asyncio
from src.api.async_llm_client import AsyncLLMChat
from src.tasks.base_task import TaskConfig, TaskType
from src.tasks.qa_task import QATask
async_llm = AsyncLLMChat(model="Qwen3-14B", config_path="./config/models.yaml")
config = TaskConfig(
task_type=TaskType.QA,
llm=async_llm,
llm_params={'temperature': 0.8, 'max_tokens': 500, 'top_p': 0.8},
llm_extra_body={"chat_template_kwargs": {"enable_thinking": False}}
)
task = QATask(config)
asyncio.run(task.async_evaluate(dataset_path="./datasets/qa/example_qa.json", batch_size=2))import asyncio
from src.api.async_llm_client import AsyncLLMChat
from src.tasks.base_task import TaskConfig, TaskType
from src.tasks.summarization_task import SummarizationTask
async_llm = AsyncLLMChat(model="Qwen3-14B", config_path="./config/models.yaml")
config = TaskConfig(
task_type=TaskType.SUMMARIZATION,
llm=async_llm,
llm_params={'temperature': 2, 'max_tokens': 30, 'top_p': 1},
llm_extra_body={"chat_template_kwargs": {"enable_thinking": False}}
)
task = SummarizationTask(config)
asyncio.run(task.async_evaluate(
dataset_path="./datasets/summarization/sample_summary.json",
batch_size=2,
probability_normalize=True
))Metrics: Relevance, Coherence, Consistency, Fluency
import asyncio
from src.api.embedding_rerank_client import EmbeddingModel, RerankingModel
from src.tasks.base_task import TaskConfig, TaskType
from src.tasks.rag.rag_task import RAGTask
emb = EmbeddingModel(embedding_model="bge-m3", config_path='./config/models.yaml', use_async=True)
rerank = RerankingModel(reranking_model="bge-reranker-large", config_path='./config/models.yaml', use_async=True)
config = TaskConfig(
task_type=TaskType.RAG,
embedding_model=emb,
reranking_model=rerank,
custom_params={
"embedding_dim": 1024,
"rrf_k": 60,
"bm25_language": "zh" # "en" or "zh"
}
)
task = RAGTask(config)
asyncio.run(task.async_evaluate(dataset_path="./datasets/rag/example_rag.json", batch_size=10, top_k=5))Custom dataset:
import asyncio
from src.api.async_llm_client import AsyncLLMChat
from src.tasks.base_task import TaskConfig, TaskType
from src.tasks.tool_calling.tool_calling_task import ToolCallingTask
async_llm = AsyncLLMChat(model="Qwen3-14B", config_path="./config/models.yaml")
config = TaskConfig(
task_type=TaskType.TOOLCALLING,
llm=async_llm,
llm_params={'temperature': 0.8, 'max_tokens': 500, 'top_p': 0.8},
llm_extra_body={"chat_template_kwargs": {"enable_thinking": False}}
)
task = ToolCallingTask(config)
asyncio.run(task.async_evaluate(dataset_path="./datasets/tool_calling/tool_calling_example.json", batch_size=5))BFCL dataset:
from src.tasks.base_task import TaskConfig, TaskType, TaskDatasetType
config = TaskConfig(
task_type=TaskType.TOOLCALLING,
dataset_type=TaskDatasetType.BCFL, # use BFCL format
llm=async_llm,
...
)
task = ToolCallingTask(config)
asyncio.run(task.async_evaluate(
dataset_path="./bfcl_data/BFCL_v3_exec_parallel_multiple.json",
batch_size=5,
folder_name="Qwen3-14B_run1"
))import asyncio
from src.api.async_llm_client import AsyncLLMChat
from src.tasks.base_task import TaskConfig, TaskType
from src.tasks.check_worthiness_task import CheckWorthinessTask
async_llm = AsyncLLMChat(model="Qwen3-14B", config_path="config/models.yaml")
config = TaskConfig(
task_type=TaskType.CHECKWORTHINESS,
llm=async_llm,
llm_params={'temperature': 0, 'max_tokens': 100, 'top_p': 1}
)
task = CheckWorthinessTask(config)
asyncio.run(task.async_evaluate(
dataset_path="datasets/factcheck/factcheck_claim_checkworthiness.json",
batch_size=10,
use_single_prompt=True
))Supports two inference modes — LLM or an external NLI service:
import asyncio
from src.tasks.base_task import TaskConfig, TaskType
from src.tasks.claim_evidence_stance_task import ClaimEvidenceStanceTask
# NLI mode
config = TaskConfig(
task_type=TaskType.CLAIMEVIDENCESTANCE,
custom_params={
'use_nli': True,
'nli_url': "http://your-nli-service/infer/xlm-roberta-large-xnli",
'nli_timeout': 30.0,
'nli_threshold': 0.8
}
)
# LLM mode
# config = TaskConfig(task_type=TaskType.CLAIMEVIDENCESTANCE, llm=async_llm, custom_params={'use_nli': False}, ...)
task = ClaimEvidenceStanceTask(config)
asyncio.run(task.async_evaluate(
dataset_path="datasets/factcheck/factcheck_claim_evidence_stance.json",
batch_size=5
))For more examples, see examples/llm_evals/.
llm-evals/
├── src/
│ ├── api/
│ │ ├── llm_client.py # Synchronous LLM client (LLMChat)
│ │ ├── async_llm_client.py # Async LLM client (AsyncLLMChat)
│ │ └── embedding_rerank_client.py # Embedding / Reranking clients
│ ├── tasks/
│ │ ├── base_task.py # Abstract base class (BaseTask, TaskConfig, TaskType)
│ │ ├── qa_task.py # Multiple-choice QA
│ │ ├── summarization_task.py # Summarization quality evaluation
│ │ ├── fact_extract_task.py # Fact extraction evaluation
│ │ ├── qa_generate_task.py # QA generation evaluation
│ │ ├── faq_response_task.py # FAQ response evaluation
│ │ ├── check_worthiness_task.py # Check worthiness
│ │ ├── claim_evidence_stance_task.py # Claim evidence stance
│ │ ├── generate_claims.py # Generate claims evaluation
│ │ ├── generate_truths.py # Generate truths evaluation
│ │ ├── rag/
│ │ │ └── rag_task.py # RAG retrieval evaluation
│ │ └── tool_calling/
│ │ └── tool_calling_task.py # Tool calling evaluation
│ ├── prompts/ # Prompt templates for each task
│ ├── utils/
│ │ └── tool_process.py # Tool calling data utilities
│ └── pipeline.py # EvalPipeline for multi-task evaluation
├── examples/
│ └── llm_evals/ # Example scripts for each task
├── datasets/ # Datasets organized by task
├── bfcl_data/ # BFCL benchmark datasets
├── config/
│ ├── example_models.yaml # Config template
│ └── models.yaml # Your config (create from template)
└── pyproject.toml
| Parameter | Type | Description |
|---|---|---|
task_type |
TaskType |
Task type (required) |
llm |
AsyncLLMChat |
Async LLM client |
llm_params |
dict |
LLM inference parameters (temperature, max_tokens, etc.) |
llm_extra_body |
dict |
Extra fields forwarded to the API request body (e.g. enable_thinking) |
embedding_model |
EmbeddingModel |
Embedding model (used by RAG task) |
reranking_model |
RerankingModel |
Reranking model (used by RAG task) |
custom_params |
dict |
Task-specific custom parameters |
dataset_type |
TaskDatasetType |
Dataset format: DEFAULT or BCFL |
This project is licensed under the MIT License.