llm-evals

A framework for evaluating large language models (LLMs) across a variety of tasks. Supports async batch inference, Azure OpenAI, and locally hosted OpenAI-compatible endpoints.

English | 中文

Supported Tasks

Task	TaskType	LLM Role	Description
Multiple-Choice QA	`QA`	Model under test	LLM answers questions directly; evaluates ability to select the correct option
Tool Calling	`TOOLCALLING`	Model under test	LLM performs function calling; supports custom datasets and BFCL format
Check Worthiness	`CHECKWORTHINESS`	Model under test	LLM determines whether a claim is worth fact-checking
Claim Evidence Stance	`CLAIMEVIDENCESTANCE`	Model under test / NLI	LLM (or an external NLI service) classifies the stance of evidence toward a claim: support / refute / irrelevant
Summarization Evaluation	`SUMMARIZATION`	G-Eval judge	LLM acts as a judge and scores summaries; metrics: Relevance / Coherence / Consistency / Fluency
Fact Extraction Evaluation	`FACTEXTRACTION`	G-Eval judge	LLM acts as a judge and scores extracted facts; metrics: Quality / Completeness
QA Generation Evaluation	`QAGENERATE`	G-Eval judge	LLM acts as a judge and scores generated QA pairs; metrics: Relevance / Accuracy / Fluency
FAQ Response Evaluation	`FAQRESPONSE`	G-Eval judge	LLM acts as a judge and scores FAQ responses; metrics: Completeness / Relevance / Coherence / Fluency / Consistency / Actionability / Evidence Use / Conciseness
Generate Claims Evaluation	`GENERATECLAIMS`	G-Eval judge	LLM acts as a judge and scores generated claims; metrics: Factual Accuracy / Coverage / Redundancy
Generate Truths Evaluation	`GENERATETRUTHS`	G-Eval judge	LLM acts as a judge and evaluates the quality of generated truth statements from documents
RAG Retrieval Evaluation	`RAG`	Not required	Uses Embedding + Reranking models only; evaluates hybrid retrieval accuracy (vector search + BM25)

Installation

Requirements: Python >= 3.9

pip install -e .

Model Configuration

Copy the example config and fill in your API credentials:

cp config/example_models.yaml config/models.yaml

Example config/models.yaml:

params:
    default:
        temperature: 0.2
        max_tokens: 1000
        top_p: 1

LLM_engines:
    gpt-4o:
        model: "gpt-4o"
        azure_api_base: "https://your-resource.openai.azure.com/"
        azure_api_key: "your_azure_api_key"
        azure_api_version: "2024-02-01"
    Qwen3-14B:
        model: "Qwen3-14B"
        local_api_key: "Empty"
        local_base_url: "http://localhost:8000/v1"
        translate_to_cht: true   # optional: convert output to Traditional Chinese via OpenCC

embedding_models:
    bge-m3:
        model: "bge-m3"
        local_api_key: "Empty"
        local_base_url: "http://localhost:8001/v1"

reranking_models:
    bge-reranker-large:
        model: "bge-reranker-large"
        local_api_key: "Empty"
        local_base_url: "http://localhost:8002/v1"

LLM_engines: If the model name starts with gpt (and does not contain oss), AzureOpenAI is used automatically; otherwise a local OpenAI-compatible endpoint is used.
translate_to_cht: When set to true, model output is converted to Traditional Chinese using OpenCC.

EvalPipeline: Multi-Task Batch Evaluation

EvalPipeline lets you chain multiple tasks in a single script with a shared batch_size and output directory, and automatically generates an execution summary report.

Quick Start

from src.api.async_llm_client import AsyncLLMChat
from src.pipeline import EvalPipeline, PipelineStep
from src.tasks.base_task import TaskConfig, TaskType

llm = AsyncLLMChat(model="Qwen3-14B", config_path="./config/models.yaml")

pipeline = EvalPipeline(output_folder="output")

pipeline.add(PipelineStep(
    task_type=TaskType.QA,
    dataset_paths="./datasets/qa/example_qa.json",   # single path or a list
    config=TaskConfig(task_type=TaskType.QA, llm=llm, llm_params={...}),
    name="QA-Qwen3-14B",
))
pipeline.add(PipelineStep(
    task_type=TaskType.SUMMARIZATION,
    dataset_paths=[
        "./datasets/summarization/set_a.json",
        "./datasets/summarization/set_b.json",   # multiple datasets run sequentially
    ],
    config=TaskConfig(task_type=TaskType.SUMMARIZATION, llm=llm, llm_params={...}),
    evaluate_kwargs={"probability_normalize": True},
    name="Summarization-Qwen3-14B",
))

results = pipeline.run(batch_size=5)
pipeline.export_report("./output/pipeline_report.json")

Full example: examples/llm_evals/pipeline_example.py

PipelineStep Parameters

Parameter	Type	Description
`task_type`	`TaskType`	Task type (required)
`dataset_paths`	`str \| List[str]`	Dataset path(s); accepts a single string or a list of paths (required)
`config`	`TaskConfig`	TaskConfig for this task (required)
`evaluate_kwargs`	`dict`	Extra arguments forwarded to `async_evaluate()`, e.g. `probability_normalize`, `top_k`, `use_single_prompt` (optional)
`name`	`str`	Step label shown in logs and the report (optional; defaults to `task_type.value`)

Report Format

The JSON produced by export_report():

{
  "generated_at": "2026-03-16T10:00:00",
  "total_steps": 2,
  "success": 1,
  "partial": 1,
  "failed": 0,
  "total_elapsed_seconds": 120.5,
  "steps": [
    {
      "step": 1,
      "name": "QA-Qwen3-14B",
      "task_type": "qa",
      "dataset_paths": ["./datasets/qa/example_qa.json"],
      "status": "success",
      "total_elapsed_seconds": 45.2,
      "datasets": [
        {
          "dataset_path": "./datasets/qa/example_qa.json",
          "status": "success",
          "elapsed_seconds": 45.2,
          "result_path": "output/qa/result.json",
          "error": null
        }
      ]
    }
  ]
}

Step-level status values: success (all datasets passed), partial (some passed), error (all failed).

Individual Task Usage

Each task is instantiated via TaskConfig + the corresponding Task class, then run with asyncio.run().

Multiple-Choice QA

import asyncio
from src.api.async_llm_client import AsyncLLMChat
from src.tasks.base_task import TaskConfig, TaskType
from src.tasks.qa_task import QATask

async_llm = AsyncLLMChat(model="Qwen3-14B", config_path="./config/models.yaml")

config = TaskConfig(
    task_type=TaskType.QA,
    llm=async_llm,
    llm_params={'temperature': 0.8, 'max_tokens': 500, 'top_p': 0.8},
    llm_extra_body={"chat_template_kwargs": {"enable_thinking": False}}
)
task = QATask(config)
asyncio.run(task.async_evaluate(dataset_path="./datasets/qa/example_qa.json", batch_size=2))

Summarization Evaluation (G-Eval)

import asyncio
from src.api.async_llm_client import AsyncLLMChat
from src.tasks.base_task import TaskConfig, TaskType
from src.tasks.summarization_task import SummarizationTask

async_llm = AsyncLLMChat(model="Qwen3-14B", config_path="./config/models.yaml")

config = TaskConfig(
    task_type=TaskType.SUMMARIZATION,
    llm=async_llm,
    llm_params={'temperature': 2, 'max_tokens': 30, 'top_p': 1},
    llm_extra_body={"chat_template_kwargs": {"enable_thinking": False}}
)
task = SummarizationTask(config)
asyncio.run(task.async_evaluate(
    dataset_path="./datasets/summarization/sample_summary.json",
    batch_size=2,
    probability_normalize=True
))

Metrics: Relevance, Coherence, Consistency, Fluency

RAG Retrieval Evaluation

import asyncio
from src.api.embedding_rerank_client import EmbeddingModel, RerankingModel
from src.tasks.base_task import TaskConfig, TaskType
from src.tasks.rag.rag_task import RAGTask

emb = EmbeddingModel(embedding_model="bge-m3", config_path='./config/models.yaml', use_async=True)
rerank = RerankingModel(reranking_model="bge-reranker-large", config_path='./config/models.yaml', use_async=True)

config = TaskConfig(
    task_type=TaskType.RAG,
    embedding_model=emb,
    reranking_model=rerank,
    custom_params={
        "embedding_dim": 1024,
        "rrf_k": 60,
        "bm25_language": "zh"  # "en" or "zh"
    }
)
task = RAGTask(config)
asyncio.run(task.async_evaluate(dataset_path="./datasets/rag/example_rag.json", batch_size=10, top_k=5))

Tool Calling

Custom dataset:

import asyncio
from src.api.async_llm_client import AsyncLLMChat
from src.tasks.base_task import TaskConfig, TaskType
from src.tasks.tool_calling.tool_calling_task import ToolCallingTask

async_llm = AsyncLLMChat(model="Qwen3-14B", config_path="./config/models.yaml")
config = TaskConfig(
    task_type=TaskType.TOOLCALLING,
    llm=async_llm,
    llm_params={'temperature': 0.8, 'max_tokens': 500, 'top_p': 0.8},
    llm_extra_body={"chat_template_kwargs": {"enable_thinking": False}}
)
task = ToolCallingTask(config)
asyncio.run(task.async_evaluate(dataset_path="./datasets/tool_calling/tool_calling_example.json", batch_size=5))

BFCL dataset:

from src.tasks.base_task import TaskConfig, TaskType, TaskDatasetType

config = TaskConfig(
    task_type=TaskType.TOOLCALLING,
    dataset_type=TaskDatasetType.BCFL,   # use BFCL format
    llm=async_llm,
    ...
)
task = ToolCallingTask(config)
asyncio.run(task.async_evaluate(
    dataset_path="./bfcl_data/BFCL_v3_exec_parallel_multiple.json",
    batch_size=5,
    folder_name="Qwen3-14B_run1"
))

Check Worthiness

import asyncio
from src.api.async_llm_client import AsyncLLMChat
from src.tasks.base_task import TaskConfig, TaskType
from src.tasks.check_worthiness_task import CheckWorthinessTask

async_llm = AsyncLLMChat(model="Qwen3-14B", config_path="config/models.yaml")
config = TaskConfig(
    task_type=TaskType.CHECKWORTHINESS,
    llm=async_llm,
    llm_params={'temperature': 0, 'max_tokens': 100, 'top_p': 1}
)
task = CheckWorthinessTask(config)
asyncio.run(task.async_evaluate(
    dataset_path="datasets/factcheck/factcheck_claim_checkworthiness.json",
    batch_size=10,
    use_single_prompt=True
))

Claim Evidence Stance

Supports two inference modes — LLM or an external NLI service:

import asyncio
from src.tasks.base_task import TaskConfig, TaskType
from src.tasks.claim_evidence_stance_task import ClaimEvidenceStanceTask

# NLI mode
config = TaskConfig(
    task_type=TaskType.CLAIMEVIDENCESTANCE,
    custom_params={
        'use_nli': True,
        'nli_url': "http://your-nli-service/infer/xlm-roberta-large-xnli",
        'nli_timeout': 30.0,
        'nli_threshold': 0.8
    }
)

# LLM mode
# config = TaskConfig(task_type=TaskType.CLAIMEVIDENCESTANCE, llm=async_llm, custom_params={'use_nli': False}, ...)

task = ClaimEvidenceStanceTask(config)
asyncio.run(task.async_evaluate(
    dataset_path="datasets/factcheck/factcheck_claim_evidence_stance.json",
    batch_size=5
))

For more examples, see examples/llm_evals/.

Project Structure

llm-evals/
├── src/
│   ├── api/
│   │   ├── llm_client.py               # Synchronous LLM client (LLMChat)
│   │   ├── async_llm_client.py         # Async LLM client (AsyncLLMChat)
│   │   └── embedding_rerank_client.py  # Embedding / Reranking clients
│   ├── tasks/
│   │   ├── base_task.py                # Abstract base class (BaseTask, TaskConfig, TaskType)
│   │   ├── qa_task.py                  # Multiple-choice QA
│   │   ├── summarization_task.py       # Summarization quality evaluation
│   │   ├── fact_extract_task.py        # Fact extraction evaluation
│   │   ├── qa_generate_task.py         # QA generation evaluation
│   │   ├── faq_response_task.py        # FAQ response evaluation
│   │   ├── check_worthiness_task.py    # Check worthiness
│   │   ├── claim_evidence_stance_task.py # Claim evidence stance
│   │   ├── generate_claims.py          # Generate claims evaluation
│   │   ├── generate_truths.py          # Generate truths evaluation
│   │   ├── rag/
│   │   │   └── rag_task.py             # RAG retrieval evaluation
│   │   └── tool_calling/
│   │       └── tool_calling_task.py    # Tool calling evaluation
│   ├── prompts/                        # Prompt templates for each task
│   ├── utils/
│   │   └── tool_process.py             # Tool calling data utilities
│   └── pipeline.py                     # EvalPipeline for multi-task evaluation
├── examples/
│   └── llm_evals/                      # Example scripts for each task
├── datasets/                           # Datasets organized by task
├── bfcl_data/                          # BFCL benchmark datasets
├── config/
│   ├── example_models.yaml             # Config template
│   └── models.yaml                     # Your config (create from template)
└── pyproject.toml

TaskConfig Parameters

Parameter	Type	Description
`task_type`	`TaskType`	Task type (required)
`llm`	`AsyncLLMChat`	Async LLM client
`llm_params`	`dict`	LLM inference parameters (temperature, max_tokens, etc.)
`llm_extra_body`	`dict`	Extra fields forwarded to the API request body (e.g. `enable_thinking`)
`embedding_model`	`EmbeddingModel`	Embedding model (used by RAG task)
`reranking_model`	`RerankingModel`	Reranking model (used by RAG task)
`custom_params`	`dict`	Task-specific custom parameters
`dataset_type`	`TaskDatasetType`	Dataset format: `DEFAULT` or `BCFL`

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
config		config
datasets		datasets
examples		examples
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_zh-CN.md		README_zh-CN.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llm-evals

Supported Tasks

Installation

Model Configuration

EvalPipeline: Multi-Task Batch Evaluation

Quick Start

PipelineStep Parameters

Report Format

Individual Task Usage

Multiple-Choice QA

Summarization Evaluation (G-Eval)

RAG Retrieval Evaluation

Tool Calling

Check Worthiness

Claim Evidence Stance

Project Structure

TaskConfig Parameters

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

llm-evals

Supported Tasks

Installation

Model Configuration

EvalPipeline: Multi-Task Batch Evaluation

Quick Start

PipelineStep Parameters

Report Format

Individual Task Usage

Multiple-Choice QA

Summarization Evaluation (G-Eval)

RAG Retrieval Evaluation

Tool Calling

Check Worthiness

Claim Evidence Stance

Project Structure

TaskConfig Parameters

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages