A framework for evaluating LLMs with support for multiple metrics, datasets, and models.
uv pip install "git+https://github.com/serval-uni-lu/llm_eval"from llm_eval_framework.llm import LLM
llm = LLM('google/gemma-3-4b-it')
output = llm.generate("Your prompt here", temperature=0.7, top_p=0.9)
print(output.content)
llm.unload()Datasets require three files: data.parquet, metadata.json, and prompt.yaml.
from llm_eval_framework.dataset import Dataset
dataset = Dataset.from_path('path/to/dataset')
print(dataset.prompts[0])
print(dataset.answers[0])Supports heuristic metrics (is_json, contains, bleu, rouge, etc.) and LLM-judge metrics (answer_correctness, bias, safety, etc.).
from llm_eval_framework.llm import LLM
from llm_eval_framework.metrics import Metric
llm = LLM('meta-llama/Llama-3.2-3B-Instruct')
correctness = Metric("answer_correctness")
score = correctness.score(input, output, reference, llm)
is_json = Metric("is_json")
score = is_json.score(output)
llm.unload()Configure evaluations via YAML:
name: experiment_name
models:
- name: Qwen/Qwen3-4B-Instruct-2507
sampling_params:
temperature: 0.7
datasets:
- name: financebench
metrics:
- answer_correctness
judge_model:
name: google/gemma-3-4b-it
sampling_params:
temperature: 0.4Run evaluation:
from llm_eval_framework.evaluation import EvaluationConfig, run_evaluation
config = EvaluationConfig.from_yaml("eval_config.yaml")
results = run_evaluation(config)Parse PDFs to markdown using Docling:
from llm_eval_framework.parser import Parser
parser = Parser()
markdown_files = parser.parse('input_dir', 'output_dir')from llm_eval_framework.chunker import Chunker
chunker = Chunker()
chunks = chunker.chunk(text)from llm_eval_framework.visualization import save_results_plot
save_results_plot(output_dir='path/to/results', save_path='plot.png')