Welcome to the YourBench FAQ! This document aims to answer common questions about what YourBench is, how it works, and how you can use it to generate dynamic, document-grounded evaluation sets for Large Language Models. Below, you'll find practical information on installation, configuration, usage, and more.
YourBench is an open-source framework designed to generate new, domain-specific or up-to-date benchmarks for evaluating Large Language Models (LLMs). Instead of relying on static (and often outdated or contaminated) benchmarks, YourBench takes your custom documents—such as company reports, specialized academic texts, or newly published web content—and automatically creates fresh question-answer pairs to test models on relevant content. By doing so, it helps you:
- Avoid contamination with older, widely used datasets.
- Generate relevant evaluations for specialized or emerging topics.
- Automatically produce large volumes of questions with minimal human labor or cost.
YourBench also supports advanced pipeline stages like summarization, chunking, multi-hop question generation, citation filtering, and more, as detailed below.
YourBench’s pipeline typically consists of these stages (in a default order):
-
Ingestion
Converts raw documents (PDF, MD, HTML, DOCX, etc.) into a normalized Markdown/text format. -
Upload/Save
Optionally packages ingested documents into a Hugging Face Dataset and pushes it to the Hub (or saves locally). Each stage can save its intermediate output either locally or to the Hugging Face Hub (or both), depending on your hf_configuration.- Local saving is controlled by
local_savingandlocal_dataset_dir. - Remote saving (Hub) is controlled by
push_to_huband credentials (token + dataset name). This ensures that downstream stages can reliably load subsets across runs.
- Local saving is controlled by
-
Summarization
Creates short summaries of each document or chunk to provide global context. -
Chunking
Splits documents into manageable single-hop segments (e.g., by tokens or semantically) and optionally creates multi-hop groupings. -
Single-Shot Question Generation
Generates questions from individual chunk(s) plus global summary. (These are usually simpler, fact based, or straightforward questions) -
Multi-Hop Question Generation
Generates more complex, integrative questions by referencing multiple chunks. (These are usually more complex questions). Both single shot and multi-hop generate questions, however, the nature of the questions generated by both is very different! -
LightEval
Assembles those questions into a final “evaluation dataset” that includes question text, ground truth, citations, and relevant chunk text. -
Citation Score Filtering
Performs fuzzy string matching to gauge how well each question’s citations match the source text, optionally filtering low-scoring items.
You can control which stages run by toggling them in your config (e.g., pipeline.ingestion.run: true or false).
-
Clone the Repository
git clone https://github.com/huggingface/yourbench.git cd yourbench -
Install Dependencies
We recommend using a virtual environment. Then install with:pip install -r requirements.txt
(If you plan to do semantic chunking or advanced tasks, you’ll need PyTorch, Transformers, etc. as indicated in the repository docs.)
-
Configure Your Environment Variables
Set required keys in.envor your shell. At minimum:OPENAI_API_KEY=... OPENAI_BASE_URL=https://api.openai.com/v1 # or your provider HF_TOKEN=... # needed to push datasets HF_ORGANIZATION=... # optional; auto-detected if HF_TOKEN is set
-
Prepare a Configuration File
Start fromexample/default_example/config.yaml(shipped with the repo) or the minimal example indocs/CONFIGURATION.md. Pointingestion.source_documents_dirat your documents and list your models inmodel_list.
-
Create or Edit a Config File
- Check out
example/configs/simple_example.yaml(the minimal version). - Update
source_documents_dirto the folder containing your raw data. - Optionally specify which model(s) you want for each pipeline stage in
model_listandmodel_roles.
- Check out
-
Call the YourBench CLI
From the repo, using uv (recommended):uvx --from yourbench yourbench run path/to/your_config.yaml --debug
Or if installed locally:
yourbench run path/to/your_config.yaml --debug
-
View the Outputs
- By default, intermediate datasets are stored on Hugging Face Hub (if configured) and/or locally, named according to your
hf_configuration. - Logs (errors, pipeline progress, etc.) are written to the
logs/folder.
- By default, intermediate datasets are stored on Hugging Face Hub (if configured) and/or locally, named according to your
YourBench is flexible. Typically:
- Ingestion converts each raw file (PDF, MD, HTML, DOCX, etc.) into a standardized Markdown.
- If your documents are already in plain text or Markdown, just place them in a folder and point
ingestion.source_documents_dirthere.
Multi-document ingestion is handled automatically: each file becomes a separate “document” entry in the resulting dataset.
Absolutely. In your config’s model_list, define multiple models. For example:
model_list:
- model_name: gpt-4.1
base_url: https://api.openai.com/v1
api_key: $OPENAI_API_KEY
- model_name: Qwen/Qwen3-30B-A3B
provider: fireworks-aiThen in model_roles, assign which model(s) perform each stage:
model_roles:
ingestion:
- Qwen/Qwen3-30B-A3B
summarization:
- gpt-4.1
- Qwen/Qwen3-30B-A3B
single_hop_question_generation:
- gpt-4.1YourBench will run inference calls in parallel for each model assigned.
- Ensure the pipeline’s
multi_hop_question_generationstage is set torun: true. - Make sure your chunking stage is also on and includes
multihop_chunks, or define multi-hop chunking parameters in your config:pipeline: chunking: chunking_configuration: chunking_mode: semantic_chunking h_min: 2 h_max: 5 num_multihops_factor: 5 multi_hop_question_generation: run: true # additional instructions or chunk sampling, etc.
- YourBench will then sample multi-chunk sets and call your chosen model(s) to produce questions requiring multiple pieces of context.
After generating questions, YourBench can verify if each question is grounded in its source chunk(s) by fuzzy matching. The pipeline stage citation_score_filtering:
- Compares the alleged citations to the actual chunk text (and optionally the ground-truth answer).
- Computes a “citation_score” by measuring string overlap (using partial ratio from
thefuzz). - Lets you filter or rank questions by how strongly they’re anchored in the original text.
Yes! In the paper, we demonstrate replicating the style and relative difficulty of MMLU subsets:
- Collect a few relevant documents for each subject domain (e.g., a handful of Wikipedia articles).
- Run the pipeline to generate multiple-choice questions.
- Evaluate your LLMs on these newly generated sets.
The results strongly correlated with the original MMLU in ranking models, but the newly generated questions are “harder” and are contamination-resistant. Just be sure to adapt your prompt instructions so that the question generation yields multiple-choice style Q&A.
- After single-shot or multi-hop generation, your “raw” question datasets appear under subset names like
single_hop_questions,multi_hop_questions. - The pipeline’s
lightevalstage merges them into a single dataset calledlighteval, containing columns likequestion,ground_truth_answer,citations, and the associated chunk(s). - By default, these subsets are saved on the HF Hub (under your designated dataset name) and/or locally, depending on your config.
Not necessarily. In your config’s hf_configuration, you can disable or enable pushing:
hf_configuration:
local_saving: true # Enables saving to disk
local_dataset_dir: ./results/datasets # Where datasets are saved locally
push_to_hub: true # Optional: also push each stage result to the Hub
concat_if_exist: false # Whether to merge with existing datasets
# private: true # Whether Hub datasets should be private
You can set local_dataset_dir (under hf_configuration) to a path and store your resulting datasets entirely locally — as long as local_saving: true is also set. Alternatively, you can enable both local saving and Hub pushing. The pipeline is flexible to your preference.
Each pipeline stage saves its result using custom_save_dataset(). The behavior depends on both:
- The config file, especially
hf_configuration.local_savingandlocal_dataset_dir. - The per-stage logic, which calls:
hf_settings = get_hf_settings(config)
custom_save_dataset(
dataset=dataset,
config=config,
subset="stage_name", # e.g., "summarized", "chunked"
save_local=hf_settings.local_saving,
push_to_hub=True,
)This ensures datasets are:
- Persisted between stages, even across different runs.
- Reloadable by exact subset name (e.g., "chunked"), preventing missing subset errors.
For large documents, the pipeline automatically:
- Splits (chunking) by token-based thresholds or semantic boundaries.
- Summarizes each chunk to keep context windows from overflowing your model’s max context length.
- Optionally merges chunk-level summaries into a single short “document_summary.”
Because chunking is crucial for big inputs, carefully tune the chunking config (e.g., l_max_tokens, overlap, or semantic threshold) to ensure coverage without overloading your model.
Adjust the pipeline config to keep chunk sizes within that limit. For example:
pipeline:
summarization:
max_tokens: 16384
chunking:
chunking_configuration:
chunking_mode: fast_chunking
l_max_tokens: 128 # or 1024 or 4096, depending on your model
token_overlap: 128These parameters let you manage how aggressively we split large documents and how much overlap we maintain between splits.
No, the pipeline itself is language-agnostic. If your model supports a given language, YourBench can ingest and generate questions for that language. For chunking in semantic mode, ensure you select a suitable multilingual embedding model (e.g., intfloat/multilingual-e5-large-instruct) in the config.
- Subset your data using the
chunk_samplingconfig to generate fewer questions. - Reduce multi-model usage if you only need a single model for question generation.
- Use smaller language models for some stages (like summarization or ingestion) while using larger ones only for question generation.
- Lower the
multi_hop_question_generation.num_multihops_factorto limit the number of multi-chunk combos.
- The Paper provides a conceptual overview, demonstration, and thorough validation results.
- Each pipeline stage’s code is in
yourbench/pipeline/. - Utility modules (e.g., for inference concurrency, chunking, dataset management) are in
yourbench/utils/. - The top-level CLI is in
yourbench/main.py.
We welcome feedback, feature requests, and bug reports! Feel free to:
- Open an issue on our GitHub repository.
- Submit a pull request if you have improvements or new features to propose.
YourBench can automate large-scale question generation and potentially replace some annotation tasks, which raises labor considerations. Additionally, if your LLM is biased or inaccurate, those biases can propagate into the generated benchmarks. It’s crucial to:
- Evaluate the outputs with human oversight.
- Use filtering steps (e.g., citation_score_filtering) or human review to catch low-quality or biased content.
- Be transparent about how these synthetic benchmarks are created.
Consider trying these advanced workflows:
- Creating Domain-Specific Benchmarks: Provide proprietary or niche documents (e.g., medical guidelines, legal briefs) to assess your model’s real-world domain knowledge.
- Temporal Evaluations: Use newly published documents (like the
Tempora-0325set from the paper) to see if your model can handle post-training knowledge. - Multi-hop Reasoning: If your domain’s content requires integrative questions, ensure multi-hop chunk generation is enabled.
Happy benchmarking, and we hope YourBench transforms how you generate and evaluate custom LLM benchmarks!
If you have other questions, please open an Issue or check the repository’s README for the most up-to-date information.