Optimize Evaluation Workflow for Better Batching and Model Reuse For benchmarks with n_repeat > 1 by ihebchaa · Pull Request #125 · mlfoundations/evalchemy

ihebchaa · 2025-05-27T10:23:31Z

The current Evalchemy evaluation workflow for banchmarks with n_repeat > 1 is highly inefficient:

# Current inefficient approach
for i in range(n_repeat):
    evaluate()  # Reloads model every iteration

Key inefficiencies

Model reloading overhead: Models are loaded/unloaded n_repeat times, wasting significant time and GPU memory cycling
Poor batching: Small benchmarks like AIME24/AIME25 (30 samples) result in tiny batches that underutilize GPU resources

Solution

Restructure the evaluation workflow to load model once and batch across all repeats:

Key Improvements

Single model loading: Load model once for all evaluation repeats
Enhanced batching: Combine samples across repeats for larger, more efficient batches
Memory efficiency: Eliminate repeated model loading/unloading cycles
Better GPU utilization: Larger batches maximize hardware throughput

Speedup

Tests on a 7B reasoning model using AIME24 with n_repeat=8, max_new_tokens=32k, and batch_size set to n_repeat * num_samples (i.e., total samples, so that vLLM processes all instances at once and handles batching) show nearly an 8× speedup.

slimfrkha · 2025-06-02T11:00:32Z

In the case of DP > 1
because of different seed (different for each n_repeat), batch will be splitted to chunks (see collator in task.py) and each chunk is generated (after splitted to DP) in a for loop.
The problem here is that

for chunk in chunks:
    self.generate(chunk)  # -> load DP models for each chunk

neginraoof · 2025-06-05T09:14:54Z

Thanks a lot @ihebchaa for the PR! Overall looks good to me, I'll look into testing and merging this.
@slimfrkha for DP > 1, do you think using a custom collator that doesn't chunk by seed would help?

slimfrkha · 2025-06-05T11:06:22Z

Thanks a lot @ihebchaa for the PR! Overall looks good to me, I'll look into testing and merging this. @slimfrkha for DP > 1, do you think using a custom collator that doesn't chunk by seed would help?

Yes i think it is the way to go. planning to open a PR to change vllm_causallms.py

update generate responses

49008c3

RyanMarten requested a review from neginraoof May 27, 2025 14:20

slimfrkha mentioned this pull request May 30, 2025

Enable CUDA Graphs with vLLM Data Parallel EleutherAI/lm-evaluation-harness#3020

Open

slimfrkha mentioned this pull request Jun 9, 2025

Ignore seed when splitting batch in chunks with groupby EleutherAI/lm-evaluation-harness#3047

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Optimize Evaluation Workflow for Better Batching and Model Reuse For benchmarks with n_repeat > 1#125

Optimize Evaluation Workflow for Better Batching and Model Reuse For benchmarks with n_repeat > 1#125
ihebchaa wants to merge 1 commit intomlfoundations:mainfrom
ihebchaa:feat/optimize-eval-workflow

ihebchaa commented May 27, 2025

Uh oh!

slimfrkha commented Jun 2, 2025 •

edited

Loading

Uh oh!

neginraoof commented Jun 5, 2025

Uh oh!

slimfrkha commented Jun 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

ihebchaa commented May 27, 2025

Key inefficiencies

Solution

Key Improvements

Speedup

Uh oh!

slimfrkha commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

neginraoof commented Jun 5, 2025

Uh oh!

slimfrkha commented Jun 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

slimfrkha commented Jun 2, 2025 •

edited

Loading