refactor: improve exactness score calculation in evaluation by alrocar · Pull Request #56 · tinybirdco/llm-benchmark

alrocar · 2025-08-05T18:28:14Z

Note

Replaces summary-based exactness with per-question averaging plus a capped exact-match bonus, adding validations and rounding.

Evaluation:
- blendedExactnessScore:
  - Computes average of per-question getExactnessScore across all validation questions instead of using summary distances.
  - Treats missing/failed queries as 0; applies a capped exact-match bonus without exceeding 100; rounds final score.
  - Adds robust validation/guardrails for missing model stats, invalid fields, zero questions, and non-finite results.
- Docs: Adds JSDoc comments for blendedExactnessScore and blendScore.

^{Written by Cursor Bugbot for commit c9ab549. This will update automatically on new commits. Configure here.}

vercel · 2025-08-05T18:28:19Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Preview	Comments	Updated (UTC)
llm-benchmark	Ready	Preview	Comment	Sep 29, 2025 5:22pm

…ation

cursor · 2025-09-29T17:26:45Z

+    : 0;
+
+  // Apply exact match bonus (safe division)
+  const exactMatchRate = exactMatches / validationSummaries.totalQuestions;


Bug: Mismatched Counts Skew Exact Match Rate

The blendedExactnessScore function calculates the exact match rate using validationSummaries.totalQuestions, but processes individual scores based on questionKeys. If these counts don't align, the exact match rate will be inaccurate, affecting the final blended score.

vercel · 2025-09-29T17:44:25Z

+
+  for (const question of questionKeys) {
+    const individualScore = getExactnessScore(provider, model, question);
+    individualScores.push(individualScore);
+  }
+


The new implementation includes 0 scores for questions that models never attempted, unfairly lowering their exactness scores compared to the original calculation method.

View Details

📝 Patch Details

diff --git a/src/src/lib/eval.ts b/src/src/lib/eval.ts index ee19459..ceff5c1 100644 --- a/src/src/lib/eval.ts +++ b/src/src/lib/eval.ts @@ -214,29 +214,21 @@ function blendedExactnessScore(provider: string, model: string) { return 0; } - const { totalMatches, exactMatches } = modelStats; + const { exactMatches, avgExactDistance, avgNumericDistance, avgFScore } = modelStats; - // Calculate individual exactness scores for all questions - const individualScores: number[] = []; - - // Get all question keys from validation results - const questionKeys = Object.keys(validationResults).filter(key => key !== '_summary'); - - // Validate we have questions to process - if (questionKeys.length === 0) { - console.log(`No questions found in validation results for ${modelKey}`); + // Validate required aggregate fields exist and are numbers + if ( + typeof avgExactDistance !== 'number' || + typeof avgNumericDistance !== 'number' || + typeof avgFScore !== 'number' + ) { + console.log(`Invalid aggregate distance data for ${modelKey}`); return 0; } - - for (const question of questionKeys) { - const individualScore = getExactnessScore(provider, model, question); - individualScores.push(individualScore); - } - - // Calculate average of individual scores (safe division) - const avgIndividualScore = individualScores.length > 0 - ? individualScores.reduce((sum, score) => sum + score, 0) / individualScores.length - : 0; + + // Use pre-calculated aggregates that only include questions the model attempted + // This ensures models aren't penalized for unattempted questions + const avgIndividualScore = blendScore(avgExactDistance, avgNumericDistance, avgFScore); // Apply exact match bonus (safe division) const exactMatchRate = exactMatches / validationSummaries.totalQuestions;

Analysis

Unfair scoring penalty in blendedExactnessScore() for models with unattempted questions

What fails: blendedExactnessScore() in src/src/lib/eval.ts iterates through ALL 50 questions and calls getExactnessScore(), which returns 0 for unattempted questions, unfairly lowering model scores compared to using pre-calculated aggregates

How to reproduce:

Check models with unattempted questions (e.g., deepseek/deepseek-chat-v3-0324:free has 4 unattempted questions)

Current implementation averages scores across all 50 questions (including 0s for unattempted)

Pre-calculated aggregates (avgExactDistance, avgNumericDistance, avgFScore) only include attempted questions

Result: Models like deepseek/deepseek-chat-v3-0324:free get artificially low scores (48 vs 56 points) because unattempted questions count as 0 instead of being excluded from calculation

Expected: Use pre-calculated aggregate statistics that only consider questions the model actually attempted, matching the original benchmark methodology that updates stats only when modelResult exists

vercel Bot deployed to Preview August 5, 2025 18:28 View deployment

alrocar force-pushed the v2 branch from 2b2ec20 to 9d0aea7 Compare August 5, 2025 18:29

vercel Bot deployed to Preview August 5, 2025 18:30 View deployment

alrocar force-pushed the v2 branch from 9d0aea7 to 280ee5c Compare August 13, 2025 17:51

vercel Bot deployed to Preview August 13, 2025 17:52 View deployment

alrocar force-pushed the v2 branch from 280ee5c to 7e675ad Compare September 1, 2025 10:33

vercel Bot deployed to Preview September 1, 2025 10:33 View deployment

alrocar force-pushed the v2 branch from 7e675ad to a332caf Compare September 1, 2025 10:40

vercel Bot deployed to Preview September 1, 2025 10:41 View deployment

alrocar force-pushed the v2 branch from a332caf to 918ce05 Compare September 5, 2025 16:55

vercel Bot deployed to Preview September 5, 2025 16:56 View deployment

alrocar force-pushed the v2 branch from 918ce05 to 6a618fd Compare September 12, 2025 10:40

vercel Bot deployed to Preview September 12, 2025 10:41 View deployment

alrocar force-pushed the v2 branch from 6a618fd to 6f85b41 Compare September 13, 2025 09:04

vercel Bot deployed to Preview September 13, 2025 09:05 View deployment

alrocar force-pushed the v2 branch from 6f85b41 to 783f98f Compare September 23, 2025 20:11

vercel Bot deployed to Preview September 23, 2025 20:13 View deployment

This comment was marked as outdated.

Sign in to view

vercel Bot deployed to Preview September 24, 2025 07:59 View deployment

This comment was marked as outdated.

Sign in to view

vercel Bot deployed to Preview September 24, 2025 08:23 View deployment

alrocar added 3 commits September 29, 2025 19:21

refactor: improve exactness score calculation in evaluation

d231ee1

feat: implement blended exactness score for model evaluation

5f2a606

refactor: improve blended exactness score calculation for model evalu…

c9ab549

…ation

alrocar force-pushed the v2 branch from 47c57b2 to c9ab549 Compare September 29, 2025 17:21

vercel Bot deployed to Preview September 29, 2025 17:22 View deployment

cursor Bot reviewed Sep 29, 2025

View reviewed changes

vercel Bot reviewed Sep 29, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: improve exactness score calculation in evaluation#56

refactor: improve exactness score calculation in evaluation#56
alrocar wants to merge 3 commits into
mainfrom
v2

alrocar commented Aug 5, 2025 •

edited by cursor Bot

Loading

Uh oh!

vercel Bot commented Aug 5, 2025 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

cursor Bot Sep 29, 2025

Uh oh!

vercel Bot Sep 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

alrocar commented Aug 5, 2025 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vercel Bot commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

cursor Bot Sep 29, 2025

Choose a reason for hiding this comment

Bug: Mismatched Counts Skew Exact Match Rate

Uh oh!

vercel Bot Sep 29, 2025

Choose a reason for hiding this comment

Analysis

Unfair scoring penalty in blendedExactnessScore() for models with unattempted questions

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

alrocar commented Aug 5, 2025 •

edited by cursor Bot

Loading

vercel Bot commented Aug 5, 2025 •

edited

Loading