Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
| : 0; | ||
|
|
||
| // Apply exact match bonus (safe division) | ||
| const exactMatchRate = exactMatches / validationSummaries.totalQuestions; |
There was a problem hiding this comment.
Bug: Mismatched Counts Skew Exact Match Rate
The blendedExactnessScore function calculates the exact match rate using validationSummaries.totalQuestions, but processes individual scores based on questionKeys. If these counts don't align, the exact match rate will be inaccurate, affecting the final blended score.
|
|
||
| for (const question of questionKeys) { | ||
| const individualScore = getExactnessScore(provider, model, question); | ||
| individualScores.push(individualScore); | ||
| } | ||
|
|
There was a problem hiding this comment.
The new implementation includes 0 scores for questions that models never attempted, unfairly lowering their exactness scores compared to the original calculation method.
View Details
📝 Patch Details
diff --git a/src/src/lib/eval.ts b/src/src/lib/eval.ts
index ee19459..ceff5c1 100644
--- a/src/src/lib/eval.ts
+++ b/src/src/lib/eval.ts
@@ -214,29 +214,21 @@ function blendedExactnessScore(provider: string, model: string) {
return 0;
}
- const { totalMatches, exactMatches } = modelStats;
+ const { exactMatches, avgExactDistance, avgNumericDistance, avgFScore } = modelStats;
- // Calculate individual exactness scores for all questions
- const individualScores: number[] = [];
-
- // Get all question keys from validation results
- const questionKeys = Object.keys(validationResults).filter(key => key !== '_summary');
-
- // Validate we have questions to process
- if (questionKeys.length === 0) {
- console.log(`No questions found in validation results for ${modelKey}`);
+ // Validate required aggregate fields exist and are numbers
+ if (
+ typeof avgExactDistance !== 'number' ||
+ typeof avgNumericDistance !== 'number' ||
+ typeof avgFScore !== 'number'
+ ) {
+ console.log(`Invalid aggregate distance data for ${modelKey}`);
return 0;
}
-
- for (const question of questionKeys) {
- const individualScore = getExactnessScore(provider, model, question);
- individualScores.push(individualScore);
- }
-
- // Calculate average of individual scores (safe division)
- const avgIndividualScore = individualScores.length > 0
- ? individualScores.reduce((sum, score) => sum + score, 0) / individualScores.length
- : 0;
+
+ // Use pre-calculated aggregates that only include questions the model attempted
+ // This ensures models aren't penalized for unattempted questions
+ const avgIndividualScore = blendScore(avgExactDistance, avgNumericDistance, avgFScore);
// Apply exact match bonus (safe division)
const exactMatchRate = exactMatches / validationSummaries.totalQuestions;
Analysis
Unfair scoring penalty in blendedExactnessScore() for models with unattempted questions
What fails: blendedExactnessScore() in src/src/lib/eval.ts iterates through ALL 50 questions and calls getExactnessScore(), which returns 0 for unattempted questions, unfairly lowering model scores compared to using pre-calculated aggregates
How to reproduce:
- Check models with unattempted questions (e.g.,
deepseek/deepseek-chat-v3-0324:freehas 4 unattempted questions) - Current implementation averages scores across all 50 questions (including 0s for unattempted)
- Pre-calculated aggregates (
avgExactDistance,avgNumericDistance,avgFScore) only include attempted questions
Result: Models like deepseek/deepseek-chat-v3-0324:free get artificially low scores (48 vs 56 points) because unattempted questions count as 0 instead of being excluded from calculation
Expected: Use pre-calculated aggregate statistics that only consider questions the model actually attempted, matching the original benchmark methodology that updates stats only when modelResult exists
Note
Replaces summary-based exactness with per-question averaging plus a capped exact-match bonus, adding validations and rounding.
blendedExactnessScore:getExactnessScoreacross all validation questions instead of using summary distances.blendedExactnessScoreandblendScore.Written by Cursor Bugbot for commit c9ab549. This will update automatically on new commits. Configure here.