chore: create failure objects #32

avocadoali · 2026-01-13T07:48:09Z

Not sure if needed but it keeps the error objects consistent

Copilot

Pull request overview

This pull request refactors inline dictionary creation into factory functions for evaluation result objects, improving code maintainability and consistency.

Changes:

Added three factory functions (create_sample_result_success, create_sample_result_failure, create_task_result_failure) to standardize result object creation
Replaced inline dictionary literals with factory function calls throughout the evaluation logic
Improved consistency by ensuring all result objects include core fields like generated_command and exact_match even in error cases

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/evaluation/sglang_eval.py

maharajamihir

rethink variable naming but otherwise looks good 👍🏾

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/evaluation/sglang_eval.py

Copilot · 2026-01-13T11:22:23Z

src/evaluation/sglang_eval.py

+def create_task_result_failure(
+    task_id: str,
+    error: str,
+    had_error: bool = True,


The had_error parameter with a default value of True is never overridden in any usage (lines 562-565, 570-573). Since this is a failure factory function, had_error should always be True. Consider removing this parameter and hardcoding the field to True in the returned dictionary, as it adds unnecessary complexity without any benefit.

sounds good?

Copilot · 2026-01-13T11:22:23Z

src/evaluation/sglang_eval.py

+    """
+    return {
+        "sample_idx": sample_idx,
+        "choice_idx": choice_idx,
+        "task_id": task_id,
+        "error": error,
+        "equivalent": equivalent,
+        "exact_match": exact_match,
+        "generated_command": generated_command,
+        "generated_command_empty": generated_command_empty,
+        "format_valid": format_valid,
+        "format_reason": format_reason,
+    }
+
+


The factory function now adds generated_command and exact_match fields to failure results where they were previously omitted (e.g., empty command case at line 446-452). While this improves consistency with successful evaluation results, it changes the structure of failure objects. Ensure that any downstream code or data consumers that process these results can handle the additional fields, or document this as a breaking change.

Suggested change

"""

return {

"sample_idx": sample_idx,

"choice_idx": choice_idx,

"task_id": task_id,

"error": error,

"equivalent": equivalent,

"exact_match": exact_match,

"generated_command": generated_command,

"generated_command_empty": generated_command_empty,

"format_valid": format_valid,

"format_reason": format_reason,

}

Note:

To avoid breaking downstream consumers that relied on the previous

structure of failure objects, optional fields such as

``generated_command`` and ``exact_match`` are only attached when they

carry non-default values.

"""

result: Dict[str, Any] = {

"sample_idx": sample_idx,

"task_id": task_id,

"error": error,

"equivalent": equivalent,

"generated_command_empty": generated_command_empty,

"format_valid": format_valid,

"format_reason": format_reason,

}

# Add optional fields if provided / non-default

if choice_idx is not None:

result["choice_idx"] = choice_idx

if generated_command:

result["generated_command"] = generated_command

if exact_match:

result["exact_match"] = exact_match

return result

chore: create failure object

346c55f

avocadoali requested review from Copilot and maharajamihir and removed request for Copilot January 13, 2026 07:48

Copilot started reviewing on behalf of avocadoali January 13, 2026 07:59 View session

avocadoali changed the title ~~chore: create failure object~~ chore: create failure objects Jan 13, 2026

Copilot AI reviewed Jan 13, 2026

View reviewed changes

src/evaluation/sglang_eval.py Outdated Show resolved Hide resolved

src/evaluation/sglang_eval.py Outdated Show resolved Hide resolved

src/evaluation/sglang_eval.py Show resolved Hide resolved

maharajamihir reviewed Jan 13, 2026

View reviewed changes

fix: address comments

96cc34b

avocadoali requested a review from Copilot January 13, 2026 11:17

Copilot started reviewing on behalf of avocadoali January 13, 2026 11:17 View session

Copilot AI reviewed Jan 13, 2026

View reviewed changes

avocadoali marked this pull request as draft January 13, 2026 12:55

maharajamihir approved these changes Jan 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore: create failure objects #32

chore: create failure objects #32

Uh oh!

avocadoali commented Jan 13, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

maharajamihir left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI Jan 13, 2026

Uh oh!

maharajamihir Jan 13, 2026

Uh oh!

Copilot AI Jan 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chore: create failure objects #32

Are you sure you want to change the base?

chore: create failure objects #32

Uh oh!

Conversation

avocadoali commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

maharajamihir left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

maharajamihir Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

avocadoali commented Jan 13, 2026 •

edited

Loading