refactor benchmark agents into harnesses#56
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the agent architecture by introducing a harness-based system, replacing the previous template-driven agent construction. The changes involve migrating configuration files to use a canonical harness key, implementing a new harness factory, and updating documentation to reflect the new taxonomy. While the refactor standardizes the agent execution loop, the review comments highlight concerns regarding the consolidation of retired experimental variants into the new canonical harnesses, which may affect result comparability, and point out redundant legacy fallback logic in the benchmark executor.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 47bf2a1a5e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughMigrates from agents/templates to harness-based runtime with new base/memory/tools/planner/tool harnesses and factory; updates YAML configs, schemas, CLI, executor, and tests; deprecates legacy agents; and filters retired results in leaderboard generation. ChangesHarness Architecture Refactor
Sequence Diagram(s)sequenceDiagram
participant CLI
participant Factory
participant Harness
participant Memory
participant LLM
CLI->>Factory: create_harness(harness, model, params)
Factory-->>CLI: Harness instance
CLI->>Harness: get_action(observation, choices)
Harness->>Memory: get_context(step)
Harness->>LLM: completion(system, prompt)
LLM-->>Harness: response
Harness->>Harness: parse + safety + retries
Harness->>Memory: update(step_data)
Harness-->>CLI: action index
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Poem
✨ Finishing Touches🧪 Generate unit tests (beta)
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 597ba7ac0b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Actionable comments posted: 9
🧹 Nitpick comments (12)
docs/HARNESS_ENGINEERING.md (1)
28-28: ⚡ Quick winUse the canonical
ReActterm for loop naming.On Line 28,
react loopis ambiguous and can be read as the React framework;ReAct loopis the clearer method name here.Proposed doc tweak
-| `planner` | Uses a plan-maintain-act loop with compact memory instead of a pure react loop. | +| `planner` | Uses a plan-maintain-act loop with compact memory instead of a pure ReAct loop. |🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/HARNESS_ENGINEERING.md` at line 28, Update the planner description to use the canonical "ReAct" term: change the phrase "react loop" in the table cell for `planner` to "ReAct loop" (e.g., "`Uses a plan-maintain-act loop with compact memory instead of a pure ReAct loop.`") so the method name is unambiguous and distinct from the React framework.llm_quest_benchmark/harnesses/tools.py (1)
9-54: 💤 Low valueCalculator sandboxing looks solid; minor note on float literals.
The two-stage defense (character whitelist + AST allow-list with no
ast.Name/ast.Call) blocks identifier and call-based escapes, and the exponent cap at line 82 prevents trivial DoS via**. One quirk worth being aware of: the whitelist permits scientific notation, so an input like1e400parses to a floatinf(and1e400 - 1e400yieldsnan), which then propagates through arithmetic/comparisons and is rendered into the returned string. If you want results to be bounded numbers, consider rejecting non-finite results before formatting.♻️ Optional guard against inf/nan results
result = _eval_calculator_node(tree.body) + if isinstance(result, float) and (result != result or result in (float("inf"), float("-inf"))): + return "error: non-finite result" except Exception as exc: return f"error: {exc}" return f"{expr} = {result}"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@llm_quest_benchmark/harnesses/tools.py` around lines 9 - 54, The calculator currently allows float infinities/NaNs through (e.g., "1e400") which then get formatted into the result; after computing result = _eval_calculator_node(tree.body) in calculator(expression: str) validate that the result is a finite numeric value (use math.isfinite for int/float, and treat booleans as valid) and if not finite return a clear error like "error: non-finite result" before the final return; ensure this check handles numeric types only and still lets boolean results pass.llm_quest_benchmark/agents/planner_agent.py (1)
7-7: ⚡ Quick winAvoid import-time deprecation side effects.
Emitting
DeprecationWarningduring module import can break consumers running with warnings-as-errors. Prefer warning whenPlannerAgentis instantiated (or first accessed) instead.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@llm_quest_benchmark/agents/planner_agent.py` at line 7, The module currently emits a DeprecationWarning at import time via warnings.warn; instead move that warning into the PlannerAgent class so it only fires when the API is actually used (e.g., inside PlannerAgent.__init__ or the first property/method called). Update planner_agent.PlannerAgent to call warnings.warn("planner_agent is deprecated, use harnesses.planner", DeprecationWarning, stacklevel=2) during instantiation (or first access) and remove the top-level warnings.warn from the module.llm_quest_benchmark/harnesses/tool_harness.py (1)
214-224: 💤 Low valueError tag references retired
tool_agentnaming.The error response is tagged
tool_agent_error(line 220) but this harness isToolCompactHarness. With the agents→harnesses rename, downstream consumers grepping logs fortool_harness_error(mirroringplanner_errorinplanner.pyline 179) will miss these. Consider renaming for consistency.♻️ Proposed change
- reasoning=f"tool_agent_error: {exc}", + reasoning=f"tool_harness_error: {exc}",🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@llm_quest_benchmark/harnesses/tool_harness.py` around lines 214 - 224, The error tag in ToolCompactHarness's exception handler uses the retired "tool_agent_error" label; update the reasoning tag and any related log text to "tool_harness_error" so downstream consumers matching planner_error semantics will find it. Specifically, change the reasoning f-string in the exception block that constructs the LLMResponse (and the accompanying logger.error message if it embeds the old tag) to use "tool_harness_error" and keep LLMResponse creation (action, is_default, parse_mode) and history/_last_response updates unchanged.llm_quest_benchmark/harnesses/memory.py (1)
281-286: 💤 Low valueDead-fallback branch in
_format_transcript_for_compaction.
_format_transcript_for_compactionis only reachable from_maybe_compact, which has already returned when_steps_since_compaction < compaction_interval. Theelse self._transcript[-self.compaction_interval:]branch is therefore unreachable in practice. Either drop the fallback or assert/document the precondition so future call sites can't accidentally widen the contract.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@llm_quest_benchmark/harnesses/memory.py` around lines 281 - 286, The fallback branch in _format_transcript_for_compaction is dead because callers (notably _maybe_compact) only call it when _steps_since_compaction >= compaction_interval; remove the unreachable else branch and simplify the slice to use only the intended precondition, or alternatively add an explicit precondition check/assertion (e.g., assert _steps_since_compaction > 0 or assert _steps_since_compaction >= compaction_interval) with a brief docstring comment on _format_transcript_for_compaction explaining that callers must only invoke it when _steps_since_compaction >= compaction_interval; update references to _transcript, _steps_since_compaction, and compaction_interval accordingly.llm_quest_benchmark/schemas/config.py (2)
216-223: 💤 Low valueDuplicated legacy-key rejection.
The
template/memory_modechecks on lines 219-222 are already enforced insideHarnessConfig.__init__(lines 84-87) via**legacy_keys. Removing them here keeps a single source of truth for the validation messages and avoids drift if the wording is later tweaked in only one location.♻️ Proposed simplification
if "agents" in data: agents = [] for agent in data["agents"]: - if "template" in agent: - raise ValueError("Use harness: key instead of template:") - if "memory_mode" in agent: - raise ValueError("Use harness: key instead of memory_mode:") agents.append(HarnessConfig(**agent)) data["agents"] = agents🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@llm_quest_benchmark/schemas/config.py` around lines 216 - 223, Remove the duplicated legacy-key rejection in the agents parsing loop by deleting the explicit checks for "template" and "memory_mode" (the two if "template" in agent / if "memory_mode" in agent branches) and let HarnessConfig(**agent) perform the validation/raise; this centralizes the error message logic in HarnessConfig.__init__ (the legacy_keys handling) and prevents divergent wording or double-checks while keeping the agents list construction (agents.append(HarnessConfig(**agent))) intact.
57-101: 💤 Low valueRedundant
@dataclassdecorator combined with custom__init__.The
@dataclassdecorator's generated__init__is fully overridden by the explicit__init__(lines 71-101), so the decorator's primary benefit is lost while still requiring the manual__post_init__()call at line 101. Either drop@dataclassand make it a plain class, or keep@dataclassand move legacy-key rejection into__post_init__(rejecting keys collected via a sentinel) so the generated__init__is preserved. Current form is functional but confusing for readers expecting standard dataclass semantics.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@llm_quest_benchmark/schemas/config.py` around lines 57 - 101, The HarnessConfig class currently decorates with `@dataclass` but fully overrides the generated __init__, which is confusing; pick one approach—either remove `@dataclass` and keep the explicit __init__ and __post_init__ call, or restore dataclass semantics by removing the explicit __init__ and moving the legacy-key validation into __post_init__ (use a sentinel field like _raw_kwargs: dict = field(default_factory=dict, repr=False) to capture unexpected kwargs or accept **kwargs in the dataclass-generated __init__ signature), then perform the template/memory_mode and unexpected-key checks inside __post_init__; update references to HarnessConfig.__init__ and HarnessConfig.__post_init__ accordingly and remove the manual __post_init__() call if you keep `@dataclass` and let dataclass initialization call it automatically.llm_quest_benchmark/harnesses/base.py (4)
559-562: 💤 Low valueForce-numeric retry heuristic matches any OpenAI model starting with
o.
self.model_spec.model_id.startswith("o")is meant to cover theo1/o3/o4-mini reasoning families, but it will also match any future OpenAI model whose normalized ID begins witho(e.g., a hypotheticalomni-…), forcing them through the JSON-then-number-only retry path even when they already return well-formed JSON. Consider tightening to an explicit prefix list or a regex like^o\d:♻️ Proposed tightening
- def _needs_force_numeric_retry(self) -> bool: - return self.model_spec.provider == "openai" and ( - self.model_spec.model_id.startswith("gpt-5") or self.model_spec.model_id.startswith("o") - ) + def _needs_force_numeric_retry(self) -> bool: + model_id = self.model_spec.model_id + return self.model_spec.provider == "openai" and ( + model_id.startswith("gpt-5") or bool(re.match(r"^o\d", model_id)) + )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@llm_quest_benchmark/harnesses/base.py` around lines 559 - 562, The `_needs_force_numeric_retry` method currently treats any OpenAI model whose `model_id` starts with "o" as needing the numeric-retry path; tighten this by checking only the intended families—either match a regex like `^o\d` against `self.model_spec.model_id` or test against an explicit allowed-prefix list such as ("o1", "o3", "o4")—and keep the `self.model_spec.provider == "openai"` guard. Update `_needs_force_numeric_retry` to use the chosen stricter test (e.g., `re.match(r"^o\d", self.model_spec.model_id)` or `self.model_spec.model_id.startswith(prefix)` against the tuple) so future "o"-prefixed model IDs (e.g., "omni-") won’t be forced into the numeric-only retry path.
599-609: 💤 Low valueSafety filter is applied twice on the retry path.
_parse_with_retriesapplies_apply_safety_filterat lines 599-602, butPlannerHarness._get_action_impl(line 154) andToolCompactHarness._get_action_impl(line 198) also call_apply_safety_filteron the returnedparsed_response.action. The override is idempotent (re-applying produces the same answer), so this isn't a correctness bug, but thepolicy_safety_overridereasoning tag can be set in either layer and the second layer can no-op silently — making it ambiguous which layer is the source of truth.Consider centralizing safety filtering either inside
_parse_with_retries(and removing it from the per-harness implementations) or outside (and removing it here), so reasoning/usage attribution is consistent.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@llm_quest_benchmark/harnesses/base.py` around lines 599 - 609, The safety filter is being applied in multiple places causing ambiguous reasoning attribution; choose one location and remove the duplicate calls. Either (A) keep safety filtering inside _parse_with_retries (the block that calls _apply_safety_filter and sets parsed_response.reasoning = "policy_safety_override") and remove any calls to _apply_safety_filter in PlannerHarness._get_action_impl and ToolCompactHarness._get_action_impl, or (B) move filtering out of _parse_with_retries and ensure the harnesses call _apply_safety_filter once and set parsed_response.reasoning there; update only the redundant callers accordingly so _apply_safety_filter and the policy_safety_override tag are set exactly once.
309-325: 💤 Low value
@abstractmethoddecorator onresetis meaningful but conflicts with the in-body call tosuper().reset().
resetis marked@abstractmethodwhile also providing a real implementation that subclasses are expected to extend viasuper().reset(). This pattern works in Python, but it makes the contract surprising: the abstractness only forces subclasses to override the name, not to call into the base, and the body looks reusable. Either drop@abstractmethod(since the base actually does meaningful state reset and subclasses currently always callsuper().reset()), or split into a concrete_reset_base_state()helper and keepresettruly abstract. The current form risks a subclass redefiningresetwithout invokingsuper().reset()and silently leakinghistory/_observation_history/_state_action_countsacross episodes.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@llm_quest_benchmark/harnesses/base.py` around lines 309 - 325, The current reset method is marked `@abstractmethod` yet contains real state-reset logic (history, _last_response, _observation_history, _decision_history, _state_action_counts, _step_count and memory_module.reset()), which risks subclasses overriding reset and skipping this base cleanup; change reset to be a concrete method that performs the shown base cleanup and removes the `@abstractmethod` decorator, and introduce a separate optional hook (e.g., an abstract or no-op _post_reset(self) or a protected _reset_base_state(self) that contains the current body while reset calls it and then calls _post_reset) so subclasses can extend behavior without accidentally skipping the base state reset (refer to reset, _reset_base_state/_post_reset, history, _last_response, _observation_history, _state_action_counts, memory_module).
450-467: 💤 Low value
_call_llmretry loop swallows nothing on the first attempt'sTypeErrorwhensystem_prompt is None.The retry loop is structured so:
system_prompt is not None+TypeError→ falls back to the no-system-prompt signature (line 461) and returns — no retries.system_prompt is None+TypeError→raise(line 462), terminates immediately.- any other
Exception→ recorded inlast_errorand retried up to 3 times.This is reasonable, but two subtleties are worth tightening:
- The fallback at line 461 is itself unprotected — if the second call also raises (any exception type), it propagates without participating in the retry budget.
- If
get_completionreturns successfully on attempt 1, great; if all 3 attempts raise the sameTypeErrorwhilesystem_prompt is None, only the first attempt actually runs because the loop body returns/raises before retrying. Thefor attempt in range(3)is misleading there.Consider documenting these branches in a comment, or wrapping the line-461 fallback in the same
last_error-tracking pattern so transient failures during the fallback are also retried.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@llm_quest_benchmark/harnesses/base.py` around lines 450 - 467, The _call_llm retry loop currently returns or re-raises immediately inside the TypeError handler which bypasses the shared retry bookkeeping; update the except TypeError block in _call_llm so that if system_prompt is not None you call self.llm.get_completion(prompt) inside the same try/except logic (capture any exception into last_error and allow the loop to continue/retry and log via self.logger/debug like the other exceptions), and if system_prompt is None do not re-raise immediately but set last_error = exc and let the loop retry up to 3 times (with the existing logging path) so all fallback attempts participate in the retry budget and transient errors are recorded consistently.llm_quest_benchmark/agents/llm_agent.py (1)
23-23: 💤 Low valueModule-import
DeprecationWarningfires for every consumer.Emitting the warning at module-import time triggers on any import path that touches
llm_quest_benchmark.agents.llm_agent, including internal re-exports (parsing helpers, keyword constants,__all__). Consider moving the warning intoLLMAgent.__init__so only actual instantiations of the legacy class trigger it; this avoids spurious noise for callers that only useparse_llm_responseor the keyword tuples through this module.♻️ Proposed change
-warnings.warn("llm_agent is deprecated, use llm_quest_benchmark.harnesses", DeprecationWarning, stacklevel=2) - - class LLMAgent(MinimalHarness): """Backward-compatible LLMAgent facade backed by concrete harness classes.""" SUPPORTED_MODELS = MODEL_CHOICES def __init__( self, ... ): + warnings.warn( + "LLMAgent is deprecated, use llm_quest_benchmark.harnesses", + DeprecationWarning, + stacklevel=2, + ) ...🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@llm_quest_benchmark/agents/llm_agent.py` at line 23, The module-level DeprecationWarning causes noise on import; remove the top-level warnings.warn call and instead emit the deprecation inside the legacy class constructor: add a warnings.warn(...) call in LLMAgent.__init__ (using DeprecationWarning and stacklevel=2) so the warning fires only when LLMAgent is instantiated; leave functions like parse_llm_response and exported constants untouched so importing them no longer triggers the warning.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@llm_quest_benchmark/agents/agent_factory.py`:
- Around line 66-90: The branches handling resolved_action_template ==
"planner.jinja" and resolved_action_template in ("tool_augmented.jinja",
"tool_augmented_hints.jinja") call create_harness but drop the memory_mode
parameter from the create_agent API; update both create_harness calls in these
branches to pass through memory_mode (or, if you prefer fail-fast, validate
memory_mode at the top of create_agent and raise a clear error when a
non-default value is provided for these harness types). Locate the two blocks
that call create_harness (the planner branch and the
tool_augmented/tool_augmented_hints branch) and either add
memory_mode=memory_mode to their argument lists or add an early guard in
create_agent that checks resolved_action_template and memory_mode and raises/
logs a clear error.
In `@llm_quest_benchmark/executors/benchmark.py`:
- Line 109: The artifact write uses a derived agent_id expression ("agent_id":
agent_config.harness_id if hasattr(agent_config, "harness_id") else
agent_config.agent_id) but the timeout-persistence code still uses
agent_config.agent_id directly; change the timeout DB write to use the same
derivation expression (or extract a single helper like derived_agent_id =
agent_config.harness_id if hasattr(agent_config, "harness_id") else
agent_config.agent_id and use that) so both artifact writes and timeout writes
use identical agent_id logic (update the code that persists timeouts to
reference derived_agent_id instead of agent_config.agent_id).
In `@llm_quest_benchmark/executors/cli/commands.py`:
- Around line 55-56: HARNESS_CHOICES is built only from HARNESS_REGISTRY.keys(),
causing CLI validation to reject special harness identifiers (e.g.,
random_choice and seeded variants) that create_harness accepts; replace the
strict Choice(...) usage that uses HARNESS_CHOICES with a small custom validator
that accepts either a key in HARNESS_REGISTRY or any value for which
is_random_choice_harness(value) returns true, and apply the same change to the
other Choice usages around the second occurrence (the block currently around
where HARNESS_CHOICES is used again at the later lines) so the CLI accepts both
registry keys and valid random/seeded harness identifiers.
In `@llm_quest_benchmark/harnesses/memo.py`:
- Line 30: The current initialization uses a falsy-coalescing pattern that can
override valid false-y custom modules: replace the expression that sets
memory_module (the argument currently using "memory_module or
CompactionMemory(compaction_interval=compaction_interval)") with an explicit
None check so that if memory_module is not None it is used, otherwise construct
CompactionMemory(compaction_interval=compaction_interval); reference the
parameter name memory_module and the fallback class CompactionMemory (and the
compaction_interval arg) when making this change.
In `@llm_quest_benchmark/harnesses/memory.py`:
- Around line 249-279: The compaction path in _maybe_compact currently leaves
_steps_since_compaction at-or-above compaction_interval when the LLM returns an
empty string or throws, causing repeated compaction attempts; wrap the call to
self.llm_client.get_completion(...) in a try/except to catch exceptions and
treat failures as empty summaries, and when the summary is empty (or an
exception occurred) set self._steps_since_compaction back to 0 or to a small
back-off value (e.g., compaction_interval // 2) to avoid retrying every step;
keep the existing behavior of resetting _transcript, _compaction_summary and
_steps_since_compaction only on a truthy summary, but ensure failures/exceptions
update _steps_since_compaction to prevent a per-step retry storm.
In `@llm_quest_benchmark/harnesses/minimal.py`:
- Around line 42-47: The parsed_response.action must be clamped to valid bounds
before you persist state; change the order in minimal.py so you
validate/normalize parsed_response.action (ensure 1 <= parsed_response.action <=
len(choices)) immediately after parsing and before calling _remember_decision,
and then update self.history and self._last_response and call
self._remember_decision(observation, choices, state_signature, parsed_response);
finally return parsed_response.action so the recorded decision, last response,
history, and returned value are all consistent with the normalized action.
In `@llm_quest_benchmark/harnesses/planner.py`:
- Around line 167-172: The action clamp is applied after bookkeeping, so
_remember_decision and history record the original out-of-range action; move the
clamp before any bookkeeping and history mutation: in the Planner (method where
self.history.append(parsed_response) and self._remember_decision(...) are
called), validate and clamp parsed_response.action into the range [1,
len(choices)] immediately after parsing (or before calling
self._remember_decision and before appending to self.history/_last_response) so
the stored counts in _state_action_counts and entries in _decision_history match
the actual returned action.
In `@llm_quest_benchmark/harnesses/tools.py`:
- Around line 149-170: The output formatter is using direct dict access which
can raise KeyError for partial entries; update the code that builds lines (in
the same method that iterates over self.step_log and computes scored/best) to
mirror the defensive access used during scoring: use entry.get("step", 0),
entry.get("observation", ""), entry.get("choices", []) (and coerce to an
iterable/string before joining), and entry.get("selected_choice", "n/a") so
missing keys won't crash search().
- Around line 160-163: The fallback selects oldest entries because after
scored.sort(..., reverse=True) the most recent (highest step) items are at the
head, but the code uses scored[-self.history_window:] which takes the tail;
change the fallback to take the first self.history_window items from scored
(e.g., scored[: self.history_window]) and, if chronological ordering is required
by the caller, reverse that slice before assigning to best; update the logic
around scored, best and self.history_window accordingly.
---
Nitpick comments:
In `@docs/HARNESS_ENGINEERING.md`:
- Line 28: Update the planner description to use the canonical "ReAct" term:
change the phrase "react loop" in the table cell for `planner` to "ReAct loop"
(e.g., "`Uses a plan-maintain-act loop with compact memory instead of a pure
ReAct loop.`") so the method name is unambiguous and distinct from the React
framework.
In `@llm_quest_benchmark/agents/llm_agent.py`:
- Line 23: The module-level DeprecationWarning causes noise on import; remove
the top-level warnings.warn call and instead emit the deprecation inside the
legacy class constructor: add a warnings.warn(...) call in LLMAgent.__init__
(using DeprecationWarning and stacklevel=2) so the warning fires only when
LLMAgent is instantiated; leave functions like parse_llm_response and exported
constants untouched so importing them no longer triggers the warning.
In `@llm_quest_benchmark/agents/planner_agent.py`:
- Line 7: The module currently emits a DeprecationWarning at import time via
warnings.warn; instead move that warning into the PlannerAgent class so it only
fires when the API is actually used (e.g., inside PlannerAgent.__init__ or the
first property/method called). Update planner_agent.PlannerAgent to call
warnings.warn("planner_agent is deprecated, use harnesses.planner",
DeprecationWarning, stacklevel=2) during instantiation (or first access) and
remove the top-level warnings.warn from the module.
In `@llm_quest_benchmark/harnesses/base.py`:
- Around line 559-562: The `_needs_force_numeric_retry` method currently treats
any OpenAI model whose `model_id` starts with "o" as needing the numeric-retry
path; tighten this by checking only the intended families—either match a regex
like `^o\d` against `self.model_spec.model_id` or test against an explicit
allowed-prefix list such as ("o1", "o3", "o4")—and keep the
`self.model_spec.provider == "openai"` guard. Update
`_needs_force_numeric_retry` to use the chosen stricter test (e.g.,
`re.match(r"^o\d", self.model_spec.model_id)` or
`self.model_spec.model_id.startswith(prefix)` against the tuple) so future
"o"-prefixed model IDs (e.g., "omni-") won’t be forced into the numeric-only
retry path.
- Around line 599-609: The safety filter is being applied in multiple places
causing ambiguous reasoning attribution; choose one location and remove the
duplicate calls. Either (A) keep safety filtering inside _parse_with_retries
(the block that calls _apply_safety_filter and sets parsed_response.reasoning =
"policy_safety_override") and remove any calls to _apply_safety_filter in
PlannerHarness._get_action_impl and ToolCompactHarness._get_action_impl, or (B)
move filtering out of _parse_with_retries and ensure the harnesses call
_apply_safety_filter once and set parsed_response.reasoning there; update only
the redundant callers accordingly so _apply_safety_filter and the
policy_safety_override tag are set exactly once.
- Around line 309-325: The current reset method is marked `@abstractmethod` yet
contains real state-reset logic (history, _last_response, _observation_history,
_decision_history, _state_action_counts, _step_count and memory_module.reset()),
which risks subclasses overriding reset and skipping this base cleanup; change
reset to be a concrete method that performs the shown base cleanup and removes
the `@abstractmethod` decorator, and introduce a separate optional hook (e.g., an
abstract or no-op _post_reset(self) or a protected _reset_base_state(self) that
contains the current body while reset calls it and then calls _post_reset) so
subclasses can extend behavior without accidentally skipping the base state
reset (refer to reset, _reset_base_state/_post_reset, history, _last_response,
_observation_history, _state_action_counts, memory_module).
- Around line 450-467: The _call_llm retry loop currently returns or re-raises
immediately inside the TypeError handler which bypasses the shared retry
bookkeeping; update the except TypeError block in _call_llm so that if
system_prompt is not None you call self.llm.get_completion(prompt) inside the
same try/except logic (capture any exception into last_error and allow the loop
to continue/retry and log via self.logger/debug like the other exceptions), and
if system_prompt is None do not re-raise immediately but set last_error = exc
and let the loop retry up to 3 times (with the existing logging path) so all
fallback attempts participate in the retry budget and transient errors are
recorded consistently.
In `@llm_quest_benchmark/harnesses/memory.py`:
- Around line 281-286: The fallback branch in _format_transcript_for_compaction
is dead because callers (notably _maybe_compact) only call it when
_steps_since_compaction >= compaction_interval; remove the unreachable else
branch and simplify the slice to use only the intended precondition, or
alternatively add an explicit precondition check/assertion (e.g., assert
_steps_since_compaction > 0 or assert _steps_since_compaction >=
compaction_interval) with a brief docstring comment on
_format_transcript_for_compaction explaining that callers must only invoke it
when _steps_since_compaction >= compaction_interval; update references to
_transcript, _steps_since_compaction, and compaction_interval accordingly.
In `@llm_quest_benchmark/harnesses/tool_harness.py`:
- Around line 214-224: The error tag in ToolCompactHarness's exception handler
uses the retired "tool_agent_error" label; update the reasoning tag and any
related log text to "tool_harness_error" so downstream consumers matching
planner_error semantics will find it. Specifically, change the reasoning
f-string in the exception block that constructs the LLMResponse (and the
accompanying logger.error message if it embeds the old tag) to use
"tool_harness_error" and keep LLMResponse creation (action, is_default,
parse_mode) and history/_last_response updates unchanged.
In `@llm_quest_benchmark/harnesses/tools.py`:
- Around line 9-54: The calculator currently allows float infinities/NaNs
through (e.g., "1e400") which then get formatted into the result; after
computing result = _eval_calculator_node(tree.body) in calculator(expression:
str) validate that the result is a finite numeric value (use math.isfinite for
int/float, and treat booleans as valid) and if not finite return a clear error
like "error: non-finite result" before the final return; ensure this check
handles numeric types only and still lets boolean results pass.
In `@llm_quest_benchmark/schemas/config.py`:
- Around line 216-223: Remove the duplicated legacy-key rejection in the agents
parsing loop by deleting the explicit checks for "template" and "memory_mode"
(the two if "template" in agent / if "memory_mode" in agent branches) and let
HarnessConfig(**agent) perform the validation/raise; this centralizes the error
message logic in HarnessConfig.__init__ (the legacy_keys handling) and prevents
divergent wording or double-checks while keeping the agents list construction
(agents.append(HarnessConfig(**agent))) intact.
- Around line 57-101: The HarnessConfig class currently decorates with
`@dataclass` but fully overrides the generated __init__, which is confusing; pick
one approach—either remove `@dataclass` and keep the explicit __init__ and
__post_init__ call, or restore dataclass semantics by removing the explicit
__init__ and moving the legacy-key validation into __post_init__ (use a sentinel
field like _raw_kwargs: dict = field(default_factory=dict, repr=False) to
capture unexpected kwargs or accept **kwargs in the dataclass-generated __init__
signature), then perform the template/memory_mode and unexpected-key checks
inside __post_init__; update references to HarnessConfig.__init__ and
HarnessConfig.__post_init__ accordingly and remove the manual __post_init__()
call if you keep `@dataclass` and let dataclass initialization call it
automatically.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 70dd43a1-3520-4c51-b877-f9eaed4dc2f4
📒 Files selected for processing (68)
configs/benchmarks/exp3_no_loop_breaker.yamlconfigs/benchmarks/exp3_stateful_compact.yamlconfigs/benchmarks/exp4_compaction_no_memo.yamlconfigs/benchmarks/exp4_memo_cot.yamlconfigs/benchmarks/exp4_memo_extended.yamlconfigs/benchmarks/exp4_memo_structured.yamlconfigs/benchmarks/exp5_stateful_compact_variance.yamlconfigs/benchmarks/exp6_prompt_hints.yamlconfigs/benchmarks/exp6_tools.yamlconfigs/benchmarks/exp6_tools_hints.yamlconfigs/benchmarks/exp6_unified_tools_screen.yamlconfigs/benchmarks/exp7_deepseek.yamlconfigs/benchmarks/exp7_haiku.yamlconfigs/benchmarks/exp7_llama.yamlconfigs/benchmarks/exp7_mistral.yamlconfigs/benchmarks/exp7_qwen.yamlconfigs/benchmarks/exp7b_model_upgrades.yamlconfigs/benchmarks/memory_compaction.yamlconfigs/benchmarks/memory_full_transcript.yamlconfigs/benchmarks/memory_modes_pilot.yamlconfigs/benchmarks/openrouter_smoke_test.yamlconfigs/default.yamlconfigs/kr1.yamlconfigs/kr1_micro.yamlconfigs/kr1_test.yamlconfigs/kr2_en_benchmark.yamlconfigs/kr2_en_test.yamlconfigs/test/parallel_agents_test.yamlconfigs/test/temperature_test.yamlconfigs/test/test_benchmark.yamldocs/ARCHITECTURE.mddocs/EXPERIMENTS_LOG.mddocs/EXPERIMENT_AUDIT.mddocs/HARNESS_ENGINEERING.mddocs/SPEC.mdllm_quest_benchmark/agents/__init__.pyllm_quest_benchmark/agents/agent_factory.pyllm_quest_benchmark/agents/llm_agent.pyllm_quest_benchmark/agents/planner_agent.pyllm_quest_benchmark/agents/strategic_agent.pyllm_quest_benchmark/agents/tool_agent.pyllm_quest_benchmark/core/leaderboard.pyllm_quest_benchmark/core/runner.pyllm_quest_benchmark/executors/benchmark.pyllm_quest_benchmark/executors/cli/commands.pyllm_quest_benchmark/harnesses/__init__.pyllm_quest_benchmark/harnesses/base.pyllm_quest_benchmark/harnesses/factory.pyllm_quest_benchmark/harnesses/memo.pyllm_quest_benchmark/harnesses/memory.pyllm_quest_benchmark/harnesses/minimal.pyllm_quest_benchmark/harnesses/planner.pyllm_quest_benchmark/harnesses/reasoning.pyllm_quest_benchmark/harnesses/tool_harness.pyllm_quest_benchmark/harnesses/tools.pyllm_quest_benchmark/schemas/__init__.pyllm_quest_benchmark/schemas/config.pyllm_quest_benchmark/tests/agents/test_anthropic.pyllm_quest_benchmark/tests/agents/test_llm_agent.pyllm_quest_benchmark/tests/agents/test_mode_agents.pyllm_quest_benchmark/tests/harnesses/__init__.pyllm_quest_benchmark/tests/harnesses/test_factory.pyllm_quest_benchmark/tests/harnesses/test_harnesses.pyllm_quest_benchmark/tests/integration/test_benchmark.pyllm_quest_benchmark/tests/integration/test_mode_agents_e2e.pyllm_quest_benchmark/tests/integration/test_quest_e2e.pyllm_quest_benchmark/tests/test_benchmark_with_directory.pyllm_quest_benchmark/tests/test_leaderboard.py
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3f040e53af
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fea154f7ba
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if resolved_action_template in harness_routed_templates and memory_mode != "default": | ||
| raise ValueError( | ||
| "memory_mode is not supported for planner/tool harness templates; configure memory via harness selection." | ||
| ) |
There was a problem hiding this comment.
Preserve planner/tool compatibility with compaction memory_mode
Do not reject memory_mode for planner/tool templates in create_agent: older callers commonly pass action_template='planner.jinja' or 'tool_augmented*.jinja' together with memory_mode='compaction', which used to work and now raises immediately. This turns previously valid legacy configs/API usage into hard failures even when the requested mode matches the routed harness behavior, so benchmark jobs and scripts that were not migrated to harness will break at startup.
Useful? React with 👍 / 👎.
| for entry in self.history: | ||
| self.logger.debug(f"\nObservation: {entry['observation']}") | ||
| self.logger.debug(f"Analysis: {entry['analysis']}") | ||
| raise ImportError("strategic_agent is deprecated; use llm_quest_benchmark.harnesses instead") |
There was a problem hiding this comment.
Keep strategic_agent importable as a deprecation wrapper
Replace the unconditional ImportError with a deprecation shim (like the other agent wrappers), because importing llm_quest_benchmark.agents.strategic_agent now always crashes at import time. Any existing integration that still imports StrategicAgent for backward compatibility cannot even start, which is a breaking change introduced by this refactor rather than a gradual deprecation path.
Useful? React with 👍 / 👎.
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
…late Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 42a89b1d8e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if harness == "human": | ||
| return HumanPlayer(skip_single=skip_single) | ||
| if harness not in HARNESS_REGISTRY: | ||
| raise ValueError(f"Unknown harness '{harness}'. Valid: {valid}") | ||
| is_random_model, seed = _parse_random_choice_seed(model) | ||
| if is_random_model: | ||
| return RandomAgent(seed=seed, debug=debug, skip_single=skip_single) |
There was a problem hiding this comment.
Handle
model=human when creating non-human harnesses
HarnessConfig allows model="human", but create_harness only special-cases harness == "human"; for any other harness it instantiates an LLM harness, which then fails when BaseHarness parses the model name. This turns inputs like llm-quest run --model human (with default harness=reasoning_recent) into runtime errors instead of a valid human run or an early validation error, and benchmark workers will emit ERROR rows for this configuration.
Useful? React with 👍 / 👎.
| is_random_model, seed = _parse_random_choice_seed(model) | ||
| if is_random_model: | ||
| return RandomAgent(seed=seed, debug=debug, skip_single=skip_single) |
There was a problem hiding this comment.
Prevent random model from silently bypassing selected harness
The factory returns RandomAgent whenever model matches random_choice*, even if the configured harness is reasoning_recent, planner, etc. Fresh evidence in this commit is that benchmark configs now include model: random_choice with non-random harnesses, so those runs execute random policy while still being reported under the chosen harness, corrupting harness-level comparisons and experiment attribution.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0a3f0b0081
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| return create_harness( | ||
| harness="planner", | ||
| model=model, | ||
| temperature=temperature, | ||
| skip_single=skip_single, |
There was a problem hiding this comment.
Preserve memory_mode when routing planner/tool templates
This wrapper now maps planner.jinja/tool_augmented*.jinja directly to harness constructors without forwarding memory_mode, so legacy create_agent(..., action_template='planner.jinja'|'tool_augmented.jinja', memory_mode='default') calls silently run with compaction memory instead of their configured mode. That changes prompt context, token usage, and behavior for existing integrations that still use the template-based API, even though the function still accepts memory_mode.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6ae2265874
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| interval_tag = f"_ci{self.compaction_interval}" if self.memory_mode == "compaction" else "" | ||
| config_str = f"{self.model}_{self.temperature}_{self.system_template}_{self.action_template}_{self.memory_mode}{interval_tag}" | ||
| config_str = f"{self.model}_{self.temperature}_{self.harness}_{self.system_template}_{self.compaction_interval}" |
There was a problem hiding this comment.
Exclude irrelevant compaction interval from non-compaction IDs
HarnessConfig.harness_id always includes compaction_interval in its hash input, even for harnesses that never use compaction (for example minimal, reasoning_recent, and reasoning_full). That means behaviorally identical runs can get different agent_ids just because an unrelated interval value changed, which fragments benchmark aggregation/artifact paths and can skew comparisons that key by agent_id.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 131ecad7b0
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| memory_mode = config.get("memory_mode") | ||
| if _is_retired_result( |
There was a problem hiding this comment.
Read memory mode from result rows when classifying modes
Use the memory_mode already stored on each benchmark result_row as a fallback here; currently mode derivation only reads config.get("memory_mode") from db_runs, but harness-based runs no longer serialize memory_mode in agent_config (they carry it on result rows via _result_entry). For reasoning_full runs this causes reasoning.jinja to be classified as short-context instead of full-history, which silently corrupts leaderboard taxonomy and any mode-level comparisons built from it.
Useful? React with 👍 / 👎.
Summary
minimal,reasoning_recent,reasoning_full,memo_compact,hinted_compact,tool_compact,tool_hinted,planner).llm_quest_benchmark/harnesses/and remove legacy LLM/planner/tool/strategic agent wrappers.agents/toplayers/soagentsno longer suggests LLM agent implementations.Verification
uv run pytest-178 passed.uv run pytest llm_quest_benchmark/tests/harnesses llm_quest_benchmark/tests/players llm_quest_benchmark/tests/integration/test_benchmark.py llm_quest_benchmark/tests/test_benchmark_with_directory.py-82 passed.uv run ruff check .uv run ruff format --check .