Skip to content

refactor benchmark agents into harnesses#56

Merged
yourconscience merged 24 commits into
masterfrom
harness
May 13, 2026
Merged

refactor benchmark agents into harnesses#56
yourconscience merged 24 commits into
masterfrom
harness

Conversation

@yourconscience

@yourconscience yourconscience commented May 11, 2026

Copy link
Copy Markdown
Owner

Summary

  • Replace template/memory-mode benchmark configs with canonical harness names (minimal, reasoning_recent, reasoning_full, memo_compact, hinted_compact, tool_compact, tool_hinted, planner).
  • Move LLM execution into llm_quest_benchmark/harnesses/ and remove legacy LLM/planner/tool/strategic agent wrappers.
  • Rename non-LLM primitives from agents/ to players/ so agents no longer suggests LLM agent implementations.
  • Tighten random/human config validation, remove unused retired prompt templates, and clean up harness memory/tool review feedback.
  • Update configs, CLI wiring, result attribution, leaderboard taxonomy, docs, and tests around the harness model.

Verification

  • uv run pytest - 178 passed.
  • uv run pytest llm_quest_benchmark/tests/harnesses llm_quest_benchmark/tests/players llm_quest_benchmark/tests/integration/test_benchmark.py llm_quest_benchmark/tests/test_benchmark_with_directory.py - 82 passed.
  • uv run ruff check .
  • uv run ruff format --check .

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the agent architecture by introducing a harness-based system, replacing the previous template-driven agent construction. The changes involve migrating configuration files to use a canonical harness key, implementing a new harness factory, and updating documentation to reflect the new taxonomy. While the refactor standardizes the agent execution loop, the review comments highlight concerns regarding the consolidation of retired experimental variants into the new canonical harnesses, which may affect result comparability, and point out redundant legacy fallback logic in the benchmark executor.

Comment thread configs/benchmarks/exp4_compaction_no_memo.yaml Outdated
Comment thread configs/benchmarks/exp4_memo_cot.yaml Outdated
Comment thread llm_quest_benchmark/executors/benchmark.py Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 47bf2a1a5e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread llm_quest_benchmark/harnesses/base.py
Comment thread llm_quest_benchmark/harnesses/base.py Outdated
@coderabbitai

coderabbitai Bot commented May 11, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4e216553-b506-4cf2-bb77-3cd74c20e37e

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Migrates from agents/templates to harness-based runtime with new base/memory/tools/planner/tool harnesses and factory; updates YAML configs, schemas, CLI, executor, and tests; deprecates legacy agents; and filters retired results in leaderboard generation.

Changes

Harness Architecture Refactor

Layer / File(s) Summary
Base harness, memory, minimal/reasoning
llm_quest_benchmark/harnesses/*
Adds BaseHarness, Default/Full/Compaction memory, Minimal and Reasoning harnesses.
Memo, planner, tool harnesses and tools
llm_quest_benchmark/harnesses/*
Implements memo_compact variants, PlannerHarness, ToolCompact/Hinted, and calculator/scratchpad/history tools.
Harness factory and schema migration
llm_quest_benchmark/harnesses/factory.py, llm_quest_benchmark/schemas/*
Adds HARNESS_REGISTRY and create_harness; introduces HarnessConfig and updates schemas.
Executor, CLI, runner updates
llm_quest_benchmark/executors/*, llm_quest_benchmark/core/runner.py
Executor records harness, uses create_harness; CLI exposes --harness; runner type accepts HarnessConfig.
Legacy agents deprecation and routing
llm_quest_benchmark/agents/*
Wraps/aliases legacy agents to harnesses; routes planner/tool templates via harness factory.
Leaderboard retirement filtering
llm_quest_benchmark/core/leaderboard.py
Excludes retired benchmarks/harnesses/templates from aggregation.
YAML configs migrated to harness field
configs/**/*.yaml
Replaces template+memory_mode with harness across all configs.
Docs reframed to harness architecture
docs/*
Updates SPEC/ARCHITECTURE, adds HARNESS_ENGINEERING and EXPERIMENT_AUDIT, maps harness names.
Tests migrated to harness APIs
llm_quest_benchmark/tests/**/*
Updates unit/integration tests to harness flow; adds comprehensive harness tests; retires agent-mode tests.

Sequence Diagram(s)

sequenceDiagram
  participant CLI
  participant Factory
  participant Harness
  participant Memory
  participant LLM
  CLI->>Factory: create_harness(harness, model, params)
  Factory-->>CLI: Harness instance
  CLI->>Harness: get_action(observation, choices)
  Harness->>Memory: get_context(step)
  Harness->>LLM: completion(system, prompt)
  LLM-->>Harness: response
  Harness->>Harness: parse + safety + retries
  Harness->>Memory: update(step_data)
  Harness-->>CLI: action index
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

Poem

A harness hops where agents used to tread,
Compact memos nibble crumbs of what was said.
Tools twirl, plans unfurl—so neatly composed,
Retired tracks fade where the leaderboard glows.
I thump approval—carrots in my pack—
Onward we bound, along the harness track! 🐇✨

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch harness

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 597ba7ac0b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread llm_quest_benchmark/harnesses/factory.py Outdated
Comment thread llm_quest_benchmark/schemas/config.py Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🧹 Nitpick comments (12)
docs/HARNESS_ENGINEERING.md (1)

28-28: ⚡ Quick win

Use the canonical ReAct term for loop naming.

On Line 28, react loop is ambiguous and can be read as the React framework; ReAct loop is the clearer method name here.

Proposed doc tweak
-| `planner` | Uses a plan-maintain-act loop with compact memory instead of a pure react loop. |
+| `planner` | Uses a plan-maintain-act loop with compact memory instead of a pure ReAct loop. |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/HARNESS_ENGINEERING.md` at line 28, Update the planner description to
use the canonical "ReAct" term: change the phrase "react loop" in the table cell
for `planner` to "ReAct loop" (e.g., "`Uses a plan-maintain-act loop with
compact memory instead of a pure ReAct loop.`") so the method name is
unambiguous and distinct from the React framework.
llm_quest_benchmark/harnesses/tools.py (1)

9-54: 💤 Low value

Calculator sandboxing looks solid; minor note on float literals.

The two-stage defense (character whitelist + AST allow-list with no ast.Name/ast.Call) blocks identifier and call-based escapes, and the exponent cap at line 82 prevents trivial DoS via **. One quirk worth being aware of: the whitelist permits scientific notation, so an input like 1e400 parses to a float inf (and 1e400 - 1e400 yields nan), which then propagates through arithmetic/comparisons and is rendered into the returned string. If you want results to be bounded numbers, consider rejecting non-finite results before formatting.

♻️ Optional guard against inf/nan results
         result = _eval_calculator_node(tree.body)
+        if isinstance(result, float) and (result != result or result in (float("inf"), float("-inf"))):
+            return "error: non-finite result"
     except Exception as exc:
         return f"error: {exc}"
     return f"{expr} = {result}"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@llm_quest_benchmark/harnesses/tools.py` around lines 9 - 54, The calculator
currently allows float infinities/NaNs through (e.g., "1e400") which then get
formatted into the result; after computing result =
_eval_calculator_node(tree.body) in calculator(expression: str) validate that
the result is a finite numeric value (use math.isfinite for int/float, and treat
booleans as valid) and if not finite return a clear error like "error:
non-finite result" before the final return; ensure this check handles numeric
types only and still lets boolean results pass.
llm_quest_benchmark/agents/planner_agent.py (1)

7-7: ⚡ Quick win

Avoid import-time deprecation side effects.

Emitting DeprecationWarning during module import can break consumers running with warnings-as-errors. Prefer warning when PlannerAgent is instantiated (or first accessed) instead.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@llm_quest_benchmark/agents/planner_agent.py` at line 7, The module currently
emits a DeprecationWarning at import time via warnings.warn; instead move that
warning into the PlannerAgent class so it only fires when the API is actually
used (e.g., inside PlannerAgent.__init__ or the first property/method called).
Update planner_agent.PlannerAgent to call warnings.warn("planner_agent is
deprecated, use harnesses.planner", DeprecationWarning, stacklevel=2) during
instantiation (or first access) and remove the top-level warnings.warn from the
module.
llm_quest_benchmark/harnesses/tool_harness.py (1)

214-224: 💤 Low value

Error tag references retired tool_agent naming.

The error response is tagged tool_agent_error (line 220) but this harness is ToolCompactHarness. With the agents→harnesses rename, downstream consumers grepping logs for tool_harness_error (mirroring planner_error in planner.py line 179) will miss these. Consider renaming for consistency.

♻️ Proposed change
-                reasoning=f"tool_agent_error: {exc}",
+                reasoning=f"tool_harness_error: {exc}",
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@llm_quest_benchmark/harnesses/tool_harness.py` around lines 214 - 224, The
error tag in ToolCompactHarness's exception handler uses the retired
"tool_agent_error" label; update the reasoning tag and any related log text to
"tool_harness_error" so downstream consumers matching planner_error semantics
will find it. Specifically, change the reasoning f-string in the exception block
that constructs the LLMResponse (and the accompanying logger.error message if it
embeds the old tag) to use "tool_harness_error" and keep LLMResponse creation
(action, is_default, parse_mode) and history/_last_response updates unchanged.
llm_quest_benchmark/harnesses/memory.py (1)

281-286: 💤 Low value

Dead-fallback branch in _format_transcript_for_compaction.

_format_transcript_for_compaction is only reachable from _maybe_compact, which has already returned when _steps_since_compaction < compaction_interval. The else self._transcript[-self.compaction_interval:] branch is therefore unreachable in practice. Either drop the fallback or assert/document the precondition so future call sites can't accidentally widen the contract.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@llm_quest_benchmark/harnesses/memory.py` around lines 281 - 286, The fallback
branch in _format_transcript_for_compaction is dead because callers (notably
_maybe_compact) only call it when _steps_since_compaction >=
compaction_interval; remove the unreachable else branch and simplify the slice
to use only the intended precondition, or alternatively add an explicit
precondition check/assertion (e.g., assert _steps_since_compaction > 0 or assert
_steps_since_compaction >= compaction_interval) with a brief docstring comment
on _format_transcript_for_compaction explaining that callers must only invoke it
when _steps_since_compaction >= compaction_interval; update references to
_transcript, _steps_since_compaction, and compaction_interval accordingly.
llm_quest_benchmark/schemas/config.py (2)

216-223: 💤 Low value

Duplicated legacy-key rejection.

The template/memory_mode checks on lines 219-222 are already enforced inside HarnessConfig.__init__ (lines 84-87) via **legacy_keys. Removing them here keeps a single source of truth for the validation messages and avoids drift if the wording is later tweaked in only one location.

♻️ Proposed simplification
         if "agents" in data:
             agents = []
             for agent in data["agents"]:
-                if "template" in agent:
-                    raise ValueError("Use harness: key instead of template:")
-                if "memory_mode" in agent:
-                    raise ValueError("Use harness: key instead of memory_mode:")
                 agents.append(HarnessConfig(**agent))
             data["agents"] = agents
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@llm_quest_benchmark/schemas/config.py` around lines 216 - 223, Remove the
duplicated legacy-key rejection in the agents parsing loop by deleting the
explicit checks for "template" and "memory_mode" (the two if "template" in agent
/ if "memory_mode" in agent branches) and let HarnessConfig(**agent) perform the
validation/raise; this centralizes the error message logic in
HarnessConfig.__init__ (the legacy_keys handling) and prevents divergent wording
or double-checks while keeping the agents list construction
(agents.append(HarnessConfig(**agent))) intact.

57-101: 💤 Low value

Redundant @dataclass decorator combined with custom __init__.

The @dataclass decorator's generated __init__ is fully overridden by the explicit __init__ (lines 71-101), so the decorator's primary benefit is lost while still requiring the manual __post_init__() call at line 101. Either drop @dataclass and make it a plain class, or keep @dataclass and move legacy-key rejection into __post_init__ (rejecting keys collected via a sentinel) so the generated __init__ is preserved. Current form is functional but confusing for readers expecting standard dataclass semantics.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@llm_quest_benchmark/schemas/config.py` around lines 57 - 101, The
HarnessConfig class currently decorates with `@dataclass` but fully overrides the
generated __init__, which is confusing; pick one approach—either remove
`@dataclass` and keep the explicit __init__ and __post_init__ call, or restore
dataclass semantics by removing the explicit __init__ and moving the legacy-key
validation into __post_init__ (use a sentinel field like _raw_kwargs: dict =
field(default_factory=dict, repr=False) to capture unexpected kwargs or accept
**kwargs in the dataclass-generated __init__ signature), then perform the
template/memory_mode and unexpected-key checks inside __post_init__; update
references to HarnessConfig.__init__ and HarnessConfig.__post_init__ accordingly
and remove the manual __post_init__() call if you keep `@dataclass` and let
dataclass initialization call it automatically.
llm_quest_benchmark/harnesses/base.py (4)

559-562: 💤 Low value

Force-numeric retry heuristic matches any OpenAI model starting with o.

self.model_spec.model_id.startswith("o") is meant to cover the o1/o3/o4-mini reasoning families, but it will also match any future OpenAI model whose normalized ID begins with o (e.g., a hypothetical omni-…), forcing them through the JSON-then-number-only retry path even when they already return well-formed JSON. Consider tightening to an explicit prefix list or a regex like ^o\d:

♻️ Proposed tightening
-    def _needs_force_numeric_retry(self) -> bool:
-        return self.model_spec.provider == "openai" and (
-            self.model_spec.model_id.startswith("gpt-5") or self.model_spec.model_id.startswith("o")
-        )
+    def _needs_force_numeric_retry(self) -> bool:
+        model_id = self.model_spec.model_id
+        return self.model_spec.provider == "openai" and (
+            model_id.startswith("gpt-5") or bool(re.match(r"^o\d", model_id))
+        )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@llm_quest_benchmark/harnesses/base.py` around lines 559 - 562, The
`_needs_force_numeric_retry` method currently treats any OpenAI model whose
`model_id` starts with "o" as needing the numeric-retry path; tighten this by
checking only the intended families—either match a regex like `^o\d` against
`self.model_spec.model_id` or test against an explicit allowed-prefix list such
as ("o1", "o3", "o4")—and keep the `self.model_spec.provider == "openai"` guard.
Update `_needs_force_numeric_retry` to use the chosen stricter test (e.g.,
`re.match(r"^o\d", self.model_spec.model_id)` or
`self.model_spec.model_id.startswith(prefix)` against the tuple) so future
"o"-prefixed model IDs (e.g., "omni-") won’t be forced into the numeric-only
retry path.

599-609: 💤 Low value

Safety filter is applied twice on the retry path.

_parse_with_retries applies _apply_safety_filter at lines 599-602, but PlannerHarness._get_action_impl (line 154) and ToolCompactHarness._get_action_impl (line 198) also call _apply_safety_filter on the returned parsed_response.action. The override is idempotent (re-applying produces the same answer), so this isn't a correctness bug, but the policy_safety_override reasoning tag can be set in either layer and the second layer can no-op silently — making it ambiguous which layer is the source of truth.

Consider centralizing safety filtering either inside _parse_with_retries (and removing it from the per-harness implementations) or outside (and removing it here), so reasoning/usage attribution is consistent.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@llm_quest_benchmark/harnesses/base.py` around lines 599 - 609, The safety
filter is being applied in multiple places causing ambiguous reasoning
attribution; choose one location and remove the duplicate calls. Either (A) keep
safety filtering inside _parse_with_retries (the block that calls
_apply_safety_filter and sets parsed_response.reasoning =
"policy_safety_override") and remove any calls to _apply_safety_filter in
PlannerHarness._get_action_impl and ToolCompactHarness._get_action_impl, or (B)
move filtering out of _parse_with_retries and ensure the harnesses call
_apply_safety_filter once and set parsed_response.reasoning there; update only
the redundant callers accordingly so _apply_safety_filter and the
policy_safety_override tag are set exactly once.

309-325: 💤 Low value

@abstractmethod decorator on reset is meaningful but conflicts with the in-body call to super().reset().

reset is marked @abstractmethod while also providing a real implementation that subclasses are expected to extend via super().reset(). This pattern works in Python, but it makes the contract surprising: the abstractness only forces subclasses to override the name, not to call into the base, and the body looks reusable. Either drop @abstractmethod (since the base actually does meaningful state reset and subclasses currently always call super().reset()), or split into a concrete _reset_base_state() helper and keep reset truly abstract. The current form risks a subclass redefining reset without invoking super().reset() and silently leaking history/_observation_history/_state_action_counts across episodes.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@llm_quest_benchmark/harnesses/base.py` around lines 309 - 325, The current
reset method is marked `@abstractmethod` yet contains real state-reset logic
(history, _last_response, _observation_history, _decision_history,
_state_action_counts, _step_count and memory_module.reset()), which risks
subclasses overriding reset and skipping this base cleanup; change reset to be a
concrete method that performs the shown base cleanup and removes the
`@abstractmethod` decorator, and introduce a separate optional hook (e.g., an
abstract or no-op _post_reset(self) or a protected _reset_base_state(self) that
contains the current body while reset calls it and then calls _post_reset) so
subclasses can extend behavior without accidentally skipping the base state
reset (refer to reset, _reset_base_state/_post_reset, history, _last_response,
_observation_history, _state_action_counts, memory_module).

450-467: 💤 Low value

_call_llm retry loop swallows nothing on the first attempt's TypeError when system_prompt is None.

The retry loop is structured so:

  • system_prompt is not None + TypeError → falls back to the no-system-prompt signature (line 461) and returns — no retries.
  • system_prompt is None + TypeErrorraise (line 462), terminates immediately.
  • any other Exception → recorded in last_error and retried up to 3 times.

This is reasonable, but two subtleties are worth tightening:

  1. The fallback at line 461 is itself unprotected — if the second call also raises (any exception type), it propagates without participating in the retry budget.
  2. If get_completion returns successfully on attempt 1, great; if all 3 attempts raise the same TypeError while system_prompt is None, only the first attempt actually runs because the loop body returns/raises before retrying. The for attempt in range(3) is misleading there.

Consider documenting these branches in a comment, or wrapping the line-461 fallback in the same last_error-tracking pattern so transient failures during the fallback are also retried.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@llm_quest_benchmark/harnesses/base.py` around lines 450 - 467, The _call_llm
retry loop currently returns or re-raises immediately inside the TypeError
handler which bypasses the shared retry bookkeeping; update the except TypeError
block in _call_llm so that if system_prompt is not None you call
self.llm.get_completion(prompt) inside the same try/except logic (capture any
exception into last_error and allow the loop to continue/retry and log via
self.logger/debug like the other exceptions), and if system_prompt is None do
not re-raise immediately but set last_error = exc and let the loop retry up to 3
times (with the existing logging path) so all fallback attempts participate in
the retry budget and transient errors are recorded consistently.
llm_quest_benchmark/agents/llm_agent.py (1)

23-23: 💤 Low value

Module-import DeprecationWarning fires for every consumer.

Emitting the warning at module-import time triggers on any import path that touches llm_quest_benchmark.agents.llm_agent, including internal re-exports (parsing helpers, keyword constants, __all__). Consider moving the warning into LLMAgent.__init__ so only actual instantiations of the legacy class trigger it; this avoids spurious noise for callers that only use parse_llm_response or the keyword tuples through this module.

♻️ Proposed change
-warnings.warn("llm_agent is deprecated, use llm_quest_benchmark.harnesses", DeprecationWarning, stacklevel=2)
-
-
 class LLMAgent(MinimalHarness):
     """Backward-compatible LLMAgent facade backed by concrete harness classes."""

     SUPPORTED_MODELS = MODEL_CHOICES

     def __init__(
         self,
         ...
     ):
+        warnings.warn(
+            "LLMAgent is deprecated, use llm_quest_benchmark.harnesses",
+            DeprecationWarning,
+            stacklevel=2,
+        )
         ...
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@llm_quest_benchmark/agents/llm_agent.py` at line 23, The module-level
DeprecationWarning causes noise on import; remove the top-level warnings.warn
call and instead emit the deprecation inside the legacy class constructor: add a
warnings.warn(...) call in LLMAgent.__init__ (using DeprecationWarning and
stacklevel=2) so the warning fires only when LLMAgent is instantiated; leave
functions like parse_llm_response and exported constants untouched so importing
them no longer triggers the warning.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@llm_quest_benchmark/agents/agent_factory.py`:
- Around line 66-90: The branches handling resolved_action_template ==
"planner.jinja" and resolved_action_template in ("tool_augmented.jinja",
"tool_augmented_hints.jinja") call create_harness but drop the memory_mode
parameter from the create_agent API; update both create_harness calls in these
branches to pass through memory_mode (or, if you prefer fail-fast, validate
memory_mode at the top of create_agent and raise a clear error when a
non-default value is provided for these harness types). Locate the two blocks
that call create_harness (the planner branch and the
tool_augmented/tool_augmented_hints branch) and either add
memory_mode=memory_mode to their argument lists or add an early guard in
create_agent that checks resolved_action_template and memory_mode and raises/
logs a clear error.

In `@llm_quest_benchmark/executors/benchmark.py`:
- Line 109: The artifact write uses a derived agent_id expression ("agent_id":
agent_config.harness_id if hasattr(agent_config, "harness_id") else
agent_config.agent_id) but the timeout-persistence code still uses
agent_config.agent_id directly; change the timeout DB write to use the same
derivation expression (or extract a single helper like derived_agent_id =
agent_config.harness_id if hasattr(agent_config, "harness_id") else
agent_config.agent_id and use that) so both artifact writes and timeout writes
use identical agent_id logic (update the code that persists timeouts to
reference derived_agent_id instead of agent_config.agent_id).

In `@llm_quest_benchmark/executors/cli/commands.py`:
- Around line 55-56: HARNESS_CHOICES is built only from HARNESS_REGISTRY.keys(),
causing CLI validation to reject special harness identifiers (e.g.,
random_choice and seeded variants) that create_harness accepts; replace the
strict Choice(...) usage that uses HARNESS_CHOICES with a small custom validator
that accepts either a key in HARNESS_REGISTRY or any value for which
is_random_choice_harness(value) returns true, and apply the same change to the
other Choice usages around the second occurrence (the block currently around
where HARNESS_CHOICES is used again at the later lines) so the CLI accepts both
registry keys and valid random/seeded harness identifiers.

In `@llm_quest_benchmark/harnesses/memo.py`:
- Line 30: The current initialization uses a falsy-coalescing pattern that can
override valid false-y custom modules: replace the expression that sets
memory_module (the argument currently using "memory_module or
CompactionMemory(compaction_interval=compaction_interval)") with an explicit
None check so that if memory_module is not None it is used, otherwise construct
CompactionMemory(compaction_interval=compaction_interval); reference the
parameter name memory_module and the fallback class CompactionMemory (and the
compaction_interval arg) when making this change.

In `@llm_quest_benchmark/harnesses/memory.py`:
- Around line 249-279: The compaction path in _maybe_compact currently leaves
_steps_since_compaction at-or-above compaction_interval when the LLM returns an
empty string or throws, causing repeated compaction attempts; wrap the call to
self.llm_client.get_completion(...) in a try/except to catch exceptions and
treat failures as empty summaries, and when the summary is empty (or an
exception occurred) set self._steps_since_compaction back to 0 or to a small
back-off value (e.g., compaction_interval // 2) to avoid retrying every step;
keep the existing behavior of resetting _transcript, _compaction_summary and
_steps_since_compaction only on a truthy summary, but ensure failures/exceptions
update _steps_since_compaction to prevent a per-step retry storm.

In `@llm_quest_benchmark/harnesses/minimal.py`:
- Around line 42-47: The parsed_response.action must be clamped to valid bounds
before you persist state; change the order in minimal.py so you
validate/normalize parsed_response.action (ensure 1 <= parsed_response.action <=
len(choices)) immediately after parsing and before calling _remember_decision,
and then update self.history and self._last_response and call
self._remember_decision(observation, choices, state_signature, parsed_response);
finally return parsed_response.action so the recorded decision, last response,
history, and returned value are all consistent with the normalized action.

In `@llm_quest_benchmark/harnesses/planner.py`:
- Around line 167-172: The action clamp is applied after bookkeeping, so
_remember_decision and history record the original out-of-range action; move the
clamp before any bookkeeping and history mutation: in the Planner (method where
self.history.append(parsed_response) and self._remember_decision(...) are
called), validate and clamp parsed_response.action into the range [1,
len(choices)] immediately after parsing (or before calling
self._remember_decision and before appending to self.history/_last_response) so
the stored counts in _state_action_counts and entries in _decision_history match
the actual returned action.

In `@llm_quest_benchmark/harnesses/tools.py`:
- Around line 149-170: The output formatter is using direct dict access which
can raise KeyError for partial entries; update the code that builds lines (in
the same method that iterates over self.step_log and computes scored/best) to
mirror the defensive access used during scoring: use entry.get("step", 0),
entry.get("observation", ""), entry.get("choices", []) (and coerce to an
iterable/string before joining), and entry.get("selected_choice", "n/a") so
missing keys won't crash search().
- Around line 160-163: The fallback selects oldest entries because after
scored.sort(..., reverse=True) the most recent (highest step) items are at the
head, but the code uses scored[-self.history_window:] which takes the tail;
change the fallback to take the first self.history_window items from scored
(e.g., scored[: self.history_window]) and, if chronological ordering is required
by the caller, reverse that slice before assigning to best; update the logic
around scored, best and self.history_window accordingly.

---

Nitpick comments:
In `@docs/HARNESS_ENGINEERING.md`:
- Line 28: Update the planner description to use the canonical "ReAct" term:
change the phrase "react loop" in the table cell for `planner` to "ReAct loop"
(e.g., "`Uses a plan-maintain-act loop with compact memory instead of a pure
ReAct loop.`") so the method name is unambiguous and distinct from the React
framework.

In `@llm_quest_benchmark/agents/llm_agent.py`:
- Line 23: The module-level DeprecationWarning causes noise on import; remove
the top-level warnings.warn call and instead emit the deprecation inside the
legacy class constructor: add a warnings.warn(...) call in LLMAgent.__init__
(using DeprecationWarning and stacklevel=2) so the warning fires only when
LLMAgent is instantiated; leave functions like parse_llm_response and exported
constants untouched so importing them no longer triggers the warning.

In `@llm_quest_benchmark/agents/planner_agent.py`:
- Line 7: The module currently emits a DeprecationWarning at import time via
warnings.warn; instead move that warning into the PlannerAgent class so it only
fires when the API is actually used (e.g., inside PlannerAgent.__init__ or the
first property/method called). Update planner_agent.PlannerAgent to call
warnings.warn("planner_agent is deprecated, use harnesses.planner",
DeprecationWarning, stacklevel=2) during instantiation (or first access) and
remove the top-level warnings.warn from the module.

In `@llm_quest_benchmark/harnesses/base.py`:
- Around line 559-562: The `_needs_force_numeric_retry` method currently treats
any OpenAI model whose `model_id` starts with "o" as needing the numeric-retry
path; tighten this by checking only the intended families—either match a regex
like `^o\d` against `self.model_spec.model_id` or test against an explicit
allowed-prefix list such as ("o1", "o3", "o4")—and keep the
`self.model_spec.provider == "openai"` guard. Update
`_needs_force_numeric_retry` to use the chosen stricter test (e.g.,
`re.match(r"^o\d", self.model_spec.model_id)` or
`self.model_spec.model_id.startswith(prefix)` against the tuple) so future
"o"-prefixed model IDs (e.g., "omni-") won’t be forced into the numeric-only
retry path.
- Around line 599-609: The safety filter is being applied in multiple places
causing ambiguous reasoning attribution; choose one location and remove the
duplicate calls. Either (A) keep safety filtering inside _parse_with_retries
(the block that calls _apply_safety_filter and sets parsed_response.reasoning =
"policy_safety_override") and remove any calls to _apply_safety_filter in
PlannerHarness._get_action_impl and ToolCompactHarness._get_action_impl, or (B)
move filtering out of _parse_with_retries and ensure the harnesses call
_apply_safety_filter once and set parsed_response.reasoning there; update only
the redundant callers accordingly so _apply_safety_filter and the
policy_safety_override tag are set exactly once.
- Around line 309-325: The current reset method is marked `@abstractmethod` yet
contains real state-reset logic (history, _last_response, _observation_history,
_decision_history, _state_action_counts, _step_count and memory_module.reset()),
which risks subclasses overriding reset and skipping this base cleanup; change
reset to be a concrete method that performs the shown base cleanup and removes
the `@abstractmethod` decorator, and introduce a separate optional hook (e.g., an
abstract or no-op _post_reset(self) or a protected _reset_base_state(self) that
contains the current body while reset calls it and then calls _post_reset) so
subclasses can extend behavior without accidentally skipping the base state
reset (refer to reset, _reset_base_state/_post_reset, history, _last_response,
_observation_history, _state_action_counts, memory_module).
- Around line 450-467: The _call_llm retry loop currently returns or re-raises
immediately inside the TypeError handler which bypasses the shared retry
bookkeeping; update the except TypeError block in _call_llm so that if
system_prompt is not None you call self.llm.get_completion(prompt) inside the
same try/except logic (capture any exception into last_error and allow the loop
to continue/retry and log via self.logger/debug like the other exceptions), and
if system_prompt is None do not re-raise immediately but set last_error = exc
and let the loop retry up to 3 times (with the existing logging path) so all
fallback attempts participate in the retry budget and transient errors are
recorded consistently.

In `@llm_quest_benchmark/harnesses/memory.py`:
- Around line 281-286: The fallback branch in _format_transcript_for_compaction
is dead because callers (notably _maybe_compact) only call it when
_steps_since_compaction >= compaction_interval; remove the unreachable else
branch and simplify the slice to use only the intended precondition, or
alternatively add an explicit precondition check/assertion (e.g., assert
_steps_since_compaction > 0 or assert _steps_since_compaction >=
compaction_interval) with a brief docstring comment on
_format_transcript_for_compaction explaining that callers must only invoke it
when _steps_since_compaction >= compaction_interval; update references to
_transcript, _steps_since_compaction, and compaction_interval accordingly.

In `@llm_quest_benchmark/harnesses/tool_harness.py`:
- Around line 214-224: The error tag in ToolCompactHarness's exception handler
uses the retired "tool_agent_error" label; update the reasoning tag and any
related log text to "tool_harness_error" so downstream consumers matching
planner_error semantics will find it. Specifically, change the reasoning
f-string in the exception block that constructs the LLMResponse (and the
accompanying logger.error message if it embeds the old tag) to use
"tool_harness_error" and keep LLMResponse creation (action, is_default,
parse_mode) and history/_last_response updates unchanged.

In `@llm_quest_benchmark/harnesses/tools.py`:
- Around line 9-54: The calculator currently allows float infinities/NaNs
through (e.g., "1e400") which then get formatted into the result; after
computing result = _eval_calculator_node(tree.body) in calculator(expression:
str) validate that the result is a finite numeric value (use math.isfinite for
int/float, and treat booleans as valid) and if not finite return a clear error
like "error: non-finite result" before the final return; ensure this check
handles numeric types only and still lets boolean results pass.

In `@llm_quest_benchmark/schemas/config.py`:
- Around line 216-223: Remove the duplicated legacy-key rejection in the agents
parsing loop by deleting the explicit checks for "template" and "memory_mode"
(the two if "template" in agent / if "memory_mode" in agent branches) and let
HarnessConfig(**agent) perform the validation/raise; this centralizes the error
message logic in HarnessConfig.__init__ (the legacy_keys handling) and prevents
divergent wording or double-checks while keeping the agents list construction
(agents.append(HarnessConfig(**agent))) intact.
- Around line 57-101: The HarnessConfig class currently decorates with
`@dataclass` but fully overrides the generated __init__, which is confusing; pick
one approach—either remove `@dataclass` and keep the explicit __init__ and
__post_init__ call, or restore dataclass semantics by removing the explicit
__init__ and moving the legacy-key validation into __post_init__ (use a sentinel
field like _raw_kwargs: dict = field(default_factory=dict, repr=False) to
capture unexpected kwargs or accept **kwargs in the dataclass-generated __init__
signature), then perform the template/memory_mode and unexpected-key checks
inside __post_init__; update references to HarnessConfig.__init__ and
HarnessConfig.__post_init__ accordingly and remove the manual __post_init__()
call if you keep `@dataclass` and let dataclass initialization call it
automatically.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 70dd43a1-3520-4c51-b877-f9eaed4dc2f4

📥 Commits

Reviewing files that changed from the base of the PR and between 75522c6 and 3f040e5.

📒 Files selected for processing (68)
  • configs/benchmarks/exp3_no_loop_breaker.yaml
  • configs/benchmarks/exp3_stateful_compact.yaml
  • configs/benchmarks/exp4_compaction_no_memo.yaml
  • configs/benchmarks/exp4_memo_cot.yaml
  • configs/benchmarks/exp4_memo_extended.yaml
  • configs/benchmarks/exp4_memo_structured.yaml
  • configs/benchmarks/exp5_stateful_compact_variance.yaml
  • configs/benchmarks/exp6_prompt_hints.yaml
  • configs/benchmarks/exp6_tools.yaml
  • configs/benchmarks/exp6_tools_hints.yaml
  • configs/benchmarks/exp6_unified_tools_screen.yaml
  • configs/benchmarks/exp7_deepseek.yaml
  • configs/benchmarks/exp7_haiku.yaml
  • configs/benchmarks/exp7_llama.yaml
  • configs/benchmarks/exp7_mistral.yaml
  • configs/benchmarks/exp7_qwen.yaml
  • configs/benchmarks/exp7b_model_upgrades.yaml
  • configs/benchmarks/memory_compaction.yaml
  • configs/benchmarks/memory_full_transcript.yaml
  • configs/benchmarks/memory_modes_pilot.yaml
  • configs/benchmarks/openrouter_smoke_test.yaml
  • configs/default.yaml
  • configs/kr1.yaml
  • configs/kr1_micro.yaml
  • configs/kr1_test.yaml
  • configs/kr2_en_benchmark.yaml
  • configs/kr2_en_test.yaml
  • configs/test/parallel_agents_test.yaml
  • configs/test/temperature_test.yaml
  • configs/test/test_benchmark.yaml
  • docs/ARCHITECTURE.md
  • docs/EXPERIMENTS_LOG.md
  • docs/EXPERIMENT_AUDIT.md
  • docs/HARNESS_ENGINEERING.md
  • docs/SPEC.md
  • llm_quest_benchmark/agents/__init__.py
  • llm_quest_benchmark/agents/agent_factory.py
  • llm_quest_benchmark/agents/llm_agent.py
  • llm_quest_benchmark/agents/planner_agent.py
  • llm_quest_benchmark/agents/strategic_agent.py
  • llm_quest_benchmark/agents/tool_agent.py
  • llm_quest_benchmark/core/leaderboard.py
  • llm_quest_benchmark/core/runner.py
  • llm_quest_benchmark/executors/benchmark.py
  • llm_quest_benchmark/executors/cli/commands.py
  • llm_quest_benchmark/harnesses/__init__.py
  • llm_quest_benchmark/harnesses/base.py
  • llm_quest_benchmark/harnesses/factory.py
  • llm_quest_benchmark/harnesses/memo.py
  • llm_quest_benchmark/harnesses/memory.py
  • llm_quest_benchmark/harnesses/minimal.py
  • llm_quest_benchmark/harnesses/planner.py
  • llm_quest_benchmark/harnesses/reasoning.py
  • llm_quest_benchmark/harnesses/tool_harness.py
  • llm_quest_benchmark/harnesses/tools.py
  • llm_quest_benchmark/schemas/__init__.py
  • llm_quest_benchmark/schemas/config.py
  • llm_quest_benchmark/tests/agents/test_anthropic.py
  • llm_quest_benchmark/tests/agents/test_llm_agent.py
  • llm_quest_benchmark/tests/agents/test_mode_agents.py
  • llm_quest_benchmark/tests/harnesses/__init__.py
  • llm_quest_benchmark/tests/harnesses/test_factory.py
  • llm_quest_benchmark/tests/harnesses/test_harnesses.py
  • llm_quest_benchmark/tests/integration/test_benchmark.py
  • llm_quest_benchmark/tests/integration/test_mode_agents_e2e.py
  • llm_quest_benchmark/tests/integration/test_quest_e2e.py
  • llm_quest_benchmark/tests/test_benchmark_with_directory.py
  • llm_quest_benchmark/tests/test_leaderboard.py

Comment thread llm_quest_benchmark/agents/agent_factory.py Outdated
Comment thread llm_quest_benchmark/executors/benchmark.py Outdated
Comment thread llm_quest_benchmark/executors/cli/commands.py
Comment thread llm_quest_benchmark/harnesses/memo.py Outdated
Comment thread llm_quest_benchmark/harnesses/memory.py Outdated
Comment thread llm_quest_benchmark/harnesses/minimal.py
Comment thread llm_quest_benchmark/harnesses/planner.py
Comment thread llm_quest_benchmark/harnesses/tools.py
Comment thread llm_quest_benchmark/harnesses/tools.py Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3f040e53af

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread llm_quest_benchmark/schemas/__init__.py
Comment thread llm_quest_benchmark/schemas/config.py

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fea154f7ba

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +52 to +55
if resolved_action_template in harness_routed_templates and memory_mode != "default":
raise ValueError(
"memory_mode is not supported for planner/tool harness templates; configure memory via harness selection."
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve planner/tool compatibility with compaction memory_mode

Do not reject memory_mode for planner/tool templates in create_agent: older callers commonly pass action_template='planner.jinja' or 'tool_augmented*.jinja' together with memory_mode='compaction', which used to work and now raises immediately. This turns previously valid legacy configs/API usage into hard failures even when the requested mode matches the routed harness behavior, so benchmark jobs and scripts that were not migrated to harness will break at startup.

Useful? React with 👍 / 👎.

for entry in self.history:
self.logger.debug(f"\nObservation: {entry['observation']}")
self.logger.debug(f"Analysis: {entry['analysis']}")
raise ImportError("strategic_agent is deprecated; use llm_quest_benchmark.harnesses instead")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep strategic_agent importable as a deprecation wrapper

Replace the unconditional ImportError with a deprecation shim (like the other agent wrappers), because importing llm_quest_benchmark.agents.strategic_agent now always crashes at import time. Any existing integration that still imports StrategicAgent for backward compatibility cannot even start, which is a breaking change introduced by this refactor rather than a gradual deprecation path.

Useful? React with 👍 / 👎.

yourconscience and others added 17 commits May 12, 2026 12:02
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
…late

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 42a89b1d8e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +67 to +73
if harness == "human":
return HumanPlayer(skip_single=skip_single)
if harness not in HARNESS_REGISTRY:
raise ValueError(f"Unknown harness '{harness}'. Valid: {valid}")
is_random_model, seed = _parse_random_choice_seed(model)
if is_random_model:
return RandomAgent(seed=seed, debug=debug, skip_single=skip_single)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Handle model=human when creating non-human harnesses

HarnessConfig allows model="human", but create_harness only special-cases harness == "human"; for any other harness it instantiates an LLM harness, which then fails when BaseHarness parses the model name. This turns inputs like llm-quest run --model human (with default harness=reasoning_recent) into runtime errors instead of a valid human run or an early validation error, and benchmark workers will emit ERROR rows for this configuration.

Useful? React with 👍 / 👎.

Comment on lines +71 to +73
is_random_model, seed = _parse_random_choice_seed(model)
if is_random_model:
return RandomAgent(seed=seed, debug=debug, skip_single=skip_single)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Prevent random model from silently bypassing selected harness

The factory returns RandomAgent whenever model matches random_choice*, even if the configured harness is reasoning_recent, planner, etc. Fresh evidence in this commit is that benchmark configs now include model: random_choice with non-random harnesses, so those runs execute random policy while still being reported under the chosen harness, corrupting harness-level comparisons and experiment attribution.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0a3f0b0081

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 74 to 78
return create_harness(
harness="planner",
model=model,
temperature=temperature,
skip_single=skip_single,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve memory_mode when routing planner/tool templates

This wrapper now maps planner.jinja/tool_augmented*.jinja directly to harness constructors without forwarding memory_mode, so legacy create_agent(..., action_template='planner.jinja'|'tool_augmented.jinja', memory_mode='default') calls silently run with compaction memory instead of their configured mode. That changes prompt context, token usage, and behavior for existing integrations that still use the template-based API, even though the function still accepts memory_mode.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6ae2265874

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread llm_quest_benchmark/schemas/config.py Outdated

interval_tag = f"_ci{self.compaction_interval}" if self.memory_mode == "compaction" else ""
config_str = f"{self.model}_{self.temperature}_{self.system_template}_{self.action_template}_{self.memory_mode}{interval_tag}"
config_str = f"{self.model}_{self.temperature}_{self.harness}_{self.system_template}_{self.compaction_interval}"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Exclude irrelevant compaction interval from non-compaction IDs

HarnessConfig.harness_id always includes compaction_interval in its hash input, even for harnesses that never use compaction (for example minimal, reasoning_recent, and reasoning_full). That means behaviorally identical runs can get different agent_ids just because an unrelated interval value changed, which fragments benchmark aggregation/artifact paths and can skew comparisons that key by agent_id.

Useful? React with 👍 / 👎.

Comment thread llm_quest_benchmark/harnesses/factory.py Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 131ecad7b0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread llm_quest_benchmark/core/leaderboard.py Outdated
Comment on lines +388 to +389
memory_mode = config.get("memory_mode")
if _is_retired_result(

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Read memory mode from result rows when classifying modes

Use the memory_mode already stored on each benchmark result_row as a fallback here; currently mode derivation only reads config.get("memory_mode") from db_runs, but harness-based runs no longer serialize memory_mode in agent_config (they carry it on result rows via _result_entry). For reasoning_full runs this causes reasoning.jinja to be classified as short-context instead of full-history, which silently corrupts leaderboard taxonomy and any mode-level comparisons built from it.

Useful? React with 👍 / 👎.

@yourconscience yourconscience merged commit 4fb0a9d into master May 13, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant