Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
658a8f9
feat(swe-bench): implement iterative predictor for SWE-bench
Jerryguan777 Jan 14, 2026
2395ac4
fix CI warnings when PR
Jerryguan777 Jan 15, 2026
92b1e66
feat(swe-bench): isolate workspace per instance for parallel execution
Jerryguan777 Jan 16, 2026
6a69861
change nim_llm to nemotron-3-nano, add and fix license headers
Jerryguan777 Jan 17, 2026
f80efce
clean up existed repo folder when downloading
Jerryguan777 Jan 22, 2026
f8f430a
fix CI warning: shutil.rmtree in an async function blocks the event loop
Jerryguan777 Jan 22, 2026
4baa36c
feat(swe-bench): add command validation to prevent dangerous shell co…
Jerryguan777 Jan 28, 2026
75e94e3
feat(swe-bench): add timeout and error handling for git operations
Jerryguan777 Jan 28, 2026
61b02b2
fix(swe-bench): remove redundant shutil import in cleanup method
Jerryguan777 Jan 28, 2026
26fda75
fix(swe-bench): add specific exception handling for repo setup
Jerryguan777 Jan 28, 2026
03e0b46
fix(swe-bench): add error handling for workspace cleanup
Jerryguan777 Jan 28, 2026
55039a3
docs(swe-bench): add benchmark context for iterative predictor
Jerryguan777 Jan 28, 2026
3f62472
docs(swe-bench): clarify step_limit override in config
Jerryguan777 Jan 28, 2026
0c4ded9
docs(swe-bench): add complete docstrings for public methods
Jerryguan777 Jan 28, 2026
df324e0
feat(swe-bench): add simple metrics logging for iterative agent
Jerryguan777 Jan 28, 2026
7ec5f78
test(swe-bench): add comprehensive tests for iterative predictor
Jerryguan777 Jan 28, 2026
201e379
style(swe-bench): rename fixtures to follow project naming convention
Jerryguan777 Jan 28, 2026
4569449
refactor(swe-bench): use pytest.raises for explicit exception assertion
Jerryguan777 Jan 28, 2026
a140745
fix(swe-bench): resolve ruff linting warnings in tests
Jerryguan777 Jan 28, 2026
b44a7c6
feat(swe-bench): add shell bypass pattern detection for security
Jerryguan777 Feb 2, 2026
e8e7e82
chore(swe-bench): code review fixes
Jerryguan777 Feb 2, 2026
329c712
fix(swe-bench): add input sanitization for get_repo_path
Jerryguan777 Feb 2, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions examples/evaluation_and_profiling/swe_bench/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -159,6 +159,18 @@ That information is only used for evaluation. Using it can taint the predictor a
These predictors are provided in this NeMo Agent Toolkit example:
- `gold` - Uses the patch from the `SWEBenchInput` instance, bypassing problem-solving logic. See [predict_gold_stub.py](src/nat_swe_bench/predictors/predict_gold/predict_gold_stub.py) and configuration file `examples/evaluation_and_profiling/swe_bench/configs/config_gold.yml`.
- `skeleton` - Skeleton code for creating a problem-solving workflow. This code can be copied to create a net-new predictor. See [predict_skeleton.py](src/nat_swe_bench/predictors/predict_skeleton/predict_skeleton.py) and configuration file `examples/evaluation_and_profiling/swe_bench/configs/config_skeleton.yml`.
- `iterative` - Iterative agent that solves problems by executing bash commands step-by-step, observing results, and generating patches. See [predict_iterative.py](src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py) and configuration file `examples/evaluation_and_profiling/swe_bench/configs/config_iterative.yml`.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Benchmark Context (January 2026)
The iterative predictor achieves 70% success rate on SWE-bench Lite, which primarily reflects the capabilities of modern foundation models (Claude Sonnet 4.5, GPT-5.2) rather than framework-specific innovations. SWE-bench Lite is approaching saturation at 70-80% with simple agent architectures.
**For evaluating framework improvements beyond task correctness, consider tracking:**
- **Efficiency metrics:** Tokens consumed, steps taken, cost per solution
- **Reliability metrics:** Success rate variance over multiple runs
- **Harder benchmarks:** SWE-bench Verified (currently ~35% SOTA, not saturated) or full SWE-bench dataset (2,294 problems)
This positions the iterative predictor as a reference implementation demonstrating NeMo Agent toolkit's builder pattern and tool integration capabilities.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description mentions 70% success rate but doesn't provide context about what this means for benchmark saturation. I recommend adding the baove to the Readme to provide context

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated as suggested. Thanks

### Benchmark Context (January 2026)

The iterative predictor achieves 70% success rate on SWE-bench Lite, which primarily reflects the capabilities of modern foundation models (Claude Sonnet 4.5, GPT-5.2) rather than framework-specific innovations. SWE-bench Lite is approaching saturation at 70-80% with simple agent architectures.

**For evaluating framework improvements beyond task correctness, consider tracking:**
- **Efficiency metrics:** Tokens consumed, steps taken, cost per solution
- **Reliability metrics:** Success rate variance over multiple runs
- **Harder benchmarks:** SWE-bench Verified (currently ~35% SOTA, not saturated) or full SWE-bench dataset (2,294 problems)

This positions the iterative predictor as a reference implementation demonstrating NeMo Agent toolkit's builder pattern and tool integration capabilities.

### Adding a net new predictor
To add a new predictor:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,31 +16,59 @@
import typing

from pydantic import Discriminator
from pydantic import Field
from pydantic import Tag

from nat.data_models.common import BaseModelRegistryTag
from nat.data_models.common import TypedBaseModel
from nat.data_models.component_ref import LLMRef
from nat.data_models.function import FunctionBaseConfig


class SweBenchPredictorBaseConfig(TypedBaseModel, BaseModelRegistryTag):
"""Base configuration class for SWE-bench predictors."""
description: str = "Swe Bench Problem Solver"


class SweBenchPredictorGoldConfig(SweBenchPredictorBaseConfig, name="gold"):
"""Configuration for the gold predictor that uses the provided patch directly.

Attributes:
verbose: Whether to enable verbose output for debugging.
"""
verbose: bool = True


class SweBenchPredictorSkeletonConfig(SweBenchPredictorBaseConfig, name="skeleton"):
"""Configuration for the skeleton predictor template.

Attributes:
verbose: Whether to enable verbose output for debugging.
"""
verbose: bool = False

class SweBenchPredictorIterativeConfig(SweBenchPredictorBaseConfig, name="iterative"):
"""Configuration for the iterative predictor that solves problems step-by-step.

Attributes:
llm_name: Reference to the LLM to use for iterative problem solving.
step_limit: Maximum number of agent steps before termination.
timeout: Command execution timeout in seconds.
"""
llm_name: LLMRef = Field(description="LLM to use for iterative agent")
step_limit: int = Field(default=250, description="Maximum number of agent steps")
timeout: int = Field(default=60, description="Command execution timeout in seconds")

SweBenchPredictorConfig = typing.Annotated[typing.Annotated[SweBenchPredictorGoldConfig,
Tag(SweBenchPredictorGoldConfig.static_type())]
| typing.Annotated[SweBenchPredictorSkeletonConfig,
Tag(SweBenchPredictorSkeletonConfig.static_type())],
Discriminator(TypedBaseModel.discriminator)]

SweBenchPredictorConfig = typing.Annotated[
typing.Annotated[SweBenchPredictorGoldConfig, Tag(SweBenchPredictorGoldConfig.static_type())]
| typing.Annotated[SweBenchPredictorSkeletonConfig, Tag(SweBenchPredictorSkeletonConfig.static_type())]
| typing.Annotated[SweBenchPredictorIterativeConfig, Tag(SweBenchPredictorIterativeConfig.static_type())],
Discriminator(TypedBaseModel.discriminator)]

class SweBenchWorkflowConfig(FunctionBaseConfig, name="swe_bench"):
"""Configuration for the SWE-bench workflow.

Attributes:
predictor: The predictor configuration (gold, skeleton, or iterative).
"""
predictor: SweBenchPredictorConfig
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
# SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

llms:
nim_llm:
_type: nim
model_name: nvidia/nemotron-3-nano-30b-a3b
temperature: 0.0
max_tokens: 4096
claude_sonnet_llm:
_type: litellm
model_name: anthropic/claude-sonnet-4-5-20250929
temperature: 0.0
api_key: "${ANTHROPIC_API_KEY}"
openai_llm:
_type: openai
model_name: gpt-5.2
temperature: 0.0
api_key: "${OPENAI_API_KEY}"
Comment on lines +22 to +31
Copy link

@coderabbitai coderabbitai bot Feb 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use NAT_-prefixed environment variables for API keys.

✏️ Suggested update
-    api_key: "${ANTHROPIC_API_KEY}"
+    api_key: "${NAT_ANTHROPIC_API_KEY}"
@@
-    api_key: "${OPENAI_API_KEY}"
+    api_key: "${NAT_OPENAI_API_KEY}"

As per coding guidelines, use 'nat' for the API namespace and CLI tool, 'nvidia-nat' for the package name, 'NAT_' prefix for environment variables, and 'NeMo-Agent-Toolkit' for URLs and directory names.

🤖 Prompt for AI Agents
In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/configs/config_iterative.yml`
around lines 22 - 31, Update the api_key environment variable names in the YAML
entries for claude_sonnet_llm and openai_llm so they use the NAT_ prefix;
replace "${ANTHROPIC_API_KEY}" with "${NAT_ANTHROPIC_API_KEY}" in the
claude_sonnet_llm block and "${OPENAI_API_KEY}" with "${NAT_OPENAI_API_KEY}" in
the openai_llm block to follow the NAT_ environment variable naming convention.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll keep the standard environment variable names (OPENAI_API_KEY, ANTHROPIC_API_KEY) for consistency with:

  1. Other examples in this repository (e.g., nat_autogen_demo, agno_personal_finance)
  2. Industry-standard conventions that users likely already have configured
  3. The NAT_ prefix pattern in this project is used for NAT-specific configs (OAuth, user IDs), not third-party API keys

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for API key environment variable usage patterns in examples directory
echo "=== Searching for ANTHROPIC_API_KEY usage ==="
rg 'ANTHROPIC_API_KEY' examples/ -n -C2 --type yaml

echo -e "\n=== Searching for OPENAI_API_KEY usage ==="
rg 'OPENAI_API_KEY' examples/ -n -C2 --type yaml

echo -e "\n=== Searching for NAT_ prefixed API keys ==="
rg 'NAT_.*API_KEY' examples/ -n -C2 --type yaml

echo -e "\n=== Checking specific examples mentioned by user ==="
fd -t f . examples/advanced_agents/nat_autogen_demo examples/advanced_agents/agno_personal_finance -e yml -e yaml -x cat {}

Length of output: 3286


@Jerryguan777, thank you for the clarification! You're absolutely right—I verified that other examples in the repository (nat_autogen_demo, agno_personal_finance) consistently use the standard OPENAI_API_KEY naming convention, and there are no NAT_-prefixed third-party API keys in the examples directory.

Your reasoning makes sense: using industry-standard environment variable names improves developer experience since users likely already have them configured. I appreciate you distinguishing between NAT-specific configuration variables and third-party service API keys.


🧠 Learnings used
Learnt from: CR
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 0
File: .cursor/rules/general.mdc:0-0
Timestamp: 2026-01-31T03:35:56.118Z
Learning: Applies to **/*.{py,md,toml,yaml,yml} : Use 'nat' for the API namespace and CLI tool, 'nvidia-nat' for the package name, 'NAT_' prefix for environment variables, and 'NeMo-Agent-Toolkit' for URLs and directory names.

Learnt from: CR
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 0
File: .cursor/rules/general.mdc:0-0
Timestamp: 2026-01-31T03:35:56.118Z
Learning: Applies to docs/source/**/*.md : Never use deprecated names 'Agent Intelligence toolkit', 'aiqtoolkit', 'AgentIQ', 'AIQ', or 'aiq' when referring to NeMo Agent Toolkit; update any occurrences found in documentation unless intentionally referencing deprecated names or compatibility layers.

Learnt from: willkill07
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 1512
File: docs/source/run-workflows/launching-ui.md:20-20
Timestamp: 2026-01-30T15:03:44.777Z
Learning: In the NVIDIA/NeMo-Agent-Toolkit repository documentation, "NeMo Agent Toolkit" should always be spelled with a capital "T" in "Toolkit". The documentation guideline states: "Documentation in Markdown files should not use NAT as an acronym, always spell out NeMo Agent Toolkit."

Learnt from: AnuradhaKaruppiah
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 701
File: examples/evaluation_and_profiling/simple_calculator_eval/src/nat_simple_calculator_eval/configs/config-with-custom-post-process.yml:16-18
Timestamp: 2025-08-22T20:29:13.433Z
Learning: In the NeMo-Agent-Toolkit project, configuration consistency across example files is prioritized over removing unused config options, even when those options aren't currently implemented in the runtime.

Learnt from: Jerryguan777
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 1485
File: examples/advanced_agents/sandbox_agent/docker-compose.yaml:31-33
Timestamp: 2026-01-26T01:05:01.390Z
Learning: In the Sandbox Agent (examples/advanced_agents/sandbox_agent), host-side tools like web_search (Tavily) and youtube_transcript run on the host machine, not inside the Docker sandbox container. Therefore, their API keys (e.g., TAVILY_API_KEY) do not need to be passed as environment variables in docker-compose.yaml.

Learnt from: CR
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 0
File: .cursor/rules/general.mdc:0-0
Timestamp: 2026-01-31T03:35:56.118Z
Learning: Never commit API keys, credentials or personal data; use environment variables or '.env' files excluded from Git.

Learnt from: zterek
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 1243
File: examples/risk_and_security/retail_agent/src/nat_retail_agent/configs/red-teaming.yml:1-98
Timestamp: 2025-12-12T20:49:51.370Z
Learning: In the NVIDIA/NeMo-Agent-Toolkit repository, most YAML configuration files use 2-space indentation rather than 4-space indentation. When reviewing YAML files, prefer consistency with the existing codebase style (2-space indentation) over the stated coding guideline (4-space indentation) until a repository-wide standardization is performed.


workflow:
_type: swe_bench
predictor:
_type: iterative
llm_name: "openai_llm" # "nim_llm" or "claude_sonnet_llm" or "openai_llm"
step_limit: 100 # Overrides default (250)
timeout: 60

functions:
git_repo_tool:
_type: git_repo_tool
workspace_dir: "./.workspace"
cleanup_on_exit: true

eval:
general:
output_dir: .tmp/nat/examples/evaluation_and_profiling/swe_bench/iterative/
max_concurrency: 5
dataset:
_type: parquet
file_path: hf://datasets/princeton-nlp/SWE-bench_Lite/data/test-00000-of-00001.parquet
id_key: instance_id
structure:
disable: true
filter:
allowlist:
field:
instance_id:
- sympy__sympy-20590
# - sympy__sympy-21055
# - sympy__sympy-11400
# - astropy__astropy-12907
# - django__django-15781
# - astropy__astropy-6938
# - django__django-11001
# - mwaskom__seaborn-3010
# - pallets__flask-4045
# - psf__requests-1963

evaluators:
swe_bench:
_type: swe_bench
run_id: nat_iterative_1
clean: true


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All files (even if no code) must contain the license headers

Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
Comment on lines +1 to +14
Copy link

@coderabbitai coderabbitai bot Feb 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add a module docstring for the package initializer.
This public module currently has only the SPDX header; add a short Google-style module docstring describing the package.

✏️ Suggested docstring
@@
 # limitations under the License.
+"""Iterative SWE-bench predictor package."""
+

As per coding guidelines, Provide Google-style docstrings for every public module, class, function and CLI command.

🤖 Prompt for AI Agents
In
`@examples/evaluation_and_profiling/swe_bench/src/nat_swe_bench/predictors/predict_iterative/__init__.py`
around lines 1 - 14, The package initializer (__init__.py) for the
predict_iterative module is missing a public module docstring; add a short
Google-style module docstring at the top of __init__.py that briefly describes
the package purpose, the main functionality it exposes (e.g., iterative
predictors for NAT SWE benchmarking), and any important public symbols or
modules contained within the package; ensure the docstring follows Google-style
(one-line summary, blank line, short elaboration) and is placed before any code
or headers.

Copy link
Author

@Jerryguan777 Jerryguan777 Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged. Keeping as-is for consistency with other __init__.py files in this project.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jerryguan777, understood. I'll skip this suggestion.


🧠 Learnings used
Learnt from: CR
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 0
File: .cursor/rules/general.mdc:0-0
Timestamp: 2026-01-31T03:35:56.118Z
Learning: Applies to {src,packages}/**/*.py : Provide Google-style docstrings for every public module, class, function and CLI command.

Loading