NVIDIA · Jerryguan777 · Jan 14, 2026 · Jan 15, 2026 · Jan 16, 2026 · Jan 17, 2026
@@ -159,6 +159,18 @@ That information is only used for evaluation. Using it can taint the predictor a
 These predictors are provided in this NeMo Agent Toolkit example:
 - `gold` - Uses the patch from the `SWEBenchInput` instance, bypassing problem-solving logic. See [predict_gold_stub.py](src/nat_swe_bench/predictors/predict_gold/predict_gold_stub.py) and configuration file `examples/evaluation_and_profiling/swe_bench/configs/config_gold.yml`.
 - `skeleton` - Skeleton code for creating a problem-solving workflow. This code can be copied to create a net-new predictor. See [predict_skeleton.py](src/nat_swe_bench/predictors/predict_skeleton/predict_skeleton.py) and configuration file `examples/evaluation_and_profiling/swe_bench/configs/config_skeleton.yml`.
+- `iterative` - Iterative agent that solves problems by executing bash commands step-by-step, observing results, and generating patches. See [predict_iterative.py](src/nat_swe_bench/predictors/predict_iterative/predict_iterative.py) and configuration file `examples/evaluation_and_profiling/swe_bench/configs/config_iterative.yml`.
+
-
+
+### Benchmark Context (January 2026)
+
+The iterative predictor achieves 70% success rate on SWE-bench Lite, which primarily reflects the capabilities of modern foundation models (Claude Sonnet 4.5, GPT-5.2) rather than framework-specific innovations. SWE-bench Lite is approaching saturation at 70-80% with simple agent architectures.
+
+**For evaluating framework improvements beyond task correctness, consider tracking:**
+- **Efficiency metrics:** Tokens consumed, steps taken, cost per solution
+- **Reliability metrics:** Success rate variance over multiple runs
+- **Harder benchmarks:** SWE-bench Verified (currently ~35% SOTA, not saturated) or full SWE-bench dataset (2,294 problems)
+
+This positions the iterative predictor as a reference implementation demonstrating NeMo Agent toolkit's builder pattern and tool integration capabilities.
-
+
+### Benchmark Context (January 2026)
+
+The iterative predictor achieves 70% success rate on SWE-bench Lite, which primarily reflects the capabilities of modern foundation models (Claude Sonnet 4.5, GPT-5.2) rather than framework-specific innovations. SWE-bench Lite is approaching saturation at 70-80% with simple agent architectures.
+
+**For evaluating framework improvements beyond task correctness, consider tracking:**
+- **Efficiency metrics:** Tokens consumed, steps taken, cost per solution
+- **Reliability metrics:** Success rate variance over multiple runs
+- **Harder benchmarks:** SWE-bench Verified (currently ~35% SOTA, not saturated) or full SWE-bench dataset (2,294 problems)
+
+This positions the iterative predictor as a reference implementation demonstrating NeMo Agent toolkit's builder pattern and tool integration capabilities.
+### Benchmark Context (January 2026)
+
+The iterative predictor achieves 70% success rate on SWE-bench Lite, which primarily reflects the capabilities of modern foundation models (Claude Sonnet 4.5, GPT-5.2) rather than framework-specific innovations. SWE-bench Lite is approaching saturation at 70-80% with simple agent architectures.
+
+**For evaluating framework improvements beyond task correctness, consider tracking:**
+- **Efficiency metrics:** Tokens consumed, steps taken, cost per solution
+- **Reliability metrics:** Success rate variance over multiple runs
+- **Harder benchmarks:** SWE-bench Verified (currently ~35% SOTA, not saturated) or full SWE-bench dataset (2,294 problems)
+
+This positions the iterative predictor as a reference implementation demonstrating NeMo Agent toolkit's builder pattern and tool integration capabilities.
 
 ### Adding a net new predictor
 To add a new predictor:

@@ -16,31 +16,59 @@
 import typing
 
 from pydantic import Discriminator
+from pydantic import Field
 from pydantic import Tag
 
 from nat.data_models.common import BaseModelRegistryTag
 from nat.data_models.common import TypedBaseModel
+from nat.data_models.component_ref import LLMRef
 from nat.data_models.function import FunctionBaseConfig
 
 
 class SweBenchPredictorBaseConfig(TypedBaseModel, BaseModelRegistryTag):
+    """Base configuration class for SWE-bench predictors."""    
     description: str = "Swe Bench Problem Solver"
 
 
 class SweBenchPredictorGoldConfig(SweBenchPredictorBaseConfig, name="gold"):
+    """Configuration for the gold predictor that uses the provided patch directly.
+
+    Attributes:
+        verbose: Whether to enable verbose output for debugging.
+    """
     verbose: bool = True
 
 
 class SweBenchPredictorSkeletonConfig(SweBenchPredictorBaseConfig, name="skeleton"):
+    """Configuration for the skeleton predictor template.
+
+    Attributes:
+        verbose: Whether to enable verbose output for debugging.
+    """
     verbose: bool = False
 
+class SweBenchPredictorIterativeConfig(SweBenchPredictorBaseConfig, name="iterative"):
+    """Configuration for the iterative predictor that solves problems step-by-step.
+
+    Attributes:
+        llm_name: Reference to the LLM to use for iterative problem solving.
+        step_limit: Maximum number of agent steps before termination.
+        timeout: Command execution timeout in seconds.
+    """    
+    llm_name: LLMRef = Field(description="LLM to use for iterative agent")
+    step_limit: int = Field(default=250, description="Maximum number of agent steps")
+    timeout: int = Field(default=60, description="Command execution timeout in seconds")
 
-SweBenchPredictorConfig = typing.Annotated[typing.Annotated[SweBenchPredictorGoldConfig,
-                                                            Tag(SweBenchPredictorGoldConfig.static_type())]
-                                           | typing.Annotated[SweBenchPredictorSkeletonConfig,
-                                                              Tag(SweBenchPredictorSkeletonConfig.static_type())],
-                                           Discriminator(TypedBaseModel.discriminator)]
-
+SweBenchPredictorConfig = typing.Annotated[
+    typing.Annotated[SweBenchPredictorGoldConfig, Tag(SweBenchPredictorGoldConfig.static_type())]
+    | typing.Annotated[SweBenchPredictorSkeletonConfig, Tag(SweBenchPredictorSkeletonConfig.static_type())]
+    | typing.Annotated[SweBenchPredictorIterativeConfig, Tag(SweBenchPredictorIterativeConfig.static_type())],
+    Discriminator(TypedBaseModel.discriminator)]
 
 class SweBenchWorkflowConfig(FunctionBaseConfig, name="swe_bench"):
+    """Configuration for the SWE-bench workflow.
+
+    Attributes:
+        predictor: The predictor configuration (gold, skeleton, or iterative).
+    """
     predictor: SweBenchPredictorConfig
@@ -0,0 +1,78 @@
+# SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+llms:
+  nim_llm:
+    _type: nim
+    model_name: nvidia/nemotron-3-nano-30b-a3b
+    temperature: 0.0
+    max_tokens: 4096
+  claude_sonnet_llm:
+    _type: litellm
+    model_name: anthropic/claude-sonnet-4-5-20250929
+    temperature: 0.0
+    api_key: "${ANTHROPIC_API_KEY}"
+  openai_llm:
+    _type: openai
+    model_name: gpt-5.2
+    temperature: 0.0
+    api_key: "${OPENAI_API_KEY}"
+
+workflow:
+  _type: swe_bench
+  predictor:
+    _type: iterative
+    llm_name: "openai_llm"   # "nim_llm" or "claude_sonnet_llm" or "openai_llm"
+    step_limit: 100  # Overrides default (250)
+    timeout: 60
+
+functions:
+  git_repo_tool:
+    _type: git_repo_tool
+    workspace_dir: "./.workspace"
+    cleanup_on_exit: true
+
+eval:
+  general:
+    output_dir: .tmp/nat/examples/evaluation_and_profiling/swe_bench/iterative/
+    max_concurrency: 5
+    dataset:
+      _type: parquet
+      file_path: hf://datasets/princeton-nlp/SWE-bench_Lite/data/test-00000-of-00001.parquet
+      id_key: instance_id
+      structure:
+        disable: true
+      filter:
+        allowlist:
+          field:
+            instance_id:
+              - sympy__sympy-20590
+              # - sympy__sympy-21055
+              # - sympy__sympy-11400
+              # - astropy__astropy-12907
+              # - django__django-15781
+              # - astropy__astropy-6938
+              # - django__django-11001
+              # - mwaskom__seaborn-3010
+              # - pallets__flask-4045
+              # - psf__requests-1963
+
+  evaluators:
+    swe_bench:
+      _type: swe_bench
+      run_id: nat_iterative_1
+      clean: true
+
+
@@ -0,0 +1,14 @@
+# SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.