-
Notifications
You must be signed in to change notification settings - Fork 498
feat(swe-bench): add an iterative predictor with ~70% success rate #1414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Changes from all commits
658a8f9
2395ac4
92b1e66
6a69861
f80efce
f8f430a
4baa36c
75e94e3
61b02b2
26fda75
03e0b46
55039a3
3f62472
0c4ded9
df324e0
7ec5f78
201e379
4569449
a140745
b44a7c6
e8e7e82
329c712
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,78 @@ | ||
| # SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| llms: | ||
| nim_llm: | ||
| _type: nim | ||
| model_name: nvidia/nemotron-3-nano-30b-a3b | ||
| temperature: 0.0 | ||
| max_tokens: 4096 | ||
| claude_sonnet_llm: | ||
| _type: litellm | ||
| model_name: anthropic/claude-sonnet-4-5-20250929 | ||
| temperature: 0.0 | ||
| api_key: "${ANTHROPIC_API_KEY}" | ||
| openai_llm: | ||
| _type: openai | ||
| model_name: gpt-5.2 | ||
| temperature: 0.0 | ||
| api_key: "${OPENAI_API_KEY}" | ||
|
Comment on lines
+22
to
+31
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Use ✏️ Suggested update- api_key: "${ANTHROPIC_API_KEY}"
+ api_key: "${NAT_ANTHROPIC_API_KEY}"
@@
- api_key: "${OPENAI_API_KEY}"
+ api_key: "${NAT_OPENAI_API_KEY}"As per coding guidelines, use 'nat' for the API namespace and CLI tool, 'nvidia-nat' for the package name, 'NAT_' prefix for environment variables, and 'NeMo-Agent-Toolkit' for URLs and directory names. 🤖 Prompt for AI Agents
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll keep the standard environment variable names (
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chain🏁 Script executed: Length of output: 3286
Your reasoning makes sense: using industry-standard environment variable names improves developer experience since users likely already have them configured. I appreciate you distinguishing between NAT-specific configuration variables and third-party service API keys. 🧠 Learnings used |
||
|
|
||
| workflow: | ||
| _type: swe_bench | ||
| predictor: | ||
| _type: iterative | ||
| llm_name: "openai_llm" # "nim_llm" or "claude_sonnet_llm" or "openai_llm" | ||
| step_limit: 100 # Overrides default (250) | ||
| timeout: 60 | ||
|
|
||
| functions: | ||
| git_repo_tool: | ||
| _type: git_repo_tool | ||
| workspace_dir: "./.workspace" | ||
| cleanup_on_exit: true | ||
|
|
||
| eval: | ||
| general: | ||
| output_dir: .tmp/nat/examples/evaluation_and_profiling/swe_bench/iterative/ | ||
| max_concurrency: 5 | ||
| dataset: | ||
| _type: parquet | ||
| file_path: hf://datasets/princeton-nlp/SWE-bench_Lite/data/test-00000-of-00001.parquet | ||
| id_key: instance_id | ||
| structure: | ||
| disable: true | ||
| filter: | ||
| allowlist: | ||
| field: | ||
| instance_id: | ||
| - sympy__sympy-20590 | ||
| # - sympy__sympy-21055 | ||
| # - sympy__sympy-11400 | ||
| # - astropy__astropy-12907 | ||
| # - django__django-15781 | ||
| # - astropy__astropy-6938 | ||
| # - django__django-11001 | ||
| # - mwaskom__seaborn-3010 | ||
| # - pallets__flask-4045 | ||
| # - psf__requests-1963 | ||
|
|
||
| evaluators: | ||
| swe_bench: | ||
| _type: swe_bench | ||
| run_id: nat_iterative_1 | ||
| clean: true | ||
|
|
||
|
|
||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. All files (even if no code) must contain the license headers |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,14 @@ | ||
| # SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
Comment on lines
+1
to
+14
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add a module docstring for the package initializer. ✏️ Suggested docstring@@
# limitations under the License.
+"""Iterative SWE-bench predictor package."""
+As per coding guidelines, Provide Google-style docstrings for every public module, class, function and CLI command. 🤖 Prompt for AI Agents
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Acknowledged. Keeping as-is for consistency with other There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
🧠 Learnings used |
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR description mentions 70% success rate but doesn't provide context about what this means for benchmark saturation. I recommend adding the baove to the Readme to provide context
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated as suggested. Thanks