Skip to content

fix: replace O(n^2) regex with linear string search in code block extraction (ReDoS)#6118

Closed
Ashutosh0x wants to merge 3 commits into
google:mainfrom
Ashutosh0x:fix/redos-code-extraction
Closed

fix: replace O(n^2) regex with linear string search in code block extraction (ReDoS)#6118
Ashutosh0x wants to merge 3 commits into
google:mainfrom
Ashutosh0x:fix/redos-code-extraction

Conversation

@Ashutosh0x

Copy link
Copy Markdown
Contributor

Summary

Fix for #5992 — Replace catastrophic O(n²) regex backtracking in extract_code_and_truncate_content with a linear-time string search.

Problem

The regex pattern at line 153 of code_execution_utils.py uses multiple .*? groups with re.DOTALL:

\\python
rf'(?P.?)({leading_delimiter_pattern})(?P.?)({trailing_delimiter_pattern})(?P.*?)$'
\\

When the input is large and contains no matching delimiters (or partial delimiters), the regex engine tries all possible combinations of how the lazy quantifiers can match, causing O(n²) backtracking that hangs the process.

CWE-1333: Inefficient Regular Expression Complexity (ReDoS)

Fix

Replaced the regex with a simple str.find()-based approach:

  1. For each delimiter pair, find the first occurrence of the leading delimiter
  2. Find the corresponding trailing delimiter after it
  3. Pick the earliest match

This runs in O(n × d) time where d = number of delimiter pairs (typically 2-3), which is effectively O(n).

Testing

The fix preserves the same behavior — extracting the first code block and truncating content after it. The string search approach handles the same edge cases:

  • No delimiters found → returns None
  • Empty code block → returns None
  • Multiple code blocks → picks the earliest one

Fixes #5992

@Ashutosh0x

Copy link
Copy Markdown
Contributor Author

Hi @surajksharma07 — this fixes the ReDoS (CWE-1333) reported in #5992.

The regex in extract_code_and_truncate_content() uses multiple .*? groups with re.DOTALL, causing O(n²) backtracking on large inputs without matching delimiters. Replaced with a simple str.find() loop that runs in O(n) time.

Behavioral parity maintained — same inputs produce same outputs, just without the hang.

@rohityan rohityan self-assigned this Jun 15, 2026
@wukath wukath self-assigned this Jun 16, 2026
@rohityan rohityan removed their assignment Jun 17, 2026
@rohityan rohityan added the tools [Component] This issue is related to tools label Jun 17, 2026
@rohityan

Copy link
Copy Markdown
Collaborator

Hi @Ashutosh0x , Thank you for your contribution! We appreciate you taking the time to submit this pull request. Please fix formatting errors before we can proceed with a review.

@rohityan rohityan added the request clarification [Status] The maintainer need clarification or more information from the author label Jun 17, 2026
copybara-service Bot pushed a commit that referenced this pull request Jun 17, 2026
Merge #6118

## Summary
- Replace regular expression-based code block extraction with a simple and safe string-find based search. This avoids exponential backtracking (ReDoS) when processing long or repeating inputs with missing trailing delimiters.
- Add unit tests to verify standard behavior and test against ReDoS vulnerability.

Co-authored-by: Kathy Wu <wukathy@google.com>
PiperOrigin-RevId: 933834549
@adk-bot

adk-bot commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

Thank you @Ashutosh0x for your contribution! 🎉

Your changes have been successfully imported and merged via Copybara in commit 910e1c1.

Closing this PR as the changes are now in the main branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merged [Status] This PR is merged request clarification [Status] The maintainer need clarification or more information from the author tools [Component] This issue is related to tools

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants