fix: replace O(n^2) regex with linear string search in code block extraction (ReDoS)#6118
fix: replace O(n^2) regex with linear string search in code block extraction (ReDoS)#6118Ashutosh0x wants to merge 3 commits into
Conversation
…nd_truncate_content (ReDoS)
|
Hi @surajksharma07 — this fixes the ReDoS (CWE-1333) reported in #5992. The regex in Behavioral parity maintained — same inputs produce same outputs, just without the hang. |
|
Hi @Ashutosh0x , Thank you for your contribution! We appreciate you taking the time to submit this pull request. Please fix formatting errors before we can proceed with a review. |
Merge #6118 ## Summary - Replace regular expression-based code block extraction with a simple and safe string-find based search. This avoids exponential backtracking (ReDoS) when processing long or repeating inputs with missing trailing delimiters. - Add unit tests to verify standard behavior and test against ReDoS vulnerability. Co-authored-by: Kathy Wu <wukathy@google.com> PiperOrigin-RevId: 933834549
|
Thank you @Ashutosh0x for your contribution! 🎉 Your changes have been successfully imported and merged via Copybara in commit 910e1c1. Closing this PR as the changes are now in the main branch. |
Summary
Fix for #5992 — Replace catastrophic O(n²) regex backtracking in
extract_code_and_truncate_contentwith a linear-time string search.Problem
The regex pattern at line 153 of
code_execution_utils.pyuses multiple.*?groups withre.DOTALL:\\python
rf'(?P.?)({leading_delimiter_pattern})(?P
.?)({trailing_delimiter_pattern})(?P.*?)$'\\
When the input is large and contains no matching delimiters (or partial delimiters), the regex engine tries all possible combinations of how the lazy quantifiers can match, causing O(n²) backtracking that hangs the process.
CWE-1333: Inefficient Regular Expression Complexity (ReDoS)
Fix
Replaced the regex with a simple
str.find()-based approach:This runs in O(n × d) time where d = number of delimiter pairs (typically 2-3), which is effectively O(n).
Testing
The fix preserves the same behavior — extracting the first code block and truncating content after it. The string search approach handles the same edge cases:
Fixes #5992