Skip to content

fix(api): refactors the SQL LIKE pattern escaping logic to use a centralized utility function, ensuring consistent and secure handling of special characters across all database queries.#12

Open
tomerqodo wants to merge 4 commits intoclaude_claude_vs_qodo_base_fixapi_refactors_the_sql_like_pattern_escaping_logic_to_use_a_centralized__utility_function_ensuring_consistent_and_secure_handling_of_special_characters_across_all_databasefrom
claude_claude_vs_qodo_head_fixapi_refactors_the_sql_like_pattern_escaping_logic_to_use_a_centralized__utility_function_ensuring_consistent_and_secure_handling_of_special_characters_across_all_database
Open

fix(api): refactors the SQL LIKE pattern escaping logic to use a centralized utility function, ensuring consistent and secure handling of special characters across all database queries.#12
tomerqodo wants to merge 4 commits intoclaude_claude_vs_qodo_base_fixapi_refactors_the_sql_like_pattern_escaping_logic_to_use_a_centralized__utility_function_ensuring_consistent_and_secure_handling_of_special_characters_across_all_databasefrom
claude_claude_vs_qodo_head_fixapi_refactors_the_sql_like_pattern_escaping_logic_to_use_a_centralized__utility_function_ensuring_consistent_and_secure_handling_of_special_characters_across_all_database

Conversation

@tomerqodo
Copy link
Copy Markdown

Benchmark PR from agentic-review-benchmarks#12

NeatGuyCoding and others added 4 commits January 25, 2026 12:05
…ralized

utility function, ensuring consistent and secure handling of special characters
across all database queries.

Signed-off-by: NeatGuyCoding <15627489+NeatGuyCoding@users.noreply.github.com>
…logic

Signed-off-by: NeatGuyCoding <15627489+NeatGuyCoding@users.noreply.github.com>
Comment on lines +293 to +298
escaped_query = escape_like_pattern(query)
query_pattern = f"%{escaped_query}%"
sql = f"""
SELECT TOP {top_k} id, text, meta
FROM {self.schema}.{self.table_name}
WHERE text LIKE ?
WHERE text LIKE ? ESCAPE '|'
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The escape_like_pattern() function escapes special characters using backslash (e.g., %\%), but the IRIS SQL query at line 298 uses ESCAPE '|' (pipe), so the database expects pipe-prefixed escapes but receives backslash-prefixed ones — making the escaping completely ineffective. Similarly, the ClickZetta query at clickzetta_vector.py:991 uses ESCAPE '\\' which may produce a two-character escape string depending on the SQL dialect, when ESCAPE requires exactly one character. Both ESCAPE clauses should use backslash to match what escape_like_pattern() produces.

Extended reasoning...

What the bug is

The escape_like_pattern() utility function (defined in api/libs/helper.py) escapes SQL LIKE special characters using backslash as the escape character: % becomes \%, _ becomes \_, and \ becomes \\. This function is used consistently across the PR to sanitize user input before LIKE queries.

However, two raw-SQL database implementations specify a different ESCAPE character than backslash, creating a mismatch that renders the escaping completely ineffective.

IRIS: Wrong ESCAPE character

In api/core/rag/datasource/vdb/iris/iris_vector.py at line 298, the SQL query specifies ESCAPE '|' (pipe character). The code calls escape_like_pattern(query) on line 293, which produces patterns like 50\% discount. When the database receives this pattern with ESCAPE '|', it looks for pipe-prefixed characters as escaped wildcards — but the pattern contains backslash-prefixed characters. The \ is treated as a literal backslash and % remains a wildcard. The escaping has zero effect.

ClickZetta: Ambiguous ESCAPE string

In api/core/rag/datasource/vdb/clickzetta/clickzetta_vector.py at line 991, the Python code ESCAPE \\ evaluates to the SQL text ESCAPE \. How this is interpreted depends on ClickZetta's SQL dialect: if it follows ANSI SQL (where backslash is not special in string literals), \ is a two-character string \\, but the ESCAPE clause requires exactly one character — this would cause a SQL error or undefined behavior. The code on line 980 uses '' for quote escaping (ANSI style), further suggesting ClickZetta follows ANSI conventions.

Step-by-step proof (IRIS)

  1. User searches for 50% discount
  2. escape_like_pattern("50% discount") returns "50\% discount"
  3. The query pattern becomes "%50\% discount%"
  4. SQL executed: WHERE text LIKE '%50\% discount%' ESCAPE '|'
  5. IRIS looks for | as the escape character prefix
  6. It finds \ (literal backslash) followed by % (wildcard) — neither is preceded by |
  7. Result: % is still treated as a wildcard, matching any text containing 50 followed by anything followed by discount
  8. This defeats the entire purpose of the escaping

Impact

For IRIS users, LIKE wildcard injection is not prevented despite the PR's intent. Special characters in search queries will still be interpreted as SQL wildcards, potentially returning incorrect results or enabling wildcard-based injection. For ClickZetta users, the query may fail entirely or produce undefined behavior depending on the SQL dialect.

Fix

For IRIS, change ESCAPE '|' to ESCAPE '\' (or ESCAPE '\\' depending on how IRIS handles backslash escaping in string literals). For ClickZetta, verify the SQL dialect's string literal handling and ensure the ESCAPE clause specifies exactly one backslash character. Alternatively, escape_like_pattern could be adapted to use a per-database escape character.

Comment on lines +92 to +96
keyword_trimmed = keyword[:30]
escaped_keyword = escape_like_pattern(keyword_trimmed)
keyword_like_val = f"%{escaped_keyword}%"
keyword_conditions = [
WorkflowRun.inputs.ilike(keyword_like_val),
WorkflowRun.outputs.ilike(keyword_like_val),
WorkflowRun.inputs.ilike(keyword_like_val, escape="\\"),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Removing the unicode_escape encoding from keyword search in workflow_app_service.py breaks non-ASCII keyword searches (e.g., Chinese, Japanese) against WorkflowRun.inputs/outputs. These fields are stored via json.dumps() with default ensure_ascii=True, so non-ASCII characters like 中文 are persisted as \u4e2d\u6587. The new code searches for the literal Unicode characters which will never match the escaped form in the database.

Extended reasoning...

What the bug is

The old code at line 89 performed a Unicode escape transformation on the search keyword:

keyword_like_val = f"%{keyword[:30].encode("unicode_escape").decode("utf-8")}%".replace(r"\u", r"\\u")

This converted non-ASCII characters (e.g., Chinese 中文) into their \uXXXX escape sequence form before using them in a SQL LIKE pattern. The new code simply passes the keyword through escape_like_pattern(), which only escapes %, _, and \ — it does not convert Unicode characters to escape sequences.

Why this matters

The WorkflowRun.inputs and WorkflowRun.outputs fields are stored using json.dumps(domain_model.inputs) at api/core/repositories/sqlalchemy_workflow_execution_repository.py:152. Critically, this call does not pass ensure_ascii=False, so Python uses the default ensure_ascii=True. This means any non-ASCII characters in the JSON data are stored as \uXXXX escape sequences in the database TEXT column.

For example, if a workflow input contains the Chinese text 中文, it is stored in the database as the literal string \u4e2d\u6587.

Step-by-step proof

  1. A user runs a workflow with input {"query": "中文测试"} (Chinese text meaning "Chinese test")
  2. The repository calls json.dumps({"query": "中文测试"}) which produces {"query": "\u4e2d\u6587\u6d4b\u8bd5"} (because ensure_ascii=True by default)
  3. This escaped string is stored in the WorkflowRun.inputs TEXT column
  4. A user searches for workflow logs with keyword 中文
  5. Old behavior: The code would encode 中文 to \u4e2d\u6587 and search for %\u4e2d\u6587% — this MATCHES the stored data
  6. New behavior: The code passes 中文 through escape_like_pattern() which returns 中文 unchanged, then searches for %中文% — this does NOT match \u4e2d\u6587 in the database

Impact

Any search for non-ASCII text (Chinese, Japanese, Korean, Arabic, etc.) in workflow run inputs or outputs will silently return zero results, even when matching data exists. This is a significant regression for users in non-English locales who rely on keyword search in workflow logs.

How to fix

The fix should either:

  1. Restore the unicode_escape encoding logic for the keyword before searching WorkflowRun.inputs/outputs, or
  2. Change the repository to use json.dumps(domain_model.inputs, ensure_ascii=False) so that actual Unicode characters are stored in the database (matching the pattern used in logstore_workflow_execution_repository.py:112), and then migrate existing data. Option 2 is the more correct long-term approach but requires a data migration.

case "not contains":
filters.append(json_field.notlike(f"%{value}%"))
escaped_value = escape_like_pattern(str(value))
filters.append(json_field.notlike(f"%{escaped_value}%"))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug: The "not contains" case on line 1207 calls json_field.notlike(f"%{escaped_value}%") without the escape="\\" parameter. Every other like/notlike/ilike call in this PR includes escape="\\" to work with the backslash-based escaping from escape_like_pattern(). Without it, searching for literal %, _, or \ in a "not contains" metadata filter will produce incorrect results because the database won't interpret the escape sequences.

Extended reasoning...

What the bug is

In api/core/rag/retrieval/dataset_retrieval.py at line 1207, the "not contains" condition uses:

filters.append(json_field.notlike(f"%{escaped_value}%"))

but is missing the escape="\\" parameter that every other LIKE/notlike call in this function (and across the entire PR) includes.

How it manifests

The escape_like_pattern() function transforms special characters using backslash escaping: % becomes \%, _ becomes \_, and \ becomes \\. However, for the database to correctly interpret these escape sequences, the ESCAPE clause must be specified in the SQL query. Without escape="\\", the database treats \% as a literal backslash followed by a wildcard %, rather than as an escaped literal percent character.

Step-by-step proof

  1. User creates a document with metadata field category = "50% off"
  2. User filters with condition "not contains" and value "%"
  3. escape_like_pattern("%") returns "\%"
  4. The pattern becomes "%\%%" passed to notlike()
  5. Without escape="\\": The database sees \ as a literal character and % as a wildcard. The pattern %\%% matches any string containing a literal backslash followed by anything — which does NOT match "50% off". So the filter incorrectly includes "50% off" in results (it should be excluded since the value does contain %).
  6. With escape="\\": The database correctly interprets \% as an escaped literal %. The pattern matches any string containing a literal %, and "50% off" is correctly excluded from the "not contains" results.

Why existing code doesn't prevent it

This is simply an omission in the PR. Compare the "contains" case on line 1203 which correctly includes escape="\\", with the "not contains" case on line 1207 which does not. The "start with" (line 1211) and "end with" (line 1215) cases also correctly include the escape clause.

Impact

Metadata filters using the "not contains" condition with values containing %, _, or \ will return incorrect results. The escaping done by escape_like_pattern() is rendered ineffective for this specific condition, defeating the purpose of the entire PR for this code path.

How to fix

Add escape="\\" to line 1207:

filters.append(json_field.notlike(f"%{escaped_value}%", escape="\\"))

Comment thread api/libs/helper.py
Comment on lines +35 to +64
def escape_like_pattern(pattern: str) -> str:
"""
Escape special characters in a string for safe use in SQL LIKE patterns.

This function escapes the special characters used in SQL LIKE patterns:
- Backslash (\\) -> \\
- Percent (%) -> \\%
- Underscore (_) -> \\_

The escaped pattern can then be safely used in SQL LIKE queries with the
ESCAPE '\\' clause to prevent SQL injection via LIKE wildcards.

Args:
pattern: The string pattern to escape

Returns:
Escaped string safe for use in SQL LIKE queries

Examples:
>>> escape_like_pattern("50% discount")
'50\\% discount'
>>> escape_like_pattern("test_data")
'test\\_data'
>>> escape_like_pattern("path\\to\\file")
'path\\\\to\\\\file'
"""
if not pattern:
return pattern
# Escape backslash first, then percent and underscore
escapedPattern = pattern.replace("\\", "\\\\").replace("%", "\\%").replace("_", "\\_")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟣 This PR misses two Document.name.like() locations that use the same vulnerable pattern it fixes elsewhere: api/controllers/console/datasets/datasets_document.py:266 and api/controllers/service_api/dataset/document.py:476. Both pass unescaped request.args.get("keyword") directly into f"%{search}%" without calling escape_like_pattern(), meaning a search for "50%" would match all documents. This is a pre-existing issue not introduced by this PR, but it falls squarely within the stated scope of this centralization effort.

Extended reasoning...

What the bug is

The PR introduces escape_like_pattern() and systematically applies it to 12+ locations where user input is used in SQL LIKE patterns (e.g., Dataset.name.ilike, App.name.ilike, Tag.name.ilike, DocumentSegment.content.ilike). However, two locations with the identical vulnerable pattern were missed:

  1. api/controllers/console/datasets/datasets_document.py:264-266
  2. api/controllers/service_api/dataset/document.py:474-476

Both use Document.name.like(f"%{search}%") where search comes directly from request.args.get("keyword").

The specific code path

In both files, the pattern is identical:

search = request.args.get("keyword")
if search:
    search = f"%{search}%"
    query = query.where(Document.name.like(search))

The user-supplied keyword parameter is interpolated directly into the LIKE pattern without any escaping. Compare this to the fixed pattern in, e.g., api/services/dataset_service.py:147:

escaped_search = helper.escape_like_pattern(search)
query = query.where(Dataset.name.ilike(f"%{escaped_search}%", escape="\\"))

Why existing code does not prevent it

These two files are not in the PR changed file list. The PR author likely searched for common patterns but missed these two controllers, possibly because they use .like() (case-sensitive) rather than .ilike() (case-insensitive), or because they are in controller files rather than service files.

Step-by-step proof

  1. User navigates to the document list endpoint (console or service API)
  2. User enters keyword "50%" to search for documents containing literal "50%"
  3. The code constructs search = "%50%%" - the % from the user input acts as a wildcard
  4. The LIKE pattern %50%% matches any document containing "50" followed by anything, effectively matching ALL documents that contain "50" anywhere (e.g., "50 items", "1500 records", "Document [CORRUPTED] Synthetic Benchmark PR #29981 - perf: optimize DatasetRetrieval.retrieve、RetrievalService._deduplicat… dify#50")
  5. With the fix applied elsewhere in this PR, the pattern would be %50%% with ESCAPE '\', correctly matching only documents containing the literal string "50%"

Impact

Users searching for documents with special characters (%, _, ) in their names will get incorrect results. The % wildcard is particularly impactful as it matches any sequence of characters, causing searches like "50%" to return far more results than expected.

How to fix

Apply the same pattern used throughout the PR:

from libs.helper import escape_like_pattern

if search:
    escaped_search = escape_like_pattern(search)
    search = f"%{escaped_search}%"
    query = query.where(Document.name.like(search, escape="\\"))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants