fix(api): refactors the SQL LIKE pattern escaping logic to use a centralized utility function, ensuring consistent and secure handling of special characters across all database queries.#12
fix(api): refactors the SQL LIKE pattern escaping logic to use a centralized utility function, ensuring consistent and secure handling of special characters across all database queries.#12
Conversation
…ralized utility function, ensuring consistent and secure handling of special characters across all database queries. Signed-off-by: NeatGuyCoding <15627489+NeatGuyCoding@users.noreply.github.com>
…logic Signed-off-by: NeatGuyCoding <15627489+NeatGuyCoding@users.noreply.github.com>
| escaped_query = escape_like_pattern(query) | ||
| query_pattern = f"%{escaped_query}%" | ||
| sql = f""" | ||
| SELECT TOP {top_k} id, text, meta | ||
| FROM {self.schema}.{self.table_name} | ||
| WHERE text LIKE ? | ||
| WHERE text LIKE ? ESCAPE '|' |
There was a problem hiding this comment.
🔴 The escape_like_pattern() function escapes special characters using backslash (e.g., % → \%), but the IRIS SQL query at line 298 uses ESCAPE '|' (pipe), so the database expects pipe-prefixed escapes but receives backslash-prefixed ones — making the escaping completely ineffective. Similarly, the ClickZetta query at clickzetta_vector.py:991 uses ESCAPE '\\' which may produce a two-character escape string depending on the SQL dialect, when ESCAPE requires exactly one character. Both ESCAPE clauses should use backslash to match what escape_like_pattern() produces.
Extended reasoning...
What the bug is
The escape_like_pattern() utility function (defined in api/libs/helper.py) escapes SQL LIKE special characters using backslash as the escape character: % becomes \%, _ becomes \_, and \ becomes \\. This function is used consistently across the PR to sanitize user input before LIKE queries.
However, two raw-SQL database implementations specify a different ESCAPE character than backslash, creating a mismatch that renders the escaping completely ineffective.
IRIS: Wrong ESCAPE character
In api/core/rag/datasource/vdb/iris/iris_vector.py at line 298, the SQL query specifies ESCAPE '|' (pipe character). The code calls escape_like_pattern(query) on line 293, which produces patterns like 50\% discount. When the database receives this pattern with ESCAPE '|', it looks for pipe-prefixed characters as escaped wildcards — but the pattern contains backslash-prefixed characters. The \ is treated as a literal backslash and % remains a wildcard. The escaping has zero effect.
ClickZetta: Ambiguous ESCAPE string
In api/core/rag/datasource/vdb/clickzetta/clickzetta_vector.py at line 991, the Python code ESCAPE \\ evaluates to the SQL text ESCAPE \. How this is interpreted depends on ClickZetta's SQL dialect: if it follows ANSI SQL (where backslash is not special in string literals), \ is a two-character string \\, but the ESCAPE clause requires exactly one character — this would cause a SQL error or undefined behavior. The code on line 980 uses '' for quote escaping (ANSI style), further suggesting ClickZetta follows ANSI conventions.
Step-by-step proof (IRIS)
- User searches for
50% discount escape_like_pattern("50% discount")returns"50\% discount"- The query pattern becomes
"%50\% discount%" - SQL executed:
WHERE text LIKE '%50\% discount%' ESCAPE '|' - IRIS looks for
|as the escape character prefix - It finds
\(literal backslash) followed by%(wildcard) — neither is preceded by| - Result:
%is still treated as a wildcard, matching any text containing50followed by anything followed bydiscount - This defeats the entire purpose of the escaping
Impact
For IRIS users, LIKE wildcard injection is not prevented despite the PR's intent. Special characters in search queries will still be interpreted as SQL wildcards, potentially returning incorrect results or enabling wildcard-based injection. For ClickZetta users, the query may fail entirely or produce undefined behavior depending on the SQL dialect.
Fix
For IRIS, change ESCAPE '|' to ESCAPE '\' (or ESCAPE '\\' depending on how IRIS handles backslash escaping in string literals). For ClickZetta, verify the SQL dialect's string literal handling and ensure the ESCAPE clause specifies exactly one backslash character. Alternatively, escape_like_pattern could be adapted to use a per-database escape character.
| keyword_trimmed = keyword[:30] | ||
| escaped_keyword = escape_like_pattern(keyword_trimmed) | ||
| keyword_like_val = f"%{escaped_keyword}%" | ||
| keyword_conditions = [ | ||
| WorkflowRun.inputs.ilike(keyword_like_val), | ||
| WorkflowRun.outputs.ilike(keyword_like_val), | ||
| WorkflowRun.inputs.ilike(keyword_like_val, escape="\\"), |
There was a problem hiding this comment.
🔴 Removing the unicode_escape encoding from keyword search in workflow_app_service.py breaks non-ASCII keyword searches (e.g., Chinese, Japanese) against WorkflowRun.inputs/outputs. These fields are stored via json.dumps() with default ensure_ascii=True, so non-ASCII characters like 中文 are persisted as \u4e2d\u6587. The new code searches for the literal Unicode characters which will never match the escaped form in the database.
Extended reasoning...
What the bug is
The old code at line 89 performed a Unicode escape transformation on the search keyword:
keyword_like_val = f"%{keyword[:30].encode("unicode_escape").decode("utf-8")}%".replace(r"\u", r"\\u")This converted non-ASCII characters (e.g., Chinese 中文) into their \uXXXX escape sequence form before using them in a SQL LIKE pattern. The new code simply passes the keyword through escape_like_pattern(), which only escapes %, _, and \ — it does not convert Unicode characters to escape sequences.
Why this matters
The WorkflowRun.inputs and WorkflowRun.outputs fields are stored using json.dumps(domain_model.inputs) at api/core/repositories/sqlalchemy_workflow_execution_repository.py:152. Critically, this call does not pass ensure_ascii=False, so Python uses the default ensure_ascii=True. This means any non-ASCII characters in the JSON data are stored as \uXXXX escape sequences in the database TEXT column.
For example, if a workflow input contains the Chinese text 中文, it is stored in the database as the literal string \u4e2d\u6587.
Step-by-step proof
- A user runs a workflow with input
{"query": "中文测试"}(Chinese text meaning "Chinese test") - The repository calls
json.dumps({"query": "中文测试"})which produces{"query": "\u4e2d\u6587\u6d4b\u8bd5"}(becauseensure_ascii=Trueby default) - This escaped string is stored in the
WorkflowRun.inputsTEXT column - A user searches for workflow logs with keyword
中文 - Old behavior: The code would encode
中文to\u4e2d\u6587and search for%\u4e2d\u6587%— this MATCHES the stored data - New behavior: The code passes
中文throughescape_like_pattern()which returns中文unchanged, then searches for%中文%— this does NOT match\u4e2d\u6587in the database
Impact
Any search for non-ASCII text (Chinese, Japanese, Korean, Arabic, etc.) in workflow run inputs or outputs will silently return zero results, even when matching data exists. This is a significant regression for users in non-English locales who rely on keyword search in workflow logs.
How to fix
The fix should either:
- Restore the
unicode_escapeencoding logic for the keyword before searchingWorkflowRun.inputs/outputs, or - Change the repository to use
json.dumps(domain_model.inputs, ensure_ascii=False)so that actual Unicode characters are stored in the database (matching the pattern used inlogstore_workflow_execution_repository.py:112), and then migrate existing data. Option 2 is the more correct long-term approach but requires a data migration.
| case "not contains": | ||
| filters.append(json_field.notlike(f"%{value}%")) | ||
| escaped_value = escape_like_pattern(str(value)) | ||
| filters.append(json_field.notlike(f"%{escaped_value}%")) |
There was a problem hiding this comment.
🔴 Bug: The "not contains" case on line 1207 calls json_field.notlike(f"%{escaped_value}%") without the escape="\\" parameter. Every other like/notlike/ilike call in this PR includes escape="\\" to work with the backslash-based escaping from escape_like_pattern(). Without it, searching for literal %, _, or \ in a "not contains" metadata filter will produce incorrect results because the database won't interpret the escape sequences.
Extended reasoning...
What the bug is
In api/core/rag/retrieval/dataset_retrieval.py at line 1207, the "not contains" condition uses:
filters.append(json_field.notlike(f"%{escaped_value}%"))but is missing the escape="\\" parameter that every other LIKE/notlike call in this function (and across the entire PR) includes.
How it manifests
The escape_like_pattern() function transforms special characters using backslash escaping: % becomes \%, _ becomes \_, and \ becomes \\. However, for the database to correctly interpret these escape sequences, the ESCAPE clause must be specified in the SQL query. Without escape="\\", the database treats \% as a literal backslash followed by a wildcard %, rather than as an escaped literal percent character.
Step-by-step proof
- User creates a document with metadata field
category="50% off" - User filters with condition
"not contains"and value"%" escape_like_pattern("%")returns"\%"- The pattern becomes
"%\%%"passed tonotlike() - Without
escape="\\": The database sees\as a literal character and%as a wildcard. The pattern%\%%matches any string containing a literal backslash followed by anything — which does NOT match"50% off". So the filter incorrectly includes"50% off"in results (it should be excluded since the value does contain%). - With
escape="\\": The database correctly interprets\%as an escaped literal%. The pattern matches any string containing a literal%, and"50% off"is correctly excluded from the "not contains" results.
Why existing code doesn't prevent it
This is simply an omission in the PR. Compare the "contains" case on line 1203 which correctly includes escape="\\", with the "not contains" case on line 1207 which does not. The "start with" (line 1211) and "end with" (line 1215) cases also correctly include the escape clause.
Impact
Metadata filters using the "not contains" condition with values containing %, _, or \ will return incorrect results. The escaping done by escape_like_pattern() is rendered ineffective for this specific condition, defeating the purpose of the entire PR for this code path.
How to fix
Add escape="\\" to line 1207:
filters.append(json_field.notlike(f"%{escaped_value}%", escape="\\"))| def escape_like_pattern(pattern: str) -> str: | ||
| """ | ||
| Escape special characters in a string for safe use in SQL LIKE patterns. | ||
|
|
||
| This function escapes the special characters used in SQL LIKE patterns: | ||
| - Backslash (\\) -> \\ | ||
| - Percent (%) -> \\% | ||
| - Underscore (_) -> \\_ | ||
|
|
||
| The escaped pattern can then be safely used in SQL LIKE queries with the | ||
| ESCAPE '\\' clause to prevent SQL injection via LIKE wildcards. | ||
|
|
||
| Args: | ||
| pattern: The string pattern to escape | ||
|
|
||
| Returns: | ||
| Escaped string safe for use in SQL LIKE queries | ||
|
|
||
| Examples: | ||
| >>> escape_like_pattern("50% discount") | ||
| '50\\% discount' | ||
| >>> escape_like_pattern("test_data") | ||
| 'test\\_data' | ||
| >>> escape_like_pattern("path\\to\\file") | ||
| 'path\\\\to\\\\file' | ||
| """ | ||
| if not pattern: | ||
| return pattern | ||
| # Escape backslash first, then percent and underscore | ||
| escapedPattern = pattern.replace("\\", "\\\\").replace("%", "\\%").replace("_", "\\_") |
There was a problem hiding this comment.
🟣 This PR misses two Document.name.like() locations that use the same vulnerable pattern it fixes elsewhere: api/controllers/console/datasets/datasets_document.py:266 and api/controllers/service_api/dataset/document.py:476. Both pass unescaped request.args.get("keyword") directly into f"%{search}%" without calling escape_like_pattern(), meaning a search for "50%" would match all documents. This is a pre-existing issue not introduced by this PR, but it falls squarely within the stated scope of this centralization effort.
Extended reasoning...
What the bug is
The PR introduces escape_like_pattern() and systematically applies it to 12+ locations where user input is used in SQL LIKE patterns (e.g., Dataset.name.ilike, App.name.ilike, Tag.name.ilike, DocumentSegment.content.ilike). However, two locations with the identical vulnerable pattern were missed:
- api/controllers/console/datasets/datasets_document.py:264-266
- api/controllers/service_api/dataset/document.py:474-476
Both use Document.name.like(f"%{search}%") where search comes directly from request.args.get("keyword").
The specific code path
In both files, the pattern is identical:
search = request.args.get("keyword")
if search:
search = f"%{search}%"
query = query.where(Document.name.like(search))The user-supplied keyword parameter is interpolated directly into the LIKE pattern without any escaping. Compare this to the fixed pattern in, e.g., api/services/dataset_service.py:147:
escaped_search = helper.escape_like_pattern(search)
query = query.where(Dataset.name.ilike(f"%{escaped_search}%", escape="\\"))Why existing code does not prevent it
These two files are not in the PR changed file list. The PR author likely searched for common patterns but missed these two controllers, possibly because they use .like() (case-sensitive) rather than .ilike() (case-insensitive), or because they are in controller files rather than service files.
Step-by-step proof
- User navigates to the document list endpoint (console or service API)
- User enters keyword "50%" to search for documents containing literal "50%"
- The code constructs search = "%50%%" - the % from the user input acts as a wildcard
- The LIKE pattern %50%% matches any document containing "50" followed by anything, effectively matching ALL documents that contain "50" anywhere (e.g., "50 items", "1500 records", "Document [CORRUPTED] Synthetic Benchmark PR #29981 - perf: optimize DatasetRetrieval.retrieve、RetrievalService._deduplicat… dify#50")
- With the fix applied elsewhere in this PR, the pattern would be %50%% with ESCAPE '\', correctly matching only documents containing the literal string "50%"
Impact
Users searching for documents with special characters (%, _, ) in their names will get incorrect results. The % wildcard is particularly impactful as it matches any sequence of characters, causing searches like "50%" to return far more results than expected.
How to fix
Apply the same pattern used throughout the PR:
from libs.helper import escape_like_pattern
if search:
escaped_search = escape_like_pattern(search)
search = f"%{escaped_search}%"
query = query.where(Document.name.like(search, escape="\\"))
Benchmark PR from agentic-review-benchmarks#12