Refactor to use DBMS_HYBRID_SEARCH.SEARCH directly#148
Refactor to use DBMS_HYBRID_SEARCH.SEARCH directly#148aopstudio wants to merge 1 commit intooceanbase:developfrom
Conversation
Replaces the two-step process of calling GET_SQL and executing the returned SQL with a direct call to DBMS_HYBRID_SEARCH.SEARCH. This simplifies the search logic and parses the JSON result directly, improving efficiency and maintainability.
|
|
📝 WalkthroughWalkthroughThis change refactors the hybrid search workflow in the client base by replacing a two-step pattern (DBMS_HYBRID_SEARCH.GET_SQL followed by SQL execution) with a direct DBMS_HYBRID_SEARCH.SEARCH call. The code now processes search results directly instead of generating and executing SQL, updating result parsing, error handling, and logging accordingly. Changes
Possibly related issues
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@src/pyseekdb/client/client_base.py`:
- Line 2570: Replace the redundant f-string in the logging call so it isn't
using an f-string with no placeholders; change the logger.debug call that
currently reads logger.debug(f"Search result received from
DBMS_HYBRID_SEARCH.SEARCH") to use a plain string logger.debug("Search result
received from DBMS_HYBRID_SEARCH.SEARCH") in the same function where
logger.debug is called.
- Around line 2564-2569: The code reads search_result_json and passes it to
_transform_sql_result but doesn't validate its schema; if search_result_json is
a dict (e.g., a nested response with a 'hits' key from
DBMS_HYBRID_SEARCH.SEARCH) _transform_sql_result will iterate keys and corrupt
output. Update the parsing in the block that sets search_result (inspect
variable search_result_json) to detect dict-shaped responses: if it's a dict,
either extract the list of rows (e.g., search_result_json["hits"] or similar) or
pass the dict to _transform_search_result to obtain a list[dict[str, Any]]
before calling _transform_sql_result; ensure the final search_result is
validated as a list of dicts before further processing and raise or log a clear
error if the schema is unexpected.
- Around line 2549-2553: The SQL currently injects table_name directly into
search_query; instead build the query with a parameter placeholder and pass
table_name as a bound parameter to _execute_query_with_cursor to prevent SQL
injection: change the constructed search_query to use a %s placeholder for the
DBMS_HYBRID_SEARCH.SEARCH first argument (e.g. DBMS_HYBRID_SEARCH.SEARCH(%s,
`@search_parm`) ...) and call self._execute_query_with_cursor(conn, search_query,
[table_name], use_context_manager) so the table name from
CollectionNames.table_name() is bound instead of interpolated.
| # Call DBMS_HYBRID_SEARCH.SEARCH directly instead of GET_SQL + SQL execution | ||
| search_query = f"SELECT DBMS_HYBRID_SEARCH.SEARCH('{table_name}', @search_parm) as search_result FROM dual" | ||
| logger.debug(f"Executing search query: {search_query}") | ||
|
|
||
| rows = self._execute_query_with_cursor(conn, get_sql_query, [], use_context_manager) | ||
|
|
||
| if not rows or not rows[0].get("query_sql"): | ||
| logger.warning("No SQL query returned from GET_SQL") | ||
| rows = self._execute_query_with_cursor(conn, search_query, [], use_context_manager) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Find the method containing lines 2549-2553
rg -n "_collection_hybrid_search|def.*hybrid" src/pyseekdb/client/client_base.py -B 5 -A 30 | head -100Repository: oceanbase/pyseekdb
Length of output: 1876
🏁 Script executed:
# Check the signature of _execute_query_with_cursor
rg -n "_execute_query_with_cursor" src/pyseekdb/client/client_base.py -B 2 -A 10 | head -60Repository: oceanbase/pyseekdb
Length of output: 2494
🏁 Script executed:
# Understand how table_name is derived in the hybrid search context
rg -n "table_name.*=" src/pyseekdb/client/client_base.py | grep -A 2 -B 2 "2[45]"Repository: oceanbase/pyseekdb
Length of output: 929
🏁 Script executed:
# Check what _transform_sql_result expects
rg -n "_transform_sql_result" src/pyseekdb/client/client_base.py -B 2 -A 5Repository: oceanbase/pyseekdb
Length of output: 770
🏁 Script executed:
# Find CollectionNames class and table_name methods
rg -n "class CollectionNames|def table_name" src/pyseekdb -B 2 -A 8Repository: oceanbase/pyseekdb
Length of output: 1576
🏁 Script executed:
# Look for other uses of DBMS_HYBRID_SEARCH in the codebase
rg -n "DBMS_HYBRID_SEARCH" src/pyseekdb --context 3Repository: oceanbase/pyseekdb
Length of output: 1876
🏁 Script executed:
# Check how parameters are used in other execute_query_with_cursor calls
rg -n "_execute_query_with_cursor.*\[" src/pyseekdb/client/client_base.py | head -20Repository: oceanbase/pyseekdb
Length of output: 331
🏁 Script executed:
# Look at lines 2549-2570 more carefully to understand the full context
sed -n '2540,2580p' src/pyseekdb/client/client_base.pyRepository: oceanbase/pyseekdb
Length of output: 1873
🏁 Script executed:
# Find escape_string function
rg -n "def escape_string|escape_string" src/pyseekdb --context 2 | head -40Repository: oceanbase/pyseekdb
Length of output: 2960
🏁 Script executed:
# Check how other SQL queries handle table_name interpolation
rg -n "SELECT.*\`.*table" src/pyseekdb/client/client_base.py | head -10Repository: oceanbase/pyseekdb
Length of output: 460
🏁 Script executed:
# Look for backtick usage around table_name in other queries
rg -n "\`.*table_name" src/pyseekdb/client/client_base.py | head -15Repository: oceanbase/pyseekdb
Length of output: 1795
🏁 Script executed:
# Check UPDATE statement with backticks (from earlier search showing line 1525)
sed -n '1520,1530p' src/pyseekdb/client/client_base.pyRepository: oceanbase/pyseekdb
Length of output: 450
Escape or parameterize the table_name in the DBMS_HYBRID_SEARCH.SEARCH call to prevent SQL injection.
The table_name is derived from user-supplied collection_name via CollectionNames.table_name() and is interpolated directly into the SQL string without escaping or parameterization. Although the preceding SET statement uses escape_string() for the search parameters, this call lacks protection. Use parameter binding with %s to align with the parameterized approach already supported by _execute_query_with_cursor.
Proposed fix
- search_query = f"SELECT DBMS_HYBRID_SEARCH.SEARCH('{table_name}', `@search_parm`) as search_result FROM dual"
- logger.debug(f"Executing search query: {search_query}")
-
- rows = self._execute_query_with_cursor(conn, search_query, [], use_context_manager)
+ search_query = "SELECT DBMS_HYBRID_SEARCH.SEARCH(%s, `@search_parm`) as search_result FROM dual"
+ logger.debug(f"Executing search query: {search_query}")
+
+ rows = self._execute_query_with_cursor(conn, search_query, [table_name], use_context_manager)🧰 Tools
🪛 Ruff (0.14.14)
2550-2550: Possible SQL injection vector through string-based query construction
(S608)
🤖 Prompt for AI Agents
In `@src/pyseekdb/client/client_base.py` around lines 2549 - 2553, The SQL
currently injects table_name directly into search_query; instead build the query
with a parameter placeholder and pass table_name as a bound parameter to
_execute_query_with_cursor to prevent SQL injection: change the constructed
search_query to use a %s placeholder for the DBMS_HYBRID_SEARCH.SEARCH first
argument (e.g. DBMS_HYBRID_SEARCH.SEARCH(%s, `@search_parm`) ...) and call
self._execute_query_with_cursor(conn, search_query, [table_name],
use_context_manager) so the table name from CollectionNames.table_name() is
bound instead of interpolated.
| # Parse the search result JSON | ||
| search_result_json = rows[0]["search_result"] | ||
| if isinstance(search_result_json, str): | ||
| search_result = json.loads(search_result_json) | ||
| else: | ||
| search_result = search_result_json |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's examine the file structure and the specific lines mentioned
wc -l src/pyseekdb/client/client_base.pyRepository: oceanbase/pyseekdb
Length of output: 101
🏁 Script executed:
# Read the context around lines 2564-2573
sed -n '2540,2590p' src/pyseekdb/client/client_base.pyRepository: oceanbase/pyseekdb
Length of output: 2233
🏁 Script executed:
# Find the _transform_sql_result method definition
rg -n "_transform_sql_result" src/pyseekdb/client/client_base.pyRepository: oceanbase/pyseekdb
Length of output: 182
🏁 Script executed:
# Find the _transform_search_result method definition
rg -n "_transform_search_result" src/pyseekdb/client/client_base.pyRepository: oceanbase/pyseekdb
Length of output: 182
🏁 Script executed:
# Search for DBMS_HYBRID_SEARCH.SEARCH usage
rg -n "DBMS_HYBRID_SEARCH\.SEARCH" src/pyseekdb/client/client_base.py -A 5 -B 5Repository: oceanbase/pyseekdb
Length of output: 1494
🏁 Script executed:
# Read _transform_sql_result implementation
sed -n '3007,3060p' src/pyseekdb/client/client_base.pyRepository: oceanbase/pyseekdb
Length of output: 2030
🏁 Script executed:
# Read _transform_search_result implementation
sed -n '3107,3170p' src/pyseekdb/client/client_base.pyRepository: oceanbase/pyseekdb
Length of output: 2271
🏁 Script executed:
# Check type hints for both methods
sed -n '3007,3010p' src/pyseekdb/client/client_base.py
sed -n '3107,3110p' src/pyseekdb/client/client_base.pyRepository: oceanbase/pyseekdb
Length of output: 539
Validate SEARCH response schema before transforming.
_transform_sql_result expects list[dict[str, Any]], but will silently corrupt output if passed a dict with nested structure (it would iterate over dict keys instead of rows). DBMS_HYBRID_SEARCH.SEARCH may return either a flat list or a nested dict with hits key, but the code provides no schema validation. Use the existing _transform_search_result method for dict responses or extract rows accordingly.
Suggested validation
- # Transform search result to standard format
- return self._transform_sql_result(search_result, include)
+ # Transform search result to standard format
+ if isinstance(search_result, dict) and "hits" in search_result:
+ return self._transform_search_result(search_result, include)
+ if isinstance(search_result, dict) and "rows" in search_result:
+ search_result = search_result["rows"]
+ if not isinstance(search_result, list):
+ raise ValueError(f"Unexpected SEARCH result schema: {type(search_result)}")
+ return self._transform_sql_result(search_result, include)🤖 Prompt for AI Agents
In `@src/pyseekdb/client/client_base.py` around lines 2564 - 2569, The code reads
search_result_json and passes it to _transform_sql_result but doesn't validate
its schema; if search_result_json is a dict (e.g., a nested response with a
'hits' key from DBMS_HYBRID_SEARCH.SEARCH) _transform_sql_result will iterate
keys and corrupt output. Update the parsing in the block that sets search_result
(inspect variable search_result_json) to detect dict-shaped responses: if it's a
dict, either extract the list of rows (e.g., search_result_json["hits"] or
similar) or pass the dict to _transform_search_result to obtain a list[dict[str,
Any]] before calling _transform_sql_result; ensure the final search_result is
validated as a list of dicts before further processing and raise or log a clear
error if the schema is unexpected.
| search_result = json.loads(search_result_json) | ||
| else: | ||
| search_result = search_result_json | ||
| logger.debug(f"Search result received from DBMS_HYBRID_SEARCH.SEARCH") |
There was a problem hiding this comment.
Remove redundant f-string.
f"..." without placeholders triggers F541.
🧹 Proposed fix
- logger.debug(f"Search result received from DBMS_HYBRID_SEARCH.SEARCH")
+ logger.debug("Search result received from DBMS_HYBRID_SEARCH.SEARCH")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| logger.debug(f"Search result received from DBMS_HYBRID_SEARCH.SEARCH") | |
| logger.debug("Search result received from DBMS_HYBRID_SEARCH.SEARCH") |
🧰 Tools
🪛 Ruff (0.14.14)
2570-2570: f-string without any placeholders
Remove extraneous f prefix
(F541)
🤖 Prompt for AI Agents
In `@src/pyseekdb/client/client_base.py` at line 2570, Replace the redundant
f-string in the logging call so it isn't using an f-string with no placeholders;
change the logger.debug call that currently reads logger.debug(f"Search result
received from DBMS_HYBRID_SEARCH.SEARCH") to use a plain string
logger.debug("Search result received from DBMS_HYBRID_SEARCH.SEARCH") in the
same function where logger.debug is called.
|
Please sign the CLA and fix the CI/quality first. |
| search_result = json.loads(search_result_json) | ||
| else: | ||
| search_result = search_result_json | ||
| logger.debug(f"Search result received from DBMS_HYBRID_SEARCH.SEARCH") |
There was a problem hiding this comment.
There is no placeholders in this f-string, please check if this is intentional
Summary
Refactor to use DBMS_HYBRID_SEARCH.SEARCH directly
Solution Description
Replaces the two-step process of calling GET_SQL and executing the returned SQL with a direct call to DBMS_HYBRID_SEARCH.SEARCH. This simplifies the search logic and parses the JSON result directly, improving efficiency and maintainability.
Notes
This PR is for the AI coding activity and addresses issue #118.
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.