Skip to content

Refactor to use DBMS_HYBRID_SEARCH.SEARCH directly#148

Open
aopstudio wants to merge 1 commit intooceanbase:developfrom
aopstudio:simple_hybrid_search
Open

Refactor to use DBMS_HYBRID_SEARCH.SEARCH directly#148
aopstudio wants to merge 1 commit intooceanbase:developfrom
aopstudio:simple_hybrid_search

Conversation

@aopstudio
Copy link

@aopstudio aopstudio commented Jan 28, 2026

Summary

Refactor to use DBMS_HYBRID_SEARCH.SEARCH directly

Solution Description

Replaces the two-step process of calling GET_SQL and executing the returned SQL with a direct call to DBMS_HYBRID_SEARCH.SEARCH. This simplifies the search logic and parses the JSON result directly, improving efficiency and maintainability.

Notes

This PR is for the AI coding activity and addresses issue #118.

Summary by CodeRabbit

  • Refactor
    • Optimized search query execution workflow. The internal search processing has been streamlined to reduce overhead and improve performance. These changes are internal only; all existing functionality, public APIs, and backward compatibility remain completely unchanged.

✏️ Tip: You can customize this high-level summary in your review settings.

Replaces the two-step process of calling GET_SQL and executing the returned SQL with a direct call to DBMS_HYBRID_SEARCH.SEARCH. This simplifies the search logic and parses the JSON result directly, improving efficiency and maintainability.
@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@coderabbitai
Copy link

coderabbitai bot commented Jan 28, 2026

📝 Walkthrough

Walkthrough

This change refactors the hybrid search workflow in the client base by replacing a two-step pattern (DBMS_HYBRID_SEARCH.GET_SQL followed by SQL execution) with a direct DBMS_HYBRID_SEARCH.SEARCH call. The code now processes search results directly instead of generating and executing SQL, updating result parsing, error handling, and logging accordingly.

Changes

Cohort / File(s) Summary
Hybrid Search Flow Refactoring
src/pyseekdb/client/client_base.py
Replaced GET_SQL + execution path with direct SEARCH call; updated result parsing to handle stringified JSON/dict from SEARCH instead of executing returned SQL; modified error handling to warn on missing search_result; adjusted logging to reflect new query flow and result reception

Possibly related issues

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 A rabbit's ode to streamlined paths:
Once SQL strings danced in the night,
Through GET_SQL gates, so roundabout tight—
Now SEARCH hops directly, swift and lean,
One call replaces the in-between!
The result flows pure, no strings to bind,
A simpler way, with cleaner mind. 🌙

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: replacing the two-step GET_SQL + execution flow with a direct DBMS_HYBRID_SEARCH.SEARCH call, which is the primary focus of the changeset.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@src/pyseekdb/client/client_base.py`:
- Line 2570: Replace the redundant f-string in the logging call so it isn't
using an f-string with no placeholders; change the logger.debug call that
currently reads logger.debug(f"Search result received from
DBMS_HYBRID_SEARCH.SEARCH") to use a plain string logger.debug("Search result
received from DBMS_HYBRID_SEARCH.SEARCH") in the same function where
logger.debug is called.
- Around line 2564-2569: The code reads search_result_json and passes it to
_transform_sql_result but doesn't validate its schema; if search_result_json is
a dict (e.g., a nested response with a 'hits' key from
DBMS_HYBRID_SEARCH.SEARCH) _transform_sql_result will iterate keys and corrupt
output. Update the parsing in the block that sets search_result (inspect
variable search_result_json) to detect dict-shaped responses: if it's a dict,
either extract the list of rows (e.g., search_result_json["hits"] or similar) or
pass the dict to _transform_search_result to obtain a list[dict[str, Any]]
before calling _transform_sql_result; ensure the final search_result is
validated as a list of dicts before further processing and raise or log a clear
error if the schema is unexpected.
- Around line 2549-2553: The SQL currently injects table_name directly into
search_query; instead build the query with a parameter placeholder and pass
table_name as a bound parameter to _execute_query_with_cursor to prevent SQL
injection: change the constructed search_query to use a %s placeholder for the
DBMS_HYBRID_SEARCH.SEARCH first argument (e.g. DBMS_HYBRID_SEARCH.SEARCH(%s,
`@search_parm`) ...) and call self._execute_query_with_cursor(conn, search_query,
[table_name], use_context_manager) so the table name from
CollectionNames.table_name() is bound instead of interpolated.

Comment on lines +2549 to +2553
# Call DBMS_HYBRID_SEARCH.SEARCH directly instead of GET_SQL + SQL execution
search_query = f"SELECT DBMS_HYBRID_SEARCH.SEARCH('{table_name}', @search_parm) as search_result FROM dual"
logger.debug(f"Executing search query: {search_query}")

rows = self._execute_query_with_cursor(conn, get_sql_query, [], use_context_manager)

if not rows or not rows[0].get("query_sql"):
logger.warning("No SQL query returned from GET_SQL")
rows = self._execute_query_with_cursor(conn, search_query, [], use_context_manager)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# Find the method containing lines 2549-2553
rg -n "_collection_hybrid_search|def.*hybrid" src/pyseekdb/client/client_base.py -B 5 -A 30 | head -100

Repository: oceanbase/pyseekdb

Length of output: 1876


🏁 Script executed:

# Check the signature of _execute_query_with_cursor
rg -n "_execute_query_with_cursor" src/pyseekdb/client/client_base.py -B 2 -A 10 | head -60

Repository: oceanbase/pyseekdb

Length of output: 2494


🏁 Script executed:

# Understand how table_name is derived in the hybrid search context
rg -n "table_name.*=" src/pyseekdb/client/client_base.py | grep -A 2 -B 2 "2[45]"

Repository: oceanbase/pyseekdb

Length of output: 929


🏁 Script executed:

# Check what _transform_sql_result expects
rg -n "_transform_sql_result" src/pyseekdb/client/client_base.py -B 2 -A 5

Repository: oceanbase/pyseekdb

Length of output: 770


🏁 Script executed:

# Find CollectionNames class and table_name methods
rg -n "class CollectionNames|def table_name" src/pyseekdb -B 2 -A 8

Repository: oceanbase/pyseekdb

Length of output: 1576


🏁 Script executed:

# Look for other uses of DBMS_HYBRID_SEARCH in the codebase
rg -n "DBMS_HYBRID_SEARCH" src/pyseekdb --context 3

Repository: oceanbase/pyseekdb

Length of output: 1876


🏁 Script executed:

# Check how parameters are used in other execute_query_with_cursor calls
rg -n "_execute_query_with_cursor.*\[" src/pyseekdb/client/client_base.py | head -20

Repository: oceanbase/pyseekdb

Length of output: 331


🏁 Script executed:

# Look at lines 2549-2570 more carefully to understand the full context
sed -n '2540,2580p' src/pyseekdb/client/client_base.py

Repository: oceanbase/pyseekdb

Length of output: 1873


🏁 Script executed:

# Find escape_string function
rg -n "def escape_string|escape_string" src/pyseekdb --context 2 | head -40

Repository: oceanbase/pyseekdb

Length of output: 2960


🏁 Script executed:

# Check how other SQL queries handle table_name interpolation
rg -n "SELECT.*\`.*table" src/pyseekdb/client/client_base.py | head -10

Repository: oceanbase/pyseekdb

Length of output: 460


🏁 Script executed:

# Look for backtick usage around table_name in other queries
rg -n "\`.*table_name" src/pyseekdb/client/client_base.py | head -15

Repository: oceanbase/pyseekdb

Length of output: 1795


🏁 Script executed:

# Check UPDATE statement with backticks (from earlier search showing line 1525)
sed -n '1520,1530p' src/pyseekdb/client/client_base.py

Repository: oceanbase/pyseekdb

Length of output: 450


Escape or parameterize the table_name in the DBMS_HYBRID_SEARCH.SEARCH call to prevent SQL injection.

The table_name is derived from user-supplied collection_name via CollectionNames.table_name() and is interpolated directly into the SQL string without escaping or parameterization. Although the preceding SET statement uses escape_string() for the search parameters, this call lacks protection. Use parameter binding with %s to align with the parameterized approach already supported by _execute_query_with_cursor.

Proposed fix
-        search_query = f"SELECT DBMS_HYBRID_SEARCH.SEARCH('{table_name}', `@search_parm`) as search_result FROM dual"
-        logger.debug(f"Executing search query: {search_query}")
-
-        rows = self._execute_query_with_cursor(conn, search_query, [], use_context_manager)
+        search_query = "SELECT DBMS_HYBRID_SEARCH.SEARCH(%s, `@search_parm`) as search_result FROM dual"
+        logger.debug(f"Executing search query: {search_query}")
+
+        rows = self._execute_query_with_cursor(conn, search_query, [table_name], use_context_manager)
🧰 Tools
🪛 Ruff (0.14.14)

2550-2550: Possible SQL injection vector through string-based query construction

(S608)

🤖 Prompt for AI Agents
In `@src/pyseekdb/client/client_base.py` around lines 2549 - 2553, The SQL
currently injects table_name directly into search_query; instead build the query
with a parameter placeholder and pass table_name as a bound parameter to
_execute_query_with_cursor to prevent SQL injection: change the constructed
search_query to use a %s placeholder for the DBMS_HYBRID_SEARCH.SEARCH first
argument (e.g. DBMS_HYBRID_SEARCH.SEARCH(%s, `@search_parm`) ...) and call
self._execute_query_with_cursor(conn, search_query, [table_name],
use_context_manager) so the table name from CollectionNames.table_name() is
bound instead of interpolated.

Comment on lines +2564 to +2569
# Parse the search result JSON
search_result_json = rows[0]["search_result"]
if isinstance(search_result_json, str):
search_result = json.loads(search_result_json)
else:
search_result = search_result_json
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's examine the file structure and the specific lines mentioned
wc -l src/pyseekdb/client/client_base.py

Repository: oceanbase/pyseekdb

Length of output: 101


🏁 Script executed:

# Read the context around lines 2564-2573
sed -n '2540,2590p' src/pyseekdb/client/client_base.py

Repository: oceanbase/pyseekdb

Length of output: 2233


🏁 Script executed:

# Find the _transform_sql_result method definition
rg -n "_transform_sql_result" src/pyseekdb/client/client_base.py

Repository: oceanbase/pyseekdb

Length of output: 182


🏁 Script executed:

# Find the _transform_search_result method definition
rg -n "_transform_search_result" src/pyseekdb/client/client_base.py

Repository: oceanbase/pyseekdb

Length of output: 182


🏁 Script executed:

# Search for DBMS_HYBRID_SEARCH.SEARCH usage
rg -n "DBMS_HYBRID_SEARCH\.SEARCH" src/pyseekdb/client/client_base.py -A 5 -B 5

Repository: oceanbase/pyseekdb

Length of output: 1494


🏁 Script executed:

# Read _transform_sql_result implementation
sed -n '3007,3060p' src/pyseekdb/client/client_base.py

Repository: oceanbase/pyseekdb

Length of output: 2030


🏁 Script executed:

# Read _transform_search_result implementation
sed -n '3107,3170p' src/pyseekdb/client/client_base.py

Repository: oceanbase/pyseekdb

Length of output: 2271


🏁 Script executed:

# Check type hints for both methods
sed -n '3007,3010p' src/pyseekdb/client/client_base.py
sed -n '3107,3110p' src/pyseekdb/client/client_base.py

Repository: oceanbase/pyseekdb

Length of output: 539


Validate SEARCH response schema before transforming.

_transform_sql_result expects list[dict[str, Any]], but will silently corrupt output if passed a dict with nested structure (it would iterate over dict keys instead of rows). DBMS_HYBRID_SEARCH.SEARCH may return either a flat list or a nested dict with hits key, but the code provides no schema validation. Use the existing _transform_search_result method for dict responses or extract rows accordingly.

Suggested validation
-        # Transform search result to standard format
-        return self._transform_sql_result(search_result, include)
+        # Transform search result to standard format
+        if isinstance(search_result, dict) and "hits" in search_result:
+            return self._transform_search_result(search_result, include)
+        if isinstance(search_result, dict) and "rows" in search_result:
+            search_result = search_result["rows"]
+        if not isinstance(search_result, list):
+            raise ValueError(f"Unexpected SEARCH result schema: {type(search_result)}")
+        return self._transform_sql_result(search_result, include)
🤖 Prompt for AI Agents
In `@src/pyseekdb/client/client_base.py` around lines 2564 - 2569, The code reads
search_result_json and passes it to _transform_sql_result but doesn't validate
its schema; if search_result_json is a dict (e.g., a nested response with a
'hits' key from DBMS_HYBRID_SEARCH.SEARCH) _transform_sql_result will iterate
keys and corrupt output. Update the parsing in the block that sets search_result
(inspect variable search_result_json) to detect dict-shaped responses: if it's a
dict, either extract the list of rows (e.g., search_result_json["hits"] or
similar) or pass the dict to _transform_search_result to obtain a list[dict[str,
Any]] before calling _transform_sql_result; ensure the final search_result is
validated as a list of dicts before further processing and raise or log a clear
error if the schema is unexpected.

search_result = json.loads(search_result_json)
else:
search_result = search_result_json
logger.debug(f"Search result received from DBMS_HYBRID_SEARCH.SEARCH")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove redundant f-string.

f"..." without placeholders triggers F541.

🧹 Proposed fix
-        logger.debug(f"Search result received from DBMS_HYBRID_SEARCH.SEARCH")
+        logger.debug("Search result received from DBMS_HYBRID_SEARCH.SEARCH")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
logger.debug(f"Search result received from DBMS_HYBRID_SEARCH.SEARCH")
logger.debug("Search result received from DBMS_HYBRID_SEARCH.SEARCH")
🧰 Tools
🪛 Ruff (0.14.14)

2570-2570: f-string without any placeholders

Remove extraneous f prefix

(F541)

🤖 Prompt for AI Agents
In `@src/pyseekdb/client/client_base.py` at line 2570, Replace the redundant
f-string in the logging call so it isn't using an f-string with no placeholders;
change the logger.debug call that currently reads logger.debug(f"Search result
received from DBMS_HYBRID_SEARCH.SEARCH") to use a plain string
logger.debug("Search result received from DBMS_HYBRID_SEARCH.SEARCH") in the
same function where logger.debug is called.

@hnwyllmm
Copy link
Member

Please sign the CLA and fix the CI/quality first.
You can execute 'make pre-commit' to fix CI/quality.

search_result = json.loads(search_result_json)
else:
search_result = search_result_json
logger.debug(f"Search result received from DBMS_HYBRID_SEARCH.SEARCH")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no placeholders in this f-string, please check if this is intentional

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants