Skip to content

feat: implement lexical index search#11

Merged
programmer-ke merged 1 commit into
masterfrom
feat/lexical-search
Jun 16, 2026
Merged

feat: implement lexical index search#11
programmer-ke merged 1 commit into
masterfrom
feat/lexical-search

Conversation

@programmer-ke

@programmer-ke programmer-ke commented Jun 16, 2026

Copy link
Copy Markdown
Owner

This allows retrieving result from the index matching a user's query.

This PR implements the search index use-case and required dependencies.

Summary by CodeRabbit

  • New Features

    • Added search capability to the document index
    • Query validation now rejects empty or whitespace-only searches
    • Search results can be limited via a configurable parameter (default: 10 results)
  • Documentation

    • Updated documentation to clarify that document chunks are indexed for lexical search

@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Adds search capability to the docs-buddy system. New Query and QueryResult domain dataclasses are introduced alongside a search_index use case function. WhooshDocumentIndex gains persistent index initialization and a search method, plus a new WhooshIndexError. Fake adapters and unit/integration tests cover the full search path.

Changes

Index Search Feature

Layer / File(s) Summary
Domain Query/QueryResult types and validation
src/docs_buddy/domain/__init__.py, tests/unit/test_domain.py
Adds InvalidQueryError, frozen Query dataclass (trims and rejects empty text), and QueryResult with JSON __str__. Unit test covers normalization, empty, and whitespace-only inputs.
search_index use case and DocumentIndex protocol
src/docs_buddy/services/use_cases.py, tests/unit/test_services.py
Adds DEFAULT_MAX_RESULTS, SearchIndexError, DocumentIndex.search protocol method, and search_index function enforcing max_results >= 1. Unit tests verify result count limiting and invalid max_results raises SearchIndexError.
WhooshDocumentIndex persistent init and search
src/docs_buddy/adapters/whoosh_index.py
Defines WhooshIndexError, promotes schema to class-level _SCHEMA, makes constructor accept optional index_location to open existing index and build MultifieldParser, and adds search method that validates init, parses query, executes Whoosh search, and maps hits to QueryResult.
Fake adapter search helpers and integration tests
src/docs_buddy/adapters/__init__.py, tests/integration/test_adapters.py
Exports WhooshIndexError. Adds destination_content properties to FakeIntermediateStorage and FakeDocumentChunksPipeline. Adds FakeIndex.search with case-insensitive substring filter and max_results truncation. Integration tests verify error on uninitialized search and result count limiting on a fitted index.
README and todo.md minor updates
README.md, todo.md
README heading gains a leading space; indexing TODO narrowed to lexical-only. todo.md adds "address todo comments" entry.

Sequence Diagram(s)

sequenceDiagram
  participant Caller
  participant search_index
  participant DocumentIndex
  participant WhooshDocumentIndex
  Caller->>search_index: search_index(query: Query, index, max_results)
  alt max_results < 1
    search_index-->>Caller: raises SearchIndexError
  else valid
    search_index->>DocumentIndex: search(query, max_results)
    DocumentIndex->>WhooshDocumentIndex: MultifieldParser.parse(query.text)
    WhooshDocumentIndex->>WhooshDocumentIndex: searcher.search(parsed, limit=max_results)
    WhooshDocumentIndex-->>DocumentIndex: hits → QueryResult list (JSON metadata)
    DocumentIndex-->>search_index: list[QueryResult]
    search_index-->>Caller: list[QueryResult]
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • programmer-ke/docs-buddy#9: Established the foundational Whoosh-based lexical chunk indexing and fake adapter infrastructure that this PR extends with search, WhooshIndexError, and domain.Query/QueryResult.

Poem

🐇 A query hops in, text trimmed just right,
Through Whoosh fields it searches the night.
max_results guards the burrow door —
No empty questions, no results galore!
The index speaks: "Your chunks are found!" 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 38.71% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: implementing lexical index search functionality, which aligns with the PR objectives and the substantial changes across domain, adapter, and service layers.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/lexical-search

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/docs_buddy/adapters/whoosh_index.py (1)

54-72: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Initialize searchable state after fit so a fitted instance can be queried.

fit commits data but leaves self._index unset, so search on the same instance always raises WhooshIndexError even after successful indexing.

Proposed fix
     def fit(
         self, chunks: Iterator[domain.DocumentChunk], destination: PathLike
     ) -> None:
@@
         writer.commit()
+        self._index = ix
+        self._query_parser = qparser.MultifieldParser(
+            self._SEARCH_FIELDS,
+            schema=self._SCHEMA,
+            group=qparser.OrGroup,
+        )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/docs_buddy/adapters/whoosh_index.py` around lines 54 - 72, The fit method
creates and commits data to the Whoosh index but fails to store the index
instance in self._index, causing subsequent search calls on the same instance to
fail. After the writer.commit() call in the fit method, assign the created index
instance (ix) to self._index so that the searchable state is properly
initialized and search queries can use the indexed data.
🧹 Nitpick comments (1)
src/docs_buddy/adapters/__init__.py (1)

265-272: ⚡ Quick win

Align FakeIndex.search with real adapter search fields to avoid contract drift in tests.

The fake currently matches only chunk content, while WhooshDocumentIndex searches across content/metadata/path keywords. This can give false confidence in unit tests.

Proposed refactor
     def search(self, query, max_results):
         """Return results from the existing chunks"""
         chunks = self._pipeline.destination_content
+        needle = str(query).lower()
         return [
             domain.QueryResult(c.chunk, c.path, c.metadata)
             for c in chunks
-            if str(query).lower() in c.chunk.lower()
+            if needle in c.chunk.lower()
+            or needle in c.path.lower()
+            or needle in json.dumps(c.metadata).lower()
         ][:max_results]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/docs_buddy/adapters/__init__.py` around lines 265 - 272, The
FakeIndex.search method in the search method only searches within the chunk
field, while the real WhooshDocumentIndex searches across multiple fields
including chunk content, metadata, and path. Update the filter condition that
checks str(query).lower() in c.chunk.lower() to instead check if the query
appears in any of the three fields: chunk, metadata (as a string), and path.
This ensures the fake adapter's search behavior aligns with the real adapter's
contract and prevents false confidence in unit tests.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/docs_buddy/adapters/whoosh_index.py`:
- Around line 54-72: The fit method creates and commits data to the Whoosh index
but fails to store the index instance in self._index, causing subsequent search
calls on the same instance to fail. After the writer.commit() call in the fit
method, assign the created index instance (ix) to self._index so that the
searchable state is properly initialized and search queries can use the indexed
data.

---

Nitpick comments:
In `@src/docs_buddy/adapters/__init__.py`:
- Around line 265-272: The FakeIndex.search method in the search method only
searches within the chunk field, while the real WhooshDocumentIndex searches
across multiple fields including chunk content, metadata, and path. Update the
filter condition that checks str(query).lower() in c.chunk.lower() to instead
check if the query appears in any of the three fields: chunk, metadata (as a
string), and path. This ensures the fake adapter's search behavior aligns with
the real adapter's contract and prevents false confidence in unit tests.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4e32b699-27e4-4bc9-8fe3-76655af4ccba

📥 Commits

Reviewing files that changed from the base of the PR and between aac234c and 409aabb.

📒 Files selected for processing (9)
  • README.md
  • src/docs_buddy/adapters/__init__.py
  • src/docs_buddy/adapters/whoosh_index.py
  • src/docs_buddy/domain/__init__.py
  • src/docs_buddy/services/use_cases.py
  • tests/integration/test_adapters.py
  • tests/unit/test_domain.py
  • tests/unit/test_services.py
  • todo.md

@programmer-ke programmer-ke merged commit 0aefb42 into master Jun 16, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant