Skip to content

index and query prs#82

Open
nick1udwig wants to merge 6 commits intosimiligh:mainfrom
nick1udwig:index-and-query-prs-3
Open

index and query prs#82
nick1udwig wants to merge 6 commits intosimiligh:mainfrom
nick1udwig:index-and-query-prs-3

Conversation

@nick1udwig
Copy link
Contributor

@nick1udwig nick1udwig commented Feb 18, 2026

Description

Index and query PRs, not just issues

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 💥 Breaking change (fix or feature that would cause existing functionality to change)
  • 📚 Documentation update
  • 🔧 Configuration/build change
  • ♻️ Refactoring (no functional changes)
  • 🧪 Test update

Related Issues

Fixes #43

Changes Made

  • adds PR indexing
  • adds PR querying (against PRs and issues)

Testing

  • I have run go build ./... successfully
  • I have run go test ./... successfully
  • I have run go vet ./... successfully
  • I have tested the changes locally

(I've tested with OpenAI models)

Screenshots (if applicable)

Checklist

  • My code follows the project's style guidelines
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • New and existing unit tests pass locally with my changes

Additional Notes

Summary by CodeRabbit

  • New Features

    • PR indexing into a dedicated collection and metadata (including changed files) when enabled
    • New pr-duplicate command: AI-powered duplicate detection for a PR against issues/PRs with top-k, threshold, and JSON output
  • Documentation

    • CLI docs and workflow examples updated for PR indexing, pr-duplicate usage, and RFC3339 --since behavior
  • Configuration

    • Sample env entries for main and optional PR collection names
  • Tests

    • Unit tests added for PR duplicate candidate parsing and helpers

@nick1udwig nick1udwig requested a review from Kavirubc as a code owner February 18, 2026 18:50
@coderabbitai
Copy link

coderabbitai bot commented Feb 18, 2026

📝 Walkthrough

Walkthrough

Adds optional pull-request indexing and a new simili pr-duplicate CLI command. Updates config/environment to support a separate PR Qdrant collection, extends GitHub and Gemini integrations, updates the index workflow to optionally index PR metadata, and adds CLI and tests for PR duplicate detection.

Changes

Cohort / File(s) Summary
Env & Config
.env.sample, .simili.yaml, internal/core/config/config.go
Add QDRANT_COLLECTION sample, optional QDRANT_PR_COLLECTION, qdrant.pr_collection config and PRCollection field with defaulting/merge logic.
CLI docs
README.md
Document --include-prs, --pr-collection, --since RFC3339 note, new simili pr-duplicate command and example workflows showing PR indexing and duplicate checking.
Indexing pipeline
cmd/simili/commands/index.go, cmd/simili/commands/pr_support.go
Add --include-prs flag, resolve/create PR collection, Job.IsPullRequest handling, processPullRequest, PR metadata/text building, changed-files collection, embedding and upsert into PR collection, dry-run support.
New command & tests
cmd/simili/commands/pr_duplicate.go, cmd/simili/commands/pr_duplicate_test.go
Add pr-duplicate command: fetch PR, build metadata, embed, search issue+PR collections, LLM-based duplicate detection, output/JSON rendering; include unit tests for candidate parsing and extraction.
Gemini / LLM
internal/integrations/gemini/llm.go, internal/integrations/gemini/prompts.go
Add PR duplicate detection types and DetectPRDuplicate method; add prompt builder for PR duplicate detection (JSON response contract). Note: duplicate prompt block observed in diff.
GitHub integration
internal/integrations/github/client.go
Add GetPullRequest, ListPullRequests, and ListPullRequestFiles methods (paginated aggregation).
Batch & infra
cmd/simili/commands/batch.go, .gitignore
Make LLM client optional at init (non-fatal failure handling); tighten .gitignore pattern for /simili.
Misc
other small doc/config tweaks
Docs, sample env and manifest changes (small line additions/removals).

Sequence Diagram

sequenceDiagram
    actor User
    participant CLI as simili CLI
    participant GitHub as GitHub API
    participant Gemini as Gemini LLM
    participant Qdrant as Qdrant

    User->>CLI: simili pr-duplicate --repo owner/repo --number 123
    CLI->>GitHub: GetPullRequest(owner,repo,123)
    GitHub-->>CLI: PR metadata
    CLI->>GitHub: ListPullRequestFiles(owner,repo,123)
    GitHub-->>CLI: changed files
    CLI->>CLI: buildPRMetadataText(PR + files)
    CLI->>Gemini: embed PR text
    Gemini-->>CLI: embedding vector
    CLI->>Qdrant: search issue collection (embedding)
    Qdrant-->>CLI: issue candidates (similarity)
    CLI->>Qdrant: search PR collection (embedding)
    Qdrant-->>CLI: PR candidates (similarity)
    CLI->>Gemini: DetectPRDuplicate(PR + merged candidates)
    Gemini-->>CLI: duplicate decision (is_duplicate, id, confidence)
    CLI->>User: render results (or JSON)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

enhancement

Suggested reviewers

  • Kavirubc

Poem

🐰 A hop, a nibble, a PR in sight,
Into collections we tuck it tight.
Vectors hum and Gemini peeks,
Finding echoes across issue weeks.
Hooray — duplicates found (or not) tonight! 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 15.63% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'index and query prs' clearly and concisely summarizes the main feature addition of PR indexing and querying capabilities throughout the changeset.
Linked Issues check ✅ Passed The PR fully implements all coding requirements from issue #43: --include-prs flag for PR indexing, QDRANT_PR_COLLECTION configuration, pr-duplicate CLI command, processPullRequest worker, and dry-run support for PR indexing.
Out of Scope Changes check ✅ Passed All changes are directly aligned with PR indexing objectives: configuration additions, CLI enhancements, worker implementations, LLM integration for duplicate detection, and GitHub API client extensions for PR operations.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (5)
cmd/simili/commands/batch.go (1)

342-349: LLM initialization failure is silently swallowed in non-verbose mode, giving no indication that LLM-powered features will be skipped.

When verbose is false and LLM initialization fails, users get no message, even though downstream steps will skip duplicate detection, triage analysis, and other LLM-dependent features. This can result in confusing batch output.

Consider always logging a message when LLM is unavailable, reserving verbose mode for the detailed error:

Suggested change
 	llm, err := gemini.NewLLMClient(llmKey, llmModel)
 	if err == nil {
 		deps.LLMClient = llm
 		if verbose {
 			fmt.Printf("✓ Initialized LLM client (%s) with model: %s\n", llm.Provider(), llm.Model())
 		}
-	} else if verbose {
-		fmt.Printf("ℹ LLM client unavailable, continuing without LLM steps: %v\n", err)
+	} else {
+		fmt.Println("ℹ LLM client unavailable, continuing without LLM steps")
+		if verbose {
+			fmt.Printf("  LLM init error: %v\n", err)
+		}
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/simili/commands/batch.go` around lines 342 - 349, The current LLM init
branch only logs failures when verbose is true, so when err != nil and verbose
is false the fact that deps.LLMClient is nil is silent; change the else branch
that handles llm initialization to always print a concise notice like "LLM
client unavailable, continuing without LLM steps" (using the same fmt.Printf or
logger) and only append the full error details when verbose is true (e.g.,
include err in the message when verbose). Update the block around the llm
variable / deps.LLMClient assignment in batch.go so failures are always
announced but detailed error text remains controlled by the verbose flag.
cmd/simili/commands/pr_duplicate.go (2)

444-505: buildCandidateFromSearchResult has good defensive fallbacks, but silently defaults state to "open" which could be misleading.

Lines 487-489: when state is missing from the payload, it defaults to "open". A missing state could also indicate a closed or merged entity. Consider defaulting to "unknown" instead, so downstream consumers (including the LLM) aren't given a potentially incorrect signal.

Proposed change
 	state, _ := res.Payload["state"].(string)
 	if state == "" {
-		state = "open"
+		state = "unknown"
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/simili/commands/pr_duplicate.go` around lines 444 - 505, The function
buildCandidateFromSearchResult currently defaults a missing payload "state" to
"open", which can mislead downstream consumers; change the fallback to use
"unknown" instead of "open" (update the variable state handling in
buildCandidateFromSearchResult where state is read from res.Payload and set when
empty), ensure the candidate.State field receives "unknown" for missing/empty
state, and run/adjust any tests or callers that assumed "open" as the implicit
default.

192-206: New GitHub client is created per invocation — acceptable for CLI, but note the token is used without context timeout.

fetchPullRequestMetadataText uses context.Background() (inherited from caller). For a CLI tool this is fine, but if the GitHub API hangs, there's no timeout. Consider adding a context with timeout for robustness.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/simili/commands/pr_duplicate.go` around lines 192 - 206,
fetchPullRequestMetadataText currently uses the caller context directly and
creates a new GitHub client per call without a timeout; wrap the incoming ctx in
a context.WithTimeout (e.g., 10s or configurable) inside
fetchPullRequestMetadataText, defer cancel(), and use that timed context when
calling similiGithub.NewClient, ghClient.GetPullRequest and
listAllPullRequestFilePaths so API calls cannot hang indefinitely; ensure the
cancel function is called to release resources and make the timeout value
configurable if needed.
cmd/simili/commands/pr_duplicate_test.go (1)

20-45: Test uses int64 for number fields, but JSON-unmarshaled map[string]interface{} yields float64.

When Qdrant search results are deserialized from JSON into map[string]interface{}, Go's encoding/json represents JSON numbers as float64, not int64. The toInt helper handles both, so production code works, but these tests don't exercise the realistic float64 path. Consider adding a test case (or adjusting one) with float64(42) to cover the real-world deserialization type.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/simili/commands/pr_duplicate_test.go` around lines 20 - 45, The test
TestBuildCandidateFromSearchResult_Issue builds a fake qdrant.SearchResult with
"issue_number" as int64, but real JSON-unmarshaled payloads use float64; update
the test payload to use float64(42) (or add an additional test case) so
buildCandidateFromSearchResult and the toInt helper are exercised with the
realistic float64 path; locate the test and change the "issue_number" value in
the payload map used by TestBuildCandidateFromSearchResult_Issue accordingly.
cmd/simili/commands/index.go (1)

298-364: processPullRequest looks solid overall.

The function correctly fetches the PR, retrieves file paths, builds metadata text, chunks, embeds, and upserts. The empty-text guard (Line 315) and empty-chunks fallback (Lines 321-323) are good defensive measures.

One minor note: the issue *github.Issue parameter name is slightly misleading since this processes a PR. Consider renaming to something like ghIssue to clarify that it's the stub from the issues list endpoint, not an actual issue.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/simili/commands/index.go` around lines 298 - 364, Rename the misleading
parameter issue in the processPullRequest function to ghIssue (or ghIssueStub)
to clarify it's the GitHub Issue stub for a PR; update the function signature
(processPullRequest(ctx context.Context, workerID int, ghIssue *github.Issue,
...)), replace all references inside processPullRequest (e.g., prNumber :=
issue.GetNumber()) to use ghIssue, and update every call site that invokes
processPullRequest to pass the renamed parameter variable accordingly so
compilation stays consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cmd/simili/commands/index.go`:
- Around line 121-127: The PR collection is being created with
cfg.Embedding.Dimensions instead of the actual detected embeddingDimensions,
causing a vector size mismatch; modify the call to
qdrantClient.CreateCollection(ctx, prCollection, cfg.Embedding.Dimensions) to
pass embeddingDimensions instead (i.e., qdrantClient.CreateCollection(ctx,
prCollection, embeddingDimensions)) so the PR collection uses the embedder's
reported dimension; update any related log/messages if present to reflect
embeddingDimensions and ensure the variable embeddingDimensions (from where the
embedder reports dimensions) is in scope for this branch.

In `@cmd/simili/commands/pr_support.go`:
- Around line 35-53: The function listAllPullRequestFilePaths currently only
retrieves the first page (PerPage:100) from gh.ListPullRequestFiles; update it
to paginate by looping calls to gh.ListPullRequestFiles and using the response's
NextPage (or setting ListOptions.Page) to fetch subsequent pages until
NextPage==0, appending filenames from each page (trimming and skipping empty
names) into the paths slice, and then sort and return the aggregated result;
ensure you handle and propagate errors from each call and avoid duplicating
entries.

---

Nitpick comments:
In `@cmd/simili/commands/batch.go`:
- Around line 342-349: The current LLM init branch only logs failures when
verbose is true, so when err != nil and verbose is false the fact that
deps.LLMClient is nil is silent; change the else branch that handles llm
initialization to always print a concise notice like "LLM client unavailable,
continuing without LLM steps" (using the same fmt.Printf or logger) and only
append the full error details when verbose is true (e.g., include err in the
message when verbose). Update the block around the llm variable / deps.LLMClient
assignment in batch.go so failures are always announced but detailed error text
remains controlled by the verbose flag.

In `@cmd/simili/commands/index.go`:
- Around line 298-364: Rename the misleading parameter issue in the
processPullRequest function to ghIssue (or ghIssueStub) to clarify it's the
GitHub Issue stub for a PR; update the function signature
(processPullRequest(ctx context.Context, workerID int, ghIssue *github.Issue,
...)), replace all references inside processPullRequest (e.g., prNumber :=
issue.GetNumber()) to use ghIssue, and update every call site that invokes
processPullRequest to pass the renamed parameter variable accordingly so
compilation stays consistent.

In `@cmd/simili/commands/pr_duplicate_test.go`:
- Around line 20-45: The test TestBuildCandidateFromSearchResult_Issue builds a
fake qdrant.SearchResult with "issue_number" as int64, but real JSON-unmarshaled
payloads use float64; update the test payload to use float64(42) (or add an
additional test case) so buildCandidateFromSearchResult and the toInt helper are
exercised with the realistic float64 path; locate the test and change the
"issue_number" value in the payload map used by
TestBuildCandidateFromSearchResult_Issue accordingly.

In `@cmd/simili/commands/pr_duplicate.go`:
- Around line 444-505: The function buildCandidateFromSearchResult currently
defaults a missing payload "state" to "open", which can mislead downstream
consumers; change the fallback to use "unknown" instead of "open" (update the
variable state handling in buildCandidateFromSearchResult where state is read
from res.Payload and set when empty), ensure the candidate.State field receives
"unknown" for missing/empty state, and run/adjust any tests or callers that
assumed "open" as the implicit default.
- Around line 192-206: fetchPullRequestMetadataText currently uses the caller
context directly and creates a new GitHub client per call without a timeout;
wrap the incoming ctx in a context.WithTimeout (e.g., 10s or configurable)
inside fetchPullRequestMetadataText, defer cancel(), and use that timed context
when calling similiGithub.NewClient, ghClient.GetPullRequest and
listAllPullRequestFilePaths so API calls cannot hang indefinitely; ensure the
cancel function is called to release resources and make the timeout value
configurable if needed.

Comment on lines +35 to +53
func listAllPullRequestFilePaths(ctx context.Context, gh *similiGithub.Client, org, repo string, number int) ([]string, error) {
files, _, err := gh.ListPullRequestFiles(ctx, org, repo, number, &github.ListOptions{
PerPage: 100,
})
if err != nil {
return nil, err
}

paths := make([]string, 0, len(files))
for _, f := range files {
name := strings.TrimSpace(f.GetFilename())
if name != "" {
paths = append(paths, name)
}
}

sort.Strings(paths)
return paths, nil
}
Copy link

@coderabbitai coderabbitai bot Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

listAllPullRequestFilePaths only fetches the first page — PRs with >100 changed files will have truncated file lists.

The function passes PerPage: 100 but never loops on resp.NextPage, so only the first page of files is returned. This contrasts with the paginated comment-fetching pattern in processIssue (index.go, Lines 226-239) and the commit message that mentions "paginate GitHub PR file listings."

🐛 Proposed fix: add pagination loop
 func listAllPullRequestFilePaths(ctx context.Context, gh *similiGithub.Client, org, repo string, number int) ([]string, error) {
-	files, _, err := gh.ListPullRequestFiles(ctx, org, repo, number, &github.ListOptions{
-		PerPage: 100,
-	})
-	if err != nil {
-		return nil, err
-	}
-
-	paths := make([]string, 0, len(files))
-	for _, f := range files {
-		name := strings.TrimSpace(f.GetFilename())
-		if name != "" {
-			paths = append(paths, name)
+	var paths []string
+	opts := &github.ListOptions{PerPage: 100}
+	for {
+		files, resp, err := gh.ListPullRequestFiles(ctx, org, repo, number, opts)
+		if err != nil {
+			return nil, err
+		}
+		for _, f := range files {
+			name := strings.TrimSpace(f.GetFilename())
+			if name != "" {
+				paths = append(paths, name)
+			}
+		}
+		if resp == nil || resp.NextPage == 0 {
+			break
 		}
+		opts.Page = resp.NextPage
 	}
 
 	sort.Strings(paths)
 	return paths, nil
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func listAllPullRequestFilePaths(ctx context.Context, gh *similiGithub.Client, org, repo string, number int) ([]string, error) {
files, _, err := gh.ListPullRequestFiles(ctx, org, repo, number, &github.ListOptions{
PerPage: 100,
})
if err != nil {
return nil, err
}
paths := make([]string, 0, len(files))
for _, f := range files {
name := strings.TrimSpace(f.GetFilename())
if name != "" {
paths = append(paths, name)
}
}
sort.Strings(paths)
return paths, nil
}
func listAllPullRequestFilePaths(ctx context.Context, gh *similiGithub.Client, org, repo string, number int) ([]string, error) {
var paths []string
opts := &github.ListOptions{PerPage: 100}
for {
files, resp, err := gh.ListPullRequestFiles(ctx, org, repo, number, opts)
if err != nil {
return nil, err
}
for _, f := range files {
name := strings.TrimSpace(f.GetFilename())
if name != "" {
paths = append(paths, name)
}
}
if resp == nil || resp.NextPage == 0 {
break
}
opts.Page = resp.NextPage
}
sort.Strings(paths)
return paths, nil
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/simili/commands/pr_support.go` around lines 35 - 53, The function
listAllPullRequestFilePaths currently only retrieves the first page
(PerPage:100) from gh.ListPullRequestFiles; update it to paginate by looping
calls to gh.ListPullRequestFiles and using the response's NextPage (or setting
ListOptions.Page) to fetch subsequent pages until NextPage==0, appending
filenames from each page (trimming and skipping empty names) into the paths
slice, and then sort and return the aggregated result; ensure you handle and
propagate errors from each call and avoid duplicating entries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not correct because gh.ListPullRequestFiles() paginates internally (see in internal/integrations/github/client.go)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check the ListPullRequestFiles implementation to verify pagination
cat internal/integrations/github/client.go | grep -A 50 "func.*ListPullRequestFiles"

Length of output: 1585


@nick1udwig You're absolutely right, and I apologize for the confusion!

I've verified the implementation in internal/integrations/github/client.go, and ListPullRequestFiles does indeed handle pagination internally with a loop that fetches all pages until resp.NextPage == 0. The function aggregates all files across pages before returning them.

The current implementation in listAllPullRequestFilePaths is correct as-is — no additional pagination loop is needed. My review comment was mistaken. Thank you for catching that! 🐰

 />\_/\
( •.• )
 > ^ <  oops!

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
cmd/simili/commands/index.go (2)

314-323: Good defensive checks — consider parity with processIssue.

The empty-text guard (line 315) and empty-chunks fallback (lines 321–323) are sensible. processIssue lacks both — if an issue has an empty title, body, and no comments, it would pass empty chunks to EmbedBatch. Consider backporting these guards to processIssue for consistency.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/simili/commands/index.go` around lines 314 - 323, Add the same defensive
checks used in the PR flow into processIssue: after building the text for an
issue (the counterpart of buildPRMetadataText) verify
strings.TrimSpace(issueText) != "" and log+return when empty, and after calling
splitter.SplitText(issueText) ensure if len(chunks) == 0 you set chunks =
[]string{issueText} before calling EmbedBatch; reference the processIssue
function, the local issueText/variable that holds the concatenated
title/body/comments, splitter.SplitText, and the call to EmbedBatch to locate
where to add these guards.

336-356: Minor style inconsistencies with processIssue.

Two small divergences from processIssue:

  1. map[string]interface{} (line 342) vs map[string]any (line 278 in processIssue). Since Go 1.18, any is the idiomatic alias.
  2. []byte(fmt.Sprintf(...)) (line 338) vs fmt.Appendf(nil, ...) (line 274 in processIssue). Both work identically, but using the same pattern avoids cognitive overhead.
♻️ Suggested alignment
-		pointID := uuid.NewMD5(uuid.NameSpaceURL, []byte(fmt.Sprintf("%s/%s/pr/%d/chunk/%d", org, repo, prNumber, i))).String()
+		pointID := uuid.NewMD5(uuid.NameSpaceURL, fmt.Appendf(nil, "%s/%s/pr/%d/chunk/%d", org, repo, prNumber, i)).String()
 		points[i] = &qdrant.Point{
 			ID:     pointID,
 			Vector: embeddings[i],
-			Payload: map[string]interface{}{
+			Payload: map[string]any{
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/simili/commands/index.go` around lines 336 - 356, In the loop that builds
Qdrant points (points, pointID, embeddings, chunks) align styles with
processIssue: change the Payload type from map[string]interface{} to
map[string]any and build the UUID input using fmt.Appendf(nil,
"%s/%s/pr/%d/chunk/%d", org, repo, prNumber, i) instead of
[]byte(fmt.Sprintf(...)); update the uuid.NewMD5 call to use that fmt.Appendf
result so the point ID generation and payload type match processIssue’s
patterns.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@cmd/simili/commands/index.go`:
- Around line 121-127: The PR collection creation now correctly uses the
auto-detected embeddingDimensions when calling qdrantClient.CreateCollection (in
the block guarded by indexIncludePRs and prCollection != cfg.Qdrant.Collection),
but you should ensure consistency everywhere: confirm
qdrantClient.CreateCollection is passed embeddingDimensions (not
cfg.Embedding.Dimensions) for PR collections and replace any remaining
occurrences that pass cfg.Embedding.Dimensions for prCollection with the
embeddingDimensions variable to avoid dimension mismatches; keep the existing
conditional that checks indexIncludePRs and prCollection !=
cfg.Qdrant.Collection and preserve error handling in that block.

---

Nitpick comments:
In `@cmd/simili/commands/index.go`:
- Around line 314-323: Add the same defensive checks used in the PR flow into
processIssue: after building the text for an issue (the counterpart of
buildPRMetadataText) verify strings.TrimSpace(issueText) != "" and log+return
when empty, and after calling splitter.SplitText(issueText) ensure if
len(chunks) == 0 you set chunks = []string{issueText} before calling EmbedBatch;
reference the processIssue function, the local issueText/variable that holds the
concatenated title/body/comments, splitter.SplitText, and the call to EmbedBatch
to locate where to add these guards.
- Around line 336-356: In the loop that builds Qdrant points (points, pointID,
embeddings, chunks) align styles with processIssue: change the Payload type from
map[string]interface{} to map[string]any and build the UUID input using
fmt.Appendf(nil, "%s/%s/pr/%d/chunk/%d", org, repo, prNumber, i) instead of
[]byte(fmt.Sprintf(...)); update the uuid.NewMD5 call to use that fmt.Appendf
result so the point ID generation and payload type match processIssue’s
patterns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[0.2.0v][Feature] PR indexing with dedicated collection and pr-duplicate CLI

1 participant