Conversation
📝 WalkthroughWalkthroughAdds optional pull-request indexing and a new Changes
Sequence DiagramsequenceDiagram
actor User
participant CLI as simili CLI
participant GitHub as GitHub API
participant Gemini as Gemini LLM
participant Qdrant as Qdrant
User->>CLI: simili pr-duplicate --repo owner/repo --number 123
CLI->>GitHub: GetPullRequest(owner,repo,123)
GitHub-->>CLI: PR metadata
CLI->>GitHub: ListPullRequestFiles(owner,repo,123)
GitHub-->>CLI: changed files
CLI->>CLI: buildPRMetadataText(PR + files)
CLI->>Gemini: embed PR text
Gemini-->>CLI: embedding vector
CLI->>Qdrant: search issue collection (embedding)
Qdrant-->>CLI: issue candidates (similarity)
CLI->>Qdrant: search PR collection (embedding)
Qdrant-->>CLI: PR candidates (similarity)
CLI->>Gemini: DetectPRDuplicate(PR + merged candidates)
Gemini-->>CLI: duplicate decision (is_duplicate, id, confidence)
CLI->>User: render results (or JSON)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (5)
cmd/simili/commands/batch.go (1)
342-349: LLM initialization failure is silently swallowed in non-verbose mode, giving no indication that LLM-powered features will be skipped.When
verboseis false and LLM initialization fails, users get no message, even though downstream steps will skip duplicate detection, triage analysis, and other LLM-dependent features. This can result in confusing batch output.Consider always logging a message when LLM is unavailable, reserving verbose mode for the detailed error:
Suggested change
llm, err := gemini.NewLLMClient(llmKey, llmModel) if err == nil { deps.LLMClient = llm if verbose { fmt.Printf("✓ Initialized LLM client (%s) with model: %s\n", llm.Provider(), llm.Model()) } - } else if verbose { - fmt.Printf("ℹ LLM client unavailable, continuing without LLM steps: %v\n", err) + } else { + fmt.Println("ℹ LLM client unavailable, continuing without LLM steps") + if verbose { + fmt.Printf(" LLM init error: %v\n", err) + } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/simili/commands/batch.go` around lines 342 - 349, The current LLM init branch only logs failures when verbose is true, so when err != nil and verbose is false the fact that deps.LLMClient is nil is silent; change the else branch that handles llm initialization to always print a concise notice like "LLM client unavailable, continuing without LLM steps" (using the same fmt.Printf or logger) and only append the full error details when verbose is true (e.g., include err in the message when verbose). Update the block around the llm variable / deps.LLMClient assignment in batch.go so failures are always announced but detailed error text remains controlled by the verbose flag.cmd/simili/commands/pr_duplicate.go (2)
444-505:buildCandidateFromSearchResulthas good defensive fallbacks, but silently defaults state to"open"which could be misleading.Lines 487-489: when
stateis missing from the payload, it defaults to"open". A missing state could also indicate a closed or merged entity. Consider defaulting to"unknown"instead, so downstream consumers (including the LLM) aren't given a potentially incorrect signal.Proposed change
state, _ := res.Payload["state"].(string) if state == "" { - state = "open" + state = "unknown" }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/simili/commands/pr_duplicate.go` around lines 444 - 505, The function buildCandidateFromSearchResult currently defaults a missing payload "state" to "open", which can mislead downstream consumers; change the fallback to use "unknown" instead of "open" (update the variable state handling in buildCandidateFromSearchResult where state is read from res.Payload and set when empty), ensure the candidate.State field receives "unknown" for missing/empty state, and run/adjust any tests or callers that assumed "open" as the implicit default.
192-206: New GitHub client is created per invocation — acceptable for CLI, but note the token is used without context timeout.
fetchPullRequestMetadataTextusescontext.Background()(inherited from caller). For a CLI tool this is fine, but if the GitHub API hangs, there's no timeout. Consider adding a context with timeout for robustness.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/simili/commands/pr_duplicate.go` around lines 192 - 206, fetchPullRequestMetadataText currently uses the caller context directly and creates a new GitHub client per call without a timeout; wrap the incoming ctx in a context.WithTimeout (e.g., 10s or configurable) inside fetchPullRequestMetadataText, defer cancel(), and use that timed context when calling similiGithub.NewClient, ghClient.GetPullRequest and listAllPullRequestFilePaths so API calls cannot hang indefinitely; ensure the cancel function is called to release resources and make the timeout value configurable if needed.cmd/simili/commands/pr_duplicate_test.go (1)
20-45: Test usesint64for number fields, but JSON-unmarshaledmap[string]interface{}yieldsfloat64.When Qdrant search results are deserialized from JSON into
map[string]interface{}, Go'sencoding/jsonrepresents JSON numbers asfloat64, notint64. ThetoInthelper handles both, so production code works, but these tests don't exercise the realisticfloat64path. Consider adding a test case (or adjusting one) withfloat64(42)to cover the real-world deserialization type.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/simili/commands/pr_duplicate_test.go` around lines 20 - 45, The test TestBuildCandidateFromSearchResult_Issue builds a fake qdrant.SearchResult with "issue_number" as int64, but real JSON-unmarshaled payloads use float64; update the test payload to use float64(42) (or add an additional test case) so buildCandidateFromSearchResult and the toInt helper are exercised with the realistic float64 path; locate the test and change the "issue_number" value in the payload map used by TestBuildCandidateFromSearchResult_Issue accordingly.cmd/simili/commands/index.go (1)
298-364:processPullRequestlooks solid overall.The function correctly fetches the PR, retrieves file paths, builds metadata text, chunks, embeds, and upserts. The empty-text guard (Line 315) and empty-chunks fallback (Lines 321-323) are good defensive measures.
One minor note: the
issue *github.Issueparameter name is slightly misleading since this processes a PR. Consider renaming to something likeghIssueto clarify that it's the stub from the issues list endpoint, not an actual issue.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/simili/commands/index.go` around lines 298 - 364, Rename the misleading parameter issue in the processPullRequest function to ghIssue (or ghIssueStub) to clarify it's the GitHub Issue stub for a PR; update the function signature (processPullRequest(ctx context.Context, workerID int, ghIssue *github.Issue, ...)), replace all references inside processPullRequest (e.g., prNumber := issue.GetNumber()) to use ghIssue, and update every call site that invokes processPullRequest to pass the renamed parameter variable accordingly so compilation stays consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cmd/simili/commands/index.go`:
- Around line 121-127: The PR collection is being created with
cfg.Embedding.Dimensions instead of the actual detected embeddingDimensions,
causing a vector size mismatch; modify the call to
qdrantClient.CreateCollection(ctx, prCollection, cfg.Embedding.Dimensions) to
pass embeddingDimensions instead (i.e., qdrantClient.CreateCollection(ctx,
prCollection, embeddingDimensions)) so the PR collection uses the embedder's
reported dimension; update any related log/messages if present to reflect
embeddingDimensions and ensure the variable embeddingDimensions (from where the
embedder reports dimensions) is in scope for this branch.
In `@cmd/simili/commands/pr_support.go`:
- Around line 35-53: The function listAllPullRequestFilePaths currently only
retrieves the first page (PerPage:100) from gh.ListPullRequestFiles; update it
to paginate by looping calls to gh.ListPullRequestFiles and using the response's
NextPage (or setting ListOptions.Page) to fetch subsequent pages until
NextPage==0, appending filenames from each page (trimming and skipping empty
names) into the paths slice, and then sort and return the aggregated result;
ensure you handle and propagate errors from each call and avoid duplicating
entries.
---
Nitpick comments:
In `@cmd/simili/commands/batch.go`:
- Around line 342-349: The current LLM init branch only logs failures when
verbose is true, so when err != nil and verbose is false the fact that
deps.LLMClient is nil is silent; change the else branch that handles llm
initialization to always print a concise notice like "LLM client unavailable,
continuing without LLM steps" (using the same fmt.Printf or logger) and only
append the full error details when verbose is true (e.g., include err in the
message when verbose). Update the block around the llm variable / deps.LLMClient
assignment in batch.go so failures are always announced but detailed error text
remains controlled by the verbose flag.
In `@cmd/simili/commands/index.go`:
- Around line 298-364: Rename the misleading parameter issue in the
processPullRequest function to ghIssue (or ghIssueStub) to clarify it's the
GitHub Issue stub for a PR; update the function signature
(processPullRequest(ctx context.Context, workerID int, ghIssue *github.Issue,
...)), replace all references inside processPullRequest (e.g., prNumber :=
issue.GetNumber()) to use ghIssue, and update every call site that invokes
processPullRequest to pass the renamed parameter variable accordingly so
compilation stays consistent.
In `@cmd/simili/commands/pr_duplicate_test.go`:
- Around line 20-45: The test TestBuildCandidateFromSearchResult_Issue builds a
fake qdrant.SearchResult with "issue_number" as int64, but real JSON-unmarshaled
payloads use float64; update the test payload to use float64(42) (or add an
additional test case) so buildCandidateFromSearchResult and the toInt helper are
exercised with the realistic float64 path; locate the test and change the
"issue_number" value in the payload map used by
TestBuildCandidateFromSearchResult_Issue accordingly.
In `@cmd/simili/commands/pr_duplicate.go`:
- Around line 444-505: The function buildCandidateFromSearchResult currently
defaults a missing payload "state" to "open", which can mislead downstream
consumers; change the fallback to use "unknown" instead of "open" (update the
variable state handling in buildCandidateFromSearchResult where state is read
from res.Payload and set when empty), ensure the candidate.State field receives
"unknown" for missing/empty state, and run/adjust any tests or callers that
assumed "open" as the implicit default.
- Around line 192-206: fetchPullRequestMetadataText currently uses the caller
context directly and creates a new GitHub client per call without a timeout;
wrap the incoming ctx in a context.WithTimeout (e.g., 10s or configurable)
inside fetchPullRequestMetadataText, defer cancel(), and use that timed context
when calling similiGithub.NewClient, ghClient.GetPullRequest and
listAllPullRequestFilePaths so API calls cannot hang indefinitely; ensure the
cancel function is called to release resources and make the timeout value
configurable if needed.
| func listAllPullRequestFilePaths(ctx context.Context, gh *similiGithub.Client, org, repo string, number int) ([]string, error) { | ||
| files, _, err := gh.ListPullRequestFiles(ctx, org, repo, number, &github.ListOptions{ | ||
| PerPage: 100, | ||
| }) | ||
| if err != nil { | ||
| return nil, err | ||
| } | ||
|
|
||
| paths := make([]string, 0, len(files)) | ||
| for _, f := range files { | ||
| name := strings.TrimSpace(f.GetFilename()) | ||
| if name != "" { | ||
| paths = append(paths, name) | ||
| } | ||
| } | ||
|
|
||
| sort.Strings(paths) | ||
| return paths, nil | ||
| } |
There was a problem hiding this comment.
listAllPullRequestFilePaths only fetches the first page — PRs with >100 changed files will have truncated file lists.
The function passes PerPage: 100 but never loops on resp.NextPage, so only the first page of files is returned. This contrasts with the paginated comment-fetching pattern in processIssue (index.go, Lines 226-239) and the commit message that mentions "paginate GitHub PR file listings."
🐛 Proposed fix: add pagination loop
func listAllPullRequestFilePaths(ctx context.Context, gh *similiGithub.Client, org, repo string, number int) ([]string, error) {
- files, _, err := gh.ListPullRequestFiles(ctx, org, repo, number, &github.ListOptions{
- PerPage: 100,
- })
- if err != nil {
- return nil, err
- }
-
- paths := make([]string, 0, len(files))
- for _, f := range files {
- name := strings.TrimSpace(f.GetFilename())
- if name != "" {
- paths = append(paths, name)
+ var paths []string
+ opts := &github.ListOptions{PerPage: 100}
+ for {
+ files, resp, err := gh.ListPullRequestFiles(ctx, org, repo, number, opts)
+ if err != nil {
+ return nil, err
+ }
+ for _, f := range files {
+ name := strings.TrimSpace(f.GetFilename())
+ if name != "" {
+ paths = append(paths, name)
+ }
+ }
+ if resp == nil || resp.NextPage == 0 {
+ break
}
+ opts.Page = resp.NextPage
}
sort.Strings(paths)
return paths, nil
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func listAllPullRequestFilePaths(ctx context.Context, gh *similiGithub.Client, org, repo string, number int) ([]string, error) { | |
| files, _, err := gh.ListPullRequestFiles(ctx, org, repo, number, &github.ListOptions{ | |
| PerPage: 100, | |
| }) | |
| if err != nil { | |
| return nil, err | |
| } | |
| paths := make([]string, 0, len(files)) | |
| for _, f := range files { | |
| name := strings.TrimSpace(f.GetFilename()) | |
| if name != "" { | |
| paths = append(paths, name) | |
| } | |
| } | |
| sort.Strings(paths) | |
| return paths, nil | |
| } | |
| func listAllPullRequestFilePaths(ctx context.Context, gh *similiGithub.Client, org, repo string, number int) ([]string, error) { | |
| var paths []string | |
| opts := &github.ListOptions{PerPage: 100} | |
| for { | |
| files, resp, err := gh.ListPullRequestFiles(ctx, org, repo, number, opts) | |
| if err != nil { | |
| return nil, err | |
| } | |
| for _, f := range files { | |
| name := strings.TrimSpace(f.GetFilename()) | |
| if name != "" { | |
| paths = append(paths, name) | |
| } | |
| } | |
| if resp == nil || resp.NextPage == 0 { | |
| break | |
| } | |
| opts.Page = resp.NextPage | |
| } | |
| sort.Strings(paths) | |
| return paths, nil | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cmd/simili/commands/pr_support.go` around lines 35 - 53, The function
listAllPullRequestFilePaths currently only retrieves the first page
(PerPage:100) from gh.ListPullRequestFiles; update it to paginate by looping
calls to gh.ListPullRequestFiles and using the response's NextPage (or setting
ListOptions.Page) to fetch subsequent pages until NextPage==0, appending
filenames from each page (trimming and skipping empty names) into the paths
slice, and then sort and return the aggregated result; ensure you handle and
propagate errors from each call and avoid duplicating entries.
There was a problem hiding this comment.
I think this is not correct because gh.ListPullRequestFiles() paginates internally (see in internal/integrations/github/client.go)
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check the ListPullRequestFiles implementation to verify pagination
cat internal/integrations/github/client.go | grep -A 50 "func.*ListPullRequestFiles"
Length of output: 1585
@nick1udwig You're absolutely right, and I apologize for the confusion!
I've verified the implementation in internal/integrations/github/client.go, and ListPullRequestFiles does indeed handle pagination internally with a loop that fetches all pages until resp.NextPage == 0. The function aggregates all files across pages before returning them.
The current implementation in listAllPullRequestFilePaths is correct as-is — no additional pagination loop is needed. My review comment was mistaken. Thank you for catching that! 🐰
/>\_/\
( •.• )
> ^ < oops!
There was a problem hiding this comment.
🧹 Nitpick comments (2)
cmd/simili/commands/index.go (2)
314-323: Good defensive checks — consider parity withprocessIssue.The empty-text guard (line 315) and empty-chunks fallback (lines 321–323) are sensible.
processIssuelacks both — if an issue has an empty title, body, and no comments, it would pass empty chunks toEmbedBatch. Consider backporting these guards toprocessIssuefor consistency.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/simili/commands/index.go` around lines 314 - 323, Add the same defensive checks used in the PR flow into processIssue: after building the text for an issue (the counterpart of buildPRMetadataText) verify strings.TrimSpace(issueText) != "" and log+return when empty, and after calling splitter.SplitText(issueText) ensure if len(chunks) == 0 you set chunks = []string{issueText} before calling EmbedBatch; reference the processIssue function, the local issueText/variable that holds the concatenated title/body/comments, splitter.SplitText, and the call to EmbedBatch to locate where to add these guards.
336-356: Minor style inconsistencies withprocessIssue.Two small divergences from
processIssue:
map[string]interface{}(line 342) vsmap[string]any(line 278 inprocessIssue). Since Go 1.18,anyis the idiomatic alias.[]byte(fmt.Sprintf(...))(line 338) vsfmt.Appendf(nil, ...)(line 274 inprocessIssue). Both work identically, but using the same pattern avoids cognitive overhead.♻️ Suggested alignment
- pointID := uuid.NewMD5(uuid.NameSpaceURL, []byte(fmt.Sprintf("%s/%s/pr/%d/chunk/%d", org, repo, prNumber, i))).String() + pointID := uuid.NewMD5(uuid.NameSpaceURL, fmt.Appendf(nil, "%s/%s/pr/%d/chunk/%d", org, repo, prNumber, i)).String() points[i] = &qdrant.Point{ ID: pointID, Vector: embeddings[i], - Payload: map[string]interface{}{ + Payload: map[string]any{🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/simili/commands/index.go` around lines 336 - 356, In the loop that builds Qdrant points (points, pointID, embeddings, chunks) align styles with processIssue: change the Payload type from map[string]interface{} to map[string]any and build the UUID input using fmt.Appendf(nil, "%s/%s/pr/%d/chunk/%d", org, repo, prNumber, i) instead of []byte(fmt.Sprintf(...)); update the uuid.NewMD5 call to use that fmt.Appendf result so the point ID generation and payload type match processIssue’s patterns.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@cmd/simili/commands/index.go`:
- Around line 121-127: The PR collection creation now correctly uses the
auto-detected embeddingDimensions when calling qdrantClient.CreateCollection (in
the block guarded by indexIncludePRs and prCollection != cfg.Qdrant.Collection),
but you should ensure consistency everywhere: confirm
qdrantClient.CreateCollection is passed embeddingDimensions (not
cfg.Embedding.Dimensions) for PR collections and replace any remaining
occurrences that pass cfg.Embedding.Dimensions for prCollection with the
embeddingDimensions variable to avoid dimension mismatches; keep the existing
conditional that checks indexIncludePRs and prCollection !=
cfg.Qdrant.Collection and preserve error handling in that block.
---
Nitpick comments:
In `@cmd/simili/commands/index.go`:
- Around line 314-323: Add the same defensive checks used in the PR flow into
processIssue: after building the text for an issue (the counterpart of
buildPRMetadataText) verify strings.TrimSpace(issueText) != "" and log+return
when empty, and after calling splitter.SplitText(issueText) ensure if
len(chunks) == 0 you set chunks = []string{issueText} before calling EmbedBatch;
reference the processIssue function, the local issueText/variable that holds the
concatenated title/body/comments, splitter.SplitText, and the call to EmbedBatch
to locate where to add these guards.
- Around line 336-356: In the loop that builds Qdrant points (points, pointID,
embeddings, chunks) align styles with processIssue: change the Payload type from
map[string]interface{} to map[string]any and build the UUID input using
fmt.Appendf(nil, "%s/%s/pr/%d/chunk/%d", org, repo, prNumber, i) instead of
[]byte(fmt.Sprintf(...)); update the uuid.NewMD5 call to use that fmt.Appendf
result so the point ID generation and payload type match processIssue’s
patterns.
Description
Index and query PRs, not just issues
Type of Change
Related Issues
Fixes #43
Changes Made
Testing
go build ./...successfullygo test ./...successfullygo vet ./...successfully(I've tested with OpenAI models)
Screenshots (if applicable)
Checklist
Additional Notes
Summary by CodeRabbit
New Features
Documentation
Configuration
Tests