feat: store content hashes on graph nodes for cross-branch inference caching by maoshuorz · Pull Request #682 · potpie-ai/potpie

maoshuorz · 2026-03-07T03:57:03Z

/claim #223

Proposed Changes

Implements hash-based caching for knowledge graph nodes as described in #223. The existing codebase already has an excellent inference cache infrastructure (PostgreSQL inference_cache table, InferenceCacheService, content_hash.py). This PR closes the gap by integrating content hashes into the graph construction phase so that cross-branch cache comparisons are efficient.

What changed

1. Graph Construction (code_graph_service.py)

Compute content_hash for each node during graph construction using the existing generate_content_hash() utility
Store content_hash directly on Neo4j nodes for fast cross-branch lookups
Add Neo4j composite index content_hash_repo_idx on (content_hash, repoId) for efficient hash-based queries
Add get_cached_node_hashes() method to retrieve cached inference data by content hash from Neo4j

2. Inference Service (inference_service.py)

fetch_graph() now returns content_hash and node_type from Neo4j, avoiding redundant hash recomputation
_lookup_cache_for_nodes() reuses pre-computed graph hashes when available, falling back to on-the-fly computation

3. Tests (3 new test files, 20+ test cases)

test_content_hash.py — Hash generation, whitespace normalization, type differentiation, cacheability checks
test_inference_cache_service.py — Cache hit/miss, store/upsert, access tracking, stats
test_graph_hash_integration.py — Cross-branch cache reuse, partial branch changes, same/modified code scenarios

How it works

Branch A (first parse):
  Graph construction → compute content_hash per node → store in Neo4j
  Inference → generate docstrings → store in inference_cache (keyed by content_hash)

Branch B (subsequent parse):
  Graph construction → compute content_hash per node → store in Neo4j
  Inference → fetch nodes with pre-computed hashes → lookup inference_cache
  → Cache HIT for unchanged nodes (skip LLM) → Cache MISS only for modified nodes

Proof

The content hash system ensures identical code produces identical hashes regardless of branch:

code = "def validate(data): return all(k in data for k in ['name', 'email'])"
hash_branch_a = generate_content_hash(code, "FUNCTION")
hash_branch_b = generate_content_hash(code, "FUNCTION")
assert hash_branch_a == hash_branch_b  # Same code = same hash = cache hit

Checklist

Hash generation working correctly
Cache hit/miss working as expected
Faster graph generation for similar branches (unchanged nodes skip LLM inference)
No loss in inference quality (hashes include cache version for invalidation)
Tests added

Summary by CodeRabbit

New Features
- Content-based caching: reuse pre-computed node hashes and node text to reduce redundant processing and speed up responses.
- Database index added to improve cache lookup performance.
Improvements
- More robust cache lookup and safer handling of missing/partial node data and edge cases.
- Duplicate graph flow now preserves content hashes and node text when copying nodes.
Tests
- Extensive unit and integration tests validating hashing, cacheability, cache service behavior, and cross-branch caching.

…caching Previously, content hashes were only computed at inference time. Now they are computed and stored directly on Neo4j nodes during graph construction, enabling faster cross-branch cache lookups. Changes: - code_graph_service.py: compute content_hash during node creation, add Neo4j composite index on (content_hash, repoId), add get_cached_node_hashes() for cross-branch hash comparison - inference_service.py: fetch_graph now returns content_hash and node_type, _lookup_cache_for_nodes reuses pre-computed graph hashes - Add unit tests for content_hash, inference_cache_service, and cross-branch cache reuse scenarios Closes potpie-ai#223

coderabbitai · 2026-03-07T03:57:25Z

Walkthrough

Adds content-hash-based caching across parsing and inference: CodeGraphService computes/stores node_text and optional content_hash and exposes a cache-read helper; InferenceService consumes or generates content_hash for cache lookups and persists hashes; unit and integration tests for hashing and cache behavior added.

Changes

Cohort / File(s)	Summary
Graph Service Caching `app/modules/parsing/graph_construction/code_graph_service.py`	Adds `get_cached_node_hashes(session, project_id)`, creates index on (`content_hash`, `repoId`), extracts and stores `node_text`, and conditionally computes/attaches `content_hash` during node creation using content-hash utilities.
Duplicate Graph Propagation `app/modules/parsing/graph_construction/parsing_service.py`	`duplicate_graph` now propagates `content_hash` and `node_text` when copying nodes between repos (read and write payloads updated).
Inference Service Integration `app/modules/parsing/knowledge_graph/inference_service.py`	`fetch_graph` now returns `content_hash` and `node_type`; cache lookup reuses existing hash when present or normalizes text, checks cacheability, generates/persists `content_hash` otherwise; logging hardened for missing fields; chunk consolidation fixes for zero-docstring case.
Content Hash Unit Tests `tests/unit-tests/parsing/test_content_hash.py`	Adds tests for `generate_content_hash`, `has_unresolved_references`, and `is_content_cacheable` covering normalization, node-type differentiation, unresolved refs, repetitive content detection, and configurable min length.
Graph Hash Integration Tests `tests/unit-tests/parsing/test_graph_hash_integration.py`	Integration tests validating content-hash generation across node types/branches and simulated cache hit/miss behavior for unchanged vs modified code.
Inference Cache Service Tests `tests/unit-tests/parsing/test_inference_cache_service.py`	Tests for InferenceCacheService with mocked DB: cache miss/hit, embedding presence/absence, access_count increments, upsert vs insert behavior, and cache statistics aggregation.

Sequence Diagram(s)

sequenceDiagram
    participant Graph as CodeGraphService
    participant Neo4j as Neo4j
    participant Inference as InferenceService
    participant Cache as InferenceCache

    Graph->>Neo4j: Create/store nodes with `node_text` and optional `content_hash`
    Neo4j-->>Graph: Acknowledge stored

    Inference->>Neo4j: Fetch nodes (`node_text`, `content_hash`, `node_type`)
    Neo4j-->>Inference: Return nodes

    alt content_hash present
        Inference->>Inference: Normalize `node_text` and verify cacheability
    else content_hash absent
        Inference->>Inference: Normalize `node_text`, check cacheability, generate `content_hash`
        Inference->>Neo4j: Persist generated `content_hash` on node
    end

    Inference->>Cache: Lookup by `content_hash`
    Cache-->>Inference: Return cached inference or miss

    alt cache miss
        Inference->>Cache: Store inference result keyed by `content_hash`
        Cache-->>Inference: Confirm stored
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Inference caching during parsing #487: Implements content-hash-driven inference caching and integrates hash generation/storage into parsing and inference flows (strong code-level overlap).
Enhanced java parsing support #210: Modifies graph-creation logic and node handling in code_graph_service.py (related changes to node construction that may interact with the new content-hash/index additions).

Poem

🐰
I nibble bytes and hash them tight,
Saved snippets snug beneath moonlight,
Across branches, crumbs remember why,
Cache hops quick — no needless try,
Thump-thump — fast inference, rabbit's delight!

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main objective of the PR: implementing content hash storage on graph nodes to enable cross-branch inference caching. It directly reflects the core changes across all modified files.
Docstring Coverage	✅ Passed	Docstring coverage is 83.33% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (1)

tests/unit-tests/parsing/test_graph_hash_integration.py (1)
79-124: These “integration” tests still stop at pure utility calls.

They never touch CodeGraphService.create_and_store_graph() or InferenceService.fetch_graph(), so wiring regressions in the new graph-backed cache path can slip through. At least one test here should persist a node with content_hash and verify the inference path reuses it end-to-end.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit-tests/parsing/test_graph_hash_integration.py` around lines 79 -
124, Add an end-to-end test that exercises the graph-backed cache: use
generate_content_hash to compute a content_hash for a sample node, call
CodeGraphService.create_and_store_graph (or the service method that persists
nodes) to persist a node containing that content_hash and an associated
inference payload, then call InferenceService.fetch_graph (or the inference
entrypoint used in production) for the same code and assert the returned result
reuses the persisted inference (e.g., compare docstring/tags) instead of
recomputing; place this alongside the existing tests (e.g., near
test_cache_lookup_simulation/test_partial_branch_changes) so the wiring between
create_and_store_graph and fetch_graph is covered.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@app/modules/parsing/graph_construction/code_graph_service.py`:
- Around line 66-69: The cache entry currently sets the stored vector under the
key "embedding" which downstream consumers expect as "embedding_vector", causing
Neo4j cache hits to miss; update the cache population where cache[content_hash]
is assigned (in code_graph_service.py) to use the key "embedding_vector" (or
include both "embedding_vector" and "embedding" if you want backward
compatibility) so consumers reading cached_inference["embedding_vector"] get the
stored vector and avoid unnecessary re-generation.

In `@tests/unit-tests/parsing/test_content_hash.py`:
- Around line 66-71: Update the test_includes_cache_version to temporarily
modify app.modules.parsing.utils.content_hash.CACHE_VERSION: call
generate_content_hash(code) to get hash1, monkeypatch or save the original
CACHE_VERSION, set CACHE_VERSION to a different value, call
generate_content_hash(code) again and assert the new hash differs from hash1,
then restore the original CACHE_VERSION; reference generate_content_hash and
CACHE_VERSION so the test verifies cache-version invalidation.
- Around line 96-99: The test_long_content_cacheable currently builds `code` as
`"    x = 1\n" * 20`, which yields only one unique line and contradicts
`is_content_cacheable` behavior; update the `code` fixture in the test to
contain many unique lines (for example by varying the line content per
iteration) so that `is_content_cacheable(code)` returns True, keeping the test
name and assertion as-is; edit the `test_long_content_cacheable` function to
generate non-repetitive multi-line content instead of repeating the same line.

In `@tests/unit-tests/parsing/test_inference_cache_service.py`:
- Around line 75-83: The two test calls that assign to an unused variable (the
calls to service.store_inference in test_inference_cache_service.py) are causing
Ruff F841; remove the unused local by either asserting the returned value or
discarding it (call service.store_inference(...) without assigning to result)
for both occurrences (the one that creates "new_hash" and the other at lines
~95-98), ensuring the test still verifies behavior as needed (e.g., replace the
assignment with an assert on the return or simply call the method if the return
isn't needed).

---

Nitpick comments:
In `@tests/unit-tests/parsing/test_graph_hash_integration.py`:
- Around line 79-124: Add an end-to-end test that exercises the graph-backed
cache: use generate_content_hash to compute a content_hash for a sample node,
call CodeGraphService.create_and_store_graph (or the service method that
persists nodes) to persist a node containing that content_hash and an associated
inference payload, then call InferenceService.fetch_graph (or the inference
entrypoint used in production) for the same code and assert the returned result
reuses the persisted inference (e.g., compare docstring/tags) instead of
recomputing; place this alongside the existing tests (e.g., near
test_cache_lookup_simulation/test_partial_branch_changes) so the wiring between
create_and_store_graph and fetch_graph is covered.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 344ad1c8-9f3f-477e-9cec-3735b2010aae

📥 Commits

Reviewing files that changed from the base of the PR and between 4cc4dfd and cf1a095.

📒 Files selected for processing (6)

app/modules/parsing/graph_construction/code_graph_service.py
app/modules/parsing/knowledge_graph/inference_service.py
tests/unit-tests/parsing/__init__.py
tests/unit-tests/parsing/test_content_hash.py
tests/unit-tests/parsing/test_graph_hash_integration.py
tests/unit-tests/parsing/test_inference_cache_service.py

- code_graph_service.py: use "embedding_vector" key to match downstream consumers in inference_service.py (lines 1411, 1471) - test_content_hash.py: properly test CACHE_VERSION invalidation; fix repetitive content detection in test_long_content_cacheable - test_inference_cache_service.py: remove unused result assignments (Ruff F841)

- Safe-access node_id in logging statements to prevent KeyError/TypeError - Guard content_hash slicing against None values - Handle empty all_docstrings list in consolidate_docstrings()

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

app/modules/parsing/graph_construction/code_graph_service.py (1)
131-136: Consider index column order for query patterns.

The composite index (content_hash, repoId) is created, but get_cached_node_hashes queries primarily by repoId. For Neo4j composite indexes, the leftmost column should match the most selective equality predicate.

If most queries filter by repoId first (as in get_cached_node_hashes), consider reversing the order to (repoId, content_hash). However, if cross-branch lookups by content_hash across repos are planned, the current order may be intentional.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/modules/parsing/graph_construction/code_graph_service.py` around lines
131 - 136, The composite index creation in code_graph_service.py currently uses
(content_hash, repoId) but get_cached_node_hashes filters primarily by repoId
first, so change the index to (repoId, content_hash) to match the
leftmost-equals rule for Neo4j composite indexes; update the CREATE INDEX
statement in the block where session.run(...) creates content_hash_repo_idx (or
document a deliberate reason to keep the current order if cross-repo
content_hash lookups are required), then run and benchmark
get_cached_node_hashes to confirm improved index usage and ensure the migration
remains idempotent.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@app/modules/parsing/graph_construction/code_graph_service.py`:
- Around line 189-193: The duplicate_graph routine currently omits the
precomputed content_hash on duplicated nodes, causing redundant regeneration
later; update duplicate_graph to copy processed_node["content_hash"] into the
duplication query and ensure the node creation logic in the duplication path
preserves that content_hash field (so when creating new nodes in duplicate_graph
the content_hash produced earlier by generate_content_hash/is_content_cacheable
is carried over rather than recomputed).

---

Nitpick comments:
In `@app/modules/parsing/graph_construction/code_graph_service.py`:
- Around line 131-136: The composite index creation in code_graph_service.py
currently uses (content_hash, repoId) but get_cached_node_hashes filters
primarily by repoId first, so change the index to (repoId, content_hash) to
match the leftmost-equals rule for Neo4j composite indexes; update the CREATE
INDEX statement in the block where session.run(...) creates
content_hash_repo_idx (or document a deliberate reason to keep the current order
if cross-repo content_hash lookups are required), then run and benchmark
get_cached_node_hashes to confirm improved index usage and ensure the migration
remains idempotent.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5ff094b4-54f2-4f97-a497-8c30538175a0

📥 Commits

Reviewing files that changed from the base of the PR and between cf1a095 and 7ef0fc4.

📒 Files selected for processing (3)

app/modules/parsing/graph_construction/code_graph_service.py
tests/unit-tests/parsing/test_content_hash.py
tests/unit-tests/parsing/test_inference_cache_service.py

🚧 Files skipped from review as they are similar to previous changes (2)

tests/unit-tests/parsing/test_inference_cache_service.py
tests/unit-tests/parsing/test_content_hash.py

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@app/modules/parsing/knowledge_graph/inference_service.py`:
- Around line 152-177: The precomputed content_hash on nodes is trusted without
confirming it was generated from the same normalized text; update the logic in
inference_service.py (around the node["content_hash"] branch in the loop that
uses _normalize_node_text, is_content_cacheable, and generate_content_hash) to
always compute normalized_text via _normalize_node_text and then recompute
content_hash from that normalized_text (using generate_content_hash) before
using it for cache lookups, or alternatively, read and trust a persisted
normalized hash if graph_construction/code_graph_service.py starts saving one;
in short, ensure the hash used for cache lookup is derived from the
normalized_text path (recompute content_hash from normalized_text in the branch
that currently trusts node["content_hash"]).
- Around line 506-508: When all_docstrings is empty, avoid creating a synthetic
parent DocstringNode with docstring="" and embeddings; instead return an empty
DocstringResponse immediately so update_neo4j_with_docstrings() is not invoked
for a no-op. Modify the logic around the consolidated_text creation (the branch
handling len(all_docstrings) == 0) to short-circuit and return an appropriate
empty DocstringResponse rather than building a parent node, ensuring functions
and callers that expect a DocstringResponse still receive one but no Neo4j
writes or embeddings are produced.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c95a9097-1fdc-47e5-8d09-2106e1094d96

📥 Commits

Reviewing files that changed from the base of the PR and between 7ef0fc4 and d256207.

📒 Files selected for processing (1)

app/modules/parsing/knowledge_graph/inference_service.py

coderabbitai · 2026-03-07T12:30:31Z

+            # Use pre-computed content_hash from graph if available
+            content_hash = node.get("content_hash")
+
+            if not content_hash:
+                # Normalize text for consistent hashing
+                normalized_text = self._normalize_node_text(text, node_dict)
+                node["normalized_text"] = normalized_text
+
+                # Check if content is cacheable
+                if not is_content_cacheable(normalized_text):
+                    uncacheable_nodes += 1
+                    continue
+
+                # Generate content hash
+                content_hash = generate_content_hash(
+                    normalized_text, node.get("node_type")
+                )
+                node["content_hash"] = content_hash
+            else:
+                # Hash already computed at graph construction time
+                normalized_text = self._normalize_node_text(text, node_dict)
+                node["normalized_text"] = normalized_text

-            # Generate content hash
-            content_hash = generate_content_hash(normalized_text, node.get("node_type"))
-            node["content_hash"] = content_hash
+                if not is_content_cacheable(normalized_text):
+                    uncacheable_nodes += 1
+                    continue


⚠️ Potential issue | 🟠 Major

Precomputed hashes need the same canonicalization path.

This branch trusts node["content_hash"] before proving it was generated from the same normalized text that the fallback path hashes. In app/modules/parsing/graph_construction/code_graph_service.py:185-204, the graph writer stores hashes from raw node_text, so nodes containing Code replaced for brevity...node_id ... placeholders can keep branch-specific hashes and miss the cross-branch cache even when normalized_text is identical. Either persist the normalized hash during graph construction or recompute from normalized_text here before lookup.

Possible safe fix in this file

- if not content_hash: - # Normalize text for consistent hashing - normalized_text = self._normalize_node_text(text, node_dict) - node["normalized_text"] = normalized_text - - # Check if content is cacheable - if not is_content_cacheable(normalized_text): - uncacheable_nodes += 1 - continue - - # Generate content hash - content_hash = generate_content_hash( - normalized_text, node.get("node_type") - ) - node["content_hash"] = content_hash - else: - # Hash already computed at graph construction time - normalized_text = self._normalize_node_text(text, node_dict) - node["normalized_text"] = normalized_text - - if not is_content_cacheable(normalized_text): - uncacheable_nodes += 1 - continue + normalized_text = self._normalize_node_text(text, node_dict) + node["normalized_text"] = normalized_text + + if not is_content_cacheable(normalized_text): + uncacheable_nodes += 1 + continue + + content_hash = generate_content_hash( + normalized_text, node.get("node_type") + ) + node["content_hash"] = content_hash

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@app/modules/parsing/knowledge_graph/inference_service.py` around lines 152 - 177, The precomputed content_hash on nodes is trusted without confirming it was generated from the same normalized text; update the logic in inference_service.py (around the node["content_hash"] branch in the loop that uses _normalize_node_text, is_content_cacheable, and generate_content_hash) to always compute normalized_text via _normalize_node_text and then recompute content_hash from that normalized_text (using generate_content_hash) before using it for cache lookups, or alternatively, read and trust a persisted normalized hash if graph_construction/code_graph_service.py starts saving one; in short, ensure the hash used for cache lookup is derived from the normalized_text path (recompute content_hash from normalized_text in the branch that currently trusts node["content_hash"]).

coderabbitai · 2026-03-07T12:30:31Z

+        if len(all_docstrings) == 0:
+            consolidated_text = ""
+        elif len(all_docstrings) == 1:


⚠️ Potential issue | 🟠 Major

Don’t persist a synthetic empty docstring for empty chunk results.

When all_docstrings is empty, this now builds a parent DocstringNode with docstring="". That flows into update_neo4j_with_docstrings() and writes an empty docstring plus embedding for the parent node instead of leaving the node untouched. Return an empty DocstringResponse here.

Suggested fix

- if len(all_docstrings) == 0: - consolidated_text = "" + if len(all_docstrings) == 0: + return DocstringResponse(docstrings=[]) elif len(all_docstrings) == 1: consolidated_text = all_docstrings[0]

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if len(all_docstrings) == 0:

consolidated_text = ""

elif len(all_docstrings) == 1:

if len(all_docstrings) == 0:

return DocstringResponse(docstrings=[])

elif len(all_docstrings) == 1:

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@app/modules/parsing/knowledge_graph/inference_service.py` around lines 506 - 508, When all_docstrings is empty, avoid creating a synthetic parent DocstringNode with docstring="" and embeddings; instead return an empty DocstringResponse immediately so update_neo4j_with_docstrings() is not invoked for a no-op. Modify the logic around the consolidated_text creation (the branch handling len(all_docstrings) == 0) to short-circuit and return an appropriate empty DocstringResponse rather than building a parent node, ensuring functions and callers that expect a DocstringResponse still receive one but no Neo4j writes or embeddings are produced.

The duplicate_graph method was not copying the new content_hash and node_text fields when cloning nodes to a new repo, causing cache misses on duplicated projects.

sonarqubecloud · 2026-03-07T14:55:31Z

Quality Gate failed

Failed conditions
C Reliability Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

app/modules/parsing/graph_construction/parsing_service.py (1)
621-627: ⚠️ Potential issue | 🔴 Critical

Indent duplicate_graph as a method of ParsingService.

The function is currently defined at module level (line 621) but uses self and accesses instance attributes (self.search_service, self.inference_service.driver). This will cause a NameError at runtime. Move the function definition to be inside the ParsingService class with proper indentation.

The field additions for content_hash and node_text are correctly implemented in both the query (lines 636–637) and create statement (lines 665–666), but they will not work until the function structure is fixed.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@app/modules/parsing/graph_construction/parsing_service.py` around lines 621 -
627, The method duplicate_graph is currently defined at module level but uses
instance members (self.search_service, self.inference_service.driver), so move
and indent the entire async def duplicate_graph(...) block to be inside the
ParsingService class definition so it becomes an instance method; ensure its
signature remains async def duplicate_graph(self, old_repo_id: str, new_repo_id:
str) and that all internal references (node_batch_size, relationship_batch_size,
session usage, and the queries that add content_hash and node_text) keep their
current names and indentation so they access self correctly within
ParsingService.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@app/modules/parsing/graph_construction/parsing_service.py`:
- Around line 621-627: The method duplicate_graph is currently defined at module
level but uses instance members (self.search_service,
self.inference_service.driver), so move and indent the entire async def
duplicate_graph(...) block to be inside the ParsingService class definition so
it becomes an instance method; ensure its signature remains async def
duplicate_graph(self, old_repo_id: str, new_repo_id: str) and that all internal
references (node_batch_size, relationship_batch_size, session usage, and the
queries that add content_hash and node_text) keep their current names and
indentation so they access self correctly within ParsingService.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 79d3ff72-4ad2-43ab-ae7e-1cb2c0116e7a

📥 Commits

Reviewing files that changed from the base of the PR and between d256207 and a407d8d.

📒 Files selected for processing (1)

app/modules/parsing/graph_construction/parsing_service.py

algora-pbc bot added the 🙋 Bounty claim label Mar 7, 2026

algora-pbc bot mentioned this pull request Mar 7, 2026

Implement inference caching while knowledge graph generation #223

Open

4 tasks

coderabbitai bot reviewed Mar 7, 2026

View reviewed changes

Comment thread app/modules/parsing/graph_construction/code_graph_service.py

Comment thread tests/unit-tests/parsing/test_content_hash.py Outdated

Comment thread tests/unit-tests/parsing/test_content_hash.py

Comment thread tests/unit-tests/parsing/test_inference_cache_service.py Outdated

maoshuorz added 2 commits March 7, 2026 20:19

fix: add defensive null checks for SonarQube reliability rating

d256207

- Safe-access node_id in logging statements to prevent KeyError/TypeError - Guard content_hash slicing against None values - Handle empty all_docstrings list in consolidate_docstrings()

coderabbitai bot reviewed Mar 7, 2026

View reviewed changes

Comment thread app/modules/parsing/graph_construction/code_graph_service.py

coderabbitai bot reviewed Mar 7, 2026

View reviewed changes

fix: preserve content_hash and node_text in duplicate_graph

a407d8d

The duplicate_graph method was not copying the new content_hash and node_text fields when cloning nodes to a new repo, causing cache misses on duplicated projects.

coderabbitai bot reviewed Mar 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: store content hashes on graph nodes for cross-branch inference caching#682

feat: store content hashes on graph nodes for cross-branch inference caching#682
maoshuorz wants to merge 4 commits intopotpie-ai:mainfrom
maoshuorz:feat/inference-cache-graph-hashes

maoshuorz commented Mar 7, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 7, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 7, 2026

Uh oh!

coderabbitai bot Mar 7, 2026

Uh oh!

sonarqubecloud bot commented Mar 7, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

maoshuorz commented Mar 7, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed Changes

What changed

How it works

Proof

Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud bot commented Mar 7, 2026

Quality Gate failed

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

maoshuorz commented Mar 7, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 7, 2026 •

edited

Loading