feat: store content hashes on graph nodes for cross-branch inference caching#682
feat: store content hashes on graph nodes for cross-branch inference caching#682maoshuorz wants to merge 4 commits intopotpie-ai:mainfrom
Conversation
…caching Previously, content hashes were only computed at inference time. Now they are computed and stored directly on Neo4j nodes during graph construction, enabling faster cross-branch cache lookups. Changes: - code_graph_service.py: compute content_hash during node creation, add Neo4j composite index on (content_hash, repoId), add get_cached_node_hashes() for cross-branch hash comparison - inference_service.py: fetch_graph now returns content_hash and node_type, _lookup_cache_for_nodes reuses pre-computed graph hashes - Add unit tests for content_hash, inference_cache_service, and cross-branch cache reuse scenarios Closes potpie-ai#223
WalkthroughAdds content-hash-based caching across parsing and inference: CodeGraphService computes/stores Changes
Sequence Diagram(s)sequenceDiagram
participant Graph as CodeGraphService
participant Neo4j as Neo4j
participant Inference as InferenceService
participant Cache as InferenceCache
Graph->>Neo4j: Create/store nodes with `node_text` and optional `content_hash`
Neo4j-->>Graph: Acknowledge stored
Inference->>Neo4j: Fetch nodes (`node_text`, `content_hash`, `node_type`)
Neo4j-->>Inference: Return nodes
alt content_hash present
Inference->>Inference: Normalize `node_text` and verify cacheability
else content_hash absent
Inference->>Inference: Normalize `node_text`, check cacheability, generate `content_hash`
Inference->>Neo4j: Persist generated `content_hash` on node
end
Inference->>Cache: Lookup by `content_hash`
Cache-->>Inference: Return cached inference or miss
alt cache miss
Inference->>Cache: Store inference result keyed by `content_hash`
Cache-->>Inference: Confirm stored
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
tests/unit-tests/parsing/test_graph_hash_integration.py (1)
79-124: These “integration” tests still stop at pure utility calls.They never touch
CodeGraphService.create_and_store_graph()orInferenceService.fetch_graph(), so wiring regressions in the new graph-backed cache path can slip through. At least one test here should persist a node withcontent_hashand verify the inference path reuses it end-to-end.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit-tests/parsing/test_graph_hash_integration.py` around lines 79 - 124, Add an end-to-end test that exercises the graph-backed cache: use generate_content_hash to compute a content_hash for a sample node, call CodeGraphService.create_and_store_graph (or the service method that persists nodes) to persist a node containing that content_hash and an associated inference payload, then call InferenceService.fetch_graph (or the inference entrypoint used in production) for the same code and assert the returned result reuses the persisted inference (e.g., compare docstring/tags) instead of recomputing; place this alongside the existing tests (e.g., near test_cache_lookup_simulation/test_partial_branch_changes) so the wiring between create_and_store_graph and fetch_graph is covered.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@app/modules/parsing/graph_construction/code_graph_service.py`:
- Around line 66-69: The cache entry currently sets the stored vector under the
key "embedding" which downstream consumers expect as "embedding_vector", causing
Neo4j cache hits to miss; update the cache population where cache[content_hash]
is assigned (in code_graph_service.py) to use the key "embedding_vector" (or
include both "embedding_vector" and "embedding" if you want backward
compatibility) so consumers reading cached_inference["embedding_vector"] get the
stored vector and avoid unnecessary re-generation.
In `@tests/unit-tests/parsing/test_content_hash.py`:
- Around line 66-71: Update the test_includes_cache_version to temporarily
modify app.modules.parsing.utils.content_hash.CACHE_VERSION: call
generate_content_hash(code) to get hash1, monkeypatch or save the original
CACHE_VERSION, set CACHE_VERSION to a different value, call
generate_content_hash(code) again and assert the new hash differs from hash1,
then restore the original CACHE_VERSION; reference generate_content_hash and
CACHE_VERSION so the test verifies cache-version invalidation.
- Around line 96-99: The test_long_content_cacheable currently builds `code` as
`" x = 1\n" * 20`, which yields only one unique line and contradicts
`is_content_cacheable` behavior; update the `code` fixture in the test to
contain many unique lines (for example by varying the line content per
iteration) so that `is_content_cacheable(code)` returns True, keeping the test
name and assertion as-is; edit the `test_long_content_cacheable` function to
generate non-repetitive multi-line content instead of repeating the same line.
In `@tests/unit-tests/parsing/test_inference_cache_service.py`:
- Around line 75-83: The two test calls that assign to an unused variable (the
calls to service.store_inference in test_inference_cache_service.py) are causing
Ruff F841; remove the unused local by either asserting the returned value or
discarding it (call service.store_inference(...) without assigning to result)
for both occurrences (the one that creates "new_hash" and the other at lines
~95-98), ensuring the test still verifies behavior as needed (e.g., replace the
assignment with an assert on the return or simply call the method if the return
isn't needed).
---
Nitpick comments:
In `@tests/unit-tests/parsing/test_graph_hash_integration.py`:
- Around line 79-124: Add an end-to-end test that exercises the graph-backed
cache: use generate_content_hash to compute a content_hash for a sample node,
call CodeGraphService.create_and_store_graph (or the service method that
persists nodes) to persist a node containing that content_hash and an associated
inference payload, then call InferenceService.fetch_graph (or the inference
entrypoint used in production) for the same code and assert the returned result
reuses the persisted inference (e.g., compare docstring/tags) instead of
recomputing; place this alongside the existing tests (e.g., near
test_cache_lookup_simulation/test_partial_branch_changes) so the wiring between
create_and_store_graph and fetch_graph is covered.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 344ad1c8-9f3f-477e-9cec-3735b2010aae
📒 Files selected for processing (6)
app/modules/parsing/graph_construction/code_graph_service.pyapp/modules/parsing/knowledge_graph/inference_service.pytests/unit-tests/parsing/__init__.pytests/unit-tests/parsing/test_content_hash.pytests/unit-tests/parsing/test_graph_hash_integration.pytests/unit-tests/parsing/test_inference_cache_service.py
- code_graph_service.py: use "embedding_vector" key to match downstream consumers in inference_service.py (lines 1411, 1471) - test_content_hash.py: properly test CACHE_VERSION invalidation; fix repetitive content detection in test_long_content_cacheable - test_inference_cache_service.py: remove unused result assignments (Ruff F841)
- Safe-access node_id in logging statements to prevent KeyError/TypeError - Guard content_hash slicing against None values - Handle empty all_docstrings list in consolidate_docstrings()
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
app/modules/parsing/graph_construction/code_graph_service.py (1)
131-136: Consider index column order for query patterns.The composite index
(content_hash, repoId)is created, butget_cached_node_hashesqueries primarily byrepoId. For Neo4j composite indexes, the leftmost column should match the most selective equality predicate.If most queries filter by
repoIdfirst (as inget_cached_node_hashes), consider reversing the order to(repoId, content_hash). However, if cross-branch lookups bycontent_hashacross repos are planned, the current order may be intentional.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/modules/parsing/graph_construction/code_graph_service.py` around lines 131 - 136, The composite index creation in code_graph_service.py currently uses (content_hash, repoId) but get_cached_node_hashes filters primarily by repoId first, so change the index to (repoId, content_hash) to match the leftmost-equals rule for Neo4j composite indexes; update the CREATE INDEX statement in the block where session.run(...) creates content_hash_repo_idx (or document a deliberate reason to keep the current order if cross-repo content_hash lookups are required), then run and benchmark get_cached_node_hashes to confirm improved index usage and ensure the migration remains idempotent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@app/modules/parsing/graph_construction/code_graph_service.py`:
- Around line 189-193: The duplicate_graph routine currently omits the
precomputed content_hash on duplicated nodes, causing redundant regeneration
later; update duplicate_graph to copy processed_node["content_hash"] into the
duplication query and ensure the node creation logic in the duplication path
preserves that content_hash field (so when creating new nodes in duplicate_graph
the content_hash produced earlier by generate_content_hash/is_content_cacheable
is carried over rather than recomputed).
---
Nitpick comments:
In `@app/modules/parsing/graph_construction/code_graph_service.py`:
- Around line 131-136: The composite index creation in code_graph_service.py
currently uses (content_hash, repoId) but get_cached_node_hashes filters
primarily by repoId first, so change the index to (repoId, content_hash) to
match the leftmost-equals rule for Neo4j composite indexes; update the CREATE
INDEX statement in the block where session.run(...) creates
content_hash_repo_idx (or document a deliberate reason to keep the current order
if cross-repo content_hash lookups are required), then run and benchmark
get_cached_node_hashes to confirm improved index usage and ensure the migration
remains idempotent.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 5ff094b4-54f2-4f97-a497-8c30538175a0
📒 Files selected for processing (3)
app/modules/parsing/graph_construction/code_graph_service.pytests/unit-tests/parsing/test_content_hash.pytests/unit-tests/parsing/test_inference_cache_service.py
🚧 Files skipped from review as they are similar to previous changes (2)
- tests/unit-tests/parsing/test_inference_cache_service.py
- tests/unit-tests/parsing/test_content_hash.py
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@app/modules/parsing/knowledge_graph/inference_service.py`:
- Around line 152-177: The precomputed content_hash on nodes is trusted without
confirming it was generated from the same normalized text; update the logic in
inference_service.py (around the node["content_hash"] branch in the loop that
uses _normalize_node_text, is_content_cacheable, and generate_content_hash) to
always compute normalized_text via _normalize_node_text and then recompute
content_hash from that normalized_text (using generate_content_hash) before
using it for cache lookups, or alternatively, read and trust a persisted
normalized hash if graph_construction/code_graph_service.py starts saving one;
in short, ensure the hash used for cache lookup is derived from the
normalized_text path (recompute content_hash from normalized_text in the branch
that currently trusts node["content_hash"]).
- Around line 506-508: When all_docstrings is empty, avoid creating a synthetic
parent DocstringNode with docstring="" and embeddings; instead return an empty
DocstringResponse immediately so update_neo4j_with_docstrings() is not invoked
for a no-op. Modify the logic around the consolidated_text creation (the branch
handling len(all_docstrings) == 0) to short-circuit and return an appropriate
empty DocstringResponse rather than building a parent node, ensuring functions
and callers that expect a DocstringResponse still receive one but no Neo4j
writes or embeddings are produced.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: c95a9097-1fdc-47e5-8d09-2106e1094d96
📒 Files selected for processing (1)
app/modules/parsing/knowledge_graph/inference_service.py
| # Use pre-computed content_hash from graph if available | ||
| content_hash = node.get("content_hash") | ||
|
|
||
| if not content_hash: | ||
| # Normalize text for consistent hashing | ||
| normalized_text = self._normalize_node_text(text, node_dict) | ||
| node["normalized_text"] = normalized_text | ||
|
|
||
| # Check if content is cacheable | ||
| if not is_content_cacheable(normalized_text): | ||
| uncacheable_nodes += 1 | ||
| continue | ||
|
|
||
| # Generate content hash | ||
| content_hash = generate_content_hash( | ||
| normalized_text, node.get("node_type") | ||
| ) | ||
| node["content_hash"] = content_hash | ||
| else: | ||
| # Hash already computed at graph construction time | ||
| normalized_text = self._normalize_node_text(text, node_dict) | ||
| node["normalized_text"] = normalized_text | ||
|
|
||
| # Generate content hash | ||
| content_hash = generate_content_hash(normalized_text, node.get("node_type")) | ||
| node["content_hash"] = content_hash | ||
| if not is_content_cacheable(normalized_text): | ||
| uncacheable_nodes += 1 | ||
| continue |
There was a problem hiding this comment.
Precomputed hashes need the same canonicalization path.
This branch trusts node["content_hash"] before proving it was generated from the same normalized text that the fallback path hashes. In app/modules/parsing/graph_construction/code_graph_service.py:185-204, the graph writer stores hashes from raw node_text, so nodes containing Code replaced for brevity...node_id ... placeholders can keep branch-specific hashes and miss the cross-branch cache even when normalized_text is identical. Either persist the normalized hash during graph construction or recompute from normalized_text here before lookup.
Possible safe fix in this file
- if not content_hash:
- # Normalize text for consistent hashing
- normalized_text = self._normalize_node_text(text, node_dict)
- node["normalized_text"] = normalized_text
-
- # Check if content is cacheable
- if not is_content_cacheable(normalized_text):
- uncacheable_nodes += 1
- continue
-
- # Generate content hash
- content_hash = generate_content_hash(
- normalized_text, node.get("node_type")
- )
- node["content_hash"] = content_hash
- else:
- # Hash already computed at graph construction time
- normalized_text = self._normalize_node_text(text, node_dict)
- node["normalized_text"] = normalized_text
-
- if not is_content_cacheable(normalized_text):
- uncacheable_nodes += 1
- continue
+ normalized_text = self._normalize_node_text(text, node_dict)
+ node["normalized_text"] = normalized_text
+
+ if not is_content_cacheable(normalized_text):
+ uncacheable_nodes += 1
+ continue
+
+ content_hash = generate_content_hash(
+ normalized_text, node.get("node_type")
+ )
+ node["content_hash"] = content_hash🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@app/modules/parsing/knowledge_graph/inference_service.py` around lines 152 -
177, The precomputed content_hash on nodes is trusted without confirming it was
generated from the same normalized text; update the logic in
inference_service.py (around the node["content_hash"] branch in the loop that
uses _normalize_node_text, is_content_cacheable, and generate_content_hash) to
always compute normalized_text via _normalize_node_text and then recompute
content_hash from that normalized_text (using generate_content_hash) before
using it for cache lookups, or alternatively, read and trust a persisted
normalized hash if graph_construction/code_graph_service.py starts saving one;
in short, ensure the hash used for cache lookup is derived from the
normalized_text path (recompute content_hash from normalized_text in the branch
that currently trusts node["content_hash"]).
| if len(all_docstrings) == 0: | ||
| consolidated_text = "" | ||
| elif len(all_docstrings) == 1: |
There was a problem hiding this comment.
Don’t persist a synthetic empty docstring for empty chunk results.
When all_docstrings is empty, this now builds a parent DocstringNode with docstring="". That flows into update_neo4j_with_docstrings() and writes an empty docstring plus embedding for the parent node instead of leaving the node untouched. Return an empty DocstringResponse here.
Suggested fix
- if len(all_docstrings) == 0:
- consolidated_text = ""
+ if len(all_docstrings) == 0:
+ return DocstringResponse(docstrings=[])
elif len(all_docstrings) == 1:
consolidated_text = all_docstrings[0]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if len(all_docstrings) == 0: | |
| consolidated_text = "" | |
| elif len(all_docstrings) == 1: | |
| if len(all_docstrings) == 0: | |
| return DocstringResponse(docstrings=[]) | |
| elif len(all_docstrings) == 1: |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@app/modules/parsing/knowledge_graph/inference_service.py` around lines 506 -
508, When all_docstrings is empty, avoid creating a synthetic parent
DocstringNode with docstring="" and embeddings; instead return an empty
DocstringResponse immediately so update_neo4j_with_docstrings() is not invoked
for a no-op. Modify the logic around the consolidated_text creation (the branch
handling len(all_docstrings) == 0) to short-circuit and return an appropriate
empty DocstringResponse rather than building a parent node, ensuring functions
and callers that expect a DocstringResponse still receive one but no Neo4j
writes or embeddings are produced.
The duplicate_graph method was not copying the new content_hash and node_text fields when cloning nodes to a new repo, causing cache misses on duplicated projects.
|
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
app/modules/parsing/graph_construction/parsing_service.py (1)
621-627:⚠️ Potential issue | 🔴 CriticalIndent
duplicate_graphas a method ofParsingService.The function is currently defined at module level (line 621) but uses
selfand accesses instance attributes (self.search_service,self.inference_service.driver). This will cause aNameErrorat runtime. Move the function definition to be inside theParsingServiceclass with proper indentation.The field additions for
content_hashandnode_textare correctly implemented in both the query (lines 636–637) and create statement (lines 665–666), but they will not work until the function structure is fixed.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@app/modules/parsing/graph_construction/parsing_service.py` around lines 621 - 627, The method duplicate_graph is currently defined at module level but uses instance members (self.search_service, self.inference_service.driver), so move and indent the entire async def duplicate_graph(...) block to be inside the ParsingService class definition so it becomes an instance method; ensure its signature remains async def duplicate_graph(self, old_repo_id: str, new_repo_id: str) and that all internal references (node_batch_size, relationship_batch_size, session usage, and the queries that add content_hash and node_text) keep their current names and indentation so they access self correctly within ParsingService.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@app/modules/parsing/graph_construction/parsing_service.py`:
- Around line 621-627: The method duplicate_graph is currently defined at module
level but uses instance members (self.search_service,
self.inference_service.driver), so move and indent the entire async def
duplicate_graph(...) block to be inside the ParsingService class definition so
it becomes an instance method; ensure its signature remains async def
duplicate_graph(self, old_repo_id: str, new_repo_id: str) and that all internal
references (node_batch_size, relationship_batch_size, session usage, and the
queries that add content_hash and node_text) keep their current names and
indentation so they access self correctly within ParsingService.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 79d3ff72-4ad2-43ab-ae7e-1cb2c0116e7a
📒 Files selected for processing (1)
app/modules/parsing/graph_construction/parsing_service.py




/claim #223
Proposed Changes
Implements hash-based caching for knowledge graph nodes as described in #223. The existing codebase already has an excellent inference cache infrastructure (PostgreSQL
inference_cachetable,InferenceCacheService,content_hash.py). This PR closes the gap by integrating content hashes into the graph construction phase so that cross-branch cache comparisons are efficient.What changed
1. Graph Construction (
code_graph_service.py)content_hashfor each node during graph construction using the existinggenerate_content_hash()utilitycontent_hashdirectly on Neo4j nodes for fast cross-branch lookupscontent_hash_repo_idxon(content_hash, repoId)for efficient hash-based queriesget_cached_node_hashes()method to retrieve cached inference data by content hash from Neo4j2. Inference Service (
inference_service.py)fetch_graph()now returnscontent_hashandnode_typefrom Neo4j, avoiding redundant hash recomputation_lookup_cache_for_nodes()reuses pre-computed graph hashes when available, falling back to on-the-fly computation3. Tests (3 new test files, 20+ test cases)
test_content_hash.py— Hash generation, whitespace normalization, type differentiation, cacheability checkstest_inference_cache_service.py— Cache hit/miss, store/upsert, access tracking, statstest_graph_hash_integration.py— Cross-branch cache reuse, partial branch changes, same/modified code scenariosHow it works
Proof
The content hash system ensures identical code produces identical hashes regardless of branch:
Checklist
Summary by CodeRabbit
New Features
Improvements
Tests