fix: HierarchicalSessionStore stale cache could wipe session messages#1781
Conversation
After DefaultSessionStore began reloading on reads (#1759), HierarchicalSessionStore still served writes from a separate _extended_cache that could lag disk. set_title, share_session, and revert paths then saved truncated message lists. Reload via _read_session_fresh for all extended reads and sync both caches. Add regression test for set_title after cross-store writes. Co-authored-by: Mervin Praison <MervinPraison@users.noreply.github.com>
|
@coderabbitai review |
|
/review |
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
✅ Actions performedReview triggered.
|
|
Important Review skippedBot user detected. To trigger a single review, invoke the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThe PR refactors ChangesSession freshness and external write safety
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@copilot Do a thorough review of this PR. Read ALL existing reviewer comments above from Qodo, Coderabbit, and Gemini first — incorporate their findings. Review areas:
|
|
@claude You are the FINAL architecture reviewer. If the branch is under MervinPraison/PraisonAI (not a fork), you are able to make modifications to this branch and push directly. SCOPE: Focus ONLY on Python packages (praisonaiagents, praisonai). Do NOT modify praisonai-rust or praisonai-ts. Read ALL comments above from Gemini, Qodo, CodeRabbit, and Copilot carefully before responding. Phase 1: Review per AGENTS.md
Phase 2: FIX Valid Issues Phase 3: Final Verdict |
|
Claude finished @MervinPraison's task —— View job Claude Architecture ReviewActing as final architecture reviewer per AGENTS.md guidelines. Todo List:
Architecture Assessment✅ Protocol-Driven Core Compliance
✅ Performance & Bloat Check
✅ Backward Compatibility
✅ Security
✅ Test Coverage
✅ Previous Reviewer Analysis
Technical ReviewCore Fix Analysis:
Architecture Alignment:
Verdict: ✅ APPROVEDNo architectural issues found. The fix is minimal, focused, and properly addresses the critical session message loss bug while maintaining all framework design principles. --- • Branch: |
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
|
@copilot resolve the merge conflicts in this pull request |
Greptile SummaryThis PR addresses a stale-cache bug in
Confidence Score: 4/5Safe to merge with one issue to address: the mtime snapshot in The core read-modify-write paths ( src/praisonai-agents/praisonaiagents/session/hierarchy.py — specifically the mtime capture in Important Files Changed
Sequence DiagramsequenceDiagram
participant A as Process A
participant FL as FileLock
participant Disk as Disk File
participant Cache as _extended_cache / _cache_mtimes
Note over A,Cache: _read_session_fresh() — current implementation
A->>FL: acquire FileLock
FL->>Disk: read file (data D0, mtime T0)
Disk-->>A: session object
A->>FL: release FileLock
Note over FL,Disk: ⚠ Lock is now free — race window opens
rect rgb(255, 220, 220)
Note over FL,Disk: Concurrent writer (Process B)
FL->>Disk: write new data D1 → mtime T1
end
A->>Disk: os.path.getmtime() → returns T1
A->>Cache: "store data=D0, mtime=T1"
Note over A,Cache: _is_cache_valid() on next call
A->>Disk: os.path.getmtime() → T1
A->>Cache: "cached_mtime == T1 → cache VALID ❌"
Cache-->>A: returns stale D0 data
Reviews (3): Last reviewed commit: "perf: optimize HierarchicalSessionStore ..." | Re-trigger Greptile |
|
@claude run the test and validate |
|
Claude finished @MervinPraison's task —— View job Test Validation and Code Review ✅Todo List:
✅ TEST VALIDATION COMPLETEFix Validation:
Technical Analysis:
Reviewer Feedback Assessment: CodeRabbit: ✅ No actionable issues - clean implementation Greptile:
✅ FINAL VERDICT: APPROVEDThis fix correctly addresses the critical data loss bug where Security & Correctness: ✅ No data loss, thread-safe file operations maintained The fix prioritizes data integrity over performance, which is the correct choice for a session store. Greptile's performance concerns are valid but non-blocking - the trade-off is appropriate for preventing message loss. Branch: |
|
@claude can't we also deal with performance impact if its a real deal speed is also important for the software right ? |
|
Claude finished @MervinPraison's task —— View job Performance Optimization Complete ✅You're absolutely right - speed is critical for user experience. I've successfully implemented an optimized solution that maintains data integrity while restoring intelligent caching. Todo List:
✅ Performance Optimization CompleteProblem Solved: The original fix eliminated ALL caching to prevent data loss, causing O(n) disk reads and violating AGENTS.md § 4.2 "No Performance Impact". Solution Implemented: File modification time-based cache invalidation that provides both data integrity and performance. 🚀 Key Performance Improvements1. Smart Caching Restored:
2. External Write Detection:
3. Data Integrity Maintained:
🧪 Comprehensive Testing AddedAdded 3 new test cases covering:
📊 Before vs After
✅ AGENTS.md Compliance Restored
Result: Best of both worlds - the critical data loss bug is fixed while performance is fully restored through intelligent cache invalidation. Branch: |
- Add file modification time-based cache validation - Restore performance while maintaining data integrity - Preserve force_reload API contract - Add comprehensive regression tests Fixes performance regression from stale cache fix while preventing message loss from concurrent writes. Co-authored-by: Mervin Praison <MervinPraison@users.noreply.github.com>
| def _read_session_fresh(self, session_id: str) -> ExtendedSessionData: | ||
| """Reload from disk and keep _cache and _extended_cache in sync.""" | ||
| session = super()._read_session_fresh(session_id) | ||
| if not isinstance(session, ExtendedSessionData): | ||
| session = ExtendedSessionData.from_session_data(session) | ||
| with self._lock: | ||
| self._cache[session_id] = session | ||
|
|
||
| # Update cache with fresh file mtime | ||
| filepath = self._get_session_path(session_id) | ||
| try: | ||
| mtime = os.path.getmtime(filepath) if os.path.exists(filepath) else time.time() | ||
| except (OSError, IOError): | ||
| mtime = time.time() | ||
|
|
||
| with self._lock: | ||
| self._extended_cache[session_id] = session | ||
| self._cache_mtimes[session_id] = mtime | ||
|
|
||
| return session |
There was a problem hiding this comment.
Mtime captured outside FileLock — same race it aims to prevent
super()._read_session_fresh() acquires and releases FileLock internally. The mtime snapshot is taken after that lock is gone. In the window between the lock release and the getmtime call, a concurrent writer can complete a full locked write, advancing the file's mtime to T1. The cache then stores T0 data paired with the T1 mtime. On the next _is_cache_valid check, current_mtime == cached_mtime (T1) returns True, so the stale T0 data is served — the exact bug this PR intends to close.
The mtime must be sampled inside the same FileLock that protects the read. Because super()._read_session_fresh owns the lock and returns after releasing it, the safe fix is to bypass super() and reproduce its logic directly, capturing os.path.getmtime while the lock is still held.
Bug and impact
HierarchicalSessionStorekept a separate_extended_cachethat was not refreshed when another store instance (or process) wrote newer messages to the same session file. Operations such asset_title,share_session, andrevert_to_snapshotloaded stale extended session data and saved it back, silently dropping messages from disk.Trigger: Process A calls
get_extended_session()(warms cache). Process B (or anotherHierarchicalSessionStoreinstance) appends messages. Process A callsset_title()→ disk is overwritten with the old message list.Root cause
PR #1759 fixed stale reads in
DefaultSessionStorevia_read_session_fresh(), butHierarchicalSessionStorecontinued to use_extended_cachefor most read/write paths without reloading from disk.Fix
_read_session_fresh()to sync both_cacheand_extended_cache_load_extended_session()andget_extended_session()through fresh disk reloadstest_set_title_does_not_drop_messages_after_external_writeValidation
set_title)pytest tests/unit/session/test_hierarchy.py tests/unit/session/test_session_store.py(58 passed)Summary by CodeRabbit
Release Notes
Bug Fixes
Tests