Archive System: External Data Source Indexing#3047
Conversation
…ditions Reverts the query/response approach from #3037 and fixes the actual bugs that caused empty ephemeral directories: - directory_listing.rs: Restore async indexer dispatch (return empty, populate via events). Subdirectories from a parent's shallow index now correctly fall through to trigger their own indexer job. - subscriptionManager.ts: Pre-register initial listener before calling transport.subscribe() so buffer replay events aren't broadcast to an empty listener Set. - useNormalizedQuery.ts: Seed TanStack Query cache when oldData is undefined, so events arriving before the query response aren't silently dropped by the setQueryData updater. Adds bridge test (Rust harness + TS integration) that reproduces the ephemeral event streaming flow end-to-end.
Updated project description in README.md.
|
Important Review skippedToo many files! This PR contains 243 files, which is 93 over the limit of 150. ⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: ⛔ Files ignored due to path filters (57)
📒 Files selected for processing (243)
You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
- Create core/src/data/ module with SourceManager wrapping sd-archive Engine - Add Sources to GroupType and Source to ItemType enums - Add default Sources group to new library creation - Register source operations: create, list, get, delete, sync, list_items - Register adapter operations: list, config, update - Add bundled adapter sync from workspace adapters/ directory - Add adapter update system with BLAKE3 change detection and backup/rollback - Frontend: Sources home, source detail with virtualized list, adapters screen - Frontend: SourcesGroup sidebar, SpaceGroup dispatch, spaceItemUtils - Frontend: TopBar integration (path bar, search, sync, actions menu) - Frontend: Tab title sync, adapter icon lookup hook - Regenerate TypeScript types Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Refactor source sync to dispatch SourceSyncJob instead of inline sync - Rewrite VoiceOverlay with audio recorder and TTS playback hooks - Migrate TabBar to @spaceui/primitives - Update SpacebotContext query invalidation - Add SpaceUI section to CONTRIBUTING.md and README.md - Update sources UI (Adapters, SourceDetail) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Delete Sidebar.tsx, SidebarItem.tsx, Section.tsx, LocationsSection.tsx from the old explorer sidebar. The active sidebar is SpacesSidebar which uses SpaceItem from @spaceui/primitives. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Sort `pub mod adapters` alphabetically in ops/mod.rs - Wrap long line in volume/fs/refs.rs - Handle empty TARGET_TRIPLE env var in xtask setup - Replace `link:@spacedrive/*` with published `^0.2.3` versions
- cargo fmt across all modified files - Add Vite client types and Window.__SPACEDRIVE__ declaration - Fix @sd/interface/platform import to @sd/interface - Align @types/react versions between tauri and interface packages - Remove unused imports/vars in useDropZone, DragOverlay, ContextMenuWindow - Fix WebviewWindow.location references to use globalThis - Exclude updater.example.ts from typecheck
- Add ReactComponent export to *.svg module declarations - Fix SdPath imports to use generated types (device_slug, Cloud/Sidecar variants) - Create useJobs barrel file for JobManager hooks - Remove unused imports across ~65 files - Add type annotations for implicit any params (d3, callbacks, map iterators) - Remove stale @ts-expect-error directives in MeshViewer - Add declare module for gaussian-splats-3d and qrcode - Fix Location field names (path→sd_path, total_file_count→file_count, etc) - Fix SdPath discriminated union narrowing (remove stale Local variant) - Fix React 19 RefObject<T|null> vs RefObject<T> mismatches - Fix null vs undefined mismatches throughout - Add missing required fields to ApplyTagsInput, policy types, etc
Matches the spacebot justfile pattern for local SpaceUI development. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PR SummaryHigh Risk Overview Adds multiple production adapters under Updates project metadata/docs: removes git submodules, expands contributor guidance around SpaceUI local linking, rewrites Reviewed by Cursor Bugbot for commit beab5a7. Configure here. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 6 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit beab5a7. Configure here.
| params = { | ||
| "startHistoryId": history_id, | ||
| "maxResults": min(BATCH_SIZE, max_results), | ||
| "historyTypes": "messageAdded,messageDeleted,labelAdded,labelRemoved", |
There was a problem hiding this comment.
Gmail incremental sync returns wrong type causing unpack error
High Severity
sync_messages_incremental returns either a single integer (when falling back to full sync on line 289) or a tuple (total_changes, new_history_id) on line 346. The caller on line 468 always unpacks it as a tuple: total, new_history_id = sync_messages_incremental(...). When the history ID is expired and the function falls back to sync_messages_full, it returns a plain int, causing a ValueError: not enough values to unpack at runtime.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit beab5a7. Configure here.
| version: a.version().to_string(), | ||
| author: a.author().to_string(), | ||
| data_type: a.data_type().to_string(), | ||
| kind: AdapterKind::Native, |
There was a problem hiding this comment.
AdapterRegistry.list always reports kind as Native
Medium Severity
AdapterRegistry::list hardcodes kind: AdapterKind::Native for every adapter, including script-based adapters. Since all 11 shipped adapters are script-based, their AdapterInfo will incorrectly report kind: Native instead of kind: Script. This will mislead any UI or logic that depends on the adapter kind.
Reviewed by Cursor Bugbot for commit beab5a7. Configure here.
| AND {where} | ||
| ORDER BY hv.visit_time DESC | ||
| LIMIT ? | ||
| """ |
There was a problem hiding this comment.
FTS search inserts WHERE clause inside subquery incorrectly
Medium Severity
The Safari history SQL query applies the cursor filter (hv.visit_time > ?) and visit_count filter as top-level WHERE conditions alongside a correlated subquery that selects the latest visit. When the cursor filter is active, rows where the most recent visit is older than the cursor but a non-latest visit is newer could be incorrectly excluded or included, producing inconsistent results with the incremental sync model.
Reviewed by Cursor Bugbot for commit beab5a7. Configure here.
| tokio::spawn(async move { | ||
| let _ = stdin.write_all(config_json.as_bytes()).await; | ||
| let _ = stdin.shutdown().await; | ||
| }); |
There was a problem hiding this comment.
Script adapter stdin sends config without cursor data
High Severity
The script adapter sends only the raw config JSON to the subprocess via stdin, but all adapter scripts expect the input to contain both config and cursor keys (e.g., input_data.get("cursor")). The cursor value stored in the database via db.get_cursor() is never retrieved or included in the stdin payload, so incremental sync will never work — adapters will always perform a full sync.
Reviewed by Cursor Bugbot for commit beab5a7. Configure here.
| "to": "note", | ||
| "to_id": target_id, | ||
| }) | ||
| link_count += 1 |
There was a problem hiding this comment.
Obsidian adapter only resolves links from current sync batch
Medium Severity
During incremental sync, title_to_id is only populated from files modified since the last cursor. When resolving wikilinks in the second pass, links to unchanged notes won't resolve because those notes aren't in title_to_id. This means incremental syncs will silently lose inter-note links to any previously-synced, unmodified notes.
Reviewed by Cursor Bugbot for commit beab5a7. Configure here.
| trust_tier: TrustTier::from_str_or_default(&self.trust_tier), | ||
| safety_mode: SafetyMode::from_str_or_default(&self.safety_mode), | ||
| quarantine_threshold: self.quarantine_threshold as u8, | ||
| flag_threshold: self.flag_threshold as u8, |
There was a problem hiding this comment.
Integer safety_score overflows on i32 to u8 cast
Low Severity
SourceRow::into_info casts quarantine_threshold and flag_threshold from i32 to u8 without bounds checking. If the database contains values outside 0–255 (e.g., from manual edits or corruption), the cast will silently truncate, producing incorrect threshold values that could affect safety screening behavior.
Reviewed by Cursor Bugbot for commit beab5a7. Configure here.
| std::fs::create_dir_all(&models_dir)?; | ||
| let safety = match SafetyModel::new(&models_dir) { | ||
| Ok(model) => { | ||
| tracing::info!("safety screening model loaded (Prompt Guard 2 22M)"); |
There was a problem hiding this comment.
This log line claims Prompt Guard 2, but SafetyModel is currently a stub (always safe, SAFETY_MODEL_VERSION = "stub-v1"). I'd either make the log reflect the actual model/version, or gate the Prompt Guard wording behind the real implementation.
| tracing::info!("safety screening model loaded (Prompt Guard 2 22M)"); | |
| tracing::info!(version = SAFETY_MODEL_VERSION, "safety screening model loaded"); |
| .envs(&env) | ||
| .stdin(std::process::Stdio::piped()) | ||
| .stdout(std::process::Stdio::piped()) | ||
| .stderr(std::process::Stdio::piped()) |
There was a problem hiding this comment.
stderr is piped but never drained, which can deadlock the adapter if it writes enough to stderr (pipe buffer fills, child blocks, parent waits on stdout/exit).
Also, adapter.runtime.timeout exists in the manifest, but it isn't enforced here. Wrapping the sync loop + child.wait() in a tokio::time::timeout (and killing the child on expiry) would prevent a hung adapter from stalling sync forever.
| if let Some(obj) = config.as_object() { | ||
| for (key, value) in obj { | ||
| env.insert( | ||
| format!("SPACEDRIVE_CONFIG_{}", key.to_uppercase()), |
There was a problem hiding this comment.
build_env exports every config key/value as SPACEDRIVE_CONFIG_*. Two concerns:
- Secrets:
adapter.tomlhas[[adapter.config]] secret = true, but that isn't used here. It seems safer to avoid exporting secret fields to env vars (stdin already carries full config). - Key sanitization: JSON keys can contain characters that are invalid/awkward in environment variable names (notably
=/NUL), which can makespawn()fail.
| // directory. Uses CARGO_MANIFEST_DIR at compile time to find the workspace | ||
| // root, matching the pattern from the spacedrive-data prototype. | ||
| let installed_dir = data_dir.join("adapters"); | ||
| Self::sync_bundled_adapters(&installed_dir); |
There was a problem hiding this comment.
This is running synchronous std::fs directory walking/copying inside an async fn new(). If this happens on the runtime worker thread, it can stall unrelated tasks.
Worth considering tokio::task::spawn_blocking (or tokio::fs) for the adapter sync/copy, and using symlink_metadata/explicit symlink handling in copy_dir_recursive to avoid accidentally following symlinks when copying into the library directory.
| .await | ||
| .map_err(|e| LibraryError::Other(format!("Failed to create source manager: {e}")))?; | ||
|
|
||
| self.source_manager |
There was a problem hiding this comment.
Minor race: get().is_some() + set(...) isn't atomic. If two callers race, one will hit the set error and bubble a LibraryError even though the manager is actually initialized.
Consider treating a set failure as success here (or switching to a get_or_try_init style API).
| .into_iter() | ||
| .find(|s| s.id == self.input.source_id) | ||
| .ok_or_else(|| { | ||
| QueryError::Internal(format!("Source not found: {}", self.input.source_id)) |
There was a problem hiding this comment.
Not-found is currently surfaced as QueryError::Internal, which likely ends up as a 500 for a normal "missing source" case. Seems more appropriate to return InvalidInput/Validation here.
| QueryError::Internal(format!("Source not found: {}", self.input.source_id)) | |
| QueryError::InvalidInput(format!("Source not found: {}", self.input.source_id)) |
| const source = await core.sources.create({ | ||
| name: "Work Gmail", | ||
| adapter_id: "gmail", | ||
| trust_tier: "external", |
There was a problem hiding this comment.
This TS example includes trust_tier, but core.sources.create input currently only accepts { name, adapter_id, config } and the trust tier comes from the adapter manifest.
| trust_tier: "external", | |
| // trust tier comes from the adapter manifest |
| let engine = Engine::new(config).await?; | ||
|
|
||
| // Create source from adapter | ||
| let source_id = engine.create_source( |
There was a problem hiding this comment.
The usage example here doesn't match the current Engine API (create_source returns SourceInfo, sync is engine.sync(&source_id), and search is cross-source with an optional SearchFilter). It'd be good to keep this example compiling so external users can copy/paste it.


Summary
Adds the Archive system to Spacedrive v2 - a data archival engine for indexing external sources (emails, notes, messages, etc.) beyond the filesystem.
Key additions:
Architecture
Standalone Crate
Built as
crates/archive/(package:sd-archive) for better CI caching and reusability:Core Integration
Integrates with v2 via library-scoped manager:
Storage Layout
Sources live alongside VDFS in library:
Adapters
Shipped adapters (11 total):
Adapter protocol:
adapters/directoryFeatures
Hybrid Search
Combines two search strategies via Reciprocal Rank Fusion:
Safety Screening
Every record passes through Prompt Guard 2 before becoming searchable:
Schema-Driven
Sources defined by TOML schemas, auto-generate:
Example schema:
License Change: AGPL → FSL
Changed from AGPL-3.0 to FSL-1.1-ALv2 (Functional Source License):
Why FSL:
Additional restrictions added:
Still permitted:
README Rewrite
Simplified and modernized the README:
New tagline: "One file manager for all your devices and clouds"
New opening:
Documentation
Design Doc
docs/core/design/archive.md(1,114 lines)Complete implementation plan:
User Documentation
docs/archive/README.md(403 lines)User-facing guide:
Crate Documentation
crates/archive/README.md(239 lines)Developer reference:
Implementation Status
✅ Completed
Phase 0: Adapters
Phase 1: Standalone Crate
crates/archive/Phase 2: Core Integration
sd-archivedependency to corecore/src/ops/sources/Documentation:
🚧 Next Steps
Phase 2 (continued):
library_sourcestablePhase 3: Jobs & Pipeline
Phase 4: Search
sources.searchqueryPhase 5: UI
Breaking Changes
License
Dependencies
lancedb = "0.15"(vector search)fastembed = "4"(embeddings)ort,tokenizers,hf-hub(safety screening)Testing
Archive Crate
Core Integration
cargo test -p spacedrive-core -- sources::Adapters
Performance
Benchmarks (M2 Max, 10k Gmail messages):
Memory:
Migration Guide
For Users
No migration needed. Archive is a new feature. Existing VDFS data unaffected.
For Developers
New operations available:
New crate available:
Related
docs/core/design/archive.mddocs/archive/README.mdcrates/archive/README.md~/Projects/spacedriveapp/spacedrive-archive-prototype🤖 Generated with Claude Code
Note
This PR introduces the Archive system, a comprehensive data archival engine for indexing external sources beyond the filesystem. Key additions include 11 production-ready adapters (Gmail, Slack, Obsidian, Chrome, Safari, GitHub, etc.), hybrid search combining FTS5 and LanceDB vector search via RRF, safety screening with Prompt Guard 2, and comprehensive documentation. The implementation is built as a standalone crate (
crates/archive/) for better CI caching and reusability. Additionally, this PR changes the license from AGPL-3.0 to FSL-1.1-ALv2 and rewrites the README. See the design documentation atdocs/core/design/archive.mdfor architectural details and integration patterns with Spacedrive v2.Written by Tembo for commit beab5a7. This will update automatically on new commits.