Skip to content

Commit 2ae6ee9

Browse files
committed
reworked the concept for going forward
1 parent d872b76 commit 2ae6ee9

4 files changed

Lines changed: 358 additions & 416 deletions

File tree

proposed-changes/024-three-phase-sync-architecture.md

Lines changed: 12 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,18 @@
11
# 024 — Three-Phase Sync Architecture
22

3-
**Status**: Draft
4-
**Created**: 2026-03-01
3+
**Status**: Draft
4+
**Created**: 2026-03-01
5+
**Updated**: 2026-03-16
56
**Context**: Initial sync takes ~48–54s for Chat Essence app (2015 messages, 62 group shards × 2 types = 124 shard operations, 263 files total). Root cause is sequential per-shard processing where download, merge, upload, and DB commit are interleaved.
67

7-
## Decision Alignment (026)
8+
## Strategic Context (026)
89

9-
This proposal is the mandatory baseline in the performance-first direction defined in `026-recap-sync-direction.md`.
10+
This proposal is the **primary performance lever** in the direction defined by `026-recap-sync-direction.md`.
1011

11-
- 024 is required regardless of storage mode.
12-
- It improves execution order and batching without forcing a storage-format decision.
13-
- It applies to both profiles:
14-
- Dataset/Flat mode (Dir, GDrive default profile).
15-
- Linked-Data mode (Solid/interoperability profile).
16-
17-
In short: 024 is phase A of the new strategy, not an optional optimization.
12+
- The logical index/shard model is sound and stays unchanged.
13+
- The performance problem is in **execution**: sequential I/O, per-shard DB commits, no parallelism.
14+
- This proposal fixes execution without touching the data model or requiring a new orchestrator.
15+
- It applies to all backends equally (Dir, GDrive, Solid).
1816

1917
---
2018

@@ -292,13 +290,14 @@ Recommend **Option B** as the pragmatic choice — most of the benefit with boun
292290

293291
### Why Not Faster?
294292

295-
The remaining ~15s is CPU time for CRDT merges across ~2015 documents. To go below that, structural changes (fewer files = less per-file overhead, see Proposal 025) or algorithmic optimizations in the merge logic are needed.
293+
The remaining ~15s is CPU time for CRDT merges across ~2015 documents. To reduce this further, algorithmic optimizations in the merge logic or reducing per-file serialization overhead (see Proposal 025 for backend-controlled physical layout) would be needed.
296294

297295
---
298296

299297
## Relationship to Other Proposals
300298

301-
- **025 (Flat File Storage Architecture)**: Reduces file count from 263 to ~10–15. Three-phase sync is a prerequisite — it provides the download/merge/upload separation that Flat File mode builds on.
299+
- **025 (Backend-Controlled Physical Layout)**: Extends `_ShardSyncAdapter` so backends can decide how logical shards map to physical files. Three-phase sync is a prerequisite — it provides the download/merge/upload separation that bulk layout backends build on.
300+
- **026 (Recap Sync Direction)**: Defines the strategic direction. This proposal is the primary performance lever identified there.
302301
- **015 (Shard-Level File Consolidation)**: Dataset-mode shards. Three-phase sync works with both individual-resource and dataset-mode shards.
303302
- **014 (GDrive Sync Performance)**: Proposed batch APIs. Three-phase download naturally enables batch/parallel regardless of backend.
304303
- **013 (Sync Structure Analysis)**: Documented sequential overhead. This proposal directly addresses it.
Lines changed: 261 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,261 @@
1+
# 025 — Backend-Controlled Physical Layout
2+
3+
**Status**: Draft
4+
**Created**: 2026-03-01
5+
**Updated**: 2026-03-16
6+
**Depends on**: 024 (Three-Phase Sync Architecture)
7+
**Context**: After three-phase sync (024) fixes execution order and parallelism, further gains come from letting backends aggregate logical shards into fewer physical files — without changing the logical index/shard model.
8+
9+
## Strategic Context (026)
10+
11+
This proposal complements `026-recap-sync-direction.md` and builds on `024`:
12+
13+
- The logical index/shard model stays unchanged. There is **one** model, not two profiles.
14+
- **024** fixes how shards are processed (parallel I/O, bulk commit).
15+
- **025** fixes how backends map logical shards to physical files.
16+
- The existing `_ShardSyncAdapter` abstraction is the right extension point.
17+
18+
---
19+
20+
## Problem Statement
21+
22+
After 024, all shard downloads and uploads happen in parallel with a single DB commit. But the file count per sync cycle remains high:
23+
24+
### File Count Analysis (Chat Essence)
25+
26+
| Category | Files |
27+
|---|---|
28+
| GroupIndex shards (SyncChatMessage) | 62 |
29+
| GroupIndex shards (SyncMessageGroup) | 62 |
30+
| FullIndex shards (SyncKeyword, etc.) | ~5–10 |
31+
| Index metadata documents | ~128 |
32+
| Infrastructure (ClientInstallation, etc.) | ~5 |
33+
| **Total** | **~263** |
34+
35+
For the Dir backend, parallelized local file I/O makes 263 files manageable. For remote backends (GDrive, Solid, WebDAV), each file means an HTTP request — parallelism helps but can't eliminate per-request latency overhead.
36+
37+
### The Key Insight
38+
39+
The orchestrator works with **logical shards**. How those shards map to physical files is a backend concern. Currently, `_ShardSyncAdapter` has two implementations:
40+
41+
- `FilePerResourceShardSyncAdapter`: 1 document = 1 file (Linked Data mode)
42+
- `FilePerShardShardSyncAdapter`: 1 shard's documents = 1 TriG dataset file
43+
44+
Both still produce one physical I/O operation per logical shard. A backend like GDrive could aggregate multiple logical shards into fewer physical files, dramatically reducing HTTP round-trips.
45+
46+
---
47+
48+
## Proposal: Extend `_ShardSyncAdapter` for Bulk Physical Layout
49+
50+
### Core Idea
51+
52+
The orchestrator continues to think in terms of logical shards. The `_ShardSyncAdapter` interface gains the ability to **aggregate** multiple shards into physical storage units at the backend's discretion.
53+
54+
### How It Works
55+
56+
1. **Orchestrator** (after 024) collects all shard download/upload specs in Phase 1/3.
57+
2. **Backend adapter** receives the full list of shard specs and decides how to map them to physical files.
58+
3. **Download**: Backend reads physical files, splits them back into per-shard data, returns logical shard contents.
59+
4. **Upload**: Backend receives per-shard data, aggregates into its preferred physical layout, writes.
60+
61+
### Interface Evolution
62+
63+
```dart
64+
/// Current interface (simplified):
65+
abstract class _ShardSyncAdapter {
66+
Future<ShardData> downloadShard(ShardSpec spec);
67+
Future<void> uploadShard(ShardSpec spec, ShardData data);
68+
}
69+
70+
/// Extended interface — backends can override bulk operations:
71+
abstract class _ShardSyncAdapter {
72+
/// Download a single shard (default path, used by Dir backend).
73+
Future<ShardData> downloadShard(ShardSpec spec);
74+
75+
/// Download multiple shards in bulk.
76+
/// Default: delegates to downloadShard() individually.
77+
/// Override: backend can read fewer physical files and split results.
78+
Future<Map<ShardSpec, ShardData>> downloadShards(List<ShardSpec> specs) async {
79+
return {for (final spec in specs) spec: await downloadShard(spec)};
80+
}
81+
82+
/// Upload a single shard (default path).
83+
Future<void> uploadShard(ShardSpec spec, ShardData data);
84+
85+
/// Upload multiple shards in bulk.
86+
/// Default: delegates to uploadShard() individually.
87+
/// Override: backend can aggregate into fewer physical files.
88+
Future<void> uploadShards(Map<ShardSpec, ShardData> shards) async {
89+
for (final entry in shards.entries) {
90+
await uploadShard(entry.key, entry.value);
91+
}
92+
}
93+
}
94+
```
95+
96+
### Per-Backend Strategies
97+
98+
| Backend | Download Strategy | Upload Strategy | Physical Files |
99+
|---|---|---|---|
100+
| **Dir** | Parallel file reads (024) | Parallel file writes | 1 per shard (unchanged) |
101+
| **GDrive** | Batch API or aggregated files | Batch API or aggregated files | Configurable (1 per type, or fewer aggregated files) |
102+
| **Solid** | Parallel GETs; future bulk endpoints | Parallel PUTs; future bulk endpoints | 1 per shard or per resource (Linked Data) |
103+
| **WebDAV** | Parallel GETs | Parallel PUTs | 1 per shard (unchanged) |
104+
105+
---
106+
107+
## GDrive: Aggregated Shard Files
108+
109+
For GDrive, the adapter could aggregate all shards of a resource type into a single physical TriG dataset file:
110+
111+
### Physical Layout Example (Chat Essence on GDrive)
112+
113+
```
114+
locorda/chat_essence/
115+
├── SyncChatMessage.trig # All 62 ChatMessage shards in one file
116+
├── SyncMessageGroup.trig # All 62 MessageGroup shards in one file
117+
├── SyncKeyword.trig # All Keyword shards in one file
118+
├── ClientInstallation.trig # Infrastructure
119+
└── ... # ~10–15 physical files total
120+
```
121+
122+
Each physical file contains multiple Named Graphs (one per logical shard). The shard structure is preserved within the file — the backend simply packs/unpacks shards from the physical container.
123+
124+
```trig
125+
# SyncChatMessage.trig — contains all 62 logical shards
126+
127+
# Shard metadata (which shards are in this file, their clock hashes)
128+
<> a idx:AggregatedShardFile ;
129+
idx:containsShard <shard/0>, <shard/1>, ... <shard/61> .
130+
131+
<shard/0> {
132+
# All resources belonging to shard 0
133+
<chat-message/msg-001#it> a chat:ChatMessage ; ... .
134+
<chat-message/msg-042#it> a chat:ChatMessage ; ... .
135+
}
136+
137+
<shard/1> {
138+
# All resources belonging to shard 1
139+
...
140+
}
141+
```
142+
143+
### Change Detection
144+
145+
- Per-file ETag/hash detects whether the aggregated file has changed.
146+
- Inside the file, per-shard clock hashes enable skipping unchanged shard data during merge (Phase 2).
147+
- The backend downloads the full file but the orchestrator merges only changed shards — same as today.
148+
149+
### Result: HTTP Round-Trips
150+
151+
| Scenario | Current (263 files) | After 024 (parallel) | After 024 + 025 (aggregated) |
152+
|---|---|---|---|
153+
| Initial sync downloads | 263 sequential | 263 parallel | ~10–15 parallel |
154+
| Incremental sync downloads | scan all shards | scan all shards (parallel) | download changed type files |
155+
156+
---
157+
158+
## Dir Backend: Minimal Changes
159+
160+
The Dir backend already performs well with parallel local file I/O (after 024). Aggregation is **not needed** for local filesystems — the overhead per file is ~1ms.
161+
162+
The Dir adapter uses the default `downloadShards`/`uploadShards` implementation (delegating to individual shard operations) with parallelism from 024.
163+
164+
---
165+
166+
## Solid Backend: Linked Data Preservation
167+
168+
For Solid, the adapter continues to use per-resource or per-shard files to preserve Linked Data discoverability. If Solid Community Server adds bulk endpoints (on their roadmap), the adapter can implement `downloadShards`/`uploadShards` using those endpoints without changing the logical model.
169+
170+
---
171+
172+
## Open Questions
173+
174+
### 1. Granularity of Aggregation
175+
176+
**Question**: Should aggregation always be per-type, or should backends choose?
177+
178+
**Assessment**: Leave it to the backend. A simple GDrive backend might start with one file per type. A more sophisticated version might split very large types into size-based chunks. The `_ShardSyncAdapter` interface doesn't prescribe aggregation granularity.
179+
180+
### 2. Concurrent Write Conflicts on Aggregated Files
181+
182+
**Question**: Two installations modify different shards of the same type → both upload the same aggregated file.
183+
184+
**Assessment**: Same mitigations as today:
185+
- **ETags** detect the conflict.
186+
- **Re-download, re-merge, re-upload** resolves it.
187+
- **CRDT semantics** guarantee eventual convergence regardless.
188+
- **Scale reminder**: 2–20 installations, low conflict probability.
189+
190+
### 3. Partial Sync / onRequest with Aggregated Files
191+
192+
**Question**: GroupIndex `onRequest` fetch policy expects to download individual shards. How does this work with aggregated files?
193+
194+
**Assessment**: Two options:
195+
- **No onRequest for aggregated backends**: GDrive always uses prefetch. Aggregated files contain all shards — there's nothing to lazily fetch.
196+
- **Backend decides**: A backend that supports fine-grained access (Solid) uses per-shard layout; a backend that doesn't (GDrive) uses aggregation. The fetch policy is a logical concern; physical layout adapts to what the backend can do.
197+
198+
### 4. Changelog-Based Incremental Sync (GDrive Future)
199+
200+
**Question**: GDrive has a Changes API. Could a future GDrive adapter skip the "download and compare" step?
201+
202+
**Assessment**: Yes. A changelog-based adapter could:
203+
1. Query GDrive Changes API for modified files since last sync.
204+
2. Download only those files.
205+
3. Merge only changed shards within those files.
206+
207+
This is a backend-specific optimization that fits cleanly into the `downloadShards` interface — the orchestrator doesn't need to know how the backend detected changes.
208+
209+
### 5. Memory Pressure for Large Aggregated Files
210+
211+
**Question**: Loading all shards of a type into memory for packing/unpacking.
212+
213+
**Assessment**: For Chat Essence (~2015 resources × ~1–2 KB = ~2–4 MB), well within limits. For very large types, the backend could split into multiple physical files (e.g., by shard range). The orchestrator remains unaware of this detail.
214+
215+
---
216+
217+
## Expected Impact
218+
219+
### Performance (Initial Sync, Chat Essence, GDrive Backend)
220+
221+
| Metric | Current | After 024 only | After 024 + 025 |
222+
|---|---|---|---|
223+
| HTTP requests (download) | 263 sequential | 263 parallel | ~10–15 parallel |
224+
| HTTP requests (upload) | 263 sequential | 263 parallel | ~10–15 parallel |
225+
| Per-request latency | ~200ms × 263 | ~200ms (amortized via parallelism) | ~200ms × ~12 |
226+
| DB commits | 124 | 1 | 1 |
227+
| **Estimated total (GDrive)** | **minutes** | **~30–40s** | **~5–10s** |
228+
229+
### Dir Backend: No Additional Impact Beyond 024
230+
231+
For Dir, 024's parallelism and bulk commit already bring the main gains. 025 is not needed for local filesystems.
232+
233+
### Code Complexity
234+
235+
| Aspect | Impact |
236+
|---|---|
237+
| Logical model | **No change** — indices, shards, groups unchanged |
238+
| Orchestrator | **Minimal change** — calls bulk adapter methods instead of per-shard |
239+
| `_ShardSyncAdapter` | **Extended** — bulk methods with default sequential fallback |
240+
| New code | **Per-backend** — each backend implements its aggregation strategy |
241+
| Existing adapters | **Unchanged** — default implementations preserve current behavior |
242+
243+
---
244+
245+
## Relationship to Other Proposals
246+
247+
- **024 (Three-Phase Sync)**: Prerequisite. Provides the phase separation that makes bulk adapter operations possible. 024 alone gives ~2× improvement; 025 on top gives the additional reduction in HTTP round-trips.
248+
- **026 (Recap Sync Direction)**: Defines the strategic direction — one model, fix execution, let backends optimize physical layout.
249+
- **015 (Shard-Level File Consolidation)**: Introduced dataset-mode shards (one TriG file per shard). This proposal extends the idea: backends can aggregate multiple shards into one file.
250+
- **014 (GDrive Sync Performance)**: Identified HTTP latency as bottleneck. Backend-controlled aggregation addresses this by reducing request count.
251+
252+
---
253+
254+
## Next Steps (if approved)
255+
256+
1. **Implement 024 first** — three-phase sync is the foundation.
257+
2. **Add bulk methods to `_ShardSyncAdapter`**`downloadShards`/`uploadShards` with default sequential fallback.
258+
3. **Update `RemoteSyncOrchestrator`** — call bulk methods from three-phase flow.
259+
4. **Implement GDrive aggregation adapter** — pack/unpack shards into per-type physical files.
260+
5. **Benchmark** — measure actual improvement with Chat Essence on GDrive.
261+
6. **Evaluate changelog optimization** — prototype GDrive Changes API integration.

0 commit comments

Comments
 (0)