|
| 1 | +# 025 — Backend-Controlled Physical Layout |
| 2 | + |
| 3 | +**Status**: Draft |
| 4 | +**Created**: 2026-03-01 |
| 5 | +**Updated**: 2026-03-16 |
| 6 | +**Depends on**: 024 (Three-Phase Sync Architecture) |
| 7 | +**Context**: After three-phase sync (024) fixes execution order and parallelism, further gains come from letting backends aggregate logical shards into fewer physical files — without changing the logical index/shard model. |
| 8 | + |
| 9 | +## Strategic Context (026) |
| 10 | + |
| 11 | +This proposal complements `026-recap-sync-direction.md` and builds on `024`: |
| 12 | + |
| 13 | +- The logical index/shard model stays unchanged. There is **one** model, not two profiles. |
| 14 | +- **024** fixes how shards are processed (parallel I/O, bulk commit). |
| 15 | +- **025** fixes how backends map logical shards to physical files. |
| 16 | +- The existing `_ShardSyncAdapter` abstraction is the right extension point. |
| 17 | + |
| 18 | +--- |
| 19 | + |
| 20 | +## Problem Statement |
| 21 | + |
| 22 | +After 024, all shard downloads and uploads happen in parallel with a single DB commit. But the file count per sync cycle remains high: |
| 23 | + |
| 24 | +### File Count Analysis (Chat Essence) |
| 25 | + |
| 26 | +| Category | Files | |
| 27 | +|---|---| |
| 28 | +| GroupIndex shards (SyncChatMessage) | 62 | |
| 29 | +| GroupIndex shards (SyncMessageGroup) | 62 | |
| 30 | +| FullIndex shards (SyncKeyword, etc.) | ~5–10 | |
| 31 | +| Index metadata documents | ~128 | |
| 32 | +| Infrastructure (ClientInstallation, etc.) | ~5 | |
| 33 | +| **Total** | **~263** | |
| 34 | + |
| 35 | +For the Dir backend, parallelized local file I/O makes 263 files manageable. For remote backends (GDrive, Solid, WebDAV), each file means an HTTP request — parallelism helps but can't eliminate per-request latency overhead. |
| 36 | + |
| 37 | +### The Key Insight |
| 38 | + |
| 39 | +The orchestrator works with **logical shards**. How those shards map to physical files is a backend concern. Currently, `_ShardSyncAdapter` has two implementations: |
| 40 | + |
| 41 | +- `FilePerResourceShardSyncAdapter`: 1 document = 1 file (Linked Data mode) |
| 42 | +- `FilePerShardShardSyncAdapter`: 1 shard's documents = 1 TriG dataset file |
| 43 | + |
| 44 | +Both still produce one physical I/O operation per logical shard. A backend like GDrive could aggregate multiple logical shards into fewer physical files, dramatically reducing HTTP round-trips. |
| 45 | + |
| 46 | +--- |
| 47 | + |
| 48 | +## Proposal: Extend `_ShardSyncAdapter` for Bulk Physical Layout |
| 49 | + |
| 50 | +### Core Idea |
| 51 | + |
| 52 | +The orchestrator continues to think in terms of logical shards. The `_ShardSyncAdapter` interface gains the ability to **aggregate** multiple shards into physical storage units at the backend's discretion. |
| 53 | + |
| 54 | +### How It Works |
| 55 | + |
| 56 | +1. **Orchestrator** (after 024) collects all shard download/upload specs in Phase 1/3. |
| 57 | +2. **Backend adapter** receives the full list of shard specs and decides how to map them to physical files. |
| 58 | +3. **Download**: Backend reads physical files, splits them back into per-shard data, returns logical shard contents. |
| 59 | +4. **Upload**: Backend receives per-shard data, aggregates into its preferred physical layout, writes. |
| 60 | + |
| 61 | +### Interface Evolution |
| 62 | + |
| 63 | +```dart |
| 64 | +/// Current interface (simplified): |
| 65 | +abstract class _ShardSyncAdapter { |
| 66 | + Future<ShardData> downloadShard(ShardSpec spec); |
| 67 | + Future<void> uploadShard(ShardSpec spec, ShardData data); |
| 68 | +} |
| 69 | +
|
| 70 | +/// Extended interface — backends can override bulk operations: |
| 71 | +abstract class _ShardSyncAdapter { |
| 72 | + /// Download a single shard (default path, used by Dir backend). |
| 73 | + Future<ShardData> downloadShard(ShardSpec spec); |
| 74 | +
|
| 75 | + /// Download multiple shards in bulk. |
| 76 | + /// Default: delegates to downloadShard() individually. |
| 77 | + /// Override: backend can read fewer physical files and split results. |
| 78 | + Future<Map<ShardSpec, ShardData>> downloadShards(List<ShardSpec> specs) async { |
| 79 | + return {for (final spec in specs) spec: await downloadShard(spec)}; |
| 80 | + } |
| 81 | +
|
| 82 | + /// Upload a single shard (default path). |
| 83 | + Future<void> uploadShard(ShardSpec spec, ShardData data); |
| 84 | +
|
| 85 | + /// Upload multiple shards in bulk. |
| 86 | + /// Default: delegates to uploadShard() individually. |
| 87 | + /// Override: backend can aggregate into fewer physical files. |
| 88 | + Future<void> uploadShards(Map<ShardSpec, ShardData> shards) async { |
| 89 | + for (final entry in shards.entries) { |
| 90 | + await uploadShard(entry.key, entry.value); |
| 91 | + } |
| 92 | + } |
| 93 | +} |
| 94 | +``` |
| 95 | + |
| 96 | +### Per-Backend Strategies |
| 97 | + |
| 98 | +| Backend | Download Strategy | Upload Strategy | Physical Files | |
| 99 | +|---|---|---|---| |
| 100 | +| **Dir** | Parallel file reads (024) | Parallel file writes | 1 per shard (unchanged) | |
| 101 | +| **GDrive** | Batch API or aggregated files | Batch API or aggregated files | Configurable (1 per type, or fewer aggregated files) | |
| 102 | +| **Solid** | Parallel GETs; future bulk endpoints | Parallel PUTs; future bulk endpoints | 1 per shard or per resource (Linked Data) | |
| 103 | +| **WebDAV** | Parallel GETs | Parallel PUTs | 1 per shard (unchanged) | |
| 104 | + |
| 105 | +--- |
| 106 | + |
| 107 | +## GDrive: Aggregated Shard Files |
| 108 | + |
| 109 | +For GDrive, the adapter could aggregate all shards of a resource type into a single physical TriG dataset file: |
| 110 | + |
| 111 | +### Physical Layout Example (Chat Essence on GDrive) |
| 112 | + |
| 113 | +``` |
| 114 | +locorda/chat_essence/ |
| 115 | +├── SyncChatMessage.trig # All 62 ChatMessage shards in one file |
| 116 | +├── SyncMessageGroup.trig # All 62 MessageGroup shards in one file |
| 117 | +├── SyncKeyword.trig # All Keyword shards in one file |
| 118 | +├── ClientInstallation.trig # Infrastructure |
| 119 | +└── ... # ~10–15 physical files total |
| 120 | +``` |
| 121 | + |
| 122 | +Each physical file contains multiple Named Graphs (one per logical shard). The shard structure is preserved within the file — the backend simply packs/unpacks shards from the physical container. |
| 123 | + |
| 124 | +```trig |
| 125 | +# SyncChatMessage.trig — contains all 62 logical shards |
| 126 | +
|
| 127 | +# Shard metadata (which shards are in this file, their clock hashes) |
| 128 | +<> a idx:AggregatedShardFile ; |
| 129 | + idx:containsShard <shard/0>, <shard/1>, ... <shard/61> . |
| 130 | +
|
| 131 | +<shard/0> { |
| 132 | + # All resources belonging to shard 0 |
| 133 | + <chat-message/msg-001#it> a chat:ChatMessage ; ... . |
| 134 | + <chat-message/msg-042#it> a chat:ChatMessage ; ... . |
| 135 | +} |
| 136 | +
|
| 137 | +<shard/1> { |
| 138 | + # All resources belonging to shard 1 |
| 139 | + ... |
| 140 | +} |
| 141 | +``` |
| 142 | + |
| 143 | +### Change Detection |
| 144 | + |
| 145 | +- Per-file ETag/hash detects whether the aggregated file has changed. |
| 146 | +- Inside the file, per-shard clock hashes enable skipping unchanged shard data during merge (Phase 2). |
| 147 | +- The backend downloads the full file but the orchestrator merges only changed shards — same as today. |
| 148 | + |
| 149 | +### Result: HTTP Round-Trips |
| 150 | + |
| 151 | +| Scenario | Current (263 files) | After 024 (parallel) | After 024 + 025 (aggregated) | |
| 152 | +|---|---|---|---| |
| 153 | +| Initial sync downloads | 263 sequential | 263 parallel | ~10–15 parallel | |
| 154 | +| Incremental sync downloads | scan all shards | scan all shards (parallel) | download changed type files | |
| 155 | + |
| 156 | +--- |
| 157 | + |
| 158 | +## Dir Backend: Minimal Changes |
| 159 | + |
| 160 | +The Dir backend already performs well with parallel local file I/O (after 024). Aggregation is **not needed** for local filesystems — the overhead per file is ~1ms. |
| 161 | + |
| 162 | +The Dir adapter uses the default `downloadShards`/`uploadShards` implementation (delegating to individual shard operations) with parallelism from 024. |
| 163 | + |
| 164 | +--- |
| 165 | + |
| 166 | +## Solid Backend: Linked Data Preservation |
| 167 | + |
| 168 | +For Solid, the adapter continues to use per-resource or per-shard files to preserve Linked Data discoverability. If Solid Community Server adds bulk endpoints (on their roadmap), the adapter can implement `downloadShards`/`uploadShards` using those endpoints without changing the logical model. |
| 169 | + |
| 170 | +--- |
| 171 | + |
| 172 | +## Open Questions |
| 173 | + |
| 174 | +### 1. Granularity of Aggregation |
| 175 | + |
| 176 | +**Question**: Should aggregation always be per-type, or should backends choose? |
| 177 | + |
| 178 | +**Assessment**: Leave it to the backend. A simple GDrive backend might start with one file per type. A more sophisticated version might split very large types into size-based chunks. The `_ShardSyncAdapter` interface doesn't prescribe aggregation granularity. |
| 179 | + |
| 180 | +### 2. Concurrent Write Conflicts on Aggregated Files |
| 181 | + |
| 182 | +**Question**: Two installations modify different shards of the same type → both upload the same aggregated file. |
| 183 | + |
| 184 | +**Assessment**: Same mitigations as today: |
| 185 | +- **ETags** detect the conflict. |
| 186 | +- **Re-download, re-merge, re-upload** resolves it. |
| 187 | +- **CRDT semantics** guarantee eventual convergence regardless. |
| 188 | +- **Scale reminder**: 2–20 installations, low conflict probability. |
| 189 | + |
| 190 | +### 3. Partial Sync / onRequest with Aggregated Files |
| 191 | + |
| 192 | +**Question**: GroupIndex `onRequest` fetch policy expects to download individual shards. How does this work with aggregated files? |
| 193 | + |
| 194 | +**Assessment**: Two options: |
| 195 | +- **No onRequest for aggregated backends**: GDrive always uses prefetch. Aggregated files contain all shards — there's nothing to lazily fetch. |
| 196 | +- **Backend decides**: A backend that supports fine-grained access (Solid) uses per-shard layout; a backend that doesn't (GDrive) uses aggregation. The fetch policy is a logical concern; physical layout adapts to what the backend can do. |
| 197 | + |
| 198 | +### 4. Changelog-Based Incremental Sync (GDrive Future) |
| 199 | + |
| 200 | +**Question**: GDrive has a Changes API. Could a future GDrive adapter skip the "download and compare" step? |
| 201 | + |
| 202 | +**Assessment**: Yes. A changelog-based adapter could: |
| 203 | +1. Query GDrive Changes API for modified files since last sync. |
| 204 | +2. Download only those files. |
| 205 | +3. Merge only changed shards within those files. |
| 206 | + |
| 207 | +This is a backend-specific optimization that fits cleanly into the `downloadShards` interface — the orchestrator doesn't need to know how the backend detected changes. |
| 208 | + |
| 209 | +### 5. Memory Pressure for Large Aggregated Files |
| 210 | + |
| 211 | +**Question**: Loading all shards of a type into memory for packing/unpacking. |
| 212 | + |
| 213 | +**Assessment**: For Chat Essence (~2015 resources × ~1–2 KB = ~2–4 MB), well within limits. For very large types, the backend could split into multiple physical files (e.g., by shard range). The orchestrator remains unaware of this detail. |
| 214 | + |
| 215 | +--- |
| 216 | + |
| 217 | +## Expected Impact |
| 218 | + |
| 219 | +### Performance (Initial Sync, Chat Essence, GDrive Backend) |
| 220 | + |
| 221 | +| Metric | Current | After 024 only | After 024 + 025 | |
| 222 | +|---|---|---|---| |
| 223 | +| HTTP requests (download) | 263 sequential | 263 parallel | ~10–15 parallel | |
| 224 | +| HTTP requests (upload) | 263 sequential | 263 parallel | ~10–15 parallel | |
| 225 | +| Per-request latency | ~200ms × 263 | ~200ms (amortized via parallelism) | ~200ms × ~12 | |
| 226 | +| DB commits | 124 | 1 | 1 | |
| 227 | +| **Estimated total (GDrive)** | **minutes** | **~30–40s** | **~5–10s** | |
| 228 | + |
| 229 | +### Dir Backend: No Additional Impact Beyond 024 |
| 230 | + |
| 231 | +For Dir, 024's parallelism and bulk commit already bring the main gains. 025 is not needed for local filesystems. |
| 232 | + |
| 233 | +### Code Complexity |
| 234 | + |
| 235 | +| Aspect | Impact | |
| 236 | +|---|---| |
| 237 | +| Logical model | **No change** — indices, shards, groups unchanged | |
| 238 | +| Orchestrator | **Minimal change** — calls bulk adapter methods instead of per-shard | |
| 239 | +| `_ShardSyncAdapter` | **Extended** — bulk methods with default sequential fallback | |
| 240 | +| New code | **Per-backend** — each backend implements its aggregation strategy | |
| 241 | +| Existing adapters | **Unchanged** — default implementations preserve current behavior | |
| 242 | + |
| 243 | +--- |
| 244 | + |
| 245 | +## Relationship to Other Proposals |
| 246 | + |
| 247 | +- **024 (Three-Phase Sync)**: Prerequisite. Provides the phase separation that makes bulk adapter operations possible. 024 alone gives ~2× improvement; 025 on top gives the additional reduction in HTTP round-trips. |
| 248 | +- **026 (Recap Sync Direction)**: Defines the strategic direction — one model, fix execution, let backends optimize physical layout. |
| 249 | +- **015 (Shard-Level File Consolidation)**: Introduced dataset-mode shards (one TriG file per shard). This proposal extends the idea: backends can aggregate multiple shards into one file. |
| 250 | +- **014 (GDrive Sync Performance)**: Identified HTTP latency as bottleneck. Backend-controlled aggregation addresses this by reducing request count. |
| 251 | + |
| 252 | +--- |
| 253 | + |
| 254 | +## Next Steps (if approved) |
| 255 | + |
| 256 | +1. **Implement 024 first** — three-phase sync is the foundation. |
| 257 | +2. **Add bulk methods to `_ShardSyncAdapter`** — `downloadShards`/`uploadShards` with default sequential fallback. |
| 258 | +3. **Update `RemoteSyncOrchestrator`** — call bulk methods from three-phase flow. |
| 259 | +4. **Implement GDrive aggregation adapter** — pack/unpack shards into per-type physical files. |
| 260 | +5. **Benchmark** — measure actual improvement with Chat Essence on GDrive. |
| 261 | +6. **Evaluate changelog optimization** — prototype GDrive Changes API integration. |
0 commit comments