You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add technology selection and CAP theorem analysis for cloud storage, distributed file system, and photo sharing
- Introduced comprehensive sections on technology selection for cloud storage, detailing choices for object storage backends, metadata stores, sync protocols, and conflict resolution strategies.
- Included CAP theorem analyses for cloud storage, distributed file systems, and photo sharing, clarifying consistency and availability trade-offs for various components.
- Enhanced documentation to provide a thorough understanding of architectural decisions and their implications for system performance and user experience.
Copy file name to clipboardExpand all lines: software_system_design/cloud_storage.md
+168Lines changed: 168 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -116,6 +116,174 @@ Representative **REST-style** surface (names illustrative). Many products also u
116
116
{: .tip }
117
117
> Treat **block uploads** as idempotent by **hash**. Treat **file commits** as idempotent with a **client mutation id** to avoid duplicate revisions on retries.
118
118
119
+
### Technology Selection & Tradeoffs
120
+
121
+
Cloud storage stacks in three planes: **immutable bytes** (object/block layer), **authoritative metadata** (tree, versions, ACLs), and **sync semantics** (how clients converge). Interviewers expect you to name realistic building blocks and justify trade-offs—not to pick one vendor dogmatically.
122
+
123
+
#### Object storage backend
124
+
125
+
| Option | Pros | Cons | When it wins |
126
+
|--------|------|------|----------------|
127
+
|**Custom chunk store** (hash-partitioned volumes, EC, scrubbers) | Full control over placement, cost, on-prem; can co-design with dedup/GC | Years of engineering; you own reliability, upgrades, and incident response | Hyperscaler or regulated environments building a proprietary plane |
128
+
|**Ceph / MinIO (S3-compatible cluster)**| Mature replication/EC; ops patterns exist; self-hostable | Operability at exabyte scale is non-trivial; feature gaps vs public cloud (multi-region, compliance SKUs) | Private cloud, hybrid, or teams that want S3 API without AWS |
129
+
|**Managed S3-compatible (e.g., S3, GCS, Azure Blob)**| Durability/availability SLAs, global replication, lifecycle tiers, compliance certs | Cost at scale; less control over internals; egress/operation pricing | Fastest path for most products; default in interviews unless “on-prem” is stated |
130
+
131
+
**Why it matters:** Metadata references chunks by **hash**; the object layer only needs **PUT-by-hash**, **GET-by-hash**, **lifecycle/GC hooks**, and **strong durability**. The hard part is **reference counting** and **async deletion** in your metadata plane—not the raw blob PUT.
|**etcd** (or similar consistent KV) | Strong consistency, watches for coordination | Poor fit for large trees, heavy listing queries, and billions of rows—designed for **small** critical state |**Locks, quotas, rate-limit counters**, not the full file catalog |
139
+
|**Custom B-tree / LSM on disk**| Ultimate performance/cost tuning | You reimplement SQL, migrations, backup—rarely justified | Extreme embedded or legacy systems; not a typical greenfield choice |
140
+
141
+
**Why it matters:** File **names and hierarchy** need **transactional invariants**; search/list workloads need **secondary indexes**. A relational model maps cleanly; pure KV is usually paired with **another** system for graph/path queries unless you accept heavy client-side logic.
142
+
143
+
#### Sync protocol
144
+
145
+
| Approach | Pros | Cons | When it wins |
146
+
|----------|------|------|----------------|
147
+
|**Rsync-like delta sync** (rolling hash, send missing segments) | Minimizes bytes on **repeated similar** files; great for low uplink | Complex client; server may still be chunk/manifest based—align with CDC policy | Bandwidth-sensitive clients; backup tools; complement to chunk stores |
148
+
|**Full file replace**| Simple mental model | Wastes bandwidth on large files; fights user expectations for “sync” | Small files only; rarely the main strategy at scale |
**Why it matters:** Interviews reward **CDC + content-addressed blocks** because edits localize to a few chunks; “upload whole file every time” fails the efficiency bar unless scope is explicitly tiny files.
152
+
153
+
#### Conflict resolution
154
+
155
+
| Strategy | Pros | Cons | When it wins |
156
+
|----------|------|------|--------------|
157
+
|**Last-writer-wins (LWW)**| Simple; single head revision | Silent overwrite—bad for shared folders and offline | Low-stakes caches; **not** ideal as the only story for collaboration |
158
+
|**Version vectors / DAG**| Captures **causality**; enables merge tools and audit | UX complexity; still need policy for binaries | Advanced sync; technical users; foundation for **branch + merge** flows |
159
+
|**User manual merge / conflict copies**| Safe for **binary**; clear accountability | Noisy folders; user burden |**Default** for generic cloud drives on binary files |
160
+
161
+
**Our choice (interview narrative):**
162
+
163
+
-**Bytes:** Managed **S3-compatible object storage** (or Ceph/MinIO if hybrid/on-prem) for durability and operational leverage; **content-addressed** chunks with **erasure coding** behind the API.
164
+
-**Metadata:****PostgreSQL** (or sharded Spanner/Cockroach-class SQL) for transactional tree + ACL + revision history; **etcd** only for **coordination** (locks, leases), not the main catalog.
165
+
-**Sync:****CDC chunking + block upload + manifest commit** with **cursor-based change feed** and **push** notifications; optional **rsync-style** second pass only for niche bandwidth savings—not as the primary store of truth.
166
+
-**Conflicts:****Optimistic concurrency** on commit (`etag`/base revision); for binaries, **conflict copies** or **explicit user resolution**; reserve **LWW** for clearly defined single-writer resources.
167
+
168
+
**Rationale:** Optimize for **deduplicated storage**, **clear consistency story on metadata**, and **honest conflict UX**—without building a custom object store unless the prompt demands it.
169
+
170
+
### CAP Theorem Analysis
171
+
172
+
**CAP** (Consistency, Availability, Partition tolerance) says that under a **network partition**, you cannot have both **strong linearizability** and **full read/write availability** for the same data plane. Real products **partition responsibilities**: different subsystems pick different points on the spectrum.
173
+
174
+
For **cloud storage**:
175
+
176
+
-**File reads (content)** should stay **highly available**: clients can often read **replicated** chunks; temporary metadata staleness may block “latest” path resolution, but bytes addressed by **known hash** remain readable (**AP**-leaning for immutable blobs).
177
+
-**Sync conflicts** need **careful consistency** on **metadata** (which revision is head, who is allowed to commit)—typically **CP**-leaning for the **commit path** (reject or branch on conflict), while **notifications** and **change feeds** are **eventually consistent** with bounded lag.
178
+
179
+
| Subsystem | Typical CAP stance | Interview phrasing |
|**File metadata** (path, size, head revision) |**CP** on commit: transactional updates, version checks | “We serialize commits per file or use compare-and-swap on revision.” |
182
+
|**File content** (immutable chunks by hash) |**AP** for read: multiple replicas; **eventual** visibility of new hashes after commit | “Chunks are immutable; once committed, reads don’t need quorum metadata.” |
183
+
|**Sync state** (per-device cursor, local vs server revision) |**AP** with **eventual** convergence; **repair** via change feed | “Devices are sources of truth for *pending work*; server is source of truth for *committed* state.” |
184
+
|**Sharing permissions**|**CP** when enforcing on sensitive operations; cached reads may be **eventually** fresh | “Writes to ACLs go through authoritative store; reads may use cached policy with short TTL.” |
185
+
186
+
```mermaid
187
+
flowchart TB
188
+
subgraph cp [CP-leaning paths]
189
+
M[Metadata commit\nfile tree + revision]
190
+
A[ACL grant/revoke]
191
+
end
192
+
193
+
subgraph ap [AP-leaning paths]
194
+
B[Block read by hash\nreplicated objects]
195
+
N[Notify / change feed delivery]
196
+
end
197
+
198
+
subgraph part [Partition]
199
+
P[Network split\nclients cannot reach all replicas]
200
+
end
201
+
202
+
P -->|choose| M
203
+
P -->|choose| B
204
+
M -->|may fail closed\nreject ambiguous commits| X[Consistent tree]
205
+
B -->|still serve\nknown hashes| Y[Available reads]
206
+
```
207
+
208
+
{: .note }
209
+
> **PACELC** extension: even **without** a partition, you trade **latency (L)** vs **consistency (C)**—e.g., reading ACL from a cache is faster but may be briefly stale. Mentioning PACELC signals seniority.
210
+
211
+
### SLA and SLO Definitions
212
+
213
+
**SLA** = contract with customers (credits, legal). **SLO** = internal target; **SLI** = what you measure. Below: illustrative **SLOs** for a consumer/enterprise-grade drive; tune numbers to the prompt.
|**Upload latency**| Time from **last byte** of chunk received to **ack** (or from **commit request** to **success**) | P99 < 500 ms for **commit**; chunk PUT P99 < 1 s for **≤ 32 MB** chunk under normal load | 30-day rolling |
218
+
|**Download latency**| Time to **first byte** (TTFB) for signed URL or gateway stream | P99 TTFB < 300 ms **same region**; higher cross-region (disclose) | 30-day rolling |
219
+
|**Sync latency**| Wall-clock from **server commit** to **change visible** on subscribed client (feed or push) | P95 < 10 s; P99 < 60 s (mobile/long-poll may widen tail) | 30-day rolling |
220
+
|**Data durability**| Probability of **permanent loss** of committed user object |**11 nines** for committed bytes after **ack** (provider-style claim; explain replication + EC + repair) | Annual / incident-based review |
221
+
|**Availability**| Successful **metadata** read/write and **auth** checks vs all attempts |**99.9%** monthly (consumer); **99.95%+** (business)—exclude customer-caused throttling if defined | Monthly |
222
+
|**Conflict resolution accuracy**| Share of conflicts **correctly classified** (no silent wrong winner) vs detected conflicts |**99.99%****detection** rate for concurrent commits to same base revision; **0** silent LWW on shared folders if policy forbids | Per release + sampled audits |
|**Burn alerts**| Page on **fast burn** (budget exhausted in days); ticket on **slow burn**|
230
+
|**Trade-offs**| If sync latency SLO slips, **throttle** non-critical features (preview gen) before dropping **durability** paths |
231
+
|**Freeze**| If budget exhausted, **freeze launches** until reliability work ships |
232
+
233
+
{: .warning }
234
+
> **Never** conflate **durability** (bits not lost) with **availability** (API up). You can be **available** and **wrong** if you serve stale metadata—separate SLIs.
235
+
236
+
### Database Schema
237
+
238
+
Logical schema (illustrative SQL-oriented). Adjust types (`UUID`, `BIGINT`), indexing, and soft-delete to your scale story.
239
+
240
+
**`files`** — one row per **logical file** (node); `path` may be **materialized** for perf or **derived** from closure table—not both without a clear source of truth.
|`local_version`| BIGINT | Last **known applied** server revision on device |
282
+
|`synced_at`| TIMESTAMPTZ | Last successful reconcile |
283
+
284
+
{: .tip }
285
+
> At scale, **`sync_state`** is often **sharded** with the user or stored **client-side** (SQLite) with server **cursors**—the table above is the **server-side** model when you track enterprise devices centrally.
0 commit comments