Skip to content

Commit a74bfc7

Browse files
committed
Add technology selection and CAP theorem analysis for cloud storage, distributed file system, and photo sharing
- Introduced comprehensive sections on technology selection for cloud storage, detailing choices for object storage backends, metadata stores, sync protocols, and conflict resolution strategies. - Included CAP theorem analyses for cloud storage, distributed file systems, and photo sharing, clarifying consistency and availability trade-offs for various components. - Enhanced documentation to provide a thorough understanding of architectural decisions and their implications for system performance and user experience.
1 parent fae84fa commit a74bfc7

3 files changed

Lines changed: 400 additions & 4 deletions

File tree

software_system_design/cloud_storage.md

Lines changed: 168 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -116,6 +116,174 @@ Representative **REST-style** surface (names illustrative). Many products also u
116116
{: .tip }
117117
> Treat **block uploads** as idempotent by **hash**. Treat **file commits** as idempotent with a **client mutation id** to avoid duplicate revisions on retries.
118118
119+
### Technology Selection & Tradeoffs
120+
121+
Cloud storage stacks in three planes: **immutable bytes** (object/block layer), **authoritative metadata** (tree, versions, ACLs), and **sync semantics** (how clients converge). Interviewers expect you to name realistic building blocks and justify trade-offs—not to pick one vendor dogmatically.
122+
123+
#### Object storage backend
124+
125+
| Option | Pros | Cons | When it wins |
126+
|--------|------|------|----------------|
127+
| **Custom chunk store** (hash-partitioned volumes, EC, scrubbers) | Full control over placement, cost, on-prem; can co-design with dedup/GC | Years of engineering; you own reliability, upgrades, and incident response | Hyperscaler or regulated environments building a proprietary plane |
128+
| **Ceph / MinIO (S3-compatible cluster)** | Mature replication/EC; ops patterns exist; self-hostable | Operability at exabyte scale is non-trivial; feature gaps vs public cloud (multi-region, compliance SKUs) | Private cloud, hybrid, or teams that want S3 API without AWS |
129+
| **Managed S3-compatible (e.g., S3, GCS, Azure Blob)** | Durability/availability SLAs, global replication, lifecycle tiers, compliance certs | Cost at scale; less control over internals; egress/operation pricing | Fastest path for most products; default in interviews unless “on-prem” is stated |
130+
131+
**Why it matters:** Metadata references chunks by **hash**; the object layer only needs **PUT-by-hash**, **GET-by-hash**, **lifecycle/GC hooks**, and **strong durability**. The hard part is **reference counting** and **async deletion** in your metadata plane—not the raw blob PUT.
132+
133+
#### Metadata store
134+
135+
| Option | Pros | Cons | When it wins |
136+
|--------|------|------|----------------|
137+
| **PostgreSQL** (sharded) | ACID, constraints (`UNIQUE (parent_id, name)`), rich queries, mature tooling | Cross-shard transactions painful; need careful shard key (`owner_id` / `namespace_id`) | **Default interview answer** for tree + ACLs + transactional commits |
138+
| **etcd** (or similar consistent KV) | Strong consistency, watches for coordination | Poor fit for large trees, heavy listing queries, and billions of rows—designed for **small** critical state | **Locks, quotas, rate-limit counters**, not the full file catalog |
139+
| **Custom B-tree / LSM on disk** | Ultimate performance/cost tuning | You reimplement SQL, migrations, backup—rarely justified | Extreme embedded or legacy systems; not a typical greenfield choice |
140+
141+
**Why it matters:** File **names and hierarchy** need **transactional invariants**; search/list workloads need **secondary indexes**. A relational model maps cleanly; pure KV is usually paired with **another** system for graph/path queries unless you accept heavy client-side logic.
142+
143+
#### Sync protocol
144+
145+
| Approach | Pros | Cons | When it wins |
146+
|----------|------|------|----------------|
147+
| **Rsync-like delta sync** (rolling hash, send missing segments) | Minimizes bytes on **repeated similar** files; great for low uplink | Complex client; server may still be chunk/manifest based—align with CDC policy | Bandwidth-sensitive clients; backup tools; complement to chunk stores |
148+
| **Full file replace** | Simple mental model | Wastes bandwidth on large files; fights user expectations for “sync” | Small files only; rarely the main strategy at scale |
149+
| **Block-level + content-defined chunking + manifest commit** | Stable dedup, resumable uploads, idempotent **PUT(hash)** | Requires manifest versioning and GC | **Industry-typical** for Drive/Dropbox-class products |
150+
151+
**Why it matters:** Interviews reward **CDC + content-addressed blocks** because edits localize to a few chunks; “upload whole file every time” fails the efficiency bar unless scope is explicitly tiny files.
152+
153+
#### Conflict resolution
154+
155+
| Strategy | Pros | Cons | When it wins |
156+
|----------|------|------|--------------|
157+
| **Last-writer-wins (LWW)** | Simple; single head revision | Silent overwrite—bad for shared folders and offline | Low-stakes caches; **not** ideal as the only story for collaboration |
158+
| **Version vectors / DAG** | Captures **causality**; enables merge tools and audit | UX complexity; still need policy for binaries | Advanced sync; technical users; foundation for **branch + merge** flows |
159+
| **User manual merge / conflict copies** | Safe for **binary**; clear accountability | Noisy folders; user burden | **Default** for generic cloud drives on binary files |
160+
161+
**Our choice (interview narrative):**
162+
163+
- **Bytes:** Managed **S3-compatible object storage** (or Ceph/MinIO if hybrid/on-prem) for durability and operational leverage; **content-addressed** chunks with **erasure coding** behind the API.
164+
- **Metadata:** **PostgreSQL** (or sharded Spanner/Cockroach-class SQL) for transactional tree + ACL + revision history; **etcd** only for **coordination** (locks, leases), not the main catalog.
165+
- **Sync:** **CDC chunking + block upload + manifest commit** with **cursor-based change feed** and **push** notifications; optional **rsync-style** second pass only for niche bandwidth savings—not as the primary store of truth.
166+
- **Conflicts:** **Optimistic concurrency** on commit (`etag`/base revision); for binaries, **conflict copies** or **explicit user resolution**; reserve **LWW** for clearly defined single-writer resources.
167+
168+
**Rationale:** Optimize for **deduplicated storage**, **clear consistency story on metadata**, and **honest conflict UX**—without building a custom object store unless the prompt demands it.
169+
170+
### CAP Theorem Analysis
171+
172+
**CAP** (Consistency, Availability, Partition tolerance) says that under a **network partition**, you cannot have both **strong linearizability** and **full read/write availability** for the same data plane. Real products **partition responsibilities**: different subsystems pick different points on the spectrum.
173+
174+
For **cloud storage**:
175+
176+
- **File reads (content)** should stay **highly available**: clients can often read **replicated** chunks; temporary metadata staleness may block “latest” path resolution, but bytes addressed by **known hash** remain readable (**AP**-leaning for immutable blobs).
177+
- **Sync conflicts** need **careful consistency** on **metadata** (which revision is head, who is allowed to commit)—typically **CP**-leaning for the **commit path** (reject or branch on conflict), while **notifications** and **change feeds** are **eventually consistent** with bounded lag.
178+
179+
| Subsystem | Typical CAP stance | Interview phrasing |
180+
|-----------|--------------------|------------------|
181+
| **File metadata** (path, size, head revision) | **CP** on commit: transactional updates, version checks | “We serialize commits per file or use compare-and-swap on revision.” |
182+
| **File content** (immutable chunks by hash) | **AP** for read: multiple replicas; **eventual** visibility of new hashes after commit | “Chunks are immutable; once committed, reads don’t need quorum metadata.” |
183+
| **Sync state** (per-device cursor, local vs server revision) | **AP** with **eventual** convergence; **repair** via change feed | “Devices are sources of truth for *pending work*; server is source of truth for *committed* state.” |
184+
| **Sharing permissions** | **CP** when enforcing on sensitive operations; cached reads may be **eventually** fresh | “Writes to ACLs go through authoritative store; reads may use cached policy with short TTL.” |
185+
186+
```mermaid
187+
flowchart TB
188+
subgraph cp [CP-leaning paths]
189+
M[Metadata commit\nfile tree + revision]
190+
A[ACL grant/revoke]
191+
end
192+
193+
subgraph ap [AP-leaning paths]
194+
B[Block read by hash\nreplicated objects]
195+
N[Notify / change feed delivery]
196+
end
197+
198+
subgraph part [Partition]
199+
P[Network split\nclients cannot reach all replicas]
200+
end
201+
202+
P -->|choose| M
203+
P -->|choose| B
204+
M -->|may fail closed\nreject ambiguous commits| X[Consistent tree]
205+
B -->|still serve\nknown hashes| Y[Available reads]
206+
```
207+
208+
{: .note }
209+
> **PACELC** extension: even **without** a partition, you trade **latency (L)** vs **consistency (C)**—e.g., reading ACL from a cache is faster but may be briefly stale. Mentioning PACELC signals seniority.
210+
211+
### SLA and SLO Definitions
212+
213+
**SLA** = contract with customers (credits, legal). **SLO** = internal target; **SLI** = what you measure. Below: illustrative **SLOs** for a consumer/enterprise-grade drive; tune numbers to the prompt.
214+
215+
| Category | SLI | Example SLO | Measurement window |
216+
|----------|-----|-------------|---------------------|
217+
| **Upload latency** | Time from **last byte** of chunk received to **ack** (or from **commit request** to **success**) | P99 < 500 ms for **commit**; chunk PUT P99 < 1 s for **≤ 32 MB** chunk under normal load | 30-day rolling |
218+
| **Download latency** | Time to **first byte** (TTFB) for signed URL or gateway stream | P99 TTFB < 300 ms **same region**; higher cross-region (disclose) | 30-day rolling |
219+
| **Sync latency** | Wall-clock from **server commit** to **change visible** on subscribed client (feed or push) | P95 < 10 s; P99 < 60 s (mobile/long-poll may widen tail) | 30-day rolling |
220+
| **Data durability** | Probability of **permanent loss** of committed user object | **11 nines** for committed bytes after **ack** (provider-style claim; explain replication + EC + repair) | Annual / incident-based review |
221+
| **Availability** | Successful **metadata** read/write and **auth** checks vs all attempts | **99.9%** monthly (consumer); **99.95%+** (business)—exclude customer-caused throttling if defined | Monthly |
222+
| **Conflict resolution accuracy** | Share of conflicts **correctly classified** (no silent wrong winner) vs detected conflicts | **99.99%** **detection** rate for concurrent commits to same base revision; **0** silent LWW on shared folders if policy forbids | Per release + sampled audits |
223+
224+
**Error budget policy (how teams operate):**
225+
226+
| Element | Policy |
227+
|---------|--------|
228+
| **Budget** | e.g., **0.1%** monthly unavailability = ~43 minutes/month at 99.9% |
229+
| **Burn alerts** | Page on **fast burn** (budget exhausted in days); ticket on **slow burn** |
230+
| **Trade-offs** | If sync latency SLO slips, **throttle** non-critical features (preview gen) before dropping **durability** paths |
231+
| **Freeze** | If budget exhausted, **freeze launches** until reliability work ships |
232+
233+
{: .warning }
234+
> **Never** conflate **durability** (bits not lost) with **availability** (API up). You can be **available** and **wrong** if you serve stale metadata—separate SLIs.
235+
236+
### Database Schema
237+
238+
Logical schema (illustrative SQL-oriented). Adjust types (`UUID`, `BIGINT`), indexing, and soft-delete to your scale story.
239+
240+
**`files`** — one row per **logical file** (node); `path` may be **materialized** for perf or **derived** from closure table—not both without a clear source of truth.
241+
242+
| Column | Type | Notes |
243+
|--------|------|--------|
244+
| `id` | UUID / BIGINT | Primary key; stable across renames |
245+
| `name` | VARCHAR | Display name; sibling uniqueness with `parent_id` |
246+
| `path` | TEXT | Optional **denormalized** path for fast listing; or omit and use `parent_id` chain |
247+
| `parent_id` | FK → `files.id` | **NULL** or sentinel for root |
248+
| `size` | BIGINT | Logical size (bytes) for latest committed revision |
249+
| `content_hash` | BYTEA / CHAR(64) | **Hash** of content (manifest root or whole-file); aligns with dedup story |
250+
| `version` / `head_revision` | BIGINT | Monotonic revision for optimistic locking |
251+
| `owner_id` | UUID | Billing / primary owner |
252+
| `permissions` | ENUM / JSONB | **Default** visibility (e.g. `private`, `anyone_with_link`) or bitmask; **fine-grained** grants live in `sharing` |
253+
| `created_at`, `updated_at` | TIMESTAMPTZ | Audit |
254+
| `deleted_at` | TIMESTAMPTZ | Soft delete for sync trash |
255+
256+
**`file_versions`** — immutable **snapshots** for history and GC.
257+
258+
| Column | Type | Notes |
259+
|--------|------|--------|
260+
| `id` | BIGINT | Surrogate PK |
261+
| `file_id` | FK → `files.id` | |
262+
| `version` | BIGINT | Matches commit; **UNIQUE (file_id, version)** |
263+
| `storage_key` | TEXT | Manifest id, or **pointer** to manifest table row |
264+
| `timestamp` | TIMESTAMPTZ | Commit time (server authoritative) |
265+
266+
**`sharing`** — ACL edges.
267+
268+
| Column | Type | Notes |
269+
|--------|------|--------|
270+
| `file_id` | FK → `files.id` | Resource (file or folder node) |
271+
| `user_id` | UUID | Principal |
272+
| `permission_level` | ENUM / SMALLINT | `viewer`, `commenter`, `editor`, `owner` |
273+
| `granted_by`, `granted_at` | UUID, TIMESTAMPTZ | Audit |
274+
275+
**`sync_state`****per device** convergence (optional **per file** row, or **summary + separate** pending table).
276+
277+
| Column | Type | Notes |
278+
|--------|------|--------|
279+
| `device_id` | UUID | Client-registered device |
280+
| `file_id` | FK → `files.id` | |
281+
| `local_version` | BIGINT | Last **known applied** server revision on device |
282+
| `synced_at` | TIMESTAMPTZ | Last successful reconcile |
283+
284+
{: .tip }
285+
> At scale, **`sync_state`** is often **sharded** with the user or stored **client-side** (SQLite) with server **cursors**—the table above is the **server-side** model when you track enterprise devices centrally.
286+
119287
---
120288

121289
## Step 2: Back-of-the-Envelope Estimation

software_system_design/distributed_file_system.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -344,9 +344,9 @@ SLAs are **external promises** (often **contractual**); **SLOs** are **internal*
344344

345345
| SLO | Target | Measurement window | Notes |
346346
|-----|--------|--------------------|-------|
347-
| **Read latency (p50 / p99)** | **< 5 ms / < 50 ms** same-rack sequential chunk read | **Rolling 30 days** | Excludes **cross-region**; **cold** EC tail may be **higher**—split SLO by **tier** |
348-
| **Write / append ack (p99)** | **< 100 ms** intra-cell | **Rolling 30 days** | Dominated by **pipeline** + **disk**; **record append** may **batch** |
349-
| **Metadata RPC (p99)** | **< 10 ms** for lookup / lease | **Rolling 30 days** | **Spikes** often mean **GC**, **HA** failover, or **hot** directory |
347+
| **Read latency (p50 / p99)** | **Under 5 ms / under 50 ms** same-rack sequential chunk read | **Rolling 30 days** | Excludes **cross-region**; **cold** EC tail may be **higher**—split SLO by **tier** |
348+
| **Write / append ack (p99)** | **Under 100 ms** intra-cell | **Rolling 30 days** | Dominated by **pipeline** + **disk**; **record append** may **batch** |
349+
| **Metadata RPC (p99)** | **Under 10 ms** for lookup / lease | **Rolling 30 days** | **Spikes** often mean **GC**, **HA** failover, or **hot** directory |
350350
| **Data durability** | **99.999999999%** (11 nines) **annual** object survival | **Yearly** | **Justify** with **3× replication + scrubbing + MTTR**—not magic; **backups** for **metadata** |
351351
| **Metadata durability** | **No silent loss** of committed namespace; **RPO** **near zero** for edit log | Per incident | **QJM/Raft** **fsync** policy matters |
352352
| **Availability (data plane reads)** | **99.9%–99.99%** monthly | **Monthly** | **Planned** maintenance windows **excluded** or **budgeted** separately |
@@ -362,7 +362,7 @@ SLAs are **external promises** (often **contractual**); **SLOs** are **internal*
362362
|---------|--------|
363363
| **Budget** | **1 - SLO** per month (e.g., **99.99%****~4.32 min** bad metadata availability per month) |
364364
| **Spend** | **Failover**, **GC pauses**, **slow disks** consume budget—track **burn rate** |
365-
| **Gate releases** | If **burn** > **** sustained, **freeze** risky changes; prioritize **reliability** work |
365+
| **Gate releases** | If **burn** exceeds **** sustained, **freeze** risky changes; prioritize **reliability** work |
366366
| **Degraded modes** | **Read-only metadata** may **preserve** **C** at **cost of A**—document as **acceptable** for **batch** |
367367
| **Customer comms** | **SLA** credits only if **external** monitoring agrees; **internal** SLOs stricter |
368368

0 commit comments

Comments
 (0)