moq-dev · kixelated · Dec 4, 2024 · Dec 5, 2024 · Dec 5, 2024 · Dec 9, 2024
diff --git a/src/pages/blog/live-arch.mdx b/src/pages/blog/live-arch.mdx
@@ -0,0 +1,266 @@
+# Live Video Architecture: Scale or Profit?
+
+It's time to get that Staff Engineer promotion.
+That's right, we're talking about _distributed systems design_.
+It's not every day that you get to design a live video system (without blindly asking ChatGPT at least).
+
+Each is an iteration on the previous design with increasing complexity and usually cost.
+Be warned that there is never a "right" answer in systems design and the investment you should make depends solely on your product.
+Start small and build upon success and failure
+
+
+## Requirements 
+But first, a message from our sponsor: __requirements__
+
+We're going to build a real-time media application.
+There will be N broadcasters and M viewers in a room, some performing both roles.
+Think Discord or Google Meet or Zoom or _your least favorite meeting app_ (there's a lot of them).
+
+Considerations to be had:
+- Some users will have poor networks.
+- Some users will want higher quality/latency (ex. screen share) 
+- Some users might be geo-dispersed. ex. some might live in Europe while some might live in North of Mexico, and there's an ocean between them.
+- Some customers might be willing to pay for better connectivity/quality.
+
+Got that?
+Basically just make a conferencing application.
+
+## Design 0: Peer to Peer 
+We start with somehow both the least and most complex design: __peer-to-peer__.
+
+That's right, boot up your favorite game from the 2000s and enter your friend's IP address manually.
+We don't need no stinking servers... at least until your "friend" enables `noclip`.
+
+The design is very simple, even if the implementation isn't.
+Users connect directly to each other with only routers and switches in the middle.
+
+The very first problem with P2P hits us immediately: how do you connect?
+One of the clients needs to have a static IP/port which is unlikely because of these pesky things called __NATs__.
+
+Even in a client/client protocol, one of the clients needs a public IP address.
+You see, firewalls and NATs are widespread (even with IPv6) and unless configured (port forwarding), they only allow outgoing connections.
+Even a fully peer-to-peer protocol requires at least one client to be the designated IT guy and act as a server just to get connections established.
+
+But you also should still have _some_ central server to exchange user details, like a "lobby" service, because we don't want to enter IPs manually or scan the Internet for friends.
+No, we're not going to build "Media over Blockchain".
+This same server can be used to punch holes in NATs *most of the time* via a protocol like ICE/STUN used in WebRTC.
+I say *most of the time* with slanty text because it doesn't work for symmetric NATs.
+Depending on the router you have at home, peer-to-peer might be impossible without literally proxying all traffic through an intermediate server (TURN). 
+
+But NATs are not the deal-breaker for P2P in many applications.
+That award is reserved for _fanout_.
+
+You see, if Alice is on a peer-to-peer call with Bob, then Alice needs to send their audio/video to Bob only.
+Thats like 3Mb/s at least for good quality video.
+But if Candy, Dandy, Eric, Fritz, Geof, Harry, Iguana, Joel, Katrina, Luke, Mochael, Nigel, 🍊, Paul, Q, Raul, Steve, Tavana, Uguana, Video, Windy, Xylan, Yennifer, and Zee all join the call, then now Alice needs to send N copies of their media.
+There's no way to share the upload*, and now we need at least 78Mb/s of upload capacity now to make sure everyone gets the video.
+Your home Internet might be able to support that, but what about your cell phone?
+
+We need a different architecture.
+
+## Design -1: Multicast
+"Well of course, you daft blog author, that's why you use multicast!"
+
+It's true, multicast lets you send one packet to multiple destinations.
+The participants would all join a multicast group and then the network would figure it out.
+
+"But, you daft blog author, nobody uses multicast!"
+
+It's true, the protocol designed to solve all of our problems doesn't actually work on the internet.
+There are a number of reasons but I'd be lying if I said I knew why.
+My theory is that its because multicast requires the world's ISPs to collaborate and work together to build what amounts to a global CDN.
+Multicast is still a good option within internal networks and datacenters provided you buy the right networking equipment.
+
+So instead of asking ISPs to build a global CDN, we're going to build a global CDN using unicast.
+
+So we're using unicast.
+But, we're using _the ideas behind multicast_.
+
+**New Requirement:** The broadcaster MUST send only one copy of the media.
+Instead of the network layer (L2) performing fanout via multicast, we can fanout at the application layer (L7) instead.
+That's ultimately the premise behind CDNs: more expressive routing and caching can be performed at higher layers.
+But those are spoilers; keep reading.
+
+## Design 1: Hub and Spoke
+Okay so we're going to use servers with public IPs.
+We must first ask ourselves: "How do you determine which client connects to which server?"
+
+I've got an answer for you: have everyone connect to the same server.
+
+We're already at the most common conferencing architecture.
+Discord uses it as do many conferencing applications.
+It's a good deaign when meetings consist of a handful of cost-sensitive users.
+I thought distributed systems design was supposed to be hard.
+
+Unfortunately, there are two problems with our simple approach:
+1. The maximum channel size is limited by the server size.
+2. Users far away from the server will have a degraded experience.
+
+When you hear the phrase "WebRTC doesn't scale", this is the architecture being referenced.
+A single host has limits in terms of throughput and quality.
+If we want to have thousands of people in the same room our only option is to scale vertically.
+You can throw larger CPUs and NICs at the problem but eventually you hit a limit.
+Other designs can scale horizontally so this isn't a fundamental problem with WebRTC.
+
+And _who_ decides which server to use?
+Ideally, the server is the closest on average to all users, but what if users trickle in one at a time?
+There's no right answer, only convoluted business logic, but usually it's based on the first user to connect (aka the "host").
+
+**Fun tip**: Next time you have a meeting, make sure the solo Euro member doesn't create the meeting.
+Let them suffer alone with poor video quality and free healthcare.
+
+## Design 2: Edge Nodes
+So you're serious about improving the user experience and have additional cash to throw around?
+Good?
+
+There is an unspoken rule of the internet: it sucks.
+The "public" internet consists of interconnected ISPs and transit providers that want to pay as little as possible while keeping customers happy enough.
+You might have a fiber internet connection to your ISP's servers, but obviously they're not going to give you a dedicated fiber connection to everywhere in the world.
+You're going to get a dedicated connection to the nearest speed test website and that's it.
+
+The internet is the world's largest sharing experiment.
+But to get the most reliable experience, we need to explicitly not share.
+We want our media packets to spend as little time as possible on the public **inter**net and as much time on our private **intra**net.
+
+That's right, it's time to copy a HTTP CDN and put edge servers next to each user.
+Queue the map with all of the datacenters around the world.
+
+<globe of dots>
+
+That's the one, good stuff.
+
+So we provision servers around the world and users will connect to the closest one.
+This minimizes the amount of time our user's packets spend mingling with the commoners.
+We're now a toddler who refuses to share with anyone else and can better control the quality/cost once the packets are in our network.
+
+But how does a user know which server is the closest?
+There's a few options:
+
+1. Ping them all, use the min.
+2. Use GeoDNS
+3. Use anycast.
+
+I'm a big anycast 🪭 as are most CDNs.
+Basically, all of our servers advertise the same IP address and BGP figures out which one is closest.
+This gives ISPs greater control on how traffic should be routed, potentially over paths with higher RTTs but lower congestion.
+
+But I'm going to stop blabbing about anycast; we've got _distributed systems_ to design.
+
+Note that we still have a central server that "owns" the meeting room.
+This is commonly called the "origin" and each edge node needs to know it's address to proxy media.
+Store this in a database or something.
+
+
+## Design 3: Deduplication
+
+So the previous design has users connect to the closest server, but how do the servers 
+
+But critically, It also gives us the ability to **deduplicate**, serving the same content to multiple users kinda like multicast.
+That's why Google even has a CDN edge on a cruise ship.
+
+The other benefit of point-to-point is more subtle: it prevents __tromboning__.
+For example, let's say someone from Boston is trying to call someone from the UK a ~~lolly gagger~~ (TODO make sure that properly censors the bad word).
+
+- If we're using a single SFU in us-west, then the traffic first heads west and then back east.
+- If we're using multiple SFUs, then the traffic more closely follows the shortest path.
+
+This tromboning is rare in practice as most traffic is regional, but shush I'm trying to over-engineer the system.
+We don't want to send traffic back and forth for no reason because it increases latency and reduces capacity.
+
+So yeah, we have every user connect to the closet server via anycast or geo-DNS.
+There's some database or gossip protocol used to discover the "origin" server for each user.
+When the server needs to route traffic to a user, it routes it to the origin instead.
+
+But the end result looks like a tangled mess.
+Or because it 'twis the season, the end result looks more like my 🎄 lights after they magically knotted themselves in the attic (a miracle).
+This is no good, because if every edge can connect to any other edge, there can be a lot of duplicated traffic on each link.
+
+## Design 4: A Tree
+The internet certainly behaves like a net.
+There are many forks and traffic needs to figure out if it should go right or left.
+
+We need to avoid our media going both left and right because it's a waste of our limited network capacity.
+But we need to avoid multiple copies too.
+
+For example, let's say an edge in the UK wants to fetch content from an origin in San Francisco
+It could fetch from a server in Ireland, then New York, then Chicago, then Utah, then Frisco (nobody calls it that).
+If other servers take a similar route, intersecting at some point, then we can combine multiple fetches into one.
+
+It was called a tree at Twitch but it's really more like a river with many tributaries.
+Each layer in the tree compounds, reducing hundreds of thousands of edge requests into a handful of origin requests.
+
+But computing this tree is not easy.
+Ar Twitch we literally updated it by hand every time there was outage or some sort of change to the network topology.
+The process has since been automated as some smart engineers earned their paychecks... only to be scrapped as Twitch was absorbed into the mothership.
+But that's a tale for another time...
+
+So yeah, we're doing some application-based routing to try and maximize the cache-hit ratio.
+This means less duplicate media flowing over the backbone which means less compute and network capacity needed, which means less money burned on the cloud.
+It's just getting increasingly difficult to design as users can join from any edge and leave at any time.
+
+## Design 5: Coalesce
+The key word in that last sentence is _any_.
+If a user can connect to _any_ edge, then we've already started at a disadvantage.
+
+For example, let's say there's 3 Australian viewers connecting to our service and we have 6 servers in our Sydney datacenter.
+If each user connects via anycast or round-robin DNS to the "closest" edge, there's actually some randomization involved to break ties.
+Our servers within the Sydney data center have equal latency so our users will be assigned randomly.
+This works, but it's not the cheapest.
+
+You see, we want all of those users to connect to the same host so we can serve them from a single cache.
+If they're connected to different edges within the same datacenter, then we need to transfer the media between those hosts eating up more compute and network resources.
+Every host within the datacenter provides an equal user experience (provided they aren't full) so coalescing users to the same host is pure cost savings.
+
+We need to designate at least one server per datacenter as the "owner" of each meeting.
+But there's a gotcha, as now we can cause lopsided load as some meetings are larger than others.
+We also need the ability to spill over to additional servers for the largest events as a single server is not enough to serve the Superbowl.
+
+There's a few ways to accomplish this.
+One approach is to create a service that assigns each user to a host, but it quickly becomes a monolith as it needs the state of the entire system.
+Another approach is to rely on hashing, returning a stateless assignment based on the meeting name with some mechanism to avoid overloading individual hosts.
+
+## Design 6: Optimization 
+Now we're in the optimization end-game.
+
+One key flaw in the previous designs is the word "any".
+We went from a user connecting to a single host to the user connecting to _any_ host.
+
+If we get academic for just a second, we have a connected graph.
+- Every node in the graph is a user or server.
+- Every edge is a possible route between servers or a user.
+
+The number of permutations balloons out of control.
+Even finding the shortest path starts becoming difficult.
+But we're not trying to find the shortest path, we're also trying to find a cheap one.
+
+There's a cost associated with every node and edge.
+Some are cheap, some are shit, and some are both although there's not much correlation.
+Interconnect is a world of politics filled with price gouging and anti-competitive behavior.
+No specifics will be mentioned lest I get sued by a South Korean ISP, or a German one, or really any nationality.
+
+So, we have a classic graph optimization problem where every node/edge has a cost.
+But what's really cool about this problem is that _only unique traffic counts_.
+The more users we serve from cache, the cheaper a node/link becomes _per user_.
+
+What's the ramifications of this?
+Well, maybe serving those 3 users from Australia out of the Syndey data center isn't worth it.
+It's expensive to reserve capacity on a wire sitting on the floor of the world's largest ocean.
+The Aussies are used to awful Internet, how about we send their traffic over congestion transit links instead?
+
+Maybe if there were 20 users then the cost would average out to something more palatable.
+But again, users can join/leave at any time so the "optimal" decision constantly changes.
+It's no feasible to constantly _recalculating splines_ so at some point you have to rely on heuristics.
+
+And we have plenty of crude heuristics.
+Before beefing up capacity, Twitch would only allow assignments to data centers based on _global_ viewership. 
+20 Aussies watching a neighborhood event?
+Sorry mate, best we can do is LA.
+100 Brazilian viewers and the first Australian viewer shows up?
+Welcome to our Sydney data center.
+
+I believe this optimization problem is impossible to solve at scale.
+But it's also the secret sauce behind every CDN; as you scale up, more can be cached but where do you draw the line between quality/cost?
+
+# quic.video
+
+<map of 3 edges>