From 7b4504f10d53718854a772e46fbe7fe0229aaec7 Mon Sep 17 00:00:00 2001 From: kixelated Date: Wed, 4 Dec 2024 09:48:53 -0800 Subject: [PATCH 1/7] Create live-arch.mdx --- src/pages/blog/live-arch.mdx | 80 ++++++++++++++++++++++++++++++++++++ 1 file changed, 80 insertions(+) create mode 100644 src/pages/blog/live-arch.mdx diff --git a/src/pages/blog/live-arch.mdx b/src/pages/blog/live-arch.mdx new file mode 100644 index 0000000..d16848d --- /dev/null +++ b/src/pages/blog/live-arch.mdx @@ -0,0 +1,80 @@ +# Live Architecture + +It's time to get that Staff Engineer promotion. +That's right, we're talking about distributed systems design. + +I'm going to walk through a few designs for the same system with increasing complexity. +There's no right answer, just pros and cons, but I'd love to hear your favorites. +Or @me with an alternative because you're super smart and I used to eat paste in school (true story). + + +## Requirements +But first, a message from our sponsor: __requirements__ + +We're going to build a real-time media application. +There will be N broadcasters and M viewers in a room, some performing both roles. +Think Discord or Google Meet or Zoom or _your least favorite app here_. + +Additional considerations: +- Some users will have poor networks. +- Some broadcasters will want higher quality/latency (ex. screen share) +- Some users might be geo-dispersed. ex. some in Europe and some in ~~freedom land~~ North America. +- Some customers might be willing to pay for better connectivity/quality. + +Got that? +Basically just make a conferencing application. + +## Design 0: Peer to Peer +We start with both the least complicated and yet somehow most complex design: __peer-to-peer__. +That's right, boot up your favorite game from the 2000s and enter your friend's IP address manually. + +The design is very simple. +Users connect directly to each other without pesky intermediate servers (unless you count routers). +There should still be _some_ central server to exchange user details, like a "lobby" service, or else discovery becomes very difficult. +Sorry, we're not going to build "Media over Blockchain" or something like BitTorrent sans trackers. + +The very first problem with P2P hits us immediately: how do you connect? +In a client/client protocol, one of the two peers needs to act like a server using a public IP/port. +But this usually not an option because IPv4 addresses are expensive and IPv6 addresses are somehow still unavailable. +Instead, the internet consists of NATs that generate IP/port pairs for outgoing connections (client) but critically, reject incoming connections (server) unless configured to forward ports. + +And those NATs are a headache. +WebRTC uses STUN/ICE (and the help of a middle server) to connect clients in what only can be described as a giant hack, smuggling IP/port pairs between "connections". +However even this doesnt work in all scenarios (symmetric NATs) and some peers have to resort to a TURN, literally proxying UDP/TCP traffic through a middle server. + +But NATs are not the deal-breaker. +That award is reserved for _multiple participants_. + +You see, if Alice is on a peer-to-peer call with Bob, then Alice needs to send their audio/video to Bob only. +Thats like 3Mb/s at least for good quality video. +But if Candy, Dandy, Eric, Fritz, Geof, Harry, Iguana, Joel, Katrina, Luke, Mochael, Nigel, 🍊, Paul, Q, Raul, Steve, Tavana, Uguana, Video, Windy, Xylan, Yennifer, and Zee all join the call, then now Alice needs to send N copies of their media. +There's no way to share*; we need 78Mb/s of upload capacity now to make sure everyone gets the video. +Your home Internet might be able to support that, but what about your cell phone? + +We need a different architecture. + +## Design -1: Multicast +"Well of course, you daft blog author, that's why you use multicast!" + +It's true, multicast is the solution. +Instead of sending N unicast packets, you could send 1 multicast packet. +The participants would all join a multicast group and then the routers will figure it out. + +Unfortunately, multicast doesn't actually work in practice. +The main problem is that you're entirely reliant on ISPs to work together and build an optimal fanout architecture. +There's limited support for multicast outside of datacenter environments that use the same brand of routers. +I'm no expert on the subject, but those experts have tried. + +But I am an expert on congestion control, something that multicast doesn't really support. +You see, if you're downloading live media at 6Mb/s and your Internet suddenly has issues, the unicast solution is to lower the send rate and only transmit important data (ex. audio). +But multicast is a firehose that can't be turned off by the sender. +The only option is to have a router drop (random*) packets turning your live stream into swiss cheese. + +So we're using unicast. +But, we're using _the ideas behind multicast_. + +The broadcaster MUST send only one copy of the media. +Instead of the network performing fanout (L2), we use the application layer (L7) instead. +That's ultimately the premise behind CDNs: shorten and deduplicate the hops. +But those are spoilers; keep reading. + From 584c5edbe04d15a91a1a1bdc0b017689fc8b4e1c Mon Sep 17 00:00:00 2001 From: kixelated Date: Wed, 4 Dec 2024 17:49:10 -0800 Subject: [PATCH 2/7] Update live-arch.mdx --- src/pages/blog/live-arch.mdx | 81 +++++++++++++++++++++++++++++++++++- 1 file changed, 79 insertions(+), 2 deletions(-) diff --git a/src/pages/blog/live-arch.mdx b/src/pages/blog/live-arch.mdx index d16848d..6ce1401 100644 --- a/src/pages/blog/live-arch.mdx +++ b/src/pages/blog/live-arch.mdx @@ -74,7 +74,84 @@ So we're using unicast. But, we're using _the ideas behind multicast_. The broadcaster MUST send only one copy of the media. -Instead of the network performing fanout (L2), we use the application layer (L7) instead. -That's ultimately the premise behind CDNs: shorten and deduplicate the hops. +Instead of the network layer (L2) performing fanout, we use the application layer (L7) instead. +That's ultimately the premise behind CDNs: more expressive routing and caching can be performed at higher layers. But those are spoilers; keep reading. +## Design 1: Hub and Spoke +We're already at the most common architecture as far as I can tell (ex. Discord uses it). +From now on, there will be pros and cons for each design. + +But first, ask yourself: "How do you determine which user connects to which server?" + +I've got an answer for you: have everyone connect to the same server. + +This is by far the simplest (and cheapest) architecture. +There are unfortunately two problems: + +1. The maximum channel size is limited by the server size. +2. Users far away from the server will have a degraded experience. + +When you hear the phrase "WebRTC doesn't scale", this is the architecture being referenced. +A single host has limits in terms of throughput and quality. +Our only option is to scale vertically, while other designs can scale horizontally. + +But like I said, it is the cheapest and simplest option which is why I think it's a good fit for an application like Discord. + +But _who_ decides which server to use? +Ideally the server is the net closest to all users, but what if users trickle in one at a time? +There's no right answer, only convoluted business logic, but usually it's based on the *host*. +**Fun tip**: Next time you have a meeting, make sure the lone Euro member doesn't create the meeting. + +## Design 2: Point to Point +So you're serious about improving the user experience and have spare ~~cash~~ servers. +Good? + +It's time to copy a HTTP CDN and put edge servers right next to each user. +The unspoken rule of the internet: it sucks. + +We want our media packets to spend as little time as possible on the public internet and as much time on our private intranet. +We then have the ability to prioritize/reserve traffic instead of fighting with the commoners using transit. +It also gives us the ability to **deduplicate**, serving the same content to multiple users kinda like multicast. +That's why Google even has a CDN edge on a cruise ship. + +The other benefit of point-to-point is more subtle: it prevents __tromboning__. +For example, let's say someone from Boston is trying to call someone from the UK a ~~lolly gagger~~ (TODO make sure that properly censors the bad word). + +- If we're using a single SFU in us-west, then the traffic first heads west and then back east. +- If we're using multiple SFUs, then the traffic more closely follows the shortest path. + +This tromboning is rare in practice as most traffic is regional, but shush I'm trying to over-engineer the system. +We don't want to send traffic back and forth for no reason because it increases latency and reduces capacity. + +So yeah, we have every user connect to the closet server via anycast or geo-DNS. +There's some database or gossip protocol used to discover the "origin" server for each user. +When the server needs to route traffic to a user, it routes it to the origin instead. + +But the end result looks like a tangled mess. +Or because it 'twis the season, the end result looks more like my 🎄 lights after they magically knotted themselves in the attic (a miracle). +This is no good, because if every edge can connect to any other edge, there can be a lot of duplicated traffic on each link. + +## Design 3: A Tree +The internet certainly behaves like a net. +There are many forks and traffic needs to figure out if it should go right or left. + +We need to avoid our media going both left and right because it's a waste of our limited network capacity. +But we need to avoid multiple copies too. + +For example, let's say an edge in the UK wants to fetch content from an origin in San Francisco +It could fetch from a server in Ireland, then New York, then Chicago, then Utah, then Frisco (nobody calls it that). +If other servers take a similar route, intersecting at some point, then we can combine multiple fetches into one. + +It was called a tree at Twitch but it's really more like a river with many tributaries. +Each layer in the tree compounds, reducing hundreds of thousands of edge requests into a handful of origin requests. + +But computing this tree is not easy. +It was literally created by hand at Twitch and updated every time there was outage of some sort. +The process has since been automated as some smart engineers earned their paychecks, only to be scrapped as Twitch embraced the mothership (AMZN). + + + + + +## Design 4: Constraints From 779f70d0b21f341013a80da12c68ede827526b04 Mon Sep 17 00:00:00 2001 From: kixelated Date: Wed, 4 Dec 2024 20:28:39 -0800 Subject: [PATCH 3/7] Update live-arch.mdx better title --- src/pages/blog/live-arch.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/pages/blog/live-arch.mdx b/src/pages/blog/live-arch.mdx index 6ce1401..79c8ba4 100644 --- a/src/pages/blog/live-arch.mdx +++ b/src/pages/blog/live-arch.mdx @@ -1,4 +1,4 @@ -# Live Architecture +# Architect (like a boss) It's time to get that Staff Engineer promotion. That's right, we're talking about distributed systems design. From da997705be694ed1dfded1b297d1204aff2f5624 Mon Sep 17 00:00:00 2001 From: kixelated Date: Mon, 9 Dec 2024 10:46:19 -0800 Subject: [PATCH 4/7] Update live-arch.mdx --- src/pages/blog/live-arch.mdx | 69 ++++++++++++++++++++++++++++++++++-- 1 file changed, 66 insertions(+), 3 deletions(-) diff --git a/src/pages/blog/live-arch.mdx b/src/pages/blog/live-arch.mdx index 79c8ba4..b476cb2 100644 --- a/src/pages/blog/live-arch.mdx +++ b/src/pages/blog/live-arch.mdx @@ -147,11 +147,74 @@ It was called a tree at Twitch but it's really more like a river with many tribu Each layer in the tree compounds, reducing hundreds of thousands of edge requests into a handful of origin requests. But computing this tree is not easy. -It was literally created by hand at Twitch and updated every time there was outage of some sort. -The process has since been automated as some smart engineers earned their paychecks, only to be scrapped as Twitch embraced the mothership (AMZN). +Ar Twitch we literally updated it by hand every time there was outage or some sort of change to the network topology. +The process has since been automated as some smart engineers earned their paychecks... only to be scrapped as Twitch was absorbed into the mothership. +But that's a tale for another time... +So yeah, we're doing some application-based routing to try and maximize the cache-hit ratio. +This means less duplicate media flowing over the backbone which means less compute and network capacity needed, which means less money burned on the cloud. +It's just getting increasingly difficult to design as users can join from any edge and leave at any time. +## Design 4: Coalesce +The key word in that last sentence is _any_. +If a user can connect to _any_ edge, then we've already started at a disadvantage. +For example, let's say there's 3 Australian viewers connecting to our service and we have 6 servers in our Sydney datacenter. +If each user connects via anycast or round-robin DNS to the "closest" edge, there's actually some randomization involved to break ties. +Our servers within the Sydney data center have equal latency so our users will be assigned randomly. +This works, but it's not the cheapest. +You see, we want all of those users to connect to the same host so we can serve them from a single cache. +If they're connected to different edges within the same datacenter, then we need to transfer the media between those hosts eating up more compute and network resources. +Every host within the datacenter provides an equal user experience (provided they aren't full) so coalescing users to the same host is pure cost savings. + +We need to designate at least one server per datacenter as the "owner" of each meeting. +But there's a gotcha, as now we can cause lopsided load as some meetings are larger than others. +We also need the ability to spill over to additional servers for the largest events as a single server is not enough to serve the Superbowl. + +There's a few ways to accomplish this. +One approach is to create a service that assigns each user to a host, but it quickly becomes a monolith as it needs the state of the entire system. +Another approach is to rely on hashing, returning a stateless assignment based on the meeting name with some mechanism to avoid overloading individual hosts. + +## Design 5: Optimization +Now we're in the optimization end-game. + +One key flaw in the previous designs is the word "any". +We went from a user connecting to a single host to the user connecting to _any_ host. + +If we get academic for just a second, we have a connected graph. +- Every node in the graph is a user or server. +- Every edge is a possible route between servers or a user. + +The number of permutations balloons out of control. +Even finding the shortest path starts becoming difficult. +But we're not trying to find the shortest path, we're also trying to find a cheap one. + +There's a cost associated with every node and edge. +Some are cheap, some are shit, and some are both although there's not much correlation. +Interconnect is a world of politics filled with price gouging and anti-competitive behavior. +No specifics will be mentioned lest I get sued by a South Korean ISP, or a German one, or really any nationality. + +So, we have a classic graph optimization problem where every node/edge has a cost. +But what's really cool about this problem is that _only unique traffic counts_. +The more users we serve from cache, the cheaper a node/link becomes _per user_. + +What's the ramifications of this? +Well, maybe serving those 3 users from Australia out of the Syndey data center isn't worth it. +It's expensive to reserve capacity on a wire sitting on the floor of the world's largest ocean. +The Aussies are used to awful Internet, how about we send their traffic over congestion transit links instead? + +Maybe if there were 20 users then the cost would average out to something more palatable. +But again, users can join/leave at any time so the "optimal" decision constantly changes. +It's no feasible to constantly _recalculating splines_ so at some point you have to rely on heuristics. + +And we have plenty of crude heuristics. +Before beefing up capacity, Twitch would only allow assignments to data centers based on _global_ viewership. +20 Aussies watching a neighborhood event? +Sorry mate, best we can do is LA. +100 Brazilian viewers and the first Australian viewer shows up? +Welcome to our Sydney data center. + +I believe this optimization problem is impossible to solve at scale. +But it's also the secret sauce behind every CDN; as you scale up, more can be cached but where do you draw the line between quality/cost? -## Design 4: Constraints From 049dd2c9c136167011d70976eebf83cb18e75cce Mon Sep 17 00:00:00 2001 From: kixelated Date: Wed, 18 Dec 2024 10:00:27 -0800 Subject: [PATCH 5/7] Update live-arch.mdx --- src/pages/blog/live-arch.mdx | 141 ++++++++++++++++++++++------------- 1 file changed, 89 insertions(+), 52 deletions(-) diff --git a/src/pages/blog/live-arch.mdx b/src/pages/blog/live-arch.mdx index b476cb2..bdd5024 100644 --- a/src/pages/blog/live-arch.mdx +++ b/src/pages/blog/live-arch.mdx @@ -1,11 +1,17 @@ -# Architect (like a boss) +# Live Video Architecture: For Scale and Profit? It's time to get that Staff Engineer promotion. -That's right, we're talking about distributed systems design. +That's right, we're talking about _distributed systems design_. +It's a topic with surprisingly few (public) resources out there because the big companies that have to solve these scaling problems are not eager to share. + +Well I am! I'm going to walk through a few designs for the same system with increasing complexity. -There's no right answer, just pros and cons, but I'd love to hear your favorites. -Or @me with an alternative because you're super smart and I used to eat paste in school (true story). +As with many design problems, there's no right answer. +This is not a leet-code problem where you can count the number of cycles a program takes to execute. +It all boils down to pros and cons depending on your problem. + +But @me if you know best because you're a super smart Web3 affifiando while I used to eat paste in school (true). ## Requirements @@ -13,34 +19,40 @@ But first, a message from our sponsor: __requirements__ We're going to build a real-time media application. There will be N broadcasters and M viewers in a room, some performing both roles. -Think Discord or Google Meet or Zoom or _your least favorite app here_. +Think Discord or Google Meet or Zoom or _your least favorite meeting app_ (there's a lot of them). -Additional considerations: +Considerations to be had: - Some users will have poor networks. -- Some broadcasters will want higher quality/latency (ex. screen share) -- Some users might be geo-dispersed. ex. some in Europe and some in ~~freedom land~~ North America. +- Some users will want higher quality/latency (ex. screen share) +- Some users might be geo-dispersed. ex. some might live in Europe while some live in Freedom. - Some customers might be willing to pay for better connectivity/quality. Got that? Basically just make a conferencing application. ## Design 0: Peer to Peer -We start with both the least complicated and yet somehow most complex design: __peer-to-peer__. +We start with somehow both the least and most complex design: __peer-to-peer__. + That's right, boot up your favorite game from the 2000s and enter your friend's IP address manually. +We don't need no stinking servers, at least until your "friend" enables `noclip`. -The design is very simple. -Users connect directly to each other without pesky intermediate servers (unless you count routers). -There should still be _some_ central server to exchange user details, like a "lobby" service, or else discovery becomes very difficult. -Sorry, we're not going to build "Media over Blockchain" or something like BitTorrent sans trackers. +The design is very simple, even if the implementation isn't. +Users connect directly to each other with only routers and switches in the middle. +There should still _some_ central server to exchange user details, like a "lobby" service, because we don't want to enter IPs manually or scan the Internet for friends. +Sorry, we're not going to build "Media over Blockchain". The very first problem with P2P hits us immediately: how do you connect? -In a client/client protocol, one of the two peers needs to act like a server using a public IP/port. -But this usually not an option because IPv4 addresses are expensive and IPv6 addresses are somehow still unavailable. -Instead, the internet consists of NATs that generate IP/port pairs for outgoing connections (client) but critically, reject incoming connections (server) unless configured to forward ports. +One of the clients needs to have a static IP/port which is unlikely because of these pesky things called __NATs__. + +In a client/client protocol, one of the two peers needs to IP/port. + +You see, IPv4 addresses are expensive and somehow the Internet hasn't embraced IPv6 yet. +When a client creates a connection to a server, a NAT basically leases it a temporary IP/port pair until the connection is idle. +A client's public IP address can (and will) change potentially on a per-connection basis. -And those NATs are a headache. -WebRTC uses STUN/ICE (and the help of a middle server) to connect clients in what only can be described as a giant hack, smuggling IP/port pairs between "connections". -However even this doesnt work in all scenarios (symmetric NATs) and some peers have to resort to a TURN, literally proxying UDP/TCP traffic through a middle server. +And these NATs are a headache. +WebRTC uses STUN/ICE (and the help of a middle server) to connect clients in what only can be described as a giant hack: smuggling IP/port pairs between "connections". +However, this doesnt work in all scenarios (symmetric NATs) and some peers have to resort to a TURN, literally proxying UDP/TCP traffic through a middle server. But NATs are not the deal-breaker. That award is reserved for _multiple participants_. @@ -48,7 +60,7 @@ That award is reserved for _multiple participants_. You see, if Alice is on a peer-to-peer call with Bob, then Alice needs to send their audio/video to Bob only. Thats like 3Mb/s at least for good quality video. But if Candy, Dandy, Eric, Fritz, Geof, Harry, Iguana, Joel, Katrina, Luke, Mochael, Nigel, 🍊, Paul, Q, Raul, Steve, Tavana, Uguana, Video, Windy, Xylan, Yennifer, and Zee all join the call, then now Alice needs to send N copies of their media. -There's no way to share*; we need 78Mb/s of upload capacity now to make sure everyone gets the video. +There's no way to share the upload*, and now we need at least 78Mb/s of upload capacity now to make sure everyone gets the video. Your home Internet might be able to support that, but what about your cell phone? We need a different architecture. @@ -56,63 +68,85 @@ We need a different architecture. ## Design -1: Multicast "Well of course, you daft blog author, that's why you use multicast!" -It's true, multicast is the solution. +It's true, multicast let's you share uploads. Instead of sending N unicast packets, you could send 1 multicast packet. The participants would all join a multicast group and then the routers will figure it out. Unfortunately, multicast doesn't actually work in practice. -The main problem is that you're entirely reliant on ISPs to work together and build an optimal fanout architecture. +The main problem is that we're entirely reliant on ISPs to work together and build an optimal fanout architecture, and that's just not happening. There's limited support for multicast outside of datacenter environments that use the same brand of routers. -I'm no expert on the subject, but those experts have tried. - -But I am an expert on congestion control, something that multicast doesn't really support. -You see, if you're downloading live media at 6Mb/s and your Internet suddenly has issues, the unicast solution is to lower the send rate and only transmit important data (ex. audio). -But multicast is a firehose that can't be turned off by the sender. -The only option is to have a router drop (random*) packets turning your live stream into swiss cheese. So we're using unicast. But, we're using _the ideas behind multicast_. -The broadcaster MUST send only one copy of the media. -Instead of the network layer (L2) performing fanout, we use the application layer (L7) instead. +**New Requirement:** The broadcaster MUST send only one copy of the media. +Instead of the network layer (L2) performing fanout via multicast, we can fanout at the application layer (L7) instead. That's ultimately the premise behind CDNs: more expressive routing and caching can be performed at higher layers. But those are spoilers; keep reading. ## Design 1: Hub and Spoke -We're already at the most common architecture as far as I can tell (ex. Discord uses it). -From now on, there will be pros and cons for each design. - -But first, ask yourself: "How do you determine which user connects to which server?" +Okay so we're going to use servers with public IPs. +We must first ask ourselves: "How do you determine which client connects to which server?" I've got an answer for you: have everyone connect to the same server. -This is by far the simplest (and cheapest) architecture. -There are unfortunately two problems: +We're already at the most common conferencing architecture. +Discord uses it as do many conferencing applications. +It's a good deaign when meetings consist of a handful of cost-sensitive users. +I thought distributed systems design was supposed to be hard. +Unfortunately, there are two problems with our simple approach: 1. The maximum channel size is limited by the server size. 2. Users far away from the server will have a degraded experience. When you hear the phrase "WebRTC doesn't scale", this is the architecture being referenced. A single host has limits in terms of throughput and quality. -Our only option is to scale vertically, while other designs can scale horizontally. +If we want to have thousands of people in the same room our only option is to scale vertically. +You can throw larger CPUs and NICs at the problem but eventually you hit a limit. +Other designs can scale horizontally so this isn't a fundamental problem with WebRTC. -But like I said, it is the cheapest and simplest option which is why I think it's a good fit for an application like Discord. +And _who_ decides which server to use? +Ideally, the server is the closest on average to all users, but what if users trickle in one at a time? +There's no right answer, only convoluted business logic, but usually it's based on the first user to connect (aka the "host"). -But _who_ decides which server to use? -Ideally the server is the net closest to all users, but what if users trickle in one at a time? -There's no right answer, only convoluted business logic, but usually it's based on the *host*. -**Fun tip**: Next time you have a meeting, make sure the lone Euro member doesn't create the meeting. +**Fun tip**: Next time you have a meeting, make sure the solo Euro member doesn't create the meeting. +Let them suffer alone with poor video quality and free healthcare. -## Design 2: Point to Point -So you're serious about improving the user experience and have spare ~~cash~~ servers. +## Design 2: One to Many +So you're serious about improving the user experience and have additional cash to throw around? Good? -It's time to copy a HTTP CDN and put edge servers right next to each user. -The unspoken rule of the internet: it sucks. +There is an unspoken rule of the internet: it sucks. +The "public" internet consists of interconnected ISPs and transit providers that want to pay as little as possible while keeping customers happy enough. +You might have a fiber internet connection to your ISP's servers, but obviously they're not going to give you a dedicated fiber connection to everywhere in the world. +You're going to get a dedicated connection to the nearest speed test website and that's it. + +The internet is the world's largest sharing experiment. +But to get the most reliable experience, we need to explicitly not share. +We want our media packets to spend as little time as possible on the public **inter**net and as much time on our private **intra**net. + +That's right, it's time to copy a HTTP CDN and put edge servers next to each user. +Queue the map with all of the datacenters around the world. -We want our media packets to spend as little time as possible on the public internet and as much time on our private intranet. -We then have the ability to prioritize/reserve traffic instead of fighting with the commoners using transit. -It also gives us the ability to **deduplicate**, serving the same content to multiple users kinda like multicast. + + +That's the one, good stuff. + +We want servers around the world. +Users will connect to the closest server in order to minimize the amount of time their packets spend on the internet mingling with commoners. + +Once the packets are in our network, we can better control the quality/cost. +We're now a toddler who refuses to share with anyone else and can prioritize the most important traffic. + +How do you know which region is the closest? + +## Design 3: Deduplication + +So like I said, +We lease dedicated links and transit in regions that matter. +We also now have the ability to prioritize/reserve traffic instead of fighting with the commoners using transit. + +But critically, It also gives us the ability to **deduplicate**, serving the same content to multiple users kinda like multicast. That's why Google even has a CDN edge on a cruise ship. The other benefit of point-to-point is more subtle: it prevents __tromboning__. @@ -132,7 +166,7 @@ But the end result looks like a tangled mess. Or because it 'twis the season, the end result looks more like my 🎄 lights after they magically knotted themselves in the attic (a miracle). This is no good, because if every edge can connect to any other edge, there can be a lot of duplicated traffic on each link. -## Design 3: A Tree +## Design 4: A Tree The internet certainly behaves like a net. There are many forks and traffic needs to figure out if it should go right or left. @@ -155,7 +189,7 @@ So yeah, we're doing some application-based routing to try and maximize the cach This means less duplicate media flowing over the backbone which means less compute and network capacity needed, which means less money burned on the cloud. It's just getting increasingly difficult to design as users can join from any edge and leave at any time. -## Design 4: Coalesce +## Design 5: Coalesce The key word in that last sentence is _any_. If a user can connect to _any_ edge, then we've already started at a disadvantage. @@ -176,7 +210,7 @@ There's a few ways to accomplish this. One approach is to create a service that assigns each user to a host, but it quickly becomes a monolith as it needs the state of the entire system. Another approach is to rely on hashing, returning a stateless assignment based on the meeting name with some mechanism to avoid overloading individual hosts. -## Design 5: Optimization +## Design 6: Optimization Now we're in the optimization end-game. One key flaw in the previous designs is the word "any". @@ -218,3 +252,6 @@ Welcome to our Sydney data center. I believe this optimization problem is impossible to solve at scale. But it's also the secret sauce behind every CDN; as you scale up, more can be cached but where do you draw the line between quality/cost? +# quic.video + + From 9744b9b865caa866fcab5402cb560118d94b1e30 Mon Sep 17 00:00:00 2001 From: kixelated Date: Wed, 18 Dec 2024 18:30:22 -0800 Subject: [PATCH 6/7] Update live-arch.mdx --- src/pages/blog/live-arch.mdx | 43 +++++++++++++++++++++--------------- 1 file changed, 25 insertions(+), 18 deletions(-) diff --git a/src/pages/blog/live-arch.mdx b/src/pages/blog/live-arch.mdx index bdd5024..365bedc 100644 --- a/src/pages/blog/live-arch.mdx +++ b/src/pages/blog/live-arch.mdx @@ -2,16 +2,11 @@ It's time to get that Staff Engineer promotion. That's right, we're talking about _distributed systems design_. -It's a topic with surprisingly few (public) resources out there because the big companies that have to solve these scaling problems are not eager to share. +We're going to design a live video system one interation at a time with increasing complexity. -Well I am! - -I'm going to walk through a few designs for the same system with increasing complexity. -As with many design problems, there's no right answer. -This is not a leet-code problem where you can count the number of cycles a program takes to execute. -It all boils down to pros and cons depending on your problem. - -But @me if you know best because you're a super smart Web3 affifiando while I used to eat paste in school (true). +Be warned that there is no right answer. +This is not an optimization problem where you can literally count the number of cycles a program takes to execute. +It all boils down to pros and cons and how they align with your specific product. ## Requirements @@ -112,7 +107,7 @@ There's no right answer, only convoluted business logic, but usually it's based **Fun tip**: Next time you have a meeting, make sure the solo Euro member doesn't create the meeting. Let them suffer alone with poor video quality and free healthcare. -## Design 2: One to Many +## Design 2: Edge Nodes So you're serious about improving the user experience and have additional cash to throw around? Good? @@ -132,19 +127,31 @@ Queue the map with all of the datacenters around the world. That's the one, good stuff. -We want servers around the world. -Users will connect to the closest server in order to minimize the amount of time their packets spend on the internet mingling with commoners. +So we provision servers around the world and users will connect to the closest one. +This minimizes the amount of time our user's packets spend mingling with the commoners. +We're now a toddler who refuses to share with anyone else and can better control the quality/cost once the packets are in our network. + +But how does a user know which server is the closest? +There's a few options: + +1. Ping them all, use the min. +2. Use GeoDNS +3. Use anycast. + +I'm a big anycast 🪭 as are most CDNs. +Basically, all of our servers advertise the same IP address and BGP figures out which one is closest. +This gives ISPs greater control on how traffic should be routed, potentially over paths with higher RTTs but lower congestion. + +But I'm going to stop blabbing about anycast; we've got _distributed systems_ to design. -Once the packets are in our network, we can better control the quality/cost. -We're now a toddler who refuses to share with anyone else and can prioritize the most important traffic. +Note that we still have a central server that "owns" the meeting room. +This is commonly called the "origin" and each edge node needs to know it's address to proxy media. +Store this in a database or something. -How do you know which region is the closest? ## Design 3: Deduplication -So like I said, -We lease dedicated links and transit in regions that matter. -We also now have the ability to prioritize/reserve traffic instead of fighting with the commoners using transit. +So the previous design has users connect to the closest server, but how do the servers But critically, It also gives us the ability to **deduplicate**, serving the same content to multiple users kinda like multicast. That's why Google even has a CDN edge on a cruise ship. From fcb35c86ebfec9a75d302c132c059289c6dd3827 Mon Sep 17 00:00:00 2001 From: kixelated Date: Thu, 6 Feb 2025 09:17:38 -0800 Subject: [PATCH 7/7] Update live-arch.mdx --- src/pages/blog/live-arch.mdx | 52 +++++++++++++++++++----------------- 1 file changed, 27 insertions(+), 25 deletions(-) diff --git a/src/pages/blog/live-arch.mdx b/src/pages/blog/live-arch.mdx index 365bedc..2925dff 100644 --- a/src/pages/blog/live-arch.mdx +++ b/src/pages/blog/live-arch.mdx @@ -1,12 +1,12 @@ -# Live Video Architecture: For Scale and Profit? +# Live Video Architecture: Scale or Profit? It's time to get that Staff Engineer promotion. That's right, we're talking about _distributed systems design_. -We're going to design a live video system one interation at a time with increasing complexity. +It's not every day that you get to design a live video system (without blindly asking ChatGPT at least). -Be warned that there is no right answer. -This is not an optimization problem where you can literally count the number of cycles a program takes to execute. -It all boils down to pros and cons and how they align with your specific product. +Each is an iteration on the previous design with increasing complexity and usually cost. +Be warned that there is never a "right" answer in systems design and the investment you should make depends solely on your product. +Start small and build upon success and failure ## Requirements @@ -19,7 +19,7 @@ Think Discord or Google Meet or Zoom or _your least favorite meeting app_ (there Considerations to be had: - Some users will have poor networks. - Some users will want higher quality/latency (ex. screen share) -- Some users might be geo-dispersed. ex. some might live in Europe while some live in Freedom. +- Some users might be geo-dispersed. ex. some might live in Europe while some might live in North of Mexico, and there's an ocean between them. - Some customers might be willing to pay for better connectivity/quality. Got that? @@ -29,28 +29,26 @@ Basically just make a conferencing application. We start with somehow both the least and most complex design: __peer-to-peer__. That's right, boot up your favorite game from the 2000s and enter your friend's IP address manually. -We don't need no stinking servers, at least until your "friend" enables `noclip`. +We don't need no stinking servers... at least until your "friend" enables `noclip`. The design is very simple, even if the implementation isn't. Users connect directly to each other with only routers and switches in the middle. -There should still _some_ central server to exchange user details, like a "lobby" service, because we don't want to enter IPs manually or scan the Internet for friends. -Sorry, we're not going to build "Media over Blockchain". The very first problem with P2P hits us immediately: how do you connect? One of the clients needs to have a static IP/port which is unlikely because of these pesky things called __NATs__. -In a client/client protocol, one of the two peers needs to IP/port. +Even in a client/client protocol, one of the clients needs a public IP address. +You see, firewalls and NATs are widespread (even with IPv6) and unless configured (port forwarding), they only allow outgoing connections. +Even a fully peer-to-peer protocol requires at least one client to be the designated IT guy and act as a server just to get connections established. -You see, IPv4 addresses are expensive and somehow the Internet hasn't embraced IPv6 yet. -When a client creates a connection to a server, a NAT basically leases it a temporary IP/port pair until the connection is idle. -A client's public IP address can (and will) change potentially on a per-connection basis. +But you also should still have _some_ central server to exchange user details, like a "lobby" service, because we don't want to enter IPs manually or scan the Internet for friends. +No, we're not going to build "Media over Blockchain". +This same server can be used to punch holes in NATs *most of the time* via a protocol like ICE/STUN used in WebRTC. +I say *most of the time* with slanty text because it doesn't work for symmetric NATs. +Depending on the router you have at home, peer-to-peer might be impossible without literally proxying all traffic through an intermediate server (TURN). -And these NATs are a headache. -WebRTC uses STUN/ICE (and the help of a middle server) to connect clients in what only can be described as a giant hack: smuggling IP/port pairs between "connections". -However, this doesnt work in all scenarios (symmetric NATs) and some peers have to resort to a TURN, literally proxying UDP/TCP traffic through a middle server. - -But NATs are not the deal-breaker. -That award is reserved for _multiple participants_. +But NATs are not the deal-breaker for P2P in many applications. +That award is reserved for _fanout_. You see, if Alice is on a peer-to-peer call with Bob, then Alice needs to send their audio/video to Bob only. Thats like 3Mb/s at least for good quality video. @@ -63,13 +61,17 @@ We need a different architecture. ## Design -1: Multicast "Well of course, you daft blog author, that's why you use multicast!" -It's true, multicast let's you share uploads. -Instead of sending N unicast packets, you could send 1 multicast packet. -The participants would all join a multicast group and then the routers will figure it out. +It's true, multicast lets you send one packet to multiple destinations. +The participants would all join a multicast group and then the network would figure it out. + +"But, you daft blog author, nobody uses multicast!" + +It's true, the protocol designed to solve all of our problems doesn't actually work on the internet. +There are a number of reasons but I'd be lying if I said I knew why. +My theory is that its because multicast requires the world's ISPs to collaborate and work together to build what amounts to a global CDN. +Multicast is still a good option within internal networks and datacenters provided you buy the right networking equipment. -Unfortunately, multicast doesn't actually work in practice. -The main problem is that we're entirely reliant on ISPs to work together and build an optimal fanout architecture, and that's just not happening. -There's limited support for multicast outside of datacenter environments that use the same brand of routers. +So instead of asking ISPs to build a global CDN, we're going to build a global CDN using unicast. So we're using unicast. But, we're using _the ideas behind multicast_.