-
Notifications
You must be signed in to change notification settings - Fork 282
[Feedback requested DO NOT MERGE] Rewriting the 'RPO and RTO' page to clear up common confusion #4091
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
📖 Docs PR preview links
|
| Temporal Cloud is designed to limit data loss after recovery when the incident triggering the failover is resolved. | ||
| - "Temporal-initiated failovers:" Also known as "automatic failovers," these failovers are initiated by Temporal's tooling and/or on-call engineers on Namespaces that have High Availability enabled. **Temporal highly recommends keeping Temporal-initiated failovers enabled,** which is the default for all Namespaces with High Availability features. Users can still trigger manual failovers on their Namespaces even if Temporal-initiated failovers are enabled. When Temporal-initiated failovers are disabled on a Namespace, Temporal's RTO for that Namespace is unbounded (it is dependent on how long the underlying outage lasts) | ||
|
|
||
| Temporal Cloud strives to maintain a P95 [replication lag](/cloud/high-availability/monitoring#replication-lag-metric) of less than 1 minute. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this bit because
- I'm not sure p95 is good enough. This could be read as "up to 5% of Namespaces could be above the 1-minute RPO at any given moment."
- We already say we have a 1-minute RPO. I don't think we need additional standards / goals to be publicly stated. They would only add confusion. Let's state our main goal (RPO) and stand by it.
|
|
||
| Internally, our components are distributed across a minimum of three availability zones per region. | ||
| We implement a cell architecture. | ||
| We implement a [cell architecture](https://docs.aws.amazon.com/wellarchitected/latest/reducing-scope-of-impact-with-cell-based-architecture/what-is-a-cell-based-architecture.html). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've blogged about Cells https://temporal.io/blog/two-years-in#scale, https://temporal.io/blog/building-durable-cloud-control-systems-with-temporal#implementing-the-data-plane-a-cell-based-architecture - we should have a first-class definition of what a Temporal Cloud cell is in our docs. Not to expand scope here to adding a full Cloud architecture page (although I do think we should have one eventually) - maybe add a section on /cloud/service-availability and link to that?
|
|
||
| Recovery Point Objective (RPO) and Recovery Time Objective (RTO) define data durability and service restoration timelines, respectively. | ||
| In Temporal Cloud, these objectives vary depending on your deployment configuration and the scope of any failure. | ||
| Recovery Point Objective (RPO) and Recovery Time Objective (RTO) are the objectives that Temporal strives to meet for data durability (recovery point) and service restoration (recovery time) in the event of cloud outages. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RPO and RTO are how we measure, and low values for RPO and RTO are what we strive for. Could tighten this phrasing.
| Recovery Point Objective (RPO) and Recovery Time Objective (RTO) define data durability and service restoration timelines, respectively. | ||
| In Temporal Cloud, these objectives vary depending on your deployment configuration and the scope of any failure. | ||
| Recovery Point Objective (RPO) and Recovery Time Objective (RTO) are the objectives that Temporal strives to meet for data durability (recovery point) and service restoration (recovery time) in the event of cloud outages. | ||
| These objectives are high priority goals for Temporal Cloud, but are not a contractual commitment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds rough. Can we say RPO + RTO aren't part of the availability SLA instead (and link to the SLA page)?
| 1. **[High Availability](/cloud/high-availability) features enabled** (same-region, multi-region, or multi-cloud replication): Sub-1-minute RPO and 20 minutes or less RTO | ||
| 2. **Default (non-HA) namespace, regional failure**: 8-hour RPO and RTO | ||
| 3. **Default (non-HA) namespace, availability zone failure**: 0 RPO and RTO | ||
| When High Availability is enabled on a Namespace, the user chooses an "active" region (where processing happens) and a "replica" region (where processing will switch to in the event of a failure). If the active and replica are in the same cloud provider but different regions (e.g., AWS us-east-1 and AWS us-west-2), this is called Multi-region Replication. If the active and replica are in different cloud providers (e.g., AWS and GCP), this is called Multi-cloud Replication. If the active and replica are in the same region, this is called Same-region Replication. Temporal will always place the active and replica in different [cells](https://docs.aws.amazon.com/wellarchitected/latest/reducing-scope-of-impact-with-cell-based-architecture/what-is-a-cell-based-architecture.html). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we add a cell definition in docs, update this link too
| In case of an outage in the active region or cell, Temporal Cloud will failover to the replica so that existing Workflow Executions will continue to run and new Executions can be started. | ||
|
|
||
| ## High Availability, Regional Failure | ||
| The Recovery Point Objective and Recovery Time Objective for Temporal Cloud depend on the type of outage and which [High Availability](/cloud/high-availability) feature your Namespace has enabled: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This breakdown is great!
| - "8-hour RPO" for Namespaces without the appropriate High Availability feature: Historically, regional outages have not led to data corruption or permanent data loss in data systems that are replicated across three availability zones. All Namespace data was available once the outage ended; affected Namespaces observed a recovery point of "zero" (no data loss). However, as a precaution, Temporal backs up all Namespaces in rolling 4-hour windows. Should a future outage cause permanent data loss in the underlying data system, these backups would meet an 8-hour RPO. | ||
|
|
||
| Temporal Cloud is designed to limit data loss after recovery when the incident triggering the failover is resolved. | ||
| - "Temporal-initiated failovers:" Also known as "automatic failovers," these failovers are initiated by Temporal's tooling and/or on-call engineers on Namespaces that have High Availability enabled. **Temporal highly recommends keeping Temporal-initiated failovers enabled,** which is the default for all Namespaces with High Availability features. Users can still trigger manual failovers on their Namespaces even if Temporal-initiated failovers are enabled. When Temporal-initiated failovers are disabled on a Namespace, Temporal's RTO for that Namespace is unbounded (it is dependent on how long the underlying outage lasts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feels relevant to mention here: we don't control whether the customer ever actually initiates failover when there's an outage. I guess it doesn't change our RTO, but since this is a clarification section, maybe still useful to call out. I don't feel super strongly.
| Temporal has put extensive work into tools and processes that minimize the recovery point and achieve its RPO for Temporal-initiated failovers, including: | ||
|
|
||
| **Recovery Time Objective (RTO) - 20 minutes** | ||
| - Best-in-class data replication technology that keeps the replica up to date with the active. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could link to https://www.youtube.com/watch?v=mULBvv83dYM where Liang gets into more specifics
| - Best-in-class data replication technology that keeps the replica up to date with the active. | ||
|
|
||
| Recovery time objective (RTO) for Temporal Cloud is 20 minutes or less per incident. | ||
| - Monitoring, alerting, and internal SLOs on the replication lag across all Temporal Cloud Namespaces. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - Monitoring, alerting, and internal SLOs on the replication lag across all Temporal Cloud Namespaces. | |
| - Monitoring, alerting, and internal SLOs on the replication lag for every Temporal Cloud Namespace. |
|
|
||
| **All writes to storage are synchronously replicated across AZs**, including our writes to ElasticSearch. | ||
| ElasticSearch is eventually consistent, but this does not impact our RPO as there is no data loss. | ||
| - You can detect outages that Temporal doesn't. In the cloud, regional outages never affect every service the same way. It's possible that Temporal--and the services it depends on--are unaffected by the outage, even while your Workers or other cloud infrastructure are disrupted. If you monitor each service in your critical path and alert on unusual |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we suggest or link to specific guidance on how to do this?
| - You can sequence your failovers in a particular order. Your cloud infrastructure probably contains more pieces than just your Temporal Namespace: Temporal Workers, compute pools, data stores, and other cloud services. If you manually failover, you can choose the order in which these pieces switch to the replica region. You can then test that ordering with failover drills and ensure it executes smoothly without data consistency issues or bottlenecks. | ||
|
|
||
| This leads to the following objectives for availability zone failure: | ||
| - You can proactively failover more aggressively than Temporal. While the 20-minute RTO should be sufficient for most use cases, some may strive to hit an even lower RTO. For workloads like high frequency trading, auctions, or popular sporting events, an outage at the wrong time could cause tremendous lost revenue per minute. You can adopt a posture that fails over more eagerly than Temporal does. For example, you could trigger a manual failover at the first sign of a possible disruption, before its known to be a true regional outage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - You can proactively failover more aggressively than Temporal. While the 20-minute RTO should be sufficient for most use cases, some may strive to hit an even lower RTO. For workloads like high frequency trading, auctions, or popular sporting events, an outage at the wrong time could cause tremendous lost revenue per minute. You can adopt a posture that fails over more eagerly than Temporal does. For example, you could trigger a manual failover at the first sign of a possible disruption, before its known to be a true regional outage. | |
| - You can proactively failover more aggressively than Temporal. While the 20-minute RTO should be sufficient for most use cases, some may strive to hit an even lower RTO. For workloads like high frequency trading, auctions, or popular sporting events, an outage at the wrong time could cause tremendous lost revenue per minute. You can adopt a posture that fails over more eagerly than Temporal does. For example, you could trigger a manual failover at the first sign of a possible disruption, before knowing whether there's a true regional outage. |
DO NOT MERGE. Feedback requested first.
What does this PR do?
Corrects errors and clears up common confusion points around our RPO, RTO, and SLA.
Internal Note on the previously-stated 8-hour RTO / RPO for non-HA Namespaces:
Notes to reviewers
Todo items:
[ ] Must hear back from Eng re: what our RTO and RPO are for Same-region Replication
[ ] Must get alignment with Eng stakeholders @sergeybykov and @meiliang86 that this is an accurate framing of our RTO and RPO, especially re: the 8-hour RTO / RPO previously stated.
[ ] Determine whether we should discuss conflict resolution when talking about the RPO. Details in Slack