Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
168 changes: 168 additions & 0 deletions docs/03-core-concepts/01-architecture/07-consistency-guarantees.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,168 @@
---
sidebar_label: Consistency Guarantees
---

# Consistency Guarantees

## OM (Ozone Manager) HA Consistency

:::info
Notice: Before Ozone 2.2.0 (current is 2.1.0), all operations in OM are linearizable. After [HDDS-14424](https://issues.apache.org/jira/browse/HDDS-14424) is done and released in Ozone 2.2.0, users will have more options to configure the consistency guarantees for OM based on the tradeoff across scalability, throughput and staleness.
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parenthetical "(current is 2.1.0)" will become outdated as soon as a new release ships and will require ongoing doc churn. Consider removing the "current is …" portion and phrasing this in terms of version ranges only (e.g., "Prior to Ozone 2.2.0…").

Suggested change
Notice: Before Ozone 2.2.0 (current is 2.1.0), all operations in OM are linearizable. After [HDDS-14424](https://issues.apache.org/jira/browse/HDDS-14424) is done and released in Ozone 2.2.0, users will have more options to configure the consistency guarantees for OM based on the tradeoff across scalability, throughput and staleness.
Notice: Before Ozone 2.2.0, all operations in OM are linearizable. After [HDDS-14424](https://issues.apache.org/jira/browse/HDDS-14424) is done and released in Ozone 2.2.0, users will have more options to configure the consistency guarantees for OM based on the tradeoff across scalability, throughput and staleness.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if the follower read feature will make it to 2.2.0. IMO I'm more inclined to only write a user doc when the feature is sure to be included.

Suggested change
Notice: Before Ozone 2.2.0 (current is 2.1.0), all operations in OM are linearizable. After [HDDS-14424](https://issues.apache.org/jira/browse/HDDS-14424) is done and released in Ozone 2.2.0, users will have more options to configure the consistency guarantees for OM based on the tradeoff across scalability, throughput and staleness.
Notice: Before Ozone 2.2.0 , all operations in OM are linearizable. After [HDDS-14424](https://issues.apache.org/jira/browse/HDDS-14424) is done and released in Ozone 2.2.0, users will have more options to configure the consistency guarantees for OM based on the tradeoff across scalability, throughput and staleness.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if the follower read feature will make it to 2.2.0. IMO I'm more inclined to only write a user doc when the feature is sure to be included.

Suggested change
Notice: Before Ozone 2.2.0 (current is 2.1.0), all operations in OM are linearizable. After [HDDS-14424](https://issues.apache.org/jira/browse/HDDS-14424) is done and released in Ozone 2.2.0, users will have more options to configure the consistency guarantees for OM based on the tradeoff across scalability, throughput and staleness.
Notice: Before Ozone 2.2.0 , all operations in OM are linearizable. After [HDDS-14424](https://issues.apache.org/jira/browse/HDDS-14424) is done and released in Ozone 2.2.0, users will have more options to configure the consistency guarantees for OM based on the tradeoff across scalability, throughput and staleness.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on this, it will be good to have some kind of versioning on the docs and write the current docs based on the current version (2.1.0) instead of the future version (for example follower read API is evolving and still not finalized). Something like a drop-down that will switch the doc versions based on the version (e.g. we default to the current version 2.1.0). When we want to release a new version (e.g. 2.2.0) we can then port the previous version 2.1.0 to the new docs and update the docs based on the changes in the new version.

:::

### Default Configuration (Non-Linearizable) (will release in Ozone 2.2)
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The capitalization is inconsistent between the heading ("Non-Linearizable") and the bullet ("Non-linearizable"). Pick one form and use it consistently throughout this page for easier scanning/searching.

Suggested change
### Default Configuration (Non-Linearizable) (will release in Ozone 2.2)
### Default Configuration (Non-linearizable) (will release in Ozone 2.2)

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the behavior prior to HDDS-14424 consistent with the non-linerazable case?

Suggested change
### Default Configuration (Non-Linearizable) (will release in Ozone 2.2)
### Default Configuration (Non-Linearizable)

- **Read Path**: Only the leader serves read requests
- **Mechanism**: Reads query the state machine directly without ReadIndex
- **Guarantee**: **Non-linearizable** - may return stale data during leader transitions
- **Performance**: No heartbeat rounds required for reads, better latency
- **Risk**: Short-period split-brain scenario possible (old leader may serve stale reads during leadership transition)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be nice to have a write up what are the conditions when split brain might happen (assuming that every leader election/transition do not cause split brain)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read does not require consensus, so it's possible that during network partitioning or a stale leader where it was so slow it lost leadership.

Split brain is not possible for writes.


### Optional: Linearizable Reads (will release in Ozone 2.2)
Comment on lines +13 to +20
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Version formatting and tense are inconsistent here ("2.2" vs "2.2.0" elsewhere, and repeated "will release" phrasing). To reduce future maintenance and ambiguity, consider using a consistent semantic version (e.g., 2.2.0) and wording like "Starting in Ozone 2.2.0" instead of "will release" in headings.

Copilot uses AI. Check for mistakes.
- **Configuration**: `ozone.om.ha.raft.server.read.option=LINEARIZABLE`
- **Mechanism**: Uses Raft ReadIndex (Raft section 6.4)
- **Guarantee**: Linearizability - reads reflect all committed writes
- **Trade-off**: Requires leader to confirm leadership via heartbeat rounds
- **Benefit**: Both the leader and followers can serve reads
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do existing clients utilize this capability? How clients pick the service where request is going to be sent?


### Write Path
- **All writes** go through Ratis consensus for replication
- **Application**: Single-threaded executor ensures **ordered application** of transactions
- **Double Buffer**: Used for batching responses while maintaining ordering

### Advanced Read Optimizations

#### Follower Read with Local Lease (will release in Ozone 2.2)
- Config: `ozone.om.follower.read.local.lease.enabled=false` (default)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now set to 'true' by default.

- Allows followers to serve reads if:
- Follower is caught up (within `ozone.om.follower.read.local.lease.lag.limit=10000` log entries)
- Leader is active (heartbeat within `ozone.om.follower.read.local.lease.time.ms=5000`)
- Reduces read latency by load balancing across replicas

#### Leader Skip Linearizable Read
- Config: `ozone.om.allow.leader.skip.linearizable.read=false` (default)
- Leader serves committed reads locally without ReadIndex
- Lower latency but may not reflect uncommitted writes

## SCM (Storage Container Manager) HA Consistency

### Consistency Model
- **Strong consistency** via Apache Ratis (Raft consensus)
- **Simpler than OM**: Leader-only reads, no follower read options

### Read Path
- **Only the leader serves requests**
- No linearizable read options exposed
- No follower read support

### Write Path
- All mutations use `@Replicate` annotation
- Routed through `SCMHAInvocationHandler` to Ratis
- Replicated to followers before acknowledgment
- **Batched DB operations** via `SCMHADBTransactionBufferImpl`

### Key Differences from OM
| Aspect | SCM HA | OM HA |
| ------------ | --------------------------------- | -------------------------------- |
| Read options | Leader-only | Linearizable, follower reads |
| Complexity | Simpler | More flexible |
| Use case | Metadata for containers/pipelines | High-volume namespace operations |

## DN (DataNode) ContainerStateMachine Consistency

### Concurrent Execution Model
This is the **key differentiator** from OM/SCM:

#### OM/SCM
Single global executor, **sequential application** of all transactions
- Ensures simple ordering and `lastAppliedIndex` tracking
- Trade-off: Lower throughput, but for storage systems, the metadata throughput is not the primary concern.

#### ContainerStateMachine
**Per-container concurrency**
- Config: `hdds.container.ratis.num.container.op.executors=10` (default)
- Multiple `applyTransaction` calls can execute **concurrently**
- Each container has its own `TaskQueue` for **per-container serialization**
- Different containers process operations **in parallel**

### Architecture

```mermaid
graph TB
subgraph "Ratis Log Entries"
TX1[Transaction 1<br/>Container A]
TX2[Transaction 2<br/>Container B]
TX3[Transaction 3<br/>Container A]
TX4[Transaction 4<br/>Container C]
TX5[Transaction 5<br/>Container B]
end

subgraph "ContainerStateMachine.applyTransaction()"
AT[applyTransaction Semaphore<br/>Max Concurrent Limit]
end

subgraph "Per-Container Task Queues"
QA[TaskQueue<br/>Container A<br/>FIFO]
QB[TaskQueue<br/>Container B<br/>FIFO]
QC[TaskQueue<br/>Container C<br/>FIFO]
end

subgraph "Shared Thread Pool"
direction LR
T1[Thread 1]
T2[Thread 2]
T3[Thread 3]
T4[...]
T10[Thread 10]
end

subgraph "Execution Model"
CONCURRENT[Concurrent across containers<br/>Different containers process in parallel]
SEQUENTIAL[Sequential within container<br/>Same container operations are serialized]
end

TX1 --> AT
TX2 --> AT
TX3 --> AT
TX4 --> AT
TX5 --> AT

AT --> QA
AT --> QB
AT --> QA
AT --> QC
AT --> QB

QA -.-> T1
QA -.-> T2
QB -.-> T3
QC -.-> T4
QB -.-> T10

T1 --> CONCURRENT
T3 --> CONCURRENT
T4 --> CONCURRENT

QA --> SEQUENTIAL
QB --> SEQUENTIAL
QC --> SEQUENTIAL

style AT fill:#ff9999
style QA fill:#99ccff
style QB fill:#99ccff
style QC fill:#99ccff
style CONCURRENT fill:#99ff99
style SEQUENTIAL fill:#ffff99
```

1. **Container Independence**: Operations on different containers don't conflict
2. **Per-Container Ordering**: Each container's operations are serialized via `TaskQueue`
3. **Throughput Optimization**: Multiple containers can process concurrently
4. **Semaphore Control**: `applyTransactionSemaphore` limits total concurrent operations

### BCSID (Block Commit Sequence ID)
- **Purpose**: Tracks block commit order **within each container**
- **Per-container sequence number** incremented on each block commit
- **Used for**:
- Conflict detection during replication
- Recovery and validation
- Ensuring read requests are valid