feat: add OpenEBS LocalPV LVM provisioner #1624
Open
+76
−0
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
While troubleshooting issues in the staging environment, it became apparent that draining individual nodes often caused the entire MariaDB cluster to become unresponsive, even when the drained node was not hosting all of the affected pods.
Further investigation showed that the storage backend for the databases does not behave the way we had assumed. Last year, in an effort to reduce latency caused by write amplification, we switched the database PVCs from the
ceph-block-replicatedStorageClass toceph-block-single.As a reminder,
ceph-block-replicatedcreates 3 replicas of each volume at the Ceph level. Since both MariaDB and Postgres also replicate data at the application level, we were effectively writing the same logical data to 18 physical disks (3 replicas at the DB layer × 3 replicas at the storage layer × 2 copies on RAID‑1).The
ceph-block-singleStorageClass was introduced with the intention of reducing this to 3 × 1 × 2, i.e. 6 copies in total. This was configured in Rook by settingreplicas: 1on the corresponding CephBlockPool and then using that pool from the StorageClass, and it appeared to work as expected.However,
replicas: 1does not mean “store all data for this volume on a single node.” It only means “keep a single copy of each object.” Ceph will still split the volume into multiple placement groups (PGs) and distribute those PGs across all participating OSDs in the cluster. In other words, even withreplicas: 1, the data that backs a given PVC can be physically spread across multiple nodes.In practice, this means that when we drain 1 out of 3 nodes, there is a high chance that one or more database instances on the other nodes lose access to their underlying volume, because the OSD(s) responsible for their data happen to be scheduled on the drained node. There is no guarantee that the OSDs backing a given PVC are located on the same node where the database pod is running.
For example, consider the following situation:
If we drain
node2, all MariaDB instances remain functional. If we drainnode1, we simultaneously lose access to the volumes for bothmariadb-0andmariadb-2, even though one of those pods is still scheduled onnode3. From the database operator’s perspective, nodes that are “up” can suddenly lose their storage when an unrelated node is drained.Solution
Instead of using distributed
rook-cephvolumes for databases that already implement their own replication, we should switch those workloads to node‑local volumes that are still dynamically provisioned and consumable via PVCs, so that operators can use them just like any other StorageClass.This PR proposes solving that problem with the OpenEBS Local PV LVM provisioner. It provides semantics similar to the Kubernetes built‑in Local PV static provisioner, but without requiring administrators to pre-create and manage individual volumes on each node.
Instead, OpenEBS Local PV LVM dynamically carves out logical volumes from one or more configured LVM Volume Groups on each node and exposes them through a StorageClass. This gives us:
This solution was tested in dev environment for MariaDB.
ps. The "replicated" storage provider feature has been explicitly disabled for now to limit the scope, but in future we may want to evaluate it as rook's replacement.