Skip to content

Conversation

@skrobul
Copy link
Collaborator

@skrobul skrobul commented Jan 22, 2026

Problem

While troubleshooting issues in the staging environment, it became apparent that draining individual nodes often caused the entire MariaDB cluster to become unresponsive, even when the drained node was not hosting all of the affected pods.

Further investigation showed that the storage backend for the databases does not behave the way we had assumed. Last year, in an effort to reduce latency caused by write amplification, we switched the database PVCs from the ceph-block-replicated StorageClass to ceph-block-single.

As a reminder, ceph-block-replicated creates 3 replicas of each volume at the Ceph level. Since both MariaDB and Postgres also replicate data at the application level, we were effectively writing the same logical data to 18 physical disks (3 replicas at the DB layer × 3 replicas at the storage layer × 2 copies on RAID‑1).

The ceph-block-single StorageClass was introduced with the intention of reducing this to 3 × 1 × 2, i.e. 6 copies in total. This was configured in Rook by setting replicas: 1 on the corresponding CephBlockPool and then using that pool from the StorageClass, and it appeared to work as expected.

However, replicas: 1 does not mean “store all data for this volume on a single node.” It only means “keep a single copy of each object.” Ceph will still split the volume into multiple placement groups (PGs) and distribute those PGs across all participating OSDs in the cluster. In other words, even with replicas: 1, the data that backs a given PVC can be physically spread across multiple nodes.

In practice, this means that when we drain 1 out of 3 nodes, there is a high chance that one or more database instances on the other nodes lose access to their underlying volume, because the OSD(s) responsible for their data happen to be scheduled on the drained node. There is no guarantee that the OSDs backing a given PVC are located on the same node where the database pod is running.

For example, consider the following situation:

Pod Selected Node OSD for PVC runs on
mariadb-0 node1 node3
mariadb-1 node2 node2
mariadb-2 node3 node1

If we drain node2, all MariaDB instances remain functional. If we drain node1, we simultaneously lose access to the volumes for both mariadb-0 and mariadb-2, even though one of those pods is still scheduled on node3. From the database operator’s perspective, nodes that are “up” can suddenly lose their storage when an unrelated node is drained.

Solution

Instead of using distributed rook-ceph volumes for databases that already implement their own replication, we should switch those workloads to node‑local volumes that are still dynamically provisioned and consumable via PVCs, so that operators can use them just like any other StorageClass.

This PR proposes solving that problem with the OpenEBS Local PV LVM provisioner. It provides semantics similar to the Kubernetes built‑in Local PV static provisioner, but without requiring administrators to pre-create and manage individual volumes on each node.

Instead, OpenEBS Local PV LVM dynamically carves out logical volumes from one or more configured LVM Volume Groups on each node and exposes them through a StorageClass. This gives us:

  • Node‑local storage for databases (no cross‑node Ceph hops for I/O).
  • Dynamic provisioning via PVCs, so existing operators (e.g. MariaDB/Postgres operators) can continue to work unchanged (minus the migration to new StorageClass).
  • Predictable failure domains: if a node goes down, only the database instances scheduled on that node are affected, which aligns with how the application‑level replication is already designed.

This solution was tested in dev environment for MariaDB.

ps. The "replicated" storage provider feature has been explicitly disabled for now to limit the scope, but in future we may want to evaluate it as rook's replacement.

@cardoe
Copy link
Contributor

cardoe commented Jan 23, 2026

Good with me. Hoping that @ctria will weigh in.

Copy link
Contributor

@ctria ctria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see the use case/need for something other than ceph here.

I don't see anything wrong with OpenEBS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants