Skip to content

Commit 9541864

Browse files
committed
Add a section about operational readiness
1 parent c755401 commit 9541864

1 file changed

Lines changed: 31 additions & 6 deletions

File tree

docs/admin/going-into-production.md

Lines changed: 31 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -313,7 +313,7 @@ needs. You should also use storage with high [IOPS] when possible to
313313
improve CrateDB performance.
314314
:::
315315

316-
On a Unix-like system, you might mount an external volume to a path like
316+
On a [Unix-like] system, you might mount an external volume to a path like
317317
`/opt/cratedb`. If you are installing CrateDB by hand, you can then set
318318
[CRATE_HOME] to `/opt/cratedb`. Make sure to set `CRATE_HOME` before
319319
running {ref}`bin/crate <crate-reference:cli-crate>`.
@@ -401,26 +401,51 @@ For security reasons, most production clusters should use wire encryption for
401401
network traffic between nodes and clients. Check out the reference manual on
402402
{ref}`secured communications <crate-reference:admin_ssl>` for more information.
403403

404+
(prod-monitoring)=
405+
406+
## Operational readiness
407+
408+
Going into production is not a one-time step. Operating CrateDB reliably
409+
requires continuous monitoring, maintenance, and lifecycle management.
410+
The following checklist highlights important aspects to consider for production clusters.
411+
412+
- **Cluster health and capacity management**
413+
- **Shard sizes:** Monitor your shard sizes to remain around 50 GB ({ref}`sharding-partitioning`). Especially for partitioned tables, observe how your data volume changes over time.
414+
- **Disk usage:** If the {ref}`low watermark threshold <cluster.routing.allocation.disk.watermark.low>` is exceeded, CrateDB will no longer allocate new shards on affected nodes. Monitor your disk usage to guarantee seamless shard allocation.
415+
- **Shard count per node:** The number of open shards per node is limited. Monitor your shard count to prevent exceeding {ref}`cluster.max_shards_per_node`.
416+
- **Cluster and node health:** Several system tables expose the status of various health checks. Regularly check {ref}`sys-node-checks`, {ref}`sys-health`, and {ref}`sys-cluster_health`.
417+
- **Lifecycle and maintenance management**
418+
- **Keep CrateDB up-to-date:** Regularly upgrade CrateDB to stay within supported versions. Consult the [Support Terms] regarding end-of-life policies.
419+
- **Keep your ecosystem up-to-date:** Keep drivers, frameworks, and other tools within versions that are supported by their respective providers.
420+
- **Data hygiene:** Delete data you no longer need, such as old partitions, columns, or deprecated tables.
421+
- **Disaster scenarios and planning**
422+
- **Plan scenarios:** Actively think about failure scenarios you want to be protected against and their implications on your setup, such as {ref}`replication <ddl-replication>`, number of nodes, {ref}`multi-zone setup <multi-zone-setup>`, etc.
423+
- **Practice recovery:** Test your contingency plans, observe how CrateDB and other components in your stack behave and ensure you have error logging, retries, the ability to replay ingestion payloads, and similar mechanisms in place.
424+
- **Self-managed: additional requirements** (if you are not using CrateDB Cloud)
425+
- **Monitoring:** Have both operating system/container-level metrics such as CPU, I/O, memory, and network-related metrics available, as well as CrateDB's own {ref}`jmx_monitoring`.
426+
- **Snapshots:** Regular {ref}`snapshots <snapshot-restore>` enable point-in-time recovery.
427+
- **Infrastructure lifecycle:** Apply regular updates to your operating system, container runtime, etc. as well. If you are running in the cloud, switch to recent VM and storage generations.
428+
- **TLS certificates:** When using wire encryption, renew your certificates in time to prevent communication breakdowns.
429+
- **Support readiness**
430+
- When engaging with CrateDB support, have logs and monitoring metrics ready to share. In certain situations, CrateDB support may also ask for a {ref}`jfr`, {ref}`heap dump <jcmd>`, or [system table export].
431+
404432
[configuration]: inv:crate-reference#config
405433
[configure]: inv:crate-reference#config
406434
[crate_heap_dump_path]: inv:crate-reference#conf-env-dump-path
407-
[crate_heap_size]: inv:crate-reference#conf-env-heap-size
408435
[crate_home]: inv:crate-reference#conf-env-crate-home
409-
[crate_java_opts]: inv:crate-reference#conf-env-java-opts
410436
[data paths]: https://cratedb.com/docs/crate/reference/en/latest/config/node.html#paths
411-
[filesystem hierarchy standard]: https://en.wikipedia.org/wiki/Filesystem_Hierarchy_Standard
412437
[iops]: https://en.wikipedia.org/wiki/IOPS
413438
[linux filesystem hierarchy]: https://tldp.org/LDP/Linux-Filesystem-Hierarchy/html/
414439
[localhost]: https://en.wikipedia.org/wiki/Localhost
415440
[multiple types of node]: https://cratedb.com/docs/crate/reference/en/latest/config/node.html#node-types
416441
[network.host]: inv:crate-reference#network.host
417442
[node.name]: inv:crate-reference#node.name
418443
[path settings]: https://cratedb.com/docs/crate/reference/en/latest/config/node.html#paths
419-
[path.data]: inv:crate-reference#path.data
420-
[raid 0]: https://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_0
421444
[shared-nothing]: https://en.wikipedia.org/wiki/Shared-nothing_architecture
422445
[stderr]: https://en.wikipedia.org/wiki/Standard_streams
423446
[symbolic links]: https://en.wikipedia.org/wiki/Symbolic_link
424447
[systemd]: https://github.com/systemd/systemd
425448
[timeout settings]: https://cratedb.com/docs/crate/reference/en/latest/config/node.html#garbage-collection
426449
[unix-like]: https://en.wikipedia.org/wiki/Unix-like
450+
[support terms]: https://cratedb.com/legal/support-terms
451+
[system table export]: https://cratedb-toolkit.readthedocs.io/cfr/systable.html

0 commit comments

Comments
 (0)