diff --git a/docs/operations/_index.md b/docs/operations/_index.md index d01e61b314b..2d22155c77f 100644 --- a/docs/operations/_index.md +++ b/docs/operations/_index.md @@ -4,4 +4,30 @@ linkTitle: "Operations" no_section_index_title: true weight: 8 menu: ---- \ No newline at end of file +--- + +This section covers day-2 operation of a Cortex cluster. Start here if you are +running Cortex in production. + +## Core operator guides + +- [Monitoring Cortex]({{< relref "./monitoring-cortex.md" >}}) — install the + bundled dashboards, alert rules, and recording rules. +- [Troubleshooting]({{< relref "./troubleshooting.md" >}}) — symptom-driven + decision tree for the write path, read path, storage, and rings. +- [Upgrading]({{< relref "./upgrading.md" >}}) — version-to-version upgrade + procedure, component ordering, and downgrade caveats. + +## Specialized topics + +- [Scaling the Query Frontend]({{< relref "./scalable-query-frontend.md" >}}) +- [Query Auditor]({{< relref "./query-auditor.md" >}}) — detect query + correctness regressions. +- [Query Tee]({{< relref "./query-tee.md" >}}) — compare two Cortex deployments + side-by-side. +- [Requests Mirroring with Envoy]({{< relref + "./requests-mirroring-to-secondary-cluster.md" >}}) + +For component-level operational guidance (HA pairs, shuffle sharding, zone +replication, capacity planning, encryption), see the [Guides]({{< relref +"../guides/" >}}) section. diff --git a/docs/operations/monitoring-cortex.md b/docs/operations/monitoring-cortex.md new file mode 100644 index 00000000000..74d33ed3029 --- /dev/null +++ b/docs/operations/monitoring-cortex.md @@ -0,0 +1,132 @@ +--- +title: "Monitoring Cortex" +linkTitle: "Monitoring Cortex" +weight: 1 +slug: monitoring-cortex +--- + +This page describes the bundled assets Cortex ships for monitoring a production +deployment — Grafana dashboards, Prometheus alerting rules, and recording rules +— and how to install them. The assets live in the repository and are kept in +sync with the code; they are the same artifacts the Cortex maintainers use to +operate their own clusters. + +## What ships with Cortex + +| Asset | Source | Purpose | +|-------|--------|---------| +| Dashboards (JSON) | `docs/getting-started/dashboards/` | Drop-in Grafana dashboards covering every Cortex component | +| Alert rules | `docs/getting-started/alerts.yaml` | 50+ PrometheusRule alerts grouped by component | +| Recording rules | `docs/getting-started/cortex-jsonnet/cortex-mixin/recording_rules.libsonnet` | Pre-aggregated series used by the dashboards and alerts | +| Jsonnet mixin | `docs/getting-started/cortex-jsonnet/cortex-mixin/` | The source of truth — generates the JSON/YAML above | + +## Dashboards + +Each dashboard JSON in `docs/getting-started/dashboards/` is ready to import into +Grafana via **Dashboards → Import → Upload JSON file**. + +| Dashboard | What to watch | +|-----------|---------------| +| `cortex-writes.json` | End-to-end write path: distributor QPS, ingestion rate, ingester push errors and latency, samples appended, WAL writes. The first dashboard to open during a write incident. | +| `cortex-reads.json` | End-to-end read path: query QPS at the frontend, scheduler queue length, querier execution latency, store-gateway and ingester sub-queries. | +| `cortex-queries.json` | Per-query breakdowns: chunks/series fetched, bytes processed, queries by tenant. Useful for hunting expensive queries. | +| `cortex-slow-queries.json` | The slowest queries in the last interval, including the PromQL and the tenant. Pair with the query-frontend logs. | +| `cortex-compactor.json` | Compactor run progress, blocks compacted vs. failed, sync errors. | +| `cortex-compactor-resources.json` | CPU, memory, disk, and goroutines for the compactor pods. | +| `cortex-object-store.json` | Object-store request rate, latency, and error rate broken down by operation (Get, Iter, Upload). | +| `cortex-rollout-progress.json` | Rolling-deployment progress for stateful sets (ingester, store-gateway, compactor). | +| `cortex-scaling.json` | Suggested replica counts derived from current load — pair with [Capacity Planning]({{< relref "../guides/capacity-planning.md" >}}). | +| `cortex-config.json` | The runtime configuration currently in effect, by tenant. | +| `alertmanager.json` | Alertmanager-specific: notification rate, replication, ring health. | +| `ruler.json` | Ruler-specific: evaluation rate, missed evaluations, push and query errors. | + +Dashboards assume a Prometheus datasource named `Cortex`; either name your +datasource that way or edit the dashboard variables on import. Several +dashboards rely on the recording rules described below — install those first or +some panels will be empty. + +## Alerts + +The bundled alerts in `docs/getting-started/alerts.yaml` are grouped by concern: + +| Group | Examples | +|-------|----------| +| `cortex_alerts` | `CortexIngesterUnhealthy`, `CortexRequestErrors`, `CortexRequestLatency`, `CortexQueriesIncorrect`, `CortexInconsistentRuntimeConfig`, `CortexKVStoreFailure`, `CortexMemoryMapAreasTooHigh` | +| `cortex_ingester_instance_alerts` | `CortexIngesterReachingSeriesLimit`, `CortexIngesterReachingTenantsLimit`, `CortexDistributorReachingInflightPushRequestLimit` | +| `cortex-rollout-alerts` | `CortexRolloutStuck` | +| `cortex-provisioning` | `CortexProvisioningTooManyActiveSeries`, `CortexProvisioningTooManyWrites`, `CortexAllocatingTooMuchMemory` | +| `ruler_alerts` | `CortexRulerTooManyFailedPushes`, `CortexRulerTooManyFailedQueries`, `CortexRulerMissedEvaluations`, `CortexRulerFailedRingCheck` | +| `gossip_alerts` | `CortexGossipMembersMismatch` | +| `etcd_alerts` | `EtcdAllocatingTooMuchMemory` | +| `alertmanager_alerts` | `CortexAlertmanagerSyncConfigsFailing`, `CortexAlertmanagerRingCheckFailing`, `CortexAlertmanagerPartialStateMergeFailing`, `CortexAlertmanagerReplicationFailing`, `CortexAlertmanagerPersistStateFailing`, `CortexAlertmanagerInitialSyncFailed` | +| `cortex_blocks_alerts` | `CortexIngesterHasNotShippedBlocks`, `CortexIngesterHasUnshippedBlocks`, `CortexIngesterTSDBHeadCompactionFailed`, `CortexIngesterTSDBWALCorrupted`, `CortexQuerierHasNotScanTheBucket`, `CortexQuerierHighRefetchRate`, `CortexStoreGatewayHasNotSyncTheBucket`, `CortexBucketIndexNotUpdated`, `CortexTenantHasPartialBlocks` | +| `cortex_compactor_alerts` | `CortexCompactorHasNotSuccessfullyCleanedUpBlocks`, `CortexCompactorHasNotSuccessfullyRunCompaction`, `CortexCompactorHasNotUploadedBlocks` | + +For every alert, the file ships with `for`, `severity`, and a short summary in +annotations. Treat these as a starting point — tune the thresholds (and which +alerts page vs. ticket) to your SLOs. + +### Installing the alerts + +The alerts file is a standard Prometheus rule file. In Kubernetes with the +Prometheus Operator, wrap it in a `PrometheusRule` resource; an example lives in +`docs/getting-started/prometheusrule.yaml`. With a self-hosted Prometheus, add +the file to `rule_files:` in `prometheus.yml`. + +If you also run a Cortex ruler, the same file can be loaded into Cortex itself +via `cortextool rules load` (see [Sharded Ruler]({{< relref +"../guides/sharded_ruler.md" >}})). + +## Recording rules + +The dashboards depend on a set of pre-aggregated metrics defined in +`docs/getting-started/cortex-jsonnet/cortex-mixin/recording_rules.libsonnet`. +These collapse per-instance counters into per-cluster/per-tenant rates so the +dashboards stay fast on large deployments. Install them the same way you +install the alerts — alongside, in the same Prometheus. + +Skipping the recording rules will leave several dashboard panels blank or +extremely slow. + +## The Jsonnet mixin + +If you already manage Prometheus rules and dashboards via Jsonnet/Tanka, import +`docs/getting-started/cortex-jsonnet/cortex-mixin/` directly: + +```jsonnet +local cortexMixin = import 'cortex-mixin/mixin.libsonnet'; + +{ + prometheusAlerts+:: cortexMixin.prometheusAlerts, + prometheusRules+:: cortexMixin.prometheusRules, + grafanaDashboards+:: cortexMixin.grafanaDashboards, +} +``` + +The mixin honours the standard [monitoring-mixin +contract](https://github.com/monitoring-mixins/docs), so it composes with mixins +for Kubernetes, etcd, Memcached, and the other dependencies a Cortex cluster +typically runs alongside. + +The mixin's `_config` block exposes knobs for the datasource name, single-binary +vs. microservices mode, namespace/cluster labels, and per-component selectors. +See `cortex-mixin/config.libsonnet` for the full list. + +## Tracing + +Dashboards and alerts cover RED metrics — latency, traffic, errors. For +end-to-end request tracing, configure Cortex's OpenTelemetry/Jaeger exporter as +described in [Tracing]({{< relref "../guides/tracing.md" >}}). The +`cortex-slow-queries.json` dashboard surfaces a query ID that maps directly to a +trace when tracing is enabled, making it easy to pivot from "this query was +slow" to "here is where it spent its time." + +## Related + +- [Capacity Planning]({{< relref "../guides/capacity-planning.md" >}}) — sizing + inputs to feed the scaling dashboard. +- [Tracing]({{< relref "../guides/tracing.md" >}}) — span exporter setup. +- [Query Auditor]({{< relref "./query-auditor.md" >}}) — detecting query + correctness regressions. +- [Query Tee]({{< relref "./query-tee.md" >}}) — comparing two Cortex + deployments side-by-side. diff --git a/docs/operations/troubleshooting.md b/docs/operations/troubleshooting.md new file mode 100644 index 00000000000..995ccb7a4d9 --- /dev/null +++ b/docs/operations/troubleshooting.md @@ -0,0 +1,187 @@ +--- +title: "Troubleshooting Cortex" +linkTitle: "Troubleshooting" +weight: 2 +slug: troubleshooting +--- + +A decision tree for the most common production issues. Each section starts with +the symptom an operator sees, names the metrics and logs to inspect, and points +to the upstream fix. + +The [bundled dashboards and alerts]({{< relref "./monitoring-cortex.md" >}}) +surface most of the signals referenced below. Install them first if you have +not already. + +## Write path + +### Distributors return 5xx on `/api/v1/push` + +1. **Confirm where the error originates.** Distributor logs include the cause: + ingester unreachable, rate-limit exceeded, validation error. Filter for + `level=warn` and `level=error` on the distributor. +2. **Check ingester health on the ring page** (`/ring` on any distributor). All + ingesters should be in state `ACTIVE`. `UNHEALTHY` or missing ingesters + point at a partition between distributor and ingester, or at the KV store. +3. **Check the `CortexIngesterUnhealthy` alert.** If it is firing, follow it: + the offending ingester is in the alert's labels. +4. **Inspect `cortex_distributor_ingester_append_failures_total`.** A non-zero + rate that matches the 5xx rate confirms ingester-side rejection. + +If the cause is `per-user limit exceeded`, raise the limit in `runtime_config` +([Overrides]({{< relref "../guides/overrides.md" >}})) rather than scaling out. + +### Samples are accepted but never appear in queries + +1. **Verify the tenant header.** The push and the query must use the same + `X-Scope-OrgID`. The single most common cause of "missing data" is a + tenant-ID mismatch. +2. **Check `cortex_ingester_memory_series` on the receiving ingester.** If + non-zero for the tenant, the data is in memory and queries should see it. +3. **Confirm time-range overlap.** Ingesters serve recent data from the TSDB + head and from local on-disk blocks until they age out per + `-blocks-storage.tsdb.retention-period` (default `6h`). Queriers stop + consulting ingesters entirely for time ranges older than + `-limits.query-ingesters-within` (per-tenant, when set). Older data must + have been shipped and must be visible to the store-gateway via the bucket + index — check `cortex_ingester_shipper_uploads_total`, the + `CortexIngesterHasNotShippedBlocks` alert, and + `CortexBucketIndexNotUpdated`. + +### Distributor `inflight push requests` rejected + +The `CortexDistributorReachingInflightPushRequestLimit` alert fires when +distributors near `-distributor.instance-limits.max-inflight-push-requests`. +Either scale distributors horizontally or raise the limit if CPU and memory +have headroom. + +## Read path + +### Queries time out at the frontend + +1. **Look at `cortex-reads.json` and `cortex-slow-queries.json`.** They show + queue depth, per-step latency, and the offending PromQL. +2. **If the frontend queue is full** (`CortexFrontendQueriesStuck` or + `CortexSchedulerQueriesStuck`): there are not enough queriers, or queriers + are blocked on something downstream. Check querier CPU, then ingester and + store-gateway latency. +3. **If the queue is empty but queries are still slow:** the bottleneck is in + the querier or below. Look at chunks fetched per query and bytes scanned — + an expensive query may need the protections in [Protecting Cortex from + Heavy Queries]({{< relref "../guides/protecting-cortex-from-heavy-queries.md" + >}}). + +### Queries return partial or no data for old time ranges + +Old data lives in object storage and is served by the store-gateway. Check: + +- `CortexStoreGatewayHasNotSyncTheBucket` — a stale store-gateway will not see + recently uploaded blocks. +- `CortexBucketIndexNotUpdated` — the compactor maintains the bucket index; + querier and store-gateway use it to discover blocks. +- `CortexQuerierHighRefetchRate` — symptom of store-gateways missing blocks + the querier expected to find. + +### Queries return incorrect results + +`CortexQueriesIncorrect` fires when the same query, run through the query-tee +against two backends, disagrees. Cortex ships a [Query +Auditor]({{< relref "./query-auditor.md" >}}) for this case; pair it with the +[Query Tee]({{< relref "./query-tee.md" >}}) to bisect which deployment is +wrong. + +## Storage path + +### Ingester is not shipping blocks + +The `CortexIngesterHasNotShippedBlocks` and `CortexIngesterHasUnshippedBlocks` +alerts catch this. Common causes: + +- Object-store credentials misconfigured — see distributor and ingester logs + for `403`/`AccessDenied`. +- A new block has not been cut yet. Ingesters cut blocks every + `-blocks-storage.tsdb.block-ranges-period` (default `2h`); a recently + started ingester has nothing to ship until the first block-range elapses. +- Disk pressure: check `cortex_ingester_tsdb_*` metrics and pod disk usage. + +### TSDB head compaction or WAL errors + +`CortexIngesterTSDBHeadCompactionFailed`, `CortexIngesterTSDBWALCorrupted`, and +`CortexIngesterTSDBWALWritesFailed` indicate disk-level problems. Treat the +affected ingester as a failed replica: cordon it, let traffic move to the +other replicas in the ring, then restore from a healthy ingester or replay +the WAL on a fresh volume. Do **not** restart in place if the WAL is corrupt — +you will lose the in-memory series. + +### Compactor falls behind + +`CortexCompactorHasNotSuccessfullyRunCompaction` means recent blocks are +piling up and queries will get slower over time. Check: + +- Compactor CPU and memory headroom — compaction is CPU-bound. +- Object-store latency on the compactor (it does a lot of small reads/writes). +- The `cortex-compactor.json` dashboard for per-tenant progress. + +See [Partitioning Compactor]({{< relref "../guides/partitioning-compactor.md" +>}}) for scaling out. + +## Hash ring and KV store + +### `CortexKVStoreFailure` is firing + +The component named in the alert cannot reach the KV store backend (Consul, +etcd, or memberlist). Steps: + +1. From an affected pod, hit the KV backend's health endpoint directly. +2. If the backend is up, look for network policy or DNS changes since the alert + started. +3. With memberlist, check `cortex_memberlist_client_messages_received_total` + and `cortex_memberlist_client_messages_sent_total` on each pod; a partition + shows up as one-sided traffic. + +### Ingesters keep joining and leaving the ring + +`CortexGossipMembersMismatch` indicates members disagree on cluster membership. +This is almost always a misconfigured `join_members:` list (some pods do not +list a bootstrap peer that resolves) or a packet-loss issue between zones. +[Gossip Ring Getting Started]({{< relref "../guides/gossip-ring-getting-started.md" +>}}) walks through the canonical configuration. + +## Alertmanager + +`CortexAlertmanagerSyncConfigsFailing`, `CortexAlertmanagerReplicationFailing`, +and the `*Persist*` / `*InitialSync*` alerts trace to the Alertmanager's +storage backend or its peer replication. Inspect the alertmanager logs for the +specific operation that failed; the alert annotations include the storage +endpoint that returned the error. + +## Ruler + +A spike in `CortexRulerMissedEvaluations` typically means a ruler tenant has +too many rules for the assigned shards. Either shard more aggressively (see +[Sharded Ruler]({{< relref "../guides/sharded_ruler.md" >}})) or move +heavy-evaluation tenants to the +[query-frontend-backed rule evaluation path]({{< relref +"../guides/rule-evaluations-via-query-frontend.md" >}}) so they share the +query path's capacity rather than the ruler's local one. + +## Multi-tenant noisy-neighbour + +If one tenant is degrading the cluster for everyone: + +1. Use `cortex-queries.json` filtered by tenant to confirm the source. +2. Apply tenant-specific limits via `runtime_config` ([Overrides]({{< relref + "../guides/overrides.md" >}})). Limits take effect within seconds — no + restart needed. +3. For longer-term isolation, move the tenant to its own shuffle shard + ([Shuffle Sharding]({{< relref "../guides/shuffle-sharding.md" >}})). + +## When the answer isn't here + +- Search recent CHANGELOG entries for the component you suspect — many subtle + bugs are documented there before they show up in an issue. +- Check [GitHub issues](https://github.com/cortexproject/cortex/issues) for the + alert name or error string; production issues are frequently filed verbatim. +- Ask in the + [#cortex Slack channel](https://cloud-native.slack.com/messages/cortex) with + the alert name, the dashboard timeframe, and a relevant log line. diff --git a/docs/operations/upgrading.md b/docs/operations/upgrading.md new file mode 100644 index 00000000000..10d5e542291 --- /dev/null +++ b/docs/operations/upgrading.md @@ -0,0 +1,146 @@ +--- +title: "Upgrading Cortex" +linkTitle: "Upgrading" +weight: 3 +slug: upgrading +--- + +This page describes how to upgrade a running Cortex cluster safely. It covers +the general procedure, the component-by-component ordering, and the places to +look for version-specific breaking changes. + +## Where breaking changes are recorded + +Cortex follows semantic versioning for the [v1 surface]({{< relref +"../configuration/v1-guarantees.md" >}}). Anything outside that surface — and +anything inside it that requires explicit operator action — is recorded in +`CHANGELOG.md` in the repository root, in the section for the target release. + +For each release, scan the CHANGELOG for these headings before upgrading: + +- `[CHANGE]` — behavioural changes that affect operators. +- `[FEATURE]` / `[ENHANCEMENT]` — new options that may need configuration. +- `[BUGFIX]` — fixes; useful when an upgrade resolves a known issue. + +`[CHANGE]` entries are the ones that bite. Always read every `[CHANGE]` in the +range between your current version and the target, not just the target's +entries. + +## Recommended upgrade procedure + +### 1. Pick a target version + +Upgrade one minor version at a time when possible — for example, `1.16 → 1.17` +rather than `1.14 → 1.17` — so any `[CHANGE]` entries that require config +adjustments can be applied incrementally. Patch upgrades (`1.17.0 → 1.17.1`) +are safe to take in bulk. + +### 2. Reconcile configuration + +Compare your current config against `cortex -config.expand-env -modules` and +the [Config File Reference]({{< relref +"../configuration/config-file-reference.md" >}}) for the target version. + +Pay particular attention to: + +- Renamed or deprecated flags (the CHANGELOG `[CHANGE]` entries call these out + explicitly). +- New required fields — usually called out in `[FEATURE]` entries. +- Runtime config (`overrides:`) — limits added in newer versions take effect + even when the flag is absent, using the new default. + +Use `cortex -config.file=cortex.yaml -modules=none -log.level=debug` against +the new binary to validate the config without starting the service. + +### 3. Deploy in the canonical order + +Cortex's components form a layered system. Upgrade in this order so each layer +is always serving a version that understands what the layer above sends it: + +1. **Compactor, store-gateway** — read-side dependencies, downstream of + ingester. Safe to roll first because writes are unaffected. +2. **Querier, query-frontend, query-scheduler** — depend on (1). +3. **Ingester** — careful, see below. +4. **Distributor** — depends on (3) speaking the new ingester wire format. +5. **Ruler, alertmanager** — top-of-stack, depend on (1)-(4). + +Within a stateful set (ingester, store-gateway, compactor), use a rolling +update that respects the ring: drain each pod via the +`/ingester/shutdown` endpoint before terminating, so its in-memory series flush +to object storage cleanly. The bundled `CortexRolloutStuck` alert fires if the +rollout stalls — pair the rollout with the +[rollout-progress dashboard]({{< relref "./monitoring-cortex.md" >}}). + +### 4. Validate after each layer + +After rolling each layer, before moving to the next: + +- The [bundled dashboards]({{< relref "./monitoring-cortex.md" >}}) should + show steady error rates and latency. +- No new alerts should be firing. +- Sample a few queries from production tenants to confirm the read path + returns the expected data. + +If anything regresses, **stop and roll back the current layer** before the +next layer starts. Cortex tolerates running mixed versions across a single +minor-version boundary; running across two minors is not supported. + +## Component-specific notes + +### Ingester + +Ingesters hold the active TSDB head in memory (and on the WAL) plus local +on-disk blocks that have been cut but may not yet have shipped. A rough +restart that loses the local disk loses up to one +`-blocks-storage.tsdb.block-ranges-period` (default `2h`) of head data plus +any on-disk blocks not yet uploaded. + +- Drain via `POST /ingester/shutdown` so blocks are uploaded and series are + handed off to a replacement replica before the pod terminates. +- Watch `cortex_ingester_shipper_uploads_total` climb on the draining pod; + termination is safe once it stops increasing. +- The `CortexIngesterHasUnshippedBlocks` alert is a hard "don't terminate yet" + signal. + +### Compactor + +The compactor owns the bucket index. After a compactor upgrade, +`CortexBucketIndexNotUpdated` should clear within one compaction interval. If +it does not, the new binary may be failing to read pre-existing index format — +check compactor logs for `cannot decode`-style errors. + +### Alertmanager + +Alertmanager persists state (silences, notification log) through the storage +backend. Read the relevant `[CHANGE]` entries closely for any state migration: +on rare upgrades the persisted format changes and silences/nflog from the +previous version must be replayed. + +## Downgrade + +Cortex supports rolling **back** within the same minor version (any patch +release of the same minor) at any time. Cross-minor downgrade is **not** +supported: persisted formats on disk and in object storage may have been +written in a way the older binary cannot read. + +If you must downgrade across a minor, the safe path is: + +1. Stop all writes (point Prometheus's `remote_write` at a buffer). +2. Wait for compactor to finish and ingesters to ship. +3. Restore the previous binary against object storage; ingester local state on + disk should be discarded. +4. Resume writes. + +This is invasive — prefer to roll forward to a fix instead. + +## Related + +- [v1 Guarantees]({{< relref "../configuration/v1-guarantees.md" >}}) — what is + and isn't covered by the stability contract. +- [Ingesters: Rolling Updates]({{< relref + "../guides/ingesters-rolling-updates.md" >}}) — the mechanics of draining an + ingester safely. +- [Migrating the KV Store to Memberlist]({{< relref + "../guides/migration-kv-store-to-memberlist.md" >}}) — the canonical example + of a multi-step migration done online. +- `CHANGELOG.md` in the repository root — version-by-version change list.