Skip to content

Allow changefeed to resume from any timestamp above tikv_gc_safe_point #4967

@TailinLyu

Description

@TailinLyu

Is your feature request related to a problem?

Feature Request: Allow changefeed to operate from any timestamp where data physically exists on TiKV

Problem

TiCDC newarch (v8.5+) refuses to create or resume a changefeed from a timestamp that is below PD's service GC safepoint, even when TiKV still has the data. The check_gc_safe_point config option exists in ReplicaConfig but is dead code — never evaluated in any conditional.

The core issue: TiCDC conflates "PD says this timestamp is past GC" with "TiKV can't serve this data." These are different things. The window between tikv_gc_safe_point (physical compaction boundary) and PD's min service safepoint (logical bookkeeping) can be hours to days wide, and all data in that window is fully readable from TiKV.

Scenarios where this matters

  • Changefeed was paused/stopped longer than gc-ttl (default 24h) — its PD safepoint expired, but TiKV data is intact
  • Changefeed was in failed/stuck state — not refreshing its safepoint, PD expired it
  • Creating a new changefeed from an old timestamp — operator wants to replicate from a point in the past (e.g. for migration, backfill, or disaster recovery) where data still exists on TiKV
  • Incident recovery — something went wrong, but TiKV GC hasn't run yet; the data is there and we should be able to use it

In all cases, the data physically exists on TiKV and is readable. TiCDC should be able to use it.

Analysis

There are two distinct GC boundaries:

Boundary What it means Where stored
tikv_gc_safe_point Physical: TiKV has compacted MVCC versions below this. Reads below this actually fail. mysql.tidb table
PD min service GC safepoint Logical: the minimum across all registered services. Controls when the GC worker is allowed to advance the physical boundary. PD etcd

TiCDC's newarch has 5 separate checks that use PD's logical safepoint as a hard boundary, rejecting timestamps that TiKV could serve:

  1. Create/resume API (api/v2/changefeed.go) — EnsureChangefeedStartTsSafety rejects startTs < minServiceGCTs
  2. Coordinator runtime (coordinator/coordinator.go) — checkStaleCheckpointTs kills changefeed if checkpoint is stale vs lastSafePointTs or exceeds gc-ttl
  3. Schema store bootstrap (logservice/schemastore/persist_storage.go) — getAllPhysicalTables and registerTable reject snapTs/startTs < p.gcTs (where p.gcTs = PD safepoint at schema store init time)
  4. Schema store table info lookup (logservice/schemastore/multi_version.go) — getTableInfo(ts) returns "no version found" when ts is before the earliest loaded version
  5. DDL event fetch (logservice/schemastore/persist_storage.go) — fetchTableDDLEvents rejects start < gcTs

None of these checks validate against tikv_gc_safe_point. They all use PD's logical safepoint or derivatives.

Root cause in schema store architecture

The schema store initializes by:

  1. Reading PD's min service safepoint (gc.UnifyGetServiceGCSafepoint)
  2. Doing a TiKV snapshot read at that timestamp to load the schema
  3. Setting its internal p.gcTs to that value
  4. Serving all subsequent requests from its local pebble DB (which only has data from that point onward)

If the changefeed's target timestamp is below PD's service safepoint, the schema store has no local data for it — even though TiKV could serve a snapshot read at that timestamp. The schema store could have initialized at the changefeed's checkpoint, but it doesn't try.

What we did (workaround)

We patched the newarch binary to bypass/relax these checks when check_gc_safe_point=false:

  1. API gate: Skip EnsureChangefeedStartTsSafety on create/resume
  2. Coordinator: Skip checkStaleCheckpointTs
  3. Schema store: Clamp snapTs/startTs to p.gcTs in getAllPhysicalTables and registerTable (use current schema as approximation)
  4. Table info: Return earliest available version when requested ts is before it
  5. DDL fetch: Check allTargetTs[0] (actual event timestamp from DDL history) instead of caller's start parameter

Testing

  • Deployed to production TiCDC cluster (v8.5.6-release.2 newarch)
  • Successfully created and resumed a changefeed from a timestamp 4+ days behind PD's service safepoint (but above tikv_gc_safe_point)
  • Changefeed caught up from 100+ hours of lag in ~4 hours at 20-50x replication rate
  • Zero encoding/decoding errors — data correctness confirmed
  • Changefeed has been running in steady-state (2-3s lag) for 24+ hours with no issues

Limitations of our workaround

  • Schema store clamping assumes no DDLs happened between the target timestamp and PD safepoint. If DDLs did occur, the schema store would use incorrect column definitions, potentially causing silent data corruption.
  • The check_gc_safe_point=false flag is a blunt instrument — it disables all safety checks without validating whether the data actually exists on TiKV.

Proposed solution

Option A (minimal): Wire up check_gc_safe_point config to control the existing checks. This is what we implemented. Backwards compatible but unsafe if DDLs happened in the gap.

Option B (better): When creating or resuming a changefeed, have the schema store initialize at the changefeed's target/checkpoint timestamp rather than at PD's current service safepoint. Since the target timestamp is above tikv_gc_safe_point, TiKV can serve the snapshot read. The schema store would then have accurate schema data at the exact point needed, and all downstream checks would pass naturally with no bypasses. The gc_keeper.go would register its service safepoint at the changefeed's checkpoint.

Option C (best): Separate the concepts properly throughout the codebase:

  • Pre-flight validation should check against tikv_gc_safe_point (can TiKV actually serve this data?), not PD service safepoint (is the bookkeeping current?)
  • The schema store should be able to initialize at any timestamp above tikv_gc_safe_point
  • Runtime checks should distinguish "safepoint expired but data exists" from "data physically gone"
  • Provide a clear operational path: if data exists, let operators use it

Environment

  • TiCDC v8.5.6-release.2 (newarch)
  • gc-ttl default: 24 hours (pkg/config/server.go:105)
  • check_gc_safe_point config field exists in ReplicaConfig but is dead code upstream

Describe the feature you'd like

Described above, basically we want config check_gc_safe_point to be effective.

Describe alternatives you've considered

No response

Teachability, Documentation, Adoption, Migration Strategy

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    contributionThis PR is from a community contributor.first-time-contributorIndicates that the PR was contributed by an external member and is a first-time contributor.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions