Is your feature request related to a problem?
Feature Request: Allow changefeed to operate from any timestamp where data physically exists on TiKV
Problem
TiCDC newarch (v8.5+) refuses to create or resume a changefeed from a timestamp that is below PD's service GC safepoint, even when TiKV still has the data. The check_gc_safe_point config option exists in ReplicaConfig but is dead code — never evaluated in any conditional.
The core issue: TiCDC conflates "PD says this timestamp is past GC" with "TiKV can't serve this data." These are different things. The window between tikv_gc_safe_point (physical compaction boundary) and PD's min service safepoint (logical bookkeeping) can be hours to days wide, and all data in that window is fully readable from TiKV.
Scenarios where this matters
- Changefeed was paused/stopped longer than
gc-ttl (default 24h) — its PD safepoint expired, but TiKV data is intact
- Changefeed was in failed/stuck state — not refreshing its safepoint, PD expired it
- Creating a new changefeed from an old timestamp — operator wants to replicate from a point in the past (e.g. for migration, backfill, or disaster recovery) where data still exists on TiKV
- Incident recovery — something went wrong, but TiKV GC hasn't run yet; the data is there and we should be able to use it
In all cases, the data physically exists on TiKV and is readable. TiCDC should be able to use it.
Analysis
There are two distinct GC boundaries:
| Boundary |
What it means |
Where stored |
tikv_gc_safe_point |
Physical: TiKV has compacted MVCC versions below this. Reads below this actually fail. |
mysql.tidb table |
| PD min service GC safepoint |
Logical: the minimum across all registered services. Controls when the GC worker is allowed to advance the physical boundary. |
PD etcd |
TiCDC's newarch has 5 separate checks that use PD's logical safepoint as a hard boundary, rejecting timestamps that TiKV could serve:
- Create/resume API (
api/v2/changefeed.go) — EnsureChangefeedStartTsSafety rejects startTs < minServiceGCTs
- Coordinator runtime (
coordinator/coordinator.go) — checkStaleCheckpointTs kills changefeed if checkpoint is stale vs lastSafePointTs or exceeds gc-ttl
- Schema store bootstrap (
logservice/schemastore/persist_storage.go) — getAllPhysicalTables and registerTable reject snapTs/startTs < p.gcTs (where p.gcTs = PD safepoint at schema store init time)
- Schema store table info lookup (
logservice/schemastore/multi_version.go) — getTableInfo(ts) returns "no version found" when ts is before the earliest loaded version
- DDL event fetch (
logservice/schemastore/persist_storage.go) — fetchTableDDLEvents rejects start < gcTs
None of these checks validate against tikv_gc_safe_point. They all use PD's logical safepoint or derivatives.
Root cause in schema store architecture
The schema store initializes by:
- Reading PD's min service safepoint (
gc.UnifyGetServiceGCSafepoint)
- Doing a TiKV snapshot read at that timestamp to load the schema
- Setting its internal
p.gcTs to that value
- Serving all subsequent requests from its local pebble DB (which only has data from that point onward)
If the changefeed's target timestamp is below PD's service safepoint, the schema store has no local data for it — even though TiKV could serve a snapshot read at that timestamp. The schema store could have initialized at the changefeed's checkpoint, but it doesn't try.
What we did (workaround)
We patched the newarch binary to bypass/relax these checks when check_gc_safe_point=false:
- API gate: Skip
EnsureChangefeedStartTsSafety on create/resume
- Coordinator: Skip
checkStaleCheckpointTs
- Schema store: Clamp
snapTs/startTs to p.gcTs in getAllPhysicalTables and registerTable (use current schema as approximation)
- Table info: Return earliest available version when requested
ts is before it
- DDL fetch: Check
allTargetTs[0] (actual event timestamp from DDL history) instead of caller's start parameter
Testing
- Deployed to production TiCDC cluster (v8.5.6-release.2 newarch)
- Successfully created and resumed a changefeed from a timestamp 4+ days behind PD's service safepoint (but above
tikv_gc_safe_point)
- Changefeed caught up from 100+ hours of lag in ~4 hours at 20-50x replication rate
- Zero encoding/decoding errors — data correctness confirmed
- Changefeed has been running in steady-state (2-3s lag) for 24+ hours with no issues
Limitations of our workaround
- Schema store clamping assumes no DDLs happened between the target timestamp and PD safepoint. If DDLs did occur, the schema store would use incorrect column definitions, potentially causing silent data corruption.
- The
check_gc_safe_point=false flag is a blunt instrument — it disables all safety checks without validating whether the data actually exists on TiKV.
Proposed solution
Option A (minimal): Wire up check_gc_safe_point config to control the existing checks. This is what we implemented. Backwards compatible but unsafe if DDLs happened in the gap.
Option B (better): When creating or resuming a changefeed, have the schema store initialize at the changefeed's target/checkpoint timestamp rather than at PD's current service safepoint. Since the target timestamp is above tikv_gc_safe_point, TiKV can serve the snapshot read. The schema store would then have accurate schema data at the exact point needed, and all downstream checks would pass naturally with no bypasses. The gc_keeper.go would register its service safepoint at the changefeed's checkpoint.
Option C (best): Separate the concepts properly throughout the codebase:
- Pre-flight validation should check against
tikv_gc_safe_point (can TiKV actually serve this data?), not PD service safepoint (is the bookkeeping current?)
- The schema store should be able to initialize at any timestamp above
tikv_gc_safe_point
- Runtime checks should distinguish "safepoint expired but data exists" from "data physically gone"
- Provide a clear operational path: if data exists, let operators use it
Environment
- TiCDC v8.5.6-release.2 (newarch)
gc-ttl default: 24 hours (pkg/config/server.go:105)
check_gc_safe_point config field exists in ReplicaConfig but is dead code upstream
Describe the feature you'd like
Described above, basically we want config check_gc_safe_point to be effective.
Describe alternatives you've considered
No response
Teachability, Documentation, Adoption, Migration Strategy
No response
Is your feature request related to a problem?
Feature Request: Allow changefeed to operate from any timestamp where data physically exists on TiKV
Problem
TiCDC newarch (v8.5+) refuses to create or resume a changefeed from a timestamp that is below PD's service GC safepoint, even when TiKV still has the data. The
check_gc_safe_pointconfig option exists inReplicaConfigbut is dead code — never evaluated in any conditional.The core issue: TiCDC conflates "PD says this timestamp is past GC" with "TiKV can't serve this data." These are different things. The window between
tikv_gc_safe_point(physical compaction boundary) and PD's min service safepoint (logical bookkeeping) can be hours to days wide, and all data in that window is fully readable from TiKV.Scenarios where this matters
gc-ttl(default 24h) — its PD safepoint expired, but TiKV data is intactIn all cases, the data physically exists on TiKV and is readable. TiCDC should be able to use it.
Analysis
There are two distinct GC boundaries:
tikv_gc_safe_pointmysql.tidbtableTiCDC's newarch has 5 separate checks that use PD's logical safepoint as a hard boundary, rejecting timestamps that TiKV could serve:
api/v2/changefeed.go) —EnsureChangefeedStartTsSafetyrejectsstartTs < minServiceGCTscoordinator/coordinator.go) —checkStaleCheckpointTskills changefeed if checkpoint is stale vslastSafePointTsor exceedsgc-ttllogservice/schemastore/persist_storage.go) —getAllPhysicalTablesandregisterTablerejectsnapTs/startTs < p.gcTs(wherep.gcTs= PD safepoint at schema store init time)logservice/schemastore/multi_version.go) —getTableInfo(ts)returns "no version found" whentsis before the earliest loaded versionlogservice/schemastore/persist_storage.go) —fetchTableDDLEventsrejectsstart < gcTsNone of these checks validate against
tikv_gc_safe_point. They all use PD's logical safepoint or derivatives.Root cause in schema store architecture
The schema store initializes by:
gc.UnifyGetServiceGCSafepoint)p.gcTsto that valueIf the changefeed's target timestamp is below PD's service safepoint, the schema store has no local data for it — even though TiKV could serve a snapshot read at that timestamp. The schema store could have initialized at the changefeed's checkpoint, but it doesn't try.
What we did (workaround)
We patched the newarch binary to bypass/relax these checks when
check_gc_safe_point=false:EnsureChangefeedStartTsSafetyon create/resumecheckStaleCheckpointTssnapTs/startTstop.gcTsingetAllPhysicalTablesandregisterTable(use current schema as approximation)tsis before itallTargetTs[0](actual event timestamp from DDL history) instead of caller'sstartparameterTesting
tikv_gc_safe_point)Limitations of our workaround
check_gc_safe_point=falseflag is a blunt instrument — it disables all safety checks without validating whether the data actually exists on TiKV.Proposed solution
Option A (minimal): Wire up
check_gc_safe_pointconfig to control the existing checks. This is what we implemented. Backwards compatible but unsafe if DDLs happened in the gap.Option B (better): When creating or resuming a changefeed, have the schema store initialize at the changefeed's target/checkpoint timestamp rather than at PD's current service safepoint. Since the target timestamp is above
tikv_gc_safe_point, TiKV can serve the snapshot read. The schema store would then have accurate schema data at the exact point needed, and all downstream checks would pass naturally with no bypasses. Thegc_keeper.gowould register its service safepoint at the changefeed's checkpoint.Option C (best): Separate the concepts properly throughout the codebase:
tikv_gc_safe_point(can TiKV actually serve this data?), not PD service safepoint (is the bookkeeping current?)tikv_gc_safe_pointEnvironment
gc-ttldefault: 24 hours (pkg/config/server.go:105)check_gc_safe_pointconfig field exists inReplicaConfigbut is dead code upstreamDescribe the feature you'd like
Described above, basically we want config
check_gc_safe_pointto be effective.Describe alternatives you've considered
No response
Teachability, Documentation, Adoption, Migration Strategy
No response