Allow changefeed to resume from any timestamp above tikv_gc_safe_point

### Is your feature request related to a problem?

## Feature Request: Allow changefeed to operate from any timestamp where data physically exists on TiKV

### Problem

TiCDC newarch (v8.5+) refuses to create or resume a changefeed from a timestamp that is below PD's service GC safepoint, even when TiKV still has the data. The `check_gc_safe_point` config option exists in `ReplicaConfig` but is dead code — never evaluated in any conditional.

The core issue: **TiCDC conflates "PD says this timestamp is past GC" with "TiKV can't serve this data." These are different things.** The window between `tikv_gc_safe_point` (physical compaction boundary) and PD's min service safepoint (logical bookkeeping) can be hours to days wide, and all data in that window is fully readable from TiKV.

### Scenarios where this matters

- **Changefeed was paused/stopped** longer than `gc-ttl` (default 24h) — its PD safepoint expired, but TiKV data is intact
- **Changefeed was in failed/stuck state** — not refreshing its safepoint, PD expired it
- **Creating a new changefeed from an old timestamp** — operator wants to replicate from a point in the past (e.g. for migration, backfill, or disaster recovery) where data still exists on TiKV
- **Incident recovery** — something went wrong, but TiKV GC hasn't run yet; the data is there and we should be able to use it

In all cases, the data physically exists on TiKV and is readable. TiCDC should be able to use it.

### Analysis

There are two distinct GC boundaries:

| Boundary | What it means | Where stored |
|----------|---------------|--------------|
| `tikv_gc_safe_point` | Physical: TiKV has compacted MVCC versions below this. Reads below this **actually fail**. | `mysql.tidb` table |
| PD min service GC safepoint | Logical: the minimum across all registered services. Controls when the GC worker is allowed to advance the physical boundary. | PD etcd |

TiCDC's newarch has **5 separate checks** that use PD's logical safepoint as a hard boundary, rejecting timestamps that TiKV could serve:

1. **Create/resume API** (`api/v2/changefeed.go`) — `EnsureChangefeedStartTsSafety` rejects `startTs < minServiceGCTs`
2. **Coordinator runtime** (`coordinator/coordinator.go`) — `checkStaleCheckpointTs` kills changefeed if checkpoint is stale vs `lastSafePointTs` or exceeds `gc-ttl`
3. **Schema store bootstrap** (`logservice/schemastore/persist_storage.go`) — `getAllPhysicalTables` and `registerTable` reject `snapTs/startTs < p.gcTs` (where `p.gcTs` = PD safepoint at schema store init time)
4. **Schema store table info lookup** (`logservice/schemastore/multi_version.go`) — `getTableInfo(ts)` returns "no version found" when `ts` is before the earliest loaded version
5. **DDL event fetch** (`logservice/schemastore/persist_storage.go`) — `fetchTableDDLEvents` rejects `start < gcTs`

None of these checks validate against `tikv_gc_safe_point`. They all use PD's logical safepoint or derivatives.

### Root cause in schema store architecture

The schema store initializes by:
1. Reading PD's min service safepoint (`gc.UnifyGetServiceGCSafepoint`)
2. Doing a TiKV snapshot read at that timestamp to load the schema
3. Setting its internal `p.gcTs` to that value
4. Serving all subsequent requests from its local pebble DB (which only has data from that point onward)

If the changefeed's target timestamp is below PD's service safepoint, the schema store has no local data for it — even though TiKV could serve a snapshot read at that timestamp. The schema store could have initialized at the changefeed's checkpoint, but it doesn't try.

### What we did (workaround)

We patched the newarch binary to bypass/relax these checks when `check_gc_safe_point=false`:

1. **API gate**: Skip `EnsureChangefeedStartTsSafety` on create/resume
2. **Coordinator**: Skip `checkStaleCheckpointTs`
3. **Schema store**: Clamp `snapTs`/`startTs` to `p.gcTs` in `getAllPhysicalTables` and `registerTable` (use current schema as approximation)
4. **Table info**: Return earliest available version when requested `ts` is before it
5. **DDL fetch**: Check `allTargetTs[0]` (actual event timestamp from DDL history) instead of caller's `start` parameter

### Testing

- Deployed to production TiCDC cluster (v8.5.6-release.2 newarch)
- Successfully created and resumed a changefeed from a timestamp 4+ days behind PD's service safepoint (but above `tikv_gc_safe_point`)
- Changefeed caught up from 100+ hours of lag in ~4 hours at 20-50x replication rate
- Zero encoding/decoding errors — data correctness confirmed
- Changefeed has been running in steady-state (2-3s lag) for 24+ hours with no issues

### Limitations of our workaround

- Schema store clamping assumes no DDLs happened between the target timestamp and PD safepoint. If DDLs did occur, the schema store would use incorrect column definitions, potentially causing silent data corruption.
- The `check_gc_safe_point=false` flag is a blunt instrument — it disables all safety checks without validating whether the data actually exists on TiKV.

### Proposed solution

**Option A (minimal):** Wire up `check_gc_safe_point` config to control the existing checks. This is what we implemented. Backwards compatible but unsafe if DDLs happened in the gap.

**Option B (better):** When creating or resuming a changefeed, have the schema store initialize at the changefeed's target/checkpoint timestamp rather than at PD's current service safepoint. Since the target timestamp is above `tikv_gc_safe_point`, TiKV can serve the snapshot read. The schema store would then have accurate schema data at the exact point needed, and all downstream checks would pass naturally with no bypasses. The `gc_keeper.go` would register its service safepoint at the changefeed's checkpoint.

**Option C (best):** Separate the concepts properly throughout the codebase:
- Pre-flight validation should check against `tikv_gc_safe_point` (can TiKV actually serve this data?), not PD service safepoint (is the bookkeeping current?)
- The schema store should be able to initialize at any timestamp above `tikv_gc_safe_point`
- Runtime checks should distinguish "safepoint expired but data exists" from "data physically gone"
- Provide a clear operational path: if data exists, let operators use it

### Environment

- TiCDC v8.5.6-release.2 (newarch)
- `gc-ttl` default: 24 hours (`pkg/config/server.go:105`)
- `check_gc_safe_point` config field exists in `ReplicaConfig` but is dead code upstream


### Describe the feature you'd like

Described above, basically we want config `check_gc_safe_point` to be effective.

### Describe alternatives you've considered

_No response_

### Teachability, Documentation, Adoption, Migration Strategy

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow changefeed to resume from any timestamp above tikv_gc_safe_point #4967

Is your feature request related to a problem?

Feature Request: Allow changefeed to operate from any timestamp where data physically exists on TiKV

Problem

Scenarios where this matters

Analysis

Root cause in schema store architecture

What we did (workaround)

Testing

Limitations of our workaround

Proposed solution

Environment

Describe the feature you'd like

Describe alternatives you've considered

Teachability, Documentation, Adoption, Migration Strategy

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Boundary	What it means	Where stored
`tikv_gc_safe_point`	Physical: TiKV has compacted MVCC versions below this. Reads below this actually fail.	`mysql.tidb` table
PD min service GC safepoint	Logical: the minimum across all registered services. Controls when the GC worker is allowed to advance the physical boundary.	PD etcd

Allow changefeed to resume from any timestamp above tikv_gc_safe_point #4967

Description

Is your feature request related to a problem?

Feature Request: Allow changefeed to operate from any timestamp where data physically exists on TiKV

Problem

Scenarios where this matters

Analysis

Root cause in schema store architecture

What we did (workaround)

Testing

Limitations of our workaround

Proposed solution

Environment

Describe the feature you'd like

Describe alternatives you've considered

Teachability, Documentation, Adoption, Migration Strategy

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions