[Improvement]: Refactor snapshot-expiring via ProcessFactory plugin#4107
[Improvement]: Refactor snapshot-expiring via ProcessFactory plugin#4107baiyangtx wants to merge 8 commits intoapache:masterfrom
Conversation
…actory Co-Authored-By: Aime <aime@bytedance.com> Change-Id: Idfac8a56427baccaeeca27e8f71719d476d7839a
709fc06 to
b5e817f
Compare
| ConfigOptions.key("expire-snapshots.enabled").booleanType().defaultValue(true); | ||
|
|
||
| public static final ConfigOption<Duration> SNAPSHOT_EXPIRE_INTERVAL = | ||
| ConfigOptions.key("expire-snapshot.interval") |
There was a problem hiding this comment.
YAML is expire-snapshots.interval
| properties.keySet().stream() | ||
| .filter(key -> key.startsWith(POOL_CONFIG_PREFIX)) | ||
| .map(key -> key.substring(POOL_CONFIG_PREFIX.length())) | ||
| .map(key -> key.substring(0, key.indexOf(".") + 1)) |
There was a problem hiding this comment.
last result is pool.default..thread-count / pool.snapshots-expiring..thread-count, that's not get the pool
There was a problem hiding this comment.
org.apache.amoro.process.ExecuteEngine
| @Override | ||
| public void run() { | ||
| try { | ||
| AmoroTable<?> amoroTable = tableRuntime.loadTable(); |
There was a problem hiding this comment.
The problem is that the new scheduling path no longer preserves the old “run, then record cleanup time” behavior for snapshot expiration.
In the old implementation, SnapshotsExpiringExecutor.java executed tableMaintainer.expireSnapshots() synchronously. Only after that finished did PeriodicTableScheduler.java (line 125) update lastCleanTime and schedule the next run. So the interval was effectively measured from the end of the previous cleanup.
In the new path, ActionCoordinatorScheduler.java (line 103) only submits/registers a process and returns immediately. After that return, PeriodicTableScheduler still updates lastCleanTime right away, even though the real cleanup work has not finished yet. The actual cleanup now happens later in SnapshotsExpiringProcess.java (line 53).
There was a problem hiding this comment.
Building on your observation — the async submission also introduces a state-loss issue in LocalExecutionEngine.getStatus().
getStatus() removes the Future from the map on terminal states (isDone/isCancelled), making it non-idempotent:
Call 1: future.isDone() == true → remove → SUCCESS
Call 2: future == null → KILLED (wrong!)
TableProcessExecutor polls getStatus() in a loop (line 107), so if any retry or concurrent access queries the same identifier twice after completion, it gets KILLED instead of the real result.
There's also a TOCTOU race between containsKey and get across cancelingInstances/activeInstances (lines 67-70), since the compound check-then-act isn't atomic even with ConcurrentHashMap.
|
It looks like In Later, when snapshot expiration is triggered, In other words, the new expire-snapshots path is effectively disabled due to initialization order. We probably need to initialize execute engines before injecting them into process factories, or re-inject them after engine initialization. |
Why are the changes needed?
Close #xxx.
Brief change log
How was this patch tested?
Add some test cases that check the changes thoroughly including negative and positive cases if possible
Add screenshots for manual tests if appropriate
Run test locally before making a pull request
Documentation