Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions maintainer/maintainer.go
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ import (
)

const (
periodEventInterval = time.Millisecond * 100
periodEventInterval = time.Second * 120
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Increasing periodEventInterval from 100ms to 120s is a drastic change that will significantly impact the system's responsiveness. This interval controls how frequently the maintainer calculates the global checkpoint and resends critical messages (such as bootstrap requests or barrier ACKs). A 2-minute delay in these operations will lead to excessive replication lag and extremely slow recovery from transient network issues or node failures. If this change was intended to reduce CPU overhead for large-scale changefeeds (e.g., 1 million tables), it should be made configurable or set to a more reasonable value like 1-5 seconds.

periodRedoInterval = time.Second * 1
Comment on lines +50 to 51
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

120s shared periodic tick can regress checkpoint freshness and resend recovery.

periodEventInterval now drives both checkpoint calculation (Line 655) and EventPeriod resend/housekeeping (Line 1230 + Line 1178). Moving it to 120s means both flows can lag by up to 2 minutes.

Suggested split of intervals
 const (
-	periodEventInterval = time.Second * 120
+	periodEventInterval      = time.Second * 120 // resend/housekeeping
+	checkpointCalcInterval   = time.Second       // checkpoint advancement cadence
 	periodRedoInterval  = time.Second * 1
 )
@@
-func (m *Maintainer) calCheckpointTs(ctx context.Context) {
-	ticker := time.NewTicker(periodEventInterval)
+func (m *Maintainer) calCheckpointTs(ctx context.Context) {
+	ticker := time.NewTicker(checkpointCalcInterval)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@maintainer/maintainer.go` around lines 50 - 51, The single shared variable
periodEventInterval was increased to 120s but is used for both checkpoint
calculation and EventPeriod resend/housekeeping; split it into two clearly named
intervals (e.g., periodCheckpointInterval for the checkpoint calculation used in
the checkpoint function referenced at Line 655, and periodEventInterval for
EventPeriod resend/housekeeping used around Lines 1178 and 1230) and update all
references so checkpoint logic uses periodCheckpointInterval (shorter, e.g.,
original/near-real-time value) while EventPeriod resend/housekeeping continues
to use the longer periodEventInterval; keep periodRedoInterval unchanged and
ensure names (periodCheckpointInterval, periodEventInterval, periodRedoInterval)
match across maintainer.go and any functions that read them.

)

Expand Down Expand Up @@ -309,7 +309,7 @@ func (m *Maintainer) HandleEvent(event *Event) bool {
zap.Stringer("changefeedID", m.changefeedID),
zap.Int("eventType", event.eventType),
zap.Duration("duration", duration),
zap.Any("Message", event.message),
zap.Any("MessageType", event.message.Type),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using zap.Stringer is preferred here since event.message.Type (of type IOType) implements the fmt.Stringer interface. This ensures the log contains the human-readable string representation of the message type rather than its raw integer value, which is much more helpful for debugging slow event processing. Additionally, logging only the type instead of the entire message is a good improvement for performance when dealing with large table counts.

Suggested change
zap.Any("MessageType", event.message.Type),
zap.Stringer("MessageType", event.message.Type),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Guard event.message before logging MessageType in defer path.

event.message.Type can nil-deref in the slow-event logger. A small guard avoids panic in diagnostic code.

Nil-safe logging tweak
 			if event.eventType == EventMessage {
+				messageType := "nil"
+				if event.message != nil {
+					messageType = event.message.Type.String()
+				}
 				log.Info("maintainer is too slow",
 					zap.Stringer("changefeedID", m.changefeedID),
 					zap.Int("eventType", event.eventType),
 					zap.Duration("duration", duration),
-					zap.Any("MessageType", event.message.Type),
+					zap.String("messageType", messageType),
 				)
 			} else {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
zap.Any("MessageType", event.message.Type),
if event.eventType == EventMessage {
messageType := "nil"
if event.message != nil {
messageType = event.message.Type.String()
}
log.Info("maintainer is too slow",
zap.Stringer("changefeedID", m.changefeedID),
zap.Int("eventType", event.eventType),
zap.Duration("duration", duration),
zap.String("messageType", messageType),
)
} else {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@maintainer/maintainer.go` at line 312, In the defer slow-event logging path,
guard against nil event.message before accessing event.message.Type: update the
zap logging call that currently uses zap.Any("MessageType", event.message.Type)
to first check if event.message != nil and only include event.message.Type when
non-nil (otherwise log nil or omit the field) so the defer logger cannot
nil-deref; locate this change around the slow-event defer block and the usage of
event.message and "MessageType".

)
} else {
log.Info("maintainer is too slow",
Expand Down
Loading