Skip to content

Feat: prevent other processes seeing missing intervals during restatement#5285

Merged
erindru merged 4 commits intomainfrom
erin/adjust-restatement-stage
Sep 11, 2025
Merged

Feat: prevent other processes seeing missing intervals during restatement#5285
erindru merged 4 commits intomainfrom
erin/adjust-restatement-stage

Conversation

@erindru
Copy link
Collaborator

@erindru erindru commented Sep 3, 2025

This PR builds on #5273 and #5274

Currently, the RestatementStage clears intervals from state before the BackfillStage populates data.

This means that other processes that look at state while the restatement is running will see missing intervals and try to fill them, competing with the restatement process.

This PR adjusts the plan evaluation order to:

  • move RestatementStage to after BackfillStage
    • note that despite the name, RestatementStage doesn't actually perform restatement, it's just responsible for clearing intervals from state
  • adjust RestatementStage to clear intervals from all environments except prod

The result of this means that:

  • Intervals are never cleared from prod so no other processes see missing intervals and compete with the current plan to try and fill them
  • Intervals are only cleared from dev environments if they were successfully restated in prod

There are some new failure modes:

  • If the plan fails during restatement, prod will be in an inconsistent state. It's expected that the user will run the plan again until it succeeds. Once it succeeds, prod will be back in a consistent state.
  • If a user deploys a new version of the model while the restatement is occurring (a side-effect of keeping restatements internal is that other plans don't know about them), there is a check for new snapshot versions of the restated models. If some are detected, they are output to the console before the plan fails with the usual conflict error:
Screenshot From 2025-09-10 14-48-29

@erindru erindru force-pushed the erin/adjust-restatement-stage branch from 4ff0abb to ac4e372 Compare September 5, 2025 00:29
@erindru erindru force-pushed the erin/fix-restatement-clear-across-all-environments branch from 042e3a1 to f33490a Compare September 7, 2025 23:54
@erindru erindru force-pushed the erin/adjust-restatement-stage branch from 2e8a99a to 6a817e9 Compare September 8, 2025 00:30
@erindru erindru force-pushed the erin/fix-restatement-clear-across-all-environments branch from f33490a to 81bac93 Compare September 8, 2025 01:46
@erindru erindru force-pushed the erin/adjust-restatement-stage branch from 6a817e9 to 907dc44 Compare September 8, 2025 01:54
@erindru erindru marked this pull request as ready for review September 8, 2025 02:40

self.state_sync.remove_intervals(
snapshot_intervals=list(snapshot_intervals_to_restate),
snapshot_intervals=[(s.table_info, s.interval) for s in intervals_to_clear.values()],
Copy link
Collaborator

@izeigerman izeigerman Sep 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we should not nuke intervals for snapshots that have changed while the restatement was running. As per our discussion earlier:

There’s a chance that another snapshot sneaks into prod while restatement was running but before we nuked the intervals. This snapshot’s intervals will be nuked, but now we have an intermittent period during which the prod is in incorrect state. The subsequent run will, of course, catch up the missing intervals. However, this is confusing to the user since their next run will now backfill models it wasn’t suppose to. To prevent this, the restatement plan will now check whether the snapshots for restated models changed in prod while this plan was running. If some snapshots were changed, the intervals for those won’t be removed and instead the plan will return an error stating that the restatement for these models wasn’t successful. The user will then have a choice whether to reissue restatement for these models or accept that these models haven’t been restated properly and move on.

Specifically

If some snapshots were changed, the intervals for those won’t be removed

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good point, I glossed over that when I was focusing on identifying the affected snapshots to begin with.

I've updated it to only clear intervals for snapshots whose names are not in the "deployed during restatement" list

plan.environment.naming_info,
self.default_catalog,
)
# note: the plan will automatically fail at the promotion stage with a ConflictingPlanError because the environment was changed by another plan
Copy link
Collaborator

@izeigerman izeigerman Sep 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Please note that this is not always going to be the case since we plan to support concurrent plan applications soon. So I'd suggest we fail explicitly and not rely on a failure downstream.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, we won't need log_models_updated_during_restatement since we can just put the entire message into exception.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still need log_models_updated_during_restatement to have useful formatted console output about the affected models rather than trying to jam all this info into an exception message.

However, i've moved the "what to do about it" part into the exception message and now raise an exception here instead of relying on the finalize stage throwing the error.

I've updated the screenshot in the PR description to give an indication of what it looks like

# allows us to print the physical names of the tables affected in the console output
# note that we can't use the DeployabilityIndex on the plan because it only includes
# snapshots for the current environment, not across all environments
deployability_index = DeployabilityIndex.create(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should only do this if the target environment is not prod.

Actually, I believe we should always use DeployabilityIndex.all_deployable() since "restatement" doesn't apply to "dev" tables. For "dev" tables it's always a preview.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RestatementStage isnt even produced if the target is not prod, because as of this PR it only refers to clearing intervals any environment that isnt prod, which is only necessary if the target is prod.

I think I see what you're saying - a prod restatement would never care about dev previews because that data has no chance of ever being deployed. So there is no need to build a deployability index, and in the console output we should just reference the deployable version of the model's physical table, which we can do with DeployabilityIndex.all_deployable()

@erindru erindru force-pushed the erin/fix-restatement-clear-across-all-environments branch 2 times, most recently from 1438970 to 924e59e Compare September 9, 2025 23:39
@erindru erindru force-pushed the erin/adjust-restatement-stage branch 2 times, most recently from bc326d7 to 4f19430 Compare September 10, 2025 02:26
@erindru erindru requested a review from izeigerman September 10, 2025 02:51
@erindru erindru force-pushed the erin/adjust-restatement-stage branch from 4f19430 to e71adf6 Compare September 10, 2025 02:59
)


def model_display_name(
Copy link
Collaborator Author

@erindru erindru Sep 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this because I wanted to output model display names based on SnapshotIdAndVersion and the existing display_name function required an entire SnapshotTableInfo object just to read the node type and name.

If we already know it's a model then all we need to produce a display name is the snapshot name string, so model_display_name reduces the requirement to just this.

SnapshotIdLike would work as well in order to help narrow the inputs to still be related to Snapshots rather than any random str; let me know if this is preferred

Copy link
Collaborator

@izeigerman izeigerman Sep 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just add display_name to SnapshotIdAndVersion for consistency? Then you no longer need to do isinstance check

Copy link
Collaborator Author

@erindru erindru Sep 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didnt do that initially because SnapshotIdAndVersion didnt have enough information to implement SnapshotInfoMixin and I didn't want to duplicate the method signature for display_name.

However, i've added it and I guess mypy will keep the method signatures in sync

Base automatically changed from erin/fix-restatement-clear-across-all-environments to main September 10, 2025 21:08
@erindru erindru force-pushed the erin/adjust-restatement-stage branch from e71adf6 to 6eaa6eb Compare September 10, 2025 21:26
version = snapshot.table_info.version
if (
prod_snapshot := promoted_snapshots_by_name.get(name)
) and prod_snapshot.version != version:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it sufficient to just check the version? I'm sure the version will match if the snapshots match

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved internally, this is checking the version, := just got misread as =

if isinstance(snapshot, SnapshotInfoMixin):
return snapshot.display_name(**naming_kwargs)

return model_display_name(node_name=snapshot.name, **naming_kwargs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

@izeigerman izeigerman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few more comments. LGTM once addressed

@erindru erindru force-pushed the erin/adjust-restatement-stage branch from 6eaa6eb to acad2ed Compare September 11, 2025 21:57
@erindru erindru merged commit 26ebace into main Sep 11, 2025
36 checks passed
@erindru erindru deleted the erin/adjust-restatement-stage branch September 11, 2025 23:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants