-
Notifications
You must be signed in to change notification settings - Fork 378
Builtin Linting Rule: cronintervalalignment #5091
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 8 commits
a18a50f
c11510d
c2c1862
f6aee9e
c72d253
347e312
5f5f429
b6c7a69
5898932
bd895ba
1c80707
c316b80
5861dba
e839d4e
a5fa6b4
cc5276f
67b984f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -49,6 +49,7 @@ | |
| "nomissingaudits", | ||
| "nomissingowner", | ||
| "nomissingexternalmodels", | ||
| "cronvalidator", | ||
| ], | ||
| ), | ||
| ) | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -274,4 +274,34 @@ def create_fix(self, model_name: str) -> t.Optional[Fix]: | |
| ) | ||
|
|
||
|
|
||
| class CronValidator(Rule): | ||
| """Upstream model has a cron expression with longer intervals than downstream model.""" | ||
|
|
||
| def check_model(self, model: Model) -> t.Optional[RuleViolation]: | ||
| placeholder_start_date = "2020-01-01 10:00:00" | ||
|
|
||
| this_model_cron_next = model.cron_next(placeholder_start_date) | ||
|
|
||
| for upstream_model_name in model.depends_on: | ||
| upstream_model = self.context.get_model(upstream_model_name) | ||
|
|
||
| # Skip if upstream model doesn't exist | ||
| if upstream_model is None: | ||
| continue | ||
|
|
||
| # Skip model kinds since they don't run on cron schedules | ||
| skip_kinds = ["EXTERNAL", "EMBEDDED", "SEED"] | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I suggest using enum values instead of hardcoded strings (
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. updated |
||
| if upstream_model.kind.name in skip_kinds: | ||
| continue | ||
|
|
||
| upstream_model_cron_next = upstream_model.cron_next(placeholder_start_date) | ||
|
|
||
| if upstream_model_cron_next > this_model_cron_next: | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I thought about this more and I don't think this is the right way to handle this. Imagine graph A (upstream) <- B (downstream) with crons
Both models are hourly, so technically the upstream's cron interval is not "longer" as the violation suggests. However, B will run before A which might not be the desired behavior. But if it was this instead:
Your check will return false given the current placeholder start date, because So if the length of the interval is what you're trying to capture here then you want to do the following instead: and move away from the placehoder date.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This was actually my first approach, but it won't work in isolation because the For example, upstream_cron_interval_unit = IntervalUnit.from_cron(upstream_model.cron)
this_cron_interval_unit = IntervalUnit.from_cron(model.cron)
if upstream_cron_interval_unit.seconds > this_cron_interval_unit.seconds:
# violationNew Approach: Use an OR condition where if upstream model's next cron is greater OR its interval unit in seconds is greater, return a violation. Example in action for your first scenario: You'll notice the placeholder date implementation works as expected. This is NOT a rule violation because model B runs AFTER model A. A: 5 * * * * (run at 5th minute of every hour) (Pdb) upstream_cron_interval_unit.seconds
3600
(Pdb) this_cron_interval_unit.seconds
3600
(Pdb) upstream_model_cron_next
datetime.datetime(2020, 1, 1, 10, 5, tzinfo=datetime.timezone.utc)
(Pdb) this_model_cron_next
datetime.datetime(2020, 1, 1, 11, 0, tzinfo=datetime.timezone.utc)
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note: it's important WHEN models are linted because when it's to deployed to prod determines the next run. For nuanced cron schedules, a model applied at a specific time can emit violation vs. not. |
||
| return self.violation( | ||
| f"Upstream model {upstream_model_name} has longer cron interval ({upstream_model.cron}) " | ||
| f"than this model ({model.cron})" | ||
| ) | ||
| return None | ||
|
|
||
|
|
||
| BUILTIN_RULES = RuleSet(subclasses(__name__, Rule, (Rule,))) | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the name be more specific? I feel like there can be multiple approaches to validating cron
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about this?
CronAlignmentValidatorThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or
CronIntervalAlignmentThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the last one is good, thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated