Skip to content

scheduler: don't wait on input dependencies when cff.Predicate returns false#105

Open
Saijayavinoth wants to merge 2 commits into
mainfrom
predicate-early-dispatch
Open

scheduler: don't wait on input dependencies when cff.Predicate returns false#105
Saijayavinoth wants to merge 2 commits into
mainfrom
predicate-early-dispatch

Conversation

@Saijayavinoth

Copy link
Copy Markdown

Closes #104

Problem

When cff.Predicate returns false, the task body is skipped — but the scheduler still waits for every declared input dependency to finish first. An unrelated slow input imposes its full latency on the consumer, even though the predicate=false result is what decides the skip and the input value is never read. See linked issue for reproducer.

Change

Layer What changes
API (scheduler) PredicateJob{Run func(ctx)(bool,error); Dependencies []*ScheduledJob} + Scheduler.EnqueuePredicate. Re-exported as cff.PredicateJob.
Scheduler (3 sites in scheduler/scheduler.go) Worker arm calls predicateRun; done-branch fast-dispatches consumers on !result && err==nil; enqueue handler does the same for consumers enqueued after the predicate completed (cached in ScheduledJob.predicateResult).
Codegen (predicate.go.tmpl) Emits EnqueuePredicate(PredicateJob{}); Run sig changes from error to (bool, error). Consumer codegen byte-identical — its existing if !p<hash> { return nil } gate is untouched.

Safety

Fast-dispatch appears to race the slow producer's write to the consumer's input value. It does not:

  • Skip gate precedes input read. Consumer wrapper is shaped if !p<hash> { return nil }; ... use(inputs...), so on predicate=false the function returns before any input is dereferenced.
  • Slow producer (the dep we no longer wait on). When it eventually completes — success or error — the standard notification loop decrements consumer.remaining past zero (to −1). The if remaining == 0 push-to-ready check no longer matches, so the consumer is not re-dispatched. Producer errors still propagate to the flow's error path via the normal mechanism.
  • Fast sibling producer errors. If a non-predicate input errors before fast-dispatch fires, the consumer is already marked invalid; fast-dispatch still pushes it to ready, and the worker's j.invalid check fires before any body code runs, surfacing the error as errJobInvalid — identical to pre-PR behavior under ContinueOnError.
  • Late-enqueue race. Predicate may complete before the consumer is enqueued. Handled by caching predicateResult on ScheduledJob in the done branch; enqueue handler reads it and fast-dispatches inline.

Verification

  • go test -race ./scheduler/... — passes, including new TestPredicateEarlyDispatch, TestPredicateTrueWaitsForDeps, TestPredicateError.

  • make test — 21/22 internal/tests/ packages green. The one failure (TestPanicRecovered) is a pre-existing Go-version stack-format issue; reproduces with this change reverted.

  • TestBlockingInputs_PredicateShortCircuits — asserts elapsed < slowDelay/10 for a 100ms slow dep.

  • BenchmarkPredicateFalseWithSlowDep (internal/tests/benchmark/), 5ms slow dep, AMD EPYC 7B13:

    Variant wall ns/op consumer_ns/op
    baseline 5,226,110 5,219,513
    this PR 5,226,572 32,877

    Wall time bounded by Wait() draining the slow producer; consumer_ns/op is the metric of interest: ~159× reduction.

Out of scope (planned follow-up PR)

Fast-dispatch invalid consumers — same mechanism applied to consumers whose dep errored (j.invalid short-circuits without dereferencing inputs). Only meaningful under ContinueOnError; needs its own test coverage. Will open as a follow-up after this lands.

@codecov

codecov Bot commented May 18, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 69.23077% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 67.94%. Comparing base (4cdfe95) to head (9ab9f6e).

Files with missing lines Patch % Lines
scheduler/scheduler.go 69.23% 4 Missing and 4 partials ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main     #105   +/-   ##
=======================================
  Coverage   67.94%   67.94%           
=======================================
  Files          32       32           
  Lines        2059     2084   +25     
=======================================
+ Hits         1399     1416   +17     
- Misses        636      640    +4     
- Partials       24       28    +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

lint: fix .golangci.yml to pass on golangci-lint v1.64.x

Two changes, both surfacing because v1.64.x added strict `config verify`
that older versions silently bypassed:

- Fix niliness → nilness (typo in govet enable list since e66de90).
  Older golangci-lint silently ignored the unknown name; v1.64.x
  rejects it during config verification.

- Exclude fmt.Fprint/Fprintf/Fprintln from errcheck. Once config
  verify passes, errcheck actually runs and catches conventional
  fmt-write callsites where the returned error is intentionally
  ignored. Standard exclusion in Go projects.
When cff.Predicate returns false, the consumer's task body is skipped
via its existing 'if !p<hash> { return nil }' gate — but today the
scheduler still waits for every declared input dependency before that
gate runs, so unrelated slow inputs impose their full latency.

Introduce PredicateJob and Scheduler.EnqueuePredicate so the scheduler
can recognize predicate evaluations. When a PredicateJob completes
with PredicateResult=false, walk its consumers in the done branch,
set remaining=0, and push to ready immediately, bypassing the wait on
unrelated data dependencies. Late-enqueue race handled by caching the
predicate's bool on ScheduledJob.predicateResult so consumers
enqueued after the predicate completed can also fast-dispatch.

Safety: the consumer wrapper's skip gate structurally precedes any
read of data-dep values, so early-dispatching it races no producer
write. Existing tests pass unchanged; consumer codegen is
byte-identical.

Benchmark (AMD EPYC 7B13, 5ms slow dep):
- Before (early-dispatch disabled): 5,219,513 consumer_ns/op
- After  (early-dispatch enabled):     45,364 consumer_ns/op
~115x reduction in consumer dispatch latency.

Closes #104
@Saijayavinoth Saijayavinoth force-pushed the predicate-early-dispatch branch from ed87797 to 9ab9f6e Compare May 18, 2026 11:19
@Saijayavinoth Saijayavinoth requested a review from abhinav May 18, 2026 11:41
@Saijayavinoth Saijayavinoth self-assigned this May 18, 2026
@Saijayavinoth

Copy link
Copy Markdown
Author

@abhinav PTAL

@abhinav

abhinav commented May 26, 2026

Copy link
Copy Markdown
Collaborator

Apologies, @Saijayavinoth I don't work at Uber anymore. Someone from Uber can likely review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

scheduler: task waits for slow input dependencies even when cff.Predicate=false

2 participants