Fix spin-loop/cleanup failure mode within run loop #42

smudge · 2024-06-26T22:53:00Z

This ensures that exceptions raised in thread callback hooks are rescued and properly mark jobs as failed.

This is also a good opportunity to change the num argument (of work_off(num)) to mean number of jobs (give or take a few due to max_claims), not number of iterations. Previously (before threading was introduced) I think it meant number of jobs (though jobs and iterations were 1:1). I would not have done this before the refactor, because there was no guarantee that one of success or failure would be incremented (the thread might crash for many reasons). Now, we only increment success and treat total - success as the "failure" number when we return from the method.

Fixes #23 and #41

This is also a prereq for a resolution I'm cooking up for #36

This ensures that exceptions raised in thread callback hooks are rescued and properly mark jobs as failed. Fixes Betterment#23 and Betterment#41

(like in pre-threading implementation)

smudge · 2024-06-26T22:58:59Z

lib/delayed/worker.rb

        pool = Concurrent::FixedThreadPool.new(jobs.length)
        jobs.each do |job|
          pool.post do
-            run_thread_callbacks(job) do


The explanation might make more sense if I put it here next to the related code.

Previously, if run_thread_callbacks crashed for any reason, neither success nor failure would be incremented. Furthermore, the job cleanup (down at the end of run(job)) would never occur, so attempts would not be incremented and run_at would not be bumped, making the job immediately available for pickup again.

This is what I mean by the "spinloop". A job gets picked up, its thread crashes, no cleanup occurs, it gets immediately picked up again, etc. Pushing the run_thread_callbacks call into run_job allows us to clean up the job if, e.g., it fails to deserialize, or if one of our callbacks fails to connect to a secondary resource (as was the case in #41).

smudge · 2024-06-26T23:00:36Z

lib/delayed/worker.rb


      job.error = e
      failed(job)
+      false # work failed


This was a bug that improperly reported deserialization errors in the success number, which only impacted the logging output.

smudge · 2024-06-26T23:15:04Z

.github/workflows/ci.yml

      fail-fast: false
      matrix:
-        ruby: ['2.6', '2.7', '3.0', '3.1', '3.2']
+        ruby: ['2.7', '3.0', '3.1', '3.2']


we can make this more formal, but I wanted to unblock this build for now, without the linter churn that comes from actually changing the min supported Ruby.

samandmoore

Change looks good as a solve for the issue. Just one question about the other behavior change!

samandmoore · 2024-06-27T11:49:15Z

lib/delayed/worker.rb

+      total = 0

-      num.times do
+      while total < num


So I'm not sure how likely it is to matter but this change means that we will keep looping until we see num jobs which means that if the queue becomes empty before we hit num we will keep looping, right?

Is that okay or desired?

There's a break if empty? below that covers that case as well, so we should only ever continue the loop if there are jobs being returned in the query (and that's consistent with the way it worked pre-threading too).

ah! i missed that. excellent.

samandmoore

domainlgtm

samandmoore

platformlgtm

… for now) (#48) This is related to #41 and #42 insofar as a kind of DB-bound "spinloop" is still possible if a worker picks up jobs that take so little time that the worker immediately turns around and asks for more. As of now, there has been no way to tune the amount of time a worker should wait in between _successful_ iterations of its run loop. This introduces a configuration (`min_reserve_interval`) specifying a minimum number of seconds (default: 0 as this is not a major release) that a worker should wait in between _successful_ job reserve queries. An existing config (`sleep_delay`) is still used to define the number of seconds (default: 5) that a worker should wait in between _unsuccessful_ job reserve attempts (i.e. the queue is empty). The job execution time is subtracted from `min_reserve_interval` when the worker sleeps, and if jobs take more than `min_reserve_interval` to complete than the worker will not sleep before the next reserve query. /no-platform

PR Betterment#42 inadvertently flipped the order of `:thread` and `:perform`, and also pushed `:thread` far enough in that cleanup steps would happen after the `with_connection`'s `end` block in the connection plugin. This introduces the possibility of thread safety issues, or connections being held longer than intended (exhausting the connection pool / connection limits).

@argvniyx-enroute

/no-platform /domain @argvniyx-enroute @mavenraven PR #42 inadvertently flipped the order of `:thread` and `:perform`, and also pushed `:thread` far enough in that cleanup steps would happen after the `with_connection`'s`end` block in the connection plugin. This introduces the possibility of thread safety issues, or connections being held longer than intended (exhausting the connection pool / connection limits).

smudge added 2 commits June 26, 2024 18:26

Fix spin-loop/cleanup failure mode within run loop

23c05f9

This ensures that exceptions raised in thread callback hooks are rescued and properly mark jobs as failed. Fixes Betterment#23 and Betterment#41

Make 'num' mean # of jobs, not # of iterations

2ce9b67

(like in pre-threading implementation)

smudge requested review from effron, jmileham and samandmoore June 26, 2024 22:53

smudge commented Jun 26, 2024

View reviewed changes

Pin sqlite3 for build

d4cae21

smudge force-pushed the fix-spinloop branch from ca44114 to d4cae21 Compare June 26, 2024 23:11

smudge commented Jun 26, 2024

View reviewed changes

samandmoore reviewed Jun 27, 2024

View reviewed changes

samandmoore approved these changes Jun 27, 2024

View reviewed changes

smudge merged commit fd97a38 into Betterment:main Jun 27, 2024

smudge deleted the fix-spinloop branch June 27, 2024 16:35

smudge mentioned this pull request Jun 27, 2024

Possible unthrottled spinloop when jobs fail to deserailize #41

Closed

warmerzega mentioned this pull request Aug 13, 2024

Handle proper values for ActiveJob.enqueue_after_transaction_commit for Rails 7.2 support #43

Merged

smudge mentioned this pull request Dec 18, 2024

Add a config for enforcing a minimum job reserve interval (default: 0 for now) #48

Merged

smudge mentioned this pull request Apr 2, 2025

[fix] Thread callback ordering #53

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix spin-loop/cleanup failure mode within run loop #42

Fix spin-loop/cleanup failure mode within run loop #42

Uh oh!

smudge commented Jun 26, 2024

Uh oh!

smudge Jun 26, 2024

Uh oh!

smudge Jun 26, 2024

Uh oh!

smudge Jun 26, 2024

Uh oh!

samandmoore left a comment

Uh oh!

samandmoore Jun 27, 2024

Uh oh!

smudge Jun 27, 2024

Uh oh!

samandmoore Jun 27, 2024

Uh oh!

samandmoore left a comment

Uh oh!

samandmoore left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix spin-loop/cleanup failure mode within run loop #42

Fix spin-loop/cleanup failure mode within run loop #42

Uh oh!

Conversation

smudge commented Jun 26, 2024

Uh oh!

smudge Jun 26, 2024

Choose a reason for hiding this comment

Uh oh!

smudge Jun 26, 2024

Choose a reason for hiding this comment

Uh oh!

smudge Jun 26, 2024

Choose a reason for hiding this comment

Uh oh!

samandmoore left a comment

Choose a reason for hiding this comment

Uh oh!

samandmoore Jun 27, 2024

Choose a reason for hiding this comment

Uh oh!

smudge Jun 27, 2024

Choose a reason for hiding this comment

Uh oh!

samandmoore Jun 27, 2024

Choose a reason for hiding this comment

Uh oh!

samandmoore left a comment

Choose a reason for hiding this comment

Uh oh!

samandmoore left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants