Skip to content

Conversation

@smudge
Copy link
Member

@smudge smudge commented Jun 26, 2024

This ensures that exceptions raised in thread callback hooks are rescued and properly mark jobs as failed.

This is also a good opportunity to change the num argument (of work_off(num)) to mean number of jobs (give or take a few due to max_claims), not number of iterations. Previously (before threading was introduced) I think it meant number of jobs (though jobs and iterations were 1:1). I would not have done this before the refactor, because there was no guarantee that one of success or failure would be incremented (the thread might crash for many reasons). Now, we only increment success and treat total - success as the "failure" number when we return from the method.

Fixes #23 and #41

This is also a prereq for a resolution I'm cooking up for #36

smudge added 2 commits June 26, 2024 18:26
This ensures that exceptions raised in thread callback hooks are rescued
and properly mark jobs as failed.

Fixes Betterment#23 and Betterment#41
(like in pre-threading implementation)
pool = Concurrent::FixedThreadPool.new(jobs.length)
jobs.each do |job|
pool.post do
run_thread_callbacks(job) do
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The explanation might make more sense if I put it here next to the related code.

Previously, if run_thread_callbacks crashed for any reason, neither success nor failure would be incremented. Furthermore, the job cleanup (down at the end of run(job)) would never occur, so attempts would not be incremented and run_at would not be bumped, making the job immediately available for pickup again.

This is what I mean by the "spinloop". A job gets picked up, its thread crashes, no cleanup occurs, it gets immediately picked up again, etc. Pushing the run_thread_callbacks call into run_job allows us to clean up the job if, e.g., it fails to deserialize, or if one of our callbacks fails to connect to a secondary resource (as was the case in #41).


job.error = e
failed(job)
false # work failed
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a bug that improperly reported deserialization errors in the success number, which only impacted the logging output.

fail-fast: false
matrix:
ruby: ['2.6', '2.7', '3.0', '3.1', '3.2']
ruby: ['2.7', '3.0', '3.1', '3.2']
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can make this more formal, but I wanted to unblock this build for now, without the linter churn that comes from actually changing the min supported Ruby.

Copy link
Member

@samandmoore samandmoore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change looks good as a solve for the issue. Just one question about the other behavior change!

total = 0

num.times do
while total < num
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I'm not sure how likely it is to matter but this change means that we will keep looping until we see num jobs which means that if the queue becomes empty before we hit num we will keep looping, right?

Is that okay or desired?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a break if empty? below that covers that case as well, so we should only ever continue the loop if there are jobs being returned in the query (and that's consistent with the way it worked pre-threading too).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah! i missed that. excellent.

Copy link
Member

@samandmoore samandmoore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

domainlgtm

Copy link
Member

@samandmoore samandmoore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

platformlgtm

@smudge smudge merged commit fd97a38 into Betterment:main Jun 27, 2024
@smudge smudge deleted the fix-spinloop branch June 27, 2024 16:35
smudge added a commit that referenced this pull request Dec 18, 2024
… for now) (#48)

This is related to #41 and #42 insofar as a kind of DB-bound "spinloop"
is still possible if a worker picks up jobs that take so little time
that the worker immediately turns around and asks for more. As of now,
there has been no way to tune the amount of time a worker should wait in
between _successful_ iterations of its run loop.

This introduces a configuration (`min_reserve_interval`) specifying a
minimum number of seconds (default: 0 as this is not a major release)
that a worker should wait in between _successful_ job reserve queries.
An existing config (`sleep_delay`) is still used to define the number of
seconds (default: 5) that a worker should wait in between _unsuccessful_
job reserve attempts (i.e. the queue is empty).

The job execution time is subtracted from `min_reserve_interval` when
the worker sleeps, and if jobs take more than `min_reserve_interval` to
complete than the worker will not sleep before the next reserve query.

/no-platform
smudge added a commit to smudge/delayed that referenced this pull request Apr 2, 2025
PR Betterment#42 inadvertently flipped the order of `:thread` and `:perform`, and
also pushed `:thread` far enough in that cleanup steps would happen
after the `with_connection`'s `end` block in the connection plugin.

This introduces the possibility of thread safety issues, or connections
being held longer than intended (exhausting the connection pool /
connection limits).
smudge added a commit that referenced this pull request Apr 2, 2025
/no-platform
/domain @argvniyx-enroute @mavenraven 

PR #42 inadvertently flipped the order of `:thread` and `:perform`, and also pushed `:thread` far enough in that cleanup steps would happen after the `with_connection`'s`end` block in the connection plugin.

This introduces the possibility of thread safety issues, or connections being held longer than intended (exhausting the connection pool / connection limits).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Difference in error behavior when a job is undefined

2 participants