-
Notifications
You must be signed in to change notification settings - Fork 0
Reenqueue unparseable job #39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
d0354ec to
c2310b7
Compare
| -- | A default implementation for ccOnFailedToFetchJob, | ||
| -- when the parsing of the row should never fail. | ||
| -- This will create a logAttention and reenqueue for the next day. | ||
| shouldNotFail :: MonadLog m => String -> idx -> m Action |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename to defaultOnFailedToFetchJob and add logging for the index please.
| -- ^ Fields needed to be selected from the jobs table in order to assemble a | ||
| -- job. | ||
| , ccJobFetcher :: !(row -> job) | ||
| , ccJobFetcher :: !(row -> Either (idx, String) job) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| , ccJobFetcher :: !(row -> Either (idx, String) job) | |
| , ccJobFetcher :: !(row -> Either (idx, Text) job) |
| defaultOnFailedToFetchJob :: (MonadLog m, Show idx) => Text -> idx -> m Action | ||
| defaultOnFailedToFetchJob msg idx = do | ||
| logAttention "Unexpected unparseable job" $ A.object ["error" A..= msg, "idx" A..= show idx] | ||
| pure . RerunAfter $ ihours 48 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick on API: take an Interval instead of defaulting to a magic value. This way users of the library need to think about what makes sense for their job, one size does not fit all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is tricky, because this function is intended to be used only when you know that your job should always be parsed and could not fail. If I force users to think, it kinds of defeat its purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While defaulting to a magic value is fine, I'd prefer idays 1 so you can see in logs every day something's up.
jonathanjouty
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consumers_job_execution_seconds[...] despite the misleading name
Yes, there's a bit more info in our internal docs here, I should probably document it here too 🤔
I didn't find a way to plug this to
consumer-monitoring, unfortunately.
You could wrap-around ccOnFailedToFetchJob here (in addition to ccProcessJob)?
https://github.com/scrive/consumers/blob/master/consumers-metrics-prometheus/src/Database/PostgreSQL/Consumers/Instrumented.hs#L238
We could add reporting into consumers_job_execution_seconds with 0 seconds and a new job_result label (hack), or add a dedicated counter for these failures (better).
| -- ^ Action taken if fetching a job failed. It is advised to reenqueue the | ||
| -- job at a later date and emit a warning in such a case. This is mostly | ||
| -- to ensure the application using consumers won't fail completely when | ||
| -- this happens. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: Add 'defaultOnFailedToFetchJob' Haddock link.
consumers/test/Test.hs
Outdated
| , ccNotificationChannel = Just "consumers_test_chan" | ||
| , -- select some small timeout | ||
| ccNotificationTimeout = 100 * 1000 -- 100 msec | ||
| , ccOnFailedToFetchJob = \_ _ -> pure . RerunAfter $ idays 14 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hypothetically, what happens if the test fails to parse a job? This silently reschedules and tests pass? I don't remember the test rig enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. (This is very hypothetical though, I don't see how it could happen: the payload is text; unless there's a slight discrepancy between the text formatting between PG and haskell, it's impossible for this to fail. This feature is really more meant for JSON deserialisation).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the test job can't fail to parse, just use defaultOnFailedToFetchJob here.
d546f1a to
712e537
Compare
arybczak
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix CI.
| , "FOR UPDATE SKIP LOCKED" | ||
| ] | ||
| stuckJobs <- fetchMany ccJobFetcher | ||
| stuckJobs <- rights <$> fetchMany ccJobFetcher |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this ignoring LeftS? Below code uses index only, you get an index with a Left 🤔
| , "WHERE id IN (" <> reservedJobs now <> ")" | ||
| , "RETURNING" <+> mintercalate ", " ccJobSelectors | ||
| ] | ||
| -- Decode lazily as we want the transaction to be as short as possible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was this removed? The comment is still accurate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's still a diff.
| pure (batchSize > 0) | ||
|
|
||
| reserveJobs :: Int -> m ([job], Int) | ||
| reserveJobs :: MonadCatch m => Int -> m ([Either (idx, T.Text) job], Int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need MonadCatch here, it's a top level m with MonadMask constraint.
|
|
||
| import Control.Exception (SomeException) | ||
| import Data.Aeson.Types qualified as A | ||
| import Data.Text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import qualified or Text type only (I think this works).
| defaultOnFailedToFetchJob :: (MonadLog m, Show idx) => Text -> idx -> m Action | ||
| defaultOnFailedToFetchJob msg idx = do | ||
| logAttention "Unexpected unparseable job" $ A.object ["error" A..= msg, "idx" A..= show idx] | ||
| pure . RerunAfter $ ihours 48 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While defaulting to a magic value is fine, I'd prefer idays 1 so you can see in logs every day something's up.
consumers/test/Test.hs
Outdated
| , ccNotificationChannel = Just "consumers_test_chan" | ||
| , -- select some small timeout | ||
| ccNotificationTimeout = 100 * 1000 -- 100 msec | ||
| , ccOnFailedToFetchJob = \_ _ -> pure . RerunAfter $ idays 14 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the test job can't fail to parse, just use defaultOnFailedToFetchJob here.
| -- This will create a logAttention and reenqueue, to be replayed in 2 days. | ||
| defaultOnFailedToFetchJob :: (MonadLog m, Show idx) => Text -> idx -> m Action | ||
| defaultOnFailedToFetchJob msg idx = do | ||
| logAttention "Unexpected unparseable job" $ A.object ["error" A..= msg, "idx" A..= show idx] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| logAttention "Unexpected unparseable job" $ A.object ["error" A..= msg, "idx" A..= show idx] | |
| logAttention "Unexpected unparseable job" $ A.object ["error" A..= msg, "job_id" A..= show idx] |
|
CI is fixed and I addressed all your concerns (hopefully). |
|
|
||
| -- | A default implementation for ccOnFailedToFetchJob, | ||
| -- when the parsing of the row should never fail. | ||
| -- This will create a logAttention and reenqueue, to be replayed in 2 days. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 day
| , "WHERE id IN (" <> reservedJobs now <> ")" | ||
| , "RETURNING" <+> mintercalate ", " ccJobSelectors | ||
| ] | ||
| -- Decode lazily as we want the transaction to be as short as possible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's still a diff.
| action <- | ||
| lift $ | ||
| either | ||
| (\(idx, t) -> ccOnFailedToFetchJob t idx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That reminded me that we probably want to give ccOnFailedToFetchJob the same treatment we give to ccOnException here, i.e. have a safety net in case it itself throws due to a bug.
4b08bf6 to
45c29c5
Compare
45c29c5 to
6f76b3c
Compare
Proposal to ensure that an application running several different consumers doesn't crash if one of the job is unparseable for some reason.
I didn't find a way to plug this to
consumer-monitoring, unfortunately. I guess warning on a high number of failure inconsumers_job_execution_seconds, which despite the misleading name, includes the job_result is still the best way to go to detect this kind of issues. Or perhaps this is a good case for usinglogAttention.