Reenqueue unparseable job #39

Raveline · 2025-09-26T15:11:41Z

Proposal to ensure that an application running several different consumers doesn't crash if one of the job is unparseable for some reason.

I didn't find a way to plug this to consumer-monitoring, unfortunately. I guess warning on a high number of failure in consumers_job_execution_seconds, which despite the misleading name, includes the job_result is still the best way to go to detect this kind of issues. Or perhaps this is a good case for using logAttention.

arybczak · 2025-11-27T12:32:33Z

consumers/src/Database/PostgreSQL/Consumers/Config.hs

+-- | A default implementation for ccOnFailedToFetchJob,
+-- when the parsing of the row should never fail.
+-- This will create a logAttention and reenqueue for the next day.
+shouldNotFail :: MonadLog m => String -> idx -> m Action


Rename to defaultOnFailedToFetchJob and add logging for the index please.

arybczak · 2025-11-27T12:36:41Z

consumers/src/Database/PostgreSQL/Consumers/Config.hs

  -- ^ Fields needed to be selected from the jobs table in order to assemble a
  -- job.
-  , ccJobFetcher :: !(row -> job)
+  , ccJobFetcher :: !(row -> Either (idx, String) job)


Suggested change

, ccJobFetcher :: !(row -> Either (idx, String) job)

, ccJobFetcher :: !(row -> Either (idx, Text) job)

jonathanjouty · 2025-11-27T13:55:41Z

consumers/src/Database/PostgreSQL/Consumers/Config.hs

+defaultOnFailedToFetchJob :: (MonadLog m, Show idx) => Text -> idx -> m Action
+defaultOnFailedToFetchJob msg idx = do
+  logAttention "Unexpected unparseable job" $ A.object ["error" A..= msg, "idx" A..= show idx]
+  pure . RerunAfter $ ihours 48


Nitpick on API: take an Interval instead of defaulting to a magic value. This way users of the library need to think about what makes sense for their job, one size does not fit all.

This is tricky, because this function is intended to be used only when you know that your job should always be parsed and could not fail. If I force users to think, it kinds of defeat its purpose.

While defaulting to a magic value is fine, I'd prefer idays 1 so you can see in logs every day something's up.

jonathanjouty

consumers_job_execution_seconds [...] despite the misleading name

Yes, there's a bit more info in our internal docs here, I should probably document it here too 🤔

I didn't find a way to plug this to consumer-monitoring, unfortunately.

You could wrap-around ccOnFailedToFetchJob here (in addition to ccProcessJob)?
https://github.com/scrive/consumers/blob/master/consumers-metrics-prometheus/src/Database/PostgreSQL/Consumers/Instrumented.hs#L238

We could add reporting into consumers_job_execution_seconds with 0 seconds and a new job_result label (hack), or add a dedicated counter for these failures (better).

jonathanjouty · 2025-11-27T13:58:02Z

consumers/src/Database/PostgreSQL/Consumers/Config.hs

+  -- ^ Action taken if fetching a job failed. It is advised to reenqueue the
+  -- job at a later date and emit a warning in such a case. This is mostly
+  -- to ensure the application using consumers won't fail completely when
+  -- this happens.


Nitpick: Add 'defaultOnFailedToFetchJob' Haddock link.

jonathanjouty · 2025-11-27T13:59:38Z

consumers/test/Test.hs

        , ccNotificationChannel = Just "consumers_test_chan"
        , -- select some small timeout
          ccNotificationTimeout = 100 * 1000 -- 100 msec
+        , ccOnFailedToFetchJob = \_ _ -> pure . RerunAfter $ idays 14


Hypothetically, what happens if the test fails to parse a job? This silently reschedules and tests pass? I don't remember the test rig enough.

Yes. (This is very hypothetical though, I don't see how it could happen: the payload is text; unless there's a slight discrepancy between the text formatting between PG and haskell, it's impossible for this to fail. This feature is really more meant for JSON deserialisation).

Since the test job can't fail to parse, just use defaultOnFailedToFetchJob here.

arybczak

Please fix CI.

arybczak · 2025-12-17T19:15:19Z

consumers/src/Database/PostgreSQL/Consumers/Components.hs

            , "FOR UPDATE SKIP LOCKED"
            ]
-        stuckJobs <- fetchMany ccJobFetcher
+        stuckJobs <- rights <$> fetchMany ccJobFetcher


Why is this ignoring LeftS? Below code uses index only, you get an index with a Left 🤔

arybczak · 2025-12-17T19:17:00Z

consumers/src/Database/PostgreSQL/Consumers/Components.hs

            , "WHERE id IN (" <> reservedJobs now <> ")"
            , "RETURNING" <+> mintercalate ", " ccJobSelectors
            ]
-      -- Decode lazily as we want the transaction to be as short as possible.


Why was this removed? The comment is still accurate.

There's still a diff.

arybczak · 2025-12-17T19:18:26Z

consumers/src/Database/PostgreSQL/Consumers/Components.hs

      pure (batchSize > 0)

-    reserveJobs :: Int -> m ([job], Int)
+    reserveJobs :: MonadCatch m => Int -> m ([Either (idx, T.Text) job], Int)


I don't think you need MonadCatch here, it's a top level m with MonadMask constraint.

arybczak · 2025-12-17T19:18:53Z

consumers/src/Database/PostgreSQL/Consumers/Config.hs


 import Control.Exception (SomeException)
 import Data.Aeson.Types qualified as A
+import Data.Text


Import qualified or Text type only (I think this works).

arybczak · 2025-12-17T19:19:52Z

consumers/src/Database/PostgreSQL/Consumers/Config.hs

+defaultOnFailedToFetchJob :: (MonadLog m, Show idx) => Text -> idx -> m Action
+defaultOnFailedToFetchJob msg idx = do
+  logAttention "Unexpected unparseable job" $ A.object ["error" A..= msg, "idx" A..= show idx]
+  pure . RerunAfter $ ihours 48


While defaulting to a magic value is fine, I'd prefer idays 1 so you can see in logs every day something's up.

arybczak · 2025-12-17T19:20:37Z

consumers/test/Test.hs

        , ccNotificationChannel = Just "consumers_test_chan"
        , -- select some small timeout
          ccNotificationTimeout = 100 * 1000 -- 100 msec
+        , ccOnFailedToFetchJob = \_ _ -> pure . RerunAfter $ idays 14


Since the test job can't fail to parse, just use defaultOnFailedToFetchJob here.

arybczak · 2025-12-17T19:22:49Z

consumers/src/Database/PostgreSQL/Consumers/Config.hs

+-- This will create a logAttention and reenqueue, to be replayed in 2 days.
+defaultOnFailedToFetchJob :: (MonadLog m, Show idx) => Text -> idx -> m Action
+defaultOnFailedToFetchJob msg idx = do
+  logAttention "Unexpected unparseable job" $ A.object ["error" A..= msg, "idx" A..= show idx]


Suggested change

logAttention "Unexpected unparseable job" $ A.object ["error" A..= msg, "idx" A..= show idx]

logAttention "Unexpected unparseable job" $ A.object ["error" A..= msg, "job_id" A..= show idx]

Raveline · 2025-12-18T16:41:14Z

CI is fixed and I addressed all your concerns (hopefully).

arybczak · 2025-12-18T17:38:47Z

consumers/src/Database/PostgreSQL/Consumers/Config.hs

+
+-- | A default implementation for ccOnFailedToFetchJob,
+-- when the parsing of the row should never fail.
+-- This will create a logAttention and reenqueue, to be replayed in 2 days.


arybczak · 2025-12-18T17:39:13Z

consumers/src/Database/PostgreSQL/Consumers/Components.hs

            , "WHERE id IN (" <> reservedJobs now <> ")"
            , "RETURNING" <+> mintercalate ", " ccJobSelectors
            ]
-      -- Decode lazily as we want the transaction to be as short as possible.


There's still a diff.

arybczak · 2025-12-18T17:45:02Z

consumers/src/Database/PostgreSQL/Consumers/Components.hs

+            action <-
+              lift $
+                either
+                  (\(idx, t) -> ccOnFailedToFetchJob t idx)


That reminded me that we probably want to give ccOnFailedToFetchJob the same treatment we give to ccOnException here, i.e. have a safety net in case it itself throws due to a bug.

Raveline force-pushed the reenqueue-unparseable-job branch 2 times, most recently from d0354ec to c2310b7 Compare October 3, 2025 08:35

arybczak reviewed Nov 27, 2025

View reviewed changes

jonathanjouty reviewed Nov 27, 2025

View reviewed changes

Raveline and others added 7 commits December 2, 2025 15:37

WIP

03ae07c

fix

3de2d37

Change ccJobFetcher signature

cf48850

Add a default implementation for parsing that should not fail

468af9f

Apply PR suggestions

0fd9fdd

Set a more sensible reenqueue default time

ec838d6

Monitor jobs that failed to parse

712e537

Raveline force-pushed the reenqueue-unparseable-job branch from d546f1a to 712e537 Compare December 2, 2025 16:06

Fix formatting

87f4cae

arybczak reviewed Dec 17, 2025

View reviewed changes

Raveline added 2 commits December 18, 2025 17:20

Address PR comments

0c1570e

Fix formatting again

73c244e

arybczak reviewed Dec 18, 2025

View reviewed changes

Format also the changes to consumer-metrics

56a705d

Raveline force-pushed the reenqueue-unparseable-job branch from 4b08bf6 to 45c29c5 Compare December 19, 2025 08:47

Add protection against crashlooping in ccOnFailedToFetchJob

6f76b3c

Raveline force-pushed the reenqueue-unparseable-job branch from 45c29c5 to 6f76b3c Compare December 19, 2025 11:18

	, ccJobFetcher :: !(row -> Either (idx, String) job)
	, ccJobFetcher :: !(row -> Either (idx, Text) job)

	logAttention "Unexpected unparseable job" $ A.object ["error" A..= msg, "idx" A..= show idx]
	logAttention "Unexpected unparseable job" $ A.object ["error" A..= msg, "job_id" A..= show idx]

Reenqueue unparseable job #39

Are you sure you want to change the base?

Reenqueue unparseable job #39

Uh oh!

Conversation

Raveline commented Sep 26, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jonathanjouty left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arybczak left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Raveline commented Dec 18, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants