Skip to content

Recover lost batches #7

@sleekweasel

Description

@sleekweasel

If a batch-run times out or otherwise fails to return results for some/all the tests in the batch, we should consider re-queueing the tests. (Implies maintaining the currently running tests in the database.)

A test should not be re-queued endlessly - it's probably genuinely timing out or killing its agent. (Implies multiple queues.)

Re-queued tests are re-run in isolation, to separate bad tests from innocent batch-mates. (Implies workers knows about secondary queuing.)

We should recover even (especially) if the agent is killed with extreme prejudice. (Implies worker-tracking.)

Workers should not terminate until the queue is empty and all workers are idle. (Implies coordination)

Proposal:

  1. Workers should use transactions (http://redis.io/topics/transactions) to pull 'n' tests off the primary queue (or only 1 from the requeue) and into their own set, and then run them. Once the run is finished, any tests from the primary queue that weren't executed for any reason are added to the requeue.
  2. Worker-controller maintains a set listing each worker, removing a worker from the set when it terminates. If a worker terminates with tests in its set, worker-controller adds those tests to the requeue.
  3. The worker-controller polls for an empty queue and requeue, and for all worker sets to be empty, whereupon the worker-controller puts a 'tests complete' marker in a controller set and workers terminate in response.
  4. The various queues and sets have names based on that of the primary queue - e.g. queue, queue_requeue, queue_worker0, queue_control. These will be ensured empty at start-up by the processes using them (in case of previous catastrophic failure) but should be naturally empty by the end of a normally completed run.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions