Improve parallelization of processes

Currently Massive steps will check the number of items to be processed and will create jobs accordingly. This works pretty well when all you need is to parallelize the processing of one file. But what happens when two processes are started concurrently (not necessarily at the same time)?

The first process step will enqueue its jobs (say 100 jobs), which will be processed by some workers (3 for example). In this scenario each worker will process 3 jobs at a time. Then the second process is started, enqueing more 50 jobs. Since all workers are busy this process will hang, waiting for a worker to be able to process it.

What is worse is that since Resque will push the first process 100 jobs into the queue (with RPUSH), and will pop them in order (with LPOP), the second process jobs will only run after the first process jobs complete.

A solution would be to use multiple queues with different priorities and have the jobs enqueued in those different queues.

For example, in the above sceneario, suppose that the first process would enqueue its jobs in 100 different queues named `massive-1`, `massive-2`, `massive-3`, ..., `massive-100`. When the second process enqueues its jobs, it would also enqueue then in different queues, like `massive-1`, `massive-2`, `massive-3`, ..., `massive-50`. The workers would need to be started with a splat (`*`), so that Resque is able to process those dynamic queues.

Now when one of the 3 workers finish processing a job for the first process it will get a job from higher priority queues, which are `massive-1`, then `massive-2`, etc. Since there are jobs from the second process in those queues, they would be processed before a job from the first process.

It would happen like so with 3 workers:
- First process enqueues 100 jobs in `massive-N` queues
- Workers process 3 jobs from the first process in `massive-1`, `massive-2`, `massive-3` queues
- Second process enqueues 50 jobs in `massive-N` queues
- After the job in `massive-1` gets processed, the second process job in `massive-1` starts processing
- After the job in `massive-2` gets processed, the second process job in `massive-2` starts processing
- After the job in `massive-3` gets processed, the second process job in `massive-3` starts processing
- After the second process job in `massive-1` gets processed, the first process job in `massive-4` starts processing
- After the second process job in `massive-2` gets processed, the first process job in `massive-2` starts processing
- After the second process job in `massive-3` gets processed, the first process job in `massive-3` starts processing
- So on until there are no jobs to be processed.

In this way we give each process a fair amount of processing when concurrent process are run. By adding more workers we are able to process more jobs in parallel, including ones of different processes.

The developer could specify how many queues he wants to use, and whether he wants to split jobs in queues at all. For example, we could limit it to use only 10 queues. For processes that use multiple jobs this would mean that processes would not be run in parallel, but could work for ones in multiple jobs.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve parallelization of processes #2

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve parallelization of processes #2

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions