-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Currently Massive steps will check the number of items to be processed and will create jobs accordingly. This works pretty well when all you need is to parallelize the processing of one file. But what happens when two processes are started concurrently (not necessarily at the same time)?
The first process step will enqueue its jobs (say 100 jobs), which will be processed by some workers (3 for example). In this scenario each worker will process 3 jobs at a time. Then the second process is started, enqueing more 50 jobs. Since all workers are busy this process will hang, waiting for a worker to be able to process it.
What is worse is that since Resque will push the first process 100 jobs into the queue (with RPUSH), and will pop them in order (with LPOP), the second process jobs will only run after the first process jobs complete.
A solution would be to use multiple queues with different priorities and have the jobs enqueued in those different queues.
For example, in the above sceneario, suppose that the first process would enqueue its jobs in 100 different queues named massive-1, massive-2, massive-3, ..., massive-100. When the second process enqueues its jobs, it would also enqueue then in different queues, like massive-1, massive-2, massive-3, ..., massive-50. The workers would need to be started with a splat (*), so that Resque is able to process those dynamic queues.
Now when one of the 3 workers finish processing a job for the first process it will get a job from higher priority queues, which are massive-1, then massive-2, etc. Since there are jobs from the second process in those queues, they would be processed before a job from the first process.
It would happen like so with 3 workers:
- First process enqueues 100 jobs in
massive-Nqueues - Workers process 3 jobs from the first process in
massive-1,massive-2,massive-3queues - Second process enqueues 50 jobs in
massive-Nqueues - After the job in
massive-1gets processed, the second process job inmassive-1starts processing - After the job in
massive-2gets processed, the second process job inmassive-2starts processing - After the job in
massive-3gets processed, the second process job inmassive-3starts processing - After the second process job in
massive-1gets processed, the first process job inmassive-4starts processing - After the second process job in
massive-2gets processed, the first process job inmassive-2starts processing - After the second process job in
massive-3gets processed, the first process job inmassive-3starts processing - So on until there are no jobs to be processed.
In this way we give each process a fair amount of processing when concurrent process are run. By adding more workers we are able to process more jobs in parallel, including ones of different processes.
The developer could specify how many queues he wants to use, and whether he wants to split jobs in queues at all. For example, we could limit it to use only 10 queues. For processes that use multiple jobs this would mean that processes would not be run in parallel, but could work for ones in multiple jobs.