Skip to content

Improve parallelization of processes #2

@vicentemundim

Description

@vicentemundim

Currently Massive steps will check the number of items to be processed and will create jobs accordingly. This works pretty well when all you need is to parallelize the processing of one file. But what happens when two processes are started concurrently (not necessarily at the same time)?

The first process step will enqueue its jobs (say 100 jobs), which will be processed by some workers (3 for example). In this scenario each worker will process 3 jobs at a time. Then the second process is started, enqueing more 50 jobs. Since all workers are busy this process will hang, waiting for a worker to be able to process it.

What is worse is that since Resque will push the first process 100 jobs into the queue (with RPUSH), and will pop them in order (with LPOP), the second process jobs will only run after the first process jobs complete.

A solution would be to use multiple queues with different priorities and have the jobs enqueued in those different queues.

For example, in the above sceneario, suppose that the first process would enqueue its jobs in 100 different queues named massive-1, massive-2, massive-3, ..., massive-100. When the second process enqueues its jobs, it would also enqueue then in different queues, like massive-1, massive-2, massive-3, ..., massive-50. The workers would need to be started with a splat (*), so that Resque is able to process those dynamic queues.

Now when one of the 3 workers finish processing a job for the first process it will get a job from higher priority queues, which are massive-1, then massive-2, etc. Since there are jobs from the second process in those queues, they would be processed before a job from the first process.

It would happen like so with 3 workers:

  • First process enqueues 100 jobs in massive-N queues
  • Workers process 3 jobs from the first process in massive-1, massive-2, massive-3 queues
  • Second process enqueues 50 jobs in massive-N queues
  • After the job in massive-1 gets processed, the second process job in massive-1 starts processing
  • After the job in massive-2 gets processed, the second process job in massive-2 starts processing
  • After the job in massive-3 gets processed, the second process job in massive-3 starts processing
  • After the second process job in massive-1 gets processed, the first process job in massive-4 starts processing
  • After the second process job in massive-2 gets processed, the first process job in massive-2 starts processing
  • After the second process job in massive-3 gets processed, the first process job in massive-3 starts processing
  • So on until there are no jobs to be processed.

In this way we give each process a fair amount of processing when concurrent process are run. By adding more workers we are able to process more jobs in parallel, including ones of different processes.

The developer could specify how many queues he wants to use, and whether he wants to split jobs in queues at all. For example, we could limit it to use only 10 queues. For processes that use multiple jobs this would mean that processes would not be run in parallel, but could work for ones in multiple jobs.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions