The current worker infrastructure has a number of issues.
Every job requires a new build of moolloy and this has resulted in a number of failed jobs where the moolloy build has failed, this typically results in all of our workers dying.
Building moolloy has also required us to clone the entire moolloy repo for every job which is a very costly operation since the alloy repo is at least 100 MB to clone.
Initial attempts at using a seed repo to reduce the download time have been unsuccessful.
To solve these issues we will split the worker infrastructure into 2 steps.
The first step will be a build step, where a build worker will clone the repo, checkout the appropriate commit, and then upload the resulting jar file to S3.
Each commit will only be built once.
The second step will be the run step, exactly the same as the current runner except we will no longer need to build. Instead we will download the previously uploaded jar file from S3 to run.
The full workflow will be as follows:
- Commit hook is triggered from github to the dashboard (or a manual build is scheduled)
- Dashboard queues a build job to the build queue
- Build worker receives job from the build queue
- Build worker clones the moolloy repo to a temporary directory
- Build worker checks out the specified commit
- Build worker runs
submodule init && submodule update
- Build worker runs
ant deps configure dist to build moolloy
- Build worker uploads jar file to S3
- Build worker reports success to dashboard along with S3 key and hash of jar file
- Build worker deletes temporary directory (if everything has completed successfully, otherwise directory will remain for debugging purposes)
- Build worker resumes polling build queue
- Dashboard queues run job to the run queue (as result of build completion if CI is enabled for the model, or as a result of manual user action)
- Run worker receives the job from the run queue
- Run worker creates temporary directory
- Run worker downloads jar file from S3
- Run worker verifies file hash
- Run worker downloads the model from S3
- Run worker extracts the model
- Run worker executes moolloy
- Run worker compares results to the model results
- Run worker tarballs the directory and uploads it to S3
- Run worker reports results to dashboard
- Run worker deletes temporary directory
- Run worker resumes polling job queue
The current worker infrastructure has a number of issues.
Every job requires a new build of moolloy and this has resulted in a number of failed jobs where the moolloy build has failed, this typically results in all of our workers dying.
Building moolloy has also required us to clone the entire moolloy repo for every job which is a very costly operation since the alloy repo is at least 100 MB to clone.
Initial attempts at using a seed repo to reduce the download time have been unsuccessful.
To solve these issues we will split the worker infrastructure into 2 steps.
The first step will be a build step, where a build worker will clone the repo, checkout the appropriate commit, and then upload the resulting jar file to S3.
Each commit will only be built once.
The second step will be the run step, exactly the same as the current runner except we will no longer need to build. Instead we will download the previously uploaded jar file from S3 to run.
The full workflow will be as follows:
submodule init && submodule updateant deps configure distto build moolloy