Scripts to run presto in a Slurm cluster#126
Conversation
| @@ -0,0 +1,45 @@ | |||
| #!/bin/bash | |||
| #SBATCH --time=00:25:00 | |||
There was a problem hiding this comment.
This can take a little while due to the cost of the ANALYZE steps. Currently 25 min seems to be a decent limit.
| sbatch "$@" --nodes=1 --ntasks-per-node=10 ${JOB_TYPE}_benchmarks.sbatch; | ||
|
|
||
| echo "Waiting for jobs to finish..." | ||
| while :; do |
There was a problem hiding this comment.
This just tracks the job in the terminal until it is finished.
paul-aiyedun
left a comment
There was a problem hiding this comment.
The overall approach makes sense to me. However, I had a few questions and code cleanup comments.
There was a problem hiding this comment.
Can we give this file a more specific name e.g. echo_helper.sh?
| rm *.log | ||
| rm *.out | ||
|
|
||
| [ $# -ge 1 ] && echo "$0 expected first argument is 'create/run'" && exit 1 |
There was a problem hiding this comment.
Nit: I think using if statements for this type of check is more readable.
|
|
||
| [ $# -ge 1 ] && echo "$0 expected first argument is 'create/run'" && exit 1 | ||
| JOB_TYPE="$1" | ||
| [ "$JOB_TYPE" == "create" ] && [ "$JOB_TYPE" == "run" ] && echo "parameter must be create or run" && exit 1 |
There was a problem hiding this comment.
Can we create different scripts for each workflow?
| [ "$JOB_TYPE" == "create" ] && [ "$JOB_TYPE" == "run" ] && echo "parameter must be create or run" && exit 1 | ||
| shift 1 | ||
|
|
||
| [ -z "$NUM_NODES" ] && echo "NUM_NODES env variable must be set" && exit 1 |
There was a problem hiding this comment.
Can we consistently pass these variables as script arguments instead of a mix of environment variables and arguments?
| fi | ||
| done | ||
| if ((${#missing[@]})); then | ||
| echo_error "required env var ${missing[*]} not set" |
There was a problem hiding this comment.
Should this file source common.sh?
| source slurm_functions.sh | ||
| source setup_coord.sh | ||
|
|
||
| for i in $(seq 0 $(( $NUM_WORKERS - 1 )) ); do |
There was a problem hiding this comment.
Is this script run per node or per task?
| rm ${WORKSPACE}/iterating_queries.sql | ||
| } | ||
|
|
||
| # Check if the coordinator is running via curl. Fail after 10 retries. |
There was a problem hiding this comment.
Can we reuse the existing wait_for_worker_node_registration function (
| validate_environment_preconditions CONFIGS SINGLE_NODE_EXECUTION | ||
| local coord_config="${CONFIGS}/etc_coordinator/config_native.properties" | ||
| # Replace placeholder in configs | ||
| sed -i "s+discovery\.uri.*+discovery\.uri=http://${COORD}:8080+g" ${coord_config} |
There was a problem hiding this comment.
I think we should expand our config generation capability in velox-testing and avoid having to use sed commands.
| # so that the job will finish when the cli is done (terminating background | ||
| # processes like the coordinator and workers). | ||
| if [ "${type}" == "coord" ]; then | ||
| srun -w $COORD --ntasks=1 --overlap \ |
There was a problem hiding this comment.
Is this kicking off another job or running the container locally?
There was a problem hiding this comment.
I think it might make sense to name the top level directory slurm instead of cluster.
| mkdir -p ${WORKSPACE}/.hive_metastore | ||
|
|
||
| # Run the worker with the new configs. | ||
| CUDA_VISIBLE_DIVICES=${gpu_id} srun -N1 -w $node --ntasks=1 --overlap \ |
Creating a set of scripts to automate running presto (gpu-native) within a Slurm cluster. Currently should support single-node-single-gpu and single-node-multi-gpu (multiple workers on the same node with different GPU assignments). It's currently being used to run TPCH-SF1k
TODOs:
Refactor this to use more of the velox-testing scripts internally (such as run_benchmarks.sh).
Create a common config setup