-
Notifications
You must be signed in to change notification settings - Fork 12
Automatically drop caches when running benchmarks #189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
… misiug/jinjaDockerCompose
….sh. Argument -g or --gpu-ids allows to pass in a comma delimited set of GPU ids. This way more that one developer can run a multi GPU cluster on the same 8 GPU server. For example one dev can set -g 0,1,2,3 and another -g 4,5,6,7
…lox-testing into misiug/tempjinja
…lox-testing into misiug/tempjinja
| # make cudf.exchange=true if we are running multiple workers | ||
| # make cudf.exchange=true if we are running multiple workers | ||
| sed -i "s+cudf.exchange=false+cudf.exchange=true+g" ${worker_config}/config_native.properties | ||
| if [[ -n ${MEMORY_PERCENT} ]]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Memory percent is something that IBM usually sets to 50% in their multi-worker runs. If nothing else we should be experimenting with this limit (default is 0).
paul-aiyedun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes overall look good to me. However, I think cache dropping should be integrated into the benchmarking test suite.
|
|
||
| echo "Dropping cache" | ||
| if [[ -z ${SKIP_DROP_CACHE} ]]; then | ||
| dropcache; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have cache dropping be driven by the performance test suite (i.e. logic in presto/testing/performance_benchmarks/common_fixtures.py), similar to what we do for profiling? This should allow for an easier extension for cold runs.
| KVIKIO_ARRAY=(8) | ||
| DRIVERS_ARRAY=(2) | ||
| WORKERS_ARRAY=(1) | ||
| SCHEMA_ARRAY=(sf1k_64mb) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should remove this default value.
| exit 1 | ||
| fi | ||
| ;; | ||
| -s|--schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--schemas
| done | ||
| } | ||
|
|
||
| export PRESTO_DATA_DIR=/raid/ocs_benchmark_data/tpch/experimental |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be set by a parameter if not set already.
| for memory in "${MEMORY_ARRAY[@]}"; do | ||
| echo "Running combo: num_workers = $workers, kvikio_threads = $kvikio, num_drivers = $drivers, schema = $schema, memory = $memory" | ||
| ./start_native_gpu_presto.sh -w $workers --kvikio-threads $kvikio --num-drivers $drivers --memory-percent $memory | ||
| ./run_benchmark.sh -b tpch -s ${schema} --tag "${schema}_${workers}w_${drivers}d_${kvikio}k_${memory}m_dropcache" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider ${schema}_${workers}wk_${drivers}dr_${kvikio}kv_${memory}mp. The dropcache suffix should no longer be needed after extending the benchmark script to drop caches by default.
| export NUM_WORKERS=1 | ||
| export KVIKIO_THREADS=8 | ||
| export VCPU_PER_WORKER="" | ||
| export MEMORY_PERCENT="" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think config generation/updates should be separate from building/deploying the server, but we can revisit this later.
mbrobbel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should/can also use the following endpoints?
GET: /v1/operation/server/clearCache?type=memory: It clears the memory cache on worker node. Here is an example:
curl -X GET "http://localhost:7777/v1/operation/server/clearCache?type=memory"
Cleared memory cache
GET: /v1/operation/server/clearCache?type=ssd: It clears the ssd cache on worker node. Here is an example:
curl -X GET "http://localhost:7777/v1/operation/server/clearCache?type=ssd"
Cleared ssd cache
GET: /v1/operation/server/writeSsd: It writes data from memory cache to the ssd cache on worker node. Here is an example:
curl -X GET "http://localhost:7777/v1/operation/server/writeSsd"
Succeeded write ssd cache
Altered run_benchmark.sh to automatically drop caches when running (can be turned off via option).
Also added a script to run benchmarks with multiple configurations - useful for comparing benchmark settings quickly.