Skip to content

Conversation

@misiugodfrey
Copy link
Contributor

Altered run_benchmark.sh to automatically drop caches when running (can be turned off via option).

Also added a script to run benchmarks with multiple configurations - useful for comparing benchmark settings quickly.

misiugodfrey and others added 30 commits November 19, 2025 10:59
….sh. Argument -g or --gpu-ids allows to pass in a comma delimited set of GPU ids. This way more that one developer can run a multi GPU cluster on the same 8 GPU server. For example one dev can set -g 0,1,2,3 and another -g 4,5,6,7
# make cudf.exchange=true if we are running multiple workers
# make cudf.exchange=true if we are running multiple workers
sed -i "s+cudf.exchange=false+cudf.exchange=true+g" ${worker_config}/config_native.properties
if [[ -n ${MEMORY_PERCENT} ]]; then
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Memory percent is something that IBM usually sets to 50% in their multi-worker runs. If nothing else we should be experimenting with this limit (default is 0).

Copy link
Contributor

@paul-aiyedun paul-aiyedun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes overall look good to me. However, I think cache dropping should be integrated into the benchmarking test suite.


echo "Dropping cache"
if [[ -z ${SKIP_DROP_CACHE} ]]; then
dropcache;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have cache dropping be driven by the performance test suite (i.e. logic in presto/testing/performance_benchmarks/common_fixtures.py), similar to what we do for profiling? This should allow for an easier extension for cold runs.

KVIKIO_ARRAY=(8)
DRIVERS_ARRAY=(2)
WORKERS_ARRAY=(1)
SCHEMA_ARRAY=(sf1k_64mb)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should remove this default value.

exit 1
fi
;;
-s|--schema)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--schemas

done
}

export PRESTO_DATA_DIR=/raid/ocs_benchmark_data/tpch/experimental
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be set by a parameter if not set already.

for memory in "${MEMORY_ARRAY[@]}"; do
echo "Running combo: num_workers = $workers, kvikio_threads = $kvikio, num_drivers = $drivers, schema = $schema, memory = $memory"
./start_native_gpu_presto.sh -w $workers --kvikio-threads $kvikio --num-drivers $drivers --memory-percent $memory
./run_benchmark.sh -b tpch -s ${schema} --tag "${schema}_${workers}w_${drivers}d_${kvikio}k_${memory}m_dropcache"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider ${schema}_${workers}wk_${drivers}dr_${kvikio}kv_${memory}mp. The dropcache suffix should no longer be needed after extending the benchmark script to drop caches by default.

export NUM_WORKERS=1
export KVIKIO_THREADS=8
export VCPU_PER_WORKER=""
export MEMORY_PERCENT=""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think config generation/updates should be separate from building/deploying the server, but we can revisit this later.

Copy link
Member

@mbrobbel mbrobbel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should/can also use the following endpoints?



    GET: /v1/operation/server/clearCache?type=memory: It clears the memory cache on worker node. Here is an example:

    curl -X GET "http://localhost:7777/v1/operation/server/clearCache?type=memory"

    Cleared memory cache

    GET: /v1/operation/server/clearCache?type=ssd: It clears the ssd cache on worker node. Here is an example:

    curl -X GET "http://localhost:7777/v1/operation/server/clearCache?type=ssd"

    Cleared ssd cache

    GET: /v1/operation/server/writeSsd: It writes data from memory cache to the ssd cache on worker node. Here is an example:

    curl -X GET "http://localhost:7777/v1/operation/server/writeSsd"

    Succeeded write ssd cache


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants