Add script for posting nigthly benchmark results #203

TomAugspurger · 2026-02-03T16:38:34Z

This adds a script for posting the results of a benchmark run to a database. Each run would result in

A new "QueryEngine" configuration, which identifies the exact software versions used to run the benchmark
A new "BenchmarkRun", which identifies the hardware, query engine, storage configuration, benchmark definition, query engine configuration, and more
Many new "QueryLogs", one per individual query iteration executed

Here's a sample of the document we'd post:

{
  "sku_name": "pdx-h100",
  "storage_configuration_name": "test",
  "benchmark_definition_name": "tpch-100",
  "cache_state": "warm",
  "query_engine": {
    "engine_name": "velox",
    "identifier_hash": "121e1e62b5426195",
    "version": "unknown",
    "commit_hash": "unknown"
  },
  "run_at": "2026-01-28T18:38:02+00:00",
  "node_count": 1,
  "query_logs": [
    {
      "query_name": "1",
      "execution_order": 0,
      "runtime_ms": 3982.0,
      "status": "success",
      "extra_info": {
        "execution_number": 1
      }
    },
    {
      "query_name": "1",
      "execution_order": 1,
      "runtime_ms": 1509.0,
      "status": "success",
      "extra_info": {
        "execution_number": 2
      }
      // repeated for each execution of each query
    }
  ],
  "concurrency_streams": 1,
  "engine_config": {
    "coordinator": {
      "coordinator": "true",
      "node-scheduler.include-coordinator": "false",
      "http-server.http.port": "9200",
      "discovery-server.enabled": "true",
      "discovery.uri": "http://gpu-h100-0017:9200"
      // ...
    },
    "worker": {
      "coordinator": "false",
      "http-server.http.port": "9000",
      "discovery.uri": "http://gpu-h100-0017:9200",
      "presto.version": "testversion",
      "system-memory-gb": "240",
      "query-memory-gb": "228",
      // ...
    }
  },
  "extra_info": {
    "kind": "single-node",
    "gpu_count": 1,
    "gpu_name": "H100",
    "num_drivers": 2,
    "worker_image": "presto-native-worker-gpu",
    "execution_number": 1
  },
  "is_official": false
}

TomAugspurger · 2026-02-03T16:39:08Z

benchmark_data_tools/post_results.py

+This script operates on the parsed output of the benchmark runner. The
+expected directory structure is:
+
+    ../velox-testing-benchmark-viz/2026/01/28/01/


Is this the correct directory structure to target? Or is there an earlier, raw-er version we should use?

That is the current format of the runner on that particular cluster (it will generate a commit under the date directory). Not all of our benchmarking is consistent with such a structure.

Standardizing on one directory structure is something we need to do, and hopefully we can get there and then layer this on top.

TomAugspurger · 2026-02-03T16:39:39Z

benchmark_data_tools/post_results.py

+    │   └── slurm-4575179.out
+    └── result_dir
+        ├── benchmark_cold.json
+        ├── benchmark_full.json


Right now, I'm only using benchmark_full.json.

I think that's fine. IIRC the others are just subsets of the full data for organization's sake. Currently that benchmark_full.json is generated from a branch misiug/cluster, but if that's the format we want it should not be hard to get that upstream.

Whatever output we want to generate should be in that same directory, although we can alter the format to whatever would best fit with the database API.

TomAugspurger · 2026-02-03T16:41:34Z

benchmark_data_tools/post_results.py

+        --sku-name PDX-H100 \
+        --storage-configuration-name pdx-lustre-sf-100 \
+        --cache-state warm


These values are currently required.

The --sku-name and --storage-configuration-name needs to match a name in the database, so I think it's OK to require them.

--cache-state warm is probably incorrect, and not needed to be provided here. The DB also has a field for is_os_page_cache_cleared at the query log level. I'd recommend using that, but we need to somehow know whether or not the os page cache is actually cleared.

There is a PR up to make it so that the run_benchmark.sh script will automatically drop caches before running benchmarks (unless told otherwise): #189. With that we should be able to assume the first iteration of every benchmark is always "lukewarm" and any subsequent iterations will be "warm". For true "cold" runs we would need to run each query as a separate run, or add a "cold" option to the benchmark script.

TomAugspurger · 2026-02-03T16:42:50Z

benchmark_data_tools/post_results.py

+        data = json.loads(file_path.read_text())
+
+        keys = list(data)
+        if keys != ["tpch"]:


The results_dir/benchmark_full.json file looks like

❯ head result_dir/benchmark_full.json { "tpch": { "raw_times_ms": { "Q1": [ 3982, 1509, 1497, 1315, 1498 ],

Are there any other value we should be prepared for here?

That should be all for now. We usually just use these base values and then calculate the statistics from them. The exception would be NULL values if any errors occured.

TomAugspurger · 2026-02-03T16:45:00Z

benchmark_data_tools/post_results.py

+    # Generate or use provided identifier hash
+    if identifier_hash is None:
+        identifier_hash = generate_identifier_hash(benchmark_metadata.timestamp, engine)
+
+    # Use placeholders for version info if not provided
+    if version is None:
+        version = "unknown"
+    if commit_hash is None:
+        commit_hash = "unknown"


We need to sort out these version fields.

The identifier_hash is extremely important, since it helps uniquely identify the engine / software being run. I'd recommend using something like the sha256 digest of the container image, if we have one available. Do we?

commit_hash is the commit of the "primary" software, so presumably velox-cudf? Can we get that? Likewise for version.

Assuming we store all our benchmarked images in the GitLab registry, we should have both a Manifest digest and a Configuration digest we can use.

For presto-gpu we are probably going to need to include a unique identifier for both the presto and velox version (as they can be mixed/matched). If we only use the velox-cudf hash then we could theoretically have images built with the same hash that could vary substantially from different presto commits. For our images, we are planning on tagging them with both hashes. Would we be able to fit both hashes into the entry here (something like <presto_hash>-<velox_hash>)?

TomAugspurger · 2026-02-03T16:47:30Z

benchmark_data_tools/post_results.py

+                {
+                    "query_name": query_name,
+                    "execution_order": execution_order,
+                    "runtime_ms": 0.0,


I don't love putting 0.0 here (someone will forget to filter to only successes and include 0 in an aggregation). Check if that's allowed to be None in the API / DB.

misiugodfrey

Thanks for putting this together, Tom.

I think most of this should work as-is. The only issue is that we haven't standardized some of this expected output in upstream main yet (the output this is based on is from an experimental set of branches). This should be easy now that we have an upstream schema to store in.

I think we just need to standardize our output and then sync this on top.

misiugodfrey · 2026-02-03T19:42:14Z

benchmark_data_tools/post_results.py

+    │   └── slurm-4575179.out
+    └── result_dir
+        ├── benchmark_cold.json
+        ├── benchmark_full.json


I think that's fine. IIRC the others are just subsets of the full data for organization's sake. Currently that benchmark_full.json is generated from a branch misiug/cluster, but if that's the format we want it should not be hard to get that upstream.

Whatever output we want to generate should be in that same directory, although we can alter the format to whatever would best fit with the database API.

misiugodfrey · 2026-02-03T23:31:58Z

benchmark_data_tools/post_results.py

+        data = json.loads(file_path.read_text())
+
+        keys = list(data)
+        if keys != ["tpch"]:


That should be all for now. We usually just use these base values and then calculate the statistics from them. The exception would be NULL values if any errors occured.

misiugodfrey · 2026-02-03T23:37:08Z

benchmark_data_tools/post_results.py

+
+        Expects coordinator.config and worker.config files.
+        """
+        coordinator_config = parse_config_file(configs_dir / "coordinator.config")


In our presto environments these files are going to be in a configs directory under <configs>/etc_worker/config_native.properties and <configs>/etc_coordinator/config_native.properties

misiugodfrey · 2026-02-03T23:48:05Z

benchmark_data_tools/post_results.py

+    # Generate or use provided identifier hash
+    if identifier_hash is None:
+        identifier_hash = generate_identifier_hash(benchmark_metadata.timestamp, engine)
+
+    # Use placeholders for version info if not provided
+    if version is None:
+        version = "unknown"
+    if commit_hash is None:
+        commit_hash = "unknown"


Assuming we store all our benchmarked images in the GitLab registry, we should have both a Manifest digest and a Configuration digest we can use.

For presto-gpu we are probably going to need to include a unique identifier for both the presto and velox version (as they can be mixed/matched). If we only use the velox-cudf hash then we could theoretically have images built with the same hash that could vary substantially from different presto commits. For our images, we are planning on tagging them with both hashes. Would we be able to fit both hashes into the entry here (something like <presto_hash>-<velox_hash>)?

misiugodfrey · 2026-02-03T23:51:45Z

benchmark_data_tools/post_results.py

+This script operates on the parsed output of the benchmark runner. The
+expected directory structure is:
+
+    ../velox-testing-benchmark-viz/2026/01/28/01/


That is the current format of the runner on that particular cluster (it will generate a commit under the date directory). Not all of our benchmarking is consistent with such a structure.

Standardizing on one directory structure is something we need to do, and hopefully we can get there and then layer this on top.

misiugodfrey · 2026-02-03T23:58:43Z

benchmark_data_tools/post_results.py

+        --sku-name PDX-H100 \
+        --storage-configuration-name pdx-lustre-sf-100 \
+        --cache-state warm


There is a PR up to make it so that the run_benchmark.sh script will automatically drop caches before running benchmarks (unless told otherwise): #189. With that we should be able to assume the first iteration of every benchmark is always "lukewarm" and any subsequent iterations will be "warm". For true "cold" runs we would need to run each query as a separate run, or add a "cold" option to the benchmark script.

Add script for posting nigthly benchmark results

1b05703

TomAugspurger commented Feb 3, 2026

View reviewed changes

misiugodfrey reviewed Feb 4, 2026

View reviewed changes

Add script for posting nigthly benchmark results #203

Are you sure you want to change the base?

Add script for posting nigthly benchmark results #203

Conversation

TomAugspurger commented Feb 3, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

misiugodfrey left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants