Skip to content

Add deterministic program output to BenchGen benchmarks (CSmith-style global hash) #51

@pronesto

Description

@pronesto

Description

BenchGen currently generates executable benchmarks that manipulate data structures but do not produce any observable output in normal (non-debug) mode. As a consequence:

  • There is no easy way to validate that two compiled versions of the same benchmark behave equivalently.
  • Differential testing across compilers, optimization levels, or languages is harder than necessary.
  • Benchmark executions are effectively “opaque”: they exercise computation, but produce no result.
  • Compiler optimizations might remove large chunks of code, provided that the compiler can conclude that it's dead code.

In contrast, tools like CSmith address this by computing a deterministic hash (e.g., MD5) over program state at the end of execution and printing it, enabling correctness checks without affecting control flow.

Motivation

BenchGen is explicitly positioned as a complement to fuzzers like CSmith, focusing on performance evaluation rather than bug finding. However, the lack of observable output limits its usefulness in several scenarios:

  • Compiler validation: ensuring that aggressive optimizations do not silently miscompile benchmarks.
  • Cross-language comparisons: checking semantic equivalence between C, C++, Julia, Go, etc.
  • Regression testing: detecting unintended semantic changes in BenchGen itself or in its language backends.

Currently, debug mode prints a trace of operations (NEW, INSERT, CONTAINS, REMOVE, etc.), but this is not suitable for optimized builds or performance experiments.

Proposed Solution

Introduce a deterministic global execution hash, inspired by CSmith, that is:

  • Updated during program execution.
  • Printed once at the end of main.
  • Independent of external libraries (important for multi-language support).

High-level idea

  1. BenchGen generates a global variable (e.g., uint64_t exec_hash).
  2. Certain behavioral operations (notably contains) update this variable deterministically.
  3. At the end of main, the final hash value is printed.

For example, the current C code generated for contains looks like:

for (int i = 0; i < array2->size; i++) {
  if (array2->data[i] == 62) {
    return array2;
  }
}

This could be extended to something like:

for (int i = 0; i < array2->size; i++) {
  if (array2->data[i] == 62) {
    // Update global execution hash using array2->data[i]
    exec_hash = hash_update(exec_hash, array2->data[i]);
    return array2;
  }
}

At the end of execution:

int main(int argc, char** argv) {
  // Initialize exec_hash
  // ... generated benchmark code ...
  printf("%llu\n", exec_hash);
  return 0;
}

Hash function

  • The hash does not need to be cryptographic (MD5 is only a conceptual reference).

  • It should be:

    • Deterministic
    • Simple
    • Easy to reimplement across languages (C, C++, Julia, Go, Rust, etc.)
  • A lightweight rolling hash (e.g., xor + multiplication) would likely suffice.

The key requirement is semantic observability, not security. Below we have an example:

#include <stdint.h>
#include <stdio.h>

/* Global execution hash */
static uint64_t exec_hash;

/*
 * Simple, deterministic hash update.
 * - No library calls
 * - Easy to reimplement in other languages
 * - Cheap enough to not distort performance measurements
 */
static inline void hash_update_u64(uint64_t value) {
    /* Large odd constant (same across languages) */
    const uint64_t MUL = 0x9e3779b97f4a7c15ULL;  // 2^64 / golden ratio
    exec_hash ^= value + MUL + (exec_hash << 6) + (exec_hash >> 2);
}

If we use this example, then contains would generate code like:

for (int i = 0; i < array2->size; i++) {
  if (array2->data[i] == 62) {
    /* Record the observed value in the global execution hash */
    hash_update_u64((uint64_t)array2->data[i]);
    return array2;
  }
}

Why contains?

contains is a natural candidate because:

  • It observes data without mutating it.
  • It already branches on concrete values.
  • It exists uniformly across supported data structures and languages.

That said, the mechanism could later be generalized to other operations if desired.

Benefits

  • Enables semantic equivalence checking across compilers and languages.
  • Improves BenchGen’s suitability for compiler correctness experiments.
  • Preserves existing performance characteristics (hash updates are cheap and local).
  • Aligns BenchGen more closely with established practices in compiler testing tools like CSmith.

Open questions

  • Which behavior blocks should update the hash (only contains, or others)?
  • Should this be optional (e.g., enabled via a flag)?
  • Preferred default hash width (32-bit vs 64-bit)?

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions