Skip to content

Add XeGPU matrix multiplication benchmark #15

@tkarna

Description

@tkarna

This issue outlines the required steps to add an XeGPU matrix multiplication benchmark.

  1. mlir-gen should be a Python module.
    • Move to python dir tree as a module, e.g., python/lighthouse/payload_generator/payload_generator.py
    • Add command line interface, e.g., python/lighthouse/payload_generator/cli.py
    • Map to an executable script, e.g., mlir-gen in pyproject.toml:
      [project.scripts]
      mlir-gen = "lighthouse.python.ingress.payload_generator:cli"
  2. Generic infrastructure to define and execute workloads
    • A Workload object to hold the payload IR and fixed metadata (e.g. problem size).
    • Provides methods for getting payload IR, schedule IR, input arguments for calling payload, (optionally) correctness verification, etc.
    • A generic execution_engine wrapper that can execute a Workload and can, e.g., run it in a timer loop for benchmarking.
  3. XeGPU matmul Workload and benchmarking tools
    • Currently supports only matmul + elementwise post ops.
    • Location: python/lighthouse/benchmark/xegpu/matmul.py (?)
    • Generates payload, defines lowering schedule, compiles, executes, measures performance.
    • Uses existing functionality: workload, payload generator, execution toos, etc.
    • Exposed as another CLI command, e.g., benchmark_xegpu_matmul
  4. Installation mechanism for XeGPU support:
    • Compile LLVM with LevelZero runtime and necessary flags.
    • Hook up LLVM Python bindings with Lighthouse.
      • Simply set PYTHONPATH="$LLVM_INSTALL_PATT/python_packages/mlir_core/"
    • Provide an easy-to-use install mechanism.
      • Build instructions in README(?).
      • Or a generic build script, invoked on install(?).
        • mlir-python-bindings Python package is not installed in this case(?).
      • When executing kernels, raise a descriptive error if Xe GPU device/drivers are missing.

Examples of the benchmark command line tool usage:

$ benchmark_xegpu_matmul
sizes=4096,4096,4096 dt=f16,f32 wg-tile=256,256 sg-tile=32,32 ... time(ms): 1.76 GFLOPS: 78072
$ benchmark_xegpu_matmul --sizes 4096 2048 1024 --wg-tile-size 256 256 ...
...
$ benchmark_xegpu_matmul --dump-kernel {intial,tiled,vectorized,bufferized,xegpu-wg,...}
<prints payload IR at this stage of lowering>
$ benchmark_xegpu_matmul --dump-kernel xegpu-wg --dump-schedule
<also dumps transform schdule used to lower to this level>

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions