MeshPress is a 3D mesh compression library for real-time GPU decompression. It extends the AMD GPUOpen meshlet compression approach with two encoders that both beat the AMD baseline while remaining crack-free:
MeshletGTSSegDelta— segmented-delta vertices on GTS connectivity. 24% better compression than AMD baseline, GPU-parallel decode (~10 µs/meshlet on RTX 3090).MeshletParaDelta— Touma-Gotsman parallelogram predictor on a boundary/interior split, with Morton-delta boundary coding and an optional 13-feature curvature MLP. ~6–10% better compression thanGTSSegDelta(29.34 vs 31.17 BPV on a 1.8M-vert tank model; 30.16 vs 32.49 on stanford-bunny), still crack-free. CUDA-accelerated encode via CuPy RawKernel — 33× speedup on the para predictor for million-vertex meshes.
- Crack-free meshlet compression via global quantization grid
- GPU-parallel decompression using GTS strip decode (countbits/firstbithigh intrinsics)
- Segmented delta vertex encoding: 24% smaller vertices than AMD baseline
- Parallelogram + delta-boundary encoder: another ~7% on top of SegDelta
- Controlled precision: guaranteed maximum reconstruction error
- Scales to millions of polygons (tested on 1.8M vertex models)
git clone https://github.com/maletsden/meshpress.git
cd meshpress
pip install -r requirements.txt
pip install scikit-learn cupy-cuda12x # for meshlet encoders and GPU benchmarksfrom reader import Reader
from encoder import MeshletGTSSegDelta, MeshletGTSPlain
model = Reader.read_from_file('assets/stanford-bunny.obj')
# Our best: GTS connectivity + segmented delta vertices (crack-free, GPU-parallel)
encoder = MeshletGTSSegDelta(max_verts=256, precision_error=0.0005, verbose=True)
compressed = encoder.encode(model)
# AMD baseline: GTS connectivity + plain quantized vertices (crack-free, GPU-parallel)
encoder = MeshletGTSPlain(max_verts=256, precision_error=0.0005, verbose=True)
compressed = encoder.encode(model)
# Best compression: parallelogram predictor + delta boundary (crack-free)
from encoder import MeshletParaDelta
encoder = MeshletParaDelta(max_verts=384, precision_error=0.0005,
use_nn=True, verbose=True)
compressed = encoder.encode(model)
# Million-vertex meshes: skip NN, run para predictor on GPU (CuPy)
encoder = MeshletParaDelta(max_verts=512, precision_error=0.0005,
use_nn=False, use_cuda=True, verbose=True)
compressed = encoder.encode(model)All encoders use AMD GTS connectivity (Generalized Triangle Strip with L/R flags + inc/reuse packing, ~6 bpt, GPU-parallel decode via bit intrinsics). The difference is vertex encoding only.
| Model | Verts | Tris | AMD Baseline BPV | SegDelta BPV | Haar BPV | ParaDelta BPV | ParaDelta vs AMD |
|---|---|---|---|---|---|---|---|
| bunny | 2,503 | 4,968 | 47.34 | 41.53 | 42.28 | 37.19 | +21.4% |
| torus | 3,840 | 7,680 | 56.48 | 39.07 | 38.56 | 39.94 | +29.3% |
| stanford-bunny | 35,947 | 69,451 | 37.90 | 32.49 | 34.47 | 30.16 | +20.4% |
| Monkey | 504,482 | 1,007,616 | 41.54 | 31.69 | 33.77 | 31.53 | +24.1% |
| tank | 1,806,639 | 3,514,247 | — | 31.17 | — | 29.34 | −5.9% vs SegDelta |
All crack-free (0 shared vertex mismatches verified). 100% within error target.
ParaDelta is MeshletParaDelta at default settings (mv=384 with use_nn=True
on bunny/stanford-bunny, mv=512 with use_nn=False, use_cuda=True on
Monkey/tank). SegDelta still wins on torus because boundary refs cost more
than interior gains save when triangle count is low.
The para predictor is the per-meshlet causal hot loop. With use_cuda=True
(CuPy + RawKernel, 32 threads/block, 1 thread per meshlet) it runs bit-exactly
the same encoder on the GPU and produces the same output.
| Mesh | CPU para | GPU para | Speedup | Total encode (CPU → GPU) |
|---|---|---|---|---|
| stanford-bunny (36K verts) | 6.0 s | — | — | — |
| Monkey (504K verts) | 56 s | 1.7 s | 33× | 92 s → 36 s |
| tank (1.8M verts) | ~10 min (est.) | ~3 s (est.) | ~30× | ~10 min → 127 s |
The CUDA path is required to test MeshletParaDelta on million-vertex models in
reasonable time — the CPU para predictor is O(meshlet_count × meshlet_size²) in
practice. Phase A (vert ordering, sort, GTS connectivity) and meshlet generation
are still pure-Python and dominate the remaining encode time.
| Component | Size | % | BPV | Description |
|---|---|---|---|---|
| Headers | 117 KB | 5.9% | 1.86 | Global + per-meshlet metadata |
| Vertex data | 1,083 KB | 54.2% | 17.18 | Segmented delta encoded |
| Connectivity | 798 KB | 39.9% | 12.66 | GTS: L/R flags + inc/reuse |
| Total | 1,998 KB | 100% | 31.69 | 9.1x compression |
| Method | S-Bunny | Monkey | vs Plain | GPU Parallel |
|---|---|---|---|---|
| Plain (AMD baseline) | 104 KB | 1,704 KB | — | Trivial (just read) |
| Segmented Delta | 80 KB | 1,083 KB | -36% | 32 independent prefix sums |
| Haar Wavelet | 89 KB | 1,215 KB | -29% | log(N) steps, 6 syncs |
| CDF 5/3 Wavelet | 91 KB | 1,228 KB | -28% | log(N) steps, 6 syncs |
Segmented delta gives the best compression AND the fastest GPU decode: 32 independent segments decoded via prefix sum (2 syncs total vs 6 for wavelets).
| Kernel | Stanford-Bunny (36K) | Monkey (504K) |
|---|---|---|
| AMD GTS+Plain (baseline) | 11.2 µs | 36.7 µs |
| GTS+SegDelta Opt v3 (ours) | 9.8 µs | 44.3 µs |
| GTS+Haar Opt v3 (ours) | ~10 µs | ~45 µs |
Our segmented delta adds only +21% decode overhead on large meshes vs the plain AMD baseline, while providing 24% better compression. On small meshes it's actually faster.
MeshletParaDelta extends the meshlet pipeline with three ideas that, together,
beat both GTSSegDelta and the AMD baseline by ~20–24% on real meshes:
Mesh → Global int-grid quantization (crack-free)
→ Meshlet generation (greedy, max 384 verts default)
→ Per meshlet:
Boundary / interior split
Boundary refs: Morton-sort table → axis-delta + Rice
Interior verts: causal Touma-Gotsman parallelogram predictor
residual = true − (a + b − c)
per-axis best of {fixed, Rice, exp-Golomb, arith}
Optional: 13-feat MLP curvature bias (second-ring apexes d_ac, d_bc
projected into local frame; deterministic discovery, no
flag bits transmitted; ~3x encode-time cost; ≤0.1 BPV gain)
Connectivity: GTS v3 (L/R + FIFO reuse)
→ Compressed stream
The boundary table is the key trick. A flat layout costs n_boundary * 3 * g_bits bits. Sorting boundary verts by 3D Morton code clusters them
spatially, so per-axis deltas zigzag-Rice down to ~half the cost. Per-meshlet
boundary refs become small ascending integers → Rice on sorted-deltas wins
again.
| stanford-bunny mv=384 | flat boundary | delta boundary |
|---|---|---|
| Boundary table | 20,742 B | 11,636 B |
| Boundary refs | 21,458 B | 10,380 B |
| Total | 156 KB | 136 KB |
| BPV | 34.72 | 30.23 |
The 13-feature MLP improves the parallelogram prediction by projecting the
two second-ring apexes (the third vertices of the triangles adjacent to T'
across edges (a,c) and (b,c)) into the local triangle frame. Both encoder
and decoder discover those apexes deterministically from the already-decoded
neighbour set, so no flag bit needs to be transmitted. Net BPV gain is small
(−0.07 on stanford-bunny, −0.06 on Monkey) and on small meshes the fp32
weight header (~1.1 KB) outweighs the saving — use_nn=False is the right
choice for meshes below ~10K verts.
# Small-to-medium meshes (NN gives ~0.07 BPV extra)
encoder = MeshletParaDelta(max_verts=384, precision_error=0.0005,
use_nn=True, nn_steps=300, verbose=True)
compressed = encoder.encode(model)
# Million-vertex meshes (skip NN, run para predictor on GPU)
encoder = MeshletParaDelta(max_verts=512, precision_error=0.0005,
use_nn=False, use_cuda=True, verbose=True)
compressed = encoder.encode(model)Mesh → Global Quantization (crack-free integer grid)
→ Meshlet Generation (greedy region growing, max 256 verts)
→ Per meshlet:
BFS vertex ordering (spatial locality for delta encoding)
GTS strip generation (L/R flags + inc/reuse packing)
Vertex encoding: segmented delta on globally-quantized integers
→ Compressed stream
┌── Meshlet Header (37 bytes)
│ n_verts, n_tris, per-channel: base_min, base_bits, delta_min, delta_bits
├── GTS Connectivity
│ L/R flags: 1 bit per triangle (which edge reused)
│ Inc flags: 1 bit per strip entry (new vertex vs reused)
│ Reuse buffer: uint8 per reused vertex (local index)
├── Vertex Data (per channel x,y,z)
│ Base stream: 32 anchor values × base_bits
│ Delta stream: (n_verts-32) deltas × delta_bits
└── GPU decode: countbits + firstbithigh for connectivity
32 parallel prefix sums for vertices
Phase 1: Read header (1 thread → broadcast) [1 sync]
Phase 2: GTS connectivity decode (parallel, bit intrinsics) [0 syncs]
Phase 3: Vertex base values (32 threads) [1 sync]
Phase 4: Segment prefix sums (96 parallel: 32 segs × 3 ch) [1 sync]
Phase 5: Global dequantize + write output (parallel) [0 syncs]
Total: 3 __syncthreads
All vertices are quantized to a global integer grid at encode time. The segmented delta encoding is lossless on integers (prefix sum exactly reconstructs the original sequence). Shared boundary vertices in different meshlets get the same integer code → identical float reconstruction → zero cracks.
# Full encoder comparison on all models
python test_final_benchmark.py
# CUDA GPU decode speed benchmark (requires CuPy)
python benchmark_cuda_decode.py
# Wavelet/delta comparison
python test_wavelet_meshlet_compression.py
# Connectivity encode/decode verification
python test_connectivity_verify.py| Encoder | Connectivity | Vertices | LOD | Crack-free | GPU Decode |
|---|---|---|---|---|---|
MeshletParaDelta |
GTS v3 | Para-predict + delta boundary | No | Yes | Parallel (CPU-decoded refs) |
MeshletGTSSegDelta |
GTS | Segmented delta | No | Yes | Parallel |
MeshletGTSHaar |
GTS | Haar wavelet | No | Yes | Parallel |
MeshletGTSPlain |
GTS | Plain quantized | No | Yes | Parallel |
MeshletWaveletGlobalEB |
EdgeBreaker | Int wavelet | No | Yes | Sequential |
MeshletPlainAMD |
Raw packed | Global grid | No | Yes | Parallel |
MeshletLOD |
FIFO-adjacency | Chain-delta + entropy | Yes (5 levels) | Yes | Parallel |
MeshletLOD is a progressive Level-of-Detail encoder built on QEM simplification
with synchronized boundary protection. A single compressed bitstream can be
decoded at any of 5 resolution levels without re-encoding; every level produces
a valid, crack-free triangulation covering the full mesh area.
from reader import Reader
from encoder import MeshletLOD, decode_lod
from encoder.implementation.meshlet_wavelet import _to_numpy
model = Reader.read_from_file('assets/stanford-bunny.obj')
enc = MeshletLOD(max_verts=256, precision_error=0.0005, n_lod_levels=5, verbose=True)
compressed = enc.encode(model)
# Decode at a chosen LOD (0 = coarsest, 4 = full resolution)
verts, tris = _to_numpy(model)
out_verts, out_tris = decode_lod(verts, tris, compressed, lod_level=2)Mesh → Generate meshlets (greedy region growing, max 256 verts)
→ Identify boundary vertices (shared between ≥2 meshlets)
→ QEM on full mesh with boundary PROTECTED from collapse
→ QEM on boundary subgraph alone (boundary-to-boundary collapses)
→ Per meshlet:
Local AABB quantization for interior vertices
Chain-delta encoding of interior positions (delta from collapse parent)
FIFO-adjacency bitstream for connectivity
→ Compact ancestor tables:
(collapse_step, direct_parent) per vertex
Entropy/exp-Golomb coded parent deltas
- Geometric — boundary vertices use a global quantization grid, so the same boundary vertex yields bit-identical positions in every meshlet.
- Topological — triangles are redirected, not dropped. When a vertex collapses into its ancestor at low LOD, any triangle containing it gets its vertex replaced by the ancestor. Triangles that become degenerate disappear into neighbouring triangles; full area coverage is preserved.
| Model | Verts | Tris | BPV | Size | Non-LOD GTSSegDelta BPV |
LOD overhead |
|---|---|---|---|---|---|---|
| bunny | 2,503 | 4,968 | 52.32 | 16 KB | 41.53 | +26% |
| torus | 3,840 | 7,680 | 50.25 | 24 KB | 39.07 | +29% |
| stanford-bunny | 35,947 | 69,451 | 49.80 | 224 KB | 32.49 | +53% |
| Monkey | 504,482 | 1,007,616 | 50.69 | 3.1 MB | 31.69 | +60% |
The ~25–60% BPV overhead vs a non-LOD encoder is the cost of progressive decode capability — a single bitstream covering 5 resolution levels instead of one fixed resolution.
| Component | Size | % | Description |
|---|---|---|---|
| Boundary positions | 24 KB | 11% | Global-quant, shared across meshlets |
| Interior positions | 43 KB | 19% | Local-AABB quant + chain-delta |
| Interior ancestors | 88 KB | 39% | Compact (collapse_step, parent) + entropy |
| Boundary ancestors | 20 KB | 9% | Same format, separate chain |
| Connectivity (FIFO) | 48 KB | 22% | Real AMD-style FIFO-adjacency bitstream |
| Total | 224 KB | 100% | 5 LODs in one bitstream |
| LOD | Verts | Tris | % of full |
|---|---|---|---|
| 0 | 3,004 | 3,727 | 5% |
| 1 | 11,241 | 20,056 | 29% |
| 2 | 19,476 | 36,520 | 53% |
| 3 | 27,712 | 52,989 | 76% |
| 4 | 35,947 | 69,451 | 100% |
Per-frame hot path: ancestor resolution (K1) → boundary composition (K2) → triangle redirection + emission (K3). All three kernels are embarrassingly parallel (1 thread per vertex or per triangle).
| Model | LOD 0 total | LOD 4 total | Fused kernel | Throughput |
|---|---|---|---|---|
| bunny (5K tris) | 65 µs | 80 µs | 24 µs | — |
| torus (7.7K tris) | 58 µs | 70 µs | 23 µs | — |
| stanford-bunny (70K tris) | 112 µs | 76 µs | 34 µs | ~900 M tri/s |
| Monkey (1M tris) | 268 µs | 78 µs | 37 µs | ~12.8 G tri/s |
One-time mesh-load costs (connectivity decode + ancestor table upload) are separate and amortized over all frames.
python benchmark_cuda_a_fifo.py # run the CUDA decompression benchmarkPer-frame (parallel):
K1 1 thread / vertex Walk compact collapse chain → interior ancestor
K1' 1 thread / vertex Walk boundary chain → boundary ancestor
K2 1 thread / vertex Compose: combined = bnd_anc[int_anc]
K3 1 thread / triangle Redirect 3 verts, skip degenerate, emit
One-time (at mesh load):
- Decode FIFO-adjacency connectivity into per-meshlet triangle lists
- Propagate chain-delta interior positions (roots → children)
- Upload ancestor tables, triangle list, positions to GPU
| Aspect | MeshletGTSSegDelta |
MeshletLOD |
|---|---|---|
| BPV | 32 (stanford-bunny) | 50 |
| LOD levels | 1 (fixed) | 5 (progressive) |
| Crack-free | Yes | Yes (geometric + topological) |
| GPU decode per frame | ~10 µs | ~80 µs |
| GPU throughput | ~35 G tri/s | ~12 G tri/s |
| Re-encoding for different LOD | Required | Not needed |
| Use case | Static draws | Distance-based LOD |
Choose MeshletGTSSegDelta when you only need one resolution level and
absolute minimum size/fastest decode. Choose MeshletLOD when the camera
distance varies and you want smooth transition between detail levels from
a single compressed representation.
| Encoder Model | Bytes/tri | Bytes/vert | Ratio |
|---|---|---|---|
| BaselineEncoder | 18.05 | 35.82 | 1.00 |
| PackedGTSQuantizator (radix) | 4.18 | 8.29 | 4.32 |
| PackedGTSEllipsoidFitter(0.0005, 4) | 2.75 | 5.47 | 6.55 |
This project is licensed under the MIT License. See the LICENSE file for details.
Contributions are welcome! Please open an issue or submit a pull request.

