Skip to content

VAE produces grid-like tile artifacts on flat regions (constant-green diagnostic) #202

@AmitMY

Description

VAE produces grid-like tile artifacts on flat regions

We've been fine-tuning the LTX-2.3 VAE for a downstream tokenizer and noticed a regular grid pattern in the decoder output, most visible on flat regions (uniform backgrounds). It's small per-pixel but very structured, and we believe the same pattern contaminates natural images — it's just easier to isolate against a constant input.

The test

Encode → decode a constant (0, 128, 0) clip and look at the recon and its FFT. A constant input has zero spatial structure, so any structure in the output is the VAE's contribution.

Script (~50 lines)
import numpy as np
import torch
from src.ltx_vae import load_ltx_vae

enc, dec = load_ltx_vae("/path/to/vae.safetensors", device="cuda")
enc.eval(); dec.eval()

bg = torch.tensor([0, 128, 0], dtype=torch.float32) / 255.0
src = (bg * 2 - 1).view(1, 3, 1, 1, 1).expand(1, 3, 49, 256, 256).contiguous().cuda()
with torch.no_grad():
    recon = dec(enc(src)).clamp(-1, 1).cpu().numpy()

# FFT of the green channel of the middle frame, centre-cropped to 128x128.
g = recon[0, 1, 24, 64:192, 64:192]
mag = np.log10(np.abs(np.fft.fftshift(np.fft.fft2(g))) + 1e-9)

# Quantitative measure: fraction of FFT energy that's NOT at DC.
f = np.fft.fftshift(np.fft.fft2(recon[0, 1, 24]))
cy, cx = f.shape[0] // 2, f.shape[1] // 2
f_no_dc = f.copy(); f_no_dc[cy-2:cy+3, cx-2:cx+3] = 0
print(f"off-DC fraction: {np.abs(f_no_dc).sum() / np.abs(f).sum():.4f}")

What we see

LTX-2.3
max |Δ| (G channel) 0.124 (on a [0,1]-normalised input)
mean |Δ| (G channel) 0.0026
FFT off-DC fraction (G channel) 0.98

98 % of the recon's frequency energy is not at DC — i.e. it's in spatial structure, almost all of it on a regular grid. The grid period matches the VAE's patch_size=4 upsample stride and its harmonics.

Source frame | recon | abs(diff)×50:

Image

FFT log-magnitude

Image

FFT log-magnitude, source | recon, 128×128 centre crop:

Image

We also see strong artifacts at the frame boundaries (presumably from conv-stack padding); the centre crop above isolates the periodic structure cleanly.

Why we think this matters for real video

Natural-image variance hides the grid visually, but the FFT energy is still being deposited there — it's just masked by content. For downstream tasks that care about high-frequency fidelity (compression, sharp-region reconstruction, anti-aliasing under camera pans) this floor is the limit of what any fine-tune on top can achieve. We've observed this empirically: even after substantial domain-adaptation training with perceptual + segmentation-weighted losses, the per-frame FFT continues to show the same harmonics in foreground regions.

Suggested directions if you're iterating on the VAE architecture

This is the textbook checkerboard pattern from stride-s transposed convs (Odena, Dumoulin, Olah, "Deconvolution and Checkerboard Artifacts," Distill 2016 — https://distill.pub/2016/deconv-checkerboard/). In rough order of disruption:

  1. Resize-then-conv upsampling — replace ConvTranspose(stride=s) with Upsample(scale=s) + Conv(stride=1). One-line change per stage; eliminates uneven kernel overlap. (Odena et al. 2016, link above.)
  2. PixelShuffle with ICNR initialisation — fast on GPU, checkerboard-free if initialised so the sub-pixel conv equals nearest-neighbour upsample at start. Shi et al., "Real-Time Single Image and Video Super-Resolution…", CVPR 2016, https://arxiv.org/abs/1609.05158 ; Aitken et al., "Checkerboard artifact free sub-pixel convolution," 2017, https://arxiv.org/abs/1707.02937.
  3. BlurPool downsampling in the encoder — Zhang, "Making Convolutional Networks Shift-Invariant Again," ICML 2019, https://arxiv.org/abs/1904.11486. Stops aliased frequencies from being baked into the latent on the way in.
  4. Alias-free design throughout — the principled fix; sinc-windowed up/downsample respecting the Nyquist limit. Karras et al., "Alias-Free Generative Adversarial Networks" (StyleGAN3), NeurIPS 2021, https://arxiv.org/abs/2106.12423.
  5. Patch-based ViT tokenizer (no conv stride at all) — recent video tokenizers go this route and avoid the grid by construction. Yu et al., "Language Model Beats Diffusion — Tokenizer is Key to Visual Generation" (MAGVIT-v2), ICLR 2024, https://arxiv.org/abs/2310.05737 ; Yu et al., "An Image is Worth 32 Tokens for Reconstruction and Generation" (TiTok), NeurIPS 2024, https://arxiv.org/abs/2406.07550.

Happy to share the full diagnostic + numbers from our domain. The script above is fast (a few seconds) and the off-DC fraction on a constant input would make a useful CI regression test for any future VAE.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions