VAE produces grid-like tile artifacts on flat regions (constant-green diagnostic)

## VAE produces grid-like tile artifacts on flat regions

We've been fine-tuning the LTX-2.3 VAE for a downstream tokenizer and noticed a regular grid pattern in the decoder output, most visible on flat regions (uniform backgrounds). It's small per-pixel but very structured, and we believe the same pattern contaminates natural images — it's just easier to isolate against a constant input.

### The test

Encode → decode a constant `(0, 128, 0)` clip and look at the recon and its FFT. A constant input has zero spatial structure, so any structure in the output is the VAE's contribution.

<details>
<summary>Script (~50 lines)</summary>

```python
import numpy as np
import torch
from src.ltx_vae import load_ltx_vae

enc, dec = load_ltx_vae("/path/to/vae.safetensors", device="cuda")
enc.eval(); dec.eval()

bg = torch.tensor([0, 128, 0], dtype=torch.float32) / 255.0
src = (bg * 2 - 1).view(1, 3, 1, 1, 1).expand(1, 3, 49, 256, 256).contiguous().cuda()
with torch.no_grad():
    recon = dec(enc(src)).clamp(-1, 1).cpu().numpy()

# FFT of the green channel of the middle frame, centre-cropped to 128x128.
g = recon[0, 1, 24, 64:192, 64:192]
mag = np.log10(np.abs(np.fft.fftshift(np.fft.fft2(g))) + 1e-9)

# Quantitative measure: fraction of FFT energy that's NOT at DC.
f = np.fft.fftshift(np.fft.fft2(recon[0, 1, 24]))
cy, cx = f.shape[0] // 2, f.shape[1] // 2
f_no_dc = f.copy(); f_no_dc[cy-2:cy+3, cx-2:cx+3] = 0
print(f"off-DC fraction: {np.abs(f_no_dc).sum() / np.abs(f).sum():.4f}")
```

</details>

### What we see

| | LTX-2.3 |
|---|---:|
| max \|Δ\| (G channel) | **0.124** (on a [0,1]-normalised input) |
| mean \|Δ\| (G channel) | 0.0026 |
| FFT off-DC fraction (G channel) | **0.98** |

98 % of the recon's frequency energy is *not* at DC — i.e. it's in spatial structure, almost all of it on a regular grid. The grid period matches the VAE's `patch_size=4` upsample stride and its harmonics.

Source frame | recon | abs(diff)×50:

<img width="768" height="256" alt="Image" src="https://github.com/user-attachments/assets/3eb10214-0768-4f58-a8c9-cb22683b1b3e" />

FFT log-magnitude

<img width="768" height="256" alt="Image" src="https://github.com/user-attachments/assets/fc573efc-90bc-4a37-af5e-d1655c229dcd" />


FFT log-magnitude, source | recon, 128×128 centre crop:

<img width="256" height="128" alt="Image" src="https://github.com/user-attachments/assets/8b357f95-283b-4bd1-bae9-30a85acc24bd" />

We also see strong artifacts at the frame boundaries (presumably from conv-stack padding); the centre crop above isolates the periodic structure cleanly.


### Why we think this matters for real video

Natural-image variance hides the grid visually, but the FFT energy is still being deposited there — it's just masked by content. For downstream tasks that care about high-frequency fidelity (compression, sharp-region reconstruction, anti-aliasing under camera pans) this floor is the limit of what any fine-tune on top can achieve. We've observed this empirically: even after substantial domain-adaptation training with perceptual + segmentation-weighted losses, the per-frame FFT continues to show the same harmonics in foreground regions.

### Suggested directions if you're iterating on the VAE architecture

This is the textbook checkerboard pattern from stride-`s` transposed convs (Odena, Dumoulin, Olah, *"Deconvolution and Checkerboard Artifacts,"* Distill 2016 — https://distill.pub/2016/deconv-checkerboard/). In rough order of disruption:

1. **Resize-then-conv upsampling** — replace `ConvTranspose(stride=s)` with `Upsample(scale=s) + Conv(stride=1)`. One-line change per stage; eliminates uneven kernel overlap. (Odena et al. 2016, link above.)
2. **PixelShuffle with ICNR initialisation** — fast on GPU, checkerboard-free if initialised so the sub-pixel conv equals nearest-neighbour upsample at start. Shi et al., *"Real-Time Single Image and Video Super-Resolution…"*, CVPR 2016, https://arxiv.org/abs/1609.05158 ; Aitken et al., *"Checkerboard artifact free sub-pixel convolution,"* 2017, https://arxiv.org/abs/1707.02937.
3. **BlurPool downsampling in the encoder** — Zhang, *"Making Convolutional Networks Shift-Invariant Again,"* ICML 2019, https://arxiv.org/abs/1904.11486. Stops aliased frequencies from being baked into the latent on the way in.
4. **Alias-free design throughout** — the principled fix; sinc-windowed up/downsample respecting the Nyquist limit. Karras et al., *"Alias-Free Generative Adversarial Networks"* (StyleGAN3), NeurIPS 2021, https://arxiv.org/abs/2106.12423.
5. **Patch-based ViT tokenizer (no conv stride at all)** — recent video tokenizers go this route and avoid the grid by construction. Yu et al., *"Language Model Beats Diffusion — Tokenizer is Key to Visual Generation"* (MAGVIT-v2), ICLR 2024, https://arxiv.org/abs/2310.05737 ; Yu et al., *"An Image is Worth 32 Tokens for Reconstruction and Generation"* (TiTok), NeurIPS 2024, https://arxiv.org/abs/2406.07550.

Happy to share the full diagnostic + numbers from our domain. The script above is fast (a few seconds) and the **off-DC fraction on a constant input** would make a useful CI regression test for any future VAE.


	LTX-2.3
max \|Δ\| (G channel)	0.124 (on a [0,1]-normalised input)
mean \|Δ\| (G channel)	0.0026
FFT off-DC fraction (G channel)	0.98

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VAE produces grid-like tile artifacts on flat regions (constant-green diagnostic) #202

VAE produces grid-like tile artifacts on flat regions

The test

What we see

Why we think this matters for real video

Suggested directions if you're iterating on the VAE architecture

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

VAE produces grid-like tile artifacts on flat regions (constant-green diagnostic) #202

Description

VAE produces grid-like tile artifacts on flat regions

The test

What we see

Why we think this matters for real video

Suggested directions if you're iterating on the VAE architecture

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions