2x Performance Improvement to Forward+ Auto Exposure by SoftLattice · Pull Request #117963 · godotengine/godot

SoftLattice · 2026-03-29T13:06:50Z

Summary

Improved the luminance_reduce.glsl shader which is used in calculating Auto Exposure for cameras in the Forward+ renderer. GPU traces indicate the improved version is approximately 2x faster for three passes. Bottlenecked framerate measurements show a 25% improvement in total draw frame rate.

Motivation

Optimizing post-processing pipelines improves Godot's ability to produce both stylized and realistic effects while still hitting target frame rates. Auto exposure creates a realistic adaptive lighting effect important in scenes with dynamic lighting by measuring the average luminance across portions of the screen.

The current implementation uses a binary tree reduction scheme to compute the average, which is an optimal parallel summation algorithm, but the reduction is performed entirely in shared memory which leaves room for improvement. With a fixed work-group size of 64, the reduction requires log2(64) stages, which means 6 write/read trips to shared memory in the current scheme.

The GL_KHR_shader_subgroup_arithmetic extension provides subgroupAdd, which allows stages of the reduction to be computed with register shuffles. Register shuffles are approximately 5~10x faster than shared memory, allowing for significant reductions in cache waits.

Shared memory is still needed to synchronize between subgroups, but only need [log2(64)/log2(subgroupSize)] - 1 write/read trips are required. This corresponds to 0 trips for AMD, and 1 for most other major manufacturers (any device with subgroupSize > 4), which moves the shader bottleneck to the initial texture fetch.

Changes

Moved current implementation to sharedmem_reduction() as fallback behavior for subgroupSize < 4
Created alternative subgroup implementation subgroup_reduction()
Shader selects plan based on workgroup level gl_NumSubgroups to avoid group divergence

NOTE: For subgroup sizes of 2 subgroup_reduction() reduces to the original algorithm but with unnecessary barriers, and subgroup sizes of 1 would cause infinte loops so a defensive max(shift, 1u) is used for the loop iteration. These cases necessitate preserving sharedmem_reduction() as an alternative route.

Accuracy

The resulting algorithm produces identical results up to floating point precision. Comparison of visual outputs using the Physical Light Camera Units demo are shown below.

Current Output

Output this PR

Benchmarks

Benchmarks were performed on an NVIDIA 3080 Ti. Using NSIGHT Graphics, traces of the shader indicated roughly 2x overall improvement for 3 passes of 4K viewport.

Shader Pass	Current time (μs)	New time (μs)
0	304.2	140.8
1	13.6	11.3
2	6.1	5.1
Total	324.0	157.2

Additionally a small project was created to record FPS statistics of a 4K Viewport with autoexposure, with V-Sync disabled and uncapped framerates.

Godot project and comparison of the measurements are below.

stress_test.tar.gz

Verified Compatibilities

Hardware
- Nvidia
OS
- Linux
- Windows
Drivers
- Vulkan
- D3D12

Notes

The change is isolated to servers/rendering/renderer_rd/shaders/effects/luminance_reduce.glsl
This only affects the Forward+ renderer
No AI was used to develop this code

clayjohn · 2026-03-30T21:59:54Z

Very nice!

Let's hold off on this until we understand why #117339 is performing poorly on Metal. I suspect that this PR will have the same underlying problem

Improved forward+ luminance reduce shader

5f1420f

SoftLattice requested a review from a team as a code owner March 29, 2026 13:06

Nintorch added enhancement topic:shaders labels Mar 29, 2026

Nintorch added this to the 4.x milestone Mar 29, 2026

Chaosus added topic:rendering topic:3d and removed topic:shaders labels Mar 29, 2026

SoftLattice changed the title ~~Improved Forward+ Luminance Reduce Shader~~ 2x Performance Improvement to Forward+ Auto Exposure Mar 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

2x Performance Improvement to Forward+ Auto Exposure#117963

2x Performance Improvement to Forward+ Auto Exposure#117963
SoftLattice wants to merge 1 commit intogodotengine:masterfrom
SoftLattice:optimized_luminance

SoftLattice commented Mar 29, 2026

Uh oh!

clayjohn commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

SoftLattice commented Mar 29, 2026

Summary

Motivation

Changes

Accuracy

Current Output

Output this PR

Benchmarks

Verified Compatibilities

Notes

Uh oh!

clayjohn commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants