Skip to content

2x Performance Improvement to Forward+ Auto Exposure#117963

Open
SoftLattice wants to merge 1 commit intogodotengine:masterfrom
SoftLattice:optimized_luminance
Open

2x Performance Improvement to Forward+ Auto Exposure#117963
SoftLattice wants to merge 1 commit intogodotengine:masterfrom
SoftLattice:optimized_luminance

Conversation

@SoftLattice
Copy link
Copy Markdown

Summary

Improved the luminance_reduce.glsl shader which is used in calculating Auto Exposure for cameras in the Forward+ renderer. GPU traces indicate the improved version is approximately 2x faster for three passes. Bottlenecked framerate measurements show a 25% improvement in total draw frame rate.

Motivation

Optimizing post-processing pipelines improves Godot's ability to produce both stylized and realistic effects while still hitting target frame rates. Auto exposure creates a realistic adaptive lighting effect important in scenes with dynamic lighting by measuring the average luminance across portions of the screen.

The current implementation uses a binary tree reduction scheme to compute the average, which is an optimal parallel summation algorithm, but the reduction is performed entirely in shared memory which leaves room for improvement. With a fixed work-group size of 64, the reduction requires log2(64) stages, which means 6 write/read trips to shared memory in the current scheme.

The GL_KHR_shader_subgroup_arithmetic extension provides subgroupAdd, which allows stages of the reduction to be computed with register shuffles. Register shuffles are approximately 5~10x faster than shared memory, allowing for significant reductions in cache waits.

Shared memory is still needed to synchronize between subgroups, but only need [log2(64)/log2(subgroupSize)] - 1 write/read trips are required. This corresponds to 0 trips for AMD, and 1 for most other major manufacturers (any device with subgroupSize > 4), which moves the shader bottleneck to the initial texture fetch.

Changes

  • Moved current implementation to sharedmem_reduction() as fallback behavior for subgroupSize < 4
  • Created alternative subgroup implementation subgroup_reduction()
  • Shader selects plan based on workgroup level gl_NumSubgroups to avoid group divergence

NOTE: For subgroup sizes of 2 subgroup_reduction() reduces to the original algorithm but with unnecessary barriers, and subgroup sizes of 1 would cause infinte loops so a defensive max(shift, 1u) is used for the loop iteration. These cases necessitate preserving sharedmem_reduction() as an alternative route.

Accuracy

The resulting algorithm produces identical results up to floating point precision. Comparison of visual outputs using the Physical Light Camera Units demo are shown below.

Current Output

new_shader_output

Output this PR

old_shader_output

Benchmarks

Benchmarks were performed on an NVIDIA 3080 Ti. Using NSIGHT Graphics, traces of the shader indicated roughly 2x overall improvement for 3 passes of 4K viewport.

Shader Pass Current time (μs) New time (μs)
0 304.2 140.8
1 13.6 11.3
2 6.1 5.1
Total 324.0 157.2

Additionally a small project was created to record FPS statistics of a 4K Viewport with autoexposure, with V-Sync disabled and uncapped framerates.

Godot project and comparison of the measurements are below.

stress_test.tar.gz

fps_comparison

Verified Compatibilities

  • Hardware
    • Nvidia
  • OS
    • Linux
    • Windows
  • Drivers
    • Vulkan
    • D3D12

Notes

  • The change is isolated to servers/rendering/renderer_rd/shaders/effects/luminance_reduce.glsl
  • This only affects the Forward+ renderer
  • No AI was used to develop this code

@SoftLattice SoftLattice requested a review from a team as a code owner March 29, 2026 13:06
@Nintorch Nintorch added this to the 4.x milestone Mar 29, 2026
@SoftLattice SoftLattice changed the title Improved Forward+ Luminance Reduce Shader 2x Performance Improvement to Forward+ Auto Exposure Mar 30, 2026
@clayjohn
Copy link
Copy Markdown
Member

Very nice!

Let's hold off on this until we understand why #117339 is performing poorly on Metal. I suspect that this PR will have the same underlying problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants