2x Performance Improvement to Forward+ Auto Exposure#117963
Open
SoftLattice wants to merge 1 commit intogodotengine:masterfrom
Open
2x Performance Improvement to Forward+ Auto Exposure#117963SoftLattice wants to merge 1 commit intogodotengine:masterfrom
SoftLattice wants to merge 1 commit intogodotengine:masterfrom
Conversation
Member
|
Very nice! Let's hold off on this until we understand why #117339 is performing poorly on Metal. I suspect that this PR will have the same underlying problem |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Improved the
luminance_reduce.glslshader which is used in calculating Auto Exposure for cameras in the Forward+ renderer. GPU traces indicate the improved version is approximately 2x faster for three passes. Bottlenecked framerate measurements show a 25% improvement in total draw frame rate.Motivation
Optimizing post-processing pipelines improves Godot's ability to produce both stylized and realistic effects while still hitting target frame rates. Auto exposure creates a realistic adaptive lighting effect important in scenes with dynamic lighting by measuring the average luminance across portions of the screen.
The current implementation uses a binary tree reduction scheme to compute the average, which is an optimal parallel summation algorithm, but the reduction is performed entirely in shared memory which leaves room for improvement. With a fixed work-group size of 64, the reduction requires
log2(64)stages, which means 6 write/read trips to shared memory in the current scheme.The
GL_KHR_shader_subgroup_arithmeticextension providessubgroupAdd, which allows stages of the reduction to be computed with register shuffles. Register shuffles are approximately 5~10x faster than shared memory, allowing for significant reductions in cache waits.Shared memory is still needed to synchronize between subgroups, but only need
[log2(64)/log2(subgroupSize)] - 1write/read trips are required. This corresponds to 0 trips for AMD, and 1 for most other major manufacturers (any device withsubgroupSize> 4), which moves the shader bottleneck to the initial texture fetch.Changes
sharedmem_reduction()as fallback behavior forsubgroupSize< 4subgroup_reduction()gl_NumSubgroupsto avoid group divergenceNOTE: For subgroup sizes of 2
subgroup_reduction()reduces to the original algorithm but with unnecessary barriers, and subgroup sizes of 1 would cause infinte loops so a defensivemax(shift, 1u)is used for the loop iteration. These cases necessitate preservingsharedmem_reduction()as an alternative route.Accuracy
The resulting algorithm produces identical results up to floating point precision. Comparison of visual outputs using the Physical Light Camera Units demo are shown below.
Current Output
Output this PR
Benchmarks
Benchmarks were performed on an NVIDIA 3080 Ti. Using NSIGHT Graphics, traces of the shader indicated roughly 2x overall improvement for 3 passes of 4K viewport.
Additionally a small project was created to record FPS statistics of a 4K Viewport with autoexposure, with V-Sync disabled and uncapped framerates.
Godot project and comparison of the measurements are below.
stress_test.tar.gz
Verified Compatibilities
Notes
servers/rendering/renderer_rd/shaders/effects/luminance_reduce.glsl