I was wondering whether you have considered using PGO for optimizing rav1d. Both for the straightforward "let's generate more optimized artifacts" approach, but also for the less obvious "let's do PGO, see what it has optimized and try to backport these optimizations to the original source code".
rav1d contains a lot of assembly that most likely won't be touched by LLVM's PGO pipeline, but out of curiosity, I tried to apply https://github.com/Kobzol/cargo-pgo to rav1d, and it still seems to produce a quite nice ~2% speedup:
hyperfine --warmup 0 --runs 1 "./rav1d -q -i Chimera-AV1-8bit-1280x720-3363kbps.ivf -o /dev/null --threads=1" "./rav1d-pgo -q -i Chimera-AV1-8bit-1280x720-3363kbps.ivf -o /dev/null --threads=1"
Benchmark 1: ./rav1d -q -i Chimera-AV1-8bit-1280x720-3363kbps.ivf -o /dev/null --threads=1
Time (abs ≡): 50.679 s [User: 50.599 s, System: 0.077 s]
Benchmark 2: ./rav1d-pgo -q -i Chimera-AV1-8bit-1280x720-3363kbps.ivf -o /dev/null --threads=1
Time (abs ≡): 49.915 s [User: 49.840 s, System: 0.074 s]
Summary
./rav1d-pgo -q -i Chimera-AV1-8bit-1280x720-3363kbps.ivf -o /dev/null --threads=1 ran
1.02 times faster than ./rav1d -q -i Chimera-AV1-8bit-1280x720-3363kbps.ivf -o /dev/null --threads=1
Anyway, I don't have more than that, just was curious what do you think of this approach.
I was wondering whether you have considered using PGO for optimizing
rav1d. Both for the straightforward "let's generate more optimized artifacts" approach, but also for the less obvious "let's do PGO, see what it has optimized and try to backport these optimizations to the original source code".rav1dcontains a lot of assembly that most likely won't be touched by LLVM's PGO pipeline, but out of curiosity, I tried to apply https://github.com/Kobzol/cargo-pgo torav1d, and it still seems to produce a quite nice ~2% speedup:Anyway, I don't have more than that, just was curious what do you think of this approach.