Skip to content

Comments

Regenerate binaries on ISPC 1.30.0#60

Open
MarijnS95 wants to merge 11 commits intomainfrom
ispc-1.27
Open

Regenerate binaries on ISPC 1.30.0#60
MarijnS95 wants to merge 11 commits intomainfrom
ispc-1.27

Conversation

@MarijnS95
Copy link
Member

@MarijnS95 MarijnS95 commented May 21, 2025

@MarijnS95 MarijnS95 requested a review from Jasper-Bekkers May 21, 2025 08:36
@MarijnS95
Copy link
Member Author

Turns out there are a bunch of new generic target ISAs to streamline which vector sizes/widths to select, as well as Apple-specific CPU targets :)

@MarijnS95
Copy link
Member Author

MarijnS95 commented May 26, 2025

On the MacBook Air M4

Main @ 6e7b616 (ISPC 1.20...)

Downsample `square_test.png` using ispc_downsampler
                        time:   [38.827 ms 38.848 ms 38.884 ms]
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

This ispc-1.27 PR @ 3556673

Downsample `square_test.png` using ispc_downsampler
                        time:   [48.220 ms 48.253 ms 48.287 ms]
                        change: [+24.103% +24.190% +24.278%] (p = 0.00 < 0.05)
                        Performance has regressed.

Recompiling locally on the M4 Air (ispc 1.27.0 from brew using cargo b -rF ispc):

Downsample `square_test.png` using ispc_downsampler
                        time:   [46.576 ms 46.586 ms 46.596 ms]
                        change: [-3.5237% -3.4550% -3.3855%] (p = 0.00 < 0.05)
                        Performance has improved.

That's a significant performance deficit, which we should investigate before merging. Even playing around with the new CPU flags from Twinklebear/ispc-rs#42, or the generic ISAs, or removing .target_isas() altogether to compile natively for the host yields no improvement.

Funny thing is, with the ISPC test this M4 Air whines a little, but it doesn't during resize 😓

@MarijnS95 MarijnS95 marked this pull request as draft May 26, 2025 10:31
@MarijnS95 MarijnS95 changed the title Regenerate binaries on ISPC 1.27 Regenerate binaries on ISPC 1.28 Aug 13, 2025
@MarijnS95
Copy link
Member Author

MarijnS95 commented Aug 13, 2025

Looks like performance is not restored in 1.28, or we're still doing something wrong. Barely any change compared against 1.27 (which was 24% slower than main per the above):

This PR @ 00d0256

cargo bench
Downsample `square_test.png` using ispc_downsampler
                        time:   [48.412 ms 48.475 ms 48.539 ms]
                        change: [+29.098% +29.471% +29.821%] (p = 0.00 < 0.05)
                        Performance has regressed.

@MarijnS95 MarijnS95 changed the title Regenerate binaries on ISPC 1.28 Regenerate binaries on ISPC 1.29.1 Dec 24, 2025
@MarijnS95
Copy link
Member Author

MarijnS95 commented Dec 24, 2025

Re-running this test on my host, recompiled on this ISPC version:

ispc --version
Intel(r) Implicit SPMD Program Compiler (Intel(r) ISPC), 1.28.2 (build commit  @ 20250924, LLVM 20.1.8)

On latest main @ f2ddfab (but not using those prebuilts)

cargo bench
Downsample `square_test.png` using ispc_downsampler
                        time:   [46.776 ms 46.875 ms 46.969 ms]

Then following the suggestion from @Jasper-Bekkers in Traverse-Research/intel-tex-rs-2#42 to only use i32x4 because NEON is 128-bits slightly regresses performance:

cargo bench
Downsample `square_test.png` using ispc_downsampler
                        time:   [48.003 ms 48.101 ms 48.196 ms]
                        change: [+2.3395% +2.6161% +2.9034%] (p = 0.00 < 0.05)
                        Performance has regressed.

Also, this M4 chip is supposed to save SME (Scalable Matrix Extensions) but not SVE (Scalable Vector Extensions) and confirmed with sysctl -a hw.optional (and NEON is confirmed as well).

Perhaps this needs to be reported upstream as I'm slightly out of ideas how to best bisect this compiler performance regression.

@MarijnS95
Copy link
Member Author

Just went back in history to generate the blobs for all missing versions:

ISPC 1.23 @ 754d4bf

Downsample `square_test.png` using ispc_downsampler
                        time:   [37.430 ms 37.550 ms 37.666 ms]

ISPC 1.24 @

Downsample `square_test.png` using ispc_downsampler
                        time:   [37.180 ms 37.317 ms 37.454 ms]
                        change: [-1.1514% -0.6188% -0.1473%] (p = 0.01 < 0.05)
                        Change within noise threshold.

ISPC 1.25.3

Downsample `square_test.png` using ispc_downsampler
                        time:   [38.024 ms 38.151 ms 38.315 ms]
                        change: [+1.6876% +2.2352% +2.8037%] (p = 0.00 < 0.05)
                        Performance has regressed.

ISPC 1.26

Downsample `square_test.png` using ispc_downsampler
                        time:   [49.251 ms 49.422 ms 49.588 ms]
                        change: [+29.523% +30.093% +30.690%] (p = 0.00 < 0.05)
                        Performance has regressed.

1.26 is where this regression happened.

Turns out that 1.26 release is exactly where a bunch of Apple improvements have been announced. Unfortunately, playing with that new --darwin-version-min flag, or the new CPU targets (which are only available up to A17, the "predecessor" to M4 in the iPhone space) mentioned above, don't make a difference. I couldn't immediately find if those iPhone skews have support for vector extensions at all..?

@Jasper-Bekkers
Copy link
Member

Jasper-Bekkers commented Dec 24, 2025

Yeah I closed thar PR because later I realized why there was a big delta: I was profiling on battery.

@MarijnS95
Copy link
Member Author

@Jasper-Bekkers Oh I'm also exclusively developing on battery (the perks of Apple putting RTGs in these MacBooks 🤤) but the ±37ms vs ±45ms regression remains consistent.

@pbrubaker
Copy link

Hey all, apologies for the delay. I'm going to tag @aneshlya but I would create an issue on the ispc GitHub and link this issue. That's the best way to report these kinds of things right now.

@aneshlya
Copy link

aneshlya commented Jan 7, 2026

Hi there, I don't have Apple HW to test performance on, but it would be helpful if you can share the ispc code corresponding to the benchmark. I have a theory what ispc commit is guilty. I can create a custom build of ispc for you to test in your environment and check if the regression disappears (if you're OK with it).

As Pete mentioned, don't hesitate to submit such bugs to ispc/ispc repo.

@MarijnS95
Copy link
Member Author

Thanks for the reply @pbrubaker. Best to track this at the official ISPC repo as I was merely using this pull request to first bisect and track where the performance regression was happening and/or if our invocation arguments are at fault. Issue is opened at ispc/ispc#3688.

@aneshlya I'd be more than happy to try a custom-built ispc on these kernels, thanks!

@MarijnS95 MarijnS95 changed the title Regenerate binaries on ISPC 1.29.1 Regenerate binaries on ISPC 1.30.0 Feb 5, 2026
Comment on lines +12 to +15
#[allow(clippy::unnecessary_operation, clippy::identity_op)]
const _: () = {
["Size of WeightDimensions"][::std::mem::size_of::<WeightDimensions>() - 12usize];
["Alignment of WeightDimensions"][::std::mem::align_of::<WeightDimensions>() - 4usize];
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bindgen picks the latest MSRV by default, we should perhaps see if it can be configured through ispc_compile if this is breaking; and set our rust-version in Cargo.toml accordingly too.

@MarijnS95
Copy link
Member Author

Downsample `square_test.png` using ispc_downsampler
                        time:   [39.530 ms 39.542 ms 39.554 ms]
                        change: [+2.1826% +2.2370% +2.2959%] (p = 0.00 < 0.05)
                        Performance has regressed.

Performance is indeed mostly restored in 1.30, thanks @aneshlya!

@pbrubaker
Copy link

Downsample `square_test.png` using ispc_downsampler
                        time:   [39.530 ms 39.542 ms 39.554 ms]
                        change: [+2.1826% +2.2370% +2.2959%] (p = 0.00 < 0.05)
                        Performance has regressed.

Performance is indeed mostly restored in 1.30, thanks @aneshlya!

Glad to hear it!

@aneshlya
Copy link

aneshlya commented Feb 5, 2026

Performance is indeed mostly restored in 1.30.

Thanks for checking! I've also added ispc-downsampler to our CI but we can only check stability on open source runners.

@MarijnS95
Copy link
Member Author

MarijnS95 commented Feb 10, 2026

That little 2% regression mostly seems to be noise, it fluctuates a bit on retesting, and doesn't seem to be affected by locally compiling the ISPC kernel (as opposed to the one from CI, i.e. with slightly different host detection).


The suggested i32x4 change from Traverse-Research/intel-tex-rs-2#42 seems to consistently make things about 0.5-1ms slower, though?

Downsample `square_test.png` using ispc_downsampler
                        time:   [38.503 ms 38.570 ms 38.636 ms]
                        change: [+2.0616% +3.1908% +4.0920%] (p = 0.00 < 0.05)
                        Performance has regressed.

@MarijnS95 MarijnS95 marked this pull request as ready for review February 10, 2026 12:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants