I'm planning to benchmark go runtime multi-thread performance with the go_benchmarks.go.
Modified the benchmark to run single case with multiple times(as Loop_single2.patch)
Observed the "RSA2048 3-prime Sign" throughput will drift from 12xxx ops/s to 15xxx ops/s gradually(ref: Loop_singel_result.txt)
Some other cases(ECDSA-P256 for example) have similar trend also.
Verified with Go 1.17 and 1.20, on Intel Xeon(R) Platinum 8480+ and Intel(R) Xeon(R) Gold 6240Y
Loop_single2.patch
Loop_single_result.txt