I have a model expressed as a sum of many (~from 10 to 40) SHO kernels and I have been playing around with tinygp and celerite2 (Jax implementation). I have done some tests, and celerite2 is faster than tinygp (see figure below) when using a sum of multiple semi-separable kernels.
Could you give me some insight into why we have such a difference in runtime between the two libraries?
And also would it be possible to reach the celerite2 speed with a modification of the tinygp implementation? I am currently in the process of reading the tinygp code to understand what could explain such a difference.

Thanks,