Dear Ben,
Nice examples, thanks for writing this!
The 030_timing.py code is doing an unfair comparison of the speed of a slow python loop versus a compiled C-OpenCL code. In this case, the C-kernel will always be faster. A more fair comparison would be to compare the speed of a C-code that performs the sum vs the OpenCL version executed on the GPU. Does this make sense?