Safe handling of launching multiple thread blocks for kernel

When processing large amount of data in CUDA, host should launch multiple thread blocks. It is crucial to calculate accurate number of active threads per each block for producing the accurate result. It would be nice to think about the abstraction to safely work on launching multiple thread blocks.