improve our `sum()`, `var()` and other aggregate functions

We can probably abandon the aim of giving the same output regardless of the block structure, in favor of just using NumPy's native methods to do the calculation within each block and then aggregating across blocks. This sacrifices some consistency for speed.