-
Notifications
You must be signed in to change notification settings - Fork 5
L_0 sketching implementation & testing status
| Sketch sampling | Sketch addition | Sketch scalar multiplication | Sketch uniform | |
|---|---|---|---|---|
| 100 | ✔️ | ✔️ | ❌ | ❌ |
| 1000 | ✔️ | ✔️ | ❌ | ❌ |
| 10000 | ✔️ | ✔️ | ❌ | ❌ |
(✔️= implemented and successful, ❌= not yet implemented)
- We first create an empty sketch.
- We then query this sketch. Since the sketch is empty, this should throw
AllBucketZeroException. - We then query this sketch again. Since the sketch was already queried, this should throw
MultipleQueryException. - Next we create another sketch, and find a vector with no good buckets by repeatedly eliminating the indices of good buckets.
- Here's an example:
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 1 |
| 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 1 |
| 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 |
| 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ^ | ^ |
| 0 | 2 | 3 | 4 | 5 | 6 | 7 | 9 |
|---|---|---|---|---|---|---|---|
| 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
| 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 0 | 1 | 0 | 1 | 0 | 0 | 1 | 1 |
| 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 1 | 1 | 0 | 1 | 1 |
| 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ^ |
- In this case, our vector has zeros at indices {1, 7, 8}, and nonzero values at indices {0, 2, 3, 4, 5, 6, 9}.
- Since we have created a vector with no good buckets, as long as we don't get a false positive from one of the buckets, this should throw
NoGoodBucketException.
If we give a sketch a vector with only one nonzero index, then the sketch should always return the correct value.
We want to make sure the failure modes of sketch sampling are below a certain probability (TODO figure out what the theoretical probability is). The two ways a sample could fail are either there are no good buckets, or the sample is incorrect.
- For each vector size n, we randomly generate n updates using
Testing_Vector, and use them to update a sketch. - We then query the sketch.
- If the query does not throw an exception, and the returned index/value does not match with the vector, increment the counter for sample incorrect failures.
- If the query throws
AllBucketsZeroException, and the vector is nonzero, increment the counter for sample incorrect failures. - If the query throws
NoGoodBucketException, increment the counter for no good bucket failures. - The query should never throw
MultipleQueryException.
At the end, make sure the the failure rate is less than or equal to the expected probability.
Very similar to sketch sampling.
- For each vector size n, we randomly generate 2 separate
Testing_Vectors, each with n updates, and use them to update two separate sketches. - We then add the sketches, and query them.
- If the query does not throw an exception, and the returned index/value does not match with the sum of the two vectors, increment the counter for sample incorrect failures.
- If the query throws
AllBucketsZeroException, and the sum of the two vectors is nonzero, increment the counter for sample incorrect failures. - If the query throws
NoGoodBucketException, increment the counter for no good bucket failures. - The query should never throw
MultipleQueryException.
At the end, make sure the the failure rate is less than or equal to the expected probability.
If we look at TestSketchSample, we're running 3 sizes of tests.
| # of runs | size of vector | number of updates | time using powermod (on Kenny's laptop) | time using hash+xor, native | time using hash+xor, larger ints |
|---|---|---|---|---|---|
| 10000 | 100 | 100 | 2.686 seconds | 1.88919 seconds | 2.16831 seconds |
| 1000 | 1000 | 1000 | 5.124 seconds | 3.50447 seconds | 3.83548 seconds |
| 1000 | 10000 | 10000 | 101.796 seconds | 66.7599 seconds | 68.6052 seconds |
For powermod, the majority of time is spent on the 1000 runs, size 10000 vector, 10000 updates test.
Looking at a gmon line by line profile, it seems like the most time is being taken up by PrimeGenerator::power, taking 63.32 seconds=60.81% of the time, and Bucket::contains, taking 21.7 seconds=20.85% of the time.
We can't do much about Bucket::contains since in order to check if a bucket contains an index, we have to call XXH64.
- We have 1000 runs.
- For each run, we have 10000 updates, giving us 10^7 total updates.
- For each update, we have (floor(log_2(10000))+1)^2=196 buckets, giving us 1.96*10^9 total calls to
Bucket::contains, - For each call to
Bucket::contains, we callXXH64, giving us 1.96*10^9 total calls toXXH64.
All these theoretical number of calls match up with what I have found empirically using gcov.
I haven't run an XXH64 benchmark on my own laptop yet, but the benchmarks provided on xxHash's GitHub wiki suggest a throughput of somewhere around 10^8 hashes per second.
(1.96*10^9 calls to XXH64)/(10^8 hashes per second)=19.6 seconds, which is approximately what we have empirically.
For PrimeGenerator::power, I'm not sure exactly how much better we can do either.
- We have 1000 runs.
- For each run, we have 10000 updates, giving us 10^7 total updates.
- For each update, we have (floor(log_2(10000))+1)^2=196 buckets.
- On average, we will need to update the bucket (1 + 1/2 + 1/4 + ... + 1/2^13)/14=14.28% of the time, giving us 2.80*10^8 total calls to
PrimeGenerator::power.- For the sake of analysis, call the arguments to
PrimeGenerator::powerx,y, andp, and the resultres. - y is an integer picked uniformly from [1, 10000].
- Since we are exponentiating by squaring,
PrimeGenerator::poweriterates over the bits ofy, looping ceiling(log_2(y)) times. - On average, this is 12.3631 times.
- In each pass of the loop,
x=(x*x)%pis run. - An additional
res=(res*x)%pis carried out if the bit ofyis 1, which is about half the time. - To summarize, for each call to
PrimeGenerator::power, we do on average 18.54 multiplication + mod operations. - This gives us 5.19*10^9 multiplication + mod operations.
- For the sake of analysis, call the arguments to
All these theoretical number of operations match up with what I have found empirically using gcov.
For my 2.3GHz laptop, 1 clock cycle should be 434.78 picoseconds. Empirically, (63.32 seconds)/(5.19*10^9 operations)=12.20 nanoseconds for each multiplication + mod=28.06 clock cycles. Considering that the most costly instruction is calculating the remainder, which alone has a latency of upwards of 20 clock cycles, we don't have much room for improvement here if we continue to use just exponentiation by squaring. We'd have to switch to some more efficient powermod algorithm, apparently there's something called Montgommery multiplication/reduction, and something called Barrett reduction? I have no clue what they are and how they work, but it might be worth looking into if we really need the speedup.
10^6 updates for each run
Maybe optimizations weren't enabled for the boost runtimes, that could be causing the 20x slowdown.
| Vector size | Boost multiprecision | Native | hash+xor | inline,hash+xor | column optimization |
|---|---|---|---|---|---|
| 10^3 | 110.678 | 5.43079 | 2.86827 | 1.64006 | 0.33727 |
| 10^4 | 192.773 | 9.73575 | 5.2734 | 3.07678 | 0.466436 |
| 10^5 | 281.799 | 14.1229 | 7.41999 | 4.15527 | 0.521033 |
| 10^6 | 387.783 | 19.1509 | 10.0366 | 5.99906 | 0.603263 |
| 10^7 | 533.402 | 26.7054 | 14.4372 | 8.40956 | 0.720604 |
| 10^8 | 668.878 | 33.618 | 18.7456 | 10.6028 | 0.811287 |
| 10^9 | 818.414 | 41.2379 | 21.9179 | 12.6764 | 0.918864 |
| 10^10 | 1654.09 | 70.9191 | 29.1174 | 16.9599 | 1.0602 |
| 10^11 | 1911.86 | 82.8579 | 34.03 | 19.893 | 1.07846 |
| 10^12 | 2099.64 | 93.1091 | 39.7805 | 23.0413 | 1.15421 |
The hash+xor algorithm is ~4x as fast as the native implementation of powermod, and keeps its speedup for larger int sizes.
The previous "bug" where all buckets seemed to be failing a bit more often than powermod for smaller vector sizes was due to the unit testing not being updated to match the restriction of only +1/-1 updates. After updating the unit tests, this is no longer a problem.
For the timing of the hash+xor algorithm, now the largest bottleneck is contains() and Sketch::update, since for each bucket 2 hashes need to be calculated: 1 for getting the bucket seed, and 1 for checking if the bucket contains the index.
TBH I don't full understand gprof.
There's the -pg option to enable profiling, but apparently there's also a -mfentry option that does profiling differently or something?
They give similar results, so I'm not sure what's up with that.
This is the raw output of gprof:
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
63.15 54.42 54.42 _mcount_private
13.87 66.37 11.95 2266000000 0.00 0.00 Bucket_Boruvka::contains(unsigned long const&, unsigned long const&, unsigned long const&) const
9.88 74.88 8.51 __fentry__
9.53 83.09 8.21 12000000 0.00 0.00 Sketch::update(unsigned long const&)
2.16 84.95 1.86 XXH64
0.99 85.80 0.85 313965250 0.00 0.00 Bucket_Boruvka::update(unsigned long const&, unsigned long const&)
0.36 86.11 0.31 3 0.10 7.12 test_sketch_sample(unsigned long, unsigned long, unsigned long, double, double)
0.05 86.15 0.04 24048000 0.00 0.00 double_to_ull(double, double)
0.02 86.17 0.02 log2
The total runtime with profiling was 86.61 seconds.
_mcount_private took 54.42 seconds=63.15% runtime.
__fentry__ took 8.51 seconds=9.88% runtime.
Bucket::contains took 11.95 seconds=13.87% runtime.
Sketch::update took 8.21 seconds=9.53% runtime.
Bucket::update took 0.85 seconds=0.99% runtime.
A run without profiling took 58.709 seconds.
So it's not 100% accurate to say that _mcount_private is entirely gprof overhead, and I have no clue what __fentry__ is.
My understanding is that _mcount_private and __fentry__ are both more or less profiling things that are put at the beginning of function calls.
So reducing function calls should be a good optimization, right?
(Not sure why I didn't think of this earlier considering how small these functions are, but now's better than never)
With inline bucket functions, a run without profiling took 32.330 seconds.
Unfortunately gprof doesn't seem to work well with a lot of inlining: https://stackoverflow.com/a/2087943/, https://stackoverflow.com/a/378024/.
But we should still be able to conclude that the 11.95 seconds in Bucket::contains still counts to the runtime, which now is about 36.96%. Similarly, the 8.21 seconds in Sketch::update counts as about 25.40%.
I thought our method for getting rid of the b bucket was to make a also be an xor thing, i.e. we change a+=index*delta to a^=index, and then we don't need deltas.
The failure rates are still low, which is why I didn't notice that I was not doing what we had originally planned.
One nice thing about xor is that we don't have to worry about overflow either, and we can just keep a as a 64 bit int, and it should be able to handle graphs with up to 6074001000 vertices.
So do we keep deltas and use the original a update method, or do we get rid of deltas entirely and use a^=index? (in both methods, c^=h(index))
David suggested that instead of using a different seed for each bucket, we can combine "columns" of buckets with decreasing mu guesses. So we do one hash call using both the column index of the bucket and the update index, and use the resulting number of trailing zeroes as the number of buckets to update in the column.
Here's an example of one column, with 1 being the indices a bucket contains:
Bucket0:11111111
Bucket1:10011010
Bucket2:00001010
Bucket3:00001000
This saves a whole log factor of hash calls off the runtime for updating sketches.