Skip to content

L_0 sketching implementation & testing status

kennyzzhang edited this page Mar 19, 2021 · 18 revisions

Progress table: type of test by vector size

Sketch sampling Sketch addition Sketch scalar multiplication Sketch uniform
100 ✔️ ✔️
1000 ✔️ ✔️
10000 ✔️ ✔️

(✔️= implemented and successful, ❌= not yet implemented)

Test exceptions:

  • We first create an empty sketch.
  • We then query this sketch. Since the sketch is empty, this should throw AllBucketZeroException.
  • We then query this sketch again. Since the sketch was already queried, this should throw MultipleQueryException.
  • Next we create another sketch, and find a vector with no good buckets by repeatedly eliminating the indices of good buckets.
  • Here's an example:
0 1 2 3 4 5 6 7 8 9
1 1 1 1 1 1 1 1 1 1
1 0 0 1 0 1 0 0 1 1
0 1 0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1 1 1
0 0 1 0 1 0 0 1 1 1
0 1 1 1 1 0 0 0 0 0
0 0 0 0 1 1 0 1 0 1
1 1 1 1 1 1 1 1 1 1
0 1 0 1 0 1 1 0 1 0
0 0 0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1 1 1
0 0 0 0 0 0 0 0 1 0
0 0 0 0 0 1 1 0 0 0
0 0 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0 0 0
^ ^
0 2 3 4 5 6 7 9
1 1 1 1 1 1 1 1
1 0 1 0 1 0 0 1
0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1
0 1 0 1 0 0 1 1
0 1 1 1 0 0 0 0
0 0 0 1 1 0 1 1
1 1 1 1 1 1 1 1
0 0 1 0 1 1 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1
0 0 0 0 0 0 0 0
0 0 0 0 1 1 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
^
  • In this case, our vector has zeros at indices {1, 7, 8}, and nonzero values at indices {0, 2, 3, 4, 5, 6, 9}.
  • Since we have created a vector with no good buckets, as long as we don't get a false positive from one of the buckets, this should throw NoGoodBucketException.

Test give only one index

If we give a sketch a vector with only one nonzero index, then the sketch should always return the correct value.

Test sketch sampling:

We want to make sure the failure modes of sketch sampling are below a certain probability (TODO figure out what the theoretical probability is). The two ways a sample could fail are either there are no good buckets, or the sample is incorrect.

  • For each vector size n, we randomly generate n updates using Testing_Vector, and use them to update a sketch.
  • We then query the sketch.
    • If the query does not throw an exception, and the returned index/value does not match with the vector, increment the counter for sample incorrect failures.
    • If the query throws AllBucketsZeroException, and the vector is nonzero, increment the counter for sample incorrect failures.
    • If the query throws NoGoodBucketException, increment the counter for no good bucket failures.
    • The query should never throw MultipleQueryException.

At the end, make sure the the failure rate is less than or equal to the expected probability.

Test sketch addition:

Very similar to sketch sampling.

  • For each vector size n, we randomly generate 2 separate Testing_Vectors, each with n updates, and use them to update two separate sketches.
  • We then add the sketches, and query them.
    • If the query does not throw an exception, and the returned index/value does not match with the sum of the two vectors, increment the counter for sample incorrect failures.
    • If the query throws AllBucketsZeroException, and the sum of the two vectors is nonzero, increment the counter for sample incorrect failures.
    • If the query throws NoGoodBucketException, increment the counter for no good bucket failures.
    • The query should never throw MultipleQueryException.

At the end, make sure the the failure rate is less than or equal to the expected probability.

Timing analysis

If we look at TestSketchSample, we're running 3 sizes of tests.

# of runs size of vector number of updates time using powermod (on Kenny's laptop) time using hash+xor, native time using hash+xor, larger ints
10000 100 100 2.686 seconds 1.88919 seconds 2.16831 seconds
1000 1000 1000 5.124 seconds 3.50447 seconds 3.83548 seconds
1000 10000 10000 101.796 seconds 66.7599 seconds 68.6052 seconds

For powermod, the majority of time is spent on the 1000 runs, size 10000 vector, 10000 updates test. Looking at a gmon line by line profile, it seems like the most time is being taken up by PrimeGenerator::power, taking 63.32 seconds=60.81% of the time, and Bucket::contains, taking 21.7 seconds=20.85% of the time.

We can't do much about Bucket::contains since in order to check if a bucket contains an index, we have to call XXH64.

  • We have 1000 runs.
  • For each run, we have 10000 updates, giving us 10^7 total updates.
  • For each update, we have (floor(log_2(10000))+1)^2=196 buckets, giving us 1.96*10^9 total calls to Bucket::contains,
  • For each call to Bucket::contains, we call XXH64, giving us 1.96*10^9 total calls to XXH64.

All these theoretical number of calls match up with what I have found empirically using gcov.

I haven't run an XXH64 benchmark on my own laptop yet, but the benchmarks provided on xxHash's GitHub wiki suggest a throughput of somewhere around 10^8 hashes per second. (1.96*10^9 calls to XXH64)/(10^8 hashes per second)=19.6 seconds, which is approximately what we have empirically.

For PrimeGenerator::power, I'm not sure exactly how much better we can do either.

  • We have 1000 runs.
  • For each run, we have 10000 updates, giving us 10^7 total updates.
  • For each update, we have (floor(log_2(10000))+1)^2=196 buckets.
  • On average, we will need to update the bucket (1 + 1/2 + 1/4 + ... + 1/2^13)/14=14.28% of the time, giving us 2.80*10^8 total calls to PrimeGenerator::power.
    • For the sake of analysis, call the arguments to PrimeGenerator::power x, y, and p, and the result res.
    • y is an integer picked uniformly from [1, 10000].
    • Since we are exponentiating by squaring, PrimeGenerator::power iterates over the bits of y, looping ceiling(log_2(y)) times.
    • On average, this is 12.3631 times.
    • In each pass of the loop, x=(x*x)%p is run.
    • An additional res=(res*x)%p is carried out if the bit of y is 1, which is about half the time.
    • To summarize, for each call to PrimeGenerator::power, we do on average 18.54 multiplication + mod operations.
    • This gives us 5.19*10^9 multiplication + mod operations.

All these theoretical number of operations match up with what I have found empirically using gcov.

For my 2.3GHz laptop, 1 clock cycle should be 434.78 picoseconds. Empirically, (63.32 seconds)/(5.19*10^9 operations)=12.20 nanoseconds for each multiplication + mod=28.06 clock cycles. Considering that the most costly instruction is calculating the remainder, which alone has a latency of upwards of 20 clock cycles, we don't have much room for improvement here if we continue to use just exponentiation by squaring. We'd have to switch to some more efficient powermod algorithm, apparently there's something called Montgommery multiplication/reduction, and something called Barrett reduction? I have no clue what they are and how they work, but it might be worth looking into if we really need the speedup.

Large sketch timing

10^6 updates for each run

Maybe optimizations weren't enabled for the boost runtimes, that could be causing the 20x slowdown.

Vector size Boost multiprecision Native hash+xor inline,hash+xor column optimization
10^3 110.678 5.43079 2.86827 1.64006 0.33727
10^4 192.773 9.73575 5.2734 3.07678 0.466436
10^5 281.799 14.1229 7.41999 4.15527 0.521033
10^6 387.783 19.1509 10.0366 5.99906 0.603263
10^7 533.402 26.7054 14.4372 8.40956 0.720604
10^8 668.878 33.618 18.7456 10.6028 0.811287
10^9 818.414 41.2379 21.9179 12.6764 0.918864
10^10 1654.09 70.9191 29.1174 16.9599 1.0602
10^11 1911.86 82.8579 34.03 19.893 1.07846
10^12 2099.64 93.1091 39.7805 23.0413 1.15421

Hash+xor

The hash+xor algorithm is ~4x as fast as the native implementation of powermod, and keeps its speedup for larger int sizes.

The previous "bug" where all buckets seemed to be failing a bit more often than powermod for smaller vector sizes was due to the unit testing not being updated to match the restriction of only +1/-1 updates. After updating the unit tests, this is no longer a problem.

For the timing of the hash+xor algorithm, now the largest bottleneck is contains() and Sketch::update, since for each bucket 2 hashes need to be calculated: 1 for getting the bucket seed, and 1 for checking if the bucket contains the index.

TBH I don't full understand gprof. There's the -pg option to enable profiling, but apparently there's also a -mfentry option that does profiling differently or something? They give similar results, so I'm not sure what's up with that.

This is the raw output of gprof:

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
 63.15     54.42    54.42                             _mcount_private
 13.87     66.37    11.95 2266000000     0.00     0.00  Bucket_Boruvka::contains(unsigned long const&, unsigned long const&, unsigned long const&) const
  9.88     74.88     8.51                             __fentry__
  9.53     83.09     8.21 12000000     0.00     0.00  Sketch::update(unsigned long const&)
  2.16     84.95     1.86                             XXH64
  0.99     85.80     0.85 313965250     0.00     0.00  Bucket_Boruvka::update(unsigned long const&, unsigned long const&)
  0.36     86.11     0.31        3     0.10     7.12  test_sketch_sample(unsigned long, unsigned long, unsigned long, double, double)
  0.05     86.15     0.04 24048000     0.00     0.00  double_to_ull(double, double)
  0.02     86.17     0.02                             log2

The total runtime with profiling was 86.61 seconds. _mcount_private took 54.42 seconds=63.15% runtime. __fentry__ took 8.51 seconds=9.88% runtime. Bucket::contains took 11.95 seconds=13.87% runtime. Sketch::update took 8.21 seconds=9.53% runtime. Bucket::update took 0.85 seconds=0.99% runtime.

A run without profiling took 58.709 seconds. So it's not 100% accurate to say that _mcount_private is entirely gprof overhead, and I have no clue what __fentry__ is.

My understanding is that _mcount_private and __fentry__ are both more or less profiling things that are put at the beginning of function calls. So reducing function calls should be a good optimization, right? (Not sure why I didn't think of this earlier considering how small these functions are, but now's better than never)

With inline bucket functions, a run without profiling took 32.330 seconds.

Unfortunately gprof doesn't seem to work well with a lot of inlining: https://stackoverflow.com/a/2087943/, https://stackoverflow.com/a/378024/.

But we should still be able to conclude that the 11.95 seconds in Bucket::contains still counts to the runtime, which now is about 36.96%. Similarly, the 8.21 seconds in Sketch::update counts as about 25.40%.

Are update deltas necessary?

I thought our method for getting rid of the b bucket was to make a also be an xor thing, i.e. we change a+=index*delta to a^=index, and then we don't need deltas. The failure rates are still low, which is why I didn't notice that I was not doing what we had originally planned. One nice thing about xor is that we don't have to worry about overflow either, and we can just keep a as a 64 bit int, and it should be able to handle graphs with up to 6074001000 vertices.

So do we keep deltas and use the original a update method, or do we get rid of deltas entirely and use a^=index? (in both methods, c^=h(index))

Bucket column optimization

David suggested that instead of using a different seed for each bucket, we can combine "columns" of buckets with decreasing mu guesses. So we do one hash call using both the column index of the bucket and the update index, and use the resulting number of trailing zeroes as the number of buckets to update in the column.

Here's an example of one column, with 1 being the indices a bucket contains:

 Bucket0:11111111
 Bucket1:10011010
 Bucket2:00001010
 Bucket3:00001000

This saves a whole log factor of hash calls off the runtime for updating sketches.

Clone this wiki locally