--- rav1d ---
n=9831
mean 8215199.18ns
medi 10060897.00ns
p90 14742105.30ns
p95 15724595.45ns
p99 17899872.77ns
average std timing variation 261486.01
--- dav1d ---
n=9831
mean 7516056.93ns
medi 9230588.80ns
p90 13610381.10ns
p95 14609830.10ns
p99 16788979.33ns
average std timing variation 135148.00
--- ratio ---
n=9831
mean 1.09
medi 1.10
p90 1.13
p95 1.14
p99 1.16
-- Absolute speedup slowest frames are rav1d slowest frames --
improving the slowest 50% by 1% would save 627464976.13 (99.22%)
improving the slowest 50% by 5% would save 3137324880.65 (96.12%)
improving the slowest 50% by 10% would save 6274649761.30 (92.23%)
improving the slowest 50% by 20% would save 12549299522.60 (84.46%)
improving the slowest 25% by 1% would save 352693168.56 (99.56%)
improving the slowest 25% by 5% would save 1763465842.79 (97.82%)
improving the slowest 25% by 10% would save 3526931685.57 (95.63%)
improving the slowest 25% by 20% would save 7053863371.14 (91.27%)
improving the slowest 10% by 1% would save 159179283.66 (99.80%)
improving the slowest 10% by 5% would save 795896418.31 (99.01%)
improving the slowest 10% by 10% would save 1591792836.62 (98.03%)
improving the slowest 10% by 20% would save 3183585673.24 (96.06%)
improving the slowest 5% by 1% would save 84497686.67 (99.90%)
improving the slowest 5% by 5% would save 422488433.35 (99.48%)
improving the slowest 5% by 10% would save 844976866.69 (98.95%)
improving the slowest 5% by 20% would save 1689953733.38 (97.91%)
improving the slowest 1% by 1% would save 19710769.30 (99.98%)
improving the slowest 1% by 5% would save 98553846.48 (99.88%)
improving the slowest 1% by 10% would save 197107692.95 (99.76%)
improving the slowest 1% by 20% would save 394215385.90 (99.51%)
-- Difference speedup slowest frames are rav1d - dav1d --
improving the slowest 25% by 1% would save 30686228.72 (99.55%)
improving the slowest 25% by 5% would save 153431143.57 (97.77%)
improving the slowest 25% by 10% would save 306862287.15 (95.54%)
improving the slowest 25% by 20% would save 613724574.30 (91.07%)
improving the slowest 25% by 50% would save 1534311435.75 (77.68%)
improving the slowest 10% by 1% would save 14029421.62 (99.80%)
improving the slowest 10% by 5% would save 70147108.11 (98.98%)
improving the slowest 10% by 10% would save 140294216.22 (97.96%)
improving the slowest 10% by 20% would save 280588432.44 (95.92%)
improving the slowest 10% by 50% would save 701471081.10 (89.79%)
improving the slowest 5% by 1% would save 7609317.89 (99.89%)
improving the slowest 5% by 5% would save 38046589.43 (99.45%)
improving the slowest 5% by 10% would save 76093178.85 (98.89%)
improving the slowest 5% by 20% would save 152186357.70 (97.79%)
improving the slowest 5% by 50% would save 380465894.25 (94.46%)
improving the slowest 1% by 1% would save 1758226.73 (99.97%)
improving the slowest 1% by 5% would save 8791133.66 (99.87%)
improving the slowest 1% by 10% would save 17582267.33 (99.74%)
improving the slowest 1% by 20% would save 35164534.66 (99.49%)
improving the slowest 1% by 50% would save 87911336.65 (98.72%)
n frames in lowest 1.00% 98
slowest time diff frames rav1d
9187: rav1d= 16381141.8 dav1d= 14249326.5 ratio= 1.15, diff= 2131815.30 n_samples@100khz=1638
9177: rav1d= 16750423.1 dav1d= 14688030.8 ratio= 1.14, diff= 2062392.30 n_samples@100khz=1675
9186: rav1d= 16635372.1 dav1d= 14628069.6 ratio= 1.14, diff= 2007302.50 n_samples@100khz=1663
9176: rav1d= 16074932.2 dav1d= 14098569.2 ratio= 1.14, diff= 1976363.00 n_samples@100khz=1607
9195: rav1d= 16519818.2 dav1d= 14557388.9 ratio= 1.13, diff= 1962429.30 n_samples@100khz=1651
9183: rav1d= 16585604.4 dav1d= 14623517.8 ratio= 1.13, diff= 1962086.60 n_samples@100khz=1658
9196: rav1d= 15704063.8 dav1d= 13743497.9 ratio= 1.14, diff= 1960565.90 n_samples@100khz=1570
9178: rav1d= 16267450.5 dav1d= 14307055.7 ratio= 1.14, diff= 1960394.80 n_samples@100khz=1626
6865: rav1d= 18500798.6 dav1d= 16552558.0 ratio= 1.12, diff= 1948240.60 n_samples@100khz=1850
6285: rav1d= 15867283.1 dav1d= 13927933.1 ratio= 1.14, diff= 1939350.00 n_samples@100khz=1586
9174: rav1d= 15939393.7 dav1d= 14011077.2 ratio= 1.14, diff= 1928316.50 n_samples@100khz=1593
9193: rav1d= 16047787.1 dav1d= 14123608.6 ratio= 1.14, diff= 1924178.50 n_samples@100khz=1604
9201: rav1d= 16220579.4 dav1d= 14309487.5 ratio= 1.13, diff= 1911091.90 n_samples@100khz=1622
6866: rav1d= 15307081.8 dav1d= 13415563.3 ratio= 1.14, diff= 1891518.50 n_samples@100khz=1530
9194: rav1d= 16024488.5 dav1d= 14140632.3 ratio= 1.13, diff= 1883856.20 n_samples@100khz=1602
9185: rav1d= 16397133.0 dav1d= 14516647.0 ratio= 1.13, diff= 1880486.00 n_samples@100khz=1639
9192: rav1d= 16220258.7 dav1d= 14347078.0 ratio= 1.13, diff= 1873180.70 n_samples@100khz=1622
6484: rav1d= 17064982.4 dav1d= 15206699.6 ratio= 1.12, diff= 1858282.80 n_samples@100khz=1706
5183: rav1d= 11342964.7 dav1d= 9485317.0 ratio= 1.20, diff= 1857647.70 n_samples@100khz=1134
9184: rav1d= 15810618.0 dav1d= 13975471.3 ratio= 1.13, diff= 1835146.70 n_samples@100khz=1581
6476: rav1d= 14224412.7 dav1d= 12389343.1 ratio= 1.15, diff= 1835069.60 n_samples@100khz=1422
9204: rav1d= 16245757.7 dav1d= 14414383.3 ratio= 1.13, diff= 1831374.40 n_samples@100khz=1624
6859: rav1d= 11903174.4 dav1d= 10071824.1 ratio= 1.18, diff= 1831350.30 n_samples@100khz=1190
9169: rav1d= 15610712.9 dav1d= 13779669.9 ratio= 1.13, diff= 1831043.00 n_samples@100khz=1561
6475: rav1d= 17105298.0 dav1d= 15277185.7 ratio= 1.12, diff= 1828112.30 n_samples@100khz=1710
9264: rav1d= 16424634.7 dav1d= 14599886.3 ratio= 1.12, diff= 1824748.40 n_samples@100khz=1642
5543: rav1d= 16431777.2 dav1d= 14610115.7 ratio= 1.12, diff= 1821661.50 n_samples@100khz=1643
9168: rav1d= 16085764.4 dav1d= 14264471.9 ratio= 1.13, diff= 1821292.50 n_samples@100khz=1608
6326: rav1d= 14173021.1 dav1d= 12351892.9 ratio= 1.15, diff= 1821128.20 n_samples@100khz=1417
1738: rav1d= 15070580.8 dav1d= 13253411.8 ratio= 1.14, diff= 1817169.00 n_samples@100khz=1507
9175: rav1d= 15720511.8 dav1d= 13904022.2 ratio= 1.13, diff= 1816489.60 n_samples@100khz=1572
9237: rav1d= 16234274.3 dav1d= 14420144.3 ratio= 1.13, diff= 1814130.00 n_samples@100khz=1623
6874: rav1d= 17035022.3 dav1d= 15224567.3 ratio= 1.12, diff= 1810455.00 n_samples@100khz=1703
9200: rav1d= 15457874.8 dav1d= 13648365.4 ratio= 1.13, diff= 1809509.40 n_samples@100khz=1545
1920: rav1d= 15225706.2 dav1d= 13423026.3 ratio= 1.13, diff= 1802679.90 n_samples@100khz=1522
9181: rav1d= 18296001.4 dav1d= 16494613.2 ratio= 1.11, diff= 1801388.20 n_samples@100khz=1829
6740: rav1d= 16754291.3 dav1d= 14969318.7 ratio= 1.12, diff= 1784972.60 n_samples@100khz=1675
9308: rav1d= 16393914.0 dav1d= 14609544.5 ratio= 1.12, diff= 1784369.50 n_samples@100khz=1639
6862: rav1d= 12294920.7 dav1d= 10513683.5 ratio= 1.17, diff= 1781237.20 n_samples@100khz=1229
9167: rav1d= 15510178.1 dav1d= 13737766.3 ratio= 1.13, diff= 1772411.80 n_samples@100khz=1551
6870: rav1d= 11900472.3 dav1d= 10136742.6 ratio= 1.17, diff= 1763729.70 n_samples@100khz=1190
6480: rav1d= 12547323.6 dav1d= 10784544.8 ratio= 1.16, diff= 1762778.80 n_samples@100khz=1254
9191: rav1d= 15629694.5 dav1d= 13867475.7 ratio= 1.13, diff= 1762218.80 n_samples@100khz=1562
1751: rav1d= 14887199.4 dav1d= 13127492.6 ratio= 1.13, diff= 1759706.80 n_samples@100khz=1488
9317: rav1d= 16231630.4 dav1d= 14473621.4 ratio= 1.12, diff= 1758009.00 n_samples@100khz=1623
9199: rav1d= 17742668.4 dav1d= 15985142.7 ratio= 1.11, diff= 1757525.70 n_samples@100khz=1774
9320: rav1d= 16519434.2 dav1d= 14762244.4 ratio= 1.12, diff= 1757189.80 n_samples@100khz=1651
6284: rav1d= 18506325.2 dav1d= 16749223.2 ratio= 1.10, diff= 1757102.00 n_samples@100khz=1850
6867: rav1d= 12062536.4 dav1d= 10306431.2 ratio= 1.17, diff= 1756105.20 n_samples@100khz=1206
6861: rav1d= 12333730.6 dav1d= 10580966.6 ratio= 1.17, diff= 1752764.00 n_samples@100khz=1233
1569: rav1d= 15048133.7 dav1d= 13298266.5 ratio= 1.13, diff= 1749867.20 n_samples@100khz=1504
9321: rav1d= 15842388.7 dav1d= 14097614.8 ratio= 1.12, diff= 1744773.90 n_samples@100khz=1584
9152: rav1d= 15164263.8 dav1d= 13420003.8 ratio= 1.13, diff= 1744260.00 n_samples@100khz=1516
6561: rav1d= 15694681.4 dav1d= 13953587.1 ratio= 1.12, diff= 1741094.30 n_samples@100khz=1569
6597: rav1d= 15810137.1 dav1d= 14069158.1 ratio= 1.12, diff= 1740979.00 n_samples@100khz=1581
6736: rav1d= 12336682.9 dav1d= 10602805.7 ratio= 1.16, diff= 1733877.20 n_samples@100khz=1233
6869: rav1d= 11539205.0 dav1d= 9810908.2 ratio= 1.18, diff= 1728296.80 n_samples@100khz=1153
9202: rav1d= 15542952.5 dav1d= 13816646.3 ratio= 1.12, diff= 1726306.20 n_samples@100khz=1554
9235: rav1d= 17664942.8 dav1d= 15939932.7 ratio= 1.11, diff= 1725010.10 n_samples@100khz=1766
1660: rav1d= 15183188.3 dav1d= 13460438.7 ratio= 1.13, diff= 1722749.60 n_samples@100khz=1518
9190: rav1d= 18024098.9 dav1d= 16302028.7 ratio= 1.11, diff= 1722070.20 n_samples@100khz=1802
9173: rav1d= 15478870.3 dav1d= 13758195.7 ratio= 1.13, diff= 1720674.60 n_samples@100khz=1547
9268: rav1d= 15699709.9 dav1d= 13983821.0 ratio= 1.12, diff= 1715888.90 n_samples@100khz=1569
6601: rav1d= 13394198.3 dav1d= 11678414.6 ratio= 1.15, diff= 1715783.70 n_samples@100khz=1339
9258: rav1d= 16777138.2 dav1d= 15062249.6 ratio= 1.11, diff= 1714888.60 n_samples@100khz=1677
6773: rav1d= 12501258.9 dav1d= 10790214.3 ratio= 1.16, diff= 1711044.60 n_samples@100khz=1250
9182: rav1d= 15618810.0 dav1d= 13910518.5 ratio= 1.12, diff= 1708291.50 n_samples@100khz=1561
6875: rav1d= 13711879.3 dav1d= 12003729.1 ratio= 1.14, diff= 1708150.20 n_samples@100khz=1371
1894: rav1d= 14820272.2 dav1d= 13112558.7 ratio= 1.13, diff= 1707713.50 n_samples@100khz=1482
6382: rav1d= 21241665.4 dav1d= 19534148.6 ratio= 1.09, diff= 1707516.80 n_samples@100khz=2124
9220: rav1d= 15694992.1 dav1d= 13988952.7 ratio= 1.12, diff= 1706039.40 n_samples@100khz=1569
6741: rav1d= 13998722.6 dav1d= 12293335.9 ratio= 1.14, diff= 1705386.70 n_samples@100khz=1399
1582: rav1d= 14790423.1 dav1d= 13085234.3 ratio= 1.13, diff= 1705188.80 n_samples@100khz=1479
6863: rav1d= 11558143.7 dav1d= 9853388.8 ratio= 1.17, diff= 1704754.90 n_samples@100khz=1155
6856: rav1d= 17887305.3 dav1d= 16185632.4 ratio= 1.11, diff= 1701672.90 n_samples@100khz=1788
6478: rav1d= 12012756.0 dav1d= 10312156.7 ratio= 1.16, diff= 1700599.30 n_samples@100khz=1201
6771: rav1d= 12950217.7 dav1d= 11256301.3 ratio= 1.15, diff= 1693916.40 n_samples@100khz=1295
6682: rav1d= 13404503.6 dav1d= 11710972.6 ratio= 1.14, diff= 1693531.00 n_samples@100khz=1340
1244: rav1d= 15337421.8 dav1d= 13645752.5 ratio= 1.12, diff= 1691669.30 n_samples@100khz=1533
9221: rav1d= 15824731.5 dav1d= 14134069.1 ratio= 1.12, diff= 1690662.40 n_samples@100khz=1582
9318: rav1d= 15718866.8 dav1d= 14032765.0 ratio= 1.12, diff= 1686101.80 n_samples@100khz=1571
9205: rav1d= 15487603.5 dav1d= 13803014.7 ratio= 1.12, diff= 1684588.80 n_samples@100khz=1548
6868: rav1d= 11611749.2 dav1d= 9928509.6 ratio= 1.17, diff= 1683239.60 n_samples@100khz=1161
9228: rav1d= 15775589.1 dav1d= 14093263.5 ratio= 1.12, diff= 1682325.60 n_samples@100khz=1577
9214: rav1d= 15145493.5 dav1d= 13464096.5 ratio= 1.12, diff= 1681397.00 n_samples@100khz=1514
2128: rav1d= 15430232.0 dav1d= 13749780.9 ratio= 1.12, diff= 1680451.10 n_samples@100khz=1543
9326: rav1d= 15879450.0 dav1d= 14199993.8 ratio= 1.12, diff= 1679456.20 n_samples@100khz=1587
6745: rav1d= 12257767.0 dav1d= 10578696.4 ratio= 1.16, diff= 1679070.60 n_samples@100khz=1225
6560: rav1d= 18572310.0 dav1d= 16896700.0 ratio= 1.10, diff= 1675610.00 n_samples@100khz=1857
1699: rav1d= 15775983.9 dav1d= 14101787.6 ratio= 1.12, diff= 1674196.30 n_samples@100khz=1577
9265: rav1d= 15795137.0 dav1d= 14121877.4 ratio= 1.12, diff= 1673259.60 n_samples@100khz=1579
9347: rav1d= 16168011.8 dav1d= 14503087.7 ratio= 1.11, diff= 1664924.10 n_samples@100khz=1616
1764: rav1d= 14548468.9 dav1d= 12883636.9 ratio= 1.13, diff= 1664832.00 n_samples@100khz=1454
6602: rav1d= 12718329.3 dav1d= 11056896.8 ratio= 1.15, diff= 1661432.50 n_samples@100khz=1271
9319: rav1d= 15850951.6 dav1d= 14192264.5 ratio= 1.12, diff= 1658687.10 n_samples@100khz=1585
9236: rav1d= 15517922.5 dav1d= 13859411.6 ratio= 1.12, diff= 1658510.90 n_samples@100khz=1551
9203: rav1d= 15475966.7 dav1d= 13818568.1 ratio= 1.12, diff= 1657398.60 n_samples@100khz=1547
6738: rav1d= 11829454.9 dav1d= 10172663.3 ratio= 1.16, diff= 1656791.60 n_samples@100khz=1182
[1244,1569,1582,1660,1699,1738,1751,1764,1894,1920,2128,5183,5543,6284,6285,6326,6382,6475,6476,6478,6480,6484,6560,6561,6597,6601,6602,6682,6736,6738,6740,6741,6745,6771,6773,6856,6859,6861,6862,6863,6865,6866,6867,6868,6869,6870,6874,6875,9152,9167,9168,9169,9173,9174,9175,9176,9177,9178,9181,9182,9183,9184,9185,9186,9187,9190,9191,9192,9193,9194,9195,9196,9199,9200,9201,9202,9203,9204,9205,9214,9220,9221,9228,9235,9236,9237,9258,9264,9265,9268,9308,9317,9318,9319,9320,9321,9326,9347]
would get 150186 samples
This hasn't really yielded any concrete insights yet but I figured I'd share the results as they've just been sitting for a while.
All work was done single thread on Chimera 8 bit.
NOTE: wherever I say "frame" below, I really mean "single call to
{r,d}avid_decode_frame. I'm only just realizing the Chimera file has 8929 frames and my analysis shows "frame" (call) 9187 is the slowest. Did I mess something up for is that function called more than once per frame?I wanted to get more granular performance indicators than just sampling / counters for the whole program, so I inserted some hacky code to measure per-frame metrics in both rav1d and dav1d for each call to
{r,d}av1d_decode_frame. Then comparing the two gives an idea of which frames are interesting and potentially causing a slowdown.The simplest metric is wall clock time per frame. Initially I thought the ratio of
rav1d / dav1dwas interesting because that gives the ratio of slowdown, but actually I think the differencerav1d - dav1dis more interesting because we'd rather target a slow frame that takes 15ms in rav1d compared to 10ms in dav1d -- diff = 5ms, ratio = 1.5 -- rather than a slow frame in rav1d that takes 2ms vs 1ms -- diff = 1ms, ratio = 2. These results are pasted below, with some hypothetical scenarios of speeding up the slowest X% of frames by Y%. The slowest X% is given first by absolute slowest frames, then by the slowest compared to dav1d in time difference. Then it shows the slowest 1% frames compared to dav1d in time difference. Note these times are averaged over 10 runs. The interesting takeaway in the hypothetical speedups is improving the slowest 10% in time difference by 20% would reduce overall runtime by 4% points.Frame time (output of `analyzeframetimes.py`)
And a graph of these times:
I then also recorded some CPU counters. The counters were: instructions, branch instructions, branch mispredictions, L1 dcm, L2 dcm, L2 icm, TLB dm, TLB im, branch not taken, and l1 dca. dcm/icm is data/instruction cache miss, dm/im is data/instruction miss. I then tried to look at how well those counters correlate with the time differences. I have to record groups of 4 counters at a time because of hardware limits. The output is pasted below for each group, and also calculates the correlation for the slowest X% of frames.
Frame counter output of `analyzeframestats.py`
This counter work was done in this rav1d branch and this dav1d branch
So then I wanted to see why these frames were slower, so I hacked in a way to enable+disable perf record for each of the slow frames in this branch. It gets used like this:
and the perf record can be adjusted to record whatever you want. This outputs a single perf.data with only samples from the slow frames (hardcoded into the code).
This still isn't granular enough though so I've been thinking of more targeted analysis that could better pinpoint exactly which points are slow. I have a few rough ideas and most revolve around being able to repeatedly run a specific frame (or call) with slightly different perf options (either via perf record out of process or directly with perf_event_open in process). Is there a (hopefully easy) way to hack in a flag that could repeatedly decode a single frame, or perhaps also the N previous frames to allow for realistic warmup of caches? Otherwise I can use the
--skipflag which I just discovered.Ok I think that's it, let me know if I forgot to include something somewhere