IR/cache.tex at master · Steap/IR · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
\section{Estimating the cache size}

\subsection{A well-known technique}
The Servet developers have used the same technique as Rafael Saavedra and Alan
Smith\cite{Cache_TLB} to measure the cache size : reading from and writing to an
array $n$ times, increasing its size every time, and estimating $C(i), 1 \le i
\le n$, the number of cycles it takes. Then it is possible to compute the
gradients, $C[i+1]/C[i], 1 \le i < n$. Since traversing an array is faster if
its values can fit in the cache, there should be a noticeable peak of the
gradient   when the array becomes bigger than the cache.

\subsection{Testing the benchmark}
\subsubsection{Portability and reproducibility}
Scientists often need to conduct experiments. One key point is that they have to
be reproducible. In computer science, one cannot know for sure if an experiment
could work and give the expected results on every single machine it could
possibly be run on.

\subsubsection{Tests run by the Servet team}
The program was tested with the following machines :
    \begin{itemize}
        \item Intel Xeon E7450 (released in September 2008)
        \item Intel Xeon 5060 (released in May 2006)
        \item Itanium Montvale (released in October 2007)
        \item AMD Athlon single core processor (released before 2005)
    \end{itemize}

A large variety of hardware was used (AMD and Intel, multi-core and
monocore...). Unfortunately, there were no recent processors : it would have
been interesting to know whether the benchmarks could run well on new hardware,
especially when it comes to HPC, since all processors used in this field are
up-to-date\footnote{http://www.top500.org/stats/list/36/procgen}. Processors
have been changing a lot lately : for example, the L3 cache is now almost always
shared between all the cores.  These tests were all conclusive.
%Dempsey 5060 => May 23rd, 2006
%Xeon E7450 => September, 15th, 2008
%AMD Athlon 3200 unicore
%Itanium Montvale => October 2007

\subsubsection{Other tests}
%XXX
When we tried the program on a Nehalem E5504 though, it was unable to correctly
estimate the size of the L1 cache : indeed, it was only able to detect 16KB out
of 32KB of L1 cache.

The algorithm is definitely not wrong : the only issue is the way the first
peak is detected. Basically, the application computes gradients ($ C[k+1]/C[k],
0 \le k < n $), and the first gradient greater than a predefined threshold
indicates the cache size \footnote{See est\_cache\_size.c for more
information on the behaviour of the code}

\lstset{caption=A part of the output of $./job.sh\ est\_cache$ on a Nehalem E5504}
\begin{lstlisting}
For size 16 KBytes -> gradient = 1.356991
For size 32 KBytes -> gradient = 2.239814
For size 64 KBytes -> gradient = 1.142565
\end{lstlisting}

This output clearly shows that the cache size is 32KBytes. However, the
threshold being set to 1.35, the program will return "16KBytes" as the L1 cache
size.

I contacted the Servet developers to tell them about this issue. According to
them, one way to fix this bug would probably consist in selecting the highest
gradient instead of the first gradient greater than a arbitrarily given value :

\begin{quotation}
"Selecting the highest value in a typical L1 cache size range might be the best
approach. Probable we have to refine this, as this could have some impact. We
have seen around 99\% of successful L1 approximations, and any change should
increase this overall rate!"
\begin{flushright}Guillermo Lopez Taboada\end{flushright}
\end{quotation}

\subsection{Knowing the cache size may not be enough}
Knowing the cache size, developers may write code which does not raise cache
misses too often, thus increasing its performance. Another similar measurement
may help decreasing the memory access cost : the size of the TLB. Saavedra and
Smith\cite{Cache_TLB} proved that encountering TLB misses can significantly
decrease performance. That is definitely a benchmark to add to Servet in order
to be able to efficiently use a processor.

\subsection{Shared Caches}
In order to perfectly use the cache in a multicore machine, one has to know what
cache are shared between the cores. Indeed, if two different cores need to
access different data at the same time and use the same cache, they will keep
replacing data in the cache, thus losing time. On the opposite, if they have to
access the same piece of data at the same time, using a cache shared between
them could improve performance.

Therefore, the algorithm used by Servet is quite simple. Two cores try to
access two arrays at the same time. If the number of cycles required to achieve
this is significantly higher than expected (basically, greater than twice the
time needed for a single core to access the elements of one of the arrays), then
the cores probably do not share the cache. This algorithm is repeated for each
level of cache, using arrays of size $\frac{2}{3} * cache size$ (which makes it
impossible for the arrays to both fit in the cache).