tla/intro.tex at master · codemonkeyricky/tla · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
\chapter{Motivation}

\section{Catching Problems Early}

Years ago, I worked on a proprietary low-power processor in an embedded
system. The processor ran a microcode featuring a custom instruction set. To
enter a low-power state, a set (possibly hundreds) of instructions were
executed. These instructions progressively put the system in a lower power state.
For example: Turn off IP A, then turn off IP B, then turn off the power island
to the IPs. To save cost and power, the low-power processor had very limited
debuggability support.\\

An experienced reader may start to notice some red flags.\\

If the microcode attempts to access the memory interface when the power
island has been shut off, the processor will hang. Since the debug power island has
been shut off, the physical hardware debug port is also unavailable, leaving the
developer with \textit{no way} of live debugging problems. At this point,
the developer needs to search through numerous instructions to
catch system constraint violations (invariants) \textit{manually}.\\

As one can imagine, maintaining the microcode was very expensive.
Fortunately, the proprietary low-power processor only had a handful of
instructions, so I created a simulator for this proprietary processor to verify
the microcode before deploying it on-target. The simulator models the processor
states as a state graph, with executed instruction, transitions the state machine
to the next state. At every state, all the invariants are verified. Example
invariants include:
\begin{itemize}
    \item Accessing memory interface after power off leads to a hang
    \item Accessing certain register in certain chip revision leads to a hang
    \item Verify IPs are shut off in the allowed order
\end{itemize}
The verification algorithm was implemented using a \textit{depth-first-search},
providing 100\% microcode coverage before deployment on target.\\

In reality, any system can be modeled as an arbitrary set of states with a
collection of invariants that be true at all times. The complexity of such an
arbitrary system generally grows quadratically as the number of states grows
linearly (eg. in an N-state system, adding state N+1 may introduce N transitions
into the new state). Many engineering problems have a large number of states,
such as lockless or wait-free data structures, distributed algorithms, OS
schedulers, consensus protocols, and more. As the number of states grows, the
problem becomes more challenging for designers to reason about.\\

So, how do we produce a system that is \textit{correct by design?}

\section{The Generalized Problem}

Fast forward to now: I stumbled across TLA+, a formalized solution of what I
was looking for.\\

In the age of big data, vertical scaling is no longer sufficient. The industry
has been exploring and implementing horizontal scaling solutions for past two
decades. Instead of focusing on increasing clock speed, hardwares vendor has
been focusing on adding more instances of hardware in their design. On the other
hand, software vendors has been designing horizontal scaling solution that takes
advantage of large volume of commodity hardware. There is one small problem.
Horizontal scaling requires concurrent reasoning, and:\\

\textit{Humans are not good at concurrent reasoning}.\\

Our cognitive system is optimized for sequential reasoning. Enumerating
all scenarios in one's mind to ensure an arbitrary design accommodates all
the corner cases is challenging.\\

Consider a distributed system. The system is a cluster of independently
operating entities, which collectively needs to offer the correct system
behavior. At any given time, nodes in the cluster may crash, restart, receive
instructions out-of-order, etc.\\

Consider a single producer multiple consumer lockless queue. The consumers may
reserve an index in the queue in a certain order but may release it in a
different order. What if one reader is slow, and another reader is superfast
and possibly lapses the slow reader?\\

Consider an OS scheduler with locks. Assume all the processes have the same
priority. Can a process starve the other processes by repeatedly acquiring and
releasing the lock? How do we ensure scheduling is fair?\\

The \textit{anti-pattern} is to band-aid the design until the bugs stop coming.
This is never ideal, and a hard-to-reproduce crashing bug will come in at the
most inconvenient time. How do we make sure the system is \textit{correct by
design}? To solve this problem, we must rely on tools to do the reasoning
\textit{for us}.

\section{What is TLA+?}

TLA+ is a \textit{system specification language}, allowing the designer to
describe a system as a set of states with transitions. Designers can describe
safety properties that must hold in every state, and liveness properties that
must hold for a sequence of states.\\

Once the system is modeled as a set of states, the states can be
\textit{exhaustively} explored (via breath-first-search) to ensure properties
are upheld throughout the entire state space (either per state or a sequence of
states).\\

When designing a system, designer tend to focus on the obvious happy and unhappy
path, and create corresponding tests to provide coverage. Unless the system is
trivial, designers often miss subtle cases that can cause violations. These tend
to be caught during verification, at which points more tests are added to
prevent future escapes.\\

The \textit{superpower} of TLA+ is it will exhaustively cover \textit{all}
states and transitions before the design even enters verification. Consider the
following example system state graph:

\begin{center}
\begin{tikzpicture}[shorten >=1pt, node distance=2cm, on grid, auto]
    % Define the states
    \node[state] (q1) {s1};
    \node[state] (q2) [right=of q1] {s2};
    \node[state] (q3) [right=of q2] {s3};
    \node[state] (q4) [right=of q3] {s4};
    \node[state] (q5) [right=of q4] {s5};

    \node[state] (q10) [above=of q2] {s10};
    \node[state] (q11) [right=of q10] {s11};
    \node[state, dashed] (q12) [right=of q11] {s12};
    \node[state, dashed] (q13) [right=of q12] {s13};

    \node[state, dashed] (q20) [below=of q2] {s20};
    \node[state] (q21) [right=of q20] {s21};
    \node[state, dashed] (q22) [right=of q21] {s22};
    \node[state, dashed] (q23) [right=of q22] {s23};

    % Connect the states with arrows
    \path[->]
        (q1) edge node {} (q2)
        (q2) edge node {} (q3)
        (q3) edge node {} (q4)
        (q4) edge node {} (q5)
        (q1) edge node {} (q10)
        (q10) edge node {} (q11)
        (q2) edge node {} (q21)
        (q21) edge node {} (q4)
        (q11) edge node {} (q4)
        ;

    \path[->, dash pattern=on 2pt off 3pt]
        (q11) edge node {} (q12)
        (q12) edge node {} (q13)
        (q1) edge node {} (q20)
        (q20) edge node {} (q21)
        (q21) edge node {} (q22)
        (q22) edge node {} (q23)
        % (q1) edge node {} (q22)
        (q4) edge node {} (q23)
        (q4) edge node {} (q13)
        ;
\end{tikzpicture}
\end{center}

The solid states and edges are the ones designer has accounted for, and the rest
are state and transition the designer missed. A large part of the missed states
and transitions may be inconsequential, but a subset of these may be
catastrophic failure states that require explicit handling or re-architecting.\\

The best way to look at TLA+ is it is a \textit{thinking tool}. Once the
designer specifies the system with allowed transitions and safety properties,
the model checker will identify cases where violations can occur and the set of
transitions that led up to the violations.\\

Amazon AWS has used TLA+ to identify a bug that occurs after 35 transitions
\cite{aws_bug}. As humans, we are reasonably good at describing the
\textit{top-level} design, but need a tool like TLA+ to identify all the
\textit{dark corners} of the design.

\section{About This Book}
% This book was initially a set of notes I took while learning TLA+. I decided to
% formalize these notes into this short book, which I hope the readers will find
% helpful in their TLA+ journey.\newline

The book intends to teach the readers how to write TLA+ specification for their
design to provide confidence in its \textit{correctness}. This book is targeted
to software designers, hardware designers, system architects, and in general
anyone interested in designing correct systems.\\

To get the most out of the book, the reader is expected to have general computer
science knowledge. The reader doesn't need to be an expert in a particular
language to understand this book; TLA+ is effectively its own language. This
book is example-driven and will go through designs such as lockless queues,
simple task schedulers, consensus algorithms, etc. Readers will likely enjoy a
deeper insight if there is familiarity with these topics.

\section{How to Use This Book}

This book was designed to be used as a reference, providing examples and
references using TLA+.\\

This book is split into multiple parts, covering TLA+ native notation, PlusCal
(C-like syntax that transpiles down to TLA+ native notation), and rapid
prototyping with TLA+. All examples will follow a similar layout, covering the
problem statement, design, specification, safety and liveness properties.\\

All examples in this book will be presented using TLA+ \textit{mathematical
notation}. Converting between Mathematical and ASCII notation is assumed trivial
due to the one-to-one mapping. Readers are encouraged to consult Table 8 in
\cite{ss} as needed.\\

The last part of the book provides language references and focused discussion on
various topics such as liveness, fairness, patterns, etc.