|
| 1 | +--- |
| 2 | +title: 08/09 - Hashing |
| 3 | +date: 2026-02-16/18 |
| 4 | +--- |
| 5 | + |
| 6 | +## Roadmap |
| 7 | + |
| 8 | +These lectures introduce **hash tables**, the primary data structure for implementing dynamic sets with fast average-case operations. We motivate the design through direct-address tables, introduce hash functions and collision resolution via chaining, then cover **open addressing** and **universal hashing** for provable worst-case guarantees. |
| 9 | + |
| 10 | +1. **The Dictionary Problem**: Operations and motivation. |
| 11 | +2. **Direct-Address Tables**: A simple starting point. |
| 12 | +3. **Hash Tables**: Hash functions and collisions. |
| 13 | +4. **Chaining**: Collision resolution via linked lists. |
| 14 | +5. **Analysis of Chaining**: Expected cost under simple uniform hashing. |
| 15 | +6. **Open Addressing**: Linear probing, quadratic probing, double hashing. |
| 16 | +7. **Analysis of Open Addressing**: Expected number of probes. |
| 17 | +8. **Universal Hashing**: Worst-case guarantees via randomized hash functions. |
| 18 | +9. **A Universal Hash Family**: Construction and proof. |
| 19 | + |
| 20 | +--- |
| 21 | + |
| 22 | +## 1. The Dictionary Problem |
| 23 | + |
| 24 | +We want a data structure supporting: |
| 25 | + |
| 26 | +* `INSERT(S, x)`: Insert element $x$ into set $S$. |
| 27 | +* `DELETE(S, x)`: Remove element $x$ from set $S$. |
| 28 | +* `SEARCH(S, k)`: Find element with key $k$ in $S$. |
| 29 | + |
| 30 | +**Goal**: All three operations in $O(1)$ expected time. |
| 31 | + |
| 32 | +--- |
| 33 | + |
| 34 | +## 2. Direct-Address Tables |
| 35 | + |
| 36 | +**Idea**: If keys are drawn from a universe $U = \{0, 1, \dots, m-1\}$, allocate an array $T[0 \dots m-1]$. Slot $k$ holds a pointer to the element with key $k$ (or `NIL`). |
| 37 | + |
| 38 | +**Performance**: All operations take $\Theta(1)$ worst-case time. |
| 39 | + |
| 40 | +**Problem**: If the universe $|U|$ is large (e.g., 64-bit integers), the table is impractically large. In practice, the number of keys actually stored $n \ll |U|$. |
| 41 | + |
| 42 | +--- |
| 43 | + |
| 44 | +## 3. Hash Tables |
| 45 | + |
| 46 | +**Idea**: Use a **hash function** $h: U \to \{0, 1, \dots, m-1\}$ to map keys to table slots. Store element with key $k$ in slot $h(k)$. |
| 47 | + |
| 48 | +$$h: U \to \{0, 1, \dots, m-1\}$$ |
| 49 | + |
| 50 | +The table size $m$ is much smaller than $|U|$. |
| 51 | + |
| 52 | +**Collision**: Two keys $k_1 \neq k_2$ with $h(k_1) = h(k_2)$ is a **collision**. Collisions are unavoidable if $n > m$. |
| 53 | + |
| 54 | +### Hash Function Design |
| 55 | + |
| 56 | +A good hash function satisfies the **simple uniform hashing assumption**: each key is equally likely to hash to any of the $m$ slots, independently of all other keys. |
| 57 | + |
| 58 | +**Division method**: $h(k) = k \bmod m$. Choose $m$ to be a prime not close to a power of 2. |
| 59 | + |
| 60 | +**Multiplication method**: $h(k) = \lfloor m \cdot (kA \bmod 1) \rfloor$ for some constant $0 < A < 1$. |
| 61 | + |
| 62 | +--- |
| 63 | + |
| 64 | +## 4. Chaining |
| 65 | + |
| 66 | +**Chaining** resolves collisions by placing all elements that hash to the same slot into a **linked list**. |
| 67 | + |
| 68 | +* Slot $j$ contains a pointer to the head of the list of all elements with $h(k) = j$. |
| 69 | + |
| 70 | +### Pseudocode |
| 71 | + |
| 72 | +```text |
| 73 | +CHAINED-HASH-INSERT(T, x) |
| 74 | +1. insert x at the head of list T[h(x.key)] |
| 75 | +
|
| 76 | +CHAINED-HASH-SEARCH(T, k) |
| 77 | +1. search for an element with key k in list T[h(k)] |
| 78 | +
|
| 79 | +CHAINED-HASH-DELETE(T, x) |
| 80 | +1. delete x from the list T[h(x.key)] |
| 81 | +``` |
| 82 | + |
| 83 | +`INSERT` takes $O(1)$ time. `DELETE` takes $O(1)$ if lists are doubly linked. |
| 84 | + |
| 85 | +--- |
| 86 | + |
| 87 | +## 5. Analysis of Chaining |
| 88 | + |
| 89 | +Define the **load factor** $\alpha = n/m$ (average number of elements per slot). |
| 90 | + |
| 91 | +**Theorem**: Under simple uniform hashing, an unsuccessful search takes expected time $\Theta(1 + \alpha)$. |
| 92 | + |
| 93 | +**Proof sketch**: An unsuccessful search examines all elements in slot $h(k)$. The expected list length is $\alpha = n/m$. Adding $O(1)$ for computing $h(k)$ gives $\Theta(1 + \alpha)$. |
| 94 | + |
| 95 | +**Theorem**: Under simple uniform hashing, a successful search takes expected time $\Theta(1 + \alpha)$. |
| 96 | + |
| 97 | +**Interpretation**: If $n = O(m)$ (i.e., $\alpha = O(1)$), all operations take $O(1)$ expected time. |
| 98 | + |
| 99 | +--- |
| 100 | + |
| 101 | +## 6. Open Addressing |
| 102 | + |
| 103 | +In **open addressing**, all elements are stored in the hash table itself (no linked lists). On collision, we **probe** for an alternative slot. |
| 104 | + |
| 105 | +A **probe sequence** for key $k$ is a permutation $\langle h(k,0), h(k,1), \dots, h(k,m-1) \rangle$ of $\{0,1,\dots,m-1\}$. |
| 106 | + |
| 107 | +```text |
| 108 | +HASH-INSERT(T, k) |
| 109 | +1. i = 0 |
| 110 | +2. repeat |
| 111 | +3. j = h(k, i) |
| 112 | +4. if T[j] == NIL |
| 113 | +5. T[j] = k |
| 114 | +6. return j |
| 115 | +7. else i = i + 1 |
| 116 | +8. until i == m |
| 117 | +9. error "hash table overflow" |
| 118 | +``` |
| 119 | + |
| 120 | +```text |
| 121 | +HASH-SEARCH(T, k) |
| 122 | +1. i = 0 |
| 123 | +2. repeat |
| 124 | +3. j = h(k, i) |
| 125 | +4. if T[j] == k |
| 126 | +5. return j |
| 127 | +6. i = i + 1 |
| 128 | +7. until T[j] == NIL or i == m |
| 129 | +8. return NIL |
| 130 | +``` |
| 131 | + |
| 132 | +**Deletion** is tricky: cannot just set to `NIL` (would break search). Use a special `DELETED` sentinel. |
| 133 | + |
| 134 | +### Probing Strategies |
| 135 | + |
| 136 | +**Linear Probing**: $h(k, i) = (h'(k) + i) \bmod m$. |
| 137 | +* Simple but causes **primary clustering**: long runs of occupied slots form and grow. |
| 138 | + |
| 139 | +**Quadratic Probing**: $h(k, i) = (h'(k) + c_1 i + c_2 i^2) \bmod m$. |
| 140 | +* Reduces primary clustering but causes **secondary clustering**: two keys with the same $h'(k)$ have identical probe sequences. |
| 141 | + |
| 142 | +**Double Hashing**: $h(k, i) = (h_1(k) + i \cdot h_2(k)) \bmod m$. |
| 143 | +* Uses two independent hash functions. |
| 144 | +* Gives $\Theta(m^2)$ distinct probe sequences; approximates uniform hashing well. |
| 145 | +* Requirement: $h_2(k)$ must be coprime to $m$ for all $k$ (e.g., choose $m$ prime). |
| 146 | + |
| 147 | +--- |
| 148 | + |
| 149 | +## 7. Analysis of Open Addressing |
| 150 | + |
| 151 | +Assume **uniform hashing**: each key is equally likely to have any of the $m!$ permutations as its probe sequence. |
| 152 | + |
| 153 | +**Theorem**: Under uniform hashing with load factor $\alpha = n/m < 1$: |
| 154 | + |
| 155 | +* Expected number of probes in an **unsuccessful search**: $\leq \dfrac{1}{1 - \alpha}$. |
| 156 | +* Expected number of probes in a **successful search**: $\leq \dfrac{1}{\alpha} \ln \dfrac{1}{1-\alpha}$. |
| 157 | + |
| 158 | +**Implication**: For $\alpha$ bounded away from 1, operations take $O(1)$ expected time. As $\alpha \to 1$, performance degrades sharply. |
| 159 | + |
| 160 | +--- |
| 161 | + |
| 162 | +## 8. Universal Hashing |
| 163 | + |
| 164 | +**Problem with fixed hash functions**: For any deterministic $h$, an adversary can choose $n$ keys that all hash to the same slot, giving $\Theta(n)$ worst-case time per operation. |
| 165 | + |
| 166 | +**Solution**: Choose the hash function **randomly** at runtime from a family $\mathcal{H}$. |
| 167 | + |
| 168 | +**Definition**: A family $\mathcal{H}$ of hash functions from $U$ to $\{0, \dots, m-1\}$ is **universal** if for any two distinct keys $k, \ell \in U$: |
| 169 | +$$\Pr_{h \in \mathcal{H}}[h(k) = h(\ell)] \leq \frac{1}{m}$$ |
| 170 | + |
| 171 | +**Theorem**: If $h$ is chosen uniformly from a universal family $\mathcal{H}$, and we use chaining, then for any key $k$: |
| 172 | +$$E[\text{number of collisions with } k] < \frac{n}{m} = \alpha$$ |
| 173 | + |
| 174 | +So all operations take $O(1 + \alpha) = O(1)$ expected time when $n = O(m)$, regardless of the input. |
| 175 | + |
| 176 | +--- |
| 177 | + |
| 178 | +## 9. A Universal Hash Family |
| 179 | + |
| 180 | +**Construction**: Let $p$ be a prime larger than $|U|$. For $a \in \{1, \dots, p-1\}$ and $b \in \{0, \dots, p-1\}$, define: |
| 181 | +$$h_{a,b}(k) = ((ak + b) \bmod p) \bmod m$$ |
| 182 | + |
| 183 | +The family $\mathcal{H} = \{ h_{a,b} : a \in \{1,\dots,p-1\}, b \in \{0,\dots,p-1\} \}$ is universal. |
| 184 | + |
| 185 | +**Proof sketch**: For distinct $k, \ell \in U$, $ak + b \not\equiv a\ell + b \pmod{p}$, so their images in $\mathbb{Z}_p$ are distinct and uniformly distributed. The probability both map to the same slot modulo $m$ is at most $\lceil p/m \rceil / (p-1) \leq 1/m$ for $p \geq m$. |
| 186 | + |
| 187 | +--- |
| 188 | + |
| 189 | +## References |
| 190 | + |
| 191 | +* **CLRS**: Chapter 11 — Hash Tables (Sections 11.1–11.5). |
0 commit comments