Location
- File:
chapters/nlp-book-chapter8.pdf
- Page: 51
- Section: 8.3.4 Sharing across Heads and Layers
Problem Description
I would like to report a potential error in the description of Grouped-Query Attention (GQA).
In the text regarding the parameter $n_g$ (number of groups), the book states:
"By contrast, when $n_g = 1$, it becomes the GQA model."
Reasoning
If $n_g$ represents the number of groups:
-
$n_g = 1$ implies that all query heads share a single Key-Value pair. This is the exact definition of MQA (Multi-Query Attention).
- As proposed in the original GQA paper (Ainslie et al.), GQA is an interpolation between MHA and MQA.
- Limit 1 ($n_g = 1$): MQA
- Limit 2 ($n_g = H$): MHA
- Intermediate: GQA
Therefore, stating that $n_g=1$ becomes the "GQA model" is confusing, as GQA usually refers to the general case or the intermediate state, whereas the specific limit of 1 is widely recognized as MQA.
Suggested Fix
I suggest changing the sentence to:
"By contrast, when $n_g = 1$, it becomes the MQA model."
Thank you for the great resources.
Location
chapters/nlp-book-chapter8.pdfProblem Description$n_g$ (number of groups), the book states:
I would like to report a potential error in the description of Grouped-Query Attention (GQA).
In the text regarding the parameter
Reasoning$n_g$ represents the number of groups:
If
Therefore, stating that$n_g=1$ becomes the "GQA model" is confusing, as GQA usually refers to the general case or the intermediate state, whereas the specific limit of 1 is widely recognized as MQA.
Suggested Fix
I suggest changing the sentence to:
Thank you for the great resources.