Great catch.

Jul 26, 2023

Great catch. The sentence you highlighted was poorly worded. I just edited to fix. You are correct in saying that it was misleading. Here's what actually happens in multi-head attention:

Let D represent the hidden dimensionality of the model and N represent the number of attention heads. The input sequences (Q, K, and V) get projected N times, where each projection has a dimensionality of D / N. After passing through the attention mechanism, the "pieces" are concatenated back together, with the resulting dimensionality once again being D.

Hope that clarifies things. Thanks for commenting!

Written by Trevor McGuire

No responses yet