Trevor McGuire
Jul 26, 2023

--

Great catch. The sentence you highlighted was poorly worded. I just edited to fix. You are correct in saying that it was misleading. Here's what actually happens in multi-head attention:

Let D represent the hidden dimensionality of the model and N represent the number of attention heads. The input sequences (Q, K, and V) get projected N times, where each projection has a dimensionality of D / N. After passing through the attention mechanism, the "pieces" are concatenated back together, with the resulting dimensionality once again being D.

Hope that clarifies things. Thanks for commenting!

--

--

Trevor McGuire
Trevor McGuire

Written by Trevor McGuire

Machine Learning Engineer. linkedin — https://www.linkedin.com/in/trevorwmcguire/ || Twitter/Instagram— @trevorwmcguire

No responses yet