Great catch. The sentence you highlighted was poorly worded. I just edited to fix. You are correct in saying that it was misleading. Here's what actually happens in multi-head attention:
Let D represent the hidden dimensionality of the model and N represent the number of attention heads. The input sequences (Q, K, and V) get projected N times, where each projection has a dimensionality of D / N. After passing through the attention mechanism, the "pieces" are concatenated back together, with the resulting dimensionality once again being D.
Hope that clarifies things. Thanks for commenting!