4

It is my understanding that each encoder block takes the output from the previous encoder, and that the output is the attended representation (Z) of the sequence (aka sentence). My question is, how does the last encoder block produce K, V from Z (to be used in encoder-decode attention aublayer of the decoder)

are we simply taking Wk and Wv from last encoder layer?

http://jalammar.github.io/illustrated-transformer/

4

1 に答える 1