# scaled_dot_product_attention¶

api_attr

Static Graph

This interface Multi-Head Attention using scaled dot product. Attention mechanism can be seen as mapping a query and a set of key-value pairs to an output. Multi-Head Attention performs attention using multi-head parallel, and the inputs of attention would be transformed by linear projection. The formula is as follows:

\begin{align}\begin{aligned}MultiHead(Q, K, V ) & = Concat(head_1, ..., head_h)\\where \ head_i & = Attention(QW_i^Q , KW_i^K , VW_i^V )\\\begin{split}Attention(Q, K, V) & = softmax (\\frac{QK^\mathrm{T}}{\sqrt{d_k}}) V\end{split}\end{aligned}\end{align}

For more details, please refer to Attention Is All You Need .

Note that the implementation is adapted to batch, and all matrix multiplication in $$Attention(Q, K, V)$$ is batched matrix multiplication. Refer to api_fluid_layers_matmul .

Parameters
• queries (Variable) – A 3-D Tensor with shape $$[N, L_q, d_k \\times h]$$ , where $$N$$ stands for batch size, $$L_q$$ for the sequence length of query, $$d_k \\times h$$ for the feature size of query, $$h$$ for head number. The data type should be float32 or float64.

• keys (Variable) – A 3-D Tensor with shape $$[N, L_k, d_k \\times h]$$ , where $$N$$ stands for batch size, $$L_k$$ for the sequence length of key, $$d_k \\times h$$ for the feature size of key, $$h$$ for head number. The data type should be the same as queries .

• values (Variable) – A 3-D Tensor with shape $$[N, L_k, d_v \\times h]$$ , where $$N$$ stands for batch size, $$L_k$$ for the sequence length of key, $$d_v \\times h$$ for the feature size of value, $$h$$ for head number. The data type should be the same as queries .

• num_heads (int, optional) – Indicate the number of head. If the number is 1, linear projection would not be performed on inputs. Default: 1.

• dropout_rate (float, optional) – The rate to drop the attention weight. Default: 0.0, which means no dropout.

Returns

A 3-D Tensor with shape $$[N, L_q, d_v \\times h]$$ ,

where $$N$$ stands for batch size, $$L_q$$ for the sequence length of query, $$d_v \\times h$$ for the feature size of value. It has the same data type with inputs, representing the output of Multi-Head Attention.

Return type

Variable

Raises
• TypeError – The dtype of inputs keys, values and queries should be the same.

• ValueError – Inputs queries, keys and values should all be 3-D tensors.

• ValueError – The hidden size of queries and keys should be the same.

• ValueError – The max sequence length in value batch and in key batch should be the same.

• ValueError – he hidden size of keys must be divisible by the number of attention heads.

• ValueError – he hidden size of values must be divisible by the number of attention heads.

Examples