# scaled_dot_product_attention¶

api_attr

declarative programming (static graph)

paddle.fluid.nets.scaled_dot_product_attention(queries, keys, values, num_heads=1, dropout_rate=0.0)[source]

This interface Multi-Head Attention using scaled dot product. Attention mechanism can be seen as mapping a query and a set of key-value pairs to an output. Multi-Head Attention performs attention using multi-head parallel, and the inputs of attention would be transformed by linear projection. The formula is as follows:

\begin{align}\begin{aligned}MultiHead(Q, K, V ) & = Concat(head_1, ..., head_h)\\where \ head_i & = Attention(QW_i^Q , KW_i^K , VW_i^V )\\Attention(Q, K, V) & = softmax (\frac{QK^\mathrm{T}}{\sqrt{d_k}}) V\end{aligned}\end{align}

For more details, please refer to Attention Is All You Need .

Note that the implementation is adapted to batch, and all matrix multiplication in $$Attention(Q, K, V)$$ is batched matrix multiplication. Refer to matmul .

Parameters
• queries (Variable) – A 3-D Tensor with shape $$[N, L_q, d_k \times h]$$ , where $$N$$ stands for batch size, $$L_q$$ for the sequence length of query, $$d_k \times h$$ for the feature size of query, $$h$$ for head number. The data type should be float32 or float64.

• keys (Variable) – A 3-D Tensor with shape $$[N, L_k, d_k \times h]$$ , where $$N$$ stands for batch size, $$L_k$$ for the sequence length of key, $$d_k \times h$$ for the feature size of key, $$h$$ for head number. The data type should be the same as queries .

• values (Variable) – A 3-D Tensor with shape $$[N, L_k, d_v \times h]$$ , where $$N$$ stands for batch size, $$L_k$$ for the sequence length of key, $$d_v \times h$$ for the feature size of value, $$h$$ for head number. The data type should be the same as queries .

• num_heads (int, optional) – Indicate the number of head. If the number is 1, linear projection would not be performed on inputs. Default: 1.

• dropout_rate (float, optional) – The rate to drop the attention weight. Default: 0.0, which means no dropout.

Returns

A 3-D Tensor with shape $$[N, L_q, d_v \times h]$$ , where $$N$$ stands for batch size, $$L_q$$ for the sequence length of query, $$d_v \times h$$ for the feature size of value. It has the same data type with inputs, representing the output of Multi-Head Attention.

Return type

Variable

Raises
• TypeError – The dtype of inputs keys, values and queries should be the same.

• ValueError – Inputs queries, keys and values should all be 3-D tensors.

• ValueError – The hidden size of queries and keys should be the same.

• ValueError – The max sequence length in value batch and in key batch should be the same.

• ValueError – he hidden size of keys must be divisible by the number of attention heads.

• ValueError – he hidden size of values must be divisible by the number of attention heads.

Examples

import paddle.fluid as fluid

queries = fluid.data(name="queries", shape=[3, 5, 9], dtype="float32")
keys = fluid.data(name="keys", shape=[3, 6, 9], dtype="float32")
values = fluid.data(name="values", shape=[3, 6, 10], dtype="float32")
contexts = fluid.nets.scaled_dot_product_attention(queries, keys, values)
contexts.shape  # [3, 5, 10]