scaled_dot_product_attention¶
- paddle.fluid.nets. scaled_dot_product_attention ( queries, keys, values, num_heads=1, dropout_rate=0.0 ) [source]
-
- api_attr
-
Static Graph
This interface Multi-Head Attention using scaled dot product. Attention mechanism can be seen as mapping a query and a set of key-value pairs to an output. Multi-Head Attention performs attention using multi-head parallel, and the inputs of attention would be transformed by linear projection. The formula is as follows:
\[ \begin{align}\begin{aligned}MultiHead(Q, K, V ) & = Concat(head_1, ..., head_h)\\where \ head_i & = Attention(QW_i^Q , KW_i^K , VW_i^V )\\\begin{split}Attention(Q, K, V) & = softmax (\\frac{QK^\mathrm{T}}{\sqrt{d_k}}) V\end{split}\end{aligned}\end{align} \]For more details, please refer to Attention Is All You Need .
Note that the implementation is adapted to batch, and all matrix multiplication in \(Attention(Q, K, V)\) is batched matrix multiplication. Refer to api_fluid_layers_matmul .
- Parameters
-
queries (Variable) – A 3-D Tensor with shape \([N, L_q, d_k \\times h]\) , where \(N\) stands for batch size, \(L_q\) for the sequence length of query, \(d_k \\times h\) for the feature size of query, \(h\) for head number. The data type should be float32 or float64.
keys (Variable) – A 3-D Tensor with shape \([N, L_k, d_k \\times h]\) , where \(N\) stands for batch size, \(L_k\) for the sequence length of key, \(d_k \\times h\) for the feature size of key, \(h\) for head number. The data type should be the same as
queries
.values (Variable) – A 3-D Tensor with shape \([N, L_k, d_v \\times h]\) , where \(N\) stands for batch size, \(L_k\) for the sequence length of key, \(d_v \\times h\) for the feature size of value, \(h\) for head number. The data type should be the same as
queries
.num_heads (int, optional) – Indicate the number of head. If the number is 1, linear projection would not be performed on inputs. Default: 1.
dropout_rate (float, optional) – The rate to drop the attention weight. Default: 0.0, which means no dropout.
- Returns
-
- A 3-D Tensor with shape \([N, L_q, d_v \\times h]\) ,
-
where \(N\) stands for batch size, \(L_q\) for the sequence length of query, \(d_v \\times h\) for the feature size of value. It has the same data type with inputs, representing the output of Multi-Head Attention.
- Return type
-
Variable
- Raises
-
TypeError – The dtype of inputs keys, values and queries should be the same.
ValueError – Inputs queries, keys and values should all be 3-D tensors.
ValueError – The hidden size of queries and keys should be the same.
ValueError – The max sequence length in value batch and in key batch should be the same.
ValueError – he hidden size of keys must be divisible by the number of attention heads.
ValueError – he hidden size of values must be divisible by the number of attention heads.
Examples
import paddle.fluid as fluid import paddle paddle.enable_static() queries = paddle.static.data(name="queries", shape=[3, 5, 9], dtype="float32") keys = paddle.static.data(name="keys", shape=[3, 6, 9], dtype="float32") values = paddle.static.data(name="values", shape=[3, 6, 10], dtype="float32") contexts = fluid.nets.scaled_dot_product_attention(queries, keys, values) contexts.shape # [3, 5, 10]