scaled_dot_product_attention

api_attr

declarative programming (static graph)

paddle.fluid.nets.scaled_dot_product_attention(queries, keys, values, num_heads=1, dropout_rate=0.0)[source]

This interface Multi-Head Attention using scaled dot product. Attention mechanism can be seen as mapping a query and a set of key-value pairs to an output. Multi-Head Attention performs attention using multi-head parallel, and the inputs of attention would be transformed by linear projection. The formula is as follows:

\[ \begin{align}\begin{aligned}MultiHead(Q, K, V ) & = Concat(head_1, ..., head_h)\\where \ head_i & = Attention(QW_i^Q , KW_i^K , VW_i^V )\\Attention(Q, K, V) & = softmax (\frac{QK^\mathrm{T}}{\sqrt{d_k}}) V\end{aligned}\end{align} \]

For more details, please refer to Attention Is All You Need .

Note that the implementation is adapted to batch, and all matrix multiplication in \(Attention(Q, K, V)\) is batched matrix multiplication. Refer to matmul .

Parameters
  • queries (Variable) – A 3-D Tensor with shape \([N, L_q, d_k \times h]\) , where \(N\) stands for batch size, \(L_q\) for the sequence length of query, \(d_k \times h\) for the feature size of query, \(h\) for head number. The data type should be float32 or float64.

  • keys (Variable) – A 3-D Tensor with shape \([N, L_k, d_k \times h]\) , where \(N\) stands for batch size, \(L_k\) for the sequence length of key, \(d_k \times h\) for the feature size of key, \(h\) for head number. The data type should be the same as queries .

  • values (Variable) – A 3-D Tensor with shape \([N, L_k, d_v \times h]\) , where \(N\) stands for batch size, \(L_k\) for the sequence length of key, \(d_v \times h\) for the feature size of value, \(h\) for head number. The data type should be the same as queries .

  • num_heads (int, optional) – Indicate the number of head. If the number is 1, linear projection would not be performed on inputs. Default: 1.

  • dropout_rate (float, optional) – The rate to drop the attention weight. Default: 0.0, which means no dropout.

Returns

A 3-D Tensor with shape \([N, L_q, d_v \times h]\) , where \(N\) stands for batch size, \(L_q\) for the sequence length of query, \(d_v \times h\) for the feature size of value. It has the same data type with inputs, representing the output of Multi-Head Attention.

Return type

Variable

Raises
  • TypeError – The dtype of inputs keys, values and queries should be the same.

  • ValueError – Inputs queries, keys and values should all be 3-D tensors.

  • ValueError – The hidden size of queries and keys should be the same.

  • ValueError – The max sequence length in value batch and in key batch should be the same.

  • ValueError – he hidden size of keys must be divisible by the number of attention heads.

  • ValueError – he hidden size of values must be divisible by the number of attention heads.

Examples

import paddle.fluid as fluid

queries = fluid.data(name="queries", shape=[3, 5, 9], dtype="float32")
keys = fluid.data(name="keys", shape=[3, 6, 9], dtype="float32")
values = fluid.data(name="values", shape=[3, 6, 10], dtype="float32")
contexts = fluid.nets.scaled_dot_product_attention(queries, keys, values)
contexts.shape  # [3, 5, 10]