FusedTransformerEncoderLayer¶
- class paddle.incubate.nn. FusedTransformerEncoderLayer ( d_model, nhead, dim_feedforward, dropout_rate=0.1, activation='relu', attn_dropout_rate=None, act_dropout_rate=None, normalize_before=False, weight_attr=None, bias_attr=None ) [source]
- 
         FusedTransformerEncoderLayer is composed of two sub-layers which are self (multi-head) attention and feedforward network. Before and after each sub-layer, pre-process and post-precess would be applied on the input and output accordingly. If normalize_before is True, pre-process is layer normalization and post-precess includes dropout, residual connection. Otherwise, no pre-process and post-precess includes dropout, residual connection, layer normalization. - Parameters
- 
           - d_model (int) – The expected feature size in the input and output. 
- nhead (int) – The number of heads in multi-head attention(MHA). 
- dim_feedforward (int) – The hidden layer size in the feedforward network(FFN). 
- dropout_rate (float, optional) – The dropout probability used in pre-process and post-precess of MHA and FFN sub-layer. Default 0.1 
- activation (str, optional) – The activation function in the feedforward network. Default relu. 
- attn_dropout_rate (float, optional) – The dropout probability used in MHA to drop some attention target. If None, use the value of dropout. Default None 
- act_dropout_rate (float, optional) – The dropout probability used after FFN activition. If None, use the value of dropout. Default None 
- normalize_before (bool, optional) – Indicate whether to put layer normalization into preprocessing of MHA and FFN sub-layers. If True, pre-process is layer normalization and post-precess includes dropout, residual connection. Otherwise, no pre-process and post-precess includes dropout, residual connection, layer normalization. Default False 
- weight_attr (ParamAttr|list|tuple, optional) – To specify the weight parameter property. If it is a list/tuple, weight_attr[0] would be used as weight_attr for MHA, and weight_attr[1] would be used as weight_attr for linear in FFN. Otherwise, MHA and FFN both use it as weight_attr to create parameters. Default: None, which means the default weight parameter property is used. See usage for details in - ParamAttr.
- bias_attr (ParamAttr|list|tuple|bool, optional) – To specify the bias parameter property. If it is a list/tuple, bias_attr[0] would be used as bias_attr for MHA, and bias_attr[1] would be used as bias_attr for linear in FFN. Otherwise, MHA and FFN both use it as bias_attr to create parameters. The False value means the corresponding layer would not have trainable bias parameter. See usage for details in - ParamAttr. Default: None, which means the default bias parameter property is used.
 
 Examples # required: gpu import paddle from paddle.incubate.nn import FusedTransformerEncoderLayer # encoder input: [batch_size, src_len, d_model] enc_input = paddle.rand((2, 4, 128)) # self attention mask: [batch_size, n_head, src_len, src_len] attn_mask = paddle.rand((2, 2, 4, 4)) encoder_layer = FusedTransformerEncoderLayer(128, 2, 512) enc_output = encoder_layer(enc_input, attn_mask) # [2, 4, 128] - 
            
           forward
           (
           src, 
           src_mask=None, 
           cache=None
           )
           forward¶
- 
           Applies a Transformer encoder layer on the input. - Parameters
- 
             - src (Tensor) – The input of Transformer encoder layer. It is a tensor with shape [batch_size, sequence_length, d_model]. The data type should be float32 or float64. 
- src_mask (Tensor, optional) – A tensor used in multi-head attention to prevents attention to some unwanted positions, usually the paddings or the subsequent positions. It is a tensor with shape broadcasted to [batch_size, n_head, sequence_length, sequence_length]. When the data type is bool, the unwanted positions have False values and the others have True values. When the data type is int, the unwanted positions have 0 values and the others have 1 values. When the data type is float, the unwanted positions have -INF values and the others have 0 values. It can be None when nothing wanted or needed to be prevented attention to. Default None. 
- cache (Tensor, optional) – It is an instance of MultiHeadAttention.Cache. See TransformerEncoderLayer.gen_cache for more details. It is only used for inference and should be None for training. Default None. 
 
- Returns
- 
             Tensor|tuple, It is a tensor that has the same shape and data type as enc_input, representing the output of Transformer encoder layer. Or a tuple if cache is not None, except for encoder layer output, the tuple includes the new cache which is same as input cache argument but incremental_cache has an incremental length. See MultiHeadAttention.gen_cache and MultiHeadAttention.forward for more details. 
 
 
