FusedEcMoe

class paddle.incubate.nn. FusedEcMoe ( hidden_size, inter_size, num_experts, act_type, weight_attr=None, bias_attr=None ) [source]

A FusedEcMoe Layer.

Parameters
  • hidden_size (int) – The dim size of input units.

  • inter_size (int) – The dim size of feed forward network.

  • num_expert (int) – The number of experts.

  • act_type (string) – The activation type. Currently only support gelu, relu.

  • weight_attr (ParamAttr, optional) – The attribute for the learnable weight of this layer. The default value is None and the weight will be initialized to zero. For detailed information, please refer to paddle.ParamAttr.

  • bias_attr (ParamAttr|bool, optional) – The attribute for the learnable bias of this layer. If it is set to False, no bias will be added to the output. If it is set to None or one kind of ParamAttr, a bias parameter will be created according to ParamAttr. For detailed information, please refer to paddle.ParamAttr. The default value is None and the bias will be initialized to zero.

Attribute:

weight (Parameter): the learnable weight of this layer. bias (Parameter): the learnable bias of this layer.

Shape:
  • input: Multi-dimentional tensor with shape \([batch\_size, seq\_len, d\_model]\) .

  • output: Multi-dimentional tensor with shape \([batch\_size, seq\_len, d\_model]\) .

Examples

>>> 
>>> import paddle
>>> paddle.device.set_device('gpu')
>>> from paddle.incubate.nn.layer.fused_ec_moe import FusedEcMoe

>>> x = paddle.randn([10, 128, 1024]) # [bsz, seq_len, d_model]
>>> gate = paddle.randn([10, 128, 8]) # [bsz, seq_len, num_experts]
>>> moe = FusedEcMoe(1024, 4096, 8, act_type="gelu")
>>> y = moe(x, gate)
>>> print(y.shape)
[10, 128, 1024]
forward ( x, gate )

forward

Defines the computation performed at every call. Should be overridden by all subclasses.

Parameters
  • *inputs (tuple) – unpacked tuple arguments

  • **kwargs (dict) – unpacked dict arguments