paddle.nn.quant. weight_quantize ( x, algo='weight_only_int8', arch=None ) [source]

Quantization function for weight_only and llm.int8’s weight.

  • x (Tensor) – The input Tensor to be quantized, the data type is float16 or bfloat16.

  • algo (str) – The algo that is x will be apply, must be one of ‘weight_only_int8’, ‘weight_only_int4’ and ‘llm.int8’, default: ‘weight_only_int8’.

  • arch (int) – The compute arch for target device. For example, A100 is 80, v100 is 70, if you do not assign arch, we will get arch from your device, default: None.


The Tensor which is the quantitative results, the data type is int8, the shape is transposition of x. scale (Tensor): The scale Tensor which is the scale of pre-channel, the data type is float32.

Return type

out (Tensor)


>>> import paddle
>>> from paddle.nn.quant import weight_quantize

>>> paddle.seed(2023)
>>> x = paddle.rand(shape=[64, 32], dtype=paddle.float16)
>>> out, scale = weight_quantize(x, algo='weight_only_int8')
>>> print(out.shape)
[32, 64]
>>> print(scale.shape)