weight_quantize

paddle.nn.quant. weight_quantize ( x: Tensor, algo: _Algo = 'weight_only_int8', arch: int | None = None, group_size: _GroupSize = -1 ) → tuple[Tensor, Tensor] [source]

Quantization function for weight_only and llm.int8’s weight.

Parameters

x (Tensor) – The input Tensor to be quantized, the data type is float16 or bfloat16.
algo (str) – The algo that is x will be apply, must be one of ‘weight_only_int8’, ‘weight_only_int4’, ‘llm.int8’, ‘w4a8’ and ‘w4afp8, default: ‘weight_only_int8’.
arch (int) – The compute arch for target device. For example, A100 is 80, v100 is 70, if you do not assign arch, we will get arch from your device, default: None.
group_size (int) – The group size for weight quantization. -1 stands for default per-channel mode. Currently only support 64 or 128.

Returns

The Tensor which is the quantitative results, the data type is int8, the shape is transposition of x. scale (Tensor): The scale Tensor which is the scale of pre-channel, the data type is float32.

Return type

out (Tensor)

Examples

           >>> 
>>> import paddle
>>> from paddle.nn.quant import weight_quantize

>>> paddle.seed(2023)
>>> x = paddle.rand(shape=[64, 32], dtype=paddle.float16)
>>> out, scale = weight_quantize(x, algo='weight_only_int8')
>>> print(out.shape)
paddle.Size([32, 64])
>>> print(scale.shape)
paddle.Size([32])