2.3.0 Release Note¶
1. Important Updates¶
We are excited to release the PaddlePaddle Framework V2.3.0. This version contains the following highlights.
API¶
Added more than 100 new APIs, covering automatic differentiation, linear algebra, probability distribution, sparse tensor, framework performance analysis, hardware device management, vision domain, etc.
Added 4 new automatic differentiation APIs, 11 new linear algebra APIs, and 21 new probability distribution APIs to better support use cases in scientific computing, reinforcement learning, xand other application areas.
Added 11 new Sparse Tensor APIs including basic functions of sparse tensor construction and conversion. The COO and CSR formats are supported.
Added 9 new framework performance analysis APIs. The new performance profiling APIs, centered around Paddle.Profiler.Profiler, help users collect and analyze performance statistics during training and inference.
Added 7 APIs for device management, facilitating hardware information acquistion.
Added several visual and text domain APIs to facilitate ~~the~~ reusability of MobileNetV3, ResNeXt and other backbone networks, to achieve the fast networking.
Paddle HIgh reusability operator library¶
We announce PHI as the new Paddle HIgh reusability operator library. PHI provides Primitive API, enabling kernel reuse for operator development. As a refactored functional operator library, PHI aims to solve legacy problems that harm the framework’s performance and reusability, in particular on the operator development. Such problems include inefficient ways of cross using operators, unclear operator interfaces and lacking direct calls to the operator library in C++. With PHI, new operators can be easily implemented by composing functions available in the functional library. The library provides over 200 C++ operator class APIs and nearly 500 kernels. Composing new operators through these builtin functions can greatly reduce the user’s development effort. PHI supports different types of hardware (e.g., GPU and XPU). In addition, PHI is extensible with plugins for accommodating third party accelerators (such as NPU) in a low cost and reusable fashion. In short, PHI supports low level operator composability, the reuse of kernels through Primitives, and accelerators through plugins.
Distributed Training¶
Fully upgrade the adaptive distributed training architecture, including multiple modules such as elastic resource management, asynchronous pipelined executor, heterogeneous communication, and automatic parallelism, and support the hardaware distributed training and inference under a variety of heterogeneous hardware.
Add MoE parallel strategy, GroupSharded parallel strategy, and Pure FP16 under dynamic graph hybrid Parallelism, which further supports the efficient distributed training of large models under the dynamic graph.
Comprehensively upgrade and optimize the architecture of general heterogeneous parameter server, and simplify each module, such as communication and storage, to improve the secondary development experience of parameter server. The performance of GPU parameter server is improved by 2.38 times under 100 billion parameters and 10 billion data.
Compile and Install¶
From version 2.3.0, PaddlePaddle upgrades GPU architectures supported.
Inference Deployment¶
Add the Java API and ONNX Runtime CPU backend.
Support the TensorRT 8.0 / 8.2 and structured sparsity, with deep performance optimization for ERNIElike structural models.
Hardware Backend Extention¶
Add custom device support: provide a plugin way to extend PaddlePaddle hardware backend.
Add training/inference support for multiple heterogeneous chips such as HUAWEI Ascend 910 / GraphCore IPU / Cambricon MLU / KUNLUNXIN 2.
Framework Architecture¶
In this version, we did a lot of work on the framework executor. For details, please see New Dynamic Graph Execution Mechanism and New Static Graph Executor.
2. Incompatibility Upgrade¶
Due to limitation of the binary size, sm35 CUDA ARCH is dropped in precompiled binaries. (#41754)
When
paddle.to_tensor
converts a python int scalar to a Tensor, the default data type on Windows changes from int32 to int64, thus alignment with Linux/Mac. (#39662)To keep consistency with division behavior under python3, the division symbol
/
has been changed from “rounding divide” to “true divide”, and the data type of the computed output has been switched from int to float. (#40890)
2.2  2.3.0 



Revise the ELU’s formula. The computing method in case of alpha <0 aligns with the original paper, thus fixing a small number of cases where the results are incorrectly calculated. Meanwhile, elu_ will report an error in case of alpha <0, because it is not mathematically possible to compute the inverse gradient from the output only at alpha <0. (#37316)
2.2  2.3.0 



3. Training Framework (with the distributed function)¶
(1) New functions¶
API¶
Add 4 new automatic differentiation APIs to support scientific computing, as listed below: (#40692)
paddle.incubate.autograd.vjp
, compute vectorJacobi matrix product.paddle.incubate.autograd.jvp
, compute Jacobi matrixvector product.paddle.incubate.autograd.Jacobian
, compute Jacobi matrix.paddle.incubate.autograd.Hessian
, compute Hessian matrix.
Add linear algebra class API
Add
paddle.linalg.triangular_solve
, to compute a system of linear equations with unique solutions through a triangular coefficient. (#36714)Add
paddle.linalg.eig
, to compute the characteristic decomposition of the general square matrix. (#35764)Add
paddle.linalg.sovle
, to compute solutions to systems of linear equations. (#35715)Add
paddle.linalg.lstsq
, to compute leastsquares solutions to systems of linear equations. (#38585, #38621)Add
paddle.linalg.qr
, compute QR decomposition of matrix. (#35742, #38824）Add
paddle.inner
, to compute inner product of a matrix. (#37706)Add
paddle.outer
, to compute outer product of a matrix. (#37706)Add
paddle.linalg.cov
, to compute covariance between vectors. (#38392)Add
paddle.linalg.cholesky_sovle
, to compute the cholesky solution of the equation. (#38167)Add
paddle.linalg.lu
andpaddle.linalg.lu_unpack
, to compute matrix lu decomposition, and decompress lu matrix. (#38617, #38559, #38616)
Add 21 new probability distribution class APIs for reinforcement learning, variation inference, scientific computing, and other scenarios. Including 6 random variable distributions, 13 random variable transformations, and 2 KL divergence computing. as listed below: (#40536, #38820, #38558, #38445, #38244, #38047)
paddle.distribution.ExponentialFamily
, exponential distribution family base class.paddle.distribution.Beta
,Beta
distribution.paddle.distribution.Dirichlet
,Dirichlet
distribution.paddle.distribution.Independent
, Independent distribution, used to create higher order distributions.paddle.distribution.TransformedDistribution
, Transform distribution, used to generate higherorder distributions through the base distribution and a series of transformations.paddle.distribution.Multionmial
, a multinomial distribution.paddle.distribution.Transform
, base class for transforming random variables.paddle.distribution.AbsTransform
, take absolute value transform.paddle.distribution.AffineTransform
, affine transform.paddle.distribution.ChainTransform
, chain combination of the transform.paddle.distribution.ExpTransform
, exponential transform.paddle.distribution.IndependentTransform
, independent transform, used to extend theevent_dim
of the transform definition field.paddle.distribution.PowerTransform
, power transform.paddle.distribution.ReshapeTransform
,reshape
transform.paddle.distribution.SigmoidTransform
,sigmoid
transform.paddle.distribution.SoftmaxTransform
,softmax
transform.paddle.distribution.StackTransform
,stack
transform, used to combine multiple transforms in astack
method.paddle.distribution.StickBreakingTransform
,stickbreaking
transform.paddle.distribution.TanhTransform
,tanh
transform.paddle.distribution.kl_divergence
, compute KL divergence.paddle.distribution.register_kl
, register userdefined KL divergence calculation function.
Add highlevel API
Add
paddle.vision.models.AlexNet
andpaddle.vision.models.alexnet
, to use AlexNet models directly. (#36058)Add
paddle.vision.models.DenseNet
,paddle.vision.models.densenet121
,paddle.vision.models.densenet161
,paddle.vision.models. densenet169
,paddle.vision.models.densenet201
, andpaddle.vision.models.densenet264
, to use DenseNet models directly. (#36069)Add
paddle.vision.models.GoogLeNet
andpaddle.vision.models.googlenet
, to use GoogLeNet models directly. (#36034)Add
paddle.vision.models.InceptionV3
,paddle.vision.models.inception_v3
, to use InceptionV3 models directly. (#36064)Add
paddle.vision.models.MobileNetV3Small
,paddle.vision.models.MobileNetV3Large
,paddle.vision.models.mobilenet_v3_small
, andpaddle.vision.models.mobilenet_v3_large
, to use MobileNetV3 models directly . (#38653)Add
paddle.vision.models.ResNeXt
,paddle.vision.models.resnext50_32x4d
,paddle.vision.models.resnext50_64x4d
,paddle.vision.models. paddle.vision.models.resnext101_32x4d
,paddle.vision.models.resnext101_64x4d
,paddle.vision.models.resnext152_32x4d
, andpaddle.vision.models.resnext152_64x4d
, to use ResNeXt models directly. (#36070)Add
paddle.vision.models.ShuffleNetV2
,paddle.vision.models.shufflenet_v2_x0_25
,paddle.vision.models.shufflenet_v2_x0_33
,paddle.vision.models.shufflenet_v2_x0_5
,paddle.vision.models.shufflenet_v2_x1_0
,paddle.vision.models.shufflenet_v2_x1_5
,paddle.vision.models.shufflenet_v2_x2_0
, andpaddle.vision.models.shufflenet_v2_swish
, to use ShuffleNetV2 models directly (#36067)Add
paddle.vision.models.SqueezeNet
,paddle.vision.models.squeezenet1_0
, andpaddle.vision.models.squeezenet1_1
, to use SqueezeNet models directly. (#36066)Add
paddle.vision.models.wide_resnet50_2
, andpaddle.vision.models.wide_resnet101_2
, to use WideResNet models directly. (#36952)Add
paddle.vision.ops.nms
API, to support singlecategory and multicategory nonmaximum suppression (NMS) algorithms for target detection and prediction task acceleration (#40962)Add
paddle.vision.ops.roi_pool
andpaddle.vision.ops.RoIPool
, to support RoI region pooling operations in detection tasks. (#36154)Add
paddle.vision.ops.roi_align
andpaddle.vision.ops.RoIAlign
, to support RoI Align operations in detection tasks. (#35102)Add
paddle.text.ViterbiDecoder
, andpaddle.text.viterbi_decode
Viterbi decoding API, mainly for sequence tagging model prediction. (#35778)
Add 11 Sparse class APIs, to support basic functions, such as creating Sparse Tensor in COO and CRS formats, and add C++ interconverting with Tensor.
paddle.sparse.sparse_coo_tensor
，create Sparse Tensor in COO format. (#40780）paddle.sparse.sparse_csr_tensor
，create Sparse Tensor in CSR format. (#40780）paddle.sparse.ReLU
，support ReLU activation layer for SparseCooTensor.（#40959)paddle.sparse.functional.relu
，support ReLU function of SparseCooTensor.（#40959)Tensor.values()
，c++ method to get nonzero elements of a SparseCooTensor or SparseCsrTensor. （#40608）Tensor.indices()
，c++ method to get the coordinate information of a SparseCooTensor. （#40608）Tensor.crows()
，c++ method to get information about the compressed row information of the SparseCsrTensor.（#40608）Tensor.cols()
，c++ method to get the column information of the SparseCsrTensor （#40608）Tensor.to_sparse_coo()
，c++ method to convert a DenseTensor or SparseCsrTensor to a SparseCooTensor. (#40780）Tensor.to_sparse_csr()
，c++ convert a DenseTensor or SparseCooTensor to a SparseCsrTensor. (#40780）Tensor.to_dense()
，c++ convert a SparseCooTensor or SparseCsrTensor to a DenseTensor. (#40780）
Add hardware related APIs
Add four GPU memory monitoring related APIs:
paddle.device.cuda.max_memory_allocated
,paddle.device.cuda.max_memory_reserved
,paddle.device.cuda.memory_allocated
, andpaddle.device.cuda.memory_reserved
, to view and analyze the GPU memory usage in realtime. (#38657)Add
paddle.device.cuda.get_device_properties
, to return the properties of the GPU device. (#35661)Add
paddle.device.cuda.get_device_name
andpaddle.device.cuda.get_device_capability
, to return the name and compute capability of the GPU device. (#35672)
Add Tensor operation API
Add
paddle.nansum
, to sum input Tensor alongaxis
with ignoring theNaNs
values. (#38137)Add
paddle.nanmean
,to average input Tensor alongaxis
with ignoring theNaNs
values. （#40472）Add
paddle.clone
, to return a copy of the input Tensor and provide gradient calculation. (#38020)Add
paddle.Tensor.element_size
, to return the number of bytes allocated for a single element in a Tensor. (#38020)Add
paddle.Tensor.to_uva_tensor
, to convert the numpy objects to be accessed by CUDA objects with virtual addresses, which are stored in CPU memory physically. (#39146, #38950)Add
paddle.rot90
, to rotate the ndimensional Tensor by 90 degrees along the plane specified byaxes
. (#37634)Add
paddle.logit
andpaddle.Tensor.logit
, to compute the logit function values for input Tensor. (#37844)Add
paddle.repeat_interleave
, to copy the input along the specified axis, and return a new Tensor. (#37981)Add
paddle.renorm
, to split the Tensor into multiple pieces at the specifiedaxis
and then perform p norm operations separately. (#38130, #38459)Add
paddle.mode
andpaddle.Tensor.mode
, to search the values and indices of the input Tensor along the specified axis. (#38446)Add
paddle.quantile
andpaddle.Tensor.quantile
, to compute the qquantile of a Tensor along the specified axis. (#38567)Add
paddle.kthvalue
andpaddle.Tensor.kthvalue
, to find the values and indices of the kth smallest at the specified axis. (#38386)Add
paddle.is_floating_point
andpaddle.Tensor.is_floating_point
, to determine if the input Tensor is the floating point type. (#37885)Add
paddle.erfinv
andpaddle.Tensor.erfinv
, to compute the inverse error function of the input Tensor. (#38295)Add
paddle.lerp
andpaddle.Tensor.lerp
, to compute linear interpolation among the input Tensors based on the given weights. (#37253)Add
paddle.angle
, to compute the phase angle of a complex Tensor. (#37689)Add
paddle.rad2deg
andpaddle.Tensor.rad2deg
, to convert each of the elements of input from the angles in radians to the degrees. (#37598)Add
paddle.deg2rad
andpaddle.Tensor.deg2rad
, to convert each of the elements of input from the degrees in radians to the angles. (#37598)Add
paddle.gcd
andpaddle.Tensor.gcd
, to compute the greatest common divisors of the absolute values of two inputs by element. (#37819)Add
paddle.lcm
andpaddle.Tensor.lcm
, to compute the least common multiple of the absolute value of two inputs by element. (#37819)Add
paddle.amax
andpaddle.Tensor.amax
, to get the maximum value of Tensor elements along the specified dimension. (#38417)Add
paddle.amin
andpaddle.Tensor.amin
, to get the minimum value of Tensor elements along the specified dimension. (#38417)Add
paddle.isclose
, to determine if each element of two Tensors is close to each other. (#37135)Add
paddle.put_along_axis
andpaddle.take_along_axis
, for extracting or placing elements with specified index subscripts. (#38608)Add
paddle.bincount
andpaddle.Tensor.bincount
, for counting the number of occurrences of each element in a Tensor. (#36317)Add
paddle.fmax
andpaddle.fmin
, to extend the max/min function to support the case of NaN values in the two Tensors. If there is one NaN value in the corresponding position, return that nonNaN value; if there are two NaN values in the corresponding position, return the NaN value. (#37826)Add
paddle.diff
, for computing the nth forward difference along a given dimension. It currently supports n=1. (#37441)Add inverse hyperbolic functions:
paddle.asinh
,paddle.acosh
, andpaddle.atanh
. (#37076)Add
paddle.as_real
andpaddle.as_complex
for conversion between real Tensor and complex Tensor. (#37784)Add
paddle.complex
, for constructing a complex Tensor with the given real and imaginary parts. (#37918, #38272)Add
paddle.det
andpaddle.slogdet
, to compute the determinant of a matrix and the natural logarithm of the determinant. (#34992)Add
paddle.nn.utils.parameters_to_vector
, to flatten parameters to a 1D Tensor. (#38020)Add
paddle.nn.utils.vector_to_parameters
, to transform a Tensor with 1D shape to the parameters. (#38020)
Add networking class APIs
Add
paddle.nn.Fold
andpaddle.nn.functional.fold
, to extract sliding local area blocks for the Tensors of a batch. (#38613)Add
paddle.nn.CELU
andpaddle.nn.functional.celu
, to support the CELU activation layer. (#36088)Add
paddle.nn.HingeEmbeddingLoss
. Add a way to compute hinge embedding loss. It is usually used for nonlinear embedding or semisupervised learning. (#37540)Add
paddle.nn.ZeroPad2D
API, for zeropadding according to the padding property. (#37151)Add
paddle.nn.MaxUnPool3D
andpaddle.nn.MaxUnPool1D
, for computing 3D maximum inverse pooling and 1D maximum inverse pooling. (#38716)Add
paddle.incubate.graph_khop_sampler
,paddle.incubate.graph_sample_neighbors
, andpaddle.incubate.graph_reindex
APIs, to support graph multiorder neighbor sampling and graph reindexing operations. They are mainly used for graph neural network model training. (#39146, #40809)
Add random number class APIs
Add
paddle.poisson
, to generate a Tensor that obeys Poisson distributed with the lambda parameter. (#38117)Add
paddle.randint_like
API, to generate a new Tensor that obeys uniform distribution in the range [low, high), with the shape of the output matching the shape of the input. (#36169)Add
paddle.Tensor.exponential_
. It is an inplace style API that populates the input Tensor with exponentially distributed random numbers. (#38256)
Add parameter initialization class APIs
Add
paddle.nn.initializer.Dirac
, to initialize 3D/4D/5D parameters with Dirac delta functions. It is commonly used for initialization of Conv1D/Conv2D/Conv3D parameters in the convolution layer. (#37389)Add
paddle.nn.initializer.Orthogonal
for orthogonal matrix initialization. The initialized parameter is the (semi) orthogonal vector. (#37163)Add
paddle.nn.initializer.calculate_gain
, to get the recommended gain value for the activation function. The gain value can be used to set certain initialization APIs to adjust the initialization range. (#37163)
Add learning rate class API
Add
paddle.optimizer.lr.MultiplicativeDecay
, to provide thelambda
function to set the learning rate. (#38250)
Add distributedrelated APIs
Add new optimizerrelated APIs(#40710)
paddle.incubate.optimizer.functional.minimize_bfgs
，add secondorder optimizer BFGS.paddle.incubate.optimizer.functional.minimize_lbfgs
，add secondorder optimizer LBFGS.
Add
paddle.incubate.multiprocessing
module, to provide Tensor (CPU/GPU) data transfer between python processes. (#37302, #41339)Add
paddle.incubate.autotune.set_config
API, to support multiversion Kernel autoselection, mixed precision data layout autoconversion, and num_workers autoselection for DataLoader to automatically improve model performance. (#42301)Add
paddle.incubate.nn.FusedMultiTransformer
andpaddle.incubate.nn.functional.fused_multi_transformer
API, to fuse multiple layers of transformers into a single op to improve model inference performance. It should be noted that only forward is supported. (#42311)Add einsum_v2 operators for consistent interface between imperative and static mode. It is compatible with the
paddle.einsum
implementation at the original python side, while supporting dynamic to static export and more complete Infershape inference. (#42495, #42327, #42397, #42105)
IR(Intermediate Representation)¶
Dynamic graph to static graph
For the variable type StaticAnalysis module, add support for type tag similar to
a, b = paddle.shape(x)
. (#39245)Add a computed field, supporting
InputSpec.name
as the Program cache hash key. (#38273)Add syntax for supporting
dict['key'] = x.shape
. (#40611)Add the support for Pure FP16 training. (#36944)
Add the support
for i in [x,y,z]
syntax. (#37259)Add the support for type hint syntax of python3. (#36544)
Pass development
Add forward and backward fusion for FC + [relugelu] based on NVIDIA cuBlasLt Epilogue. (#39437）
Kernel Primitive API
Add KP operators on GPU platform, including cast, scale, clip, bce_loss, abs_grad, reduce_sum_grad, reduce_mean_grad, clip, bce_loss, full, full_like, distribution, random , masked_select_kernel, where_index, masked_select_grad, dropout, sigmoid, where, and abs_grad. (#36203, #36423, #39390, #39734, #38500, #38959, #39197, #39563, #39666, #40517, #40617, #40766, #39898, #39609)
Add the support for XPU2 source code compilation mode. (#37254, #40397, #38455)
Add the support for KP operator reuse on XPU2 and GPU, including reduce, broadcast, elementwise_add,
exp、log、relu、sigmoid、leaky_relu、softplus、hard_swish、reciprocal
。(#36904, #37226, #38918, #40560, #39787, #39917, #40002, #40364)Add unit tests of KP operators on the XPU2 platform, including
brelu、ceil、celu、elu、floor、hard_shrink、hard_sigmoid、log1p、logsigmoid、relu6、silu、soft_relu、softsign、sqrt、square、swish、thresholded_relu、softshrink
。(#40448, #40524)Add the support for XPU2 KP models, including resnet50, deepfm, wide_deep, yolov3darknet53, det_mv3_db, bert, transformer, mobilenet_v3, and GPT2.
Mixed Precision Training¶
Split the
paddle.amp.GradScaler.unscale_
method from theminimize
of the mixed precision trainingpaddle.amp.GradScaler
, to provide a separate interface for recovering the loss. (#35825)Add the FP16 support for
paddle.nn.ClipByGlobalNorm
dynamic graph mode. Add FP16 Kernel for clip op to enable cliprelated operations to support FP16 compute. (#36198, #36577)Support the case that the
optimizer
parameter transferred frompaddle.amp.decorate
is Nan. (#37541)For the merged_momentum op，add the support of input multiple learning rates ， the computing for use_nesterov policy and the regularization computing . (#37527)
Add multi_tensor policy to
paddle.optimizer.Momentum
optimizer. Addset_to_zero
branch toclear_grad
ofOptimzizer
class. (#37564)Add multi_tensor policy to
paddle.optimizer.Adam
. (#38010)Add multi_precision policy to
paddle.optimizer.SGD
optimizer. (#38231)Add the storage
master weight
parameter to the optimizerstate_dict
method. (#39121)Add support for op CUDA bfloat16 mixed precision training. Support for O1 and O2 modes. Enable the above training modes via
paddle.amp.auto_cast
. (#39029, #39815)Add bfloat16 CUDA Kernel for the following ops: matmul, concat, split, dropout, reshape, slice, squeeze, stack, transpose, unbind, elementwize_max, elementwize_add, elementwize_mul, elementwize_sub, scale, sum, layer_norm, p_norm, reduce_sum, softmax, log_softmax, sigmoid, sqrt, softplus, square, gaussian_random, fill_constant, and fill_any_like. (#39485, #39380, #39395, #39402, #39457, #39461, #39602, #39716, #39683, #39843, #39999, #40004, #40027)
Add bfloat16 CPU Kernel for the following ops: dropout, reshape, slice, squeeze, unsqueeze, stack, transpose, unbind, elementwize_max, elementwise_mul, elementwise_sub, and gather. (#39380, #39395, #39402, #39457, #39461, #39602, #39716, #39683)
Support printing of Tensor with data of bfloat16. (#39375, #39370)
Add support for FP16 computation for
p_norm
,elementwise_max
, andfill_constant_batch_size_like ``scatter
. (#35888, #39907, #38136, #38499)Add support for int16_t for the following ops: cumsum, less_than, less_equal, greater_than, greater_equal, equal, not_equal, fill_any_like, grather_nd reduce_sum, where_index, reshape, and unsqueeze. (#39636)
Add support for int16_t label type for cross_entropy op. (#39409)
Add support for int16_t id type for embedding op. (#39381)
Add support for FP16 type for reduce_mean op. (#38289)
Add support for FP16 type for elementwise_min op. (#38123)
Update bfloat16 AMP oneDNN default support list. (#39304)
Paddle HIgh reusability operator library¶
We announce PHI as the new Paddle HIgh reusability operator library. PHI provides Primitive API, enabling kernel reuse for operator development. As a refactored functional operator library, PHI aims to solve legacy problems that harm the framework’s performance and reusability, in particular on the operator development. Such problems include inefficient ways of cross using operators, unclear operator interfaces and lacking direct calls to the operator library in C++. With PHI, new operators can be easily implemented by composing functions available in the functional library. The library provides over 200 C++ operator class APIs and nearly 500 kernels. Composing new operators through these builtin functions can greatly reduce the user’s development effort. PHI supports different types of hardware (e.g., GPU and XPU). In addition, PHI is extensible with plugins for accommodating third party accelerators (such as NPU) in a low cost and reusable fashion. In short, PHI supports low level operator composabilty, the reuse of kernels through Primitives, and accelerators through plugins.The main contents include six parts as below:
The implementation of the operator library infrastructure, core components and mechanisms : The directory structure of the new operator library is reasonably planned, design and implement the common base data structure of the new operator library, the new functional InferMeta and Kernel development paradigm and the corresponding registration and management components. Support the automated compilation object generation and compilation dependency generation of Kernel files, allowing developers to focus only on the functional Kernel implementation, and making the development paradigm clear and concise. (#34425, #37107, #36946, #36948, #37876, #37916, #37977, 38078, #38861, #39123, #39131, #39748, #39790, #39941, #40239, #40635, #41091, #37409, #37942, #39002, #38109, #37881, #37517, #39870, #40975, #39475, #37304, #36910, #37120, #37146, #37215, #37255, #37369, #38258, #38257, #38355, #38853, #38937, #38977, #38946, #39085, #39153, #39228, #38301, #38275, #38506, #38607, #38473, #38632, #38811, #38880, #38996, #38914, #39101)
Operator library C++ API system construction: design and implement yaml configuration filebased operator definition paradigm, to automatically generate more than 200 C++ operator class APIs for internal and external developers to reuse. This reduces the cost of repeated development of basic operators. (#37668, #36938, #38172, #38182, #38311, #38438, #39057, #39229, #39281, #39263, #39408, #39436, #39482, #39497, #39651, #39521, #39760, #40060, #40196, #40218, #40640, #40732, #40729, #40840, #40867, #41025, #41368)
Operator library compatible with various execution systems: Implement new InferMeta and Kernel to access the original dynamic and static graph execution system. Support the safe removal of the original OpKernel registration and migration to the new Kernel form. (#34425, #38825, #38837, #38842, #38976, #39134, #39140, #39135, #39252, #39222, #39351)
Decouple the underlying data structures and tool functions of the operator library from the framework: Relieve PHI’s dependence on the framework for core data structures, lay the foundation for subsequent independent compilation of PHI, and support infrt, custom Kernel, and a series of Phibased construction work (#38583, #39188, #39560, #39931, #39169, #38951, #38898, #38873, #38696, #38651, #39359, #39305, #39234, #39098, #39120, #38979, #38899, #38844, #39714, #39729, #39889, #39587, #39558, #39514, #39502, #39300, #39246, #39124)
Integration between custom operator mechanism and Phi with improvement: support for calling over 200 C++ operator class APIs automatically generated by PHI when writing custom operators. This reduces custom operator development costs. A series of bugs are fixed. (#37122, #37276, #37281, #37262, #37415, #37423, #37583, #38776, #39353, #41072)
Operator scale migration and refactoring: migrate about 250 highfrequency forward and backward operator Kernel to the new operator library and refactor them as a single function. Achieve the highperformance operator by encapsulating multiple base Kernel functions on the C++ side for the fast combination. Meanwhile, add the corresponding yaml operator definition, and access to the new dynamic graph execution system to improve the python API scheduling performance. The migrated and refactored operators include:
sqrt （#40727）
square（#40727）
sin (#40175)
sinh (#40175)
elementwise_fmax（#40140）
elementwise_fmin（#40140）
p_norm (#40819)
fill_constant_batch_size_like (#40784)
conv2d（#39354）
conv3d（#39354）
mish（#40727）
gather (#40500)
sgd（40045）
momentum (#41319)
rmsprop（#40994）
adam (#40351)
layer_norm（#40193）
adagrad（#40994）
adamax (#40173)
adadelta (#40173)
ceil (#40913)
cos (#40175)
atan (#40175)
cosh (#40175)
erf（#40388）
asin (#40175)
acos (#40175)
scale (#39278)
elementwise_pow (#40993)
round (#40913)
floor (#40913)
pow (#40913)
elementwise_floordiv (#40993)
reciprocal（#40727）
log1p (#40785)
allclose (#40469)
mul (#40833)
elementwise_max (#40590)
elementwise_min (#40590)
elementwise_mod (#40590)
fill_any_like (#39807)
dot（#38359）
sum (#40873)
diag_v2 (#39914)
one_hot_v2（39876）
bce_loss (#39868)
argsort (#40151)
arg_max (#40222)
arg_min (#40222)
segment_pool (#40099)
dist (#40178)
isnan_v2 (#40076)
logical_and (#39942)
logical_not (#39942)
isfinite_v2 (#40076)
logical_or (#39942)
isinf_v2 (#40076)
is_empty (#39919)
logical_xor (#39942)
less_than（#39970）
not_equal（#39970）
equal（#39970）
less_equal（#39970）
equal_all（#39970）
uniform_random (#39937)
randperm (#41265)
unbind (#39789)
bernoulli (#39590)
where (#39811)
log10 (#40785)
log2 (#40785)
expm1（#40727）
atan2 (#39806)
empty (#38334)
tan (#40175)
bitwise_and （#40031）
bitwise_not（#40031）
bitwise_or（#40031）
poisson（#39814）
cholesky_solve（#40387）
bitwise_xor（#40031）
triangular_solve（#40417）
sigmoid (#40626)
atanh (#40175)
softsign（#40727）
thresholded_relu (#40385)
tanh_shrink (#40565)
stanh（#40727）
reduce_mean (#37559)
reduce_max（#40225）
reduce_min (#40374)
reduce_all (#40374)
reduce_any (#40374)
logsumexp (#40790)
softshrink（#40565）
stack（#40581）
tile (#40371)
unique（#40581）
unstack（#40581）
slice（#40736）
transpose2（#39327）
unsqueeze2（ #40596）
squeeze2（ #40596）
strided_slice (#40708)
softmax (#39547)
leaky_relu (#40385)
gelu (#40393)
prelu (#40393)
log_softmax (#40393)
elu (#40565)
logsigmoid (#40626)
kthvalue（#40575）
mode (#40571)
yolo_box（#40112）
yolov3_loss (#40944）
temporal_shift（#40727）
depthwise_conv2d（#39354）
pad3d (#40701)
pad（ #40012）
greater_equal（#39970）
kldiv_loss (#39770)
isclose (#39770)
silu (#40565)
unfold (#39778)
batch_norm（39347）
norm（#39324）
label_smooth (#39796)
grid_sampler (#40585)
greater_than（#39970）
nearest_interp_v2 (#40855)
bilinear_interp_v2 (#40855)
softmax_with_cross_entropy (#40832)
rnn (#41007)
reverse (#40791)
trace (#39510)
kron（#40427）
accuracy（#39982）
dropout（#40148）
bincount (#39947)
assign_value (#40967)
assign (#40022)
cast (#37610)
where_index (#40255)
cumprod (Xiong Kun #39770)
shard_index (#40254)
lookup_table_v2（#39901）
adamw (#40351)
tanh (#40385)
cross (#39829)
split (#39060)
linspace (#40124)
huber_loss (#39761)
hierarchical_sigmoid（#40553）
nll_loss (#39936)
exp（#40727）
rsqrt（#40727）
viterbi_decode (#40186)
conj (#38247)
lgamma (#39770)
relu (#40175)
log (#40785)
bilinear_tensor_product（#39903）
logit (#37844)
broadcast_tensors（#40047）
gumbel_softmax（#39873）
diagonal （#39575）
multi_dot (#40038)
matrix_power (#40231)
digamma（#39240）
masked_select（#39193）
determinant (#40539)
eigh (#40213)
shape (#40248)
reduce_prod (#39844)
histogram（#39496）
meshgrid (#41411)
brelu (#40385)
hard_swish (#40913)
hard_shrink (#40565)
selu (#39819)
expand_v2 (#39471)
top_k_v2（#40064）
expand_as_v2（#40373）
swish (#40913)
hard_sigmoid (#40626)
exp, det, assign, gaussian_random, matrix_rank, eye, and deformable_conv. (#41755, #41737)
New Dynamic Graph Execution Mechanism¶
To improve scheduling performance and custom development capability of the dynamic graph execution mechanism of the PaddlePaddle, we have reconstructed the underlying execution mechanism of the dynamic graph. With the new execution method, the PHI operator library can be used for efficient runtime execution. For the operators supported by the PHI operator library, switching to the new dynamic graph mode will get a significant improvement in scheduling performance. However, due to the huge workload required in the upgrade of the overall framework execution mechanism and this part of the work is coupled with a lot on the PHI operator library, we still do not use this execution method by default in this version. If you want to try it, you can switch to it by setting the environment variable FLAGS_enable_eager_mode=1
.The details are as follows:
Implementation of dynamic graph execution infrastructure, core components and mechanism: By staticizing dynamic graphrelated execution codes, the original homogeneous operators constructing converted to specific calling for different PHI APIs, thus greatly optimizing the scheduling overhead. (#36059, #37323, #37556, #37555, #37478, #37458, #37479, #37599, #37659, #37654, #39200, #39309, #39319, #39414, #39504, #39526, #39878, #39963)
New dynamic graph execution mechanism subfunction development and adaptation: support more flexible and complete dynamic graph subfunctions such as hook, pylayer, double_grad, inplace, amp, etc. (#41396, #40400, #40695, #41043, #40915, #41104, #41350, #41209, #40830, #40891, #36814, #37377, #37193, #36965, #37810, #36837, #38488, #39282, #39449, #39531, #39638, #39674, #39893, #40170, #40693, #40937, #41016, #41051, #41121, #41198, #41287, #41380, #41306, #41387, #40623, #40945, #39282, #39449, #38488)
Automatic code generation mechanism for new dynamic graph execution: When we are trying to split the computation and scheduling logic of a large number of homogeneous operators into different specific scheduling logics, we find that it is a huge workload. So we introduce a new automatic code generation logic to generate code and thus simplify the runtime logic of dynamic graphs. Meanwhile, in order to adapt to the various types of runtime logic in the previous framework, we also use some complicated compilation techniques to obtain information at runtime to generate more accurate scheduling code. (#37574, #37575, #37639, #37723, #37753, #37812, #37837, #37910, #37943, #37992, #37959, #38017, #37969, #38160, #38085, #38562, #38573, #39192, #39215, #39355, #39358, #39328, #39233, #39628, #39767, #39743, #39897, #39797, #39997, #40058, #40080, #40107, #39962, #40132, #40276, #40266, #40480, #40482, #40368, #40650, #40815, #40907, #40935, #41089)
New dynamic graph execution mechanism accessed into the main framework and Integration test: we currently use some environment variables to distinguish between static graph mode and dynamic graph mode (including new dynamic graph and old dynamic graph mode). We have adapted most logics of dynamic graphs in these modes. However, there are still a lot of problems being fixed. (#37638, #37643, #37653, #38314, #38337, #38338, #39164, #39326, #40391, #40201, #40854, #40887)
Update some judgment logics under dynamic graphs, to support fast execution paths for dynamic graphs in compatible forms：（#40786）
Nonstatic graph mode (current transition scheme):
_non_static_mode()
。Determined as new dynamic graph in dynamic graph mode (recommended judgment logic):
_in_dygrah_mode()
。Determined as old dynamic graph in dynamic graph mode (Not recommended. It will be deprecated in future versions):
_in_legacy_dygraph()
。Enable old dynamic graph and disable new dynamic graph in dynamic graph mode:
_enable_legacy_dygraph()
or exit_test_eager_guard()
。Enable new dynamic graph and disable old dynamic graph in dynamic graph mode:
_disable_legacy_dygraph()
or withwith _test_eager_guard()
。Determine in new dynamic graph in static or dynamic graph mode:
_in_eager_without_dygraph_check()
。
Support inplace after dynamic graph reconstruction: input and output are the same Tensor.
Adapt the inplace strategy for dynamic graph reconstruction intermediate states.(#40400)
Adapt the inplace strategy to the final state of the dynamic graph reconstruction. (#40695)
Add inplace strategy to PyLayer function after dynamical graph reconstruction. (#41043)
Add inplace strategy for Tensor’s setitem function after dynamical graph reconstruction. (#40915)
Add
_reset_grad_inplace_version
interface after dynamic graph reconstruction, to set the inplace version of the Tensor’s gradient to 0. (#41101)If the value of the forward Tensor is not needed during the inverse computation (no need buffer property), the inplace version detection operation is not needed for that Tensor. For Tensor with no_need_buffer, skip the inplace version check. (#41350)
Unify error messages for inplace version checks after and before reconstruction of dynamic graphs. (#41209)
Support view strategy after dynamical graph reconstruction: input and output Tensor share underlying data.
Add support for weakref on the python side of the new dynamic graph eager Tensor. (#41797)
Enhance the new dynamic graph DoubleGrad function to support the basic DoubleGrad feature.(#41893, #41894, #41895)
Add
core.eager.StringTensor
interface, to support the construction of StringTensor on python side and the use of the StringTensor related APIs. (#41039)**Add
_grad_name
and_grad_value
*tocore.eager.Tensor
to return the name and value of a gradient. (#41990)Add the processing of the no_need_buffer attribute for dynamic graph intermediate state. The Tensor with the no_need_buffer attribute is skipped in the inplace backward check operation. (#41720)
New Static Graph Executor¶
In order to solve the problem that the original static graph executor of the PaddlePaddle is not good enough for scheduling in some scenarios and it is not easy to use multiple streams, we have implemented a new static graph executor with superior performance. It is easy to take advantage of the asynchronous scheduling capabilities of multistreams and multithreads. The new executor is a compatible upgrade of the original executor. At present, it is used by default in singlecard scenarios. Users do not need to make any changes in the training codes. It can be used automatically. Of course, we also provide an interface to switch back to the original executor. Users can switch back to the original executor by setting the environment variable: FLAGS_USE_STANDALONE_EXECUTOR=false
. (#41179) The main contents are as follows.
Basic components: Highperformance thread pool for multithreaded scheduling in the executor (#35470, #35930, #36030, #36480, #36688, #36740, #38335, #40770) and thread coop component (#38779, #40876, #40912) . There is the timely memory recovery after operator execution (#37642, #39617, #40859). There is the new dependency analysis algorithm for parallel executor (#37231) etc.
Scheduling logic: Optimize the scheduling method of operator in the executor. Support multistream multithreaded asynchronous scheduling mechanism. Change transforms such as data type, device, and layout to the operator scheduling to improve performance. Support caching the selection of operator Kernel. Support the selection of new PHI operator.（#35024, #34922, #35711, #35928, #39458，#36899）。
Interface compatibility: Compatible with the user interface and functionality of the original executor, such as alignment with python interface Executor.run(), support for managing Tensor in Scope, etc. This ensures that users can switch to the new executor without perception. (#37278, #37379, #37445, #37510, #40955, #41778, #41058, #38584, #37957, #37672, #37474, #37085, #37061, #36945)
Enhance debugging and error reporting in multithreaded scenarios by capturing error reports from subthreads and throwing them uniformly in the main thread. This can improve user experience. (#36692，#36802)
Fix the bug with the new executor communication flow resetting stream cache information in the allocator, to reduce RecordStream overhead in crossstream scenarios. This improves performance of DeepFM models by about 8% after optimization. (#42046)
Optimize the dependency analysis method between new executor operators to improve runtime performance. Establish correct dependencies for send/recv communication operators to support pipeline parallel. (#42009)
Distributed Training¶
Basic functions of multimachine multicard parallel training based on collective communication
Add support for elastic training, enables scaling up and down the number of workers, enables training process resuming when node failure，to improve the fault tolerance of distributed training. (#36684, #37177, #37781)
Refactor launch startup module, add
master
collaboration and node numbernnodes
definition, to improve the ease of using the distributed startup.(#40086, #40568, #40782, #40844, #40936, #41190, #41314)Add support for GPU/NPU/XPU multihardware heterogeneous training. (#37613, #37998)
Add fleet_executor asynchronous pipeline executor. (#36966, #37049, #37087, #37126, #37150, #37203, #37167, #37282, #37319, #37462, #37507, #37533, #37576, #37605, #37691, #37742, #37783, #37809, #37862, #37882, #37934, #38024, #38083, #38164, #38261, #38290, #40607, #37093, #37106, #37143, #37338, #37376, #37485, #37531, #37623, #37693, #37755, #37807, #37889, #38420, #38539, #36892, #37084, #37158, #37361, #37509, #37603, #37703, #37824, #38114, #38322, #38535, #38650, #38709, #38799, #38839, #38904)
Add distributed inference function for largescale model. (#38795, #39012, #39032, #39076, #39194, #39207, #39241, #39603, #39758, #39992).
Dynamic graph hybrid parallelism
Reconstruct
paddle.distributed.fleet.utils.recompute
, to support new dynamic computational graph. (#41396)Add pure FP16 training to support data parallelism. (#36420)
Add MoE (Mixture of Experts) parallel strategy, to support largescale MoE model training. (#41092, #40895, #40850, #39224)
Add GroupSharded parallel strategy. Support stage1, stage2, stage3, and it supports synchronous and asynchronous communication. It can be used together with the basic function combinations such as Recompute, AMP O1\O2, Offload, GroupShardedClipGrad, and GroupShardedScaler. (#37489, #37568, #37707, #37836, #37947, #38151, #38407, #38052, #39112, #38989, #39171, #39285, #39334, #39397, #39581, #39668, #40129, #40396, #40488, #40601，#37725，#37904, #38064)
Static graph hybrid parallelism
Add
scale_gradient
flag bit togradient_scale_configs
to control the position where the gradient aggregation operation averages the gradients under pipeline parallelism. (#36384)Under tensor parallelism, the dropout op supports the settings of deterministic random seed generators, to ensure random consistency for nondistributed variables and randomness of distributed variables. (#36228)
NPU hybrid parallelism supports Offload, with saving 40% of NPU memory. (#37224)
Add
force_cpu
optional parameter to the seed op, to allow dropout to read seed values directly from CPU. (#35820)Improve the Automatic Sparsity (ASP) sharding strategy and support the selection of sharding strategy according to the program. (#40028)
Automatic parallel
Add the process restart (relaunch) after automatic mapping between logical processes and physical devices. (#37523, #37326)
Improve the underlying mechanism and interface for automatic parallel to facilitate the unification of modules and add the optimized pass. (#36617, #38132)
Add unified resource representation, to support for automatic mapping between logical processes and physical devices. (#37091, #37482, #37094)
Improve the distributed attribute complementation for the backward and update parts of the computation graph. (#36744)
Add data slicing function. (#36055)
Add tensor resharding function to reshard the tensor according to the distributed properties of the tensor and operator. (#40865, #41106)
Add the automatic conversion pass of distributed parameters when the number of resources or parallel policy changes. (#40434)
Add GradientMerge pass to reduce the number of communications and improve training efficiency. (#38259, #40737)
Add Recompute pass to reduce the activation memory storage. (#38920)
Add Sharding optimization pass, to support pgos 3 stage optimization. (#38502)
Add fused QKV parallelization for Transformer class model. (#39080)
Improve the sharding propagation for while op to ensure convergence of the fixpoint algorithm. (#39939, #39086, #39014)
Support training and inference for subblock and while op control flow. (#39612, #39895, #40077)
Parameter Server
Add NaN/Inf value checking tool under GPUPS. (#38131)
Under GPUPS, add set_date interface to adapt incremental training. (#36194)
Under GPUPS, add asynchronous release dataset function. (#37790)
Under GPUPS, support the Dump parameters and intermediate layers（#36157）；
Under GPUPS, support the optimizer parameter configuration. (#39783, #39849)
Under the Unified Parameter Server, refactor the base classes of each module such as communication and storage, to improve the ease of secondary development of each module. (#41207, #41022, #40702, #39341 #39377, #39191, #39064)
Add evaluation metrics module under the Unified Parameter Server, to support AUC/WuAUC/MaskAUC and other evaluation metrics calculation and customizable extensions. (#38789)
Supports XPU parameter server training on KUNLUNXIN 2. (#41917, #42266, #41916)
Profiler¶
Add the performance analysis module
paddle.profiler
in the Python layer: Provide the ability to collect, export, and count performance data during the training push. (#40065, #40357, #40888)paddle.profiler.Profiler
: performance analyzer, interface for user interaction. (#41029, #41524, #41157, #40249, #40111, #39964, #40133)paddle.profiler.RecordEvent
: provide custom punches to record time. (#39693, #39694, #39695, #39675,#41445, #41132)paddle.profiler.ProfilerTarget
: specify the target device for performance analysis.paddle.profiler.ProfilerState
: indicate the state of the performance analyzer.paddle.profiler.SortedKeys
: specify the sorting method of the data within the statistics form.paddle.profiler.make_scheduler
: the scheduler generating the performance analyzer state and implement the periodic control of the collection scope.paddle.profiler.export_chrome_tracing
: save performance data to a google chrome tracing file viewable by the chrome://tracing plugin. (#39316, #39984, #41029)paddle.profiler.export_protobuf
: save performance data to a protobuf file represented by internal structure. (#39519, #39109, #39474)paddle.profiler.load_profiler_result
: load the performance data saved to a protobuf file.paddle.profiler.Profiler
generate statistics for data reading, step overhead and throughput for the model training by specifying thetimer_only
parameter.(#40386)
Refactor Profiler underlying infrastructure in C++ layer
Modify the name and type of logging for op under new dynamic graph.（#41771
Add Kernel running statistics into profilers’ summarization and optimize the summarization.（#41989
Remove sideeffect to performance in forward computing forward when Profiler is off. （#42142）
CINN compiler adoption¶
With the recent development of PaddlePaddle’s compiler, a.k.a, CINN（GitHub  PaddlePaddle/CINN: Compiler Infrastructure for Neural Networks）, paddle framework has also been changed to adapt the compiler CINN features. These include the subgraph management related functions for the PaddleCINN runtime, optimization of memory and speed performance, and bug fixing during development.
Functions developed:
Subgraph op related functions:
Add the function to find and generate CINN subgraphs from computational graphs.（#36345）
Add cinn_launch op as a runtime entry point to CINN. It is responsible for scheduling CINN to compile the subgraph, to initialize the data, and to execute the generated kernels.（#36600）
Add a helper class
CinnLaunchContext
to the kernel implementation of cinn_launch op to manage the intermediate data for compiling and running subgraphs, to improve scalability and code readability.（#37938）Add additional fetch nodes to CINN subgraphs, thus ensuring that CINN external nodes can fetch the values of variables.（#37172, #37190）
Add the function to symbolize a CINN subgraph, which is used to topologically sort the subgraphs and return the CINN execution sequence.（#36417
Add
CinnCompiler
class for involking subgraphs in the CINN compiled graph that can be replaced by using CINN operators. （#36562, #36975）Add the interface to CINN symbolization class to get the names of subgraph fetched variables to prevent fetched variables from being eliminated in compilation optimizations.（#37218）
Checking, debugging, and PI changes related:
Synchronize the update of NetBuilder API name changes in CINN.（#40392）
Add necessary log information to PaddleCINN for better debugging.（#36867）
Add the bidirectional conversion function between Paddle desc and CINN desc.（#36100）
The operator implemented in CINN may not use some input variables compared to Paddle. Therefore, remove the check that the input variables must be used in the cinn_launch op.（#37119）
Added cinn_instruction_run op for invoking CINN to execute a single generation instruction, facilitating the construction of scheduling run subgraphs on the Paddle side.（#39435, #39576）
Add control macros to Paddle for CUDA/CUBLAS/MKL/CINN pass application required to compile CINN.（#37066, #36660）
Add two control flags FLAGS_allow_cinn_ops and FLAGS_deny_cinn_ops to control the categories of CINN operators used to replace native operators during Paddle training.（#36842）
Performance optimization:
Speed optimization
Optimize the computational time consumed by CinnCacheKey.（#37786, #37317）
Cache variable scope for CINN compiled subgraphs to reduce runtime parameter construction overhead.（#37983）
Utilize CINN’s autotuning in case of subgraph compilation, could be enabled by flag, for further tuning of training performance.（#41795）
Refactor the correctness check of compilation results in case of subgraph compilation to avoid repeated checks at runtime and reduce the scheduling overhead.（#41777）
Enable TransposeFolding and GemmRewriter optimization passes by default in PaddleCINN training.（#41084）
Pass the cuda stream created in Paddle into CINN so that Paddle and CINN can use the same CUDA stream in cuda computing.（#37337）
Move CINN optimization pass application logic from Paddle to CINN.（#42047, #42070）
Device memory optimization
Add NoNeedBufferVars to cinn_launch op to declare a list of input variables that do not require a buffer, so that the memory can be freed in advance.（#38367）
Pass in reference count information for external variables to the subgraph, so that subgraphs within cinn_launch can reuse memory optimization passes and reduce the memory overhead in using CINN.（#39209, #39622）
Add the function to convert a collection of executable instructions generated by CINN compilation to a Paddle Graph, supporting reuse of the Paddle scheduler and memory optimization pass, further reducing the memory overhead in using CINN. （#39724, #39911）
Add Kernel of cinn_instruction_run op, to support dynamic device memory requests based on data types inferred from compilation results.（#40920）
Bug fixing:
Fix and optimize the generation logic of CINN subgraphs.（#36503）
Fix the bug that PaddleCINN does not support noinput subgraphs.（#40814）
Fix an error reported due to CINN not being able to handle useless outputs in operators such as batch_norm.（#36996）
Fix several bugs in CINN subgraph partitioning and symbolization, and solve problems with Paddle training accessing the CINN. (#36739, #36698 )
CINN does not yet support the control flow yet. Add logic to skip control flow when encountered.（#40812）
Other¶
Model quantization
Upgrade quantization storage format to unify quantization formats for dynamic and static graphs. (#41041)
Add new post training quantization (PTQ): EMD and Adaround. (#40421, #38460)
Support to quantize more operations in PTQ and QAT, such as crop, split, ab, unsqueeze etc. (#40083)
Support to quantize operators in control flow. (#37498)
Support quantization of matmul_v2 operator. (#36469)
Add support for quantized matmul_v2 inference on TensorRT. (#36594)
CUDA memory optimization
Implement multistream safe Allocator to support safe and efficient use of CUDA memory in asynchronous computing scenarios. (#37290)
Add new APIs (paddle.device.cuda.max_memory_allocated, paddle.device.cuda.max_memory_reserved, paddle.device.cuda.memory_allocated and paddle.device.cuda.memory_reserved) for GPU memory monitoring in runtime. (#38657)
Support allocate CUDA Managed Memory to train super large models in memoryconstrained scenarios. (#39075)
Add GetBasePtr interface in C++ to get device address created with cudaMalloc. (#37978)
Reduce the number of free blocks in AutoGrowth Allocator to improve memory allocation performance. (#35732)
Remove redundant Float32 temporary tensor and cast operation for tensor with data type FP16 in
initializer.Normal
andinitializer.Constant
to save 2x memory. (#38818)
Highorder derivative testing for models in dynamic graphs.
Custom op: Support to custom op in ROCm(HIP) platform. (#36771)
Cost Model: Add basic Cost Model based on profiling infomation. (#35774)
Added a function to allow user to add their own layer and correspond pruning way to ASP support. (#40253)
Add string tensor data structure, allowing the framework to have the ability to represent and process string. (#39830, #40992)
Add or upgrade oneDNN FP32/int8/bfloat16 Kernel, including:
ELU (#37149)
exp (#38624)
stack (#37002)
softplus (#36382)
round (#39653)
shape (#36033)
flatten and flatten2 (#35892)
slice (#37630)
elementwise_mul (#40546)
elementwise_add (#38176)
ementwise_div (#36158)
elementwise_sub (#35662)
roi_align (#37848)
assembly optimized Adam (#39158)
logsoftmax (#39793)
activation (#40721)
mul (#38552)
mean (#37104)
relu (#36265)
pool2d (#37081)
concat (#35889)
LayerNorm (#40418)
Add the 3stage storage graph retrieval engine based on SSD  host memory  GPU device memory, to support largescale graph neural network training. (#42472, #42321, #42027)
Add heterogeneous multicloud training communication module switch, implement the Send/Recv interface function, and support multiple heterogeneous cloud communication.（#40965 40911）
(2) Function optimization¶
API¶
Add backward implementation of
paddle.linalg.det
. (#36013)Add support for mixed precision training O2 mode for
paddle.Model
, i.e., support for Pure FP16 training mode of the original dynamic/static graphs. (#36441)Support for self chain calls for
paddle.nn.Layer
. (#36609)Add settings of
is_distributed
property for theto
method ofpaddle.nn.Layer
to ensure that the distributed properties remain consistent before and after network parameter transform. (#36221)Improve the parameter conversion logic of the
to
method ofpaddle.nn.Layer
, to reduce the peak memory consumption of the conversion process and improve the conversion success rate. (#36862)Support settings of the shape of the output Tensor for
paddle.incubate.graph_send_recv
to reduce the memory usage during the actual computation. (#40509)Add the support of int32 and int64 data types for
paddle.incubate.segment_sum
,segment_mean
,segment_max
, andsegment_min
. (#40577)Add the support of the bool type for transpose op. (#35886)
Switch the
paddle.mm
underlying operator from matmul to matmul_v2. (#35770)Support static graph mode and support the unknown shape for
paddle.einsum
. (#40360)Support data
parallelism for paddle.nn.functional.margin_cross_entropy
andpaddle.nn.functional.class_center_sample
. (#39852)Support input of shape [1] for
paddle.nn.functional.grid_sample
. （#36183）Support NHWC data format for
paddle.nn.PRelu
. (#37019)Support the fixed random state using
paddle.seed
forpaddle.nn.functional.class_center_sample
. (#38248)Add ROCM backend support for all APIs under
paddle.fft
, and optimize CUFFT backend error messages. (#36415, #36114)Support the function that the slicing dimension i 0, that is, allow slicing index results to be empty . (#37313)
Support int and bool type Tensor with using bool index for
Tensor.setitem
. (#37761)Support nearest mode for
paddle.nn.functional.interpolate
when the input shape is 5D. (#38868)Add the support of int16 for
paddle.nn.Embedding
andpaddle.gather
. (#40964, #40052)Support data
parallelism on single machine on``CPU platform``in paddle.distributed.spawn
. (#35745, #36758, #36637)Add
depthwise_conv2d
MKLDNN operator. (#38484)Add complex types check in the static graph model for API
paddle.abs
,paddle.transpose
,paddle.squeeze
,paddle.unsqueeze
,paddle.matmul
, andpaddle.full
. (#40113)Support tuple and list type arguments for
paddle.autograd.PyLayer
. (#38146)Add check whether tensor is inplace and leaf when calculate gradient . (#37931)
Support HIP library for
paddle.autograd.PyLayer
. (#38184)Support more size inputs for
paddle.take_along_axis
andpaddle.put_along_axis
, and allow index matrix shape size to be larger than array matrix shape size. (#39072)Optimize the error report message of API
paddle.nn.Pad2D
when replicate is 0. (#36510)Support pad input in tuple format for API
paddle.nn.Pad2D
. (#35985)Add tdm_sample API in
paddle.distributed.InMemoryDataset
to support sampling operations in TDM algorithms. (#37044)Add Presaving Hooks mechanism for
paddle.jit.save
. (#38186）Add new higherorder differentiationrelated APIs.
elementwise_add
: add thirdorder Kernel, to support computation of thirdorder differentiation. (#36508, #36618)matmul_v2
: add thirdorder Kernel, to support computation of thirdorder differentiation. (#36459)elementwise_mul
: Add thirdorder Kernel, to support computation of thirdorder differentiation. (#37152)
Improve the logic of the
paddle.amp.GradScaler
to call check_finite_and_unscale op, to eliminate the cudaMemcpy introduced by the creation of the bool variable. (#37770)Add check for unstack and unique op in case of input Tensor with 0 elements. (#36021)
Add new multilayer, bidirectional LSTM function that supports KUNLUNXIN 2, to improve RNN forward/backward ops, and support the use of temporal model training. (#42076)
Add bce_loss forward/backward ops for KUNLUNXIN 2. (#41610)
Add backward implementation of
paddle.linalg.det
. (#36013)
IR(Intermediate Representation)¶
Dynamic Graphs to Static Graphs
Optimize the behavior of the
ProgramCache.last
interface for dynamic graph to static graph so that it returns the most recently used Program instead of the final generated Program. (#39541)Optimize the error report message for the
paddle.reshape
API for dynamic graph to static graph, and add a new recommended usage hint. (#40599)Optimize the type of exception catch in the
is_api_in_module
function when transcribing dynamic code to static code. (#40243)Optimize the hint of error message for dynamic graph to static graph，hide warning information by default. (#39730)
Add the support of type hint syntax for dynamic graph to static graph to improve the accuracy of variable type analysis. (#39572)
Optimize the
paddle.cond
function to allow values are equal for basic types such as bool and int . (#37888)Optimize the decorate function
@to_static
to allow the switch of the train/eval mode. (#37383)Optimize the stack of error report for dynamic graph to static graph, to highlight userrelated codes and reduce the framework redundant error stack. (#36741)
Remove
no_value
placeholder from the return value ofpaddle.cond
. (#36513、#36826)Adapt the run_program op to the new dynamic graph mode. (#40198, #40355)
Add check for zip syntax. (#37846)
Fix the dynamic graph to static graph failure due to the error of dimension and type judgment in the
paddle.signal.frame
,paddle.signal.stft
andpaddle.signal.istft
. (#40113)Add registration of plural type Kernel for mean, pad3d ops. (#40113)
Mixed Precision Training¶
Distributed Training¶
Basic functions of the distributed training
Optimize Fleet API and DistributedStrategy configuration to use dynamic graph parallel function conveniently. (#40408)
Optimize Dynamic Graph mixed parallel HybridParallelClipGrad strategy, support 4D hybrid parallel and Pure FP16 training. (#36237, #36555)
Restructure dynamic graph data parallel strategy, to support new dynamic graph and communication. (#40389, #40593, #40836, #41119, #41413, #39987)
Support distributed tensor model parallel for fused_attention op. (#40101)
Support the distributed tensor model parallel for fused_feedforward op. (#40160)
Graph retrieval engine
Optimize the data format returned by the graph sampling interface of the graph engine, with a 3x improvement of the sampling speed. (#37315)
Reduce the amount of graph engine threads to improve performance. (#37098)
Optimize graph engine data transfer to improve performance. (#37341)
Optimize the merge logic of embedding op to improve performance by exploiting the topological relationship of embedding op in the model. (#35942)
Communication library: restructure the communication library to improve the scalability and development of the communication library, and support heterogeneous communication. (#41398, #39720, #40911, #40579, #40629, #40437, #40430, #40228, #40181, #40100, #40097, #39892, #39384, #39737, #40040)
Support the publication of MoErelated interfaces in
paddle.incubate.distributed.models.moe
(moe.GShardGate
,moe.BaseGate
,moe.SwitchGate
,moe.MoELayer
, andmoe. ClipGradForMOEByGlobalNorm
). (#42300)Fix the error report in the use of recomputing in
paddle.incubate.distributed.models.moe.MoELayer
. (#42128)Fix the error report in the new dynamic graph pipeline parallel caused by different data types (#41937 #42053)
Fix the error report in the new dynamic graph tensor model parallel due to different data types（#41960）
Custom operator¶
Enhance the C++ custom operator mechanism for writing secondorder gradient operators, to support adding suffixes to the gradient input variables of secondorder gradient operators for use as outputs. (#41781)
Remove the use of the deprecated enumeration type
PlaceType
from the Tensor API member methods, make it compatible, and add a deprecation warning. (#41882)Add deprecated warning for a number of deprecated interfaces of the original Tensor API, including the incomplete constructor, reshape, mutable_data, and copy_to methods. (#41882)
Other¶
Error report and debugging optimization
Optimize
the error message of the label
boundary check for the cross_entropy op. (#40001)Add profile record for
infer_shape
andcompute
methods of op execution of dynamic graphs, show their cost in timeline. (#39023)Replace
pybind::index_error
error hint on Windows for unknown exceptions. (#40538)Add the error message in the outofbounds checks for user scatter op. (#37429)
Download tool: For the problem of slow decompression of directories with multiple files in
paddle.utils.download.get_path_from_url
, replace the original way (traverse directory in loop) of decompressing files in directories one by one by calling extractall on the directory, which greatly improves the decompression speed. (#37311)Speed up the quantization training for
fake_quantize_range_abs_max
、fake_quantize_abs_max
、fake_quantize_dequantize_abs_max
、fake_quantize_moving_average_abs_max
, etc. (#40491)
(3) Performance optimization¶
Distributed Training¶
Hybrid parallel optimizer
sharding_optimizer
supportsoptimize_cast
optimization, which move the parameter cast during forward and backwark stage to the optimizer stage. This improves performance by 7%. (#35878)GPUPS optimization: support for gradient fuse allreduce training. This improves training performance by 20%. (#35131)
GPUPS optimization: dump CPU optimization speed improves by 3.21x. (#40068)
CPU parameter server streaming training optimization: support for automatic statistics of sparse parameter statistics, incremental saving of sparse parameters, etc. The training performance improves by 20%. (#36465, #36601, #36734, #36909, #36943, #37181, #37194, #37515, #37626, #37995, #38582, #39250, #40762, #41234, #41320, #41400)
Autotuning¶
Add hardwareaware automatic performance tuning for the full training process, with performance improvements of about 3% to 50% or more on image classification, segmentation, detection, and image generation tasks compared to the model’s default configuration. The autotuning status is set via the paddle.incubate.autotune.set_config
API. By default, it is currently disabled. Autotuning has three specific levels:
Add the autotuning function to
paddle.io.DataLoader
, to select the best num_workers based on training data and device resources. (#42004)Add mixedprecision training data layout autotuning feature, to select the best data layout based on device type and data type, and automatically convert it at runtime. (#41964)
Add the automatic tuning of the required workspace size threshold for Conv, which is automatically set based on the GPU’s currently available requested device memory resources. Add the automatic selection of Conv cuDNN algorithms based on the generic AlgorithmCache design and Kernel timing component, which supports data variation length models.（#41833）
Operator Optimization¶
Optimize
FasterTokenizer
performance, with a 10% performance improvement compared to preoptimization. (#36701)Optimize
index_select
inverse computation, with 3.7~25.2x performance improvement over preoptimization. (#37055)Optimize the performance of
paddle.nn.ClipByGlobalNorm
. Take 10*10paddle.nn.Linear
as an example. In contrast to preoptimization, the performance improves by about 30%. (#38209)Optimize the performance of
pnorm
with very large or very smallaxis
dimensions, with 3196x improvement in forward speed and 1.119x improvement in backward speed. (#37685, #38215, #39011)Optimize
softmax
forward and backward performance, with a speedup ratio of about 2x for theaxis!=1
configuration. (#38602, #38609, #32387, #37927)Optimize
log_softmax
forward and backward performance, with a speedup ratio of about 6x to 20x foraxis!=1
configurations. (#38992, #40612)Optimize
softmax_with_cross_entropy
forward and backward performance, with a speedup ratio of about 1.3x for thehard_label
configuration. (#39553, #40424, #40643)Optimize
top_k
performance, with a speedup ratio of more than 22x for onedimension and largerk
(k=5000) configuration. (#40941)Optimize
elementwise_mul
backward computation, with 1.85~12.16x performance improvement over preoptimization. (#37728)Optimize
elementwise_min
andelementwise_max
backward computation, to equalize or improve performance by 1.05x to 18.75x over preoptimization. (#38236, #37906)Optimize
nearest_interp
forward and backward computation, with forward performance improvement by 1.5x to 2.3x over preoptimization, and backward performance improvement by 60% to 1.8x over preoptimization. (#38528, #39067)Optimize
bilinear_interp
forward and backward computation, with forward performance improvement by 0.4x to 2.3x over preoptimization, and backward performance improvement by 10%30% over preoptimization. (#39243, #39423)Optimize
dropout
forward and backward computation, with performance improvement by about 20%. (#39795, #38859, #38279, #40053)Optimize
grid_sampler
forward and backward computation, with forward performance improvement by 10% to 30% over preoptimization, and backward performance improvement by 10% to 60% over preoptimization. (#39751)Optimize
group_norm
forward and backward computation, with the forward performance improvement by 1.04x to 2.35x, and backward performance improvement by 1.12x to 1.18x. (#39944, #40657, #39596)Optimize
conv1d
forward and backward computation, with the forward performance improvement by 1.00x to 2.01x, and backward performance improvement by 1.01x to 474.56x. (#38425)Optimize
elementwise_div
backward computation, with the backward performance improvement by 1.02x to 29.25x. (#38044)Optimize
gelu
forward and backward computation, with the backward performance improvement by 1.13x to 1.43x, and reverse performance improvement by 1.10x to 1.55x. (#38188, #38263)Optimize
elementwise_sub
backward computation, with the backward performance improvement by 1.04x to 15.64x. (#37754)Optimize
flip's
forward performance on onedimensional data input, with the performance improvement by 100%. (#37825)Optimize
layer_norm
forward and backward computation, with the forward performance improvement by 2x to 5x over preoptimization, and backward performance improvement by 20% to 50% over preoptimization. (#39167, #39247)Optimize
embedding
forward and backward computation, with a maximum improvement of 1.51x in forward performance and 1.03x to 7.79x in backward performance. (#39856, #39886)Optimize
gelu
FP16 forward and backward calculations, with forward performance improvement by 9% to 12% over preoptimization, and backward performance improvement by 2% to 9% over preoptimization. (#38980)Remove CPU > GPU explicit data transfer operation in
gather_nd
forward and backward operators, and remove the explicit synchronous operation inindex_select
forward and backward operators. Change GPU > GPU data transfer inscatter_nd
from synchronous operation to asynchronous operation. (#40933)Optimize
Lars optimzier
computation, with the training performance improvement of Resnet50 PF16 model by 5.1% over preoptimization. (#35652, #35476)Optimize
AvgPool2dGrad
computation, with the performance improvement by 2.6x over preoptimization. (#35389)Optimize
Elementwise
computation for multivariate output, improving performance by up to 15% over preoptimization. （#38329, #38410）Optimize
Categorical
the probs computation, simplify the computation logic, and improve the performance by 4x to 5x. (#42178)Optimize the
paddle.sum
performance, with performance improvement by about 20%. (#42309)Remove CudaStreamSync operation from
paddle.nn.ClipGradByGlobalNorm
to reduce scheduling overhead during execution, with 5% performance improvement on ptb models. (#42170)Optimize a series of underlying data structures and detailed implementations in the original dynamic graph execution system to improve the scheduling performance of the original dynamic graph. (#42010, #42171, #42224, #42256, #42306, #42329, #42340, #42368, #42425）
Simplify the probs calculation logics of
paddle.distribution.Categorical
, to improve performance by 4x to 5x. (#42178)
(4) Bug fixing¶
API¶
Fix the output type error with
paddle.sum
when the input parameter type and output parameter type do not match and the number of reduce elements on theaxis
is 1. (#36123)Fix an
AttributeError
inpaddle.flops
when the layer output type is tuple. (#38850)Fix the
paddle.diag
failing to propagate gradients because there is no backward kernel. (#40447)Fix an error in sorting
paddle.sort
input with NaN values. (#41070)Fix the error when
paddle.full_like
’s input contains INF value. (#40232)Fix the bug in
paddle.strided_slice
: strided_slice result does not consistent with slice when the data in the input of starts is less than rank. (#39066)Fix the bug in the
max_pool
family of operators where infer_shape is calculated incorrectly when index is returned. This affects the APIs:paddle.nn.functional.max_pool1d/2d/3d
,paddle.nn.functional.adaptive_max_pool1d/2d/3d
,paddle.nn.MaxPool1D/2D/3D
,paddle.nn.AdaptiveMaxPool1D/2D/3D
. (#40139)Fix an issue where the dtype of pooling_mask returned by the
max_pool
family of operators is incorrect. Now the dtype of pooling_mask is int32. The affected APIs arepaddle.nn.functional.max_pool1d/2d/3d
,paddle.nn.functional.adaptive_max_pool1d/2d/3d
,paddle.nn.MaxPool1D/2D/3D
,paddle.nn.AdaptiveMaxPool1D/2D/3D
. (#39314 )Fix the bug with
paddle.shape
where the backward gradient by default causes a computation error. (#37340)Fix the bug in
paddle.nn.Layer's
to
method when converting both dtype and place at the same time. (#37007)Fix the bug that
paddle.amp.decorate
fails to rewrite the parameters of nonleaf network layers to FP16. (#38402)Fix the bug that the
paddle.amp.decorate
rewrites the noninput parameter inpaddle.nn.BatchNorm1D
,paddle.nn.BatchNorm2D
, andpaddle.nn.BatchNorm3D
to FP16. (#38541)Fix the bug that the
paddle.amp.decorate
rewrites the noninput parameter inpaddle.nn.SyncBatchNorm
to FP16. (#40943)Fix redundant warnings in
paddle.nn.Layer.to
. (#36700)Fix the bug in
paddle.nn.RNN
when being used inside control flow. (#41162)Fix the bug that the
paddle.to_tensor
fails to specify the CUDAPlace of the Tensor. (#39662)Fix the issue that
paddle.nn.Identity
is not exposed. (#39615)Fix the bug where the output values of the
fill_
andzero_
inplace APIs are incorrect when the input is on a CUDAPinned Place after dynamic graph reconstruction. (#41229)After refactoring the dynamic graph, fix the bug of incorrect inplace version value of the output Tensor when calling assign op using the append op. Change it to call assign op using the
_C_ops
. (#41118)Remove unreasonable codes in the
elementwise_add
‘s thirdorder kernel, and fix an uninitialized issue in the network creation process. (#36618)Fix the missing attribute bug in
conv2d
execution of cuDNN Kernel. (#38827)Fix an issue where
multiclass_nms3
output shape is incorrect. (#40059)Fix an issue with
yolo_box
outputting incorrect shape. (#40056)Fix an issue where the higherorder differentiation
gradients
interface does not take effect as expected when target_grad is specified. (#40940)Fix an issue that the network parameter type is incorrect when the default_dtype is modified in the op
_BatchNormBase
base class in the dynamic graph mode. The affected APIs arepaddle.nn.BatchNorm1D
，paddle.nn.BatchNorm2D
，paddle.nn.BatchNorm3D
， andpaddle.nn.SyncBatchNorm
. Specific reason: whenget_default_dtype() == 'float16'
, the default parameter data type is modified byset_default_dtype('float32')
. The parameter type in dynamic graph mode is created by default_dtype; therefore, the change of the default parameter type causes the subsequent networking Parameter type error. (#36376)Fix the bug of the undefined intermediate variable in the backward op in batchnorm op in case that the data type is FP32 and the data dimension is
dims = 2 and data_layout = NHWC
. (#37020)Fix the bug that shape of weights is incorrect, when using
paddle.static.nn.prelu
in static graph mode, and input format isNHWC
,mode==channel
. (#38310)Fix the bug of
paddle.nn.functional.class_center_sample
: CUDA seed setting issue in multimachine case. (#38815)Fix the bug of failing to report error when the input of
paddle.nn.functional.one_hot
is incorrect. (#41335)Fix an issue where a callback to reclaim device memory on a DCU device is not triggered in time, resulting in an OOM of the device memory. (#40445)
Fix the bugs of
setitem
backward gradient abnormal and inplace logic handling abnormal in some dynamic graph scenarios. (#37023, #38298)Fix the bug of index abnormal when Tensor array uses the Slice to index in the dynamic to static scenarios. (#39251)
Fix the bug of memory or device memory leaks caused by some temporary variables not being correctly destructed when
paddle.Tensor.register_hook
interface is used. (#40716)Fix the bug that
Tensor.getitem
cannot get the value when the index is a bool Tensor with all False. (#41297)Fix the bug that
Tensor.getitem
cannot get the value when the index is a bool scalar Tensor. (#40829)Fix the bug in
paddle.index_select
when index is a 0shape Tensor. (#41383)Fix the bug when the number of GPU threads requested by
paddle.index_select
andpaddle.index_sample
exceeds the limited machine resources. (#41127, #37816, #39736, #41563)Fix the bug when ReduceConfig, elemwise_grad, gather, gather_nd, and scatter ops request more GPU threads than the limited machine resources. (#40813, #41127)
Fix the bug that the memory access is out of boundary when NX ! = 1 in ReadData, ReadDataBc, and ReadDataReduce in Kernel Primitive API. (#36373)
Fix the bug of the computation result abnormal due to data overflow caused by the IndexRandom data type error. (#39867, #39891)
Fix the bug of the returned computing result error of reduce op when reduce_num = 1. (#38771)
Fix the bug of the memory access outofbound of reduce op in the middle dimension of reduce in HIP environments. (#41273)
Fix the bug of Kernel failed to properly release in the computation of two FP16 onedimensional vectors of matmul op.
Fix the bug caused by CUDA integer computation overflow for some operators, including: bernoulli, gaussian_random, gumbel_softmax, multinomial, truncated_gaussian_random, uniform_ random_inplace, and uniform_random ops. (#37670)
Fix the bug where
paddle.nn.Sequential
reports a KeyError error when traversing sublayers in a for loop. (#39372)Fix the bug of the check shape error in
paddle.nn.functional.unfold
when compiling in static graphs. (#38907, #38819)Fix the bug of reporting an error if
axis
is specified when using dropout for static graphs. (#37223)Migrate the matmul operator in the
paddle.nn.MultiHeadAttention
to the matmul_v2 operator. (#36222)Fix the bug occurred in throwing FPE when the empty Tensor is used in
paddle.nn.functional.label_smooth
. (#35861）Fix the deformation bug of reshape op when input is an empty Tensor. Support the empty Tensor rehape to [1]. (#36087)
Fix the bug of the modified values will incorrectly override other rows when the
fill_diagonal
‘s input parameter offset is nonzero. (#36212)Modify stop_gradient returned by the range op bing set to True in dynamic graph mode. (#37486)
Fix the bug where Lamb optimizer is updated incorrectly when Beta1Pow and Beta2Pow are on the GPU. (#38518)
Fix the bug where the conv2d operator doesn’t respect to FLAGS_cudnn_deterministic. (#37173)
Fix the bug caused by an earlier version of cufft that does not define CUFFT_VERSION. (#37312)
Fix the computing error of
paddle.ifftshit
andpaddle.fftshift
. (#36834, #36748)Fix the
axis
computation error inpaddle.fft
series of APIs. (#36321)Fix an output data type registration bug of batch_norm_grad op in case of FP16 data type. This bug causes the compilation failure in some scenarios. There is also the impact on FP16 computational precision. (#42461)
Fix the incorrect Infershape information bug in the
paddle.nn.functional.pad
API when the padding is Tensor in dynamic to static conversion. (#42414)Fix an exception in
paddle.distribution.StickBreakingTransform
when the input dimension exceeds 2. (#41762)Fix a nan/inf bug calculated with QK^T in fused_attention op. (#42032)
Fix a nan/inf bug calculated in fused_attention op with FusedResidualDropoutBias on V100. (#42398)
Fix a redundant data transform bug introduced by the full_like op during execution. (#41973)
Fix a problem with p_norm op calculating nan on GPU environments. (#41804)
Fix a section error of split op when the sections parameter has a size of 0. （#41755）
Fix the bug of reporting not supporting Place (gpu:0) in multicard training when broadcast is required in 6 elementwise ops (pow, complex, divide_double, multiply_double, fmax, and fmin). (#42332)
Fix the bug that the deprecated interface reports a warning in case of
import paddle
due to a PIL version update. (#42307)Fix the bug that
paddle.linalg.matrix_rank
does not support tol as FP64 Tensor under static graph. (#42085)
IR(Intermediate Representation)¶
Dynamic to static graphs
Fix a type derivation error in reverse gradient accumulation when the
tensor_array
is used with the control flow. (#39585, #39689)Fix an issue where the parameter gradient type is not set correctly during dynamic to static AMP training. (#40938)
Fix an issue of reporting an error in the dynamic to static transcription when there are misplaced annotations in the codes. (#39035, #38003)
Fix an issue where Tensor is not properly converted to Variable when calling a nonforward function in dynamic to static codes. (#37296, #38540)
Fix an issue where
paddle
is incorrectly passed as a variable when dynamic to static transcription. (#37999)Fix an issue where model parameters are incorrectly counted when calling
paddle.flops
after model dynamic to static conversion. (#36852)Fix an issue where GPU memory will keep growing in train mode and no_grad contexts after loading models using the
paddle.jit.save/load
interface. (#36434)Add warning in function of convert_call when converting the generator function. (#35369)
Fix the run_program op dependency analysis bug. (#38470)
Fix the code conversion bug when returning a single value in control flow For. (#40683)
Fix the bug when generating a reverse op when the input to conditional_block op contains LoDTensorArray. (#39585)
Fix the bug that
padddle.jit.save
loses the forward_pre_hook and forward_post_hook of the top Layer in case of the export of a dynamictostatic model. (#42273)Fix the dynamic to static conversion error report where the shape parameter in
paddle.expand
contains a Tensor. (#41973)
Distributed Training¶
Distributed training basic functions
Fix the bug of a port reporting error in the distributed multimachine training. (#37274)
Fix the brpc compilation dependency bug. (#37064)
Fix an occupied port issue due to tcp selfconnections when Fleet starts. (#38174)
Fix the precision degradation bug under data parallel due to inconsistent initialization of FP16 parameters under multiple cards. (#38838, #38563, #38405)
Fix the precision degradation under data parallel due to FP16 gradient synchronization without dividing by the number of cards. (#38378)
Dynamic graph mixing parallel
Fix the bug where parameters are not updated in FP16 mode under mixed parallel by using the new update interface. (#36017)
Static graph mixing parallel
Fix an issue where grad merge is not compatible with ClipGradientByGlobalNorm in distributed dp mode. (#36334)
Fix an issue under hybrid parallelism where the nondistributed parameters of tensor model parallelism are not broadcast during the initialization phase, resulting in inconsistent nondistributed parameters across cards. (#36186)
Fix the issue that sharding’s save_persistables interface does not save FP16 parameters and offload persistent variables when sharding is enabled with offload. (#40477)
Fix the bug where ema parameters are not saved on non0 cards when sharding is enabled for training. (#39860)
Fix an issue where FC incorrectly calculates gradients according to column cuts. (#38724)
Fix the bug reported when DistributedStrategy is set to without_graph_optimizer when used with rnn. (#36176)
GPUPS Parameter Server Training
Fix the CPU branch compilation bug triggered by the GPUPS macro definition. (#37248)
Fix an occasional error raised when saving delta and pullsparse concurrency during GPUPS streamline training. (#37233)
Fix a download error issue caused by HDFSClient querying a directory without returning the full path. (#36590)
Fix the bug with pulling old parameters in GPUPS streamline training. (#36512)
Fix a GPUPS multistream allocation issue. (#37476)
Fix the bug of the GPUPS pybind out of core. (#37287)
Other¶
Fix the clip_extra issue when saving models for dynamic graph quantization training. (#38323)
Fix an issue with abs_max scale initialization for dynamic graph quantization training. (#39307)
Fix an issue of exceptions in saving model in dynamic graph quantization training. (#38102, #38012)
Fix the offline quantization flatten op output error. (#37722)
Fix the nonmatching dimension bug in case of inverse quantization matmul op. (#36982)
Fix the bug of adding quantization op when quantizing matmul_v2 without weights. (#36593)
Fix the error of saving the quant_axis attribute in the conv op channelwise quantization when saving the models. (#39054)
Fix the slow training of channelwise quantization. (#40772)
Fix the bug of quantization training when dividing by tensor(initialized as 0) leads to nan. (#36762)
Fix incorrect settings of amp_level for mixed precision in multithreaded scenarios. (#39198)
Fix an issue where PyLayer and Recompute is not set mixed precision correctly when mixed precision training is used with PyLayer and Recompute. (#39950, #40042)
Fix an issue where
D_GLIBCXX_USE_CXX11_ABI
does not take effect when compiling custom operators under Mac. (#37878)Fix the bug of inconsistent dynamic and static behaviors in case of block=None the initializerrelated API. (#37827)
Fix the bug in python 3.6 where there is no fluid module. (#35862)
Fix the bug where optimizer
paddle.optimizer.Adamw
incorrectly calls adam op. (#36028)Fix a logic error when the
paddle.optimizer.Momentum
optimizer parameterregularizer
property is None under the multi tensor policy. (#38344)Fix the bug that the
paddle.optimizer.Momentum
andpaddle.optimizer.Adam
optimizers modify themulti_precision
property under the multi tensor policy. (#38991)Fix the code compilation error when using finalstate API amp in combination with optional Tensor. (#40980)
Fix the bug where paddle+lite+xpu prediction library would report an error when calling lite CPU prediction, and fix the bug where paddle+lite(without NNAdapter) would report an error when compiling. (#37449)
Fix the bug in Debug compile mode where LoDTensorArray crashes due to inconsistent Pybind11 bindings. (#37954)
Fix the bug that prevents correct construction of Tensor in the extreme case where the shape parameter is a list of Tensor mix with int. (#38284)
Fix a compatibility issue with the
paddle.optimizer.AdamW
API. (#37905)Fix the bug in _InstanceNormBase where the returne value of extra_repr is incorrect. (#38537)
Fix the bug that the Paddle Inference lacks of the symbol
paddle::distributed::TensorTable
when the DWITH_DISTRIBUTED is uesd. (#41128)matmul_v2 op reports error when there is a 0 value in the shape. (#35791)
Fix the problem of the repeated printing for no gradient input hint message of the recomputed in dynamic graphs. Change it to the printing only once with using warning. (#38293)
Fix the low accuracy bug on the validation set in later epoch training in visual models in the gelu op. (#38450)
Fix adamw op error in numerical computation. (#37746)
Add the parameters in the sparse_momentum
_C_ops
interface. (#39969)Fix the bug where there is no
distributed
module in python 3.6. (#35848)Fix the eigh unit test data initialization problem. (#39568)
Fix the eigvalsh unit test data initialization problem. (#39841)
Fix the bug of not working properly due to excessive register usage on V100 by segment op. (#38113)
Fix the bug with convrelated op sparsification incorrectly set dimension. (#36054)
Provide Automatic SParsity training for static graphrelated function Alias to
Paddle.static.sparsity
. (#36525)Fix the bug where divide op’s integer division is still an integer. (#40890)
Fix the crash bug of
paddle.multiplex
when input Tensor value is 0. (#34972)Fix a speed exception for set
reduction
parameter inpaddlpaddle.nn.functional.kl_div
. (#37283)Fix the data source unsorted bug in loading the Cifar dataset. (#37272)
Fix the conversion of loss from uint16 to float in the ProgressBar class. (#39231)
Fix the ShareBufferWith shared data type problem. (#37464, #37247)
Fix the performance issue when
paddle.io.DataLoader
uses IterableDataset and num_workers>0. (#40541)Fix the bug with
paddle.vision.ops.yolo_loss
returns incomplete values in dynamic graph. (#40185)Remove the restriction that the input parameter dataset of
paddle.io.BatchSampler
needs to be thepaddle.io.Dataset
type, to expand the support for userdefined datasets. (#40184)Fix the bug of
paddle.summary
reporting that op_flops does not exist. (#36489)Fix the formula error of lars_momentum op when lars_weight_decay=0. (#40892)
Fix the bug that the optimizeoffload cannot save presistable var. (#36433)
Fix an issue where optimizeroffload does not support adamw op type. (#36432)
Fix an issue where enable_program_desc_tracing_data in Tracer is not safe in multithreaded scenarios. (#39776)
Fix an issue where the model file size is not initialized when the model is read. (#40518)
Fix the logic bug of the Expand op. When the dimension of the input Tensor X is smaller than the shape to be expanded, it may result in the incorrect Out.Shape. (#38677)
Fix the dynamic to static transcription error when the Expand_As op takes only y.shape without Y variable entered. (#38677)
Fix the logic error when Expand_As op computes the output shape. (#38677)
Fix the bug that the variables of the
core.VarDesc.VarType.STRINGS
type report error when getting thelod_level
property and setting itslod_level
to None. (#39077)Fix an issue where the framework function
Pylayer
does not support different dtypes. (#37974)Fix the bug of division by zero of the learning rate decay API
paddle.optimizer.lr.PolynomialDecay
. (#38782)Fix the issue where some logs remained after calling the DisableGlogInfo() interface. (#36356)
Fix an error in backward of multilayer RNN (when dropout is set to 0) in the training of SimpleRNN, GRU and LSTM API CPU. (#37080)
Add cache for fft on the backend of cufft and hipfft. (#36646)
Enable the shifts parameter of
paddle.roll
to support transfer in Tensor. (#36727)Add onemkl to fft as an optional computation backend. (#36414)
Fix the precision bug in the bfloat16 type under two mamtul_v2 and elementwise_div ops. (#42479)
Fix a possible error in the next step caused by LoDTensorArray clearing only the internal Tensor and not clearing the Array during device memory recycling. (#42398)
4. Deployment Direction (Paddle Inference)¶
(1) New features¶
New APIs¶
Add the Java API so that Java developers can implement high performance inference on the server and in the cloud through a simple and flexible interface.(#37162)
Add
GetTrtCompileVersion
andGetTrtRuntimeVersion
interfaces for getting TensorRT version information. (#36429)Add the
ShareExternalData
interface to avoid memory copy of input data during inference. (#39809)
New functions¶
Add ONNX Runtime backend support. Currently it supports only CPU in the integrated version. (#39988, #40561)
Add support for Ascend 310 inference based on the Paddle Lite subgraph approach. (#35226)
Add the native GPU FP16 inference. (#40531)
For the switch_ir_debug interface, add the dump model function. (#36581)
Add the configuration interface for TensorRT config:
void UpdateConfigInterleaved(paddle_infer::Config* c, bool with_interleaved)
for special data layout in int8 quantization inference. (#38884)Add TensorRT inspector output information to the log. It is valid only for TensorRT 8.2 or later. (#38362，#38200))
Add the support of the TensorRT ASP sparse inference. (#36413)
(2) Underlying optimization¶
CPU performance optimization¶
Optimize the caching mechanism of MKLDNN. (#38336, #36980, #36695)
Add matmul_scale_fuse pass. (#37962)
Add MKLDNN reshape_transpose_matmul_v2_mkldnn_fuse_pass. (#37847, #40948)
Add MKLDNN conv_hard_sigmoid_mkldnn_fuse_pass. (#36869)
Add MKLDNN matmul_v2_transpose_reshape_fuse_pass. (#36481)
Add MKLDNN softplus_activation_mkldnn_fuse_pass. (#36657)
Add MKLDNN elt_act_mkldnn_fuse_pass. (#36541)
Add MKLDNN mish operator and conv_mish_mkldnn_fuse_pass. (#38623)
GPU performance optimization¶
Change the inference default video memory allocation policy from
naive_best_fit
toauto_growth
, to solve the problem of some models filled up with the GPU video memory. (#41491)Support gelu and FC+gelu ops using TensorRT inference. (#38399)
Support
deformable_conv
inference using TensorRT under static shape. (#36612 #36850 #37345)Support nearest_interp_v2 op using TensorRT inference. (#34126)
Add
yolo_box
TensorRT plugin to support input parametersiou_aware
andiou_aware_factor
so that the IoU computed by inference is used as a factor for confidence. (#34128)Support
elementwise_sub
andelementwise_div
calling for TensorRT inference. (#40806 #41253)Support
multiclass_nms3
using TensorRT inference. (#41181 #41344)Support flatten_contiguous_rang op using TensorRT inference. (#38922)
Support for
pool2d
attributepadding
using TensorRT inference when dimension is 4, andglobal_pooling
andceil_mode
are True. (#39545)Support batch_norm and elementwise_add using TensorRT inference when dimension is 5. (#36446)
Add the
reduce
int32 and float types to use TensorRT inference. Addreduce_mean
GPU operator int32 and int64 registration. (#39088)Modify MatmulV2ToMul pass. Modify the qualifier (not support of broadcast) and op_teller mapping condition. (#36652)
Add the support for TenorRT plugin interface AddPluginV2IOExt. (#36493)
Add the aligned attribute in roi_align op and support for TensorRT inference. (#38905)
Add the support for TensorRT inference with concat attribute
axis = 1
. (#39096)Add TensorRT plugin: preln_emb_eltwise_layernorm, preln_skip_la, and rnorm ops, for ERNIElike model performance optimization. (#39570)
Add TensorRT fuse pass: preln_embedding_eltwise_layernorm_fuse_pass, preln_skip_layernorm_fuse_pass, for ERNIElike model performance optimization. (#39508)
Split matmul fusionrelated passes based on different backends (GPU, CPU, TensorRT), to support transpose function for FC weights. (#39369)
Add the support to TensorRT by roll, strided_slice, and slice op in case of dynamic shapes. (#41913, #41573, #41467)
Add div op support for TensorRT. (#41243)
Quantization support
For the
PostTrainingQuantization
API, add the support forpaddle.io.DataLoader
object orPython Generator
input. (#38686)ERNIE full quantization model inference supports for interleaved data layout. (#39424)
Support for PaddleSlim new quantile model format inference. (#41049)
Add matmul int8 quantization inference op converter and plugin. (#37285)
Add pass to determine if all ops in the model can support int8 quantization. (#36042)
Support quantization inference for the FC part of the multihead attention of the nonvariablelength branch. (#39660)
(3) Bug fixing¶
Framework and API fixing¶
Fix the bug of model clipping when saving static graphs. (#37579)
For the C API, add wrapper PD_Cstr for strings, and provide construction and destructing methods to avoid users to use C runtime library to destruct strings directly. (#38667)
Fix the logic bug with memory reuse at prediction time. (#37324)
Fix memory reuse error reporting in multithreading. (#37894)
Allow passing empty strings for inference when no weight file is available. (#38579)
Fix an issue of clone not being supported when TensorRT dynamic shape is enabled. (#38520)
Fix multithreaded clone error after TensorRT dynamic shape is enabled. (#40067)
For the lite xpu interface, fix an issue where the xpu card cannot be selected. (#36610)
The TensorRT dynamic shape parameter automatically generate the interface, to add the file existence check. (#36628)
Fix the bug that the MKLDNN does not support conv3d. (#42055)
Backend Capability Fixing¶
Fix cuDNN default algorithm selection configuration for prediction, with using nondeterministic policies. (#41491)
Fix the bug with deformable_conv op in TensorRT plugin resource recovery handling error. (#38374)
Fix a serialization error in the TensorRT plugin for deformable_conv op. (#38057)
Adapt the new refactor engine and serialization API of TensorRT 8.0. (#36769)
Fix the bug that the Flatten2MatmulFusePass, Squeeze2MatmulFusePass, and Reshape2MatmulFusePass do not take effect. (#37644)
Fix the bug with TensorRT input data reporting errors. (#37427)
Add error message when input dimension is wrong. (#38962)
Fix the bug with EmbEltwiseLayernorm output type error. (#40015)
Remove conv_affine_channel_fuse_pass and the corresponding unit test. (#39817)
Fix an issue where the adaptive_pool2d pass incorrectly replaces the pool attribute. (#39600)
Fix the bug that shuffle_channel_detect_pass incorrectly generates shuffle_channel op. (#39242)
Fix transpose parameter error. (#39006)
Fix the crash bug when nearest_interp_v2 input scale dimension is less than 1. (#38725)
Fix the bug that the prelu does not support onedimensional input in dynamic shape. (#39389)
Fix the bug in the kernel function of slice’s special_slice_plugin. (#39875)
Temporarily disable int8 branch under skip_layernorm variable length to prevent accuracy degradation. (#39991)
Fix some bugs regarding support for preln_ernie models. (#39733)
Fix the bug that slice may exceed threads limit in ERNIE. Fix the bug that the spacial_slice is incorrectly triggered. (#39096)
Fix the bug that the elementwise does not support broadcast when the dimension is the same. (#37908)
Fix the problem that the underlying implementation is different in the nearest_interp op when align_corners is True and TensorRT layer results and native op have diff. (#37525)
Fix qkv_plugin: Kernel function computation error. (#37096)
Fix the bug with inference pass for dynamic quantization. (#35879)
Reuse directly when Tensor requests less memory than the allocated size. (#37880)
Fix the hang bug when ERNIE fixedlength model is enabled with TensorRT. (#37839)
Fix the crash bug when TensorRT int8 lacks of dynamic range information. (#36900)
Fix the bug with slice deserialization code. (#36588)
Fix yolo box calculation formula error. (#36240)
Fix the crash bug when the earlier version model uses a later version of roi_align. (#38788) External Developers
Fix the bug of a large performance difference of softmax between python and C++. (#37130)
Fix matmul inference failure on static shape 2dimensional input and dynamic shape 3dimensional input. (#36849)
Fix reshape_transpose_matmul_mkldnn_fuse_pass mishandling of shapes. (#36731)
Fix an issue where TensorRT gets 4 dimensions when the input is 2 dimensions. (#36614)
Fix the bug report when the interpolate_v2 MKLDNN operator is null in the scale attribute. (#36623)
Fix poor performance of the recurrent operator in multithreaded scenarios. (#36052)
Remove restrictions of relu, sigmoid, tanh, relu6, batch_norm, clip, concat, gelu, hard_sigmoid, prelu, softmax, split, and swish on TensorRT 2dimensional inputs. (#37097)
Fix reshape op to use TensorRT inference. (#41090)
Fix matmul related pass, which is compatible with matmul_v2. (#36424)
Support VALID and SAME attributes in the padding method of the conv2d operator when TensorRT is enabled. (#38999)
Fix MKLDNN multiinput operator quantization problem. (#39593, #39346, #40717)
Fix scale error of conv+activation in MKLDNN quantization scenarios. (#38331)
Fix the bug in MKLDNN quantization without parameters where the quantization of subsequent operators is handled differently. (#39342)
Fix a data type related issue in MKLDNN cpu_bfloat16_placement_pass. (#38702)
Fix a split operator execution issue in MKLDNN bfloat16 inference. (#39548)
Fix the bug with MKLDNN matmul_v2 operator not supporting 6 dimensions. (#36342, #38665)
Fix MKLDNN DeviceContext error in MKLDNN matmul_v2_transpose_reshape. (#38554)
Fix incorrectly calculated results for segmentation models in MKLDNN inference scenarios. (#37310)
Fix MKLDNN bfloat16 placement operator list and add the missing operator. (#36291)
Fix the format bug of MKLDNN operators, including: FC, conv_transpose, 6dimensional Tensor error reporting, and wrong output format of conv to NHWC input. (#38890, #37344, #37175, #38553, #40049, #39097)
Fix MKLDNN multithreaded reasoning scenario error due to cache mechanism. (#36290, #35884)
Fix MKLDNN quantization model accuracy anomaly caused by matmul and FC. (#38023, #37618)
Fix the abnormal quantization model accuracy issue in MKLDNN quantization conversion scripts caused by missing passes. (#37619, #40542,#38912)
Fix the crash bug in MKLDNN enabling volume op due to data type mismatch. (#38133)
Fix an issue where some MKLDNN ops need to change back to the original layout after modifying the layout. (#39422)
Fix the bug of Python API error report due to conflict with Ascend software stack, because the GIL lock is not released in the Ascend 910 inference scenario. (#38605)
5. Environment Adaptation¶
Compile and Install¶
From version 2.3.0, PaddlePaddle has adjusted and upgraded the types of GPU architectures supported by the framework. (For more information, please refer to: GPU architectures supported by PaddlePaddle)
Notes:
PIP source installation means downloading the installation package and dependency libraries from PIP official website with using
pip install paddlepaddle
orpip install paddlepaddlegpu
. This supports less architecture types, and lighter installation package,and only one CUDA version of the installation package is provided(compared with BOS source).Prior to version 2.3, the PIP source installer (CUDA10.2) supports the following GPU architectures: 3.5, 5.0, 5.2, 6.0, 6.1, 7.0, and 7.5.
Later than version 2.3, the PIP source installer (CUDA11.0) supports the following GPU architectures: 6.0, 6.1, 7.0, 7.5, 8.0
The BOS source is a way to download the installation package and dependency libraries from the official website of PaddlePaddle, which supports more GPU architectures. The download source is from China and it is much faster.(compared with PIP source, it supports more kinds of architectures and provides multiple CUDA versions of installation packages).
Prior to version 2.3, the GPU architectures supported by the bos source installer on the PaddlePaddle website:
CUDA10 : 3.5, 5.0, 5.2, 6.0, 6.1, 7.0, 7.5；
CUDA11 : 5.2，6.0，6.1，7.0，7.5，8.0。
Later than version 2.3, the GPU architectures supported by the bos source installer on the PaddlePaddle website:
CUDA10 : 3.5, 5.0, 5.2, 6.0, 6.1, 7.0, 7.5；
CUDA11 : 3.5, 5.0, 6.0, 6.1, 7.0, 7.5, 8.0。
Support Python 3.10. Fix compilation bugs caused by some PythonC API changes on Windows. (#41180)
The Windows platform supports the compilation through Visual Studio 2019. (#38719)
Eliminate various warnings when compiling on the Windows platform. (#38034, #37890, #37442, #37439, #36857)
Fix jetson compilation issues introduced by the underlying data structure upgrade. (#39669, #39441)
New Hardware Backend Extention¶
Custom device support: provide a plugin way to extend PaddlePaddle hardware backend. With this function, developers do not need to modify PaddlePaddle codes for specific hardware, but simply implement the standard interface and compile it into a dynamic link library that can be called by PaddlePaddle as a plugin.This reduces the development effort of adding a new hardware backend to PaddlePaddle. Currently it supports custom Runtime and custom Kernel.
Support Huawei NPU chip (Ascend910) training/inference. Support ResNet50, YoloV3, BERT, Transformer and many other models. Support static + dynamic graph and automixed precision training. Support single card, and distribute training across multiple cards, multiple machines.
Support Graphcore IPU chip (including IPU Mk2 GC200 and Bow IPU) training/inference. Support ResNet50, BERT and other models. Support static graph training. Support single card, and distribute training across multiple cards, multiple machines.
Support cambricon MLU chip (MLU370x4) training/inference. Support models such as ResNet50. Support static graph + dynamic graph training. Support automixed precision training. Support single card, and distribute training across multiple cards, multiple machines.
Support KUNLUNXIN 2 chips (KUNLUNXIN AI acceleration cards R200, R300) training/inference. Support ResNet50, YoloV3, OCRDB, SSD, MobilnetV3, UNet, BERT, Transformer, GPT2, Wide&Deep, and DeepFM. Support static graph + dynamic graph training. Support automixed precision training. Support single card, and distribute training across multiple cards, multiple machines.
Thanks to our Contributors¶
This release contains contributions from the project core team as well as :
Adam Osewski, Allen Guo, arlesniak, chenenquan, chenyanlann, fengkuangxiaxia, fuqianya, fwenguang, guguguzi, helen88, houj04, Jacek Czaja, jakpiase, jianghaicheng, joanna.wozna.intel, joeqiao12, Leo Chen, Leo Guo, LifAngyU, lidanqing, Liyulingyue, Matsumoto GAO, maxhuiy, MingXu Huang, Nyakku Shigure, piotrekobi, piotrekobiIntel, QingshuChen, qipengh, Skr Bang, Sylwester Fraczek, Sławomir Siwek, taixiurong, tanzhipeng, Tomasz Socha, TTerror, Webbley, yaozhixin, ykkk2333, yujun, Zhangjingyu06, zhangxiaoci, zhangyikun02, zhangyk0314, zlsh80826, zn, Zuza