3.1 Release Note¶
PaddlePaddle framework version 3.1 has undergone further optimization and polishing for its core function of automatic parallelism, enhancing usability and performance. It also provides support for FP8 low-precision training, improving the training speed of large language models by 10-20%. The hardware expansion mechanism has been improved, reducing the cost of adapting to hardware similar to CUDA. Users only need to register the kernel. At the same time, the basic capabilities of the framework have been enhanced to improve its stability. The key updated features are as follows:
Automatic Parallel Architecture: The automatic parallel architecture has undergone further refinement to enhance the usability of the automatic parallel core mechanism and improve dynamic graph performance. The automatic parallel core mechanism has been improved, including the addition of multiple operator slicing derivation rules, support for distributed tensors being sliced along the same dimension by multiple mesh dimensions, and support for dynamic graph parallel strategies (PP, CP, SEP, TP-CONV), among others. Meanwhile, performance optimizations have been systematically implemented for the automatic parallel system of dynamic graphs, achieving performance that is essentially on par with manual parallelism on models such as Llama2, Qwen Baichuan, and others.
Low-precision training: Based on the blockwise fp8 gemm operator, it supports low-precision training, achieving training accuracy comparable to BF16, and speeding up large model training by 10-20%.
Heterogeneous multi-core adaptation: Provides a CUDA-like operator reuse mechanism, where users can simply register to use the corresponding kernel.
Framework stability enhancement: The system has fixed the calculation errors of operators in the cases of 0-Size and large dimensions.
1. User experience¶
API enhancements, bug fixes, and improvements are aimed at enhancing user experience and improving the usability of the API. The paddle.randn_like
API has been added, multiple functional defects in APIs have been fixed, and support for complex types and 0-Size Tensor has been enhanced. Documentation and code have also been updated and optimized accordingly to improve overall accuracy and professionalism.
Bug Fixes¶
Fixed the issue of inconsistent input and output types in the
tensordot
API. #72139Fixed the issue where the output of the
atleast
API was a Tensor list. #73102Fixed the issue with the
nonzer
API. #72003Fixed the memory leak issue in
dualpipev
. #72070Fixed the overflow issue in
softmax
calculation. #71935Fixed the shape checking issue in
take_along_axis
whenbroadcast=False
. #72436Fixed the incorrect handling of Nan input in
maximum
andminimum
functions. #71933Fixed the issue with
visit_type
. #72782Fixed the int32 out-of-bounds issue in
gather_scatter_functor
. #72905Fixed the inplace implementation of
Bernoulli
in PaddlePaddle. #73271Fixed issues with
moe_permute
andmoe_unpermute
. #73365Fixed the syntax checking issue of
ast.parse
for pyi files. #71872Fixed the issue of complex division. #73331
Fixed issues related to TensorRT integration. #72302, #72278
Improvements¶
Enhance the functionality of the API, improve its usability, and enhance the user experience. This includes but is not limited to expanding the data types supported by the API, checking API parameters, correcting default values of API parameters, and refining API return values. #71997, #72911, #72985, #73240, #72927, #73451, #73416, #73420, #73347, #73050, #73246, #73123, #73336, #73062, #72201, #72190
Enhanced API support for complex types. #72279, #72308, #72518, #72391, #72239, #72286, #72169, #72577, #72619
Enhanced API support for 0-Size Tensor. #72570, #72692, #72138, #72410, #72565, #72262
Correct spelling errors in the API code to enhance overall accuracy and professionalism. #71780, #71786, #72093, #72113, #72241, #72237, #72590, #72591, #72769, #72858, #73045, #72195, #72627, #72657, #73162, #73402, #72208, #72659, #72658, #72660, #72661, #72656
Communication optimization reduces peak memory usage. #72035
Docs¶
2. Execution architecture¶
Supports FP8 matrix operations, enhances model training efficiency, and simultaneously enhances multiple models to improve stability; provides a C_ops-style interface for calling the inverse, facilitating memory optimization and functional experimentation.
New Features¶
Support FP8 matrix multiplication acceleration to enhance computational performance and precision adaptability. #73092
Support for 0-size Tensor execution. #71829, #72263, #72244, #72814
Support for DeepEP. #73495
The CINN backend is enabled by default. #71838
Support for SOT-related execution. #72472, #72559, #72466, #73269, #73329, #73405, #73399, #73424, #73509
Added support for kernels with the stride mechanism. #73053
Bug fixes¶
Performance optimization and stability: Optimize training stability, enhance support for Python 3.11+ versions, improve the automatic activation logic of the CINN compiler in dynamic graph mode, fix issues related to dynamic shape inference and gradient backpropagation, optimize GPU kernel execution efficiency (such as for_range and constant folding), improve NPU memory copy and context management, and enhance large-scale model training performance and hardware utilization. #71777, #71837, #71834, #71950, #71960, #72103, #70652, #72313, #72405, #72581, #73418
Large Tensor Support Extension: The extension operator supports very large-sized tensors, including mathematical operations (lerp/mean/bmm/trapezoid), tensor operations (arg_min_max/diag/prelu), padding, comparisons (allclose/isclose), and fusion operators (softmax_mask_fuse), addressing compatibility issues in mixed-precision training. #71916, #71970, #72516, #72517, #72638, #72652, #73046, #73093, #73136, #72679, #73174, #73198, #73121, #73096, #73261, #73201, #73291, #73373, #73318, #73436, #72705, #72276, #73135, #73304, #73381, #72712, #72717, #72634, #72562, #72628, #72706, #72831, #72888, #72753, #72931, #73021, #73064, #73069, #73153, #73118, #73252, #73253, #73262, #73259, #73288, #73105, #73275, #73284, #73110, #73335, #73342, #73447, #73460, #73194
Fix for 0-Size Tensor issue: Fixed computation anomalies caused by 0-Size Tensor, covering pooling (max_pool1d/lp_pool1d), sorting (matrix_rank), statistics (std/nanmedian), and element-level operations (elementwise compare), ensuring numerical stability and API consistency under extreme input scenarios. #71961, #72017, #72785, #73214, #73263, #73267, #73280, #72444, #72437, #72460, #73090, #73516, #72807, #72799, #72800, #72809, #73497
API Enhancements and Compatibility: Added support for Python standard library types (dataclasses), expanded API data type compatibility (creation of bfloat16 parameters, automatic inference of -1 dimension), fixed NumPy API interaction errors, and optimized BatchNorm memory layout. #72059, #72283, #72451, #72512, #72618, #72976, #73084, #73205, #73250, #73111, #73260, #72094, #71844, #71357
Memory management and bug fixes: Address high-risk issues such as memory overflow (set_value/nonzero), null pointer (data nullptr), and CUDA graph allocation failure. Fix memory leaks and computational errors in core operations such as gradient clipping (clip_grad), tensor assignment (assign), and broadcasting (broadcast). Optimize NPU asynchronous execution and predictor GIL release logic to enhance system robustness. #71895, #72101, #72133, #72149, #72176, #72314, #72256, #72757, #72749, #72792, #72815, #72819, #72958, #73023, #73103, #73014, #73137, #73256, #73211, #73251, #73210, #73415, #73206, #71983, #72485, #72561
Other important fixes: Fixed defects in scientific computation, save/load modules, improved Slice operator kernel configuration, optimized fallback strategy for dynamic shape inference, and refined exception throwing and type checking logic. #71810, #72246, #72378, #72467, #72635, #72751, #72044, #72051, #73231, #73109
Fixed issues related to SOT, #71932, #71971, #72194, #72288, #72306, #72367, #72495, #72522, #72704, #72631, #72737, #73067, #73030, #73059, #73282, #73511, #73526, #73549, #73515
Improvements¶
Development of the 0-size mechanism for Paddle API. #72721, #72756, #72790, #72806, #72764, #72786, #72853, #72826, #72851, #72928, #72912, #72922, #72924, #72887, #72921, #72906, #72895, #72821, #72914, #72936, #72943, #72694, #72919, #72940, #72820, #72934, #72975, #72872, #72984, #72988, #72972, #72977, #72937, #73086, #73042, #73017, #73044, #73077, #73108, #73027, #72970, #73008, #72996, #73165, #73166, #73170, #73122, #73204, #73207, #73186, #73197, #73168, #73172, #73125, #73181, #73270, #73028, #73094, #73180, #73276, #73333, #73341, #73299, #73346, #73361, #73375, #73152, #73377, #73355, #73382, #73385, #73386, #73352, #73387, #73401, #73384, #73450, #73437, #73503, #73507, #73477, #73513, #73525, #73528, #73517, #72898, #72880, #72864, #72993, #72954, #72866, #72878, #72889, #72861, #72837
SOT-related enhancements: Enhanced functionalities (such as NumPy interoperability and super support), improved training stability, and fixed multiple issues to enhance code robustness, #71763, #71666, #71858, #71865, #72474, #72154, #72784, #72956, #73038, #73066, #73287, #73278, #73332, #73372, #73412, #73407, #73506
Code style refactoring: Through code refactoring and the unification of cross-platform kernel behaviors, we have improved code quality and maintainability. Additionally, we have added a YAML format pre-commit check tool, as documented in #72216, #72360, #72816, #72969, #73106, #72825, #73150, #73151, #73158, #73101, #73326, #72580, and #72424
Paddle CPU/GPU Kernel accuracy issue is pushed to the whole team. #72879, #72894, #73012, #72973, #73018, #72965, #73128, #73229, #72992, #73344, #73274, #73295, #73293, #73317, #73320, #73454, #73492, #73535
Slice issue fixes: Fixed issues related to slices, including indexing logic, performance optimization, etc., #72644, #72676, #72838, #72966, #73095, #72840, #73112, #73367, #73390, #73307, #73465, #73362, #72733, #72886
Performance optimization: By optimizing index logic and enhancing performance, we aim to improve overall performance, #72707, #73485
Other significant improvements: including dynamic shape support, fixing meshgrid and adding unit tests, upgrading CUB to version 2.1.0, improving FP8 numerical processing, optimizing the CUDA graph shared pool mechanism, removing ShadowFeedOp to simplify data flow, enhancing version compatibility for PIR model saving/loading, fixing flip and reverse kernel issues, improving the NaN propagation logic of paddle.angle, introducing an asynchronous GC check mechanism, optimizing the Scope lock-free interface of Dy2St, cleaning up unused third-party dependencies (absl), and further promoting the decoupling of PHI and Fluid to enhance the framework's stability, performance, and scalability. #72356, #72380, #72633, #72794, #72917, #72920, #72945, #72620, #73011, #73051, #73052, #73075, #73176, #73191, #73337, #73311, #73173, #73239, #73448, #73478, #73522, #73369
Performance¶
SOT-related: Through improvements such as optimizing the Guard condition mechanism, enhancing dynamic shape processing capabilities, and adding support for no_grad, execution efficiency has been enhanced, functional features have been expanded, and the code structure and performance have been optimized. #70362, #70154, #71748, #72004, #72159, #72174, #71994, #72250, #72285, #72322, #72272, #72417, #72438, #72462, #72463, #72503, #72501, #72521, #72509, #72544, #73469, #73471, #73555
Deprecations¶
Devs¶
Optimized CINN backend integration and dynamic shape processing logic, improved framework stability through code structure refactoring and test reinforcement, and added debug log functionality to enhance maintainability. #71817, #71896, #71984, #72067, #72165, #72207, #72235, #72273, #72326, #72400, #72381, #72560, #72783, #73530
3. CINN¶
Optimize compiler performance and enhance stability
Performance¶
Support automatic conversion and optimization of Layout in training scenarios. (#71891)
Kernel compilation optimizations for operators such as argmin, argmax, and arange have been added to the backend. (#71956, #72598))
Support for fused optimization of matrix multiplication. (#72846)
Optimize the computation performance of some operators, specifically the Kernel. (#72871)
Bug fixes¶
Fix some processing logic bugs in various scenarios. (#71813, #71886, #71927, #71915, #71946, #71949, #71955, #71942, #71939, #71973, #72001, #72020, #72014, #72021, #72027, #72061, #72025, #72095, #72108, #72132, #71985, #72106, #72140, #72167, #72037, #72178, #72143, #72175, #72191, #72213, #72189, #72214, #72166, #72180, #72284, #72267, #72348, #72332, #72307, #72353, #72204, #72457, #72426, #72536, #72541, #72365, #72621, #72630, #72669, #72682, #72732, #72811, #72941, #72795, #73536)
4. Auto Parallel architecture¶
In version 3.1, we further refined the automatic parallel architecture to enhance the usability of automatic parallelism and the performance of dynamic graphs. Specifically, we improved the core mechanism of automatic parallelism, including adding multiple operator splitting derivation rules, supporting the splitting of the same dimension of distributed tensors by multiple mesh dimensions, and supporting dynamic graph parallel strategies (PP, CP, SEP, TP-CONV), etc. At the same time, we systematically optimized the performance of the automatic parallel system for dynamic graphs, achieving performance that is basically on par with manual parallelism on models such as Llama.
Improvements¶
Support for distributed tensors where the same dimension is partitioned by multiple mesh dimensions. #73233
Support for converting automatic parallel communication topology descriptions (ProcessMesh) into manual parallel communication groups. #72052
Support send/recv of any serializable Python object. #72098
Complete dynamic graph parallel strategy
Support for pipeline parallelism strategies 1F1B and VPP scheduling. #72155, #72480, #72179
Support for parallel processing of long texts. #73195
Support automatic parallel communication in the data parallel dimension. #72540
Add the following operator segmentation derivation rules
min
,min_grad
#72269bitwise_or
,atan2
,fmax
,fmin
,reciprocal
#72310argmin
,abs
,cosh
#72264mean_all
,mean_all_grad
#72479topk
,topk_grad
#72499argsort
#72388round
,mish
,elu
,selu
,celu
,stanh
,softplus
,softshrink
,thresholded_relu
,logit
,nonzero
#72312unique ops
#72824put_along_axis
#72766round_grad
,trunc_grad
,ceil_grad
,floor_grad
,poisson_grad
#72677log_softmax
,cummax
,cummin
#72720unary
#72177unary_grad
#72260index_select
,index_select_grad
#72727roll
,roll_grad
#72740empty_like
#73169roi_align
,roi_align_grad
#72925expand_as
,expand_as_grad
#73107fused_gemm_epilogur
#73126label_smooth
,label_smooth
#72845group_norm
,group_norm_grad
#72946instance_norm
,instance_norm_grad
#72938batch_norm
,sync_batch_norm
#72918reduce_any
#73175fused_gemm_epilogue_rule
#73494
Performance¶
Support for the tensor_fusion optimization strategy and overlap optimization strategy with grouped parallel slicing. #72551, #72902, #73142, #71785
Optimize the reshard module to reduce communication overhead. #71969, #73024, #71868
Optimize the slicing derivation rule for multiply to reduce communication overhead. #73408
Optimize the reverse communication when the distributed partition status is set to Partial, in order to reduce communication overhead. #73236
Communication fusion optimization during gradient update. #72120 and #72745
Optimize the derivation of gelu slicing to reduce communication overhead. #73279
Optimize the slicing derivation rule of fused_rms_norm when there is a Partial state in the input, to reduce communication and computation overhead. #73054
Bug fixes¶
Fixed the bug of communication hang in the virtual pipeline parallel strategy on H-card. #71104, #73470
Fixed the bug in save/load. #72023
Fixed the bug that the linear_fused_grad_add strategy did not work in dynamic graph mode. #72708
Fixed the issues of the fused_rms_norm operator not running and accuracy bugs. #72663
Fixed the bug in the derivation rule for the expand operator segmentation. #73154
5. Operator Mechanism¶
New Features¶
Gradient and automatic differentiation optimization: Initially supports dual gradient computation for put_along_axis and repeat_interleave operations, improves the numerical stability of complex operators in automatic differentiation scenarios, and implements operator decomposition for masked_fill operations. #72789, #73056, #73225
Operator mechanism extension: Added custom support for radd and rmul, enhancing the framework's ability to overload asymmetric operators. #73119
FP8 Module Support and Operator Development: Added support for FP8 block quantization GEMM, introduced multiple fused operators, and provided efficient operator-level implementations for Mixture of Experts (MoE) models, enhancing training and inference performance. #73228, #73285, #73133, #73364, #73520, #73531
Bug Fixes¶
Gradient and automatic differentiation stability improvement: Fixed some errors in the calculation of inverse operator gradients, enhancing numerical stability and functional correctness in automatic differentiation scenarios. #71716, #72299, #72358, #73037, #73140, #73185
Numerical accuracy and overflow protection: Addresses issues such as numerical overflow, loss of precision, and large tensor overflow, ensuring the reliability of low-precision computations and large tensor operations. #72584, #72608, #72681, #72639, #73245, #73359, #72456
Operator logic and framework alignment: Align operator operation logic, fix issues such as abnormal operator inputs, and other important fixes: add checks to ensure the correctness of framework functionality. #72282, #71863, #72650, #72843, #73070, #73141, #73203, #73350, #73440, #73539, #73339
CUDA kernel and hardware adaptation optimization: Supports NVIDIA SM90 architecture, fixes issues such as overflow, removes redundant CUDA error checks, and enhances GPU computing efficiency and adaptability to new hardware. #72507, #72849, #72959, #73130, #73489
Improvements¶
Added an implementation of fast division and modulo operation for the int64_t version, improving computational performance and numerical stability in large integer scenarios, #72530
Optimize the kernel with stride tensor copy to improve the efficiency of data copy under non-continuous memory layout. #72662
-Unify the usage of quantization API in dynamic and static graph modes, simplifying the development process of quantization models, #73100
Performance¶
Optimize the decomposition performance of the gelu operator to enhance computational efficiency. #72812
Others¶
Fluid operator normalization and exit, #71789, #71818, #71808, #71860, #71806, #72011, #72043, #72034, #72047, #72056, #72087, #72086, #72083, #72079, #72078, #72076, #72057, #72077, #72096, #72085, #72092, #72110, #72127, #72111, #72126, #72135, #72112, #72131, #70358, #72125, #72171, #72160, #72188, #72197
6. Performance¶
Improvements¶
Performance optimization and acceleration: Enabled cuDNN support for deep convolution, enhancing convolution operation efficiency. Updated pooling operation strategy and optimized permute memory operations to reduce CUDA memory usage. Optimized printing speed, accelerating debugging and log output processes. #71796, #73442, #73563
Feature Enhancements and Operational Support: Added the masked_fill operation and Boolean index optimization to enhance tensor masking processing capabilities. Implemented the index_elementwise operation to support index-based element-level operations. Added pooling and reshape execution strategies to enhance the flexibility of model operations. #72788, #72942
Bug fixes and stability improvements: Fixed a partial state support issue with fused_rms_norm in SPMD parallel mode. Corrected index errors in output dimension calculation and IndexGetStride during the slice operation to ensure computational correctness. #72118, #72223, #73184, #73237, #73054
Faster Guard adaptation: Reduce SOT end-to-end link overhead. #71900, #71979, #72081, #72327, #72564, #72823
Performance optimization and acceleration: Optimize operator scheduling strategy. Upgrade Flash Attention to version v3 to reduce computational overhead. Fix model performance bottlenecks and improve inference and training speed. #71937, #71828, #71461, #72039, #72228, #72225, #72623, #72666, #73147, #73393
Parallel computing: Optimize the grid re-sharding strategy in automatic parallelism, integrate communication and optimize logic in the Sharding Stage, enhance the stability of distributed training, and reduce the communication overhead of distributed training. #71969, #72120, #73279, #73406
Feature enhancements and fixes: - Optimized operator indexing and kernel scheduling logic. #72625, #72741, #73082, #73501
Model and operation support: Supports deep convolution in NHWC format, adapting to more hardware memory layouts. #72121
7. Custom Device¶
Optimize hardware mechanisms and provide a solution for reusing CUDA-like hardware kernels.
New Features¶
Based on the customdevice integration solution, we introduce a low-cost support solution for hardware backends similar to CUDA. These CUDA-like backends can be plugged into Paddle in a modular manner, allowing for cost-effective reuse of the majority of CUDA kernels from the NVIDIA ecosystem within Paddle. Furthermore, they can be decoupled from feature upgrades within the Paddle framework, significantly reducing the cost of hardware backend integration and iteration, enhancing user willingness to adopt, and fostering a positive collaborative ecosystem between Paddle and hardware manufacturers. #72604, #72668, #72758, #72865, #72910, #73033, #73145, #73281, #73079
Enhance XPU fundamental capabilities: add kernels, expand data types, and supplement branches in the XPU environment #71424, #71809, #71594, #71779, #71756, #71573, #71883, #71954, #71931, #72280, #72361, #72406, #72528, #72752, #72852, #72982, #73357, #73414, #73464, #73234, #71776
DCU kernel extended data type #73129
8. Environment Adaptation¶
We have optimized the stability and cross-platform compatibility of the framework, and resolved issues related to compilation and installation failures on various platforms. We have upgraded key dependencies such as CUDA, further optimized the CI/CD process, improved the build speed, and enhanced the overall stability of the system. Additionally, we have ceased maintenance of compilation and installation in the Python 3.8 environment.
Bug fixes¶
Fixed compilation errors when using clang17 to compile third-party libraries. #72524
Fixed compilation issues when using CUDA 12.9. #72808, #72841, #72978, #73360
Fixed compilation issues when using GCC 13.3. #73144
Fixed compilation issues when WITH_PIP_CUDA_LIBRARIES=ON. #72907
Fixed compilation issues when WITH_NVSHMEM=ON. #73368
Improvements¶
Devs¶
Compilation, installation, maintenance, and upgrade. #71911, #73005
Import, export, and update of symbols for the Windows platform. #72497, #72498, #72500
Windows platform supports CUDA 12.8. #72433
CI maintenance and upgrade. #72443, #72836, #72563, #72653, #72477, #72778, #72960, #73289, #73422, #73514, #72748,
Github Action CI construction. #71738, #70602, #71958, #71959, #71992, #72013, #72153, #72031, #72141, #72104, #72182, #72342, #72352, #72249, #72068, #72441, #72392, #72446, #72435, #72515, #72514, #72396, #72547, #72345, #72236, #72586, #72537, #72609, #72632, #72642, #72673, #72647, #72696, #72771, #72711, #72680, #72774, #72813, #72804, #72903, #72900, #72932, #72967, #72991, #72115, #73242, #72801, #73433, #73391, #73456, #73376, #73453, #73481, #73546, #73446, #72744
9. List of contributors¶
0x3878f, A-nnonymous, AndSonder, ApricityXX, aquagull, author, baoqiwen, BeingGod, blacksheep-Aristotle, BoShen5, bukejiyu, cangtianhuang, carryyu, chang-wenbin, changeyoung98, chen2016013, ckl117, co63oc, cqulilujia, crashbussy, cszdrg, Cutelemon6, cyy536, DanielSun11, danleifeng, datutu-L, deepllz, Dmovic, DrRyanHuang, dynamicheart, Eddie-Wang1120, eggman-1024, emmanuel-ferdman, Enigmatisms, enkilee, fangfangssj, feixi21, FeixLiu, ForFishes, Function-Samuel, ggggxm, GITD245, Glencsa, GoldenStain, gongshaotian, gouzil, gzy19990617, hanlintang, Hongqing-work, houj04, huangjiyi, hxzd5568, HydrogenSulfate, jzhang533, LCStayingdullCircuit, leon062112, lifulll, linkk08, LittleHeroZZZX, liufengwei0103, Liujie0926, liuruyan, lixinqi, LiYuRio, lizexu123, lizhenyun01, lj970926, lshpku, megemini, mikethegoblin, ming1753, mzj104, NKNaN, ooooo-create, pesionzhao, phlrain, pkuzyc, PolaKuma, Qin-sx, RichardWooSJTU, risemeup1, runzhech, RuohengMa, sasaya123, shanjiang7, SigureMo, sneaxiy, swgu98, SylarTiaNII, tianhaodongbd, tianshuo78520a, timminator, tizhou86, umiswing, waliwali777, wanghuancoder, Waynezee, Wennie396, xiaoguoguo626807, XieYunshen, Xing-lil, xkkkkkk23, Xreki, xuxinyi389, Yeenyeong, yongqiangma, YqGe585, yuanlehome, YuanRisheng, yulangz, yuwu46, zeroRains, zhangbo9674, zhanghonggeng, zhangting2020, ZhangX-21, zhangyk0314, zhangyuqin1998, zhink, zhiqiu, zhouquan32, zhoutianzi666, zhupengyang, zrr1999, zty-king, zyfncg