3.1 Release Note

PaddlePaddle framework version 3.1 has undergone further optimization and polishing for its core function of automatic parallelism, enhancing usability and performance. It also provides support for FP8 low-precision training, improving the training speed of large language models by 10-20%. The hardware expansion mechanism has been improved, reducing the cost of adapting to hardware similar to CUDA. Users only need to register the kernel. At the same time, the basic capabilities of the framework have been enhanced to improve its stability. The key updated features are as follows:

  • Automatic Parallel Architecture: The automatic parallel architecture has undergone further refinement to enhance the usability of the automatic parallel core mechanism and improve dynamic graph performance. The automatic parallel core mechanism has been improved, including the addition of multiple operator slicing derivation rules, support for distributed tensors being sliced along the same dimension by multiple mesh dimensions, and support for dynamic graph parallel strategies (PP, CP, SEP, TP-CONV), among others. Meanwhile, performance optimizations have been systematically implemented for the automatic parallel system of dynamic graphs, achieving performance that is essentially on par with manual parallelism on models such as Llama2, Qwen Baichuan, and others.

  • Low-precision training: Based on the blockwise fp8 gemm operator, it supports low-precision training, achieving training accuracy comparable to BF16, and speeding up large model training by 10-20%.

  • Heterogeneous multi-core adaptation: Provides a CUDA-like operator reuse mechanism, where users can simply register to use the corresponding kernel.

  • Framework stability enhancement: The system has fixed the calculation errors of operators in the cases of 0-Size and large dimensions.

1. User experience

API enhancements, bug fixes, and improvements are aimed at enhancing user experience and improving the usability of the API. The paddle.randn_like API has been added, multiple functional defects in APIs have been fixed, and support for complex types and 0-Size Tensor has been enhanced. Documentation and code have also been updated and optimized accordingly to improve overall accuracy and professionalism.

New Features

  • Added paddle.randn_like API. #72492

Bug Fixes

  • Fixed the issue of inconsistent input and output types in the tensordot API. #72139

  • Fixed the issue where the output of the atleast API was a Tensor list. #73102

  • Fixed the issue with the nonzer API. #72003

  • Fixed the memory leak issue in dualpipev. #72070

  • Fixed the overflow issue in softmax calculation. #71935

  • Fixed the shape checking issue in take_along_axis when broadcast=False. #72436

  • Fixed the incorrect handling of Nan input in maximum and minimum functions. #71933

  • Fixed the issue with visit_type. #72782

  • Fixed the int32 out-of-bounds issue in gather_scatter_functor. #72905

  • Fixed the inplace implementation of Bernoulli in PaddlePaddle. #73271

  • Fixed issues with moe_permute and moe_unpermute. #73365

  • Fixed the syntax checking issue of ast.parse for pyi files. #71872

  • Fixed the issue of complex division. #73331

  • Fixed issues related to TensorRT integration. #72302, #72278

Improvements

Docs

  • Fixed errors in the documentation, improving its usability and user experience. #72549, #73036

Devs

Deprecations

2. Execution architecture

Supports FP8 matrix operations, enhances model training efficiency, and simultaneously enhances multiple models to improve stability; provides a C_ops-style interface for calling the inverse, facilitating memory optimization and functional experimentation.

New Features

Bug fixes

Improvements

Performance

Deprecations

  • Code cleanup: Cleaned up Python 3.8 support declarations, and completed related code cleanup, dependency streamlining, and syntax modernization updates to optimize code maintainability and compatibility. #71815, #72802, #72856, #72854, #72855, #72873, #72870, #72868, #72891

Devs

Others

  • Others: Added kernel support for FP16/BF16 data types in CPU sections, optimized error handling and tolerance configuration in the testing module, etc. #71764, #71951, #72944

3. CINN

Optimize compiler performance and enhance stability

Performance

  • Support automatic conversion and optimization of Layout in training scenarios. (#71891)

  • Kernel compilation optimizations for operators such as argmin, argmax, and arange have been added to the backend. (#71956, #72598))

  • Support for fused optimization of matrix multiplication. (#72846)

  • Optimize the computation performance of some operators, specifically the Kernel. (#72871)

4. Auto Parallel architecture

In version 3.1, we further refined the automatic parallel architecture to enhance the usability of automatic parallelism and the performance of dynamic graphs. Specifically, we improved the core mechanism of automatic parallelism, including adding multiple operator splitting derivation rules, supporting the splitting of the same dimension of distributed tensors by multiple mesh dimensions, and supporting dynamic graph parallel strategies (PP, CP, SEP, TP-CONV), etc. At the same time, we systematically optimized the performance of the automatic parallel system for dynamic graphs, achieving performance that is basically on par with manual parallelism on models such as Llama.

Improvements

  • Support for distributed tensors where the same dimension is partitioned by multiple mesh dimensions. #73233

  • Support for converting automatic parallel communication topology descriptions (ProcessMesh) into manual parallel communication groups. #72052

  • Support send/recv of any serializable Python object. #72098

  • Complete dynamic graph parallel strategy

  • Support for pipeline parallelism strategies 1F1B and VPP scheduling. #72155, #72480, #72179

  • Support for parallel processing of long texts. #73195

  • Support for visual parallelism strategy. #73063, #73039

  • Support automatic parallel communication in the data parallel dimension. #72540

  • Add the following operator segmentation derivation rules

  • min, min_grad #72269

  • bitwise_or,atan2,fmax,fmin,reciprocal #72310

  • argmin, abs, cosh #72264

  • mean_all, mean_all_grad #72479

  • topk, topk_grad #72499

  • argsort #72388

  • round, mish, elu, selu, celu, stanh, softplus, softshrink, thresholded_relu, logit, nonzero #72312

  • unique ops #72824

  • put_along_axis #72766

  • round_grad, trunc_grad, ceil_grad, floor_grad, poisson_grad #72677

  • log_softmax, cummax, cummin #72720

  • unary #72177

  • unary_grad #72260

  • index_select, index_select_grad #72727

  • roll, roll_grad #72740

  • empty_like #73169

  • roi_align, roi_align_grad #72925

  • expand_as, expand_as_grad #73107

  • fused_gemm_epilogur #73126

  • label_smooth, label_smooth #72845

  • group_norm, group_norm_grad #72946

  • instance_norm, instance_norm_grad #72938

  • batch_norm, sync_batch_norm #72918

  • reduce_any #73175

  • fused_gemm_epilogue_rule #73494

Performance

  • Support for the tensor_fusion optimization strategy and overlap optimization strategy with grouped parallel slicing. #72551, #72902, #73142, #71785

  • Optimize the reshard module to reduce communication overhead. #71969, #73024, #71868

  • Optimize the slicing derivation rule for multiply to reduce communication overhead. #73408

  • Optimize the reverse communication when the distributed partition status is set to Partial, in order to reduce communication overhead. #73236

  • Communication fusion optimization during gradient update. #72120 and #72745

  • Optimize the derivation of gelu slicing to reduce communication overhead. #73279

  • Optimize the slicing derivation rule of fused_rms_norm when there is a Partial state in the input, to reduce communication and computation overhead. #73054

Bug fixes

  • Fixed the bug of communication hang in the virtual pipeline parallel strategy on H-card. #71104, #73470

  • Fixed the bug in save/load. #72023

  • Fixed the bug that the linear_fused_grad_add strategy did not work in dynamic graph mode. #72708

  • Fixed the issues of the fused_rms_norm operator not running and accuracy bugs. #72663

  • Fixed the bug in the derivation rule for the expand operator segmentation. #73154

Others

  • Clean up dead code to facilitate code maintenance. #71814, #72538

  • Added a new API, local_map, to pass distributed tensors to functions written for ordinary tensors. (#71804)

  • Add checks for operator fused_linear_param_grad_add. (#72483)

5. Operator Mechanism

New Features

  • Gradient and automatic differentiation optimization: Initially supports dual gradient computation for put_along_axis and repeat_interleave operations, improves the numerical stability of complex operators in automatic differentiation scenarios, and implements operator decomposition for masked_fill operations. #72789, #73056, #73225

  • Operator mechanism extension: Added custom support for radd and rmul, enhancing the framework's ability to overload asymmetric operators. #73119

  • FP8 Module Support and Operator Development: Added support for FP8 block quantization GEMM, introduced multiple fused operators, and provided efficient operator-level implementations for Mixture of Experts (MoE) models, enhancing training and inference performance. #73228, #73285, #73133, #73364, #73520, #73531

Bug Fixes

  • Gradient and automatic differentiation stability improvement: Fixed some errors in the calculation of inverse operator gradients, enhancing numerical stability and functional correctness in automatic differentiation scenarios. #71716, #72299, #72358, #73037, #73140, #73185

  • Numerical accuracy and overflow protection: Addresses issues such as numerical overflow, loss of precision, and large tensor overflow, ensuring the reliability of low-precision computations and large tensor operations. #72584, #72608, #72681, #72639, #73245, #73359, #72456

  • Operator logic and framework alignment: Align operator operation logic, fix issues such as abnormal operator inputs, and other important fixes: add checks to ensure the correctness of framework functionality. #72282, #71863, #72650, #72843, #73070, #73141, #73203, #73350, #73440, #73539, #73339

  • CUDA kernel and hardware adaptation optimization: Supports NVIDIA SM90 architecture, fixes issues such as overflow, removes redundant CUDA error checks, and enhances GPU computing efficiency and adaptability to new hardware. #72507, #72849, #72959, #73130, #73489

Improvements

  • Added an implementation of fast division and modulo operation for the int64_t version, improving computational performance and numerical stability in large integer scenarios, #72530

  • Optimize the kernel with stride tensor copy to improve the efficiency of data copy under non-continuous memory layout. #72662

-Unify the usage of quantization API in dynamic and static graph modes, simplifying the development process of quantization models, #73100

Performance

  • Optimize the decomposition performance of the gelu operator to enhance computational efficiency. #72812

6. Performance

New Features

The acc_steps of sharding_overlap is configurable. #72395

Bug fixes

  • Fixed the inplace issue of operator c_softmax_with_cross_entropy_grad. #72366

Improvements

  • Performance optimization and acceleration: Enabled cuDNN support for deep convolution, enhancing convolution operation efficiency. Updated pooling operation strategy and optimized permute memory operations to reduce CUDA memory usage. Optimized printing speed, accelerating debugging and log output processes. #71796, #73442, #73563

  • Feature Enhancements and Operational Support: Added the masked_fill operation and Boolean index optimization to enhance tensor masking processing capabilities. Implemented the index_elementwise operation to support index-based element-level operations. Added pooling and reshape execution strategies to enhance the flexibility of model operations. #72788, #72942

  • Bug fixes and stability improvements: Fixed a partial state support issue with fused_rms_norm in SPMD parallel mode. Corrected index errors in output dimension calculation and IndexGetStride during the slice operation to ensure computational correctness. #72118, #72223, #73184, #73237, #73054

  • Faster Guard adaptation: Reduce SOT end-to-end link overhead. #71900, #71979, #72081, #72327, #72564, #72823

  • Performance optimization and acceleration: Optimize operator scheduling strategy. Upgrade Flash Attention to version v3 to reduce computational overhead. Fix model performance bottlenecks and improve inference and training speed. #71937, #71828, #71461, #72039, #72228, #72225, #72623, #72666, #73147, #73393

  • Parallel computing: Optimize the grid re-sharding strategy in automatic parallelism, integrate communication and optimize logic in the Sharding Stage, enhance the stability of distributed training, and reduce the communication overhead of distributed training. #71969, #72120, #73279, #73406

Feature enhancements and fixes: - Optimized operator indexing and kernel scheduling logic. #72625, #72741, #73082, #73501

  • Model and operation support: Supports deep convolution in NHWC format, adapting to more hardware memory layouts. #72121

7. Custom Device

Optimize hardware mechanisms and provide a solution for reusing CUDA-like hardware kernels.

New Features

Based on the customdevice integration solution, we introduce a low-cost support solution for hardware backends similar to CUDA. These CUDA-like backends can be plugged into Paddle in a modular manner, allowing for cost-effective reuse of the majority of CUDA kernels from the NVIDIA ecosystem within Paddle. Furthermore, they can be decoupled from feature upgrades within the Paddle framework, significantly reducing the cost of hardware backend integration and iteration, enhancing user willingness to adopt, and fostering a positive collaborative ecosystem between Paddle and hardware manufacturers. #72604, #72668, #72758, #72865, #72910, #73033, #73145, #73281, #73079

Enhance XPU fundamental capabilities: add kernels, expand data types, and supplement branches in the XPU environment #71424, #71809, #71594, #71779, #71756, #71573, #71883, #71954, #71931, #72280, #72361, #72406, #72528, #72752, #72852, #72982, #73357, #73414, #73464, #73234, #71776

DCU kernel extended data type #73129

8. Environment Adaptation

We have optimized the stability and cross-platform compatibility of the framework, and resolved issues related to compilation and installation failures on various platforms. We have upgraded key dependencies such as CUDA, further optimized the CI/CD process, improved the build speed, and enhanced the overall stability of the system. Additionally, we have ceased maintenance of compilation and installation in the Python 3.8 environment.

Bug fixes

  • Fixed compilation errors when using clang17 to compile third-party libraries. #72524

  • Fixed compilation issues when using CUDA 12.9. #72808, #72841, #72978, #73360

  • Fixed compilation issues when using GCC 13.3. #73144

  • Fixed compilation issues when WITH_PIP_CUDA_LIBRARIES=ON. #72907

  • Fixed compilation issues when WITH_NVSHMEM=ON. #73368

Improvements

  • Avoid copying temporary files generated during the compilation of custom operators. #73196

  • Warning message optimization. #72877

Devs

Deprecations

  • Discontinue support for compilation in Python 3.8 environment. #72827

9. List of contributors

0x3878f, A-nnonymous, AndSonder, ApricityXX, aquagull, author, baoqiwen, BeingGod, blacksheep-Aristotle, BoShen5, bukejiyu, cangtianhuang, carryyu, chang-wenbin, changeyoung98, chen2016013, ckl117, co63oc, cqulilujia, crashbussy, cszdrg, Cutelemon6, cyy536, DanielSun11, danleifeng, datutu-L, deepllz, Dmovic, DrRyanHuang, dynamicheart, Eddie-Wang1120, eggman-1024, emmanuel-ferdman, Enigmatisms, enkilee, fangfangssj, feixi21, FeixLiu, ForFishes, Function-Samuel, ggggxm, GITD245, Glencsa, GoldenStain, gongshaotian, gouzil, gzy19990617, hanlintang, Hongqing-work, houj04, huangjiyi, hxzd5568, HydrogenSulfate, jzhang533, LCStayingdullCircuit, leon062112, lifulll, linkk08, LittleHeroZZZX, liufengwei0103, Liujie0926, liuruyan, lixinqi, LiYuRio, lizexu123, lizhenyun01, lj970926, lshpku, megemini, mikethegoblin, ming1753, mzj104, NKNaN, ooooo-create, pesionzhao, phlrain, pkuzyc, PolaKuma, Qin-sx, RichardWooSJTU, risemeup1, runzhech, RuohengMa, sasaya123, shanjiang7, SigureMo, sneaxiy, swgu98, SylarTiaNII, tianhaodongbd, tianshuo78520a, timminator, tizhou86, umiswing, waliwali777, wanghuancoder, Waynezee, Wennie396, xiaoguoguo626807, XieYunshen, Xing-lil, xkkkkkk23, Xreki, xuxinyi389, Yeenyeong, yongqiangma, YqGe585, yuanlehome, YuanRisheng, yulangz, yuwu46, zeroRains, zhangbo9674, zhanghonggeng, zhangting2020, ZhangX-21, zhangyk0314, zhangyuqin1998, zhink, zhiqiu, zhouquan32, zhoutianzi666, zhupengyang, zrr1999, zty-king, zyfncg