3.0 Release Note¶

Declaration: This document is translated by Baidu Translate

As China’s first independently developed industrial-grade deep learning platform, PaddlePaddle has always adhered to the open-source path, supporting the intelligent upgrade of industries. The PaddlePaddle framework version 3.0 not only continues the characteristics of the PaddlePaddle framework 2.0 series, which unifies static and dynamic operations and integrates training and inference, but also achieves breakthroughs in automatic parallelism, neural network compilers, and high-order automatic differentiation, providing strong support for technological innovation and industrial applications in the era of large models, and creating a one-stop, high-performance deep learning development experience for developers. Whether it is cutting-edge algorithm research or the implementation of industrial-grade large models, PaddlePaddle framework version 3.0 will become the preferred tool for developers. Key features are described as follows:

Unified Static and Dynamic Automatic Parallelism: This feature significantly reduces the cost of industrial development and training. Users only need to perform a small amount of tensor slicing marking on a single card, and the PaddlePaddle framework will automatically derive the distributed slicing information and add communication operators to ensure logical correctness. At the same time, based on the model structure and cluster information, combined with the optimization of memory and scheduling layers, PaddlePaddle can automatically find the most efficient distributed parallel strategy, thereby significantly reducing the development cost of hybrid parallel training and enabling developers to focus more on model and algorithm innovation. The automatic parallel architecture has undergone in-depth verification and polishing to better support the pre-training + fine-tuning process for common large model scenarios such as pure dense models, pure sparse models (MoE), and multi-modal understanding models. It improves the slicing derivation rules of operators and supports converting automatic parallel training parameters into manual parallel parameters for downstream inference, achieving comprehensive usability and helping users reduce the development cost of large model parallel programs. Additionally, to further simplify the user’s distributed development process, a new paddle.distributed.parallel interface is introduced. Based on the encapsulation of distributed tensor marking syntax, it supports users in non-intrusively configuring common parallel strategies such as data parallelism, model parallelism, and pipeline parallelism outside of the model networking. Furthermore, the static graph automatic parallel architecture has undergone a comprehensive upgrade based on PIR, with the underlying basic components, core modules, parallel strategies, and performance optimization strategies all implemented uniformly based on the extended PIR DistDialect, further enhancing the consistency of automatic parallelism between static and dynamic states, and achieving performance levels on the Llama series models that are on par with or even surpass manual parallel methods.
Integrated Training and Inference for Large Models: Since version 2.0, PaddlePaddle has adopted the design philosophy of “unified dynamic and static, integrated training and inference,” and version 3.0 will continue to uphold this philosophy. Thanks to the unified dynamic and static architecture and interface design, PaddlePaddle fully supports both dynamic and static graph modes, and possesses excellent whole-graph export capabilities. The success rate of whole-graph export from dynamic to static in PaddlePaddle is as high as 95%, surpassing PyTorch’s 62%. “Integrated training and inference” means being able to reuse training and inference code, especially model networking code, within the same framework. After completing the development and training of the model, only a small amount of development work is required to achieve rapid inference deployment. This feature provides an ultimate development experience for the industry. It enables the reuse of training and inference capabilities, providing a unified development experience and ultimate training efficiency for the entire process of large models. Through the work of transitioning from dynamic to static, the training and inference tasks can be seamlessly connected. It supports multiple mainstream large models, and the DeepSeek-R1 full-blood version achieves single-machine deployment with doubled throughput.
High-order differential in scientific computing: PaddlePaddle Framework 3.0 provides support for high-order automatic differentiation, compilation optimization, and distributed training capabilities for scientific computing. Experiments on 41 different equations on NVIDIA Modulus show that the differential equation solving speed of PaddlePaddle is on average 115% faster than the version of PyTorch with compiler optimization enabled. Additionally, PaddlePaddle has also established the PaddleScience toolkit for solving general mathematical problems and the PaddleHelix toolkit focused on biological computing. Furthermore, PaddlePaddle Framework 3.0 natively supports complex technology systems, which is of great significance for data feature analysis in scenarios such as weather forecasting and aerodynamic analysis of automobiles and aircraft.
Neural Network Compiler: This feature significantly reduces the cost of performance optimization. The compiler of PaddlePaddle adopts an integrated design with the framework, capable of supporting efficient training and variable-shape inference for various models such as generative models and scientific computing models, providing a good balance between computational flexibility and high performance. After using the CINN compiler, over 60% of the models have shown significant performance improvements, with an average increase of 27.4%. The CINN neural network compiler has comprehensive improvements in completeness and performance. In this version, we have comprehensively optimized the front-end and back-end aspects of the compiler: including adding an automatic Re-Compute mechanism for reverse computation graphs, front-end Pass performance optimization, upgrading the symbol derivation mechanism, optimizing operator fusion strategies, enhancing the back-end Schedule strategy and subscript expression simplification capabilities, etc. At the same time, we have investigated and fixed a large number of correctness and performance issues, systematically improving the general optimization capabilities of the compiler.
Heterogeneous Multi-Chips Adaptation: One of the key features of PaddlePaddle is its ability to adapt to heterogeneous multi-core environments and fully leverage hardware potential. In terms of access mechanism, PaddlePaddle provides simple and efficient abstract interfaces and a basic operator system, reducing the cost of adaptation. In terms of operation mechanism, it optimizes scheduling and storage sharing mechanisms, enhancing scheduling efficiency. From the perspective of operator kernels, PaddlePaddle offers a compiler-based automatic fusion and tuning solution to improve end-to-end performance. Additionally, PaddlePaddle has established research and development infrastructure for new hardware vendors, including code integration, continuous integration, and model regression testing. These mechanisms ensure that new hardware is incorporated into PaddlePaddle’s normal release system, allowing users to install and try it directly without the need for compilation. PaddlePaddle’s comprehensive functionality and low-cost access mechanism have attracted hardware vendors to contribute a total of 4001 pull requests (PRs), encompassing 26584 commits.

In addition to the above core features, Highly Extensible Intermediate Representation To enhance the scalability of the PaddlePaddle framework, we have developed the Highly Extensible Intermediate Representation (PIR), which systematically abstracts the underlying core concepts and provides flexible and efficient components. As an infrastructure, PIR supports multiple technologies such as dynamic-to-static, automatic differentiation, automatic parallelization, combinational operators, and graph optimization, and is widely used in distributed training, model compression, and inference deployment scenarios. Through the Declarative Rewrite Rule (DRR) mechanism provided by PIR, the development cost of Pass can be reduced by 60%. At the same time, PIR has been verified in all scenarios and is enabled by default, supporting one-click dynamic-to-static conversion, ensuring excellent performance and good scalability of the framework. Continuous improvements have been made to the existing functions of the framework version 2.0, and new features have brought significant improvements in user experience, performance, ease of secondary development, and hardware adaptability. This version continues to enrich and enhance the API functions to meet more scenarios at the user experience level. For large model scenarios, optimization and improvement have been made to the distributed parallel strategy optimization and inference function enhancement. Thorough usability improvements have been made in terms of compilation and installation, with a new synchronous upgrade of the installation method and version of dependent packages. Comprehensive reinforcement of system security has been carried out, and comprehensive error correction checks have been conducted on product documentation. At the same time, a large amount of cleanup has been done on some obsolete code to ensure the simplicity of the architecture.

Incompatible upgrade¶

PaddlePaddle API supports implicit type promotion. In the most commonly used calculations such as addition, subtraction, multiplication, and division, if the data types of the two inputs are different, it is necessary to determine the data type of the output. Historically, PaddlePaddle has only partially supported implicit type promotion, and the actual rules are unclear. Objectively, this manifests as inconsistencies between dynamic and static graphs, inconsistencies between API and operator overloading, and non-compliance with commutativity. Especially when large models widely use mixed calculations with bf16/fp16 and fp32, unexpected issues are prone to occur and are difficult to locate. Starting from the 3.0 beta version, PaddlePaddle has clarified the implicit data type promotion rules, which defines in detail the types of calculation results for Tensor and Tensor, as well as Tensor and a scalar (Scalar), ensuring that calculations comply with commutativity, operator overloading is consistent with binary API results, and dynamic graphs and static graphs produce consistent results. This is more in line with user understanding and industry habits. #60638, #63842, #60011

Discontinued Features¶

Support for 0-dimensional Tensor has been stable for two versions. In this version, the switch FLAGS_set_to_1d, which converts 0-dimensional Tensor to a 1-dimensional Tensor containing only one element in some cases, has been removed. This switch is to accommodate incorrect writing in some suites where 0-dimensional Tensor is represented by a 1-dimensional Tensor containing only one element. That is, PaddlePaddle now fully distinguishes between the semantics of 0-dimensional Tensor and 1-dimensional Tensor containing only one element, and the two are not equivalent. #61227

1. User experience upgrade¶

New Features¶

Added PaddlePaddle APIs to expand PaddlePaddle’s functionality. These include paddle.nn.FeatureAlphaDropout, paddle.cartesian_prod, paddle.distributed.to_distributed, paddle.pi, etc. #64881, #65605, #70757, #71030, #69946, #70021, #69613, #68123, #70032
Introduce new Tensor class methods and attributes, along with corresponding unit tests, to enhance the usability of Tensor. #68334, #68681, #69132, #69270, #69256, #69197, #69231, #69222, #69257, #69301, #69361, #69348, #69464, #69542, #69667, #69563, #69796, #69477, #69779, #69724, #69835, #69781, #69982, #69913, #70026, #70013, #69539, #69736, #69841, #70277, #69580, #69599, #69693, #69848, #69751, #70556, #70591, #69673, #70647, #68192, #68511, #68833, #69406, #69480, #69463, #69632, #69473, #68694, #69534, #69820, #70121

API Function Enhancement¶

Enhanced the functionality of 43 APIs, making existing APIs easier to use and facilitating code conversion. This includes but is not limited to adding API parameters, expanding the data types supported by APIs, and correcting existing unreasonable designs. #65105, #65103, #62975, #64436, #63346, #68079, #67878, #68432, #68677, #69012, #69385, #65032, #64977, #67071, #67298, #66687, #65946, #66170, #66929, #67994, #67947, #68033, #68046, #68294, #68214, #68281, #68390, #68772, #69451, #69252, #69529, #69750, #69827, #69099, #68594, #70090, #70228, #70166, #70389, #70790, #71029, #71283, #71342
PaddlePaddle Python API fully supports type hints. All parameters and return values of Python API have been annotated with type hints for ease of development and use. #65209, #65201, #65190, #65082, #65226, #65076, #65238, #65236, #65247, #65249, #65244, #65272, #65191, #65290, #65255, #65292, #65300, #65301, #65332, #65323, #65326, #65273, #65317, #65354, #65283, #65372, #65337, #65085, #65382, #65381, #65378, #65274, #65380, #65386, #65351, #65284, #65366, #65308, #65375, #65376, #65464, #65197, #65455, #65457, #65487, #65486, #65547, #65504, #65460, #65183, #65454, #65559, #65560, #65570, #65569, #65566, #65620, #65568, #65567, #65660, #65645, #65600, #65532, #65765, #65767, #65770, #65768, #65771, #65772, #65774, #65769, #65773, #65766, #65776, #65775, #65755, #65779, #65777, #65823, #65807, #65821, #65819, #65810, #65808, #65824, #65553, #65818, #65812, #65803, #65865, #65870, #65866, #65844, #65845, #65853, #65874, #65871, #65809, #65867, #65822, #65872, #65873, #65869, #65868, #65849, #65875, #65876, #65843, #65727, #65587, #66006, #66005, #65785, #65784, #65811, #65919, #65838, #65852, #65847, #66014, #65805, #66009, #66012, #65633, #66011, #66010, #66013, #66015, #66016, #66030, #66028, #66029, #66054, #66040, #65993, #66058, #66280, #66037, #66057, #66077, #66051, #65912, #66090, #66189, #66127, #66277, #66119, #66270, #66305, #66306, #66279, #66276, #66295, #66301, #66473, #66384, #66505, #66328, #66394, #66392, #66432, #66575, #66572, #66656, #66475, #66654, #66616, #66694, #66686, #66766, #66749, #66760, #66803, #66770, #66693, #66771, #66792, #66862, #66867, #66684, #66966, #66793, #66987, #66985, #66989, #66639, #66994, #66986, #66993, #67002, #66996, #67001, #66864, #67031, #67089, #67143, #67179, #67178, #67284, #67104, #67079, #67132, #67147, #67204, #67112, #67233, #67366, #67067, #67391, #67428, #67197, #67047, #66890, #67159, #67439, #67555, #67448, #67556, #67469, #67558, #67405, #67644, #67624, #67679, #67677, #67785, #67767, #65319, #65277, #67673, #65557, #67527, #66965, #65905, #65657, #66357, #68163
Optimized the error messages of many PaddlePaddle APIs, making the errors more understandable. #67148, #67154, #67546, #67335, #67255, #67099, #67074, #67073, #66957, #67063, #67575, #67608, #67634, #67325, #67429, #67401, #66881, #68492, #67695, #69833, #70398

Bug Fixes¶

Fixed a bug in paddle.nn.functional.max_unpool1d when the input output_size is a tuple. #65910
Fixed the issue where paddle.base.core.eager.Tensor did not support paddle::DataType. #66765
Fixed the issue where an error occurred during BF16 training when the pir switch was turned on. #66833
Fixed the issue of bias in the linear layer during parallel processing in the pipeline. #67212
Fixed the error issue when using loss for judgment in parallel pipeline. #66980
Fixed the error issue when using paddle.Tensor.item in parallel pipeline. #67441
Fixed bugs in paddle.einsum in specific scenarios. #67588
Fixed the error issue of paddle.nn.SyncBatchNorm during gradient computation. #67559
Fixed the issue mentioned in issue #69992. #70017
Fixed the issue where paddle.arange produced incorrect results when dealing with large integers. #70188
Fixed the issue where paddle.max and paddle.min propagated incorrectly when there were nan values in the input. #70049
Fixed issues with APIs such as paddle.linalg.svd and paddle.linalg.any when handling 0-size Tensor. #70235, #70489, #70047, #70103, #70127, #70098, #70077, #70130, #70254, #70125, #70342, #70369, #71094, #71089, #71185, #70537, #70481
Fixed some issues with type hint annotations and documentation issues. #65429, #65496, #65461, #65542, #65575, #65545, #65609, #65644, #65700, #65697, #65719, #65639, #65742, #65891, #65877, #65895, #66007, #66679, #66680, #66676, #66677, #66884, #67288, #67302, #66978, #67295, #67520, #67421, #67529, #67536, #67618, #67661, #67698, #67800, #67933, #67893, #68108, #67927, #68322, #68341, #68415, #68372, #68559, #68598, #68708, #68780, #68992, #68989, #68895, #69014, #69139, #68996, #69090, #68922, #69333, #69141, #69609, #69652, #69715, #69716, #69934, #70253, #70297, #70252, #70468, #70102, #70546, #70616, #70582, #70635, #70499, #70755, #70935, #71133, #71172, #71238, #71230, #71394

Document optimization¶

Enhanced several API documents to make them easier to read and understand. #67772, #69895, #65904, #66480, #66974, #67100, #66991, #67287, #67841, #68206, #68305, #68462, #67061, #66503, #68856, #68866, #68768, #69215, #69449, #69396, #69498, #69413, #69404, #69729, #69749, #69266, #69989, #70209, #70128, #70143, #69874, #70242, #70145, #70813, #71046

2. Basic execution architecture¶

PIR is fully implemented and enabled by default, supporting one-click transition from motion to stillness, ensuring excellent performance and good scalability of the framework.

Bug Fixes¶

Fixed accuracy issues caused by parameter configuration. #65814
Fixed bugs related to save/load. #65268, #65359, #65373, #65314, #65446, #65476, #66891, #66931, #65978, #67654, #67906, #68723, #71452, #71457, #67819, #68120, #68300, #68315, #68743, #68744, #69585, #71165, #71400
Skip/fix failed unit tests in PIR mode, including scenarios such as Windows and XPU. #65690, #65759, #65730, #65760, #65833, #65834, #65856, #65886, #65899, #65932, #65998, #65953, #65997, #66061, #66111, #66137, #66073, #66203, #66227, #65744, #66234, #67487, #67561, #67584, #67742, #69832, #65885, #66709, #66734, #66959, #67399, #67389, #67230, #67403, #67619, #67662, #67902, #67382, #67430, #67517, #67533, #67573, #67468, #67640, #67667, #67716, #68386, #67234, #67266, #67362, #67631, #68081
Fixed bugs related to dynamic graphs. #65619, #69163, #68862, #68164, #69867
Fixed bugs related to control flow. #65722, #70181
Fixed kernel operation-related bugs, including issues with operation positions and null pointers. #66334, #67931, #70353
Fixed Amp-related bugs. #66778, #67582, #67704, #68655
Fixed CINN-related bugs. #69577, #71101, #71387, #71401
Fixed the bug related to the transition from dynamic to static. #67617, #67936, #68938, #68734, #69010, #69408, #69461, #69699, #69774, #69803, #69853, #70510, #70830, #70904, #70913, #71040, #71048, #71106, #71201, #71216, #71223, #71296, #71385, #71505, #66934, #71096, #71144, #71430, #71437, #71473, #71412, #65648, #67853, #66543, #68229, #70846, #67532
Fixed other bugs, including issues related to backpropagation gradient calculation, memory copying, and executor errors. #65493, #65678, #65673, #65794, #66358, #66875, #67339, #67465, #67754, #67835, #67892, #67967, #67952, #68036, #68063, #68128, #68151, #68140, #68167, #68200, #68325, #68376, #68539, #68530, #68637, #68639, #68688, #68751, #68806, #68810, #68779, #68811, #68844, #68790, #68870, #68960, #68999, #69036, #69188, #69234, #69375, #69399, #69538, #69603, #69633, #69765, #69768, #69821, #70091, #70123, #70147, #70201, #70198, #69815, #70420, #70377, #70552, #70545, #70595, #70836, #70771, #70922, #70969, #70926, #71117, #71151, #71194, #71234, #71339, #71445, #66350, #66533, #66622, #67721, #67700, #69207, #69615, #69785, #67805

Function optimization¶

Support save/load. #65296, #65671, #66231, #66185, #66722, #66863, #67057, #68101, #68628, #66359, #68481
Optimize the compilation process of custom operators. #67615, #67659
Support for composite operators. #69121, #69144, #70204, #71098, #71335
Support for CINN compiler execution. #69589, #70115
Support for custom devices. #70909, #71294, #71362, #71010, #71036, #70637, #71085
Execution support for other scenarios. #65050, #65664, #65741, #65786, #65499, #66441, #67668, #68199, #69088, #70199, #70308, #70709, #70937, #71066, #71079, #71121, #71136, #71205

New Features¶

SOT adapts to Python 3.13 bytecode, supporting static graph conversion (SOT mode) under Python 3.13. #68071, #69126, #69131, #69196, #69232, #69253, #69267, #69412, #69431, #69432, #69436, #69557, #69567, #69700, #69707, #69735, #69738, #69744, #69753, #69887, #69920, #69950, #70319, #70927
Support for custom devices. #68061, #68836, #70366, #70549
Adapted PIR forward execution. #65335
Support save/load. #67910
Adapted to pylayer. #70335
Adapt lazy_init. #67379, #67467
Optimize the logic under PIR. #67961
Support for other scenarios. #68344, #70071, #70291, #70752, #70812, #71033

Changes unrelated to ordinary users¶

Optimize SOT debugging for experience and improve development efficiency. #67560, #69072, #69837, #70134, #70387, #70740, #71118, #71268, #71275, #71458, #71460
Other changes unrelated to user usage. #65393, #65795, #65799, #65911, #65977, #66982, #67563, #68761, #68909, #69130, #69233, #69956, #71142

Security Issues¶

Introduced approval rules for IR (Intermediate Representation) save/load operations to enhance security and governance during model serialization. #65737

Others¶

Sparse API migration. #66139, #66319, #66866
PIR function extension. #67966, #69909
Migrate file locations. #66477, #66824, #67592
Log addition. #68382, #70506
Enable PIR by default. #68278
Header file organization. #68422, #68471
Compilation optimization. #67831, #67821, #68717
Manage related tests with guards. #67816, #67827, #67989
Fixed spelling errors. #70784, #70787
Check for CUDA errors. #70399

Developer¶

Fix issues in dynamic-to-static conversion. Improve overall graph conversion success rate and optimize inference export experience. #65291, #66153, #66379, #66557, #67021, #67482, #67495, #67981, #68030, #68078, #68328, #68442, #68679, #68850, #68892, #68991, #69043, #69097, #69210, #69295, #69428, #69518, #69642, #69940, #70118, #70169, #70218, #70287, #70412, #71099, #71156, #71193, #71336, #71463, #71476, #71503
Inplace strategy upgrade. #65491
Control flow related development. #67251
Add environment variables. #68467
Support sparse operator operations. #67111
Other execution support development, including logic optimization, version adaptation, and adding unit tests. #69241, #69806, #70768, #66829, #67110, #67442, #67041, #67452, #69061, #69307, #68669, #69829, #70003, #70443, #70364, #71495

Performance optimization¶

Optimize dynamic shape handling in static graph conversion, reducing graph construction iterations and compilation time. #65235, #65477, #65517, #65882, #66346, #66746, #67786, #67876, #68113, #68302, #68337, #68616, #69354, #70009, #70877
End-to-end performance optimization for SOT, minimizing subgraph fragmentation, reducing scheduling overhead, and improving static training efficiency. #67591, #67746, #67823, #67890, #67921, #68031, #68153, #68729, #69249, #69263, #69300, #69313, #69325, #69353, #69411, #69506, #69672, #69746, #69834, #69836, #69852, #69975, #70151, #70293, #70405, #70851, #71039, #71254, #71295, #71298, #71346, #71377, #71407
Optimize the performance of dynamic shape scenarios. #68491, #68629
Accelerate the execution speed of PIR executor. #69513
Optimize PIR saving and loading performance. #69683
Optimize for device. #69676
Clean up redundant input and output information. #66278

Discontinued Features¶

Remove outdated test cases. #66269, #66690, #67505, #67464, #68400, #68178, #68194
Clean up obsolete flags and configurations. #69124, #69176, #69274, #68384
Eliminate old APIs. #66032, #67303
Cleaned up PIR redundancy strategy and single test. #66366, #70534, #68444, #70599, #68801, #66303, #67854, #70795
Discard the related unit tests and APIs for dynamic-to-static conversion. #66421, #68251, #68252, #68253, #68254, #68409, #70569, #71279
Discard the related unit tests for automatic parallelism. #67857, #67862, #67995, #68012, #68013, #67798

3. Compiler architecture¶

The CINN compiler has seen comprehensive improvements in completeness and performance. In this version, we have conducted thorough optimizations across all aspects of the compiler’s front-end and back-end: including the addition of an automatic Re-Compute mechanism for reverse computation graphs, front-end Pass performance optimization, symbol derivation mechanism upgrades, operator fusion strategy optimization, back-end Schedule strategy, and enhanced subscript expression simplification capabilities. At the same time, we have investigated and fixed a large number of correctness and performance issues, systematically enhancing the compiler’s general optimization capabilities. When the CINN compiler is enabled for the PaddlePaddle PaddleX series models, over 60% of the models show significant performance improvements compared to dynamic graph mode.

New Features¶

New hardware backend support: Added support for two new backends, HIP and SYCL. (#65146, #65329, #69554, #71204, #65438, #66476, #66620, #67813)
Added support for manual setting of numerical ranges, equality constraints, and other information for symbol dimensions in reasoning scenarios. (#67628, #67384)

Function optimization¶

Optimize the printing of error messages to enhance the development and debugging experience. (#67738, #68769, #71076)
Support the Welford algorithm, which can simultaneously ensure the performance and accuracy of the BatchNorm-related operator Kenrel. (#71184, #71057)

Performance optimization¶

New backend optimization strategies such as GridReduce, Loop merging, Transpose tuning, and automatic vectorization have been added, significantly enhancing Kernel performance across various dimensional spaces and under different hardware configurations in all scenarios. (#67236, #68897, #69409, #65336, #66419, #68338, #68364, #71087, #68019, #68122, #65187, #66742, #67083, #68667, #68750, #69376, #69350, #69740, #68918, #70092, #69607, #69794, #70258, #70547, #70581, #70649, #69732, #70786, #70942, #71014, #71263, #71249, #71340, #71301, [#71380](https://github.com
Optimize operator fusion strategies, upgrading various strategies including horizontal fusion, multi-downstream fusion, Reshape alignment fusion, etc., to further enhance the fusion capabilities of operators and improve end-to-end optimization performance. (#66034, #67829, #68171, #69478, #69691, #70665, #71103, #70873)
The simplification capability of backend subscript expressions has been upgraded, supporting the simplification of complex expressions with dynamic and static dimensions, significantly reducing the subscript computation overhead in the generated backend Kernel. (#68011, #68617, #68624, #68685, #68220, #68720, #68753, #68986, #68987, #69071, #69164, #69282, #69522, #69857, #70208, #70355, #70427, #70450, #68737, #70500, #70953, #70933, #71026, #70456, #70257, #70461, #70142, #71018, #71278)
A new automatic Re-Compute mechanism for reverse computation graphs has been added, which can effectively reduce model training memory usage and improve performance. (#69342, #70255, #68241, #69954, #70832)
Optimize the backend Host and Device code compilation process to reduce compilation time and improve the processing performance of branches in the Broadcast scenario. (#65669, #65916, #66109, #65611, #65990, #66088, #66207, #66537, #66768, #70685, #71410, #66062)
Improved and upgraded the mechanisms for symbol derivation, simplification, and caching in dynamic dimensions, added symbol derivation interface implementations for all conventional operators (580+), and provided more constraint information for Kernel compilation.(#65343、#66582、#65500、#65591、#66637、#68208、#68056、#68015、#68096、#68236、#68973、#68967、#69133、#68550、#68882、#69005、#69911、#70376、#71153、#66644、#66650、#66642、#66729、#66838、#66762、#66580、#66612、#66625、#66643、#66837、#66946、#67018、#67049、#66956、#67008、#66930、#66877、#66896、#67120、#67117、#67098、#67136、#67294、#67327、#66827、#67201、#66892、#67377、#66619、#67037、#67412、#67394、#67374、#67418、#67348、#67337、#67390、#67407、#67491、#67422、#67461、#67458、#67486、#67490、#67462、#67364、#67435、#67665、#67426、#67507、#67730、#67776、#67806、#67803、#67788、#67705、#67814、#67858、#67751、#67875、#67663、#67434、#67818、#68180、#68547、#68548、#68670、#68964、#68929、#68907、#68917、#68984、#68644、#69167、#68975、#68947、#68978、#68980、#68979、#69329、#69055、#69331、#69414、#69335、#69017、#69344、#69069、#69698、#69919、#69964、#70337、#70282、#70741、#70818、#71031、#70541、#66609、#66889、#66633、#66735、#66935、#66627、#66730、#67210、#67115、#67275、#67472、#67577、#67328、#67566、#67451、#68098、#68225、#68177、#68102、#67951、#67957、#68235、#68447、#68446、#68183、#68318、#68385、#67635、#65623、#65956、#66063、#65992、#65880、#66343、#65889、#66606、#66618、#66737、#66607、#66579、#66732、#66849、#66400、#66952、#66570、#66967、#66595、#67121、#67206、#67444、#67494、#67499、#67267、#67567、#67455、#67161、#67581、#67539、#67625、#67690、#67454、#67731、#67734、#67735、#67607、#67413、#67387、#67882、#67864、#67503、#67861、#67888、#67884、#67826、#68044、#67851、#68276、#69888、#70093、#70436、#70914、#71222)
Optimized some front-end passes to enhance the robustness of the front-end processing flow and improve the performance of computationally intensive subgraphs. (#65142, #67466, #69228, #70994, #71226, #71297, #71443)
Designed new backend IR basic components and related Pass interfaces to provide a more concise and efficient way of developing optimization strategies. Through automatic pruning strategies, it can effectively reduce the traversal overhead of backend IR. (#70485, #70765, #71042, #70952, #69454, #70361, #70334, #70406, #70191, #70462, #70548, #70592, #70437, #70619, #70543, #69611, #70739, #70533, #70696, #70498, #70829, #71111, #70883)

Bug fixes¶

Fix some bugs in the derivation and implementation logic of operator symbols. (#65185, #65231, #65266, #65951, #67142, #67286, #65958, #65955, #66470, #66764, #66036, #66662, #66741, #66745, #66807, #66791, #66859, #66880, #66962)
Fixed bugs in the lowering of some special operators to the compiler. (#68698, #68699, #68691, #68948, #70144, #70895)
Fixed the issue of errors reported in some scenarios when integrating operators. (#67038, #67400, #67655, #67723, #68029, #68042, #68888, #69250, #69937, #70924)
Fix the correctness issue of the backend when handling extreme values, and improve the robustness of the compiler. (#68327)
Fixed implementation logic bugs in the backend Schedule and post-processing tuning process, resolving errors and performance issues in some cases. (#68605, #68937, #68587, #69060, #69608, #71471, #71068)
Resolved the issue of randomness in the operator fusion process. (#69547, #70931)

4. Automatic parallel architecture¶

In the official 3.0 version, we have conducted in-depth verification and refinement of the automatic parallel architecture to better support the pre-training + fine-tuning process for common large model scenarios such as pure text dense models, pure text sparse models (MoE), and multi-modal understanding models. Specifically, we have added segmentation derivation rules for over 20 operators tailored for these scenarios, and support the conversion of automatic parallel training parameters into manual parallel parameters for downstream inference, making automatic parallelism fully usable and helping users reduce the development cost of large model parallel programs. Additionally, to further simplify the distributed development process for users, we have introduced a new paddle.distributed.parallel interface. Based on the encapsulation of distributed tensor notation syntax, it supports users in non-intrusively configuring common parallel strategies such as data parallelism, model parallelism, and pipeline parallelism outside of model networking. Furthermore, the static graph automatic parallel architecture has undergone a comprehensive upgrade based on PIR, with the underlying basic components, core modules, parallel strategies, and performance optimization strategies all implemented uniformly based on the extended PIR DistDialect. This has further enhanced the dynamic and static consistency of automatic parallelism, achieving performance levels on the Llama series models that are on par with or even surpass manual parallelism.

New Features¶

Added the paddle.distributed.parallel interface to support configuring common parallel strategies outside of model networking, simplifying the distributed development process. #69004, #69033, #69077, #69136, #69169, #69212, #69217, #69283, #69288, #69326, #69365, #69384, #69426, #69443, #69462, #69492, #69628, #69677, #69697, #69776, #69896, #70138, #70182, #70539, #71116, #71210
For pure text sparse scenarios, it supports MoE expert parallelism, implements an expert parallelism to mesh partitioning conversion mechanism, and supports automatic invocation of all2all communication. #66462, #66750, #68004, #68053, #68187, #68477, #69098, #69262, #69296, #70715, #71292, #71320
To meet the needs of users in extreme manual optimization scenarios for managing segmentation status and communication operations, and to address the issue of being unable to use tensor segmentation syntax in some non-SPMD scenarios, we have added the LocalLayer interface to support a hybrid network of automatic and manual parallelism. #70519, #70525, #70600, #71232, #71264, #71373
To enable users to run automatic parallel programs using domestic hardware, we have completed the adaptation for Kunlun chips, and support for other chips is also underway. #70997, #71126, #71229, #71289, #71425, #71500
For situations where the data dimension cannot be divided evenly by the device dimension, non-balanced splitting derivation and splitting transformation are supported. #66103, #67756, #69265, #70072
The shard_dataloader function has been upgraded to support setting the gradient accumulation step count through batch_sampler, and also supports scenarios with multiple model inputs. #65325, #70659
Upgrades have been made to the parameter saving and loading functions, supporting asynchronous storage of parameters, mutual loading of master_weight between dynamic and static graphs, as well as parameter version control and offload functions. #66858, #67427, #70105, #70639
To meet users’ needs for converting dynamic networking involving PyLayer to static, support has been added for PyLayer in static graph mode, allowing distributed tensors to be run within PyLayer. #67326, #68190, #69089, #70831
To address the issue of incorrect dynamic-to-static conversion caused by inconsistency between the data stream input format and the input_spec actually required by the model for dynamic-to-static conversion, the dynamic-to-static conversion interface supports a user-defined input_spec feature, allowing users to input the required input_spec on their own. #69183
For hybrid parallel scenarios, the gradient clipping strategy has been adapted and supported. #65259, #65928, #69287, #69760, #71421
For scenarios where the number of model layers is not divisible by the number of devices, a non-balanced pipeline parallel strategy is supported, allowing users to split different numbers of network layers at different pipeline stages. #69728, #70164, #70230
Added set_mesh and get_mesh interfaces to enable users to easily set and retrieve the global mesh. #69999
Added automatic and manual parallelism accuracy alignment switches to facilitate the conversion of existing manual parallelism models to automatic parallelism and verify the accuracy of the results. #67681

Functional improvements¶

Improve and optimize the derivation rules for operator slicing

Added derivation rules for operators add_n, split, and softmax_grad. #65606, #69439
Added operator splitting derivation rules for assign and embedding_grad. #67457
Added clip operator slicing derivation rule. #70632
Added derivation rules for the dist_stack and gather_nd operators. #65426
Added the derivation rule for dropout operator segmentation. #70216
Added slicing derivation rule for fused_dropout_add operator. #67722
Added fast_ln custom operator segmentation derivation rule. #68148
Added greater_equal and less_equal operator slicing derivation rules. #68868
Added greater_than and less_than operator slicing derivation rules. #68133
Added if operator segmentation derivation rule. #69357
Added slicing derivation rules for operators logical_and, logical_not, logical_or, and logical_xor. #67840
Added logsumexp operator slicing derivation rule. #67840
Added non_zero operator slicing derivation rule. #67996
Added pad operator slicing derivation rule. #68304
Added the derivation rule for operator segmentation of p_norm. #68317
Added the derivation rule for the scatter_nd operator’s slicing. #67980
Added sigmoid operator segmentation derivation rule. #71092

Static graph automatic parallel architecture based on PIR upgrade

Upgrades to Automatic Mixed Precision (AMP) training. #65089, #65892, #66418, #66674, #68545
Upgrade of recalculation strategy. #69681, #70064
Upgrades to the parameter slicing parallel strategy. #63542, #67748, #68288, #68314, #69059, #71167
Upgrading the pipeline parallelism strategy. #66810, #67174, #67522, #68141, #68742, #68962, #69052, #69201, #69244, #69578, #69584, #69654, #69799, #69894, #70360, #70615
Gradient accumulation strategy upgrade. #66641, #67254, #67907, #68391, #68460, #68472, #68664, #68727, #69171, #69805
Operator fusion strategy upgrade. #68087, #68207, #68383, #68623, #68650, #68736, #69103, #70889
The tensor_fusion optimization strategy has been upgraded. #66130, #68475, #69243, #69560, #69823, #70195, #70309, #70363, #70869
Tensor parallel optimization strategy upgrade. #68182, #68389
Upgrade of custom operator segmentation derivation mechanism. #67614
Upgrades to the parameter saving and loading mechanism. #66416, #67045, #67369, #68203
Optimize computation graph compilation time. #68796

Bug fixes¶

Fixed bugs in the segmentation derivation mechanism and the segmentation derivation rules for several operators. #65702, #65835, #66098, #66955, #67052, #67059, #67101, #67283, #67729, #67996, #68413, #68455, #68533, #68976, #68977, #69027, #69203, #69223, #69862, #69991, #70100, #70624, #71024, #71152, #71214, #71253, #71388
Fixed several bugs in the segmentation conversion mechanism. #65060, #65820, #67630, #67809, #68115, #68468, #70023
Fixed the bug of incorrect derivation of shard_degree in parameter slice parallelism. #68781, #69214
Fixed issues in scenarios such as inconsistent results between dynamic and static graphs in shard_dataloader, slicing dict-type data, and custom sampler scenarios. #65262, #66096, #66882, #69620
Fixed the bug where the recompute setting with use_reentrant=false was incompatible with parameter slicing. #65188
Fixed bugs in the parameter loading and saving functions. #66266, #69764
Fixed bugs in operators such as Conv2D, fill_constant, flash_attn_grad, reduce_scatter, if, tuple_push, and tuple_pop. #67587, #68008, #68586, #68589, #69519, #70207
Fixed bugs in communication operators such as reduce_scatter, p_send, and p_recv. #67386, #71433
Fixed bugs related to tensor type promotion. #66541, #68342
Fixed the bug where automatic allocation of GPU memory occurred when converting uninitialized distributed tensors to NumPy arrays on some cards. #66361
Fixed the bug that triggered data copying when calling to_tensor on non-segmented tensors. #67169
Fixed the bug related to the segmentation of the scaler parameter. #68289
Fixed the accuracy issue of enable_delay_scale_loss. #68525
Fixed the hang issue caused by different creation orders of communication groups. #68847
Fixed the bug of incorrect op_role setting in static graph scenarios. #67850, #67986, #68156
Fixed the bug where the output variable of the random number operator could not be sliced in static graphs. #67589, #67750, #68067
Fixed the bug where the graph cache mechanism failed in static graphs. #68488
Fixed the bug of index out-of-bounds in paddle.distributed.to_distributed. #70174
Fixed a bug in the pipeline parallel visualization tool. #71386

5. Operator mechanism¶

Operator-related PRs, including the splitting of combined operators, the adaptation of new hardware-compatible operator kernels, sparse operator operations, and the retirement of old IR operators, have laid the foundation for PIR-compatible compilers and achieving performance advantages across multiple hardware platforms. The standardization of the operator system has optimized the code structure, reduced technical debt, and improved maintainability.

New Features¶

Support the splitting of combinatory operators. #65148, #65007, #65482, #65006, #65692, #65961, #65968, #65967, #66510, #66795, #66835, #67151, #67342, #67481, #67502, #67606, #67757, #67775, #67891, #67790, #67965, #67968, #68168, #68125, #68228, #68295, #68353, #68357, #68827, #68834, #69239, #68817, #69108, #69373, #69372, #68829, #69684, #68818, #68835, #69838, #69998, #69675, #70367, #70080, #71352, #66450, #67593, #67988, #68346, #68399, #68319, #68485, #68961, #68575
PIR supports Pylayer. #69674, #70375
Support for XPU-related operator computations. #65684, #65976, #68497
PIR supports sparse operators. #62663, #67885, #67976, #68261, #68326
Support manual Recompute. #65879
Implement the kernel and register the operator. #63130
Support for Custom Op. #68824, #68748
Added dynamic graph second-order inverse composition for acos. #70409
Support initialization and computation of 0-size tensors. #70504

Bug Fixes¶

Fixed bugs related to composite operators. #70250, #67170, #71218, #69095, #70189
Fixed XPU-related bugs. #65149, #70845
Fixed shape-related bugs. #68722, #70210, #70492
Fixed save/load-related bugs. #69153
Fixed bugs related to types. #65721, #65859
Fixing issues during the invocation and execution of other operators, including type matching, type inference, parameter type support, etc,. #65360, #65024, #66308, #67085, #67285, #67076, #67547, #68007, #68527, #68549, #68543, #68604, #68741, #68859, #69025, #69065, #69405, #69688, #69912, #70177, #70517, #70596, #70788, #70870, #71332, #71454, #71442, #71499, #67459, #68470, #70206

Others¶

Optimize code style. #68536
Fix spelling errors. #67456, #66673, #68702, #68735, #68718, #70700, #70682, #70670, #70241, #69626, #70051, #67764, #68872, #70055, #67954, #67404, #69273, #66981, #68145, #69148, #69145, #69168, #68940, #70344
Modify the interface documentation. #69378
Replaced operator and parameter naming under the fluid operator system. #69345, #69382, #69484, #69444

Discarded¶

xshape output exit. #66769, #67009, #67152, #67172, #67355, #67373, #66089
Remove the obsolete operators, their kernels, related unit tests, and related calling codes under the fluid system. #67370, #67088, #67324, #67666, #68058, #68311, #68358, #68312, #68355, #67528, #68316, #68356, #68397, #68441, #68417, #68567, #68583, #68649, #68331, #68730, #69754, #69445, #69921, #70268, #69446, #69544, #70272, #69745, #70300, #70388, #70421, #70302, #70445, #69275, #69081, #70588, #67778, #67953, #68093, #68092, #67684, #69665, #67915, #67917, #68403, #68404, #68969, #68953, #68954, #68942, #68950, #69381, #69380, #69448, #69680, #69775, #69812, #69840, #69828, #69742, #69923, #69922, #69904, #70002, #70054, #70052, #70053, #70713, #70718, #70718, #70717
Remove deprecated flags. #70727, #70726
Remove the deprecated API of combination operators. #69873, #69309

Developer-related¶

Support for composition operators, including adapter operators, adding flags, test cases, etc. #67725, #65252, #67590, #68076, #66711, #68813, #68928, #69054, #69156, #69255, #69460, #70270
Add unit tests for operators. #68272, #68490
Added operator API aliases for PaddleCustomDevice. #69526
Define the position of the shift operator to ensure it only supports dynamic graphs. #69289
Annotate only forward computation operators. #68580
Change the inverse operator of the view operation to reuse the forward operator, thereby supporting the need for higher-order differentiation in scientific computing scenarios. #71086
Migrate operator file location/modify function namespace/modify function parameter names, etc. #66393, #67066, #67012, #67243, #67367, #67760, #67242, #67189, #67899, #67687, #68035, #67682, #68464, #68469, #67900, #68563, #68562, #68564, #68479, #68588, #68726, #68719, #68767, #68557, #68671, #68786, #67948, #64999, #68581, #68361, #68656, #68396, #68059, #68785, #68665, #68869, #67626, #68921, #69268, #69271, #69306, #69302, #69341, #69364, #69343, #69383, #69415, #69437, #69494, #69541, #69543, #69540, #69569, #69568, #69621, #69622, #69701, #69702, #69704, #69743, #69780, #69814, #69822, #69893, #69967, #69976, #70011, #70015, #70007, #70010, #70346, #70414, #69951, #70299, #70441, #70435, #68420, #70671, #70705, #68540, #70211, #67489, #66927, #66942, #66848, #66796, #67036, #67244, #67299, #67171, #67293, #67208, #67408, #67523, #67689, #67694, #67797, #67894, #65969, #65939, #67928, #68097, #66744, #68496, #66943, #68773, #69272
Move test file locations. #67564, #68266, #68634
Pre-modification related to xshape output exit. #67543, #67572

Improvement¶

Supported more data types. #69143
Update xpu interface. #69800
Improved operator printing functionality. #69916
Upgraded the normalize operation to support more scenarios. #70152
Extended group_norm to handle cases where the rank is greater than 5. #68774
Improved the usage of backward_blacklist. #69356

Performance improvement¶

Optimized the performance of the where_double_grad operator. #70404
Change “for range” to “slice” to speed up the execution of grad. #69938

6. Framework performance optimization¶

PRs related to performance optimization, encompassing optimizing operator performance, enhancing kernel performance, optimizing memory usage, and refining namespaces, all aim to provide users with a superior development experience.

New Features¶

Enhanced support for fp8 type. #64735, #64955
Enhanced support for XPU. #65362, #65304, #68451
Enhanced support for DCU. #65398, #65857, #66423
Expand the capabilities of oneDNN. #66000, #66474, #66568
Rename parameters and support more complex masks. #65409
Support for flash-attention. #68968
Support OpenVINO CPU high-performance inference. #69122

Functional improvements¶

Enhance PIR pass to achieve better fusion. #65540
Enhanced OneDNN functionality. #65971, #70430, #70630, #70871
Improve the performance of FlashMask. #68109
Optimize kernel performance. #69660, #69596
Combinatorial operator optimization. #69515, #69616

Bug Fixes¶

Fixed bugs related to PIR, CINN, SOT, OneDNN, etc. #68951, #69553, #69682, #67741, #69346, #69401, #68903
Fixed bugs related to composite operators. #69479, #69487, #67176
Fixed the issue with the FP8 data type on the CPU. #65539
Remove unnecessary overhead for creating events in computational flow. #67315
Fixed performance issues. #68378
Fixed issues related to types. #69720
Fixed other issues. #70019, #70008, #70645, #71209, #68152, #69907, #71207

Performance optimization¶

Optimizations related to the CINN compiler. #69455, #70284, #67576, #68946, #68615
Optimizations related to oneDNN. #68784, #68716, #67554
Memory-related optimizations. #68660, #69930, #68174, #68660, #70359
Kernel computation-related optimizations. #65507, #68541, #71479, #71403
XPU-related optimizations. #67051
Other optimizations include pass optimization of the inference process, dynamic shape optimization in automatic parallelism, and FlashAttention computation optimization. #68394, #68696, #68759, #68791, #69390, #69961, #69939, #70455, #70663, #71290

Others¶

Modify function namespaces. #66818, #67023, #67114, #67217, #67524, #67796, #67881 Upgrade OneDNN. #69917
Modify the pass level. #69524
Optimizations related to memory read and write. #65804, #66923
Optimize the GetValueName-related signatures. #66363, #66559, #66738

Discarded¶

Remove obsolete files and functions. #67514, #67811, #67911

7. Inferential deployment¶

Focusing on two core directions: the construction of the new generation of Proven Intermediate Representation (PIR) ecosystem and large model inference optimization, the main breakthroughs include:

Deep fusion of PIR-TensorRT

Complete the refactoring and code optimization of the core execution mechanism, and develop over 50 operator converters
Added low-precision support (FP16/INT8) and Generic Plugin execution capability
Build a complete unit testing system that supports the entire process of model loading/saving

Leap in reasoning performance of large models

Added full-process support for the Mixture of Experts (MoE) system, covering Hopper architecture optimization
Supports processing of 128K ultra-long sequences, enhancing long text reasoning capabilities
Implement cutting-edge quantization schemes such as FP8/W8A8 to reduce memory usage

Comprehensive upgrade of infrastructure

OneDNN has been upgraded to version 3.6, significantly enhancing CPU inference performance
Model loading speed optimized by over 40%, supporting fast loading of PIR models
Improve distributed inference support and fix allreduce data type issues

New Features¶

Support Paddle-TensorRT based on PaddlePaddle’s new generation of intermediate representation (PIR)
Development of core basic execution mechanism functions and code optimization. #64995, #67054, #67660, #67755, #70762,
Development of operator Marker and Converter. #67753，#67956，#68084，#67974，#68395，#68216，#68529，#68608， #68663，#68757，#68614，#68783，#68775，#68839，#68686，#68840，#68941，#69015，#69038，#69117，#69208，#69315，#69261，#68878，#69705，#69706，#70170，#70267，#70429，#69330，#70507，#70535，#70667，#70816，#70826，#70955，#71028，#71013，#71157，#71231，#69199，#68956，#66658，#66811，#67519，#67877，#68090，#69086，#68787，#68778，#69318，#69995，#70325，#70817，#70879，#70875，#71041，#68876
Support for Generic Plugin execution function. #66634, #70251
Low-precision (FP16, INT8) function support. #69597, #71127,
Auxiliary functions such as the single test system and pass usage support have been improved #67525, #68034, #71281, #71235, #67568, #70139, #70529
Large model inference optimization
Added fused_moe function support (basic support/non-standard TopK/Hopper architecture) #66084, #67425, #67732
Support for mixed precision computation (GQA mixed precision/BF16 registration) #65078, #67769
Added inference optimization features (dynamic graph inference/support for 128K long sequences) #65962, #70088
Added implementation of quantization inference operator (FP8 W8A8 computation/weight-only int4 quantization) #65441, #64094

Feature-complete¶

The functional mechanism of Inference is well-established under PIR
The executor supports loading .json models #65223
Support controllable PIR mode switch-on/off #65596
Improved reasoning mechanism of large models
Optimized gemm algorithm search (cublaslt global search/offline caching) #65597, #66132
Enhance type system compatibility (PD_VISIT_FLOATING_AND_HALF_TYPES) #71022
Optimized attention mechanism (support for multiple blocks of MMHA/XPU) #67211, #68104

Performance optimization¶

OneDNN has been upgraded to version 3.6, resulting in a general improvement in model inference performance on GNR/EMR devices #69386
Operator performance optimization (layer_norm/top_p_sampling) #65711
Model loading acceleration (regular/PIR model) #69110, #70219

Bug fixes¶

Fixed issues related to Predictor when saving/loading PIR models. #65180, #65019, #65714, #69619, #67570, #65595, #69200
Fixed execution issues of reasoning unit tests in scenarios such as PIR and multiple hardware configurations. #65763，#66481，#67105，#67248，#67470，#67638，#68135，#68191，#68211，#68160，#68185，#68127，#68887，#69191, #70961，#68020，#67923，#67963，#68482，#68546，#68593，#68793
Fixed issues related to Paddle TensorRT conversion and execution. #66932，#66655，#67274，#67504，#65780，#68170，#68647，#68776，#69573，#69598，#69510，#69864，#69885，#70161，#70116，#70791，#70801，#70824，#70939， #71143，#71154，#71163，#71183，#71233，#71287，#71319，#67720，#69671，#70168，#69957
Fixed issues related to Paddle Inference compilation and linking. #65846, #67081, #63184
Fixed quantization issues. #67839, #68049, #70099, #64878, #65717, #67552, #67715
Fixed OneDNN inference issues. #67836, #68021, #68132, #71426, #68057
Fixed memory issues. #68631, #69129, #70314, #67863
Paddle Inference supports bug fixes for OpenVINO issues. #70212, #70288,
Fixed issues related to Pass. #65349，#65421，#65677，#66850，#67443，#67620，#68158，#68642，#68837，#68880，#68935，#69112，#69205，#69242，#69352，#69421，#69690，
Fixed other issues. #70237, #68173
Fixed issues related to fused_moe (testing/GEMM/WINT4/multi-architecture compatibility/Bias optional) #67353, #67396, #67717, #67794, #67783
Fixed issues in the block_attention series (GQA discrepancy/out-of-bounds risk/multi-head support) #67175, #69001, #70763
Fixed PIR-related issues (layout conversion/BF16 replacement errors) #66977, #67830
Fixed distributed-related issues (allreduce data type/parameter synchronization) #67449, #69157
Fixed kernel execution issues (forward-backward conflict/default stream argsort) #67218, #68374
Other key fixes (reducing the size of the C++ library/fixing RoPE calculation in NeoX format/fixing static graph execution) #66041, #66583, #67580

Other modifications¶

Code cleanup and maintenance (API deprecation/compilation warning fixes) #68048, #70384
Third-party integration optimization (OpenVINO submodule management) #70313, #70425

8. Hardware adaptation¶

Continuously improve and upgrade the functions of platforms such as Kunlun and Haiguang to enhance user experience

New Features¶

The addition of operations (ops) and improvement of functions on Kunlun Core XPU involve the following ops: flash attention/flash_attn_unpadded, multinomial, matmul, repeat_interleave, logsumexp, index_put_grad, mean_grad, pow, pow_grad, rsqrt, full, rms_norm, rms_norm_grad, put_along_axis, Cumsum, argmin, masked_select/grad, expand_v2/grad, all2all, expand, reduce_sum, reduce_max, reduce_min, moe, fused_linear_param_grad_add, adamw, clip/clip_grad, tan, acos, blha_get_max_len, gather/gather_grad, scatter/scatter_grad, round, index_select/sindex_select_grad, isfinite, isinf, quantize_linear, dequantize_linear, conv3d_transpose, logsumexp_grad, index_add_grad, eye, gather_element, tril, triu, set_value_grad, argmax, take_along_axis, etc #65413, #64846, #65656, #65963, #66143, #66482, #66585, #67077, #67173, #67551, #63989, #67919, #68052, #68176, #68408, #68454, #68478, #68473, #68453, #68770, #68933, #69042, #68713, #69368, #69723, #69767, #69898, #69970, #69771, #70176, #70428, #70573, #70576, #70633, #70114, #70627, #71038, #71132, #71228, #71274, #71364, #71375, #71431, #71451, #67585, #67637, #67914, #67641, #67913, #67955, #68411, #68560, #68423, #68894, #71053, #71047, #69056, #70843, #65653, #68023, #67780, #68622, #67215

Add support for rocsolver and warpctc on Haiguang DCU, and carry out the addition of OPs and improvement of functions. The involved ops include: flash_attention, hipblaslt, fastgelu, multiclass_nms3

#68066, #69457, #68603, #65599, #70587, #71337, #70173

Bug fixes¶

Bug fix for OP on Kunlun Core XPU #65020, #65251, #65418, #65387, #65525, #65613, #65533, #65705, #65915, #66238, #66485, #67349, #67372, #67276, #67460, #67496, #67530, #67828, #68010, #68157, #68172, #68388, #68213, #68501, #68504, #68585, #69229, #69374, #69424, #69440, #69614, #68542, #69990, #70351, #70479, #70431, #70638, #70856, #70974, #70973, #71027, #71062, #71115, #71110, #70858, #71147, #71212, #71361, #71423, #70859, #71492, #71493, #69826, #67341, #68906, #71171

Bug fix for OP on Haiguang DCU #69617, #65716, #66630, #65399

Performance optimization¶

Kunlun Core XPU upgrades the functions of basic components such as streams and optimizes the performance of certain operations. #65102, #69727, #69899, #69942, #70025, #70640

Upgrade of hardware underlying basic libraries¶

The upgrade of the basic library supports Kunlun Core P800, as well as the support for basic components #65494, #65924, #69752, #70835, #65554, #66998, #65278, #70614, #71012, #71178, #71168, #68740, #71100, #65221, #67983

Others¶

Modifications to related modules such as op test #65654, #66233, #66728, #67959, #68169, #68418, #68434, #68445, #68877, #68993, #69006, #70471, #70706, #67777, #65698, #68433, #65689

9. Environment update¶

We optimized the framework’s stability and cross-platform compatibility, fixed issues related to test coverage and compilation environment compatibility, and enhanced support for multiple platforms such as Windows, XPU, and DCU. Simultaneously, we streamlined the code structure, removed obsolete code and unnecessary dependent libraries to reduce maintenance costs, upgraded key dependencies such as CUDA, further optimized the CI/CD process, improved build speed, and enhanced overall system stability.

Bug Fixes¶

Improve the CI/CD process, fix test cases, resolve compilation and installation issues in different environments, and enhance the stability and cross-environment compatibility of the framework. #65627, #65736, #65900, #66069, #67000, #67312, #67432, #67540, #67670, #68449, #70806, #65665, #65652, #70644, #68119, #68466, #68858, #68788, #68934, #69883, #69924, #71187, #70798, #71248, #70512, #71363, #71438, #71291

Improvement and Upgrade¶

Environmental upgrade #69491, #66560, #65686, #71177, #71284, #69791, #69349, #70944, #65411
Pipeline merging #66815, #67306
Improvement of DCU/NPU/KUNLUN pipeline #67516, #67629, #67987, #69903, #68448, #70401, #71192, #71197, #68027
Support for Windows environment #70390, #70785, #71286, #71414, #68901
Improvement of third-party libraries #71419
Other optimizations are aimed at enhancing CI stability and execution efficiency #67574, #69058, #70610, #67093, #69037, #65213, #65913, #65947, #66479, #71054, #71396

New Features¶

Added Github Action mechanism #70571, #70626, #71325, #71344, #71353, #71322, #70415, #70465, #70524, #70550, #70564, #70579, #70580, #70963, #71200, #71261, #71265

Discarded¶

Cleanup of obsolete code and dependencies, including removing Python libraries that are no longer needed and simplifying compilation configurations to reduce maintenance costs #65635, #67542, #67609, #69572, #68150, #67604, #68561, #68904, #67219

10. other¶

Changes unrelated to user usage, including cleanup of obsolete code, code migration, cleanup of unit tests, debugging, or upgrades to monitoring mechanisms.

Developer-related content¶

Remove useless debugging code and migrate code #65256, #65782, #65836, #65840, #65862, #65863, #65987, #66547, #66556, #66645, #66646, #66648, #66672, #66783, #66083, #65562, #66564, #66370, #66912, #66913, #66914, #66915, #66664, #66671, #66121, #65907, #65949, #65950, #65954, #66545, #66649, #66900, #66901, #66902, #66903, #66904, #66906, #66907, #66908, #66909, #66549, #66555, #66647, #66898, #66886, #66042, #66043, #66045, #66046, #65826, #65825, #65827, #65829, #65830, #65831, #66081, #66082, #66087, #65980, #65981, #65983, #65985, #65979, #65986, #65988, #65989, #66682, #66717, #65802, #66159, #66147, #66149, #66150, #65798, #65731, #66145, #66086, #65781, #65837, #65828, #65864, #65959, #65706, #66918, #66191, #66689, #66808, #65424, #65452, #65463, #65478, #65339
Standardize code namespaces #64755, #64765, #64767, #64770, #64775, #64776, #64757, #64780, #64777, #64779, #64758, #64759, #64762
Modify operator list #66573, #65598, #65100, #65385, #65192, #65118, #65108, #65153, #65465, #65128, #65420, #65099, #65207, #66066, #65400, #65160, #65195, #65445, #65479, #65193, #65401, #66724, #65164, #65466, #65661, #65897, #66022, #65313, #65616, #65588, #65174, #65402, #65154, #65151, #65098, #64953, #65122, #65590, #65152
The old executor function of the Paddle framework is being phased out #65077, #65340
Error message prompt optimization #66668, #66675, #66605, #66613, #66507, #66700, #66739, #66719, #66733, #66552, #66548, #66623, #66702, #66705, #66718, #66727, #66860, #66869, #66933, #66939, #66553, #66774, #66794, #66551, #66540, #66617, #66841, #66788, #66954, #66698, #66782, #66844, #66443, #66455, #66517, #66804, #66802, #66536, #66707, #66525, #66753, #66550, #66857, #66471, #66628, #66469, #66775, #66506, #66780, #66953, #66695, #66603, #66491, #66715, #66632, #66594, #66615, #66578, #66534, #66569, #66529, #66530, #66522, #66789, #66600, #66511, #66512, #66527, #66518, #66958, #66532, #65258, #66487, #66876, #66832, #66872, #66830, #66708, #66502, #66521, #66592

Discarded¶

Clean up abandoned code and useless unit tests #65894, #66165, #66293, #66102, #66442, #66922, #66531, #65518, #66800, #66372, #65902, #65462, #65327, #65189, #65181, #66535, #65383, #65173, #66429, #66386, #66447, #66367, #66160, #65408, #65433, #65481, #65444, #65389, #65663, #65649, #65629, #66142, #65796, #66163, #66291, #65480, #65495, #65498, #65503, #65502, #65501, #65512, #65528, #65472, #65390, #65344, #65384, #65388, #65198, #65248, #65443, #65430

11. List of contributors¶

0x3878f, 0x45f, 2742195759, 86kkd, A-nnonymous, ADream-ki, Aganlengzi, Albresky, AndPuQing, AndSonder, Aoraki-Dream, ApricityXX, Asthestarsfalll, Aurelius84, BHmingyang, BeingGod, Betelgeu, BiynXu, CJ77Qi, Caogration, DDDivano, Dale1314, Deleter-D, DesmonDay, Difers, Dmovic, DongBaiYue, DrRyanHuang, DrownFish19, Eddie-Wang1120, EgoistSA, FeixLiu, ForFishes, Fripping, From00, Function-Samuel, GoldenStain, Guanhuachen2003, GuoxiaWang, Hanyonggong, HarperCy, Hongqing-work, HydrogenSulfate, JZ-LIANG, Jeff114514, JiaWenxuan, LLee233, LanCole, Lans1ot, Layssy, Leoforever123, LiYuRio, LielinJiang, LittleHeroZZZX, Liujie0926, Liyulingyue, Luohongzhige, Marcusryz, MarisaSparkL, Micalling, MikhayEeer, MrXnneHang, MufanColin, NKNaN, Neo-WY, NeroLoh, PolaKuma, Qin-sx, QingshuChen, RachelXu7, RichardWooSJTU, RuohengMa, SCUcookie, Sekiro-x, SigureMo, Sunny-bot1, SylarTiaNII, Sylence8, TBD1, TR666, TimeYWL, Tom-Zheng, Turingg, Victor-Bayim, Vvsmile, WAYKEN-TSE, Wanglongzhi2001, Wangzheee, Waynezee, Wennie396, Whsjrczr, Wizard-ZP, Wong4j, XavierZXY, XiaociZhang, XieYunshen, Xing-lil, Xreki, YKTian-x2b, YZW-explorer, YanhuiDua, YuanRisheng, ZHOU05030, ZhangHandi, ZhangX-21, ZibinGuo, a2064968462, anderson101866, aooxin, aquagull, baoqiwen, bapijun, blacksheep-Aristotle, bukejiyu, carryyu, ccsuzzh, chang-wenbin, changeyoung98, chen2016013, ckl117, cmcamdy, co63oc, continue-coding, cqulilujia, crazyxiaoxi, cszdrg, cubehan3, cyber-pioneer, danleifeng, decade-afk, deepllz, dynamicheart, eee4017, eggman-1024, enkilee, epiphanyer, ethan-sem, fangfangssj, feixi21, fightfat, fufu0615, fxfxfxfxfxfxfxfx, fxy1699, gitliuyf, gongel, gongshaotian, gongweibao, gouzil, gsq7474741, guixxiic, gzy19990617, hanyang2508, haoyu2022, heavyrain-lzy, houj04, huangjiyi, huangkr03, hxzd5568, icpcccpc, inaomIIsfarell, iosmers, jeff41404, jerrywgz, jiachengdai, jiahy0825, jinmingyi1998, jinyouzhi, joseflv, jychen21, jzhang533, kangguangli, kanze1, kineast, kircle888, l1cacheDell, leo0519, lifulll, linkk08, little1d, liufengwei0103, liuruyan, lixcli, liym27, liyongchao911, lizexu123, lizhenyun01, lj970926, lshpku, lszxb, ltd0924, luotao1, lwkhahaha, lxd-cumt, mayang002, megemini, mikemikimike, ming1753, monster1015, mori0umi, ndyysheep, nizne9, nobodynobody, ooooo-create, penPenf28, phlrain, pkuzyc, qili93, rich04lin, risemeup1, ronny1996, rsmallblue, runzhech, skywalker2012, smile2game, sneaxiy, successfulbarrier, sunzhongkai588, swgu98, tc20042008, tianhaodongbd, tianshuo78520a, tizhou86, tlxd, uanu2002, umiswing, vivienfanghuagood, waliwali777, walkalone20, wanghuancoder, wangna11BD, will-jl944, winffke, winter-wang, wwwuyan, xiaoguoguo626807, xiaoluomi, xiaoyao0115, xingmingyyj, xkkkkkk23, xu8117, xuxinyi389, xz-alex, yangrongxinuser, yeteye, yinfan98, yongqiangma, yuan20041218, yuanlehome, yuguo-Jack, yumin066, zbt78, zeroRains, zhangbo9674, zhanghonggeng, zhanglirong1999, zhangting2020, zhangyk0314, zhangyuqin1998, zhiminzhang0830, zhink, zhiqiu, zhouquan32, zhoutianzi666, zhwesky2010, zoooo0820, zrr1999, zty-king, zxcd, zyfncg