3.0 Beta Release Note

Overview of PaddlePaddle 3.0 Beta

The core features of this version mainly include new technologies such as dynamic-static unity auto parallel and automatic optimization of neural network compiler, to aim to address the new challenges in the current deep learning field.PaddlePaddle Framework 3.0 Beta extends the design concepts of 2.x such as dynamic-static unity and integrated training and inference. The development interface is fully compatible with 2.x version. This means that codes developed in version 2.x can run directly on version 3.x without modification in most cases. Several key features are detailed as follows:

  • Dynamic-static graph unified auto parallel: To make the parallel training programming of large models easier, PaddlePaddle has also optimized the semi-auto parallel programming paradigm with dynamic-static graph unified. Developers do not need to delve into the complex concepts and APIs need in manual parallel programming; developers only need to perform a small amount of tensor sharding annotation to complete the construction of hybrid parallelism for large models. The framework is able to automatically derive distributed sharding states and add communication operators, and also supports one-key dynamic-to-static distributed training, thus dramatically simplifying the development of hybrid parallel training codes. In terms of dynamic-static unity, PaddlePaddle has comprehensively upgraded its dynamic-to-static training capability by adopting bytecode-based dynamic-static conversion technology, to support adaptive graph construction functions. It has been verified on more than 700 PaddlePaddle industrial-grade models, achieving a 100% success rate of one-key dynamic-to-static training.

  • Automatic optimization of neural network compiler: PaddlePaddle Compiler Infrastructure for Neural Networks (CINN) adopts the design of integration with the framework, supporting the efficient training and dynamic shape inference of generative models, scientific computing models and other models. This provides a good balance between computational flexibility and high performance. The inference performance of Llama2 and Stable Diffusion models has been improved by 30% through automatic fusion of operators and code generation technology.

  • High-order automatic differentiation: In order to better support scientific computing scenarios, PaddlePaddle Framework designs and implements high-order automatic differentiation technology based on combinatorial operator mechanism, combined with automatic optimization technology of neural network compiler. We have tested more than 40 differential equations in scientific computing scenarios, and its solution speed is 70% ahead of similar products in the industry.

  • Highly scalable intermediate representation: In order to improve the scalability of the PaddlePaddle framework, we have developed a highly scalable Paddle Intermediate Representation (PIR).This representation systematically abstracts the underlying core concepts and provides flexible and efficient components. PIR serves as the infrastructure to support a number of technologies such as dynamic-to-static, automatic differentiation, auto parallel, combinatorial operators, and graph optimization; it is widely used in scenarios such as distributed training, model compression, and inference deployment. With the Declarative Rewrite Rule (DRR) mechanism provided by PIR, the development cost of Pass can be reduced by 60%.We have tested over 900 model configurations and the results show that the overall performance of inference improves by more than 10% after using PIR.

  • Multi-Hardware adaptation: PaddlePaddle provides a well-functioning and low-cost solution for large model hardware adaptation. The new hardware only needs to be adapted with more than 30 interfaces to support training, compression and inference of large models. Meanwhile, PaddlePaddle provides compiler-based hardware access mode, and hardware vendors only need to implement the compiler’s code generation back-end in the form of plug-ins to achieve efficient adaptation with the PaddlePaddle framework.PaddlePaddle hardware access this time has additional support for the daily release of four hardware units: Kunlun XPU, Ascend NPU, Hygon DCU and Cambricon MLU.

This version includes the continuous improvement of some of the existing features of the framework 2.x. Meanwhile, the new features of this version bring significant improvements in terms of user experience, performance, ease of secondary development and hardware adaptability. In addition to the above core features, this version continues to enrich and enhance the API functions to meet more scenarios at the user experience level, optimizes and improves the distributed parallel strategy optimization and reasoning function enhancement for the large model scenarios, makes thorough improvement in terms of ease-of-use in compilation and installation, makes a new synchronous upgrade to the installation method and version of the dependency packages, strengthens the security of the system comprehensively, and makes comprehensive error-correction checks to the product documentation. We have also carried out a cleanup of some deprecated codes to ensure architectural simplicity. The performance of PaddlePaddle 3.0 Beta is still mature and stable without the use of new features, and each new feature provides a switch for flexible control, which makes it easy for users to quickly understand the related product features and experience comparison.

User Experience Upgrade

Incompatibility Upgrade

  • PaddlePaddle API supports type promotion.In the most common calculations such as addition, subtraction, multiplication, and division, if the two inputs are of different data types, it is necessary to determine the data type of the output. Historically, PaddlePaddle partially supported this and the actual rules were not clear. Objectively, there were dynamic-static inconsistency, inconsistent API and operator overloading, and inconsistent interchange rates, and unexpected problems (hard to fix) especially in the case of large models using a mix of bf16/fp16 and fp32 for a wide range of calculations. Starting from the 3.0 beta, PaddlePaddle has clarified the type promotion rules, and defined in detail the types of Tensor vs Tensor and Tensor vs. 1 number (Scalar) computation results, ensuring that the computation conforms to the exchange law, the operator overloading is consistent with the results of the binary API, and the results of the dynamic graph are consistent with those of the static graph. This is more in line with user understanding and industry practice. #60638, #63842, #60011

Deprecated Features

  • There have been two versions stably supporting 0-dimensional Tensor. This version removes the switch FLAGS_set_to_1d that converts a 0-dimensional Tensor to a 1-dimensional Tensor with only 1 element in some cases. This switch is for compatibility with the incorrect way of writing a 1-element 1-dimensional Tensor to represent a 0-dimensional Tensor in some kits. That is, the current PaddlePaddle fully distinguish between the semantics of a 0-dimensional Tensor and a 1-dimensional Tensor with only 1 element, both are not equivalent. #61227

New API Features

Compared with the previous version, this version is added with 126 new APIs, richer API functions to better support the needs of large models, and scientific computation. The details are as follows:

  • Add Tensor computation API. paddle.gammaln, paddle.gammainc, paddle.gammaincc, paddle.sinc, paddle.pdist, paddle.histogramdd,paddle.signbit, paddle.copysign, paddle.bitwise_right_shift/bitwise_left_shift, paddle.isposinf/isneginf/isreal, paddle.isin, paddle.hsplit/dsplit, paddle.column_stack/row_stack/dstack/hstack/vstack, paddle.slice_scatter, paddle.masked_scatter #60553, #59311, #59357, #63521, #57869, #57880, #57882, #60150, #57785, #58092, #63523, #64001, #58917, #59127, #59973, #59383

  • Add probability distribution API. paddle.distribution.ContinuousBernoulli, paddle.distribution.MultivariateNormal, paddle.distribution.Exponential, paddle.distribution.Gamma, paddle.distribution.Binomial, paddle.distribution.Poisson #58004, #57899, #57856

  • Add optimizer API. paddle.optimizer.ASGD, paddle.optimizer.NAdam, paddle.optimizer.RAdam, paddle.optimizer.Rprop #58834, #63671, #58851

  • Add Linear Algebra API. paddle.linalg.matrix_exp #59715

  • Add other APIs. paddle.bernoulli_, paddle.nn.ZeroPad1D/ZeroPad3D, paddle.nn.AdaptiveLogSoftmaxWithLoss, paddle.Tensor.apply #64252, #59690, #63728, #63302, #59374,#63227

Some API Enhancements

API Performance Improvements

  • Focus on optimizing the performance of Tensor basic index, advanced index, and combined index, improving computational performance by 2X to 31X on GPUs and 1.8X to 1004X on CPUs. #60254, #60276, #60452, #60771, #61021, #60983, #61060, #60618

Bug Fixing

  • Fix errors in paddle.optimizer.LBFGS caused by using non-Tensor computations #60219

  • Fix the problem of random numbers not being fixed in paddle.optimizer.LBFGS #60591

  • Fix the incorrect calculation of gradient of set_value operator #59034

  • Fix the problem of Tensor basic index adapting to PIR #60259, #61103

  • Fix the problem of Tensor combined index assignment problem #60447

  • Fix the problem when Tensor combined index takes values [problem] #61922

  • Fix paddle.flatten stride calculation error issue, with being able to add paddle.flatten_ #63084

  • Fix the result inconsistency problem between paddle.index_fill and paddle.index_fill_ #59863

  • Fix the paddle.masked_scatter error report issue #60835

  • Fix the paddle.histogramdd cpu error report issue #61891

  • Fix the bug that paddle.cast_ continuous use on cpu leads to incorrect result #60054

  • Fix paddle.put_along_axis bug when input size is very large #60551

  • Fix paddle.nanmedian cpu error report issue #63221

  • Fix the bug that paddle.median does not support inputs other than floating-point types in the min branch. #64444

  • Fix the dataloader issue in distributed scenarios. #62696, #63378

  • Fix the formatting issue in error prompt #63106, #63144

  • Fix the format issue under GLOG_v>=6. #63345

Security Improvements

  • Enhance the checking of parent_ids #62826

Basic Execution Architecture

PIR basic functions have been upgraded and improved comprehensively, and the maturity level has been greatly improved. Based on PIR, the design of the PaddlePaddle infrastructure is more reasonable, ensuring the excellent performance and good scalability of the framework. In this version, we have completed the inference verification of PIR in multiple scenarios: For the single-machine scenario, complete the PIR back-end switching in the dynamic-to-static scenarios; For inference scenario, complete the verification of all the stock models, and 84.2% of the models have a gain of 10%+; we have completed the verification of distributed scenarios based on PIR. Meanwhile, based on PIR, we have completed the development and validation of core modules such as control flow, backward logic, save/load, and OneDNN adaptation, which lays a solid foundation for the switching of the PaddlePaddle PIR to the default mode. The functional completeness, execution efficiency and stability of the PaddlePaddle framework operator system are further improved, bringing better use and development experience to the developers.

Function Optimization

PIR New Features

Dynamic-to-static Function Optimization

Optimize the dynamic-to-static basic capability, adapt to the dynamic dimension in SOT training scenarios, and support Python 3.12.

Operator Mechanisms

For the problems of incomplete implementation of some kernels and inefficient calculation logic, we have improved and optimized some of the operator implementation and internal mechanisms of framework, fixed some known problems, and supported some new features.

Bug Fixing

Developer Content

Vulnerability Fixing

Deprecated Features

  • Clean up deprecated actuators and other logic to reduce redundant codes. #64822, #60941

Compiler Infrastructure for Neural Networks (CINN)

In version 3.0, the compiler architecture has been significantly upgraded. Based on Shape Dialect, build a symbolic automatic derivation and simplification system, support symbolic expression and constraint construction, and support end-to-end execution under the dynamic shape of the compiler. Meanwhile, CINN has upgraded the automatic fusion of subgraphs and Pass Pipline mechanism, merged the core modules of dynamic and static shapes, and merged the iteration paths, so that the architecture is clear and unified. In this version, the compiler has been refactored in important back-end modules such as AST Compute, Schedule strategy, and Tiling, improving the general optimization capability of the compiler, and verifies the training, inference correctness and speedup performance of the dynamic shapes on the subgraphs of PaddlePaddle Industry Suite models and typical large models Llama2-7B and Stable Diffusion models.

New Features

  1. Upgrade the new automatic subgraph fusion mechanism, and innovatively propose the TrivialOp and ReduceOp fusion theory, supporting a wider range of vertical fusion and horizontal fusion, ensuring the correctness and robustness of subgraph fusion, and giving full play to the fusion potential of the neural network compiler.(#63340#63913#63579#63605#60769#62088#63124#63658#64557#63318#62545

  2. Add the symbol derivation function of dynamic shapes. Based on the Shape Dialect, realize the dynamic symbol construction, automatic derivation, constraint expression, symbol simplification and other mechanisms, introduce the DimExpr concept, upgrade the support for the PaddlePaddle framework of the InferSymbolicShape logic of the 150 + typical primitive operators, and provide more information for training and inference with compiler support for dynamic shapes.(#60843#62662#63790#60098#60511#61232#61939#62798#62955#63029#60572#61035#61224#61587#61937#62314#62394#62569#62495#62844#63000#63016#64222#60129#60899#61342#61439#62766#61133#61430#61498#61680#63367#62151#62665#61407#61502#61655#64115#61791#62141#63422#63577#63978#63576#63947#64332#63990

  3. Add the Pass Pipline function, including PdToCinn, CinnPreprocess, BuildGroupOp, GroupClusterOp, CinnLowering, Accuracy Check and other Pass strategies, to support the Lowering and execution of subgraphs in dynamic and static shapes, with a clear architecture.(#61611#62612#64354#61848#62316#64152#61619#62318#61977#62211#63972#63686#64505

  4. Add the support for BuketLower and DyShapeSchdule functions, to realize automatic bucket compilation and optimization according to the range of dynamic shapes; and adapt and upgrade the logic of CodeGen module to support the generation of InferShape function and the distribution of conditional branching function of Host function, so as to support the acceleration of training inference under the dynamic Shape of large models.(#62730#61115#59941#62207#64318#64345#60519#62584#60828#60533#61436#62071#63971#61656#63083#64405#63047#64655#63095#63829#63572

  5. Add support for compilation caching strategy, to automatically recognize, merge and reuse compilation results of the same subgraph structure, improve compilation efficiency by using multi-threading, so as to enhance the user experience.(#62952#63269#64718#61367#63305#63750#63871#64893

  6. Add support for GenerateShape mechanism, add corresponding AST Compute operator definitions, support automatic resolution of dynamic symbols, and automatic generation of ShapeOp in the Lowering stage.(#64167#64636#61993#64843#62587

Function Optimization

  1. Optimize BuildCinnPass logic, upgrade the compiler’s perception strategy for black and white list operators, and improve the robustness of Pass logic.(#62372#61081#61225#58863

  2. Optimize the OpLoweringGroup data structure, remove unnecessary interfaces and members, and reduce the coupling between upstream and downstream modules.(#62339

  3. Optimize the component design of the compiler on the architecture Arch, to abstract the concept of hardware, and reduce the cost of adapting to domestic hardware.(#63530#64347#64506#64587

  4. Upgrade the AST Compute module of the compiler’s back-end operator, to adapt to support the computing logic of dynamic Shape.(#62488#63581#63687#63654#64217

Performance Optimization

  1. Optimize the Schedule logic of AST IR, restructure the core modules such as Vectorize, Unroll, AxisBind, and ComputeAt, and merged the iterative paths of dynamic and static shapes, so as to reduce the development and maintenance costs.(#60449#60155#60342#60498#60538#60190#61197#63140#61156

  2. Optimize the Tiling strategy and temp Buffer function, support warp-level memory continuous Read and cache_read cache_write function, and improve the subgraph execution performance.(#64240#60562#64711#62856#61576#61901#62581#61987#60190#63138#62517

  3. Support automatic search function of Schedule configuration and AOT offline saving mechanism to accelerate the performance of subgraph Kernel.(#64271#64588#64694#64620#64702#63086

  4. Support OptimizeReductionTactic optimization strategy to improve kernel performance in Reduce scenarios.(#6066#61363#60881#63859

  5. Enhance DCE Pass function, remove redundant If/For branch codes and improve execution efficiency.(#61682

  6. Add support for FuseParallelMatmulPass Pass, integrate multiple Matmul operators to achieve acceleration.(#63623

Bug Fixing

  1. Fix the bug when Lowering some special operators to the compiler, to improve the end-to-end user experience.(#60800#64720#62593#62661#64626#63320#64581#61608#64135#64659#62391#62490#63891#64529

  2. Fix a bug in the symbolic derivation logic of some operators.(#62141#62376#62941#63322#64672#64407#60241#60440#62503#62997#63169#61098#63973#62248#62321#63755#63917#63903#64173#64525#64615#62247#62455#62898#62867#63608#63789#64085#64136#64181

  3. Fix the problems of compiler execution errors under dynamic and static shapes, to improve the robustness of the framework mechanism.(#60813#61877#61909#62954#63614#60339#60623#60658#60669#58823#62483#62742#61797#63411#64077#62736#62390#63689

Deprecated Features

  1. Remove useless symbol-related components such as adt DimExpr, SymbolicDimExpr and ShapedTypeInterface.(#60901#60933#60744#64176#64140

  2. Remove the old Group Cluster, and the front-end representation under the old IR, to improve the simplicity of the architecture.(#63683#64630#61380

Auto-Parallel Architecture

In order to further enhance the usability of the Auto Parallel architecture in large model training scenarios, PaddlePaddle has improved the Auto Parallel functionality in dynamic-static graphs, including the newly added parallel strategies such as sharding parallelism and interleaved pipeline parallelism, including support of lazy initialization parameters. Add and enhance the SPMD derivation rules for some of the operators. The auto-parallel architecture has been comprehensively verified in a number of mainstream large language models. Meanwhile, in order to build the new 3.0 architecture of PaddlePaddle, the static graph auto parallel architecture has been comprehensively upgraded based on PIR, the new generation intermediate representation of Paddlepaddle. It introduces DistDialect for distributed related components, and natively support DistAttr and DistTensor in the computation graph representation, and smooth the transfom from static to dynmaic graph, further enhance the unity of auto parallel usage in dynamic and static graph mode. Finally, a number of performance optimization technologies have been added and improved, including zero bubble pipeline scheduling strategy, achieving the same or even better end-to-end training performance compared to the manual parallelism on typical large models such as Llama-2 13B/70B.

Function Improvements

  • Add the dtensor_from_local interface for creating DistTensor from local tensor after sharding (correspondingly, shard_tensor is the created DistTensor from global tensor before sharding). #60206

  • Add the unshard_tensor interface to convert DistTensor to global tensor, which is reciprocal operation to shard_tensor. #60272

  • To reduce the GPU memory usage during training, add Sharding parallelism, and support stage1, stage2 and stage3 modes. #61926, #62711, #62486, #62230

  • To solve the problem of insufficient GPU memory when initializing parameters first and then sharding them, add the LazyInit function, to support slicing parameters first and then initializing them. #60316, #60441, #60563, #61792

  • In order to reduce the bubble of pipeline parallel, add the interleaved pipeline parallel parallelism has been added, and support automatically converting the pipeline parallel of the user’s networking to interleaved pipeline parallel through configuration, so that the user doesn’t need to perform complicated marking in the networking. #59751, #60050, #60467, #60868, #60187, #62884, #60560, #61541

  • Add the SPMD derivation rules for stack, gather, scatter_grad, cumsum, unbind, swiglu, and fused_linear_param_grad. Improve and optimize the implementation of fused_rope, reshape, flatten, fused_rms_norm, slice, tile, flash_attn, cross_entropy and other operator slice derivation rules, to solve the problem of incompatibility in some of the model networking scenarios. #62720, #64202, #63361, #63290, #61460, #59986, #61184, #60144, #62525, #62053, #60709, #60111, #63681, #62180, #60794, #60632, #62439

  • Improve the distributed checkpoint storage and loading function, support master_weights strategy, and fix the random hanging problem. #60027, #59872

  • In order to support the auto parallel of arbitrary shape tensor, add the non-uniform tensor sharding feature. #62611, #61432

  • In order to support users to use customized operators in the auto parallel networking, support user registration outside the framework to customize the SPMD derivation rules for this class of operators. #60509

  • Improve the slice SPMD rule, and support the transition from any state to replicate and from replicate state to any state. #60281, #59869

  • Add MoE expert parallelism (experimental). Currently, only dynamic graph auto parallel is supported. #63904

  • Fix some process adaptation problems of auto parallel and dynamic diagram execution, and dynamic to static. #60214, #60546, #62082, #61313, #61840, #60614, #60234, #64813, #61606, #63405, #64334, #60504

Performance Optimization

  • In order to reduce the bubble in pipeline parallel, support the reverse computation of parameter and activation splitting in backward, and add zero bubble pipeline scheduling strategy to improve the training performance. #62865, #62737, #64534,

  • To improve the performance of sequence parallel, perform fusion on related communication operations and computation operations, and optimize redundant transopse operations. #64807, #63948, #64316, #64119

  • Optimize the time consumption of auto parallel graph optimization for static graphs, to reduce the delay from the start of training to the completion of the first step. #59912, #61817, #60022, #60125

  • Optimize the time consumption of related communication operations in hybrid parallel scenarios. #62157, #61622

  • Optimize the redundant video memory consumption of parameters under the auto parallel dynamic-to-static. #62746

  • Improve the hybrid precision training function of auto parallel, support the setting of local auto_cast and black/white list, support master grad function, and adapt to different parallel strategies. 60158, #59987, #62629, #60385, #62015, #60514, #61221, #60779, #63228

  • Optimize non-essential casts caused by type promotion and amp to improve performance. #63293, #63228

Upgrade Static Graph Auto Parallel Architecture

  • Based on the new generation of Intermediate Representation(PIR), add the new DistDialect, natively supporting DistAttr and DistTensor in computation graph representation, and realizing the direct binding of distributed attributes between tensor or operator, which making the auto-parallel architecture more simple and unified. #63828, #64299, #63870, #64144, #62524, #62630, #62897, #60478, #60574, #63876, #63798, #62560, #63676

  • Improve APIs such as shard_tensor, reshard, and to_static, to support users to convert the dynamic graph model networking directly into PIR static computation graph for better performance. #62945, #62356, #60175, #62654, #63347

  • Optimize the auto-parallel graph optimization compilation process, and reduce the compilation and optimization time of static graphs by refactoring and optimizing the procedure of computation graph parallelization and communication resolution. #64137, #62201, #64143, #62560

  • Optimize the procedure of the SPMD derivation in static graphs to achieve the consistency results under dynamic-static graphs, which improves the unity and stability of the architecture. #62659, #62547, #63117, #63434, #63770, #64361, #63073

  • Upgrade the implementation of Reshard conversion in static graphs, and use consistent conversion rules under dynamic-static graphs to ensure the consistency of the execution logic and results of tensor reshard conversion in dynamic-static graphs, so as to improve user experience. #62718, #62694, #60215, #63362, #63072, #63962, #64223, #61796, #64465, #64623, #64418

Automatic Search and Tuning of Training Strategies

In order to improve the ease of use of the training strategy automatic search and tuning tool (AutoTuner), support user-defined search items, support for setting the priority of search items, and support for user-configured illegal strategy combinations, to comprehensively enhance the error reporting information in the runtime and post-run logs, and support for AutoTuner on NPU devices. #60101, #60294, #61898, #60248, #60417, #60954, #61499, #62724, #60954, #63693, #62853, #62984

Cuda Training Performance Optimization

This upgrade achieves the improvement of large model training efficiency from multiple perspectives, such as operator computation efficiency, distributed communication optimization, and video memory optimization.

Function Improvements

  • Enhance the FlashAttention operator function, including support for NVIDIA SM90 GPU compilation, support for Group Query Attention, support for cuDNN access, support for QKV-packed form inputs, and so on. #59820#60776#58680#63289

  • In the Repeat_interleave operator, add support for BFloat16 data type. #61854

  • For the issues of many interface parameters of ResNet-like models such as fused_scale_bias_add_relu, fused_scale_bias_relu_conv_bn, and fused_dconv_drelu_dbn, and the ease of use of operators, add the fuse_resunit pass, to support automatic fusion of the abovementioned operators, to achieve generic performance optimization. (#59771)

Performance Improvement

  • To address the problem of large GPU memory consumption during the computation of SwiGLU activation module of the Llama models, add the SwiGLU fusion operator to save the memory consumption of intermediate variables, thus reducing the memory overhead during the training process of the large model, and reducing the recomputation to improve the performance. The performance of the Llama-70B model is improved by 9%. #61508

  • To address the problem of higher percentage of communications in Sequence Parallel, realize the overlap between Sequence Parallel reverse process communication and Matmul computation, saving the end-to-end time consumption and improving the end-to-end performance of large model training scenarios by 1%~2%. #62284#63531

  • For the problem of slow training speed due to the need to divide by nranks after sharding reverse communications, support the fusion of reverse communication and division by nranks operation, and support the mode of ReduceScatter Average, to improve the performance of large model training. #62623

  • For the problem of jitter training speed caused by the input data broadcasting process of the tensor model parallel process, fix the unnecessary synchronization between CPU and GPU in the data broadcasting, to ensure the stability of the training speed. #60816

  • For the problem of low training speed due to the long parallel P2P communication time of pipelined models, realize the overlap of P2P communication and forward-backward computation. The end-to-end training performance of large models is improved by 2%~3%. #61935,#62051

  • For the problem of low inefficiency of bias gradient computation of fused_linear_param_grad_add operator, optimize the computation efficiency of bias gradient computation, and improve the end-to-end training performance of large model by 0.2%. #63114

  • For the problem of long time-consuming parameter broadcasting process after the end of sharding reverse computation, implement the overlap between parameter broadcasting and next step computation. As a result, the end-to-end training performance of large model is improved by more than 2%. #63945

  • To address the problem that the gradient occupies too much video memory during the pipelined parallel training, as a result of slow training speed due to the introduction of multiple computations, we have implemented the gradient dynamic release technique, to improve the end-to-end training performance of large models by 3.4%. #59739

Bug Fixing

  • Fix the problem of StreamSafeCUDAAllocator CUDA Event resource leakage, as a result of slowdown of large model training. #64621

  • Fix the bug of reverse calculation error of fused_rotary_position_embedding operator. #60217

  • Fix the bug that customized operators cannot control the calculation accuracy by black and white lists in AMP scenarios. #60052

  • Fix the bug that operators such as add_, and divide_ natively supporting operations with different data types have unanticipated type boosting when type boosting occurs. #64302

Distributed Strategy Enhancements

Focus on strengthening the functional experience of PaddlePaddle dynamic graph distributed computing, and make various functional improvements to parallel strategies such as AutoTuner, pipeline parallel, and sharding, and enhance the flexibility of large model training. Add the features such as Flash Attention Mask, which significantly reduce the video memory usage of large model training, especially long-sequence training, improve training performance, and provide stronger capability support for large model training. In addition, several bugs and potential security risks have been fixed, which has significantly improved the overall stability of the system.

Function Optimization

  • Optimize the search space of Autotuner, which significantly improves the performance of search. #62608

  • For the problem of pipeline parallel that the training may be wrong due to the checking of sending type in the eval process, add the training configuration, to skip the redundant receiving check of pipelined sending, featuring higher flexibility and better performance. #63001

  • In the dynamic graph pipeline parallel, add the checking of the size and type of the sent and received data, and add the error message, making the robustness and debuggability better. #59405

  • Support the settings of multiple loss functions with returning multiple losses in dynamic graph pipeline, which improves the flexibility of dynamic graph pipeline. #63167

  • In the dynamic graph pipeline, add the pipeline cache clearing configuration option, to clear the cache sent and received in the pipeline in time to better support dynamic batchsize training. #62277

  • For the problem that the sharding stage3 strategy cannot be aligned bit by bit, replace the unordered set with OrderedSet to avoid the error caused by the accumulation order, as a result of alignment bit by bit after fixing. #60085

  • In order to further reduce the video memory usage in sequence parallel, add a new method of recalculating allgather, to reduce the video memory size of the activation of allgather. #64244

New Features for Dynamic Graphs

  • For the search space of autotuner, add a new search dimension of refined recompute, which makes the search result more accurate and the threshold of model tuning lower. #62430

  • For the problem of limiting the training batch size in virtual pipeline parallel, modify the pipeline scheduling method, to flexibly set the batch size, so as to support more flexible batch size. #61561,#60314

  • In order to solve the problem that the video memory occupation of the mask is a quadratic complexity with low performance in sequence length when using flash attention with a mask, the memory complexity of the mask is reduced from the quadrature of the sequence length to the first square by using the sparse mask, to optimize the memory of the mask. This reduces the number of storage accesses. Meanwhile, use share memory to accelerate memory access, greatly improving the performance. #62029

  • Add the dynamic graph sharding parallel strategy, to improve the communications and computation overlap function, to improve the performance of the training process. #60455

Communication Library Function Optimization

  • Enhance the functionality of the NCCL communication library to support the initialization of customized NCCL libraries by passing additional initialization parameters during initialization. #62193

  • Add the NCCL library path search function to support more flexible NCCL library search methods. #62492

Bug Fixing

  • Fix the problem of dbias_out space application of fused_linear_param_grad_add_kernel operator, and add the gradient address checking logic to make the error message easier to debug. #363433,#64460

  • Fix the problem that the sharding policy does not scale the gradient when comm_overlap is turned off in the support of reduce_avg operation. #62702

  • Fix the bug related to fusion in the calculation order of main grad in Stage2. #59142

  • Fix the bug that the switch attribute cannot be found when reduce_avg communication operation is turned on under the sharding strategy. #62502

  • Fix the problem of setting stop_gradient=True for some parameters when Sharding stage1 training supports non-training parameter training. #62616

  • Fix the bug of message printing when TCP is turned off, to prevent misleading users. #62631

  • Fix the DataParallel training problem and solve multi-card training error when some gradients are not initialized and segmentation fault error occurs in data parallel training. #62299

  • For the scenario of turning on sequence parallel, fix the bug caused by weight freezing in some models. #63596

  • Fix some bugs for autotuner scenarios with single dp. #60757

  • Fix aadiff bug of streaming parallel strategy. (#64716)

  • Remove some distributed unit tests. (#62762)

Security Risk Fixing

  • Fix security vulnerability against security leakage risk in prune_by_memory_estimation operator. #61320

Parameter Server

This update mainly fixes several bugs in the process of using the parameter server as well as compilation and installation issues.

Bug Fixing

  • For the problem of reading and writing out of bounds of the unique operator, fix the problem of setting the wrong length in the calculation process of the unique operator to ensure the correctness of the operation of the unique operator. #60840

  • Fixed some bugs in PGLBox save/load and compilation process to ensure the correctness of PGLBox function in response to the lack of save/load function and compilation error in PGLBox training process. #63905

  • Fix the setting value of use_ps_gpu in CPUPS to ensure the correctness of the CPUPS training process, in response to the problem that the CPUPS training process triggers the GPUPS logic and causes the training to crash. #61406

  • For the problem that the cudaErrorInvalidResourceHandle error occurs in GPUPS training in CUDA 12.3, add the device id switching mechanism, to ensure that the corresponding resource operation is carried out on the correct device. #63391

  • For the problem of garbled codes in PGLBox Embedding Dump process, fix the bug of improper use of C++ std::string, to ensure the correctness of Embedding Dump results. #65179

Documentation Improvement

  • Access security warnings in the RPC interface documentation, to remind users that they need to use this interface under secure network conditions. #64100

Security Enhancement

  • Fix several code security issues to prevent malicious code injection. #60023#60544#60615

Inference Deployment

The inference framework is based on PIR upgraded PASS under GPU, XPU, CPU hardware, to significantly reduce the number of lines of codes compared with the previous version, and improve development efficiency. The underlying executor is upgraded to a new version of asynchronous executor, improving inference performance on most models. Complete the adaptive interconnection for inference acceleration based on CINN compiler. Add the switches for these features. Users can turn on the features through settings. In addition, Paddle Inference supports direct loading of optimized serialized models under mixed inference with TensorRT subgraphs natively, to reduce startup time consumption. For Paddle-TensorRT, add the interfaces to flexibly control node computation precision and whether the subgraph enters TensorRT computation. It is convenient for debugging. For performance optimization, GPU, XPU, CPU are added with more Transformer and LLM computing acceleration fusion operator, such as group attention mechanism fusion operator, GQA structure, and WINT4, and support for automatic matching by PASS.

New Features

  • Paddle-TensorRT

    • The API called at the underlying of Paddle-TensorRT is upgraded. When the version of TensorRT is later than 8.5, the EnqueueV2 API called (which will be deprecated in the future) is upgraded to the EnqueueV3 API. #60807

    • Add the config.exp_disable_tensorrt_subgraph() to set some subgraphs not to enter TensorRT. #61967

    • Add the config.exp_disable_tensorrt_dynamic_shape_ops() to set dynamic shape input operators not to enter TensorRT. The default value is False. #62352

    • Add the config.exp_specify_tensorrt_subgraph_precision() to set nodes to run different precision types. #62402

  • In the Inference, add switch to turn on CINN compiler. When configuring inference config, turn on CINN through config.enable_cinn(). #61949

  • PIR use mechanism in the Inference upgrade

    • In the config, add enable_new_ir() interface to enable PIR. #61968

    • In the config, add set_optimization_level() interface to set different optimization levels. #61968

    • In the PIR mechanism, the PASS function supports custom C++PASS. #62468

    • The inference library exposes PIR-related implementation header files to the outside world. Support users’ secondary development based on PIR, such as custom Pass development. #61863,#62293

    • The PIR mechanism supports input and output of the Hook operator by registering the Predictor. #63101

  • The multi-layer Transformer fusion operator fused_multi_transformer_op supports GQA calculation. #64125

Function Improvements

  • The inference supports loading optimized models directly, making it possible to skip IR optimization altogether. The deployment in this way can minimize framework overhead. #61598

  • Re-specify the shape range information file when loading the saved IR PASS optimized model inference. #60457

  • Collect the Shape information within the subgraph of the control flow operator, supporting the use of Paddle-TensorRT inference acceleration. #60451 ,#59588

  • The mixed-precision PASS (auto_mixed_precision_pass) for GPU-native inference supports the handling of sparse Tensor. #62656

  • XPU hardware related function

    • XPU’s fused PASS for Conv and FC supports conversion from Float to INT31 type. #59981

    • XPU’s strided slice operator supports the setting of strides non-negative. #62268

    • XPU’s multi-layer Encoder fusion PASS is adaptive to sequence length and supports variable length. #63825

  • Paddle TensorRT INT8 computation mode supports tile operator into TensorRT computation, to improve INT8 performance of some models. #60189

Model Compression

Fix bugs and optimize functions mainly for Post Training Quantization (PTQ) and Quantization Aware Training (QAT).

  • Support the simulation quantization grouped by channel. #61828

  • Support automatic saving of quantization scale to model parameter file under dynamic graphs. #59441

  • Remove the restriction that the dataloader must be a DataLoader instance. #61798

Performance Optimization

  • Upgrade the inference executor to reduce the video memory usage at runtime while keeping the performance unchanged. This can be used through config.enable_use_executor(True). #57920,#58452,#63350,#64466

  • Upgrade oneDNN version of paddle inference to v3.4. Its overall performance has been improved compared with v3.3. #64661

  • Upgrade the CUTLASS-based support for matrix multiplication and activation fusion calculation. (#61925)

Add generic PASS in PIR mechanism

  • Add identity_op_clean_pass and matmul_scale_fuse_pass. #59840

  • Add fused_flash_attn_pass. The pass can call flash_attention to replace the original attentions computation. #64213,#64707,#63304

  • In the inference PIR new architecture, upgrade layout adjustment algorithm, support the NHWC inference of conv class and norm class. The performance tested on SD models is significantly improved. #63628,#64634,#64658,#64708,#64830,#64896

  • Add remove_redundant_transpose PASS. #63357

  • Enable CSE PASS in inference to improve inference performance. #64523

GPU Performance Optimizations

Include new fusion operators and new PASS under PIR mechanism.

  • Optimize the performance of sparse convolution operator (sparse conv) to improve the inference performance of BEV and other models. #63067

  • Add the fusion PASS based on flash attention. #63220

  • The inference supports elementwise_add+group_norm+silu activated operator fusion pattern and its corresponding fusion kernel. #64199

  • The Matrix multiplication calculation supports groupwise’s Weight only INT4 calculation. #60422#63212#60204)

  • The implementation of the group attention mechanism fusion operator block_multi_head_attention supports KV Cache quantization. #59951)

  • The Inference uses CUTLASS upgraded conv fusion operator to implement and support PASS automatic fusion. Support bias and activation. Compared to the original cuDNN, the new operator has significant performance acceleration. It is used through config.exp_enable_use_cutlass(True). #64201#64641

  • Add the blha_get_max_len operator and remove every call to get_max_len in block_multihead_attention. The function application is used for large model dynamic inference acceleration. #64246

  • Data layout optimization: PASS prohibits using NHWC mode calculation in the conv fusion operator FP32 precision type, because cuDNN will cause performance degradation under this condition. #63400

  • GPU peak video memory optimization: upgrade the underlying interface TryShrinkMemory, and upgrade to support GPU place under the support for the release of the idle video memory in the pool. In certain scenarios, peak video memory can be significantly cut. #61319

CPU performance optimization

Include new fusion operator. Add PASS under PIR mechanism and optimize part of Kernel.

  • Add scale_matmul_fuse_pass. #63313

  • Add CPU implementation in fused_bias_residual_layernorm and fused_rms_norm to improve inference speed. #63196#63165

  • Add the cache optimization for Deconvolution kernel, to greatly improve the execution speed of this operator. #60922

  • In PIR, add depthwise_conv fusion PASS, to convert the depthwise_conv operator to conv2d, thus using the onednn conv2d kernel optimization to improve the inference speed of this operator. #63051

  • In PIR, add Conv and Activation Fusion PASS (conv_activation_mkldnn_fuse_pass), to support the fusion of conv and 13 kinds of activation functions, thus greatly improving the inference speed of conv-related operators. #63145

  • In PIR, add the fusion PASS (operator_unsqueeze_onednn_fuse_pass) between multiple operators and unsqueeze, to improve inference speed. #63592

  • In PIR, add PASS (operator_reshape_onednn_fuse_pass) to fuse reshape into multiple operators. #63812

  • In PIR, add scale fusion PASS (operator_scale_onednn_fuse_pass). #63811

  • In PIR, add PASS (conv2d_transpose_bias operator) that fuses conv and bias. #62241

  • In PIR, add onednn_placement_pass, which supports 151 operators to convert from Phi operators to oneDNN operators, so that the oneDNN high-performance library can be used for optimization, to improve the inference speed. #63982

  • In PIR, add the fusion between Elementwise type operators and 13 activation functions, to greatly improve the inference speed of enabling Onednn on the CPU. #63516

  • In PIR, add the fusion of multiple conv + concat + activation functions and fused_conv + concat + activation functions, to greatly improve the inference speed when there are concat and activation functions in conv. #62993#62713

  • In PIR, add matmul+add operator fusion PASS (matmul_elementwise_add_fuse_pass). #62715

  • In PIR, add the scale parameter to fold PASS (scale_matmul_fuse_pass). #63313

  • In PIR, add the fusion PASS (softplus_activation_fuse_pass) between softplus and 12 activation functions. #63617

  • In PIR, add fc operator conversion PASS (fc_onednn_enable_pass). #63518

  • In PIR, add self-attention operator fusion PASS (self_attention_fuse_pass). #63726

  • In PIR, add fusion PASS (fc_activation_fuse_pass) between fc and 12 activation functions. #63853

  • In PIR, add BatchNorm folded PASS (conv2d_bn_onednn_fuse_pass) to amplify the fusion probability of subsequent PASS. #64524

  • In PIR, add the fusion PASS (matmul_activation_fuse_pass) between matmul and 12 activation functions. #62901

  • In PIR, add reshape + transpose + reshape fusion PASS (shuffle_channel_detect_pass), which is fused into a shuffle_channel operator under specific conditions. #64053

  • In PIR, add reshape + transpose + matmul fusion PASS (reshape_transpose_matmul_fuse_pass). #62998

  • In PIR, add matmul + transpose + reshape fusion PASS (matmul_transpose_reshape_fuse_pass) to PIR to significantly improve performance in some scenarios. #63151(https://github.com/PaddlePaddle/Paddle/pull/63151)

  • XPU hardware new fusion PASS optimization:

    • Add qk_qkv_attention_xpu_fuse_pass and qkv_attention_xpu_kernel in XPU hardware. #60089

    • Add rotary position encoded fusion operator, to support elementwise_mul + strided_slice + sin/cos+ stack fusion to 1 operator in XPU hardware. #60025

    • Add group_norm_silu_xpu_fuse_pass. #62689

    • Add weight_only_linear_xpu_pass. #64185

    • Add block_multihead_attention operator and PASS, to support large model inference for LLaMA2 models in XPU devices. #65036

    • Support float16 type for squeeze_excitation_block_xpu_kernel. #61023

Bug Fixing

  • Fix mixed-precision conversions in models such as faster_rcnn_swin_tiny_fpn_1x_coco, and solve the mixed_precision_pass error. #64673

  • Block fused_conv2d_add_act pass from being validated in activation functions that are sigmoid (fused conv2d and sigmoid cause performance degradation between cudnn versions 8.0 and 8.7). #64717

  • Fix compilation issues with self_dp_attention and fused_layer_norm_avx_kernel in Clang12. #63414

  • Fix the issue that scale and zeroPoints in the qdq operator of some models are deleted prematurely in the IR/Pass stage. #62225

  • Fix the issue that causes an error to be reported when both Config.UseOptimizedModel() and config.EnableMemoryOptim() are turned on. #62501

  • Add constraint on matmul_scale_fuse_pass, where input w must be a weight or the pass will not be matched. #62850

  • Keep inference model output key ordering guaranteed to be the same as when dynamic graph models are exported. #63791

  • Fix the error in subgraph when the constant fold PASS is in “the folded op and its input and output are not in the same subgraph.” #62148

  • Fix several runtime problems in PaddleTRT mode. Include the failure of quantization calibration table generation caused by yolo_box operator in int8 mode, and the error caused by incorrect handling of dim attribute data type in reduce operator. #61596

  • Fix some runtime error problems in mixed-precision inference mode.Include the errors caused by sharing weights among fused conv2d operators without correctly converting weight layout, fused conv2d operator backend not properly selected as cuDNN, fused conv2d operator incorrectly handling bias dimension under NHWC, incorrectly handling input data type of norm class operator. #60955#60076#63007#63988

  • Fix the problem that config.delete_pass function does not take effect. #61056

  • Fix the GC mechanism of While control flow in PIR to recycle unwanted inputs in advance and reduce the peak memory, for example, 2GB memory reduction in LLaMA 7B model. #63062

  • Fix the OneDNN mean kernel rollback error. #64676

  • Fix the conv_bias_fuse_pass strong constraints newly added, e.g., the shape of the bias cannot be 1, so as to ensure the stability of the pass inference result. #64412

  • Fix the conv_elementwise_add_onednn_fuse_pass strong constraints newly added, e.g., conv2d_out and residual_param must have the same size, so that the pass inference is stable. #64448

  • Fix the problem of repeatedly inserting quantized inverse-quantization operators under certain circumstances #63082

Hardware Adaptation

Adaptation Scheme (Custom Device)

For PaddlePaddle hardware access, add the daily release supports for 4 hardware Kunlun XPU, Ascend NPU, Hygon DCU and Cambricon MLU this time. Meanwhile, the problems in distributed communications have been fixed through large model training and inference deployment, and performance is optimized through functions such as video memory optimization, and overlap of computation and communication. Furthermore, each hardware is also added to support a large number of BFloat16 data type operators this time, as well as many operator fusion Pass and fusion operators on each hardware. Through the hardware and software together, hardware large Transformer operator library is accessed to fully improve the performance of large models.

New Features

  • Add the support for distributed policy sharding stage1 v2. #61500

  • Support the distributed communication module in BF16 data type.Add some operators to support for BF16 data types such as empty, shape, etc. #60768,#62140,#62604

  • Add the support for get_comm_name interface, support for memory stat function, and support for Profiler to record memory time. #62556,#61030,#62292

  • Add support for some fusion strategies and operators, including silu_fuse_pass, conv_elementwise_add_act_fuse_pass, and generator offset. #60595,#60708,#60616

Performance Optimization

  • The distributed communication strategy Sharing uses asynchronous strategy in Broadcast parameter, to improve the overlap between computation and communication. #59745

  • Add the support for STRIDED Layout operator to improve the performance of the operator. #62532,#62697,#62649

  • Optimize the memory usage of elementwise_mul operator.#62377

Bug Fixing

  • Fix the bug under the distributed strategy Sharing. #61942,#62236,#62305,#62535,#62572,#61601

  • Fix the problem that the operator cannot be registered due to c_embedding operator is not under PHI namespace. #60774

  • Fix the xccl_comm release issue. #60465

  • Fix data address error caused by index_put operator fallbacking cpu. #61842

  • Fix stream_safe_custom_device_allocator issue. #63369

  • Fix the distributed worker port conflict issue. #61409

  • Fix comm data type to improve device compatibility. #62306

  • Unify the use of comm data type to phi::DataType. #62464,#62562

  • Fix the problem of missing precision parameter in PD_ConfigEnableCustomDevice. #63702

Kunlun XPU

New Features

  • Add the support for BF16 data types for some operators, including compare_kernel and add reduce_all_kernel (#63602), empty(#60212), hybrid_parallel_optimizer(#60213), reduce_max/reduce_min(#60453), all_reduce/concat/split(#62364), tile/tile_grad(#63075), accuracy(#63863), swiglu/set_value(#64070), amp_master_grad(#63865), c_concat (#63403), flatten (#63997), compare_op (#64473), moment1/moment2 (#62688), fused_rope (#60064), c_softmax_with_cross_entropy (#60472), elementwise_pow/square/sin/cos (#60402), strided_slice (#60382), tile/sigmoid_grad (#60119), elementwise_sub/elementwise_div (#60386), softmax_with_cross_entropy (#63759)

  • Add the support for INT8 data types for some operators, including multi_encoder_xpu (#61212), qkv_attention (#63105)

  • Update Kunlun SDK versions including BKCL, XHPC, XCCL, etc. #59895#59888#63624, #60305, #62076, #62646, #63520, #64163, #64326, #60617, #60377, #60421, #60598, #61199

  • Add the support for memory stat function. #61116

  • Add multi-stream support, to assign default l3/gm buffer size to each stream. #62729

  • Add nonzero operator, to support simulator XPUSIM_SKIP_RUN mode. #60224#60388

  • Add stride_slice and stride_slice_grad operators, to support strides < 0. #62749

  • Add rotary_embedding, to support use_neox_rotary_style == True. #64090

  • Add fusion Pass and fusion operators including cross_attention (#63203), fused_bias_act (#62232), fused_layernorm (#62228), group_norm_silu_xpu_fuse_pass (#63342)

  • Add the support for distributed policy sharding stage3. #57457

  • Add the support for tf32 fc quantization mode. #62273

  • Add the flash attention operator. #60065

  • Add the roformer relative embedding pass & kernel and support multi_encoder_xpu. #62089

  • Add the support for pp + sharding strategy. #63640

  • Upgrade the XPU communication library architecture to support dynamic-static unified communication library function. #63817

Performance Optimization

  • Add XHPC buffer manager to improve the performance of Paddle and XHPC memory collaboration. #63924

  • Enhance TensorSetConstantXPU performance and support BF16 data type. #63920,#61818

  • Fusion multiple group norm + silu + conv modules and compress the video memory. #62892

  • Optimize XPU memory allocation in comm manager. #64139

  • Optimize operator performance, including mean_all_grad (#61148), dropout_v2 (#61029), fused_rotary_position_embedding (#62846), cross_entropy (#63159), elementwise_add (#64289), fused_gemm_epilogue (#61350, check_nan_or_inf (#60853)

Bug Fixing

  • Fix the tile operator support for 0-dimensional Tensor. #64279

  • Fix the group_norm_silu_fuse_pass. #63449

  • Fix the XPU API GM memory issue. #60260,#60387,#62940

  • Fix the distributed strategy Sharing stage1 v2 bug. #64209

  • Fix the XPU constant issue. #60763

  • Fix some operator issues, including AdamW (#62251), dropout_v3 (#62726), softmax(#63780) , fused rope embedding (#62143), elementwise_add (#60252), resnet_basic_block (#62914)

  • Fix XPU runtime and installation related issues. #60028,#61970

  • Fix XPU compilation bugs. #63307

  • Fix end-side memory related bugs when initializing XPU communication library. #64396

Hygon DCU

New Features

  • Add the support for Hygon DCU K100. #63535

  • Support the complex64/128 data type and fusion operators such as fused_bias_residual_layernorm, fused_bias_dropout_residual_layer_norm, and rms_norm. #63217

Bug Fixing

Environment Updates

In this PaddlePaddle version, we complete the release and update synchronization of the basic dependency libraries, and remove the old dependency libraries that are no longer updated. Complete a number of optimizations to improve compilation efficiency and compatibility, and improve the CI pipeline monitoring function to enhance the user installation experience. Fixe the several known compilation problems, improved the compilation system of paddle, and add some new features. Through the optimizations, the compilation and installation experience of the PaddlePaddle framework is further improved to bring developers a better use and development experience.

New Support

Compilation Optimizations

CI Pipeline Improvements

  • Improve the merge-in code monitoring mechanism in the CI pipeline, to ensure higher code quality and stability. Add a function monitoring module, to monitor various indicators of the CI pipeline in real time, ensuring smooth execution of each stage, to identify and resolve issues in a timely manner. #61384,#62190,#60758,#60399,#58623,#62177,#62361,#62893,#63705,#64476,#64752,#64733,#61914

Others

Non-user related changes, including deprecated code cleanup, useless unit test cleanup, debugging or upgrade of monitoring mechanism. #63377,#64106,#64220,#64293,#64464,#64944,#63638,#63732,#63735,#63826,#63982,#63737,#64471,#64574,#64494,#62775,#63601,#62564,#63772,#64719,#61640,#63459,#64062,#63480,#63833#63673,#63672,#64131,#64156,#64155,#64159,#63902,#64230,#64229,#64236,#64260,#64175,#64250,#64269,#64238,#64349,#64394,#64402,#64401,#64388,#64329,#64502,#64501,#64515,#64503,#64514,#64601,#64564,#64012,#64697,#64682,#64051,#63267,#63426,#63626,#63257,#63266,#63468,#63262,#63248,#63241,#63252,#63258,#63235,#63399,#63488,#63487,#63466,#63464,#63483,#63486,#63475,#63489,#63470,#63457,#63493,#63561,#63584,#63587,#63586,#63569,#63559,#63558,#63555,#63543,#63589,#63583,#63565,#63564,#63265,#63562,#63591,#63460,#63238,#63631,#63707,#63714,#63854,#63929,#63532,#59628,#62209,#63742,#60518,#62078,#62684,#62723,#64141,#60404,#64212,#60652,#64545,#64477,#64556,#63160,#63796,#64693,#64484,#64677,#64461,#63189,#63855,#63896,#63193,#63200,#63406,#61283,#63607,#64486,#64004,#63132,#63553,#63572,#63794,#63919,#63980,#62917,#64451,#63541,#63703,#64536,#63264,#63335,#63841,#64628,#63419,#62210,#63557,#63064,#61442,#63537,#63839,#60927,#60566,#60842,#64612,#60047,#63898,#60415,#60474,#60439,#60565,#64414,#62526,#54183,#64096,#61325,#60629,#61051,#62103,#63594,#60968,#64613,#64073,#63816,#64416,#62499,#64531,#63827,#59885,#59949,#63428,#63218,#63538,#64497,#63082,#64395,#60183,#63691,#64428,#64648,#64650,#59926,#59750,#60080,#60208,#64124,#64187,#64166,#64284,#64253,#64555,#59878,#64081