Memory Allocation and Optimization

System Message: WARNING/2 (/FluidDoc/docs/guides/performance_improving/memory_optimize_en.rst, line 3)

Title overline too short.

###########
Memory Allocation and Optimization
###########

1. Memory Allocation Strategy

System Message: WARNING/2 (/FluidDoc/docs/guides/performance_improving/memory_optimize_en.rst, line 8)

Title underline too short.

1. Memory Allocation Strategy
===========================

1.1. AutoGrowth Strategy

Since version 1.6+, PaddlePaddle supports the AutoGrowth strategy, which allocates memory on demand. AutoGrowth strategy has been enabled by default in version 1.7+, making it convenient for users to run multiple tasks on the same GPU card at the same time.

Because the native CUDA system calls cudaMalloc and cudaFree are synchronous operations, which are very time-consuming, the AutoGrowth strategy will cache the allocated memory for subsequent allocation. The specific methods are as follows:

  • In the first few memory allocations, PaddlePaddle framework will call cudaMalloc and allocate memory on demand. When releasing the allocated memory, it will not call cudaFree to return the memory to GPU, but cache the memory inside the framework.

  • In the subsequent allocations, PaddlePaddle framework will first check if there is a fit block (block size larger than the required memory size) in the cached memory. If there is, it will split the required memory from the fit block and return. Otherwise, it will call cudaMalloc to allocate memory from GPU. The allocated memory are also cached when being released for subsequent allocation.

Therefore, the AutoGrowth strategy may slow the speed in the first few batches of model training, but will not affect the speed in the subsequent training process.

1.2. Pre-Allocation Strategy

System Message: WARNING/2 (/FluidDoc/docs/guides/performance_improving/memory_optimize_en.rst, line 29)

Title underline too short.

1.2. Pre-Allocation Strategy
----------------

In addition to the AutoGrowth strategy, paddlepaddle also provides a Pre-Allocation strategy, which is the default memory allocation strategy before paddlepaddle 1.7.

The Pre-Allocation strategy allocates a large size chunk at the first allocation, and the subsequent memory allocation is mostly obtained from the pre allocated memory chunk. Among them, the chunk size is determined by the environment variable FLAGS_fraction_of_gpu_memory_to_use, and the calculation formula of chunk size is:

chunk_size = FLAGS_fraction_of_gpu_memory_to_use * number of current available memory of a single GPU card

The default value of FLAGS_fraction_of_gpu_memory_to_use is 0.92, that is, the framework will pre allocates 92% of the currently available memory of the GPU card.

The specific way of Pre-Allocation strategy to allocate GPU memory is:

  • When allocating memory of requested_size,
    • If requested_size <= chunk_size, the framework will first allocate a memory chunk of chunk_size, then split a block of requested_size and return the block. Every subsequent memory allocation will be performed on the chunk.

    • If requested_size > chunk_size, the framework will call cudaMalloc to allocate memory block of requested_size and return.

  • When freeing memory of requested_size,
    • If free_size <= chunk_size, the framework will put the memory block back into the pre-allocated chunk, instead of returning back to GPU.

    • If free_size > chunk_size, the framework will call cudaFree and return the memory back to GPU.

If there are other tasks on your GPU card that occupy the memory, you can appropriately decrease FLAGS_fraction_of_gpu_memory_to_use to ensure that the framework can pre-allocate the memory block of appropriate size, for example

export FLAGS_fraction_of_gpu_memory_to_use=0.4 # Pre-allocate 40% memory of a single GPU card

If FLAGS_fraction_of_gpu_memory_to_use is set to 0, the framework will call cudaMalloc and cudaFree every time the memory is allocated and released, which will seriously affect the performance and is not recommended. Only when you want to measure the actual memory usage of the network, you could set FLAGS_fraction_of_gpu_memory_to_use to 0, and observe the memory usage of command nvidia-smi display.

1.3. Configuration of memory allocation strategy

System Message: WARNING/2 (/FluidDoc/docs/guides/performance_improving/memory_optimize_en.rst, line 64)

Title underline too short.

1.3. Configuration of memory allocation strategy
-----------------------

Since version 1.6+, PaddlePaddle supports both the AutoGrowth strategy and the Pre-Allocation Strategy, and control the strategy used in framework by the environment variable FLAGS_allocator_strategy.

Use AutoGrowth strategy:

export FLAGS_allocator_strategy=auto_growth # Use AutoGrowth strategy

Use Pre-Allocation strategy:

export FLAGS_allocator_strategy=naive_best_fit # Use Pre-Allocation strategy

Plus, since version 1.7.2+, PaddlePaddle provides an environment variable FLAGS_gpu_memory_limit_mb, which controls the maximum gpu memory limit that the process can allocate. If it is equal to 0, there would be no limit and all gpu memory would be available to the process. If it is larger than 0, the process would raise out of memory error if the allocated memory exceeds the limit even though there is available memory on the gpu card. The unit is MB and default value is 0.

2. Memory Optimization Strategy

System Message: WARNING/2 (/FluidDoc/docs/guides/performance_improving/memory_optimize_en.rst, line 85)

Title underline too short.

2. Memory Optimization Strategy
===========================

Paddlepaddle provides several general memory optimization methods to optimize the memory usage of your network (including general memory and GPU memory).

2.1. GC Strategy: memory garbage eager collection

System Message: WARNING/2 (/FluidDoc/docs/guides/performance_improving/memory_optimize_en.rst, line 90)

Title underline too short.

2.1. GC Strategy: memory garbage eager collection
-------------------------

The principle of GC(Garbage Collection)is to release the memory space of useless variables eagerly during network running, in order to save memory space. GC is suitable for training and inference using Executor or ParallelExecutor, but it is not suitable for C++ inference library.

Since version 1.6+, GC Strategy is enabled by default.

GC Strategy is controlled by 3 environment variable:

  • FLAGS_eager_delete_tensor_gb

Variable to enable GC, its data type is double. The default value is -1 in PaddlePaddle with version < 1.6, and is 0 in PaddlePaddle with version >= 1.6. GC Strategy will cache a certain amount of memory garbage and release it uniformly. FLAGS_eager_delete_tensor_gb means the threshold of cached memory garbage, the unit of which is GB. It is recommended to set FLAGS_eager_delete_tensor_gb=0.

If FLAGS_eager_delete_tensor_gb=0, once there is memory garbage, it will be collected immediately to save memory.

If FLAGS_eager_delete_tensor_gb=1, the memory garbage is collected when the cached amount of garbage reaches 1GB.

If FLAGS_eager_delete_tensor_gb<0, GC Strategy is disabled.

  • FLAGS_memory_fraction_of_eager_deletion

Variable to control GC Strategy, its data type is double. The default value is 1, range [0,1]. It is only suitable for ParallelExecutor. GC will sort the variables in descending order according to the memory space occupied by the variables, and only collect the memory space of top FLAGS_memory_fraction_of_eager_deletion variables. It is recommended to remain default value, that is FLAGS_memory_fraction_of_eager_deletion=1.

If FLAGS_memory_fraction_of_eager_deletion=0.6, top 60% variables will be collected.

If FLAGS_memory_fraction_of_eager_deletion=0, no variable will be collected, GC Strategy is disabled.

If FLAGS_memory_fraction_of_eager_deletion=1, all variables will be collected.

  • FLAGS_fast_eager_deletion_mode

Variable to enable fast GC Strategy, its type is bool. The default value is True, which means use fast GC Strategy. Fast GC Strategy will collect the memory garbage immediately instead of waiting for CUDA Kernel finish. It is recommended to remain default value, that is FLAGS_fast_eager_deletion_mode=True.

2.2. Inplace Strategy: output reuses input inside operator

System Message: WARNING/2 (/FluidDoc/docs/guides/performance_improving/memory_optimize_en.rst, line 134)

Title underline too short.

2.2. Inplace Strategy: output reuses input inside operator
----------------------------------

The principle of Inplace strategy is that the output of some operators can reuses the memory space of input. For example, the output and input of operator reshape can reuse the same memory space.

Inplace Strategy is suitable for ParallelExecutor, which can be set through BuildStrategy. The Strategy is not suitable for Executor+Program or C++ inference library.

Since version 1.6+, Inplace Strategy is enabled by default.

The specific way of Inplace strategy is:

build_strategy = fluid.BuildStrategy()
build_strategy.enable_inplace = True # Enable Inplace Strategy

compiled_program = fluid.CompiledProgram(train_program, build_strategy=build_strategy)

In PaddlePaddle with version < 1.6, due to of some design problems, when the Inplace Strategy is enabled, the variable in fetch_list in the subsequent exe.run must be persistent. That is, if you the variables you want to fetch are loss and acc, you must set:

loss.persistable = True
acc.persistable = True

Since version 1.6+, setting variables in fetch_list to persistable is not needed.

3. Memory Optimization Best Practice

System Message: WARNING/2 (/FluidDoc/docs/guides/performance_improving/memory_optimize_en.rst, line 168)

Title underline too short.

3. Memory Optimization Best Practice
=======================

We recommend the best memory optimization strategy as:

  • Enable GC strategy:set FLAGS_eager_delete_tensor_gb=0.

  • Enable Inplace strategy:set build_strategy.enable_inplace = True, and set variables in fetch_list to persistable using var.persistable = True when the version of PaddlePaddle < 1.6.

Since version 1.6+, the above optimal strategy have been enabled by default and setting variables in fetch_list to persistable is not needed.