declarative programming (static graph)
ExecutionStrategy allows the user to more preciously control how to run the program in ParallelExecutor by setting the property.
import paddle.fluid as fluid x = fluid.layers.data(name='x', shape=, dtype='float32') y = fluid.layers.data(name='y', shape=, dtype='float32') y_predict = fluid.layers.fc(input=x, size=1, act=None) cost = fluid.layers.square_error_cost(input=y_predict, label=y) avg_loss = fluid.layers.mean(cost) sgd_optimizer = fluid.optimizer.SGD(learning_rate=0.001) sgd_optimizer.minimize(avg_loss) exec_strategy = fluid.ExecutionStrategy() exec_strategy.num_threads = 4 train_exe = fluid.ParallelExecutor(use_cuda=False, loss_name=avg_loss.name, exec_strategy=exec_strategy)
The type is BOOL, allow_op_delay represents whether to delay the communication operators to run, it may make the execution faster. Note that this option is invalid now, and it will be removed in next version. Default False.
The type is INT, num_iteration_per_drop_scope indicates how many iterations to clean up the temp variables which is generated during execution. It may make the execution faster, because the temp variable’s shape maybe the same between two iterations. Default 1.
If you fetch data when calling the ‘run’, the ParallelExecutor will clean up the temp variables at the end of the current iteration.
In some NLP model, it may cause the GPU memory is insufficient, in this case, you should reduce num_iteration_per_drop_scope.
This config that how many iteration the executor will run when user call exe.run() in python
The type is INT, num_threads represents the size of thread pool that used to run the operators of the current program in ParallelExecutor. If \(num\_threads=1\), all the operators will execute one by one, but the order maybe difference between iterations. If it is not set, it will be set in ParallelExecutor according to the device type and device count, for GPU, \(num\_threads=device\_count*4\), for CPU, \(num\_threads=CPU\_NUM*4\), the explanation of:math:CPU_NUM is in ParallelExecutor. if it is not set, ParallelExecutor will get the cpu count by calling multiprocessing.cpu_count(). Default 0.
This config that the this is distributed training with parameter server