DistributeTranspiler¶
- class paddle.fluid.transpiler.distribute_transpiler. DistributeTranspiler ( config=None ) [source]
- 
         - api_attr
- 
             Static Graph 
 DistributeTranspiler Convert the fluid program to distributed data-parallelism programs. Supports two modes: parameter server(pserver) mode and nccl2 mode. In pserver mode, the main_program will be transformed to use a remote parameter server to do parameter optimization. And the optimization graph will be put into a parameter server program. In nccl2 mode, the transpiler will append a NCCL_ID broadcasting op in startup_program to share the NCCL_ID across the job nodes. After transpile_nccl2 called, you *must* pass trainer_id and num_trainers argument to ParallelExecutor to enable NCCL2 distributed mode. Examples x = fluid.data(name='x', shape=[13], dtype='float32') y = fluid.data(name='y', shape=[1], dtype='float32') y_predict = fluid.layers.fc(input=x, size=1, act=None) cost = fluid.layers.square_error_cost(input=y_predict, label=y) avg_loss = fluid.layers.mean(cost) sgd_optimizer = fluid.optimizer.SGD(learning_rate=0.001) sgd_optimizer.minimize(avg_loss) # for pserver mode pserver_endpoints = "192.168.0.1:6174,192.168.0.2:6174" trainer_endpoints = "192.168.0.1:6174,192.168.0.2:6174" current_endpoint = "192.168.0.1:6174" trainer_id = 0 trainers = 4 role = "PSERVER" t = fluid.DistributeTranspiler() t.transpile( trainer_id, pservers=pserver_endpoints, trainers=trainers) if role == "PSERVER": pserver_program = t.get_pserver_program(current_endpoint) pserver_startup_program = t.get_startup_program(current_endpoint, pserver_program) elif role == "TRAINER": trainer_program = t.get_trainer_program() # for nccl2 mode trainer_num = 2 trainer_id = 0 config = fluid.DistributeTranspilerConfig() config.mode = "nccl2" trainer_endpoints = "192.168.0.1:6174,192.168.0.2:6174" t = fluid.DistributeTranspiler(config=config) t.transpile(trainer_id=trainer_id, trainers=trainer_endpoints, current_endpoint="192.168.0.1:6174") exe = fluid.ParallelExecutor( use_cuda=True, loss_name=avg_loss.name, num_trainers=trainer_num, trainer_id=trainer_id ) - 
            
           transpile
           (
           trainer_id, 
           program=None, 
           pservers='127.0.0.1:6174', 
           trainers=1, 
           sync_mode=True, 
           startup_program=None, 
           current_endpoint='127.0.0.1:6174'
           )
           transpile¶
- 
           Transpile the input program to distributed programs with config and arguments. - Parameters
- 
             - trainer_id (int) – id for current trainer worker, if you have n workers, the id may range from 0 ~ n-1 
- program (Program|None) – program to transpile, default is fluid.default_main_program(). 
- startup_program (Program|None) – startup_program to transpile, default is fluid.default_startup_program(). 
- pservers (str) – comma separated ip:port string for the pserver list. 
- trainers (int|str) – in pserver mode this is the number of trainers, in nccl2 mode this is a string of trainer endpoints. 
- sync_mode (bool) – Do sync training or not, default is True. 
- startup_program – startup_program to transpile, default is fluid.default_main_program(). 
- current_endpoint (str) – need pass current endpoint when transpile as nccl2 distributed mode. In pserver mode this argument is not used. 
 
 Examples transpiler = fluid.DistributeTranspiler() t.transpile( trainer_id=0, pservers="127.0.0.1:7000,127.0.0.1:7001", trainers=2, sync_mode=False, current_endpoint="127.0.0.1:7000") 
 - 
            
           get_trainer_program
           (
           wait_port=True
           )
           get_trainer_program¶
- 
           Get transpiled trainer side program. The program on trainer side compared with origin program has following difference: - Delete optimizer related op, because parameter updated on Pserver 
- After the op which computed gradient of each parameter, add - Send_opand- Recv_op
 - Parameters
- 
             - wait_port (bool) – Whether to wait for the parameter server to be ready before returning to program, 
- True (default is) – 
 
- Returns
- 
             trainer side program. 
- Return type
- 
             Program 
 Examples import paddle.fluid as fluid #this is an example, find available endpoints in your case pserver_endpoints = "192.168.0.1:6174,192.168.0.2:6174" trainer_id = 0 trainers = 4 t = fluid.DistributeTranspiler() t.transpile(trainer_id, trainers=trainers, pservers=pserver_endpoints) trainer_program = t.get_trainer_program() 
 - 
            
           get_pserver_program
           (
           endpoint
           )
           get_pserver_program¶
- 
           Get parameter server side program.The program on pserver side compared with origin program has following difference: - Only the following op is included: optimize-related op and communication-related op 
- NO.0 block only has variable definitions and - listen_and_serv_op
- Every variable which need to be updated has a unique block 
 - Parameters
- 
             endpoint (str) – current parameter server endpoint. 
- Returns
- 
             the program for current parameter server to run. 
- Return type
- 
             Program 
 Examples import paddle.fluid as fluid #this is an example, find available endpoints in your case pserver_endpoints = "192.168.0.1:6174,192.168.0.2:6174" current_endpoint = "192.168.0.1:6174" trainer_id = 0 trainers = 4 t = fluid.DistributeTranspiler() t.transpile( trainer_id, pservers=pserver_endpoints, trainers=trainers) pserver_program = t.get_pserver_program(current_endpoint) 
 - 
            
           get_pserver_programs
           (
           endpoint
           )
           get_pserver_programs¶
- 
           Get pserver side main program and startup program for distributed training. The main_programreturned by this function is consistent with the return value of the functionget_pserver_program.- Parameters
- 
             endpoint (str) – current pserver endpoint. 
- Returns
- 
             (main_program, startup_program), of type “Program” 
- Return type
- 
             tuple 
 Examples import paddle.fluid as fluid #this is an example, find available endpoints in your case pserver_endpoints = "192.168.0.1:6174,192.168.0.2:6174" current_endpoint = "192.168.0.1:6174" trainer_id = 0 trainers = 4 t = fluid.DistributeTranspiler() t.transpile( trainer_id, pservers=pserver_endpoints, trainers=trainers) pserver_program, pserver_startup_program = t.get_pserver_programs(current_endpoint) 
 - 
            
           get_startup_program
           (
           endpoint, 
           pserver_program=None, 
           startup_program=None
           )
           get_startup_program¶
- 
           Deprecated Get startup program for current parameter server. Modify operator input variables if there are variables that were split to several blocks. - Parameters
- 
             - endpoint (str) – current pserver endpoint. 
- pserver_program (Program) – deprecated, call get_pserver_program first. 
- startup_program (Program) – deprecated, should pass startup_program when initializing 
 
- Returns
- 
             parameter server side startup program. 
- Return type
- 
             Program 
 Examples pserver_endpoints = "192.168.0.1:6174,192.168.0.2:6174" trainer_endpoints = "192.168.0.1:6174,192.168.0.2:6174" current_endpoint = "192.168.0.1:6174" trainer_id = 0 trainers = 4 t = fluid.DistributeTranspiler() t.transpile(trainer_id, pservers=pserver_endpoints, trainers=trainers) pserver_program = t.get_pserver_program(current_endpoint) pserver_startup_program = t.get_startup_program(current_endpoint, pserver_program) 
 
