Profiler

class paddle.profiler. Profiler ( *, targets: Optional[Iterable[paddle.profiler.profiler.ProfilerTarget]] = None, scheduler: Optional[Union[Callable[[int], paddle.profiler.profiler.ProfilerState], tuple]] = None, on_trace_ready: Optional[Callable[[...], Any]] = None, record_shapes: Optional[bool] = False, profile_memory=False, timer_only: Optional[bool] = False, emit_nvtx: Optional[bool] = False, custom_device_types: Optional[list] = [] ) [source]

Profiler context manager, user interface to manage profiling process to start, stop, export profiling data and print summary table.

Parameters
  • targets (list, optional) – specify target devices to profile, and all existing and supported devices will be chosen by default. Currently supported values, ProfilerTarget.CPU , ProfilerTarget.GPU and ProfilerTarget.MLU .

  • scheduler (Callable|tuple, optional) – If it is a callable object, it takes a step number as parameter and return the corresponding ProfilerState. This callable object can be generated by make_scheduler function. If not provided (None), the default scheduler will keep tracing until the profiler exits. If it is a tuple, it has two values start_batch and end_batch, which means profiling range [start_batch, end_batch).

  • on_trace_ready (Callable, optional) – Callable object, serves as callback function, and takes the Profiler object as parameter, which provides a way for users to do post-processing. This callable object will be called when scheduler returns ProfilerState.RECORD_AND_RETURN. The default value is export_chrome_tracing.

  • timer_only (bool, optional) – If it is True, the cost of Dataloader and every step of the model will be count without profiling. Otherwise, the model will be timed and profiled. Default: False.

  • record_shapes (bool, optional) – If it is True, collect op’s input shape information. Default: False.

  • profile_memory (bool, optional) – If it is True, collect tensor memory allocation and release information. Default: False.

Examples

  1. profiling range [2, 5).

    # required: gpu
    import paddle.profiler as profiler
    with profiler.Profiler(
            targets=[profiler.ProfilerTarget.CPU, profiler.ProfilerTarget.GPU],
            scheduler = (2, 5),
            on_trace_ready = profiler.export_chrome_tracing('./log')) as p:
        for iter in range(10):
            #train()
            p.step()
    
  2. profiling range [2,4], [7, 9], [11,13].

    # required: gpu
    import paddle.profiler as profiler
    with profiler.Profiler(
            targets=[profiler.ProfilerTarget.CPU, profiler.ProfilerTarget.GPU],
            scheduler = profiler.make_scheduler(closed=1, ready=1, record=3, repeat=3),
            on_trace_ready = profiler.export_chrome_tracing('./log')) as p:
        for iter in range(10):
            #train()
            p.step()
    
  3. Use profiler without context manager, and use default parameters.

    # required: gpu
    import paddle.profiler as profiler
    p = profiler.Profiler()
    p.start()
    for iter in range(10):
        #train()
        p.step()
    p.stop()
    p.summary()
    
  4. Use profiler to get throughput and cost of the model.

    import paddle
    import paddle.profiler as profiler
    
    class RandomDataset(paddle.io.Dataset):
        def __init__(self, num_samples):
            self.num_samples = num_samples
    
        def __getitem__(self, idx):
            image = paddle.rand(shape=[100], dtype='float32')
            label = paddle.randint(0, 10, shape=[1], dtype='int64')
            return image, label
    
        def __len__(self):
            return self.num_samples
    
    class SimpleNet(paddle.nn.Layer):
        def __init__(self):
            super(SimpleNet, self).__init__()
            self.fc = paddle.nn.Linear(100, 10)
    
        def forward(self, image, label=None):
            return self.fc(image)
    
    dataset = RandomDataset(20 * 4)
    simple_net = SimpleNet()
    opt = paddle.optimizer.SGD(learning_rate=1e-3, parameters=simple_net.parameters())
    BATCH_SIZE = 4
    loader = paddle.io.DataLoader(
        dataset,
        batch_size=BATCH_SIZE)
    p = profiler.Profiler(timer_only=True)
    p.start()
    for i, (image, label) in enumerate(loader()):
        out = simple_net(image)
        loss = paddle.nn.functional.cross_entropy(out, label)
        avg_loss = paddle.mean(loss)
        avg_loss.backward()
        opt.minimize(avg_loss)
        simple_net.clear_gradients()
        p.step(num_samples=BATCH_SIZE)
        if i % 10 == 0:
            step_info = p.step_info(unit='images')
            print("Iter {}: {}".format(i, step_info))
            # The average statistics for 10 steps between the last and this call will be
            # printed when the "step_info" is called at 10 iteration intervals.
            # The values you get may be different from the following.
            # Iter 0:  reader_cost: 0.51946 s batch_cost: 0.66077 s ips: 6.054 images/s
            # Iter 10:  reader_cost: 0.00014 s batch_cost: 0.00441 s ips: 907.009 images/s
    p.stop()
    # The performance summary will be automatically printed when the "stop" is called.
    # Reader Ratio: 2.658%
    # Time Unit: s, IPS Unit: images/s
    # |                 |       avg       |       max       |       min       |
    # |   reader_cost   |     0.00011     |     0.00013     |     0.00007     |
    # |    batch_cost   |     0.00405     |     0.00434     |     0.00326     |
    # |       ips       |    1086.42904   |    1227.30604   |    959.92796    |
    
start ( )

start

Start profiler and enter the first profiler step(0). State transformed from CLOSED to self.current_state and trigger corresponding action.

Examples

# required: gpu
import paddle.profiler as profiler
prof = profiler.Profiler(
    targets=[profiler.ProfilerTarget.CPU, profiler.ProfilerTarget.GPU],
    scheduler = (1, 9),
    on_trace_ready = profiler.export_chrome_tracing('./log'))
prof.start()
for iter in range(10):
    #train()
    prof.step()
prof.stop()
stop ( )

stop

Stop profiler and State transformed from self.current_state to CLOSED. Trigger corresponding action and post-process profiler result using self.on_trace_ready if result exists.

Examples

# required: gpu
import paddle.profiler as profiler
prof = profiler.Profiler(
    targets=[profiler.ProfilerTarget.CPU, profiler.ProfilerTarget.GPU],
    scheduler = (1, 7),
    on_trace_ready = profiler.export_chrome_tracing('./log'))
prof.start()
for iter in range(10):
    #train()
    prof.step()
prof.stop()
step ( num_samples: Optional[int] = None )

step

Signals the profiler that the next profiling step has started. Get the new ProfilerState and trigger corresponding action.

Parameters

num_samples (int|None, optional) – Specifies the batch size of every step of the model that is used to compute throughput when timer_only is True. Default: None.

Examples

# required: gpu
import paddle.profiler as profiler
prof = profiler.Profiler(
    targets=[profiler.ProfilerTarget.CPU, profiler.ProfilerTarget.GPU],
    scheduler = (3, 7),
    on_trace_ready = profiler.export_chrome_tracing('./log'))

prof.start()
for iter in range(10):
    #train()
    prof.step()
prof.stop()
step_info ( unit=None )

step_info

Get statistics for current step. If the function is called at certain iteration intervals, the result is the average of all steps between the previous call and this call. Statistics are as follows:

  1. reader_cost: the cost of loading data measured in seconds.

  2. batch_cost: the cost of step measured in seconds.

3. ips(Instance Per Second): the throughput of the model measured in samples/s or others depends on the unit. When num_samples of step() is None, it is measured in steps/s.

Parameters

unit (string, optional) – The unit of input data is only used When num_samples of step() is specified as a number. For example, when it is images, the unit of throughput is images/s. Default: None, the unit of throughput is samples/s.

Returns

A string representing the statistic.

Return type

string

Examples

import paddle.profiler as profiler
prof = profiler.Profiler(timer_only=True)
prof.start()
for iter in range(20):
    #train()
    prof.step()
    if iter % 10 == 0:
        print("Iter {}: {}".format(iter, prof.step_info()))
        # The example does not call the DataLoader, so there is no "reader_cost".
        # Iter 0:  batch_cost: 0.00001 s ips: 86216.623 steps/s
        # Iter 10:  batch_cost: 0.00001 s ips: 103645.034 steps/s
prof.stop()
# Time Unit: s, IPS Unit: steps/s
# |                 |       avg       |       max       |       min       |
# |    batch_cost   |     0.00000     |     0.00002     |     0.00000     |
# |       ips       |   267846.19437  |   712030.38727  |   45134.16662   |
export ( path='', format='json' )

export

Exports the tracing data to file.

Parameters
  • path (str) – file path of the output.

  • format (str, optional) – output format, can be chosen from [‘json’, ‘pb’], ‘json’ for chrome tracing and ‘pb’ for protobuf, default value is ‘json’.

Examples

# required: gpu
import paddle.profiler as profiler
prof = profiler.Profiler(
    targets=[profiler.ProfilerTarget.CPU, profiler.ProfilerTarget.GPU],
    scheduler = (3, 7))
prof.start()
for iter in range(10):
    #train()
    prof.step()
prof.stop()
prof.export(path="./profiler_data.json", format="json")
summary ( sorted_by=SortedKeys.CPUTotal, op_detail=True, thread_sep=False, time_unit='ms', views=None )

summary

Print the Summary table. Currently support overview, model, distributed, operator, memory manipulation and userdefined summary.

Parameters
  • sorted_by (SortedKeys , optional) – how to rank the op table items, default value is SortedKeys.CPUTotal.

  • op_detail (bool, optional) – expand each operator detail information, default value is True.

  • thread_sep (bool, optional) – print op table each thread, default value is False.

  • time_unit (str, optional) – time unit for display, can be chosen form [‘s’, ‘ms’, ‘us’, ‘ns’], default value is ‘ms’.

  • views (SummaryView|list[SummaryView], optional) – summary tables to print, default to None means all views to be printed.

Examples

# required: gpu
import paddle.profiler as profiler
prof = profiler.Profiler(
    targets=[profiler.ProfilerTarget.CPU, profiler.ProfilerTarget.GPU],
    scheduler = (3, 7),
    on_trace_ready = profiler.export_chrome_tracing('./log'))
prof.start()
for iter in range(10):
    #train()
    prof.step()
prof.stop()
prof.summary(sorted_by=profiler.SortedKeys.CPUTotal, op_detail=True, thread_sep=False, time_unit='ms')