Profiler

class paddle.profiler. Profiler ( *, targets: Iterable[ProfilerTarget] | None = None, scheduler: Callable[[int], ProfilerState] | tuple[int, int] | None = None, on_trace_ready: Callable[[Profiler], None] | None = None, record_shapes: bool = False, profile_memory: bool = False, timer_only: bool = False, emit_nvtx: bool = False, custom_device_types: list[str] = [], with_flops: bool = False ) [source]

Profiler context manager, user interface to manage profiling process to start, stop, export profiling data and print summary table.

Parameters

targets (list, optional) – specify target devices to profile, and all existing and supported devices will be chosen by default. Currently supported values, ProfilerTarget.CPU , ProfilerTarget.GPU and ProfilerTarget.XPU .
scheduler (Callable|tuple, optional) – If it is a callable object, it takes a step number as parameter and return the corresponding ProfilerState. This callable object can be generated by make_scheduler function. If not provided (None), the default scheduler will keep tracing until the profiler exits. If it is a tuple, it has two values start_batch and end_batch, which means profiling range [start_batch, end_batch).
on_trace_ready (Callable, optional) – Callable object, serves as callback function, and takes the Profiler object as parameter, which provides a way for users to do post-processing. This callable object will be called when scheduler returns ProfilerState.RECORD_AND_RETURN. The default value is export_chrome_tracing.
timer_only (bool, optional) – If it is True, the cost of Dataloader and every step of the model will be count without profiling. Otherwise, the model will be timed and profiled. Default: False.
record_shapes (bool, optional) – If it is True, collect op’s input shape information. Default: False.
profile_memory (bool, optional) – If it is True, collect tensor memory allocation and release information. Default: False.
custom_device_types (list, optional) – If targets contain profiler.ProfilerTarget.CUSTOM_DEVICE, custom_device_types select the custom device type for profiling. The default value represents all custom devices will be selected.
with_flops (bool, optional) – If it is True, the flops of the op will be calculated. Default: False.

Examples

profiling range [2, 5).

>>> 
>>> import paddle.profiler as profiler
>>> import paddle
>>> paddle.device.set_device('gpu')
>>> with profiler.Profiler(
...     targets=[profiler.ProfilerTarget.CPU, profiler.ProfilerTarget.GPU],
...     scheduler = (2, 5),
...     on_trace_ready = profiler.export_chrome_tracing('./log')
... ) as p:
...     for iter in range(10):
...         # train()
...         p.step()

profiling range [2,4], [7, 9], [11,13].

>>> 
>>> import paddle.profiler as profiler
>>> import paddle
>>> paddle.device.set_device('gpu')
>>> with profiler.Profiler(
...     targets=[profiler.ProfilerTarget.CPU, profiler.ProfilerTarget.GPU],
...     scheduler = profiler.make_scheduler(closed=1, ready=1, record=3, repeat=3),
...     on_trace_ready = profiler.export_chrome_tracing('./log')
... ) as p:
...     for iter in range(10):
...         # train()
...         p.step()

Use profiler without context manager, and use default parameters.

>>> 
>>> import paddle.profiler as profiler
>>> import paddle
>>> paddle.device.set_device('gpu')
>>> p = profiler.Profiler()
>>> p.start()
>>> for iter in range(10):
...     #train()
...     p.step()
>>> p.stop()
>>> p.summary()

Use profiler to get throughput and cost of the model.

>>> import paddle
>>> import paddle.profiler as profiler

>>> class RandomDataset(paddle.io.Dataset): # type: ignore[type-arg]
...     def __init__(self, num_samples):
...         self.num_samples = num_samples
...     def __getitem__(self, idx):
...         image = paddle.rand(shape=[100], dtype='float32')
...         label = paddle.randint(0, 10, shape=[1], dtype='int64')
...         return image, label
...     def __len__(self):
...         return self.num_samples
>>> class SimpleNet(paddle.nn.Layer):
...     def __init__(self):
...         super().__init__()
...         self.fc = paddle.nn.Linear(100, 10)
...     def forward(self, image, label=None):
...         return self.fc(image)
>>> dataset = RandomDataset(20 * 4)
>>> simple_net = SimpleNet()
>>> opt = paddle.optimizer.SGD(learning_rate=1e-3, parameters=simple_net.parameters())
>>> BATCH_SIZE = 4
>>> loader = paddle.io.DataLoader(
...     dataset,
...     batch_size=BATCH_SIZE)
>>> p = profiler.Profiler(timer_only=True)
>>> p.start()
>>> for i, (image, label) in enumerate(loader()):
...     out = simple_net(image)
...     loss = paddle.nn.functional.cross_entropy(out, label)
...     avg_loss = paddle.mean(loss)
...     avg_loss.backward()
...     opt.minimize(avg_loss)
...     simple_net.clear_gradients()
...     p.step(num_samples=BATCH_SIZE)
...     if i % 10 == 0:
...         step_info = p.step_info(unit='images')
...         print("Iter {}: {}".format(i, step_info))
...         # The average statistics for 10 steps between the last and this call will be
...         # printed when the "step_info" is called at 10 iteration intervals.
...         # The values you get may be different from the following.
...         # Iter 0:  reader_cost: 0.51946 s batch_cost: 0.66077 s ips: 6.054 images/s
...         # Iter 10:  reader_cost: 0.00014 s batch_cost: 0.00441 s ips: 907.009 images/s
>>> p.stop()
>>> # The performance summary will be automatically printed when the "stop" is called.
>>> # Reader Ratio: 2.658%
>>> # Time Unit: s, IPS Unit: images/s
>>> # |                 |       avg       |       max       |       min       |
>>> # |   reader_cost   |     0.00011     |     0.00013     |     0.00007     |
>>> # |    batch_cost   |     0.00405     |     0.00434     |     0.00326     |
>>> # |       ips       |    1086.42904   |    1227.30604   |    959.92796    |

start ( ) → None start¶

Start profiler and enter the first profiler step(0). State transformed from CLOSED to self.current_state and trigger corresponding action.

Examples

>>> 
>>> import paddle.profiler as profiler
>>> import paddle
>>> paddle.device.set_device('gpu')
>>> prof = profiler.Profiler(
...     targets=[profiler.ProfilerTarget.CPU, profiler.ProfilerTarget.GPU],
...     scheduler = (1, 9),
...     on_trace_ready = profiler.export_chrome_tracing('./log'))
>>> prof.start()
>>> for iter in range(10):
...     # train()
...     prof.step()
>>> prof.stop()

stop ( ) → None stop¶

Stop profiler and State transformed from self.current_state to CLOSED. Trigger corresponding action and post-process profiler result using self.on_trace_ready if result exists.

Examples

>>> 
>>> import paddle.profiler as profiler
>>> import paddle
>>> paddle.device.set_device('gpu')
>>> prof = profiler.Profiler(
...     targets=[profiler.ProfilerTarget.CPU, profiler.ProfilerTarget.GPU],
...     scheduler = (1, 7),
...     on_trace_ready = profiler.export_chrome_tracing('./log'))
>>> prof.start()
>>> for iter in range(10):
...     # train()
...     prof.step()
... prof.stop()

step ( num_samples: Optional[int] = None ) → None step¶

Signals the profiler that the next profiling step has started. Get the new ProfilerState and trigger corresponding action.

Parameters: num_samples (int|None, optional) – Specifies the batch size of every step of the model that is used to compute throughput when timer_only is True. Default: None.

Examples

>>> 
>>> import paddle.profiler as profiler
>>> import paddle
>>> paddle.device.set_device('gpu')
>>> prof = profiler.Profiler(
...     targets=[profiler.ProfilerTarget.CPU, profiler.ProfilerTarget.GPU],
...     scheduler = (3, 7),
...     on_trace_ready = profiler.export_chrome_tracing('./log'))

>>> prof.start()
>>> for iter in range(10):
...     #train()
...     prof.step()
>>> prof.stop()

step_info ( unit: Optional[str] = None ) → str step_info¶

Get statistics for current step. If the function is called at certain iteration intervals, the result is the average of all steps between the previous call and this call. Statistics are as follows:

reader_cost: the cost of loading data measured in seconds.
batch_cost: the cost of step measured in seconds.

3. ips(Instance Per Second): the throughput of the model measured in samples/s or others depends on the unit. When num_samples of step() is None, it is measured in steps/s.

Parameters: unit (string, optional) – The unit of input data is only used When num_samples of step() is specified as a number. For example, when it is images, the unit of throughput is images/s. Default: None, the unit of throughput is samples/s.
Returns: A string representing the statistic.
Return type: string

Examples

>>> import paddle.profiler as profiler
>>> prof = profiler.Profiler(timer_only=True)
>>> prof.start()
>>> for iter in range(20):
...     #train()
...     prof.step()
...     if iter % 10 == 0:
...         print("Iter {}: {}".format(iter, prof.step_info()))
...         # The example does not call the DataLoader, so there is no "reader_cost".
...         # Iter 0:  batch_cost: 0.00001 s ips: 86216.623 steps/s
...         # Iter 10:  batch_cost: 0.00001 s ips: 103645.034 steps/s
>>> prof.stop()
>>> # Time Unit: s, IPS Unit: steps/s
>>> # |                 |       avg       |       max       |       min       |
>>> # |    batch_cost   |     0.00000     |     0.00002     |     0.00000     |
>>> # |       ips       |   267846.19437  |   712030.38727  |   45134.16662   |

export ( path: str = '', format: str = 'json' ) → None export¶

Exports the tracing data to file.

Parameters

path (str) – file path of the output.
format (str, optional) – output format, can be chosen from [‘json’, ‘pb’], ‘json’ for chrome tracing and ‘pb’ for protobuf, default value is ‘json’.

Examples

>>> 
>>> import paddle
>>> paddle.device.set_device('gpu')
>>> import paddle.profiler as profiler
>>> prof = profiler.Profiler(
...     targets=[profiler.ProfilerTarget.CPU, profiler.ProfilerTarget.GPU],
...     scheduler = (3, 7))
>>> prof.start()
>>> for iter in range(10):
...     # train()
...     prof.step()
>>> prof.stop()
>>> prof.export(path="./profiler_data.json", format="json")

summary ( sorted_by: SortedKeys = SortedKeys.CPUTotal, op_detail: bool = True, thread_sep: bool = False, time_unit: Literal['s', 'ms', 'us', 'ns'] = 'ms', views: Optional[Union[SummaryView, list[paddle.profiler.profiler.SummaryView]]] = None ) → None summary¶

Print the Summary table. Currently support overview, model, distributed, operator, memory manipulation and user-defined summary.

Parameters

sorted_by (SortedKeys , optional) – how to rank the op table items, default value is SortedKeys.CPUTotal.
op_detail (bool, optional) – expand each operator detail information, default value is True.
thread_sep (bool, optional) – print op table each thread, default value is False.
time_unit (str, optional) – time unit for display, can be chosen from [‘s’, ‘ms’, ‘us’, ‘ns’], default value is ‘ms’.
views (SummaryView|list[SummaryView], optional) – summary tables to print, default to None means all views to be printed.

Examples

>>> 
>>> import paddle
>>> paddle.device.set_device('gpu')
>>> import paddle.profiler as profiler
>>> prof = profiler.Profiler(
...     targets=[profiler.ProfilerTarget.CPU, profiler.ProfilerTarget.GPU],
...     scheduler = (3, 7),
...     on_trace_ready = profiler.export_chrome_tracing('./log'))
>>> prof.start()
>>> for iter in range(10):
...     # train()
...     prof.step()
>>> prof.stop()
>>> prof.summary(sorted_by=profiler.SortedKeys.CPUTotal, op_detail=True, thread_sep=False, time_unit='ms')