Prepare Steps

Data preparation in PaddlePaddle Fluid can be separated into 3 steps.

Step 1: Define a reader to generate training/testing data

The generated data type can be Numpy Array or LoDTensor. According to the different data formats returned by the reader, it can be divided into Batch Reader and Sample Reader.

The batch reader yields a mini-batch data for each, while the sample reader yields a sample data for each.

If your reader yields a sample data, we provide a data augmentation and batching tool for you: Python Reader .

Step 2: Define data layer variables in network

Users should use fluid.data to define data layer variables. Name, dtype and shape are required when defining. For example,

import paddle.fluid as fluid

image = fluid.data(name='image', dtype='float32', shape=[None, 28, 28])
label = fluid.data(name='label', dtype='int64', shape=[None, 1])

None means that the dimension is uncertain. In this example, None means the batch size.

Step 3: Send the data to network for training/testing

PaddlePaddle Fluid provides 2 methods for sending data to the network: Asynchronous DataLoader API, and Synchronous Feed Method.

  • Asynchronous DataLoader API

User should use fluid.io.DataLoader to define a DataLoader object and use its setter method to set the data source. When using DataLoader API, the process of data sending works asynchronously with network training/testing. It is an efficient way for sending data and recommended to use.

  • Synchronous Feed Method

User should create the feeding data beforehand and use executor.run(feed=...) to send the data to fluid.Executor or fluid.ParallelExecutor . Data preparation and network training/testing work synchronously, which is less efficient.

Comparison of these 2 methods are as follows:

Comparison item

Synchronous Feed Method

Asynchronous DataLoader API

API

executor.run(feed=...)

fluid.io.DataLoader

Data type

Numpy Array or LoDTensor

Numpy Array or LoDTensor

Data augmentation

use Python for data augmentation

use Python for data augmentation

Speed

slow

rapid

Recommended applications

model debugging

industrial training

Choose different usages for different data formats

According to the different data formats of reader, users should choose different usages for data preparation.

Read data from sample reader

If user-defined reader is a sample reader, users should use the following steps:

Step 1. Batching

Use the data reader interfaces in PaddlePaddle Fluid for data augmentation and batching. Please refer to Python Reader for details.

Step 2. Sending data

If using Asynchronous DataLoader API, please use set_sample_generator or set_sample_list_generator to set the data source for DataLoader. Please refer to Asynchronous Data Reading for details.

If using Synchronous Feed Method, please use DataFeeder to convert the reader data to LoDTensor before sending to the network. Please refer to DataFeeder for details.

Read data from sample reader

Step 1. Batching

Since the reader has been a batch reader, this step can be skipped.

Step 2. Sending data

If using Asynchronous DataLoader API, please use set_batch_generator to set the data source for DataLoader. Please refer to Asynchronous Data Reading for details.

If using Synchronous Feed Method, please refer to Take Numpy Array as Training Data for details.