DataLoader

class paddle.fluid.reader. DataLoader [source]
static from_generator ( feed_list=None, capacity=None, use_double_buffer=True, iterable=True, return_list=False, use_multiprocess=False, drop_last=True )

from_generator

Warning

This API will be deprecated in the future, it is recommended to use paddle.io.DataLoader which supports multi-processes acceleration.

Note

The framework ensures that the data loading order of DataLoader is exactly the same as the user-defined data source.

Create a DataLoader object for loading data from Python generator. Data would be prefetched using Python thread and be pushed into a queue asynchronously.

The created DataLoader object provides 3 methods to set the data source set_sample_generator , set_sample_list_generator and set_batch_generator . Please see the following example codes to know their usages.

If iterable = True, the created DataLoader object is a Python generator object, which is iterable using for-range loop.

If iterable = False, the created DataLoader object provides start() and reset() method to control the data reading process.

Parameters
  • feed_list (list(Tensor)|tuple(Tensor)) – feed Tensor list. The Tensors should be created by paddle.static.data().

  • capacity (int) – capacity of the queue maintained in DataLoader. The unit is batch number. Set larger capacity if your reader is fast.

  • use_double_buffer (bool, optional) – whether to use double_buffer_reader. If use_double_buffer=True, the DataLoader would prefetch next batch data asynchronously, so it would speed up data feeding and occupies a little more CPU or GPU memory, i.e., the memory of one batch input data.

  • iterable (bool, optional) – whether the created DataLoader is iterable.

  • return_list (bool, optional) – whether the return value on each device is presented as a list. It is only valid when iterable=True. If return_list=False, the return value on each device would be a dict of str -> LoDTensor, where the key of the dict is the name of each fed Tensors. If return_list=True, the return value on each device would be a list(LoDTensor). It is recommended to use return_list=False in static graph mode and use return_list=True in dygraph mode.

  • use_multiprocess (bool, optional) – whether to use multi-process to speed up the data loading process in dygraph. Note: this parameter only can be used in the dygraph mode. In the static graph mode, whether this parameter is set or not has no effect. The Default value is False.

  • drop_last (bool, optional) – whether to drop the last batches whose number is less than the CPU core/GPU card number. The default value is True. In training phase, users should not set drop_last=False, because all CPU cores/GPU cards must read data from DataLoader. In inference phase, users can set drop_last=False, so that the last batches whose number is less than the CPU core/GPU card number can be tested.

Returns

the created DataLoader object.

Return type

loader (DataLoader)

Examples 1:

'''
Example in static graph mode
'''
import numpy as np

import paddle
import paddle.static as static
import paddle.nn.functional as F


BATCH_NUM = 10
BATCH_SIZE = 16
EPOCH_NUM = 4

CLASS_NUM = 10

ITERABLE = True # whether the created DataLoader object is iterable
USE_GPU = False # whether to use GPU

DATA_FORMAT = 'batch_generator' # data format of data source user provides

paddle.enable_static()

def simple_net(image, label):
    fc_tmp = static.nn.fc(image, size=CLASS_NUM)
    cross_entropy = F.softmax_with_cross_entropy(image, label)
    loss = paddle.mean(cross_entropy)
    sgd = paddle.optimizer.SGD(learning_rate=1e-3)
    sgd.minimize(loss)
    return loss

def get_random_images_and_labels(image_shape, label_shape):
    image = np.random.random(size=image_shape).astype('float32')
    label = np.random.random(size=label_shape).astype('int64')
    return image, label

# If the data generator yields one sample each time,
# use DataLoader.set_sample_generator to set the data source.
def sample_generator_creator():
    def __reader__():
        for _ in range(BATCH_NUM * BATCH_SIZE):
            image, label = get_random_images_and_labels([784], [1])
            yield image, label

    return __reader__

# If the data generator yield list of samples each time,
# use DataLoader.set_sample_list_generator to set the data source.
def sample_list_generator_creator():
    def __reader__():
        for _ in range(BATCH_NUM):
            sample_list = []
            for _ in range(BATCH_SIZE):
                image, label = get_random_images_and_labels([784], [1])
                sample_list.append([image, label])

            yield sample_list

    return __reader__

# If the data generator yields a batch each time,
# use DataLoader.set_batch_generator to set the data source.
def batch_generator_creator():
    def __reader__():
        for _ in range(BATCH_NUM):
            batch_image, batch_label = get_random_images_and_labels([BATCH_SIZE, 784], [BATCH_SIZE, 1])
            yield batch_image, batch_label

    return __reader__

# If DataLoader is iterable, use for loop to train the network
def train_iterable(exe, prog, loss, loader):
    for _ in range(EPOCH_NUM):
        for data in loader():
            exe.run(prog, feed=data, fetch_list=[loss])

# If DataLoader is not iterable, use start() and reset() method to control the process
def train_non_iterable(exe, prog, loss, loader):
    for _ in range(EPOCH_NUM):
        loader.start() # call DataLoader.start() before each epoch starts
        try:
            while True:
                exe.run(prog, fetch_list=[loss])
        except paddle.core.EOFException:
            loader.reset() # call DataLoader.reset() after catching EOFException

def set_data_source(loader, places):
    if DATA_FORMAT == 'sample_generator':
        loader.set_sample_generator(sample_generator_creator(), batch_size=BATCH_SIZE, drop_last=True, places=places)
    elif DATA_FORMAT == 'sample_list_generator':
        loader.set_sample_list_generator(sample_list_generator_creator(), places=places)
    elif DATA_FORMAT == 'batch_generator':
        loader.set_batch_generator(batch_generator_creator(), places=places)
    else:
        raise ValueError('Unsupported data format')

image = static.data(name='image', shape=[None, 784], dtype='float32')
label = static.data(name='label', shape=[None, 1], dtype='int64')

# Define DataLoader
loader = paddle.fluid.io.DataLoader.from_generator(feed_list=[image, label], capacity=16, iterable=ITERABLE)

# Define network
loss = simple_net(image, label)

places = static.cuda_places() if USE_GPU else static.cpu_places()
set_data_source(loader, places)

exe = static.Executor(places[0])
exe.run(static.default_startup_program())

prog = static.CompiledProgram(static.default_main_program())
if loader.iterable:
    train_iterable(exe, prog, loss, loader)
else:
    train_non_iterable(exe, prog, loss, loader)

Examples 2:

'''
Example in dynamic graph mode.
'''
import numpy as np

import paddle
import paddle.nn as nn
import paddle.optimizer as opt
import paddle.distributed as dist

BATCH_SIZE = 16
BATCH_NUM = 4
EPOCH_NUM = 4

IMAGE_SIZE = 784
CLASS_NUM = 10

USE_GPU = False # whether to use GPU

def _get_random_images_and_labels(image_shape, label_shape):
        image = np.random.random(size=image_shape).astype('float32')
        label = np.random.random(size=label_shape).astype('int64')
        return image, label

def __reader__():
        for _ in range(BATCH_NUM):
            batch_image, batch_label = _get_random_images_and_labels(
                [BATCH_SIZE, IMAGE_SIZE], [BATCH_SIZE, CLASS_NUM])
            yield batch_image, batch_label

def random_batch_reader():
    return __reader__

class LinearNet(nn.Layer):
    def __init__(self):
        super().__init__()
        self._linear = nn.Linear(IMAGE_SIZE, CLASS_NUM)

    @paddle.jit.to_static
    def forward(self, x):
        return self._linear(x)

# set device
paddle.set_device('gpu' if USE_GPU else 'cpu')

# create network
layer = LinearNet()
dp_layer = paddle.DataParallel(layer)
loss_fn = nn.CrossEntropyLoss()
adam = opt.Adam(learning_rate=0.001, parameters=dp_layer.parameters())

# create data loader
loader = paddle.fluid.io.DataLoader.from_generator(capacity=5)
loader.set_batch_generator(random_batch_reader())

for epoch_id in range(EPOCH_NUM):
    for batch_id, (image, label) in enumerate(loader()):
        out = layer(image)
        loss = loss_fn(out, label)

        loss.backward()

        adam.step()
        adam.clear_grad()
        print("Epoch {} batch {}: loss = {}".format(
            epoch_id, batch_id, np.mean(loss.numpy())))
static from_dataset ( dataset, places, drop_last=True )

from_dataset

Warning

This API will be deprecated in the future, it is recommended to use paddle.io.DataLoader which supports multi-processes acceleration.

Create an iterable DataLoader object for loading data from Dataset. Dataset is only supported in Linux system currently.

Parameters
  • dataset (InMemoryDataset|QueueDataset) – the dataset object.

  • places (list(CUDAPlace)|list(CPUPlace)|list(str)) – places where the result data should be converted. If places is list of string, the string in the list can be cpu, gpu:x and gpu_pinned, where x is the index of the GPUs.

  • drop_last (bool, optional) – whether to drop the last batch whose sample number is less than batch size. If drop_last = True, they would be dropped. If drop_last = False, they would be kept.

Returns

the created DataLoader object, which can be

treated as a Python generator.

Return type

loader (DataLoader)

Examples

import paddle
import paddle.static as static

paddle.enable_static()

image = static.data(name='image', shape=[None, 784], dtype='float32')
label = static.data(name='label', shape=[None, 1], dtype='int64')

dataset = paddle.distributed.QueueDataset()
dataset.init(
    batch_size=32,
    pipe_command='cat',
    use_var=[image, label])
dataset.set_filelist(['a.txt', 'b.txt', 'c.txt'])

loader = paddle.fluid.io.DataLoader.from_dataset(dataset, static.cpu_places())