DataLoader¶
- class paddle.fluid.reader. DataLoader [source]
-
-
static
from_generator
(
feed_list=None,
capacity=None,
use_double_buffer=True,
iterable=True,
return_list=False,
use_multiprocess=False,
drop_last=True
)
from_generator¶
-
Warning
This API will be deprecated in the future, it is recommended to use
paddle.io.DataLoader
which supports multi-processes acceleration.Note
The framework ensures that the data loading order of DataLoader is exactly the same as the user-defined data source.
Create a DataLoader object for loading data from Python generator. Data would be prefetched using Python thread and be pushed into a queue asynchronously.
The created DataLoader object provides 3 methods to set the data source
set_sample_generator
,set_sample_list_generator
andset_batch_generator
. Please see the following example codes to know their usages.If iterable = True, the created DataLoader object is a Python generator object, which is iterable using for-range loop.
If iterable = False, the created DataLoader object provides
start()
andreset()
method to control the data reading process.- Parameters
-
feed_list (list(Tensor)|tuple(Tensor)) – feed Tensor list. The Tensors should be created by
paddle.static.data()
.capacity (int) – capacity of the queue maintained in DataLoader. The unit is batch number. Set larger capacity if your reader is fast.
use_double_buffer (bool, optional) – whether to use double_buffer_reader. If use_double_buffer=True, the DataLoader would prefetch next batch data asynchronously, so it would speed up data feeding and occupies a little more CPU or GPU memory, i.e., the memory of one batch input data.
iterable (bool, optional) – whether the created DataLoader is iterable.
return_list (bool, optional) – whether the return value on each device is presented as a list. It is only valid when iterable=True. If return_list=False, the return value on each device would be a dict of str -> LoDTensor, where the key of the dict is the name of each fed Tensors. If return_list=True, the return value on each device would be a list(LoDTensor). It is recommended to use return_list=False in static graph mode and use return_list=True in dygraph mode.
use_multiprocess (bool, optional) – whether to use multi-process to speed up the data loading process in dygraph. Note: this parameter only can be used in the dygraph mode. In the static graph mode, whether this parameter is set or not has no effect. The Default value is False.
drop_last (bool, optional) – whether to drop the last batches whose number is less than the CPU core/GPU card number. The default value is True. In training phase, users should not set drop_last=False, because all CPU cores/GPU cards must read data from DataLoader. In inference phase, users can set drop_last=False, so that the last batches whose number is less than the CPU core/GPU card number can be tested.
- Returns
-
the created DataLoader object.
- Return type
-
loader (DataLoader)
Examples 1:
''' Example in static graph mode ''' import numpy as np import paddle import paddle.static as static import paddle.nn.functional as F BATCH_NUM = 10 BATCH_SIZE = 16 EPOCH_NUM = 4 CLASS_NUM = 10 ITERABLE = True # whether the created DataLoader object is iterable USE_GPU = False # whether to use GPU DATA_FORMAT = 'batch_generator' # data format of data source user provides paddle.enable_static() def simple_net(image, label): fc_tmp = static.nn.fc(image, size=CLASS_NUM) cross_entropy = F.softmax_with_cross_entropy(image, label) loss = paddle.mean(cross_entropy) sgd = paddle.optimizer.SGD(learning_rate=1e-3) sgd.minimize(loss) return loss def get_random_images_and_labels(image_shape, label_shape): image = np.random.random(size=image_shape).astype('float32') label = np.random.random(size=label_shape).astype('int64') return image, label # If the data generator yields one sample each time, # use DataLoader.set_sample_generator to set the data source. def sample_generator_creator(): def __reader__(): for _ in range(BATCH_NUM * BATCH_SIZE): image, label = get_random_images_and_labels([784], [1]) yield image, label return __reader__ # If the data generator yield list of samples each time, # use DataLoader.set_sample_list_generator to set the data source. def sample_list_generator_creator(): def __reader__(): for _ in range(BATCH_NUM): sample_list = [] for _ in range(BATCH_SIZE): image, label = get_random_images_and_labels([784], [1]) sample_list.append([image, label]) yield sample_list return __reader__ # If the data generator yields a batch each time, # use DataLoader.set_batch_generator to set the data source. def batch_generator_creator(): def __reader__(): for _ in range(BATCH_NUM): batch_image, batch_label = get_random_images_and_labels([BATCH_SIZE, 784], [BATCH_SIZE, 1]) yield batch_image, batch_label return __reader__ # If DataLoader is iterable, use for loop to train the network def train_iterable(exe, prog, loss, loader): for _ in range(EPOCH_NUM): for data in loader(): exe.run(prog, feed=data, fetch_list=[loss]) # If DataLoader is not iterable, use start() and reset() method to control the process def train_non_iterable(exe, prog, loss, loader): for _ in range(EPOCH_NUM): loader.start() # call DataLoader.start() before each epoch starts try: while True: exe.run(prog, fetch_list=[loss]) except paddle.core.EOFException: loader.reset() # call DataLoader.reset() after catching EOFException def set_data_source(loader, places): if DATA_FORMAT == 'sample_generator': loader.set_sample_generator(sample_generator_creator(), batch_size=BATCH_SIZE, drop_last=True, places=places) elif DATA_FORMAT == 'sample_list_generator': loader.set_sample_list_generator(sample_list_generator_creator(), places=places) elif DATA_FORMAT == 'batch_generator': loader.set_batch_generator(batch_generator_creator(), places=places) else: raise ValueError('Unsupported data format') image = static.data(name='image', shape=[None, 784], dtype='float32') label = static.data(name='label', shape=[None, 1], dtype='int64') # Define DataLoader loader = paddle.fluid.io.DataLoader.from_generator(feed_list=[image, label], capacity=16, iterable=ITERABLE) # Define network loss = simple_net(image, label) places = static.cuda_places() if USE_GPU else static.cpu_places() set_data_source(loader, places) exe = static.Executor(places[0]) exe.run(static.default_startup_program()) prog = static.CompiledProgram(static.default_main_program()) if loader.iterable: train_iterable(exe, prog, loss, loader) else: train_non_iterable(exe, prog, loss, loader)
Examples 2:
''' Example in dynamic graph mode. ''' import numpy as np import paddle import paddle.nn as nn import paddle.optimizer as opt import paddle.distributed as dist BATCH_SIZE = 16 BATCH_NUM = 4 EPOCH_NUM = 4 IMAGE_SIZE = 784 CLASS_NUM = 10 USE_GPU = False # whether to use GPU def _get_random_images_and_labels(image_shape, label_shape): image = np.random.random(size=image_shape).astype('float32') label = np.random.random(size=label_shape).astype('int64') return image, label def __reader__(): for _ in range(BATCH_NUM): batch_image, batch_label = _get_random_images_and_labels( [BATCH_SIZE, IMAGE_SIZE], [BATCH_SIZE, CLASS_NUM]) yield batch_image, batch_label def random_batch_reader(): return __reader__ class LinearNet(nn.Layer): def __init__(self): super().__init__() self._linear = nn.Linear(IMAGE_SIZE, CLASS_NUM) @paddle.jit.to_static def forward(self, x): return self._linear(x) # set device paddle.set_device('gpu' if USE_GPU else 'cpu') # create network layer = LinearNet() dp_layer = paddle.DataParallel(layer) loss_fn = nn.CrossEntropyLoss() adam = opt.Adam(learning_rate=0.001, parameters=dp_layer.parameters()) # create data loader loader = paddle.fluid.io.DataLoader.from_generator(capacity=5) loader.set_batch_generator(random_batch_reader()) for epoch_id in range(EPOCH_NUM): for batch_id, (image, label) in enumerate(loader()): out = layer(image) loss = loss_fn(out, label) loss.backward() adam.step() adam.clear_grad() print("Epoch {} batch {}: loss = {}".format( epoch_id, batch_id, np.mean(loss.numpy())))
-
static
from_dataset
(
dataset,
places,
drop_last=True
)
from_dataset¶
-
Warning
This API will be deprecated in the future, it is recommended to use
paddle.io.DataLoader
which supports multi-processes acceleration.Create an iterable DataLoader object for loading data from Dataset. Dataset is only supported in Linux system currently.
- Parameters
-
dataset (InMemoryDataset|QueueDataset) – the dataset object.
places (list(CUDAPlace)|list(CPUPlace)|list(str)) – places where the result data should be converted. If places is list of string, the string in the list can be
cpu
,gpu:x
andgpu_pinned
, where x is the index of the GPUs.drop_last (bool, optional) – whether to drop the last batch whose sample number is less than batch size. If drop_last = True, they would be dropped. If drop_last = False, they would be kept.
- Returns
-
- the created DataLoader object, which can be
-
treated as a Python generator.
- Return type
-
loader (DataLoader)
Examples
import paddle import paddle.static as static paddle.enable_static() image = static.data(name='image', shape=[None, 784], dtype='float32') label = static.data(name='label', shape=[None, 1], dtype='int64') dataset = paddle.distributed.QueueDataset() dataset.init( batch_size=32, pipe_command='cat', use_var=[image, label]) dataset.set_filelist(['a.txt', 'b.txt', 'c.txt']) loader = paddle.fluid.io.DataLoader.from_dataset(dataset, static.cpu_places())
-
static
from_generator
(
feed_list=None,
capacity=None,
use_double_buffer=True,
iterable=True,
return_list=False,
use_multiprocess=False,
drop_last=True
)