Take Numpy Array as Training Data

PaddlePaddle Fluid supports configuring data layer with fluid.data() . Then you can use Numpy Array or directly use Python to create C++ fluid.LoDTensor , and then feed it to fluid.Executor or fluid.ParallelExecutor through Executor.run(feed=...) .

Configure Data Layer

With fluid.data() , you can configure data layer in neural network. Details are as follows:

import paddle.fluid as fluid

image = fluid.data(name="image", shape=[None, 3, 224, 224])
label = fluid.data(name="label", shape=[None, 1], dtype="int64")

# use image/label as layer input
prediction = fluid.layers.fc(input=image, size=1000, act="softmax")
loss = fluid.layers.cross_entropy(input=prediction, label=label)
...

In the code above, image and label are two input data layers created by fluid.data . image is float data of shape [None, 3, 224, 224] ; label is the int data of shape [None, 1] . Note that:

  1. When the program is executing, executor will check whether the shape and dtype defined and feeded are consistent. If they are not consistent, the program will exit with an error. In some tasks, the dimension will change in different training steps. For this case, the value of the dimension can be set to None. For example, the shape can be set to [None, 3, 224, 224] when the 0th dimension will change.

  2. Data type of category labels in Fluid is int64 and the label starts from 0. About the supported data types,please refer to Data types supported by Fluid .

Transfer Train Data to Executor

Both Executor.run and ParallelExecutor.run receive a parameter feed . The parameter is a dict in Python. Its key is the name of data layer,such as image in code above. And its value is the corresponding numpy array.

For example:

exe = fluid.Executor(fluid.CPUPlace())
# init Program
exe.run(fluid.default_startup_program())
exe.run(feed={
   "image": numpy.random.random(size=(32, 3, 224, 224)).astype('float32'),
   "label": numpy.random.random(size=(32, 1)).astype('int64')
})

Advanced Usage

How to feed Sequence Data

Sequence data is a unique data type supported by PaddlePaddle Fluid. You can take LoDTensor as input data type.

You need to:

  1. Feed all data to be trained in a mini-batch.

  2. Get the length of each sequence.

You can use fluid.create_lod_tensor to create LoDTensor .

To feed sequence information, it is necessary to set the sequence nested depth lod_level .

For instance, if the training data are sentences consisting of words, lod_level=1; if train data are paragraphs which consists of sentences that consists of words, lod_level=2 .

For example:

sentence = fluid.data(name="sentence", dtype="int64", shape=[None, 1], lod_level=1)

...

exe.run(feed={
  "sentence": create_lod_tensor(
    data=numpy.array([1, 3, 4, 5, 3, 6, 8], dtype='int64').reshape(-1, 1),
    recursive_seq_lens=[[4, 1, 2]],
    place=fluid.CPUPlace()
  )
})

Training data sentence contain three samples, the lengths of which are 4, 1, 2 respectively.

They are data[0:4], data[4:5] and data[5:7] respectively.

How to prepare training data for every device in ParallelExecutor

When you feed data to ParallelExecutor.run(feed=...) , you can explicitly assign data for every training device (such as GPU).

You need to feed a list to feed . Each element of the list is a dict.

The key of the dict is name of data layer and the value of dict is value of data layer.

For example:

parallel_executor = fluid.ParallelExecutor()
parallel_executor.run(
  feed=[
     {
       "image": numpy.random.random(size=(32, 3, 224, 224)).astype('float32'),
       "label": numpy.random.random(size=(32, 1)).astype('int64')
     },
     {
       "image": numpy.random.random(size=(16, 3, 224, 224)).astype('float32'),
       "label": numpy.random.random(size=(16, 1)).astype('int64')
     },
  ]
)

In the code above, GPU0 will train 32 samples and GPU1 will train 16 samples.

Data types supported by Fluid

Data types supported by PaddlePaddle Fluid contains:

  • float16: supported by part of operations

  • float32: major data type of real number

  • float64: minor data type of real number, supported by most operations

  • int32: minor data type of labels

  • int64: major data type of labels

  • uint64: minor data type of labels

  • bool: type of control flow data

  • int16: minor type of labels

  • uint8: input data type, used for pixel of picture