MultiSlotDataGenerator

class paddle.distributed.fleet. MultiSlotDataGenerator [source]
generate_batch ( samples )

generate_batch

This function needs to be overridden by the user to process the generated samples from generate_sample(self, str) function It is usually used as batch processing when a user wants to do preprocessing on a batch of samples, e.g. padding according to the max length of a sample in the batch

Parameters

samples (list tuple) – generated sample from generate_sample

Returns

a python generator, the same format as return value of generate_sample

Example

import paddle.distributed.fleet.data_generator as dg
class MyData(dg.DataGenerator):

    def generate_sample(self, line):
        def local_iter():
            int_words = [int(x) for x in line.split()]
            yield ("words", int_words)
        return local_iter

    def generate_batch(self, samples):
        def local_iter():
            for s in samples:
                yield ("words", s[1].extend([s[1][0]]))
mydata = MyData()
mydata.set_batch(128)
generate_sample ( line )

generate_sample

This function needs to be overridden by the user to process the original data row into a list or tuple.

Parameters

line (str) – the original data row

Returns

Returns the data processed by the user.

The data format is list or tuple:

[(name, [feasign, …]), …]

or ((name, [feasign, …]), …)

For example: [(“words”, [1926, 08, 17]), (“label”, [1])]

System Message: ERROR/3 (/usr/local/lib/python3.8/site-packages/paddle/distributed/fleet/data_generator/data_generator.py:docstring of paddle.distributed.fleet.data_generator.data_generator.DataGenerator.generate_sample, line 16)

Unexpected indentation.

or ((“words”, [1926, 08, 17]), (“label”, [1]))

Note

The type of feasigns must be in int or float. Once the float element appears in the feasign, the type of that slot will be processed into a float.

Example

import paddle.distributed.fleet.data_generator as dg
class MyData(dg.DataGenerator):

    def generate_sample(self, line):
        def local_iter():
            int_words = [int(x) for x in line.split()]
            yield ("words", [int_words])
        return local_iter
run_from_memory ( )

run_from_memory

This function generator data from memory, it is usually used for debug and benchmarking

Example

import paddle.distributed.fleet.data_generator as dg
class MyData(dg.DataGenerator):

    def generate_sample(self, line):
        def local_iter():
            yield ("words", [1, 2, 3, 4])
        return local_iter

mydata = MyData()
mydata.run_from_memory()
run_from_stdin ( )

run_from_stdin

This function reads the data row from stdin, parses it with the process function, and further parses the return value of the process function with the _gen_str function. The parsed data will be wrote to stdout and the corresponding protofile will be generated.

Example

import paddle.distributed.fleet.data_generator as dg
class MyData(dg.DataGenerator):

    def generate_sample(self, line):
        def local_iter():
            int_words = [int(x) for x in line.split()]
            yield ("words", [int_words])
        return local_iter

mydata = MyData()
mydata.run_from_stdin()
set_batch ( batch_size )

set_batch

Set batch size of current DataGenerator This is necessary only if a user wants to define generator_batch

Example

import paddle.distributed.fleet.data_generator as dg
class MyData(dg.DataGenerator):

    def generate_sample(self, line):
        def local_iter():
            int_words = [int(x) for x in line.split()]
            yield ("words", int_words)
        return local_iter

    def generate_batch(self, samples):
        def local_iter():
            for s in samples:
                yield ("words", s[1].extend([s[1][0]]))
mydata = MyData()
mydata.set_batch(128)