MultiSlotDataGenerator¶
- class paddle.distributed.fleet. MultiSlotDataGenerator [source]
-
-
generate_batch
(
samples
)
generate_batch¶
-
This function needs to be overridden by the user to process the generated samples from generate_sample(self, str) function It is usually used as batch processing when a user wants to do preprocessing on a batch of samples, e.g. padding according to the max length of a sample in the batch
- Parameters
-
samples (list tuple) – generated sample from generate_sample
- Returns
-
a python generator, the same format as return value of generate_sample
Example
import paddle.distributed.fleet.data_generator as dg class MyData(dg.DataGenerator): def generate_sample(self, line): def local_iter(): int_words = [int(x) for x in line.split()] yield ("words", int_words) return local_iter def generate_batch(self, samples): def local_iter(): for s in samples: yield ("words", s[1].extend([s[1][0]])) mydata = MyData() mydata.set_batch(128)
-
generate_sample
(
line
)
generate_sample¶
-
This function needs to be overridden by the user to process the original data row into a list or tuple.
- Parameters
-
line (str) – the original data row
- Returns
-
- Returns the data processed by the user.
-
The data format is list or tuple:
- [(name, [feasign, …]), …]
-
or ((name, [feasign, …]), …)
For example: [(“words”, [1926, 08, 17]), (“label”, [1])]
or ((“words”, [1926, 08, 17]), (“label”, [1]))
Note
The type of feasigns must be in int or float. Once the float element appears in the feasign, the type of that slot will be processed into a float.
Example
import paddle.distributed.fleet.data_generator as dg class MyData(dg.DataGenerator): def generate_sample(self, line): def local_iter(): int_words = [int(x) for x in line.split()] yield ("words", [int_words]) return local_iter
-
run_from_memory
(
)
run_from_memory¶
-
This function generator data from memory, it is usually used for debug and benchmarking
Example
import paddle.distributed.fleet.data_generator as dg class MyData(dg.DataGenerator): def generate_sample(self, line): def local_iter(): yield ("words", [1, 2, 3, 4]) return local_iter mydata = MyData() mydata.run_from_memory()
-
run_from_stdin
(
)
run_from_stdin¶
-
This function reads the data row from stdin, parses it with the process function, and further parses the return value of the process function with the _gen_str function. The parsed data will be wrote to stdout and the corresponding protofile will be generated.
Example
import paddle.distributed.fleet.data_generator as dg class MyData(dg.DataGenerator): def generate_sample(self, line): def local_iter(): int_words = [int(x) for x in line.split()] yield ("words", [int_words]) return local_iter mydata = MyData() mydata.run_from_stdin()
-
set_batch
(
batch_size
)
set_batch¶
-
Set batch size of current DataGenerator This is necessary only if a user wants to define generator_batch
Example
import paddle.distributed.fleet.data_generator as dg class MyData(dg.DataGenerator): def generate_sample(self, line): def local_iter(): int_words = [int(x) for x in line.split()] yield ("words", int_words) return local_iter def generate_batch(self, samples): def local_iter(): for s in samples: yield ("words", s[1].extend([s[1][0]])) mydata = MyData() mydata.set_batch(128)
-
generate_batch
(
samples
)