MultiSlotStringDataGenerator¶
- class paddle.distributed.fleet. MultiSlotStringDataGenerator [source]
-
-
generate_batch
(
samples
)
generate_batch¶
-
This function needs to be overridden by the user to process the generated samples from generate_sample(self, str) function It is usually used as batch processing when a user wants to do preprocessing on a batch of samples, e.g. padding according to the max length of a sample in the batch
- Parameters
-
samples (list tuple) – generated sample from generate_sample
- Returns
-
a python generator, the same format as return value of generate_sample
Example
>>> import paddle.distributed.fleet.data_generator as dg >>> class MyData(dg.DataGenerator): ... def generate_sample(self, line): ... def local_iter(): ... int_words = [int(x) for x in line.split()] ... yield ("words", int_words) ... return local_iter ... ... def generate_batch(self, samples): ... def local_iter(): ... for s in samples: ... yield ("words", s[1].extend([s[1][0]])) >>> mydata = MyData() >>> mydata.set_batch(128)
-
generate_sample
(
line
)
generate_sample¶
-
This function needs to be overridden by the user to process the original data row into a list or tuple.
- Parameters
-
line (str) – the original data row
- Returns
-
Returns the data processed by the user. The data format is list or tuple: [(name, [feasign, …]), …] or ((name, [feasign, …]), …)
For example: [(“words”, [1926, 08, 17]), (“label”, [1])] or ((“words”, [1926, 08, 17]), (“label”, [1]))
Note
The type of feasigns must be in int or float. Once the float element appears in the feasign, the type of that slot will be processed into a float.
Example
>>> import paddle.distributed.fleet.data_generator as dg >>> class MyData(dg.DataGenerator): ... def generate_sample(self, line): ... def local_iter(): ... int_words = [int(x) for x in line.split()] ... yield ("words", [int_words]) ... return local_iter
-
run_from_memory
(
)
run_from_memory¶
-
This function generator data from memory, it is usually used for debug and benchmarking
Example
>>> >>> import paddle.distributed.fleet.data_generator as dg >>> class MyData(dg.DataGenerator): ... def generate_sample(self, line): ... def local_iter(): ... yield ("words", [1, 2, 3, 4]) ... return local_iter >>> mydata = MyData() >>> mydata.run_from_memory()
-
run_from_stdin
(
)
run_from_stdin¶
-
This function reads the data row from stdin, parses it with the process function, and further parses the return value of the process function with the _gen_str function. The parsed data will be wrote to stdout and the corresponding protofile will be generated.
Example
>>> import paddle.distributed.fleet.data_generator as dg >>> class MyData(dg.DataGenerator): ... def generate_sample(self, line): ... def local_iter(): ... int_words = [int(x) for x in line.split()] ... yield ("words", [int_words]) ... return local_iter >>> mydata = MyData() >>> mydata.run_from_stdin()
-
set_batch
(
batch_size
)
set_batch¶
-
Set batch size of current DataGenerator This is necessary only if a user wants to define generator_batch
Example
>>> import paddle.distributed.fleet.data_generator as dg >>> class MyData(dg.DataGenerator): ... def generate_sample(self, line): ... def local_iter(): ... int_words = [int(x) for x in line.split()] ... yield ("words", int_words) ... return local_iter ... ... def generate_batch(self, samples): ... def local_iter(): ... for s in samples: ... yield ("words", s[1].extend([s[1][0]])) >>> mydata = MyData() >>> mydata.set_batch(128)
-
generate_batch
(
samples
)