InMemoryDataset¶
- class paddle.fluid.dataset. InMemoryDataset [source]
-
InMemoryDataset, it will load data into memory and shuffle data before training. This class should be created by DatasetFactory
Example
dataset = paddle.fluid.DatasetFactory().create_dataset(“InMemoryDataset”)
-
set_feed_type
(
data_feed_type
)
set_feed_type¶
-
Set data_feed_desc
-
set_queue_num
(
queue_num
)
set_queue_num¶
-
Set Dataset output queue num, training threads get data from queues
- Parameters
-
queue_num (int) – dataset output queue num
Examples
import paddle.fluid as fluid dataset = fluid.DatasetFactory().create_dataset("InMemoryDataset") dataset.set_queue_num(12)
-
set_parse_ins_id
(
parse_ins_id
)
set_parse_ins_id¶
-
Set id Dataset need to parse insid
- Parameters
-
parse_ins_id (bool) – if parse ins_id or not
Examples
import paddle.fluid as fluid dataset = fluid.DatasetFactory().create_dataset("InMemoryDataset") dataset.set_parse_ins_id(True)
-
set_parse_content
(
parse_content
)
set_parse_content¶
-
Set if Dataset need to parse content
- Parameters
-
parse_content (bool) – if parse content or not
Examples
import paddle.fluid as fluid dataset = fluid.DatasetFactory().create_dataset("InMemoryDataset") dataset.set_parse_content(True)
-
set_parse_logkey
(
parse_logkey
)
set_parse_logkey¶
-
Set if Dataset need to parse logkey
- Parameters
-
parse_content (bool) – if parse logkey or not
Examples
import paddle.fluid as fluid dataset = fluid.DatasetFactory().create_dataset("InMemoryDataset") dataset.set_parse_logkey(True)
-
set_merge_by_sid
(
merge_by_sid
)
set_merge_by_sid¶
-
Set if Dataset need to merge sid. If not, one ins means one Pv.
- Parameters
-
merge_by_sid (bool) – if merge sid or not
Examples
import paddle.fluid as fluid dataset = fluid.DatasetFactory().create_dataset("InMemoryDataset") dataset.set_merge_by_sid(True)
-
set_enable_pv_merge
(
enable_pv_merge
)
set_enable_pv_merge¶
-
Set if Dataset need to merge pv.
- Parameters
-
enable_pv_merge (bool) – if enable_pv_merge or not
Examples
import paddle.fluid as fluid dataset = fluid.DatasetFactory().create_dataset("InMemoryDataset") dataset.set_enable_pv_merge(True)
-
preprocess_instance
(
)
preprocess_instance¶
-
Merge pv instance and convey it from input_channel to input_pv_channel. It will be effective when enable_pv_merge_ is True.
Examples
import paddle.fluid as fluid dataset = fluid.DatasetFactory().create_dataset("InMemoryDataset") filelist = ["a.txt", "b.txt"] dataset.set_filelist(filelist) dataset.load_into_memory() dataset.preprocess_instance()
-
set_current_phase
(
current_phase
)
set_current_phase¶
-
Set current phase in train. It is useful for untest. current_phase : 1 for join, 0 for update.
Examples
import paddle.fluid as fluid dataset = fluid.DatasetFactory().create_dataset("InMemoryDataset") filelist = ["a.txt", "b.txt"] dataset.set_filelist(filelist) dataset.load_into_memory() dataset.set_current_phase(1)
-
postprocess_instance
(
)
postprocess_instance¶
-
Divide pv instance and convey it to input_channel.
Examples
import paddle.fluid as fluid dataset = fluid.DatasetFactory().create_dataset("InMemoryDataset") filelist = ["a.txt", "b.txt"] dataset.set_filelist(filelist) dataset.load_into_memory() dataset.preprocess_instance() exe.train_from_dataset(dataset) dataset.postprocess_instance()
-
set_fleet_send_batch_size
(
fleet_send_batch_size=1024
)
set_fleet_send_batch_size¶
-
Set fleet send batch size, default is 1024
- Parameters
-
fleet_send_batch_size (int) – fleet send batch size
Examples
import paddle.fluid as fluid dataset = fluid.DatasetFactory().create_dataset("InMemoryDataset") dataset.set_fleet_send_batch_size(800)
-
set_fleet_send_sleep_seconds
(
fleet_send_sleep_seconds=0
)
set_fleet_send_sleep_seconds¶
-
Set fleet send sleep time, default is 0
- Parameters
-
fleet_send_sleep_seconds (int) – fleet send sleep time
Examples
import paddle.fluid as fluid dataset = fluid.DatasetFactory().create_dataset("InMemoryDataset") dataset.set_fleet_send_sleep_seconds(2)
-
set_merge_by_lineid
(
merge_size=2
)
set_merge_by_lineid¶
-
Set merge by line id, instances of same line id will be merged after shuffle, you should parse line id in data generator.
- Parameters
-
merge_size (int) – ins size to merge. default is 2.
Examples
import paddle.fluid as fluid dataset = fluid.DatasetFactory().create_dataset("InMemoryDataset") dataset.set_merge_by_lineid()
-
set_date
(
date
)
set_date¶
-
- Api_attr
-
Static Graph
Set training date for pull sparse parameters, saving and loading model. Only used in psgpu
- Parameters
-
date (str) – training date(format : YYMMDD). eg.20211111
Examples
import paddle.fluid as fluid dataset = fluid.DatasetFactory().create_dataset("InMemoryDataset") dataset.set_date("20211111")
-
load_into_memory
(
is_shuffle=False
)
load_into_memory¶
-
Load data into memory
- Args:
-
is_shuffle(bool): whether to use local shuffle, default is False
Examples
# required: skiptest import paddle.fluid as fluid dataset = fluid.DatasetFactory().create_dataset("InMemoryDataset") filelist = ["a.txt", "b.txt"] dataset.set_filelist(filelist) dataset.load_into_memory()
-
preload_into_memory
(
thread_num=None
)
preload_into_memory¶
-
Load data into memory in async mode
- Parameters
-
thread_num (int) – preload thread num
Examples
# required: skiptest import paddle.fluid as fluid dataset = fluid.DatasetFactory().create_dataset("InMemoryDataset") filelist = ["a.txt", "b.txt"] dataset.set_filelist(filelist) dataset.preload_into_memory() dataset.wait_preload_done()
-
wait_preload_done
(
)
wait_preload_done¶
-
Wait preload_into_memory done
Examples
# required: skiptest import paddle.fluid as fluid dataset = fluid.DatasetFactory().create_dataset("InMemoryDataset") filelist = ["a.txt", "b.txt"] dataset.set_filelist(filelist) dataset.preload_into_memory() dataset.wait_preload_done()
-
local_shuffle
(
)
local_shuffle¶
-
Local shuffle
Examples
# required: skiptest import paddle.fluid as fluid dataset = fluid.DatasetFactory().create_dataset("InMemoryDataset") filelist = ["a.txt", "b.txt"] dataset.set_filelist(filelist) dataset.load_into_memory() dataset.local_shuffle()
-
global_shuffle
(
fleet=None,
thread_num=12
)
global_shuffle¶
-
Global shuffle. Global shuffle can be used only in distributed mode. i.e. multiple processes on single machine or multiple machines training together. If you run in distributed mode, you should pass fleet instead of None.
Examples
# required: skiptest import paddle.fluid as fluid from paddle.incubate.distributed.fleet.parameter_server.pslib import fleet dataset = fluid.DatasetFactory().create_dataset("InMemoryDataset") filelist = ["a.txt", "b.txt"] dataset.set_filelist(filelist) dataset.load_into_memory() dataset.global_shuffle(fleet)
- Parameters
-
fleet (Fleet) – fleet singleton. Default None.
thread_num (int) – shuffle thread num. Default is 12.
-
release_memory
(
)
release_memory¶
-
- Api_attr
-
Static Graph
Release InMemoryDataset memory data, when data will not be used again.
Examples
# required: skiptest import paddle.fluid as fluid from paddle.incubate.distributed.fleet.parameter_server.pslib import fleet dataset = fluid.DatasetFactory().create_dataset("InMemoryDataset") filelist = ["a.txt", "b.txt"] dataset.set_filelist(filelist) dataset.load_into_memory() dataset.global_shuffle(fleet) exe = fluid.Executor(fluid.CPUPlace()) exe.run(fluid.default_startup_program()) exe.train_from_dataset(fluid.default_main_program(), dataset) dataset.release_memory()
-
get_pv_data_size
(
)
get_pv_data_size¶
-
Get memory data size of Pv, user can call this function to know the pv num of ins in all workers after load into memory.
Note
This function may cause bad performance, because it has barrier
- Returns
-
The size of memory pv data.
Examples
import paddle.fluid as fluid dataset = fluid.DatasetFactory().create_dataset("InMemoryDataset") filelist = ["a.txt", "b.txt"] dataset.set_filelist(filelist) dataset.load_into_memory() print dataset.get_pv_data_size()
-
get_memory_data_size
(
fleet=None
)
get_memory_data_size¶
-
Get memory data size, user can call this function to know the num of ins in all workers after load into memory.
Note
This function may cause bad performance, because it has barrier
- Parameters
-
fleet (Fleet) – Fleet Object.
- Returns
-
The size of memory data.
Examples
# required: skiptest import paddle.fluid as fluid from paddle.incubate.distributed.fleet.parameter_server.pslib import fleet dataset = fluid.DatasetFactory().create_dataset("InMemoryDataset") filelist = ["a.txt", "b.txt"] dataset.set_filelist(filelist) dataset.load_into_memory() print dataset.get_memory_data_size(fleet)
-
get_shuffle_data_size
(
fleet=None
)
get_shuffle_data_size¶
-
Get shuffle data size, user can call this function to know the num of ins in all workers after local/global shuffle.
Note
This function may cause bad performance to local shuffle, because it has barrier. It does not affect global shuffle.
- Parameters
-
fleet (Fleet) – Fleet Object.
- Returns
-
The size of shuffle data.
Examples
# required: skiptest import paddle.fluid as fluid from paddle.incubate.distributed.fleet.parameter_server.pslib import fleet dataset = fluid.DatasetFactory().create_dataset("InMemoryDataset") filelist = ["a.txt", "b.txt"] dataset.set_filelist(filelist) dataset.load_into_memory() dataset.global_shuffle(fleet) print dataset.get_shuffle_data_size(fleet)
-
set_graph_config
(
config
)
set_graph_config¶
-
Set graph config, user can set graph config in gpu graph mode.
- Parameters
-
config (dict) – config dict.
- Returns
-
The size of shuffle data.
Examples
# required: skiptest import paddle.fluid as fluid from paddle.incubate.distributed.fleet.parameter_server.pslib import fleet dataset = fluid.DatasetFactory().create_dataset("InMemoryDataset") graph_config = {"walk_len": 24, "walk_degree": 10, "once_sample_startid_len": 80000, "sample_times_one_chunk": 5, "window": 3, "debug_mode": 0, "batch_size": 800, "meta_path": "cuid2clk-clk2cuid;cuid2conv-conv2cuid;clk2cuid-cuid2clk;clk2cuid-cuid2conv", "gpu_graph_training": 1} dataset.set_graph_config(graph_config)
-
set_pass_id
(
pass_id
)
set_pass_id¶
-
Set pass id, user can set pass id in gpu graph mode.
- Parameters
-
pass_id – pass id.
Examples
import paddle.fluid as fluid pass_id = 0 dataset = fluid.DatasetFactory().create_dataset("InMemoryDataset") dataset.set_pass_id(pass_id)
-
get_pass_id
(
)
get_pass_id¶
-
Get pass id, user can set pass id in gpu graph mode.
- Returns
-
The pass id.
Examples
import paddle.fluid as fluid dataset = fluid.DatasetFactory().create_dataset("InMemoryDataset") pass_id = dataset.get_pass_id()
-
dump_walk_path
(
path,
dump_rate=1000
)
dump_walk_path¶
-
desc
(
)
desc¶
-
Returns a protobuf message for this DataFeedDesc
Examples
import paddle.fluid as fluid dataset = fluid.DatasetFactory().create_dataset() print(dataset.desc())
- Returns
-
A string message
-
set_batch_size
(
batch_size
)
set_batch_size¶
-
Set batch size. Will be effective during training
Examples
import paddle.fluid as fluid dataset = fluid.DatasetFactory().create_dataset() dataset.set_batch_size(128)
- Parameters
-
batch_size (int) – batch size
-
set_download_cmd
(
download_cmd
)
set_download_cmd¶
-
Set customized download cmd: download_cmd
Examples
import paddle.fluid as fluid dataset = fluid.DatasetFactory().create_dataset() dataset.set_download_cmd("./read_from_afs")
- Parameters
-
download_cmd (str) – customized download command
-
set_fea_eval
(
record_candidate_size,
fea_eval=True
)
set_fea_eval¶
-
set fea eval mode for slots shuffle to debug the importance level of slots(features), fea_eval need to be set True for slots shuffle.
- Parameters
-
record_candidate_size (int) – size of instances candidate to shuffle one slot
fea_eval (bool) – whether enable fea eval mode to enable slots shuffle. default is True.
Examples
import paddle.fluid as fluid dataset = fluid.DatasetFactory().create_dataset(“InMemoryDataset”) dataset.set_fea_eval(1000000, True)
-
set_filelist
(
filelist
)
set_filelist¶
-
Set file list in current worker.
Examples
import paddle.fluid as fluid dataset = fluid.DatasetFactory().create_dataset() dataset.set_filelist(['a.txt', 'b.txt'])
- Parameters
-
filelist (list) – file list
-
set_hdfs_config
(
fs_name,
fs_ugi
)
set_hdfs_config¶
-
Set hdfs config: fs name ad ugi
Examples
import paddle.fluid as fluid dataset = fluid.DatasetFactory().create_dataset() dataset.set_hdfs_config("my_fs_name", "my_fs_ugi")
- Parameters
-
fs_name (str) – fs name
fs_ugi (str) – fs ugi
-
set_pipe_command
(
pipe_command
)
set_pipe_command¶
-
Set pipe command of current dataset A pipe command is a UNIX pipeline command that can be used only
Examples
import paddle.fluid as fluid dataset = fluid.DatasetFactory().create_dataset() dataset.set_pipe_command("python my_script.py")
- Parameters
-
pipe_command (str) – pipe command
-
set_pv_batch_size
(
pv_batch_size
)
set_pv_batch_size¶
-
Set pv batch size. It will be effective during enable_pv_merge
Examples
import paddle.fluid as fluid dataset = fluid.DatasetFactory().create_dataset() dataset.set_pv_batch(128)
- Parameters
-
pv_batch_size (int) – pv batch size
-
set_rank_offset
(
rank_offset
)
set_rank_offset¶
-
Set rank_offset for merge_pv. It set the message of Pv.
Examples
import paddle.fluid as fluid dataset = fluid.DatasetFactory().create_dataset() dataset.set_rank_offset("rank_offset")
- Parameters
-
rank_offset (str) – rank_offset’s name
-
set_so_parser_name
(
so_parser_name
)
set_so_parser_name¶
-
Set so parser name of current dataset
Examples
import paddle.fluid as fluid dataset = fluid.DatasetFactory().create_dataset() dataset.set_so_parser_name("./abc.so")
- Parameters
-
pipe_command (str) – pipe command
-
set_thread
(
thread_num
)
set_thread¶
-
Set thread num, it is the num of readers.
Examples
import paddle.fluid as fluid dataset = fluid.DatasetFactory().create_dataset() dataset.set_thread(12)
- Parameters
-
thread_num (int) – thread num
-
set_use_var
(
var_list
)
set_use_var¶
-
Set Variables which you will use.
Examples
import paddle.fluid as fluid dataset = fluid.DatasetFactory().create_dataset() dataset.set_use_var([data, label])
- Parameters
-
var_list (list) – variable list
-
slots_shuffle
(
slots
)
slots_shuffle¶
-
Slots Shuffle Slots Shuffle is a shuffle method in slots level, which is usually used in sparse feature with large scale of instances. To compare the metric, i.e. auc while doing slots shuffle on one or several slots with baseline to evaluate the importance level of slots(features).
- Parameters
-
slots (list[string]) – the set of slots(string) to do slots shuffle.
Examples
import paddle.fluid as fluid dataset = fluid.DatasetFactory().create_dataset(“InMemoryDataset”) dataset.set_merge_by_lineid() #suppose there is a slot 0 dataset.slots_shuffle([‘0’])
-
set_feed_type
(
data_feed_type
)