imikolov¶
imikolov’s simple dataset.
This module will download dataset from http://www.fit.vutbr.cz/~imikolov/rnnlm/ and parse training set and test set into paddle reader creators.
-
paddle.dataset.imikolov.
build_dict
(min_word_freq=50)[source] Build a word dictionary from the corpus, Keys of the dictionary are words, and values are zero-based IDs of these words.
-
paddle.dataset.imikolov.
train
(word_idx, n, data_type=1)[source] imikolov training set creator.
It returns a reader creator, each sample in the reader is a word ID tuple.
- Parameters
word_idx (dict) – word dictionary
n (int) – sliding window size if type is ngram, otherwise max length of sequence
data_type (member variable of DataType (NGRAM or SEQ)) – data type (ngram or sequence)
- Returns
Training reader creator
- Return type
callable
-
paddle.dataset.imikolov.
test
(word_idx, n, data_type=1)[source] imikolov test set creator.
It returns a reader creator, each sample in the reader is a word ID tuple.
- Parameters
word_idx (dict) – word dictionary
n (int) – sliding window size if type is ngram, otherwise max length of sequence
data_type (member variable of DataType (NGRAM or SEQ)) – data type (ngram or sequence)
- Returns
Test reader creator
- Return type
callable