imikolov

imikolov’s simple dataset.

This module will download dataset from http://www.fit.vutbr.cz/~imikolov/rnnlm/ and parse training set and test set into paddle reader creators.

paddle.dataset.imikolov.build_dict(min_word_freq=50)[source]

Build a word dictionary from the corpus, Keys of the dictionary are words, and values are zero-based IDs of these words.

paddle.dataset.imikolov.train(word_idx, n, data_type=1)[source]

imikolov training set creator.

It returns a reader creator, each sample in the reader is a word ID tuple.

Parameters
  • word_idx (dict) – word dictionary

  • n (int) – sliding window size if type is ngram, otherwise max length of sequence

  • data_type (member variable of DataType (NGRAM or SEQ)) – data type (ngram or sequence)

Returns

Training reader creator

Return type

callable

paddle.dataset.imikolov.test(word_idx, n, data_type=1)[source]

imikolov test set creator.

It returns a reader creator, each sample in the reader is a word ID tuple.

Parameters
  • word_idx (dict) – word dictionary

  • n (int) – sliding window size if type is ngram, otherwise max length of sequence

  • data_type (member variable of DataType (NGRAM or SEQ)) – data type (ngram or sequence)

Returns

Test reader creator

Return type

callable