Imikolov

class paddle.text.datasets. Imikolov ( data_file=None, data_type='NGRAM', window_size=- 1, mode='train', min_word_freq=50, download=True ) [source]

Implementation of imikolov dataset.

Parameters
  • data_file (str) – path to data tar file, can be set None if download is True. Default None

  • data_type (str) – ‘NGRAM’ or ‘SEQ’. Default ‘NGRAM’.

  • window_size (int) – sliding window size for ‘NGRAM’ data. Default -1.

  • mode (str) – ‘train’ ‘test’ mode. Default ‘train’.

  • min_word_freq (int) – minimal word frequence for building word dictionary. Default 50.

  • download (bool) – whether to download dataset automatically if data_file is not set. Default True

Returns

instance of imikolov dataset

Return type

Dataset

Examples

import paddle
from paddle.text.datasets import Imikolov

class SimpleNet(paddle.nn.Layer):
    def __init__(self):
        super(SimpleNet, self).__init__()

    def forward(self, src, trg):
        return paddle.sum(src), paddle.sum(trg)

paddle.disable_static()

imikolov = Imikolov(mode='train', data_type='SEQ', window_size=2)

for i in range(10):
    src, trg = imikolov[i]
    src = paddle.to_tensor(src)
    trg = paddle.to_tensor(trg)

    model = SimpleNet()
    src, trg = model(src, trg)
    print(src.numpy().shape, trg.numpy().shape)