wmt16

ACL2016 Multimodal Machine Translation. Please see this website for more details: http://www.statmt.org/wmt16/multimodal-task.html#task1

If you use the dataset created for your task, please cite the following paper: Multi30K: Multilingual English-German Image Descriptions.

@article{elliott-EtAl:2016:VL16,

author = {{Elliott}, D. and {Frank}, S. and {Sima”an}, K. and {Specia}, L.}, title = {Multi30K: Multilingual English-German Image Descriptions}, booktitle = {Proceedings of the 6th Workshop on Vision and Language}, year = {2016}, pages = {70–74}, year = 2016

}

paddle.dataset.wmt16.train(src_dict_size, trg_dict_size, src_lang='en')[source]

WMT16 train set reader.

This function returns the reader for train data. Each sample the reader returns is made up of three fields: the source language word index sequence, target language word index sequence and next word index sequence.

NOTE: The original like for training data is: http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/training.tar.gz

paddle.dataset.wmt16 provides a tokenized version of the original dataset by using moses’s tokenization script: https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl

Parameters
  • src_dict_size (int) – Size of the source language dictionary. Three special tokens will be added into the dictionary: <s> for start mark, <e> for end mark, and <unk> for unknown word.

  • trg_dict_size (int) – Size of the target language dictionary. Three special tokens will be added into the dictionary: <s> for start mark, <e> for end mark, and <unk> for unknown word.

  • src_lang (string) – A string indicating which language is the source language. Available options are: “en” for English and “de” for Germany.

Returns

The train reader.

Return type

callable

paddle.dataset.wmt16.test(src_dict_size, trg_dict_size, src_lang='en')[source]

WMT16 test set reader.

This function returns the reader for test data. Each sample the reader returns is made up of three fields: the source language word index sequence, target language word index sequence and next word index sequence.

NOTE: The original like for test data is: http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/mmt16_task1_test.tar.gz

paddle.dataset.wmt16 provides a tokenized version of the original dataset by using moses’s tokenization script: https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl

Parameters
  • src_dict_size (int) – Size of the source language dictionary. Three special tokens will be added into the dictionary: <s> for start mark, <e> for end mark, and <unk> for unknown word.

  • trg_dict_size (int) – Size of the target language dictionary. Three special tokens will be added into the dictionary: <s> for start mark, <e> for end mark, and <unk> for unknown word.

  • src_lang (string) – A string indicating which language is the source language. Available options are: “en” for English and “de” for Germany.

Returns

The test reader.

Return type

callable

paddle.dataset.wmt16.validation(src_dict_size, trg_dict_size, src_lang='en')[source]

WMT16 validation set reader.

This function returns the reader for validation data. Each sample the reader returns is made up of three fields: the source language word index sequence, target language word index sequence and next word index sequence.

NOTE: The original like for validation data is: http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/validation.tar.gz

paddle.dataset.wmt16 provides a tokenized version of the original dataset by using moses’s tokenization script: https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl

Parameters
  • src_dict_size (int) – Size of the source language dictionary. Three special tokens will be added into the dictionary: <s> for start mark, <e> for end mark, and <unk> for unknown word.

  • trg_dict_size (int) – Size of the target language dictionary. Three special tokens will be added into the dictionary: <s> for start mark, <e> for end mark, and <unk> for unknown word.

  • src_lang (string) – A string indicating which language is the source language. Available options are: “en” for English and “de” for Germany.

Returns

The validation reader.

Return type

callable

paddle.dataset.wmt16.get_dict(lang, dict_size, reverse=False)[source]

return the word dictionary for the specified language.

Parameters
  • lang (string) – A string indicating which language is the source language. Available options are: “en” for English and “de” for Germany.

  • dict_size (int) – Size of the specified language dictionary.

  • reverse (bool) – If reverse is set to False, the returned python dictionary will use word as key and use index as value. If reverse is set to True, the returned python dictionary will use index as key and word as value.

Returns

The word dictionary for the specific language.

Return type

dict

paddle.dataset.wmt16.fetch()[source]

download the entire dataset.