使用预训练的词向量完成文本分类任务

作者: fiyen 日期: 2021.05 摘要: 本示例教程将会演示如何使用飞桨内置的Imdb数据集,并使用预训练词向量进行文本分类。

[17]:
print('自然语言相关数据集:', paddle.text.__all__)
自然语言相关数据集: ['Conll05st', 'Imdb', 'Imikolov', 'Movielens', 'UCIHousing', 'WMT14', 'WMT16']

一、简介

在这个示例中,将使用飞桨2.1完成针对Imdb数据集(电影评论情感二分类数据集)的分类训练和测试。Imdb将直接调用自飞桨2.1,同时, 利用预训练的词向量(GloVe embedding)完成任务。

二、环境设置

本教程基于Paddle 2.1 编写,如果你的环境不是本版本,请先参考官网安装 Paddle 2.1 。

[3]:
import paddle
from paddle.io import Dataset
import numpy as np
import paddle.text as text
import random

print(paddle.__version__)
2.1.0

三、用飞桨2.1调用Imdb数据集

由于飞桨2.1提供了经过处理的Imdb数据集,可以方便地调用所需要的数据实例,省去了数据预处理的麻烦。目前,飞桨2.1以及内置的高质量 数据集包括Conll05st、Imdb、Imikolov、Movielens、HCIHousing、WMT14和WMT16等,未来还将提供更多常用数据集的调用接口。

以下定义了调用imdb训练集合测试集的方法。其中,cutoff定义了构建词典的截止大小,即数据集中出现频率在cutoff以下的不予考虑;mode定义了返回的数据用于何种用途(test: 测试集,train: 训练集)。

3.1 定义数据集

[2]:
imdb_train = text.Imdb(mode='train', cutoff=150)
imdb_test = text.Imdb(mode='test', cutoff=150)
Cache file /home/aistudio/.cache/paddle/dataset/imdb/imdb%2FaclImdb_v1.tar.gz not found, downloading https://dataset.bj.bcebos.com/imdb%2FaclImdb_v1.tar.gz
Begin to download

Download finished

调用Imdb得到的是经过编码的数据集,每个term对应一个唯一id,映射关系可以通过imdb_train.word_idx查看。将每一个样本即一条电影评论,表示成id序列。可以检查一下以上生成的数据内容:

[4]:
print("训练集样本数量: %d; 测试集样本数量: %d" % (len(imdb_train), len(imdb_test)))
print(f"样本标签: {set(imdb_train.labels)}")
print(f"样本字典: {list(imdb_train.word_idx.items())[:10]}")
print(f"单个样本: {imdb_train.docs[0]}")
print(f"最小样本长度: {min([len(x) for x in imdb_train.docs])};最大样本长度: {max([len(x) for x in imdb_train.docs])}")
训练集样本数量: 25000; 测试集样本数量: 25000
样本标签: {0, 1}
样本字典: [(b'the', 0), (b'and', 1), (b'a', 2), (b'of', 3), (b'to', 4), (b'is', 5), (b'in', 6), (b'it', 7), (b'i', 8), (b'this', 9)]
单个样本: [5146, 43, 71, 6, 1092, 14, 0, 878, 130, 151, 5146, 18, 281, 747, 0, 5146, 3, 5146, 2165, 37, 5146, 46, 5, 71, 4089, 377, 162, 46, 5, 32, 1287, 300, 35, 203, 2136, 565, 14, 2, 253, 26, 146, 61, 372, 1, 615, 5146, 5, 30, 0, 50, 3290, 6, 2148, 14, 0, 5146, 11, 17, 451, 24, 4, 127, 10, 0, 878, 130, 43, 2, 50, 5146, 751, 5146, 5, 2, 221, 3727, 6, 9, 1167, 373, 9, 5, 5146, 7, 5, 1343, 13, 2, 5146, 1, 250, 7, 98, 4270, 56, 2316, 0, 928, 11, 11, 9, 16, 5, 5146, 5146, 6, 50, 69, 27, 280, 27, 108, 1045, 0, 2633, 4177, 3180, 17, 1675, 1, 2571]
最小样本长度: 10;最大样本长度: 2469

对于训练集,将数据的顺序打乱,以优化将要进行的分类模型训练的效果。

[5]:
shuffle_index = list(range(len(imdb_train)))
random.shuffle(shuffle_index)
train_x = [imdb_train.docs[i] for i in shuffle_index]
train_y = [imdb_train.labels[i] for i in shuffle_index]

test_x = imdb_test.docs
test_y = imdb_test.labels

从样本长度上可以看到,每个样本的长度是不相同的。然而,在模型的训练过程中,需要保证每个样本的长度相同,以便于构造矩阵进行批量运算。 因此,需要先对所有样本进行填充或截断,使样本的长度一致。

[6]:
def vectorizer(input, label=None, length=2000):
    if label is not None:
        for x, y in zip(input, label):
            yield np.array((x + [0]*length)[:length]).astype('int64'), np.array([y]).astype('int64')
    else:
        for x in input:
            yield np.array((x + [0]*length)[:length]).astype('int64')

3.2 载入预训练向量

以下给出的文件较小,可以直接完全载入内存。对于大型的预训练向量,无法一次载入内存的,可以采用分批载入,并行处理的方式进行匹配。

[7]:
# !wget http://nlp.stanford.edu/data/glove.6B.zip
# !unzip -q glove.6B.zip

glove_path = "./glove.6B.100d.txt"
embeddings = {}

观察上述GloVe预训练向量文件一行的数据:

[8]:
# 使用utf8编码解码
with open(glove_path, encoding='utf-8') as gf:
    line = gf.readline()
    print("GloVe单行数据:'%s'" % line)
GloVe单行数据:'the -0.038194 -0.24487 0.72812 -0.39961 0.083172 0.043953 -0.39141 0.3344 -0.57545 0.087459 0.28787 -0.06731 0.30906 -0.26384 -0.13231 -0.20757 0.33395 -0.33848 -0.31743 -0.48336 0.1464 -0.37304 0.34577 0.052041 0.44946 -0.46971 0.02628 -0.54155 -0.15518 -0.14107 -0.039722 0.28277 0.14393 0.23464 -0.31021 0.086173 0.20397 0.52624 0.17164 -0.082378 -0.71787 -0.41531 0.20335 -0.12763 0.41367 0.55187 0.57908 -0.33477 -0.36559 -0.54857 -0.062892 0.26584 0.30205 0.99775 -0.80481 -3.0243 0.01254 -0.36942 2.2167 0.72201 -0.24978 0.92136 0.034514 0.46745 1.1079 -0.19358 -0.074575 0.23353 -0.052062 -0.22044 0.057162 -0.15806 -0.30798 -0.41625 0.37972 0.15006 -0.53212 -0.2055 -1.2526 0.071624 0.70565 0.49744 -0.42063 0.26148 -1.538 -0.30223 -0.073438 -0.28312 0.37104 -0.25217 0.016215 -0.017099 -0.38984 0.87424 -0.72569 -0.51058 -0.52028 -0.1459 0.8278 0.27062
'

可以看到,每一行都以单词开头,其后接上该单词的向量值,各个值之间用空格隔开。基于此,可以用如下方法得到所有词向量的字典。

[9]:
with open(glove_path, encoding='utf-8') as gf:
    for glove in gf:
        word, embedding = glove.split(maxsplit=1)
        embedding = [float(s) for s in embedding.split(' ')]
        embeddings[word] = embedding
print("预训练词向量总数:%d" % len(embeddings))
print(f"单词'the'的向量是:{embeddings['the']}")
预训练词向量总数:400000
单词'the'的向量是:[-0.038194, -0.24487, 0.72812, -0.39961, 0.083172, 0.043953, -0.39141, 0.3344, -0.57545, 0.087459, 0.28787, -0.06731, 0.30906, -0.26384, -0.13231, -0.20757, 0.33395, -0.33848, -0.31743, -0.48336, 0.1464, -0.37304, 0.34577, 0.052041, 0.44946, -0.46971, 0.02628, -0.54155, -0.15518, -0.14107, -0.039722, 0.28277, 0.14393, 0.23464, -0.31021, 0.086173, 0.20397, 0.52624, 0.17164, -0.082378, -0.71787, -0.41531, 0.20335, -0.12763, 0.41367, 0.55187, 0.57908, -0.33477, -0.36559, -0.54857, -0.062892, 0.26584, 0.30205, 0.99775, -0.80481, -3.0243, 0.01254, -0.36942, 2.2167, 0.72201, -0.24978, 0.92136, 0.034514, 0.46745, 1.1079, -0.19358, -0.074575, 0.23353, -0.052062, -0.22044, 0.057162, -0.15806, -0.30798, -0.41625, 0.37972, 0.15006, -0.53212, -0.2055, -1.2526, 0.071624, 0.70565, 0.49744, -0.42063, 0.26148, -1.538, -0.30223, -0.073438, -0.28312, 0.37104, -0.25217, 0.016215, -0.017099, -0.38984, 0.87424, -0.72569, -0.51058, -0.52028, -0.1459, 0.8278, 0.27062]

3.3 给数据集的词表匹配词向量

接下来,提取数据集的词表,需要注意的是,词表中的词编码的先后顺序是按照词出现的频率排列的,频率越高的词编码值越小。

[10]:
word_idx = imdb_train.word_idx
vocab = [w for w in word_idx.keys()]
print(f"词表的前5个单词:{vocab[:5]}")
print(f"词表的后5个单词:{vocab[-5:]}")
词表的前5个单词:[b'the', b'and', b'a', b'of', b'to']
词表的后5个单词:[b'troubles', b'virtual', b'warriors', b'widely', '<unk>']

观察词表的后5个单词,发现最后一个词是“<unk>”,这个符号代表所有词表以外的词。另外,对于形式b’the’,是字符串’the’ 的二进制编码形式,使用中注意使用b’the’.decode()来进行转换(’<unk>’并没有进行二进制编码,注意区分)。 接下来,给词表中的每个词匹配对应的词向量。预训练词向量可能没有覆盖数据集词表中的所有词,对于没有的词,设该词的词 向量为零向量。

[11]:
# 定义词向量的维度,注意与预训练词向量保持一致
dim = 100

vocab_embeddings = np.zeros((len(vocab), dim))
for ind, word in enumerate(vocab):
    if word != '<unk>':
        word = word.decode()
    embedding = embeddings.get(word, np.zeros((dim,)))
    vocab_embeddings[ind, :] = embedding

四、组网

4.1 构建基于预训练向量的Embedding

对于预训练向量的Embedding,一般期望它的参数不再变动,所以要设置trainable=False。如果希望在此基础上训练参数,则需要 设置trainable=True。

[12]:
pretrained_attr = paddle.ParamAttr(name='embedding',
                                   initializer=paddle.nn.initializer.Assign(vocab_embeddings),
                                   trainable=False)
embedding_layer = paddle.nn.Embedding(num_embeddings=len(vocab),
                                      embedding_dim=dim,
                                      padding_idx=word_idx['<unk>'],
                                      weight_attr=pretrained_attr)

4.2 构建分类器

这里,构建简单的基于一维卷积的分类模型,其结构为:Embedding->Conv1D->Pool1D->Linear。在定义Linear时,由于需要知 道输入向量的维度,可以按照公式官方文档 来进行计算。这里给出计算的函数如下:

[13]:
def cal_output_shape(input_shape, out_channels, kernel_size, stride, padding=0, dilation=1):
    return out_channels, int((input_shape + 2*padding - (dilation*(kernel_size - 1) + 1)) / stride) + 1


# 定义每个样本的长度
length = 2000

# 定义卷积层参数
kernel_size = 5
out_channels = 10
stride = 2
padding = 0

output_shape = cal_output_shape(length, out_channels, kernel_size, stride, padding)
output_shape = cal_output_shape(output_shape[1], output_shape[0], 2, 2, 0)
sim_model = paddle.nn.Sequential(embedding_layer,
                             paddle.nn.Conv1D(in_channels=dim, out_channels=out_channels, kernel_size=kernel_size,
                                              stride=stride, padding=padding, data_format='NLC', bias_attr=True),
                             paddle.nn.ReLU(),
                             paddle.nn.MaxPool1D(kernel_size=2, stride=2),
                             paddle.nn.Flatten(),
                             paddle.nn.Linear(in_features=np.prod(output_shape), out_features=2, bias_attr=True),
                             paddle.nn.Softmax())

paddle.summary(sim_model, input_size=(-1, length), dtypes='int64')
---------------------------------------------------------------------------
 Layer (type)       Input Shape          Output Shape         Param #
===========================================================================
  Embedding-1       [[1, 2000]]         [1, 2000, 100]        514,700
   Conv1D-1       [[1, 2000, 100]]       [1, 998, 10]          5,010
    ReLU-1         [[1, 998, 10]]        [1, 998, 10]            0
  MaxPool1D-1      [[1, 998, 10]]        [1, 998, 5]             0
   Flatten-1       [[1, 998, 5]]          [1, 4990]              0
   Linear-1         [[1, 4990]]             [1, 2]             9,982
   Softmax-1          [[1, 2]]              [1, 2]               0
===========================================================================
Total params: 529,692
Trainable params: 14,992
Non-trainable params: 514,700
---------------------------------------------------------------------------
Input size (MB): 0.01
Forward/backward pass size (MB): 1.75
Params size (MB): 2.02
Estimated Total Size (MB): 3.78
---------------------------------------------------------------------------

[13]:
{'total_params': 529692, 'trainable_params': 14992}

4.3 读取数据,进行训练

可以利用飞桨2.0的io.Dataset模块来构建一个数据的读取器,方便地将数据进行分批训练。

[14]:
class DataReader(Dataset):
    def __init__(self, input, label, length):
        self.data = list(vectorizer(input, label, length=length))

    def __getitem__(self, idx):
        return self.data[idx]

    def __len__(self):
        return len(self.data)


# 定义输入格式
input_form = paddle.static.InputSpec(shape=[None, length], dtype='int64', name='input')
label_form = paddle.static.InputSpec(shape=[None, 1], dtype='int64', name='label')

model = paddle.Model(sim_model, input_form, label_form)
model.prepare(optimizer=paddle.optimizer.Adam(learning_rate=0.001, parameters=model.parameters()),
              loss=paddle.nn.loss.CrossEntropyLoss(),
              metrics=paddle.metric.Accuracy())

# 分割训练集和验证集
eval_length = int(len(train_x) * 1/4)
model.fit(train_data=DataReader(train_x[:-eval_length], train_y[:-eval_length], length),
          eval_data=DataReader(train_x[-eval_length:], train_y[-eval_length:], length),
          batch_size=32, epochs=10, verbose=1)
The loss value printed in the log is the current step, and the metric is the average value of previous steps.
Epoch 1/10
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:77: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
  return (isinstance(seq, collections.Sequence) and
step 586/586 [==============================] - loss: 0.4641 - acc: 0.6480 - 415ms/step
Eval begin...
step 196/196 [==============================] - loss: 0.3703 - acc: 0.7694 - 209ms/step
Eval samples: 6250
Epoch 2/10
step 586/586 [==============================] - loss: 0.5839 - acc: 0.7744 - 416ms/step
Eval begin...
step 196/196 [==============================] - loss: 0.3651 - acc: 0.7939 - 206ms/step
Eval samples: 6250
Epoch 3/10
step 586/586 [==============================] - loss: 0.3980 - acc: 0.7953 - 419ms/step
Eval begin...
step 196/196 [==============================] - loss: 0.3801 - acc: 0.7982 - 214ms/step
Eval samples: 6250
Epoch 4/10
step 586/586 [==============================] - loss: 0.4552 - acc: 0.8184 - 415ms/step
Eval begin...
step 196/196 [==============================] - loss: 0.3370 - acc: 0.8077 - 210ms/step
Eval samples: 6250
Epoch 5/10
step 586/586 [==============================] - loss: 0.4108 - acc: 0.8361 - 421ms/step
Eval begin...
step 196/196 [==============================] - loss: 0.3369 - acc: 0.8179 - 210ms/step
Eval samples: 6250
Epoch 6/10
step 586/586 [==============================] - loss: 0.4215 - acc: 0.8486 - 415ms/step
Eval begin...
step 196/196 [==============================] - loss: 0.3419 - acc: 0.8062 - 213ms/step
Eval samples: 6250
Epoch 7/10
step 586/586 [==============================] - loss: 0.4092 - acc: 0.8586 - 424ms/step
Eval begin...
step 196/196 [==============================] - loss: 0.3312 - acc: 0.8200 - 208ms/step
Eval samples: 6250
Epoch 8/10
step 586/586 [==============================] - loss: 0.4488 - acc: 0.8694 - 419ms/step
Eval begin...
step 196/196 [==============================] - loss: 0.3328 - acc: 0.8186 - 205ms/step
Eval samples: 6250
Epoch 9/10
step 586/586 [==============================] - loss: 0.5302 - acc: 0.8770 - 412ms/step
Eval begin...
step 196/196 [==============================] - loss: 0.3961 - acc: 0.8152 - 201ms/step
Eval samples: 6250
Epoch 10/10
step 586/586 [==============================] - loss: 0.3728 - acc: 0.8807 - 420ms/step
Eval begin...
step 196/196 [==============================] - loss: 0.3353 - acc: 0.8210 - 202ms/step
Eval samples: 6250

五、评估效果并用模型预测

[ ]:
# 评估
model.evaluate(eval_data=DataReader(test_x, test_y, length), batch_size=32, verbose=1)

# 预测
true_y = test_y[100:105] + test_y[-110:-105]
pred_y = model.predict(DataReader(test_x[100:105] + test_x[-110:-105], None, length), batch_size=1)
test_x_doc = test_x[100:105] + test_x[-110:-105]

# 标签编码转文字
label_id2text = {0: 'positive', 1: 'negative'}

for index, y in enumerate(pred_y[0]):
    print("原文本:%s" % ' '.join([vocab[i].decode() for i in test_x_doc[index] if i < len(vocab) - 1]))
    print("预测的标签是:%s, 实际标签是:%s" % (label_id2text[np.argmax(y)], label_id2text[true_y[index]]))