chunk_eval(input, label, chunk_scheme, num_chunk_types, excluded_chunk_types=None, seq_length=None)
This operator computes the precision, recall and F1-score for chunk detection. It is often used in sequence tagging tasks, such as Named Entity Recognition(NER).
For some basics of chunking, please refer to Chunking with Support Vector Machines .
This operator supports IOB, IOE, IOBES and IO (also known as plain) tagging schemes. Here is a NER example for the usage of these tagging schemes:
====== ====== ====== ===== == ============ ===== ===== ===== == ========= Li Ming works at Agricultural Bank of China in Beijing. ====== ====== ====== ===== == ============ ===== ===== ===== == ========= IO I-PER I-PER O O I-ORG I-ORG I-ORG I-ORG O I-LOC IOB B-PER I-PER O O B-ORG I-ORG I-ORG I-ORG O B-LOC IOE I-PER E-PER O O I-ORG I-ORG I-ORG E-ORG O E-LOC IOBES B-PER E-PER O O I-ORG I-ORG I-ORG E-ORG O S-LOC ====== ====== ====== ===== == ============ ===== ===== ===== == =========
There are three chunk types(named entity types) including PER(person), ORG(organization) and LOC(location), and we can see that the labels have the form <tag type>-<chunk type> .
Since the implementation of this operator actually uses label ids rather than label strings, to make it work, there should be a way to map label ids to tag types and chunk types. This operator uses the following way to do mapping:
tag_type = label % num_tag_type chunk_type = label / num_tag_type
where num_tag_type is the num of tag types in the tagging scheme, num_chunk_type is the num of chunk types, and tag_type get its value from the following table.
Scheme Begin Inside End Single plain 0 - - - IOB 0 1 - - IOE - 0 1 - IOBES 0 1 2 3
Accordingly, in the above NER example, if the tagging scheme is IOB and chunk types are ORG, PER and LOC, then the label ids would be as follows:
B-ORG 0 I-ORG 1 B-PER 2 I-PER 3 B-LOC 4 I-LOC 5 O 6
With which we can map each label id to the corresponding tag type and chunk type correctly.
input (Variable) – A Tensor or LoDTensor, representing the predicted labels from the network. When it is a Tensor, its shape would be [N, M, 1], where N stands for batch size, M for sequence length; When it is a LoDTensor, its shape would be [N, 1] where N stands for the total sequence lengths in this mini-batch. The data type should be int64.
label (Variable) – A Tensor or LoDTensor representing the ground-truth labels. It should have the same shape, lod and data type as
chunk_scheme (str) – Indicate the tagging schemes used here. The value must be IOB, IOE, IOBES or plain.
num_chunk_types (int) – The number of chunk types.
excluded_chunk_types (list, optional) – Indicate the chunk types shouldn’t be taken into account. It should be a list of chunk type ids(integer). Default None.
seq_length (Variable, optional) – A 1D Tensor containing the length of each sequence when
labelare Tensor. It needn’t be provided if
labelare LoDTensor. Default None.
A tuple including precision, recall, F1-score, chunk number detected, chunk number in ground-truth, chunk number correctly detected. Each is a Tensor with shape . The data type of precision, recall and F1-score all is float32, and the others’ data type all is int64.
- Return type
import paddle.fluid as fluid dict_size = 10000 label_dict_len = 7 sequence = fluid.data( name='id', shape=[None, 1], lod_level=1, dtype='int64') embedding = fluid.embedding( input=sequence, size=[dict_size, 512]) hidden = fluid.layers.fc(input=embedding, size=512) label = fluid.data( name='label', shape=[None, 1], lod_level=1, dtype='int64') crf = fluid.layers.linear_chain_crf( input=hidden, label=label, param_attr=fluid.ParamAttr(name="crfw")) crf_decode = fluid.layers.crf_decoding( input=hidden, param_attr=fluid.ParamAttr(name="crfw")) fluid.layers.chunk_eval( input=crf_decode, label=label, chunk_scheme="IOB", num_chunk_types=int((label_dict_len - 1) / 2))