edit_distance( input, label, normalized=True, ignored_tokens=None, input_length=None, label_length=None )
This op computes the edit distances, also called Levenshtein distance, between a batch of hypothesis strings and their references. It measures how dissimilar two strings are by counting the minimum number of operations to transform one string into another. The operations include insertion, deletion, and substitution.
For example, given hypothesis string A = “kitten” and reference B = “sitting”, A will be transformed into B at least after two substitutions and one insertion:
“kitten” -> “sitten” -> “sittin” -> “sitting”
So the edit distance between A and B is 3.
The input is a Tensor, the input_length and label_length should be supported.
The batch_size of labels should be same as input.
The output include the edit distance value between every pair of input and related label, and the number of sequence. If Attr(normalized) is true, the edit distance value will be divided by the length of label.
input (Tensor) – The input tensor, its rank should be equal to 2 and its data type should be int64.
label (Tensor) – The label tensor, its rank should be equal to 2 and its data type should be int64.
normalized (bool, default True) – Indicated whether to normalize the edit distance.
ignored_tokens (list<int>, default None) – Tokens that will be removed before calculating edit distance.
input_length (Tensor) – The length for each sequence in input if it’s of Tensor type, it should have shape (batch_size, ) and its data type should be int64.
label_length (Tensor) – The length for each sequence in label if it’s of Tensor type, it should have shape (batch_size, ) and its data type should be int64.
NOTE – To be avoid unexpected result, the value of every elements in input_length and label_length should be equal to the value of the second dimension of input and label. For example, The input: [[1,2,3,4],[5,6,7,8],[9,10,11,12]], the shape of input is [3,4] and the input_length should be [4,4,4]
NOTE – This Api is different from fluid.metrics.EditDistance
distance(Tensor): edit distance result, its data type is float32, and its shape is (batch_size, 1). sequence_num(Tensor): sequence number, its data type is float32, and its shape is (1,).
- Return type
import paddle import paddle.nn.functional as F input = paddle.to_tensor([[1,2,3],[4,5,6],[4,4,4],[1,1,1]], dtype='int64') label = paddle.to_tensor([[1,3,4,1],[4,5,8,1],[7,7,7,1],[1,1,1,1]], dtype='int64') input_len = paddle.to_tensor([3,3,3,3], dtype='int64') label_len = paddle.to_tensor([4,4,4,4], dtype='int64') distance, sequence_num = F.loss.edit_distance(input=input, label=label, input_length=input_len, label_length=label_len, normalized=False) # print(distance) # [[3.] # [2.] # [4.] # [1.]] # if set normalized to True # [[0.75] # [0.5 ] # [1. ] # [0.25] # # print(sequence_num) #