edit_distance

paddle.fluid.layers.edit_distance(input, label, normalized=True, ignored_tokens=None, input_length=None, label_length=None)[source]

This op computes the edit distances between a batch of hypothesis strings and their references. Edit distance, also called Levenshtein distance, measures how dissimilar two strings are by counting the minimum number of operations to transform one string into anthor. Here the operations include insertion, deletion, and substitution.

For example, given hypothesis string A = “kitten” and reference B = “sitting”, the edit distance is 3 for A will be transformed into B at least after two substitutions and one insertion:

“kitten” -> “sitten” -> “sittin” -> “sitting”

The input is a LoDTensor/Tensor consisting of all the hypothesis strings with the total number denoted by batch_size, and the separation is specified by the LoD information or input_length. And the batch_size reference strings are arranged in order in the same way as input.

The output contains the batch_size results and each stands for the edit distance for a pair of strings respectively. If Attr(normalized) is true, the edit distance will be divided by the length of reference string.

Parameters
  • input (Variable) – The indices for hypothesis strings, its rank should equals to 2 and its data type should be int64.

  • label (Variable) – The indices for reference strings, its rank should equals to 2 and its data type should be int64.

  • normalized (bool, default True) – Indicated whether to normalize the edit distance by the length of reference string.

  • ignored_tokens (list<int>, default None) – Tokens that should be removed before calculating edit distance.

  • input_length (Variable) – The length for each sequence in input if it’s of Tensor type, it should have shape [batch_size] and dtype int64.

  • label_length (Variable) – The length for each sequence in label if it’s of Tensor type, it should have shape [batch_size] and dtype int64.

Returns

edit_distance_out(Variable): edit distance result in shape [batch_size, 1]. sequence_num(Variable): sequence number in shape [].

Return type

Tuple

Examples

import paddle.fluid as fluid

# using LoDTensor
x_lod = fluid.data(name='x_lod', shape=[None,1], dtype='int64', lod_level=1)
y_lod = fluid.data(name='y_lod', shape=[None,1], dtype='int64', lod_level=1)
distance_lod, seq_num_lod = fluid.layers.edit_distance(input=x_lod, label=y_lod)

# using Tensor
x_seq_len = 5
y_seq_len = 6
x_pad = fluid.data(name='x_pad', shape=[None,x_seq_len], dtype='int64')
y_pad = fluid.data(name='y_pad', shape=[None,y_seq_len], dtype='int64')
x_len = fluid.data(name='x_len', shape=[None], dtype='int64')
y_len = fluid.data(name='y_len', shape=[None], dtype='int64')
distance_pad, seq_num_pad = fluid.layers.edit_distance(input=x_pad, label=y_pad, input_length=x_len, label_length=y_len)