Minimum Edit Distance

Minimum Edit Distance

When we auto-correct a misspelled word, what would be the right strategy. Say, you have written a word graffe in some word processor. The word processor would suggest the words that are near to this misspelled word. I typed this word in MS word and the MS word gave me following suggestions.

The nearest or the most relevant word is on top in the list. There are many probabilistic and computational techniques, which help in finding the most relevant word. One of them is dynamic programming algorithm, Minimum Edit Distance.

What is dynamic programming?

The dynamic programming is a class of algorithms where we simplify the problem by dividing it into smaller solvable parts and then build up the final solution from those simpler and smaller solutions. It can be confused with the divide and conquer algorithms (binary search and quick sort) where we divide the problem into smaller parts and combine the solutions to get the final solutions. However, there is very basic difference between the two. In divide and conquer approach, we do not reuse the previous solutions. Say if there was a computation like x + y = 6, we will not reuse its results later on rather we will compute it again when needed. Whereas in dynamic programing, we reuse the result 6 instead of computing x + y every time.

What is Minimum Edit Distance?

The minimum distance is the editing cost to convert one string into another string. From editing, we mean

  1. Deletion
  2. Insertion
  3. Substitution

The deletion and insertion operations have same cost that is 1, whereas the substitution operation has cost 2. The substitution operation has double cost as it involves both operations i.e. deletion followed by insertion.

Examples:

  1. Cat to Cats: has minimum edit distance equal to 1 as we need to perform just one insertion operation i.e. inserting s at the end of the string Car.
  2. Cats to Cat: has minimum edit distance equal to 1 as we need to perform just one deletion operation i.e. deleting s at the end of the source string Cats.
  3. Cat to Cap: has minimum edit distance equal to 2 as we need to perform a substitution/replace operation i.e. substitution/replacing t with p.

Steps to compute minimum edit distance

  1. Say the source string has m letters and the destination string has n letters. Draw a table with m+2 rows and n+2 columns. Let us convert the string “cat” to “apes”. Here the source string has 3 letters and the destination string has 4 letters. We draw table with 3 + 2 = 5 rows and 4 + 2 = 6 columns
  1. Now in the first column, write the source string. Start writing the source string bottom up, leaving the lower two cells empty. Similarly, in the last row, write the destination string from left to right keeping the left two cells empty. Put hash # symbol in the second last cell of first column and the second cell of the last row. The hash # symbol here means, nothing or empty string.
t
a
c
#
# a p e s
  1. Now fill up the second column bottom up.
    1. Write 0 in the cell corresponding to both hash # symbol. The zero 0 here means, the cost of converting the empty string to the empty string is zero is zero 0.
    2. Write 1 next to the letter c in first column, it means the cost of converting the string “c” to empty string is 1 as we need to delete one letter that is c.
    3. Write 2 next to the letter a in first column, it means the cost of converting the string “ca” to empty string (#) is 2 as we need to delete two letters that is “ca”.
    4. Write 3 next to the letter t in first column, it means the cost of converting the string “cat” to empty string (#) is 3 as we need to delete three letters that is “cat”.
  2. The point to be note is, when we move bottom up, we write the cost of deletions
  3. Next, we fill up the last row from left to right.
    1. Write 1 above a in the last row. It means the cost of converting empty string to the string “a” is 1 that is inserting one letter “a”.
    2. Write 2 above p in the last row. It means the cost of converting empty string to the string “ap” is 2 that is inserting two letters “ap”.
    3. Write 3 above e in the last row. It means the cost of converting empty string to the string “ape” is 3 that is inserting three letters “ape”
    4. Write 4 above s in the last row. It means the cost of converting empty string to the string “apes” is 4 that is inserting four letters “apes”
  4. It means, when move from left to right, we calculate the cost of insertions.

 

t 3
a 2
c 1
# 0 1 2 3 4
# a p e s
  1. So, we can always fill the left column and the bottom row using the above discussed deletion and insertion rules.
  2. Now we fill the third column of the table. The value in any cell of the table will be computed by the following rules.
    1. If the letters of the source string and the destination string corresponding to the given cell are the same, we will simply copy the value from the diagonal cell.
    2. Otherwise, we will add 1 to the cell at left (insertion cost), 1 to the lower cell (deletion cost) and 2 to the diagonal cell(substitution cost). The minimum result of the three computations will be written in the given cell.
    3. Let us fill the cell in the second column that corresponds to the letter c of source string and the letter a of the destination string. We add 1 to the left cell , the result is 2. We add 1 to the lower cell, the result is 2 and we add 2 to the diagonal cell, here again the result is 2. The minimum value is 2. So we write in 2 in this cell.
t 3
a 2 1
c 1 2
# 0 1 2 3 4
# a p e s
    1. Let us fill the cell corresponding to the letter a of source string and a of the destination string. Since we have same letter in both strings, we will simply copy the value from the diagonal cell, that is 1.
t 3
a 2 1
c 1 2
# 0 1 2 3 4
# a p e s
    1. Let us fill the cell in the second column that corresponds to the letter t of source string and the letter a of the destination string. We add 1 to the left cell, the result is 4. We add 1 to the lower cell, the result is 2 and we add 2 to the diagonal cell, here again the result is 4. The minimum value is 2. Therefore, we write 2 in this cell. Here the cost 2 means the cost to convert the source string cat to the destination string a. The cost is 2 as by deleting two letters c and t from the word cat, we can get the string a.
t 3 2
a 2 1
c 1 2
# 0 1
# a p e s
  1. Similarly, we keep on filling all the cells. The value in the top right cell is actually the editing distance from the source string cat to the destination string apes. So in this case the minimum editing distance from cat to apes is 5.
t 3 2 3 4 5
a 2 1 2 3 4
c 1 2 3 4 5
# 0 1 2 3 4
# a p e s

Computing alignments by back trace.

The minimum edit distance just tells us the minimum cost of editing to convert from source string to the destination string. It does not tell us how to do that editing. To find the alignment of the two strings to go from the source to the destination, we may use back trace.

  1. Start from the top right corner and draw arrows to the cells from where the value in the given cell may come. In this example, the value 5 in the top right corner may come from left, diagonal and the lower cell. So we draw arrow from 5 to all those source cells.
t 3 2 3 4 5
a 2 1 2 3 4
c 1 2 3 4 5
# 0 1 2 3 4
# a p e S
  1. These three arrows provide us three different paths. We may choose any one of them. All will give us the same cost that is 5. For now let us choose the diagonal path and move to the cell with value 3.
  2. Now draw arrows from the current cell (the cell having value 3) to its source cells. As it is obvious, this cell was derived from the left cell. So we draw to left cell.
t 3 2 3 4 5
a 2 1 2 3 4
c 1 2 3 4 5
# 0 1 2 3 4
# a p e S
  1. Figuring out the source cell for every given cell, we reach to the cell with zero 0 value. Of course we have more than one paths where we have multiple source cells. For now, let us analyze just one of the possible paths.
t 3 2 3 4 5
a 2 1 2 3 4
c 1 2 3 4 5
# 0 1 2 3 4
# a p e S
  1. For this path, we start from 0 and reach 5 at top right corner with following set of rules.
    1. Vertical movement means deletion
    2. Horizontal movement means insertion.
    3. The diagonal movement means substitution if the value was incremented by 2.
    4. The diagonal movement means no change if the same value was retained.
  2. By following these rules on the above path, we get following alignment.
    1. Vertical movement from zero 0 to one 1.Deletion
    2. Diagonal movement from 1 to 1 i.e. keeping the same value. No Change.
    3. Horizontal Movement from 1 to 2. Insertion.
    4. Horizontal movement from 2 to 3. Insertion.
    5. Diagonal movement from 3 to 5. Substitution.
  3. Hence the final alignment by following above path from string cat to apes is
    Deletion no change insertion insertion substitution and the cost for this alignment is as written in the top right cell 1 + 0 + 1 + 1 + 2 = 5
  4. Let us apply this alignment on the source cat.
    1. Deletion: we deleted the first letter c. The remaining string is “at”.
    2. No Change: we kept the second letter without making any change. The resultant string is “at”.
    3. Insertion: we inserted p after a. The resultant string is “apt”.
    4. Insertion: we inserted e after p. The resultant string is “apet”.
    5. Substitution/Replace: We substituted the letter t with s. The resulted string is “apes”
  5. Now let us choose some other path to get some other alignment.
t 3 2 3 4 5
a 2 1 2 3 4
c 1 2 3 4 5
# 0 1 2 3 4
# a p e S
  1. In this path, we have the alignment:
    deletion no change substitution insertion insertion.
  2. By following this alignment we will edit the source string in following order.
    1. Deletion: we deleted the first letter c. The remaining string is “at”.
    2. No Change: we kept the second letter without making any change. The resultant string is “at”.
    3. Substitution/Replace: We substituted the letter t with p. The resulted string is “ap”
    4. Insertion: we inserted e after p. The resultant string is “ape”.
    5. Insertion: we inserted s after p. The resultant string is “apes”.

That is it. Enjoy.