Minimum Edit Distance
It is the editing distance between a source word and destination word. We calculate the distance to find the closest word. When you search google with misspelled word, Google automatically replaces it with the nearest words. Similarly auto correct mode in mobile or any word processor helps you correct the misspelled words automatically.
Take for example the word the misspelled word “graffe”. We have to find the edit distance of this word with its close words. The closest word would be chosen to replace it. Let’s say we have following words.
- graf —- > first four letters are same two letters f and e would inserted. 2 operations
- graft —–> t would be deleted and f and e will be inserted. 3 operations
- grail —–> first three letters are same. i and l will be deleted and ffe inserted. 5 operations
- giraffe —> only i would be inserted. 1 operation.
So the word giraffe is the closest as it has minimum editing operations (just 1). But we have done it by our human wisdom. we first of all find the string that is same and keep it same. Then we replace, insert or delete the remaining letters that are different.
Aligning the source and destination words
In above example, we found that the letter E is same and the string TION is same. So we align the two words so that same strings are not replaced. We delete the first I from the source word and insert and extra C after E. This arrangement is optional. There may be many more ways to arrange or align both words. For instance we replace N with C and can insert U before TION. All these alignments are correct. We need to find the most optimum with minimum cost.
Minimum Edit Distance Algorithm
This algorithm is example of dynamic programming. In dynamic programming, we start with the simplest solution and then we keep on improving its efficiency step by step. In other words the last solution is derived from many intermediate solutions.
Suppose we to calculate the edit distance from the word “intention” to execution. We can do it in following steps. The source string has n characters and destination string has m characters.
- create a 2 dimensional array of m+1, and n+1 size.
- Fill the first column bottom up from 0 to n. It means we are deleting all the characters one by one and the cost of deletion for one character is 1. So the cost of n characters would be n.
- Fill the first bottom row left to right from 0 to m. It means we are inserting characters into of destination string one by one and the cost of insertion for one character is 1. So the cost of insertion of nth character would be m
- The insertion and deletion cost would be one and the substitution cost would be 2 for different letters and 0 for same letters.
- Now we have left column and bottom row and we would start filling the array from bottom to top. Let’s fill the second column from bottom to up. The first unfilled cell from bottom to top would be filled from left cell, bottom cell or diagonally left bottom cell.
- We would compare the cost of insertion, deletion and substitution and will choose the minimum.
- The cell (1,1) can derived from left (1,0) by insertion adding 1 to left value (it makes 1 + 1 = 2) or from bottom(0,1) by deleting adding 1 to bottom cell value (1 + 1 = 2) or substituting i with e adding 2 to (0,0) cell. All operations give us same result 2. So we write 2 in that cell.
- Look at the cell(4,1). Here the source letter and the destination is same(e). So the substitution cost 0 would be added to 3 ( 3+0 = 3) resulting 3. Where as deletion and insertion costs would (4+1 = 5) be 5 each. So we would write 3 in cell (4,1).
- The process continues and the value on top right corner would be the minimum editing distance of two letters (8 in this case).
Alignment Trace for minimum edit distance
There may be multiple trances for alignments or arrangements using minimum edit distance. Here are the rules.
- We start from the top right corner.
- We look to the left, down and left-down diagonal cell from where the current cell was derived.
- We can move to any cell from where the current cell was derived.
- The only condition is that every new cell in our path should have less or equal cost.
- In short any non decreasing path from 0,0 to n,m or any non increasing path from n,m to 0,0 corresponds to alignment trance.
- The diagonal movement means substitution, vertical movement means deletion and the horizontal movement means insertion.