Soundex Algorithm

Soundex Algorithm

The soundex algorithm is used to find the misspelled names. While searching the name of a person, we usually make mistakes. For instance the name Jurafsky can be misspelled like Jarofsky, Jarovsky, and Jarovski. By using the soundex algorithm, we can find all the modified forms of the name Jurafsky. This algorithms is mostly used in libraries, particularly  for English names.

Steps for Soundex Algorithm

Step 1: Keep the initial letter or character as it is. Drop the following letters if they are occurring at start of the word
a,e,i,o,u,y,w,h

Step 2: Replace the remaining letters with digits as under. Please keep the first letter unchanged.
b, f, p, v →1
c, g, j, k, q, s, x, z →2
d, t → 3
l → 4
m, n →5
r →6

Step 3: if the same number is repeated consecutively. keep just one occurrence and delete the rest.

Step 4: The usual format for Soundex is Letter Digit Digit Digit i.e.( the first letter followed by three digits). If the soundex code as more than three digits, we delete the extra digits and add trailing zeros if there are less than three digits.

Example:

Name :Jurafsky

Step 1: Jrfsk
Step 2: J6122  // mapping from letter to digits
Step 3:  J612   // deleting the consecutively repeated digits

Name :Jarofsky

Step 1: Jrfsk
Step 2: J6122  // mapping from letter to digits
Step 3:  J612   // deleting the consecutively repeated digits

Name :Jarovsky

Step 1: Jrvsk
Step 2: J6122  // mapping from letter to digits
Step 3:  J612   // deleting the consecutively repeated digits

Name :Jarovski

Step 1: Jrvsk
Step 2: J6122  // mapping from letter to digits
Step 3:  J612   // deleting the consecutively repeated digits

Name: Bill

Step 1: Bll // removing vowels and y,w,h
Step 2:B44 // mapping from letters to digits
Step 3:B4 // removing consecutive repetition.
Step 4: B400 // Adding trailing zeros to get the format LetterDigitDigitDigit

Name: Clinton

Step 1: Clntn // removing vowels and y,w,h
Step 2:C4535 // mapping from letters to digits
Step 3: C453  // Removing extra digits to get the format LetterDigitDigitDigit

Usage in Databases.

The algorithm has been implemented in databse servers like MySQL and SQL servers etc. You can find and compare the soundex code by built in soundex function.  Here is a sample MySQL statement to find the soundex code for any string.

SELECT SOUNDEX(‘Clinton’);