– Part 2 of Autocorrect SeriesIn a previous article, we discussed how a typical autocorrect solution is only 60% accurate and how this low level of accuracy could lead to $3M in lost revenue. Unbxd Autocorrect, on the other hand, offers a 90% accuracy level. This 30% increase in the accuracy level could turn the lost revenue into a relevant shopper experience and high-margin profit. In this follow-up article, let us discuss the technical details of how we built the Unbxd ecommerce Site Search Autocorrect to provide 90% accuracy.
The 3-step process of Unbxd AutocorrectUnbxd Autocorrect works in three layers:
- It checks whether the entered shopper query contains any misspellings
- If yes, it computes the scores of correct alternatives to the misspelled word, taking word frequency, error likelihood, and shopper context into consideration
- Finally, it returns the word with the highest score as the closest possible alternative to the misspelled word
Step1: Identifying misspellingsIn order to determine whether a word is wrongly spelled, we need to first start with a list of all the words that we know to be correct. This list is our corpus, which makes a solid foundation for a competent autocorrect solution. Any word not present in our corpus or any deviation from the known word present in our corpus would be considered as an error. Limited corpus makes it difficult to link a misspelled query to the right terms. It could also tag a correct word as a misspelling. Therefore, it is imperative that this corpus be exhaustive and inclusive. For instance, take the shopper query “skorts.” A basic autocorrect solution with a limited corpus would not identify this query as correct if it is not present in the corpus. The solution would return the closest possible alternative as the correct spelling or, worse, it would lead to a zero results page.
Limited corpus leads an autocorrect solution to mistake a correct word for a spelling errorUnbxd Autocorrect uses an extensive and comprehensive corpus sourced from product catalogs of e-commerce sites, basic and advanced English corpus, and other credible sources such as ConceptNet, in order to identify the misspelled words with a high level of accuracy.
Step 2: Computing closest possible alternativesSay, a shopper enters the query “blick shurts.” Unbxd Autocorrect understands that both these words in the string are not in the Unbxd corpus. It corrects this query one word at a time. First, Unbxd Autocorrect considers “blick:” Some possible contenders for this misspelled word are “black,” “brick,” “block,” “back,” and so on. In order to find the most relevant suggestions for “blick” from a list of many possible alternatives, Unbxd Autocorrect contemplates: “Among all these possibilities, which one is the most likely to have been the originally intended word?” To answer this question, there are a few other factors that Unbxd Autocorrect considers:
1. How frequently does the word occur in our corpus?
The more frequently a word occurs in our corpus, the more likely it is that the shopper intended to type that word.
2. What is the likelihood of the spelling mistake?
If a spelling mistake looks highly unlikely, say, substituting a “z” with a “p” (keys that are placed far apart on a QWERTY keyboard), then it makes sense to assume that, that word is less likely to have been originally intended.
3. What is the shopper context?
The words that occur before and after the misspelled word in the query could give Unbxd Autocorrect information about the context in which the spelling mistake was made.Unbxd Autocorrect quantifies each of the above factors and comes up with a final score for all the possible suggestions for the misspelled word in the query. Then, it deems the suggestion with the highest score as the most relevant substitution for the misspelled word.
Step 2.1: Computing word frequencyUnbxd Autocorrect uses language models (probability distributions over a sequence of words) to quantify the word frequency in our corpus. It calculates the probability of a sequence of words using the n-gram model: It computes the conditional probability from frequency counts as: When n=1, Unbxd Autocorrect computes the word probability as:
Step 2.2: Computing likelihood of spelling errorsIn order to quantify the likelihood of a spelling mistake, Unbxd Autocorrect uses the “Noisy Channel Model.” This model assumes that you send a signal (originally intended word) through a channel, which adds noise and generates a noisy output (spelling error).
Noisy channel modelUnbxd Autocorrect considers the following types of common spelling mistakes in order to compute the spelling mistake likelihood:
Insertion: Adding an alphabet to a word, resulting in an error
Deletion: Deleting an alphabet from a word
Substitution: Replacing a single character with a different alphabet
Reversal: Interchanging two adjacent alphabets
insert_matrix[‘ei’] = Number of times the character ‘i’ has been mistakenly inserted before “e”
delete_matrix[‘gh’] = Number of times the character “h” has been deleted when it follows the letter “g”
substitution_matrix[‘mn’] = Number of times the character “m” has been replaced by “n”
reversal_matrix[‘io’] = Number of times “io” has been mistakenly typed as “oi”And Unbxd Autocorrect calculates the error probability as: Probability(mistakenly inserting “i” before “e”) = Number of times “e” has been mistakenly written as “ei” / Number of times “e” has occurred. That is, If there are multiple spelling mistakes in an error, Unbxd Autocorrect multiplies all the probability values to arrive at the final probability score for computing the likelihood of committing the spelling mistake.