– Part 2 of Autocorrect Series
In a previous article, we discussed how a typical autocorrect solution is only 60% accurate and how this low level of accuracy could lead to $3M in lost revenue.
Unbxd Autocorrect, on the other hand, offers a 90% accuracy level. This 30% increase in the accuracy level could turn the lost revenue into a relevant shopper experience and high-margin profit.
In this follow-up article, let us discuss the technical details of how we built the Unbxd Autocorrect to provide 90% accuracy.
The 3-step process of Unbxd Autocorrect
Unbxd Autocorrect works in three layers:
- It checks whether the entered shopper query contains any misspellings
- If yes, it computes the scores of correct alternatives to the misspelled word, taking word frequency, error likelihood, and shopper context into consideration
- Finally, it returns the word with the highest score as the closest possible alternative to the misspelled word
Step1: Identifying misspellings
In order to determine whether a word is wrongly spelled, we need to first start with a list of all the words that we know to be correct. This list is our corpus, which makes a solid foundation for a competent autocorrect solution. Any word not present in our corpus or any deviation from the known word present in our corpus would be considered as an error.
Limited corpus makes it difficult to link a misspelled query to the right terms. It could also tag a correct word as a misspelling. Therefore, it is imperative that this corpus be exhaustive and inclusive.
For instance, take the shopper query “skorts.” A basic autocorrect solution with a limited corpus would not identify this query as correct if it is not present in the corpus. The solution would return the closest possible alternative as the correct spelling or, worse, it would lead to a zero results page.
Limited corpus leads an autocorrect solution to mistake a correct word for a spelling error
Unbxd Autocorrect uses an extensive and comprehensive corpus sourced from product catalogs of e-commerce sites, basic and advanced English corpus, and other credible sources such as ConceptNet, in order to identify the misspelled words with a high level of accuracy.
Step 2: Computing closest possible alternatives
Say, a shopper enters the query “blick shurts.” Unbxd Autocorrect understands that both these words in the string are not in the Unbxd corpus. It corrects this query one word at a time.
First, Unbxd Autocorrect considers “blick:” Some possible contenders for this misspelled word are “black,” “brick,” “block,” “back,” and so on. In order to find the most relevant suggestions for “blick” from a list of many possible alternatives, Unbxd Autocorrect contemplates:
“Among all these possibilities, which one is the most likely to have been the originally intended word?”
To answer this question, there are a few other factors that Unbxd Autocorrect considers:
1. How frequently does the word occur in our corpus?
The more frequently a word occurs in our corpus, the more likely it is that the shopper intended to type that word.
2. What is the likelihood of the spelling mistake?
If a spelling mistake looks highly unlikely, say, substituting a “z” with a “p” (keys that are placed far apart on a QWERTY keyboard), then it makes sense to assume that, that word is less likely to have been originally intended.
3. What is the shopper context?
The words that occur before and after the misspelled word in the query could give Unbxd Autocorrect information about the context in which the spelling mistake was made.
Unbxd Autocorrect quantifies each of the above factors and comes up with a final score for all the possible suggestions for the misspelled word in the query. Then, it deems the suggestion with the highest score as the most relevant substitution for the misspelled word.
Step 2.1: Computing word frequency
Unbxd Autocorrect uses language models (probability distributions over a sequence of words) to quantify the word frequency in our corpus.
It calculates the probability of a sequence of words using the n-gram model:
It computes the conditional probability from frequency counts as:
When n=1, Unbxd Autocorrect computes the word probability as:
Step 2.2: Computing likelihood of spelling errors
In order to quantify the likelihood of a spelling mistake, Unbxd Autocorrect uses the “Noisy Channel Model.” This model assumes that you send a signal (originally intended word) through a channel, which adds noise and generates a noisy output (spelling error).
Noisy channel model
Unbxd Autocorrect considers the following types of common spelling mistakes in order to compute the spelling mistake likelihood:
Insertion: Adding an alphabet to a word, resulting in an error
Deletion: Deleting an alphabet from a word
Substitution: Replacing a single character with a different alphabet
Reversal: Interchanging two adjacent alphabets
Spelling errors could also be a combination of one or more of the above types of deviations. For example:
‘beggining’ is an error formed by inserting a ‘g’ and deleting an ‘n’ from ‘beginning’
‘acheivemen’ is an error formed by reversing ‘ie’ and deleting ‘t’ from ‘achievement’
Unbxd Autocorrect calculates the probabilities of the above types of spelling mistakes using matrices such as insert_matrix, delete_matrix, substitution_matrix, and reversal_matrix.
To calculate these matrices, it uses a data set of standard spelling mistakes in the English language. For example,
insert_matrix[‘ei’] = Number of times the character ‘i’ has been mistakenly inserted before “e”
delete_matrix[‘gh’] = Number of times the character “h” has been deleted when it follows the letter “g”
substitution_matrix[‘mn’] = Number of times the character “m” has been replaced by “n”
reversal_matrix[‘io’] = Number of times “io” has been mistakenly typed as “oi”
And Unbxd Autocorrect calculates the error probability as:
Probability(mistakenly inserting “i” before “e”) = Number of times “e” has been mistakenly written as “ei” / Number of times “e” has occurred. That is,
If there are multiple spelling mistakes in an error, Unbxd Autocorrect multiplies all the probability values to arrive at the final probability score for computing the likelihood of committing the spelling mistake.
Step 2.3: Computing shopper context
Sometimes, the word intended by the shopper is not the most frequently occurring word. To account for this, Unbxd Autocorrect gets information about the context in which the shopper entered this word.
Say, a shopper enters the query “chicken jeg.” Without considering the other correctly spelled word (chicken), a common autocorrect solution might return “jug” as the most relevant substitution for the error “jeg.”
However, Unbxd Autocorrect considers the word “chicken” entered before “jeg.” So, it is able to correct “jeg” to “leg,” since “chicken leg” is a phrase that is present in our corpus while “chicken jug” is not.
It calculates the context score by computing bigram or trigram scores using the n-gram model discussed above. This context score is calculated only if the query has more than one word.
Step 3: Returning the closest possible alternative with the highest score
Once Unbxd Autocorrect has the all the three scores (frequency of word, likelihood of spelling mistake, and shopper context), it multiplies them to get the final score. Unbxd returns the suggestion with the highest score as the most relevant substitution for a misspelled word in a query.
Unbxd Autocorrect offers a 90% accuracy level
Unbxd Autocorrect automatically handles multiple misspellings in a shopper query, taking various factors, particularly, context into consideration. Therefore, it offers a 90%+ accuracy level.
This exceptional level of accuracy ensures that shoppers get relevant search results, even if they have misspelled the search query. This significant improvement in both the technology and shopper experience could add a potential $3M to a retailer’s topline.
To know how Unbxd Autocorrect can benefit your business, contact us at unbxd.com/contact.