Most apps have some kind of search.
It helps users find the specific thing if they don’t know its exact name. That’s at least how I define search in the context of this article.
Disclaimer: My solution is nothing groundbreaking. I just had to rediscover this solution on my own because it is hard to find optimized solutions for specific use cases. All I could find on the internet were general-purpose solutions that didn’t work for me.
I want all searches in all apps to get better. That’s why I publish my findings here.
What is the problem of most searches?
And the reason you need a search is because the user want to find something, so you can’t expect the user to already know the exact name of the things they want to find.
When I search for the color “yellow” and I misspell it as “yelow”. The color I wanted is not showing up. Why? Because the search is not searching, it is indexing.
When I search for the color “dark green” and it is (for whatever reason) saved as “green dark”, I don’t get the result. Why? Because the search is not respecting how language works.
When I search for the color “violet” but the color is saved as “purple”, I don’t get the result. Why? Because sometimes there are multiple words for the same thing.
What common solutions exist?
Vector search
Vector search is what you might know from search engines like Google. It focuses on semantic meaning instead of character matching. That’s why it works well for the case where you use a different word than was used to save the items you are searching for, but both describe the same thing.
A vector database is used to store the items in a multi-dimensional space. In this space, the terms that are close to each other have a similar meaning. The words “violet” and “purple” should be very close to each other, while “yellow” and “red” should be far apart.
A vector search is bad at handling misspellings. To a vector DB the term “yelow” is a very different from the term “yellow”.
Vector search
Type to search colors.
Fuzzy search
Fuzzy search is a common solution in developer tools like fzf or Telescope but I rarely see it in other contexts.
The term “Fuzzy search” is not tied to any algorithm. The most common type of fuzzy search is subsequence matching like Smith-Waterman or similar sequence alignment algorithms.
This means that they match strings where the typed characters appear in the exact sequential order specified, allowing for any number of un-typed characters (gaps) between them.
Because of this, you find “yellow”, even when you type “yelow”.
They can’t match strings where the typed characters are out of order or where a typed character is entirely missing from the target string. So when you type “yelllow”, it will not match “yellow”, because of the additional “l” in the middle.
Fuzzy search
Type to search colors.
My solution
This method modifies the common fuzzy search by splitting the search query and target strings by whitespace. For example, a query like “yel sun” becomes the independent words “yel” and “sun”. The system calculates a score for each word against the target item and adds them together (e.g., score("yel", item) + score("sun", item)). The closer the query word is to the item, the higher the score, allowing “yel sun” to score highly for “Sunbeam Yellow”.
Because the query is split into words, the input order does not matter. Results are then filtered by setting a threshold for the total score to 50% of the highest achieved score. Neither standard vector nor fuzzy searches would match this specific scenario. This algorithm solves the common limitations of fuzzy search, though matching semantic meaning still requires a vector database.
I found this to be the best solution for most scenarios. It’s fast, accurate, and easy to implement. It’s also easy to tweak to your needs. For example, you can change the threshold to a higher or lower deviation from the top to adjust the sensitivity of the search.
Magic search
Type to search colors.
The total score is calculated by splitting the search query by whitespace into independent words. The algorithm calculates a score for each word against the target item and adds them together.
Each individual word’s score is determined using the Smith-Waterman local alignment algorithm.
For a query word of length and a target text of length , an matrix is initialized with zeros. Each cell evaluates the maximum alignment score up to character in the word and character in the text.
The score for each cell is calculated using this recurrence relation:
Where:
- is the match or mismatch score. It equals 2 if the characters match, and -1 if they do not match.
- is the gap penalty, set to -1.
- The 0 resets negative scores, allowing the local alignment to start fresh anywhere in the text.
The final score for the word is the highest single integer value found anywhere within the populated matrix .
func RunMagicSearch(query string, items []SearchItem) []SearchResult {
// The total score is calculated by splitting the search query by whitespace into independent words.
words := SplitByWhitespace(ToLower(query))
var scored []SearchResult
for _, item := range items {
score := MagicScore(words, ToLower(item.Name))
if score > 0 {
scored = append(scored, SearchResult{Item: item, Score: score})
}
}
SortByScoreDescending(scored)
if len(scored) == 0 {
return []SearchResult{}
}
// Results are then filtered by setting a threshold for the total score to 50% of the highest achieved score.
topScore := scored[0].Score
threshold := topScore * 0.5
var filtered []SearchResult
for _, result := range scored {
if result.Score >= threshold {
filtered = append(filtered, result)
}
}
return filtered
}
func MagicScore(words []string, text string) float64 {
totalScore := float64(0)
// The algorithm calculates a score for each word against the target item and adds them together.
for _, word := range words {
totalScore += SmithWatermanScore(word, text)
}
return totalScore
}
The overall time complexity is , where:
- : Number of items in the list.
- : Number of words per query.
- : Length of a query word.
- : Length of a target text.
Potential alternative threshold logic:
- Statistical Variance: Cut off results that fall a specific number of standard deviations below the mean score of the current result set.
- Max-Drop (Elbow Method): Sort scores descending and set the threshold at the largest numerical gap between consecutive scores.
- Logarithmic Scaling: Base the threshold on a non-linear curve tied to the query’s character length or total word count.
- 1D Clustering: Separate the returned scores into two distinct clusters (matches versus noise) using Jenks natural breaks or K-means.