How to Make a Thing in Haskell, Part 4: Knowing the Score

At the heart of any fuzzy finder is an algorithm that determines what strings match the user’s query. selecta’s is especially suited to interactive fuzzy finding because it also assigns each string a score to approximate the quality of the match, so high-quality matches can be presented at the top of the list for ease of selection. This post describes selecta’s fuzzy matching algorithm so I’ll be prepared to reimplement it in Haskell.

selecta’s scoring algorithm is simple and elegant. Given a query and a potential string to be matched, which I’ll call a choice:

That last division step is where the magic happens. The numerator – (length(query) / length(shortest matching substring)) – penalizes long matches. So, if you’re comparing an aardvark to a pizza with the query aa, this step penalizes a pizza more than an aardvark – the shortest substring matching aa in a pizza is the whole string, which is 7 characters long, while the shortest matching substring in an aardvark is aa, which is only two characters long.

Then, this match-quality approximation is divided by the length of the choice. The idea here is that longer choices contain a greater variety of letters and are more likely to match the query than shorter choices. In a sense, the shorter the match, the more precisely the choice matches the query. The intuitive way to put this is that if you’re fuzzy-finding for a file and type the query bin, you’re more likely to want bin/ or brain than breakpoints/.

I’m also going to use Gary Bernhardt’s algorithm for finding the shortest matching substring. This is another clever, simple solution:

So, when matching the query an to the choice old arcana, you generate the suffixes arcana, ana, and a. The first matching substrings in these are arcan and an (a does not match at all), making the shortest substring an, at 2 characters long.

You may notice that this doesn’t compare the lengths of all the matches. For example, if you match aardvark on the query aa, you only compare aa and ardva, missing the valid match aardva. However, because aa is a prefix of that suffix, it’s guaranteed to be shorter than aardva, and since we’re just looking for the shortest substring, this incomplete consideration won’t change the correctness of the search.

So those are all the necessary moving parts – a way to find the shortest matching substring, and a function that calculates the score based on that. Next time, I’ll flex my tiny Haskell muscles and do some string manipulation!