How do I match a longer string to shorter word or string

I have a database of items with tags, such as:

  • item1 is tagged with "pork with apple sauce"
  • item2 is tagged with "pork",
  • item3 is tagged with "apple sauce".

If I match the string:

"Today I would like to eat pork with apple sauce, it would fill me up"

against the tags, I would get three results. However, I just want to get the most specific one, which in this case would be item1.

This is just an example and i'm not using a particular database. Just string and map in ruby. I came up with "fuzzy search". I'm not sure if this is correct. Can anybody suggest how to solve this particular problem?


Yes, you need to do a fuzzy match (aka approximate match). It is quite a well known problem, and implementing an approximate matching algorithm by hand is not an easy task (but I'm sure it's very fun! =D). There are lots of things that can affect how "similar" two strings, A and B, are, depending on what things you consider important, like how many times A appears in B, or how close the order and distance between the words in A appear in B, or if the "important" words in A appear in B, etc.

If you can get by with an existing library, there seems to be a couple of Ruby gems that can get the job done. For example, using this one called fuzzy-string-match, which uses the Jaro-Winkler distance ported from Lucene (a Java library... it also seems to have preserved the Java convention of camelCased method names ¬¬):

require 'fuzzystringmatch'

matcher = FuzzyStringMatch::JaroWinkler.create(:pure)

tags = ["pork with apple sauce", "pork", "apple sauce"]
input = "Today I would like to eat pork with apple sauce, it would fill me up"

# Select the tag by distance to the input string (distance == 1 means perfect 
# match)
best_tag = tags.max_by { |tag| matcher.getDistance(tag, input) }

p best_tag 

Will correctly select "pork with apple sauce".

There's also this other gem called amatch that has many other approximate matching algorithms.

Need Your Help

About UNIX Resources Network

Original, collect and organize Developers related documents, information and materials, contains jQuery, Html, CSS, MySQL, .NET, ASP.NET, SQL, objective-c, iPhone, Ruby on Rails, C, SQL Server, Ruby, Arrays, Regex, ASP.NET MVC, WPF, XML, Ajax, DataBase, and so on.