# Finding similar/related texts algorithms

I searched a lot in stackoverflow and Google but I didn't find the best answer for this. Actually, I'm going to develop a news reader system that crawl and collect news from web (with a crawler) and then, I want to find similar or related news in websites (In order to prevent showing duplicated news in website)

I think the best live example for that is Google News, it collect news from web and then categorize and find related news and articles. This is what I want to do.

What's the best algorithm for doing this?

## Answers

A relatively simple solution is to compute a tf-idf vector (en.wikipedia.org/wiki/Tf*idf) for each document, then use the cosine distance (en.wikipedia.org/wiki/Cosine_similarity) between these vectors as an estimate for semantic distance between articles.

This will probably capture semantic relationships better than Levenstein distance and is much faster to compute.

This is one: http://en.wikipedia.org/wiki/Levenshtein_distance

public static SqlInt32 ComputeLevenstheinDistance(SqlString firstString, SqlString secondString) { int n = firstString.Value.Length; int m = secondString.Value.Length; int[,] d = new int[n + 1,m + 1]; // Step 1 if (n == 0) { return m; } if (m == 0) { return n; } // Step 2 for (int i = 0; i <= n; d[i, 0] = i++) { } for (int j = 0; j <= m; d[0, j] = j++) { } // Step 3 for (int i = 1; i <= n; i++) { //Step 4 for (int j = 1; j <= m; j++) { // Step 5 int cost = (secondString.Value[j - 1] == firstString.Value[i - 1]) ? 0 : 1; // Step 6 d[i, j] = Math.Min(Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1), d[i - 1, j - 1] + cost); } } // Step 7 return d[n, m]; }

This is handy for the task at hand: http://code.google.com/p/boilerpipe/

Also, if you need to reduce the number of words to analyze, try this: http://ots.codeplex.com/

I have found the OTS VERY useful in sentiment analysis, whereby I can reduce the number of sentences into a small list of common phrases and/or words and calculate the overall sentiment based on this. The same should work for similarity.