Finding similar/related texts algorithms

I searched a lot in stackoverflow and Google but I didn't find the best answer for this. Actually, I'm going to develop a news reader system that crawl and collect news from web (with a crawler) and then, I want to find similar or related news in websites (In order to prevent showing duplicated news in website)

I think the best live example for that is Google News, it collect news from web and then categorize and find related news and articles. This is what I want to do.

What's the best algorithm for doing this?

Answers


A relatively simple solution is to compute a tf-idf vector (en.wikipedia.org/wiki/Tf*idf) for each document, then use the cosine distance (en.wikipedia.org/wiki/Cosine_similarity) between these vectors as an estimate for semantic distance between articles.

This will probably capture semantic relationships better than Levenstein distance and is much faster to compute.


This is one: http://en.wikipedia.org/wiki/Levenshtein_distance

public static SqlInt32 ComputeLevenstheinDistance(SqlString firstString, SqlString secondString)
{
    int n = firstString.Value.Length;
    int m = secondString.Value.Length;
    int[,] d = new int[n + 1,m + 1];

    // Step 1
    if (n == 0)
    {
        return m;
    }

    if (m == 0)
    {
        return n;
    }

    // Step 2
    for (int i = 0; i <= n; d[i, 0] = i++)
    {
    }

    for (int j = 0; j <= m; d[0, j] = j++)
    {
    }

    // Step 3
    for (int i = 1; i <= n; i++)
    {
        //Step 4
        for (int j = 1; j <= m; j++)
        {
            // Step 5
            int cost = (secondString.Value[j - 1] == firstString.Value[i - 1]) ? 0 : 1;

            // Step 6
            d[i, j] = Math.Min(Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1), d[i - 1, j - 1] + cost);
        }
    }
    // Step 7
    return d[n, m];
}

This is handy for the task at hand: http://code.google.com/p/boilerpipe/

Also, if you need to reduce the number of words to analyze, try this: http://ots.codeplex.com/

I have found the OTS VERY useful in sentiment analysis, whereby I can reduce the number of sentences into a small list of common phrases and/or words and calculate the overall sentiment based on this. The same should work for similarity.


Need Your Help

Web page - using PDF file template instead of FPDF

php mysql web pdf-generation

I have a small PHP webpage on which people can register for some events. After registration they receive email with confirmation in PDF (template with fill in fields like last name, mail address et...

Prevent batch file from closing after it executes an external .exe program

windows powershell batch-file cmd

Believe it or not, I've searched all over stackoverflow and Google and can't find an answer to this that works for me.

About UNIX Resources Network

Original, collect and organize Developers related documents, information and materials, contains jQuery, Html, CSS, MySQL, .NET, ASP.NET, SQL, objective-c, iPhone, Ruby on Rails, C, SQL Server, Ruby, Arrays, Regex, ASP.NET MVC, WPF, XML, Ajax, DataBase, and so on.