How can I find the Largest Common Substring between two strings in PHP?

Is there a fast algorithm for finding the Largest Common Substring in two strings or is it an NPComplete problem?

In PHP I can find a needle in a haystack:

<?php

if (strstr("there is a needle in a haystack", "needle")) {
    echo "found<br>\n";
}
?>

I guess I could do this in a loop over one of the strings but that would be very expensive! Especially since my application of this is to search a database of email and look for spam (i.e. similar emails sent by the same person).

Does anyone have any PHP code they can throw out there?

Answers


The similar_text function may be what you want.

This calculates the similarity between two strings. Returns the number of matching chars in both strings

You may also want to look at levenshtein


Especially since my application of this is to search a database of email and look for spam (i.e. similar emails sent by the same person).

I think you should be looking at Bayesian spam inference algorithms, not necessarily longest common substring.

http://www.devshed.com/c/a/PHP/Implement-Bayesian-inference-using-PHP-Part-1/


I've just wrote a function the finds the longest sub string in str1 that exists in str2

public static function getLongestMatchingSubstring($str1, $str2)
{
    $len_1 = strlen($str1);
    $longest = '';
    for($i = 0; $i < $len_1; $i++){
        for($j = $len_1 - $i; $j > 0; $j--){
            $sub = substr($str1, $i, $j);
            if (strpos($str2, $sub) !== false && strlen($sub) > strlen($longest)){
                $longest = $sub;
                break;
            }
        }
    }
    return $longest;
}

I have since found a relevant wikipedia article. It is not a NP complete problem, it can be done in O(mn) time using a dynamic programming algorithm.

In PHP I found the similar_text function very useful. Here's a code sample to retrieve a series of text emails and loop through them and find ones that are 90% similar to each other. Note: Something like this is NOT scalable:

<?php
// Gather all messages by a user into two identical associative arrays
$getMsgsRes = mysql_query(SELECT * FROM email_messages WHERE from = '$someUserID');
while($msgInfo = mysql_fetch_assoc($getMsgsRes))
{
    $msgsInfo1[] = $msgInfo;
    $msgsInfo2[] = $msgInfo;
}

// Loop over msgs and compare each one to every other
foreach ($msgsInfo1 as $msg1)
    foreach ($msgsInfo2 as $msg2)
        similar_text($msg1['msgTxt'],$msg2['msgTxt'],$similarity_pst);
        if ($similarity_pst > 90)
            echo "{$msg1['msgID']} is ${similarity_pst}% to {$msg2['msgID']}\n";
?>

Late to this party, but here is a way to find the largest common substring in an array of strings:

Example:

$array = array(
    'PTT757LP4',
    'PTT757A',
    'PCT757B',
    'PCT757LP4EV'
);
echo longest_common_substring($array); // => T757

The function:

function longest_common_substring($words) {
    $words = array_map('strtolower', array_map('trim', $words));
    $sort_by_strlen = create_function('$a, $b', 'if (strlen($a) == strlen($b)) { return strcmp($a, $b); } return (strlen($a) < strlen($b)) ? -1 : 1;');
    usort($words, $sort_by_strlen);
    // We have to assume that each string has something in common with the first
    // string (post sort), we just need to figure out what the longest common
    // string is. If any string DOES NOT have something in common with the first
    // string, return false.
    $longest_common_substring = array();
    $shortest_string = str_split(array_shift($words));

    while (sizeof($shortest_string)) {
        array_unshift($longest_common_substring, '');
        foreach ($shortest_string as $ci => $char) {
            foreach ($words as $wi => $word) {
                if (!strstr($word, $longest_common_substring[0] . $char)) {
                    // No match
                    break 2;
                } // if
            } // foreach
            // we found the current char in each word, so add it to the first longest_common_substring element,
            // then start checking again using the next char as well
            $longest_common_substring[0].= $char;
        } // foreach
        // We've finished looping through the entire shortest_string.
        // Remove the first char and start all over. Do this until there are no more
        // chars to search on.
        array_shift($shortest_string);
    }
    // If we made it here then we've run through everything
    usort($longest_common_substring, $sort_by_strlen);
    return array_pop($longest_common_substring);
}

I have written this up a little bit on my blog:


Please have a look at Algorithm implementation/Strings/Longest common substring on Wikibooks. I haven't tested the PHP implementation but it seems to match the general algorithm on the Wikipedia page.


Need Your Help

Static file cannot be found in Django view

django static views django-settings

I am having an issue with static files in the development server on Django 1.5.4. I am not sure if it is the same problem on the actual production server (running Apache), as I found a solution for...

excel VBA break execution when there's no break key on keyboard

excel vba excel-vba keyboard

I'm just noticing that on my laptop (Dell XPS 15z) there's no BREAK key (no dedicated number keypad). I'm running the debugger step-by-step and then when all seems fine, I just let it play out. How...