3

I'm wondering if there is some kind of way to do fuzzy string matching in PHP. Looking for a word in a long string, finding a potential match even if its mis-spelled; something that would find it if it was off by one character due to an OCR error.

I was thinking a regex generator might be able to do it. So given an input of "crazy" it would generate this regex:

.*((crazy)|(.+razy)|(c.+azy)|cr.+zy)|(cra.+y)|(craz.+)).*

It would then return all matches for that word or variations of that word.

How to build the generator: I would probably split the search string/word up into an array of characters and build the regex expression doing a foreach the newly created array replacing the key value (the position of the letter in the string) with ".+".

Is this a good way to do fuzzy text search or is there a better way? What about some kind of string comparison that gives me a score based on how close it is? I'm trying to see if some badly converted OCR text contains a word in short.

4

3 に答える 3

6

正しい単語が何であるかわからない場合、文字列距離関数は役に立ちません。pspell関数をお勧めします:

$p = pspell_new("en");
print_r(pspell_suggest($p, "crazzy"));

http://www.php.net/manual/en/function.pspell-suggest.php

于 2009-11-12T09:08:51.907 に答える
3
    echo generateRegex("crazy");
    function generateRegex($word)
    {
      $len = strlen($word);
      $regex = "\b((".$word.")";
      for($i = 0; $i < $len; $i++)
      {
        $temp = $word;
        $temp[$i] = '.';
        $regex .= "|(".$temp.")";
      }
      $regex = $regex.")\b";
      return $regex;
    }
于 2009-11-12T08:43:42.440 に答える
1

レーベンシュタインは、文字列編集距離の一例です。目的ごとに異なるメトリックがあります。それらをよく理解し、自分に合ったものを見つけてください。

于 2009-11-12T08:18:07.000 に答える