php - PHPで3文字以下の単語を削除する

Question

phpクラスを使用して記事からタグクラウドを作成していますが、3文字以下の単語を削除したいのですが、数字の単語も削除します。

タグの例：1111猿鹿猫豚水牛

結果が欲しい：サル鹿バッファロー

そのクラスのPHPコード（完全なコードはこちら）

    function keywords_extract($text)
{
    $text = strtolower($text);
    $text = strip_tags($text);

    /* 
     * Handle common words first because they have punctuation and we need to remove them
     * before removing punctuation.
     */
    $commonWords = "'tis,'twas,a,able,about,across,after,ain't,all,almost,also,am,among,an,and,any,are,aren't," .
        "as,at,be,because,been,but,by,can,can't,cannot,could,could've,couldn't,dear,did,didn't,do,does,doesn't," .
        "don't,either,else,ever,every,for,from,get,got,had,has,hasn't,have,he,he'd,he'll,he's,her,hers,him,his," .
        "how,how'd,how'll,how's,however,i,i'd,i'll,i'm,i've,if,in,into,is,isn't,it,it's,its,just,least,let,like," .
        "likely,may,me,might,might've,mightn't,most,must,must've,mustn't,my,neither,no,nor,not,o'clock,of,off," .
        "often,on,only,or,other,our,own,rather,said,say,says,shan't,she,she'd,she'll,she's,should,should've," .
        "shouldn't,since,so,some,than,that,that'll,that's,the,their,them,then,there,there's,these,they,they'd," .
        "they'll,they're,they've,this,tis,to,too,twas,us,wants,was,wasn't,we,we'd,we'll,we're,were,weren't,what," .
        "what'd,what's,when,when,when'd,when'll,when's,where,where'd,where'll,where's,which,while,who,who'd," .
        "who'll,who's,whom,why,why'd,why'll,why's,will,with,won't,would,would've,wouldn't,yet,you,you'd,you'll," .

    $commonWords = strtolower($commonWords);
    $commonWords = explode(",", $commonWords);
    foreach($commonWords as $commonWord) 
    {
        $text = $this->str_replace_word($commonWord, "", $text);  
    }

    /* remove punctuation and newlines */
    /*
     * Changed to handle international characters
     */
    if ($this->m_bUTF8)
        $text = preg_replace('/[^\p{L}0-9\s]|\n|\r/u',' ',$text);
    else
        $text = preg_replace('/[^a-zA-Z0-9\s]|\n|\r/',' ',$text);

    /* remove extra spaces created */
    $text = preg_replace('/ +/',' ',$text);
    $text = trim($text);
    $words = explode(" ", $text);
    foreach ($words as $value) 
    {
        $temp = trim($value);
        if (is_numeric($temp))
            continue;
        $keywords[] = trim($temp);
    }
    return $keywords;
}

使い方など色々試してみましたが、うまくいきませif (strlen($words)<3 && is_numeric($words)==true)んでした。

私を助けてください

score 1 · Accepted Answer

&&to ||：
from：
if (strlen($words)<3 && is_numeric($words)==true)
to：に変更する必要があります
if (strlen($words)<3 || is_numeric($words)==true)

また、 3文字以下の単語を削除する場合は、次の代わりに
使用する必要があります。<=<
if (strlen($words) <= 3 || is_numeric($words)==true)

score 1 · Accepted Answer

あなたは正規表現でそれを行うことができます

変化する：

/* remove extra spaces created */
$text = preg_replace('/ +/',' ',$text);
$text = trim($text);
$words = explode(" ", $text);

に：

/* remove extra spaces created */
$words = preg_replace('/\b\w{1,3}\s|[0-9]/gi','',$text);
return $words;

返品を含む次のforeachセクションを削除します。

正規表現パターンの説明は次のとおりです。

\b = Match a word boundary position (whitespace or the beginning/end of the string).
\w = Match any word character (alphanumeric & underscore).
{1,3} = Matches 1 to 3 of the preceeding token.
\s = Match any whitespace character (spaces, tabs, line breaks).
| = or.
[0-9] = Match any numeric character.

そして、このパターンの人間が理解できる説明は次のとおりです。「開始位置から1文字または3文字の長さまでの任意の単語文字と、それに続く空白文字を含む単語を検索し、空の文字列でそれ。

score 1 · Accepted Answer

プロセスを少し変更して、実行速度を上げます（そうすべきだと思います）。

ステップ1：各一般的な単語を空の文字列に置き換える代わりに$text（置換プロセスはコストがかかります）、後でフィルタリングするために各一般的な単語をハッシュテーブルに格納します。

$commonWords = explode(",", $commonWords);
foreach($commonWords as $commonWord)
    $hashWord[$commonWord] = $commonWord;

ステップ2：一般的な単語、数字、および4桁未満の単語を同時にフィルタリングします。

$words = preg_split("/[\s\n\r]/", $text);
foreach ($words as $value) 
{
    // Skip it is common word
    if (isset($hashWord[$value])) continue;
    // Skip if it is numeric
    if (is_numeric($value)) continue;
    // Skip if word contains less than 4 digits
    if (strlen($value) < 4) continue;

    $keywords[] = preg_replace('/[^a-zA-Z0-9\s].+/', '', $value);
}

以下は、この関数の完全なソースコードです（コピーして貼り付ける場合）

function keywords_extract($text) {
    $text = strtolower($text);
    $text = strip_tags($text);

    $commonWords = "'tis,'twas,a,able,about,across,after,ain't,all,almost,also,am,among,an,and,any,are,aren't," .
        "as,at,be,because,been,but,by,can,can't,cannot,could,could've,couldn't,dear,did,didn't,do,does,doesn't," .
        "don't,either,else,ever,every,for,from,get,got,had,has,hasn't,have,he,he'd,he'll,he's,her,hers,him,his," .
        "how,how'd,how'll,how's,however,i,i'd,i'll,i'm,i've,if,in,into,is,isn't,it,it's,its,just,least,let,like," .
        "likely,may,me,might,might've,mightn't,most,must,must've,mustn't,my,neither,no,nor,not,o'clock,of,off," .
        "often,on,only,or,other,our,own,rather,said,say,says,shan't,she,she'd,she'll,she's,should,should've," .
        "shouldn't,since,so,some,than,that,that'll,that's,the,their,them,then,there,there's,these,they,they'd," .
        "they'll,they're,they've,this,tis,to,too,twas,us,wants,was,wasn't,we,we'd,we'll,we're,were,weren't,what," .
        "what'd,what's,when,when,when'd,when'll,when's,where,where'd,where'll,where's,which,while,who,who'd," .
        "who'll,who's,whom,why,why'd,why'll,why's,will,with,won't,would,would've,wouldn't,yet,you,you'd,you'll,";

    $commonWords = explode(",", $commonWords);
    foreach($commonWords as $commonWord)
        $hashWord[$commonWord] = $commonWord;

    $words = preg_split("/[\s\n\r]/", $text);
    foreach ($words as $value) 
    {
        // Skip it is common word
        if (isset($hashWord[$value])) continue;
        // Skip if it is numeric
        if (is_numeric($value)) continue;
        // Skip if word contains less than 4 digits
        if (strlen($value) < 4) continue;

        $keywords[] = preg_replace('/[^a-zA-Z0-9\s].+/', '', $value);
    }
    return $keywords;
}

デモ：ideone.com/obG6n

score 0 · Accepted Answer

0

If((strlen($word) <=  3) && is_numeric($words)){
     //Don't add in the list
}

于 2012-07-08T04:22:25.847 に答える

score 0 · Accepted Answer

今私は追加します$text = preg_replace('!\\b\\w{1,3}\\b!', ' ', $text);

前

    $text = preg_replace('/ +/',' ',$text);
    $text = trim($text);
    $words = explode(" ", $text);

エラーなし:)

ソース

このphpクラスを使用したい場合は、ここでコードを取得できます

すべてをありがとう：）

php - PHPで3文字以下の単語を削除する

5 に答える 5

Related

Reference