php - PHPを使用してコンテキストで読めない文字を削除する方法は?

Question

こんにちは zend_lucene_search にコンテキストをフィードしています。特殊文字までの単語を検索できますが、それ以降は検索できません。

例えば：

    very well to the other job boards � one of the main things that has impressed is the variety of the applications, especially with regards to the background of the candidates" manoj � Head

「ボード」を検索すると取得できますが、判読できない文字の後に 1 つまたは任意の文字列を検索すると、検索できません。

これらを削除する方法とプレーンテキストを取得したい。

.docx/pdf ファイルをテキストに変換すると、このような文字が表示されます。

また

テキストのみを zend_search_lucene にフィードする方法を教えてください。

助けてください。

score 2 · Accepted Answer

次のpreg_replace関数呼び出しを使用して、文字列からすべての非 ASCII (いわゆる特殊) 文字を削除できます。

$replaced = preg_replace('/[^\x00-\x7F]+/', '', $str);
// produces this converted text:
//    "very well to the other job boards  one of the main things that has impressed
// is the variety of the applications, especially with regards to the background of the
// candidates" manoj  Head"

score 1 · Accepted Answer

現在の HTML ドキュメントの文字セットと一致するように、処理する文字列の文字セットを変換する必要がある場合があります。

たとえば、HTML ドキュメントが UTF-8 を使用している場合、文字列を utf8_encode() で実行できます。そうではなく、どの文字セットを使用すればよいかわからない場合は、 mb_convert_encoding()を使用して、より一般的な文字セットを試してみてください。

php - PHPを使用してコンテキストで読めない文字を削除する方法は?

2 に答える 2

Related

Reference