detection - 言語を検出する方法

Question

おそらく確率メトリックを使用して、テキストがどの言語であるかを検出するための優れたオープンソースエンジンはありますか？ローカルで実行でき、GoogleやBingにクエリを実行しないものはありますか？約1500万ページのOCRテキストで各ページの言語を検出したいと思います。

すべてのドキュメントにラテンアルファベットを使用する言語が含まれているわけではありません。

score 8 · Accepted Answer

何をしているのかによっては、ベイジアン学習アルゴリズムをある程度サポートしているpython自然言語処理ツールキット（NLTK）をチェックすることをお勧めします。

一般に、文字と単語の頻度がおそらく最速の評価ですが、言語の識別以外のことを行う必要がある場合は、NLTK（または一般にベイジアン学習アルゴリズム）がおそらく役立ちます。ベイジアン法は、最初の2つの方法のエラー率が高すぎることがわかった場合にも役立つ可能性があります。

score 5 · Accepted Answer

ターゲット言語の文字の頻度、有向グラフの頻度などに関する統計があれば、確実に独自の言語を作成できます。

次に、それをオープンソースとしてリリースします。そして出来上がり、あなたはテキストの言語を検出するためのオープンソースエンジンを持っています！

score 4 · Accepted Answer

将来の参考のために、私が使用することになったエンジンは、BSDライセンスの下にあるlibtextcatですが、2003年以降は維持されていないようです。それでも、それはうまく機能し、ツールチェーンに簡単に統合できます。

score 3 · Accepted Answer

CLD2をお試しください：

インストール

export CPPFLAGS="-std=c++98"  # https://github.com/CLD2Owners/cld2/issues/47
pip install cld2-cffi --user

走る

import cld2

res = cld2.detect("This is a sample text.")
print(res)
res = cld2.detect("Dies ist ein Beispieltext.")
print(res)
res = cld2.detect("Je ne peut pas parler cette language.")
print(res)
res = cld2.detect(" هذه هي بعض النصوص العربية")
print(res)
res = cld2.detect("这是一些阿拉伯文字")  # Chinese?
print(res)
res = cld2.detect("これは、いくつかのアラビア語のテキストです")
print(res)
print("Supports {} languages.".format(len(cld2.LANGUAGES)))

与える

Detections(is_reliable=True, bytes_found=23, details=(Detection(language_name=u'ENGLISH', language_code=u'en', percent=95, score=1675.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0)))
Detections(is_reliable=True, bytes_found=27, details=(Detection(language_name=u'GERMAN', language_code=u'de', percent=96, score=1496.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0)))
Detections(is_reliable=True, bytes_found=38, details=(Detection(language_name=u'FRENCH', language_code=u'fr', percent=97, score=1134.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0)))
Detections(is_reliable=True, bytes_found=48, details=(Detection(language_name=u'ARABIC', language_code=u'ar', percent=97, score=1263.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0)))
Detections(is_reliable=False, bytes_found=29, details=(Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0)))
Detections(is_reliable=True, bytes_found=63, details=(Detection(language_name=u'Japanese', language_code=u'ja', percent=98, score=3848.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0), Detection(language_name=u'Unknown', language_code=u'un', percent=0, score=0.0)))
Supports 282 languages.

その他

https://detectlanguage.com/-CLD2に関するサービス

score 2 · Accepted Answer

非常に洗練されたものは必要ないと思います。たとえば、ドキュメントが英語であるかどうかをかなり高いレベルで確実に検出するには、次のようなN個の最も一般的な英語の単語が含まれているかどうかをテストします。

"the a an is to are in on in it"

それらすべてが含まれているとしたら、ほぼ間違いなく英語だと思います。

score 1 · Accepted Answer

あるいは、RubyのWhatLanguageジェムを試すこともできます。これは素晴らしくシンプルで、Twitterのデータ分析に使用しました。クイックデモについては、http： //www.youtube.com/watch？v = lNqZ2cqOReo＆list = UUJ_3fstMOH-g4yBxtvgAWkw＆index = 0＆feature=plcpをご覧ください。

score 1 · Accepted Answer

GithubでFrancをチェックしてください。JavaScriptで書かれているので、ブラウザやNodeでも使用できます。

フランは、他のどのライブラリやGoogleよりも多くの言語をサポートしています。

フランは335言語をサポートするために簡単にフォークされます。フランは同じです

競争のように速い。

detection - 言語を検出する方法

7 に答える 7

その他

Related

Reference