c++ - 言語検出

Question

主に請求書で、OCRにtesseractを使用しています。ただし、tesseract では、ファイルの処理を開始する前に言語を指定する必要があります。

定義済みのデフォルト言語に基づいて ocr を実行しようと考えました。次に、結果のテキストを使用して、使用されている言語を確認したいと思います。デフォルトの言語でない場合は、tesseract からより良い結果を得るために再度処理します。

しかし、言語検出アルゴリズムを実装するにはどうすればよいでしょうか? 使用できる C++ ライブラリはありますか?

score 3 · Accepted Answer

このホワイトペーパー「OCR アプリケーションの自然言語識別」では、要件に類似した識別タスクに関連する手法について説明しています。

score 3 · Accepted Answer

I am not sure if this would help as the library is in Java. But I found it really cool as it is able to detect around 50 languages from the given text and with a pretty good precision level. You may like to have a look at it and as it is open source, you may rewrite the code in C++ and give it back to the open source community if your application requires to be written only in C++.

Here is the link for the same:

http://code.google.com/p/language-detection/

Note: It uses the Apache Nutch and Tika libraries for analysis.

c++ - 言語検出

3 に答える 3

Related

Reference