c - utf8で使用されているアルファベットを検出するにはどうすればよいですか？

Question

score 2 · Accepted Answer

Unicodeの「スクリプト」プロパティを確認することをお勧めします。最新のデータベースはここにあります。

迅速で汚い実装のために、ターゲットテキスト内のすべての文字をスキャンして、それぞれのスクリプト名を検索してみます。文字数が最も多いスクリプトを選択してください。

score 1 · Accepted Answer

Use an N-gram model and then give a sufficiently large set of training data. A full example describing this technique is to be found at this page, among others:

http://phpir.com/language-detection-with-n-grams/

Although the article assumes you are implementing in PHP and by "language" you mean something like English, Italian, etc... the description may be implemented in C if you require this, and instead of using "language" as in English, etc. for the training, just use your notion of "alphabet" for the training. For example, look at all of your "Latin alphabet" strings together and consider their n-grams for n=2:

Bonjour: "Bo", "on", "nj", "jo", "ou", "ur"

Hello: "He", "el", "ll", "lo"

With enough training data, you will discover dominant combinations that are likely for all Latin text, for example, perhaps "Bo" and "el" are quite probable for text written in the "Latin alphabet". Likewise, these combinations are probably quite rare in text that is written in the "Hiragana alphabet". Similar discoveries will be made with any other alphabet classification for which you can provide sufficient training data.

This technique is also known as a Hidden Markov model or a Markov chain; searching for these keywords will give more ideas for implementation. For "quick and dirty" I would use n=2 and gather just enough training data such that the least common letter from each alphabet is encountered at least once... e.g. at least one 'z' and at least one 'ぅ' *little hiragana u.

EDIT:

For a simpler solution than N-Grams, use only basic statistical tests -- min, max and average -- to compare your Input (a string given by the user) with an Alphabet (a string of all characters in one of the alphabets you are interested).

Step 1. Place all the numerical values of the Alphabet (e.g. utf8 codes) in an array. For example, if the Alphabet to be tested against is "Basic Latin", make an array DEF := {32, 33, 34, ..., 122}.

Step 2. Place all the numerical values of the Input into an array, for example, make an array INP := {73, 102, 32, ...}.

Step 3. Calculate a score for the input based on INP and DEF. If INP really comes from the same alphabet as DEF, then I would expect the following statements to be true:

min(INP) >= min(DEF)
max(INP) <= max(DEF)
avg(INP) - avg(DEF) < EPS, where EPS is a suitable constant

If all statements are true, the score should be close to 1.0. If all are false, the score should close to 0.0. After this "Score" routine is defined, all that's left is to repeat it on each alphabet you are interested in and choose the one whiich gives the highest score for a given Input.

c - utf8で使用されているアルファベットを検出するにはどうすればよいですか？

2 に答える 2

Related

Reference