java - テキストからの単語境界の検出

Question

単語境界の識別でこの問題が発生しています。ウィキペディアドキュメントのすべてのマークアップを削除しました。次に、エンティティのリストを取得したいと思います（意味のある用語）。文書のバイグラム、トリグラムを取り、辞書（wordnet）に存在するかどうかを確認する予定です。これを達成するためのより良い方法はありますか？

以下はサンプルテキストです。エンティティを識別したい（二重引用符で囲まれているように表示）

Vulcans are a humanoid species in the fictional "Star Trek" universe who evolved on the planet Vulcan and are noted for their attempt to live by reason and logic with no interference from emotion They were the first extraterrestrial species officially to make first contact with Humans and later became one of the founding members of the "United Federation of Planets"

score 1 · Accepted Answer

あなたが話していることは、確立されたアルゴリズムを適用するという単純な問題ではなく、実際にはまだ急成長している研究の対象であると思います。

簡単な「これを行う」という答えを出すことはできませんが、頭のてっぺんからいくつかの指針があります。

WordNetの使用はうまくいくと思いますが（バイグラム/トリグラムがどこに入るのかはわかりませんが）、WordNetルックアップは、名前付きエンティティを見つけるためのすべてではなく、ハイブリッドシステムの一部として表示する必要があります
次に、いくつかの単純な常識的な基準を適用することから始めます（大文字の単語のシーケンス、「of」などの頻繁な小文字の機能語をこれらに適用してみてください。「既知のタイトル」と大文字の単語で構成されるシーケンス）。
統計的に、エンティティの候補として偶然に隣り合って表示されるとは思われない単語のシーケンスを探します。
動的Webルックアップを組み込むことができますか？（システムは大文字のシーケンス「IBM」を見つけて、たとえば「IBM is ... [organisation | company|...]」というテキストパターンのウィキペディアエントリが見つかるかどうかを確認します。
ここと「情報抽出」の文献で一般的にいくつかのアイデアが得られるかどうかを確認してください：http ：//www-nlpir.nist.gov/related_projects/muc/proceedings/muc_7_toc.html

真実は、そこにある文献を見ると、人々がひどく洗練された、確立されたアルゴリズムを使用しているようには見えないということです。ですから、あなたのデータを見て、調査し、あなたが思いつくことができるものを見る余地はたくさんあると思います...頑張ってください！

score 0 · Accepted Answer

私が正しく理解していれば、二重引用符（ "）で区切られたサブストリングを抽出しようとしています。正規表現でキャプチャグループを使用できます。

    String text = "Vulcans are a humanoid species in the fictional \"Star Trek\"" +
        " universe who evolved on the planet Vulcan and are noted for their " +
        "attempt to live by reason and logic with no interference from emotion" +
        " They were the first extraterrestrial species officially to make first" +
        " contact with Humans and later became one of the founding members of the" +
        " \"United Federation of Planets\"";
    String[] entities = new String[10];                 // An array to hold matched substrings
    Pattern pattern = Pattern.compile("[\"](.*?)[\"]"); // The regex pattern to use
    Matcher matcher = pattern.matcher(text);            // The matcher - our text - to run the regex on
    int startFrom   = text.indexOf('"');                // The index position of the first " character
    int endAt       = text.lastIndexOf('"');            // The index position of the last " character
    int count       = 0;                                // An index for the array of matches
    while (startFrom <= endAt) {                        // startFrom will be changed to the index position of the end of the last match
        matcher.find(startFrom);                        // Run the regex find() method, starting at the first " character
        entities[count++] = matcher.group(1);           // Add the match to the array, without its " marks
        startFrom = matcher.end();                      // Update the startFrom index position to the end of the matched region
    }

または、文字列関数を使用して「パーサー」を記述します。

    int startFrom = text.indexOf('"');                              // The index-position of the first " character
    int nextQuote = text.indexOf('"', startFrom+1);                 // The index-position of the next " character
    int count = 0;                                                  // An index for the array of matches
    while (startFrom > -1) {                                        // Keep looping as long as there is another " character (if there isn't, or if it's index is negative, the value of startFrom will be less-than-or-equal-to -1)
        entities[count++] = text.substring(startFrom+1, nextQuote); // Retrieve the substring and add it to the array
        startFrom = text.indexOf('"', nextQuote+1);                 // Find the next " character after nextQuote
        nextQuote = text.indexOf('"', startFrom+1);                 // Find the next " character after that
    }

どちらの場合も、サンプルテキストは例のためにハードコーディングされており、同じ変数が存在すると想定されます（という名前の文字列変数text）。

entities配列の内容をテストする場合：

    int i = 0;
    while (i < count) {
        System.out.println(entities[i]);
        i++;
    }

警告する必要があります。境界/境界の場合（つまり、 "文字が文字列の先頭または末尾にある場合。これらの例は、"文字のパリティが不均一な場合（つまり、はテキスト内の奇数の「文字」です。事前に簡単なパリティチェックを使用できます。

    static int countQuoteChars(String text) {
        int nextQuote = text.indexOf('"');              // Find the first " character
        int count = 0;                                  // A counter for " characters found
        while (nextQuote != -1) {                       // While there is another " character ahead
            count++;                                    // Increase the count by 1
            nextQuote = text.indexOf('"', nextQuote+1); // Find the next " character
        }
        return count;                                   // Return the result
    }

    static boolean quoteCharacterParity(int numQuotes) {
        if (numQuotes % 2 == 0) { // If the number of " characters modulo 2 is 0
            return true;          // Return true for even
        }
        return false;             // Otherwise return false
    }

numQuotesたまたま0このメソッドが返される場合true（0を法とする任意の数は0であるため、そう(count % 2 == 0)なりますtrue）、文字がない場合は解析を続行したくないので、次のことを確認する必要があることに注意してください。どこかでこの状態。

お役に立てれば！

score 0 · Accepted Answer

他の誰かが、テキストのコーパスから「興味深い」単語を見つける方法について同様の質問をしました。あなたは答えを読むべきです。特に、Boloの回答は、単語の出現密度を使用してそれがどれほど重要かを判断する興味深い記事を示しています---テキストが何かについて話すとき、それは通常かなり頻繁にその何かを参照するという観察を使用します。この記事は、処理中のテキストに関する事前の知識を必要としないため、興味深いものです（たとえば、特定の辞書を対象とした辞書は必要ありません）。

この記事では、2つのアルゴリズムを提案しています。

最初のアルゴリズムは、測定された重要度に従って単一の単語（「フェデレーション」、「トレック」など）を評価します。実装は簡単で、Pythonで（あまりエレガントではない）実装を提供することもできます。

2番目のアルゴリズムは、空白を完全に無視し、ツリー構造を使用して名詞句を分割する方法を決定することにより、名詞句（「スタートレック」など）を抽出するため、より興味深いものです。進化に関するダーウィンの独創的なテキストに適用されたときにこのアルゴリズムによって与えられた結果は非常に印象的です。ただし、この記事の説明はかなりわかりにくいため、このアルゴリズムの実装にはもう少し検討が必要であり、著者が追跡するのが少し難しいように思われることを認めます。とはいえ、私はあまり時間をかけなかったので、運が良かったかもしれません。

java - テキストからの単語境界の検出

3 に答える 3

Related

Reference