java - 文中の 2 つの単語の共起をカウントする (java)

Question

この問題を解決する助けが必要です。

文字列の 2 つのバラバラなリストがあります。リストのlist A = {a1, ..., an}要素list B = {b1, ..., bn} は、「人工」や「インテリジェンス」などの単純な単語にすることも、「人工知能」などの単語をさらに組み合わせることもできます。また、多くの単語を含む文があります。それらのいくつかは、2 つのリストのいずれかに含まれています。

私がしなければならないことは、2 つのリストからの 2 つの文字列が同じ文で一緒に出現する回数を数えることです。

問題は、文の中に人工知能のような単語を見つけた場合、考慮すべき正しい単語は「人工知能」のみになることです（「人工」でも「知能」でもありません）。

文に含まれるリストからすべての単語をツリーセットに追加し、長さで並べ替えて、最も長い単語のみを取得することを考えていましたが、解決策は非常に素晴らしく効率的ではないと思います。

現時点では、コードは次のようになっています (ただし、私が話している問題はまだあります)。

// iterates on the words from the list A
for (String a: A)
    // if the phrase contains the word
    if (phrase.matches(".*\\b" + a + "\\b.*")
        // iterates on the words from the list B
        for (String b: B)
            // if the phrase contains the word
            if (phrase.matches(".*\\b" + b + "\\b.*")
                // do stuffs

何か提案はありますか？ありがとう！

score 1 · Accepted Answer

私の考えは、考慮された単語を追跡してからクリーンアップすることです。

次のようなことを試してください：

int counter = 0;
List<String[]> h = new ArrayList<String[]>();
HashSet<String> words = new HashSet<String>();

// iterates on the words from the list A
for (String a: A)
    // if the phrase contains the word
    if (phrase.matches(".*\\b" + a + "\\b.*"))
        // iterates on the words from the list B
        for (String b: B)
            // if the phrase contains the word
            if (phrase.matches(".*\\b" + b + "\\b.*")) {

                h.add(new String[]{a,b});
                words.add(a);
                words.add(b);
            }

// clean up:

// 1. clean words
for (String i:words) {
    // in words, keep only strings that are not contained by others
}

// 2. clean h
for (String[] i :  h) {
    // if i[0] or i[1] are both in words, then
    // increment counter... or whatever you want
}

あなたの問題を理解したと思います

score 1 · Accepted Answer

2 つのリストがあります。リスト内のすべての単語について、最初の単語からリスト内の残りの単語へのマップを作成します。たとえば、このリストに「人工知能」、「コウモリの洞窟」、「犬」がある場合、次のように保存します。

"artificial" => { "artificial intelligence" }

"bat" => { "bat cave" }

"dog" => { "dog" }

これが最初のステップになります。リストを前処理して、firstword からリスト内の残りの単語へのマップを取得します。

あなたのセリフに「人工知能はかっこいい」などの文が含まれている場合。で行を分割し\wます。あなたは言葉を得る。私たちが最初に出会う言葉は「人工」です。以前に取得した両方のマップを調べます。artificialそのため、マップの 1 つにのキーが表示されます。行の次の単語が何であるかを知っています。それにもかかわらず、最長の一致に対して一致させたいと考えています。したがって、に対応する単語のリストを比較して取得しartificialます。そして、最長の部分文字列を一致させます。artificial intelliegence最長の一致を探しているので、一緒に見つけます。それにもかかわらず、2 番目のリストに対してプロセスを繰り返します。どちらか長い方に応じて、リスト 1 に属するかリスト 2 に属するかを選択します。

ここにいくつかのサンプルコードがあります。

import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashMap;
import java.util.LinkedHashMap;
import java.util.List;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;


public class WordSplits {
    public static Map<String, List<String>> first2rest(List<String> wordList) {
        Map<String, List<String>> first2RestWords = new HashMap<String, List<String>>();
        for (String word : wordList) {
            // TODO Make it use Pattern. Sample demo. Get the first word of
            // every string.
            String splits[] = word.split("\\W");
            String firstWord = splits[0];
            List<String> restWords = first2RestWords.get(firstWord);
            if (restWords == null) {
                restWords = new ArrayList<String>();
            }
            restWords.add(word);
            // store the complete pattern nevertheless
            first2RestWords.put(firstWord, restWords);
        }

        return first2RestWords;
    }

    public static Map<String, List<Integer>> longestSubstring(String line,
            List<String> first, List<String> second) {
        Map<String, List<Integer>> occurences = new LinkedHashMap<String, List<Integer>>();
        Map<String, List<String>> first2RestWords = first2rest(first);
        Map<String, List<String>> second2RestWords = first2rest(second);

        Matcher wordMatcher = Pattern.compile("\\w+").matcher(line);
        for (int start = 0; start < line.length() && wordMatcher.find(start);) {

            String word = wordMatcher.group();

            String maxWordFirst = "", maxWordSecond = "";
            if (first2RestWords.containsKey(word)) {
                maxWordFirst = longestMatch(
                        line.substring(wordMatcher.start()),
                        first2RestWords.get(word));
            }
            if (second2RestWords.containsKey(word)) {
                maxWordSecond = longestMatch(
                        line.substring(wordMatcher.start()),
                        second2RestWords.get(word));

            }

            if (maxWordFirst.length() > 0 || maxWordSecond.length() > 0) {
                if (maxWordFirst.equals(maxWordSecond)) {
                    System.out.println("Belongs to both the lists : " + maxWordFirst);
                } else {
                    if (maxWordFirst.length() > maxWordSecond.length()) {
                        System.out.println("Belongs to first list:  " + maxWordFirst);
                    } else if (maxWordSecond.length() > maxWordFirst.length()) {
                        System.out.println("Belongs to second list: " + maxWordSecond);
                    }
                }
            } else {
                System.out.println(word + " does not belong to any list");
            }
            // Take some action
            start = wordMatcher.start() + Math.max(maxWordFirst.length(), maxWordSecond.length()) + 1;
            start = Math.max(wordMatcher.end(), start);
        }

        return occurences;
    }

    public static String longestMatch(String line, List<String> wordList) {
        String maxWord = "";
        // poor way to compare
        for (String word : wordList) {
            if (line.startsWith(word) && word.length() > maxWord.length()) {
                maxWord = word;
            }
        }

        return maxWord;
    }

    public static void main(String[] args) {
        longestSubstring("artificial intelligence is cool. bat.",
                Arrays.asList("dog", "cow", "dog", "artificial intelligence", "bat"),
                Arrays.asList("artificial", "hound", "cool", "bat", "dog hound"));
    }
}

処理する行は"artificial intelligence is cool. bat."

l1 = `"dog", "cow", "dog", "artificial", "artificial intelligence", "bat"`

l2 = `"intelligence", "hound", "cool", "bat", "dog hound"`

プログラムからの出力は

Belongs to first list:  artificial intelligence
is does not belong to any list
Belongs to second list: cool
Belongs to both the lists : bat

やるべき最適化がたくさんあります。

score 0 · Accepted Answer

これに対する解決策を見つけたと思います。あなたはそれについて考えるのを手伝ってくれました。

私ができることは、両方のリストを別々に反復し、文章で見つけた単語を2つの一時的なマップに追加することです（オカレンスをカウントする重み付き）。その後、これら 2 つのマップを常に個別に反復処理し、文字列 a1 に文字列 a2 が含まれている場合、a2 の重みを 1 減らします。その後、正しい重みを含む 2 つのマップを取得し、各ペアの共起をインクリメントしながら両方を反復できます。

この方法でうまくいくと思います！

java - 文中の 2 つの単語の共起をカウントする (java)

4 に答える 4

Related

Reference