java - 最長のキーワードの検索文字列

Question

私はJavaを使用しており、大規模な（〜15000）キーワード（文字列）のセットがあり、これらのキーワードを定期的に含むドキュメント（文字列）があります。

より長いキーワード (文字数が最も多いもの) を優先して、ドキュメント内のキーワードの各使用のインデックスを見つけたいと思います。たとえば、キーワードが「水」、「ボトル」、「飲んだ」、「水筒」で、ドキュメントが「水筒から飲んだ」の場合、次の結果が必要です。

2 飲んだ

16 水筒

私の最初の試みは、トライを使用して、ドキュメントを 1 文字ずつ調べ、部分文字列がキーワードに一致するたびに最初のインデックスを記録することでした。ただし、一部のキーワードは長いキーワードのプレフィックス (たとえば、"water" や "water bottle") であり、"water" のインデックスを記録して最初からやり直すため、コードは長いキーワードを見つけることができません。

重要な場合は、キーワードに小文字、大文字、スペース、ハイフン、およびアポストロフィ (および大文字小文字の問題) を含めることができます。

そのため、最長のキーワードを見つけるための助けをいただければ幸いです。ありがとう。

score 0 · Accepted Answer

小さいキーワードからキーワードを作成できる場合は、機能するコードで行う必要があるのは、最初に長いキーワードを確認することだけです。注：これはまったくテストしていません。すでにこの問題に十分な労力を費やしていると思います。これが役立つ場合は、賛成+承諾することを忘れないでください。

すなわち

import java.util.TreeSet;
import java.util.Comparator;
import java.util.LinkedList;
import java.util.HashMap;
import java.util.Iterator;

public class KeywordSearcher {
    private TreeSet<String> ts;

    public KeywordSearcher() {
    ts = new TreeSet<String>(new Comparator<String>() {
    // Sort all the keywords by length, largest first
        public int compare(String arg0, String arg1) {
            if(arg0.length() > arg1.length()) return -1;
            if(arg0.length() == arg1.length()) return 0;
            return 1;
        }});
    }

    public void addKeyword(String s) {
        ts.add(s);
    }

    private LinkedList<Integer> findKeyword(String document, String s) {
        int start = 0;
        int index;
        LinkedList<Integer> indexes = new LinkedList<Integer>();        

        while(true) {
            index = document.indexOf(s, start);
            if (index == -1) break;
            indexes.add(index);
            start = index + s.length();
        }

        return indexes;
    }

    public HashMap<String, LinkedList<Integer>> findAllKeywords(String document) {
        Iterator<String> is = ts.iterator();
        HashMap<String, LinkedList<Integer>> allIndices = new HashMap<String, LinkedList<Integer>>();

        while(is.hasNext()) {
            String nextKeyword = is.next();
        // See if we found a larger keyword, if we did already, skip this keyword
        boolean foundIt = false;
        for (String key : allIndices.keySet()) {
                if(key.contains(nextKeyword)) {
                    foundIt = true;
                    break;
                }
        }
            if (foundIt) continue;

            // We didn't find the larger keyword, look for the smaller keyword
            LinkedList<Integer> indexes = findKeyword(document, nextKeyword);

            if (indexes.size() > 0) allIndices.put(nextKeyword, indexes);
        }

        return allIndices;
    }
}

score 0 · Accepted Answer

私の理解が正しければ、文書内に「水筒」が見つかった場合、「水」の検索をスキップしてください。これは、キーワードのある種のツリー構造を意味します。

私の提案は、次のようにソートされたツリーにキーワードを配置することです。

drank
water bottle
    bottle
    water

コードでは、最初に語根にある用語 (「drank」と「water bottle」) を検索します。「水筒」の一致数がゼロになった場合は、次のレベルに移動して、それらの用語 (「水筒」と「水」) を検索します。

ツリーを作成するには、少し作業が必要です。

しかし、このツリー構造では、複数の複合語を持つことができます。

clean water bottle
    clean bottle
        clean
    water bottle
        bottle
        water

java - 最長のキーワードの検索文字列

2 に答える 2

Related

Reference