java - テキスト内の単語の出現回数を数える方法

Question

テキスト内で最もよく使用される 10 の単語を見つけるプログラムを作成するプロジェクトに取り組んでいますが、行き詰まり、次に何をすべきかわかりません。誰か助けてくれませんか？

私はここまで来ました：

import java.io.File;
import java.io.FileNotFoundException;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.Scanner;
import java.util.regex.Pattern;

public class Lab4 {
    public static void main(String[] args) throws FileNotFoundException {
        Scanner file = new Scanner(new File("text.txt")).useDelimiter("[^a-zA-Z]+");
        List<String> words = new ArrayList<String>();
        while (file.hasNext()){
            String tx = file.next();
            // String x = file.next().toLowerCase();
            words.add(tx);
        }
        Collections.sort(words);
        // System.out.println(words);
    }
}

score 9 · Accepted Answer

Guava Multiset を使用できます。単語カウントの例を次に示します: http://code.google.com/p/guava-libraries/wiki/NewCollectionTypesExplained

マルチセット内のカウントが最も高い単語を見つける方法は次のとおりです。要素頻度の順序でマルチセットを反復処理する最も簡単な方法は?

更新私は2012年にこの回答を書きました。それ以来、Java 8があり、外部ライブラリなしで数行で最も使用されている10の単語を見つけることが可能になりました:

List<String> words = ...

// map the words to their count
Map<String, Integer> frequencyMap = words.stream()
         .collect(toMap(
                s -> s, // key is the word
                s -> 1, // value is 1
                Integer::sum)); // merge function counts the identical words

// find the top 10
List<String> top10 = words.stream()
        .sorted(comparing(frequencyMap::get).reversed()) // sort by descending frequency
        .distinct() // take only unique values
        .limit(10)   // take only the first 10
        .collect(toList()); // put it in a returned list

System.out.println("top10 = " + top10);

静的インポートは次のとおりです。

import static java.util.Comparator.comparing;
import static java.util.stream.Collectors.toList;
import static java.util.stream.Collectors.toMap;

score 1 · Accepted Answer

これは、Java 8 のストリーミング API も使用する lbalazscs のバージョンよりもさらに短いバージョンです。

Arrays.stream(new String(Files.readAllBytes(PATH_TO_FILE), StandardCharsets.UTF_8).split("\\W+"))
            .collect(Collectors.groupingBy(Function.<String>identity(), HashMap::new, counting()))
            .entrySet()
            .stream()
            .sorted(((o1, o2) -> o2.getValue().compareTo(o1.getValue())))
            .limit(10)
            .forEach(System.out::println);

これにより、すべてが一度に実行されます。ファイルをロードし、単語以外の文字で分割し、すべてを単語ごとにグループ化し、各グループに単語数を割り当ててから、上位 10 単語の単語をカウント付きで出力します。

非常によく似たセットアップに関する詳細な議論については、https ://stackoverflow.com/a/33946927/327301 も参照してください。

score -1 · Accepted Answer

ファイルまたはコマンドラインから文字列として入力を作成し、それを以下のメソッドに渡すと、単語がキーとして含まれ、値がその文または段落の出現またはカウントとして含まれるマップが返されます。

public Map<String,Integer> getWordsWithCount(String sentances)
{
    Map<String,Integer> wordsWithCount = new HashMap<String, Integer>();

    String[] words = sentances.split(" ");
    for (String word : words)
    {
        if(wordsWithCount.containsKey(word))
        {
            wordsWithCount.put(word, wordsWithCount.get(word)+1);
        }
        else
        {
            wordsWithCount.put(word, 1);
        }

    }

    return wordsWithCount;

}

java - テキスト内の単語の出現回数を数える方法

5 に答える 5

Related

Reference