java - TF-IDFの結果が1つしか得られないのはなぜですか？

Question

// Calculating term frequency
    System.out.println("Please enter the required word  :");
    Scanner scan = new Scanner(System.in);
    String word = scan.nextLine();

    String[] array = word.split(" ");
    int filename = 11;
    String[] fileName = new String[filename];
    int a = 0;
    int totalCount = 0;
    int wordCount = 0;


    for (a = 0; a < filename; a++) {

        try {
            System.out.println("The word inputted is " + word);
            File file = new File(
                    "C:\\Users\\user\\fypworkspace\\TextRenderer\\abc" + a
                            + ".txt");
            System.out.println(" _________________");

            System.out.print("| File = abc" + a + ".txt | \t\t \n");

            for (int i = 0; i < array.length; i++) {

                totalCount = 0;
                wordCount = 0;

                Scanner s = new Scanner(file);
                {
                    while (s.hasNext()) {
                        totalCount++;
                        if (s.next().equals(array[i]))
                            wordCount++;

                    }

                    System.out.print(array[i] + " ---> Word count =  "
                            + "\t\t " + "|" + wordCount + "|");
                    System.out.print("  Total count = " + "\t\t " + "|"
                            + totalCount + "|");
                    System.out.printf("  Term Frequency =  | %8.4f |",
                            (double) wordCount / totalCount);

                    System.out.println("\t ");

                }
            }
        } catch (FileNotFoundException e) {
            System.out.println("File is not found");

        }

    }

System.out.println("Please enter the required word  :");
    Scanner scan2 = new Scanner(System.in);
    String word2 = scan2.nextLine();
    String[] array2 = word2.split(" ");
    int numofDoc;

    for (int b = 0; b < array2.length; b++) {

        numofDoc = 0;

        for (int i = 0; i < filename; i++) {

            try {

                BufferedReader in = new BufferedReader(new FileReader(
                        "C:\\Users\\user\\fypworkspace\\TextRenderer\\abc"
                                + i + ".txt"));

                int matchedWord = 0;

                Scanner s2 = new Scanner(in);

                {

                    while (s2.hasNext()) {
                        if (s2.next().equals(array2[b]))
                            matchedWord++;
                    }

                }
                if (matchedWord > 0)
                    numofDoc++;

            } catch (IOException e) {
                System.out.println("File not found.");
            }

        }
        System.out.println(array2[b]
                + " --> This number of files that contain the term  "
                + numofDoc);
        double inverseTF = Math.log10((float) numDoc / numofDoc);
        System.out.println(array2[b] + " --> IDF " +  inverseTF );
        double TFIDF = (((double) wordCount / totalCount) * inverseTF );
        System.out.println(array2[b] + " --> TFIDF " + TFIDF);
    }
}

こんにちは、これは用語頻度とTF-IDFを計算するための私のコードです。最初のコードは、特定の文字列の各ファイルの用語頻度を計算します。2番目のコードは、上記の値を使用して各ファイルのTF-IDFを計算することになっています。しかし、私は1つの値しか受け取りませんでした。各ドキュメントにTF-IDF値を提供することになっています。

用語頻度の出力例：

入力された単語は「is」です

| ファイル=abc0.txt|
は--->単語数=|2 | 総数=|150 | 期間頻度=| 0.0133 |

入力された単語は「is」です

| ファイル=abc1.txt|
は--->単語数=|0 | 総数=|9 | 期間頻度=| 0.0000 |

TF-IDF

is->用語7を含むこのファイル数

は->IDF0.1962946357308887

is-> TFIDF 0.0028607962606519654 <<<ファイルごとに1つの値を取得すると想定します。つまり、10個のファイルがあり、異なるファイルごとに10個の異なる値を取得すると想定します。ただし、出力されるのは1つの結果のみです。誰かが私の間違いを指摘できますか？

score 1 · Accepted Answer

ファイルごとに繰り返されると思われるprintlnステートメントは次のとおりです。

double TFIDF = (((double) wordCount / totalCount) * inverseTF );
System.out.println(array2[b] + " --> TFIDF " + TFIDF);

しかし、それは単一のループに含まれています

for (int b = 0; b < array2.length; b++)

それだけ。この行をファイルごとに出力する場合は、このステートメントをすべてのファイルに対する別のループで囲む必要があります。

これは宿題なので、最終的なコードは含めませんが、別のヒントを示します。TFIDFの計算に変数wordCountとtotalCountも含めました。ただし、これらは各ファイル名/単語のペアに固有です。したがって、1回だけでなく、ファイル名/単語ごとに保存するか、最後のループで再度再計算する必要があります。

score 0 · Accepted Answer

TDIDFを出力する部分は、すべてのファイルをループするforループ内に移動する必要があります。

すなわち：

    System.out.println(array2[b]
            + " --> This number of files that contain the term  "
            + numofDoc);
    double inverseTF = Math.log10((float) numDoc / numofDoc);
    System.out.println(array2[b] + " --> IDF " +  inverseTF );
    double TFIDF = (((double) wordCount / totalCount) * inverseTF );
    System.out.println(array2[b] + " --> TFIDF " + TFIDF);
}

}}

java - TF-IDFの結果が1つしか得られないのはなぜですか？

2 に答える 2

Related

Reference