c++ - 両方のテキストファイルで最も使用されている単語を検索する

Question

同じ単語を複数回使用している 2 つの txt ファイルがあります。両方を配列に取り込み、フォーマットされていないtxtファイルの1つを挿入ソートでフォーマットしました。

次に、フォーマットされた 2 つの配列を比較して、最も一般的な単語と、それらが使用された回数を見つける必要があります。

各配列を通過する for ループを使用できることはわかっていますが、その方法はわかりません。何か助けはありますか？

編集：これが私がこれまでに持っているものです。

#include<iostream>
#include<fstream>
#include<string>
using namespace std;

const int size = 100;
void checkIF(string x)
{
    fstream infile;
    cout << "Attempting to open ";
    cout << x;
    cout << "\n";
    infile.open(x);
    if( !infile )
    {
        cout << "Error: File couldn't be opened.\n";
    }
    else
    {
        cout << "File opened succsesfully.\n";
    }
}
void checkFile()
{
    string f1 = "text1.txt", f2 = "abbreviations.txt";
    checkIF(f1);
    checkIF(f2);
}

string* readFiles(string txt1[],string abb[])
{
    fstream intxt1("text1.txt");
    fstream inabb("abbreviations.txt");
    int i = 0;
    while (!intxt1.eof())
    {   
        intxt1 >> txt1[i];
        //cout << txt1[i];
        i++;
    }
        while (!inabb.eof())
    {   
        inabb >> abb[i];
        //cout << abb[i];
        i++;
    }

    return txt1;
    return abb;
}

string* insertionSort(string txt1[], int arraySize)
{
    int i, j;
    string insert;

    for (i = 1; i < arraySize; i++)
    {
        insert = txt1[i];
        j = i;
        while ((j > 0) && (txt1[j - 1] > insert))
        {
            txt1[j] = txt1[j - 1];
            j = j - 1;
        }
        txt1[j] = insert;
    }
    return txt1;
}


void compare(string txt1[],string abb[])
{

}

void main()
{
    string txt1Words[size];
    string abbWords[size];
    checkFile();
    readFiles(txt1Words,abbWords);
    insertionSort(txt1Words,100);
    compare(txt1Words,abbWords);
    system("Pause");
}

score 0 · Accepted Answer

配列を使用する代わりに、ベクトルを使用します。

いいえ

string txt1Words[size];

しかし

vector<string> txt1Words;

簡単に使用できます

std::count(txt1Words.begin(), txt1Words.end(), word_to_search);

score 0 · Accepted Answer

まず、「両方のテキストファイルで最も使用されている単語」の問題に対処しましょう。それは、最もよく使用されるものをどのように定義するかに大きく依存します。基本的に、カウント付きの単語のセットが 2 つあります。

例えば

ファイル A:"apple apple apple banana"

ファイル B:"apple apple banana orange orange orange orange orange"

名前とカウントのセットとして保存すると、

ファイル A:{("apple",5), ("banana",1)}

ファイル B:{("apple",2), ("banana",1), ("orange",5)}

注: これはコードではなく、単なるモックアップ表記です。

この小さな例では、両方のファイルで最もよく使用されるのはどれでしょうか? しかし問題は、「apple」が両方のファイルで使用されているため、最もよく使用される必要があるかどうかです。それとも、ファイルの 1 つで最も多く使用されているため、「オレンジ」を最も使用する必要がありますか?

2 つのセットの何らかの共通部分が必要であると仮定します。したがって、両方のファイルに含まれる単語のみがカウントされます。さらに、私があなたの場合、単語が出現する最小数でランク付けします。このように、ファイル A の 5 つの「リンゴ」は、ファイル B に 2 回しか出現しないため、「リンゴ」の重みが高くなりすぎないようにします。

これをコードに書き出すと、次のようになります。

class Word
{
public:
    std::string Token;
    int Count;

    Word (const std::string &token, int count)
        : Token(token), Count(count) {}
};

と

    std::map<std::string, int> FileA;
    std::map<std::string, int> FileB;

    std::vector<Word> intersection;

    for (auto i = FileA.begin(); i != FileA.end (); ++i)
    {
        auto bentry = FileB.find (i->first); //Look up the word from A in B
        if (bentry == FileB.end ())
        {
            continue; //The word from file A was not in file B, try the next word
        }

        //We found the word from A in B
        intersection.push_back(Word (i->first,std::min(i->second,bentry->second))); //You can replace the std::min call with whatever method you want to qualitate "most common"
    }

    //Now sort the intersection by Count
    std::sort (intersection.begin(),intersection.end(), [](const Word &a, const Word &b) { return a.Count > b.Count;});

    for (auto i = intersection.begin (); i != intersection.end (); ++i)
    {
        std::cout << (*i).Token << ": " << (*i).Count << std::endl;
    }

実行してみてください: http://ideone.com/jbPm1g

それが役立つことを願っています。

c++ - 両方のテキスト ファイルで最も使用されている単語を検索する

4 に答える 4

Related

Reference

c++ - 両方のテキストファイルで最も使用されている単語を検索する