c - C (C++ ではない) の単語頻度統計

Question

文字列が単一の空白で区切られた単語で構成されている場合、文字列に出現する回数で並べ替えられた単語を降順に出力します。

たとえば、「ab bc bc」の入力文字列は、次の出力を生成します。

bc : 2
ab : 1

この問題は、マップなどの C++ データ構造を使用すると簡単に解決できます。しかし、問題が単純な古い C でしか解決できない場合は、はるかに困難に見えます。

ここでは、どのようなデータ構造とアルゴリズムを使用する必要がありますか? できるだけ詳しく教えてください。DSとアルゴが苦手です。:-(

score 4 · Accepted Answer

使用できるデータ構造の1つは、strcmpを使用して比較できる単語を含む単純な二分木です。（今のところ、ケースの問題は無視します）。

あなたはそれを成長させるときに木がバランスを保つことを確実にする必要があるでしょう。これについては、ウィキペディアまたは他の場所でAVLツリーまたは1-2ツリーまたは赤黒木を検索してください。

二分木構造体を作成するために、各ノードにはnullになる可能性のある左右のサブノードがあり、リーフノードの場合は両方のサブノードがnullになることを除いて、あまり詳しく説明しません。簡単にするには、値と2つのサブノードを持つ「侵入型」ノードを使用します。何かのようなもの：

struct Node
{
  char * value;
  size_t frequency; 
  struct Node * left;
  struct Node * right;
};

そして明らかにCであるため、すべてのメモリ管理を行う必要があります。

ツリーを再帰的に下って比較し、必要に応じて左または右に移動する関数があります。見つかった場合は、周波数が上がるだけです。そうでない場合は、関数がノードを挿入する場所を決定できるはずです。次に、挿入とリバランスのロジックが実行されます。もちろん、新しいノードには頻度1の単語が含まれます。

最後に、結果を印刷するツリーを再帰的に処理する方法が必要になります。あなたの場合、これは再帰関数である可能性があります。

ちなみに、代替データ構造はある種のハッシュテーブルであることに注意してください。

最も効率的な解決策を探していて、手元にたくさんのメモリがある場合は、データ構造を使用して、文字に遭遇したときに各文字を分岐します。したがって、「a」はaで始まるすべての単語を示し、次に「b」などの2番目の文字に移動します。データ構造を知らない人のために実装するのはかなり複雑なので、行くことをお勧めします単純な二分木で。

印刷では、頻度の逆順ではないため、最初に結果を並べ替える必要があることに注意してください。（マップを使用するC ++では、それらをこの順序で取得することもできません）。

score 2 · Accepted Answer

これには三分木を使用します。Jon Bentley と Robert Sedgewick によってデータ構造が紹介されている次の記事には、C での例があります。

http://www.cs.princeton.edu/~rs/strings/

score 1 · Accepted Answer

これが私がそれを行う方法のサンプルです。findWord() での検索は最適化できます。一度に 1 つではなく、単語のブロックを割り当てることによって、割り当ての数を減らすこともできます。この場合にも、リンクされたリストを実装できます。メモリの解放が不足しています。これでうまくいくはずです。

    #include <stdio.h>
    #include <assert.h>
    #include <stdlib.h>

    #define MAXWORDLEN 128

    const char* findWhitespace(const char* text)
    {
        while (*text && !isspace(*text))
            text++;
        return text;
    }

    const char* findNonWhitespace(const char* text)
    {
        while (*text && isspace(*text))
            text++;
        return text;
    }

    typedef struct tagWord
    {
        char word[MAXWORDLEN + 1];
        int count;
    } Word;

    typedef struct tagWordList
    {
        Word* words;
        int count;
    } WordList;

    WordList* createWordList(unsigned int count);

    void extendWordList(WordList* wordList, const int count)
    {
        Word* newWords = (Word*)malloc(sizeof(Word) * (wordList->count + count));
        if (wordList->words != NULL) {
            memcpy(newWords, wordList->words, sizeof(Word)* wordList->count);
            free(wordList->words);
        }
        for (int i = wordList->count; i < wordList->count + count; i++) {
            newWords[i].word[0] = '\0';
            newWords[i].count = 0;
        }
        wordList->words = newWords;
        wordList->count += count;
    }

    void addWord(WordList* wordList, const char* word)
    {
        assert(strlen(word) <= MAXWORDLEN);
        extendWordList(wordList, 1);
        Word* wordNode = &wordList->words[wordList->count - 1];
        strcpy(wordNode->word, word);
        wordNode->count++;  
    }

    Word* findWord(WordList* wordList, const char* word)
    {
        for(int i = 0; i < wordList->count; i++) {
            if (stricmp(word, wordList->words[i].word) == 0) {
                return &wordList->words[i];
            }
        }
        return NULL;
    }

    void updateWordList(WordList* wordList, const char* word)
    {
        Word* foundWord = findWord(wordList, word);
        if (foundWord == NULL) {
            addWord(wordList, word);
        } else {
            foundWord->count++;
        }
    }

    WordList* createWordList(unsigned int count)
    {
        WordList* wordList = (WordList*)malloc(sizeof(WordList));
        if (count > 0) {
            wordList->words = (Word*)malloc(sizeof(Word) * count);
            for(unsigned int i = 0; i < count; i++) {
                wordList->words[i].count = 0;
                wordList->words[i].word[0] = '\0';
            }
        }
        else {
            wordList->words = NULL;
        }
        wordList->count = count;    
        return wordList;
    }

    void printWords(WordList* wordList)
    {
        for (int i = 0; i < wordList->count; i++) {
            printf("%s: %d\n", wordList->words[i].word, wordList->words[i].count);
        }
    }

    int compareWord(const void* vword1, const void* vword2)
    {
        Word* word1 = (Word*)vword1;
        Word* word2 = (Word*)vword2;
        return strcmp(word1->word, word2->word);
    }

    void sortWordList(WordList* wordList)
    {
        qsort(wordList->words, wordList->count, sizeof(Word), compareWord);
    }

    void countWords(const char* text)
    {
        WordList   *wordList = createWordList(0);
        Word       *foundWord = NULL;
        const char *beg = findNonWhitespace(text);
        const char *end;
        char       word[MAXWORDLEN];

        while (beg && *beg) {
            end = findWhitespace(beg);
            if (*end) {
                assert(end - beg <= MAXWORDLEN);
                strncpy(word, beg, end - beg);
                word[end - beg] = '\0';
                updateWordList(wordList, word);
                beg = findNonWhitespace(end);
            }
            else {
                beg = NULL;
            }
        }

        sortWordList(wordList);
        printWords(wordList);
    }

int main(int argc, char* argv[])
{
    char* text = "abc 123 abc 456 def 789 \tyup this \r\ncan work yup 456 it can";
    countWords(text);
}

c - C (C++ ではない) の単語頻度統計

3 に答える 3

Related

Reference