c - テキストファイル内の単語の出現回数をカウントする

Question

単語がテキストファイルに出現する回数を追跡するにはどうすればよいですか? これをすべての単語に対して実行したいと思います。

たとえば、入力が次のような場合:

「その男は少年に挨拶した。」

「man said hi to boy」のそれぞれのオカレンスは 1 です。

「the」の出現回数は 2 です。

単語と出現のペアで辞書を保持することを考えていましたが、これを C で実装する方法がわかりません。解決策に関する同様または関連する問題へのリンクは素晴らしいでしょう。

編集: 自分のハッシュテーブルをロールアウトするのを避けるために、glib の使用方法を学ぶことにしました。途中で、同様の問題を説明する優れたチュートリアルを見つけました。http://bo.majewski.name/bluear/gnu/GLib/ch03s03.html

さまざまなアプローチの数、特に Ruby 実装のシンプルさと優雅さに驚かされます。

score 5 · Accepted Answer

はい、単語出現ペアを持つ辞書は問題なく機能し、そのような辞書を実装する通常の方法は、ハッシュテーブル (または場合によっては二分探索木) を使用することです。

また、複雑さがこの問題に対して漸近的に最適なトライ(またはその圧縮バージョン、「パトリシアトライ」 /Radix トライ) を使用することもできますが、実際には (良い) ハッシュテーブルの実装よりも遅い可能性があると思います。

[ハッシュテーブルと試行のどちらが優れているかは、入力内の単語の分布に依存すると本当に思います-たとえば、ハッシュテーブルは各単語をハッシュバケットに格納する必要があります(衝突を防ぐため)。共通の接頭辞を持つ単語、それらの共通の接頭辞は共有され、それぞれ一度だけ保存する必要がありますが、まだすべてのポインターのオーバーヘッドがあります...両方を試した場合、方法を知りたいです彼らは比較します。]

score 5 · Accepted Answer

好奇心旺盛な方のために、単語数の問題に対する簡単な Ruby のソリューションを示します。基本的には C のアルゴリズムと同じですが、コードが多くなります。

h = Hash.new(0)
File.read("filename.txt").split.each do |w|
  h[w] += 1
end
p h

score 4 · Accepted Answer

これはカウントされますか？

#include <stdio.h>
#include <stdlib.h>
int main(int argc, char **argv)
{
    char buffer[2048];
    if (argc != 2)
    {
        fprintf(stderr, "Usage: %s file\n", argv[0]);
        exit(EXIT_FAILURE);
    }
    snprintf(buffer, sizeof(buffer), "tr -cs '[a-z][A-Z]' '[\\n*]' < %s |"
                                     " sort | uniq -c | sort -n", argv[1]);
    return(system(buffer));
}

これは基本的に、Unix で単語を数える方法を説明する正規のスクリプトをシェルスクリプトとしてカプセル化したものです。

' tr' コマンドは、アルファベット以外の文字をすべて改行に変換し、重複を絞り出します。最初の ' sort' は、各単語のすべての出現をまとめてグループ化します。' uniq -c' は、各単語の連続出現回数をカウントし、単語とそのカウントを出力します。2 番目の ' sort' は、繰り返し回数の多い順に並べます。' ' へのオプションを検討する必要があるかもしれませんtr。これはシステム間で最も安定したコマンドではなく、日常的に手動でバッシングを行うように管理しています。/usr/bin/tr を使用する Solaris 10 では、上記のコードは (独自のソースで) 以下を生成します。

   1
   1 A
   1 EXIT
   1 FAILURE
   1 Usage
   1 Z
   1 a
   1 c
   1 cs
   1 exit
   1 file
   1 fprintf
   1 if
   1 main
   1 return
   1 sizeof
   1 snprintf
   1 stderr
   1 stdio
   1 stdlib
   1 system
   1 tr
   1 uniq
   1 z
   2 argc
   2 char
   2 h
   2 include
   2 int
   2 s
   2 sort
   3 argv
   3 n
   4 buffer

score 2 · Accepted Answer

個々の単語については、これが何か大きなものの一部でない限り、プログラムを書く必要はまったくありません:

sed -e 's/[[:space:]]/\n/g' < file.txt | grep -c WORD

score 2 · Accepted Answer

ハッシュテーブルを使用して、ハッシュテーブル内のすべてのエントリが、これまでに見つかった単語と回数を含む構造を指すようにすることができます。

score 1 · Accepted Answer

パールで：

my %wordcount = ();
while(<>){map {$wordcount{$_}++} (split /\s+/)}
print "$_ = $wordcount{$_}\n" foreach sort keys %wordcount;

Perl Golf では (楽しみのためだけに):

my%w;                       
map{$w{$_}++}split/\s+/while(<>); 
print"$_=$w{$_}\n"foreach keys%w;

score 0 · Accepted Answer

警告未テストのコード:

#include <stdio.h>

struct LLNode
{
    LLNode* Next;    
    char*   Word;
    int     Count;
};

void PushWord(LLNode** list, const char* word)
{
    LLNode* node = NULL;
    unsigned int len = 0;
    if (*list == NULL) 
    {
        $list = new LLNode;
        $list = "\0";
    }
    node = *list;
    while ((node = node->Next) != NULL) // yes we are skipping the first node
    {
        if (!strcmp(node->Word, word))
        {
            node->Count++;
            break;
        }

        if (!node->Next)
        {
            LLNode* nnode = new LLNode;
            nnode->Count = 1;
            node->Next = nnode;
            len = strlen(word);
            node->Word = new char[len + 1];
            strcpy(node->Word, word);
            break;
        }
    }
}

void GetCounts(LLNode* list)
{
    if (!list)
        return;
    LLNode* node = list;
    while ((node = node->Next) != NULL) // yes we are skipping the first node
    {
        printf("Word: %s, Count: %i", node->Word, node->Count);
    }
}

void PushWords(LLNode** list, const char* words)
{
    char ch = '\0';
    unsigned int len = strlen(words);
    char buff[len]; // to be sure we have no buffer ovverunes. May consume too much memery for your application though.
    int index = 0;
    for (unsigned int i = 0; i < len; i++)
    {
        ch = words[i];
        if (index > 0 && ch == ' ')
        {
            ch[index + 1] = '\0';
            PushWords(list, buff);
            index = 0;
        }
        else if (ch != ' ')
        {
            ch[index++] = ch;
        }
    }

    if (index > 0 && ch == ' ')
    {
        ch[index + 1] = '\0';
        PushWords(list, buff);
        index = 0;
    }
}

int main()
{
    LLNode* list = NULL;
    PushWords(&list, "Hello world this is a hello world test bla");
    GetCount(list);
    // release out memery here
}

私は今それを書いたので、おそらくうまくいかないでしょう - しかし、それは一般的な考えです.

今回は C++ でのもう 1 つのラフドラフト (注: std::map の検索時間はかなり良好です):

#include <iostream>
#include <string>
#include <map>

using namespace std;

typedef map<string, int> CountMap;

void PushWords(CountMap& list, const char* words)
{
    char ch = '\0';
    unsigned int len = strlen(words);
    string str;
    int index = 0;
    for (unsigned int i = 0; i < len; i++)
    {
        ch = words[i];
        if (index > 0 && ch == ' ')
        {
            list[str] = list[str] + 1;
            index = 0;
        }
        else if (ch != ' ')
        {
            str += ch;
            index++;
        }
    }

    if (index > 0 && ch == ' ')
    {
        list[str] = list[str] + 1;
    }
}

void PrintCount(CountMap& list)
{
    CountMap::iterator iter = list.begin(), end = list.end();
    for (; iter != end; ++iter)
    {
        cout << (*iter).first << " : " << (*iter).second;
    }
}


int main()
{
    CountMap map;
    PushWords(map, "Hello world this is a hello world test bla");
    PrintCount(map);
}

score 0 · Accepted Answer

#include <conio.h>
#include <iostream.h>
#include <fstream.h>
#include <cstdlib>

struct stdt
{
       char name[20] ;
       int id ;

}; //std

int main()
{
      stdt boy ;
      int a = 0 ;
      ofstream TextFile ;
      cout << "Begin File Creation \n" ;
      TextFile.open("F:\\C++ Book Chapter Program\\Ch  7\\File.txt" );
      if ( !TextFile)
      {
           cerr <<"Erro 100 Openoing File.DAT" ;
           exit(100);     
      }//end if
      while ( a < 3 )
      {
            TextFile.write( (char*) &boy , sizeof (boy) ) ;
            cout << "\nEnter Name : " ;
            cin  >> boy.name;
            cout << "\nEnter ID : " ;
            cin  >> boy.id ;
            a++;
      }//end while

      TextFile.close();
      cout << "\nEnd File Creation" ;

      ifstream TextFile1 ;
      TextFile1.open("F:\\C++ Book Chapter Program\\Ch  7\\File.txt" );
      while ( TextFile1.read( (char*) &boy , sizeof (boy) ) )
      {
            cout << "\nEnter Name : " << boy.name; 
            cout << "\nEnter ID : " << boy.id ;


      }// end While

      getch();
      return 0 ;
}//end main

c - テキストファイル内の単語の出現回数をカウントする

8 に答える 8

Related

Reference