bash - ファイル内のすべての単語の頻度リストを作成するにはどうすればよいですか？

Question

私はこのようなファイルを持っています：

This is a file with many words.
Some of the words appear more than once.
Some of the words only appear one time.

2列のリストを生成したいと思います。最初の列はどの単語が表示されるかを示し、2番目の列はそれらが表示される頻度を示します。次に例を示します。

this@1
is@1
a@1
file@1
with@1
many@1
words3
some@2
of@2
the@2
only@1
appear@2
more@1
than@1
one@1
once@1
time@1

この作業を簡単にするために、リストを処理する前に、すべての句読点を削除し、すべてのテキストを小文字に変更します。
wordsその周りに簡単な解決策がない限り、word2つの別々の単語として数えることができます。

これまでのところ、私はこれを持っています：

sed -i "s/ /\n/g" ./file1.txt # put all words on a new line
while read line
do
     count="$(grep -c $line file1.txt)"
     echo $line"@"$count >> file2.txt # add word and frequency to file
done < ./file1.txt
sort -u -d # remove duplicate lines

何らかの理由で、これは各単語の後に「0」のみを表示しています。

ファイルに表示されるすべての単語のリストを頻度情報とともに生成するにはどうすればよいですか？

score 74 · Accepted Answer

sedとgrepでtrはなくsort、、、、、uniqおよびawk：

% (tr ' ' '\n' | sort | uniq -c | awk '{print $2"@"$1}') <<EOF
This is a file with many words.
Some of the words appear more than once.
Some of the words only appear one time.
EOF

a@1
appear@2
file@1
is@1
many@1
more@1
of@2
once.@1
one@1
only@1
Some@2
than@1
the@2
This@1
time.@1
with@1
words@2
words.@1

ほとんどの場合、数字と句読点を削除し、すべてを小文字に変換し（そうでない場合は、「THE」、「The」、「the」は別々にカウントされます）、長さがゼロの単語のエントリを抑制します。ASCIIテキストの場合、次の変更されたコマンドを使用してこれらすべてを実行できます。

sed -e  's/[^A-Za-z]/ /g' text.txt | tr 'A-Z' 'a-z' | tr ' ' '\n' | grep -v '^$'| sort | uniq -c | sort -rn

score 47 · Accepted Answer

uniq -cはすでに必要な処理を実行しているので、入力を並べ替えるだけです。

echo 'a s d s d a s d s a a d d s a s d d s a' | tr ' ' '\n' | sort | uniq -c

出力：

  6 a
  7 d
  7 s

score 12 · Accepted Answer

これにはtrを使用できます。実行するだけです。

tr ' ' '\12' <NAME_OF_FILE| sort | uniq -c | sort -nr > result.txt

都市名のテキストファイルのサンプル出力：

3026 Toronto
2006 Montréal
1117 Edmonton
1048 Calgary
905 Ottawa
724 Winnipeg
673 Vancouver
495 Brampton
489 Mississauga
482 London
467 Hamilton

score 7 · Accepted Answer

入力ファイルの内容

$ cat inputFile.txt
This is a file with many words.
Some of the words appear more than once.
Some of the words only appear one time.

使用するsed | sort | uniq

$ sed 's/\.//g;s/\(.*\)/\L\1/;s/\ /\n/g' inputFile.txt | sort | uniq -c
      1 a
      2 appear
      1 file
      1 is
      1 many
      1 more
      2 of
      1 once
      1 one
      1 only
      2 some
      1 than
      2 the
      1 this
      1 time
      1 with
      3 words

uniq -ic大文字と小文字を区別して無視しますが、結果リストにはThisの代わりにが含まれthisます。

score 5 · Accepted Answer

AWKを使ってみよう！

この関数は、提供されたファイルに出現する各単語の頻度を降順で一覧表示します。

function wordfrequency() {
  awk '
     BEGIN { FS="[^a-zA-Z]+" } {
         for (i=1; i<=NF; i++) {
             word = tolower($i)
             words[word]++
         }
     }
     END {
         for (w in words)
              printf("%3d %s\n", words[w], w)
     } ' | sort -rn
}

次のようにファイルで呼び出すことができます。

$ cat your_file.txt | wordfrequency

出典：AWKワードRuby

score 4 · Accepted Answer

これはあなたのために働くかもしれません：

tr '[:upper:]' '[:lower:]' <file |
tr -d '[:punct:]' |
tr -s ' ' '\n' | 
sort |
uniq -c |
sed 's/ *\([0-9]*\) \(.*\)/\2@\1/'

score 2 · Accepted Answer

Python 3でやってみましょう！

"""Counts the frequency of each word in the given text; words are defined as
entities separated by whitespaces; punctuations and other symbols are ignored;
case-insensitive; input can be passed through stdin or through a file specified
as an argument; prints highest frequency words first"""

# Case-insensitive
# Ignore punctuations `~!@#$%^&*()_-+={}[]\|:;"'<>,.?/

import sys

# Find if input is being given through stdin or from a file
lines = None
if len(sys.argv) == 1:
    lines = sys.stdin
else:
    lines = open(sys.argv[1])

D = {}
for line in lines:
    for word in line.split():
        word = ''.join(list(filter(
            lambda ch: ch not in "`~!@#$%^&*()_-+={}[]\\|:;\"'<>,.?/",
            word)))
        word = word.lower()
        if word in D:
            D[word] += 1
        else:
            D[word] = 1

for word in sorted(D, key=D.get, reverse=True):
    print(word + ' ' + str(D[word]))

このスクリプトに「frequency.py」という名前を付け、「〜/.bash_aliases」に行を追加しましょう。

alias freq="python3 /path/to/frequency.py"

ここで、ファイル「content.txt」内の頻度の単語を見つけるには、次のようにします。

freq content.txt

出力をパイプすることもできます。

cat content.txt | freq

また、複数のファイルからのテキストを分析することもできます。

cat content.txt story.txt article.txt | freq

Python 2を使用している場合は、置き換えるだけです

''.join(list(filter(args...)))とfilter(args...)
python3とpython
print(whatever)とprint whatever

score 1 · Accepted Answer

ソートにはGNUAWK（gawk）が必要です。がない別のAWKがある場合asort()、これは簡単に調整してからにパイプすることができますsort。

awk '{gsub(/\./, ""); for (i = 1; i <= NF; i++) {w = tolower($i); count[w]++; words[w] = w}} END {qty = asort(words); for (w = 1; w <= qty; w++) print words[w] "@" count[words[w]]}' inputfile

複数の行に分割：

awk '{
    gsub(/\./, ""); 
    for (i = 1; i <= NF; i++) {
        w = tolower($i); 
        count[w]++; 
        words[w] = w
    }
} 
END {
    qty = asort(words); 
    for (w = 1; w <= qty; w++)
        print words[w] "@" count[words[w]]
}' inputfile

score 1 · Accepted Answer

file.txtに次のテキストがある場合。

This is line number one
This is Line Number Tow
this is Line Number tow

次のコマンドを使用して、各単語の頻度を見つけることができます。

 cat file.txt | tr ' ' '\n' | sort | uniq -c

出力：

  3 is
  1 line
  2 Line
  1 number
  2 Number
  1 one
  1 this
  2 This
  1 tow
  1 Tow

score 1 · Accepted Answer

これはもう少し複雑な作業です。少なくとも次のことを考慮に入れる必要があります。

句読点を削除します。空は空とは異なります。または空？
地球は地球とは異なり、神は神から、月は月とは異なりますが、Theとtheは同じと見なされます。したがって、単語を小文字にするかどうかは疑問です。
BOM文字を考慮に入れる必要があります

$ file the-king-james-bible.txt 
the-king-james-bible.txt: UTF-8 Unicode (with BOM) text

BOMは、ファイルの最初のメタ文字です。削除しないと、1つの単語に誤って影響する可能性があります。

以下は、AWKを使用したソリューションです。

    {  

        if (NR == 1) { 
            sub(/^\xef\xbb\xbf/,"")
        }

        gsub(/[,;!()*:?.]*/, "")
    
        for (i = 1; i <= NF; i++) {
    
            if ($i ~ /^[0-9]/) { 
                continue
            }
    
            w = $i
            words[w]++
        }
    } 
    
    END {
    
        for (idx in words) {
    
            print idx, words[idx]
        }
    }

BOM文字を削除し、句読文字を置き換えます。単語を小文字にすることはありません。さらに、このプログラムは聖書の単語を数えるために使用されたため、すべての節をスキップします（続行するif条件）。

$ awk -f word_freq.awk the-king-james-bible.txt > bible_words.txt

プログラムを実行し、出力をファイルに書き込みます。

$ sort -nr -k 2 bible_words.txt | head
the 62103
and 38848
of 34478
to 13400
And 12846
that 12576
in 12331
shall 9760
he 9665
unto 8942

sortとを使用するとhead、聖書で最も頻繁に使用される単語のトップ10が見つかります。

score 0 · Accepted Answer

#!/usr/bin/env bash

declare -A map 
words="$1"

[[ -f $1 ]] || { echo "usage: $(basename $0 wordfile)"; exit 1 ;}

while read line; do 
  for word in $line; do 
    ((map[$word]++))
  done; 
done < <(cat $words )

for key in ${!map[@]}; do 
  echo "the word $key appears ${map[$key]} times"
done|sort -nr -k5

score -1 · Accepted Answer

  awk '{ 
       BEGIN{word[""]=0;}
    {
    for (el =1 ; el <= NF ; ++el) {word[$el]++ }
    }
 END {
 for (i in word) {
        if (i !="") 
           {
              print word[i],i;
           }
                 }
 }' file.txt | sort -nr

bash - ファイル内のすべての単語の頻度リストを作成するにはどうすればよいですか？

12 に答える 12

AWKを使ってみよう！

Related

Reference