linux - bashのファイルから単語の出現を計算します

Question

非常に初心者の質問で申し訳ありませんが、私はbashプログラミングに少し慣れていません（数日前に開始しました）。基本的に私がやりたいのは、あるファイルに別のファイルのすべての単語の出現を保持することです

私はこれができることを知っています：

sort | uniq -c | sort

その後、2番目のファイルを取得し、オカレンスを再度計算して、最初のファイルを更新します。3番目のファイルを取得した後など。

私が現在行っていることは問題なく機能しますが（私は、、を使用しgrepてsedいawkます）、かなり遅いように見えます。

を使用して、コマンドなどを使用するだけで非常に効率的な方法があると確信していますが、uniq理解できません。

私を正しい道に導いていただけませんか。

書いたコードも貼り付けています。

#!/bin/bash
#   count the number of word occurrences from a file and writes to another file #
#   the words are listed from the most frequent to the less one                 #

touch .check                # used to check the occurrances. Temporary file
touch distribution.txt      # final file with all the occurrences calculated

page=$1             # contains the file I'm calculating
occurrences=$2          # temporary file for the occurrences

# takes all the words from the file $page and orders them by occurrences
cat $page | tr -cs A-Za-z\' '\n'| tr A-Z a-z > .check

# loop to update the old file with the new information
# basically what I do is check word by word and add them to the old file as an update
cat .check | while read words
do
    word=${words}       # word I'm calculating
    strlen=${#word}     # word's length
    # I use a black list to not calculate banned words (for example very small ones or inunfluent words, like articles and prepositions
    if ! grep -Fxq $word .blacklist && [ $strlen -gt 2 ]
    then
        # if the word was never found before it writes it with 1 occurrence
        if [ `egrep -c -i "^$word: " $occurrences` -eq 0 ]
        then
            echo "$word: 1" | cat >> $occurrences
        # else it calculates the occurrences
        else
            old=`awk -v words=$word -F": " '$1==words { print $2 }' $occurrences`
            let "new=old+1"
            sed -i "s/^$word: $old$/$word: $new/g" $occurrences
        fi
    fi
done

rm .check

# finally it orders the words
awk -F": " '{print $2" "$1}' $occurrences | sort -rn | awk -F" " '{print $2": "$1}' > distribution.txt

score 8 · Accepted Answer

さて、私はあなたがやろうとしていることの要点を私が持っているかどうかはわかりませんが、私はそれをこのようにします：

while read file
do
  cat $file | tr -cs A-Za-z\' '\n'| tr A-Z a-z | sort | uniq -c > stat.$file
done < file-list

これで、すべてのファイルの統計が得られ、単純に集計されます。

while read file
do
  cat stat.$file
done < file-list \
| sort -k2 \
| awk '{if ($2!=prev) {print s" "prev; s=0;}s+=$1;prev=$2;}END{print s" "prev;}'

使用例：

$ for i in ls bash cp; do man $i > $i.txt ; done
$ cat <<EOF > file-list
> ls.txt
> bash.txt
> cp.txt
> EOF

$ while read file; do
> cat $file | tr -cs A-Za-z\' '\n'| tr A-Z a-z | sort | uniq -c > stat.$file
> done < file-list

$ while read file
> do
>   cat stat.$file
> done < file-list \
> | sort -k2 \
> | awk '{if ($2!=prev) {print s" "prev; s=0;}s+=$1;prev=$2;}END{print s" "prev;}' | sort -rn | head

3875 the
1671 is
1137 to
1118 a
1072 of
793 if
744 and
533 command
514 in
507 shell

linux - bashのファイルから単語の出現を計算します

1 に答える 1

Related

Reference