python - 文字/単語の比率に基づいてファイルから行を削除する - unix/bash

Question

2 つのファイルがあり、特定のトークン比率に該当する行を削除する必要があります。

ファイル 1:

This is a foo bar question
that is not a parallel sentence because it's too long
hello world

ファイル 2:

c'est le foo bar question
creme bulee
bonjour tout le monde

そして、計算された比率は合計no. of words in file 1 / total no. of words in file 2であり、この比率を下回った場合、文章は削除されます。

出力は、ファイル 1 とファイル 2 の文をタブで区切った結合ファイルです。

[アウト]：

This is a foo bar question\tc'est le foo bar question
hello world\tbonjour tout le monde

ファイルの行数は常に同じです。私は次のようにそれを行ってきましたが、Pythonを使用する代わりにunix bashで同じことを行うにはどうすればよいですか?

# Calculate the ratio.
with io.open('file1', , 'r', encoding='utf8') as f1, io.open('file2', , 'r', encoding='utf8') as f2: 
    ratio = len(f1.read().split()) / float(len(f2.read().split()))
# Check and output to file.
with io.open('file1', , 'r', encoding='utf8') as f1, io.open('file2', , 'r', encoding='utf8') as f2, io.open('fileout', , 'w', encoding='utf8') as fout:
    for l1, l2 in zip(file1, file2):
        if len(l1.split())/float(len(l2.split())) > ratio:
            print>>fout, "\t".join([l1.strip() / l2.strip()])

また、比率の計算が単語ではなく文字に基づいている場合、 Python でこれを行うことができますが、unix bash で同じことを行うにはどうすればよいですか? 差はとでのみカウントされることに注意してlen(str.split())くださいlen(str)。

# Calculate the ratio.
with io.open('file1', , 'r', encoding='utf8') as f1, io.open('file2', , 'r', encoding='utf8') as f2: 
    ratio = len(f1.read()) / float(len(f2.read()))
# Check and output to file.
with io.open('file1', , 'r', encoding='utf8') as f1, io.open('file2', , 'r', encoding='utf8') as f2, io.open('fileout', , 'w', encoding='utf8') as fout:
    for l1, l2 in zip(file1, file2):
        if len(l1)/float(len(l2)) > ratio:
            print>>fout, "\t".join([l1.strip() / l2.strip()])

score 1 · Accepted Answer

これは、Awk の単純な比率計算機です。

awk 'NR == FNR { a[NR] = NF; next }
    { print NF/a[FNR] }' file1 file2

これは、各行の比率を出力するだけです。比率が特定の範囲内にある場合にのみ 2 番目のファイルを印刷するように拡張するのは簡単です。

awk 'NR == FNR { a[NR] = NF; next }
    NF/a[FNR] >= 0.5 && NF/a[FNR] <= 2' file1 file2

(これは Awk の短縮形を使用します -- 一般的な形式ではcondition { action }、を省略した場合、{ action }デフォルトでになり{ print }ます。同様に、条件を省略した場合、アクションは無条件に実行されます。)

file1同じことを行うために2 回目のパスを実行するか、ファイル名を逆にしてもう一度実行することができます。

ああ、待って、ここに完全な解決策があります。

awk 'NR == FNR { a[NR] = NF; w[NR] = $0; next }
    NF/a[FNR] >= 0.5 && NF/a[FNR] <= 2 { print w[FNR] "\t" $0 }' file1 file2

score 1 · Accepted Answer

bash は非整数には適していないという Tripleee のコメントは正しいですが、本当に bash を実行したい場合は、これで始められるはずです。プログラムwcと-w引数でそれを行うことができます。単語を数えます。bc はとりわけ float 除算を行います。

while read line1 <&3 && read line2 <&4; do     
    line1_count=`echo $line1 | wc -w`
    line2_count=`echo $line2 | wc -w`
    ratio=`echo "$line1_count / $line2_count" | bc -l`
    echo $ratio
done 3<file1 4<file2

また、man bc関係式の部分も見てください。これにより、比率のしきい値が何であれ、比較を行うことができます。

python - 文字/単語の比率に基づいてファイルから行を削除する - unix/bash

2 に答える 2

Related

Reference