python - 列に関して CSV ファイルの上位 20 行 (降順) を抽出します

Question

次のような 3 列の CSV ファイルがあります。

a,b,c
1,1,2
1,3,5
1,5,7
.
.
2,3,4
2,1,5
2,4,7

出力を次のようにしたい

a,b,c
1,5,7
1,3,5
1,1,2
.
.
2,4,7
2,3,4
2,1,5

つまり、列 a の各要素について、上位 20 行 ('b' の値が最も高い 20 行) のみが必要です。拙い説明をお許しください。これまでにこれを試しましたが、必要な出力が得られません:

import csv
import heapq
from itertools import islice
csvout = open ("output.csv", "w")
writer = csv.writer(csvout, delimiter=',',quotechar='"', lineterminator='\n', quoting=csv.QUOTE_MINIMAL)
freqs = {}
with open('input.csv') as fin:
    csvin = csv.reader(fin)
    rows_with_mut = ([float(row[1])] + row for row in islice(csvin, 1, None) if row[2])
    for row in rows_with_mut:
        cnt = freqs.setdefault(row[0], [[]] * 20)
        heapq.heappushpop(cnt, row)

for assay_id, vals in freqs.iteritems():
    output = [row[1:] for row in sorted(filter(None, vals), reverse=True)]
    writer.writerows(output)

score 2 · Accepted Answer

ファイルは列 a でのみソートされるため、列 b & c でもソートする必要があります。natsortを使用して、ファイルをループして列 a の値ごとに 20 行を出力するよりも、ファイルを昇順または降順で並べ替えることをお勧めします。

次のようなもの：

import natsort

with open('myfile.csv', 'r') as inFile:
    lines = inFile.readlines()
    sortedList = reversed(natsort.natsorted(lines))
    #alternatively, you might want to try natsort.versorted() which is used for version numbers
    counter = 0
    prevAVal=currentAval=1
    for line in sortedList:
        currentAVal = ",".split(line)[0]
        if currentAVal != prevAval:
            counter = 0
        if counter < 20 :
                print line
        counter = counter + 1
        prevAVal=currentAVal

score 1 · Accepted Answer

反対票を投じるリスクがある場合は、単純な bash スクリプトを使用できます。

#!/bin/bash
all=$(cat) #read from stdin
echo "$all" | head -n 1 #echo the header of the file
allt=$(echo "$all" | tail -n +2) #remove the header from memory
avl=$(echo "$allt" | cut -d ',' -f 1 | sort | uniq) #find all unique values in the a column
for av in $avl #iterate over these values
do
    echo "$allt" | grep "^$av," | sort -t$',' -k2nr | head -n 20 #for each value, find all lines with that value and sort them, return the top 20...
done

これをコマンドラインで次のように実行できます。

bash script.sh < data.csv

端末に結果を出力します...

例：

サンプル値を（「ドット」行なしで）使用すると、次のようになります。

user@machine ~> bash script.sh < data.csv 
a,b,c
1,5,7
1,3,5
1,1,2
2,4,7
2,3,4
2,1,5

data2.csv結果をファイル (たとえば)に書き込みたい場合は、次を使用します。

bash script.sh < data.csv > data2.csv

同じファイルを読み書きしないでください。実行しないでbash script.sh < data.csv > data.csvください。

python - 列に関して CSV ファイルの上位 20 行 (降順) を抽出します

2 に答える 2

Related

Reference