shell - ファイルの効率的な分割を探しています

Question

これが私のファイルであると仮定します：

$ cat file.txt 
A:1:i
B:2:ii
X:9:iv

このような for ループを使用すると、すべてのフィールドを個別に出力して、サブファイルにリダイレクトできます

$ for i in $(seq 1 3); do echo $i; awk -F ":" -v FL=$i '{print $FL}' file.txt > $i.out; done

となることによって：

$ cat 1.out 
A
B
X

$ cat 2.out 
1
2
9

$ cat 3.out 
i
ii
iv

質問: ほぼ 70 列で、ほぼ 10 GB のファイルサイズでこれを実行する必要があります。動作しますが、遅いです。このビッグデータセットで作業するためのより良い/効率的な分割を提案できる人はいますか? ありがとう。

$ for i in $(seq 1 70); do echo $i; awk -F ":" -v FL=$i '{print $FL}' *.data > $i.out; done

score 6 · Accepted Answer

あなたがやろうとしていることを考えると、これはかなり速いはずです：

awk -F: '{ for (i=1; i<=NF; i++) print $i > i".out" }' file.txt

score 2 · Accepted Answer

Python バージョン

#!/bin/env python

with open('file.txt', 'r') as ih:
    while True:
        line = ih.readline()
        if line == '': break
        for i,element in enumerate(line.strip().split(':')):
            outfile = "%d.out" % (i+1)
            with open(outfile, 'a') as oh:
                oh.write("%s\n" % element)

これは、元のファイルを 1 回しか処理しないため、少し高速になる可能性があります。出力ファイルを開いたままにしておくことで、さらに最適化できることに注意してください (そのままで、それぞれを閉じて、書き込みごとに再度開きます)。

編集

たとえば、次のようなものです。

#!/bin/env python

handles = dict()

with open('file.txt', 'r') as ih:
    while True:
        line = ih.readline()
        if line == '': break
        for i,element in enumerate(line.strip().split(':')):
            outfile = "%d.out" % (i+1)

            if outfile not in handles:
                handles[outfile] = open(outfile, 'a');

            handles[outfile].write("%s\n" % element)

for k in handles:
    handles[k].close()

これにより、実行中はハンドルが開いたままになり、続行/終了する前にすべてのハンドルが閉じられます。

score 1 · Accepted Answer

perlでは次のことができます：

#!/usr/bin/perl -w
my $n = 3;
my @FILES;
for my $i (1..$n) {
  my $f;
  open ($f, "> $i.out") or die;
  push @FILES, $f;
}
while (<>) {
  chomp;
  @a = split(/:/);
  for my $i (0..$#a) {
    print $FILES[$i] $a[$i],"\n";
  }
}
close($f) for $f in @FILES;

score 0 · Accepted Answer

3 つの列があることがわかっている場合は、coreutils を使用します。

< file.txt tee >(cut -d: -f1 > 1.out) >(cut -d: -f2 > 2.out) >(cut -d: -f3 > 3.out) > /dev/null

より一般的にするために、コマンドライン生成を自動化する 1 つの方法を次に示します。

# Determine number of fields and generate tee argument
arg=""
i=1
while read; do 
  arg="$arg >(cut -d: -f$i > $((i++)).out)"
done < <(head -n1 file.txt | tr ':' '\n')

arg今でしょ：

>(cut -d: -f1 > 1.out) >(cut -d: -f2 > 2.out) >(cut -d: -f3 > 3.out)

スクリプトファイルに保存します。

echo "< file.txt tee $arg > /dev/null" > script

そして実行します：

. ./script

score 0 · Accepted Answer

これは、あまり見かけない機能を使用する bash スクリプトです。bash にファイルのファイル記述子を割り当てるように要求し、その記述子を変数に格納します。

# Read the first line to get a count of the columns
IFS=: read -a columns < file.txt

# Open an output file for each column, saving the file descriptor in an array
for c in "${columns[@]}"; do
    exec {a}>$((++i)).txt
    fds+=( $a )
done

# Iterate through the iput, writing each column to the file opened for it
while IFS=: read -a fields; do
    for f in "${fields[@]}"; do
        printf "$f\n" >&${fds[++i]}
    done
done < file.txt

# Close the file descriptors
for fd in "${fds[@]}"; do
    exec {fd}>&-
done

shell - ファイルの効率的な分割を探しています

5 に答える 5

Related

Reference