bash - Sort lots of large compressed files

Question

I have a lot of large compressed files called xaa.gz, xab.gz, xac.gz etc. Unfortunately they are not sorted. I would like to do the equivalent of the following.

zcat x*|sort > largefile

split -l 1000000 largefile

Then gzip the split files and throw away all the other files made before as well.

The problem is this makes a massive uncompressed file and then lots of smaller uncompressed split files before then compressing them. Is it possible to do the whole thing without making a huge file in the middle of the process and ideally without saving the split files before compressing them either?

I have 8 cores so I would like to take advantage of them too (I don't have coreutils 8.20 so can't take advantage of sort --parallel).

score 3 · Accepted Answer

完全なコードではありませんが、ここでできることについていくつかのアイデアがあります。

1) 入力ファイルを分割して並列処理します。

num_cores=8
i=0
while read f; do
  part_name=part$i
  set $part_name="${!part_name} $f"
  (( i=(i+1)%num_cores ))
done < <(ls x*.gz)

2）異なるプロセスでファイルの一部を解凍してソートします。

sort -m <(zcat $part0 | sort) <(zcat $part1 | sort) ...

3) ファイルをすぐに圧縮するように split に指示します。

... | split -l 1000000 --filter='gzip > $FILE.gz'

bash - Sort lots of large compressed files

1 に答える 1

Related

Reference