bash - bashを使用して大きなファイルを多くの小さなファイルに分割する方法は?

Question

たとえばall、2000行のファイルがあり、行番号1〜500、501〜1000、1001〜1500、1501〜2000の4つの小さなファイルに分割できることを願っています。

おそらく、私はこれを使用してこれを行うことができます:

cat all | head -500 >small1
cat all | tail -1500 | head -500 >small2
cat all | tail -1000 | head -500 >small3
cat all | tail -500 >small4

ただし、この方法では行番号の計算が必要になるため、行数が適切でない場合や、ファイルを多数の小さなファイルに分割したい場合 (例: all3241 行のファイルで、それぞれ 463 行の 7 つのファイルに分割します)。

これを行うより良い方法はありますか？

score 38 · Accepted Answer

ファイルを分割する場合は、次を使用しますsplit。

split -l 500 all all

ファイルをそれぞれ500行の複数のファイルに分割します。ファイルをほぼ同じサイズの4つのファイルに分割する場合は、次のようなものを使用します。

split -l $(( $( wc -l < all ) / 4 + 1 )) all all

score 10 · Accepted Answer

コマンドを調べてくださいsplit、それはあなたが望むこと（そしてそれ以上）をするはずです：

$ split --help
Usage: split [OPTION]... [INPUT [PREFIX]]
Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default
size is 1000 lines, and default PREFIX is 'x'.  With no INPUT, or when INPUT
is -, read standard input.

Mandatory arguments to long options are mandatory for short options too.
  -a, --suffix-length=N   generate suffixes of length N (default 2)
      --additional-suffix=SUFFIX  append an additional SUFFIX to file names.
  -b, --bytes=SIZE        put SIZE bytes per output file
  -C, --line-bytes=SIZE   put at most SIZE bytes of lines per output file
  -d, --numeric-suffixes[=FROM]  use numeric suffixes instead of alphabetic.
                                   FROM changes the start value (default 0).
  -e, --elide-empty-files  do not generate empty output files with '-n'
      --filter=COMMAND    write to shell COMMAND; file name is $FILE
  -l, --lines=NUMBER      put NUMBER lines per output file
  -n, --number=CHUNKS     generate CHUNKS output files.  See below
  -u, --unbuffered        immediately copy input to output with '-n r/...'
      --verbose           print a diagnostic just before each
                            output file is opened
      --help     display this help and exit
      --version  output version information and exit

SIZE is an integer and optional unit (example: 10M is 10*1024*1024).  Units
are K, M, G, T, P, E, Z, Y (powers of 1024) or KB, MB, ... (powers of 1000).

CHUNKS may be:
N       split into N files based on size of input
K/N     output Kth of N to stdout
l/N     split into N files without splitting lines
l/K/N   output Kth of N to stdout without splitting lines
r/N     like 'l' but use round robin distribution
r/K/N   likewise but only output Kth of N to stdout

score 4 · Accepted Answer

他の人がすでに述べたように、を使用できますsplit。受け入れられた回答が言及する複雑なコマンド置換は必要ありません。参考までに、次のコマンドを追加します。これらのコマンドは、要求されたほとんどのことを実現します。-nコマンドライン引数を使用してチャックの数を指定する場合、small*ファイルには正確に 500 行が含まれていないことに注意してくださいsplit。

$ seq 2000 > all
$ split -n l/4 --numeric-suffixes=1 --suffix-length=1 all small
$ wc -l small*
 583 small1
 528 small2
 445 small3
 444 small4
2000 total

または、 GNU parallelを使用できます。

$ < all parallel -N500 --pipe --cat cp {} small{#}
$ wc -l small*
 500 small1
 500 small2
 500 small3
 500 small4
2000 total

ご覧のとおり、この呪文は非常に複雑です。GNU Parallel は、実際には、パイプラインの並列化に最もよく使用されます。IMHOは調べる価値のあるツールです。

bash - bashを使用して大きなファイルを多くの小さなファイルに分割する方法は?

3 に答える 3

Related

Reference