tar - TAR ファイルをロードし、その *bz2 コンテンツを bzcat で sdout に抽出する

Question

この質問に続いて、bz2 で圧縮された json ファイルを含む40 GB の TAR ファイルを効率的な方法で PostgreSQL にロードしようとしています。

上記の回答のとおり、プロセスを分離し、外部ツールを使用して次のフローを作成しようとしています。

ファイルを開いて、TAR (この場合は bsdtar、TAR には Windows ビルドに抽出が含まれていないため) を使用して SDOUT に抽出します (*.bz2 ファイルのみ)。
bzcat を呼び出して *BZ2 ファイルを抽出します (sdout にエクスポート)。
これを Python スクリプト 'file_handling' で開きます。このスクリプトは、各着信行をツイートにマップし、これを csv として stdout に出力します。
これを PSQL にパイプして、1 つの COPY コマンドにロードします。

現在、bzcat に到達するとエラーが発生します。これは、上記を実行する行を作成する必要があるものです。

pipeline = [filename[1:3] + " && ",  # Change drive to H so that TAR can find the file without a drive name (doesn't like absolute paths, apparently).
            '"C:\\Tools\\GnuWin32\\gnuwin32\\bin\\bsdtar" vxOf ' + filename_nodrive + ' "*.bz2"',  # Call to tar, outputs to stdin
            " | C:\\Tools\\GnuWin32\\gnuwin32\\bin\\bzcat.exe"#,  # Forward its output to bzcat
            ' | python "D:\Cloud\Dropbox\Coding\GitHub\pyTwitter\pyTwitter_filehandling.py"', # Extract Tweets
            ' | "C:\Program Files\PostgreSQL\9.4\bin\psql.exe" -1f copy.sql ' + secret_login_d
           ]
module_call = "".join(pipeline)
module_call = "H: && "C:\Tools\GnuWin32\gnuwin32\bin\bsdtar" vxOf "Twitter datastream/Sourcefiles/archiveteam-twitter-stream-2013-01.tar" "*.bz2" | C:\Tools\GnuWin32\gnuwin32\bin\bzcat.exe | python "D:\Cloud\Dropbox\Coding\GitHub\pyTwitter\pyTwitter_filehandling.py" | "C:\Program Files\PostgreSQL\9.4in\psql.exe" -1f copy.sql "user=xxx password=xxx host=localhost port=5432 dbname=xxxxxx""

TAR のコードを実行すると、TAR ファイルが CMD プロンプトに出力され、すべてがうまくいっていることがわかります。ただし、bzcat 行でエラーが発生します。

x 01/29/06/39.json.bz2
bzcat.exe: Data integrity error when decompressing.
    Input file = (stdin), output file = (stdout)

It is possible that the compressed file(s) have become corrupted.
You can use the -tvv option to test integrity of such files.

-tvv を実行すると、次のようになります。

huff+mtf data integrity (CRC) error in data

7-zip (GUI) を使用して同じアーカイブを抽出しようとしましたが、これは引き続き機能します。これをトラブルシューティングする方法についての助けをいただければ幸いです。GNUWin32 で Windows 8.1 を実行しています。

score 2 · Accepted Answer

bsdtar.exe は、ファイルデータ内の改行バイトを DOS CRLF シーケンスに変換しているため、bzip2 出力ストリームが破損しています。

GNU tar は相対パスを使用すると機能しましたが、Windows では絶対パスを処理しません。

代わりに 7-zip を使用することをお勧めします。

7z.exe x -so -ir!*.json.bz2 archive.tar | bzcat | ...

tar - TAR ファイルをロードし、その *bz2 コンテンツを bzcat で sdout に抽出する

1 に答える 1

Related

Reference