1

私は Perl に手を出し始めたばかりで、さまざまなプログラミング言語に触れようとしています。次のコードの一部がひどい場合はご容赦ください。

CSVファイルを受け取り、それを「X」個のCSV行を含むファイルバッチに分割できる、迅速で汚いCSVパーサーが必要でした(エントリに改行が含まれる可能性があることを考慮して)。

私は実用的な解決策を思いつきました、そしてそれはうまくいきました。ただし、分割しようとしている CSV ファイルの 1 つとして、シリアル化された PHP コードを含むファイルを見つけました。

これにより、CSV の解析が壊れているようです。シリアル化を削除するとすぐに、CSV ファイルが正しく解析されます。

CSV ファイル内のシリアル化されたデータを解析する際に知っておく必要があるトリックはありますか?

コードの短縮サンプルを次に示します。

use strict;
use warnings;

my $csv = Text::CSV_XS->new({ eol => $/, always_quote => 1, binary => 1 });
my $out;
my $in;

open $in, "<:encoding(utf8)", "infile.csv" or die("cannot open input file $inputfile");
open $out, ">outfile.000";
binmode($out, ":utf8");
while (my $line = $csv->getline($in)) {
    $lines++;
    $csv->print($out, $line);
}

while上記のループに入ることができません。シリアル化されたデータを削除するとすぐに、突然ループに入ることができます。

編集:

問題を引き起こしている行の例 (Vim から直接取得 - したがって ^M):

"26","other","1","20,000 Subscriber Plan","Some text here.^M\
Some more text","on","","18","","0","","0","0","recurring","0","","payment","totalsend","0","tsadmin","R34bL9oq","37","0","0","","","","","","","","","","","","","","","","","","","","","","","0","0","0","a:18:{i:0;s:1:\"3\";i:1;s:1:\"2\";i:2;s:2:\"59\";i:3;s:2:\"60\";i:4;s:2:\"61\";i:5;s:2:\"62\";i:6;s:2:\"63\";i:7;s:2:\"64\";i:8;s:2:\"65\";i:9;s:2:\"66\";i:10;s:2:\"67\";i:11;s:2:\"68\";i:12;s:2:\"69\";i:13;s:2:\"70\";i:14;s:2:\"71\";i:15;s:2:\"72\";i:16;s:2:\"73\";i:17;s:2:\"74\";}","","","0","0","","0","0","0.0000","0.0000","0","","","0.00","","6","1"
"27","other","1","35,000 Subscriber Plan","Some test here.^M\
Some more text","on","","18","","0","","0","0","recurring","0","","payment","totalsend","0","tsadmin","R34bL9oq","38","0","0","","","","","","","","","","","","","","","","","","","","","","","0","0","0","a:18:{i:0;s:1:\"3\";i:1;s:1:\"2\";i:2;s:2:\"59\";i:3;s:2:\"60\";i:4;s:2:\"61\";i:5;s:2:\"62\";i:6;s:2:\"63\";i:7;s:2:\"64\";i:8;s:2:\"65\";i:9;s:2:\"66\";i:10;s:2:\"67\";i:11;s:2:\"68\";i:12;s:2:\"69\";i:13;s:2:\"70\";i:14;s:2:\"71\";i:15;s:2:\"72\";i:16;s:2:\"73\";i:17;s:2:\"74\";}","","","0","0","","0","0","0.0000","0.0000","0","","","0.00","","7","1"
"28","other","1","50,000 Subscriber Plan","Some text here.^M\
Some more text","on","","18","","0","","0","0","recurring","0","","payment","totalsend","0","tsadmin","R34bL9oq","39","0","0","","","","","","","","","","","","","","","","","","","","","","","0","0","0","a:18:{i:0;s:1:\"3\";i:1;s:1:\"2\";i:2;s:2:\"59\";i:3;s:2:\"60\";i:4;s:2:\"61\";i:5;s:2:\"62\";i:6;s:2:\"63\";i:7;s:2:\"64\";i:8;s:2:\"65\";i:9;s:2:\"66\";i:10;s:2:\"67\";i:11;s:2:\"68\";i:12;s:2:\"69\";i:13;s:2:\"70\";i:14;s:2:\"71\";i:15;s:2:\"72\";i:16;s:2:\"73\";i:17;s:2:\"74\";}","","","0","0","","0","0","0.0000","0.0000","0","","","0.00","","8","1""73","other","8","10,000,000","","","","0","","0","","0","0","recurring","0","","payment","","0","","","75","0","10000000","","","","","","","","","","","","","","","","","","","","","","","0","0","0","a:17:{i:0;s:1:\"3\";i:1;s:1:\"2\";i:2;s:2:\"59\";i:3;s:2:\"60\";i:4;s:2:\"61\";i:5;s:2:\"62\";i:6;s:2:\"63\";i:7;s:2:\"64\";i:8;s:2:\"65\";i:9;s:2:\"66\";i:10;s:2:\"67\";i:11;s:2:\"68\";i:12;s:2:\"69\";i:13;s:2:\"70\";i:14;s:2:\"71\";i:15;s:2:\"72\";i:16;s:2:\"74\";}","","","0","0","","0","0","0.0000","0.0000","0","","","0.00","","14","0"
4

1 に答える 1

3

読み込もうとしている CSV は、埋め込まれた引用符をバックスラッシュでエスケープしますが、デフォルトでText::CSV_XSはそれらを二重にしてエスケープします。コンストラクタに追加escape_char => '\\'してみてください。Text::CSV_XS

allow_loose_escapes => 1バックスラッシュを使用して、改行のように厳密に必要としない他のものを引用する場合にも必要になる場合があります。

もう 1 つのオプションは、エスケープにバックスラッシュの代わりに二重引用符を使用するようにライターを変更することです。できるかもしれないし、できないかもしれない。二重引用符は CSV のより一般的なフレーバーであり、プログラム パーサーは一般に両方を読み取ることができますが (指示された場合)、Excel などでバックスラッシュを含むバリアントを読み取ることはできません。

于 2013-10-17T07:10:54.270 に答える