perl - File::Queue が perl で utf8 文字列を処理できるようにする方法は?

Question

XML ファイルからのデータを perl で処理していて、FIFO File::Queue を使用して処理を分割してスピードアップしたいと考えています。1 つの perl スクリプトが XML ファイルを解析し、別のスクリプト用に JSON 出力を準備します。

#!/usr/bin/perl -w
binmode STDOUT, ":utf8";
use utf8;
use strict;
use XML::Rules;
use JSON;
use File::Queue;

#do the XML magic: %data contains result

my $q = new File::Queue (File => './importqueue', Mode => 0666);
my $json = new JSON;
my $qItem = $json->allow_nonref->encode(\%data);
$q->enq($qItem);

long%dataに数値と z データのみが含まれている限り、これは正常に機能します。しかし、ワイド文字の 1 つ (例: ł、ą、ś、ż など) が発生すると、次のようになります。Wide character in syswrite at /usr/lib/perl/5.10/IO/Handle.pm line 207.

文字列が有効な utf8 かどうかを確認しようとしました:

print utf8::is_utf8($qItem). ':' . utf8::valid($qItem)

そして私は得ました1:1-はい、正しいutf8文字列を持っています。

その理由は、syswrite が :utf8 でエンコードされたファイルであることを認識していないキューファイルにファイルハンドラを取得することが原因である可能性があることがわかりました。

私は正しいですか？その場合、File:Queue に :utf8 ファイルハンドラを強制的に使用させる方法はありますか? File:Queue は最良の選択ではないかもしれません。sth else を使用して、2 つの perl スクリプト間に FIFO キューを作成する必要がありますか?

score 3 · Accepted Answer

utf8::is_utf8文字列が UTF-8 を使用してエンコードされているかどうかはわかりません。（その情報すら入手できません。）

>perl -MEncode -E"say utf8::is_utf8(encode_utf8(chr(0xE9))) || 0"
0

utf8::valid文字列が有効な UTF-8 かどうかはわかりません。

>perl -MEncode -E"say utf8::valid(qq{\xE9}) || 0"
1

どちらも内部ストレージの詳細を確認します。どちらも必要ありません。

File::Queue は、バイト文字列のみを送信できます。送信するデータを文字列にシリアル化するのはあなた次第です。

テキストをシリアル化する主な手段は、文字エンコーディング、または略して単にエンコーディングです。UTF-8 は文字エンコーディングです。

たとえば、文字列

dostępu

次の文字 (それぞれ Unicode コードポイント) で構成されます。

64 6F 73 74 119 70 75

これらの文字のすべてがバイトに収まるわけではないため、File::Queue を使用して文字列を送信することはできません。UTF-8 を使用してその文字列をエンコードすると、次の文字で構成される文字列が得られます。

64 6F 73 74 C4 99 70 75

これらの文字はバイトに収まるので、文字列は File::Queue を使用して送信できます。

JSON は、使用したように、Unicode コードポイントの文字列を返します。そのため、文字エンコーディングを適用する必要があります。

File::Queue は、文字列を自動的にエンコードするオプションを提供していないため、自分で行う必要があります。

encode_utf8およびdecode_utf8Encode モジュールから使用できます

 my $json = JSON->new->allow_nonref;
 $q->enq(encode_utf8($json->encode(\%data)));
 my $data = $json->decode(decode_utf8($q->deq()));

または、JSON にエンコード/デコードを任せることもできます。

 my $json = JSON->new->utf8->allow_nonref;
 $q->enq($json->encode(\%data));
 my $data = $json->decode($q->deq());

score 0 · Accepted Answer

ドキュメントを見てみると……

perldoc -f syswrite
              WARNING: If the filehandle is marked ":utf8", Unicode
               characters encoded in UTF-8 are written instead of bytes, and
               the LENGTH, OFFSET, and return value of syswrite() are in
               (UTF8-encoded Unicode) characters.  The ":encoding(...)" layer
               implicitly introduces the ":utf8" layer.  Alternately, if the
               handle is not marked with an encoding but you attempt to write
               characters with code points over 255, raises an exception.  See
               "binmode", "open", and the "open" pragma, open.

man 3perl open
use open OUT => ':utf8';
...
with the "OUT" subpragma you can declare the default
       layers of output streams.  With the "IO"  subpragma you can control
       both input and output streams simultaneously.

したがって、プログラムの先頭に追加use open OUT=> ':utf8'すると役立つと思います

perl - File::Queue が perl で utf8 文字列を処理できるようにする方法は?

2 に答える 2

Related

Reference