xml - XML::Twig を高速化するにはどうすればよいですか

Question

XML::Twig非常に大きな XML ドキュメントを解析するために使用しています。<change></change>タグに基づいてチャンクに分割したい。

今私は持っています：

my $xml = XML::Twig->new(twig_handlers => { 'change' => \&parseChange, });
$xml->parsefile($LOGFILE);

sub parseChange {

  my ($xml, $change) = @_;

  my $message = $change->first_child('message');
  my @lines   = $message->children_text('line');

  foreach (@lines) {
    if ($_ =~ /[^a-zA-Z0-9](?i)bug(?-i)[^a-zA-Z0-9]/) {
      print outputData "$_\n";
    }
  }

  outputData->flush();
  $change->purge;
}

現在、これはparseChangeXML からそのブロックをプルするときにメソッドを実行しています。それは非常にゆっくりと進んでいます。ファイルから XML を読み取り、$/=</change>XML タグの内容を返す関数を作成することに対してテストしたところ、はるかに高速になりました。

不足しているもの、またはXML::Twig間違って使用しているものはありますか? 私はPerlが初めてです。

編集: これは、変更ファイルからの変更の例です。このファイルは、次のような多数のファイルで構成されており、その間に何も存在してはなりません。

<change>
<project>device_common</project>
<commit_hash>523e077fb8fe899680c33539155d935e0624e40a</commit_hash>
<tree_hash>598e7a1bd070f33b1f1f8c926047edde055094cf</tree_hash>      
<parent_hashes>71b1f9be815b72f925e66e866cb7afe9c5cd3239</parent_hashes>      
<author_name>Jean-Baptiste Queru</author_name>      
<author_e-mail>jbq@google.com</author_e-mail>      
<author_date>Fri Apr 22 08:32:04 2011 -0700</author_date>      
<commiter_name>Jean-Baptiste Queru</commiter_name>      
<commiter_email>jbq@google.com</commiter_email>      
<committer_date>Fri Apr 22 08:32:04 2011 -0700</committer_date>      
<subject>chmod the output scripts</subject>      
<message>         
    <line>Change-Id: Iae22c67066ba4160071aa2b30a5a1052b00a9d7f</line>      
</message>      
<target>         
    <line>generate-blob-scripts.sh</line>      
</target>   
</change>

score 3 · Accepted Answer

現状では、プログラムは、関心のない要素の外側のデータを含め、すべての XML ドキュメントを処理しています。change

twig_handlersコンストラクターのパラメーターをに変更するtwig_rootsと、対象の要素のみに対してツリー構造が構築され、残りは無視されます。

my $xml = XML::Twig->new(twig_roots => { change => \&parseChange });

score 1 · Accepted Answer

XML::Twigには、表示されたタグを処理し、不要になったタグを破棄してメモリを解放するメカニズムが含まれています。

ドキュメントから抜粋した例を次に示します(これには、より多くの役立つ情報もあります)。

my $t= XML::Twig->new( twig_handlers => 
                          { section => \&section,
                            para   => sub { $_->set_tag( 'p'); }
                          },
                       );
  $t->parsefile( 'doc.xml');

  # the handler is called once a section is completely parsed, ie when 
  # the end tag for section is found, it receives the twig itself and
  # the element (including all its sub-elements) as arguments
  sub section 
    { my( $t, $section)= @_;      # arguments for all twig_handlers
      $section->set_tag( 'div');  # change the tag name.4, my favourite method...
      # let's use the attribute nb as a prefix to the title
      my $title= $section->first_child( 'title'); # find the title
      my $nb= $title->att( 'nb'); # get the attribute
      $title->prefix( "$nb - ");  # easy isn't it?
      $section->flush;            # outputs the section and frees memory
    }

これはおそらく、数ギガバイトのファイルを操作する場合に不可欠です。これは、(ドキュメントによると) 全体をメモリに格納すると、ファイルのサイズの 10 倍ものサイズが必要になるためです。

編集:編集した質問に基づくいくつかのコメント。ファイル構造について詳しく知らないと、何が速度を低下させているのか正確にはわかりませんが、試してみるべきいくつかのことを次に示します。

多くの行を書き込んでいる場合、出力ファイルハンドルをフラッシュすると速度が低下します。Perl は特にパフォーマンス上の理由からファイル書き込みをキャッシュしますが、それをバイパスしています。
メカニズムを使用する代わりに(?i)、パフォーマンスが低下する可能性があるかなり高度な機能を使用する代わりに、一致全体で大文字と小文字を区別しないようにしてみませんか? /[^a-z0-9]bug[^a-z0-9]/i同等です。で単純化することもできますが/\bbug\b/i、これはほぼ同等です。唯一の違いは、一致しないクラスにアンダースコアが含まれていることです。
中間ステップを削除するために、他にもいくつかの単純化を行うことができます。

このハンドラーコードは、あなたのハンドラーコードと比べてどのように速度が速くなりますか?

sub parseChange
{
    my ($xml, $change) = @_;

    foreach(grep /[^a-z0-9]bug[^a-z0-9]/i, $change->first_child_text('message'))
    {
        print outputData "$_\n";
    }

    $change->purge;
}

score 0 · Accepted Answer

XML :: Twigの答えではありませんが、..。

xmlファイルからデータを抽出する場合は、XSLTを検討することをお勧めします。xsltprocと次のXSLスタイルシートを使用して、バグを含む変更行を1Gbの<change>sから約1分で取得しました。多くの改善が可能だと私は確信しています。

<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0" >

  <xsl:output method="text"/>
  <xsl:variable name="lowercase" select="'abcdefghijklmnopqrstuvwxyz'" />
  <xsl:variable name="uppercase" select="'ABCDEFGHIJKLMNOPQRSTUVWXYZ'" />

  <xsl:template match="/">
    <xsl:apply-templates select="changes/change/message/line"/>
  </xsl:template>

  <xsl:template match="line">
    <xsl:variable name="lower" select="translate(.,$uppercase,$lowercase)" />
    <xsl:if test="contains($lower,'bug')">
      <xsl:value-of select="."/>
      <xsl:text>
</xsl:text>
    </xsl:if>
  </xsl:template>
</xsl:stylesheet>

XML処理を次のように実行できる場合

プレーンテキストに抽出
平らにされたテキストをまとめる
利益

その場合、XSLTはそのプロセスの最初のステップのツールになる可能性があります。

score 0 · Accepted Answer

XML が非常に大きい場合は、XML::SAXを使用してください。データセット全体をメモリにロードする必要はありません。代わりに、ファイルを順番にロードし、タグごとにコールバックイベントを生成します。XML::SAX を使用して、1GB を超えるサイズの XML を解析することに成功しました。データの XML::SAX ハンドラの例を次に示します。

#!/usr/bin/env perl
package Change::Extractor;
use 5.010;
use strict;
use warnings qw(all);

use base qw(XML::SAX::Base);

sub new {
    bless { data => '', path => [] }, shift;
}

sub start_element {
    my ($self, $el) = @_;
    $self->{data} = '';
    push @{$self->{path}} => $el->{Name};
}

sub end_element {
    my ($self, $el) = @_;
    if ($self->{path} ~~ [qw[change message line]]) {
        say $self->{data};
    }
    pop @{$self->{path}};
}

sub characters {
    my ($self, $data) = @_;
    $self->{data} .= $data->{Data};
}

1;

package main;
use strict;
use warnings qw(all);

use XML::SAX::PurePerl;

my $handler = Change::Extractor->new;
my $parser = XML::SAX::PurePerl->new(Handler => $handler);

$parser->parse_file(\*DATA);

__DATA__
<?xml version="1.0"?>
<change>
  <project>device_common</project>
  <commit_hash>523e077fb8fe899680c33539155d935e0624e40a</commit_hash>
  <tree_hash>598e7a1bd070f33b1f1f8c926047edde055094cf</tree_hash>
  <parent_hashes>71b1f9be815b72f925e66e866cb7afe9c5cd3239</parent_hashes>
  <author_name>Jean-Baptiste Queru</author_name>
  <author_e-mail>jbq@google.com</author_e-mail>
  <author_date>Fri Apr 22 08:32:04 2011 -0700</author_date>
  <commiter_name>Jean-Baptiste Queru</commiter_name>
  <commiter_email>jbq@google.com</commiter_email>
  <committer_date>Fri Apr 22 08:32:04 2011 -0700</committer_date>
  <subject>chmod the output scripts</subject>
  <message>
    <line>Change-Id: Iae22c67066ba4160071aa2b30a5a1052b00a9d7f</line>
  </message>
  <target>
    <line>generate-blob-scripts.sh</line>
  </target>
</change>

出力

Change-Id: Iae22c67066ba4160071aa2b30a5a1052b00a9d7f

xml - XML::Twig を高速化するにはどうすればよいですか

5 に答える 5

Related

Reference