html - Perl 正規表現による HTML ソート

Question

Scientific Papers と Authors へのリンクと発行年を含む HTML テーブルで構成される HTML ファイルがあります。html は古いものから新しいものへと並べ替えられます。ファイルを解析し、新しいものから古いものへと並べ替えられたソースコードを含む新しいファイルを取得して、テーブルを並べ替える必要があります。

これは、仕事をしているはずの小さなperlスクリプトですが、半ソートされた結果を生成します

local $/=undef;
open(FILE, "pubTable.html")  or die "Couldn't open file: $!";
binmode FILE;
my $html = <FILE>; 
open (OUTFILE, ">>sorted.html") || die "Can't oupen output file.\n";
map{print OUTFILE "<tr>$_->[0]</tr>"} 
sort{$b->[1] <=> $a->[1]} 
map{[$_, m|, +(\d{4}).*</a>|]}
$html =~ m|<tr>(.*?)</tr>|gs;
close (FILE);  
close (OUTFILE);

そして、ここに私の入力ファイルがあります：リンク

出力として得られるもの：リンク

出力から、注文が順調に進んでいることがわかりますが、1992 年の後に 1993 年が表示され、リストの先頭にはありません。

score 2 · Accepted Answer

maphtml に次の行があるため、正規表現に問題がありました。

<a href="http://www.icp.uni-stuttgart.de/~hilfer/publikationen/pdfo/">,{UCLA}-Report 982051,Los Angeles,,1989,</a></td>   </tr>

と

<a href="http://www.icp.uni-stuttgart.de/~hilfer/publikationen/pdfo/">Phys.Rev.Lett., <b> 60</b>, 1514, 1988</a></td>   </tr>
<a href="http://www.icp.uni-stuttgart.de/~hilfer/publikationen/pdfo/">Phys. Rev. B, <b> 45</b>, 7115, 1992</a></td>   </tr>
<a href="http://www.icp.uni-stuttgart.de/~hilfer/publikationen/pdfo/">J.Chem.Phys., <b> 96</b>, 2269, 1992</a></td>   </tr>

1989 行では、年の末尾にコンマが含まれ、先頭に空白はありません。そのため、スクリプトは多くの警告をスローし、常にその行を一番下に置きました。

(\d{4})他の 3 行には、後ろに何か.*(年)が付いた 4 桁の数字があります。そのため、ソートでは他の数字 (7115、2269、1514) がソートに使用され、それらは年と混同されていました。

これらの問題を修正するには、それに応じて正規表現を調整する必要があります。

前：

map{[$_, m|, +(\d{4}).*</a>|]}

後：

map{[$_, m|, *(\d{4}),?</a>|]}

score 1 · Accepted Answer

また、HTML の処理にも使用できる XML::Twig を使用したソリューション。それはかなり堅牢です: ファイル内の他のテーブルを処理しません。UCLA レポートの年のようなタイプミスに対応します...

#!/usr/bin/perl 

use strict;
use warnings;

use XML::Twig;

my $IN  = 'sort_input.html';
my $OUT = 'sort_output.html';

my $t= XML::Twig->new( twig_handlers => { 'table[@class="pubtable"]' => \&sort_table,
                                        },
                       pretty_print => 'indented',
                     )
       ->parsefile_html( $IN)
       ->print_to_file( $OUT);

sub sort_table
  { my( $t, $table)= @_;
   $table->sort_children( sub { if($_[0]->last_child( 'td')->text=~ m{(\d+)\D*$}s) { $1; } },
                          type => 'numeric', order => 'reverse'
                        );
  }

score 0 · Accepted Answer

堅牢な HTML 解析/操作ライブラリを使用したソリューション:

use strictures;
use autodie qw(:all);
use Web::Query qw();
my $w = Web::Query->new_from_file('pubTable.html');
$w->find('table')->html(
    join q(),
    map { $_->[0]->html }
    sort { $a->[1] <=> $b->[1] }
    @{
        $w->find('tr')->map(sub {
            my (undef, $row) = @_;
            my ($year) = $row->find('.pubname')->text =~ /(\d\d\d\d) ,? \s* \z/msx;
            return [$row => $year];
        })
    }
);

open my $out, '>:encoding(UTF-8)', 'sorted.html';
print {$out} $w->html;
close $out;

html - Perl 正規表現による HTML ソート

3 に答える 3

Related

Reference