perl - Perlでテキストファイルから表形式のデータを抽出/解析するにはどうすればよいですか?

Question

HTML::TableExtractのようなものを探しています。これは、HTML 入力用ではなく、インデントとスペースでフォーマットされた「テーブル」を含むプレーンテキスト入力用です。

データは次のようになります。

Here is some header text.

Column One       Column Two      Column Three
a                                           b
a                    b                      c


Some more text

Another Table     Another Column
abdbdbdb          aaaa

score 1 · Accepted Answer

パッケージ化されたソリューションを認識していませんが、ファイルに対して2つのパスを実行できると仮定すると、あまり柔軟ではないことを行うのはかなり簡単です（以下は部分的にPerlishの擬似コードの例です）

仮定：データにはスペースが含まれている可能性があり、スペースがある場合はCSVで引用されません。そうでない場合は、を使用してText::CSV(_XS)ください。
前提：フォーマットに使用されるタブはありません。
ロジックでは、「列区切り文字」を、100％スペースが入力された垂直行の連続セットとして定義しています。
誤ってすべての行にオフセットM文字のデータの一部であるスペースがある場合、ロジックはオフセットMを列区切り文字と見なします。これは、それ以上のことを知ることができないためです。それがよりよく知ることができる唯一の方法は、列の分離がX>1である少なくともXスペースである必要があるかどうかです-そのための2番目のコードフラグメントを参照してください。

サンプルコード：

my $INFER_FROM_N_LINES = 10; # Infer columns from this # of lines
                             # 0 means from entire file
my $lines_scanned = 0;
my @non_spaces=[];
# First pass - find which character columns in the file have all spaces and which don't
my $fh = open(...) or die;
while (<$fh>) {
    last if $INFER_FROM_N_LINES && $lines_scanned++ == $INFER_FROM_N_LINES;
    chomp;
    my $line = $_;
    my @chars = split(//, $line); 
    for (my $i = 0; $i < @chars; $i++) { # Probably can be done prettier via map?
        $non_spaces[$i] = 1 if $chars[$i] ne " ";
    }
}
close $fh or die;

# Find columns, defined as consecutive "non-spaces" slices.
my @starts, @ends; # Index at which columns start and end
my $state = " "; # Not inside a column
for (my $i = 0; $i < @non_spaces; $i++) {
    next if $state eq " " && !$non_spaces[$i];
    next if $state eq "c" && $non_spaces[$i];
    if ($state eq " ") { # && $non_spaces[$i] of course => start column
        $state = "c";
        push @starts, $i;
    } else { # meaning $state eq "c" && !$non_spaces[$i] => end column
        $state = " ";
        push @ends, $i-1;
    }
}
if ($state eq "c") { # Last char is NOT a space - produce the last column end
    push @ends, $#non_spaces;
}

# Now split lines
my $fh = open(...) or die;
my @rows = ();
while (<$fh>) {
    my @columns = ();
    push @rows, \@columns;
    chomp;
    my $line = $_;
    for (my $col_num = 0; $col_num < @starts; $col_num++) {
        $columns[$col_num] = substr($_, $starts[$col_num], $ends[$col_num]-$starts[$col_num]+1);
    }
}
close $fh or die;

ここで、列の分離を少なくともX> 1のXスペースにする必要がある場合は、それも実行可能ですが、列の場所のパーサーはもう少し複雑にする必要があります。

# Find columns, defined as consecutive "non-spaces" slices separated by at least 3 spaces.
my $min_col_separator_is_X_spaces = 3;
my @starts, @ends; # Index at which columns start and end
my $state = "S"; # inside a separator
NEXT_CHAR: for (my $i = 0; $i < @non_spaces; $i++) {
    if ($state eq "S") { # done with last column, inside a separator
        if ($non_spaces[$i]) { # start a new column
            $state = "c";
            push @starts, $i;
        }
        next;
    }
    if ($state eq "c") { # Processing a column
        if (!$non_spaces[$i]) { # First space after non-space
                                # Could be beginning of separator? check next X chars!
            for (my $j = $i+1; $j < @non_spaces
                            || $j < $i+$min_col_separator_is_X_spaces; $j++) {
                 if ($non_spaces[$j]) {
                     $i = $j++; # No need to re-scan again
                     next NEXT_CHAR; # OUTER loop
                 }
                 # If we reach here, next X chars are spaces! Column ended!
                 push @ends, $i-1;
                 $state = "S";
                 $i = $i + $min_col_separator_is_X_spaces;
            }
         }
        next;
    }
}

score 1 · Accepted Answer

これは、概要をコメントした非常に簡単な解決策です。（長くなって申し訳ありません。）基本的に、「単語」が列ヘッダーnの開始後に表示される場合、その本文の大部分が列n + 1 に続く場合を除き、列nで終了します。代わりにあります。これを整理したり、複数の異なるテーブルをサポートするように拡張したりすることは、演習として残されています。列ヘッダーの左オフセット以外のものを境界マークとして使用することもできます。たとえば、中央、または列番号によって決定される値です。

#!/usr/bin/perl


use warnings;
use strict;


# Just plug your headers in here...
my @headers = ('Column One', 'Column Two', 'Column Three');

# ...and get your results as an array of arrays of strings.
my @result = ();


my $all_headers = '(' . (join ').*(', @headers) . ')';
my $found = 0;
my @header_positions;
my $line = '';
my $row = 0;
push @result, [] for (1 .. @headers);


# Get lines from file until a line matching the headers is found.

while (defined($line = <DATA>)) {

    # Get the positions of each header within that line.

    if ($line =~ /$all_headers/) {
        @header_positions = @-[1 .. @headers];
        $found = 1;
        last;
    }

}


$found or die "Table not found! :<\n";


# For each subsequent nonblank line:

while (defined($line = <DATA>)) {
    last if $line =~ /^$/;

    push @{$_}, "" for (@result);
    ++$row;

    # For each word in line:

    while ($line =~ /(\S+)/g) {

        my $word = $1;
        my $position = $-[1];
        my $length = $+[1] - $position;
        my $column = -1;

        # Get column in which word starts.

        while ($column < $#headers &&
            $position >= $header_positions[$column + 1]) {
            ++$column;
        }

        # If word is not fully within that column,
        # and more of it is in the next one, put it in the next one.

        if (!($column == $#headers ||
            $position + $length < $header_positions[$column + 1]) &&
            $header_positions[$column + 1] - $position <
            $position + $length - $header_positions[$column + 1]) {

            my $element = \$result[$column + 1]->[$row];
            $$element .= " $word";

        # Otherwise, put it in the one it started in.

        } else {

            my $element = \$result[$column]->[$row];
            $$element .= " $word";

        }

    }

}


# Output! Eight-column tabs work best for this demonstration. :P

foreach my $i (0 .. $#headers) {
    print $headers[$i] . ": ";
    foreach my $c (@{$result[$i]}) {
        print "$c\t";
    }
    print "\n";
}


__DATA__

This line ought to be ignored.

Column One       Column Two      Column Three
These lines are part of the tabular data to be processed.
The data are split based on how much words overlap columns.

This line ought to be ignored also.

出力例:

列 1: これらの行は データが分割されている
列 2: 方法に基づく表の一部
列 3: 処理されるデータ。多くの単語が列に重なっています。

perl - Perlでテキストファイルから表形式のデータを抽出/解析するにはどうすればよいですか?

2 に答える 2

Related

Reference