python - csv ファイルからの英数字テキストの抽出

Question

フラットファイル (*.csv) で参照されるロケーショングリッド (AI および 1-9) があり、さまざまな形式で空白やランダムなケース (9-H、@ b 3、e など) が含まれることがあります。 -4、d4、c6、5h、C2、i9、... は、a ～ i と 1 ～ 9 の任意の組み合わせで、空白、~ @、および - を含みます。

このような英数字の抽出を処理するにはどうすればよいでしょうか? 出力は、「メモ」の前の別の列または別のテキストファイルにあることが理想的です。スクリプトを読んで、それが何をするかを理解することはできますが、まだスクリプトを書くのに十分ではありません。

サンプル入力ファイル:

Record  Notes
46651   Adrian reported green-pylons are in central rack. (e-4)
46652   Jose enetered location of triangles in the uppur corner. (b/c6)
46207   [Location: 5h] Gabe located the long pipes in the near the far corner.
46205   Committee-reports are in boxes in holding area, @ b 3).
45164   Caller-nu,mbers @ 1A
45165   All carbon rod tackles 3 F and short (top rack)
45166   USB(3 Port) in C2
45167   Full tackle in b2.
45168    5b; USB(4 port)
45073   SHOVELs+ KIPER ON PET-FOOD (@g6), ALSO ATTEMPT-STALL AND DRAWCORD.
45169   Persistent CORDS ~i9
45170   Deliverate handball moved to D-2 on instructions from Pete
45440   slides and overheads + contact-sheets to 9-H (top bin).
45441   d7-slides and negatives (black and white)
<eof>

必要な出力 (英数字形式、同じファイルまたは新しいファイルのいずれか)

Record  Location    Notes  
46651   E4  
46652   C6  
46205   A1  
...  
46169   I9

つまり、常に後者の文字セットを抽出します。

わかりました、「パターンマッチ (m//) での初期化されていない値 $note の使用」エラーが発生した後、試してみて、部分的に成功しました。

#   # starts with anything then space or punctuation then letter then number
if ($note =~ /.*[\s\~\p{Punct}]([a-iA-I])[\s\p{Punct}]*([0-9]).*/) {
    $note =~ s/.*[\s\~\p{Punct}]([a-iA-I])[\s\p{Punct}]*([0-9]).*/$1$2/x;

#   # starts line with letter then number
} elsif ($note =~ /^([a-iA-I])[\s\p{Punct}]*([0-9]).*/) {
   $note =~ s/^([a-iA-I])[\s\p{Punct}]*([0-9]).*/$1$2/x;

#   # after punctuation then number
} elsif ($note =~ /.*[\s\p{Punct}]([0-9])[\s\p{Punct}]*([a-iA-I]).*/) {
   $note =~ s/.*[\s\p{Punct}]([0-9])[\s\p{Punct}]*([a-iA-I]).*/$2$1/x;

#   # beginning of line with number
} elsif ($note =~ /^([0-9])[\s\p{Punct}]*([a-iA-I]).*/) {
    $note =~ s/^([0-9])[\s\p{Punct}]*([a-iA-I]).*/$2$1/x;

#   # empty line or no record of any grid location except "#7 asdfg" format
} elsif  ($note=~ "") {
    $note = "##";

}

スクリプトがあまり成功しないのは、99994 や 99993 などのレコードに遭遇したときです。

99999 norecordofgridhere --
99998
99997 ボックス #7 がインボイスなしでアレイに入りました。
オフフィールドを見つけたとき、99996 は h 7 でダウンし、コーチェラは e 8 でした。
99994 個のカートンがオフィスに 4 個のバケツの後
99993 6 個の箱がオフィスのファイルキャビネットの最上段に

出力は次のとおりです。

99999 # # norecordofgridhere --
99998 # #
99997 E 7 ボックス #7 がインボイスなしでアレイに入りました。
99996 E 8 は h 7 でダウンしており、Coachela は e 8 でオフフィールドを見つけました。
99994 B 4 個のバケツの後、オフィスに 4 カートン
99993 B 6 オフィスのファイルキャビネットの一番上の棚に 6 箱

99994 と 99993 には # があるはずです。どこで失敗しましたか? これを修正するにはどうすればよいですか?

Text::CSV_XS を使用するなど、よりクリーンな方法があると思いますが、モジュールが適切にインストールされていることをテストした後でも、ストロベリー perl で不具合が発生しました。これで、activestateperl に戻りました。

score 0 · Accepted Answer

Text::CSV_XSを使用して CSV ファイルを解析します。高速で正確です。

次に、ID に一致する正規表現を作成します。

最後に、各 ID を正規化します。

#!/usr/bin/perl

use v5.10;
use strict;
use warnings;
use autodie;

use Text::CSV_XS;

# Build up the regular expression to look for IDs
my $Separator_Set  = qr{ [- ] }x;
my $ID_Letters_Set = qr{ [a-i] }xi;
my $ID_Numbers_Set = qr{ [1-9] }x;
my $Location_Re = qr{
    \b
    $ID_Letters_Set $Separator_Set? $ID_Numbers_Set |
    $ID_Numbers_Set $Separator_Set? $ID_Letters_Set
    \b
}x;

# Initialize Text::CSV_XS and tell it this is a tab separated CSV
my $csv = Text::CSV_XS->new({
    sep_char => "\t",   # tab separated fields
}) or die "Cannot use CSV: ".Text::CSV_XS->error_diag ();

# Read in and discard the CSV header line.
my $headers = $csv->getline(*DATA);

# Output our own header line    
say "Record\tLocation\tNotes";

# Read each CSV row, extract and normalize the ID, and output a new row.
while( my $row = $csv->getline(*DATA) ) {
    my($record, $notes) = @$row;

    # Extract and normalize the ID
    my($id) = $notes =~ /($Location_Re)/;
    $id = normalize_id($id);

    # Output a new row
    printf "%d\t%s\t%s\n", $record, $id, $notes;
}


sub normalize_id {
    my $id = shift;

    # Return empty string if we were passed in a blank
    return '' if !defined $id or !length $id or $id !~ /\S/;

    my($letter) = $id =~ /($ID_Letters_Set)/;
    my($number) = $id =~ /($ID_Numbers_Set)/;

    return uc($letter).$number;
}

__END__
Record  Notes
46651   Adrian reported green-pylons are in central rack. (e-4)
46652   Jose enetered location of triangles in the uppur corner. (b/c6)
46207   [Location: 5h] Gabe located the long pipes in the near the far corner.
46205   Committee-reports are in boxes in holding area, @ b 3).
45164   Caller-nu,mbers @ 1A
45165   All carbon rod tackles 3 F and short (top rack)
45166   USB(3 Port) in C2
45167   Full tackle in b2.
45168    5b; USB(4 port)
45073   SHOVELs+ KIPER ON PET-FOOD (@g6), ALSO ATTEMPT-STALL AND DRAWCORD.
45169   Persistent CORDS ~i9
45170   Deliverate handball moved to D-2 on instructions from Pete
45440   slides and overheads + contact-sheets to 9-H (top bin).
45441   d7-slides and negatives (black and white)

score 0 · Accepted Answer

...

my $coord;
if ($note =~ /
   (?&DEL)

   ( (?&ROW) (?&SEP)?+ (?&COL)
   | (?&COL) (?&SEP)?+ (?&ROW)
   )

   (?&DEL)

   (?(DEFINE)
      (?<ROW> [a-hA-H]    )
      (?<COL> [1-9]       )
      (?<SEP> [\s~\@\-]++ )
      (?<DEL> ^ | \W | \z )
   )
/x) {
    $coord = $1;
    ( my $row = uc($coord) ) =~ s/[^A-H]//g;
    ( my $col = uc($coord) ) =~ s/[^1-9]//g;
    $coord = "$row$col";
}

...

python - csv ファイルからの英数字テキストの抽出

2 に答える 2

Related

Reference