perl - perlを使用してコーパス内のすべての一意の単語にインデックスを付ける方法

Question

.i 1
.t
 effici machineindepend procedur 
garbag collect  variou list structur
.w
 method  return regist   free
list   essenti part   list process
system.   paper past solut   recoveri
problem  review  compar.  new algorithm
 present  offer signific advantag  speed
 storag util.  routin  implement
 algorithm   written   list languag 
       insur  degre
 machin independ. final  applic  
algorithm   number  differ list structur
appear   literatur  indic.
.b
cacm august 1967
.a
schorr h.
wait w. m.
.n
ca670806 jb februari 27 1978 428 pm
.x
1024 4 1549

1024 4 1549

1050 4 1549

.i 2
.t
 comparison  batch process  instant turnaround
.w
 studi   program effort  student
  introductori program cours  present
  effect  have instant turnaround   minut
 oppos  convent batch process
 turnaround time    hour  examin. 
 item compar   number  comput
run  trip   comput center program prepar
time keypunch time debug time
number  run  elaps time    run
   run   problem.   
result  influenc   fact  bonu point
 given  complet   program problem
    specifi number  run 
 evid  support instant  batch.
.b
cacm august 1967
.a
smith l. b.
.n
ca670805 jb februari 27 1978 432 pm
.x
1550 4 1550

1550 4 1550

1304 5 1550

1472 5 1550

上記のテキストは、停止とステミングの両方が行われる2つのファイルの内容です。新しいファイルは、.i（数字が続く）から始まり、.tと.b、.bとの間のテキスト内の単語のインデックスを作成する必要があります。 a、.a＆.n、.n＆.xとし、.xから新しいドキュメントの先頭までのすべてのテキストを無視します。すなわち.I（数字が続く）

すべてのファイルの内容は、「コーパス」という1つのファイルに保存されます。コーパスおよび各ドキュメントに出現する回数を使用して、すべての一意の単語にインデックスを付ける必要があります。ドキュメント内のどの位置にある可能性があります。

open FILE, '<', 'sometext.txt' or die $!;
my @texts = <FILE>;
foreach my $text(@texts) {
        my @lines = split ("\n",$text);
        foreach my $line(@lines) {
            my @words = split (" ",$text);
            foreach my $word(@words) {
                $word = trim($word);
                my $match = qr/$word/i;

                open STFILE, '<', 'sometext.txt' or die $!;
                my $count=0;

                while (<STFILE>) {
                    if ($_ =~ $match) {
                        my @mword = split /\s+/, $_;
                        $_ =~ s/[A-Za-z0-9_ ]//g;
                        for my $i (0..$#mword) {
                            if ($mword[$i] =~ $match) {
                                #print "match found on line $. word ", $i+1,"\n";
                                $count++
                            }
                        }
                    }
                }
                print "$word appears $count times \n";
                close(STFILE) or die "Couldn't close $file: $!\n\n";
            }
        }
    }


    close(FILE) or die "Couldn't close $file: $!\n\n";

    sub trim($)
{
    my $string = shift;
    $string =~ s/^\s+//;
    $string =~ s/\s+$//;
    return $string;
}

上記のコードは、コーパス内の各単語の出現回数をカウントします。個々のドキュメントでの単語の出現もカウントするように変更する方法。

score 2 · Accepted Answer

どうですか：

編集して、ドキュメントごとに個別のカウンターを追加します。

#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;

my $words;
my $doc;
my $file = 'path/to/file';
open my $fh, '<', $file or die "unable to open '$file' for reading:$!"
while(<$fh>) {
    chomp;
    $doc = $_ if /^\.i/;
    next if (/^\.x\b/ .. /^\.i\b/);
    next if /^\./;
    my @words = split;
    for(@words) {
        $words->{$_}{$doc}++;
    }
}
close $fh;
print Dumper $words;

score 1 · Accepted Answer

各単語の現在のカウントを含むハッシュ値で、ハッシュを使用します。すべての行とすべての単語をループします。.t と .b の間のテキストを無視するために、ダム (フラグ変数) ベースのステートマシンを使用します。

上記のコードのいずれかの記述に行き詰まった場合は、何が行き詰まったのかについて具体的な質問を投稿してください。

perl - perlを使用してコーパス内のすべての一意の単語にインデックスを付ける方法

2 に答える 2

Related

Reference