私は膨大な数の言葉を持っています。2 つの特定の単語が特定の距離よりも少ない回数出現する回数を数えたいと考えています。
たとえば、「time」と「late」の間の距離が 3 語以内の場合、カウンターをインクリメントしたいと考えています。"time" と "late" という単語は、配列内で何百回も出現する可能性があります。それらが互いに近くに発生する回数を見つけるにはどうすればよいですか?
インデックスハッシュを使用すると、非常に効果的な解決策になります。
my @words = qw( word1 word2 word3 word4 word5 word6 );
# That can be expensive, but you do it only once
my %index;
@index{@words} = (0..$#words);
# That will be real quick
my $distance = $index{"word6"} - $index{"word2"}
print "Distance: $distance \n";
上記のスクリプトの出力は次のようになります。
Distance: 4
注:インデックスハッシュの作成にはコストがかかる場合があります。ただし、多くの距離チェックを実行する場合は、ルックアップが高速であるため(イベントログ(n)ではなく一定時間)、それだけの価値があります。
重複する単語をサポートする必要がありますか?
#! /usr/bin/perl
use strict;
use warnings;
use constant DEBUG => 0;
my @words;
if( $ARGV[0] && -f $ARGV[0] ) {
open my $fh, "<", $ARGV[0] or die "Could not read $ARGV[0], because: $!\n";
my $hughTestFile = do { local $/; <$fh> };
@words = split /[\s\n]/, $hughTestFile; # $#words == 10M words with my test.log
# Test words (below) were manually placed at equal distances (~every 900K words) in test.log
# With above, TESTS ran in avg of 15 seconds. Likely test.log was in buffers/cache.
} else {
@words = qw( word1 word2 word3 word4 word5 word6 word7 word8 word4 word9 word0 );
}
sub IndexOf {
my $searchFor = shift;
return undef if( !$searchFor );
my $Nth = shift || 1;
my $length = $#words;
my $cntr = 0;
for my $word (@words) {
if( $word eq $searchFor ) {
$Nth--;
return $cntr if( $Nth == 0 );
}
$cntr++;
}
return undef;
}
sub Distance {
# args: <1st word>, <2nd word>, [occurrence_of_1st_word], [occurrence_of_2nd_word]
# for occurrence counts: 0, 1 & undef - all have the same effect (1st occurrence)
my( $w1, $w2 ) = ($_[0], $_[1]);
my( $n1, $n2 ) = ($_[2] || undef, $_[3] || undef );
die "Missing words\n" if( !$w1 );
$w2 = $w1 if( !$w2 );
my( $i1, $i2 ) = ( IndexOf($w1, $n1), IndexOf($w2, $n2) );
if( defined($i1) && defined($i2) ) {
my $offset = $i1-$i2;
print " Distance (offset) = $offset\n";
return undef;
} elsif( !defined($i1) && !defined($i2) ) {
print " Neither words were ";
} elsif( !defined($i1) ) {
print " First word was not ";
} else {
print " Second word was not ";
}
print "found in list\n";
return undef;
}
# TESTS
print "Your array has ".$#words." words\n";
print "When 1st word is AFTER 2nd word:\n";
Distance( "word7", "word3" );
print "When 1st word is BEFORE 2nd word:\n";
Distance( "word2", "word5" );
print "When 1st word == 2nd word:\n";
Distance( "word4", "word4" );
print "When 1st word doesn't exist:\n";
Distance( "word00", "word6" );
print "When 2nd word doesn't exist:\n";
Distance( "word1", "word99" );
print "When neither 1st or 2nd words exist:\n";
Distance( "word00", "word99" );
print "When the 1st word is AFTER the 2nd OCCURRENCE of 2nd word:\n";
Distance( "word9", "word4", 0, 2 );
print "When the 1st word is BEFORE the 2nd OCCURRENCE of the 2nd word:\n";
Distance( "word7", "word4", 1, 2 );
print "When the 2nd OCCURRENCE of the 2nd word doesn't exist:\n";
Distance( "word7", "word99", 0, 2 );
print "When the 2nd OCCURRENCE of the 1st word is AFTER the 2nd word:\n";
Distance( "word4", "word2", 2, 0 );
print "When the 2nd OCCURRENCE of the 1st word is BEFORE the 2nd word:\n";
Distance( "word4", "word0", 2, 0 );
print "When the 2nd OCCURRENCE of the 1st word exists, but 2nd doesn't:\n";
Distance( "word4", "word99", 2, 0 );
print "When neither of the 2nd OCCURRENCES of the words exist:\n";
Distance( "word00", "word99", 2, 2 );
print "Distance between 2nd and 1st OCCURRENCES of the same word:\n";
Distance( "word4", "", 2, 1 );