perl - Perl でテキスト内の「実際の」単語をカウントするにはどうすればよいですか?

Question

テキスト処理の問題が発生しました。私は記事を書いていますが、「実際の」単語がいくつあるかを知りたいと思っています。

これが「本物」の意味です。記事には通常、ダッシュ、コンマ、ドットなどのさまざまな句読点が含まれています。私が知りたいのは、" -" ダッシュや " ," コンマとスペースなどをスキップして、単語がいくつあるかを調べたいことです。

私はこれをやってみました：

my @words = split ' ', $article;
print scalar @words, "\n";

ただし、これには単語としてスペースを含むさまざまな句読点が含まれます。

だから私はこれを使用することを考えています：

my @words = grep { /[a-z0-9]/i } split ' ', $article;
print scalar @words, "\n";

これは、文字または数字を含むすべての単語に一致します。これで記事内の単語数を数えるのに十分な方法だと思いますか?

これを行うCPANのモジュールを知っている人はいますか?

score 1 · Accepted Answer

I think your solution is about as good as you're going to get without resorting to something elaborate.

You could also write it as

my @words = $article =~ /\S*\w\S*/

or count the words in a file by writing

my $n = 0;
while (<>) {
  my @words = /\S*\w\S*/g;
  $n += @words;
}

say "$n words found";

Try a few sample blocks of text and look at the list of "words" that it finds. If you are happy with that then your code works.

2 に答える 2