python - Pythonを使用してドキュメント内の2文字以上の単語の総数を取得する

Question

.txtドキュメント内の上位10個の最も頻繁な単語、10個の最も頻度の低い単語、および単語の総数を計算する小さなPythonスクリプトがあります。割り当てによると、単語は2文字以上として定義されます。最も頻度の高い10個の単語と最も頻度の低い10個の単語が正常に印刷されますが、ドキュメント内の単語の総数を印刷しようとすると、1文字の単語（「a」など）を含むすべての単語の総数が印刷されます。）。単語の総数を取得して、2文字以上の単語のみを計算するにはどうすればよいですか？

これが私のスクリプトです：

from string import *
from collections import defaultdict
from operator import itemgetter
import re

number = 10
words = {}
total_words = 0
words_only = re.compile(r'^[a-z]{2,}$')
counter = defaultdict(int)

"""Define function to count the total number of words"""
def count_words(s):
    unique_words = split(s)
    return len(unique_words)

"""Define words as 2 letters or more -- no single letter words such as "a" """
for word in words:
    if len(word) >= 2:
        counter[word] += 1


"""Open text document, strip it, then filter it"""
txt_file = open('charactermask.txt', 'r')

for line in txt_file:
    total_words = total_words + count_words(line)
    for word in line.strip().split():
        word = word.strip(punctuation).lower()
        if words_only.match(word):
            counter[word] += 1


# Most Frequent Words
top_words = sorted(counter.iteritems(),
                    key=lambda(word, count): (-count, word))[:number] 

print "Most Frequent Words: "

for word, frequency in top_words:
    print "%s: %d" % (word, frequency)


# Least Frequent Words:
least_words = sorted(counter.iteritems(),
                    key=lambda (word, count): (count, word))[:number]

print " "
print "Least Frequent Words: "

for word, frequency in least_words:
    print "%s: %d" % (word, frequency)


# Total Unique Words:
print " "
print "Total Number of Words: %s" % total_words

私はPythonの専門家ではありません。これは、現在受講しているPythonクラス用です。私のコードのすっきりと適切なフォーマットは、この割り当てでは私に不利になります。可能であれば、このコードのフォーマットが「グッドプラクティス」と見なされるかどうかを誰かに教えてもらえますか？

score 3 · Accepted Answer

リスト内包法：

def countWords(s):
    words = s.split()
    return len([word for word in words if len(word)>=2])

詳細な方法：

def countWords(s):
    words = s.split()
    count = 0
    for word in words:
        if len(word) >= 2:
            count += 1
    return count

余談ですが、を使用することについての称賛defaultdictがありますが、私は一緒に行きcollections.Counterます：

words = collections.Counter([word for line in open(filepath) for word in line.strip()])
words = dict((k,v) for k,v in words.iteritems if len(k)>=2)
mostFrequent = [w[0] for w in words.most_common(10)]
leastFrequent = [w[0] for w in words.most_common()[-10:]]

お役に立てれば

score 1 · Accepted Answer

単語のカウントは単にsplit（）を使用します

ここでもmatch_words正規表現を使用する必要があります

def count_words(s):
    unique_words = split(s)
    return len(filter(lambda x: words_only.match(x):, unique_words))

あなたのスタイルは素晴らしく見えます:)

score 1 · Accepted Answer

申し訳ありませんが、このソリューションでは少しやり過ぎたようです。つまり、私は実際にあなたのコードを分解し、それを私が行う方法で元に戻しました。

from collections import defaultdict
from operator import itemgetter
from heapq import nlargest, nsmallest
from itertools import starmap
from textwrap import dedent
import re

class WordCounter(object):
    """
    Count the number of words consisting of two letters or more.
    """

    words_only = re.compile(r'[a-z]{2,}', re.IGNORECASE)

    def __init__(self, filename, number=10):
        self.counter = defaultdict(int)

        # Open text document and find all words
        with open(filename, 'r') as txt_file:
            for word in self.words_only.findall(txt_file.read()):
                self.counter[word.lower()] += 1

        # Get total count
        self.total_words = sum(self.counter.values())

        # Most Frequent Words
        self.top_words = nlargest(
            number, self.counter.items(), itemgetter(1))

        # Least Frequent Words
        self.least_words = nsmallest(
            number, self.counter.items(), itemgetter(1))

    def __str__(self):
        """
        Summary of least and most used words, and total word count.
        """
        template = dedent("""
            Most Frequent Words:
            {0}

            Least Frequent Words:
            {1}

            Total Number of Words: {2}
            """)

        line_template = "{0}: {1}".format
        top_words = "\n".join(starmap(line_template, self.top_words))
        least_words = "\n".join(starmap(line_template, self.least_words))

        return template.format(top_words, least_words, self.total_words)


print WordCounter("charactermask.txt")

これが私が行った変更の要約とその理由です

しないでくださいfrom x import *。一部のモジュールは安全に実行できるように設計されていますが、名前空間が汚染されているため、一般的にはお勧めできません。必要なものだけをインポートするか、短縮名でモジュールをインポートしますimport string as st。これにより、バグの多いコードが少なくなります。
それをクラスにします。この種の場合はスクリプトとして記述しても問題ありませんが、コードをより適切に整理するために、また別のプロジェクトで必要な場合に備えて、常にコードをクラスまたは関数でラップすることをお勧めします。その後、あなたはただすることができfrom wordcounter import WordCounter、あなたは行ってもいいです。
Docstringはコードブロック内に移動しました。このようhelp(my_class_or_function)に、インタラクティブインタプリタを入力すると自動的に使用されます。
コメントは通常、#使い捨ての文字列ではなく、接頭辞が付きます。それは大したことではありませんが、かなり一般的な慣習です。
ファイルを開くときにwithステートメントを使用します。それは良い習慣です。それらを閉じることを覚えていることを心配する必要はありません。
.strip().split()冗長です。だけ使用してください.split()。
を使用しre.findallます。これにより、「一流」のような単語の問題が回避されます。この単語は、メソッドを使用してもまったくカウントされません。findall定義に従って、「トップ」と「ノッチ」を数えています。また、より高速です。ただし、正規表現を少し変更する必要があります。
wordsdictは使用されていません。削除されました。
sum総単語数を計算するために使用します。これにより、inspectorG4dgetsコードの問題が解決されますwords_only。一貫した結果を得るには、パターンを各単語に2回（合計に1回、単語数に1回）使用する必要があります。
とを使用heapq.nlargestしheapq.nsmallestます。n個の最小または最大の結果のみが必要な場合は、フルソートよりも高速でメモリ効率が高くなります。
印刷したい、またはしたくない文字列を返す関数を作成します。printステートメントを直接使用することは柔軟性が低くなりますが、デバッグには非常に役立ちます。
新しいコードの場合は、演算子formatの代わりに文字列メソッドを使用します。%前者は後者を改良し、置き換えるために作られました。
複数の連続した印刷の代わりに、複数行の文字列を使用します。実際に何が書き込まれるかを確認しやすく、保守も簡単です。textwrap.dedent関数は、文字列を周囲のコードと同じレベルにインデントする場合に役立ちます。

また、どちらがより読みやすいかという質問もあります：starmap(line_template, self.top_words)または[line_template(*x) for x in self.top_words]。ほとんどの人は常にリスト内包表記を好み、私は通常それらに同意しますが、ここではスターマップ法の簡潔さが気に入りました。

そうは言っても、私はuser1552512に同意します、あなたのスタイルは素晴らしく見えます！素晴らしく、読みやすいコード、よくコメントされた、非常にPEP8に準拠しています。あなたは遠くへ行きます。:)

score 0 · Accepted Answer

個人的には、あなたのコードはうまく見えると思います。その「標準的な」Pythonスタイルかどうかはわかりませんが、読みやすいです。私もPythonにかなり慣れていませんが、ここに私の答えがあります。

count_words（s）関数が単語の総数を計算するものだと思います。あなたが抱えている問題は、splitを呼び出すだけであるということです。単語をスペースで区切っているだけです。

単語の2+文字のみをカウントする必要があるため、その関数では、unique_wordsリスト内の2+文字の単語の数のみをカウントするループを記述します。

python - Pythonを使用してドキュメント内の2文字以上の単語の総数を取得する

4 に答える 4

Related

Reference