python - 特定の単語だけのテキストファイルを読み取れるようにするにはどうすればよいですか?

Question

コードでテキストファイル内の特定の単語のみを読み取り、その単語とカウント (テキストファイルに単語が出現する回数) を表示するにはどうすればよいですか?

from collections import Counter
import re

def openfile(filename):
 fh = open(filename, "r+")
 str = fh.read()
 fh.close()
 return str

def removegarbage(str):
 str = re.sub(r'\W+', ' ', str)
 str = str.lower()
 return str

def getwordbins(words):
 cnt = Counter()
 for word in words:
    cnt[word] += 1
 return cnt

 def main(filename, topwords):
   txt = openfile(filename)
   txt = removegarbage(txt)
   words = txt.split(' ')
   bins = getwordbins(words)
   for key, value in bins.most_common(topwords):
    print key,value

  main('filename.txt', 10)

score 1 · Accepted Answer

あなたが探しているのは単純な辞書構造だと思います。これにより、探している単語だけでなく、その数も追跡できます。

ディクショナリは、キーと値のペアとして物事を保存します。したがって、たとえば、キー「アリス」(検索したい単語) を設定し、その値をそのキーワードが見つかった回数に設定できます。

何かが辞書にあるかどうかを確認する最も簡単な方法は、Python のinキーワードを使用することです。すなわち

if 'pie' in words_in_my_dict: do something

その情報が邪魔にならないので、単語カウンターの設定は非常に簡単です!

def get_word_counts(words_to_count, filename):
    words = filename.split(' ')
    for word in words:
        if word in words_to_count:
            words_to_count[word] += 1
    return words_to_count

if __name__ == '__main__':

    fake_file_contents = (
        "Alice's Adventures in Wonderland (commonly shortened to "
        "Alice in Wonderland) is an 1865 novel written by English"
        " author Charles Lutwidge Dodgson under the pseudonym Lewis"
        " Carroll.[1] It tells of a girl named Alice who falls "
        "down a rabbit hole into a fantasy world populated by peculiar,"
        " anthropomorphic creatures. The tale plays with logic, giving "
        "the story lasting popularity with adults as well as children."
        "[2] It is considered to be one of the best examples of the literary "
        "nonsense genre,[2][3] and its narrative course and structure, "
        "characters and imagery have been enormously influential[3] in "
        "both popular culture and literature, especially in the fantasy genre."
        )

    words_to_count = {
        'alice' : 0,
        'and' : 0,
        'the' : 0
        }

    print get_word_counts(words_to_count, fake_file_contents)

これにより、次の出力が得られます。

{'and': 4, 'the': 5, 'alice': 0}

店以来、dictionary私たちが数えたい言葉とそれらが登場した回数の両方。アルゴリズム全体は、各単語がに含まれているかどうかを単純にチェックしておりdict、含まれていることが判明した場合は、1その単語の値に追加します。

辞書についてはこちらをご覧ください。

編集：

すべての単語を数えてから特定のセットを見つけたい場合、辞書はこのタスクに最適です (そして高速です!)。

行う必要がある唯一の変更は、最初に辞書が存在するかどうかを確認し、key存在しない場合は辞書に追加することです。

例

def get_all_word_counts(filename):
    words = filename.split(' ')

    word_counts = {}
    for word in words: 
        if word not in word_counts:     #If not already there
            word_counts[word] = 0   # add it in.
        word_counts[word] += 1          #Increment the count accordingly
    return word_counts

これにより、次の出力が得られます。

and : 4
shortened : 1
named : 1
popularity : 1
peculiar, : 1
be : 1
populated : 1
is : 2
(commonly : 1
nonsense : 1
an : 1
down : 1
fantasy : 2
as : 2
examples : 1
have : 1
in : 4
girl : 1
tells : 1
best : 1
adults : 1
one : 1
literary : 1
story : 1
plays : 1
falls : 1
author : 1
giving : 1
enormously : 1
been : 1
its : 1
The : 1
to : 2
written : 1
under : 1
genre,[2][3] : 1
literature, : 1
into : 1
pseudonym : 1
children.[2] : 1
imagery : 1
who : 1
influential[3] : 1
characters : 1
Alice's : 1
Dodgson : 1
Adventures : 1
Alice : 2
popular : 1
structure, : 1
1865 : 1
rabbit : 1
English : 1
Lutwidge : 1
hole : 1
Carroll.[1] : 1
with : 2
by : 2
especially : 1
a : 3
both : 1
novel : 1
anthropomorphic : 1
creatures. : 1
world : 1
course : 1
considered : 1
Lewis : 1
Charles : 1
well : 1
It : 2
tale : 1
narrative : 1
Wonderland) : 1
culture : 1
of : 3
Wonderland : 1
the : 5
genre. : 1
logic, : 1
lasting : 1

注: ご覧のとおりsplit(' ')、ファイルを作成したときに、いくつかの「失火」がありました。具体的には、一部の単語には開き括弧または閉じ括弧が付いています。ファイル処理でこれを説明する必要があります..しかし、それはあなたに任せます！

score 1 · Accepted Answer

多くの関数を実行するのは複雑すぎると思います。単一の関数で実行しないのはなぜですか?

# def function if desired
# you may have the filepath/specific words etc as parameters

 f = open("filename.txt")
 counter=0
 for line in f:
     # you can remove punctuation, translate them to spaces,
     # now any interesting words will be surrounded by spaces and
     # you can detect them
     line = line.translate(maketrans(".,!? ","     "))
     words = line.split() # splits on any number of whitespaces
     for word in words:
         if word == specificword:
             # of use a list of specific words: 
             # if word in specificwordlist:
             counter+=1
             print word
             # you could also append the words to some list, 
             # create a dictionary etc
 f.close()

score 0 · Accepted Answer

これでおそらく十分でしょう...正確にはあなたが尋ねたものではありませんが、最終結果はあなたが望むものです（私は思います）

interesting_words = ["ipsum","dolor"]

some_text = """
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec viverra consectetur sapien, sed posuere sem rhoncus quis. Mauris sit amet ligula et nulla ultrices commodo sed sit amet odio. Nullam vel lobortis nunc. Donec semper sem ut est convallis posuere adipiscing eros lobortis. Nullam tempus rutrum nulla vitae pretium. Proin ut neque id nisi semper faucibus. Sed sodales magna faucibus lacus tristique ornare.
"""

d = Counter(some_text.split())
final_list = filter(lambda item:item[0] in interesting_words,d.items())

ただし、その複雑さは素晴らしいものではないため、大きなファイルや「interesting_words」の大きなリストでは時間がかかる場合があります

python - 特定の単語だけのテキスト ファイルを読み取れるようにするにはどうすればよいですか?

4 に答える 4

編集：

例

Related

Reference

python - 特定の単語だけのテキストファイルを読み取れるようにするにはどうすればよいですか?