python - ウェブサイトで最も一般的な単語を見つける

翻译自：https://stackoverflow.com/questions/17904354 2013-07-28T02:14:14.187

1915 次

私はpythonが初めてです。Web サイトで単語が使用された回数を調べる簡単なプログラムがあります。

opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]

url = 'http://en.wikipedia.org/wiki/Albert_Einstein'
ourUrl = opener.open(url).read()
soup = BeautifulSoup(ourUrl)
dem = soup.findAll('p') #find paragraphs
for i in dem:    # loop for each para

    words = re.findall(r'\w+', i.text)   
    cap_words = [word.upper() for word in words]
    word_counts = Counter(cap_words)
    print word_counts

問題は、これにより、ウェブサイトの総単語数ではなく、パラごとの単語数が得られることです。どのような変更が必要です。また、a、an、などの一般的な記事を除外したい場合は、どのコードを含める必要がありますか。

python - ウェブサイトで最も一般的な単語を見つける

1 に答える 1

Related

Reference