python - 本「Python を使用した Web スクレイピング: 最新の Web からデータを収集する」第 7 章「データの正規化」セクションで同じ結果を得る方法

Question

Python バージョン: 2.7.10

私のコード:

# -*- coding: utf-8 -*-

from urllib2 import urlopen
from bs4 import BeautifulSoup
from collections import OrderedDict
import re
import string

def cleanInput(input):
    input = re.sub('\n+', " ", input)
    input = re.sub('\[[0-9]*\]', "", input)
    input = re.sub(' +', " ", input)
    # input = bytes(input, "UTF-8")
    input = bytearray(input, "UTF-8")
    input = input.decode("ascii", "ignore")

    cleanInput = []
    input = input.split(' ')

    for item in input:
        item = item.strip(string.punctuation)
        if len(item) > 1 or (item.lower() == 'a' or item.lower() == 'i'):
            cleanInput.append(item)
    return cleanInput

def ngrams(input, n):
    input = cleanInput(input)
    output = []

    for i in range(len(input)-n+1):
        output.append(input[i:i+n])
    return output

url = 'https://en.wikipedia.org/wiki/Python_(programming_language)'
html = urlopen(url)
bsObj = BeautifulSoup(html, 'lxml')
content = bsObj.find("div", {"id": "mw-content-text"}).get_text()
ngrams = ngrams(content, 2)
keys = range(len(ngrams))
ngramsDic = {}
for i in range(len(keys)):
    ngramsDic[keys[i]] = ngrams[i]
# ngrams = OrderedDict(sorted(ngrams.items(), key=lambda t: t[1], reverse=True))
ngrams = OrderedDict(sorted(ngramsDic.items(), key=lambda t: t[1], reverse=True))


print ngrams
print "2-grams count is: " + str(len(ngrams))

私は最近、本Web Scraping with Python: Collecting Data from the Modern Webに従って Web スクレイピングの方法を学びましたが、Chapter 7 Data Normalizationセクションでは、最初に本と同じようにコードを記述し、端末からエラーが発生しました。

Traceback (most recent call last):
  File "2grams.py", line 40, in <module>
    ngrams = OrderedDict(sorted(ngrams.items(), key=lambda t: t[1], reverse=True))
AttributeError: 'list' object has no attribute 'items'

したがって、エンティティがのリストである新しい辞書を作成して、コードを変更しましたngrams。しかし、私はまったく異なる結果を得ました：

質問：

本が示すように結果を取得したい場合 (値と頻度でソートされている場合)、各 2 グラムの出現をカウントするために独自の行を作成する必要があります。または、本のコードには既にその機能がありました (本はpython 3コードでした）？github でサンプルコードを予約する
私の出力の頻度は、著者のものとはかなり異なっていました。たとえば、[u'Software', u'Foundation']37 回発生しましたが、40 回ではありませんでした。その違いの原因は何ですか (私のコードエラーでしょうか)?

本のスクリーンショット:

score 1 · Accepted Answer

ngrams はリストだったので、この章でもエラーが発生しました。私はそれをdictに変換しましたが、うまくいきました

ngrams1 = OrderedDict(sorted(dict(ngrams1).items(), key=lambda t: t[1], reverse=True))

score 1 · Accepted Answer

私はこの本を読んで同じ問題を抱えていました。ngramsはdictである必要があります。Python バージョン 3.4

ここに私のコードがあります:

from urllib.request import urlopen
from bs4 import BeautifulSoup
from collections import OrderedDict
import re
import string

def cleanInput(input):
    input = re.sub('\n+',' ', input)
    input = re.sub('\[0-9]*\]', '', input)
    input = re.sub('\+', ' ', input)
    input = bytes(input, 'utf-8')
    input = input.decode('ascii', 'ignore')
    cleanInput = []
    input = input.split(' ')
    for item in input:
        item = item.strip(string.punctuation)
        if len(item) >1 or (item.lower() == 'a' or item.lower() == 'i'):
            cleanInput.append(item)
    return cleanInput

def ngrams(input, n):
    input = cleanInput(input)
    output = []
    for i in range(len(input)-n+1):
        output.append(input[i:i+n])
    return output

html = urlopen("http://en.wikipedia.org/wiki/Python_(programming_language)")
bsObj = BeautifulSoup(html, "lxml")
content = bsObj.find("div", {"id": "mw-content-text"}).get_text()
ngrams1 = ngrams(content, 2)
#ngrams1  is something like this [['This', 'article'], ['article', 'is'], ['is', 'about'], ['about', 'the'], ['the', 'programming'], ['programming', 'language'],
ngrams = {}
for i in ngrams1:
    j = str(i)   #the key of ngrams should not be a list
    ngrams[j] = ngrams.get(j, 0) + 1
    # ngrams.get(j, 0) means return a value for the given key j. If key j is not available, then returns default value 0.
    # when key j appear again, ngrams[j] = ngrams[j]+1

ngrams = OrderedDict(sorted(ngrams.items(), key=lambda t: t[1], reverse=True))
print(ngrams)
print("2-grams count is:"+str(len(ngrams)))

これは私の結果の一部です：

OrderedDict([("['Python', 'Software']", 37), ("['Software', 'Foundation']", 37), ("['of', 'the']", 37), ("['of', 'Python']", 35), ("['Foundation', 'Retrieved']", 32),

python - 本「Python を使用した Web スクレイピング: 最新の Web からデータを収集する」第 7 章「データの正規化」セクションで同じ結果を得る方法

5 に答える 5

Related

Reference