python-2.7 - 美しいスープを使用して Web サイトからデータをスクレイピングする際の問題

Question

Web サイトから 41 のアイテムとその価格のリストをスクレイピングしようとしています。しかし、私の出力 csv には、ページの最後にある 2 ～ 3 個の項目がありません。この理由として、一部のデバイスでは、他のデバイスとは異なるクラスで価格が記載されています。私のコードでの再帰は、名前と価格を一緒に実行しており、価格が別のクラスで言及されているアイテムについては、次のデバイスから価格値を取得しています。したがって、これらのデバイスの価格は以前のデバイスの再帰にすでに入力されているため、最後の 2 ～ 3 項目をスキップしています。以下は参照されたコードです：

# -*- coding: cp1252 -*-
import csv
import urllib2
import sys
import time
from bs4 import BeautifulSoup
page = urllib2.urlopen('http://www.att.com/shop/wireless/devices/smartphones.deviceListGridView.xhr.flowtype-NEW.deviceGroupType-Cellphone.paymentType-postpaid.packageType-undefined.html?taxoStyle=SMARTPHONES&showMoreListSize=1000').read()
soup = BeautifulSoup(page)
soup.prettify()
with open('AT&T_2012-12-28.csv', 'wb') as csvfile:
    spamwriter = csv.writer(csvfile, delimiter=',')
    spamwriter.writerow(["Date","Month","Day of Week","Device Name","Price"])
    items = soup.findAll('a', {"class": "clickStreamSingleItem"},text=True)
    prices = soup.findAll('div', {"class": "listGrid-price"})
    for item, price in zip(items, prices):
        textcontent = u' '.join(price.stripped_strings)
        if textcontent:            
            spamwriter.writerow([time.strftime("%Y-%m-%d"),time.strftime("%B"),time.strftime("%A") ,unicode(item.string).encode('utf8').replace('â„¢','').replace('Â®','').strip(),textcontent])

価格は通常下に記載されてlistGrid-priceいますが、現在価格が不足している 2 ～ 3 個のアイテムについてはlistGrid-price-outOfStock、これも再帰に含める必要があります。これにより、アイテムの前に適切な価格が表示され、すべてのデバイスに対してループが実行されます。

プログラミング初心者のため、無知なことをお許しください

score 0 · Accepted Answer

コンパレータ関数を使用して、カスタム比較を行い、それをに渡すことができますfindAll()。

したがって、prices割り当てを使用して行を変更すると、次のようになります。

prices = soup.findAll('div', class_=match_both)

関数を次のように定義します。

def match_both(arg):
    if arg == "listGrid-price" or arg == "listGrid-price-outOfStock":
        return True
    return False

(ここでは、関数がどのように機能するかを理解するために、関数をより簡潔で冗長にすることができます)

したがって、両方と比較され、いずれかの場合に一致が返されます。

詳細については、ドキュメントを参照してください。(has_six_characters バリアント)

さて、特定のテキストを除外する方法も尋ねたので。

textへの引数にfindAll()は、カスタムコンパレータを指定することもできます。したがって、この場合、文言Write a reviewが一致して、価格と文言がずれてしまうことは望ましくありません。

したがって、レビュー部分を除外するように編集したスクリプト:

# -*- coding: cp1252 -*-
import csv
import urllib2
import sys
import time
from bs4 import BeautifulSoup

def match_both(arg):
    if arg == "listGrid-price" or arg == "listGrid-price-outOfStock":
        return True
    return False

def not_review(arg):
    if not arg:
        return arg
    return "Write a review" not in arg

page = urllib2.urlopen('http://www.att.com/shop/wireless/devices/smartphones.deviceListGridView.xhr.flowtype-NEW.deviceGroupType-Cellphone.paymentType-postpaid.packageType-undefined.html?taxoStyle=SMARTPHONES&showMoreListSize=1000').read()
soup = BeautifulSoup(page)
soup.prettify()
with open('AT&T_2012-12-28.csv', 'wb') as csvfile:
    spamwriter = csv.writer(csvfile, delimiter=',')
    spamwriter.writerow(["Date","Month","Day of Week","Device Name","Price"])
    items = soup.findAll('a', {"class": "clickStreamSingleItem"},text=not_review)
    prices = soup.findAll('div', class_=match_both)
    for item, price in zip(items, prices):
        textcontent = u' '.join(price.stripped_strings)
        if textcontent:
                spamwriter.writerow([time.strftime("%Y-%m-%d"),time.strftime("%B"),time.strftime("%A") ,unicode(item.string).encode('utf8').replace('â„¢','').replace('Â®','').strip(),textcontent])

python-2.7 - 美しいスープを使用して Web サイトからデータをスクレイピングする際の問題

1 に答える 1

Related

Reference