python-2.7 - 美しいスープを使用して、htmlページの異なるクラスのデータを印刷する際の問題

Question

Web サイトからデバイスの価格を取得する必要があります。価格は Web サイトで 2 つのタイプとして言及されています。

単一価格例: $99.99
価格帯「$49.99」から「$99.99」

単一の価格値は単一のクラスで言及されており、それらの値を抽出することはできますが、価格範囲は 2 つのクラスで言及されています。

<div class="gridPrice">"$199.99" 
 <span class="multiDevicePrice-to">to</span> "$399.99"

範囲として記載されている価格は二重引用符で囲まれていますが、単一の値としての価格は引用符なしです。

次のコードを使用しています。

import csv
import urllib2
import sys  
from bs4 import BeautifulSoup
page = urllib2.urlopen('http://www.att.com/shop/wireless/devices/smartphones.html').read()
soup = BeautifulSoup(page)
soup.prettify()
for anchor1 in soup.findAll('div', {"class": "listGrid-price"},text=True):
    if anchor1.string:
        print unicode(anchor1.string).strip()
for anchor2 in soup.findAll('div', {"class": "gridPrice"},text=True):
    if anchor2.string:
        print unicode(anchor2.string).strip()

出力では、価格帯の値を取得していません。必要なのは、すべての価格をまとめたリストです。

score 1 · Accepted Answer

.stripped_stringsこの属性を使用して、特定のタグ内のすべての (削除された) テキスト値のイテラブルを取得できます。

for anchor1 in soup.findAll('div', {"class": "listGrid-price"}):
    textcontent = u' '.join(anchor1.stripped_strings)
    if textcontent:
        print textcontent

それらの値を 1 つまたは 2 つだけ選択する必要がある場合があります。itertools.isliceそこで役立つかもしれません：

from itertools import islice

for anchor1 in soup.findAll('div', {"class": "listGrid-price"}):
    textcontent = u' '.join(islice(anchor1.stripped_strings, 0, 3, 2))
    if textcontent:
        print textcontent

このislice呼び出しは、グリッド内の開始価格と終了価格である最初と 3 番目の要素のみを選択します。

python-2.7 - 美しいスープを使用して、htmlページの異なるクラスのデータを印刷する際の問題

1 に答える 1

Related

Reference