python - BeautifulSoup - lxml および html5lib パーサーのスクレイピングの違い

Question

Python 2.7でBeautifulSoup 4を使用しています。Web サイトから特定の要素を抽出したいと思います (数量、以下の例を参照)。何らかの理由で、lxmlパーサーでは、ページから目的の要素をすべて抽出することはできません。最初の 3 つの要素のみを出力します。html5libパーサーを使用して、それらすべてを抽出できるかどうかを確認しようとしています。

このページには、価格と数量とともに複数のアイテムが含まれています。各アイテムに必要な情報を含むコードの一部は、次のようになります。

<td class="size-price last first" colspan="4">
                    <span>453 grams </span>
            <span> <span class="strike">$619.06</span> <span class="price">$523.91</span>
                    </span>
                </td>

次の 3 つのケースを考えてみましょう。

ケース 1 - データ:

#! /usr/bin/python
from bs4 import BeautifulSoup
data = """
<td class="size-price last first" colspan="4">
                    <span>453 grams </span>
            <span> <span class="strike">$619.06</span> <span class="price">$523.91</span>
                    </span>
                </td>"""                
soup = BeautifulSoup(data)
print soup.td.span.text

版画:

453 grams

ケース 2 - LXML:

#! /usr/bin/python
from bs4 import BeautifulSoup
from urllib import urlopen
webpage = urlopen('The URL goes here')
soup=BeautifulSoup(webpage, "lxml")
print soup.find('td', {'class': 'size-price'}).span.text

版画:

453 grams

ケース 3 - HTML5LIB:

#! /usr/bin/python
from bs4 import BeautifulSoup
from urllib import urlopen
webpage = urlopen('The URL goes here')
soup=BeautifulSoup(webpage, "html5lib")
print soup.find('td', {'class': 'size-price'}).span.text

次のエラーが表示されます。

Traceback (most recent call last):
  File "C:\Users\Dom\Python-Code\src\Testing-Code.py", line 6, in <module>
    print soup.find('td', {'class': 'size-price'}).span.text
AttributeError: 'NoneType' object has no attribute 'span'

html5lib パーサーを使用して必要な情報を抽出するには、コードをどのように適応させる必要がありますか? html5lib を使用した後、コンソールにスープを表示するだけで必要な情報をすべて表示できるので、必要なものを取得できると考えました。lxmlパーサーには当てはまらないので、lxmlパーサーがlxmlパーサーを使用してすべての数量を抽出していないように見えるという事実にも興味があります。

print [td.span.text for td in soup.find_all('td', {'class': 'size-price'})]

score 1 · Accepted Answer

from lxml import etree

html = 'your html'
tree = etree.HTML(html)
tds = tree.xpath('.//td[@class="size-price last first"]')
for td in tds:
    price = td.xpath('.//span[@class="price"]')[0].text
    strike = td.xpath('.//span[@class="strike"]')[0].text
    spans = td.xpath('.//span')
    quantity = [i.text for i in spans if 'grams' in i.text][0].strip(' ')

python - BeautifulSoup - lxml および html5lib パーサーのスクレイピングの違い

2 に答える 2

Related

Reference