python - Python で Beautifulsoup を使用して HTML からタグを抽出する方法

Question

次のように簡略化された HTML ページを解析しようとしています。

<div class="anotherclass part"
  <a href="http://example.com" >
    <div class="column abc"><strike>&#163;3.99</strike><br>&#163;3.59</div>
    <div class="column def"></div>
    <div class="column ghi">1 Feb 2013</div>
    <div class="column jkl">
      <h4>A title</h4>
      <p>
        <img class="image" src="http://example.com/image.jpg">A, List, Of, Terms, To, Extract - 1 Feb 2013</p>
    </div>
  </a>
</div>

私はPythonのコーディングの初心者で、http://www.crummy.com/software/BeautifulSoup/bs3/documentation.htmlにあるbeautifulsoupのドキュメントを読んで再読しました。

私はこのコードを持っています：

from BeautifulSoup import BeautifulSoup

with open("file.html") as fp:
  html = fp.read()

soup = BeautifulSoup(html)

parts = soup.findAll('a', attrs={"class":re.compile('part'), re.IGNORECASE} )
for part in parts:
  mypart={}

  # ghi
  mypart['ghi'] = part.find(attrs={"class": re.compile('ghi')} ).string
  # def
  mypart['def'] = part.find(attrs={"class": re.compile('def')} ).string
  # h4
  mypart['title'] = part.find('h4').string

  # jkl
  mypart['other'] = part.find('p').string

  # abc
  pattern = re.compile( r'\&\#163\;(\d{1,}\.?\d{2}?)' )
  theprices = re.findall( pattern, str(part) )
  if len(theprices) == 2:
    mypart['price'] = theprices[1]
    mypart['rrp'] = theprices[0]
  elif len(theprices) == 1:
    mypart['price'] = theprices[0]
    mypart['rrp'] = theprices[0]
  else:
    mypart['price'] = None
    mypart['rrp'] = None

クラスからテキストを抽出したいのですが、スクリプトが正しく機能するdefとghi思います。

abcまた、スクリプトが現時点でややぎこちない方法で行っている2 つの価格も抽出したいと考えています。この部分には 2 つの価格がある場合もあれば、1 つの場合とない場合もあります。

最後に、スクリプトが失敗する"A, List, Of, Terms, To, Extract"クラスから部分を抽出したいと思います。jklタグの文字列部分を取得するとうまくいくと思いましたが、pなぜうまくいかないのかわかりません。この部分の日付は常にクラスの日付と一致するghiため、簡単に交換/削除できます。

何かアドバイス？ありがとうございました！

score 2 · Accepted Answer

まず、追加convertEntities=bs.BeautifulSoup.HTML_ENTITIESする場合

soup = bs.BeautifulSoup(html, convertEntities=bs.BeautifulSoup.HTML_ENTITIES)

次に、などの html エンティティが£、などの対応する Unicode 文字に変換され£ます。これにより、より単純な正規表現を使用して価格を特定できます。

が与えられた場合、その属性を使用して、価格とともにpartのテキストコンテンツを見つけることができます。<div>contents

In [37]: part.find(attrs={"class": re.compile('abc')}).contents
Out[37]: [<strike>£3.99</strike>, <br />, u'\xa33.59']

各項目から番号を抽出するか、番号がない場合はスキップするだけです。

def parse_price(text):
    try:
        return float(re.search(r'\d*\.\d+', text).group())
    except (TypeError, ValueError, AttributeError):
        return None

price = []
for item in part.find(attrs={"class": re.compile('abc')}).contents:
    item = parse_price(item.string)
    if item:
        price.append(item)

この時点priceで、0、1、または 2 つの float のリストになります。私たちは言いたいです

mypart['rrp'], mypart['price'] = price

ただし、アイテムが1つしかない場合、または含まれている場合priceは機能しません。[]

で 3 つのケースを処理する方法if..elseは問題ありません。これが最も簡単で、間違いなく最も読みやすい方法です。しかし、それは少し平凡でもあります。もう少し簡潔にしたい場合は、次のようにします。

priceアイテムが 1 つしか含まれていない場合は同じ価格を繰り返したいので、itertools.cycleについて考えるように導かれるかもしれません。

priceが空のリストである場合、が[]必要ですitertools.cycle([None])が、それ以外の場合はを使用できますitertools.cycle(price)。

したがって、両方のケースを 1 つの式に結合するには、次を使用できます。

price = itertools.cycle(price or [None])
mypart['rrp'], mypart['price'] = next(price), next(price)

この関数は、反復子の値を1 つずつnext剥がします。はその値を循環しているため、終了することはありませんprice。price値を順番に生成し続け、必要に応じて最初からやり直すだけです。これがまさに私たちが望んでいることです。

属性A, List, Of, Terms, To, Extract - 1 Feb 2013を使用して再度取得できます。contents

# jkl
mypart['other'] = [item for item in part.find('p').contents
                   if not isinstance(item, bs.Tag) and item.string.strip()]

したがって、完全な実行可能なコードは次のようになります。

import BeautifulSoup as bs
import os
import re
import itertools as IT

def parse_price(text):
    try:
        return float(re.search(r'\d*\.\d+', text).group())
    except (TypeError, ValueError, AttributeError):
        return None

filename = os.path.expanduser("~/tmp/file.html")
with open(filename) as fp:
    html = fp.read()

soup = bs.BeautifulSoup(html, convertEntities=bs.BeautifulSoup.HTML_ENTITIES)

for part in soup.findAll('div', attrs={"class": re.compile('(?i)part')}):
    mypart = {}
    # abc
    price = []
    for item in part.find(attrs={"class": re.compile('abc')}).contents:
        item = parse_price(item.string)
        if item:
            price.append(item)

    price = IT.cycle(price or [None])
    mypart['rrp'], mypart['price'] = next(price), next(price)

    # jkl
    mypart['other'] = [item for item in part.find('p').contents
                       if not isinstance(item, bs.Tag) and item.string.strip()]

    print(mypart)

利回り

{'price': 3.59, 'other': [u'A, List, Of, Terms, To, Extract - 1 Feb 2013'], 'rrp': 3.99}

python - Python で Beautifulsoup を使用して HTML からタグを抽出する方法

1 に答える 1

Related

Reference