python - 子hrefをBeautifulSoupリストに抽出する

Question

Python を学習し、BeautifulSoup を使用していくつかの Web ページをクロールしています。私が探しているのは、最初の「td」の子「a」を見つけ、href を抽出してリストに追加することです。href をセルテキストに追加する方法と場所を教えてください。

import urllib2

from BeautifulSoup import BeautifulSoup

def listify(table):
    """Convert an html table to a nested list""" 
    result = []
    rows = table.findAll('tr')
    for row in rows:
        result.append([])
        cols = row.findAll('td')
        for col in cols:
            strings = [_string.encode('utf8') for _string in col.findAll(text=True)]
            text = ''.join(strings)
            result[-1].append(text)
    return result

score 1 · Accepted Answer

最初のものを見つけるtd:row.find('td')代わりに使用します。最初の一致を返します
a再びchild.find('a')を検索し、最初のものを見つけるために使用します。
要素は pythondictのように動作し、 item アクセスを使用してなどの要素属性を取得しますhref。

まとめると、次のようになります。

cell = row.find('td')
link = cell.find('a') if cell else None
if link is not None and 'href' in link:
    result[-1].append(link['href'])

python - 子hrefをBeautifulSoupリストに抽出する

1 に答える 1

Related

Reference