python - Python BeautifulSoup 特定の URL を抽出する

Question

特定の URL だけを取得することはできますか?

お気に入り：

<a href="http://www.iwashere.com/washere.html">next</a>
<span class="class">...</span>
<a href="http://www.heelo.com/hello.html">next</a>
<span class="class">...</span>
<a href="http://www.iwashere.com/wasnot.html">next</a>
<span class="class">...</span>

からの URL のみを出力する必要があります。http://www.iwashere.com/

同様に、出力 URL:

http://www.iwashere.com/washere.html
http://www.iwashere.com/wasnot.html

私は文字列ロジックでそれを行いました。BeautifulSoupを使用した直接的な方法はありますか?

score 17 · Accepted Answer

属性値に正規表現を使用するなど、複数の側面を一致させることができます。

import re
soup.find_all('a', href=re.compile('http://www\.iwashere\.com/'))

一致するもの（例）：

[<a href="http://www.iwashere.com/washere.html">next</a>, <a href="http://www.iwashere.com/wasnot.html">next</a>]

そのため、string で始まる値を持つ属性を持つすべての<a>タグ。hrefhttp://www.iwashere.com/

結果をループして、href属性だけを選択できます。

>>> for elem in soup.find_all('a', href=re.compile('http://www\.iwashere\.com/')):
...     print elem['href']
... 
http://www.iwashere.com/washere.html
http://www.iwashere.com/wasnot.html

代わりにすべての相対パスに一致させるには、値がスキーマ (またはなど) または二重スラッシュ ( ) で始まっていないかどうかをテストする否定先読みアサーションを使用します。そのような値は、代わりに相対パスにする必要があります。http:mailto://hostname/path

soup.find_all('a', href=re.compile(r'^(?!(?:[a-zA-Z][a-zA-Z0-9+.-]*:|//))'))

score 6 · Accepted Answer

BeautifulSoup 4.0.0以降を使用している場合：

soup.select('a[href^="http://www.iwashere.com/"]')

score 0 · Accepted Answer

gazpachoの部分一致でこれを解決できます。

入力：

html = """\
<a href="http://www.iwashere.com/washere.html">next</a>
<span class="class">...</span>
<a href="http://www.heelo.com/hello.html">next</a>
<span class="class">...</span>
<a href="http://www.iwashere.com/wasnot.html">next</a>
<span class="class">...</span>
"""

コード：

from gazpacho import Soup

soup = Soup(html)
links = soup.find('a', {'href': "http://www.iwashere.com/"}, partial=True)
[link.attrs['href'] for link in links]

どちらが出力されますか:

# ['http://www.iwashere.com/washere.html', 'http://www.iwashere.com/wasnot.html']

python - Python BeautifulSoup 特定の URL を抽出する

3 に答える 3

Related

Reference