python - BeautifulSoup../..があるときにimgsrcからURLを取得する方法

Question

したがって、次のように、特定の画像へのリンクを取得しようとしていたとしましょう。

from bs4 import BeautfiulSoup
import urlparse

soup = BeautifulSoup("http://examplesite.com")
for image in soup.findAll("img"):
    srcd = urlparse.urlparse(src)
    path = srcd.path # gets the path
    fn = os.path.basename(path) # gets filename

# lets say the webpage i was scraping had their images like this:
# <img src="../..someimage.jpg" />

それから完全なURLを取得する簡単な方法はありますか？または、正規表現を使用する必要がありますか？

score 2 · Accepted Answer

使用urlparse.urljoin：

>>> import urlparse
>>> base_url = "http://example.com/foo/"
>>> urlparse.urljoin(base_url, "../bar")
'http://example.com/bar'
>>> urlparse.urljoin(base_url, "/baz")
'http://example.com/baz'

python - BeautifulSoup../..があるときにimgsrcからURLを取得する方法

1 に答える 1

Related

Reference