python - Pythonを使用してHTMLページソースから画像ファイルをダウンロードしますか?

Question

HTML ページからすべての画像ファイルをダウンロードして特定のフォルダーに保存するスクレイパーを作成しています。すべての画像は HTML ページの一部です。

score 88 · Accepted Answer

指定された URL からすべての画像をダウンロードし、指定された出力フォルダーに保存するコードを次に示します。必要に応じて変更できます。

"""
dumpimages.py
    Downloads all the images on the supplied URL, and saves them to the
    specified output file ("/test/" by default)

Usage:
    python dumpimages.py http://example.com/ [output]
"""
from bs4 import BeautifulSoup as bs
from urllib.request import (
    urlopen, urlparse, urlunparse, urlretrieve)
import os
import sys

def main(url, out_folder="/test/"):
    """Downloads all the images at 'url' to /test/"""
    soup = bs(urlopen(url))
    parsed = list(urlparse(url))

    for image in soup.findAll("img"):
        print("Image: %(src)s" % image)
        filename = image["src"].split("/")[-1]
        parsed[2] = image["src"]
        outpath = os.path.join(out_folder, filename)
        if image["src"].lower().startswith("http"):
            urlretrieve(image["src"], outpath)
        else:
            urlretrieve(urlunparse(parsed), outpath)

def _usage():
    print("usage: python dumpimages.py http://example.com [outpath]")

if __name__ == "__main__":
    url = sys.argv[-1]
    out_folder = "/test/"
    if not url.lower().startswith("http"):
        out_folder = sys.argv[-1]
        url = sys.argv[-2]
        if not url.lower().startswith("http"):
            _usage()
            sys.exit(-1)
    main(url, out_folder)

編集：出力フォルダーを指定できるようになりました。

score 13 · Accepted Answer

Ryan のソリューションは優れていますが、画像ソースの URL が絶対 URL であるか、単にメインページの URL に連結しただけでは適切な結果が得られない場合は失敗します。urljoin は絶対 URL と相対 URL を認識するため、途中のループを次のように置き換えます。

for image in soup.findAll("img"):
    print "Image: %(src)s" % image
    image_url = urlparse.urljoin(url, image['src'])
    filename = image["src"].split("/")[-1]
    outpath = os.path.join(out_folder, filename)
    urlretrieve(image_url, outpath)

score 8 · Accepted Answer

そして、これは1つの画像をダウンロードするための関数です:

def download_photo(self, img_url, filename):
    file_path = "%s%s" % (DOWNLOADED_IMAGE_PATH, filename)
    downloaded_image = file(file_path, "wb")

    image_on_web = urllib.urlopen(img_url)
    while True:
        buf = image_on_web.read(65536)
        if len(buf) == 0:
            break
        downloaded_image.write(buf)
    downloaded_image.close()
    image_on_web.close()

    return file_path

score 8 · Accepted Answer

ページをダウンロードして html ドキュメントを解析し、正規表現で画像を見つけてダウンロードする必要があります。ダウンロードには urllib2 を使用し、html ファイルの解析には Beautiful Soup を使用できます。

score 3 · Accepted Answer

htmllib を使用してすべての img タグを抽出し (do_img をオーバーライド)、urllib2 を使用してすべての画像をダウンロードします。

score 1 · Accepted Answer

リクエストに承認が必要な場合は、次を参照してください。

r_img = requests.get(img_url, auth=(username, password)) 
f = open('000000.jpg','wb') 
f.write(r_img.content) 
f.close()

python - Pythonを使用してHTMLページソースから画像ファイルをダウンロードしますか?

6 に答える 6

Related

Reference