python - Selenium Webdriver (Python) を使用して Web サイトから画像を抽出する

Question

数千のサブサイトをクロールして情報を抽出する必要があります。

残念ながら、問題の情報は通常の HTML テキストではなく、テキストが動的にレンダリングされた画像です。

これらの画像を抽出してさらに処理するにはどうすればよいですか? PythonでSelenium Webdriverを使用しています。

score 1 · Accepted Answer

mechanizeplusでできないことはほとんどありませんBeautifulSoup。画像のさらなる処理はpytesserで行うことができますが、私は経験がありません。Python OCR に詳しい方からアドバイスをいただければ幸いです。

import mechanize, BeautifulSoup

browser = mechanize.Browser()
html = browser.open("http://www.dreamstime.com/free-photos")
soup = BeautifulSoup.BeautifulSoup(html)
for ii, image in enumerate(soup.findAll('img')):
    _src = image['src']
    if str(_src).startswith('http://') and str(_src).endswith('.jpg'):
        print 'Storing this image:', _src
        data = browser.open(_src).read()
        fl = 'image' + str(ii) + '.jpg'
        with open(fl, 'wb') as f:
            f.write(data)
        f.closed

python - Selenium Webdriver (Python) を使用して Web サイトから画像を抽出する

1 に答える 1

Related

Reference