python - Scrapy エラー: exceptions.IOError: 画像ファイルを識別できません

Question

画像ファイル名またはそれを追跡するための応答 URL を知らなくても、次のエラーが何度も発生します。

2012-08-20 08:14:34+0000 [spider] Unhandled Error
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 545, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 362, in callback
    self._startRunCallbacks(result)
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 458, in _startRunCallbacks
    self._runCallbacks()
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 545, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
--- <exception caught here> ---
  File "/usr/lib/pymodules/python2.7/scrapy/contrib/pipeline/images.py", line 204, in media_downloaded
    checksum = self.image_downloaded(response, request, info)
  File "/usr/lib/pymodules/python2.7/scrapy/contrib/pipeline/images.py", line 252, in image_downloaded
    for key, image, buf in self.get_images(response, request, info):
  File "/usr/lib/pymodules/python2.7/scrapy/contrib/pipeline/images.py", line 261, in get_images
    orig_image = Image.open(StringIO(response.body))
  File "/usr/lib/python2.7/dist-packages/PIL/Image.py", line 1980, in open
    raise IOError("cannot identify image file")
exceptions.IOError: cannot identify image file

では、どうすればこの問題を解決できますか? settings.py で既に定義した特定の数のエラーが発生すると、スパイダーが停止する原因

score 3 · Accepted Answer

問題のある行は PIL を使用してImage.open()、scrapy.contrib.pipelines.images.ImagesPipeline で:

def get_images(self, response, request, info):
    key = self.image_key(request.url)
    orig_image = Image.open(StringIO(response.body))

media_downloaded() の try ブロックはこれをキャッチしますが、エラー自体を発行します。

except Exception:
    log.err(spider=info.spider)

このファイルを次の方法でハッキングできます。

try:
    key = self.image_key(request.url)
    checksum = self.image_downloaded(response, request, info)
except ImageException, ex:
    log.msg(str(ex), level=log.WARNING, spider=info.spider)
    raise
except IOError, ex:
    log.msg(str(ex), level=log.WARNING, spider=info.spider)
    raise ImageException
except Exception:
    log.err(spider=info.spider)
    raise ImageException

ただし、独自のパイプラインを作成し、pipelines.py ファイルで image_downloaded() メソッドをオーバーライドすることをお勧めします。

from scrapy import log
from scrapy.contrib.pipeline.images import ImagesPipeline

class BkamImagesPipeline(ImagesPipeline):

    def image_downloaded(self, response, request, info):
        try:
            super(BkamImagesPipeline, self).image_downloaded(response, request, info)
        except IOError, ex:
            log.msg(str(ex), level=log.WARNING, spider=info.spider)

設定ファイルでこのパイプラインを必ず宣言してください。

ITEM_PIPELINES = [
    'bkam.pipelines.BkamImagesPipeline',
]

python - Scrapy エラー: exceptions.IOError: 画像ファイルを識別できません

1 に答える 1

Related

Reference