python - カスタムリンクエクストラクタをスクレイピー python で作成するにはどうすればよいですか

Question

リンクを抽出するためのカスタムスクレイピーリンクエクストラクタを作成したいと考えています。

スクレイピーのドキュメントには、2 つの組み込みエクストラクターがあると書かれています。

http://doc.scrapy.org/en/latest/topics/link-extractors.html

しかし、カスタムリンクエクストラクタで実装する方法のコード例を見たことがありません。カスタムエクストラクタの記述例を教えてください。

score 7 · Accepted Answer

これは、カスタムリンクエクストラクタの例です。

class RCP_RegexLinkExtractor(SgmlLinkExtractor):
    """High performant link extractor"""

    def _extract_links(self, response_text, response_url, response_encoding, base_url=None):
        if base_url is None:
            base_url = urljoin(response_url, self.base_url) if self.base_url else response_url

        clean_url = lambda u: urljoin(base_url, remove_entities(clean_link(u.decode(response_encoding))))
        clean_text = lambda t: replace_escape_chars(remove_tags(t.decode(response_encoding))).strip()

        links_text = linkre.findall(response_text)
        urlstext = set([(clean_url(url), clean_text(text)) for url, _, text in links_text])

        return [Link(url, text) for url, text in urlstext]

使用法

rules = (
    Rule(
        RCP_RegexLinkExtractor(
            allow=(r"epolls/2012/president/[a-z]{2}/[a-z]+_romney_vs_obama-[0-9]{4}\.html"),
            # Regex explanation:
            #     [a-z]{2} - matches a two character state abbreviation
            #     [a-z]*   - matches a state name
            #     [0-9]{4} - matches a 4 number unique webpage identifier

            allow_domains=('realclearpolitics.com',),
        ),
        callback='parseStatePolls',
        # follow=None, # default 
        process_links='processLinks',
        process_request='processRequest',
    ),
)

ここを見てください https://github.com/jtfairbank/RCP-Poll-Scraper

score 2 · Accepted Answer

I had a hard time to find recent examples for this, so I decided to post my walkthrough of the process of writing a custom link extractor.

The reason why I decided to create a custom link extractor

I had a problem with crawling a website that had href urls that had spaces, tabs and line breaks, like such:

<a href="
       /something/something.html
         " />

Supposing the page that had this link was at:

http://example.com/something/page.html

Instead of transforming this href url into:

http://example.com/something/something.html

Scrapy transformed it into:

http://example.com/something%0A%20%20%20%20%20%20%20/something/something.html%0A%20%20%20%20%20%20%20

And this was causing an infinite loop, as the crawler would go deeper and deeper on those badly interpreted urls.

I tried to use the process_value and process_links params of LxmlLinkExtractor, as suggested here without luck, so I decided to patch the method that processes relative urls.

Finding the original code

At the current version of Scrapy (1.0.3), the recommended link extractor is the LxmlLinkExtractor.

If you want to extend LxmlLinkExtractor, you should check out how the code goes on the Scrapy version that you are using.

You can probably open your currently used scrapy code location by running, from the command line (on OS X):

open $(python -c 'import site; print site.getsitepackages()[0] + "/scrapy"')

In the version that I use (1.0.3) the code of LxmlLinkExtractor is in:

scrapy/linkextractors/lxmlhtml.py

There I saw that the method I needed to adapt was _extract_links() inside LxmlParserLinkExtractor, that is then used by LxmlLinkExtractor.

So I extended LxmlLinkExtractor and LxmlParserLinkExtractor with slightly modified classes called CustomLinkExtractor and CustomLxmlParserLinkExtractor. The single line I modified is commented out.

# Import everything from the original lxmlhtml
from scrapy.linkextractors.lxmlhtml import *
_collect_string_content = etree.XPath("string()")

# Extend LxmlParserLinkExtractor
class CustomParserLinkExtractor(LxmlParserLinkExtractor):

    def _extract_links(self, selector, response_url, response_encoding, base_url):
        links = []
        for el, attr, attr_val in self._iter_links(selector._root):

            # Original method was:
            # attr_val = urljoin(base_url, attr_val)
            # So I just added a .strip()

            attr_val = urljoin(base_url, attr_val.strip())

            url = self.process_attr(attr_val)
            if url is None:
                continue
            if isinstance(url, unicode):
                url = url.encode(response_encoding)
            # to fix relative links after process_value
            url = urljoin(response_url, url)
            link = Link(url, _collect_string_content(el) or u'',
                nofollow=True if el.get('rel') == 'nofollow' else False)
            links.append(link)

        return unique_list(links, key=lambda link: link.url) \
                if self.unique else links


# Extend LxmlLinkExtractor
class CustomLinkExtractor(LxmlLinkExtractor):

    def __init__(self, allow=(), deny=(), allow_domains=(), deny_domains=(), restrict_xpaths=(),
                 tags=('a', 'area'), attrs=('href',), canonicalize=True,
                 unique=True, process_value=None, deny_extensions=None, restrict_css=()):
        tags, attrs = set(arg_to_iter(tags)), set(arg_to_iter(attrs))
        tag_func = lambda x: x in tags
        attr_func = lambda x: x in attrs

        # Here I replaced the original LxmlParserLinkExtractor with my CustomParserLinkExtractor
        lx = CustomParserLinkExtractor(tag=tag_func, attr=attr_func,
            unique=unique, process=process_value)

        super(LxmlLinkExtractor, self).__init__(lx, allow=allow, deny=deny,
            allow_domains=allow_domains, deny_domains=deny_domains,
            restrict_xpaths=restrict_xpaths, restrict_css=restrict_css,
            canonicalize=canonicalize, deny_extensions=deny_extensions)

And when defining the rules, I use CustomLinkExtractor:

from scrapy.spiders import Rule


rules = (

    Rule(CustomLinkExtractor(canonicalize=False, allow=[('^https?\:\/\/example\.com\/something\/.*'),]), callback='parse_item', follow=True),

  )

score 0 · Accepted Answer

https://github.com/geekan/scrapy-examples および https://github.com/mjhea0/Scrapy-Samplesでも LinkExtractor の例を見つけました

（上記のリンクで必要な情報が見つからなかった後に編集されました）

より正確にはhttps://github.com/geekan/scrapy-examples/search?utf8=%E2%9C%93&q=linkextractors&type=Codeおよびhttps://github.com/mjhea0/Scrapy-Samples/search?utf8=で%E2%9C%93&q=リンクエクストラクタ

python - カスタム リンク エクストラクタをスクレイピー python で作成するにはどうすればよいですか

3 に答える 3

The reason why I decided to create a custom link extractor

Finding the original code

Related

Reference

python - カスタムリンクエクストラクタをスクレイピー python で作成するにはどうすればよいですか