python - Webページにスクレーパーがある場合、そのスクレーパーを余分なページで動作させることは可能ですか?

Question

from twill.commands import *
from bs4 import BeautifulSoup
from urllib import urlopen
import urllib2

with open('urls.txt') as inf:
    urls = (line.strip() for line in inf)
    for url in urls:
        try:
            urllib2.urlopen(url)
        except urllib2.HTTPError, e:
            print e
        site = urlopen(url)   
        soup = BeautifulSoup(site)
        for td in soup.find_all('td', {'class': 'subjectCell'}):
            print td.find('a').text

私のコードは、ファイルの各 URL から 1 つのページのみを開きます。さらに多くのページがある場合もあります。その場合、次のページのパターンは &page=x になります。

ここに私が話しているページがあります：

http://www.last.fm/user/TheBladeRunner_/library/tags?tag=long+track http://www.last.fm/user/TheBladeRunner_/library/tags?tag=long+track&page=7

score 1 · Accepted Answer

next_page リンクからhref属性を読み取り、それをURLリストに追加できます (はい、タプルをリストに変更する必要があります)。それは次のようなものかもしれません：

from twill.commands import *
from bs4 import BeautifulSoup
from urllib import urlopen
import urllib2
import urlparse

with open('urls.txt') as inf:
    urls = [line.strip() for line in inf]
    for url in urls:
        try:
            urllib2.urlopen(url)
        except urllib2.HTTPError, e:
            print e
        site = urlopen(url)   
        soup = BeautifulSoup(site)
        for td in soup.find_all('td', {'class': 'subjectCell'}):
            print td.find('a').text

        next_page = soup.find_all('a', {'class': 'nextlink'}):
        if next_page:
            next_page = next_page[0]
            urls.append(urlparse.urljoin(url, next_page['href']))

score 0 · Accepted Answer

ページからすべてのリンクを取得してそれらをたどる何かを作成できます。スクレイピーが無料で行うことです

ページ上のすべてのリンクをたどるスパイダーを作成できます。他のページへのページネーションリンクがあると仮定すると、スクレイパーは自動的にそれらをたどります。

ページ上のすべてのリンクをbeautifulsoupで解析することで同じことを達成できますが、scrapyがすでに無料でそれを行っているのに、なぜそれを行うのでしょうか?

score -1 · Accepted Answer

あなたの質問を理解しているかどうかはわかりませんが、「次の」パターンに一致する正規表現 (http://www.tutorialspoint.com/python/python_reg_expressions.htm) を作成し、見つかったものの中から検索することを考えるかもしれません。ページ上の URL。サイト内リンクに高度な適合性がある場合、私はこのアプローチをよく使用します。

python - Webページにスクレーパーがある場合、そのスクレーパーを余分なページで動作させることは可能ですか?

3 に答える 3

Related

Reference