python - PythonによるWebスクレイピング

Question

Web サイトから毎日の日の出/日の入り時刻を取得したいと思います。Python で Web コンテンツをスクレイピングすることは可能ですか? モジュールは何を使用していますか？利用可能なチュートリアルはありますか?

score 194 · Accepted Answer

urllib2 を優れたBeautifulSoupライブラリと組み合わせて使用します。

import urllib2
from BeautifulSoup import BeautifulSoup
# or if you're using BeautifulSoup4:
# from bs4 import BeautifulSoup

soup = BeautifulSoup(urllib2.urlopen('http://example.com').read())

for row in soup('table', {'class': 'spad'})[0].tbody('tr'):
    tds = row('td')
    print tds[0].string, tds[1].string
    # will print date and sunrise

score 63 · Accepted Answer

私は本当にScrapyをお勧めします。

削除された回答からの引用：

Scrapyクロールは、（Twistedに加えて）非同期操作を使用するため、mechanizeよりも高速です。

Scrapyは、libxml2上で（x）htmlを解析するためのより優れた最速のサポートを備えています。

Scrapyは、完全なUnicodeを備えた成熟したフレームワークであり、リダイレクト、gzip圧縮された応答、奇数のエンコーディング、統合されたhttpキャッシュなどを処理します。

Scrapyに慣れたら、5分以内にスパイダーを作成して、画像をダウンロードし、サムネイルを作成し、抽出したデータをcsvまたはjsonに直接エクスポートできます。

score 17 · Accepted Answer

Web スクレイピング作業のスクリプトをこのビットバケットライブラリに集めました。

あなたのケースのスクリプト例：

from webscraping import download, xpath
D = download.Download()

html = D.get('http://example.com')
for row in xpath.search(html, '//table[@class="spad"]/tbody/tr'):
    cols = xpath.search(row, '/td')
    print 'Sunrise: %s, Sunset: %s' % (cols[1], cols[2])

出力：

Sunrise: 08:39, Sunset: 16:08
Sunrise: 08:39, Sunset: 16:09
Sunrise: 08:39, Sunset: 16:10
Sunrise: 08:40, Sunset: 16:10
Sunrise: 08:40, Sunset: 16:11
Sunrise: 08:40, Sunset: 16:12
Sunrise: 08:40, Sunset: 16:13

score 10 · Accepted Answer

pyqueryをチェックアウトすることを強くお勧めします。jquery に似た (別名 css に似た) 構文を使用しているため、そのバックグラウンドを持つ人にとっては非常に簡単です。

あなたの場合、それは次のようになります：

from pyquery import *

html = PyQuery(url='http://www.example.com/')
trs = html('table.spad tbody tr')

for tr in trs:
  tds = tr.getchildren()
  print tds[1].text, tds[2].text

出力：

5:16 AM 9:28 PM
5:15 AM 9:30 PM
5:13 AM 9:31 PM
5:12 AM 9:33 PM
5:11 AM 9:34 PM
5:10 AM 9:35 PM
5:09 AM 9:37 PM

score 7 · Accepted Answer

urllib2を使用して HTTP 要求を作成すると、Web コンテンツが作成されます。

次のように取得できます。

import urllib2
response = urllib2.urlopen('http://example.com')
html = response.read()

Beautiful Soupは、スクリーンスクレイピングに適した Python HTML パーサーです。

具体的には、HTML ドキュメントの解析に関するチュートリアルです。

幸運を！

score 4 · Accepted Answer

Scrapemark (URL の検索 - py2) とhttlib2 (画像のダウンロード - py2+3)を組み合わせて使用します。Scrapemark.py には 500 行のコードがありますが、正規表現を使用しているため、それほど高速ではない可能性があり、テストしていません。

ウェブサイトをスクレイピングする例:

import sys
from pprint import pprint
from scrapemark import scrape

pprint(scrape("""
    <table class="spad">
        <tbody>
            {*
                <tr>
                    <td>{{[].day}}</td>
                    <td>{{[].sunrise}}</td>
                    <td>{{[].sunset}}</td>
                    {# ... #}
                </tr>
            *}
        </tbody>
    </table>
""", url=sys.argv[1] ))

使用法：

python2 sunscraper.py http://www.example.com/

結果：

[{'day': u'1. Dez 2012', 'sunrise': u'08:18', 'sunset': u'16:10'},
 {'day': u'2. Dez 2012', 'sunrise': u'08:19', 'sunset': u'16:10'},
 {'day': u'3. Dez 2012', 'sunrise': u'08:21', 'sunset': u'16:09'},
 {'day': u'4. Dez 2012', 'sunrise': u'08:22', 'sunset': u'16:09'},
 {'day': u'5. Dez 2012', 'sunrise': u'08:23', 'sunset': u'16:08'},
 {'day': u'6. Dez 2012', 'sunrise': u'08:25', 'sunset': u'16:08'},
 {'day': u'7. Dez 2012', 'sunrise': u'08:26', 'sunset': u'16:07'}]

score 1 · Accepted Answer

を使用して生活を楽にするCSS Selectors

私はパーティーに遅れて来たことを知っていますが、あなたに素敵な提案があります.

使用BeautifulSoupはすでに提案されていCSS Selectorsます HTML内のデータをスクレイピングするために使用することをお勧めします

import urllib2
from bs4 import BeautifulSoup

main_url = "http://www.example.com"

main_page_html  = tryAgain(main_url)
main_page_soup = BeautifulSoup(main_page_html)

# Scrape all TDs from TRs inside Table
for tr in main_page_soup.select("table.class_of_table"):
   for td in tr.select("td#id"):
       print(td.text)
       # For acnhors inside TD
       print(td.select("a")[0].text)
       # Value of Href attribute
       print(td.select("a")[0]["href"])

# This is method that scrape URL and if it doesnt get scraped, waits for 20 seconds and then tries again. (I use it because my internet connection sometimes get disconnects)
def tryAgain(passed_url):
    try:
        page  = requests.get(passed_url,headers = random.choice(header), timeout = timeout_time).text
        return page
    except Exception:
        while 1:
            print("Trying again the URL:")
            print(passed_url)
            try:
                page  = requests.get(passed_url,headers = random.choice(header), timeout = timeout_time).text
                print("-------------------------------------")
                print("---- URL was successfully scraped ---")
                print("-------------------------------------")
                return page
            except Exception:
                time.sleep(20)
                continue

score 1 · Accepted Answer

特定のカテゴリからアイテムの名前を取得することを考えている場合は、css セレクターを使用してそのカテゴリのクラス名を指定することでそれを行うことができます。

import requests ; from bs4 import BeautifulSoup

soup = BeautifulSoup(requests.get('https://www.flipkart.com/').text, "lxml")
for link in soup.select('div._2kSfQ4'):
    print(link.text)

これは部分的な検索結果です:

Puma, USPA, Adidas & moreUp to 70% OffMen's Shoes
Shirts, T-Shirts...Under ₹599For Men
Nike, UCB, Adidas & moreUnder ₹999Men's Sandals, Slippers
Philips & moreStarting ₹99LED Bulbs & Emergency Lights

python - PythonによるWebスクレイピング

10 に答える 10

Related

Reference