python - Python フェッチ Web ページデータ

Question

テレビキャッチアップ Web サイトから html をフェッチし、分割関数を使用して、すべての html データをチャンネル名とテーブルで現在オンになっているプログラムだけに分割するプログラムを作成しようとしています。 1 - 「プログラム名」。誰かがそれを助けることができれば、最初の分割機能の後に私が何をするかについて助けが必要です。

import urllib2
import string


proxy = urllib2.ProxyHandler({"http" : "http://c99.cache.e2bn.org:8084"})

opener = urllib2.build_opener(proxy)

urllib2.install_opener(opener)

tvCatchup = urllib2.urlopen('http://www.TVcatchup.com')

html = tvCatchup.read()

firstSplit = html.split('<a class="enabled" href="/watch.html?c=')[1:]
for i in firstSplit:
    print i

secondSplit = html.split ('1" title="BBC One"></a></li><li class="v-type" style="color:#6d6d6d;">')[1:]

for i in secondSplit:
print i

score 1 · Accepted Answer

出力を分割しませんが、何らかの HTML パーサーを使用します。美しいスープは良い選択です。

score 0 · Accepted Answer

HTML をサブストリング化するのではなく、スクリーンスクレーパーが必要なようです。優れたスクリーンスクレイピングツールはScrapyで、XPATH を使用してデータを取得します。

Scrapyの概要ページは便利です。Web ページからデータを抽出する方法の完全な例を提供します。

score -1 · Accepted Answer

urllib2 は使用しないでください。代わりにリクエストを使用してください https://github.com/kennethreitz/requests

HTML 解析には BeautifulSoup http://www.crummy.com/software/BeautifulSoup/bs4/doc/を使用します。

注：このプロキシはダウンしているようで、プロキシ設定を削除すると機能します

import requests
from BeautifulSoup import BeautifulSoup

proxyDict = {"http":"http://c99.cache.e2bn.org:8084"}
r = requests.get("http://www.TVcatchup.com", proxies=proxyDict)

soup = BeautifulSoup(r.text)
tvs = list()

uls = soup.findAll("ul", { "class":"channels2"}
for ul in uls:
   div = ul.find("div")
   if div:
       showid = div.get("showid")
       link = ul.find("a")
       href = link.get("href")
       title = link.get("title")
       tvs.append({"showid":showid, "href":href, "title":title})
print tvs

あなたはこれを手に入れるでしょう

[{'showid': u'450263', 'href': u'/watch.html?c=1', 'title': u'BBC One'}, 
{'showid': u'450353', 'href': u'/watch.html?c=2', 'title': u'BBC Two'}, 
{'showid': u'450398', 'href': u'/watch.html?c=3', 'title': u'ITV1'}, 
{'showid': u'450521', 'href': u'/watch.html?c=4', 'title': u'Channel 4'},...

python - Python フェッチ Web ページ データ

3 に答える 3

Related

Reference

python - Python フェッチ Web ページデータ