python - PythonでHTMLを解析する

Question

Pythonを使用して、Webサイトのコンテンツを取得する関数を作成したいと思います。たとえば、Webサイトの組織のコンテンツを取得します。

コードでは、組織は東京大学です。

<tr class="odd">
  <th>Organization:</th>
  <td>University of Tokyo</td>
</tr>

get http://www.ip-adress.com/ip_tracer/157.123.22.11のように、新しいインストールを行わずにWebサイトのコンテンツを直接取得するにはどうすればよいですか。

score 3 · Accepted Answer

私はBeautifulSoupが好きで、HTML文字列のデータに簡単にアクセスできます。実際の複雑さは、HTMLがどのように形成されるかによって異なります。HTMLが「id」と「class」を使用する場合、それは簡単です。そうでない場合は、「最初のdiv、2番目のリスト項目を取得する...」などのより静的なものに依存します。これは、HTMLのコンテンツが大幅に変更されるとひどいものになります。

HTMLをダウンロードするには、BeautifulSoupのドキュメントから例を引用します。

import urllib2
from BeautifulSoup import BeautifulSoup

page = urllib2.urlopen("http://www.icc-ccs.org/prc/piracyreport.php")
soup = BeautifulSoup(page)
for incident in soup('td', width="90%"):
    where, linebreak, what = incident.contents[:3]
    print where.strip()
    print what.strip()
    print

score 2 · Accepted Answer

BeautifulSoupを使用する：

import bs4

html = """<tr class="odd">
  <th>Organization:</th>
  <td>University of Tokyo</td>
</tr>
"""
soup = bs4.BeautifulSoup(html)
univ = soup.tr.td.getText()
assert univ == u"University of Tokyo"

編集：

最初にHTMLを読む必要がある場合は、次を使用しますurllib2。

import urllib2

html = urllib2.urlopen("http://example.com/").read()

score 0 · Accepted Answer

この Web サイトは、認識されているユーザーエージェントによってアクセスされているかどうかを確認することで、アクセスをフィルタリングしているため、403 Access Forbidden error使用できます。urllib2.urlopenだからここに完全なものがあります：

import urllib2
import lxml.html as lh

req = urllib2.Request("http://www.ip-adress.com/ip_tracer/157.123.22.11", headers={'User-Agent' : "Magic Browser"})
html = urllib2.urlopen(req).read()
doc=lh.fromstring(html)
print ''.join(doc.xpath('.//*[@class="odd"]')[-1].text_content().split())
>>> 
Organization:ZenithDataSystems

python - PythonでHTMLを解析する

3 に答える 3

Related

Reference