python - LXML Web ページのスクレイピング、不正な形式の html

翻译自：https://stackoverflow.com/questions/16134319 2013-04-21T17:51:26.400

135 次

この Web サイトhttp://sana.sy/eng/21/2013/01/07/pr-460536.htmから記事のテキストをスクレイピングしようとしていますが、HTML の形式が正しくありません。誰でもそれを正しくする方法を教えてもらえますか。

this is the code
import urllib2
from lxml import etree
import StringIO

speachesurls = ["http://sana.sy/eng/21/2013/01/07/pr-460536.htm", "http://sana.sy/eng/21/2012/06/04/pr-423234.htm", "http://sana.sy/eng/21/2012/01/12/pr-393338.htm"]


# scrape the speaches

for url in speachesurls:
    result = urllib2.urlopen(url)
    html = result.read()
    parser = etree.HTMLParser()
    tree = etree.parse(StringIO.StringIO(html), parser)
    xpath = "//html/body/table[3]/tbody/tr[3]/td[4]/table[2]/tbody/tr/td[2]/table/tbody/tr/td/table/tbody/tr[2]/td/div/table/tbody/tr[2]/td/div/p"
    a = tree.find(xpath)
    print a.text_content()

python - LXML Web ページのスクレイピング、不正な形式の html

1 に答える 1

Related

Reference