python - Web Crawler 新しい Web サイトからリンクを取得するには

Question

ニュース Web サイトのページ (そのアーカイブの 1 つ) からリンクを取得しようとしています。Python で次のコード行を書きました。

main.py含む:

import mechanize
from bs4 import BeautifulSoup

url = "http://www.thehindu.com/archive/web/2010/06/19/"

br =  mechanize.Browser()
htmltext = br.open(url).read()

articletext = ""
soup = BeautifulSoup(htmltext)
for tag in soup.findAll('li', attrs={"data-section":"Business"}):
    articletext += tag.contents[0]

print articletext

tag.contents[0] のオブジェクトの例: <a href="http://www.thehindu.com/business/itc-to-issue-11-bonus/article472545.ece" target="_blank">ITC to issue 1:1 bonus</a>

しかし、それを実行すると、次のエラーが発生します。

File "C:\Python27\crawler\main.py", line 4, in <module>
    text = articletext.getArticle(url)
  File "C:\Python27\crawler\articletext.py", line 23, in getArticle
    return getArticleText(htmltext)
  File "C:\Python27\crawler\articletext.py", line 18, in getArticleText
    articletext += tag.contents[0]
TypeError: cannot concatenate 'str' and 'Tag' objects

誰かがそれを整理するのを手伝ってくれますか?? 私はPythonプログラミングが初めてです。ありがとうございます。

score 3 · Accepted Answer

link_dictionary を漠然と使用しています。読み取り目的で使用していない場合は、次のコードを試してください。

 br =  mechanize.Browser()
 htmltext = br.open(url).read()

 articletext = ""
 for tag_li in soup.findAll('li', attrs={"data-section":"Op-Ed"}):
    for link in tag_li.findAll('a'):
        urlnew = urlnew = link.get('href')
        brnew =  mechanize.Browser()
        htmltextnew = brnew.open(urlnew).read()            
        articletext = ""
        soupnew = BeautifulSoup(htmltextnew)
        for tag in soupnew.findAll('p'):
            articletext += tag.text
        print re.sub('\s+', ' ', articletext, flags=re.M)

注 :reは正規表現用です。このために、のモジュールをインポートしますre。

score 2 · Accepted Answer

lxmlより高速なモジュールで強力な XPath クエリ言語を使用することをお勧めします。それと同じくらい簡単です：

import urllib2
from lxml import etree

url = 'http://www.thehindu.com/archive/web/2010/06/19/'
html = etree.HTML(urllib2.urlopen(url).read())

for link in html.xpath("//li[@data-section='Business']/a"):
    print '{} ({})'.format(link.text, link.attrib['href'])

@data-section='Chennai' の更新

#!/usr/bin/python
import urllib2
from lxml import etree

url = 'http://www.thehindu.com/template/1-0-1/widget/archive/archiveWebDayRest.jsp?d=2010-06-19'
html = etree.HTML(urllib2.urlopen(url).read())

for link in html.xpath("//li[@data-section='Chennai']/a"):
    print '{} => {}'.format(link.text, link.attrib['href'])

python - Web Crawler 新しい Web サイトからリンクを取得するには

3 に答える 3

Related

Reference