6

目標: 検索文字列を渡して Google で検索し、URL、タイトル、および URL タイトルと共に公開される小さな説明をスクレイピングします。

私は次のコードを持っていますが、現時点では、私のコードは最初の 10 件の結果のみを提供します。これは、1 ページのデフォルトの Google 制限です。Webスクレイピング中にページネーションを実際に処理する方法がわかりません。また、実際のページの結果と出力されたものを見ると、食い違いがあります。また、スパン要素を解析する最良の方法が何であるかもわかりません。

<em>これまでのところ、次のようなスパンがあり、要素を削除して残りの文字列を連結したいと考えています。それを行う最善の方法は何ですか?

<span class="st">The <em>Beautiful Soup</em> Theater Collective was founded in the summer of 2010 by its Artistic Director, Steven Carl McCasland. A continuation of a student group he <b>...</b><br /></span

コード:

from BeautifulSoup import BeautifulSoup
import urllib, urllib2

def google_scrape(query):
    address = "http://www.google.com/search?q=%s&num=100&hl=en&start=0" % (urllib.quote_plus(query))
    request = urllib2.Request(address, None, {'User-Agent':'Mosilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11'})
    urlfile = urllib2.urlopen(request)
    page = urlfile.read()
    soup = BeautifulSoup(page)

    linkdictionary = {}

    for li in soup.findAll('li', attrs={'class':'g'}):
        sLink = li.find('a')
        print sLink['href']
        sSpan = li.find('span', attrs={'class':'st'})
        print sSpan

    return linkdictionary

if __name__ == '__main__':
    links = google_scrape('beautifulsoup')

私の出力は次のようになります。

http://www.crummy.com/software/BeautifulSoup/
<span class="st"><em>Beautiful Soup</em>: a library designed for screen-scraping HTML and XML.<br /></span>
http://pypi.python.org/pypi/BeautifulSoup/3.2.1
<span class="st"><span class="f">Feb 16, 2012 &ndash; </span>HTML/XML parser for quick-turnaround applications like screen-scraping.<br /></span>
http://www.beautifulsouptheatercollective.org/
<span class="st">The <em>Beautiful Soup</em> Theater Collective was founded in the summer of 2010 by its Artistic Director, Steven Carl McCasland. A continuation of a student group he <b>...</b><br /></span>
http://lxml.de/elementsoup.html
<span class="st"><em>BeautifulSoup</em> is a Python package that parses broken HTML, just like lxml supports it based on the parser of libxml2. <em>BeautifulSoup</em> uses a different parsing <b>...</b><br /></span>
https://launchpad.net/beautifulsoup/
<span class="st">The discussion group is at: http://groups.google.com/group/<em>beautifulsoup</em> &middot; Home page <b>...</b> <em>Beautiful Soup</em> 4.0 series is  the current focus of development <b>...</b><br /></span>
http://www.poetry-online.org/carroll_beautiful_soup.htm
<span class="st"><em>Beautiful Soup BEAUTIFUL Soup</em>, so rich and green, Waiting in a hot tureen! Who for such dainties would not stoop? Soup of the evening, <em>beautiful Soup</em>!<br /></span>
http://www.youtube.com/watch?v=hDG73IAO5M8
<span class="st"><span class="f">Jul 6, 2009 &ndash; </span>taken from the motion picture &quot;Alice in wonderland&quot; (1999) http://www.imdb.com/<wbr>title/tt0164993/<br /></wbr></span>
http://www.soupsong.com/
<span class="st">A witty and substantive research effort on the history of soup and food in all cultures, with over 400 pages of recipes, quotations, stories, traditions, literary <b>...</b><br /></span>
http://www.facebook.com/beautifulsouptc
<span class="st">To connect with The <em>Beautiful Soup</em> Theater Collective, sign up for Facebook <b>...</b> We&#39;re thrilled to announce the cast of <em>Beautiful Soup&#39;s</em> upcoming production of <b>...</b><br /></span>
http://blog.dispatched.ch/webscraping-with-python-and-beautifulsoup/
<span class="st"><span class="f">Mar 15, 2009 &ndash; </span>Recently my life has been a hype; partly due to my upcoming Python addiction. There&#39;s simply no way around it; so I should better confess it in <b>...</b><br /></span>

Google 検索ページの結果には、次の構造があります。

<li class="g">
<div class="vsc" sig="bl_" bved="0CAkQkQo" pved="0CAgQkgowBQ">
<h3 class="r">
<div class="vspib" aria-label="Result details" role="button" tabindex="0">
<div class="s">
<div class="f kv">
<div id="poS5" class="esc slp" style="display:none">
<div class="f slp">3 answers&nbsp;-&nbsp;Jan 16, 2009</div>
<span class="st">
I read this without finding the solution:
<b>...</b>
The "normal" way is to: Go to the
<em>Beautiful Soup</em>
web site,
<b>...</b>
Brian beat me too it, but since I already have
<b>...</b>
<br>
</span>
</div>
<div>
</div>
<h3 id="tbpr_6" class="tbpr" style="display:none">
</li>

各検索結果は<li>要素の下にリストされます。

4

3 に答える 3

2

このリスト内包表記はタグを取り除きます。

>>> sSpan
<span class="st">The <em>Beautiful Soup</em> Theater Collective was founded in the summer of 2010 by its Artistic Director, Steven Carl McCasland. A continuation of a student group he <b>...</b><br /></span>
>>> [em.replaceWithChildren() for em in sSpan.findAll('em')]
[None]
>>> sSpan
<span class="st">The Beautiful Soup Theater Collective was founded in the summer of 2010 by its Artistic Director, Steven Carl McCasland. A continuation of a student group he <b>...</b><br /></span>
于 2012-07-17T05:14:03.180 に答える
0

単純なhtml正規表現を作成し、クリーンアップされた文字列でreplace関数を呼び出して、ドットを削除しました。

import re

p = re.compile(r'<.*?>')
print p.sub('',str(sSpan)).replace('.','')

<span class="st">The <em>Beautiful Soup</em> is a collection of all the pretty places you would rather be. All posts are credited via a click through link. For further inspiration of pretty things, <b>...</b><br /></span>

The Beautiful Soup is a collection of all the pretty places you would rather be All posts are credited via a click through link For further inspiration of pretty things, 
于 2012-07-17T17:59:57.740 に答える