python - Python正規表現とウェブスクレイピングを学習して立ち往生

Question

Pythonを使ってWebスクレイピングをしようとしています。私は（私の目標）である製品のリンクを取得しようとしています

http://www.fastfurnishings.com/3-Piece-Reversible-Bonded-Leather-Match-Sofa-Set-i-p/bstrblm3p.htm

この URL / サイトをスクレイピングしています

 http://www.fastfurnishings.com/SearchResults.asp?Search=3-Piece+Reversible+Bonded+Leather+Match+Sofa+Set+in+Cream

ページビューを実行すると、必要な URL を特定するのに役立つ特定の ID やタグがないことがわかります。また、正規表現も苦手です。私はこれまでPythonでこれを持っています

import urllib
import re
product = "3-Piece Reversible Bonded Leather Match Sofa Set in Cream"
productSearchUrl = product.replace(" ","+");
myurl = "http://www.fastfurnishings.com/SearchResults.asp?Search="+productSearchUrl
print myurl
htmlfile = urllib.urlopen(myurl)
htmltext = htmlfile.read()
regex = '<td valign="top" width="33%" align="center">(.+?)</td> '
r = re.compile(regex)
print re.findall(r,htmltext)

しかし、それは何も読んでいません...助けていただければ幸いです

score 3 · Accepted Answer

ScrapyやBeautifulSoupなどの Web スクレイパーライブラリを使用することをお勧めします。間違いなく多くの手間を省き、情報をかき集めた後、実際に達成したいことに集中できるようになります。

score 3 · Accepted Answer

これが、次のような HTML パーサーを使用する理由ですBeautifulSoup。

>>> import urllib2
>>> from bs4 import BeautifulSoup as BS
>>> html = urllib2.urlopen('http://www.fastfurnishings.com/SearchResults.asp?Search=3-Piece+Reversible+Bonded+Leather+Match+Sofa+Set+in+Cream')
>>> soup = BS(html)
>>> print soup.find('td', {'valign':'top', 'width':'33%', 'align':'center'}).a['href']
http://www.fastfurnishings.com/3-Piece-Reversible-Bonded-Leather-Match-Sofa-Set-i-p/bstrblm3p.htm

それがどれほど簡単だったか見てください;）

score 0 · Accepted Answer

これをしないでください。考慮していない改行があるようです：

r = re.compile(regex, re.DOTALL)

python - Python正規表現とウェブスクレイピングを学習して立ち往生

3 に答える 3

Related

Reference