次のコードを使用して、Web サイトからデータをスクレイピングしています。
# -*- coding: cp1252 -*-
import urllib2
import sys
from bs4 import BeautifulSoup
page = urllib2.urlopen('http://www.att.com/shop/wireless/plans-new.html#fbid=U-XD_DHOGEp').read()
soup = BeautifulSoup(page)
plans = soup.findAll('div', {"class": "planTitle"})
for plan in plans:
planname = u' '.join(plan.stripped_strings)
plantypes = soup.findAll('div', {"class":"top"})
prices = soup.findAll('div', {"class":"bottom"})
for plantype, price in zip(plantypes, prices):
plantype1 = u' '.join(plantype.stripped_strings)
price1 = u' '.join(price.stripped_strings)
print planname, plantype1, price1
問題: このコードに記載されている Web ページを参照すると、これらは 4 ~ 5 種類のプランであり、各プランには 3 つの音声オプションといくつかの 2 ~ 3 つのデータ オプションが存在します。プランごとに、それぞれの音声オプションと、それらのオプションの月額料金を取得できるように、データをスクレイピングしたいと考えています。
現在実行中のコードは、プラン名と音声オプションの可能なすべての組み合わせを返します。間違ったプラン名と音声オプションの組み合わせでもエントリが作成されるため、プラン名ごとに 20 ~ 30 のエントリが取得されます。例えば。個人プラン - 550 分 - $59.99、この組み合わせでは 500 分と 59.99 がファミリー プランの一部です。
正しい Plan + Voice オプションの組み合わせのみが抽出されるようにループを実行したい。
Web ページのスニペット: プランごとに 1 つのボックスが Web ページに表示され、音声オプションとそれらのオプションに対応する価格が含まれています。ボックスごとにループを実行したいのですが、音声オプションとその価格の要素とクラスの組み合わせは一意ではありません。そのため、プラン名は他の boxex からも値を取得します。
<div class="innerContainer">
<div class="planTitle">
<h2><a href="http://www.att.com/shop/wireless/plans/individualplans.html" data-cqpath="/content/att/shop/en/wireless/plans-new/jcr:content/maincontent/authortext;2013010">AT&T Individual Plans</a></h2>
</div>
<div class="planSubTitle">
<img src="/shopcms/media/att/2012/shop/wireless/promotions/Plans-CM_page/tiny-clock.jpg" alt="">
<p>Voice plan options:</p>
</div>
<!-- Begin three white boxes -->
<!-- Note, extra boxes can be added to the row with the following method -->
<!-- 1. Add more div containers inside .whiteBox -->
<!-- 2. Modify class names to boxes_one, boxes_two, boxes_three etc... (max six) -->
<div class="whiteBox">
<div class="boxes_three">
<a class="lnk-help tooltips fullBoxLink" href="#smartphone_individual_voice_450" onclick="window.location.href = 'http://www.att.com/shop/wireless/plans/voice/sku3830290.html?source=IC95ATPLP00PSP00L&wtExtndSource=spindvoice450';return false;" aria-describedby="smartphone_individual_voice_450" data-cqpath="/content/att/shop/en/wireless/plans-new/jcr:content/maincontent/authortext;2013010" title=""></a>
<span id="smartphone_individual_voice_450" class="tips" role="tooltip">$0.45/min. for additional minutes</span>
<div class="top">
<p class="stat">450</p>
<p class="statText">Minutes</p>
</div>
<div class="bottom">
<p>$39.99/mo.</p>
</div>
</div>
<div class="boxes_three">
<a class="lnk-help tooltips fullBoxLink" href="#smartphone_individual_voice_900" onclick="window.location.href = 'http://www.att.com/shop/wireless/plans/voice/sku3830292.html?source=IC95ATPLP00PSP00L&wtExtndSource=spindvoice900';return false;" aria-describedby="smartphone_individual_voice_900" data-cqpath="/content/att/shop/en/wireless/plans-new/jcr:content/maincontent/authortext;2013010" title=""></a>
<span id="smartphone_individual_voice_900" class="tips" role="tooltip">$0.40/min. for additional minutes</span>
<div class="top">
<p class="stat">900</p>
<p class="statText">Minutes</p>
</div>
<div class="bottom">
<p>$59.99/mo.</p>
</div>
</div>
<div class="boxes_three borderNone">
<a class="fullBoxLink" href="http://www.att.com/shop/wireless/plans/voice/sku3830293.html?source=IC95ATPLP00PSP00L&wtExtndSource=spindvoiceunlim" data-cqpath="/content/att/shop/en/wireless/plans-new/jcr:content/maincontent/authortext;2013010"></a>
<div class="top">
<p class="stat">Unlimited</p>
<p class="statText">Minutes</p>
</div>
<div class="bottom">
<p>$69.99/mo.</p>
</div>
</div>
</div>
<!-- End three white boxes -->
<!-- Begin left gray container -->
<div class="containerTwoThirds">
<div class="planSubTitle">
<img src="/shopcms/media/att/2012/shop/wireless/promotions/Plans-CM_page/tiny-globe.jpg" alt="">
<p>Data plan options:</p>
</div>
<div class="grayTwoThirds">
<div class="grayBox">
<a class="fullBoxLink" href="http://www.att.com/shop/wireless/services/dataplus300mb-smartphone4glte-sku5380269.html?source=IC95ATPLP00PSP00L&wtExtndSource=spinddata300mb" data-cqpath="/content/att/shop/en/wireless/plans-new/jcr:content/maincontent/authortext;2013010"></a>
<p class="stat"><strong>300MB</strong></p>
<p class="statText">$20.00/mo.</p>
</div>
<div class="grayBoxBreak"></div>
<div class="grayBox">
<a class="fullBoxLink" href="http://www.att.com/shop/wireless/services/datapro3gb-smartphone4glte-sku5470232.html?source=IC95ATPLP00PSP00L&wtExtndSource=spinddata3gb" data-cqpath="/content/att/shop/en/wireless/plans-new/jcr:content/maincontent/authortext;2013010"></a>
<p class="stat"><strong>3GB</strong></p>
<p class="statText">$30.00/mo.</p>
</div>
<div class="grayBoxBreak"></div>
<div class="grayBox">
<a class="fullBoxLink" href="http://www.att.com/shop/wireless/services/datapro5gb-smartphone4glte-sku5480228.html?source=IC95ATPLP00PSP00L&wtExtndSource=spinddata5gb" data-cqpath="/content/att/shop/en/wireless/plans-new/jcr:content/maincontent/authortext;2013010"></a>
<p class="stat"><strong>5GB</strong></p>
<p class="statText">$50.00/mo.</p>
</div>
</div>
</div>
<!-- End left gray container -->
<!-- Begin right gray container -->
<div class="containerThird">
<div class="planSubTitle">
<img src="/shopcms/media/att/2012/shop/wireless/promotions/Plans-CM_page/tiny-phone.jpg" alt="">
<p>Messaging plan options: <span class="fix"></span></p>
</div>
<div class="grayThird">
<div class="grayBox">
<a data-cqpath="/content/att/shop/en/wireless/plans-new/jcr:content/maincontent/authortext;2012325" href="http://www.att.com/shop/wireless/services/messagingunlimited-sku1160055.html?source=IC95ATPLP00PSP00L&wtExtndSource=spindmessunlim" class="fullBoxLink"></a>
<p class="stat"><strong>ULTD</strong> MSGS</p>
<p class="statText">$20.00/mo.</p>
</div>
<div class="grayBoxBreak"></div>
<div class="grayBox last">
<p class="stat"><strong>PAY PER USE</strong></p>
<p class="statText">20¢/text <span class="lightGray">|</span> 30¢/pic/video</p>
</div>
</div>
</div>
<!-- End right gray container -->
<!-- Begin sub footer -->
<div class="bottomLinks">
<div class="links">
<a href="http://www.att.com/shop/wireless/plans/individualplans.html?taxoPlan=POSTPAID-INDIVIDUAL-CANADA&source=IC95ATPLP00PSP00L&wtExtndSource=spindcanada" data-cqpath="/content/att/shop/en/wireless/plans-new/jcr:content/maincontent/authortext;2013010">Nation with Canada Plans</a> | <a href="http://www.att.com/shop/wireless/plans/voice/sku5740279.html?source=IC95ATPLP00PSP00L&wtExtndSource=spindhomephone" data-cqpath="/content/att/shop/en/wireless/plans-new/jcr:content/maincontent/authortext;2013010">Unlimited Home Phone</a> | <a href="http://www.att.com/shop/wireless/plans/voice/sku3830294.html?source=IC95ATPLP00PSP00L&wtExtndSource=spindsenior" data-cqpath="/content/att/shop/en/wireless/plans-new/jcr:content/maincontent/authortext;2013010">Senior Plans</a>
</div>
<a class="shop_button" href="http://www.att.com/shop/wireless/devices/smartphones.html?source=IC95ATPLP00PSP00L&wtExtndSource=indshopsp" data-cqpath="/content/att/shop/en/wireless/plans-new/jcr:content/maincontent/authortext;2013010"><img src="/shopcms/media/att/2012/shop/wireless/promotions/Plans-CM_page/buttons/shop_smartphones.png" alt="Shop Smartphones" width="158" height="29"></a>
</div>
<!-- End sub footer -->
</div>
私はプログラミングが初めてなので、この問題を解決するのを手伝ってください。