I am scraping data from following html structure from 30-40 webpages like these https://www.o2.co.uk/shop/tariffs/sony/xperia-z-purple/
:
<td class="monthlyCost">£13<span>.50</span></td>
<td class="phoneCost">£479.99</td>
<td><span class="lowLight">24 Months</span></td>
<td>50</td>
<td>Unlimited</td>
<td class="dataAllowance">100MB</td>
<td class="extras">
I am indexing to scrape data present under td
tags having no class like 50 & Unlimited which corresponds to Minutes and texts column in the dataset. Code which I am using is:
results = tariff_link_soup.findAll('td', {"class": None})
minutes = results[1]
texts = results[2]
print minutes,texts
All these 30-40 webplinks are present on https://www.o2.co.uk/shop/phones/
webpage, I am finding those links on this webpage accessing them and then reaching this desired webpage, all these final device webpages follow same structure.
Problem: I was hoping to get only minutes and text values which are like 50 & Unlimited, 200 & Unlimited and are present at 2nd and 3rd index for all webpages. Still I am getting some other values when I am printing the data for eg. 500MB
, 100MB
which are values under dataAllowance
class and td tag. I am using class as None
attribute but still not able to get required data. I checked html structure and it was consistent across pages.
Please help me in solving this issue as I am not able to fathom reason for this anomaly.
Update: Entire Python code which I am using:
urls = ['https://www.o2.co.uk/shop/phones/',
'https://www.o2.co.uk/shop/phones/?payGo=true']
plans = ['Pay Monthly','Pay & Go']
for url,plan in zip(urls,plans):
if plan == 'Pay Monthly':
device_links = parse().direct_url(url,'span', {"class": "model"})
for device_link in device_links:
device_link.parent['href'] = urlparse.urljoin(url, device_link.parent['href'])
device_link_page = urllib2.urlopen(device_link.parent['href'])
device_link_soup = BeautifulSoup(device_link_page)
dev_names = device_link_soup.find('h1')
for devname in dev_names:
tariff_link = device_link_soup.find('a',text = re.compile('View tariffs'))
tariff_link['href'] = urlparse.urljoin(url, tariff_link['href'])
tariff_link_page = urllib2.urlopen(tariff_link['href'])
tariff_link_soup = BeautifulSoup(tariff_link_page)
dev_price = tariff_link_soup.findAll('td', {"class": "phoneCost"})
monthly_price = tariff_link_soup.findAll('td', {"class": "monthlyCost"})
tariff_length = tariff_link_soup.findAll('span', {"class": "lowLight"})
data_plan = tariff_link_soup.findAll('td', {"class": "dataAllowance"})
results = tariff_link_soup.xpath('//td[not(@class)]')
print results[1].text
print results[2].text