python - Issue with scraping data using indexing from html structure

Question

I am scraping data from following html structure from 30-40 webpages like these https://www.o2.co.uk/shop/tariffs/sony/xperia-z-purple/ :

    <td class="monthlyCost">£13<span>.50</span></td>
              <td class="phoneCost">£479.99</td>
              <td><span class="lowLight">24 Months</span></td>
    <td>50</td>
    <td>Unlimited</td>
    <td class="dataAllowance">100MB</td>
    <td class="extras">

I am indexing to scrape data present under td tags having no class like 50 & Unlimited which corresponds to Minutes and texts column in the dataset. Code which I am using is:

        results       = tariff_link_soup.findAll('td', {"class": None})
        minutes = results[1]
        texts = results[2]
        print minutes,texts

All these 30-40 webplinks are present on https://www.o2.co.uk/shop/phones/ webpage, I am finding those links on this webpage accessing them and then reaching this desired webpage, all these final device webpages follow same structure.

Problem: I was hoping to get only minutes and text values which are like 50 & Unlimited, 200 & Unlimited and are present at 2nd and 3rd index for all webpages. Still I am getting some other values when I am printing the data for eg. 500MB, 100MB which are values under dataAllowance class and td tag. I am using class as None attribute but still not able to get required data. I checked html structure and it was consistent across pages.

Please help me in solving this issue as I am not able to fathom reason for this anomaly.

Update: Entire Python code which I am using:

urls  =  ['https://www.o2.co.uk/shop/phones/',
          'https://www.o2.co.uk/shop/phones/?payGo=true']

plans =  ['Pay Monthly','Pay & Go']
for url,plan in zip(urls,plans):

    if plan == 'Pay Monthly':
        device_links = parse().direct_url(url,'span', {"class": "model"})

        for device_link in device_links:
            device_link.parent['href'] = urlparse.urljoin(url, device_link.parent['href'])            
            device_link_page           = urllib2.urlopen(device_link.parent['href'])
            device_link_soup           = BeautifulSoup(device_link_page)

        dev_names = device_link_soup.find('h1')
        for devname in dev_names:

            tariff_link = device_link_soup.find('a',text = re.compile('View tariffs'))

            tariff_link['href'] = urlparse.urljoin(url, tariff_link['href'])

            tariff_link_page    = urllib2.urlopen(tariff_link['href'])
            tariff_link_soup    = BeautifulSoup(tariff_link_page)
            dev_price     = tariff_link_soup.findAll('td', {"class": "phoneCost"})
            monthly_price = tariff_link_soup.findAll('td', {"class": "monthlyCost"})
            tariff_length = tariff_link_soup.findAll('span', {"class": "lowLight"})
            data_plan     = tariff_link_soup.findAll('td', {"class": "dataAllowance"})
            results       = tariff_link_soup.xpath('//td[not(@class)]')
            print results[1].text
            print results[2].text

score 0 · Accepted Answer

I finally used following code to solve my problem:

    for row in tariff_link_soup('table', {'id' : 'tariffTable'})[0].tbody('tr'):                                                                                                                                                               
        tds = row('td')                                                                                                                                                   
        #print tds[0].text,tds[1].text,tds[2].text,tds[3].text,tds[4].text,tds[5].text
        monthly_prices = unicode(tds[0].text).encode('utf8').replace("Â£","").replace("FREE","0").replace("Free","0").strip()
        dev_prices     = unicode(tds[1].text).encode('utf8').replace("Â£","").replace("FREE","0").replace("Free","0").strip()
        tariff_lengths = unicode(tds[2].text).encode('utf8').strip()
        minutes        = unicode(tds[3].text).encode('utf8').strip()
        texts          = unicode(tds[4].text).encode('utf8').strip()
        data           = unicode(tds[5].text).encode('utf8').strip()
        device_names   = unicode(dev_names).encode('utf8').strip()

I am accessing the required data row by row here, using the tabular structure in which data is present. I am taking all elements present in a row and assigning names to those which are required in my data.

python - Issue with scraping data using indexing from html structure

1 に答える 1

Related

Reference