python - BeautifulSoup：アンカータグからテキストを抽出します

Question

抽出したい：

imageタグの次のsrcからのテキストと
divクラスデータ内にあるアンカータグのテキスト

img srcを正常に抽出できましたが、アンカータグからテキストを抽出するのに問題があります。

<a class="title" href="http://www.amazon.com/Nikon-COOLPIX-Digital-Camera-NIKKOR/dp/B0073HSK0K/ref=sr_1_1?s=electronics&amp;ie=UTF8&amp;qid=1343628292&amp;sr=1-1&amp;keywords=digital+camera">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a>

HTMLページ全体へのリンクは次のとおりです。

これが私のコードです：

for div in soup.findAll('div', attrs={'class':'image'}):
    print "\n"
    for data in div.findNextSibling('div', attrs={'class':'data'}):
        for a in data.findAll('a', attrs={'class':'title'}):
            print a.text
    for img in div.findAll('img'):
        print img['src']

私がやろうとしているのは、画像src（リンク）とその中のタイトルを抽出するdiv class=dataことです。たとえば、次のようになります。

 <a class="title" href="http://www.amazon.com/Nikon-COOLPIX-Digital-Camera-NIKKOR/dp/B0073HSK0K/ref=sr_1_1?s=electronics&amp;ie=UTF8&amp;qid=1343628292&amp;sr=1-1&amp;keywords=digital+camera">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a>

抽出する必要があります：

Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)

score 78 · Accepted Answer

これは役に立ちます：

from bs4 import BeautifulSoup

data = '''<div class="image">
        <a href="http://www.example.com/eg1">Content1<img  
        src="http://image.example.com/img1.jpg" /></a>
        </div>
        <div class="image">
        <a href="http://www.example.com/eg2">Content2<img  
        src="http://image.example.com/img2.jpg" /> </a>
        </div>'''

soup = BeautifulSoup(data)

for div in soup.findAll('div', attrs={'class':'image'}):
    print(div.find('a')['href'])
    print(div.find('a').contents[0])
    print(div.find('img')['src'])

Amazon製品を検討している場合は、公式APIを使用する必要があります。スクレイピングの問題を緩和し、使用条件の範囲内でアクティビティを維持するPythonパッケージが少なくとも1つあります。

score 29 · Accepted Answer

私の場合、次のように機能しました。

from BeautifulSoup import BeautifulSoup as bs

url="http://blabla.com"

soup = bs(urllib.urlopen(url))
for link in soup.findAll('a'):
        print link.string

それが役に立てば幸い！

score 8 · Accepted Answer

lxmlルートを使用してxpathを使用することをお勧めします。

from lxml import etree
# data is the variable containing the html
data = etree.HTML(data)
anchor = data.xpath('//a[@class="title"]/text()')

score 3 · Accepted Answer

上記のすべての回答は、私の回答を作成するのに本当に役立ちます。これにより、他のユーザーが出したすべての回答に投票しました。しかし、私が扱っていた正確な問題に対する自分の回答を最終的にまとめました。

質問が明確に定義されているように、私はdom構造内の兄弟とその子の一部にアクセスする必要がありました。このソリューションはdom構造内の画像を繰り返し、製品タイトルを使用して画像名を作成し、画像をローカルディレクトリに保存します。

import urlparse
from urllib2 import urlopen
from urllib import urlretrieve
from BeautifulSoup import BeautifulSoup as bs
import requests

def getImages(url):
    #Download the images
    r = requests.get(url)
    html = r.text
    soup = bs(html)
    output_folder = '~/amazon'
    #extracting the images that in div(s)
    for div in soup.findAll('div', attrs={'class':'image'}):
        modified_file_name = None
        try:
            #getting the data div using findNext
            nextDiv =  div.findNext('div', attrs={'class':'data'})
            #use findNext again on previous object to get to the anchor tag
            fileName = nextDiv.findNext('a').text
            modified_file_name = fileName.replace(' ','-') + '.jpg'
        except TypeError:
            print 'skip'
        imageUrl = div.find('img')['src']
        outputPath = os.path.join(output_folder, modified_file_name)
        urlretrieve(imageUrl, outputPath)

if __name__=='__main__':
    url = r'http://www.amazon.com/s/ref=sr_pg_1?rh=n%3A172282%2Ck%3Adigital+camera&keywords=digital+camera&ie=UTF8&qid=1343600585'
    getImages(url)

score 1 · Accepted Answer

>>> txt = '<a class="title" href="http://rads.stackoverflow.com/amzn/click/B0073HSK0K">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a> '
>>> fragment = bs4.BeautifulSoup(txt)
>>> fragment
<a class="title" href="http://rads.stackoverflow.com/amzn/click/B0073HSK0K">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a> 
>>> fragment.find('a', {'class': 'title'})
<a class="title" href="http://rads.stackoverflow.com/amzn/click/B0073HSK0K">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a>
>>> fragment.find('a', {'class': 'title'}).string
u'Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)'

score 1 · Accepted Answer

print(link_addres.contents[0])

アンカータグのコンテキストを出力します

例：

 statement_title = statement.find('h2',class_='briefing-statement__title')
 statement_title_text = statement_title.a.contents[0]

score 1 · Accepted Answer

アンカータグからhrefを取得するには、を使用しtag.get("href")、imgsrcを取得するにはを使用しますtag.img.get("src")。

このデータを使用した例：

data = """
            <div class="image">
            <a href="http://www.example.com/eg1">Content1<img src="http://image.example.com/img1.jpg" /></a>
            </div>
            <div class="image">
            <a href="http://www.example.com/eg2">Content2<img src="http://image.example.com/img2.jpg" /> </a>
            </div>
        """

リンクとテキストを取得します。

import requests
from bs4 import BeautifulSoup

def get_soup(url):
    response = requests.get(url)
    if response.ok:
        return BeautifulSoup(response.text, features="html.parser")

def get_links(soup):
    links = []
    for tag in soup.findAll("a", href=True):
        if img := tag.img:
            img = img.get("src")
        links.append(dict(url=tag.get("href"), text=tag.text, img=img))
    return links

# soup = get_soup('www.example.com')
soup = BeautifulSoup(data, features="html.parser")
links = get_links(soup)

出力：

[{'url': 'http://www.example.com/eg1', 'text': 'Content1', 'img': 'http://image.example.com/img1.jpg'},
{'url': 'http://www.example.com/eg2', 'text': 'Content2 ', 'img': 'http://image.example.com/img2.jpg'}]

python - BeautifulSoup：アンカータグからテキストを抽出します

7 に答える 7

Related

Reference