python - BeautifulSoup find nextClass

Question

だから基本的に。私は2つのクラスを持っています。1つは靴の発売日です。もうひとつはその日に発売されたシューズ。ただし、これらはまったく異なる 2 つのクラスです。だから私はこれらのクラスからこすり取ろうとしています。すべての日付を含む「月のヘッダー」。そして、次のクラスであるスニーカーポストメインには、月ヘッダーの日付からすべての靴が含まれています。ただし、これらは 2 つの異なるクラスです。それらは互いにリンクされていません。そこで、h4 クラスから .nextSibling を実行して、「セクション」クラスをキャッチしようとしました。そのようには機能しませんでした。

<h4 class="month-header">April 15, 2016</h4>
<section class = "sneaker-post-main">...</section>
<section class = "sneaker-post-main">...</section>
<section class = "sneaker-post-main">...</section>
<h4 class="month-header">April 16, 2016</h4>
<section class = "sneaker-post-main">...</section>
<section class = "sneaker-post-main">...</section>
<section class = "sneaker-post-main">...</section>
<h4 class="month-header">April 17, 2016</h4>
<section class = "sneaker-post-main">...</section>
<section class = "sneaker-post-main">...</section>
<section class = "sneaker-post-main">...</section>

また、私の HTML が意味をなさない場合、これは私がスクレイピングしている Web サイトです。http://sneakernews.com/air-jordan-release-dates/ 日付が辞書のキーで、値がその日付にリリースされる靴のリストのように見えるようにしたかったのです。以下に示すように。

April 16 2015
{
    Shoe info 1
    Shoe info 2
    Shoe info 3
}
April 17 2015
{
    Shoe info 1
    Shoe info 2
    Shoe info 3
}

BeautifulSoup を使用してこのタスクを実行しようとしています。私はそれを理解できないようです。2016 年 4 月 15 日 -> リリース日 HTML です。... -> 靴の情報 etectra が含まれています。

from bs4 import BeautifulSoup
import requests
import json


headers = {
    #'Cookie': 'X-Mapping-fjhppofk=FF3085BC452778AD1F6476C56E952C7A; _gat=1; __qca=P0-293756458-1459822661767; _gat_cToolbarTracker=1; _ga=GA1.2.610207006.1459822661',
    'Accept-Encoding': 'gzip, deflate, sdch',
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36,(KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36',
    'Accept-Language': 'en-US,en;q=0.8',
    'Accept': '*/*',
    'Connection': 'keep-alive',
    'Content-Length': 0
}
response = requests.get('http://sneakernews.com/air-jordan-release-dates/',headers=headers).text
soup = BeautifulSoup(response)
for tag in soup.findAll('h4', attrs = {'class':'month-header'}): 
    print tag.nextSibling.nextSibling.nextSibling

これはこれまでの私のコードです！

score 1 · Accepted Answer

find_next_siblings()メソッドと単純なスライス操作を使用して、タグ sectionの直後にこれらのを返すことができます。h4

サンプル HTML ドキュメントを使用したデモ:

In [32]: from bs4 import BeautifulSoup 

In [33]: result = []

In [34]: html = """<h4 class="month-header">April 15, 2016</h4>
   ....: <section class = "sneaker-post-main">...</section>
   ....: <section class = "sneaker-post-main">...</section>
   ....: <section class = "sneaker-post-main">...</section>
   ....: <h4 class="month-header">April 16, 2016</h4>
   ....: <section class = "sneaker-post-main">...</section>
   ....: <section class = "sneaker-post-main">...</section>
   ....: <section class = "sneaker-post-main">...</section>
   ....: <h4 class="month-header">April 17, 2016</h4>
   ....: <section class = "sneaker-post-main">...</section>
   ....: <section class = "sneaker-post-main">...</section>
   ....: <section class = "sneaker-post-main">...</section>"""

In [35]: soup = BeautifulSoup(html, 'html.parser')

In [36]: for header in soup.find_all('h4', class_='month-header'):
   ....:     d = {}
   ....:     d['month'] = header.get_text()
   ....:     d['released'] = [s.get_text() for s in header.find_next_siblings('section', class_='sneaker-post-main')[:3]]
   ....:     result.append(d)
   ....:     

In [37]: result
Out[37]: 
[{'month': 'April 15, 2016', 'released': ['...', '...', '...']},
 {'month': 'April 16, 2016', 'released': ['...', '...', '...']},
 {'month': 'April 17, 2016', 'released': ['...', '...', '...']}]

更新：

「セクション」の数が一定でない場合は、ジェネレーター関数を使用してこのようにします (おそらく効率的ではありません)。

def gen(soup): 
    for header in soup.find_all('h4', class_='month-header'):
        d = {}
        d['month'] = header.get_text()
        d['released'] = []
        for s in header.find_next_siblings('section', class_='sneaker-post-main'):
            nxt =  s.find_next_sibling()
            if isinstance(nxt, Tag) and nxt.name != 'h4':
                d['released'].append(s)
            else:
                d['released'].append(s)
                break
        yield d

ジェネレーター関数は、「スープ」である 1 つの引数を取ります。

from bs4 import BeautifulSoup, SoupStrainer, Tag

wanted_tag = SoupStrainer(['h4', 'section']) # only parse h4 and section tags 
soup = BeautifulSoup(response, 'html.parser', parse_only = wanted_tag)

for tag in soup(['script', 'style', 'img']):
    tag.decompose() #  Just to clean up little bit

for d in gen(soup):
    # do something

python - BeautifulSoup find nextClass

2 に答える 2

Related

Reference