python - Pythonとbeautifulsoupを使用して選択メニューを使用してWebページからデータを取得する

Question

データを取得する必要がある選択リストがたくさんある Web ページからデータを収集しようとしています。ここにページがあります: - http://www.asusparts.eu/partfinder/Asus/All In One/E Series/

そして、これは私がこれまでに持っているものです:

import glob, string
from bs4 import BeautifulSoup
import urllib2, csv

for file in glob.glob("http://www.asusparts.eu/partfinder/*"):

##-page to show all selections for the E-series-##
selected_list = 'http://www.asusparts.eu/partfinder/Asus/All In One/E Series/'

##-
page = urllib2.urlopen(selected_list)
soup = BeautifulSoup(page)

##-page which shows results after selecting one option-##
url = 'http://www.asusparts.eu/partfinder/Asus/All In One/E Series/ET10B'


##-identify the id of select list which contains the E-series-##  
select = soup.find('select', id="myselectListModel")
option_tags = select.findAll('option')

##-omit first item in list as isn't part of the option-##
option_tags = option_tags[1:]

for option in option_tags:
    open(url + option['value'])


html = urllib2.urlopen("http://www.asusparts.eu/partfinder/")

soup = BeautifulSoup(html)

all = soup.find('div', id="accordion")

私は正しい方法で進んでいるかどうかわかりませんか？すべての選択メニューが混乱を招くため。基本的に、画像、価格、説明など、選択した結果からすべてのデータを取得する必要があります。それらはすべて、「アコーディオン」という名前のすべての結果を含む 1 つの div タグ内に含まれていますが、これでもすべてのデータを収集できますか? または、この div 内のタグを検索するために、さらに深く掘り下げる必要がありますか? また、一度にすべてのデータを取得できるため、クラスではなく ID で検索することをお勧めします。上記のものからこれを行うにはどうすればよいですか？ありがとう。また、グロブ関数を正しく使用しているかどうかもわかりませんか?

編集

これが私の編集したコードです。エラーは返されませんが、e シリーズのすべてのモデルが返されるかどうかはわかりません。

import string, urllib2, urllib, csv, urlparse from bs4 import
BeautifulSoup


##-page which shows results after selecting one option-##
url = 'http://www.asusparts.eu/partfinder/Asus/All In One/E Series/ET10B'


base_url = 'http://www.asusparts.eu/' + url

print base_url

##-page to show all selections for the E-series-##
selected_list = urllib.quote(base_url + '/Asus/All In One/E Series/ET10B')
print urllib.quote(base_url + '/Asus/All In One/E Series/ET10B')

#selected_list = 'http://www.asusparts.eu/partfinder/Asus/All In One/E Series/ET10B'

##-
page = urllib2.urlopen('http://www.asusparts.eu/partfinder/Asus/All%20In%20One/E%20Series')
soup = BeautifulSoup(page)

print soup

##-identify the id of select list which contains the E-series-##  
select = soup.find('select', id="myselectListModel")
option_tags = select.findAll('option')

print option_tags 

##-omit first item in list as isn't part of the option-##
option_tags = option_tags[1:]

print option_tags


for option in option_tags:
    url + option['redirectvalue']

print " " + url + option['redirectvalue']

score 1 · Accepted Answer

まず、投稿したコードにいくつか問題があることを指摘したいと思います。まず、globモジュールは通常、HTTP 要求の作成には使用されません。指定されたパスにあるファイルのサブセットを反復処理するのに役立ちます。詳細については、ドキュメントを参照してください。

2番目の問題は、次の行にあります。

for file in glob.glob("http://www.asusparts.eu/partfinder/*"):

その後にインデントされたコードがないため、インデントエラーが発生します。これによりエラーが発生し、残りのコードが実行されなくなります。

別の問題は、変数に python の「予約済み」名を使用していることです。allやなどの単語fileを変数名に使用しないでください。

最後に、ループしている場合option_tags:

for option in option_tags:
    open(url + option['value'])

このopenステートメントは、パスがであるローカルファイルを開こうとしますurl + option['value']。その場所にファイルがあるとは思えないので、これはおそらくエラーを引き起こします。さらに、この開いているファイルに対して何もしていないことに注意してください。

さて、批評で十分です。asusのページを見てみましたが、あなたが達成したいことについてのアイデアがあると思います. 私が理解していることから、あなたはasusページで各コンピューターモデルの部品（画像、テキスト、価格など）のリストをスクレイピングしたいと考えています。各モデルには、固有の URL にパーツのリストがあります (例: http://www.asusparts.eu/partfinder/Asus/Desktop/B%20Series/BM2220 )）。これは、モデルごとにこの一意の URL を作成できる必要があることを意味します。さらに複雑なことに、各パーツカテゴリは動的にロードされます。たとえば、「冷却」セクションのパーツは、「冷却」のリンクをクリックするまでロードされません。これは、2 つの部分からなる問題があることを意味します。1) 有効な (ブランド、タイプ、ファミリ、モデル) の組み合わせをすべて取得し、2) 特定のモデルのすべてのパーツをロードする方法を見つけます。

ちょっと退屈だったので、面倒な作業のほとんどを処理する簡単なプログラムを作成することにしました。それは最もエレガントなものではありませんが、仕事を成し遂げるでしょう. ステップ 1) はで完了しget_model_information()ます。ステップ 2) はで処理されますparse_models()が、少しわかりにくいです。asus の Web サイトを見ると、parts サブセクションをクリックするたびに JavaScript 関数getProductsBasedOnCategoryID()が実行され、フォーマット済みの ajax 呼び出しが行われますPRODUCT_URL(以下を参照)。応答は、クリックしたセクションに入力するために使用される JSON 情報です。

import urllib2
import json
import urlparse
from bs4 import BeautifulSoup

BASE_URL = 'http://www.asusparts.eu/partfinder/'
PRODUCTS_URL = 'http://json.zandparts.com/api/category/GetCategories/'\
               '44/EUR/{model}/{family}/{accessory}/{brand}/null/'
ACCESSORIES = ['Cable', 'Cooling', 'Cover', 'HDD', 'Keyboard', 'Memory',
               'Miscellaneous', 'Mouse', 'ODD', 'PS', 'Screw']


def get_options(url, select_id):
    """
    Gets all the options from a select element.
    """
    r = urllib2.urlopen(url)
    soup = BeautifulSoup(r)
    select = soup.find('select', id=select_id)
    try:
        options = [option for option in select.strings]
    except AttributeError:
        print url, select_id, select
        raise
    return options[1:]  # The first option is the menu text


def get_model_information():
    """
    Finds all the models for each family, all the families and models for each
    type, and all the types, families, and models for each brand.

    These are all added as tuples (brand, type, family, model) to the list
    models.
    """
    model_info = []

    print "Getting brands"
    brand_options = get_options(BASE_URL, 'mySelectList')

    for brand in brand_options:
        print "Getting types for {0}".format(brand)
        # brand = brand.replace(' ', '%20')  # URL encode spaces
        brand_url = urlparse.urljoin(BASE_URL, brand.replace(' ', '%20'))
        types = get_options(brand_url, 'mySelectListType')

        for _type in types:
            print "Getting families for {0}->{1}".format(brand, _type)
            bt = '{0}/{1}'.format(brand, _type)
            type_url = urlparse.urljoin(BASE_URL, bt.replace(' ', '%20'))
            families = get_options(type_url, 'myselectListFamily')

            for family in families:
                print "Getting models for {0}->{1}->{2}".format(brand,
                                                                _type, family)
                btf = '{0}/{1}'.format(bt, family)
                fam_url = urlparse.urljoin(BASE_URL, btf.replace(' ', '%20'))
                models = get_options(fam_url, 'myselectListModel')

                model_info.extend((brand, _type, family, m) for m in models)

    return model_info


def parse_models(model_information):
    """
    Get all the information for each accessory type for every
    (brand, type, family, model). accessory_info will be the python formatted
    json results. You can parse, filter, and save this information or use
    it however suits your needs.
    """

    for brand, _type, family, model in model_information:
        for accessory in ACCESSORIES:
            r = urllib2.urlopen(PRODUCTS_URL.format(model=model, family=family,
                                                 accessory=accessory,
                                                 brand=brand,))
            accessory_info = json.load(r)
            # Do something with accessory_info
            # ...


def main():
    models = get_model_information()
    parse_models(models)


if __name__ == '__main__':
    main()

最後に、余談を一つ。私は図書館urllib2に賛成しましたrequests。個人的には、はるかに多くの機能を提供し、セマンティクスが優れていると思いますが、好きなものを使用できます。

python - Pythonとbeautifulsoupを使用して選択メニューを使用してWebページからデータを取得する

1 に答える 1

Related

Reference