python - 要素のリストを返す Web スクレイパー

Question

mechanize と lxml を介して複数の Web ページのテーブルから情報をスクレイピングするスクレーパーを構築しようとしています。以下のコードは要素のリストを返しています。これらの要素からテキストを取得する方法を見つけようとしています (.text の追加はリストオブジェクトでは機能しません)。

コードは次のとおりです。

import mechanize
import lxml.html as lh
import csv

br = mechanize.Browser()
response = br.open("http://localhost/allproducts")

output = csv.writer(file(r'output.csv','wb'), dialect='excel')

for link in br.links(url_regex="product"):
    follow = br.follow_link(link)
    url = br.response().read()
    find = lh.document_fromstring(url)
    find = find.findall('.//td')
    print find
    output.writerows([find])

tds からのテキストが csv ファイルに表示されるが、各 td からのテキストが別の行に表示される上記のコードの末尾に次を追加する場合、形式を上記のコードと同じにしたいと思います。要素のリストではなくテキストを使用 (各ページのすべての情報が 1 行に表示されます)

for find in find:
    print find.text
    output.writerows([find.text])

他の多くの例からコードを取得したので、一般的な推奨事項も大歓迎です

score 0 · Accepted Answer

あなたはとても近かった！あなたのコードには2つの問題があります:

1) find は、文字列のリストではなく、オブジェクトのリストです。これを確認するための python を次に示します。

>>> type(find)
<type 'list'>
>>> find
[<Element td at 0x101401e30>, <Element td at 0x101401e90>, <Element td at 0x101401ef0>, <Element td at 0x101401f50>, <Element td at 0x101401fb0>, <Element td at 0x101404050>, <Element td at 0x1014040b0>, <Element td at 0x101404110>, <Element td at 0x101404170>, <Element td at 0x1014041d0>, <Element td at 0x101404230>, <Element td at 0x101404290>, <Element td at 0x1014042f0>, <Element td at 0x101404350>, <Element td at 0x1014043b0>, <Element td at 0x101404410>]
>>> type(find[0])
<class 'lxml.html.HtmlElement'>

find変数はオブジェクトのリストを指していると言え<class 'lxml.html.HtmlElement'>ます。このタイプの構造体は、に直接渡すべきではありませんoutput.writerows。代わりに、この関数はテキスト項目のリストを受け取ります。

2）オブジェクトを反復処理するときfind、変数 name を再割り当てしていますfind。繰り返し処理するアイテムの名前と同じ名前を使用しないでください。

for item in find:
    print item.text
    output.writerows([item.text])

すべてをまとめると、次のようになります。

for link in br.links(url_regex="product"):
    follow = br.follow_link(link)
    url = br.response().read()
    find = lh.document_fromstring(url)
    find = find.findall('.//td')
    print find
    results = []  # Create a place to store the text names
    for item in find:
        results.append(item.text)  # Store the text name of the item in the results list.
    output.writerows(results)  # Now, write the results out.  # EDITED to use correct variable here.

プロのヒント

次のように、リスト内包表記を使用してワンライナーとして結果を生成することもできます。

results = [item.text for item in find]
output.writerows(results)

これにより、3 行の python が 1 行に置き換えられます。

python - 要素のリストを返す Web スクレイパー

1 に答える 1

Related

Reference