python - Web解析されたコンテンツのlxmlの長さ

Question

lxmlPythonでWebページをスクレイプします。ただし、テーブル行の数を取得するには、最初にすべてを取得してから、len()関数を使用します。無駄だと思いますが、さらにスクレイピングするためにそれらの番号（動的な番号）を取得する他の方法はありますか？

import lxml.html
doc = ''
try:
    doc = lxml.html.parse('url')
except SkipException: pass 

if doc: 
    buf = ''
    #get the total number of rows in table
    tr = doc.xpath("/html/body/div[1]/div[1]/table[1]/tbody/tr")
    table = []
    # iterate over the table rows limited to max number
    for i in range(3, len(tr)):
            # get the rows content                                              
            table += doc.xpath("body/div[1]/div[1]/table[1]/tbody/tr[%s]/td" % i)

score 0 · Accepted Answer

from itertools import islice

trs = doc.xpath("/html/body/div[1]/div[1]/table[1]/tbody/tr")
for tr in islice(trs, 3):
   for td in tr.xpath('td'):
      ...whatever...

score 0 · Accepted Answer

このセクションで説明されているように、イテレータアプローチを使用しようとしましたか？http：//lxml.de/api.html#iteration？私はそのような方法があるとかなり確信しています。何かの長さを見つけて、その後（x）rangeで繰り返すことは、決してエレガントな解決策ではありません。lxmlの背後にいる人たちが、適切なツールを提供してくれると確信しています。

score 0 · Accepted Answer

一致した要素を開始点として使用できますtr。Pythonリストの場合と同じように、要素を単純に繰り返すことができます。

tr = doc.xpath("/html/body/div[1]/div[1]/table[1]/tbody/tr")
for row in tr[3:]:
    table += row.findall('td')

上記は、含まれているすべての要素.findall()を取得するために使用されますが、より詳細な制御が必要な場合は、さらに呼び出しを使用できます。td.xpath()

python - Web解析されたコンテンツのlxmlの長さ

3 に答える 3

Related

Reference