python - Python/lxml: HTML テーブルの行をキャプチャするにはどうすればよいですか?

Question

私の株式スクリーニングツールでは、スクリプトで BeautifulSoup から lxml に切り替える必要があります。Python スクリプトが処理する必要のある Web ページをダウンロードした後、BeautifulSoup はそれらを適切に解析できましたが、処理が遅すぎます。BeautifulSoup がたった 1 つの株の貸借対照表、損益計算書、およびキャッシュフロー計算書を解析するのに約 10 秒かかります。スクリプトで分析する株が 5000 を超えることを考えると、許容できないほど遅くなります。

いくつかのベンチマークテスト (http://www.crummy.com/2012/1/22/0) によると、lxml は BeautifulSoup よりも 100 倍近く高速です。したがって、lxml は、BeautifuSoup では 14 時間かかるジョブを 10 分以内に完了できるはずです。

HTML を使用して HTML テーブルの行の内容を取得するにはどうすればよいですか? 私のスクリプトがダウンロードして解析する必要がある HTML ページの例は、http: //www.smartmoney.com/quote/FAST/?story=financials&opt=YB にあります。

BeautifulSoup を使用してこの HTML テーブルを解析するソースコードは次のとおりです。

    url_local = local_balancesheet (symbol_input)
    url_local = "file://" + url_local
    page = urllib2.urlopen (url_local)
    soup = BeautifulSoup (page)
    soup_line_item = soup.findAll(text=title_input)[0].parent.parent.parent
    list_output = soup_line_item.findAll('td') # List of elements

現金と短期投資を探している場合、title_input = "現金と短期投資".

lxmlで同じ機能を実行するにはどうすればよいですか?

score 1 · Accepted Answer

BeautifulSoup で lxml パーサーを使用できるため、なぜこれを行っているのかわかりません。

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

soup = BeautifulSoup(markup, "lxml")

編集：これは遊ぶためのコードです。これは私にとって約6秒で実行されます。

def get_page_data(url):
    f = urllib2.urlopen(url)
    soup = BeautifulSoup(f, 'lxml')
    f.close()
    trs = soup.findAll('tr')
    data = {}
    for tr in trs:
        try:
            if tr.div.text.strip() in ('Cash & Short Term Investments', 'Property, Plant & Equipment - Gross',
                               'Total Liabilities', 'Preferred Stock (Carrying Value)'):
                data[tr.div.text] = [int(''.join(e.text.strip().split(','))) for e in tr.findAll('td')]
        except (AttributeError, ValueError):
            # headers dont have a tr tag, and thus raises AttributeError
            # 'Fiscal Year Ending in 2011' raises ValueError
            pass
    return data

python - Python/lxml: HTML テーブルの行をキャプチャするにはどうすればよいですか?

1 に答える 1

Related

Reference