python - 操作のためにHTMLデータをPythonリストに解析する

Question

HTML Web サイトを読み込んでデータを抽出しようとしています。たとえば、企業の過去 5 年間の EPS (1 株あたりの利益) を読み込んでみたいと思います。基本的に、私はそれを読み込んで、BeautifulSoup または html2text のいずれかを使用して巨大なテキストブロックを作成することができます。次に、ファイルを検索したいと思います-私は re.search を使用しています-しかし、それを正しく機能させることができないようです。アクセスしようとしている行は次のとおりです。

EPS (ベーシック)\n13.4620.6226.6930.1732.81\n\n

そこで、EPS = [13.46, 20.62, 26.69, 30.17, 32.81] というリストを作成したいと思います。

助けてくれてありがとう。

from stripogram import html2text
from urllib import urlopen
import re
from BeautifulSoup import BeautifulSoup

ticker_symbol = 'goog'
url = 'http://www.marketwatch.com/investing/stock/'
full_url = url + ticker_symbol + '/financials'  #build url

text_soup = BeautifulSoup(urlopen(full_url).read()) #read in 

text_parts = text_soup.findAll(text=True)
text = ''.join(text_parts)

eps = re.search("EPS\s+(\d+)", text)
if eps is not None:
    print eps.group(1)

score 2 · Accepted Answer

HTML の解析に正規表現を使用することはお勧めできません。BeautifulSoupパーサーを使用します。rowTitleクラスとEPS (Basic)テキストを含むセルを見つけてから、クラスを持つ次の兄弟を反復処理しvalueCellます。

from urllib import urlopen
from BeautifulSoup import BeautifulSoup

url = 'http://www.marketwatch.com/investing/stock/goog/financials'
text_soup = BeautifulSoup(urlopen(url).read()) #read in

titles = text_soup.findAll('td', {'class': 'rowTitle'})
for title in titles:
    if 'EPS (Basic)' in title.text:
        print [td.text for td in title.findNextSiblings(attrs={'class': 'valueCell'}) if td.text]

プリント:

['13.46', '20.62', '26.69', '30.17', '32.81']

それが役立つことを願っています。

score 2 · Accepted Answer

私は非常に異なるアプローチを取るでしょう。HTMLページのスクレイピングにLXMLを使用しています

切り替えた理由の 1 つは、BS がしばらくメンテナンスされていなかったためです。つまり、更新されたということです。

私のテストでは、次を実行しました

import requests
from lxml import html
from collections import OrderedDict
page_as_string = requests.get('http://www.marketwatch.com/investing/stock/goog/financials').content

tree = html.fromstring(page_as_string)

ページを見ると、データが 2 つのテーブルに分割されていることがわかります。EPS が必要なので、2 番目の表にあることに注意しました。これをプログラムで整理するコードを書くこともできますが、それはあなたに任せます。

tables = [ e for e in tree.iter() if e.tag == 'table']
eps_table = tables[-1]

最初の行に列見出しがあることに気付いたので、すべての行を分離したい

table_rows = [ e for e in eps_table.iter() if e.tag == 'tr']

列見出しを取得します。

column_headings =[ e.text_content() for e in table_rows[0].iter() if e.tag == 'th']

最後に、列見出しを行ラベルとセル値にマッピングできます

my_results = []
for row in table_rows[1:]:
    cell_content = [ e.text_content() for e in row.iter() if e.tag == 'td']
    temp_dict = OrderedDict()
    for numb, cell in enumerate(cell_content):
        if numb == 0:
            temp_dict['row_label'] = cell.strip()
         else:
            dict_key = column_headings[numb]
            temp_dict[dict_key] = cell

    my_results.append(temp_dict)

今すぐ結果にアクセスする

for row_dict in my_results:
    if row_dict['row_label'] == 'EPS (Basic)':
        for key in row_dict:
            print key, ':', row_dict[key]   


row_label :  EPS (Basic)
2008 : 13.46
2009 : 20.62
2010 : 26.69
2011 : 30.17
2012 : 32.81
5-year trend :

ここでやるべきことはまだあります。たとえば、直角度をテストしませんでした (各行のセルの数は等しい)。

最後に、私は初心者であり、他の人がこれらの要素 (xPath または cssselect) を取得するためのより直接的な方法をアドバイスすると思いますが、これは機能し、構造化された方法でテーブルからすべてを取得します。

テーブルのすべての行が利用可能であり、元の行の順序になっていることを追加する必要があります。my_results リストの最初の項目 (辞書) には最初の行のデータが含まれ、2 番目の項目には 2 番目の行のデータが含まれます。

lxml の新しいビルドが必要なときは、UC-IRVINEの本当にいい人が管理しているページにアクセスします。

これが役立つことを願っています

score 1 · Accepted Answer

from bs4 import BeautifulSoup
import urllib2
import lxml
import pandas as pd

url = 'http://markets.ft.com/research/Markets/Tearsheets/Financials?s=CLLN:LSE&subview=BalanceSheet'

soup = BeautifulSoup(urllib2.urlopen(url).read())

table = soup.find('table', {'data-ajax-content' : 'true'})

data = []

for row in table.findAll('tr'):
    cells = row.findAll('td')
    cols = [ele.text.strip() for ele in cells]
    data.append([ele for ele in cols if ele])

df = pd.DataFrame(data)

print df

dictframe = df.to_dict()

print dictframe

上記のコードは、Web ページから DataFrame を取得し、それを使用して Python 辞書を作成します。

python - 操作のためにHTMLデータをPythonリストに解析する

3 に答える 3

Related

Reference