python - Windmill がすべての html コンテンツを取得しない

Question

Python Windmill フレームワークを使用して、Web ページからデータをスクレイピングしようとしています。ただし、HTML テーブルのコンテンツをページから取得するのに問題があります。テーブルは Javascript によって生成されるため、Windmill を使用してコンテンツを取得しています。ただし、コンテンツはテーブルを返さないため、BeautifulSoup を使用してコンテンツを解析しようとするとエラーが発生します。

from windmill.authoring import WindmillTestClient
from BeautifulSoup import BeautifulSoup

from copy import copy
import re

def get_massage():
    my_massage = copy(BeautifulSoup.MARKUP_MASSAGE)
    my_massage.append((re.compile(u"document.write(.+);"), lambda match: ""))
    my_massage.append((re.compile(u'alt=".+">'), lambda match: ">"))
    return my_massage

def test_scrape():
    my_massage = get_massage()
    client = WindmillTestClient(__name__)
    client.open(url='http://marinetraffic.com/ais/datasheet.aspx?MMSI=636092060&TIMESTAMP=2&menuid=&datasource=POS&app=&mode=&B1=Search')
    client.waits.forPageLoad(timeout='60000')
    html = client.commands.getPageText()
    assert html['status']
    assert html['result']
    soup=BeautifulSoup(html['result'],markupMassage=my_massage)
    print soup.prettify()

スープからの出力を見るとテーブルが欠落していますが、firebug のようなもので Web ページのコンテンツを見ると表示されます。全体として、テーブルのコンテンツを取得し、それを何らかのデータ構造に解析して、さらに処理しようとしています。どんな助けでも大歓迎です！

score 0 · Accepted Answer

問題は、使用しているマークアップメッセージが、作業中のページでうまく機能していないことです。つまり、必要以上の html コードが削除されています。

BeautifulSoup必要な Web ページを解析できるかどうかを確認するために、次のことを試してみました。

soup = BeautifulSoup(html['result'])
soup.table

うまく機能したので、この場合は結局、マークアップメッセージは必要ないように思われます。

python - Windmill がすべての html コンテンツを取得しない

1 に答える 1

Related

Reference