python - Beautifulsoup4.2とwindmillを使ったドキュメントスクレイピング、「document.write();」を消す方法 BeautifulSoup コンストラクターに渡されたマークアップから

Question

BS3 の場合のように、BS4 では「マークアップマッサージ」が使用されなくなっていることがわかりました。しかし、不要な document.write を破棄するには、同様の方法が必要です。BS3では次のようにしますが、BS4ではどのようにしますか?

# Javascript code in ths page generates HTML markup  
# that isn't parsed correctly by BeautifulSoup.
# To avoid this problem, all document.write fragments are removed
my_massage = copy(BeautifulSoup.MARKUP_MASSAGE)
my_massage.append((re.compile(u"document.write(.+);"), lambda match: ""))
my_massage.append((re.compile(u'alt=".+">'), lambda match: ">"))

また、BS4 BeautifulSoup コンストラクターは markupmassage 引数をサポートしなくなったため、プログラムのどこで document.write 問題を処理する必要がありますか? テーブルのマークアップを印刷しようとしているだけで、windmill を実行するとスレッド例外が発生するため、それが問題であると想定しています。

これは私のコードがどのように見えるかです:

#!/usr/bin/env python
# Generated by the windmill services transformer
#from windmill.authoring import WindmillTestClient
from bs4 import BeautifulSoup

import re, urlparse
from copy import copy
from windmill.authoring import setup_module, WindmillTestClient
from windmill.conf import global_settings
import sys


global_settings.START_CHROME = True # This makes it use Firefox
setup_module(sys.modules[__name__])


def get_table_info(client):
        """
    Parse HTML page and extract featured image name and link
    """
    # Get Javascript updated HTML page
    client.waits.forElement(xpath=u"//table[@id='trades']",
                        timeout=40000)
    response = client.commands.getPageText()
    assert response['status']
    assert response['result']

    # Create soup from HTML page and get desired information
    soup = BeautifulSoup(response['result'])

    table_info = soup.select("#trades")
    return table_info


def test_scrape():
    """
    Scrape site
    """

    # Open main gallery page
    client = WindmillTestClient(__name__)
    client.open(url='http://www.zulutrade.com/trader/128391')


    table_info = {}
    table_info = get_table_info(client)


    print table_info



test_scrape()

score 0 · Accepted Answer

マークアップをマッサージする方法を BeautifulSoup に指示する必要はありません。BeautifulSoupコンストラクターに渡す前に、自分で変更できます。

html = response['result']
html = re.sub(r'document.write(.+);', '', html)
html = re.sub(r'alt=".+">', '>', html)
soup = BeautifulSoup(html)

python - Beautifulsoup4.2とwindmillを使ったドキュメントスクレイピング、「document.write();」を消す方法 BeautifulSoup コンストラクターに渡されたマークアップから

1 に答える 1

Related

Reference