python - 要素/ノードから HTML を抽出する

Question

htmlの文字列があるとします...

<div class="content">
   This is some test <b>this is bold </b> this is great list of text.
</div>
<div class="content">
   <ul>
      <li>Item 1</li>
      <li>Item 2</li>
      <li>Item 3</li>
   </ul>
</div>

ここで、Scrapy を使用して、これら 2 つの要素の内容を 1 つの変数にスクレイピングしたいと考えています。

def parse(self, response):
   hxs = HtmlXPathSelector(response)

   # this returns all nested elements/nodes except text
   contents = product.select('//div[@class="content"]/*').extract()

   # this returns all nested text except elements/nodes
   contents = product.select('//div[@class="content"]/text()').extract()

両方の要素/ノードのネストされた HTML 全体を変数の文字列として取得するにはどうすればよいですか?

score 1 · Accepted Answer

あなたはそれを行うことができます-同様の質問への回答でhttps://stackoverflow.com/a/10899531/85461/node()を参照してください。

# Returns all child nodes - text as well as elements.
contents = product.select('//div[@class="content"]/node()').extract()

extract()HTMLを復元する通常の方法で連結できるリストを返すことに注意してください。

html = "\n".join(contents)

score 0 · Accepted Answer

速度が重要でない場合は、BeautifulSoup http://www.crummy.com/software/BeautifulSoup/bs4/doc/で簡単に実行できます。

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(response)
contents = soup.findAll("div", {"class":"content"})
for content in contents:
    print content # this is div html

python - 要素/ノードから HTML を抽出する

3 に答える 3

Related

Reference