python - Pythonのscrapyセレクターでテキストのみを抽出するにはどうすればよいですか？

Question

私はこのコードを持っています

   site = hxs.select("//h1[@class='state']")
   log.msg(str(site[0].extract()),level=log.ERROR)

出力は

 [scrapy] ERROR: <h1 class="state"><strong>
            1</strong>
            <span> job containing <strong>php</strong> in <strong>region</strong> paying  <strong>$30-40k per year</strong></span>
                </h1>

HTMLタグなしでテキストのみを取得することは可能ですか？

score 57 · Accepted Answer

//h1[@class='state']

上記のxpathで、属性h1を持つタグを選択していますclassstate

入ってくるものすべてを選択するのはそのためですh1 element

h1タグのテキストを選択するだけの場合は、

//h1[@class='state']/text()

タグのテキストとh1その子タグを選択する場合は、使用する必要があります

//h1[@class='state']//text()

したがって、違いは/text()特定のタグテキストと、特定のタグのテキスト、および//text()その子タグです。

以下のコードはあなたのために働きます

site = ''.join(hxs.select("//h1[@class='state']/text()").extract()).strip()

score 3 · Accepted Answer

BeautifulSoupget_text()機能を使用できます。

from bs4 import BeautifulSoup

text = '''
<td><a href="http://www.fakewebsite.com">Please can you strip me?</a>
<br/><a href="http://www.fakewebsite.com">I am waiting....</a>
</td>
'''
soup = BeautifulSoup(text)

print(soup.get_text())

score 1 · Accepted Answer

BeautifulSoupを使用してhtmlタグを削除できます。例を次に示します。

from BeautifulSoup import BeautifulSoup
''.join(BeautifulSoup(str(site[0].extract())).findAll(text=True))

次に、追加の空白や新しい行などをすべて削除できます。

追加のモジュールを使用したくない場合は、単純な正規表現を試すことができます。

# replace html tags with ' '
text = re.sub(r'<[^>]*?>', ' ', str(site[0].extract()))

score 1 · Accepted Answer

スクレイピーインスタンスを実行していないため、これをテストできませんでした。text()ただし、検索式内で使用することはできます。

例えば：

site = hxs.select("//h1[@class='state']/text()")

( から入手しましたtutorial)

python - Pythonのscrapyセレクターでテキストのみを抽出するにはどうすればよいですか？

5 に答える 5

Related

Reference