python - ScrapyBodyテキストのみ

Question

Python Scrapyを使用して本文からのみテキストをスクレイプしようとしていますが、まだ運がありません。

<body>何人かの学者がここでタグからすべてのテキストを削るのを手伝ってくれるかもしれないことを願っています。

score 4 · Accepted Answer

Scrapyは、XPath表記を使用してHTMLドキュメントの一部を抽出します。では、/html/bodyパスを使用して抽出してみました<body>か？（にネストされていると仮定します<html>）。//bodyセレクターを使用する方がさらに簡単な場合があります。

x.select("//body").extract()    # extract body

Scrapyが提供するセレクターの詳細については、こちらをご覧ください。

score 2 · Accepted Answer

lynx -nolist -dumpページをレンダリングしてから表示されているテキストをダンプする、によって生成されるような出力を取得すると便利です。段落要素のすべての子のテキストを抽出することで、近づきました。

私は//body//text()、本文内のすべてのテキスト要素をプルするから始めましたが、これにはスクリプト要素が含まれていました。 //body//pタグなしテキストの周りの暗黙の段落タグを含む、本文内のすべての段落要素を取得します。//body//p/text()サブタグ（太字、斜体、スパン、divなど）から欠落要素を含むテキストを抽出します。//body//p//text()ページに段落に埋め込まれたスクリプトタグがない限り、必要なコンテンツのほとんどを取得しているようです。

XPathでは、すべての子孫が含まれます/が、直接の子を意味します。//

% scrapy shell
In[1]: fetch('http://stackoverflow.com/questions/5390133/scrapy-body-text-only')
In[2]: hxs.select('//body//p//text()').extract()

Out[2]:
[u"I am trying to scrape the text only from body using python Scrapy, but haven't had any luck yet.",
u'Wishing some scholars might be able to help me here scraping all the text from the ',
u'&lt;body&gt;',
u' tag.',
u'Thank you in advance for your time.',
u'Scrapy uses XPath notation to extract parts of a HTML document. So, have you tried just using the ',
u'/html/body',
u' path to extract ',
u'&lt;body&gt;',
u"? (assuming it's nested in ",
u'&lt;html&gt;',
u'). It might be even simpler to use the ',
u'//body',
u' selector:',
u'You can find more information about the selectors Scrapy provides ',
u'here',

文字列をスペースで結合すると、かなり良い出力が得られます。

In [43]: ' '.join(hxs.select("//body//p//text()").extract())
Out[43]: u"I am trying to scrape the text only from body using python Scrapy, but haven't had any luck yet. Wishing some scholars might be able to help me here scraping all the text from the  &lt;body&gt;  tag. Thank you in advance for your time. Scrapy uses XPath notation to extract parts of a HTML document. So, have you tried just using the  /html/body  path to extract  &lt;body&gt; ? (assuming it's nested in  &lt;html&gt; ). It might be even simpler to use the  //body  selector: You can find more information about the selectors Scrapy provides  here . This is a collaboratively edited question and answer site for  professional and enthusiast programmers . It's 100% free, no registration required. about \xbb \xa0\xa0\xa0 faq \xbb \r\n             tagged asked 1 year ago viewed 280 times active 1 year ago"

python - ScrapyBodyテキストのみ

2 に答える 2

Related

Reference