1

これが私のコードスニペットです。Scrapy を使用して Web サイトをスクレイピングし、インデックス作成のためにデータを Elasticsearch に保存しようとしています。

def parse(self, response):
    for news in response.xpath('head'):
        yield {
            'pagetype': news.xpath('//meta[@name="pagetype"]/@content').extract(),
            'description': news.xpath('//div[@class="module__content"]/*/node()/text()').extract(),
              }

今私の問題は、「説明」フィールドに保存される値です。

    [u'\n              \n              ', u'"For\n              many of us what we eat on Christmas day isn\'t what we would usually consume and\n              that\u2019s perfectly ok," Dr said.', u'"However\n              it is not uncommon for festive season celebrations to begin in November and\n              continue well in to the New Year.', u'"So\n              if health is on the agenda, being mindful about what we put into our bodies\n              with a balanced approach, throughout the whole festive season, is important."', u"Dr\n              , a lecturer at School\n              Sciences, said balancing fresh, healthy food with being physically active was a\n              good start.", u'"Whatever\n              the celebration, try to limit processed foods, often high in fat, sugar and\n              salt," she said.', u'"Taking\n              time during holidays to prepare food and make the most of fresh ingredients is\n              often a much healthier option than relying on convenience foods and take away.', u'"Being\n              mindful about going back for seconds is important too.\xa0 We don\u2019t need to eat until we feel\n              uncomfortable and eating the foods we enjoy doesn\'t necessarily mean we need to\n              eat copious amounts."', u"Dr\n             own healthy tips and substitutes for the Christmas season\n              include:", u'But\n              just because Dr  is a dietitian, doesn\u2019t mean she doesn\u2019t enjoy a\n              Christmas treat or two.', u'"I\n              would have to say my sister in law\'s homemade rocky road is my favourite\n              festive treat. She makes it every Christmas day and it gets better each year," she\n              said.', u'"I\n              also enjoy a summer cocktail every so often during the festive season and a\n              mojito would be one of my favourites on Christmas day. We make it with extra\n              mint from the garden which is a nice, fresh addition.', u'"Rather\n              than focusing on food avoidance, moderation is the best approach.', u'"There\n              are definitely some more healthy choices and some less healthy options when it\n              comes to the typical Christmas day menu, but it\'s more important to be mindful\n              of a healthy, balanced diet throughout the festive period, rather than avoiding\n              specific foods on one day of the year."', u'\n                ', u'\n              \n                ', u'\n                ', u'\n              \n                ', u'\n              ', u'\n                ', u'\n                        ', u'\n                        ', u'\n                        ', u'\n                    ', u'\n            ', u'Related News', u'\n          ', u'\n        ', u'\n          ', u'\n        ', u'\n          ', u'\n        ', u'Search for related news']

空白、改行コード、「u」文字がたくさんあります....

このコードをさらに処理して、余分な空白、改行 (\n) コード、および「u」文字を含まない通常のテキストのみを含めるにはどうすればよいですか?

BeautifulSoupは Scrapy とうまく連携することを読みましたが、Scrapy と BeautifulSoup を統合する方法の例は見つかりませんでした。私は他の方法を使用することもできます。どんな助けでも大歓迎です。

ありがとう

4

1 に答える 1

0

たとえば、この回答に示されている方法を使用して、リスト内の文字列からスペースと改行を削除できます。

[' '.join(item.split()) for item in list_of_strings]

wherelist_of_stringsは、例として指定した文字列のリストです。

「u」の文字については、特に気にする必要はありません。それらは単に、文字列が Unicode エンコーディングであることを意味します。たとえば、この問題に関するこの質問を参照してください。

于 2016-12-24T12:40:51.967 に答える