python - Python Scrapy 余分な解析済み文字を削除する方法

Question

スクレイピーを使用した解析プロセス中に、この出力が見つかりました

[u'TARTARINI AUTO SPA (CENTRALINO SELEZIONE PASSANTE)'],"[u'VCBONAZZI\xa043', u'40013', u'CASTEL MAGGIORE']",[u'0516322411'],[u'info@tartariniauto. it'],[u'CARS (LPG INSTALLERS)'],[u'track.aspx?id=0&url=http://www.tartariniauto.it']

ご覧のとおり、次のような余分な文字がいくつかあります

u' \xa043 " ' [ ]

私はしたくない。これらを削除するにはどうすればよいですか?? また、この文字列には 5 つの項目があります。文字列を次のようにしたい：

item1 、 item2 、 item3 、 item4 、 item5

これが私のpipelines.pyコードです

from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.loader.processor import TakeFirst, MapCompose, Join
import re
import json
import csv

class InfobelPipeline(object):
    def __init__(self):
      self.file = csv.writer(open('items.csv','wb'))
    def process_item(self, item, spider):
      name = item['name']
      address = item['address']
      phone = item['phone']
      email = item['email']
      category = item['category']
      website = item['website']
      self.file.writerow((name,address,phone,email,category,website))
    return item

ありがとう

score 5 · Accepted Answer

表示されている余分な文字は Unicode 文字列です。ウェブ上でスクレイピングをしていると、それらをよく見かけます。一般的な例には、著作権記号 (© unicode ポイントU+00A9) または商標記号 ™ unicode ポイントが含まれU+2122ます。

それらを削除する最も簡単な方法は、それらをASCIIにエンコードしてから、ASCII文字でない場合は破棄することです（いずれもASCII文字ではありません）

>>> example = u"Xerox ™ printer"
>>> example
u'Xerox \u2122 printer'
>>> example.encode('ascii')
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2122' in position 6: ordinal 
not in range(128)
>>> example.encode('ascii', errors='ignore')
'Xerox  printer'
>>>

ご覧のとおり、シンボルを ascii にデコードしようとするとUnicodeEncodeError、文字を ascii で表現できないため、 a が発生します。ただし、errors='ignore'キーワード引数を追加すると、エンコードできないシンボルは単に無視されます。

python - Python Scrapy 余分な解析済み文字を削除する方法

1 に答える 1

Related

Reference