python - 2回実行するとクローラーが重複しますか?

Question

Python でクローラーフレームワーク「scrapy」を使用し、pipelines.py ファイルを使用してアイテムを json 形式でファイルに保存します。これを行うためのコードは import json の下に示されています

class AYpiPipeline(object):
def __init__(self):
    self.file = open("a11ypi_dict.json","ab+")


# this method is called to process an item after it has been scraped.
def process_item(self, item, spider):
    d = {}    
    i = 0
# Here we are iterating over the scraped items and creating a dictionary of    dictionaries.
try:
    while i<len(item["foruri"]):
        d.setdefault(item["foruri"][i],{}).setdefault(item["rec"][i],{})[item["foruri_id"][i]] = item['thisurl'] + ":" + item["thisid"][i]
    i+=1
except IndexError:
    print "Index out of range"
    # Writing it to a file
    json.dump(d,self.file)
return item

問題は、クローラーを 2 回実行すると (たとえば)、ファイルに重複したスクレイピングアイテムが表示されることです。最初にファイルから読み取り、次にデータを書き込む新しいデータと一致させることでそれを防止しようとしましたが、データはファイルはjson形式だったので、json.loads()関数でデコードしましたが、機能しません:

import json 

class AYpiPipeline(object):
    def __init__(self):
        self.file = open("a11ypi_dict.json","ab+")
        self.temp = json.loads(file.read())
    
    # this method is called to process an item after it has been scraped.
    def process_item(self, item, spider):
        d = {}    
        i = 0
        # Here we are iterating over the scraped items and creating a dictionary of    dictionaries.
        try:
            while i<len(item["foruri"]):
            d.setdefault(item["foruri"][i],{}).setdefault(item["rec"][i],{})[item["foruri_id"][i]] = item['thisurl'] + ":" + item["thisid"][i]
            i+=1
        except IndexError:
            print "Index out of range"
        # Writing it to a file
    
             if d!=self.temp: #check whether the newly generated data doesn't match the one already in the file
                  json.dump(d,self.file)
        return item
    .

これを行う方法を提案してください。

注: 別のリンクセットをクロールする可能性があるため、ファイルを「追加」モードで開く必要があることに注意してください。ただし、同じ start_url でクローラーを 2 回実行すると、同じデータがファイルに 2 回書き込まれます。

score 1 · Accepted Answer

thisなどのカスタムミドルウェアを使用して、重複を除外できます。ただし、これをスパイダーで実際に使用するには、さらに 2 つのことが必要です。フィルターが重複を識別できるようにアイテムに ID を割り当てる方法と、スパイダーの実行間で訪問した ID のセットを保持する方法です。2 つ目は簡単です。shelve のような Pythonic を使用するか、最近人気のある多くのキー値ストアの 1 つを使用できます。ただし、最初の部分は難しくなり、解決しようとしている問題によって異なります。

python - 2回実行するとクローラーが重複しますか?

1 に答える 1

Related

Reference