python-2.7 - vcf ファイル内の vcard 重複除去用の python コードは、vobject で動作しますが、「正確な重複」に対してのみ動作します

Question

#!/usr/bin/env python2.7 

import vobject

abinfile='/foo/bar/dir/infile.vcf' #ab stands for address book  

aboutfile='/foo/bar/dir/outfile.vcf'  

def eliminate_vcard_duplicates (abinfile, aboutfile):

    #we first convert the Adrees Book IN FILE into a list

    with open(abinfile) as source_file:
        ablist = list(vobject.readComponents(source_file))

    #then add each vcard from that list in a new list unless it's already there

    ablist_norepeats=[]
    ablist_norepeats.append(ablist[0])

    for i in range(1, len(ablist)):
        jay=len(ablist_norepeats)
        for j in reversed(range(0, jay)): #we do reversed because usually cards have duplicates nearby
            if ablist_norepeats[j].serialize() == ablist[i].serialize():
                break
            else:
                jay += -1
        if jay == 0:
            ablist_norepeats.append(ablist[i])

    #and finally write the singularized list to the Adrees Book OUT FILE

    with open(aboutfile, 'w') as destination_file:
        for j in range(0, len(ablist_norepeats)):
            destination_file.write(ablist_norepeats[j].serialize)

eliminate_vcard_duplicates(abinfile, aboutfile)

上記のコードは機能し、完全な重複がない (同一の特異点を持つ重複) 新しいファイルを作成します。コードにはいくつかの効率の問題があることを知っています。各 vacard を 1 回だけシリアライズできます。for などの非効率的な使用。ここでは、解決方法がわからない問題の 1 つを説明する短いコードを提供したいと思います。

エレガントに解決する方法がわからない問題は次のとおりです。カードのフィールドの一部がスクランブルされている場合、それらが等しいことを検出しません。vobject、re、または別のアプローチでそのような重複を検出する方法はありますか?

テストで使用されたファイルの内容は、4 つの同等の vcard (電話が混乱したコードであり、電子メールが混乱した考えではありません) で、次のとおりです。

BEGIN:VCARD
VERSION:3.0
FN:Foo_bar1
N:;Foo_bar1;;;
EMAIL;TYPE=INTERNET:foobar1@foo.bar.com
TEL;TYPE=CELL:123456789
TEL;TYPE=CELL:987654321
END:VCARD
BEGIN:VCARD
VERSION:3.0
FN:Foo_bar1
N:;Foo_bar1;;;
EMAIL;TYPE=INTERNET:foobar1@foo.bar.com
TEL;TYPE=CELL:123456789
TEL;TYPE=CELL:987654321
END:VCARD
BEGIN:VCARD
VERSION:3.0
FN:Foo_bar1
N:;Foo_bar1;;;
TEL;TYPE=CELL:123456789
TEL;TYPE=CELL:987654321
EMAIL;TYPE=INTERNET:foobar1@foo.bar.com
END:VCARD
BEGIN:VCARD
VERSION:3.0
FN:Foo_bar1
N:;Foo_bar1;;;
TEL;TYPE=CELL:987654321
TEL;TYPE=CELL:123456789
EMAIL;TYPE=INTERNET:foobar1@foo.bar.com
END:VCARD

上記のコードは、最後の 1 つの電話番号がスクランブルされているため、4 つがすべて同じであることを検出しません。

おまけとして、誰かがより高速なアルゴリズムを持っている場合は、それを共有できれば素晴らしいことです. 上記のものは、30.000 Vcardファイルで数日かかります...

score 1 · Accepted Answer

以下はより高速なコード (約 3 桁) ですが、完全な重複のみを削除します...

    #!/usr/bin/env python2.7 

    import vobject
    import datetime

    abinfile='/foo/bar/dir/infile.vcf' #ab stands for address book  

    aboutfile='/foo/bar/dir/outfile.vcf' 

    def eliminate_vcard_duplicatesv2(abinfile, aboutfile):

        #we first convert the Adrees Book IN FILE into a list
        ablist=[]
        with open(abinfile) as source_file:
            ablist = list(vobject.readComponents(source_file))

        #we then serialize the list to expedite comparison process
        ablist_serial=[]
        for i in range(0, len(ablist)):
            ablist_serial.append(ablist[i].serialize())

        #then add each unique vcard's position from that list in a new list unless it's already there
        ablist_singletons=[]
        duplicates=0
        for i in range(1, len(ablist_serial)):
            if i % 1000 == 0:
                print "COMPUTED CARD:", i, "Number of duplicates: ", duplicates, "Current time:", datetime.datetime.now().time()
            jay=len(ablist_singletons)
            for j in reversed(range(0, jay)): #we do reversed because usually cards have duplicates nearby
                if ablist_serial[ablist_singletons[j]] == ablist_serial[i]:
                    duplicates += 1
                    break
                else:
                    jay += -1
            if jay == 0:
                ablist_singletons.append(i)

        print "Length of Original Vcard File: ", len(ablist)
        print "Length of Singleton Vcard File: ", len(ablist_singletons)
        print "Generating Singleton Vcard file and storing it in: ", aboutfile

        #and finally write the singularized list to the Adrees Book OUT FILE
        with open(aboutfile, 'w') as destination_file:
            for k in range(0, len(ablist_singletons)):
                destination_file.write(ablist_serial[ablist_singletons[k]])

    eliminate_vcard_duplicatesv2(abinfile, aboutfile)

python-2.7 - vcf ファイル内の vcard 重複除去用の python コードは、vobject で動作しますが、「正確な重複」に対してのみ動作します

3 に答える 3

Related

Reference