python - パンダのあいまいマージ/一致名列、重複あり

Question

donors現在、用と用の 2 つのデータフレームがありfundraisersます。fundraisers寄付も行っているかどうかを確認しようとしています。もしそうなら、その情報の一部を私のfundraiserデータセットにコピーします (寄付者の名前、電子メール、最初の寄付)。私のデータの問題は次のとおりです。

名前と電子メールで一致させる必要がありますが、ユーザーの名前が少し異なる場合があります (例: 'Kat' と 'Kathy')。
donorsとの名前が重複していますfundraisers:
- 2a) 寄付者の場合、最初の寄付日だけを気にするので、一意の名前と電子メールの組み合わせを取得できます
- 2b) 募金活動では、両方の行を保持し、日付などのデータを失わないようにする必要があります。

私が今持っているサンプルコード:

import pandas as pd
import datetime
from fuzzywuzzy import fuzz
import difflib 

donors = pd.DataFrame({"name": pd.Series(["John Doe","John Doe","Tom Smith","Jane Doe","Jane Doe","Kat test"]), "Email": pd.Series(['a@a.ca','a@a.ca','b@b.ca','c@c.ca','something@a.ca','d@d.ca']),"Date": (["27/03/2013  10:00:00 AM","1/03/2013  10:39:00 AM","2/03/2013  10:39:00 AM","3/03/2013  10:39:00 AM","4/03/2013  10:39:00 AM","27/03/2013  10:39:00 AM"])})
fundraisers = pd.DataFrame({"name": pd.Series(["John Doe","John Doe","Kathy test","Tes Ester", "Jane Doe"]),"Email": pd.Series(['a@a.ca','a@a.ca','d@d.ca','asdf@asdf.ca','something@a.ca']),"Date": pd.Series(["2/03/2013  10:39:00 AM","27/03/2013  11:39:00 AM","3/03/2013  10:39:00 AM","4/03/2013  10:40:00 AM","27/03/2013  10:39:00 AM"])})

donors["Date"] = pd.to_datetime(donors["Date"], dayfirst=True)
fundraisers["Date"] = pd.to_datetime(donors["Date"], dayfirst=True)

donors["code"] = donors.apply(lambda row: str(row['name'])+' '+str(row['Email']), axis=1)
idx = donors.groupby('code')["Date"].transform(min) == donors['Date']
donors = donors[idx].reset_index().drop('index',1)

したがって、これにより、各寄付者による最初の寄付が残ります (名前と電子メールがまったく同じ人は誰でも同じ人物であると仮定します)。

fundraisers理想的には、データセットを次のようにしたいと考えています。

Date                Email       name        Donor Name  Donor Email Donor Date
2013-03-27 10:00:00     a@a.ca      John Doe    John Doe    a@a.ca      2013-03-27 10:00:00 
2013-01-03 10:39:00     a@a.ca      John Doe    John Doe    a@a.ca      2013-03-27 10:00:00 
2013-02-03 10:39:00     d@d.ca      Kathy test  Kat test    d@d.ca      2013-03-27 10:39:00 
2013-03-03 10:39:00     asdf@asdf.ca    Tes Ester   
2013-04-03 10:39:00     something@a.ca  Jane Doe    Jane Doe    something@a.ca  2013-04-03 10:39:00

私はこのスレッドに従ってみました: is it possible to do fuzzy match merge with python pandas? しかし、範囲外のエラーを取得し続けます（募金活動で名前が重複するのは気に入らないと思います）:(これらのデータセットを一致/マージする方法はありますか?
forループでそれを行う（これは機能しますが、非常に遅く、より良い方法が必要だと感じています）

コード：

fundraisers["donor name"] = ""
fundraisers["donor email"] = ""
fundraisers["donor date"] = ""
for donindex in range(len(donors.index)):
    max = 75
    for funindex in range(len(fundraisers.index)):
        aname = donors["name"][donindex]
        comp = fundraisers["name"][funindex]
        ratio = fuzz.ratio(aname, comp)
        if ratio > max:
            if (donors["Email"][donindex] == fundraisers["Email"][funindex]):
                ratio *= 2
            max = ratio
            fundraisers["donor name"][funindex] = aname
            fundraisers["donor email"][funindex] = donors["Email"][donindex]
            fundraisers["donor date"][funindex] = donors["Date"][donindex]

score 1 · Accepted Answer

私は Jaro-Winkler を使用します。これは、現在利用可能な最もパフォーマンスが高く正確な近似文字列マッチングアルゴリズムの 1 つである [ Cohen, et al. 』、『ウィンクラー』。

これは、クラゲパッケージの Jaro-Winkler で行う方法です。

def get_closest_match(x, list_strings):

  best_match = None
  highest_jw = 0

  for current_string in list_strings:
    current_score = jellyfish.jaro_winkler(x, current_string)

    if(current_score > highest_jw):
      highest_jw = current_score
      best_match = current_string

  return best_match

df1 = pandas.DataFrame([[1],[2],[3],[4],[5]], index=['one','two','three','four','five'], columns=['number'])
df2 = pandas.DataFrame([['a'],['b'],['c'],['d'],['e']], index=['one','too','three','fours','five'], columns=['letter'])

df2.index = df2.index.map(lambda x: get_closest_match(x, df1.index))

df1.join(df2)

出力：

    number  letter
one     1   a
two     2   b
three   3   c
four    4   d
five    5   e

更新:パフォーマンスを向上させるには、 Levenshteinモジュールの jaro_winkler を使用してください。

from jellyfish import jaro_winkler as jf_jw
from Levenshtein import jaro_winkler as lv_jw

%timeit jf_jw("appel", "apple")
>> 339 ns ± 1.04 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

%timeit lv_jw("appel", "apple")
>> 193 ns ± 0.675 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

python - パンダのあいまいマージ/一致名列、重複あり

3 に答える 3

Related

Reference