python - 2 つの DataFrame 間のあいまい一致が遅い

Question

df_camcli id と originを持つ DataFrame A ( ) があります。

cli id |            origin
------------------------------------
123    | 1234 M-MKT XYZklm 05/2016

そして DataFrame B ( df_dict) とショートカットとキャンペーン

shortcut |         campaign
------------------------------------
M-MKT    | Mobile Marketing Outbound

たとえば、起源を持つクライアントは、キーワードが含まれているため、1234 M-MKT XYZklm 05/2016実際にはキャンペーンからのものであることがわかっています。Mobile Marketing OutboundM-MKT

ショートカットは、アルゴリズムが何を決定するかに基づく一般的なキーワードであることに注意してください。原点はM-Marketing、MMKTまたはのいずれでもかまいませんMob-MKT。最初にすべてのオリジンを分析して、手動でショートカットのリストを作成しました。また、正規表現を使用originして、プログラムにフェッチされる前にクリーンアップしています。

ショートカットで顧客の出所とキャンペーンを一致させ、スコアを付けて違いを確認したいと思います。以下に示すように：

cli id | shortcut |         origin            |        campaign           | Score
---------------------------------------------------------------------------------
123    | M-MKT    | 1234 M-MKT XYZklm 05/2016 | Mobile Marketing Outbound | 0.93

以下は動作する私のプログラムですが、本当に遅いです。DataFrame A には ~400.000 行があり、別の DataFrame B には ~40 行があります。

速くする方法はありますか？

from fuzzywuzzy import fuzz
list_values = df_dict['Shortcut'].values.tolist()

def TopFuzzMatch(tokenA, dict_, position, value):
    """
    Calculates similarity between two tokens and returns TOP match and score
    -----------------------------------------------------------------------
    tokenA: similarity to this token will be calculated
    dict_a: list with shortcuts
    position: whether I want first, second, third...TOP position
    value: 0=similarity score, 1=associated shortcut
    -----------------------------------------------------------------------
    """
    sim = [(fuzz.token_sort_ratio(x, tokenA),x) for x in dict_]
    sim.sort(key=lambda tup: tup[0], reverse=True)
    return sim[position][value]

df_cam['1st_choice_short'] = df_cam.apply(lambda x: TopFuzzMatch(x['cli_origin'],list_values,0,1), axis=1 )
df_cam['1st_choice_sim'] = df_cam.apply(lambda x: TopFuzzMatch(x['cli_origin'],list_values,0,0), axis=1 )

精度を評価するために、2番目と3番目のベストマッチも計算したいことに注意してください。

編集

メソッドを見つけましprocess.ExtractOneたが、速度は変わりません。したがって、私のコードは次のようになります。

def TopFuzzMatch(token, dict_, value):
    score = process.extractOne(token, dict_, scorer=fuzz.token_sort_ratio)
    return score[value]

score 1 · Accepted Answer

私は解決策を見つけました-元の列を正規表現（数字と特殊文字なし）でクリーンアップした後、繰り返しの異なる値が数百あるだけなので、それらだけでファズアルゴリズムを計算すると、時間が大幅に改善されます。

def TopFuzzMatch(df_cam, df_dict):
    """
    Calculates similarity bewteen two tokens and return TOP match
    The idea is to do it only over distinct values in given DF (takes ages otherwise)
    -----------------------------------------------------------------------
    df_cam: DataFrame with client id and origin
    df_dict: DataFrame with abbreviation which is matched with the description i need
    -----------------------------------------------------------------------
    """
    #Clean special characters and numbers
    df_cam['clean_camp'] = df_cam.apply(lambda x: re.sub('[^A-Za-z]+', '',x['origin']), axis=1)

    #Get unique values and calculate similarity
    uq_origin = np.unique(df_cam['clean_camp'].values.ravel())
    top_match = [process.extractOne(x, df_dict['Shortcut'])[0] for x in uq_origin]

    #To DataFrame
    df_match = pd.DataFrame({'unique': uq_origin})
    df_match['top_match'] = top_match

    #Merge
    df_cam = pd.merge(df_cam, df_match, how = 'left', left_on = 'clean_camp', right_on = 'unique')
    df_cam = pd.merge(df_cam, df_dict, how = 'left', left_on = 'top_match', right_on = 'Shortcut')

    return df_cam

df_out = TopFuzzMatch(df_cam, df_dict)

python - 2 つの DataFrame 間のあいまい一致が遅い

1 に答える 1

Related

Reference