44

私は企業の大規模なデータベースを使用しています。

2つの商号の類似性を比較して、重複している可能性があるかどうかを確認できるようにしたいと思います。

以下は、重複する可能性が高いとテストする必要がある会社名のリストですが、これを行うための良い方法は何ですか?

ジョージワシントンミドルシュル
ジョージワシントンスクール

サンタフェイースト株式会社
サンタフェイースト

Chop't Creative Salad Co
Chop't Creative Salad Company

マニーとオルガのピザ
マニーズ&オルガズピザ

レイズヘルバーガーも
レイズヘルバーガー

エルソル
エルソルデアメリカ

オルニーシアターセンターフォージアーツ
オルニーシアター

21Mラウンジ
21Mラウンジ

Holiday Inn Hotel Washington
ホリデーインワシントン-ジョージタウン

レジデンスインワシントンDC/デュポンサークル
レジデンスインマリオットデュポンサークル

ジミージョンズグルメサンドイッチ
ジミー・ジョンズ

ワシントンDCのオムニショアハムホテル
オムニショアハムホテル
4

10 に答える 10

56

私は最近同様のタスクを実行しましたが、1つのセット内で重複を探すのではなく、データベース内の既存の名前と新しいデータを照合していました。名前の照合は実際にはよく研究されたタスクであり、一般的な文字列を照合するために検討するものを超える多くの要因があります。

まず、論文「「名前ゲーム」の遊び方: RaffoとLhuilleryによるさまざまなヒューリスティックを比較した特許検索」を参照することをお勧めします。公開されたバージョンはこちらで、PDFはこちらから無料で入手できます。著者は、いくつかの異なるマッチング戦略を比較して、素晴らしい要約を提供します。彼らは、解析、照合、フィルタリングと呼ばれる3つの段階を検討します。

解析は、さまざまなクリーニング手法を適用することで構成されます。いくつかの例:

  • Standardizing lettercase (e.g., all lowercase)
  • Standardizing punctuation (e.g., commas must be followed by spaces)
  • Standardizing whitespace (e.g., converting all runs of whitespace to single spaces)
  • Standardizing accented and special characters (e.g., converting accented letters to ASCII equivalents)
  • Standardizing legal control terms (e.g., converting "Co." to "Company")

In my case, I folded all letters to lowercase, replaced all punctuation with whitespace, replaced accented characters by unaccented counterparts, removed all other special characters, and removed legal control terms from the beginning and ends of the names following a list.

Matching is the comparison of the parsed names. This could be simple string matching, edit distance, Soundex or Metaphone, comparison of the sets of words making up the names, or comparison of sets of letters or n-grams (letter sequences of length n). The n-gram approach is actually quite nice for names, as it ignores word order, helping a lot with things like "department of examples" vs. "examples department". In fact, comparing bigrams (2-grams, character pairs) using something simple like the Jaccard index is very effective. In contrast to several other suggestions, Levenshtein distance is one of the poorer approaches when it comes to name matching.

In my case, I did the matching in two steps, first with comparing the parsed names for equality and then using the Jaccard index for the sets of bigrams on the remaining. Rather than actually calculating all the Jaccard index values for all pairs of names, I first put a bound on the maximum possible value for the Jaccard index for two sets of given size, and only computed the Jaccard index if that upper bound was high enough to potentially be useful. Most of the name pairs were still dissimilar enough that they weren't matches, but it dramatically reduced the number of comparisons made.

Filtering is the use of auxiliary data to reject false positives from the parsing and matching stages. A simple version would be to see if matching names correspond to businesses in different cities, and thus different businesses. That example could be applied before matching, as a kind of pre-filtering. More complicated or time-consuming checks might be applied afterwards.

I didn't do much filtering. I checked the countries for the firms to see if they were the same, and that was it. There weren't really that many possibilities in the data, some time constraints ruled out any extensive search for additional data to augment the filtering, and there was a manual checking planned, anyway.

于 2011-06-19T06:58:05.380 に答える
19

I'd like to add some examples to the excellent accepted answer. Tested in Python 2.7.

Parsing

Let's use this odd name as an example.

name = "THE |  big,- Pharma: LLC"  # example of a company name

We can start with removing legal control terms (here LLC). To do that, there is an awesome cleanco Python library, which does exactly that:

from cleanco import cleanco
name = cleanco(name).clean_name()  # 'THE | big,- Pharma'

Remove all punctuation:

name = name.translate(None, string.punctuation)  # 'THE  big Pharma'

(for unicode strings, the following code works instead (source, regex):

import regex
name = regex.sub(ur"[[:punct:]]+", "", name)  # u'THE  big Pharma'

Split the name into tokens using NLTK:

import nltk
tokens = nltk.word_tokenize(name)  # ['THE', 'big', 'Pharma']

Lowercase all tokens:

tokens = [t.lower() for t in tokens]  # ['the', 'big', 'pharma']

Remove stop words. Note that it might cause problems with companies like On Mars will be incorrectly matched to Mars, because On is a stopword.

from nltk.corpus import stopwords
tokens = [t for t in tokens if t not in stopwords.words('english')]  # ['big', 'pharma']

I don't cover accented and special characters here (improvements welcome).

Matching

Now, when we have mapped all company names to tokens, we want to find the matching pairs. Arguably, Jaccard (or Jaro-Winkler) similarity is better than Levenstein for this task, but is still not good enough. The reason is that it does not take into account the importance of words in the name (like TF-IDF does). So common words like "Company" influence the score just as much as words that might uniquely identify company name.

To improve on that, you can use a name similarity trick suggested in this awesome series of posts (not mine). Here is a code example from it:

# token2frequency is just a word counter of all words in all names
# in the dataset
def sequence_uniqueness(seq, token2frequency):
    return sum(1/token2frequency(t)**0.5 for t in seq)

def name_similarity(a, b, token2frequency):
    a_tokens = set(a.split())
    b_tokens = set(b.split())
    a_uniq = sequence_uniqueness(a_tokens)
    b_uniq = sequence_uniqueness(b_tokens)
    return sequence_uniqueness(a.intersection(b))/(a_uniq * b_uniq) ** 0.5

Using that, you can match names with similarity exceeding certain threshold. As a more complex approach, you can also take several scores (say, this uniqueness score, Jaccard and Jaro-Winkler) and train a binary classification model using some labeled data, which will, given a number of scores, output if the candidate pair is a match or not. More on this can be found in the same blog post.

于 2016-11-09T00:02:43.230 に答える
8

レーベンシュタイン距離を使用できます。これは、2つのシーケンス間の差(基本的には編集距離)を測定するために使用できます。

Pythonでのレーベンシュタイン距離

def levenshtein_distance(a,b):
    n, m = len(a), len(b)
    if n > m:
        # Make sure n <= m, to use O(min(n,m)) space
        a,b = b,a
        n,m = m,n

    current = range(n+1)
    for i in range(1,m+1):
        previous, current = current, [i]+[0]*n
        for j in range(1,n+1):
            add, delete = previous[j]+1, current[j-1]+1
            change = previous[j-1]
            if a[j-1] != b[i-1]:
                change = change + 1
            current[j] = min(add, delete, change)

    return current[n]

if __name__=="__main__":
    from sys import argv
    print levenshtein_distance(argv[1],argv[2])
于 2011-06-19T05:18:38.183 に答える
8

There is great library for searching for similar/fuzzy strings for python: fuzzywuzzy. It's a nice wrapper library upon mentioned Levenshtein distance measuring. Here how your names could be analysed:

#!/usr/bin/env python

from fuzzywuzzy import fuzz

names = [
    ("George Washington Middle Schl",
     "George Washington School"),

    ("Santa Fe East Inc",
     "Santa Fe East"),

    ("Chop't Creative Salad Co",
     "Chop't Creative Salad Company"),

    ("Manny and Olga's Pizza",
     "Manny's & Olga's Pizza"),

    ("Ray's Hell Burger Too",
    "Ray's Hell Burgers"),

    ("El Sol",
    "El Sol de America"),

    ("Olney Theatre Center for the Arts",
    "Olney Theatre"),

    ("21 M Lounge",
    "21M Lounge"),

    ("Holiday Inn Hotel Washington",
    "Holiday Inn Washington-Georgetown"),

    ("Residence Inn Washington,DC/Dupont Circle",
    "Residence Inn Marriott Dupont Circle"),

    ("Jimmy John's Gourmet Sandwiches",
    "Jimmy John's"),

    ("Omni Shoreham Hotel at Washington D.C.",
    "Omni Shoreham Hotel"),
]

if __name__ == '__main__':
    for pair in names:
        print "{:>3} :: {}".format(fuzz.partial_ratio(*pair), pair)

>>>  79 :: ('George Washington Middle Schl', 'George Washington School')
>>> 100 :: ('Santa Fe East Inc', 'Santa Fe East')
>>> 100 :: ("Chop't Creative Salad Co", "Chop't Creative Salad Company")
>>>  86 :: ("Manny and Olga's Pizza", "Manny's & Olga's Pizza")
>>>  94 :: ("Ray's Hell Burger Too", "Ray's Hell Burgers")
>>> 100 :: ('El Sol', 'El Sol de America')
>>> 100 :: ('Olney Theatre Center for the Arts', 'Olney Theatre')
>>>  90 :: ('21 M Lounge', '21M Lounge')
>>>  79 :: ('Holiday Inn Hotel Washington', 'Holiday Inn Washington-Georgetown')
>>>  69 :: ('Residence Inn Washington,DC/Dupont Circle', 'Residence Inn Marriott Dupont Circle')
>>> 100 :: ("Jimmy John's Gourmet Sandwiches", "Jimmy John's")
>>> 100 :: ('Omni Shoreham Hotel at Washington D.C.', 'Omni Shoreham Hotel')

Another way of solving such kind of problems could be Elasticsearch, which also supports fuzzy searches.

于 2016-01-09T01:04:37.603 に答える
4

This a bit of an update to Dennis comment. That answer was really helpful as was the links he posted but I couldn't get them to work right off. After trying the Fuzzy Wuzzy search I found this gave me a bunch better set of answers. I have a large list of merchants and I just want to group them together. Eventually I'll have a table I can use to try some machine learning to play around with but for now this takes a lot of the effort out of it.

I only had to update his code a little bit and add a function to create the tokens2frequency dictionary. The original article didn't have that either and then the functions didn't reference it correctly.

import pandas as pd
from collections import Counter
from cleanco import cleanco
import regex
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

# token2frequency is just a Counter of all words in all names
# in the dataset
def sequence_uniqueness(seq, token2frequency):
    return sum(1/token2frequency[t]**0.5 for t in seq)

def name_similarity(a, b, token2frequency):
    a_tokens = set(a)
    b_tokens = set(b)
    a_uniq = sequence_uniqueness(a, token2frequency)
    b_uniq = sequence_uniqueness(b, token2frequency)
    if a_uniq==0 or b_uniq == 0:
        return 0
    else:
        return sequence_uniqueness(a_tokens.intersection(b_tokens), token2frequency)/(a_uniq * b_uniq) ** 0.5

def parse_name(name):
    name = cleanco(name).clean_name()
    #name = name.translate(None, string.punctuation)
    name = regex.sub(r"[[:punct:]]+", "", name)
    tokens = nltk.word_tokenize(name) 
    tokens = [t.lower() for t in tokens]
    tokens = [t for t in tokens if t not in stopwords.words('english')] 
    return tokens

def build_token2frequency(names):
    alltokens = []
    for tokens in names.values():
        alltokens += tokens

    return Counter(alltokens)

with open('marchants.json') as merchantfile:
    merchants = pd.read_json(merchantfile)

merchants = merchants.unique()
parsed_names = {merchant:parse_name(merchant) for merchant in merchants}
token2frequency = build_token2frequency(parsed_names)

grouping = {}
for merchant, tokens in parsed_names.items():
    grouping[merchant] = {merchant2: name_similarity(tokens, tokens2, token2frequency) for merchant2, tokens2 in parsed_names.items()}

filtered_matches = {}
for merchant in pcard_merchants:
    filtered_matches[merchant] = {merchant1: ratio for merchant1, ratio in grouping[merchant].items() if ratio >0.3 }

This will give you a final filtered list of names and the other names they match up to. It's the same basic code as the other post just with a couple of missing pieces filled in. This also is run in Python 3.8

于 2020-07-15T16:36:16.890 に答える
2

「pythoneditdistance」を検索したところ、このライブラリが最初の結果として表示されました:http: //www.mindrot.org/projects/py-editdist/

同じ仕事をする別のPythonライブラリはここにあります:http://pypi.python.org/pypi/python-Levenshtein/

編集距離は、単純な(通常は文字ベースの)編集操作のみに従って、ある文字列を別の文字列に変換するために実行する必要のある作業量を表します。すべての操作(置換、削除、挿入、場合によっては転置)には関連するコストがあり、2つの文字列間の最小編集距離は、2つの文字列がどれほど類似していないかを示す尺度です。

あなたの特定のケースでは、長いものから短いものへの距離を見つけ、文字の削除にペナルティを少なくするように文字列を並べ替えることができます(多くの場合、文字列の1つが他の文字列のほとんどサブ文字列であることがわかります) 。したがって、削除によって大きなペナルティが課されることはありません。

このサンプルコードを利用することもできます:http://norvig.com/spell-correct.html

于 2011-06-19T05:12:29.497 に答える
2

Diff-Match-Patchライブラリの使用を検討してください。あなたはDiffプロセスに興味があるでしょう-あなたのテキストにdiffを適用することは、それらのプログラムによる表現とともに、あなたに違いの良い考えを与えることができます。

于 2011-06-19T05:14:22.733 に答える
1

単語を空白やコンマなどで区切ってから、別の名前と共通する単語の数を数え、「類似」と見なされる前にしきい値を設定した単語の数を追加します。

もう1つの方法は、同じことを行うことですが、単語を取得して、キャラクターごとにそれらをつなぎ合わせます。次に、単語ごとに、文字が同じ順序で(両側から)x個の文字(またはパーセンテージ)で見つかったかどうかを比較する必要があります。その場合、単語も類似していると言えます。

例:あなたは正方形と正方形を持っています

次に、キャラクターで確認すると、正方形がすべて正方形で同じ順序になっていることがわかります。これは似たような言葉です。

于 2011-06-19T04:04:01.593 に答える
1

The algorithms that are based on the Levenshtein distance are good (not perfect) but their main disadvantage is that they are very slow for each comparison and concerning the fact that you would have to compare every possible combination.

Another way of working out the problem would be, to use embedding or bag of words to transform each company name (after some cleaning and prepossessing ) into a vector of numbers. And after that you apply an unsupervised or supervised ML method depending on what is available.

于 2018-07-05T13:00:08.547 に答える
0

I created matchkraft (https://github.com/MatchKraft/matchkraft-python). It works on top of fuzzy-wuzzy and you can fuzzy match company names in one list.

It is very easy to use. Here is an example in python:

from matchkraft import MatchKraft

mk = MatchKraft('<YOUR API TOKEN HERE>')

job_id = mk.highlight_duplicates(name='Stackoverflow Job',
primary_list=[
    'George Washington Middle Schl',
    'George Washington School',
    'Santa Fe East Inc',
    'Santa Fe East',
    'Rays Hell Burger Too',
    'El Sol de America',
    'microsoft',
    'Olney Theatre',
    'El Sol'
 ]
)

print (job_id)


mk.execute_job(job_id=job_id)


job  = mk.get_job_information(job_id=job_id)
print (job.status)

while (job.status!='Completed'):
   print (job.status)
   time.sleep(10)
   job  = mk.get_job_information(job_id=job_id)


results = mk.get_results_information(job_id=job_id)
if isinstance(results, list):
  for r in results:
      print(r.master_record + ' --> ' + r.match_record)
 else:
     print("No Results Found")
 
于 2021-06-24T02:31:06.707 に答える