python - Pythonのdifflib.get_close_matches（）関数はどのように機能しますか？

Question

以下は2つのアレイです。

import difflib
import scipy
import numpy

a1=numpy.array(['198.129.254.73','134.55.221.58','134.55.219.121','134.55.41.41','198.124.252.101'], dtype='|S15')
b1=numpy.array(['198.124.252.102','134.55.41.41','134.55.219.121','134.55.219.137','134.55.220.45', '198.124.252.130'],dtype='|S15')

difflib.get_close_matches(a1[-1],b1,2)

出力：

['198.124.252.130', '198.124.252.102']

に最も'198.124.252.102'近いものであってはなり'198.124.252.101'ませんか？

いくつかの浮動型の重みについて指定されているドキュメントを調べましたが、アルゴリズムの使用に関する情報はありません。

最後の2つのオクテットの絶対差が1であるかどうかを確認する必要があります（最初の3つのオクテットが同じである場合）。

したがって、最初に最も近い文字列を見つけてから、その最も近い文字列で上記の条件を確認しています。

これを達成するための他の機能または方法はありますか？また、どのようにget_close_matches()動作しますか？

ipaddripsに対してそのような操作は行われていないようです。

score 7 · Accepted Answer

さて、あなたの問題を説明するドキュメントにこの部分があります：

これは最小限の編集シーケンスを生成しませんが、人々に「正しく見える」一致を生成する傾向があります。

期待する結果を得るには、Levenshtein_distanceを使用できます。

しかし、IPを比較するには、整数比較を使用することをお勧めします。

>>> parts = [int(s) for s in '198.124.252.130'.split('.')]
>>> parts2 = [int(s) for s in '198.124.252.101'.split('.')]
>>> from operator import sub
>>> diff = sum(d * 10**(3-pos) for pos,d in enumerate(map(sub, parts, parts2)))
>>> diff
29

このスタイルを使用して、比較関数を作成できます。

from functools import partial
from operator import sub

def compare_ips(base, ip1, ip2):
    base = [int(s) for s in base.split('.')]
    parts1 = (int(s) for s in ip1.split('.'))
    parts2 = (int(s) for s in ip2.split('.'))
    test1 = sum(abs(d * 10**(3-pos)) for pos,d in enumerate(map(sub, base, parts1)))
    test2 = sum(abs(d * 10**(3-pos)) for pos,d in enumerate(map(sub, base, parts2)))
    return cmp(test1, test2)

base = '198.124.252.101'
test_list = ['198.124.252.102','134.55.41.41','134.55.219.121',
             '134.55.219.137','134.55.220.45', '198.124.252.130']
sorted(test_list, cmp=partial(compare_ips, base))
# yields:
# ['198.124.252.102', '198.124.252.130', '134.55.219.121', '134.55.219.137', 
#  '134.55.220.45', '134.55.41.41']

score 2 · Accepted Answer

Some hint from difflib:

SequenceMatcher is a flexible class for comparing pairs of sequences of any type, so long as the sequence elements are hashable. The basic algorithm predates, and is a little fancier than, an algorithm published in the late 1980's by Ratcliff and Obershelp under the hyperbolic name "gestalt pattern matching". The basic idea is to find the longest contiguous matching subsequence that contains no "junk" elements (R-O doesn't address junk). The same idea is then applied recursively to the pieces of the sequences to the left and to the right of the matching subsequence. This does not yield minimal edit sequences, but does tend to yield matches that "look right" to people.

Regarding your requirement to compare IPs based on custom logic. You should first validate if the string is proper ip. Then writing comparison logic using simple integer arithmetic should be an easy task to fulfill your requirement. A library is not needed at all.

score 1 · Accepted Answer

difflib言及：

基本的なアルゴリズムは、1980年代後半にRatcliffとObershelpによって双曲線の名前「gestaltpatternmatching」で公開されたアルゴリズムよりも前のものであり、少し凝っています。

そして、それが何を意味するのかという点で、「ゲスタルトパターンマッチング」ウィキペディアページはいくつかの答えを提供することができます。また、ウィキペディアのページでは、Pythondifflibライブラリとその実装に関するいくつかの言及が「アプリケーション」セクションに記載されています。

https://en.wikipedia.org/wiki/Gestalt_Pattern_Matching

python - Pythonのdifflib.get_close_matches（）関数はどのように機能しますか？

3 に答える 3

Related

Reference