python - タプルの2つのリストの組み合わせを最適化し、それらの重複を削除するにはどうすればよいですか？

Question

ここから、各タプルの2番目の項目が重複している場合、タプルのリストから要素を削除するにはどうすればよいですか？、タプルの1つのリストからタプルの2番目の要素の重複を削除できます。

タプルのリストが2つあるとしましょう。

alist = [(0.7897897,'this is a foo bar sentence'),
(0.653234, 'this is a foo bar sentence'),
(0.353234, 'this is a foo bar sentence'),
(0.325345, 'this is not really a foo bar'),
(0.323234, 'this is a foo bar sentence'),]

blist = [(0.64637,'this is a foo bar sentence'),
(0.534234, 'i am going to foo bar this sentence'),
(0.453234, 'this is a foo bar sentence'),
(0.323445, 'this is not really a foo bar')]

そして、2番目の要素が同じである場合（score_from_alist * score_from_blist）、スコアを組み合わせて、目的の出力を達成する必要があります。

clist = [(0.51,'this is a foo bar sentence'), # 0.51 = 0.789 * 0.646
(0.201, 'this is not really a foo bar')] # 0.201  = 0.325 * 0.323

現在、これを行うことでclistを達成していますが、alistとblistに約5500以上のタプルがあり、2番目の要素にそれぞれ約20〜40語がある場合、5秒以上かかります。次の機能を高速化する方法はありますか？

def overlapMatches(alist, blist):
    start_time = time.time()
    clist = []
    overlap = set()
    for d in alist:
        for dn in blist:
            if d[1] == dn[1]:
                score = d[0]*dn[0]
                overlap.add((score,d[1]))
    for s in sorted(overlap, reverse=True)[:20]:
        clist.append((s[0],s[1]))
    print "overlapping matches takes", time.time() - start_time 
    return clist

score 3 · Accepted Answer

辞書/セットを使用して、重複を排除し、高速ルックアップを提供します。

alist = [(0.7897897,'this is a foo bar sentence'),
(0.653234, 'this is a foo bar sentence'),
(0.353234, 'this is a foo bar sentence'),
(0.325345, 'this is not really a foo bar'),
(0.323234, 'this is a foo bar sentence'),]

blist = [(0.64637,'this is a foo bar sentence'),
(0.534234, 'i am going to foo bar this sentence'),
(0.453234, 'this is a foo bar sentence'),
(0.323445, 'this is not really a foo bar')]

bdict = {k:v for v,k in reversed(blist)}
clist = []
cset = set()
for v,k in alist:
   if k not in cset:
      b = bdict.get(k, None)
      if b is not None:
        clist.append((v * b, k))
        cset.add(k)
print(clist)

ここでblistは、各文の最初の出現を除くすべてを削除し、文ごとの高速ルックアップを提供するために使用されます。

の順序を気にしない場合はclist、構造をいくらか単純化できます。

bdict = {k:v for v,k in reversed(blist)}
cdict = {}
for v,k in alist:
   if k not in cdict:
      b = bdict.get(k, None)
      if b is not None:
        cdict[k] = v * b
print(list((k,v) for v,k in cdict.items()))

score 1 · Accepted Answer

タプル内の1番目のアイテムによって降順でソートされていると仮定して、単一のリスト内に重複がある場合に1番目のアイテムが最も高いタプルを残し、タプル内の対応する2番目のアイテムが同じ：

# remove duplicates (take the 1st item among duplicates)
a, b = [{sentence: score for score, sentence in reversed(lst)}
        for lst in [alist, blist]]

# merge (leave only tuples that have common 2nd items (sentences))
clist = [(a[s]*b[s], s) for s in a.viewkeys() & b.viewkeys()]
clist.sort(reverse=True) # sort by (score, sentence) in descending order
print(clist)

出力：

[(0.510496368389, 'this is a foo bar sentence'),
 (0.10523121352499999, 'this is not really a foo bar')]

python - タプルの2つのリストの組み合わせを最適化し、それらの重複を削除するにはどうすればよいですか？

2 に答える 2

Related

Reference