python - 順序を維持しながら、リストのリストを「圧縮された」リストのリストに置き換えます

Question

添付したコードのようにリストのリストがあります。共通の値があれば、各サブリストをリンクしたい。次に、リストのリストをリストの要約リストに置き換えたいと思います。例:[[1,2,3],[3,4]]欲しいリストがあるとし[1,2,3,4]ます。あるなら[[4,3],[1,2,3]]欲しい[4,3,1,2]。私が持っているなら、私は[[1,2,3],[a,b],[3,4],[b,c]]欲しい[[1,2,3,4],[a,b,c]]か、どちらか[[a,b,c],[1,2,3,4]]は気にしません。

私はほとんどそこにいます...

[[1,2,3],[10,5],[3,8,5]]私の問題は、私が望むようなケースがある[1,2,3,10,5,8]が、現在のコードで私が得たときです[1,2,3,8,10,5]

これが私のコードです：

import itertools

a = [1,2,3]
b = [3,4]
i = [21,22]
c = [88,7,8]
e = [5,4]
d = [3, 50]
f = [8,9]
g=  [9,10]
h = [20,21]

lst = [a,b,c,i,e,d,f,g,h,a,c,i]*1000  
#I have a lot of list but not very many different lists

def any_overlap(a, b):
  sb = set(b)
  return any(itertools.imap(sb.__contains__, a))

def find_uniq(lst):
    ''' return the uniq parts of lst'''
    seen = set()
    seen_add = seen.add
    return [ x for x in lst if x not in seen and not seen_add(x)]

def overlap_inlist(o_lst, lstoflst):
    '''
    Search for overlap, using "any_overlap", of a list( o_lst) in a list of lists (lstoflst).
    If there is overlap add the uniq part of the found list to the search list, and keep 
    track of where that list was found 
    '''
    used_lst =[ ]
    n_lst =[ ]
    for lst_num, each_lst in enumerate(lstoflst):
        if any_overlap(o_lst,each_lst):
            n_lst.extend(each_lst)
            used_lst.append(lst_num)
    n_lst= find_uniq(n_lst)
    return  n_lst, used_lst

def comb_list(lst):
    '''
    For each list in a list of list find all the overlaps using 'ovelap_inlist'.
    Update the list each time to delete the found lists. Return the final combined list. 
    '''
    for updated_lst in lst:
        n_lst, used_lst = overlap_inlist(updated_lst,lst)
        lst[:] = [x for i,x in enumerate(lst) if i not in used_lst]
        lst.insert(0,n_lst)
    return lst
comb_lst = comb_list(lst)
print comb_lst

このスクリプトの出力は次のとおりです。

[[88, 7, 8, 9, 10], [1, 2, 3, 4, 50, 5], [21, 22, 20]]

キーが次のような元の順序になるようにします。

[[88, 7, 8, 9, 10], [1, 2, 3, 4, 5, 50,], [21, 22, 20]]

5 と 50 は新しい lst[2] で切り替えられます

私はPythonに少し慣れていません。問題の解決策や現在のコードに関するコメントをいただければ幸いです。私はコンピューター科学者ではありませんが、これを迅速に行うある種のアルゴリズムが存在する可能性があると思います (おそらく集合論から?)。そのようなアルゴリズムがある場合は、正しい方向に向けてください。

私はこの方法をもっと複雑にしているかもしれません...ありがとう!!

score 14 · Accepted Answer

これはブルートフォースアプローチです（理解しやすいかもしれません）：

from itertools import chain

def condense(*lists):
    # remember original positions
    positions = {}
    for pos, item in enumerate(chain(*lists)):
        if item not in positions:
            positions[item] = pos

    # condense disregarding order
    sets = condense_sets(map(set, lists))

    # restore order
    result = [sorted(s, key=positions.get) for s in sets]
    return result if len(result) != 1 else result[0]

def condense_sets(sets):
    result = []
    for candidate in sets:
        for current in result:
            if candidate & current:   # found overlap
                current |= candidate  # combine (merge sets)

                # new items from candidate may create an overlap
                # between current set and the remaining result sets
                result = condense_sets(result) # merge such sets
                break
        else:  # no common elements found (or result is empty)
            result.append(candidate)
    return result

例

>>> condense([1,2,3], [10,5], [3,8,5])
[1, 2, 3, 10, 5, 8]
>>> a = [1,2,3]
>>> b = [3,4]
>>> i = [21,22]
>>> c = [88,7,8]
>>> e = [5,4]
>>> d = [3, 50]
>>> f = [8,9]
>>> g=  [9,10]
>>> h = [20,21]
>>> condense(*[a,b,c,i,e,d,f,g,h,a,c,i]*1000)
[[1, 2, 3, 4, 5, 50], [88, 7, 8, 9, 10], [21, 22, 20]]
>>> condense([1], [2, 3, 2])
[[1], [2, 3]]

性能比較

condense_*()関数は、この質問への回答からのものです。lst_OP質問からの入力リスト (異なるサイズ)、 - @Blckknght の回答lst_BKからのテストリスト(異なるサイズ)。ソースを参照してください。

測定では、「互いに素な集合」および「無向グラフの連結成分」の概念に基づくソリューションが、両方のタイプの入力に対して同様に機能することが示されています。

name                       time ratio comment
condense_hynekcer     5.79 msec  1.00 lst_OP
condense_hynekcer2     7.4 msec  1.28 lst_OP
condense_pillmuncher2 11.5 msec  1.99 lst_OP
condense_blckknght    16.8 msec  2.91 lst_OP
condense_jfs            26 msec  4.49 lst_OP
condense_BK           30.5 msec  5.28 lst_OP
condense_blckknght2   30.9 msec  5.34 lst_OP
condense_blckknght3   32.5 msec  5.62 lst_OP


name                       time  ratio comment
condense_blckknght     964 usec   1.00 lst_BK
condense_blckknght2   1.41 msec   1.47 lst_BK
condense_blckknght3   1.46 msec   1.51 lst_BK
condense_hynekcer2    1.95 msec   2.03 lst_BK
condense_pillmuncher2  2.1 msec   2.18 lst_BK
condense_hynekcer     2.12 msec   2.20 lst_BK
condense_BK           3.39 msec   3.52 lst_BK
condense_jfs           544 msec 563.66 lst_BK


name                       time ratio comment
condense_hynekcer     6.86 msec  1.00 lst_rnd
condense_jfs          16.8 msec  2.44 lst_rnd
condense_blckknght    28.6 msec  4.16 lst_rnd
condense_blckknght2   56.1 msec  8.18 lst_rnd
condense_blckknght3   56.3 msec  8.21 lst_rnd
condense_BK           70.2 msec 10.23 lst_rnd
condense_pillmuncher2  324 msec 47.22 lst_rnd
condense_hynekcer2     334 msec 48.61 lst_rnd

結果を再現するには、 gist を複製して実行しますmeasure-performance-condense-lists.py

score 6 · Accepted Answer

これが私のアプローチです。「ばらばらなセット」の概念を使用して、最初にどのサブリストが互いに重複しているかを識別し、次にそれらを結合して重複を排除します。

from collections import OrderedDict
def disjoint_set_find(djs, node): # disjoint set find, with path compression
    if node not in djs: # base case, node is a root of a set
        return node
    djs[node] = disjoint_set_find(djs, djs[node]) # recurse, caching results
    return djs[node]

def disjoint_set_union(djs, first, second):
    first = disjoint_set_find(djs, first)   # find root of first set
    second = disjoint_set_find(djs, second) # and of second set
    if first < second: # make smaller root the root of the new combined set
        djs[second] = first
    elif second < first:
        djs[first] = second
    # deliberately ignore the case where first == second (same set)

def condenseBK(*master_list):
    values = OrderedDict() # maps values to the first sublist containing them
    overlaps = {}  # maps sublist indexes to each other to form a disjoint set
    for i, sublist  in enumerate(master_list):
        for v in sublist:
            if v not in values: # no overlap, so just store value
                values[v] = i
            else: # overlap detected, do a disjoint set union
                disjoint_set_union(overlaps, values[v], i)
    output = [] # results
    output_map = {} # map from original indexes to output indexes
    for v, i, in values.items(): # iterate over values in order
        root = disjoint_set_find(overlaps, i)
        if root not in output_map:
            output_map[i] = len(output)
            output.append([]) # create new output sublists as necessary
        output[output_map[root]].append(v)
    return output

出力例:

>>> a = [1,2,3]
>>> b = [3,4]
>>> c = [88,7,8]
>>> d = [3, 50]
>>> e = [5,4]
>>> f = [8,9]
>>> g = [9,10]
>>> h = [20,21]
>>> i = [21,22]
>>> lst = [a,b,c,i,e,d,f,g,h,a,c,i]*1000
>>> condenseBK(*lst)
[[1, 2, 3, 4, 5, 50], [88, 7, 8, 9, 10], [21, 22, 20]]

アルゴリズムの説明:

リクエストに応じて、私のコードがどのように機能するかを説明します。

最初の 2 つの関数は、素集合findのandunion操作を実装します。データ構造は、ノードを親ノードにマッピングするディクショナリで実装されます。ディクショナリのキーではないノードは、セットのです。この操作は、指定されたを含むセットのルートノードを返します。パフォーマンスを少し向上させるために、「パス圧縮」を実装しました。これにより、時間の経過とともに必要な再帰ステップの数が削減されます。この操作は、引数およびを含むセットを結合します。rootfindnodeunionfirstsecond

メインcondense関数には 2 つの部分があります。まず、いくつかのデータ構造をセットアップし、次にそれらを使用して出力を構築します。

values各値からそれが含まれる最初のサブリストのインデックスにマップする OrderedDictionary です。各値が追加される順序は、正しい順序で出力を生成するために使用されます。

overlaps互いに素な集合として使用される辞書です。そのノードは、重複するサブリストの整数インデックスです。

最初のループは、これら 2 つのデータ構造を埋めます。サブリストとその内容をループします。以前に見られなかった値は、valuesディクショナリに追加されます。見つかった場合、現在のサブリストはその値を含む以前のサブリストと重複しています。

オーバーラップを解決するために、コードは 2 つのサブリストを含む互いに素なセットの結合を行います。

出力はoutputリストに組み込まれます。出力サブリストは入力よりも少ない可能性が高いため、(入力からの) 古いインデックスと出力リストに適用される新しいインデックスとの間をマッピングするための追加の辞書が必要です。

出力リストを埋めるために、値を反復処理します。これは、OrderedDict クラスを使用することで、追加された順序で行われます。互いに素なセットを使用すると、重複するリストを正しく組み合わせることができます。

このアルゴリズムは、すぐにはオーバーラップしない処理対象のリストが多数ある場合に、非常に優れたパフォーマンスを発揮します。たとえば、次の 200 個の 3 要素リストのセットは最終的にすべて重複しますが、最初の 100 個が処理された後にのみ重複が表示されるようになります。

lst2 = [list(range(4*i,   4*(i+1)-1)) for i in range(100)] + \
       [list(range(4*i+2, 4*(i+1)+1)) for i in range(100)]

score 4 · Accepted Answer

あなたの問題は本質的にグラフ理論的なものであり、接続されたコンポーネントの問題であり、各コンポーネントの要素の順序に関する要件が追加されています。

プログラムでは、すべてのリストのセットが無向グラフを形成し、各リストがグラフのノードになります。2 つのリストが共通の要素を持っている場合は直接接続され、両方が直接的または間接的に接続されている 3 番目のリストが存在する場合は間接的に接続されていると言います。たとえば、3 つのリスト [a, b]、[b, c]、および [c, d] が与えられた場合、[a, b] と [b, c] は直接接続され、[b, c] と [c, d] ですが、[a, b] と [c, d] は間接的に接続されています。これらは共通の要素を共有していませんが、両方とも同じリスト [b, c] を持つ要素を共有しているためです。

グループ内のすべてのノードが (直接的または間接的に) 接続されており、グラフ内の他のノードがグループ内のどのノードにも接続されていない場合、ノードのグループは接続コンポーネントです。

無向グラフのすべての接続コンポーネントを生成する、かなり単純な線形時間アルゴリズムがあります。それを使用して、それらの要素の順序を維持しながら、凝縮された互いに素なリストのすべてのリストを生成する関数を定義できます。

from itertools import imap, combinations_with_replacement
from collections import defaultdict

def connected_components(edges):
    neighbors = defaultdict(set)
    for a, b in edges:
        neighbors[a].add(b)
        neighbors[b].add(a)
    seen = set()
    def component(node, neighbors=neighbors, seen=seen, see=seen.add):
        unseen = set([node])
        next_unseen = unseen.pop
        while unseen:
            node = next_unseen()
            see(node)
            unseen |= neighbors[node] - seen
            yield node
    for node in neighbors:
        if node not in seen:
            yield component(node)

def condense(lists):
    tuples = combinations_with_replacement(enumerate(imap(tuple, lists)), 2)
    overlapping = ((a, b) for a, b in tuples
                          if not set(a[1]).isdisjoint(b[1]))
    seen = set()
    see = seen.add
    for component in connected_components(overlapping):
        yield [item for each in sorted(component)
                    for item in each[1]
                    if not (item in seen or see(item))]

print list(condense([[1, 2, 3], [10, 5], [3, 8, 5], [9]]))
print list(condense([[1, 2, 3], [5, 6], [3, 4], [6, 7]]))

結果：

[[1, 2, 3, 10, 5, 8], [9]]
[[5, 6, 7], [1, 2, 3, 4]]

condense()すべてのリストを他のすべてのリストに対してテストして、共通の要素があるかどうかを確認する必要があるため、の時間計算量は 2 次です。したがって、パフォーマンスはひどいです。どうにかして改善できないでしょうか？はい。

共通の要素がある場合、2 つのリストは直接接続されます。この関係を逆転させることができます: 2 つの要素が同じリストに属している場合、これらの要素は直接接続され、両方に (直接または間接的に) 接続されている 3 番目の要素が存在する場合、間接的に接続されます。たとえば、2 つのリスト [a, b] と [b, c] が与えられた場合、a と b は直接接続されるだけでなく、b と c も接続されるため、a と c は間接的に接続されます。各要素の最初の出現位置を格納するという JFSebastians のアイデアも採用すると、次のように実装できます。

def condense(lists):
    neighbors = defaultdict(set)
    positions = {}
    position = 0
    for each in lists:
        for item in each:
            neighbors[item].update(each)
            if item not in positions:
                positions[item] = position
                position += 1
    seen = set()
    def component(node, neighbors=neighbors, seen=seen, see=seen.add):
        unseen = set([node])
        next_unseen = unseen.pop
        while unseen:
            node = next_unseen()
            see(node)
            unseen |= neighbors[node] - seen
            yield node
    for node in neighbors:
        if node not in seen:
            yield sorted(component(node), key=positions.get)

引き続き連結要素アルゴリズムを使用しますが、今回は要素をリストではなく連結要素として表示します。結果は以前と同じですが、時間の計算量が線形になったため、はるかに高速に実行されます。

score 4 · Accepted Answer

これを行うためのよりクリーンな方法があると確信していますが、特定の道を歩み始め、リファクタリングなしで機能させるために必要なことを行いました。

lookup = {}
out = []
index = 0
for grp in lst:
    keys = [lookup.get(num, None) for num in grp]
    keys = [key for key in keys if key is not None]
    if len(keys):
        if len(set(keys)) != 1:
            for num in grp:
                out[keys[0]].append(num)
            seen = set()
            keys = [key for key in keys if key not in seen and not seen.add(key)]
            for key in keys[1:]:
                out[keys[0]].extend(out[key])
                del out[key]
            seen = set()
            out[keys[0]] = [item for item in out[keys[0]] if item not in seen and not seen.add(item)]
        else:
            for num in grp:
                lookup[num] = keys[0]
            out[keys[0]].extend(grp)
            seen = set()
            out[keys[0]] = [item for item in out[keys[0]] if item not in seen and not seen.add(item)]
    else:
        out.append(grp)
        for num in grp:
            lookup[num] = index
        index += 1

print out

セットを使用したリスト削減手法については、@Steven に感謝します。

出力

[[1, 2, 3, 4, 5, 50], [88, 7, 8, 9, 10], [21, 22, 20]]

score 4 · Accepted Answer

I tried to write a fast and readable solution. It is never much slower than other solutions, if I know, but can be sometimes much faster because it is additionally optimized for longer sublist or for many sublists that are subset of any yet existing group. (This is motivated by text of the question "I have a lot of list but not very many different lists.") The code uses less memory only for condensed data that can be much less than the original data. It can work e.g. with a generator collecting data from a realtime process. The estimate of complexity is O(n log n). I think that no algorithm that uses sorting can be of linear complexity.

def condense(lists):
    groups = {}         # items divided into groups {id(the_set): the_set,...}
    members = {}        # mapping from item to group
    positions = {}      # mapping from item to sequential ordering
    iposition = 0       # counter for positions
    for sublist in lists:
        if not sublist or members.get(sublist[0], set()).issuperset(sublist):
            continue    # speed-up condition if all is from one group
        common = set([x for x in sublist if x in members])
        if common:      # any common group can be a destination for merge
            dst_group = members[common.pop()]
            common = common - dst_group   # are more groups common for sublist?
            while common:
                grp = members[common.pop()]
                if len(grp) > len(dst_group):   # merge shorter into longer grp
                    grp, dst_group = dst_group, grp
                dst_group.update(grp)
                for item in grp:
                    members[item] = dst_group
                del groups[id(grp)]
                common = common - dst_group
        else:           # or create a new group if nothing common
            dst_group = set()
            groups[id(dst_group)] = dst_group
        newitems = []
        for item in sublist:    # merge also new items
            if item not in positions:
                positions[item] = iposition
                iposition += 1
                newitems.append(item)
                members[item] = dst_group
        dst_group.update(newitems)
    return [sorted(x, key=positions.get) for x in groups.values()]

It is faster than pillmuncher2 for subslists longer than approximately 8 items because it can work on more items together. It is also very fast for lists with many similar sublists or many sublists that are subset of any yet existing group. It is faster by 25% over pillmuncher2 for lst_OP, however slower by 15% for lst_BK.

An example of test data with long sublists is [list(range(30)) + [-i] for i in range(100)].

I intentionally wrote "common = common - dst_group" instead of using the set operator -= or "set.difference_update", because the updade in-place is not effective if the set on the right side is much bigger then on the left side.

Modified pillmuncher's solution for easier readability. The modification is a little slower than the original due to replacing a generator by list.append. Probably the most readable fast solution.

# Modified pillmuncher's solution
from collections import defaultdict

def condense(lists):
    neighbors = defaultdict(set)  # mapping from items to sublists
    positions = {}                # mapping from items to sequential ordering
    position = 0
    for each in lists:
        for item in each:
            neighbors[item].update(each)
            if item not in positions:
                positions[item] = position
                position += 1
    seen = set()
    see = seen.add
    for node in neighbors:
        if node not in seen:
            unseen = set([node])      # this is a "todo" set
            next_unseen = unseen.pop  # method alias, not called now
            group = []                # collects the output
            while unseen:
                node = next_unseen()
                see(node)
                unseen |= neighbors[node] - seen
                group.append(node)
            yield sorted(group, key=positions.get)

score 3 · Accepted Answer

class List(list): pass

rv = dict()

def condense_step():
    """find and merge one overlapping pair in rv"""
    for i, iv in rv.items():
        for j, jv in rv.items():
            if i != j and i.intersection(j):
                m = i.union(j)
                del rv[i]
                del rv[j]
                rv.setdefault(m, [])
                rv[m] += iv
                rv[m] += jv
                return True

def unique(l):
    """flatten list-of-lists excluding duplicates"""
    seen = set()
    for i in sum(l, []):
        if i not in seen:
            seen.add(i)
            yield i

def condense(inp):
    rv.clear()
    inp = map(List, inp)

    for i in range(len(inp)):
        inp[i].order = i
        rv.setdefault(frozenset(inp[i]), [])
        rv[frozenset(inp[i])].append(inp[i])

    while condense_step():
        pass

    for v in rv.values():
        v.sort(key=lambda x: x.order)

    return [list(unique(i)) for i in rv.values()]

score 2 · Accepted Answer

このソリューションでは、Ordered Dictionary のみを使用します。
元のコピーを変更せずに残したい場合は、 deepcopy()が必要です。

from collections import OrderedDict
from copy import deepcopy

def treat(passed_list):
    L = deepcopy(passed_list)

    dic = OrderedDict()
    for subl in L:
        for x in subl:
            if x not in dic:
                dic[x] =  subl
    print 'dic at start'
    print '\n'.join('%-3s : %s' % (a,dic[a])
                    for a in dic) + '\n'

    for sublist in L:
        short = []
        short.extend(el for el in sublist
                     if el not in short)
        seen  = []
        for k,val in dic.iteritems():
            if val is sublist:
                break
            if k in short:
                if val not in seen:
                    seen.append(val)

        sumseen = []
        for elseen in seen:
            for y in elseen:
                sumseen.append(y)
                dic[y] = sumseen

        if seen:
            for el in sublist:
                if el not in sumseen:
                    sumseen.append(el)
                    dic[el] = sumseen

        sublist[:] = short

    cumul = []
    cumul.extend(lu for lu in dic.itervalues()
                 if lu not in cumul)
    return cumul


plus = [[1,2,3,2,1],[10,5,5,5,10],
        [8,5,3,3,5],[45,50,12,45,40,12]]

lst = [[1,2,3], [10,5], [3,8,5]]

for one_list in (plus,lst):
    print 'one_list before == %r\n' % one_list
    print 'treat(one_list) == %r\n' % treat(one_list)
    print 'one_list after  == %r\n' % one_list
    print '===================================='

結果

one_list before == [[1, 2, 3, 2, 1], [10, 5, 5, 5, 10], [8, 5, 3, 3, 5], [45, 50, 12, 45, 40, 12]]

dic at start
1   : [1, 2, 3, 2, 1]
2   : [1, 2, 3, 2, 1]
3   : [1, 2, 3, 2, 1]
10  : [10, 5, 5, 5, 10]
5   : [10, 5, 5, 5, 10]
8   : [8, 5, 3, 3, 5]
45  : [45, 50, 12, 45, 40, 12]
50  : [45, 50, 12, 45, 40, 12]
12  : [45, 50, 12, 45, 40, 12]
40  : [45, 50, 12, 45, 40, 12]

treat(one_list) == [[1, 2, 3, 10, 5, 8], [45, 50, 12, 40]]

one_list after  == [[1, 2, 3, 2, 1], [10, 5, 5, 5, 10], [8, 5, 3, 3, 5], [45, 50, 12, 45, 40, 12]]

====================================
one_list before == [[1, 2, 3], [10, 5], [3, 8, 5]]

dic at start
1   : [1, 2, 3]
2   : [1, 2, 3]
3   : [1, 2, 3]
10  : [10, 5]
5   : [10, 5]
8   : [3, 8, 5]

treat(one_list) == [[1, 2, 3, 10, 5, 8]]

one_list after  == [[1, 2, 3], [10, 5], [3, 8, 5]]

====================================

このソリューションには不便があります。JF Sebastian のソリューションよりも 2 倍から 3 倍遅くなります。

python - 順序を維持しながら、リストのリストを「圧縮された」リストのリストに置き換えます

7 に答える 7

例

性能比較

Related

Reference