python - How do I do a SQL style disjoint or set difference on two Pandas DataFrame objects?

Question

I'm trying to use Pandas to solve an issue courtesy of an idiot DBA not doing a backup of a now crashed data set, so I'm trying to find differences between two columns. For reasons I won't get into, I'm using Pandas rather than a database.

What I'd like to do is, given:

Dataset A = [A, B, C, D, E]  
Dataset B = [C, D, E, F]

I would like to find values which are disjoint.

Dataset A!=B = [A, B, F]

In SQL, this is standard set logic, accomplished differently depending on the dialect, but a standard function. How do I elegantly apply this in Pandas? I would love to input some code, but nothing I have is even remotely correct. It's a situation in which I don't know what I don't know..... Pandas has set logic for intersection and union, but nothing for disjoint/set difference.

Thanks!

score 9 · Accepted Answer

set.symmetric_difference次の関数を使用できます。

In [1]: df1 = DataFrame(list('ABCDE'), columns=['x'])

In [2]: df1
Out[2]:
   x
0  A
1  B
2  C
3  D
4  E

In [3]: df2 = DataFrame(list('CDEF'), columns=['y'])

In [4]: df2
Out[4]:
   y
0  C
1  D
2  E
3  F

In [5]: set(df1.x).symmetric_difference(df2.y)
Out[5]: set(['A', 'B', 'F'])

score 0 · Accepted Answer

これは複数の列の解決策です。おそらくあまり効率的ではありません。これを高速化するためのフィードバックをぜひお寄せください。

input = pd.DataFrame({'A': [1, 2, 2, 3, 3], 'B': ['a', 'a', 'b', 'a', 'c']})
limit = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'c']})


def set_difference(input_set, limit_on_set):
    limit_on_set_sub = limit_on_set[['A', 'B']]
    limit_on_tuples = [tuple(x) for x in limit_on_set_sub.values]
    limit_on_dict = dict.fromkeys(limit_on_tuples, 1)

    entries_in_limit = input_set.apply(lambda row:
        (row['A'], row['B']) in limit_on_dict, axis=1)

    return input_set[~entries_in_limit]

 >>> set_difference(input, limit)

  item  user
1    a     2
3    a     3

python - How do I do a SQL style disjoint or set difference on two Pandas DataFrame objects?

2 に答える 2

Related

Reference