1

I have a DataFrame data laid out like this:

Observation     A_1    A_2    A_3    B_1    B_2    B_3
Obs1            yes    no     yes    no     no     no
Obs2            no     no     no     yes    yes    yes
Obs3            yes    yes    yes    yes    yes    yes

The goal: calculate the frequency of all observations marked "yes" that are:

  • only in "A" samples
  • only in "B" samples
  • In both groups

EDIT: This means that I need to exclude, for the first two counts, the observations that contain "yes" for both the A and B group (see third line).

I thought about using groupby:

grouper = data.groupby(lambda x: x.split("_")[0], axis=1)
grouped = grouper.agg(lambda x: sum(x == "yes"))

But I have counts divided by row, which is not what I want.

What would be the best couse of action here?

EDIT: As requested, more information on the output. I'd like something like

Frequency of valid [meaning "yes"] observations in group A: X
Frequency of valid observations in group "B": Y
Frequency for all valid observations: Z

Where X, Y, and Z are the counts returned.

I'm not caring for this specific output for the individual observations. I'm interested in values across all of them.

4

2 に答える 2

3
In [129]: a = ['A_1', 'A_2', 'A_3']

In [130]: b = ['B_1', 'B_2', 'B_3']

In [131]: ina = (df[a] == 'yes').any(axis=1)

In [132]: inb = (df[b] == 'yes').any(axis=1)

In [133]: ina & ~inb
Out[133]:
Observation
Obs1            True
Obs2           False
Obs3           False
dtype: bool

In [134]: ~ina & inb
Out[134]:
Observation
Obs1           False
Obs2            True
Obs3           False
dtype: bool

In [135]: ina & inb
Out[135]:
Observation
Obs1           False
Obs2           False
Obs3            True
dtype: bool

カウントは value_counts を使用して行うことができます: (ina & inb).value_counts()[True]

于 2013-05-15T09:57:50.533 に答える
2

1 と 2 のどちらとしてカウントするかはまだはっきりしませんyes no yes no no no。これまでに必要だった最も近いものは次のようになります。

>>> df
             A_1  A_2  A_3  B_1  B_2  B_3
Observation                              
Obs1         yes   no  yes   no   no   no
Obs2          no   no   no  yes  yes  yes
Obs3         yes  yes  yes  yes  yes  yes
Obs4         yes  yes   no   no   no   no
>>> y = (df == "yes").groupby(lambda x: x.split("_")[0], axis=1).sum()
>>> y
             A  B
Observation      
Obs1         2  0
Obs2         0  3
Obs3         3  3
Obs4         2  0
>>> which = y.apply(lambda x: tuple(x.index[x > 0]), axis=1)
>>> which
Observation
Obs1             (A,)
Obs2             (B,)
Obs3           (A, B)
Obs4             (A,)
dtype: object
>>> y.groupby(which).sum()
        A  B
(A,)    4  0
(A, B)  3  3
(B,)    0  3

または多分単に

>>> which.value_counts()
(A,)      2
(A, B)    1
(B,)      1
dtype: int64

あなたの目標に応じて。

于 2013-05-15T10:47:28.040 に答える