python - パンダの日付フィールドのcut/qcutに相当するものは何ですか?

Question

更新: バージョン 0.20.0 以降、pandas cut/qcut DOES は日付フィールドを処理します。詳細については、新機能を参照してください。

pd.cut および pd.qcut が datetime64 および timedelta64 dtype をサポートするようになりました (GH14714、GH14798)

元の質問: Pandas の cut 関数と qcut 関数は、ピボットテーブルなどで使用する連続データを「バケット化」するのに最適ですが、混合で datetime 軸を取得する簡単な方法がわかりません。パンダはすべての時間関連のものでとても素晴らしいのでイライラします!

簡単な例を次に示します。

def randomDates(size, start=134e7, end=137e7):
    return np.array(np.random.randint(start, end, size), dtype='datetime64[s]')

df = pd.DataFrame({'ship' : randomDates(10), 'recd' : randomDates(10), 
                   'qty' : np.random.randint(0,10,10), 'price' : 100*np.random.random(10)})
df

     price      qty recd                ship
0    14.723510   3  2012-11-30 19:32:27 2013-03-08 23:10:12
1    53.535143   2  2012-07-25 14:26:45 2012-10-01 11:06:39
2    85.278743   7  2012-12-07 22:24:20 2013-02-26 10:23:20
3    35.940935   8  2013-04-18 13:49:43 2013-03-29 21:19:26
4    54.218896   8  2013-01-03 09:00:15 2012-08-08 12:50:41
5    61.404931   9  2013-02-10 19:36:54 2013-02-23 13:14:42
6    28.917693   1  2012-12-13 02:56:40 2012-09-08 21:14:45
7    88.440408   8  2013-04-04 22:54:55 2012-07-31 18:11:35
8    77.329931   7  2012-11-23 00:49:26 2012-12-09 19:27:40
9    46.540859   5  2013-03-13 11:37:59 2013-03-17 20:09:09

価格または数量のグループごとにビンに入れるには、cut/qcut を使用してそれらをバケット化します。

df.groupby([pd.cut(df['qty'], bins=[0,1,5,10]), pd.qcut(df['price'],q=3)]).count()

                       price  qty recd ship
qty     price               
(0, 1]  [14.724, 46.541]   1   1   1   1
(1, 5]  [14.724, 46.541]   2   2   2   2
        (46.541, 61.405]   1   1   1   1
(5, 10] [14.724, 46.541]   1   1   1   1
        (46.541, 61.405]   2   2   2   2
         (61.405, 88.44]   3   3   3   3

しかし、「recd」または「ship」の日付フィールドで同じことを行う簡単な方法がわかりません。たとえば、recd と ship の (たとえば) 毎月のバケットごとに分類されたカウントの同様のテーブルを生成します。resample() には、期間にバケット化するためのすべての機構があるようですが、ここで適用する方法がわかりません。「日付カット」のバケット (またはレベル) は pandas.PeriodIndex に相当し、df['recd'] の各値に該当する期間のラベルを付けたいですか?

したがって、私が探している出力の種類は次のようになります。

ship    recv     count
2011-01 2011-01  1
        2011-02  3
        ...      ...
2011-02 2011-01  2
        2011-02  6
...     ...      ...

より一般的には、出力で連続変数またはカテゴリ変数を組み合わせて一致させたいと考えています。df には、赤/黄/緑の値を持つ「ステータス」列も含まれていると想像してください。ステータス、価格バケット、出荷、および受信バケットごとにカウントを要約したい場合があります。

ship    recv     price   status count
2011-01 2011-01  [0-10)   green     1
                            red     4
                 [10-20) yellow     2
                  ...      ...    ...
        2011-02  [0-10)  yellow     3
        ...      ...       ...    ...

おまけの質問として、上記の groupby() の結果を変更して、'count' という 1 つの出力列だけを含める最も簡単な方法は何ですか?

score 5 · Accepted Answer

pandas.PeriodIndex を使用したソリューションを次に示します (警告: PeriodIndex は、「4M」などの倍数 > 1 の時間ルールをサポートしていないようです)。おまけの質問への答えはだと思います.size()。

In [49]: df.groupby([pd.PeriodIndex(df.recd, freq='Q'),
   ....:             pd.PeriodIndex(df.ship, freq='Q'),
   ....:             pd.cut(df['qty'], bins=[0,5,10]),
   ....:             pd.qcut(df['price'],q=2),
   ....:            ]).size()
Out[49]: 
                qty      price 
2012Q2  2013Q1  (0, 5]   [2, 5]    1
2012Q3  2013Q1  (5, 10]  [2, 5]    1
2012Q4  2012Q3  (5, 10]  [2, 5]    1
        2013Q1  (0, 5]   [2, 5]    1
                (5, 10]  [2, 5]    1
2013Q1  2012Q3  (0, 5]   (5, 8]    1
        2013Q1  (5, 10]  (5, 8]    2
2013Q2  2012Q4  (0, 5]   (5, 8]    1
        2013Q2  (0, 5]   [2, 5]    1

score 4 · Accepted Answer

リサンプリングするフィールドのインデックスを設定するだけです。いくつかの例を次に示します

In [36]: df.set_index('recd').resample('1M',how='sum')
Out[36]: 
                 price  qty
recd                       
2012-07-31   64.151194    9
2012-08-31   93.476665    7
2012-09-30   94.193027    7
2012-10-31         NaN  NaN
2012-11-30         NaN  NaN
2012-12-31   12.353405    6
2013-01-31         NaN  NaN
2013-02-28  129.586697    7
2013-03-31         NaN  NaN
2013-04-30         NaN  NaN
2013-05-31  211.979583   13

In [37]: df.set_index('recd').resample('1M',how='count')
Out[37]: 
2012-07-31  price    1
            qty      1
            ship     1
2012-08-31  price    1
            qty      1
            ship     1
2012-09-30  price    2
            qty      2
            ship     2
2012-10-31  price    0
            qty      0
            ship     0
2012-11-30  price    0
            qty      0
            ship     0
2012-12-31  price    1
            qty      1
            ship     1
2013-01-31  price    0
            qty      0
            ship     0
2013-02-28  price    2
            qty      2
            ship     2
2013-03-31  price    0
            qty      0
            ship     0
2013-04-30  price    0
            qty      0
            ship     0
2013-05-31  price    3
            qty      3
            ship     3
dtype: int64

score 1 · Accepted Answer

datetime64[ns] の基本的なストレージ形式に依存するアイデアを思いつきました。このように dcut() を定義すると

def dcut(dts, freq='d', right=True):
    hi = pd.Period(dts.max(), freq=freq) + 1   # get first period past end of data
    periods = pd.PeriodIndex(start=dts.min(), end=hi, freq=freq)
    # get a list of integer bin boundaries representing ns-since-epoch
    # note the extra period gives us the extra right-hand bin boundary we need
    bounds = np.array(periods.to_timestamp(how='start'), dtype='int')
    # bin our time field as integers
    cut = pd.cut(np.array(dts, dtype='int'), bins=bounds, right=right)
    # relabel the bins using the periods, omitting the extra one at the end
    cut.levels = periods[:-1].format()
    return cut

次に、私が望んでいたことを行うことができます。

df.groupby([dcut(df.recd, freq='m', right=False),dcut(df.ship, freq='m', right=False)]).count()

取得するため：

                price qty recd ship
2012-07 2012-10   1    1    1    1
2012-11 2012-12   1    1    1    1
        2013-03   1    1    1    1  
2012-12 2012-09   1    1    1    1
        2013-02   1    1    1    1  
2013-01 2012-08   1    1    1    1
2013-02 2013-02   1    1    1    1
2013-03 2013-03   1    1    1    1
2013-04 2012-07   1    1    1    1
        2013-03   1    1    1    1

同様に dqcut() を定義して、最初に各 datetime 値を (指定した頻度で) 含まれる期間の開始を表す整数に「丸め」、次に qcut() を使用してそれらの境界の中から選択できると思います。または、最初に生の整数値に対して qcut() を実行し、選択した頻度に基づいて結果のビンを丸めますか?

おまけの質問にまだ満足していませんか？:)

score 0 · Accepted Answer

Seriesの興味のある部分を使用してその中に入れてから、シリーズオブジェクトDataFrameを呼び出してみてはどうでしょうか。cut

price_series = pd.Series(df.price.tolist(), index=df.recd)

その後

 pd.qcut(price_series, q=3)

等々。（@ Jeffの答えが最善だと思いますが）

python - パンダの日付フィールドのcut/qcutに相当するものは何ですか?

4 に答える 4

Related

Reference