python - Pandas で時系列の 1 つ以上のディメンションでグループ化するにはどうすればよいですか?

Question

次のようなデータがあります。

timestamp, country_code,  request_type,   latency
2013-10-10-13:40:01,  1,    get_account,    134
2013-10-10-13:40:63,  34,   get_account,    256
2013-10-10-13:41:09,  230,  modify_account, 589
2013-10-10-13:41:12,  230,  get_account,    43
2013-10-10-13:53:12,  1,    modify_account, 1003

タイムスタンプは秒単位であり、規則的ではありません。

次のようなパンダクエリで表現する方法：

10 分の解像度で country_code ごとのリクエスト数 ?
1 分の解像度で request_type ごとに 99% のパーセンタイルレイテンシ?
10 分単位での国コードとリクエストタイプごとのリクエスト数

次に、すべてのグループを同じグラフにグラフ化し、それぞれを時間の経過とともに独自の線としてグラフ化します。

アップデート：

1の提案に基づいています。私は持っています：

bycc = df.groupby('country_code').reason.resample('10T', how='count')
bycc.plot() # BAD: uses (country_code, timestamp) on the x axis
bycc[1].plot() # properly graphs the time-series for country_code=1

しかし、各 country_code を、x 軸に適切なタイムスタンプを、y 軸に値を示す個別の行としてグラフ化する簡単な方法を見つけることができないようです。2つの問題があると思います（1）タイムスタンプはcountry_codeごとに同じではありません。同じ開始/終了に合わせる必要があり、（2）複数インデックスのTimeSeriesオブジェクトから移動するための正しいAPI /メソッドを見つける必要がありますマルチインデックスの最初の値ごとに 1 行の単一プロットに変換します。私のやり方で働いています...

更新 2

以下はそれを行うようです：

i = 0
max = 3
pylab.rcParams['figure.figsize'] = (20.0, 10.0) # get bigger graph
for cc in bycc.index.levels[0]:
    i = i + 1
    if (i <= max):
        cclabel = "cc=%d" % (cc)
        bycc[cc].plot(legend=True, label=cclabel)

ノイジーになるので、最大値までしか出力しません。次に、多くの時系列を含むプロットをより適切に表示する方法を考えてみましょう。

score 6 · Accepted Answer

注: pandas は、分の 4 秒が余分にあるため、datetime 文字列 "2013-10-10-13:40:63" を解析できません (これdateutilは解析できません。pandas は dateutil を使用して日付を解析します)。説明を簡単にするために、「2013-10-10-13:40:59」に変換しました。

`country_code`1. 10 分解像度でのリクエスト数:

In [83]: df
Out[83]:
                     country_code    request_type  latency
timestamp
2013-10-10 13:40:01             1     get_account      134
2013-10-10 13:40:59            34     get_account      256
2013-10-10 13:41:09           230  modify_account      589
2013-10-10 13:41:12           230     get_account       43
2013-10-10 13:53:12             1  modify_account     1003

In [100]: df.groupby('country_code').request_type.resample('10T', how='count')
Out[100]:
country_code  timestamp
1             2013-10-10 13:40:00    1
              2013-10-10 13:50:00    1
34            2013-10-10 13:40:00    1
230           2013-10-10 13:40:00    2
dtype: int64

2. 1 分の解像度での`latency`99パーセンタイル`request_type`

ここでも非常によく似たアプローチを取ることができます。

In [107]: df.groupby('request_type').latency.resample('T', how=lambda x: x.quantile(0.99))
Out[107]:
request_type    timestamp
get_account     2013-10-10 13:40:00     254.78
                2013-10-10 13:41:00      43.00
modify_account  2013-10-10 13:41:00     589.00
                2013-10-10 13:42:00        NaN
                2013-10-10 13:43:00        NaN
                2013-10-10 13:44:00        NaN
                2013-10-10 13:45:00        NaN
                2013-10-10 13:46:00        NaN
                2013-10-10 13:47:00        NaN
                2013-10-10 13:48:00        NaN
                2013-10-10 13:49:00        NaN
                2013-10-10 13:50:00        NaN
                2013-10-10 13:51:00        NaN
                2013-10-10 13:52:00        NaN
                2013-10-10 13:53:00    1003.00
dtype: float64

3. 10 分単位`country_code`のリクエスト数`request_type`

への呼び出しに追加のグループを追加していることを除いて、これは基本的に # 1 と同じですDataFrame.groupby。

In [108]: df.groupby(['country_code', 'request_type']).request_type.resample('10T', how='count')
Out[108]:
country_code  request_type    timestamp
1             get_account     2013-10-10 13:40:00    1
              modify_account  2013-10-10 13:50:00    1
34            get_account     2013-10-10 13:40:00    1
230           get_account     2013-10-10 13:40:00    1
              modify_account  2013-10-10 13:40:00    1
dtype: int64

プロットに関する限り、何を求めているのか明確ではありません。詳しく説明してください。

python - Pandas で時系列の 1 つ以上のディメンションでグループ化するにはどうすればよいですか?

1 に答える 1

country_code1. 10 分解像度でのリクエスト数:

2. 1 分の解像度でのlatency99パーセンタイルrequest_type

3. 10 分単位country_codeのリクエスト数request_type

Related

Reference

`country_code`1. 10 分解像度でのリクエスト数:

2. 1 分の解像度での`latency`99パーセンタイル`request_type`

3. 10 分単位`country_code`のリクエスト数`request_type`