python - Pandas クロス集計: 書式設定された日付 (mmm yy) として名前が付けられた列の順序の変更

Question

パンダのクロス集計の列を無駄に注文する方法を探していました。特に、日付の値に基づいてフォーマットされた日付 (mmm yy) であり、3 文字の月名 (mmm) でアルファベット順に並べ替えられていない列を並べ替える必要があります。

私のコードの詳細は次のとおりです。

パイソン3.3

パンダ 0.12.0

f_dtfltパンダのデータフレームです。

f_dtflt.COLLECTION_DATEdtype datetime64[ns] です

私のクロス集計ステートメントは次のとおりです。

pd.crosstab(f_dtflt.EW_REGIONCOLLSITE, f_dtflt.COLLECTION_DATE.apply(lambda x: x.strftime("%b %y")), margins=True)

出力は次のとおりです。

COLLECTION_DATE    Apr 13  Aug 13  Dec 12  Feb 13  Jan 13  Jul 13  Jun 13 
EW_REGIONCOLLSITE                                                           
EAST                 1964    2092    2280    2272    2757    2113    1902   
WEST                 2579    2011    1003    2351    2216    1506    1823   
All                  4543    4103    3283    4623    4973    3619    3725   

COLLECTION_DATE    Mar 13  May 13  Nov 12  Oct 12  Sep 13    All  
EW_REGIONCOLLSITE                                                 
EAST                 1682    1981    2108     825     975  22951  
WEST                 2770    3014     407      42     888  20610  
All                  4452    4995    2515     867    1863  43561

列を日付の昇順で並べ替えたい... 10 月 12 日、11 月 12 日、... 1 月 13 日、... 9 月 13 日。日付を yy-mm (例: 13- 01) しかし、これらのラベルはレポートで使用されるので、妥協したくありません。

私は python と pandas を初めて使用するので、応答のドットを接続して初心者を助けてください! 本当にありがとう。

方法 1

@Andyの回答の最初の部分に応じて編集してください。ステップ 3 に問題があります。

私は Andy の提案を実装しようとしましたが、ここにこの取り組みに関する詳細情報があります。

1）次の行を実行して、日付がどのように見えるかを確認しました。次の行は、収集日として「2012-10」などの値を作成します。（プリントで「美化」？）

print(pd.DatetimeIndex(f_dtflt['COLLECTION_DATE']).to_period('M'))

2) 上記のステートメントがクロスタブに入力されると、月の値が 513、514 などの数字に変更されます (フィールドの実際の値?)

table1=pd.crosstab(f_dtflt.EW_REGIONCOLLSITE, pd.DatetimeIndex(f_dtflt['COLLECTION_DATE']).to_period('M'), margins=True)

出力は次のとおりです。

col_0              513   514   515   516   517   518   519   520   521   522
EW_REGIONCOLLSITE                                                              
EAST               825  2108  2280  2757  2272  1682  1964  1981  1902  2113   
WEST                42   407  1003  2216  2351  2770  2579  3014  1823  1506   
All                867  2515  3283  4973  4623  4452  4543  4995  3725  3619   

col_0               523   524    All  
EW_REGIONCOLLSITE                     
EAST               2092   975  22951  
WEST               2011   888  20610  
All                4103  1863  43561

3) 次のコードを実行すると、「int」オブジェクトに属性「strftime」がないというエラーがスローされます。

table1.columns = table1.columns.map(lambda x: x.strftime("%b %y"))

私はこれをかなりいじりました。ここに私のメモのいくつかがあります：

# This runs and creates an array of strings: '513' etc.
pd.to_datetime(table1.columns.map(str), unit='M')

# The last entry in table1.columns is "All" and needs to be removed.  Hence [:-1] slice.
# This also runs but seems to give years in 1630's.
pd.DatetimeIndex(table1.columns[:-1].map(str)).to_datetime('M')

# This does not run because it says object is immutable
table1.columns[:-1]=pd.DatetimeIndex(table1.columns[:-1].map(str)).to_datetime('M')

# This also runs but the output is weird.  It seems to give an array of both dates and -1
table1.columns.reindex(pd.DatetimeIndex(table1.columns[:-1].map(str)).to_datetime('M'))

# Does not run:  DatetimeIndex() must be called with a collection of some kind, '513' was passed
table1.columns = table1.columns.map(lambda x: pd.DatetimeIndex(str(x)).strftime("%b %y"))

# Does not run:  DatetimeIndex object is not callable
table1.rename(columns=pd.DatetimeIndex(table1.columns[:-1].map(str)).to_datetime('M'))

4) これは、クロス集計の列にラベルを付けるために機能します。

table1.columns.name = 'COLLECTION_DATE'

方法 2

@Andyは2番目の提案をしましたが、私はそれをいじってみましたが、うまくいきませんでした。問題の大部分は、python、pandas、および numpy に精通していないことです。自分なりに整理しながらメモをとってみました。ここに私のメモがあります：

# Working with a new concept
# This creates row titles of 12 10, 12 11, etc.
table1=pd.crosstab(f_dtflt.EW_REGIONCOLLSITE, f_dtflt.COLLECTION_DATE.apply(lambda x: x.strftime("%y %m")), margins=True)

# This throws an error that yb is not defined
table1.columns.map(lambda yb: "%s %s" % (y, b) for y, b in yb.split())

# Tried to simplify and see what happens.  Runs and creates an array of lists such as [['12, '10'], ['12', '11']...]
table1.columns.map(lambda x: x.split())

# Trying a different approach.  This creates a numpy array of datetimes.
tempholder=table1.columns[:-1].map(lambda x: datetime.datetime(year=int(x[0:2]), month=int(x[3:]), day=1))

# Noted that f_dtflt['COLLECTION_DATE'] was a dtype of datetime64[ns] but tempholder was dtype object. So had issue.
# Convert to datetime64
# Get error:  Out of bounds nanosecond timestamp: 12-10-01 00:00:00
tempholder=pd.to_datetime(tempholder)

# Tempholder is an array of datetimes from the datetime module.  I used the pandas date function above.  
# Need to change that and use python datetime module function.
# Does not work: 'numpy.ndarray' object has no attribute 'apply'...
# this is a pandas function which does not work on a numpy array.
tempholder.apply(lambda x: x.strftime('%b %y'))

# This works for numpy array but I can't tell what it contains.  
# print(tempholder) gives <map object at 0x0000000026C04F28>
# tempholder gives Out[169]: <builtins.map at 0x26c04f28>
tempholder=map(lambda x: x.strftime('%b %y'), tempholder)

score 1 · Accepted Answer

私は少し異なる角度からこの問題に取り組み、pandas のクロス集計で列を順序付けする一般的な方法として使用できる関数を作成しました。ピボットテーブルでも機能する可能性がありますが、テストも詳細も見ていません。行ラベルの順序付けにも使用できると思いますが、試していません。

これにより、"12 10_Oct 12" や 12 11_Nov 12" などの列ラベルを持つクロス集計が作成されます。このラベルにより、クロス集計のアルファベット順が効果的に強制されます。ラベルのアルファベット順のセクションは、"_" とラベルが連結されます。使いたい。

table_1=pd.crosstab(f_dtflt.EW_REGIONCOLLSITE, f_dtflt.COLLECTION_DATE.apply(lambda x: x.strftime("%y %m_%b %y")), margins=True)

出力：

"COLLECTION_DATE    12 10_Oct 12  12 11_Nov 12  12 12_Dec 12  13 01_Jan 13  
EW_REGIONCOLLSITE                                                           
EAST                        825          2108          2280          2757   
WEST                         42           407          1003          2216   
All                         867          2515          3283          4973   

COLLECTION_DATE    13 02_Feb 13  13 03_Mar 13  13 04_Apr 13  13 05_May 13  
EW_REGIONCOLLSITE                                                           
EAST                       2272          1682          1964          1981   
WEST                       2351          2770          2579          3014   
All                        4623          4452          4543          4995   

COLLECTION_DATE    13 06_Jun 13  13 07_Jul 13  13 08_Aug 13  13 09_Sep 13  
EW_REGIONCOLLSITE                                                           
EAST                       1902          2113          2092           975   
WEST                       1823          1506          2011           888   
All                        3725          3619          4103          1863   

COLLECTION_DATE      All  
EW_REGIONCOLLSITE         
EAST               22951  
WEST               20610  
All                43561 "

関数と呼び出し:

def clean_label(label_list, margins='False'):
    ''' This function takes the column index list from a crosstab (or pivot table?) in pandas and removes the 
    part of the label before and including the "_".  This allows the user to order the columns manually by creating
    an alphabetical index followed by "_" and then the label that they would like to use.  For example, a label such as
    ['a_Positive', 'b_Negative'] will be converted to ['Positive', 'Negative'].  Another example would be to order dates
    in a table from ['12 10_Oct 12', '12 11_Nov 12'] to ['Oct 12', 'Nov 12']

    margins = False if the crosstab was created without margins and therefore does not have an "All" at the end of the list
    margins = True if the crosstab was created with margins and therefore has an "All" at the end of the list
    '''
    corrected_list=list()

    # If one creates margins in pivot/crosstab, will get the last column of "All"
    # This has to be removed from the following code or it will throw an error.
    if margins:
        convert_list = label_list[:-1]
    else:
        convert_list = label_list

    for l in convert_list:
        x,y=l.split('_')
        corrected_list.append(y)

    if margins:
        corrected_list.append('Total')  # Renames "All" to "Total"

    return corrected_list  

# Change the labels on the crosstab table
table_1.columns=clean_label(table_1.columns, margins=True)

# Change name of columns
table_1.columns.name = 'Month of Collection'

# Change name of rows
table_1.index.name = 'Region'

出力 (最終表):

"Month of Collection  Oct 12  Nov 12  Dec 12  Jan 13  Feb 13  Mar 13  Apr 13  
Region                                                                        
EAST                    825    2108    2280    2757    2272    1682    1964   
WEST                     42     407    1003    2216    2351    2770    2579   
All                     867    2515    3283    4973    4623    4452    4543   

Month of Collection  May 13  Jun 13  Jul 13  Aug 13  Sep 13  Total  
Region                                                              
EAST                   1981    1902    2113    2092     975  22951  
WEST                   3014    1823    1506    2011     888  20610  
All                    4995    3725    3619    4103    1863  43561  "

score 0 · Accepted Answer

文字列として年月を実行した場合 (そして正しい順序である場合)、逆にすることができます。

In [1]: df = pd.DataFrame([['a', 'b']], columns=['12 Mar', '12 Jun'])

In [2]: df.columns.map(lambda yb: ' '.join(reversed(yb.split())))
Out[2]: array(['Mar 12', 'Jun 12'], dtype=object)

In [3]: df.columns = df.columns.map(lambda yb: ' '.join(reversed(yb.split())))

ピリオドでこれを行うことができると提案しました：

pd.DatetimeIndex(f_dtflt['COLLECTION_DATE']).to_period('M')

次に、列を必要な形式にクリーンアップできたら、次のようにします。

df.columns = df.columns.map(lambda x: x.strftime("%b %y"))
df.columns.name = 'COLLECTION_DATE'

しかし、これは期間インデックスを int に変更するようです (おそらくバグ?)。

python - Pandas クロス集計: 書式設定された日付 (mmm yy) として名前が付けられた列の順序の変更

2 に答える 2

Related

Reference