python - カスタムの列の順序（カテゴリ）をパンダの箱ひげ図に適用するにはどうすればよいですか？

Question

編集：この質問は2013年にパンダ〜0.13で発生し、バージョン0.15〜0.18の間のどこかで箱ひげ図を直接サポートすることで廃止されました（@Cireoの遅い回答によると、パンダはこれが尋ねられたため、カテゴリのサポートを大幅に改善しました）。

boxplotpandasDataFrameの給与列を取得できます...

train.boxplot(column='Salary', by='Category', sym='')

...ただし、「カテゴリ」列で使用されるインデックス順序を定義する方法がわかりません。別の基準に従って、独自のカスタム順序を指定したいと思います。

category_order_by_mean_salary = train.groupby('Category')['Salary'].mean().order().keys()

カスタム列の順序を箱ひげ図の列に適用するにはどうすればよいですか？（順序付けを強制するために接頭辞を付けて列名を醜くまとめる以外）

'Category'は、27個の異なる値をとる文字列です（実際には、categoricalである必要がありますが、これは0.13に戻り、categoricalはサードクラスの市民でした）['Accounting & Finance Jobs','Admin Jobs',...,'Travel Jobs']。したがって、次のように簡単に因数分解できます。pd.Categorical.from_array()

検査では、制限は内部pandas.tools.plotting.py:boxplot()にあり、順序付けを許可せずに列オブジェクトを変換します。

pandas.core.frame.py.boxplot（）はへのパススルーです
インスタンス化するpandas.tools.plotting.py:boxplot（） ..。
インスタンス化するmatplotlib.pyplot.py:boxplot（） ..。
matplotlib.axes.py:boxplot（）

カスタムバージョンのpandasboxplot（）をハックするか、オブジェクトの内部にアクセスできると思います。また、拡張リクエストを提出します。

score 12 · Accepted Answer

実例なしでこれを行う方法を言うのは難しい。私の最初の推測は、必要な順序で整数列を追加することです。

単純なブルートフォース攻撃の方法は、各箱ひげ図を一度に1つずつ追加することです。

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

df = pd.DataFrame(np.random.rand(37,4), columns=list('ABCD'))
columns_my_order = ['C', 'A', 'D', 'B']
fig, ax = plt.subplots()
for position, column in enumerate(columns_my_order):
    ax.boxplot(df[column], positions=[position])

ax.set_xticks(range(position+1))
ax.set_xticklabels(columns_my_order)
ax.set_xlim(xmin=-0.5)
plt.show()

score 6 · Accepted Answer

編集：これは、バージョン0.15〜0.18のどこかに直接サポートが追加された後の正しい答えです

tl; dr：最近のパンダの場合-boxplotpositionsの引数を使用します。

別の回答を追加します。これはおそらく別の質問になる可能性があります。フィードバックをいただければ幸いです。

groupby内にカスタムの列順序を追加したかったので、多くの問題が発生しました。結局、私はオブジェクトから使用しようとするのを避け、代わりに各サブプロットを自分で調べて明示的な位置を提供する必要がありましboxplotた。groupby

import matplotlib.pyplot as plt
import pandas as pd

df = pd.DataFrame()
df['GroupBy'] = ['g1', 'g2', 'g3', 'g4'] * 6
df['PlotBy'] = [chr(ord('A') + i) for i in xrange(24)]
df['SortBy'] = list(reversed(range(24)))
df['Data'] = [i * 10 for i in xrange(24)]

# Note that this has no effect on the boxplot
df = df.sort_values(['GroupBy', 'SortBy'])
for group, info in df.groupby('GroupBy'):
    print 'Group: %r\n%s\n' % (group, info)

# With the below, cannot use
#  - sort data beforehand (not preserved, can't access in groupby)
#  - categorical (not all present in every chart)
#  - positional (different lengths and sort orders per group)
# df.groupby('GroupBy').boxplot(layout=(1, 5), column=['Data'], by=['PlotBy'])

fig, axes = plt.subplots(1, df.GroupBy.nunique(), sharey=True)
for ax, (g, d) in zip(axes, df.groupby('GroupBy')):
    d.boxplot(column=['Data'], by=['PlotBy'], ax=ax, positions=d.index.values)
plt.show()

最終的なコードでは、並べ替えの値ごとに複数のデータポイントがあるため、位置を決定するのがさらに少し複雑になり、最終的に次のことを行う必要がありました。

to_plot = data.sort_values([sort_col]).groupby(group_col)
for ax, (group, group_data) in zip(axes, to_plot):
    # Use existing sorting
    ordering = enumerate(group_data[sort_col].unique())
    positions = [ind for val, ind in sorted((v, i) for (i, v) in ordering)]
    ax = group_data.boxplot(column=[col], by=[plot_by], ax=ax, positions=positions)

score 3 · Accepted Answer

実際、私は同じ質問で立ち往生しました。そして、次のようなコードを使用して、マップを作成し、xticklabelsをリセットすることで解決しました。

df = pd.DataFrame({"A":["d","c","d","c",'d','c','a','c','a','c','a','c']})
df['val']=(np.random.rand(12))
df['B']=df['A'].replace({'d':'0','c':'1','a':'2'})
ax=df.boxplot(column='val',by='B')
ax.set_xticklabels(list('dca'))

score 2 · Accepted Answer

パンダがカテゴリ列を作成できるようになったことに注意してください。グラフにすべての列を表示したり、適切にトリミングしたりしてもかまわない場合は、次のようにすることができます。

http://pandas.pydata.org/pandas-docs/stable/categorical.html

df['Category'] = df['Category'].astype('category', ordered=True)

最近のパンダはpositions、フレームから軸までずっと通過できるようにも見えます。

score 2 · Accepted Answer

Cireoが指摘したように：

新しいpositions=属性を使用します。

df.boxplot(column=['Data'], by=['PlotBy'], positions=df.index.values)

私はこれが以前に正確であることを知っていますが、私のような初心者には十分に明確/要約されていません

score 1 · Accepted Answer

箱ひげ図のデフォルトの列の順序に満足できない場合は、箱ひげ図の列パラメーターを設定することで、特定の順序に変更できます。

以下の2つの例を確認してください。

np.random.seed(0)
df = pd.DataFrame(np.random.rand(37,4), columns=list('ABCD'))

##
plt.figure()
df.boxplot()
plt.title("default column order")

##
plt.figure()
df.boxplot(column=['C','A', 'D', 'B'])
plt.title("Specified column order")

score 0 · Accepted Answer

ちょっとばかげているように聞こえるかもしれませんが、プロットの多くは順序を決定することを可能にします。例えば：

ライブラリとデータセット

import seaborn as sns
df = sns.load_dataset('iris')

特定の順序

p1=sns.boxplot(x='species', y='sepal_length', data=df, order=["virginica", "versicolor", "setosa"])
sns.plt.show()

score 0 · Accepted Answer

これは、カテゴリ順を適用することで解決できます。ランキングは自分で決めることができます。曜日の例を示します。

平日にカテゴリ順を提供する

#List categorical variables in correct order
weekday = ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']
#Assign the above list to category ranking
wDays = pd.api.types.CategoricalDtype(ordered= True, categories=Weekday)
#Apply this to the specific column in DataFrame
df['Weekday'] = df['Weekday'].astype(wDays)
# Then generate your plot
plt.figure(figsize = [15, 10])
sns.boxplot(data = flights_samp, x = 'Weekday', y = 'Y Axis Variable', color = colour)

python - カスタムの列の順序（カテゴリ）をパンダの箱ひげ図に適用するにはどうすればよいですか？

8 に答える 8

Related

Reference