python - csv 国勢調査データからのマルチインデックスの作成

Question

より体系的な方法で値を計算できるように、複数のインデックス付きデータフレームを作成したいと思います。

もっとエレガントなソリューションがあることは知っていますが、それを見つけるのに苦労しています。私が見つけたもののほとんどは、シリーズとタプルに関係しています。私はパンダ（およびプログラミング）にかなり慣れていません。これは、マルチインデックスを使用/作成する最初の試みです。

国勢調査データを csv としてダウンロードし、関連フィールドを含むデータフレームを作成した後、次のようにしました。

county housingunits2010 housingunits2012 occupiedunits2010 occupiedunits2012
8001   120              200              50                100
8002   100              200              75                125

そして、私は次のようになりたい:

id    Year  housingunits occupiedunits
8001  2010  120          50
      2012  200          100
8002  2010  100          75
      2012  200          125

そして、計算された値 (つまり、年の差、%change) から、および他のデータフレームから列を追加し、郡と年ごとにマージして一致させることができます。

私が学んだ基本的な方法 (以下を参照) を使用して回避策を見つけましたが、... 確かにエレガントではありません。任意の提案をいただければ幸いです。

最初に 2 つの差分データフレームを作成する

df3 = df2[["county_id","housingunits2012"]]
df4 = df2[["county_id","housingunits2010"]]

年列の追加

df3['year'] = np.array(['2012'] * 7)
df4['year'] = np.array(['2010'] * 7)
df3.columns = ['county_id','housingunits','year']
df4.columns = ['county_id','housingunits','year']

追加中

df5 = df3.append(df4)

csvへの書き込み

df5.to_csv('/Users/ntapia/df5.csv', index = False)

読み取りと並べ替え

df6 = pd.read_csv('/Users/ntapia/df5.csv', index_col=[0, 2])
df6.sort_index(0)

結果 (実際のデータ):

                      housingunits
county_id year              
8001      2010        163229
          2012        163986
8005      2010        238457
          2012        239685
8013      2010        127115
          2012        128106
8031      2010        285859
          2012        288191
8035      2010        107056
          2012        109115
8059      2010        230006
          2012        230850
8123      2010         96406
          2012         97525

ありがとう！

score 1 · Accepted Answer

import re
df = df.set_index('county')
df = df.rename(columns=lambda x: re.search(r'([a-zA-Z_]+)(\d{4})', x).groups())
df.columns = MultiIndex.from_tuples(df.columns, names=['label', 'year'])
s = df.unstack()
s.name = 'count'
print(s)

与える

label          year  county
housingunits   2010  8001      120
                     8002      100
               2012  8001      200
                     8002      200
occupiedunits  2010  8001       50
                     8002       75
               2012  8001      100
                     8002      125
Name: count, dtype: int64

DataFrame通話でそれが必要な場合reset_index()：

print(s.reset_index())

収量

           label  year  county  numunits
0   housingunits  2010    8001       120
1   housingunits  2010    8002       100
2   housingunits  2012    8001       200
3   housingunits  2012    8002       200
4  occupiedunits  2010    8001        50
5  occupiedunits  2010    8002        75
6  occupiedunits  2012    8001       100
7  occupiedunits  2012    8002       125

python - csv 国勢調査データからのマルチインデックスの作成

1 に答える 1

Related

Reference