python - 最初の属性に基づいて内部的に日付によるPythonソート

Question

8 つの属性 (最初の属性に従って並べ替えられている) を持つデータセットがあり、次の形式です (例として、タブで区切られています)。

AX  0123  December 20, 2010  1  2  8.0  hello this
AX  2313  April 19, 2009  2  3  4.0  hi there
AX  4532  December 19, 2010  6  2  8.0  nice tie
AX  1244  January 10, 2011  3  4  8.0  king tale
BX  0214  September 10, 2009  2  3  9.0 this king
BX  0114  February 9, 2003  4  9  4.0  his brought
BX  3214  September 1, 2006  1  3  3.0 is great
MG  980   April 20, 2007  2  4  7.1  not available
MG  246   May 8, 2005  5  1  2.1  make goat

ファイルが最初の属性に従ってソートされたので、最初の属性に基づいて日付に従って内部的にソートする必要があります。出力は次のようになります (データベースを使用したくありません。これは巨大なファイルです (2 GB) ) だから私は特別な python コードが必要かもしれないと思う (単純なコードでこれを行うことができるかどうかわからない)

AX  2313  April 19, 2009  2  3  4.0  hi there
AX  4532  December 19, 2010  6  2  8.0  nice tie
AX  0123  December 20, 2010  1  2  8.0  hello this
AX  1244  January 10, 2011  3  4  8.0  king tale
BX  0114  February 9, 2003  4  9  4.0  his brought
BX  3214  September 1, 2006  1  3  3.0 is great
BX  0214  September 10, 2009  2  3  9.0 this king
MG  246   May 8, 2005  5  1  2.1  make goat
MG  980   April 20, 2007  2  4  7.1  not available

どんな返信でも大歓迎です。他にご不明な点がございましたら、お問い合わせください。

score 1 · Accepted Answer

OK、速くて汚い。あなたはそれを改善することができます：

from datetime import datetime as dt
from collections import defaultdict

dd = defaultdict(list)

with open('test.txt') as f:
    for line in f:
        lines = line.split('\t')
        dd[lines[0]].append(lines)

def mydate(line):
    return dt.strptime(line[2], "%B %d, %Y")

keys = sorted(dd.keys())

my_list = []
for key in keys:
    dd[key].sort(key=mydate)
    my_list.extend(dd[key])

for item in my_list:
    print item

これにより、次が生成されます。

['AX', '2313', 'April 19, 2009', '2', '3', '4.0', 'hi there\n']
['AX', '4532', 'December 19, 2010', '6', '2', '8.0', 'nice tie\n']
['AX', '0123', 'December 20, 2010', '1', '2', '8.0', 'hello this\n']
['AX', '1244', 'January 10, 2011', '3', '4', '8.0', 'king tale\n']
['BX', '0114', 'February 9, 2003', '4', '9', '4.0', 'his brought\n']
['BX', '3214', 'September 1, 2006', '1', '3', '3.0 is great\n']
['BX', '0214', 'September 10, 2009', '2', '3', '9.0 this king\n']
['MG', '246', 'May 8, 2005', '5', '1', '2.1', 'make goat']
['MG', '980', 'April 20, 2007', '2', '4', '7.1', 'not available\n']

次に、すべてのリストを string.join() するだけです

text_lines = []
for item in my_list:
    text_lines.append('\t'.join(item))

full_text = ''.join(text_lines)

score 0 · Accepted Answer

pandasは、さまざまなデータ型のデータセットを分析するために設計された Python ライブラリです。

データがdata.txtにある場合は、それを読み取ってpandas.read_csv()結果を並べ替えることができますDataFrame。

>>> import datetime
>>> import pandas as pd

>>> def date_converter(date_string):
...     return datetime.datetime.strptime(datestring, '%B %d, %Y').date()
>>> df = pd.read_csv('data.txt', sep='\t', header=None,
...                  converters={2:date_converter})
>>> print df
  X.1   X.2         X.3  X.4  X.5  X.6            X.7
0  AX   123  2010-12-20    1    2  8.0     hello this
1  AX  2313  2009-04-19    2    3  4.0       hi there
2  AX  4532  2010-12-19    6    2  8.0       nice tie
3  AX  1244  2011-01-10    3    4  8.0      king tale
4  BX   214  2009-09-10    2    3  9.0      this king
5  BX   114  2003-02-09    4    9  4.0    his brought
6  BX  3214  2006-09-01    1    3  3.0       is great
7  MG   980  2007-04-20    2    4  7.1  not available
8  MG   246  2005-05-08    5    1  2.1      make goat

>>> df = df.set_index(['X.1', 'X.3'])  # using a hierarchical index
>>> df = df.sort_index()
>>> print df
                 X.2  X.4  X.5  X.6            X.7
X.1 X.3                                           
AX  2009-04-19  2313    2    3  4.0       hi there
    2010-12-19  4532    6    2  8.0       nice tie
    2010-12-20   123    1    2  8.0     hello this
    2011-01-10  1244    3    4  8.0      king tale
BX  2003-02-09   114    4    9  4.0    his brought
    2006-09-01  3214    1    3  3.0       is great
    2009-09-10   214    2    3  9.0      this king
MG  2005-05-08   246    5    1  2.1      make goat
    2007-04-20   980    2    4  7.1  not available

ベースであるnumpyため、大規模なデータセットには適切な選択です。

python - 最初の属性に基づいて内部的に日付によるPythonソート

2 に答える 2

Related

Reference