python - Pandas read_csv と夏時間の削除

Question

2003 年 7 月 27 日から現在までの EURUSD 1min OHLC データを含む 312.5MB の csv ファイルがありますが、日付はすべて夏時間に合わせて調整されているため、重複やギャップが発生します。

ファイルが非常に大きいため、デフォルトの日付パーサーは遅すぎるため、次のようにしました。

tizo = dateutil.tz.tzfile('/usr/share/zoneinfo/GB')
def date_parse_1min(s):
    return datetime(int(s[6:10]), 
                    int(s[3:5]), 
                    int(s[0:2]), 
                    int(s[11:13]),
                    int(s[14:16]),tzinfo=tizo)

df = read_csv("EURUSD_1m_clean_w_header.csv",index_col=0,parse_dates=True, date_parser=date_parse_1min)

#verify that it's got the tz right:
df.index
Exception AttributeError: "'NoneType' object has no attribute 'toordinal'" in 'pandas.tslib._localize_tso' ignored
Exception AttributeError: "'NoneType' object has no attribute 'toordinal'" in 'pandas.tslib._localize_tso' ignored
<class 'pandas.tseries.index.DatetimeIndex'>
[2003-07-26 23:00:00, ..., 2012-12-15 23:59:00]
Length: 4938660, Freq: None, Timezone: tzfile('/usr/share/zoneinfo/GB')

そこに属性エラーがある理由がわかりません。

df.index.get_duplicates()
<class 'pandas.tseries.index.DatetimeIndex'>
[2003-10-26 01:00:00, ..., 2012-10-28 01:59:00]
Length: 600, Freq: None, Timezone: None
df1 = df.tz_convert('GMT')
df1.index.get_duplicates()
<class 'pandas.tseries.index.DatetimeIndex'>
[2003-10-26 01:00:00, ..., 2012-10-28 01:59:00]
Length: 600, Freq: None, Timezone: None

パンダに夏時間オフセットを削除させるにはどうすればよいですか? 明らかに、変更が必要な正しい整数インデックスを作成してそのようにすることはできますが、もっと良い方法があるはずです。

score 0 · Accepted Answer

各年の最初と最後の重複値を取得し、その間のデータを 1 時間ずらすと、問題を修正する最も簡単な方法になります。最初のデータポイントが夏時間で始まることを考慮する必要があります。

python - Pandas read_csv と夏時間の削除

1 に答える 1

Related

Reference