python - 次のオブジェクトがdatetime.datetimeオブジェクトではなくnumpy文字列であるのはなぜですか？

Question

私は次のように配置されたcsvファイルを持っています：

Person,Date1,Date2,Status
Person1,12/10/11,17/10/11,Done
...

さまざまな操作を実行したいので、Pythonにプルして、日付文字列をdatetime.datetimeオブジェクトに変換することから始めます。私は次のコードを持っています：

import re
import numpy as np
from datetime import datetime, timedelta
from dateutil import rrule

def get_data(csv_file = '/home/garry/Desktop/complaints/input.csv'):
    inp = np.genfromtxt(csv_file,
        delimiter=',',
        filling_values = None,
        dtype = None)

    date = re.compile(r'\d+/\d+/\d+')
    count = 0
    item_count = 0

    for line in inp:
        for item in line:
            if re.match(date, item):
                item = datetime.strptime(item, '%d/%m/%y')
                inp[count][item_count] = item
                item_count += 1
            else:
                item_count += 1
        item_count = 0
        count += 1

    return inp

def get_teams(data):
    team_list = []
    for line in data:
        if line[0] not in team_list:
            team_list.append(line[0])
        else:
            pass
    del team_list[0]
    return team_list

def get_months():
    month_list = []
    months = [1,2,3,4,5,6,7,8,9,10,11,12]
    now = datetime.now()
    start_month = now.month - 7
    for count in range(0,7):
        if months[start_month] > now.month:
            year = now.year - 1
        else:
            year = now.year
        month_list.append([months[start_month], year])
        start_month += 1
    return month_list

if __name__ == "__main__":
    inp = get_data()
    for item in inp[2]:
        print type(item)
    team_list = get_teams(inp)
    month_list = get_months()

mainメソッド（デバッグ用に挿入）のprintステートメントは次を返します。

<type 'numpy.string_'>
<type 'numpy.string_'>
<type 'numpy.string_'>
<type 'numpy.string_'>

get_data（）関数のループは、日付文字列をdatetime.datetimeオブジェクトに変更することになっているため、これは明らかに私が望んでいることではありません。テストとして個々の日付文字列でループ内にあるのと同じコードを実行すると、Typeは問題なく変換されます。上記のコードでは、文字列がdatetime.datetime形式に変更されるため、ある意味でも機能しています。これらは正しいタイプではありません。誰かが私がここで間違っていることを見ることができますか？

score 2 · Accepted Answer

問題は、numpy配列のタイプが固定されていることです。Numpyはデータを固定サイズの連続したメモリブロックに格納するため、配列内のインデックスに値を割り当てる場合は、numpy配列numpyに格納する前にデータを変換します。文字列の配列でもこれを行います。例えば：

>>> a = numpy.array(['xxxxxxxxxx'] * 10)
>>> for index, datum in enumerate(a):
...     print datum, a[index], type(a[index])
...     a[index] = 5
...     print datum, a[index], type(a[index])
... 
xxxxxxxxxx xxxxxxxxxx <type 'numpy.string_'>
xxxxxxxxxx 5 <type 'numpy.string_'>
xxxxxxxxxx xxxxxxxxxx <type 'numpy.string_'>
xxxxxxxxxx 5 <type 'numpy.string_'>
xxxxxxxxxx xxxxxxxxxx <type 'numpy.string_'>
xxxxxxxxxx 5 <type 'numpy.string_'>
xxxxxxxxxx xxxxxxxxxx <type 'numpy.string_'>
xxxxxxxxxx 5 <type 'numpy.string_'>
xxxxxxxxxx xxxxxxxxxx <type 'numpy.string_'>
xxxxxxxxxx 5 <type 'numpy.string_'>
xxxxxxxxxx xxxxxxxxxx <type 'numpy.string_'>
xxxxxxxxxx 5 <type 'numpy.string_'>
xxxxxxxxxx xxxxxxxxxx <type 'numpy.string_'>
xxxxxxxxxx 5 <type 'numpy.string_'>
xxxxxxxxxx xxxxxxxxxx <type 'numpy.string_'>
xxxxxxxxxx 5 <type 'numpy.string_'>
xxxxxxxxxx xxxxxxxxxx <type 'numpy.string_'>
xxxxxxxxxx 5 <type 'numpy.string_'>
xxxxxxxxxx xxxxxxxxxx <type 'numpy.string_'>
xxxxxxxxxx 5 <type 'numpy.string_'>

便利な（またはそうでない！）datetime.datetimeオブジェクトは、を使用して変換できるstrため、この行では...

inp[count][item_count] = item

...numpyアイテムを文字列に変換し、それを配列に挿入するだけです。

これで、を使用してこの動作を回避できますdtype=object。しかし、そうすることで、遅いpythonコードの束を呼び出さnumpyなければならないため、の速度の多くが無効になります。numpy

>>> a = numpy.array(['xxxxxxxxxx'] * 10, dtype=object)
>>> for index, datum in enumerate(a):
...     print datum, a[index], type(a[index])
...     a[index] = 5
...     print datum, a[index], type(a[index])
... 
xxxxxxxxxx xxxxxxxxxx <type 'str'>
xxxxxxxxxx 5 <type 'int'>
xxxxxxxxxx xxxxxxxxxx <type 'str'>
xxxxxxxxxx 5 <type 'int'>
xxxxxxxxxx xxxxxxxxxx <type 'str'>
xxxxxxxxxx 5 <type 'int'>
xxxxxxxxxx xxxxxxxxxx <type 'str'>
xxxxxxxxxx 5 <type 'int'>
xxxxxxxxxx xxxxxxxxxx <type 'str'>
xxxxxxxxxx 5 <type 'int'>
xxxxxxxxxx xxxxxxxxxx <type 'str'>
xxxxxxxxxx 5 <type 'int'>
xxxxxxxxxx xxxxxxxxxx <type 'str'>
xxxxxxxxxx 5 <type 'int'>
xxxxxxxxxx xxxxxxxxxx <type 'str'>
xxxxxxxxxx 5 <type 'int'>
xxxxxxxxxx xxxxxxxxxx <type 'str'>
xxxxxxxxxx 5 <type 'int'>
xxxxxxxxxx xxxxxxxxxx <type 'str'>
xxxxxxxxxx 5 <type 'int'>

numpyここで、あなたがその可能性を最大限に活用していないことを付け加えておきます。forNumpyは、明示的なループなしで、ベクトル化された方法で配列を処理するように設計されています。（詳細については、チュートリアルを参照してください。）したがって、forループを使用して作業するときはいつでもnumpy、それを回避する方法を尋ねるのが自然です。コードの問題を指摘するのではなく、実行できる興味深いことを1つ紹介します。

>>> numpy.genfromtxt('input.csv', delimiter=',', dtype=None, names=True)
array([('Person1', '12/10/11', '17/10/11', 'Done'),
       ('Person1', '12/10/11', '17/10/11', 'Done'),
       ('Person1', '12/10/11', '17/10/11', 'Done'),
       ('Person1', '12/10/11', '17/10/11', 'Done'),
       ('Person1', '12/10/11', '17/10/11', 'Done'),
       ('Person1', '12/10/11', '17/10/11', 'Done')], 
      dtype=[('Person', '|S7'), ('Date1', '|S8'), 
             ('Date2', '|S8'), ('Status', '|S4')])
>>> a = numpy.genfromtxt('input.csv', delimiter=',', dtype=None, names=True)
>>> a['Status']
array(['Done', 'Done', 'Done', 'Done', 'Done', 'Done'], 
      dtype='|S4')
>>> a['Date1']
array(['12/10/11', '12/10/11', '12/10/11', '12/10/11', '12/10/11',
       '12/10/11'], 
      dtype='|S8')

これで、正規表現を使用してテーブルをループする代わりに、日付に直接アクセスできます。

score 1 · Accepted Answer

問題は、でinp定義する配列がからdtypeをget_data取得することです。その要素の1つを別のオブジェクトに置き換えようとすると、そのオブジェクトは文字列に変換されます。"|S8np.genfromtxt

最初のアイデアは、を使用inpしてリストに変換することinp.tolist()です。そうすれば、必要に応じて個々のフィールドのタイプを変更できます。しかし、もっと良いものがあります（私は思います）：

あなたの例によると、2列目と3列目は常に日付ですよね？datetime次に、文字列をオブジェクトにすぐに変換できますnp.genfromtxt

np.genfromtxt(csv_file,
              delimiter=",",
              dtype=None,
              names=True,
              converters={1:lambda d:datetime.strptime(d,"%d/%m/%y"),
                          2:lambda d:datetime.strptime(d,"%d/%m/%y")})

これは、コメントされていない最初の行（ここではyour）からフィールドが取得されたnames=True、構造化された出力を取得することを意味します。ご想像のとおり、キーワードは2列目と3列目の文字列をオブジェクトに変換します。ndarrayPerson,Date1,Date2,Statusconvertersdatetime

最初と最後の列が文字列であることがすでにわかっている場合はdtype、None以外のものを使用することをお勧めしnp.genfromtxtます。各列のタイプを推測する必要がない場合は、より高速に動作します。

さて、別のコメントのために：

forループ内にカウンターを保持する代わりに、for (i, item) in enumerate(whatever)のようなものを使用すると、より簡単になります。

python - 次のオブジェクトがdatetime.datetimeオブジェクトではなくnumpy文字列であるのはなぜですか？

2 に答える 2

Related

Reference