python - 行数が一致しない Python Pandas と NumPy.where の動作

Question

以下のすべての例で Pandas 0.8.1 を使用していますが、Pandas 0.11 を使用すると、同じ例が同じように機能することを確認できます。

Pandas のバージョンを新しいバージョンに変更することに依存するソリューションは、現在の問題には適用できません (ただし、これが新しい Pandas のバージョンで修正されるかどうかについて、コメント(回答ではなく) を自由に追加してください)。

サンプルの Pandas DataFrame オブジェクトがあります

In [20]: dfrm
Out[20]:
          A         B         C     D
0  1.202034 -0.285256  0.392160     0
1  1.799628 -0.169389 -0.305984     3
2  1.262144 -1.165034 -1.780316     6
3 -0.355975  1.610605  1.298506  None
4 -0.139220  0.024292  0.132928    12
5  0.921821 -0.109189 -0.539100    15
6  0.987901 -1.253987 -1.139684    18
7  2.170929  0.520814 -0.139740   NaN
8 -2.329704 -0.475419  1.473144    24
9  1.161275  0.918900 -1.077892    27

まず、私が見ている型エラーについて少し混乱しています。numpy.where特定の列の異なるサブセットの文字列ラベルを作成するために使用しようとすると、ラベルの文字列の性質によりエラーが発生するようです。

In [21]: np.where(dfrm['D'] > 12, 'L', 'S')
Out[21]: ---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-21-a40c5cd8713c> in <module>()
----> 1 np.where(dfrm['D'] > 12, 'L', 'S')

/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/IPython/core/displayhook.pyc in __call__(self, result)
    236             self.start_displayhook()
    237             self.write_output_prompt()
--> 238             format_dict = self.compute_format_data(result)
    239             self.write_format_data(format_dict)
    240             self.update_user_ns(result)

/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/IPython/core/displayhook.pyc in compute_format_data(self, result)
    148             MIME type representation of the object.
    149         """
--> 150         return self.shell.display_formatter.format(result)
    151
    152     def write_format_data(self, format_dict):

/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/IPython/core/formatters.pyc in format(self, obj, include, exclude)
    124                     continue
    125             try:
--> 126                 data = formatter(obj)
    127             except:
    128                 # FIXME: log the exception

/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/IPython/core/formatters.pyc in __call__(self, obj)
    445                 type_pprinters=self.type_printers,
    446                 deferred_pprinters=self.deferred_printers)
--> 447             printer.pretty(obj)
    448             printer.flush()
    449             return stream.getvalue()

/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/IPython/lib/pretty.pyc in pretty(self, obj)
    358                             if callable(meth):
    359                                 return meth(obj, self, cycle)
--> 360             return _default_pprint(obj, self, cycle)
    361         finally:
    362             self.end_group()

/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/IPython/lib/pretty.pyc in _default_pprint(obj, p, cycle)
    478     if getattr(klass, '__repr__', None) not in _baseclass_reprs:
    479         # A user-provided repr.
--> 480         p.text(repr(obj))
    481         return
    482     p.begin_group(1, '<')

/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/series.pyc in __repr__(self)
    772             result = self._get_repr(print_header=True,
    773                                     length=len(self) > 50,
--> 774                                     name=True)
    775         else:
    776             result = '%s' % ndarray.__repr__(self)

/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/series.pyc in _get_repr(self, name, print_header, length, na_rep, float_format)
    833                                         length=length, na_rep=na_rep,
    834                                         float_format=float_format)
--> 835         return formatter.to_string()
    836
    837     def __str__(self):

/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/format.pyc in to_string(self)
    109
    110         fmt_index, have_header = self._get_formatted_index()
--> 111         fmt_values = self._get_formatted_values()
    112
    113         maxlen = max(len(x) for x in fmt_index)

/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/format.pyc in _get_formatted_values(self)
    100         return format_array(self.series.values, None,
    101                             float_format=self.float_format,
--> 102                             na_rep=self.na_rep)
    103
    104     def to_string(self):

/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/format.pyc in format_array(values, formatter, float_format, na_rep, digits, space, justify)
    460                         justify=justify)
    461
--> 462     return fmt_obj.get_result()
    463
    464

/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/format.pyc in get_result(self)
    479             fmt_values = self._format_strings(use_unicode=True)
    480         else:
--> 481             fmt_values = self._format_strings(use_unicode=False)
    482
    483         return _make_fixed_width(fmt_values, self.justify)

/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/format.pyc in _format_strings(self, use_unicode)
    512         vals = self.values
    513
--> 514         is_float = lib.map_infer(vals, com.is_float) & notnull(vals)
    515         leading_space = is_float.any()
    516

/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/common.pyc in notnull(obj)
    100     boolean ndarray or boolean
    101     '''
--> 102     res = isnull(obj)
    103     if np.isscalar(res):
    104         return not res

/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/common.pyc in isnull(obj)
     58     from pandas.core.generic import PandasObject
     59     if isinstance(obj, np.ndarray):
---> 60         return _isnull_ndarraylike(obj)
     61     elif isinstance(obj, PandasObject):
     62         # TODO: optimize for DataFrame, etc.

/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/common.pyc in _isnull_ndarraylike(obj)
     75         shape = values.shape
     76         result = np.empty(shape, dtype=bool)
---> 77         vec = lib.isnullobj(values.ravel())
     78         result[:] = vec.reshape(shape)
     79

/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/lib.so in pandas.lib.isnullobj (pandas/src/tseries.c:5269)()

ValueError: Does not understand character buffer dtype format string ('s')

文字列 'L' と 'S' を -1 や 1 のような整数に置き換えると問題なく動作するので、回避策です。np.whereしかし、見知らぬ問題は、の出力を行数の少ない DataFrame と混合するとどうなるかということです。

In [22]: dfrm1 = dfrm.ix[0:7]

In [23]: dfrm1
Out[23]:
          A         B         C     D
0  1.202034 -0.285256  0.392160     0
1  1.799628 -0.169389 -0.305984     3
2  1.262144 -1.165034 -1.780316     6
3 -0.355975  1.610605  1.298506  None
4 -0.139220  0.024292  0.132928    12
5  0.921821 -0.109189 -0.539100    15
6  0.987901 -1.253987 -1.139684    18
7  2.170929  0.520814 -0.139740   NaN

In [24]: dfrm
Out[24]:
          A         B         C     D
0  1.202034 -0.285256  0.392160     0
1  1.799628 -0.169389 -0.305984     3
2  1.262144 -1.165034 -1.780316     6
3 -0.355975  1.610605  1.298506  None
4 -0.139220  0.024292  0.132928    12
5  0.921821 -0.109189 -0.539100    15
6  0.987901 -1.253987 -1.139684    18
7  2.170929  0.520814 -0.139740   NaN
8 -2.329704 -0.475419  1.473144    24
9  1.161275  0.918900 -1.077892    27

** 次の行がエラーなしで機能するのはなぜですか? **

In [25]: dfrm1['E'] = np.where(dfrm['D'] > 12, -1, 1)

In [26]: dfrm1
Out[26]:
          A         B         C     D  E
0  1.202034 -0.285256  0.392160     0  1
1  1.799628 -0.169389 -0.305984     3  1
2  1.262144 -1.165034 -1.780316     6  1
3 -0.355975  1.610605  1.298506  None  1
4 -0.139220  0.024292  0.132928    12  1
5  0.921821 -0.109189 -0.539100    15 -1
6  0.987901 -1.253987 -1.139684    18 -1
7  2.170929  0.520814 -0.139740   NaN  1

最初に出力を保存してもnp.where(小さい DataFrame には適切な行数がありませんdfrm1)、保存されたオブジェクトの使用も機能します。

In [28]: tmp = np.where(dfrm['D'] > 12, -1, 1)

In [29]: tmp
Out[29]:
0    1
1    1
2    1
3    1
4    1
5   -1
6   -1
7    1
8   -1
9   -1
Name: D

In [30]: dfrm1['F'] = tmp

In [31]: dfrm1
Out[31]:
          A         B         C     D  E  F
0  1.202034 -0.285256  0.392160     0  1  1
1  1.799628 -0.169389 -0.305984     3  1  1
2  1.262144 -1.165034 -1.780316     6  1  1
3 -0.355975  1.610605  1.298506  None  1  1
4 -0.139220  0.024292  0.132928    12  1  1
5  0.921821 -0.109189 -0.539100    15 -1 -1
6  0.987901 -1.253987 -1.139684    18 -1 -1
7  2.170929  0.520814 -0.139740   NaN  1  1

Pandas が何らかの形で Index オブジェクトに関するメタデータを共有し、データが同じインデックスを持つオブジェクトからのものである場合、データを挿入するときに切り捨てが許可されている可能性があると考えました。

In [33]: tmp1 = tmp.reset_index(drop=True)

In [34]: dfrm1['G'] = tmp1

In [35]: dfrm1
Out[35]:
          A         B         C     D  E  F  G
0  1.202034 -0.285256  0.392160     0  1  1  1
1  1.799628 -0.169389 -0.305984     3  1  1  1
2  1.262144 -1.165034 -1.780316     6  1  1  1
3 -0.355975  1.610605  1.298506  None  1  1  1
4 -0.139220  0.024292  0.132928    12  1  1  1
5  0.921821 -0.109189 -0.539100    15 -1 -1 -1
6  0.987901 -1.253987 -1.139684    18 -1 -1 -1
7  2.170929  0.520814 -0.139740   NaN  1  1  1

しかし、Index オブジェクトの特定のオブジェクト ID を調査した後でも、明確なパターンはありません。

In [36]: id(tmp.index)
Out[36]: 96118016

In [37]: id(tmp1.index)
Out[37]: 104735160

In [38]: id(dfrm.index)
Out[38]: 96118016

In [39]: id(dfrm1.index)
Out[39]: 104322304

間違った次元のデータ範囲を割り当てようとすると、失敗することに注意してください。

In [40]: dfrm1['H'] = np.arange(10)
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-40-987f4eb97131> in <module>()
----> 1 dfrm1['H'] = np.arange(10)

/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/frame.pyc in __setitem__(self, key, value)
   1710         else:
   1711             # set column
-> 1712             self._set_item(key, value)
   1713
   1714     def _boolean_set(self, key, value):

/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/frame.pyc in _set_item(self, key, value)
   1749         ensure homogeneity.
   1750         """
-> 1751         value = self._sanitize_column(key, value)
   1752         NDFrame._set_item(self, key, value)
   1753

/opt/epd/7.3-2_pandas0.8.1/lib/python2.7/site-packages/pandas/core/frame.pyc in _sanitize_column(self, key, value)
   1778                     value = value.reindex(self.index).values
   1779             else:
-> 1780                 assert(len(value) == len(self.index))
   1781
   1782                 if not isinstance(value, np.ndarray):

AssertionError:

In [41]: dfrm1['H'] = np.arange(8)

In [42]: dfrm1
Out[42]:
          A         B         C     D  E  F  G  H
0  1.202034 -0.285256  0.392160     0  1  1  1  0
1  1.799628 -0.169389 -0.305984     3  1  1  1  1
2  1.262144 -1.165034 -1.780316     6  1  1  1  2
3 -0.355975  1.610605  1.298506  None  1  1  1  3
4 -0.139220  0.024292  0.132928    12  1  1  1  4
5  0.921821 -0.109189 -0.539100    15 -1 -1 -1  5
6  0.987901 -1.253987 -1.139684    18 -1 -1 -1  6
7  2.170929  0.520814 -0.139740   NaN  1  1  1  7

の出力がnp.where異なる方法で処理されるのはなぜですか?

score 4 · Accepted Answer

これは予期されることです。シリーズを DataFrame 列に割り当てています。短く (または長く) 関係ないので、整列されます。インデックスが照合され、それらの値が取得されます。長さが同じ場合にのみ、まっすぐなnumpy配列が機能する理由は単純です。位置合わせができないので、同じ長さでなければなりません。

インデックスが関係ない場合は ID。等しい場合のみ (i1.equals(i2) など)

数字ではない、またはオフセットされたラベルを使用してこの演習全体を試してください (0 から始まらないと、整列が機能しているかどうかがわかります)

score 0 · Accepted Answer

が文字列のを構築ValueErrorするために発生しますが、の配列を取ります。wherendarrayisnullobjobject dtype
バージョン 0.8.1 では、式 (計算) の右側が、左側の小さい方のインデックスと一致するように再インデックスできるa を返すwhereため、割り当てが機能します。whereSeriesDataFrame

python - 行数が一致しない Python Pandas と NumPy.where の動作

2 に答える 2

Related

Reference