python - 列の値に従ってnumpyndarray（行列）をフィルタリングします

Question

この質問は、NumPy ndarrayいくつかの列値に従ってaをフィルタリングすることに関するものです。

私はかなり大きいNumPy ndarray（300000、50）を持っており、いくつかの特定の列の値に従ってフィルタリングしています。ndtypes名前で各列にアクセスできるようにしています。

最初の列には名前が付けられcategory_codeており、行列をフィルタリングして、category_codeがにある行のみを返す必要があり("A", "B", "C")ます。

NumPy ndarray結果は、名前で列にアクセスできる別の結果である必要がありdtypeます。

これが私が今していることです：

index = numpy.asarray([row['category_code'] in ('A', 'B', 'C') for row in data])
filtered_data = data[index]

次のようなリスト内包表記：

list = [row for row in data if row['category_code'] in ('A', 'B', 'C')]
filtered_data = numpy.asarray(list)

dtypes私が最初に持っていたものにアクセスできなくなったため、機能しませんでした。

同じ結果を達成するためのより良い/よりPythonicな方法はありますか？

次のようになります。

filtered_data = data.where({'category_code': ('A', 'B','C'})

ありがとう！

score 10 · Accepted Answer

You can use the NumPy-based library, Pandas, which has a more generally useful implementation of ndarrays:

>>> # import the library
>>> import pandas as PD

Create some sample data as python dictionary, whose keys are the column names and whose values are the column values as a python list; one key/value pair per column

>>> data = {'category_code': ['D', 'A', 'B', 'C', 'D', 'A', 'C', 'A'], 
            'value':[4, 2, 6, 3, 8, 4, 3, 9]}

>>> # convert to a Pandas 'DataFrame'
>>> D = PD.DataFrame(data)

To return just the rows in which the category_code is either B or C, two steps conceptually, but can easily be done in a single line:

>>> # step 1: create the index 
>>> idx = (D.category_code== 'B') | (D.category_code == 'C')

>>> # then filter the data against that index:
>>> D.ix[idx]

        category_code  value
   2             B      6
   3             C      3
   6             C      3

Note the difference between indexing in Pandas versus NumPy, the library upon which Pandas is built. In NumPy, you would just place the index inside the brackets, indicating which dimension you are indexing with a ",", and using ":" to indicate that you want all of the values (columns) in the other dimension:

>>>  D[idx,:]

In Pandas, you call the the data frame's ix method, and place only the index inside the brackets:

>>> D.loc[idx]

score 2 · Accepted Answer

選択できる場合は、パンダを強くお勧めします。パンダには「列インデックス」が組み込まれており、他にも多くの機能があります。それはnumpyに基づいています。

python - 列の値に従ってnumpyndarray（行列）をフィルタリングします

2 に答える 2

Related

Reference