python - パンダシリーズで要素のインデックスを見つける

Question

これが非常に基本的な質問であることはわかっていますが、何らかの理由で答えが見つかりません。python pandasでシリーズの特定の要素のインデックスを取得するにはどうすればよいですか? (最初の出現で十分です)

つまり、次のようなものが欲しいです：

import pandas as pd
myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
print myseries.find(7) # should output 3

確かに、そのようなメソッドをループで定義することは可能です:

def find(s, el):
    for i in s.index:
        if s[i] == el: 
            return i
    return None

print find(myseries, 7)

しかし、もっと良い方法があるはずだと思います。ある？

score 53 · Accepted Answer

インデックスに変換すると、使用できますget_loc

In [1]: myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])

In [3]: Index(myseries).get_loc(7)
Out[3]: 3

In [4]: Index(myseries).get_loc(10)
KeyError: 10

重複処理

In [5]: Index([1,1,2,2,3,4]).get_loc(2)
Out[5]: slice(2, 4, None)

非連続が返される場合、ブール配列を返します

In [6]: Index([1,1,2,1,3,2,4]).get_loc(2)
Out[6]: array([False, False,  True, False, False,  True, False], dtype=bool)

内部でハッシュテーブルを使用するため、非常に高速です

In [7]: s = Series(randint(0,10,10000))

In [9]: %timeit s[s == 5]
1000 loops, best of 3: 203 µs per loop

In [12]: i = Index(s)

In [13]: %timeit i.get_loc(5)
1000 loops, best of 3: 226 µs per loop

Viktor が指摘しているように、インデックスの作成には 1 回限りの作成オーバーヘッドがあります (インデックスを使用して実際に何かを行うときに発生しますis_unique) 。

In [2]: s = Series(randint(0,10,10000))

In [3]: %timeit Index(s)
100000 loops, best of 3: 9.6 µs per loop

In [4]: %timeit Index(s).is_unique
10000 loops, best of 3: 140 µs per loop

score 15 · Accepted Answer

In [92]: (myseries==7).argmax()
Out[92]: 3

これは、事前に 7 があることがわかっている場合に機能します。これは (myseries==7).any() で確認できます

複数の7（またはなし）も説明する別のアプローチ（最初の回答と非常に似ています）は次のとおりです。

In [122]: myseries = pd.Series([1,7,0,7,5], index=['a','b','c','d','e'])
In [123]: list(myseries[myseries==7].index)
Out[123]: ['b', 'd']

score 7 · Accepted Answer

これを行う別の方法は、同様に満足のいくものではありませんが、次のとおりです。

s = pd.Series([1,3,0,7,5],index=[0,1,2,3,4])

list(s).index(7)

リターン: 3

私が取り組んでいる現在のデータセットを使用したオンタイムテスト（ランダムと見なしてください）：

[64]:    %timeit pd.Index(article_reference_df.asset_id).get_loc('100000003003614')
10000 loops, best of 3: 60.1 µs per loop

In [66]: %timeit article_reference_df.asset_id[article_reference_df.asset_id == '100000003003614'].index[0]
1000 loops, best of 3: 255 µs per loop


In [65]: %timeit list(article_reference_df.asset_id).index('100000003003614')
100000 loops, best of 3: 14.5 µs per loop

score 6 · Accepted Answer

numpy を使用すると、値が見つかったインデックスの配列を取得できます。

import numpy as np
import pandas as pd
myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
np.where(myseries == 7)

これは、インデックスの配列を含む 1 要素のタプルを返します。ここで、7 は myseries の値です。

(array([3], dtype=int64),)

score 2 · Accepted Answer

まだ言及されていない別の方法は、tolist メソッドです。

myseries.tolist().index(7)

値がシリーズに存在すると仮定すると、正しいインデックスを返す必要があります。

score 1 · Accepted Answer

Pandas にはIndex、という関数を持つ組み込みクラスがありますget_loc。この関数は次のいずれかを返します

index (要素のインデックス)
slice (指定された番号が連続している場合)
array (番号が複数のインデックスにある場合は bool 配列)

例：

import pandas as pd

>>> mySer = pd.Series([1, 3, 8, 10, 13])
>>> pd.Index(mySer).get_loc(10)  # Returns index
3  # Index of 10 in series

>>> mySer = pd.Series([1, 3, 8, 10, 10, 10, 13])
>>> pd.Index(mySer).get_loc(10)  # Returns slice
slice(3, 6, None)  # 10 occurs at index 3 (included) to 6 (not included)


# If the data is not in sequence then it would return an array of bool's.
>>> mySer = pd.Series([1, 10, 3, 8, 10, 10, 10, 13, 10])
>>> pd.Index(mySer).get_loc(10)
array([False, True, False, False, True, True, False, True])

他にも多くのオプションがありますが、私にとっては非常にシンプルであることがわかりました。

python - パンダシリーズで要素のインデックスを見つける

11 に答える 11

Related

Reference