python - 文字列内で n 番目に出現する部分文字列を見つける

Question

これはかなり些細なことのように思えますが、私は Python が初めてで、最も Pythonic な方法で実行したいと考えています。

文字列内の部分文字列の n 番目の出現に対応するインデックスを見つけたいです。

私がやりたいことと同等の何かがなければなりません。

mystring.find("substring", 2nd)

Pythonでこれをどのように達成できますか?

score 101 · Accepted Answer

以下は、単純な反復ソリューションのより Pythonic なバージョンです。

def find_nth(haystack, needle, n):
    start = haystack.find(needle)
    while start >= 0 and n > 1:
        start = haystack.find(needle, start+len(needle))
        n -= 1
    return start

例：

>>> find_nth("foofoofoofoo", "foofoo", 2)
6

の n 番目の重複オカレンスを見つけたい場合は、次のようにの代わりにをneedleインクリメントできます。1len(needle)

def find_nth_overlapping(haystack, needle, n):
    start = haystack.find(needle)
    while start >= 0 and n > 1:
        start = haystack.find(needle, start+1)
        n -= 1
    return start

例：

>>> find_nth_overlapping("foofoofoofoo", "foofoo", 2)
3

これはマークのバージョンよりも読みやすく、分割バージョンや正規表現モジュールのインポートの余分なメモリを必要としません。また、さまざまなアプローチとは異なり、Zen of pythonのいくつかのルールにも準拠しています。re

シンプルは複雑よりも優れています。
フラットはネストよりも優れています。
読みやすさが重要です。

score 89 · Accepted Answer

マークの反復アプローチは通常の方法だと思います。

文字列分割を使用した代替方法を次に示します。これは、検索関連のプロセスに役立つことがよくあります。

def findnth(haystack, needle, n):
    parts= haystack.split(needle, n+1)
    if len(parts)<=n+1:
        return -1
    return len(haystack)-len(parts[-1])-len(needle)

そして、ここに簡単な（そして、針に一致しないチャフを選択する必要があるという点で、やや汚い）ワンライナーがあります：

'foo bar bar bar'.replace('bar', 'XXX', 1).find('bar')

score 43 · Accepted Answer

これにより、string 内で 2 番目に出現する部分文字列が検索されます。

def find_2nd(string, substring):
   return string.find(substring, string.find(substring) + 1)

編集: パフォーマンスについてはあまり考えていませんが、n 回目の出現を見つけるには簡単な再帰が役立ちます。

def find_nth(string, substring, n):
   if (n == 1):
       return string.find(substring)
   else:
       return string.find(substring, find_nth(string, substring, n - 1) + 1)

score 34 · Accepted Answer

正規表現が常に最適な解決策ではないことを理解しているので、ここではおそらく 1 つを使用します。

>>> import re
>>> s = "ababdfegtduab"
>>> [m.start() for m in re.finditer(r"ab",s)]
[0, 2, 11]
>>> [m.start() for m in re.finditer(r"ab",s)][2] #index 2 is third occurrence 
11

score 20 · Accepted Answer

これまでに提示された最も顕著なアプローチ、つまり @bobince のfindnth()( に基づくstr.split()) と @tgamblin のまたは @Mark Byers の ' find_nth()( に基づくstr.find()) を比較するベンチマーク結果をいくつか提供しています。また、C の拡張機能 ( ) と比較して、_find_nth.soどれだけ高速に実行できるかを確認します。ここにあるfind_nth.py：

def findnth(haystack, needle, n):
    parts= haystack.split(needle, n+1)
    if len(parts)<=n+1:
        return -1
    return len(haystack)-len(parts[-1])-len(needle)

def find_nth(s, x, n=0, overlap=False):
    l = 1 if overlap else len(x)
    i = -l
    for c in xrange(n + 1):
        i = s.find(x, i + l)
        if i < 0:
            break
    return i

もちろん、文字列が大きい場合にパフォーマンスが最も重要になるため、「bigfile」と呼ばれる 1.3 GB のファイルで 1000001 番目の改行 (「\n」) を見つけたいとします。mmap.mmapメモリを節約するために、ファイルのオブジェクト表現に取り組みたいと思います。

In [1]: import _find_nth, find_nth, mmap

In [2]: f = open('bigfile', 'r')

In [3]: mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)

オブジェクトはをサポートしていないためfindnth()、にはすでに最初の問題があります。したがって、実際にはファイル全体をメモリにコピーする必要があります。mmap.mmapsplit()

In [4]: %time s = mm[:]
CPU times: user 813 ms, sys: 3.25 s, total: 4.06 s
Wall time: 17.7 s

痛い！幸いなことsに、私の Macbook Air の 4 GB のメモリに収まるので、ベンチマークしてみましょうfindnth()。

In [5]: %timeit find_nth.findnth(s, '\n', 1000000)
1 loops, best of 3: 29.9 s per loop

明らかにひどいパフォーマンスです。に基づくアプローチがどのように機能するかを見てみましょうstr.find()。

In [6]: %timeit find_nth.find_nth(s, '\n', 1000000)
1 loops, best of 3: 774 ms per loop

ずっといい！明らかに、findnth()の問題は、の間に文字列を強制的にコピーすることですsplit()。これは、の後に 1.3 GB のデータをコピーした 2 回目s = mm[:]です。の 2 番目の利点は次のとおりです。直接find_nth()使用できるため、ファイルのコピーは必要ありません。mm

In [7]: %timeit find_nth.find_nth(mm, '\n', 1000000)
1 loops, best of 3: 1.21 s per loop

mmvs.ではわずかなパフォーマンスペナルティがあるように見えますが、これはの合計 47 秒と比較して 1.2 秒で答えを得ることができることをs示しています。find_nth()findnth

str.find()ベースドアプローチがベースドアプローチよりも著しく悪いケースは見つからなかったstr.split()ので、この時点で、@bobince の代わりに @tgamblin または @Mark Byers の回答を受け入れる必要があると主張します。

私のテストでは、find_nth()上記のバージョンは、私が思いついた最速の純粋な Python ソリューションでした (@Mark Byers のバージョンと非常によく似ています)。C 拡張モジュールでどれだけ改善できるか見てみましょう。ここにある_find_nthmodule.c：

#include <Python.h>
#include <string.h>

off_t _find_nth(const char *buf, size_t l, char c, int n) {
    off_t i;
    for (i = 0; i < l; ++i) {
        if (buf[i] == c && n-- == 0) {
            return i;
        }
    }
    return -1;
}

off_t _find_nth2(const char *buf, size_t l, char c, int n) {
    const char *b = buf - 1;
    do {
        b = memchr(b + 1, c, l);
        if (!b) return -1;
    } while (n--);
    return b - buf;
}

/* mmap_object is private in mmapmodule.c - replicate beginning here */
typedef struct {
    PyObject_HEAD
    char *data;
    size_t size;
} mmap_object;

typedef struct {
    const char *s;
    size_t l;
    char c;
    int n;
} params;

int parse_args(PyObject *args, params *P) {
    PyObject *obj;
    const char *x;

    if (!PyArg_ParseTuple(args, "Osi", &obj, &x, &P->n)) {
        return 1;
    }
    PyTypeObject *type = Py_TYPE(obj);

    if (type == &PyString_Type) {
        P->s = PyString_AS_STRING(obj);
        P->l = PyString_GET_SIZE(obj);
    } else if (!strcmp(type->tp_name, "mmap.mmap")) {
        mmap_object *m_obj = (mmap_object*) obj;
        P->s = m_obj->data;
        P->l = m_obj->size;
    } else {
        PyErr_SetString(PyExc_TypeError, "Cannot obtain char * from argument 0");
        return 1;
    }
    P->c = x[0];
    return 0;
}

static PyObject* py_find_nth(PyObject *self, PyObject *args) {
    params P;
    if (!parse_args(args, &P)) {
        return Py_BuildValue("i", _find_nth(P.s, P.l, P.c, P.n));
    } else {
        return NULL;    
    }
}

static PyObject* py_find_nth2(PyObject *self, PyObject *args) {
    params P;
    if (!parse_args(args, &P)) {
        return Py_BuildValue("i", _find_nth2(P.s, P.l, P.c, P.n));
    } else {
        return NULL;    
    }
}

static PyMethodDef methods[] = {
    {"find_nth", py_find_nth, METH_VARARGS, ""},
    {"find_nth2", py_find_nth2, METH_VARARGS, ""},
    {0}
};

PyMODINIT_FUNC init_find_nth(void) {
    Py_InitModule("_find_nth", methods);
}

setup.pyファイルは次のとおりです。

from distutils.core import setup, Extension
module = Extension('_find_nth', sources=['_find_nthmodule.c'])
setup(ext_modules=[module])

でいつものようにインストールしpython setup.py installます。C コードは単一の文字の検索に限定されているため、ここでは有利に働きますが、これがどれほど速いか見てみましょう。

In [8]: %timeit _find_nth.find_nth(mm, '\n', 1000000)
1 loops, best of 3: 218 ms per loop

In [9]: %timeit _find_nth.find_nth(s, '\n', 1000000)
1 loops, best of 3: 216 ms per loop

In [10]: %timeit _find_nth.find_nth2(mm, '\n', 1000000)
1 loops, best of 3: 307 ms per loop

In [11]: %timeit _find_nth.find_nth2(s, '\n', 1000000)
1 loops, best of 3: 304 ms per loop

明らかにまだかなり速いです。興味深いことに、C レベルではインメモリと mmapd のケースに違いはありません。のライブラリ関数に_find_nth2()基づくが、の単純な実装に負けていることも興味深いです。の追加の「最適化」は明らかに裏目に出ています...string.hmemchr()_find_nth()memchr()

結論として、findnth()(based on str.split()) での実装は、(a) 必要なコピーのために大きな文字列に対してひどいパフォーマンスを示し、(b)mmap.mmapオブジェクトに対してまったく機能しないため、本当に悪い考えです。find_nth()(に基づく)での実装はstr.find()、すべての状況で優先する必要があります (したがって、この質問に対する回答として受け入れられます)。

C 拡張機能は、純粋な Python コードよりもほぼ 4 倍速く実行されたため、まだ改善の余地がかなりあります。これは、専用の Python ライブラリ関数のケースがある可能性があることを示しています。

score 11 · Accepted Answer

最も簡単な方法は？

text = "This is a test from a test ok" 

firstTest = text.find('test')

print text.find('test', firstTest + 1)

score 8 · Accepted Answer

インデックスパラメーターを取る find 関数を使用して、おそらく次のようなことを行います。

def find_nth(s, x, n):
    i = -1
    for _ in range(n):
        i = s.find(x, i + len(x))
        if i == -1:
            break
    return i

print find_nth('bananabanana', 'an', 3)

特にPythonicではないと思いますが、シンプルです。代わりに再帰を使用してそれを行うことができます：

def find_nth(s, x, n, i = 0):
    i = s.find(x, i)
    if n == 1 or i == -1:
        return i 
    else:
        return find_nth(s, x, n - 1, i + len(x))

print find_nth('bananabanana', 'an', 3)

それを解決するのは機能的な方法ですが、それがより Pythonic になるかどうかはわかりません。

score 2 · Accepted Answer

aまたは aを検索するときに機能する別のre+バージョンを次に示します。これが過剰に設計されている可能性が高いことは率直に認めますが、何らかの理由で私を楽しませてくれました。itertoolsstrRegexpObject

import itertools
import re

def find_nth(haystack, needle, n = 1):
    """
    Find the starting index of the nth occurrence of ``needle`` in \
    ``haystack``.

    If ``needle`` is a ``str``, this will perform an exact substring
    match; if it is a ``RegexpObject``, this will perform a regex
    search.

    If ``needle`` doesn't appear in ``haystack``, return ``-1``. If
    ``needle`` doesn't appear in ``haystack`` ``n`` times,
    return ``-1``.

    Arguments
    ---------
    * ``needle`` the substring (or a ``RegexpObject``) to find
    * ``haystack`` is a ``str``
    * an ``int`` indicating which occurrence to find; defaults to ``1``

    >>> find_nth("foo", "o", 1)
    1
    >>> find_nth("foo", "o", 2)
    2
    >>> find_nth("foo", "o", 3)
    -1
    >>> find_nth("foo", "b")
    -1
    >>> import re
    >>> either_o = re.compile("[oO]")
    >>> find_nth("foo", either_o, 1)
    1
    >>> find_nth("FOO", either_o, 1)
    1
    """
    if (hasattr(needle, 'finditer')):
        matches = needle.finditer(haystack)
    else:
        matches = re.finditer(re.escape(needle), haystack)
    start_here = itertools.dropwhile(lambda x: x[0] < n, enumerate(matches, 1))
    try:
        return next(start_here)[1].start()
    except StopIteration:
        return -1

score 2 · Accepted Answer

re.finditer を使用した別のアプローチを次に示します。
違いは、これは干し草の山を必要な範囲でしか調べないことです。

from re import finditer
from itertools import dropwhile
needle='an'
haystack='bananabanana'
n=2
next(dropwhile(lambda x: x[0]<n, enumerate(re.finditer(needle,haystack))))[1].start()

score 1 · Accepted Answer

>>> s="abcdefabcdefababcdef"
>>> j=0
>>> for n,i in enumerate(s):
...   if s[n:n+2] =="ab":
...     print n,i
...     j=j+1
...     if j==2: print "2nd occurence at index position: ",n
...
0 a
6 a
2nd occurence at index position:  6
12 a
14 a

score 1 · Accepted Answer

とを使用する別の「トリッキーな」ソリューションを提供しsplitますjoin。

あなたの例では、使用できます

len("substring".join([s for s in ori.split("substring")[:2]]))

score 1 · Accepted Answer

ループと再帰を使用しないソリューション。

compile メソッドで必要なパターンを使用し、変数'n'に目的の出現を入力すると、最後のステートメントは、指定された文字列でパターンが n 番目に出現する開始インデックスを出力します。ここでは、finditer の結果、つまり iterator がリストに変換され、n 番目のインデックスに直接アクセスしています。

import re
n=2
sampleString="this is history"
pattern=re.compile("is")
matches=pattern.finditer(sampleString)
print(list(matches)[n].span()[0])

score -1 · Accepted Answer

これはあなたが本当に欲しい答えです：

def Find(String,ToFind,Occurence = 1):
index = 0 
count = 0
while index <= len(String):
    try:
        if String[index:index + len(ToFind)] == ToFind:
            count += 1
        if count == Occurence:
               return index
               break
        index += 1
    except IndexError:
        return False
        break
return False

score -1 · Accepted Answer

基本的なプログラミング知識がある人向けの簡単なソリューション:

# Function to find the nth occurrence of a substring in a text
def findnth(text, substring, n):

# variable to store current index in loop
count = -1

# n count
occurance = 0

# loop through string
for letter in text:
    
    # increment count
    count += 1
    
    # if current letter in loop matches substring target
    if letter == substring:
        
        # increment occurance
        occurance += 1
        
        # if this is the nth time the substring is found
        if occurance == n:
            
            # return its index
            return count
        
# otherwise indicate there is no match
return "No match"

# example of how to call function
print(findnth('C$100$150xx', "$", 2))

python - 文字列内で n 番目に出現する部分文字列を見つける

26 に答える 26

Related

Reference