python - 開始インデックスと終了インデックスに基づいて文字列を削除する

Question

だから、私は長い文字列をたくさん持っているので、この操作を行うための効率的な方法を考えています私が次のような文字列を持っていると仮定します

 "< stuff to remove> get this stuff <stuff to remove>

だから、私は「これを手に入れよう」を抽出しようとしています

だから私はこのようなものを書いています。

 strt_pos = 0
  end_pos = 0
 while True:
   strt_idx = string.find(start_point, strt_pos) # start_point = "<" in our example
   end_idx  = string.find(end_point, end_pos)   # end_point = ">" in our example
   chunk_to_remove = string[strt_idx:end_idx]
    # Now how do i chop this part off from the string??
   strt_pos = strt_pos + 1
    end_pos = end_pos + 1
   if str_pos >= len(string) # or maybe end_pos >= len(string):
      break

これを実装するためのより良い方法は何ですか

score 2 · Accepted Answer

あなたが行っている検索操作が質問の一部であるかどうかはわかりません。開始インデックスと終了インデックスがあり、それらの文字を文字列から削除したい場合は、そのための特別な関数は必要ありません。Pythonでは、文字列の文字に数値インデックスを使用できます。

> x="abcdefg"
> x[1:3]
'bc'

The operation you want to perform would be something like x[:strt_idx] + x[end_idx:] . (if you omit the first argument it means "start from the beginning" and if you omit the second one it means "continue to the end".)

score 2 · Accepted Answer

正規表現を使用します。

>>> s = "< stuff to remove> get this stuff <stuff to remove>"
>>> import re
>>> re.sub(r'<[^<>]*>', '', s)
' get this stuff '

この式は、<[^<>]*>で始まり、<で終わり、>どちらも<または>間にない文字列と一致します。次に、subコマンドは一致を空の文字列に置き換えて、それを削除します。

次に、必要に応じて、結果を呼び出し.strip()て先頭と末尾のスペースを削除できます。

もちろん、これは、たとえばネストされたタグがある場合は失敗しますが、例では機能します。

score 2 · Accepted Answer

正規表現はこれを行う簡単な方法です（ただし、jedwardsの回答で示されているように必ずしも高速であるとは限りません）。

import re
s = '< stuff to remove> get this stuff <stuff to remove>'
s = re.sub(r'<[^>]*>', '', s)

この後sは文字列になります' get this stuff '。

score 0 · Accepted Answer

文字列の開始インデックスと終了インデックスがある場合は、次のようにすることができます。

substring = string[s_ind:e_ind]

ここs_indで、は文字列に含めたい最初の文字のインデックスであり、は文字列に含めたくないe_ind最初の文字のインデックスです。

例えば

string = "Long string of which I only want a small part"
#         012345678901234567890123456789012345678901234
#         0         1         2         3
substring = string[21:32]
print substring

プリントI only want

現在と同じ方法でインデックスを見つけることができます。

編集：効率に関しては、このタイプのソリューションは実際には正規表現ソリューションよりも効率的です。その理由は、必ずしも必要ではない正規表現に関連する多くのオーバーヘッドがあるためです。

人々が最も効率的であると主張することを盲目的に行うのではなく、自分でこれらのことをテストすることをお勧めします。

次のテストプログラムを検討してください。

#!/bin/env python

import re
import time

def inner_regex(s):
    return re.sub(r'<[^>]*>', '', s)

def inner_substr(s):
    s_ind = s.find('>') + 1
    e_ind = s.find('<', s_ind)
    return s[s_ind:e_ind]


s = '<stuff to remove> get this stuff <stuff to remove>'

tr1 = time.time()
for i in range(100000):
    s1 = inner_regex(s)
tr2 = time.time()
print("Regex:     %f" % (tr2 - tr1))

ts1 = time.time()
for i in range(100000):
    s2 = inner_substr(s)
ts2 = time.time()
print("Substring: %f" % (ts2 - ts1))

出力は次のとおりです。

Regex:     0.511443
Substring: 0.148062

つまり、正規表現アプローチを使用すると、元の修正されたアプローチよりも3倍以上遅くなります。

編集：コンパイルされた正規表現に関するコメントに関しては、コンパイルされていない正規表現よりも高速ですが、明示的なサブストリングよりも低速です。

#!/bin/env python

import re
import time

def inner_regex(s):
    return re.sub(r'<[^>]*>', '', s)

def inner_regex_compiled(s,r):
    return r.sub('', s)

def inner_substr(s):
    s_ind = s.find('>') + 1
    e_ind = s.find('<', s_ind)
    return s[s_ind:e_ind]


s = '<stuff to remove> get this stuff <stuff to remove>'


tr1 = time.time()
for i in range(100000):
    s1 = inner_regex(s)
tr2 = time.time()


tc1 = time.time()
r = re.compile(r'<[^>]*>')
for i in range(100000):
    s2 = inner_regex_compiled(s,r)
tc2 = time.time()


ts1 = time.time()
for i in range(100000):
    s3 = inner_substr(s)
ts2 = time.time()


print("Regex:          %f" % (tr2 - tr1))
print("Regex Compiled: %f" % (tc2 - tc1))
print("Substring:      %f" % (ts2 - ts1))

戻り値：

Regex:          0.512799  # >3 times slower
Regex Compiled: 0.297863  # ~2 times slower
Substring:      0.144910

話の教訓：正規表現はツールボックスに含めると便利なツールですが、利用可能な場合は、より単純な方法ほど効率的ではありません。

そして、自分で簡単にテストできることについて、人々の言葉を信じないでください。

python - 開始インデックスと終了インデックスに基づいて文字列を削除する

4 に答える 4

Related

Reference