python - str.replace(..).replace(..) ad nauseam は Python の標準イディオムですか?

Question

たとえば、HTML で使用する文字列をエスケープする関数が必要だとします (Django のエスケープフィルターのように):

    def escape(string):
        """
        Returns the given string with ampersands, quotes and angle 
        brackets encoded.
        """
        return string.replace('&', '&amp;').replace('<', '&lt;').replace('>', '&gt;').replace("'", '&#39;').replace('"', '&quot;')

これは機能しますが、すぐに見苦しくなり、アルゴリズムのパフォーマンスが低下するように見えます (この例では、文字列は 5 回繰り返しトラバースされます)。より良いのは次のようなものです。

    def escape(string):
        """
        Returns the given string with ampersands, quotes and angle 
        brackets encoded.
        """
        # Note that ampersands must be escaped first; the rest can be escaped in 
        # any order.
        return replace_multi(string.replace('&', '&amp;'),
                             {'<': '&lt;', '>': '&gt;', 
                              "'": '&#39;', '"': '&quot;'})

そのような関数は存在しますか、それとも以前に書いたものを使用するための標準的な Python イディオムですか?

score 24 · Accepted Answer

実行速度が遅すぎるアプリケーションがあり、それをプロファイリングして、このスニペットのような行が速度低下の原因であることがわかりましたか? 思わぬところでボトルネックが発生。

現在のスニペットは文字列を 5 回トラバースし、毎回 1 つのことを実行します。おそらく毎回5つのことを実行する（または少なくとも毎回何かを実行する）ことで、一度トラバースすることを提案しています。これが自動的に私にとってより良い仕事をしてくれるかどうかは明らかではありません。現在使用されているアルゴリズムは O(n*m) (文字列の長さがルールの内容よりも長いと仮定) です。ここで、n は文字列の長さで、m は置換ルールの数です。アルゴリズムの複雑さを O(n*log(m)) のようなものに減らすことができ、特定のケースでは、元のものはすべて1文字だけです(ただし、複数の呼び出しの場合はそうではありません) to in general)—O(n) ですが、 m は 5ですがn は unboundedreplaceであるため、これは問題ではありません。

m が一定に保たれている場合、両方のソリューションの複雑さは実際には O(n) になります。5 つの単純なパスを 1 つの複雑なパスに変換しようとするのが価値のある作業になるかどうかは明らかではありません。実際の時間は現時点では推測できません。スケーリングを改善できる何かがあれば、もっとやりがいのある仕事だと思ったでしょう。

連続したパスではなく 1 回のパスですべてを行うには、競合するルールをどうするか、およびそれらをどのように適用するかについての質問にも答える必要があります。これらの質問に対する解決策は、一連のreplace.

score 17 · Accepted Answer

これを行うためのさまざまな方法をテストして、どれがより速くなるかを確認してみませんか (最速の方法だけに関心があると仮定して)。

def escape1(input):
        return input.replace('&', '&amp;').replace('<', '&lt;').replace('>', '&gt;').replace("'", '&#39;').replace('"', '&quot;')

translation_table = {
    '&': '&amp;',
    '<': '&lt;',
    '>': '&gt;',
    "'": '&#39;',
    '"': '&quot;',
}

def escape2(input):
        return ''.join(translation_table.get(char, char) for char in input)

import re
_escape3_re = re.compile(r'[&<>\'"]')
def _escape3_repl(x):
    s = x.group(0)
    return translation_table.get(s, s)
def escape3(x):
    return _escape3_re.sub(_escape3_repl, x)

def escape4(x):
    return unicode(x).translate(translation_table)

test_strings = (
    'Nothing in there.',
    '<this is="not" a="tag" />',
    'Something & Something else',
    'This one is pretty long. ' * 50
)

import time

for test_i, test_string in enumerate(test_strings):
    print repr(test_string)
    for func in escape1, escape2, escape3, escape4:
        start_time = time.time()
        for i in xrange(1000):
            x = func(test_string)
        print '\t%s done in %.3fms' % (func.__name__, (time.time() - start_time))
    print

これを実行すると、次のようになります。

'Nothing in there.'
    escape1 done in 0.002ms
    escape2 done in 0.009ms
    escape3 done in 0.001ms
    escape4 done in 0.005ms

'<this is="not" a="tag" />'
    escape1 done in 0.002ms
    escape2 done in 0.012ms
    escape3 done in 0.009ms
    escape4 done in 0.007ms

'Something & Something else'
    escape1 done in 0.002ms
    escape2 done in 0.012ms
    escape3 done in 0.003ms
    escape4 done in 0.007ms

'This one is pretty long. <snip>'
    escape1 done in 0.008ms
    escape2 done in 0.386ms
    escape3 done in 0.011ms
    escape4 done in 0.310ms

次々と入れ替えていくのが一番手っ取り早いようです。

編集: 1000000 回の反復でテストを再度実行すると、最初の 3 つの文字列に対して次の結果が得られます (4 番目の文字列は、私のマシンでは待機するのに時間がかかりすぎます =P):

'Nothing in there.'
    escape1 done in 0.001ms
    escape2 done in 0.008ms
    escape3 done in 0.002ms
    escape4 done in 0.005ms

'<this is="not" a="tag" />'
    escape1 done in 0.002ms
    escape2 done in 0.011ms
    escape3 done in 0.009ms
    escape4 done in 0.007ms

'Something & Something else'
    escape1 done in 0.002ms
    escape2 done in 0.011ms
    escape3 done in 0.003ms
    escape4 done in 0.007ms

数値はほぼ同じです。最初のケースでは、文字列の直接置換が現在最速であるため、実際にはさらに一貫性があります。

score 13 · Accepted Answer

私は次のようなきれいなものを好みます：

substitutions = [
    ('<', '&lt;'),
    ('>', '&gt;'),
    ...]

for search, replacement in substitutions:
    string = string.replace(search, replacement)

score 9 · Accepted Answer

削減を使用できます：

reduce(lambda s,r: s.replace(*r),
       [('&', '&amp;'),
        ('<', '&lt;'),
        ('>', '&gt;'),
        ("'", '&#39;'),
        ('"', '&quot;')],
       string)

score 7 · Accepted Answer

それがDjangoの機能です：

def escape(html):
    """Returns the given HTML with ampersands, quotes and carets encoded."""
    return mark_safe(force_unicode(html).replace('&', '&amp;').replace('<', '&lt;').replace('>', '&gt;').replace('"', '&quot;').replace("'", '&#39;'))

score 5 · Accepted Answer

bebrawの提案に従って、これが私が最終的に使用したものです（もちろん、別のモジュールで）：

import re

class Subs(object):
    """
    A container holding strings to be searched for and replaced in
    replace_multi().

    Holds little relation to the sandwich.
    """
    def __init__(self, needles_and_replacements):
        """
        Returns a new instance of the Subs class, given a dictionary holding 
        the keys to be searched for and the values to be used as replacements.
        """
        self.lookup = needles_and_replacements
        self.regex = re.compile('|'.join(map(re.escape,
                                             needles_and_replacements)))

def replace_multi(string, subs):
    """
    Replaces given items in string efficiently in a single-pass.

    "string" should be the string to be searched.
    "subs" can be either:
        A.) a dictionary containing as its keys the items to be
            searched for and as its values the items to be replaced.
        or B.) a pre-compiled instance of the Subs class from this module
               (which may have slightly better performance if this is
                called often).
    """
    if not isinstance(subs, Subs): # Assume dictionary if not our class.
        subs = Subs(subs)
    lookup = subs.lookup
    return subs.regex.sub(lambda match: lookup[match.group(0)], string)

使用例:

def escape(string):
    """
    Returns the given string with ampersands, quotes and angle 
    brackets encoded.
    """
    # Note that ampersands must be escaped first; the rest can be escaped in 
    # any order.
    escape.subs = Subs({'<': '&lt;', '>': '&gt;', "'": '&#39;', '"': '&quot;'})
    return replace_multi(string.replace('&', '&amp;'), escape.subs)

ずっといい：）。助けてくれてありがとう。

編集

気にしないで、マイク・グラハムは正しかった。私はそれをベンチマークしましたが、交換は実際にははるかに遅くなります。

コード：

from urllib2 import urlopen
import timeit

def escape1(string):
    """
    Returns the given string with ampersands, quotes and angle
    brackets encoded.
    """
    return string.replace('&', '&amp;').replace('<', '&lt;').replace('>', '&gt;').replace("'", '&#39;').replace('"', '&quot;')

def escape2(string):
    """
    Returns the given string with ampersands, quotes and angle
    brackets encoded.
    """
    # Note that ampersands must be escaped first; the rest can be escaped in
    # any order.
    escape2.subs = Subs({'<': '&lt;', '>': '&gt;', "'": '&#39;', '"': '&quot;'})
    return replace_multi(string.replace('&', '&amp;'), escape2.subs)

# An example test on the stackoverflow homepage.
request = urlopen('http://stackoverflow.com')
test_string = request.read()
request.close()

test1 = timeit.Timer('escape1(test_string)',
                     setup='from __main__ import escape1, test_string')
test2 = timeit.Timer('escape2(test_string)',
                     setup='from __main__ import escape2, test_string')
print 'multi-pass:', test1.timeit(2000)
print 'single-pass:', test2.timeit(2000)

出力：

multi-pass: 15.9897229671
single-pass: 66.5422530174

それだけです。

score 3 · Accepted Answer

どうやら正規表現を介してそれを実装することはかなり一般的です。この例は、ASPNとここで見つけることができます。

score 2 · Accepted Answer

よし、座って計算した。plsは私に腹を立てないでください。ΤΖΩΤΖΙΟΥの解決策について具体的に議論することに答えますが、これはコメント内に押し込むのがやや難しいので、このようにさせてください。実際、OPの質問に関連するいくつかの考慮事項も放映します。

最初に、私は ΤΖΩΤΖΙΟΥ と彼のアプローチの優雅さ、正確さ、実行可能性について話し合ってきました。提案のように見えますが、置換ペアを格納するためのレジスタとして (本質的に順序付けされていない) 辞書を使用していますが、実際には一貫して正しい結果を返します。これはitertools.starmap()、以下の 11 行目のへの呼び出しが、2 番目の引数として、単一の文字/バイトのペア (詳細は後述) のイテレータを取得するため[ ( 'h', 'h', ), ( 'e', 'e', ), ( 'l', 'l', ), ... ]です。これらの文字/バイトのペアは、最初の引数replacer.getが繰り返し呼び出されるものです。'>'最初に変換され'>'、次に不注意で再び変換される状況に遭遇する機会はありません。'&gt;'これは、各文字/バイトが置換のために 1 回だけ考慮されるためです。したがって、この部分は原則として問題なく、アルゴリズム的に正しいです。

次の質問は実行可能性であり、これにはパフォーマンスの検討が含まれます。重要なタスクが厄介なコードを使用して 0.01 秒で正しく完了し、素晴らしいコードを使用して 1 秒で完了する場合、実際には厄介なコードが望ましいと見なされる可能性があります (ただし、1 秒の損失が実際に耐えられない場合に限ります)。テストに使用したコードは次のとおりです。さまざまな実装が含まれています。それはpython 3.1で書かれているので、それ自体が素晴らしい識別子にUnicodeギリシャ文字を使用できます（zippy3kではpy2と同じを返しますitertools.izip）：

import itertools                                                                  #01
                                                                                  #02
_replacements = {                                                                 #03
  '&': '&amp;',                                                                   #04
  '<': '&lt;',                                                                    #05
  '>': '&gt;',                                                                    #06
  '"': '&quot;',                                                                  #07
  "'": '&#39;', }                                                                 #08
                                                                                  #09
def escape_ΤΖΩΤΖΙΟΥ( a_string ):                                                  #10
  return ''.join(                                                                 #11
    itertools.starmap(                                                            #12
      _replacements.get,                                                          #13
      zip( a_string, a_string ) ) )                                               #14
                                                                                  #15
def escape_SIMPLE( text ):                                                        #16
  return ''.join( _replacements.get( chr, chr ) for chr in text )                 #17
                                                                                  #18
def escape_SIMPLE_optimized( text ):                                              #19
  get = _replacements.get                                                         #20
  return ''.join( get( chr, chr ) for chr in text )                               #21
                                                                                  #22
def escape_TRADITIONAL( text ):                                                   #23
  return text.replace('&', '&amp;').replace('<', '&lt;').replace('>', '&gt;')\    #24
    .replace("'", '&#39;').replace('"', '&quot;')                                 #25

タイミング結果は次のとおりです。

escaping with SIMPLE            took 5.74664253sec for 100000 items
escaping with SIMPLE_optimized  took 5.11457801sec for 100000 items
escaping TRADITIONAL in-situ    took 0.57543013sec for 100000 items
escaping with TRADITIONAL       took 0.62347413sec for 100000 items
escaping a la ΤΖΩΤΖΙΟΥ          took 2.66592320sec for 100000 items

「伝統的な」方法が「すぐに醜くなり、アルゴリズムのパフォーマンスが低下するように見える」という元の投稿者の懸念は、このコンテキストに入れると部分的に不当に見えることが判明しました. 実際に最高のパフォーマンスを発揮します。関数呼び出しに隠しておくと、8% のパフォーマンスの低下が見られます (「メソッドの呼び出しはコストがかかります」が、一般的にはそれでも行う必要があります)。比較すると、ΤΖΩΤΖΙΟΥ の実装は、従来の方法の約 5 倍の時間がかかります。これは、python の長年にわたって磨かれた最適化された文字列メソッドと競合しなければならない複雑さがより高いことを考えると、驚くことではありません。

ここにはさらに別のアルゴリズム、SIMPLE があります。私が見る限り、これは ΤΖΩΤΖΙΟΥ のメソッドが行うこととまったく同じです: テキスト内の文字/バイトを反復処理し、それぞれのルックアップを実行し、すべての文字/バイトを結合して、結果のエスケープされたテキストを返します。これを行う方法の 1 つは、かなり長くて謎めいた定式化を伴う場合であることがわかりますが、SIMPLE 実装は実際には一目で理解できます。

ただし、ここで本当につまずくのは、SIMPLE アプローチのパフォーマンスがいかに悪いかということです。従来の方法の約 10 倍遅く、ΤΖΩΤΖΙΟΥ の方法の 2 倍も遅いです。私はここで完全に途方に暮れています。おそらく、誰かがなぜそうすべきなのかを考え出すことができます。Python の最も基本的なビルディングブロックのみを使用し、2 つの暗黙的な反復で動作するため、使い捨てリストやすべてを構築することは回避されますが、それでも遅く、その理由はわかりません。

ΤΖΩΤΖΙΟΥ のソリューションのメリットについて述べて、このコードレビューを締めくくります。私はそれを十分に明確にしましたが、コードが読みにくく、目の前のタスクには誇張されすぎています。しかし、それよりも重要なのは、彼が文字を扱う方法を見つけて、特定の小さな範囲の文字に対してバイトのような方法で動作することを確認することです。目の前のタスクでは確実に機能しますが、たとえばバイト文字列「ΤΖΩΤΖΙΟΥ」を反復処理するとすぐに、単一の文字を表す隣接するバイトを反復処理します。ほとんどの場合、これはまさに避けるべきことです。これがまさに、py3k の「文字列」が古い「Unicode オブジェクト」になり、古い「文字列」が「bytes」および「bytearray」になった理由です。2 シリーズから 3 シリーズへのコードの移行に費用がかかる可能性がある py3k の機能を 1 つ挙げるとしたら、それは py3k のこの 1 つの特性です。それ以来、すべてのエンコーディングの問題の 98% が解消されたばかりであり、私の動きを真剣に疑うような巧妙なハックはありません。上記のアルゴリズムは「概念的に8ビットクリーンでユニコードセーフ」ではありません.2010年であることを考えると、これは私にとって深刻な欠点です.

score 1 · Accepted Answer

Unicode以外の文字列でPython<3.0を使用している場合は、別のtranslate方法を試してください。

# Python < 3.0
import itertools

def escape(a_string):
    replacer= dict( (chr(c),chr(c)) for c in xrange(256))
    replacer.update(
        {'&': '&amp;',
         '<': '&lt;',
         '>': '&gt;',
         '"': '&quot;',
         "'": '&#39;'}
    )
    return ''.join(itertools.imap(replacer.__getitem__, a_string))

if __name__ == "__main__":
    print escape('''"Hello"<i> to George's friend&co.''')

$ python so2484156.py 
&quot;Hello&quot;&lt;i&gt; to George&#39;s friend&amp;co.

これは、必要に応じて、入力文字列の「シングルスキャン」に近いものです。

編集

私の意図は、unicode.translate1文字の置換に限定されない同等のものを作成することだったので、上記の答えを思いつきました。ユーザーの「フロー」から、ほぼ完全にコンテキストから外れたコメントがありましたが、正しい点が1つあります。上記のコードは、そのままでは、Unicode文字列ではなくバイト文字列で機能することを目的としています。私が強く嫌う明らかな更新（すなわちunichr（）…xrange（sys.maxunicode + 1））があるので、Pythonが保証することを前提として、ユニコード文字列とバイト文字列の両方で機能する別の関数を思いつきました。

all( (chr(i)==unichr(i) and hash(chr(i))==hash(unichr(i)))
    for i in xrange(128)) is True

新しい関数は次のとおりです。

def escape(a_string):
    replacer= {
        '&': '&amp;',
        '<': '&lt;',
        '>': '&gt;',
        '"': '&quot;',
        "'": '&#39;',
    }
    return ''.join(
        itertools.starmap(
            replacer.get, # .setdefault *might* be faster
            itertools.izip(a_string, a_string)
        )
    )

タプルのシーケンスでスターマップが使用されていることに注意してください。置換辞書にない文字については、その文字を返します。

python - str.replace(..).replace(..) ad nauseam は Python の標準イディオムですか?

9 に答える 9

編集

編集

Related

Reference