python - string identity comparison in CPython

Question

I have recently discovered a potential bug in a production system where two strings were compared using the identity operator, eg:

if val[2] is not 's':

I imagine this will however often work anyway, because as far as I know CPython stores the short immutable strings in the same location. I've replaced it with !=, but I need to confirm that the data that previously went through this code is correct, so I'd like to know if this always worked, or if it only sometimes worked.

The Python version has always been 2.6.6 as far as I know and the above code seems to be the only place where the is operator was used.

Does anyone know if this line will always work as the programmer intended?

edit: Because this is no doubt very specific and unhelpful to future readers, I'll ask a different question:

Where should I look to confirm with absolute certainty the behaviour of the Python implementation? Are the optimisations in CPython's source code easy to digest? Any tips?

score 3 · Accepted Answer

オブジェクトが同じかどうかを確認せずに 2 つのオブジェクトを比較するだけの場合は、 is/演算子を使用しないでください。is not

Python が既存のものと同じ内容の新しい文字列オブジェクトを作成しないことは理にかなっていますが (文字列は不変であるため)、これにより等価性と同一性は同等ですが、特に大量の Python では、それに依存しません。そこに実装。

score 3 · Accepted Answer

2.6.x の CPython コードを見ることができます: http://svn.python.org/projects/python/branches/release26-maint/Objects/stringobject.c

1 文字の文字列は特別に扱われ、それぞれの文字列は 1 回しか存在しないため、コードは安全です。ここにいくつかの重要なコードがあります (抜粋):

static PyStringObject *characters[UCHAR_MAX + 1];

PyObject *
PyString_FromStringAndSize(const char *str, Py_ssize_t size)
{
    register PyStringObject *op;
    if (size == 1 && str != NULL &&
        (op = characters[*str & UCHAR_MAX]) != NULL)
    {
        Py_INCREF(op);
        return (PyObject *)op;
    }

...

score 3 · Accepted Answer

人々がすでに指摘しているように、python (またはとにかく CPython) で作成された文字列には常に当てはまるはずですが、C 拡張機能を使用している場合はそうではありません。

簡単な反例として：

import numpy as np

x = 's'
y = np.array(['s'], dtype='|S1')

print x
print y[0]

print 'x is y[0] -->', x is y[0]
print 'x == y[0] -->', x == y[0]

これにより、次の結果が得られます。

s
s
x is y[0] --> False
x == y[0] --> True

もちろん、C の拡張機能を使用したことがないのであれば、おそらく安全です...私はそれを当てにはしませんが...

編集：さらに単純な例として、物事が何らかの方法で漬けられたり詰め込まれたりした場合、それは当てはまりませんstruct.

例えば：

import pickle
x = 's'
pickle.dump(x, file('test', 'w'))
y = pickle.load(file('test', 'r'))

print x is y
print x == y

"s"また（フォーマット文字列に必要なため、明確にするために別の文字を使用しています）：

import struct
x = 'a'
y = struct.pack('s', x)

print x is y
print x == y

score 2 · Accepted Answer

この動作は、空の単一文字の latin-1 文字列に常に適用されます。unicodeobject.c から:

PyObject *PyUnicode_FromUnicode(const Py_UNICODE *u,
                                Py_ssize_t size)
{
.....
    /* Single character Unicode objects in the Latin-1 range are
       shared when using this constructor */
    if (size == 1 && *u < 256) {
        unicode = unicode_latin1[*u];

このスニペットは Python 3 のものですが、以前のバージョンにも同様の最適化が存在する可能性があります。

score 0 · Accepted Answer

自動の短い文字列のインターン (Python ソースの定数と同じ、リテラルの 's' と同じ) により機能することは認められていますが、ここで ID を使用するのは非常にばかげています。

Python はダックタイピングに関するもので、文字列のように見える任意のオブジェクトを使用できます。たとえば、同じコードval[2]が実際にの場合は失敗しますu"s"。

python - string identity comparison in CPython

5 に答える 5

Related

Reference