python - string.decodeカスタムエラー引数

Question

私はこのPython2.7コードを持っています：

# coding: utf-8
#
f = open('data.txt', 'r')

for line in f:
  line = line.decode(encoding='utf-8', errors='foo23')
  print len(line)

f.close()

エラーの有効な/登録されたコーデックは次のとおりであるため、Pythonがエラーを発行しないのはなぜですか。

厳しい
無視
交換
xmlcharrefreplace
backslashreplace

ドキュメントには、自分で登録できると書かれていますが、「foo23」は登録していません。Pythonコードはエラー/警告なしで実行されます。エンコーディング引数を変更するとエラーが発生しますが、エラーをカスタム文字列に変更するとすべて問題ありません。

line = line.decode(encoding='utf-9', errors='foo23')

 File "parse.py", line 7, in <module>
line = line.decode(encoding='utf-9', errors='foo23')
LookupError: unknown encoding: utf-9

score 7 · Accepted Answer

デコード中にエラーがない場合。パラメータは使用されず、errors文字列である限りその値は重要ではありません。

>>> b'\x09'.decode('utf-8', errors='abc')
u'\t'

指定されたエンコーディングを使用してバイトをデコードできない場合は、エラーハンドラが使用され、存在しないエラーハンドラを指定するとエラーが発生します。

>>> b'\xff'.decode('utf-8', errors='abc')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "../lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
LookupError: unknown error handler name 'abc'

score 1 · Accepted Answer

errors重要な引数は、エラーをどのように処理するかを関数に指示することstr.decode()です。それ自体ではすべてが発生するわけではありません。2番目の例でエラーが発生する理由はencoding、関数に無効な引数を渡したためであり、他の理由はありません。

score 0 · Accepted Answer

jfsの回答で述べられているように、Pythonはデコードにエラーがないときにエラーハンドラが有効かどうかをチェックしないため、無効なエラーハンドラを提供してもエラーが発生しない可能性があります。

ただし、この動作は実装に依存することに注意してください。ご覧のとおり、CPythonでは、エラーが発生するまで、encodeanddecode関数はエラーハンドラーの存在をチェックしません。

対照的に、IronPythonでは、encodeanddecode関数はエンコード/デコードを試みる前に指定されたエラーハンドラーの存在をチェックするため、指定したサンプルコードは次のようなエラーを生成します。

Traceback (most recent call last):
  File ".\code.py", line 6, in <module>
LookupError: unknown error handler name 'foo23'

もちろん、他のPython実装は、この状況では異なる動作をする可能性があります。

実際、CPythonがデコードエラーが発生するまでエラーハンドラーの検証を待機していることと、IronPythonが発生していないことを確認したかったので、両方の実装のソースコードを確認しました。

CPython

以下は、Python2.6.2PyUnicode_DecodeUTF8Statefulのファイルにある関数のコードです。unicodeobject.cこの関数は、UTF-8でエンコードされたバイトをデコードする作業のほとんどを実行するように見えます。

PyObject *PyUnicode_DecodeUTF8Stateful(const char *s,
                                       Py_ssize_t size,
                                       const char *errors,
                                       Py_ssize_t *consumed)
{
    const char *starts = s;
    int n;
    Py_ssize_t startinpos;
    Py_ssize_t endinpos;
    Py_ssize_t outpos;
    const char *e;
    PyUnicodeObject *unicode;
    Py_UNICODE *p;
    const char *errmsg = "";
    PyObject *errorHandler = NULL;
    PyObject *exc = NULL;

    /* Note: size will always be longer than the resulting Unicode
       character count */
    unicode = _PyUnicode_New(size);
    if (!unicode)
        return NULL;
    if (size == 0) {
        if (consumed)
            *consumed = 0;
        return (PyObject *)unicode;
    }

    /* Unpack UTF-8 encoded data */
    p = unicode->str;
    e = s + size;

    while (s < e) {
        Py_UCS4 ch = (unsigned char)*s;

        if (ch < 0x80) {
            *p++ = (Py_UNICODE)ch;
            s++;
            continue;
        }

        n = utf8_code_length[ch];

        if (s + n > e) {
            if (consumed)
                break;
            else {
                errmsg = "unexpected end of data";
                startinpos = s-starts;
                endinpos = size;
                goto utf8Error;
            }
        }

        switch (n) {

        case 0:
            errmsg = "unexpected code byte";
            startinpos = s-starts;
            endinpos = startinpos+1;
            goto utf8Error;

        case 1:
            errmsg = "internal error";
            startinpos = s-starts;
            endinpos = startinpos+1;
            goto utf8Error;

        case 2:
            if ((s[1] & 0xc0) != 0x80) {
                errmsg = "invalid data";
                startinpos = s-starts;
                endinpos = startinpos+2;
                goto utf8Error;
            }
            ch = ((s[0] & 0x1f) << 6) + (s[1] & 0x3f);
            if (ch < 0x80) {
                startinpos = s-starts;
                endinpos = startinpos+2;
                errmsg = "illegal encoding";
                goto utf8Error;
            }
            else
                *p++ = (Py_UNICODE)ch;
            break;

        case 3:
            if ((s[1] & 0xc0) != 0x80 ||
                (s[2] & 0xc0) != 0x80) {
                errmsg = "invalid data";
                startinpos = s-starts;
                endinpos = startinpos+3;
                goto utf8Error;
            }
            ch = ((s[0] & 0x0f) << 12) + ((s[1] & 0x3f) << 6) + (s[2] & 0x3f);
            if (ch < 0x0800) {
                /* Note: UTF-8 encodings of surrogates are considered
                   legal UTF-8 sequences;

                   XXX For wide builds (UCS-4) we should probably try
                   to recombine the surrogates into a single code
                   unit.
                */
                errmsg = "illegal encoding";
                startinpos = s-starts;
                endinpos = startinpos+3;
                goto utf8Error;
            }
            else
                *p++ = (Py_UNICODE)ch;
            break;

        case 4:
            if ((s[1] & 0xc0) != 0x80 ||
                (s[2] & 0xc0) != 0x80 ||
                (s[3] & 0xc0) != 0x80) {
                errmsg = "invalid data";
                startinpos = s-starts;
                endinpos = startinpos+4;
                goto utf8Error;
            }
            ch = ((s[0] & 0x7) << 18) + ((s[1] & 0x3f) << 12) +
                ((s[2] & 0x3f) << 6) + (s[3] & 0x3f);
            /* validate and convert to UTF-16 */
            if ((ch < 0x10000)        /* minimum value allowed for 4
                                         byte encoding */
                || (ch > 0x10ffff))   /* maximum value allowed for
                                         UTF-16 */
            {
                errmsg = "illegal encoding";
                startinpos = s-starts;
                endinpos = startinpos+4;
                goto utf8Error;
            }
#ifdef Py_UNICODE_WIDE
            *p++ = (Py_UNICODE)ch;
#else
            /*  compute and append the two surrogates: */

            /*  translate from 10000..10FFFF to 0..FFFF */
            ch -= 0x10000;

            /*  high surrogate = top 10 bits added to D800 */
            *p++ = (Py_UNICODE)(0xD800 + (ch >> 10));

            /*  low surrogate = bottom 10 bits added to DC00 */
            *p++ = (Py_UNICODE)(0xDC00 + (ch & 0x03FF));
#endif
            break;

        default:
            /* Other sizes are only needed for UCS-4 */
            errmsg = "unsupported Unicode code range";
            startinpos = s-starts;
            endinpos = startinpos+n;
            goto utf8Error;
        }
        s += n;
        continue;

      utf8Error:
        outpos = p-PyUnicode_AS_UNICODE(unicode);
        if (unicode_decode_call_errorhandler(
                errors, &errorHandler,
                "utf8", errmsg,
                starts, size, &startinpos, &endinpos, &exc, &s,
                &unicode, &outpos, &p))
            goto onError;
    }
    if (consumed)
        *consumed = s-starts;

    /* Adjust length */
    if (_PyUnicode_Resize(&unicode, p - unicode->str) < 0)
        goto onError;

    Py_XDECREF(errorHandler);
    Py_XDECREF(exc);
    return (PyObject *)unicode;

  onError:
    Py_XDECREF(errorHandler);
    Py_XDECREF(exc);
    Py_DECREF(unicode);
    return NULL;
}

この関数が別の関数を呼び出していることがわかりますunicode_decode_call_errorhandler。これは実際にエラーハンドラーを使用する関数です。関数のコードは以下のとおりです

static
int unicode_decode_call_errorhandler(const char *errors, PyObject **errorHandler,
                                     const char *encoding, const char *reason,
                                     const char *input, Py_ssize_t insize, Py_ssize_t *startinpos,
                                     Py_ssize_t *endinpos, PyObject **exceptionObject, const char **inptr,
                                     PyUnicodeObject **output, Py_ssize_t *outpos, Py_UNICODE **outptr)
{
    static char *argparse = "O!n;decoding error handler must return (unicode, int) tuple";

    PyObject *restuple = NULL;
    PyObject *repunicode = NULL;
    Py_ssize_t outsize = PyUnicode_GET_SIZE(*output);
    Py_ssize_t requiredsize;
    Py_ssize_t newpos;
    Py_UNICODE *repptr;
    Py_ssize_t repsize;
    int res = -1;

    if (*errorHandler == NULL) {
        *errorHandler = PyCodec_LookupError(errors);
        if (*errorHandler == NULL)
            goto onError;
    }

    if (*exceptionObject == NULL) {
        *exceptionObject = PyUnicodeDecodeError_Create(
            encoding, input, insize, *startinpos, *endinpos, reason);
        if (*exceptionObject == NULL)
            goto onError;
    }
    else {
        if (PyUnicodeDecodeError_SetStart(*exceptionObject, *startinpos))
            goto onError;
        if (PyUnicodeDecodeError_SetEnd(*exceptionObject, *endinpos))
            goto onError;
        if (PyUnicodeDecodeError_SetReason(*exceptionObject, reason))
            goto onError;
    }

    restuple = PyObject_CallFunctionObjArgs(*errorHandler, *exceptionObject, NULL);
    if (restuple == NULL)
        goto onError;
    if (!PyTuple_Check(restuple)) {
        PyErr_SetString(PyExc_TypeError, &argparse[4]);
        goto onError;
    }
    if (!PyArg_ParseTuple(restuple, argparse, &PyUnicode_Type, &repunicode, &newpos))
        goto onError;
    if (newpos<0)
        newpos = insize+newpos;
    if (newpos<0 || newpos>insize) {
        PyErr_Format(PyExc_IndexError, "position %zd from error handler out of bounds", newpos);
        goto onError;
    }

    /* need more space? (at least enough for what we
       have+the replacement+the rest of the string (starting
       at the new input position), so we won't have to check space
       when there are no errors in the rest of the string) */
    repptr = PyUnicode_AS_UNICODE(repunicode);
    repsize = PyUnicode_GET_SIZE(repunicode);
    requiredsize = *outpos + repsize + insize-newpos;
    if (requiredsize > outsize) {
        if (requiredsize<2*outsize)
            requiredsize = 2*outsize;
        if (_PyUnicode_Resize(output, requiredsize) < 0)
            goto onError;
        *outptr = PyUnicode_AS_UNICODE(*output) + *outpos;
    }
    *endinpos = newpos;
    *inptr = input + newpos;
    Py_UNICODE_COPY(*outptr, repptr, repsize);
    *outptr += repsize;
    *outpos += repsize;
    /* we made it! */
    res = 0;

  onError:
    Py_XDECREF(restuple);
    return res;
}

エラーハンドラを使用してPyUnicode_DecodeUTF8Stateful呼び出すので、を呼び出します。これは、提供されたエラーハンドラを最終的に検証するものです。以下のコードを参照してください。unicode_decode_call_errorhandlerNULLunicode_decode_call_errorhandlerPyCodec_LookupError

PyObject *PyCodec_LookupError(const char *name)
{
    PyObject *handler = NULL;

    PyInterpreterState *interp = PyThreadState_GET()->interp;
    if (interp->codec_search_path == NULL && _PyCodecRegistry_Init())
    return NULL;

    if (name==NULL)
    name = "strict";
    handler = PyDict_GetItemString(interp->codec_error_registry, (char *)name);
    if (!handler)
    PyErr_Format(PyExc_LookupError, "unknown error handler name '%.400s'", name);
    else
    Py_INCREF(handler);
    return handler;
}

PyUnicode_DecodeUTF8Statefulその呼び出しのコードunicode_decode_call_errorhandlerはutf8Errorラベルの下にあることに注意してください。これは、デコード中にエラーが発生した場合にのみ到達可能です。

IronPython

IronPython 2.7.9では、デコードは以下のStringOps.DoDecode関数（in StringOps.cs）で処理されます。

internal static string DoDecode(CodeContext context, string s, string errors, string encoding, Encoding e, bool final, out int numBytes) {
    byte[] bytes = s.MakeByteArray();
    int start = GetStartingOffset(e, bytes);
    numBytes = bytes.Length - start;

#if FEATURE_ENCODING
    // CLR's encoder exceptions have a 1-1 mapping w/ Python's encoder exceptions
    // so we just clone the encoding & set the fallback to throw in strict mode.
    e = (Encoding)e.Clone();

    switch (errors) {
        case "backslashreplace":
        case "xmlcharrefreplace":
        case "strict": e.DecoderFallback = final ? DecoderFallback.ExceptionFallback : new ExceptionFallBack(numBytes, e is UTF8Encoding); break;
        case "replace": e.DecoderFallback = ReplacementFallback; break;
        case "ignore": e.DecoderFallback = new PythonDecoderFallback(encoding, s, null); break;
        default:
            e.DecoderFallback = new PythonDecoderFallback(encoding, s, LightExceptions.CheckAndThrow(PythonOps.LookupEncodingError(context, errors)));
            break;
    }
#endif

    string decoded = e.GetString(bytes, start, numBytes);

#if FEATURE_ENCODING
    if (e.DecoderFallback is ExceptionFallBack fallback) {
        byte[] badBytes = fallback.buffer.badBytes;
        if (badBytes != null) {
            numBytes -= badBytes.Length;
        }
    }
#endif

    return decoded;
}

ここで、DoDecode関数はswitchデコードする前にステートメントにエラーハンドラーを作成しています。エラーハンドラーの名前を含む文字列（errors）が認識された組み込みハンドラーの1つでない場合は、関数を介して登録済みエラーハンドラーのディクショナリから取得されたPython関数オブジェクトを使用してオブジェクトをDoDecode作成します（以下を参照）。PythonDecoderFallbackPythonOps.LookupEncodingError

[LightThrowing]
internal static object LookupEncodingError(CodeContext/*!*/ context, string name) {
    Dictionary<string, object> errorHandlers = context.LanguageContext.ErrorHandlers;
    lock (errorHandlers) {
        if (errorHandlers.ContainsKey(name))
            return errorHandlers[name];
        else
            return LightExceptions.Throw(PythonOps.LookupError("unknown error handler name '{0}'", name));
    }
}

ディクショナリで指定されLookupEncodingErrorたエラーハンドラが見つからない場合、LookupErrorを「スロー」します。つまり、オブジェクトを作成して返します。次に、このオブジェクトは関数によってチェックされ、IronPythonで無効なエラーハンドラーを使用して呼び出したときに表示される「不明なエラーハンドラー名」エラーが最終的に生成されます。nameerrorHandlersLightExceptionLightExceptionLightExceptions.CheckAndThrowdecode

繰り返しますが、これはすべてオブジェクトのメソッドが呼び出されるDoDecode前に発生するため、IronPythonは、デコードエラーがあるかどうかに関係なく、無効なエラーハンドラーでエラーを生成します。EncodingGetString

python - string.decodeカスタムエラー引数

3 に答える 3

CPython

IronPython

Related

Reference