python - Cython: Unicode 文字列を wchar 配列に変換する

Question

Cython を使用して、UCS2 形式 (wchar の配列) の Unicode 文字列を受け入れる外部 C API とやり取りする作業を行っています。(UTF-16 に対する UCS2 の制限は理解していますが、これはサードパーティ API です。)

Cython バージョン: 0.15.1
Python バージョン: 2.6 (ナローユニコードビルド)
OS : FreeBSD

Cython のユーザーガイドでは、Unicode をバイト文字列に変換する方法について詳しく説明していますが、16 ビット配列に変換する方法がわかりませんでした。最初に UTF-16 にエンコードする必要があることに気付きました (今のところ、BMP を超えるコードポイントは発生しないと想定しています)。次に何をすればいいですか？助けてください。

前もって感謝します。

score 2 · Accepted Answer

これはPython 3で非常に可能であり、解決策は次のとおりです。

# cython: language_level=3

from libc.stddef cimport wchar_t

cdef extern from "Python.h":
    wchar_t* PyUnicode_AsWideCharString(object, Py_ssize_t *)

cdef extern from "wchar.h":
    int wprintf(const wchar_t *, ...)

my_string = u"Foobar\n"
cdef Py_ssize_t length
cdef wchar_t *my_wchars = PyUnicode_AsWideCharString(my_string, &length)

wprintf(my_wchars)
print("Length:", <long>length)
print("Null End:", my_wchars[7] == 0)

あまり良くない Python 2 の方法が続きますが、未定義または壊れた動作を扱っている可能性があるため、簡単に信頼することはできません。

# cython: language_level=2

from cpython.ref cimport PyObject
from libc.stddef cimport wchar_t
from libc.stdio  cimport fflush, stdout
from libc.stdlib cimport malloc, free

cdef extern from "Python.h":
    ctypedef PyObject PyUnicodeObject
    Py_ssize_t PyUnicode_AsWideChar(PyUnicodeObject *o, wchar_t *w, Py_ssize_t size)

my_string = u"Foobar\n"
cdef Py_ssize_t length = len(my_string.encode("UTF-16")) // 2 # cheating
cdef wchar_t *my_wchars = <wchar_t *>malloc(length * sizeof(wchar_t))
cdef Py_ssize_t number_written = PyUnicode_AsWideChar(<PyUnicodeObject *>my_string, my_wchars, length)

# wprintf breaks things for some reason
print [my_wchars[i] for i in range(length)]
print "Length:", <long>length
print "Number Written:", <long>number_written
print "Null End:", my_wchars[7] == 0

free(my_wchars)

python - Cython: Unicode 文字列を wchar 配列に変換する

1 に答える 1

Related

Reference