python-3.x - アクセント付き文字からアスキー文字への規則

Question

ascii に関連付けられたすべてのアクセント付き文字の UTF-8 コードを見つけるのに役立つルールはありますか? たとえば、文字の UTF-8 コードからすべてのアクセント付き文字é, è,... をすべての UTF-8 コードにできeますか?

上記の Ramchandra Apte によるソリューションを使用した Python 3 のショーケースを次に示します。

import unicodedata

def accented_letters(letter):
    accented_chars = []

    for accent_type in "acute", "double acute", "grave", "double grave":
        try:
            accented_chars.append(
                unicodedata.lookup(
                    "Latin small letter {letter} with {accent_type}" \
                    .format(**vars())
                )
            )

        except KeyError:
            pass

    return accented_chars

print(accented_letters("e"))


for kind in ["NFC", "NFKC", "NFD", "NFKD"]:
    print(
        '---',
        kind,
        list(unicodedata.normalize(kind,"é")),
        sep = "\n"
    )

for oneChar in "βεέ.¡¿?ê":
    print(
        '---',
        oneChar,
        unicodedata.name(oneChar),

Unicode で字形的に似ている文字を見つけますか?

        unicodedata.normalize('NFD', oneChar).encode('ascii','ignore'),
        sep = "\n"
    )

対応する出力。

['é', 'è', 'ȅ']
---
NFC
['é']
---
NFKC
['é']
---
NFD
['e', '́']
---
NFKD
['e', '́']
---
β
GREEK SMALL LETTER BETA
b''
---
ε
GREEK SMALL LETTER EPSILON
b''
---
έ
GREEK SMALL LETTER EPSILON WITH TONOS
b''
---
.
FULL STOP
b'.'
---
¡
INVERTED EXCLAMATION MARK
b''
---
¿
INVERTED QUESTION MARK
b''
---
?
QUESTION MARK
b'?'
---
ê
LATIN SMALL LETTER E WITH CIRCUMFLEX
b'e'

UTF-8 に関する技術情報 (cjc343 による参照)

https://www.rfc-editor.org/rfc/rfc3629

score 1 · Accepted Answer

多くの場合、多くの言語で異なる文字であると想定されています。ただし、これが本当に必要な場合は、文字列を正規化する関数を見つける必要があります。したがって、文字列内の 2 つの Unicode コードポイントになる分解された文字を取得するために正規化する必要があります。

score 0 · Accepted Answer

unicodedata.lookup の使用:

import unicodedata

def accented_letters(letter):
    accented_chars = []
    for accent_type in "acute", "double acute", "grave", "double grave":
        try:
            accented_chars.append(unicodedata.lookup("Latin small letter {letter} with {accent_type}".format(**vars())))
        except KeyError:
            pass
    return accented_chars

print(accented_letters("e"))

逆に、NFD 形式でunicodedata.normalizeを使用して最初の文字を取得できます。これは、2 番目の文字が結合形式のアクセントであるためです。

print(unicodedata.normalize("NFD","è")[0]) # Prints "e".

python-3.x - アクセント付き文字からアスキー文字への規則

上記の Ramchandra Apte によるソリューションを使用した Python 3 のショーケースを次に示します。

Unicode で字形的に似ている文字を見つけますか?

UTF-8 に関する技術情報 (cjc343 による参照)

2 に答える 2

Related

Reference