python - ascii/unicode 文字列の分割

Question

Python を使用して ID3v2 (MP3 ヘッダー) プロトコルをデコードしようとしています。デコードするデータのフォーマットは以下の通りです。

s1、、s2...sn-1は Unicode (utf-16/utf-8) 文字列で、最後の文字列 'sn' は Unicode またはバイナリ文字列です。

data = s1+delimiters+s2+delimiters+...+sn

ここで、utf-16 の'\x00'+'\x00' 区切り文字はであり、utf-8 の区切り文字は'\x00'

私dataはユニコード型と仲良くしています。ここで、すべての文字列 (、、...) をから抽出する必要s1がs2ありsnますdata。このために、私はsplit()次のように使用しています。

#!/usr/bin/python

def extractStrings(encoding_type, data):
    if(encoding_type == "utf-8"): delimitors = '\x00'
    else: delimitors = '\x00'+'\x00'
    return data.split(delimitors)

def main():        
    # Set-1
    encoding_type = "utf-8"
    delimitors = '\x00'
    s1="Hello".encode(encoding_type)
    s2="world".encode(encoding_type)
    data = s1+delimitors+s2
    print extractStrings(encoding_type, data)

    # Set-2
    encoding_type = "utf-16"
    delimitors = '\x00'+'\x00'
    s1="Hello".encode(encoding_type)
    s2="world".encode(encoding_type)
    data = s1+delimitors+s2
    print extractStrings(encoding_type, data)

if __name__ == "__main__":
    main()

出力：

['Hello', 'world']

['\xff\xfeH\x00e\x00l\x00l\x00o', '\x00\xff\xfew\x00o\x00r\x00l\x00d\x00']

セット 1 のデータでは機能しますが、セット 2 では機能しません。以来、set-2 の「データ」

'\xff\xfeH\x00e\x00l\x00l\x00o\x00\x00\x00\xff\xfew\x00o\x00r\x00l\x00d\x00'
                             ^               ^

'\x00'文字「0」が原因で、余分な先行区切り文字があり、適切なジョブを実行できません。

両方のケースで「データ」を適切にデコードするのを手伝ってくれる人はいますか?

アップデート：

私は問題を単純化しようとします。s1 = エンコードされた (utf-8/utf-16) 文字列

s2 = バイナリ文字列 (Unicode ではない)

utf-16 の'\x00'+'\x00'区切り文字は、utf-8 の区切り文字は'\x00'

データ = (s1+デリミタ)+s2

「データ」から s1 と s2 を抽出するのを手伝ってくれる人はいますか?

Update2: 解決策

次のコードは私の要件で機能します。

def splitNullTerminatedEncStrings(self, data, encoding_type, no_of_splits):
data_dec = data.decode(encoding_type, 'ignore')
chunks = data_dec.split('\x00', no_of_splits) 
enc_str_lst = []
for data_dec_seg in chunks[:-1]: 
    enc_str_lst.append(data_dec_seg.encode(encoding_type)) 
data_dec_chunks = '\x00'.join(chunks[:-1])   
if(data_dec_chunks): data_dec_chunks += '\x00'
data_chunks = data_dec_chunks.encode(encoding_type) 
data_chunks_len = len(data_chunks)
enc_str_lst.append(data[data_chunks_len:]) # last segment
return enc_str_lst

score 4 · Accepted Answer

ここで、utf-16 の区切り文字は '\x00'+'\x00' であり、utf-8 の区切り文字は '\x00' です。

ではない正確に。UTF-16 の区切り文字は\0\0、コード単位の境界のみにあります。1\0つのコード単位の末尾にある 1 つの後\0に別のコード単位の先頭にあるものは、区切り文字を構成しません。バイトの「同期」について話している ID3 標準は、これが当てはまらないことを暗示していますが、それは間違っています。

[余談: 残念なことに、多くのタグ読み取りツールは文字通りそのように解釈し、その結果、2 つのゼロバイトを含むシーケンス ( ĀaUTF-16BE の U+0100、U+0061、または、あなたが発見したように、 UTF-16LE の文字列の末尾に ASCII があると、フレームが壊れます。その結果、UTF-16 テキスト形式 (UTF-16+BOM 0x01 および UTF-16BE 0x02) は完全に信頼性が低く、すべてのタグ作成者が避ける必要があります。また、テキスト形式 0x00 は、純粋な ASCII 以外では信頼できません。UTF-8 が勝者です!]

Tフレーム ( 以外) に指定されているような、エンコードされた終端文字列のリスト構造があるTXXX場合、単純な方法は、U+0000 ターミネータで分割する前にそれらをデコードすることです。

def extractStrings(encoding_type, data):
    chars = data.decode(encoding_type)
    # chars is now a Unicode string, delimiter is always character U+0000
    return chars.split(u'\0')

が ID3 フレーム全体の場合data、残念ながら単一のでは処理できませんsplit()。ファミリ以外のフレームTには、エンコードされた終端文字列、ASCII のみの終端文字列、バイナリオブジェクト (終端がない)、および整数のバイト/ワード値が混在しています。APICはその 1 つですが、一般的なケースでは、解析するすべてのフレームの構造を事前に把握し、各フィールドを 1 つずつ使用して、各ターミネータを手動で見つける必要があります。

UTF-16 でエンコードされたデータ内のコード単位で整列されたターミネータを誤解せずに見つけるにはĀa、次のように正規表現を使用できます。

ix= re.match('((?!\0\0)..)*', data, re.DOTALL).end()
s, remainder= data[:ix], data[ix+2:]

これは本当に楽しいことではありません.ID3v2はあまりきれいなフォーマットではありません. 私の頭のてっぺんとテストされていない、この種のことは私がそれにアプローチする方法です:

p= FrameParser(data)
if frametype=='APIC':
    encoding= p.encoding()
    mimetype= p.string()
    pictype= p.number(1)
    desc= p.encodedstring()
    img= p.binary()

class FrameParser(object):
    def __init__(self, data):
        self._data= data
        self._ix= 0
        self._encoding= 0

    def encoding(self): # encoding byte - remember for later call to unicode()
        self._encoding= self.number(1)
        if not 0<=self._encoding<4:
            raise ValueError('Unknown ID3 text encoding %r' % self._encoding)
        return self._encoding

    def number(self, nbytes= 1):
        n= 0
        for i in nbytes:
            n*= 256
            n+= ord(self._data[self._ix])
            self._ix+= 1
        return n

    def binary(self): # the whole of the rest of the data, uninterpreted
        s= self._data[self._ix:]
        self._ix= len(self._data)
        return s

    def string(self): # non-encoded, maybe-terminated string
        return self._string(0)

    def encodedstring(self): # encoded, maybe-terminated string
        return self._string(self._encoding)

    def _string(self, encoding):
        if encoding in (1, 2): # UTF-16 - look for double zero byte on code unit boundary
            ix= re.match('((?!\0\0)..)*', self._data[self._ix:], re.DOTALL).end()
            s= self._data[self._ix:self._ix+ix]
            self._ix+= ix+2
        else: # single-byte encoding - look for first zero byte
            ix= self._data.find('\0', self._ix)
            s= self._data[self._ix:self._ix+ix] if ix!=-1 else self._data[self._ix:]
            self._ix= ix if ix!=-1 else len(self._data)
        return s.decode(['windows-1252', 'utf-16', 'utf-16be', 'utf-8][encoding])

score 3 · Accepted Answer

最初に文字列をデコードしてみませんか?

パイソン 2:

decoded = unicode(data, 'utf-8')
# or
decoded = unicode(data, 'utf-16')

パイソン 3:

decoded = str(data, 'utf-8')
# or
decoded = str(data, 'utf-16')

次に、エンコーディングに依存しないデータを直接操作し、区切り文字は常に単一の null です。

python - ascii/unicode 文字列の分割

3 に答える 3

Related

Reference