python - ファイル内のユニコード文字列に異なるものが含まれています

Question

私のシステムはフェドーラです。何らかの理由で.1 つのレコードの最後のフィールドはUnicode 文字列です (qemu でゲストマシンからの memcpy コピーデータを使用します)。Unicode 文字列は、Windows regedit キー名です。

smss.exe|NtOpenKey|304|4|4|0|\^@R^@e^@g^@i^@s^@t^@r^@y^@\^@M^@a^@ c^@h^@i^@n^@e^@\^@S^@y^@s^@t^@e^@m^@\^@C^@u^@r^@r^ @e^@n^@t^@C^@o^@n^@t^@r^@o^@l^@S^@e^@t^@\^@C^@o^@n ^@t^@r^@o^@l^@\^@S^@e^@s^@s^@i^@o^@n^@ ^@M^@a^@n^@a ^@g^@e^@r^@ smss.exe|NtClose|304|4|4|0|システム|NtOpenKey|4|0|2147484532|0|\^@R^@e^@g^@i ^@s^@t^@r^@y^@\^@M^@a^@c^@h^@i^@n^@e^@\^@S^@y^@s^@ t^@e^@m^@\^@C^@u^@r^@r^@e^@n^@t^@C^@o^@n^@t^@r^@o^ @l^@S^@e^@t^@ services.exe|NtOpenKey|680|624|636|0|\^@R^@E^@G^@I^@S^@T^@R^ @Y^@\^@M^@A^@C^@H^@I^@N^@E^@\^@S^@y^@s^@t^@e^@m^@\ ^@C^@u^@r^@r^@e^@n^@t^@C^@o^@n^@t^@r^@o^@l^@S^@e^@ t^@\^@S^@e^@r^@v^@i^@c^@e^@s^@

16 進コードの一部を次に示します。「|」を使用してください。分割文字として。最初の 6 つのフィールドは ascii 文字列でした。最後のフィールドはウィンドウの Unicode 文字列です (これは utf-16 コードだと思います)。

0000000 6d73 7373 652e 6578 4e7c 4f74 6570 4b6e
0000010 7965 337c 3430 347c 347c 307c 5c7c 5200
0000020 6500 6700 6900 7300 7400 7200 7900 5c00
0000030 4d00 6100 6300 6800 6900 6e00 6500 5c00
0000040 5300 7900 7300 7400 6500 6d00 5c00 4300
0000050 7500 7200 7200 6500 6E00 7400 4300 6F00
0000060 6E00 7400 7200 6F00 6C00 5300 6500 7400
0000070 5C00 4300 6F00 6F00 6F00 6E00 7400 7200 6F00 6C00
0000080 5C00 5300 6500 7300 6900 6900 6F00 6F00 6F00 6E00 6E00
0000090 2000 4d00 6100 6e00 6100 6700 6500 7200

Python を使用して解析し、 db に挿入します。これが私が扱う方法です

def parsecreate(filename):
    sourcefile = codecs.open("data.db",mode="r",encoding='utf-8')
    cx = sqlite3.connect("sqlite.db")
    cu = cx.cursor()
    cu.execute("create table data(id integer primary key,command text, ntfunc text, pid text, ppid text, handle text, roothandle text, genevalue text)")
    eachline = []
    for lines in sourcefile:
        eachline = lines.split('|')
        eachline[-1] = eachline[-1].strip('\n')
        eachline[-1] = eachline[-1].decode('utf-8')

        cu.execute("insert into data(command,ntfunc,pid,ppid,handle,roothandle,genevalue) values(?,?,?,?,?,?,?)",(eachline[0],eachline[1],eachline[2],eachline[3],eachline[4],eachline[5],eachline[-1]) )

    cx.commit()
    cx.close()

私は間違っています：

ファイル "./parse1.py"、18 行目、sourcefile の行の parsecreate: ファイル "/usr/lib/python2.7/codecs.py"、684 行目、次の return self.reader.next() ファイル " /usr/lib/python2.7/codecs.py"、615 行目、次の行 = self.readline() ファイル "/usr/lib/python2.7/codecs.py"、530 行目、readline データ = self .read(readsize, firstline=True) File "/usr/lib/python2.7/codecs.py", line 477, in read newchars, decodedbytes = self.decode(data, self.errors) UnicodeDecodeError: 'utf8' codec位置 51 のバイト0xd0をデコードできません: 継続バイトが無効です

Unicode 文字列には、utf8 が知らないバイトが含まれている可能性があるためです。最後のフィールドを正しく読み取るにはどうすればよいですか?

簡単に言うと。utf-16 エンコードファイルではなく、Unicode 文字列があります。フィールドを正しく db に挿入するにはどうすればよいですか? Python は 1 つのエンコーディングスタイルを使用してファイルを読み取ります。元のバイトを読み取ることはできますか。これらのバイトを Unicode 文字列に結合できますか。

score 3 · Accepted Answer

データファイルはテキストのみのファイルではないため、ファイルをバイナリとして開き、テキストフィールドを明示的にデコードします。元のバイナリデータであると思うものを取り戻すために、データをかなり操作する必要がありました。元のデータは、sqlite3.exe以下の最終出力と同様のダンプであった可能性がありますが、最終フィールドのデータは、TEXTではなくUTF-16でエンコードされたBLOBとして保存されていました。

行による解析と'|'による分割に注意してください UTF-16データに「\n」または「|」を表すバイトが含まれていると問題が発生する可能性がありますが、ここでは無視します。

これが私のテストです：

from binascii import unhexlify
import sqlite3

data = unhexlify('''\
6d73 7373 652e 6578 4e7c 4f74 6570 4b6e
7965 337c 3430 347c 347c 307c 5c7c 5200
6500 6700 6900 7300 7400 7200 7900 5c00
4d00 6100 6300 6800 6900 6e00 6500 5c00
5300 7900 7300 7400 6500 6d00 5c00 4300
7500 7200 7200 6500 6e00 7400 4300 6f00
6e00 7400 7200 6f00 6c00 5300 6500 7400
5c00 4300 6f00 6e00 7400 7200 6f00 6c00
5c00 5300 6500 7300 7300 6900 6f00 6e00
2000 4d00 6100 6e00 6100 6700 6500 7200'''.replace(' ','').replace('\n',''))

# OP's data dump must have been decoded from the original data
# as little-endian words, and is missing a final 0x00 byte.
# Byte-swapping and adding missing zero byte to get back what
# was likely the original binary data.
data = ''.join(a+b for a,b in zip(data[1::2],data[::2])) + '\x00'

with open('data.db','wb') as f:
    f.write(data)

def parsecreate(filename):
    with open(filename,'rb') as sourcefile:
        with sqlite3.connect("sqlite.db") as cx:
            cu = cx.cursor()
            cu.execute("create table data(id integer primary key,command text, ntfunc text, pid text, ppid text, handle text, roothandle text, genevalue text)")
            eachline = []
            for line in sourcefile:
                eachline = line.split('|')
                eachline[-1] = eachline[-1].decode('utf-16le')
                cu.execute("insert into data(command,ntfunc,pid,ppid,handle,roothandle,genevalue) values(?,?,?,?,?,?,?)",(eachline[0],eachline[1],eachline[2],eachline[3],eachline[4],eachline[5],eachline[-1]) )

parsecreate('data.db')

出力：

C:\>sqlite3 sqlite.db
SQLite version 3.7.9 2011-11-01 00:52:41
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite> select * from data;
1|smss.exe|NtOpenKey|304|4|4|0|\Registry\Machine\System\CurrentControlSet\Control\Session Manager

python - ファイル内のユニコード文字列に異なるものが含まれています

1 に答える 1

Related

Reference