python - Fixed length data field and variable length utf-8 encoding

Question

I have a Python project where I have a fixed byte-length text field (NOT FIXED CHAR-LENGTH FIELD) in a comm protocol that contains a utf-8 encoded, NULL padded, NULL terminated string.

I need to ensure that a string fits into the fixed byte-length field. Since utf-8 is a variable width encoding, this makes using brute force to truncate the string at a fixed byte length dicey since you could possibly leave part of a multi-byte character dangling at the end.

Is there a module/method/function/etc that can help me with truncating utf-8 variable width encoded strings to a fixed byte-length?

Something that does Null padding and termination would be a bonus.

This seems like a nut that would have already been cracked. I don't want to reinvent something if it already exists.

score 5 · Accepted Answer

Let Python detect and eliminate any partial or invalid characters.

byte_str = uni_str.encode('utf-8')
byte_str = byte_str[:size].decode('utf-8', 'ignore').encode('utf-8')

This works because the UTF-8 spec encodes the number of following bytes in the first byte of a character, so the missing bytes can be easily detected.

Edit: Here's the results from this code using a random oriental character string I pulled from another question. The first number is the maximum size, the second is the actual number of bytes in the UTF-8 string.

45 45 具有靜電產生裝置之影像輸入裝置
44 42 具有靜電產生裝置之影像輸入裝
43 42 具有靜電產生裝置之影像輸入裝
42 42 具有靜電產生裝置之影像輸入裝
41 39 具有靜電產生裝置之影像輸入
40 39 具有靜電產生裝置之影像輸入
39 39 具有靜電產生裝置之影像輸入
38 36 具有靜電產生裝置之影像輸
37 36 具有靜電產生裝置之影像輸
36 36 具有靜電產生裝置之影像輸
35 33 具有靜電產生裝置之影像
34 33 具有靜電產生裝置之影像
33 33 具有靜電產生裝置之影像
32 30 具有靜電產生裝置之影
31 30 具有靜電產生裝置之影

score 4 · Accepted Answer

It is very easy to see in a UTF-8 stream whether a given byte is at the start (or not) of a given character's byte stream. If the byte is of the form 10xxxxxx then it is a non-initial byte of a character, if the byte is of the form 0xxxxxx it is a single byte character, and other bytes are the initial bytes of a multi-byte character.

As such, you can build your own function without too much difficulty. Just ensure that the last character you add to your field is either of the form 0xxxxxx, or is of the form 10xxxxxx where the next character (which you're not adding) is not of the form 10xxxxxx. I.e. you make sure you've just added a one-byte UTF-8 character or the last byte of a multi-byte UTF-8 character. You can then just add 0s to fill in the rest of your field.

score 0 · Accepted Answer

def fit(s, l):
    u = s.decode("utf8")
    while True:
        if len(s) <= l:
            return s + "\0" * (l - len(s))
        u = u[:-1]
        s = u.encode("utf8")

should be about the thing you need. Maybe you have to refine it; it is untested.

I edited because I accidentally answered in C. I changed the algorithm to a not so optimal one, but easier to understand.

python - Fixed length data field and variable length utf-8 encoding

3 に答える 3

Related

Reference