c - 符号付き文字のビットごとの AND

Question

data type の配列に読み込んだファイルがありますsigned char。この事実を変えることはできません。

私は今これをしたいと思います:!((c[i] & 0xc0) & 0x80)はc[i]署名された文字の 1 つです。

これで、 C99 標準のセクション 6.5.10 から、「[ビットごとの AND の]各オペランドは整数型でなければならない」ことがわかりました。

C99 仕様のセクション 6.5 には次のように書かれています。

一部の演算子 (単項演算子 ~ 、および二項演算子 << 、 >> 、 & 、 ^ 、および | は、まとめてビット演算子として説明されます) は、整数型のオペランドを持ちます。これらの演算子は、整数の内部表現に依存する値を返すため、符号付き型の実装定義の側面があります。

私の質問は 2 つあります。

signed charファイルの元のビットパターンを使用したいので、ビットパターンが変更されないように変換/キャストするにはどうすればよいunsigned charですか?
これらの「実装定義の側面」のリストはどこにありますか (MVSC と GCC など)?

または、別のルートを取り、の任意の値の符号付き文字と符号なし文字の両方で同じ結果が得られると主張することもできますc[i]。

当然のことながら、私は関連する標準や権威あるテキストへの言及に報いるとともに、「情報に基づいた」推測を思いとどまらせます。

score 5 · Accepted Answer

As others point out, in all likelyhood your implementation is based on two's complement, and will give exactly the result you expect.

However, if you're worried about the results of an operation involving a signed value, and all you care about is the bit pattern, simply cast directly to an equivalent unsigned type. The results are defined under the standard:

6.3.1.3 Signed and unsigned integers

...
Otherwise, if the new type is unsigned, the value is converted by repeatedly adding or subtracting one more than the maximum value that can be represented in the new type until the value is in the range of the new type.

This is essentially specifying that the result will be the two's complement representation of the value.

Fundamental to this is that in two's complement maths the result of a calculation is modulo some power of two (i.e. the number of bits in the type), which in turn is exactly equivalent to masking off the relevant number of bits. And the complement of a number is the number subtracted from the power of two.

Thus adding a negative value is the same as adding any value which differs from the value by a multiple of that power of two.

i.e:

        (0 + signed_value) mod (2^N)
==
      (2^N + signed_value) mod (2^N)
==
  (7 * 2^N + signed_value) mod (2^N)

etc. (if you know modulo, that should be pretty self-evidently true)

So if you have a negative number, adding a power of two will make it positive (-5 + 256 = 251), but the bottom 'N' bits will be exactly the same (0b11111011) and it will not affect the outcome of a mathematical operation. As values are then truncated to fit the type, the result is exactly the binary value you expected with even if the result 'overflows' (i.e. what you might think happens if the number was positive to start with - this wrapping is also well defined behaviour).

So in 8-bit two's complement:

-5 is the same as 251 (i.e 256 - 5) - 0b11111011
If you add 30, and 251, you get 281. But that's larger than 256, and 281 mod 256 equals 25. Exactly the same as 30 - 5.
251 * 2 = 502. 502 mod 256 = 246. 246 and -10 are both 0b11110110.

Likewise if you have:

unsigned int a;
int b;

a - b == a + (unsigned int) -b;

Under the hood, this cast is unlikely to be implemented with arithmetic and will certainly be a straight assignment from one register/value to another, or just optimised out altogether as the maths does not make a distinction between signed and unsigned (intepretation of CPU flags is another matter, but that's an implementation detail). The standard exists to ensure that an implementation doesn't take it upon itself to do something strange instead, or I suppose, for some weird architecture which isn't using two's complement...

score 1 · Accepted Answer

unsigned char UC = *(unsigned char*)&CC-これは、「ビットパターン」を維持したまま符号付きから符号なしに変換する方法です。したがって、コードを次のように変更できます。

!(( (*(unsigned char*)(c+i)) & 0xc0) & 0x80)

説明（参考文献付き）：

761 オブジェクトへのポインターが文字型へのポインターに変換されると、結果はオブジェクトの最下位アドレスのバイトを指します。

1124 型 char、unsigned char、または signed char (またはその修飾バージョン) を持つオペランドに適用すると、結果は 1 になります。

unsigned charこれら 2 つは、ポインタが元のポインタと同じバイトを指していることを意味しsigned charます。

score 0 · Accepted Answer

あなたは次のようなものを持っているようです：

signed char c[] = "\x7F\x80\xBF\xC0\xC1\xFF";

for (int i = 0; c[i] != '\0'; i++)
{
    if (!((c[i] & 0xC0) & 0x80))
        ...
}

あなたは（正しく）型の符号拡張について心配していsigned charます。ただし、実際には、(c[i] & 0xC0)は符号付き文字を (signed) に変換しますintが、& 0xC0は上位バイトの設定ビットを破棄します。式の結果は 0x00 .. 0xFF の範囲になります。これは、符号と大きさ、1 の補数、または 2 の補数のバイナリ値を使用するかどうかに関係なく適用されると思います。特定の符号付き文字値に対して得られる詳細なビットパターンは、基になる表現によって異なります。しかし、結果が 0x00 .. 0xFF の範囲になるという全体的な結論は有効です。

その懸念に対する簡単な解決策があります — の値を使用c[i]するunsigned char前ににキャストします。

if (!(((unsigned char)c[i] & 0xC0) & 0x80))

値はに昇格する前ににc[i]変換され(または、コンパイラがに昇格し、次にに強制し、その後に昇格する場合があります)、符号なしの値が操作で使用されます。unsigned charintintunsigned charunsigned charint&

もちろん、コードは冗長になっただけです。& 0xC0に続けて使用& 0x80することは、 just と完全に同等です& 0x80。

UTF-8 データを処理して継続バイトを探している場合、正しいテストは次のとおりです。

if (((unsigned char)c[i] & 0xC0) == 0x80)

score 0 · Accepted Answer

「ファイルの元のビットパターンを使用したいので、ビットパターンが変更されないようにするには、符号付き char を unsigned char に変換/キャストするにはどうすればよいですか?」

同じトピックに関する質問に対する以前の回答で誰かがすでに説明したように、符号付きまたは符号なしの小さな整数型は、式でint使用されるたびにその型に昇格されます。

C11 6.3.1.1

「int が元の型のすべての値を表すことができる場合 (ビットフィールドの幅によって制限されるため)、値は int に変換されます。それ以外の場合は、unsigned int に変換されます。これらは整数プロモーションと呼ばれます。 ."

また、同じ回答で説明されているように、整数リテラルは常に型intです。

したがって、式は擬似コードに要約されます(int) & (int) & (int)。操作は 3 つの一時的な int 変数に対して実行され、結果は int 型になります。

ここで、元のデータに特定の符号表現の符号ビットとして解釈される可能性のあるビットが含まれている場合 (実際には、これはすべてのシステムで 2 の補数になります)、問題が発生します。これらのビットは、signed char から int への昇格時に保持されるためです。

そして、ビットごとの & 演算子は、整数オペランド (C11 6.5.10/3) の内容に関係なく、符号付きかどうかに関係なく、すべてのビットに対して AND を実行します。元の signed char の符号付きビットにデータがあった場合、それは失われます。整数リテラル (0xC0 または 0x80) には、符号ビットに対応するビットが設定されていないためです。

解決策は、符号ビットが「一時的な int」に転送されないようにすることです。1 つの解決策は、c[i] を完全に明確に定義された unsigned char にキャストすることです (C11 6.3.1.3)。これにより、「この変数の内容全体が整数であり、心配する符号ビットがない」ことがコンパイラに通知されます。

さらに良いことに、ビット操作のすべての形式で常に符号なしデータを使用する習慣をつけてください。MISRA-C に準拠した純粋で 100% 安全な式の書き換え方法は次のとおりです。

if ( ((uint8_t)c[i] & 0xc0u) & 0x80u) > 0u)

u サフィックスは、実際には式を unsigned int にすることを強制しますが、常に目的の型にキャストすることをお勧めします。コードの読者に、「私は自分が何をしているのかを実際に知っており、C の奇妙な暗黙の昇格規則もすべて理解しています」と伝えます。

そして、ヘックス(0xc0 & 0x80)が無意味であることを知っていれば、それは常に真です。そしてx & 0xC0 & 0x80は常にと同じx & 0x80です。したがって、式を次のように単純化します。

if ( ((uint8_t)c[i] & 0x80u) > 0u)

「これらの「実装定義の側面」のリストはどこかにありますか」

はい、C 標準では、便利なようにそれらを付録 J.3 にリストしています。ただし、この場合に遭遇する唯一の実装定義の側面は、整数の符号の実装です。実際には、これは常に 2 の補数です。

編集:質問で引用されたテキストは、さまざまなビット単位の演算子が実装定義の結果を生成することに関係しています。これは、正確な参照がない付録でも実装定義として簡単に言及されています。実際の 6.5 章では、 & | の impl.defined の動作についてはあまり言及されていません。明示的に言及されている唯一の演算子は << と >> です。ここで、負の数を左にシフトすることは未定義の動作でさえありますが、右にシフトすることは実装定義です。

c - 符号付き文字のビットごとの AND

4 に答える 4

6.3.1.3 Signed and unsigned integers

Related

Reference