c - C の文字列関数が unsigned char ではなく char を使用する配列で機能するのはなぜですか?

Question

標準ライブラリ関数ではC、文字列の要素はchars です。ではなく決定した正当な理由はありunsigned charますか？

8 ビット文字列の使用unsigned charには、小さな利点ではありますが、いくつかの利点があります。

通常、ASCII コードは符号なしの値として記憶するため、より直感的です。また、バイナリデータで作業する場合は、負の数を扱うよりも、符号なしの 0x00 から 0xFF の範囲を好みます。したがって、キャストする必要があります。
符号なし整数を使用する方が高速で効果的である場合や、一部のプロセッサでより小さなコードが生成される場合があります。

score 12 · Accepted Answer

C provides three different character types:

char represents a character (which C also calls a "byte").
unsigned char represents a byte-sized pattern of bits, or an unsigned integer.
signed char represents a byte-sized signed integer.

It is implementation-defined whether char is a signed or an unsigned type, so I think the question amounts to either "why does char exist at all as this maybe-signed type?" or "why doesn't C require char to be unsigned?".

The first thing to know is that Ritchie added the "char" type to the B language in 1971, and C inherited it from there. Prior to that, B was word-oriented rather than byte-oriented (so says the man himself, see "The Problems of B".)

With that done, the answer to both of my questions might be that early versions of C didn't have unsigned types.

Once char and the string-handling functions were established, changing them all to unsigned char would be a serious breaking change (i.e. almost all existing code would stop working), and one of the ways C has tried to cultivate its user-base over the decades is by mostly avoiding catastrophic incompatible changes. So it would be surprising for C to make that change.

Given that char is going to be the character type, and that (as you observe) it makes a lot of sense for it to be unsigned, but that plenty of implementations already existed in which char was signed, I suppose that making the signedness of char implementation-defined was a workable compromise -- existing code would continue working. Provided that it was using char only as a character and not for arithmetic or order comparisons, it would also be portable to implementations where char is unsigned.

Unlike some of C's age-old implementation-defined variations, implementers do still choose signed characters (Intel). The C standard committee cannot help but observe that some people seem to stick with signed characters for some reason. Whatever those people's reasons are, current or historical, C has to allow it because existing C implementations rely on it being allowed. So forcing char to be unsigned is far lower on the list of achievable goals than forcing int to be 2's complement, and C hasn't even done that.

A supplementary question is "why does Intel still specify char to be signed in its ABIs?", to which I don't know an answer but I'd guess that they've never had an opportunity to do otherwise without massive disruption. Maybe they even like them.

score 4 · Accepted Answer

良い質問。標準では署名されていないcharか署名されているかは定義されていないため（これは実装に任されています）、優先順位charは2つの角度から組み合わされたと思います。

char、よりも入力にかかる時間が短くunsigned char、文字列操作関数のプロトタイプが読みやすく、使いやすくなっています。
元のASCII仕様は7ビットであったため、有効な値が0〜127の範囲であるか0〜255の範囲であるかは、C仕様では問題ではありませんでした。8ビット文字セットの標準化はかなり後に行われました。

score 4 · Accepted Answer

The signedness of char is implementation-defined.

A cleaner solution to the problem you're describing would be to mandate that plain char must be unsigned.

The reason plain char may be either signed or unsigned is partly historical, and partly related to performance.

Very early versions of C didn't have unsigned types. Since ASCII only covers the range 0 to 127, it was assumed that there was no particular disadvantage in making char a signed type. Once that decision was made, some programmers might have written code that depends on that, and later compilers kept char as a signed type to avoid breaking such code.

Quoting a C Reference Manual from 1975, 3 years before the publication of K&R1:

Characters (declared, and hereinafter called, char) are chosen from the ASCII set; they occupy the right- most seven bits of an 8-bit byte. It is also possible to interpret chars as signed, 2’s complement 8-bit numbers.

EBCDIC requires 8-bit unsigned char, but apparently EBCDIC-based machines weren't yet supported at that time.

As for performance, values of type char are implicitly converted, in many contexts, to int (assuming that int can represent all values of type char, which is usually the case). This is done via the "integer promotions". For example, this:

char ch = '0';
ch ++;

doesn't just perform an 8-bit increment. It converts the value of ch from char to int, adds 1 to the result, and converts the sum back from int to char to store it in ch. (The compiler can generate any code that provably achieves the same effect.)

Converting an 8-bit signed char to a 32-bit signed int requires sign extension. Converting an 8-bit unsigned char to a 32-bit signed int requires zero-filling the high-order 24 bits of the target. (The actual widths of these types may vary.) Depending on the CPU, one of these operations may be faster than the other. On some CPUs, making plain char signed might result in faster generated code.

(I don't know what the magnitude of this effect is.)

score 3 · Accepted Answer

関連する 3 つのタイプがあります。

signed char小さな符号付き整数を格納するように設計されています
unsigned char、小さな符号なし整数を格納するように設計されています
char、文字を格納するように設計されています

あなたが本当に知りたいのは、なぜcharunsigned 型ではないのかということだと思います。

C に unsigned 型がなかった時期がありました[1]。charは署名済み (4 ページを参照) と記述されていましたが、その時点でも「他の実装では署名伝播機能が失われている」ため、場所によっては既に署名済み、その他の場所では署名なしとして動作していました。そして、実装の選択は、単に彼らにとって最も簡単なものを反映していると思います (たとえば、最初の C 実装が作成された PDP-11 ではMOVB、符号拡張が行われました。符号拡張を取得せずにバイトを単語に変換します)。

現在、私が知っているほとんどの実装は signed を使用してcharいます。私が知っている唯一の署名されていないcharものは、EBCDIC のサポートによって義務付けられた IBM のものです (基本文字セットの文字の文字コードは正でなければならず、EBCDIC ではそれらのほとんどが 128 を超えています)。

[1]代わりに使用されるポインター...

score 3 · Accepted Answer

いいえ、正当な理由はありません。また、 char の署名が実装定義である理由もありません。負の数のインデックスを使用する種類のシンボルテーブルは存在しません。

これはすべて、8ビットの整数と「文字」があり、「文字」は魔法のような神秘的なものであるという誤った奇妙な仮定に由来すると思います。

これは、恐竜が地球を歩いていた時代から継承された、C 標準の多くの不合理な欠陥の 1 つにすぎません。char の不思議な符号性は、おそらく暗黙の整数昇格によって引き起こされる符号性関連のバグの可能性を除いて、言語に何も追加しません。

編集：

彼らは char を他の整数型 (short、int、long) と同じように動作させたかったため、char に署名を許可した可能性があります。これらはすべて、デフォルトで常に署名されることが標準で保証されています。

符号なし整数を使用する方が高速で効果的である場合や、一部のプロセッサでより小さなコードが生成される場合があります。

最終的にどのタイプになるかは、正確には直感的ではありません。式のオペランドとして char を使用すると、常に int に昇格されます。同様に、定数文字リテラル 'a'、'\n' などは、char 型ではなく int 型です。C 言語は、暗黙的な昇格規則 (「整数昇格」および「通常の算術変換」/「バランス」と呼ばれる) に従って型を昇格するようにコンパイラに強制します。

その昇格が完了すると、コンパイラは、最適化によって結果が変わらないことを証明できれば、型を最も効果的なものに最適化できます。

このコードがある場合:

char a = 'a';
char b = 'b';
char c = a + b;

行間で多くのあいまいなことが起こっています。まず、リテラル 'a' と 'b' は、intsigned/unsigned char に静かに切り捨てられます。次に、式a + bでは、a と b の両方が、整数昇格規則によって暗黙的にint型に昇格されます。加算は 2 つに対して実行されintます。次に、結果は静かに切り捨てられ、signed/unsigned char に戻されます。

コンパイラが、最適化が上記のあいまいさのいずれにも影響を与えないことを証明できる場合、すべてを健全な 8 ビット操作に置き換えることができます。

score 1 · Accepted Answer

1

標準は char を signed char として定義していないため

于 2012-08-24T09:05:25.497 に答える

c - C の文字列関数が unsigned char ではなく char を使用する配列で機能するのはなぜですか?

6 に答える 6

Related

Reference