c - Cで可変長の'文字列'-yを実装する方法

Question

私はかなりグーグルで検索しましたが、可変長文字列が一般的に高級言語でどのように実装されているかについての情報を見つけることができません。私は自分でそのような言語を作成していますが、文字列からどこから始めればよいのかわかりません。

stringタイプを説明する構造体と、createそのような「文字列」を割り当てる関数があります。

/* A safer `strcpy()`, using `strncpy()` and `sizeof()` */
#define STRCPY(TO, FROM) \
  strncpy(TO, FROM, sizeof(TO)); TO[sizeof(TO) - 1] = '\0'

struct string {
  // …
  char  native[1024];
};

string String__create(char native[]) {
  string this = malloc(sizeof(struct string));

  // …
  STRCPY(this->native, native);

  return this;
}

ただし、これでは1kbの長さの文字列しか許可されません。それは一種のばかげたことであり、ほとんどの場合、メモリの膨大な浪費です。

何らかの方法で使用するメモリを宣言する必要があるとすると、（効果的に）無制限の文字数を（効率的に）格納できる文字列を実装するにはどうすればよいですか？

score 12 · Accepted Answer

現在、多くのC++std::string実装は「小さな文字列の最適化」を使用しています。擬似コードの場合：

struct string {
    Int32 length
    union {
        char[12] shortString
        struct {
           char* longerString
           Int32 heapReservedSpace
        }
    }
}

アイデアは、最大12文字の文字列がshortString配列に格納されるというものです。文字列全体が連続し、単一のキャッシュラインのみを使用します。より長い文字列はヒープに格納されます。これにより、文字列オブジェクトに12バイトのスペアバイトが残ります。ポインタはそのすべてを取得するわけではないので、ヒープに割り当てたメモリの量を覚えておくこともできます（>=length）。これは、文字列を少しずつ増やすシナリオをサポートするのに役立ちます。

score 4 · Accepted Answer

一般的なアプローチは、長さのフィールドと、文字を保持するために動的に割り当てられたメモリ領域へのポインタを持つことです。

typedef struct string {
    size_t length;
    unsigned char *native;
} string_t;

string_t String__create(char native[]) {
    string_t this;
    this.length = strlen(native);
    this.native = malloc(this.length+1);
    if (this.native) {
        strncpy(this.native, native, this.length+1);
    } else {
        this.length = 0;
    }
    return this;
}

動的に割り当てる場合string_t：

string_t* String__create(char native[]) {
    string_t* this;
    if (this = malloc(sizeof(*this))) {
        this->length = strlen(native);
        this->native = malloc(this->length+1);
        if (this->native) {
            strncpy(this->native, native, this->length+1);
        } else {
            free(this);
            this=NULL;
        }
    }
    return this;
}
void String__delete(string_t** str) {
    free((*str)->native);
    free((*str));
    *str=NULL;
}

score 4 · Accepted Answer

他の人があなたに言ったことに加えて、私は「容量」の概念も含めます：割り当てられたベクトルのサイズをメモリに知ることはできません。それを保存する必要があります。文字列構造体のsizeofを実行すると、4バイト*3つの数値フィールド=12バイトが返されます（構造体で使用されているパディングのためにおそらく大きくなります）。また、sizeofを使用して割り当てられたメモリの長さを取得することはできません。

typedef struct _mystring {
        char * native;
        size_t size;
        size_t capacity;
} String;

このように、容量は常に文字列が含まれるチャンクの実際のサイズに影響します。文字列が短くなるとしましょう。文字列の容量とサイズを完全に一致させるために、再割り当てする必要はありません。また、最初の文字列にある文字ではなく、文字列に必要な文字を最初から割り当てることができます。最後に、C ++文字列char動的ベクトルを模倣し、文字列が容量制限を超えるたびに容量を2倍にすることができます。これらはすべて、メモリ操作を最小限に抑え、パフォーマンスを向上させます（メモリを浪費しますが、1Kbほどになることはありません）。

String String__create(char native[], size_t capacity) {
  String this;

  this.size = strlen( native );
  if ( capacity < ( this.size + 1 ) )
        this.capacity = this.size + 1;
  else  this.capacity = capacity;

  this.native = (char *) malloc( capacity * sizeof( char ) );
  strcpy( this.native, native );

  return this;
}

String * String__set(String *this, char native[]) {
    this->size = strlen( native );

    if ( this->size >= this->capacity ) {
        do {
            this->capacity <<= 1;
        } while( this->size > this->capacity );

        this->native = realloc( this->native, this->capacity );
    }

    strcpy( this->native, native );

    return this;
}

void String__delete(String *this)
{
    free( this->native );
}

score 2 · Accepted Answer

realloc will relocate your memory to a bigger buffer -- simply using this command will allow you to resize your string. Use the following struct for your string:

struct mystring
{
    char * string;
    size_t size;
};

The important part being keeping a track of the size of your string, and having every string manipulation function testing if the size makes sense.

You could pre-allocate a large buffer and keep adding to it, and only realloc when said buffer is full -- you have to implement all the functions for this. It is preferable (far less error prone, and more secure) to mutate string by moving from one immutable string to another (using the semantics of realoc).

score 0 · Accepted Answer

一部の人々は、連続した文字列（C文字列）ではなく、「ロープ」データ構造を使用して、無制限の長さの文字列を格納することを好みます。

簡略化されたロープは、次のように定義できます。

#include <stdio.h>

struct non_leaf_rope_node{
    char zero;
    union rope* left_substring;
    union rope* right_substring;
    // real rope implementations have a few more items here
};
#define rope_leaf_max ( sizeof( struct non_leaf_rope_node ) )

typedef union rope {
    char rope_data[ rope_leaf_max ];
    struct non_leaf_rope_node pointers;
} rope;

void print( union rope *node ){
    if( node->rope_data[0] != '\0' ){
        // short literal data
        fputs( node->rope_data, stdout );
    }else{
        // non-root node
        print( node->pointers.left_substring );
        print( node->pointers.right_substring );
    };
};
// other details involving malloc() and free() go here

int main(void){
    rope left = { "Hello," };
    rope right = { " World!" };
    rope root = {0,0,0};
    root.pointers.left_substring = &left;
    root.pointers.right_substring = &right;
    print( &root );

    return 0;
};

lope_leaf_max文字未満のロープは、nullで終了するC文字列と同じように格納されます。左と右のサブ文字列を指すルートnon_leaf_rope_nodeとして、rope_leaf_max文字を超えるロープが格納されます（これは、左と右のサブサブ文字列を指す場合があります）。最終的には、リーフノードとリーフノードを指します。それぞれに、完全な文字列の少なくとも1つの文字が含まれています。

ロープは常に少なくとも1つの文字を格納するため、常に次のことがわかります。ロープノードの最初のバイトがゼロ以外の場合、そのノードはリテラル文字を格納するリーフノードです。ロープノードの最初のバイトがゼロの場合、そのノードは左右のサブ文字列へのポインタを格納します。（実際のロープの実装には、多くの場合、第3の種類のロープノードがあります）。

多くの場合、ロープを使用すると、C文字列を使用するよりも必要なRAMスペースの合計が少なくなります。（「ニューヨーク市」などのフレーズを含むノードは、1つのロープで、または2つのロープ間で共有される一部の実装で複数回再利用できます）。ロープを使用する方が、Cストリングを使用するよりも速い場合があります。

c - Cで可変長の'文字列'-yを実装する方法

5 に答える 5

Related

Reference