c++ - C の最適化 - 低レベルのコード

Question

glibc で使用される malloc である dlmalloc に匹敵するメモリアロケータを作成しようとしています。dlmalloc は、ブロック分割に最適なアロケーターであり、ブロックを再び大きなブロックに統合する前に、最近使用されたブロックのプールを保持します。私が書いているアロケーターは、代わりに最初にフィットするスタイルを行います。

私の問題は 2 つあります。(1) 私のコードのテスト時間は glibc malloc のテスト時間と比較して非常に不規則であり、(2) 私のコードの平均実行時間は 3 倍から 4 倍長くなる日があります。(2) は大したことではありませんが、glibc malloc が同じように苦しんでいない理由を理解したいと思います。さらに、この投稿では、(1) で説明した malloc と私のコードの間の動作のサンプルを示します。1000 回のテストのバッチの平均時間が malloc (上記の問題 (2)) よりもはるかに長い場合もあれば、平均が同じ場合もあります。しかし、私のコードでの一連のテストのテスト時間は常に非常に不規則です (上記の問題 (1))。つまり、テストのバッチには平均の 20 倍の時間のジャンプがあり、これらのジャンプは通常の (平均に近い) 時間に散在しています。glibc malloc はこれを行いません。

私が取り組んでいるコードは次のとおりです。

===================================

/* represent an allocated/unallocated  block of memory */
struct Block {

    /* previous allocated or unallocated block needed for consolidation but not used in allocation */
    Block* prev;
    /* 1 if allocated and 0 if not */
    unsigned int tagh;
   /* previous unallocated block */
   Block* prev_free;
   /* next unallocated block  */
   Block* next_free;
   /* size of current block */
   unsigned int size;
};

#define CACHE_SZ 120000000

/* array to be managed by allocator */
char arr[CACHE_SZ] __attribute__((aligned(4)));

/* initialize the contiguous memory located at arr for allocator */
void init_cache(){
/* setup list head node that does not change */
   Block* a = (Block*)  arr;
  a->prev = 0; 
  a->tagh = 1;
  a->prev_free = 0;
  a->size = 0;

/* setup the usable data block */
  Block* b = (Block*) (arr + sizeof(Block));
  b->prev = a; 
  b->tagh = 0;
  b->prev_free = a;
  b->size = CACHE_SZ - 3*sizeof(Block);
  a->next_free = b;

/* setup list tail node that does not change */
  Block* e = (Block*)((char*)arr + CACHE_SZ - sizeof(Block)); 
  e->prev = b;
  e->tagh = 1;
  e->prev_free = b;
  e->next_free = 0;
  e->size = 0;
  b->next_free = e;
}

char* alloc(unsigned int size){
  register Block* current = ((Block*) arr)->next_free; 
  register Block* new_block;

/* search for a first-fit block */

   while(current != 0){
       if( current->size >= size + sizeof(Block)) goto good;
       current = current->next_free;
   }

/* what to do if no decent size block found */
   if( current == 0) {
       return 0;
   }

/* good block found */
good:
/* if block size is exact return it */
   if( current->size == size){
       if(current->next_free != 0) current->next_free->prev_free = current->prev_free;
       if(current->prev_free != 0) current->prev_free->next_free = current->next_free;
       return (char* ) current + sizeof(Block);
   }

/* otherwise split the block */

   current->size -= size + sizeof(Block); 

    new_block = (Block*)( (char*)current + sizeof(Block) + current->size);
    new_block->size = size;
    new_block->prev = current;
    new_block->tagh = 1;
   ((Block*)((char*) new_block + sizeof(Block) + new_block->size ))->prev = new_block;

   return (char* ) new_block + sizeof(Block);
}

main(int argc, char** argv){
    init_cache();
    int count = 0;

/* the count considers the size of the cache arr */
    while(count < 4883){

/* the following line tests malloc; the quantity(1024*24) ensures word alignment */
   //char * volatile p = (char *) malloc(1024*24);
/* the following line tests above code in exactly the same way */
    char * volatile p = alloc(1024*24);
        count++;

    }
}

=====================================

上記のコードを次のように単純にコンパイルします。

g++ -O9 alloc.c

常にブロックを分割し、正確なサイズのブロックを返さない簡単なテストを実行します。

bash$ for((i=0; i<1000; i++)); do (time ./a.out) 2>&1|grep real; 終わり

私のコードと glibc malloc のテストのサンプル出力は次のとおりです。

私のコード:

real    0m0.023s
real    0m0.109s    <----- irregular jump >
real    0m0.024s
real    0m0.086s
real    0m0.022s
real    0m0.104s    <----- again irregular jump >
real    0m0.023s
real    0m0.023s
real    0m0.098s
real    0m0.023s
real    0m0.097s
real    0m0.024s
real    0m0.091s
real    0m0.023s
real    0m0.025s
real    0m0.088s
real    0m0.023s
real    0m0.086s
real    0m0.024s
real    0m0.024s

malloc コード (Nice と通常は 20ms 近くに留まります):

real    0m0.025s
real    0m0.024s
real    0m0.024s
real    0m0.026s
real    0m0.024s
real    0m0.026s
real    0m0.025s
real    0m0.026s
real    0m0.026s
real    0m0.025s
real    0m0.025s
real    0m0.024s
real    0m0.024s
real    0m0.024s
real    0m0.025s
real    0m0.026s
real    0m0.025s

malloc コード時間がより規則的になっていることに注意してください。他の予測不可能な時間には、私のコードは 0m0.020 秒ではなく 0m0.070 秒であるため、平均実行時間は 25 ミリ秒ではなく 70 ミリ秒に近くなります (上記の問題 (2)) が、ここには示されていません。この場合、malloc の平均 (25 ミリ秒) 近くで実行できたのは幸運でした。

質問は、(1) glibc malloc などのより規則的な時間を持つようにコードを変更するにはどうすればよいですか? (2) dlmalloc は特徴的にバランスのとれたアロケーターであり、最速ではないことを読んだので、可能であれば glibc の malloc よりも高速にするにはどうすればよいでしょうか (スプリット/ベストフィット/ファーストフィットアロケーターのみを考慮し、他のアロケーターは考慮しません)。 ?

c++ - C の最適化 - 低レベルのコード

2 に答える 2

Related

Reference