c - Linux でメモリアラインバッファにアクセスするとコストが高くなるのはなぜですか?

Question

以下のプログラムには、2.6.x カーネルを実行している 64 Linux ホスト上で 64 バイトに整列されたバッファーと 16 バイトに整列されたバッファーの 2 つのバッファーがあります。

キャッシュラインの長さは 64 バイトです。したがって、このプログラムでは、一度に 1 つのキャッシュラインにアクセスするだけです。posix_memalignedアライメントされていないバッファよりも高速ではないにしても、同等であることを期待していました。ここにいくつかの指標があります

./readMemory 10000000

time taken by posix_memaligned buffer: 293020299 
time taken by standard buffer: 119724294 

./readMemory 100000000

time taken by posix_memaligned buffer: 548849137 
time taken by standard buffer: 211197082

#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <linux/time.h>

void now(struct timespec * t);

int main(int argc, char **argv)
{        
  char *buf;        
  struct timespec st_time, end_time;        
  int runs;        
  if (argc !=2) 
  {
             printf("Usage: ./readMemory <number of runs>\n");                
             exit(1);        
  }        
  errno = 0;        
  runs = strtol(argv[1], NULL, 10);        
  if (errno !=0)        {
            printf("Invalid number of runs: %s \n", argv[1]);
            exit(1);
    }

    int returnVal = -1;

    returnVal = posix_memalign((void **)&buf, 64, 1024);
    if (returnVal != 0)
    {
            printf("error in posix_memaligh\n");
    }

    char tempBuf[64];
    char * temp = buf;

    size_t cpyBytes = 64;

    now(&st_time);
    for(int x=0; x<runs; x++) {
    temp = buf;
    for(int i=0; i < ((1024/64) -1); i+=64)
    {
            memcpy(tempBuf, temp, cpyBytes);
            temp += 64;
    }
    }
    now(&end_time);

    printf("time taken by posix_memaligned buffer: %ld \n", (end_time.tv_nsec - st_time.tv_nsec));

    char buf1[1024];        
    temp = buf1;        
    now(&st_time);        
    for(int x=0; x<runs; x++) 
    {        
      temp = buf1;        
      for(int i=0; i < ((1024/64) -1); i+=64)        
     {                
        memcpy(tempBuf, temp, cpyBytes);                
        temp += 64;        
      }          
    }        
    now(&end_time);        
    printf("time taken by standard buffer: %ld \n", (end_time.tv_nsec - st_time.tv_nsec));
    return 0;
}

void now(struct timespec *tnow)
{
    if(clock_gettime(CLOCK_MONOTONIC_RAW, tnow) <0 )
    {
            printf("error getting time");
            exit(1);
    }
}

最初のループの分解は

    movq    -40(%rbp), %rdx        
    movq    -48(%rbp), %rcx        
    leaq    -176(%rbp), %rax
    movq    %rcx, %rsi
    movq    %rax, %rdi
    call    memcpy
    addq    $64, -48(%rbp)
    addl    $64, -20(%rbp)

2 番目のループの分解は、

    movq    -40(%rbp), %rdx
    movq    -48(%rbp), %rcx
    leaq    -176(%rbp), %rax
    movq    %rcx, %rsi
    movq    %rax, %rdi
    call    memcpy
    addq    $64, -48(%rbp)
    addl    $64, -4(%rbp)

score 1 · Accepted Answer

ベンチマークにはいくつかの問題があります。

実行時間が短すぎるため、多くのノイズ/ジッターが発生している可能性があります。
CPU周波数スケーリングを有効にしている場合、CPUがフル/ターボ周波数に切り替わる前に最初のループが実行されている可能性があります。最初にCPUをウォームアップするか、ベンチマーク中に周波数スケーリングをオフにする必要があります。
リアルタイムの優先順位で実行していないため、スケジューリングを監視している可能性があります。
実行ごとに1つのサンプルしか取得できません。あらゆる種類の科学的判断を下すには、少なくとも30回の実行が必要です（1つのサンプルを使用した科学的研究は一般に逸話と呼ばれます）。

score 1 · Accepted Answer

その理由は、バッファーの相対的な配置である可能性があります。

memcpyワード境界で整列されたデータ (32/64 ビット) をコピーするときに最も高速に動作します。
両方のバッファが適切に配置されていれば、すべて問題ありません。
両方のバッファが同じように位置合わせされていない場合はmemcpy、小さいプレフィックスをバイト単位でコピーし、残りに対して単語単位で実行することによって処理します。

しかし、一方のバッファーがワード境界で整列されていて、もう一方がそうでない場合、読み取りと書き込みの両方をワード境界で整列させる方法はありません。そのmemcpyため、単語ごとに機能しますが、メモリアクセスの半分は正しく整列されていません。

両方のスタックバッファが同じように整列されていない (たとえば、両方のアドレスが 8*x+2 である) が、バッファからのバッファposix_memalignが整列されている場合、表示内容を説明できます。

c - Linux でメモリ アライン バッファにアクセスするとコストが高くなるのはなぜですか?

3 に答える 3

Related

Reference

c - Linux でメモリアラインバッファにアクセスするとコストが高くなるのはなぜですか?