c - このラングトンの蟻シムを最適化する方法は?

Question

Langton's ant sim (ルールストリング RLR 用) を作成しており、速度を最適化しようとしています。関連するコードは次のとおりです。

#define AREA_X 65536
#define AREA_Y 65536
#define TURN_LEFT 3
#define TURN_RIGHT 1
int main()
{
  uint_fast8_t* state;
  uint_fast64_t ant=((AREA_Y/2)*AREA_X) + (AREA_X/2);
  uint_fast8_t ant_orientation=0;
  uint_fast8_t two_pow_five=32;
  uint32_t two_pow_thirty_two=0;/*not fast, relying on exact width for overflow*/
  uint_fast8_t change_orientation[4]={0, TURN_RIGHT, TURN_LEFT, TURN_RIGHT};
  int_fast64_t* move_ant={AREA_X, 1, -AREA_X, -1};
  ... initialise empty state
  while(1)
  {
    while(two_pow_five--)/*removing this by doing 32 steps per inner loop, ~16% longer*/
    {
      while(--two_pow_thirty_two)
      {
        /*one iteration*/
        /* 54 seconds for init + 2^32 steps
        ant_orientation = ( ant_orientation + (117>>((++state[ant])*2 )) )&3;
        state[ant] = (36 >> (state[ant] *2) ) & 3;
        ant+=move_ant[ant_orientation];
        */

        /* 47 seconds for init + 2^32 steps
        ant_orientation = ( ant_orientation + ((state[ant])==1?3:1) )&3;
        state[ant] += (state[ant]==2)?-2:1;
        ant+=move_ant[ant_orientation];
        */

        /* 46 seconds for init + 2^32 steps
        ant_orientation = ( ant_orientation + ((state[ant])==1?3:1) )&3;
        if(state[ant]==2)
        {
          --state[ant];
          --state[ant];
        }
        else
          ++state[ant];
        ant+=move_ant[ant_orientation];
        */

        /* 44 seconds for init + 2^32 steps
        ant_orientation = ( ant_orientation + ((++state[ant])==2?3:1) )&3;
        if(state[ant]==3)state[ant]=0;
        ant+=move_ant[ant_orientation];
        */

        // 37 seconds for init + 2^32 steps
        // handle every situation with nested switches and constants
        switch(ant_orientation)
        {
          case 0:
            switch(state[ant])
            {
              case 0:
                ant_orientation=1;
                state[ant]=1;
                ++ant;
                break;
              case 1:
                ant_orientation=3;
                state[ant]=2;
                --ant;
                break;
              case 2:
                ant_orientation=1;
                state[ant]=0;
                ++ant;
                break;
            }
            break;
          case 1:
            switch(state[ant])
            {
              ...
            }
            break;
          case 2:
            switch(state[ant])
            {
              ...
            }
            break;
          case 3:
            switch(state[ant])
            {
              ...
            }
            break;
        }

      }
    }
    two_pow_five=32;
    ... dump to file every 2^37 steps
  }
  return 0;
}

2 つの質問があります。

試行錯誤のテストにより、c でできる限り最適化しようとしましたが、利用していないトリックはありますか? いつかアセンブルしてみようと思いますが、アセンブルではなくcで話してみてください。
速度を上げるために問題をモデル化するより良い方法はありますか?

詳細: 移植性は問題ではありません。私は 64 ビット Linux を使用しており、gcc、i5-2500k、および 16 GB の RAM を使用しています。状態配列はそのままで 4GiB を使用しますが、プログラムは 12GiB の RAM を使用することができます。sizeof(uint_fast8_t)=1. 境界チェックは意図的に存在しません。破損はダンプから手動で簡単に見つけることができます。

編集:おそらく直観に反して、switch ステートメントを削除するのではなく積み重ねることで、これまでで最高の効率が得られました。

編集:私は問題を再モデル化し、反復ごとに 1 つのステップよりも速いものを考え出しました。以前は、各状態要素は 2 ビットを使用し、Langton の ant グリッド内の 1 つのセルを記述していました。新しい方法は 8 ビットすべてを使用し、グリッドの 2x2 セクションを記述します。反復ごとに、ステップ数、新しい方向、および現在の状態 + 方向の新しい状態の事前計算値を検索することにより、可変数のステップが実行されます。すべての可能性が同じであると仮定すると、反復ごとに平均して 2 つのステップが実行されます。おまけとして、メモリの 1/4 を使用して同じ領域をモデル化します。

while(--iteration)
{
        // roughly 31 seconds per 2^32 steps
        table_offset=(state[ant]*24)+(ant_orientation*3);
        it+=twoxtwo_table[table_offset+0];
        state[ant]=twoxtwo_table[table_offset+2];
        ant+=move_ant2x2[(ant_orientation=twoxtwo_table[table_offset+1])];
}

まだ最適化を試みていませんが、次に試みることは、以前のようにネストされたスイッチと定数を使用してオフセット式とルックアップを削除することです (ただし、内部ケースは 12 ではなく 648 です)。

score 3 · Accepted Answer

または、分岐の代わりに単一の符号なしバイト定数を人工レジスタとして使用できます。

value:   1  3  1  1
bits:   01 11 01 01 ---->101 decimal value for an unsigned byte
index    3  2  1  0  ---> get first 2 bits to get  "1"  (no shift)
                      --> get second 2 bits to get "1"  (shifting for 2 times)
                      --> get third 2 bits to get  "3"  (shifting for 4 times)
                      --> get last 2 bits to get   "1"  (shifting for 6 times)

 Then "AND" the result with  binary(11) or decimal(3) to get your value.

 (101>>( (++state[ant])*2  ) ) & 3 would give you the turnright or turnleft

 Example:
 ++state[ant]= 0:  ( 101>>( (0)*2 ) )&3  --> 101 & 3 = 1
 ++state[ant]= 1:  ( 101>>( (1)*2 ) )&3  --> 101>>2 & 3 = 1
 ++state[ant]= 2:  ( 101>>( (2)*2 ) )&3  --> 101>>4 & 3 = 3  -->turn left
 ++state[ant]= 3:  ( 101>>( (3)*2 ) )&3  --> 101>>6 & 3 = 1

 Maximum six-shifting + one-multiplication + one-"and" may be better.
 Dont forget constant can be auto-promoted so you may add some suffixes or something else.

%4 モジュラスに「unsigned int」を使用しているため、「and」演算を使用できます。

  state[ant]=state[ant]&3; instead of state[ant]=state[ant]%4;

熟練していないコンパイラの場合、これにより速度が向上するはずです。

最も難しい部分: modulo-3

  C = A % B is equivalent to C = A – B * (A / B)
  We need  state[ant]%3
  Result = state[ant] - 3 * (state[ant]/3)

  state[ant]/3 is always <=1 for your valid direction states.
  Only when state[ant]  is 3 then state[ant]/3 is 1, other values give 0.
  When multiplied by 3, that part is 0 or 3 (only 3 when state[ant] is 3 otherwise 0)
  Result = state[ant] - (0 or 3)

  Lets look at all possibilities:

  state[ant]=0:  0 - 0  ---> 0   ----> 00100100 shifted by 0 times &3 --> 00000000
  state[ant]=1:  1 - 0  ---> 1   ----> 00100100 shifted by 2 times &3 --> 00000001
  state[ant]=2:  2 - 0  ---> 2   ----> 00100100 shifted by 4 times &3 --> 00000010
  state[ant]=3:  3 - 3  ---> 0   ----> 00100100 shifted by 6 times &3 --> 00000000

  00100100 is 36 in decimal.

  (36 >> (state[ant] *2) ) & 3 will give you state[ant]%3 for your valid states (0,1,2,3)

  Example: 

  state[ant]=0: 36 >> 0  --> 36 ----> 36& 3 ----> 0  satisfies 0%3
  state[ant]=1: 36 >> 2  --> 9 -----> 9 & 3 ----> 1  satisfies 1%3
  state[ant]=2: 36 >> 4  --> 2 -----> 2 & 3 ----> 2 satisfies  2%3
  state[ant]=3: 36 >> 6  --> 0 -----> 0 & 3 ----> 0 satisfies 3%3

c - このラングトンの蟻シムを最適化する方法は?

1 に答える 1

Related

Reference