embedded - ARMThumb2の効率的な整数平方根アルゴリズムを探しています

Question

符号なし整数の平方根（その整数部分）を見つけるための高速な整数のみのアルゴリズムを探しています。コードは、ARMThumb2プロセッサで優れたパフォーマンスを発揮する必要があります。アセンブリ言語またはCコードである可能性があります。

ヒントは大歓迎です。

score 36 · Accepted Answer

Jack W. Crenshaw によるInteger Square Rootsは、別のリファレンスとして役立ちます。

C スニペットアーカイブには、整数平方根の実装もあります。これは単なる整数の結果を超えて、答えの余分な小数 (固定小数点) ビットを計算します。(更新: 残念ながら、C スニペットアーカイブは現在機能していません。リンクはページの Web アーカイブを指しています。) C スニペットアーカイブのコードは次のとおりです。

#define BITSPERLONG 32
#define TOP2BITS(x) ((x & (3L << (BITSPERLONG-2))) >> (BITSPERLONG-2))

struct int_sqrt {
    unsigned sqrt, frac;
};

/* usqrt:
    ENTRY x: unsigned long
    EXIT  returns floor(sqrt(x) * pow(2, BITSPERLONG/2))

    Since the square root never uses more than half the bits
    of the input, we use the other half of the bits to contain
    extra bits of precision after the binary point.

    EXAMPLE
        suppose BITSPERLONG = 32
        then    usqrt(144) = 786432 = 12 * 65536
                usqrt(32) = 370727 = 5.66 * 65536

    NOTES
        (1) change BITSPERLONG to BITSPERLONG/2 if you do not want
            the answer scaled.  Indeed, if you want n bits of
            precision after the binary point, use BITSPERLONG/2+n.
            The code assumes that BITSPERLONG is even.
        (2) This is really better off being written in assembly.
            The line marked below is really a "arithmetic shift left"
            on the double-long value with r in the upper half
            and x in the lower half.  This operation is typically
            expressible in only one or two assembly instructions.
        (3) Unrolling this loop is probably not a bad idea.

    ALGORITHM
        The calculations are the base-two analogue of the square
        root algorithm we all learned in grammar school.  Since we're
        in base 2, there is only one nontrivial trial multiplier.

        Notice that absolutely no multiplications or divisions are performed.
        This means it'll be fast on a wide range of processors.
*/

void usqrt(unsigned long x, struct int_sqrt *q)
{
      unsigned long a = 0L;                   /* accumulator      */
      unsigned long r = 0L;                   /* remainder        */
      unsigned long e = 0L;                   /* trial product    */

      int i;

      for (i = 0; i < BITSPERLONG; i++)   /* NOTE 1 */
      {
            r = (r << 2) + TOP2BITS(x); x <<= 2; /* NOTE 2 */
            a <<= 1;
            e = (a << 1) + 1;
            if (r >= e)
            {
                  r -= e;
                  a++;
            }
      }
      memcpy(q, &a, sizeof(long));
}

次のコードに落ち着きました。これは基本的に、平方根計算法に関するウィキペディアの記事からのものです。しかし、 usestdint.h型uint32_tなどに変更されています。厳密には、戻り値の型をに変更できますuint16_t。

/**
 * \brief    Fast Square root algorithm
 *
 * Fractional parts of the answer are discarded. That is:
 *      - SquareRoot(3) --> 1
 *      - SquareRoot(4) --> 2
 *      - SquareRoot(5) --> 2
 *      - SquareRoot(8) --> 2
 *      - SquareRoot(9) --> 3
 *
 * \param[in] a_nInput - unsigned integer for which to find the square root
 *
 * \return Integer square root of the input value.
 */
uint32_t SquareRoot(uint32_t a_nInput)
{
    uint32_t op  = a_nInput;
    uint32_t res = 0;
    uint32_t one = 1uL << 30; // The second-to-top bit is set: use 1u << 14 for uint16_t type; use 1uL<<30 for uint32_t type


    // "one" starts at the highest power of four <= than the argument.
    while (one > op)
    {
        one >>= 2;
    }

    while (one != 0)
    {
        if (op >= res + one)
        {
            op = op - (res + one);
            res = res +  2 * one;
        }
        res >>= 1;
        one >>= 2;
    }
    return res;
}

私が発見した良い点は、かなり単純な変更で「丸められた」答えを返すことができることです。これは、特定のアプリケーションで精度を高めるために役立つことがわかりました。この場合、 2 ³² - 1uint32_tの丸められた平方根は 2 ¹⁶であるため、戻り値の型はでなければならないことに注意してください。

/**
 * \brief    Fast Square root algorithm, with rounding
 *
 * This does arithmetic rounding of the result. That is, if the real answer
 * would have a fractional part of 0.5 or greater, the result is rounded up to
 * the next integer.
 *      - SquareRootRounded(2) --> 1
 *      - SquareRootRounded(3) --> 2
 *      - SquareRootRounded(4) --> 2
 *      - SquareRootRounded(6) --> 2
 *      - SquareRootRounded(7) --> 3
 *      - SquareRootRounded(8) --> 3
 *      - SquareRootRounded(9) --> 3
 *
 * \param[in] a_nInput - unsigned integer for which to find the square root
 *
 * \return Integer square root of the input value.
 */
uint32_t SquareRootRounded(uint32_t a_nInput)
{
    uint32_t op  = a_nInput;
    uint32_t res = 0;
    uint32_t one = 1uL << 30; // The second-to-top bit is set: use 1u << 14 for uint16_t type; use 1uL<<30 for uint32_t type


    // "one" starts at the highest power of four <= than the argument.
    while (one > op)
    {
        one >>= 2;
    }

    while (one != 0)
    {
        if (op >= res + one)
        {
            op = op - (res + one);
            res = res +  2 * one;
        }
        res >>= 1;
        one >>= 2;
    }

    /* Do arithmetic rounding to nearest integer */
    if (op > res)
    {
        res++;
    }

    return res;
}

score 9 · Accepted Answer

一般的なアプローチの1つは、二分法です。

hi = number
lo = 0
mid = ( hi + lo ) / 2
mid2 = mid*mid
while( lo < hi-1 and mid2 != number ) {
    if( mid2 < number ) {
        lo = mid
    else
        hi = mid
    mid = ( hi + lo ) / 2
    mid2 = mid*mid

そのようなものはかなりうまくいくはずです。log2（number）テストを行い、log2（number）を乗算および除算します。除算は2による除算であるため、。に置き換えることができます>>。

終了条件が特定されていない可能性があるため、さまざまな整数をテストして、2による除算が2つの偶数の値の間で誤って振動しないことを確認してください。それらは1以上異なります。

score 8 · Accepted Answer

高速ではありませんが、小さくてシンプルです。

int isqrt(int n)
{
  int b = 0;

  while(n >= 0)
  {
    n = n - b;
    b = b + 1;
    n = n - b;
  }

  return b - 1;
}

score 8 · Accepted Answer

ほとんどのアルゴリズムは単純なアイデアに基づいていますが、必要以上に複雑な方法で実装されていることがわかりました。http://ww1.microchip.com/downloads/en/AppNotes/91040a.pdf (Ross M. Fosler 著)からアイデアを取り入れ、非常に短い C 関数にしました。

uint16_t int_sqrt32(uint32_t x)
{
    uint16_t res=0;
    uint16_t add= 0x8000;   
    int i;
    for(i=0;i<16;i++)
    {
        uint16_t temp=res | add;
        uint32_t g2=temp*temp;      
        if (x>=g2)
        {
            res=temp;           
        }
        add>>=1;
    }
    return res;
}

これは、私の blackfin で 5 サイクル/ビットにコンパイルされます。while ループの代わりに for ループを使用すると、コンパイルされたコードは一般的に高速になり、決定論的な時間の追加の利点が得られると思います (ただし、コンパイラが if ステートメントを最適化する方法にある程度依存します)。

score 7 · Accepted Answer

sqrt 関数の使用方法に依存します。高速バージョンを作成するために、いくつかの近似をよく使用します。たとえば、 vector のモジュールを計算する必要がある場合:

Module = SQRT( x^2 + y^2)

私が使う：

Module = MAX( x,y) + Min(x,y)/2

次のように 3 つまたは 4 つの命令でコーディングできます。

If (x > y )
  Module  = x + y >> 1;
Else
   Module  = y + x >> 1;

score 4 · Accepted Answer

このウィキペディアの記事で説明されている 2 桁ごとのアルゴリズムに似たものに落ち着きました。

score 1 · Accepted Answer

これは、整数 log_2 とニュートン法を組み合わせてループのないアルゴリズムを作成する Java のソリューションです。欠点として、分割が必要です。コメント行は、64 ビットアルゴリズムにアップコンバートするために必要です。

private static final int debruijn= 0x07C4ACDD;
//private static final long debruijn= ( ~0x0218A392CD3D5DBFL)>>>6;

static
{
  for(int x= 0; x<32; ++x)
  {
    final long v= ~( -2L<<(x));
    DeBruijnArray[(int)((v*debruijn)>>>27)]= x; //>>>58
  }
  for(int x= 0; x<32; ++x)
    SQRT[x]= (int) (Math.sqrt((1L<<DeBruijnArray[x])*Math.sqrt(2)));
}

public static int sqrt(final int num)
{
  int y;
  if(num==0)
    return num;
  {
    int v= num;
    v|= v>>>1; // first round up to one less than a power of 2 
    v|= v>>>2;
    v|= v>>>4;
    v|= v>>>8;
    v|= v>>>16;
    //v|= v>>>32;
    y= SQRT[(v*debruijn)>>>27]; //>>>58
  }
  //y= (y+num/y)>>>1;
  y= (y+num/y)>>>1;
  y= (y+num/y)>>>1;
  y= (y+num/y)>>>1;
  return y*y>num?y-1:y;
}

仕組み: 最初の部分は、約 3 ビットの精度の平方根を生成します。行y= (y+num/y)>>1;は精度をビット単位で 2 倍にします。最後の行は、生成される屋根のルートを削除します。

score 0 · Accepted Answer

この方法は長い除算に似ています。根の次の桁を推測し、減算を行い、差が特定の基準を満たしている場合はその桁を入力します。バイナリバージョンでは、次の桁の選択肢は 0 または 1 しかないため、常に 1 を推測し、減算を行い、差がマイナスでない限り 1 を入力します。

http://www.realitypixels.com/turk/opensource/index.html#FractSqrt

score -2 · Accepted Answer

RGB ガンマ圧縮用に 16 ビット sqrt を設計しました。上位 8 ビットに基づいて、3 つの異なるテーブルにディスパッチします。欠点: テーブルに約 1 キロバイトを使用し、正確な sqrt が不可能な場合は丸めが予測不能になり、最悪の場合、単一の乗算を使用します (ただし、入力値が非常に少ない場合のみ)。

static uint8_t sqrt_50_256[] = {
  114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,
  133,134,135,136,137,138,139,140,141,142,143,143,144,145,146,147,148,149,150,
  150,151,152,153,154,155,155,156,157,158,159,159,160,161,162,163,163,164,165,
  166,167,167,168,169,170,170,171,172,173,173,174,175,175,176,177,178,178,179,
  180,181,181,182,183,183,184,185,185,186,187,187,188,189,189,190,191,191,192,
  193,193,194,195,195,196,197,197,198,199,199,200,201,201,202,203,203,204,204,
  205,206,206,207,207,208,209,209,210,211,211,212,212,213,214,214,215,215,216,
  217,217,218,218,219,219,220,221,221,222,222,223,223,224,225,225,226,226,227,
  227,228,229,229,230,230,231,231,232,232,233,234,234,235,235,236,236,237,237,
  238,238,239,239,240,241,241,242,242,243,243,244,244,245,245,246,246,247,247,
  248,248,249,249,250,250,251,251,252,252,253,253,254,254,255,255
};

static uint8_t sqrt_0_10[] = {
  1,2,3,3,4,4,5,5,5,6,6,6,7,7,7,7,8,8,8,8,9,9,9,9,9,10,10,10,10,10,11,11,11,
  11,11,11,12,12,12,12,12,12,13,13,13,13,13,13,13,14,14,14,14,14,14,14,15,15,
  15,15,15,15,15,15,16,16,16,16,16,16,16,16,17,17,17,17,17,17,17,17,17,18,18,
  18,18,18,18,18,18,18,19,19,19,19,19,19,19,19,19,19,20,20,20,20,20,20,20,20,
  20,20,21,21,21,21,21,21,21,21,21,21,21,22,22,22,22,22,22,22,22,22,22,22,23,
  23,23,23,23,23,23,23,23,23,23,23,24,24,24,24,24,24,24,24,24,24,24,24,25,25,
  25,25,25,25,25,25,25,25,25,25,25,26,26,26,26,26,26,26,26,26,26,26,26,26,27,
  27,27,27,27,27,27,27,27,27,27,27,27,27,28,28,28,28,28,28,28,28,28,28,28,28,
  28,28,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,30,30,30,30,30,30,30,30,
  30,30,30,30,30,30,30,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,32,32,
  32,32,32,32,32,32,32,32,32,32,32,32,32,32,33,33,33,33,33,33,33,33,33,33,33,
  33,33,33,33,33,33,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,35,35,
  35,35,35,35,35,35,35,35,35,35,35,35,35,35,35,35,36,36,36,36,36,36,36,36,36,
  36,36,36,36,36,36,36,36,36,37,37,37,37,37,37,37,37,37,37,37,37,37,37,37,37,
  37,37,37,38,38,38,38,38,38,38,38,38,38,38,38,38,38,38,38,38,38,38,39,39,39,
  39,39,39,39,39,39,39,39,39,39,39,39,39,39,39,39,39,40,40,40,40,40,40,40,40,
  40,40,40,40,40,40,40,40,40,40,40,40,41,41,41,41,41,41,41,41,41,41,41,41,41,
  41,41,41,41,41,41,41,41,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,
  42,42,42,42,43,43,43,43,43,43,43,43,43,43,43,43,43,43,43,43,43,43,43,43,43,
  43,44,44,44,44,44,44,44,44,44,44,44,44,44,44,44,44,44,44,44,44,44,44,45,45,
  45,45,45,45,45,45,45,45,45,45,45,45,45,45,45,45,45,45,45,45,45,46,46,46,46,
  46,46,46,46,46,46,46,46,46,46,46,46,46,46,46,46,46,46,46,47,47,47,47,47,47,
  47,47,47,47,47,47,47,47,47,47,47,47,47,47,47,47,47,47,48,48,48,48,48,48,48,
  48,48,48,48,48,48,48,48,48,48,48,48,48,48,48,48,48,49,49,49,49,49,49,49,49,
  49,49,49,49,49,49,49,49,49,49,49,49,49,49,49,49,49,50,50,50,50,50,50,50,50,
  50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,51,51,51,51,51,51,51,51,
  51,51,51,51,51,51,51,51,51,51,51,51,51,51,51,51,51,51,52,52,52,52,52,52,52,
  52,52,52,52,52,52,52,52,52,52,52,52,52,52,52,52,52,52,52,53,53
};

static uint8_t sqrt_11_49[] = {
  54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,0,76,77,78,
  0,79,80,81,82,83,0,84,85,86,0,87,88,89,0,90,0,91,92,93,0,94,0,95,96,97,0,98,0,
  99,100,101,0,102,0,103,0,104,105,106,0,107,0,108,0,109,0,110,0,111,112,113
};

uint16_t isqrt16(uint16_t v) {
  uint16_t a, b;
  uint16_t h = v>>8;
  if (h <= 10) return v ? sqrt_0_10[v>>2] : 0;
  if (h >= 50) return sqrt_50_256[h-50];
  h = (h-11)<<1;
  a = sqrt_11_49[h];
  b = sqrt_11_49[h+1];
  if (!a) return b;
  return b*b > v ? a : b;
}

私はそれを log2 ベースの sqrt と比較し__builtin_clzましsqrtfた(int)sqrtf((float)i)。そしてかなり奇妙な結果を得ました：

$ gcc -O3 test.c -o test && ./test 
isqrt16: 6.498955
sqrtf: 6.981861
log2_sqrt: 61.755873

Clangは命令sqrtfへの呼び出しをコンパイルしましたsqrtss。これは、その table とほぼ同じ速さですsqrt。学んだ教訓: x86 では、コンパイラは十分な速さを提供できますsqrt。これは、自分で思いつくよりも 10% も遅くなく、多くの時間を無駄にするか、醜いビット単位のハックを使用すると 10 倍速くなる可能性があります。それでもsqrtssカスタム関数よりも少し遅いので、これらの 5% が本当に必要な場合は取得できます。たとえば、ARM にはがないsqrtssため、log2_sqrt はそれほど遅れてはいけません。

FPU が利用可能な x86 では、古い Quake ハックが整数平方根を計算する最速の方法のようです。この表や FPU の sqrtss よりも 2 倍高速です。

embedded - ARMThumb2の効率的な整数平方根アルゴリズムを探しています

14 に答える 14

Related

Reference