c# - ビット数で与えられた倍精度数をより低い精度に丸める効率的な方法

Question

C# では、連想配列のさまざまなサイズのバケットに格納できるように、double をより低い精度に丸めたいと考えています。通常の丸めとは異なり、有効ビット数に丸めたい。したがって、大きな数は小さな数よりも絶対的に大きく変化しますが、比例して同じように変化する傾向があります。したがって、2 進数の 10 桁に丸めたい場合は、最上位 10 ビットを見つけて、下位ビットをすべてゼロにし、切り上げのために小さな数値を追加する可能性があります。

「中途半端な」数字は切り上げられることを好みます。

整数型の場合、可能なアルゴリズムは次のようになります。

  1. Find: zero-based index of the most significant binary digit set H.
  2. Compute: B = H - P, 
       where P is the number of significant digits of precision to round
       and B is the binary digit to start rounding, where B = 0 is the ones place, 
       B = 1 is the twos place, etc. 
  3. Add: x = x + 2^B 
       This will force a carry if necessary (we round halfway values up).
  4. Zero out: x = x mod 2^(B+1). 
       This clears the B place and all lower digits.

問題は、最上位ビットセットを見つける効率的な方法を見つけることです。整数を使用していた場合、MSB を見つけるためのクールなビットハックがあります。できれば Round(Log2(x)) を呼び出したくありません。この関数は何百万回も呼び出されます。

注：私はこのSOの質問を読みました：

倍精度値を (やや) 低い精度に丸める良い方法は何ですか?

C++で動作します。私はC＃を使用しています。

アップデート：

これは、私が使用しているコードです（回答者が提供したものから変更されたものです）：

/// <summary>
/// Round numbers to a specified number of significant binary digits.
/// 
/// For example, to 3 places, numbers from zero to seven are unchanged, because they only require 3 binary digits,
/// but larger numbers lose precision:
/// 
///      8    1000 => 1000   8
///      9    1001 => 1010  10
///     10    1010 => 1010  10
///     11    1011 => 1100  12
///     12    1100 => 1100  12
///     13    1101 => 1110  14
///     14    1110 => 1110  14
///     15    1111 =>10000  16
///     16   10000 =>10000  16
///     
/// This is different from rounding in that we are specifying the place where rounding occurs as the distance to the right
/// in binary digits from the highest bit set, not the distance to the left from the zero bit.
/// </summary>
/// <param name="d">Number to be rounded.</param>
/// <param name="digits">Number of binary digits of precision to preserve. </param>
public static double AdjustPrecision(this double d, int digits)
{
    // TODO: Not sure if this will work for both normalized and denormalized doubles. Needs more research.
    var shift = 53 - digits; // IEEE 754 doubles have 53 bits of significand, but one bit is "implied" and not stored.
    ulong significandMask = (0xffffffffffffffffUL >> shift) << shift;
    var local_d = d;
    unsafe
    {
        // double -> fixed point (sorta)
        ulong toLong = *(ulong*)(&local_d);
        // mask off your least-sig bits
        var modLong = toLong & significandMask;
        // fixed point -> float (sorta)
        local_d = *(double*)(&modLong);
    }
    return local_d;
}

更新 2: デッカーのアルゴリズム

他の回答者のおかげで、これはデッカーのアルゴリズムから導き出されました。上記のコードのように切り捨てるのではなく、最も近い値に丸め、安全なコードのみを使用します。

private static double[] PowersOfTwoPlusOne;

static NumericalAlgorithms()
{
    PowersOfTwoPlusOne = new double[54];
    for (var i = 0; i < PowersOfTwoPlusOne.Length; i++)
    {
        if (i == 0)
            PowersOfTwoPlusOne[i] = 1; // Special case.
        else
        {
            long two_to_i_plus_one = (1L << i) + 1L;
            PowersOfTwoPlusOne[i] = (double)two_to_i_plus_one;
        }
    }
}

public static double AdjustPrecisionSafely(this double d, int digits)
{
    double t = d * PowersOfTwoPlusOne[53 - digits];
    double adjusted = t - (t - d);
    return adjusted;
}

更新 2: タイミング

テストを実行したところ、Dekker のアルゴリズムは 2 倍の速さであることがわかりました。

テストのコール数: 100,000,000
Unsafe Time = 1.922 (秒)
Safe Time = 0.799 (秒)

score 8 · Accepted Answer

Dekker のアルゴリズムは、浮動小数点数を高い部分と低い部分に分割します。仮数にsビット (IEEE 754 64 ビットバイナリでは 53)がある場合、要求し*x0た上位s - b*x1ビットを受け取り、残りのビットを受け取ります。これは破棄してもかまいません。以下のコードでScaleは、値は 2 ^bである必要があります。コンパイル時にbがわかっている場合(定数 43 など)、に置き換えることScaleができます0x1p43。それ以外の場合は、何らかの方法で2 ^{bを生成する必要があります。}

これには、最も近い値に丸めるモードが必要です。IEEE 754 演算で十分ですが、他の合理的な演算でも問題ない場合があります。同数を偶数に丸めますが、これは要求したものではありません (上向きの同数)。それは必要ですか？

x * (Scale + 1)これはがオーバーフローしないことを前提としています。演算は倍精度 (それ以上ではない) で評価する必要があります。

void Split(double *x0, double *x1, double x)
{
    double d = x * (Scale + 1);
    double t = d - x;
    *x0 = d - t;
    *x1 = x - *x0;
}

score 2 · Accepted Answer

興味深い...これの必要性について聞いたことがありませんが、ファンキーで危険なコードを介して「実行」できると思います...

void Main()
{
    // how many bits you want "saved"
    var maxBits = 20;

    // create a mask like 0x1111000 where # of 1's == maxBits
    var shift = (sizeof(int) * 8) - maxBits;
    var maxBitsMask = (0xffffffff >> shift) << shift;

    // some floats
    var floats = new []{ 1.04125f, 2.19412347f, 3.1415926f};
    foreach (var f in floats)
    {
        var localf = f;
        unsafe
        {
            // float -> fixed point (sorta)
            int toInt = *(int*)(&localf);
            // mask off your least-sig bits
            var modInt = toInt & maxBitsMask;
            // fixed point -> float (sorta)
            localf = *(float*)(&modInt);
        }
        Console.WriteLine("Was {0}, now {1}", f, localf);
    }
}

そしてダブルスで：

void Main()
{
    var maxBits = 50;
    var shift = (sizeof(long) * 8) - maxBits;
    var maxBitsMask = (0xffffffffffffffff >> shift) << shift;
    var doubles = new []{ 1412.04125, 22.19412347, 3.1415926};
    foreach (var d in doubles)
    {
        var local = d;
        unsafe
        {
            var toLong = *(ulong*)(&local);
            var modLong = toLong & maxBitsMask;
            local = *(double*)(&modLong);
        }
        Console.WriteLine("Was {0}, now {1}", d, local);
    }
}

ああ...受け入れられなくなった。:)

完全を期すために、ここではJeppeの「安全でない」アプローチを使用しています。

void Main()
{
    var maxBits = 50;
    var shift = (sizeof(long) * 8) - maxBits;
    var maxBitsMask = (long)((0xffffffffffffffff >> shift) << shift);
    var doubles = new []{ 1412.04125, 22.19412347, 3.1415926};
    foreach (var d in doubles)
    {
        var local = d;
        var asLong = BitConverter.DoubleToInt64Bits(d);
        var modLong = asLong & maxBitsMask;
        local = BitConverter.Int64BitsToDouble(modLong);
        Console.WriteLine("Was {0}, now {1}", d, local);
    }
}

c# - ビット数で与えられた倍精度数をより低い精度に丸める効率的な方法

2 に答える 2

Related

Reference