是否可以在不獲取二進制表示的情況下實作nextafter？-有解無憂

通常nextafter通過以下方式實作：

double nextafter(double x, double y)
{
    // handle corner cases

    int delta = ((x > 0) == (x < y)) ? 1 : -1;
    unsigned long long mant = __mant(x);  // get mantissa as int
    mant  = delta;
    ...
}

這里，使用獲得二進制表示__mant(x)。

出于好奇：是否可以在nextafter不獲得二進制表示的情況下實作？例如，使用一系列算術浮點運算。

uj5u.com熱心網友回復：

下面的代碼nextafter在上升方向上實作了 IEEE-754 演算法的有限值，并使用舍入到最近關系到偶數。NaN、無窮大和下降方向的處理是顯而易見的。

在不假設 IEEE-754 或舍入到最近的情況下，浮點屬性在 C 2018 5.2.4.2.2 中有足夠好的特征，我們可以通過這種方式實作 nextafter（再次在上升方向）：

如果輸入是NaN，則回傳，如果是信令NaN則報錯。
如果輸入是?∞，則回傳-DBL_MAX。
如果輸入為-DBL_TRUE_MIN，則回傳零。
如果輸入為零，則回傳 DBL_TRUE_MIN。
如果輸入是 DBL_MAX，則回傳 ∞。
如果輸入是 ∞，則回傳 ∞。（請注意，這在完整nextafter(x, y)實作中永遠不會發生，因為它會將第一個引數移動到第二個引數的方向，所以我們永遠不會從 ∞ 上升，因為我們永遠不會收到大于 ∞ 的第二個引數。）
否則，如果它是正數，則使用logb來查找指數e。如果e小于DBL_MIN，則回傳輸入加號DBL_TRUE_MIN（次法線和最低法線的 ULP）。如果e不小于DBL_MIN，則回傳輸入加號scalb(1, e 1 - DBL_MANT_DIG)（輸入的特定 ULP）。舍入方法無關緊要，因為這些添加是準確的。
否則，輸入為負。除非輸入恰好是FLT_RADIX（輸入等于scalb(1, e)）的冪，否則使用上面的方法，將的第二個引數減scalb一（因為此nextafter步驟從較大的指數過渡到較低的指數）。

請注意，FLT_RADIX上面是正確的；沒有DBL_RADIX; 所有浮點格式都使用相同的基數。

如果您想將logb和scalb視為操作浮點表示的函式，則可以將它們替換為普通算術。log可以找到可以快速改進為真實指數的快速近似值，并且scalb可以通過多種方式實作，可能只是查表。如果log仍然令人反感，那么試驗比較就足夠了。

上面處理有或沒有次法線的格式，因為如果支持次法線，它會隨著遞減而進入它們，如果不支持次法線，最小法線幅度是DBL_TRUE_MIN，所以它在上面被認為是我們步進的點接下來歸零。

有一個警告；C 標準允許“不確定”一個實作是否支持次正規，“如果浮點運算不能一致地將次正規表示解釋為零，也不能解釋為非零。” 在那種情況下，我沒有看到標準指定了標準的nextafter作用，所以我們沒有什么可以在我們的實作中匹配它。假設有時支持次正規，DBL_TRUE_MIN必須是次正規值，并且以上將嘗試作業，好像次正規支持當前處于打開狀態（例如，重繪為零已關閉），如果關閉，您將獲得任何您想要的得到。

#include <float.h>
#include <math.h>


/*  Return the next floating-point value after the finite value q.

    This was inspired by Algorithm 3.5 in Siegfried M. Rump, Takeshi Ogita, and
    Shin'ichi Oishi, "Accurate Floating-Point Summation", _Technical Report
    05.12_, Faculty for Information and Communication Sciences, Hamburg
    University of Technology, November 13, 2005.
    
    IEEE-754 and the default rounding mode,
    round-to-nearest-ties-to-even, may be required.
*/
double NextAfter(double q)
{
    /*  Scale is .625 ULP, so multiplying it by any significand in [1, 2)
        yields something in [.625 ULP, 1.25 ULP].
    */
    static const double Scale = 0.625 * DBL_EPSILON;

    /*  Either of the following may be used, according to preference and
        performance characteristics.  In either case, use a fused multiply-add
        (fma) to add to q a number that is in [.625 ULP, 1.25 ULP].  When this
        is rounded to the floating-point format, it must produce the next
        number after q.
    */
#if 0
    // SmallestPositive is the smallest positive floating-point number.
    static const double SmallestPositive = DBL_EPSILON * DBL_MIN;

    if (fabs(q) < 2*DBL_MIN)
        return q   SmallestPositive;

    return fma(fabs(q), Scale, q);
#else
    return fma(fmax(fabs(q), DBL_MIN), Scale, q);
#endif
}


#if defined CompileMain


#include <stdio.h>
#include <stdlib.h>


#define NumberOf(a) (sizeof (a) / sizeof *(a))


int main(void)
{
    int status = EXIT_SUCCESS;

    static const struct { double in, out; } cases[] =
    {
        {  INFINITY,                 INFINITY                },
        {  0x1.fffffffffffffp1023,   INFINITY                },
        {  0x1.ffffffffffffep1023,   0x1.fffffffffffffp1023  },
        {  0x1.ffffffffffffdp1023,   0x1.ffffffffffffep1023  },
        {  0x1.ffffffffffffcp1023,   0x1.ffffffffffffdp1023  },
        {  0x1.0000000000003p1023,   0x1.0000000000004p1023  },
        {  0x1.0000000000002p1023,   0x1.0000000000003p1023  },
        {  0x1.0000000000001p1023,   0x1.0000000000002p1023  },
        {  0x1.0000000000000p1023,   0x1.0000000000001p1023  },

        {  0x1.fffffffffffffp1022,   0x1.0000000000000p1023  },

        {  0x1.fffffffffffffp1,      0x1.0000000000000p2     },
        {  0x1.ffffffffffffep1,      0x1.fffffffffffffp1     },
        {  0x1.ffffffffffffdp1,      0x1.ffffffffffffep1     },
        {  0x1.ffffffffffffcp1,      0x1.ffffffffffffdp1     },
        {  0x1.0000000000003p1,      0x1.0000000000004p1     },
        {  0x1.0000000000002p1,      0x1.0000000000003p1     },
        {  0x1.0000000000001p1,      0x1.0000000000002p1     },
        {  0x1.0000000000000p1,      0x1.0000000000001p1     },

        {  0x1.fffffffffffffp-1022,  0x1.0000000000000p-1021 },
        {  0x1.ffffffffffffep-1022,  0x1.fffffffffffffp-1022 },
        {  0x1.ffffffffffffdp-1022,  0x1.ffffffffffffep-1022 },
        {  0x1.ffffffffffffcp-1022,  0x1.ffffffffffffdp-1022 },
        {  0x1.0000000000003p-1022,  0x1.0000000000004p-1022 },
        {  0x1.0000000000002p-1022,  0x1.0000000000003p-1022 },
        {  0x1.0000000000001p-1022,  0x1.0000000000002p-1022 },
        {  0x1.0000000000000p-1022,  0x1.0000000000001p-1022 },

        {  0x0.fffffffffffffp-1022,  0x1.0000000000000p-1022 },
        {  0x0.ffffffffffffep-1022,  0x0.fffffffffffffp-1022 },
        {  0x0.ffffffffffffdp-1022,  0x0.ffffffffffffep-1022 },
        {  0x0.ffffffffffffcp-1022,  0x0.ffffffffffffdp-1022 },
        {  0x0.0000000000003p-1022,  0x0.0000000000004p-1022 },
        {  0x0.0000000000002p-1022,  0x0.0000000000003p-1022 },
        {  0x0.0000000000001p-1022,  0x0.0000000000002p-1022 },
        {  0x0.0000000000000p-1022,  0x0.0000000000001p-1022 },

        { -0x1.fffffffffffffp1023,  -0x1.ffffffffffffep1023  },
        { -0x1.ffffffffffffep1023,  -0x1.ffffffffffffdp1023  },
        { -0x1.ffffffffffffdp1023,  -0x1.ffffffffffffcp1023  },
        { -0x1.0000000000004p1023,  -0x1.0000000000003p1023  },
        { -0x1.0000000000003p1023,  -0x1.0000000000002p1023  },
        { -0x1.0000000000002p1023,  -0x1.0000000000001p1023  },
        { -0x1.0000000000001p1023,  -0x1.0000000000000p1023  },

        { -0x1.0000000000000p1023,  -0x1.fffffffffffffp1022  },

        { -0x1.0000000000000p2,     -0x1.fffffffffffffp1     },
        { -0x1.fffffffffffffp1,     -0x1.ffffffffffffep1     },
        { -0x1.ffffffffffffep1,     -0x1.ffffffffffffdp1     },
        { -0x1.ffffffffffffdp1,     -0x1.ffffffffffffcp1     },
        { -0x1.0000000000004p1,     -0x1.0000000000003p1     },
        { -0x1.0000000000003p1,     -0x1.0000000000002p1     },
        { -0x1.0000000000002p1,     -0x1.0000000000001p1     },
        { -0x1.0000000000001p1,     -0x1.0000000000000p1     },

        { -0x1.0000000000000p-1021, -0x1.fffffffffffffp-1022 },
        { -0x1.fffffffffffffp-1022, -0x1.ffffffffffffep-1022 },
        { -0x1.ffffffffffffep-1022, -0x1.ffffffffffffdp-1022 },
        { -0x1.ffffffffffffdp-1022, -0x1.ffffffffffffcp-1022 },
        { -0x1.0000000000004p-1022, -0x1.0000000000003p-1022 },
        { -0x1.0000000000003p-1022, -0x1.0000000000002p-1022 },
        { -0x1.0000000000002p-1022, -0x1.0000000000001p-1022 },
        { -0x1.0000000000001p-1022, -0x1.0000000000000p-1022 },

        { -0x1.0000000000000p-1022, -0x0.fffffffffffffp-1022 },
        { -0x0.fffffffffffffp-1022, -0x0.ffffffffffffep-1022 },
        { -0x0.ffffffffffffep-1022, -0x0.ffffffffffffdp-1022 },
        { -0x0.ffffffffffffdp-1022, -0x0.ffffffffffffcp-1022 },
        { -0x0.0000000000004p-1022, -0x0.0000000000003p-1022 },
        { -0x0.0000000000003p-1022, -0x0.0000000000002p-1022 },
        { -0x0.0000000000002p-1022, -0x0.0000000000001p-1022 },
        { -0x0.0000000000001p-1022, -0x0.0000000000000p-1022 },
    };

    for (int i = 0; i < NumberOf(cases);   i)
    {
        double in = cases[i].in, expected = cases[i].out;
        double observed = NextAfter(in);
        printf("NextAfter(%a) = %a.\n", in, observed);
        if (! (observed == expected))
        {
            printf("\tError, expected %a.\n", expected);
            status = EXIT_FAILURE;
        }
    }

    return status;
}


#endif  // defined CompileMain

uj5u.com熱心網友回復：

考慮到 FP 可能/可能不支持次法線、無窮大、大多數值具有唯一值（例如使用 2float表示double）、支持 /= 0、沒有 FP 編碼的內在知識，使用假設mant = delta;是下一個值導致可移植性失敗 - 即使使用二進制操作。僅使用 FP 操作需要對 FP 編碼進行許多假設。

我認為更有用的方法是發布使用“一系列算術浮點運算”的候選代碼，然后詢問 1）它失敗的條件 2）如何改進？

uj5u.com熱心網友回復：

我只能說，這只有在融合乘加 (FMA) 可用時才有可能。對于我下面的示例 ISO-C99 實作，我使用float映射到 IEEE-754，binary32因為我可以通過這種方式獲得更好的測驗覆寫率。假設 C 的 IEEE-754 系結有效（因此 C 浮點型別系結到 IEEE-754 二進制型別，支持次正規等），有效的舍入模式是舍入到最近或偶數, 和 ISO-C 標準 ( FE_INEXACT, FE_OVERFLOW, FE_UNDERFLOW)指定的例外信號要求被放棄（傳遞掩碼回應）。

代碼可能比它需要的更復雜；我只是簡單地將各種運算元類分開并一一處理。我使用具有最嚴格浮點設定的 Intel 編譯器進行編譯。實施nextafterf()從英特爾數學庫作為黃金參考。我認為我的實作存在與英特爾庫實作中的錯誤相匹配的錯誤極不可能，但顯然并非不可能。

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <float.h>
#include <string.h>
#include <fenv.h>
#include <math.h>

float my_nextafterf (float a, float b)
{
    const float FP32_MIN_NORMAL = 0x1.000000p-126f;
    const float FP32_MAX_NORMAL = 0x1.fffffep 127f;
    const float FP32_EPSILON = 0x1.0p-23f;
    const float FP32_ONE = 1.0f;
    const float FP32_HALF = 0.5f;
    const float FP32_SUBNORM_SCALE = FP32_ONE / FP32_MIN_NORMAL;
    const float FP32_INC = (FP32_ONE   FP32_EPSILON) * FP32_EPSILON * FP32_HALF;
    const float FP32_INT_SCALE = FP32_ONE / FP32_EPSILON;

    float r;
    if ((!(fabsf(a) <= INFINITY)) || (!(fabsf(b) <= INFINITY))) { // unordered
        r = a   b;
    }
    else if (a == b) { // equal
        r = b;
    }
    else if (fabsf (a) == INFINITY) { // infinity
        r = (a >= 0) ? FP32_MAX_NORMAL : (-FP32_MAX_NORMAL);
    }
    else if (fabsf (a) >= FP32_MIN_NORMAL) { // normal
        float factor = ((a >= 0) != (a > b)) ? FP32_INC : (-FP32_INC);
        r = fmaf (factor, a, a);
    } else { // subnormal or zero
        float scal = (a >= 0) ? FP32_INT_SCALE : (-FP32_INT_SCALE);
        float incr = ((a >= 0) != (a > b)) ? FP32_ONE : (-FP32_ONE);
        r = (a * scal * FP32_SUBNORM_SCALE   incr) / scal / FP32_SUBNORM_SCALE;
    }
    return r;
}

float uint32_as_float (uint32_t a)
{
    float r;
    memcpy (&r, &a, sizeof r);
    return r;
}

uint32_t float_as_uint32 (float a)
{
    uint32_t r;
    memcpy (&r, &a, sizeof r);
    return r;
}

// Fixes via: Greg Rose, KISS: A Bit Too Simple. http://eprint.iacr.org/2011/007
static uint32_t kiss_z = 362436069;
static uint32_t kiss_w = 521288629;
static uint32_t kiss_jsr = 123456789;
static uint32_t kiss_jcong = 380116160;
#define znew (kiss_z=36969*(kiss_z&0xffff) (kiss_z>>16))
#define wnew (kiss_w=18000*(kiss_w&0xffff) (kiss_w>>16))
#define MWC  ((znew<<16) wnew)
#define SHR3 (kiss_jsr^=(kiss_jsr<<13),kiss_jsr^=(kiss_jsr>>17),kiss_jsr^=(kiss_jsr<<5))
#define CONG (kiss_jcong=69069*kiss_jcong 13579)
#define KISS ((MWC^CONG) SHR3)

int main (void)
{
    float a, b, res, ref;
    uint32_t ia, ib, ires, iref;
    unsigned long long int count = 0;
    const uint32_t FP32_QNAN_BIT = 0x00400000;
    const uint32_t FP32_SIGN_BIT = 0x80000000;
    const uint32_t FP32_INFINITY = 0x7f800000;
    
    printf ("Testing nextafterf()\n");

    while (1) {
        ia = KISS;
        ib = KISS;
        /* increase likelihood of zeros, infinities, equality */
        if (!(KISS & 0xfff)) ia = ia & FP32_SIGN_BIT;
        if (!(KISS & 0xfff)) ib = ib & FP32_SIGN_BIT;
        if (!(KISS & 0xfff)) ia = ia | FP32_INFINITY;
        if (!(KISS & 0xfff)) ib = ib | FP32_INFINITY;
        if (!(KISS & 0xffffff)) ib = ia;
        a = uint32_as_float (ia);
        b = uint32_as_float (ib);

        res = my_nextafterf (a, b);
        ref = nextafterf (a, b);
        ires = float_as_uint32 (res);
        iref = float_as_uint32 (ref);

        if (ires != iref) {
            /* if both 'from' and 'to' are NaN, result may be either NaN, quietened */
            if (!(isnan (a) && isnan (b) && 
                  ((ires == (ia | FP32_QNAN_BIT)) || (ires == (ib | FP32_QNAN_BIT))))) {
                printf ("error: a=x b=x  res=x  ref=x\n", ia, ib, ires, iref);
            }
        }
        count  ;
        if (!(count & 0xffffff)) printf ("\rcount = 0x%llx", count);
    }
    return EXIT_SUCCESS;
}

轉載請註明出處，本文鏈接：https://www.uj5u.com/houduan/401801.html

標籤：C 算法浮点库

上一篇：從用戶那里獲取輸入，直到它是負數，然后計算它們的總和。（使用while回圈）

下一篇：在HTML表單提交上通過Ajax檢查Wordpress中是否存在電子郵件