pajenegod's blog

By pajenegod, history, 4 years ago, In English

Introduction

I'm writing this blog because of the large number of blogs asking about why they get strange floating arithmetic behaviour in C++. For example:

"WA using GNU C++17 (64) and AC using GNU C++17" https://codeforces.com/blog/entry/78094

"The curious case of the pow function" https://codeforces.com/blog/entry/21844

"Why does this happen?" https://codeforces.com/blog/entry/51884

"Why can this code work strangely?" https://codeforces.com/blog/entry/18005

and many many more.

Example

Here is a simple example of the kind of weird behaviour I'm talking about

Example showing the issue
Output for 32 bit g++
Output for 64 bit g++

Looking at this example, the output that one would expect from $$$10 * 10 - 10^{-15}$$$ is exactly $$$100$$$ since $$$100$$$ is the closest representable value of a double. This is exactly what happens in 64 bit g++. However, in 32 bit g++ there seems to be some kind of hidden excess precision causing the output to only sometimes(???) be $$$100$$$.

Explanation

In C and C++ there are different modes (referred to as methods) of how floating point arithmetic is done, see (https://en.wikipedia.org/wiki/C99#IEEE_754_floating-point_support). You can detect which one is being used by the value of FLT_EVAL_METHOD found in cfloat. In mode 2 (which is what 32 bit g++ uses by default) all floating point arithmetic is done using long double. Note that in this mode numbers are temporarily stored as long doubles while being operated on, this can / will cause a kind of excess precision. In mode 0 (which is what 64 bit g++ uses by default) the arithmetic is done using each corresponding type, so there is no excess precision.

Detecting and turning on/off excess precision

Here is a simple example of how to detect excess precision (partly taken from https://stackoverflow.com/a/20870774)

Test for detecting excess precision

If b is rounded (as one would "expect" since it is a double), then the result is zero. Otherwise it is something like 8e-17 because of excess precision. I tried running this in custom invocation. MSVC(C++17), Clang and g++17(64bit) all use mode 0 and round b to 0, while g++11, g++14 and g++17 as expected all use mode 2 and b = 8e-17.

The culprit behind all of this misery is the old x87 instruction set, which only supports (80 bit) long double arithmetic. The modern solution is to on top of this use the SSE instruction set (version 2 or later), which supports both float and double arithmetic. On GCC you can turn this on with the flags -mfpmath=sse -msse2. This will not change the value of FLT_EVAL_METHOD, but it will effectively turn off excess precision, see 81993714.

It is also possible to effectively turn on excess precision with -mfpmath=387, see 81993724.

Fun exercise

Using your newfound knowledge of excess precision, try to find a compiler + input to "hack" this

Try to hack this

Conclusion / TLDR

32 bit g++ by default does all of its floating point arithmetic with (80 bit) long double. This causes a ton of frustrating and weird behaviours. 64 bit g++ does not have this issue.

  • Vote: I like it
  • +187
  • Vote: I do not like it

| Write comment?
»
4 years ago, # |
  Vote: I like it -6 Vote: I do not like it

So what is the suggested solution for CP contests?

#pragma GCC target("fpmath=387") // Turns on excess precision => This seems very brittle and platform-dependent.

Does using long double for all calculations also work?

  • »
    »
    4 years ago, # ^ |
      Vote: I like it +26 Vote: I do not like it

    Solution: Don't do stupid things with floating point. Don't check floating point for equality. Be aware how floating point behaves wrt precision.

    If you need extended precision you can explicitly use it, but in most cases it won't be needed if you take some care with how you do floating point operations (or avoid floating point when it's not needed).

    • »
      »
      »
      4 years ago, # ^ |
        Vote: I like it 0 Vote: I do not like it

      I explicitly use -Wfloat-equal to ensure I never end up equating floating points.

      However, at least in the example codes in the blog, a very simple subtraction operation is done. While I am aware of floating point issues, I would say that extended precision is more preferable than thinking about whether this particular line may lead to WA later.

      • »
        »
        »
        »
        4 years ago, # ^ |
        Rev. 2   Vote: I like it +3 Vote: I do not like it

        However, at least in the example codes in the blog, a very simple subtraction operation is done.

        Is there any substantial difference between a - b == 0 and a == b? The code in the blog does a - b and realizes it can give either 0 or stupidly close to zero. This is the exact same phenomenon as trying to compare two floats. I.e. not doing equality checks resolves it.

  • »
    »
    4 years ago, # ^ |
      Vote: I like it +1 Vote: I do not like it

    Yes, using long double everywhere works, doing it like that means you won't have to worry about excess precision. In general simply using 64 bit C++ means you don't have to worry about excess precision.

    I wrote this blog to inform people about how floating arithmetic is done in C++. That does not mean I think excess precision is a good idea. The way I see it, excess precision is a thing of the past and I'm happy I don't have to bother with it when I use 64 bit C++.

»
4 years ago, # |
  Vote: I like it +65 Vote: I do not like it

Well explained! But the real problem is people comparing floats using a == b :)

»
4 years ago, # |
Rev. 3   Vote: I like it 0 Vote: I do not like it

Doesn't this violate IEEE 754 since it requires basic floating point operations to be correctly rounded (results in closest representable value)? Is it related to https://codeforces.com/blog/entry/21844?

Edit: Seems that IEEE 754 allows excess precision.

  • »
    »
    4 years ago, # ^ |
      Vote: I like it +11 Vote: I do not like it

    After testing I'm pretty sure https://codeforces.com/blog/entry/21844 is related to excess precision. I found a minimum working example that I believe has the same issue.

    Example showing the issue
    Output for 32 bit g++
    Output for 64 bit g++

    As you can see, excess precision is able to "leak out" of f.

    The thing I'm not sure of is if pow having excess precision should be seen as a bug or not. From testing on cf, only when submitting under g++11 does pow have excess precision. It seems to not have excess precision on any later version. So probably pow having excess precision should be categorized as a bug, and it has now been fixed.

»
4 years ago, # |
  Vote: I like it 0 Vote: I do not like it

Auto comment: topic has been updated by pajenegod (previous revision, new revision, compare).

»
4 years ago, # |
  Vote: I like it 0 Vote: I do not like it

Auto comment: topic has been updated by pajenegod (previous revision, new revision, compare).

»
4 years ago, # |
  Vote: I like it 0 Vote: I do not like it

Auto comment: topic has been updated by pajenegod (previous revision, new revision, compare).

»
4 years ago, # |
  Vote: I like it +1 Vote: I do not like it

Auto comment: topic has been updated by pajenegod (previous revision, new revision, compare).

»
21 month(s) ago, # |
  Vote: I like it 0 Vote: I do not like it

The Code in (Detecting and turning on/off excess precision) Section:

using g++20 in Clion Gives 0 and the FLT_EVAL_METHOD says that the mode is 2 but when I change double b = a*a to long double b = a*a it give 8e-17 as I understand from you they should work the same or did i get something wrong?

and The code in (Fun exercise) Section :

how does the int w = pow(y, 2); affect the result?

  • »
    »
    21 month(s) ago, # ^ |
      Vote: I like it +5 Vote: I do not like it

    using g++20 in Clion Gives 0 and the FLT_EVAL_METHOD says that the mode is 2 but when I change double b = a*a to long double b = a*a it give 8e-17 as I understand from you they should work the same or did i get something wrong?

    The way mode 2 works is that all consecutive floating point calculations are done using long doubles. But when the floating point numbers are stored in memory, they are stored as their respecitive type.

    how does the int w = pow(y, 2); affect the result?

    The pow call forces y to be stored in memory, which rounds y. For example, if y had the value $$$1-2^{-60}$$$, then the pow call rounds it to exactly 1.0 since 1.0 is the closest representable double to $$$1-2^{-60}$$$.