Discussion: Modern G++ Command Line

3 years ago, # ^ |

+29

if (source.contains("#pragma")) {
  return COMPILATION_ERROR;
}

→ Reply

SkyDreams

3 years ago, # ^ |

Where do we need to write this ?

→ Reply

3 years ago, # ^ |

+14

Mike can write this somewhere inside Codeforces code

→ Reply

arpit0891

3 years ago, # ^ |

-15

Agreed ...

→ Reply

arpit0891

3 years ago, # ^ |

Different optimizations you can refer to this .

→ Reply

iensen

3 years ago, # ^ |

+14

I don't think this works. One can use _Pragma ("GCC optimize(3)") instead.

→ Reply

pshevchuk

3 years ago, # ^ |

Or you can use "??=" as a synonym for # and write ??=pragma

→ Reply

3 years ago, # ^ |

+68

Then the contestants will just use function attributes, which have exactly the same effect as pragmas, but for individual functions:

Spoiler

#define fast_function \
  __attribute__((__optimize__("O3,unroll-loops"))) \
  __attribute__((__target__("avx,avx2,sse,sse2,sse3,sse4,popcnt,fma")))

int fast_function solve(int n) { ... }

Here's an example: 128071115. The fast_function define can be also obfuscated a little bit to avoid trivial regexp based detection. There may be other loopholes.

Disabling AVX instructions support on the judge machines may be a useful equalizer option (AVX is the biggest contributor when the pragmas help). Some people on the Internet are saying that disabling AVX is possible in Windows: https://newbedev.com/how-can-i-check-whether-intel-s-avx-is-enabled-on-my-computer

→ Reply

swiftc

3 years ago, # ^ |

← Rev. 2 →

+38

Actually, we can disable #pragmas at compiler level, like this, but I think it's not necessary.

→ Reply

Wind_Eagle

3 years ago, # |

← Rev. 2 →

+70

I think that -O2 is a fine standard. For example, in Belarusian OI we have -O2 as a standard.

One can use pragmas to get -O3, -funroll-loops and other features.

→ Reply

Kevin114514

3 years ago, # |

-8

what you need to do is

#pragma GCC optimize(3)
#pragma GCC optimize("inline")
#pragma GCC optimize("-fgcse")
#pragma GCC target("avx","sse2")
#pragma GCC optimize("-fgcse-lm")
#pragma GCC optimize("-fipa-sra")
#pragma GCC optimize("-ftree-pre")
#pragma GCC optimize("-ftree-vrp")
#pragma GCC optimize("-fpeephole2")
#pragma GCC optimize("-ffast-math")
#pragma GCC optimize("-fsched-spec")
#pragma GCC optimize("unroll-loops")

→ Reply

lis05

3 years ago, # ^ |

could you please explain what each of them does?

→ Reply

tmaddy

3 years ago, # ^ |

-77

no-one cares

→ Reply

lis05

3 years ago, # ^ |

-19

fine

→ Reply

CartesianTree

3 years ago, # |

← Rev. 2 →

-13

Deleted.

→ Reply

2_3_3

3 years ago, # |

+89

The optimization is really a good news for C++ programmers.

But I have seen that sometimes it can give negative-optimization.

So I suggest that users can decide to use it or not by themselves.

→ Reply

jeal0uspengu1n

3 years ago, # |

+10

Ahhh!!! c++20 noice noice

→ Reply

Kyou_mo_kawaii

3 years ago, # |

← Rev. 2 →

+127

Xellos claims that O3 can sometimes be slower than O2 in https://codeforces.com/blog/entry/66279. That thread is also a great resource for pragmas.

I think rather than asking people you can do this more scientifically by rerunning existing submissions under both the old and new flags. It will be interesting to see the stats of how many AC will become MLE or WA(due to some UB being exposed?) and which types of problems benefit the most from O3/unroll and which ones it will actually slow down. It will make an interesting blog post if you get those stats!

→ Reply

Time_To_Night_Sky

3 years ago, # |

← Rev. 2 →

-38

[Deleted]

→ Reply

3 years ago, # ^ |

+93

It's wrong constraints, not wrong solution. Who to blame is the author.

→ Reply

LeoPro

3 years ago, # ^ |

+58

Yeah, you kinda right. We are suggested a problem, we have to implement a solution, which works correctly and fits into the time (and memory) limit. But what in fact do the contest authors want?

The correctness of algorithm is checked using testcases. The time and memory limit check, how good is the solution. Very often, the quality is just the complexity. The editorials show us the intended complexity. They do not show neither intended language, nor intended pragmas.

Apparently, the main thing is not to allow too slow — incorrect — solutions to pass. That is why time limits can be tight. Then it can be difficult to make accepted a solution with correct complexity — particularly if it has big constant factor (say, written not in C++). But tight limits are discouraged in codeforces.

You are suggested a list of languages you can write a solution in. I believe that they are alternatives — we can pick any of them, translate our correct idea and get accepted. I believe that choosing C++ isn't a must (in perfect world). The correct idea must be accepted, whenever it is written in C++ or not. But the languages differ in the speed. The way to check the asymptotics is to set enormous limits — say, $$$n \le 10^9$$$ for $$$O(n\log{n})$$$ algorithm and about 300 seconds of time limit. But it is impossible.

Consider some example. Problemsetters decide that solution must work in $$$O(n\log^2{n})$$$ and $$$O(n^2/w)$$$ should not pass. They implement the best solution for $$$O(n^2/w)$$$, common solution for $$$O(n\log^2{n})$$$ and set time limit somewhere between their ruinning times. The more the gap is, the more other language or non-optimal solutions are allowed. Forbidding to use pragmas is increasing the size of the gap. It simplifies work for problemsetters and increases probability of receiving good problem. Yes, it's their fault, but do you want to help them preparing high quality problems or not?

P. S. Yes, I used pragmas three or four times and all times the program worked slower. So you can read my comment as " I do not know, where (and how) should I use pragmas. Please, forbid all the contestants to use them!". And to add up, putting some random lines in the beginning of the file to make the program faster hardly correlate in my mind with thinking about best solution for the problem.

→ Reply

Time_To_Night_Sky

3 years ago, # ^ |

← Rev. 2 →

-16

[Deleted]

→ Reply

Lyrically

3 years ago, # |

Looking forward to the release of the new standard of C++.

→ Reply

Sparky_Master_WCH1226

3 years ago, # |

Wow, thanks! That is a very useful and meaningful idea.

→ Reply

FrozenKandy

3 years ago, # |

pypy3 64 bit is coming!! I couldn't be happier :D

→ Reply

prabhav_

3 years ago, # |

Psyched for C++ 20 :D

→ Reply

3 years ago, # |

Sometimes (for example, when there are no autovectorized loops), pragmas can be useless and even slow down the program (I had some examples of it). Maybe there are some that never slow down?

→ Reply

Marslai24

3 years ago, # |

I think the optimization of compiler would be great, but on the other hand, this would give less pleasure in optimizing the code itself. By the way, thanks for the update!

→ Reply

hbarp

3 years ago, # |

-26

Probably the wrong place to ask this but I dont know where to ask so sorry for it :)

Are there any upcoming lectures for EDU section ?

→ Reply

christchurch

3 years ago, # |

-39

O2 is enough.A Chinese oj called POJ did not even turn on o2 optimization，and only supports the C98 standard QAQ

→ Reply

oneninetyfive

3 years ago, # ^ |

+16

POJ might be the single worst online judge that exists today... The website seems like it is stuck in the 1990s.

→ Reply

purplesyringa

3 years ago, # |

-9

gcc11 std=c++20

What about c++2b? That would be even better.

→ Reply

3 years ago, # ^ |

As far as I know, c++20 itself isn't fully supported in any compiler as of yet. It'll be a long way to getting it to work on cf anyway, so c++2b doesn't make sense.

If your comment was sarcasm, well...

→ Reply

purplesyringa

3 years ago, # ^ |

← Rev. 2 →

It wasn't sarcastic in the slightest.

C++20 is not fully supported, right. C++23 neither. That does not mean we can't use features that are supported, however, does it?

Also, I'm not sure why making C++20 work would be difficult. It only requires updating the compiler, as far as I am aware.

→ Reply

3 years ago, # ^ |

← Rev. 2 →

+26

Ah I see. The rest of the comment will reference this.

My point about C++20 was that if compilers don't fully support all features, it might be a great surprise for some people if some features won't work on CF (but work locally). For instance, this post: https://codeforces.com/blog/entry/94457, where even C++17 features don't work, since the version of GCC is 9.2.0, while most people on Ubuntu use 9.3.0 (and people on rolling-release distros use much more recent versions, like 11.1.0). However, the majority of "good" C++20 features have already been implemented by GCC, so it might be less of an issue for C++20. Some good features (like constexpr std::string and std::vector) still haven't been implemented on GCC though, so I would still prefer waiting for them to be implemented.

The problem would be worse for C++23 (since most C++23 features in the STL aren't implemented in compilers anyway), so people who have updated compilers might face unexpected issues in due course of time as the compiler gets outdated.

As far as updating the compiler is concerned, I suppose it takes some non-trivial amount of effort which might lead to maintenance delays (even for half an hour, that's quite a bit of a delay). I don't think CF would want to follow a rolling-release pattern and update compilers as and when more features are implemented, which would keep us stuck with compilers that don't fully implement the standard (as in the case of 9.2.0). I hope that clears up the reasons behind my concerns.

→ Reply

Ritwin

3 years ago, # ^ |

-10

I think a lot of people will be happy to use set.contains(x) though (And no, I don't know why it took this long to add a .contains method)

→ Reply

purplesyringa

3 years ago, # ^ |

+11

GCC is known to support newer language features even when the selected standard is quite old. Structured buildings, for instance, supported even with -std=c++11 option--GCC will still compile them and only raise a warning.

So why don't we do a similar thing here? Make a compiler called 'GCC (C++20, partial C++2b support)', or maybe even 'GCC (C++20)' if you think minor lies are allowable, that uses the -std=c++2b CLI option? In this case, if you use C++20 features only, you're in the clear. And if you use newer features, you'll either get an error because the feature was not implemented yet, or it will compile correctly and you won't lose precious seconds rewriting the code.

It's a win-win situation. Participants that are aware of this little trick can benefit from it. Participants that aren't will get the same results as before, except they get more lucky, so to say, in pushing the boundaries, if they accidentally use too new features.

As far as updating the compiler is concerned, I suppose it takes some non-trivial amount of effort which might lead to maintenance delays (even for half an hour, that's quite a bit of a delay).

Do you think the only problem is the updating difficulty, and that if it was automated, it'd work out?

→ Reply

3 years ago, # ^ |

+11

Sure, that can be done I guess (not sure about all the technical details about deploying these updates on CF though).

It's just a matter of personal taste to wait for a more polished implementation of C++20 (or maybe even C++23). However, if there is a guarantee that compiler updates can happen frequently enough without breaking stuff very frequently, I would go with the newer compilers too, since that mimics my own update cycle as well. Note that there tend to be bugs in new releases of compilers too, so that might be a minor inconvenience here and there.

→ Reply

depressed_ontop

3 years ago, # |

+24

pypy3 64-bit ,now python users don't have to switch in python and pypy. Now we are going to use pypy3 always.

→ Reply

sky123

3 years ago, # |

I think this is good, because it will reduce the impact of reasons other than the complexity of the program on the judge results. :)

→ Reply

ilyaleshchik

3 years ago, # |

I think this option will be good as a separate compiler option when you submit a task.

→ Reply

oversolver

3 years ago, # |

only if -O3 is necessary to beat Rust-ers

→ Reply

nikgaevoy

3 years ago, # ^ |

I understand that it was a joke, but I still want to answer :)

From my experience, Rust standard library is incredibly slow even in comparison with -O2 version of STL. At least BTreeSet, which seems to be the counterpart to std::set in Rust.

Would be happy to hear from Rust experts that I am wrong.

→ Reply

avm

3 years ago, # ^ |

+10

Let's insert 5000000 32-bit integers into sorted set and compute some polynomial hash.

Rust 128344948 submission runs 1637 ms and used 49200 Kb memory.
GNU C++17 (64) 128344998 submission runs 5256 ms and used 239500 Kb memory.

Why this happen and why BTreeSet is slower than red-black tree for small sets you could read in the man.

→ Reply

CodingKnight

3 years ago, # |

According to the following blog, Speed up Code executions with help of Pragma in C/C++, the optimization flag -Ofast is superset of -O3, and its performance is slightly better than -O3.

→ Reply

3 years ago, # ^ |

This blog is misleading about the -Ofast option and doesn't mention the downsides. This is what the official documentation says: https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

Disregard strict standards compliance. -Ofast enables all -O3 optimizations. It also enables optimizations that are not valid for all standard-compliant programs. It turns on -ffast-math, -fallow-store-data-races and the Fortran-specific -fstack-arrays, unless -fmax-stack-var-size is specified, and -fno-protect-parens.

If people don't mind deviating from strict standards compliance, then they can always use pragmas to enforce -Ofast for their code. But everyone else would be rightfully upset if their valid code doesn't behave as expected, especially if this results in a WA verdict.

→ Reply

CodingKnight

3 years ago, # ^ |

Thanks for sharing the gcc link. It seems that I was lucky enough to not get WA verdict when I tried this pragma optimization flag.

→ Reply

sanyacoder

3 years ago, # |

-119

This is a great idea except for one thing. I am strongly against -O2 and -O3 keys in command line. (Also sometimes it is not fair for people not using C++)

I have an example:

test.cpp:

#include <iostream>
using namespace std;
int main() {
    int t = 1;
    while (t > 0) {
        t *= 9;
        cout << t << '\n';
    }
    cout << "finished\n";
}

Then run g++ test.cpp -O1 -o test && ./test in command shell. I got such result:

But if I run g++ test.cpp -O2 -o test && ./test or g++ test.cpp -O3 -o test && ./test, I will get a program that goes into an infinite loop.

A part of it's output:

291559329
-1670933335
2141469169

So, I think -O1 would be better for the command line.

Possible explanation (I'm not an expert):
Unsigned comparison is a bit faster than signed because it is not just a "stupid" comparison. Processor needs to look at first bit, than choose how it will compare numbers (it depends on first bit), еtс. Compiler does not consider the context that it can be a negative number and just uses unsigned comparation to imporove speed.

→ Reply

webmaster

3 years ago, # ^ |

+79

Overflow of signed integers is undefined behaviour in C++. That's why your program gets stuck in O2/O3.

→ Reply

3 years ago, # ^ |

+14

Your code is just incorrect. Add -fsanitize=undefined option to the gcc command line when compiling, run your program and it will tell you what's wrong. Yes, C++ is a very difficult programming language and it's often criticized for that. If you want an infinite loop and no unexpected surprises, then you can use Python.

→ Reply

tfg

3 years ago, # ^ |

+11

I'd say calling a function that doesn't exists somewhere and not having that raising an error until the function is called or having a typo in a variable name and the program not accusing that of being an error is an unexpected surprise by itself.

→ Reply

3 years ago, # ^ |

Yet I don't see any Python users posting blogs about such errors and being unable to troubleshoot them. The biggest source of actual confusion may be something like a somewhat unpredictable performance of string concatenation in CPython. There's also Rust for those, who prefer statically typed programming languages and catching more errors at the compilation stage.

→ Reply

nikgaevoy

3 years ago, # |

+11

As far as I understand, -O3 can be faster and also can be slower than -O2, but it is more often faster than slower. The only reason why the standard compilation line for big competitions (such as ICPC) contains -O2 is that -O3 allows the compiler to do some optimizations that are not present in the C++ standard (or maybe even do something contradicting the standard, assuming it does not change the program behaviour, but I am not sure about that).

On the other hand, -O2 is guaranteed to perform almost all optimizations from the standard, so it is usually enough for any reasonable solution, and you always can add pragmas whenever you specifically need -O3.

Anyway, it is impossible to make the best command line to deal with pragmas once and forever, because there always will be something like sse or avx, that is not added by -march=native -mtune=native, but still can be added manually via pragmas and make some difference in running time.

→ Reply

3 years ago, # ^ |

In my experience it's more often slower than faster since regular CP code isn't bottlenecked by how instructions are generated.

→ Reply

nikgaevoy

3 years ago, # ^ |

To be very honest, I think it is more often more or less equal. Also, I bet that it depends not only on the problem but also on your preferred code style. E.g. I tend to write very template-heavy code that relies on inline optimizations that are different in -O2 and -O3 (according to godbolt). I can't quite tell which one is better though.

In general, I think that if the optimizations went that far, then you better simply check both -O2 and -O3.

→ Reply

3 years ago, # ^ |

I do check, it's how I know. However, even if it was 50/50 on which flag gives better results, then O2 is the one to use as the default.

I can't quite tell which one is better though.

That's the thing. You get different code, that doesn't mean you get faster code. It should depend on problem type rather than code style though (unless we're talking about bad code style that needs O3), for example randomly accessing memory is bottlenecked by cache misses and no amount of compiler optimisation can change that.

→ Reply

3 years ago, # |

+24

Great resource: https://wiki.gentoo.org/wiki/GCC_optimization.

Basically, feel free to use machine-specific options (-march=native is enough) since that's where the code will run, but keep optimisation to -O2.

-ftree-vectorize is an optimization option (default at -O3 and -Ofast), which attempts to vectorize loops using the selected ISA if possible. The reason it isn't enabled at -O2 is that it doesn't always improve code, it can make code slower as well, and usually makes the code larger; it really depends on the loop etc.
On an Intel/AMD64 platform with -march=native -O2 or lower optimization level, the code will likely end up with AVX instructions used but using shorter SSE XMM registers. To take full advantage of AVX YMM registers, the -ftree-vectorize, -O3 or -Ofast options should be used as well
Also available are the -mtune and -mcpu flags. These flags are normally only used when there is no available -march option
Even the GCC manual says that using -funroll-loops and -funroll-all-loops will make code larger and run more slowly.
Stick to the basics: -march, -O, and -pipe.

Note -pipe which just makes compilation faster but that's also useful.

→ Reply

3 years ago, # ^ |

Here is an alternative opinion from another Linux distro: https://documentation.suse.com/sbp/all/html/SBP-GCC-10/index.html

Usually we recommend using -O2. This is the optimization level we use to build most SUSE and openSUSE packages, because at this level the compiler makes balanced size and speed trade-offs when building a general-purpose operating system. However, we suggest using -O3 if you know that your project is compute-intensive and is either small or an important part of your actual workload. Moreover, if the compiled code contains performance-critical floating-point operations, we strongly advise that you investigate whether -ffast-math or any of the fine-grained options it implies can be safely used.

Looks like you are right and -march=native also implies -mtune=native at least for the x86 target. As for -funroll-loops/-funroll-all-loops, GCC documentation says that:

Spoiler

It's important to note that popular open source software typically has a long history. Years had been used to iron out bugs and improve performance. Many of the performance critical parts had been already identified and optimized using hand written assembly or intrinsics. Many loops that could actually benefit from unrolling, had been unrolled manually. There are relatively few low hanging fruits left for the compilers. This limits the usefulness of -ftree-vectorize and -funroll-loops optimization options when used with such code.

But competitive programming solutions are usually small, somewhat messy and there's not much time available for doing manual loops unrolling due to short contests duration. This provides more optimization opportunities for the compiler. Pre-written library code is the only exception.

→ Reply

3 years ago, # ^ |

Keep in mind that compute-intensive in OS applications usually means things like video processing, not complex algorithms. The vast majority doesn't have performance-critical parts at all.

and there's not much time available for doing manual loops unrolling due to short contests duration

The tradeoff isn't simply "no time to do this", it's "no time to do this probably useless thing". Copypasting the insides of a loop and renaming variables isn't hard when you know it's useful, but when you don't know, there's a good chance it's not useful.

→ Reply

3 years ago, # ^ |

← Rev. 2 →

Manual loop unrolling doesn't make any sense anymore, unless it is a non-trivial rearrangement of program logic that the compiler can't detect. At O3, the compiler is smart enough to unroll loops which it deems fit conservatively (for instance, this: https://godbolt.org/z/hzKvnsGfz — try it without -funroll-loops).

-funroll-loops usually leads to a better performance (especially if it's a lightweight tight loop), but again, compiler options don't always do what you want them to do, and this doesn't unroll all loops either, since the unrolling is heuristic-based. Compilers might even reroll your loops if they find that some of your loops are hand-unrolled and are pretty bad due to the absence of such a guarantee (this is just a hypothetical).

Compilers are smart enough at micro-optimizations (more so than most programmers), so I believe we shouldn't rule out the usefulness of loop-rolling that the compiler does.

→ Reply

3 years ago, # ^ |

+21

This is probably the most sane thing to do.

I would however add that if -march=native isn't set already, then bmi/bmi2 (if not that, then popcnt, lzcnt, tzcnt) should be considered for non-trivial bit operations (like __builtin_popcount, __builtin_ctz, __builtin_clz, __lg and so on). These would be pretty helpful in doing these bitwise operations and also speed up bitsets a bit.

Ofast is unsafe for floating point operations (I remember someone saying that it breaks some solutions to geometry problems, which was the biggest reason I stopped using Ofast), and O3 might make code slower (from my experience on CF as well, but I tend to include it anyway since it happens rarely). One good thing about O3 (as pointed out above) is the larger bit vectors that you get (but you can do that by enabling AVX2 anyway).

The point is, optimized code should not be necessary to pass, although it's nice to have.

-march=native or -mtune=native being the only addition to the compiler options should be fine in my opinion. The reason why it's not used in production is that people want to build for other architectures too, but since all code compilation and execution happens on one system in the case of competitive programming, it makes sense to actually get as much performance out of it as possible. Some might argue it's even the best thing, since it allows you to use instructions from most ISAs that are supported, which makes it fair game for people who don't actually want to use pragmas (and for people who are afraid of using pragmas due to the fear that their submission will get an RE due to SIGILL).

I would actually be interested in the flags that are supported on codeforces (the flags turned on by -march=native can be found by running g++ -march=native -E -v - </dev/null 2>&1 | grep cc1). For instance, my machine gives the following output.

Output

Still, in an ideal scenario, such optimization options shouldn't be involved (and perhaps be banned from code too), but cheesing problems using SIMD is way too fun.

→ Reply

ggdwbg

3 years ago, # ^ |

There's actually another way to get free performance: write high level idiomatic code and constexpr'ing as much as you can.

Better to show with an example: https://gcc.godbolt.org/z/YocGTro41

High level doesn't mean slow. Example above is overly high-level, it computes some nontrivial thing (sum of all node labels in some implicitly defined graph that is not stored in memory) and it compiles in a single machine instruction that just moves the result into the register.

You can do memory allocations for no cost (one can think about memory for constants/precomputed stuff as registers); nasty cases/whatever can be elegantly expressed in this way.

Not sure if this is common knowledge in CP.

→ Reply