### MikeMirzayanov's blog

By MikeMirzayanov, 13 months ago,

Hello Codeforces!

I am in the process of making improvements and updates to the judgment machines. I've read the post https://codeforces.com/blog/entry/94587 and I think, maybe it is a good idea to make such a compilation line -O3 -funroll-loops -march=native -mtune=native? I haven't done any research that it is definitely better than -O2 and it is best in the general case for CP solutions. In a way, this will only strengthen the gap from Python/PyPy/Java, on the other hand: in pragmas and so you can set up everything. What do you think? What are suggestions to the command line?

P.S. You got it right. Yes, gcc11 std=c++20, pypy3 64-bit and more are coming.

• +830

 » 13 months ago, # |   +111 yeah c++20 will be great
 » 13 months ago, # | ← Rev. 2 →   +46 I think -O2 is enough in CP, regardless of whether there are better compilation options. I even think we should not allow #pragmas. Too much compiler optimization may let a bad solution get Accepted, which will make CP bored.
•  » » 13 months ago, # ^ |   +37 How do you disable pragmas to run in code when the user has them in code ? (genuine question, idk anything about pragmas other than the fact that they are some dark magic used to speed up code)
•  » » » 13 months ago, # ^ |   +14 I am not quite sure about the details myself, but I remember hearing somewhere that other OJ's are able to "stop" pragmas. Correct me if I'm wrong, but USACO judgers do this. I also know that that the teamscode OJ used for the recent teamscode round also stopped the effect of pragmas. Nonetheless, I do not know this topic well enough, so I might be wrong.
•  » » » 13 months ago, # ^ |   +29 if (source.contains("#pragma")) { return COMPILATION_ERROR; } 
•  » » » » 13 months ago, # ^ |   +4 Where do we need to write this ?
•  » » » » » 13 months ago, # ^ |   +14 Mike can write this somewhere inside Codeforces code
•  » » » » » » 13 months ago, # ^ |   -15 Agreed ...
•  » » » » » 13 months ago, # ^ |   0 Different optimizations you can refer to this .
•  » » » » 13 months ago, # ^ |   +14 I don't think this works. One can use _Pragma ("GCC optimize(3)") instead.
•  » » » » » 13 months ago, # ^ |   0 Or you can use "??=" as a synonym for # and write ??=pragma
•  » » » » 13 months ago, # ^ |   +68 Then the contestants will just use function attributes, which have exactly the same effect as pragmas, but for individual functions: Spoiler#define fast_function \ __attribute__((__optimize__("O3,unroll-loops"))) \ __attribute__((__target__("avx,avx2,sse,sse2,sse3,sse4,popcnt,fma"))) int fast_function solve(int n) { ... } Here's an example: 128071115. The fast_function define can be also obfuscated a little bit to avoid trivial regexp based detection. There may be other loopholes.Disabling AVX instructions support on the judge machines may be a useful equalizer option (AVX is the biggest contributor when the pragmas help). Some people on the Internet are saying that disabling AVX is possible in Windows: https://newbedev.com/how-can-i-check-whether-intel-s-avx-is-enabled-on-my-computer
•  » » » 13 months ago, # ^ |   +38 Actually, we can disable #pragmas in compiler level, like this, but I think it's not necessary.
•  » » 13 months ago, # ^ |   0 As far as I know, GCC's -O2 doesn't perform loop unroll and vectorization, but clang does.
 » 13 months ago, # | ← Rev. 2 →   +70 I think that -O2 is a fine standard. For example, in Belarusian OI we have -O2 as a standard.One can use pragmas to get -O3, -funroll-loops and other features.
 » 13 months ago, # |   -8 what you need to do is #pragma GCC optimize(3) #pragma GCC optimize("inline") #pragma GCC optimize("-fgcse") #pragma GCC target("avx","sse2") #pragma GCC optimize("-fgcse-lm") #pragma GCC optimize("-fipa-sra") #pragma GCC optimize("-ftree-pre") #pragma GCC optimize("-ftree-vrp") #pragma GCC optimize("-fpeephole2") #pragma GCC optimize("-ffast-math") #pragma GCC optimize("-fsched-spec") #pragma GCC optimize("unroll-loops") 
•  » » 13 months ago, # ^ |   0 could you please explain what each of them does?
•  » » » 13 months ago, # ^ |   -77 no-one cares
•  » » » » 13 months ago, # ^ |   0 I care!
•  » » » » 13 months ago, # ^ |   -19 fine
 » 13 months ago, # | ← Rev. 2 →   -13 Deleted.
 » 13 months ago, # |   +89 The optimization is really a good news for C++ programmers.But I have seen that sometimes it can give negative-optimization.So I suggest that users can decide to use it or not by themselves.
 » 13 months ago, # |   +10 Ahhh!!! c++20 noice noice
 » 13 months ago, # | ← Rev. 2 →   +127 Xellos claims that O3 can sometimes be slower than O2 in https://codeforces.com/blog/entry/66279. That thread is also a great resource for pragmas.I think rather than asking people you can do this more scientifically by rerunning existing submissions under both the old and new flags. It will be interesting to see the stats of how many AC will become MLE or WA(due to some UB being exposed?) and which types of problems benefit the most from O3/unroll and which ones it will actually slow down. It will make an interesting blog post if you get those stats!
 » 13 months ago, # | ← Rev. 2 →   -38 [Deleted]
•  » » 13 months ago, # ^ |   +93 It's wrong constraints, not wrong solution. Who to blame is the author.
•  » » » 13 months ago, # ^ |   +58 Yeah, you kinda right. We are suggested a problem, we have to implement a solution, which works correctly and fits into the time (and memory) limit. But what in fact do the contest authors want? The correctness of algorithm is checked using testcases. The time and memory limit check, how good is the solution. Very often, the quality is just the complexity. The editorials show us the intended complexity. They do not show neither intended language, nor intended pragmas. Apparently, the main thing is not to allow too slow — incorrect — solutions to pass. That is why time limits can be tight. Then it can be difficult to make accepted a solution with correct complexity — particularly if it has big constant factor (say, written not in C++). But tight limits are discouraged in codeforces. You are suggested a list of languages you can write a solution in. I believe that they are alternatives — we can pick any of them, translate our correct idea and get accepted. I believe that choosing C++ isn't a must (in perfect world). The correct idea must be accepted, whenever it is written in C++ or not. But the languages differ in the speed. The way to check the asymptotics is to set enormous limits — say, $n \le 10^9$ for $O(n\log{n})$ algorithm and about 300 seconds of time limit. But it is impossible. Consider some example. Problemsetters decide that solution must work in $O(n\log^2{n})$ and $O(n^2/w)$ should not pass. They implement the best solution for $O(n^2/w)$, common solution for $O(n\log^2{n})$ and set time limit somewhere between their ruinning times. The more the gap is, the more other language or non-optimal solutions are allowed. Forbidding to use pragmas is increasing the size of the gap. It simplifies work for problemsetters and increases probability of receiving good problem. Yes, it's their fault, but do you want to help them preparing high quality problems or not?P. S. Yes, I used pragmas three or four times and all times the program worked slower. So you can read my comment as " I do not know, where (and how) should I use pragmas. Please, forbid all the contestants to use them!". And to add up, putting some random lines in the beginning of the file to make the program faster hardly correlate in my mind with thinking about best solution for the problem.
•  » » » 13 months ago, # ^ | ← Rev. 2 →   -16 [Deleted]
 » 13 months ago, # |   0 Looking forward to the release of the new standard of C++.
 » 13 months ago, # |   +1 Wow, thanks! That is a very useful and meaningful idea.
 » 13 months ago, # | ← Rev. 2 →   0 C++20 ...Now thats a good news!
 » 13 months ago, # |   +4 pypy3 64 bit is coming!! I couldn't be happier :D
 » 13 months ago, # |   +3 Psyched for C++ 20 :D
 » 13 months ago, # |   0 Sometimes (for example, when there are no autovectorized loops), pragmas can be useless and even slow down the program (I had some examples of it). Maybe there are some that never slow down?
 » 13 months ago, # |   0 I think the optimization of compiler would be great, but on the other hand, this would give less pleasure in optimizing the code itself. By the way, thanks for the update!
 » 13 months ago, # |   -26 Probably the wrong place to ask this but I dont know where to ask so sorry for it :) Are there any upcoming lectures for EDU section ?
 » 13 months ago, # |   -39 O2 is enough.A Chinese oj called POJ did not even turn on o2 optimization，and only supports the C98 standard QAQ
•  » » 13 months ago, # ^ |   +16 POJ might be the single worst online judge that exists today... The website seems like it is stuck in the 1990s.
 » 13 months ago, # |   -9 gcc11 std=c++20 What about c++2b? That would be even better.
•  » » 13 months ago, # ^ |   0 As far as I know, c++20 itself isn't fully supported in any compiler as of yet. It'll be a long way to getting it to work on cf anyway, so c++2b doesn't make sense.If your comment was sarcasm, well...
•  » » » 13 months ago, # ^ | ← Rev. 2 →   +6 It wasn't sarcastic in the slightest.C++20 is not fully supported, right. C++23 neither. That does not mean we can't use features that are supported, however, does it?Also, I'm not sure why making C++20 work would be difficult. It only requires updating the compiler, as far as I am aware.
•  » » » » 13 months ago, # ^ | ← Rev. 2 →   +26 Ah I see. The rest of the comment will reference this.My point about C++20 was that if compilers don't fully support all features, it might be a great surprise for some people if some features won't work on CF (but work locally). For instance, this post: https://codeforces.com/blog/entry/94457, where even C++17 features don't work, since the version of GCC is 9.2.0, while most people on Ubuntu use 9.3.0 (and people on rolling-release distros use much more recent versions, like 11.1.0). However, the majority of "good" C++20 features have already been implemented by GCC, so it might be less of an issue for C++20. Some good features (like constexpr std::string and std::vector) still haven't been implemented on GCC though, so I would still prefer waiting for them to be implemented.The problem would be worse for C++23 (since most C++23 features in the STL aren't implemented in compilers anyway), so people who have updated compilers might face unexpected issues in due course of time as the compiler gets outdated.As far as updating the compiler is concerned, I suppose it takes some non-trivial amount of effort which might lead to maintenance delays (even for half an hour, that's quite a bit of a delay). I don't think CF would want to follow a rolling-release pattern and update compilers as and when more features are implemented, which would keep us stuck with compilers that don't fully implement the standard (as in the case of 9.2.0). I hope that clears up the reasons behind my concerns.
•  » » » » » 13 months ago, # ^ |   -10 I think a lot of people will be happy to use set.contains(x) though (And no, I don't know why it took this long to add a .contains method)
•  » » » » » 13 months ago, # ^ |   +11 GCC is known to support newer language features even when the selected standard is quite old. Structured buildings, for instance, supported even with -std=c++11 option--GCC will still compile them and only raise a warning.So why don't we do a similar thing here? Make a compiler called 'GCC (C++20, partial C++2b support)', or maybe even 'GCC (C++20)' if you think minor lies are allowable, that uses the -std=c++2b CLI option? In this case, if you use C++20 features only, you're in the clear. And if you use newer features, you'll either get an error because the feature was not implemented yet, or it will compile correctly and you won't lose precious seconds rewriting the code.It's a win-win situation. Participants that are aware of this little trick can benefit from it. Participants that aren't will get the same results as before, except they get more lucky, so to say, in pushing the boundaries, if they accidentally use too new features. As far as updating the compiler is concerned, I suppose it takes some non-trivial amount of effort which might lead to maintenance delays (even for half an hour, that's quite a bit of a delay). Do you think the only problem is the updating difficulty, and that if it was automated, it'd work out?
•  » » » » » » 13 months ago, # ^ |   +11 Sure, that can be done I guess (not sure about all the technical details about deploying these updates on CF though).It's just a matter of personal taste to wait for a more polished implementation of C++20 (or maybe even C++23). However, if there is a guarantee that compiler updates can happen frequently enough without breaking stuff very frequently, I would go with the newer compilers too, since that mimics my own update cycle as well. Note that there tend to be bugs in new releases of compilers too, so that might be a minor inconvenience here and there.
 » 13 months ago, # |   +24 pypy3 64-bit ,now python users don't have to switch in python and pypy. Now we are going to use pypy3 always.
 » 13 months ago, # |   +1 I think this is good, because it will reduce the impact of reasons other than the complexity of the program on the judge results. :)
 » 13 months ago, # |   +3 I think this option will be good as a separate compiler option when you submit a task.
 » 13 months ago, # |   +1 only if -O3 is necessary to beat Rust-ers
•  » » 13 months ago, # ^ |   0 I understand that it was a joke, but I still want to answer :)From my experience, Rust standard library is incredibly slow even in comparison with -O2 version of STL. At least BTreeSet, which seems to be the counterpart to std::set in Rust.Would be happy to hear from Rust experts that I am wrong.
•  » » » 13 months ago, # ^ |   +10 Let's insert 5000000 32-bit integers into sorted set and compute some polynomial hash.Rust 128344948 submission runs 1637 ms and used 49200 Kb memory. GNU C++17 (64) 128344998 submission runs 5256 ms and used 239500 Kb memory.Why this happen and why BTreeSet is slower than red-black tree for small sets you could read in the man.
 » 13 months ago, # |   0 According to the following blog, Speed up Code executions with help of Pragma in C/C++, the optimization flag -Ofast is superset of -O3, and its performance is slightly better than -O3.
•  » » 13 months ago, # ^ |   0 This blog is misleading about the -Ofast option and doesn't mention the downsides. This is what the official documentation says: https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html Disregard strict standards compliance. -Ofast enables all -O3 optimizations. It also enables optimizations that are not valid for all standard-compliant programs. It turns on -ffast-math, -fallow-store-data-races and the Fortran-specific -fstack-arrays, unless -fmax-stack-var-size is specified, and -fno-protect-parens. If people don't mind deviating from strict standards compliance, then they can always use pragmas to enforce -Ofast for their code. But everyone else would be rightfully upset if their valid code doesn't behave as expected, especially if this results in a WA verdict.
•  » » » 13 months ago, # ^ |   0 Thanks for sharing the gcc link. It seems that I was lucky enough to not get WA verdict when I tried this pragma optimization flag.
 » 13 months ago, # |   -119 This is a great idea except for one thing. I am strongly against -O2 and -O3 keys in command line. (Also sometimes it is not fair for people not using C++)I have an example:test.cpp: #include using namespace std; int main() { int t = 1; while (t > 0) { t *= 9; cout << t << '\n'; } cout << "finished\n"; } Then run g++ test.cpp -O1 -o test && ./test in command shell. I got such result: 9 81 729 6561 59049 531441 4782969 43046721 387420489 -808182895 finished But if I run g++ test.cpp -O2 -o test && ./test or g++ test.cpp -O3 -o test && ./test, I will get a program that goes into an infinite loop.A part of it's output: 291559329 -1670933335 2141469169 So, I think -O1 would be better for the command line.Possible explanation (I'm not an expert):Unsigned comparison is a bit faster than signed because it is not just a "stupid" comparison. Processor needs to look at first bit, than choose how it will compare numbers (it depends on first bit), еtс. Compiler does not consider the context that it can be a negative number and just uses unsigned comparation to imporove speed.
•  » » 13 months ago, # ^ |   +79 Overflow of signed integers is undefined behaviour in C++. That's why your program gets stuck in O2/O3.
•  » » 13 months ago, # ^ |   +14 Your code is just incorrect. Add -fsanitize=undefined option to the gcc command line when compiling, run your program and it will tell you what's wrong. Yes, C++ is a very difficult programming language and it's often criticized for that. If you want an infinite loop and no unexpected surprises, then you can use Python.
•  » » » 13 months ago, # ^ |   +11 I'd say calling a function that doesn't exists somewhere and not having that raising an error until the function is called or having a typo in a variable name and the program not accusing that of being an error is an unexpected surprise by itself.
•  » » » » 13 months ago, # ^ |   +3 Yet I don't see any Python users posting blogs about such errors and being unable to troubleshoot them. The biggest source of actual confusion may be something like a somewhat unpredictable performance of string concatenation in CPython. There's also Rust for those, who prefer statically typed programming languages and catching more errors at the compilation stage.
 » 13 months ago, # |   +11 As far as I understand, -O3 can be faster and also can be slower than -O2, but it is more often faster than slower. The only reason why the standard compilation line for big competitions (such as ICPC) contains -O2 is that -O3 allows the compiler to do some optimizations that are not present in the C++ standard (or maybe even do something contradicting the standard, assuming it does not change the program behaviour, but I am not sure about that).On the other hand, -O2 is guaranteed to perform almost all optimizations from the standard, so it is usually enough for any reasonable solution, and you always can add pragmas whenever you specifically need -O3.Anyway, it is impossible to make the best command line to deal with pragmas once and forever, because there always will be something like sse or avx, that is not added by -march=native -mtune=native, but still can be added manually via pragmas and make some difference in running time.
•  » » 13 months ago, # ^ |   +3 In my experience it's more often slower than faster since regular CP code isn't bottlenecked by how instructions are generated.
•  » » » 13 months ago, # ^ |   0 To be very honest, I think it is more often more or less equal. Also, I bet that it depends not only on the problem but also on your preferred code style. E.g. I tend to write very template-heavy code that relies on inline optimizations that are different in -O2 and -O3 (according to godbolt). I can't quite tell which one is better though.In general, I think that if the optimizations went that far, then you better simply check both -O2 and -O3.
•  » » » » 13 months ago, # ^ |   0 I do check, it's how I know. However, even if it was 50/50 on which flag gives better results, then O2 is the one to use as the default. I can't quite tell which one is better though. That's the thing. You get different code, that doesn't mean you get faster code. It should depend on problem type rather than code style though (unless we're talking about bad code style that needs O3), for example randomly accessing memory is bottlenecked by cache misses and no amount of compiler optimisation can change that.
 » 13 months ago, # |   +24 Great resource: https://wiki.gentoo.org/wiki/GCC_optimization.Basically, feel free to use machine-specific options (-march=native is enough) since that's where the code will run, but keep optimisation to -O2. -ftree-vectorize is an optimization option (default at -O3 and -Ofast), which attempts to vectorize loops using the selected ISA if possible. The reason it isn't enabled at -O2 is that it doesn't always improve code, it can make code slower as well, and usually makes the code larger; it really depends on the loop etc.On an Intel/AMD64 platform with -march=native -O2 or lower optimization level, the code will likely end up with AVX instructions used but using shorter SSE XMM registers. To take full advantage of AVX YMM registers, the -ftree-vectorize, -O3 or -Ofast options should be used as wellAlso available are the -mtune and -mcpu flags. These flags are normally only used when there is no available -march optionEven the GCC manual says that using -funroll-loops and -funroll-all-loops will make code larger and run more slowly.Stick to the basics: -march, -O, and -pipe. Note -pipe which just makes compilation faster but that's also useful.
•  » » 13 months ago, # ^ |   0 Here is an alternative opinion from another Linux distro: https://documentation.suse.com/sbp/all/html/SBP-GCC-10/index.html Usually we recommend using -O2. This is the optimization level we use to build most SUSE and openSUSE packages, because at this level the compiler makes balanced size and speed trade-offs when building a general-purpose operating system. However, we suggest using -O3 if you know that your project is compute-intensive and is either small or an important part of your actual workload. Moreover, if the compiled code contains performance-critical floating-point operations, we strongly advise that you investigate whether -ffast-math or any of the fine-grained options it implies can be safely used. Looks like you are right and -march=native also implies -mtune=native at least for the x86 target. As for -funroll-loops/-funroll-all-loops, GCC documentation says that: Spoiler-funroll-loops Unroll loops whose number of iterations can be determined at compile time or upon entry to the loop. -funroll-loops implies -frerun-cse-after-loop, -fweb and -frename-registers. It also turns on complete loop peeling (i.e. complete removal of loops with a small constant number of iterations). This option makes code larger, and may or may not make it run faster.-funroll-all-loopsUnroll all loops, even if their number of iterations is uncertain when the loop is entered. This usually makes programs run more slowly. -funroll-all-loops implies the same options as -funroll-loops. It's important to note that popular open source software typically has a long history. Years had been used to iron out bugs and improve performance. Many of the performance critical parts had been already identified and optimized using hand written assembly or intrinsics. Many loops that could actually benefit from unrolling, had been unrolled manually. There are relatively few low hanging fruits left for the compilers. This limits the usefulness of -ftree-vectorize and -funroll-loops optimization options when used with such code.But competitive programming solutions are usually small, somewhat messy and there's not much time available for doing manual loops unrolling due to short contests duration. This provides more optimization opportunities for the compiler. Pre-written library code is the only exception.
•  » » » 13 months ago, # ^ |   +3 Keep in mind that compute-intensive in OS applications usually means things like video processing, not complex algorithms. The vast majority doesn't have performance-critical parts at all. and there's not much time available for doing manual loops unrolling due to short contests duration The tradeoff isn't simply "no time to do this", it's "no time to do this probably useless thing". Copypasting the insides of a loop and renaming variables isn't hard when you know it's useful, but when you don't know, there's a good chance it's not useful.
•  » » » 13 months ago, # ^ | ← Rev. 2 →   0 Manual loop unrolling doesn't make any sense anymore, unless it is a non-trivial rearrangement of program logic that the compiler can't detect. At O3, the compiler is smart enough to unroll loops which it deems fit conservatively (for instance, this: https://godbolt.org/z/hzKvnsGfz — try it without -funroll-loops).-funroll-loops usually leads to a better performance (especially if it's a lightweight tight loop), but again, compiler options don't always do what you want them to do, and this doesn't unroll all loops either, since the unrolling is heuristic-based. Compilers might even reroll your loops if they find that some of your loops are hand-unrolled and are pretty bad due to the absence of such a guarantee (this is just a hypothetical).Compilers are smart enough at micro-optimizations (more so than most programmers), so I believe we shouldn't rule out the usefulness of loop-rolling that the compiler does.
•  » » 13 months ago, # ^ |   +21 This is probably the most sane thing to do.I would however add that if -march=native isn't set already, then bmi/bmi2 (if not that, then popcnt, lzcnt, tzcnt) should be considered for non-trivial bit operations (like __builtin_popcount, __builtin_ctz, __builtin_clz, __lg and so on). These would be pretty helpful in doing these bitwise operations and also speed up bitsets a bit.Ofast is unsafe for floating point operations (I remember someone saying that it breaks some solutions to geometry problems, which was the biggest reason I stopped using Ofast), and O3 might make code slower (from my experience on CF as well, but I tend to include it anyway since it happens rarely). One good thing about O3 (as pointed out above) is the larger bit vectors that you get (but you can do that by enabling AVX2 anyway).The point is, optimized code should not be necessary to pass, although it's nice to have.-march=native or -mtune=native being the only addition to the compiler options should be fine in my opinion. The reason why it's not used in production is that people want to build for other architectures too, but since all code compilation and execution happens on one system in the case of competitive programming, it makes sense to actually get as much performance out of it as possible. Some might argue it's even the best thing, since it allows you to use instructions from most ISAs that are supported, which makes it fair game for people who don't actually want to use pragmas (and for people who are afraid of using pragmas due to the fear that their submission will get an RE due to SIGILL).I would actually be interested in the flags that are supported on codeforces (the flags turned on by -march=native can be found by running g++ -march=native -E -v - &1 | grep cc1). For instance, my machine gives the following output. Output/usr/lib/gcc/x86_64-pc-linux-gnu/11.1.0/cc1 -E -quiet -v — -march=skylake -mmmx -mpopcnt -msse -msse2 -msse3 -mssse3 -msse4.1 -msse4.2 -mavx -mavx2 -mno-sse4a -mno-fma4 -mno-xop -mfma -mno-avx512f -mbmi -mbmi2 -maes -mpclmul -mno-avx512vl -mno-avx512bw -mno-avx512dq -mno-avx512cd -mno-avx512er -mno-avx512pf -mno-avx512vbmi -mno-avx512ifma -mno-avx5124vnniw -mno-avx5124fmaps -mno-avx512vpopcntdq -mno-avx512vbmi2 -mno-gfni -mno-vpclmulqdq -mno-avx512vnni -mno-avx512bitalg -mno-avx512bf16 -mno-avx512vp2intersect -mno-3dnow -madx -mabm -mno-cldemote -mclflushopt -mno-clwb -mno-clzero -mcx16 -mno-enqcmd -mf16c -mfsgsbase -mfxsr -mno-hle -msahf -mno-lwp -mlzcnt -mmovbe -mno-movdir64b -mno-movdiri -mno-mwaitx -mno-pconfig -mno-pku -mno-prefetchwt1 -mprfchw -mno-ptwrite -mno-rdpid -mrdrnd -mrdseed -mno-rtm -mno-serialize -msgx -mno-sha -mno-shstk -mno-tbm -mno-tsxldtrk -mno-vaes -mno-waitpkg -mno-wbnoinvd -mxsave -mxsavec -mxsaveopt -mxsaves -mno-amx-tile -mno-amx-int8 -mno-amx-bf16 -mno-uintr -mno-hreset -mno-kl -mno-widekl -mno-avxvnni --param l1-cache-size=32 --param l1-cache-line-size=64 --param l2-cache-size=9216 -mtune=skylake -dumpbase -Still, in an ideal scenario, such optimization options shouldn't be involved (and perhaps be banned from code too), but cheesing problems using SIMD is way too fun.
•  » » 13 months ago, # ^ |   0 There's actually another way to get free performance: write high level idiomatic code and constexpr'ing as much as you can.Better to show with an example: https://gcc.godbolt.org/z/YocGTro41High level doesn't mean slow. Example above is overly high-level, it computes some nontrivial thing (sum of all node labels in some implicitly defined graph that is not stored in memory) and it compiles in a single machine instruction that just moves the result into the register.You can do memory allocations for no cost (one can think about memory for constants/precomputed stuff as registers); nasty cases/whatever can be elegantly expressed in this way.Not sure if this is common knowledge in CP.
•  » » » 13 months ago, # ^ |   0 That's not taking any inputs, so it's not applicable to programming contests.The question isn't whether a speedup trick is possible in some situation that can arise with non-zero chance. Plenty of those exist. It has to be useful in contests.
 » 13 months ago, # | ← Rev. 3 →   -18 [Deleted]
 » 13 months ago, # |   +19 As a somewhat strong competitor who mainly uses a language other than C++: Don't worry about widening the performance gap between C++ and slower languages with a change like this one. Most of what these flags will accomplish was already possible with pragmas, and SIMD/vectorization only provides a large advantage in some fairly specific situations, which are often simply absent in the performance bottlenecks of even somewhat complex code. The main effect of changing the compilation flags would be to make these optimizations more accessible, and harder for problem-setters to overlook.Most* of the pressure I feel to sometimes switch to C++ comes from the small minority of problems that have rather tight time constraints, and I think the same goes for pretty much every contestant not using a reliably-very-fast language like C++. So, unless this change meaningfully affects the frequency or severity of such occurrences, I think it won't meaningfully alter competitors' decisions. And while this change will certainly lead to tighter constraints on some problems, I intuitively expect that in most situations where this is enough to create serious difficulties for slower languages, it will come with the flip-side of also removing unintended C++ solutions (often much simpler than the alternatives), and those are themselves a strong incentive to switch languages for that problem. *Why only "most"The rest of this pressure comes from the relative inconvenience of describing some mutation-heavy algorithms in a pure language like Haskell. But this comes mainly from (in decreasing order of importance) (1) my tendency to not realize how stupid and improvable my ideas are until I've started implementing them, (2) the unique characteristics of fully-immutable coding, and (3) my lack of skill and confidence at the programming part of competitive programming; and none of those have anything to do with C++ in particular. I could just as easily switch to some other language (say, Kotlin) on a problem for this reason.PS: A belated thanks for the (unannounced?) change 6 months ago removing -XSafe from the GHC compilation settings. It would also be nice if the version of GHC shown in the Languages drop-down (currently 8.10.1) agreed with the version of GHC actually being used (currently 8.6.3).In the long run, I think the best way to narrow the gap is by adding more languages that can be run as 64-bit executables. This is why I am excited by the introduction of PyPy3 64-bit even though I will probably rarely use it myself. I think many share this sentiment.
•  » » 13 months ago, # ^ |   +23 Oh wow reaching GM with Haskell. orz
 » 13 months ago, # |   -7 Please don't, like 90% of the time -O3 flag is slower by a lot. -O2 is best. If user wants they can upgrade to -O3 with pragma.
•  » » 13 months ago, # ^ |   +14 Can you provide any examples of real Codeforces submissions that get slower with -O3? (since you write "90% of the time" and "by a lot" you'll probably need multiple examples) I've never seen any cases where -O3 is significantly slower. (but adding -funroll-loops to the default command line is probably a bad idea)
 » 13 months ago, # |   +5 Add swift please
 » 13 months ago, # |   0 It would be great to explore what's new in C++20
 » 13 months ago, # |   -23 As a c++ programmer, I can say that it will be a great job
 » 13 months ago, # |   0 Yes, it sounds good!!
 » 13 months ago, # | ← Rev. 2 →   0 c++20 Awesome
 » 13 months ago, # | ← Rev. 2 →   -12 n
 » 13 months ago, # |   0 Java SE 17 (LTS) is coming soon and hopefully u will add it as well.
 » 12 months ago, # |   -15 I believe complexity is one of the most important things for CP. If the programs cannot accept the judgement though it uses a accurate algorithm with acceptable complexity just due to the fact that it doesn't use skills like fast read or other optimizing codes, this situation should never happen in rated contests because of unfairness. I do believe more auto -O2 or -O3 can avoid this in certain condition. I believe there should be a judgement program to decide which one to use.