E869120's blog

By E869120, 12 months ago, In English,

Hello Codeforces!

Recently, in JOI 2018 Spring Camp, a young genius QCFium, wrote a naive $$$O(NQ)$$$ algorithm in the problem Examination, and got the perfect score. The code is as follows:

#pragma GCC target ("avx2")
#pragma GCC optimization ("O3")
#pragma GCC optimization ("unroll-loops")
//Naive Solution as follows...

I thought that it was very surprising, but actually, even $$$N,Q \leq 100,000$$$, with the only 3 lines $$$O(NQ)$$$ naive algorithm actually got accepted. To check whether this speedup is actually effective or not, I wrote these two codes: one is with the 3-line speedup, and the other is without the 3-line speedup.

Code A (With speedup)


Code B (Without speedup)


As a result, in custom innovation (code-test) in Codeforces, Code A used 2074ms but Code B used 2292ms. Code A is faster by ~10.5%. Actually, a young genius QCFium said "The more simple the code is, the more relatively faster with 3-line speedup." So, it could be possible that with speedup, the calculating speed becomes x1.2 or x1.5.

So I have some questions to the community.
Is it possible to use this speedup in CodeForces official contests?
And is it legal to use it in IOI selection contests in your country? (Actually in Japanese Olympiad in Inforcatics, it is OK to use this speedup)

I am very appreciate for sharing your opinion.
Thank you for reading this blog.

 
 
 
 
  • Vote: I like it
  • +47
  • Vote: I do not like it

»
12 months ago, # |
  Vote: I like it +8 Vote: I do not like it

Yes, I've used such pragmas when trying to break some CF problem and it worked. You just need to pay attention to the compiler, since MSVC-specific pragmas don't work on GCC and vice versa (they're simply ignored).

In our IOI selection last year, one guy managed to squeeze some points out of one problem I reused from JOI with looser constraints. It didn't give full score, I think, but it was still quite an impressive speedup and squeezing points like that was in fact one of the expected strategies for those unable to get full score. I imagine most local olympiads don't care — or don't know it matters. Anyway, adding pragmas for AVX and loop unrolling on top of your every code most likely won't slow it down and -O3 is often unnecessary or sometimes even slower than -O2.

»
12 months ago, # |
  Vote: I like it +36 Vote: I do not like it

We need MrDindows and dmkozyrev here

»
12 months ago, # |
Rev. 6   Vote: I like it +70 Vote: I do not like it

This is a short example of x8 speed up for 625 000 000 multiplications of complex numbers: original 3759 ms, 396 kb, improved 187 ms, 400 kb.

I solved a lot of problems (one example), using naive solution in O(n^2) or O(n*q), where q,n <= 200000. Only Ofast, avx, avx2, fma helps a lot (x8 speed up, not so small as x1.2-x1.5), another is not sufficient. AVX for packed floats / doubles, AVX2 for packed integral types, FMA for more effective instructions. When this is enabled, compiler can generate machine-specific code, that allows to work with 256-bit registers by using of avx instructions. But you need to write code in parallel-style with independent iterations of cycles.

You can read a guide to vectorization with intel® c++ compilers. I'm using this too in my everyday work.

UPD. At current time it is a part of GCC/clang compilers, but since C++20 it will be part of standard C++ language. Link, experimental::simd

UPD 2. Increasing of all constraits up to 300-400k will help to drop all such solutions.

  • »
    »
    12 months ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    Your improved 187 ms, 400kb code contains:

    #pragma GCC optimize("Ofast")
    #pragma GCC target("avx,avx2,fma")
    

    which is a bit different from the 3 lines shown in the blog post:

    #pragma GCC target ("avx2")
    #pragma GCC optimization ("O3")
    #pragma GCC optimization ("unroll-loops")
    

    Which one is better to use for speed up the code?

    • »
      »
      »
      12 months ago, # ^ |
      Rev. 2   Vote: I like it 0 Vote: I do not like it

      I think that -Ofast includes all safe and unsafe optimizations for speed up. You can check there. I'm using first, but seems that we need to write unroll-loops too.

      You can compile your source code with next flags: -fopt-info, -fopt-info-loop, -fopt-info-loop-missed, -fopt-info-vec, -fopt-info-vec-missed. Link to all options. It can detect which lines of code have been failed in process of code optimizations and why.

      UPD. I remember that something from list of optimizations allowed me to speed up Segment Tree in 2 times, because it removed tail recursion in recursive queries.

      • »
        »
        »
        »
        5 weeks ago, # ^ |
          Vote: I like it 0 Vote: I do not like it

        But queries in segment tree uses the result from l to mid and mid + 1 to r and them combine them which is not tail recursion i think(since calling the function is not the last thing done in query function of segment tree). Correct me if i am wrong?

    • »
      »
      »
      2 months ago, # ^ |
        Vote: I like it 0 Vote: I do not like it

      It is better to use the 2 liner as #pragma GCC optimize("Ofast") enables all -O3 optimizations, also enables optimizations that are not valid for all standard-compliant programs. It turns on -ffast-math, -fallow-store-data-races and etc. Moreover "unroll-loops" are also covered under the "Ofast" so explicitly declaring it will also produce same results. shahidul_brur

»
12 months ago, # |
  Vote: I like it +9 Vote: I do not like it

What does this 3 lines do?? And if i put this pramgas in my code will it speed up?

»
12 months ago, # |
  Vote: I like it +8 Vote: I do not like it

Hi does this work for cms?

  • »
    »
    12 months ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    Actually, it does work, at least in JOI 2018/2019 Spring Camp.

»
12 months ago, # |
  Vote: I like it +1 Vote: I do not like it

Does anyone why does simpler code, make the speedup faster? And also does macros and including bits/stdc++.h instead of iostream for example affect the speedup?

»
12 months ago, # |
  Vote: I like it +13 Vote: I do not like it

here is another example where a naive solution can get accepted with pragmas: https://codeforces.com/contest/911/submission/33820899

vectorization of code can give really big speedups...

»
12 months ago, # |
  Vote: I like it 0 Vote: I do not like it

Is there a way to solve today's Div2B 1143B - Nirvana with this optimization?

»
12 months ago, # |
  Vote: I like it 0 Vote: I do not like it

To everyone who doesn't know what's going on here: seems that topicstarter doesn't know it either, and it looks like some magic for him.

Better refer to dmkozyrev's message above in the comments.

»
12 months ago, # |
  Vote: I like it 0 Vote: I do not like it

Just to let you folks know, last time I checked it didn't work on USACO. L

»
12 months ago, # |
  Vote: I like it 0 Vote: I do not like it

Codeforces uses 32-bit binaries (although the servers themselves are 64-bit IIRC), so AVX won't work. Although I'm not completely sure that every language other than C++ also runs in 32-bit mode. If someone found a language running with a 64-bit interpreter, there would be an opportunity for some "bitness arbitrage"...

  • »
    »
    12 months ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    Never mind, I am wrong. You actually can generate AVX instructions on x86.

»
8 months ago, # |
  Vote: I like it +3 Vote: I do not like it

will it work only for naive algo ?? because if i am taking simple input and output ..the execution time slows down to 4x .. so what's use of using it ??

  • »
    »
    2 months ago, # ^ |
      Vote: I like it +14 Vote: I do not like it

    Honestly, the main cause of the speedup is called vectorization, which the compiler does automatically due to the pragmas. After blindly trying for some time, I realized that auto-vectorization has very very limited use cases (i.e. it doesn't work most of the time).

    In fact, to truly know how your code has been optimized by the compiler, you need to get down to the assembly code. If you are afraid of the assembly code, stay away from these optimization pragmas and optimize on the algo level only.

»
2 months ago, # |
  Vote: I like it 0 Vote: I do not like it

This does not work for oj.uz

»
5 weeks ago, # |
Rev. 2   Vote: I like it +1 Vote: I do not like it

Consider using

#pragma GCC target ("native")

To learn about how different compilers do on different architectures with autovectorization, try

https://godbolt.org