M.Mahdi's blog

By M.Mahdi, 9 years ago, In English

Hi!
I was trying to solve probem E from the last contest (542E - Игра на графе) and I needed to solve all pairs shortest path problem. With n BFSs, It would be done in O(nm) but I'm too lazy to implement it! I saw 3 second time limit, so i used floyd-warshal and Bang... It just accepted in 1.7 second! 11022079
I tested my code on my machine without -O2 with a simple 1000 vertex connected graph and it takes 15 seconds until my program answers, and with -O2 it just takes 1.5 second!
I wrote another non-optimizable program with n3 running time (You can see it here) and with n = 1000 it runs in 12 seconds on my machine.
So obviously -O2 can optimize floyd! I wonder how it's possible?!

P.S. As I_love_Tanya_Romanova said, my second example is nonsense! :D
Anyway, could anyone explain how my solution is optimized?

  • Vote: I like it
  • +28
  • Vote: I do not like it

»
9 years ago, # |
Rev. 3   Vote: I like it +39 Vote: I do not like it

Sounds like some sort of joke for me.

Right, I have no idea how -O2 is working, and absolutely no idea about possible speed-up because of using -O2.

But you wrote a code which does only 1e9 extremely easy operations, and for some reason you expect that it have to work for 10 seconds? We are in 2015 now, not in 2005) You don't use floating point calcualtions, you don't use slow stuff like / or % — it doesn't have to work long.

I don't know why it was slow on your machine, but CF servers have this level productivity for a long time already, not just for last week or month. You may look for some cases when authors decided that CF servers are very slow (like you did in this topic) and then stuff like n=50000 here, and we were not expecting n^2/2 to pass within TL, we are sorry for this happened :)

And your second example... Once again, is it a joke? :) You are comparing N^3 from Floyd with N^3 iterations of calling rand(), using both % and /?

P.S. Talking about productivity of CF servers, 472G - Design Tutorial: Increase the Constraints comes into my mind :) Calculate how much operations brute force does there, and still it is good enough for 7 seconds.

  • »
    »
    9 years ago, # ^ |
    Rev. 2   Vote: I like it +14 Vote: I do not like it

    I didn't thought /, % and rand() are slow. Now I tested them, yes, you are right, Thank you!
    My main reason to surprise was the big gap between running time of my solution, with and without -O2. I tested it on another machine and got the same result. You should test it on your machine to see it by yourself! :D

  • »
    »
    9 years ago, # ^ |
      Vote: I like it +24 Vote: I do not like it

    I don't know why it was slow on your machine, but CF servers have this level productivity for a long time already

    Are you comparing his code without optimizations with same code on CF servers (which always enable optimizations)?

»
9 years ago, # |
  Vote: I like it +53 Vote: I do not like it

You can compare the generated code by compiling with debug symbols "-g", and then disassembling with "objdump -S yourexecutable". It will print disassembled code interleaved with source, code. The basic idea is that unoptimized code is 10x slower because it does 10x more stuff.

There were 2 main differences:

In unoptimized version there is a lot of function calls, mostly for operating with vector. The ones that are called the most are vector::iterator::operator++ and iterator comparison function. And they call even more functions, because STL implementation is quite complicated. On the other hand optimized version everything is inlined. Instead of calling those two large functions for each loop iteration, there are only 3 asm instructions: add constant to register, compare to other regist, jump if not equal. Even min results in ~20 instructions: move values from one registers to other for passing as function arguments, call min function, save base register, move arguments from argument registers to local variables in stack memory, read the values back from stack to registers, dereference the values passed by reference, compare values, branch depending on result, move result to correct register, restore base register, return, move result to other register. Optimized version has only 2 instructions, compare the values in correct registers (cmp), perform conditional move (which is also single instruction).

Other big difference is that most access to variables, result in moving data from memory to registers and back, even when the same value is accessed twice in a row. There are also temporary variables like i that could fit in registers, but are put in stack memory instead.

»
9 years ago, # |
  Vote: I like it +45 Vote: I do not like it

Here are some images with solve function: Unoptimized, Optimized with -O2, -O3 -march=native . Keep in mind that in image for unoptimized version, you can't see the code for STL functions that get called. Adding "-O3 -march=native" enables more aggresive optimizations like loop unrolling and vector instructions.

  • »
    »
    9 years ago, # ^ |
      Vote: I like it +26 Vote: I do not like it

    Thank you so much for your clear explanation!

  • »
    »
    9 years ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    Out of interest, what’s the disassembler you’re using? (Is it good?)

    • »
      »
      »
      9 years ago, # ^ |
        Vote: I like it +8 Vote: I do not like it

      Those images were created with radare. Don't know if it's good, but at least it can draw control flow graphs.

»
9 years ago, # |
  Vote: I like it +8 Vote: I do not like it

And this is why you should never compile code with optimization disabled. Especially code that uses the C++ standard library, because basically everything in it relies on inlining for not being terribly slow.