NiceClock's blog

By NiceClock, 3 years ago, In English

Hello codeforces! Some days ago, I solved task which was to optimize $$$O(N^3)$$$, where $$$N \leq 2000$$$. There was a matrix $$$2000 \times 2000$$$. After all optimizations, the following happened:

1) a[2000][2000] got TL

2) a[2048][2048] got TL

3) a[2048][2000] got AC

I understand, that sizes like $$$2^k$$$ are faster slower because of cache and all that jazz. But I can't compare knowledge with the obtained result.

Does anyone know why this is happening?

P.S. Explained to me by the Great and Powerful 'Xellos'.

P.S.S Half of my knowledge of optimizations is [The Lord of the Chairs who wished to stay behind the scenes], I am grateful to him for that. And for the following optimization fact:

If the task requires XOR, then instead of int it is faster to use unsigned int. Why? If the number is positive, then it is stored in int with a bit equal to 1. We all know that 1 XOR 1 = 0. That is, in fact, in the XOR operation code between two ints, we will change this bit separately (informally, with 'if'). In unsigned int, we do not have a bit responsible for the sign, and no bits need to be processed separately. There is intuition! In practice, I've tested it, it speeds up the code a little. (maximum 1.4 times, average 1.1)

The question is, why do we need optimizations that speed up the code by 1.1 times? Idk, it is just very interesting. When you are bored. ,

  • Vote: I like it
  • +58
  • Vote: I do not like it

»
3 years ago, # |
  Vote: I like it 0 Vote: I do not like it

Have you considered it's just small perturbation in run time?

  • »
    »
    3 years ago, # ^ |
    Rev. 2   Vote: I like it 0 Vote: I do not like it

    I tested it several times.

    I found that if the addresses differ by a power of two, then they will have the same cache and they will go into one hash line. And when the hash line overflows it will slow us down. But if so, why a[2048][2000] is faster that a[2000][2000]?

    • »
      »
      »
      3 years ago, # ^ |
        Vote: I like it 0 Vote: I do not like it

      Well, that was exactly my first idea and reason I didn't write it.

      Anyway, I suggest you share the sumbmissions (and test if possible), because otherwise people could only blindly guess.

      • »
        »
        »
        »
        3 years ago, # ^ |
        Rev. 3   Vote: I like it 0 Vote: I do not like it

        It was this task, but with TL 2.5 and in yandex contest. (hometask in university) In this system, a[2048][2048] get TL, a[2000][2000] get AC with 4985ms and a[2048][2000] get AC with 4760 ms.

        Code

        • »
          »
          »
          »
          »
          3 years ago, # ^ |
          Rev. 6   Vote: I like it -9 Vote: I do not like it

          Also i find out, that std::swap is faster than swap by xor (very sad), and now all 3 codes are working in e-olymp. But time difference is still visible, and in original contest still only 1 solution is getting AC. (btw, i know that exists faster solution with lists, but it doesn't make question less interesting)

          • »
            »
            »
            »
            »
            »
            3 years ago, # ^ |
              Vote: I like it +25 Vote: I do not like it

            Also i find out, that std::swap is faster than swap by xor (very sad)

            But just why did you expect the opposite? :)

            • »
              »
              »
              »
              »
              »
              »
              3 years ago, # ^ |
                Vote: I like it +1 Vote: I do not like it

              Some 'olds' said me, that they used that optimization since their 6th grade. Probably in modern c++ realization of std::swap is better :)

              • »
                »
                »
                »
                »
                »
                »
                »
                3 years ago, # ^ |
                Rev. 2   Vote: I like it +11 Vote: I do not like it

                For simple swap you need just two read and two write instructions. It's a minimal and optimal way right now and has always been.

                Probably you or some "olds" mixed up this xor-swap with arithmetic replacement of comparisons, which can really help sometimes by removing branching.

»
3 years ago, # |
  Vote: I like it 0 Vote: I do not like it

This was discussed some time ago in Russian here.

Briefly "cache and all that jazz" are much more complicated than it seems from high level, So it's funny, but the exactly the opposite you wrote is true: sizes like 2**k might be much slower :)

Link from that blog http://igoro.com/archive/gallery-of-processor-cache-effects/, read about Cache associativity. The problem happens when you access addresses with step 2**k.

  • »
    »
    3 years ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    Actually I know, I just misprint, sorry for that. But problem is still exists — a[2048][2000] faster that a[2000][2000]!

»
3 years ago, # |
  Vote: I like it 0 Vote: I do not like it

If you know some similar problems, where this thing can be tested, please send them.

  • »
    »
    3 years ago, # ^ |
    Rev. 3   Vote: I like it 0 Vote: I do not like it

    I tested it with problem and solution from post shef_2318 mentioned. With a[2048][2048], as it mentioned in post, it gets TL. With a[2048][2001] it works 904 ms, with a[2047][2001] and a[2049][2001] 920 ms. It is about 1.7% in problem, where we didn't really applied to array too much. I mean, it is just another proof of the fact, that a[2^k][n] is working faster than a[n][n].

    • »
      »
      »
      3 years ago, # ^ |
        Vote: I like it +8 Vote: I do not like it

      904 ms and 920ms are no different. Especially in yac

»
3 years ago, # |
  Vote: I like it +29 Vote: I do not like it

There's a more general effect in play here: you're throwing away time/memory locality, which is a big factor in cache performance. If you accessed nearby places in memory in nearby instructions, your code would be much faster because it actually benefits from cache, and array sizes wouldn't matter.

In this case, you have a code that's shit no matter what array sizes you pick; however, by taking advantage of the particular (best) hardware design, you're making it slightly less shit when you replace 2048 by 2000. You still have a lot of cache misses and replaced entries, but not as many as before, so it's not shit enough to get TLE.

»
3 years ago, # |
  Vote: I like it 0 Vote: I do not like it

Sadly, that's forbidden knowledge.

»
3 years ago, # |
  Vote: I like it +10 Vote: I do not like it

Carlsen, is that you?

»
3 years ago, # |
  Vote: I like it +41 Vote: I do not like it

There is a bit of text in Codechef's footer that goes like:

We also aim to have training sessions and discussions related to algorithms, binary search, technicalities like array size and the likes.

I have always wondered what could be so technical about array size. Now I finally understand!