Array size optimization

#	User	Rating
1	ecnerwala	3648
2	Benq	3580
3	orzdevinwang	3570
4	cnnfls_csy	3569
5	Geothermal	3568
6	tourist	3565
7	maroonrk	3530
8	Radewoosh	3520
9	Um_nik	3481
10	jiangly	3467

#	User	Contrib.
1	maomao90	174
2	adamant	164
2	awoo	164
4	TheScrasse	160
5	nor	159
6	maroonrk	156
7	-is-this-fft-	150
7	SecondThread	150
9	orz	146
10	pajenegod	145

Hello codeforces! Some days ago, I solved task which was to optimize $$$O(N^3)$$$, where $$$N \leq 2000$$$. There was a matrix $$$2000 \times 2000$$$. After all optimizations, the following happened:

1) a[2000][2000] got TL

2) a[2048][2048] got TL

3) a[2048][2000] got AC

I understand, that sizes like $$$2^k$$$ are ~~faster~~ slower because of cache and all that jazz. But I can't compare knowledge with the obtained result.

Does anyone know why this is happening?

P.S. Explained to me by the Great and Powerful 'Xellos'.

P.S.S Half of my knowledge of optimizations is [The Lord of the Chairs who wished to stay behind the scenes], I am grateful to him for that. And for the following optimization fact:

If the task requires XOR, then instead of int it is faster to use unsigned int. Why? If the number is positive, then it is stored in int with a bit equal to 1. We all know that 1 XOR 1 = 0. That is, in fact, in the XOR operation code between two ints, we will change this bit separately (informally, with 'if'). In unsigned int, we do not have a bit responsible for the sign, and no bits need to be processed separately. There is intuition! In practice, I've tested it, it speeds up the code a little. (maximum 1.4 times, average 1.1)

The question is, why do we need optimizations that speed up the code by 1.1 times? Idk, it is just very interesting. When you are bored. ,

Comments (19)

Write comment?

riadwaw

3 years ago, # |

Have you considered it's just small perturbation in run time?

→ Reply

NiceClock

3 years ago, # ^ |

← Rev. 2 →

I tested it several times.

I found that if the addresses differ by a power of two, then they will have the same cache and they will go into one hash line. And when the hash line overflows it will slow us down. But if so, why a[2048][2000] is faster that a[2000][2000]?

Well, that was exactly my first idea and reason I didn't write it.

Anyway, I suggest you share the sumbmissions (and test if possible), because otherwise people could only blindly guess.

← Rev. 3 →

It was this task, but with TL 2.5 and in yandex contest. (hometask in university) In this system, a[2048][2048] get TL, a[2000][2000] get AC with 4985ms and a[2048][2000] get AC with 4760 ms.

Code

← Rev. 6 →

-9

Also i find out, that std::swap is faster than swap by xor (very sad), and now all 3 codes are working in e-olymp. But time difference is still visible, and in original contest still only 1 solution is getting AC. (btw, i know that exists faster solution with lists, but it doesn't make question less interesting)

MrDindows

+25

Also i find out, that std::swap is faster than swap by xor (very sad)

But just why did you expect the opposite? :)

Some 'olds' said me, that they used that optimization since their 6th grade. Probably in modern c++ realization of std::swap is better :)

+11

For simple swap you need just two read and two write instructions. It's a minimal and optimal way right now and has always been.

Probably you or some "olds" mixed up this xor-swap with arithmetic replacement of comparisons, which can really help sometimes by removing branching.

shef_2318

This was discussed some time ago in Russian here.

Briefly "cache and all that jazz" are much more complicated than it seems from high level, So it's funny, but the exactly the opposite you wrote is true: sizes like 2**k might be much slower :)

Link from that blog http://igoro.com/archive/gallery-of-processor-cache-effects/, read about Cache associativity. The problem happens when you access addresses with step 2**k.

Actually I know, I just misprint, sorry for that. But problem is still exists — a[2048][2000] faster that a[2000][2000]!

If you know some similar problems, where this thing can be tested, please send them.

I tested it with problem and solution from post shef_2318 mentioned. With a[2048][2048], as it mentioned in post, it gets TL. With a[2048][2001] it works 904 ms, with a[2047][2001] and a[2049][2001] 920 ms. It is about 1.7% in problem, where we didn't really applied to array too much. I mean, it is just another proof of the fact, that a[2^k][n] is working faster than a[n][n].

904 ms and 920ms are no different. Especially in yac

Xellos

+29

There's a more general effect in play here: you're throwing away time/memory locality, which is a big factor in cache performance. If you accessed nearby places in memory in nearby instructions, your code would be much faster because it actually benefits from cache, and array sizes wouldn't matter.

In this case, you have a code that's shit no matter what array sizes you pick; however, by taking advantage of the particular (best) hardware design, you're making it slightly less shit when you replace 2048 by 2000. You still have a lot of cache misses and replaced entries, but not as many as before, so it's not shit enough to get TLE.

Nice one! Thanks a lot!)

I_love_AlexandraX

Sadly, that's forbidden knowledge.

ya k tebe v komnatu pridu.

terrexo

+10

Carlsen, is that you?

-is-this-fft-

+41

There is a bit of text in Codechef's footer that goes like:

We also aim to have training sessions and discussions related to algorithms, binary search, technicalities like array size and the likes.

I have always wondered what could be so technical about array size. Now I finally understand!

NiceClock's blog