Strange behaviour in dp & bitmask problem — Codeforces 401D

→ Pay attention

Before contest
Codeforces Round 940 (Div. 2) and CodeCraft-23
3 days
Register now »

*has extra registration

→ Top rated

#	User	Rating
1	ecnerwala	3648
2	Benq	3580
3	orzdevinwang	3570
4	cnnfls_csy	3569
5	Geothermal	3568
6	tourist	3565
7	maroonrk	3530
8	Radewoosh	3520
9	Um_nik	3481
10	jiangly	3467

Countries | Cities | Organizations

View all →

→ Top contributors

#	User	Contrib.
1	maomao90	174
2	adamant	164
2	awoo	164
4	TheScrasse	160
5	nor	159
6	maroonrk	156
7	-is-this-fft-	150
8	SecondThread	147
9	orz	146
10	pajenegod	145

View all →

→ Find user

→ Recent actions

Detailed →

zoomer's blog

Strange behaviour in dp & bitmask problem — Codeforces 401D

By zoomer, history, 9 years ago, In English

I just got AC for this dynamic programming problem 401D - Roman and Numbers after many TLEs, WAs and trials using recursion then loops..

I have faced some behaviour, which -in my opinion- seems to be strange, and I can't figure out why it behaves like that:

Declaring the dp array like this : dp[100][1<<18] gives a TLE. However, dp[1<<18][100] doesn't.
using a long long for bitmask is slower than int !!! .... I know that 32bit integer would be sufficient for this problem, but I thought it would be safer to use long long.

I hope someone can explain it for me, thanks a lot :D

dp, bitmask, arrays, data structures

zoomer
9 years ago
5

Comments (5)

Write comment?

ykaya

9 years ago, # |

I dont know about your first question but using long long is slower than integer because long long has got 64bits and integer has got 32bits.

For example to sum 2 number in long long you have to check 64bits but in integer you have to check 32bits.So that means integer is x2 faster than long long. Its better use long long just when necessary.

→ Reply

zoomer

9 years ago, # ^ |

But I thought that operations on both 32bits integer and 64bits long long are made in the same time, not bit by bit!!

→ Reply

misof

9 years ago, # |

← Rev. 2 →

The first issue most likely is due to cache hits/misses. The second version will often run faster if you access multiple values dp[x][y] with the same x roughly at the same time -- which is indeed the case if you iterate over all subsets in the outer cycle of your DP. When you access some dp[x][y1], a whole chunk of consecutive bytes of memory (most likely including the entire dp[x]) can be brought into the cache, which then speeds up access to dp[x][y2].

The big picture here is that code that makes consecutive memory accesses will run faster than code that jumps all over memory. For the same reason, iterating over a vector can be much faster than iterating over a linked list with the same number of elements, if those elements happen to be scattered all across the process's memory.

→ Reply

zoomer

9 years ago, # ^ |

Thanks a lot .. it's quite clear now ..

I don't want to look like a dumb, but I don't know much about how memory works internally.

so I have one little question: I have learnt that accessing an element in an array made by calculating its location using a little equation.. and this makes all elements in array are accessible in a constant time.. why do "consecutive memory accesses will run faster"? although elements here are an array in the same block in the memory! Thanks,

→ Reply

misof

9 years ago, # ^ |

← Rev. 2 →

If you have an array A[3][3], the elements are stored in the memory in the order A[0][0], A[0][1], A[0][2], A[1][0], A[1][1], and so on.

If you have an array dp[100][1<<18], there about 16 megabytes of data between the elements dp[31][12345] and dp[47][12345]. If you access these two memory locations one after another, your program will be forced to access each of them separately.

On the other hand, if you have an array dp[1<<18][100], there are only a few bytes of data between dp[12345][31] and dp[12345][47]. These memory accesses happen locally, in the same short chunk of memory, and that may often make them faster.

Look at these two pieces of code:

They both do the exact same thing: they fill a matrix with data and then sum the data a few times. The first one is cache-friendly, accesses the physical memory in consecutive order, and terminates in 0.18 seconds. The second one jumps all over the place, which makes it run 4.89 seconds.

→ Reply