Why you may want to use two-layer dp not only to decrease the memory usage

#	User	Rating
1	tourist	3880
2	jiangly	3669
3	ecnerwala	3654
4	Benq	3627
5	orzdevinwang	3612
6	Geothermal	3569
6	cnnfls_csy	3569
8	jqdai0815	3532
9	Radewoosh	3522
10	gyh20	3447

#	User	Contrib.
1	awoo	161
1	maomao90	161
3	adamant	156
4	maroonrk	153
5	atcoder_official	148
5	-is-this-fft-	148
5	SecondThread	148
8	Petr	147
9	nor	144
10	TheScrasse	142

I was upsolving AtCoder Beginner Contest 195 problem F, aka "Coprime Present" where you need to count subsets (of a special set of size < 72) with pairwise coprime elements.

My solution is a bitmask dp over prime factors that were used and a prefix of the set. My first submission.

Fairly straightforward, nothing smart, $$$O(N \cdot 2^n)$$$ time and memory. However, I was disappointed by the memory usage of 600+ Mb. Therefore, I implemented a two-layer version of the same dp. Here is my second submission.

The difference is minimal, I just take layer_number & 1, and don't forget to clear the previous layer before proceeding to the next one

Obviously, it uses way less memory, only 30 Mb. What surprised me is that the running time has also decreased, from 472 ms to just 147 ms, for an improvement by a factor of 3!

I have two hypotheses on why this happens:

two layers fit into the better cache level, facilitating memory lookups;
memory allocation itself takes some time (there must be a reason why people write custom allocators, right?).

Can someone more experienced please tell me what actually happens there?

Comments (16)

Show archived | Write comment?

never_giveup

3 years ago, # |

+38

Memory allocation

→ Reply

gultai4ukr

3 years ago, # ^ |

← Rev. 2 →

Yeah, 30 Mb is too much to fit into cache on any level, so it's purely memory allocation related speedup.

nskybytskyi

+93

meme

+16

Good one :)
BTW, L4 cache exists, but that's another story :)

It's not about cache, it's about memory allocation. It's obvious that one layer wouldn't fit into cache, so there is no difference in that regard.

UPD: I didn't understand your message at first, but it seems that you meant the same thing as me :) Hope this comment with clarify our statements.

I just edited my comment to make it more clear :)

agarus

← Rev. 3 →

+10

I don't know how accurate it is, but "sudo perf record -e cache-misses solution" shows 5 times more cache-misses on first solution over the second on my machine (this counters may work different on different cpus, and obviously they may have different caches).

PS. Hm, cache misses % almost the same — 75% for both solutions. Because second solution has not only 5 times less cache misses, but also 5 times less cache references.

MrDindows

+24

The problem is not in memory allocation itself, but more in high memory utilization. So it seems to me that it is actually more closer to caching issues rather than expensive allocations.

Two submission examples:

First — huge amount of memory is allocated once on each run, but when it's unused, running time is low (17 ms on the smallest test).

Second — huge amount of memory is allocated, but gradually, and previous layers of dp are being freed. So at the same time only two layers are allocated and maximum running time is same as in two-layer solution (170ms), but total allocated memory size is still huge.

I think that I just don't understand how this works, why allocating huge memory and not using it is fast, godbolt says that allocation happens so no compier help on that side.

Maybe in the second example compiler optimized something too? However I'm not sure that's possible based on my knowledge. On the other hand, maybe computer decides to give the same memory block each allocation, so it's indeed equivalent of 2 layers dp?

Urbanowicz

+31

C++ new int64_t[...] and delete operate on heap memory. Since requested memory blocks are of the same size, it indeed works almost exactly as a small fixed pool of arrays.

As for difference in time, it is simply due to difference in number of active memory pages. The more pages your process touches, the more frequently page fault happens. Allocating huge unused array is fast because actual page allocation happens lazily.

some calls of first submission

So, the biggest offenders are not the page faults per se, but the initializations of new pages?

Mikasa.Ackerman

-53

Improved by a factor of 3, not 3!

+76

It's exclamation mark, not a factorial.

-8

upd: deleted

+11

memory allocation itself takes some time (there must be a reason why people write custom allocators, right?).

That's more important when you have a huge number of allocations (not same as total allocated memory). For example allocations of some tree nodes.

nskybytskyi's blog