Insane memory usage which I cannot explain

I was practising earlier today on this problem from the most recent contest. When I use a bitset of size 5005, I get a memory usage of 3688KB. When I use a bitset of size 50005, I get a memory usage of 31168KB. When I use a bitset of size 400005, I get a memory usage of a huge 244924KB! But then, if I take the bitset operations outside of the recursive function call, even with a bitset of size 400005, I get a memory usage of only 796KB!

Inside my recursive function, I am defining no new variables except a single long long and a single int — to my knowledge, combined with an absolute worst of 5000 function calls, this should only yield around 60KB of extra memory usage compared to a blank dfs. Obviously I am wrong. But I am confused — I am not declaring any new bitsets inside the recursive function, so why does the memory usage spike so high, and why does it depend on the size of the bitset?

If anybody understands why, I would be grateful for you to leave an explanation in the comments. Thanks

mov rdi, QWORD PTR [r12+8] call std::vector<int, std::allocator<int> >::operator[](unsigned long) mov rsi, rbx lea rdi, [rsp+48] movsx rdx, DWORD PTR [rax] call std::bitset<400005ul>::operator<<(unsigned long) const lea rsi, [rsp+48] mov rdi, rbx call std::bitset<400005ul>::operator|=(std::bitset<400005ul> const&)

mov rdi, r13 call std::vector<int, std::allocator<int> >::operator[](unsigned long) mov rsi, rbx lea rdi, [rsp+50224] movsx rdx, DWORD PTR [rax] call std::bitset<400005ul>::operator<<(unsigned long) const lea rsi, [rsp+50224] mov rdi, rbx call std::bitset<400005ul>::operator|=(std::bitset<400005ul> const&)

Comments (11)

Write comment?

balalaika

9 months ago, # |

I am defining no new variables — but new bitset you defining (knapsack << count[v] returns new bitset)

My guess that compiler figured how to optimize this from outside of lambda (don't know, how to check)

→ Reply

myst-6

9 months ago, # ^ |

I thought knapsack << count[v] would be a transient value and not stored on the call stack. It just seems weird that this would cause the memory usage to spike, surely it should be deallocated right after it is used since I'm not storing it in any variables. But maybe you're right, it's the best explanation I've got for now

ammar_hammoud

Well, I may be wrong, but what I've noticed that there's $$$(5000^2 \times 4+5000 \times 4+10^6+5000\times 64) \ byte \approx 100 MB$$$ and the variables of recursive calls in the stack, besides that you're using C++14, may cause this, I think newer version have more control of memory, try to compile with C++17 or C++20 and see if you get the same usage

← Rev. 2 →

Can you explain how you derived $$$(5000^2 \times 4 + 5000 \times 4 + 10^6 + 5000 \times 64)$$$? Also, unfortunately, using a newer version of C++ does not change the memory usage.

ShaoNianTongXue5307

when i trying use c++20 to submit it give rte on 10

lis05

+13

Very strange. The both codes should work identically.

Either the compiler makes some strange optimizations, or...

← Rev. 6 →

+36

Okay, I did a little bit of testing. The memory problem was indeed caused by knapsack |= knapsack << count[v];, as it creates a temprorary bitset on the stack. I was able to fix that here: link

I didn't really get into the problem, but I think that in the worst case the line knapsack |= knapsack << count[v]; can be executed somewhat around $$$O(n * log(n))$$$ times. Therefore, it should take about ~400 MB. Tho $$$O(n * log(n))$$$ is actually higher than the real value, the memory usage should be at least 200MB, and that's exactly what we are seeing.

I think that the submission with memory usage of 796KB is actually working *wrong*. It should have the same memory usage as the other code, as they are just absolutely the same.

My theory is that the compiler did optimizations that prevent using too much of stack memory. Maybe it's because the first code is run in a lambda function and the second isn't, maybe not, I don't know. Would be happy if anyone could contibute to my theory.

UPD: I disassembled the codes. Indeed, they look different.

In the code that uses a lot of memory the line knapsack |= knapsack << count[v]; corresponds to:

And in the second code:

As you can see, the first code has stack offset 48, and the second code has 50224. If we do the math, 50224 * 8 is a little bit more than 400k bits.

I can't ready assembly well, but my explanation of this is:

In the first code, the compiler can't optimize the bitset operation, therefore it simply places a new bitset at the beginning of the stack(remember, stack grows downwards). In the second code, the compiler optimizes the code, by placing the new bitset at an offset of ~400k bits. When the operation is executed, the new bitset does not move the stack pointer, therefore not using any additional space. When the next loop iterations begins, the stack pointer stays at the same place, therefore reusing old 400k bits again.

Your comment and sammyuri's comment give a good insight into what's happening. Thank you for the detailed explanation. It's interesting to see how creating another bitset to store the knapsack << count[v] actually reduces the memory usage down to normal.

sammyuri

Comparing the assembly of the low and high memory submissions using Godbolt, as far as I can tell the $$$244924KB$$$ program allocates one bitset on the stack in the solve() subroutine, and one additional bitset when the lambda is called, each of which are just over $$$50KB$$$. So if the lambda is recursively called about $$$5000$$$ times (which happens in test case 10), there are about $$$5000 \times 50KB = 250,000KB$$$ worth of bitsets allocated on the stack simultaneously.

With the other program, however, the solve() subroutine allocates two bitsets on the stack which total $$$100KB$$$, then reuses the second one for all the << operations.

Your comment and lis05's comment give a good insight into what's happening. Thank you for the detailed explanation. I guess the compiler doesn't think to reuse the memory because for some strange reason, but this is useful to know! Thank you

Bucketsmith

I faced the same issue while solving the latest E2. It's okay to solve E1 with bitset of size 5000 but the memory consumption goes wild and got RTE when trying to migrate my solution to E2. I thought it was 1e6 bitset's fault and used some binary lifting stuff and python generated code to prevent extra bitsets but not working. Pretty thanks for your blog and lis05's comment now I see why and how.

#	User	Rating
1	tourist	3690
2	jiangly	3647
3	Benq	3581
4	orzdevinwang	3570
5	Geothermal	3569
5	cnnfls_csy	3569
7	Radewoosh	3509
8	ecnerwala	3486
9	jqdai0815	3474
10	gyh20	3447

#	User	Contrib.
1	maomao90	174
2	awoo	164
3	adamant	163
4	TheScrasse	159
5	nor	157
6	maroonrk	155
7	-is-this-fft-	152
8	Petr	146
8	orz	146
10	BledDest	145

myst-6's blog