myst-6's blog

By myst-6, history, 9 months ago, In English

I was practising earlier today on this problem from the most recent contest. When I use a bitset of size 5005, I get a memory usage of 3688KB. When I use a bitset of size 50005, I get a memory usage of 31168KB. When I use a bitset of size 400005, I get a memory usage of a huge 244924KB! But then, if I take the bitset operations outside of the recursive function call, even with a bitset of size 400005, I get a memory usage of only 796KB!

Inside my recursive function, I am defining no new variables except a single long long and a single int — to my knowledge, combined with an absolute worst of 5000 function calls, this should only yield around 60KB of extra memory usage compared to a blank dfs. Obviously I am wrong. But I am confused — I am not declaring any new bitsets inside the recursive function, so why does the memory usage spike so high, and why does it depend on the size of the bitset?

If anybody understands why, I would be grateful for you to leave an explanation in the comments. Thanks

  • Vote: I like it
  • +28
  • Vote: I do not like it

»
9 months ago, # |
  Vote: I like it 0 Vote: I do not like it

I am defining no new variables — but new bitset you defining (knapsack << count[v] returns new bitset)

My guess that compiler figured how to optimize this from outside of lambda (don't know, how to check)

  • »
    »
    9 months ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    I thought knapsack << count[v] would be a transient value and not stored on the call stack. It just seems weird that this would cause the memory usage to spike, surely it should be deallocated right after it is used since I'm not storing it in any variables. But maybe you're right, it's the best explanation I've got for now

»
9 months ago, # |
  Vote: I like it 0 Vote: I do not like it

Well, I may be wrong, but what I've noticed that there's $$$(5000^2 \times 4+5000 \times 4+10^6+5000\times 64) \ byte \approx 100 MB$$$ and the variables of recursive calls in the stack, besides that you're using C++14, may cause this, I think newer version have more control of memory, try to compile with C++17 or C++20 and see if you get the same usage

  • »
    »
    9 months ago, # ^ |
    Rev. 2   Vote: I like it 0 Vote: I do not like it

    Can you explain how you derived $$$(5000^2 \times 4 + 5000 \times 4 + 10^6 + 5000 \times 64)$$$? Also, unfortunately, using a newer version of C++ does not change the memory usage.

»
9 months ago, # |
  Vote: I like it 0 Vote: I do not like it

when i trying use c++20 to submit it give rte on 10

»
9 months ago, # |
  Vote: I like it +13 Vote: I do not like it

Very strange. The both codes should work identically.

Either the compiler makes some strange optimizations, or...

»
9 months ago, # |
Rev. 6   Vote: I like it +36 Vote: I do not like it

Okay, I did a little bit of testing. The memory problem was indeed caused by knapsack |= knapsack << count[v];, as it creates a temprorary bitset on the stack. I was able to fix that here: link

I didn't really get into the problem, but I think that in the worst case the line knapsack |= knapsack << count[v]; can be executed somewhat around $$$O(n * log(n))$$$ times. Therefore, it should take about ~400 MB. Tho $$$O(n * log(n))$$$ is actually higher than the real value, the memory usage should be at least 200MB, and that's exactly what we are seeing.

I think that the submission with memory usage of 796KB is actually working *wrong*. It should have the same memory usage as the other code, as they are just absolutely the same.

My theory is that the compiler did optimizations that prevent using too much of stack memory. Maybe it's because the first code is run in a lambda function and the second isn't, maybe not, I don't know. Would be happy if anyone could contibute to my theory.

UPD: I disassembled the codes. Indeed, they look different.

In the code that uses a lot of memory the line knapsack |= knapsack << count[v]; corresponds to:

mov     rdi, QWORD PTR [r12+8]
call    std::vector<int, std::allocator<int> >::operator[](unsigned long)
mov     rsi, rbx
lea     rdi, [rsp+48]
movsx   rdx, DWORD PTR [rax]
call    std::bitset<400005ul>::operator<<(unsigned long) const
lea     rsi, [rsp+48]
mov     rdi, rbx
call    std::bitset<400005ul>::operator|=(std::bitset<400005ul> const&)

And in the second code:

mov     rdi, r13
call    std::vector<int, std::allocator<int> >::operator[](unsigned long)
mov     rsi, rbx
lea     rdi, [rsp+50224]
movsx   rdx, DWORD PTR [rax]
call    std::bitset<400005ul>::operator<<(unsigned long) const
lea     rsi, [rsp+50224]
mov     rdi, rbx
call    std::bitset<400005ul>::operator|=(std::bitset<400005ul> const&)

As you can see, the first code has stack offset 48, and the second code has 50224. If we do the math, 50224 * 8 is a little bit more than 400k bits.

I can't ready assembly well, but my explanation of this is:

In the first code, the compiler can't optimize the bitset operation, therefore it simply places a new bitset at the beginning of the stack(remember, stack grows downwards). In the second code, the compiler optimizes the code, by placing the new bitset at an offset of ~400k bits. When the operation is executed, the new bitset does not move the stack pointer, therefore not using any additional space. When the next loop iterations begins, the stack pointer stays at the same place, therefore reusing old 400k bits again.

  • »
    »
    9 months ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    Your comment and sammyuri's comment give a good insight into what's happening. Thank you for the detailed explanation. It's interesting to see how creating another bitset to store the knapsack << count[v] actually reduces the memory usage down to normal.

»
9 months ago, # |
  Vote: I like it +5 Vote: I do not like it

Comparing the assembly of the low and high memory submissions using Godbolt, as far as I can tell the $$$244924KB$$$ program allocates one bitset on the stack in the solve() subroutine, and one additional bitset when the lambda is called, each of which are just over $$$50KB$$$. So if the lambda is recursively called about $$$5000$$$ times (which happens in test case 10), there are about $$$5000 \times 50KB = 250,000KB$$$ worth of bitsets allocated on the stack simultaneously.

With the other program, however, the solve() subroutine allocates two bitsets on the stack which total $$$100KB$$$, then reuses the second one for all the << operations.

  • »
    »
    9 months ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    Your comment and lis05's comment give a good insight into what's happening. Thank you for the detailed explanation. I guess the compiler doesn't think to reuse the memory because for some strange reason, but this is useful to know! Thank you

»
9 months ago, # |
  Vote: I like it +5 Vote: I do not like it

I faced the same issue while solving the latest E2. It's okay to solve E1 with bitset of size 5000 but the memory consumption goes wild and got RTE when trying to migrate my solution to E2. I thought it was 1e6 bitset's fault and used some binary lifting stuff and python generated code to prevent extra bitsets but not working. Pretty thanks for your blog and lis05's comment now I see why and how.