conqueror_of_tourist's blog

By conqueror_of_tourist, history, 7 months ago,

In the recent Goodbye 2021 Round on Problem E my code seems to run a lot slower when running on pypy64 as opposed to regular pypy.

The following submissions are identical, with the exception being the language used.

141113239 (pypy3.7 — AC 732ms)

141185703 (pypy3.8 64bit — TLE)

Running on Custom Invocation on one testcase of size $10^5$ shows that pypy3.7 will take about 250ms whereas pypy3.8-64 will take 5600ms.

Profiling my code it seems like the culprit is my find function

#Finds smallest index of segtree such that seg[ind] <= x
#Only works if func = min
def find(self, x):
curr = 1

if self.data[curr] > x:
return -1
while curr < self._size:
assert self.data[curr] <= x
if self.data[2 * curr] <= x:
curr = 2 * curr
else:
curr = 2 * curr + 1
assert self.data[curr] <= x
return curr - self._size


Replacing this function with a trivial (but incorrect) function makes the submission run in ~230 ms in both languages. However it seems like this function should run in $O(log n)$ and not be 20x slower than the rest of my code (and if itis slow, why is it not an issue in 32bit pypy?). If anyone can figure out the issue here (and what I need to avoid to be able to run code in pypy64 quickly) it would be greatly appreciated.

• +82

 » 7 months ago, # |   +48 I've also noticed that 64 bit PyPy on CF sometimes runs super slow. Like for example herePyPy3 64 bit 1949 ms 140113619 PyPy3 32 bit 374 ms 141188674I've also had many people messaging me about weird TLEs with 64 bit PyPy:https://codeforces.com/blog/entry/98385?#comment-872293https://codeforces.com/blog/entry/98385?#comment-872290Something definitely seems buggy with 64 bit PyPy on CF. But I haven't yet had the time to figure out what is causing it or if I can even replicate the TLEs locally.
•  » » 7 months ago, # ^ | ← Rev. 2 →   +97 An update, I found a really really good killer for i in range(10**7): if sum([]): break Using custom invocation on Codeforces, it runs in 4.5 s in 64 bit PyPy3 and 61 ms in 32 bit PyPy3.
•  » » » 7 months ago, # ^ |   +55 I've opened up an issue about this on PyPy's website https://foss.heptapod.net/pypy/pypy/-/issues/3629 . It seems that when PyPy made the jump from Python3.7 to Python3.8 they introduced some kind of bug. pypy3.8-v7.3.7 has the bug, but pypy3.7-v7.3.7 doesn't. A quick fix for CF would be to switch from pypy3.8-v7.3.7-win64 to pypy3.7-v7.3.7-win64. The problem is not related to 32 bit vs 64 bit, nor is it a windows vs linux thing.One fun thing to note is that adding empty for loops everywhere seems to fix it. For example for i in range(10**7): for _ in range(1): pass if sum([]): break runs fast no matter PyPy version.
•  » » » » 7 months ago, # ^ |   +52 Thank you. I temporarily reverted it to 7.3.5 here.
•  » » » » » 7 months ago, # ^ |   +23 Oh nice!I just tried resubmitting many of the mysterious TLEs, and they all got AC now. =)
 » 7 months ago, # |   0 This is why I don't use 64-bit on contests :P
 » 7 months ago, # | ← Rev. 2 →   +87 I tested it a bit in custom invocation with random input. If you comment out the asserts in the while loop, it runs in 650ms, otherwise 6.5 seconds. Funny thing is that you can replace it with an assert True or if 0 > 1: break and the same bug will still occur (but lowered to ~5seconds).So my theory is that loop got unrolled but the additional branches blew up the number of possible code paths. I think told pajenegod about an similar issue before for pyrival's RMQ in 117322863. The solution there was to add a no-op for _ in range(1): pass to prevent inlining (and indeed the same trick seems to work for unrolling too). Explanation from the dev about why this happens can be found in this issue. Some background on how jit guards/bridges work can be found at: https://youtu.be/NQfpHQII2cU?t=1389Anyway this is occurring often enough that you should report it to https://foss.heptapod.net/groups/pypy/-/issues. The devs are fairly responsive and have fixed other jit bugs reported by codeforce users before. You should probably try to reduce it to a minimal reproducible example first though.Though I can't read them, you can also output jit logs with something like PYPYLOG=jit-log-opt,jit-summary,jit-backend:logfile pypy3.8 bug.py. Summary without that assert line: Tracing: 81 0.070865 Backend: 81 0.029770 TOTAL: 0.422342 ops: 116056 heapcached ops: 129132 recorded ops: 35992 calls: 3995 guards: 7006 opt ops: 20512 opt guards: 3911 opt guards shared: 2224 forcings: 0 abort: trace too long: 0 abort: compiling: 0 abort: vable escape: 0 abort: bad loop: 0 abort: force quasi-immut: 0 abort: segmenting trace: 0 nvirtuals: 2512 nvholes: 825 nvreused: 625 vecopt tried: 0 vecopt success: 0 Total # of loops: 30 Total # of bridges: 52 Freed # of loops: 0 Freed # of bridges: 0 Summary with the assert: Tracing: 672 1.453424 Backend: 671 0.186138 TOTAL: 3.310882 ops: 3332025 heapcached ops: 1164403 recorded ops: 318645 calls: 111855 guards: 68843 opt ops: 338865 opt guards: 43555 opt guards shared: 39539 forcings: 0 abort: trace too long: 1 abort: compiling: 0 abort: vable escape: 0 abort: bad loop: 0 abort: force quasi-immut: 0 abort: segmenting trace: 0 nvirtuals: 2999 nvholes: 768 nvreused: 611 vecopt tried: 0 vecopt success: 0 Total # of loops: 29 Total # of bridges: 643 Freed # of loops: 10 Freed # of bridges: 19 
•  » » 7 months ago, # ^ |   -202 such a shame to see a master on CodeForces not using spoiler to place in a long comment.Pathetic!
 » 7 weeks ago, # |   +28 I updated PyPy3-64 to 7.3.9 (3.9.10). I will be glad if you could test this version.
•  » » 7 weeks ago, # ^ | ← Rev. 2 →   +12 I retested their submissions and it's now very good:new version 779 ms vs old version TLE.new version 358 ms vs old version 1949 ms. The below code also run in custom invocation in 779 ms in pypy3 64bit vs 1185 ms in pypy3 32bit for i in range(10**9): if sum([]): break 
•  » » 7 weeks ago, # ^ | ← Rev. 2 →   +14 Yes, the 3.9 version is better. Thank you MikeMirzayanov!I have tested it now against one of the recent TLEs I faced in a contest.Mysterious TLE — 2000msSweetly Accepted — 1996msWhat I was able to learn from this is the importance of even .004 seconds in CP for a Python coder, to move from a TLE code to an AC code. I had say that's a narrow escape.
•  » » 7 weeks ago, # ^ |   +33 Also I updated Go and PHP. Welcome!
 » 6 weeks ago, # |   +3 new pypy3 64 sub pypy3 sub do not know the issue of this random TLE