conqueror_of_tourist's blog

By conqueror_of_tourist, history, 7 months ago, In English

In the recent Goodbye 2021 Round on Problem E my code seems to run a lot slower when running on pypy64 as opposed to regular pypy.

The following submissions are identical, with the exception being the language used.

141113239 (pypy3.7 — AC 732ms)

141185703 (pypy3.8 64bit — TLE)

Running on Custom Invocation on one testcase of size $$$10^5$$$ shows that pypy3.7 will take about 250ms whereas pypy3.8-64 will take 5600ms.

Profiling my code it seems like the culprit is my find function

#Finds smallest index of segtree such that seg[ind] <= x
#Only works if func = min
def find(self, x):
    curr = 1
    
    if self.data[curr] > x:
        return -1
    while curr < self._size:
        assert self.data[curr] <= x
        if self.data[2 * curr] <= x:
            curr = 2 * curr
        else:
            curr = 2 * curr + 1
        assert self.data[curr] <= x
    return curr - self._size

Replacing this function with a trivial (but incorrect) function makes the submission run in ~230 ms in both languages. However it seems like this function should run in $$$O(log n)$$$ and not be 20x slower than the rest of my code (and if itis slow, why is it not an issue in 32bit pypy?). If anyone can figure out the issue here (and what I need to avoid to be able to run code in pypy64 quickly) it would be greatly appreciated.

 
 
 
 
  • Vote: I like it
  • +82
  • Vote: I do not like it

»
7 months ago, # |
  Vote: I like it +48 Vote: I do not like it

I've also noticed that 64 bit PyPy on CF sometimes runs super slow. Like for example here

PyPy3 64 bit 1949 ms 140113619

PyPy3 32 bit 374 ms 141188674

I've also had many people messaging me about weird TLEs with 64 bit PyPy:

https://codeforces.com/blog/entry/98385?#comment-872293

https://codeforces.com/blog/entry/98385?#comment-872290

Something definitely seems buggy with 64 bit PyPy on CF. But I haven't yet had the time to figure out what is causing it or if I can even replicate the TLEs locally.

  • »
    »
    7 months ago, # ^ |
    Rev. 2   Vote: I like it +97 Vote: I do not like it

    An update, I found a really really good killer

    for i in range(10**7):
        if sum([]):
            break
    

    Using custom invocation on Codeforces, it runs in 4.5 s in 64 bit PyPy3 and 61 ms in 32 bit PyPy3.

    • »
      »
      »
      7 months ago, # ^ |
        Vote: I like it +55 Vote: I do not like it

      I've opened up an issue about this on PyPy's website https://foss.heptapod.net/pypy/pypy/-/issues/3629 .

      It seems that when PyPy made the jump from Python3.7 to Python3.8 they introduced some kind of bug. pypy3.8-v7.3.7 has the bug, but pypy3.7-v7.3.7 doesn't. A quick fix for CF would be to switch from pypy3.8-v7.3.7-win64 to pypy3.7-v7.3.7-win64. The problem is not related to 32 bit vs 64 bit, nor is it a windows vs linux thing.

      One fun thing to note is that adding empty for loops everywhere seems to fix it. For example

      for i in range(10**7):
          for _ in range(1): pass
          if sum([]):
              break
      

      runs fast no matter PyPy version.

      • »
        »
        »
        »
        7 months ago, # ^ |
          Vote: I like it +52 Vote: I do not like it

        Thank you. I temporarily reverted it to 7.3.5 here.

        • »
          »
          »
          »
          »
          7 months ago, # ^ |
            Vote: I like it +23 Vote: I do not like it

          Oh nice!

          I just tried resubmitting many of the mysterious TLEs, and they all got AC now. =)

»
7 months ago, # |
  Vote: I like it 0 Vote: I do not like it

This is why I don't use 64-bit on contests :P

»
7 months ago, # |
Rev. 2   Vote: I like it +87 Vote: I do not like it

I tested it a bit in custom invocation with random input. If you comment out the asserts in the while loop, it runs in 650ms, otherwise 6.5 seconds. Funny thing is that you can replace it with an assert True or if 0 > 1: break and the same bug will still occur (but lowered to ~5seconds).

So my theory is that loop got unrolled but the additional branches blew up the number of possible code paths. I think told pajenegod about an similar issue before for pyrival's RMQ in 117322863. The solution there was to add a no-op for _ in range(1): pass to prevent inlining (and indeed the same trick seems to work for unrolling too). Explanation from the dev about why this happens can be found in this issue. Some background on how jit guards/bridges work can be found at: https://youtu.be/NQfpHQII2cU?t=1389

Anyway this is occurring often enough that you should report it to https://foss.heptapod.net/groups/pypy/-/issues. The devs are fairly responsive and have fixed other jit bugs reported by codeforce users before. You should probably try to reduce it to a minimal reproducible example first though.

Though I can't read them, you can also output jit logs with something like PYPYLOG=jit-log-opt,jit-summary,jit-backend:logfile pypy3.8 bug.py.

Summary without that assert line:

Tracing:      	81	0.070865
Backend:      	81	0.029770
TOTAL:      		0.422342
ops:             	116056
heapcached ops:  	129132
recorded ops:    	35992
  calls:         	3995
guards:          	7006
opt ops:         	20512
opt guards:      	3911
opt guards shared:	2224
forcings:        	0
abort: trace too long:	0
abort: compiling:	0
abort: vable escape:	0
abort: bad loop: 	0
abort: force quasi-immut:	0
abort: segmenting trace:	0
nvirtuals:       	2512
nvholes:         	825
nvreused:        	625
vecopt tried:    	0
vecopt success:  	0
Total # of loops:	30
Total # of bridges:	52
Freed # of loops:	0
Freed # of bridges:	0

Summary with the assert:

Tracing:      	672	1.453424
Backend:      	671	0.186138
TOTAL:      		3.310882
ops:             	3332025
heapcached ops:  	1164403
recorded ops:    	318645
  calls:         	111855
guards:          	68843
opt ops:         	338865
opt guards:      	43555
opt guards shared:	39539
forcings:        	0
abort: trace too long:	1
abort: compiling:	0
abort: vable escape:	0
abort: bad loop: 	0
abort: force quasi-immut:	0
abort: segmenting trace:	0
nvirtuals:       	2999
nvholes:         	768
nvreused:        	611
vecopt tried:    	0
vecopt success:  	0
Total # of loops:	29
Total # of bridges:	643
Freed # of loops:	10
Freed # of bridges:	19
  • »
    »
    7 months ago, # ^ |
      Vote: I like it -202 Vote: I do not like it

    such a shame to see a master on CodeForces not using spoiler to place in a long comment.

    Pathetic!

»
7 weeks ago, # |
  Vote: I like it +28 Vote: I do not like it

I updated PyPy3-64 to 7.3.9 (3.9.10). I will be glad if you could test this version.

  • »
    »
    7 weeks ago, # ^ |
    Rev. 2   Vote: I like it +12 Vote: I do not like it

    I retested their submissions and it's now very good:
    new version 779 ms vs old version TLE.
    new version 358 ms vs old version 1949 ms.

    The below code also run in custom invocation in 779 ms in pypy3 64bit vs 1185 ms in pypy3 32bit

    for i in range(10**9):
        if sum([]):
            break 
    
  • »
    »
    7 weeks ago, # ^ |
    Rev. 2   Vote: I like it +14 Vote: I do not like it

    Yes, the 3.9 version is better. Thank you MikeMirzayanov!

    I have tested it now against one of the recent TLEs I faced in a contest.

    Mysterious TLE — 2000ms

    Sweetly Accepted — 1996ms

    What I was able to learn from this is the importance of even .004 seconds in CP for a Python coder, to move from a TLE code to an AC code. I had say that's a narrow escape.

  • »
    »
    7 weeks ago, # ^ |
      Vote: I like it +33 Vote: I do not like it

    Also I updated Go and PHP. Welcome!

»
6 weeks ago, # |
  Vote: I like it +3 Vote: I do not like it

new pypy3 64 sub pypy3 sub do not know the issue of this random TLE