user14767553's blog

By user14767553, 2 years ago, In English

After uphacking many Python 3 submissions in 1625B - Elementary Particles, I decided to write this tutorial and share my approach to hacking Python 3 hash tables since I am not aware of any other other blog posts on this.

Method 1: Exploit identical hash values

This method is similar to the approach described in neal's blog on blowing up unordered_map in C++.

The hash function for non-negative integers $$$x$$$ in Python appears to be $$$f(x) = x \mod m$$$, where $$$m = 2^{31}-1$$$ for 32-bit Python and $$$m = 2^{61}-1$$$ for 64-bit Python. In particular, this method is useful for hacking the hash table where the keys are 64-bit integers.

Hence, by inserting $$$n$$$ multiples of $$$m$$$, we can create a hash collision of $$$n$$$ elements, hence blowing up the time complexity of insertion to $$$O(n^2)$$$.

Method 2: Exploit open addressing weaknesses in CPython

This method only works for CPython, not PyPy. If you manage to make the PyPy hash table TLE for small integer keys ($$$\leq 10^9$$$), please share your approach!

Open addressing is one way in which collisions in hash tables are managed.

Suppose the hash table contains an array, where some elements have been assigned a value from prior insertions while others have not. To insert a new element $$$x$$$ into a hash table, we compute its hash value, and deterministically probe until the element has not been assigned a value. Then, the value $$$x$$$ is assigned to that element. The search uses the same probing function to find the key. If the probing function is not deterministic, there would not be a way to retrieve the element that has been inserted.

For more details about open addressing, see Resource 1 and Resource 2.

The worst case scenario is where we probe $$$n$$$ times before finally finding an empty slot, but the CPython open addressing method is so complex that this is almost impossible. In particular, the hash table automatically expands after being two-thirds full, and the size in terms of the number of elements in the hash table is a power of 2. The bad news is that this open addressing method is hackable.

How to exploit this? There is a greedy heuristic which works quite well for an array of any size $$$N$$$.

To generate each element for the first $$$M$$$ elements (where $$$M = \frac{2^K-1}{3}$$$ for maximum damage in CPython implementation), we generate $$$k$$$ random candidate elements, simulate the number of probing operations required to insert each candidate element, and take the candidate element which gives the worst case each time.

Details on how CPython implements its hash table can be found here.

The remaining elements would be equal to the $$$M$$$ th element that was generated. The intuition behind is that as more elements are added, there will be higher potential for hash collisions. The $$$M$$$ th element was specifically chosen to make the search for the existence of the key to be as painful as possible for the hash table.

The code can be found here.

How to avoid getting hacked

The safest way is to follow this blog. If the input consists of small integers (smaller than hash modulus), then non-deterministically randomising the insertion order should be enough to avoid getting hacked via Method 2.

Practice problem
Div 1 coders can uphack Python 3 submissions for 1625B - Elementary Particles as practice for Method 2 until the end of the uphacking period.

Please correct me if there are any errors. Thanks.

  • Vote: I like it
  • +61
  • Vote: I do not like it