Tutorial: How to hack hash table in Python 3

#	User	Rating
1	tourist	3880
2	jiangly	3669
3	ecnerwala	3654
4	Benq	3627
5	orzdevinwang	3612
6	Geothermal	3569
6	cnnfls_csy	3569
8	jqdai0815	3532
9	Radewoosh	3522
10	gyh20	3447

#	User	Contrib.
1	awoo	161
1	maomao90	161
3	adamant	156
4	maroonrk	153
5	-is-this-fft-	148
5	atcoder_official	148
5	SecondThread	148
8	Petr	147
9	nor	144
10	TheScrasse	142

After uphacking many Python 3 submissions in 1625B - Elementary Particles, I decided to write this tutorial and share my approach to hacking Python 3 hash tables since I am not aware of any other other blog posts on this.

Method 1: Exploit identical hash values

This method is similar to the approach described in neal's blog on blowing up unordered_map in C++.

The hash function for non-negative integers $$$x$$$ in Python appears to be $$$f(x) = x \mod m$$$, where $$$m = 2^{31}-1$$$ for 32-bit Python and $$$m = 2^{61}-1$$$ for 64-bit Python. In particular, this method is useful for hacking the hash table where the keys are 64-bit integers.

Hence, by inserting $$$n$$$ multiples of $$$m$$$, we can create a hash collision of $$$n$$$ elements, hence blowing up the time complexity of insertion to $$$O(n^2)$$$.

Method 2: Exploit open addressing weaknesses in CPython

This method only works for CPython, not PyPy. If you manage to make the PyPy hash table TLE for small integer keys ($$$\leq 10^9$$$), please share your approach!

Open addressing is one way in which collisions in hash tables are managed.

Suppose the hash table contains an array, where some elements have been assigned a value from prior insertions while others have not. To insert a new element $$$x$$$ into a hash table, we compute its hash value, and deterministically probe until the element has not been assigned a value. Then, the value $$$x$$$ is assigned to that element. The search uses the same probing function to find the key. If the probing function is not deterministic, there would not be a way to retrieve the element that has been inserted.

For more details about open addressing, see Resource 1 and Resource 2.

The worst case scenario is where we probe $$$n$$$ times before finally finding an empty slot, but the CPython open addressing method is so complex that this is almost impossible. In particular, the hash table automatically expands after being two-thirds full, and the size in terms of the number of elements in the hash table is a power of 2. The bad news is that this open addressing method is hackable.

How to exploit this? There is a greedy heuristic which works quite well for an array of any size $$$N$$$.

To generate each element for the first $$$M$$$ elements (where $$$M = \frac{2^K-1}{3}$$$ for maximum damage in CPython implementation), we generate $$$k$$$ random candidate elements, simulate the number of probing operations required to insert each candidate element, and take the candidate element which gives the worst case each time.

Details on how CPython implements its hash table can be found here.

The remaining elements would be equal to the $$$M$$$ th element that was generated. The intuition behind is that as more elements are added, there will be higher potential for hash collisions. The $$$M$$$ th element was specifically chosen to make the search for the existence of the key to be as painful as possible for the hash table.

The code can be found here.

How to avoid getting hacked

The safest way is to follow this blog. If the input consists of small integers (smaller than hash modulus), then non-deterministically randomising the insertion order should be enough to avoid getting hacked via Method 2.

Practice problem
Div 1 coders can uphack Python 3 submissions for 1625B - Elementary Particles as practice for Method 2 until the end of the uphacking period.

Please correct me if there are any errors. Thanks.

user14767553's blog

Method 1: Exploit identical hash values

Method 2: Exploit open addressing weaknesses in CPython

How to avoid getting hacked