### Diego's blog

By Diego, 11 months ago, translation,

# TL;DR

If you code on python, use set or dict, and want to avoid hacks, you can read only the last section.

# Introduction

Most people interested in hacking know about the method of creating tests targeting unordered containers in C++. This method is described in more detail in post. However, creation of similar tests for set and dict in python is covered much less well, which I decided to fix with this post.

# A little theory about hash tables

I will not go into detailed descriptions and proofs, because there are plenty of them in the open sources. Let us consider only what interests us in this case.

An open-addressing hash table is a data structure that stores a set of elements in an array of knownly larger size, defining the position of the element as its hash taken modulo the size of the array. If hashes of several elements give the same cell in the array, the hash table tries to take another cell according to some rule, which depends on the particular implementation of the table. Usually this rule boils down to checking the cells $f(x), f(f(x)), f(f(f(x)))$ and so on, until an empty one is found. The function $f$ is often a linear transformation $(a * x + b) \% size$, where $size$ is the size of the array, and $a$ and $b$ are relatively prime with it.

# Python

Python uses a hash table with an array size equal to a power of two to implement dict, and the transformation is slightly more complex than a simple linear one -- $f(x) = (5x + 1 + pertrub) \% size$, where $pertrub$ is initially equal to hash, but is devided by 32 in each step. For a detailed implementation, see repository.

Also, the hash function for numbers in python is very predictable, it's just the number itself.

Thus, to make a countertest it is enough to find a sequence of indexes that will be searched for a particular number and preoccupy them, and then provoke a search for that number in the dictionary, which is quite easy to do for most problems.

The set implementation in Python3 is a bit more complicated, it does not test a single cell, but 10 consecutive cells, which can be observed here. However, building a test in this case is not very difficult either, just add $x, x + 1, x + 2, ..., x + 9$ instead of one number $x$.

# Tests

I used 153032429 (Thanks to turkids for it) and 153408991 from Codeforces Round 781 (Div. 2) to check. The result can be found as hacks number 796358 and 796362. Judging by the "unknown verdict" result, it went too well and I broke the author's solution at the same time. More details may know shishyando and Kirill22.

# How to protect yourself?

Unlike C++, Python does not provide a way to define its own hash function for an existing type (or I just don't know about it), but nobody prevents you from defining your type with a different hash function:

from random import getrandbits

RANDOM = getrandbits(32)

class Wrapper(int):
def __init__(self, x):
int.__init__(x)
def __hash__(self):
return super(Wrapper, self).__hash__() ^ RANDOM


An example of using this type can be found in 153409562. Unfortunately, while the author's solution breaks on my tests I will not be able to test the robustness of my solution with the Wrapper class. However, locally it works fast enough.

• +178

 » 11 months ago, # |   0 I can't view the hacks Image
•  » » 11 months ago, # ^ |   +1
 » 11 months ago, # |   0 This is very interesting. Did Python developers use SipHash for string keys in order to defend against Hash DoS, but left integer keys unprotected? from Python's configure.ac# str, bytes and memoryview hash algorithm AH_TEMPLATE(Py_HASH_ALGORITHM, [Define hash algorithm for str, bytes and memoryview. SipHash24: 1, FNV: 2, externally defined: 0]) 
•  » » 11 months ago, # ^ | ← Rev. 4 →   0 It's probably due to overhead on short keys. Rust, for example, uses SipHash-1-3 by default on all keys, however it does have a lot of overhead for short keys (like integer keys very common in competitive programming problems), e.g. compare:153429932 Siphash-1-3, 4694 ms150361213 custom hash function based on NASAM, 2167 ms
•  » » » 11 months ago, # ^ |   0 There's also aHash available for Rust: https://github.com/tkaitchuck/aHashAnd an unfinished discussion here about its implementation details when AES instructions are not available: https://github.com/tkaitchuck/aHash/issues/106
•  » » » » 11 months ago, # ^ |   0 Yeah I've noticed itBut there's no way to pull its crate for CF, and when I tried implementing it for myself it's much slower than my custom hash, at least on CF; perhaps it's not configured correctly to do hardware-accelerated AES
•  » » » » » 11 months ago, # ^ |   0 This is interesting, because your write_u64 hasher function doesn't exactly look lightweight with 3 multiplications and a bunch of shifts/rotates. I suspect that the specialized code path for short keys in aHash fallback should do fewer calculations. But this needs to be confirmed. Also AES is another part of the puzzle. Codeforces hardware is modern enough to support AES.
•  » » » » » » 11 months ago, # ^ | ← Rev. 3 →   0 To be clear, what I tried to implement was the AES version (using unsafe to call the AES intrinsic), not the fallback. My function is probably slightly slower than the AHash fallback, however I am somewhat less confident about the fallback's statistical propertiesEdit: Also, only two of those multiplies are actually performed. The update to the dither (self.1) gets optimized away when the hash function only calls write_u64 once per key.
 » 11 months ago, # |   +13 Also check this.
•  » » 11 months ago, # ^ |   +19 Thank you, I haven't seen it before. Also I tested my method for PyPy and it work as well as with cpython. Submissions: 153445268 and 153445556.
•  » » » 11 months ago, # ^ | ← Rev. 2 →   0 Why is Wrapper(i)=i? I checked for a few integers and it gives the same value!! Can I use this method in dictionaries for two different data types?
•  » » » » 11 months ago, # ^ | ← Rev. 2 →   +5 Wrapper is inherited from int, so it can be used as regular integer (but type(Wrapper(...) + Wrapper(...)) == int!). I do not know what would happen if you mix Wrappers and ints in same dict, but it can be combined with other types, like str, tuple, etc.
•  » » » » » 11 months ago, # ^ |   0 Ohh. Now I get it. type(Wrapper(i)) is different. Thanks! What does int.__init__(x) mean in init function in Wrapper Class.
•  » » » » » » 11 months ago, # ^ |   +5 Those two init lines don't do anything and should just be removed. They are nonsense.
 » 11 months ago, # |   +16 I can confirm this hash hack works in both PyPy2 and PyPy3. The hack is simple enough that any Python user should expect to be hacked using this.As for the work around. I personally think a better thing to do is to use from random RANDOM = random.randrange(2**62) def Wrapper(x): return x ^ RANDOM This is arguably not as nice, but it should run a lot faster/use less memory in PyPy than using a custom class. Also wrapping int fails for big integers in PyPy2, I'm not entirely sure as to why big ints are a problem.
•  » » 11 months ago, # ^ |   0 Are non-int (str, tuple, ...) keyed dictionaries safe against this type of hack?If not what would an efficient protection be?
•  » » » 11 months ago, # ^ |   +3 Hash of str is randomized, so it looks quite hard to make predictable collision. But I haven't yet researched how tuple hashes work, so they can be vulnerable.
•  » » » » 11 months ago, # ^ |   0 I see.Thanks.
 » 11 months ago, # |   0 Anyone who was here before me want to comment on how this went for unordered_map? Seems like a pretty short path from esoteric/specific hack that I might not even mind getting hit with individually to... thing that caused FSTs and complaints, and therefore a thing to be pre-empted by inclusion in pretests.I dunno. I grok and mostly agree with notions like "it is the responsibility of the competitor to know what they're using, even if choices are implied/constrained, they're still algorithmic decisions and therefore fair game" hence the 'not even mad' sentiment in the individualized scenario above. I guess maybe that masochism breaks down at the point where it's no longer up to the would-be hacker to execute the hack, and/or it only takes one successful hack to get replicated across the entire contest...?Think I answered my own question. Doooo we like it this way though? I don't know at what point something becomes a bad barrier to entry, feels subjectively like we tend to run towards that point wherever it is though...
•  » » 11 months ago, # ^ |   0 I have seen very few people hack unorederd_map, including me.In the last round, tests against unordered_map were included in the main testset and I belive it is well, because it gives beginners a lesson that they should not blindly rely on the tools of the language, but should be aware of how these tools work. Better that they find out about it in the form of a dropped task than in the form of a DoS attack on the service they will one day create.And if the creators of the round did not provide some types of tests, it can always be done by participants in the form of hacks.
 » 11 months ago, # | ← Rev. 4 →   0 Well, I would suggest this approach, but it slows down the solution, which is already a huge problem for python. import random RANDOM = random.randrange((1 << 31) - 1) class Dict(dict): def __setitem__(self, __k, __v): return super().__setitem__(__k ^ RANDOM, __v) def __delitem__(self, __v): return super().__delitem__(__v ^ RANDOM) def __getitem__(self, __k): return super().__getitem__(__k ^ RANDOM) def __contains__(self, __o: object) -> bool: return super().__contains__(__o ^ RANDOM) 
•  » » 11 months ago, # ^ |   0 I like the idea of offloading mitigations onto the collection/class instead of hoping I'd apply salt(key) in all the right places... could also include an override for d.keys() to return unsalted results (e.g. for sorting), maybe make defaultdict behavior the default as an extra... the hack makes Counter(input_array) an issue too so might as well customize for that too etc.
 » 11 months ago, # | ← Rev. 2 →   +15 Nice hack! Actually this hack can be solved by converting the number into string. Here is my new solution, only 20% slower than my original solution. 153489121I suggest every python user to look at this new post, otherwise your code may be hacked in the future since dictionary is used so frequently in the contest.
•  » » 10 months ago, # ^ | ← Rev. 2 →   0 some pypy64 results on that specific problem:c = Counter(sorted(alist)) worked, hacked original 156687344 and sorted version 156840227 both 202ms (looks artificially good because original solve already sorts downstream of that)directly-applied xor salt on both in and out comprehensions 156840897 265msstr in and int out 156841036 280msrandom.shuffle on input list before Counter slowest 156840377 343msnot tried (yet): various subclassing backflips
 » 9 months ago, # |   0 How does the "solution" help, since it keeps the collisions?hash(a) = hash(b) is the same as hash(a)^RANDOM=hash(b)^RANDOM..
•  » » 9 months ago, # ^ |   0 This solution is not prevent collisions itself, it prevent collision-chains like hash(x) = hash(a_1), f(hash(x)) = hash(a_2), f(f(hash(x))) = hash(a_3), .... Xor with RANDOM changes every element and broke the chain.