Diego's blog

By Diego, 3 months ago, translation, In English

TL;DR

If you code on python, use set or dict, and want to avoid hacks, you can read only the last section.

Introduction

Most people interested in hacking know about the method of creating tests targeting unordered containers in C++. This method is described in more detail in post. However, creation of similar tests for set and dict in python is covered much less well, which I decided to fix with this post.

A little theory about hash tables

I will not go into detailed descriptions and proofs, because there are plenty of them in the open sources. Let us consider only what interests us in this case.

An open-addressing hash table is a data structure that stores a set of elements in an array of knownly larger size, defining the position of the element as its hash taken modulo the size of the array. If hashes of several elements give the same cell in the array, the hash table tries to take another cell according to some rule, which depends on the particular implementation of the table. Usually this rule boils down to checking the cells $$$f(x), f(f(x)), f(f(f(x)))$$$ and so on, until an empty one is found. The function $$$f$$$ is often a linear transformation $$$(a * x + b) \% size$$$, where $$$size$$$ is the size of the array, and $$$a$$$ and $$$b$$$ are relatively prime with it.

Python

Python uses a hash table with an array size equal to a power of two to implement dict, and the transformation is slightly more complex than a simple linear one -- $$$f(x) = (5x + 1 + pertrub) \% size$$$, where $$$pertrub$$$ is initially equal to hash, but is devided by 32 in each step. For a detailed implementation, see repository.

Also, the hash function for numbers in python is very predictable, it's just the number itself.

Thus, to make a countertest it is enough to find a sequence of indexes that will be searched for a particular number and preoccupy them, and then provoke a search for that number in the dictionary, which is quite easy to do for most problems.

The set implementation in Python3 is a bit more complicated, it does not test a single cell, but 10 consecutive cells, which can be observed here. However, building a test in this case is not very difficult either, just add $$$x, x + 1, x + 2, ..., x + 9$$$ instead of one number $$$x$$$.

Tests

I used 153032429 (Thanks to turkids for it) and 153408991 from Codeforces Round #781 (Div. 2) to check. The result can be found as hacks number 796358 and 796362. Judging by the "unknown verdict" result, it went too well and I broke the author's solution at the same time. More details may know shishyando and Kirill22.

How to protect yourself?

Unlike C++, Python does not provide a way to define its own hash function for an existing type (or I just don't know about it), but nobody prevents you from defining your type with a different hash function:

from random import getrandbits

RANDOM = getrandbits(32)

class Wrapper(int):
    def __init__(self, x):
        int.__init__(x)
    def __hash__(self):
        return super(Wrapper, self).__hash__() ^ RANDOM

An example of using this type can be found in 153409562. Unfortunately, while the author's solution breaks on my tests I will not be able to test the robustness of my solution with the Wrapper class. However, locally it works fast enough.

 
 
 
 
  • Vote: I like it
  • +178
  • Vote: I do not like it

»
3 months ago, # |
  Vote: I like it 0 Vote: I do not like it

I can't view the hacks

Image
»
3 months ago, # |
  Vote: I like it 0 Vote: I do not like it

This is very interesting. Did Python developers use SipHash for string keys in order to defend against Hash DoS, but left integer keys unprotected?

from Python's configure.ac
  • »
    »
    3 months ago, # ^ |
    Rev. 4   Vote: I like it 0 Vote: I do not like it

    It's probably due to overhead on short keys. Rust, for example, uses SipHash-1-3 by default on all keys, however it does have a lot of overhead for short keys (like integer keys very common in competitive programming problems), e.g. compare:

    153429932 Siphash-1-3, 4694 ms

    150361213 custom hash function based on NASAM, 2167 ms

    • »
      »
      »
      3 months ago, # ^ |
        Vote: I like it 0 Vote: I do not like it

      There's also aHash available for Rust: https://github.com/tkaitchuck/aHash

      And an unfinished discussion here about its implementation details when AES instructions are not available: https://github.com/tkaitchuck/aHash/issues/106

      • »
        »
        »
        »
        3 months ago, # ^ |
          Vote: I like it 0 Vote: I do not like it

        Yeah I've noticed it

        But there's no way to pull its crate for CF, and when I tried implementing it for myself it's much slower than my custom hash, at least on CF; perhaps it's not configured correctly to do hardware-accelerated AES

        • »
          »
          »
          »
          »
          3 months ago, # ^ |
            Vote: I like it 0 Vote: I do not like it

          This is interesting, because your write_u64 hasher function doesn't exactly look lightweight with 3 multiplications and a bunch of shifts/rotates. I suspect that the specialized code path for short keys in aHash fallback should do fewer calculations. But this needs to be confirmed. Also AES is another part of the puzzle. Codeforces hardware is modern enough to support AES.

          • »
            »
            »
            »
            »
            »
            3 months ago, # ^ |
            Rev. 3   Vote: I like it 0 Vote: I do not like it

            To be clear, what I tried to implement was the AES version (using unsafe to call the AES intrinsic), not the fallback. My function is probably slightly slower than the AHash fallback, however I am somewhat less confident about the fallback's statistical properties

            Edit: Also, only two of those multiplies are actually performed. The update to the dither (self.1) gets optimized away when the hash function only calls write_u64 once per key.

»
3 months ago, # |
  Vote: I like it +13 Vote: I do not like it

Also check this.

  • »
    »
    3 months ago, # ^ |
      Vote: I like it +19 Vote: I do not like it

    Thank you, I haven't seen it before. Also I tested my method for PyPy and it work as well as with cpython. Submissions: 153445268 and 153445556.

    • »
      »
      »
      3 months ago, # ^ |
      Rev. 2   Vote: I like it 0 Vote: I do not like it

      Why is Wrapper(i)=i? I checked for a few integers and it gives the same value!! Can I use this method in dictionaries for two different data types?

      • »
        »
        »
        »
        3 months ago, # ^ |
        Rev. 2   Vote: I like it +5 Vote: I do not like it

        Wrapper is inherited from int, so it can be used as regular integer (but type(Wrapper(...) + Wrapper(...)) == int!). I do not know what would happen if you mix Wrappers and ints in same dict, but it can be combined with other types, like str, tuple, etc.

        • »
          »
          »
          »
          »
          3 months ago, # ^ |
            Vote: I like it 0 Vote: I do not like it

          Ohh. Now I get it. type(Wrapper(i)) is different. Thanks! What does int.__init__(x) mean in init function in Wrapper Class.

          • »
            »
            »
            »
            »
            »
            3 months ago, # ^ |
              Vote: I like it +5 Vote: I do not like it

            Those two init lines don't do anything and should just be removed. They are nonsense.

»
3 months ago, # |
  Vote: I like it +16 Vote: I do not like it

I can confirm this hash hack works in both PyPy2 and PyPy3. The hack is simple enough that any Python user should expect to be hacked using this.

As for the work around. I personally think a better thing to do is to use

from random
RANDOM = random.randrange(2**62)

def Wrapper(x):
  return x ^ RANDOM

This is arguably not as nice, but it should run a lot faster/use less memory in PyPy than using a custom class. Also wrapping int fails for big integers in PyPy2, I'm not entirely sure as to why big ints are a problem.

  • »
    »
    3 months ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    Are non-int (str, tuple, ...) keyed dictionaries safe against this type of hack?

    If not what would an efficient protection be?

    • »
      »
      »
      3 months ago, # ^ |
        Vote: I like it +3 Vote: I do not like it

      Hash of str is randomized, so it looks quite hard to make predictable collision. But I haven't yet researched how tuple hashes work, so they can be vulnerable.

»
3 months ago, # |
  Vote: I like it 0 Vote: I do not like it

Anyone who was here before me want to comment on how this went for unordered_map? Seems like a pretty short path from esoteric/specific hack that I might not even mind getting hit with individually to... thing that caused FSTs and complaints, and therefore a thing to be pre-empted by inclusion in pretests.

I dunno. I grok and mostly agree with notions like "it is the responsibility of the competitor to know what they're using, even if choices are implied/constrained, they're still algorithmic decisions and therefore fair game" hence the 'not even mad' sentiment in the individualized scenario above. I guess maybe that masochism breaks down at the point where it's no longer up to the would-be hacker to execute the hack, and/or it only takes one successful hack to get replicated across the entire contest...?

Think I answered my own question. Doooo we like it this way though? I don't know at what point something becomes a bad barrier to entry, feels subjectively like we tend to run towards that point wherever it is though...

  • »
    »
    3 months ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    I have seen very few people hack unorederd_map, including me.

    In the last round, tests against unordered_map were included in the main testset and I belive it is well, because it gives beginners a lesson that they should not blindly rely on the tools of the language, but should be aware of how these tools work. Better that they find out about it in the form of a dropped task than in the form of a DoS attack on the service they will one day create.

    And if the creators of the round did not provide some types of tests, it can always be done by participants in the form of hacks.

»
3 months ago, # |
Rev. 4   Vote: I like it 0 Vote: I do not like it

Well, I would suggest this approach, but it slows down the solution, which is already a huge problem for python.

import random

RANDOM = random.randrange((1 << 31) - 1)

class Dict(dict):
    def __setitem__(self, __k, __v):
        return super().__setitem__(__k ^ RANDOM, __v)

    def __delitem__(self, __v):
        return super().__delitem__(__v ^ RANDOM)

    def __getitem__(self, __k):
        return super().__getitem__(__k ^ RANDOM)

    def __contains__(self, __o: object) -> bool:
        return super().__contains__(__o ^ RANDOM)
  • »
    »
    3 months ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    I like the idea of offloading mitigations onto the collection/class instead of hoping I'd apply salt(key) in all the right places... could also include an override for d.keys() to return unsalted results (e.g. for sorting), maybe make defaultdict behavior the default as an extra... the hack makes Counter(input_array) an issue too so might as well customize for that too etc.

»
3 months ago, # |
Rev. 2   Vote: I like it +15 Vote: I do not like it

Nice hack! Actually this hack can be solved by converting the number into string. Here is my new solution, only 20% slower than my original solution. 153489121

I suggest every python user to look at this new post, otherwise your code may be hacked in the future since dictionary is used so frequently in the contest.

»
3 months ago, # |
  Vote: I like it +7 Vote: I do not like it

I opened up an issue about this over at PyPy. link

»
3 months ago, # |
  Vote: I like it 0 Vote: I do not like it

This means, Every solution that I constructed using either defaultdict or set can be hacked, with a simple enough test?
Imma consider leaving python!!

»
2 months ago, # |
Rev. 2   Vote: I like it -16 Vote: I do not like it

Welp, looks like someone finally pulled the trigger, on a div4. Bravo, hope we're proud of ourselves :P

If anyone can find an actual CVE for this (plenty for the strings case before it was addressed in 2012), I'd be less quick to dismiss this as cp leaning into its lack of real-world relevance... but for now, I'm at "c'mon, duuude... do we really need yet more reasons to be salty?" I'm only assuming this was done by a self-loathing python user, because who else would bother?

I guess my only other thought is that for cses at least, hacks do get applied individually but it's not automatic/guaranteed that they'll get added to system tests. I understand that there are scaling issues in filtering like that, especially with a small admin team and pool of setters/coordinators, but perhaps something can still be said on a competitive level towards making the hacker actually target/execute the hack... but then again, this happened for free with likely no rating reward, so the only net outcome is a downward clubbing of some div4 baby seals... so maybe there isn't a 'game mechanic' way to make this less stupid, I dunno.

Future readers: mitigations are enumerated above, str transform, randomized salt, etc. just beware if you also need sorted results (ie key the sort on original keys). You can also rearrange the input elements if their order doesn't matter to the problem (although short of a full/slow random shuffle it's probably still hackable on an individual level).

fwiw 1676F - Longest Strike is the perfect test of this (even if I wonder if the author meant 'streak' instead of 'strike').

  • »
    »
    2 months ago, # ^ |
    Rev. 2   Vote: I like it 0 Vote: I do not like it

    some pypy64 results on that specific problem:

    c = Counter(sorted(alist)) worked, hacked original 156687344 and sorted version 156840227 both 202ms (looks artificially good because original solve already sorts downstream of that)

    directly-applied xor salt on both in and out comprehensions 156840897 265ms

    str in and int out 156841036 280ms

    random.shuffle on input list before Counter slowest 156840377 343ms

    not tried (yet): various subclassing backflips

»
3 weeks ago, # |
  Vote: I like it 0 Vote: I do not like it

It happened again in a div 4 problem H (Gambling)

»
3 weeks ago, # |
  Vote: I like it 0 Vote: I do not like it

can we just change int to string while storing it as key in dictionary... ? this can fix this error..

»
3 weeks ago, # |
  Vote: I like it 0 Vote: I do not like it

How does the "solution" help, since it keeps the collisions?

hash(a) = hash(b) is the same as hash(a)^RANDOM=hash(b)^RANDOM..

  • »
    »
    3 weeks ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    This solution is not prevent collisions itself, it prevent collision-chains like hash(x) = hash(a_1), f(hash(x)) = hash(a_2), f(f(hash(x))) = hash(a_3), .... Xor with RANDOM changes every element and broke the chain.