C++ has always had the convenient data structures `std::set`

and `std::map`

, which are tree data structures whose operations take time. With C++11, we finally received a hash set and hash map in `std::unordered_set`

and `std::unordered_map`

. Unfortunately, I've seen a lot of people on Codeforces get hacked or fail system tests when using these. In this post I'll explain how it's possible to break these data structures and what you can do in order to continue using your favorite hash maps without worrying about being hacked .

So how are they hackable? We always assume hash maps are *O*(1) per operation (insert, erase, access, etc.). But this depends on a key assumption, which is that each item *only runs into O(1) collisions on average*. If our input data is completely random, this is a reasonable assumption. But this is no longer a safe bet when the input isn't random, especially so if someone is adversarially designing inputs to our code (a.k.a. hacking phase). In particular, if they know our hash function, they can easily generate a large number of different inputs that all collide, thus causing an

*O*(

*n*

^{2}) blow-up.

We'll prove that now by blowing up `unordered_map`

. In order to do that, we first have to determine exactly how it's implemented. For this we can dig into gcc's implementation on GitHub: https://github.com/gcc-mirror/gcc.

After some searching around we run into unordered_map.h. Inside the file we can quickly see that `unordered_map`

makes use of `__detail::_Mod_range_hashing`

and `__detail::_Prime_rehash_policy`

. From this we can guess that the map first hashes the input value and then mods by a prime number, and the result is used as the appropriate position in the hash table.

Some further searching for `_Prime_rehash_policy`

leads us to hashtable_c++0x.cc. Here we can see that there is an array called `__prime_list`

, and the hash table has a policy to resize itself when it gets too large. So we just need to find this list of primes.

The one include on this file leads us to hashtable-aux.cc. Aha, here is the list we're looking for.

One more thing: we need to know the hash function `unordered_map`

uses before modding by these primes. It turns out to be quite simple: the map uses `std::hash`

, which for integers is simply the identity function. Armed with this knowledge, we can insert lots of multiples of one of these primes to the map in order to get *n* ^{2} blow-up. Not all of the primes work though, due to the resizing policy of the map; in order for a prime to work, we need the map to actually resize to this prime at some point in its set of operations. It turns out the right prime depends on the compiler version: for gcc 6 or earlier, 126271 does the job, and for gcc 7 or later, 107897 will work. Run the code below in Custom Invocation and see what output you get.

```
#include <ctime>
#include <iostream>
#include <unordered_map>
using namespace std;
const int N = 2e5;
void insert_numbers(long long x) {
clock_t begin = clock();
unordered_map<long long, int> numbers;
for (int i = 1; i <= N; i++)
numbers[i * x] = i;
long long sum = 0;
for (auto &entry : numbers)
sum += (entry.first / x) * entry.second;
printf("x = %lld: %.3lf seconds, sum = %lld\n", x, (double) (clock() - begin) / CLOCKS_PER_SEC, sum);
}
int main() {
insert_numbers(107897);
insert_numbers(126271);
}
```

Depending on which compiler version you are using, one of these two numbers will take much longer than the other. There are several other primes that also work; try some more for yourself!

Note that for other hash tables like `cc_hash_table`

or `gp_hash_table`

(see Chilli's helpful post), it's even easier to hack them. These hash tables use a modulo power of two policy, so in order to make a lot of collisions occur we can simply insert a lot of numbers that are equivalent, say, modulo 2^{16}.

### Don't want to be hacked?

Let's look at how to safeguard these hash maps from collision attacks. To do this we can write our own custom hash function which we give to the `unordered_map`

(or `gp_hash_table`

, etc.). The standard hash function looks something like this:

```
struct custom_hash {
size_t operator()(uint64_t x) const {
return x;
}
};
```

However as we mentioned, any predictable / deterministic hash function can be reverse-engineered to produce a large number of collisions, so the first thing we should do is add some non-determinism (via high-precision clock) to make it more difficult to hack:

```
struct custom_hash {
size_t operator()(uint64_t x) const {
static const uint64_t FIXED_RANDOM = chrono::steady_clock::now().time_since_epoch().count();
return x + FIXED_RANDOM;
}
};
```

See my post on making randomized solutions unhackable for more details. Awesome, so our hash is perfectly safe now, right? Not so fast. All we've done is add the same fixed number to every input to the function. But if two numbers a and b satisfy a = b (mod m), then a + x = b + x (mod m) for every x as well. Similar problems occur for other very simple hash functions: multiplying by a random large odd number (and overflowing mod 2^{64}) is likely effectively modulo p, but will be problematic for `gp_hash_table`

's power of two policy; the same situation occurs for xor-ing with a random number.

A slightly better hash function like the following may look enticing:

```
struct custom_hash {
size_t operator()(uint64_t x) const {
static const uint64_t FIXED_RANDOM = chrono::steady_clock::now().time_since_epoch().count();
x ^= FIXED_RANDOM;
return x ^ (x >> 16);
}
};
```

However, if you are using a `gp_hash_table`

this actually still leaves you susceptible to hacks from a strong enough adversary. In particular, after inserting the numbers `(1 << 16) + 1`

, `(2 << 16) + 2`

, `(3 << 16) + 3`

, ..., into this hash table, all of the outputs will be equivalent modulo 2^{16}. (Do you see why?)

So we want a better hash function, ideally one where changing any input bit results in a 50-50 chance to change any output bit. Note for example that in the hash function `x + FIXED_RANDOM`

, this property is not satisfied at all; for example, changing a higher bit in `x`

results in a 0% chance of changing a lower bit of the output.

Personally, I like to use splitmix64, which is extremely high-quality and fast; credit goes to Sebastiano Vigna for designing it. Adding all this together, we have our safe custom hash function:

```
struct custom_hash {
static uint64_t splitmix64(uint64_t x) {
// http://xorshift.di.unimi.it/splitmix64.c
x += 0x9e3779b97f4a7c15;
x = (x ^ (x >> 30)) * 0xbf58476d1ce4e5b9;
x = (x ^ (x >> 27)) * 0x94d049bb133111eb;
return x ^ (x >> 31);
}
size_t operator()(uint64_t x) const {
static const uint64_t FIXED_RANDOM = chrono::steady_clock::now().time_since_epoch().count();
return splitmix64(x + FIXED_RANDOM);
}
};
```

Now we can simply define our `unordered_map`

or our `gp_hash_table`

as follows:

```
unordered_map<long long, int, custom_hash> safe_map;
gp_hash_table<long long, int, custom_hash> safe_hash_table;
```

Once we use these in our program above, it runs very quickly:

```
x = 107897: 0.035 seconds, sum = 2666686666700000
x = 126271: 0.031 seconds, sum = 2666686666700000
=====
Used: 109 ms, 9204 KB
```

Added to my snippets ;). Amazing Work.

c++ 17 when set with same key has size larger than 8 it will use RBT to store data.

Are you saying

`unordered_set`

transitions to using red-black tree when it encounters 8 collisions in the same location? This isn't true. In the code snippet I posted above,`insert_numbers(107897)`

in G++17 takes about as long as`insert_numbers(126271)`

in G++14.Oh, i am wrong,it was implemented in jdk1.8.

Do you have a link to where it says this? I'm interested in reading the documentation for it.

http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/tip/src/share/classes/java/util/HashMap.java

@neal Thanks for the blog and can you help me know what is the exact difference between set and a Map in C++.

The only difference which I am aware of is that the set is a map having key === value.

Is there any other difference between set and Map?

If no then why set has been made if we already have a Map with us?

I will really appreciate If you could answer my query.

Yes very useful blog

Really very useful for me

How convenient! Thanks a lot!

very cool. keep it up!

Cool! I'm curious how many people actually do anti-hashing hacks in contest. Could you put the standard unordered_map runtimes on the inputs to use as comparisons to the benchmarks you put at the end? Or does it simply take way too much time to even record?

The thing about this specific hack is that if anyone successfully makes this hack on anyone else in the contest, their test will be added to system tests which will leave you in trouble. A few examples of recent problems where you can fail for using unprotected

`unordered_map`

include 1027F - Session in BSU and 1039C - Network Safety.As far as runtime, it gets a bit slower with the custom hash but not too much. Pure

`unordered_map`

gives anywhere between 0.00s and 0.04s on non-adversarial cases when running with Custom Invocation, vs. 0.03s with custom hash. If you're concerned with speed then`gp_hash_table`

with the custom hash is the way to go, since it uses power of two modding and linear probing rather than prime modding and collision chaining.Oh, I wasn't that concerned about the speed of your custom hash. I was curious about the speed of

`std::unordered_map`

on the adversarial case that you've created. However, reading it more closely, you haveN= 10^{5}, so if it really is causing anO(n^{2}) blowup on`std::unordered_map`

, then it's probably too slow to bother recording the time.Ah. Run the code from the post in Custom Invocation :)

I like

`(uintptr_t)main`

. :) This pointer should be random for every run because of OS security issue.Interesting idea. Unfortunately when I tried it on Codeforces just now, it gave the same result every time. :(

Really!? That's too sad. T_T It worked for most other OJ platforms...

You may try

`new char`

Nicely written, as always. Thanks!

Whenever someone talks about hacking hashmaps, I think of this problem: https://ipsc.ksp.sk/2014/real/problems/h.html

Thanks for this helpful blog. For completeness, it should be noted that the last definition

`gp_hash_table<long long, int, custom_hash> safe_hash_table;`

should be preceded by

The following is a slight update to your test program.

Update

How does it compare with alternating max_load_factor of the hash table? because it is runs slower as compared to this trick (Arpa's Blog):

this gives output of:

This doesn't make it unhackable, it just changes the prime number that breaks it. Try calling

`insert_numbers(1056323);`

instead:I am not sure I understand how it "only" changes the prime number because according to the code, you are inserting numbers with same modulo wrt the prime. Running on equal modulo numbers with:

which gives output of:

Also

`reserve`

must change according to the elements to be inserted (upper bound to be a power of two). Maybe it's because of rehash scheme when`max_load_factor`

is achieved in the bucket under consideration.When you call

`.reserve()`

you are changing the internal capacity of the map, which means you are effectively changing the internal prime number modulo it uses out of this list.So yes if you change the capacity again, it will work well on the previous prime number I gave you, but there will be a new number in the list that is problematic.

What if i need unordered_map<pair<int,int> , int> mp; here first is pair . What if more complex such as use (1,2,3,4) as first , i meant for struct data type first .

Read the comment right below this one. https://codeforces.com/blog/entry/62393?#comment-464775

What if x=a^(b>>1) overflow unsigned long long.

Pair is for two numbers a,b. If i want (a,b,c,d) as index. Then what will be the value of x?

like

`splitmix64`

is there a good hash function for pairs too?If you have a pair of integers you'd like to hash, you can use the custom hash function above on each of them to get two values

`a`

and`b`

. Then combine them in any way you like, e.g.,`a + b`

.The one issue with

`a + b`

is that swapping the two elements of the pair will lead to the same hash value. This isn't a problem from a theory point of view since "O(1) collisions on average" is still valid, but to avoid this situation you can switch to a non-symmetric function such as`3 * a + b`

or`a ^ (b >> 1)`

.And how would you go about using unordered_set with strings as keys?

Here is an idea to use a random seed in the MurmurHashUnaligned2 which is the hash function that C++ uses by default for hashing strings: https://stackoverflow.com/a/34976823/10017885 although here it is written that even with using a randomized seed MurmurHash can be hacked: https://en.wikipedia.org/wiki/MurmurHash#Vulnerabilities

sha256(constant random string + desired string) --> never hacked again

But I doubt anyone would care enough to hack your murmurhash solution, if you ever used it.

And what fuction would you recommend for hashing ints?

Read the post.

In your post you provide a function for hashing long longs and I am interested in a good function for hashing ints.

Well, I suppose the same function would also work but maybe for ints we could have a function that is faster and also works.

Using the idea in neal's comment

Man, I envy that you can type that. Much enjoyment.

Thanks for this blog, neal. Can you recommend a fast hash function that is not difficult to remember (for

`gp_hash_table`

)?Can

`xorshift64`

be such function?UPD. Okay, this is bad hash function.Next code:

returns

`cnt = 4095`

.UPD2.I got idea about calculation polinomial hash from s, where`x = s[0]+(s[1]<<16)+(s[2]<<32)+(s[3]<<48)`

. Tested it and it is fast.very helpful !!

you write very good and you need just another blog like this one to be in "Top contributors List".

neal Why use size_t as the return value of operator(), why not int64_t, does it affect the performance of functions

Why does this code take more than 2 seconds in custom invocation with C++17, while the same code with the 1e6 replaced by 1e9 takes less than 100 ms?(also, replacing 1e6 by 1e5 makes the running time over 10 seconds)

Good question. This is actually quite tricky. It's because the default hash function returns a

`size_t`

, and on Codeforces`size_t`

is a 32-bit integer. This means that multiplying by an integer up to 1e9 actually overflows 32 bits when hashed and ends up with a number that is no longer a multiple of our prime.I didn't get it. So if I have an array like [1,1,1,1......,1], your hash function is not deterministic because hash(1) != hash(1) because it uses some FIXED_RANDOM. If the FIXED_RANDOM would be the same for all numbers, then I think we are the begining. Correct me if I am wrong. Thanks.

FIXED_RANDOM is fixed because is constant and static, only the first time get a new value. Then hash(1)==hash(1) but its value might be different for each program execution

Oh, If I see this entry before, then I never got

TLEfor 1234B2 - Social Network (hard version). Here is my submission which gotTLE^{2000 ms+}After changing the hash function as you provided, then itAccepted^{202 ms}SummaryProblem : 1234B2 - Social Network (hard version)

unordered_map :

TLE^{2000 ms+}unordered_map (_custom_hash_) :

AC^{202 ms}map :

AC^{342 ms}Just wanted to ask this, that for largest value possible in long long int x, this x += 0x9e3779b97f4a7c15 expression will overflow bounds of uint64.

Also the argument for hash requires unsigned int64 value, but if we have negative numbers to hash too, then what happens.

Thanks.

Thanks a lot for this post! It was in detail and very useful..

Absolutely beautiful!

What a gem! Thanks, Neal!

Thank you

I have wondered about this and it is nice to see it proved.

Credit to the C++ language designers who made it super easy to specify a custom hash function.

Nice blog !!

Hey Nice Blog, But I have a doubt.... I have submitted same code(both have your custom_hash).

Problem : Social Network

My Solutions : unordered_map , unordered_set

I have a doubt that, i am getting TLE while using custom_hash with unordered set, but got ac while using same custom hash in unordered map. is there any reason for this? does your custom hash works faster on map than set or anything else?

It's not the custom hash. Your

`exist`

function passes the entire set by value instead of by reference.Got it !! And sorry for asking wrong question.

btw, thanks got ac by making it refernce.

Can you please send the improved code?

means??

Can we use this custom hash in unordered set as well??

yes, u can see my code.

Can u please give link to your code

Thank a lot for lending your time.

Using map<vector,int> mp; gives me TLE.

I want to use Unordered_map to avoid TLE.

How can i make it? Any help is appreciated. Thanks.

Question : 1225D - Power Products

My Submission : 77023902

neal Please help

The complexity of your program with map<vector,int> is $$$O(n^2)$$$, assuming that $$$a_i \leq n$$$. Using an unordered_map will just remove a log factor, try improving your complexity by more than that.

testuser@Mohammads-MacBook-Air helloworld % ./helloworld

x = 107897: 0.111 seconds, sum = 2666686666700000

x = 126271: 0.099 seconds, sum = 2666686666700000

Using Clang++17 on VS Code Macbook Air. I didn't find an issue, is there an equivalent here?

@neal Thanks for the blog and can you help me know what is the exact difference between set and a Map in C++.

The only difference which I am aware of is that the set is a map having key === value.

Is there any other difference between set and Map?

If no then why set has been made if we already have a Map with us?

I will really appreciate If you could answer my query.

size_t is 32 bit in 32 bit compilers. Is using 64 bit hash function splitmix64 good then? Or do you know any better hash function for 32 bit?

I think this comment on stack overflow is quite great.

https://stackoverflow.com/a/12996028/4275047