String hashing collision, multiple bases don't work, weird behavior with 1944D

#	User	Rating
1	ecnerwala	3649
2	Benq	3581
3	orzdevinwang	3570
4	Geothermal	3569
4	cnnfls_csy	3569
6	tourist	3565
7	maroonrk	3531
8	Radewoosh	3521
9	Um_nik	3482
10	jiangly	3468

#	User	Contrib.
1	maomao90	174
2	awoo	164
3	adamant	161
4	TheScrasse	159
5	nor	158
6	maroonrk	156
7	-is-this-fft-	152
8	SecondThread	147
9	orz	146
10	pajenegod	145

During Codeforces Round 934 (Div. 2) I submitted problem D hoping to get the green AC but unfortunately got the red WA (submission 251770538). I spent the rest of the time in the contest trying to figure out the problem in vain. After the contest, I checked the accepted solutions and found that they were exactly like mine which made me more confused.

I stress-tested my solution using a brute-force approach during the contest and using AC solutions after the contest time but couldn't find a single test case. Even after a million random test cases, my solution is still steadfast. I know that a few hundred test cases are enough.

My implementation of string hashing is inspired by Usaco.guide, which uses random bases and a fixed modulus (1LL << 61) -1). I used multiple bases to make sure no collisions happen, but this also didn't help.

Usaco.guide string hashing implementation

#include <bits/stdc++.h>
typedef long long ll;
using namespace std;

// BeginCodeSnip{HashedString}
class HashedString {
  public:
	// change M and B if you want
	static const ll M = (1LL << 61) - 1;
	static const ll B;
	static __int128 mul(ll a, ll b) { return (__int128)a * b; }
	static ll mod_mul(ll a, ll b) { return mul(a, b) % M; }

  private:
	// pow[i] contains P^i % M
	static vector<ll> pow;

	// p_hash[i] is the hash of the first i characters of the given string
	vector<ll> p_hash;

  public:
	HashedString(const string &s) : p_hash(s.size() + 1) {
		while (pow.size() < s.size()) { pow.push_back(mod_mul(pow.back(), B)); }
		p_hash[0] = 0;
		for (int i = 0; i < s.size(); i++) {
			p_hash[i + 1] = (mul(p_hash[i], B) + s[i]) % M;
		}
	}

	ll get_hash(int start, int end) {
		ll raw_val =
		    p_hash[end + 1] - mod_mul(p_hash[start], pow[end - start + 1]);
		return (raw_val + M) % M;
	}
};
mt19937 rng((uint32_t)chrono::steady_clock::now().time_since_epoch().count());
vector<ll> HashedString::pow = {1};
const ll HashedString::B = uniform_int_distribution<ll>(0, M - 1)(rng);
// EndCodeSnip

const auto M = HashedString::M;
const auto B = HashedString::B;
const auto mul = HashedString::mul;
const auto mod_mul = HashedString::mod_mul;
ll inv(ll base, ll MOD) {
	ll ans = 1, expo = MOD - 2;
	while (expo) {
		if (expo & 1) { ans = mod_mul(ans, base); }
		expo >>= 1;
		base = mod_mul(base, base);
	}
	return ans;
}

string n, h;

int main() {
	cin.tie(0)->sync_with_stdio(0);
	cin >> n >> h;
	if (n.size() > h.size()) return cout << 0, 0;
	HashedString hs(h);

	set<ll> good;
	ll h_hsh = 1, n_hsh = 1;
	for (int i = 0; i < n.size(); i++) {
		// Compute product hashes
		h_hsh = mod_mul(h_hsh, B + h[i] - 'a');
		n_hsh = mod_mul(n_hsh, B + n[i] - 'a');
	}

	for (int i = n.size() - 1; i < h.size(); i++) {
		if (i >= n.size()) {
			// Update product hashes using modular inverse
			h_hsh = mod_mul(h_hsh, inv(B + h[i - n.size()] - 'a', M));
			h_hsh = mod_mul(h_hsh, B + h[i] - 'a');
		}
		if (n_hsh == h_hsh) { good.insert(hs.get_hash(i + 1 - n.size(), i)); }
	}

	cout << good.size() << '\n';
}

Unfortunately, I couldn't get AC during the contest but now I'm curious about this weird behavior of string hashing. Why using many random bases doesn't help and the solution is to use a prime modulus (i.e. $$$10^9 + 7$$$)?

Now the situation is more complex, I tried to submit a solution (submission 252010026) with only one base and $$$10^9 + 7$$$ as a modulus and got AC. Isn't the collision probability high for such a case?

Another thing to note is the inability to use GNU C++20 compiler caused all of this chaos because I couldn't use __int128 and (1LL << 61) - 1 as a modulus which I think will get AC.

Comments (26)

Write comment?

donot

6 weeks ago, # |

+107

You use modulo 2^30 — 1 which is not prime. It is the product of 3,3,7,11,31,151,331. This is bad. I will explain further, but you need a level of number theory understanding.

Here is a good blog: https://codeforces.com/blog/entry/100027

When can the string "aaaa...aaab" of length l and its reverse collide with itself, if the hash is taken modulo p?-

Approximately when b^l — 1 = 0 or b^l = 1 (mod p). Maybe off by 1 but doesn't really matter.

So when l is a multiple of the order of b modulo p. Taking l=p-1 will always work, and unless you get unlucky, the order will usually be this high.

So you can force a collsion by making l a multiple of p-1 for all those primes.

Unfortunately LCM of p-1 for the primes above is 1650. That is really small and therefore you go wrong quickly for a test of this type.

→ Reply

6 weeks ago, # ^ |

+79

By the way its a standard result that a^k-1 cannot be prime if k is composite. Since if d divides k then a^d-1 divides a^k-1.

So to sum up, you used a non prime modulus. In general, polynomial hashing can go wrong when the base has small order modulo whatever modulus you choose.

This will typically not happen when you choose a prime, since the order will be big. But here it was quite small.

Heap_OverFlow

← Rev. 2 →

-36

[deleted]

Dominater069

Collosiom probability for 1e9 + 7 and one base is : 10^-9

Over 2e5 queries, chance of being correct is (1 — 10^-9)^200000 and then another power of ~40 for number of test cases

It comes to around 99%, which is safe enough to get ac (however you should use 2 bases instead of 1 to not fall in that 1%)

Also, use random bases so that people cant hack your specific base.

VLamarca

So the main question here is why use modulo prime 10^9+7 instead of 2^64 for example

Even if I map letters a-z to 1-26 (not using 0) in a random order. Use multiple bases like 27, 29 and 31. I still get WA. And the solutions seems to be changing the modulo from 2^64 to 10^9+7

https://codeforces.com/blog/entry/4898

I always thought to myself “these bunch of nerds have an obsession with prime numbers. If I use a a prime base (29) and take it modulo 2^64 (simply use unsigned long long to avoid typing the % operations) it should be fine”

Guess I learned to respect the nerds

← Rev. 3 →

I think from a mathematical point of view these two cases are slightly different. Maybe someone has a better way of looking at it.

At a high level there is an issue with hashing when p^k = 1 (mod M) for small k: it becomes easy to create collisions by swapping characters 0 and k As I explained above this is the case for M=2^30-1.

With M=2^64 this condition is not violated, since the order of 3 mod M is 2^62. (A slightly less classic number theory exercise.)

The issue there is that it is significantly easier to create a string that hashes to 0 modulo a very composite number than a prime number (not too surprising).

Maybe in general the moral is that polynomial hashing is simple to understand, but can be weak in mathematically sophisticated ways.

-17

Well, to speak from another perspective, none of this math matters much. I would guess that there is a sufficiently small pair of strings in which there would be collision (for example even using two modulos like 1e9+7 and 1e9+9).

In practice what happens is you need to think about the meta on how easy is it to break these hashes and how hard working is the problem setter. Im pretty sure 1e9+7 can be broken and many people would be disappointed.

Since the input is not uniformly generated, any hash can be broken. (maybe not, but sort of). Well, should it? Not sure, but it is an interesting discussion.

dorijanlendvaj

+15

"Since the input is not uniformly generated, any hash can be broken."

Any fixed hash can be, but if you randomly generate a base(which you should be doing if hacking exists) it's probably impossible to hack such a solution with reasonable probability.

-10

By base, you mean the variable b here right:

"abc" = 1 + 2*b + 3*b^2

Well, how big can b be? usually it would be 27 (for a-b mapping to 1-26), or 29 if you want a prime number. Can it be around 10^5? At this point, I would think this would actually increase collisions.

Assuming 100 test cases per problem and 10^5 queries per problem, each one breaking a different base paired with the prime 10^9+7 and/or 10^9+9. Maybe randomly generating the base is not enough.

So maybe, you need to randomly generate the modulo too?

Next time I will just stick to the deterministic string algorithms :)

nifeshe

It absolutely does not matter how high you choose $$$b$$$ to be (unless something like $$$mod - 1$$$). People just tend to like smaller bases.

-15

Hmm maybe you are right. But even if it does not matter, with 10^7 cases and using 10^9+7 as modulo, that is 1% probability of failure I guess. (Assuming the problemsetter is an antihash maniac)

Based on the birthday paradox, if you have probability 1/M of collision between two hashes, you only need to hash sqrt(M) strings for the collision probability to be sizable.

The issue is you can use multiple bases or moduli together and the collision probability goes down exponentially.

If you use 4 moduli, then the collision probability becomes 1/M^4, which can be made around 10^-36 feasibly.

Then you don't fail (with high probability) on any reasonable input size. Such a tester could need to spend days/months/years to break each solution.

By the way, as I gain more experience, I've also started to prefer the deterministic algorithms, but hashing is really powerful. For example in this problem it could support character updates.

You can divide it by $$$10^9+7$$$ by using 2 bases.

In practice, yeah, maybe you right, this will work. But think about that: there are 26^(10^5) strings of size 10^5. There is probably a few strings that break all the combinations of pairs of bases even with pairs of modulos (10^9+7,10^9+9).

jeroenodb

+13

You can actually prove the following: Given that you choose a random base, and use a fixed prime number $$$P$$$, the probability of collision is $$$\leq \frac{n}{P}$$$, you can prove it with the fact that a polynomial can only have $$$n$$$ zeroes. This means, for any pair of strings with both having at most $$$10^5$$$ characters, when you choose the base randomly, there's only a collision probability $$$\frac{n}{P}$$$, with $$$n=10^5$$$. Notice that this probability is an upperbound, and less than in practice where you assume it is about $$$\frac{1}{P}$$$. Still, this shows that it is impossible for such a string to exist as you are saying.

I doubt that; just looking at one modulo, you'd need the sum of $$$q$$$ highest amounts of roots of polynomials made by the $$$n^4$$$ pairs of strings to be close to $$$10^9$$$ to hack every base, For $$$n=q=10^6$$$, I think it's reasonable to assume that the probability of that is lower than the probability of $$$q$$$ of those polynomials having at least $$$200$$$ roots. For each of those polynomials, the chance of that is about $$$10^{-375}$$$. Assuming all of the chances are independent(not necessarily true but I think it's pretty close to reality), there is about a $$$10^{-3.6*10^8}$$$ chance for a string to have that property, much lower than $$$\frac{1}{26^{10^5}}$$$.

That makes sense, thank you!

I would just reword the use of the word probability you used. Maybe saying that there are at most n different bases for which the hash can be broken for a given pair of strings of size n (random base and fixed modulo P).

I like to think that it is not a matter of probability, but a matter of counting how many bases can be broken at most, since the input is not randomly generated.

Oh wait, actually they do not have at most n roots. I forgot we are working under modulo P. Polynomials can have more than that.

I still think Im right, for a fixed prime P, assuming problemsetter guesses it. Even choosing 2 bases randomly might not be safe. (In practice is safe, unless problemsetter has imense computational power to find such string).

Actually, the number of roots is bounded by the degree of a polynomial in ANY field: Proof

Wow, didnt know that. Maybe I shouldnt be having math discussions if I dont know math. Thanks!

Rabeeeee

if we do doublehashing and have two fixed different bases with two different modulus, then is it still possible to hack it by taking the multiplication of those two fixed bases? if yes, then will a high value of base solve it? or we have to set random bases even in case of double hashing

With some advanced methods, whenever the number of bases and mods is small and all are constant, and there are a few of them (let's say less than $$$5$$$ or something, from what I've tried), hash collisions can be made really efficiently. For larger amount of bases and mods, it is still possible, but it gets more effort to make more sophisticated methods and more running time also.

+25

Sure, I agree it's enough to know "use prime modulo, random bases". Maybe you also want to know "use base with high order".

Not sure about your other points--I don't think there exists a way to break hash with 1e9+7 and random base. And I think it's fair to try to break such hashes if it were possible.

Nobody gives credit for a tree problem where the solution relies on the tree being O(log n) in height--you have to be correct in the worst case. So when you use a probabilistic technique, you want to be correct with high probability in the worst case. (Here, that means you need to use a random base.)

Igor_Parfenov

5 weeks ago, # |

251760926

I have interesting situation. I use two "random" but constant hashes, which passed. But it was hacked after the contest. How to create such hacks?

MuhammadSawalhy's blog