Zlobober's blog

By Zlobober2 years ago, translation, In English,

Hi everyone! I wan't to share some interesting fact about how we're going to challenge almost every solution, that uses polynomial hashes modulo 2^64. We'll hack any solution, regardless of it's base (!), the only thing we need, that it's using int64 with overflows — like many coders write hashing.

Keywords: Only today, only for you, ladies and gentlemen: we're gonna challenge Petr's solution in problem 7D - Palindrome Degree from Codeforces Beta Round #7!

Is it interesting? Welcome reading after the cut. Firstly, for the most impatient of you. Here's the source of the generator:

const int Q = 11;
const int N = 1 << Q;

char S[N];

for (int i = 0; i < N; i++)
    S[i] = 'A' + __builtin_popcount(i) % 2; 
    // this function in g++ returns
    // number of ones in binary representation of number i

Let's try solutions of two CFBR #7 winners on this test: Petr Mitrichev's and Vlad Epifanov's. vepifanov solution doesn't contain hashing and so, it works correct: the answer is 6. But Petr's solution returns 8. After a little bit thinking it becomes clear, that the answer on such test for this problem is always (N + 1) / 2 — Vlad is 100% correct.

Moreover, if we take Q = 20, then Vlad's solution returns correct answer 11, but Petr's one returns 2055, what is obliviously wrong :-)

I'll prove, that starting from Q = 11, there are lots of collisions in this string.

Let's look on that string. How was it formed? It's beggining like that:

ABBABAABBAABABBABAABABBAABBABAABBAABABBAABBABAABABBABAABBAABABBA...

It's true, that it can be formed by recurrent rule S -> S + (not S), starting from с S = "A", where (not S) means string after changing A на B and vice versa. Let's denote S for fixed Q like SQ.

Let's remember, what polynomial has is. It's function of string S of length l, equal to , where P is some odd number.

I claim that hash(S[0... (2k - 1)]) for some sufficiently small k k will be equal to hash(S[(2k)... (2k + 1 - 1)]).

That means, that for Q = 10, hash(SQ) = hash(not SQ). That's very cool, because it SQ and not SQ will appear in bigger order strings many-many times because of reccurent condition.

Let's see what does condition hash(SQ) = hash(not SQ) mean.First, we can take zeros and ones in coefficients instead of ord('A') and ord('B') — we can just take off the both parts of the equation.

What is hash(not SQ) — hash(SQ)? It's simple to understand, that it's

T = P0 — P1 — P2 + P3 — P4 + P5 + P6 — P7... ± P2Q — 1

--- it's just sign-alternating sum of P powers, where signs change by similar ABBABAAB...-rule.

Let's consequentially take factors in this expression off the brackets:

T = (P1 - 1)( - P0 + P2 + P4 - P6 + P8 - P10 - P12 + P14...) = 
 = (P1 - 1)(P2 - 1)( - P4 + P8 + P12 - P16...) = ... = (P1 - 1)(P2 - 1)(P4 - 1)... (P2Q - 1 - 1).

(maybe, it's multiplied on (-1), but there is no matter for it.)

Now the main thing — this value modulo 2^64 will become zero very-very fast!

Let's understand, what's the maximal power of 2 this value is divisable. Let's look on each of Q - 1 factors. (i + 1)-st factor P2i + 1 - 1 = (P2i - 1)(P2i + 1) is divisible by i-th and by some even number P2i + 1. That means that if i-th bracket is divisible by 2r, then (i + 1)-st is divisible at least on 2r + 1.

So that means that (P1 - 1)(P2 - 1)(P4 - 1)...(P2Q - 1 - 1) is divisible at least by 2·22·23·... = 2Q(Q — 1) / 2. That means, that it's enough to take Q >= 12. Congratulations, this is anti-hash test!

So because of that we have such small test length in comparsion with the modulo size 2^64. So antitest size is something of order , if we use x-bit data type.

Main idea: don't use overflowing when counting hashes unless you are confident that there is no test, consisting of this ABBABAABBAABABBA... string.

How did I get that? First using of that test was in 2003 on St. Petersburg school programming contest in problem cubes (russian statements here). This problem was used in SIS) problemset. Many generations of students got WA27 on that problem, submitting hash solution. One of them was I — nobody could explain me, why is there WA for any hash base. burunduk1 looked a bit on that test, but couldn't explain me either. Since that moment I remembered about that problem.

And now I've decided to think a bit and understand, what's happening in that test. burunduk1 offered to post it on CF, so here it is :-)

I tried to found any information about anti-hash in web, googled a lot, but couldn't find anything. Does anybody know anything else, maybe, any papers? Maybe I'm not so good in googling?

 
 
 
 
  • Vote: I like it  
  • +272
  • Vote: I do not like it  

 
»
2 years ago, # |
  Vote: I like it +32 Vote: I do not like it

In POI19 http://main.edu.pl/en/archive/oi/19/pre ... It also contain such a case ...

  •  
    »
    »
    21 month(s) ago, # ^ |
      Vote: I like it +1 Vote: I do not like it

    We found that case by accident during that POI :) We actually submitted a short paper about this to the next IOI journal.

 
»
10 months ago, # |
  Vote: I like it +12 Vote: I do not like it

Sorry to revive this, but I use a lot of hashing technique for online contests and have some questions :).

In short, the solution to not be targeted by this anti-hash test is to use a different mod (not power of 2)? For instance using mod 10^9+7. But that mod is kind of small, and with around 50k random strings it is really easy to get a collision. Using a larger mod will require to implement a function to multiply longs, and that function adds an unwanted overhead to the solution that might make it time out easily on some problems (does knows an efficient way of doing it? I can only think of a O(bits) way, where bits is the number of bits of our mod, something like this: http://pastebin.com/r5Af5zfp).

One other possibility is using two mods (that fit in an int type), and creating a pair of hash values. My question is, is that enough? I mean, enough for online contests where there won't be more than 10^8 string hashes. Through some empirical results, I couldn't find a hash collision for that, my bruteforce test code is here: http://pastebin.com/iaVp6ypH (be careful running it, it can consume pretty fast your computer memory, adjust the LIMIT variable for your purposes). Thanks!

  •  
    »
    »
    10 months ago, # ^ |
      Vote: I like it +53 Vote: I do not like it

    You are talking right things. Hash value of order 10^9 + 7 can really cause collisions. But it's useful to understand why exactly collision can happen.

    Imagine the problem, where you are asked to check 105 pairs of substrings of big string to be equal. Suppose that answers for each query are in fact "NO". Let's assume that due to big length hash of each substring is random uniformly distributed value over segment [0..109 + 6] not depending from hashes of other strings. Then the probability that the check suddenly gives us answer "YES" instead of "NO" is almost same with the probability of two uniformly distributed values from that segment to be equal. This probability is 10 - 9. The probability that no one of the 105 checks will fail due to collision is (1 - 10 - 9)105 ≈ 1 - 10 - 4, that is really close to 1. One the testset consisting of 100 tests the probability that you'll recieve "Accepted" is (1 - 10 - 4)100 ≈ 99%. That's pretty big probability to write such solution, so using single hash modulo 109 + 7 is ok for that type of problems.

    Other type of problems involve checks like "check that hashes of 105 substrings are pairwise different". For example if we calculate number of different substrings of length 5·104 in string of length 105 and they all are in fact different. Once again, we can assume that each hash of that substrings is random uniformly distributed value from segment [0..109 + 7], that are independent one for each other. So now we need to calculate another probability: that from 105 random integers from range of length 109 no two will suddenly coincide, giving as less answer then expected an verdict "Wrong Answer".

    But here appears the effect known as Birthday problem: surprisingly, but for different independent randomly distributed values probability for some pair of the to be equal is about . So for 105 probes our probability to fail on one test will be even greater than 50%!

    Here is one note: we take hash over prime modulo because for prime p Zp is a field, and for finite fields it can be proven, that all values above (that are values of random polynomials in fixed point) are very close to be really uniformly distributed over the whole field. If we take composite number, there will appear special effects like zero divisors and other, so there is needed additional analysis of probability for that cases, it can easily happen that all such calculations for them will be wrong. The main idea of my topic is exactly about surprising property of Z264, that one can easily build a polynomial of small degree (~2048), whose value is zero in any point, so it isn't suitable for solving problems with hashes.

    So as you correctly said, we need to increase the hash-value space size. One way is to look on one hash modulo prime of greater size. But there is indeed a problem with their multiplication. You can multiply them with binary multiplication (such as binary exponentiation, but with adding and doubling on each step), or you can use some hacks like this (examples of code in Russian interface).

    Other idea is two use a pair of hashes by two different prime modulo of order 109, let's say, p and q. Then we once again can assume that two components of that hash will be independent of each other random uniformly distributed over the Zp and Zq respectively, so in fact we have a point in 2d-space that is randomly uniformly distributed over the p·q ≈ 1018 possible values. Then in every calculation above we can replace 109 with 1018 and probability for us to fail becomes insignificant. From my experience this way is faster then previous two.

    I hope this will help in understanding, when hashes can cause a collision, and why we can submit correctly proved hash solution and be sure, that it will be Accepted.

    •  
      »
      »
      »
      10 months ago, # ^ |
      Rev. 3   Vote: I like it +4 Vote: I do not like it

      Wow, very complete response :). Thank you for the help, It definitely helped in understanding!! We should learn Russian, we miss so much interesting information only available at CF.ru posts :|

      There is also a very interesting article here: http://www.mii.lt/olympiads_in_informatics/pdf/INFOL119.pdf A friend sent me, not sure if he found here on CF or somewhere else..

      •  
        »
        »
        »
        »
        10 months ago, # ^ |
          Vote: I like it 0 Vote: I do not like it

        Look a few posts above and you will see one of the author of this article saying that they submitted a paper about this :D.

  •  
    »
    »
    10 months ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    But note one other thing. Exactly as Zlobober said there are two types of problems where hashes are useful — comparing many pairs of strings and checking if all hashes are different. Note that when you just check if two hashes are equal or not, you probably won't have to check so many pairs that a probability of a collision will be high, because firstly it will rather exceed time limit than find a collision. In that case exponent in (1-p)^k (where p is a prob. of collision) grows linearly with time of execution. When you have 10^6 hashes generated modulo 10^9 number of colliding hashes will be pretty high but that doesn't change fact that if you won't do more than 10^6 queries of type "==" the probability of FINDING collision is really small (~0.999). This can be stated unless you won't do anything else than using operation "==" or "!=". But when you use other operations such as "<" the exponent and so the probability of collision may grow much faster! Best example is a simple example "are all of those hashes are different". You can sort them and check if every two consecutive values of hashes are different. But in this case you have used a specific values of hashes and that allowed you to obtain O(n^2) informations in O(n log n), so a probability of finding a collision in former example is very large.

    So summing up:

    1. I use only "==" and "!=" -> nothing to worry about

    2. I use "<" (maybe hidden in sort or a set or a map or anything) -> use pair of hashes mod 10^9+7 and 10^9+13 or anything else.