### kinoud's blog

By kinoud, history, 2 weeks ago,

I was trying to find some formal analysis on the error rate of Zobrist Hashing (used in 1977D - XORificator, and there's a tutorial in XOR Hashing [TUTORIAL]).

But I couldn't find any. (If you find any, feel free to post it here.)

Luckily, I managed to derive some upper bound on the error rate. I write it here and you can tell me if there's any problem in my derivation.

The problem statement:

You are given a set of $m$-bit integers $A=[a_1,a_2,...,a_N]$, and $n$ sets $[S_1,S_2,...,S_n]$ , each of which is a subset of $A$, and for every $1\le i < j \le n$, $S_i\ne S_j$. Denote $f(S_i)=\bigoplus\limits_{x\in S_i}x$, which is the xor sum of elements in $S_i$. Now, if I randomly assign each $a_1,a_2,...,a_N$ in $A$, what is the (estimated upper bound) probability that there exists some $1\le i < j \le n$ such that $f(S_i)=f(S_j)$?

The analysis:

One observation is that $f(S_i)=f(S_j)$ is true if and only if $f(S_i\cup S_j-S_i\cap S_j) = 0$. So let's denote $T_{ij}=S_i\cup S_j-S_i\cap S_j$. Now we want to estimate the probability that there's some $T_{ij}$, satisfying $f(T_{ij})=0$. For some specific $i$ and $j$, it is obvious that $P(f(T_{ij})=0)=\frac{1}{2^m}$ because elements in $A$ are randomly generated. However, there are $\frac{n(n-1)}{2}$ pairs of $(i,j)$, and all these $T_{ij}$ are actually not independent. For example, if $m=1$ (all integers in $A$ are 1-bit), and $A=[a_1,a_2],S_1=[a_1],S_2=[a_2],S_3=[a_1,a_2]$. Then no matter how you assign $a_1$ and $a_2$, for $f(T_{1,2})=a_1\oplus a_2,f(T_{2,3})=a_1,f(T_{1,3})=a_2$, there must be at least one $0$. So you can not treat the $f(T_{ij})$ as independent random variables for different $ij$.

The linearity property of expectation is used in my derivation. Let's first derive the expectation of how many pairs of $1\le i<j\le n$ satisfying $f(T_{ij})=0$ as

$\mathbf{E}\sum\limits_{i,j}[f(T_{ij})=0] = \sum\limits_{i,j}\mathbf{E}[f(T_{ij})=0] = \sum\limits_{i,j}P(f(T_{ij})=0) = \sum\limits_{i,j}\frac{1}{2^m} = \frac{n(n-1)}{2^{m+1}}$

The expectation of how many pairs of $1\le i<j\le n$ satisfying $f(T_{ij})=0$ can also be written as $P_1 + 2P_2 + ... + kP_k + ... + \frac{n(n-1)}{2}P_{n(n-1)/2}$ , where $P_k$ is the probability that there're exactly $k$ pairs of $1\le i<j\le n$ satisfying $f(T_{ij})=0$.

So, $P_1 + 2P_2 + ... + kP_k + ... + \frac{n(n-1)}{2}P_{\frac{n(n-1)}{2}}=\frac{n(n-1)}{2^{m+1}}$.

So, the probability that there exists some $1\le i < j \le n$ such that $f(S_i)=f(S_j)$ is

$P_1 + P_2 + ... + P_k + ... + P_{n(n-1)/2} < P_1 + 2P_2 + ... + kP_k + ... + \frac{n(n-1)}{2}P_{n(n-1)/2} = \frac{n(n-1)}{2^{m+1}}$.

If $n=3\times 10^5$ and $m=64$, the above upper bound is $2.4\times 10^{-9}$.

I'm glad to know if you find something wrong. :)

• +57

 » 2 weeks ago, # |   0 Thanks ,good job how much time write the blog its tooo tall but thanks again
 » 2 weeks ago, # |   0 Awesome! I was wondering about this because I had to use 128 bits to make it pass during the contest (got WA with 64 bits).Given this bound, I guess it was due to using a poor PRNG not close enough to true random (I get 64 bits from two calls to rand(), which is a linear congruential generator AFAIK)In my 128 bit submission I used mt19337.
•  » » 2 weeks ago, # ^ |   +4 Doesn't rand() return a 16-bit number, maximally 32k ? 64 bits should have been enough if you used 4 calls to rand().
•  » » » 2 weeks ago, # ^ |   +2 It does that on Codeforces? I didn't know. It returns a 32-bit number on my computer :(.
•  » » » » 2 weeks ago, # ^ |   +11 Blog post by neal: link
•  » » » » » 2 weeks ago, # ^ |   0 Thanks!
 » 2 weeks ago, # | ← Rev. 3 →   +18 You can actually simplify the analysis quite a bit with the union bound "trick".For any collection of events $A_1, A_2, A_3, \ldots, A_n$, we have the extremely simple upper bound on their union: $\mathbb{P}\left(\bigcup_{i=1}^n A_i\right) \leq \sum_{i=1}^n \mathbb{P}(A_i)$(Can be proven using induction and the two set version of PIE, but it is intuitively obvious)Applying the union bound to the $T_{i,j}$'s gives you the result instantly. Also in order to see that $\mathbb{P}(f(T_{i,j})=0) = \tfrac{1}{2^m}$, assume that $T_{i,j}$ is non-empty, let $a_k\in T_{i,j}$. Then, for any value that the xor of the other elements takes, there is exactly one value that $a_k$ can take which makes the whole hash 0. This argument can be formalized by using conditional probabilities.