I'd like to share this one unpopular but awesome algorithm that I only heard a mention of once. Kapun's algorithm finds a hash collision for moduli as large as $$$10^{18}$$$. I know, we have lots of algorithms that do much better than that (and I'm writing an article on that at the moment, keep tuned), but Kapun's algorithm is really surprising in its simplicity. That its correctness is so difficult to prove is of more surprise even.
Here we go.
A polynomial hash of a string $$$s$$$ is defined as
Unable to parse markup [type=CF_MATHJAX]
We want to find two strings with the same polynomial hash using only two characters: $$$0$$$ and $$$1$$$.
We can re-formulate this problem in another way. Let
Unable to parse markup [type=CF_MATHJAX]
, then we want to find two distinct subsets of $$$a$$$ with the same sum modulo $$$M$$$. Now forget about the modulus: let's just find two subsets with the same sum.Firstly, when is this possible? There are $$$2^n$$$ possible subsets and
Unable to parse markup [type=CF_MATHJAX]
distinct sums. IfUnable to parse markup [type=CF_MATHJAX]
, that is,Unable to parse markup [type=CF_MATHJAX]
, there are certainly two subsets with the same sum. Generating $$$M$$$ or even just $$$\sqrt{M}$$$ strings (if you use birthday paradox) is impossible for large $$$M$$$, but there's still a way through.It turns out there is a deterministic algorithm that attempts to solve this problem that is very easy to implement!
Algorithm
Suppose we have a list of $$$n$$$ integers in range
Unable to parse markup [type=CF_MATHJAX]
each: $$$[a_1, a_2, \dots, a_n]$$$. Kapun's algorithm attempts to find two subsets of these integers with equal sum.The algorithm is as follows. Instead of finding two subsets, we will find an expression of kind
Unable to parse markup [type=CF_MATHJAX]
for some $$$i$$$ and some signs. It's easy to see that this is equivalent to finding two subsets with the same sum. Initially, we have $$$n$$$ trivial expressions $$$a_1, a_2, \dots, a_n$$$. The algorithm is iterative; at each step we do the following:- If we have an expression with value $$$0$$$ at the moment, return this expression.
- Otherwise, sort the expressions by their values.
- Replace the current expressions
Unable to parse markup [type=CF_MATHJAX]
byUnable to parse markup [type=CF_MATHJAX]
, that is, the expressions are divided into pairs and each pair is replaced by its difference. If $$$k$$$ is odd, $$$b_k$$$ is simply dropped. - Repeat from the beginning.
If we end up with
Unable to parse markup [type=CF_MATHJAX]
and $$$x > 0$$$, the algorithm has failed.Theorem
Kapun's algorithm necessarily succeeds if
Unable to parse markup [type=CF_MATHJAX]
, where $$$\alpha, \beta$$$ are some small constants.Proof. It's obvious that if there exists a counterexample for a particular
Unable to parse markup [type=CF_MATHJAX]
tuple, it also exists forUnable to parse markup [type=CF_MATHJAX]
. Therefore we can define $$$f(n)$$$ as the smallest $$$R$$$ for which a counterexample exists for specific $$$n$$$.Also, let's denote by operator
Unable to parse markup [type=CF_MATHJAX]
the sorted array $$$a$$$, and by $$$\Delta a$$$ the neighbour-difference array ofUnable to parse markup [type=CF_MATHJAX]
. For instance,Unable to parse markup [type=CF_MATHJAX]
, andUnable to parse markup [type=CF_MATHJAX]
.For a particular $$$n$$$, let us consider the smallest countertest, the "smallest" meaning the one with the smallest $$$R$$$. Let's say it's some $$$[a_1, a_2, \dots, a_n]$$$. The algorithm sorts this array, so without loss of generality assume $$$a$$$ is sorted.
As long as
Unable to parse markup [type=CF_MATHJAX]
, the subsequent behavior of the algorithm only depends on $$$\Delta a$$$, right? So we can subtract $$$a_1 - 1$$$ from all elements of $$$a$$$, because this operation does not affect the difference array, $$$a_1$$$ stays positive andUnable to parse markup [type=CF_MATHJAX]
decreases. Therefore, in the optimal countertest $$$a_1 = 1$$$.Notice that an array of kind
Unable to parse markup [type=CF_MATHJAX]
has the same difference array as an array of kindUnable to parse markup [type=CF_MATHJAX]
, so in the optimal arrayUnable to parse markup [type=CF_MATHJAX]
, and so on. Therefore the optimal countertest is of kindUnable to parse markup [type=CF_MATHJAX]
.Let us denote
Unable to parse markup [type=CF_MATHJAX]
. ThenUnable to parse markup [type=CF_MATHJAX]
. In particular,Unable to parse markup [type=CF_MATHJAX]
. Notice that the mapping between $$$a'$$$ and the optimal $$$a$$$ is a bijection! This means that instead of searching for $$$a$$$ of size $$$n$$$ with minimalUnable to parse markup [type=CF_MATHJAX]
, we can forget about $$$a$$$ and search for $$$a'$$$ of sizeUnable to parse markup [type=CF_MATHJAX]
with the minimal sum of $$$a'$$$.What can we say about the optimal $$$a'$$$? We only care about its sum, so without loss of generality assume it's sorted. As long as
Unable to parse markup [type=CF_MATHJAX]
, the behavior of the algorithm only depends onUnable to parse markup [type=CF_MATHJAX]
, so we can subtractUnable to parse markup [type=CF_MATHJAX]
from all elements of $$$a'$$$, because the sum of the elements (non-strictly) decreases. ThereforeUnable to parse markup [type=CF_MATHJAX]
. We can also assume the arrayUnable to parse markup [type=CF_MATHJAX]
and so on, because subtracting a constant from a suffix of the array decreases the sum. Therefore the optimal array is of kindUnable to parse markup [type=CF_MATHJAX]
.Let us denote
Unable to parse markup [type=CF_MATHJAX]
. ThenUnable to parse markup [type=CF_MATHJAX]
. The sum of $$$a'$$$ is some weighted sum ofUnable to parse markup [type=CF_MATHJAX]
. Notice that the mapping betweenUnable to parse markup [type=CF_MATHJAX]
and the optimal $$$a'$$$ is a bijection, so we can search forUnable to parse markup [type=CF_MATHJAX]
of sizeUnable to parse markup [type=CF_MATHJAX]
with a particular weighted sum of $$$a'$$$ minimized.You get the drill. Every time we have some weight function
Unable to parse markup [type=CF_MATHJAX]
, and we want to find $$$a$$$ with the minimal weight.Unable to parse markup [type=CF_MATHJAX]
holds for all reductions except the first one (whenUnable to parse markup [type=CF_MATHJAX]
), so due to the rearrangement inequality the optimal $$$a$$$ is sorted. Subtracting $$$a_1 - 1$$$ from all elements of $$$a$$$ decreasesUnable to parse markup [type=CF_MATHJAX]
becauseUnable to parse markup [type=CF_MATHJAX]
. Subtracting a constant from a suffix of $$$a$$$ reducesUnable to parse markup [type=CF_MATHJAX]
for the same reason, so the optimal $$$a$$$ is of kindUnable to parse markup [type=CF_MATHJAX]
Minimizing
Unable to parse markup [type=CF_MATHJAX]
is equivalent to minimizingUnable to parse markup [type=CF_MATHJAX]
, whereUnable to parse markup [type=CF_MATHJAX]
It's easy to see from this formula that
Unable to parse markup [type=CF_MATHJAX]
is always decreasing, so the argument can be continued by induction.Now, what do we get at the end? We have
Unable to parse markup [type=CF_MATHJAX]
andUnable to parse markup [type=CF_MATHJAX]
; it's obvious that the optimal $$$a$$$ is $$$[1]$$$. Treating that as a neighbour-difference array of some $$$a'$$$ we getUnable to parse markup [type=CF_MATHJAX]
orUnable to parse markup [type=CF_MATHJAX]
for the optimal $$$a'$$$, depending on the parity. Treating that as a neighbour-difference array of someUnable to parse markup [type=CF_MATHJAX]
we getUnable to parse markup [type=CF_MATHJAX]
orUnable to parse markup [type=CF_MATHJAX]
in the former case andUnable to parse markup [type=CF_MATHJAX]
orUnable to parse markup [type=CF_MATHJAX]
in the latter case.It takes a little while to notice the following fact.
Let us consider an infinite sequence $$$G$$$ with the following properties:
Unable to parse markup [type=CF_MATHJAX]
,Unable to parse markup [type=CF_MATHJAX]
,- among such $$$G$$$ it is the lexicographically smallest one.
Let us prove that the optimal $$$a$$$ for a given $$$n$$$ (arbitrary) and $$$\mu$$$ (with $$$\lambda$$$ non-increasing) is the prefix of $$$G$$$ of length $$$n$$$, by induction. For $$$n = 1$$$ this is obvious. Induction step: say we have already proven the fact for all lesser $$$n$$$. We have also already derived that the optimal $$$a$$$ is of kind
Unable to parse markup [type=CF_MATHJAX]
. Due to inductionUnable to parse markup [type=CF_MATHJAX]
is optimal for $$$\mu'$$$ if $$$a'$$$ is a prefix of $$$G$$$ of length $$$n'$$$. Therefore, $$$a$$$ satisfies the following conditions:- $$$a_1 = 1$$$,
Unable to parse markup [type=CF_MATHJAX]
,- among such $$$a$$$ it is the lexicographically smallest one.
It's easy to see that this implies that $$$a$$$ is the prefix of $$$G$$$ of length $$$n$$$. The proof by induction is complete.
We now know that $$$G$$$ gives the smallest counterexample to Kapun's algorithm, for lots of definitions of "smallest" (namely, for all $$$\mu$$$ with non-decreasing $$$\lambda$$$). Therefore
Unable to parse markup [type=CF_MATHJAX]
.The sequence $$$G$$$ is as follows:
Unable to parse markup [type=CF_MATHJAX]
This series is given by A018819 and is asymptotically
Unable to parse markup [type=CF_MATHJAX]
with high precision. Therefore Kapun's algorithm works for all
Unable to parse markup [type=CF_MATHJAX]
QED.
Upper bound
The bound obtained above can be seen as a kind of "lower bound", meaning that, if you have an array and you want to find two subsets, it makes no sense to decrease $$$R$$$ any further than that bound.
It it reasonable to attempt to provide an upper bound in the same fashion, i.e. some $$$R$$$ such that there is little chance that Kapun's algorithm succeeds anymore.
This bound is, unfortunately, much less rigid than the lower bound. It is derived on assumption that Kapun's algorithm works best when $$$a$$$ is selected randomly uniformly from
Unable to parse markup [type=CF_MATHJAX]
, which is of course false. We can only hope that whatever bound we derive is of the right magnitude.So let's forget about math and all and assume $$$a$$$ is selected randomly. Then, after sorting,
Unable to parse markup [type=CF_MATHJAX]
. So after running one iteration of Kapun's algorithm we haveUnable to parse markup [type=CF_MATHJAX]
andUnable to parse markup [type=CF_MATHJAX]
. After all $$$\log_2 n$$$ operations we haveUnable to parse markup [type=CF_MATHJAX]
We want
Unable to parse markup [type=CF_MATHJAX]
or something, soUnable to parse markup [type=CF_MATHJAX]
. This bound turns out to be really similar to the rigid lower bound, so maybe we weren't that much off after all.As far as I am aware, this is what people used to call the "proof" of Kapun's algorithm, which is obviously wrong, but at least the correct proof explains away why it worked so well.
Application to hashing
In our use case,
Unable to parse markup [type=CF_MATHJAX]
. After rearranging the inequality of the lower bound we getUnable to parse markup [type=CF_MATHJAX]
which translates to
Unable to parse markup [type=CF_MATHJAX]
for $$$M \approx 10^9$$$ and toUnable to parse markup [type=CF_MATHJAX]
for $$$M \approx 10^{18}$$$.For the upper bound,
Unable to parse markup [type=CF_MATHJAX]
which translates to
Unable to parse markup [type=CF_MATHJAX]
for $$$M \approx 10^9$$$ and toUnable to parse markup [type=CF_MATHJAX]
for $$$M \approx 10^{18}$$$.In reality, the truth is somewhere in the middle. For $$$M \approx 10^{18}$$$, the 50-th percentile is
Unable to parse markup [type=CF_MATHJAX]
and the 95-th percentile isUnable to parse markup [type=CF_MATHJAX]
.Implementation
Licensed under GPLv3 or any later version at your option for any use cases, and under MIT for use in competitive programming tournaments.