Kapun's algorithm

Правка en12, от purplesyringa, 2022-02-15 13:00:43

I'd like to share this one unpopular but awesome algorithm that I only heard a mention of once. Kapun's algorithm finds a hash collision for moduli as large as $$$10^{18}$$$. I know, we have lots of algorithms that do much better than that (and I'm writing an article on that at the moment, keep tuned), but Kapun's algorithm is really surprising in its simplicity. That its correctness is so difficult to prove is of more surprise even.

Here we go.

A polynomial hash of a string $$$s$$$ is defined as

Unable to parse markup [type=CF_MATHJAX]

We want to find two strings with the same polynomial hash using only two characters: $$$0$$$ and $$$1$$$.

We can re-formulate this problem in another way. Let

Unable to parse markup [type=CF_MATHJAX]

, then we want to find two distinct subsets of $$$a$$$ with the same sum modulo $$$M$$$. Now forget about the modulus: let's just find two subsets with the same sum.

Firstly, when is this possible? There are $$$2^n$$$ possible subsets and

Unable to parse markup [type=CF_MATHJAX]

distinct sums. If

Unable to parse markup [type=CF_MATHJAX]

, that is,

Unable to parse markup [type=CF_MATHJAX]

, there are certainly two subsets with the same sum. Generating $$$M$$$ or even just $$$\sqrt{M}$$$ strings (if you use birthday paradox) is impossible for large $$$M$$$, but there's still a way through.

It turns out there is a deterministic algorithm that attempts to solve this problem that is very easy to implement!

Algorithm

Suppose we have a list of $$$n$$$ integers in range

Unable to parse markup [type=CF_MATHJAX]

each: $$$[a_1, a_2, \dots, a_n]$$$. Kapun's algorithm attempts to find two subsets of these integers with equal sum.

The algorithm is as follows. Instead of finding two subsets, we will find an expression of kind

Unable to parse markup [type=CF_MATHJAX]

for some $$$i$$$ and some signs. It's easy to see that this is equivalent to finding two subsets with the same sum. Initially, we have $$$n$$$ trivial expressions $$$a_1, a_2, \dots, a_n$$$. The algorithm is iterative; at each step we do the following:
  • If we have an expression with value $$$0$$$ at the moment, return this expression.
  • Otherwise, sort the expressions by their values.
  • Replace the current expressions

    Unable to parse markup [type=CF_MATHJAX]

    by

    Unable to parse markup [type=CF_MATHJAX]

    , that is, the expressions are divided into pairs and each pair is replaced by its difference. If $$$k$$$ is odd, $$$b_k$$$ is simply dropped.
  • Repeat from the beginning.

If we end up with

Unable to parse markup [type=CF_MATHJAX]

and $$$x > 0$$$, the algorithm has failed.

Theorem

Kapun's algorithm necessarily succeeds if

Unable to parse markup [type=CF_MATHJAX]

, where $$$\alpha, \beta$$$ are some small constants.

Proof. It's obvious that if there exists a counterexample for a particular

Unable to parse markup [type=CF_MATHJAX]

tuple, it also exists for

Unable to parse markup [type=CF_MATHJAX]

. Therefore we can define $$$f(n)$$$ as the smallest $$$R$$$ for which a counterexample exists for specific $$$n$$$.

Also, let's denote by operator

Unable to parse markup [type=CF_MATHJAX]

the sorted array $$$a$$$, and by $$$\Delta a$$$ the neighbour-difference array of

Unable to parse markup [type=CF_MATHJAX]

. For instance,

Unable to parse markup [type=CF_MATHJAX]

, and

Unable to parse markup [type=CF_MATHJAX]

.

For a particular $$$n$$$, let us consider the smallest countertest, the "smallest" meaning the one with the smallest $$$R$$$. Let's say it's some $$$[a_1, a_2, \dots, a_n]$$$. The algorithm sorts this array, so without loss of generality assume $$$a$$$ is sorted.

As long as

Unable to parse markup [type=CF_MATHJAX]

, the subsequent behavior of the algorithm only depends on $$$\Delta a$$$, right? So we can subtract $$$a_1 - 1$$$ from all elements of $$$a$$$, because this operation does not affect the difference array, $$$a_1$$$ stays positive and

Unable to parse markup [type=CF_MATHJAX]

decreases. Therefore, in the optimal countertest $$$a_1 = 1$$$.

Notice that an array of kind

Unable to parse markup [type=CF_MATHJAX]

has the same difference array as an array of kind

Unable to parse markup [type=CF_MATHJAX]

, so in the optimal array

Unable to parse markup [type=CF_MATHJAX]

, and so on. Therefore the optimal countertest is of kind

Unable to parse markup [type=CF_MATHJAX]

.

Let us denote

Unable to parse markup [type=CF_MATHJAX]

. Then

Unable to parse markup [type=CF_MATHJAX]

. In particular,

Unable to parse markup [type=CF_MATHJAX]

. Notice that the mapping between $$$a'$$$ and the optimal $$$a$$$ is a bijection! This means that instead of searching for $$$a$$$ of size $$$n$$$ with minimal

Unable to parse markup [type=CF_MATHJAX]

, we can forget about $$$a$$$ and search for $$$a'$$$ of size

Unable to parse markup [type=CF_MATHJAX]

with the minimal sum of $$$a'$$$.

What can we say about the optimal $$$a'$$$? We only care about its sum, so without loss of generality assume it's sorted. As long as

Unable to parse markup [type=CF_MATHJAX]

, the behavior of the algorithm only depends on

Unable to parse markup [type=CF_MATHJAX]

, so we can subtract

Unable to parse markup [type=CF_MATHJAX]

from all elements of $$$a'$$$, because the sum of the elements (non-strictly) decreases. Therefore

Unable to parse markup [type=CF_MATHJAX]

. We can also assume the array

Unable to parse markup [type=CF_MATHJAX]

and so on, because subtracting a constant from a suffix of the array decreases the sum. Therefore the optimal array is of kind

Unable to parse markup [type=CF_MATHJAX]

.

Let us denote

Unable to parse markup [type=CF_MATHJAX]

. Then

Unable to parse markup [type=CF_MATHJAX]

. The sum of $$$a'$$$ is some weighted sum of

Unable to parse markup [type=CF_MATHJAX]

. Notice that the mapping between

Unable to parse markup [type=CF_MATHJAX]

and the optimal $$$a'$$$ is a bijection, so we can search for

Unable to parse markup [type=CF_MATHJAX]

of size

Unable to parse markup [type=CF_MATHJAX]

with a particular weighted sum of $$$a'$$$ minimized.

You get the drill. Every time we have some weight function

Unable to parse markup [type=CF_MATHJAX]

, and we want to find $$$a$$$ with the minimal weight.

Unable to parse markup [type=CF_MATHJAX]

holds for all reductions except the first one (when

Unable to parse markup [type=CF_MATHJAX]

), so due to the rearrangement inequality the optimal $$$a$$$ is sorted. Subtracting $$$a_1 - 1$$$ from all elements of $$$a$$$ decreases

Unable to parse markup [type=CF_MATHJAX]

because

Unable to parse markup [type=CF_MATHJAX]

. Subtracting a constant from a suffix of $$$a$$$ reduces

Unable to parse markup [type=CF_MATHJAX]

for the same reason, so the optimal $$$a$$$ is of kind

Unable to parse markup [type=CF_MATHJAX]

Minimizing

Unable to parse markup [type=CF_MATHJAX]

is equivalent to minimizing

Unable to parse markup [type=CF_MATHJAX]

, where

Unable to parse markup [type=CF_MATHJAX]

It's easy to see from this formula that

Unable to parse markup [type=CF_MATHJAX]

is always decreasing, so the argument can be continued by induction.

Now, what do we get at the end? We have

Unable to parse markup [type=CF_MATHJAX]

and

Unable to parse markup [type=CF_MATHJAX]

; it's obvious that the optimal $$$a$$$ is $$$[1]$$$. Treating that as a neighbour-difference array of some $$$a'$$$ we get

Unable to parse markup [type=CF_MATHJAX]

or

Unable to parse markup [type=CF_MATHJAX]

for the optimal $$$a'$$$, depending on the parity. Treating that as a neighbour-difference array of some

Unable to parse markup [type=CF_MATHJAX]

we get

Unable to parse markup [type=CF_MATHJAX]

or

Unable to parse markup [type=CF_MATHJAX]

in the former case and

Unable to parse markup [type=CF_MATHJAX]

or

Unable to parse markup [type=CF_MATHJAX]

in the latter case.

It takes a little while to notice the following fact.

Let us consider an infinite sequence $$$G$$$ with the following properties:

  • Unable to parse markup [type=CF_MATHJAX]

    ,
  • Unable to parse markup [type=CF_MATHJAX]

    ,
  • among such $$$G$$$ it is the lexicographically smallest one.

Let us prove that the optimal $$$a$$$ for a given $$$n$$$ (arbitrary) and $$$\mu$$$ (with $$$\lambda$$$ non-increasing) is the prefix of $$$G$$$ of length $$$n$$$, by induction. For $$$n = 1$$$ this is obvious. Induction step: say we have already proven the fact for all lesser $$$n$$$. We have also already derived that the optimal $$$a$$$ is of kind

Unable to parse markup [type=CF_MATHJAX]

. Due to induction

Unable to parse markup [type=CF_MATHJAX]

is optimal for $$$\mu'$$$ if $$$a'$$$ is a prefix of $$$G$$$ of length $$$n'$$$. Therefore, $$$a$$$ satisfies the following conditions:
  • $$$a_1 = 1$$$,
  • Unable to parse markup [type=CF_MATHJAX]

    ,
  • among such $$$a$$$ it is the lexicographically smallest one.

It's easy to see that this implies that $$$a$$$ is the prefix of $$$G$$$ of length $$$n$$$. The proof by induction is complete.

We now know that $$$G$$$ gives the smallest counterexample to Kapun's algorithm, for lots of definitions of "smallest" (namely, for all $$$\mu$$$ with non-decreasing $$$\lambda$$$). Therefore

Unable to parse markup [type=CF_MATHJAX]

.

The sequence $$$G$$$ is as follows:

Unable to parse markup [type=CF_MATHJAX]

This series is given by A018819 and is asymptotically

Unable to parse markup [type=CF_MATHJAX]

with high precision. Therefore Kapun's algorithm works for all

Unable to parse markup [type=CF_MATHJAX]

QED.

Upper bound

The bound obtained above can be seen as a kind of "lower bound", meaning that, if you have an array and you want to find two subsets, it makes no sense to decrease $$$R$$$ any further than that bound.

It it reasonable to attempt to provide an upper bound in the same fashion, i.e. some $$$R$$$ such that there is little chance that Kapun's algorithm succeeds anymore.

This bound is, unfortunately, much less rigid than the lower bound. It is derived on assumption that Kapun's algorithm works best when $$$a$$$ is selected randomly uniformly from

Unable to parse markup [type=CF_MATHJAX]

, which is of course false. We can only hope that whatever bound we derive is of the right magnitude.

So let's forget about math and all and assume $$$a$$$ is selected randomly. Then, after sorting,

Unable to parse markup [type=CF_MATHJAX]

. So after running one iteration of Kapun's algorithm we have

Unable to parse markup [type=CF_MATHJAX]

and

Unable to parse markup [type=CF_MATHJAX]

. After all $$$\log_2 n$$$ operations we have

Unable to parse markup [type=CF_MATHJAX]

We want

Unable to parse markup [type=CF_MATHJAX]

or something, so

Unable to parse markup [type=CF_MATHJAX]

. This bound turns out to be really similar to the rigid lower bound, so maybe we weren't that much off after all.

As far as I am aware, this is what people used to call the "proof" of Kapun's algorithm, which is obviously wrong, but at least the correct proof explains away why it worked so well.

Application to hashing

In our use case,

Unable to parse markup [type=CF_MATHJAX]

. After rearranging the inequality of the lower bound we get

Unable to parse markup [type=CF_MATHJAX]

which translates to

Unable to parse markup [type=CF_MATHJAX]

for $$$M \approx 10^9$$$ and to

Unable to parse markup [type=CF_MATHJAX]

for $$$M \approx 10^{18}$$$.

For the upper bound,

Unable to parse markup [type=CF_MATHJAX]

which translates to

Unable to parse markup [type=CF_MATHJAX]

for $$$M \approx 10^9$$$ and to

Unable to parse markup [type=CF_MATHJAX]

for $$$M \approx 10^{18}$$$.

In reality, the truth is somewhere in the middle. For $$$M \approx 10^{18}$$$, the 50-th percentile is

Unable to parse markup [type=CF_MATHJAX]

and the 95-th percentile is

Unable to parse markup [type=CF_MATHJAX]

.

Implementation

Licensed under GPLv3 or any later version at your option for any use cases, and under MIT for use in competitive programming tournaments.

Code
Теги hashing, polynomial hash, hacking

История

 
 
 
 
Правки
 
 
  Rev. Язык Кто Когда Δ Комментарий
en15 Английский purplesyringa 2022-02-15 18:10:24 7 Fix various typos
en14 Английский purplesyringa 2022-02-15 14:47:10 6
en13 Английский purplesyringa 2022-02-15 13:14:01 8
en12 Английский purplesyringa 2022-02-15 13:00:43 4
en11 Английский purplesyringa 2022-02-15 12:36:43 0 (published)
en10 Английский purplesyringa 2022-02-15 11:57:38 27
en9 Английский purplesyringa 2022-02-15 11:56:52 8
en8 Английский purplesyringa 2022-02-15 11:56:11 12
en7 Английский purplesyringa 2022-02-15 11:55:33 4
en6 Английский purplesyringa 2022-02-15 11:46:44 1
en5 Английский purplesyringa 2022-02-15 11:45:26 11
en4 Английский purplesyringa 2022-02-15 11:42:28 17
en3 Английский purplesyringa 2022-02-15 11:40:11 264
en2 Английский purplesyringa 2022-02-15 11:37:17 72
en1 Английский purplesyringa 2022-02-15 11:33:52 13139 Initial revision (saved to drafts)