Kapun's algorithm - Codeforces

→ Pay attention

Before contest
Codeforces Round 950 (Div. 3)
34:50:07
Register now »

→ Top rated

#	User	Rating
1	tourist	3757
2	jiangly	3647
3	Benq	3581
4	orzdevinwang	3570
5	Geothermal	3569
5	cnnfls_csy	3569
7	Radewoosh	3509
8	ecnerwala	3486
9	jqdai0815	3474
10	gyh20	3447

Countries | Cities | Organizations

→ Top contributors

#	User	Contrib.
1	maomao90	171
2	awoo	165
3	adamant	164
4	TheScrasse	159
5	maroonrk	155
6	nor	154
7	-is-this-fft-	152
8	Petr	147
9	orz	146
10	pajenegod	145

View all →

→ Find user

→ Recent actions

Detailed →

Kapun's algorithm

Revision en12, by purplesyringa, 2022-02-15 13:00:43

I'd like to share this one unpopular but awesome algorithm that I only heard a mention of once. Kapun's algorithm finds a hash collision for moduli as large as $$$10^{18}$$$. I know, we have lots of algorithms that do much better than that (and I'm writing an article on that at the moment, keep tuned), but Kapun's algorithm is really surprising in its simplicity. That its correctness is so difficult to prove is of more surprise even.

Here we go.

A polynomial hash of a string $$$s$$$ is defined as

Unable to parse markup [type=CF_MATHJAX]

We want to find two strings with the same polynomial hash using only two characters: $$$0$$$ and $$$1$$$.

We can re-formulate this problem in another way. Let

Unable to parse markup [type=CF_MATHJAX]

, then we want to find two distinct subsets of $$$a$$$ with the same sum modulo $$$M$$$. Now forget about the modulus: let's just find two subsets with the same sum.

Firstly, when is this possible? There are $$$2^n$$$ possible subsets and

Unable to parse markup [type=CF_MATHJAX]

distinct sums. If

Unable to parse markup [type=CF_MATHJAX]

, that is,

Unable to parse markup [type=CF_MATHJAX]

, there are certainly two subsets with the same sum. Generating $$$M$$$ or even just $$$\sqrt{M}$$$ strings (if you use birthday paradox) is impossible for large $$$M$$$, but there's still a way through.

It turns out there is a deterministic algorithm that attempts to solve this problem that is very easy to implement!

Algorithm

Suppose we have a list of $$$n$$$ integers in range

Unable to parse markup [type=CF_MATHJAX]

each: $$$[a_1, a_2, \dots, a_n]$$$. Kapun's algorithm attempts to find two subsets of these integers with equal sum.

The algorithm is as follows. Instead of finding two subsets, we will find an expression of kind

Unable to parse markup [type=CF_MATHJAX]

for some $$$i$$$ and some signs. It's easy to see that this is equivalent to finding two subsets with the same sum. Initially, we have $$$n$$$ trivial expressions $$$a_1, a_2, \dots, a_n$$$. The algorithm is iterative; at each step we do the following:

If we have an expression with value $$$0$$$ at the moment, return this expression.
Otherwise, sort the expressions by their values.
Replace the current expressions
Unable to parse markup [type=CF_MATHJAX]
by
Unable to parse markup [type=CF_MATHJAX]
, that is, the expressions are divided into pairs and each pair is replaced by its difference. If $$$k$$$ is odd, $$$b_k$$$ is simply dropped.
Repeat from the beginning.

If we end up with

Unable to parse markup [type=CF_MATHJAX]

and $$$x > 0$$$, the algorithm has failed.

Theorem

Kapun's algorithm necessarily succeeds if

Unable to parse markup [type=CF_MATHJAX]

, where $$$\alpha, \beta$$$ are some small constants.

Proof. It's obvious that if there exists a counterexample for a particular

Unable to parse markup [type=CF_MATHJAX]

tuple, it also exists for

Unable to parse markup [type=CF_MATHJAX]

. Therefore we can define $$$f(n)$$$ as the smallest $$$R$$$ for which a counterexample exists for specific $$$n$$$.

Also, let's denote by operator

Unable to parse markup [type=CF_MATHJAX]

the sorted array $$$a$$$, and by $$$\Delta a$$$ the neighbour-difference array of

Unable to parse markup [type=CF_MATHJAX]

. For instance,

Unable to parse markup [type=CF_MATHJAX]

, and

Unable to parse markup [type=CF_MATHJAX]

.

For a particular $$$n$$$, let us consider the smallest countertest, the "smallest" meaning the one with the smallest $$$R$$$. Let's say it's some $$$[a_1, a_2, \dots, a_n]$$$. The algorithm sorts this array, so without loss of generality assume $$$a$$$ is sorted.

As long as

Unable to parse markup [type=CF_MATHJAX]

, the subsequent behavior of the algorithm only depends on $$$\Delta a$$$, right? So we can subtract $$$a_1 - 1$$$ from all elements of $$$a$$$, because this operation does not affect the difference array, $$$a_1$$$ stays positive and

Unable to parse markup [type=CF_MATHJAX]

decreases. Therefore, in the optimal countertest $$$a_1 = 1$$$.

Notice that an array of kind

Unable to parse markup [type=CF_MATHJAX]

has the same difference array as an array of kind

Unable to parse markup [type=CF_MATHJAX]

, so in the optimal array

Unable to parse markup [type=CF_MATHJAX]

, and so on. Therefore the optimal countertest is of kind

Unable to parse markup [type=CF_MATHJAX]

.

Let us denote

Unable to parse markup [type=CF_MATHJAX]

. Then

Unable to parse markup [type=CF_MATHJAX]

. In particular,

Unable to parse markup [type=CF_MATHJAX]

. Notice that the mapping between $$$a'$$$ and the optimal $$$a$$$ is a bijection! This means that instead of searching for $$$a$$$ of size $$$n$$$ with minimal

Unable to parse markup [type=CF_MATHJAX]

, we can forget about $$$a$$$ and search for $$$a'$$$ of size

Unable to parse markup [type=CF_MATHJAX]

with the minimal sum of $$$a'$$$.

What can we say about the optimal $$$a'$$$? We only care about its sum, so without loss of generality assume it's sorted. As long as

Unable to parse markup [type=CF_MATHJAX]

, the behavior of the algorithm only depends on

Unable to parse markup [type=CF_MATHJAX]

, so we can subtract

Unable to parse markup [type=CF_MATHJAX]

from all elements of $$$a'$$$, because the sum of the elements (non-strictly) decreases. Therefore

Unable to parse markup [type=CF_MATHJAX]

. We can also assume the array

Unable to parse markup [type=CF_MATHJAX]

and so on, because subtracting a constant from a suffix of the array decreases the sum. Therefore the optimal array is of kind

Unable to parse markup [type=CF_MATHJAX]

.

Let us denote

Unable to parse markup [type=CF_MATHJAX]

. Then

Unable to parse markup [type=CF_MATHJAX]

. The sum of $$$a'$$$ is some weighted sum of

Unable to parse markup [type=CF_MATHJAX]

. Notice that the mapping between

Unable to parse markup [type=CF_MATHJAX]

and the optimal $$$a'$$$ is a bijection, so we can search for

Unable to parse markup [type=CF_MATHJAX]

of size

Unable to parse markup [type=CF_MATHJAX]

with a particular weighted sum of $$$a'$$$ minimized.

You get the drill. Every time we have some weight function

Unable to parse markup [type=CF_MATHJAX]

, and we want to find $$$a$$$ with the minimal weight.

Unable to parse markup [type=CF_MATHJAX]

holds for all reductions except the first one (when

Unable to parse markup [type=CF_MATHJAX]

), so due to the rearrangement inequality the optimal $$$a$$$ is sorted. Subtracting $$$a_1 - 1$$$ from all elements of $$$a$$$ decreases

Unable to parse markup [type=CF_MATHJAX]

because

Unable to parse markup [type=CF_MATHJAX]

. Subtracting a constant from a suffix of $$$a$$$ reduces

Unable to parse markup [type=CF_MATHJAX]

for the same reason, so the optimal $$$a$$$ is of kind

Unable to parse markup [type=CF_MATHJAX]

Minimizing

Unable to parse markup [type=CF_MATHJAX]

is equivalent to minimizing

Unable to parse markup [type=CF_MATHJAX]

, where

Unable to parse markup [type=CF_MATHJAX]

It's easy to see from this formula that

Unable to parse markup [type=CF_MATHJAX]

is always decreasing, so the argument can be continued by induction.

Now, what do we get at the end? We have

Unable to parse markup [type=CF_MATHJAX]

and

Unable to parse markup [type=CF_MATHJAX]

; it's obvious that the optimal $$$a$$$ is $$$[1]$$$. Treating that as a neighbour-difference array of some $$$a'$$$ we get

Unable to parse markup [type=CF_MATHJAX]

or

Unable to parse markup [type=CF_MATHJAX]

for the optimal $$$a'$$$, depending on the parity. Treating that as a neighbour-difference array of some

Unable to parse markup [type=CF_MATHJAX]

we get

Unable to parse markup [type=CF_MATHJAX]

or

Unable to parse markup [type=CF_MATHJAX]

in the former case and

Unable to parse markup [type=CF_MATHJAX]

or

Unable to parse markup [type=CF_MATHJAX]

in the latter case.

It takes a little while to notice the following fact.

Let us consider an infinite sequence $$$G$$$ with the following properties:

Unable to parse markup [type=CF_MATHJAX]
,
Unable to parse markup [type=CF_MATHJAX]
,
among such $$$G$$$ it is the lexicographically smallest one.

Let us prove that the optimal $$$a$$$ for a given $$$n$$$ (arbitrary) and $$$\mu$$$ (with $$$\lambda$$$ non-increasing) is the prefix of $$$G$$$ of length $$$n$$$, by induction. For $$$n = 1$$$ this is obvious. Induction step: say we have already proven the fact for all lesser $$$n$$$. We have also already derived that the optimal $$$a$$$ is of kind

Unable to parse markup [type=CF_MATHJAX]

. Due to induction

Unable to parse markup [type=CF_MATHJAX]

is optimal for $$$\mu'$$$ if $$$a'$$$ is a prefix of $$$G$$$ of length $$$n'$$$. Therefore, $$$a$$$ satisfies the following conditions:

$$$a_1 = 1$$$,
Unable to parse markup [type=CF_MATHJAX]
,
among such $$$a$$$ it is the lexicographically smallest one.

It's easy to see that this implies that $$$a$$$ is the prefix of $$$G$$$ of length $$$n$$$. The proof by induction is complete.

We now know that $$$G$$$ gives the smallest counterexample to Kapun's algorithm, for lots of definitions of "smallest" (namely, for all $$$\mu$$$ with non-decreasing $$$\lambda$$$). Therefore

Unable to parse markup [type=CF_MATHJAX]

.

The sequence $$$G$$$ is as follows:

Unable to parse markup [type=CF_MATHJAX]

This series is given by A018819 and is asymptotically

Unable to parse markup [type=CF_MATHJAX]

with high precision. Therefore Kapun's algorithm works for all

Unable to parse markup [type=CF_MATHJAX]

QED.

Upper bound

The bound obtained above can be seen as a kind of "lower bound", meaning that, if you have an array and you want to find two subsets, it makes no sense to decrease $$$R$$$ any further than that bound.

It it reasonable to attempt to provide an upper bound in the same fashion, i.e. some $$$R$$$ such that there is little chance that Kapun's algorithm succeeds anymore.

This bound is, unfortunately, much less rigid than the lower bound. It is derived on assumption that Kapun's algorithm works best when $$$a$$$ is selected randomly uniformly from

Unable to parse markup [type=CF_MATHJAX]

, which is of course false. We can only hope that whatever bound we derive is of the right magnitude.

So let's forget about math and all and assume $$$a$$$ is selected randomly. Then, after sorting,

Unable to parse markup [type=CF_MATHJAX]

. So after running one iteration of Kapun's algorithm we have

Unable to parse markup [type=CF_MATHJAX]

and

Unable to parse markup [type=CF_MATHJAX]

. After all $$$\log_2 n$$$ operations we have

Unable to parse markup [type=CF_MATHJAX]

We want

Unable to parse markup [type=CF_MATHJAX]

or something, so

Unable to parse markup [type=CF_MATHJAX]

. This bound turns out to be really similar to the rigid lower bound, so maybe we weren't that much off after all.

As far as I am aware, this is what people used to call the "proof" of Kapun's algorithm, which is obviously wrong, but at least the correct proof explains away why it worked so well.

Application to hashing

In our use case,

Unable to parse markup [type=CF_MATHJAX]

. After rearranging the inequality of the lower bound we get

Unable to parse markup [type=CF_MATHJAX]

which translates to

Unable to parse markup [type=CF_MATHJAX]

for $$$M \approx 10^9$$$ and to

Unable to parse markup [type=CF_MATHJAX]

for $$$M \approx 10^{18}$$$.

For the upper bound,

Unable to parse markup [type=CF_MATHJAX]

which translates to

Unable to parse markup [type=CF_MATHJAX]

for $$$M \approx 10^9$$$ and to

Unable to parse markup [type=CF_MATHJAX]

for $$$M \approx 10^{18}$$$.

In reality, the truth is somewhere in the middle. For $$$M \approx 10^{18}$$$, the 50-th percentile is

Unable to parse markup [type=CF_MATHJAX]

and the 95-th percentile is

Unable to parse markup [type=CF_MATHJAX]

.

Implementation

Licensed under GPLv3 or any later version at your option for any use cases, and under MIT for use in competitive programming tournaments.

Code

# Feel free to modify constants as appropriate
M = 10 ** 9 + 7
b = 25
n = 300


class Expression:
    def __init__(self, value: int):
        self.value: int = value

    def __sub__(self, rhs: "Expression"):
        return DifferenceExpression(self, rhs)

    def __lt__(self, rhs: "Expression"):
        return self.value < rhs.value

    def __gt__(self, rhs: "Expression"):
        return self.value > rhs.value

    def __eq__(self, rhs: "Expression | int"):
        if isinstance(rhs, int):
            return self.value == rhs
        else:
            return self.value == rhs.value

    def __ne__(self, rhs: "Expression"):
        return not (self == rhs)

    def __iter__(self):
        raise NotImplementedError()

    def __repr__(self):
        raise NotImplementedError()


class TrivialExpression(Expression):
    def __init__(self, value: int, info):
        super().__init__(value)
        self.info = info

    def __iter__(self):
        yield self, 1

    def __repr__(self):
        return repr(self.value)


class DifferenceExpression(Expression):
    def __init__(self, a: Expression, b: Expression):
        super().__init__(a.value - b.value)
        self.__a: Expression = a
        self.__b: Expression = b

    def __iter__(self):
        yield from self.__a
        for expr, coeff in self.__b:
            yield expr, -coeff

    def __repr__(self):
        return f"({self.__a!r})-({self.__b!r})"


exprs = []
for i in range(n):
    exprs.append(TrivialExpression(pow(b, i, M), i))


exprs.sort()
while len(exprs) > 1 and exprs[0] != 0:
    exprs = [exprs[2 * i + 1] - exprs[2 * i] for i in range(len(exprs) // 2)]
    exprs.sort()

if exprs[0] != 0:
    print("Collision not found; try to increase n")
    raise SystemExit(1)


expr = exprs[0]

vals = [0] * n
for trivial_expr, coeff in expr:
    vals[trivial_expr.info] = coeff

s = "".join("B" if c == 1 else "A" for c in vals)
t = "".join("B" if c == -1 else "A" for c in vals)

hash_s = sum(ord(c) * pow(b, i, M) for i, c in enumerate(s)) % M
hash_t = sum(ord(c) * pow(b, i, M) for i, c in enumerate(t)) % M
assert hash_s == hash_t

print("Collision found:")
print(s)
print(t)

Tags

hashing, polynomial hash, hacking

History

Revisions

	Rev.	Lang.	By	When	Δ	Comment
	en15		purplesyringa	2022-02-15 18:10:24	7	Fix various typos
	en14		purplesyringa	2022-02-15 14:47:10	6
	en13		purplesyringa	2022-02-15 13:14:01	8
	en12		purplesyringa	2022-02-15 13:00:43	4
	en11		purplesyringa	2022-02-15 12:36:43	0	(published)
	en10		purplesyringa	2022-02-15 11:57:38	27
	en9		purplesyringa	2022-02-15 11:56:52	8
	en8		purplesyringa	2022-02-15 11:56:11	12
	en7		purplesyringa	2022-02-15 11:55:33	4
	en6		purplesyringa	2022-02-15 11:46:44	1
	en5		purplesyringa	2022-02-15 11:45:26	11
	en4		purplesyringa	2022-02-15 11:42:28	17
	en3		purplesyringa	2022-02-15 11:40:11	264
	en2		purplesyringa	2022-02-15 11:37:17	72
	en1		purplesyringa	2022-02-15 11:33:52	13139	Initial revision (saved to drafts)