Addendum: Optimized variant of Barrett reduction + proof re: ultimate NTT article

#	User	Rating
1	tourist	3880
2	jiangly	3669
3	ecnerwala	3654
4	Benq	3627
5	orzdevinwang	3612
6	Geothermal	3569
6	cnnfls_csy	3569
8	jqdai0815	3532
9	Radewoosh	3522
10	gyh20	3447

#	User	Contrib.
1	awoo	161
2	maomao90	160
3	adamant	156
4	maroonrk	153
5	-is-this-fft-	148
5	atcoder_official	148
5	SecondThread	148
8	Petr	147
9	nor	144
9	TheScrasse	144

This is an extension of my article proposing the "ultimate" NTT. Here I will present a simplified and optimized variant of the Barrett reduction algorithm on that page. As it relaxes certain bounds of the standard parameters of the algorithm, I will also present a proof of the correctness of the variant, at least for this specific case.

First, the algorithm:

function mulMod(a: Long, b: Long): Long {
    xh = multiplyHighUnsigned(a, b) // high word of product
    xl = a * b // low word of product

    g = multiplyHighUnsigned( (xh << 2) | (xl >>> 62) , BARR_R)

    t = xl - g * MOD - MOD
    t += (t >> 63) & MOD
    return t
}

Notice that the middle routine has been greatly simplified; instead of computing the high and middle words of $$$xr$$$, we simply take the top 64 bits of $$$x$$$, in effect computing $$$\displaystyle \left\lfloor \left\lfloor \frac{x}{2^{62}} \right\rfloor \frac{r}{2^{64}} \right\rfloor$$$ in place of $$$\displaystyle \left\lfloor \frac {xr}{4^{63}} \right\rfloor$$$. Experimentally, this seems to work (and indeed, this variant is how Barrett reduction is normally implemented for multi-word cases), however a naive adjustment of the bounds of the Barrett reduction algorithm would relax the pre-normalization bound of $$$t$$$ to $$$[-m, 2m)$$$. This is bad news, as there is a possibility of overflowing into the sign bit, thus forcing the use of a modulus one bit shorter (e.g. $$$4611686018326724609$$$, $$$g = 3$$$) so that we could easily correct the result with two normalization rounds.

So the following is a proof of correctness for this relaxed algorithm, at least for the very specific use case in the described NTT:

On the Project Nayuki article on Barrett reduction (note that I will use $$$n$$$ instead of $$$m$$$ for the modulus in accordance to the notation on that page), one important lemma is the inequality

$$$\displaystyle\frac{x}{n} - 1 < \frac{xr}{4^k} ≤ \frac{x}{n} \implies \frac{x}{n} - 2 < \left\lfloor \frac{x}{n} - 1 \right\rfloor ≤ \left\lfloor \frac{xr}{4^k} \right\rfloor ≤ \frac{x}{n}$$$

We shall show that this inequality holds even if $$$\dfrac{xr}{4^k}$$$ is replaced by $$$\dfrac{(x-\delta) r}{4^k}$$$, with $$$0 ≤ \delta < 2^{62}$$$, the bits that have been omitted.

First we need to obtain a tighter bound on $$$r$$$. Given that $$$r$$$, $$$k$$$, and $$$n$$$ are fixed in our algorithm, we can verify that $$$\dfrac {4^k} n - r < 2^{-9}$$$. Let this value be $$$\varepsilon$$$. We thus obtain $$$\displaystyle \frac{4^k}{n} - \varepsilon < r < \frac{4^k}{n}$$$

Multiply by $$$x-\delta \geq 0$$$: $$$\displaystyle (x-\delta)\left(\frac{4^k}{n} - \varepsilon\right) ≤ (x-\delta)r ≤ (x-\delta)\frac{4^k}{n}$$$

Given that $$$\delta$$$ is nonnegative we can relax the right bound to $$$x\dfrac{4^k}{n}$$$.

Divide by $$$4^k$$$: $$$\displaystyle (x-\delta)\left(\frac{1}{n} - \frac \varepsilon {4^k}\right) ≤ \frac {(x-\delta)r}{4^k} ≤ \frac{x}{n}$$$

Recompose the leftmost expression: $$$\displaystyle \frac x n - \left(\frac {\varepsilon x} {4^k} + \frac \delta n\right) + \frac{\delta\varepsilon} {4^k} ≤ \frac {(x-\delta)r}{4^k} ≤ \frac{x}{n}$$$

$$$\displaystyle\frac{\delta\varepsilon} {4^k} \geq 0$$$, so relax the bound: $$$\displaystyle \frac x n - \left(\frac {\varepsilon x} {4^k} + \frac \delta n\right) ≤ \frac {(x-\delta)r}{4^k} ≤ \frac{x}{n}$$$

$$$x < n^2 < 4^k \implies \dfrac{x}{4^k} < 1$$$, so further relax the bound: $$$\displaystyle \frac x n - \left({\varepsilon} + \frac \delta n\right) < \frac {(x-\delta)r}{4^k} ≤ \frac{x}{n}$$$

$$$\delta < 2^{62}$$$ while $$$n$$$ is very slightly below $$$2^{63}$$$, so it can be verified that $$$\dfrac \delta n$$$ must be $$$< \dfrac34$$$.

$$$\dfrac34 + \varepsilon < 1$$$, therefore $$$\displaystyle \frac x n - 1 < \frac {(x-\delta)r}{4^k} ≤ \frac{x}{n}$$$

We can thus follow the rest of the proof in the Nayuki article to prove that our $$$t$$$ is in $$$[-n, n)$$$.

Samples (Kotlin, so YMMV on benchmarks for other languages)

74853038 for 993E - Nikita and Order Statistics, 328 ms faster

74852214 for 986D - Perfect Encoding, 452 ms faster

Addendum

There are other applications of this proof besides NTT. For example, I found a safe prime $$$9223372036854771239$$$, which is a good property for a rolling-hash modulus to have, as any number in $$$[2, m-2]$$$ will have multiplicative order at least $$$\dfrac{m-1}{2}$$$. It seems that $$$\varepsilon$$$ and $$$\dfrac\delta n - \dfrac 1 2$$$ will both be rather small for moduli so close below a power of 2.

Spheniscine's blog

Samples (Kotlin, so YMMV on benchmarks for other languages)

Addendum