Fast modular multiplication

#	User	Rating
1	ecnerwala	3649
2	Benq	3581
3	orzdevinwang	3570
4	Geothermal	3569
4	cnnfls_csy	3569
6	tourist	3565
7	maroonrk	3531
8	Radewoosh	3521
9	Um_nik	3482
10	jiangly	3468

#	User	Contrib.
1	maomao90	174
2	awoo	164
3	adamant	163
4	TheScrasse	159
5	nor	158
6	maroonrk	156
7	-is-this-fft-	151
8	SecondThread	147
9	orz	146
10	pajenegod	145

I wrote an article/blog about how to do fast modular multiplication:

https://simonlindholm.github.io/files/bincoef.pdf

tl;dr:

avoid latency-bound loops
dynamic modulus is slow, constant modulus is fast
if you perform many multiplications with the same dynamic modulus, you can do what the compiler does and use Barrett reduction (involves some precomputation)
it is actually possible to beat the compiler if you accept a result in [0, 2*MOD):

uint64_t reduce(uint64_t a) {
  return a - (uint64_t)((__uint128_t(-1ULL / MOD) * a) >> 64) * MOD;
}

Same goes for addition and subtraction: if you can live with a result in [0, 2*MOD), just do a + b or a - b + MOD and skip the range correction that brings the result into [0, MOD). Delay modular reductions far as possible, ideally combining them with multiplications. While being mindful of overflows, of course.

on 32-bit, use Montgomery multiplication instead, to avoid __uint128_t
if really desperate, combine Montgomery multiplication with SIMD; this runs 3x faster than Barrett reduction when AVX2 is available
for larger numbers (e.g. multiplication of 64-bit numbers), use floating-point based methods

simonlindholm's blog