Fast modular multiplication

#	User	Rating
1	ecnerwala	3648
2	Benq	3580
3	orzdevinwang	3570
4	cnnfls_csy	3569
5	Geothermal	3568
6	tourist	3565
7	maroonrk	3530
8	Radewoosh	3520
9	Um_nik	3481
10	jiangly	3467

#	User	Contrib.
1	maomao90	174
2	adamant	164
2	awoo	164
4	TheScrasse	160
5	nor	159
6	maroonrk	156
7	SecondThread	150
8	-is-this-fft-	148
9	pajenegod	145
10	BledDest	144

I wrote an article/blog about how to do fast modular multiplication:

https://simonlindholm.github.io/files/bincoef.pdf

tl;dr:

avoid latency-bound loops
dynamic modulus is slow, constant modulus is fast
if you perform many multiplications with the same dynamic modulus, you can do what the compiler does and use Barrett reduction (involves some precomputation)
it is actually possible to beat the compiler if you accept a result in [0, 2*MOD):

uint64_t reduce(uint64_t a) {
  return a - (uint64_t)((__uint128_t(-1ULL / MOD) * a) >> 64) * MOD;
}

Same goes for addition and subtraction: if you can live with a result in [0, 2*MOD), just do a + b or a - b + MOD and skip the range correction that brings the result into [0, MOD). Delay modular reductions far as possible, ideally combining them with multiplications. While being mindful of overflows, of course.

on 32-bit, use Montgomery multiplication instead, to avoid __uint128_t
if really desperate, combine Montgomery multiplication with SIMD; this runs 3x faster than Barrett reduction when AVX2 is available
for larger numbers (e.g. multiplication of 64-bit numbers), use floating-point based methods