Fast modular multiplication

Revision en2, by simonlindholm, 2020-06-07 15:31:34

I wrote an article/blog about how to do fast modular multiplication:

https://simonlindholm.github.io/files/bincoef.pdf

tl;dr:

  • avoid latency-bound loops
  • dynamic modulus is slow, constant modulus is fast
  • if you perform many multiplications with the same dynamic modulus, you can do what the compiler does and use Barrett reduction (involves some precomputation)
  • it is actually possible to beat the compiler if you accept a result in [0, 2*MOD):
uint64_t reduce(uint64_t a) {
  return a - (uint64_t)((__uint128_t(-1ULL / MOD) * a) >> 64) * MOD;
}

Same goes for addition and subtraction: if you can live with a result in [0, 2*MOD), just do a + b or a - b + MOD and skip the range correction that brings the result into [0, MOD). Delay modular reductions far as possible, ideally combining them with multiplications. While being mindful of overflows, of course.

  • on 32-bit, use Montgomery multiplication instead, to avoid __uint128_t
  • if really desperate, combine Montgomery multiplication with SIMD; this runs 3x faster than Barrett reduction when AVX2 is available
  • for larger numbers (e.g. multiplication of 64-bit numbers), use floating-point based methods
Tags #modular arithmetic, #optimization, #simd

History

 
 
 
 
Revisions
 
 
  Rev. Lang. By When Δ Comment
en2 English simonlindholm 2020-06-07 15:31:34 9 modulo -> modulus
en1 English simonlindholm 2020-06-07 13:18:32 1325 Initial revision (published)