sslotin's blog

By sslotin, 6 weeks ago, In English

I've been working on computational number theory for my book and recently wrote a few new articles on:

This is largely inspired by Jakube's tutorials and Nyaan' library implementations. Thank you!

If someone is interested, I may or may not try to optimize the sieve of Eratosthenes next: I believe it is possible to find all primes among the first $$$10^9$$$ numbers in well under a second.

 
 
 
 
  • Vote: I like it
  • +68
  • Vote: I do not like it

»
6 weeks ago, # |
  Vote: I like it +11 Vote: I do not like it

Thank you!

First of all, your factorizations link leads to an incorrect page. Second, you write there that for integers up to $$$2^{64}$$$ it is better to use Pollard's rho, but you are aware of some advanced factorization algorithms, as you mention some of them for larger $$$n$$$. Does it mean that your implementation is faster than, for example, some implementations of SQUFOF? (here in C++ or here in C)

  • »
    »
    6 weeks ago, # ^ |
      Vote: I like it +3 Vote: I do not like it

    I filtered out SQUFOF early on because it has the same asymptotic complexity but looks much more complex and arithmetic-intensive.

    I added dacin21's implementation to the benchmark, and it measures 425 factorizations per second, which is 7 times slower than Pollard-Brent with Montgomery multiplication. It can probably be optimized, but not by an order of magnitude.

    • »
      »
      »
      6 weeks ago, # ^ |
        Vote: I like it 0 Vote: I do not like it

      The main advantage of SQUFOF was that it mostly uses 32 bit integers. Back in the days, codeforces was still using 32 bit machines, so this was great. On modern 64-bit machines however, SQUFOF greatly suffers from division operations by changing values, which rule out the typical Barrett / Montgomery tricks, and having to compute square roots.

      In all honesty, I'm kinda glad I get to retire my SQUFOF code, as "fast" was just about the only positive thing I could say about it.

      PS: how fast is your code compared to tinyecm.c?

»
6 weeks ago, # |
  Vote: I like it 0 Vote: I do not like it

Finally we can remove this blog from catalog!

»
5 weeks ago, # |
  Vote: I like it +24 Vote: I do not like it

If you're interested, https://loj.ac/p/6466 is a testbed for integer factorization ($$$n\le 10^{30}$$$). There are public implementations of quadratic sieve, elliptic-curve factorization and some pollard rho's :)

»
5 weeks ago, # |
Rev. 4   Vote: I like it +8 Vote: I do not like it

I'm a bit surprised that you found a division-based implementation of the extended Euclid algorithm typically faster than exponentiation-by-squaring for finding an inverse mod $$$10^9 + 7$$$ on your setup, even knowing that general 32-bit divmod isn't that slow on most hardware. But it's close and I expect with more optimizations, exponentiation-by-squaring can retake the lead. The modmuls themselves can be made perhaps 10-20% faster using relaxed Barrett reduction or the variant Montgomery reduction I discuss below, and (if you want) there are several ways to reach $$$10^9 + 5$$$ with 37 modmuls instead of 43.

long Barrett reduction comments

There's also a slightly simpler and faster relative of Montgomery reduction that uses wide multiplications. (So it sacrifices the traditional main advantage of Montgomery reduction, in exchange for being as simple and fast as Lemire reduction.) The idea is simple: wideMul(x * pinv, p) is clearly a multiple of p and its lower word is equal to x. So, its upper word must be $$$-2^{-64} \cdot x$$$.

It's my understanding that the highest-precision multiplication operands typically included in SIMD instruction sets are doubles, providing only 52+1 bits of precision. So I'm curious about your plans for using SIMD to further speed up factorization of 60-bit integers.

I tried my own hand at a somewhat optimized sieve of Eratosthenes in January. The basic optimizations of segmenting the sieve into blocks small enough to fit into the L2 cache, and skipping over all even numbers plus a dumb Haskell-specific workaround/optimization or two were enough to be able to generate and traverse the list of all primes up to a billion in about 1.7 seconds on the Codeforces servers. I expect just applying the wheel-sieve idea with primes 3 and 5 as well, and using a language where the expected output isn't 2GB-large lazy linked list would get under a second with a little room to spare.