How to Optimize C Code for NTT -- POLYMUL on SPOJ

#	User	Rating
1	ecnerwala	3648
2	Benq	3580
3	orzdevinwang	3570
4	cnnfls_csy	3569
5	Geothermal	3568
6	tourist	3565
7	maroonrk	3530
8	Radewoosh	3520
9	Um_nik	3481
10	jiangly	3467

#	User	Contrib.
1	maomao90	174
2	awoo	164
2	adamant	164
4	TheScrasse	159
4	nor	159
6	maroonrk	156
7	-is-this-fft-	150
8	SecondThread	147
9	orz	146
10	pajenegod	145

On SPOJ, there is a problem named POLYMUL, which is pretty straightforward: Multiply two polynomials, which have integer coefficients in the interval $$$[0, 1000]$$$ and can each have a degree of up to $$$10000$$$.

Currently, I am trying to solve this problem with NTT and my solution can be found here. As the solution shows, I am doing NTT with the primes of both $$$998244353$$$ and $$$2013265921$$$, and then using Chinese Remainder Theorem in order to find the true coefficients without any modulo (I am assuming the coefficients are less than $$$1000^2\cdot 10000=10^{10}$$$, which is much less than the product of $$$998244353$$$ and $$$2013265921$$$). On my machine, it takes about 3 seconds for this solution to solve a test case with $$$T=10$$$ (i.e. $$$10$$$ pairs of polynomial) where each polynomial in each pair is of degree $$$10000$$$ and the coefficients are integers randomly picked from the interval $$$[0, 1000]$$$.

However, the time limit is 1 second, and I do not know what to do to optimize my solution more. Does any one have any suggestions on how to optimize my NTT algorithm or how to optimize my C code in general? Any solutions would be very welcome. Thanks!

Comments (4)

Show archived | Write comment?

Chilli

5 years ago, # |

← Rev. 3 →

+27

A couple different observations, in decreasing order of "how much does this matter".

First off, you don't even need NTT for this problem (since there are no modulos). You can just use regular FFT.

Second, your NTT is recursive and isn't based off bit reversals. That'll make it much slower.

Third, you have too many mod operations. If you start off with 2 values that have already been modded, you can do a+b%MOD = (a+b>=MOD ? a+b-MOD : a+b). Similarly for subtract.

Fourth, you're using 64 bit integers. Throughout most of the computation, those are unnecessary. The only operation you need 64 bit integers for is multiplication. Only upcast then.

You can see my implementation here: https://gist.github.com/Chillee/4982ba0840745f63f3771bd84f280557#file-ntt-cpp

I might write a blog post sometime about writing a fast FFT for CP.

→ Reply

Rezwan.Arefin01

5 years ago, # ^ |

Adding another one: You are calculating powers of root of unity on the fly, while you can pre-calculate all of them before hand, before calling NTT.

Noble_Mushtak

Thank you both for your suggestions. First, I implemented just Rezwan's suggestion and got accepted in 0.62 seconds: https://pastebin.com/raw/JatXqYJE

Then, I implemented both your second and third suggestions and got accepted in 0.18 seconds: https://pastebin.com/raw/bgLtgqP3

However, why do you suggest using regular FFT over NTT? Personally, I am confused about the advantages of one over the other, and I mostly just used NTT because it seemed easier to implement. Is FFT faster than NTT? When would you choose using FFT over NTT and vice versa?

FFT is faster, but has the potential to run into precision issues with large enough inputs. NTT works entirely in integers so has no precision issues, but is slower and requires certain specific mods. In this case, both work because the coefficients never become too big.

Noble_Mushtak's blog