On SPOJ, there is a problem named POLYMUL, which is pretty straightforward: Multiply two polynomials, which have integer coefficients in the interval $$$[0, 1000]$$$ and can each have a degree of up to $$$10000$$$.

Currently, I am trying to solve this problem with NTT and my solution can be found here. As the solution shows, I am doing NTT with the primes of both $$$998244353$$$ and $$$2013265921$$$, and then using Chinese Remainder Theorem in order to find the true coefficients without any modulo (I am assuming the coefficients are less than $$$1000^2\cdot 10000=10^{10}$$$, which is much less than the product of $$$998244353$$$ and $$$2013265921$$$). On my machine, it takes about 3 seconds for this solution to solve a test case with $$$T=10$$$ (i.e. $$$10$$$ pairs of polynomial) where each polynomial in each pair is of degree $$$10000$$$ and the coefficients are integers randomly picked from the interval $$$[0, 1000]$$$.

However, the time limit is 1 second, and I do not know what to do to optimize my solution more. Does any one have any suggestions on how to optimize my NTT algorithm or how to optimize my C code in general? Any solutions would be very welcome. Thanks!

Instead of giving you a direct answer, I will give you 2 tips:

Try using a NTT library that has been coded by others. Then see what is the difference between yours and theirs.

Try studying the AC code of others (you can find them by googling).

A couple different observations, in decreasing order of "how much does this matter".

First off, you don't even need NTT for this problem (since there are no modulos). You can just use regular FFT.

Second, your NTT is recursive and isn't based off bit reversals. That'll make it much slower.

Third, you have too many mod operations. If you start off with 2 values that have already been modded, you can do

`a+b%MOD = (a+b>=MOD ? a+b-MOD : a+b)`

. Similarly for subtract.Fourth, you're using 64 bit integers. Throughout most of the computation, those are unnecessary. The only operation you need 64 bit integers for is multiplication. Only upcast then.

You can see my implementation here: https://gist.github.com/Chillee/4982ba0840745f63f3771bd84f280557#file-ntt-cpp

I might write a blog post sometime about writing a fast FFT for CP.

Adding another one: You are calculating powers of root of unity on the fly, while you can pre-calculate all of them before hand, before calling NTT.

Thank you both for your suggestions. First, I implemented just Rezwan's suggestion and got accepted in 0.62 seconds: https://pastebin.com/raw/JatXqYJE

Then, I implemented both your second and third suggestions and got accepted in 0.18 seconds: https://pastebin.com/raw/bgLtgqP3

However, why do you suggest using regular FFT over NTT? Personally, I am confused about the advantages of one over the other, and I mostly just used NTT because it seemed easier to implement. Is FFT faster than NTT? When would you choose using FFT over NTT and vice versa?

FFT is faster, but has the potential to run into precision issues with large enough inputs. NTT works entirely in integers so has no precision issues, but is slower and requires certain specific mods. In this case, both work because the coefficients never become too big.