orz's blog

By orz, history, 11 months ago, translation, In English

Consider the following problem: given three integers $$$x$$$, $$$y$$$ и $$$m$$$, $$$0 \leqslant x,y < m < 2^{32}$$$, calculate $$$xy\bmod m$$$. The easy way is just to multiply these numbers and apply the modulo operation:

uint32_t prod(const uint32_t x, const uint32_t y, const uint32_t m)
{
	return x * y % m;
}

As you might have guessed, this solution is wrong. The thing is that an overflow is possible in such a procedure: the operation x * y is performed in the typeuint32_t, and in fact the intermediate result of this operation will not be $$$xy$$$, but $$$xy\bmod2^{32}$$$. If after that we take the result modulo $$$m$$$, it may differ from the correct one:

$$$ \left(xy\bmod2^{32}\right)\bmod m\ne xy\bmod m. $$$

The way out is simple — you need to multiply in a larger type:

uint64_t prod_uint64(const uint64_t x, const uint64_t y, const uint64_t m)
{
	return x * y % m;
}

If you do this, then, since $$$xy<2^{64}$$$, this product will definitely not overflow, and after taking the result modulo, you will get the correct answer.

The question is: what if $$$x$$$, $$$y$$$ and $$$m$$$ can be greater than $$$2^{32}$$$? I suggest the following.

  1. Binary multiplication. Just like binary exponentiation, there is binary multiplication: to calculate $$$xy$$$, count $$$x\left\lfloor\frac y2\right\rfloor$$$, add this number to itself, and possibly add another $$$x$$$. This will spend $$$\mathcal O(\log y)$$$ actions, but among them there will be nothing but addition and subtraction!
	uint64_t sum(const uint64_t x, const uint64_t y, const uint64_t m)
	{
		uint64_t ans = x + y;
		if (ans < x || ans >= m)
			ans -= m;
		return ans;
	}
uint64_t prod_binary(const uint64_t x, const uint64_t y, const uint64_t m)
	{
		if (y <= 1)
			return y ? x : 0;
		uint64_t ans = prod_binary(x, y >> 1, m);
		ans = sum(ans, ans, m);
		if (y & 1)
			ans = sum(ans, x, m);
		return ans;
	}
  1. Multiplication via int128. To multiply two 32-bit numbers, you need a 64-bit intermediate variable. And to multiply two 64-bit numbers, you need a 128-bit variable! Modern 64-bit C++ compilers (except perhaps Microsoft® Visual C++®) have a special type __int128, which allows performing operations on 128-bit numbers.
	int64_t prod_uint128(const uint64_t x, const uint64_t y, const uint64_t m)
	{
		return (unsigned __int128)x * y % m;
	}
  1. Multiplication using real type. What is $$$xy\bmod m$$$? This is actually $$$xy-cm$$$, where $$$c=\left\lfloor\frac{xy}m\right\rfloor$$$. Let's then try to calculate $$$c$$$, and from here we find $$$xy\bmod m$$$. At the same time, note that we do not need to find $$$c$$$ exactly. What happens if we accidentally count, say, $$$c-4$$$? Then, when calculating the remainder, we get $$$xy-(c-4)m=xy-cm+4m=xy\bmod m+4m$$$. At first glance, this is not what we need. But if $$$m$$$ is not too large and $$$ xy\bmod m+4m$$$ did not overflood the 64-bit type, then after that you can honestly take the remainder and get the answer.

    This translates into the following implementation:
	uint64_t prod_double(const uint64_t x, const uint64_t y, const uint64_t m)
	{
		uint64_t c = (double)x * y / m;
		int64_t ans = int64_t(x * y - c * m) % int64_t(m);
		if (ans < 0)
			ans += m;
		return ans;
	}
	uint64_t prod_long_double(const uint64_t x, const uint64_t y, const uint64_t m)
	{
		uint64_t c = (long double)x * y / m;
		int64_t ans = int64_t(x * y - c * m) % int64_t(m);
		if (ans < 0)
			ans += m;
		return ans;
	}

double is accurate enough for this task if $$$x$$$, $$$y$$$ and $$$m$$$ are less than $$$2^{57}$$$. long double is enough for numbers less than $$$2^{63}$$$, but remember that long double must be 80-bit for this, and this is not true on all compilers: for example, in Microsoft® Visual C++® long double is the same as double.

Please note that this method is not applicable if $$$m>2^{63}$$$: in this case, ans cannot be stored in int64_t, because, perhaps, $$$\mathtt{ans}\geqslant2^{63}$$$ and an overflow will occur, due to which the (ans < 0) branch will be executed and we will receive an incorrect answer.

It can be seen that Microsoft® Visual C++® suffers from developmental delay lags behind other compilers in the availability of technical means for multiplying large numbers modulo, so if we want the function to work quickly on all compilers, it needs some fresh idea. Fortunately, such an idea was invented in 1960 by Anatoly Karatsuba. 4. Karatsuba multiplication. The idea was originally used to quickly multiply long numbers. Namely, let $$$x$$$ and $$$y$$$ be two non-negative integers less than $$$N^2$$$. We divide them with the remainder by $$$N$$$: $$$x=Nx_1+x_0$$$, $$$y=Ny_1+y_0$$$. Then the required $$$xy$$$ can be found as $$$N^2x_1y_1+Nx_1y_0+Nx_0y_1+x_0y_0=N\cdot\bigl(N\cdot x_1y_1+\left(x_0+x_1\right)\left(y_0+y_1\right)-x_1y x_0y_0\bigr)+x_0y_0$$$. As you can see, this transformation reduced the multiplication of $$$x$$$ and $$$y$$$ 1) to $$$\mathcal O(1)$$$ additions and subtractions of numbers not exceeding $$$N^4$$$; 2) three multiplications of numbers not exceeding $$$2N$$$ (namely $$$x_0y_0$$$, $$$x_1y_1$$$ and $$$\left(x_0+x_1\right)\left(y_0+y_1\right) $$$); 3) to two multiplications by $$$N$$$ of numbers not exceeding $$$2N^2$$$.

Item 1) is almost always very simple. In the case of long numbers, item 3) is also simple: you can take $$$N$$$, equal to a power of two, and then it is performed as a binary shift (operator << in C++). Therefore, in essence, Karatsuba reduced one multiplication of numbers less than $$$N^2$$$ to three multiplications of numbers less than $$$2N$$$. If these multiplications are also reduced by the Karatsuba method, according to the master theorem, the asymptotics of this method will be $$$\Theta\left(\log^{\log_23}N\right)$$$ instead of the naive $$$\Theta\left(\log^2N\right)$$$.

But we do not need to use the recursion, because if the lengths of $$$x$$$ and $$$y$$$ are halved, we can already use prod_uint64 orprod_double. The difficulty in our case is point 3): choose such $$$N$$$ so that, firstly, it is less than $$$2^{32}$$$ or at least a bit more, and secondly, so that it can be quickly multiplied by numbers of order $$$N^2$$$. Both requirements are met if we take $$$N=\mathrm{round}\left(\sqrt m\right)$$$: indeed, then for $$$m<2^{64}$$$ we have $$$N<2^{32}$$$, and $$$\left|m_0\right|=\left|m-N^2\right|\leqslant N<2^{32}$$$; then $$$xN=\left(x_1N+x_0\right)N=x_1N^2+x_0N\equiv x_0N-x_1m_0\pmod m$$$, and both multiplications here are performed over numbers of order $$$N$$$.

The attentive reader will notice that we have a serious problem here: we actually extract the square root of an integer. If you know how to use the multiplication of Karatsuba in this problem, bypassing this problem (including finding the square root quickly enough, writing faster or shorter code than mine), please write in the comments!

Since finding the product $$$\left(x_0+x_1\right)\left(y_0+y_1\right)$$$ turned out to be extremely unpleasant (remember, prod_double does not work for $$$m>2^{63} $$$), I decided just to calculate $$$x_0y_1$$$ and $$$x_1y_0$$$ separately — so this is not Karatsuba's method in the true sense, since we spend four multiplications of numbers of order $$$N$$$.

	uint64_t dif(const uint64_t x, const uint64_t y, const uint64_t m)
	{
		uint64_t ans = x - y;
		if (ans > x)
			ans += m;
		return ans;
	}
	bool check_ge_rounded_sqrt(const uint64_t m, const uint64_t r)
	{
		return ((r >= 1ull << 32) || r * (r + 1) >= m);
	}
	bool check_le_rounded_sqrt(const uint64_t m, const uint64_t r)
	{
		return (r == 0 || ((r <= 1ull << 32) && r * (r - 1) < m));
	}
	bool check_rounded_sqrt(const uint64_t m, const uint64_t r)
	{
		return check_ge_rounded_sqrt(m, r) && check_le_rounded_sqrt(m, r);
	}
	uint64_t rounded_sqrt(const uint64_t m)
	{
		uint64_t r = floorl(.5 + sqrtl(m));
		if (!check_ge_rounded_sqrt(m, r))
			while (!check_ge_rounded_sqrt(m, ++r));
		else if (!check_le_rounded_sqrt(m, r))
			while (!check_le_rounded_sqrt(m, --r));
		return r;
	}
	uint64_t prod_karatsuba_aux(const uint64_t x, const uint64_t N, const int64_t m0, const uint64_t m)
	{
		uint64_t x1 = x / N;
		uint64_t x0N = (x - N * x1) * N;
		if (m0 >= 0)
			return dif(x0N, x1 * (uint64_t)m0, m);
		else
			return sum(x0N, x1 * (uint64_t)-m0, m);
	}
	uint64_t prod_karatsuba(const test& t)
	{
		uint64_t x = t.x, y = t.y, m = t.modulo;
		uint64_t N = rounded_sqrt(t.modulo);
		int64_t m0 = m - N * N;
		uint64_t x1 = t.x / N;
		uint64_t x0 = t.x - N * x1;
		uint64_t y1 = t.y / N;
		uint64_t y0 = t.y - N * y1;
		uint64_t x0y0 = sum(x0 * y0, 0, m);
		uint64_t x0y1 = sum(x0 * y1, 0, m);
		uint64_t x1y0 = sum(x1 * y0, 0, m);
		uint64_t x1y1 = sum(x1 * y1, 0, m);
		return sum(prod_karatsuba_aux(sum(prod_karatsuba_aux(x1y1, N, m0, m), sum(x0y1, x1y0, m), m), N, m0, m), x0y0, m);
	}

It can be seen that in fact the only thing that the Karatsuba method gives us here is that if you find a large number $$$N$$$, by which you can quickly multiply, then you can multiply any two numbers modulo. In fact, if the modulus $$$m$$$ were fixed, and there were many queries for multiplication by this fixed modulus, then Karatsuba's method would be lightning fast, since the most expensive operation in it is the square root. Thus, I would like to take, for example, $$$N=2^{32}$$$ and do everything the same as in the previous paragraph, but without the square root. Alas, I haven't figured out how to multiply by $$$2^{32}$$$. One could write something like this:

uint64_t prod_double_small(const uint64_t x, const uint64_t y, const uint64_t m)
{
	uint64_t c = (double)x * y / m;
	uint64_t ans = (x * y - c * m) % m;
	return ans;
}

It calculates the product modulo, provided that uint64_t c = (double) x * y / m was calculated absolutely exactly. But it is not possible to guarantee that it will be accurately calculated, since $$$\frac{xy}m $$$ may well be $$$10^{-18}$$$ less than some integer, and the double type is not enough to notice it. This is the problem that the prod_karatsuba_aux function bypasses. If you somehow bypass it more directly, you are welcome in the comments.


Below are three tables for different compilers (may the MikeMirzayanov' name be famous, because thanks to the inappropriate use of his Polygon I completed it), in each table the rows correspond to different functions, the columns correspond to maximum allowed bitness of $$$x$$$, $$$y$$$ and $$$m$$$. If CE is specified, then the program will not compile with this compiler, and if WA, then it may give an incorrect answer. Otherwise, the cell features runtime on Intel® Core™ i3-8100 CPU @ 3.60GHz. The error is approximately equal to one to two nanoseconds, at the slowest functions it can go up to ten nanoseconds.

  1. Microsoft® Visual C++® 2010

    Method 32 bits 57 bits 63 bits 64 bits
    prod_uint64 7 ns WA WA WA
    prod_binary 477 ns 847 ns 889 ns 870 ns
    prod_uint128 CE CE CE CE
    prod_double 66 ns 95 ns WA WA
    prod_long_double 66 ns 98 ns WA WA
    prod_karatsuba 128 ns 125 ns 138 ns 139 ns

  2. GNU G++17

    Method 32 bits 57 bits 63 bits 64 bits
    prod_uint64 4 ns WA WA WA
    prod_binary 455 ns 774 ns 841 ns 845 ns
    prod_uint128 CE CE CE CE
    prod_double 26 ns 36 ns WA WA
    prod_long_double 29 ns 20 ns 19 ns WA
    prod_karatsuba 82 ns 81 ns 91 ns 88 ns

  3. GNU G++17 (64 bit)

    Method 32 bits 57 bits 63 bits 64 bits
    prod_uint64 8 ns WA WA WA
    prod_binary 313 ns 550 ns 604 ns 630 ns
    prod_uint128 17 ns 34 ns 30 ns 30 ns
    prod_double 23 ns 22 ns WA WA
    prod_long_double 23 ns 24 ns 23 ns WA
    prod_karatsuba 65 ns 65 ns 69 ns 66 ns

Therefore, the basic recipe is as follows: if unsigned __int128 is available, then use it, if an 80-bit long double is available, then it should always be enough, and otherwise, if double is enough, use double, else, apply the Karatsuba method.

If you wish, you can try to apply these ideas to the problems of the special contest.

 
 
 
 
  • Vote: I like it
  • +269
  • Vote: I do not like it

»
11 months ago, # |
  Vote: I like it 0 Vote: I do not like it

Auto comment: topic has been updated by orz (previous revision, new revision, compare).

»
11 months ago, # |
  Vote: I like it +10 Vote: I do not like it
  • »
    »
    11 months ago, # ^ |
      Vote: I like it +14 Vote: I do not like it

    Yes, KACTL 134725980 is a quarter faster than my implementation 134726237 but has a shorter range: it fails on 63-bit numbers 134725895 while my submission 134726249 doesn't.

    • »
      »
      »
      11 months ago, # ^ |
        Vote: I like it +13 Vote: I do not like it

      I don't have coach mode so I can't see the gym submissions.

      This might be a bug that should be reported to simonlindholm. But looking at the git history, it might also be intentional to only support M < 0.394 * 2^64 = 7.2e18 for the sake of speed: https://github.com/kth-competitive-programming/kactl/pull/174

      I agree that this behavior was a bit surprising since the comment warning about it is written in unrendered latex.

      • »
        »
        »
        »
        11 months ago, # ^ |
          Vote: I like it 0 Vote: I do not like it

        It's a bit shameful for Codeforces because I granted all the rights that I was able to grant given the contest is public. As far as I understand, you should be able to watch the submissions once you sent an accepted one, so if you copy the snippet from the statement and replace // TODO with return (unsigned __int128)t.x * t.y % t.modulo; it should be OK.

      • »
        »
        »
        »
        11 months ago, # ^ |
          Vote: I like it 0 Vote: I do not like it

        Are you sure you don't have a coach mode? I don't see a checkbox «I trust this user» in your account.

»
11 months ago, # |
  Vote: I like it 0 Vote: I do not like it

Auto comment: topic has been updated by orz (previous revision, new revision, compare).

»
11 months ago, # |
  Vote: I like it +27 Vote: I do not like it

Thanks for the great blog! I was curious about how a custom implementation of __int128 compares to your implementation. I'll try to update this comment with updates hopefully soon enough with the results.

Just a minor nitpick: the sample code that's provided has some undefined behaviour issues (for instance, shifting by 64 is undefined, it would be better to use either ~0ULL or uint64_t(-1) for those purposes).

  • »
    »
    11 months ago, # ^ |
      Vote: I like it +8 Vote: I do not like it

    What do you mean by «custom implementation of __int128»? Making a pared down version of it that is able of everything we need and supported by all compilers?

    • »
      »
      »
      11 months ago, # ^ |
      Rev. 3   Vote: I like it +18 Vote: I do not like it

      Apologies for not having clarified earlier.

      I was talking about an implementation along the lines of usual long division, and not the complete implementation of 128 bit arithmetic, so yes.

      A reference to such an algorithm is The Art of Computer Programming, Vol. 2, Algorithm D, where the base of computation can be set to $$$2^{32}$$$ to get what we want.

      An implementation of the mentioned algorithm needs the following kinds of operations:

      • Addition of two one-place integers to get a result and a carry.
      • Multiplication of two one-place integers to get a two-place answer.
      • Division of a two-place integer by a one-place integer, provided that the quotient is a one-place integer, and also yielding a one-place integer.

      Using a "place" as a block of 32 bits would get us what we want to do, on a computer that supports 64-bit arithmetic.

      Unfortunately, the algorithm is quite involved to implement. An (possibly buggy) implementation can be seen here (not my implementation). It uses 16 bit integers rather than 32 bit ones though, but it shouldn't be hard to fix.

      • »
        »
        »
        »
        11 months ago, # ^ |
          Vote: I like it 0 Vote: I do not like it

        Actually I already have this code written for a solution in some old problem, I should definitely make use of it. The only thing missing is the modulo operation although, and writing it faster than $$$\mathcal O(\log(\mathrm{Ans}))$$$ might be laborious.

        • »
          »
          »
          »
          »
          11 months ago, # ^ |
          Rev. 3   Vote: I like it +18 Vote: I do not like it

          Did you mean that you have code for division of 128-bit unsigned integers (possibly stored in a struct) by a 64-bit unsigned integer, analogous to the code in the link?

          Implementing multiplication of two 64-bit integers to get a 128-bit integer and subtraction of two 128-bit integers is fairly trivial (for multiplication, an implementation roughly identical to u256::mult here works, and for subtraction where the result is guaranteed to be non-negative, we only need to maintain a single carry).

          This should lead to at most a marginal overhead for implementing the modulo operation from the division operation.

          • »
            »
            »
            »
            »
            »
            11 months ago, # ^ |
              Vote: I like it 0 Vote: I do not like it

            I checked, I have a version with logarithmic division. Maybe it will be rather fast after a bit of optimization.

            • »
              »
              »
              »
              »
              »
              »
              11 months ago, # ^ |
              Rev. 3   Vote: I like it +10 Vote: I do not like it

              I implemented the code as given in their link but for 32 bits (and tried to identify places it can go wrong, but I'm not too sure since I don't understand the algorithm completely).

              It gives a marginal speedup (10-20%) over the Karatsuba implementation.

              UPD: I was able to optimize it to 1716ms (compared to unsigned __int128's 1201ms and Karatsuba's 2496ms). 134767943

              • »
                »
                »
                »
                »
                »
                »
                »
                11 months ago, # ^ |
                  Vote: I like it 0 Vote: I do not like it

                Please try in on a compiler without 128-bit integer.

                • »
                  »
                  »
                  »
                  »
                  »
                  »
                  »
                  »
                  11 months ago, # ^ |
                  Rev. 3   Vote: I like it +3 Vote: I do not like it

                  Karatsuba performs faster in that case (3603ms vs 3432ms), so manually implementing the modulus operation doesn't make sense I guess.

                • »
                  »
                  »
                  »
                  »
                  »
                  »
                  »
                  »
                  11 months ago, # ^ |
                  Rev. 2   Vote: I like it +21 Vote: I do not like it

                  I was able to further optimize the performance by roughly 400ms on 32 bit architectures, by using fast divide instructions like divl and restructuring code for those instructions to work. 134785220 makes me believe that this can be further optimized, if someone tries to actually understand how the algorithm works.

                  UPD: Loop unrolling (and noticing that $$$n$$$ is constant) shaves off about 300ms more, and gives a runtime of about 3057ms 134787237.

                • »
                  »
                  »
                  »
                  »
                  »
                  »
                  »
                  »
                  11 months ago, # ^ |
                    Vote: I like it 0 Vote: I do not like it

                  Very nice job!

  • »
    »
    11 months ago, # ^ |
    Rev. 4   Vote: I like it +16 Vote: I do not like it

    As for the nitpick, I kinda screwed up, and this undefined behavior leaked into every statement and every solution of all four special problems, so I will only be able to fix it after a good sleep. Is it there somewhere in the post?

    • »
      »
      »
      11 months ago, # ^ |
      Rev. 2   Vote: I like it +21 Vote: I do not like it

      I can't find such an issue with the post itself, however it does seem to be the case in the last problem that asks for operations on 64 bit modulos. The sample code uses BITS = 64 and does 1ll << BITS.

      A better way of doing that would be (-uint64_t(1)) >> (64 - BITS), unless BITS = 0, which doesn't seem to be feasible for any positive modulo.

      • »
        »
        »
        »
        11 months ago, # ^ |
          Vote: I like it 0 Vote: I do not like it

        According to the standard, 1ull << 64 is zero, so I will just replace all 1ll with 1ull and thus handle the undefined behavior.

        • »
          »
          »
          »
          »
          11 months ago, # ^ |
            Vote: I like it +18 Vote: I do not like it

          I just don't get why you went for 1ull << 64 when ~0ull works just fine, is shorter and much clearer: "set all 64 bits of ull to 1".

          • »
            »
            »
            »
            »
            »
            11 months ago, # ^ |
              Vote: I like it 0 Vote: I do not like it

            I have four versions of a problem which contain a substring 1ll << BITS. BITS is a constant integer 32, 57, 63, or 64 in different versions of the problem.

            nor noticed that in the case BITS == 64 my code's behavior is undefined. I agreed, but noted that replacing 1ll with 1ull does the job.

            0ull is great only for BITS == 64 whereas my aim is to have the same snippets of code in all four versions of the problem, differing only in the declaration of constant BITS.

            • »
              »
              »
              »
              »
              »
              »
              11 months ago, # ^ |
                Vote: I like it +10 Vote: I do not like it

              Oh right, makes sense. It's still an ugly case where you have to dig in documentation to find out if it's not doing something it shouldn't, and gives annoying warnings in gcc by default.

              • »
                »
                »
                »
                »
                »
                »
                »
                11 months ago, # ^ |
                  Vote: I like it 0 Vote: I do not like it

                Perhaps, but I allow such undesirable elements in pieces of code that are to be learnt by heart and used when necessary. Modular multiplication is one of these cases. If it is thoroughly tested and proven to work, it just fulfills its only purpose. Even if it's formally undefined and Sir Compiler swears, I won't have, for example, to check this function if I have a bug somewhere in the code.

                • »
                  »
                  »
                  »
                  »
                  »
                  »
                  »
                  »
                  11 months ago, # ^ |
                  Rev. 2   Vote: I like it +25 Vote: I do not like it

                  until you upgrade compiler and it teaches you what undefined means

»
11 months ago, # |
  Vote: I like it +37 Vote: I do not like it
»
11 months ago, # |
  Vote: I like it +17 Vote: I do not like it

Why in the sum function of the Binary multiplication part, it's: ~~~~~ if (ans < x || ans >= m) ans -= m; ~~~~~ and not: ~~~~~ if (ans < 0) ans += m; if (ans >= m) ans -= m; ~~~~~ ?

  • »
    »
    11 months ago, # ^ |
    Rev. 2   Vote: I like it +5 Vote: I do not like it

    Because ans is unsigned and instead of ans >= m it can overflow and become some number less than x.

    UPD. It's not about ans being unsigned, it's about its ability to overflow, and it's able to overflow because $$$m$$$ can be greater than $$$2^{63}$$$.

»
11 months ago, # |
  Vote: I like it +10 Vote: I do not like it

What about the non-recursive implementation of binary multiplication and the Karatsuba algorithm?

  • »
    »
    11 months ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    Karatsuba is non-recursive, only one iteration is performed.

    I didn't try rewriting the binary multiplication in non-recursive manner, but expect negligible acceleration.

    • »
      »
      »
      11 months ago, # ^ |
      Rev. 2   Vote: I like it +10 Vote: I do not like it

      I wrote it now and the non-recursive version of binary multiplication is about 20-35% faster (dependent on compiler version)

  • »
    »
    11 months ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    And what about sweet rhythm?

»
11 months ago, # |
  Vote: I like it 0 Vote: I do not like it

Auto comment: topic has been updated by orz (previous revision, new revision, compare).

»
11 months ago, # |
  Vote: I like it +10 Vote: I do not like it

How about Barrett reduction? I think it might be useful for mass modulo operations

  • »
    »
    11 months ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    As far as I know, even for 32-bit mods, you need 128-bit integer multiplication and shift support, and 64-bit division support. Generalizing to 64-bit mods, it seems like implementing it would be kinda hard too.

    • »
      »
      »
      11 months ago, # ^ |
        Vote: I like it 0 Vote: I do not like it

      Either hard or impractically slow.

      • »
        »
        »
        »
        11 months ago, # ^ |
          Vote: I like it +10 Vote: I do not like it

        I've never done it, but I think implementing the 64-bit version of the Barrett reduction by managing four 64-bit pieces of the intermediate product by hand (see my comment below) isn't that hard (at least compared to the Karatsuba algorithm), and I'm fairly certain that it would still run faster than 20ns on CodeForces (~70 CPU cycles). There is a huge computational advantage in knowing the modulo in advance.

      • »
        »
        »
        »
        11 months ago, # ^ |
          Vote: I like it +10 Vote: I do not like it

        We can use Barrett reduction to multiply modulo 16-bit number. For example we can use fft four times (or even less, when answer is small enough) for 16-bit prime, then recover whole answer using Chinese remainder theorem. Since we don't perform any modulo operation (inside fft ofc) it might speed up our fft pretty much. I think you should also consider 16-bit multiplications in your blog (and 32-bit with barrett reduction, why not)

    • »
      »
      »
      11 months ago, # ^ |
        Vote: I like it +10 Vote: I do not like it

      As I said below, on x86_64 there's 128-bit multiplication although the result is still in two 64-bit registers.

»
11 months ago, # |
  Vote: I like it +10 Vote: I do not like it

there are mulq and imulq asm instructions, which store the result of multiplication in 2 64-bit variables. In theory, you can use _umul128 msvc intrinsic (seems to be not working on codeforces, so inline asm is the only way)

»
11 months ago, # |
Rev. 2   Vote: I like it +35 Vote: I do not like it

There are faster methods to do modular multiplication when modulo is constant.

The one currently used in compilers is called Barrett reduction. It uses a similar approach as the floating-point arithmetic one, but it sticks to integer types and uses precomputed magic constants to ensure that the results are always rounded correctly. Compilers do it automatically e. g. when the modulo is a 32-bit compile-time constant such as $$$10^9+7$$$, but in general it needs four times the bits to store its intermediate product, so for larger sizes you'd need to split results in halves and do the arithmetic manually since there is no 196- or 256-bit multiplication in hardware.

Daniel Lemire recently came up with a much simpler algorithm, which is the current state-of-the-art for 32-bit integers. It has been generalized with a similar technique for the 64-bit case, although I'm not sure if it will be faster for the general 128-by-64 reduction.

There is also a way to do it without quadruple-precision at all: Montgomery multiplication. It is not just a pure reduction though: it first transforms input integers into a "space" where it can perform modular multiplication cheaply, and then transforms results back when they are needed. There is some overhead associated with these transforms, and so it is only beneficial when you have multiple modular multiplications you need to perform (e. g. binary exponentiation). This representation is the one used in computational number theory algorithms that deal with big integers — mainly because you'd want to use SIMDized loops for most if not all operations, and it does not have 128-bit multiplication at all.

I've implemented the rho algorithm for factoring a 64-bit number (which involves a lot of 64-by-64 multiplications modulo a 64-bit number) and tried several reduction methods, Montgomery multiplication becoming the fastest. The implementation matches the speed of factorization in Sagemath, so I believe Montgomery multiplication is still the fastest when you need to do a large chain of modular multiplications with a fixed modulo, and Lemire reduction is the fastest when you need to do just one multiplication and have the result immediately.

(By the way, I'm writing a more comprehensive article on the topic, but it isn't ready yet. Stay tuned.)

  • »
    »
    11 months ago, # ^ |
      Vote: I like it +8 Vote: I do not like it

    The algorithm found by Daniel Lemire that you mentioned seems (not sure if this is correct) to have been called Barrett reduction in some competitive programming libraries. For instance, here. Is there a difference between this implementation and the one you mention?

  • »
    »
    11 months ago, # ^ |
      Vote: I like it +16 Vote: I do not like it

    I read the notes by Lemire. He states he is also good in case of fixed modulus. I feel I need to extend the article and the contest with fixed modulo algorithms and special problems.

    • »
      »
      »
      11 months ago, # ^ |
        Vote: I like it +29 Vote: I do not like it

      I'll just add some references here in case you want to extend the article.

      For some algorithms for fixed mods (constexpr or not), this repo could be a super valuable resource IMO. Their implementations are pretty heavily constant-optimized and are pretty educational too.

      Also, while discussing the subject of the post on the AC Discord server, pajenegod mentioned this blog post with the same name. There's also a clever technique they mentioned much earlier while discussing about fast modular operations during matrix multiplication by a given modulus (which gives good speedups when the modulus is small) where you keep accumulating multiples of the modulus if necessary until you reach some threshold, and by looking at one of the highest bits, you can check if you need to subtract a specific large multiple of modulus from it or not. An example for reference.

»
11 months ago, # |
  Vote: I like it +13 Vote: I do not like it

$$$|m_0| = |m-N^2|$$$, you're missing a minus that makes a huge difference

  • »
    »
    11 months ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    Sorry, yes, the Russian version of the blog doesn't contain the mistake, so I didn't notice.

»
11 months ago, # |
  Vote: I like it +39 Vote: I do not like it

Btw, I think you're focusing too much on compiler differences instead of CPU differences. On x86_64, we have multiplication of 64-bit inputs into a 128-bit output, stored as 2 registers (%rdx:%rax), and inline assembly means we don't need to give a shit about what types the compiler supports. Implementing the modulo is the second part. For instance, your __int128 example does exactly that: one mulq, move the outputs to correct registers and call a builtin function __umodti3 (or __udivmodti4 if you want to go deeper), which apparently works on 128-bit inputs using __attribute__((mode(TI))).

tl;dr this implementation is pretty low-level but we can go a level or two lower and then implement

»
11 months ago, # |
  Vote: I like it 0 Vote: I do not like it

Auto comment: topic has been updated by orz (previous revision, new revision, compare).

»
7 months ago, # |
  Vote: I like it 0 Vote: I do not like it

Auto comment: topic has been updated by orz (previous revision, new revision, compare).

»
7 months ago, # |
Rev. 2   Vote: I like it +6 Vote: I do not like it

If you look into the previous revisions of this plog post you'll find out that the code snippets didn't render (however, I do remember that at some point they rendered well). Now I fixed if at the cost of the correct enumeration: every item in the enumeration is 1.

This happened because the enumeration item must contain in a single paraphaph whereas I had to break the paragraph with a linebreak right before the ~~~~~ of the snippet. Does anyone here know how to make a code snippet without a linebreak? Or how to add a linebreak without breaking the enumeration item?

»
5 weeks ago, # |
Rev. 6   Vote: I like it 0 Vote: I do not like it

Hi, I came across this blogpost and noticed it does not test using compiler intrinsics to achieve this in MSVC. I use a modmul version in my template which looks like this

template <class T>
constexpr int sgn(T x) {
    return (x > 0) - (x < 0);
}

long long mod_mul(long long a, long long b, long long m = MOD) {
#ifdef _MSC_VER
    unsigned long long h, l, r;
    l = _umul128(llabs(a), llabs(b), &h);
    _udiv128(h, l, m, &r);
    return sgn(a) * sgn(b) >= 0 ? r : m - r;
#else
    long long r = a * static_cast<__int128>(b) % m;
    return r >= 0 ? r : r + m;
#endif
}

I didn't try to micro-optimize, but I wonder if something like this would be faster on x64 MSVC than the karatsuba-like implementation.

»
5 weeks ago, # |
  Vote: I like it 0 Vote: I do not like it

I believe there was a 128-bit float type, both normal (quad-float) and decimal, in GCC. Could this work for 64 bit modulos? If that could work, I think it may be faster or almost equal to using the 128-bit integer type.

»
2 weeks ago, # |
  Vote: I like it -26 Vote: I do not like it

Can you plz tell how to define unsigned __int128 without using the template that you have used in this blog ??

  • »
    »
    2 weeks ago, # ^ |
      Vote: I like it -10 Vote: I do not like it

    Sorry for posting this amid the ZCO Mock everyone , I didn't know it wasn't allowed and i thought that i won't get the response from author so soon after the mock , hence posted it right away . Deep Apologies . Won't happen again .

»
2 weeks ago, # |
  Vote: I like it 0 Vote: I do not like it

I dont understand this blog. Why not just do (x mod m)*(y mod m) mod m. This is obviously equal to xy mod m.

But also, given that m < 2^32, (x mod m) and (y mod m) will also be smaller than 2^32. Therefore (x mod m) * (y mod m) will be smaller than 2^64 so you can just use long long in c++?

Can someone pls explain why this is not considered or why I am wrong?

  • »
    »
    2 weeks ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    Yes, this blog discusses doing fast modulo arithmetic over a modulo that could be up to 2^64, that could not fit in an int where you either have to use int_128, binary multiplication or anything else.