Note: This technique had been used before at https://codeforces.com/blog/entry/60851 (editorial code of problem F) and https://codeforces.com/problemset/submission/1017/41357847 .

While this solution is faster than using `int64_t`

(because Codeforces machines are 32-bit), the time limit should be loose enough for solution that does not use this trick to pass. However, this trick may be useful if you want to be very fast or implement an unintended/suboptimal brute force solution. (the same thing can be said about fast input/output methods. On Codeforces, cin/cout is usually fast enough)

This is definitely faster than the usual `return (long long)a * b % mod`

, but it might be slower than Montgomery multiplication.

Given a positive integer `md`

, and two positive integers `a`

and `b`

in range `[0, md-1]`

, the following function will compute `(int) ((long long) a * b % md)`

: (the quotient of the division is stored in variable `d`

)

```
int mul(int a, int b) {
unsigned long long x = (long long) a * b;
unsigned xh = (unsigned) (x >> 32), xl = (unsigned) x, d, m;
asm(
"divl %4; \n\t"
: "=a" (d), "=d" (m)
: "d" (xh), "a" (xl), "r" (md)
);
return m;
}
```

x86 assembly has an instruction to divide a 64-bit integer by a 32-bit integer, provided that both the quotient and the remainder fits in 32 bits. (More details)

However, it's not a compiler bug that GCC does not optimize code like this

```
uint32_t f(uint64_t a, uint32_t b) { return a % b; }
```

to use that assembly instruction, because that would not work well (as required by the C++ standard) when the quotient exceeds 2^32. See also 64308 – Missed optimization: 64-bit divide used when 32-bit divide would work.

Unfortunately, I think there isn't any intrinsic function of GCC that provides the functionality, and it's necessary to use `asm`

. (source)

Benchmark code: (you can copy-paste that to https://codeforces.com/problemset/customtest )

**Code**

Because Codeforces caches the result, you may need to change the input or some whitespaces in the code to re-run the code. When I run it the result is ~= 0.45 vs 0.65, which is a 30% performance gain.