pajenegod's blog

By pajenegod, history, 12 days ago, In English

I recently had a very interesting idea for how to greatly speed up convolution (a.k.a. polynomial multiplication).

def convolution(A, B):
  C = [0] * (len(A) + len(B) - 1)
  for i in range(len(A)):
    for j in range(len(B)):
      C[i + j] += A[i] * B[j]
  return C

The standard technique for how to do convolution fast is to make use of cyclic convolution (polynomial mult mod $$$x^n - 1$$$).

def cyclic_convolution(A, B):
  n = len(A) # A and B needs to have the same size
  C = [0] * n
  for i in range(n):
    for j in range(n):
      C[(i + j) % n] += A[i] * B[j]
  return C

Cyclic convolution can be calculated in $$$O(n \log n)$$$ using FFT, which is really fast. The issue here is that in order to do convolution using cyclic convolution, we need to pad with a lot of 0s to not be affected by the wrap around. All this 0-padding feels very inefficient.

So here is my idea. What if we do polynomial multiplication mod $$$x^n - i$$$ instead of mod $$$x^n - 1$$$? Then when we get wrap around, it will be multiplied by the imaginary unit, so it wont interfere with the real part! I call this the imaginary cyclic convolution.

def imaginary_cyclic_convolution(A, B):
  n = len(A) # A and B needs to have the same size
  C = [0] * n
  for i in range(n):
    for j in range(n):
      C[(i + j) % n] += (1 if i + j < n else 1j) * A[i] * B[j]
  return C

Imaginary cyclic convolution is the perfect algorithm to use for implementing convolution. Using it, we no longer need to do copious amount of 0 padding, since the imaginary unit will take care of the wrap around. In fact, the size (the value of $$$n$$$) required is exactly half of what we would need if we had used cyclic convolution.

One question still remains, how do we implement imaginary cyclic convolution efficiently?

The trick is rather simple. Let $$$\omega = i^{\frac{1}{n}}$$$. Now note that if $$$C(\omega x) = A(\omega x) B(\omega x) \mod x^n - 1$$$ then $$$C(x) = A(x) B(x) \mod x^n - i$$$. So here is the algorithm

def imaginary_cyclic_convolution(A, B):
  n = len(A) # A and B needs to have the same size
  w = (1j)**(1/n) # n-th root of imaginary unit
  
  # Transform the polynomials A(x) -> A(wx) and B(x) -> B(wx)
  A = [A[i] * w**i for i in range(n)]
  B = [B[i] * w**i for i in range(n)]

  C = cyclic_convolution(A, B)
  
  # Transform the polynomial C(wx) -> C(x)
  C = [C[i] / w**i for i in range(n)]
  return C

Read more »

 
 
 
 
  • Vote: I like it
  • +137
  • Vote: I do not like it

By pajenegod, history, 3 weeks ago, In English
 
 
 
 
  • Vote: I like it
  • +162
  • Vote: I do not like it

By pajenegod, history, 4 months ago, In English

Hi CF! During this past weekend I was reading up on Montgomery transformation, which is a really interesting and useful technique to do fast modular multiplication. However, all of the explanations I could find online felt very unintuitive for me, so I decided to write my own blog on the subject. A big thanks to kostia244, nor, nskybytskyi and -is-this-fft- for reading this blog and giving me some feedback =).

Fast modular multiplication

Let $$$P=10^9+7$$$ and let $$$a$$$ and $$$b$$$ be two numbers in $$$[0,P)$$$. Our goal is to calculate $$$a \cdot b \, \% \, P$$$ without ever actually calling $$$\% \, P$$$. This is because calling $$$\% \, P$$$ is very costly.

If you haven't noticed that calling $$$\% \, P$$$ is really slow, then the reason you haven't noticed it is likely because the compiler automatically optimizes away the $$$\% \, P$$$ call if $$$P$$$ is known at compile time. But if $$$P$$$ is not known at compile time, then the compiler will have to call $$$\% \, P$$$, which is really really slow.

Montgomery reduction of $$$a \cdot b$$$

It turns out that the trick to calculate $$$a \cdot b \, \% \, P$$$ efficently is to calculate $$$a \cdot b \cdot 2^{-32} \, \% \, P$$$ efficiently. So the goal for this section will be to figure out how to calculate $$$a \cdot b \cdot 2^{-32} \, \% \, P$$$ efficently. $$$a \cdot b \cdot 2^{-32} \, \% \, P$$$ is called the Montgomery reduction of $$$a \cdot b$$$, denoted by $$$\text{m_reduce}(a \cdot b)$$$.

Idea (easy case)

Suppose that $$$a \cdot b$$$ just happens to be divisible by $$$2^{32}$$$. Then $$$(a \cdot b \cdot 2^{-32}) \, \% \, P = (a \cdot b) \gg 32$$$, which runs super fast!

Idea (general case)

Can we do something similar if $$$a \cdot b$$$ is not divisible by $$$2^{32}$$$? The answer is yes! The trick is to find some integer $$$m$$$ such that $$$(a \cdot b + m \cdot P)$$$ is divisible by $$$2^{32}$$$. Then $$$a \cdot b \cdot 2^{-32} \, \% \, P = (a \cdot b + m \cdot P) \cdot 2^{-32} \, \% \, P = (a \cdot b + m \cdot P) \gg 32$$$.

So how do we find such an integer $$$m$$$? We want $$$ (a \cdot b + m \cdot P) \,\%\, 2^{32} = 0$$$ so $$$m = (-a \cdot b \cdot P^{-1}) \,\%\, 2^{32}$$$. So if we precalculate $$$(-P^{-1}) \,\%\, 2^{32}$$$ then calculating $$$m$$$ can be done blazingly fast.

Montgomery transformation

Since the Montgomery reduction divides $$$a \cdot b$$$ by $$$2^{32}$$$, we would like some some way of multiplying by $$$2^{32}$$$ modulo $$$P$$$. The operation $$$x \cdot 2^{32} \, \% \, P$$$ is called the Montgomery transform of $$$x$$$, denoted by $$$\text{m_transform}(x)$$$.

The trick to implement $$$\text{m_transform}$$$ efficently is to make use of the Montgomery reduction. Note that $$$\text{m_transform}(x) = \text{m_reduce}(x \cdot (2^{64} \, \% \, P))$$$, so if we precalculate $$$2^{64} \, \% \, P$$$, then $$$\text{m_transform}$$$ also runs blazingly fast.

Montgomery multiplication

Using $$$\text{m_reduce}$$$ and $$$\text{m_transform}$$$ there are multiple different ways of calculating $$$a \cdot b \, \% \, P$$$ effectively. One way is to run $$$\text{m_transform}(\text{m_reduce}(a \cdot b))$$$. This results in two calls to $$$\text{m_reduce}$$$ per multiplication.

Another common way to do it is to always keep all integers transformed in the so called Montgomery space. If $$$a' = \text{m_transform}(a)$$$ and $$$b' = \text{m_transform}(b)$$$ then $$$\text{m_transform}(a \cdot b \, \% \, P) = \text{m_reduce}(a' \cdot b')$$$. This effectively results in one call to $$$\text{m_reduce}$$$ per multiplication, however you now have to pay to move integers in to and out of the Montgomery space.

Example implementation

Here is a Python 3.8 implementation of Montgomery multiplication. This implementation is just meant to serve as a basic example. Implement it in C++ if you want it to run fast.

P = 10**9 + 7
r = 2**32
r2 = r * r % P
Pinv = pow(-P, -1, r) # (-P^-1) % r

def m_reduce(ab):
  m = ab * Pinv % r
  return (ab + m * P) // r

def m_transform(a):
  return m_reduce(a * r2)

# Example of how to use it
a = 123456789
b = 35
a_prim = m_transform(a) # mult a by 2^32
b_prim = m_transform(b) # mult b by 2^32
prod_prim = m_reduce(a_prim * b_prim) # divide a' * b' by 2^32
prod = m_reduce(prod_prim) # divide prod' by 2^32
print('%d * %d %% %d = %d' % (a, b, P, prod)) # prints 123456789 * 35 % 1000000007 = 320987587

Final remarks

One important issue that I've so far swept under the rug is that the output of m_reduce is actually in $$$[0, 2 P)$$$ and not $$$[0, P)$$$. I just want end by discussing this issue. I can see two ways of handling this:

  • Alternative 1. You can force $$$\text{m_reduce}(a \cdot b)$$$ to be in $$$[0, P)$$$ for $$$a$$$ and $$$b$$$ in $$$[0, P)$$$ by adding an if-stament to the output of m_reduce. This will work for any odd integer $$$P < 2^{31}$$$.
Fixed implementation of m_reduce
  • Alternative 2. Assuming $$$P$$$ is an odd integer $$$< 2^{30}$$$ then if $$$a$$$ and $$$b$$$ $$$\in [0, 2 P)$$$ you can show that the output of $$$\text{m_reduce}(a \cdot b)$$$ is also in $$$[0,2 P)$$$. So if you are fine working with $$$[0, 2 P) \vphantom]$$$ everywhere then you don't need any if-statements. Nyaan's github has a nice C++ implementation of Montgomery multiplication using this style of implementation.

Read more »

 
 
 
 
  • Vote: I like it
  • +92
  • Vote: I do not like it

By pajenegod, history, 17 months ago, In English

I've always liked using Python (PyPy) for solving problems in competitive programming. And most problems are very doable, even in Python. What I've found is that the most difficult problems to solve in Python are those requiring 64 bit integers.

The reason why 64 bit integers are problematic is because CF runs Windows, and PyPy only supports 32 bit on Windows. So whenever a problem involves integers that cannot fit inside of a signed 32 bit int, PyPy switches to big integers (which runs insanely slow, sometimes a factor of 20 times slower).

What I currently have to do to get around big integers

However with the latest PyPy version (version 7.3.4) PyPy has finally switched to 64 bit on Windows! So upgrading PyPy would mean no more problems with big integers. This would make PyPy far more usable and more beginner friendly. So if possible please update PyPy's version on CF to 7.3.4! MikeMirzayanov

Edit: Reading Results of 2020 [list some changes and improvements] blog I realized that I should probably be tagging geranazavr555, kuviman and cannor147 too.

Read more »

 
 
 
 
  • Vote: I like it
  • +642
  • Vote: I do not like it

By pajenegod, history, 21 month(s) ago, In English

Let me tell you the story of how I made $2200 from doing competitive programming.

Spoiler

Once many many fortnights ago Hackerrank held one of its regular competitions, "World CodeSprint 9". This was back when Hackerrank actually sent out its prizes. The competition was very unusual in that one of its hardest problems was a scored based approximation problem. This competition was also the first time that I would get placed in the top 100s! Using my beloved Python :)

As I recall the prize for getting placed 4 to 100 was a t-shirt and $75. More precisely these $75 were sent either in Bitcoins or Amazon giftcards depending on where the prize winners lived, and in my case I got Bitcoins. I received the $75 in Bitcoins on 21st of March 2017.

Prices 2017

When I got them, I didn't really know how to do anything with them, so I kind of just forgot about them. Turns out that the value of Bitcoin has increased a bit since then:

Prices 2017-2021

30 times more to be precise! So today I just sold them for a bit over $2200 (Sold when the price hit 26640€/btc). Not too shabby for a 34th place finish in a regular competition! =D

Read more »

 
 
 
 
  • Vote: I like it
  • +488
  • Vote: I do not like it

By pajenegod, history, 2 years ago, In English

Introduction

I'm writing this blog because of the large number of blogs asking about why they get strange floating arithmetic behaviour in C++. For example:

"WA using GNU C++17 (64) and AC using GNU C++17" https://codeforces.com/blog/entry/78094

"The curious case of the pow function" https://codeforces.com/blog/entry/21844

"Why does this happen?" https://codeforces.com/blog/entry/51884

"Why can this code work strangely?" https://codeforces.com/blog/entry/18005

and many many more.

Example

Here is a simple example of the kind of weird behaviour I'm talking about

Example showing the issue
Output for 32 bit g++
Output for 64 bit g++

Looking at this example, the output that one would expect from $$$10 * 10 - 10^{-15}$$$ is exactly $$$100$$$ since $$$100$$$ is the closest representable value of a double. This is exactly what happens in 64 bit g++. However, in 32 bit g++ there seems to be some kind of hidden excess precision causing the output to only sometimes(???) be $$$100$$$.

Explanation

In C and C++ there are different modes (referred to as methods) of how floating point arithmetic is done, see (https://en.wikipedia.org/wiki/C99#IEEE_754_floating-point_support). You can detect which one is being used by the value of FLT_EVAL_METHOD found in cfloat. In mode 2 (which is what 32 bit g++ uses by default) all floating point arithmetic is done using long double. Note that in this mode numbers are temporarily stored as long doubles while being operated on, this can / will cause a kind of excess precision. In mode 0 (which is what 64 bit g++ uses by default) the arithmetic is done using each corresponding type, so there is no excess precision.

Detecting and turning on/off excess precision

Here is a simple example of how to detect excess precision (partly taken from https://stackoverflow.com/a/20870774)

Test for detecting excess precision

If b is rounded (as one would "expect" since it is a double), then the result is zero. Otherwise it is something like 8e-17 because of excess precision. I tried running this in custom invocation. MSVC(C++17), Clang and g++17(64bit) all use mode 0 and round b to 0, while g++11, g++14 and g++17 as expected all use mode 2 and b = 8e-17.

The culprit behind all of this misery is the old x87 instruction set, which only supports (80 bit) long double arithmetic. The modern solution is to on top of this use the SSE instruction set (version 2 or later), which supports both float and double arithmetic. On GCC you can turn this on with the flags -mfpmath=sse -msse2. This will not change the value of FLT_EVAL_METHOD, but it will effectively turn off excess precision, see 81993714.

It is also possible to effectively turn on excess precision with -mfpmath=387, see 81993724.

Fun exercise

Using your newfound knowledge of excess precision, try to find a compiler + input to "hack" this

Try to hack this

Conclusion / TLDR

32 bit g++ by default does all of its floating point arithmetic with (80 bit) long double. This causes a ton of frustrating and weird behaviours. 64 bit g++ does not have this issue.

Read more »

 
 
 
 
  • Vote: I like it
  • +187
  • Vote: I do not like it