I would like to preface this blog by saying this is not a tutorial on FWHT but some (maybe obvious) observations on FWHT that I wanted to share.

Thanks to rama_pang and oolimry for proof reading. Love you guys <3

Of course I (and the proof-readers) are not perfect so if you find anything unclear just tell me in the comments.

We will use the symbols $$$\&,|,\oplus$$$ to represent bitwise and, bitwise or, bitwise xor respectively.

## FWHT

Given $$$2$$$ arrays $$$A$$$ and $$$B$$$. We will want to find array $$$C$$$ such that $$$C_i=\sum\limits_{j \star k=i} A_j \cdot B_k$$$. For our normal FFT, the operation $$$\star$$$ is just addition.

When finding fast for FFT, I found library code that does FWHT $$$^{[1]}$$$, that is it does the above operation where $$$\star$$$ can also be any of the $$$3$$$ bitwise operations.

We focus on the case for $$$\oplus$$$. We focus even more specifically on the very simple case of $$$A$$$ and $$$B$$$ being size $$$2$$$. We want to convolute $$$\{p_0,p_1\}$$$ and $$$\{q_0,q_1\}$$$ quickly. The way we do this is find some magic matrix $$$T$$$ such that we can just perform the dot product to "convolute" these transformed vectors then apply $$$T^{-1}$$$ to get the convoluted vector. We want to convolute these arrays into $$$\{p_0q_0 + p_1q_1 , p_0q_1 + p_1q_0\}$$$. So we want to find some invertible matrix $$$T$$$ that satisfies $$$T \Big( \matrix{p_0 \\ p_1} \Big) \cdot T \Big( \matrix{q_0 \\ q_1} \Big)=T \Big( \matrix{p_0q_0+p_1q_1 \\ p_0q_1+p_1q_0} \Big)$$$.

Let $$$T=\Big( \matrix{a & b \\ c& d} \Big)$$$. Expanding these, we obtain $$$\Big( \matrix{ a^2p_0q_0+abp_0q_1+abp_1q_0+b^2p_1q_1 \\ c^2p_0q_0+cdp_0q_1+cdp_1q_0+d^2p_1q_1} \Big)=\Big( \matrix{ ap_0q_0+bp_0q_1+bp_1q_0+ap_1q_1 \\ cp_0q_0+dp_0q_1+dp_1q_0+cp_1q_1} \Big)$$$.

We obtain $$$a^2=a,ab=b,b^2=a$$$ and similar set of equations for $$$c,d$$$. We find that $$$(a,b)={(0,0),(1,1),(1,-1)}$$$. But the only way to make a invertible matrix is $$$T=\Big( \matrix{1 & 1 \\ 1& -1} \Big)$$$. We also find that $$$T^{-1}=\frac 12 \Big( \matrix{1 & 1 \\ 1& -1} \Big)$$$.

We can do a similar thing for $$$\&$$$ and $$$|$$$ and find that $$$T=\Big( \matrix{0 & 1 \\ 1 & 1} \Big)$$$ and $$$T=\Big( \matrix{1 & 0 \\ 1 & 1} \Big)$$$ works.

Then we just apply the matrix $$$T \otimes T \otimes \ldots \otimes T$$$ to our vector to get it's point-coordinate form $$$^{[4]}$$$, where $$$\otimes$$$ is the Kronecker product $$$^{[5],[6]}$$$. Since we are applying the same operation $$$T$$$ to each "layer" of the vector, it makes sense why the Kronecker product is used.

I initially just whacked the math and didn't really pay much attention to it. So here is my attempt at going deeper I suppose.

## Kronecker Product

It turns out that $$$(A \otimes B)(C \otimes D)=AC \otimes BD$$$ $$$^{[5]}$$$. This gives us a fast way to compute $$$T \otimes T \otimes \ldots \otimes T$$$.

We assume that $$$A,B,C$$$ are $$$n \times n$$$ square matrices and $$$I_n$$$ is the $$$n \times n$$$ identity matrix.

Claim: $$$A \otimes B \otimes C = (A \otimes I_n \otimes I_n)(I_n \otimes B \otimes I_n)(I_n \otimes I_n \otimes C)=(I_n \otimes I_n \otimes C)(I_n \otimes B \otimes I_n)(A \otimes I_n \otimes I_n)$$$.

Proof: Trivial

Corollary: we can switch the order we multiply the matrices $$$(A \otimes I_n \otimes I_n),(I_n \otimes B \otimes I_n),(I_n \otimes I_n \otimes C)$$$, (i.e. they are all pairwise commutative).

Corollary: $$$((A \otimes I_n \otimes I_n)(I_n \otimes B \otimes I_n)(I_n \otimes I_n \otimes C))^{-1}=(A \otimes I_n \otimes I_n)^{-1}(I_n \otimes B \otimes I_n)^{-1}(I_n \otimes I_n \otimes C)^{-1}$$$.

Notice that these things works for arbitrary finite length $$$T \otimes T \otimes \ldots \otimes T$$$ but I will not write it down or it will become notation hell. Also, this works if $$$A,B,C$$$ are not of the same size (but are still square matrices).

This gives us a fast way to multiply a matrix by $$$\underbrace{T \otimes T \otimes \ldots \otimes T}_{k \text{ times}}$$$ as each element of $$$(I_n \otimes \ldots \otimes I_n \otimes T \otimes I_n \otimes \ldots \otimes I_n)$$$ has at most $$$n^{k+1}$$$ non-zero element. So multiplying a vector by that matrix only takes $$$O(n^{k+1})$$$ time and we are multiplying by $$$O(k)$$$ matrices. This is exactly the speedup given by FWHT.

## Generalization of Xor

What I did not realize was that $$$\oplus$$$ was literally addition under modulo $$$2$$$. That is, we were literally doing a "small" FFT in each layer of our FWHT.

Recall that in FFT, we when we convert a polynomial to the point-coordinate form, we are multiplying the coefficients of the polynomial by the matrix $$$T=\begin{pmatrix} \omega_n^0 & \omega_n^0 & \omega_n^0 & \ldots & \omega_n^{0} \\ \omega_n^0 & \omega_n^1 & \omega_n^2 & \ldots & \omega_n^{n-1} \\ \vdots & \vdots & \vdots & \ddots & \\ \omega_n^0 & \omega_n^{n-1} & \omega_n^{2n-2} & & \omega_n^{n^2-2n+1} \end{pmatrix}$$$, where $$$w_a^b=e^{i\frac{b\tau}{a}}$$$.

Well, when we are doing convolution with $$$\oplus$$$, our matrix is $$$T=\Big( \matrix{1 & 1 \\ 1& -1} \Big)=\Big( \matrix{\omega_2^0 & \omega_2^0 \\ \omega_2^0 & \omega_2^1} \Big)$$$. That means we can do our "FWHT" with other things such as addition modulo $$$3$$$ with the matrix $$$T=\begin{pmatrix} \omega_3^0 & \omega_3^0 & \omega_3^0 \\ \omega_3^0 & \omega_3^1 & \omega_3^2 \\ \omega_3^0 & \omega_3^2 & \omega_3^4\end{pmatrix}$$$. This idea is used in GP of Tokyo 2021 G with addition modulo $$$7$$$ used. Note that we naively apply the matrix in $$$O(b^2)$$$ if we are dealing with $$$b$$$-bit numbers.

## Generalization of And/Or

A way to think about $$$\&$$$ and $$$|$$$ when we are dealing with $$$0 \leq a,b < 2$$$ (i.e. single-bit numbers) is $$$a \& b=\min(a,b)$$$ and $$$a | b = \max (a,b)$$$. Since these are symmetric we can focus on the case of $$$|$$$.

Let's try to generalize this to each bit having digits from $$$0$$$ to $$$2$$$.

Our matrix will look like $$$T=\begin{pmatrix} a_0 & b_0 & c_0 \\ a_1 & b_1 & c_1 \\ a_2 & b_2 & c_2 \end{pmatrix}$$$. If we apply the math we have used above, we find that we have to satisfy the equations $$$a_i^2=a_i, a_ib_i=b_i^2=b_i,a_ic_i=b_ic_i=c_i^2=c_i$$$.

So we have $$$(a_i,b_i,c_i)=\{(1,1,1),(1,1,0),(1,0,0),(0,0,0)\}$$$. Since we want $$$T$$$ to be invertible we can just have $$$T=\begin{pmatrix} 1 & 0 & 0 \\ 1 & 1 & 0 \\ 1 & 1 & 1 \end{pmatrix}$$$.

Notice an observation about this $$$T$$$, it is just a "staircase" matrix, where the bottom right triangle is filled with $$$1$$$ and the rest of the elements are $$$0$$$.

Now what happens if we do $$$T \cdot \begin{pmatrix}a \\ b \\ c\end{pmatrix} = \begin{pmatrix}a \\ a+b \\ a+b+c \end{pmatrix}$$$. That is, $$$T$$$ is literally a "do prefix sum matrix".

So what happens if we consider numbers with $$$n$$$ bits, what would multiplying the matrix by $$$\underbrace{T \otimes T \otimes \ldots \otimes T}_{n \text{ times}}$$$ do? (Yes, we are considering the case for operator $$$|$$$).

Let us try to expand out $$$T \otimes T \otimes T$$$.

$$$T \otimes T \otimes T= \begin{pmatrix} 1 & 0 \\ 1 & 1 \end{pmatrix} \otimes \begin{pmatrix} 1 & 0 \\ 1 & 1 \end{pmatrix} \otimes \begin{pmatrix} 1 & 0 \\ 1 & 1 \end{pmatrix}= \begin{pmatrix} 1 & 0 & 0 & 0 \\ 1 & 1& 0 & 0 \\ 1 & 0 & 1 & 0 \\ 1 & 1& 1 & 1 \end{pmatrix} \otimes \begin{pmatrix} 1 & 0 \\ 1 & 1 \end{pmatrix}=\begin{pmatrix} 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\ 1 & 0 & 1 & 0 & 0 & 0 & 0 & 0 \\ 1 & 1 & 1 & 1 & 0 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0 & 1 & 0 & 0 & 0 \\ 1 & 1 & 0 & 0 & 1 & 1 & 0 & 0 \\ 1 & 0 & 1 & 0 & 1 & 0 & 1 & 0 \\ 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 \end{pmatrix}$$$.

This may look like garbage but we will try to compute $$$(T \otimes T \otimes T) \cdot \begin{pmatrix} 000 \\ 001 \\ 010 \\ 011 \\ 100 \\ 101 \\ 110 \\ 111 \end{pmatrix}=\begin{pmatrix} 000 \\ 000+001 \\ 000+010 \\ 000+001+010+011 \\ 000+100 \\ 000+001+100+101 \\ 000+010+100+110 \\ 000+001+010+011+100+101+110+111 \end{pmatrix}$$$.

Notice that the effect of $$$T \otimes T \otimes T$$$ is exactly SOS. That means, FWHT for $$$\&$$$ and $$$|$$$ is just applying the zeta transform and the mobius transform in $$$O(n \cdot 2^n)$$$ using SOS dp $$$^{[2],[3]}$$$, which is the same complexity as the $$$O(N \cdot \log N)$$$ we have using FWHT taking $$$N=2^n$$$.

So we can generalize FWHT for the $$$b$$$-bit version of $$$ \&$$$ and $$$|$$$ using (generalized) SOS dp in $$$O(n \cdot b^n)$$$ where the array sizes of the convolution are at most size $$$b^n$$$.

The reason for why FWHT for $$$ \&$$$ and $$$|$$$ can be done using subset sum is actually quite intuitive in hindset, so I am quite confused how I have never made the connection before.

We can think of this as some PIE, by focusing on the case of $$$|$$$ on binary numbers (for simplicity). Define $$$\bar A_{mask}=\sum\limits_{sub \subseteq mask} A_{sub}$$$ and $$$\bar B_{mask}=\sum\limits_{sub \subseteq mask} B_{sub}$$$.

Now, by PIE, we know that $$$C_{mask}=\sum\limits_{sub \subseteq mask} (-1)^{|mask \setminus sub|} \bar A_{sub} \bar B_{sub}$$$. It is clear that $$$C=\mu(\bar A \cdot \bar B)=\mu(\zeta(A)\cdot \zeta(B))$$$.

## Subset Sum Convolution

Now, we can actually use ideas from FWHT to do subset sum convolution, where we want to find $$$C(mask)=\sum\limits_{sub \subseteq mask} A_{sub} \cdot B_{mask \setminus sub}$$$ for all $$$mask$$$.

Define:

- $$$\hat A_{i,mask}\begin{cases}A_{mask} &, \text{if } |mask|=i\\ 0 &, \text{otherwise}\end{cases}$$$
- $$$\hat B_{i,mask}\begin{cases}B_{mask} &, \text{if } |mask|=i\\ 0 &, \text{otherwise}\end{cases}$$$

It can be shown that $$$C(mask)=\mu \Bigg(\sum\limits_{i=0}^{|mask|} \zeta(\hat A_{i}) \cdot \zeta(\hat B_{|mask|-i}) \Bigg)$$$ and it can be calculated in $$$O(n^2 \cdot 2^n)$$$ $$$^{[3]}$$$.

Well, we can reason about it very easily now. Since for convolution with $$$|$$$, FWHT is same as $$$\zeta$$$ and the inverse FWHT is same as $$$\mu$$$. So it is obvious why the above equation holds, the only way that some non-zero element in $$$\hat{A}_{i}$$$ and $$$\hat B_{|mask|-i}$$$ can convolute to an element with $$$|mask|$$$ bits is for those element 's bits to be mutually exclusive.

Also note that we can apply the inverse FWHT on the outside of the summation as $$$\mu(A)+\mu(B)=\mu(A+B)$$$ since $$$\mu$$$ can be thought of as some arbitrary matrix.

Actually, we don't have to restrict ourselves to convolution with $$$|$$$ for this task. Convolution with $$$\oplus$$$ and $$$\&$$$ works fine too.

## More Trash

We can do more weird stuff like performing the operation $$$\oplus$$$ on the $$$0$$$-th bit, performing $$$\&$$$ on the $$$1$$$-th bit and performing $$$|$$$ on the $$$2$$$-th bit.

Recalling that the matrices for these are $$$T_\oplus=\Big( \matrix{1 & 1 \\ 1 & -1} \Big)$$$, $$$T_\&=\Big( \matrix{0 & 1 \\ 1 & 1} \Big)$$$ and $$$T_|=\Big( \matrix{1 & 0 \\ 1 & 1} \Big)$$$ respectively. We just multiply our vector by $$$T_| \otimes T_\& \otimes T_\oplus$$$ for the forward transform and do the inverse for the backward transform.

We can also "mix and match" transformations of various sizes as long as we modify the result given in the Kronecker Product section.

As an example, suppose we have a number system of $$$(a,b)$$$ where $$$0 \leq a < 2$$$ and $$$0 \leq b < 3$$$ and we define $$$(a,b)+(c,d)=(a+c \mod 2, b+d \mod 3)$$$.

We can do convolution on this by noting that we have $$$T_2=\Big( \matrix{\omega_2^0 & \omega_2^0 \\ \omega_2^0 & \omega_2^1} \Big)$$$ and $$$T_3=\begin{pmatrix} \omega_3^0 & \omega_3^0 & \omega_3^0 \\ \omega_3^0 & \omega_3^1 & \omega_3^2 \\ \omega_3^0 & \omega_3^2 & \omega_3^4\end{pmatrix}$$$. So the transformation is given by $$$T_2 \otimes T_3=(T_2 \otimes I_3) (I_2 \otimes T_3)$$$.

So as long as you can define the transformation matrix, you can combine them and do weird stuff.

## References

- [1] https://github.com/kth-competitive-programming/kactl/blob/main/content/numerical/FastSubsetTransform.h
- [2] https://codeforces.com/blog/entry/45223
- [3] https://codeforces.com/blog/entry/72488
- [4] https://codeforces.com/blog/entry/71899
- [5] https://www.math.uwaterloo.ca/~hwolkowi/henry/reports/kronthesisschaecke04.pdf
- [6] https://en.wikipedia.org/wiki/Kronecker_product

Thanks for the amazing blog!

I remember when first trying to solve this problem, I just took the Kronecker product of the three matrices. This gives an 8x8 matrix on which I did FWHT. This runs significantly slower than what I should have done, which was consider the remainder (mod 3) of the current bit I was on. I'm not sure how to explain this, but here's some code that does OR on the first bit, AND on the second bit, XOR on the third bit, OR again on the fourth bit, and so on. I'm only using the original $$$T_{|}$$$, $$$T_{&}$$$, and $$$T_{⊕}$$$ matrices.

I hope this is useful to someone. Thanks again!

This topic had been discussed in the report on China Team Selection 2018 of CommonAnts: 浅谈DFT在信息学竞赛中的应用 (in Chinese).

He focused on the case that the index and its operation is an abelian semigroup $$$G$$$. Actually, the proper transform exists if and only if $$$\forall \, g\in G$$$, $$$\exists \, n > 1$$$ such that $$$g^n = g$$$. You can check the link above for details.

Tks for tutorial! I was recently unable to solve this problem, because I forgot about FWHT. After reading the article, this task turns into a simple exercise.

Seriously insane hard work, bro ! Hats off for this, I have learned something new today !