Tutorial: https://en.algorithmica.org/hpc/algorithms/matmul/

A version you can submit on CodeForces (it is optimized for a specific CPU that is very different from that of CF, so the speedup is smaller): https://github.com/sslotin/amh-code/blob/main/matmul/self-contained.cc

These blocking/vectorization techniques can also be applied to some dynamic programming algorithms, such as the Floyd-Warshall ("for-for-for") algorithm. Do you know any similar DP problems where the number of operations is larger than the number of states?

What about the Strassen algorithm? Could you place it on this graph too? I'm pretty sure it will not be even close to those on the right, but I think it should be faster than the naive one.

I've mentioned it in the article. It is extremely tedious to implement, and I don't think that any BLAS library has done it so far, but there is a paper showing that an efficient (vectorized) implementation can be up to 10-20% faster for very large matrices.

The speedup mainly comes from having to do fewer raw arithmetic operations starting from a certain problem size, so I really doubt that a scalar Strassen algorithm implementation will beat the baseline for anything under $$$n=2000$$$.

The current fastest solution to matrix multiplication modulo $$$998244353$$$ on Library Checker uses Strassen and vectorization, and runs in half the time of the most optimized trivial approaches (albeit without manual vectorization or any other fancy optimizations), so it might be worth trying to beat that solution :)

Would this kind of thing work for any DP that can be optimized using memory trick? Knapsack for example.

First off, I think it's really cool that you got a

`36x`

speedup without having to manually call a bunch of intrinsic`_mm256_sub_epi32`

functions. These GCC Vector Extensions sure make vectorization look a lot less daunting.I suppose the main difficulty in using this for CF is that tasks usually require the answer modulo some prime and avoid floating point all together. How big would the speedup be if we're working with integers modulo $$$10^9+7$$$?

Optimizing the naive $$$O(n^2)$$$ polynomial multiplication and comparing to FFT might be interesting...

On a final note, the IJK floyd warshall only needs 3 repetitions: https://arxiv.org/pdf/1904.01210.pdf Note that this might not work correctly with blocking, but it should at least allow for vectorization of the inner loop.