50x faster matrix multiplication in under 40 lines of C

Please subscribe to the official Codeforces channel in Telegram via the link https://t.me/codeforces_official. ×

→ Pay attention

Before contest
Codeforces Round (Div. 2)
2 days
Register now »

*has extra registration

→ Top rated

#	User	Rating
1	tourist	3880
2	jiangly	3669
3	ecnerwala	3654
4	Benq	3627
5	orzdevinwang	3612
6	Geothermal	3569
6	cnnfls_csy	3569
8	jqdai0815	3532
9	Radewoosh	3522
10	gyh20	3447

Countries | Cities | Organizations

View all →

→ Top contributors

#	User	Contrib.
1	awoo	161
2	maomao90	160
3	adamant	156
4	maroonrk	154
5	-is-this-fft-	148
5	atcoder_official	148
5	SecondThread	148
8	Petr	147
9	TheScrasse	145
10	nor	144

View all →

→ Find user

→ Recent actions

Detailed →

sslotin's blog

50x faster matrix multiplication in under 40 lines of C

By sslotin, 2 years ago, In English

Tutorial: https://en.algorithmica.org/hpc/algorithms/matmul/

A version you can submit on CodeForces (it is optimized for a specific CPU that is very different from that of CF, so the speedup is smaller): https://github.com/sslotin/amh-code/blob/main/matmul/self-contained.cc

These blocking/vectorization techniques can also be applied to some dynamic programming algorithms, such as the Floyd-Warshall ("for-for-for") algorithm. Do you know any similar DP problems where the number of operations is larger than the number of states?

+223

sslotin
2 years ago
5

Comments (5)

Write comment?

peltorator

2 years ago, # |

What about the Strassen algorithm? Could you place it on this graph too? I'm pretty sure it will not be even close to those on the right, but I think it should be faster than the naive one.

→ Reply

sslotin

2 years ago, # ^ |

+18

I've mentioned it in the article. It is extremely tedious to implement, and I don't think that any BLAS library has done it so far, but there is a paper showing that an efficient (vectorized) implementation can be up to 10-20% faster for very large matrices.

The speedup mainly comes from having to do fewer raw arithmetic operations starting from a certain problem size, so I really doubt that a scalar Strassen algorithm implementation will beat the baseline for anything under $$$n=2000$$$.

→ Reply

nor

2 years ago, # ^ |

← Rev. 2 →

+19

The current fastest solution to matrix multiplication modulo $$$998244353$$$ on Library Checker uses Strassen and vectorization, and runs in half the time of the most optimized trivial approaches (albeit without manual vectorization or any other fancy optimizations), so it might be worth trying to beat that solution :)

→ Reply

tfg

2 years ago, # |

Would this kind of thing work for any DP that can be optimized using memory trick? Knapsack for example.

→ Reply

dacin21

2 years ago, # |

First off, I think it's really cool that you got a 36x speedup without having to manually call a bunch of intrinsic _mm256_sub_epi32 functions. These GCC Vector Extensions sure make vectorization look a lot less daunting.

I suppose the main difficulty in using this for CF is that tasks usually require the answer modulo some prime and avoid floating point all together. How big would the speedup be if we're working with integers modulo $$$10^9+7$$$?

Optimizing the naive $$$O(n^2)$$$ polynomial multiplication and comparing to FFT might be interesting...

On a final note, the IJK floyd warshall only needs 3 repetitions: https://arxiv.org/pdf/1904.01210.pdf Note that this might not work correctly with blocking, but it should at least allow for vectorization of the inner loop.

→ Reply