50x faster matrix multiplication in under 40 lines of C

→ Pay attention

Contest is running
Helvetic Coding Contest 2024 online mirror (teams allowed, unrated)
01:16:11
Register now »

→ Top rated

#	User	Rating
1	tourist	3690
2	jiangly	3647
3	Benq	3581
4	orzdevinwang	3570
5	Geothermal	3569
5	cnnfls_csy	3569
7	Radewoosh	3509
8	ecnerwala	3486
9	jqdai0815	3474
10	gyh20	3447

Countries | Cities | Organizations

View all →

→ Top contributors

#	User	Contrib.
1	maomao90	174
2	awoo	165
3	adamant	161
4	TheScrasse	160
5	nor	158
6	maroonrk	156
7	-is-this-fft-	152
8	orz	146
9	SecondThread	145
9	pajenegod	145

View all →

→ Find user

→ Recent actions

Detailed →

sslotin's blog

50x faster matrix multiplication in under 40 lines of C

By sslotin, 2 years ago, In English

Tutorial: https://en.algorithmica.org/hpc/algorithms/matmul/

A version you can submit on CodeForces (it is optimized for a specific CPU that is very different from that of CF, so the speedup is smaller): https://github.com/sslotin/amh-code/blob/main/matmul/self-contained.cc

These blocking/vectorization techniques can also be applied to some dynamic programming algorithms, such as the Floyd-Warshall ("for-for-for") algorithm. Do you know any similar DP problems where the number of operations is larger than the number of states?

+223

sslotin
2 years ago
5

Comments (5)

Write comment?

peltorator

2 years ago, # |

What about the Strassen algorithm? Could you place it on this graph too? I'm pretty sure it will not be even close to those on the right, but I think it should be faster than the naive one.

→ Reply

sslotin

2 years ago, # ^ |

+18

I've mentioned it in the article. It is extremely tedious to implement, and I don't think that any BLAS library has done it so far, but there is a paper showing that an efficient (vectorized) implementation can be up to 10-20% faster for very large matrices.

The speedup mainly comes from having to do fewer raw arithmetic operations starting from a certain problem size, so I really doubt that a scalar Strassen algorithm implementation will beat the baseline for anything under $$$n=2000$$$.

→ Reply

nor

2 years ago, # ^ |

← Rev. 2 →

+19

The current fastest solution to matrix multiplication modulo $$$998244353$$$ on Library Checker uses Strassen and vectorization, and runs in half the time of the most optimized trivial approaches (albeit without manual vectorization or any other fancy optimizations), so it might be worth trying to beat that solution :)

→ Reply

tfg

2 years ago, # |

Would this kind of thing work for any DP that can be optimized using memory trick? Knapsack for example.

→ Reply

dacin21

2 years ago, # |

First off, I think it's really cool that you got a 36x speedup without having to manually call a bunch of intrinsic _mm256_sub_epi32 functions. These GCC Vector Extensions sure make vectorization look a lot less daunting.

I suppose the main difficulty in using this for CF is that tasks usually require the answer modulo some prime and avoid floating point all together. How big would the speedup be if we're working with integers modulo $$$10^9+7$$$?

Optimizing the naive $$$O(n^2)$$$ polynomial multiplication and comparing to FFT might be interesting...

On a final note, the IJK floyd warshall only needs 3 repetitions: https://arxiv.org/pdf/1904.01210.pdf Note that this might not work correctly with blocking, but it should at least allow for vectorization of the inner loop.

→ Reply