50x faster matrix multiplication in under 40 lines of C

Блог пользователя sslotin

Автор sslotin, 2 года назад, По-английски

Tutorial: https://en.algorithmica.org/hpc/algorithms/matmul/

A version you can submit on CodeForces (it is optimized for a specific CPU that is very different from that of CF, so the speedup is smaller): https://github.com/sslotin/amh-code/blob/main/matmul/self-contained.cc

These blocking/vectorization techniques can also be applied to some dynamic programming algorithms, such as the Floyd-Warshall ("for-for-for") algorithm. Do you know any similar DP problems where the number of operations is larger than the number of states?

+223

sslotin
2 года назад
7

Комментарии (7)

Написать комментарий?

peltorator

2 года назад, # |

What about the Strassen algorithm? Could you place it on this graph too? I'm pretty sure it will not be even close to those on the right, but I think it should be faster than the naive one.

→ Ответить

sslotin

2 года назад, # ^ |

+18

I've mentioned it in the article. It is extremely tedious to implement, and I don't think that any BLAS library has done it so far, but there is a paper showing that an efficient (vectorized) implementation can be up to 10-20% faster for very large matrices.

The speedup mainly comes from having to do fewer raw arithmetic operations starting from a certain problem size, so I really doubt that a scalar Strassen algorithm implementation will beat the baseline for anything under $$$n=2000$$$.

→ Ответить

nor

2 года назад, # ^ |

← Rev. 2 →

+19

The current fastest solution to matrix multiplication modulo $$$998244353$$$ on Library Checker uses Strassen and vectorization, and runs in half the time of the most optimized trivial approaches (albeit without manual vectorization or any other fancy optimizations), so it might be worth trying to beat that solution :)

→ Ответить

tfg

2 года назад, # |

Would this kind of thing work for any DP that can be optimized using memory trick? Knapsack for example.

→ Ответить

dacin21

2 года назад, # |

First off, I think it's really cool that you got a 36x speedup without having to manually call a bunch of intrinsic _mm256_sub_epi32 functions. These GCC Vector Extensions sure make vectorization look a lot less daunting.

I suppose the main difficulty in using this for CF is that tasks usually require the answer modulo some prime and avoid floating point all together. How big would the speedup be if we're working with integers modulo $$$10^9+7$$$?

Optimizing the naive $$$O(n^2)$$$ polynomial multiplication and comparing to FFT might be interesting...

On a final note, the IJK floyd warshall only needs 3 repetitions: https://arxiv.org/pdf/1904.01210.pdf Note that this might not work correctly with blocking, but it should at least allow for vectorization of the inner loop.

→ Ответить

dmitrysurnin

3 месяца назад, # |

Привет! Вы не сравнивали с библиотекой numpy на питоне? Отчего-то у меня данный matmul() на C++ выполняется за 0,25 сек, а numpy выполняется за 0,11 сек. Запускаемый код на питоне таков:

a = numpy.random.rand(1920,1920)
b = numpy.random.rand(1920,1920)
st = datetime.datetime.now()
c = numpy.matmul(a, b)
en = datetime.datetime.now()
print(en - st) # 0:00:00.117674

Hello! Did you compare with numpy on python?

→ Ответить

sslotin

3 месяца назад, # ^ |

"BLAS" на графике и есть numpy. Замедление, скорее всего, из-за того, что моя реализация заточена под конкретный CPU (Zen 2): чтобы выжать максимальную производительность, для разных микроархитектур кернел нужно немного менять (я привожу основные принципы в секции "Designing the kernel").

→ Ответить