*UPD (2021-10-13): I've just discovered that I published this post for Russian audience only :(*

Hi everyone!

I'm writing a book about performance engineering: the art of optimizing algorithms beyond just asymptotic complexity.

The book walks through the main CPU optimization techniques such as caching, SIMD and pipelining while exploring many large case studies where we speed up some classic algorithm or a data structure, closely matching or even improving on the state-of-the-art.

You will probably not learn a single asymptotically faster algorithm there, but it will help you not to get TL when you are supposed to get AC, and sometimes get AC when you are supposed to get TL.

Among the stuff that you are probably most interested in:

- A version of segment tree that can compute prefix sums in 2ns plus the time of the slowest memory read if the array doesn't fit in L2 cache (is larger than ~2^16).
- An implementation of GCD that works 2-3 faster than
`std::gcd`

. - A 4-7x faster binary search (I wrote about it before)
- Integer factorization taking ~0.5ms per 60-bit integer.
- Argmin at the speed of memory (~$$$3 \cdot 10^{10}$$$ integer / float elements per second)
- An algorithm for parsing series of integers ~2x faster than
`scanf("%d")`

does. - An implementation of BLAS-level matrix multiplication that looks like 8 "for" loops wrapping a one-liner and works ~100x faster than the naive "for-for-for" one, and a similar ~50x implementation of the Floyd algorithm.

I've only recently finished the research stage and I'm now writing it all up. I've already completed ~120 pages, which is ~30% of what I intend to. I plan to finish a new article every 2-3 days and post any relevant updates here on CodeForces and on my revived twitter.

I'm not sure what I'm going to do with the book when I'm done, but it is definitely going to be available online for free.

Happy to hear any comments and suggestions.

Good test for matrix multiplication is the problem "I — Matrix God" from contest here. AsGreyWolf have achieved 405ms in his submission. It is matrix multiplication by modulo in $$$O(n^3)$$$ time for $$$n=1000$$$ with low constant given by vectorization.

I want to share interesting Russian problem about elections. Among all 85 pages of submissions only 3 persons solved this problem. Author of problem ch_egor solved it faster than $$$O(n \cdot m)$$$, remain two solvers solved it in $$$O(n\cdot m)$$$ with vectorization.

as a constant factor enthusiast this pleases me

in particular i enjoy simd unhealthily: some practical uses of simd are

the latter two are mainly from simonlindholm, the god of simd — he has written a book on applications of simd in CP in github here, #4 is the "bitops" chapter.