On Cache Friendliness

#	User	Rating
1	ecnerwala	3649
2	Benq	3581
3	orzdevinwang	3570
4	Geothermal	3569
4	cnnfls_csy	3569
6	tourist	3565
7	maroonrk	3531
8	Radewoosh	3521
9	Um_nik	3482
10	jiangly	3468

#	User	Contrib.
1	maomao90	174
2	awoo	164
3	adamant	162
4	TheScrasse	159
5	nor	158
6	maroonrk	156
7	-is-this-fft-	151
8	SecondThread	147
9	orz	146
10	pajenegod	145

The pattern of memory accesses plays a huge role in determining the actual running time of an algorithm. Many of you already know this fact, but do you know how big the difference can be? Take a moment to study this code:

#include <bits/stdc++.h>
using namespace std;
using namespace chrono;

vector<int> generate_random(int n) {
	mt19937 eng;
	vector<int> a(n);
	iota(a.begin(), a.end(), 0);
	shuffle(a.begin(), a.end(), eng);
	return a;
}

vector<int> generate_cycle(int n) {
	vector<int> a(n);
	iota(a.begin(), a.end(), 1);
	a[n-1] = 0;
	return a;
}

int main() {
	int n, t, q, z = 0;
	cin >> n >> t >> q;
	auto a = (t ? generate_cycle(n) : generate_random(n));

	auto start = high_resolution_clock::now();
	while (q--) {
		int x = q % n;
		for (int i=0; i<n; i++)
			x = a[x];
		z += x;
	}
	duration<double> dur = high_resolution_clock::now() - start;
	cout << "Time: " << dur.count() << '\n';
    cout << z << '\n';
}

The program performs $$$q$$$ walks of length $$$n$$$ on a permutation graph. With $$$t=0$$$, the permutation is randomly generated and with $$$t=1$$$ it's a cycle $$$0\rightarrow 1\rightarrow 2 \rightarrow \ldots \rightarrow (n-1) \rightarrow 0$$$.

Now try running this code using custom invocation with the following input: 10000000 0 10, and then 10000000 1 10. Can you guess the running time in both cases? Surely it won't take more than one second as there are only 100M memory reads... Also, you can probably guess that the second one will be faster, but exactly how much faster?

In case you're lazy and don't want to try it yourself

By the way, if I leave out printing $$$z$$$, the second part runs instantly, because the compiler correctly deduces that the entire while loop is unnecessary!

Tips to optimize your memory access patterns:

Avoid using multiple arrays to represent complex data structures. Instead, use arrays of structs. In particular, this applies to segment trees where you have to keep multiple numbers per node. There are cases where the difference is insignificant.
Use smaller data types. Using long long everywhere may slow your program down for multiple reasons, one of them is because your memory accesses will be more spread out = less cache friendly.
Try switching rows/columns of matrices. Make sure that the innermost loop runs on the last dimension of the matrix. In particular, when multiplying matrices, transposing the second matrix significantly reduces the running time.

Feel free to add suggestions!

Comments (14)

Show archived | Write comment?

Errichto

5 years ago, # |

← Rev. 2 →

+23

Don't transpose the second matrix in multiplication. Just write down needed for-loops and order them in such a way that the last one isn't the first dimension of any cell that you use in addition/product.

for(a) for(b) for(c) answer[a][c] += M1[a][b] * M2[b][c];

And yes, I agree that cache is important. I guess the short advice is: try to iterate the last dimension of multidimensional array, and try iterating array from start to end (or the other way) instead of random-ish order.

→ Reply

ivan100sic

5 years ago, # ^ |

Why waste memory accesses like this? Isn't it better to store the sum of products in a temporary variable and then assign it to the result matrix element at the end:

transpose(M2);
for(a) for(b) { t=0; for(c) t += M1[a][c] * M2[b][c]; answer[a][b] = t; }

Wanna benchmark these two?

← Rev. 4 →

$$$M1[a][b]$$$ is constant in my version (compiler should do that) so we both have two new reads each time. Well, I have read and write to new places, you have two reads. I don't think there should be a big difference. Feel free to prove me wrong with a benchmark. I will be surprised, but also glad that I learned something new.

I like your approach and it's quite good indeed and also a lot easier to code. On Codeforces the difference is insignificant, especially for smaller matrices.

Code

#include <bits/stdc++.h>
using namespace std;
using namespace chrono;

typedef long long ll;
typedef unsigned long long ull;
typedef long double ld;

const int n = 2222;
unsigned a[n][n], b[n][n], c[n][n];

template<class T>
void transpose(T a) {
	for (int i=0; i<n; i++)
		for (int j=i+1; j<n; j++)
			swap(a[i][j], a[j][i]);
}

void mul_ivan100sic() {
	transpose(b);
	for (int i=0; i<n; i++) {
		for (int j=0; j<n; j++) {
			unsigned t = 0;
			for (int k=0; k<n; k++)
				t += a[i][k] * b[j][k];
			c[i][j] = t;
		}
	}
	transpose(b);
}

void mul_errichto() {
	for (int i=0; i<n; i++)
		for (int j=0; j<n; j++)
			c[i][j] = 0;
	for (int i=0; i<n; i++)
		for (int j=0; j<n; j++)
			for (int k=0; k<n; k++)
				c[i][k] += a[i][j] * b[j][k];
}

template<class T>
void junk(T a) {
	mt19937 eng;
	for (int i=0; i<n; i++)
		for (int j=0; j<n; j++)
			a[i][j] = uniform_int_distribution<unsigned>(0, -1u)(eng);
}

int main() {
	ios_base::sync_with_stdio(false);
	cin.tie(nullptr);
	cout.tie(nullptr);
	cerr.tie(nullptr);

	junk(a);
	junk(b);
	int q;
	cin >> q;
	auto start = high_resolution_clock::now();
	q ? mul_ivan100sic() : mul_errichto();
	duration<double> t = high_resolution_clock::now() - start;
	cout << t.count() << '\n';
}

There is a scenario where your approach might up being significantly slower, and that's when multiplying matrices modulo $$$M$$$. In order to use the trick with $$$M^2$$$, you'd need the result matrix to hold long longs. Hold on as I make another benchmark...

Then I would make (at least temporarily) array of long longs as the resulting matrix — just FYI.

EDIT: @post_below, one more thing is that you can change mod*mod into 16*mod*mod or something like that and it will still work. Again, it won't be a big difference.

+11

I tried to make a fair comparison:

#include <bits/stdc++.h>
using namespace std;
using namespace chrono;

typedef long long ll;
typedef unsigned long long ull;
typedef long double ld;

const int n = 777;
const unsigned MOD = 1'000'000'007;
unsigned a[n][n], b[n][n], c[n][n];

template<class T>
void transpose(T a) {
	for (int i=0; i<n; i++)
		for (int j=i+1; j<n; j++)
			swap(a[i][j], a[j][i]);
}

void mul_ivan100sic() {
	transpose(b);
	for (int i=0; i<n; i++) {
		for (int j=0; j<n; j++) {
			ull t = 0;
			for (int k=0; k<n; k++) {
				t += a[i][k] * 1ull * b[j][k];
				if (t >= MOD*1ull*MOD)
					t -= MOD*1ull*MOD;
			}
			c[i][j] = t % MOD;
		}
	}
	transpose(b);
}

unsigned A[n][n], B[n][n];
ull C[n][n];

void mul_errichto() {
	for (int i=0; i<n; i++)
		for (int j=0; j<n; j++)
			C[i][j] = 0;
	for (int i=0; i<n; i++)
		for (int j=0; j<n; j++)
			for (int k=0; k<n; k++) {
				C[i][k] += A[i][j] * 1ull * B[j][k];
				if (C[i][k] >= MOD*1ull*MOD)
					C[i][k] -= MOD*1ull*MOD;
			}
	for (int i=0; i<n; i++)
		for (int j=0; j<n; j++)
			C[i][j] %= MOD;
}

template<class T>
void junk(T a) {
	mt19937 eng;
	for (int i=0; i<n; i++)
		for (int j=0; j<n; j++)
			a[i][j] = uniform_int_distribution<unsigned>(0, -1u)(eng);
}

int main() {
	ios_base::sync_with_stdio(false);
	cin.tie(nullptr);
	cout.tie(nullptr);
	cerr.tie(nullptr);

	junk(a);
	junk(b);
	junk(A);
	junk(B);
	int q;
	cin >> q;
	auto start = high_resolution_clock::now();
	q ? mul_ivan100sic() : mul_errichto();
	duration<double> t = high_resolution_clock::now() - start;
	cout << t.count() << '\n';
}

Your approach is again only slightly slower (around 5-6%), I wouldn't call it a big difference.

Xellos

Now try matrix multiplication with blocks (look up "blocking matrix multiplication" if you don't know what it is). It improves cache efficiency beyond simple ordering of rows/columns/indices.

loujunjie

+14

I like this article very much.

Use smaller data types. Using long long everywhere may slow your program down for multiple reasons, one of them is because your memory accesses will be more spread out = less cache friendly.

Many people believe long long is as fast as int on 64-bit judge machines. It's wrong.

I wrote an article about this topic in my blog but it's in Chinese. Maybe I should translate it.

BekzhanKassenov

Avoid using multiple arrays to represent complex data structures. Instead, use arrays of structs.

It's not as simple :) This article actually argues the other way (sorry for Russian, but Google Translates manages it just fine), on practice it depends on many other things, like the cache size and number of reads/writes for each particular index/struct.

Of course, this can be hard to predict. I had this scenario in mind: You want to make a segment tree to support maximum subsegment sum queries. You'll need 4 numbers per node. Some people prefer to spread this information over four arrays, I claim it's a lot better to use an array of structs.

Um_nik

+19

You didn't say anything about how caching works so for people who didn't know about such issues before your benchmarks look like some magic. One can probably infer from your 3rd tip that it is better to access consecutive memory IF they know how multidimensional arrays are stored in memory.

2nd tip sounds crazy. Just tried to change ints to longlongs in the code from the blog, changes in runtime are insignificant.

Also there are much more real-life examples like reordering vertices in tree by dfs traversal, but you say only about arrays without looking into underlying issues.

Thanks for bringing it to public attention, but the blog itself is really bad.

+26

I want to write a blog that goes into practical details of caches, pipelining/stalls, SIMD and similar low-level things that the compiler tries to handle, but sometimes unsuccessfully. Too bad I haven't gotten to it yet.

Regarding variable sizes, the change from 64-bit to 32-bit can be useful, but the improvement it brings is much smaller than just going from many cache misses to no cache misses. It can be useful if you have a lookup table that's accessed very often and should just barely fit in your L1 cache, for example, but that's a very rare case in CP. It's also useful if your code gets MLE :^)

Yes. changing longlongs to ints can help you get AC (because of lower time and memory consumption), but it is (in most parts) are not due to caches. Your example is possible the only one (with ints array fits into cache fully and otherwise not), but it wasn't mentioned in the blog.

Will wait for your blog :)

rohit_goyal

4 years ago, # |

Nice tips!

Here if anyone is interested : Problem

This problem can be used to play around with the stats. Changing long long to int improved execution time from TLE to nearly 1200 ms, I believe using struct's could chip some more time from it.

ivan100sic's blog