On Cache Friendliness

#	User	Rating
1	ecnerwala	3649
2	Benq	3581
3	orzdevinwang	3570
4	Geothermal	3569
4	cnnfls_csy	3569
6	tourist	3565
7	maroonrk	3531
8	Radewoosh	3521
9	Um_nik	3482
10	jiangly	3468

#	User	Contrib.
1	maomao90	174
2	awoo	164
3	adamant	163
4	TheScrasse	159
5	nor	158
6	maroonrk	156
7	-is-this-fft-	151
8	SecondThread	147
9	orz	146
10	pajenegod	145

The pattern of memory accesses plays a huge role in determining the actual running time of an algorithm. Many of you already know this fact, but do you know how big the difference can be? Take a moment to study this code:

#include <bits/stdc++.h>
using namespace std;
using namespace chrono;

vector<int> generate_random(int n) {
	mt19937 eng;
	vector<int> a(n);
	iota(a.begin(), a.end(), 0);
	shuffle(a.begin(), a.end(), eng);
	return a;
}

vector<int> generate_cycle(int n) {
	vector<int> a(n);
	iota(a.begin(), a.end(), 1);
	a[n-1] = 0;
	return a;
}

int main() {
	int n, t, q, z = 0;
	cin >> n >> t >> q;
	auto a = (t ? generate_cycle(n) : generate_random(n));

	auto start = high_resolution_clock::now();
	while (q--) {
		int x = q % n;
		for (int i=0; i<n; i++)
			x = a[x];
		z += x;
	}
	duration<double> dur = high_resolution_clock::now() - start;
	cout << "Time: " << dur.count() << '\n';
    cout << z << '\n';
}

The program performs $$$q$$$ walks of length $$$n$$$ on a permutation graph. With $$$t=0$$$, the permutation is randomly generated and with $$$t=1$$$ it's a cycle $$$0\rightarrow 1\rightarrow 2 \rightarrow \ldots \rightarrow (n-1) \rightarrow 0$$$.

Now try running this code using custom invocation with the following input: 10000000 0 10, and then 10000000 1 10. Can you guess the running time in both cases? Surely it won't take more than one second as there are only 100M memory reads... Also, you can probably guess that the second one will be faster, but exactly how much faster?

In case you're lazy and don't want to try it yourself

By the way, if I leave out printing $$$z$$$, the second part runs instantly, because the compiler correctly deduces that the entire while loop is unnecessary!

Tips to optimize your memory access patterns:

Avoid using multiple arrays to represent complex data structures. Instead, use arrays of structs. In particular, this applies to segment trees where you have to keep multiple numbers per node. There are cases where the difference is insignificant.
Use smaller data types. Using long long everywhere may slow your program down for multiple reasons, one of them is because your memory accesses will be more spread out = less cache friendly.
Try switching rows/columns of matrices. Make sure that the innermost loop runs on the last dimension of the matrix. In particular, when multiplying matrices, transposing the second matrix significantly reduces the running time.

Feel free to add suggestions!

	Rev.	Lang.	By	When	Δ	Comment
	en1		ivan100sic	2019-09-10 01:43:33	2808	Initial revision (published)

Rev.

Lang.

When

Comment

en1

ivan100sic

2019-09-10 01:43:33

2808

Initial revision (published)

History