On Cache Friendliness

The pattern of memory accesses plays a huge role in determining the actual running time of an algorithm. Many of you already know this fact, but do you know how big the difference can be? Take a moment to study this code:

#include <bits/stdc++.h>
using namespace std;
using namespace chrono;

vector<int> generate_random(int n) {
	mt19937 eng;
	vector<int> a(n);
	iota(a.begin(), a.end(), 0);
	shuffle(a.begin(), a.end(), eng);
	return a;
}

vector<int> generate_cycle(int n) {
	vector<int> a(n);
	iota(a.begin(), a.end(), 1);
	a[n-1] = 0;
	return a;
}

int main() {
	int n, t, q, z = 0;
	cin >> n >> t >> q;
	auto a = (t ? generate_cycle(n) : generate_random(n));

	auto start = high_resolution_clock::now();
	while (q--) {
		int x = q % n;
		for (int i=0; i<n; i++)
			x = a[x];
		z += x;
	}
	duration<double> dur = high_resolution_clock::now() - start;
	cout << "Time: " << dur.count() << '\n';
    cout << z << '\n';
}

The program performs $$$q$$$ walks of length $$$n$$$ on a permutation graph. With $$$t=0$$$, the permutation is randomly generated and with $$$t=1$$$ it's a cycle $$$0\rightarrow 1\rightarrow 2 \rightarrow \ldots \rightarrow (n-1) \rightarrow 0$$$.

Now try running this code using custom invocation with the following input: 10000000 0 10, and then 10000000 1 10. Can you guess the running time in both cases? Surely it won't take more than one second as there are only 100M memory reads... Also, you can probably guess that the second one will be faster, but exactly how much faster?

In case you're lazy and don't want to try it yourself

By the way, if I leave out printing $$$z$$$, the second part runs instantly, because the compiler correctly deduces that the entire while loop is unnecessary!

Tips to optimize your memory access patterns:

Avoid using multiple arrays to represent complex data structures. Instead, use arrays of structs. In particular, this applies to segment trees where you have to keep multiple numbers per node. There are cases where the difference is insignificant.
Use smaller data types. Using long long everywhere may slow your program down for multiple reasons, one of them is because your memory accesses will be more spread out = less cache friendly.
Try switching rows/columns of matrices. Make sure that the innermost loop runs on the last dimension of the matrix. In particular, when multiplying matrices, transposing the second matrix significantly reduces the running time.

Feel free to add suggestions!