4-5x Faster Drop-in Replacement for std::lower_bound

#	User	Rating
1	ecnerwala	3649
2	Benq	3581
3	orzdevinwang	3570
4	Geothermal	3569
4	cnnfls_csy	3569
6	tourist	3565
7	maroonrk	3531
8	Radewoosh	3521
9	Um_nik	3482
10	jiangly	3468

#	User	Contrib.
1	maomao90	174
2	awoo	164
3	adamant	162
4	TheScrasse	159
5	nor	158
6	maroonrk	156
7	-is-this-fft-	151
8	SecondThread	147
9	orz	146
10	pajenegod	145

This will probably be an hour-long read about the CPU memory model that explains 10 lines of code at the very end.

The following code is 4 to 5 times faster than calling std::lower_bound over a sorted array:

#pragma GCC optimize("O3")
#include <bits/stdc++.h>

using namespace std;

const int n = (1<<20);
const int block_size = 16;
alignas(64) int a[n], b[n+1];

int eytzinger(int i = 0, int k = 1) {
    if (k <= n) {
        i = eytzinger(i, 2 * k);
        b[k] = a[i++];
        i = eytzinger(i, 2 * k + 1);
    }
    return i;
}

int search(int x) {
    int k = 1;
    while (k <= n) {
        __builtin_prefetch(b + k * block_size);
        k = 2 * k + (b[k] < x);
    }
    k >>= __builtin_ffs(~k);
    return k;
}

tl;dr: it uses a segment tree-like cache-friendly array reordering that allows trading off memory bandwidth for latency via prefetching. Because of this, its running time could be volatile on some online judges due to "noisy neighbor" effects (see comments below).

This is technically not a drop-in replacement, since it requires some preprocessing, but I can't recall a lot of scenarios where you obtain a sorted array but can't spend linear time on preprocessing.

This method could also be used to speedup heaps, segment trees, and other static binary tree structures.

Comments (8)

Write comment?

dmkz

4 years ago, # |

+46

I want to test this method, but maybe I am doing it wrong? Link. Got different check value and got that std::lower_bound is faster. Can you help to fix?

→ Reply

sslotin

4 years ago, # ^ |

+25

The check value is wrong because it returns element's index in its internal numeration, not the original one. I should probably put that in bold in the red somewhere, though it's possible to restore it by spending a few more instructions.

As of running time, I am not sure, but I think that it probably has something to do with the fact that 32-bit random numbers are used to query an array with 20-bit items, so that most searches are only hitting the path to the largest element and the whole caching trick stops working.

I use variations of this code for benchmarking.

MrDindows

← Rev. 3 →

Idk what is k >>= __builtin_ffs(~k); but it should be k - n + 1.

https://ideone.com/IeWTJ1

Upd: nah, this works only for n = power of 2

No problem, we can extend array to power of two by std::numeric_limits<int>::max().

Also I tested eytzinger in comparison with recursive template. Link on ideone

Runtime from ideone

eytzinger = 0.157s, check = 377557
lower_bound = 0.242s, check = 377557
template = 0.138s, check = 377557

+13

Guys I don't think that benchmarking on ideone is a good idea, and I wouldn't benchmark memory-bound algorithms on non-dedicated shared-memory machines in general.

The key trick is to trade off (almost all of) memory bandwidth for latency, which is not going to work well when you have noisy neighbours possibly competing for it.

I am not sure if major online judges provide decent memory isolation, but public ideone definitely doesn't.

Xellos

Memory isolation is an OS feature that's required for security and if you have to switch processes or move pages around too often, performance just tanks. This overhead is possible and could be an explanation, but I'm 99% certain that when you run your code on ideone, it doesn't share physical memory.

← Rev. 2 →

Yes they don't share the memory, but they do share the memory bus, and hence the bandwidth. By "memory isolation" I meant that ideally they shouldn't.

Try running this bad boi in the background on a spare core on your laptop:

#include <cstring>

const int n = (1<<23);
int a[n], b[n];

int main() {
    while (true)
        std::memcpy(a, b, sizeof a);

    return 0;
}

My running time on the benchmarks went 3x.

Ok, so that's what you meant. I'm not sure how it works for servers compared to regular PC CPUs (if there are attempts to mitigate it), but it's a factor.

sslotin's blog