sslotin's blog

By sslotin, 4 years ago, translation, In English

https://algorithmica.org/en/eytzinger

This will probably be an hour-long read about the CPU memory model that explains 10 lines of code at the very end.

The following code is 4 to 5 times faster than calling std::lower_bound over a sorted array:

#pragma GCC optimize("O3")
#include <bits/stdc++.h>

using namespace std;

const int n = (1<<20);
const int block_size = 16;
alignas(64) int a[n], b[n+1];

int eytzinger(int i = 0, int k = 1) {
    if (k <= n) {
        i = eytzinger(i, 2 * k);
        b[k] = a[i++];
        i = eytzinger(i, 2 * k + 1);
    }
    return i;
}

int search(int x) {
    int k = 1;
    while (k <= n) {
        __builtin_prefetch(b + k * block_size);
        k = 2 * k + (b[k] < x);
    }
    k >>= __builtin_ffs(~k);
    return k;
}

tl;dr: it uses a segment tree-like cache-friendly array reordering that allows trading off memory bandwidth for latency via prefetching. Because of this, its running time could be volatile on some online judges due to "noisy neighbor" effects (see comments below).

This is technically not a drop-in replacement, since it requires some preprocessing, but I can't recall a lot of scenarios where you obtain a sorted array but can't spend linear time on preprocessing.

This method could also be used to speedup heaps, segment trees, and other static binary tree structures.

  • Vote: I like it
  • +214
  • Vote: I do not like it

»
4 years ago, # |
  Vote: I like it +46 Vote: I do not like it

I want to test this method, but maybe I am doing it wrong? Link. Got different check value and got that std::lower_bound is faster. Can you help to fix?

  • »
    »
    4 years ago, # ^ |
      Vote: I like it +25 Vote: I do not like it

    The check value is wrong because it returns element's index in its internal numeration, not the original one. I should probably put that in bold in the red somewhere, though it's possible to restore it by spending a few more instructions.

    As of running time, I am not sure, but I think that it probably has something to do with the fact that 32-bit random numbers are used to query an array with 20-bit items, so that most searches are only hitting the path to the largest element and the whole caching trick stops working.

    I use variations of this code for benchmarking.

  • »
    »
    4 years ago, # ^ |
    Rev. 3   Vote: I like it +5 Vote: I do not like it

    Idk what is k >>= __builtin_ffs(~k); but it should be k - n + 1.

    https://ideone.com/IeWTJ1

    Upd: nah, this works only for n = power of 2

    • »
      »
      »
      4 years ago, # ^ |
        Vote: I like it 0 Vote: I do not like it

      No problem, we can extend array to power of two by std::numeric_limits<int>::max().

      Also I tested eytzinger in comparison with recursive template. Link on ideone

      Runtime from ideone
      • »
        »
        »
        »
        4 years ago, # ^ |
          Vote: I like it +13 Vote: I do not like it

        Guys I don't think that benchmarking on ideone is a good idea, and I wouldn't benchmark memory-bound algorithms on non-dedicated shared-memory machines in general.

        The key trick is to trade off (almost all of) memory bandwidth for latency, which is not going to work well when you have noisy neighbours possibly competing for it.

        I am not sure if major online judges provide decent memory isolation, but public ideone definitely doesn't.

        • »
          »
          »
          »
          »
          4 years ago, # ^ |
            Vote: I like it 0 Vote: I do not like it

          Memory isolation is an OS feature that's required for security and if you have to switch processes or move pages around too often, performance just tanks. This overhead is possible and could be an explanation, but I'm 99% certain that when you run your code on ideone, it doesn't share physical memory.

          • »
            »
            »
            »
            »
            »
            4 years ago, # ^ |
            Rev. 2   Vote: I like it 0 Vote: I do not like it

            Yes they don't share the memory, but they do share the memory bus, and hence the bandwidth. By "memory isolation" I meant that ideally they shouldn't.

            Try running this bad boi in the background on a spare core on your laptop:

            #include <cstring>
            
            const int n = (1<<23);
            int a[n], b[n];
            
            int main() {
                while (true)
                    std::memcpy(a, b, sizeof a);
            
                return 0;
            }
            

            My running time on the benchmarks went 3x.

            • »
              »
              »
              »
              »
              »
              »
              4 years ago, # ^ |
                Vote: I like it 0 Vote: I do not like it

              Ok, so that's what you meant. I'm not sure how it works for servers compared to regular PC CPUs (if there are attempts to mitigate it), but it's a factor.