dmkz's blog

By dmkz, history, 4 years ago, In English

Hello everybody!

If you want to use binary search over array with known at compile time size, and this size can be extended to power of two, you can use next simple template:

template<int n, typename T>
int lowerBound(const T* __restrict arr, const T val) {
    if (n == 1) { return val > arr[0]; } 
    else {
        constexpr int mid = n / 2;
        if (val > arr[mid]) { return mid+lowerBound<n-mid, T>(arr+mid, val); }
        else { return lowerBound<mid, T>(arr,val); }
    }
};

Testing on ideone shows, that it is faster in 4 times in comparison with std::lower_bound.

Thanks for reading this.

UPD. Testing for doubles, in 3.5 times faster.

  • Vote: I like it
  • +34
  • Vote: I do not like it

»
4 years ago, # |
  Vote: I like it +22 Vote: I do not like it

There is an issue with how you do your testing.

std::vector<int> queries(1 << 9);
for (auto &it : queries) { it = dist(gen); }
...
for (int i = 0; i < (1 << 15); i++) {
    for (auto q : queries) {
        res += f<1<<20>(&arr[0],q);
    }
}

The compiler could make use of the fact that the inner loop doesn't depend on i. I tried removing the i for loop and switching to $$$2 ^ {23}$$$ queries, see here. The times are now much more similar. 1.56 s vs 1.19 s.

  • »
    »
    4 years ago, # ^ |
    Rev. 2   Vote: I like it 0 Vote: I do not like it

    This testing method was implemented with goal to reduce cost of loading queries from memory to cache. If compiler will use fact, that inner loop doesn't depend on i, then only one iteration will be executed, because there is not JIT-compilation in GCC. We can see, that runtime is a lot, so, I think, that $$$2^{15}$$$ iterations were executed.

    Another test with xorshift32 random generator shows 1.68 s vs 1.08 s.

»
4 years ago, # |
Rev. 3   Vote: I like it +15 Vote: I do not like it

First of all: Your results are very much skewed towards lowerBound. This could be because of several reasons

1) You first execute std::lower_bound thus filling data into caches perhaps allowing better execution for the following iteration. When trying to prove that something is slower than something, isolate things. Probably execute on different CPU cores and such too. I'd suggest making two separate binaries to avoid such effects and run them multiple times

2) Welp, you know this could be because your code invokes undefined behavior? You pretty much attempt to store $$$2^{20} \cdot 2^{23}$$$ in int res. You can test it by adding #pragma GCC optimize ("trapv") so that your code exits on int overflow (but don't use it when comparing performance results)

3) By initializing static variables and other stuff you skew the results towards the compile-based solution. This is pretty easy to isolate in this case by writing a generator that pipes exactly the same data into your program whilst not being compile-time available.

Actually
  • »
    »
    4 years ago, # ^ |
    Rev. 6   Vote: I like it +28 Vote: I do not like it

    couldn't understand any of the optimizations

    I will tell you.

    The first one and the most impressive one you can see on the image below:

    Spoiler with image

    Here compiler copies all 255 array elements, which can be accessed in the first 8 steps of binary search, to the stack memory, so they may be accessed without cache misses. And it's wow! I didn't expect such deep level of optimization logic from the any of compilers!

    But the sad part of this story is that this happened only because many iterations of the same test are performed, and the compiler knows this in advance. In most of real problems, such an optimization cannot be expected. So the test itself is really not so good.

    The second optimization is quite simple and obvious: compiler groups several subsequent calls of template function, so size and indices are known for these steps in compile time, and checks of binary search look like sequence of cmp, jmp, cmp, jmp without any indexing math.

    Spoiler with second image

    PS. My test: https://ideone.com/AL5ZNd

    • »
      »
      »
      4 years ago, # ^ |
      Rev. 2   Vote: I like it 0 Vote: I do not like it

      Ah, so it is caching after all.

      Think practical implementation of such binary search would look like this (but this particular optimization should cache last N calls instead if first ones) :https://codeforces.com/blog/entry/65278#comment-492839

      But I couldn't prove the idea of cache locality via valgrind, so I didn't think in that direction. Can you elaborate on how to prove that such optimization improves L1 hits by some cpu counting tool?

    • »
      »
      »
      4 years ago, # ^ |
        Vote: I like it 0 Vote: I do not like it

      to the stack memory, so they may be accessed without cache misses

      You're right that there's caching, but it's not caused by allocating and copying to the stack, but by the fact that just accessing this memory puts it to L1 cache. The stack is good because it's often faster to compute addresses, memory access speed only depends on locality of the data you're accessing.

      • »
        »
        »
        »
        4 years ago, # ^ |
          Vote: I like it 0 Vote: I do not like it

        You are right and a little bit wrong in the same time =) Accessing the memory really puts it to L1 cache, but there is no guarantee that it will remain there. There is an optimization here exactly because of locality of copied data, so instead of accessing so sparsely located elements of 8 steps of binsearch, we do that on copied data, which is located densely. And it really doesn’t matter to us whether it is a stack memory or not, I just stated the fact of this case.

        Actually we can use such optimization of binary search by ourselves, and it was described in this blog (only RU) 9 years ago. There, the author used three levels of data and got 60 percent speedup.

        • »
          »
          »
          »
          »
          4 years ago, # ^ |
            Vote: I like it 0 Vote: I do not like it

          There is an optimization here exactly because of locality of copied data

          memory access speed only depends on locality of the data you're accessing

          Yes, we're saying the same thing then. It just sounded like you said that the stack helps caching.

          (Technically, data doesn't have to be densely packed to fit in cache. It just has to be packed on distinct cache lines — in practice, this is some requirement on lower bits of their addresses. This is architecture-dependent, so I guess the compiler makes sure it's not a concern.)