ifsmirnov's blog

By ifsmirnov, history, 8 years ago, In English

Consider a well-known problem: given a static array of size n, answer m queries of kind "how many numbers on [l, r] have value less than x". The standard solution is to build a segment tree where in every node we store a sorted vector. To answer a query we do a binary search in every corresponding node, achieving per query time complexity.

There is a method to reduce time complexity to per query, called fractional cascading. Instead of doing binary search in each node we do it only in the root, and then "push" its result to children in O(1).

For years I thought that the second approach is blazingly faster than the first one. And today I've made a test. I implemented both approaches in a pretty straightforward way and tested them on a random data. The results were quite surprising.

Fractional cascading: 760 ms

Top-down implementation: 670 ms

Bottom-up implementation: 520 ms

The first one is , others are ! Time is averaged over several consecutive runs. Test data is generated randomly with n = 100000, m = 200000.

Why doesn't fractional cascading give any improvements? Am I implementing it in an improper way? Anyway, this might be worth taking a look.

  • Vote: I like it
  • +182
  • Vote: I do not like it

| Write comment?
»
8 years ago, # |
  Vote: I like it +5 Vote: I do not like it

Did you measure the structure build time separately?

  • »
    »
    8 years ago, # ^ |
      Vote: I like it +13 Vote: I do not like it

    Yes. They are all about 70-90 ms, bottom-up approach statistically being a bit faster.

»
8 years ago, # |
  Vote: I like it +68 Vote: I do not like it

Standart solution will be to create persistent RSQ and say that get(l, r, x) = root[r].get(x) — root[l — 1].get(x) ?

  • »
    »
    8 years ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    Yes, but can you make updates?

    • »
      »
      »
      8 years ago, # ^ |
        Vote: I like it +51 Vote: I do not like it

      Can you make updates to interval tree with fractional cascading?

»
8 years ago, # |
  Vote: I like it +25 Vote: I do not like it

Saying partial cascading you mean fractional cascading, right?

  • »
    »
    8 years ago, # ^ |
      Vote: I like it +15 Vote: I do not like it

    Oops. Sure. Thanks. Actually, I thought for a while that a word 'partial' makes no sense in this context but came up with that this is just one more stupid term.

»
8 years ago, # |
  Vote: I like it -10 Vote: I do not like it

I'm wondering where did you run into this issue: competitive programming, academic research or some commercial development? If it is the last option, I will not ask fur further details as that might be confidential, but still the very fact that there exist jobs where one could implement and use advanced data structure stuff would be cool by itself.

»
8 years ago, # |
  Vote: I like it +5 Vote: I do not like it

Will something like this really change the performance of the partial cascading implementation?

»
8 years ago, # |
  Vote: I like it +15 Vote: I do not like it

Could you explain "Instead of doing binary search in each node we do it only in the root, and then "push" its result to children in O(1)." in more details? I didn't get the idea.

»
8 years ago, # |
Rev. 2   Vote: I like it +6 Vote: I do not like it

not entirely sure, but i think the first one is slower due to cache misses( linked list)

»
8 years ago, # |
  Vote: I like it +77 Vote: I do not like it

Like someone else has already pointed out, I think cache misses play an important role here. In the fractional cascading approach, you have to go through 3 "linked lists" so to speak (the lp, the rp, and the node itself) while the other solutions only have two such traversals (the node and the val arrays).

Having a single array of

struct Item { int val, lp, rp; };

Should probably eliminate such a problem. It might even be worth it to include pointers to the left and right child nodes Items directly into every Item. That would make every item larger but ensure a single (at most two in case of unlucky misalignment) memory access per visited node.

Also notice that the extra lg N cost in the other solutions is due to a binary search on a contiguous and typically small array, so it will be effectively quite cheap, and thus although fractional cascading should probably be faster if all implementions are made cache-friendly, the gain will not be huge unless we go to very large N.

My personal rule of thumb is that the difference between "cache-friendly operations" and "cache-unfriendly operations" is between 10 and 20 times in terms of runtime. So a cache friendly sqrt N typically performs about the same as a cache unfriendly lg N (very common in problems with cache-friendly Sqrt-Decomposition vs Treap or similar cache-unfriendly structures).

  • »
    »
    8 years ago, # ^ |
      Vote: I like it +10 Vote: I do not like it

    I've just tried your approach. Its consumed time is between straightforward fractional cascading and log^2 top-down approach, still not beating log^2 but a bit faster than the first version of cascading.

    • »
      »
      »
      8 years ago, # ^ |
      Rev. 2   Vote: I like it +29 Vote: I do not like it

      Interesting. Try to count the number of "far away memory accesses" in both cases. If the amount is similar in both implementations (lg^2 and fractional cascading) I can't think of any really obvious reason why it is not faster.

      At this point, if you're still interested in investigating the issue, I strongly suggest using perf to pinpoint exactly where the program is consuming all of its CPU time.

      perf tutorial: https://perf.wiki.kernel.org/index.php/Tutorial

      Excellent talk on benchmark and C++ tuning using perf: https://www.youtube.com/watch?v=nXaxk27zwlk

      perf even goes into the assembly, so it is an extremely powerful tool. Sometimes, unexpected compiler optimizations (or failures to optimize what should obviously be optimized) can be behind such counter-intuitive results.

    • »
      »
      »
      8 years ago, # ^ |
      Rev. 14   Vote: I like it 0 Vote: I do not like it

      You can try optimize cache misses in a number of ways.
      1) Allocate memory in another order. For example allocate for small arrays first. Or i think it will be better to start allocation from up to bottom of recursion tree.
      2) Avoid use "new" operator for memory allocation. It depends on actual memory allocator and can allocate memory in different pages depending on reqired size. allocation like ptr = heap + heap_pos; heap_pos += size; can be better in your recursive memory allocation.
      3) There is technique to reduce number of page faults for binary search in a big array. Renumber array indices in a certain order. In a binary search algorithm, the very common element to read is n / 2, then n / 4, 3 * n / 4, n / 8, 3 * n / 8, etc... renumber elements in this way or use something heap-like full binary search tree to search elements. In an actual array of a big size very commonly accessed elements are stored at different pages. If you renumber indices like above, cache misses will occur only at last search stage on a few last pages.
      PS: the last technique can be adapted for your specific set of arrays. It will be better to place most commonly accessed elements in cache.

»
8 years ago, # |
  Vote: I like it +3 Vote: I do not like it

can anybody tell me where I can find problems of this type?

»
8 years ago, # |
  Vote: I like it +2 Vote: I do not like it

Similar effect with the rank-heuristic for DSU. O(a(n)) theoretical solution with the rank-heuristic works slower than O(log(n)) solution w/o this heuristic.

»
8 years ago, # |
  Vote: I like it 0 Vote: I do not like it

I think, such difference may be easily produced by uArch behavior on different code flows. Basically, this can be a question of constants for algo complexity, but the actual complexity func. It is interesting quesion and if I've got spare time this weekend I'll run uArch/memory analysis on these samples and report back with findings.

»
8 years ago, # |
  Vote: I like it 0 Vote: I do not like it

I'll just back up the others' answers. It's probably the same reason why even the leading prime sieves use Eratosthenes sieve and not the Atkin sieve, even though the former is slower with O(n log log n) compared to O(n). Atkin uses so much prime wheels and skips so much remainders such that the cache tends to miss all the time.

Practice is stranger than theory like how reality is stranger than fiction.

»
8 years ago, # |
  Vote: I like it +3 Vote: I do not like it

any problems and good tutorials for this technique? (for both standard and fractional cascading)