Hello everyone.

Recent codechef long challenge had a task about queries on paths inside a tree: short, given a tree, each vertex contains a linear function, and given $$$q$$$ queries of types "add given linear $$$f$$$ to all vertices of the path between $$$u$$$ and $$$v$$$" and "calculate maximum of $$$f(z)$$$ where $$$z$$$ is given and $$$f$$$ is a function in any of vertices on the path between given $$$u$$$ and $$$v$$$", $$$n$$$ and $$$q$$$ are $$$\leq 10^5$$$, time limit is 4s.

It seems that the intended time complexity is $$$O(n\sqrt{n}\log{n})$$$, but I heard that it required efforts to get AC with this solution. On the other hand, it's easy to get ac with an $$$O(nq)$$$ solution.

Don't get it wrong: if we iterate over a path just, for example, by going from both endpoints towards the root, it gets tl (even with pragmas): submission. On the other hand, one can use heavy-light decomposition (this particular implementation is not necessary, I suppose, but it's still very nice) first (that is, sort every vertex' children by their subtree sizes), then rearrange vertices by their dfs in-order, and voilà, every path is now a union of about $$$\sim\log{n}$$$ consecutive segments. If we now do basically the same stupid implementation of what we are asked for in the statement, we get ac for 1.6s with pragmas and even ac for 2.6s without pragmas.

Hope this trick is not a mutual knowledge (though it's definitely not a common knowledge). I found it so simple and potentially usable that it deserves a post in my opinion.

For other people like me who didn't get the idea behind this when skimming through the post, the main goal of doing HLD is to have the vertices from a path continuous in memory, as opposed to being all over the place like when you're iterating over parent links.

This is also particularly useful when optimizing a tree dp. Usually the solution is built in BFS order and having neighbours' table close to the one you're currently building can be a good improvement. See the fastest solutions here.

Unfortunately it’s my favorite trick too :) Similar thing also goes to centroid decomposition as well.

Suppose we are computing some functions for all number of paths. Naive DFS will have $$$O(n^2)$$$ cache miss. However, say we can express all path passing centroid, as a combination of two path from centroid. Then we can just list them in array and do a quadratic loop. Usually this quadratic loop should be optimized using some data structure, but naive implementatoom also gives $$$O(n\log n)$$$ cache miss. I believe the latter is about 10-20x faster.

And this is why we should not make problems with $$$n^{1.5}\log$$$ time complexity.

That's a first time I see somebody have analysis on number of a cache misses better than "well, we go through all the data in right order so it should be fast(er)" but the interesting part: if I comment out sort part in the solution it still gets AC in 1.67 (and probably may be even a bit more faster if I delete all the code that actually calculates sizes which I am lazy to perform) so this time "by sorting we make paths closer in memory" analysis seems to be enough

By cache miss I just meant the total size of graph traversed by DFS. I think the collected path information will be small enough to fit in L2 or L3, but I’m not an expert on this topic.

Interestingly, I got a similar idea here :)

Even the problem with $$$n\sqrt n$$$ time complexity was accepted by a $$$n^2$$$ solution（problem and webmaster's solution）

Excuse me, I didn't understand why your solution is faster than $$$O(nm)$$$. Do you visit all vertices on simple path for each query?

My solution used this technique with $$$O(\sqrt{n})$$$ for update and $$$O(\sqrt{n}$$$ $$$log$$$ $$$n)$$$ to get answer.

The complexity is still $$$O(nq)$$$, but there is only $$$O(n \log n)$$$ caches miss, which makes it very fast in practice.

Wow, thanks. It's really cool trick. I tried to cut off solutions with bad time complexity (what is hard when main complexity is $$$O(n\sqrt{n}$$$ $$$log n)$$$), but I didn't expect this.

Do the tests include a case in which an $$$O(nq)$$$ approach will incur close to the worst case of $$$10^{10}$$$ function evaluations? It seems like that many operations should be too slow for 4 seconds no matter how basic and cache-friendly they are.

Yep, tests where tree is a bamboo must be so close for maximum test for this solution, and the last subtask also contains this tests.

Recently, I've prepaired $$$O(n^2)$$$ solution that used slow sort on array of $$$10^5$$$ ints and should get TLE. So, I've implemented simple sort that makes $$$\frac{10^{10}}{2}$$$ "swap if less" operations.

And it works ~3 sec on CF on the worst case. So, basic enough and cache-friendly operations are fast enough in practice.

Can you please explain why there are only O(nlogn) caches miss ?

I'm not into HLD especially, but look — a modern memory model has multiple cache levels: registers (in the core), L1, L2, L3 and RAM itself. The "further" our target variable from the core, the slower (waaaaaay too slower) it is to obtain it. The CPU firstly looks for the variable in cache levels and only in case of misses goes to RAM.

During obtaining the variable, cache levels load into itself not only the variable, but also its memory block, where it belongs. The CPU loads inside L3 a bigger block, inside L2 there is a smaller subblock of a block, and so on.

That's why, for example, it's way too faster to have a [2][2][10][1000] array in a DP task instead of [1000][10][2][2].

As for HLD, in each path from a vertex to a vertex, we have here a O(logn) different subpaths, and each subpath is a consequtive segment in memory. So once the CPU captures the segment, it's faster to get values from it.

Hope you get the point, but you may want to read some articles in Wiki or whatever, to understand it deeply.

One more problem for this trick: http://codeforces.com/contest/925/problem/E (1.1s AC with 5s TL)