[Tutorial] Theoretically Faster HLD and Centroid Decomposition

#	User	Rating
1	ecnerwala	3649
2	Benq	3581
3	orzdevinwang	3570
4	Geothermal	3569
4	cnnfls_csy	3569
6	tourist	3565
7	maroonrk	3531
8	Radewoosh	3521
9	Um_nik	3482
10	jiangly	3468

#	User	Contrib.
1	maomao90	174
2	awoo	164
3	adamant	161
4	TheScrasse	159
5	nor	158
6	maroonrk	156
7	-is-this-fft-	152
8	SecondThread	147
9	orz	146
10	pajenegod	145

This blog was going to be translation of $$$[1]$$$ (which is in Chinese) but I think I ended up deviating too much from the source material that it would not be appropriate to call this a translation. Anyways, I am honestly amazed by how ahead of its time this paper is. In its abstract, it starts by saying that the state of art for tree path queries is $$$O(n \log n + q \sqrt n \log n)$$$, which I honestly can't fathom how one would arrive, then proceeds to claim that it has found an $$$O((n+q) \log n)$$$ algorithm. All the more, in 2007.

While I was in the process of writing this blog, smax sniped me and posted $$$[9]$$$. However, we will focus on static trees only.

Thanks to oolimry, Everule and timreizin for proofreading.

Note to reader: I am expecting the reader to be familiar with what a splay tree and HLD is. You do not really need to know why splay trees are $$$O(\log n)$$$ amortised but you should know what the zig, zig-zig and zig-zag operations are. Also, when I say some algorithm has some complexity, I usually mean that it has that complexity amortised. You can probably tell from context which ones are amortised complexity. Also, there is basically $$$0$$$ reason to use the techniques described in this blog in practice (except maybe in interactive problems?), the constant factor is too high. You can scroll to the bottom to see the benchmarks. Then again, my implementation of normal HLD uses a really fast segment tree and my implementation $$$O((n+q) \log n)$$$ algorithm might have a really big constant factor.

Splay Trees

First, we want to talk about splay trees $$$[2]$$$, specifically why splay trees have amortised $$$O(\log n)$$$ complexity for splaying some vertex $$$v$$$.

Define:

$$$s(v)$$$ as the size of the subtree of $$$v$$$
$$$r(v)=\log s(v)$$$, the rank function

There are $$$3$$$ types of operations we can perform on a vertex when we are splaying it to the root — zig, zig-zig, zig-zag. Let $$$r$$$ and $$$r'$$$ be the rank functions before and after we perform the operation. The main results we want to show is that:

zig has a time complexity of $$$O(r'(x)-r(x)+1)$$$ amortised
zig-zig and zig-zag has a time complexity of $$$O(r'(x)-r(x))$$$ amortised

I will not prove these as you can find the proof in $$$[2]$$$. But the point is that through some black magic, the only operation that has a cost with a constant term is a zig operation, which we only perform once. Somehow, we manage to remove all constant terms from our zig-zig and zig-zag operations.

Let $$$r_i$$$ be the rank function after performing the $$$i$$$-th operation. Suppose that we have performed $$$k$$$ operations on to splay some vertex, then the total time complexity is $$$O((r_{k}(x)-r_{k-1}(x))+(r_{k-1}(x)-r_{k-2}(x))+\ldots+(r_{1}(x)-r_{0}(x))+1)=O(r_k(x)-r_0(x)+1)$$$. The black magic from removing all constant terms from our zig-zig and zig-zag operation is that no matter how many zig-zig and zig-zag operations we perform, they don't have any effect on the final complexity since we cancel out all the terms in a telescoping sum manner (which also explains why performing zigs only would make the complexity blow up, as we should expect). For the case of splay trees, $$$r_k(x)=\log n$$$, since $$$x$$$ becomes the root of the tree, so we obtain the complexity of $$$O(\log n)$$$.

Link-Cut Trees

Now, let's discuss why link cut trees $$$[3]$$$ have a complexity of $$$O(\log n)$$$. Here is an image of link cut trees from $$$[4]$$$.

We split the tree into disjoint chains. Internally, each chain is a splay tree storing elements in the order of depth (lowest depth to the left and highest depth to the right). This way, when we splay $$$x$$$ in its splay tree, the vertices to its left are those with lower depth and the vertices to its right are those with higher depth. $$$2$$$ disjoint chains which are connected by an edge in the original tree would be connected by a path-parent edge in the new tree. If chains $$$a$$$ and $$$b$$$ were connected, with $$$a$$$ being the parent of $$$b$$$, then there would be a path-parent edge from the root of the splay tree containing $$$b$$$ to $$$a$$$. That is for each splay tree, only its root will have a path-parent edge. In the image, the tree on the left is the initial tree which is stored internally by the structure of splay trees on the right. It is good to mention now that we can actually view these disjoint splay trees as a single giant splay tree by not caring about which edge types join the vertices internally (and not caring about the fact that vertices can only have at most 2 children).

An important operation on link-cut trees is the $$$access(x)$$$ operation, which given some $$$x$$$ that we have chosen, changes the chains in such a way that the chain containing the root of the original tree also contains the vertex $$$x$$$. This is done by repeatedly ascending into the highest vertex in each chain, and switching which child is connected to its parent. So the cost of this operation is related to the number of distinct chains on the path from $$$x$$$ to the root of the whole tree. Also, we will shorten the chain of $$$x$$$ to have the deepest vertex be $$$x$$$ for reasons that will be apparent later. In the above image, we have performed $$$access(l)$$$ to get the state being the middle tree.

In pseudo-code, the access function would look something like this:

1. access(x):
2.     splay(x)
3.     if (right(x)!=NULL):                  //disconnect the current child
4.         path_parent(right(x))=x
5.         right(x)=NULL
6.     while (path_parent(x)!=NULL):        //it is now the root
7.         par=path_parent(x)
8.         splay(par)
9.         if (right(par)!=NULL):            //change both child chain's path_parent
10.            path_parent(right(par))=par      
11.        path_parent(x)=NULL
12.        right(par)=x                      //change child chain to new one
13.        x=par

$$$O(\log^2 n)$$$ Time Complexity Bound

First, let us show that the time complexity of access is bounded by $$$O(\log^2 n)$$$ We will show this by showing that the while loop (lines 6-13) will only loop $$$O(\log n)$$$ times. We will use ideas from HLD for this. Label each edge as either light or heavy. The idea is that it is rare that light edges will be changed to be in a chain.

Let the potential be the number of heavy edges that are not in a chain. Here is how the potential changes for an operation in our access function.

If we change the edge in the chain to a heavy edge, our potential decreases by $$$1$$$.
If we change the edge in the chain to a light edge, our potential increases by at most $$$1$$$ (at most because it could be the case that the edge was in another light edge or does not exist).

The number of times we change an edge in the change to a light edge is $$$O(\log n)$$$. The potential can only increase by at most $$$O(\log n)$$$ in each access, so the number of times the while look is run is $$$O(\log n)$$$. Each time the while loop is run, the most time-consuming part is line 8, which we have show earlier to have an time complexity of $$$O(\log n)$$$, so the entire access function has a complexity of $$$O(\log^2 n)$$$.

$$$O(\log n)$$$ Time Complexity Bound

Here, we want to show that over all time lines 6-13 is called, the total time complexity is bounded by $$$O(\log n)$$$. Let's stop thinking about the splay trees as individual splay trees but imagine the entire structure is a giant splay tree instead. Recall our rank function from earlier. Since we want to regard the entire structure as a giant splay tree, $$$s(x)$$$ does not only refer to the size of the subtree when only considering the splay tree describing the chain that contains $$$x$$$, but all nodes that are in the subtree of $$$x$$$ when considering path-parent edges too.

We have established earlier that if we perform a splay operation on $$$x$$$, we have the time complexity of $$$O(r_{final}(x)-r_{initial}(x)+1)$$$. Suppose our while loop runs for $$$k$$$ times and on the $$$i$$$-th time the variable par is vertex $$$par^k$$$ (for convenience, we let $$$par^0=x$$$). Our time complexity would look something like $$$O(r_{final}(par^k)-r_{initial}(par^k)+1+\ldots+r_{final}(par^1)-r_{initial}(par^1)+1+r_{final}(par^0)-r_{initial}(par^0)+1)$$$. Notice that $$$r_{initial}(par^{x+1}) \geq r_{final}(par^x)$$$, because when we begin splaying $$$par^{x+1}$$$, the final position of $$$par^x$$$ would have been in the subtree of $$$par^{x+1}$$$, so $$$par^{x+1}$$$ obviously has a bigger subtree size than $$$par^x$$$. Therefore, we are able to cancel quite a few terms from the time complexity giving us a time complexity of $$$O(r_{final}(par^k)-r_{initial}(par^0)+k)$$$. We know that $$$r_{final}(par^k)=\log n$$$ since $$$par^k$$$ becomes the root of the tree and from the previous analysis on the $$$O(\log^2 n)$$$ time complexity bound, $$$k = \log n$$$. Therefore, $$$O(r_{final}(par^k)-r_{initial}(par^0)+k)=O(\log n)$$$.

Tree Path Queries

HLD + Segment Tree

This the solution that I think everyone and their mother knows. Perform the HLD on the tree and make a segment tree on each heavy chain. But there are a few things I want to talk about here.

Worst Case for HLD

There are $$$2$$$ styles of implementing this "segment tree on each heavy chain" part. One way, which is more popular because it is easy to code, is to have a single segment tree that stores all heavy chains. Another way is to construct a new segment tree for each heavy chain.

Now, let's think about how we can actually get HLD to its worst case of $$$O(n+q \log^2 n)$$$ for both implementations. For the case of balanced binary trees, can easily figure that the balanced binary tree easily forces the implementation of having a single segment tree to go into its worse case of $$$O(n+q\log^2 n)$$$, but for the implementation where we have a new segment tree for each heavy path, it only forces it to have $$$O(n+q \log n)$$$ complexity. To force the worst case $$$O(n+q \log^2 n)$$$ complexity, we will have to do modification to the binary tree. The problem with the balanced binary tree is that we do not make our segment tree use $$$O(\log n)$$$ time since the chains are so short. Ok, so let's make the chains long instead. Specifically, create a balanced binary tree of $$$\sqrt n$$$ vertices, then replace each vertex with a chain of size $$$\sqrt n$$$.

It seems that our complexity would be something like $$$O(n+q \log(\sqrt n) \log(\sqrt n))$$$, which has quite low constant compared to the implementation using a single segment tree on a normal balanced binary tree since $$$\log^2(\sqrt n) = \frac{1}{4} \log^2(n)$$$. Is there a tree that can make the constant higher?

Really Fast Segment Trees

When looking for fast segment tree implementations, most people would probably direct you to the AtCoder Library $$$[5]$$$ where they have implemented a fast lazy segment tree template with which maintains an array with elements in the monoid $$$S$$$ and is able to apply operation $$$F$$$ acting on $$$S \to S$$$ on an interval in the array. Although their code is very fast, we can speed it up by assuming that the functions are commutative. That is for $$$f,g \in F, x \in S, f(g(x))=g(f(x))$$$, which is usually true for the uses of segment trees in competitive programming.

There is actually a way to handle lazy tags in a way that does not change implementation of iterative segment tree too much. The idea is pretty similar to what is mentioned in $$$[6]$$$. We do not propagate lazy tags. Instead, the true value of a node in the segment tree is $$$val_u + \sum\limits_{v \text{ is ancestor of } u} lazy_v$$$ and we apply the lazy tags in a bottom-up fashion when doing queries.

Below is code for segment tree that handles range increment updates and range max queries.

Code

struct node{
	int BUF=1000005;
	int val[2000010],lazy[2000010];
	
	void get(int i){ val[i]=max(val[i<<1],val[i<<1|1])+lazy[i]; }
	
	void update(int i,int j,int k){ //[i,j]
		i+=BUF-1,j+=BUF+1; //(i,j)
		while (i|j){
			if (j-i>1){
				if (~i&1) val[i^1]+=k,lazy[i^1]+=k;
				if (j&1) val[j^1]+=k,lazy[j^1]+=k;
			}
			i>>=1,j>>=1;
			get(i),get(j);
		}
	}
	
	int query(int i,int j){ //[i,j]
		i+=BUF-1,j+=BUF+1; //(i,j)
		int resl=-1e18,resr=-1e18;
		while (i|j){
			if (j-i>1){
				if (~i&1) resl=max(resl,val[i^1]);
				if (j&1) resr=max(resr,val[j^1]);
			}
			i>>=1,j>>=1;
			resl+=lazy[i],resr+=lazy[j];
		}
		return max(resl,resr);
	}
};

Link-Cut Tree

Let's just maintain a link-cut tree with some modifications to the underlying splay tree. Specifically, we additionally store a value which is the composition of values in the subtree of the splay tree (we only consider those vertices in the same chain). To perform a query on $$$u$$$ to $$$v$$$ (WLOG $$$u$$$ has lower depth than $$$v$$$), perform $$$access(u)$$$ then $$$access(v)$$$. Let $$$w=lca(u,v)$$$. It is possible that $$$w=u$$$. The below image shows the state of the tree after performing $$$access(u)$$$ and $$$access(v)$$$ respectively.

After performing both accesses, we would have $$$w$$$ being the root of the entire splay tree since we have accessed $$$u$$$ before $$$v$$$, $$$w$$$ would have already been in the same chain as the root before we start to access $$$v$$$. At the same time, we must splay $$$w$$$ when we are accessing $$$v$$$. So $$$w$$$ would be the last thing we splay, making it the root of the entire splay tree. Now, it is actually easy to perform path queries from $$$u$$$ to $$$v$$$. If $$$u=w$$$, then we simply query $$$w$$$ and the right child of $$$w$$$ (which is the path to $$$v$$$). If $$$u \neq w$$$, then we have to additionally include the path from $$$u$$$ to $$$w$$$ (but not including $$$w$$$). Luckily, due to the way we have accessed the vertices, this path would be exactly the chain containing $$$u$$$.

HLD + Splay Tree

Let's replace the segment tree in the HLD solution with a splay tree with the same modification to the underlying splay tree as the earlier section. If we need to query the prefix of a splay tree, just splay the vertex then additionally query the left child. If we need to query the subarray of $$$[l,r]$$$ on the splay tree, splay $$$l$$$ to the root and $$$r$$$ to just below root, then we only have to additionally query the right child of $$$r$$$. Remember for a HLD query, we do some prefix queries and at most $$$1$$$ subarray query.

What would be the complexity? It seems like it would be the same $$$O(n + q \log^2 n)$$$. But no, it can be shown that it is actually $$$O(n+q\log n)$$$. Yes, it is not a typo. I still don't believe it myself even though the justification is below.

In the same way we showed that access works in $$$O(\log n)$$$ in link-cut trees, we will do the same here by imagining that there are fake edges from the root of each chain of the splay tree to its the parent of the chain so when we define the rank function and count the number of vertices in a subtree of $$$u$$$, we also take into account those vertices connected via fake edges. It is not too hard to see that the time complexity for the path query between $$$a$$$ and $$$b$$$ would be a similar telescoping sum resulting in $$$O(r_{final}(a)+r_{final}(b)-r_{initial}(a)-r_{initial}(b)+k_a+k_b)=O(\log n)$$$.

Balanced HLD

Although the previous $$$2$$$ algorithms have $$$O(n+q\log n)$$$ complexity. They are extremely unpractical as splay trees (or any dynamic tree) carry a super huge constant. Is it possible to transfer the essence of what splay tree is doing into a static data structure?

What was so special about our HLD+splay tree solution that it is able to cut one log when compared to HLD+any other data structure? It's splaying! That is if a vertex was recently accessed, it would be pushed near to the root the tree even though it may have many light edges on its path to the root. This property of caring about the entire tree structure is unique to splay trees and isn't found in any other method mentioned here as other data structures treat each of the heavy chains independently.

So, we need to create a data structure that is similar to HLD+segment tree but instead of having a structure based on dividing based on the sum of weights of unweighted nodes (which is a segment tree), maybe let's give each node a weight and split the based on those weight. Wait, isn't that centroid decomposition?

Indeed, let us first do HLD then take the heavy chain containing the root of the tree. For each vertex in this heavy chain, give each vertex a weight which is the number of vertices in its connected components after removing all edges in the heavy chain. And that's pretty much the entire idea. Divide things based on the weighted case. More specifically, given a heavy chain with total weight $$$W$$$, choose a vertex $$$x$$$ such that the total weight of the left and rights sides of the chain (excluding $$$x$$$) have weight $$$\leq \frac{W}{2}$$$. Such a vertex always exists (it's the centroid). We set $$$x$$$ as the root of the binary tree and recurse on the left and right childs. For the subtrees that are children of the heavy chain, repeat this process and we have constructed our desired tree.

Now, we need to show that queries here are indeed $$$O(\log n)$$$. First we need to think about how we actually perform queries in our HLD structure. We know from HLD+segment tree our query loop is for querying the path between $$$a$$$ and $$$b$$$ is this:

if $$$a$$$ and $$$b$$$ are in the same heavy chain, query $$$in[a]$$$ to $$$in[b]$$$
if $$$a$$$ is deeper than $$$b$$$, query $$$in[hpar[a]]$$$ to $$$in[a]$$$
if $$$b$$$ is deeper than $$$a$$$, query $$$in[hpar[b]]$$$ to $$$in[b]$$$

Of these $$$3$$$ queries, only the first query type is a sub-array query on the heavy chain, the rest of them are queries on prefixes of the heavy chain. Furthermore, the first query type is only done once. Now, what is the time complexity for a prefix query on a binary tree? It may be tempting to say it is "just $$$O(\log n)$$$ duh", but we can improve it.

For example, if we want to query the values on the prefix where the last vertex is the one labelled $$$x$$$, we simply perform a walk from vertex $$$x$$$ to the root of the tree and add the costs of those vertices and their left children where appropriate (if we have walked up from the left child, don't add stuff). Walking from vertex $$$x$$$ to the root of the tree takes $$$O(d_x-d_{root}+1)$$$ time. For the case of sub-array queries on $$$[x,y]$$$ we can see that it is walking from $$$x$$$ and $$$y$$$ respectively to the root of the tree which will take $$$O(d_x+d_y-d_{root}-d_{root}+1+1)$$$.

Let's change the definition of $$$d_a$$$ slightly to be the depth when the consider the entire structure of our tree (so we consider light edges too when calculating the depth). Let $$$x_{root}$$$ be the root of the heavy chain containing vertex $$$x$$$. The time complexity for prefix or suffix queries becomes $$$O(d_x-d_{x_{root}}+1)$$$ and for sub-array queries it becomes $$$O(d_x+d_y-d_{x_{root}}-d_{y_{root}}+2)$$$. Then we can see that the time complexity is some telescoping sum here too, since when we traverse a light edge, the depth of the would decrease. Actually we don't need the telescoping sum justification here as we can just prove it by saying the querying the simple path from $$$x$$$ to $$$y$$$ only requires us to move $$$x$$$ and $$$y$$$ upwards (and never downwards). In the end, the time complexity only depends on the depth of the endpoints of our queries. So, the only thing we need to prove is that the depth of the entire tree is bounded by $$$O(\log n)$$$.

But the proof of that is exactly the proof that centroid decomposition gives us a centroid tree of depth at most $$$O(\log n)$$$. Maybe I should elaborate more. Let's consider the new tree structure we have built. There are $$$2$$$ types of edges, heavy edges and light edges. When we traverse down a heavy edge, the size of the subtree would be at least halved due to how we have chosen to split the heavy tree so there are most $$$O(\log n)$$$ heavy edges on the path from some vertex to the root on our constructed tree. However, when we traverse down a light edge, there is no guarantee about what happens to the size of the subtree, except it can decrease by $$$1$$$, which is pretty useless. Luckily for us, we know that for every vertex, it has at most $$$O(\log n)$$$ light edges on the path to the root, because that's how HLD works. So we can determine that the depth of the tree is $$$O(\log n)+O(\log n)=O(\log n)$$$. We have shown that our complexity for querying is $$$O(\log n)$$$. Also, is not too hard show that our complexity of construction of this structure is $$$O(n \log n)$$$ since constructing the tree from a single heavy chain is literally centroid decomposition.

The depth of the tree is $$$O(\log n)$$$ but what is the constant. The number of heavy and light edges are both $$$\log n$$$ so our analysis from earlier gives us a $$$2 \log n$$$ bound on the height of the tree. Can we do better? It turns out no, this bound is sharp (up to $$$\pm 1$$$ maybe). Here is how we can construct a tree that forces the depth to be $$$2 \log n$$$ by our construction.

Let $$$T(s)$$$ be our constructed tree that has $$$s$$$ vertices. It is defined recursively by the below image.

Let $$$d(s)$$$ denote the depth of $$$T(s)$$$. As a base case, $$$d(0)=0$$$. We also have $$$d(s)=2+d(\frac{s-2}{2})$$$ since the heavy chain of $$$T(s)$$$ would be on the right side of the tree, so $$$a$$$ would be connected to its left child by a light edge in the original tree. We can have the root of the heavy chain be vertex $$$b$$$ in our constructed tree (it can be vertex $$$a$$$ but we want to assume the worst case) so that in our construction tree, $$$a$$$ would have a depth of $$$2$$$, requiring us to traverse $$$2$$$ edges to get to $$$T(\frac{s-2}{2})$$$. Therefore, it is easy to see that we can make $$$T(s)$$$ have a height of about $$$2 \log n$$$.

Path and Subtree Queries

With our normal HLD+segment tree query, we can easily handle both path and subtree queries $$$[7]$$$.

Can we do it for our new structure? Yes.

Firstly, one of the problems of subtree operations is that if the number of children is very large, it will be hard to compute the aggregate values of children. This is the reason for the difficulty of $$$[9]$$$. But we are not doing operations on a dynamic tree, we can simply augment our tree to make the number of children small for our case.

As in $$$[9]$$$, the idea is to binarize the tree, however since we do not have to care about the depth of the augmented tree, we can simply augment it into a caterpillar graph.

Subtree operations are the same on the original tree and the main tree. The only case we have to handle differently is path operations. For example, the path $$$A \to B$$$ passes through $$$X$$$ in the original tree but not in the augmented tree. However, we can solve this by checking if the lca of $$$A$$$ and $$$B$$$ is a fake vertex. If so, we separately process the actual lca in the original tree.

The original lazy tag we used when we only had path queries only applies the value to the children in the real tree. However with subtree queries, we need a new lazy tag that applies the value to all children, which includes children connected via light edges. Modifying the code to add another lazy tag is not hard, just very tedious.

More specifically, we have $$$2$$$ lazy tags, $$$lazy1$$$ and $$$lazy2$$$. $$$lazy1$$$ is applied to the children in the same heavy chain while $$$lazy2$$$ is applied to all children regardless of whether or not they are in the same heavy chain.

Then, the true value of a vertex $$$u$$$ in the balanced binary tree is $$$val_u + \sum\limits_{\substack{v \text{ is ancestor of } u \\ v \text{ and } u \text{ same heavy chain}} }lazy1_v + \sum\limits_{v \text{ is ancestor of } u }lazy2_v$$$.

Modifying these changes to the original algorithm is not difficult, just very tedious.

Benchmarks

Here are the benchmarks for the various implementations of the tree path queries so that you have a better ideas of the practical performance of the things I will describe so you will realize that the algorithm is practically pretty useless (except, maybe some interactives which are similar to $$$[8]$$$).

The problem is given a tree with $$$n=10^6$$$ vertices where all vertices initially of weight $$$0$$$, handle the following $$$q=5 \cdot 10^6$$$ operations of $$$4$$$ types:

1 u v w increase the weights of all vertices on the simple path from $$$u$$$ to $$$v$$$ by $$$w$$$
2 u v find the maximum weight of any vertex on the simple path from $$$u$$$ to $$$v$$$
3 u w increase the weights of all vertices on the subtree of $$$u$$$ by $$$w$$$
4 u find the maximum weight of any vertex on the subtree of $$$u$$$

It is bench-marked on my desktop with Ryzen 7 3700X CPU with compilation command g++ xxx.cpp -O2.

Note that the difference between balanced HLD 1 and 2 is that balanced HLD 2 is able to handle all types of queries while balanaced HLD 1 is only able to handle the first $$$2$$$ queries.

Benchmarks when there are only query types $$$1$$$ and $$$2$$$.

	HLD + segment tree (single segment tree)	HLD + segment tree (many segment tree)	HLD + segment tree (many segment tree, ACL)	link-cut tree	HLD + splay tree	Balanced HLD 1	Balanced HLD 2
Time Complexity	$$$O(n + q \log ^2 n)$$$	$$$O(n + q \log ^2 n)$$$	$$$O(n + q \log ^2 n)$$$	$$$O(n+q \log n)$$$	$$$O(n + q \log n)$$$	$$$O((n + q) \log n)$$$	$$$O((n+q) \log n)$$$
Random ($$$wn=0$$$)	$$$14.31~s$$$	$$$8.91~s$$$	$$$10.89~s$$$	$$$8.66~s$$$	$$$9.77~s$$$	$$$7.90~s$$$	$$$13.38~s$$$
Random ($$$wn=-10$$$)	$$$10.84~s$$$	$$$6.65~s$$$	$$$6.97~s$$$	$$$5.78~s$$$	$$$5.18~s$$$	$$$4.54~s$$$	$$$11.17~s$$$
Random ($$$wn=10$$$)	$$$15.14~s$$$	$$$10.73~s$$$	$$$12.62~s$$$	$$$10.69~s$$$	$$$13.25~s$$$	$$$10.04~s$$$	$$$13.74~s$$$
Binary Tree ($$$k=1$$$)	$$$21.45~s$$$	$$$13.09~s$$$	$$$17.49~s$$$	$$$12.40~s$$$	$$$13.62~s$$$	$$$10.59~s$$$	$$$13.96~s$$$
Binary tree ($$$k=5$$$)	$$$20.48~s$$$	$$$14.64~s$$$	$$$18.12~s$$$	$$$11.82~s$$$	$$$15.16~s$$$	$$$11.80~s$$$	$$$14.90~s$$$

Benchmarks when there are all $$$4$$$ query types.

	HLD + segment tree (single segment tree)	Balanced HLD 2
Time Complexity	$$$O(n + q \log ^2 n)$$$	$$$O((n+q) \log n)$$$
Random ($$$wn=0$$$)	$$$9.06~s$$$	$$$10.61~s$$$
Random ($$$wn=-10$$$)	$$$7.12~s$$$	$$$8.86~s$$$
Random ($$$wn=10$$$)	$$$9.92~s$$$	$$$11.15~s$$$
Binary Tree ($$$k=1$$$)	$$$13.47~s$$$	$$$11.80~s$$$
Binary tree ($$$k=5$$$)	$$$11.67~s$$$	$$$11.63~s$$$

I am unsure why a value of $$$k$$$ closer to $$$\sqrt n$$$ made the first $$$3$$$ codes all faster. Maybe there is something wrong with my generator or is the segment tree just too fast? Also, IO takes about $$$1~s$$$.

Here are my codes (generators + solutions). They have not been stress tested and are not guaranteed to be correct. They are only here for reference.

gen_random.cpp

#include "testlib.h"

#include <bits/stdc++.h>
using namespace std;

int main(int argc, char* argv[]){
    registerGen(argc, argv, 1);
	
	int n=atoi(argv[1]),q=atoi(argv[2]),wn=atoi(argv[3]),qt=atoi(argv[4]);
	
	vector<pair<int,int> > edges;
	for (int x=1;x<n;x++) edges.push_back({rnd.wnext(x,wn),x});
	
	shuffle(edges.begin(),edges.end());
	vector<int> perm(n);
	iota(perm.begin(),perm.end(),1);
	shuffle(perm.begin(),perm.end());
	
	cout<<n<<" "<<q<<endl;
	for (auto [a,b]:edges){
		if (rnd.next(2)) cout<<perm[a]<<" "<<perm[b]<<endl;
		else cout<<perm[b]<<" "<<perm[a]<<endl;
	}
	
	for (int x=0;x<q;x++){
		int a=rnd.next(1,qt);
		if (a==1){
			int u=rnd.next(1,n),v=rnd.next(1,n),w=rnd.next(1,1000000000);
			cout<<a<<" "<<u<<" "<<v<<" "<<w<<endl;
		}
		else if (a==2){
			int u=rnd.next(1,n),v=rnd.next(1,n);
			cout<<a<<" "<<u<<" "<<v<<endl;
		}
		else if (a==3){
			int u=rnd.next(1,n),w=rnd.next(1,1000000000);
			cout<<a<<" "<<u<<" "<<w<<endl;
		}
		else{
			int u=rnd.next(1,n);
			cout<<a<<" "<<u<<endl;
		}
	}
}

gen_binary.cpp

#include "testlib.h"

#include <bits/stdc++.h>
using namespace std;

int main(int argc, char* argv[]){
    registerGen(argc, argv, 1);
	
	int n=atoi(argv[1]),q=atoi(argv[2]),k=atoi(argv[3]),qt=atoi(argv[4]);
	
	vector<pair<int,int> > edges;
	queue<int> qu;
	qu.push(0),qu.push(0);
	int IDX=0;
	
	while (IDX<n-1){
		int u=qu.front();
		qu.pop();
		
		for (int x=0;x<k;x++){
			edges.push_back({u,++IDX});
			u=IDX;
			if (IDX==n-1) break;
		}
		qu.push(IDX),qu.push(IDX);
	}
	
	shuffle(edges.begin(),edges.end());
	vector<int> perm(n);
	iota(perm.begin(),perm.end(),1);
	shuffle(perm.begin()+1,perm.end()); //keep 1 as the root
	
	cout<<n<<" "<<q<<endl;
	for (auto [a,b]:edges){
		if (rnd.next(2)) cout<<perm[a]<<" "<<perm[b]<<endl;
		else cout<<perm[b]<<" "<<perm[a]<<endl;
	}
	
	for (int x=0;x<q;x++){
		int a=rnd.next(1,qt);
		if (a==1){
			int u=rnd.next(1,n),v=rnd.next(1,n),w=rnd.next(1,1000000000);
			cout<<a<<" "<<u<<" "<<v<<" "<<w<<endl;
		}
		else if (a==2){
			int u=rnd.next(1,n),v=rnd.next(1,n);
			cout<<a<<" "<<u<<" "<<v<<endl;
		}
		else if (a==3){
			int u=rnd.next(1,n),w=rnd.next(1,1000000000);
			cout<<a<<" "<<u<<" "<<w<<endl;
		}
		else{
			int u=rnd.next(1,n);
			cout<<a<<" "<<u<<endl;
		}
	}
}

hld_single.cpp

#include <bits/stdc++.h>
using namespace std;

#define int long long
#define ii pair<int,int>
#define fi first
#define se second
#define endl '\n'

#define pub push_back
#define pob pop_back

#define rep(x,start,end) for(int x=(start)-((start)>(end));x!=(end)-((start)>(end));((start)<(end)?x++:x--))
#define all(x) (x).begin(),(x).end()
#define sz(x) (int)(x).size()

struct node{
	int BUF=1000005;
	int val[2000010],lazy[2000010];
	
	void get(int i){ val[i]=max(val[i<<1],val[i<<1|1])+lazy[i]; }
	
	void update(int i,int j,int k){ //[i,j]
		i+=BUF-1,j+=BUF+1; //(i,j)
		while (i|j){
			if (j-i>1){
				if (~i&1) val[i^1]+=k,lazy[i^1]+=k;
				if (j&1) val[j^1]+=k,lazy[j^1]+=k;
			}
			i>>=1,j>>=1;
			get(i),get(j);
		}
	}
	
	int query(int i,int j){ //[i,j]
		i+=BUF-1,j+=BUF+1; //(i,j)
		int resl=-1e18,resr=-1e18;
		while (i|j){
			if (j-i>1){
				if (~i&1) resl=max(resl,val[i^1]);
				if (j&1) resr=max(resr,val[j^1]);
			}
			i>>=1,j>>=1;
			resl+=lazy[i],resr+=lazy[j];
		}
		return max(resl,resr);
	}
} root;

int n,q;
vector<int> al[1000005];

int ss[1000005];
int in[1000005];
int out[1000005];
int d[1000005];
int pp[1000005];
int hp[1000005];

int _TIME=0;

void dfs_ss(int i,int p){
	ss[i]=1;
	for (auto &it:al[i]){
		if (it==p) continue;
		dfs_ss(it,i);
		ss[i]+=ss[it];
		if (al[i][0]==p || ss[al[i][0]]<ss[it]) swap(al[i][0],it);
	}
}

void dfs_hld(int i,int p){
	in[i]=++_TIME;
	
	for (auto it:al[i]){
		if (it==p) continue;
		
		d[it]=d[i]+1;
		if (it==al[i][0]) hp[it]=hp[i];
		else hp[it]=it;
		pp[it]=i;
		
		dfs_hld(it,i);
	}
	
	out[i]=_TIME;
}

signed main(){
	ios::sync_with_stdio(0);
	cin.tie(0);
	cout.tie(0);
	cin.exceptions(ios::badbit | ios::failbit);
	
	cin>>n>>q;
	
	int a,b,c;
	rep(x,1,n){
		cin>>a>>b;
		al[a].pub(b);
		al[b].pub(a);
	}
	
	dfs_ss(1,-1);
	hp[1]=1;
	dfs_hld(1,-1);
	
	while (q--){
		cin>>a;
		
		if (a==1){
			cin>>a>>b>>c;
			while (hp[a]!=hp[b]){
				if (d[hp[a]]<d[hp[b]]) swap(a,b);
				root.update(in[hp[a]],in[a],c);
				a=pp[hp[a]];
			}
			if (d[a]>d[b]) swap(a,b);
			root.update(in[a],in[b],c);
		}
		else if (a==2){
			cin>>a>>b;
			int res=-1e18;
			while (hp[a]!=hp[b]){
				if (d[hp[a]]<d[hp[b]]) swap(a,b);
				res=max(res,root.query(in[hp[a]],in[a]));
				a=pp[hp[a]];
			}
			if (d[a]>d[b]) swap(a,b);
			res=max(res,root.query(in[a],in[b]));
			
			cout<<res<<endl;
		}
		else if (a==3){
			cin>>a>>c;
			root.update(in[a],out[a],c);
		}
		else{
			cin>>a;
			cout<<root.query(in[a],out[a])<<endl;
		}
	}
}

hld_many.cpp

#include <bits/stdc++.h>
using namespace std;

#define int long long
#define ii pair<int,int>
#define fi first
#define se second
#define endl '\n'

#define pub push_back
#define pob pop_back

#define rep(x,start,end) for(int x=(start)-((start)>(end));x!=(end)-((start)>(end));((start)<(end)?x++:x--))
#define all(x) (x).begin(),(x).end()
#define sz(x) (int)(x).size()

int __dat[8000005];
int *ptr=__dat;

struct node{
	int BUF;
	int *val,*lazy;
	
	node(){}
	node(int n){
		n++;
		
		BUF=n;
		val=ptr; ptr+=n<<1;
		lazy=ptr; ptr+=n<<1;
	}
	
	void get(int i){ val[i]=max(val[i<<1],val[i<<1|1])+lazy[i]; }
	
	void update(int i,int j,int k){
		i+=BUF-1,j+=BUF+1; //(i,j)
		while (i|j){
			if (j-i>1){
				if (~i&1) val[i^1]+=k,lazy[i^1]+=k;
				if (j&1) val[j^1]+=k,lazy[j^1]+=k;
			}
			i>>=1,j>>=1;
			get(i),get(j);
		}
	}
	
	int query(int i,int j){
		i+=BUF-1,j+=BUF+1; //(i,j)
		int resl=-1e18,resr=-1e18;
		while (i|j){
			if (j-i>1){
				if (~i&1) resl=max(resl,val[i^1]);
				if (j&1) resr=max(resr,val[j^1]);
			}
			i>>=1,j>>=1;
			resl+=lazy[i],resr+=lazy[j];
		}
		return max(resl,resr);
	}
};

int n,q;
vector<int> al[1000005];

int ss[1000005];
int in[1000005];
int d[1000005];
int pp[1000005];
int hp[1000005];
int num[1000005];

void dfs_ss(int i,int p){
	ss[i]=1;
	for (auto &it:al[i]){
		if (it==p) continue;
		dfs_ss(it,i);
		ss[i]+=ss[it];
		if (al[i][0]==p || ss[al[i][0]]<ss[it]) swap(al[i][0],it);
	}
}

void dfs_hld(int i,int p){
	num[hp[i]]++;
	
	for (auto it:al[i]){
		if (it==p) continue;
		
		d[it]=d[i]+1;
		if (it==al[i][0]) hp[it]=hp[i],in[it]=in[i]+1;
		else hp[it]=it;
		pp[it]=i;
		
		dfs_hld(it,i);
	}
}

node root[1000005];

signed main(){
	ios::sync_with_stdio(0);
	cin.tie(0);
	cout.tie(0);
	cin.exceptions(ios::badbit | ios::failbit);
	
	cin>>n>>q;
	
	int a,b,c;
	rep(x,1,n){
		cin>>a>>b;
		al[a].pub(b);
		al[b].pub(a);
	}
	
	dfs_ss(1,-1);
	hp[1]=1;
	dfs_hld(1,-1);
	
	rep(x,1,n+1) if (hp[x]==x) root[x]=node(num[x]);
	
	while (q--){
		cin>>a;
		
		if (a==1){
			cin>>a>>b>>c;
			while (hp[a]!=hp[b]){
				if (d[hp[a]]<d[hp[b]]) swap(a,b);
				root[hp[a]].update(0,in[a],c);
				a=pp[hp[a]];
			}
			if (d[a]>d[b]) swap(a,b);
			root[hp[a]].update(in[a],in[b],c);
		}
		else{
			cin>>a>>b;
			int res=-1e18;
			while (hp[a]!=hp[b]){
				if (d[hp[a]]<d[hp[b]]) swap(a,b);
				res=max(res,root[hp[a]].query(0,in[a]));
				a=pp[hp[a]];
			}
			if (d[a]>d[b]) swap(a,b);
			res=max(res,root[hp[a]].query(in[a],in[b]));
			
			cout<<res<<endl;
		}
	}
}

hld_many_ACL.cpp

#include <bits/stdc++.h>
#include <atcoder/lazysegtree>
using namespace std;
using namespace atcoder;

#define int long long
#define ii pair<int,int>
#define fi first
#define se second
#define endl '\n'

#define pub push_back
#define pob pop_back

#define rep(x,start,end) for(int x=(start)-((start)>(end));x!=(end)-((start)>(end));((start)<(end)?x++:x--))
#define all(x) (x).begin(),(x).end()
#define sz(x) (int)(x).size()

using S=int;
using F=int;

S op(S l,S r){ return max(l,r); }
S e(){ return 0; }
S mapping(F l,S r){ return l+r; }
F composition(F l,F r){ return l+r; }
F id(){ return 0; }
using segtree=lazy_segtree<S, op, e, F, mapping, composition, id>;

int n,q;
vector<int> al[1000005];

int ss[1000005];
int in[1000005];
int d[1000005];
int pp[1000005];
int hp[1000005];
int num[1000005];

void dfs_ss(int i,int p){
	ss[i]=1;
	for (auto &it:al[i]){
		if (it==p) continue;
		dfs_ss(it,i);
		ss[i]+=ss[it];
		if (al[i][0]==p || ss[al[i][0]]<ss[it]) swap(al[i][0],it);
	}
}

void dfs_hld(int i,int p){
	num[hp[i]]++;
	
	for (auto it:al[i]){
		if (it==p) continue;
		
		d[it]=d[i]+1;
		if (it==al[i][0]) hp[it]=hp[i],in[it]=in[i]+1;
		else hp[it]=it;
		pp[it]=i;
		
		dfs_hld(it,i);
	}
}

segtree root[1000005];

signed main(){
	ios::sync_with_stdio(0);
	cin.tie(0);
	cout.tie(0);
	cin.exceptions(ios::badbit | ios::failbit);
	
	cin>>n>>q;
	
	int a,b,c;
	rep(x,1,n){
		cin>>a>>b;
		al[a].pub(b);
		al[b].pub(a);
	}
	
	dfs_ss(1,-1);
	hp[1]=1;
	dfs_hld(1,-1);
	
	rep(x,1,n+1) if (hp[x]==x) root[x]=segtree(num[x]);
	
	while (q--){
		cin>>a;
		
		if (a==1){
			cin>>a>>b>>c;
			while (hp[a]!=hp[b]){
				if (d[hp[a]]<d[hp[b]]) swap(a,b);
				root[hp[a]].apply(0,in[a]+1,c);
				a=pp[hp[a]];
			}
			if (d[a]>d[b]) swap(a,b);
			root[hp[a]].apply(in[a],in[b]+1,c);
		}
		else{
			cin>>a>>b;
			int res=-1e18;
			while (hp[a]!=hp[b]){
				if (d[hp[a]]<d[hp[b]]) swap(a,b);
				res=max(res,root[hp[a]].prod(0,in[a]+1));
				a=pp[hp[a]];
			}
			if (d[a]>d[b]) swap(a,b);
			res=max(res,root[hp[a]].prod(in[a],in[b]+1));
			
			cout<<res<<endl;
		}
	}
}

linkcut.cpp

#include <bits/stdc++.h>
using namespace std;

#define int long long
#define ii pair<int,int>
#define fi first
#define se second
#define endl '\n'

#define pub push_back
#define pob pop_back

#define rep(x,start,end) for(int x=(start)-((start)>(end));x!=(end)-((start)>(end));((start)<(end)?x++:x--))
#define all(x) (x).begin(),(x).end()
#define sz(x) (int)(x).size()

struct node{
	node *p=nullptr; bool head=true;
	node *l=nullptr,*r=nullptr; //only store real children
	int val=0,mx=0,lazy=0;
	
	node (node *_p){
		p=_p;
	}
	
	bool dir(){
		return this==p->l;
	}
	
	void propo(){
		if (lazy==0) return;
		
		val+=lazy;
		mx+=lazy;
		if (l) l->lazy+=lazy;
		if (r) r->lazy+=lazy;
		lazy=0;
	}
	
	void upd(){
		mx=max({
			val,
			l==nullptr?0:l->mx+l->lazy,
			r==nullptr?0:r->mx+r->lazy
		});
	}
};

void rotate(node *u){
	node *par=u->p;
	bool dir=u->dir();
	
	par->propo();
	u->propo();
	
	u->p=par->p;
	if (!par->head){
		if (par->dir()) par->p->l=u;
		else par->p->r=u;
	}
	swap(u->head,par->head);
	
	if (dir){
		node *child=u->r;
		par->l=child;
		if (child) child->p=par;
		
		u->r=par;
		par->p=u;
	}
	else{
		node *child=u->l;
		par->r=child;
		if (child) child->p=par;
		u->l=par;
		par->p=u;
	}
	
	par->upd();
	u->upd();
}

void splay(node *u){
	while (!u->head){
		if (u->p->head){ //almost root
			rotate(u);
		}
		else if (u->dir()==u->p->dir()){
			rotate(u->p);
			rotate(u);
		}
		else{
			rotate(u);
			rotate(u);
		}
	}
	u->propo();
}

node* access(node *u){
	splay(u);
	if (u->r!=nullptr){
		u->r->head=true;
		u->r=nullptr;
		u->upd();
	}
	while (u->p){
		node *par=u->p;
		splay(par);
		if (par->r) par->r->head=true;
		u->head=false;
		par->r=u;
		par->upd();
		u=par;
	}
	
	return u;
}

int n,q;
vector<int> al[1000005];
int d[1000005];
node *root[1000005];

void dfs(int i,int p){
	root[i]=new node(p==-1?nullptr:root[p]);
	for (auto it:al[i]){
		if (it==p) continue;
		d[it]=d[i]+1;
		dfs(it,i);
	}
}

signed main(){
	ios::sync_with_stdio(0);
	cin.tie(0);
	cout.tie(0);
	cin.exceptions(ios::badbit | ios::failbit);
	
	cin>>n>>q;
	
	int a,b,c;
	rep(x,1,n){
		cin>>a>>b;
		al[a].pub(b);
		al[b].pub(a);
	}
	
	dfs(1,-1);
	
	while (q--){
		cin>>a;
		
		if (a==1){
			cin>>a>>b>>c;
			
			if (d[a]>d[b]) swap(a,b);
			access(root[a]);
			node *temp=access(root[b]);
			splay(root[a]);
			
			if (root[a]->p) root[a]->lazy+=c;
			temp->val+=c;
			if (temp->r) temp->r->lazy+=c;
			temp->upd();
		}
		else{
			cin>>a>>b;
			
			if (d[a]>d[b]) swap(a,b);
			access(root[a]);
			node *temp=access(root[b]);
			splay(root[a]);
			
			int ans=max({
				root[a]->p?(root[a]->mx)+(root[a]->lazy):0,
				temp->val+temp->lazy,
				temp->r?temp->r->mx+temp->r->lazy+temp->lazy:0
			});
			
			cout<<ans<<endl;
		}
	}
}

hld_splay.cpp

#include <bits/stdc++.h>
using namespace std;

#define int long long
#define ii pair<int,int>
#define fi first
#define se second
#define endl '\n'

#define pub push_back
#define pob pop_back

#define rep(x,start,end) for(int x=(start)-((start)>(end));x!=(end)-((start)>(end));((start)<(end)?x++:x--))
#define all(x) (x).begin(),(x).end()
#define sz(x) (int)(x).size()

struct node{
	node *p=nullptr;
	node *l=nullptr,*r=nullptr; //only store real children
	int val=0,mx=0,lazy=0;
	
	node (node *_p){
		p=_p;
	}
	
	bool dir(){
		return this==p->l;
	}
	
	void propo(){
		if (lazy==0) return;
		
		val+=lazy;
		mx+=lazy;
		if (l) l->lazy+=lazy;
		if (r) r->lazy+=lazy;
		lazy=0;
	}
	
	void upd(){
		mx=max({
			val,
			l==nullptr?0:l->mx+l->lazy,
			r==nullptr?0:r->mx+r->lazy
		});
	}
};

void rotate(node *u){
	node *par=u->p;
	bool dir=u->dir();
	
	par->propo();
	u->propo();
	
	u->p=par->p;
	if (par->p){
		if (par->dir()) par->p->l=u;
		else par->p->r=u;
	}
	
	if (dir){
		node *child=u->r;
		par->l=child;
		if (child) child->p=par;
		
		u->r=par;
		par->p=u;
	}
	else{
		node *child=u->l;
		par->r=child;
		if (child) child->p=par;
		u->l=par;
		par->p=u;
	}
	
	par->upd();
	u->upd();
}

void splay(node *u){
	while (u->p){
		if (u->p->p==nullptr){ //almost root
			rotate(u);
		}
		else if (u->dir()==u->p->dir()){
			rotate(u->p);
			rotate(u);
		}
		else{
			rotate(u);
			rotate(u);
		}
	}
	u->propo();
}

int n,q;
vector<int> al[1000005];

int ss[1000005];
int d[1000005];
int pp[1000005];
int hp[1000005];
int num[1000005];
node* root[1000005];

void dfs_ss(int i,int p){
	ss[i]=1;
	for (auto &it:al[i]){
		if (it==p) continue;
		dfs_ss(it,i);
		ss[i]+=ss[it];
		if (al[i][0]==p || ss[al[i][0]]<ss[it]) swap(al[i][0],it);
	}
}

void dfs_hld(int i,int p){
	num[hp[i]]++;
	
	if (hp[i]==i) root[i]=new node(nullptr);
	else root[i]=new node(root[p]);
	
	for (auto it:al[i]){
		if (it==p) continue;
		
		d[it]=d[i]+1;
		if (it==al[i][0]) hp[it]=hp[i];
		else hp[it]=it;
		pp[it]=i;
		
		dfs_hld(it,i);
	}
}

signed main(){
	ios::sync_with_stdio(0);
	cin.tie(0);
	cout.tie(0);
	cin.exceptions(ios::badbit | ios::failbit);
	
	cin>>n>>q;
	
	int a,b,c;
	rep(x,1,n){
		cin>>a>>b;
		al[a].pub(b);
		al[b].pub(a);
	}
	
	dfs_ss(1,-1);
	hp[1]=1;
	dfs_hld(1,-1);
	
	while (q--){
		cin>>a;
		
		if (a==1){
			cin>>a>>b>>c;
			while (hp[a]!=hp[b]){
				if (d[hp[a]]<d[hp[b]]) swap(a,b);
				splay(root[a]);
				root[a]->val+=c;
				if (root[a]->l) root[a]->l->lazy+=c;
				root[a]->upd();
				a=pp[hp[a]];
			}
			if (d[a]>d[b]) swap(a,b);
			if (a==b){
				splay(root[a]);
				root[a]->val+=c;
				root[a]->upd();
			}
			else{
				splay(root[b]);
				splay(root[a]);
				if (root[b]->p!=root[a]) rotate(root[b]);
				
				if (root[b]->l) root[b]->l->lazy+=c;
				root[a]->val+=c;
				root[b]->val+=c;
				
				root[b]->upd();
				root[a]->upd();
			}
		}
		else{
			cin>>a>>b;
			int res=-1e18;
			
			while (hp[a]!=hp[b]){
				if (d[hp[a]]<d[hp[b]]) swap(a,b);
				splay(root[a]);
				res=max(res,root[a]->val+root[a]->lazy);
				if (root[a]->l) res=max(res,root[a]->l->mx+root[a]->l->lazy+root[a]->lazy);
				a=pp[hp[a]];
			}
			if (d[a]>d[b]) swap(a,b);
			if (a==b){
				splay(root[a]);
				res=max(res,root[a]->val+root[a]->lazy);
			}
			else{
				splay(root[b]);
				splay(root[a]);
				if (root[b]->p!=root[a]) rotate(root[b]);
				res=max(res,root[a]->val+root[a]->lazy);
				res=max(res,root[b]->val+root[a]->lazy+root[b]->lazy);
				if (root[b]->l) res=max(res,root[b]->l->mx+root[a]->lazy+root[b]->lazy+root[b]->l->lazy);
			}
			
			cout<<res<<endl;
		}
	}
}

binary_tree.cpp

#include <bits/stdc++.h>
using namespace std;

#define int long long
#define ii pair<int,int>
#define fi first
#define se second
#define endl '\n'

#define pub push_back
#define pob pop_back

#define rep(x,start,end) for(int x=(start)-((start)>(end));x!=(end)-((start)>(end));((start)<(end)?x++:x--))
#define all(x) (x).begin(),(x).end()
#define sz(x) (int)(x).size()

int n,q;
vector<int> al[1000005];

int ss[1000005];
int pp[1000005];
int hp[1000005];

int ss2[1000005]; //size of other childs
vector<int> chain[1000005];

void dfs_ss(int i,int p){
	ss[i]=1;
	for (auto &it:al[i]){
		if (it==p) continue;
		dfs_ss(it,i);
		ss[i]+=ss[it];
		if (al[i][0]==p || ss[al[i][0]]<ss[it]) swap(al[i][0],it);
	}
}

int TIME=0;
int in[1000005];

void dfs_hld(int i,int p){
	in[i]=TIME++;
	
	ss2[i]=1;
	chain[hp[i]].pub(i);
	
	for (auto it:al[i]){
		if (it==p) continue;
		
		if (it==al[i][0]) hp[it]=hp[i];
		else hp[it]=it;
		pp[it]=i;
		
		dfs_hld(it,i);
		if (hp[it]!=hp[i]) ss2[i]+=ss[it];
	}
}

int d[1000005];
int hd[1000005];
int par[1000005];
int childs[1000005][2];

int dfs_chain(int i,int l,int r,int p,int depth,int hdepth){
	int tot=0;
	rep(x,l,r+1) tot+=ss2[chain[i][x]];
	
	int curr=0;
	rep(x,l,r+1){
		curr+=ss2[chain[i][x]];
		if (curr>=(tot+1)/2){ //centroid
			int u=chain[i][x];
			d[u]=depth;
			hd[u]=hdepth;
			par[u]=p;
			if (l!=x) childs[u][0]=dfs_chain(i,l,x-1,u,depth+1,hdepth);
			if (r!=x) childs[u][1]=dfs_chain(i,x+1,r,u,depth+1,hdepth);
			for (auto it:al[u]) if (hp[it]!=hp[u] && it!=pp[u]){
				dfs_chain(it,0,sz(chain[it])-1,u,depth+1,depth+1);
			}
			return u;
		}
	}
}

int lazy[1000005];
int val[1000005];
int mx[1000005];

void upd(int i){
	mx[i]=val[i];
	rep(z,0,2) if (childs[i][z]!=-1) mx[i]=max(mx[i],mx[childs[i][z]]);
	mx[i]+=lazy[i];
}

signed main(){
	ios::sync_with_stdio(0);
	cin.tie(0);
	cout.tie(0);
	cin.exceptions(ios::badbit | ios::failbit);
	
	cin>>n>>q;
	
	int a,b,c;
	rep(x,1,n){
		cin>>a>>b;
		al[a].pub(b);
		al[b].pub(a);
	}
	
	dfs_ss(1,-1);
	hp[1]=1;
	dfs_hld(1,-1);
	
	memset(par,-1,sizeof(par));
	memset(childs,-1,sizeof(childs));
	dfs_chain(1,0,sz(chain[1])-1,-1,0,0);
	
	while (q--){
		cin>>a;
		
		if (a==1){
			cin>>a>>b>>c;
			
			while (hp[a]!=hp[b]){
				if (hd[a]<hd[b]) swap(a,b);
				
				int oa=a;
				while (hp[a]==hp[oa]){
					if (in[a]<=in[oa]){
						val[a]+=c;
						if (childs[a][0]!=-1){
							lazy[childs[a][0]]+=c;
							mx[childs[a][0]]+=c;
						}
					}
					upd(a);
					a=par[a];
				}
			}
			
			if (in[a]>in[b]) swap(a,b);
			int oa=a,ob=b;
			while (a!=b){
				int md=max(d[a],d[b]);
				if (d[a]>=md){
					if (in[oa]<=in[a]){
						val[a]+=c;
						if (childs[a][1]!=-1){
							lazy[childs[a][1]]+=c;
							mx[childs[a][1]]+=c;
						}
					}
					upd(a);
					a=par[a];
				}
				if (d[b]>=md){
					if (in[b]<=in[ob]){
						val[b]+=c;
						if (childs[b][0]!=-1){
							lazy[childs[b][0]]+=c;
							mx[childs[b][0]]+=c;
						}
					}
					upd(b);
					b=par[b];
				}
			}
			
			val[a]+=c;
			while (a!=-1 && hp[oa]==hp[a]){
				upd(a);
				a=par[a];
			}
		}
		else{
			cin>>a>>b;
			
			int ans=0;
			while (hp[a]!=hp[b]){
				if (hd[a]<hd[b]) swap(a,b);
				
				int curr=0;
				int oa=a;
				while (hp[a]==hp[oa]){
					if (in[a]<=in[oa]){
						curr=max(curr,val[a]);
						if (childs[a][0]!=-1) curr=max(curr,mx[childs[a][0]]);
					}
					curr+=lazy[a];
					a=par[a];
				}
				ans=max(ans,curr);
			}
			
			if (in[a]>in[b]) swap(a,b);
			int oa=a,ob=b;
			int currl=0,currr=0;
			while (a!=b){
				int md=max(d[a],d[b]);
				if (d[a]>=md){
					if (in[oa]<=in[a]){
						currl=max(currl,val[a]);
						if (childs[a][1]!=-1) currl=max(currl,mx[childs[a][1]]);
					}
					currl+=lazy[a];
					a=par[a];
				}
				if (d[b]>=md){
					if (in[b]<=in[ob]){
						currr=max(currr,val[b]);
						if (childs[b][0]!=-1) currr=max(currr,mx[childs[b][0]]);
					}
					currr+=lazy[b];
					b=par[b];
				}
			}
			
			currl=max({currl,currr,val[a]});
			
			while (a!=-1 && hp[oa]==hp[a]){
				currl+=lazy[a];
				a=par[a];
			}
			
			ans=max(ans,currl);
			cout<<ans<<endl;
		}
	}
}

binary_tree2.cpp

#include <bits/stdc++.h>
using namespace std;

#define int long long
#define ii pair<int,int>
#define fi first
#define se second
#define endl '\n'

#define pub push_back
#define pob pop_back

#define rep(x,start,end) for(int x=(start)-((start)>(end));x!=(end)-((start)>(end));((start)<(end)?x++:x--))
#define all(x) (x).begin(),(x).end()
#define sz(x) (int)(x).size()

int n,q;
vector<int> al[2000005];

int ss[2000005];
int hp[2000005];

int ss2[2000005]; //size of other childs
vector<int> chain[2000005];

int FAKE_IDX;
int rp[2000005]; //real vertex for fake vertices

void dfs_ss(int i,int p){
	if (p!=-1 && i<=n){
		for (auto &it:al[i]) if (it==p || it==rp[p]) swap(it,al[i].back());
		al[i].pob();
	}
	
	int curr=i;
	while (sz(al[curr])>2){
		swap(al[curr],al[FAKE_IDX]);
		al[curr]={al[FAKE_IDX].back(),FAKE_IDX};
		al[FAKE_IDX].pob();
		rp[FAKE_IDX]=i;
		curr=FAKE_IDX;
		FAKE_IDX++;
	}
	
	ss[i]=1;
	for (auto &it:al[i]){
		dfs_ss(it,i);
		ss[i]+=ss[it];
		if (ss[al[i][0]]<ss[it]) swap(al[i][0],it);
	}
}

int TIME=0;
int in[2000005];

void dfs_hld(int i,int p){
	in[i]=TIME++;
	
	ss2[i]=1;
	chain[hp[i]].pub(i);
	
	for (auto it:al[i]){
		if (it==al[i][0]) hp[it]=hp[i];
		else hp[it]=it;
		
		dfs_hld(it,i);
		if (hp[it]!=hp[i]) ss2[i]+=ss[it];
	}
}

int d[2000005];
int hd[2000005];
int par[2000005];
int childs[2000005][3]; //(left,right,fake)

int dfs_chain(int i,int l,int r,int p,int depth,int hdepth){
	int tot=0;
	rep(x,l,r+1) tot+=ss2[chain[i][x]];
	
	int curr=0;
	rep(x,l,r+1){
		curr+=ss2[chain[i][x]];
		if (curr>=(tot+1)/2){ //centroid
			int u=chain[i][x];
			d[u]=depth;
			hd[u]=hdepth;
			par[u]=p;
			if (l!=x) childs[u][0]=dfs_chain(i,l,x-1,u,depth+1,hdepth);
			if (r!=x) childs[u][1]=dfs_chain(i,x+1,r,u,depth+1,hdepth);
			for (auto it:al[u]) if (hp[it]!=hp[u]){
				childs[u][2]=dfs_chain(it,0,sz(chain[it])-1,u,depth+1,depth+1);
			}
			return u;
		}
	}
}

int lazy1[2000005]; //only propogate on real tree
int lazy2[2000005]; //propogate on both real and fake tree
int val[2000005];
ii mx[2000005]; //(real,fake)

void upd(int i){
	mx[i]={val[i],0};
	rep(z,0,3){
		if (childs[i][z]!=-1){
			if (z!=2) mx[i].fi=max(mx[i].fi,mx[childs[i][z]].fi);
			else mx[i].se=max(mx[i].se,mx[childs[i][z]].fi);
			mx[i].se=max(mx[i].se,mx[childs[i][z]].se);
		}
	}
	mx[i].fi+=lazy1[i]+lazy2[i]; mx[i].se+=lazy2[i];
}

signed main(){
	ios::sync_with_stdio(0);
	cin.tie(0);
	cout.tie(0);
	cin.exceptions(ios::badbit | ios::failbit);
	
	cin>>n>>q;
	
	int a,b,c;
	rep(x,1,n){
		cin>>a>>b;
		al[a].pub(b);
		al[b].pub(a);
	}
	
	memset(rp,-1,sizeof(rp));
	FAKE_IDX=n+1;
	dfs_ss(1,-1);
	hp[1]=1;
	dfs_hld(1,-1);
	
	memset(par,-1,sizeof(par));
	memset(childs,-1,sizeof(childs));
	dfs_chain(1,0,sz(chain[1])-1,-1,0,0);
	
	// rep(x,1,FAKE_IDX) cout<<hp[x]<<" "; cout<<endl;
	// rep(x,1,FAKE_IDX) cout<<childs[x][0]<<" "; cout<<endl;
	// rep(x,1,FAKE_IDX) cout<<childs[x][1]<<" "; cout<<endl;
	// rep(x,1,FAKE_IDX) cout<<childs[x][2]<<" "; cout<<endl;
	
	while (q--){
		cin>>a;
		
		if (a==1){
			cin>>a>>b>>c;
			
			while (hp[a]!=hp[b]){
				if (hd[a]<hd[b]) swap(a,b);
				
				int oa=a;
				while (hp[a]==hp[oa]){
					if (in[a]<=in[oa]){
						val[a]+=c;
						if (childs[a][0]!=-1){
							lazy1[childs[a][0]]+=c;
							mx[childs[a][0]].fi+=c;
						}
					}
					upd(a);
					a=par[a];
				}
			}
			
			if (in[a]>in[b]) swap(a,b);
			
			if (rp[a]!=-1){ //lca is a fake vertex
				int g=rp[a];
				val[g]+=c;
				while (g!=-1){
					upd(g);
					g=par[g];
				}
			}
			
			int oa=a,ob=b;
			while (a!=b){
				int md=max(d[a],d[b]);
				if (d[a]>=md){
					if (in[oa]<=in[a]){
						val[a]+=c;
						if (childs[a][1]!=-1){
							lazy1[childs[a][1]]+=c;
							mx[childs[a][1]].fi+=c;
						}
					}
					upd(a);
					a=par[a];
				}
				if (d[b]>=md){
					if (in[b]<=in[ob]){
						val[b]+=c;
						if (childs[b][0]!=-1){
							lazy1[childs[b][0]]+=c;
							mx[childs[b][0]].fi+=c;
						}
					}
					upd(b);
					b=par[b];
				}
			}
			
			val[a]+=c;
			while (a!=-1){
				upd(a);
				a=par[a];
			}
		}
		else if (a==2){
			cin>>a>>b;
			
			int ans=0,ansl=0,ansr=0;
			while (hp[a]!=hp[b]){
				if (hd[a]<hd[b]) swap(a,b),swap(ansl,ansr);
				
				int curr=0;
				int oa=a;
				while (hp[a]==hp[oa]){
					if (in[a]<=in[oa]){
						curr=max(curr,val[a]);
						if (childs[a][0]!=-1) curr=max(curr,mx[childs[a][0]].fi);
					}
					curr+=lazy1[a]+lazy2[a];
					ansl+=lazy2[a];
					a=par[a];
				}
				ansl=max(ansl,curr);
			}
			
			if (in[a]>in[b]) swap(a,b),swap(ansl,ansr);
			
			if (rp[a]!=-1){ //lca is a fake vertex
				int g=rp[a],curr=val[rp[a]];
				while (g!=-1){
					if (hp[g]==hp[rp[a]]) curr+=lazy1[g];
					curr+=lazy2[g];
					g=par[g];
				}
				ans=max(ans,curr);
			}
			
			int oa=a,ob=b;
			int currl=0,currr=0;
			while (a!=b){
				int md=max(d[a],d[b]);
				if (d[a]>=md){
					if (in[oa]<=in[a]){
						currl=max(currl,val[a]);
						if (childs[a][1]!=-1) currl=max(currl,mx[childs[a][1]].fi);
					}
					currl+=lazy1[a]+lazy2[a];
					ansl+=lazy2[a];
					a=par[a];
				}
				if (d[b]>=md){
					if (in[b]<=in[ob]){
						currr=max(currr,val[b]);
						if (childs[b][0]!=-1) currr=max(currr,mx[childs[b][0]].fi);
					}
					currr+=lazy1[b]+lazy2[b];
					ansr+=lazy2[b];
					b=par[b];
				}
			}
			
			currl=max({currl,currr,val[a]});
			ansl=max(ansl,ansr);
			
			while (a!=-1){
				if (hp[a]==hp[oa]) currl+=lazy1[a];
				currl+=lazy2[a];
				ansl+=lazy2[a];
				a=par[a];
			}
			
			cout<<max({ans,currl,ansl})<<endl;
		}
		else if (a==3){
			cin>>a>>c;
			
			int oa=a;
			while (a!=-1){
				if (in[oa]<=in[a]){
					val[a]+=c;
					rep(z,1,3) if (childs[a][z]!=-1){ //update both right and light children
						lazy2[childs[a][z]]+=c;
						mx[childs[a][z]].fi+=c;
						mx[childs[a][z]].se+=c;
					}
				}
				upd(a);
				a=par[a];
			}
		}
		else{
			cin>>a;
			
			int oa=a,ans=0,curr=0;
			while (a!=-1){
				if (in[oa]<=in[a]){
					curr=max(curr,val[a]);
					rep(z,1,3) if (childs[a][z]!=-1){
						if (z==1) curr=max(curr,mx[childs[a][z]].fi);
						else ans=max(ans,mx[childs[a][z]].fi);
						ans=max(ans,mx[childs[a][z]].se);
					}
				}
				if (hp[a]==hp[oa]) curr+=lazy1[a];
				curr+=lazy2[a];
				ans+=lazy2[a];
				
				a=par[a];
			}
			
			cout<<max(ans,curr)<<endl;
		}
	}
}

script.py

import subprocess

#compile everything
subprocess.run("g++ gen_random.cpp -o bin/gen_random -O2 -std=c++17",shell=True)
subprocess.run("g++ gen_binary.cpp -o bin/gen_binary -O2 -std=c++17",shell=True)

tests=[
	"gen_random 1000000 5000000 0",
	"gen_random 1000000 5000000 -10",
	"gen_random 1000000 5000000 10",
	"gen_binary 1000000 5000000 1",
	"gen_binary 1000000 5000000 5"
]

solutions=[
	"hld_single",
	"hld_many",
	"hld_many_ACL",
	"linkcut",
	"hld_splay",
	"binary_tree",
	"binary_tree2"
]

for i in solutions:
	subprocess.run("g++ {}.cpp -o bin/{} -O2 -std=c++17".format(i,i),shell=True)

print("compilation done")

for test in tests:
	print(test)
	subprocess.run("./bin/{} 2 > in".format(test),shell=True)
	
	for x in range(7):
		print(solutions[x],subprocess.run("time ./bin/{} < in > out".format(solutions[x]),shell=True,capture_output=True).stderr)
	print()

for test in tests:
	print(test)
	subprocess.run("./bin/{} 4 > in".format(test),shell=True)
	
	for x in [0,6]:
		print(solutions[x],subprocess.run("time ./bin/{} < in > out".format(solutions[x]),shell=True,capture_output=True).stderr)
	print()

1/3 Centroid Decomposition

When I was writing this blog, I was wondering whether we could cut the $$$log$$$ factor from some centroid decomposition problems. Thanks to balbit for telling me about this technique.

Consider the following problem: You are given a weighted tree of size $$$n$$$ whose edges may have negative weights. Each vertex may either be toggled "on" or "off". Handle the following $$$q$$$ operations:

Toggle vertex $$$u$$$.
Given vertex $$$u$$$, find the maximum value of $$$d(u,v)$$$ over all vertices $$$v$$$ that is toggled "on". It is guaranteed that at least one such $$$v$$$ exists.

It is well-know how to solve this in $$$O(n \log n + q \log^2 n)$$$ by using centroid decomposition + segment trees. But can we do better?

The reason segment trees have to be used is because when we query for the longest path ending at some centroid parent to perform our queries, we have to ignore the contribution of our own subtree. An obvious way to solve this is to try to decompose the centroid tree in such a way that each vertex has at most $$$2$$$ children. Unfortunately, I do not know a way to do this such that the depth of the tree bounded by $$$\log_2 n$$$, but there is a way to make the depth of the tree bounded by $$$log_{\frac{3}{2}} n$$$.

Instead of thinking of doing centroid decomposition on vertices, let us consider doing it on edges. Top image is the usual cetroid decomposition, while bottom image is the one we want to use here. That is, the centroid gets passed down to its children when recursing.

Ok, so now we want to figure out what is the largest possible number of edges of the smaller partition. Remember, we want to make this value as large as possible to get a split that is as even as possible.

Firstly, we have a lower bound $$$\frac{1}{3}m$$$, whre $$$m$$$ is the number of edges. This is obtained when the tree is a star graph with $$$3$$$ children. The number of edges in each subtree are $$$[1,1,1]$$$, it is clear that the best way to partition the subtrees is $$$[1]$$$ and $$$[1,1]$$$, which gives us our desired lower bound.

Now, we will show that this bound is obtainable. Let $$$A$$$ be an array containing elements in the interval $$$[0,\frac{1}{2}]$$$ such that $$$\sum A=1$$$. This array $$$A$$$ describes the ratio between the number of edges in each subtree against the total number of edges. The elements are bounded above by $$$\frac{1}{2}$$$ due to the properties of the centroid.

Then, the following algorithm surprisingly finds a subset $$$S$$$ whose sum lies in the range $$$[\frac{1}{3},\frac{2}{3}]$$$.

1. tot=0
2. S={}
3. for x in [1,n]:
4.     if (tot+A[x]<=2/3):
5.         tot+=A[x]
6.         S.insert(x)

It is clear that $$$\sum\limits_{s \in S} A[s] \leq \frac{2}{3}$$$, so it suffices to show that $$$\sum\limits_{s \in S} A[s] \geq \frac{1}{3}$$$.

Let $$$P[x]=A[1]+A[2]+\ldots+A[x]$$$, that is $$$P$$$ is the prefix sum array.

Consider the first index $$$x$$$ such that $$$P[x]>\frac{2}{3}$$$. We will split into $$$2$$$ cases.

$$$A[x]<\frac{1}{3}$$$: when we have completed iteration $$$x-1$$$, the $$$\sum\limits_{s \in S} A[s] = P[x-1] = P[x]-A[x] > \frac{1}{3}$$$.
$$$A[x] \geq \frac{1}{3}$$$: it is easy to see that the final $$$S=[1,n] \setminus \{x\}$$$. So $$$\sum\limits_{s \in S} A[s] = 1-A[x] \geq \frac{1}{2}$$$.

Therefore, we are able to obtain a centroid tree which is a binary tree and has depth at most $$$\log_{\frac{3}{2}} n$$$.

Returning back to the original problem, we are able to solve it in $$$O(n \log n + q_1 \log^2 n + q_2 \log n)$$$ where $$$q_1$$$ and $$$q_2$$$ are the number of queries of type $$$1$$$ and $$$2$$$ respectively.

References

[1] https://github.com/dawxy/ACM-CODER/blob/master/%E3%80%90%E8%AE%BA%E6%96%87%26%26%E6%95%99%E7%A8%8B%E3%80%91/QTREE%E8%A7%A3%E6%B3%95%E7%9A%84%E4%B8%80%E4%BA%9B%E7%A0%94%E7%A9%B6.pdf
[2] https://ocw.mit.edu/courses/6-854j-advanced-algorithms-fall-2008/921232cb9a69015c50002ff5ea6a9824_lec6.pdf
[3] https://courses.csail.mit.edu/6.851/spring12/scribe/L19.pdf
[4] https://en.wikipedia.org/wiki/Link/cut_tree
[5] https://codeforces.com/blog/entry/82400
[6] https://codeforces.com/blog/entry/72626
[7] https://codeforces.com/blog/entry/53170
[8] https://oj.uz/problem/view/JOI14_secret
[9] https://codeforces.com/blog/entry/103726

Comments (10)

Write comment?

errorgorn

21 month(s) ago, # |

← Rev. 4 →

It seems everyone disagrees what the name of this technique should be. So let's settle it.

balanced HLD
Chinese HLD
global HLD
Chinese binary tree (CBT)
Yang Zhe (YZ) tree

→ Reply

ftiasch

21 month(s) ago, # ^ |

+50

The trick is known in literature as Biased Binary Search Tree (BiBST), originated in Sleator's paper as far as I remembered.

Kyou_mo_kawaii

Daniel Sleator is active on codeforces so maybe can just ask him? Ping Darooha

ScarletS

For the acronym alone, CBT.

nor

+16

When I was told about this technique, it was referred to as global BST (and not global HLD), so I prefer that name merely for convenience reasons. Not really sure where the "global" comes from, though.

winterfire

Error-gone or Ara-gorn?

SSRS_

For the case of balanced binary trees, can easily figure that the balanced binary tree easily forces the implementation of having a single segment tree to go into its worse case of O(n+qlog^2n), but for the implementation where we have a new segment tree for each heavy path, it only forces it to have O(n+qlogn) complexity.

I did not understand this part. Why does it become O(n+qlogn) complexity if we have a segment tree for each heavy path?

brunovsky

+13

He's saying it becomes $$$O(n+q\log n)$$$ only when the original tree is a binary tree, which should be pretty obvious since the sum of chain lengths is $$$\log n$$$. In the table further down it gives the general $$$O(n+q\log^2n)$$$

Oh, I missed that the sum of chain lengths is $$$ \log n $$$. Thank you very much.

lrvideckis

4 months ago, # |

for the 1/3 centroid decomposition, is it possible to construct a tree where you hit the 1/3,2/3 worst case for all decompositions?

for example for the 3-star tree, the first split is 1/3,2/3 but then you have 3 linked lists for which you can split 1/2,1/2 all the way down

for example you can solve this problem 1592D - Hemose in ICPC ? with 1/3CD as long as you choose your splits closely enough to 1/2,1/2 240284857

errorgorn's blog

Splay Trees

Link-Cut Trees

$$$O(\log^2 n)$$$ Time Complexity Bound

$$$O(\log n)$$$ Time Complexity Bound

Tree Path Queries

HLD + Segment Tree

Worst Case for HLD

Really Fast Segment Trees

Link-Cut Tree

HLD + Splay Tree

Balanced HLD

Path and Subtree Queries

Benchmarks

1/3 Centroid Decomposition

References