# A brief introduction into the applications of square root decomposition

Square root decomposition is the process of separating a structure of size $$$O(N)$$$ into $$$O(\sqrt{N})$$$ "blocks" of size $$$O(\sqrt{N})$$$ each, in a way that aids manipulation of the entire structure. Square root decomposition is extremely versatile. Some of its most well-known use cases are:

- answering queries on a static array, with methods such as Mo's algorithm or block precomputation
- "lazy" brute force modifications, where it is easy to query an entire block, but tag pushing isn't obvious
- separating objects based on a threshold, when there exists both an $$$O(x)$$$ algorithm and an $$$O(n/x)$$$ algorithm

In this tutorial we will walk through multiple types of square root decomposition.

## Answering queries on a static array, offline (Mo's algorithm)

Consider a problem where we are asked to find the answer for certain intervals $$$[l,r]$$$. We can't quickly compute the answer for an arbitrary interval, but we know how to transition to $$$[l,r\pm1]$$$ and $$$[l\pm1,r]$$$ fast given some information remaining from $$$[l,r]$$$ answer calculation. The number of transitions we need to do to get from $$$[l_1,r_1]$$$ to $$$[l_2,r_2]$$$ is $$$|l_1-l_2|+|r_1-r_2|$$$.

If there are only two intervals we need to answer such transitioning wouldn't help us. However, if there are *many* intervals, choosing a good transitioning route will drastically reduce the total time needed. Finding the best transition route quick is allegedly NP-hard, so we will focus on estimating a "good enough" route.

Consider the following scheme for 2D Manhattan TSP: We do square root decomposition on the $$$x$$$ coordinate, separating the grid into vertical blocks, $$$n$$$ cells tall and $$$B$$$ cells wide. There are $$$n/B$$$ such blocks.

For an arbitrary block, call the number of points we need to visit in it $$$C$$$. To get to all of them, we can visit them in order from top to bottom: we spend at most $$$O(BC)$$$ transitions moving left and right, and $$$O(n)$$$ transitions going down.

If for each block we need $$$O(BC+n)$$$ transitions, in total we will need $$$O(\sum(BC+n))$$$ = $$$O(Bq+n^2/B)$$$ transitions. Selecting a block size of $$$B=O(n/\sqrt{q})$$$ gives us a total transition count of $$$O(qn/\sqrt{q}+n\sqrt{q})$$$ = $$$O(n\sqrt{q})$$$.

Implementation of sorting these transitions can be done as follows:

```
int B = ???;
struct query {
int l, r, id;
const bool operator<(const query& o) const {
if(l/B == o.l/B) {
// they are in the same block, sort going down
return r < o.r;
} else {
// they are not in the same block, sort by block number
return l < o.l;
}
}
}
```

Observe here that between each block, we need to run from the bottom of the "grid" all the way up. We can prevent this by using a zig-zag pattern, getting a bit of a constant optimization:

```
int B = ???;
struct query {
int l, r, id;
const bool operator<(const query& o) const {
if(l/B == o.l/B) {
if((l/B) % 2 == 0) return r < o.r;
else return r > o.r;
} else {
return l < o.l;
}
}
}
```

In the following we will look at some examples.

### Range distinct query (SPOJ DQUERY)

Statement: Given a static array and queries of the form $$$[l,r]$$$, count the number of different values that occur between $$$[l,r]$$$.

The naive transition is intuitive: we maintain a table $$$v$$$ where $$$v[x]$$$ is the number of times $$$x$$$ occured in the interval that we are maintaining. When adding an element $$$x$$$ such that $$$v[x]=0$$$, increment the answer; when deleting an element $$$x$$$ such that $$$v[x]=1$$$, decrement the answer.

**However**, this needs many random accesses to this table, which is very cache-unfriendly, leading to a massive constant. Can we do better?

Consider arrays $$$pre[i]$$$ and $$$nxt[i]$$$, which equal the last location of the element $$$i$$$ and the next location of the element $$$i$$$, respectively (equal to some infinity if this element doesn't exist.) We can use these arrays to quickly check whether or not some element occurs in an range.

- Case 1: $$$[l,r]\rightarrow[l,r+1]$$$. We add 1 if and only if $$$r+1$$$ has not occured in $$$[l,r]$$$, or equivalently, $$$pre[r+1] < l$$$.
- Case 2: $$$[l,r]\rightarrow[l,r-1]$$$. We subtract 1 iff $$$r$$$ has not occured in $$$[l,r-1]$$$.
- Case 3: $$$[l,r]\rightarrow[l-1,r]$$$. We add 1 iff $$$l-1$$$ has not occurned in $$$[l,r]$$$, or equivalently, $$$r < nxt[l-1]$$$.
- Case 4: $$$[l,r]\rightarrow[l+1,r]$$$. We subtract 1 iff $$$l$$$ has not occured in $$$[l+1,r]$$$.

This leads to a much quicker solution, which runs plausibly fast even for $$$n=10^6$$$ even if we directly implement Mo's.

### Range inversion query (naive version)

Statement: Given a static array and queries of the form $$$[l,r]$$$, count the number of pairs $$$(i,j)$$$ such that $$$a[i]>a[j]$$$. $$$O(n\log n\sqrt n)$$$.

Every time we add or erase an element, we consider how much this element contributes to the answer. If we maintain a Fenwick tree corresponding to the current interval, we can quickly find out how many inversions involve a certain value. Transition is $$$O(\log n)$$$; hence overall it is $$$O(n\log n\sqrt n)$$$.

**Advanced**: Sweepline Mo / Offline Again

This is 617E - XOR and Favorite Number, except there are multiple favorite numbers.

Statement: Given a static array, a static set $$$S$$$ of size $$$C$$$ and queries of the form $$$[l,r]$$$, count the number of pairs $$$(i,j)$$$ such that $$$a[i]\text{ xor }a[j]\in S$$$. $$$O(nC+n\sqrt n)$$$ time; $$$O(n)$$$ memory. Assume $$$q=O(n)$$$.

Notice that $$$a[i]\text{ xor }a[j]\in S$$$ is equivalent to $$$\exists v\in S:a[i]\text{ xor }v=a[j]$$$. We can consider each $$$v$$$ separately and run $$$C$$$ instances of Mo's, but that would be $$$O(nC\sqrt n)$$$. Consider the transitions: we add an element $$$j$$$ and want to know the amount of $$$i$$$ such that $$$a[i]\text{ xor }a[j]\in S$$$ and $$$i\in[l,r]$$$.

Call the transition delta $$$f(l,r,p)$$$ where $$$p$$$ is the element we are adding. Removing can be done by subtracting $$$f(l,r-1,p)$$$ or $$$f(l+1,r,p)$$$ depending on the direction. Sweepline Mo can be used when this transition function is differentiable: i.e. it satisfies $$$f(l,r,p)=f(1,r,p)-f(1,l-1,p)$$$.

The main idea is to first run a "dry" Mo and find all $$$f(1,x,p)$$$ that we need to calculate. Then, we do a sweep-line on $$$x$$$, updating the data structure accordindly, and calculating the $$$f(1,x,p)$$$ that involve this $$$x$$$. We generally needs the "calculation" part to be $$$O(1)$$$, like normal Mo, but we make more wiggle room for the update part.

However, if we store all $$$(x,p)$$$, we get $$$O(n\sqrt n)$$$ memory and a ridiculous constant factor. Notice that because of the Mo transition order, we can separate all $$$p$$$ into several types, given a fixed $$$x$$$:

- $$$p=x$$$
- $$$p=x+1$$$
- $$$p\in[A,B]$$$

For the first and second type, we calculate $$$ansPrev[i]$$$ and $$$ansNext[i]$$$ and don't update anything, leaving these arrays for the second "wet" Mo to use.

For the third type, we find all $$$(p,A,B)$$$ that result from the dry Mo. There are at most $$$4(M-1)$$$ such intervals because the movement of any end of the Mo interval creates exactly one interval. Then, on the corresponding $$$x$$$, we iterate through the interval and add the transition value we found to the respective query.

Example implementation:

```
sort(qs, qs + m);
cl = 1, cr = 0;
for(int i = 0; i < m; i++) {
if (cr < qs[i].r) {
iv[cl - 1].push_back({cr + 1, qs[i].r, i, -1});
cr = qs[i].r;
}
if (qs[i].l < cl) {
iv[cr].push_back({qs[i].l, cl - 1, i, 1});
cl = qs[i].l;
}
if (qs[i].r < cr) {
iv[cl - 1].push_back({qs[i].r + 1, cr, i, 1});
cr = qs[i].r;
}
if (cl<qs[i].l) {
iv[cr].push_back({cl, qs[i].l - 1, i, -1});
cl = qs[i].l;
}
}
for(int i = 1; i <= n; i++) {
ansPrev[i] = ansPrev[i-1] + cgc[ar[i]];
add(ar[i]);
for (auto [l, r, i, c] : iv[i]) {
ll g = 0;
for(int p = l; p <= r; p++)
g += cgc[ar[p]];
ans[i] += c * g;
}
ansThere[i] = ansThere[i - 1] + cgc[ar[i]];
}
for(int i = 0; i < m; i++) {
curans += ans[i];
fans[qs[i].i] = curans + ansPrev[qs[i].r] + ansThere[qs[i].l-1];
}
for(int i = 0; i < m; i++) {
print(fans[i]);
pc('\n');
}
```

Special implementation details:

- Here
`add`

just updates the`cgc`

array, which is a population count of the things in the current prefix xor the things in the set. `ivs`

is the intervals on each`x`

, along with the query and contribution coefficient.- If we do a prefix sum on
`ansPrev`

and`ansThere`

, we don't even need to use a while loop in the second Mo.

### Range inversion query, again

$$$O(n\sqrt n)$$$

Consider sweepline mo. Sweepline mo in general is a method used to reduce the number of *modifications* of the Mo data structure to $$$O(n)$$$, not affecting the query count of $$$O(n\sqrt n)$$$. This means modifications have to be at most $$$O(\sqrt n)$$$ and queries have to be at most $$$O(1)$$$.

For this problem we need to support point increment/decrement in $$$O(\sqrt n)$$$ and prefix sum query in $$$O(1)$$$. Flip it around to get an equivalent problem: suffix increment/decrement in $$$O(\sqrt n)$$$ and point query in $$$O(1)$$$.

Separate this auxillary array into blocks of size $$$O(\sqrt n)$$$. For each block, maintain a value $$$lazy$$$. When we do increment/decrement a suffix, for blocks that intersect with this suffix yet are not completely covered, we change the elements with brute force. Otherwise, we modify the $$$lazy$$$ tag. To query a position, we add the (brute force) value with its corresponding $$$lazy$$$ tag.

For people who think there is not much of a difference between $$$O(n\sqrt n)$$$ and $$$O(n\log n\sqrt n)$$$: When $$$n=100000$$$, the former runs in $$$300ms$$$ while the latter runs in $$$3000ms$$$. Normally time limits are $$$1000ms$$$.

## Answering queries on a static array, online

When the data structure for Mo's is small enough, you can use first do block decomposition into size $$$B$$$ and then calculate the data structure for all intervals of blocks, creating $$$O((N/B)^2)$$$ structures.

Then, assuming the transition time for this data structure is still $$$O(T)$$$, querying an arbitrary interval can be done in $$$O(TB)$$$.

Normally intialization of each structure is $$$O(B)$$$ and transition is $$$O(1)$$$, so it would be best to choose a block size of $$$O(\sqrt n)$$$ to get $$$O(n\sqrt n)$$$ precalculation and $$$O(\sqrt n)$$$ query time. Here we look at some examples:

### 522D - Closest Equals

Statement: Given a static array and queries of the form $$$[l, r]$$$, for each query calculate $$$\min(|i-j|:a[i]=a[j]\wedge i,j\in[l,r])$$$. $$$n\le 500000$$$

Doing Mo's for this problem looks a bit intractable because although insertion is easy using $$$pre$$$ and $$$nxt$$$ arrays, it seems impossible to delete values and "roll back" the $$$\min$$$ value. Hence we consider block decomposition, which would only require insertion.

Decompose this array into contiguous blocks of size $$$O(B)$$$; for each contiguous interval of blocks compute the answer, for a total precomputation complexity of $$$O((N/B)N)$$$.

Note that if we want the minimum distance the only possible $$$j$$$ values when we fix an $$$i$$$ that couuld possibly contribute to the answer are $$$pre[i]$$$ and $$$nxt[i]$$$. As such, when extending a precomputed block, we calculate if $$$pre[i]$$$ and $$$nxt[i]$$$ are in the interval we need to answer, and if so we try to update the minimum, for a query complexity of $$$O(B)$$$.

Total time complexity is $$$O(n^2/B+qB)$$$, selecting $$$B=O(\sqrt n)$$$ is best, getting a total time complexity of $$$O((n+q)\sqrt n)$$$ with tiny constant. Note that input and outout for this problem is a massive portion of the total running time.

### 617E - XOR and Favorite Number online version

Statement: Given a static array, a number $$$k$$$ and queries of the form $$$[l,r]$$$, for each query calculate the number of $$$(i,j)$$$ such that $$$a[i]\text{ xor }a[j]=k$$$.

$$$a[i]\text{ xor }a[j]=k \iff a[i]\text{ xor }k=a[j]$$$; we precalculate answers for each block and now count the contribution of new elements to the block interval. We want to know how many elements with value $$$a[i]\text{ xor }k$$$ there are in the current interval.

First of all the current interval is block chunk + at most $$$O(B)$$$ elements, so we can directly maintain a population array on these $$$O(B)$$$ elements. For the block chunk, notice that the count "# of $$$x$$$ in blocks from $$$l$$$ to $$$r$$$" is differentiable. Precalculate "# of $$$x$$$ in blocks from $$$1$$$ to $$$r$$$", and then we're done.

But not quite; this method takes $$$O(B\max A)$$$ memory for the chunk count array. Notice that there are at most $$$O(n)$$$ possible values for $$$a[i]$$$ and $$$a[i]\text{ xor }k$$$; hash these values and we get $$$O(nB+(n/B)^2)$$$ memory, $$$O((n+q)B+(n/B)^2)$$$. Square root decomposition gives $$$O((n+q)\sqrt n)$$$.

### Range distinct query revisited

Statement: Range distinct query but online

Obviously direct precomputation and utilizing $$$pre[i]$$$ and $$$nxt[i]$$$ arrays lead to an $$$O((n+q)\sqrt n)$$$ running time. But can we do better?

Notice that the bottleneck is in the precomputation $$$O((n/B)n)$$$, because for each block we need to compute the answers for all intervals that have this block as its first block. Yet we only store $$$O((n/B)(n/B))$$$ values. Obviously time complexity will be lower bounded by this if we use decomposition methods. Here we will reach this lower bound.

Consider the definition of the answer for an arbitrary interval $$$[l,r]$$$:

Let's put the points $$$(pre[i]+1, i)$$$ onto the 2D plane. Define a function $$$p(x,y)$$$ as the number of points such that its x-coordinate is smaller than $$$x$$$ and its y-coordinate is smaller than $$$y$$$, or a 2D prefix sum. Our answer is now

but $$$p(x,y)$$$ takes $$$O(n^2)$$$ time to calculate. Consider transforming the $$$A$$$ expression to incorporate decomposition into blocks of size $$$B$$$:

Now this is a prefix sum on the points $$$(pre[i]/B+1,i/B)$$$ where division is rounded down. This can obiously be computed in $$$O((n/B)^2)$$$.

Consider the optimal choice of $$$B$$$. The time complexity is

Selecting $$$B=O(\sqrt[3]{n})$$$ gives us a total time complexity of

This runs faster than the currently best known method of persistent segment trees because of its tiny constant for $$$n=10^6$$$. Its memory usage is also much, much smaller.

## More compilicated examples

### Optimizing brute force: 911G - Mass Change Queries

Statement: Given an array and operations of the form $$$l,r,x,y$$$, set $$$a[i]=y$$$ for all $$$i$$$ such that $$$i\in[l,r]$$$ and $$$a[i]=x$$$. Here $$$1\le a[i],x,y\le10^5$$$.

We observe that we can apply these operations to the entire array pretty quickly by storing the locations each value occur in something simlar to a vector and do small-to-large merging.

Do block decomposition on this array into blocks of size $$$B$$$. For blocks that are completely contained by an operation, do small-to-larger merging; otherwise, reconstruct this block and modify with brute force.

I don't know how to prove the true time complexity rigorously, but we can upper bound it. If there is no brute force changes, for each block it will take a total of $$$O(B\log B)$$$ time, because whenever an element is moved, the size of the vector that it is in at least doubles. Total time would thus be $$$O(NB\log B)$$$. If brute force changes on a block absolutely breaks amortized time, moving would take $$$O((N+Q)B\log B)$$$ time, for a total of $$$O((N+Q)B\log B+QN/B)$$$. Assume $$$Q=O(N)$$$ to simplify; $$$O(NB\log B+N^2/B)$$$. Selecting $$$B=\sqrt{\frac{N}{\log N}}$$$ gives

Note here that the $$$\log$$$ is inside the square root sign.

### Value distance query

Statement: Given a static array and queries of the form $$$x, y$$$, find $$$\min(|i-j|:a[i]=x,a[j]=y)$$$.

There are at most $$$O(\sqrt n)$$$ values that occur more than $$$O(\sqrt n)$$$ times. Consider a pair $$$x,y$$$ such that at least one of $$$x,y$$$ occurs more than $$$O(\sqrt n)$$$ times. We try to precompute answers for all such $$$x, y$$$ as there are at most $$$O(n\sqrt n)$$$ pairs.

For all $$$x$$$ where $$$x$$$ occurs more than $$$O(\sqrt n)$$$ times, we sweep the array using $$$x$$$, once forwards and once backwards. We keep track of the last location where $$$x$$$ occured in the direction that we are sweeping, and then we try to update the answer for $$$x$$$ and $$$a[i]$$$ (i is the index of the sweep pointer.)

As well as that, for each value keep a sorted vector of the indices where said value occurs.

When we answer queries, if we already precomputed the answer, we directly use it; otherwise, we run two-pointers on the corresponding vectors that we stored. The total time complexity is $$$O((n+q)\sqrt n)$$$.

### [Ynoi2008] rplexq

Statement: Given a rooted tree and queries of the form $$$(l,r,x)$$$, for each query find the number of $$$i,j$$$ such that $$$l\le i<j\le r$$$ and the LCA of $$$i$$$ and $$$j$$$ is $$$x$$$.

First of all, we answer these queries offline and put the $$$[l,r]$$$ values on the corresponding node. We do dsu on tree, maintaining an implicit segtree of the indices of nodes that belong to some subtree with small-to-large merging.

If we don't even care about the intervals, the amount of pairs that have LCA at node $$$x$$$ is equal to

where $$$j$$$ is the children of $$$x$$$.

When a node has a pretty small amount of children, we can answer the queries with brute force, as prefix sum queries on an implicit segtree has a small enough constant. But when a node has a lot of children AND has a lot of queries on it, this is much too slow.

First off we do a special check for the heavy child, so as to guarantee that the amount of nodes we parse are on the order of $$$O(n\log n)$$$. We then take the light children and condense the indices. Now, we can run Mo's on these indices, where what we store is $$$\sum\binom{colors}2$$$. Due to the ridiculously high initialization constant for Mo's it might be better to run brute force when queries or children count is small enough.

Note that the complexity for Mo's with length $$$n$$$ queries $$$m$$$ is $$$O(n\sqrt m)$$$. Because of

the total time complexity is upper-bounded by $$$O(n\log n\sqrt m)$$$. ODT claims that the time complexity can be $$$O(n\sqrt{m\log n})$$$ to $$$O(n\sqrt m)$$$, but I do not know how to prove this.

If you have other interesting square root decomposition problems, please share them in the comments below and I'll include them in the next edit!

Auto comment: topic has been updated by box (previous revision, new revision, compare).Wow!!! your username is more complicated than my handle's password.

Interesting side note about Mo's that I learned from the book

Looking for a Challengeby monsoon and friends: There are lots of ways that you can sort your queries to make it faster than O(n^2). This blog mentions using buckets, but you can also do this thing called "Hilbert Mo's" which in practice often runs a bit faster, in which you sort the events by their position on the Hilbert Curve.You can also do some sweeps in order to get an MST in n*log(n) time which will give you within 2x the best possible order of points (and it might be possible to solve TSP for manhattan distances fast too, I'm not totally sure on the details with that though). Manhattan distances are important here because you can convert and [l, r] query to a 2d point (x==l, y==r) and then you need to find a path that visits all points to answer all queries.

SecondThread We are waiting for next episode of Algorithms Thread. It is probably the best contest out there.

Blog on hilbert sort for Mo's: https://codeforces.com/blog/entry/61203.

Can you explain Sweepline Mo in more detail?

Thank you for your feedback! Is the new explanation better?

Auto comment: topic has been updated by box (previous revision, new revision, compare).`If you have other interesting square root decomposition problems, please share them in the comments below and I'll include them in the next edit!`

677D - Vanya and Treasure

IdeaFor each $$$i \geq 2$$$, you can use bfs in $$$O(nm)$$$ if $$$freq[i - 1] \cdot freq[i] > nm$$$, otherwise you can execute updates naively in $$$O(freq[i - 1] \cdot freq[i])$$$.

1093E - Intersection of Permutations

IdeaIt's possible to use a Fenwick tree for each block of size $$$k$$$ and execute swaps naively within each block. The complexity is $$$O(n^2\log{n}/k + nk)$$$. The best $$$k$$$ is $$$\sqrt{n\log{n}}$$$ and the total complexity is $$$O(n\sqrt{n\log{n}})$$$. 93571194

Thank you for the suggestions! I will try to solve these problems and add them to the post.

A bonus technique when applying sqrt decomposition especially on tree is described here, in problem F. It consider nodes with height divides by sqrt(n) special nodes and store some specific data checkpoint, making some path queries possible when binary lifting and HLD are too hard to fit memory/time limit.

Thank you for the problem link. I liked this problem a lot!

Separate this auxillary array into blocks of size $$$O(\sqrt {n})$$$. For each block, maintain a value lazy. When we do increment/decrement a suffix, for blocks that intersect with this suffix yet are not completely covered, we change the elements with brute force. Otherwise, we modify the lazy tag. To query a position, we add the (brute force) value with its corresponding lazy tag.Can you explain this in more detail? Thank you in advance!

Sorry for necroposting but in the "Range distinct query (SPOJ DQUERY)" part, shouldn't the $$$pre[r + 1] < l$$$ be $$$pre[a[r + 1]] < l$$$ (and similarly for $$$r < nxt[l - 1]$$$), where $$$a$$$ is our array? box

no. The arrays $$$pre$$$ and $$$nxt$$$ are indexed by the respective position in the array; this means that $$$pre[x]$$$ holds the index of the last position in $$$a$$$, before $$$x$$$, that is equal to $$$a[x]$$$.

Sorry for necroposting, but for the range inversion queries in $$$O(n \sqrt{n})$$$, the blog does not really mention about it, but I think we need to do both prefix update and suffix update because when we are removing or adding on the left side of the current query boundaries, we need to calculate the elements smaller than that element, and for removing or adding on the right side of the current query boundaries, we need to calculate the element greater than that element. It doesn't change the time complexity, but I thought some people might get confused. My submission on Library Checker. Again, sorry for necroposting, and please let me know if there is a way to just only use the range suffix updates as mentioned in the blog.

A prefix update can be simulated by just a global offset, and an inverse suffix update.

Is $$$O(n\sqrt q)$$$ the lower bound time complexity for Mo's algorithm? I think it's not coincident that the Hilbert curve and the basic bucket approach achieve the same complexity.

In other words, we have to construct $$$q$$$ points in $$$n\times n$$$ table such that the Manhattan shortest path to visit all points is $$$O(n\sqrt q)$$$. For $$$q = 2$$$, the path should be $$$O(n)$$$. For $$$q=O(n^2)$$$, the path should be $$$O(n^2)$$$. Both extreme cases make sense. Can someone prove this hypothesis?

Not sure what exactly are you looking for, but placing points into cells with both coordinates divisible by $$$k$$$ also works. In this case $$$q = \Theta \left(\frac{n^2}{k^2}\right)$$$, so the algorithms' complexity is $$$O\left(\frac{n^2}{k}\right)$$$ and the lower bound for Manhattan shortest path is $$$\Omega(qk)$$$ which is $$$\Omega\left(\frac{n^2}{k}\right)$$$.

Nice proof! Thank you.

Hey, in Range Distinct Value Query, you need to store $$$O((n/B)(n/B))$$$ values, and if $$$B = \sqrt[3]{n}$$$, you need $$$O(\sqrt[3]{n ^ 4})$$$ values. Hence, for $$$n = 10^6$$$, it takes more memories $$$O(100 \cdot n)$$$ than persistent segment tree $$$O(4 \cdot n)$$$