Understanding Disjoint Set Structures

#	User	Rating
1	ecnerwala	3648
2	Benq	3580
3	orzdevinwang	3570
4	cnnfls_csy	3569
5	Geothermal	3568
6	tourist	3565
7	maroonrk	3530
8	Radewoosh	3520
9	Um_nik	3481
10	jiangly	3467

#	User	Contrib.
1	maomao90	174
2	adamant	164
2	awoo	164
4	TheScrasse	160
5	nor	159
6	maroonrk	156
7	SecondThread	150
8	-is-this-fft-	149
9	pajenegod	145
10	BledDest	144

Again, Im writing this both to help others and test myself so I will try to explain everything at a basic level. Any feedback is appreciated.

Disjoint Set(DS) data structures are representations of sets (which are all disjoint, sharing no elements) with certain functions:

FindSet(x): finds the set of element x
UniteSets(x,y): unites the sets x and y
MakeSet(x): makes a set with element x
Disunion(list): removes all elements from other sets and makes a new set with these elements
(Note: the disunion operation is not as commonly used as the other operations and is not implemented efficiently, with an O(n) runtime.)

DS is used in various algorithms, such as Kruskal's minimum spanning tree finder.

This post will proceed with a naive implementation, and then progress to an efficient solution.

The Naive Solution

We construct a forest (group of disjoint trees). Each element has a parent node and a list of children. (The list of children can be omitted if we are not intending to disunite sets.)

Define P(x): a function which maps an element to its "parent" element. In each set there will be exactly one parent, and P(parent) = parent, this is how a set is referred to.

Each element e starts as being in its own set, with P(e)=e. We can make an element easily with this element, and we can unite two sets X,Y with the following method.

Given two elements x,y, iterate P on each until P(x)=x, P(y)=y. Then set P(x)=y and add x to y's children. Now all elements that were children of x are children of y, so we have successfully united the two sets.

To disunite two sets, iterate through every node N (from farthest from root to root) that we are intending to breakaway. Change P of N's children to P(N) unless N=P(N) and N has more than one child or P(N) is going to be removed as well. If N=P(N)& N has more than one child (if it had one it could be removed with no consequence) instead make one of P(N)'s children the root. (Make sure not to select a child that is going to be removed, if all children are going to be removed go one level deeper.) If P(N) is going to be moved, repeat the procedure for P(N), with all of N's children included under P(N). As every node is processed at most once, the runtime is O(N(cost of removing from children + cost of adding to children)+(cost of building the new set, or n^2)).

Lets look at their runtimes:
FindSet : O(n), as in worst case the structure is a linked list.
UniteSets: O(n), as its runtime is determined by two FindSet calls.
MakeSet: O(1).
Disunion: O(n(cost of removing from children + cost of adding to children)+ n^2). Don't get too caught up memorizing the above algorithm, it is unwieldy and slow compared to the actual solution.

Optimization 1: Balanced Trees with log N height

Looking at the first version of our DS one possible optimization becomes appearant. Our DS is a forest, so if all the trees were balanced, with a height of log N (something like a binary tree) then both FindSet and, as a result, UniteSets, will take O(log N) time.

As we build our sets from singular elements, our tree structure is determined by UniteSets. This points to potentially editing UniteSets to improve our runtime. Instead of arbitarily combining two sets, we can be smarter. Notice the height of the resulting tree is max(height(x)+1,height(y)) as height(x) and height(y)-1 are the heights of the two subtrees of root y after unification. Looking at this equation we notice some inefficency, if height(x)>height(y) then the height of the resulting tree is greater than if x became the new root. Thus, if we choose the taller tree's root as the final root, we have a more efficent solution.

It turns out that this makes the tree's have a height of O(log N). Let the current maximum height of the whole DS be h. To increase h, we need to combine a tree with height h with another tree of height h (otherwise the max height wouldn't change.) Thus, a tree of height h+1 requires the sum of the elements in the trees of height h. Thus we get a recurrence, # of nodes in tree of height h+1 = 2*# of nodes in tree of height h. As it takes 2^n nodes to form a tree of height n, the height of a tree with N nodes is log N.

Lets look at the updated runtimes:
FindSet : O(log n), as in worst case the structure is a binary tree.
UniteSets: O(log n), as its runtime is determined by two FindSet calls.
MakeSet: O(1).
Disunion: O(n(cost of removing from children + cost of adding to children)).

Optimization 2: Path Compression

The runtimes with optimization 1 are pretty good (Disunion will always be O(n) at worst as removing n arbitrary nodes requires touching all n nodes), but can we make them better? With path compression, we can make them even faster (and decrease the memory needed.) However, this heuristic makes analyzing the runtimes kind of complicated based on what you are using the DS for.

The inspiration for this optimization comes from inefficency in FindSet, namely that calling FindSet(x) with the same x multiple times still requires iterating up the tree. This inspires a dp solution, where we memoize the final parent of x. Then, realize that we kind of already do that with P. Instead of creating a seperate array, we can just update P.

When we do path compression we do just that, for every node n from x to the root, set P(n)=root. That way, each of the nodes on the root are only 1 step from finding their parent. For these nodes FindSet becomes O(1).

The above image shows path compression after calling findSet(1).

When we unite this set with other sets the distance from these nodes to their root could increase again, and thus our O(1) time progressively becomes ruined. Regardless, allowing path compression results in a much flatter tree, and thus on average reduces the runtime.

Note that with path compression, we can improve our disunion algorithm. We no longer need to store the children of each node. To disunion, run FindSet on every node in order to compress every node. This makes every tree have a height of two. Removing most nodes is now trivial, as they are leaves. Removing the roots is a little harder, but this can be accomplished intelligently. After compressing all the nodes, we again look at every node, and if a we process a node that is not a root whose parent is being disunioned and a root, instead make the non-root node the new root. Then, all the trees have a root that is not being removed (unless the whole tree is being removed, in which case we don't have to worry about it). After compressing all the nodes again, all disunioned nodes can be removed trivially and heights can be recalculated.

Lets look at our final runtimes:
FindSet : O(α(n)) (This and UniteSets' worst case is the reverse Ackermann function, as uniting two sets with path compression shortens the trees every union. This is effectively constant.)
UniteSets: O(α(n))
MakeSet: O(1).
Disunion: O(n).

All with O(n) space!

Implementation

This implementation (especially disunion) isn't the most efficient/consise as readability has been emphasized. Also note that MakeSet has not been included as it doesn't even require a method proper, just setting par[x]=x. All the methods function as described above. par[x]= parent of x, rnk[x] = height of x (aka rank of x, notice that by default a set of height h has a rnk of h-1). Disunion functions by removing all elements with rm[x]=1, and setting their par to nroot.

  #include <bits/stdc++.h>
 #define MAXN 100000
 using namespace std;
 int par[MAXN],rnk[MAXN],N;
 bool rm[MAXN];
int getPar(int x){
  if(par[x]==x)return x;
  par[x]=getPar(par[x]);
  return par[x];
 }
int unitePar(int x,int y){
  int p1=getPar(x),p2=getPar(y);
  if(rnk[p1]>rnk[p2]){
    par[p2]=p1;
    rnk[p1]=rnk[p2]+1;
  }
  else{
    par[p1]=p2;
    rnk[p2]=max(rnk[p2],rnk[p1]+1);
  }
 }
void disunion(int nroot){
  int p;
  for(int i=0;i<N;++i)getPar(i);
  for(int i=0;i<N;++i){
   p=getPar(i);
   if(rm[p]&&!rm[i]){
    par[p]=i;
    par[i]=i;
   }
  }
  for(int i=0;i<N;++i)getPar(i);
  for(int i=0;i<N;++i){
   if(rm[i]){
    par[i]=nroot;
    rm[i]=0;
   }
  }
  memset(rnk,0,sizeof(rnk));
  for(int i=0;i<N;++i){
   p=getPar(i);
   if(p!=i)rnk[p]=1;
  }
 }