Union Find

There are two methods in order to traverse a graph – BFS, DFS. Let’s say there is a graph problem to count the number of connected components. Here are the steps you can take.
1. Build the graph
2. Perform BFS or DFS
3. Outer loop to check each unvisited node
4. Number of calls to BFS or DFS is equivalent to the number of connected components because BFS or DFS will mark the connected component as visited

Time complexity would be O(V+E) where V is the node and E is the edge. It is linear complexity.

This is a simple problem with a simple solution: BFS or DFS works perfectly.

Now, let’s tweak the problem a little bit. Let’s say there is a stream of edges instead of the entire graph from the beginning. And you are asked to provide the number of connected components as the new edges are coming in. You have to perform BFS or DFS every time new edges come in because the new edge may or may not decrease the number of connected components. Let’s say there is an edge connecting node between u and v. Two possibilities of this edge is
1. if u, v belong to the same connected component, the number of connected components is unchanged
2. if u, v belong to the different connected components, the number of connected components will decrease by one because the component which u belongs will be connected to component which v belongs (or vice versa)

Time complexity of naive approach would be O(V+E) * O(E) = O(VE + E^2). It would be quadratic regardless this is dense or sparse graph.

The question we ask is can there be a better way to solve the problem?

Naive approach of union find

Let’s have an array to keep track of its component ids. The size of the array will be the size of the nodes each node will be assigned a component id it belongs. With the setting, when new edges come in, we look up the table and perform the following actions.
1. if they (two nodes in the edge) belong to the same component (have the same component id), then skip.
2. if they belong to different component, then update the component id.

Here is the psuedo code for the updating component id

componentid = array(size V), initialized to node id (node id will be component id)

for each edge(u,v):
    if componentid[u] != componentid[v]: // find operation
      cid = componentid[u]
      for each i in 0 to n-1: // n is number of nodes. line 6-8 is merge (or union) operation
          if componentid[i] == cid:
              componentid[i] = componentid[v]

For each edge, number of union operation is n-1. The total cost of union operation would be O(VE) which is O(V^2) assuming this is sparse graph. For each edge, number of find operation is E and the total cost of find operation would be O(E) since find operation itself is O(1). All together, the total cost would be O(V^2) + O(E) = O(V^2). The code is simpler yet the complexity is still the same because union operation is expensive. Is there any way to improve union operation?

Optimized union operation

If you think about it, we actually don’t need to iterate every single node in a component and update the id. Let’s say there is a root node in a component, then we only need to update the root node component id. Let’s take a look at the example below.

In the above graph, there will be an edge between u and v nodes. Now, when updating connected component, you only need to update component id of root node of u. Each find operation will look up the component id of root node then update. How do you find the root node of component? Just keep looking at parents where node id == parent[node id]. Find operation will not be O(1) but union operation will be O(1) now because you always update 1 node for the operation. Here is the optimized version of pseudo code.

for each edge(u,v) in edges:
    rootU = find(u)
    rootV = find(v)
    if rootU != rootV:
        componentid[rootV] = componentid[rootU]
        
function find(x):
    while componentid[x] != x:
        x = componentid[x]
    return x

Let’s look at the complexity of the operation.
The cost of each union operation is O(1) and there are O(E) edges. Total cost of union operation is O(E).
The cost of find operation is O(V) in worst case (where the tree is like a linked list, single chain) and there are O(E) edges. The total cost of find operation in worst case would be O(VE) which is O(V^2) assuming sparse graph.

We are heading to the correct direction but the time complexity didn’t improve which means there needs further optimization.

Union find with merging rules

If you blindly merge one tree to another, you might end up a single chain, linked list. The best case of tree structure would be a balanced tree where the height is O(log n). How can we achieve almost balanced tree to make find operation O(log n)?

Here is the claim of the rule:
When merging together two components, you will achieve O(log n) if you make the root of the smaller component point to the root of the larger component.

Let’s prove if the claim is correct. Let’s say there are two connected components (trees) – u and v

With the above setting, we need to prove: h <= log(N(u) + N(v)) where h is height of the new component (tree)
This will be a proof by induction.

Base case:
1. All nodes are disconnected: since all nodes are disconnected there is only one node per component. height of the component is 0 and it is true that 0 <= log 1 (only 1 node in the tree)
2. There is an edge between two nodes from step 1. In this case, h=1 and n=2. Claim is still true after first union operation.

After the above steps, there are two possible cases
1. two disconnected nodes are joined. Same as 1st step above
2. 2 nodes and 1 node join which becomes n=3, h=1 and 1 <= log3 according to the rule.

Assuming the claim is true after i union operations, we need to prove: h <= log n after i+1 union operation. Let’s note height of V tree as H(v) and height of U tree as H(u). There are two possible cases of the trees (u and v) where H(v) > H(u) or H(v) < H(u). With that said, the height of newly joined tree will be like this: h = max(H(v), H(u)+1). Note that number of nodes in V are greater than u here.

Case 1. H(v) > H(u). In this case, joining u tree to v doesn’t change the height and new height of the result tree is still H(v). We know that H(v) <= log N(v) (induction hypothesis) which also makes the following statement true H(v) <= log (N(v) + N(u)). New height h is H(v) so finally this guarantees h <= log(N(v) + N(u)).

Case 2. H(v) < H(u). h (new height) = H(u) + 1. based on induction hypothesis, we know H(u) <= log N(u) is true. Let’s add some tweak to the hypothesis: H(u) + 1 <= log N(u) + 1. H(u) + 1 is new height h. log N(u) + 1 would be log 2N(u). Then, it leads to this equality:
h <= log 2N(u). Given N(u) < N(v), this is also true: h <= log (N(u) + N(v)), which guarantees new height will be lower than log (N(u) + N(v)).

Based on the above proof, we know that the height will be always balanced if two trees are merged according to the rule.

With that said, the total complexity will be
(n-1) * cost of union op + E * cost of find op = (n-1) * O(1) + E * O(log n) = O (n log n)

Here is the pseudo code based on all the optimization

parents = a 1D array of size n
sizes = a 1D array of size n

for i in 0 to n-1:
    parents[i] = i
    sizes[i] = 1

for (u, v) in edges:
    rootU = find(u)
    rootV = find(v)
    if rootU != rootV:
        if sizes[rootU] < sizes[rootV]:
            parents[rootU] = rootV
            sizes[rootV] += sizes[rootU]
        else:
            parents[rootV] = rootU
            sizes[rootU] += sizes[rootV]
            
function find(x):
    while parents[x] != x:
        x = parents[x]
    return x

Time complexity is O(n log n), slightly slower than DFS or BFS but you don’t need to build graph for traversal. In addition to that, when graph changes dynamically, union find is better than DFS or BFS

Final optimization – Path Compression

There is one final optimization. You might have noticed that we are traversing to find the parent node every time we merge. However, what if we update the parent node id to the root node after traversing? It is slightly more work in the beginning but will save time for next root node search. In other words, we can compress the path (flattening the tree). The updated code will use recursion for simplicity.

function find(x):
    // base case
    if x == parents[x]:
        return x
        
    // recursive case
    rootX = find(parent[x])
    parent[x] = rootX
    return rootX

Time complexity of the find is still O(log n). However, with many traversals, this practically makes it O(1)

Conclusion

This is it for the union find. DFS or BFS is typically used for many graph problems. However, there are cases when DFS or BFS doesn’t work perfectly and Union Find can actually improve for the better performance with simple code. I hope this helped you understand union find better. Thank you for reading.