Intro to Union-Find Data Structure

Union-Find is a data structure that mimics tree data structure except that nodes point to their parent instead of children.

The union-find is very useful for set operations such as the following.
1. find – find the root of the current node
2. same component – check if two trees have the same root node. (belongs to the same set)
3. union (merge) – merge two trees (sets) into one

Such operations are very useful for problems like connected components – each node belongs to only one connected component – and vertex coloring.

The union-find is useful for algorithms such as minimum-spanning tree (Kruskal’s algorithm) – a minimum subset of edges forming a tree connecting all vertices.

How is the union-find structured?

The union-find data structure represents each subset as a reverse or backward tree that a node points to the parent instead of children.
In this example, I used an array to keep track of parent nodes efficiently and for simplicity.
But you could use hashmap to deal with a more complicated node structure.

Let’s take a look at the example below.

union find data structure example

There are two sets. (Two trees with two nodes)
The first set has 4 nodes – 2, 1, 0, 3 – where node 2 is the root node.
The other set has 1 node – 4.

As you can see from the above I have an array to keep track of the parent of each node.
For example, the parent node of node 0 is 1 and the parent node of node 1 is 2.
You can see that the value of indices 0 and 1 is the value of their parent nodes.

Algorithm

Let’s take a look at the algorithms for each union-find operation.

Find

Again, this operation checks the root node of the provided node.
For the find operation, you just need to recursively check the parent node until it reaches the root.
How do you know when it reaches the root?
In the above structure, you will see the index and the value are equal.
If you are using another data structure, which will likely be the one using a key and value, you just need to check if a key and value are the same.

Same Component

This operation checks if two nodes belong to the same tree, have the same root node.
This one is very easy once ‘Find’ operation is implemented as you just need to find the root node of both nodes and check if they are the same.

Merge

The operation merges two trees into one.
However, merging without much thought may cause an unbalanced tree and hurt overall performance.
How should we merge the tree?
You can always link the tree that has a lower height to the higher one.
For example, if you are to merge two trees in the above picture, there are only two cases.
You can merge root node 2 to 4 which will increase the height by 1.
Or you can merge root node 4 to 2 of which the height will remain the same.
Therefore you should always merge the lower tree to the higher one.

Code Example

#include <vector>

using namespace std;

class UnionFind
{
    int totalNodes;
    
    // indicates parent nodes
    vector<int> parents;
    
    // indicates number of elements in a subtree
    vector<int> numElts;
    
public:
    UnionFind(int _totalNodes) : totalNodes(_totalNodes), parents(totalNodes), numElts(totalNodes)
    {
        for (int i = 0; i < totalNodes; ++i)
        {
            parents[i] = i;
            numElts[i] = 1;
        }
    }
    
    int find(int nodeId)
    {
        if (parents[nodeId] == nodeId)
        {
            return nodeId;
        }
        else
        {
            return find(parents[nodeId]);
        }
    }
    
    bool sameComponent(int node1, int node2)
    {
        return find(node1) == find(node2);
    }
    
    void merge(int node1, int node2)
    {
        int root1 = find(node1);
        int root2 = find(node2);
        
        // already in the same component. noop
        if (root1 == root2)
        {
            return;
        }
        
        
        if (numElts[root1] >= numElts[root2])
        {
            parent[root2] = root1;
            numElts[root1] += numElts[root2];
        }
        else
        {
            parent[root1] = root2;
            numElts[root2] += numElts[root1];
        }
    }
};


Performance

Find – O(log n)
It just needs to climb the tree until it reaches the root

Same Component – O(log n)
It only needs to complete two find operations.

Merge – O(log n)
It only needs to complete two find operations.

Conclusion

The union-find is a very powerful data structure that handles set operations in a really fast time.
And the data structure itself is very simple too!
The data structure is used for many algorithms such as minimum spanning tree, Kruskal’s algorithm.

Intro to Array data structure Part 2 – 2D Array

Continuing from the last post I am going to discuss more about the array data structure which is 2D array.
In this post, I will focus on the basic concept and usage.

What is 2D Array?

2D array is basically array within array.
Typically array contains specified data type in its element.
However, it is possible that the element can contain another array instead!

This is a 2D array with 3 rows and 5 columns.
In each row, there is an array of size 5 which is the column size.
In other words, there is an array of size 3 (row) and there is another array size 5 within each element.

Code Example

Here is a code example of usage of the 2D array.

This is a basic access usage example.

#include <iostream>

using namespace std;

int main()
{
    // declaration of 2D array without initializing elements
    int two_d_arr[3][5];
    
    // you can also initialize 2D array just like 1D array
    // you just need to make sure there is an array for each element.
    // 3 rows. Each row contains an array of size 5
    int two_d_arr2[3][5] = {{10, 15, 23, 31, 3}, {13, 72, 29, 19, 85}, {61, 42, 1, 5, 27}};
    
    // when initializing 2D array you don't need to specify row size
    // compiler will figure out row size for you as long as correct column size is provided
    // int two_d_arr2[][5] = {{10, 15, 23, 31, 3}, {13, 72, 29, 19, 85}, {61, 42, 1, 5, 27}};
    
    // you can use for loop to access 2D array
    // this will print out each column per row first
    for (int i = 0; i < 3; ++i)
    {
        for (int j = 0; j < 5; ++j)
        {
            // first bracket [i] represents row index
            // second bracket [j] represents column index
            cout << two_d_arr2[i][j] << " ";
        }
        
        cout << endl;
    }
    cout << endl;
    
    /*
     * output
     * 10 15 23 31 3
     * 13 72 29 19 85
     * 61 42 1  5  27
     */
    
    // or you can switch row and column if you want
    // this will print out each row per column first
    for (int i = 0; i < 5; ++i)
    {
        for (int j = 0; j < 3; ++j)
        {
            // first bracket [i] represents row index
            // second bracket [j] represents column index
            cout << two_d_arr2[j][i] << " ";
        }
        
        cout << endl;
    }
    
    /*
     * output
     * 10 13 61
     * 15 72 42
     * 23 29 1
     * 31 19 5
     * 3  85 27
     */
    
    return 0;
}

Accessing 2D array is really like 1D array except you need to specify row and column index.

Here is another code example that 2D array is used as a function parameter.

#include <iostream>

using namespace std;

// array parameter needs to know column size. 
// but row size is still necessary as another parameter in order to loop it
void printArray(int arr[][5], int rowSize)
{
    for (int i = 0; i < rowSize; ++i)
    {
        for (int j = 0; j < 5; ++j)
        {
            cout << arr[i][j] <<  " ";
        }
        
        cout << endl;
    }
}

int main()
{
    int two_d_arr2[][5] = {{10, 15, 23, 31, 3}, {13, 72, 29, 19, 85}, {61, 42, 1, 5, 27}};
    
    printArray(two_d_arr2, 3);
    return 0;
}

Overall, it’s fairly simple to use 2D array that we just need to provide row and column index accordingly.
Sometimes you can skip row index when declaring 2D array with initialization or using as a function parameter.

Please note that you can have 3D array or more!
You just need to have a proper index when you access them.

Conclusion

We have taken a look at the basics of array data structure.
However, there are still more topics to discuss about array!
I will try to have another post about it.
You might also want to take a look at my post about linked list so you can have apple to apple comparison.

Intro to Array data structure Part 1

Today, we will go over a popular data structure – array.
This is the most basic and fundamental data structure in computer science regardless of any programming language.
Even many complex data structures use array inside the implementation.
Programming language like python doesn’t have array but it provides list instead. (of course, python list is much more flexible and easier to use than C++ array)
In this post, I will go over some important characteristics of array with example code based on C++.

What is Array?

Array is a contiguous piece of memory of a certain data type.
It sounds hard but it will be very clear once you see this picture and example code.

Here, this picture is an array of integer with size 3.

array

Data type is an integer and there are a total of three elements for the array.
Please note that all those three elements are located right next to each other.

The first element of the array contains the integer 10 and 3 for second and 99 for third respectively.
In order to access each element, we need to know the index of the element.
This could be a little counter-intuitive but index of C++ always starts from 0 for array.

For example, if you would like to access the first element you need to know that it’s located at index 0.

Let’s take a look at the C++ code example for array declaration and basic usage.

#include <iostream>

using namespace std;

int main()
{
    //this is just a normal integer variable
    int temp;
    
    // this is an array of integer with size 3
    // please note that [ ] is necessary in order to declare it as an array.
    // you need to provide size inside [ ] unless you are initializing right away.
    // however, if you are initializing the array like below compilre can figure out the size.
    int arr[] = {10, 3, 99};
    
    // 3 inside [ ] means size of the array and it is necessary if you are not initializing like above
    // int arr[3]; 
    
    // you can access each element in the array using index
    // you will see 10 in the screen
    cout << "value of first index of the array:" << arr[0] << endl;
    
    // you can also assign a value to each element using index
    // you will see 11 on the screen
    arr[0] = 11;
    cout << "value of first index of the array after change:" << arr[0] << endl;
    
    // you will see 3 on the screen
    cout << "value of second index of the array:" << arr[1] << endl;
    
    // in this case, arr can only hold int because it's declared as int arr[3]!
    // this will cause a compilation error!
    // arr[2] = "test";
    
    // implicit conversion from 1.1 to 1. bad!
    // double precision will be dropped!
    // you will see 1 on the screen if you print arr[2]
    // arr[2] = 1.1;
    
    return 0;
}

As you see above, an array is merely a container that can hold many elements of the same data type.
You can have an array of double, short, int or string. (Please note that string is already an array of char)

One thing you have to remember is that unlike python an array in C++ can contain only a single data type.
Another words, if you have an array of int, then it can only hold int.
Although you can still assign double, short or any other number related data type it doesn’t necessarily mean it’s correct.

What happens you mistakenly assign 1.1 to arr[2] like the above example?
It will implicitly convert 1.1 to 1 because int cannot take double precision and will only take the integer part.
And the user might be surprised to see 1 on the screen instead of 1.1!

Array access with loop

You can use a loop (for, while or do-while) to access an array efficiently and elegantly.
And I will explain it using the example below.

Let’s say you are to write a program that does the following.
1. Take 5 test scores from the user
2. Find the lowest and highest test score
3. Calculate the average of those scores
4. Print the average, lowest and highest score along with individual test scores.

#include <iostream>

using namespace std;

int main()
{
    const int NUM_TESTS = 5;
    int scores[NUM_TESTS];
    
    for (int i = 0; i < NUM_TESTS; ++i)
    {
        cout << "Please enter a test score:";
        
        // you can use for loop to access each element in the array!
        // value of i will be from 0 to NUM_TESTS - 1 which are 0,1,2,3,4
        // please note that the index of the array always starts from 0!
        cin >> scores[i];
    }

    // in order to find lowest score you need to compare all the element
    // starting from the element at index 0
    // so far the element at index 0 is the lowest test score.
    int lowestScore = scores[0];
    
    // please note that this loop starts at index 1 since you already got the element at index 0
    // i will be 1,2,3,4
    for (int i = 1; i < NUM_TESTS; ++i)
    {
        // update lowestScore if current element is lower
        if (scores[i] < lowestScore)
        {
            lowestScore = scores[i];
        }
    }
    
    cout << "The lowest score is " << lowestScore << endl;
    
    // please refer to explantions for lowest score as this is very similar
    int highestScore = scores[0];
    for (int i = 1; i < NUM_TESTS; ++i)
    {
        if (scores[i] > highestScore)
        {
            highestScore = scores[i];
        }
    }
    
    cout << "The highest score is " << highestScore << endl;
    
    int total = 0;
    for (int i = 0; i < NUM_TESTS; ++i)
    {
        total += scores[i];
    }
    
    double avg = total * 1.0 / NUM_TESTS;
    cout << "Average of test scores:" << avg << endl;
    
    for (int i = 0; i < NUM_TESTS; ++i)
    {
        cout << scores[i] << " ";
    }
    cout << endl;
    
    return 0;
}

As you can see above it’s much easier to access (read/write) the array using loops and that’s usually recommended way unless you want to access some specific location.

Array as a function parameter

In the above example, we observed that you can have a variable for the array.
Then you might ask if the array can be used in function, either as a parameter or return type.


Quick answer is that a function can take an array as a parameter but it cannot return the array.
However, it doesn’t mean returning an array is completely impossible. Instead of returning array directly, it can return a pointer which is essentially the same as array.
Although a pointer to an array and an array is not exactly the same it is still possible to use pointer like an array.
I will discuss the difference between array and pointer in another post.
For now, let’s just focus on array used as a function parameter.
I rewrote above example code using functions instead.

#include <iostream>

using namespace std;

// this is basic syntax to have an array parameter
// you only need to specify the parameter is array. don't need to have size inside [ ]
int getHigestTestScore(int scores[], int size)
{
    // -999 is just a custom error code
    if (size <= 0)
    {
        return -999;
    }
    int highestScore = scores[0];
    for (int i = 1; i < size; ++i)
    {
        if (scores[i] > highestScore)
        {
            highestScore = scores[i];
        }
    }
    
    return highestScore;
}

// you could also specify the size like below and it works fine.
// but it's quite rigid and usually not a good choice
// in the end this function really assumes the array is sizes of 5 which isn't flexible
// and it involves some hard coded number which smells
int getHigestTestScore2(int scores[5])
{
    int highestScore = scores[0];
    for (int i = 1; i < 5; ++i)
    {
        if (scores[i] > highestScore)
        {
            highestScore = scores[i];
        }
    }
    
    return highestScore;
}

int main()
{
    int arr[] = {1,2,3,4,5};
    int high1 = getHigestTestScore(arr);
    
    int high2 = getHigestTestScore2(arr);
    return 0;
}

I think the example code is pretty clear on how to use array as a function parameter.
Although you can use a pointer instead of array I omitted here to just focus on array only.

Pros

  • Random access is allowed which enables fast search
  • No extra space is required like Linked List (i.e., next pointers)
  • Memory locality yields better access performance than Linked List since all the elements are all located next to each other

Cons

  • Adjusting the size is not flexible. You will need to copy/move elements after increasing/decreasing the size.
  • Insertion and deletion of elements are quite expensive because you have to copy/move all the elements after the operations.

Performance

Search – O(1)
Insertion – O(n)
Deletion – O(n)

Conclusion

We have taken a look at basic concepts and usage of array. However, this is not it!
There are many array related topics such as 2D array and array pointer I am going to discuss in next post.

Intro to Linked List data structure

Today, we will go over a popular data structure – linked list.
Linked list may not make much sense for some languages such as python and javascript since there is already a good one such as list.
However, since linked list is quite useful for languages like C++ it is still worthwhile to go over.
In this post I will go over some important characteristics about linked list and strengths and weaknesses.
Please note that all the explanations will be based on C++.

What is Linked List?

Linked list is a data structure that each node is connected like a chain via pointer.
It is one of fundamental data structure in C++ and widely used in C++ standard library (STL).

I believe a picture is worth a thousand words and let’s take a look at how a linked list looks like before any explanations.

As you see above there are three nodes in the linked list.
Each node is based on Node struct as below which contains a value and a pointer to the next Node.

struct Node
{
    // value that this Node contains. 
    // It could be any data type you want to contain
    int   val;
    
    // pointer to next node
    Node *next;
    
    // pointer to previous node.
    // this is optional
    // Node *prev;
};

The example code has only one pointer ‘next’ but you can also have another Node * to point to the previous Node.
If you have only one pointer (likely it’s a pointer to the next Node) it’s called a singly linked list.
If you have two pointers one pointing to the next and the other to the previous then it’s called a doubly linked list.

Linked List Operations – insert, search, delete

As you can imagine there are three basic operations for a linked list.
You need to be able to search a node in the list, insert and delete.
Let’s go over each operation.

Search

How should the search be?
Given you have a value to search you need to start from the head of the list and check if each node in a chain is the same as the search value.
Return a pointer to the node if you find it or null pointer otherwise.
Let’s take a look at code example for search.

/**
 * @brief  search a node based on search value
 * 
 * @param  head       head of linked list
 * @param  searchVal  value to search in the list
 * 
 * @return pointer to found node. null otherwise
 */
Node *searchNode(Node *head, int searchVal)
{
    // end of the list. value not found
    if (!head)
    {
        return 0;
    }

    if (head->val == serachVal)
    {
        return head;
    }
    else
    {
        return searchNode(head->next, searchVal);
    }
}

I implemented the operation recursively so I don’t need to write a loop. (I thought the recursive solution was easier than loop but it’s your choice)
I actually also used loop based implementation for insertion operation for your reference.

Insertion

Inserting a node could be slightly trickier depending on how you want to maintain the list.
There are three possible cases for insertion.

1. Insert the node at the head of the list
This one is the easiest as you just need to create a node and insert the node.
Here is the code example for head insertion.

/**
 * @brief  node insertion at the beginning of the list
 * 
 * @param  head    double pointer to head since you need to update 'head' after insertion
 * @param  newVal  new value to insertNode
 */
void insertNode(Node **head, int newVal)
{
    Node *newNode = new Node;
    newNode->val = newVal;
    newNode->next = *head;

    // update head
    *head = newNode;
}

2. Insert the node at the tail of the list
This one is slightly harder than head insertion but still pretty simple.
What you need to do is to traverse to the end of the list and insert the node there.

/**
 * @brief  node insertion at the end of the list
 * 
 * @param  head    double pointer to head since you still might need to update 'head'
 *                 if the list is empty
 * @param  newVal  new value to insert
 */
void insertNode(Node **head, int newVal)
{
    Node *newNode = new Node;
    newNode->val = newVal;
    newNode->next = 0;

    Node *tailNode = *head;
    while (tailNode && tailNode->next)
    {
        tailNode = tailNode->next;
    }

    if (!tailNode)
    {
        *head = newNode;
    }
    else
    {
        tailNode->next = newNode;
    }
}

3. Insert the node based on the sorted order of the list.
Imagine you want to maintain the list in sorted order (decreasing or increasing).
Then you will need to find a proper location to insert.

For example, let’s say you have a list like below and you would like to insert value 4.
1-> 3-> 5

Then you need to traverse the list and find a node that is not less than 4 which is 5, the last one, and create a node and insert.
However, there is a couple of other edge cases such as the inserting position happens to be at the head or tail of the list.
Although this one is not terribly difficult it does require little more thoughts than the others.

void insertNode(Node **head, int newVal)
{
    // create new Node
    Node *newNode = new Node;
    newNode->val = newVal;
    newNode->next = 0;
    
    // if empty list then just insert it here
    if (!*head)
    {
        *head = newNode;
        return;
    }
    // check if it needs to be inserted at head.
    else if (newNode->val < (*head)->val)
    {
        newNode->next = *head;
        *head = newNode;
        return;
    }
    
    Node *iter = *head;
    
    // loop until you find first Node that is not less than new value
    while (iter && iter->next)
    {
        if (iter->next->val >= newNode->val)
        {
            break;
        }
        
        iter = iter->next;
    }
    
    if (iter->next)
    {
        newNode->next = iter->next;
    }
    
    iter->next = newNode;
}

The tricky part of insertion is that you have to make sure you update the next pointers properly.
1. If insertion is happening at the head, make sure head pointer is updated and the next pointer of new node point to last head
2. If insertion is happening at the tail, make sure the last node of the list is pointing to the new node properly
3. If insertion is happening in the middle of the list, make sure the previous node’s next points to new node and new node points to the proper next one.

deletion

The deletion of a node is a very similar process as insertion.
First, you need to find which node to delete if there is any and properly updates the next pointers.
Just like insertion deleting node could be the head, the tail or in the middle of the list.
Let’s take a look at the code example.

/**
 * @brief  delete a node from the list if there is one
 *         for simplicity I don't delete all the duplicate values here
 * 
 * @param  head       double pointer to head in case you have to delete head node
 * @param  deleteVal  value to delete
 */
void deleteNode(Node **head, int deleteVal)
{
    Node *iter = *head;
    
    // nothing to delete
    if (!*head)
    {
        return;
    }
    // deleting head. need to update head
    else if ((*head)->val == deleteVal)
    {
        *head = (*head)->next;
        delete iter;
        return;
    }

    while (iter && iter->next)
    {
        if (iter->next->val == deleteVal)
        {
            break;
        }

        iter = iter->next;
    }

    if (iter->next)
    {
        Node *delNode = iter->next;
        iter->next = iter->next->next;
        delete delNode;
    }
}

Pros

  • Unlike an array size of a linked list is very flexible as long as the memory permits
  • Insertion and deletion are much simpler than an array. For an array insertion/deletion you have to move many other elements after the operation
  • It is a perfect data structure to implement more complex data structures like Stack and Queue since you only need to maintain head(or tail) of the list for insertion, search, deletion

Cons

  • It requires extra memory (next, prev pointers)
  • It does not allow random access like an array and therefore search could be slower than an array

Performance

Search – O(n)
Insertion – O(1) if always insert at the head (for cases like Stack)
O(n) otherwise
Deletion – O(1) if always delete head (for cases like Stack)
O(n) otherwise

Conclusion

We have taken a look at some basics of linked list.
Linked list is a great data structure for some special purposes due to its flexibility of insertion and deletion.
However, there are some weaknesses about it so it requires some discretion to wisely use it.

Thank you for reading my post and please let me know if you have any questions or suggestions and good luck!