Exceptional Code: 2011

Wednesday, August 17, 2011

A Binary Search Trees Tutorial

This is a short tutorial on Binary Search Trees covering the basic operations that are often done on such trees.

A Binary Search Tree (BST) is a tree data structure that has the following property:

For any node x in the tree, the left subtree rooted at x only contains nodes with values less than or equal to the value stored at x; and the right subtree rooted at x only contains nodes with values greater than the value at x. Also, the left subtree and the right subtree of x must be binary trees themselves.

                              Figure: A binary search tree of size 9 and depth 3, with root 8
                                         and leaves 1, 4, 7 and 13
                                        Image source: Wikipedia

BSTs have sub-linear (logarithmic) average case complexity for element insertion and searching. Once a BST is built, sorting it can be done in linear time. All we need is traverse the tree in inorder.

There are a number of operations that can be done on a BST. We will describe those here in some details with Java code examples.

First let's see how we can represent a tree node in Java:

class TreeNode

{

int data;

TreeNode parent;

TreeNode left;

TreeNode right;

public TreeNode(int data)

{

this.data = data;

parent = left = right = null;

}

public void setData(int data)

{

this.data = data;

}

public void setLeft(TreeNode node)

{

left = node;

}

public void setRight(TreeNode node)

{

right = node;

}

public void setParent(TreeNode node)

{

parent = node;

}

public int getData()

{

return data;

}

public TreeNode getLeft()

{

return left;

}

public TreeNode getRight()

{

return right;

}

public TreeNode getParent()

{

return parent;

}

Inserting a node into a BST:

Inserting a node into a BST is pretty straightforward. We start checking nodes starting from the root node. At each node if the value is greater than the value in the node to be inserted then we move to the left child, otherwise we move to the right child. We repeat this until there is no more node to be traversed. The last node will be the parent node of the node to be inserted. Now if the node to be inserted has a smaller value than the current node then we set it as the left child of the current node otherwise we set it as the right child.

Java code:

public static void insert(TreeNode node)

{

TreeNode x, y;

// root is a global variable and is the root of the tree

x = y = root; // Assume root is initialized to null

while (x != null)

{

if (x.getData() > node.getData())

{

y = x;

x = x.getLeft();

}

else

{

y = x;

x = x.getRight();

}

// y will be the parent of node

if (y == null)

{

root = node;

return;

}

if (y.getData() > node.getData())

y.setLeft(node);

else

y.setRight(node);

node.setParent(y);

}

Deleting a node from a BST:

When deleting a node from a BST, there can be 3 different scenarios:
i) The node does not have any children: This is kind of a trivial case. When there is no children all we need is removing the node from its parent.
ii) The node has only one child: This is also a relatively simple case. Since the node has only one child, we will simply replace the node with this lone child without violating the BST sorted order property.
iii) The node has two children: Now, this is a bit tricky :) Since there are two children we have to be careful not to break the sorted order of the BST. To keep the sorted order intact, we need to replace the node to be deleted with its successor so that the order is maintained in the absence of the node. Once a successor is found, we need to replace the node to be deleted with it. It turns out the successor can have at most one child. So if the successor has a child we will have to replace the successor with that child after the successor already replaced the node to be deleted.

Java code:

public static void delete(TreeNode node)
{
    // Case 1: node does not have a child, just delete it
    if (node.getLeft() == null && node.getRight() == null)
    {
        if (node.getParent() != null && node.getParent().getLeft() == node)
            node.setLeft(null);
        else if (node.getParent() != null && node.getParent().getRight() == node)
            node.setRight(null);
    }
    // Case 2: node has only one child, splice the child with its parent
    else if (node.getLeft() == null || node.getRight() == null)
    {
        if (node.getParent() != null)
        {
            TreeNode x = node.getLeft() == null ? node.getRight() : node.getLeft();
            if (node.getParent().getLeft() == node)
                node.getParent().setLeft(x);
            else
                node.getParent().setRight(x);
        }
    }
    // Case 3: node has both children, set the successor of the node to its parent
    else
    {
        TreeNode x = findSuccessor(node); // x will have at most one child
        // Instead of deleting we can just copy the successor's data over to the node to be deleted
        node.setData(x.getData());
        // Now delete the successor and set its child (if any) to its parent
        TreeNode nodeChild = x.getLeft() == null ? x.getRight() : x.getLeft();
        if (x.getLeft() != null)
        {
            if (x.getParent().getLeft() == x)
                x.getParent().setLeft(nodeChild);
            else
                x.getParent().setRight(nodeChild);
        }
        else

        {
            if (x.getParent().getLeft() == x)
                x.getParent().setLeft(nodeChild);
            else
                x.getParent().setRight(nodeChild);
        }
    }
}

Finding the minimum value in the tree:

In a BST, all nodes to the left of a node contains smaller (or equal) values than the value stored in the node itself. This gives us a clue about how to tackle the problem of finding the minimum value stored in the BST. If you think about it, the node with the minimum value in a BST is actually the leftmost node in the tree. If it wasn't then there would be node(s) in the tree that are on the right of some some node but contain smaller values than the node itself, thus violating the BST contract.

Here is the Java code for finding the minimum value node in a BST:

public TreeNode findMinimum(TreeNode root)

{

if (root == null)

return null;

if (root.getLeft() != null)

return findMinimum(root.getLeft());

return root;

}

Finding the maximum value in the tree:

Similar to the logic as in finding the minimum value, the maximum value node will be the right most node in the tree. The Java code for finding the maximum value node in a BST:

public static TreeNode findMaximum(TreeNode root)
{
    if (root == null)
        return null;

    if (root.getRight() != null)
        return findMaximum(root.getRight());

    return root;
}

One useful property of the BST is that if it is traversed in inorder we get a sorted list of values stored in the tree nodes. There are two operations that are often done in relation to maintaining this sorted order of a BST: finding the successor node and predecessor node of a given node.

Finding the successor node of a given node:

The successor node is the node with the next bigger number in the tree. Since all numbers to the left are smaller than the current node the successor node has to be located in the right subtree. Now, since it is the next bigger number of the current node, it has to be the smallest of all numbers in the right subtree. So, essentially we are looking for the minimum number in the right subtree of the current node. In the case that there is no right child of the node, we need to look up the tree and figure out the successor. The successor is the first ancestor whose left subtree has this node as the largest number. In other words: the first ancestor of this node whose left child is also an ancestor of this node. The intuition is: as we traverse left up the tree we traverse smaller values, the first node on the right is the next larger number.

Here is the Java code:

public static TreeNode findSuccessor(TreeNode node)

{

if (node == null)

return null;

if (node.getRight() != null)

return findMinimum(node.getRight());

TreeNode y = node.getParent();

TreeNode x = node;

while (y != null && x == y.getRight())

{

x = y;

y = y.getParent();

}

return y;

}

Finding the predecessor node of a given node:

The predecessor of a given node is the node containing the next smaller value. Finding the predecessor follows symmetric rules that we used for finding the successor.

Java code for finding the predecessor:

public static TreeNode findPredecessor(TreeNode node)

{

if (node == null)

return null;

if (node.getLeft() != null)

return findMaximum(node.getLeft());

TreeNode y = node.getParent();

TreeNode x = node;

while (y != null && x == y.getLeft())

{

x = y;

y = y.getParent();

}

return y;

}

Determining whether a tree is a BST or not:

Sometimes we already have a binary tree that we need to determine whether it is a BST or not. This is an interesting problem and can be really solved with a simple recursive solution.

The BST property - that every node on the right subtree has to be larger than the current node and every node on the left subtree has to be smaller (or equal) than the current node - is the key to figuring out whether a tree is a BST or not. On a first thought it might look like we can simply traverse the tree and at every node check whether the node contains a value larger than the value at the left child and smaller than the value on the right child, and if this condition holds for all the nodes in the tree then we have a BST. This is the so called Greedy approach, making a decision based on local properties. But this approach clearly won't work for the following tree:

      20
     /    \
10    30
/   \
5 40

In the tree above, at every node the condition that the node contains a value larger than its left child and smaller than its right child hold, still its not a BST: the value 5 is on the right subtree of the node containing 20, a violation of the BST property!

So how do we solve this? It turns out that instead of making a decision based solely on a node and its children's values, we also need information flowing down from the parent as well. In the case of the tree above, if we could remember about the node containing the value 20 we could see that the node with value 5 is violating the BST property contract.

So the condition we need to check at each node is that: a) if the node is the left child of its parent, then it must be smaller (or equal) than the parent and it must pass down the value from its parent to its right subtree to make sure none of the nodes in that subtree is greater the parent, and similarly b) if the node is the right child of its parent, then it must be larger than the parent and it must pass down the value from its parent to its left subtree to make sure none of the nodes in that subtree is greater the parent.

A simple but elegant recursive solution in Java can explain this further:

public static boolean isBST(TreeNode node, int leftData, int rightData)
{
    if (node == null)
        return true;

    if (node.getData() > leftData || node.getData() <= rightData)
        return false;

    return (isBST(node.left, node.getData(), rightData) && isBST(node.right, leftData, node.getData()));
}

The initial call to this function can be something like this:

if (isBST(root, Integer.MAX_VALUE, Integer.MIN_VALUE))
System.out.println("This is a BST.");
else
System.out.println("This is NOT a BST!");

Essentially we keep creating a valid range (starting from [ MIN_VALUE, MAX_VALUE]) and keep shrinking it down foe each node as we go down recursively.

So, here is my short and sweet primer on Binary Search Trees. Hope you found it useful. Let me know if there is a bug or a possible improvement in runtime or space.

Sunday, July 31, 2011

Coding up a Trie (Prefix Tree)

The tree data structure is one of the most important data storage mechanism in programming. It's a natural way to represent essential utilities on a computer like the directory structure in a file system.

Many other objects can be stored in a tree data structure resulting in space and/or time efficiency. For example, when we have a huge number of dictionary (and/or non-dictionary) words or strings that we want to store in memory we can use a tree structure to efficiently store the words instead of using a plain Array or Vector type that simply stores each word individually in memory. The space needed to store the words in an Array or Vector is simply the number of words times the average length of the words we need to store.

A Trie, also called a Prefix Tree, is a tree structure that stores words with a common prefix under the same sequence of edges in the tree eliminating the need for storing the same prefix each time for each word. From Wikipedia:

A trie, or prefix tree, is an ordered tree data structure that is used to store an associative array where the keys are usually strings. Unlike a binary search tree, no node in the tree stores the key associated with that node; instead, its position in the tree shows what key it is associated with. All the descendants of a node have a common prefix of the string associated with that node, and the root is associated with the empty string.

Please read the Wikipedia entry to learn the advantages of a Trie over other data structures in terms of searching and sorting. We will focus on building a Trie in this post.

Today we will build a simple Trie structure that does only two operations - Insert and Search/Find. We will also add a print function for convenience of seeing the tree on screen. For simplicity, let's assume all the words/strings we want to store consist of lowercase letters only.

First, we need a way to represent each Node in the tree. The following simple class represents a Trie node in our case:

class TrieNode

{

    char letter;

    TrieNode[] links;

    boolean fullWord;

    TrieNode(char letter, boolean fullWord)

{

        this.letter = letter;

        links = new TrieNode[26];

        this.fullWord = fullWord;

}

}

The "letter" member field stores the leading letter in the word starting from the root node the the current node. The "links" field contain references to child nodes that contain the next letter of the child words that share a common prefix (consisting of all the letters beginning from the root node the current node). The "fullWord" field indicates whether the current letter marks the end of a valid word in the dictionary and also at the same time is a valid prefix of one or more other words. For example, the letter "n" in the words "pan" and "pant" both is the ending letter of "pan" and marks a prefix of the word "pant". For simplicity we are statically allocating 26 references to child nodes as that is the maximum number of letters in the English alphabet (lowercase only). It can be optimized by only allocating nodes that have real edge in the tree. But let's not complicate things for optimization now.

Image courtesy: a blogspot blog

In the image above the Trie contains the words an, ant, all, allot, alloy, aloe, are, ate, be. The "grounded" sign indicates a full word ending at this node.

Algorithms for inserting a a word into a Trie:

1. Set current node to root node. The root node does not contain any letter (initialized to the null character for convenience).
2. Set the current letter to the first letter in the word.
3. If the current node already has an existing reference to the current letter (through one of the elements in the "links" field) then set current node to that referenced node; else create a new node, set the letter to current letter, and set current node to this new node.
4. Repeat step 3 until all letters in the current word has been processed.

Algorithm for searching for a word in the Trie:

1. Set current node to root node. Set the current letter to the first letter in the word.
2. If the current node is null then the word does not exist in the Trie.
3. If the current node reference to a valid node containing the current letter then set current node to that referenced node and set current letter to the letter in the new current node.
4. Repeat steps 2 and 3 until all letters in the word has been processed.
5. Now there are two possibilities that may indicate the letter is not there in the tree: a) the current letter is the last letter and there is no valid node containing this letter, and b) there is a valid node containing the last letter but the node does not indicate it contains a full word (marked with the "fullWord" boolean field.
6. If step the conditions in step 5 are not met, then we have a match for the word in the Trie.

At this point you should be able to implement a Trie yourself with the above functionalities. If you still need help then look at the code below (but you better write the code yourself :) ).

Java Code:

public class PrefixTree
{
    static TrieNode createTree()
    {
        return(new TrieNode('\0'));
    }

    static void insertWord(TrieNode root, String word)
    {
        int offset = 97;
        int l = word.length();
        char[] letters = word.toCharArray();
        TrieNode curNode = root;

        for (int i = 0; i < l; i++)
        {
            if (curNode.links[letters[i]-offset] == null)
                curNode.links[letters[i]-offset] = new TrieNode(letters[i]);
            curNode = curNode.links[letters[i]-offset];
        }
        curNode.fullWord = true;
    }

    static boolean find(TrieNode root, String word)
    {
        char[] letters = word.toCharArray();
        int l = letters.length;
        int offset = 97;
        TrieNode curNode = root;

        int i;
        for (i = 0; i < l; i++)
        {
            if (curNode == null)
                return false;
            curNode = curNode.links[letters[i]-offset];
        }

        if (i == l && curNode == null)
            return false;

        if (curNode != null && !curNode.fullWord)
            return false;

        return true;
    }

    static void printTree(TrieNode root, int level, char[] branch)
    {
        if (root == null)
            return;

        for (int i = 0; i < root.links.length; i++)
        {
            branch[level] = root.letter;
            printTree(root.links[i], level+1, branch);
        }

        if (root.fullWord)
        {
            for (int j = 1; j <= level; j++)
                System.out.print(branch[j]);
            System.out.println();
        }
    }

    public static void main(String[] args)
    {
        TrieNode tree = createTree();

        String[] words = {"an", "ant", "all", "allot", "alloy", "aloe", "are", "ate", "be".};
        for (int i = 0; i < words.length; i++)
            insertWord(tree, words[i]);

        char[] branch = new char[50];
        printTree(tree, 0, branch);

        String searchWord = "all";
        if (find(tree, searchWord))
        {
            System.out.println("The word was found");
        }
        else
        {
            System.out.println("The word was NOT found");
        }
    }
}

class TrieNode
{
    char letter;
    TrieNode[] links;
    boolean fullWord;

    TrieNode(char letter)
    {
        this.letter = letter;
        links = new TrieNode[26];
        this.fullWord = false;
    }
}

Saturday, July 23, 2011

External Sorting for sorting large files in disk

Sorting is a fundamental programming task. Given the abundance of built-in libraries that perform tasks like sorting and binary search, we often become forgetful of exactly how these tasks are accomplished.

When the data is so large that it cannot be processed in memory at one time we need to resort to the file system to store part or all the data during the sorting process. We then need to perform another layer of disk operations on top of regular sorting algorithms to manage the data as they get sorted.

External Sorting is precisely the technique we described in the previous paragraph.

Let us describe in some detail how external sorting can be done in Java:

First the algorithm:

Say, we have one file (it can be more than one file, but having just one file simplifies the process for illustration purpose) in disk containing N numbers. And suppose the memory in our computer can hold M numbers at a time.

1. Start reading the input file from the beginning.
2. Read M (or less if number of entries remaining in the file is less than M) numbers from the file and store it into a temp buffer.
3. Sort (using any good sorting method - Quicksort, for example) the numbers in the buffer stored in step 2.
4. Create a temp file in disk and write out the sorted numbers from step 3 to this temp file. Save the name of the temp file.
5. Repeat step 2 to 4 until all numbers from the input file has been read, sorted, and written out to temp files.

At this point, we have chunks of numbers of size M sorted and stored in temp files in disk. We need to merge all these sorted files into one single sorted file. We will apply the merging algorithm from Merge Sort to join the numbers from these sorted files together.

6. Open all the temp files (and set the read pointer to the beginning of the files).
7. Find the minimum number from the set of numbers currently pointed to by the file read pointer.
8. Write the number to disk. (To increase efficiency you could write the number to a buffer first and then flush the buffer out to disk when the buffer is full. But modern I/O libraries should be doing this anyway for you).
9. Read another number from the file that contained the minimum number at step 7.
10. Repeat step 7 to 9 until all numbers from all the temp files have been processed, merged, and written out to disk.

The new file in disk now contains a sorted list of the numbers supplied in the initial input file.

Java code:

import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.Arrays;
import java.util.Random;

public class ExternalSort
{
static int N = 2000000; // size of the file in disk
static int M = 100000; // max items the memory buffer can hold

public static void externalSort(String fileName)
{
String tfile = "temp-file-";
int[] buffer = new int[M < N ? M : N];

try
{
FileReader fr = new FileReader(fileName);
BufferedReader br = new BufferedReader(fr);
int slices = (int) Math.ceil((double) N/M);

int i, j;
i = j = 0;
// Iterate through the elements in the file
for (i = 0; i < slices; i++)
{
// Read M-element chunk at a time from the file
for (j = 0; j < (M < N ? M : N); j++)
{
String t = br.readLine();
if (t != null)
buffer[j] = Integer.parseInt(t);
else
break;
}
// Sort M elements
Arrays.sort(buffer);

// Write the sorted numbers to temp file
FileWriter fw = new FileWriter(tfile + Integer.toString(i) + ".txt");
PrintWriter pw = new PrintWriter(fw);
for (int k = 0; k < j; k++)
pw.println(buffer[k]);

pw.close();
fw.close();
}

br.close();
fr.close();

// Now open each file and merge them, then write back to disk
int[] topNums = new int[slices];
BufferedReader[] brs = new BufferedReader[slices];

for (i = 0; i < slices; i++)
{
brs[i] = new BufferedReader(new FileReader(tfile + Integer.toString(i) + ".txt"));
String t = brs[i].readLine();
if (t != null)
topNums[i] = Integer.parseInt(t);
else
topNums[i] = Integer.MAX_VALUE;
}

FileWriter fw = new FileWriter("E:\\test\\external-sorted.txt");
PrintWriter pw = new PrintWriter(fw);

for (i = 0; i < N; i++)
{
int min = topNums[0];
int minFile = 0;

for (j = 0; j < slices; j++)
{
if (min > topNums[j])
{
min = topNums[j];
minFile = j;
}
}

pw.println(min);
String t = brs[minFile].readLine();
if (t != null)
topNums[minFile] = Integer.parseInt(t);
else
topNums[minFile] = Integer.MAX_VALUE;

}
for (i = 0; i < slices; i++)
brs[i].close();

pw.close();
fw.close();
}
catch (FileNotFoundException e)
{
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}

static String generateInput(int n)
{
String fileName = "external-sort.txt";
Random rand = new Random();

try
{
FileWriter fw = new FileWriter(fileName);
PrintWriter pw = new PrintWriter(fw);

for (int i = 0; i < n; i++)
pw.println(rand.nextInt(101));

pw.close();
}

catch (IOException e)
{
e.printStackTrace();
}

return fileName;
}

public static void main(String[] args)

{

String fileName = generateInput(N);

externalSort(fileName);

}

Happy external sorting!