Monday, February 20, 2012

Life on a Tree - Creating, Copying, and Pruning Tree Data Structures

The tree is one of the most important data structures in computer programming.



A very common of example of the tree data structure being used in everyday life is a directory structure widely used both in Windows and Unix systems.

Today I will write a brief tutorial on trees, and show a number of common operations done on the structure of a tree. The code examples will be in Java.

First, let's create a tree. A tree is essentially a set of node objects that hold some data and each node has zero or more "child" nodes. The nodes without any child nodes are called leaves and the node that itself is not a child of any other node is called the root. You can read up on trees on the Web, we will focus on the implementation of trees here.

We can define a node in the tree as follows:

class Node
{
    private int data; /// this can be more complex objects
  
    private Node left;
    private Node right;
    private Node parent;
  
    public Node(int value)
    {
        data = value; // we could also use setter/getter for this value
        parent = null; // this is optional, but having this is often useful
        left = null;
        right = null;
    }
   
    public Node(Node node) // copy constructor
    {
        data = node.getData();
        left = null;
        right = null;
        parent = null;
    }

    public void setData(int data)
    {
        this.data = data;
    }
   
    public int getData()
    {
        return data;
    }
   
    public void setLeftNode(Node node)
    {
       left = node;
    }

    public Node getLeftNode()
    {
        return left;
    }

    public void setRightNode(Node node)
    {
        right = node;
    }

    public Node getRightNode()
    {
       return right;
    }

    public void setParentNode(Node node)
    {
         parent = node;
    }

    public Node getParentNode()
    {
        return parent;
    }
}

The class above is a simple example of a node in a tree. It contains a data variable, a reference to left and right sub-trees each, and a reference to the parent node as well. There are also a set of setters and getters to manipulate the data members. Please note that - in practice you might need much more complex nodes loaded with big data objects and more children than just a left and right nodes.

To create a tree using the class above you could do the following:

Node root = new Node(100); // creating a root node with value 100

root.setLeftNode(new Node(200));
root.setRightNode(new Node(50));

root.getLeftNode().setLeftNode(new Node(10));
root.getLeftNode().setRightNode(new Node(1000));

A tree is represented by the root node of the tree. So, the "root" object above represent the entire tree.

Copying a tree:

Say you have a tree that you want to copy to another tree. Now, copying can be done in two ways - i) shallow copy, and ii) deep copy.

We will show deep copy here. Deep copying means the new tree is an entirely new copy of the old tree - both in structure and in data. Everything is allocated again in memory.

The following function does a deep copy of a tree to another tree:

public void deepCopy(Node root, Node copy)
{
        Node left = root.getLeftNode();
        Node right = root.getRightNode();
       
        if (left != null)
        {
            copy.setLeftNode(new Node(left));
            deepCopy(left, copy.getLeftNode());
        }
       
        if (right != null)
        {
            copy.setRightNode(new Node(right));
            deepCopy(right, copy.getRightNode());
        }
}

This function can be called like this:

Node copy = new Node(root); // copy the root
deepCopy(root, copy); // copy the rest of the tree

Now, the tree referenced by "copy" holds an entire deep copy of the tree "root".

Pruning a tree:

Pruning means deleting one or more subtrees of a tree. We will implement a filter-based pruning here. That is, whenever a node will match some criteria described in a filter we will delete that node along with all its children from its parent.

First, we will need a way to represent a filter. We will do that by way of an interface that all filter classes will have to implement.

interface Filter
{
    public boolean check(Node node);
}

Now, we can define a Filter class as follows:

class MyFilter implements Filter
{
    public boolean check(Node node)
    {
        if (node.getData() == 200)
            return false;
        return true;
    }
}

This class indicates that we would like to delete all sub-trees rooted at a node containing the data value 200.

The pruning class will be a lot like the deep copying class. This is because the pruned tree is actually a deep copy of the original tree minus the pruned nodes!

Here is what the pruning method looks like:

public void pruneTree(Node root, Node copy, Filter filter)
{
        Node left = root.getLeftNode();
        if (left != null && filter.check(left))
        {
            copy.setLeftNode(new Node(left));
            pruneTree(left, copy.getLeftNode(), filter);
        }
       
        Node right = root.getRightNode();
        if (right != null && filter.check(right))
        {
            copy.setRightNode(new Node(right));
            pruneTree(right, copy.getRightNode(), filter);
        }
}

This method can be called as follows:

Node copy = new Node(root);
Filter filter = new MyFilter();
pruneTree(root, copy, filter);

At this point, the node object "copy" will contain a pruned version of the original tree rooted at "root".

Finally, here is a small class that can print a tree in pre-order traversal. I used this class in testing out the code in this post:

public void printTree(Node root)
{
        if (root != null)
        {
            System.out.println(root.getData()); 
            printTree(root.getLeftNode());
            printTree(root.getRightNode());
        }
}

I hope you will now be able to create simple trees quickly. It won't be hard to modify these classes and methods to build more complex tree classes to handle more than two children, copying a subtree instead of the whole tree, prune more selectively, etc.

Happy tree coding:)

Friday, February 10, 2012

Solving the Boggle Game - Recursion, Prefix Tree, and Dynamic Programming

I spent this past weekend designing the game of Boggle. Although it looks like a simple game at a high level, implementing it in a programming language was a great experience. I had to use recursion, sorting, searching, prefix trees (also knows as trie's), and dynamic programming as I was improving the run time of the program. So here is a summary of the work I did in trying to write a very optimal solution to the Boggle game:



*Image courtesy: http://stackoverflow.com/questions/746082/how-to-find-list-of-possible-words-from-a-letter-matrix-boggle-solver

The game of Boggle is played on a N x N board (usually made of cubes that has letters engraved on it). Given a dictionary, you will have to construct words on the board following these rules:

i) The letters in the word must be "adjacent" to each other
ii) Two letters on the board are "adjacent" if they are located on the board next to each other horizontally, vertically, or diagonally.
iii) A word must be 3 or more letters long to be valid

The more words you can construct, the more points you get. Longer words get more points.

In computerse: given an N x N board, and a dictionary containing hundreds of thousands of words, construct those words from the board that are found in the dictionary.

There are a number of approaches that one can take to solve this problem. I designed three different algorithms to solve it. Here are they:

First solution - Recursion + Binary Search:

In this approach, we recurse (using backtracking) through the board and generate all possible words. For each word that are three or more letters long we check to see if it's in the dictionary. If it is, we have a match!

Here is the algorithm steps:

1. Read in the dictionary in to a container in memory.
2. Sort the dictionary.
3. Using backtracking, search the board for words.
4. If a word is found and it contains 3 or more letters, do a binary search on the dictionary to see if the word is there.
5. If searching was successful in the previous step, print the letter out.
6. Continue step 3-5 as long as there are more words to construct on the board.

Complexity of this approach:

In this solution, we do a good job on the dictionary search. Using binary search, we are quickly finding out whether a word is in dictionary or not. But the real bottleneck is in searching the board for words. For an N x N board the search space is O((N*N)!). I am not exactly sure about this number, but you can find some discussions of it here: http://www.danvk.org/wp/2007-08-02/how-many-boggle-boards-are-there/.
(N*N)! is a HUGE number even for N = 5. So this approach is is impractical and out of question for any useful implementation.


Second Solution - Pruned Recursion + Prefix Tree (also known as a Trie):

From the previous approach, our major concern was the enormous search space on the board. Fortunately, using a a prefix tree or trie data structure we could significantly cut down on this search space. The reasoning behind this improvement is that, if a word "abc" does not occur as a prefix to any word in the dictionary there is no reason to keep searching for words after we encounter "abc" on the board. This actually cut down the run time a lot.

Here is the algorithm steps:

1. Read a word from the dictionary file.
2. Insert it into a prefix tree data structure in memory.
3. Repeat steps 1-2 until all words in the dictionary have been inserted into the prefix tree.
4. Using backtracking, search the board for words.
5. If a word is found and it contains 3 or more letters, search the prefix tree for the word.
6. If searching was *not* successful in the previous step, return from this branch of the backtracking stage. (There is no point to continue searching in this branch, nothing in the dictionary as the prefix tree says).
7. If searching was successful in step 5, continue searching by constructing more words along this branch of backtracking and stop when the leaf node has been reached in the prefix tree. (at that point there is nothing more to search).
8. Repeat steps 4-7 as long as there are more words to search in the backtracking.

Complexity of this approach:

This approach significantly improves on the first one. Building a prefix tree our of the dictionary words is O(W * L), where W is the number of words in the dictionary and L is the maximum length of a word in the dictionary.
Searching the board will be of the same order as the dictionary since we are not really searching words that are not in the dictionary. But in reality it will be more work than that as we still need to backtrack along the board to construct new words until we can consult the dictionary prefix tree to know whether it exists or not.

Third and Final Solution - No search space + Dynamic Programming:

The 2nd approach mentioned above was good enough until the board size was 5. Unfortunately with a board size of 6, that too was taking forever to complete!



It got me into thinking - "Dang, this search space is still too big to search! Can I just get rid of it entirely?" And then this idea popped into my mind: instead of random constructing word after word in this infinite ocean of words why don't I take a word from the dictionary and somehow magically check whether that's available on the board or not?

It turns out, we can use a nifty dynamic programming technique to quickly check whether a word (from the dictionary in this case) can be constructed from the board or not!

Here is core point of the dynamic programming idea:


For a word of length k to be found (end location) at the [i, j]-th location of the board, the k-1'th letter of that word must be located in one of the adjacent cells of [i, j].



The base case is k = 1.

A letter of length 1 will be found (end location) in the [i, j]-th cell of the board of the only letter in the word matches the letter in the [i, j]-th location of the board.

Once our dynamic programming table is populated with the base case, we can build on top of that for any word of length k, k > 1.

Here is a sample code for this:

for (k = 2; k < MAX_WORD_LENGTH; ++k)
    for (i = 0; i < N; ++i)
        for (j = 0; j < N; ++j)
            if (board[i][j] == word[k])
            {
                 for all the "adjacent" cells of board[i, j]
                     if we table[k-1][adjacent_i][adjacent_j] is true
                         then table[k][i][j] = true;
             }


*I have the C++ code, if you are interested to take a look leave a comment with your email address.

Run Time Complexity:

The run time complexity of this approach is obvious is pretty obvious. It's O (W * N * N * MAX_WORD_LENGTH). Here N is dimension of the board which is usually between 4 to 6. So essentially the algorithm runs in the order of the size of the dictionary!

I solved a 6 X 6 boggle game with a dictionary with 600,000 words in 0.002 seconds! With I/O it as about 0.57 seconds. But with the trie and binary search approaches, it took much longer!

So, here is the summary of my a weekend's worth of adventure in to the world of algorithms and data structures. Feel free to ask me any questions, or any bug in these algorithms:)

Credit where it is due: My friend Satej who is a ninja problem solver and a PhD student at the UCF explained the dynamic programming solution to me first before I refined and finalized it. Thanks Satej!


Update (5/10/2012):

So many people have asked for the source code that I feel I better add the C++ code here:) The final, polished code for this is not in my home computer here, so if there is a bug please let me know!


#include <cstdio>
#include <iostream>

using namespace std;

const int N = 6; // max length of a word in the board

char in[N * N + 1]; // max length of a word
char board[N+1][N+2]; // keep room for a newline and null char at the end
char prev[N * N + 1];
bool dp[N * N + 1][N][N];

// direction X-Y delta pairs for adjacent cells
int dx[] = {0, 1, 1, 1, 0, -1, -1, -1};
int dy[] = {1, 1, 0, -1, -1, -1, 0, 1};
bool visited[N][N];

bool checkBoard(char* word, int curIndex, int r, int c, int wordLen)
{
    if (curIndex == wordLen - 1)
    {
        //cout << "Returned TRUE!!" << endl;
        return true;
    }
   
    int ret = false;
       
    for (int i = 0; i < 8; ++i)
    {
        int newR = r + dx[i];
        int newC = c + dy[i];
       
        if (newR >= 0 && newR < N && newC >= 0 && newC < N && !visited[newR][newC] && word[curIndex+1] == board[newR][newC])
        {
            ++curIndex;
            visited[newR][newC] = true;
           
            ret = checkBoard(word, curIndex, newR, newC, wordLen);
            if (ret)
                break;
               
            --curIndex;
            visited[newR][newC] = false;
        }
    }
   
    return ret;           
}

int main(int argc, char* argv[])
{
   
    int i, j, k, l;
   
    FILE* fDict = fopen("dict.txt","r");
    FILE* fBoard = fopen("tmp2.txt","r");
   
    for(i = 0; i < N; ++i)
        fgets(board[i], N+2, fBoard);
   
    strcpy(prev,"");
    int pLen = 0;
   
    while(fgets(in, N*N + 1, fDict))
    {
        int len = strlen(in);
        if (in[len-1] == '\n')
        {
            in[len-1] = '\0'; // remove the trailing newline
            --len;
        }
       
        if (len < 3)
            continue; // we only want words longer than 3 or more letter
           
        for(i = 0; i < len && i < pLen; ++i)
        {
            if(prev[i] != in[i])
                break;
        }
       
        int firstMismatch = i; // little optimization: building on previous word (will benefit if the word list is sorted)
       
        if(firstMismatch==0)
        {
            for(i = 0; i < 6; ++i)
            {
                for(j = 0; j < 6; ++j)
                {
                    if(board[i][j] == in[0])
                        dp[0][i][j] = true;
                    else
                        dp[0][i][j] = false;
                }
            }
            firstMismatch = 1;
        }
       
        for(k = firstMismatch; k < len; ++k)
        {
            for(i=0;i<6;++i)
            {
                for(j=0;j<6;++j)
                {   
                    dp[k][i][j] = false;
                           
                    if(board[i][j] != in[k])
                        continue;
                       
                    for(l= 0; l < 8 && !dp[k][i][j]; ++l)
                    {
                        int ti = i + dx[l];
                        int tj = j + dy[l];
                       
                        if(ti < 0 || ti >= 6 || tj < 0 || tj >= 6)
                            continue;
                       
                        if (dp[k-1][ti][tj])
                            dp[k][i][j] = true;
                    }
                }
            }  
        }
       
        // check if the word is tagged as found in the dp table
        bool flag = false;
        for(i = 0; i < 6 && !flag; ++i)
        {
            for(j = 0; j < 6 && !flag; ++j)
            {
                if(dp[len-1][i][j])
                    flag =true;
            }
        }
       
        // dp table says its there, but make sure its in the board and it does not repeat a location in the board
        if(flag)
        {
            //cout << "Checking word: " << in << endl;
            bool verified = false;
           
            for (i = 0; i < N && !verified; ++i)
            {
                for (j = 0; j < N && !verified; ++j)
                {
                    if (in[0] != board[i][j])
                        continue;
                       
                    memset(visited, false, sizeof(visited));
                    visited[i][j] = true;
                   
                    if (checkBoard(in, 0, i, j, len))
                    {
                        cout << in << endl;
                        break;
                    }
                }
            }
        }
       
        strcpy(prev,in);
        pLen=len;
           
    }
   
    return 0;
}


Enjoy!

Friday, January 20, 2012

The Power of Perl Regular Expressions

I am not a big scripting language person. C++ and Java (and occasional C#) have been my stronghold so far. But recently I have been strolling around in the Perl world at work. And to my amazement I found Perl so rich a language that I fell in love with it.

Here is an example of the power and succinctness of Perl. This one involves using Perl's powerful regular expressions and file/directory handling features.

For a project I was doing I needed to modify contents of a large number of files in a directory. The actual modification needed was really simple. For all occurrences of a term, say XYZ, replace it with .XYZ (insert a dot before the term). The term XYZ could or could not be preceded by a "-" (minus sign). Other than that the term is guaranteed to be separated by white space before and after it. Also, the term can occur at the beginning or at the end of a line.

Now, I was new to Perl, and regular expressions for that matter. I struggled with the problem for one and half day. There were always one or two cases that I was not covering, or cases I was not covering that I was not supposed to. Finally I gave up and posted this as a question on Stackoverflow:

Overlapping text substitution with Perl regular expression.

Within 10 minutes I had a bunch of answers from some really Perl expert guys. I am so thankful to those people who answered it and saved me from more days of struggling with the problem!

Here is what I ended with for the whole problem:


$^I = ".copy";
my @ARGV = <*.txt>;

while (<>)
{
    s/((?:^|\s)-?)(XYZ)(?=\s|$)/$1.$2/g;
    print;
}


That's it! How cool is that?

If it was in C++ or Java, I would have to deal with opening file streams, buffer streams in Java, open a file one at a time, read a line one by one, store it on a list or something in memory, and then write the whole thing back to file system, so on and so forth. And I would easily end up writing close to a hundred lines of code for that!

But in Perl, it's just those few lines above. The diamond operator (<>) nicely takes care of the directory and file browsing part so succinctly. And the one line regular expression finds all references of the pattern in all the files (denoted by the wildcard *.txt) and replaces the matching patterns with a dot inserted in the beginning.

I admit, Perl's regex and file/directory handling packages are working in the background, but still the expressive power of these tools are so elegant!

I hope to keep exploring Perl more in the coming days. Specially the powerful regular expressions!

Wednesday, August 17, 2011

A Binary Search Trees Tutorial

This is a short tutorial on Binary Search Trees covering the basic operations that are often done on such trees.

A Binary Search Tree (BST) is a tree data structure that has the following property:

For any node x in the tree, the left subtree rooted at x only contains nodes with values less than or equal to the value stored at x; and the right subtree rooted at x only contains nodes with values greater than the value at x. Also, the left subtree and the right subtree of x must be binary trees themselves.


                              Figure: A binary search tree of size 9 and depth 3, with root 8
                                         and leaves 1, 4, 7 and 13
                                        Image source: Wikipedia


BSTs have sub-linear (logarithmic) average case complexity for element insertion and searching. Once a BST is built, sorting it can be done in linear time. All we need is traverse the tree in inorder.

There are a number of operations that can be done on a BST. We will describe those here in some details with Java code examples.

First let's see how we can represent a tree node in Java:

class TreeNode
{
    int data;
    TreeNode parent;
    TreeNode left;
    TreeNode right;
   
    public TreeNode(int data)
    {
        this.data = data;
        parent = left = right = null;
    }
   
    public void setData(int data)
    {
        this.data = data;
    }
   
    public void setLeft(TreeNode node)
    {
        left = node;
    }
   
    public void setRight(TreeNode node)
    {
        right = node;
    }
   
    public void setParent(TreeNode node)
    {
        parent = node;
    }
   
    public int getData()
    {
        return data;
    }
   
    public TreeNode getLeft()
    {
        return left;
    }
   
    public TreeNode getRight()
    {
        return right;
    }
   
    public TreeNode getParent()
    {
        return parent;
    }
}

Inserting a node into a BST:

Inserting a node into a BST is pretty straightforward. We start checking nodes starting from the root node. At each node if the value is greater than the value in the node to be inserted then we move to the left child, otherwise we move to the right child. We repeat this until there is no more node to be traversed. The last node will be the parent node of the node to be inserted. Now if the node to be inserted has a smaller value than the current node then we set it as the left child of the current node otherwise we set it as the right child.

Java code:

public static void insert(TreeNode node)
{
    TreeNode x, y;
          // root is a global variable and is the root of the tree
    x = y = root; // Assume root is initialized to null
   
    while (x != null)
    {
        if (x.getData() > node.getData())
        {
            y = x;
            x = x.getLeft();
        }
        else
        {
            y = x;
            x = x.getRight();
        }
    }
   
    // y will be the parent of node
    if (y == null)
    {
        root = node;
        return;
    }
   
    if (y.getData() > node.getData())
        y.setLeft(node);
    else
        y.setRight(node);
   
    node.setParent(y);
}

Deleting a node from a BST:

When deleting a node from a BST, there can be 3 different scenarios:
i) The node does not have any children: This is kind of a trivial case. When there is no children all we need is removing the node from its parent.
ii) The node has only one child: This is also a relatively simple case. Since the node has only one child, we will simply replace the node with this lone child without violating the BST sorted order property.
iii) The node has two children: Now, this is a bit tricky :) Since there are two children we have to be careful not to break the sorted order of the BST. To keep the sorted order intact, we need to replace the node to be deleted with its successor so that the order is maintained in the absence of the node. Once a successor is found, we need to replace the node to be deleted with it. It turns out the successor can have at most one child. So if the successor has a child we will have to replace the successor with that child after the successor already replaced the node to be deleted.

Java code:

public static void delete(TreeNode node)
{
    // Case 1: node does not have a child, just delete it
    if (node.getLeft() == null && node.getRight() == null)
    {
        if (node.getParent() != null && node.getParent().getLeft() == node)
            node.setLeft(null);
        else if (node.getParent() != null && node.getParent().getRight() == node)
            node.setRight(null);           
    }
    // Case 2: node has only one child, splice the child with its parent
    else if (node.getLeft() == null || node.getRight() == null)
    {
        if (node.getParent() != null)
        {
            TreeNode x = node.getLeft() == null ? node.getRight() : node.getLeft();
            if (node.getParent().getLeft() == node)
                node.getParent().setLeft(x);
            else
                node.getParent().setRight(x);
        }           
    }
    // Case 3: node has both children, set the successor of the node to its parent
    else
    {
        TreeNode x = findSuccessor(node); // x will have at most one child
        // Instead of deleting we can just copy the successor's data over to the node to be deleted
        node.setData(x.getData());
        // Now delete the successor and set its child (if any) to its parent
        TreeNode nodeChild = x.getLeft() == null ? x.getRight() : x.getLeft();
        if (x.getLeft() != null)
        {
            if (x.getParent().getLeft() == x)
                x.getParent().setLeft(nodeChild);
            else
                x.getParent().setRight(nodeChild);
        }
        else
        {
            if (x.getParent().getLeft() == x)
                x.getParent().setLeft(nodeChild);
            else
                x.getParent().setRight(nodeChild);
        }
    }
}

Finding the minimum value in the tree:

In a BST, all nodes to the left of a node contains smaller (or equal) values than the value stored in the node itself. This gives us a clue about how to tackle the problem of finding the minimum value stored in the BST. If you think about it, the node with the minimum value in a BST is actually the leftmost node in the tree. If it wasn't then there would be node(s) in the tree that are on the right of some some node but contain smaller values than the node itself, thus violating the BST contract.

Here is the Java code for finding the minimum value node in a BST:

public TreeNode findMinimum(TreeNode root)
{
    if (root == null)
        return null;
  
    if (root.getLeft() != null)
        return findMinimum(root.getLeft());
  
    return root;
}

Finding the maximum value in the tree:

Similar to the logic as in finding the minimum value,  the maximum value node will be the right most node in the tree. The Java code for finding the maximum value node in a BST:

public static TreeNode findMaximum(TreeNode root)
{
    if (root == null)
        return null;
  
    if (root.getRight() != null)
        return findMaximum(root.getRight());
  
    return root;
}

One useful property of the BST is that if it is traversed in inorder we get a sorted list of values stored in the tree nodes. There are two operations that are often done in relation to maintaining this sorted order of a BST: finding the successor node and predecessor node of a given node.

Finding the successor node of a given node:

The successor node is the node with the next bigger number in the tree. Since all numbers to the left are smaller than the current node the successor node has to be located in the right subtree. Now, since it is the next bigger number of the current node, it has to be the smallest of all numbers in the right subtree. So, essentially we are looking for the minimum number in the right subtree of the current node. In the case that there is no right child of the node, we need to look up the tree and figure out the successor. The successor is the first ancestor whose left subtree has this node as the largest number. In other words: the first ancestor of this node whose left child is also an ancestor of this node. The intuition is: as we traverse left up the tree we traverse smaller values, the first node on the right is the next larger number.

Here is the Java code:

public static TreeNode findSuccessor(TreeNode node)
{
    if (node == null)
        return null;
   
    if (node.getRight() != null)
        return findMinimum(node.getRight());
  
    TreeNode y = node.getParent();
    TreeNode x = node;
    while (y != null && x == y.getRight())
    {
        x = y;
        y = y.getParent();
    }

    return y;
}

Finding the predecessor node of a given node:

The predecessor of a given node is the node containing the next smaller value. If the node has  a left child then the largest value in the subtree rooted at this left child is the predecessor. If the node does not have a left child then the predecessor node is somewhere up the tree. Now, if the node is the left child of its parent then there is no predecessor as smaller values exist only on the left side of a node. If the node is the right child of its parent the parent is the next smaller number in the absence of a left child.

Java code for finding the predecessor:

public static TreeNode findPredecessor(TreeNode node)
{
    if (node == null)
        return null;
    
    TreeNode parent = node.getParent();
    if (parent != null && parent.getLeft() == null)
        return parent;
   
    return findMaximum(node.getLeft());
}

Determining whether a tree is a BST or not:

Sometimes we already have a binary tree that we need to determine whether it is a BST or not. This is an interesting problem and can be really solved with a simple recursive solution.

The BST property - that every node on the right subtree has to be larger than the current node and every node on the left subtree has to be smaller (or equal) than the current node - is the key to figuring out whether a tree is a BST or not. On a first thought it might look like we can simply traverse the tree and at every node check whether the node contains a value larger than the value at the left child and smaller than the value on the right child, and if this condition holds for all the nodes in the tree then we have a BST. This is the so called Greedy approach, making a decision based on local properties. But this approach clearly won't work for the following tree:

      20
     /    \
  10    30
          /   \
        5     40

In the tree above, at every node the condition that the node contains a value larger than its left child and smaller than its right child hold, still its not a BST: the value 5 is on the right subtree of the node containing 20, a violation of the BST property!

So how do we solve this? It turns out that instead of making a decision based solely on a node and its children's values, we also need information flowing down from the parent as well. In the case of the tree above, if we could remember about the node containing the value 20 we could see that the node with value 5 is violating the BST property contract.

So the condition we need to check at each node is that: a) if the node is the left child of its parent, then it must be smaller (or equal) than the parent and it must pass down the value from its parent to its right subtree to make sure none of the nodes in that subtree is greater the parent, and similarly b) if the node is the right child of its parent, then it must be larger than the parent and it must pass down the value from its parent to its left subtree to make sure none of the nodes in that subtree is greater the parent.

A simple but elegant recursive solution in Java can explain this further:

public static boolean isBST(TreeNode node, int leftData, int rightData)
{
    if (node == null)
        return true;
   
    if (node.getData() > leftData || node.getData() <= rightData)
        return false;
   
    return (isBST(node.left, node.getData(), rightData) && isBST(node.right, leftData, node.getData()));
}

The initial call to this function can be something like this:

if (isBST(root, Integer.MAX_VALUE, Integer.MIN_VALUE))
    System.out.println("This is a BST.");
else
    System.out.println("This is NOT a BST!");

Essentially we keep creating a valid range (starting from [ MIN_VALUE, MAX_VALUE]) and keep shrinking it down foe each node as we go down recursively.

So, here is my short and sweet primer on Binary Search Trees. Hope you found it useful. Let me know if there is a bug or a possible improvement in runtime or space.

Sunday, July 31, 2011

Coding up a Trie (Prefix Tree)

The tree data structure is one of the most important data storage mechanism in programming. It's a natural way to represent essential utilities on a computer like the directory structure in a file system.

Many other objects can be stored in a tree data structure resulting in space and/or time efficiency. For example, when we have a huge number of dictionary (and/or non-dictionary) words or strings that we want to store in memory we can use a tree structure to efficiently store the words instead of using a plain Array or Vector type that simply stores each word individually in memory. The space needed to store the words in an Array or Vector is simply the number of words times the average length of the words we need to store.

A Trie, also called a Prefix Tree, is a tree structure that stores words with a common prefix under the same sequence of edges in the tree eliminating the need for storing the same prefix each time for each word. From Wikipedia:
A trie, or prefix tree, is an ordered tree data structure that is used to store an associative array where the keys are usually strings. Unlike a binary search tree, no node in the tree stores the key associated with that node; instead, its position in the tree shows what key it is associated with. All the descendants of a node have a common prefix of the string associated with that node, and the root is associated with the empty string.
Please read the Wikipedia entry to learn the advantages of a Trie over other data structures in terms of searching and sorting. We will focus on building a Trie in this post.

Today we will build a simple Trie structure that does only two operations - Insert and Search/Find. We will also add a print function for convenience of seeing the tree on screen. For simplicity, let's assume all the words/strings we want to store consist of lowercase letters only.

First, we need a way to represent each Node in the tree. The following simple class represents a Trie node in our case:

class TrieNode
{
    char letter;
    TrieNode[] links;
    boolean fullWord;
  
    TrieNode(char letter, boolean fullWord)
    {
        this.letter = letter;
        links = new TrieNode[26];
        this.fullWord = fullWord;
    }
}

The "letter" member field stores the leading letter in the word starting from the root node the the current node. The "links" field contain references to child nodes that contain the next letter of the child words that share a common prefix (consisting of all the letters beginning from the root node the current node). The "fullWord" field indicates whether the current letter marks the end of a valid word in the dictionary and also at the same time is a valid prefix of one or more other words. For example, the letter "n" in the words "pan" and "pant" both is the ending letter of "pan" and marks a prefix of the word "pant". For simplicity we are statically allocating 26 references to child nodes as that is the maximum number of letters in the English alphabet (lowercase only). It can be optimized by only allocating nodes that have real edge in the tree. But let's not complicate things for optimization now.


Image courtesy: a blogspot blog

In the image above the Trie contains the words an, ant, all, allot, alloy, aloe, are, ate, be. The "grounded" sign indicates a full word ending at this node.

Algorithms for inserting a a word into a Trie:

1. Set current node to root node. The root node does not contain any letter (initialized to the null character for convenience).
2. Set the current letter to the first letter in the word.
3. If the current node already has an existing reference to the current letter (through one of the elements in the "links" field) then set current node to that referenced node; else create a new node, set the letter to current letter, and set current node to this new node.
4. Repeat step 3 until all letters in the current word has been processed.

Algorithm for searching for a word in the Trie:

1. Set current node to root node. Set the current letter to the first letter in the word.
2. If the current node is null then the word does not exist in the Trie.
3. If the current node reference to a valid node containing the current letter then set current node to that referenced node and set current letter to the letter in the new current node.
4. Repeat steps 2 and 3 until all letters in the word has been processed.
5. Now there are two possibilities that may indicate the letter is not there in the tree: a) the current letter is the last letter and there is no valid node containing this letter, and b) there is a valid node containing the last letter but the node does not indicate it contains a full word (marked with the "fullWord" boolean field.
6. If step the conditions in step 5 are not met, then we have a match for the word in the Trie.

At this point you should be able to implement a Trie yourself with the above functionalities. If you still need help then look at the code below (but you better write the code yourself :) ).

Java Code:

public class PrefixTree
{
    static TrieNode createTree()
    {
        return(new TrieNode('\0', false));
    }
   
    static void insertWord(TrieNode root, String word)
    {
        int offset = 97;
        int l = word.length();
        char[] letters = word.toCharArray();
        TrieNode curNode = root;
       
        for (int i = 0; i < l; i++)
        {
            if (curNode.links[letters[i]-offset] == null)
                curNode.links[letters[i]-offset] = new TrieNode(letters[i], i == l-1 ? true : false);
            curNode = curNode.links[letters[i]-offset];
        }
    }

    static boolean find(TrieNode root, String word)
    {
        char[] letters = word.toCharArray();
        int l = letters.length;
        int offset = 97;
        TrieNode curNode = root;
       
        int i;
        for (i = 0; i < l; i++)
        {
            if (curNode == null)
                return false;
            curNode = curNode.links[letters[i]-offset];
        }
       
        if (i == l && curNode == null)
            return false;
       
        if (curNode != null && !curNode.fullWord)
            return false;
       
        return true;
    }
   
    static void printTree(TrieNode root, int level, char[] branch)
    {
        if (root == null)
            return;
       
        for (int i = 0; i < root.links.length; i++)
        {
            branch[level] = root.letter;
            printTree(root.links[i], level+1, branch);   
        }
       
        if (root.fullWord)
        {
            for (int j = 1; j <= level; j++)
                System.out.print(branch[j]);
            System.out.println();
        }
    }
   
    public static void main(String[] args)
    {
        TrieNode tree = createTree();
       
        String[] words = {"an", "ant", "all", "allot", "alloy", "aloe", "are", "ate", "be".};
        for (int i = 0; i < words.length; i++)
            insertWord(tree, words[i]);
       
        char[] branch = new char[50];
        printTree(tree, 0, branch);
       
        String searchWord = "all";
        if (find(tree, searchWord))
        {
            System.out.println("The word was found");
        }
        else
        {
            System.out.println("The word was NOT found");
        }
    }
}

class TrieNode
{
    char letter;
    TrieNode[] links;
    boolean fullWord;
   
    TrieNode(char letter, boolean fullWord)
    {
        this.letter = letter;
        links = new TrieNode[26];
        this.fullWord = fullWord;
    }
}

Saturday, July 23, 2011

External Sorting for sorting large files in disk

Sorting is a fundamental programming task. Given the abundance of built-in libraries that perform tasks like sorting and binary search, we often become forgetful of exactly how these tasks are accomplished.

When the data is so large that it cannot be processed in memory at one time we need to resort to the file system to store part or all the data during the sorting process. We then need to perform another layer of disk operations on top of regular sorting algorithms to manage the data as they get sorted.

External Sorting is precisely the technique we described in the previous paragraph.

Let us describe in some detail how external sorting can be done in Java:

First the algorithm:

Say, we have one file (it can be more than one file, but having just one file simplifies the process for illustration purpose) in disk containing N numbers. And suppose the memory in our computer can hold M numbers at a time.

1. Start reading the input file from the beginning.
2. Read M (or less if number of entries remaining in the file is less than M) numbers from the file and store it into a temp buffer.
3. Sort (using any good sorting method - Quicksort, for example) the numbers in the buffer stored in step 2.
4. Create a temp file in disk and write out the sorted numbers from step 3 to this temp file. Save the name of the temp file.
5. Repeat step 2 to 4 until all numbers from the input file has been read, sorted, and written out to temp files.

At this point, we have chunks of numbers of size M sorted and stored in temp files in disk. We need to merge all these sorted files into one single sorted file. We will apply the merging algorithm from Merge Sort to join the numbers from these sorted files together.

6. Open all the temp files (and set the read pointer to the beginning of the files).
7. Find the minimum number from the set of numbers currently pointed to by the file read pointer.
8. Write the number to disk. (To increase efficiency you could write the number to a buffer first and then flush the buffer out to disk when the buffer is full. But modern I/O libraries should be doing this anyway for you).
9. Read another number from the file that contained the minimum number at step 7.
10. Repeat step 7 to 9 until all numbers from all the temp files have been processed, merged, and written out to disk.

The new file in disk now contains a sorted list of the numbers supplied in the initial input file.

Java code:



import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.Arrays;
import java.util.Random;


public class ExternalSort
{
static int N = 2000000; // size of the file in disk
static int M = 100000; // max items the memory buffer can hold

public static void externalSort(String fileName)
{
String tfile = "temp-file-";
int[] buffer = new int[M < N ? M : N];

try
{
FileReader fr = new FileReader(fileName);
BufferedReader br = new BufferedReader(fr);
int slices = (int) Math.ceil((double) N/M);

int i, j;
i = j = 0;
// Iterate through the elements in the file
for (i = 0; i < slices; i++)
{
// Read M-element chunk at a time from the file
for (j = 0; j < (M < N ? M : N); j++)
{
String t = br.readLine();
if (t != null)
buffer[j] = Integer.parseInt(t);
else
break;
}
// Sort M elements
Arrays.sort(buffer);


// Write the sorted numbers to temp file
FileWriter fw = new FileWriter(tfile + Integer.toString(i) + ".txt");
PrintWriter pw = new PrintWriter(fw);
for (int k = 0; k < j; k++)
pw.println(buffer[k]);

pw.close();
fw.close();
}

br.close();
fr.close();

// Now open each file and merge them, then write back to disk
int[] topNums = new int[slices];
BufferedReader[] brs = new BufferedReader[slices];

for (i = 0; i < slices; i++)
{
brs[i] = new BufferedReader(new FileReader(tfile + Integer.toString(i) + ".txt"));
String t = brs[i].readLine();
if (t != null)
topNums[i] = Integer.parseInt(t);
else
topNums[i] = Integer.MAX_VALUE;
}

FileWriter fw = new FileWriter("E:\\test\\external-sorted.txt");
PrintWriter pw = new PrintWriter(fw);

for (i = 0; i < N; i++)
{
int min = topNums[0];
int minFile = 0;

for (j = 0; j < slices; j++)
{
if (min > topNums[j])
{
min = topNums[j];
minFile = j;
}
}

pw.println(min);
String t = brs[minFile].readLine();
if (t != null)
topNums[minFile] = Integer.parseInt(t);
else
topNums[minFile] = Integer.MAX_VALUE;

}
for (i = 0; i < slices; i++)
brs[i].close();

pw.close();
fw.close();
}
catch (FileNotFoundException e)
{
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}


static String generateInput(int n)
{
String fileName = "external-sort.txt";
Random rand = new Random();

try
{
FileWriter fw = new FileWriter(fileName);
PrintWriter pw = new PrintWriter(fw);

for (int i = 0; i < n; i++)
pw.println(rand.nextInt(101));

pw.close();
}

catch (IOException e)
{
e.printStackTrace();
}

return fileName;
}
        
public static void main(String[] args)
{
String fileName = generateInput(N);
externalSort(fileName);
}
}



Happy external sorting!