Monday, February 20, 2012

Life on a Tree - Creating, Copying, and Pruning Tree Data Structures

The tree is one of the most important data structures in computer programming.



A very common of example of the tree data structure being used in everyday life is a directory structure widely used both in Windows and Unix systems.

Today I will write a brief tutorial on trees, and show a number of common operations done on the structure of a tree. The code examples will be in Java.

First, let's create a tree. A tree is essentially a set of node objects that hold some data and each node has zero or more "child" nodes. The nodes without any child nodes are called leaves and the node that itself is not a child of any other node is called the root. You can read up on trees on the Web, we will focus on the implementation of trees here.

We can define a node in the tree as follows:

class Node
{
    private int data; /// this can be more complex objects
  
    private Node left;
    private Node right;
    private Node parent;
  
    public Node(int value)
    {
        data = value; // we could also use setter/getter for this value
        parent = null; // this is optional, but having this is often useful
        left = null;
        right = null;
    }
   
    public Node(Node node) // copy constructor
    {
        data = node.getData();
        left = null;
        right = null;
        parent = null;
    }

    public void setData(int data)
    {
        this.data = data;
    }
   
    public int getData()
    {
        return data;
    }
   
    public void setLeftNode(Node node)
    {
       left = node;
    }

    public Node getLeftNode()
    {
        return left;
    }

    public void setRightNode(Node node)
    {
        right = node;
    }

    public Node getRightNode()
    {
       return right;
    }

    public void setParentNode(Node node)
    {
         parent = node;
    }

    public Node getParentNode()
    {
        return parent;
    }
}

The class above is a simple example of a node in a tree. It contains a data variable, a reference to left and right sub-trees each, and a reference to the parent node as well. There are also a set of setters and getters to manipulate the data members. Please note that - in practice you might need much more complex nodes loaded with big data objects and more children than just a left and right nodes.

To create a tree using the class above you could do the following:

Node root = new Node(100); // creating a root node with value 100

root.setLeftNode(new Node(200));
root.setRightNode(new Node(50));

root.getLeftNode().setLeftNode(new Node(10));
root.getLeftNode().setRightNode(new Node(1000));

A tree is represented by the root node of the tree. So, the "root" object above represent the entire tree.

Copying a tree:

Say you have a tree that you want to copy to another tree. Now, copying can be done in two ways - i) shallow copy, and ii) deep copy.

We will show deep copy here. Deep copying means the new tree is an entirely new copy of the old tree - both in structure and in data. Everything is allocated again in memory.

The following function does a deep copy of a tree to another tree:

public void deepCopy(Node root, Node copy)
{
        Node left = root.getLeftNode();
        Node right = root.getRightNode();
       
        if (left != null)
        {
            copy.setLeftNode(new Node(left));
            deepCopy(left, copy.getLeftNode());
        }
       
        if (right != null)
        {
            copy.setRightNode(new Node(right));
            deepCopy(right, copy.getRightNode());
        }
}

This function can be called like this:

Node copy = new Node(root); // copy the root
deepCopy(root, copy); // copy the rest of the tree

Now, the tree referenced by "copy" holds an entire deep copy of the tree "root".

Pruning a tree:

Pruning means deleting one or more subtrees of a tree. We will implement a filter-based pruning here. That is, whenever a node will match some criteria described in a filter we will delete that node along with all its children from its parent.

First, we will need a way to represent a filter. We will do that by way of an interface that all filter classes will have to implement.

interface Filter
{
    public boolean check(Node node);
}

Now, we can define a Filter class as follows:

class MyFilter implements Filter
{
    public boolean check(Node node)
    {
        if (node.getData() == 200)
            return false;
        return true;
    }
}

This class indicates that we would like to delete all sub-trees rooted at a node containing the data value 200.

The pruning class will be a lot like the deep copying class. This is because the pruned tree is actually a deep copy of the original tree minus the pruned nodes!

Here is what the pruning method looks like:

public void pruneTree(Node root, Node copy, Filter filter)
{
        Node left = root.getLeftNode();
        if (left != null && filter.check(left))
        {
            copy.setLeftNode(new Node(left));
            pruneTree(left, copy.getLeftNode(), filter);
        }
       
        Node right = root.getRightNode();
        if (right != null && filter.check(right))
        {
            copy.setRightNode(new Node(right));
            pruneTree(right, copy.getRightNode(), filter);
        }
}

This method can be called as follows:

Node copy = new Node(root);
Filter filter = new MyFilter();
pruneTree(root, copy, filter);

At this point, the node object "copy" will contain a pruned version of the original tree rooted at "root".

Finally, here is a small class that can print a tree in pre-order traversal. I used this class in testing out the code in this post:

public void printTree(Node root)
{
        if (root != null)
        {
            System.out.println(root.getData()); 
            printTree(root.getLeftNode());
            printTree(root.getRightNode());
        }
}

I hope you will now be able to create simple trees quickly. It won't be hard to modify these classes and methods to build more complex tree classes to handle more than two children, copying a subtree instead of the whole tree, prune more selectively, etc.

Happy tree coding:)

Friday, February 10, 2012

Solving the Boggle Game - Recursion, Prefix Tree, and Dynamic Programming

I spent this past weekend designing the game of Boggle. Although it looks like a simple game at a high level, implementing it in a programming language was a great experience. I had to use recursion, sorting, searching, prefix trees (also knows as trie's), and dynamic programming as I was improving the run time of the program. So here is a summary of the work I did in trying to write a very optimal solution to the Boggle game:



*Image courtesy: http://stackoverflow.com/questions/746082/how-to-find-list-of-possible-words-from-a-letter-matrix-boggle-solver

The game of Boggle is played on a N x N board (usually made of cubes that has letters engraved on it). Given a dictionary, you will have to construct words on the board following these rules:

i) The letters in the word must be "adjacent" to each other
ii) Two letters on the board are "adjacent" if they are located on the board next to each other horizontally, vertically, or diagonally.
iii) A word must be 3 or more letters long to be valid

The more words you can construct, the more points you get. Longer words get more points.

In computerse: given an N x N board, and a dictionary containing hundreds of thousands of words, construct those words from the board that are found in the dictionary.

There are a number of approaches that one can take to solve this problem. I designed three different algorithms to solve it. Here are they:

First solution - Recursion + Binary Search:

In this approach, we recurse (using backtracking) through the board and generate all possible words. For each word that are three or more letters long we check to see if it's in the dictionary. If it is, we have a match!

Here is the algorithm steps:

1. Read in the dictionary in to a container in memory.
2. Sort the dictionary.
3. Using backtracking, search the board for words.
4. If a word is found and it contains 3 or more letters, do a binary search on the dictionary to see if the word is there.
5. If searching was successful in the previous step, print the letter out.
6. Continue step 3-5 as long as there are more words to construct on the board.

Complexity of this approach:

In this solution, we do a good job on the dictionary search. Using binary search, we are quickly finding out whether a word is in dictionary or not. But the real bottleneck is in searching the board for words. For an N x N board the search space is O((N*N)!). I am not exactly sure about this number, but you can find some discussions of it here: http://www.danvk.org/wp/2007-08-02/how-many-boggle-boards-are-there/.
(N*N)! is a HUGE number even for N = 5. So this approach is is impractical and out of question for any useful implementation.


Second Solution - Pruned Recursion + Prefix Tree (also known as a Trie):

From the previous approach, our major concern was the enormous search space on the board. Fortunately, using a a prefix tree or trie data structure we could significantly cut down on this search space. The reasoning behind this improvement is that, if a word "abc" does not occur as a prefix to any word in the dictionary there is no reason to keep searching for words after we encounter "abc" on the board. This actually cut down the run time a lot.

Here is the algorithm steps:

1. Read a word from the dictionary file.
2. Insert it into a prefix tree data structure in memory.
3. Repeat steps 1-2 until all words in the dictionary have been inserted into the prefix tree.
4. Using backtracking, search the board for words.
5. If a word is found and it contains 3 or more letters, search the prefix tree for the word.
6. If searching was *not* successful in the previous step, return from this branch of the backtracking stage. (There is no point to continue searching in this branch, nothing in the dictionary as the prefix tree says).
7. If searching was successful in step 5, continue searching by constructing more words along this branch of backtracking and stop when the leaf node has been reached in the prefix tree. (at that point there is nothing more to search).
8. Repeat steps 4-7 as long as there are more words to search in the backtracking.

Complexity of this approach:

This approach significantly improves on the first one. Building a prefix tree our of the dictionary words is O(W * L), where W is the number of words in the dictionary and L is the maximum length of a word in the dictionary.
Searching the board will be of the same order as the dictionary since we are not really searching words that are not in the dictionary. But in reality it will be more work than that as we still need to backtrack along the board to construct new words until we can consult the dictionary prefix tree to know whether it exists or not.

Third and Final Solution - No search space + Dynamic Programming:

The 2nd approach mentioned above was good enough until the board size was 5. Unfortunately with a board size of 6, that too was taking forever to complete!



It got me into thinking - "Dang, this search space is still too big to search! Can I just get rid of it entirely?" And then this idea popped into my mind: instead of random constructing word after word in this infinite ocean of words why don't I take a word from the dictionary and somehow magically check whether that's available on the board or not?

It turns out, we can use a nifty dynamic programming technique to quickly check whether a word (from the dictionary in this case) can be constructed from the board or not!

Here is core point of the dynamic programming idea:


For a word of length k to be found (end location) at the [i, j]-th location of the board, the k-1'th letter of that word must be located in one of the adjacent cells of [i, j].



The base case is k = 1.

A letter of length 1 will be found (end location) in the [i, j]-th cell of the board of the only letter in the word matches the letter in the [i, j]-th location of the board.

Once our dynamic programming table is populated with the base case, we can build on top of that for any word of length k, k > 1.

Here is a sample code for this:

for (k = 2; k < MAX_WORD_LENGTH; ++k)
    for (i = 0; i < N; ++i)
        for (j = 0; j < N; ++j)
            if (board[i][j] == word[k])
            {
                 for all the "adjacent" cells of board[i, j]
                     if we table[k-1][adjacent_i][adjacent_j] is true
                         then table[k][i][j] = true;
             }


*I have the C++ code, if you are interested to take a look leave a comment with your email address.

Run Time Complexity:

The run time complexity of this approach is obvious is pretty obvious. It's O (W * N * N * MAX_WORD_LENGTH). Here N is dimension of the board which is usually between 4 to 6. So essentially the algorithm runs in the order of the size of the dictionary!

I solved a 6 X 6 boggle game with a dictionary with 600,000 words in 0.002 seconds! With I/O it as about 0.57 seconds. But with the trie and binary search approaches, it took much longer!

So, here is the summary of my a weekend's worth of adventure in to the world of algorithms and data structures. Feel free to ask me any questions, or any bug in these algorithms:)

Credit where it is due: My friend Satej who is a ninja problem solver and a PhD student at the UCF explained the dynamic programming solution to me first before I refined and finalized it. Thanks Satej!


Update (5/10/2012):

So many people have asked for the source code that I feel I better add the C++ code here:) The final, polished code for this is not in my home computer here, so if there is a bug please let me know!


#include <cstdio>
#include <iostream>

using namespace std;

const int N = 6; // max length of a word in the board

char in[N * N + 1]; // max length of a word
char board[N+1][N+2]; // keep room for a newline and null char at the end
char prev[N * N + 1];
bool dp[N * N + 1][N][N];

// direction X-Y delta pairs for adjacent cells
int dx[] = {0, 1, 1, 1, 0, -1, -1, -1};
int dy[] = {1, 1, 0, -1, -1, -1, 0, 1};
bool visited[N][N];

bool checkBoard(char* word, int curIndex, int r, int c, int wordLen)
{
    if (curIndex == wordLen - 1)
    {
        //cout << "Returned TRUE!!" << endl;
        return true;
    }
   
    int ret = false;
       
    for (int i = 0; i < 8; ++i)
    {
        int newR = r + dx[i];
        int newC = c + dy[i];
       
        if (newR >= 0 && newR < N && newC >= 0 && newC < N && !visited[newR][newC] && word[curIndex+1] == board[newR][newC])
        {
            ++curIndex;
            visited[newR][newC] = true;
           
            ret = checkBoard(word, curIndex, newR, newC, wordLen);
            if (ret)
                break;
               
            --curIndex;
            visited[newR][newC] = false;
        }
    }
   
    return ret;           
}

int main(int argc, char* argv[])
{
   
    int i, j, k, l;
   
    FILE* fDict = fopen("dict.txt","r");
    FILE* fBoard = fopen("tmp2.txt","r");
   
    for(i = 0; i < N; ++i)
        fgets(board[i], N+2, fBoard);
   
    strcpy(prev,"");
    int pLen = 0;
   
    while(fgets(in, N*N + 1, fDict))
    {
        int len = strlen(in);
        if (in[len-1] == '\n')
        {
            in[len-1] = '\0'; // remove the trailing newline
            --len;
        }
       
        if (len < 3)
            continue; // we only want words longer than 3 or more letter
           
        for(i = 0; i < len && i < pLen; ++i)
        {
            if(prev[i] != in[i])
                break;
        }
       
        int firstMismatch = i; // little optimization: building on previous word (will benefit if the word list is sorted)
       
        if(firstMismatch==0)
        {
            for(i = 0; i < 6; ++i)
            {
                for(j = 0; j < 6; ++j)
                {
                    if(board[i][j] == in[0])
                        dp[0][i][j] = true;
                    else
                        dp[0][i][j] = false;
                }
            }
            firstMismatch = 1;
        }
       
        for(k = firstMismatch; k < len; ++k)
        {
            for(i=0;i<6;++i)
            {
                for(j=0;j<6;++j)
                {   
                    dp[k][i][j] = false;
                           
                    if(board[i][j] != in[k])
                        continue;
                       
                    for(l= 0; l < 8 && !dp[k][i][j]; ++l)
                    {
                        int ti = i + dx[l];
                        int tj = j + dy[l];
                       
                        if(ti < 0 || ti >= 6 || tj < 0 || tj >= 6)
                            continue;
                       
                        if (dp[k-1][ti][tj])
                            dp[k][i][j] = true;
                    }
                }
            }  
        }
       
        // check if the word is tagged as found in the dp table
        bool flag = false;
        for(i = 0; i < 6 && !flag; ++i)
        {
            for(j = 0; j < 6 && !flag; ++j)
            {
                if(dp[len-1][i][j])
                    flag =true;
            }
        }
       
        // dp table says its there, but make sure its in the board and it does not repeat a location in the board
        if(flag)
        {
            //cout << "Checking word: " << in << endl;
            bool verified = false;
           
            for (i = 0; i < N && !verified; ++i)
            {
                for (j = 0; j < N && !verified; ++j)
                {
                    if (in[0] != board[i][j])
                        continue;
                       
                    memset(visited, false, sizeof(visited));
                    visited[i][j] = true;
                   
                    if (checkBoard(in, 0, i, j, len))
                    {
                        cout << in << endl;
                        break;
                    }
                }
            }
        }
       
        strcpy(prev,in);
        pLen=len;
           
    }
   
    return 0;
}


Enjoy!