Sunday, July 31, 2011

Coding up a Trie (Prefix Tree)

The tree data structure is one of the most important data storage mechanism in programming. It's a natural way to represent essential utilities on a computer like the directory structure in a file system.

Many other objects can be stored in a tree data structure resulting in space and/or time efficiency. For example, when we have a huge number of dictionary (and/or non-dictionary) words or strings that we want to store in memory we can use a tree structure to efficiently store the words instead of using a plain Array or Vector type that simply stores each word individually in memory. The space needed to store the words in an Array or Vector is simply the number of words times the average length of the words we need to store.

A Trie, also called a Prefix Tree, is a tree structure that stores words with a common prefix under the same sequence of edges in the tree eliminating the need for storing the same prefix each time for each word. From Wikipedia:
A trie, or prefix tree, is an ordered tree data structure that is used to store an associative array where the keys are usually strings. Unlike a binary search tree, no node in the tree stores the key associated with that node; instead, its position in the tree shows what key it is associated with. All the descendants of a node have a common prefix of the string associated with that node, and the root is associated with the empty string.
Please read the Wikipedia entry to learn the advantages of a Trie over other data structures in terms of searching and sorting. We will focus on building a Trie in this post.

Today we will build a simple Trie structure that does only two operations - Insert and Search/Find. We will also add a print function for convenience of seeing the tree on screen. For simplicity, let's assume all the words/strings we want to store consist of lowercase letters only.

First, we need a way to represent each Node in the tree. The following simple class represents a Trie node in our case:

class TrieNode
{
    char letter;
    TrieNode[] links;
    boolean fullWord;
  
    TrieNode(char letter, boolean fullWord)
    {
        this.letter = letter;
        links = new TrieNode[26];
        this.fullWord = fullWord;
    }
}

The "letter" member field stores the leading letter in the word starting from the root node the the current node. The "links" field contain references to child nodes that contain the next letter of the child words that share a common prefix (consisting of all the letters beginning from the root node the current node). The "fullWord" field indicates whether the current letter marks the end of a valid word in the dictionary and also at the same time is a valid prefix of one or more other words. For example, the letter "n" in the words "pan" and "pant" both is the ending letter of "pan" and marks a prefix of the word "pant". For simplicity we are statically allocating 26 references to child nodes as that is the maximum number of letters in the English alphabet (lowercase only). It can be optimized by only allocating nodes that have real edge in the tree. But let's not complicate things for optimization now.


Image courtesy: a blogspot blog

In the image above the Trie contains the words an, ant, all, allot, alloy, aloe, are, ate, be. The "grounded" sign indicates a full word ending at this node.

Algorithms for inserting a a word into a Trie:

1. Set current node to root node. The root node does not contain any letter (initialized to the null character for convenience).
2. Set the current letter to the first letter in the word.
3. If the current node already has an existing reference to the current letter (through one of the elements in the "links" field) then set current node to that referenced node; else create a new node, set the letter to current letter, and set current node to this new node.
4. Repeat step 3 until all letters in the current word has been processed.

Algorithm for searching for a word in the Trie:

1. Set current node to root node. Set the current letter to the first letter in the word.
2. If the current node is null then the word does not exist in the Trie.
3. If the current node reference to a valid node containing the current letter then set current node to that referenced node and set current letter to the letter in the new current node.
4. Repeat steps 2 and 3 until all letters in the word has been processed.
5. Now there are two possibilities that may indicate the letter is not there in the tree: a) the current letter is the last letter and there is no valid node containing this letter, and b) there is a valid node containing the last letter but the node does not indicate it contains a full word (marked with the "fullWord" boolean field.
6. If step the conditions in step 5 are not met, then we have a match for the word in the Trie.

At this point you should be able to implement a Trie yourself with the above functionalities. If you still need help then look at the code below (but you better write the code yourself :) ).

Java Code:

public class PrefixTree
{
    static TrieNode createTree()
    {
        return(new TrieNode('\0'));
    }
   
    static void insertWord(TrieNode root, String word)
    {
        int offset = 97;
        int l = word.length();
        char[] letters = word.toCharArray();
        TrieNode curNode = root;
       
        for (int i = 0; i < l; i++)
        {
            if (curNode.links[letters[i]-offset] == null)
                curNode.links[letters[i]-offset] = new TrieNode(letters[i]);
            curNode = curNode.links[letters[i]-offset];
        }

        curNode.fullWord = true; 
    }

    static boolean find(TrieNode root, String word)
    {
        char[] letters = word.toCharArray();
        int l = letters.length;
        int offset = 97;
        TrieNode curNode = root;
       
        int i;
        for (i = 0; i < l; i++)
        {
            if (curNode == null)
                return false;
            curNode = curNode.links[letters[i]-offset];
        }
       
        if (i == l && curNode == null)
            return false;
       
        if (curNode != null && !curNode.fullWord)
            return false;
       
        return true;
    }
   
    static void printTree(TrieNode root, int level, char[] branch)
    {
        if (root == null)
            return;
       
        for (int i = 0; i < root.links.length; i++)
        {
            branch[level] = root.letter;
            printTree(root.links[i], level+1, branch);   
        }
       
        if (root.fullWord)
        {
            for (int j = 1; j <= level; j++)
                System.out.print(branch[j]);
            System.out.println();
        }
    }
   
    public static void main(String[] args)
    {
        TrieNode tree = createTree();
       
        String[] words = {"an", "ant", "all", "allot", "alloy", "aloe", "are", "ate", "be".};
        for (int i = 0; i < words.length; i++)
            insertWord(tree, words[i]);
       
        char[] branch = new char[50];
        printTree(tree, 0, branch);
       
        String searchWord = "all";
        if (find(tree, searchWord))
        {
            System.out.println("The word was found");
        }
        else
        {
            System.out.println("The word was NOT found");
        }
    }
}

class TrieNode
{
    char letter;
    TrieNode[] links;
    boolean fullWord;
   
    TrieNode(char letter)
    {
        this.letter = letter;
        links = new TrieNode[26];
        this.fullWord = false;
    }
}

Saturday, July 23, 2011

External Sorting for sorting large files in disk

Sorting is a fundamental programming task. Given the abundance of built-in libraries that perform tasks like sorting and binary search, we often become forgetful of exactly how these tasks are accomplished.

When the data is so large that it cannot be processed in memory at one time we need to resort to the file system to store part or all the data during the sorting process. We then need to perform another layer of disk operations on top of regular sorting algorithms to manage the data as they get sorted.

External Sorting is precisely the technique we described in the previous paragraph.

Let us describe in some detail how external sorting can be done in Java:

First the algorithm:

Say, we have one file (it can be more than one file, but having just one file simplifies the process for illustration purpose) in disk containing N numbers. And suppose the memory in our computer can hold M numbers at a time.

1. Start reading the input file from the beginning.
2. Read M (or less if number of entries remaining in the file is less than M) numbers from the file and store it into a temp buffer.
3. Sort (using any good sorting method - Quicksort, for example) the numbers in the buffer stored in step 2.
4. Create a temp file in disk and write out the sorted numbers from step 3 to this temp file. Save the name of the temp file.
5. Repeat step 2 to 4 until all numbers from the input file has been read, sorted, and written out to temp files.

At this point, we have chunks of numbers of size M sorted and stored in temp files in disk. We need to merge all these sorted files into one single sorted file. We will apply the merging algorithm from Merge Sort to join the numbers from these sorted files together.

6. Open all the temp files (and set the read pointer to the beginning of the files).
7. Find the minimum number from the set of numbers currently pointed to by the file read pointer.
8. Write the number to disk. (To increase efficiency you could write the number to a buffer first and then flush the buffer out to disk when the buffer is full. But modern I/O libraries should be doing this anyway for you).
9. Read another number from the file that contained the minimum number at step 7.
10. Repeat step 7 to 9 until all numbers from all the temp files have been processed, merged, and written out to disk.

The new file in disk now contains a sorted list of the numbers supplied in the initial input file.

Java code:



import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.Arrays;
import java.util.Random;


public class ExternalSort
{
static int N = 2000000; // size of the file in disk
static int M = 100000; // max items the memory buffer can hold

public static void externalSort(String fileName)
{
String tfile = "temp-file-";
int[] buffer = new int[M < N ? M : N];

try
{
FileReader fr = new FileReader(fileName);
BufferedReader br = new BufferedReader(fr);
int slices = (int) Math.ceil((double) N/M);

int i, j;
i = j = 0;
// Iterate through the elements in the file
for (i = 0; i < slices; i++)
{
// Read M-element chunk at a time from the file
for (j = 0; j < (M < N ? M : N); j++)
{
String t = br.readLine();
if (t != null)
buffer[j] = Integer.parseInt(t);
else
break;
}
// Sort M elements
Arrays.sort(buffer);


// Write the sorted numbers to temp file
FileWriter fw = new FileWriter(tfile + Integer.toString(i) + ".txt");
PrintWriter pw = new PrintWriter(fw);
for (int k = 0; k < j; k++)
pw.println(buffer[k]);

pw.close();
fw.close();
}

br.close();
fr.close();

// Now open each file and merge them, then write back to disk
int[] topNums = new int[slices];
BufferedReader[] brs = new BufferedReader[slices];

for (i = 0; i < slices; i++)
{
brs[i] = new BufferedReader(new FileReader(tfile + Integer.toString(i) + ".txt"));
String t = brs[i].readLine();
if (t != null)
topNums[i] = Integer.parseInt(t);
else
topNums[i] = Integer.MAX_VALUE;
}

FileWriter fw = new FileWriter("E:\\test\\external-sorted.txt");
PrintWriter pw = new PrintWriter(fw);

for (i = 0; i < N; i++)
{
int min = topNums[0];
int minFile = 0;

for (j = 0; j < slices; j++)
{
if (min > topNums[j])
{
min = topNums[j];
minFile = j;
}
}

pw.println(min);
String t = brs[minFile].readLine();
if (t != null)
topNums[minFile] = Integer.parseInt(t);
else
topNums[minFile] = Integer.MAX_VALUE;

}
for (i = 0; i < slices; i++)
brs[i].close();

pw.close();
fw.close();
}
catch (FileNotFoundException e)
{
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}


static String generateInput(int n)
{
String fileName = "external-sort.txt";
Random rand = new Random();

try
{
FileWriter fw = new FileWriter(fileName);
PrintWriter pw = new PrintWriter(fw);

for (int i = 0; i < n; i++)
pw.println(rand.nextInt(101));

pw.close();
}

catch (IOException e)
{
e.printStackTrace();
}

return fileName;
}
        
public static void main(String[] args)
{
String fileName = generateInput(N);
externalSort(fileName);
}
}



Happy external sorting!


Exceptional Code!

Hello there!

So, coding is what I do for a living. And coding is my passion as well.

There has never been a better time to create things so easily with a combination of imagination and a bit of programming skills. Yes, you can build what you want. You don't need a lot of money or a lot of people to build simple software that does virtually anything!

I hope to write my journey as a programmer (or developer/engineer/whatever) in this blog and share code that I believe is "exceptional".

See you until the next post shows up!