Sunday, July 31, 2011

Coding up a Trie (Prefix Tree)

The tree data structure is one of the most important data storage mechanism in programming. It's a natural way to represent essential utilities on a computer like the directory structure in a file system.

Many other objects can be stored in a tree data structure resulting in space and/or time efficiency. For example, when we have a huge number of dictionary (and/or non-dictionary) words or strings that we want to store in memory we can use a tree structure to efficiently store the words instead of using a plain Array or Vector type that simply stores each word individually in memory. The space needed to store the words in an Array or Vector is simply the number of words times the average length of the words we need to store.

A Trie, also called a Prefix Tree, is a tree structure that stores words with a common prefix under the same sequence of edges in the tree eliminating the need for storing the same prefix each time for each word. From Wikipedia:
A trie, or prefix tree, is an ordered tree data structure that is used to store an associative array where the keys are usually strings. Unlike a binary search tree, no node in the tree stores the key associated with that node; instead, its position in the tree shows what key it is associated with. All the descendants of a node have a common prefix of the string associated with that node, and the root is associated with the empty string.
Please read the Wikipedia entry to learn the advantages of a Trie over other data structures in terms of searching and sorting. We will focus on building a Trie in this post.

Today we will build a simple Trie structure that does only two operations - Insert and Search/Find. We will also add a print function for convenience of seeing the tree on screen. For simplicity, let's assume all the words/strings we want to store consist of lowercase letters only.

First, we need a way to represent each Node in the tree. The following simple class represents a Trie node in our case:

class TrieNode
{
    char letter;
    TrieNode[] links;
    boolean fullWord;
  
    TrieNode(char letter, boolean fullWord)
    {
        this.letter = letter;
        links = new TrieNode[26];
        this.fullWord = fullWord;
    }
}

The "letter" member field stores the leading letter in the word starting from the root node the the current node. The "links" field contain references to child nodes that contain the next letter of the child words that share a common prefix (consisting of all the letters beginning from the root node the current node). The "fullWord" field indicates whether the current letter marks the end of a valid word in the dictionary and also at the same time is a valid prefix of one or more other words. For example, the letter "n" in the words "pan" and "pant" both is the ending letter of "pan" and marks a prefix of the word "pant". For simplicity we are statically allocating 26 references to child nodes as that is the maximum number of letters in the English alphabet (lowercase only). It can be optimized by only allocating nodes that have real edge in the tree. But let's not complicate things for optimization now.


Image courtesy: a blogspot blog

In the image above the Trie contains the words an, ant, all, allot, alloy, aloe, are, ate, be. The "grounded" sign indicates a full word ending at this node.

Algorithms for inserting a a word into a Trie:

1. Set current node to root node. The root node does not contain any letter (initialized to the null character for convenience).
2. Set the current letter to the first letter in the word.
3. If the current node already has an existing reference to the current letter (through one of the elements in the "links" field) then set current node to that referenced node; else create a new node, set the letter to current letter, and set current node to this new node.
4. Repeat step 3 until all letters in the current word has been processed.

Algorithm for searching for a word in the Trie:

1. Set current node to root node. Set the current letter to the first letter in the word.
2. If the current node is null then the word does not exist in the Trie.
3. If the current node reference to a valid node containing the current letter then set current node to that referenced node and set current letter to the letter in the new current node.
4. Repeat steps 2 and 3 until all letters in the word has been processed.
5. Now there are two possibilities that may indicate the letter is not there in the tree: a) the current letter is the last letter and there is no valid node containing this letter, and b) there is a valid node containing the last letter but the node does not indicate it contains a full word (marked with the "fullWord" boolean field.
6. If step the conditions in step 5 are not met, then we have a match for the word in the Trie.

At this point you should be able to implement a Trie yourself with the above functionalities. If you still need help then look at the code below (but you better write the code yourself :) ).

Java Code:

public class PrefixTree
{
    static TrieNode createTree()
    {
        return(new TrieNode('\0'));
    }
   
    static void insertWord(TrieNode root, String word)
    {
        int offset = 97;
        int l = word.length();
        char[] letters = word.toCharArray();
        TrieNode curNode = root;
       
        for (int i = 0; i < l; i++)
        {
            if (curNode.links[letters[i]-offset] == null)
                curNode.links[letters[i]-offset] = new TrieNode(letters[i]);
            curNode = curNode.links[letters[i]-offset];
        }

        curNode.fullWord = true; 
    }

    static boolean find(TrieNode root, String word)
    {
        char[] letters = word.toCharArray();
        int l = letters.length;
        int offset = 97;
        TrieNode curNode = root;
       
        int i;
        for (i = 0; i < l; i++)
        {
            if (curNode == null)
                return false;
            curNode = curNode.links[letters[i]-offset];
        }
       
        if (i == l && curNode == null)
            return false;
       
        if (curNode != null && !curNode.fullWord)
            return false;
       
        return true;
    }
   
    static void printTree(TrieNode root, int level, char[] branch)
    {
        if (root == null)
            return;
       
        for (int i = 0; i < root.links.length; i++)
        {
            branch[level] = root.letter;
            printTree(root.links[i], level+1, branch);   
        }
       
        if (root.fullWord)
        {
            for (int j = 1; j <= level; j++)
                System.out.print(branch[j]);
            System.out.println();
        }
    }
   
    public static void main(String[] args)
    {
        TrieNode tree = createTree();
       
        String[] words = {"an", "ant", "all", "allot", "alloy", "aloe", "are", "ate", "be".};
        for (int i = 0; i < words.length; i++)
            insertWord(tree, words[i]);
       
        char[] branch = new char[50];
        printTree(tree, 0, branch);
       
        String searchWord = "all";
        if (find(tree, searchWord))
        {
            System.out.println("The word was found");
        }
        else
        {
            System.out.println("The word was NOT found");
        }
    }
}

class TrieNode
{
    char letter;
    TrieNode[] links;
    boolean fullWord;
   
    TrieNode(char letter)
    {
        this.letter = letter;
        links = new TrieNode[26];
        this.fullWord = false;
    }
}

17 comments:

  1. Very good attempt. But I believe one of the main advantage of Trie is space efficiency. But with this implementation each node will have 26 elements and many elements may be empty at the given time.

    ReplyDelete
  2. Thanks Phani for your comment. I mention this in the post: "It can be optimized by only allocating nodes that have real edge in the tree".

    This implementation gives the main idea. We can add additional code to make sure no unnecessary node is allocated in the tree. I skipped that part to keep things simple.

    ReplyDelete
  3. hi i would like to know why you have initialised the offset variable to 97 in the find method

    ReplyDelete
  4. Hi Mariyah,

    For simplicity, I assumed all words will consist of lowercase English letters only. The ASCII value of 'a' is 97, so subtracting 97 from the ASCII value of any lowercase letter in the alphabet gives you the index of that letter (where index of 'a' is 0, and so on...).

    ReplyDelete
  5. Hi,

    I think there is a subtle bug in the code. Correct me if I am wrong. Suppose we have two strings called geeksforgeeks and geeks.
    Addition of first string will be fine. When we will try to add second string we will find that it already exists and we won't update the last char s as true. (I think this is missing)
    So when we will try to search for word "geeks" particularly it will fail.

    Thanks

    ReplyDelete
    Replies
    1. Hi Abhijheet,

      Good catch! It is indeed a very subtle bug. Thanks for finding it out!

      I fixed the code. I now explicitly mark a word completely inserted each time the insertWord method is done with inserting a word.

      Delete
    2. This comment has been removed by the author.

      Delete
  6. Good work! helped me a lot. Thanks

    ReplyDelete
  7. The Tries is just a tree, so the standard recursive DFS works pretty well for it.

    ReplyDelete
  8. Yes, it's a bit waste when you don't use all 26 characters of the alphabet. But my goal was to implement the trie without using any library data structures. The Set itself has cost associated with it.

    ReplyDelete
  9. Amazing explanation but when this code will execute
    if (i == l && curNode == null)
    return false;

    ReplyDelete
  10. Good work, and thanks for this lecture, Bilash

    ReplyDelete
  11. This blog is fully fabulous in all aspects.

    Read More about Uranium Bikes

    ReplyDelete
  12. Hi, great work you have here :) However, i want to ask if its possible to support numeric and uppercase letters as well? Can you show me? Many thanks

    ReplyDelete
  13. I think if we want to implement a trie more effciently, better approach will be suffix trees, isn't it?
    That way we won't allocate any space that isn't used.

    ReplyDelete
  14. Hi, it gave me this error "IndexOutOfRangeException: Array index is out of range." on curNode.links[letters[i]-offset] i dunno why. Thanks

    ReplyDelete