Friday, January 20, 2012

The Power of Perl Regular Expressions

I am not a big scripting language person. C++ and Java (and occasional C#) have been my stronghold so far. But recently I have been strolling around in the Perl world at work. And to my amazement I found Perl so rich a language that I fell in love with it.

Here is an example of the power and succinctness of Perl. This one involves using Perl's powerful regular expressions and file/directory handling features.

For a project I was doing I needed to modify contents of a large number of files in a directory. The actual modification needed was really simple. For all occurrences of a term, say XYZ, replace it with .XYZ (insert a dot before the term). The term XYZ could or could not be preceded by a "-" (minus sign). Other than that the term is guaranteed to be separated by white space before and after it. Also, the term can occur at the beginning or at the end of a line.

Now, I was new to Perl, and regular expressions for that matter. I struggled with the problem for one and half day. There were always one or two cases that I was not covering, or cases I was not covering that I was not supposed to. Finally I gave up and posted this as a question on Stackoverflow:

Overlapping text substitution with Perl regular expression.

Within 10 minutes I had a bunch of answers from some really Perl expert guys. I am so thankful to those people who answered it and saved me from more days of struggling with the problem!

Here is what I ended with for the whole problem:


$^I = ".copy";
my @ARGV = <*.txt>;

while (<>)
{
    s/((?:^|\s)-?)(XYZ)(?=\s|$)/$1.$2/g;
    print;
}


That's it! How cool is that?

If it was in C++ or Java, I would have to deal with opening file streams, buffer streams in Java, open a file one at a time, read a line one by one, store it on a list or something in memory, and then write the whole thing back to file system, so on and so forth. And I would easily end up writing close to a hundred lines of code for that!

But in Perl, it's just those few lines above. The diamond operator (<>) nicely takes care of the directory and file browsing part so succinctly. And the one line regular expression finds all references of the pattern in all the files (denoted by the wildcard *.txt) and replaces the matching patterns with a dot inserted in the beginning.

I admit, Perl's regex and file/directory handling packages are working in the background, but still the expressive power of these tools are so elegant!

I hope to keep exploring Perl more in the coming days. Specially the powerful regular expressions!