Regular Expressions, Episode IV

I was asked to (and I intend to) post the materials from my Intro to Regular Expressions panel at Penguicon, but between slackerdom and 2 weeks in Japan without my netbook (upon which are all my materials from that panel), that hasn’t happened yet. So in the mean time, I’ll give to you further examples of the practical application of RegEx. The crappy prequel will come later.

While I was in Japan, I had 2 occasions to use Regex to speed my job along. I present them to you now as a more practical example than “m/Hello World/”:

In the first example, I was configuring data groups in a proprietary application. In some instances, I had multiple instances of an identical group. To save myself some typing, I tried to copy and paste a single instance of the group over and over again, but the end result didn’t sort by group, but by element.

For example, if one group has the elements
11-01-a1
11-01-b1
11-01-c1
then when I C&P that group, the paste auto-increments the end number, and the resultant mess is
11-01-a1
11-01-a2
11-01-b1
11-01-b2
11-01-c1
11-01-c2
When you consider that the actual number of elements is 11 rather than 3, and the actual number of groups is 36…. that’s quite an unwieldy mess. Fortunately, this app allows me to export and import the data as .csv and by running a pair of regex (and with some effort, I could probably make it a single regex) I wind up with 36 nice groups, such as this
11-01-a1
11-01-b1
11-01-c1
11-02-a2
11-02-b2
11-02-c2

The regex are as follows:
s/01(-[a-z](\d))(\D)/0\2\1\3/g (which handles groups 2-9)
s/01(-[a-z](\d\d))(\D)/\2\1\3/g (which handles groups 10-99)
So we match against the “01-” that we know is there, any single lower-case letter, and any one (or 2) digits, followed by any non-digit. The non-digit is important to keep the single-digit regex from matching against the two-digit entries as well (and acting only on the first digit). We have 3 groups taken from the match. 1 is everything from the dash to the trailing non-digit, 2 is the digit(s), and 3 is the trailing non-digit. (edit: as I re-read this, it occurs to me I only needed two groups, that “\D” could have been part of group 1)
Our substitution then is a zero (for the single digit match), the digit(s), and a reiteration of the dash and everything which followed.
The cool thing is, I was able to do this without any code. I used the ConText editor, which has a regex function built into its search/replace.

For the second example, I actually needed code. My co-worker was performing a similar repetitive task where he had 1 set of 1,416 memory addresses defined, and needed 6 more nearly-identical sets created. But the addresses in each set were offset by 180, which doesn’t lend itself well to a simple search and replace. Furthermore, the addressing was somewhat variable. It comes in 3 formats:
D05517
D05608.01
DSH05551.012

Again, the data was exportable to .csv, so one quick perl script later (56 lines total, including blank lines and curly-brace lines) I saved him 8.496 lines of manual editing, plus an untold quantity of time.

Here’s the relevant bit of the code:
(edit: and, for reference, what the actual .csv lines look like for the above sample addresses)

“COUNT_UCT”,”D05517″,Word,1,RO,1000,,,,,,,,,,”Under Cycle Counter”,
“EVENT_WORD0_1″,”D05608.01″,Boolean,1,RO,1000,,,,,,,,,,””,
“BIRTH_PART”,”DSH05551.012″,String,1,RO,1000,,,,,,,,,,””,

while (<IN820>)
{
$a820 = $_;
if ($a820 =~ m/”.+?”\,”D.*?0(\d\d\d\d).+/)
{
$a830 = $1 + 180;
$a840 = $a830 +180;
$a860 = $a840 +180;
$a870 = $a860 +180;
$a880 = $a870 +180;
$a890 = $a880 +180;
$a820 =~ s/(“.+?”\,”D.*?0)\d\d\d\d(.+)/\1$a830\2/;
print OUT830 $a820;
$a820 =~ s/(“.+?”\,”D.*?0)\d\d\d\d(.+)/\1$a840\2/;
print OUT840 $a820;
$a820 =~ s/(“.+?”\,”D.*?0)\d\d\d\d(.+)/\1$a860\2/;
print OUT860 $a820;
$a820 =~ s/(“.+?”\,”D.*?0)\d\d\d\d(.+)/\1$a870\2/;
print OUT870 $a820;
$a820 =~ s/(“.+?”\,”D.*?0)\d\d\d\d(.+)/\1$a880\2/;
print OUT880 $a820;
$a820 =~ s/(“.+?”\,”D.*?0)\d\d\d\d(.+)/\1$a890\2/;
print OUT890 $a820;
}
}

So we iterate through the existing config file one line at a time, to see if it matches m/“.+?”\,”D.*?0(\d\d\d\d).+/ …anything between 2 double-quotes followed by a comma and another double-quote, a “D” followed by zero or more of any character (without being greedy), then a zero, then 4 digits (which we capture in a group), and anything else (greedy). Actually, that last “.+” isn’t needed. We can stop after we match and capture the 4 digits after the zero.

I then assign a variable to the 4 digits we matched, plus 180 (Perl is beautiful in that we can match 4 digits in text, say “+180” and it becomes an integer). I then assign 5 more variables, each equal to the previous variable +180. Then we take the initial string and perform 6 different substitutions (replacing the 4 digits with each of the 6 variables in turn) and outputting it to 6 different files.

It’s no masterpiece of code, but it blew my co-worker’s mind that I was able to do “all that work” so quickly. 🙂

Advertisements
Post a comment or leave a trackback: Trackback URL.

Comments

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: