Perl Module #2

Perl Module 2
Control Structures and Regular Expressions
K. Yue copyright @2001
Revised: September 1, 2001

1. Operators and Comparators

Perl's set of operators and comparators comprise nearly all of C's operators and comparators. All of the usual arithmetic expressions and precedence are the same as in C.
The following operators are valid in Perl but not in C.

Operator	Meaning
**	Exponentiation
**=	Exponentiation assignment
()	Null list
.	String concatenation
.=	String concatenation assignment
eq, ne, ge, gt, le, lt	String comparisons
x	String repetition
..	Range
-f, -x, -d, etc	Unary file test operators. Perl has the ability of testing various file settings.

2. Control Structures

The if-then-else statement in Perl is similar to C, except:

The block {} is always needed for the then part and the else part.
The way the condition is evaluated is different.
Perl supports the keyword elsif.

In Perl, if an expression is needed to evaluate to a truth value, the following is done:

The expression is evaluated (and converted) to a string.
If the string is either "" or "0", return false. Otherwise, return true.

Example: Boolean values

""         # false
"0"        # false
"00"      # true
$n - $n  # "0": false
undef    # undef (undefined) is converted to "": false.

Perl supports the unless statement.

unless (some-condition)
{ action;
}

is equivalent to

if (! some-condition)
{ action;
}

Perl's while statement is similar to that of C, except the following.

There may be an optional continue block.
There are three loop control statements: next, last and redo.
A label may precede the while statement for the uses of the loop control statements.

If there is a continue block, it is executed before the conditional of the while statement is executed.
The loop control statements are:
- next: resume control at the next iteration; same as continue in C.
- last: exit the loop immediately; same as break in C.
- redo: resume control at the next iteration without reevaluate the conditional.
If no label is attached, the loop control statements refer to the innermost loop.
Labels can be added for loops so that the control statements may refer to the loop that is not necessarily innermost.
The until statement is the same as the while statement, except the test is reversed.

Example:

DAILY_WORK:
while (1)
{ while (! &time_up_for_the_day)
   { last if &boss_let_go_early;
      last DAILY_WORK if &win_lottery;
      &work_a_while;
      redo if &overtime_not_over;
   } continue
   { &play_a_game_secretly_to_relax;
   }
}

A simple Perl statement can be appended with the option if, unless, while or until.

Example:

&work unless &too_tired;
&work if &having_fun;
&work until &too_tired;
&work while &having_fun;

A block statement may be considered as a loop that is executed once. Thus, loop control statements can be used in a block statement.
The for loop is similar to C's and is defined by:

for (stmt_1; stmt_2; stmt_3)
{ stmt_4;
}

is equivalent to:

stmt_1;
while (stmt_2)
{ stmt_4;
Stmt_3;
}

The foreach statement allows iteration through a list.

foreach $num (@num_list)
{ print "$num\n";
}

Note: The variable $num is set to the value of the element of @num_list in turn.

If the scalar variable is omitted in the foreach statement, the value of the element is stored in the built-in variable $_.
In fact, $_ is the default argument for many operations.

Example:

foreach $num (@num_list)
{ print "$num\n";
}

is the same as:

foreach (@num_list)
{ print "$_\n";
}

foreach (@num_list)
{ print;
print "\n";
}

The do operator allows the execution of a block statement, with the conditional tested at the end of each iteration.

Example:

do
{ &i_like_it_this_way;
print "interesting stuff\n";
}

do
{ &work;
} until &tired;

Since || and && are evaluated by short-circuited evaluation, they are commonly used in Perl program for control.

Example:

if (&error) { die "ay-ya-ya\n"; }
die "ay-ya-ya\n" if &error;
&error && die "ay-ya-ya\n";

unless (&kiss_me) { &leave_me; }
&leave_me unless &kiss_me;
&kiss_me || &leave_me;

There is no switch statement in Perl. The switch statement can be simulated in many ways in Perl.

Exercise 1:

Consider the C's statement:

switch (ch)
{ case 'a': a_ct++; break;
   case 'e': e_ct++; break;
   case 'I': i_ct++; break;
   case 'o': o_ct++; break;
   case 'u': u_ct++; break;
   default : other_ct++; break;
}

Implement the statement in Perl in two different ways.

3. Regular Expressions

A regular expression is a pattern for string matching.

Example:

if (/good/)
{ print;
}
# if $_ contains the pattern "good", then $_ is printed.

Note:

A pattern is usually enclosed by two forward slashes (/).
The string the pattern matched to is $_, unless specified otherwise.

A single character pattern matches a single character.
Single character patterns are illustrated below:

.            any character except \n.
[abc]     a character class: matches 'a' or 'b' or 'c'.
[a-zA-Z] a character class: matches any letter.
[^ABC]    negation of a character class: matches
                 any characters except 'a', 'b' or 'c'.

Backlashed characters can be used to match different single characters.

\.       .
\n      newline
\r     carriage return
\t     tab
\f     formfeed
\d     a digit: [0-9]
\D     a non-digit: [^0-9]
\w     an alphanumeric: [0-9a-zA-Z_]
\W    a non-alphanumeric: [^0-9a-zA-Z_]
\s     a white-space: [\t\f\r\n]
\S      a non-white-space: [^\t\f\r\n]
\060   a character with the specified value: 060 ('0')

Most other backlashed characters match themselves.

Exercise 2:

Find the single character pattern that matches the following description.

(a) all vowels,
(b) all non-vowels,
(c) all characters except lower case letters (other than 'a' to 'z'),
(d) the backspace character,
(e) carriage return or form feed,
(f) the character ^,
(g) any character in my name ("kwok-bun Yue").

Single character patterns are atoms. A regular expression inside a parenthesis is also an atom.

Grouping Patterns

A sequence is a grouping pattern that contains a sequence of atoms.

Example

/ab1/           matches “ab1"
/a[aeiou]c/   matches “aac”, “aec”, “aic”, “aoc” and “auc”
/a.a/            matches an “a”, follows by any character
                     and then another “a”.

An atom may be qualified by a quantifier (multiplier):

*         0 or more times.
+        1 or more times.
?      0 or 1 times (i.e., optional)
{5}    exactly 5 times
{3,}    3 or more times
{2,6}    2 to 6 times

The symbol | is used for alternation. Example:

/xy{2,4}/   matches “xyy”, “xyyy” and “xyyyy”
/x+y*x+/    matches one or more ‘x’, follows by 0
                     or more “y”, follows by 1 or more “x”.
/ABC|ace/   matches “ABC” or “ace”
/[abc]{4}/ matches a string of 4 characters
                    of ‘a’, ‘b’ or ‘c’.

Pattern matching in Perl is leftmost greedy. It tries to matches as many characters as possible starting from the left hand side.

Example:

Consider the string “abccccbaccccba”.

The pattern

/a.*ba/     matches the entire string,
                 not “abccccba”.
/a.*BA*/   matches the entire string;
                 with the first “.*”
                 matching “bccccbacccc”.

To match as few characters as possible, use '?' after the wildcards * or +.

Parentheses can be placed around sub-patterns. It causes the sub-patterns to be memorized. The matching of the sub-pattern does not change. The memorized sub-patterns can be referred to by \1, \2, and so on, within the pattern.

Example:

/(.)a\1/    matches “aaa”, “bab”, “xax”, “5a5",
                 etc., but not “5a6", etc.
/(.*)a\1/   matches “abaab”, “a”, “cidacid”, etc.
/([ABC])x([de])y\2x\1/
              matches “axdydxa”, etc.

Anchoring patterns ensure the patterns to line up with specific part of the string.

^    matches the beginning of a string.
$    Matches the end of a string.
\b   matches on word boundary (i.e., between \w
         and \W, or \w and string’s start or end.)
\B   matches on non-word boundary.

Example:

/\bair\b/ matches “ air&”, “+air+”, “air”, etc,
                  but not “hair”, “airs”, etc.
/\bair\B/ matches “airs”, “+airing”, etc,
                  but not “air&”, “+air”, “air”, etc.
/^air/       matches “air”, “airs”, etc, but not “hair”.
/^air$/      matches “air” only.

The precedence of modifiers of patterns are as follow:

Parenthesis (): highest
Multipliers: +, *, ?, {m,n}
Sequence and anchoring: ABC, ^, $, \b, \B.
Alternation: |.

Example:

/a|bc*/ is equivalent to /(a)|((b)(c*))/

Exercise 3:

Give the Perl’s pattern for the following matching:

(a) either “abcde” or “edcba”.
(b) at least two b followed by at least seven c.
(c) any number of *, followed by any number of $, followed by any number of +.
(d) a ^ at the beginning of a string, followed by three to four a.
(e) any ten characters, including newline, just before the end of the string.
(f) any string with the same word in a row for two or more times. A word is defined as a sequence of alphanumeric or '_', enclosed by white spaces or beginning or end of a string.

Regular expression is variable interpolated.
The special read-only variables $1, $2, ..., store the values of \1, \2 , .., etc, of pattern matching with memorization.
Memorized values in pattern matching can also be assigned to a list variable. In this case, $1, $2, etc., are not updated.
The system variables $`, $& and $' store the string before the matched pattern, the matched pattern and the string after the matched pattern respectively.

Example:

if (/life is (.*)\./)
{ print $1;
}

if (@s = /love is (.*) and hatred is (.*)\./)
{ print "$s[0], not $s[1]";

}

# print all lines in the file example.dat that contain "[n]",
# where n is given by the user.
print "what is the index"?"
$index = <STDIN>;
chop($index);
open(IN,"example.dat");
while (<IN>)
{ if (/\[$index\]/)
{ print;
}
}

The case of a pattern can be ignored by appending with an i. Example:

The pattern /yue/i matches "yue", "YUE", "yUe", etc.

Exercise 4:

Write a Perl program to read in a file "a.a" and prints out all lines that contain the characters ‘a’, ‘c’, ‘e’ and ‘g’.

Matching Operators

All non-alphanumeric characters can be used as delimiters of patterns. However, if a character other than / is used, the m operator must be used.

Example: The following pattern matchings are the same.

/^\/usr\/bin\/perl/
m#^/usr/bin/perl#

By default, the pattern is matched to the predefined variable $_.

Example:

# Print all lines from the standard input file that contain the
# string "perl" somewhere in the line, case ignoring.
while (<STDIN>)
{ if (/perl/i)
{ print;
}
}

The operator =~ can be used to specify a pattern matching target other than $_.

Example: The program above can be rewritten as (though not Perl’s style):

while ($line = <STDIN>)
{ if ($line =~ /perl/i)
{ print $line;
}
}
...
print "Do you want to quit? [y/n]";
if (<STDIN> =~ /y/i)
{ die "bye, dear.";
}

The operator !~ is equivalent to the negation of =~.

print "I love you." if $letter !~ /hate/;

Substitution and other common operators using regular expressions

The s operator allows the substitution of a regular expression by a string.
s/old-reg/new-string/switches replaces the first regular expression old-reg by the string new-string. To replace all instances, use the global switch g.

Example:

$_ = "I love you.";
s/love/hate/;
print;     # print out "I hate you."
$_ = "I love you and you love me.";
s/love/hate/;
print;     # print out "I hate you and you love me."
$_ = "I love you and you love me.";
s/love/hate/g;
print;     # print out "I hate you and you hate me."

The following is a command line execution of Perl. The switch -e indicates command line execution. The switch -n loops through each line of the file in the command line.

$perl NE "s/love/hate/g; print;" love_letter.dat
$perl NE "s/\$i\b/$count/g; print;" < ex1.pl > ex2.pl

The split operator uses a regular expression as the delimiter to split a string. A list of strings is returned. Each string is separated from the consecutive string by a substring matching the pattern.

Example:

$line = 'kwok-bun Yue,123456789,Computer Science';
($name, $ssnum, $major) = split /,/, $line;

Exercise 5:

Write a piece of Perl’s code that reads the file "some.file" and breaks down the contents into tokens. A token is a string of characters (other than white spaces) that are separated by white spaces. The tokens should be stored in the variable @words.

The join operation is the reverse of the split operation. It does not use regular expressions. It glues together a list of strings by a specified glue string.

Example:

$glue = ":";
@list = ("12", "05","59");
print join($glue, @list); # print "12:05:59"

The transliteration operator tr (or y) uses similar syntax but is not concerned with regular expressions.

Example:

The first perl command swaps x and y. The second example changes all lower case characters to upper case characters.

$perl -ne ‘tr/xy/yx/; print;’ < e1.dat > e2.dat
$perl -ne ‘tr/a-z/A-Z/; print;’ < emp1.dat > emp2.dat

Exercise 6:

Write a Perl program to get rid of all comments of an Ada program, "ex1.ada". In Ada, anything after -- in a line is discarded by the compiler. Print out the Ada program without comments to the standard output file.

4. Suggested Solution to Classwork Exercise

1. For example,

{ $ch eq 'a' && ($a_ct++, last);
   $ch eq 'e' && ($e_ct++, last);
   $ch eq 'I' && ($i_ct++, last);
   $ch eq 'o' && ($o_ct++, last);
   $ch eq 'u' && ($u_ct++, last);
   $other_ct++;
}

# or

{ ($a_ct++, last) if ($ch eq 'a');
   ($e_ct++, last) if ($ch eq 'e');
   ($i_ct++, last) if ($ch eq 'I');
   ($o_ct++, last) if ($ch eq 'o');
   ($u_ct++, last) if ($ch eq 'u');
   $other_ct++;
}

# or

S1:
{ $ch eq 'a' && do {$a_ct++; last S1;}
   $ch eq 'e' && do {$e_ct++; last S1;}
   $ch eq 'I' && do {$i_ct++; last S1;}
   $ch eq 'o' && do {$o_ct++; last S1;}
   $ch eq 'u' && do {$u_ct++; last S1;}
   $other_ct++;
}

(2)

(a) [aeiouAEIOU]
(b) [^aeiouAEIOU]
(c) [^AZ]
(d) \010
(e) [\r\f]
(f) \^
(g) [kwo\-bunYe]

(3)

(a) /abcde|edbca/
(b) /b{2,}c{7,}/
(c) /\**\$*\+*/
(d) /^\^a{3,4}/
(e) /(.|\n){10}$/
(f) /\b(\w*)\b(.*\b\1\b)+/

(4) For example,

#!/usr/bin/perl
open(IN, "a.a");
while (<IN>)
{ if ((/a/) && (/c/) && (/e/) && (/g/))
{ print;
}
}

(5) For example,

# Decompose a file into tokens with
# white spaces as delimiters.
open(IN, "some.file");
while (<IN>)
{ chop;
@words = (@words, split(/\s+/));
}

(6) For example.

#!/usr/bin/perl
# This does not take care of the problem of
# -– inside a string.
open(IN, "ex1.ada");
while (<IN>)
{ while (/(.*)--/)
   { $_ = $1 . "\n";
   }
   print;
}

or simply:

# This does not take care of the problem of
# -– inside a string.
perl -ne "chomp; s/^(.*?)--.*/\1/; print qq($_\n);" ex1.ada