Mike Gossland's Perl Tutorial Course for Windows


Introduction | Binding Op | Reg Expressions | Examples | Substitution | Advanced


Chapter 2: Matching and Substitution

Advanced Matching Ideas

Now you have seen the basics of pattern matching and substitution. If you are not too overwhelmed by these regular expressions, you can already see why Perl is so well liked by its proponents. Consider doing these same jobs with a language that lacks regular expressions!

Regular expressions are so useful that most modern languages have them available nowadays. Python has them built in, Javascript has added regular expressions in newer versions, and Visual Basic has access to the RegExp object when you add Microsoft VBScript Regular Expressions to your project as a reference.

While we have made a decent start of an introduction to matching and substitution, you are not ready to go out into the real world. I've been holding a few more advanced ideas back from you while you get your feet wet.

Greediness in Matching

The match repetition characters, * and +, are the greediest they can be when trying to make a match. That is, let's say you have a string like, "The word twice is in this sentence twice". Let's say you want to match all the characters from the beginning of the line to the end of the word twice. So you use the matching expression

m/^.*twice/

Which of the two instances of the word twice would it match? The answer is the second. It would match the sentence line all the way to the end. This is because, by default the matching operators are greedy. They'll take all they can get and extend the match as far as possible within the searched string.

If you want to curb this behaviour and make them match on the first occurrence instead, then you must use the greed-inhibiting character, "?". Putting a "?" right after a repetition character means "don't be greedy". So in this case we'd change our matching statement to

m/^.*?twice/

and it would just match on the part, "The word twice". In practice, managing the greediness of these matching characters is very useful indeed.

Case-Insensitive Matching

What if I'm looking for a particular character string, and I don't care whether they are in upper or lower case? Say I'm looking for "student" or "Student" or "STUDENT". Do I have to do this?

/(s|S)tudent/ or even worse:

/(s|S)(t|T)(u|U)(d|D)(e|E)(n|N)(t|T)/

By now, you might start to think this is too awkward for Perl. Indeed, there's a special option that allows you to do case insensitive matches.

/student/i

is all it takes. "i" means "insensitive".

Matching and Substitution using Variables

In all the examples I've shown so far, the matching pattern has been hard-coded in, and could not be altered during a program. However, there's no reason why you can't assign a matching pattern to a variable, and then use the variable instead. For example, if you were looking for cats or dogs you could do this:

$text_string = "I've had cats, dogs and birds";
for $animal ( "cat", "dog", "bird" ) {
	if ( $text_string =~ /$animal/ ) {
		print "Found a $animal\n";
	}
}

You can use variables for the pattern match quite freely, but keep in mind they really do act as search patterns, not as defined strings. For instance if your variable was 

$match = "yes?"

then

$question =~ /$match/ 

would specify a search like /yes?/ and you'd be looking for "ye" followed by 0 or 1 occurences of s, not the literal string ""yes?" with a question mark.

Global Matching and Substitution

All of the pattern matching and substitution operations we've done so far have a weakness you'd soon discover in practice. They only operate on the first match found.

To allow you to work your way through the entire text to be searched, Perl offers the "global" option for matching and substition, specified by putting a "g" behind the final /. For example:

$story =~ m/Harry Potter/g;

would match all instances of "Harry Potter" in the story, and: 

$story =~ s/Harry Potter/Larry Wall/g;

would turn the hero from Harry Potter into the hero from Perl in the whole of the Sorcerer's Stone. The /g option can be used with any matching operation.

Matching across newlines

If you were to try these pattern substitutions on large blocks of text that include multiple lines, you'd be in for confusion and disappointment when a number of the matches didn't work. This is because of the behaviour of the period metacharacter "." By default, it matches any character except the newline character. Therefore the far-ranging matching operation ".*" will only match up to the first newline character encountered. But what if you need to match something that starts on one line and spans several lines before it stops, like a long HTML comment or table? The answer is to use the /s modifier, which mnemonically means to "treat string as a single line". This changes the behaviour of "." to match newline characters as well. For example:

$html = "<!-- Start of section -->
<p>Here is some content</p>
<!-- End of section -->";

In order to match the beginning of this comment to the end, we add the /s modifier like this:

$html =~ s/<!-- Start.*End of section -->//s;

Without the /s added at the end of the pattern, it wouldn't match at all.

Different Delimiters

Another of Perl's strengths is that the language will help out when things get ugly for you. You can use different delimiters in the matching expression when you have a need for it.

For example, file paths typically contain many /'s. (Even on a DOS and Windows machine, path separators in Perl are /'s and not \'s.)

If you are matching on something that contains a "/" such as a path, you could be looking at something ugly. Let's say you wanted to change a file path of a text file in a particular directory, c:/web/cgi-bin/, and move it to the d: drive. You'd have to back-quote the / characters with \ and use \/ for the path separators. So,you'd have to write:

s/c:(\/web\/cgi-bin\/.*?.txt)/d:$1/

Instead of this mess, you can change the delimiter character to anything you'd like, as long as it's repeated at the beginning and then at the end. Here we use the # character (also known as the pound sign, number sign, hash, or the octothorp) - like this:

s#c:(/web/cgi-bin/.*?.txt)#d:$1#

There is another useful variation on this idea. You can use different pairs of delimiters around each section. This makes it easy to split up your matching and substitution expressions to keep things neater.

s{c:(/web/cgi-bin/.*?.txt)}{d:$1}

You don't even need to use the same delimiter, and you can put the two parts on separate lines, so you could write:

s{c:(/web/cgi-bin/.*?.txt)}
  (d:$1)

Negative Matching

In some cases you are more interested in whether a pattern does not match a string rather than that it does. In this case you could write

if ( ! $string =~ m/search text/ ) ...

but as usual, Perl makes it easier for you and offers you more than one way to do it. In this case, there's the "negative" binding operator, !~, so you could write this:

if ( $string !~ m/search text/ ) ...

So if you see this negative operator in perl in future you'll know what it means.

Default Matching

Matching is such a common operation in Perl that you can dispense with the "m" at the front of it. Everybody, including perl itself, knows what you mean if it's left out. So

m/hello/ is completely equivalent to
/hello/

And in fact, if you leave out the entire variable and binding operator, then perl will assume you mean to bind to the default variable mentioned before.

So the statement

/hello/

is completely equivalent to

$_ =~ m/hello/;

Isn't that cool? This is very neat when you are searching through files, because as you read through a file, each line of the file is automatically put into the default variable for you. More on this in a later section on reading from files.

Conclusion

This concludes the section on regular expression pattern matching and substitution. We've spent a lot of time on it because it is a very important concept and operation in the world of Perl programming. I urge you to try to look over the sections on pattern matching and substitution in the online documentation. You might find it quite daunting at first, but once you get the hang of how the documentation is written, you will be able to get more ideas of more advance matching and substitution operations.

In the next lesson, we'll put some of this matching and substitution knowledge to work as we start to manipulate files. See you next time!