Regular Expressions

Pattern Matching

Pattern matching is a powerful computational tool. It can make programs shorter and easier to understand due to its declarative nature. Some programming languages use some form of pattern matching as their primary means of expressing programs; examples include Snobol4 and Prolog.

There are many ways to do pattern matching, and many domains in which to use it. Snobol4 uses a very expressive pattern matching language over strings; Prolog uses a fairly simple pattern matching language over relations (trees). In both Snobol4 and Prolog, pattern matching is expensive, because it can imply backtracking; both languages provide imperative mechanisms for the programmer to override the natural declarative nature of the pattern matching so as to avoid backtracking.

It would be nice if there were a model of pattern matching that would allow us to use a pattern matching language that was as powerful as possible and yet not so powerful as to overcomplicate the mechanism and lead to inefficiency. Consider Unix file globbing patterns: they seem to be simple, but are they? Do they ever imply backtracking? If not, are they as powerful as they could be?

In fact, Unix file globbing patterns form an ad hoc pattern matching language, which, while efficient (non-backtracking), is not as powerful as it could be.

We can find a good model for pattern matching in the field of automata theory.

Machines for Pattern Matching

A pattern can be regarded as a description of a machine which takes a string of symbols as input and which either accepts or rejects that string of symbols. If we consider all possible input strings, we see that the machine defines a language, which it parses or recognizes.

This raises several interesting questions:

A whole theory of computability and computation exists, based on a classification of pattern matching machines. These machines are models of the basic kinds of computers.

Interpreters

It's important, when considering pattern matching, to remember that a modern computer is actually a simple machine that interprets a program constructed from operations that are primitives for the machine. Such an interpreter can also be simulated by another interpreter. This is how modern computer systems actually work: the hardware is an interpreter for a very simple language known as microcode; an interpreter for what we usually call machine language is written in microcode (so when we run machine language programs we're already running a software interpreter); an operating system defines a more powerful language on top of machine language by extending it with system calls; and programming languages are built as interpreters on top of the operating system.

Sometimes programming languages are implemented by compilation or translation in an effort to gain efficiency by skipping a level of interpretation, but usually the levels of interpretation continue up to higher levels: a Lisp interpreter is written in C, for example; an object-oriented language like CLOS is written in Lisp; a CLOS programmer designs an interpreter for an application-oriented language in CLOS; if the application is extensible, a user of the appliation may design yet another interpreter in that language.

Each of these interpreters, except for the bottom-most layer, is a software machine. Pattern matchers are likewise software machines.

The Chomsky Hierarchy

Linguist Noam Chomsky defined a hierarchy of languages, in terms of complexity. This four-level hierarchy, called the Chomsky hierarchy, corresponds precisely to four classes of machines. Each higher level in the hierarchy incorporates the lower levels: that is, anything that can be computed by a machine at the lowest level can also be computed by a machine at the next highest level.

How do we prove this? We prove it by constructing an interpreter for the lower level machine on the higher level one: if we can construct such an interpreter, then clearly the higher level machine is capable of computing anything the lower level machine can compute: it can simply compute it on the interpreter!

The levels of the Chomsky hierarchy are as follows:

MACHINE CLASS            LANGUAGE CLASS
-------------            --------------
finite automata          regular languages
pushdown automata        context-free languages
linear bounded automata  context-sensitive languages
Turing machines          recursively enumerable languages
The simplest kind of machine is the finite automaton, while the most complex machine is the Turing machine.

Each of these language classes are of interest to computer programmers:

LANGUAGE CLASS                    EXAMPLE
--------------                    -------
regular languages                 pattern matching languages
context-free languages            simpler programming languages
context-sensitive languages       most programming languages
recursively enumerable languages  natural languages

Languages are usually defined by grammars, so there are also classes of grammars that correspond to these language classes:

LANGUAGE CLASS                    GRAMMAR CLASS
--------------                    -------------
regular languages                 regular expressions
context-free languages            Backus-Naur forms
context-sensitive languages       Van Wijngaarden (two-level) grammars
recursively enumerable languages  ???

Parsing technology for programming language implementation is based on algorithms that are known to be able to parse particular classes of languages based on particular grammars. Regular expressions are used in compilers to implement the lexical level of language definition. Because of the simplicity and efficiency of the finite automata that interpret regular expressions, they are also used for pattern matching in tools and other languages.

Finite Automata

A finite automaton (or finite state machine) consists of:

To run the automaton on a given input string, we simply iteratively execute the state transition function, calling it with the current state (starting with the initial state) and the current input symbol. Each time we call the function, we advance to the next input symbol, and use that with the new state returned by the function as the input to the next iteration. We terminate when we've exhausted the input string. If the process terminates with the function returning a state that's a member of the set of accepting states, we have recognized the input string; otherwise, we have rejected it.

Finite automata can be modelled with a state transition diagram, a digraph with the states represented by the nodes and the transitions represented by the arcs (labelled with symbols). For any regular language, there are an infinite number of possible finite automata that can be constructed to recognize that language.

Every finite automaton can be represented by a regular expression; regular expressions are the grammars of the regular languages. A regular expression (regexp) is defined as follows:

It's very easy to write an efficient interpreter for a given finite automaton in any programming language. All that's required is to represent that state transition function as a function or an array. It's also easy to automate the construction of efficient finite automata from regular expressions.

Regular Expressions in Unix

Regular expressions are everywhere in Unix. Many applications support regular expressions for searching text: text editors (emacs, vi); pagers (less, more); file searching tools (grep, egrep, sed); programming languages (Tcl, Perl, C, Awk, Lex, etc).

Matching Regular Expressions

Finite automata simply accept or reject a given string. When used as a tool, however, you can do more with regular expressions: you can print lines in a file that contain a match for a regular expression (the grep-like tools do this); you can move the cursor in a text editor or pager to the next occurrence of a matching regular expression; you can substitute new text for matching occurrences, allowing you to change or delete based on regular expressions; and you can extract subparts of a match.

Notation

Any number of notations can be devised for regular expressions. In Unix, several notations are used in different tools; they're all very similar, but just different enough that you need to be careful when using any one of them (this annoyance is due to two facts: 1. regular expression use developed slowly and in different parts of the country, leading to some divergence, and 2. much Unix software is or orignates as free software, and there's no way to require an author to use an existing notation that he may never have heard or that he thinks is stupid).

Real regular expression notations differ from the simple notation above in order to provide shorthands that are easier to write. I will describe Tcl's regular expression notation (which is that of egrep) and explain the shorthands in terms of the basic notation above.

The Alphabet

The input alphabet is ASCII. Since the regular expression is also written in ASCII, we find that we have the need to quote regexp metacharacters. Any of the special metacharacters mentioned below can be quoted with a backslash and thus matches just that character in the alphabet. Here are some regexps from the alphabet:
A
B
7
&
@
\*

Concatenation of Regexps

Regexps are concatenated by juxtaposition: here is the regexp A and the regexp B concatenated:
AB

Sum of Regexps

Regexps are summed with the infix | operator; here are some sums of regexps:
A|B
0|1|2|3|4|5|6|7|8|9

Iteration of Regexps

Regexps are iterated with the postfix Kleene star *; here are some iterated regexps:
A*
9*
\**
Iteration means zero or more occurrences of the preceding regexp.

Parentheses

Parentheses are used to override the precedence of the various regexp metacharacters:
(ABC)
(ABC)*
(0|1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)*

Character Classes

A character class is a shorthand notation for a sum of single-character regexps. It is written as a string of characters within square brackets. The characters form a set, so the order of the characters and any repetition is immaterial. These three regexps are equivalent:
0|1|2|3|4|5|6|7|8|9
[0123456789]
[11987654321111011]
There is a special shorthand for use in a character class: a range of characters can be expressed by separating the lower and upper inclusive bounds with a hyphen; in this case, order matters because ASCII collating order is used to resolve the range. These three regexps are equivalent:
0|1|2|3|4|5|6|7|8|9
[0123456789]
[0-9]

This shorthand makes the hyphen a metacharacter within a character class. How do we get a literal hyphen in a character class? Like this (all three are equivalent):

[-A-Z]
[A-Z-]
A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|R|S|T|U|V|W|X|Y|Z|-

There is also a notation for the complement of a character class: if the first character of the character class is a carat (^), the character class is complemented. So [^a-zA-Z] is the regexp that matches any non-alphabetic ASCII character (including all control characters). To use a literal carat in a character class, simply place it anywhere except in the first position.

To include a literal ] in the sequence, make it the first character (following a possible ^).

Optional Regexp

The postfix question mark (?) operator makes the preceding regexp optional (i.e, it can occur zero or one times). This shorthand is equivalent to a common use of summing. These two regexps are equivalent:
a(foo)?
afoo|a

One or More Occurrences

The postfix plus operator (+) is just like the * operator except that it specifies one or more repetitions of the previous regexp. These two regexps are equivalent:
[0-9]+
[0-9][0-9]*

Any Character

The dot (.) metacharacter matches any single ASCII character. It's a very important shorthand for the sum of all the ASCII characters.

Beginning and End of Line

A regular expression that starts with a carat (^) will only match if it occurs at the beginning of the input string. This is the behavior we described for the basic regular expressions above. Without a carat, Tcl regular expressions can actually begin anywhere in the input string: we're not testing whether or not the entire string matches, but whether or not a match occurs as a substring of the input. It turns out that this is more useful, so it's the default. If the anchored approach were the default, then most Tcl regexps would have to start with .*. Carat only has this special interpretation if it's the first character of the regexp.

A regular expression that ends with dollar ($) only matches at the end of the input string. Combining these two anchors allows us to constrain a regexp to match the entire input string, thus simulating the more strict basic regexps described earlier.

Specifying Submatches

In order to support the easy extraction of matching subexpressions, parentheses are used. Any parenthesized subexpression can be extracted from the input string. Since parentheses can be put around any part of the regexp without changing its meaning, this causes no ambiguity.

Regular Expressions in Tcl

Since a regular expression match may occur in several positions in a string, we need a way to decide which one is the match. In common with standard Unix practice, Tcl's regular expression interpreter always chooses the leftmost, longest possible match. For example, the regexp a*b* when matched against the string aabaaabb, matches aab, the first three letters of the input string.

The regexp Command

The regsub Command


Keith Waclena
The University of Chicago Library
This page last updated: Tue Aug 16 14:20:06 CDT 1994
This page was generated from Extended HTML by xhtml.