Tcl's pattern matching facilities test whether a given string matches
a specified pattern. Patterns are described using a syntax
known as regular expressions. For example, the pattern
expression consisting of a single period matches any character. The
pattern a..a
matches any four-character string whose
first and last characters are both a
.
The regexp
command takes a pattern, a string, and an
optional match variable. It tests whether the string matches the
pattern, returns 1 if there is a match and zero otherwise, and sets the
match variable to the part of the string that matched the
pattern:
% set something candelabra
candelabra
% regexp a..a $something match
1
% set match
abra
Patterns can also contain subpatterns (delimited by
parentheses) and denote repetition. A star denotes zero or more
occurrences of a pattern, so a(.*)a
matches any string of
at least two characters
that begins and ends with the character a
. Whatever has
matched the subpattern between the a's will get put into the first
subvariable:
% set something candelabra
candelabra
% regexp a(.*)a $something match
1
% set match
andelabra
Note that Tcl regexp by default behaves in a greedy fashion.
There are three alternative substrings of "candelabra" that match the
regexp a(.*)a
: "andelabra", "andela", and "abra".
Tcl chose the
longest substring. This is very painful when trying to pull HTML
pages apart:
% set simple_case "Normal folks might say <i>et cetera</i>"
Normal folks might say <i>et cetera</i>
% regexp {<i>(.+)</i>} $simple_case match italicized_phrase
1
% set italicized_phrase
et cetera
% set some_html "Pedants say <i>sui generis</i> and <i>ipso facto</i>"
Pedants say <i>sui generis</i> and <i>ipso facto</i>
% regexp {<i>(.+)</i>} $some_html match italicized_phrase
1
% set italicized_phrase
sui generis</i> and <i>ipso facto
What you want is a non-greedy regexp, the standard feature of Perl and
an option in Tcl 8.1 and later versions.
Lisp systems in the 1970s included elegant ways of returning all possibilities when there were multiple matches for an expression. Java libraries, Perl, and Tcl demonstrate the progress of the field of computer science by ignoring these superior systems of decades past.
regexp {last_visit=([^;]+)} $cookie match last_visit
Note the square brackets inside the regexp. The Tcl interpreter isn't
trying to call a procedure because the entire regexp has been grouped
with braces rather than double quotes. Square brackets denote a range
of acceptable characters:
[A-Z]
would match any uppercase character
[ABC]
would match any of first three characters in
the alphabet (uppercase only)
[^ABC]
would match any character other than
the first three uppercase characters in the alphabet, i.e., the
^
reverses the sense of the brackets
[^;]
says "one or more characters
that meets the preceding spec", i.e., "one or more characters that
isn't a semicolon". It is distinguished from *
in that
there must be at least one character for a match.
If successful, the regexp
command above will set the
match
variable with the complete matching string,
starting from "last_visit=". Our code doesn't make any use of this
variable but only looks at the subvar last_visit
that
would also have been set.
Pages that use this cookie expect an integer and this code failed in one case where a user edited his cookies file and corrupted it so that his browser was sending several thousands bytes of garbage after the "last_visit=". A better approach might have been to limit the match to digits:
regexp {last_visit=([0-9]+)} $cookie match last_visit
regexp
allows multiple pattern variables.
The pattern variables after the first are set to the substrings that
matched the subpatterns. Here is an example of matching a credit card
expiration date entered by a user:
% set date_typed_by_user "06/02"
06/02
% regexp {([0-9][0-9])/([0-9][0-9])} $date_typed_by_user match month year
1
% set month
06
% set year
02
%
Each pair of parentheses corresponds to a subpattern variable.
regexp
includes
optional flags as well as multiple match variables:
regexp [flags] pattern data matched_result var1 var2 ...
The various flags are
-nocase
-indices
-
, put a
--
flag at the end of your flags
.
*
+
?
|
(a|b)
matches
an a
or a b
( )
[ ]
[A-z]
matches any
character from uppercase A
through lowercase
z
(i.e., any alphabetic character). If the first
character in the set is ^
, this
complements the set, e.g., [^A-z]
matches any non-alphabetic character.^
^
must appear at the beginning of the pattern expression.$
$
must appear last in the pattern expression.regsub
command performs substitution based on a
pattern:
regsub [flags] pattern data replacements var
matches the pattern against the data. If the match succeeds, the
variable named var
is set to data
, with
various parts modified, as specified by replacements
. If
the match fails, var
is simply set to
data
. The value returned by regsub
is the
number of replacements performed.
The flag -all
specifies that every occurrence of the
pattern should be replaced. Otherwise only the first occurrence is
replaced. Other flags include -nocase
and --
as with regexp
Here's an example from the banner ideas module of the ArsDigita
Community System (see /doc/bannerideas.html).
The goal is that each banner idea contain a linked thumbnail image.
To facilitate cutting and pasting of the image html, we don't require
that the publisher include uniform subtags within the IMG. However,
we use regexp
to clean up:
# turn "<img align=right hspace=5" into "<img align=left border=0 hspace=8"
regsub -nocase {align=[^ ]+} $picture_html "" without_align
regsub -nocase {hspace=[^ ]+} $without_align "" without_hspace
regsub -nocase {<img} $without_hspace {<img align=left border=0 hspace=8} final_photo_html
In the example above, <replacements> specified the literal characters
''
. Other replacement directives include:
&
inserts the string that matched the pattern
\1
through \9
inserts the strings that matched the corresponding sub-patterns in the
pattern.
<!--
and
-->
) by the comment text, enclosed in parentheses.
% proc extract_comment_text {html} {
regsub -all {<!--([^-]*)-->} $html {(\1)} with_exposed_comments
return $with_exposed_comments
}
% extract_comment_text {<!--insert the price below-->
We give the same low price to everyone: $219.99
<!--make sure to query out discount if this is one of our big customers-->}
(insert the price below)
We give the same low price to everyone: $219.99
(make sure to query out discount if this is one of our big customers)
Also see http://www.tcl.tk/man/tcl8.4/TclCmd/regsub.htm
string match
uses "GLOB-style" matching. Here is the
syntax:
string match pattern data
It returns 1 if there is
a match and 0 otherwise. The only pattern elements permitted here are
?
, which matches any single character; *
,
which matches any sequence; and []
, which delimits a set
of characters or a range. This differs from regexp
in
that the pattern must match the entire string supplied:
% regexp "foo" "foobar"
1
% string match "foo" "foobar"
0
% # here's what we need to do to make the string match
% # work like the regexp
% string match "*foo*" foobar
1
Here's an example of the character range system in use:
string match {*[0-9]*} $text
returns 1 if text
contains at least one digit and 0
otherwise.
[A-z] includes characters other than A-Z and a-z (see the ASCII man page). I ran a test using tclsh 8.3:
% set x ^ ^ % regsub {[A-z]} $x {OOPS} x2 1 % puts $x2 OOPS
-- Paul Takemura, January 29, 2005