RegExor
RE) Primer With Examples
Regular expressions (or REs) are a way
to concisely specify a group of text strings.
The Unix editor ed was about the first (Unix) program
to provides REs.
Many later commands used this form of RE
and their man pages would "see also
ed(1)".
Over time folks wanted more expressive REs and new
features were
added.
The ed REs became known as basic
REs or BREs, and the others
became known as extended
REs or EREs.
Suppose you needed to find a specific IPv4 address in the files under
/etc?
This is easy to do; just specify the IP address as a string
of text and do a search.
But, what if you didn’t know in advance which IP address you
were looking for,
only that you wanted to see all IP addresses in those files?
Even if you could you wouldn’t want to specify every possible
IP address to some searching tool!
You need a way to specify all IP addresses in a compact form.
That is, you want to tell your searching tool to show anything that
matches
number.number.number.number.
This is the sort of task we use REs for.
You can specify a pattern (RE) for phone numbers,
dates, credit-card numbers, email addresses, URLs, and so on.
The Good Enough Principle
With REs the concept of "good enough" applies. Consider the pattern used above for IP addresses. It will match any valid IP address, but also strings that look like
7654321.300.0.777or5.3.8.12.9.6(possibly an SNMP OID).To match only valid IPv4 addresses is possible but rarely worth the effort. It is unlikely your search of
/etcfiles will find such strings, and if a few turn up you can easily eye-ball them to determine if they are valid IP addresses.It is possible to craft a more precise RE but in real life you only need an RE good enough for your purpose at hand. If a few extra matches are caught you can usually deal with them. (Of course, if making global search and replace commands, you will need to be more precise!)
An RE is a pattern, or template, against which strings can be matched. Strings either match the pattern or they don’t. If they do, parts of the matching string can be saved in named variables, which can be used later to either match more text, or to transform the matching string.
Pattern matching for text turns out to be one of the most useful and
common operations to perform on files.
Over the years a large number of tools have been created that use
REs, including all text editors, grep,
sed, sort, and others.
The shell wildcards can be considered a type of RE.
While the idea of REs is standard, different tools may
use slightly different syntaxes, or dialects.
Some of these tools also contain extensions that may be useful.
Perl’s REs are about the most complex and useful
dialect, and are sometimes refered to as
PREs.
(See man pages for perlre(1), also
perlrequick(1), and perlretut(1).)
Eventually POSIX stepped in and standardized
RE syntax, it is mostly compatible with the original
ed REs but with many additions.
(POSIX doesn’t use terms like BRE and
ERE).
However few of the older tools changed to use the new syntax.
See regex(7) for details.
Most RE dialects work this way: one line is read into
some buffer.
Next the RE is matched against it.
In a programming environment such as perl or an editor
(sed), if the RE matches than some
additional steps (such as modification of the buffer) may be
done.
With a tool such as grep a matching line is just
printed.
Finally the cycle repeats.
Top-down explanation (from
regex(7)man page):An RE is one or more branches separated with
and matches text if any of the branches match the text. A branch is one or pieces concatenated together, and matches if the first piece matches, then the next matches from the end of the first match, until all pieces have matched. A piece is an atom optionally followed by a modifier:|,*,+, or a bound (i.e., a range). An atom is a single character RE or?.(RE)
| RE | Meaning | Example | Matches |
|---|---|---|---|
| Basic Regular Expressions | |||
| c | The character c |
J | J |
| . | Any character (except a newline) | . | J z $ |
| \c | The character c literally
(escaped) when c
is a metacharacter (such as ).
Never end a RE with a single
. |
Mr\. Ed | Mr. Ed |
| seq | A sequence of REs mataches a string of text that matches each RE in turn. | t.n | tan tBn t%n |
| [list] [range] |
Called a character class,
any one character in the list.
You can only use a range if LC_COLLATE is set to
POSIX or C. |
t[aeio]n T[A-D]N [wW]ayne |
tan ten tin ton TAN TBN TCN TDN wayne Wayne |
| [^list] [^range] |
Any one character not in the list or range. | t[^ou0123456789]n T[^0-9a-z]N |
tan ten tin
(but not ton or t9n) TAN TBN (but not TaN or T9N) |
| RE* | Zero or more of RE | to*n | tn ton toon tooon |
| ^RE | Anchors RE to the beginning
of a line (technically the ^ matches the null
string at the beginning of the buffer) |
^t.n \^t.n 2^2=4 |
tons (but not wanton) ^tons 2^2=4 (only special in front) |
| RE$ | Like ,
anchors RE to match
at the end of a line |
t.n$ ^ton$ ^$ |
wanton (but not tons) ton (on a line by itself) (an empty line) |
| Extended Regular Expressions | |||
| RE+ | One or more occurences of RE | to+n | ton toon tooon |
| RE? | Zero or one occurences of RE | to?n | tn ton |
| RE{m} | Exactly m occurences of RE | to{2}n | toon |
| RE{m,} | m or more occurences of RE | to{2,}n | toon tooon toooon |
| RE{m,n} | Between m and n occurences of RE | to{2,3}n | toon tooon |
| RE|RE | Either the left or right RE | Mr\.|Ms\. | Mr. Ms. |
| (RE) | RE (the parens are used for grouping and back-references). | (Mr\.|Ms\.) Smith | Mr. Smith Ms. Smith |
| (RE)RE\1 | The \1 is a reference to the first group. Groups are numbered by counting the opening parenthsis of groups. | (t[ao]n)\1 (rin)(-tin)\2 |
tantan tonton (but not
tanton) rin-tin-tin |
| \<RE RE\> \bRE RE\b |
RE only at the beginning or end
of a word.
(While and
are common is sometimes used
instead.)
Like other anchors these match the null string at the
boundries of words.
(POSIX doesn't define word boundry matches for either
BREs or EREs but these are common
extensions with Gnu, Perl, and other utilities.)
|
\<ton \bton ton\> ton\b |
tons (but not wanton) wanton (but not tons) |
Some characters used to express REs
(meta-characters)
are only special if they appear in a specific context.
For instance, the
is special only if it is
the first character of some RE, the
^
only if the last.
The $
is not special if the first character in
an RE.
A *
followed by a character other than a digit
is not the beginning of a bound.
A backslash is always special, so it is illegal to end an
RE with one.
{
Special (or meta-) characters lose their meaning
if escaped, that is preceeded with a backslash
(
) character.
For POSIX these are the characters in the following list:
\
.
(In some dialects the reverse is true; meta-characters only have
special meaning when escaped!)
Some special characters (in POSIX, all) lose their special meaning
when used in a character class.
^.[$()|*+?{\
To include a literal
in the list
make it the first character (following a possible
]
).
To include a literal ^
, make it the first
character (following a possible -
)
or the last character.
^
Other metacharacters except
lose their
meaning in the list.
The list can include one or more predefined
character classes.
In POSIX these look like: \[:name:],
where name is one of:
alnum | |
alpha | |
blank |
(space or tab only) |
cntrl | |
digit | |
graph |
(any printable character except a space) |
lower | |
print |
(any printable character) |
punct | |
space |
(any white-space character) |
upper | |
xdigit |
(any hexidecimal digit) |
[:xdigit:] is the same as [[:digit:]abcd],
which is the same (in the C locale) as
[0-9abcd].
In some RE dialects a backslash-character is used,
for instance
instead of
\d
for a digit.
[:digit:]
A back reference (
followed by a
non-zero decimal digit \d) matches the same
sequence of characters matched by the d-th parenthesized
subexpression (numbering subexpressions by the positions of their
opening parentheses, left to right).
For example:
matches
\([bc]\)\1
or bb
but not
cc
.
(Note:
Back references are not defined in POSIX but are
very commonly used anyway.)
bc
In the event that an RE could match more than one substring of a given string, the RE matches the one starting earliest in the string. If the RE could match more than one substring starting at that point, it matches the longest. (This is often called greedy matching.) Subexpressions also match the longest possible substrings, subject to the constraint that the whole match be as long as possible. Perl supports both greedy and non-greedy (also known as reluctant, minimal, ungreedy, or generous) modes. For example:
$ echo '12345' | sed 's/\([0-9]*\)\([0-9]*\)/|\1|\2|/' |12345||