Regular expression (shortened regex or regexp) is a
kind of mathematical expression, plentifully used in programming, that defines simple patterns in
strings of characters (usually text). Regular
expressions are typically used for searching patterns (i.e. not just
exact matches but rather sequences of characters which follow some
rules, e.g. numeric values or web URLs),
substitutions (replacement) of such patterns, describing syntax of computer languages, their parsing etc. (though more creative uses aren't out
of question either, e.g. generating random
strings). Regular expression is itself a string of symbols which however
describes potentially many (even infinitely
many) other strings thanks to containing special symbols that may stand
for repetition, alternative etc. For example a.*.b is a
regular expression describing a string that starts with letter
a, which is followed by a sequence of at least one
character and then ends with b (so e.g. aab,
abbbb, acaccb etc.).
WATCH OUT: be careful not to confuse regular expressions with Unix wildcards used in file
names (e.g. sourse/*.c is a wildcard, not a regexp).
{ A popular online tool for playing around with regular expressions is https://regexr.com/, though it requires JS and is bloated; if you want to stay with Unix, just grep (possibly with -o to see just the matched string). ~drummyfish }
Regular expressions are encountered in many Unix tools, programming languages, editors etc. Especially worthy of mention are grep (searches for patterns in files), sed (text processor, often used for search and replacement of patterns), awk, Perl, Vim etc.
From the viewpoint of theoretical computer science and formal languages regular expressions are computationally weak, they are equivalent to the weakest models of computation such as regular grammars or finite state machines (both deterministic and nondeterministic) -- in fact regular expressions are often implemented as finite state machines. This means that regular expressions can NOT describe any possible pattern (for example they can't capture a math expression with nested brackets), only relatively simple ones; however it turns out that very many commonly encountered patterns are simple enough to be described this way, so we have a good enough tool. The advantage of regular expressions is exactly that they are simple, yet very often sufficient.
Are there yet simpler pattern describers than regular expressions? Yes, of course, the simplest example is just a string directly describing the pattern, e.g. "abc" matching exactly just the string "abc" -- this is called a fixed string. Next we can think of case-insensitive pattern, so "abc" would match "abc", "ABC", "AbC" etc. Notable subclass of regular expressions are so called star-free languages/expressions which are regular expressions without the star (repetition) operator. Star-free expressions can be used as a simpler variant to regular expressions, they may still describe many patterns and are easier to implement.
OK, let's now dive into how exactly regular expressions work, shall
we? Imagine regexp as an extension of a fixed string pattern (a string
describing exactly itself, i.e. "abc" describes just the string "abc").
The extension is in giving certain characters a special meaning -- the
most common are . (dot) and * (asterisk). Dot
means "any character is allowed here", so "a.c" will describe strings
"aac", "abc", "acc" etc. Asterisk means "the previous character repeated
any number of times (even zero)", so "abc*" describes strings "ab",
"abc", "abcc", "abccc" etc. If we want a regex to contain any special
character as such, without its special meaning, we have to escape it --
for example "a.c" will describe just a string "a.c". This is a super
swift high altitude introduction, more detail will follow.
There exist different standards and de-facto standards for regular
expressions, some using different symbols, some having extra syntactic sugar (which however usually
only make the syntax more comfortable, NOT more computationally
powerful) and features (typically e.g. so called capture groups
that allow to extract specific subparts of given matched pattern). There
are cases where a feature makes regexes more computationally powerful,
namely the backreference \n present in extended regular
expressions (source: Backreferences in practical regular
expressions, 2020). Most relevant standards are probably Posix and Perl (with specific
implementations sometimes adding their own flavor, e.g. GNU, Vim etc.): Posix specifies
basic and extended regular expression
(extended usually turned on with the -E CLI flag). The
following table sums up the most common constructs used in regular
expressions:
| construct | matches | availability | example |
|---|---|---|---|
| char | this exact character | everywhere | a matches a |
. |
any single character | everywhere | . matches a, b,
1 etc. |
expr* |
any number (even 0) of repeating expr | everywhere | a* matches empty, a,
aa, aaa, ... |
^ |
start of expression (usually start of line) | everywhere | ^a matches a at the start of line |
$ |
end of expression (usually end of line) | everywhere | a$ matches a at the end of line |
expr+ |
matches 1 or more repeating expr | escape (\+) in basic |
a+ matches a, aa,
aaa, ... |
expr? |
matches 0 or 1 expr | escape (\?) in basic |
a? matches either empty or a |
[S] |
matches anything character from set S | everywhere | [abc] matches a, b or
c |
(expr) |
marks group (for capt. groups etc.) | escape (\(, \)) in basic |
a(bc)d matches abcd with group
bc |
[A-B] |
like [ ] but specifies a range |
everywhere | [3-5] matches 3, 4 and
5 |
[^S] |
matches any char. NOT from set S | everywhere | [^abc] matches d, e,
A, 1 etc. |
{M,N} |
M to N repetitions of expr | escape (\{, \}) in basic |
a{2,4} matches aa, aaa,
aaaa |
e1|e2 |
e1 or e2 | escape in basic | ab|cd match. ab or cd |
\n |
backref., nth matched group (starts with 1) | extended only | (..).*\1 matches e.g. ABcdefAB |
[:alpha:] |
alphabetic, a to z, A to Z | Posix (GNU has [[ ]]) |
[:alpha:]* matches e.g. abcDEF |
[:alnum:] |
same as above | ||
[:digit:] |
same as above | ||
[:blank:] |
same as above | ||
[:lower:] |
same as above | ||
[:space:] |
same as above | ||
\w |
like [:alnum:] plus also _ char. |
Perl | |
\d |
digit, 0 to 9 | Perl | |
\s |
like [:space:] |
Perl |
Note on case-sensitivity: regular expressions can be
case-insensitive, but besides special character classes (such as
[:alnum:]) there isn't a widely standardized way to express
case sensitivity inside the regular expression itself, it's usually done
by passing a flag (such as -i) to the program that's using
the expression. But there are exceptions, for example in vim adding \c anywhere in a regular
expression makes it case-insensitive.
Here we'll demonstrate some practical uses of regular expressions. Most common Unix tools associated with regular expressions are probably grep (for searching) and sed (for replacing).
The most basic use case is you just wanting to search for some pattern in a file, i.e. for example you are looking for all IP addresses in a file, for a certain exact word inside source code comment etc.
The following uses grep to find and count all occurrences of the word
capitalism or capitalist (disregarding case
with the -i flag) in a plain text version of this wiki and passes them to be counted with
wc.
grep -i -o "capitalis[mt]" ~/Downloads/lrs_wiki.txt | wc -l
We find out there are 829 such occurrences.
Of course, quite frequently you may want to see the lines that match
along with files and line numbers, try also e.g.
grep -m 10 -H -n "Jesus" ~/Downloads/lrs_wiki.txt.
Now let's search for things that suck with (-o prints
out just the matches instead of whole line, -m 10 limits
the output to 10 results at most):
grep -o -m 10 "[^ ]* \(sucks\|is shit\)" ~/Downloads/lrs_wiki.txt
Currently we get the following output:
body sucks
OS sucks
everything sucks
Everything is shit
language sucks
it sucks
D&D sucks
writing sucks
Fediverse sucks
This sucks
Now let's try replacing stuff with sed --
this is done with a very common format (which you should remember as
it's often used in common speech) s/PATTERN/REPLACE/g where
PATTERN is the regular expression to match,
REPLACE is a string with which to replace the pattern
(s stands for substitute and g for global,
i.e. "replace all"). Let's say we are retarded and obsessed with muh privacy and want to censor all names in a
portion of the wiki we want to print, so we'll just replace all words
composed of letters that start with uppercase letter (and continue with
lowercase letters) -- this will also censor other words than names but
let's keep it simple for now. The command may look
something like:
cat ~/Downloads/lrs_wiki.txt | tail -n +5003 | head -n 10 | sed "s/[A-Z][a-z]\+/<BEEP>/g"
Here we may get e.g.:
they typically have a personal and not easy to describe faith. [19]<BEEP>
was a <BEEP>. [20]<BEEP> often used the word "[21]<BEEP>" instead of
"nature" or "universe"; even though he said he didn't believe in the
traditional personal <BEEP>, he also said that the laws of physics were like
books in a library which must have obviously been written by someone or
something we can't comprehend. [22]<BEEP> <BEEP> said he was "deeply
religious, though not in the orthodox sense". <BEEP> are also very hardcore
religious people such as [23]<BEEP> <BEEP>, the inventor of [24]<BEEP>
language, who even planned to be a <BEEP> missionary. <BEEP> "true
atheists" are mostly second grade "scientists" who make career out of the
A more advanced feature is what we call capture
groups that allow us to reuse parts of the matched pattern in
the replacement string -- this is needed in some cases, for example if
you just want to insert some extra characters in the pattern. Capture
groups are parts of the pattern inside brackets (( and
) which sometimes have to be escaped to \( and
\)); in the replacement string we then reference them with
\1 (first group), \2 (second group) etc. Let's
demonstrate this on the following example that will highlight all four
letter words:
cat ~/Downloads/lrs_wiki.txt | tail -n +8080 | head -n 7 | sed "s/ \([^ ]\)\([^ ]\)\([^ ]\)\([^ ]\) / \!\!\!\1-\2-\3-FUCKING-\4\!\!\! /g"
The result may look something like this:
Bootstrapping as a general principle can aid us in creation of extremely
!!!f-r-e-FUCKING-e!!! technology by greatly minimizing all its [7]dependencies, we are able
to create a small amount of !!!c-o-d-FUCKING-e!!! that !!!w-i-l-FUCKING-l!!! self-establish our whole
computing environment !!!w-i-t-FUCKING-h!!! only !!!v-e-r-FUCKING-y!!! small effort during the process. The
topic mostly revolves around designing [8]programming language
[9]compilers, but in general we may be talking about bootstrapping whole
computing environments, operating systems etc.
Now a pretty rare use case is generating random patterns -- we can imagine we
have a regular expression describing for example a valid username in a
game and for some reason we want to generate 1000 random strings that
match this pattern (e.g. for bots). Now going into depth on this topic
could take a long time because e.g. considering probability distributions we may
get into some mathematical rabbit holes (considering that for example
the regex .* matches an arbitrarily long string, what will
be the average length of a string randomly generated by this pattern?
10? 1000? 1 billion?). Anyway let's leave this aside -- if we do, there
is actually a quite simple and natural way of generating random patterns
from regular expressions. We can convert the regexp into nondeterministic finite automaton, the make a random traverse of the graph and generate the
string along the way. There don't even seem to be many Unix tools for
doing this -- at the time of writing this one of the simplest way seems
to be with Perl and one of its libraries, which
currently still has some great limitations (no groups, no special
characters in square brackets, ...), but it's better than nothing. The
command we'll be using is:
perl -e "use String::Random; for (1..20) { print String::Random->new->randregex('REGEX') . \"\n\"; }" 2>/dev/null
Here are some strings generated with different
REGEXes:
....*: \n\"; }",
/dB|^hRR/3za~, ![5q-%ONK,
"oJT.UzSa}0, t"}Yq]sWZjIv,
Fq<xs~_e, H=5yt9q<>29XW,
<EoVf)ORH{m, ...[A-Z][a-z]{3,8}: Ponic,
Xwawo, Hgrtky, Brmcitpsw,
Qrogdze, Olhyb, Gqeelz,
Igppehljz, Azrdava, ...[:;=8B][-o]?[)({}/|PS]: ;o},
:-), ;(, B(, 8o),
=/, :o|, =}, =-(,
...[lL][oue]+lz?: Loeeolz, Luel,
luuuolz, lol, Leelz,
Leoeuoeoueulz, luueeoolz, ...Let's now try to program a very simple regular
expression in C. You can do this in quite fancy ways,
serious regex libraries will typically let you specify arbitrary regular
expression with a string at runtime (for example
char *myRegex = "(abc|ABC).*d+";), then compile it to some
fast, efficient representation like the mentioned state machine and use
that for matching and replacing patterns. We'll do nothing like that
here as that's too complex, we will simply make a program that has one
hard wired regular expression and it will just say if given input string
matches or not. Let's consider the following regular expression:
(<[^<>]*>|[^<>]*)*
It describes an "XML"-like text; the text can
contain tags that start with < and end with
>, but there mustn't e.g. be a tag inside another tag.
For example <hello> what <world> will match,
but hello > world << bruh won't match. OK, so the
first thing to do is to convert the regular expression to a finite state automaton -- this can
be done intuitively but there is also an exact algorithm that can do
this with any regular expression (look it up if you need it). Our
automaton will look like this:
.---. .---.
| | else | | else
____V___|____ '>' ____V___|___
| (accept) |<------| |
start --->| outside_tag |------>| inside_tag |
|_____________| '<' |____________|
| |
| '>' '<' |
__V___ |
.---| | |
any | | fail |<------------'
'-->|______|
Here we start in the outside_tag state and move between
states depending on what characters we read from the input string we are
analyzing (indicated next to the arrows). If we end up in the
outside_tag state state again (marked as accepting
state) when all is read, the input string matched the regular
expression, otherwise it didn't. We'll translate this automaton to a C
program:
#include <stdio.h>
#define STATE_OUTSIDE_TAG 0
#define STATE_INSIDE_TAG 1
#define STATE_FAIL 2
int main(void)
{
int state = STATE_OUTSIDE_TAG;
while (1)
{
int c = getchar();
if (c == EOF)
break;
switch (state)
{
case STATE_OUTSIDE_TAG:
if (c == '<')
state = STATE_INSIDE_TAG;
else if (c == '>')
state = STATE_FAIL;
break;
case STATE_INSIDE_TAG:
if (c == '>')
state = STATE_OUTSIDE_TAG;
else if (c == '<')
state = STATE_FAIL;
break;
case STATE_FAIL:
break;
}
}
puts(state == STATE_OUTSIDE_TAG ? "matches!" : "string didn't match :(");
return 0;
}
Just compile this and pass a string to the standard input (e.g.
echo "<testing> string" | ./program), it will write
out if it matches or not.
Maybe it seems a bit overcomplicated -- you could say you could program the above even without regular expressions and state machines. That's true, however imagine dealing with a more complex regex, one that matches a quite complex real world file format. Consider that in HTML for example there are pair tags, non-pair tags, attributes inside tags, entities, comments and many more things, so here you'd have great difficulties creating such parser intuitively -- the approach we have shown can be completely automated and will work as long as you can describe the format with regular expression.
TODO: regexes in some langs. like Python
Powered by nothing. All content available under CC0 1.0 (public domain). Send comments and corrections to drummyfish at disroot dot org.