Don’t be afraid, it can smell your fear!
The world of Regular Expressions is vast and scary.
We’ll go through the very basics and some cool concepts that might help further along the way.
This post leans heavily on this amazing work.
Still not satisfied? Dive deeper, but at your own risk!
Keep in mind
Not all Regex engines are created equal, their implementations and valid patterns may vary quite a lot, so you’ll have to adapt to whatever you are working with.
These are little more than general concepts and little tricks that should work on most engines.
Basics
Let’s run through the basics of pattern matching.
Ranges
Use []
to match whatever item falls within the given range.
[abc]
➡️ ‘a’ or ‘b’ or ‘c’.
[a-z]
➡️ Any char between ‘a’ and ‘z’. It may or may not include diacritics.
[a-zA-Z0-9]
➡️ Any alphanumeric char either lower or upper case.
Bonus
Negate ranges with ^
!
[^a-z]
➡️ Any char not between ‘a’ and ‘z’.
There is more to negations below
The Dot
Use it to match any item except for new lines (usually).
So basically anything.
.
➡️ Any one char.
..
➡️ Any two chars (not necessarily the same ones).
Multipliers
Use them to match any number of the previous item.
.+
➡️ Any char 1 or more times
.*
➡️ Any char 0 or more times
.?
➡️ Any char 0 or 1 times
We are using .
for simplicity, but you can match any item you want:
ac+
➡️ Would match ‘ac’, ‘acc’, ‘acccc’, …
Beware the Greed
What would you expect to happen if you pass a string like <body>Banana</body>
through a regex like <.*>
?
You’d probably expect it to match <body>
and/or </body>
.
To your (and everyone else’s) surprise, that regex would (most likely) match <body>Banana</body>
instead.
How can this be?
Well, by default most regex engine’s +
and *
multipliers are greedy, which means that given any regex, they will try to match as much as possible.
A lazy match is probably what you want in most cases, and you usually get that by adding ?
to the multiplier:
Given the string <body>Banana</body>
, the regex<.*?>
will match <body>
and/or </body>
.
Now if you want to get real fancy you could also use <[^>]+>
to achieve this (you should be able to understand what’s going on there).
It’s usually more efficient, but honestly if that’s not an issue I wouldn’t even bother with it, as it get really hard to read really fast.
TL;DR
If you are having trouble with .*
(or .+
), try using *?
(or .+?
) instead!
Numbered Multipliers
Much like regular multipliers, these ones match a given number or range of numbers of the previous item.
a{5}
➡️ ‘aaaaa’.
a{1-5}
➡️ Any number of consecutive ‘a’ between 1 and 5.
What’s actually cool about them is that they can behave like a more interesting ?
multiplier:
a{3,}
➡️ 3 or more ‘a’.
Cool Things
Now for the cool and useful stuff.
Shorthands
Regex can be a bit of a pain to write and read.
Plus, there are certain structures we will probably want to match really often.
So why not use shorthands?
\s
➡️ a whitespace.
\S
➡️ anything but a whitespace (opposite of \s
).
\d
➡️ a digit (0-9).
\D
➡️ anything but a digit (opposite of \d
).
\w
➡️ a ‘word’ char (shorthand for [a-zA-Z0-9_]
).
\W
➡️ anything but a ‘word’ char (opposite of \w
).
Anchors
Say you are trying to match all commented lines in a bash script.
You want to match all #.*
but only at the beginning of the line.
For that, you would use ^
, like so: ^#.*
!
^
➡️ Start of the line.
$
➡️ End of the line.
\b
➡️ Word boundary (beginning or end of word).
So, for a regex like \bFOO$
:
✅ FOO
in What a nice line of text BAR FOO
would match.
❌ FOO
in What a nice line of text BARFOO
would not.
Logical OR
Just like your if
statements!
foo|bar
➡️ Would match either foo
OR bar
.
Escaping special chars with \
What if you want your regex to math one (or more) of the special chars we’ve seen (like $
, [
or +
)?
Your safest bet is to escape them by putting a \
in front of them.
If we take our previous example and escape the $
: \bFOO\$
:
✅ FOO
in What a nice line of text BAR FOO$ this is getting pretty long
would match.
❌ FOO
in What a nice line of text BAR FOO
would not.
This is the main reason regex get so unreadable.
Don’t let the \
scare you.
They are there for the engine but (apart from the occasional shorthand) they have no meaning for humans!
Grouping and References
One neat trick that most regex engines will allow you to do is grouping parts of the match and referencing them later in the regex.
One regex can have multiple groups and these get referenced by their number (starting with 1).
You surround the group in ()
and reference it with \
followed by the group’s number.
This is easy to grasp with an example:
(foo) (bar) \2\1
➡️ Will match foo bar barfoo
(notice the spaces).
How on earth is this useful?
If you know how Sed works, you can probably imagine this can save a lot of headaches.
Negations
You can negate parts of your regex using lookarounds.
Say you want to match all instances of foo
followed by anything but bar
, followed by baz
.
So for example, we want foobatbaz
to match but not foobarbaz
.
Using a lookahead like foo(?!bar).+?baz
, you would do just that. We negate the part of the regex that is between parenthesis and preceded by ?!
.
It simply means ‘not followed by (?!this)
’.
✅ This foobatbaz is weird
would match.
❌ This foobarbaz is weird
would not.
Similarly, you might want to go about this the other way around.
Say you want to match all instances of foo
except when it is preceded by bar
.
You could achieve this by using lookbehinds like (?<!bar)foo
.
It simply means ‘not preceded by (?<!this)
’.
✅ This batfoo is weird
would match.
❌ This barfoo is weird
would not.
Both lookaheads and lookbehinds can be used to match a pattern while negating another one.
Which one to use just depends on whether you want to negate something before or after a match.