This page looks best with JavaScript enabled

Unix Fu - Regex

 ·   ·  ☕ 6 min read

Don’t be afraid, it can smell your fear!

The world of Regular Expressions is vast and scary.
We’ll go through the very basics and some cool concepts that might help further along the way.

This post leans heavily on this amazing work.
Still not satisfied? Dive deeper, but at your own risk!

Keep in mind

Not all Regex engines are created equal, their implementations and valid patterns may vary quite a lot, so you’ll have to adapt to whatever you are working with.

These are little more than general concepts and little tricks that should work on most engines.

Basics

Let’s run through the basics of pattern matching.

Ranges

Use [] to match whatever item falls within the given range.

[abc] ➡️ ‘a’ or ‘b’ or ‘c’.
[a-z] ➡️ Any char between ‘a’ and ‘z’. It may or may not include diacritics.
[a-zA-Z0-9] ➡️ Any alphanumeric char either lower or upper case.


Bonus

Negate ranges with ^!
[^a-z] ➡️ Any char not between ‘a’ and ‘z’.

There is more to negations below


The Dot

Use it to match any item except for new lines (usually).
So basically anything.

. ➡️ Any one char.
.. ➡️ Any two chars (not necessarily the same ones).

Multipliers

Use them to match any number of the previous item.

.+ ➡️ Any char 1 or more times
.* ➡️ Any char 0 or more times
.? ➡️ Any char 0 or 1 times

We are using . for simplicity, but you can match any item you want:
ac+ ➡️ Would match ‘ac’, ‘acc’, ‘acccc’, …


Beware the Greed

greed

What would you expect to happen if you pass a string like <body>Banana</body> through a regex like <.*>?
You’d probably expect it to match <body> and/or </body>.

To your (and everyone else’s) surprise, that regex would (most likely) match <body>Banana</body> instead.

How can this be?
Well, by default most regex engine’s + and * multipliers are greedy, which means that given any regex, they will try to match as much as possible.

A lazy match is probably what you want in most cases, and you usually get that by adding ? to the multiplier:
Given the string <body>Banana</body>, the regex<.*?> will match <body> and/or </body>.

Now if you want to get real fancy you could also use <[^>]+> to achieve this (you should be able to understand what’s going on there).
It’s usually more efficient, but honestly if that’s not an issue I wouldn’t even bother with it, as it get really hard to read really fast.

TL;DR

If you are having trouble with .* (or .+), try using *? (or .+?) instead!


Numbered Multipliers

Much like regular multipliers, these ones match a given number or range of numbers of the previous item.

a{5} ➡️ ‘aaaaa’.
a{1-5} ➡️ Any number of consecutive ‘a’ between 1 and 5.

What’s actually cool about them is that they can behave like a more interesting ? multiplier:

a{3,} ➡️ 3 or morea’.

Cool Things

Now for the cool and useful stuff.

Shorthands

Regex can be a bit of a pain to write and read.
Plus, there are certain structures we will probably want to match really often.

So why not use shorthands?

\s ➡️ a whitespace.
\S ➡️ anything but a whitespace (opposite of \s).
\d ➡️ a digit (0-9).
\D ➡️ anything but a digit (opposite of \d).
\w ➡️ a ‘word’ char (shorthand for [a-zA-Z0-9_]).
\W ➡️ anything but a ‘word’ char (opposite of \w).

Anchors

Say you are trying to match all commented lines in a bash script.
You want to match all #.* but only at the beginning of the line.

For that, you would use ^, like so: ^#.*!

^ ➡️ Start of the line.
$ ➡️ End of the line.
\b ➡️ Word boundary (beginning or end of word).

So, for a regex like \bFOO$:
FOO in What a nice line of text BAR FOO would match.
FOO in What a nice line of text BARFOO would not.

Logical OR

Just like your if statements!
foo|bar ➡️ Would match either foo OR bar.

Escaping special chars with \

What if you want your regex to math one (or more) of the special chars we’ve seen (like $, [ or +)?

Your safest bet is to escape them by putting a \ in front of them.

If we take our previous example and escape the $: \bFOO\$:
FOO in What a nice line of text BAR FOO$ this is getting pretty long would match.
FOO in What a nice line of text BAR FOO would not.


This is the main reason regex get so unreadable.
Don’t let the \ scare you.
They are there for the engine but (apart from the occasional shorthand) they have no meaning for humans!


Grouping and References

One neat trick that most regex engines will allow you to do is grouping parts of the match and referencing them later in the regex.

One regex can have multiple groups and these get referenced by their number (starting with 1).

You surround the group in () and reference it with \ followed by the group’s number.

This is easy to grasp with an example:
(foo) (bar) \2\1 ➡️ Will match foo bar barfoo (notice the spaces).

How on earth is this useful?
If you know how Sed works, you can probably imagine this can save a lot of headaches.

Negations

You can negate parts of your regex using lookarounds.

Say you want to match all instances of foo followed by anything but bar, followed by baz.
So for example, we want foobatbaz to match but not foobarbaz.

Using a lookahead like foo(?!bar).+?baz, you would do just that. We negate the part of the regex that is between parenthesis and preceded by ?!.

It simply means ‘not followed by (?!this)’.
This foobatbaz is weird would match.
This foobarbaz is weird would not.

Similarly, you might want to go about this the other way around.

Say you want to match all instances of foo except when it is preceded by bar.
You could achieve this by using lookbehinds like (?<!bar)foo.

It simply means ‘not preceded by (?<!this)’.
This batfoo is weird would match.
This barfoo is weird would not.

Both lookaheads and lookbehinds can be used to match a pattern while negating another one.
Which one to use just depends on whether you want to negate something before or after a match.

Support the author with