Regex

Regex (Regular expressions) are used throughout computer science to represent text patterns we want to search for.

Aim: To understand and be able to write regexes.

The regex syntax we will learn are: ., ?, +, *, {}, |, (), [], [-], [^-], \ and combining these elements together.

The basics: ., ?, + * and {}

Let's take a look at some examples.

The simplest of these characters is the wildcard character .. This represents a single character that can be anything. a. could represent a", aQ or any number of other possibilities.

The ? character indicates that the preceding character may or may not be present. So a? represents the group of Strings and a.

The + character indicates that there may be one or more copies of the preceding character. So a+ represents the group of Strings a and aa, aaa, aaaa, ... etc.

The * character represents that the preceding character may occur any number of times (including zero). For example, xy*z represents the group of Strings xz, xyz, xyyz, xyyyz, ... etc.

We can use curly brackets {} after a character to say that we want a certain number of repetitions of that character. p{3} represents the String ppp. We can also specify a range, e.g. p{2, 4} representing pp, ppp and pppp. Finally, we can say we want more than x repetitions using p{x,}.

Stating "OR" with: | and []

There are two simple ways to write OR, one with | and one with []. Say you want to search a document for the words "affect" and "effect" to check that you have used them correctly. We can write our regex in any of the following ways.

  • Using | we can write affect|effect or a shortened version (a|e)ffect using brackets ()

  • Using [] we can write[ae]ffect

Ranges in brackets: [-] and [^-]

We might want to specify that a character should be within some range. To do this we write [a-z] to say that it should be a lower case alphabetic character ([A-Z] for upper case).

We can also combine these so that [a-zA-Z0-9] represents that we want any alphanumeric character!

Adding a carrot ^ just inside the start of the square bracket indicates that we want any character except those in the brackets. [^0-9] means we want a single non-numeric character.

Searching for a special character?

Sometimes we may want to search for a . or a ? character without the special meanings given above. In order to do this we need to escape the character using the backslash \ just before it.

An example of this is the regex www\.aWebAddress\.com, which represents the String www.aWebAddress.com.

The characters we need to escape like this are dependent on the regex type you are working with. But the most common ones are: ., ?, +, *, {, }, |, (, ), [, ], -, ^, \ and $.

Wrapping a Regex

How are regexes usually displayed? Normally we enclose a regex in two forward slashes /regex goes here/. This tells your program where the regex begins and ends. It also allows some flags to be added to the expression. We look at two of them here:

  • /regex/i is used to indicate that the regex is case insensitive, so it will return results including regex, rEgEX, REGEX etc.

  • /regex/g is used to specify a global search. That means when one matching String is found, the search will resume again from that point onwards.

Examples

  • The following will represent usernames which begin with up to 10 alphabetic characters, followed by 5 numeric characters: /[a-zA-Z]{10}[0-9]{5}/

  • You may want to search for one or more of whitespace characters in order to use them as delimiters in a String splitting operation using this regex: /[,.;:' ]+/

  • This regex represents all ".com" email addresses where a . character is allowed before the @ only: /[a-zA-Z0-9\.]@[a-zA-Z0-9]\.com/

If you want to test your regex works take a look at Regex101!

As said in the introduction, regexes are commonly used to represent a text pattern that we want to search for. You will find an introduction to the command line search tool grep in the article Grep, and an introduction to the search and replace tool sed in Sed. Both of these tools use regexes to represent text patterns.