How To Use Regular Expressions

267

2023-12-04 | By Maker.io Staff

This article introduces you to the basics of regular expressions and how they facilitate more advanced searches in textual data. For that purpose, it compares regular expressions to full-text search, introduces common search patterns, and gives examples of how the presented patterns function.

What are Regular Expressions?

A regular expression (often abbreviated to regex) is a sequence of characters that lets you define search patterns, often used for matching and manipulating strings in text processing. These expressions facilitate more advanced searches and text manipulation than a simple full-text search alone.

A simple full-text search represents the most basic regular expression. In this case, you search for exact matches of a sequence of characters, such as a word. The results only include precise duplicates of the search term.

In contrast, regular expressions support searching for general patterns, such as email addresses or phone numbers, allowing programs to quickly find such strings in long bodies of text like websites.

Regex Patterns Every Developer Should Know

You may sometimes want to find strings that begin or end with specific character sequences. Two examples include finding all strings that start with “http” or end with “com”. The following two regular expressions accomplish these tasks:

Copy Code

http.*
.*com

In a regular expression, the period states that any character can be in this position of the string. The asterisk is a so-called quantifier that matches zero or more occurrences of the character preceding it. Together, the period and asterisk mean that there may be any number of any character in this position. Besides the asterisk, you should know about four other vital quantifiers:

How To Use Regular Expressions

As an example, the following regular expression matches all strings that start with http or https, then contain at least one symbol followed by two to three letters, and end with precisely one forward slash:

Copy Code

https?.+[a-z]{2,3}\/

Matching Special Characters

Note that the forward slash appears in its escaped form to inform the regex interpreter that the forward slash character should be in this position. Escaping certain characters is necessary to differentiate quantifiers and other control characters from the actual printed symbol. As an example, consider the following regular expression that matches zero to arbitrarily many asterisk symbols:

Copy Code

\**

The asterisk appears once with a preceding backslash to indicate that the string should contain an asterisk. The second asterisk occurs unescaped and thus represents the quantifier that matches zero to arbitrarily many of the preceding symbols.

In addition, there are a few escape sequences that encode certain characters or groups of characters:

How To Use Regular Expressions

As an example, the following regular expression matches all non-empty words in a text, which can be useful when extracting words in a text:

Copy Code

\w+

How To Use Regular Expressions This example illustrates how the \w+ regex matches any word that contains at least one alphanumeric character or an underscore.

Defining Character Ranges in Regular Expressions

The previous word-matching example also includes all words with digits as well as numbers without letters. You could use the following regex to define a custom pattern to find all words in a text consisting solely of lowercase letters:

Copy Code

[abcdefghijklmnopqrstuvwxyz]+

The square brackets define that the regex should accept any character enclosed by the brackets in a particular position. Note that the selection in the previous example excludes all uppercase letters. Therefore, the result would exclude the capital letters from the matched words. In addition, defining ranges that contain many characters can quickly become tedious. Luckily, you can specify the start and end of ranges without typing out all the intermediate values:

Copy Code

[a-z]+

Excluding Characters in a Regex Search

Using the range-definition syntax, you can also exclude certain characters from the search, meaning that these characters may not appear in a specific place for the string to match:

Copy Code

[^abc]

This example matches all characters not in the specified set, meaning any except a, b, and c. Additionally, you can utilize the negated form of the escape sequences discussed above:

How To Use Regular Expressions

Understanding Regex Anchors

Note that the previous examples all match occurrences irrespective of their position in a string. However, matching entire lines in a body of text is sometimes desired. Similarly, some applications may require matching complete constructs, such as email addresses, without capturing fragments that look similar. To accomplish these tasks, you can utilize so-called anchors — for example, the start-of-string (^) and end-of-string ($) anchors. These control symbols don't match any characters. Instead, they anchor the pattern to a specific position within the string. When used together, the ^ and $ anchors ensure that a pattern matches the entire string from beginning to end.

Consider, for example, the following list of email addresses:

Copy Code

office@company.com
any1@mail.com
some_person@other_company.fr
this line does not contain a valid address
however, this one does: test@mail.uk

The following simplified pattern matches all email addresses in the text regardless of their position:

Copy Code

[\w]+\@[\w]+\.[a-z]{2,3}

Using a regex tester, the result looks as expected:

How To Use Regular Expressions This image illustrates how the regex without any anchors finds all occurrences of a string in a text.

However, adding the start-of-string anchor leads to the pattern only matching the first occurrence in the string and only if it appears right at the beginning of the text:

How To Use Regular Expressions The ^ anchor only finds the first occurrence of a string in a text and only if it appears right at the start of the text.

Similarly, the end-of-string anchor leads to the regex only matching the last address and only if it occurs precisely at the end of the string:

How To Use Regular Expressions The $ anchor finds the last occurrence of a string in a text and only if it appears right at the end of the text.

Using the two anchors in combination means that the entire string must match the pattern. In this instance, the text would have to contain exactly a single well-formed email address for the pattern to match. Therefore, combining both anchors allows checking whether an entire input sequence obeys specific rules.

Summary

Regular expressions (regex) serve as powerful tools for advanced text searching and manipulation. They enable you to define search patterns that surpass the capabilities of basic full-text search.

The period symbol represents the most basic regex pattern, matching any character. Similarly, you can build regular expressions that require certain characters to appear in specific positions of a string. When combined with quantifiers, you can define that certain characters must occur a particular number of times for the regex to match.

Some characters, such as the period and asterisk, can represent the literal character and so-called control sequences. Therefore, you must escape these characters when you want them to appear in a string. Several other predefined escape sequences allow matching words, whitespaces, and digits.

You can also define custom ranges of characters if you find those escape sequences to be too general in specific instances. Using custom ranges, you can also exclude certain characters so they may not appear in a string.

Finally, anchors help align the search pattern with the input string to match only occurrences in specific positions of a text, for example, the start or end.

Have questions or comments? Continue the conversation on TechForum, DigiKey's online community and technical resource.