Capturing text with Regular expressions

Capturing text with Regular Expressions

Regular expressions define how the text you want to capture looks like. They are a specification of a text format. But, unlike other capture methods available, they are more complex and difficult to use. As they are widely used in other contexts and very well documented, this help topic is only a brief explanation of what regular expressions are and how they work in Email Parser. There are even books and full websites covering this topic only.

The very basics of Regular expressions

A regular expression is a text string that uses tokens to match a given text. For example, the token \d matches any number between 0 and 9:

Regular expression	Input text	Captured text
\d\d\d	Hello John, please call me to 788-383-134	788
-\d\d\d-	Hello John, please call me to 788-383-134	-383-
\d\d\d-\d\d\d-\d	Hello John, please call me to 788-383-134	788-383-1
\d\d\d\d\d	Hello John, please call me to 788-383-134

As you can see, given a regular expression and an input text, there can be no matches, a single match or multiple matches.

In Email Parser, regular expressions are one of the available capture methods:

But the method “Starts after… continues until….” also accepts regular expressions if needed:

In the first one, we define with a regular expression how the text we want to capture looks like. In the second, we use the regular expressions to set the text boundaries, not the text itself.

There are many other types of tokens. The most used ones are:

Token
.	Matches any character except the line break (yes, the line break is a character)
\s	Matches a white space
\w	Matches any word character such as a,b,c,d,e…
[aeiou]	Matches any vowel. You can replace “aeiou” with any set of characters. For example [abc] will match with a,b or c
\n	Matches a new line character
[a-zA-Z]	Matches any character in the range of a…z and A..Z

You can combine tokens to build more complex text captures. For example:

Regular expression	Input text	Result
\w\d\d\d-\d\d\d	The order id is A233-531	A233-531

Quantifiers are used with the tokens shown before to build more complex regular expressions. For example:

Quantifier
*	0 or more of previous expression.
+	1 or more of previous expression.
?	0 or 1 of previous expression; also forces minimal matching when an expression might match several strings within a search string

For example:

Regular expression	Input text	Captured text
\d+	Hello John, please call me to 788-383-134	788
-\d+-?	Hello John, please call me to 788-383-134	-383-
J\w*	Hello John, please call me to 788-383-134	John
.*	Hello John, please call me to 788-383-134	Hello John, please call me to 788-383-134

Capturing text with a capture group

A capture group is a label within a regular expression that defines the name of a part of the matching text. For example, in a phone number there is a part that we can label as “prefix”, in a date there are “month”, “year” and “day” etc. These are helpful if you want to capture not the full regular expression match but only part of it.

We identify a capture group entering a name inside the regular expression like this:

\d\d-\d\d-(?’year’\d\d\d\d)

We have enclosed the capture group named year between brackets. This, in plain English, means: Year is the four-digit text that comes after two digits followed by a minus symbol and another two digits and a minus symbol.

If Email Parser finds a capture group with the same name as the field name, it will take that part as the captured text. Otherwise it will take the full match. For example:

Email Parser field name	Regular expression	Input text	Captured text
prefix	(?’prefix’\d+)-\d+-\d+	Hello John, please call me to 788-383-134	788
month	(?’year’\d+)/(?’month’\d+)/(?’day’\d+)	The date is 2017/6/8. Blah blah	6
year	(?’year’\d+)/(?’month’\d+)/(?’day’\d+)	The date is 2017/6/8. Blah blah	2017
address	(?’year’\d+)/(?’month’\d+)/(?’day’\d+)	Hello Carl, some text here 2017/6/8 etc et	2017/6/8
address	(?’year’\d+)/(?’month’\d+)/(?’day’\d+)	Hello Carl, some text here etc etc