This area of the website is no longer being updated.
Changes and news are now published on this page.

Capturing email body text with regular expressions

December 27, 2013

Here comes another example from a real use case (published with permission from the customer).

We have an email with the following text:

Delivery Type: Residential
QTY ITEM CODE DESCRIPTION PRICE
------- ----------------- ----------------- -------
1 RT32323-7 Arvo Balck $26.00
1 SG43242-95 Arvo Blue $23.00
--------------------------------------------------- ----------
 TOTAL: $49.00

We want to capture the item quantity, code, description and price with regular expressions. The first step we follow is to create a new field with the text capture method “Filtering and replacing” to remove the surrounding text and get only the table with the data we want to capture:

What we do here is removing any line matching the wildcard expression:

(Delivery.*|QTY.*|---.*|.*TOTAL.*)

This means the following:

Any line starting with Delivery or QTY or --- or contains the word TOTAL.

The result is stored in a field we have called items_list and is the following:

1 RT32323-7 Arvo Balck $26.00
1 SG43242-95 Arvo Blue $23.00

Well, this piece of text is far easier to process. There are no other lines or unwanted text that may confuse us. Now we build a regular expression that matches the format of a full line:

\d+\s+\w+\-\d+\s+.*\s+\$\d+.\d+

The meaning of this is a bit tricky:

“One or more digits followed by one or more spaces followed by one or more alphanumeric character followed by “-” followed by one or more digits followed by one or more spaces followed by any consecutive string of characters followed by a dollar sign followed by one or more digits, then a dot and then one or more digits”.

After testing that this regular expression matches correctly in many emails and only with the lines of the table, we are ready to capture the specific fields. This is done using the same regular expression and adding capture groups for the field we want to capture. For example, for the field price:

\d+\s+\w+\-\d+\s+.*\s+(?'price'\$\d+\.\d+)

Applying the same method to the rest of the fields, the resulting email parser looks as following:

Notice that, to get the fields quantity, item_code, description and price we are using as input the field items_list not the email body. We have also activated the check box “The field can appear multiple times” as we are processing one or more lines.

Tags

© 2008-2024 Triple Click Software
News & Updates

Windows App privacy police and terms of use
Web App privacy police and terms of use

This site privacy police and terms of use
PAD file·Old news