Email Parser

Extract data from incoming emails and automate your workflow

MENUMENU
DOCUMENTATION TOPICSDOCUMENTATION TOPICS

Capturing an HTML tag with XPath

 

 

 

This is a very brief introduction to how XPath works in Email Parser. XPath is not something specific to Email Parser, it is used extensively in other areas like web development, web crawler programming, Javascript and DOM manipulation, etc. We will only briefly explain the basics and give you a very broad idea of how they work. As it happens with Regular expressions, there are even full websites and books covering this topic only.

 

What is a XPath expression and how it is used

XPath expressions are used to locate a specific HTML tag in an HTML document. In Email Parser, you can use them to capture areas of text from the HTML version of the email body (the field BodyHTML).

The XPath concept is very similar to the path expressions used to identify folders in a computer (C:\Users\John\My documents…) but HTML tag names are used instead of folder names. For example, given this HTML document:

 

 

 <html>
    <body>
        <table>
            <tbody>
                <tr>
                    <td>Name:</td>
                    <td>John Doe</td>
                </tr>
                <tr>
                    <td>Phone:</td>
                    <td>1234567890</td>
                </tr>
            </tbody>
        </table>
    </body>
</html>

 

That looks like this (it is just a table):

If we use the following XPath expression:

/html/body/table/tbody/tr[1]/td[1]

The captured text will be “Name”.

If we use this one instead:

/html/body/table/tbody/tr[1]/td[2]

The captured text will be “John Doe”

This XPath expression, in a more natural language would be something like this:

“Take the <html> tag, then find the <body> tag inside it, then within the <body> find <table> and then find <tbody>. Inside <tbody> take the first <tr> tag and then inside it take the second <td> tag”

 

How to work with XPaths in Email Parser

 

Building an XPath expression to capture text from an email in Email Parser usually involves these steps:

  1. Click on “Add field” on the left panel.
  2. Choose “Capture HTML tag” as capture method.
  3. Choose “Use Xpath”.
  4. Locate an email that you want to use for testing. You can do this opening your email source and clicking an email on the list.
  5. Go to the “Fields” tab (in the email source email listing) and click on the field BodyHTML.
  6. Copy all the BodyHTML content to the testing area of the field we have just created in the step 1.
  7. If the HTML code is badly formatted, difficult to follow, no tabs etc. Use an HTML formatter like https://htmlformatter.com. The parsing results do not change if you “tidy” the HTML code. It just makes easier to read it.
  8. Type any XPath at the left, you will see that, as you type, Email Parser will evaluate it and show the results in the testing area. Start with something easy, like a title or something you can easily locate in the HTML code. Then try to add more complexity to the XPath to get what you want. You will see that this will require a lot of trial and error.
  9. Once you see that the XPath expression you have written works for that testing BodyHTML, go and get the BodyHTML of another email. Check that the results are the expected with other emails.

 

 

The very basics of the XPath syntax. Quick reference:

 

Xpath expression Meaning
/tagname Captures the <tagname> element at the root of the document (a root tag is a tag not contained in other tags)
//tagname Captures any <tagname> regardless of where is located in the tag tree. If you enable  “Capture more than one HTML tag” in the Additional options, more than one value can be captured.
//tagname1/tagname2 Captures the <tagname2> elements contained in a <tagname1>. <tagname1> can be anywhere on the document
//tagname1/tagname2[2] Captures the second <tagname2> element contained in a <tagname1>. <tagname1> can be anywhere on the document
//tagname1//tagname2 Captures the <tagname2> elements contained in a <tagname1>. <tagname1> can be anywhere in the document and <tagname2> must be contained in <tagname1> but it is not required that <tagname2> is the direct child of <tagname1>
//tagname1/*/tagname2 Captures all <tagname2> elements that are grandchildren of <tagname1> elements.
//tagname[@theattribute] Captures any <tagname> with the attribute “theattribute”. For instance <tagname theattribute=2>