How to detect keywords or phrases in the body content of messages
Background
Email messages can contain two sections with text data that get rendered in an email client when a message is viewed by a user:
text/html
(or justhtml
) sectiontext/plain
(or justplain
) section
As the section names imply, the html
section can contain HTML mark-up such as hyperlinks, embedded images, text formatting, and more.
The plain
section does not render HTML, and is displayed raw. Email clients are typically configured to display the HTML section by default to the end-user, if it's present, and only display the plain
section as a fall-back. Email messages are not required to use both of these sections.
HTML content in MQL
For HTML bodies, content is stored in two fields:
body.html.raw
body.html.inner_text
The original HTML body is preserved in the raw
field, and the internal decoded is stored in inner_text
. Unless you need to match specific HTML elements, it's best to prefer body.html.inner_text
.
Example HTML body
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<style type="text/css" media="all">/* a lot of CSS, totally ignored! */</style>
</head>
<body width="100%">
<span style="color:transparent;visibility:hidden;display:none;opacity:0;height:0;width:0;font-size:0;">This & that are in a hidden span.</span><img
src="https://test.local/img.png"> <!-- Here's a commentInsert ‌ hack after hidden preview text -->
</div>
<p>Some paragraph content, before a table</p>
<table>
<tr>
<td>Row 1, Column 1. <span>Span contents inside R1C1</span></td>
<td>Row 1, Column 2</td>
</tr>
<tr>
<td>Row 2, Column 1</td>
<td>Row 2, Column 2</td>
</tr>
</table>
<!-- comment before an image link -->
<a href="https://test.local"><img src="https://test.local/img.png"></a>
<div>Copyright © 2022</div>
</body>
</html>
The inner text from the parsed HTML is much more compact. Note that newlines are automatically inserted between tags, regardless of whether they display on the same line visually.
This & that are in a hidden span.
Some paragraph content, before a table
Row 1, Column 1
Span contents inside R1C1
Row 1, Column 2
Row 2, Column 1
Row 2, Column 2
Copyright © 2022
When searching inside HTML contents
Due to the size of HTML content, searching inside
body.html.raw
can be very time intensive. For better performance, consider writing rules that usebody.html.inner_text
instead, which contains the unescaped text inside the HTML, with different tags over different lines. The parsed HTML field,body.html.inner_text
, is much smaller and is significantly faster to search.
Plain content in MQL
For plain bodies, content is stored in body.plain.raw
.
Detect specific keywords
We can search both text sections easily to detect specific keywords or phrases:
any([body.plain.raw, body.html.inner_text], strings.ilike(., "*voicemail*", "*password reset*"))
Detect complex phrases
We can use regular expressions if we're looking for something more complex, like a Social Security Number:
any([body.plain.raw, body.html.inner_text], regex.contains(., '\b(\d\d\d)-(\d\d)-(\d\d\d\d)\b'))
Example detection rules
View example rules for body.html and body.plain.
Updated over 1 year ago