Yes, I’m aware that using regex for HTML parsing is not the best idea. But still when I need to quickly extract some small portion of a web page I find myself applying regex more often than executing an XPath query, and its lookahead and lookbehind constructions may be quite helpful.

Regex Lookaround Syntax

There are four regex syntax constructions for lookaround:

positive lookahead

match(?=expr)

the “expr” has to be found after the matching value

negative lookahead

match(?!expr)

the “expr” has to not be found after the matching value

positive lookbehind

(?<=expr)match

the “expr” has to be found before the matching value

negative lookbehind

(?<!expr)match

the “expr” has to not be found before the matching value

Usage Example

For example, you need to extract the amount value from the following piece of source HTML:

Note that it’s almost impossible to do it with pure XPath

You can do it using regex groups:

/AWS Service Charges:\s*\$(\d+)\s*</

or using regex lookaround:

/(?<=AWS Service Charges:\s*\$)\d+(?=\s*<)/

Though the second regex expression looks more complicated, the difference in these approaches is obvious:

  • in case of groups, the whole text block is matched and you need to extract the group separately
  • in case of lookaround, only the needed value is matched and ready for use

Limitations

There are some limitations with using regex lookaround though:

  • JavaScript does not support lookaround syntax
  • In Python and PCRE, lookbehinds must have a fixed length. This means that you can’t use quantifiers or alternation within lookbehind