Often, we need to extract some HTML elements ordered sequentially rather than in hierarhical order.

Suppose we need to extract all the items under “List:” header:

XPath

To get item1 through item4 extracted let’s set a bookmark that will be:

At the bottom line, let’s set another bookmark that will be:

The useful built-in function for text processing in XPath is text(). So, just apply it to the snippet and set the bookmarks in square brackets:

//text()[preceding-sibling::h1[1] = ‘List:’
and following-sibling::h2 = ‘The End’]

This is the result, expressed in separate text nodes:

Regex

This is how to parse the items with a Regex expression and get the results shown above:

(?<item>[^<>]+?)\s*(?:(<br/>)|(<h2>))

First capture the target <item> group with everything inside but ‘<’ or ‘>’. Then, remove blank spaces if present. Following that should be whether <br/> or <h2>, yet as uncaptured group.

The above Regex does not connect to the headers (<h1>List:</h1>,  <h2>End</h2>). If we want to extract using anchors for generalizing, use zero-width lookahead and lookbehind assertions (first and last capture groups in the following Regex).

(?<=<h1>List:</h1>)\s*((?<item>[^<>]+?)\s*(?:(<br/>)|(<h2>)))+\s*(?=The\sEnd</h2>)

NOTE: With this Regex, we get the items into the <item> capture group,  but because we are using a quantifier (+), after Regex processing we need to iterate through all the stored captured elements of the group: <item>[i].