.NET Regular Expressions in Infinite Cycle

I'm using .NET Regular Expressions to strip HTML code.

Using something like:

<title>(?<Title>[\w\W]+?)</title>[\w\W]+?<div class="article">(?<Text>[\w\W]+?)</div>

This works for 99% of the time, but sometimes, when parsing...

Regex.IsMatch(HTML, Pattern)

The parser just blocks and it will continue on this line of code for several minutes or indefinitely.

What's going on?


Your regex will work just fine when your HTML string actually contains HTML that fits the pattern. But when your HTML does not fit the pattern, e.g. if the last tag is missing, your regex will exhibit what I call "catastrophic backtracking". Click that link and scroll down to the "Quickly Matching a Complete HTML File" section. It describes your problem exactly. [\w\W]+? is a complicated way of saying .+? with RegexOptions.SingleLine.

