Summary: in this tutorial, you will learn how greedy quantifiers work and how to avoid unexpected results by using lazy quantifiers.
Quantifiers are greedy
Quantifiers are metacharacters in regular expressions that specify the quantity of the preceding element. For example, the + quantifier matches one or more occurrences of the preceding element.
By default, quantifiers are greedy, meaning that the quantifiers always try to match as much of the input text as possible while still allowing the whole pattern to match successfully.
The reason that quantifiers are called greedy is that they try to consume as many characters in the input string as they can.
Let’s take an example to understand how greedy quantifiers work.
Suppose you have a link tag in an HTML fragment:
<a href="https://www.csharptutorial.net/" target="_blank">Click Me</a>
Code language: HTML, XML (xml)
To get attribute values like "https://www.csharptutorial.net/"
and "_blank"
, you may come up with the following pattern:
".+"
Code language: plaintext (plaintext)
This pattern matches a text that starts with quotes ("
) followed by one or more characters (.) and ends with quotes ("
).
using System.Text.RegularExpressions;
using static System.Console;
var html = """<a href="https://www.csharptutorial.net/" target="blank">C# Tutorial</a>""";
var pattern = """
".+"
""";
var matches = Regex.Matches(html, pattern);
foreach (var match in matches)
{
WriteLine(match);
}
Code language: C# (cs)
Note that we use a raw string for an HTML string and a regular expression pattern, which contain the quotes (“). The raw string has been available since C# 11.
Output:
"https://www.csharptutorial.net/" target="blank"
Code language: plaintext (plaintext)
The result is not what we expected.
The following describes how the greedy quantifier works in this example:
- The regex pattern
".+"
is applied to the input string. - The first character of the input string is
<
, and the.
in the pattern matches any character. Therefore, the regex engine matches the opening<
character. - The greedy quantifier
+
allows one or more occurrences, so the regex engine attempts to match as many characters as possible. - The engine continues matching all subsequent characters until it reaches the closing
>
character. This includes the match"https://www.csharptutorial.net/" target="blank">C# Tutorial</a>
. - Once the engine reaches the closing
>
character, it realizes that the pattern still needs to match the remaining part of the input string. - At this point, the engine triggers backtracking. It backtracks from the last matched position, which is the closing
>
, and starts considering shorter matches. - The engine removes the last character
>
from the match and attempts to match again. However, it still doesn’t find a complete match. - The backtracking process continues, and the engine removes subsequent characters one by one until it finds a valid match.
- Finally, the engine reaches a valid match with the last ” and returns the longest possible match:
href="https://www.csharptutorial.net/" target="blank"
.
Turn off the greedy mode
To fix this issue, you need to explicitly force the quantifier (+) to use the non-greedy (lazy) mode instead by adding the question mark (?) after the + quantifier like this:
".+?"
Code language: JSON / JSON with Comments (json)
For example:
using System.Text.RegularExpressions;
using static System.Console;
var html = """<a href="https://www.csharptutorial.net/" target="blank">Click Me</a>""";
var pattern = """
".+?"
""";
var matches = Regex.Matches(html, pattern);
foreach (var match in matches)
{
WriteLine(match);
}
Code language: C# (cs)
Output:
"https://www.csharptutorial.net/"
"blank"
Code language: plaintext (plaintext)
Now, the program returns the expected result.
Summary
- Quantifiers are greedy by default.
- A greedy quantifier matches as much of the input string as possible while still allowing the overall regular expression pattern to match successfully.