Quantifiers

It's time to get soaked. There's a fascinating world right beneath the surface, the world of quantifiers. Go ahead. Put your facemask on, and take a look around beneath the waves. What you'll see is the real heart of regex.

Zero or More

We've seen that regex let you concatenate multiple patterns that can match sequences of characters. Sometimes, you want to repeat a pattern. For instance, to match all sequences of three digits, you can use the regex /\b\d\d\d\b/ - here we repeat the \d shortcut for three times.

Suppose, though, you need to find all sequences of three or more digits. How would you code this? You might try something like:

/\b(\d\d\d\d\d\d|\d\d\d\d\d|\d\d\d\d|\d\d\d)\b/

but this matches 3-6 digits, not three or more, and it's already hard to read. You could keep adding additional sequences of \ds until you reach some maximum level, but it is limiting and may cause problems in the future.

You could try matching /\d\d\d/. It will certainly get all three-digit and longer numbers, but it mixes in results like XY321Z.

Regex engines provide a variety of quantifiers that you can use to match sequences. The quantifier that gets used most frequently is *; it matches zero or more occurrences of the pattern to its left. For example, try /\b\d\d\d\d*\b/ against these strings in Rubular:

Four and 20 black birds
365 days in a year, 100 years in a century.
My phone number is 222-555-1212.
My serial number is 345678912.

You should see that this pattern matches 365, 100, 222, 555, 1212, and 345678912, but it does not match 20.

The way you read that regex is that you want to match three consecutive digits beginning at a word boundary, followed by any number of digits, and then another word boundary. The engine reads the regex as six sub-patterns:

Pattern Explanation
\b Starting at a word boundary
\d A single digit followed by ...
\d a single digit followed by ...
\d a single digit followed by ...
\d* Zero or more additional digits
\b Ending with a word boundary

One thing to watch out for is that "zero or more" truly means zero or more. The regex /x*/ matches every string, even an empty string, or a string that contains no xs anywhere. If you try this pattern in Rubular, it matches between every character.

When talking about regular expressions that match zero-length strings, imagine an arrow that starts out pointing to the beginning of the string, prior to the first character. When the regex engine goes to work, it moves this imaginary arrow to the right one character at a time until it either finds a match or determines that there is no match. The arrow never points directly at a character, but always points between each pair of characters, and matches typically occur against the character to the right of the arrow. (There are a few exceptions that match the character to the left as well, such as \b.)

When you try to match /x/ for instance, the regex engine looks to the character to the right of the arrow position. If it sees an x, it matches. Otherwise, it advances the arrow one position to the right, and again tries to match starting with the next character.

This is why something like /x*/ matches wherever in the string you're at - with the arrow pointing between characters, the regex is free to say "Nope. There are no x's between me and the next character, so it's a match."

Another way to see this is to try the regex /co*t/ against these strings:

ct
cot
coot
cooot

The regex matches every one of these strings, including the one without the letter o.

Note that the quantifier always applies to one pattern; the pattern it finds to the left of the quantifier. If necessary, you can use grouping parentheses to define the pattern to which you want to apply the *. For instance, try /1(234)*5/ against:

15
12345
12342342345
1234235

You should see that the engine treats (234) as a single pattern, so the regex matches anywhere zero or more occurrences of 234 separate 1 and 5.

The regex * quantifier looks similar to the * wildcard you find in most command line shells, but it is different. The * wildcard from a shell is more like the regex /.*/; it matches any sequence of characters, regardless of what those characters are. Thus, the wildcard blue*doc matches any file whose name begins with blue and ends with doc. /blue*doc/, however, matches any sequence of characters that begins with blu, ends with doc, and contains any number of es between the beginning and end.

One or More

The + quantifier is nearly identical to the * quantifier, but, instead of matching zero or more occurrences of something, it matches one or more occurrences of that thing. Not all regex engines offer the + quantifier - some older engines do not - but both Ruby and JavaScript provide it.

We can illustrate the + quantifier using our three-or-more digits example from the previous sub-section. In that section, we used /\b\d\d\d\d*\b/ to match three or more digits. If we replace the * with a +, /\b\d\d\d\d+\b/ we get a regex that matches four or more digits. Since we want three digits, we can eliminate one of the \d patterns, leaving /\b\d\d\d+\b/. To see that this still works as desired, try it against these strings from above:

Four and 20 black birds
365 days in a year, 100 years in a century.
My phone number is 222-555-1212.
My serial number is 345678912.

We saw earlier that a regex like /x*/ matches any string because it matches between every character. There is no similar subtlety to the + quantifier; /x+/ matches any sequence of one or more xs; it never matches the empty string between characters. Try it:

a single x matches.
As is a string of xxxxx like that.

Zero or One

Sometimes, you need an optional pattern in a regex; that is, a pattern that either occurs once or doesn't occur at all. For these situations, you need the ? quantifier. As with * and +, ? applies to the pattern to its left.

Suppose you need to test whether a string contains the words cot or coot, but don't want to match against ct or cooot. In this case, you can use /coo?t/, which matches a c followed by an o followed by an optional o followed by a t. Try it:

Scott scoots but doesn't act cooot.

One place you might use a ? would be a pattern where you are trying to match a date whose components may or may not include - separator characters. For instance, you have dates formatted as both 20170111 or 2017-01-11. To match such dates, you can use the regex /\b\d\d\d\d-?\d\d-?\d\d\b/. This matches:

20170111
2017-01-11
2017-0111
201701-11

but not:

2017/01/11

Note that ? has the same behavior subtlety as *; it matches zero occurrences. Thus, /h?/ matches each of these strings:

his
is
ish

The regex ? quantifier looks similar to the ? wildcard you find in most command line shells, but it isn't the same. The ? wildcard means zero or one occurrence of any character, or acts as a placeholder for a single character, depending on what shell you are using. The ? regex quantifier means zero or one occurrence of the pattern to its left. If you allow yourself to become confused by the similarity in appearance, you will have trouble.

Ranges

The *, +, and ? quantifiers match repeated sequences. They may provide all the regex functionality you need. However, sometimes you need to specify the repeat count more precisely. For example, you may want to test a phone number to see if contains precisely ten digits, or perhaps you want to look at all words that contain at least seven characters, or you want words that are 5-8 characters long. It's possible to do all this with the patterns and quantifiers you've already learned, but it will be tedious and messy. That's where the range quantifier comes in.

The range quantifier consists of a pair of curly braces, {}, with one or two numbers and an optional comma between the braces:

  • p{m} matches precisely m occurrences of the pattern p.
  • p{m,} matches m or more occurrences of p.
  • p{m,n} matches m or more occurrences of p, but not more than n.

Let's go through the examples we talked about above.

If you need to test a string to see if it contains precisely ten consecutive digits (perhaps it represents a US-style phone number), you can try it with the regex /\b\d{10}\b/ and these strings:

2225551212 1234567890 123456789 12345678900

You should see that this regex matches the first two numbers: they have ten digits each.

To match numbers that are at least three digits in length, we can use /\b\d{3,}\b/. Try it with these strings:

Four and 20 black birds
365 days in a year, 100 years in a century.
My phone number is 222-555-1212.
My serial number is 345678912.

This pattern matches the same six numbers that our earlier three-digits-or-more patterns matched.

If you want to match words of 5-8 letters, use /\b[a-z]{5,8}\b/i:

Bizarre
a
one two three four five six seven eight nine
sensitive
dropouts

This pattern matches Bizarre, three, seven, eight, and dropouts.

Greediness

The quantifiers we've discussed thus far are greedy: they always match the longest possible string they can. For instance, try matching /a[abc]*c/ against xabcbcbacy. You should see that this pattern matches abcbcbac, not abc or abcbc both of which could match the pattern, but are shorter than the final match string. This aspect of regex isn't often a concern, but when it is, it can be highly confusing if you aren't familiar with greediness.

In most cases, greediness is what you want. However, sometimes it isn't, and you need to match the fewest number of characters possible; we call this a lazy match. In Ruby and JavaScript, you can request a lazy match by adding a ? after the main quantifier. For example, /a[abc]*?c/ matches abc and ac in xabcbcbacy.

See this article for a more visual description of greediness vs. laziness.

Summary

That concludes our overview of regular expressions. You've now seen most of the patterns you need to use regex proficiently, but you haven't put them to use yet. Now it's time to learn how to use regex in real programs. In the next section, we show you the basics of using regex in your applications.

Before taking that plunge, though, take a little while to work the exercises below. In these exercises, use Rubular to write and test your regex. You don't need to write any code.

Exercises

  1. Write a regex that matches any word that begins with b and ends with an e, and has any number of letters in-between. You may limit your regex to lowercase letters. Test it with these strings.

    To be or not to be
    Be a busy bee
    I brake for animals.
    

    There should be four matches.

    Solution

    /\bb[a-z]*e\b/
    

    This regex should match the words be (both instances), bee, and brake.

  2. Write a regex that matches any line of text that ends with a ?. Test it with these strings.

    What's up, doc?
    Say what? No way.
    ?
    Who? What? Where? When? How?
    

    There should be three matches.

    Solution

    /^.*\?$/
    

    This regex should match the first, third, and fourth lines, but not the second line. Note the use of .*; you'll see this often in regex. It matches any sequence of characters, but, by default, does not match a newline character. It's how you ignore everything between two points when matching.

    Note that the ? must be \-escaped since we want to match a literal ?.

  3. Write a regex that matches any line of text that ends with a ?, but does not match a line that consists entirely of a single ?. Test it with the strings from the previous exercise.

    There should be two matches.

    Solution

    /^.+\?$/
    

    This regex should match the first and fourth lines, but not the second or third. The .+ pattern makes the regex match at least one character before it attempts to match the ?.

  4. Write a regex that matches any line of text that contains nothing but a URL. For this exercise, a URL begins with http:// or https://, and continues until it detects a whitespace character or end of line. Test your regex with these strings:

    http://launchschool.com/
    https://mail.google.com/mail/u/0/#inbox
    htpps://example.com
    Go to http://launchschool.com/
    https://user.example.com/test.cgi?a=p&c=0&t=0&g=0 hello
        http://launchschool.com/
    

    There should be two matches.

    Solution

    /^https?:\/\/\S*$/
    

    This regex should match the first and second text lines, but none of the others. The third line doesn't match because of a misspelling; the fourth and fifth don't match because of extra content, and the last doesn't match because of the leading spaces.

    The regex begins with a line anchor, ^, and then the http part of the URL followed by an optional s. Next, we have the :, and two / characters (both / characters must be \-escaped). We then have the rest of the URL, which we achieve by matching a string of non-whitespace characters. We also require an explicit line anchor, $, to prevent matching a URL that isn't at the end of the line.

  5. Modify your regex from the previous exercise so the URL can have optional leading or trailing whitespace, but is otherwise on a line by itself. To test your regex with trailing whitespace, you must add some spaces to the end of some lines in your sample text.

    There should be three matches.

    Solution

    /^\s*https?:\/\/\S*\s*$/
    

    This regex should match the URLs on the first, second, and last lines.

  6. Modify your regex from the previous exercise so the URL can appear anywhere on each line, so long as it begins at a word boundary.

    There should be five matches.

    Solution

    /\bhttps?:\/\/\S*/
    

    This solution should match all of the URLs above. (Note that the third line is a not a URL.)

  7. Write a regex that matches any word that contains at least three occurrences of the letter i. Test your regex against these strings:

    There should be three matches.

    Mississippi
    ziti 0minimize7
    inviting illegal iridium
    

    Solution

    /\b[a-z]*i[a-z]*i[a-z]*i[a-z]*\b/i
    

    Alternate solution

    /\b([a-z]*i){3}[a-z]*\b/i
    

    And one more solution

    /\b([a-z]*i){3,}[a-z]*\b/i
    

    Your solution should match Mississippi, inviting, and iridium. We use word boundary anchors in our solution to guard against strings that aren't words, such as 0minimize7). Each [a-z]*i matches a sequence of 0 or more letters followed by the letter i. Connecting three occurrences of [a-z]*i and then adding one more [a-z]* to the end, we get a regex that matches any word with 3 is.

    Our alternate solution is similar, but it uses the {3} quantifier to perform the 3-occurrences part of the match. The quantifier applies to ([a-z]*i) which, uses grouping parentheses to treat [a-z]*i as a single pattern for use by {3}.

    The final solution we show uses {3,} instead of {3}. See if you can determine why both solutions work.

  8. Write a regex that matches the last word in each line of text. For this exercise, assume that words are any sequence of non-whitespace characters. Test your regex against these strings:

    What's up, doc?
    I tawt I taw a putty tat!
    Thufferin' thuccotath!
    Oh my darling, Clementine!
    Camptown ladies sing this song, doo dah.
    

    There should be five matches.

    Solution

    /\S+$/
    

    Your solution should match doc?, tat!, thuccotath!, Clementine!, and dah.

  9. Write a regex that matches lines of text that contain at least 3, but no more than 6, consecutive comma separated numbers. You may assume that every number on each line is both preceded by and followed by a comma. Test your regex against these strings:

    ,123,456,789,123,345,
    ,123,456,,789,123,
    ,23,56,7,
    ,13,45,78,23,45,34,
    ,13,45,78,23,45,34,56,
    

    There should be three matches.

    Solution

    /^,(\d+,){3,6}$/
    

    Your solution should match the first, third, and fourth lines.

  10. Write a regex that matches lines of text that contain at least 3, but no more than 6, consecutive comma separated numbers. In this exercise, you can assume that the first number on each line is not preceded by a comma, and the last number is not followed by a comma. Test your regex against these strings:

    123,456,789,123,345
    123,456,,789,123
    23,56,7
    13,45,78,23,45,34
    13,45,78,23,45,34,56
    

    There should be three matches.

    Solution

    /^(\d+,){2,5}\d+$/
    

    Your solution should match the first, third, and fourth lines. In this case, the lack of a comma at each end of the strings complicates our solution slightly - we can't check for 3-6 occurrences of \d+,, but have to check for 2-5 occurrences followed by a final \d+ pattern.

  11. Challenge: Write a regex that matches lines of text that contain either 3 comma separated numbers or 6 or more comma separated numbers. Test your regex against these strings:

    123,456,789,123,345
    123,456,,789,123
    23,56,7
    13,45,78,23,45,34
    13,45,78,23,45,34,56
    

    There should be three matches.

    Solution

    /(^(\d+,){2}\d+$|^(\d+,){5,}\d+$)/
    

    Alternate Solution

    /^(\d+,){2}((\d+,){3,})?\d+$/
    

    Your solution should match the last three lines. Regex provide no simple way to say something like three occurrences, or 6 or more occurrences. We have two approaches we can take instead: either use alternation or use the ? quantifier to make part of the pattern optional.

    Our first solution uses alternation. Let's break it up a bit using "extended" syntax:

    /
      (                  # Grouping for alternation
        ^(\d+,){2}\d+$   # Match precisely 3 numbers on a line
        |                # *or*
        ^(\d+,){5,}\d+$  # Match 6 or more numbers on a line
      )                  # All done
    /x
    

    Our alternate solution uses the ? quantifier instead. Breaking it down once again, we see:

    /
      ^                  # Start of line
      (\d+,){2}          # 2 numbers at start
      (                  # followed by...
        (\d+,){3,}       #    at least 3 more numbers
      )?                 #    that are optional
      \d+                # followed by one last number
      $                  # end of line
    /x
    

    Note the use of the 'x' option on these broken out patterns. This Ruby-specific option is useful when you have a convoluted regex. It lets you write a regex over several lines, and put comments on each line. See the Ruby Regexp documentation for more information.

    In a real program, you may instead choose to use two separate regex:

    if text.match(/^(\d+,){2}\d+$/) || text.match(/^(\d+,){5,}\d+$/)
    

    This code is easier to understand, but not always practical.

  12. Challenge: Write a regex that matches HTML h1 header tags, e.g.,

    <h1>Main Heading</h1>
    <h1>Another Main Heading</h1>
    <h1>ABC</h1> <p>Paragraph</p> <h1>DEF</h1><p>Done</p>
    

    and the content between the opening and closing tags. If multiple header tags appear on one line, your regex should match the opening and closing tags and the text content of the headers, but nothing else. You may assume that there are no nested tags in the text between <h1> and </h1>.

    Solution

    /<h1>.*?<\/h1>/
    

    For this exercise, we need to use a "lazy" quantifier instead of the default greedy quantifier, so we use .*? to match the text in between the <h1> opening tag and its closing tag, </h1>.

    What would happen if you omitted the '?'? Try both the correct regex and the one with a greedy quantifier (/<h1>.*<\/h1>/) against this HTML to see:

    <h1>ABC</h1> <p>Paragraph</p> <h1>DEF</h1><p>Done</p>