Anchors

Let's get wet by moving into chest-deep waters. Keep your feet anchored to the bottom, though. That's what we're about to discuss: anchors. Anchors provide a way to limit how a regex matches a particular string by telling the regex engine where matches can begin and where they can end.

Anchors are a bit strange in the world of regex; they don't match any characters. What they do is ensure that a regex matches a string at a specific place: the beginning or end of the string or end of a line, or on a word or non-word boundary.

Start/End of Line

If you've ever used regex in any other context, there's a pretty good chance that you are familiar with the ^ and $ anchors, so we'll start our exploration of anchors there. Don't skip ahead though! There are some subtleties of which you should be aware.

The ^ and $ meta-characters are anchors that fix a regex match to the beginning (^) or ending ($) of a line of text. In Ruby, there's some subtlety to that definition which we will circle back to in the next subsection; for now, though, you can think of it as meaning that ^ and $ anchor a regex to the beginning or end of a string.

Let's see how the ^ anchor works. Try this regex, /^cat/ against these strings:

cat
catastrophe
wildcat
I love my cat
<cat>

You should find that this regex matches the first two strings, but not the last three. This example demonstrates that ^ forces the cat pattern to match at the beginning of each line.

Similarly, you can see the $ anchor in operation by trying /cat$/ against those same strings. This time, the regex matches the first, third, and fourth lines; those lines all end with cat.

Lastly, you can combine ^ and $. Try /^cat$/ against the five strings shown above. This time, the first string matches, but none of the others do.

Lines vs Strings

This sub-section is not relevant in JavaScript. Please skip ahead if you're reading this for information on JavaScript regex.

As we mentioned above, there's some subtlety involved with how ^ and $ work in Ruby. This subtlety arises when the string you are attempting to match contains one or more newline characters that aren't the last character in the string. For example, consider this code:

TEXT1 = "red fish\nblue fish"
puts "matched red" if TEXT1.match(/^red/)
puts "matched blue" if TEXT1.match(/^blue/)

If you're using Rubular to test this, put red fish and blue fish on separate lines in the test string box. Rubular doesn't recognize the \n sequence as a newline in the test string.

It may surprise you, but this example outputs both matched red and matched blue since ^ anchors the regex to the beginning of each line in the string, not the beginning of the string. For Ruby's purposes, each new line occurs after a \n character, with the beginning of the string marking the beginning of the first line. The line runs through - and includes - the next \n character. If no more \n characters are available, the last line runs through to the end of the string.

With that in mind, our example using $ shouldn't be too surprising:

TEXT2 = "red fish\nred shirt"
puts "matched fish" if TEXT2.match(/fish$/)
puts "matched shirt" if TEXT2.match(/shirt$/)

As before, we get a match for both regex. Note in particular that even though the first line in the string ends with a \n, fish is still said to occur at the end of the line. $ doesn't care if there is a \n character at the end, provided there is no more than one.

Start/End of String

This sub-section is not relevant in JavaScript. Please skip ahead if you are reading this for information on JavaScript regex.

It's not too often that you'll encounter situations where you need to match multi-line strings as shown in the previous sub-section, but they do arise. More often, though, you must match at the beginning or end of the string, not the line. For these matches, use the \A, \Z, and \z anchors (note that there is no \a anchor).

The \A anchor ensures that a regex matches at the beginning of the string, while \Z and \z match at the end of the string. The difference between \Z and \z is somewhat subtle and seldom of concern: \z always matches at the end of a string, while \Z matches up to, but not including, a newline at the end of the string. As a rule, use \z until you determine that you need \Z.

TEXT3 = "red fish\nblue fish"
TEXT4 = "red fish\nred shirt"
puts "matched red" if TEXT3.match(/\Ared/)
puts "matched blue" if TEXT3.match(/\Ablue/)
puts "matched fish" if TEXT4.match(/fish\z/)
puts "matched shirt" if TEXT4.match(/shirt\z/)

In contrast to the examples in the previous subsection, this prints matched red and matched shirt.

Even though we recommend using \A and \z for most anchored matches in Ruby, most examples and exercises in this book use ^ and $ instead. It is easier to demonstrate certain behaviors when using ^ and $ on Rubular.

Word Boundaries

The last two anchors anchor regex matches to word boundaries (\b) and non-word boundaries (\B). For these anchors, words are sequences of word characters (\w), while non-words are sequences of non-word characters (\W). A word boundary occurs:

  • between any pair of characters, one of which is a word character and one which is not.
  • at the beginning of a string if the first character is a word character.
  • at the end of a string if the last character is a word character.

A non-word boundary matches any place else:

  • between any pair of characters, both of which are word characters or both of which are not word characters.
  • at the beginning of a string if the first character is a non-word character.
  • at the end of a string if the last character is a non-word character.

For instance:

Eat some food.

Here, word boundaries occur before the E, s, and f at the start of the three words, and after the t, e, and d at their ends. Non-word boundaries occur elsewhere, such as between the o and m in some, and following the . at the end of the sentence.

To anchor a regex to a word boundary, use the \b pattern. For example, to match 3 letter words consisting of "word characters", you can use /\b\w\w\w\b/. Try it with:

One fish,
Two fish,
Red fish,
Blue fish.
123 456 7890

It's rare that you must use the non-word boundary anchor, \B. Here's a somewhat contrived example you can try. Try the regex /\Bjohn/i against these strings:

John Silver
Randy Johnson
Duke Pettijohn
Joe_Johnson

The regex matches john in the last two strings, but not the first two.

\b and \B do not work as word boundaries inside of character classes (between square brackets). In fact, \b means something else entirely when inside square brackets: it matches a backspace character.

Summary

With the use of anchors, you now have a great deal more flexibility. These simple constructs provide a degree of control over your regex that you didn't have before -- you can tell the regex engine where matches can occur. If you need it, more is available with look-ahead and look-behind assertions, but that topic is beyond the scope of this book.

In the next chapter, we'll get into quantifiers. Quantifiers, more than any other feature, lie at the heart of what makes regex so useful.

But, before you wade out any further, take a little while to work the exercises below. In these exercises, use Rubular to write and test your regex. You don't need to write any code.

Exercises

  1. Write a regex that matches the word The when it occurs at the beginning of a line. Test it with these strings:

    The lazy cat sleeps.
    The number 623 is not a word.
    Then, we went to the movies.
    Ah. The bus has arrived.
    

    There should be two matches.

    Solution

    /^The\b/
    

    This regex should match the word The in the first two lines, but should not match anything on the last two.

    If you tried using /\AThe\b/ on Rubular, the match probably didn't work right. Why not? If you haven't already tried, try it now. In most cases, you should use \A instead of ^ in Ruby, but Rubular treats the test string as a single multi-line string, so you need to use ^ instead.

  2. Write a regex that matches the word cat when it occurs at the end of a line. Test it with these strings:

    The lazy cat sleeps
    The number 623 is not a cat
    The Alaskan drives a snowcat
    

    There should be one match.

    Solution

    /\bcat$/
    

    This regex should match the word cat in the second line, but should not match anything else.

    If you tried using /\bcat\z/ on Rubular, the match probably didn't work right. Why not? If you haven't already tried, try it now. In most cases, you should use \z instead of $ in Ruby, but Rubular treats the test string as a single multi-line string, so you need to use $ instead.

  3. Write a regex that matches any three-letter word; a word is any string comprised entirely of letters. You can use these test strings.

    reds and blues
    The lazy cat sleeps.
    The number 623 is not a word. Or is it?
    

    There should be five matches.

    Solution

    /\b[a-z][a-z][a-z]\b/i
    

    As expected, this regex matches and, cat, The (both occurrences), and not. Notice that it does not match 623 or it?.

  4. Challenge: Write a regex that matches an entire line of text that consists of exactly 3 words as follows:

    • The first word is A or The.
    • There is a single space between the first and second words.
    • The second word is any 4-letter word.
    • There is a single space between the second and third words.
    • The third word -- the last word -- is either dog or cat.

    Test your solution with these strings:

    A grey cat
    A blue caterpillar
    The lazy dog
    The white cat
    A loud dog
    --A loud dog
    Go away dog
    The ugly rat
    The lazy, loud dog
    

    There should be three matches.

    Solution

    /^(A|The) [a-zA-Z][a-zA-Z][a-zA-Z][a-zA-Z] (dog|cat)$/
    

    The valid matches are A grey cat, The lazy dog, and A loud dog.

    This solution employs alternation from the first chapter in this section to define the words that occur at the beginning and end of each line and includes a match for a four-letter word in the middle. We have assumed that the middle word can contain both uppercase and lowercase letters, so we have to specify [a-zA-Z] for each of the four letters. We don't use \w because the problem explicitly asked for four-letter words.

    As with the other exercises, a proper Ruby solution would use \A and \z instead of ^ and $, but to allow for Rubular limitations, we use ^ and $ instead.