Let's get wet by moving into chest-deep waters. Keep your feet anchored to the bottom, though. That's what we're about to discuss: anchors. Anchors provide a way to limit how a regex matches a particular string by telling the regex engine where matches can begin and where they can end.
Anchors are a bit strange in the world of regex; they don't match any characters. What they do is ensure that a regex matches a string at a specific place: the beginning or end of the string or end of a line, or on a word or non-word boundary.
If you've ever used regex in any other context, there's a pretty good chance that you are familiar with the ^
and $
anchors, so we'll start our exploration of anchors there. Don't skip ahead though! There are some subtleties of which you should be aware.
The ^
and $
meta-characters are anchors that fix a regex match to the beginning (^
) or ending ($
) of a line of text. In Ruby, there's some subtlety to that definition which we will circle back to in the next subsection; for now, though, you can think of it as meaning that ^
and $
anchor a regex to the beginning or end of a string.
Let's see how the ^
anchor works. Try this regex, /^cat/
against these strings:
cat
catastrophe
wildcat
I love my cat
<cat>
You should find that this regex matches the first two strings, but not the last three. This example demonstrates that ^
forces the cat
pattern to match at the beginning of each line.
Similarly, you can see the $
anchor in operation by trying /cat$/
against those same strings. This time, the regex matches the first, third, and fourth lines; those lines all end with cat
.
Lastly, you can combine ^
and $
. Try /^cat$/
against the five strings shown above. This time, the first string matches, but none of the others do.
This sub-section is not relevant in JavaScript. Please skip ahead if you're reading this for information on JavaScript regex.
As we mentioned above, there's some subtlety involved with how ^
and $
work in Ruby. This subtlety arises when the string you are attempting to match contains one or more newline characters that aren't the last character in the string. For example, consider this code:
TEXT1 = "red fish\nblue fish"
puts "matched red" if TEXT1.match(/^red/)
puts "matched blue" if TEXT1.match(/^blue/)
If you're using Rubular to test this, put red fish
and blue fish
on separate lines in the test string box. Rubular doesn't recognize the \n
sequence as a newline in the test string.
It may surprise you, but this example outputs both matched red
and matched blue
since ^
anchors the regex to the beginning of each line in the string, not the beginning of the string. For Ruby's purposes, each new line occurs after a \n
character, with the beginning of the string marking the beginning of the first line. The line runs through - and includes - the next \n
character. If no more \n
characters are available, the last line runs through to the end of the string.
With that in mind, our example using $
shouldn't be too surprising:
TEXT2 = "red fish\nred shirt"
puts "matched fish" if TEXT2.match(/fish$/)
puts "matched shirt" if TEXT2.match(/shirt$/)
As before, we get a match for both regex. Note in particular that even though the first line in the string ends with a \n
, fish
is still said to occur at the end of the line. $
doesn't care if there is a \n
character at the end, provided there is no more than one.
This sub-section is not relevant in JavaScript. Please skip ahead if you are reading this for information on JavaScript regex.
It's not too often that you'll encounter situations where you need to match multi-line strings as shown in the previous sub-section, but they do arise. More often, though, you must match at the beginning or end of the string, not the line. For these matches, use the \A
, \Z
, and \z
anchors (note that there is no \a
anchor).
The \A
anchor ensures that a regex matches at the beginning of the string, while \Z
and \z
match at the end of the string. The difference between \Z
and \z
is somewhat subtle and seldom of concern: \z
always matches at the end of a string, while \Z
matches up to, but not including, a newline at the end of the string. As a rule, use \z
until you determine that you need \Z
.
TEXT3 = "red fish\nblue fish"
TEXT4 = "red fish\nred shirt"
puts "matched red" if TEXT3.match(/\Ared/)
puts "matched blue" if TEXT3.match(/\Ablue/)
puts "matched fish" if TEXT4.match(/fish\z/)
puts "matched shirt" if TEXT4.match(/shirt\z/)
In contrast to the examples in the previous subsection, this prints matched red
and matched shirt
.
Even though we recommend using \A
and \z
for most anchored matches in Ruby, most examples and exercises in this book use ^
and $
instead. It is easier to demonstrate certain behaviors when using ^
and $
on Rubular.
The last two anchors anchor regex matches to word boundaries (\b
) and non-word boundaries (\B
). For these anchors, words are sequences of word characters (\w
), while non-words are sequences of non-word characters (\W
). A word boundary occurs:
A non-word boundary matches any place else:
For instance:
Eat some food.
Here, word boundaries occur before the E
, s
, and f
at the start of the three words, and after the t
, e
, and d
at their ends. Non-word boundaries occur elsewhere, such as between the o
and m
in some
, and following the .
at the end of the sentence.
To anchor a regex to a word boundary, use the \b
pattern. For example, to match 3 letter words consisting of "word characters", you can use /\b\w\w\w\b/
. Try it with:
One fish,
Two fish,
Red fish,
Blue fish.
123 456 7890
It's rare that you must use the non-word boundary anchor, \B
. Here's a somewhat contrived example you can try. Try the regex /\Bjohn/i
against these strings:
John Silver
Randy Johnson
Duke Pettijohn
Joe_Johnson
The regex matches john
in the last two strings, but not the first two.
\b
and \B
do not work as word boundaries inside of character classes (between square brackets). In fact, \b
means something else entirely when inside square brackets: it matches a backspace character.
With the use of anchors, you now have a great deal more flexibility. These simple constructs provide a degree of control over your regex that you didn't have before -- you can tell the regex engine where matches can occur. If you need it, more is available with look-ahead and look-behind assertions, but that topic is beyond the scope of this book.
In the next chapter, we'll get into quantifiers. Quantifiers, more than any other feature, lie at the heart of what makes regex so useful.
But, before you wade out any further, take a little while to work the exercises below. In these exercises, use Rubular to write and test your regex. You don't need to write any code.
Write a regex that matches the word The
when it occurs at the beginning of a line. Test it with these strings:
The lazy cat sleeps.
The number 623 is not a word.
Then, we went to the movies.
Ah. The bus has arrived.
There should be two matches.
/^The\b/
This regex should match the word The
in the first two lines, but should not match anything on the last two.
If you tried using /\AThe\b/
on Rubular, the match probably didn't work right. Why not? If you haven't already tried, try it now. In most cases, you should use \A
instead of ^
in Ruby, but Rubular treats the test string as a single multi-line string, so you need to use ^
instead.
Write a regex that matches the word cat
when it occurs at the end of a line. Test it with these strings:
The lazy cat sleeps
The number 623 is not a cat
The Alaskan drives a snowcat
There should be one match.
/\bcat$/
This regex should match the word cat
in the second line, but should not match anything else.
If you tried using /\bcat\z/
on Rubular, the match probably didn't work right. Why not? If you haven't already tried, try it now. In most cases, you should use \z
instead of $
in Ruby, but Rubular treats the test string as a single multi-line string, so you need to use $
instead.
Write a regex that matches any three-letter word; a word is any string comprised entirely of letters. You can use these test strings.
reds and blues
The lazy cat sleeps.
The number 623 is not a word. Or is it?
There should be five matches.
/\b[a-z][a-z][a-z]\b/i
As expected, this regex matches and
, cat
, The
(both occurrences), and not
. Notice that it does not match 623
or it?
.
Challenge: Write a regex that matches an entire line of text that consists of exactly 3 words as follows:
A
or The
.
dog
or cat
.
Test your solution with these strings:
A grey cat
A blue caterpillar
The lazy dog
The white cat
A loud dog
--A loud dog
Go away dog
The ugly rat
The lazy, loud dog
There should be three matches.
/^(A|The) [a-zA-Z][a-zA-Z][a-zA-Z][a-zA-Z] (dog|cat)$/
The valid matches are A grey cat
, The lazy dog
, and A loud dog
.
This solution employs alternation from the first chapter in this section to define the words that occur at the beginning and end of each line and includes a match for a four-letter word in the middle. We have assumed that the middle word can contain both uppercase and lowercase letters, so we have to specify [a-zA-Z]
for each of the four letters. We don't use \w
because the problem explicitly asked for four-letter words.
As with the other exercises, a proper Ruby solution would use \A
and \z
instead of ^
and $
, but to allow for Rubular limitations, we use ^
and $
instead.