Word Boundaries and Lookahead Assertions
As I was trying to improve the lexer for postcss-calc, I learnt about two regular expression features: the word boundary anchor character and lookahead.
Word boundary anchor character: \b
In a regular expression, \b specifies that the expression has to match at the word boundary.
For example, in JavaScript,
/123\b/.test('123 456')
returns true because the space is a word separator.
/123\b/.test('123456')
returns false, because in 123456 123 is not followed by a word separator.
\b in combination with digits and units can be treacherous, because it matches the . decimal point character. Say you are expecting only whole numbers, but the input also contains decimal numbers with units. /[0-9]+\b/ matches 123 in 123.45deg, leaving the .45deg string behind, which can give the illusion that the input matches expectations.
Lookahead assertions in regular expressions
While I was looking for how to exclude the \. character from word boundaries, I came across lookahead assertions. Lookahead assertions match a pattern depending on the pattern that follows it.
The syntax for lookahead assertion can be confusing, as it looks like the syntax for non-capturing groups.
For example, appending the negative lookahead assertion (?!\.) to the pattern will only match the pattern if it is not followed by the decimal point. So
/[0-9]+(?!\.)\b
does not match any part of 123.45deg.
In a similar fashion, the positive lookahead assertion (?=\.) requires a decimal point after the pattern. In the 2018 edition of the ECMAScript standard, there’s also lookbehind assertions: (?<=\.) requires a decimal point before the pattern.