Regular Expressions (REs)
Tuesday 4th of December 2012 03:13:46 PM
Definitions, concepts, etcetera
- Full regular expressions are composed of two types of characters:
- Literals (everything that is not a meta-character).
Literal text can be thought of acting as words and the metacharacters as grammar.The words are combined with the grammar according to a set of rules to create an expression that communicates an idea.
- Start of line: ^ (caret)
- End of line: $ (dollar)
Both the ^ and the $ metacharacter anchor the match (the rest of the regular expression) to either the start or the end of the line, respectively. The caret and dollar are special in that they match a position in the line rather than any actual text characters themselves.
The regular-expression construct [...], usually called a character class, lets you list the characters you want to allow at that point in the match.
Notice how outside of a class, literal characters (like the "g" and "r" of "gr[ae]y") have an implied "and then" between them: match "g" and then match "r"... It's completely opposite inside a character class. The contents of a class is a list of characters that can match at that point, so the implication is "or."
Character-class metacharacter "-" (dash)
The dash character-class metacharacter indicates a range of characters. For example, "<h>" will match "h1", "h2", "h3", "h4", "h5", and "h6". "<h[1-6]>" is identical.
The order in which ranges are given doesn't matter.
You can freely combine ranges with literal characters: "[0-9A-Z_!.?]" matches a digit, uppercase letter, underscore, exclamation point, period, or a question mark.
A dash is a metacharacter only within a character class — otherwise it matches the normal dash character. It is not even always a metacharacter within a character class. If it is the first character listed in the class, it can't possibly indicate a range, so it is not considered a metacharacter.
Consider character classes as their own mini language. The rules regarding which metacharacters are supported (and what they do) are completely different inside and outside of character classes.
Negated character classes
If you use "[^...]" instead of "[...]", the class matches any character that isn't listed. The leading ^ in the class negates the list, so rather than listing the characters you want to include in the class, you list the characters you don't want to be included.
^ is a line anchor outside a class, but a class metacharacter inside a class (but, only when it is immediately after the class’s opening bracket; otherwise, it's not special inside a class).
Furthermore, it is important to understand that a negated character class means "match a character that's not listed" and not "don't match what is listed."
Matching any character with Dot
The metacharacter "." (usually called dot or point) is a shorthand for a character class that matches any character. It can be convenient when you want to have an "any character here" placeholder in your expression.
Alternation - matching any one of several subexpressions
A very convenient metacharacter is "|", which means "or." It allows you to combine multiple expressions into a single expression that matches any of the individual ones.
Be careful not to confuse the concept of alternation with that of a character class. A character class can match just a single character in the target text. With alternation, since each alternative can be a full-fledged regular expression in and of itself, each alternative can match an arbitrary amount of text. Character classes are almost like their own special mini-language (with their own ideas about metacharacters, for example), while alternation is part of the "main" regular expression language.
Parentheses can be used to constrain the alernation (paretheses are metacharacters, as well). That is, alternation does not reach beyond parentheses.
Ignoring differences in capitalization