Debunking RegEx — Part 2b

Cyberspecs
9 min readSep 28, 2023

--

Hello there,

Hope you enjoyed the previous part and tried out a few RegEx yourself. In this part, we’ll go through some more characters and their meanings. Don’t scratch your head just yet!

TABLE OF CONTENTS:

18. Negative Lookahead (?!)
19. Character Escapes (\n, \t, \r, etc.)
20. Match Beginning and End of String (^ and $)
21. Named Capturing Groups (?P<name>)
22. Case Insensitivity (re.IGNORECASE or re.I)
23. Substitution with re.sub()
24. Using re.compile()
25. Verbose Mode (re.VERBOSE)
26. Whitespace Management with re.VERBOSE
27. Matching Repeated Patterns with Greedy and Lazy Quantifiers
28. Matching Anything Except a Pattern with Negation (^)
29. Using Raw Strings (r””)
30. Matching URLs with Complex Patterns

Let’s get started!

18. Negative Lookahead (?!): This powerful feature can significantly enhance your regex toolkit by allowing you to specify what you don’t want to match, rather than focusing solely on what you do want to match.

Negative lookahead is denoted by (?!...), where the ... represents the pattern you wish to avoid matching.

import re  text = "apple pie, banana split" 
pattern = r"\b\w+(?!\s)"
matches = re.findall(pattern, text) print(matches)
# Output: ['apple', 'pie,', 'banana', 'split']
\b: This is a word boundary anchor, which matches the position between a word character (as defined by \w) and a non-word character. It ensures that we match whole words and not just parts of words.
  1. \b: This is a word boundary anchor, asserting a position between a word character and a non-word character. It's used to ensure that the matching word is a standalone word, not part of a larger word.
  2. \w+: This matches one or more word characters (letters, digits, or underscores). It represents the main content of the word.
  3. (?!\s): This is a negative look-ahead assertion. It asserts that what follows the current position is not a whitespace character. In other words, it ensures that the matched word is not immediately followed by a space.

The output of the code is ['apple', 'pie,', 'banana', 'split'], which are the matches found based on the regular expression pattern. The regex matches "pie," as a separate word because the negative lookahead only checks for the presence of a space immediately after the word, not for other non-word characters like the comma.

You can use the pattern \b[a-z]+\b(?! [A-Z]) to capture lowercase words that are not followed by an uppercase letter.

When processing code, you might want to capture all variable names except those that are commented out. In this case, a negative lookahead like \b\w+\b(?![\s]*//) can help you exclude variables that appear after "//" (indicating comments).

19. Character Escapes (\n, \t, \r, etc.): They are denoted by a backslash (\) followed by a specific character or code that represents a special meaning. For example, \n represents a newline character, \t represents a tab character, and \r represents a carriage return character.

import re  text = "Line 1\nLine 2" 
pattern = r"Line 1\nLine 2"
matches = re.findall(pattern, text)
print(matches) # Output: ['Line 1\nLine 2']

Backslash (\\): To match an actual backslash, you need to escape it with another backslash.
Ex: The pattern http:\\\\example.com would match “http:\\example.com”.

Escape Sequence in Text: Character escapes are not exclusive to regex patterns. They’re also used in plain text to represent special characters.
In a text string, "Hello\nWorld" would display as:

Hello
World

Escaping Special Characters: Sometimes, you want to match characters that have special meanings in regex. Escaping them ensures they’re treated as literal characters.
Ex: The pattern \d+\. \w+ would match one or more digits, followed by a dot and a space, and then one or more word characters.

Matching Dollar Sign ($): To match a literal dollar sign, you need to escape it.
Ex: The pattern Total: \$\d+ would match “Total: $100”.

Matching Parentheses (( and )): Parentheses have special meanings in regex, so they need to be escaped to match them literally.
Ex: The pattern \(not so\) special would match “(not so) special”.

20. Match Beginning and End of String (^ and $): Without multiline mode, ^ matches the beginning and $ matches the end of the entire string.

import re  
text = "Hello world!"
pattern = r"^Hello"
matches = re.findall(pattern, text)
print(matches) # Output: ['Hello']

Matching File Extensions: \.txt$
This pattern matches the string “.txt” only if it appears at the end of the text. It can be used to identify file names with the “.txt” extension.

Matching Email Domains: @\w+\.\w+$
This pattern matches an email address domain (e.g., “@example.com”) only if it occurs at the end of the email address. It ensures that the domain is the final part of the address.

Matching URLs: https?://[\w./-]+/?$
This pattern matches a URL that starts with “http://” or “https://”, followed by a sequence of word characters, dots, slashes, and hyphens. The optional slash (/?) at the end allows for URLs both with and without a trailing slash.

21. Named Capturing Groups (?P<name>): Named capturing groups allow you to give a name to a captured group for easier access.

import re  
text = "Name: John, Age: 30"
pattern = r"Name: (?P<name>\w+), Age: (?P<age>\d+)"
match = re.search(pattern, text)
if match:
print(f"Name: {match.group('name')}, Age: {match.group('age')}")

The regular expression captures the name “John” and the age “30” using the named capturing groups.

  • When re.search() is used, it searches the text for a match according to the pattern.
  • The if match: statement checks if a match was found.
  • If there’s a match, the code prints out the extracted name and age using match.group('name') and match.group('age').

So, the output of the code is: “Name: John, Age: 30”

22. Case Insensitivity (re.IGNORECASE or re.I): The re.IGNORECASE flag allows matching regardless of character case.

import re  
text = "Hello World"
pattern = r"hello"
matches = re.findall(pattern, text, re.IGNORECASE)
print(matches) # Output: ['Hello']
  1. "hello": This is the literal string "hello" that the regular expression is looking for. However, note that the actual text in the input is "Hello", and the regular expression is case-sensitive by default.
  2. re.IGNORECASE: This flag is used as the second argument to re.findall(). It instructs the regular expression engine to perform case-insensitive matching. This means that the pattern will match both uppercase and lowercase versions of the letters.

23. Substitution with re.sub(): The re.sub() function allows replacing matched patterns with other text.

import re  
text = "Colors: red, green, blue"
pattern = r"red"
new_text = re.sub(pattern, "orange", text)
print(new_text) # Output: "Colors: orange, green, blue"
  1. The regular expression r"red" matches the word "red" in the input text.
  2. The re.sub() function replaces all occurrences of "red" with "orange" in the input text.
  3. After the substitution, the new text becomes “Colors: orange, green, blue”.

The output of the code is "Colors: orange, green, blue", which is the modified text after replacing all occurrences of "red" with "orange". The resulting output reflects the change made by the re.sub() function, where the word "red" in the input text has been replaced with "orange".

24. Using re.compile(): You can compile a RegEx pattern re.compile() for improved performance in case of repeated usage.

import re  
pattern = re.compile(r"\b\w+\b")
text = "This is a test."
matches = pattern.findall(text) print(matches)
# Output: ['This', 'is', 'a', 'test']

r"\b\w+\b" is used to find all individual words within the input text "This is a test." The re.compile() function is used to precompile the regular expression pattern for efficiency.

r"\b\w+\b": This regular expression pattern consists of the following components:

  • \b: This is a word boundary anchor. It matches the position between a word character (as defined by \w) and a non-word character. It ensures that we match whole words and not just parts of words.
  • \w+: This matches one or more word characters (letters, digits, or underscores). It represents the main content of the word.
  • \b: Another word boundary anchor to ensure the end of the word.

Let’s apply the regular expression pattern to the input text “This is a test.”:

  1. The regular expression \b\w+\b finds all individual words in the input text, considering word boundaries.
  2. The words “This”, “is”, “a”, and “test” are recognized as separate words by the pattern.

25. Verbose Mode (re.VERBOSE): Verbose mode allows you to write RegEx patterns with comments and whitespace for better readability.

import re  
pattern = re.compile(r"""
\b # Word boundary
\w+ # Match one or more word characters
\b # Word boundary """, re.VERBOSE)
text = "This is a test."
matches = pattern.findall(text) print(matches)
# Output: ['This', 'is', 'a', 'test']

26. Whitespace Management with re.VERBOSE: In verbose mode, you can add whitespace and comments to your patterns for better organization and readability.

import re  
pattern = re.compile(r""" \b\d{3} # Match three digits [-.\s]? # Match optional separator \d{2} # Match two digits [-.\s]? # Match optional separator \d{4} # Match four digits \b # Word boundary """, re.VERBOSE) text = "Phone numbers: 123-45-6789 and 987.65.4321"
matches = pattern.findall(text)
print(matches) # Output: ['123-45-6789', '987.65.4321']

27. Matching Repeated Patterns with Greedy and Lazy Quantifiers: Greedy quantifiers (default) match as much as possible, while lazy quantifiers match as little as possible.

import re  
text = "<p>Hello</p><p>World</p>"
greedy_pattern = r"<p>(.*?)</p>"
lazy_pattern = r"<p>(.*?)</p>"
greedy_matches = re.findall(greedy_pattern, text)
lazy_matches = re.findall(lazy_pattern, text)
print("Greedy:", greedy_matches) # Output: ['Hello</p><p>World'] print("Lazy:", lazy_matches) # Output: ['Hello', 'World']
  1. text = "<p>Hello</p><p>World</p>": This is the input text containing two pairs of <p> and </p> tags with content inside.
  2. greedy_pattern = r"<p>(.*?)</p>": This pattern uses a greedy quantifier (.*?) to match any content between <p> and </p> tags.
  3. lazy_pattern = r"<p>(.*?)</p>": This pattern is identical to the greedy pattern and also uses a lazy quantifier (.*?).

Now, let’s apply both patterns to the input text:

greedy_pattern:

  • The greedy pattern matches the first <p> tag, then captures everything until the last </p> tag.
  • So, it captures the entire content between the first <p> and the last </p> tag: 'Hello</p><p>World'.

lazy_pattern:

  • The lazy pattern matches the first <p> tag, captures its content, then stops as soon as it encounters the first </p> tag.
  • It captures 'Hello' from the first pair of <p> and </p> tags, and then similarly captures 'World' from the second pair.

28. Matching Anything Except a Pattern with Negation (^): Using the caret (^) inside square brackets negates the character set, matching anything except the specified characters.

import re  
text = "apples and oranges"
pattern = r"[^aeiou\s]+"
matches = re.findall(pattern, text)
print(matches) # Output: ['ppl', 's', 'nd', 'r', 'ng', 's']

29. Using Raw Strings (r””): Raw strings, often denoted by the prefix “r” before a string literal (e.g., r"some_string"), are a type of string representation in programming languages that treat backslashes (\) as literal characters rather than escape characters.

In other words, when you use a raw string, backslashes are not treated as escape characters, and they are included in the string exactly as they appear.

Raw strings are particularly useful in cases where you want to represent strings containing backslashes without triggering any special escape sequences.

import re  
text = "Escape characters: \\n and \\t"
pattern = r"\\[nt]"
matches = re.findall(pattern, text)
print(matches) # Output: ['\\n', '\\t']

30. Matching URLs with Complex Patterns: RegEx can be used to match complex patterns like URLs, utilizing various symbols for different parts.

import re  
text = "Visit my website at https://www.example.com"
pattern = r"https?://(?:www\.)?([\w.-]+)"
matches = re.findall(pattern, text)
print(matches) # Output: ['example.com']
  • https?://: This matches "http://" or "https://", where the s? allows for both "http" and "https".
  • (?:www\.)?: This non-capturing group (?: ... )? is optional and captures the "www." portion of the URL if it exists.
  • ([\w.-]+): This capturing group (...) captures the domain name. [\w.-]+ matches a sequence of word characters, dots, and hyphens that make up the domain name

If you were able to follow through until here,

Do try these platforms out for RegEx Practise: https://regex101.com/

Link for part 1: https://medium.com/@nainasharma899103/debunking-regex-part-1-982171620543

Link for part 2a: https://medium.com/@nainasharma899103/debunking-regex-part-2a-d328cc515e53

See you in the next one!

--

--