Debunking RegEx — Part 1

Cyberspecs
4 min readAug 13, 2023

--

Catching up on the concepts I need in my hacking arsenal, I came across RegEx. There have been quite a few mentions regarding the complexities of RegEx in my community, and I couldn’t agree more.

I wouldn’t say I’m not terrible at it even now, but we’ll learn as we go, right?

Example of RegEx Usage

Regular expressions (RegEx) are powerful tools for pattern matching and manipulation, particularly in processing text files. One line of regex can easily replace several dozen lines of programming codes.

Regex is supported in all the scripting languages (such as Perl, Python, PHP, and JavaScript); as well as general purpose programming languages such as Java; and even word processors such as Word for searching texts. Getting started with regex may not be easy due to its geeky syntax, but it is certainly worth the investment of your time.

Let’s break down the meaning of each symbol commonly used in RegEx, along with code snippet examples:

  1. Literal Characters: Literal characters in a RegEx match themselves. For example, the pattern hello matches the word "hello" in the text.
import re  
text = "Say hello to the world!"
pattern = r"hello"
result = re.search(pattern, text)
print(result.group()) # Output: hello

2. Dot (.) — Any Character: The dot matches any character except a new line. It’s often used to match a single character in a pattern.

import re  
text = "cat, bat, rat"
pattern = r".at"
matches = re.findall(pattern, text)
print(matches) # Output: ['cat', 'bat', 'rat']

3. Asterisk (*) — Zero or More: The asterisk matches zero or more occurrences of the preceding character or group.

import re  
text = "ab abb abbb" pattern = r"ab*"
matches = re.findall(pattern, text)
print(matches) # Output: ['ab', 'abb', 'abbb']

4. Plus (+) — One or More: The plus matches one or more occurrences of the preceding character or group.

import re  
text = "apple appple apppple"
pattern = r"app+le"
matches = re.findall(pattern, text)
print(matches) # Output: ['apple', 'appple']

5. Question Mark (?) — Zero or One: The question mark matches zero or one occurrence of the preceding character or group.

import re  
text = "color colour"
pattern = r"colou?r"
matches = re.findall(pattern, text)
print(matches) # Output: ['color', 'colour']

6. Brackets [] — Character Set: Brackets define a character set, matching any one of the characters inside.

import re  
text = "gray grey"
pattern = r"gr[ae]y"
matches = re.findall(pattern, text)
print(matches) # Output: ['gray', 'grey']

7. Caret (^) — Start of Line: The caret at the beginning of a pattern asserts that the match must start at the beginning of the line.

import re  
text = "apple\nbanana"
pattern = r"^apple"
matches = re.findall(pattern, text, re.MULTILINE)
print(matches) # Output: ['apple']

8. Dollar ($) — End of Line: The dollar sign at the end of a pattern asserts that the match must occur at the end of the line.

import re  
text = "apple\nbanana"
pattern = r"banana$"
matches = re.findall(pattern, text, re.MULTILINE)
print(matches) # Output: ['banana']

9. Pipe (|) — Alternation: The pipe allows you to define multiple alternatives in a pattern.

import re  
text = "cat dog"
pattern = r"cat|dog"
matches = re.findall(pattern, text)
print(matches) # Output: ['cat', 'dog']

10. Backslash () — Escape Character: The backslash escapes special characters, allowing you to match them literally.

import re  
text = "This costs $10."
pattern = r"\$10"
matches = re.findall(pattern, text)
print(matches) # Output: ['$10']

11. Parentheses () — Grouping and Capturing: Parentheses are used to group parts of a pattern and also capture the matched text for further use.

import re  
text = "apple pie, banana split"
pattern = r"(\w+)\s(\w+)"
matches = re.findall(pattern, text)
print(matches) # Output: [('apple', 'pie'), ('banana', 'split')]

12. Curly Braces {} — Quantifiers: Curly braces specify the exact number of occurrences for a character or group.

import re  
text = "hello helllo hellllo"
pattern = r"hel{2,4}o"
matches = re.findall(pattern, text)
print(matches) # Output: ['helllo', 'hellllo']

13. Question Mark (?) — Non-greedy Quantifier: Placing a question mark after a quantifier makes it non-greedy, matching the shortest possible string.

import re  
text = "<p>first</p> <p>second</p>"
pattern = r"<p>(.*?)</p>"
matches = re.findall(pattern, text)
print(matches) # Output: ['first', 'second']

14. Backslash () — Special Sequences: Backslashes are also used for special sequences like \d (digit), \s (whitespace), \w (word character), etc.

import re  
text = "Age: 25, Height: 5'11\", Weight: 160 lbs"
age_pattern = r"Age: (\d+)"
height_pattern = r"Height: (\d+'\d+\")"
weight_pattern = r"Weight: (\d+) lbs"
age = re.search(age_pattern, text).group(1)
height = re.search(height_pattern, text).group(1)
weight = re.search(weight_pattern, text).group(1)
print(f"Age: {age}, Height: {height}, Weight: {weight}")

15. Period (.) — Any Character Except Newline: The period matches any character except a newline. Adding the re.DOTALL flag allows it to match newlines too.

import re  
text = "Hello\nWorld"
pattern = r".+"
matches = re.findall(pattern, text, re.DOTALL)
print(matches) # Output: ['Hello\nWorld']

16. Word Boundaries (\b) — Anchors: The \b asserts a word boundary, matching the position between a word character and a non-word character.

import re  text = "The cat's meow is loud." 
pattern = r"\bmeow\b"
matches = re.findall(pattern, text)
print(matches) # Output: ['meow']

17. Lookahead (?=) and Lookbehind (?<=): Lookaheads and look behinds are used to match text based on what precedes or follows a certain pattern.

import re  text = "apple pie, banana split" 
pattern = r"(?<=apple\s)(\w+)"
matches = re.findall(pattern, text)
print(matches) # Output: ['pie']

These are just a few common RegEx symbols and their meanings.

If this intrigued you, stay tuned for the upcoming parts of the RegEx series. Remember to experiment with different patterns and test them with various inputs to deepen your understanding. Do try solving the question at the beginning of this article.

--

--