Debunking RegEx — Part 2a

6 min readAug 28, 2023

Hope you enjoyed the 1st part and practiced a few RegEx yourself. If not, maybe this might help!

The following is a compilation of real-world examples where one can use RegEx to grab complex strings out of a large amount of not-so-garbage data. We’ll go step by step and follow through with why the given pattern is used to match a particular string.

Towards the end, there are a few bonus steps one can follow to craft their own RegEx.

1.Matching Email Addresses:

Pattern: \b[\w.%+-]+@[\w.-]+\.[A-Za-z]{2,}\bLet’s break it down:

\b: Ensures a word boundary.
[\w.%+-]+: Captures the username part of the email address.
@: Matches the "@" symbol.
[\w.-]+: Captures the domain name (without the top-level domain).
\.: Matches the dot separating the domain and top-level domain.
[A-Za-z]{2,}: Captures the top-level domain.

2. Extracting Dates:

Pattern: \d{2}/\d{2}/\d{4}This regex captures dates in “DD/MM/YYYY” format.

\d{2}: Matches two digits.
/: Matches the forward slash.
\d{4}: Matches four digits.

3. Matching Hashtags:

Pattern: #\w+This regex captures hashtags from social media content.

#: Matches the "#" symbol.
\w+: Captures one or more word characters (letters, digits, underscores).

4. Parsing URLs:

Pattern: https?://[^\s/$.?#]+.[^\s]*This regex extracts URLs, accounting for variations.

https?: Matches "http" or "https".
://: Matches the colon and two forward slashes.
[^\s/$.?#]+: Captures characters in the domain.
.: Matches the dot in the domain.
[^\\s]*: Matches characters in the path.

5. Capturing Phone Numbers:

Pattern: \d{3}-\d{3}-\d{4}This regex matches US phone numbers in “###-###-####” format.

\d{3}: Matches three digits.
-: Matches the hyphen.
\d{4}: Matches four digits.

6. Identifying IP Addresses:

Pattern: \b(?:\d{1,3}\.){3}\d{1,3}\bThis regex captures IPv4 addresses from logs.

\b: Matches a word boundary.
(?: ...): Non-capturing group.
\d{1,3}\.: Matches one to three digits followed by a dot.
{3}: Repeats the group three times.
\d{1,3}: Matches one to three digits.

Chart depicting different special and literal characters

7. Matching File Extensions:

Pattern: \.\w+This regex extracts file extensions from filenames.

\.: Matches the dot character.
\w+: Matches one or more word characters.

8. Validating Credit Card Numbers:

Pattern: \b(?:\d[ -]*?){13,16}\bThis regex validates credit card numbers.

\b: Matches a word boundary.
(?: ... ): Non-capturing group.
\d: Matches a digit.
[ -]*?: Matches zero or more spaces or hyphens.
{13,16}: Matches 13 to 16 repetitions.

9. Extracting HTML Tags:

Pattern: <([A-Z][A-Z0-9]*)\b[^>]*>(.*?)<\/\1>This regex captures content within HTML tags.

<: Matches the opening angle bracket.
([A-Z][A-Z0-9]*): Captures the tag name.
\b: Matches a word boundary.
[^>]*: Matches any character except ">".
>: Matches the closing angle bracket.
(.*?): Captures the tag content.
<\/\1>: Matches the closing tag using captured tag name.

10. Identifying Hex Color Codes:

Pattern: #([A-Fa-f0-9]{6}|[A-Fa-f0-9]{3})This regex matches and captures hex color codes.

#: Matches the "#" symbol.
(...): Capturing group for either 6 or 3 hex characters.
[A-Fa-f0-9]{6}: Matches six hex characters.
|: Alternation for the alternative pattern.
[A-Fa-f0-9]{3}: Matches three hex characters.

11. Extracting Domain Names from URLs:

Pattern: https?://(?:www\.)?([\w.-]+)This regex captures domain names from URLs.

https?: Matches "http" or "https".
://: Matches the colon and two forward slashes.
(?: ...): Non-capturing group.
www\.: Matches "www." if present.
([\w.-]+): Captures the domain name.

12. Extracting Mentioned Users:

Pattern: @([A-Za-z0-9_]+)This regex captures usernames mentioned in text.

@: Matches the "@" symbol.
(...)+: Capturing group for one or more characters.
[A-Za-z0-9_]+: Captures letters, digits, or underscores.

13 .Matching Function Calls:

Pattern: (\w+)\s*\(This regex captures function names in code.

(...): Capturing group for function name.
\w+: Matches one or more word characters (function name).
\s*: Matches zero or more whitespace characters.
\(: Matches the opening parenthesis.

14 .Parsing JSON Keys:

Pattern: \"(\w+)\":
This regex captures keys within JSON objects.

\": Matches a double-quote.
(...): Capturing group for key name.
\w+: Matches one or more word characters (key name).
\":: Matches a colon within quotes.

15. Extracting Data from CSV:

Pattern: "(.*?)"This regex captures data within double quotes in CSV.

\": Matches a double-quote.
(...): Capturing group for captured data.
.*?: Matches any characters non-greedily.
\": Matches the closing double-quote.

16. Matching SQL Keywords:

Pattern: \b(SELECT|FROM|WHERE|JOIN)\bThis regex matches common SQL keywords.

\b: Matches a word boundary.
( ... ): Alternation group for keywords.
SELECT|FROM|WHERE|JOIN: Matches any of the listed keywords.

17. Extracting URLs from HTML:

Pattern: <a\s+(?:[^>]*?\s+)?href="([^"]*)"This regex captures URLs from HTML links.

<a\s+: Matches "<a" tag with optional spaces.
(?: ... )?: Non-capturing group for optional attributes.
[^>]*?: Matches any character except ">" non-greedily.
href=\": Matches "href=" within quotes.
(...): Capturing group for URL.
[^"]*: Matches any character except double-quote.

18. Parsing Python Docstrings:

Pattern: ['"](.*?)['"]This regex captures text within single or double quotes.

['"]: Matches a single or double-quote.
(...): Capturing group for captured text.
.*?: Matches any characters non-greedily.
['"]: Matches the closing single or double-quote.

19. Extracting Protocol Ports:

Pattern: \b\d{1,5}\bThis regex captures protocol port numbers.

\b: Matches a word boundary.
\d{1,5}: Matches one to five digits.

20. Matching Dollar Amounts:

Pattern: \$\d+(?:\.\d{2})?This regex captures dollar amounts, including cents.

\: Escapes the dollar sign.
\d+: Matches one or more digits.
(?: ... )?: Non-capturing group for optional cents.
\.\d{2}: Matches a dot and two digits for cents.

Crafting Your Own Regex Patterns

To create your own regex patterns, follow these steps:

Define Your Objective: Clearly understand what you want to achieve with your regex pattern.
Identify Patterns: Analyze the text and identify the specific patterns you need to match.
Use Anchors and Quantifiers: Choose anchors like ^ (start of string) and $ (end of string) if needed. Apply quantifiers like * (zero or more), + (one or more), ? (zero or one), or numeric quantifiers.
Handle Special Characters: Consider if special characters need to escaping with a backslash \.
Use Character Classes: Utilize [...] to match a single character from a set.
Employ Groups: Use parentheses ( ... ) to group patterns together.
Add Alternatives: Use the pipe | for alternatives.
Leverage Lookaheads and Lookbehinds: Use (?= ... ) and (?<= ... ) for conditional matches.
Test and Refine: Test your regex against sample text and refine it as needed.
Iterate and Optimize: Tweak your regex for better accuracy and performance.

Understanding regex and its practical applications empowers one to manipulate textual data with precision. Remember, regex is a skill that grows with practice, so keep experimenting and refining your patterns.

See you in part 2b, where we’ll dive deeper into a few more concepts that remain to complete this series.

To check out the first part go here: https://medium.com/@nainasharma899103/debunking-regex-part-1-982171620543

Debunking RegEx — Part 2a

Written by Cyberspecs