Debunking RegEx — Part 2a
Hope you enjoyed the 1st part and practiced a few RegEx yourself. If not, maybe this might help!
The following is a compilation of real-world examples where one can use RegEx to grab complex strings out of a large amount of not-so-garbage data. We’ll go step by step and follow through with why the given pattern is used to match a particular string.
Towards the end, there are a few bonus steps one can follow to craft their own RegEx.
1.Matching Email Addresses:
Pattern: \b[\w.%+-]+@[\w.-]+\.[A-Za-z]{2,}\b
Let’s break it down:
\b
: Ensures a word boundary.[\w.%+-]+
: Captures the username part of the email address.@
: Matches the "@" symbol.[\w.-]+
: Captures the domain name (without the top-level domain).\.
: Matches the dot separating the domain and top-level domain.[A-Za-z]{2,}
: Captures the top-level domain.
2. Extracting Dates:
Pattern: \d{2}/\d{2}/\d{4}
This regex captures dates in “DD/MM/YYYY” format.
\d{2}
: Matches two digits./
: Matches the forward slash.\d{4}
: Matches four digits.
3. Matching Hashtags:
Pattern: #\w+
This regex captures hashtags from social media content.
#
: Matches the "#" symbol.\w+
: Captures one or more word characters (letters, digits, underscores).
4. Parsing URLs:
Pattern: https?://[^\s/$.?#]+.[^\s]*
This regex extracts URLs, accounting for variations.
https?
: Matches "http" or "https".://
: Matches the colon and two forward slashes.[^\s/$.?#]+
: Captures characters in the domain..
: Matches the dot in the domain.[^\\s]*
: Matches characters in the path.
5. Capturing Phone Numbers:
Pattern: \d{3}-\d{3}-\d{4}
This regex matches US phone numbers in “###-###-####” format.
\d{3}
: Matches three digits.-
: Matches the hyphen.\d{4}
: Matches four digits.
6. Identifying IP Addresses:
Pattern: \b(?:\d{1,3}\.){3}\d{1,3}\b
This regex captures IPv4 addresses from logs.
\b
: Matches a word boundary.(?: ...)
: Non-capturing group.\d{1,3}\.
: Matches one to three digits followed by a dot.{3}
: Repeats the group three times.\d{1,3}
: Matches one to three digits.
7. Matching File Extensions:
Pattern: \.\w+
This regex extracts file extensions from filenames.
\.
: Matches the dot character.\w+
: Matches one or more word characters.
8. Validating Credit Card Numbers:
Pattern: \b(?:\d[ -]*?){13,16}\b
This regex validates credit card numbers.
\b
: Matches a word boundary.(?: ... )
: Non-capturing group.\d
: Matches a digit.[ -]*?
: Matches zero or more spaces or hyphens.{13,16}
: Matches 13 to 16 repetitions.
9. Extracting HTML Tags:
Pattern: <([A-Z][A-Z0-9]*)\b[^>]*>(.*?)<\/\1>
This regex captures content within HTML tags.
<
: Matches the opening angle bracket.([A-Z][A-Z0-9]*)
: Captures the tag name.\b
: Matches a word boundary.[^>]*
: Matches any character except ">".>
: Matches the closing angle bracket.(.*?)
: Captures the tag content.<\/\1>
: Matches the closing tag using captured tag name.
10. Identifying Hex Color Codes:
Pattern: #([A-Fa-f0-9]{6}|[A-Fa-f0-9]{3})
This regex matches and captures hex color codes.
#
: Matches the "#" symbol.(...)
: Capturing group for either 6 or 3 hex characters.[A-Fa-f0-9]{6}
: Matches six hex characters.|
: Alternation for the alternative pattern.[A-Fa-f0-9]{3}
: Matches three hex characters.
11. Extracting Domain Names from URLs:
Pattern: https?://(?:www\.)?([\w.-]+)
This regex captures domain names from URLs.
https?
: Matches "http" or "https".://
: Matches the colon and two forward slashes.(?: ...)
: Non-capturing group.www\.
: Matches "www." if present.([\w.-]+)
: Captures the domain name.
12. Extracting Mentioned Users:
Pattern: @([A-Za-z0-9_]+)
This regex captures usernames mentioned in text.
@
: Matches the "@" symbol.(...)+
: Capturing group for one or more characters.[A-Za-z0-9_]+
: Captures letters, digits, or underscores.
13 .Matching Function Calls:
Pattern: (\w+)\s*\(
This regex captures function names in code.
(...)
: Capturing group for function name.\w+
: Matches one or more word characters (function name).\s*
: Matches zero or more whitespace characters.\(
: Matches the opening parenthesis.
14 .Parsing JSON Keys:
Pattern: \"(\w+)\":
This regex captures keys within JSON objects.
\"
: Matches a double-quote.(...)
: Capturing group for key name.\w+
: Matches one or more word characters (key name).\":
: Matches a colon within quotes.
15. Extracting Data from CSV:
Pattern: "(.*?)"
This regex captures data within double quotes in CSV.
\"
: Matches a double-quote.(...)
: Capturing group for captured data..*?
: Matches any characters non-greedily.\"
: Matches the closing double-quote.
16. Matching SQL Keywords:
Pattern: \b(SELECT|FROM|WHERE|JOIN)\b
This regex matches common SQL keywords.
\b
: Matches a word boundary.( ... )
: Alternation group for keywords.SELECT|FROM|WHERE|JOIN
: Matches any of the listed keywords.
17. Extracting URLs from HTML:
Pattern: <a\s+(?:[^>]*?\s+)?href="([^"]*)"
This regex captures URLs from HTML links.
<a\s+
: Matches "<a" tag with optional spaces.(?: ... )?
: Non-capturing group for optional attributes.[^>]*?
: Matches any character except ">" non-greedily.href=\"
: Matches "href=" within quotes.(...)
: Capturing group for URL.[^"]*
: Matches any character except double-quote.
18. Parsing Python Docstrings:
Pattern: ['"](.*?)['"]
This regex captures text within single or double quotes.
['"]
: Matches a single or double-quote.(...)
: Capturing group for captured text..*?
: Matches any characters non-greedily.['"]
: Matches the closing single or double-quote.
19. Extracting Protocol Ports:
Pattern: \b\d{1,5}\b
This regex captures protocol port numbers.
\b
: Matches a word boundary.\d{1,5}
: Matches one to five digits.
20. Matching Dollar Amounts:
Pattern: \$\d+(?:\.\d{2})?
This regex captures dollar amounts, including cents.
\
: Escapes the dollar sign.\d+
: Matches one or more digits.(?: ... )?
: Non-capturing group for optional cents.\.\d{2}
: Matches a dot and two digits for cents.
Crafting Your Own Regex Patterns
To create your own regex patterns, follow these steps:
- Define Your Objective: Clearly understand what you want to achieve with your regex pattern.
- Identify Patterns: Analyze the text and identify the specific patterns you need to match.
- Use Anchors and Quantifiers: Choose anchors like
^
(start of string) and$
(end of string) if needed. Apply quantifiers like*
(zero or more),+
(one or more),?
(zero or one), or numeric quantifiers. - Handle Special Characters: Consider if special characters need to escaping with a backslash
\
. - Use Character Classes: Utilize
[...]
to match a single character from a set. - Employ Groups: Use parentheses
( ... )
to group patterns together. - Add Alternatives: Use the pipe
|
for alternatives. - Leverage Lookaheads and Lookbehinds: Use
(?= ... )
and(?<= ... )
for conditional matches. - Test and Refine: Test your regex against sample text and refine it as needed.
- Iterate and Optimize: Tweak your regex for better accuracy and performance.
Understanding regex and its practical applications empowers one to manipulate textual data with precision. Remember, regex is a skill that grows with practice, so keep experimenting and refining your patterns.
See you in part 2b, where we’ll dive deeper into a few more concepts that remain to complete this series.
To check out the first part go here: https://medium.com/@nainasharma899103/debunking-regex-part-1-982171620543