Debunking RegEx — Part 2a

Cyberspecs
6 min readAug 28, 2023

--

Hope you enjoyed the 1st part and practiced a few RegEx yourself. If not, maybe this might help!

The following is a compilation of real-world examples where one can use RegEx to grab complex strings out of a large amount of not-so-garbage data. We’ll go step by step and follow through with why the given pattern is used to match a particular string.

Towards the end, there are a few bonus steps one can follow to craft their own RegEx.

1.Matching Email Addresses:

Pattern: \b[\w.%+-]+@[\w.-]+\.[A-Za-z]{2,}\b
Let’s break it down:

  • \b: Ensures a word boundary.
  • [\w.%+-]+: Captures the username part of the email address.
  • @: Matches the "@" symbol.
  • [\w.-]+: Captures the domain name (without the top-level domain).
  • \.: Matches the dot separating the domain and top-level domain.
  • [A-Za-z]{2,}: Captures the top-level domain.

2. Extracting Dates:

Pattern: \d{2}/\d{2}/\d{4}
This regex captures dates in “DD/MM/YYYY” format.

  • \d{2}: Matches two digits.
  • /: Matches the forward slash.
  • \d{4}: Matches four digits.

3. Matching Hashtags:

Pattern: #\w+
This regex captures hashtags from social media content.

  • #: Matches the "#" symbol.
  • \w+: Captures one or more word characters (letters, digits, underscores).

4. Parsing URLs:

Pattern: https?://[^\s/$.?#]+.[^\s]*
This regex extracts URLs, accounting for variations.

  • https?: Matches "http" or "https".
  • ://: Matches the colon and two forward slashes.
  • [^\s/$.?#]+: Captures characters in the domain.
  • .: Matches the dot in the domain.
  • [^\\s]*: Matches characters in the path.

5. Capturing Phone Numbers:

Pattern: \d{3}-\d{3}-\d{4}
This regex matches US phone numbers in “###-###-####” format.

  • \d{3}: Matches three digits.
  • -: Matches the hyphen.
  • \d{4}: Matches four digits.

6. Identifying IP Addresses:

Pattern: \b(?:\d{1,3}\.){3}\d{1,3}\b
This regex captures IPv4 addresses from logs.

  • \b: Matches a word boundary.
  • (?: ...): Non-capturing group.
  • \d{1,3}\.: Matches one to three digits followed by a dot.
  • {3}: Repeats the group three times.
  • \d{1,3}: Matches one to three digits.
Chart depicting different special and literal characters

7. Matching File Extensions:

Pattern: \.\w+
This regex extracts file extensions from filenames.

  • \.: Matches the dot character.
  • \w+: Matches one or more word characters.

8. Validating Credit Card Numbers:

Pattern: \b(?:\d[ -]*?){13,16}\b
This regex validates credit card numbers.

  • \b: Matches a word boundary.
  • (?: ... ): Non-capturing group.
  • \d: Matches a digit.
  • [ -]*?: Matches zero or more spaces or hyphens.
  • {13,16}: Matches 13 to 16 repetitions.

9. Extracting HTML Tags:

Pattern: <([A-Z][A-Z0-9]*)\b[^>]*>(.*?)<\/\1>
This regex captures content within HTML tags.

  • <: Matches the opening angle bracket.
  • ([A-Z][A-Z0-9]*): Captures the tag name.
  • \b: Matches a word boundary.
  • [^>]*: Matches any character except ">".
  • >: Matches the closing angle bracket.
  • (.*?): Captures the tag content.
  • <\/\1>: Matches the closing tag using captured tag name.

10. Identifying Hex Color Codes:

Pattern: #([A-Fa-f0-9]{6}|[A-Fa-f0-9]{3})
This regex matches and captures hex color codes.

  • #: Matches the "#" symbol.
  • (...): Capturing group for either 6 or 3 hex characters.
  • [A-Fa-f0-9]{6}: Matches six hex characters.
  • |: Alternation for the alternative pattern.
  • [A-Fa-f0-9]{3}: Matches three hex characters.

11. Extracting Domain Names from URLs:

Pattern: https?://(?:www\.)?([\w.-]+)
This regex captures domain names from URLs.

  • https?: Matches "http" or "https".
  • ://: Matches the colon and two forward slashes.
  • (?: ...): Non-capturing group.
  • www\.: Matches "www." if present.
  • ([\w.-]+): Captures the domain name.

12. Extracting Mentioned Users:

Pattern: @([A-Za-z0-9_]+)
This regex captures usernames mentioned in text.

  • @: Matches the "@" symbol.
  • (...)+: Capturing group for one or more characters.
  • [A-Za-z0-9_]+: Captures letters, digits, or underscores.

13 .Matching Function Calls:

Pattern: (\w+)\s*\(
This regex captures function names in code.

  • (...): Capturing group for function name.
  • \w+: Matches one or more word characters (function name).
  • \s*: Matches zero or more whitespace characters.
  • \(: Matches the opening parenthesis.

14 .Parsing JSON Keys:

Pattern: \"(\w+)\":
This regex captures keys within JSON objects.

  • \": Matches a double-quote.
  • (...): Capturing group for key name.
  • \w+: Matches one or more word characters (key name).
  • \":: Matches a colon within quotes.

15. Extracting Data from CSV:

Pattern: "(.*?)"
This regex captures data within double quotes in CSV.

  • \": Matches a double-quote.
  • (...): Capturing group for captured data.
  • .*?: Matches any characters non-greedily.
  • \": Matches the closing double-quote.

16. Matching SQL Keywords:

Pattern: \b(SELECT|FROM|WHERE|JOIN)\b
This regex matches common SQL keywords.

  • \b: Matches a word boundary.
  • ( ... ): Alternation group for keywords.
  • SELECT|FROM|WHERE|JOIN: Matches any of the listed keywords.

17. Extracting URLs from HTML:

Pattern: <a\s+(?:[^>]*?\s+)?href="([^"]*)"
This regex captures URLs from HTML links.

  • <a\s+: Matches "<a" tag with optional spaces.
  • (?: ... )?: Non-capturing group for optional attributes.
  • [^>]*?: Matches any character except ">" non-greedily.
  • href=\": Matches "href=" within quotes.
  • (...): Capturing group for URL.
  • [^"]*: Matches any character except double-quote.

18. Parsing Python Docstrings:

Pattern: ['"](.*?)['"]
This regex captures text within single or double quotes.

  • ['"]: Matches a single or double-quote.
  • (...): Capturing group for captured text.
  • .*?: Matches any characters non-greedily.
  • ['"]: Matches the closing single or double-quote.

19. Extracting Protocol Ports:

Pattern: \b\d{1,5}\b
This regex captures protocol port numbers.

  • \b: Matches a word boundary.
  • \d{1,5}: Matches one to five digits.

20. Matching Dollar Amounts:

Pattern: \$\d+(?:\.\d{2})?
This regex captures dollar amounts, including cents.

  • \: Escapes the dollar sign.
  • \d+: Matches one or more digits.
  • (?: ... )?: Non-capturing group for optional cents.
  • \.\d{2}: Matches a dot and two digits for cents.

Crafting Your Own Regex Patterns

To create your own regex patterns, follow these steps:

  1. Define Your Objective: Clearly understand what you want to achieve with your regex pattern.
  2. Identify Patterns: Analyze the text and identify the specific patterns you need to match.
  3. Use Anchors and Quantifiers: Choose anchors like ^ (start of string) and $ (end of string) if needed. Apply quantifiers like * (zero or more), + (one or more), ? (zero or one), or numeric quantifiers.
  4. Handle Special Characters: Consider if special characters need to escaping with a backslash \.
  5. Use Character Classes: Utilize [...] to match a single character from a set.
  6. Employ Groups: Use parentheses ( ... ) to group patterns together.
  7. Add Alternatives: Use the pipe | for alternatives.
  8. Leverage Lookaheads and Lookbehinds: Use (?= ... ) and (?<= ... ) for conditional matches.
  9. Test and Refine: Test your regex against sample text and refine it as needed.
  10. Iterate and Optimize: Tweak your regex for better accuracy and performance.

Understanding regex and its practical applications empowers one to manipulate textual data with precision. Remember, regex is a skill that grows with practice, so keep experimenting and refining your patterns.

See you in part 2b, where we’ll dive deeper into a few more concepts that remain to complete this series.

To check out the first part go here: https://medium.com/@nainasharma899103/debunking-regex-part-1-982171620543

--

--