Skip to content
nerdspice

Regex is Pretty Neat

Published • 5 min read

  • #programming
  • #regex
  • #regular
  • #expression

Have you ever been handed a massive file with zero feedback on why it is failing validation? Yeah, that was me last week.

The file had tens of thousands of lines in it and it was nearly impossible to scroll through it by hand and find the problem.

The Problem

The file in question is a fixed width file where certain data must be in certain places, for example, you can think of formatting a piece of customer data like 9 digit SSN + YYYYMMDD date of birth + last name, (up to 20 spaces), first name (up to 20 spaces), middle name (up to 20 spaces), customer number (up to 20 spaces), then N/E for new or existing customer identifier.

You can imagine why this has drawbacks. Say an employee adds a new customer to the database but instead of putting the suffix in the suffix field, they enter “2ND” as part of the last name.

The file only accepts alphabetical characters for the name fields.

It is almost impossible to scroll through tens of thousands of lines and notice this.

In the procedure that fetches the data from the database, we added all kinds of string replacements and things over time to strip out special characters, quotes, back ticks, etc. We had been using this procedure for months and never ran across this issue.

Finding the Needle in the Haystack

As a regulated business, there are certainly deadlines for these files to be submitted and processed and we needed something quick to be able to identify what was wrong with the file.

After some time of trying to manually find the issue I finally landed on using Regex to parse the file. I ended up going down a whole rabbit hole learning more about Regex. Now, obviously, as a software developer – I have encountered Regex before – but I was not super familiar with all the neat things you can do, like named capture groups (opens in a new tab) .

Enter Regex

I eventually landed on a Regex that looks like this:

^(?P<ssn>\d{9})(?P<dob>(?P<year>\d{4})(?P<month>0[1-9]|1[0-2])(?P<day>0[1-9]|[12]\d|3[01]))(?P<last>[A-Z ]{20})(?P<first>[A-Z ]{20})(?P<middle>[A-Z ]{20})(?P<status>[EN])\r?\n?$

There’s also this neat website Regex 101 (opens in a new tab) where you can enter a regex and it highlights the matches and even shows the grouped information.

Obvioulsy you would not want to paste real customer data into some arbitrary website (I didn’t), and the example problem I’m explaining here is not exactly what I was solving for but it’s substantially similar for illustrative purposes.

Breaking Down the Regex

Here’s what each part of the regex does:

^(?P<ssn>\d{9})
  • ^ → start of the line
  • \d{9} → exactly 9 digits (the SSN)
  • (?P<ssn> … ) → puts those digits into a named capture group called ssn
(?P<dob>
 (?P<year>\d{4})
 (?P<month>0[1-9]|1[0-2])
 (?P<day>0[1-9]|[12]\d|3[01])
)
  • Groups together the date of birth as dob
  • Inside that:
    • (?P<year>\d{4}) → 4 digits for the year
    • (?P<month>0[1-9]|1[0-2]) → valid months 01–12
    • (?P<day>0[1-9]|[12]\d|3[01]) → valid days 01–31
(?P<last>[A-Z ]{20})
(?P<first>[A-Z ]{20})
(?P<middle>[A-Z ]{20})
  • Each name field is exactly 20 characters
  • Only capital letters (A–Z) and spaces allowed
  • Stored in capture groups last, first, and middle
(?P<status>[EN])
  • A single character that must be E (existing) or N (new)
\r?\n?$
  • Allows for optional Windows (\r\n) or Unix (\n) line endings
  • $ → end of the line

Testing the Regex

Using Regex 101 (opens in a new tab) , I put in the regular expression and some sample data.

Screenshot of Regex101 showing fixed width validation example

If there is an entry that is invalid, such as one with the suffix “2nd,” the regular expression will not match, helping identify an error:

Screenshot of Regex101 showing fixed width validation example with error

The awesome part of this is that you can pick out specific parts using Regex. The Regex 101 site shows this over on the right side:

Screenshot of Regex101 showing matched capture groups in a fixed-width record

Using Named Capture Groups in Scripting

The named capture groups can be taken a step further by using them in scripting. For example, let’s say you wanted to parse this file. You could pick out each section and even show lines that don’t match.

A simple Python script example:

import re

# Your regex pattern
pattern = re.compile(
    r'^(?P<ssn>\d{9})'
    r'(?P<dob>(?P<year>\d{4})(?P<month>0[1-9]|1[0-2])(?P<day>0[1-9]|[12]\d|3[01]))'
    r'(?P<last>[A-Z ]{20})'
    r'(?P<first>[A-Z ]{20})'
    r'(?P<middle>[A-Z ]{20})'
    r'(?P<status>[EN])\r?\n?$'
)

# Example data as a list of lines
data = [
    "66601234520000526SMITH               JOHN                ALBERT              N",
    "66654321020000526DOE                 JANE                SUE                 E",
    "66654321020000526DOE 2ND             JANE                SUE                 E"
]

for line in data:
    match = pattern.match(line)
    if match:
        entry = match.groupdict()
        # Trim space-padded fields
        entry["last"] = entry["last"].rstrip()
        entry["first"] = entry["first"].rstrip()
        entry["middle"] = entry["middle"].rstrip()
        print(entry)
    else:
        print(f"Line did not match: {line}")

Script output:

{'ssn': '666012345', 'dob': '20000526', 'year': '2000', 'month': '05', 'day': '26', 'last': 'SMITH', 'first': 'JOHN', 'middle': 'ALBERT', 'status': 'N'}
{'ssn': '666543210', 'dob': '20000526', 'year': '2000', 'month': '05', 'day': '26', 'last': 'DOE', 'first': 'JANE', 'middle': 'SUE', 'status': 'E'}
Line did not match: 66654321020000526DOE 2ND             JANE                SUE                 E

Wow! It’s so simple but really powerful.

Other Considerations

When using Regex in a manner like this, it is important to consider all possible inputs. For example, the way this is written now, if any of the name fields are left out and only spaces appear, it will still match the Regex since it looks for A-Z and spaces.

The Regex could be adjusted to make sure at least 1 or more alphabet characters appear for the names.


Call me a nerd but I think this is really neat!

That is all.

Peace out.

~ nerdspice