Regex is Pretty Neat
Published • 5 min read
- #programming
- #regex
- #regular
- #expression
Have you ever been handed a massive file with zero feedback on why it is failing validation? Yeah, that was me last week.
The file had tens of thousands of lines in it and it was nearly impossible to scroll through it by hand and find the problem.
The Problem
The file in question is a fixed width file where certain data must be in certain places, for example, you can think of formatting a piece of customer data like 9 digit SSN + YYYYMMDD date of birth + last name, (up to 20 spaces), first name (up to 20 spaces), middle name (up to 20 spaces), customer number (up to 20 spaces), then N/E for new or existing customer identifier
.
You can imagine why this has drawbacks. Say an employee adds a new customer to the database but instead of putting the suffix in the suffix field, they enter “2ND” as part of the last name.
The file only accepts alphabetical characters for the name fields.
It is almost impossible to scroll through tens of thousands of lines and notice this.
In the procedure that fetches the data from the database, we added all kinds of string replacements and things over time to strip out special characters, quotes, back ticks, etc. We had been using this procedure for months and never ran across this issue.
Finding the Needle in the Haystack
As a regulated business, there are certainly deadlines for these files to be submitted and processed and we needed something quick to be able to identify what was wrong with the file.
After some time of trying to manually find the issue I finally landed on using Regex to parse the file. I ended up going down a whole rabbit hole learning more about Regex. Now, obviously, as a software developer – I have encountered Regex before – but I was not super familiar with all the neat things you can do, like named capture groups (opens in a new tab) .
Enter Regex
I eventually landed on a Regex that looks like this:
^(?P<ssn>\d{9})(?P<dob>(?P<year>\d{4})(?P<month>0[1-9]|1[0-2])(?P<day>0[1-9]|[12]\d|3[01]))(?P<last>[A-Z ]{20})(?P<first>[A-Z ]{20})(?P<middle>[A-Z ]{20})(?P<status>[EN])\r?\n?$
There’s also this neat website Regex 101 (opens in a new tab) where you can enter a regex and it highlights the matches and even shows the grouped information.
Obvioulsy you would not want to paste real customer data into some arbitrary website (I didn’t), and the example problem I’m explaining here is not exactly what I was solving for but it’s substantially similar for illustrative purposes.
Breaking Down the Regex
Here’s what each part of the regex does:
^(?P<ssn>\d{9})
^
→ start of the line\d{9}
→ exactly 9 digits (the SSN)(?P<ssn> … )
→ puts those digits into a named capture group calledssn
(?P<dob>
(?P<year>\d{4})
(?P<month>0[1-9]|1[0-2])
(?P<day>0[1-9]|[12]\d|3[01])
)
- Groups together the date of birth as
dob
- Inside that:
(?P<year>\d{4})
→ 4 digits for the year(?P<month>0[1-9]|1[0-2])
→ valid months01–12
(?P<day>0[1-9]|[12]\d|3[01])
→ valid days01–31
(?P<last>[A-Z ]{20})
(?P<first>[A-Z ]{20})
(?P<middle>[A-Z ]{20})
- Each name field is exactly 20 characters
- Only capital letters (
A–Z
) and spaces allowed - Stored in capture groups
last
,first
, andmiddle
(?P<status>[EN])
- A single character that must be
E
(existing) orN
(new)
\r?\n?$
- Allows for optional Windows (
\r\n
) or Unix (\n
) line endings $
→ end of the line
Testing the Regex
Using Regex 101 (opens in a new tab) , I put in the regular expression and some sample data.
If there is an entry that is invalid, such as one with the suffix “2nd,” the regular expression will not match, helping identify an error:
The awesome part of this is that you can pick out specific parts using Regex. The Regex 101 site shows this over on the right side:
Using Named Capture Groups in Scripting
The named capture groups can be taken a step further by using them in scripting. For example, let’s say you wanted to parse this file. You could pick out each section and even show lines that don’t match.
A simple Python script example:
import re
# Your regex pattern
pattern = re.compile(
r'^(?P<ssn>\d{9})'
r'(?P<dob>(?P<year>\d{4})(?P<month>0[1-9]|1[0-2])(?P<day>0[1-9]|[12]\d|3[01]))'
r'(?P<last>[A-Z ]{20})'
r'(?P<first>[A-Z ]{20})'
r'(?P<middle>[A-Z ]{20})'
r'(?P<status>[EN])\r?\n?$'
)
# Example data as a list of lines
data = [
"66601234520000526SMITH JOHN ALBERT N",
"66654321020000526DOE JANE SUE E",
"66654321020000526DOE 2ND JANE SUE E"
]
for line in data:
match = pattern.match(line)
if match:
entry = match.groupdict()
# Trim space-padded fields
entry["last"] = entry["last"].rstrip()
entry["first"] = entry["first"].rstrip()
entry["middle"] = entry["middle"].rstrip()
print(entry)
else:
print(f"Line did not match: {line}")
Script output:
{'ssn': '666012345', 'dob': '20000526', 'year': '2000', 'month': '05', 'day': '26', 'last': 'SMITH', 'first': 'JOHN', 'middle': 'ALBERT', 'status': 'N'}
{'ssn': '666543210', 'dob': '20000526', 'year': '2000', 'month': '05', 'day': '26', 'last': 'DOE', 'first': 'JANE', 'middle': 'SUE', 'status': 'E'}
Line did not match: 66654321020000526DOE 2ND JANE SUE E
Wow! It’s so simple but really powerful.
Other Considerations
When using Regex in a manner like this, it is important to consider all possible inputs. For example, the way this is written now, if any of the name fields are left out and only spaces appear, it will still match the Regex since it looks for A-Z
and spaces.
The Regex could be adjusted to make sure at least 1 or more alphabet characters appear for the names.
Call me a nerd but I think this is really neat!
That is all.
Peace out.
~ nerdspice