The Feed Doctor on Regexes
This week, “J.” in Winslow, AZ, writes:
Dear Feed Doctor,
I sell model cars, and one of the attributes I want to put in my feed is the year of the car. Unfortunately, I don’t have any separate data field for the year; however, the title DOES have the year in it. Can I write a business rule that will pull the year out of the title? My titles are usually something like “Red 1978 Mercury Cougar.”
Well, at first glance, that doesn’t look very challenging, does it? I mean, if the pattern is just “Color Year etc.,” then we could use a function like GETPART to split the title into single words, and take the second one. But that would be too easy! Besides, it’s dangerous to generalize too much; what if J. has a “Metallic Mint Green 1963 Pontiac Tempest?” Nope, I’m afraid we’re going to have to deploy something a bit more powerful: a regular expression.
I’ve mentioned regular expressions (or “regexes”) before, calling them the “Swiss Army knife of text processing.” Well, in this post, I’m going to show how a regex can be a Deus ex Machina that will save J. from a poorly-plotted, yet certain, doom.
I’m not going to give a regex tutorial in this post; there’s enough material there for a whole series of posts. Let’s just say that if you can think of a pattern of characters—letters, numbers, anything you can type—then somebody can probably think of a regex to describe that pattern. If that pattern can be broken up into sub-patterns, that’s when you can do some real magic: you can re-arrange the characters, remove some, replace some, slice, dice, etc.
Let’s take J’s titles again. As far as we’re concerned, the title could be “blah blah blah 1971 blah blah blah.” In other words, the only thing we care about is a 4-digit year. Here’s a quick run-down of how we can describe this pattern using regex syntax.
- To match any single character (anything at all), use a period: .
- To match any number of any character at all, use a period followed by a plus. The plus means “see that thing to my left? Match that as many times as you can.” Since the period matches any single character, then period-plus matches everything: .+
- To match any single digit, use \d
- To match exactly 4 digits, use \d{4}. The {4} means “see that thing to my left? Match it exactly 4 times, no more, no less.” Of course, you can put any number inside the {}.
Ok, so going back to “blah blah blah 1971 blah blah blah,” I’ll just go ahead and tell you the regex for that is:
.+\d{4}.+
See, the first “.+” matches any sequence of characters, until it runs up against a sequence of 4 digits, which is matched by the “\d{4}”, and then after that the second “.+” matches any sequence of characters.
Ok, but that still doesn’t do much, does it? To save J., we’ve got to pop that 4-digit-sequence out of a trapdoor somewhere. To grab that sequence and throw everything else away, we create a group in the regex, and we use it later with a backreference. To create a group, you just put parentheses around the sub-pattern that you want to capture:
.+(\d{4}).+
Let’s say you had a bunch of groups in your regex; you could (mentally, of course) number them from left to right, starting with number 1. In this case, there’s just 1. The backreference is just a dollar-sign followed by the number of the group; in this case: $1.
Finally, you have to do something with the regex and the backreference. We have a function called REGEXREPLACE that has three inputs:
- The source text (in our case, the title)
- The regex (in our case,
.+(\d{4}).+) describing a pattern to search for - Some text to replace the pattern that is found (in our case, this is the backreference)
So, J’s rule is:
REGEXREPLACE($TITLE,".+(\d{4}).+","$1")
And then J. just assigns this rule to the year field in his template. Here are some examples of the rule in action:
| Example # | Title | Year (From Rule) |
|---|---|---|
| 1 | Red 1978 Mercury Cougar | 1978 |
| 2 | Metallic Mint Green 1963 Pontiac Tempest | 1963 |
| 3 | White 1979 Oldsmobile 98 | 1979 |
