Searching within a string for another string is pretty easy in Python:
[python]
>>> str = 'Hello, world'
>>> print(str.find('wor'))
7
[/python]
This is fine if we know exactly what we're looking for, but what if we're looking for something not so well-defined? For example, if we want to search for a year, then we know it's going to be a 4-digit sequence, but we don’t know exactly what those digits are going to be. This is where regular expressions come in. They allow us to search for sub-strings based on a general description of what we're looking for e.g. search for a sequence of 4 consecutive digits.
In the example below, we import the re
module, which contains Python's regular expression functionality, then call the search
function with our regular expression (\d\d\d\d
) and the string we want to search in:
[python]
>>> import re
>>> str = 'Today is 31 May 2012.'
>>> mo = re.search(r'\d\d\d\d', str)
>>> print(mo)
<_sre.SRE_Match object at 0x01D3A870>
>>> print(mo.group())
2012
>>> print('%s %s' % (mo.start(), mo.end())
16 20
[/python]
In a regular expression, \d
means any digit, so \d\d\d\d
means any digit, any digit, any digit, any digit, or in plain English, 4 digits in a row. Regular expressions use backslashes a lot, which have a special meaning in Python, so we put an r in front of the string to make it a raw string, which stops Python from interpreting the backslash in any way.
If re.search
finds something that matches our regular expression, it returns a match object that holds information about what exactly was matched. In the example above, we print out the exact sub-string that was matched, and its start and end position within the string being searched.
Note that Python didn't match the date (31
). It would've matched the first two characters, the 3 and the 1 against the first two \d
's, but then the next character (a space) would not have matched the third \d
, so Python would have given up and continued searching the rest of the string.
Matching a Set of Characters
Let's try another example:
[python]
>>> str = 'Today is 2012-MAY-31'
>>> mo = re.search(r'\d\d\d\d-[A-Z][A-Z][A-Z]-\d\d', str)
>>> print(mo.group())
2012-MAY-31
[/python]
This time, our regular expression contains the new element [A-Z]
. Square brackets mean match one of these characters exactly. For example, [abc]
means Python will match an a or b or c, but no other letters. Since we want to match any letter between A and Z, we could write out the entire alphabet ([ABCDEFGHIJKLMNOPQRSTUVWXYZ]
) but thankfully, Python allows us to shorten this using a hyphen ([A-Z]
). So, our regular expression is \d\d\d\d-[A-Z][A-Z][A-Z]-\d\d
, which means:
- Look for (or match) a digit (4 times).
- Match a '-' character.
- Match a letter between A and Z (three times).
- Match a '-' character.
- Match a digit (2 times).
And as the example above shows, Python found the date embedded within the string.
Unfortunately, our regular expression currently only handles upper-case month names:
[python]
# The month uses lower-case letters
>>> str = 'Today is 2012-May-31'
>>> mo = re.search(r'\d\d\d\d-[A-Z][A-Z][A-Z]-\d\d', str)
>>> print(mo)
None
[/python]
There are two ways we can fix this. We can pass in a flag that says the search should be case-insensitive:
[python]
>>> str = 'Today is 2012-May-31'
>>> mo = re.search(r'\d\d\d\d-[A-Z][A-Z][A-Z]-\d\d', str, re.IGNORECASE)
>>> print(mo.group())
2012-May-31
[/python]
Alternatively, we can extend the character set to specify more characters: [A-Za-z]
means capital A to capital Z, or lower-case A to lower-case Z.
[python]
>>> str = 'Today is 2012-May-31'
>>> mo = re.search(r'\d\d\d\d-[A-Za-z][A-Za-z][A-Za-z]-\d\d', str)
>>> print(mo.group())
2012-May-31
[/python]
Repetitions with Regular Expressions
The regular expression in the last example is starting to get a bit unwieldy, so let’s take a look at how we can simplify it.
In a regular expression, {n}
(where n is a number) means repeat the previous element n times. So we could re-write this regular expression:
[python]\d\d\d\d-[A-Za-z][A-Za-z][A-Za-z]-\d\d[/python]
Into this:
[python]\d{4}-[A-Za-z]{3}-\d{2}[/python]
This means:
- Match any digit (4 times).
- Match a '-' character.
- Match the letter A-Z or a-z (3 times).
- Match a '-' character.
- Match any digit (2 times).
Here it is in action:
[python]
>>> str = 'Today is 2012-May-31'
>>> mo = re.search(r'\d{4}-[A-Za-z]{3}-\d{2}', str)
>>> print(mo.group())
2012-May-31
[/python]
We have a lot of flexibility when specifying how many repetitions should be matched.
- We can specify a range e.g.
{2,4}
means match 2 - 4 repetitions.
[python]
>>> str = 'abc12345def'
>>> mo = re.search(r'\d{2,4}', str)
>>> print(mo.group())
1234
[/python]
- We can leave out the upper value e.g.
{2,}
means "match 2 or more repetitions".
[python]
>>> str = "abc12345def"
>>> mo = re.search(r'\d{2,}', str)
>>> print(mo.group())
12345
[/python]
Shorthand for Common Repetitions
Some types of repetitions are so common, they have their own syntax.
{1,}
means match the previous element one or more times, but this can also be written using the special+
operator (e.g.\d+
).
[python]
>>> str = 'abc12345def'
>>> mo = re.search(r'\d+', str)
>>> print(mo.group())
12345
[/python]
{0,}
means match the previous element zero or more times, but this can also be written using the*
operator (e.g.\d*
).
[python]
>>> str = 'abc12345def'
>>> mo = re.search(r'\d*', str)
>>> print(mo.group())
[/python]
Yikes, what happened?! Why didn’t this print anything? Well, you have to be careful with the * operator because it will match zero or more repetitions. In this case, Python looked at the first character of the string being searched and said to itself Is this a digit? No. Have I matched zero or more digits? Yes (zero), so the regular expression has been matched. If we look at what the MatchObject
tells us:
[python]
>>> print('%s %s' % (mo.start(), mo.end()))
0 0
[/python]
We can see that this is exactly what's happened, it has matched an empty sub-string at the very start of the string being searched. Let's change our regular expression slightly:
[python]
>>> str = 'abc12345def'
>>> mo = re.search(r'c\d*', str)
>>> print(mo.group())
c12345
[/python]
Now our regular expression says match the letter c, then zero or more digits, and that's what Python then finds.
{0,1}
means match the previous element 0 or 1 times, but this can also be written using the?
operator (e.g.\d?
).
[python]
>>> str = 'abc12345def'
>>> mo = re.search(r'c\d?', str)
# Note: the \d was matched 1 time
>>> print(mo.group())
c1
>>> mo = re.search(r'b\d?', str)
# Note: the \d was matched 0 times
>>> print(mo.group())
b
[/python]
In the next article, we'll go onto more advanced usage of regular expressions.