Real-World Regular Expressions for Python

We’ve covered a lot of ground in this series of articles, so let’s now put it all together and work through a real-life application.

A common task is to parse a Windows INI file, which are key/value pairs, separated into sections, something like this:

[python]
[Section 1]
val1=hello world
val2=42
[Section 2]
val1=foo!
[/python]

Let’s first write a bit of Python code that reads in a test file, line by line:

[python]
for lineBuf in open('test.ini', 'r'):
print(lineBuf)
[/python]

We will now extend this by writing some regular expressions that figure out what is on each line.

Identifying section headers

The first thing we’ll do is write a regular expression that will recognize a section header, the lines that start and end with square brackets. We could write such a regular expression like this: ^\[(.+)\]$

In plain English:

Match ^ (the start of the line).
Match a [ character (escaped, since [ normally has a special meaning in a regular expression).
Match one or more characters (the section name), captured in a group.
Match a ] character (it’s actually not necessary to escape this).
Match $ (the end of the line).

If we update our code to use this regular expression:

[python]
sectionRegEx = re.compile(r'^\[(.+)\]$')
for lineBuf in open('test.ini', 'r'):
mo = sectionRegEx.search(lineBuf)
if mo:
print('Found a section: [%s]' % mo.group(1))
[/python]

We get this output:

[shell]
Found a section: [Section 1]
Found a section: [Section 2]
[/shell]

Seems to work fine!

Handling white-space in section headers

It would be handy to handle white-space in section headers, so if somebody gave us an INI file that looked like this:

[python]
[Section 1]
val1=hello world
val2=42
[ Section 2 ] junk here!
val1=foo!
[/python]

We would be able to handle the oddly-written second section header properly. Right now, our code doesn’t find it, so let’s update the regular expression to handle it: ^\s*\[\s*(.+?)\s*\]

In plain English:

Match ^ (the start of the line).
Match \s* (zero or more white-space characters).
Match a [ character.
Match \s* (zero or more white-space characters).
Match one or more characters (the section name).
Match \s* (zero or more white-space characters).
Match a ] character.

Note that we had to make the + character (that captures the section name) non-greedy, to stop it from matching any trailing spaces that might appear before the closing ]. We also stop matching after the closing ] since we don’t care if there’s anything on the line after it.

Now our code recognizes the weirdly formatted section name:

[python]
sectionRegEx = re.compile(r'^\s*\[\s*(.+?)\s*\]')
for lineBuf in open('test.ini', 'r'):
mo = sectionRegEx.search(lineBuf)
if mo :
print('Found a section: [%s]' % mo.group(1))
[/python]

This gives us the following output:

[shell]
Found a section: [Section 1]
Found a section: [Section 2]
[/shell]

The second section header has been found and its name cleaned up.

Identifying key/value pairs

The next step is to write a regular expression that identifies the key/value pairs, maybe something like this: ^(.+)=(.+)$

In plain English:

Match ^ (the start of the line).
Match one or more characters (the key name), captured in a group.
Match the = character.
Match one or more characters (the key value), captured in a group.
Match $ (the end of the line).

Again, we’d like this regular expression to handle extraneous white-space, so let’s re-write it like this: ^\s*(.+?)\s*=\s*(.+?)\s*$

And our updated code now looks like this:

[python]
sectionRegEx = re.compile(r'^\s*\[\s*(.+?)\s*\]')
keyValRegEx = re.compile(r'^\s*(.+?)\s*=\s*(.+?)\s*$')
for lineBuf in open('test.ini', 'r'):
mo = sectionRegEx.search(lineBuf)
if mo:
print('Found a section: [%s]' % mo.group(1))
mo = keyValRegEx.search(lineBuf)
if mo:
print('{%s} = {%s}' % (mo.group(1), mo.group(2)))
[/python]

We wrap the key name and values in curly braces when we print them out so that we can see if they have been trimmed correctly.

If we give it the following test input:

[shell]
[Section 1]
val1=hello world
val2 = 42 = forty-two
[ Section 2 ] junk here!
val1=foo!
[/shell]

We get the following output:

[shell]
Found a section: [Section 1]
{val1} = {hello world}
{val2} = {42 = forty-two}
Found a section: [Section 2]
{val1} = {foo!}
[/shell]

Real-World Regular Expressions for Python

Identifying section headers

Handling white-space in section headers

Identifying key/value pairs

Related Articles

Latest Articles

Tags