Regular Expressions#

Overview

Questions

  • What is a regular expression?
  • When should I use regular expressions?

Objectives:

  • Use regular expressions to pull information from a complex data file.

Keypoints:

  • Regular expressions allow you to define patterns of strings using special metacharacters.

In the first Python and Data Scripting Workshop we learned how to parse files using open, readlines, then looping through the file to find specific phrases. This works fine for some cases, however, there are some times when this will either be slow or impossible.

Sometimes, you will want to look for text which resembles a certain pattern.

Regular expressions will allow us to define patterns of text we are looking for rather than hardcoding specific phrases.

For this section, we will be parsing output files from the Spartan software. This dataset was published as supplemental information for this paper where the authors used molecular descriptors obtained from the Spartan software along with machine learning methods to predict reaction yields.

To follow along with this section, you should download this file and put it in your data folder (name it output).

We will see later how to loop over all of the files in this repository. We will be working with files in the rxnpredict/spartan_molecules directory. First, we will proof our regular expressions on one file, then we will apply it to all of our files to get our molecular descriptors.

import os

import pandas as pd
file_path = os.path.join("data", "rxnpredict", "output")

Because this is not a file which is tabular or structured, we can’t easily read it using NumPy or pandas. We will use a method from the first Python scripting workshop where we read it in using the open function. Note that there are many ways to open files in Python. If you reading a file that is only partially tabular, you may still want to use pandas with appropriate variables on the read functions. However, these files contain a lot of information and are mostly unstructured, so we will first pull out some information using regular expressions.

We use the syntax with open because this allows us to open and automatically close the file. We will read the file inside the open block. The read function in Python is used on an open file object. All of the file contents will be pulled into a string called data.

with open(file_path) as f:
    data = f.read()
# print the first 500 characters of data
print(data[:500])
SPARTAN '14 CONFORMATION SEARCH:   (Win/64b)                      Release  1.1.4


 Conformation Search


 Initializing 4 threads

  Reason for exit: Successful completion 
  Conformer Program CPU Time :          .30
  Conformer Program Wall Time:          .07

SPARTAN '14 Quantum Mechanics Driver:  (Win/64b)         Release  1.1.4

Job type: Geometry optimization.
Method: RB3LYP
Basis set: 6-31G(D)
Number of shells: 56
Number of basis functions: 193
Multiplicity: 1
Parallel Job: 4 threads

SCF 

Basic Matching#

We are going to process this file using regular expressions. Regular expressions will let us define patterns that we want to look for in our text. To use regular expressions, you start by importing the appropriate module - import re.

import re

At it’s simplest, regular expressions match the exact pattern yoou specify. For example, if we wanted to find all the places the word energy occurred in the file, we could do that.

We create a regular expression pattern by using re.compile. Inside this function, you put the pattern you want to look for. Then you use pattern.findall and pass the string you would like to search as an argument. The function will return a list of matches in the text.

pattern = re.compile("energy")
matches = pattern.findall(data)
print(matches)
['energy', 'energy']

Using metacharacters#

This is not that interesting, but we can see that the word “energy” is found twice in the text. You’ll notice here that this text matches exactly, as in it is only showing us when the word energy (lowercase e) is found. This is where the power of regular expressions can come in. Let’s imagine that you want to look for either the word energy or the word Energy. You can modify your regular expression to use special characters to tell the regular expression that either letter is okay.

To do matches other than matching for the literal character, you use special characters in your regular expressions (referred to as metacharacters). Here are some examples of metacharacters:

. ^ $ * + ? { } [ ] \ | ( )

The first we will use is [ ]. For regular expressions, these indicate a class of characters you want to search for. For example, to look for e or E, we could use [Ee].

pattern = re.compile("[Ee]nergy")
matches = pattern.findall(data)
print(matches)
['Energy', 'energy', 'energy', 'Energy', 'Energy']
pattern = re.compile("energy", re.IGNORECASE)
matches = pattern.findall(data)
print(matches)
['Energy', 'energy', 'energy', 'Energy', 'Energy']

Where this is more useful is for finding a range of characters. For example, the pattern below will find any single uppercase letter surrounded by spaces.

pattern = re.compile(" [A-Z] ")
matches = pattern.findall(data)
print(matches)
[' A ', ' S ', ' X ', ' Y ', ' Z ', ' C ', ' C ', ' C ', ' C ', ' C ', ' C ', ' H ', ' H ', ' H ', ' H ', ' C ', ' F ', ' F ', ' F ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' X ', ' Y ', ' Z ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' X ', ' Y ', ' Z ', ' X ', ' Y ', ' Z ', ' X ', ' Y ', ' X ', ' Y ', ' Z ', ' X ', ' Y ', ' Z ', ' X ', ' Y ', ' X ', ' Y ', ' Z ', ' X ', ' Y ', ' Z ', ' X ', ' Y ', ' X ', ' Y ', ' Z ', ' X ', ' Y ', ' Z ', ' X ', ' Y ', ' X ', ' Y ', ' Z ', ' X ', ' Y ', ' Z ', ' X ', ' Y ', ' X ', ' Y ', ' Z ', ' X ', ' Y ', ' Z ', ' X ', ' Y ', ' X ', ' Y ', ' Z ', ' X ', ' Y ', ' Z ', ' X ', ' Y ', ' X ', ' Y ', ' Z ', ' X ', ' Y ', ' Z ', ' X ', ' Y ', ' X ', ' Y ', ' Z ', ' X ', ' Y ', ' Z ', ' X ', ' Y ', ' X ', ' Y ', ' Z ', ' X ', ' Y ', ' Z ', ' X ', ' Y ', ' X ', ' Y ', ' Z ', ' X ', ' Y ', ' Z ', ' X ', ' Y ', ' X ', ' Y ', ' Z ', ' X ', ' Y ', ' Z ', ' X ', ' Y ', ' X ', ' Y ', ' Z ', ' X ', ' Y ', ' Z ', ' X ', ' Y ', ' K ', ' A ', ' B ', ' A ', ' B ', ' A ', ' B ', ' X ', ' Y ', ' X ', ' Y ', ' X ', ' Y ', ' X ', ' Y ']

Let’s keep building on one of these and get what follows the word “energy”. Another special character recognized by regex is the period. The period means any character except a new line. If we add a period after the word energy, it will match the word energy (case insensitive) followed by any character except a newline.

pattern = re.compile("energy.", re.IGNORECASE)
matches = pattern.findall(data)
print(matches)
['Energy ', 'energy:', 'energy:', 'Energy ', 'Energy ']

After adding the dot, you will see another trailing character has been added to our results.

We can specify how many times we want a special character or group of characters repeated by using {} and specifying the number of times. For example, the following will get the two following characters.

pattern = re.compile("energy.{2}", re.IGNORECASE)
matches = pattern.findall(data)
print(matches)
['Energy  ', 'energy: ', 'energy: ', 'Energy D', 'Energy S']

You can also use * to specify 0 or more matches, or + to specify one or more matches. In this case, you do not need the curly braces {}. In general, this is the pattern you will follow when not doing exact matches. You will define something using the special metacharacters, followed by how many times to look for that metacharacter.

pattern = re.compile("energy.+", re.IGNORECASE)
matches = pattern.findall(data)
print(matches)
['Energy          Max Grad.      Max Dist. ', 'energy:   -3142.5663410 hartrees', 'energy:   -3142.5663410 hartrees', 'Energy Due to Solvation', 'Energy SM5.4/A          1.983']

Check your understanding

Modify the regular expression we wrote above to match uppercase words that are between three and 6 letters long. *Challenge* - Write a regular expression for uppercase words that are at least three characters long (no upper bound on length**

pattern = re.compile(" [A-Z] ")
matches = pattern.findall(data)
print(matches)

Within the matches, you can use parenthesis () to create groups. For example, if we wanted to separate the word energy from what followed, we could surround relevant parts with parenthesis. In the pattern below, we’ve also added a whitespace of length greater than 0 with \s+.

pattern = re.compile("(energy)\s*(.+)", re.IGNORECASE)
matches = pattern.findall(data)
print(matches)
[('Energy', 'Max Grad.      Max Dist. '), ('energy', ':   -3142.5663410 hartrees'), ('energy', ':   -3142.5663410 hartrees'), ('Energy', 'Due to Solvation'), ('Energy', 'SM5.4/A          1.983')]
matches[1]
('energy', ':   -3142.5663410 hartrees')

You can imagine how this might be more useful if we wanted to pull out all of the keyword values pairs, ie word : value.

pattern = re.compile(".+:.+")
matches = pattern.findall(data)
print(len(matches))
313

When writing a complicated regular expression, it is useful to be able to explain the pattern within the code. You can add a special argument re.VERBOSE to allow the use of # comments. To make this multiline, use three quotes to start and end your string.

pattern = re.compile("""
                        .+             # One or more of any character
                        :              # A literal colon
                        .+             # One or more of any character excluding newline.
                        """, re.VERBOSE)
matches = pattern.findall(data)
print(len(matches))
313
matches[:25]
["SPARTAN '14 CONFORMATION SEARCH:   (Win/64b)                      Release  1.1.4",
 '  Reason for exit: Successful completion ',
 '  Conformer Program CPU Time :          .30',
 '  Conformer Program Wall Time:          .07',
 "SPARTAN '14 Quantum Mechanics Driver:  (Win/64b)         Release  1.1.4",
 'Job type: Geometry optimization.',
 'Method: RB3LYP',
 'Basis set: 6-31G(D)',
 'Number of shells: 56',
 'Number of basis functions: 193',
 'Multiplicity: 1',
 'Parallel Job: 4 threads',
 'Job type: Frequency calculation.',
 'Method: RB3LYP',
 'Basis set: 6-31G(D)',
 'Job type: Single point.',
 'Method: RB3LYP',
 'Basis set: 6-31G(D)',
 ' SCF total energy:   -3142.5663410 hartrees',
 'Job type: Molecular property calculation.',
 'Method: RB3LYP',
 'Basis set: 6-31G(D)',
 ' SCF total energy:   -3142.5663410 hartrees',
 '  Reason for exit: Successful completion ',
 '  Quantum Calculation CPU Time :     50:25.28']

Check your understanding

Add parenthesis to the appropriate place in the regular expression so that each match is grouped into (`key`, `value`).

pattern = re.compile(".+:.+")

This pattern groups keys and values, but also adds zero or more whitespace with the colon to discard the whitespace before the value.

pattern = re.compile("""
                        (.+)                    # One or more letters a-z (any case)
                        :\s*                    # A literal colon followed by 0 or more whitespace
                        (.+)             # One or more of any character excluding newline.
                     """, re.VERBOSE)

matches = pattern.findall(data)
print(len(matches))

for match in matches:
    print(match)
317
("SPARTAN '14 CONFORMATION SEARCH", '(Win/64b)                      Release  1.1.4')
('  Reason for exit', 'Successful completion ')
('  Conformer Program CPU Time ', '.30')
('  Conformer Program Wall Time', '.07')
("SPARTAN '14 Quantum Mechanics Driver", '(Win/64b)         Release  1.1.4')
('Job type', 'Geometry optimization.')
('Method', 'RB3LYP')
('Basis set', '6-31G(D)')
('Number of shells', '56')
('Number of basis functions', '193')
('Multiplicity', '1')
('Parallel Job', '4 threads')
('SCF model', 'A restricted hybrid HF-DFT SCF calculation will be')
('Optimization', 'Step      Energy          Max Grad.      Max Dist. ')
('Job type', 'Frequency calculation.')
('Method', 'RB3LYP')
('Basis set', '6-31G(D)')
('Job type', 'Single point.')
('Method', 'RB3LYP')
('Basis set', '6-31G(D)')
(' SCF total energy', '-3142.5663410 hartrees')
('Job type', 'Molecular property calculation.')
('Method', 'RB3LYP')
('Basis set', '6-31G(D)')
(' SCF total energy', '-3142.5663410 hartrees')
('  Reason for exit', 'Successful completion ')
('  Quantum Calculation CPU Time :     50', '25.28')
('  Quantum Calculation Wall Time:     58', '12.93')
("SPARTAN '14 Semi-Empirical Program", '(Win/64b)           Release  1.1.4        ')
('  Memory Used', '882.13 Kb')
('  Reason for exit', 'Successful completion ')
('  Semi-Empirical Program CPU Time ', '.33')
('  Semi-Empirical Program Wall Time', '.03')
('Surface computation Wall Time: 000:00', '04.5')
('Surface computation CPU Time: 000:00', '04.5')
('QSAR CPU Time:  000:00', '00.7')
('QSAR Wall Time: 000:00', '05.2')
("SPARTAN '14 Properties Program", '(Win/64b)                     Release  1.1.4  ')
('  Model', 'B3LYP/6- ')
('  Number of shells', '56')
('  Number of basis functions', '193')
('  Molecular charge', '0')
('  Spin multiplicity', '1')
('  Electrons', '108')
('Molecular descriptors', 'Molecular volume:       149.53 (Ang**3)')
('Surface area', '171.08 (Ang**2)')
('Ovality', '1.256')
('Atomic weight', '225.012 g')
('E(HOMO)', '-0.2616')
('E(LUMO)', '-0.0413')
('Electronegativity', '0.15')
('Hardness', '0.11')
('Est. polarizability', '52.086')
('LogP (Ghose-Crippen)', '3.78')
('  MO', '1          2          3          4          5    ')
('  Eigenvalues', '-482.87418  -62.53242  -56.35360  -56.34988  -56.34976')
('  MO', '6          7          8          9         10    ')
('  Eigenvalues', '-24.73012  -24.73012  -24.72378  -10.43890  -10.26735')
('  MO', '11         12         13         14         15    ')
('  Eigenvalues', '-10.22875  -10.22411  -10.22408  -10.22282  -10.22282')
('  MO', '16         17         18         19         20    ')
('  Eigenvalues', '-8.71946   -6.54976   -6.53686   -6.53653   -2.64331')
('  MO', '21         22         23         24         25    ')
('  Eigenvalues', '-2.63963   -2.63941   -2.62917   -2.62915   -1.30569')
('  MO', '26         27         28         29         30    ')
('  Eigenvalues', '-1.22058   -1.21529   -0.89659   -0.83068   -0.77898')
('  MO', '31         32         33         34         35    ')
('  Eigenvalues', '-0.76810   -0.69484   -0.64394   -0.62473   -0.58382')
('  MO', '36         37         38         39         40    ')
('  Eigenvalues', '-0.58272   -0.57794   -0.49620   -0.48747   -0.47840')
('  MO', '41         42         43         44         45    ')
('  Eigenvalues', '-0.45875   -0.45851   -0.45487   -0.42628   -0.41857')
('  MO', '46         47         48         49         50    ')
('  Eigenvalues', '-0.40373   -0.40185   -0.40111   -0.38011   -0.36716')
('  MO', '51         52         53         54         55    ')
('  Eigenvalues', '-0.32880   -0.30144   -0.28158   -0.26163   -0.04133')
('  MO', '56         57         58         59         60    ')
('  Eigenvalues', '-0.03083   -0.01484    0.10093    0.10914    0.12276')
('  MO', '61         62         63         64         65    ')
('  Eigenvalues', '0.13613    0.14724    0.18582    0.18715    0.21082')
('  MO', '66         67         68         69         70    ')
('  Eigenvalues', '0.21357    0.24266    0.25096    0.26708    0.30527')
('  MO', '71         72         73         74         75    ')
('  Eigenvalues', '0.31687    0.32036    0.34311    0.35024    0.42380')
('  MO', '76         77         78         79         80    ')
('  Eigenvalues', '0.47277    0.48094    0.50586    0.51801    0.51818')
('  MO', '81         82         83         84         85    ')
('  Eigenvalues', '0.55010    0.55814    0.56030    0.56391    0.57779')
('  MO', '86         87         88         89         90    ')
('  Eigenvalues', '0.58078    0.58427    0.58698    0.60498    0.65053')
('  MO', '91         92         93         94         95    ')
('  Eigenvalues', '0.67861    0.71020    0.72332    0.74136    0.79291')
('  MO', '96         97         98         99        100    ')
('  Eigenvalues', '0.80654    0.81106    0.82592    0.84248    0.86627')
('  MO', '101        102        103        104        105    ')
('  Eigenvalues', '0.87569    0.90676    0.96192    0.96647    0.99502')
('  MO', '106        107        108        109        110    ')
('  Eigenvalues', '1.05052    1.07442    1.08586    1.12076    1.13496')
('  MO', '111        112        113        114        115    ')
('  Eigenvalues', '1.13662    1.16888    1.21469    1.21506    1.24578')
('  MO', '116        117        118        119        120    ')
('  Eigenvalues', '1.32568    1.32818    1.33873    1.36247    1.37563')
('  MO', '121        122        123        124        125    ')
('  Eigenvalues', '1.37838    1.40237    1.42457    1.43372    1.44270')
('  MO', '126        127        128        129        130    ')
('  Eigenvalues', '1.47614    1.51746    1.53819    1.64469    1.65565')
('  MO', '131        132        133        134        135    ')
('  Eigenvalues', '1.66812    1.68761    1.72401    1.76178    1.77351')
('  MO', '136        137        138        139        140    ')
('  Eigenvalues', '1.83747    1.85068    1.89070    1.90841    1.91736')
('  MO', '141        142        143        144        145    ')
('  Eigenvalues', '1.94202    1.96284    1.98688    1.98944    2.00606')
('  MO', '146        147        148        149        150    ')
('  Eigenvalues', '2.03778    2.05113    2.06304    2.08675    2.09427')
('  MO', '151        152        153        154        155    ')
('  Eigenvalues', '2.11188    2.11863    2.15089    2.16447    2.22791')
('  MO', '156        157        158        159        160    ')
('  Eigenvalues', '2.23323    2.28021    2.30949    2.37697    2.47973')
('  MO', '161        162        163        164        165    ')
('  Eigenvalues', '2.49526    2.59063    2.60216    2.62025    2.69276')
('  MO', '166        167        168        169        170    ')
('  Eigenvalues', '2.71251    2.73635    2.73680    2.76192    2.86154')
('  MO', '171        172        173        174        175    ')
('  Eigenvalues', '2.90706    3.02095    3.03225    3.07843    3.09087')
('  MO', '176        177        178        179        180    ')
('  Eigenvalues', '3.36458    4.05458    4.07595    4.08552    4.22065')
('  MO', '181        182        183        184        185    ')
('  Eigenvalues', '4.29576    4.43412    4.43747    4.54613    4.70124')
('  MO', '186        187        188        189        190    ')
('  Eigenvalues', '4.70226    4.73711    4.75946    4.86294    4.98169')
('  MO', '191        192        193    ')
('  Eigenvalues', '5.46282    6.28356   90.36309')
('  Dipole moment', 'X =  -0.747860  Y =  -0.097531  Z =   0.000000')
('  Total Dipole', '0.754193 Debye')
('  Fitted Dipole ', 'x  =   -1.3099,  y  =   -0.0469,  z  =    0.0000  =    1.3107 debye')
('  Total MIN occupancy', '107.783940')
('  Total RYD occupancy', '0.216060')
('  Total occupancy', '108.000000')
('  Fitted Dipole ', 'x  =   -3.3259,  y  =   -0.1022,  z  =    0.0000  =    3.3274 debye')
('   OCC. THRESH. = 1.90', 'ANTIBOND/RYDBERG OCCUPANCY =    3.6726')
('   OCC. THRESH. = 1.80', 'ANTIBOND/RYDBERG OCCUPANCY =    3.6726')
('   OCC. THRESH. = 1.70', 'ANTIBOND/RYDBERG OCCUPANCY =    3.6726')
('   OCC. THRESH. = 1.60', 'ANTIBOND/RYDBERG OCCUPANCY =    2.4536')
('   OCC. THRESH. = 1.50', 'ANTIBOND/RYDBERG OCCUPANCY =    2.4536')
('  Total core, lone pair and bond occupancy', '105.5464 ( 97.73%)')
('  Total Rydberg and antibond occupancy', '2.4536')
('     *C1      C2       ', '1.825      1.3945 [double]')
('     *C1      *C2      ', '0.991      1.3945 [single]')
('     *C1      Br1      ', '0.993      1.9165 [single]')
('     *C4      *C3      ', '1.817      1.3969 [double]')
('     *C4      C3       ', '0.987      1.3969 [single]')
('     *C4      C7       ', '0.993      1.5050 [single]')
('     C2       C3       ', '0.985      1.3938 [single]')
('     C2       H2       ', '0.990      1.0842 [sing-H]')
('     *C2      *C3      ', '0.985      1.3938 [single]')
('     *C2      *H2      ', '0.990      1.0842 [sing-H]')
('     *C3      *H3      ', '0.990      1.0849 [sing-H]')
('     C3       H3       ', '0.990      1.0849 [sing-H]')
('     C7       F1       ', '0.997      1.3512 [single]')
('     C7       F2       ', '0.996      1.3534 [single]')
('     C7       F3       ', '0.997      1.3512 [single]')
('      ... Q0 ', '-0.163538')
('      ... Qx ', '-1.851396')
('      ... Qy ', '0.016292')
('      ... Qz ', '-0.016028')
('      ... Qxx', '1.583762')
('      ... Qyy', '-1.997099')
('      ... Qzz', '0.413337')
('      ... Qxy', '-0.023437')
('      ... Qxz', '-0.001069')
('      ... Qyz', '0.060847')
('      ... Q0 ', '-0.307790')
('      ... Qx ', '1.468323')
('      ... Qy ', '0.065515')
('      ... Qz ', '-0.007300')
('      ... Qxx', '0.344085')
('      ... Qyy', '-0.475027')
('      ... Qzz', '0.130942')
('      ... Qxy', '-0.023749')
('      ... Qxz', '-0.002051')
('      ... Qyz', '-0.082442')
('      ... Q0 ', '0.178586')
('      ... Qx ', '1.411544')
('      ... Qy ', '0.002053')
('      ... Qz ', '1.267567')
('      ... Qxx', '0.516127')
('      ... Qyy', '-1.499046')
('      ... Qzz', '0.982919')
('      ... Qxy', '-0.062188')
('      ... Qxz', '0.076552')
('      ... Qyz', '-0.047541')
('      ... Q0 ', '0.178551')
('      ... Qx ', '1.408304')
('      ... Qy ', '0.006266')
('      ... Qz ', '-1.286343')
('      ... Qxx', '0.519581')
('      ... Qyy', '-1.489739')
('      ... Qzz', '0.970158')
('      ... Qxy', '-0.008215')
('      ... Qxz', '-0.081349')
('      ... Qyz', '0.000064')
('      ... Q0 ', '-0.226184')
('      ... Qx ', '0.264106')
('      ... Qy ', '-0.047609')
('      ... Qz ', '-0.801573')
('      ... Qxx', '0.279018')
('      ... Qyy', '-0.824322')
('      ... Qzz', '0.545304')
('      ... Qxy', '0.063086')
('      ... Qxz', '0.040624')
('      ... Qyz', '0.033288')
('      ... Q0 ', '-0.228808')
('      ... Qx ', '0.273758')
('      ... Qy ', '-0.052234')
('      ... Qz ', '0.818753')
('      ... Qxx', '0.263270')
('      ... Qyy', '-0.813509')
('      ... Qzz', '0.550239')
('      ... Qxy', '-0.036747')
('      ... Qxz', '-0.033769')
('      ... Qyz', '0.041040')
('      ... Q0 ', '1.313248')
('      ... Qx ', '0.646083')
('      ... Qy ', '0.141190')
('      ... Qz ', '-0.049458')
('      ... Qxx', '0.158304')
('      ... Qyy', '-0.073952')
('      ... Qzz', '-0.084352')
('      ... Qxy', '0.000739')
('      ... Qxz', '-0.004590')
('      ... Qyz', '-0.027778')
('      ... Q0 ', '-0.463736')
('      ... Qx ', '0.424394')
('      ... Qy ', '-0.319386')
('      ... Qz ', '-0.475098')
('      ... Qxx', '-0.247750')
('      ... Qyy', '0.133772')
('      ... Qzz', '0.113979')
('      ... Qxy', '0.095114')
('      ... Qxz', '0.047052')
('      ... Qyz', '-0.009976')
('      ... Q0 ', '-0.481187')
('      ... Qx ', '0.318474')
('      ... Qy ', '0.685491')
('      ... Qz ', '0.008669')
('      ... Qxx', '-0.086026')
('      ... Qyy', '-0.038164')
('      ... Qzz', '0.124190')
('      ... Qxy', '-0.022439')
('      ... Qxz', '-0.004279')
('      ... Qyz', '-0.009291')
('      ... Q0 ', '-0.447727')
('      ... Qx ', '0.388831')
('      ... Qy ', '-0.282775')
('      ... Qz ', '0.436155')
('      ... Qxx', '-0.240775')
('      ... Qyy', '0.121889')
('      ... Qzz', '0.118886')
('      ... Qxy', '0.064528')
('      ... Qxz', '-0.006250')
('      ... Qyz', '-0.041871')
('      ... Q0 ', '-0.110462')
('      ... Qx ', '0.288697')
('      ... Qy ', '-0.006979')
('      ... Qz ', '0.003238')
('      ... Qxx', '4.913063')
('      ... Qyy', '-2.410696')
('      ... Qzz', '-2.502367')
('      ... Qxy', '-0.081163')
('      ... Qxz', '0.009367')
('      ... Qyz', '-0.011689')
('  RMS fit', '2.518255')
(' %RMS fit', '503.441352')
('  Fitted Dipole ', 'x  =   -1.3967,  y  =   -0.0902,  z  =    0.0003  =    1.3996 debye')
('  RMS fit', '4.372352')
(' %RMS fit', '420.354455')
('  Fitted Dipole ', 'x  =   -1.2761,  y  =   -0.1164,  z  =    0.0000  =    1.2814 debye')
(' Atomic Charges', 'Electrostatic Mulliken  Natural ')
('   1 *C1      ', '-0.093     +0.018   -0.105 ')
('   2 *C4      ', '+0.130     -0.008   -0.173 ')
('   3 C2       ', '+0.015     -0.149   -0.240 ')
('   4 *C2      ', '+0.015     -0.149   -0.240 ')
('   5 *C3      ', '-0.223     -0.153   -0.195 ')
('   6 C3       ', '-0.223     -0.153   -0.195 ')
('   7 H2       ', '+0.100     +0.168   +0.260 ')
('   8 *H2      ', '+0.100     +0.168   +0.260 ')
('   9 *H3      ', '+0.156     +0.165   +0.258 ')
('  10 H3       ', '+0.156     +0.165   +0.258 ')
('  11 C7       ', '+0.338     +0.786   +1.132 ')
('  12 F1       ', '-0.144     -0.269   -0.364 ')
('  13 F2       ', '-0.149     -0.259   -0.362 ')
('  14 F3       ', '-0.144     -0.269   -0.364 ')
('  15 Br1      ', '-0.036     -0.062   +0.070 ')
('   1 *C1      *C4      ', '0.085    ---  ')
('   2 *C1      C2       ', '1.412    1.825')
('   3 *C1      *C2      ', '1.412    0.991')
('   4 *C1      Br1      ', '0.978    0.993')
('   5 *C4      *C3      ', '1.414    1.817')
('   6 *C4      C3       ', '1.414    0.987')
('   7 *C4      C7       ', '0.973    0.993')
('   8 C2       *C3      ', '0.092    ---  ')
('   9 C2       C3       ', '1.425    0.985')
('  10 C2       H2       ', '0.920    0.990')
('  11 C2       Br1      ', '0.027    ---  ')
('  12 *C2      *C3      ', '1.425    0.985')
('  13 *C2      C3       ', '0.092    ---  ')
('  14 *C2      *H2      ', '0.920    0.990')
('  15 *C2      Br1      ', '0.027    ---  ')
('  16 *C3      *H3      ', '0.916    0.990')
('  17 C3       H3       ', '0.916    0.990')
('  18 C7       F1       ', '1.003    0.997')
('  19 C7       F2       ', '1.035    0.996')
('  20 C7       F3       ', '1.003    0.997')
(' Vibrational(v) Corrections', 'Temp. Correction    Hv     274.6411  ')
('  Reason for exit', 'Successful completion ')
('  Properties CPU Time ', '.78')
('  Properties Wall Time', '.50')

If we wanted to limit these results to the values which have digits, we would have to do some additional work. We could use either [0-9] or \d to indicate that we were looking for digits. However, the presence of a decimal complicates things a bit. Doing \d+ would only result in integers, while doing something like \d+\.\d+ (notice the slash in front of the period to escape the decimal) wouldn’t result in any integer matches.

pattern = re.compile("(.+):\s*(\d+)")
matches = pattern.findall(data)
for i in range(10):
    print(matches[i])
('Basis set', '6')
('Number of shells', '56')
('Number of basis functions', '193')
('Multiplicity', '1')
('Parallel Job', '4')
('Basis set', '6')
('Basis set', '6')
('Basis set', '6')
('  Quantum Calculation CPU Time :     50', '25')
('  Quantum Calculation Wall Time:     58', '12')

You’ll notice that we are only pulling out integers.

If we modify to have a decimal place (escaped with \) followed by more numbers, we will only have decimal numbers.

pattern = re.compile("(.+):\s*(\d+\.\d+)")
matches = pattern.findall(data)
for i in range(10):
    print(matches[i])
('  Quantum Calculation CPU Time :     50', '25.28')
('  Quantum Calculation Wall Time:     58', '12.93')
('  Memory Used', '882.13')
('Surface computation Wall Time: 000:00', '04.5')
('Surface computation CPU Time: 000:00', '04.5')
('QSAR CPU Time:  000:00', '00.7')
('QSAR Wall Time: 000:00', '05.2')
('Molecular volume', '149.53')
('Surface area', '171.08')
('Ovality', '1.256')

We can modify this regular expression and make the decimal optional (matches 1 or 0 times) by following it with a question mark ?. This means that the decimal is optional. You can then make the first digit have 0 or more repeats. Finally, we need to specify that we might have a negative sign at the start of the number and add .+ to indicate that more characters (perhaps as units) may follow.

pattern = re.compile("([A-Za-z].*):\s*(-?\d*\.\d+)\s+")
matches = pattern.findall(data)

print(f"{len(matches)} matches found!")
for i in range(25):
    print(matches[i])
226 matches found!
('Conformer Program CPU Time ', '.30')
('Conformer Program Wall Time', '.07')
('SCF total energy', '-3142.5663410')
('SCF total energy', '-3142.5663410')
('Quantum Calculation CPU Time :     50', '25.28')
('Quantum Calculation Wall Time:     58', '12.93')
('Memory Used', '882.13')
('Semi-Empirical Program CPU Time ', '.33')
('Semi-Empirical Program Wall Time', '.03')
('Surface computation Wall Time: 000:00', '04.5')
('Surface computation CPU Time: 000:00', '04.5')
('QSAR CPU Time:  000:00', '00.7')
('QSAR Wall Time: 000:00', '05.2')
('Molecular volume', '149.53')
('Surface area', '171.08')
('Ovality', '1.256')
('Atomic weight', '225.012')
('E(HOMO)', '-0.2616')
('E(LUMO)', '-0.0413')
('Electronegativity', '0.15')
('Hardness', '0.11')
('Est. polarizability', '52.086')
('LogP (Ghose-Crippen)', '3.78')
('Eigenvalues', '-482.87418')
('Eigenvalues', '-24.73012')

“Greedy” and “non-greedy” matching#

You’ll notice in the file that there is data between various steps. For example, the results from the step where NMR shifts are shown below. You’ll notice <step 3> starting this block and <step 4> ending this block. Let’s imagine that we want to pull out all text starting with <step n> and ending <step n+1>. We might think to match characters in between <step \d>.

<step 3>
Job type: Single point.
Method: RB3LYP
Basis set: 6-31G(D)
 SCF total energy:   -3142.5663410 hartrees


  NMR shifts (ppm)
       Atom     Isotropic        Rel. Shift 
  ---------------------------------------------------
     1   *C1      50.7941          138.83
     2   *C4      64.1012          125.53
     3    C2      63.9735          125.65
     4   *C2      63.9735          125.65
     5   *C3      68.6792          120.95
     6    C3      68.6792          120.95
     7    H2      24.8868            7.30
     8   *H2      24.8868            7.30
     9   *H3      24.8117            7.37
    10    H3      24.8117            7.37
    11    C7      63.3782          126.25
    12    F1     273.5747          -93.43
    13    F2     248.4779          -68.34
    14    F3     273.5747          -93.43
    15   Br1    2066.9469 

<step 4>

We might write something like the following. Here we are using the flag re.DOTALL to make the . all characters (including newline).

pattern = re.compile("<step \d>.+<step \d>", re.DOTALL)
matches = pattern.findall(data)
print(matches[0])
<step 2>
Job type: Frequency calculation.
Method: RB3LYP
Basis set: 6-31G(D)

<step 3>
Job type: Single point.
Method: RB3LYP
Basis set: 6-31G(D)
 SCF total energy:   -3142.5663410 hartrees


  NMR shifts (ppm)
       Atom     Isotropic        Rel. Shift 
  ---------------------------------------------------
     1   *C1      50.7941          138.83
     2   *C4      64.1012          125.53
     3    C2      63.9735          125.65
     4   *C2      63.9735          125.65
     5   *C3      68.6792          120.95
     6    C3      68.6792          120.95
     7    H2      24.8868            7.30
     8   *H2      24.8868            7.30
     9   *H3      24.8117            7.37
    10    H3      24.8117            7.37
    11    C7      63.3782          126.25
    12    F1     273.5747          -93.43
    13    F2     248.4779          -68.34
    14    F3     273.5747          -93.43
    15   Br1    2066.9469 

<step 4>

You’ll notice that this does not quite give us the expected output. It gives us the output all the way to <step 4>. This is because by default regular expressions use something called “greedy matching”. This means that what is returned tries to use as much of the pattern as possible.

We will need to add a modifier to our pattern to make it non-greedy. This means it will match as little text as possible. You can add a question mark after the + (or after the repeat character) to make the expression ‘non-greedy’.

pattern = re.compile("<step \d>(.+?)<step \d>", re.DOTALL)
matches = pattern.findall(data)
print(f"Found {len(matches)} matches!")
print(matches[0])
Found 1 matches!

Job type: Frequency calculation.
Method: RB3LYP
Basis set: 6-31G(D)

This seems better, but it’s still not quit what we want. You’ll notice that we are not getting the data in step 3. This is because by default matches will not overlap. Since <step 3> is used at the end of the first match, it can’t be used at the beginning of the next one.

pattern = re.compile("(?=(<step \d>(.+?)<step \d>))", re.DOTALL)
matches = pattern.findall(data)
print(f"Found {len(matches)} matches!")
print(matches[1][1])
Found 2 matches!

Job type: Single point.
Method: RB3LYP
Basis set: 6-31G(D)
 SCF total energy:   -3142.5663410 hartrees


  NMR shifts (ppm)
       Atom     Isotropic        Rel. Shift 
  ---------------------------------------------------
     1   *C1      50.7941          138.83
     2   *C4      64.1012          125.53
     3    C2      63.9735          125.65
     4   *C2      63.9735          125.65
     5   *C3      68.6792          120.95
     6    C3      68.6792          120.95
     7    H2      24.8868            7.30
     8   *H2      24.8868            7.30
     9   *H3      24.8117            7.37
    10    H3      24.8117            7.37
    11    C7      63.3782          126.25
    12    F1     273.5747          -93.43
    13    F2     248.4779          -68.34
    14    F3     273.5747          -93.43
    15   Br1    2066.9469 

re.finditer#

We started with re.findall here for simplicity, but an alternative which will give you more information about your matches is to use re.finditer. In this case, a iterable will be returned to you rather than a string. You can then iterate through the results with a for loop. A lot more information is provided about the matches.

pattern = re.compile("(?=(<step \d>(.+?)<step \d>))", re.DOTALL)
matches = pattern.finditer(data)
for match in matches:
    print(f"Match span is {match.span(2)}")
    print(match.group(2))
Match span is (1125, 1195)

Job type: Frequency calculation.
Method: RB3LYP
Basis set: 6-31G(D)


Match span is (1203, 2043)

Job type: Single point.
Method: RB3LYP
Basis set: 6-31G(D)
 SCF total energy:   -3142.5663410 hartrees


  NMR shifts (ppm)
       Atom     Isotropic        Rel. Shift 
  ---------------------------------------------------
     1   *C1      50.7941          138.83
     2   *C4      64.1012          125.53
     3    C2      63.9735          125.65
     4   *C2      63.9735          125.65
     5   *C3      68.6792          120.95
     6    C3      68.6792          120.95
     7    H2      24.8868            7.30
     8   *H2      24.8868            7.30
     9   *H3      24.8117            7.37
    10    H3      24.8117            7.37
    11    C7      63.3782          126.25
    12    F1     273.5747          -93.43
    13    F2     248.4779          -68.34
    14    F3     273.5747          -93.43
    15   Br1    2066.9469 

Regular expressions take practice, but they are worth learning a bit about. It is useful to try out small regexes on online tools like pythex.