Regular Expressions
=================

<div class="overview">
   <p class="overview-title">Overview</p>
    <p>Questions</p>
     <ul>
        <li>What is a regular expression?</li>
        <li>When should I use regular expressions?</li>
    </ul>
    <p>Objectives:</p>
        <ul>
            <li>Use regular expressions to pull information from a complex data file.</li>
        </ul>
    <p>Keypoints:</p>
        <ul>
            <li>Regular expressions allow you to define patterns of strings using special metacharacters.</li>
        </ul>
    </div>

In the first Python and Data Scripting Workshop we learned how to parse files using `open`, `readlines`, then looping through the file to find specific phrases. This works fine for some cases, however, there are some times when this will either be slow or impossible.

Sometimes, you will want to look for text which resembles a certain pattern. 

Regular expressions will allow us to define patterns of text we are looking for rather than hardcoding specific phrases. 

For this section, we will be parsing output files from the Spartan software. This dataset was published as supplemental information for [this paper](https://science.sciencemag.org/content/360/6385/186) where the authors used molecular descriptors obtained from the Spartan software along with machine learning methods to predict reaction yields. 

To follow along with this section, you should download [this file](https://github.com/doylelab/rxnpredict/blob/master/spartan_molecules/1-bromo-4-(trifluoromethyl)benzene.spardir/M0001/output) and put it in your data folder (name it `output`). 

We will see later how to loop over all of the files in [this repository](https://github.com/doylelab/rxnpredict). We will be working with files in the `rxnpredict/spartan_molecules` directory. First, we will proof our regular expressions on one file, then we will apply it to all of our files to get our molecular descriptors.

In [1]:
import os

import pandas as pd

In [2]:
file_path = os.path.join("data", "rxnpredict", "output")

Because this is not a file which is tabular or structured, we can't easily read it using NumPy or pandas. We will use a method from the first Python scripting workshop where we read it in using the `open` function. Note that there are many ways to open files in Python. If you reading a file that is only partially tabular, you may still want to use pandas with appropriate variables on the read functions. However, these files contain a lot of information and are mostly unstructured, so we will first pull out some information using regular expressions.

We use the syntax `with open` because this allows us to open and automatically close the file. We will read the file inside the open block. The `read` function in Python is used on an open file object. All of the file contents will be pulled into a `string` called `data`.

In [3]:
with open(file_path) as f:
    data = f.read()

In [4]:
# print the first 500 characters of data
print(data[:500])

SPARTAN '14 CONFORMATION SEARCH:   (Win/64b)                      Release  1.1.4


 Conformation Search


 Initializing 4 threads

  Reason for exit: Successful completion 
  Conformer Program CPU Time :          .30
  Conformer Program Wall Time:          .07

SPARTAN '14 Quantum Mechanics Driver:  (Win/64b)         Release  1.1.4

Job type: Geometry optimization.
Method: RB3LYP
Basis set: 6-31G(D)
Number of shells: 56
Number of basis functions: 193
Multiplicity: 1
Parallel Job: 4 threads

SCF 


## Basic Matching

We are going to process this file using regular expressions. Regular expressions will let us define patterns that we want to look for in our text. To use regular expressions, you start by importing the appropriate module - `import re`.

In [5]:
import re

At it's simplest, regular expressions match the exact pattern yoou specify. For example, if we wanted to find all the places the word `energy` occurred in the file, we could do that.

We create a regular expression pattern by using `re.compile`. Inside this function, you put the pattern you want to look for. Then you use `pattern.findall` and pass the string you would like to search as an argument. The function will return a list of matches in the text.

In [6]:
pattern = re.compile("energy")

In [7]:
matches = pattern.findall(data)
print(matches)

['energy', 'energy']


## Using metacharacters

This is not that interesting, but we can see that the word "energy" is found twice in the text. You'll notice here that this text matches exactly, as in it is only showing us when the word `energy` (lowercase `e`) is found. This is where the power of regular expressions can come in. Let's imagine that you want to look for either the word `energy` or the word `Energy`. You can modify your regular expression to use special characters to tell the regular expression that either letter is okay.

To do matches other than matching for the literal character, you use special characters in your regular expressions (referred to as `metacharacters`). Here are some examples of metacharacters:

```
. ^ $ * + ? { } [ ] \ | ( )
```

The first we will use is `[ ]`. For regular expressions, these indicate a class of characters you want to search for. For example, to look for `e` or `E`, we could use `[Ee]`.

In [8]:
pattern = re.compile("[Ee]nergy")

In [9]:
matches = pattern.findall(data)
print(matches)

['Energy', 'energy', 'energy', 'Energy', 'Energy']


In [10]:
pattern = re.compile("energy", re.IGNORECASE)

In [11]:
matches = pattern.findall(data)
print(matches)

['Energy', 'energy', 'energy', 'Energy', 'Energy']


Where this is more useful is for finding a range of characters. For example, the pattern below will find any single uppercase letter surrounded by spaces.

In [12]:
pattern = re.compile(" [A-Z] ")
matches = pattern.findall(data)
print(matches)

[' A ', ' S ', ' X ', ' Y ', ' Z ', ' C ', ' C ', ' C ', ' C ', ' C ', ' C ', ' H ', ' H ', ' H ', ' H ', ' C ', ' F ', ' F ', ' F ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S ', ' S '

Let's keep building on one of these and get what follows the word "energy". Another special character recognized by `regex` is the period. The period means `any character except a new line`. If we add a period after the word `energy`, it will match the word `energy` (case insensitive) followed by any character except a newline.

In [13]:
pattern = re.compile("energy.", re.IGNORECASE)

In [14]:
matches = pattern.findall(data)
print(matches)

['Energy ', 'energy:', 'energy:', 'Energy ', 'Energy ']


After adding the dot, you will see another trailing character has been added to our results. 

We can specify how many times we want a special character or group of characters repeated by using `{}` and specifying the number of times. For example, the following will get the two following characters. 

In [15]:
pattern = re.compile("energy.{2}", re.IGNORECASE)

In [16]:
matches = pattern.findall(data)
print(matches)

['Energy  ', 'energy: ', 'energy: ', 'Energy D', 'Energy S']


You can also use `*` to specify 0 or more matches, or `+` to specify one or more matches. In this case, you do not need the curly braces `{}`. In general, this is the pattern you will follow when not doing exact matches. You will define something using the special metacharacters, followed by how many times to look for that metacharacter.

In [17]:
pattern = re.compile("energy.+", re.IGNORECASE)
matches = pattern.findall(data)
print(matches)

['Energy          Max Grad.      Max Dist. ', 'energy:   -3142.5663410 hartrees', 'energy:   -3142.5663410 hartrees', 'Energy Due to Solvation', 'Energy SM5.4/A          1.983']


<div class="exercise-this-is-a-title exercise">
<p class="exercise-title">Check your understanding</p>
    <p>Modify the regular expression we wrote above to match uppercase words that are between three and 6 letters long. *Challenge* - Write a regular expression for uppercase words that are at least three characters long (no upper bound on length** </p>

```python
pattern = re.compile(" [A-Z] ")
matches = pattern.findall(data)
print(matches)
```
    
```{admonition} Solution
:class: dropdown

```python
pattern = re.compile(" [A-Z]{3,6} ")
challenge_pattern = re.compile(" [A-Z]{3,} ")
```
     
</div>

Within the matches, you can use parenthesis `()` to create groups. For example, if we wanted to separate the word energy from what followed, we could surround relevant parts with parenthesis. In the pattern below, we've also added a whitespace of length greater than 0 with `\s+`.

In [18]:
pattern = re.compile("(energy)\s*(.+)", re.IGNORECASE)
matches = pattern.findall(data)
print(matches)

[('Energy', 'Max Grad.      Max Dist. '), ('energy', ':   -3142.5663410 hartrees'), ('energy', ':   -3142.5663410 hartrees'), ('Energy', 'Due to Solvation'), ('Energy', 'SM5.4/A          1.983')]


In [19]:
matches[1]

('energy', ':   -3142.5663410 hartrees')

You can imagine how this might be more useful if we wanted to pull out all of the keyword values pairs, ie `word : value`.

In [20]:
pattern = re.compile(".+:.+")

In [21]:
matches = pattern.findall(data)
print(len(matches))

313


When writing a complicated regular expression, it is useful to be able to explain the pattern within the code. You can add a special argument `re.VERBOSE` to allow the use of `#` comments. To make this multiline, use three quotes to start and end your string.

In [22]:
pattern = re.compile("""
                        .+             # One or more of any character
                        :              # A literal colon
                        .+             # One or more of any character excluding newline.
                        """, re.VERBOSE)

In [23]:
matches = pattern.findall(data)
print(len(matches))

313


In [24]:
matches[:25]

["SPARTAN '14 CONFORMATION SEARCH:   (Win/64b)                      Release  1.1.4",
 '  Reason for exit: Successful completion ',
 '  Conformer Program CPU Time :          .30',
 '  Conformer Program Wall Time:          .07',
 "SPARTAN '14 Quantum Mechanics Driver:  (Win/64b)         Release  1.1.4",
 'Job type: Geometry optimization.',
 'Method: RB3LYP',
 'Basis set: 6-31G(D)',
 'Number of shells: 56',
 'Number of basis functions: 193',
 'Multiplicity: 1',
 'Parallel Job: 4 threads',
 'Job type: Frequency calculation.',
 'Method: RB3LYP',
 'Basis set: 6-31G(D)',
 'Job type: Single point.',
 'Method: RB3LYP',
 'Basis set: 6-31G(D)',
 ' SCF total energy:   -3142.5663410 hartrees',
 'Job type: Molecular property calculation.',
 'Method: RB3LYP',
 'Basis set: 6-31G(D)',
 ' SCF total energy:   -3142.5663410 hartrees',
 '  Reason for exit: Successful completion ',
 '  Quantum Calculation CPU Time :     50:25.28']

<div class="exercise-this-is-a-title exercise">
<p class="exercise-title">Check your understanding</p>
    <p> Add parenthesis to the appropriate place in the regular expression so that each match is grouped into (`key`, `value`). </p>

```python
pattern = re.compile(".+:.+")
```
    
```{admonition} Solution
:class: dropdown

```python
pattern = re.compile("(.+):(.+)")
```
     
</div>

This pattern groups keys and values, but also adds zero or more whitespace with the colon to discard the whitespace before the value.

In [25]:
pattern = re.compile("""
                        (.+)                    # One or more letters a-z (any case)
                        :\s*                    # A literal colon followed by 0 or more whitespace
                        (.+)             # One or more of any character excluding newline.
                     """, re.VERBOSE)

matches = pattern.findall(data)
print(len(matches))

for match in matches:
    print(match)

317
("SPARTAN '14 CONFORMATION SEARCH", '(Win/64b)                      Release  1.1.4')
('  Reason for exit', 'Successful completion ')
('  Conformer Program CPU Time ', '.30')
('  Conformer Program Wall Time', '.07')
("SPARTAN '14 Quantum Mechanics Driver", '(Win/64b)         Release  1.1.4')
('Job type', 'Geometry optimization.')
('Method', 'RB3LYP')
('Basis set', '6-31G(D)')
('Number of shells', '56')
('Number of basis functions', '193')
('Multiplicity', '1')
('Parallel Job', '4 threads')
('SCF model', 'A restricted hybrid HF-DFT SCF calculation will be')
('Optimization', 'Step      Energy          Max Grad.      Max Dist. ')
('Job type', 'Frequency calculation.')
('Method', 'RB3LYP')
('Basis set', '6-31G(D)')
('Job type', 'Single point.')
('Method', 'RB3LYP')
('Basis set', '6-31G(D)')
(' SCF total energy', '-3142.5663410 hartrees')
('Job type', 'Molecular property calculation.')
('Method', 'RB3LYP')
('Basis set', '6-31G(D)')
(' SCF total energy', '-3142.5663410 hartrees')
('  Reas

If we wanted to limit these results to the values which have digits, we would have to do some additional work. We could use either `[0-9]` or `\d` to indicate that we were looking for digits. However, the presence of a decimal complicates things a bit. Doing `\d+` would only result in integers, while doing something like `\d+\.\d+` (notice the slash in front of the period to escape the decimal) wouldn't result in any integer matches.

In [26]:
pattern = re.compile("(.+):\s*(\d+)")
matches = pattern.findall(data)
for i in range(10):
    print(matches[i])

('Basis set', '6')
('Number of shells', '56')
('Number of basis functions', '193')
('Multiplicity', '1')
('Parallel Job', '4')
('Basis set', '6')
('Basis set', '6')
('Basis set', '6')
('  Quantum Calculation CPU Time :     50', '25')
('  Quantum Calculation Wall Time:     58', '12')


You'll notice that we are only pulling out integers.

If we modify to have a decimal place (escaped with `\`) followed by more numbers, we will only have decimal numbers. 

In [27]:
pattern = re.compile("(.+):\s*(\d+\.\d+)")
matches = pattern.findall(data)
for i in range(10):
    print(matches[i])

('  Quantum Calculation CPU Time :     50', '25.28')
('  Quantum Calculation Wall Time:     58', '12.93')
('  Memory Used', '882.13')
('Surface computation Wall Time: 000:00', '04.5')
('Surface computation CPU Time: 000:00', '04.5')
('QSAR CPU Time:  000:00', '00.7')
('QSAR Wall Time: 000:00', '05.2')
('Molecular volume', '149.53')
('Surface area', '171.08')
('Ovality', '1.256')


We can modify this regular expression and make the decimal optional (matches 1 or 0 times) by following it with a question mark `?`. This means that the decimal is optional. You can then make the first digit have 0 or more repeats. Finally, we need to specify that we might have a negative sign at the start of the number and add `.+` to indicate that more characters (perhaps as units) may follow.

In [28]:
pattern = re.compile("([A-Za-z].*):\s*(-?\d*\.\d+)\s+")
matches = pattern.findall(data)

print(f"{len(matches)} matches found!")
for i in range(25):
    print(matches[i])

226 matches found!
('Conformer Program CPU Time ', '.30')
('Conformer Program Wall Time', '.07')
('SCF total energy', '-3142.5663410')
('SCF total energy', '-3142.5663410')
('Quantum Calculation CPU Time :     50', '25.28')
('Quantum Calculation Wall Time:     58', '12.93')
('Memory Used', '882.13')
('Semi-Empirical Program CPU Time ', '.33')
('Semi-Empirical Program Wall Time', '.03')
('Surface computation Wall Time: 000:00', '04.5')
('Surface computation CPU Time: 000:00', '04.5')
('QSAR CPU Time:  000:00', '00.7')
('QSAR Wall Time: 000:00', '05.2')
('Molecular volume', '149.53')
('Surface area', '171.08')
('Ovality', '1.256')
('Atomic weight', '225.012')
('E(HOMO)', '-0.2616')
('E(LUMO)', '-0.0413')
('Electronegativity', '0.15')
('Hardness', '0.11')
('Est. polarizability', '52.086')
('LogP (Ghose-Crippen)', '3.78')
('Eigenvalues', '-482.87418')
('Eigenvalues', '-24.73012')


## "Greedy" and "non-greedy" matching

You'll notice in the file that there is data between various steps. For example, the results from the step where NMR shifts are shown below. You'll notice `<step 3>` starting this block and `<step 4>` ending this block. Let's imagine that we want to pull out all text starting with `<step n>` and ending `<step n+1>`. We might think to match characters in between `<step \d>`.


```raw
<step 3>
Job type: Single point.
Method: RB3LYP
Basis set: 6-31G(D)
 SCF total energy:   -3142.5663410 hartrees


  NMR shifts (ppm)
       Atom     Isotropic        Rel. Shift 
  ---------------------------------------------------
     1   *C1      50.7941          138.83
     2   *C4      64.1012          125.53
     3    C2      63.9735          125.65
     4   *C2      63.9735          125.65
     5   *C3      68.6792          120.95
     6    C3      68.6792          120.95
     7    H2      24.8868            7.30
     8   *H2      24.8868            7.30
     9   *H3      24.8117            7.37
    10    H3      24.8117            7.37
    11    C7      63.3782          126.25
    12    F1     273.5747          -93.43
    13    F2     248.4779          -68.34
    14    F3     273.5747          -93.43
    15   Br1    2066.9469 

<step 4>
```

We might write something like the following. Here we are using the flag `re.DOTALL` to make the `.` all characters (including newline).

In [29]:
pattern = re.compile("<step \d>.+<step \d>", re.DOTALL)
matches = pattern.findall(data)

In [30]:
print(matches[0])

<step 2>
Job type: Frequency calculation.
Method: RB3LYP
Basis set: 6-31G(D)

<step 3>
Job type: Single point.
Method: RB3LYP
Basis set: 6-31G(D)
 SCF total energy:   -3142.5663410 hartrees


  NMR shifts (ppm)
       Atom     Isotropic        Rel. Shift 
  ---------------------------------------------------
     1   *C1      50.7941          138.83
     2   *C4      64.1012          125.53
     3    C2      63.9735          125.65
     4   *C2      63.9735          125.65
     5   *C3      68.6792          120.95
     6    C3      68.6792          120.95
     7    H2      24.8868            7.30
     8   *H2      24.8868            7.30
     9   *H3      24.8117            7.37
    10    H3      24.8117            7.37
    11    C7      63.3782          126.25
    12    F1     273.5747          -93.43
    13    F2     248.4779          -68.34
    14    F3     273.5747          -93.43
    15   Br1    2066.9469 

<step 4>


You'll notice that this does not quite give us the expected output. It gives us the output all the way to `<step 4>`. This is because by default regular expressions use something called "greedy matching". This means that what is returned tries to use as much of the pattern as possible. 

We will need to add a modifier to our pattern to make it non-greedy. This means it will match as little text as possible. You can add a `question mark` after the `+` (or after the repeat character) to make the expression 'non-greedy'.

In [31]:
pattern = re.compile("<step \d>(.+?)<step \d>", re.DOTALL)
matches = pattern.findall(data)

In [32]:
print(f"Found {len(matches)} matches!")
print(matches[0])

Found 1 matches!

Job type: Frequency calculation.
Method: RB3LYP
Basis set: 6-31G(D)




This seems better, but it's still not quit what we want. You'll notice that we are not getting the data in `step 3`. This is because by default matches will not overlap. Since `<step 3>` is used at the end of the first match, it can't be used at the beginning of the next one.

In [33]:
pattern = re.compile("(?=(<step \d>(.+?)<step \d>))", re.DOTALL)
matches = pattern.findall(data)

In [34]:
print(f"Found {len(matches)} matches!")
print(matches[1][1])

Found 2 matches!

Job type: Single point.
Method: RB3LYP
Basis set: 6-31G(D)
 SCF total energy:   -3142.5663410 hartrees


  NMR shifts (ppm)
       Atom     Isotropic        Rel. Shift 
  ---------------------------------------------------
     1   *C1      50.7941          138.83
     2   *C4      64.1012          125.53
     3    C2      63.9735          125.65
     4   *C2      63.9735          125.65
     5   *C3      68.6792          120.95
     6    C3      68.6792          120.95
     7    H2      24.8868            7.30
     8   *H2      24.8868            7.30
     9   *H3      24.8117            7.37
    10    H3      24.8117            7.37
    11    C7      63.3782          126.25
    12    F1     273.5747          -93.43
    13    F2     248.4779          -68.34
    14    F3     273.5747          -93.43
    15   Br1    2066.9469 




## re.finditer

We started with `re.findall` here for simplicity, but an alternative which will give you more information about your matches is to use `re.finditer`. In this case, a iterable will be returned to you rather than a string. You can then iterate through the results with a `for` loop. A lot more information is provided about the matches.

In [35]:
pattern = re.compile("(?=(<step \d>(.+?)<step \d>))", re.DOTALL)
matches = pattern.finditer(data)
for match in matches:
    print(f"Match span is {match.span(2)}")
    print(match.group(2))

Match span is (1125, 1195)

Job type: Frequency calculation.
Method: RB3LYP
Basis set: 6-31G(D)


Match span is (1203, 2043)

Job type: Single point.
Method: RB3LYP
Basis set: 6-31G(D)
 SCF total energy:   -3142.5663410 hartrees


  NMR shifts (ppm)
       Atom     Isotropic        Rel. Shift 
  ---------------------------------------------------
     1   *C1      50.7941          138.83
     2   *C4      64.1012          125.53
     3    C2      63.9735          125.65
     4   *C2      63.9735          125.65
     5   *C3      68.6792          120.95
     6    C3      68.6792          120.95
     7    H2      24.8868            7.30
     8   *H2      24.8868            7.30
     9   *H3      24.8117            7.37
    10    H3      24.8117            7.37
    11    C7      63.3782          126.25
    12    F1     273.5747          -93.43
    13    F2     248.4779          -68.34
    14    F3     273.5747          -93.43
    15   Br1    2066.9469 




Regular expressions take practice, but they are worth learning a bit about. It is useful to try out small regexes on online tools like [pythex](https://pythex.org/).