File Parsing#
Overview
Questions:
How do I sort through all the information in a text file and extract particular pieces of information?
Objectives:
Open a file and read in its contents line by line.
Search for a particular string in a file.
Manipulate strings and change data types.
Print to a new file.
Working with files#
One of the most common tasks in research is analyzing data. An enormous amount of data is currently being generated in biochemistry and molecular biology, much of it pertaining to sequence and structure. The PDB file format is commonly used to describe macromolecular structures that have been determined by experimental methods. You may be interested in exploring the text and/or the data in a PDB file. While the PDB web site is very helpful, there are times when it would be handy to extract specific information about one protein (or 1,000) with a few keystrokes. For example, you might be interested in the resolution of a structure, or the small molecules that are bound to the macromolecule. In general, this is called file parsing.
Working with file paths - the os.path
module#
For this section, we will be working with the file 4eyr
in the PDB_files
directory.
Jupyter notebooks are designed to work in all operating systems. You may know that the operating system commands for organizing and working with files are different in Windows, Mac OS and Linux. To address this, we will use the Python os
library. To use this library, we will first import it. Then we’ll start with two commands from the library, getcwd()
which stands for get current working directory. Executing this command will tell us where we are in our computer’s directory. The other command is listdir()
which stands from list directory. This command tells the Jupyter notebook to display a list of the files and directories in the current working directory.
import os # imports the os library so you can use its functions
os.getcwd() # provides the full file path for the current working directory
'/home/runner/work/python-scripting-biochemistry/python-scripting-biochemistry/biochemist-python/chapters'
os.listdir() # provides a list of all files in the current working directory
['File_Parsing.ipynb',
'biopython_mmcif.ipynb',
'.ipynb_checkpoints',
'images',
'Bradford_plot2.png',
'Linear_Regression.ipynb',
'data.zip',
'rcsb_api.ipynb',
'SmallMolVis.ipynb',
'data',
'ligands',
'nonlinear_regression_part_1.ipynb',
'Bradford_plot.png',
'workshop_schedule.ipynb',
'Creating_Plots_in_Jupyter_Notebooks.ipynb',
'MolVis_with_iCN3D.ipynb',
'binding_site_investigation.ipynb',
'EC_class_ligands_search.ipynb',
'resolutions.txt',
'Processing_Multiple_Files_and_Writing_Files.ipynb',
'setup.ipynb',
'Bradford_plot3.png',
'Working_with_Pandas.ipynb',
'introduction.ipynb',
'Modifying_Ligands_with_rdkit.ipynb',
'nonlinear_regression_part_2.ipynb']
The list contains files that end with tags that tell you what they are. Here are three examples.
.ipynb is a Jupyter notebook
.csv is a file containing data as comma separated values
.png is a portable network graphics file - an image file that is useful on web pages
There are also two directories in this folder: images and data. You can use the listdir()
function to find out what is in a directory if you enter the name of the directory within quotes inside the parentheses.
os.listdir('data')
['thrombin_with_ligands.csv',
'MM_data.csv',
'MM_data1.csv',
'protein_assay.csv',
'enzyme_kinetics.xlsx',
'chymotrypsin_kinetics.csv',
'protein_samples.csv',
'chymo_MM_data.csv',
'ligand13U-3D.sdf',
'Ground_water.csv',
'ligand13Un-3D.sdf',
'ligand13Uipr-3D.sdf',
'chymotrypsin_kinetics.xlsx',
'ligand13Ume-3D.sdf',
'MM_data_for_NLRpt2.csv',
'AP_kinetics.csv',
'PDB_files',
'AP_kin.csv',
'protein_assay2.csv',
'enzyme_kinetics.csv']
Notice there is a directory called PDB_files. To look at the contents of that directory, just enter ‘data/PDB_files’ within the parentheses for listdir()
.
os.listdir('data/PDB_files')
['1ddo.pdb',
'2pkr.pdb',
'4eyr.pdb',
'1pmb.cif',
'6zt7.pdb',
'7tim.pdb',
'3vnd.pdb',
'3iva.pdb',
'5veu.pdb',
'5rsa.cif',
'1a1t.cif',
'1a6n.cif',
'5eu9.pdb',
'1mbn.cif']
When we write a script, we want it to be usable on any operating system, thus we will use a python function called os.path.join
that will allow us to define file paths in a general way that will work on Windows, Mac OS or Linux.
Now we know where 4eyr.pdb is located. To access this file from a Jupyter notebook on any operating system, we can use os.path.join
from the os
library to create a variable that points directly to this file.
protein_file = os.path.join('data', 'PDB_files', '4eyr.pdb')
print(protein_file)
data/PDB_files/4eyr.pdb
Note
Here, we have specified that our filepath contains the ‘data’ and ‘PDB_files’ directory, and the os.path
module has made this into a filepath that is usable by our system.
Absolute and relative paths#
File paths can be absolute, or relative.
A relative file path gives the location relative to the directory we are in. Thus, if we are in the biochemist-python
directory, the relative filepath for the 4eyr.pdb
file would be chapters/data/PDB_files/4eyr.pdb
An absolute filepath gives the complete path to a file. This could file path could be used from anywhere on a computer, and would access the same file. For example, the absolute filepath to the 4eyr.pdb
file on a Mac might be /Users/YOUR_USER_NAME/Desktop/python-scripting-biochemistry/biochemist-python/chapters/data/PDB_files/4eyr.pdb
. You can get the absolute path of a file using os.path.abspath(path)
, where path is the relative path of the file.
Note
We are working with the os.path
module here, and this is how you will see people handle file paths in most Python code. However, as of Python 3.6, there is also a pathlib
module in the Python standard library that can be used to represent and manipulate filepaths. os.path
works with filepaths as strings, while in the pathlib
module, paths are objects. A good overview of the pathlib module can be found here.
Reading a file#
In Python, there are many ways to read in information from a text file. The best method to use depends on the type of data and the type of analysis you are performing. If you have a file with lots of different types of information, text and numbers, with different types of formatting, the most generic way to read in information is the readlines()
function. Reading in the file is a two-step process. First, you have to open the file using the file path we defined above. This creates a file object, or filehandle. Then you can read in information with the readlines()
function.
We could use the following code to accomplish this task:
outfile = open(protein_file,"r")
data = outfile.readlines()
After you open and read information from a file object, you must close the file. In this example, the command would be
outfile.close()
There is a second option for opening files - to use a context manager(that is to say, using the word with
from the beginning). This is actually the preferred way to open files, and prevents you from having to remember to close the file.
with open(protein_file,"r") as outfile:
data = outfile.readlines()
This code opens a file for reading and assigns it to the filehandle outfile
. The r
argument in the function stands for read
. Other arguments might be w
for write
if we want to write new information to the file, or a
for append if we want to add new information at the end of the file.
In the next line, we use the readlines()
function to get our file as a list of strings. Notice the dot notation introduced last lesson; readlines acts on the file object given to the left of the dot. The function creates a list called data where each element of the list is a string that is one line of the file. This is always how the readlines()
function works.
The file we will be analyzing in this example is a PDB web site file for an HIV protease complex with the inhibitor Ritonavir.
readlines
function behavior#
Note that the readlines
function can only be used on a file object one time. If you forget to set outfile.readlines()
equal to a variable, you must open
the file again in order to get the contents of the file.
Check Your Understanding
Check that your file was read in correctly by determining how many lines are in the file.
Answer
print(len(data))
2232
Searching for a pattern in your file#
The file we opened is the complete PDB file for the Crystal structure of multidrug-resistant clinical isolate 769 HIV-1 protease in complex with ritonavir. As stated previously, the readlines()
function put the file contents into a list where each element is a line of the file. You may remember from lesson 1 that a for
loop can be used to execute the same code repeatedly. As we learned in the previous lesson, we can use a for loop to iterate through elements in a list.
Here’s the code we could use to take a look at what’s in the file.
for line in data:
print(line)
Here are the first five lines that would result from running this code on 4eyr.pdb:
HEADER HYDROLASE/HYDROLASE INHIBITOR 01-MAY-12 4EYR
TITLE CRYSTAL STRUCTURE OF MULTIDRUG-RESISTANT CLINICAL ISOLATE 769 HIV-1
TITLE 2 PROTEASE IN COMPLEX WITH RITONAVIR
CAVEAT 4EYR LIGAND RIT HAS LOW CORRELATION AND HIGH REAL SPACE R VALUE.
COMPND MOL_ID: 1;
If you look through 4eyr.pdb in a text editor, you will find one line that starts with HETNAM. This line contains information about the ligand or heterogen (in PDB terms) that is bound to HIV protease in this structure, providing both the abbreviation (RIT) and the full name of the drug (RITONAVIR). We want to search through this file and find that line, and print only that line. We can do this using an if
statement.
Returning to our file example,
for line in data:
if 'HETNAM' in line:
HETNAM_line = line
print(HETNAM_line)
HETNAM RIT RITONAVIR
Remember that readlines()
saves each line of the file as a string, so HETNAM_line
is a string that contains the whole line. For our analysis, if we are most interested in the abbreviation for the drug, we need to split up the line so we can save just that abbreviation as a different variable name. To do this, we can use a function called split
. The split
function takes a string and divides it into its components using a delimiter.
The delimiter is specified as an argument to the function (put in the parenthesis ()
). If you do not specify a delimiter, a space is used by default. Let’s try this out.
HETNAM_line.split()
['HETNAM', 'RIT', 'RITONAVIR']
We can save the output of this function to a variable as a new list. In the example below, we take the line we found in the for
loop and split it up into its individual words.
words = HETNAM_line.split()
print(words)
['HETNAM', 'RIT', 'RITONAVIR']
From this print
statement, we now see that we have a list called words, where we have split HETNAM_line
. The abbreviation is actually the second element of this list, so we can now save it as a new variable.
abbrev = words[1]
print(abbrev)
RIT
Check Your Understanding
Some PDB files contain more than one heterogen. For example, the structure of D-amino acid oxidase found in PDB entry 1ddo contains three heterogens. Can you think of a way to keep all of the lines using syntax we have already learned?
Solution
You will need to create an empty list and append each answer to the list.
# Use os to assign the file path to a variable
import os
protein_file = os.path.join('data', 'PDB_files','1ddo.pdb')
# Create a list to hold the heterogen information
HETNAM_list = []
# Use the with context manager to open the file,
# then a for loop to populate HETNAM_list
with open(protein_file,"r") as outfile:
data = outfile.readlines()
for line in data:
if 'HETNAM' in line:
words = line.split()
HETNAM_list.append(words[1:3])
print(HETNAM_list)
We might also want to extract the number of atoms in the protein in this file. We will modify the above steps to achieve the desired result. We will be looking for the line that contains the term PROTEIN ATOMS
.
for line in data:
if 'PROTEIN ATOMS' in line:
PROTEIN_ATOM_line = line
words = PROTEIN_ATOM_line.split()
print(PROTEIN_ATOM_line)
REMARK 3 PROTEIN ATOMS : 1514
We can see that the fifth element in this list is a colon (:). It is possible to modify the split command to split lines using the colon as the delimiter (‘:’).
for line in data:
if 'PROTEIN ATOMS' in line:
PROTEIN_ATOM_line = line
words = PROTEIN_ATOM_line.split(':')
print(words)
['REMARK 3 PROTEIN ATOMS ', ' 1514 \n']
From this print
statement, we now see that we have a list called words, where we have split PROTEIN_ATOM_line
. The number of atoms in the protein is actually the second element of this list, so we can now save it as a new variable.
atoms = words[1]
print(atoms)
1514
The HIV protease structure in 4eyr.pdb is a dimer. Let’s find the number of atoms in each monomer unit. If we now try to do a math operation on atoms, we get an error message. Why do you think that is?
atoms / 2
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[14], line 1
----> 1 atoms / 2
TypeError: unsupported operand type(s) for /: 'str' and 'int'
Even though atoms
looks like a number to us, it is really a string, so we can use it for a math operation. We need to change the data type of atoms to a float. This is called casting.
atoms = float(atoms)
atoms / 2
Now it works. If we thought ahead we could have changed the data type when we assigned the variable originally.
atoms = float(words[1])
Exerise on file parsing
Use skills from this lesson and the previous lesson to extract the experimental method and temperature for determining the structure of 4eyr.pdb.
EXPERIMENT TYPE : X-RAY DIFFRACTION
TEMPERATURE (KELVIN) : 298
Hint
Remember that you can only use readlines once. You will need to reopen the file to read it again
To find the lines with the keywords, do a search and then print the lines to see their content. Then you can refine your search and split the lines as needed to get the desired output
Solution
This is one possible solution for the file parsing exercise.
protein_file = os.path.join('data', 'PDB_files','4eyr.pdb')
print(protein_file)
with open(protein_file,"r") as outfile:
data = outfile.readlines()
for line in data:
if 'EXPERIMENT TYPE' in line:
exp_type_line = line
words = exp_type_line.split()
words[2] = words[2].rstrip('.') # to remove the . from RESOLUTION.
words[-1] = words[-1].rstrip('.')
print(words[2], words[3], ':', words[5], words[6])
if 'KELVIN' in line:
temp_line = line
words = temp_line.split()
print(words[2], words[3], ':', words[5])
Searching for a particular line number in your file#
There is a lot of other information in the PDB file that might be of interest. For example, we might want to pull out the sequence of the protein. If we look through the file, 4eyr.pdb, in a text editor, we notice that the sequence is given in a series of lines that begin with SEQRES
SEQRES 1 A 99 PRO GLN ILE THR LEU TRP GLN ARG PRO ILE VAL THR ILE
followed by a series of lines that contain the full sequence using three letter abbreviations for the amino acids. In this case, we don’t want to pull something out of this line, as we did in our previous example, but we want to know which line of the file this is so that we can then pull the sequence from the next few lines.
When you use a for
loop, it is easy to have python keep up with the line numbers using the enumerate
command. The general syntax is
for linenum, line in enumerate(list_name):
do things in the loop
In this notation, there are now two variables you can use in your loop commands, linenum
(which can be named something else) will keep up with what iteration you are on in the loop, in this case what line you are on in the file. The variable line
(which could be named something else) functions exactly as it did before, holding the actual information from the list. Finally, instead of just giving the list name you use enumerate(list_name)
.
This block of code searches our file for the line that contains “SEQRES” and reports the line number.
with open(protein_file,"r") as outfile:
data = outfile.readlines()
for linenum, line in enumerate(data):
if 'SEQRES 1 A' in line:
print(linenum, ':', line, sep = '')
Now we know that this is line 310 in our file (remember that you start counting at zero!).
Check your Understanding
What would be printed if you entered the following?
print(data[311])
print(data[312])
print(data[313])
print(data[314])
print(data[315])
Answer
It prints line 311-315 of the list data which is the first line that contains “SEQRES 1 A” and then the sequence information for the next 5 lines from the PDB file for 7TIM.
print(data[311])
print(data[312])
print(data[313])
print(data[314])
print(data[315])
SEQRES 2 A 99 LYS ILE GLY GLY GLN LEU LYS GLU ALA LEU LEU ASN THR
SEQRES 3 A 99 GLY ALA ASP ASP THR VAL LEU GLU GLU VAL ASN LEU PRO
SEQRES 4 A 99 GLY ARG TRP LYS PRO LYS LEU ILE GLY GLY ILE GLY GLY
SEQRES 5 A 99 PHE VAL LYS VAL ARG GLN TYR ASP GLN VAL PRO ILE GLU
SEQRES 6 A 99 ILE CYS GLY HIS LYS VAL ILE GLY THR VAL LEU VAL GLY
A final note about regular expressions#
Sometimes you will need to match something more complex than just a particular word or phrase in your output file. Sometimes you will need to match a particular word, but only if it is found at the beginning of a line. Or perhaps you will need to match a particular pattern of data, like a capital letter followed by a number, but you won’t know the exact letter and number you are looking for. These types of matching situations are handled with something called regular expressions which is accessed through the python module re
. While using regular expressions is outside the scope of this tutorial, they are very useful and you might want to learn more about them in the future. A tutorial can be found at Automate the Boring Stuff with Python book. A great test site for regex is here.
Key Points
You should use the os.path module to work with file paths.
One of the most flexible ways to read in the lines of a file is the
readlines()
function.An
if statement can be used to find a particular string within a file.The split() function can be used to seperate the elements of a string.
You will often need to recast data into a different data type when it was read in as a string.