Processing Multiple Files and Writing Files#
Overview
Questions:
How do I analyze multiple files at once?
Objectives:
Import a python library.
Use python library funtions.
Process multiple files using a
for
loop.Print output to a new text file.
Processing multiple files#
In our previous lesson, we parsed values from output files. While you might have seen the utility of doing such a thing, you might have also wondered why we didn’t just search the file and cut and paste the values we wanted into a spreadsheet. If you only have 1 or 2 files, this might be a very reasonable thing to do. But what if you had 100 files to analyze? What if you had 1000? In such a case the cutting and pasting method would be very tedious and time consuming.
One of the real powers of writing a program to analyze your data is that you can just as easily analyze 100 files as 1 file. In this example, we are going to parse the PDB files for a series of enzyme structures in the Protein Data Bank and extract resolution data and atom counts for each one. The PDB files are all saved in a folder called PDB_files that you should have downloaded in the setup for this lesson. Make sure the folder is in the same directory as the directory where you are writing and executing your code.
To analyze multiple files, we will need to import a python library. A library is a set of modules which contain functions. The functions within a library or module are usually related to one another. Using libraries in Python reduces the amount of code you have to write. In the last lesson, we imported os.path
, which was a module that handled filepaths for us.
In this lesson, we will be using the glob
library, which will help us read in multiple files from our computer. Within a library there are modules and functions which do a specific computational task. Usually a function has some type of input and gives a particular output. To use a function that is in a library, you often use the dot notation introduced earlier.
import library_name
output = library_name.function_name(input)
Importing libraries#
We are going to import two libraries. One is the os
library which controls functions related to the operating system of your computer. We used this library in the last lesson to handle filepaths. The other is the glob
library which contains functions to help us analyze multiple files. If we are going to analyze multiple files, we first need to specify where those files are located.
Check Your Understanding
How would you use the os.path
module to point to the directory where your PDB files are located?
Solution
outfile = os.path.join('data', 'PDB_files')
In order to get all of the files which match a specific pattern, we will use the wildcard character *
.
import os
file_location = os.path.join('data', 'PDB_files', '*.pdb')
print(file_location)
data/PDB_files/*.pdb
This specifies that we want to look for all the files in a directory called data/PDB_files
that end in “.pdb”. The * is the wildcard character which matches any character or series of characters.
Next we are going to use a function called glob
in the library called glob
. It is a little confusing since the function and the library have the same name, but we will see other examples where this is not the case later. The output of the function glob
is a list of all the filenames that fit the pattern specified in the input. The input is the file location.
import glob
filenames = glob.glob(file_location)
print(filenames)
['data/PDB_files/1ddo.pdb', 'data/PDB_files/2pkr.pdb', 'data/PDB_files/4eyr.pdb', 'data/PDB_files/6zt7.pdb', 'data/PDB_files/7tim.pdb', 'data/PDB_files/3vnd.pdb', 'data/PDB_files/3iva.pdb', 'data/PDB_files/5veu.pdb', 'data/PDB_files/5eu9.pdb']
Reading multiple files with nested for loops#
Now you have a list of all the files which end in *.pdb
in the PDB_files
directory. To parse every file we just read in, we will use a for
loop to go through each file.
for f in filenames:
with open(f, 'r') as outfile:
data = outfile.readlines()
for line in data:
if 'RESOLUTION.' in line:
res_line = line
words = res_line.split()
resolution = float(words[3])
print(resolution)
3.1
2.4
1.8
1.85
1.9
2.6
2.7
2.91
2.05
Notice that in this code we actually used two for
loops, one nested inside the other. The outer for
loop counts over the filenames we read in earlier. The inner for
loop counts over the lines in each file, just as we did in our previous file parsing lesson.
The output our code currently generates is not that useful. It doesn’t show us which file each resolution value came from.
We want to print the name of the molecule with the resolution. We can use os.path.basename
, which is another function in os.path
to get just the name of the file.
first_file = filenames[0] # look above to recall the content of filenames
print(first_file)
file_name = os.path.basename(first_file)
print(file_name)
data/PDB_files/1ddo.pdb
1ddo.pdb
Check your understanding
How would you extract the PDB ID from the example above?
Solution
You can use the str.split function introduced in the last lesson, and split at the ‘.’ character.
split_filename = file_name.split(‘.’)
molecule_name = split_filename[0]
print(molecule_name)
Using the solution above, we can modify our loop so that it prints the file name along with each resolution value.
for f in filenames:
# Get the PDB ID
file_name = os.path.basename(f)
split_filename = file_name.split('.')
molecule_name = split_filename[0]
# Read the data
with open(f,"r") as outfile:
data = outfile.readlines()
# Loop through the data
for line in data:
if 'RESOLUTION.' in line:
res_line = line
words = res_line.split()
resolution = float(words[3])
print(molecule_name, ": ", resolution, " Angstroms", sep = "")
1ddo: 3.1 Angstroms
2pkr: 2.4 Angstroms
4eyr: 1.8 Angstroms
6zt7: 1.85 Angstroms
7tim: 1.9 Angstroms
3vnd: 2.6 Angstroms
3iva: 2.7 Angstroms
5veu: 2.91 Angstroms
5eu9: 2.05 Angstroms
Printing to a File#
Finally, it might be useful to print our results in a new file, such that we could share our results with colleagues or e-mail them to our advisors. Much like when we read in a file, the first step to writing output to a file is opening that file for writing. In general, to open a file for writing you have two options. The first is uses the open
command.
filehandle = open('file_name.txt', 'w+'):
take some actions
use filehandle.write('content') to the file
filehandle.close()
The filehandle.close() command is very important here. Think about a computer as someone who has a very good memory, but is very slow at writing. Therefore, when you tell the computer to write a line, it remembers what you want it to write, but it doesn’t actually write the new file until you tell it you are finished. The datafile.close() command tells the computer you are finished giving it lines to write and that it should go ahead and write the file now. If you are trying to write a file and the file keeps coming up empty, it is probably because you forgot to close the file.
The second (preferred) approach in python is to use the with
context manager that we have already used for reading a file. The advantage to this approach is that all of the steps that generate input for the file are indented under the initial with
statement and the file closes automatically when all of the actions indented beneath the with
statement are completed.
with open('file_name.txt', 'w') as filehandle:
take some actions
use filehandle.write('content') to add content to the file
Let’s examine the syntax of the with
statement.
with open('file_name.txt', 'w') as filehandle:
The w
instructs python to open the file for writing. If you use w+
that means open the file for writing and if the file does not exist, create it. You can also use a
for append to an existing file or a+
. The difference between w+
and a+
is that w+
will overwrite the file if it already exists, whereas a+
will keep what is already there and just add additional text to the file.
Python can only write strings to files. In the next cell, we want to print the contents of two variables, molecule_name
and resolution
. To convert what we have now to a string, you place a capital F in front of the line you want to print and enclose the content to be printed in single quotes. Each python variable is placed in braces {}
. Then you can either print the line (as we have done before) or you can use the filehandle.write()
command to print it to a file.
To make the printing neater, we will separate the PDB ID from the resolution using a tab. To insert a tab, we use the special character \t
.
with open('resolutions.txt', 'w+') as datafile:
for f in filenames:
# Get the PDB ID
file_name = os.path.basename(f)
split_filename = file_name.split('.')
molecule_name = split_filename[0]
# Read the data
with open(f,"r") as outfile:
data = outfile.readlines()
# Loop through the data
for line in data:
if 'RESOLUTION.' in line:
res_line = line
words = res_line.split()
resolution = float(words[3])
datafile.write(F'{molecule_name} \t {resolution} \n')
After you run this command, look in the directory where you ran your code and find the “resolutions.txt” file. Open it in a text editor and look at the file.
In the file writing line, notice the \n
at the end of the line. This is the newline character. Without it, the text in our file would just be all smushed together on one line.
A final note about string formatting#
The F’string’ notation that you can use with the print or the write command lets you format strings in many ways. You could include other words or whole sentences. For example, we could change the file writing line to
datafile.write(F'For the PDB ID {molecule_name} the resolution is {resolution} in Angstroms.')
where anything in the braces is a python variable and it will print the value of that variable.
Project
You can complete this project to test your skills. It should be completed when this material is used in a long workshop, or if you are working through this material independently.
The goal of this exercise is to extract the Enzyme Commission Class for a series of enzyme structures in PDB files and write them to a text file. The files are located in the data/PDB_files
folder. If you open any of these files in a text editor and search for the term “EC:” you will find a listing that looks like this:
COMPND 6 EC: 1.2.1.13;
You are probably familiar with these numbers, but just in case - the Enzyme Commission class tells you the function of an enzyme in a hierarchical format. You can learn more at the BRENDA EC Explorer.
Your assignment is to parse the files in the data/PDB_files
folder and write a new file named EC_class.txt
that contains the PDB ID and EC class for each of these enzymes. When you open the file in your text editor, it should look like this:
7tim 5.3.1.1
6zt7 3.2.1.55
5eu9 4.2.1.11
3iva 2.1.1.13
2pkr 1.2.1.13
3vnd 4.2.1.20
5veu 1.14.14.1
Hint
It helps when you are writing code to break up what you have to do into steps. Overall, we want to get information from the file. How do we do that?
If you think about the steps you will need to do this assignment you might come up with a list that is like this:
Open the file for reading.
Read the data in the file.
Loop through the lines in the file.
Read the files to gain access to the information we want.
Extract the desired information and write it to a file.
It can be helpful when you code to write out these steps and work on it in pieces. Try to write the code using these steps. Note that as you write the code, you may come up with other steps!
First, think about what you have to do for step 1, and write the code for that. Next, think about how you would do step 2 and write the code for that. You can troubleshoot each step using print statments.
The steps build on each other, so you can work on getting each piece written before moving on to the next.
Solution
with open ('EC_class.txt', 'w+') as datafile: #This opens the file for writing
for f in filenames: # Get the PDB ID
file_name = os.path.basename(f)
split_filename = file_name.split('.')
molecule_name = split_filename[0]
# *Read the files to gain access to the information we want.*
outfile = open(f,'r')
data = outfile.readlines()
outfile.close()
# *Extract the desired information and write it to a file.*
for line in data:
if 'EC:' in line:
ec_line = line
words1 = ec_line.split(';')
# print(words1)
words2 = words1[0].split(':')
datafile.write(F'{molecule_name} \t {words2[1]} \n')
Notice that the datafile.close() command is not required because this solution employs the with
context manager.
Key Points
Use the glob function in the python library
glob
to find all the files you want to analyze.You can have multiple
for
loops nested inside each other.Python can only print strings to files.