File Processing and Data Cleaning#

Session 2 Overview

So far, we have been working with data that is fairly ‘clean’. This means that, for the most part, our data didn’t have too many missing values and had the right data type. This is more often the case when the data we are working with is prepared to be machine readable in the first place (more ‘modern’ data).

Data you would like to work with, however, may not always be ‘clean’, particularly if you are working with older data. In fact, you may find in some instances, that data cleaning is 90% of the work you have to do to solve a problem. For example, you might want to work with or analyze data from published papers. In these cases, the data may be in tables in pdf.

The first topic in this section is regular expressions. We will learn what regular expressions are and how we can use them to pull items of interest from text files using patterns.

Next, we will use a library called OCRmyPDF to run optical character recognition on a pdf. This will make the pdf have information about the characters in the pdf. Then, we run another package called tabula to get information about tables from the pdf.

We’ll be grabbing data from the following paper, cleaning it, and fitting the data using Python:

Potts, R.O., Guy, R.H. A Predictive Algorithm for Skin Permeability: The Effects of Molecular Size and Hydrogen Bond Activity. Pharm Res 12, 1628–1633 (1995). https://doi.org/10.1023/A:1016236932339

In this paper, the authors make a model for predicting skin permeability of molecules based on simple molecular descriptors. This paper was published in 1995, so this will definitely not be the best method. It is, however, a nice example of the kinds of things you can do with Python.

Access and download the paper here. Save the paper in a directory called pdfs to match what we have for this notebook.

Regular Expressions

  • What is a regular expression?

  • How can I use regular expressions to pull information from files?

Reading Data from PDFs

  • How can I tell if I can extract data from a PDF?

  • How can I run optical character recognition on a PDF?

  • How can I extract character information from a PDF?

Data Cleaning with Pandas

  • What does “clean data” mean?

  • How an I drop unneccesarry

  • How can I change column or row names in a dataframe?

  • How can I cast columns to the correct data type?