Reading data from PDFs#

Overview

Questions:

  • How can I tell if I can extract data from a pdf?

  • How can I run optical character recognition on a PDF?

  • How can I extract information from a PDF?

Objectives:

  • Use ocrmypdf to make sure our PDF has recognizeable characters.

  • Use tabula-py to extract data from a table in a PDF.

You should have the paper we are going to work with in your pdfs folder. The name of the file is Potts-Guy1995_Article_APredictiveAlgorithmForSkinPer.pdf. We will be reading the tables on page 3.

Start by checking to ensure that you have the pdfs folder and the pdf. We will use the special command ls for this. We put an exclamation mark at the beginning of this command because it is not Python. In the Jupyter notebook, the commands that start with ! are commands you could execute in your terminal if you were using a terminal.

! ls pdfs
186.full.pdf	   Potts-Guy1995_Article_APredictiveAlgorithmForSkinPer.pdf
Delaney_paper.pdf  cyclodextrin.pdf

We’re going to use a Python library called tabula-py to read the data in Table 1. However, this pdf doesn’t have any text information in it yet. One way you can tell this is by clicking and dragging your cursor over the text in a pdf viewer like Adobe Acrobat. If the text is not highlighted, the pdf does not contain text information. If we tried to extract the data in the table at this point, we would get an empty table.

You can get text information in a pdf by performing optical character recognition, or OCR. If you have Adobe Acrobat Pro, it has an OCR tool built in that you can use. Python also has some free libraries which can be used for OCR. We’ll be using one called OCRmyPDF.

Again, this command is not Python. We can tell this because it starts with an exclamation mark !. To use this software, we type the command ocrmypdf followed by the path to the pdf we would like to convert. Then you put the name you would like your new output file to have.

! ocrmypdf "pdfs/Potts-Guy1995_Article_APredictiveAlgorithmForSkinPer.pdf"  "pdfs/pottsguyocr.pdf"
Scanning contents: 100%|███████████████████████| 6/6 [00:00<00:00, 390.60page/s]
Start processing 6 pages concurrently
OCR: 100%|██████████████████████████████████| 6.0/6.0 [00:09<00:00,  1.53s/page]
Postprocessing...
PDF/A conversion: 100%|█████████████████████████| 6/6 [00:01<00:00,  5.63page/s]
Recompressing JPEGs: 0image [00:00, ?image/s]
Deflating JPEGs: 0image [00:00, ?image/s]
JBIG2: 0item [00:00, ?item/s]
Image optimization did not improve the file - optimizations will not be used
Optimize ratio: 1.00 savings: -0.2%
Output file is a PDF/A-2B (as expected)
! ls pdfs
186.full.pdf
Delaney_paper.pdf
JANAF-FourthEd-1998-Carbon.pdf
Potts-Guy1995_Article_APredictiveAlgorithmForSkinPer.pdf
cyclodextrin.pdf
pottsguyocr.pdf

Reading Tables with tabula-py#

We now have two pdfs in the folder. The second one, pottsguyocr.pdf has text information in the pdf. We can use the library tabula-py to get information from table 1. The function we will be using is called tabula.read_pdf. We pass the file path to the pdf we would like to read to this function. You should also specify the page number of the table. Otherwise, it will by default try to read page 1.

import tabula
pdf_path = "pdfs/pottsguyocr.pdf"

In order to read from pages other than page 1, we will need to pass another argument (pages) to the function to specifiy which page contains the table we want to parse. You can specify a page number as a list of integers, or you can use "all" to read data from all tables in the PDF.

tables = tabula.read_pdf(pdf_path, pages=[3, 4], multiple_tables=True)
print(f"Found {len(tables)} tables.")
Found 3 tables.

The read_pdf function returns a list of pandas dataframes containing data from the tables.

Let’s examine each of these.

tables[0].head()
Unnamed: 0 Compound log P Unnamed: 1 II H, H,.1 MV R, log Koa log Kyex Unnamed: 2 log Kpep
0 NaN water — 6.85 NaN 0.45 0.82 0.35 10.6 0.00 — 1.38 — 4,38 NaN NaN
1 ' methanol — 6.68 NaN 0.44 0.43 0.47 21.7 0.28 —0.73 —2.42 NaN — 2.80
2 NaN methanoic acid — 7.08 NaN 0.60 0.75 0.38 22.3 0.30 —0.54 — 3.93 NaN — 3.63
3 NaN ethanol — 6.66 NaN 0.42 0.37 0.48 31.9 0.25 —0.32 —2.24 NaN —2.10
4 NaN ethanoic acid —7.01 NaN 0.65 0.61 0.45 33.4 0.27 —0.31 — 3.28 NaN —2.90
tables[1].head()
Unnamed: 0 Unnamed: 1 10° - a, Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8
0 Solvent NaN (cm?/mole) NaN NaN ay a3 ay r’ F n
1 Octanol NaN 5.63 NaN NaN nsd? — 4.09 —0.51 0.98 478 37
2 NaN NaN (0.14) NaN NaN NaN (0.20) (0.09) NaN NaN NaN
3 Heptane NaN 6.50 NaN NaN —2.90 — 4,80 — 1.49 0.95 124 33
4 NaN NaN (0.32) NaN NaN (0.39) (0.48) (0.24) NaN NaN NaN
tables[2].head()
implies that hydrogen bond acceptors are Unnamed: 0 not well- Unnamed: 1 Unnamed: 2 Unnamed: 3 ute a). The results in Table 4, on the other hand, imply that
0 accommodated in alkanes as compared to octanol. Whereas NaN NaN NaN NaN NaN NaN the SC lipids accept hydrogen bonds better tha... NaN NaN NaN NaN NaN NaN NaN NaN
1 octanol is capable of donating a hydrogen bond NaN via the hy- NaN NaN NaN but that, like octanol, polar species NaN can be accommodated NaN NaN
2 droxyl group, heptane and hexadecane cannot do so. Parti- NaN NaN NaN more easily NaN in the SC than in alkane solvents NaN NaN (vi. the absence NaN NaN
3 tioning decreased in the hydrocarbon solvents with increas- NaN NaN NaN NaN NaN of 7 dependence again). These conclusions may ... NaN NaN NaN NaN NaN NaN NaN NaN
4 ing solute polarity (7); no significant dependence NaN on 7 was NaN NaN NaN with those NaN of a previous study NaN (3), which reported the free

You can see that the data in some of the tables seems pretty messy.

For this exercise, we will be working with the first two tables. Neither of these tables are usable yet. We’ll save both as csvs and work on cleaning them in the next section.

tables[0].info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 13 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  1 non-null      object 
 1   Compound    37 non-null     object 
 2   log P       37 non-null     object 
 3   Unnamed: 1  0 non-null      float64
 4   II          37 non-null     float64
 5   H,          37 non-null     float64
 6   H,.1        37 non-null     float64
 7   MV          37 non-null     float64
 8   R,          37 non-null     float64
 9   log Koa     37 non-null     object 
 10  log Kyex    31 non-null     object 
 11  Unnamed: 2  0 non-null      float64
 12  log Kpep    25 non-null     object 
dtypes: float64(7), object(6)
memory usage: 3.9+ KB
output_1 = "data/potts_table1.csv"
output_2 = os.path.join("data", "potts_table2.csv")

tables[0].to_csv(output_1, index=False)
tables[1].to_csv(output_2, index=False)

Key Points

  • PDFs usually have text associated with them. If they don’t, you can use ocrmypdf to perform optical character recognition.

  • You can use the library tabula-py to extract data from tables in PDFs.