Reading data from PDFs#

Overview

Questions:

How can I tell if I can extract data from a pdf?
How can I run optical character recognition on a PDF?
How can I extract information from a PDF?

Objectives:

Use ocrmypdf to make sure our PDF has recognizeable characters.
Use tabula-py to extract data from a table in a PDF.

You should have the paper we are going to work with in your pdfs folder. The name of the file is Potts-Guy1995_Article_APredictiveAlgorithmForSkinPer.pdf. We will be reading the tables on page 3.

Start by checking to ensure that you have the pdfs folder and the pdf. We will use the special command ls for this. We put an exclamation mark at the beginning of this command because it is not Python. In the Jupyter notebook, the commands that start with ! are commands you could execute in your terminal if you were using a terminal.

! ls pdfs

186.full.pdf	   Potts-Guy1995_Article_APredictiveAlgorithmForSkinPer.pdf
Delaney_paper.pdf  cyclodextrin.pdf

We’re going to use a Python library called tabula-py to read the data in Table 1. However, this pdf doesn’t have any text information in it yet. One way you can tell this is by clicking and dragging your cursor over the text in a pdf viewer like Adobe Acrobat. If the text is not highlighted, the pdf does not contain text information. If we tried to extract the data in the table at this point, we would get an empty table.

You can get text information in a pdf by performing optical character recognition, or OCR. If you have Adobe Acrobat Pro, it has an OCR tool built in that you can use. Python also has some free libraries which can be used for OCR. We’ll be using one called OCRmyPDF.

Again, this command is not Python. We can tell this because it starts with an exclamation mark !. To use this software, we type the command ocrmypdf followed by the path to the pdf we would like to convert. Then you put the name you would like your new output file to have.

! ocrmypdf "pdfs/Potts-Guy1995_Article_APredictiveAlgorithmForSkinPer.pdf"  "pdfs/pottsguyocr.pdf"

Scanning contents: 100%|███████████████████████| 6/6 [00:00<00:00, 390.60page/s]
Start processing 6 pages concurrently
OCR: 100%|██████████████████████████████████| 6.0/6.0 [00:09<00:00,  1.53s/page]
Postprocessing...
PDF/A conversion: 100%|█████████████████████████| 6/6 [00:01<00:00,  5.63page/s]
Recompressing JPEGs: 0image [00:00, ?image/s]
Deflating JPEGs: 0image [00:00, ?image/s]
JBIG2: 0item [00:00, ?item/s]
Image optimization did not improve the file - optimizations will not be used
Optimize ratio: 1.00 savings: -0.2%
Output file is a PDF/A-2B (as expected)

! ls pdfs

186.full.pdf
Delaney_paper.pdf
JANAF-FourthEd-1998-Carbon.pdf
Potts-Guy1995_Article_APredictiveAlgorithmForSkinPer.pdf
cyclodextrin.pdf
pottsguyocr.pdf

Reading Tables with `tabula-py`#

We now have two pdfs in the folder. The second one, pottsguyocr.pdf has text information in the pdf. We can use the library tabula-py to get information from table 1. The function we will be using is called tabula.read_pdf. We pass the file path to the pdf we would like to read to this function. You should also specify the page number of the table. Otherwise, it will by default try to read page 1.

import tabula

pdf_path = "pdfs/pottsguyocr.pdf"

In order to read from pages other than page 1, we will need to pass another argument (pages) to the function to specifiy which page contains the table we want to parse. You can specify a page number as a list of integers, or you can use "all" to read data from all tables in the PDF.

tables = tabula.read_pdf(pdf_path, pages=[3, 4], multiple_tables=True)

print(f"Found {len(tables)} tables.")

Found 3 tables.

The read_pdf function returns a list of pandas dataframes containing data from the tables.

Let’s examine each of these.

tables[0].head()

	Unnamed: 0	Compound	log P	Unnamed: 1	II	H,	H,.1	MV	R,	log Koa	log Kyex	Unnamed: 2	log Kpep
0	NaN	water	— 6.85	NaN	0.45	0.82	0.35	10.6	0.00	— 1.38	— 4,38	NaN	NaN
1	'	methanol	— 6.68	NaN	0.44	0.43	0.47	21.7	0.28	—0.73	—2.42	NaN	— 2.80
2	NaN	methanoic acid	— 7.08	NaN	0.60	0.75	0.38	22.3	0.30	—0.54	— 3.93	NaN	— 3.63
3	NaN	ethanol	— 6.66	NaN	0.42	0.37	0.48	31.9	0.25	—0.32	—2.24	NaN	—2.10
4	NaN	ethanoic acid	—7.01	NaN	0.65	0.61	0.45	33.4	0.27	—0.31	— 3.28	NaN	—2.90

tables[1].head()

	Unnamed: 0	Unnamed: 1	10° -	a,	Unnamed: 2	Unnamed: 3	Unnamed: 4	Unnamed: 5	Unnamed: 6	Unnamed: 7	Unnamed: 8
0	Solvent	NaN	(cm?/mole)	NaN	NaN	ay	a3	ay	r’	F	n
1	Octanol	NaN	5.63	NaN	NaN	nsd?	— 4.09	—0.51	0.98	478	37
2	NaN	NaN	(0.14)	NaN	NaN	NaN	(0.20)	(0.09)	NaN	NaN	NaN
3	Heptane	NaN	6.50	NaN	NaN	—2.90	— 4,80	— 1.49	0.95	124	33
4	NaN	NaN	(0.32)	NaN	NaN	(0.39)	(0.48)	(0.24)	NaN	NaN	NaN

tables[2].head()

	implies that	hydrogen bond acceptors are	Unnamed: 0	not	well-	Unnamed: 1	Unnamed: 2	Unnamed: 3	ute a).	The	results	in Table	4,	on	the other hand,	imply	that
0	accommodated	in alkanes as compared to octanol. Whereas	NaN	NaN	NaN	NaN	NaN	NaN	the SC lipids accept hydrogen bonds better tha...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	octanol is capable	of donating a hydrogen bond	NaN	via the	hy-	NaN	NaN	NaN	but that,	like	octanol,	polar	species	NaN	can be accommodated	NaN	NaN
2	droxyl group,	heptane and hexadecane cannot	do	so.	Parti-	NaN	NaN	NaN	more easily	NaN	in the SC than	in alkane solvents	NaN	NaN	(vi. the absence	NaN	NaN
3	tioning decreased	in the hydrocarbon solvents	with increas-	NaN	NaN	NaN	NaN	NaN	of 7 dependence again). These conclusions may ...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	ing solute polarity	(7); no significant dependence	NaN	on	7 was	NaN	NaN	NaN	with those	NaN	of a previous	study	NaN	(3),	which reported	the	free

You can see that the data in some of the tables seems pretty messy.

For this exercise, we will be working with the first two tables. Neither of these tables are usable yet. We’ll save both as csvs and work on cleaning them in the next section.

tables[0].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 13 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  1 non-null      object 
 1   Compound    37 non-null     object 
 2   log P       37 non-null     object 
 3   Unnamed: 1  0 non-null      float64
 4   II          37 non-null     float64
 5   H,          37 non-null     float64
 6   H,.1        37 non-null     float64
 7   MV          37 non-null     float64
 8   R,          37 non-null     float64
 9   log Koa     37 non-null     object 
 10  log Kyex    31 non-null     object 
 11  Unnamed: 2  0 non-null      float64
 12  log Kpep    25 non-null     object 
dtypes: float64(7), object(6)
memory usage: 3.9+ KB

output_1 = "data/potts_table1.csv"
output_2 = os.path.join("data", "potts_table2.csv")

tables[0].to_csv(output_1, index=False)
tables[1].to_csv(output_2, index=False)

Key Points

PDFs usually have text associated with them. If they don’t, you can use ocrmypdf to perform optical character recognition.
You can use the library tabula-py to extract data from tables in PDFs.

Python for Data Science in Chemistry

Reading data from PDFs

Contents

Reading data from PDFs#

Reading Tables with `tabula-py`#

Python for Data Science in Chemistry

Reading data from PDFs

Contents

Reading data from PDFs#

Reading Tables with tabula-py#

Reading Tables with `tabula-py`#