Reading data from PDFs
Contents
Reading data from PDFs#
Overview
Questions:
How can I tell if I can extract data from a pdf?
How can I run optical character recognition on a PDF?
How can I extract information from a PDF?
Objectives:
Use
ocrmypdf
to make sure our PDF has recognizeable characters.Use
tabula-py
to extract data from a table in a PDF.
You should have the paper we are going to work with in your pdfs
folder. The name of the file is Potts-Guy1995_Article_APredictiveAlgorithmForSkinPer.pdf
. We will be reading the tables on page 3.
Start by checking to ensure that you have the pdfs folder and the pdf. We will use the special command ls
for this. We put an exclamation mark at the beginning of this command because it is not Python. In the Jupyter notebook, the commands that start with !
are commands you could execute in your terminal if you were using a terminal.
! ls pdfs
186.full.pdf Potts-Guy1995_Article_APredictiveAlgorithmForSkinPer.pdf
Delaney_paper.pdf cyclodextrin.pdf
We’re going to use a Python library called tabula-py
to read the data in Table 1
. However, this pdf doesn’t have any text information in it yet. One way you can tell this is by clicking and dragging your cursor over the text in a pdf viewer like Adobe Acrobat. If the text is not highlighted, the pdf does not contain text information. If we tried to extract the data in the table at this point, we would get an empty table.
You can get text information in a pdf by performing optical character recognition, or OCR. If you have Adobe Acrobat Pro, it has an OCR tool built in that you can use. Python also has some free libraries which can be used for OCR. We’ll be using one called OCRmyPDF.
Again, this command is not Python. We can tell this because it starts with an exclamation mark !
. To use this software, we type the command ocrmypdf
followed by the path to the pdf we would like to convert. Then you put the name you would like your new output file to have.
! ocrmypdf "pdfs/Potts-Guy1995_Article_APredictiveAlgorithmForSkinPer.pdf" "pdfs/pottsguyocr.pdf"
Scanning contents: 100%|███████████████████████| 6/6 [00:00<00:00, 390.60page/s]
Start processing 6 pages concurrently
OCR: 100%|██████████████████████████████████| 6.0/6.0 [00:09<00:00, 1.53s/page]
Postprocessing...
PDF/A conversion: 100%|█████████████████████████| 6/6 [00:01<00:00, 5.63page/s]
Recompressing JPEGs: 0image [00:00, ?image/s]
Deflating JPEGs: 0image [00:00, ?image/s]
JBIG2: 0item [00:00, ?item/s]
Image optimization did not improve the file - optimizations will not be used
Optimize ratio: 1.00 savings: -0.2%
Output file is a PDF/A-2B (as expected)
! ls pdfs
186.full.pdf
Delaney_paper.pdf
JANAF-FourthEd-1998-Carbon.pdf
Potts-Guy1995_Article_APredictiveAlgorithmForSkinPer.pdf
cyclodextrin.pdf
pottsguyocr.pdf
Reading Tables with tabula-py
#
We now have two pdfs in the folder. The second one, pottsguyocr.pdf
has text information in the pdf. We can use the library tabula-py
to get information from table 1. The function we will be using is called tabula.read_pdf
. We pass the file path to the pdf we would like to read to this function. You should also specify the page number of the table. Otherwise, it will by default try to read page 1.
import tabula
pdf_path = "pdfs/pottsguyocr.pdf"
In order to read from pages other than page 1, we will need to pass another argument (pages
) to the function to specifiy which page contains the table we want to parse.
You can specify a page number as a list of integers, or you can use "all"
to read data from all tables in the PDF.
tables = tabula.read_pdf(pdf_path, pages=[3, 4], multiple_tables=True)
print(f"Found {len(tables)} tables.")
Found 3 tables.
The read_pdf
function returns a list of pandas dataframes containing data from the tables.
Let’s examine each of these.
tables[0].head()
Unnamed: 0 | Compound | log P | Unnamed: 1 | II | H, | H,.1 | MV | R, | log Koa | log Kyex | Unnamed: 2 | log Kpep | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | water | — 6.85 | NaN | 0.45 | 0.82 | 0.35 | 10.6 | 0.00 | — 1.38 | — 4,38 | NaN | NaN |
1 | ' | methanol | — 6.68 | NaN | 0.44 | 0.43 | 0.47 | 21.7 | 0.28 | —0.73 | —2.42 | NaN | — 2.80 |
2 | NaN | methanoic acid | — 7.08 | NaN | 0.60 | 0.75 | 0.38 | 22.3 | 0.30 | —0.54 | — 3.93 | NaN | — 3.63 |
3 | NaN | ethanol | — 6.66 | NaN | 0.42 | 0.37 | 0.48 | 31.9 | 0.25 | —0.32 | —2.24 | NaN | —2.10 |
4 | NaN | ethanoic acid | —7.01 | NaN | 0.65 | 0.61 | 0.45 | 33.4 | 0.27 | —0.31 | — 3.28 | NaN | —2.90 |
tables[1].head()
Unnamed: 0 | Unnamed: 1 | 10° - | a, | Unnamed: 2 | Unnamed: 3 | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Solvent | NaN | (cm?/mole) | NaN | NaN | ay | a3 | ay | r’ | F | n |
1 | Octanol | NaN | 5.63 | NaN | NaN | nsd? | — 4.09 | —0.51 | 0.98 | 478 | 37 |
2 | NaN | NaN | (0.14) | NaN | NaN | NaN | (0.20) | (0.09) | NaN | NaN | NaN |
3 | Heptane | NaN | 6.50 | NaN | NaN | —2.90 | — 4,80 | — 1.49 | 0.95 | 124 | 33 |
4 | NaN | NaN | (0.32) | NaN | NaN | (0.39) | (0.48) | (0.24) | NaN | NaN | NaN |
tables[2].head()
implies that | hydrogen bond acceptors are | Unnamed: 0 | not | well- | Unnamed: 1 | Unnamed: 2 | Unnamed: 3 | ute a). | The | results | in Table | 4, | on | the other hand, | imply | that | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | accommodated | in alkanes as compared to octanol. Whereas | NaN | NaN | NaN | NaN | NaN | NaN | the SC lipids accept hydrogen bonds better tha... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | octanol is capable | of donating a hydrogen bond | NaN | via the | hy- | NaN | NaN | NaN | but that, | like | octanol, | polar | species | NaN | can be accommodated | NaN | NaN |
2 | droxyl group, | heptane and hexadecane cannot | do | so. | Parti- | NaN | NaN | NaN | more easily | NaN | in the SC than | in alkane solvents | NaN | NaN | (vi. the absence | NaN | NaN |
3 | tioning decreased | in the hydrocarbon solvents | with increas- | NaN | NaN | NaN | NaN | NaN | of 7 dependence again). These conclusions may ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | ing solute polarity | (7); no significant dependence | NaN | on | 7 was | NaN | NaN | NaN | with those | NaN | of a previous | study | NaN | (3), | which reported | the | free |
You can see that the data in some of the tables seems pretty messy.
For this exercise, we will be working with the first two tables. Neither of these tables are usable yet. We’ll save both as csvs and work on cleaning them in the next section.
tables[0].info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 1 non-null object
1 Compound 37 non-null object
2 log P 37 non-null object
3 Unnamed: 1 0 non-null float64
4 II 37 non-null float64
5 H, 37 non-null float64
6 H,.1 37 non-null float64
7 MV 37 non-null float64
8 R, 37 non-null float64
9 log Koa 37 non-null object
10 log Kyex 31 non-null object
11 Unnamed: 2 0 non-null float64
12 log Kpep 25 non-null object
dtypes: float64(7), object(6)
memory usage: 3.9+ KB
output_1 = "data/potts_table1.csv"
output_2 = os.path.join("data", "potts_table2.csv")
tables[0].to_csv(output_1, index=False)
tables[1].to_csv(output_2, index=False)
Key Points
PDFs usually have text associated with them. If they don’t, you can use
ocrmypdf
to perform optical character recognition.You can use the library
tabula-py
to extract data from tables in PDFs.