Session 2 Homework
Contents
Session 2 Homework#
Exercise 1#
In the first class, we examined a dataset containing molecular descriptors and the results from the ESOL solubility model. In this homework, we are going to use the tabula
library to extract a table from the associated paper. In your pdf
folder, you have a file called Delaney_paper.pdf
. Your task for this homework is to pull information from table 4 and create plots using seaborn showing the correlation between the various solubility models and the experimental values.
Using the file pdfs/Delaney_paper.pdf
and tabula-py
, read in the data from Table 4. Save this file in your data folder as delaney_table4.csv
.
Exercise 2#
Read in your saved data and clean the table using pandas.
Rename the columns to have easy and descriptive names.
Use
pd.to_numeric
with the appropriate options to cast columns to floats.Drop rows or columns if necessary.
Your dataframe should have the following column names before you save it:
['common name', 'CAS no.', 'experimental values', 'ESOL', 'Liu',
'Huuskonen', 'Kuhne', 'Wegner', 'Gasteiger', 'Tetko', 'GSE']
Save your cleaned dataframe as delaney_table4_clean.csv
Exercise 3#
Use seaborn lmplot
to create a plot showing experimental values
vs each of the models. You should have one plot per model. Create a plot which has two columns per row.
Bonus - Use seaborn to visualize the correlation between experimental values and the models. Which one has the highest correlation? The method df.corr
gives you Pearson’s correlation coefficient (R). Table 4 reports \(R^2\) - can you get the values to match? (how?)