Session 2 Homework#

Exercise 1#

In the first class, we examined a dataset containing molecular descriptors and the results from the ESOL solubility model. In this homework, we are going to use the tabula library to extract a table from the associated paper. In your pdf folder, you have a file called Delaney_paper.pdf. Your task for this homework is to pull information from table 4 and create plots using seaborn showing the correlation between the various solubility models and the experimental values.

Using the file pdfs/Delaney_paper.pdf and tabula-py, read in the data from Table 4. Save this file in your data folder as delaney_table4.csv.

Exercise 2#

Read in your saved data and clean the table using pandas.

  • Rename the columns to have easy and descriptive names.

  • Use pd.to_numeric with the appropriate options to cast columns to floats.

  • Drop rows or columns if necessary.

Your dataframe should have the following column names before you save it:

['common name', 'CAS no.', 'experimental values', 'ESOL', 'Liu',
       'Huuskonen', 'Kuhne', 'Wegner', 'Gasteiger', 'Tetko', 'GSE']

Save your cleaned dataframe as delaney_table4_clean.csv

Exercise 3#

Use seaborn lmplot to create a plot showing experimental values vs each of the models. You should have one plot per model. Create a plot which has two columns per row.

Bonus - Use seaborn to visualize the correlation between experimental values and the models. Which one has the highest correlation? The method df.corr gives you Pearson’s correlation coefficient (R). Table 4 reports \(R^2\) - can you get the values to match? (how?)