Session 3 Homework#

Exercise 1 - Linear Fitting#

Use scikitlearn to perform multiple linear regression for solubility using the dataset delaney-processed.csv. Unfortunately the dataset doesn’t contain all of the molecular descriptors as described in the original paper. Use the available descriptors as independent variables in the linear fit.

  • Use your model to predict solubilities for the dataset.

  • Compute the \(R^2\) statistic for the fit

Exercise 2 - Regular Expressions#

Parse the file SBE-b-CD-data.sdf in your data folder using regular expressions.

The SDF file contains information for 220 molecules. For each molecule, there is a section which looks like this:

>  <ID>
(-)__Sulpiride

>  <Temperature_K>
293

>  <Kapp>
35
  • Write a regular expression to retrieve the properties in <> along with the values. For example, if you were to parse the text above, the result from your regular expression should look like

    [('ID', '(-)__Sulpiride'),
     ('Temperature_K', '293'),
     ('Kapp', '35')]
    
  • Put results for all of the molecules into a pandas dataframe which has columns ID, Temperature_K, and Kapp. Hint - This may be a general Python solution rather than something we have talked aboutin class. The best way to do this may be to construct a Python dictionary with keys ID, Temperature, Kapp with lists of the values for each molecule. You can then create a pandas dataframe using pd.DataFrame.from_dict

Exercise 3 - Bonus#

This exercise directs you how to retrieve papers from ChemRxiv (chemistry preprint server) using their REST API. Your task is then to do some processing in pandas to retrieve article abstracts, then to look for phrases in the abstract using regular expressions. This homework requires learning some extra material, so it is a bonus.

First, you can use the Python requests module to query the rest API for ChemRxiv. This is the URL we will go to. If you visit this url in your browser, you will get a list of the 100 most recent papers uploaded to ChemRxiv.

https://api.figshare.com/v2/articles?institution=259&page_size=100

To retrieve this information using Python, do

import requests

results = requests.get('https://api.figshare.com/v2/articles?institution=259&page_size=100')

The ‘payload’ of this is stored in results.json(). This is where the information which you see in your browser. You can convert what you’ve retrieved into a dataframe by doing df = pd.DataFrame(results.json()).

You can retrieve the abstracts by calling requests.get on the url for each paper. Save this in a column called detail. This step will take a while to execute.

After retrieving the details, you must get the json and retreive the “description” field. You can write a custom function which does this and apply it to the detail column.

Next, use pandas str contains to search the abstract for phrases of interest. We suggest trying out machine/deep learning. Your results will vary since it will always grab the 100 most recently uploaded papers!