Session 3 Homework
Contents
Session 3 Homework#
Exercise 1 - Linear Fitting#
Use scikitlearn to perform multiple linear regression for solubility using the dataset delaney-processed.csv
. Unfortunately the dataset doesn’t contain all of the molecular descriptors as described in the original paper. Use the available descriptors as independent variables in the linear fit.
Use your model to predict solubilities for the dataset.
Compute the \(R^2\) statistic for the fit
Exercise 2 - Regular Expressions#
Parse the file SBE-b-CD-data.sdf
in your data
folder using regular expressions.
The SDF file contains information for 220 molecules. For each molecule, there is a section which looks like this:
> <ID>
(-)__Sulpiride
> <Temperature_K>
293
> <Kapp>
35
Write a regular expression to retrieve the properties in <> along with the values. For example, if you were to parse the text above, the result from your regular expression should look like
[('ID', '(-)__Sulpiride'), ('Temperature_K', '293'), ('Kapp', '35')]
Put results for all of the molecules into a pandas dataframe which has columns
ID
,Temperature_K
, andKapp
. Hint - This may be a general Python solution rather than something we have talked aboutin class. The best way to do this may be to construct a Python dictionary with keysID
,Temperature
,Kapp
with lists of the values for each molecule. You can then create a pandas dataframe usingpd.DataFrame.from_dict
Exercise 3 - Bonus#
This exercise directs you how to retrieve papers from ChemRxiv (chemistry preprint server) using their REST API. Your task is then to do some processing in pandas to retrieve article abstracts, then to look for phrases in the abstract using regular expressions. This homework requires learning some extra material, so it is a bonus.
First, you can use the Python requests
module to query the rest API for ChemRxiv. This is the URL we will go to. If you visit this url in your browser, you will get a list of the 100 most recent papers uploaded to ChemRxiv.
https://api.figshare.com/v2/articles?institution=259&page_size=100
To retrieve this information using Python, do
import requests
results = requests.get('https://api.figshare.com/v2/articles?institution=259&page_size=100')
The ‘payload’ of this is stored in results.json()
. This is where the information which you see in your browser. You can convert what you’ve retrieved into a dataframe by doing df = pd.DataFrame(results.json())
.
You can retrieve the abstracts by calling requests.get
on the url
for each paper. Save this in a column called detail
. This step will take a while to execute.
After retrieving the details, you must get the json and retreive the “description” field. You can write a custom function which does this and apply it to the detail
column.
Next, use pandas str contains to search the abstract for phrases of interest. We suggest trying out machine/deep learning. Your results will vary since it will always grab the 100 most recently uploaded papers!