Linear Fitting with SciKit Learn#

Overview

Questions:

  • How can I fit a linear equation using SciKitLearn?

  • How can I fit a linear equation with multiple variables using SciKitLearn?

Objectives:

  • Slide a pandas dataframe to get X and Y values and convert them to NumPy Arrays

  • Use the LinearRegression model in scikitlearn to perform a linear fit.

In this lesson, we are going to use the scikitlearn library to do a linear fit. When it comes to fitting equations in Python, you will encounter a lot of options. In this workshop, we will work with the library scikitlearn, and later the library statsmodels.

Another library you might encounter when doing fitting is scipy. While functionalities available from theses libraries can be similar in some cases, each has different strengths.Statsmodels provides rigourous statics while scipy has a lot of functionalities around science and engineering. SciKitLearn, which we use in this section, is geared toward machine learning.

For this lesson, we will just be doing linear fits. However, scikitlearrn has many different models built in, some of which we will see in the next session. SciKitLearn might not be the easiest library to use for analysis depending on your use case. However, we start with it here so we can better understand how to use the library for more complicated examples and models. API of scikitlearn

! ls data
PubChemElements_all.csv  delaney_table4.csv	   potts_table1_clean.csv
chemical_formulas.txt	 delaney_table4_clean.csv  potts_table2.csv
delaney-processed.csv	 potts_table1.csv	   rxnpredict

We start by reading in the data we worked to clean during the last session. We named this file potts_table1_clean.csv. Now that the data is clean, we should expect that it will load correctly with correct data types, and that we shouldn’t have too many problems working with it.

import pandas as pd
df = pd.read_csv("data/potts_table1_clean.csv")
df.head()
Compound log P pi Hd Ha MV R_2 log K_oct log K_hex log K_hep
0 water -6.85 0.45 0.82 0.35 10.6 0.00 -1.38 NaN NaN
1 methanol -6.68 0.44 0.43 0.47 21.7 0.28 -0.73 -2.42 -2.80
2 methanoicacid -7.08 0.60 0.75 0.38 22.3 0.30 -0.54 -3.93 -3.63
3 ethanol -6.66 0.42 0.37 0.48 31.9 0.25 -0.32 -2.24 -2.10
4 ethanoicacid -7.01 0.65 0.61 0.45 33.4 0.27 -0.31 -3.28 -2.90
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 10 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Compound   37 non-null     object 
 1   log P      33 non-null     float64
 2   pi         37 non-null     float64
 3   Hd         37 non-null     float64
 4   Ha         37 non-null     float64
 5   MV         37 non-null     float64
 6   R_2        37 non-null     float64
 7   log K_oct  36 non-null     float64
 8   log K_hex  30 non-null     float64
 9   log K_hep  24 non-null     float64
dtypes: float64(9), object(1)
memory usage: 3.0+ KB

Linear Regression using SciKitLearn#

SciKitLearn has a number of linear models available. The purpose of this workshop is not to cover what linear models exist or in what context to use them. Rather, we are covering how to use these models in Python. We will do a simple linear fit using the ordinary least squares method.

Preparing Data for Fitting with SciKitLearn#

Before we do the fit, we will need to make sure our data is ready. Although this data is clean, we still have some missing NaN values. In the paper, the authors perform a multiple linear regression, meaning that they fit a line based on many variables. We will do this, but we will first fit a simple linear model with one variable.

From our initial exploration with seaborn, we are going to decide to fit log P as a function of MV. SciKitLearn Models are going to require us to pass NumPy arrays to them as data. When we do this fit, it is also necessary to drop any values which are NaN.

Check Your Understanding

Prepare a dataframe which can be used for fitting. One approach to this would be to first slice the dataframe to have only the columns of interest, then to use dropna on the dataframe to drop the uneeded rows. Save your prepared data in a variable called fit_data.

fit_data = df[["log P", "MV"]]
fit_data = fit_data.dropna(axis=0, how="any")

fit_data.head()
log P MV
0 -6.85 10.6
1 -6.68 21.7
2 -7.08 22.3
3 -6.66 31.9
4 -7.01 33.4

Note

We could have alternatively used the dropna function on the original dataframe with an additional argument of subset which says to only consider our two columns of interest. Then, we would have had a dataframe called fit_data which retained all of the columns, but had a value for every cell in both the log P and MV columns.

fit_data = df.dropna(subset=["log P", "MV"], how="any")

When to use dropna

It is important that when you use dropna on data you intend to fit that this is done at the same time with the argument how=any. This is because it is imperative that the values of X and Y match with one another and are of the same length.

X = fit_data["MV"].to_numpy()
Y = fit_data["log P"].to_numpy()

SciKitLearn Models#

Now that we have prepared our X and Y variables, let’s see how we would do a fit using scikitlearn.

Typically when doing fitting with scikitlearn, the first thing you will do is to import the type of model you want to use. In our case, we are importing a LinearRegression model. This type of model performs ordinary least squares fitting. You will first import the model, then you will create a model object. After creation, you will give data to the model and tell it to perform a fit. Your model can then be used to make predictions.

from sklearn.linear_model import LinearRegression

Now that you have imported the model, you can read more about it either on the SciKitLearn website, or by using the built-in Python help function.

help(LinearRegression)
Help on class LinearRegression in module sklearn.linear_model._base:

class LinearRegression(sklearn.base.MultiOutputMixin, sklearn.base.RegressorMixin, LinearModel)
 |  LinearRegression(*, fit_intercept=True, copy_X=True, n_jobs=None, positive=False)
 |  
 |  Ordinary least squares Linear Regression.
 |  
 |  LinearRegression fits a linear model with coefficients w = (w1, ..., wp)
 |  to minimize the residual sum of squares between the observed targets in
 |  the dataset, and the targets predicted by the linear approximation.
 |  
 |  Parameters
 |  ----------
 |  fit_intercept : bool, default=True
 |      Whether to calculate the intercept for this model. If set
 |      to False, no intercept will be used in calculations
 |      (i.e. data is expected to be centered).
 |  
 |  copy_X : bool, default=True
 |      If True, X will be copied; else, it may be overwritten.
 |  
 |  n_jobs : int, default=None
 |      The number of jobs to use for the computation. This will only provide
 |      speedup in case of sufficiently large problems, that is if firstly
 |      `n_targets > 1` and secondly `X` is sparse or if `positive` is set
 |      to `True`. ``None`` means 1 unless in a
 |      :obj:`joblib.parallel_backend` context. ``-1`` means using all
 |      processors. See :term:`Glossary <n_jobs>` for more details.
 |  
 |  positive : bool, default=False
 |      When set to ``True``, forces the coefficients to be positive. This
 |      option is only supported for dense arrays.
 |  
 |      .. versionadded:: 0.24
 |  
 |  Attributes
 |  ----------
 |  coef_ : array of shape (n_features, ) or (n_targets, n_features)
 |      Estimated coefficients for the linear regression problem.
 |      If multiple targets are passed during the fit (y 2D), this
 |      is a 2D array of shape (n_targets, n_features), while if only
 |      one target is passed, this is a 1D array of length n_features.
 |  
 |  rank_ : int
 |      Rank of matrix `X`. Only available when `X` is dense.
 |  
 |  singular_ : array of shape (min(X, y),)
 |      Singular values of `X`. Only available when `X` is dense.
 |  
 |  intercept_ : float or array of shape (n_targets,)
 |      Independent term in the linear model. Set to 0.0 if
 |      `fit_intercept = False`.
 |  
 |  n_features_in_ : int
 |      Number of features seen during :term:`fit`.
 |  
 |      .. versionadded:: 0.24
 |  
 |  feature_names_in_ : ndarray of shape (`n_features_in_`,)
 |      Names of features seen during :term:`fit`. Defined only when `X`
 |      has feature names that are all strings.
 |  
 |      .. versionadded:: 1.0
 |  
 |  See Also
 |  --------
 |  Ridge : Ridge regression addresses some of the
 |      problems of Ordinary Least Squares by imposing a penalty on the
 |      size of the coefficients with l2 regularization.
 |  Lasso : The Lasso is a linear model that estimates
 |      sparse coefficients with l1 regularization.
 |  ElasticNet : Elastic-Net is a linear regression
 |      model trained with both l1 and l2 -norm regularization of the
 |      coefficients.
 |  
 |  Notes
 |  -----
 |  From the implementation point of view, this is just plain Ordinary
 |  Least Squares (scipy.linalg.lstsq) or Non Negative Least Squares
 |  (scipy.optimize.nnls) wrapped as a predictor object.
 |  
 |  Examples
 |  --------
 |  >>> import numpy as np
 |  >>> from sklearn.linear_model import LinearRegression
 |  >>> X = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])
 |  >>> # y = 1 * x_0 + 2 * x_1 + 3
 |  >>> y = np.dot(X, np.array([1, 2])) + 3
 |  >>> reg = LinearRegression().fit(X, y)
 |  >>> reg.score(X, y)
 |  1.0
 |  >>> reg.coef_
 |  array([1., 2.])
 |  >>> reg.intercept_
 |  3.0...
 |  >>> reg.predict(np.array([[3, 5]]))
 |  array([16.])
 |  
 |  Method resolution order:
 |      LinearRegression
 |      sklearn.base.MultiOutputMixin
 |      sklearn.base.RegressorMixin
 |      LinearModel
 |      sklearn.base.BaseEstimator
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, *, fit_intercept=True, copy_X=True, n_jobs=None, positive=False)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  fit(self, X, y, sample_weight=None)
 |      Fit linear model.
 |      
 |      Parameters
 |      ----------
 |      X : {array-like, sparse matrix} of shape (n_samples, n_features)
 |          Training data.
 |      
 |      y : array-like of shape (n_samples,) or (n_samples, n_targets)
 |          Target values. Will be cast to X's dtype if necessary.
 |      
 |      sample_weight : array-like of shape (n_samples,), default=None
 |          Individual weights for each sample.
 |      
 |          .. versionadded:: 0.17
 |             parameter *sample_weight* support to LinearRegression.
 |      
 |      Returns
 |      -------
 |      self : object
 |          Fitted Estimator.
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  __abstractmethods__ = frozenset()
 |  
 |  __annotations__ = {'_parameter_constraints': <class 'dict'>}
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from sklearn.base.MultiOutputMixin:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.base.RegressorMixin:
 |  
 |  score(self, X, y, sample_weight=None)
 |      Return the coefficient of determination of the prediction.
 |      
 |      The coefficient of determination :math:`R^2` is defined as
 |      :math:`(1 - \frac{u}{v})`, where :math:`u` is the residual
 |      sum of squares ``((y_true - y_pred)** 2).sum()`` and :math:`v`
 |      is the total sum of squares ``((y_true - y_true.mean()) ** 2).sum()``.
 |      The best possible score is 1.0 and it can be negative (because the
 |      model can be arbitrarily worse). A constant model that always predicts
 |      the expected value of `y`, disregarding the input features, would get
 |      a :math:`R^2` score of 0.0.
 |      
 |      Parameters
 |      ----------
 |      X : array-like of shape (n_samples, n_features)
 |          Test samples. For some estimators this may be a precomputed
 |          kernel matrix or a list of generic objects instead with shape
 |          ``(n_samples, n_samples_fitted)``, where ``n_samples_fitted``
 |          is the number of samples used in the fitting for the estimator.
 |      
 |      y : array-like of shape (n_samples,) or (n_samples, n_outputs)
 |          True values for `X`.
 |      
 |      sample_weight : array-like of shape (n_samples,), default=None
 |          Sample weights.
 |      
 |      Returns
 |      -------
 |      score : float
 |          :math:`R^2` of ``self.predict(X)`` wrt. `y`.
 |      
 |      Notes
 |      -----
 |      The :math:`R^2` score used when calling ``score`` on a regressor uses
 |      ``multioutput='uniform_average'`` from version 0.23 to keep consistent
 |      with default value of :func:`~sklearn.metrics.r2_score`.
 |      This influences the ``score`` method of all the multioutput
 |      regressors (except for
 |      :class:`~sklearn.multioutput.MultiOutputRegressor`).
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from LinearModel:
 |  
 |  predict(self, X)
 |      Predict using the linear model.
 |      
 |      Parameters
 |      ----------
 |      X : array-like or sparse matrix, shape (n_samples, n_features)
 |          Samples.
 |      
 |      Returns
 |      -------
 |      C : array, shape (n_samples,)
 |          Returns predicted values.
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from sklearn.base.BaseEstimator:
 |  
 |  __getstate__(self)
 |      Helper for pickle.
 |  
 |  __repr__(self, N_CHAR_MAX=700)
 |      Return repr(self).
 |  
 |  __setstate__(self, state)
 |  
 |  get_params(self, deep=True)
 |      Get parameters for this estimator.
 |      
 |      Parameters
 |      ----------
 |      deep : bool, default=True
 |          If True, will return the parameters for this estimator and
 |          contained subobjects that are estimators.
 |      
 |      Returns
 |      -------
 |      params : dict
 |          Parameter names mapped to their values.
 |  
 |  set_params(self, **params)
 |      Set the parameters of this estimator.
 |      
 |      The method works on simple estimators as well as on nested objects
 |      (such as :class:`~sklearn.pipeline.Pipeline`). The latter have
 |      parameters of the form ``<component>__<parameter>`` so that it's
 |      possible to update each component of a nested object.
 |      
 |      Parameters
 |      ----------
 |      **params : dict
 |          Estimator parameters.
 |      
 |      Returns
 |      -------
 |      self : estimator instance
 |          Estimator instance.

Before we do the fit, we first create the model. When we create this model, we specify settings for it such as if we want the linear model to have an intercept. It will have one by default, but if you wanted to do an ordinary least squares fit without an intercept, you would specify it when you create the model.

After we create the model, we give it data and call the fit function. Then, the model will contain information about coefficients and an intercept.

We will fit an equation for log P based on some other variable in the data frame. we have to get our data as numpy arrays.

Next, we will create the linear model using LinearRegression().

To perform the fit, we do use the fit function on the linear model we created. As we have it now, it will not quite work. The error message is shown below for discussion.

linear_model = LinearRegression()
linear_model.fit(X, Y)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[11], line 2
      1 linear_model = LinearRegression()
----> 2 linear_model.fit(X, Y)

File ~/miniconda3/envs/molssi-training/lib/python3.11/site-packages/sklearn/linear_model/_base.py:649, in LinearRegression.fit(self, X, y, sample_weight)
    645 n_jobs_ = self.n_jobs
    647 accept_sparse = False if self.positive else ["csr", "csc", "coo"]
--> 649 X, y = self._validate_data(
    650     X, y, accept_sparse=accept_sparse, y_numeric=True, multi_output=True
    651 )
    653 sample_weight = _check_sample_weight(
    654     sample_weight, X, dtype=X.dtype, only_non_negative=True
    655 )
    657 X, y, X_offset, y_offset, X_scale = _preprocess_data(
    658     X,
    659     y,
   (...)
    662     sample_weight=sample_weight,
    663 )

File ~/miniconda3/envs/molssi-training/lib/python3.11/site-packages/sklearn/base.py:554, in BaseEstimator._validate_data(self, X, y, reset, validate_separately, **check_params)
    552         y = check_array(y, input_name="y", **check_y_params)
    553     else:
--> 554         X, y = check_X_y(X, y, **check_params)
    555     out = X, y
    557 if not no_val_X and check_params.get("ensure_2d", True):

File ~/miniconda3/envs/molssi-training/lib/python3.11/site-packages/sklearn/utils/validation.py:1104, in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
   1099         estimator_name = _check_estimator_name(estimator)
   1100     raise ValueError(
   1101         f"{estimator_name} requires y to be passed, but the target y is None"
   1102     )
-> 1104 X = check_array(
   1105     X,
   1106     accept_sparse=accept_sparse,
   1107     accept_large_sparse=accept_large_sparse,
   1108     dtype=dtype,
   1109     order=order,
   1110     copy=copy,
   1111     force_all_finite=force_all_finite,
   1112     ensure_2d=ensure_2d,
   1113     allow_nd=allow_nd,
   1114     ensure_min_samples=ensure_min_samples,
   1115     ensure_min_features=ensure_min_features,
   1116     estimator=estimator,
   1117     input_name="X",
   1118 )
   1120 y = _check_y(y, multi_output=multi_output, y_numeric=y_numeric, estimator=estimator)
   1122 check_consistent_length(X, y)

File ~/miniconda3/envs/molssi-training/lib/python3.11/site-packages/sklearn/utils/validation.py:900, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
    898     # If input is 1D raise error
    899     if array.ndim == 1:
--> 900         raise ValueError(
    901             "Expected 2D array, got 1D array instead:\narray={}.\n"
    902             "Reshape your data either using array.reshape(-1, 1) if "
    903             "your data has a single feature or array.reshape(1, -1) "
    904             "if it contains a single sample.".format(array)
    905         )
    907 if dtype_numeric and array.dtype.kind in "USV":
    908     raise ValueError(
    909         "dtype='numeric' is not compatible with arrays of bytes/strings."
    910         "Convert your data to numeric values explicitly instead."
    911     )

ValueError: Expected 2D array, got 1D array instead:
array=[ 10.6  21.7  22.3  31.9  33.4  42.2  43.6  49.4  52.   52.4  53.9  53.9
  60.   60.2  62.6  64.   64.1  66.   66.   67.6  67.6  70.   71.   72.3
  72.9  74.3  79.5  83.1  84.6  93.3  94.8 104.  114. ].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

It is at this point that the array’s shape becomes important. Typically if you print the shape of a pandas Series or a one dimensional slice of a NumPy array, you will see something like (n, ). For example,

X.shape
(33,)

This array is one dimensional, mean that its shape is specified with only one number. You can sort of think of this as a vector. Even though the shape would not seem different with a shape of (33, 1), it is necessary for fitting with scikitlearn.

You will see that scikitlearn even tells us this in the error function.

Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

The shape of the array is important because when the data is one-dimensional, SciKitLearn can’t tell if you intend to have many features (or many parameters for the linear regression), or many samples. Many features would be represnted by many columns, while many samples would be represented by many rows. We must reshape our arrays to specify which we want.

In NumPy, when we use a -1 in a dimension it basically translates into whatever number is required in order for the array to have the other specified dimensions. Otherwise, we would have to specify the number of rows for each reshape command. We don’t really know or care about the number of rows, what’s important for us in that the data be in a single column (one feature with many samples).

X = X.reshape(-1, 1)
X.shape
(33, 1)
Y = Y.reshape(-1, 1)
linear_model.fit(X, Y)
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

The linear_model variable now contains our linear fit. We can check our fit coefficient and intercept by accessing attributes on the model.

print(f"The coefficient is {linear_model.coef_} and the intercept is {linear_model.intercept_}.")
The coefficient is [[0.02722491]] and the intercept is [-7.23822846].

Check your understanding

Perform a second linear fit for `logP` vs `pi`. Save your variables as `X_pi`, `Y_pi`, and `linear_pi`.

fit_data_pi = df.dropna(subset=["log P", "pi"], how="any")

linear_pi = LinearRegression()

X_pi = fit_data_pi["pi"].to_numpy().reshape(-1, 1)
Y_pi = fit_data_pi["log P"].to_numpy().reshape(-1, 1)

linear_pi.fit(X_pi, Y_pi)

print(f"The intercept is {linear_pi.coef_} and the intercept is {linear_pi.intercept_}.")
The intercept is [[0.20373337]] and the intercept is [-5.67219106].

We might next be interested in how good these fits are, or fit metrics. We might evaluate that using an R2 value. In scikitlearn for the linear model, this is accessible through model.score.

r2 = linear_model.score(X, Y)
print(f"The r2 score for log P vs MV is {r2}")
The r2 score for log P vs MV is -3.3146070112592856
r2_pi = linear_pi.score(X_pi, Y_pi)
print(f"The r2 score for log P vs pi is {r2_pi}")
The r2 score for log P vs pi is 0.003717585103936716

Making Predictions#

fit_data["model_prediction"] = linear_model.predict(X)

fit_data.head()
log P MV model_prediction
0 -6.85 10.6 -6.949644
1 -6.68 21.7 -6.647448
2 -7.08 22.3 -6.631113
3 -6.66 31.9 -6.369754
4 -7.01 33.4 -6.328916
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib notebook

sns.set_theme(font_scale=1.25)
g = sns.lmplot(x="log P", y="model_prediction", data=fit_data)
g.ax.annotate(rf"$r^2$={r2_pi:.3f}", xy=(-6, -4))
g.tight_layout()
<seaborn.axisgrid.FacetGrid at 0x7fb3ed68d9d0>
import numpy as np

# Predict for single made up value
linear_model.predict(np.array([0.75]).reshape(1, -1))
array([[-7.21780978]])
df
Compound log P pi Hd Ha MV R_2 log K_oct log K_hex log K_hep
0 water -6.85 0.45 0.82 0.35 10.6 0.00 -1.38 NaN NaN
1 methanol -6.68 0.44 0.43 0.47 21.7 0.28 -0.73 -2.42 -2.80
2 methanoicacid -7.08 0.60 0.75 0.38 22.3 0.30 -0.54 -3.93 -3.63
3 ethanol -6.66 0.42 0.37 0.48 31.9 0.25 -0.32 -2.24 -2.10
4 ethanoicacid -7.01 0.65 0.61 0.45 33.4 0.27 -0.31 -3.28 -2.90
5 n-propanol -6.41 0.42 0.37 0.48 42.2 0.24 0.34 -1.48 -1.52
6 n-propanoicacid -7.01 0.65 0.60 0.45 43.6 0.23 0.26 -2.64 -2.14
7 butane-2-one -5.90 0.70 0.00 0.51 49.4 0.17 0.28 -0.25 NaN
8 benzene NaN 0.52 0.00 0.14 50.0 0.61 2.00 2.29 NaN
9 diethylether -5.35 0.25 0.00 0.45 52.0 0.04 0.83 0.66 NaN
10 n-butanol -6.16 0.42 0.37 0.48 52.4 0.22 0.88 -1.08 -0.70
11 n-butanoicacid -6.36 0.62 0.60 0.45 53.9 0.21 0.79 -1.92 -0.96
12 phenol -5.64 0.89 0.52 0.30 53.9 0.81 1.46 -0.70 -0.82
13 toluene -3.56 0.52 0.00 0.14 60.0 0.60 2.70 2.89 NaN
14 styrene -3.75 0.65 0.00 0.16 60.2 0.85 2.95 NaN NaN
15 n-pentanol -5.78 0.42 0.37 0.48 62.6 0.22 1.40 -0.39 NaN
16 benzyl-OH -5.78 0.87 0.33 0.56 64.0 0.80 1.10 -0.62 NaN
17 n-pentanoicacid -6.01 0.60 0.60 0.45 64.1 0.21 1.33 -1.31 0.44
18 2-chlorophenol -5.04 0.88 0.32 0.31 66.0 0.85 2.15 NaN NaN
19 4-chlorophenol -5.00 1.08 0.67 0.20 66.0 0.92 2.39 -1.31 -0.12
20 m-cresol -5.38 0.88 0.57 0.34 67.6 0.82 1.96 NaN NaN
21 o-cresol NaN 0.86 0.52 0.30 67.6 0.84 1.95 NaN NaN
22 p-cresol -5.29 0.87 0.57 0.31 67.6 0.82 1.96 NaN NaN
23 4-bromophenol -5.00 1.17 0.67 0.20 70.0 1.08 2.59 -0.11 -0.20
24 4-nitrophenol NaN 1.72 0.82 0.26 71.0 1.07 1.96 -2.00 -2.15
25 3-nitrophenol -5.81 1.57 0.79 0.23 71.0 1.05 2.00 -1.40 -1.23
26 2-nitrophenol NaN 1.05 0.05 0.37 71.0 1.02 1.80 -1.40 1.04
27 ethylbenzene -3.48 0.51 0.00 0.15 72.3 0.60 3.15 NaN NaN
28 n-hexanol -5.45 0.42 0.37 0.48 72.9 0.21 2.03 0.11 0.45
29 n-hexanoicacid -5.44 0.60 0.60 0.45 74.3 0.17 1.89 -0.85 0.24
30 8-naphthol -5.11 1.08 0.61 0.40 79.5 1.52 2.84 1.77 0.30
31 n-heptanol -5.05 0.42 0.37 0.48 83.1 0.21 2.49 0.77 1.01
32 n-heptanoicacid -5.28 0.60 0.60 0.45 84.6 0.15 2.33 -0.29 1.16
33 n-octanol -4.84 0.42 0.37 0.48 93.3 0.20 3.15 1.62 1.65
34 n-octanoicacid -5.21 0.60 0.60 0.45 94.8 0.15 2.83 0.41 1.95
35 n-nonanol -4.77 0.42 0.37 0.48 104.0 0.19 3.68 1.97 2.28
36 n-decanol -4.66 0.42 0.37 0.48 114.0 0.19 NaN 2.56 2.91

Multiple Linear Regression#

fit_data = df[["log P", "MV", "pi", "Ha", "Hd", "R_2"]].copy()
fit_data.head()
log P MV pi Ha Hd R_2
0 -6.85 10.6 0.45 0.35 0.82 0.00
1 -6.68 21.7 0.44 0.47 0.43 0.28
2 -7.08 22.3 0.60 0.38 0.75 0.30
3 -6.66 31.9 0.42 0.48 0.37 0.25
4 -7.01 33.4 0.65 0.45 0.61 0.27
fit_data.dropna(axis=0, inplace=True)
X = fit_data[["MV", "pi", "Ha", "Hd", "R_2"]].to_numpy()
Y = fit_data["log P"].to_numpy()
multiple_reg = LinearRegression().fit(X, Y)
print(multiple_reg.coef_)
print(multiple_reg.intercept_)
[ 0.02606803 -1.0272469  -4.36496468 -1.28525217  0.54748187]
-4.463985993729431
r2_m = multiple_reg.score(X, Y)
fit_data["model_prediction"] = multiple_reg.predict(X)
g = sns.lmplot(x="log P", y="model_prediction", data=fit_data)

g.ax.annotate(rf"$r^2$={r2_m:.3f}", xy=(-6, -4))

g.tight_layout()
g.savefig("session3.png", dpi=250)

Key Points

  • You must import and create the model you want to use from scikitlearn.

  • SciKitLearn models require X and Y values that are at least two dimensional.

  • Use .reshape on your NumPy arrays to make sure they are the correct dimension.

  • Fit SciKitLearn models by giving them data and using the fit method.

  • Use the predict method after fitting to make predictions.