Linear Fitting with statsmodels#

Overview

Questions:

How can I fit a linear equation using statsmodels?
How can I fit a linear equation with multiple variables using statsmodels?

Objectives:

Use statsmodels for linear regression.

In this module, we are going to use statsmodels to fit our linear model. We are going to use an interface which allows us to use dataframes and text formulas to specify the equations we want to fit. To import statsmodels, use import statsmodels.formula.api as smf.

First, we’ll use pandas to load the data we cleaned in session 2.

import os
import pandas as pd

import statsmodels.formula.api as smf

file_path = os.path.join("data", "potts_table1_clean.csv")
df = pd.read_csv(file_path)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 10 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Compound   37 non-null     object 
 1   log P      34 non-null     float64
 2   pi         37 non-null     float64
 3   Hd         37 non-null     float64
 4   Ha         37 non-null     float64
 5   MV         36 non-null     float64
 6   R_2        37 non-null     float64
 7   log K_oct  36 non-null     float64
 8   log K_hex  30 non-null     float64
 9   log K_hep  24 non-null     float64
dtypes: float64(9), object(1)
memory usage: 3.0+ KB

Next, we will use ordinary least squares (ols) to fit our equation. When you call ols, you give it a formula you would like to fit. The dependent variable goes on the left side, followed by a ~. Then you put the independent variables you want to fit. To fit log P as a function of MV, we would expect put log P ~ MV. However, since our dependent variable has a space in it, we must group it using a special syntax - Q('log P'). Finally, we fit the model using .fit().

regression = smf.ols("Q('log P') ~ MV", data=df).fit()

This performs a fit to your equation using ordinary least squares. You can get a summary of your model by calling .summary on the fit.

regression.summary()

OLS Regression Results
Dep. Variable:	Q('log P')	R-squared:	0.446
Model:	OLS	Adj. R-squared:	0.428
Method:	Least Squares	F-statistic:	24.98
Date:	Wed, 12 May 2021	Prob (F-statistic):	2.16e-05
Time:	13:54:47	Log-Likelihood:	-34.474
No. Observations:	33	AIC:	72.95
Df Residuals:	31	BIC:	75.94
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	-7.2469	0.362	-20.002	0.000	-7.986	-6.508
MV	0.0271	0.005	4.998	0.000	0.016	0.038

Omnibus:	21.287	Durbin-Watson:	1.639
Prob(Omnibus):	0.000	Jarque-Bera (JB):	28.520
Skew:	1.802	Prob(JB):	6.41e-07
Kurtosis:	5.783	Cond. No.	196.

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

regression.params

Intercept   -7.246871
MV           0.027056
dtype: float64

Fitted values are stored automatically in regression.fittedvalues.

regression.fittedvalues

  -6.960075
  -6.659752
  -6.643518
  -6.383779
  -6.343195
  -6.105100
  -6.067222
  -5.910296
  -5.839950
 -5.829127
 -5.788543
 -5.623500
 -5.618089
 -5.553154
 -5.515276
 -5.512570
 -5.461163
 -5.461163
 -5.417873
 -5.417873
 -5.352939
 -5.325882
 -5.325882
 -5.290709
 -5.274476
 -5.236597
 -5.095905
 -4.998503
 -4.957918
 -4.722530
 -4.681945
 -4.433029
 -4.162467
dtype: float64

df["predicted_values"] = regression.fittedvalues

df.head()

	Compound	log P	pi	Hd	Ha	MV	R_2	log K_oct	log K_hex	log K_hep	predicted_values
0	water	-6.85	0.45	0.82	0.35	10.6	0.00	-1.38	NaN	NaN	-6.960075
1	methanol	-6.68	0.44	0.43	0.47	21.7	0.28	-0.73	-2.42	-2.80	-6.659752
2	methanoicacid	-7.08	0.60	0.75	0.38	22.3	0.30	-0.54	-3.93	-3.63	-6.643518
3	ethanol	-6.66	0.42	0.37	0.48	31.9	0.25	-0.32	-2.24	-2.10	-6.383779
4	ethanoicacid	-7.01	0.65	0.61	0.45	33.4	0.27	-0.31	-3.28	-2.90	-6.343195

You can make predictions using the fitted model by calling regression.predict and passing in values for which you want a prediction as part of a pandas dataframe with appropriate column names. Your column names must match the column names you performed the fit with.

help(regression.predict)

Help on method predict in module statsmodels.base.model:

predict(exog=None, transform=True, *args, **kwargs) method of statsmodels.regression.linear_model.OLSResults instance
    Call self.model.predict with self.params as the first argument.
    
    Parameters
    ----------
    exog : array_like, optional
        The values for which you want to predict. see Notes below.
    transform : bool, optional
        If the model was fit via a formula, do you want to pass
        exog through the formula. Default is True. E.g., if you fit
        a model y ~ log(x1) + log(x2), and transform is True, then
        you can pass a data structure that contains x1 and x2 in
        their original form. Otherwise, you'd need to log the data
        first.
    *args
        Additional arguments to pass to the model, see the
        predict method of the model for the details.
    **kwargs
        Additional keywords arguments to pass to the model, see the
        predict method of the model for the details.
    
    Returns
    -------
    array_like
        See self.model.predict.
    
    Notes
    -----
    The types of exog that are supported depends on whether a formula
    was used in the specification of the model.
    
    If a formula was used, then exog is processed in the same way as
    the original data. This transformation needs to have key access to the
    same variable names, and can be a pandas DataFrame or a dict like
    object that contains numpy arrays.
    
    If no formula was used, then the provided exog needs to have the
    same number of columns as the original exog in the model. No
    transformation of the data is performed except converting it to
    a numpy array.
    
    Row indices as in pandas data frames are supported, and added to the
    returned prediction.

to_predict = pd.DataFrame()
to_predict["MV"] = [75, 90]

regression.predict(to_predict)

0   -5.217658
1   -4.811815
dtype: float64

import seaborn as sns

g = sns.lmplot(x="log P", y="predicted_values", data=df)

Multiple Regression#

df.columns

Index(['Compound', 'log P', 'pi', 'Hd', 'Ha', 'MV', 'R_2', 'log K_oct',
       'log K_hex', 'log K_hep', 'predicted_values'],
      dtype='object')

multiple_regression = smf.ols("Q('log P') ~ pi + Hd + Ha + MV + R_2", data=df).fit()

multiple_regression.summary()

OLS Regression Results
Dep. Variable:	Q('log P')	R-squared:	0.965
Model:	OLS	Adj. R-squared:	0.959
Method:	Least Squares	F-statistic:	150.4
Date:	Wed, 12 May 2021	Prob (F-statistic):	7.87e-19
Time:	13:55:34	Log-Likelihood:	11.249
No. Observations:	33	AIC:	-10.50
Df Residuals:	27	BIC:	-1.518
Df Model:	5
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	-4.4861	0.199	-22.501	0.000	-4.895	-4.077
pi	-0.9319	0.220	-4.237	0.000	-1.383	-0.481
Hd	-1.3037	0.185	-7.038	0.000	-1.684	-0.924
Ha	-4.3756	0.371	-11.784	0.000	-5.138	-3.614
MV	0.0260	0.002	17.193	0.000	0.023	0.029
R_2	0.5135	0.177	2.900	0.007	0.150	0.877

Omnibus:	0.688	Durbin-Watson:	1.300
Prob(Omnibus):	0.709	Jarque-Bera (JB):	0.200
Skew:	0.179	Prob(JB):	0.905
Kurtosis:	3.132	Cond. No.	835.

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

df["MR_predicted_values"] = multiple_regression.fittedvalues

g = sns.lmplot(x="log P", y="MR_predicted_values", data=df)

Key Points

Statsmodels allows you to specify your equations using data frames and column names.
Calling .summary. on the fit gives you a summary of the fit and parameters.
You can use .predict to predict new values using the fitted model.
You must give the predict method a dataframe with the same column names as the original dataframe.

Python for Data Science in Chemistry

Linear Fitting with statsmodels

Contents

Linear Fitting with statsmodels#

Multiple Regression#