Linear Fitting Absorbance Data#

Overview

Questions:

How can I use Python to fit a linear model?

Objectives:

Fit a linear model using statsmodels

A common exercise in introductory chemistry is fitting a linear model to absorbance measurements at a particular wavelength vs. concentration of an analyte using spectrophotometry. In these cases, the absorbance of a sample is described using Beer-Lambert Law

\[ A = \varepsilon \cdot l \cdot c \]

where \(A\) is the absorbance, \(\varepsilon\) is the absorptivity constant, \(l\) is path length, and \(c\) is the concentration of the sample.

In this notebook, we will read in absorbance data, fit a linear model using a Python library called statsmodels, and calculate the concentrations of unknowns using our model. We will utilize skills learned in our previous lessons, including reading, accessing, and analyzing data using pandas and visualizing data using plotly. We will add a new skill - fitting a linear model with a Python library called statsmodels.

Importing Libraries and Reading Data#

Note that when fitting a model in Python, there are a number of options for the library you might pick including NumPy, SciPy, SciKit-Learn, and statsmodels. The library that you pick might be based on personal preference or features that a particular library offers. In this notebook, we show fitting with statsmodels because of the ease of seeing fit statics and defining a formula for fitting.

For statsmodels, we will use the formula API. This will allow us to define the relationship we’d like to fit using a formula as a string.

To start this notebook, we will import the libraries we need. We will import the following:

Library Name	Purpose
pandas	reading and processing data
plotly	interactive plots and visualization
statsmodels	data fitting

The cell below imports the libraries we will use for our analysis.

# imports 
import pandas as pd # for reading data from a file and putting in a table
import plotly.express as px # for plotting

import statsmodels.formula.api as smf # for fitting

For the example, we will use data stored in the file data/protein_assay.csv. This data represents absorbance data recorded in a Bradford Assay for determining protein concentration.

After we have imported our libraries, we will next use pandas to read in our data. Our data is stored in a comma separated value (CSV) file, though pandas can also read from Excel files.

# Read in file
df = pd.read_csv("data/protein_assay.csv")

# View the first five rows.
df.head()

	concentration	absorbance
0	0.2	0.285
1	0.4	0.485
2	0.6	0.621
3	0.8	0.799
4	1.0	1.010

Visualizing Data#

After reading in our data, we might wish to inspect it visually. We can do this using plotly express (imported as px).

# Create a figure using px.scatter.
fig = px.scatter(df, x='concentration', y="absorbance")

# Show the figure.
fig.show()

Note that you can change the axis labels using the following syntax.

# Create a figure using px.scatter.
fig = px.scatter(df, x='concentration', y="absorbance", 
                 labels={"concentration":"Concentration (mg/mL)", "absorbance":"Absorbance at 595 nm"})

# Show the figure.
fig.show()

Fitting a Linear Model#

To fit a linear equation to our data, we can use a library called statsmodels. statsmodels is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration.

While there are numerous options for fitting data in Python, statsmodels is particularly beneficial when dealing with linear models. Note that this encompasses more than just “linear equations”. This library is especially handy when you need straightforward access to fit parameters and in-depth statistical metrics related to the fit.

In particular, we are using a part of statsmodels called the formula API. The formula API lets you define a formula as a string for fitting and is specifically designed to work with dataframes. When defining a formula, you use the column names in a string to define the relationship. Note that the column names should not have spaces, or entering the relationship is a bit more complicated. For example, if we had data representing pressure and temperature at constant volume and expected them to follow a linear relationship such as for an ideal gas

\[ P = \frac{n R T}{V} \]

we would write "P ~ T" when using the statsmodels formula API. This is assuming that our dataframe has columns named P and T

As a slightly more complicatd eexample lone could also fit something like a polynomial using "y ~ np.power(x, 2) + x" if you had imported NumPy (import numpy as n).

To use the formula API, we will use smf (imported in first cell). We will use ordinary least squares (ols) for our fit, though a number of other options are offered.

regression = smf.ols("absorbance ~ concentration", data=df).fit()

We can see a summary of the fit including the R-squared by using the .summary() method.

regression.summary()

/usr/share/miniconda/envs/docs/lib/python3.11/site-packages/statsmodels/stats/stattools.py:74: ValueWarning:

omni_normtest is not valid with less than 8 observations; 6 samples were given.

OLS Regression Results
Dep. Variable:	absorbance	R-squared:	0.995
Model:	OLS	Adj. R-squared:	0.993
Method:	Least Squares	F-statistic:	749.4
Date:	Tue, 12 Mar 2024	Prob (F-statistic):	1.06e-05
Time:	15:14:21	Log-Likelihood:	14.638
No. Observations:	6	AIC:	-25.28
Df Residuals:	4	BIC:	-25.69
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	0.1279	0.024	5.316	0.006	0.061	0.195
concentration	0.8454	0.031	27.374	0.000	0.760	0.931

Omnibus:	nan	Durbin-Watson:	2.852
Prob(Omnibus):	nan	Jarque-Bera (JB):	0.692
Skew:	0.659	Prob(JB):	0.708
Kurtosis:	1.984	Cond. No.	4.48

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

If we would like to force the model to not have an intercept, we use a special statsmodel syntax. Adding -1 to our formula forces the intercept to be 0.

# force the intercept to be 0
regression = smf.ols("absorbance ~ concentration - 1", data=df).fit()
regression.summary()

/usr/share/miniconda/envs/docs/lib/python3.11/site-packages/statsmodels/stats/stattools.py:74: ValueWarning:

omni_normtest is not valid with less than 8 observations; 6 samples were given.

OLS Regression Results
Dep. Variable:	absorbance	R-squared (uncentered):	0.994
Model:	OLS	Adj. R-squared (uncentered):	0.993
Method:	Least Squares	F-statistic:	833.3
Date:	Tue, 12 Mar 2024	Prob (F-statistic):	9.35e-07
Time:	15:14:21	Log-Likelihood:	8.3757
No. Observations:	6	AIC:	-14.75
Df Residuals:	5	BIC:	-14.96
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
concentration	0.9930	0.034	28.866	0.000	0.905	1.081

Omnibus:	nan	Durbin-Watson:	0.590
Prob(Omnibus):	nan	Jarque-Bera (JB):	0.344
Skew:	-0.475	Prob(JB):	0.842
Kurtosis:	2.311	Cond. No.	1.00

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The model parameters are in .params of the fit variable.

regression.params

concentration    0.992967
dtype: float64

To get the slope, we will get the coefficient in front of the concentration variable.

slope = regression.params["concentration"]
print(slope)

0.9929670329670331

We can see what our model predicts for our input concentration values by using the predict method. In the cell below, we save the results in a new column in our dataframe.

df["predicted"] = regression.predict()

df.head()

	concentration	absorbance	predicted
0	0.2	0.285	0.198593
1	0.4	0.485	0.397187
2	0.6	0.621	0.595780
3	0.8	0.799	0.794374
4	1.0	1.010	0.992967

fig = px.scatter(x= df["concentration"], y=df["absorbance"])
fig.add_scatter(x=df["concentration"], y=df["predicted"], mode="lines", name="model")
fig.show()

Fitting unknowns#

Now that we have our model and slope, we can use it to calculate the protein concentrations for a set of unknowns.

Exercise

Use pandas to read in the file data/protein_samples.csv.

Next, use our calculated slope to predict concentration based on measured absorbance.

Exercise - Perform a Linear Fit#

Exercise

Now that you have completed a simple linear regression exercise with protein assay data, here is a problem with a slightly larger dataset, taken from a ground water survey of wells in Texas kindly provided by Houghton-Mifflin. The data for this exercise is in the file data/ground_water.csv. Using the skills you have learned with pandas and statsmodels, get the linear regression statistics for the relationship between pH (dependent variable) and bicarbonate levels (ppm in well water in Texas; independent variable).