Linear Fitting Absorbance Data#
Overview
Questions:
How can I use Python to fit a linear model?
Objectives:
Fit a linear model using statsmodels
A common exercise in introductory chemistry is fitting a linear model to absorbance measurements at a particular wavelength vs. concentration of an analyte using spectrophotometry. In these cases, the absorbance of a sample is described using Beer-Lambert Law
where \(A\) is the absorbance, \(\varepsilon\) is the absorptivity constant, \(l\) is path length, and \(c\) is the concentration of the sample.
In this notebook, we will read in absorbance data, fit a linear model using a Python library called statsmodels, and calculate the concentrations of unknowns using our model.
We will utilize skills learned in our previous lessons, including reading, accessing, and analyzing data using pandas
and visualizing data using plotly
.
We will add a new skill - fitting a linear model with a Python library called statsmodels
.
Importing Libraries and Reading Data#
Note that when fitting a model in Python, there are a number of options for the library you might pick including NumPy, SciPy, SciKit-Learn, and statsmodels.
The library that you pick might be based on personal preference or features that a particular library offers.
In this notebook, we show fitting with statsmodels
because of the ease of seeing fit statics and defining a formula for fitting.
For statsmodels
, we will use the formula API.
This will allow us to define the relationship we’d like to fit using a formula as a string.
To start this notebook, we will import the libraries we need. We will import the following:
Library Name |
Purpose |
---|---|
reading and processing data |
|
interactive plots and visualization |
|
data fitting |
The cell below imports the libraries we will use for our analysis.
# imports
import pandas as pd # for reading data from a file and putting in a table
import plotly.express as px # for plotting
import statsmodels.formula.api as smf # for fitting
For the example, we will use data stored in the file data/protein_assay.csv
.
This data represents absorbance data recorded in a Bradford Assay for determining protein concentration.
After we have imported our libraries, we will next use pandas
to read in our data.
Our data is stored in a comma separated value (CSV) file, though pandas can also read from Excel files.
# Read in file
df = pd.read_csv("data/protein_assay.csv")
# View the first five rows.
df.head()
concentration | absorbance | |
---|---|---|
0 | 0.2 | 0.285 |
1 | 0.4 | 0.485 |
2 | 0.6 | 0.621 |
3 | 0.8 | 0.799 |
4 | 1.0 | 1.010 |
Visualizing Data#
After reading in our data, we might wish to inspect it visually.
We can do this using plotly express (imported as px
).
# Create a figure using px.scatter.
fig = px.scatter(df, x='concentration', y="absorbance")
# Show the figure.
fig.show()
Note that you can change the axis labels using the following syntax.
# Create a figure using px.scatter.
fig = px.scatter(df, x='concentration', y="absorbance",
labels={"concentration":"Concentration (mg/mL)", "absorbance":"Absorbance at 595 nm"})
# Show the figure.
fig.show()
Fitting a Linear Model#
To fit a linear equation to our data, we can use a library called statsmodels.
statsmodels
is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration.
While there are numerous options for fitting data in Python, statsmodels
is particularly beneficial when dealing with linear models.
Note that this encompasses more than just “linear equations”.
This library is especially handy when you need straightforward access to fit parameters and in-depth statistical metrics related to the fit.
In particular, we are using a part of statsmodels
called the formula API.
The formula API lets you define a formula as a string for fitting and is specifically designed to work with dataframes.
When defining a formula, you use the column names in a string to define the relationship.
Note that the column names should not have spaces, or entering the relationship is a bit more complicated.
For example, if we had data representing pressure and temperature at constant volume and expected them to follow a linear relationship such as for an ideal gas
we would write "P ~ T"
when using the statsmodels
formula API.
This is assuming that our dataframe has columns named P
and T
As a slightly more complicatd eexample lone could also fit something like a polynomial using "y ~ np.power(x, 2) + x"
if you had imported NumPy (import numpy as n
).
To use the formula API, we will use smf
(imported in first cell).
We will use ordinary least squares (ols
) for our fit, though a number of other options are offered.
regression = smf.ols("absorbance ~ concentration", data=df).fit()
We can see a summary of the fit including the R-squared
by using the .summary()
method.
regression.summary()
/usr/share/miniconda/envs/docs/lib/python3.11/site-packages/statsmodels/stats/stattools.py:74: ValueWarning:
omni_normtest is not valid with less than 8 observations; 6 samples were given.
Dep. Variable: | absorbance | R-squared: | 0.995 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.993 |
Method: | Least Squares | F-statistic: | 749.4 |
Date: | Tue, 12 Mar 2024 | Prob (F-statistic): | 1.06e-05 |
Time: | 15:14:21 | Log-Likelihood: | 14.638 |
No. Observations: | 6 | AIC: | -25.28 |
Df Residuals: | 4 | BIC: | -25.69 |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 0.1279 | 0.024 | 5.316 | 0.006 | 0.061 | 0.195 |
concentration | 0.8454 | 0.031 | 27.374 | 0.000 | 0.760 | 0.931 |
Omnibus: | nan | Durbin-Watson: | 2.852 |
---|---|---|---|
Prob(Omnibus): | nan | Jarque-Bera (JB): | 0.692 |
Skew: | 0.659 | Prob(JB): | 0.708 |
Kurtosis: | 1.984 | Cond. No. | 4.48 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
If we would like to force the model to not have an intercept, we use a special statsmodel
syntax.
Adding -1
to our formula forces the intercept to be 0.
# force the intercept to be 0
regression = smf.ols("absorbance ~ concentration - 1", data=df).fit()
regression.summary()
/usr/share/miniconda/envs/docs/lib/python3.11/site-packages/statsmodels/stats/stattools.py:74: ValueWarning:
omni_normtest is not valid with less than 8 observations; 6 samples were given.
Dep. Variable: | absorbance | R-squared (uncentered): | 0.994 |
---|---|---|---|
Model: | OLS | Adj. R-squared (uncentered): | 0.993 |
Method: | Least Squares | F-statistic: | 833.3 |
Date: | Tue, 12 Mar 2024 | Prob (F-statistic): | 9.35e-07 |
Time: | 15:14:21 | Log-Likelihood: | 8.3757 |
No. Observations: | 6 | AIC: | -14.75 |
Df Residuals: | 5 | BIC: | -14.96 |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
concentration | 0.9930 | 0.034 | 28.866 | 0.000 | 0.905 | 1.081 |
Omnibus: | nan | Durbin-Watson: | 0.590 |
---|---|---|---|
Prob(Omnibus): | nan | Jarque-Bera (JB): | 0.344 |
Skew: | -0.475 | Prob(JB): | 0.842 |
Kurtosis: | 2.311 | Cond. No. | 1.00 |
Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The model parameters are in .params
of the fit variable.
regression.params
concentration 0.992967
dtype: float64
To get the slope, we will get the coefficient in front of the concentration variable.
slope = regression.params["concentration"]
print(slope)
0.9929670329670331
We can see what our model predicts for our input concentration values by using the predict
method.
In the cell below, we save the results in a new column in our dataframe.
df["predicted"] = regression.predict()
df.head()
concentration | absorbance | predicted | |
---|---|---|---|
0 | 0.2 | 0.285 | 0.198593 |
1 | 0.4 | 0.485 | 0.397187 |
2 | 0.6 | 0.621 | 0.595780 |
3 | 0.8 | 0.799 | 0.794374 |
4 | 1.0 | 1.010 | 0.992967 |
fig = px.scatter(x= df["concentration"], y=df["absorbance"])
fig.add_scatter(x=df["concentration"], y=df["predicted"], mode="lines", name="model")
fig.show()
Fitting unknowns#
Now that we have our model and slope, we can use it to calculate the protein concentrations for a set of unknowns.
Exercise
Use pandas to read in the file data/protein_samples.csv
.
Next, use our calculated slope to predict concentration based on measured absorbance.
Exercise - Perform a Linear Fit#
Exercise
Now that you have completed a simple linear regression exercise with protein assay data, here is a problem with a slightly larger dataset, taken from a ground water survey of wells in Texas kindly provided by Houghton-Mifflin.
The data for this exercise is in the file data/ground_water.csv
.
Using the skills you have learned with pandas and statsmodels
, get the linear regression statistics for the relationship between pH (dependent variable) and bicarbonate levels (ppm in well water in Texas; independent variable).