Linear Regression

Overview

Questions

  • How can I complete linear regression with statistics in Python?

Objectives:

  • Use a pandas dataframe for data analysis
  • Perform linear regression on the data and obtain best fit statistics

Keypoints:

  • Use pandas to create dataframes from csv formatted data.
  • Use SciPy functions to perform linear regression with statistical output.
  • Use results from linear regression for lab calculations.

Why Linear Regression?

When I was a biochemistry grad student, we almost always manipulated our data into a linear format so that we could do linear regression on our handheld calculators. The most prominent example was the manipulation of enzyme kinetic data for Lineweaver-Burke or Eadie-Hofstee plots, so that we could determine the kinetic parameters (Note: we’ll actually do non-linear curve fitting for enzyme kinetics in a future lesson). I also remember doing semi-log plots of enzyme inactivation because they were linear. Now we have many more options by using python in Jupyter notebooks.

However, some data can still be analyzed by simple linear regression. Perhaps the most common case is the protein assay. Whether you use Lowry, Bradford or BCA methods, it is still most common to use a linear regression fit to the results.

In this module, we will explore linear regression in Jupyter notebooks using Python. Please keep in mind - this is just a beginning. If you take a course in data science, you are likely to encounter a much deeper look at linear regression where you explore the relationships among many variables in a dataset.

We’re going to build on the previous lesson, using the pandas library to import the data for this lesson. Then, we will perform the linear regression using the SciPy library.

Libraries you will need

In previous lessons, we have used os, numpy, and pandas. In this lesson, we will add the SciPy library, a collection of numerical tools for solving mathematical, scientific, technical and engineering problems (from the Guru99 Python SciPy Tutorial). Specific subsets of SciPy are useful for linear algebra (scipy.linalg), optimization and fit (scipy.optimize), statistics and random numbers (scipy.stats), and numerical integration (scipy.integrate).

We will use the dot notation we introduced earlier to access the functions in these libraries. I have expanded our table of libraries to help you keep track of our tools. Please note that the abbreviations listed are just the most common. You can actually define any abbreviation you like, but it is best to use the conventional abbreviations. This will help future coders (including yourself six months from now) as they work with the code.

Library

Uses

Abbreviation

os

file management in operating systems

os

numpy

calculations

np

pandas

data management

pd

scipy

calculations and statistics

sc or sp

Stages of this module

  1. Importing the correct libraries

  2. Importing data with pandas

  3. Running simple linear regression to print out the desired statistics

  4. Calculating protein sample concentrations using the linear regression results.

The first part of the module is modeled after an exercise in Charlie Weiss’s excellent online textbook, Scientific Computing for Chemists, which you can find on his GitHub site, SciCompforChemists.

# Import the libraries you need
import os
import numpy as np
import pandas as pd
from scipy import stats

Importing data with pandas

In this lesson, you will continue to work with os and pandas to import your data. The exercises in this chapter represent a very straightforward case, but if you look at the code, you can see that it could be applied to much more complex situations. When using pandas, people often append _df to the end of a variable name as a reminder that this is a dataframe, so the dataframe that contains our results might be called results_df. The code in the next cell will enable you to create the dataframe for this lesson. Please note that DataFrame is a reserved word when working with pandas dataframes, so you should not use that word as a variable name.

The data in the csv file includes the concentrations of the standards, where 100 \(\mu\)L of standard was added to each test tube.

protein_file = os.path.join('data', 'protein_assay.csv') # gives a path to your csv file
results_df = pd.read_csv(protein_file) # use pandas to read the csv file into a dataframe
results_df  # display your dataframe
Protein Concentration (mg/mL) A595
0 0.2 0.285
1 0.4 0.485
2 0.6 0.621
3 0.8 0.799
4 1.0 1.010
5 1.2 1.118

In a dataframe, you can refer to the data in a column just by the column header. These are strings, so you need to put single or double quotes around them when you use them. In pandas, the terms series is often used to refer to the data in a single column in a dataframe. So we are going to use the two column headers to define the data for our linear regression.

xdata = results_df['Protein Concentration (mg/mL)']   # setting the x values  
ydata = results_df['A595']  # setting the y values    
print(xdata,ydata) # checking to make sure everything is in place
0    0.2
1    0.4
2    0.6
3    0.8
4    1.0
5    1.2
Name: Protein Concentration (mg/mL), dtype: float64 0    0.285
1    0.485
2    0.621
3    0.799
4    1.010
5    1.118
Name: A595, dtype: float64

Linear Regression with SciPy

Now that the data are in place, we simply need to import the stats module from SciPy to perform the linear regression.

Notice that the linregress function in scipy.stats accepts the two data series and actually generates five values: slope, intercept, r-value, p-value, and standard error. In the next cell all of those variables are assigned in a single command. Then the results are presented with a series of print statements.

slope, intercept, r_value, p_value, std_err = stats.linregress(xdata, ydata)
print("Slope = ", slope)
print("Intercept = ", intercept)
print("R-squared = ", r_value**2)
print("P value = ", p_value)
print("Standard error = ", std_err)
Slope =  0.8454285714285716
Intercept =  0.12786666666666657
R-squared =  0.994690398528738
P value =  1.0590717448341336e-05
Standard error =  0.030884027089284245

Solving for protein concentrations in samples

In the next lesson, we will look at plotting these data. For now, let’s get some practice using the results of our linear regression to calculate the protein concentrations in typical unknowns. This will be simple math, but it’s good practice with using correct syntax. For this exercise, we are going to determine protein concentration in mg/mL. The data shown for the six protein samples in this table can be found in data/protein_samples.csv. In each case 100 \(\mu\)L of sample was added to the sample tube for analysis.

Sample #

A\(_{595}\)

1

0.183

2

0.682

3

0.759

4

1.340

5

0.935

6

1.013

We begin with the equation for our standard curve

\[ A_{595} = slope * ProtConc + intercept \]

which we can rearrange to solve for protein concentration.

\[ ProtConc = \frac{A_{595} - intercept}{slope} \]

Our final equation will then solve for concentration in mg/mL, based on the units for the slope (/mg/mL) and the unitless intercept.

Now let’s put our unknown sample data in a pandas dataframe, write out the code for the equation and use some neat tricks from pandas to add the concentrations to our table.

Use pandas to import the csv data into a dataframe

This is the exact process we used above. We are just repeating it to gain practice. We start by using the os library to point to the csv file. Then we use pandas to read the csv file and write it to the dataframe.

samples_file = os.path.join('data', 'protein_samples.csv') 
samples_df = pd.read_csv(samples_file) 
samples_df  # display your dataframe
Sample # A-595
0 1 0.183
1 2 0.682
2 3 0.759
3 4 1.340
4 5 0.935
5 6 1.013

Create the Equation

The next step is to implement the equation we described above in python. You can use the column headers from the pandas dataframe to apply the equation to every row (index) of the dataframe.

protein_conc = ((samples_df['A-595'] - intercept) / slope)
print(protein_conc)
0    0.065213
1    0.655447
2    0.746525
3    1.433750
4    0.954703
5    1.046964
Name: A-595, dtype: float64

Add the column to the dataframe

Adding a column to a pandas dataframe is simple - you just use the name of the dataframe followed by the name of the new column in single quotes within square brackets. In this case, we will create the new column and assign the calculated concentration values to that column.

samples_df['ProtConc'] = protein_conc
samples_df
Sample # A-595 ProtConc
0 1 0.183 0.065213
1 2 0.682 0.655447
2 3 0.759 0.746525
3 4 1.340 1.433750
4 5 0.935 0.954703
5 6 1.013 1.046964

Eliminating values outside the calibration curve (optional)

Notice that the calculations cover all absorbance values - python went through each row of the dataframe.

There is one more issue we can address - the absorbance value for two of the unknown samples were outside the absorbance values of the standards. Therefore, these concentrations are not experimentally valid and should be reported as being “Out of Range”.

We can make this change in the dataframe by using loc to look sequentially at the rows in the dataframe. The syntax here tells python to work with the samples_df dataframe, look at one row at a time (loc), perform a conditional analysis of the A\(_{595}\) value on that row, and change the value for ProtConc to NaN if the A\(_{595}\) < 0.285 (the value for the lowest standard) or A\(_{595}\) > 1.118 (the value for the highest standard).

np.nan is a NumPy function that results in a value of NaN (for not a number) as the output. This is typically used when there is no value in a cell, but it still preserves the datatype in that cell as a float. This is preferred to using a string such as ‘Out of Range’ which would change the datatype for those cells in the column to a string.

samples_df.loc[samples_df['A-595'] < 0.285, 'ProtConc'] = np.nan
samples_df.loc[samples_df['A-595'] > 1.118, 'ProtConc'] = np.nan
samples_df
Sample # A-595 ProtConc
0 1 0.183 NaN
1 2 0.682 0.655447
2 3 0.759 0.746525
3 4 1.340 NaN
4 5 0.935 0.954703
5 6 1.013 1.046964

Optional: Rather than hard-coding the minimum and maximum values for the calibration curve, we could also do this programmatically with the .min() and .max() functions.

samples_df.loc[samples_df['A-595'] < results_df['A-595'].min(), 'ProtConc'] = np.nan
samples_df.loc[samples_df['A-595'] > results_df['A-595'].max(), 'ProtConc'] = np.nan

Exercise

Now that you have completed a simple linear regression exercise with protein assay data, here is a problem with a slightly larger dataset, taken from a ground water survey of wells in Texas, kindly provided by Houghton-Mifflin. Using the skills you have learned with pandas and SciPy, get the linear regression statistics for the relationship between pH (dependent variable) and bicarbonate levels (ppm in well water in Texas; independent variable). The data are available in the file, Ground_water.csv in the data folder. Once complete, your output should look like this:

Slope = -0.0030521595419827677
Intercept = 8.097595134597833
R-squared = 0.1152673937227531
P value = 0.04948248037131796
Standard error = 0.0014948066523110296

The next cell is just to show that the solution to the exercise works.