Data Cleaning with Pandas#

Overview

Questions:

What does “clean data” mean?
How can I drop unneccessary data from my dataframe?
How can I change column or row names in a dataframe?
How can I cast columns to the correct data type?

Objectives:

Use pandas to drop unneccessary data from our dataframe.
Learn how to rename pandas columns.
Use pandas string methods to replace characters.
Learn how to cast columns to the correct data type.

In this section, we’ll read in the data we extracted in the last lesson. You may have noticed in the last session that the data in these dataframes didn’t look great. There were columns that appeared to have no values. Once we start working with the data, we are going to see some additional problems.

import pandas as pd

table1 = pd.read_csv("data/potts_table1.csv")
table2 = pd.read_csv("data/potts_table2.csv")

table1.head()

	Unnamed: 0	Compound	log P	Unnamed: 1	II	H,	H,.1	MV	R,	log Koa	log Kyex	Unnamed: 2	log Kpep
0	NaN	water	— 6.85	NaN	0.45	0.82	0.35	10.6	0.00	— 1.38	— 4,38	NaN	NaN
1	'	methanol	— 6.68	NaN	0.44	0.43	0.47	21.7	0.28	—0.73	—2.42	NaN	— 2.80
2	NaN	methanoic acid	— 7.08	NaN	0.60	0.75	0.38	22.3	0.30	—0.54	— 3.93	NaN	— 3.63
3	NaN	ethanol	— 6.66	NaN	0.42	0.37	0.48	31.9	0.25	—0.32	—2.24	NaN	—2.10
4	NaN	ethanoic acid	—7.01	NaN	0.65	0.61	0.45	33.4	0.27	—0.31	— 3.28	NaN	—2.90

Dropping unneccessary data#

In some cases, we might have data in our dataframe that we don’t need. We will want to discard or “drop” this data from the dataframe. For the dataframe we just loaded, for example, we can see that the data in columns 0, 3, 11 appear to not have any values.

Check Your Understanding

What pandas method can you use to see how many non-null values you have in each column?

Solution

table1.info()

There are two methods you might use to drop data from a dataframe. These are drop, and dropna. Drop is used when you have specific rows or columns you want to remove from the dataframe, while dropna is used when you want to drop columns or rows which contain NaN or “not a number” values. This occurs when there are no values in a data cell.

In the output of info above, we can see that there are two columns which contain 0 non-null values. This means that all of the values in these columns are NaN. We can safely discard these columns. We’ll use the dropna function to get rid of them.

help(table1.dropna)

Help on method dropna in module pandas.core.frame:

dropna(*, axis: 'Axis' = 0, how: 'str | NoDefault' = <no_default>, thresh: 'int | NoDefault' = <no_default>, subset: 'IndexLabel' = None, inplace: 'bool' = False) -> 'DataFrame | None' method of pandas.core.frame.DataFrame instance
    Remove missing values.
    
    See the :ref:`User Guide <missing_data>` for more on which values are
    considered missing, and how to work with missing data.
    
    Parameters
    ----------
    axis : {0 or 'index', 1 or 'columns'}, default 0
        Determine if rows or columns which contain missing values are
        removed.
    
        * 0, or 'index' : Drop rows which contain missing values.
        * 1, or 'columns' : Drop columns which contain missing value.
    
        .. versionchanged:: 1.0.0
    
           Pass tuple or list to drop on multiple axes.
           Only a single axis is allowed.
    
    how : {'any', 'all'}, default 'any'
        Determine if row or column is removed from DataFrame, when we have
        at least one NA or all NA.
    
        * 'any' : If any NA values are present, drop that row or column.
        * 'all' : If all values are NA, drop that row or column.
    
    thresh : int, optional
        Require that many non-NA values. Cannot be combined with how.
    subset : column label or sequence of labels, optional
        Labels along other axis to consider, e.g. if you are dropping rows
        these would be a list of columns to include.
    inplace : bool, default False
        Whether to modify the DataFrame rather than creating a new one.
    
    Returns
    -------
    DataFrame or None
        DataFrame with NA entries dropped from it or None if ``inplace=True``.
    
    See Also
    --------
    DataFrame.isna: Indicate missing values.
    DataFrame.notna : Indicate existing (non-missing) values.
    DataFrame.fillna : Replace missing values.
    Series.dropna : Drop missing values.
    Index.dropna : Drop missing indices.
    
    Examples
    --------
    >>> df = pd.DataFrame({"name": ['Alfred', 'Batman', 'Catwoman'],
    ...                    "toy": [np.nan, 'Batmobile', 'Bullwhip'],
    ...                    "born": [pd.NaT, pd.Timestamp("1940-04-25"),
    ...                             pd.NaT]})
    >>> df
           name        toy       born
    0    Alfred        NaN        NaT
    1    Batman  Batmobile 1940-04-25
    2  Catwoman   Bullwhip        NaT
    
    Drop the rows where at least one element is missing.
    
    >>> df.dropna()
         name        toy       born
    1  Batman  Batmobile 1940-04-25
    
    Drop the columns where at least one element is missing.
    
    >>> df.dropna(axis='columns')
           name
    0    Alfred
    1    Batman
    2  Catwoman
    
    Drop the rows where all elements are missing.
    
    >>> df.dropna(how='all')
           name        toy       born
    0    Alfred        NaN        NaT
    1    Batman  Batmobile 1940-04-25
    2  Catwoman   Bullwhip        NaT
    
    Keep only the rows with at least 2 non-NA values.
    
    >>> df.dropna(thresh=2)
           name        toy       born
    1    Batman  Batmobile 1940-04-25
    2  Catwoman   Bullwhip        NaT
    
    Define in which columns to look for missing values.
    
    >>> df.dropna(subset=['name', 'toy'])
           name        toy       born
    1    Batman  Batmobile 1940-04-25
    2  Catwoman   Bullwhip        NaT
    
    Keep the DataFrame with valid entries in the same variable.
    
    >>> df.dropna(inplace=True)
    >>> df
         name        toy       born
    1  Batman  Batmobile 1940-04-25

Before saving the dataframe, we’ll look at and and discuss output from this function.

By default, the function dropna will work on axis 0 or the rows of the dataframe, and will drop any row which contains a NaN. You will see this results in a dataframe with no data.

table1.dropna()

	Unnamed: 0	Compound	log P	Unnamed: 1	II	H,	H,.1	MV	R,	log Koa	log Kyex	Unnamed: 2	log Kpep

Notice that dropna returns a dataframe and does not overwrite the original.

table1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 13 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  1 non-null      object 
 1   Compound    37 non-null     object 
 2   log P       37 non-null     object 
 3   Unnamed: 1  0 non-null      float64
 4   II          37 non-null     float64
 5   H,          37 non-null     float64
 6   H,.1        37 non-null     float64
 7   MV          37 non-null     float64
 8   R,          37 non-null     float64
 9   log Koa     37 non-null     object 
 10  log Kyex    31 non-null     object 
 11  Unnamed: 2  0 non-null      float64
 12  log Kpep    25 non-null     object 
dtypes: float64(7), object(6)
memory usage: 3.9+ KB

We can switch to dropping columns which have NaN values by adding the argument axis=1.

table1.dropna(axis=1).info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Compound  37 non-null     object 
 1   log P     37 non-null     object 
 2   II        37 non-null     float64
 3   H,        37 non-null     float64
 4   H,.1      37 non-null     float64
 5   MV        37 non-null     float64
 6   R,        37 non-null     float64
 7   log Koa   37 non-null     object 
dtypes: float64(5), object(3)
memory usage: 2.4+ KB

This is closer to what we want. However, you’ll notice that this has dropped some columns which have data. By default, pandas will drop a column which contains any NaN values. This may not be what we want in many cases because some values may simply be missing rather than incorrect.

We can add an additional argument, how=all, to drop only columns whose values are all NaN. By default, this function argument is how=any. Once we are sure we would like to keep this as our dataframe, we can add inplace=True to the function call to overwrite the dataframe.

table1.dropna(axis=1, how="all")

	Unnamed: 0	Compound	log P	II	H,	H,.1	MV	R,	log Koa	log Kyex	log Kpep
0	NaN	water	— 6.85	0.45	0.82	0.35	10.6	0.00	— 1.38	— 4,38	NaN
1	'	methanol	— 6.68	0.44	0.43	0.47	21.7	0.28	—0.73	—2.42	— 2.80
2	NaN	methanoic acid	— 7.08	0.60	0.75	0.38	22.3	0.30	—0.54	— 3.93	— 3.63
3	NaN	ethanol	— 6.66	0.42	0.37	0.48	31.9	0.25	—0.32	—2.24	—2.10
4	NaN	ethanoic acid	—7.01	0.65	0.61	0.45	33.4	0.27	—0.31	— 3.28	—2.90
5	NaN	n-propanol	— 6.41	0.42	0.37	0.48	42.2	0.24	0.34	— 1.48	—1.52
6	NaN	n-propanoic acid	—7.01	0.65	0.60	0.45	43.6	0.23	0.26	— 2.64	—2.14
7	NaN	butane-2-one	— 5.90	0.70	0.00	0.51	49.4	0.17	0.28	— 0.25	—Q.25
8	NaN	benzene	—4,.51	0.52	0.00	0.14	50.0	0.61	2.00	2.29	NaN
9	NaN	diethyl ether	—5.35	0.25	0.00	0.45	52.0	0.04	0.83	0.66	NaN
10	NaN	n-butanol	— 6.16	0.42	0.37	0.48	52.4	0.22	0.88	— 1.08	—0.70
11	NaN	n-butanoic acid	— 6.36	0.62	0.60	0.45	53.9	0.21	0.79	— 1.92	— 0.96
12	NaN	phenol	— 5.64	0.89	0.52	0.30	53.9	0.81	1.46	— 0.70	—0.82
13	NaN	toluene	— 3.56	0.52	0.00	0.14	60.0	0.60	2.70	2.89	NaN
14	NaN	styrene	— 3.75	0.65	0.00	0.16	60.2	0.85	2.95	NaN	NaN
15	NaN	n-pentanol	—5.78	0.42	0.37	0.48	62.6	0.22	1.40	— 0.39	NaN
16	NaN	benzyl-OH	— 5.78	0.87	0.33	0.56	64.0	0.80	1.10	— 0.62	NaN
17	NaN	n-pentanoic acid	— 6.01	0.60	0.60	0.45	64.1	0.21	1.33	— 1.31	0.44
18	NaN	2-chlorophenol	—5.04	0.88	0.32	0.31	66.0	0.85	2.15	NaN	NaN
19	NaN	4-chlorophenol	— 5.00	1.08	0.67	0.20	66.0	0.92	2.39	—1.31	—0.12
20	NaN	m-cresol	— 5.38	0.88	0.57	0.34	67.6	0.82	1.96	NaN	NaN
21	NaN	o-cresol	~ 5,36	0.86	0.52	0.30	67.6	0.84	1.95	NaN	NaN
22	NaN	p-cresol	— 5.29	0.87	0.57	0.31	67.6	0.82	1.96	NaN	NaN
23	NaN	4-bromophenol	— 5.00	1.17	0.67	0.20	70.0	1.08	2.59	—0.11	—0.20
24	NaN	4-nitrophenol	~—5.81	1.72	0.82	0.26	71.0	1.07	1.96	—2.0	—2.15
25	NaN	3-nitrophenol	—5.81	1.57	0.79	0.23	71.0	1.05	2.00	—1.4	— 1.23
26	NaN	2-nitrophenol	— 4,56	1.05	0.05	0.37	71.0	1.02	1.80	—1.4	1.04
27	NaN	ethylbenzene	— 3.48	0.51	0.00	0.15	72.3	0.60	3.15	NaN	NaN
28	NaN	n-hexanol	—5.45	0.42	0.37	0.48	72.9	0.21	2.03	0.11	0.45
29	NaN	n-hexanoic acid	—5.44	0.60	0.60	0.45	74.3	0.17	1.89	— 0.85	0.24
30	NaN	8-naphthol	—5.11	1.08	0.61	0.40	79.5	1.52	2.84	1.77	0.30
31	NaN	n-heptanol	— 5.05	0.42	0.37	0.48	83.1	0.21	2.49	0.77	1.01
32	NaN	n-heptanoic acid	— 5.28	0.60	0.60	0.45	84.6	0.15	2.33	— 0.29	1.16
33	NaN	n-octanol	— 4.84	0.42	0.37	0.48	93.3	0.20	3.15	1.62	1.65
34	NaN	n-octanoic acid	—5.21	0.60	0.60	0.45	94.8	0.15	2.83	0.41	1.95
35	NaN	n-nonanol	—4.77	0.42	0.37	0.48	104.0	0.19	3.68	1.97	2.28
36	NaN	n-decanol	— 4.66	0.42	0.37	0.48	114.0	0.19	4,24	2.56	2.91

The output above looks like something to keep, so we will add inplace=True to overwrite the original dataframe.

table1.dropna(axis=1, how="all", inplace=True)

table1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  1 non-null      object 
 1   Compound    37 non-null     object 
 2   log P       37 non-null     object 
 3   II          37 non-null     float64
 4   H,          37 non-null     float64
 5   H,.1        37 non-null     float64
 6   MV          37 non-null     float64
 7   R,          37 non-null     float64
 8   log Koa     37 non-null     object 
 9   log Kyex    31 non-null     object 
 10  log Kpep    25 non-null     object 
dtypes: float64(5), object(6)
memory usage: 3.3+ KB

We can drop the final column using the drop function. You can use this when you have specific rows or columns you would like to discard. Again, we use axis=1 to drop columns, then we pass the column name.

table1.drop(axis=1, columns=["Unnamed: 0"], inplace=True)

Changing column names#

Our column names are still incorrect. You will likely want to change them to make the table more legible.

table1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Compound  37 non-null     object 
 1   log P     37 non-null     object 
 2   II        37 non-null     float64
 3   H,        37 non-null     float64
 4   H,.1      37 non-null     float64
 5   MV        37 non-null     float64
 6   R,        37 non-null     float64
 7   log Koa   37 non-null     object 
 8   log Kyex  31 non-null     object 
 9   log Kpep  25 non-null     object 
dtypes: float64(5), object(5)
memory usage: 3.0+ KB

We might now want to clean up the column names and make sure they are descriptive. You can see the column names using table1.columns. You can either rename the columns by setting table1.columns to a list of the appropriate length, or you can use table1.rename. In the .rename method, you put the argument columns and set it equal to a dictionary (curly brackets) where you use the syntax

"current_column_name": "new_column_name"

table1.columns

Index(['Compound', 'log P', 'II', 'H,', 'H,.1', 'MV', 'R,', 'log Koa',
       'log Kyex', 'log Kpep'],
      dtype='object')

table1.rename(inplace=True, columns={
    "II": "pi", 
    "H,": "Hd", 
    "H,.1": "Ha", 
    "R,": "R_2",
    "log Koa": "log K_oct",
    "log Kyex": "log K_hex",
    "log Kpep": "log K_hep"
                      
                      })

table1.head()

	Compound	log P	pi	Hd	Ha	MV	R_2	log K_oct	log K_hex	log K_hep
0	water	— 6.85	0.45	0.82	0.35	10.6	0.00	— 1.38	— 4,38	NaN
1	methanol	— 6.68	0.44	0.43	0.47	21.7	0.28	—0.73	—2.42	— 2.80
2	methanoic acid	— 7.08	0.60	0.75	0.38	22.3	0.30	—0.54	— 3.93	— 3.63
3	ethanol	— 6.66	0.42	0.37	0.48	31.9	0.25	—0.32	—2.24	—2.10
4	ethanoic acid	—7.01	0.65	0.61	0.45	33.4	0.27	—0.31	— 3.28	—2.90

Fixing Data Types#

When examining .info , you’ll notice that a lot of our columns which should be numbers are still ‘objects’ or strings. We would like log P, for example to be numeric. Typically if a column appears that it should be numeric, but pandas does not automatically cast it as such, it is because there are some non-numeric characters in the column which pandas could not decide what to do with. We will need to examine these, decide what to do with them, then cast the column as numeric.

This part of our workflow (data cleaning), often requires os to examine data on a case-by-case basis to decide what to do

There are a few ways to make pandas columns numeric, but we’ll use the pandas function to_numeric.

table1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 10 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Compound   37 non-null     object 
 1   log P      37 non-null     object 
 2   pi         37 non-null     float64
 3   Hd         37 non-null     float64
 4   Ha         37 non-null     float64
 5   MV         37 non-null     float64
 6   R_2        37 non-null     float64
 7   log K_oct  37 non-null     object 
 8   log K_hex  31 non-null     object 
 9   log K_hep  25 non-null     object 
dtypes: float64(5), object(5)
memory usage: 3.0+ KB

Using the to_numeric function without any additional inputs will fail on this data set.

pd.to_numeric(table1["log P"])

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File ~/miniconda3/envs/molssi-training/lib/python3.11/site-packages/pandas/_libs/lib.pyx:2363, in pandas._libs.lib.maybe_convert_numeric()

ValueError: Unable to parse string "— 6.85"

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
Cell In[21], line 1
----> 1 pd.to_numeric(table1["log P"])

File ~/miniconda3/envs/molssi-training/lib/python3.11/site-packages/pandas/core/tools/numeric.py:185, in to_numeric(arg, errors, downcast)
    183 coerce_numeric = errors not in ("ignore", "raise")
    184 try:
--> 185     values, _ = lib.maybe_convert_numeric(
    186         values, set(), coerce_numeric=coerce_numeric
    187     )
    188 except (ValueError, TypeError):
    189     if errors == "raise":

File ~/miniconda3/envs/molssi-training/lib/python3.11/site-packages/pandas/_libs/lib.pyx:2405, in pandas._libs.lib.maybe_convert_numeric()

ValueError: Unable to parse string "— 6.85" at position 0

Scrolling to the bottom of this message and reading the error, you will see it is having a problem reading the value "— 6.85". It may not seem obvious what this problem is at first.

When we run into a problem like this we have a few options. You could choose to handle the errors differently. Pandas will let you set what you would like for it to do when it is unable to cast a value. By default, it will fail (which is what se wee above). For example, you could also set errors to be ignored (which would result in the column being unchanged, there would just be no error raised) or to “coerce” the values. Choosing “coerce” means that anything that can’t be cast as numeric will be put as NaN.

Let’s see what happens when we set errors to coerce.

pd.to_numeric(table1["log P"], errors="coerce")

  NaN
  NaN
  NaN
  NaN
  NaN
  NaN
  NaN
  NaN
  NaN
  NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
 NaN
Name: log P, dtype: float64

This unfortunately results in no numeric characters being recognized.

We have to do a little bit more processing to the values for this to work. If you examine the columns, you may notice that the negative sign is a little off. It is — when it should be -. This is very slight, and might be hard to see, but it is important to change for this data set.

We will want to replace all — with -. We could accomplish this using the string method replace. Strings in Python have a number of methods. The replace method allows us to replace a substring within a string.

test_string = "Hello world."

test_string.replace(".", "!")

'Hello world!'

The split command is another string method you are probably familiar with:

test_string.split()

['Hello', 'world.']

Pandas string methods#

If we want to use these on a column in a pandas dataframe, you might think to use apply, which we learned about in the last session. However, pandas dataframes have text-processing built-in as methods.

When pandas columns are strings, we can use string methods on the whole column by adding .str.function. For example, to replace the minus signs

table1["log P"].str.replace("—", "-")

   - 6.85
   - 6.68
   - 7.08
   - 6.66
    -7.01
   - 6.41
    -7.01
   - 5.90
   -4,.51
    -5.35
  - 6.16
  - 6.36
  - 5.64
  - 3.56
  - 3.75
   -5.78
  - 5.78
  - 6.01
   -5.04
  - 5.00
  - 5.38
  ~ 5,36
  - 5.29
  - 5.00
  ~-5.81
   -5.81
  - 4,56
  - 3.48
   -5.45
   -5.44
   -5.11
  - 5.05
  - 5.28
  - 4.84
   -5.21
   -4.77
  - 4.66
Name: log P, dtype: object

table1["log P"] = table1["log P"].str.replace("—", "-")

# We still need to get rid of spaces
table1["log P"] = table1["log P"].str.replace(" ", "")

table1["log P"] =  pd.to_numeric(table1["log P"], errors="coerce")

table1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 10 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Compound   37 non-null     object 
 1   log P      33 non-null     float64
 2   pi         37 non-null     float64
 3   Hd         37 non-null     float64
 4   Ha         37 non-null     float64
 5   MV         37 non-null     float64
 6   R_2        37 non-null     float64
 7   log K_oct  37 non-null     object 
 8   log K_hex  31 non-null     object 
 9   log K_hep  25 non-null     object 
dtypes: float64(6), object(4)
memory usage: 3.0+ KB

We actually need to change this character on all of our columns. However str methods only work on pandas series. If we want to replace a string across all of our DataFrame, we will use the .replace method. In order for it to recognize substrings, set the option regex=True. We will discuss regex more in the next session, but this is all you need to know about regex for the moment.

table1.replace("—", "-", regex=True, inplace=True)
table1.replace(" ", "", regex=True, inplace=True)

Changing the data type of multiple columns#

To change the data type of multiple columns, we will want to use the pd.to_numeric function on all of those columns. There are several ways you might choose to do this. For example, you might just choose to call the function for each column.

We can also accomplish this by using the apply operator which we learned about in the last session. The apply operator should be used whenever you want to apply a function to a row or column. In this case, we want to apply the pd.to_numeric function to each column.

Because we want to apply to the columns, we add the argument axis=1.

table1.apply(pd.to_numeric, axis=1)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File ~/miniconda3/envs/molssi-training/lib/python3.11/site-packages/pandas/_libs/lib.pyx:2363, in pandas._libs.lib.maybe_convert_numeric()

ValueError: Unable to parse string "water"

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
Cell In[31], line 1
----> 1 table1.apply(pd.to_numeric, axis=1)

File ~/miniconda3/envs/molssi-training/lib/python3.11/site-packages/pandas/core/frame.py:9565, in DataFrame.apply(self, func, axis, raw, result_type, args, **kwargs)
   9554 from pandas.core.apply import frame_apply
   9556 op = frame_apply(
   9557     self,
   9558     func=func,
   (...)
   9563     kwargs=kwargs,
   9564 )
-> 9565 return op.apply().__finalize__(self, method="apply")

File ~/miniconda3/envs/molssi-training/lib/python3.11/site-packages/pandas/core/apply.py:746, in FrameApply.apply(self)
    743 elif self.raw:
    744     return self.apply_raw()
--> 746 return self.apply_standard()

File ~/miniconda3/envs/molssi-training/lib/python3.11/site-packages/pandas/core/apply.py:873, in FrameApply.apply_standard(self)
    872 def apply_standard(self):
--> 873     results, res_index = self.apply_series_generator()
    875     # wrap results
    876     return self.wrap_results(results, res_index)

File ~/miniconda3/envs/molssi-training/lib/python3.11/site-packages/pandas/core/apply.py:889, in FrameApply.apply_series_generator(self)
    886 with option_context("mode.chained_assignment", None):
    887     for i, v in enumerate(series_gen):
    888         # ignore SettingWithCopy here in case the user mutates
--> 889         results[i] = self.f(v)
    890         if isinstance(results[i], ABCSeries):
    891             # If we have a view on v, we need to make a copy because
    892             #  series_generator will swap out the underlying data
    893             results[i] = results[i].copy(deep=False)

File ~/miniconda3/envs/molssi-training/lib/python3.11/site-packages/pandas/core/tools/numeric.py:185, in to_numeric(arg, errors, downcast)
    183 coerce_numeric = errors not in ("ignore", "raise")
    184 try:
--> 185     values, _ = lib.maybe_convert_numeric(
    186         values, set(), coerce_numeric=coerce_numeric
    187     )
    188 except (ValueError, TypeError):
    189     if errors == "raise":

File ~/miniconda3/envs/molssi-training/lib/python3.11/site-packages/pandas/_libs/lib.pyx:2405, in pandas._libs.lib.maybe_convert_numeric()

ValueError: Unable to parse string "water" at position 0

When we try this code, we immediately see an error. We do not want to try to convert the first column to a number. We can use the iloc function to exclude the first column:

table1.iloc[:, 1:].apply(pd.to_numeric, axis=1)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File ~/miniconda3/envs/molssi-training/lib/python3.11/site-packages/pandas/_libs/lib.pyx:2363, in pandas._libs.lib.maybe_convert_numeric()

ValueError: Unable to parse string "-4,38"

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
Cell In[32], line 1
----> 1 table1.iloc[:, 1:].apply(pd.to_numeric, axis=1)

File ~/miniconda3/envs/molssi-training/lib/python3.11/site-packages/pandas/core/frame.py:9565, in DataFrame.apply(self, func, axis, raw, result_type, args, **kwargs)
   9554 from pandas.core.apply import frame_apply
   9556 op = frame_apply(
   9557     self,
   9558     func=func,
   (...)
   9563     kwargs=kwargs,
   9564 )
-> 9565 return op.apply().__finalize__(self, method="apply")

File ~/miniconda3/envs/molssi-training/lib/python3.11/site-packages/pandas/core/apply.py:746, in FrameApply.apply(self)
    743 elif self.raw:
    744     return self.apply_raw()
--> 746 return self.apply_standard()

File ~/miniconda3/envs/molssi-training/lib/python3.11/site-packages/pandas/core/apply.py:873, in FrameApply.apply_standard(self)
    872 def apply_standard(self):
--> 873     results, res_index = self.apply_series_generator()
    875     # wrap results
    876     return self.wrap_results(results, res_index)

File ~/miniconda3/envs/molssi-training/lib/python3.11/site-packages/pandas/core/apply.py:889, in FrameApply.apply_series_generator(self)
    886 with option_context("mode.chained_assignment", None):
    887     for i, v in enumerate(series_gen):
    888         # ignore SettingWithCopy here in case the user mutates
--> 889         results[i] = self.f(v)
    890         if isinstance(results[i], ABCSeries):
    891             # If we have a view on v, we need to make a copy because
    892             #  series_generator will swap out the underlying data
    893             results[i] = results[i].copy(deep=False)

File ~/miniconda3/envs/molssi-training/lib/python3.11/site-packages/pandas/core/tools/numeric.py:185, in to_numeric(arg, errors, downcast)
    183 coerce_numeric = errors not in ("ignore", "raise")
    184 try:
--> 185     values, _ = lib.maybe_convert_numeric(
    186         values, set(), coerce_numeric=coerce_numeric
    187     )
    188 except (ValueError, TypeError):
    189     if errors == "raise":

File ~/miniconda3/envs/molssi-training/lib/python3.11/site-packages/pandas/_libs/lib.pyx:2405, in pandas._libs.lib.maybe_convert_numeric()

ValueError: Unable to parse string "-4,38" at position 7

An error again! This time, we see failure because a string was incorrectly read from the pdf and could not be converted to a number. You could choose to handle this differently, but for this workshop we are just going to discard values like these.

If we were using to_numeric on a pandas series, we would use the option errors="coerce". You may not see immediately how to use this with the apply function, but fortunately, pandas allows us to pass additional arguments with apply:

table1.iloc[:, 1:] = table1.iloc[:, 1:].apply(pd.to_numeric, axis=1, errors="coerce")

/tmp/ipykernel_3771/2243029040.py:1: FutureWarning: In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`
  table1.iloc[:, 1:] = table1.iloc[:, 1:].apply(pd.to_numeric, axis=1, errors="coerce")

table1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 10 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Compound   37 non-null     object 
 1   log P      33 non-null     float64
 2   pi         37 non-null     float64
 3   Hd         37 non-null     float64
 4   Ha         37 non-null     float64
 5   MV         37 non-null     float64
 6   R_2        37 non-null     float64
 7   log K_oct  36 non-null     float64
 8   log K_hex  30 non-null     float64
 9   log K_hep  24 non-null     float64
dtypes: float64(9), object(1)
memory usage: 3.0+ KB

table1.to_csv("data/potts_table1_clean.csv", index=False)

Key Points

Data cleaning prepares data for analysis. It is a separate step in the data analysis workflow.
Data cleaning is often time intensive.
Pandas has special function for data cleaning and handling missing information.

Python for Data Science in Chemistry

Data Cleaning with Pandas

Contents

Data Cleaning with Pandas#

Dropping unneccessary data#

Changing column names#

Fixing Data Types#

Pandas string methods#

Changing the data type of multiple columns#