Data Cleaning with Pandas#

Overview

Questions:

  • What does “clean data” mean?

  • How can I drop unneccessary data from my dataframe?

  • How can I change column or row names in a dataframe?

  • How can I cast columns to the correct data type?

Objectives:

  • Use pandas to drop unneccessary data from our dataframe.

  • Learn how to rename pandas columns.

  • Use pandas string methods to replace characters.

  • Learn how to cast columns to the correct data type.

In this section, we’ll read in the data we extracted in the last lesson. You may have noticed in the last session that the data in these dataframes didn’t look great. There were columns that appeared to have no values. Once we start working with the data, we are going to see some additional problems.

import pandas as pd
table1 = pd.read_csv("data/potts_table1.csv")
table2 = pd.read_csv("data/potts_table2.csv")
table1.head()
Unnamed: 0 Compound log P Unnamed: 1 II H, H,.1 MV R, log Koa log Kyex Unnamed: 2 log Kpep
0 NaN water — 6.85 NaN 0.45 0.82 0.35 10.6 0.00 — 1.38 — 4,38 NaN NaN
1 ' methanol — 6.68 NaN 0.44 0.43 0.47 21.7 0.28 —0.73 —2.42 NaN — 2.80
2 NaN methanoic acid — 7.08 NaN 0.60 0.75 0.38 22.3 0.30 —0.54 — 3.93 NaN — 3.63
3 NaN ethanol — 6.66 NaN 0.42 0.37 0.48 31.9 0.25 —0.32 —2.24 NaN —2.10
4 NaN ethanoic acid —7.01 NaN 0.65 0.61 0.45 33.4 0.27 —0.31 — 3.28 NaN —2.90

Dropping unneccessary data#

In some cases, we might have data in our dataframe that we don’t need. We will want to discard or “drop” this data from the dataframe. For the dataframe we just loaded, for example, we can see that the data in columns 0, 3, 11 appear to not have any values.

Check Your Understanding

What pandas method can you use to see how many non-null values you have in each column?

There are two methods you might use to drop data from a dataframe. These are drop, and dropna. Drop is used when you have specific rows or columns you want to remove from the dataframe, while dropna is used when you want to drop columns or rows which contain NaN or “not a number” values. This occurs when there are no values in a data cell.

In the output of info above, we can see that there are two columns which contain 0 non-null values. This means that all of the values in these columns are NaN. We can safely discard these columns. We’ll use the dropna function to get rid of them.

help(table1.dropna)
Help on method dropna in module pandas.core.frame:

dropna(*, axis: 'Axis' = 0, how: 'str | NoDefault' = <no_default>, thresh: 'int | NoDefault' = <no_default>, subset: 'IndexLabel' = None, inplace: 'bool' = False) -> 'DataFrame | None' method of pandas.core.frame.DataFrame instance
    Remove missing values.
    
    See the :ref:`User Guide <missing_data>` for more on which values are
    considered missing, and how to work with missing data.
    
    Parameters
    ----------
    axis : {0 or 'index', 1 or 'columns'}, default 0
        Determine if rows or columns which contain missing values are
        removed.
    
        * 0, or 'index' : Drop rows which contain missing values.
        * 1, or 'columns' : Drop columns which contain missing value.
    
        .. versionchanged:: 1.0.0
    
           Pass tuple or list to drop on multiple axes.
           Only a single axis is allowed.
    
    how : {'any', 'all'}, default 'any'
        Determine if row or column is removed from DataFrame, when we have
        at least one NA or all NA.
    
        * 'any' : If any NA values are present, drop that row or column.
        * 'all' : If all values are NA, drop that row or column.
    
    thresh : int, optional
        Require that many non-NA values. Cannot be combined with how.
    subset : column label or sequence of labels, optional
        Labels along other axis to consider, e.g. if you are dropping rows
        these would be a list of columns to include.
    inplace : bool, default False
        Whether to modify the DataFrame rather than creating a new one.
    
    Returns
    -------
    DataFrame or None
        DataFrame with NA entries dropped from it or None if ``inplace=True``.
    
    See Also
    --------
    DataFrame.isna: Indicate missing values.
    DataFrame.notna : Indicate existing (non-missing) values.
    DataFrame.fillna : Replace missing values.
    Series.dropna : Drop missing values.
    Index.dropna : Drop missing indices.
    
    Examples
    --------
    >>> df = pd.DataFrame({"name": ['Alfred', 'Batman', 'Catwoman'],
    ...                    "toy": [np.nan, 'Batmobile', 'Bullwhip'],
    ...                    "born": [pd.NaT, pd.Timestamp("1940-04-25"),
    ...                             pd.NaT]})
    >>> df
           name        toy       born
    0    Alfred        NaN        NaT
    1    Batman  Batmobile 1940-04-25
    2  Catwoman   Bullwhip        NaT
    
    Drop the rows where at least one element is missing.
    
    >>> df.dropna()
         name        toy       born
    1  Batman  Batmobile 1940-04-25
    
    Drop the columns where at least one element is missing.
    
    >>> df.dropna(axis='columns')
           name
    0    Alfred
    1    Batman
    2  Catwoman
    
    Drop the rows where all elements are missing.
    
    >>> df.dropna(how='all')
           name        toy       born
    0    Alfred        NaN        NaT
    1    Batman  Batmobile 1940-04-25
    2  Catwoman   Bullwhip        NaT
    
    Keep only the rows with at least 2 non-NA values.
    
    >>> df.dropna(thresh=2)
           name        toy       born
    1    Batman  Batmobile 1940-04-25
    2  Catwoman   Bullwhip        NaT
    
    Define in which columns to look for missing values.
    
    >>> df.dropna(subset=['name', 'toy'])
           name        toy       born
    1    Batman  Batmobile 1940-04-25
    2  Catwoman   Bullwhip        NaT
    
    Keep the DataFrame with valid entries in the same variable.
    
    >>> df.dropna(inplace=True)
    >>> df
         name        toy       born
    1  Batman  Batmobile 1940-04-25

Before saving the dataframe, we’ll look at and and discuss output from this function.

By default, the function dropna will work on axis 0 or the rows of the dataframe, and will drop any row which contains a NaN. You will see this results in a dataframe with no data.

table1.dropna()
Unnamed: 0 Compound log P Unnamed: 1 II H, H,.1 MV R, log Koa log Kyex Unnamed: 2 log Kpep

Notice that dropna returns a dataframe and does not overwrite the original.

table1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 13 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  1 non-null      object 
 1   Compound    37 non-null     object 
 2   log P       37 non-null     object 
 3   Unnamed: 1  0 non-null      float64
 4   II          37 non-null     float64
 5   H,          37 non-null     float64
 6   H,.1        37 non-null     float64
 7   MV          37 non-null     float64
 8   R,          37 non-null     float64
 9   log Koa     37 non-null     object 
 10  log Kyex    31 non-null     object 
 11  Unnamed: 2  0 non-null      float64
 12  log Kpep    25 non-null     object 
dtypes: float64(7), object(6)
memory usage: 3.9+ KB

We can switch to dropping columns which have NaN values by adding the argument axis=1.

table1.dropna(axis=1).info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Compound  37 non-null     object 
 1   log P     37 non-null     object 
 2   II        37 non-null     float64
 3   H,        37 non-null     float64
 4   H,.1      37 non-null     float64
 5   MV        37 non-null     float64
 6   R,        37 non-null     float64
 7   log Koa   37 non-null     object 
dtypes: float64(5), object(3)
memory usage: 2.4+ KB

This is closer to what we want. However, you’ll notice that this has dropped some columns which have data. By default, pandas will drop a column which contains any NaN values. This may not be what we want in many cases because some values may simply be missing rather than incorrect.

We can add an additional argument, how=all, to drop only columns whose values are all NaN. By default, this function argument is how=any. Once we are sure we would like to keep this as our dataframe, we can add inplace=True to the function call to overwrite the dataframe.

table1.dropna(axis=1, how="all")
Unnamed: 0 Compound log P II H, H,.1 MV R, log Koa log Kyex log Kpep
0 NaN water — 6.85 0.45 0.82 0.35 10.6 0.00 — 1.38 — 4,38 NaN
1 ' methanol — 6.68 0.44 0.43 0.47 21.7 0.28 —0.73 —2.42 — 2.80
2 NaN methanoic acid — 7.08 0.60 0.75 0.38 22.3 0.30 —0.54 — 3.93 — 3.63
3 NaN ethanol — 6.66 0.42 0.37 0.48 31.9 0.25 —0.32 —2.24 —2.10
4 NaN ethanoic acid —7.01 0.65 0.61 0.45 33.4 0.27 —0.31 — 3.28 —2.90
5 NaN n-propanol — 6.41 0.42 0.37 0.48 42.2 0.24 0.34 — 1.48 —1.52
6 NaN n-propanoic acid —7.01 0.65 0.60 0.45 43.6 0.23 0.26 — 2.64 —2.14
7 NaN butane-2-one — 5.90 0.70 0.00 0.51 49.4 0.17 0.28 — 0.25 —Q.25
8 NaN benzene —4,.51 0.52 0.00 0.14 50.0 0.61 2.00 2.29 NaN
9 NaN diethyl ether —5.35 0.25 0.00 0.45 52.0 0.04 0.83 0.66 NaN
10 NaN n-butanol — 6.16 0.42 0.37 0.48 52.4 0.22 0.88 — 1.08 —0.70
11 NaN n-butanoic acid — 6.36 0.62 0.60 0.45 53.9 0.21 0.79 — 1.92 — 0.96
12 NaN phenol — 5.64 0.89 0.52 0.30 53.9 0.81 1.46 — 0.70 —0.82
13 NaN toluene — 3.56 0.52 0.00 0.14 60.0 0.60 2.70 2.89 NaN
14 NaN styrene — 3.75 0.65 0.00 0.16 60.2 0.85 2.95 NaN NaN
15 NaN n-pentanol —5.78 0.42 0.37 0.48 62.6 0.22 1.40 — 0.39 NaN
16 NaN benzyl-OH — 5.78 0.87 0.33 0.56 64.0 0.80 1.10 — 0.62 NaN
17 NaN n-pentanoic acid — 6.01 0.60 0.60 0.45 64.1 0.21 1.33 — 1.31 0.44
18 NaN 2-chlorophenol —5.04 0.88 0.32 0.31 66.0 0.85 2.15 NaN NaN
19 NaN 4-chlorophenol — 5.00 1.08 0.67 0.20 66.0 0.92 2.39 —1.31 —0.12
20 NaN m-cresol — 5.38 0.88 0.57 0.34 67.6 0.82 1.96 NaN NaN
21 NaN o-cresol ~ 5,36 0.86 0.52 0.30 67.6 0.84 1.95 NaN NaN
22 NaN p-cresol — 5.29 0.87 0.57 0.31 67.6 0.82 1.96 NaN NaN
23 NaN 4-bromophenol — 5.00 1.17 0.67 0.20 70.0 1.08 2.59 —0.11 —0.20
24 NaN 4-nitrophenol ~—5.81 1.72 0.82 0.26 71.0 1.07 1.96 —2.0 —2.15
25 NaN 3-nitrophenol —5.81 1.57 0.79 0.23 71.0 1.05 2.00 —1.4 — 1.23
26 NaN 2-nitrophenol — 4,56 1.05 0.05 0.37 71.0 1.02 1.80 —1.4 1.04
27 NaN ethylbenzene — 3.48 0.51 0.00 0.15 72.3 0.60 3.15 NaN NaN
28 NaN n-hexanol —5.45 0.42 0.37 0.48 72.9 0.21 2.03 0.11 0.45
29 NaN n-hexanoic acid —5.44 0.60 0.60 0.45 74.3 0.17 1.89 — 0.85 0.24
30 NaN 8-naphthol —5.11 1.08 0.61 0.40 79.5 1.52 2.84 1.77 0.30
31 NaN n-heptanol — 5.05 0.42 0.37 0.48 83.1 0.21 2.49 0.77 1.01
32 NaN n-heptanoic acid — 5.28 0.60 0.60 0.45 84.6 0.15 2.33 — 0.29 1.16
33 NaN n-octanol — 4.84 0.42 0.37 0.48 93.3 0.20 3.15 1.62 1.65
34 NaN n-octanoic acid —5.21 0.60 0.60 0.45 94.8 0.15 2.83 0.41 1.95
35 NaN n-nonanol —4.77 0.42 0.37 0.48 104.0 0.19 3.68 1.97 2.28
36 NaN n-decanol — 4.66 0.42 0.37 0.48 114.0 0.19 4,24 2.56 2.91

The output above looks like something to keep, so we will add inplace=True to overwrite the original dataframe.

table1.dropna(axis=1, how="all", inplace=True)
table1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  1 non-null      object 
 1   Compound    37 non-null     object 
 2   log P       37 non-null     object 
 3   II          37 non-null     float64
 4   H,          37 non-null     float64
 5   H,.1        37 non-null     float64
 6   MV          37 non-null     float64
 7   R,          37 non-null     float64
 8   log Koa     37 non-null     object 
 9   log Kyex    31 non-null     object 
 10  log Kpep    25 non-null     object 
dtypes: float64(5), object(6)
memory usage: 3.3+ KB

We can drop the final column using the drop function. You can use this when you have specific rows or columns you would like to discard. Again, we use axis=1 to drop columns, then we pass the column name.

table1.drop(axis=1, columns=["Unnamed: 0"], inplace=True)

Changing column names#

Our column names are still incorrect. You will likely want to change them to make the table more legible.

table1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Compound  37 non-null     object 
 1   log P     37 non-null     object 
 2   II        37 non-null     float64
 3   H,        37 non-null     float64
 4   H,.1      37 non-null     float64
 5   MV        37 non-null     float64
 6   R,        37 non-null     float64
 7   log Koa   37 non-null     object 
 8   log Kyex  31 non-null     object 
 9   log Kpep  25 non-null     object 
dtypes: float64(5), object(5)
memory usage: 3.0+ KB

We might now want to clean up the column names and make sure they are descriptive. You can see the column names using table1.columns. You can either rename the columns by setting table1.columns to a list of the appropriate length, or you can use table1.rename. In the .rename method, you put the argument columns and set it equal to a dictionary (curly brackets) where you use the syntax

"current_column_name": "new_column_name"
table1.columns
Index(['Compound', 'log P', 'II', 'H,', 'H,.1', 'MV', 'R,', 'log Koa',
       'log Kyex', 'log Kpep'],
      dtype='object')
table1.rename(inplace=True, columns={
    "II": "pi", 
    "H,": "Hd", 
    "H,.1": "Ha", 
    "R,": "R_2",
    "log Koa": "log K_oct",
    "log Kyex": "log K_hex",
    "log Kpep": "log K_hep"
                      
                      })
table1.head()
Compound log P pi Hd Ha MV R_2 log K_oct log K_hex log K_hep
0 water — 6.85 0.45 0.82 0.35 10.6 0.00 — 1.38 — 4,38 NaN
1 methanol — 6.68 0.44 0.43 0.47 21.7 0.28 —0.73 —2.42 — 2.80
2 methanoic acid — 7.08 0.60 0.75 0.38 22.3 0.30 —0.54 — 3.93 — 3.63
3 ethanol — 6.66 0.42 0.37 0.48 31.9 0.25 —0.32 —2.24 —2.10
4 ethanoic acid —7.01 0.65 0.61 0.45 33.4 0.27 —0.31 — 3.28 —2.90

Fixing Data Types#

When examining .info , you’ll notice that a lot of our columns which should be numbers are still ‘objects’ or strings. We would like log P, for example to be numeric. Typically if a column appears that it should be numeric, but pandas does not automatically cast it as such, it is because there are some non-numeric characters in the column which pandas could not decide what to do with. We will need to examine these, decide what to do with them, then cast the column as numeric.

This part of our workflow (data cleaning), often requires os to examine data on a case-by-case basis to decide what to do

There are a few ways to make pandas columns numeric, but we’ll use the pandas function to_numeric.

table1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 10 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Compound   37 non-null     object 
 1   log P      37 non-null     object 
 2   pi         37 non-null     float64
 3   Hd         37 non-null     float64
 4   Ha         37 non-null     float64
 5   MV         37 non-null     float64
 6   R_2        37 non-null     float64
 7   log K_oct  37 non-null     object 
 8   log K_hex  31 non-null     object 
 9   log K_hep  25 non-null     object 
dtypes: float64(5), object(5)
memory usage: 3.0+ KB

Using the to_numeric function without any additional inputs will fail on this data set.

pd.to_numeric(table1["log P"])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File ~/miniconda3/envs/molssi-training/lib/python3.11/site-packages/pandas/_libs/lib.pyx:2363, in pandas._libs.lib.maybe_convert_numeric()

ValueError: Unable to parse string "— 6.85"

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
Cell In[21], line 1
----> 1 pd.to_numeric(table1["log P"])

File ~/miniconda3/envs/molssi-training/lib/python3.11/site-packages/pandas/core/tools/numeric.py:185, in to_numeric(arg, errors, downcast)
    183 coerce_numeric = errors not in ("ignore", "raise")
    184 try:
--> 185     values, _ = lib.maybe_convert_numeric(
    186         values, set(), coerce_numeric=coerce_numeric
    187     )
    188 except (ValueError, TypeError):
    189     if errors == "raise":

File ~/miniconda3/envs/molssi-training/lib/python3.11/site-packages/pandas/_libs/lib.pyx:2405, in pandas._libs.lib.maybe_convert_numeric()

ValueError: Unable to parse string "— 6.85" at position 0

Scrolling to the bottom of this message and reading the error, you will see it is having a problem reading the value "— 6.85". It may not seem obvious what this problem is at first.

When we run into a problem like this we have a few options. You could choose to handle the errors differently. Pandas will let you set what you would like for it to do when it is unable to cast a value. By default, it will fail (which is what se wee above). For example, you could also set errors to be ignored (which would result in the column being unchanged, there would just be no error raised) or to “coerce” the values. Choosing “coerce” means that anything that can’t be cast as numeric will be put as NaN.

Let’s see what happens when we set errors to coerce.

pd.to_numeric(table1["log P"], errors="coerce")
0    NaN
1    NaN
2    NaN
3    NaN
4    NaN
5    NaN
6    NaN
7    NaN
8    NaN
9    NaN
10   NaN
11   NaN
12   NaN
13   NaN
14   NaN
15   NaN
16   NaN
17   NaN
18   NaN
19   NaN
20   NaN
21   NaN
22   NaN
23   NaN
24   NaN
25   NaN
26   NaN
27   NaN
28   NaN
29   NaN
30   NaN
31   NaN
32   NaN
33   NaN
34   NaN
35   NaN
36   NaN
Name: log P, dtype: float64

This unfortunately results in no numeric characters being recognized.

We have to do a little bit more processing to the values for this to work. If you examine the columns, you may notice that the negative sign is a little off. It is when it should be -. This is very slight, and might be hard to see, but it is important to change for this data set.

We will want to replace all with -. We could accomplish this using the string method replace. Strings in Python have a number of methods. The replace method allows us to replace a substring within a string.

test_string = "Hello world."
test_string.replace(".", "!")
'Hello world!'

The split command is another string method you are probably familiar with:

test_string.split()
['Hello', 'world.']

Pandas string methods#

If we want to use these on a column in a pandas dataframe, you might think to use apply, which we learned about in the last session. However, pandas dataframes have text-processing built-in as methods.

When pandas columns are strings, we can use string methods on the whole column by adding .str.function. For example, to replace the minus signs

table1["log P"].str.replace("—", "-")
0     - 6.85
1     - 6.68
2     - 7.08
3     - 6.66
4      -7.01
5     - 6.41
6      -7.01
7     - 5.90
8     -4,.51
9      -5.35
10    - 6.16
11    - 6.36
12    - 5.64
13    - 3.56
14    - 3.75
15     -5.78
16    - 5.78
17    - 6.01
18     -5.04
19    - 5.00
20    - 5.38
21    ~ 5,36
22    - 5.29
23    - 5.00
24    ~-5.81
25     -5.81
26    - 4,56
27    - 3.48
28     -5.45
29     -5.44
30     -5.11
31    - 5.05
32    - 5.28
33    - 4.84
34     -5.21
35     -4.77
36    - 4.66
Name: log P, dtype: object
table1["log P"] = table1["log P"].str.replace("—", "-")

# We still need to get rid of spaces
table1["log P"] = table1["log P"].str.replace(" ", "")
table1["log P"] =  pd.to_numeric(table1["log P"], errors="coerce")
table1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 10 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Compound   37 non-null     object 
 1   log P      33 non-null     float64
 2   pi         37 non-null     float64
 3   Hd         37 non-null     float64
 4   Ha         37 non-null     float64
 5   MV         37 non-null     float64
 6   R_2        37 non-null     float64
 7   log K_oct  37 non-null     object 
 8   log K_hex  31 non-null     object 
 9   log K_hep  25 non-null     object 
dtypes: float64(6), object(4)
memory usage: 3.0+ KB

We actually need to change this character on all of our columns. However str methods only work on pandas series. If we want to replace a string across all of our DataFrame, we will use the .replace method. In order for it to recognize substrings, set the option regex=True. We will discuss regex more in the next session, but this is all you need to know about regex for the moment.

table1.replace("—", "-", regex=True, inplace=True)
table1.replace(" ", "", regex=True, inplace=True)

Changing the data type of multiple columns#

To change the data type of multiple columns, we will want to use the pd.to_numeric function on all of those columns. There are several ways you might choose to do this. For example, you might just choose to call the function for each column.

We can also accomplish this by using the apply operator which we learned about in the last session. The apply operator should be used whenever you want to apply a function to a row or column. In this case, we want to apply the pd.to_numeric function to each column.

Because we want to apply to the columns, we add the argument axis=1.

table1.apply(pd.to_numeric, axis=1)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File ~/miniconda3/envs/molssi-training/lib/python3.11/site-packages/pandas/_libs/lib.pyx:2363, in pandas._libs.lib.maybe_convert_numeric()

ValueError: Unable to parse string "water"

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
Cell In[31], line 1
----> 1 table1.apply(pd.to_numeric, axis=1)

File ~/miniconda3/envs/molssi-training/lib/python3.11/site-packages/pandas/core/frame.py:9565, in DataFrame.apply(self, func, axis, raw, result_type, args, **kwargs)
   9554 from pandas.core.apply import frame_apply
   9556 op = frame_apply(
   9557     self,
   9558     func=func,
   (...)
   9563     kwargs=kwargs,
   9564 )
-> 9565 return op.apply().__finalize__(self, method="apply")

File ~/miniconda3/envs/molssi-training/lib/python3.11/site-packages/pandas/core/apply.py:746, in FrameApply.apply(self)
    743 elif self.raw:
    744     return self.apply_raw()
--> 746 return self.apply_standard()

File ~/miniconda3/envs/molssi-training/lib/python3.11/site-packages/pandas/core/apply.py:873, in FrameApply.apply_standard(self)
    872 def apply_standard(self):
--> 873     results, res_index = self.apply_series_generator()
    875     # wrap results
    876     return self.wrap_results(results, res_index)

File ~/miniconda3/envs/molssi-training/lib/python3.11/site-packages/pandas/core/apply.py:889, in FrameApply.apply_series_generator(self)
    886 with option_context("mode.chained_assignment", None):
    887     for i, v in enumerate(series_gen):
    888         # ignore SettingWithCopy here in case the user mutates
--> 889         results[i] = self.f(v)
    890         if isinstance(results[i], ABCSeries):
    891             # If we have a view on v, we need to make a copy because
    892             #  series_generator will swap out the underlying data
    893             results[i] = results[i].copy(deep=False)

File ~/miniconda3/envs/molssi-training/lib/python3.11/site-packages/pandas/core/tools/numeric.py:185, in to_numeric(arg, errors, downcast)
    183 coerce_numeric = errors not in ("ignore", "raise")
    184 try:
--> 185     values, _ = lib.maybe_convert_numeric(
    186         values, set(), coerce_numeric=coerce_numeric
    187     )
    188 except (ValueError, TypeError):
    189     if errors == "raise":

File ~/miniconda3/envs/molssi-training/lib/python3.11/site-packages/pandas/_libs/lib.pyx:2405, in pandas._libs.lib.maybe_convert_numeric()

ValueError: Unable to parse string "water" at position 0

When we try this code, we immediately see an error. We do not want to try to convert the first column to a number. We can use the iloc function to exclude the first column:

table1.iloc[:, 1:].apply(pd.to_numeric, axis=1)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File ~/miniconda3/envs/molssi-training/lib/python3.11/site-packages/pandas/_libs/lib.pyx:2363, in pandas._libs.lib.maybe_convert_numeric()

ValueError: Unable to parse string "-4,38"

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
Cell In[32], line 1
----> 1 table1.iloc[:, 1:].apply(pd.to_numeric, axis=1)

File ~/miniconda3/envs/molssi-training/lib/python3.11/site-packages/pandas/core/frame.py:9565, in DataFrame.apply(self, func, axis, raw, result_type, args, **kwargs)
   9554 from pandas.core.apply import frame_apply
   9556 op = frame_apply(
   9557     self,
   9558     func=func,
   (...)
   9563     kwargs=kwargs,
   9564 )
-> 9565 return op.apply().__finalize__(self, method="apply")

File ~/miniconda3/envs/molssi-training/lib/python3.11/site-packages/pandas/core/apply.py:746, in FrameApply.apply(self)
    743 elif self.raw:
    744     return self.apply_raw()
--> 746 return self.apply_standard()

File ~/miniconda3/envs/molssi-training/lib/python3.11/site-packages/pandas/core/apply.py:873, in FrameApply.apply_standard(self)
    872 def apply_standard(self):
--> 873     results, res_index = self.apply_series_generator()
    875     # wrap results
    876     return self.wrap_results(results, res_index)

File ~/miniconda3/envs/molssi-training/lib/python3.11/site-packages/pandas/core/apply.py:889, in FrameApply.apply_series_generator(self)
    886 with option_context("mode.chained_assignment", None):
    887     for i, v in enumerate(series_gen):
    888         # ignore SettingWithCopy here in case the user mutates
--> 889         results[i] = self.f(v)
    890         if isinstance(results[i], ABCSeries):
    891             # If we have a view on v, we need to make a copy because
    892             #  series_generator will swap out the underlying data
    893             results[i] = results[i].copy(deep=False)

File ~/miniconda3/envs/molssi-training/lib/python3.11/site-packages/pandas/core/tools/numeric.py:185, in to_numeric(arg, errors, downcast)
    183 coerce_numeric = errors not in ("ignore", "raise")
    184 try:
--> 185     values, _ = lib.maybe_convert_numeric(
    186         values, set(), coerce_numeric=coerce_numeric
    187     )
    188 except (ValueError, TypeError):
    189     if errors == "raise":

File ~/miniconda3/envs/molssi-training/lib/python3.11/site-packages/pandas/_libs/lib.pyx:2405, in pandas._libs.lib.maybe_convert_numeric()

ValueError: Unable to parse string "-4,38" at position 7

An error again! This time, we see failure because a string was incorrectly read from the pdf and could not be converted to a number. You could choose to handle this differently, but for this workshop we are just going to discard values like these.

If we were using to_numeric on a pandas series, we would use the option errors="coerce". You may not see immediately how to use this with the apply function, but fortunately, pandas allows us to pass additional arguments with apply:

table1.iloc[:, 1:] = table1.iloc[:, 1:].apply(pd.to_numeric, axis=1, errors="coerce")
/tmp/ipykernel_3771/2243029040.py:1: FutureWarning: In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`
  table1.iloc[:, 1:] = table1.iloc[:, 1:].apply(pd.to_numeric, axis=1, errors="coerce")
table1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 10 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Compound   37 non-null     object 
 1   log P      33 non-null     float64
 2   pi         37 non-null     float64
 3   Hd         37 non-null     float64
 4   Ha         37 non-null     float64
 5   MV         37 non-null     float64
 6   R_2        37 non-null     float64
 7   log K_oct  36 non-null     float64
 8   log K_hex  30 non-null     float64
 9   log K_hep  24 non-null     float64
dtypes: float64(9), object(1)
memory usage: 3.0+ KB
table1.to_csv("data/potts_table1_clean.csv", index=False)

Key Points

  • Data cleaning prepares data for analysis. It is a separate step in the data analysis workflow.

  • Data cleaning is often time intensive.

  • Pandas has special function for data cleaning and handling missing information.