Data Cleaning with Pandas
Contents
Data Cleaning with Pandas#
Overview
Questions:
What does “clean data” mean?
How can I drop unneccessary data from my dataframe?
How can I change column or row names in a dataframe?
How can I cast columns to the correct data type?
Objectives:
Use pandas to drop unneccessary data from our dataframe.
Learn how to rename pandas columns.
Use pandas string methods to replace characters.
Learn how to cast columns to the correct data type.
In this section, we’ll read in the data we extracted in the last lesson. You may have noticed in the last session that the data in these dataframes didn’t look great. There were columns that appeared to have no values. Once we start working with the data, we are going to see some additional problems.
import pandas as pd
table1 = pd.read_csv("data/potts_table1.csv")
table2 = pd.read_csv("data/potts_table2.csv")
table1.head()
Unnamed: 0 | Compound | log P | Unnamed: 1 | II | H, | H,.1 | MV | R, | log Koa | log Kyex | Unnamed: 2 | log Kpep | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | water | — 6.85 | NaN | 0.45 | 0.82 | 0.35 | 10.6 | 0.00 | — 1.38 | — 4,38 | NaN | NaN |
1 | ' | methanol | — 6.68 | NaN | 0.44 | 0.43 | 0.47 | 21.7 | 0.28 | —0.73 | —2.42 | NaN | — 2.80 |
2 | NaN | methanoic acid | — 7.08 | NaN | 0.60 | 0.75 | 0.38 | 22.3 | 0.30 | —0.54 | — 3.93 | NaN | — 3.63 |
3 | NaN | ethanol | — 6.66 | NaN | 0.42 | 0.37 | 0.48 | 31.9 | 0.25 | —0.32 | —2.24 | NaN | —2.10 |
4 | NaN | ethanoic acid | —7.01 | NaN | 0.65 | 0.61 | 0.45 | 33.4 | 0.27 | —0.31 | — 3.28 | NaN | —2.90 |
Dropping unneccessary data#
In some cases, we might have data in our dataframe that we don’t need. We will want to discard or “drop” this data from the dataframe. For the dataframe we just loaded, for example, we can see that the data in columns 0, 3, 11 appear to not have any values.
Check Your Understanding
What pandas method can you use to see how many non-null values you have in each column?
Solution
table1.info()
There are two methods you might use to drop data from a dataframe. These are drop
, and dropna
. Drop is used when you have specific rows or columns you want to remove from the dataframe, while dropna
is used when you want to drop columns or rows which contain NaN
or “not a number” values. This occurs when there are no values in a data cell.
In the output of info
above, we can see that there are two columns which contain 0 non-null values. This means that all of the values in these columns are NaN
. We can safely discard these columns. We’ll use the dropna
function to get rid of them.
help(table1.dropna)
Help on method dropna in module pandas.core.frame:
dropna(*, axis: 'Axis' = 0, how: 'str | NoDefault' = <no_default>, thresh: 'int | NoDefault' = <no_default>, subset: 'IndexLabel' = None, inplace: 'bool' = False) -> 'DataFrame | None' method of pandas.core.frame.DataFrame instance
Remove missing values.
See the :ref:`User Guide <missing_data>` for more on which values are
considered missing, and how to work with missing data.
Parameters
----------
axis : {0 or 'index', 1 or 'columns'}, default 0
Determine if rows or columns which contain missing values are
removed.
* 0, or 'index' : Drop rows which contain missing values.
* 1, or 'columns' : Drop columns which contain missing value.
.. versionchanged:: 1.0.0
Pass tuple or list to drop on multiple axes.
Only a single axis is allowed.
how : {'any', 'all'}, default 'any'
Determine if row or column is removed from DataFrame, when we have
at least one NA or all NA.
* 'any' : If any NA values are present, drop that row or column.
* 'all' : If all values are NA, drop that row or column.
thresh : int, optional
Require that many non-NA values. Cannot be combined with how.
subset : column label or sequence of labels, optional
Labels along other axis to consider, e.g. if you are dropping rows
these would be a list of columns to include.
inplace : bool, default False
Whether to modify the DataFrame rather than creating a new one.
Returns
-------
DataFrame or None
DataFrame with NA entries dropped from it or None if ``inplace=True``.
See Also
--------
DataFrame.isna: Indicate missing values.
DataFrame.notna : Indicate existing (non-missing) values.
DataFrame.fillna : Replace missing values.
Series.dropna : Drop missing values.
Index.dropna : Drop missing indices.
Examples
--------
>>> df = pd.DataFrame({"name": ['Alfred', 'Batman', 'Catwoman'],
... "toy": [np.nan, 'Batmobile', 'Bullwhip'],
... "born": [pd.NaT, pd.Timestamp("1940-04-25"),
... pd.NaT]})
>>> df
name toy born
0 Alfred NaN NaT
1 Batman Batmobile 1940-04-25
2 Catwoman Bullwhip NaT
Drop the rows where at least one element is missing.
>>> df.dropna()
name toy born
1 Batman Batmobile 1940-04-25
Drop the columns where at least one element is missing.
>>> df.dropna(axis='columns')
name
0 Alfred
1 Batman
2 Catwoman
Drop the rows where all elements are missing.
>>> df.dropna(how='all')
name toy born
0 Alfred NaN NaT
1 Batman Batmobile 1940-04-25
2 Catwoman Bullwhip NaT
Keep only the rows with at least 2 non-NA values.
>>> df.dropna(thresh=2)
name toy born
1 Batman Batmobile 1940-04-25
2 Catwoman Bullwhip NaT
Define in which columns to look for missing values.
>>> df.dropna(subset=['name', 'toy'])
name toy born
1 Batman Batmobile 1940-04-25
2 Catwoman Bullwhip NaT
Keep the DataFrame with valid entries in the same variable.
>>> df.dropna(inplace=True)
>>> df
name toy born
1 Batman Batmobile 1940-04-25
Before saving the dataframe, we’ll look at and and discuss output from this function.
By default, the function dropna
will work on axis 0
or the rows of the dataframe, and will drop any row which contains a NaN
. You will see this results in a dataframe with no data.
table1.dropna()
Unnamed: 0 | Compound | log P | Unnamed: 1 | II | H, | H,.1 | MV | R, | log Koa | log Kyex | Unnamed: 2 | log Kpep |
---|
Notice that dropna
returns a dataframe and does not overwrite the original.
table1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 1 non-null object
1 Compound 37 non-null object
2 log P 37 non-null object
3 Unnamed: 1 0 non-null float64
4 II 37 non-null float64
5 H, 37 non-null float64
6 H,.1 37 non-null float64
7 MV 37 non-null float64
8 R, 37 non-null float64
9 log Koa 37 non-null object
10 log Kyex 31 non-null object
11 Unnamed: 2 0 non-null float64
12 log Kpep 25 non-null object
dtypes: float64(7), object(6)
memory usage: 3.9+ KB
We can switch to dropping columns which have NaN
values by adding the argument axis=1
.
table1.dropna(axis=1).info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Compound 37 non-null object
1 log P 37 non-null object
2 II 37 non-null float64
3 H, 37 non-null float64
4 H,.1 37 non-null float64
5 MV 37 non-null float64
6 R, 37 non-null float64
7 log Koa 37 non-null object
dtypes: float64(5), object(3)
memory usage: 2.4+ KB
This is closer to what we want. However, you’ll notice that this has dropped some columns which have data. By default, pandas will drop a column which contains any NaN
values. This may not be what we want in many cases because some values may simply be missing rather than incorrect.
We can add an additional argument, how=all
, to drop only columns whose values are all NaN
. By default, this function argument is how=any
. Once we are sure we would like to keep this as our dataframe, we can add inplace=True
to the function call to overwrite the dataframe.
table1.dropna(axis=1, how="all")
Unnamed: 0 | Compound | log P | II | H, | H,.1 | MV | R, | log Koa | log Kyex | log Kpep | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | water | — 6.85 | 0.45 | 0.82 | 0.35 | 10.6 | 0.00 | — 1.38 | — 4,38 | NaN |
1 | ' | methanol | — 6.68 | 0.44 | 0.43 | 0.47 | 21.7 | 0.28 | —0.73 | —2.42 | — 2.80 |
2 | NaN | methanoic acid | — 7.08 | 0.60 | 0.75 | 0.38 | 22.3 | 0.30 | —0.54 | — 3.93 | — 3.63 |
3 | NaN | ethanol | — 6.66 | 0.42 | 0.37 | 0.48 | 31.9 | 0.25 | —0.32 | —2.24 | —2.10 |
4 | NaN | ethanoic acid | —7.01 | 0.65 | 0.61 | 0.45 | 33.4 | 0.27 | —0.31 | — 3.28 | —2.90 |
5 | NaN | n-propanol | — 6.41 | 0.42 | 0.37 | 0.48 | 42.2 | 0.24 | 0.34 | — 1.48 | —1.52 |
6 | NaN | n-propanoic acid | —7.01 | 0.65 | 0.60 | 0.45 | 43.6 | 0.23 | 0.26 | — 2.64 | —2.14 |
7 | NaN | butane-2-one | — 5.90 | 0.70 | 0.00 | 0.51 | 49.4 | 0.17 | 0.28 | — 0.25 | —Q.25 |
8 | NaN | benzene | —4,.51 | 0.52 | 0.00 | 0.14 | 50.0 | 0.61 | 2.00 | 2.29 | NaN |
9 | NaN | diethyl ether | —5.35 | 0.25 | 0.00 | 0.45 | 52.0 | 0.04 | 0.83 | 0.66 | NaN |
10 | NaN | n-butanol | — 6.16 | 0.42 | 0.37 | 0.48 | 52.4 | 0.22 | 0.88 | — 1.08 | —0.70 |
11 | NaN | n-butanoic acid | — 6.36 | 0.62 | 0.60 | 0.45 | 53.9 | 0.21 | 0.79 | — 1.92 | — 0.96 |
12 | NaN | phenol | — 5.64 | 0.89 | 0.52 | 0.30 | 53.9 | 0.81 | 1.46 | — 0.70 | —0.82 |
13 | NaN | toluene | — 3.56 | 0.52 | 0.00 | 0.14 | 60.0 | 0.60 | 2.70 | 2.89 | NaN |
14 | NaN | styrene | — 3.75 | 0.65 | 0.00 | 0.16 | 60.2 | 0.85 | 2.95 | NaN | NaN |
15 | NaN | n-pentanol | —5.78 | 0.42 | 0.37 | 0.48 | 62.6 | 0.22 | 1.40 | — 0.39 | NaN |
16 | NaN | benzyl-OH | — 5.78 | 0.87 | 0.33 | 0.56 | 64.0 | 0.80 | 1.10 | — 0.62 | NaN |
17 | NaN | n-pentanoic acid | — 6.01 | 0.60 | 0.60 | 0.45 | 64.1 | 0.21 | 1.33 | — 1.31 | 0.44 |
18 | NaN | 2-chlorophenol | —5.04 | 0.88 | 0.32 | 0.31 | 66.0 | 0.85 | 2.15 | NaN | NaN |
19 | NaN | 4-chlorophenol | — 5.00 | 1.08 | 0.67 | 0.20 | 66.0 | 0.92 | 2.39 | —1.31 | —0.12 |
20 | NaN | m-cresol | — 5.38 | 0.88 | 0.57 | 0.34 | 67.6 | 0.82 | 1.96 | NaN | NaN |
21 | NaN | o-cresol | ~ 5,36 | 0.86 | 0.52 | 0.30 | 67.6 | 0.84 | 1.95 | NaN | NaN |
22 | NaN | p-cresol | — 5.29 | 0.87 | 0.57 | 0.31 | 67.6 | 0.82 | 1.96 | NaN | NaN |
23 | NaN | 4-bromophenol | — 5.00 | 1.17 | 0.67 | 0.20 | 70.0 | 1.08 | 2.59 | —0.11 | —0.20 |
24 | NaN | 4-nitrophenol | ~—5.81 | 1.72 | 0.82 | 0.26 | 71.0 | 1.07 | 1.96 | —2.0 | —2.15 |
25 | NaN | 3-nitrophenol | —5.81 | 1.57 | 0.79 | 0.23 | 71.0 | 1.05 | 2.00 | —1.4 | — 1.23 |
26 | NaN | 2-nitrophenol | — 4,56 | 1.05 | 0.05 | 0.37 | 71.0 | 1.02 | 1.80 | —1.4 | 1.04 |
27 | NaN | ethylbenzene | — 3.48 | 0.51 | 0.00 | 0.15 | 72.3 | 0.60 | 3.15 | NaN | NaN |
28 | NaN | n-hexanol | —5.45 | 0.42 | 0.37 | 0.48 | 72.9 | 0.21 | 2.03 | 0.11 | 0.45 |
29 | NaN | n-hexanoic acid | —5.44 | 0.60 | 0.60 | 0.45 | 74.3 | 0.17 | 1.89 | — 0.85 | 0.24 |
30 | NaN | 8-naphthol | —5.11 | 1.08 | 0.61 | 0.40 | 79.5 | 1.52 | 2.84 | 1.77 | 0.30 |
31 | NaN | n-heptanol | — 5.05 | 0.42 | 0.37 | 0.48 | 83.1 | 0.21 | 2.49 | 0.77 | 1.01 |
32 | NaN | n-heptanoic acid | — 5.28 | 0.60 | 0.60 | 0.45 | 84.6 | 0.15 | 2.33 | — 0.29 | 1.16 |
33 | NaN | n-octanol | — 4.84 | 0.42 | 0.37 | 0.48 | 93.3 | 0.20 | 3.15 | 1.62 | 1.65 |
34 | NaN | n-octanoic acid | —5.21 | 0.60 | 0.60 | 0.45 | 94.8 | 0.15 | 2.83 | 0.41 | 1.95 |
35 | NaN | n-nonanol | —4.77 | 0.42 | 0.37 | 0.48 | 104.0 | 0.19 | 3.68 | 1.97 | 2.28 |
36 | NaN | n-decanol | — 4.66 | 0.42 | 0.37 | 0.48 | 114.0 | 0.19 | 4,24 | 2.56 | 2.91 |
The output above looks like something to keep, so we will add inplace=True
to overwrite the original dataframe.
table1.dropna(axis=1, how="all", inplace=True)
table1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 1 non-null object
1 Compound 37 non-null object
2 log P 37 non-null object
3 II 37 non-null float64
4 H, 37 non-null float64
5 H,.1 37 non-null float64
6 MV 37 non-null float64
7 R, 37 non-null float64
8 log Koa 37 non-null object
9 log Kyex 31 non-null object
10 log Kpep 25 non-null object
dtypes: float64(5), object(6)
memory usage: 3.3+ KB
We can drop the final column using the drop
function. You can use this when you have specific rows or columns you would like to discard. Again, we use axis=1
to drop columns, then we pass the column name.
table1.drop(axis=1, columns=["Unnamed: 0"], inplace=True)
Changing column names#
Our column names are still incorrect. You will likely want to change them to make the table more legible.
table1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Compound 37 non-null object
1 log P 37 non-null object
2 II 37 non-null float64
3 H, 37 non-null float64
4 H,.1 37 non-null float64
5 MV 37 non-null float64
6 R, 37 non-null float64
7 log Koa 37 non-null object
8 log Kyex 31 non-null object
9 log Kpep 25 non-null object
dtypes: float64(5), object(5)
memory usage: 3.0+ KB
We might now want to clean up the column names and make sure they are descriptive. You can see the column names using table1.columns
. You can either rename the columns by setting table1.columns
to a list of the appropriate length, or you can use table1.rename
. In the .rename
method, you put the argument columns
and set it equal to a dictionary (curly brackets) where you use the syntax
"current_column_name": "new_column_name"
table1.columns
Index(['Compound', 'log P', 'II', 'H,', 'H,.1', 'MV', 'R,', 'log Koa',
'log Kyex', 'log Kpep'],
dtype='object')
table1.rename(inplace=True, columns={
"II": "pi",
"H,": "Hd",
"H,.1": "Ha",
"R,": "R_2",
"log Koa": "log K_oct",
"log Kyex": "log K_hex",
"log Kpep": "log K_hep"
})
table1.head()
Compound | log P | pi | Hd | Ha | MV | R_2 | log K_oct | log K_hex | log K_hep | |
---|---|---|---|---|---|---|---|---|---|---|
0 | water | — 6.85 | 0.45 | 0.82 | 0.35 | 10.6 | 0.00 | — 1.38 | — 4,38 | NaN |
1 | methanol | — 6.68 | 0.44 | 0.43 | 0.47 | 21.7 | 0.28 | —0.73 | —2.42 | — 2.80 |
2 | methanoic acid | — 7.08 | 0.60 | 0.75 | 0.38 | 22.3 | 0.30 | —0.54 | — 3.93 | — 3.63 |
3 | ethanol | — 6.66 | 0.42 | 0.37 | 0.48 | 31.9 | 0.25 | —0.32 | —2.24 | —2.10 |
4 | ethanoic acid | —7.01 | 0.65 | 0.61 | 0.45 | 33.4 | 0.27 | —0.31 | — 3.28 | —2.90 |
Fixing Data Types#
When examining .info
, you’ll notice that a lot of our columns which should be numbers are still ‘objects’ or strings. We would like log P
, for example to be numeric. Typically if a column appears that it should be numeric, but pandas does not automatically cast it as such, it is because there are some non-numeric characters in the column which pandas could not decide what to do with. We will need to examine these, decide what to do with them, then cast the column as numeric.
This part of our workflow (data cleaning), often requires os to examine data on a case-by-case basis to decide what to do
There are a few ways to make pandas columns numeric, but we’ll use the pandas function to_numeric
.
table1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Compound 37 non-null object
1 log P 37 non-null object
2 pi 37 non-null float64
3 Hd 37 non-null float64
4 Ha 37 non-null float64
5 MV 37 non-null float64
6 R_2 37 non-null float64
7 log K_oct 37 non-null object
8 log K_hex 31 non-null object
9 log K_hep 25 non-null object
dtypes: float64(5), object(5)
memory usage: 3.0+ KB
Using the to_numeric
function without any additional inputs will fail on this data set.
pd.to_numeric(table1["log P"])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
File ~/miniconda3/envs/molssi-training/lib/python3.11/site-packages/pandas/_libs/lib.pyx:2363, in pandas._libs.lib.maybe_convert_numeric()
ValueError: Unable to parse string "— 6.85"
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
Cell In[21], line 1
----> 1 pd.to_numeric(table1["log P"])
File ~/miniconda3/envs/molssi-training/lib/python3.11/site-packages/pandas/core/tools/numeric.py:185, in to_numeric(arg, errors, downcast)
183 coerce_numeric = errors not in ("ignore", "raise")
184 try:
--> 185 values, _ = lib.maybe_convert_numeric(
186 values, set(), coerce_numeric=coerce_numeric
187 )
188 except (ValueError, TypeError):
189 if errors == "raise":
File ~/miniconda3/envs/molssi-training/lib/python3.11/site-packages/pandas/_libs/lib.pyx:2405, in pandas._libs.lib.maybe_convert_numeric()
ValueError: Unable to parse string "— 6.85" at position 0
Scrolling to the bottom of this message and reading the error, you will see it is having a problem reading the value "— 6.85"
. It may not seem obvious what this problem is at first.
When we run into a problem like this we have a few options. You could choose to handle the errors differently. Pandas will let you set what you would like for it to do when it is unable to cast a value. By default, it will fail (which is what se wee above). For example, you could also set errors to be ignored (which would result in the column being unchanged, there would just be no error raised) or to “coerce” the values. Choosing “coerce” means that anything that can’t be cast as numeric will be put as NaN
.
Let’s see what happens when we set errors to coerce.
pd.to_numeric(table1["log P"], errors="coerce")
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
15 NaN
16 NaN
17 NaN
18 NaN
19 NaN
20 NaN
21 NaN
22 NaN
23 NaN
24 NaN
25 NaN
26 NaN
27 NaN
28 NaN
29 NaN
30 NaN
31 NaN
32 NaN
33 NaN
34 NaN
35 NaN
36 NaN
Name: log P, dtype: float64
This unfortunately results in no numeric characters being recognized.
We have to do a little bit more processing to the values for this to work. If you examine the columns, you may notice that the negative sign is a little off. It is —
when it should be -
. This is very slight, and might be hard to see, but it is important to change for this data set.
We will want to replace all —
with -
. We could accomplish this using the string method replace
. Strings in Python have a number of methods. The replace
method allows us to replace a substring within a string.
test_string = "Hello world."
test_string.replace(".", "!")
'Hello world!'
The split command is another string method you are probably familiar with:
test_string.split()
['Hello', 'world.']
Pandas string methods#
If we want to use these on a column in a pandas dataframe, you might think to use apply
, which we learned about in the last session. However, pandas dataframes have text-processing built-in as methods.
When pandas columns are strings, we can use string methods on the whole column by adding .str.function
. For example, to replace the minus signs
table1["log P"].str.replace("—", "-")
0 - 6.85
1 - 6.68
2 - 7.08
3 - 6.66
4 -7.01
5 - 6.41
6 -7.01
7 - 5.90
8 -4,.51
9 -5.35
10 - 6.16
11 - 6.36
12 - 5.64
13 - 3.56
14 - 3.75
15 -5.78
16 - 5.78
17 - 6.01
18 -5.04
19 - 5.00
20 - 5.38
21 ~ 5,36
22 - 5.29
23 - 5.00
24 ~-5.81
25 -5.81
26 - 4,56
27 - 3.48
28 -5.45
29 -5.44
30 -5.11
31 - 5.05
32 - 5.28
33 - 4.84
34 -5.21
35 -4.77
36 - 4.66
Name: log P, dtype: object
table1["log P"] = table1["log P"].str.replace("—", "-")
# We still need to get rid of spaces
table1["log P"] = table1["log P"].str.replace(" ", "")
table1["log P"] = pd.to_numeric(table1["log P"], errors="coerce")
table1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Compound 37 non-null object
1 log P 33 non-null float64
2 pi 37 non-null float64
3 Hd 37 non-null float64
4 Ha 37 non-null float64
5 MV 37 non-null float64
6 R_2 37 non-null float64
7 log K_oct 37 non-null object
8 log K_hex 31 non-null object
9 log K_hep 25 non-null object
dtypes: float64(6), object(4)
memory usage: 3.0+ KB
We actually need to change this character on all of our columns. However str
methods only work on pandas series. If we want to replace a string across all of our DataFrame, we will use the .replace
method. In order for it to recognize substrings, set the option regex=True
. We will discuss regex
more in the next session, but this is all you need to know about regex for the moment.
table1.replace("—", "-", regex=True, inplace=True)
table1.replace(" ", "", regex=True, inplace=True)
Changing the data type of multiple columns#
To change the data type of multiple columns, we will want to use the pd.to_numeric
function on all of those columns. There are several ways you might choose to do this. For example, you might just choose to call the function for each column.
We can also accomplish this by using the apply
operator which we learned about in the last session. The apply
operator should be used whenever you want to apply a function to a row or column. In this case, we want to apply the pd.to_numeric
function to each column.
Because we want to apply to the columns, we add the argument axis=1
.
table1.apply(pd.to_numeric, axis=1)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
File ~/miniconda3/envs/molssi-training/lib/python3.11/site-packages/pandas/_libs/lib.pyx:2363, in pandas._libs.lib.maybe_convert_numeric()
ValueError: Unable to parse string "water"
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
Cell In[31], line 1
----> 1 table1.apply(pd.to_numeric, axis=1)
File ~/miniconda3/envs/molssi-training/lib/python3.11/site-packages/pandas/core/frame.py:9565, in DataFrame.apply(self, func, axis, raw, result_type, args, **kwargs)
9554 from pandas.core.apply import frame_apply
9556 op = frame_apply(
9557 self,
9558 func=func,
(...)
9563 kwargs=kwargs,
9564 )
-> 9565 return op.apply().__finalize__(self, method="apply")
File ~/miniconda3/envs/molssi-training/lib/python3.11/site-packages/pandas/core/apply.py:746, in FrameApply.apply(self)
743 elif self.raw:
744 return self.apply_raw()
--> 746 return self.apply_standard()
File ~/miniconda3/envs/molssi-training/lib/python3.11/site-packages/pandas/core/apply.py:873, in FrameApply.apply_standard(self)
872 def apply_standard(self):
--> 873 results, res_index = self.apply_series_generator()
875 # wrap results
876 return self.wrap_results(results, res_index)
File ~/miniconda3/envs/molssi-training/lib/python3.11/site-packages/pandas/core/apply.py:889, in FrameApply.apply_series_generator(self)
886 with option_context("mode.chained_assignment", None):
887 for i, v in enumerate(series_gen):
888 # ignore SettingWithCopy here in case the user mutates
--> 889 results[i] = self.f(v)
890 if isinstance(results[i], ABCSeries):
891 # If we have a view on v, we need to make a copy because
892 # series_generator will swap out the underlying data
893 results[i] = results[i].copy(deep=False)
File ~/miniconda3/envs/molssi-training/lib/python3.11/site-packages/pandas/core/tools/numeric.py:185, in to_numeric(arg, errors, downcast)
183 coerce_numeric = errors not in ("ignore", "raise")
184 try:
--> 185 values, _ = lib.maybe_convert_numeric(
186 values, set(), coerce_numeric=coerce_numeric
187 )
188 except (ValueError, TypeError):
189 if errors == "raise":
File ~/miniconda3/envs/molssi-training/lib/python3.11/site-packages/pandas/_libs/lib.pyx:2405, in pandas._libs.lib.maybe_convert_numeric()
ValueError: Unable to parse string "water" at position 0
When we try this code, we immediately see an error. We do not want to try to convert the first column to a number. We can use the iloc
function to exclude the first column:
table1.iloc[:, 1:].apply(pd.to_numeric, axis=1)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
File ~/miniconda3/envs/molssi-training/lib/python3.11/site-packages/pandas/_libs/lib.pyx:2363, in pandas._libs.lib.maybe_convert_numeric()
ValueError: Unable to parse string "-4,38"
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
Cell In[32], line 1
----> 1 table1.iloc[:, 1:].apply(pd.to_numeric, axis=1)
File ~/miniconda3/envs/molssi-training/lib/python3.11/site-packages/pandas/core/frame.py:9565, in DataFrame.apply(self, func, axis, raw, result_type, args, **kwargs)
9554 from pandas.core.apply import frame_apply
9556 op = frame_apply(
9557 self,
9558 func=func,
(...)
9563 kwargs=kwargs,
9564 )
-> 9565 return op.apply().__finalize__(self, method="apply")
File ~/miniconda3/envs/molssi-training/lib/python3.11/site-packages/pandas/core/apply.py:746, in FrameApply.apply(self)
743 elif self.raw:
744 return self.apply_raw()
--> 746 return self.apply_standard()
File ~/miniconda3/envs/molssi-training/lib/python3.11/site-packages/pandas/core/apply.py:873, in FrameApply.apply_standard(self)
872 def apply_standard(self):
--> 873 results, res_index = self.apply_series_generator()
875 # wrap results
876 return self.wrap_results(results, res_index)
File ~/miniconda3/envs/molssi-training/lib/python3.11/site-packages/pandas/core/apply.py:889, in FrameApply.apply_series_generator(self)
886 with option_context("mode.chained_assignment", None):
887 for i, v in enumerate(series_gen):
888 # ignore SettingWithCopy here in case the user mutates
--> 889 results[i] = self.f(v)
890 if isinstance(results[i], ABCSeries):
891 # If we have a view on v, we need to make a copy because
892 # series_generator will swap out the underlying data
893 results[i] = results[i].copy(deep=False)
File ~/miniconda3/envs/molssi-training/lib/python3.11/site-packages/pandas/core/tools/numeric.py:185, in to_numeric(arg, errors, downcast)
183 coerce_numeric = errors not in ("ignore", "raise")
184 try:
--> 185 values, _ = lib.maybe_convert_numeric(
186 values, set(), coerce_numeric=coerce_numeric
187 )
188 except (ValueError, TypeError):
189 if errors == "raise":
File ~/miniconda3/envs/molssi-training/lib/python3.11/site-packages/pandas/_libs/lib.pyx:2405, in pandas._libs.lib.maybe_convert_numeric()
ValueError: Unable to parse string "-4,38" at position 7
An error again! This time, we see failure because a string was incorrectly read from the pdf and could not be converted to a number. You could choose to handle this differently, but for this workshop we are just going to discard values like these.
If we were using to_numeric
on a pandas series, we would use the option errors="coerce"
. You may not see immediately how to use this with the apply
function, but fortunately, pandas allows us to pass additional arguments with apply
:
table1.iloc[:, 1:] = table1.iloc[:, 1:].apply(pd.to_numeric, axis=1, errors="coerce")
/tmp/ipykernel_3771/2243029040.py:1: FutureWarning: In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`
table1.iloc[:, 1:] = table1.iloc[:, 1:].apply(pd.to_numeric, axis=1, errors="coerce")
table1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37 entries, 0 to 36
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Compound 37 non-null object
1 log P 33 non-null float64
2 pi 37 non-null float64
3 Hd 37 non-null float64
4 Ha 37 non-null float64
5 MV 37 non-null float64
6 R_2 37 non-null float64
7 log K_oct 36 non-null float64
8 log K_hex 30 non-null float64
9 log K_hep 24 non-null float64
dtypes: float64(9), object(1)
memory usage: 3.0+ KB
table1.to_csv("data/potts_table1_clean.csv", index=False)
Key Points
Data cleaning prepares data for analysis. It is a separate step in the data analysis workflow.
Data cleaning is often time intensive.
Pandas has special function for data cleaning and handling missing information.