Introduction to Pandas#

Overview

Questions:

  • How do I import a library?

  • How do I analyze tabular data using Python?

Objectives:

  • Use pandas to examine and analyze data.

Pandas is another Python package which is very popular for data analysis. The key data type of pandas is the dataframe. In this lesson, we will cover pandas dataframes and some basic analysis.

Pandas is capable of handling data of lots of different types. In fact, it is often used to work with data that would usually be in a spreadsheet. It is designed to make working with “relational” or “labeled” data easy and intuitive. Central to the pandas package are the special data structures called pandas Series and DataFrames. Pandas dataframes are 2 dimensional and tabular, and is particularly suited to data which is heterogenous and in columns. In fact, there are even functions which allow you to read data directly from excel spreadsheets or SQL databases.

Importing Libraries#

For this notebook, we will be reading and analyzing data using pandas. These are not part of standard Python, and we must import the package to use it.

import pandas as pd

Now, if we want to use functions from plotly or pandas, we will do pd.function_name or px.function_name

In this notebook, we will be working with a data set that contains information about the elements in the periodic table. The data is a csv (comma separated value) file from PubChem in your data folder called PubChemElements_all.csv.

This is a comma-separated value file, and we will use the pandas function read_csv. You can read more about this function in your Jupyter by typing pd.read_csv in the cell below, then right clicking and choosing “Show Contextual Help”.

Python Documentation

Most popular Python libraries have very good online documentation. You can find the pandas documentation by googling "pandas docs". You will be able to find the same help message you get for `read_csv` as well as tutorials and other types of documentation.
  1. Pandas Documentation

  2. read_csv documentation

It is common to name the variable from the read_csv function df. This is short for DataFrame. Since this file is relatively simple, we do not need any additional arguments to the function. The read_csv function reads in tabular data which is comma delimited by default.

df = pd.read_csv("data/PubChemElements_all.csv")

Examining the data#

The variable df is now a pandas DataFrame with the information contained in the csv file. You can examine the DataFrame using the .head() method. This shows the first 5 rows stored in the DataFrame. The first row of the file was used for column headers.

Methods vs Functions

In this lesson, we use the syntax `variable.SOMETHING` often. When we do this, we are accessing special information or functions that act on variables. If the syntax is `variable.SOMETHING()`, with parenthesis it is a "method", or function that acts on the variable. For example, if variable is a list, you might use the method variable.append(item) to add an item to the end of the list.

If it doesn’t have parenthesis, it is an “attribute” or data associated with the variable.

df.head()
AtomicNumber Symbol Name AtomicMass CPKHexColor ElectronConfiguration Electronegativity AtomicRadius IonizationEnergy ElectronAffinity OxidationStates StandardState MeltingPoint BoilingPoint Density GroupBlock YearDiscovered
0 1 H Hydrogen 1.008000 FFFFFF 1s1 2.20 120.0 13.598 0.754 +1, -1 Gas 13.81 20.28 0.000090 Nonmetal 1766
1 2 He Helium 4.002600 D9FFFF 1s2 NaN 140.0 24.587 NaN 0 Gas 0.95 4.22 0.000179 Noble gas 1868
2 3 Li Lithium 7.000000 CC80FF [He]2s1 0.98 182.0 5.392 0.618 +1 Solid 453.65 1615.00 0.534000 Alkali metal 1817
3 4 Be Beryllium 9.012183 C2FF00 [He]2s2 1.57 153.0 9.323 NaN +2 Solid 1560.00 2744.00 1.850000 Alkaline earth metal 1798
4 5 B Boron 10.810000 FFB5B5 [He]2s2 2p1 2.04 192.0 8.298 0.277 +3 Solid 2348.00 4273.00 2.370000 Metalloid 1808

The .info function will give information about the columns and the data type of those columns. The data type will become very important later as we work with more data.

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 118 entries, 0 to 117
Data columns (total 17 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   AtomicNumber           118 non-null    int64  
 1   Symbol                 118 non-null    object 
 2   Name                   118 non-null    object 
 3   AtomicMass             118 non-null    float64
 4   CPKHexColor            108 non-null    object 
 5   ElectronConfiguration  118 non-null    object 
 6   Electronegativity      95 non-null     float64
 7   AtomicRadius           113 non-null    float64
 8   IonizationEnergy       102 non-null    float64
 9   ElectronAffinity       57 non-null     float64
 10  OxidationStates        103 non-null    object 
 11  StandardState          118 non-null    object 
 12  MeltingPoint           103 non-null    float64
 13  BoilingPoint           93 non-null     float64
 14  Density                96 non-null     float64
 15  GroupBlock             118 non-null    object 
 16  YearDiscovered         118 non-null    object 
dtypes: float64(8), int64(1), object(8)
memory usage: 15.8+ KB

This output will show information about each column of data. Pandas assigns data types to columns, and will do its best to decide the data column for each column based on what is in the column. We can also see how many values are in each column.

We can also see descriptive statisics easily using the .describe() command.

df.describe()
AtomicNumber AtomicMass Electronegativity AtomicRadius IonizationEnergy ElectronAffinity MeltingPoint BoilingPoint Density
count 118.000000 118.000000 95.000000 113.000000 102.000000 57.000000 103.000000 93.000000 96.000000
mean 59.500000 146.607635 1.732316 201.902655 7.997255 1.072140 1273.740553 2536.212473 7.608001
std 34.207699 89.845304 0.635187 42.025707 3.339066 0.879163 888.853859 1588.410919 5.878692
min 1.000000 1.008000 0.700000 120.000000 3.894000 0.079000 0.950000 4.220000 0.000090
25% 30.250000 66.480000 1.290000 180.000000 6.020500 0.470000 516.040000 1180.000000 2.572500
50% 59.500000 142.573850 1.620000 202.000000 6.960000 0.754000 1191.000000 2792.000000 7.072000
75% 88.750000 226.777165 2.170000 229.000000 8.998500 1.350000 1806.500000 3618.000000 10.275250
max 118.000000 294.214000 3.980000 348.000000 24.587000 3.617000 3823.000000 5869.000000 22.570000

The describe function lists the mean, max, min, standard deviation and percentiles for each column excluding NaN (not a number) values.

Accessing Data in DataFrames#

One way to get information in a data frame is by using the headers, or the column names using squre brackets. The synatx for this is

dataframe["column_name"]

For example, to get the column with information about the atomic symbol, we would use the “Symbol” column.

df["Symbol"]
0       H
1      He
2      Li
3      Be
4       B
       ..
113    Fl
114    Mc
115    Lv
116    Ts
117    Og
Name: Symbol, Length: 118, dtype: object

If you want to select multiple columns, you use a list of column names in square brackets.

df[["Symbol", "AtomicNumber"]]
Symbol AtomicNumber
0 H 1
1 He 2
2 Li 3
3 Be 4
4 B 5
... ... ...
113 Fl 114
114 Mc 115
115 Lv 116
116 Ts 117
117 Og 118

118 rows × 2 columns

Get data using number index#

If we want to get information in the dataframe using row and column numbers, we use the iloc function.

The syntax for iloc is

dataframe.iloc[row_number, column_number]

If you specify only a row number, you will get all the columns.

# This will get the first row, all of the columns.
df.iloc[0]
AtomicNumber                    1
Symbol                          H
Name                     Hydrogen
AtomicMass                  1.008
CPKHexColor                FFFFFF
ElectronConfiguration         1s1
Electronegativity             2.2
AtomicRadius                120.0
IonizationEnergy           13.598
ElectronAffinity            0.754
OxidationStates            +1, -1
StandardState                 Gas
MeltingPoint                13.81
BoilingPoint                20.28
Density                   0.00009
GroupBlock               Nonmetal
YearDiscovered               1766
Name: 0, dtype: object
# This will get the first row and the second column.
df.iloc[0, 1]
'H'

Exercise

How would you get the ElectronConfiguration column?

How would you get the value in row index 10 of the ElectronConfiguration column?

Performing Calculations#

You can do mathematical operations on entire columns or rows of pandas dataframes using single lines of code. For example, if we wanted to subtract 273.15 from our melting point column, we could do so with the following like of code.

df["MeltingPoint"] - 273.15
0      -259.34
1      -272.20
2       180.50
3      1286.85
4      2074.85
        ...   
113        NaN
114        NaN
115        NaN
116        NaN
117        NaN
Name: MeltingPoint, Length: 118, dtype: float64

To save your calculation in a new column in your dataframe, use the syntax

df["new_column_name"] = CALCULATION
df["MeltingPointC"] = df["MeltingPoint"] - 273.15

If you have a more complicated action you’d like to peform for every column, you can use the syntax

df["column_name"].apply(FUNCTION)

where FUNCTION is a function that exists (like len) or a function that you’ve defined. Consider a function to calculate the temperature in Farhenheit from Celsius.

def kelvin_to_fahrenheit(kelvin_temp):
    fahrenheit = (kelvin_temp - 273.15) * 9/5 +32
    return fahrenheit

If you wanted to apply this function to every row, your first instinct might be to write a for loop. This would work, but pandas has a built in method called apply to easily allow you to do this.

When you call the apply method, you give it a function name which you would like to apply to every element of whatever you are using it on.

# Calculate the boiling point in fahrenheit
df['BoilingPoint'].apply(kelvin_to_fahrenheit)
0      -423.166
1      -452.074
2      2447.330
3      4479.530
4      7231.730
         ...   
113         NaN
114         NaN
115         NaN
116         NaN
117         NaN
Name: BoilingPoint, Length: 118, dtype: float64

Saving a new DataFrame#

If you wanted to save your data to a csv, you could do it using the method to_csv.

df.to_csv('periodic_data_processed.csv')

Filtering Data#

We can also filter data using signs like greater than >, less than <, or equal to ==.

In the cell below, we check whether or not each boiling point is greater than 500. If it is, True is returned, if not False.

df["BoilingPoint"] > 500
0      False
1      False
2       True
3       True
4       True
       ...  
113    False
114    False
115    False
116    False
117    False
Name: BoilingPoint, Length: 118, dtype: bool

We can use this as an index or slice, similar to how we learned with lists. We can see from the output that this gives us 80 elements.

df[df["BoilingPoint"]>500]
AtomicNumber Symbol Name AtomicMass CPKHexColor ElectronConfiguration Electronegativity AtomicRadius IonizationEnergy ElectronAffinity OxidationStates StandardState MeltingPoint BoilingPoint Density GroupBlock YearDiscovered MeltingPointC
2 3 Li Lithium 7.000000 CC80FF [He]2s1 0.98 182.0 5.392 0.618 +1 Solid 453.65 1615.0 0.534 Alkali metal 1817 180.50
3 4 Be Beryllium 9.012183 C2FF00 [He]2s2 1.57 153.0 9.323 NaN +2 Solid 1560.00 2744.0 1.850 Alkaline earth metal 1798 1286.85
4 5 B Boron 10.810000 FFB5B5 [He]2s2 2p1 2.04 192.0 8.298 0.277 +3 Solid 2348.00 4273.0 2.370 Metalloid 1808 2074.85
5 6 C Carbon 12.011000 909090 [He]2s2 2p2 2.55 170.0 11.260 1.263 +4, +2, -4 Solid 3823.00 4098.0 2.267 Nonmetal Ancient 3549.85
10 11 Na Sodium 22.989769 AB5CF2 [Ne]3s1 0.93 227.0 5.139 0.548 +1 Solid 370.95 1156.0 0.970 Alkali metal 1807 97.80
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
91 92 U Uranium 238.028900 008FFF [Rn]7s2 5f3 6d1 1.38 240.0 6.194 NaN +6, +5, +4, +3 Solid 1408.00 4404.0 18.950 Actinide 1789 1134.85
92 93 Np Neptunium 237.048170 0080FF [Rn]7s2 5f4 6d1 1.36 221.0 6.266 NaN +6, +5, +4, +3 Solid 917.00 4175.0 20.250 Actinide 1940 643.85
93 94 Pu Plutonium 244.064200 006BFF [Rn]7s2 5f6 1.28 243.0 6.060 NaN +6, +5, +4, +3 Solid 913.00 3501.0 19.840 Actinide 1940 639.85
94 95 Am Americium 243.061380 545CF2 [Rn]7s2 5f7 1.30 244.0 5.993 NaN +6, +5, +4, +3 Solid 1449.00 2284.0 13.690 Actinide 1944 1175.85
95 96 Cm Curium 247.070350 785CE3 [Rn]7s2 5f7 6d1 1.30 245.0 6.020 NaN +3 Solid 1618.00 3400.0 13.510 Actinide 1944 1344.85

80 rows × 18 columns

Exercise

Create a filter for elements having a melting point below 100.

Grouping Data#

You can perform operations like grouping data by category using pands. For this analysis, we will group elements by their standard state.

grouped_data = df.groupby(by="StandardState")

This is now essentially a group of dataframes. If we use describe, we will get separate statistics for each group.

grouped_data.describe()
AtomicNumber AtomicMass ... Density MeltingPointC
count mean std min 25% 50% 75% max count mean ... 75% max count mean std min 25% 50% 75% max
StandardState
Expected to be a Gas 1.0 118.000000 NaN 118.0 118.00 118.0 118.00 118.0 1.0 294.214000 ... NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN
Expected to be a Solid 8.0 113.500000 2.449490 110.0 111.75 113.5 115.25 117.0 8.0 287.687125 ... NaN NaN 0.0 NaN NaN NaN NaN NaN NaN NaN
Gas 11.0 22.545455 26.322476 1.0 7.50 10.0 27.00 86.0 11.0 53.332053 ... 0.003474 0.00973 11.0 -187.244545 67.941536 -272.20 -234.1050 -210.000 -134.5750 -71.15
Liquid 2.0 57.500000 31.819805 35.0 46.25 57.5 68.75 80.0 2.0 140.245000 ... 10.927700 13.53360 2.0 -23.015000 22.365787 -38.83 -30.9225 -23.015 -15.1075 -7.20
Solid 96.0 58.666667 30.275808 3.0 32.75 59.5 84.25 109.0 96.0 144.133827 ... 11.171000 22.57000 90.0 1168.517189 823.705160 26.85 640.8500 1050.350 1552.3875 3549.85

5 rows × 80 columns

Similarly, we can see statistics for each group when using a method like .mean().

grouped_data["BoilingPoint"].mean()
StandardState
Expected to be a Gas              NaN
Expected to be a Solid            NaN
Gas                        102.452727
Liquid                     480.915000
Solid                     2922.236875
Name: BoilingPoint, dtype: float64

Exercise

Group elements by GroupBlock and calculate statistics.

After grouping, select the electronegativity data. Are the trends what you would expect?