Introduction to Pandas#

Overview

Questions:

How do I import a library?
How do I analyze tabular data using Python?

Objectives:

Use pandas to examine and analyze data.

Pandas is another Python package which is very popular for data analysis. The key data type of pandas is the dataframe. In this lesson, we will cover pandas dataframes and some basic analysis.

Pandas is capable of handling data of lots of different types. In fact, it is often used to work with data that would usually be in a spreadsheet. It is designed to make working with “relational” or “labeled” data easy and intuitive. Central to the pandas package are the special data structures called pandas Series and DataFrames. Pandas dataframes are 2 dimensional and tabular, and is particularly suited to data which is heterogenous and in columns. In fact, there are even functions which allow you to read data directly from excel spreadsheets or SQL databases.

Importing Libraries#

For this notebook, we will be reading and analyzing data using pandas. These are not part of standard Python, and we must import the package to use it.

import pandas as pd

Now, if we want to use functions from plotly or pandas, we will do pd.function_name or px.function_name

In this notebook, we will be working with a data set that contains information about the elements in the periodic table. The data is a csv (comma separated value) file from PubChem in your data folder called PubChemElements_all.csv.

This is a comma-separated value file, and we will use the pandas function read_csv. You can read more about this function in your Jupyter by typing pd.read_csv in the cell below, then right clicking and choosing “Show Contextual Help”.

Python Documentation

Most popular Python libraries have very good online documentation. You can find the pandas documentation by googling "pandas docs". You will be able to find the same help message you get for `read_csv` as well as tutorials and other types of documentation.

It is common to name the variable from the read_csv function df. This is short for DataFrame. Since this file is relatively simple, we do not need any additional arguments to the function. The read_csv function reads in tabular data which is comma delimited by default.

df = pd.read_csv("data/PubChemElements_all.csv")

Examining the data#

The variable df is now a pandas DataFrame with the information contained in the csv file. You can examine the DataFrame using the .head() method. This shows the first 5 rows stored in the DataFrame. The first row of the file was used for column headers.

Methods vs Functions

In this lesson, we use the syntax `variable.SOMETHING` often. When we do this, we are accessing special information or functions that act on variables. If the syntax is `variable.SOMETHING()`, with parenthesis it is a "method", or function that acts on the variable. For example, if variable is a list, you might use the method variable.append(item) to add an item to the end of the list.

If it doesn’t have parenthesis, it is an “attribute” or data associated with the variable.

df.head()

	AtomicNumber	Symbol	Name	AtomicMass	CPKHexColor	ElectronConfiguration	Electronegativity	AtomicRadius	IonizationEnergy	ElectronAffinity	OxidationStates	StandardState	MeltingPoint	BoilingPoint	Density	GroupBlock	YearDiscovered
0	1	H	Hydrogen	1.008000	FFFFFF	1s1	2.20	120.0	13.598	0.754	+1, -1	Gas	13.81	20.28	0.000090	Nonmetal	1766
1	2	He	Helium	4.002600	D9FFFF	1s2	NaN	140.0	24.587	NaN	0	Gas	0.95	4.22	0.000179	Noble gas	1868
2	3	Li	Lithium	7.000000	CC80FF	[He]2s1	0.98	182.0	5.392	0.618	+1	Solid	453.65	1615.00	0.534000	Alkali metal	1817
3	4	Be	Beryllium	9.012183	C2FF00	[He]2s2	1.57	153.0	9.323	NaN	+2	Solid	1560.00	2744.00	1.850000	Alkaline earth metal	1798
4	5	B	Boron	10.810000	FFB5B5	[He]2s2 2p1	2.04	192.0	8.298	0.277	+3	Solid	2348.00	4273.00	2.370000	Metalloid	1808

The .info function will give information about the columns and the data type of those columns. The data type will become very important later as we work with more data.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 118 entries, 0 to 117
Data columns (total 17 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 AtomicNumber           118 non-null    int64  
 Symbol                 118 non-null    object 
 Name                   118 non-null    object 
 AtomicMass             118 non-null    float64
 CPKHexColor            108 non-null    object 
 ElectronConfiguration  118 non-null    object 
 Electronegativity      95 non-null     float64
 AtomicRadius           113 non-null    float64
 IonizationEnergy       102 non-null    float64
 ElectronAffinity       57 non-null     float64
OxidationStates        103 non-null    object 
StandardState          118 non-null    object 
MeltingPoint           103 non-null    float64
BoilingPoint           93 non-null     float64
Density                96 non-null     float64
GroupBlock             118 non-null    object 
YearDiscovered         118 non-null    object 
dtypes: float64(8), int64(1), object(8)
memory usage: 15.8+ KB

This output will show information about each column of data. Pandas assigns data types to columns, and will do its best to decide the data column for each column based on what is in the column. We can also see how many values are in each column.

We can also see descriptive statisics easily using the .describe() command.

df.describe()

	AtomicNumber	AtomicMass	Electronegativity	AtomicRadius	IonizationEnergy	ElectronAffinity	MeltingPoint	BoilingPoint	Density
count	118.000000	118.000000	95.000000	113.000000	102.000000	57.000000	103.000000	93.000000	96.000000
mean	59.500000	146.607635	1.732316	201.902655	7.997255	1.072140	1273.740553	2536.212473	7.608001
std	34.207699	89.845304	0.635187	42.025707	3.339066	0.879163	888.853859	1588.410919	5.878692
min	1.000000	1.008000	0.700000	120.000000	3.894000	0.079000	0.950000	4.220000	0.000090
25%	30.250000	66.480000	1.290000	180.000000	6.020500	0.470000	516.040000	1180.000000	2.572500
50%	59.500000	142.573850	1.620000	202.000000	6.960000	0.754000	1191.000000	2792.000000	7.072000
75%	88.750000	226.777165	2.170000	229.000000	8.998500	1.350000	1806.500000	3618.000000	10.275250
max	118.000000	294.214000	3.980000	348.000000	24.587000	3.617000	3823.000000	5869.000000	22.570000

The describe function lists the mean, max, min, standard deviation and percentiles for each column excluding NaN (not a number) values.

Accessing Data in DataFrames#

One way to get information in a data frame is by using the headers, or the column names using squre brackets. The synatx for this is

dataframe["column_name"]

For example, to get the column with information about the atomic symbol, we would use the “Symbol” column.

df["Symbol"]

     H
    He
    Li
    Be
     B
       ..
  Fl
  Mc
  Lv
  Ts
  Og
Name: Symbol, Length: 118, dtype: object

If you want to select multiple columns, you use a list of column names in square brackets.

df[["Symbol", "AtomicNumber"]]

	Symbol	AtomicNumber
0	H	1
1	He	2
2	Li	3
3	Be	4
4	B	5
...	...	...
113	Fl	114
114	Mc	115
115	Lv	116
116	Ts	117
117	Og	118

118 rows × 2 columns

Get data using number index#

If we want to get information in the dataframe using row and column numbers, we use the iloc function.

The syntax for iloc is

dataframe.iloc[row_number, column_number]

If you specify only a row number, you will get all the columns.

# This will get the first row, all of the columns.
df.iloc[0]

AtomicNumber                    1
Symbol                          H
Name                     Hydrogen
AtomicMass                  1.008
CPKHexColor                FFFFFF
ElectronConfiguration         1s1
Electronegativity             2.2
AtomicRadius                120.0
IonizationEnergy           13.598
ElectronAffinity            0.754
OxidationStates            +1, -1
StandardState                 Gas
MeltingPoint                13.81
BoilingPoint                20.28
Density                   0.00009
GroupBlock               Nonmetal
YearDiscovered               1766
Name: 0, dtype: object

# This will get the first row and the second column.
df.iloc[0, 1]

'H'

Exercise

How would you get the ElectronConfiguration column?

How would you get the value in row index 10 of the ElectronConfiguration column?

Performing Calculations#

You can do mathematical operations on entire columns or rows of pandas dataframes using single lines of code. For example, if we wanted to subtract 273.15 from our melting point column, we could do so with the following like of code.

df["MeltingPoint"] - 273.15

    -259.34
    -272.20
     180.50
    1286.85
    2074.85
        ...   
      NaN
      NaN
      NaN
      NaN
      NaN
Name: MeltingPoint, Length: 118, dtype: float64

To save your calculation in a new column in your dataframe, use the syntax

df["new_column_name"] = CALCULATION

df["MeltingPointC"] = df["MeltingPoint"] - 273.15

If you have a more complicated action you’d like to peform for every column, you can use the syntax

df["column_name"].apply(FUNCTION)

where FUNCTION is a function that exists (like len) or a function that you’ve defined. Consider a function to calculate the temperature in Farhenheit from Celsius.

def kelvin_to_fahrenheit(kelvin_temp):
    fahrenheit = (kelvin_temp - 273.15) * 9/5 +32
    return fahrenheit

If you wanted to apply this function to every row, your first instinct might be to write a for loop. This would work, but pandas has a built in method called apply to easily allow you to do this.

When you call the apply method, you give it a function name which you would like to apply to every element of whatever you are using it on.

# Calculate the boiling point in fahrenheit
df['BoilingPoint'].apply(kelvin_to_fahrenheit)

    -423.166
    -452.074
    2447.330
    4479.530
    7231.730
         ...   
       NaN
       NaN
       NaN
       NaN
       NaN
Name: BoilingPoint, Length: 118, dtype: float64

Saving a new DataFrame#

If you wanted to save your data to a csv, you could do it using the method to_csv.

df.to_csv('periodic_data_processed.csv')

Filtering Data#

We can also filter data using signs like greater than >, less than <, or equal to ==.

In the cell below, we check whether or not each boiling point is greater than 500. If it is, True is returned, if not False.

df["BoilingPoint"] > 500

    False
    False
     True
     True
     True
       ...  
  False
  False
  False
  False
  False
Name: BoilingPoint, Length: 118, dtype: bool

We can use this as an index or slice, similar to how we learned with lists. We can see from the output that this gives us 80 elements.

df[df["BoilingPoint"]>500]

	AtomicNumber	Symbol	Name	AtomicMass	CPKHexColor	ElectronConfiguration	Electronegativity	AtomicRadius	IonizationEnergy	ElectronAffinity	OxidationStates	StandardState	MeltingPoint	BoilingPoint	Density	GroupBlock	YearDiscovered	MeltingPointC
2	3	Li	Lithium	7.000000	CC80FF	[He]2s1	0.98	182.0	5.392	0.618	+1	Solid	453.65	1615.0	0.534	Alkali metal	1817	180.50
3	4	Be	Beryllium	9.012183	C2FF00	[He]2s2	1.57	153.0	9.323	NaN	+2	Solid	1560.00	2744.0	1.850	Alkaline earth metal	1798	1286.85
4	5	B	Boron	10.810000	FFB5B5	[He]2s2 2p1	2.04	192.0	8.298	0.277	+3	Solid	2348.00	4273.0	2.370	Metalloid	1808	2074.85
5	6	C	Carbon	12.011000	909090	[He]2s2 2p2	2.55	170.0	11.260	1.263	+4, +2, -4	Solid	3823.00	4098.0	2.267	Nonmetal	Ancient	3549.85
10	11	Na	Sodium	22.989769	AB5CF2	[Ne]3s1	0.93	227.0	5.139	0.548	+1	Solid	370.95	1156.0	0.970	Alkali metal	1807	97.80
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
91	92	U	Uranium	238.028900	008FFF	[Rn]7s2 5f3 6d1	1.38	240.0	6.194	NaN	+6, +5, +4, +3	Solid	1408.00	4404.0	18.950	Actinide	1789	1134.85
92	93	Np	Neptunium	237.048170	0080FF	[Rn]7s2 5f4 6d1	1.36	221.0	6.266	NaN	+6, +5, +4, +3	Solid	917.00	4175.0	20.250	Actinide	1940	643.85
93	94	Pu	Plutonium	244.064200	006BFF	[Rn]7s2 5f6	1.28	243.0	6.060	NaN	+6, +5, +4, +3	Solid	913.00	3501.0	19.840	Actinide	1940	639.85
94	95	Am	Americium	243.061380	545CF2	[Rn]7s2 5f7	1.30	244.0	5.993	NaN	+6, +5, +4, +3	Solid	1449.00	2284.0	13.690	Actinide	1944	1175.85
95	96	Cm	Curium	247.070350	785CE3	[Rn]7s2 5f7 6d1	1.30	245.0	6.020	NaN	+3	Solid	1618.00	3400.0	13.510	Actinide	1944	1344.85

80 rows × 18 columns

Exercise

Create a filter for elements having a melting point below 100.

Grouping Data#

You can perform operations like grouping data by category using pands. For this analysis, we will group elements by their standard state.

grouped_data = df.groupby(by="StandardState")

This is now essentially a group of dataframes. If we use describe, we will get separate statistics for each group.

grouped_data.describe()

	AtomicNumber								AtomicMass		...	Density		MeltingPointC
	count	mean	std	min	25%	50%	75%	max	count	mean	...	75%	max	count	mean	std	min	25%	50%	75%	max
StandardState
Expected to be a Gas	1.0	118.000000	NaN	118.0	118.00	118.0	118.00	118.0	1.0	294.214000	...	NaN	NaN	0.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Expected to be a Solid	8.0	113.500000	2.449490	110.0	111.75	113.5	115.25	117.0	8.0	287.687125	...	NaN	NaN	0.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Gas	11.0	22.545455	26.322476	1.0	7.50	10.0	27.00	86.0	11.0	53.332053	...	0.003474	0.00973	11.0	-187.244545	67.941536	-272.20	-234.1050	-210.000	-134.5750	-71.15
Liquid	2.0	57.500000	31.819805	35.0	46.25	57.5	68.75	80.0	2.0	140.245000	...	10.927700	13.53360	2.0	-23.015000	22.365787	-38.83	-30.9225	-23.015	-15.1075	-7.20
Solid	96.0	58.666667	30.275808	3.0	32.75	59.5	84.25	109.0	96.0	144.133827	...	11.171000	22.57000	90.0	1168.517189	823.705160	26.85	640.8500	1050.350	1552.3875	3549.85

5 rows × 80 columns

Similarly, we can see statistics for each group when using a method like .mean().

grouped_data["BoilingPoint"].mean()

StandardState
Expected to be a Gas              NaN
Expected to be a Solid            NaN
Gas                        102.452727
Liquid                     480.915000
Solid                     2922.236875
Name: BoilingPoint, dtype: float64

Exercise

Group elements by GroupBlock and calculate statistics.

After grouping, select the electronegativity data. Are the trends what you would expect?

	Symbol	AtomicNumber
0	H	1
1	He	2
2	Li	3
3	Be	4
4	B	5
...	...	...
113	Fl	114
114	Mc	115
115	Lv	116
116	Ts	117
117	Og	118

	Symbol	AtomicNumber
0	H	1
1	He	2
2	Li	3
3	Be	4
4	B	5
...	...	...
113	Fl	114
114	Mc	115
115	Lv	116
116	Ts	117
117	Og	118

	Symbol	AtomicNumber
0	H	1
1	He	2
2	Li	3
3	Be	4
4	B	5
...	...	...
113	Fl	114
114	Mc	115
115	Lv	116
116	Ts	117
117	Og	118