{ "cells": [ { "cell_type": "markdown", "id": "1a057f29-e7ec-4732-8b32-3969fc0b95e7", "metadata": {}, "source": [ "# Introduction to Pandas\n", "\n", "
\n", "

Overview

\n", "\n", "Questions:\n", "\n", "* How do I import a library?\n", "\n", "* How do I analyze tabular data using Python?\n", "\n", "\n", "Objectives:\n", "\n", "* Use pandas to examine and analyze data.\n", "\n", "
\n", "\n", "Pandas is another Python package which is very popular for data analysis. The key data type of pandas is the `dataframe`. In this lesson, we will cover pandas dataframes and some basic analysis.\n", "\n", "Pandas is capable of handling data of lots of different types. \n", "In fact, it is often used to work with data that would usually be in a spreadsheet.\n", "It is designed to make working with “relational” or “labeled” data easy and intuitive. \n", "Central to the pandas package are the special data structures called pandas Series and DataFrames. Pandas dataframes are 2 dimensional and tabular, and is particularly suited to data which is heterogenous and in columns. \n", "In fact, there are even functions which allow you to read data directly from excel spreadsheets or SQL databases.\n", "\n", "\n", "## Importing Libraries\n", "\n", "For this notebook, we will be reading and analyzing data using `pandas`. \n", "These are not part of standard Python, and we must import the package to use it.\n" ] }, { "cell_type": "code", "execution_count": 1, "id": "bb179bb7-6359-4269-8dd6-977b589aefa1", "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "markdown", "id": "9c3064a1-9727-439d-896c-24647b036f90", "metadata": {}, "source": [ "Now, if we want to use functions from `plotly` or `pandas`, we will do `pd.function_name` or `px.function_name`" ] }, { "cell_type": "markdown", "id": "9785d3d3-0db8-45bc-b9f1-8193da8ef11a", "metadata": {}, "source": [ "In this notebook, we will be working with a data set that contains information about the elements in the periodic table.\n", "The data is a csv (comma separated value) file from PubChem in your data folder called `PubChemElements_all.csv`. \n", "\n", "This is a comma-separated value file, and we will use the pandas function `read_csv`. You can read more about this function in your Jupyter by typing `pd.read_csv` in the cell below, then right clicking and choosing \"Show Contextual Help\"." ] }, { "cell_type": "markdown", "id": "ff83568b-3119-4d1a-b7a8-4001458f8212", "metadata": {}, "source": [ "
\n", "

Python Documentation

\n", "Most popular Python libraries have very good online documentation. \n", "You can find the pandas documentation by googling \"pandas docs\".\n", "You will be able to find the same help message you get for `read_csv` as well as tutorials and other types of documentation.\n", "\n", "1. [Pandas Documentation](https://pandas.pydata.org/docs/)\n", "2. [`read_csv` documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)\n", "
\n", "\n", "\n", "It is common to name the variable from the `read_csv` function `df`. This is short for `DataFrame`.\n", "Since this file is relatively simple, we do not need any additional arguments to the function. The `read_csv` function reads in tabular data which is comma delimited by default." ] }, { "cell_type": "code", "execution_count": null, "id": "90db75c1-b33b-47fa-8717-53b5ab71e578", "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv(\"data/PubChemElements_all.csv\")" ] }, { "cell_type": "markdown", "id": "ca475324-1065-41bd-8e50-fe41203edd1b", "metadata": {}, "source": [ "## Examining the data\n", "\n", "The variable `df` is now a pandas DataFrame with the information contained in the csv file. You can examine the DataFrame using the `.head()` method. \n", "This shows the first 5 rows stored in the DataFrame.\n", "The first row of the file was used for column headers.\n", "\n", "
\n", "

Methods vs Functions

\n", "In this lesson, we use the syntax `variable.SOMETHING` often.\n", "When we do this, we are accessing special information or functions that act on variables. If the syntax is `variable.SOMETHING()`, with parenthesis it is a \"method\", or function that acts on the variable. For example, if variable is a list, you might use the method variable.append(item) to add an item to the end of the list.\n", "\n", "If it doesn't have parenthesis, it is an \"attribute\" or data associated with the variable.\n", "
\n" ] }, { "cell_type": "code", "execution_count": null, "id": "fe87cebf-98fe-4a74-830b-aaf9f699a9f8", "metadata": {}, "outputs": [], "source": [ "df.head()" ] }, { "cell_type": "markdown", "id": "2edbcf30-bc71-4c20-a5f1-6c5558ad6324", "metadata": {}, "source": [ "The `.info` function will give information about the columns and the data type of those columns. The data type will become very important later as we work with more data." ] }, { "cell_type": "code", "execution_count": null, "id": "786d61bc-25bf-4e18-b1d4-8c925a2dc1a2", "metadata": {}, "outputs": [], "source": [ "df.info()" ] }, { "cell_type": "markdown", "id": "2e71dcd4-1396-46a5-be14-69da5f7863eb", "metadata": {}, "source": [ "This output will show information about each column of data.\n", "Pandas assigns data types to columns, and will do its best to decide the data column for each column based on what is in the column. \n", "We can also see how many values are in each column.\n", "\n", "We can also see descriptive statisics easily using the `.describe()` command." ] }, { "cell_type": "code", "execution_count": null, "id": "f1db1a7f-4383-4ece-a158-c97565751d8d", "metadata": {}, "outputs": [], "source": [ "df.describe()" ] }, { "cell_type": "markdown", "id": "3a71dbf3-b602-482f-a1ef-869d36882859", "metadata": {}, "source": [ "The describe function lists the `mean`, `max`, `min`, standard deviation and percentiles for each column excluding `NaN` (not a number) values." ] }, { "cell_type": "markdown", "id": "1322e933-1b6f-44b3-9676-5d71655d2fcc", "metadata": {}, "source": [ "## Accessing Data in DataFrames\n", "\n", "One way to get information in a data frame is by using the headers, or the column names using squre brackets. The synatx for this is\n", "\n", "```python\n", "dataframe[\"column_name\"]\n", "```\n", "\n", "For example, to get the column with information about the atomic symbol, we would use the \"Symbol\" column." ] }, { "cell_type": "code", "execution_count": null, "id": "686790ed-090a-4e4d-8c97-9e2266405e39", "metadata": {}, "outputs": [], "source": [ "df[\"Symbol\"]" ] }, { "cell_type": "markdown", "id": "8637e4e1-95f7-495a-aa14-b2e679c3d31f", "metadata": {}, "source": [ "If you want to select multiple columns, you use a list of column names in square brackets." ] }, { "cell_type": "code", "execution_count": null, "id": "426458d8-bf22-477f-bd89-02c44d6c22c7", "metadata": {}, "outputs": [], "source": [ "df[[\"Symbol\", \"AtomicNumber\"]]" ] }, { "cell_type": "markdown", "id": "9dbaf1ec-a88d-4c1f-adbe-0724a2160acc", "metadata": {}, "source": [ "## Get data using number index\n", "\n", "If we want to get information in the dataframe using row and column numbers, we use the `iloc` function.\n", "\n", "The syntax for `iloc` is\n", "\n", "```python\n", "dataframe.iloc[row_number, column_number]\n", "```\n", "\n", "If you specify only a row number, you will get all the columns." ] }, { "cell_type": "code", "execution_count": null, "id": "4591d889-41e0-4072-b556-afe747239561", "metadata": {}, "outputs": [], "source": [ "# This will get the first row, all of the columns.\n", "df.iloc[0]" ] }, { "cell_type": "code", "execution_count": null, "id": "1ef5d63c-5804-49d1-8e09-f0d3840a34aa", "metadata": {}, "outputs": [], "source": [ "# This will get the first row and the second column.\n", "df.iloc[0, 1]" ] }, { "cell_type": "markdown", "id": "f75dc5db-1791-477d-919c-ae41f5ed5019", "metadata": {}, "source": [ "
\n", "

Exercise

\n", "\n", "How would you get the `ElectronConfiguration` column?\n", "\n", "How would you get the value in row index 10 of the `ElectronConfiguration` column?\n", "
" ] }, { "cell_type": "markdown", "id": "e7023225-332d-4fcd-a5c7-1a6272d7ad35", "metadata": {}, "source": [ "## Performing Calculations\n", "You can do mathematical operations on entire columns or rows of `pandas` dataframes using single lines of code.\n", "For example, if we wanted to subtract 273.15 from our melting point column, we could do so with the following like of code." ] }, { "cell_type": "code", "execution_count": null, "id": "baabb9e0-9ab1-4c39-80ed-e2542b951ea0", "metadata": {}, "outputs": [], "source": [ "df[\"MeltingPoint\"] - 273.15" ] }, { "cell_type": "markdown", "id": "22bf63b6-5cb9-47d1-9540-6222f9281752", "metadata": {}, "source": [ "To save your calculation in a new column in your dataframe,\n", "use the syntax\n", "\n", "```python\n", "df[\"new_column_name\"] = CALCULATION\n", "```" ] }, { "cell_type": "code", "execution_count": null, "id": "0f487484-c051-4c4b-849f-dc1b5e124250", "metadata": {}, "outputs": [], "source": [ "df[\"MeltingPointC\"] = df[\"MeltingPoint\"] - 273.15" ] }, { "cell_type": "markdown", "id": "138e168b-fc3a-47c6-b55d-6bd935adcbf5", "metadata": {}, "source": [ "If you have a more complicated action you'd like to peform for every column, you can use the syntax\n", "\n", "```python\n", "df[\"column_name\"].apply(FUNCTION)\n", "```\n", "\n", "where FUNCTION is a function that exists (like `len`) or a function that you've defined.\n", "Consider a function to calculate the temperature in Farhenheit from Celsius." ] }, { "cell_type": "code", "execution_count": null, "id": "c0e53b49-3e1f-439b-a817-c58d9c5e7b57", "metadata": {}, "outputs": [], "source": [ "def kelvin_to_fahrenheit(kelvin_temp):\n", " fahrenheit = (kelvin_temp - 273.15) * 9/5 +32\n", " return fahrenheit" ] }, { "cell_type": "markdown", "id": "cd1906f1-b00b-4bf9-ac3e-f48f5f53bef3", "metadata": {}, "source": [ "If you wanted to apply this function to every row, your first instinct might be to write a for loop. This would work, but pandas has a built in method called `apply` to easily allow you to do this.\n", "\n", "When you call the apply method, you give it a function name which you would like to apply to every element of whatever you are using it on." ] }, { "cell_type": "code", "execution_count": null, "id": "25becb40-1c15-44d5-9f18-9efcfb6fa992", "metadata": {}, "outputs": [], "source": [ "# Calculate the boiling point in fahrenheit\n", "df['BoilingPoint'].apply(kelvin_to_fahrenheit)" ] }, { "cell_type": "markdown", "id": "1d2cdf33-e00e-4944-8b94-38396d64ae99", "metadata": {}, "source": [ "## Saving a new DataFrame\n", "\n", "If you wanted to save your data to a csv, you could do it using the method `to_csv`." ] }, { "cell_type": "code", "execution_count": null, "id": "15a97282-914f-43dd-8526-cfa143c27fa3", "metadata": {}, "outputs": [], "source": [ "df.to_csv('periodic_data_processed.csv')" ] }, { "cell_type": "markdown", "id": "16b51f13-a5af-4112-87f0-a279a743fa08", "metadata": {}, "source": [ "## Filtering Data\n", "\n", "We can also filter data using signs like greater than `>`, less than `<`, or equal to `==`.\n", "\n", "In the cell below, we check whether or not each boiling point is greater than 500.\n", "If it is, `True` is returned, if not `False`.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "98bcb909-6a69-43cf-a59d-8c5b4f3ed843", "metadata": {}, "outputs": [], "source": [ "df[\"BoilingPoint\"] > 500" ] }, { "cell_type": "markdown", "id": "416bf350-ea94-4b04-b11a-2844450eba87", "metadata": {}, "source": [ "We can use this as an index or slice, similar to how we learned with lists.\n", "We can see from the output that this gives us 80 elements." ] }, { "cell_type": "code", "execution_count": null, "id": "9a3c5f03-af97-467d-b4e7-47e3736193df", "metadata": {}, "outputs": [], "source": [ "df[df[\"BoilingPoint\"]>500]" ] }, { "cell_type": "markdown", "id": "00dfee02-cbd3-4e2f-9f85-e952921ed129", "metadata": {}, "source": [ "
\n", "

Exercise

\n", "\n", "Create a filter for elements having a melting point below 100.\n", "
" ] }, { "cell_type": "markdown", "id": "5648a766-9cbc-4928-9cd5-28099e6c49bf", "metadata": {}, "source": [ "## Grouping Data\n", "\n", "You can perform operations like grouping data by category using pands.\n", "For this analysis, we will group elements by their standard state." ] }, { "cell_type": "code", "execution_count": null, "id": "db673821-8bd9-4b48-bd54-73bf2bf71427", "metadata": {}, "outputs": [], "source": [ "grouped_data = df.groupby(by=\"StandardState\")" ] }, { "cell_type": "markdown", "id": "d06941f8-a47d-431c-b44e-7a6fcac1bcb9", "metadata": {}, "source": [ "This is now essentially a group of dataframes.\n", "If we use `describe`, we will get separate statistics for each group." ] }, { "cell_type": "code", "execution_count": null, "id": "e2eca552-3c08-4001-9840-c10991a2e3fb", "metadata": {}, "outputs": [], "source": [ "grouped_data.describe()" ] }, { "cell_type": "markdown", "id": "4f0c06f0-afc6-49e9-8503-e19dc1f893e6", "metadata": {}, "source": [ "Similarly, we can see statistics for each group when using a method like `.mean()`." ] }, { "cell_type": "code", "execution_count": null, "id": "ac98fc2c-9407-4443-9ea0-708ebd2e4813", "metadata": {}, "outputs": [], "source": [ "grouped_data[\"BoilingPoint\"].mean()" ] }, { "cell_type": "markdown", "id": "2770db46-ebcf-43df-a9e5-ca01f0b29eb0", "metadata": {}, "source": [ "
\n", "

Exercise

\n", "\n", "Group elements by GroupBlock and calculate statistics.\n", "\n", "After grouping, select the electronegativity data. Are the trends what you would expect?\n", "\n", "
" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.6" } }, "nbformat": 4, "nbformat_minor": 5 }