{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "e072e018",
   "metadata": {},
   "source": [
    "Working with Pandas DataFrames\n",
    "===============================\n",
    "\n",
    "``````{admonition} Overview\n",
    ":class: overview\n",
    "\n",
    "Questions:\n",
    "\n",
    "* How can I use pandas to examine data?\n",
    "\n",
    "Objectives:\n",
    "\n",
    "* Load a csv file into a pandas dataframe\n",
    "\n",
    "* Learn the different ways to access information in a dataframe.\n",
    "\n",
    "* Learn how to perform calculations on a column of data.\n",
    "\n",
    "* Learn about built-in plotting in a dataframe.\n",
    "\n",
    "``````"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "cf44a96a",
   "metadata": {},
   "source": [
    "Pandas is a Python library used for data analysis and manipulation. Within the world of data science, it is a ubiquitous and widely used library. If you are learning how to analyze data in Python, it will be almost impossible to avoid pandas. \n",
    "\n",
    "For this workshop, we will be using two types of data structues - NumPy arrays and pandas DataFrames. Pandas dataframes are specifically for data which is two dimensional (rows and columns). NumPy arrays, while similar in some ways, are meant to work with higher dimensional data. We'll see this in a future lesson.\n",
    "\n",
    "Pandas is very powerful. In this session, we'll be learning how to access information in pandas dataframes and how to do some basic manipulation and analysis. The first step in using pandas is importing it. Typically when pandas is imported, it is shortened to `pd`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 173,
   "id": "853340f7",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "5298ab99",
   "metadata": {},
   "source": [
    "The first data set we are looking at in this workshop is solubility data. \n",
    "This is a data set which gives molecular descriptors along with solubility calculated using the ESOL method. \n",
    "The data was obtained from [moleculenet](https://moleculenet.org/datasets-1) and you can read more about the method [here](https://www.moreisdifferent.com/assets/DIY_drug_discovery/Delaney_paper.pdf). \n",
    "The dataset has been downloaded and placed in your `data` folder.\n",
    "\n",
    "Later in the workshop, we'll be seing how we can fit this model ourselves using Python, but for now we are just going to examine the data using pandas. \n",
    "The data is saved in a csv file. CSV stands for \"comma separated values\". \n",
    "This is a file format where commas separate values. \n",
    "You can also typically read files like this into excel. \n",
    "In fact, initially in this session you may not yet see what advantages pandas offers over Excel. \n",
    "However, functionalites in pandas will be more useful once we start cleaning and fitting data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 174,
   "id": "6f093772",
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv(\"data/delaney-processed.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1443798a",
   "metadata": {},
   "source": [
    "## Examining the data\n",
    "\n",
    "Now that we have used the pandas `read_csv` function, our data is in a variable called `df`. This variable is a **pandas dataframe**. This means it has lots of special functionalities thanks to pandas. For example, you can preview the first 5 rows using the `.head` function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 175,
   "id": "23ed7e5d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Compound ID</th>\n",
       "      <th>ESOL predicted log solubility in mols per litre</th>\n",
       "      <th>Minimum Degree</th>\n",
       "      <th>Molecular Weight</th>\n",
       "      <th>Number of H-Bond Donors</th>\n",
       "      <th>Number of Rings</th>\n",
       "      <th>Number of Rotatable Bonds</th>\n",
       "      <th>Polar Surface Area</th>\n",
       "      <th>measured log solubility in mols per litre</th>\n",
       "      <th>smiles</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Amigdalin</td>\n",
       "      <td>-0.974</td>\n",
       "      <td>1</td>\n",
       "      <td>457.432</td>\n",
       "      <td>7</td>\n",
       "      <td>3</td>\n",
       "      <td>7</td>\n",
       "      <td>202.32</td>\n",
       "      <td>-0.77</td>\n",
       "      <td>OCC3OC(OCC2OC(OC(C#N)c1ccccc1)C(O)C(O)C2O)C(O)...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Fenfuram</td>\n",
       "      <td>-2.885</td>\n",
       "      <td>1</td>\n",
       "      <td>201.225</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>42.24</td>\n",
       "      <td>-3.30</td>\n",
       "      <td>Cc1occc1C(=O)Nc2ccccc2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>citral</td>\n",
       "      <td>-2.579</td>\n",
       "      <td>1</td>\n",
       "      <td>152.237</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>4</td>\n",
       "      <td>17.07</td>\n",
       "      <td>-2.06</td>\n",
       "      <td>CC(C)=CCCC(C)=CC(=O)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Picene</td>\n",
       "      <td>-6.618</td>\n",
       "      <td>2</td>\n",
       "      <td>278.354</td>\n",
       "      <td>0</td>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>0.00</td>\n",
       "      <td>-7.87</td>\n",
       "      <td>c1ccc2c(c1)ccc3c2ccc4c5ccccc5ccc43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Thiophene</td>\n",
       "      <td>-2.232</td>\n",
       "      <td>2</td>\n",
       "      <td>84.143</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0.00</td>\n",
       "      <td>-1.33</td>\n",
       "      <td>c1ccsc1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  Compound ID  ESOL predicted log solubility in mols per litre  \\\n",
       "0   Amigdalin                                           -0.974   \n",
       "1    Fenfuram                                           -2.885   \n",
       "2      citral                                           -2.579   \n",
       "3      Picene                                           -6.618   \n",
       "4   Thiophene                                           -2.232   \n",
       "\n",
       "   Minimum Degree  Molecular Weight  Number of H-Bond Donors  Number of Rings  \\\n",
       "0               1           457.432                        7                3   \n",
       "1               1           201.225                        1                2   \n",
       "2               1           152.237                        0                0   \n",
       "3               2           278.354                        0                5   \n",
       "4               2            84.143                        0                1   \n",
       "\n",
       "   Number of Rotatable Bonds  Polar Surface Area  \\\n",
       "0                          7              202.32   \n",
       "1                          2               42.24   \n",
       "2                          4               17.07   \n",
       "3                          0                0.00   \n",
       "4                          0                0.00   \n",
       "\n",
       "   measured log solubility in mols per litre  \\\n",
       "0                                      -0.77   \n",
       "1                                      -3.30   \n",
       "2                                      -2.06   \n",
       "3                                      -7.87   \n",
       "4                                      -1.33   \n",
       "\n",
       "                                              smiles  \n",
       "0  OCC3OC(OCC2OC(OC(C#N)c1ccccc1)C(O)C(O)C2O)C(O)...  \n",
       "1                             Cc1occc1C(=O)Nc2ccccc2  \n",
       "2                               CC(C)=CCCC(C)=CC(=O)  \n",
       "3                 c1ccc2c(c1)ccc3c2ccc4c5ccccc5ccc43  \n",
       "4                                            c1ccsc1  "
      ]
     },
     "execution_count": 175,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6769ca04",
   "metadata": {},
   "source": [
    "The `.info` function will give information about the columns and the data type of those columns. The data type will become very important later as we work with more data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 176,
   "id": "67ca8b13",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 1128 entries, 0 to 1127\n",
      "Data columns (total 10 columns):\n",
      " #   Column                                           Non-Null Count  Dtype  \n",
      "---  ------                                           --------------  -----  \n",
      " 0   Compound ID                                      1128 non-null   object \n",
      " 1   ESOL predicted log solubility in mols per litre  1128 non-null   float64\n",
      " 2   Minimum Degree                                   1128 non-null   int64  \n",
      " 3   Molecular Weight                                 1128 non-null   float64\n",
      " 4   Number of H-Bond Donors                          1128 non-null   int64  \n",
      " 5   Number of Rings                                  1128 non-null   int64  \n",
      " 6   Number of Rotatable Bonds                        1128 non-null   int64  \n",
      " 7   Polar Surface Area                               1128 non-null   float64\n",
      " 8   measured log solubility in mols per litre        1128 non-null   float64\n",
      " 9   smiles                                           1128 non-null   object \n",
      "dtypes: float64(4), int64(4), object(2)\n",
      "memory usage: 88.2+ KB\n"
     ]
    }
   ],
   "source": [
    "df.info()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "5c6505d9",
   "metadata": {},
   "source": [
    "Pandas assigns data types to columns, and will do its best to decide the data column for each column based on what is in the column. \n",
    "You will see that this dataframe has `1128` values in each column. \n",
    "\n",
    "We see that the column `Minimum Degree` has the data type of `int64`. Here, `int` means `integer` and `64` means `64 bit`.  The `64 bit` refers to the amount of computer memory the variable can occupy. It won't really be important for us. Similarly, `float64` means `64 bit floating point`. These are decimal numbers.\n",
    "\n",
    "The other column names which read `object` are not numeric. They might be strings or they might be something else. We'll discuss more later.\n",
    "\n",
    "The `describe` function can be used on a dataframe to quickly see statistics about columns with numerical data. If you look at the columns that statistics are computed for and compare to the data type shown from `info`, you will see that we only get statistics for columns which had `int64` or `float64` data types."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 177,
   "id": "9a80a1f3",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ESOL predicted log solubility in mols per litre</th>\n",
       "      <th>Minimum Degree</th>\n",
       "      <th>Molecular Weight</th>\n",
       "      <th>Number of H-Bond Donors</th>\n",
       "      <th>Number of Rings</th>\n",
       "      <th>Number of Rotatable Bonds</th>\n",
       "      <th>Polar Surface Area</th>\n",
       "      <th>measured log solubility in mols per litre</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>1128.000000</td>\n",
       "      <td>1128.000000</td>\n",
       "      <td>1128.000000</td>\n",
       "      <td>1128.000000</td>\n",
       "      <td>1128.000000</td>\n",
       "      <td>1128.000000</td>\n",
       "      <td>1128.000000</td>\n",
       "      <td>1128.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>-2.988192</td>\n",
       "      <td>1.058511</td>\n",
       "      <td>203.937074</td>\n",
       "      <td>0.701241</td>\n",
       "      <td>1.390957</td>\n",
       "      <td>2.177305</td>\n",
       "      <td>34.872881</td>\n",
       "      <td>-3.050102</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>1.683220</td>\n",
       "      <td>0.238560</td>\n",
       "      <td>102.738077</td>\n",
       "      <td>1.089727</td>\n",
       "      <td>1.318286</td>\n",
       "      <td>2.640974</td>\n",
       "      <td>35.383593</td>\n",
       "      <td>2.096441</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>-9.702000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>16.043000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>-11.600000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>-3.948250</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>121.183000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>-4.317500</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>-2.870000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>182.179000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>26.300000</td>\n",
       "      <td>-2.860000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>-1.843750</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>270.372000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>55.440000</td>\n",
       "      <td>-1.600000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>1.091000</td>\n",
       "      <td>2.000000</td>\n",
       "      <td>780.949000</td>\n",
       "      <td>11.000000</td>\n",
       "      <td>8.000000</td>\n",
       "      <td>23.000000</td>\n",
       "      <td>268.680000</td>\n",
       "      <td>1.580000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       ESOL predicted log solubility in mols per litre  Minimum Degree  \\\n",
       "count                                      1128.000000     1128.000000   \n",
       "mean                                         -2.988192        1.058511   \n",
       "std                                           1.683220        0.238560   \n",
       "min                                          -9.702000        0.000000   \n",
       "25%                                          -3.948250        1.000000   \n",
       "50%                                          -2.870000        1.000000   \n",
       "75%                                          -1.843750        1.000000   \n",
       "max                                           1.091000        2.000000   \n",
       "\n",
       "       Molecular Weight  Number of H-Bond Donors  Number of Rings  \\\n",
       "count       1128.000000              1128.000000      1128.000000   \n",
       "mean         203.937074                 0.701241         1.390957   \n",
       "std          102.738077                 1.089727         1.318286   \n",
       "min           16.043000                 0.000000         0.000000   \n",
       "25%          121.183000                 0.000000         0.000000   \n",
       "50%          182.179000                 0.000000         1.000000   \n",
       "75%          270.372000                 1.000000         2.000000   \n",
       "max          780.949000                11.000000         8.000000   \n",
       "\n",
       "       Number of Rotatable Bonds  Polar Surface Area  \\\n",
       "count                1128.000000         1128.000000   \n",
       "mean                    2.177305           34.872881   \n",
       "std                     2.640974           35.383593   \n",
       "min                     0.000000            0.000000   \n",
       "25%                     0.000000            0.000000   \n",
       "50%                     1.000000           26.300000   \n",
       "75%                     3.000000           55.440000   \n",
       "max                    23.000000          268.680000   \n",
       "\n",
       "       measured log solubility in mols per litre  \n",
       "count                                1128.000000  \n",
       "mean                                   -3.050102  \n",
       "std                                     2.096441  \n",
       "min                                   -11.600000  \n",
       "25%                                    -4.317500  \n",
       "50%                                    -2.860000  \n",
       "75%                                    -1.600000  \n",
       "max                                     1.580000  "
      ]
     },
     "execution_count": 177,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "13c61a11",
   "metadata": {},
   "source": [
    "## Accessing Data\n",
    "\n",
    "One great thing about pandas dataframes is how easy it is to access information. Pandas allows you to access information in a dataframe using both data indexes and names.\n",
    "\n",
    "Pandas dataframes have rows and columns, you can see how many rows and columns using `.shape`. This will return the shape as `(num_rows, num_columns)`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 178,
   "id": "80d9c4df",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(1128, 10)"
      ]
     },
     "execution_count": 178,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e206eb98",
   "metadata": {},
   "source": [
    "### Accessing with rows and column numbers \n",
    "\n",
    "#### The `.iloc` function\n",
    "Access on a particular row and column using `.iloc` followed by square brackets and the row and column numbers you want to access. If you only put one number, it will be assumed to be the row number you want.\n",
    "\n",
    "The following gets row number 35."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 179,
   "id": "b7cff7eb",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Compound ID                                        Valeraldehyde\n",
       "ESOL predicted log solubility in mols per litre           -1.103\n",
       "Minimum Degree                                                 1\n",
       "Molecular Weight                                          86.134\n",
       "Number of H-Bond Donors                                        0\n",
       "Number of Rings                                                0\n",
       "Number of Rotatable Bonds                                      3\n",
       "Polar Surface Area                                         17.07\n",
       "measured log solubility in mols per litre                  -0.85\n",
       "smiles                                                   CCCCC=O\n",
       "Name: 35, dtype: object"
      ]
     },
     "execution_count": 179,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.iloc[35]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0a6149b5",
   "metadata": {},
   "source": [
    "We can also specify a number of columns by adding a second number to the slicing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 180,
   "id": "c6f9c28c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Compound ID                                        Valeraldehyde\n",
       "ESOL predicted log solubility in mols per litre           -1.103\n",
       "Minimum Degree                                                 1\n",
       "Name: 35, dtype: object"
      ]
     },
     "execution_count": 180,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.iloc[35, :3]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d0f2c40e",
   "metadata": {},
   "source": [
    "### Accessing with names\n",
    "\n",
    "Each dataframe has an index (rows) and columns. The rows and columns have names. For the columns, these are indicated in `.head` by the bold row at the top of the dataframe. The row names are similarly listed in bold in the left of the dataframe. The index can be named, but by default it is usually just numbered. \n",
    "\n",
    "You can see information about the index or the columns using `df.index` or `df.columns`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 181,
   "id": "fe8421c4",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "RangeIndex(start=0, stop=1128, step=1)"
      ]
     },
     "execution_count": 181,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 182,
   "id": "fea32803",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['Compound ID', 'ESOL predicted log solubility in mols per litre',\n",
       "       'Minimum Degree', 'Molecular Weight', 'Number of H-Bond Donors',\n",
       "       'Number of Rings', 'Number of Rotatable Bonds', 'Polar Surface Area',\n",
       "       'measured log solubility in mols per litre', 'smiles'],\n",
       "      dtype='object')"
      ]
     },
     "execution_count": 182,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.columns"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "ba64a576",
   "metadata": {},
   "source": [
    "Some of these column names are very long, and we might not want to type them \n",
    "when we want to reference data. \n",
    "We will rename some of these columns to make accessing the data more convenient."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 183,
   "id": "34d56e51",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.rename( columns = {\n",
    "    \"ESOL predicted log solubility in mols per litre\": \"ESOL solubility (mol/L)\",\n",
    "    \"measured log solubility in mols per litre\" : \"measured solubility (mol/L)\"\n",
    "}, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1dccc85a",
   "metadata": {},
   "source": [
    "To access part of a dataframe using the index or column names, would use the `loc` function.\n",
    "\n",
    "Because our index names are just numbers, this doesn't look that different than `iloc` when we are accessing rows.\n",
    "\n",
    "#### The `.loc` function"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 184,
   "id": "0d700d98",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Compound ID                    Valeraldehyde\n",
       "ESOL solubility (mol/L)               -1.103\n",
       "Minimum Degree                             1\n",
       "Molecular Weight                      86.134\n",
       "Number of H-Bond Donors                    0\n",
       "Number of Rings                            0\n",
       "Number of Rotatable Bonds                  3\n",
       "Polar Surface Area                     17.07\n",
       "measured solubility (mol/L)            -0.85\n",
       "smiles                               CCCCC=O\n",
       "Name: 35, dtype: object"
      ]
     },
     "execution_count": 184,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.loc[35]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "62e0e533",
   "metadata": {},
   "source": [
    "However, we can now use column names"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 185,
   "id": "da288a9b",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'CCCCC=O'"
      ]
     },
     "execution_count": 185,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.loc[35, \"smiles\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "54950b86",
   "metadata": {},
   "source": [
    "#### Using column names\n",
    "\n",
    "You can pull a column of several columns of information using the syntax\n",
    "\n",
    "```python\n",
    "df[column_name]\n",
    "```\n",
    "\n",
    "To pull several columns, do\n",
    "\n",
    "```python\n",
    "df[[colname1, colname2]]\n",
    "```\n",
    "\n",
    "For example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 186,
   "id": "a867a5a7",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0            Amigdalin\n",
       "1             Fenfuram\n",
       "2               citral\n",
       "3               Picene\n",
       "4            Thiophene\n",
       "             ...      \n",
       "1123         halothane\n",
       "1124            Oxamyl\n",
       "1125         Thiometon\n",
       "1126    2-Methylbutane\n",
       "1127          Stirofos\n",
       "Name: Compound ID, Length: 1128, dtype: object"
      ]
     },
     "execution_count": 186,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[\"Compound ID\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 187,
   "id": "15a99ad9",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Compound ID</th>\n",
       "      <th>smiles</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Amigdalin</td>\n",
       "      <td>OCC3OC(OCC2OC(OC(C#N)c1ccccc1)C(O)C(O)C2O)C(O)...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Fenfuram</td>\n",
       "      <td>Cc1occc1C(=O)Nc2ccccc2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>citral</td>\n",
       "      <td>CC(C)=CCCC(C)=CC(=O)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Picene</td>\n",
       "      <td>c1ccc2c(c1)ccc3c2ccc4c5ccccc5ccc43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Thiophene</td>\n",
       "      <td>c1ccsc1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1123</th>\n",
       "      <td>halothane</td>\n",
       "      <td>FC(F)(F)C(Cl)Br</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1124</th>\n",
       "      <td>Oxamyl</td>\n",
       "      <td>CNC(=O)ON=C(SC)C(=O)N(C)C</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1125</th>\n",
       "      <td>Thiometon</td>\n",
       "      <td>CCSCCSP(=S)(OC)OC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1126</th>\n",
       "      <td>2-Methylbutane</td>\n",
       "      <td>CCC(C)C</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1127</th>\n",
       "      <td>Stirofos</td>\n",
       "      <td>COP(=O)(OC)OC(=CCl)c1cc(Cl)c(Cl)cc1Cl</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1128 rows × 2 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "         Compound ID                                             smiles\n",
       "0          Amigdalin  OCC3OC(OCC2OC(OC(C#N)c1ccccc1)C(O)C(O)C2O)C(O)...\n",
       "1           Fenfuram                             Cc1occc1C(=O)Nc2ccccc2\n",
       "2             citral                               CC(C)=CCCC(C)=CC(=O)\n",
       "3             Picene                 c1ccc2c(c1)ccc3c2ccc4c5ccccc5ccc43\n",
       "4          Thiophene                                            c1ccsc1\n",
       "...              ...                                                ...\n",
       "1123       halothane                                   FC(F)(F)C(Cl)Br \n",
       "1124          Oxamyl                          CNC(=O)ON=C(SC)C(=O)N(C)C\n",
       "1125       Thiometon                                  CCSCCSP(=S)(OC)OC\n",
       "1126  2-Methylbutane                                            CCC(C)C\n",
       "1127        Stirofos              COP(=O)(OC)OC(=CCl)c1cc(Cl)c(Cl)cc1Cl\n",
       "\n",
       "[1128 rows x 2 columns]"
      ]
     },
     "execution_count": 187,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[[\"Compound ID\", \"smiles\"]]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "185e0dd7",
   "metadata": {},
   "source": [
    "## Filtering Data\n",
    "\n",
    "There are two ways filtering in dataframes are typically done. Both are shown here for completeness, and because you will see both when looking at code others have written.\n",
    "\n",
    "In the first method, the strategy is that you first establish where something is true within a dataframe, then use that to filter the dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 188,
   "id": "67e8ba10",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0        True\n",
       "1        True\n",
       "2        True\n",
       "3       False\n",
       "4       False\n",
       "        ...  \n",
       "1123    False\n",
       "1124     True\n",
       "1125     True\n",
       "1126    False\n",
       "1127     True\n",
       "Name: Polar Surface Area, Length: 1128, dtype: bool"
      ]
     },
     "execution_count": 188,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[\"Polar Surface Area\"] >17"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6c6a280d",
   "metadata": {},
   "source": [
    "This gives us a list of values with either `True` or `False` that can be used to index into the dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 189,
   "id": "7bd812f1",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Compound ID</th>\n",
       "      <th>ESOL solubility (mol/L)</th>\n",
       "      <th>Minimum Degree</th>\n",
       "      <th>Molecular Weight</th>\n",
       "      <th>Number of H-Bond Donors</th>\n",
       "      <th>Number of Rings</th>\n",
       "      <th>Number of Rotatable Bonds</th>\n",
       "      <th>Polar Surface Area</th>\n",
       "      <th>measured solubility (mol/L)</th>\n",
       "      <th>smiles</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Amigdalin</td>\n",
       "      <td>-0.974</td>\n",
       "      <td>1</td>\n",
       "      <td>457.432</td>\n",
       "      <td>7</td>\n",
       "      <td>3</td>\n",
       "      <td>7</td>\n",
       "      <td>202.32</td>\n",
       "      <td>-0.770</td>\n",
       "      <td>OCC3OC(OCC2OC(OC(C#N)c1ccccc1)C(O)C(O)C2O)C(O)...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Fenfuram</td>\n",
       "      <td>-2.885</td>\n",
       "      <td>1</td>\n",
       "      <td>201.225</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>42.24</td>\n",
       "      <td>-3.300</td>\n",
       "      <td>Cc1occc1C(=O)Nc2ccccc2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>citral</td>\n",
       "      <td>-2.579</td>\n",
       "      <td>1</td>\n",
       "      <td>152.237</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>4</td>\n",
       "      <td>17.07</td>\n",
       "      <td>-2.060</td>\n",
       "      <td>CC(C)=CCCC(C)=CC(=O)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Estradiol</td>\n",
       "      <td>-4.138</td>\n",
       "      <td>1</td>\n",
       "      <td>272.388</td>\n",
       "      <td>2</td>\n",
       "      <td>4</td>\n",
       "      <td>0</td>\n",
       "      <td>40.46</td>\n",
       "      <td>-5.030</td>\n",
       "      <td>CC12CCC3C(CCc4cc(O)ccc34)C2CCC1O</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Rotenone</td>\n",
       "      <td>-5.246</td>\n",
       "      <td>1</td>\n",
       "      <td>394.423</td>\n",
       "      <td>0</td>\n",
       "      <td>5</td>\n",
       "      <td>3</td>\n",
       "      <td>63.22</td>\n",
       "      <td>-4.420</td>\n",
       "      <td>COc5cc4OCC3Oc2c1CC(Oc1ccc2C(=O)C3c4cc5OC)C(C)=C</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1121</th>\n",
       "      <td>Phthalonitrile</td>\n",
       "      <td>-1.717</td>\n",
       "      <td>1</td>\n",
       "      <td>128.134</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>47.58</td>\n",
       "      <td>-2.380</td>\n",
       "      <td>N#Cc1ccccc1C#N</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1122</th>\n",
       "      <td>m-Nitrotoluene</td>\n",
       "      <td>-2.640</td>\n",
       "      <td>1</td>\n",
       "      <td>137.138</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>43.14</td>\n",
       "      <td>-2.440</td>\n",
       "      <td>Cc1cccc(c1)N(=O)=O</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1124</th>\n",
       "      <td>Oxamyl</td>\n",
       "      <td>-0.908</td>\n",
       "      <td>1</td>\n",
       "      <td>219.266</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>71.00</td>\n",
       "      <td>0.106</td>\n",
       "      <td>CNC(=O)ON=C(SC)C(=O)N(C)C</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1125</th>\n",
       "      <td>Thiometon</td>\n",
       "      <td>-3.323</td>\n",
       "      <td>1</td>\n",
       "      <td>246.359</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>7</td>\n",
       "      <td>18.46</td>\n",
       "      <td>-3.091</td>\n",
       "      <td>CCSCCSP(=S)(OC)OC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1127</th>\n",
       "      <td>Stirofos</td>\n",
       "      <td>-4.320</td>\n",
       "      <td>1</td>\n",
       "      <td>365.964</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>5</td>\n",
       "      <td>44.76</td>\n",
       "      <td>-4.522</td>\n",
       "      <td>COP(=O)(OC)OC(=CCl)c1cc(Cl)c(Cl)cc1Cl</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>758 rows × 10 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "         Compound ID  ESOL solubility (mol/L)  Minimum Degree  \\\n",
       "0          Amigdalin                   -0.974               1   \n",
       "1           Fenfuram                   -2.885               1   \n",
       "2             citral                   -2.579               1   \n",
       "7          Estradiol                   -4.138               1   \n",
       "9           Rotenone                   -5.246               1   \n",
       "...              ...                      ...             ...   \n",
       "1121  Phthalonitrile                   -1.717               1   \n",
       "1122  m-Nitrotoluene                   -2.640               1   \n",
       "1124          Oxamyl                   -0.908               1   \n",
       "1125       Thiometon                   -3.323               1   \n",
       "1127        Stirofos                   -4.320               1   \n",
       "\n",
       "      Molecular Weight  Number of H-Bond Donors  Number of Rings  \\\n",
       "0              457.432                        7                3   \n",
       "1              201.225                        1                2   \n",
       "2              152.237                        0                0   \n",
       "7              272.388                        2                4   \n",
       "9              394.423                        0                5   \n",
       "...                ...                      ...              ...   \n",
       "1121           128.134                        0                1   \n",
       "1122           137.138                        0                1   \n",
       "1124           219.266                        1                0   \n",
       "1125           246.359                        0                0   \n",
       "1127           365.964                        0                1   \n",
       "\n",
       "      Number of Rotatable Bonds  Polar Surface Area  \\\n",
       "0                             7              202.32   \n",
       "1                             2               42.24   \n",
       "2                             4               17.07   \n",
       "7                             0               40.46   \n",
       "9                             3               63.22   \n",
       "...                         ...                 ...   \n",
       "1121                          0               47.58   \n",
       "1122                          1               43.14   \n",
       "1124                          1               71.00   \n",
       "1125                          7               18.46   \n",
       "1127                          5               44.76   \n",
       "\n",
       "      measured solubility (mol/L)  \\\n",
       "0                          -0.770   \n",
       "1                          -3.300   \n",
       "2                          -2.060   \n",
       "7                          -5.030   \n",
       "9                          -4.420   \n",
       "...                           ...   \n",
       "1121                       -2.380   \n",
       "1122                       -2.440   \n",
       "1124                        0.106   \n",
       "1125                       -3.091   \n",
       "1127                       -4.522   \n",
       "\n",
       "                                                 smiles  \n",
       "0     OCC3OC(OCC2OC(OC(C#N)c1ccccc1)C(O)C(O)C2O)C(O)...  \n",
       "1                                Cc1occc1C(=O)Nc2ccccc2  \n",
       "2                                  CC(C)=CCCC(C)=CC(=O)  \n",
       "7                      CC12CCC3C(CCc4cc(O)ccc34)C2CCC1O  \n",
       "9      COc5cc4OCC3Oc2c1CC(Oc1ccc2C(=O)C3c4cc5OC)C(C)=C   \n",
       "...                                                 ...  \n",
       "1121                                     N#Cc1ccccc1C#N  \n",
       "1122                                 Cc1cccc(c1)N(=O)=O  \n",
       "1124                          CNC(=O)ON=C(SC)C(=O)N(C)C  \n",
       "1125                                  CCSCCSP(=S)(OC)OC  \n",
       "1127              COP(=O)(OC)OC(=CCl)c1cc(Cl)c(Cl)cc1Cl  \n",
       "\n",
       "[758 rows x 10 columns]"
      ]
     },
     "execution_count": 189,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[df[\"Polar Surface Area\"] > 17]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6c77d241",
   "metadata": {},
   "source": [
    "This can also be accomplished using the pandas `query` function. When using the `query` function, you write your query as a string. If your column name has a space, you must surround it with backticks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 190,
   "id": "a63eff90",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Compound ID</th>\n",
       "      <th>ESOL solubility (mol/L)</th>\n",
       "      <th>Minimum Degree</th>\n",
       "      <th>Molecular Weight</th>\n",
       "      <th>Number of H-Bond Donors</th>\n",
       "      <th>Number of Rings</th>\n",
       "      <th>Number of Rotatable Bonds</th>\n",
       "      <th>Polar Surface Area</th>\n",
       "      <th>measured solubility (mol/L)</th>\n",
       "      <th>smiles</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Amigdalin</td>\n",
       "      <td>-0.974</td>\n",
       "      <td>1</td>\n",
       "      <td>457.432</td>\n",
       "      <td>7</td>\n",
       "      <td>3</td>\n",
       "      <td>7</td>\n",
       "      <td>202.32</td>\n",
       "      <td>-0.770</td>\n",
       "      <td>OCC3OC(OCC2OC(OC(C#N)c1ccccc1)C(O)C(O)C2O)C(O)...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Fenfuram</td>\n",
       "      <td>-2.885</td>\n",
       "      <td>1</td>\n",
       "      <td>201.225</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>42.24</td>\n",
       "      <td>-3.300</td>\n",
       "      <td>Cc1occc1C(=O)Nc2ccccc2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>citral</td>\n",
       "      <td>-2.579</td>\n",
       "      <td>1</td>\n",
       "      <td>152.237</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>4</td>\n",
       "      <td>17.07</td>\n",
       "      <td>-2.060</td>\n",
       "      <td>CC(C)=CCCC(C)=CC(=O)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Estradiol</td>\n",
       "      <td>-4.138</td>\n",
       "      <td>1</td>\n",
       "      <td>272.388</td>\n",
       "      <td>2</td>\n",
       "      <td>4</td>\n",
       "      <td>0</td>\n",
       "      <td>40.46</td>\n",
       "      <td>-5.030</td>\n",
       "      <td>CC12CCC3C(CCc4cc(O)ccc34)C2CCC1O</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Rotenone</td>\n",
       "      <td>-5.246</td>\n",
       "      <td>1</td>\n",
       "      <td>394.423</td>\n",
       "      <td>0</td>\n",
       "      <td>5</td>\n",
       "      <td>3</td>\n",
       "      <td>63.22</td>\n",
       "      <td>-4.420</td>\n",
       "      <td>COc5cc4OCC3Oc2c1CC(Oc1ccc2C(=O)C3c4cc5OC)C(C)=C</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1121</th>\n",
       "      <td>Phthalonitrile</td>\n",
       "      <td>-1.717</td>\n",
       "      <td>1</td>\n",
       "      <td>128.134</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>47.58</td>\n",
       "      <td>-2.380</td>\n",
       "      <td>N#Cc1ccccc1C#N</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1122</th>\n",
       "      <td>m-Nitrotoluene</td>\n",
       "      <td>-2.640</td>\n",
       "      <td>1</td>\n",
       "      <td>137.138</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>43.14</td>\n",
       "      <td>-2.440</td>\n",
       "      <td>Cc1cccc(c1)N(=O)=O</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1124</th>\n",
       "      <td>Oxamyl</td>\n",
       "      <td>-0.908</td>\n",
       "      <td>1</td>\n",
       "      <td>219.266</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>71.00</td>\n",
       "      <td>0.106</td>\n",
       "      <td>CNC(=O)ON=C(SC)C(=O)N(C)C</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1125</th>\n",
       "      <td>Thiometon</td>\n",
       "      <td>-3.323</td>\n",
       "      <td>1</td>\n",
       "      <td>246.359</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>7</td>\n",
       "      <td>18.46</td>\n",
       "      <td>-3.091</td>\n",
       "      <td>CCSCCSP(=S)(OC)OC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1127</th>\n",
       "      <td>Stirofos</td>\n",
       "      <td>-4.320</td>\n",
       "      <td>1</td>\n",
       "      <td>365.964</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>5</td>\n",
       "      <td>44.76</td>\n",
       "      <td>-4.522</td>\n",
       "      <td>COP(=O)(OC)OC(=CCl)c1cc(Cl)c(Cl)cc1Cl</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>758 rows × 10 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "         Compound ID  ESOL solubility (mol/L)  Minimum Degree  \\\n",
       "0          Amigdalin                   -0.974               1   \n",
       "1           Fenfuram                   -2.885               1   \n",
       "2             citral                   -2.579               1   \n",
       "7          Estradiol                   -4.138               1   \n",
       "9           Rotenone                   -5.246               1   \n",
       "...              ...                      ...             ...   \n",
       "1121  Phthalonitrile                   -1.717               1   \n",
       "1122  m-Nitrotoluene                   -2.640               1   \n",
       "1124          Oxamyl                   -0.908               1   \n",
       "1125       Thiometon                   -3.323               1   \n",
       "1127        Stirofos                   -4.320               1   \n",
       "\n",
       "      Molecular Weight  Number of H-Bond Donors  Number of Rings  \\\n",
       "0              457.432                        7                3   \n",
       "1              201.225                        1                2   \n",
       "2              152.237                        0                0   \n",
       "7              272.388                        2                4   \n",
       "9              394.423                        0                5   \n",
       "...                ...                      ...              ...   \n",
       "1121           128.134                        0                1   \n",
       "1122           137.138                        0                1   \n",
       "1124           219.266                        1                0   \n",
       "1125           246.359                        0                0   \n",
       "1127           365.964                        0                1   \n",
       "\n",
       "      Number of Rotatable Bonds  Polar Surface Area  \\\n",
       "0                             7              202.32   \n",
       "1                             2               42.24   \n",
       "2                             4               17.07   \n",
       "7                             0               40.46   \n",
       "9                             3               63.22   \n",
       "...                         ...                 ...   \n",
       "1121                          0               47.58   \n",
       "1122                          1               43.14   \n",
       "1124                          1               71.00   \n",
       "1125                          7               18.46   \n",
       "1127                          5               44.76   \n",
       "\n",
       "      measured solubility (mol/L)  \\\n",
       "0                          -0.770   \n",
       "1                          -3.300   \n",
       "2                          -2.060   \n",
       "7                          -5.030   \n",
       "9                          -4.420   \n",
       "...                           ...   \n",
       "1121                       -2.380   \n",
       "1122                       -2.440   \n",
       "1124                        0.106   \n",
       "1125                       -3.091   \n",
       "1127                       -4.522   \n",
       "\n",
       "                                                 smiles  \n",
       "0     OCC3OC(OCC2OC(OC(C#N)c1ccccc1)C(O)C(O)C2O)C(O)...  \n",
       "1                                Cc1occc1C(=O)Nc2ccccc2  \n",
       "2                                  CC(C)=CCCC(C)=CC(=O)  \n",
       "7                      CC12CCC3C(CCc4cc(O)ccc34)C2CCC1O  \n",
       "9      COc5cc4OCC3Oc2c1CC(Oc1ccc2C(=O)C3c4cc5OC)C(C)=C   \n",
       "...                                                 ...  \n",
       "1121                                     N#Cc1ccccc1C#N  \n",
       "1122                                 Cc1cccc(c1)N(=O)=O  \n",
       "1124                          CNC(=O)ON=C(SC)C(=O)N(C)C  \n",
       "1125                                  CCSCCSP(=S)(OC)OC  \n",
       "1127              COP(=O)(OC)OC(=CCl)c1cc(Cl)c(Cl)cc1Cl  \n",
       "\n",
       "[758 rows x 10 columns]"
      ]
     },
     "execution_count": 190,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.query(\"`Polar Surface Area` > 17\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "f9427cf7",
   "metadata": {},
   "source": [
    "``````{admonition} Check Your Understanding\n",
    ":class: exercise\n",
    "\n",
    "Rewrite the following query to use the mask syntax for filtering data.\n",
    "\n",
    "    \n",
    "```python\n",
    "df.query(\"`Number of H-Bond Donors`>3\")\n",
    "```\n",
    "\n",
    "    \n",
    "````{admonition} Solution\n",
    ":class: dropdown solution\n",
    "\n",
    "```python\n",
    "df[df[\"Number of H-Bond Donors\"] > 3]\n",
    "``` \n",
    "````\n",
    "\n",
    "``````"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 191,
   "id": "f0d19937",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Compound ID</th>\n",
       "      <th>ESOL solubility (mol/L)</th>\n",
       "      <th>Minimum Degree</th>\n",
       "      <th>Molecular Weight</th>\n",
       "      <th>Number of H-Bond Donors</th>\n",
       "      <th>Number of Rings</th>\n",
       "      <th>Number of Rotatable Bonds</th>\n",
       "      <th>Polar Surface Area</th>\n",
       "      <th>measured solubility (mol/L)</th>\n",
       "      <th>smiles</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Fenfuram</td>\n",
       "      <td>-2.885</td>\n",
       "      <td>1</td>\n",
       "      <td>201.225</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>42.24</td>\n",
       "      <td>-3.300</td>\n",
       "      <td>Cc1occc1C(=O)Nc2ccccc2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>citral</td>\n",
       "      <td>-2.579</td>\n",
       "      <td>1</td>\n",
       "      <td>152.237</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>4</td>\n",
       "      <td>17.07</td>\n",
       "      <td>-2.060</td>\n",
       "      <td>CC(C)=CCCC(C)=CC(=O)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Dieldrin</td>\n",
       "      <td>-4.533</td>\n",
       "      <td>1</td>\n",
       "      <td>380.913</td>\n",
       "      <td>0</td>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>12.53</td>\n",
       "      <td>-6.290</td>\n",
       "      <td>ClC4=C(Cl)C5(Cl)C3C1CC(C2OC12)C3C4(Cl)C5(Cl)Cl</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Rotenone</td>\n",
       "      <td>-5.246</td>\n",
       "      <td>1</td>\n",
       "      <td>394.423</td>\n",
       "      <td>0</td>\n",
       "      <td>5</td>\n",
       "      <td>3</td>\n",
       "      <td>63.22</td>\n",
       "      <td>-4.420</td>\n",
       "      <td>COc5cc4OCC3Oc2c1CC(Oc1ccc2C(=O)C3c4cc5OC)C(C)=C</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>2-pyrrolidone</td>\n",
       "      <td>0.243</td>\n",
       "      <td>1</td>\n",
       "      <td>85.106</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>29.10</td>\n",
       "      <td>1.070</td>\n",
       "      <td>O=C1CCCN1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1118</th>\n",
       "      <td>Isobutyl acetate</td>\n",
       "      <td>-1.463</td>\n",
       "      <td>1</td>\n",
       "      <td>116.160</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>26.30</td>\n",
       "      <td>-1.210</td>\n",
       "      <td>CC(C)COC(=O)C</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1122</th>\n",
       "      <td>m-Nitrotoluene</td>\n",
       "      <td>-2.640</td>\n",
       "      <td>1</td>\n",
       "      <td>137.138</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>43.14</td>\n",
       "      <td>-2.440</td>\n",
       "      <td>Cc1cccc(c1)N(=O)=O</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1124</th>\n",
       "      <td>Oxamyl</td>\n",
       "      <td>-0.908</td>\n",
       "      <td>1</td>\n",
       "      <td>219.266</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>71.00</td>\n",
       "      <td>0.106</td>\n",
       "      <td>CNC(=O)ON=C(SC)C(=O)N(C)C</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1125</th>\n",
       "      <td>Thiometon</td>\n",
       "      <td>-3.323</td>\n",
       "      <td>1</td>\n",
       "      <td>246.359</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>7</td>\n",
       "      <td>18.46</td>\n",
       "      <td>-3.091</td>\n",
       "      <td>CCSCCSP(=S)(OC)OC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1127</th>\n",
       "      <td>Stirofos</td>\n",
       "      <td>-4.320</td>\n",
       "      <td>1</td>\n",
       "      <td>365.964</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>5</td>\n",
       "      <td>44.76</td>\n",
       "      <td>-4.522</td>\n",
       "      <td>COP(=O)(OC)OC(=CCl)c1cc(Cl)c(Cl)cc1Cl</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>595 rows × 10 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "           Compound ID  ESOL solubility (mol/L)  Minimum Degree  \\\n",
       "1             Fenfuram                   -2.885               1   \n",
       "2               citral                   -2.579               1   \n",
       "8             Dieldrin                   -4.533               1   \n",
       "9             Rotenone                   -5.246               1   \n",
       "10       2-pyrrolidone                    0.243               1   \n",
       "...                ...                      ...             ...   \n",
       "1118  Isobutyl acetate                   -1.463               1   \n",
       "1122    m-Nitrotoluene                   -2.640               1   \n",
       "1124            Oxamyl                   -0.908               1   \n",
       "1125         Thiometon                   -3.323               1   \n",
       "1127          Stirofos                   -4.320               1   \n",
       "\n",
       "      Molecular Weight  Number of H-Bond Donors  Number of Rings  \\\n",
       "1              201.225                        1                2   \n",
       "2              152.237                        0                0   \n",
       "8              380.913                        0                5   \n",
       "9              394.423                        0                5   \n",
       "10              85.106                        1                1   \n",
       "...                ...                      ...              ...   \n",
       "1118           116.160                        0                0   \n",
       "1122           137.138                        0                1   \n",
       "1124           219.266                        1                0   \n",
       "1125           246.359                        0                0   \n",
       "1127           365.964                        0                1   \n",
       "\n",
       "      Number of Rotatable Bonds  Polar Surface Area  \\\n",
       "1                             2               42.24   \n",
       "2                             4               17.07   \n",
       "8                             0               12.53   \n",
       "9                             3               63.22   \n",
       "10                            0               29.10   \n",
       "...                         ...                 ...   \n",
       "1118                          2               26.30   \n",
       "1122                          1               43.14   \n",
       "1124                          1               71.00   \n",
       "1125                          7               18.46   \n",
       "1127                          5               44.76   \n",
       "\n",
       "      measured solubility (mol/L)  \\\n",
       "1                          -3.300   \n",
       "2                          -2.060   \n",
       "8                          -6.290   \n",
       "9                          -4.420   \n",
       "10                          1.070   \n",
       "...                           ...   \n",
       "1118                       -1.210   \n",
       "1122                       -2.440   \n",
       "1124                        0.106   \n",
       "1125                       -3.091   \n",
       "1127                       -4.522   \n",
       "\n",
       "                                                smiles  \n",
       "1                               Cc1occc1C(=O)Nc2ccccc2  \n",
       "2                                 CC(C)=CCCC(C)=CC(=O)  \n",
       "8       ClC4=C(Cl)C5(Cl)C3C1CC(C2OC12)C3C4(Cl)C5(Cl)Cl  \n",
       "9     COc5cc4OCC3Oc2c1CC(Oc1ccc2C(=O)C3c4cc5OC)C(C)=C   \n",
       "10                                           O=C1CCCN1  \n",
       "...                                                ...  \n",
       "1118                                     CC(C)COC(=O)C  \n",
       "1122                                Cc1cccc(c1)N(=O)=O  \n",
       "1124                         CNC(=O)ON=C(SC)C(=O)N(C)C  \n",
       "1125                                 CCSCCSP(=S)(OC)OC  \n",
       "1127             COP(=O)(OC)OC(=CCl)c1cc(Cl)c(Cl)cc1Cl  \n",
       "\n",
       "[595 rows x 10 columns]"
      ]
     },
     "execution_count": 191,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.query(\"smiles.str.contains('=')\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1208209c",
   "metadata": {},
   "source": [
    "## Sorting Data\n",
    "\n",
    "Pandas allows you to easily sort data using the `sort_values` method. Inside of the function call you list the column you would like to sort by. By default, the values will be sorted from lowest to highest (or `ascending`). This method will `return` a sorted dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 192,
   "id": "a559acad",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Compound ID</th>\n",
       "      <th>ESOL solubility (mol/L)</th>\n",
       "      <th>Minimum Degree</th>\n",
       "      <th>Molecular Weight</th>\n",
       "      <th>Number of H-Bond Donors</th>\n",
       "      <th>Number of Rings</th>\n",
       "      <th>Number of Rotatable Bonds</th>\n",
       "      <th>Polar Surface Area</th>\n",
       "      <th>measured solubility (mol/L)</th>\n",
       "      <th>smiles</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>563</th>\n",
       "      <td>Dimethyl phthalate</td>\n",
       "      <td>-2.347</td>\n",
       "      <td>1</td>\n",
       "      <td>194.186</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>52.60</td>\n",
       "      <td>-1.660</td>\n",
       "      <td>COC(=O)c1ccccc1C(=O)OC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>632</th>\n",
       "      <td>1,5-Dimethlnapthalene</td>\n",
       "      <td>-4.147</td>\n",
       "      <td>1</td>\n",
       "      <td>156.228</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>0.00</td>\n",
       "      <td>-4.679</td>\n",
       "      <td>Cc1cccc2c(C)cccc12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>634</th>\n",
       "      <td>Acenapthylene</td>\n",
       "      <td>-3.682</td>\n",
       "      <td>2</td>\n",
       "      <td>152.196</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>0.00</td>\n",
       "      <td>-3.960</td>\n",
       "      <td>C1=Cc2cccc3cccc1c23</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>635</th>\n",
       "      <td>Ethyl butyrate</td>\n",
       "      <td>-2.254</td>\n",
       "      <td>1</td>\n",
       "      <td>144.214</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>5</td>\n",
       "      <td>26.30</td>\n",
       "      <td>-1.280</td>\n",
       "      <td>CCCCCOC(=O)CC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>637</th>\n",
       "      <td>Benzo(a)pyrene</td>\n",
       "      <td>-6.007</td>\n",
       "      <td>2</td>\n",
       "      <td>252.316</td>\n",
       "      <td>0</td>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>0.00</td>\n",
       "      <td>-8.699</td>\n",
       "      <td>c1ccc2c(c1)cc3ccc4cccc5ccc2c3c45</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>276</th>\n",
       "      <td>Sorbitol</td>\n",
       "      <td>0.647</td>\n",
       "      <td>1</td>\n",
       "      <td>182.172</td>\n",
       "      <td>6</td>\n",
       "      <td>0</td>\n",
       "      <td>5</td>\n",
       "      <td>121.38</td>\n",
       "      <td>1.090</td>\n",
       "      <td>OCC(O)C(O)C(O)C(O)CO</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Amigdalin</td>\n",
       "      <td>-0.974</td>\n",
       "      <td>1</td>\n",
       "      <td>457.432</td>\n",
       "      <td>7</td>\n",
       "      <td>3</td>\n",
       "      <td>7</td>\n",
       "      <td>202.32</td>\n",
       "      <td>-0.770</td>\n",
       "      <td>OCC3OC(OCC2OC(OC(C#N)c1ccccc1)C(O)C(O)C2O)C(O)...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>441</th>\n",
       "      <td>Lactose</td>\n",
       "      <td>1.071</td>\n",
       "      <td>1</td>\n",
       "      <td>342.297</td>\n",
       "      <td>8</td>\n",
       "      <td>2</td>\n",
       "      <td>4</td>\n",
       "      <td>189.53</td>\n",
       "      <td>-0.244</td>\n",
       "      <td>OCC1OC(OC2C(O)C(O)C(O)OC2CO)C(O)C(O)C1O</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1025</th>\n",
       "      <td>Sucrose</td>\n",
       "      <td>0.310</td>\n",
       "      <td>1</td>\n",
       "      <td>342.297</td>\n",
       "      <td>8</td>\n",
       "      <td>2</td>\n",
       "      <td>5</td>\n",
       "      <td>189.53</td>\n",
       "      <td>0.790</td>\n",
       "      <td>OCC2OC(OC1(CO)OC(CO)C(O)C1O)C(O)C(O)C2O</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>298</th>\n",
       "      <td>Raffinose</td>\n",
       "      <td>0.496</td>\n",
       "      <td>1</td>\n",
       "      <td>504.438</td>\n",
       "      <td>11</td>\n",
       "      <td>3</td>\n",
       "      <td>8</td>\n",
       "      <td>268.68</td>\n",
       "      <td>-0.410</td>\n",
       "      <td>OCC1OC(CO)(OC2OC(COC3OC(CO)C(O)C(O)C3O)C(O)C(O...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1128 rows × 10 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                Compound ID  ESOL solubility (mol/L)  Minimum Degree  \\\n",
       "563      Dimethyl phthalate                   -2.347               1   \n",
       "632   1,5-Dimethlnapthalene                   -4.147               1   \n",
       "634           Acenapthylene                   -3.682               2   \n",
       "635          Ethyl butyrate                   -2.254               1   \n",
       "637          Benzo(a)pyrene                   -6.007               2   \n",
       "...                     ...                      ...             ...   \n",
       "276                Sorbitol                    0.647               1   \n",
       "0                 Amigdalin                   -0.974               1   \n",
       "441                 Lactose                    1.071               1   \n",
       "1025                Sucrose                    0.310               1   \n",
       "298               Raffinose                    0.496               1   \n",
       "\n",
       "      Molecular Weight  Number of H-Bond Donors  Number of Rings  \\\n",
       "563            194.186                        0                1   \n",
       "632            156.228                        0                2   \n",
       "634            152.196                        0                3   \n",
       "635            144.214                        0                0   \n",
       "637            252.316                        0                5   \n",
       "...                ...                      ...              ...   \n",
       "276            182.172                        6                0   \n",
       "0              457.432                        7                3   \n",
       "441            342.297                        8                2   \n",
       "1025           342.297                        8                2   \n",
       "298            504.438                       11                3   \n",
       "\n",
       "      Number of Rotatable Bonds  Polar Surface Area  \\\n",
       "563                           2               52.60   \n",
       "632                           0                0.00   \n",
       "634                           0                0.00   \n",
       "635                           5               26.30   \n",
       "637                           0                0.00   \n",
       "...                         ...                 ...   \n",
       "276                           5              121.38   \n",
       "0                             7              202.32   \n",
       "441                           4              189.53   \n",
       "1025                          5              189.53   \n",
       "298                           8              268.68   \n",
       "\n",
       "      measured solubility (mol/L)  \\\n",
       "563                        -1.660   \n",
       "632                        -4.679   \n",
       "634                        -3.960   \n",
       "635                        -1.280   \n",
       "637                        -8.699   \n",
       "...                           ...   \n",
       "276                         1.090   \n",
       "0                          -0.770   \n",
       "441                        -0.244   \n",
       "1025                        0.790   \n",
       "298                        -0.410   \n",
       "\n",
       "                                                 smiles  \n",
       "563                              COC(=O)c1ccccc1C(=O)OC  \n",
       "632                                  Cc1cccc2c(C)cccc12  \n",
       "634                                 C1=Cc2cccc3cccc1c23  \n",
       "635                                       CCCCCOC(=O)CC  \n",
       "637                    c1ccc2c(c1)cc3ccc4cccc5ccc2c3c45  \n",
       "...                                                 ...  \n",
       "276                                OCC(O)C(O)C(O)C(O)CO  \n",
       "0     OCC3OC(OCC2OC(OC(C#N)c1ccccc1)C(O)C(O)C2O)C(O)...  \n",
       "441             OCC1OC(OC2C(O)C(O)C(O)OC2CO)C(O)C(O)C1O  \n",
       "1025           OCC2OC(OC1(CO)OC(CO)C(O)C1O)C(O)C(O)C2O   \n",
       "298   OCC1OC(CO)(OC2OC(COC3OC(CO)C(O)C(O)C3O)C(O)C(O...  \n",
       "\n",
       "[1128 rows x 10 columns]"
      ]
     },
     "execution_count": 192,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.sort_values(\"Number of H-Bond Donors\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "36c8401e",
   "metadata": {},
   "source": [
    "If you include more than one column name, the dataframe will be sorted by multiple columns. First, it will be sorted by the first column indicated, then that sort will be sorted by the second. Consider the following examples. We first sort by `Number of H-Bond Donors`. Within that sort, we sort by `Molecular Weight`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 193,
   "id": "d22687d5",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Compound ID</th>\n",
       "      <th>ESOL solubility (mol/L)</th>\n",
       "      <th>Minimum Degree</th>\n",
       "      <th>Molecular Weight</th>\n",
       "      <th>Number of H-Bond Donors</th>\n",
       "      <th>Number of Rings</th>\n",
       "      <th>Number of Rotatable Bonds</th>\n",
       "      <th>Polar Surface Area</th>\n",
       "      <th>measured solubility (mol/L)</th>\n",
       "      <th>smiles</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>934</th>\n",
       "      <td>Methane</td>\n",
       "      <td>-0.636</td>\n",
       "      <td>0</td>\n",
       "      <td>16.043</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>-0.90</td>\n",
       "      <td>C</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>953</th>\n",
       "      <td>Ethyne</td>\n",
       "      <td>-0.252</td>\n",
       "      <td>1</td>\n",
       "      <td>26.038</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.29</td>\n",
       "      <td>C#C</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>689</th>\n",
       "      <td>Ethylene</td>\n",
       "      <td>-0.815</td>\n",
       "      <td>1</td>\n",
       "      <td>28.054</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>-0.40</td>\n",
       "      <td>C=C</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>600</th>\n",
       "      <td>Ethane</td>\n",
       "      <td>-1.132</td>\n",
       "      <td>1</td>\n",
       "      <td>30.070</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>-1.36</td>\n",
       "      <td>CC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>202</th>\n",
       "      <td>Propyne</td>\n",
       "      <td>-0.672</td>\n",
       "      <td>1</td>\n",
       "      <td>40.065</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>-0.41</td>\n",
       "      <td>CC#C</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    Compound ID  ESOL solubility (mol/L)  Minimum Degree  Molecular Weight  \\\n",
       "934     Methane                   -0.636               0            16.043   \n",
       "953      Ethyne                   -0.252               1            26.038   \n",
       "689    Ethylene                   -0.815               1            28.054   \n",
       "600      Ethane                   -1.132               1            30.070   \n",
       "202     Propyne                   -0.672               1            40.065   \n",
       "\n",
       "     Number of H-Bond Donors  Number of Rings  Number of Rotatable Bonds  \\\n",
       "934                        0                0                          0   \n",
       "953                        0                0                          0   \n",
       "689                        0                0                          0   \n",
       "600                        0                0                          0   \n",
       "202                        0                0                          0   \n",
       "\n",
       "     Polar Surface Area  measured solubility (mol/L) smiles  \n",
       "934                 0.0                        -0.90      C  \n",
       "953                 0.0                         0.29    C#C  \n",
       "689                 0.0                        -0.40    C=C  \n",
       "600                 0.0                        -1.36     CC  \n",
       "202                 0.0                        -0.41   CC#C  "
      ]
     },
     "execution_count": 193,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_sorted = df.sort_values([\"Number of H-Bond Donors\", \"Molecular Weight\"])\n",
    "df_sorted.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "43905fb9",
   "metadata": {},
   "source": [
    "Note the bolded indices to the left of the dataframe. After we perform our sort, these are no longer in order. In the sorted dataframe, `loc` and `iloc` do not return the same values using the same number."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 194,
   "id": "4d8ca81f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Compound ID                    Methane\n",
       "ESOL solubility (mol/L)         -0.636\n",
       "Minimum Degree                       0\n",
       "Molecular Weight                16.043\n",
       "Number of H-Bond Donors              0\n",
       "Number of Rings                      0\n",
       "Number of Rotatable Bonds            0\n",
       "Polar Surface Area                 0.0\n",
       "measured solubility (mol/L)       -0.9\n",
       "smiles                               C\n",
       "Name: 934, dtype: object"
      ]
     },
     "execution_count": 194,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_sorted.loc[934]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 195,
   "id": "8fbc2664",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Compound ID                                                   Bensulide\n",
       "ESOL solubility (mol/L)                                           -4.99\n",
       "Minimum Degree                                                        1\n",
       "Molecular Weight                                                397.524\n",
       "Number of H-Bond Donors                                               1\n",
       "Number of Rings                                                       1\n",
       "Number of Rotatable Bonds                                            10\n",
       "Polar Surface Area                                                64.63\n",
       "measured solubility (mol/L)                                        -4.2\n",
       "smiles                         CC(C)OP(=S)(OC(C)C)SCCNS(=O)(=O)c1ccccc1\n",
       "Name: 330, dtype: object"
      ]
     },
     "execution_count": 195,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_sorted.iloc[934]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "fc2a7b2d",
   "metadata": {},
   "source": [
    "## Grouping Data\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "37f40ab9",
   "metadata": {},
   "source": [
    "## Performing Operations on Multiple Cells\n",
    "\n",
    "### Broadcasting\n",
    "\n",
    "Pandas dataframes have the convenient feature that they use something called `broadcasting`. This means that if you are doing something like subtracting a number, multiplying, etc to a column or dataframe of information, it can be done all at once instead of with a `for` loop. Consider if we wanted to express the molecular surface area in $nm^2$ instead of Å$^2$. To do this, we would need to divide each value by 100.\n",
    "\n",
    "Instead of writing a `for` loop that does this, we can just write the following code. This will return a pandas Series (one dimensional dataframe). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 196,
   "id": "c0c69c15",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0       2.0232\n",
       "1       0.4224\n",
       "2       0.1707\n",
       "3       0.0000\n",
       "4       0.0000\n",
       "         ...  \n",
       "1123    0.0000\n",
       "1124    0.7100\n",
       "1125    0.1846\n",
       "1126    0.0000\n",
       "1127    0.4476\n",
       "Name: Polar Surface Area, Length: 1128, dtype: float64"
      ]
     },
     "execution_count": 196,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[\"Polar Surface Area\"] / 100"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "51f6f9a8",
   "metadata": {},
   "source": [
    "To save it as a new column, we need to capture the output in a column. You can create a new column in a dataframe using the following syntax."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 197,
   "id": "85b81987",
   "metadata": {},
   "outputs": [],
   "source": [
    "df[\"Polar Surface Area (nm^2)\"] = df[\"Polar Surface Area\"] / 100"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4086f42a",
   "metadata": {},
   "source": [
    "You can also add, subtract, or multiply two columns with one another. For example, we might want to calculate the difference between the predicted and observed values for this solubility model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 198,
   "id": "c3e6d45b",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0      -0.204\n",
       "1       0.415\n",
       "2      -0.519\n",
       "3       1.252\n",
       "4      -0.902\n",
       "        ...  \n",
       "1123   -0.898\n",
       "1124   -1.014\n",
       "1125   -0.232\n",
       "1126    0.935\n",
       "1127    0.202\n",
       "Length: 1128, dtype: float64"
      ]
     },
     "execution_count": 198,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[\"ESOL solubility (mol/L)\"] - df[\"measured solubility (mol/L)\"]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "2e046192",
   "metadata": {},
   "source": [
    "``````{admonition} Check Your Understanding\n",
    ":class: exercise\n",
    "\n",
    " Save the difference between the predicted and measured solubility in a column named \"difference\".\n",
    "\n",
    "````{admonition} Solution\n",
    ":class: solution dropdown\n",
    "\n",
    "```python\n",
    "df[\"difference\"] = df[\"ESOL solutbility (mol/L)\"] - df[\"measured solubility (mol/L)\"]\n",
    "``` \n",
    "\n",
    "````"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d5530adf",
   "metadata": {},
   "source": [
    "### The `.applymap` method\n",
    "\n",
    "Sometimes the operations you want to do on a dataframe can't be achieved with broadcasting. One might think to use a `for` loop in times like these. However, many modern Python libraries try to make it so that you will rarely need to use a loop. \n",
    "\n",
    "In pandas, if you would like to perform an operation on every cell of a dataframe, you can use the `applymap` command. For example, if we wanted to know the number of letters in a compound name, we would use the python function `len`.  The length function only works on strings, so let's see how we would use it on two columns which are strings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 199,
   "id": "1e9db742",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>smiles</th>\n",
       "      <th>Compound ID</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>54</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>22</td>\n",
       "      <td>8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>20</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>34</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>7</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1123</th>\n",
       "      <td>16</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1124</th>\n",
       "      <td>25</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1125</th>\n",
       "      <td>17</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1126</th>\n",
       "      <td>7</td>\n",
       "      <td>14</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1127</th>\n",
       "      <td>37</td>\n",
       "      <td>8</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1128 rows × 2 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      smiles  Compound ID\n",
       "0         54            9\n",
       "1         22            8\n",
       "2         20            6\n",
       "3         34            6\n",
       "4          7            9\n",
       "...      ...          ...\n",
       "1123      16            9\n",
       "1124      25            6\n",
       "1125      17            9\n",
       "1126       7           14\n",
       "1127      37            8\n",
       "\n",
       "[1128 rows x 2 columns]"
      ]
     },
     "execution_count": 199,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[[\"smiles\", \"Compound ID\"]].applymap(len)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7f0d2fc9",
   "metadata": {},
   "source": [
    "You can also do this with custom functions. Consider this function which counts the number of C's in a string. For a SMILE string we'll consider that it counts the number of carbons."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 200,
   "id": "3773ce54",
   "metadata": {},
   "outputs": [],
   "source": [
    "def count_carbons(smiles):\n",
    "    \n",
    "    carbon_count = 0\n",
    "    \n",
    "    for letter in smiles:\n",
    "        if letter.lower() == \"c\":\n",
    "            carbon_count += 1\n",
    "    \n",
    "    return carbon_count"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 201,
   "id": "3bc02d3e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>smiles</th>\n",
       "      <th>Compound ID</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>20</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>12</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>10</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>22</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1123</th>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1124</th>\n",
       "      <td>7</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1125</th>\n",
       "      <td>6</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1126</th>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1127</th>\n",
       "      <td>14</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1128 rows × 2 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      smiles  Compound ID\n",
       "0         20            0\n",
       "1         12            0\n",
       "2         10            1\n",
       "3         22            1\n",
       "4          4            0\n",
       "...      ...          ...\n",
       "1123       3            0\n",
       "1124       7            0\n",
       "1125       6            0\n",
       "1126       5            0\n",
       "1127      14            0\n",
       "\n",
       "[1128 rows x 2 columns]"
      ]
     },
     "execution_count": 201,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[[\"smiles\", \"Compound ID\"]].applymap(count_carbons)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0a02b745",
   "metadata": {},
   "source": [
    "### The `.apply` method - for rows and columns\n",
    "\n",
    "The apply method is used to apply a function to either the rows or columns of a dataframe.  If you use this on a single column, the function will be applied to every value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 202,
   "id": "4d98ca22",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0        9\n",
       "1        8\n",
       "2        6\n",
       "3        6\n",
       "4        9\n",
       "        ..\n",
       "1123     9\n",
       "1124     6\n",
       "1125     9\n",
       "1126    14\n",
       "1127     8\n",
       "Name: Compound ID, Length: 1128, dtype: int64"
      ]
     },
     "execution_count": 202,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[\"Compound ID\"].apply(len)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 203,
   "id": "92aaee4c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "smiles         1128\n",
       "Compound ID    1128\n",
       "dtype: int64"
      ]
     },
     "execution_count": 203,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[[\"smiles\", \"Compound ID\"]].apply(len)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e8f9760e",
   "metadata": {},
   "source": [
    "## Axes and Shapes\n",
    "\n",
    "We've already discussed the rows and columns in a dataframe. Also relevant here is the idea of an dataframe axis. Most functions in pandas can be used on either the rows or the columns, and can be specified by using the keyword `axis`. Typically, by default this axis will be set to 0 (rows). \n",
    "\n",
    "Axis 0 runs down the rows, while axis 1 runs across the columns. When thinking about this, consider the `iloc` function for indexing. You specify the index for the row in element 0 of this function (axis 0) and the index for the column in element 1.\n",
    "\n",
    "The `apply` method works on either rows or elements. Consider the result of what happens if we apply this function to axis 1, or the columns. Can you explain why we are seeing this answer?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 204,
   "id": "5b100362",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0       2\n",
       "1       2\n",
       "2       2\n",
       "3       2\n",
       "4       2\n",
       "       ..\n",
       "1123    2\n",
       "1124    2\n",
       "1125    2\n",
       "1126    2\n",
       "1127    2\n",
       "Length: 1128, dtype: int64"
      ]
     },
     "execution_count": 204,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[[\"smiles\", \"Compound ID\"]].apply(len, axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b774e047",
   "metadata": {},
   "source": [
    "For example, when we calculate the mean, you will get an average for every column. You do not get the average of all values. This is because the function is applied to axis 0, which runs down the dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 205,
   "id": "a911dc59",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/tmp/ipykernel_1674/3698961737.py:1: FutureWarning: The default value of numeric_only in DataFrame.mean is deprecated. In a future version, it will default to False. In addition, specifying 'numeric_only=None' is deprecated. Select only valid columns or specify the value of numeric_only to silence this warning.\n",
      "  df.mean()\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "ESOL solubility (mol/L)         -2.988192\n",
       "Minimum Degree                   1.058511\n",
       "Molecular Weight               203.937074\n",
       "Number of H-Bond Donors          0.701241\n",
       "Number of Rings                  1.390957\n",
       "Number of Rotatable Bonds        2.177305\n",
       "Polar Surface Area              34.872881\n",
       "measured solubility (mol/L)     -3.050102\n",
       "Polar Surface Area (nm^2)        0.348729\n",
       "dtype: float64"
      ]
     },
     "execution_count": 205,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "02199065",
   "metadata": {},
   "source": [
    "We can switch this to giving an average for each row by setting the axis to 1."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 206,
   "id": "6261d20d",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/tmp/ipykernel_1674/3676274908.py:1: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; in a future version this will raise TypeError.  Select only valid columns before calling the reduction.\n",
      "  df.mean(axis=1)\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "0       75.336800\n",
       "1       27.078044\n",
       "2       18.870967\n",
       "3       30.096222\n",
       "4        9.286778\n",
       "          ...    \n",
       "1123    21.562556\n",
       "1124    32.574889\n",
       "1125    29.621067\n",
       "1126     7.636222\n",
       "1127    45.481067\n",
       "Length: 1128, dtype: float64"
      ]
     },
     "execution_count": 206,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.mean(axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c0b38b94",
   "metadata": {},
   "source": [
    "You'll notice that when we run a function on axis 0, our returned value has the same number of values as the number of columns in our original dataframe. The converse is true for applying across axis 1. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2142c47e",
   "metadata": {},
   "source": [
    "If we select a column, the mean of all of the values are returned. This makes sense intuitively - it is how we might expect the function to behave. However, the function behaves this way because the function is acting on axis 0."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 207,
   "id": "d48cf3fd",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "34.87288120567376"
      ]
     },
     "execution_count": 207,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[\"Polar Surface Area\"].mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 208,
   "id": "aea26acd",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(1128,)"
      ]
     },
     "execution_count": 208,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[\"Polar Surface Area\"].shape"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "f6fb677c",
   "metadata": {},
   "source": [
    "For this example, there is only one number here describing the shape of the dataframe. This is because selecting a single column of data in DataFrame results in a sightly different data type. Instead of a Pandas DataFrame, we hae a **Pandas Series**. Knowing and understanding the difference between a DataFrame and a Series will be important as you work with Pandas. There are many functions which are shared between the two data types, but there are also some that are specific to one data type or another.\n",
    "\n",
    "The concepts covered in this chapter are helpful for data exploration, and will become essential as we continue to work with pandas.\n",
    "\n",
    "``````{admonition} Key Points\n",
    ":class: key\n",
    "\n",
    "* Pandas is used to examine, process, and analyze tabular data.\n",
    "\n",
    "* Pandas has many useful features that allow easy data manipulaiton and extraction\n",
    "\n",
    "``````"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "molssi-training",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.0 | packaged by conda-forge | (main, Oct 25 2022, 06:24:40) [GCC 10.4.0]"
  },
  "vscode": {
   "interpreter": {
    "hash": "16d4a7bb199d969b1271ebe46f77414b0d9cd01b3c3983c2b2742fc6cd4503d3"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}