Working with NumPy Arrays#

Overview

Questions:

  • When should I use NumPy arrays instead of Pandas dataframes?

  • What does the shape of a NumPy array mean?

  • How can I reshape arrays?

Objectives:

  • Convert data from a Pandas dataframe to a NumPy array.

  • Use the reshape function to reshape a NumPy array.

Pandas dataframes are built on top of a data structure known as the NumPy Array. If you completed the first MolSSI Python scripting workshop, you are already familiar with some properties of NumPy arrays.

In general, you should use pandas dataframe when working with data which is:

  • Two dimensional (rows and columns).

  • Labeled.

  • Mixed type.

  • Something for which you would like to be able to easily get statistics.

You should work with NumPy arrays when:

  • You have higher dimensional data (collection of two dimensional arrays).

  • You need to perform advanced mathematics like linear algebra.

  • You are using a library which requires NumPy arrays (scikitlearn).

For our discussion of NumPy, we are first going to load our data into a pandas dataframe then convert this data into a NumPy array.

import pandas as pd
import numpy as np
df = pd.read_csv("data/delaney-processed.csv")
df.head()
Compound ID ESOL predicted log solubility in mols per litre Minimum Degree Molecular Weight Number of H-Bond Donors Number of Rings Number of Rotatable Bonds Polar Surface Area measured log solubility in mols per litre smiles
0 Amigdalin -0.974 1 457.432 7 3 7 202.32 -0.77 OCC3OC(OCC2OC(OC(C#N)c1ccccc1)C(O)C(O)C2O)C(O)...
1 Fenfuram -2.885 1 201.225 1 2 2 42.24 -3.30 Cc1occc1C(=O)Nc2ccccc2
2 citral -2.579 1 152.237 0 0 4 17.07 -2.06 CC(C)=CCCC(C)=CC(=O)
3 Picene -6.618 2 278.354 0 5 0 0.00 -7.87 c1ccc2c(c1)ccc3c2ccc4c5ccccc5ccc43
4 Thiophene -2.232 2 84.143 0 1 0 0.00 -1.33 c1ccsc1

We will rename the columns again for convenience in referencing.

df.rename( columns = {
    "ESOL predicted log solubility in mols per litre": "ESOL solubility (mol/L)",
    "measured log solubility in mols per litre" : "measured solubility (mol/L)"
}, inplace=True)

Converting a Pandas DataFrame to a NumPy Array#

To convert a dataframe to a numpy array, use the function .to_numpy(). You will notice that both the numpy array and the pandas dataframe have the same shape after conversion.

np_array = df.to_numpy()

print(np_array)
[['Amigdalin' -0.974 1 ... 202.32 -0.77
  'OCC3OC(OCC2OC(OC(C#N)c1ccccc1)C(O)C(O)C2O)C(O)C(O)C3O ']
 ['Fenfuram' -2.885 1 ... 42.24 -3.3 'Cc1occc1C(=O)Nc2ccccc2']
 ['citral' -2.579 1 ... 17.07 -2.06 'CC(C)=CCCC(C)=CC(=O)']
 ...
 ['Thiometon' -3.323 1 ... 18.46 -3.091 'CCSCCSP(=S)(OC)OC']
 ['2-Methylbutane' -2.245 1 ... 0.0 -3.18 'CCC(C)C']
 ['Stirofos' -4.32 1 ... 44.760000000000005 -4.522
  'COP(=O)(OC)OC(=CCl)c1cc(Cl)c(Cl)cc1Cl']]
df.shape
(1128, 10)
np_array.shape
(1128, 10)

At the end of the numpy array, you will see dtype=object. When we converted the dataframe to a NumPy array, pandas attempted to give the dataframe the simplest type possible. Since we had a mix of strings and numbers, the assigned type of the numpy array is object.

Note that we have both strings and numbers in the array.

np_array.dtype
dtype('O')
print(np_array[0, 0])
print(type(np_array[0, 0]))

print(np_array[0, 1])
print(type(np_array[0, 1]))
Amigdalin
<class 'str'>
-0.974
<class 'float'>

Reshaping Arrays#

NumPy arrays have shapes just like Pandas DataFrames. One major difference is that they can have more dimensions than two. NumPy arrays are n-dimensional, meaning they can have any number of dimensions. Unlike pandas dataframes, NumPy arrays can be reshaped.

When we fit our data using scikit-learn later in this workshop, we will need to make sure our arrays are the proper shape. Unlike pandas dataframes, NumPy arrays can have more than two dimensions. Let’s see how we might reshape an array.

numbers = df[["ESOL solubility (mol/L)", "measured solubility (mol/L)"]].to_numpy()
numbers.shape
(1128, 2)
numbers.dtype
dtype('float64')

To change the shape of an array, use the function .reshape. Inside the reshape function, you list the size you would like for each dimension of the array to be. We could shape the numbers array we just created to be three dimensional rather than two dimensional. We might want rows with one column. For the third dimension you can think of it as being two columns stacked on one another.

If you do not care what the shape of one of the dimensions is when doing a reshape, you can specify a -1 for that dimension (or unknown). It essentially means to make that dimension whatever size it needs to be to fit the other specified dimensions.

# reshape so each "category" is its own two dimensional array
numbers_reshaped = numbers.reshape(-1, 1, 2)
print(numbers_reshaped.shape)
(1128, 1, 2)

We can slice into our reshaped array.

# Get ESOL values
numbers_reshaped[:, :, 0]
array([[-0.974],
       [-2.885],
       [-2.579],
       ...,
       [-3.323],
       [-2.245],
       [-4.32 ]])
numbers[:, 0]
array([-0.974, -2.885, -2.579, ..., -3.323, -2.245, -4.32 ])
# Get measured values
numbers_reshaped[:, :, 1]
array([[-0.77 ],
       [-3.3  ],
       [-2.06 ],
       ...,
       [-3.091],
       [-3.18 ],
       [-4.522]])
# By default pandas Series are one-dimensional. We will have to reshape the arrays sometimes.
names = df["Compound ID"].to_numpy()
print(names)
names = names.reshape(-1, 1)
print(names)
['Amigdalin' 'Fenfuram' 'citral' ... 'Thiometon' '2-Methylbutane'
 'Stirofos']
[['Amigdalin']
 ['Fenfuram']
 ['citral']
 ...
 ['Thiometon']
 ['2-Methylbutane']
 ['Stirofos']]

Key Points

  • Pandas dataframes are built on top of NumPy arrays.

  • NumPy arrays have all the same data type.

  • NumPy is used for numerical applications.