This lesson is in the early stages of development (Alpha version)

# Working with Numpy Arrays

## Overview

Teaching: 45 min
Exercises: 20 min
Questions
• What are the differences between numpy arrays and lists?

• How can I use NumPy to do calculations?

Objectives
• Be able to name the differences between Python lists and numpy arrays.

• Understand the idea of broadcasting.

Numpy is a widely used Python library for scientific computing. It has a number of useful features, including the a data structure called an array. Compared to the built-in data typles `lists` which we discussed in the Python Data and Scripting Workshop, `numpy` has many features which can help you in your data analysis.

## NumPy Arrays vs. Python Lists

Previously, you have worked with the built-in types of `lists`. NumPy arrays seem similar, but offer some distinct advantages.

Numpy arrays take up less space, are faster, and have more mathematical operations associated with them. However, unlike lists, they elements all have to be the same type.

There are also differences in how lists and numpy arrays behave. Let’s look at some of these.

First open a Jupyter notebook to record your work.

To use the numpy library, we have to import it. When `numpy` is imported, it is often shortened to `np`.

We will start with reading in some data from an xyz file. The following block will read a file called `water.xyz` (from the Python Data and Scripting lesson) and saving two numpy arrays - one called `coordinates` with the molecular coordinates, and another called `symbols` with the element symbols.

``````file_location = os.path.join('data', 'water.xyz')
symbols = xyz_file[:,0]
coordinates = (xyz_file[:,1:])
coordinates = coordinates.astype(np.float)

print(symbols)
print(coordinates)
``````
``````['O' 'H1' 'H2']
[[ 0.       -0.007156  0.965491]
[-0.        0.001486 -0.003471]
[ 0.        0.931026  1.207929]]
``````

## Exercise

Slice the `coordinates` array to create a new array called `oxygen_coord` which has the x, y, and z coordinate for the oxygen atom.

## Solution

For this, we want all of the columns of the first row.

``````oxygen_coord = coordinates
``````

Now that we have the oxygen coordinate, let’s imagine we wanted to do something to it. Let’s imagine that we wanted to translate the position of the oxygen atom. We want to translate it 0.1 units in the x direction and -0.1 units in the y direction.

If we were writing for loops like we did before, we might do this by defining a translation vector and using a for loop.

``````translation_vector = [0.1, -0.1, 0]

oxygen_coord_new = []

for dim in range(3):
new_position = oxygen_coord[dim] + translation_vector[dim]
oxygen_coord_new.append(new_position)

print(oxygen_coord_new)
``````
``````[0.1, -0.107156, 0.965491]
``````

However, since `oxygen_coord` is a numpy array, we could have done this differently. One way numpy arrays and lists are different is that you can easily perform element-wise operations on numpy arrays without loops. You can make your code much faster if you use numpy element-by-element operations instead of loops.

Numpy is smart. If two arrays (or a list and an array), it will guess that you want to do element-wise addition. In the `for` loop we just wrote, we actually wanted an answer that looked like

`[x1 + x2, y1+ y2, z1+z2]`

where [x1, x2, x3] was `oxygen_coord` and [y1, y2, y3] was `translation_vector`. Using the power of numpy arrays, we could have instead written

``````oxygen_coord_new = oxygen_coord + translation_vector
print(oxygen_coord_new)
``````
``````[ 0.1      -0.107156  0.965491]
``````

Numpy was smart - it looked at the shape of both of these variables, saw they were the same shape, and assumed we wanted to do element-wise operation. You could have also subtracted, multiplied, or divided these, and it would have performed element-wise operations.

Note, that this only worked because `oxygen_coord` was a `numpy` array.

``````type(oxygen_coord)
``````
``````numpy.ndarray
``````

What happens if we try this with a list?

``````oxygen_list = list(oxygen_coord)
type(oxygen_list)
``````
``````list
``````
``````oxygen_list + translation_vector
``````
``````[0.0, -0.007156, 0.965491, 0.1, -0.1, 0.0]
``````

As a note, if you wanted to concatenate the two where `oxygen_coordinate` was a numpy array, you could have done so with the `np.concatenate` function.

``````# To concatenate numpy arrays...

np.concatenate((oxygen_coord, translation_vector))
``````

You can add two arrays together, multiply arrays by scalars, or do element-wise multiplcation of arrays.

For example, you can multiply two numpy arrays to get their element-wise product. This means that given two vectors `a = np.array([a0, a1, a2])` and `b = np.array([b0, b1, b2])`, `a * b = [a0*b0, a1*a1, a2*b2]`.

In contrast, if `a` and `b` were lists, you would get an error.

Consider the variable definitions for `a1` and `a2`. What does each print statment result in?

``````a1 = np.array([2, 1, 0])
a2 = np.array([1, 3, 5])

print(a1 * a2)
print(a1 + a2)
``````

The code block results in

``````[2 3 0]
[3 4 5]
``````

For the first line, the first element is `a1*a2`, the second element is `a1*a2`, and the third element is `a1*a2`. For the second line, the first element is `a1+a2`, second is `a1+a2`, third is `a1+a2`.

What happens if `a1` and `a2` are lists? Consider each `print` statement separately.

``````a1 = [2, 1, 0]
a2 = [1, 3, 5]

print(a1 + a2)
print(a1 * a2)
``````

The first print statement results in concatenation of the lists.

``````[2, 1, 0, 1, 3, 5]
``````

Second print statement results in a TypeError. Two lists cannot be multiplied.

``````TypeError: can't multiply sequence by non-int of type 'list
``````

This is because two lists cannot be multiplied. If you wanted to do element-wise multiplication, you would have to use `numpy` arrays (like in the previous exercise.)

Another special thing about numpy is something called broadcasting. Broadcasting occurs when you attempt mathematical operations on arrays that have different shapes. If possible, the smaller array is “broadcast” across the larger array.

Let’s think about what would happen if we wanted to move every atom in our water molecule by our translation vector.

If you were working with Python lists, or you didn’t know about the features of numpy arrays, you might try to do this with a `for` loop.

``````new_coordinates = []

for atom in coordinates:
new_x = atom + translation_vector
new_y = atom + translation_vector
new_z = atom + translation_vector

new_coordinates.append([new_x, new_y, new_z])

print(new_coordinates)
``````
``````[[0.1, -0.107156, 0.965491], [0.1, -0.098514, -0.003471], [0.1, 0.831026, 1.207929]]
``````

Broadcasting in `numpy` allows us to achieve that with one command, rather than a `for` loop.

``````new_coordinates = coordinates + translation_vector

print(new_coordinates)
``````
``````[[ 0.1      -0.107156  0.965491]
[ 0.1      -0.098514 -0.003471]
[ 0.1       0.831026  1.207929]]
``````

For this to work, we have to have two arrays that have a matching dimension. You can see the shape of an array using the function `np.shape`.

``````np.shape(coordinates)
``````
``````(3, 3)
``````
``````np.shape(translation_vector)
``````
``````(3,)
``````

When you typed, `coordinates + translation_vector`, numpy looked at the shapes of both arrays to figure out if they were compatible.

It starts with the dimensions to the right, so when it saw to matching 3’s it assumed you wanted to do element-wise operation this way, and stretched or ‘broadcast’ the smaller array to match the larger one.

``````row_translate = [[0.1], [0.2], [0.3]]
print(np.shape(x_translate))
``````
``````(3, 1)
``````

Think about what will happen if you try this code

``````coordinates + row_translate
``````

## Logical comparisons

We can also do logical comparisons on whole arrays. For example, to find out if values in the array are greater than 0, we can write

``````print(coordinates > 0)
``````

This will print either `True` or `False` for each array element depending on whether the value of that element is greater than 10 or not.

``````array([[False, False,  True],
[False,  True, False],
[False,  True,  True]])
``````

To get every value in the array that is greater than 10, we can use this as a list of indices we want, or a slice.

``````greater_than_0_values = coordinates[coordinates>0]
print(greater_than_0_values)
``````
``````[0.965491 0.001486 0.931026 1.207929]
``````

## Array Axes

Imagine we wanted to calculate the geometric center of our molecule. To do this, we would need to get the average x coordinate, the average y coordinate, and the average z coordinate.

In a previous lesson, we calculated the mean of each column of an array using the `range` function and a `for` loop. This was good because it reminded us of the `range` function and how `for` loops worked. However, the `numpy.mean` function will let us do that without a `for` loop.

Our code for that would look something like this:

``````center = []

for dim in range(len(symbols)):
dim_mean = np.mean(coordinates[:, dim])
center.append(dim_mean)
``````

A `numpy` array can be thought of like a coordinates system. Axis 0 runs along the ROWS, while axis 1 runs along the COLUMNS.

## Exercise

Check out the numpy documentation for the mean function. Can you figure out what argument we might change to get the mean of each column?

## Solution

We will use the `axis` argument. If we do not specify `axis`, the mean of the entire array is computed.

A 2-dimensional array has two corresponding axes: the first running vertically downwards across rows (axis 0), and the second running horizontally across columns (axis 1).

For the `axis` argument, rows correspond to axis 1, and columns correspond to axis 0.

``````center = np.mean(data, axis=0)
print(center)
``````
``````[0.         0.308452   0.72331633]
``````

Now, the mean function has returned a `numpy` array to us with the same number of elements as we have columns. The 0th element of the variable `column_averages` corresponds the mean of column index 0, the second number (index 1) corresponds to the mean of column index 1.

## Optional - Returning to the geometry analysis project

In your geometry analysis project, you had to analyze an xyz file, find the bonds, and print bond lengths.

We can rewrite that project using the features of numpy arrays.

Recall that a solution given for a function for calculating distances between two points was the following

``````def calculate_distance(rA, rB):
"""Calculate distance between points A and B"""
x_dist = (rA - rB) ** 2
y_dist = (rA - rB) ** 2
z_dist = (rA - rB) ** 2

distance = np.sqrt(x_dist + y_dist + z_dist)

return distance
``````

This function would work for both lists and numpy arrays, because it does not assume that rA and rB can do something like element-wise subtraction.

## Exercise

Using what you’ve learned about numpy arrays, rewrite the `calculate_distance` function to use the features of numpy arrays.

If rA and rB are both numpy arrays, we can use element-wise subtraction. It is also a good idea to update your function docstring to reflect the restriction on the data type of `rA` and `rB`.

`````` def calculate_distance(rA, rB):
"""Calculate the distance between points A and B. rA and rB must be numpy arrays."""
AB = (rB - rA)**2
distance = np.sqrt(np.sum(AB))
return distance
``````

You might also have used the `numpy` function `np.linalg.norm` which calculates the magnitude of a given vector.

``````def calculate_distance(rA, rB):
dist_vec = (rA-rB)
distance = np.linalg.norm(dist_vec)
return distance
``````

Redefine your original distance function as `calculate_distance_list`. Using both, we see that both functions give the same answer.

``````print(calculate_distance_list(r1, r2))
print(calculate_distance(r1_array, r2_array))
``````

## Key Points

• NumPy arrays which are the same size use element-wise operations when added or subtracted

• NumPy uses something called broadcasting for arrays which are not the same size to allow arrays to be added or multiplied.

• NumPy has extensive documentation online - you should check this out if you need to do a computation.