Deciding Package Structure#

Overview

Questions:

  • How should I organize my code?

  • How can I handle imports in my package?

Objectives:

  • Break code into modules and subpackages based on functionality.

  • Understand how the __init__.py file affects your Python package.

Follow Along with This Lesson

To follow along with this lesson, you can complete the previous lessons, or you can download a pre-made workshop repository that is at the starting point.

You will need to make sure that you have git installed and configured, as described in the set-up instructions.

git clone https://github.com/MolSSI-Education/molecool.git
cd molecool
git checkout deciding-package-structure-start
git switch -c main

You can also download the pre-made workshop repository as a zip file. If downloading as a zip file, you will need to initialize git in the repository and make an initial commit in order to use git.

As new features are implemented in codes, it is natural for new functions and objects to be added. In many projects, this often leads to a large number of functionalities defined within a single module. For small, single developer codes, this is not a major issue, but it can still make code difficult to work with. With large or multi-developer codes, this can slow development progress to a crawl as it is difficult both to understand and work with the code.

In this lesson, we will simulate a developing piece of software. We will start with a single python module containing all the methods we have developed, and convert it into a well-structured package.

Package Structure#

Let’s start by reviewing the package structure provided to us by the CMS CookieCutter. We have a directory containing our project with a number of additional features. Under our package directory, molecool, we can see our current python module functions.py. For a more detailed explanation of the rest of the package structure, please review the package setup section of the lessons.

.
├── CODE_OF_CONDUCT.md              <- Code of Conduct for developers and users
├── LICENSE                         <- License file
├── MANIFEST.in                     <- Packaging information for pip
├── README.md                       <- Description of project which GitHub will render
├── molecool                        <- Basic Python Package import file
│   ├── __init__.py                 <- Basic Python Package import file
│   ├── functions.py                <- Starting package module
│   ├── data                        <- Sample additional data (non-code) which can be packaged. Just an example, delete in production
│   │   ├── README.md
│   │   └── look_and_say.dat
│   └── tests                       <- Unit test directory with sample tests
│       ├── __init__.py
│       └── test_molecool.py
├── devtools                        <- Deployment, packaging, and CI helpers directory
│   ├── README.md
│   ├── conda-envs                  <- Conda environments for testing
│   │   └── test_env.yaml
│   └── scripts
│       └── create_conda_env.py     <- OS agnostic Helper script to make conda environments based on simple flags
├── docs                            <- Documentation template folder with many settings already filled in
│   ├── Makefile
│   ├── README.md                   <- Instructions on how to build the docs
│   ├── _static
│   │   └── README.md
│   ├── _templates
│   │   └── README.md
│   ├── api.rst
│   ├── conf.py
│   ├── getting_started.rst
│   ├── index.rst
│   ├── make.bat
│   └── requirements.yaml           <- Documenation building specific requirements. Usually a smaller set than the main program
├── readthedocs.yml
├── pyproject.toml                  <- Generic Python build system configuration (PEP-517).
├── setup.cfg                       <- Near-master config file to make house INI-like settings for Coverage, Flake8, YAPF, etc.
├── setup.py                        <- Your package's setup file for installing with additional options that can be set
├── .codecov.yml                    <- Codecov config to help reduce its verbosity to more reasonable levels
├── .github                         <- GitHub hooks for user contribution, pull request guides and GitHub Actions CI
│   ├── CONTRIBUTING.md
│   ├── PULL_REQUEST_TEMPLATE.md
│   └── workflows
│       └── CI.yaml
├── .gitignore                      <- Stock helper file telling git what file name patterns to ignore when adding files
└── .lgtm.yml

The easiest way to start is to see what we currently have and try to decide which parts are related to one another. Looking through the functions.py file, we see a number of different functions, and for the sake of simplicity we abbreviate and rearrange them here:

atomic_weights = {
    'H': 1.00784,
    'C': 12.0107,
    'N': 14.0067,
    'O': 15.999,
    'P': 30.973762,
    'F': 18.998403,
    'Cl': 35.453,
    'Br': 79.904,
}

atom_colors = {
    'H': 'white',
    'C': '#D3D3D3',
    'N': '#add8e6',
    'O': 'red',
    'P': '#FFA500',
    'F': '#FFFFE0',
    'Cl': '#98FB98',
    'Br': '#F4A460',
    'S': 'yellow'
}

def open_pdb(file_location):
    ...
   
def open_xyz(file_location):
    ...

def write_xyz(file_location, symbols, coordinates):
    ...
    
def calculate_distance(rA, rB):
    ...

def calculate_angle(rA, rB, rC, degrees=False):
    ...

def draw_molecule(coordinates, symbols, draw_bonds=None, save_location=None, dpi=300):
    ...

def bond_histogram(bond_list, save_location=None, dpi=300, graph_min=0, graph_max=2):
    ...

def build_bond_list(coordinates, max_bond=1.5, min_bond=0):
    ...
    
def calculate_molecular_mass(symbols):
    ...

def calculate_center_of_mass(symbols, coordinates):
    ...

Right at the start we can see two dictionaries of atom data. Clearly these are related and should probably be grouped together. Looking at the functions, we see two functions that handle opening files, open_pdb and open_xyz, and a function that writes a file, write_xyz. It may make sense to group these three together in a module based on input and output.

Let’s start making new modules to place our related functions into.

Atom Data#

We will take the atomic_weights and atom_colors dictionaries and move them into a separate module called atom_data.py. This is enclosing the constant data that our system is using in a single place. This allows all the new modules we create to access the data from a single location, avoiding the need to copy the dictionaries to each module that needs them. If we have any other data, related to atoms, used by many of our functions, adding them to this module would be a good idea.

"""
Data used for the rest of the package.
"""

atomic_weights = {
    'H': 1.00784,
    'C': 12.0107,
    'N': 14.0067,
    'O': 15.999,
    'P': 30.973762,
    'F': 18.998403,
    'Cl': 35.453,
    'Br': 79.904,
}

atom_colors = {
    'H': 'white',
    'C': '#D3D3D3',
    'N': '#add8e6',
    'O': 'red',
    'P': '#FFA500',
    'F': '#FFFFE0',
    'Cl': '#98FB98',
    'Br': '#F4A460',
}

Exercise - Grouping into Modules#

Exercise - Grouping into Modules

Take approximately 10 minutes to look through the rest of the functions in the functions module and group them together. Create a module for each group with a reasonable name.

Measure Module#

Our functions.py file contains two functions that handle taking measurements: calculate_distance and calculate_angle. Similar to atom_data, we will simply place these in a module within the main package. Since both functions are taking measurements, we will call it measure.py.

"""
This module is for functions that perform measurements.
"""

def calculate_distance(rA, rB):
    """Calculate the distance between two points.

    Parameters
    ----------
    rA, rB : np.ndarray
        The coordinates of each point.

    Returns
    -------
    distance : float
        The distance between the two points.

    Examples
    --------
    >>> r1 = np.array([0, 0, 0])
    >>> r2 = np.array([0, 0.1, 0])
    >>> calculate_distance(r1, r2)
    0.1
    """

    dist_vec = rA - rB
    distance = np.linalg.norm(dist_vec)

    return distance

def calculate_angle(rA, rB, rC, degrees=False):
    AB = rB - rA
    BC = rB - rC
    theta=np.arccos(np.dot(AB, BC)/(np.linalg.norm(AB)*np.linalg.norm(BC)))

    if degrees:
        return np.degrees(theta)
    else:
        return theta

Visualize Module#

Similarly, we have two functions that handle visualization of molecules. We will place them into a module called visualize.py.

"""
Functions for visualization of molecules
"""

def draw_molecule(coordinates, symbols, draw_bonds=None, save_location=None, dpi=300):

    # Create figure
    fig = plt.figure()
    ax = fig.add_subplot(111, projection="3d")

    # Get colors - based on atom name
    colors = []
    for atom in symbols:
        colors.append(atom_colors[atom])

    size = np.array(plt.rcParams["lines.markersize"] ** 2) * 200 / (len(coordinates))

    ax.scatter(
        coordinates[:, 0],
        coordinates[:, 1],
        coordinates[:, 2],
        marker="o",
        edgecolors="k",
        facecolors=colors,
        alpha=1,
        s=size,
    )

    # Draw bonds
    if draw_bonds:
        for atoms, bond_length in draw_bonds.items():
            atom1 = atoms[0]

def bond_histogram(bond_list, save_location=None, dpi=300, graph_min=0, graph_max=2):
    lengths = []
    for atoms, bond_length in bond_list.items():
        lengths.append(bond_length)
    
    bins = np.linspace(graph_min, graph_max)
    
    fig = plt.figure()
    ax = fig.add_subplot(111)
    
    plt.xlabel('Bond Length (angstrom)')
    plt.ylabel('Number of Bonds')
    
    ax.hist(lengths, bins=bins)
    
    # Save figure
    if save_location:
        plt.savefig(save_location, dpi=dpi)
    
    return ax

Molecule Module#

Our last function is build_bond_list, which is not particularly related to any of our other functions. The name functions.py does not really give a lot of information about what is available in the module. We can rename the module to something more fitting, say molecule.py. We also add a docstring.

"""
Functions for molecule analysis
"""

def build_bond_list(coordinates, max_bond=1.5, min_bond=0):
    """
    Build a list of bonds in a set of coordinates based on a distance criteria.

    Parameters
    ----------
    coordinates: np.ndarray
        The coordinates of the atoms to analyze in an (natoms, ndim) array.

    max_bond: float, optional
        The maximum distance for two atoms to be considered bonded.

    min_bond: float, optional
        The minimum distance for two atoms to be considered bonded.

    Returns
    -------
    bonds: dict
        A dictionary containing bonded atoms with atom pairs as keys and the distance between the atoms as the value.
    """

    # Find the bonds in a molecule (set of coordinates) based on distance criteria.
    bonds = {}
    num_atoms = len(coordinates)

    for atom1 in range(num_atoms):
        for atom2 in range(atom1, num_atoms):
            distance = calculate_distance(coordinates[atom1], coordinates[atom2])
            if distance > min_bond and distance < max_bond:
                bonds[(atom1, atom2)] = distance

    return bonds

I/O Subpackage#

When looking at the three I/O functions, it may be easy to jump ahead and create an I/O module, as mentioned previously. However, what we really have is two distinct groups of functions that are related. More specifically, we have two functions that handle the input and output of a .xyz file and another function that handles the input of a .pdb. Each group is handling input and output, but are still somewhat unrelated because of their file type. Instead of making a single module, we are going to create a subpackage to handle i/o and place a module for each group within it.

Create a new directory called “io” within the package (using the command ‘mkdir directory_name’) and create two new files pdb.py and xyz.py (using the command touch file_name):

"""
Functions for manipulating pdb files.
"""

def open_pdb(file_location):
    """Open and read coordinates and atom symbols from a pdb file.

    The pdb file must specify the atom elements in the last column, and follow
    the conventions outlined in the PDB format specification.

    Parameters
    ----------
    file_location : str
        The location of the pdb file to read in.

    Returns
    -------
    coords : np.ndarray
        The coordinates of the pdb file.
    symbols : list
        The atomic symbols of the pdb file.

    """

    with open(file_location) as f:
        data = f.readlines()

    coordinates = []
    symbols = []
    for line in data:
        if "ATOM" in line[0:6] or "HETATM" in line[0:6]:
            symbols.append(line[76:79].strip())
            atom_coords = [float(x) for x in line[30:55].split()]
            coordinates.append(atom_coords)

    coords = np.array(coordinates)
    symbols = np.array(symbols)

    return symbols, coords
"""
functions for manipulating xyz files.
"""

def open_xyz(file_location):

    # Open an xyz file and return symbols and coordinates.
    xyz_file = np.genfromtxt(fname=file_location, skip_header=2, dtype="unicode")
    symbols = xyz_file[:, 0]
    coords = xyz_file[:, 1:]
    coords = coords.astype(float)
    return symbols, coords


def write_xyz(file_location, symbols, coordinates):

    num_atoms = len(symbols)

    if num_atoms != len(coordinates):
        raise ValueError(
            f"write_xyz : the number of symbols ({num_atoms}) and number of coordinates ({len(coordinates)}) must be the same to write xyz file!"
        )

    with open(file_location, "w+") as f:
        f.write("{}\n".format(num_atoms))
        f.write("XYZ file\n")

        for i in range(num_atoms):
            f.write(
                "{}\t{}\t{}\t{}\n".format(
                    symbols[i], coordinates[i, 0], coordinates[i, 1], coordinates[i, 2]
                )
            )

Now any module that needs to handle input and output can import the needed module from the io subpackage. Since these are currently small modules, it would not be a big deal to import all of them. But, consider a large I/O suite containing a large number of file types and functionalities. It will quickly create inefficiencies to leave them in one module.

Now that we’ve organized and changed the structure in our project, we should commit our changes and push to GitHub.

git add .
git commit -m "organize molecool into modules and subpackage"
git push origin main

Fixing Imports#

When we first copied the functions from the Jupyter Notebook into functions.py, we were able to import molecool package and access the functions within functions.py. After we extracted the functions from that file, we won’t be able to import those functions in the same way. In fact, we won’t be able to access them at all. Every time we restructure our code or create new folders we have to be careful and modify the __init__.py accordingly. Let us then add the new functions into the __init__.py.

# Add imports here
from .functions import *
from .measure import calculate_distance, calculate_angle
from .molecule import build_bond_list
from .visualize import draw_molecule, bond_histogram

In this way, we should be able to call each of the functions after importing our module.

>>> import molecool
>>> molecool.build_bond_list()

Even with the imports fixed, if you try to run some of these functions, you may find yourself with an ImportError. This is because the functions can only see the code that has been “loaded” into their module. Each set of functions now exist in the context of their module “namespace”.

If we look at our original functions.py module, we will see that we had a number of import statements at the top of the file:

import os
import numpy as np
import matplotlib.pyplot as plt

These are modules that are needed by some functions. Now that we have moved the functions into separate modules, we need to add the import statements into each file where they are needed. Let’s start by looking at measure.py. Looking through the functions, we can see that each of them has a reference to np, which is what we imported numpy as in functions.py.

Aside from visual inspection, you could have also seen these missing imports by using flake8 on the modules.

flake8 measure.py

You will see a message which says “undefined name np”.

In order to make these functions work again, we need to add the following import statement.

import numpy as np

to the top of our file.

As a second example, let us look at the visualize.py module. We can quickly see that there is a reference to np in each of the methods, so we need to add our numpy import statement again. We also see references to plt which was the name given to matplotlib.pyplot when it was imported. Add imports of the external libraries to the top of the visualize.py module. Of course, don’t forget our “unused import” for 3D axes.

import numpy as np
import matplotlib.pyplot as plt

from mpl_toolkits.mplot3d import Axes3D  # noqa: F401

If you use flake8 (or if you carefully inspect), you will also see that atom_colors is missing. The draw_molecule function uses the atom_colors dictionary. When all of our code was in a single module, we could simply reference the dictionary by name and use it. However, we have now moved atom_colors and atomic_weights into a separate module. In order to reference the dictionaries in visualize.py, we need to import them using an import statement. This is an intra-package import, meaning that we are importing modules from within our packages to other imports in our package (see intra package imports here)

from .atom_data import atom_colors

This import statement looks a bit different from the other import statements in our code, we have a . before the name. This is because it is a relative import. Just like when using bash, a dot (.) means to look in the current folder.

To think about this more, let’s first look at the dot in a different import statement:

import matplotlib.pyplot as plt

In this case, the . is saying look within the package matplotlib and grab the subpackage (or module) pyplot. In our case, we are not using a name before the . so where is it looking? It is looking within the current package/directory, or in this case molecool for a module or package named atom_data, from which it will import the atom_colors dictionary.

Check your Understanding - Relative Imports#

Check your Understanding

The molecule.py module also utilizes functions that are no longer available in the module. Correct the missing import statements in the module.

Using import *#

We have moved all of our functions into modules and we’ve updated our __init__.py file. If you use a Python interpreter in a directory which is not directly above your project, you can see the consequences of this. We can use the dir functions to see what is available in a particular module or object:

>>> import molecool
>>> dir(molecool)

You should see something similar to the following

['__builtins__', '__cached__', '__doc__', '__file__', '__git_revision__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__version__', '_version', 'functions', 'np', 'plt', 'calculate_distance', 'calculate_angle,
'build_bond_list', 'draw_molecule', 'bond_histogram' ]

These are all the things available to us from importing molecool. You will see your functions module, but you will also see np and plt. This comes from using from .functions import * and is why using import * is usually considered a bad practice.

Since we are no longer using any code from functions.py, we will remove the statement importing it from __init__.py.

IO Subpackage#

We haven’t yet included our io subpackage, meaning that the user would have to import this package if they wanted to use it. For example, to use the xyz functions,

>>> import molecool.io.xyz
>>> dir(molecool.io.xyz.open_xyz)

This will work, however, the main reason we broke up the modules within the io package was for development convenience. Right now this has come at the cost of slightly more complicated import statements to get access to any function.

We can, of course, edit our __init__.py file to make this simpler. At this point, the way we actually do this import is going to be stylistic - how do you want people to interact with your package?

The goal we are going for is to call an IO function using

molecool.io.IO_FUNCTION

where IO_FUNCTION is any function relating to IO.

Within the io directory, create a new file called __init__.py. Open that file with your desired editor and add the following two lines.

from .pdb import open_pdb
from .xyz import open_xyz, write_xyz

These lines are relative import statements to the functions within the io package. Think of them as pointers to the functions. When we look at the io package, it directs us to the location of the underlying functions, so we do not need to look within each submodule. This allows us to use the following import statement to our top level __init__.py to access the functions:

from . import io

We can now call our I/O functions using our target syntax.

>>> molecool.io.open_pdb()

If we wanted the I/O functions to mimic the imports from the rest of the modules, we could modify our top level __init__.py file to reflect that.

from .measure import calculate_distance, calculate_angle
from .molecule import build_bond_list
from .visualize import draw_molecule, bond_histogram
from .io import open_pdb, open_xyz, write_xyz

We could even make these functions more accessible by removing the need for the io module.

This would allow us to call functions by simply typing the following.

>>> molecool.open_pdb()

You can now appreciate how the __init__.py file plays such an important role in defining how the user imports the functions in the package.

Now that we’ve made some changes to __init__.py, we should commit our changes and push to GitHub.

git add .
git commit -m "fixing imports"
git push origin main

Final Repository State

You can see the final state of the repository after this section here.

You can also download a zip of the repository here.

Key Points#

Key Points

  • Your package should be broken up into modules and subpackages depending on the amount of code and functionality.

  • You can use the __init__.py file to define what packages are imported with your package, and how the user interacts with it.