Deciding Package Structure

Overview

Teaching: 30 min
Exercises: 0 min
Questions
  • How should I break my code into modules?

  • How can I handle imports in my package?

Objectives
  • Break code into modules and subpackages based on functionality.

  • Understand how the init.py file affects your Python package

As new features are implemented in codes, it is natural for new functions and objects to be added. In many projects, this often leads to a large number of functionalities defined within a single module. For small, single developer codes, this is not a major issue, but it can still make it difficult to work with. With large or multi-developer codes, this can slow development progress to a crawl as it is difficult to both understand and work with the code.

In this lesson, we will simulate a developing code by starting with a single python module containing all the methods we have developed, and converting it into a well structured package.

Package Structure

Lets start by reviewing the package structure provided to us by the CMS CookieCutter. We have a directory containing our project with a number of additional features. Under our package directory, molecool, we can see our current python module functions.py. For a more detailed explanation of the rest of the package structure, please review the package setup section of the lessons.

.
├── CODE_OF_CONDUCT.md              <- Code of Conduct for developers and users
├── LICENSE                         <- License file
├── MANIFEST.in                     <- Packaging information for pip
├── README.md                       <- Description of project which GitHub will render
├── molecool                        <- Basic Python Package import file
│   ├── __init__.py                 <- Basic Python Package import file
│   ├── _version.py                 <- Automatic version control with Versioneer
│   ├── data                        <- Sample additional data (non-code) which can be packaged. Just an example, delete in production
│   │   ├── README.md
│   │   └── look_and_say.dat
│   └── tests                       <- Unit test directory with sample tests
│       ├── __init__.py
│       └── test_molecool.py
├── devtools                        <- Deployment, packaging, and CI helpers directory
│   ├── README.md
│   ├── conda-envs                  <- Conda environments for testing
│   │   └── test_env.yaml
│   ├── legacy-miniconda-setup      <- Legacy Travis CI Helper, will likely be removed in later version
│   │   └── before_install.sh
│   └── scripts
│       └── create_conda_env.py     <- OS agnostic Helper script to make conda environments based on simple flags
├── docs                            <- Documentation template folder with many settings already filled in
│   ├── Makefile
│   ├── README.md                   <- Instructions on how to build the docs
│   ├── _static
│   │   └── README.md
│   ├── _templates
│   │   └── README.md
│   ├── api.rst
│   ├── conf.py
│   ├── getting_started.rst
│   ├── index.rst
│   ├── make.bat
│   └── requirements.yaml           <- Documenation building specific requirements. Usually a smaller set than the main program
├── readthedocs.yml
├── setup.cfg                       <- Near-master config file to make house INI-like settings for Coverage, Flake8, YAPF, etc.
├── setup.py                        <- Your package's setup file for installing with additional options that can be set
├── versioneer.py                   <- Automatic version control with Versioneer
├── .codecov.yml                    <- Codecov config to help reduce its verbosity to more reasonable levels
├── .github                         <- GitHub hooks for user contribution, pull request guides and GitHub Actions CI
│   ├── CONTRIBUTING.md
│   ├── PULL_REQUEST_TEMPLATE.md
│   └── workflows
│       └── CI.yaml
├── .gitignore                      <- Stock helper file telling git what file name patterns to ignore when adding files
└── .lgtm.yml

The easiest way to start is to see what we currently have and try and decide what is related to one another. Looking through the functions.py file, we see a number of different functions, and for the sake of simplicity we abbreviate and rearrange them here:

atomic_weights = {
    'H': 1.00784,
    'C': 12.0107,
    'N': 14.0067,
    'O': 15.999,
    'P': 30.973762,
    'F': 18.998403,
    'Cl': 35.453,
    'Br': 79.904,
}

atom_colors = {
    'H': 'white',
    'C': '#D3D3D3',
    'N': '#add8e6',
    'O': 'red',
    'P': '#FFA500',
    'F': '#FFFFE0',
    'Cl': '#98FB98',
    'Br': '#F4A460',
    'S': 'yellow'
}

def open_pdb(file_location):
   
def open_xyz(file_location):

def write_xyz(file_location, symbols, coordinates):
    
def calculate_distance(rA, rB):

def calculate_angle(rA, rB, rC, degrees=False):

def draw_molecule(coordinates, symbols, draw_bonds=None, save_location=None, dpi=300):

def bond_histogram(bond_list, save_location=None, dpi=300, graph_min=0, graph_max=2):

def build_bond_list(coordinates, max_bond=1.5, min_bond=0):
    
def calculate_molecular_mass(symbols):

def calculate_center_of_mass(symbols, coordinates):

Right at the start we can see two dictionaries of atom data. Clearly these are related and should probably be grouped together. Looking at the functions, we see two functions that handle opening files, open_pdb and open_xyz, and a function that writes a file, write_xyz. It may make sense to group these three together in a module based on input and output.

Lets start making new modules to place our related functions into.

Atom Data

We will take the atomic_weights and atom_colors dictionaries and move them into a separate module called atom_data.py. This is enclosing the constant data that our system is using in a single place. This allows all of the new modules we create to access the data from a single location, avoiding the need to copy the dictionaries to each module that needs them. If we have any other data, related to atoms, used by many of our functions, adding them to this module would be a good idea.

"""
Data used for the rest of the package.
"""

atomic_weights = {
    'H': 1.00784,
    'C': 12.0107,
    'N': 14.0067,
    'O': 15.999,
    'P': 30.973762,
    'F': 18.998403,
    'Cl': 35.453,
    'Br': 79.904,
}

atom_colors = {
    'H': 'white',
    'C': '#D3D3D3',
    'N': '#add8e6',
    'O': 'red',
    'P': '#FFA500',
    'F': '#FFFFE0',
    'Cl': '#98FB98',
    'Br': '#F4A460',
}

Exercise

Take approximately 10 minutes to look through the rest of the functions in the functions module and group them together. Create a module for each group with a reasonable name.

Answer

Here is how we decided to break up the functions:

  • calculate_angle and calculate_distance go together in a measure module.
  • draw_molecule and bond_histogram go into a visualize module.
  • build_bond_list is placed into a molecule module.
  • open_pdb into a pdb module in an io package.
  • open_xyz and write_xyz are placed into an xyz module in an io package.

Measure

Our functions.py file contains two functions that handle taking measurements, calculate_distance and calculate_angle. Simliar to atom_data, we will simply place these in a module within the main package. Since both functions are taking measurements, we will call it measure.py.

"""
This module is for functions which perform measurements.
"""
def calculate_distance(rA, rB):
    dist_vec = (rA - rB)
    distance = np.linalg.norm(dist_vec)
    return distance

def calculate_angle(rA, rB, rC, degrees=False):
    AB = rB - rA
    BC = rB - rC
    theta=np.arccos(np.dot(AB, BC)/(np.linalg.norm(AB)*np.linalg.norm(BC)))

    if degrees:
        return np.degrees(theta)
    else:
        return theta

Visualize

Similarly, we have two functions that handle visulaization of molecules. We will place them into a module called visualize.py.

"""
Functions for visualization of molecules
"""
def draw_molecule(coordinates, symbols, draw_bonds=None, save_location=None, dpi=300):
    # Create figure
    fig = plt.figure()
    ax = fig.add_subplot(111, projection='3d')
    
    # Get colors - based on atom name
    colors = []
    for atom in symbols:
        colors.append(atom_colors[atom])
    
    size = np.array(plt.rcParams['lines.markersize'] ** 2)*200/(len(coordinates))

    ax.scatter(coordinates[:,0], coordinates[:,1], coordinates[:,2], marker="o",
               edgecolors='k', facecolors=colors, alpha=1, s=size)
    
    # Draw bonds
    if draw_bonds:
        for atoms, bond_length in draw_bonds.items():
            atom1 = atoms[0]
            atom2 = atoms[1]
            
            ax.plot(coordinates[[atom1,atom2], 0], coordinates[[atom1,atom2], 1],
                    coordinates[[atom1,atom2], 2], color='k')
            
    plt.axis('square')
    
    # Save figure
    if save_location:
        plt.savefig(save_location, dpi=dpi, graph_min=0, graph_max=2)
    
    return ax

def bond_histogram(bond_list, save_location=None, dpi=300, graph_min=0, graph_max=2):
    lengths = []
    for atoms, bond_length in bond_list.items():
        lengths.append(bond_length)
    
    bins = np.linspace(graph_min, graph_max)
    
    fig = plt.figure()
    ax = fig.add_subplot(111)
    
    plt.xlabel('Bond Length (angstrom)')
    plt.ylabel('Number of Bonds')
    
    
    ax.hist(lengths, bins=bins)
    
    # Save figure
    if save_location:
        plt.savefig(save_location, dpi=dpi)
    
    return ax

Molecule

Our last function is build_bond_list which is not particularly related to any of our other functions (docstring added). The name functions.py does not really give a lot of information about what is available in the module. We can rename the module to something more fitting, say molecule.py.

def build_bond_list(coordinates, max_bond=1.5, min_bond=0):
    """
    Build a list of bonds in a set of coordinates based on a distance criteria.

    Parameters
    ----------
    coordinates: np.ndarray
        The coordinates of the atoms to analyze in an (natoms, ndim) array.

    max_bond: float, optional
        The maximum distance for two atoms to be considered bonded.

    min_bond: float, optional
        The minimum distance for two atoms to be considered bonded.

    Returns
    -------
    bonds: dict
        A dictionary containing bonded atoms with atom pairs as keys and the distance between the atoms as the value.
    """

    # Find the bonds in a molecule (set of coordinates) based on distance criteria.
    bonds = {}
    num_atoms = len(coordinates)

    for atom1 in range(num_atoms):
        for atom2 in range(atom1, num_atoms):
            distance = calculate_distance(coordinates[atom1], coordinates[atom2])
            if distance > min_bond and distance < max_bond:
                bonds[(atom1, atom2)] = distance

    return bonds

I/O Package

When looking at the three I/O functions, it may be easy to jump ahead and create an I/O module, as mentioned previously, however, what we really have is two distinct groups of functions that are related. More specifically, we have two functions that handle the input and output of a .xyz file and another function that handles the input of a .pdb. Each group is handling input and output, but are still somewhat unrelated because of their file type. Instead of making a single module, we are going to create a subpackage to handle i/o and place a module for each group within it.

Create a new directory called io within the package and create two new files pdb.py and xyz.py:

pdb.py

"""
Functions for manipulating pdb files.
"""
def open_pdb(file_location):
    with open(file_location) as f:
        data = f.readlines()
    
    coordinates = []
    symbols = []
    
    for line in data:
        if 'ATOM' in line[0:6] or 'HETATM' in line[0:6]:
            symbols.append(line[76:79].strip())
            
            coords = [float(x) for x in line[30:55].split()]
            coordinates.append(coords)
        
    coords = np.array(coordinates)
    symbols = np.array(symbols)

    return symbols, coords

xyz.py

"""
Functions for manipulating xyz files.
"""
def open_xyz(file_location):
    # Open an xyz file and return symbols and coordinates.
    xyz_file = np.genfromtxt(fname=file_location, skip_header=2, dtype='unicode')
    symbols = xyz_file[:,0]
    coords = (xyz_file[:,1:])
    coords = coords.astype(np.float)
    return symbols, coords

def write_xyz(file_location, symbols, coordinates):
    num_atoms = len(symbols)
    
    with open(file_location, 'w+') as f:
        f.write('{}\n'.format(num_atoms))
        f.write('XYZ file\n')
        
        for i in range(num_atoms):
            f.write('{}\t{}\t{}\t{}\n'.format(symbols[i], 
                                              coordinates[i,0], coordinates[i,1], coordinates[i,2]))

Now any module that needs to handle input and output can import the needed module from the io package. Since these are currently small modules, it would not be a big deal to import all of them, but consider a large I/O suite contianing a large number of file types and functionalities, it will quickly create inefficiencies to leave them in one module.

Fixing Imports

When we first copied the functions from the Jupyter Notebook into functions.py, we were able to import molecool package and access the functions within functions.py. After we extracted the functions from that file, we won’t be able to import those functions in the same way. In fact we won’t be able to access them at all. Every time we restructure our code or create new folders we have to be careful and modify the init accordingly. Let us then add the new functions into the __init__

# Add imports here
from .functions import *
from .measure import calculate_distance, calculate_angle
from .molecule import build_bond_list
from .visualize import draw_molecule, bond_histogram

In this way, we should be able to call each of the functions after importing our module

>>> import molecool
>>> molecool.build_bond_list()

Even with the imports fixed, if you try and run some of these functions, you may find yourself with an ImportError. This is because the functions can only see the code that has been “loaded” into the module. Each set of functions now exists as standalones within their module.

If we look at our original functions.py module, we will see that we had a number of import statements at the top of the file:

import os
import numpy as np
import matplotlib.pyplot as plt

These are modules that some of the functions need to run. Now that we have moved the functions into separate modules, we need to add in the import statements into each file where they are needed. Lets start by looking at measure.py. Looking through the functions, we can see that each of them has a reference to np, which is what we imported numpy as in functions.py.

Besides visual inspection, you could have also seen these missing imports by using flake8 on the modules.

$ flake8 measure.py

You will see a message which says “undefined name np”

In order to make these functions work again, we need to add the import statement

import numpy as np

to the top of our file.

As a second example, let us look at the visualize.py module. We can quickly see that there is a reference to np in each of the methods, so we need to add our numpy import statement again. We also see references to plt which was the name given to matplotlib.pyplot when it was imported. Add imports of the external libraries to the top of the visualize.py module. Of course, don’t forget our “unused import” for 3D axes.

import numpy as np
import matplotlib.pyplot as plt

from mpl_toolkits.mplot3d import Axes3D  # noqa: F401

If you use flake8 (or if you carefully inspect), you will also see that atom_colors is missing. The draw_molecule function uses the atom_colors dictionary. When all of our code was in a single module, we could simply reference the dictionary by name and use it. However, we have now moved atom_colors and atomic_weights into a separate module. In order to reference the dictionaries in visualize.py, we need to import them using an import statement. This is an intra-package import, meaning that we are importing modules from within our packages to other imports in our package (see intra package imports here)

from .atom_data import atom_colors

This import statement looks a bit different from the other import statements in our code, we have a . before the name. This is because it is a relative import. Just like when using bash, a dot (.) means to look in the current folder.

To think about this more, lets first look at the dot in a different import statement:

import matplotlib.pyplot as plt

In this case, the . is saying look within the package matplotlib and grab the subpackage (or module) pyplot. In our case, we are not using a name before the . so where is it looking? It is looking within the current package/directory, or in this case molecool for a module or package named atom_data, from which it will import the atom_colors dictionary.

Check your understanding

The molecule.py module also utilizes functions that are no longer available in the module. Correct the missing import statements in the module.

Answer

The build_bond_list functions utilizes the calculate_distance function, which is now in the measure module, so we want to create a relative import from the measure module.

from .measure import calculate_distance

Fixing Package Imports - what is __init__.py?

Currently, your __init__.py file should look like this:

"""
molecool
A python package for visualizing and analyzing molecular files. This is a sample package for a Best Practices Workshop from MolSSI.
"""

# Add imports here
from .functions import *

# Handle versioneer
from ._version import get_versions

versions = get_versions()
__version__ = versions["version"]
__git_revision__ = versions["full-revisionid"]
del get_versions, versions

We have moved all of our functions into modules, but we haven’t changed our __init__.py file. If you use a python interpreter in a directory which is not directly above your project, you can see the consequences of this. We can use the dir functions to see what is available in a particular module or object:

>>> import molecool
>>> dir(molecool)

You should see something similar to the following

['__builtins__', '__cached__', '__doc__', '__file__', '__git_revision__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__version__', '_version', 'functions', 'np', 'plt']

These are all of the things available to us from importhing molecool. You will see your functions module, but you will also see np and plt. This comes from using from .functions import * and is why using import * is usually considered a bad practice. You will notice that we cannot call molecool.measure.calculate_distance

help(molecool.measure.calculate_distance)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: module 'molecool' has no attribute 'measure'

You can, however, do the following without getting an error:

import molecool.measure
help(molecool.measure.calculate_distance)

To change this behavior, we will need to modify our __init__.py file. The __init__.py file contains python code that is called when a module is imported. We edit it to import modules when the package loads so that functions in the measure.py module (for example) will be imported when the package is first imported.

Modify your __init__.py file:

# Add imports here
from .functions import *
from .measure import calculate_distance, calculate_angle
from .molecule import build_bond_list
from .visualize import draw_molecule, bond_histogram

Now the behavior should be the same as before. You can also delete functions.py if you would like since we no longer have functions in that file (you must also delete it from your __init__.py file, of course.)

We haven’t yet included our io subpackage, meaning that the user would have to import this package if they wanted to use it. For example, to use the xyz functions

import molecool.io.xyz
dir(molecool.io.open_xyz)

This will work, however, the main reason we broke up the modules within the io package was for development convenience. Right now this has come at the cost of slightly more complicated import statements to get access to any function.

We can, of course, edit our __init__.py file to make this simpler. At this point, the way we actually do this import is going to be stylistic - how do you want people to interact with your package?

The goal we are going to go for is to call an IO function using:

molecool.io.IO_FUNCTION

where IO_FUNCTION is any function relating to IO.

Within the io directory, create a new file called __init__.py. Open that file within your desired editor and add the following two lines:

from .pdb import open_pdb
from .xyz import open_xyz, write_xyz

These lines are relative import statements to the functions within the io package. Think of them as pointers to the functions, i.e. when we look at the io package, it directs us to the location of the underlying functions, so we do not need to look within each submodule. This allows us to use the following import statement to our top level __init__.py to access the functions:

from . import io

We can now call our IO functions using our target syntax.

>>> molecool.io.open_pdb()

If we wanted the io functions to mimic the imports from the rest of the modules, we could modify our top level __init__.py file to reflect that.

from .functions import *
from .measure import calculate_distance, calculate_angle
from .molecule import build_bond_list
from .visualize import draw_molecule, bond_histogram
from .io import open_pdb, open_xyz, write_xyz

And we could even make these functions more accessible by removing the need for the io

Which would allow us to call functions by simply typing

>>> molecool.open_pdb()

You can now appreciate how the init file plays such an important role in defining how the user imports the functions in the package.

Key Points

  • Your package should be broken up into modules and subpackages depending on the amount of code and functionality.

  • You can use the init.py file to define what packages are imported with your package, and how the user interacts with it.