Deciding Package Structure#

Overview

Questions:

How should I organize my code?
How can I handle imports in my package?

Objectives:

Break code into modules and subpackages based on functionality.
Understand how the __init__.py file affects your Python package.

Follow Along with This Lesson

To follow along with this lesson, you can complete the previous lessons, or you can download a pre-made workshop repository that is at the starting point.

You will need to make sure that you have git installed and configured, as described in the set-up instructions.

SHELL

git clone https://github.com/MolSSI-Education/molecool.git
cd molecool
git checkout deciding-package-structure-start
git switch -c main

You can also download the pre-made workshop repository as a zip file. If downloading as a zip file, you will need to initialize git in the repository and make an initial commit in order to use git.

As new features are implemented in codes, it is natural for new functions and objects to be added. In many projects, this often leads to a large number of functionalities defined within a single module. For small, single developer codes, this is not a major issue, but it can still make code difficult to work with. With large or multi-developer codes, this can slow development progress to a crawl as it is difficult both to understand and work with the code.

In this lesson, we will simulate a developing piece of software. We will start with a single python module containing all the methods we have developed, and convert it into a well-structured package.

Package Structure#

Let’s start by reviewing the package structure provided to us by the CMS CookieCutter. We have a directory containing our project with a number of additional features. Under our package directory, molecool, we can see our current python module functions.py. For a more detailed explanation of the rest of the package structure, please review the package setup section of the lessons.

.
├── CODE_OF_CONDUCT.md              <- Code of Conduct for developers and users
├── LICENSE                         <- License file
├── MANIFEST.in                     <- Packaging information for pip
├── README.md                       <- Description of project which GitHub will render
├── molecool                        <- Basic Python Package import file
│   ├── __init__.py                 <- Basic Python Package import file
│   ├── functions.py                <- Starting package module
│   ├── data                        <- Sample additional data (non-code) which can be packaged. Just an example, delete in production
│   │   ├── README.md
│   │   └── look_and_say.dat
│   └── tests                       <- Unit test directory with sample tests
│       ├── __init__.py
│       └── test_molecool.py
├── devtools                        <- Deployment, packaging, and CI helpers directory
│   ├── README.md
│   ├── conda-envs                  <- Conda environments for testing
│   │   └── test_env.yaml
│   └── scripts
│       └── create_conda_env.py     <- OS agnostic Helper script to make conda environments based on simple flags
├── docs                            <- Documentation template folder with many settings already filled in
│   ├── Makefile
│   ├── README.md                   <- Instructions on how to build the docs
│   ├── _static
│   │   └── README.md
│   ├── _templates
│   │   └── README.md
│   ├── api.rst
│   ├── conf.py
│   ├── getting_started.rst
│   ├── index.rst
│   ├── make.bat
│   └── requirements.yaml           <- Documenation building specific requirements. Usually a smaller set than the main program
├── readthedocs.yml
├── pyproject.toml                  <- Generic Python build system configuration (PEP-517).
├── setup.cfg                       <- Near-master config file to make house INI-like settings for Coverage, Flake8, YAPF, etc.
├── setup.py                        <- Your package's setup file for installing with additional options that can be set
├── .codecov.yml                    <- Codecov config to help reduce its verbosity to more reasonable levels
├── .github                         <- GitHub hooks for user contribution, pull request guides and GitHub Actions CI
│   ├── CONTRIBUTING.md
│   ├── PULL_REQUEST_TEMPLATE.md
│   └── workflows
│       └── CI.yaml
├── .gitignore                      <- Stock helper file telling git what file name patterns to ignore when adding files
└── .lgtm.yml

The easiest way to start is to see what we currently have and try to decide which parts are related to one another. Looking through the functions.py file, we see a number of different functions, and for the sake of simplicity we abbreviate and rearrange them here:

FUNCTIONS.PY

atomic_weights = {
    'H': 1.00784,
    'C': 12.0107,
    'N': 14.0067,
    'O': 15.999,
    'P': 30.973762,
    'F': 18.998403,
    'Cl': 35.453,
    'Br': 79.904,
}

atom_colors = {
    'H': 'white',
    'C': '#D3D3D3',
    'N': '#add8e6',
    'O': 'red',
    'P': '#FFA500',
    'F': '#FFFFE0',
    'Cl': '#98FB98',
    'Br': '#F4A460',
    'S': 'yellow'
}

def open_pdb(file_location):
    ...
   
def open_xyz(file_location):
    ...

def write_xyz(file_location, symbols, coordinates):
    ...
    
def calculate_distance(rA, rB):
    ...

def calculate_angle(rA, rB, rC, degrees=False):
    ...

def draw_molecule(coordinates, symbols, draw_bonds=None, save_location=None, dpi=300):
    ...

def bond_histogram(bond_list, save_location=None, dpi=300, graph_min=0, graph_max=2):
    ...

def build_bond_list(coordinates, max_bond=1.5, min_bond=0):
    ...
    
def calculate_molecular_mass(symbols):
    ...

def calculate_center_of_mass(symbols, coordinates):
    ...

Right at the start we can see two dictionaries of atom data. Clearly these are related and should probably be grouped together. Looking at the functions, we see two functions that handle opening files, open_pdb and open_xyz, and a function that writes a file, write_xyz. It may make sense to group these three together in a module based on input and output.

Let’s start making new modules to place our related functions into.

Atom Data#

We will take the atomic_weights and atom_colors dictionaries and move them into a separate module called atom_data.py. This is enclosing the constant data that our system is using in a single place. This allows all the new modules we create to access the data from a single location, avoiding the need to copy the dictionaries to each module that needs them. If we have any other data, related to atoms, used by many of our functions, adding them to this module would be a good idea.

ATOM_DATA.PY

"""
Data used for the rest of the package.
"""

atomic_weights = {
    'H': 1.00784,
    'C': 12.0107,
    'N': 14.0067,
    'O': 15.999,
    'P': 30.973762,
    'F': 18.998403,
    'Cl': 35.453,
    'Br': 79.904,
}

atom_colors = {
    'H': 'white',
    'C': '#D3D3D3',
    'N': '#add8e6',
    'O': 'red',
    'P': '#FFA500',
    'F': '#FFFFE0',
    'Cl': '#98FB98',
    'Br': '#F4A460',
}

Exercise - Grouping into Modules#

Exercise - Grouping into Modules

Take approximately 10 minutes to look through the rest of the functions in the functions module and group them together. Create a module for each group with a reasonable name.

Solution

Here is how we decided to break up the functions:

calculate_angle and calculate_distance go together in a measure module.
draw_molecule and bond_histogram go into a visualize module.
build_bond_list is placed into a molecule module.
open_pdb go into a pdb module in an io subpackage.
open_xyz and write_xyz are placed into an xyz module in an io subpackage.

Measure Module#

Our functions.py file contains two functions that handle taking measurements: calculate_distance and calculate_angle. Similar to atom_data, we will simply place these in a module within the main package. Since both functions are taking measurements, we will call it measure.py.

MEASURE.PY

"""
This module is for functions that perform measurements.
"""

def calculate_distance(rA, rB):
    """Calculate the distance between two points.

    Parameters
    ----------
    rA, rB : np.ndarray
        The coordinates of each point.

    Returns
    -------
    distance : float
        The distance between the two points.

    Examples
    --------
    >>> r1 = np.array([0, 0, 0])
    >>> r2 = np.array([0, 0.1, 0])
    >>> calculate_distance(r1, r2)
    0.1
    """

    dist_vec = rA - rB
    distance = np.linalg.norm(dist_vec)

    return distance

def calculate_angle(rA, rB, rC, degrees=False):
    AB = rB - rA
    BC = rB - rC
    theta=np.arccos(np.dot(AB, BC)/(np.linalg.norm(AB)*np.linalg.norm(BC)))

    if degrees:
        return np.degrees(theta)
    else:
        return theta

Visualize Module#

Similarly, we have two functions that handle visualization of molecules. We will place them into a module called visualize.py.

VISUALIZE.PY

"""
Functions for visualization of molecules
"""

def draw_molecule(coordinates, symbols, draw_bonds=None, save_location=None, dpi=300):

    # Create figure
    fig = plt.figure()
    ax = fig.add_subplot(111, projection="3d")

    # Get colors - based on atom name
    colors = []
    for atom in symbols:
        colors.append(atom_colors[atom])

    size = np.array(plt.rcParams["lines.markersize"] ** 2) * 200 / (len(coordinates))

    ax.scatter(
        coordinates[:, 0],
        coordinates[:, 1],
        coordinates[:, 2],
        marker="o",
        edgecolors="k",
        facecolors=colors,
        alpha=1,
        s=size,
    )

    # Draw bonds
    if draw_bonds:
        for atoms, bond_length in draw_bonds.items():
            atom1 = atoms[0]

def bond_histogram(bond_list, save_location=None, dpi=300, graph_min=0, graph_max=2):
    lengths = []
    for atoms, bond_length in bond_list.items():
        lengths.append(bond_length)
    
    bins = np.linspace(graph_min, graph_max)
    
    fig = plt.figure()
    ax = fig.add_subplot(111)
    
    plt.xlabel('Bond Length (angstrom)')
    plt.ylabel('Number of Bonds')
    
    ax.hist(lengths, bins=bins)
    
    # Save figure
    if save_location:
        plt.savefig(save_location, dpi=dpi)
    
    return ax

Molecule Module#

Our last function is build_bond_list, which is not particularly related to any of our other functions. The name functions.py does not really give a lot of information about what is available in the module. We can rename the module to something more fitting, say molecule.py. We also add a docstring.

MOLECULE.PY

"""
Functions for molecule analysis
"""

def build_bond_list(coordinates, max_bond=1.5, min_bond=0):
    """
    Build a list of bonds in a set of coordinates based on a distance criteria.

    Parameters
    ----------
    coordinates: np.ndarray
        The coordinates of the atoms to analyze in an (natoms, ndim) array.

    max_bond: float, optional
        The maximum distance for two atoms to be considered bonded.

    min_bond: float, optional
        The minimum distance for two atoms to be considered bonded.

    Returns
    -------
    bonds: dict
        A dictionary containing bonded atoms with atom pairs as keys and the distance between the atoms as the value.
    """

    # Find the bonds in a molecule (set of coordinates) based on distance criteria.
    bonds = {}
    num_atoms = len(coordinates)

    for atom1 in range(num_atoms):
        for atom2 in range(atom1, num_atoms):
            distance = calculate_distance(coordinates[atom1], coordinates[atom2])
            if distance > min_bond and distance < max_bond:
                bonds[(atom1, atom2)] = distance

    return bonds

I/O Subpackage#

When looking at the three I/O functions, it may be easy to jump ahead and create an I/O module, as mentioned previously. However, what we really have is two distinct groups of functions that are related. More specifically, we have two functions that handle the input and output of a .xyz file and another function that handles the input of a .pdb. Each group is handling input and output, but are still somewhat unrelated because of their file type. Instead of making a single module, we are going to create a subpackage to handle i/o and place a module for each group within it.

Create a new directory called “io” within the package (using the command ‘mkdir directory_name’) and create two new files pdb.py and xyz.py (using the command touch file_name):

IO/PDB.PY

"""
Functions for manipulating pdb files.
"""

def open_pdb(file_location):
    """Open and read coordinates and atom symbols from a pdb file.

    The pdb file must specify the atom elements in the last column, and follow
    the conventions outlined in the PDB format specification.

    Parameters
    ----------
    file_location : str
        The location of the pdb file to read in.

    Returns
    -------
    coords : np.ndarray
        The coordinates of the pdb file.
    symbols : list
        The atomic symbols of the pdb file.

    """

    with open(file_location) as f:
        data = f.readlines()

    coordinates = []
    symbols = []
    for line in data:
        if "ATOM" in line[0:6] or "HETATM" in line[0:6]:
            symbols.append(line[76:79].strip())
            atom_coords = [float(x) for x in line[30:55].split()]
            coordinates.append(atom_coords)

    coords = np.array(coordinates)
    symbols = np.array(symbols)

    return symbols, coords

IO/XYZ.PY

"""
functions for manipulating xyz files.
"""

def open_xyz(file_location):

    # Open an xyz file and return symbols and coordinates.
    xyz_file = np.genfromtxt(fname=file_location, skip_header=2, dtype="unicode")
    symbols = xyz_file[:, 0]
    coords = xyz_file[:, 1:]
    coords = coords.astype(float)
    return symbols, coords


def write_xyz(file_location, symbols, coordinates):

    num_atoms = len(symbols)

    if num_atoms != len(coordinates):
        raise ValueError(
            f"write_xyz : the number of symbols ({num_atoms}) and number of coordinates ({len(coordinates)}) must be the same to write xyz file!"
        )

    with open(file_location, "w+") as f:
        f.write("{}\n".format(num_atoms))
        f.write("XYZ file\n")

        for i in range(num_atoms):
            f.write(
                "{}\t{}\t{}\t{}\n".format(
                    symbols[i], coordinates[i, 0], coordinates[i, 1], coordinates[i, 2]
                )
            )

Now any module that needs to handle input and output can import the needed module from the io subpackage. Since these are currently small modules, it would not be a big deal to import all of them. But, consider a large I/O suite containing a large number of file types and functionalities. It will quickly create inefficiencies to leave them in one module.

Now that we’ve organized and changed the structure in our project, we should commit our changes and push to GitHub.

SHELL

git add .
git commit -m "organize molecool into modules and subpackage"
git push origin main

Key Points#

Key Points

Your package should be broken up into modules and subpackages depending on the amount of code and functionality.
You can use the __init__.py file to define what packages are imported with your package, and how the user interacts with it.

Deciding Package Structure#

Package Structure#

Atom Data#

Exercise - Grouping into Modules#

Measure Module#

Visualize Module#

Molecule Module#

I/O Subpackage#

Fixing Imports#

Check your Understanding - Relative Imports#

Using `import *`#

IO Subpackage#

Key Points#

Deciding Package Structure#

Package Structure#

Atom Data#

Exercise - Grouping into Modules#

Measure Module#

Visualize Module#

Molecule Module#

I/O Subpackage#

Fixing Imports#

Check your Understanding - Relative Imports#

Using import *#

IO Subpackage#

Key Points#

Using `import *`#