# Python Coding Style ```{admonition} Overview :class: overview Questions: - How can I write Python code that is readable? Objectives: - Learn how to raise exceptions. - Understand how to follow PEP8 style for Python. - Understand what docstrings are and why they are important. - Learn to write docstrings in numpy style. ``` :::{admonition} Follow Along with This Lesson :class: tip To follow along with this lesson, you can complete the previous lessons, or you can download a pre-made workshop repository that is at the starting point. You will need to make sure that you have `git` installed and configured, as described in the set-up instructions. ````{tab-set-code} ```{code-block} shell git clone https://github.com/MolSSI-Education/molecool.git cd molecool git checkout python-coding-style-start git switch -c main ``` ```` You can also [download the pre-made workshop repository as a zip file](https://github.com/MolSSI-Education/molecool/archive/refs/tags/git-start.zip). If downloading as a zip file, you will need to initialize `git` in the repository and make an initial commit in order to use git. ::: ## Editing a function in our package Let's look at one of the functions in our package. Open your `molecool/functions.py` module in a text editor. The function `open_pdb` reads coordinates and atom symbols from a pdb file. ````{tab-set-code} ```{code-block} python def open_pdb(f_loc): with open(f_loc) as f: data = f.readlines() c = [] sym = [] for l in data: if 'ATOM' in l[0:6] or 'HETATM' in l[0:6]: sym.append(l[76:79].strip()) c2 = [float(x) for x in l[30:55].split()] c.append(c2) coords = np.array(c) return sym, coords ``` ```` If we want to test our function, we require a pdb file. The workshop materials downloaded during the setup include a set of pdb examples. These are found in `molssi_best_practices/starting_material/data/pdb/`. We want to store these files in our `molecool` directory. Luckily, `cookiecutter` created a folder designed specifically for that purpose. The folder is in `molecool/data/`. This folder can contain any data useful for testing of the basic functionality of our code. Be mindful that this folder is also downloaded when installing our package, so do not include data whose size is significant. Go ahead and copy the pdb files to a new folder `pdb` inside the data folder. With the files in our `molecool` folder, we can access the function when we execute it in the interactive Python interpreter. Test this by opening the interactive Python interpreter and typing the following. ````{tab-set-code} ```{code-block} python >>> import os >>> from molecool import open_pdb >>> pdb_file = os.path.join('molecool', 'data', 'pdb', 'water.pdb') >>> symbols, coords = open_pdb(pdb_file) >>> symbols ``` ```` ````{tab-set-code} ```{code-block} output ['O', 'H', 'H'] ``` ```` You should get a list of atomic symbols of the water molecule by executing this code. You can also see the atomic coordinates by executing: ````{tab-set-code} ```{code-block} python >>> coords ``` ```` You should get a numpy array of atomic coordinates. ````{tab-set-code} ```{code-block} output array([[ 9.626, 6.787, 12.673], [ 9.626, 8.42 , 12.673], [10.203, 7.604, 12.673]]) ``` ```` ### Function Return Type When we examine our `open_pdb` function, you may notice some inconsistency in the data type of the returned values. Our function returns both molecules symbols and coordinates. However, the molecule symbols are returned as a list, while the coordinates are returned as a numpy array. This is not necessarily a problem, but it is inconsistent. We should make sure that our function returns the same type of data for each variable. This will be more clear to both users of our code and developers who are editing our code. To change both outputs to NumPy arrays, our function can now look like the following: ````{tab-set-code} ```{code-block} python def open_pdb(f_loc): with open(f_loc) as f: data = f.readlines() c = [] sym = [] for l in data: if 'ATOM' in l[0:6] or 'HETATM' in l[0:6]: sym.append(l[76:79].strip()) c2 = [float(x) for x in l[30:55].split()] c.append(c2) coords = np.array(c) sym = np.array(sym) return sym, coords ``` ```` You can test your function again using the same procedure we applied above, and now you will see that your function returns two NumPy arrays. ### Raising Exceptions Hooray! It seems like this function works! This should come as no surprise since we are the authors of the function, and we know its internal structure. This is not necessarily true for someone editing our code and specially not true for someone just using our code. There are instances where unwanted behavior occurs, even though the code executes (i.e. there are no syntax errors). In these cases, our code should be able to stop itself to prevent further malfunction. Take for example the division by zero. If we try to calculate ````{tab-set-code} ```{code-block} python >>> 1/0 ``` ```` We would get ````{tab-set-code} ```{code-block} output ZeroDivisionError: division by zero ``` ```` In this example, the code was smart enough to identify the division by zero and halt. This type of feedback is much more helpful than just throwing an ugly `NaN`. This is called an *exception* error. There are several built-in exceptions, such as the "ZeroDivisionError". You can choose to raise exceptions yourself when you think a function should fail (instead of the function not failing, or running until it hits some other failure.) Consider our function `write_xyz`. ````{tab-set-code} ```{code-block} python def write_xyz(file_location, symbols, coordinates): ## Write an xyz file given a file location, symbols, and coordinates. num_atoms = len(symbols) with open(file_location, 'w+') as f: f.write('{}\n'.format(num_atoms)) f.write('XYZ file\n') for i in range(num_atoms): f.write('{}\t{}\t{}\t{}\n'.format(symbols[i], coordinates[i,0], coordinates[i,1], coordinates[i,2])) ``` ```` When examining this function, you may see a few opportunities for failure. For example, a user could supply `symbols` and `coordinates` with different lengths. If the `coordinates` argument is the longer one, we will not see an error. The function will simply ignore the last coordinate. If `symbols` is the longer argument, we will not have enough `coordinates` and an error will occur. Neither of these is our intended behavior, but would occur without us knowing (some errors are silent)! Let's try this out. In a python interpreter, try the following: ````{tab-set-code} ```{code-block} python >>> import molecool >>> import numpy as np >>> test_atoms = ["H", "O"] >>> test_coords = np.array([[1,0,0],[0,0,0], [0,1,0]]) >>> molecool.write_xyz("test.xyz", test_atoms, test_coords) ``` ```` You will see that no error occurs. If we open the written XYZ file, the last coordinate point has been discarded. We probably intend for these variables to have the same number of elements. When they don't, there's no way to tell what the user wanted, or if they have accidentally passed us incorrect data. We should check the length of these and raise an appropriate exception to halt the program if necessary. ````{tab-set-code} ```{code-block} python def write_xyz(file_location, symbols, coordinates): ## Write an xyz file given a file location, symbols, and coordinates. num_atoms = len(symbols) if num_atoms != len(coordinates): raise ValueError(f"write_xyz : the number of symbols ({num_atoms}) and number of coordinates ({len(coordinates)}) must be the same to write xyz file!") with open(file_location, 'w+') as f: f.write('{}\n'.format(num_atoms)) f.write('XYZ file\n') for i in range(num_atoms): f.write('{}\t{}\t{}\t{}\n'.format(symbols[i], coordinates[i,0], coordinates[i,1], coordinates[i,2])) ``` ```` As you can see, custom error messages can be quite descriptive of the problem. Let's try this out with some fake data. Using the same example as before: ````{tab-set-code} ```{code-block} python >>> import molecool >>> import numpy as np >>> test_atoms = ["H", "O"] >>> test_coords = np.array([[1,0,0],[0,0,0], [0,1,0]]) >>> molecool.write_xyz("test.xyz", test_atoms, test_coords) ``` ```` ````{tab-set-code} ```{code-block} output ValueError: write_xyz : the number of symbols (2) and number of coordinates (3) must be the same to write xyz file! ``` ```` The built-in exceptions already include errors that are common while programming. For example, our function requires explicit use of [numpy] arrays. Nevertheless, a user may be tempted to use a list of length 3 to describe the position of two atoms. We know that it is not possible to perform arithmetic between full lists. In this case we might use the exception type `TypeError`. Other types of common exceptions include undefined variables (`NameError`) and failed assertions that two numbers are the same (`AssertionError`). The latter will be particularly useful when we want to automate testing within our package. ### Coding Style Our functions are now smarter and will better guide users while using them. However, our function still might be hard to read and understand for others, so we might want to consider styling it better. As a developer, you spend a lot of time thinking about writing your code. However, code is read much more often than it is written. Following a style guide will help others (and perhaps you in the future!) to read your code. For Python, the common convention for code style is called [PEP8]. PEP8 is a document that gives guidelines for best practices in Python coding style. PEP8 is a recommendation, not a rule. However, you should follow this convention when possible. :::{admonition} Python PEP :class: note If you spend a lot of time programming in Python, you will see references to PEPs a lot. PEP stands for "Python Enhancement Proposal". These are design documents which provide information about features. PEPs come from the Python community, meaning anyone can author a PEP (however, there is a strict review process). PEPs are classified into three categories - standards, informational, or process. You can read more about PEPs in [Python's documentation](https://www.python.org/dev/peps/pep-0001/). PEP1 outlines what a PEP is and how they work. ::: PEP8 tells us several things about styling that will make our code easier to read. Let's consider some of these and how they might change our function. ### Variable names PEP8 recommends that > Never use the characters 'l' (lowercase letter el), 'O' (uppercase letter oh), or 'I' (uppercase letter eye) as single character variable names. > Function names should be lowercase, with words separated by underscores as necessary to improve readability. Though not specifically referenced in PEP8, we also recommend making all variable names descriptive so that someone reading your code can easily understand what the variable is. Consider a few variable we have defined in our function (`c`, `sym`, `c2`, `l`). Is it clear what these are or mean? We can change them to be more descriptive and readable. ````{tab-set-code} ```{code-block} python def open_pdb(file_location): with open(file_location) as f: data = f.readlines() coordinates = [] symbols = [] for line in data: if 'ATOM' in line[0:6] or 'HETATM' in line[0:6]: symbols.append(line[76:79].strip()) atom_coords = [float(x) for x in line[30:55].split()] coordinates.append(atom_coords) coords = np.array(coordinates) symbols = np.array(symbols) return symbols, coords ``` ```` For this rewrite of the function, we have made the following changes in variable names. - `f_loc` ---> `file_location` - `c` ---> `coordinates` - `sym` ---> `symbols` - `l` ---> `line` - `c2` ---> `atom_coords` These variable names follow PEP8 convention and are much more descriptive and readable. ### Indentation PEP8 indicates that indentation should be 4 spaces per indentation level. Our code meets these criteria. ### Whitespace > Always surround these binary operators with a single space on either side: assignment (=), augmented assignment (+=, -= etc.), comparisons (==, <, >, !=, <>, <=, >=, in, not in, is, is not), Booleans (and, or, not). This means that every time we have an expression assigning a variable, it should be `variable = value` instead of `variable=value`. You can also use whitespace to separate ideas within a function or code. > Use blank lines in functions, sparingly, to indicate logical sections. Using these rules, our function becomes ````{tab-set-code} ```{code-block} python def open_pdb(file_location): with open(file_location) as f: data = f.readlines() coordinates = [] symbols = [] for line in data: if 'ATOM' in line[0:6] or 'HETATM' in line[0:6]: symbols.append(line[76:79].strip()) atom_coords = [float(x) for x in line[30:55].split()] coordinates.append(atom_coords) coords = np.array(coordinates) symbols = np.array(symbols) return symbols, coords ``` ```` Now that we've written a new function in our project, we should commit our changes and push to GitHub. ````{tab-set-code} ```{code-block} shell git add . git commit -m "add open_pdb function to molecool" git push origin main ``` ```` ### Exercise - Improving the `calculate_distance` function ``````{admonition} Exercise :class: exercise Below is the `calculate_distance` function that takes two points in 3D space and returns the distance between them. Even though it works just fine, it might hard for others to understand. Take a couple of minutes to reformat this function in the `molecool/functions.py` module. ````{tab-set-code} ```{code-block} python def calculate_distance(rA, rB): d=(rA-rB) dist=np.linalg.norm(d) return dist ``` ```` `````{admonition} Solution :class: solution dropdown Here is a better formatted version of `calculate_distance`, which is easier to read and understand. ````{tab-set-code} ```{code-block} python def calculate_distance(rA, rB): dist_vec = (rA - rB) distance = np.linalg.norm(dist_vec) return distance ``` ```` ````` `````` Now we've successfully styled function according to PEP8! However, if we compare what we've written above to the sample function, `canvas`, we notice that ours is a little different. Keeping your previous Python interpreter open, type the following: ````{tab-set-code} ```{code-block} python >>> import molecool as mc >>> help(mc.canvas) ``` ```` **Note: Do not use `help(mc.canvas())`, this will actually execute your code (not what we want).** The code above calls Python's built-in function, `help`. For our canvas function, it displays the multi-line comment (called a `docstring`), that is written beneath the function definition. ````{tab-set-code} ```{code-block} output Help on function canvas in module molecool.functions: canvas(with_attribution=True) Placeholder function to show example docstring (NumPy format) Replace this function and doc string for your own project Parameters ---------- with_attribution : bool, Optional, default: True Set whether or not to display who the quote is from Returns ------- quote : str Compiled string including quote and optional attribution ``` ```` If we try the same thing on our `calculate_distance` function, we don't get a helpful message. We will want to write a docstring for our new `calculate_distance` function. This way, it will be clear to new developers who use our code what the function does, and be accessible to anyone using the function interactively. Returning to the `functions.py` module file, edit your `calculate_distance` function to look like the following. ````{tab-set-code} ```{code-block} python def calculate_distance(rA, rB): """Calculate the distance between two points. Parameters ---------- rA, rB : np.ndarray The coordinates of each point. Returns ------- distance : float The distance between the two points. Examples -------- >>> r1 = np.array([0, 0, 0]) >>> r2 = np.array([0, 0.1, 0]) >>> calculate_distance(r1, r2) 0.1 """ dist_vec = (rA - rB) distance = np.linalg.norm(dist_vec) return distance ``` ```` ### Docstrings We've now added a multi-line comment (called a `docstring`, short for "documentation string"), to the beginning of our function. Docstrings **are the first statement after a function or module definition** and are opened and closed with three quotes. The docstring should explain what the function or module does (and not how it is done). [PEP257] provides very basic guidelines for docstrings. There are many ways you could format a docstring (different styles/conventions). We recommend using [numpy style docstrings], which we used for the example above and for the `calculate_distance` function. :::{admonition} The `__doc__` attribute :class: note When you add a docstring to a function or module, python automatically adds this to the `__doc__` attribute of the object. You can also see an object's docstring by typing `object.__doc__` into the Python interpreter. For example, to see the docstring associated with the canvas function, `molecool.canvas.__doc__` into the Python interpreter (after importing `molecool`, of course.) ::: #### Sections of a Docstring Each docstring has a number of sections which are separated by headings. Headings should be underlined with hyphens (`-----`). There are many options for sections, we will only cover the most relevant here. If you would like to see a full list, check out the documentation for [numpy style docstrings]. ##### 1. Short summary A one-line summary that does not use the variable name or the function name. In our `calculate_distance` function, this corresponds to the following. ````{tab-set-code} ```{code-block} python """ Calculate the distance between two points. """ ``` ```` ##### 2. Extended summary A few sentences giving a detailed description of the function or module. This section should be used to clarify *functionality*, not to discuss implementation. We do not have an extended summary in our `calculate_distance` function, since it is relatively straightforward. ##### 3. Parameters This section contains a description of the function arguments - keywords and expected types. The parameters for our `calculate_distance` function are shown below. ````{tab-set-code} ```{code-block} python """ Parameters ---------- rA, rB : np.ndarray The coordinates of each point. """ ``` ```` Here, you can see that the parameter section begins with the section title ("Parameters"), followed by a line of hyphens ("----"). On the next line, we have the argument names (`rA, rB`), then a colon (`:`) followed by the input type of the argument. This line says that the arguments `rA` and `rB` should be of type `np.ndarray`. The next line gives a more detailed description of the parameter. When the input parameters are of different type, or they aren't related to each other, they should be written on separate lines. ##### 4. Returns This section is very similar to the `Parameters` section above. In contrast to the `Parameters` section, each returned value does not have to be named, but the type of the returned value is required. For our `calculate_distance` function, our `Returns` section looks like the following. ````{tab-set-code} ```{code-block} python """ Returns ------- distance : float The distance between the two points. """ ``` ```` ##### 5. Examples This is an optional section to show examples of functionality. This section is meant to illustrate usage. Though this section is optional, its use is strongly encouraged. Consider the example we have in our docstring ````{tab-set-code} ```{code-block} python """ Examples -------- >>> r1 = np.array([0, 0, 0]) >>> r2 = np.array([0, 0.1, 0]) >>> calculate_distance(r1, r2) 0.1 """ ``` ```` It is important that your examples in docstrings are working Python. We will see in the `testing` lesson how we can run automatic tests on our docstrings, and in the `documentation` lesson, we will see how we can display examples in documentation to our users. We have three lines of code for our example. In examples, lines of code begin with `>>>`. The first two lines define numpy arrays that are used in our `calculate_distance` function. Note that `r1` and `r2` must be numpy arrays (as indicated by our `Parameters` section), or our example will not give valid Python code (our function would error if we ran it). On the last line, you give the output (with no `>>>` in front.) Now that we've written a function in our project, we should commit our changes and push to GitHub. ````{tab-set-code} ```{code-block} shell git add . git commit -m "edit style of calculate_distance function" git push origin main ``` ```` ### Exercise - Docstrings ``````{admonition} Exercise - Docstrings :class: exercise Let's add a docstring to our `open_pdb` function including short summary, extended summary, parameters, and returns sections. Start with the following docstring. You will need to add the `Parameters` and `Returns` sections and edit the one-line description. We have filled in an extended summary. ````{tab-set-code} ```{code-block} python def open_pdb(file_location): """One line description here. The pdb file must specify the atom elements in the last column, and follow the conventions outlined in the PDB format specification. """ ``` ```` `````{admonition} Solution :class: solution dropdown ````{tab-set-code} ```{code-block} python def open_pdb(file_location): """Open and read coordinates and atom symbols from a pdb file. The pdb file must specify the atom elements in the last column, and follow the conventions outlined in the PDB format specification. Parameters ---------- file_location : str The location of the pdb file to read in. Returns ------- coords : np.ndarray The coordinates of the pdb file. symbols : list The atomic symbols of the pdb file. """ with open(file_location) as f: data = f.readlines() coordinates = [] symbols = [] for line in data: if 'ATOM' in line[0:6] or 'HETATM' in line[0:6]: symbols.append(line[76:79].strip()) coords = [float(x) for x in line[30:55].split()] coordinates.append(coords) coords = np.array(coordinates) symbols = np.array(symbols) return symbols, coords ``` ```` Once you're done make sure to add, commit, and push your changes to GitHub. `````` ### More on Coding Style If you look at PEP8, you will see that it is quite long. While you should definitely read it if you spend a lot of time programming in Python, there are luckily tools which will help us make sure our code is following PEP8 convention or other styling guidelines. There are auto-formatting tools such as `yapf` and `Black`, and static code "linters" such as `pylint` or `flake8`. Automatic code formatters will parse your python files and format them according to standards defined by that code formatter. It is usually a good idea to use a formatter (of your choice) when working on a python project. In particular, [Black](https://github.com/psf/black) has gained popularity lately. We will use [Black](https://github.com/psf/black) in this workshop. Black is an auto-formatter which is almost entirely non-customizable, ensuring all of your files will be uniform. Install `black` using `pip`. In your terminal, type ````{tab-set-code} ```{code-block} shell pip install black ``` ```` Now we can use `black` on our python files. ````{tab-set-code} ```{code-block} shell black molecool/functions.py ``` ```` You can see the changes to the `write_xyz` function, for example. You'll notice that Black also has some rules which are in addition to PEP8 formatting. For example, strings are all normalized to use double quotes. Note that `black` does not always follow PEP8. For example, PEP8 recommends that line lengths be no more than 79 characters. This is a convention which is often not followed. Black defaults to 88 characters per line instead. When you are working on a project, the exact style you use may be different. However, it is important to choose a consistent style. This will make your code much cleaner and easier to read. Now that we've changed and formatted some functions in our project, we should commit our changes and push to GitHub. ````{tab-set-code} ```{code-block} shell git add . git commit -m "run black on molecool" git push origin master ``` ```` There are other tools, such as [pylint](https://www.pylint.org/) and [flake8](https://flake8.pycqa.org/en/latest/) that are not automatic formatters, but will check your code for adherence to the PEP8 standard. Pylint, for example, will find your variables which are not `snake_case`, functions which do not have `docstrings`, simple stylistic changes, unused variables, etc. Flake8 is a little less strict in general. We will try `flake8` out here. If you would like to try `flake8`, first install it. ````{tab-set-code} ```{code-block} shell pip install flake8 ``` ```` You can run it on our `functions` module. ````{tab-set-code} ```{code-block} shell flake8 molecool/functions.py ``` ```` Let's examine one of the errors shown by the flake8 command above. ````{tab-set-code} ```{code-block} output molecool/functions.py:1:1: F401 'os' imported but unused ``` ```` This tells us it is looking at line 1 of `molecool/functions.py` (your line number may vary). `F401` is an error code which you can look up. Here, we are importing `os`, but never using it. We should remove this from our file. You will also see a second "unused import" error: ````{tab-set-code} ```{code-block} output molecool/functions.py:5:1: F401 'mpl_toolkits.mplot3d.Axes3D' imported but unused ``` ```` Although it appears this isn't used, this import is actually necessary for our 3D plot. We can tell `flake8` to ignore this problem by adding a special comment: ````{tab-set-code} ```{code-block} python from mpl_toolkits.mplot3d import Axes3D ## noqa: F401 ``` ```` You can run `flake8` again to see that the import error is no longer reported. :::{admonition} Final Repository State :class: tip You can see the final state of the repository after this section [here](https://github.com/MolSSI-Education/molecool/tree/ce11656b043846d6600b989eeba6a37ba541ea0c). You can also download a zip of the repository [here](https://github.com/MolSSI-Education/molecool/archive/refs/tags/python-coding-style-end.zip). ::: ## Key Points ```{admonition} Key Points :class: key * Your project should adopt a consistent style so that others can easily read it. * The style adopted in Python is called PEP8. * There are autoformatting tools you can use to ensure that your code meets PEP8 standards. * All functions and modules should be documented with docstrings. * Docstrings have a specific format. We recommend the NumPy format for docstrings. ``` [Exceptions]: https://realpython.com/python-exceptions/#the-try-and-except-block-handling-exceptions [PEP8]: https://www.python.org/dev/peps/pep-0008/ [PEP257]: https://www.python.org/dev/peps/pep-0257/ [YAPF]: https://github.com/google/yapf [numpy style docstrings]: https://numpydoc.readthedocs.io/en/latest/format.html