Dataclasses In Python#

Starting File: 01_typed_molecule.py

This chapter will start from the 01_typed_molecule.py and end on the 02_dataclass_molecule.py.

We so far have looked at type hints in Python and how to apply them to arguments. Thus far, they have served as a tool for annotating arguments for developers, users, and IDE’s to give some clue as to what types of data can or should be provided to an argument.

This chapter, we will continue with type hints, but also consider data restructuring of our sample Molecule class to reduce duplication of work, and make extension of these data containers (primarily) much easier.

This chapter and the next (Manual Data Validation) will ask you to do a fair amount of coding for the purposes of understanding not only the data structure, but also to think about what needs to be validated and how scientific validation translates to programmatic validation.

Dataclass: Simplified Data Containers#

Typing out attribute assignments in the __init__ of a class to the same name as the argument variable is a very common use case, especially in scientific computing. It is so common to do this argument-to-attribute-of-the-same-name operation that Python provides a built-in library called dataclasses to do it for you. Let us first look again at where our Molecule code was left from the previous chapter. This will serve as our starting point here.

Compatibility with Python 3.8 and below

If you have Python 3.8 or below, you will need to import container type objects such as List, Tuple, Dict, etc. from the typing library instead of their native types of list, tuple, dict, etc. This chapter will assume Python 3.9 or greater, however, both approaches will work in >=Python 3.9 and have 1:1 replacements of the same name.

from typing import Union

class Molecule:
    def __init__(self, 
                 name: str, 
                 charge: Union[float, int], 
                 symbols: list[str], 
                 coordinates: list[list[float]]):
        self.name = name
        self.charge = charge
        self.symbols = symbols
        self.coordinates = coordinates
        self.num_atom = len(symbols)

    def __str__(self):
        return f"name: {self.name}\ncharge: {self.charge}\nsymbols: {self.symbols}"
water = Molecule("water", 0.0, ["H", "H", "O"], [[0, 0, 0]])
print(water)
name: water
charge: 0.0
symbols: ['H', 'H', 'O']

We’re going to reduce the amount of effort we have to commit to writing out our Molecule constructor. In fact, we’re not going to write it at all! Above is the Molecule class as we left it last chapter, and a quick construct of water with its output, we’ll use that later to compare that we in fact do have the same output. Let’s get rid of the semi-duplicated code entirely now. From the native Python library dataclasses, we are going to import the dataclass object.

from dataclasses import dataclass

The dataclass object is a decorator that decorates an entire class. You might be used to decorators for functions within classes such as the @propery or the @classmethod decorators; or you might have seen the @contextlib.contextmanager decorator for allowing functions to operate from a with statement. This dataclass decorator works the same way, it just wraps around an entire class and we us it like any other decorator.

@dataclass
class Molecule:
    def __init__(self, 
                 name: str, 
                 charge: Union[float, int], 
                 symbols: list[str], 
                 coordinates: list[list[float]]):
        self.name = name
        self.charge = charge
        self.symbols = symbols
        self.coordinates = coordinates
        self.num_atom = len(symbols)

    def __str__(self):
        return f"name: {self.name}\ncharge: {self.charge}\nsymbols: {self.symbols}"

water = Molecule("water", 0.0, ["H", "H", "O"], [[0, 0, 0]])
print(water)
name: water
charge: 0.0
symbols: ['H', 'H', 'O']

On its face, it does not look like it did anything, and that is because we still have the __init__ defined. @dataclass wraps the class it’s decorating and provides its own __init__ statement that does not need to be written out. What it’s expecting is to read the class attributes of the defined class, and then do the attribute-to-variable assignment under its __init__. If you’re unfamiliar with the difference between class and instance attributes, take a look at the following code:

class ShowAtters:
    # Class attributes defined at equal level as the def statements for class functions.
    class_attr1: str
    class_attr2: int = 5
        
    def __init__(self, inst_attr1: str, inst_attr2: int = 6):
        # Instance attributes defined either during or after initialization (of instance).
        self.inst_attr1 = inst_attr1
        self.inst_attr2 = inst_attr2

# Class attributes accessible without instantiating the class. Instance Attributes are not
print(f"Class Attribute: {ShowAtters.class_attr2}")
try:
    print(f"Instance Attribute: {ShowAtters.inst_attr2}")
except AttributeError:
    print("No instance attribute called inst_attr2")
    
# After instancing, class and instances attributes available
show = ShowAtters("something")
print(f"Class Attribute: {show.class_attr2}")
print(f"Instance Attribute: {show.inst_attr2}")
Class Attribute: 5
No instance attribute called inst_attr2
Class Attribute: 5
Instance Attribute: 6

The dataclass decorator reads the class attributes and assigns those attributes variables of the same name on calling the class, taking the same argument order as the attribute appears. In practice, you mainly just have to take your variable arguments and turn them into class attributes; type hints, defaults and all. This completely takes over the __init__ process and reduces the amount of coding you need. Let’s try it now:

@dataclass
class Molecule:
    name: str
    charge: Union[float, int]
    symbols: list[str]
    coordinates: list[list[float]]
        
    def __str__(self):
        return f"name: {self.name}\ncharge: {self.charge}\nsymbols: {self.symbols}"

And to prove that this does have the same output as before:

water = Molecule("water", 0.0, ["H", "H", "O"], [[0, 0, 0]])
print(water)
name: water
charge: 0.0
symbols: ['H', 'H', 'O']

Beyond the change to the initialization, everything still behaves the same as a normal class. Additional functions, properties, attributes, etc. can all be applied as normal. Let’s apply that principle to bring back the num_atoms property, and bring our Molecule class to the form we want it in for the end of this chapter.

from dataclasses import dataclass
from typing import Union


@dataclass
class Molecule:
    name: str
    charge: Union[float, int]
    symbols: list[str]
    coordinates: list[list[float]]
        
    @property
    def num_atoms(self):
        return len(self.symbols)
        
    def __str__(self):
        return f"name: {self.name}\ncharge: {self.charge}\nsymbols: {self.symbols}"

Sane Input Handling#

You might consider with the dataclass that there is no longer as clear of the delineation between the order of arguments that you feed into the Molecule constructor and the order to the actual how they get assigned in the class. Yes, it does still assign it in the order arguments were provided, but there are only 4 items here, and the order that data are received is ambiguous. Take the case here: representation of molecules in various file formats across different programs may not always be the same. You shouldn’t expect to always get some arbitrary name of the molecule as the first thing out of the file, or the first argument that comes into your program depending on whatever source is providing the data: a file on disk, an API call from the web, some user just typing things in a script to access your program. What happens if someone, for example, makes the first argument charge, the second argument a string, and reverses the 3rd and 4th arguments like so:

water = Molecule(0.0, "water", [[0, 0, 0]], ["H", "H", "O"])
print(water)
name: 0.0
charge: water
symbols: [[0, 0, 0]]

Technically, nothing broke here because Python looked at this and just cast everything to the matching order of variables because of duck typing. The Python interpreter is going to assume the programmer will do something useful with it, or will crash when it hits a syntax error.

We strongly recommend you start relying on keyword arguments to your constructor for a dataclass to remove ambiguity. We especially recommend this as more complex data structures are encountered; maybe there are tens or hundreds of properties you could store here (consider all the data a real molecule could contain). Just like normal keyword arguments for function definitions, the keyword arguments for a dataclass match the name of the variable defined in the class attributes.

water = Molecule(name="water", charge=0.0, symbols=["H", "H", "O"], coordinates=[[0, 0, 0]])
print(water)
name: water
charge: 0.0
symbols: ['H', 'H', 'O']

Now it doesn’t matter what order the information is provided, everything will be assigned correctly.

water = Molecule(coordinates=[[0, 0, 0]], symbols=["H", "H", "O"], charge=0.0, name="water")
print(water)
name: water
charge: 0.0
symbols: ['H', 'H', 'O']

The other major sane input handling conversion you should consider is providing data as a dictionary, that then passes to keyword arguments. This is a very common case as you might expect data much more frequently brought in from external sources and are loaded in as some key-value format like a Python dictionary, JSON, YAML, TOML, or any other type of data that you could possibly think of. In any case, it’s probably not as native Python types.

mol_data = {
    "coordinates": [[0, 0, 0]], 
    "symbols": ["H", "H", "O"], 
    "charge": 0.0, 
    "name": "water"
}
water = Molecule(**mol_data)
print(water)
name: water
charge: 0.0
symbols: ['H', 'H', 'O']