Manual Data Validation#

Starting File: 02_dataclass_molecule.py

This chapter will start from the 02_dataclass_molecule.py and end on the 03_manual_valid_molecule.py.

We so far have looked at type hints in Python as augments to arguments, dataclass decorators to reduce code and automatically assign variables to attributes of the same name, and how to make our inputs more stable. Everything we’ve done up to now has been making our code easier to setup, read, and work around. However, we have not done anything to validate the inputs.

No matter your field of work, validation of data is going to be something that happens. There is no getting around it, especially in the scientific field. It may be offloaded, automated, or even trivially simple, but it will happen; so we may as well get better at it.

Hard Work Ahead

This chapter will ask you to do a fair amount of coding for the purposes of understanding not only the data structure, but also to think about what needs to be validated and how scientific validation translates to programmatic validation.

We’ll show a third party library to simplify this later, but understanding the foundation is important for understanding the tools which handle this for you.

Compatibility with Python 3.8 and below

If you have Python 3.8 or below, you will need to import container type objects such as List, Tuple, Dict, etc. from the typing library instead of their native types of list, tuple, dict, etc. This chapter will assume Python 3.9 or greater, however, both approaches will work in >=Python 3.9 and have 1:1 replacements of the same name.

Dataclass __post_init__ method#

Let’s take a look at our code as we left it from last chapter.

from dataclasses import dataclass
from typing import Union


@dataclass
class Molecule:
    name: str
    charge: Union[float, int]
    symbols: list[str]
    coordinates: list[list[float]]
        
    @property
    def num_atoms(self):
        return len(self.symbols)
        
    def __str__(self):
        return f"name: {self.name}\ncharge: {self.charge}\nsymbols: {self.symbols}"
mol_data = {
    "coordinates": [[0, 0, 0]], 
    "symbols": ["H", "H", "O"], 
    "charge": 0.0, 
    "name": "water"
}

We’ve seen several examples so far of feeding in non-type appropriate code and not having errors. Now we’re going to do actual validation on our dataclass. Although there are third party libraries to do some of this, you the developer must still have an understanding of the scientific use case for what is considered valid; even beyond the type checking itself.

We first have to know how to access the data on input. The dataclass decorator takes over the normal __init__ process where someone may expect to write our validation code, or call the validation function(s). dataclass also provides a secondary function called __post_init__ which is called automatically after the __init__, if it is defined. This function is basically free space for the developer to do whatever they want with the dataclass like there was an __init__, just after instance variables are assigned.

@dataclass
class Molecule:
    name: str
    charge: Union[float, int]
    symbols: list[str]
    coordinates: list[list[float]]
    
    def __post_init__(self):
        # Do whatever you want here, all instance attributes will be available.
        print(f"{[getattr(self, thing) for thing in mol_data.keys()]}")
        print("Post Init Ran")
        
    @property
    def num_atoms(self):
        return len(self.symbols)
        
    def __str__(self):
        return f"name: {self.name}\ncharge: {self.charge}\nsymbols: {self.symbols}"
water = Molecule(**mol_data)
[[[0, 0, 0]], ['H', 'H', 'O'], 0.0, 'water']
Post Init Ran

What you can see in the above code is that __post_init__ did run, and does have access to all of the attributes we’re working with in this problem. If we wanted to recreate the __init__ like settings we had from the 00_base_molecule.py, we could set self.num_atoms = len(self.symbols) in the __post_init_ as well, but we’ll leave it as a property. Let’s actually delve into some validation, starting with simple type validation.

Manually validating types#

Although there are external libraries to do type and value validation of data, we’re going to go through the manual process in this chapter to show all the nuances that have to be thought of. Even the most sophisticated type-checking libraries still need the programmer to tell them what are the correct types, and are the values of those incoming data correct for the application.

Validation rules themselves do not have to be complicated. Setting aside the scientific understanding, let’s start by simply validating the types. Let’s start with the name that we want to ensure is a string. We’ll handle validation in __post_init__ as that’s where we can intercept the initialization process in a dataclass.

Heads Up Question

Although we said "simply validating the types," there are some types in our current Molecule which might be rather complicated to validate. Can you think of what they might be and how you might validate them?

@dataclass
class Molecule:
    name: str
    charge: Union[float, int]
    symbols: list[str]
    coordinates: list[list[float]]
    
    def __post_init__(self):
        # We'll validate the inputs here.
        if not isinstance(self.name, str):
            raise ValueError(f"'name' must be a str, was {self.name}")
        
    @property
    def num_atoms(self):
        return len(self.symbols)
        
    def __str__(self):
        return f"name: {self.name}\ncharge: {self.charge}\nsymbols: {self.symbols}"

Let’s also create a set of bad data for use to feed to the constructor to test our validation. We’ll make a few bad entries here and combine dictionaries on the fly as needed.

bad_name = {"name": 789}  # Name is not str
bad_charge = {"charge": [1, 0.0]}  # Charge is not int or float
noniter_symbols = {"symbols": 1234567890}  # Symbols is an int
nonlist_symbols = {"symbols": '["H", "H", "O"]'}  # Symbols is a string (notably is a string-ified list)
tuple_symbols = {"symbols": ("H", "H", "O")}  # Symbols as a tuple?
bad_coords = {"coordinates": ["1", "2", "3"]}  # Coords is a single list of string
bad_symbols_and_cords = {"symbols": ["H", "H", "O"],
                         "coordinates": [[1, 1, 1], [2.0, 2.0, 2.0]]
                        }  # Coordinates top-level list is not the same length as symbols
# This will work 
water = Molecule(**mol_data)
# This will not
mangle_name = {**mol_data, **bad_name}  # Inject bad name, could be done in 1 line, easier to read this way
water = Molecule(**mangle_name)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [8], in <cell line: 3>()
      1 # This will not
      2 mangle_name = {**mol_data, **bad_name}  # Inject bad name, could be done in 1 line, easier to read this way
----> 3 water = Molecule(**mangle_name)

File <string>:7, in __init__(self, name, charge, symbols, coordinates)

Input In [5], in Molecule.__post_init__(self)
      8 def __post_init__(self):
      9     # We'll validate the inputs here.
     10     if not isinstance(self.name, str):
---> 11         raise ValueError(f"'name' must be a str, was {self.name}")

ValueError: 'name' must be a str, was 789

Congratulations, we have our first piece of validation! Now let’s move onto the charge entry. This one is slightly more complicated because charge is of type Union[float, int]. See if you can do this one yourself first.

Writing your own validator: validating Charge

Write a validator statement for the charge which should be of type Union[float, int] to put in the __post_init__ function.

We're not looking for code efficiency here (minimal loops, fewest lines, minimal calls, etc.), so do what seems simplest here for understanding.

More difficult validation entries#

Let’s take a look at our two validated entries so far. We’re going to use our example from the exercise in the last section as our answer, yours might look different and that’s fine.

@dataclass
class Molecule:
    name: str
    charge: Union[float, int]
    symbols: list[str]
    coordinates: list[list[float]]
    
    def __post_init__(self):
        # We'll validate the inputs here.
        if not isinstance(self.name, str):
            raise ValueError(f"'name' must be a str, was {self.name}")
        if not (isinstance(self.charge, float) or isinstance(self.charge, int)):
            raise ValueError(f"'charge' must be a float or int, was {self.charge}")
        
    @property
    def num_atoms(self):
        return len(self.symbols)
        
    def __str__(self):
        return f"name: {self.name}\ncharge: {self.charge}\nsymbols: {self.symbols}"

Next we are going to work on the symbols. Because this is a container-like type, we have to validate not only the outermost type of list, but also the internal items of str.

Let’s start with the Pythonic concept of “it is better to ask forgiveness than permission” and just try to do a loop on the self.symbols. If it cannot be looped over, we’ll catch that and throw a meaningful error.

@dataclass
class Molecule:
    name: str
    charge: Union[float, int]
    symbols: list[str]
    coordinates: list[list[float]]
    
    def __post_init__(self):
        # We'll validate the inputs here.
        if not isinstance(self.name, str):
            raise ValueError(f"'name' must be a str, was {self.name}")
            
        if not (isinstance(self.charge, float) or isinstance(self.charge, int)):
            raise ValueError(f"'charge' must be a float or int, was {self.charge}")
            
        try:
            for content in self.symbols:  # Loop over elements
                if not isinstance(content, str):  # Check content
                    raise ValueError(content, type(content))
        except TypeError as exec:  # Trap not iterable item
            # This will throw if you can't iterate over self.symbols
            raise ValueError(f"'symbols' must be a list, was {type(self.symbols)}") from exec
        except ValueError as exec:  # Trap the content error
            raise ValueError(f"Each element of 'symbols' must be a string, was {exec.args[0]} of type {exec.args[1]}") from exec
        
    @property
    def num_atoms(self):
        return len(self.symbols)
        
    def __str__(self):
        return f"name: {self.name}\ncharge: {self.charge}\nsymbols: {self.symbols}"

Heads Up Question

We've intentionally written the symbols validator so there are a couple edge cases that will pass validation, but still not be the correct type hint. Can you think of such an edge case?

# This will work 
water = Molecule(**mol_data)
# This will not
mangle_name = {**mol_data, **noniter_symbols}  # Inject bad name, could be done in 1 line, easier to read this way
water = Molecule(**mangle_name)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [10], in Molecule.__post_init__(self)
     16 try:
---> 17     for content in self.symbols:  # Loop over elements
     18         if not isinstance(content, str):  # Check content

TypeError: 'int' object is not iterable

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
Input In [12], in <cell line: 3>()
      1 # This will not
      2 mangle_name = {**mol_data, **noniter_symbols}  # Inject bad name, could be done in 1 line, easier to read this way
----> 3 water = Molecule(**mangle_name)

File <string>:7, in __init__(self, name, charge, symbols, coordinates)

Input In [10], in Molecule.__post_init__(self)
     19             raise ValueError(content, type(content))
     20 except TypeError as exec:  # Trap not iterable item
     21     # This will throw if you can't iterate over self.symbols
---> 22     raise ValueError(f"'symbols' must be a list, was {type(self.symbols)}") from exec
     23 except ValueError as exec:  # Trap the content error
     24     raise ValueError(f"Each element of 'symbols' must be a string, was {exec.args[0]} of type {exec.args[1]}") from exec

ValueError: 'symbols' must be a list, was <class 'int'>

Here we have wrapped the error handling through a try...except clause and then run a loop through the elements. Previously, we had only used isinstance, but there is also nothing wrong with a try...except construct so long as the errors are handled gracefully.

The raise Exception() from construct

We’ve used “Exception Chaining” to make the intentionally trapped and raised error stack more easy to read.

We can break down each component part of the validator to see what we have done, starting with the block inside the try statement:

for content in self.symbols:  # Loop over elements
    if not isinstance(content, str):  # Check content
        raise ValueError(content, type(content))

We expect to be able to iterate over a list and throw a TypeError if you can’t iterate (such as on an int). The inside of the loop then does type checking to ensure everything is a string.

try:
    pass
except TypeError as exec:
    raise ValueError(f"'symbols' must be a list, was {type(self.symbols)}") from exec

This exception in the try...except construct catches the case of not being able to iterate during the for loop, then handles it gracefully with a cleaner exception stack using the Exception Chaining construct (raise Exception() from).

try:
    pass
except ValueError as exec:
    raise ValueError(f"Each element of 'symbols' must be a string, was {exec.args[0]} of type {exec.args[1]}") from exec

The last except clause catches the case of the elements of the symbols list not being a str. We’re combining two lesser deployed ideas here to make a more meaningful error message. First, the exception chaining allows more elegant error messages. Second, passing multiple arguments to the initial error (raise ValueError(content, type(content))) so that we can write a more informative error message by accessing the initial error’s args attribute.

Let’s try to run a different type of data through the symbols validator. In theory, this is not the right type, and so it should throw an error too. As the Heads-Up Question above alluded to, it won’t (this is Case A).

mangled_data = {**mol_data, **nonlist_symbols}
water = Molecule(**mangled_data)

The reason this passed validation is because, in Python, the str type is iterable, and each element of an iterated str is another instance of str. The other alluded to edge case is if symbols is a tuple.

mangled_data = {**mol_data, **tuple_symbols}
water = Molecule(**mangled_data)

This also works fine, and is Case B from the Heads-Up Question. Depending on how symbols is used in code, having it be a tuple might be fine for your purposes. If so, consider expanding the allowed types (or coercing the incoming data to a list).

Because of all the edge cases, this is a case where being precise and “asking permission” can be more helpful than what we originally wrote. Let’s rewrite our validator to handle tuples permissively, and to catch those pesky strings.

@dataclass
class Molecule:
    name: str
    charge: Union[float, int]
    symbols: Union[list[str], tuple[str, ...]]
    coordinates: list[list[float]]
    
    def __post_init__(self):
        # We'll validate the inputs here.
        if not isinstance(self.name, str):
            raise ValueError(f"'name' must be a str, was {self.name}")
            
        if not (isinstance(self.charge, float) or isinstance(self.charge, int)):
            raise ValueError(f"'charge' must be a float or int, was {self.charge}")
            
        try:
            if not (isinstance(self.symbols, list) or isinstance(self.symbols, tuple)):
                raise TypeError
            for content in self.symbols:  # Loop over elements
                if not isinstance(content, str):  # Check content
                    raise ValueError(content, type(content))
        except TypeError as exec:  # Trap not iterable item
            # This will throw if you can't iterate over self.symbols
            raise ValueError(f"'symbols' must be a list or tuple of string, was {type(self.symbols)}") from exec
        except ValueError as exec:  # Trap the content error
            raise ValueError(f"Each element of 'symbols' must be a string, was {exec.args[0]} of type {exec.args[1]}") from exec
        
    @property
    def num_atoms(self):
        return len(self.symbols)
        
    def __str__(self):
        return f"name: {self.name}\ncharge: {self.charge}\nsymbols: {self.symbols}"
# Should now fail correctly
mangled_data = {**mol_data, **nonlist_symbols}
water = Molecule(**mangled_data)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [15], in Molecule.__post_init__(self)
     17 if not (isinstance(self.symbols, list) or isinstance(self.symbols, tuple)):
---> 18     raise TypeError
     19 for content in self.symbols:  # Loop over elements

TypeError: 

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
Input In [16], in <cell line: 3>()
      1 # Should now fail correctly
      2 mangled_data = {**mol_data, **nonlist_symbols}
----> 3 water = Molecule(**mangled_data)

File <string>:7, in __init__(self, name, charge, symbols, coordinates)

Input In [15], in Molecule.__post_init__(self)
     21             raise ValueError(content, type(content))
     22 except TypeError as exec:  # Trap not iterable item
     23     # This will throw if you can't iterate over self.symbols
---> 24     raise ValueError(f"'symbols' must be a list or tuple of string, was {type(self.symbols)}") from exec
     25 except ValueError as exec:  # Trap the content error
     26     raise ValueError(f"Each element of 'symbols' must be a string, was {exec.args[0]} of type {exec.args[1]}") from exec

ValueError: 'symbols' must be a list or tuple of string, was <class 'str'>
# Should still work
mangled_data = {**mol_data, **tuple_symbols}
water = Molecule(**mangled_data)
# Make sure we didn't break expected use cases
water = Molecule(**mol_data)

We kept the same try...except block, but now added one extra check for type. Because we were more permissive to the type, we also modified the type hint for symbols from list[str] to Union[list[str], tuple[str, ...]]. This had to take advantage of one of the “Special Forms” of type hints that tuple has.

In Type Hints in Python, we called out that the tuple is immutable, so multiple arguments of type hints fed into its [] annotations matched 1-to-1 with the index of items in the tuple. There is a “special form” of tuple annotation for “variable-length tuple of homogeneous type.” That is denoted with the literal ellipsis. So, tuple[str, ...] means variable-length tuple of all types being str. If you feel this is confusing, consider not allowing tuple type, or coercing tuple to list before processing. Also consider using an automatic type/coercion checking like pydantic (Introduction to Pydantic) to simplify your code.

One-liner Validation

It is possible to do the symbols validation in many other ways, including as a one-liner for the if statement. Because Python’s or statements are evaluated sequentially and only if its ppredecessor succeeded, you can compress both checks into a larger if...or construct. If you’re curious, here is one possible solution (wrapped with \ on a newline char for human legibility):

if not (isinstance(self.symbols, list) or isinstance(self.symbols, tuple)) or \
        any(not isinstance(x, str) for x in self.symbols):
    raise ValueError(f"{self.symbols} must be a list of str")

The even more difficult validation#

By now our validation code has started to get pretty large for evaluating only three of our entries. Some efficiency could be made with function calls for de-duplication, but not much. Even so, we still have the most complex validator so far, a list of list of numbers! In the grand scheme of schema, this isn’t that complicated of a validator, but let’s check what will need to go into validating this entry. We’ll reuse every concept covered so far, and just apply it with an extra outer loop. Let’s cover the statements of fact and then translate them to a validator.

  • The top most item is a (list or tuple)

  • Each element of the outer (list or tuple) is a (list or tuple)

  • Each element of the inner (list or tuple) is a (float or int).

We’ve expanded on our original specification to handle the edge cases we encountered previously. We’re going to now work through each change we make to the Molecule code, one part at a time, through exercises to see how well you’ve picked up on what all we’ve done so far.

Modifying the coordinates type hint

First we want to change the type hint for coordinates themselves to match the list statements of fact. Let's go through them backwards to put it all together.

You will find it easier and more legible to make these compound type hints variables.

The complexity of the final type hint suggests that maybe coercion or is a better call here (casting the tuples to lists)

coordinates Inner Loop try...except

The inner loop should look similar to what we did with symbols. Just use a placeholder name for the inner iterable element for now.

coordinates Outer Loop try...except

Focus just on the outer loop, put a pass where the inner loop would go. Try to use the placeholder name from the previous exercise to

One-liner Coordinates

Just like with symbols, we can validate coordinates with one line as well. We throw out the ability to get much more informative errors, and admittedly double loop in this example, but it’s much more compact.

if (not (isinstance(self.coordinates, list) or isinstance(self.coordinates, tuple)) or
    any(not (isinstance(y, list) or isinstance(y, tuple)) for y in self.coordinates) or
    any(any(not (isinstance(z, float) or isinstance(z, int)) for z in sub) for sub in self.coordinates)
):
    raise ValueError(f"{self.coordinates} must be a (list or tuple) of (list or tuple) of (float or int)")

Constructing our fully validated molecule#

Finally, we can build out our fully, manually validated Molecule object. This is also the final reference code of 03_manual_valid_molecule.py.

from dataclasses import dataclass
from typing import Union

# Type Helpers
fi = Union[float, int]
lfi = list[fi]
tfi = tuple[fi, ...]
inner = Union[lfi, tfi]
lo = list[inner]
tupo = tuple[inner, ...]


@dataclass
class Molecule:
    name: str
    charge: fi
    symbols: Union[list[str], tuple[str, ...]]
    coordinates: Union[lo, tupo]

    def __post_init__(self):
        # We'll validate the inputs here.
        if not isinstance(self.name, str):
            raise ValueError(f"'name' must be a str, was {self.name}")

        if not (isinstance(self.charge, float) or isinstance(self.charge, int)):
            raise ValueError(f"'charge' must be a float or int, was {self.charge}")

        try:
            if not (isinstance(self.symbols, list) or isinstance(self.symbols, tuple)):
                raise TypeError
            for content in self.symbols:  # Loop over elements
                if not isinstance(content, str):  # Check content
                    raise ValueError(content, type(content))
        except TypeError as exec:  # Trap not iterable item
            # This will throw if you can't iterate over self.symbols
            raise ValueError(f"'symbols' must be a list or tuple of string, was {type(self.symbols)}") from exec
        except ValueError as exec:  # Trap the content error
            raise ValueError(f"Each element of 'symbols' must be a string, was {exec.args[0]} of type {exec.args[1]}") from exec

        try:
            if not (isinstance(self.coordinates, list) or isinstance(self.coordinates, tuple)):
                raise TypeError
            for inner in self.coordinates:  # Loop over elements
                try:
                    if not (isinstance(inner, list) or isinstance(inner, tuple)):
                        raise TypeError
                    for content in inner:  # Loop over elements
                        if not (isinstance(content, int), isinstance(content, float)):  # Check content
                            raise ValueError(content, type(content))
                except TypeError as exec:  # Trap not iterable item
                    # This will throw if you can't iterate over self.symbols
                    raise ValueError(f"'coordinates' inner elements must be a list or tuple of float/int, was {type(inner)}") from exec
                except ValueError as exec:  # Trap the content error
                    raise ValueError(f"Each inner element of 'coordinates' must be a string, was {exec.args[0]} of type {exec.args[1]}") from exec
        except TypeError as exec:  # Trap not iterable item
            # This will throw if you can't iterate over self.symbols
            raise ValueError(f"'coordinates' must be a list or tuple of int/float, was {type(inner)}") from exec
        except ValueError as exec:  # Trap the content error
            raise ValueError(f"'coordinates' must be a list or tuple of int/float, however the following error was thrown: {exec}") from exec

    @property
    def num_atoms(self):
        return len(self.symbols)

    def __str__(self):
        return f"name: {self.name}\ncharge: {self.charge}\nsymbols: {self.symbols}"

Let’s check that our code validates and fails correctly

water = Molecule(**mol_data)
for bad in [bad_name, bad_charge, noniter_symbols, nonlist_symbols, bad_coords]:
    try:
        mangle = {**mol_data, **bad}
        water = Molecule(**mangle)
    except:
        pass
    else:
        raise ValueError(f"All of these should fail, but {bad} did not")

However, despite all our validation efforts, we so far have only validated types, actual data ranges are not yet validated. E.g. symbols should be the same length as the outer iterable of coordinates, but actually testing that still passses.

bad_symbols_and_cords = {"symbols": ["H", "H", "O"],
                         "coordinates": [[1, 1, 1], [2.0, 2.0, 2.0]]
                        }  # Coordinates top-level list is not the same length as symbols
mangle = {**mol_data, **bad_symbols_and_cords}
water = Molecule(**mangle)

Key Takeaway of Manual Validation: Don’t Do It by Hand!#

As of now, you have written a bunch of code, just to validate 4 items of relatively simple types. What about complex dictionaries? Non-native Python types? Data structures with metadata also having their own structures? We also haven’t done anything really with the type hints. None of the validation code actually reads the type hints to infer what the types should be, let alone do any programming with them.

Still, this chapter should have instilled how to think about validation, what edge cases to consider, how to gracefully handle errors, shown the implicit advantages of data coercion (or at least shown all the effort to accomodate not coercing), and given a respect for what will be required should you choose to validate manually.

In the next chapter, we’re going to show pydantic: A powerful schema validation tool which takes the dataclass structure in tandem with Python’s own native type hints to do automatic validation.