Manual Data Validation#
Starting File: 02_dataclass_molecule.py
This chapter will start from the 02_dataclass_molecule.py
and end on the 03_manual_valid_molecule.py
.
We so far have looked at type hints in Python as augments to arguments, dataclass
decorators to reduce code and automatically assign variables to attributes of the same name, and how to make our inputs more stable. Everything we’ve done up to now has been making our code easier to setup, read, and work around. However, we have not done anything to validate the inputs.
No matter your field of work, validation of data is going to be something that happens. There is no getting around it, especially in the scientific field. It may be offloaded, automated, or even trivially simple, but it will happen; so we may as well get better at it.
Hard Work Ahead
This chapter will ask you to do a fair amount of coding for the purposes of understanding not only the data structure, but also to think about what needs to be validated and how scientific validation translates to programmatic validation.
We’ll show a third party library to simplify this later, but understanding the foundation is important for understanding the tools which handle this for you.
Compatibility with Python 3.8 and below
If you have Python 3.8 or below, you will need to import container type objects such as List
, Tuple
, Dict
, etc. from the typing
library instead of their native types of list
, tuple
, dict
, etc. This chapter will assume Python 3.9 or greater, however, both approaches will work in >=Python 3.9 and have 1:1 replacements of the same name.
Dataclass __post_init__
method#
Let’s take a look at our code as we left it from last chapter.
from dataclasses import dataclass
from typing import Union
@dataclass
class Molecule:
name: str
charge: Union[float, int]
symbols: list[str]
coordinates: list[list[float]]
@property
def num_atoms(self):
return len(self.symbols)
def __str__(self):
return f"name: {self.name}\ncharge: {self.charge}\nsymbols: {self.symbols}"
mol_data = {
"coordinates": [[0, 0, 0]],
"symbols": ["H", "H", "O"],
"charge": 0.0,
"name": "water"
}
We’ve seen several examples so far of feeding in non-type appropriate code and not having errors. Now we’re going to do actual validation on our dataclass
. Although there are third party libraries to do some of this, you the developer must still have an understanding of the scientific use case for what is considered valid; even beyond the type checking itself.
We first have to know how to access the data on input. The dataclass
decorator takes over the normal __init__
process where someone may expect to write our validation code, or call the validation function(s). dataclass
also provides a secondary function called __post_init__
which is called automatically after the __init__
, if it is defined. This function is basically free space for the developer to do whatever they want with the dataclass
like there was an __init__
, just after instance variables are assigned.
@dataclass
class Molecule:
name: str
charge: Union[float, int]
symbols: list[str]
coordinates: list[list[float]]
def __post_init__(self):
# Do whatever you want here, all instance attributes will be available.
print(f"{[getattr(self, thing) for thing in mol_data.keys()]}")
print("Post Init Ran")
@property
def num_atoms(self):
return len(self.symbols)
def __str__(self):
return f"name: {self.name}\ncharge: {self.charge}\nsymbols: {self.symbols}"
water = Molecule(**mol_data)
[[[0, 0, 0]], ['H', 'H', 'O'], 0.0, 'water']
Post Init Ran
What you can see in the above code is that __post_init__
did run, and does have access to all of the attributes we’re working with in this problem. If we wanted to recreate the __init__
like settings we had from the 00_base_molecule.py
, we could set self.num_atoms = len(self.symbols)
in the __post_init_
as well, but we’ll leave it as a property
. Let’s actually delve into some validation, starting with simple type validation.
Manually validating types#
Although there are external libraries to do type and value validation of data, we’re going to go through the manual process in this chapter to show all the nuances that have to be thought of. Even the most sophisticated type-checking libraries still need the programmer to tell them what are the correct types, and are the values of those incoming data correct for the application.
Validation rules themselves do not have to be complicated. Setting aside the scientific understanding, let’s start by simply validating the types. Let’s start with the name
that we want to ensure is a string. We’ll handle validation in __post_init__
as that’s where we can intercept the initialization process in a dataclass
.
Heads Up Question
Although we said "simply validating the types," there are some types in our current Molecule
which might be rather complicated to validate. Can you think of what they might be and how you might validate them?
@dataclass
class Molecule:
name: str
charge: Union[float, int]
symbols: list[str]
coordinates: list[list[float]]
def __post_init__(self):
# We'll validate the inputs here.
if not isinstance(self.name, str):
raise ValueError(f"'name' must be a str, was {self.name}")
@property
def num_atoms(self):
return len(self.symbols)
def __str__(self):
return f"name: {self.name}\ncharge: {self.charge}\nsymbols: {self.symbols}"
Let’s also create a set of bad data for use to feed to the constructor to test our validation. We’ll make a few bad entries here and combine dictionaries on the fly as needed.
bad_name = {"name": 789} # Name is not str
bad_charge = {"charge": [1, 0.0]} # Charge is not int or float
noniter_symbols = {"symbols": 1234567890} # Symbols is an int
nonlist_symbols = {"symbols": '["H", "H", "O"]'} # Symbols is a string (notably is a string-ified list)
tuple_symbols = {"symbols": ("H", "H", "O")} # Symbols as a tuple?
bad_coords = {"coordinates": ["1", "2", "3"]} # Coords is a single list of string
bad_symbols_and_cords = {"symbols": ["H", "H", "O"],
"coordinates": [[1, 1, 1], [2.0, 2.0, 2.0]]
} # Coordinates top-level list is not the same length as symbols
# This will work
water = Molecule(**mol_data)
# This will not
mangle_name = {**mol_data, **bad_name} # Inject bad name, could be done in 1 line, easier to read this way
water = Molecule(**mangle_name)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [8], in <cell line: 3>()
1 # This will not
2 mangle_name = {**mol_data, **bad_name} # Inject bad name, could be done in 1 line, easier to read this way
----> 3 water = Molecule(**mangle_name)
File <string>:7, in __init__(self, name, charge, symbols, coordinates)
Input In [5], in Molecule.__post_init__(self)
8 def __post_init__(self):
9 # We'll validate the inputs here.
10 if not isinstance(self.name, str):
---> 11 raise ValueError(f"'name' must be a str, was {self.name}")
ValueError: 'name' must be a str, was 789
Congratulations, we have our first piece of validation! Now let’s move onto the charge
entry. This one is slightly more complicated because charge
is of type Union[float, int]
. See if you can do this one yourself first.
Writing your own validator: validating Charge
Write a validator statement for the charge
which should be of type Union[float, int]
to put in the __post_init__
function.
We're not looking for code efficiency here (minimal loops, fewest lines, minimal calls, etc.), so do what seems simplest here for understanding.
Hint
`Union` means "either" in this context.
Solution
This is one possible solution for validating this type
if not (isinstance(self.charge, float) or isinstance(self.charge, int)):
raise ValueError(f"'charge' must be a float or int, was {self.charge}")
More difficult validation entries#
Let’s take a look at our two validated entries so far. We’re going to use our example from the exercise in the last section as our answer, yours might look different and that’s fine.
@dataclass
class Molecule:
name: str
charge: Union[float, int]
symbols: list[str]
coordinates: list[list[float]]
def __post_init__(self):
# We'll validate the inputs here.
if not isinstance(self.name, str):
raise ValueError(f"'name' must be a str, was {self.name}")
if not (isinstance(self.charge, float) or isinstance(self.charge, int)):
raise ValueError(f"'charge' must be a float or int, was {self.charge}")
@property
def num_atoms(self):
return len(self.symbols)
def __str__(self):
return f"name: {self.name}\ncharge: {self.charge}\nsymbols: {self.symbols}"
Next we are going to work on the symbols
. Because this is a container-like type, we have to validate not only the outermost type of list
, but also the internal items of str
.
Let’s start with the Pythonic concept of “it is better to ask forgiveness than permission” and just try to do a loop on the self.symbols
. If it cannot be looped over, we’ll catch that and throw a meaningful error.
@dataclass
class Molecule:
name: str
charge: Union[float, int]
symbols: list[str]
coordinates: list[list[float]]
def __post_init__(self):
# We'll validate the inputs here.
if not isinstance(self.name, str):
raise ValueError(f"'name' must be a str, was {self.name}")
if not (isinstance(self.charge, float) or isinstance(self.charge, int)):
raise ValueError(f"'charge' must be a float or int, was {self.charge}")
try:
for content in self.symbols: # Loop over elements
if not isinstance(content, str): # Check content
raise ValueError(content, type(content))
except TypeError as exec: # Trap not iterable item
# This will throw if you can't iterate over self.symbols
raise ValueError(f"'symbols' must be a list, was {type(self.symbols)}") from exec
except ValueError as exec: # Trap the content error
raise ValueError(f"Each element of 'symbols' must be a string, was {exec.args[0]} of type {exec.args[1]}") from exec
@property
def num_atoms(self):
return len(self.symbols)
def __str__(self):
return f"name: {self.name}\ncharge: {self.charge}\nsymbols: {self.symbols}"
Heads Up Question
We've intentionally written the symbols
validator so there are a couple edge cases that will pass validation, but still not be the correct type hint. Can you think of such an edge case?
Hint for Case A
- There is a single primitive (non container-like) Python type which will pass the
symbols
validator - Take a look through the bad entries dictionaries above, the example is in there.
Hint for Case B
- There is a single container-like Python type which will pass the
symbols
validator - This type may not cause any issues at all if its missed since
list
and this type work roughly the same for this application as coded.
# This will work
water = Molecule(**mol_data)
# This will not
mangle_name = {**mol_data, **noniter_symbols} # Inject bad name, could be done in 1 line, easier to read this way
water = Molecule(**mangle_name)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Input In [10], in Molecule.__post_init__(self)
16 try:
---> 17 for content in self.symbols: # Loop over elements
18 if not isinstance(content, str): # Check content
TypeError: 'int' object is not iterable
The above exception was the direct cause of the following exception:
ValueError Traceback (most recent call last)
Input In [12], in <cell line: 3>()
1 # This will not
2 mangle_name = {**mol_data, **noniter_symbols} # Inject bad name, could be done in 1 line, easier to read this way
----> 3 water = Molecule(**mangle_name)
File <string>:7, in __init__(self, name, charge, symbols, coordinates)
Input In [10], in Molecule.__post_init__(self)
19 raise ValueError(content, type(content))
20 except TypeError as exec: # Trap not iterable item
21 # This will throw if you can't iterate over self.symbols
---> 22 raise ValueError(f"'symbols' must be a list, was {type(self.symbols)}") from exec
23 except ValueError as exec: # Trap the content error
24 raise ValueError(f"Each element of 'symbols' must be a string, was {exec.args[0]} of type {exec.args[1]}") from exec
ValueError: 'symbols' must be a list, was <class 'int'>
Here we have wrapped the error handling through a try...except
clause and then run a loop through the elements. Previously, we had only used isinstance
, but there is also nothing wrong with a try...except
construct so long as the errors are handled gracefully.
The raise Exception() from
construct
We’ve used “Exception Chaining” to make the intentionally trapped and raised error stack more easy to read.
We can break down each component part of the validator to see what we have done, starting with the block inside the try
statement:
for content in self.symbols: # Loop over elements
if not isinstance(content, str): # Check content
raise ValueError(content, type(content))
We expect to be able to iterate over a list
and throw a TypeError
if you can’t iterate (such as on an int
). The inside of the loop then does type checking to ensure everything is a string.
try:
pass
except TypeError as exec:
raise ValueError(f"'symbols' must be a list, was {type(self.symbols)}") from exec
This exception in the try...except
construct catches the case of not being able to iterate during the for
loop, then handles it gracefully with a cleaner exception stack using the Exception Chaining construct (raise Exception() from
).
try:
pass
except ValueError as exec:
raise ValueError(f"Each element of 'symbols' must be a string, was {exec.args[0]} of type {exec.args[1]}") from exec
The last except
clause catches the case of the elements of the symbols
list not being a str
. We’re combining two lesser deployed ideas here to make a more meaningful error message. First, the exception chaining allows more elegant error messages. Second, passing multiple arguments to the initial error (raise ValueError(content, type(content))
) so that we can write a more informative error message by accessing the initial error’s args
attribute.
Let’s try to run a different type of data through the symbols
validator. In theory, this is not the right type, and so it should throw an error too. As the Heads-Up Question above alluded to, it won’t (this is Case A).
mangled_data = {**mol_data, **nonlist_symbols}
water = Molecule(**mangled_data)
The reason this passed validation is because, in Python, the str
type is iterable, and each element of an iterated str
is another instance of str
. The other alluded to edge case is if symbols
is a tuple
.
mangled_data = {**mol_data, **tuple_symbols}
water = Molecule(**mangled_data)
This also works fine, and is Case B from the Heads-Up Question. Depending on how symbols
is used in code, having it be a tuple
might be fine for your purposes. If so, consider expanding the allowed types (or coercing the incoming data to a list
).
Because of all the edge cases, this is a case where being precise and “asking permission” can be more helpful than what we originally wrote. Let’s rewrite our validator to handle tuple
s permissively, and to catch those pesky string
s.
@dataclass
class Molecule:
name: str
charge: Union[float, int]
symbols: Union[list[str], tuple[str, ...]]
coordinates: list[list[float]]
def __post_init__(self):
# We'll validate the inputs here.
if not isinstance(self.name, str):
raise ValueError(f"'name' must be a str, was {self.name}")
if not (isinstance(self.charge, float) or isinstance(self.charge, int)):
raise ValueError(f"'charge' must be a float or int, was {self.charge}")
try:
if not (isinstance(self.symbols, list) or isinstance(self.symbols, tuple)):
raise TypeError
for content in self.symbols: # Loop over elements
if not isinstance(content, str): # Check content
raise ValueError(content, type(content))
except TypeError as exec: # Trap not iterable item
# This will throw if you can't iterate over self.symbols
raise ValueError(f"'symbols' must be a list or tuple of string, was {type(self.symbols)}") from exec
except ValueError as exec: # Trap the content error
raise ValueError(f"Each element of 'symbols' must be a string, was {exec.args[0]} of type {exec.args[1]}") from exec
@property
def num_atoms(self):
return len(self.symbols)
def __str__(self):
return f"name: {self.name}\ncharge: {self.charge}\nsymbols: {self.symbols}"
# Should now fail correctly
mangled_data = {**mol_data, **nonlist_symbols}
water = Molecule(**mangled_data)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Input In [15], in Molecule.__post_init__(self)
17 if not (isinstance(self.symbols, list) or isinstance(self.symbols, tuple)):
---> 18 raise TypeError
19 for content in self.symbols: # Loop over elements
TypeError:
The above exception was the direct cause of the following exception:
ValueError Traceback (most recent call last)
Input In [16], in <cell line: 3>()
1 # Should now fail correctly
2 mangled_data = {**mol_data, **nonlist_symbols}
----> 3 water = Molecule(**mangled_data)
File <string>:7, in __init__(self, name, charge, symbols, coordinates)
Input In [15], in Molecule.__post_init__(self)
21 raise ValueError(content, type(content))
22 except TypeError as exec: # Trap not iterable item
23 # This will throw if you can't iterate over self.symbols
---> 24 raise ValueError(f"'symbols' must be a list or tuple of string, was {type(self.symbols)}") from exec
25 except ValueError as exec: # Trap the content error
26 raise ValueError(f"Each element of 'symbols' must be a string, was {exec.args[0]} of type {exec.args[1]}") from exec
ValueError: 'symbols' must be a list or tuple of string, was <class 'str'>
# Should still work
mangled_data = {**mol_data, **tuple_symbols}
water = Molecule(**mangled_data)
# Make sure we didn't break expected use cases
water = Molecule(**mol_data)
We kept the same try...except
block, but now added one extra check for type. Because we were more permissive to the type, we also modified the type hint for symbols
from list[str]
to Union[list[str], tuple[str, ...]]
. This had to take advantage of one of the “Special Forms” of type hints that tuple
has.
In Type Hints in Python, we called out that the tuple
is immutable, so multiple arguments of type hints fed into its []
annotations matched 1-to-1 with the index of items in the tuple
. There is a “special form” of tuple
annotation for “variable-length tuple of homogeneous type.” That is denoted with the literal ellipsis. So, tuple[str, ...]
means variable-length tuple of all types being str
. If you feel this is confusing, consider not allowing tuple
type, or coercing tuple
to list
before processing. Also consider using an automatic type/coercion checking like pydantic (Introduction to Pydantic) to simplify your code.
One-liner Validation
It is possible to do the symbols
validation in many other ways, including as a one-liner for the if
statement. Because Python’s or
statements are evaluated sequentially and only if its ppredecessor succeeded, you can compress both checks into a larger if...or
construct. If you’re curious, here is one possible solution (wrapped with \ on a newline char for human legibility):
if not (isinstance(self.symbols, list) or isinstance(self.symbols, tuple)) or \
any(not isinstance(x, str) for x in self.symbols):
raise ValueError(f"{self.symbols} must be a list of str")
The even more difficult validation#
By now our validation code has started to get pretty large for evaluating only three of our entries. Some efficiency could be made with function calls for de-duplication, but not much. Even so, we still have the most complex validator so far, a list of list of numbers! In the grand scheme of schema, this isn’t that complicated of a validator, but let’s check what will need to go into validating this entry. We’ll reuse every concept covered so far, and just apply it with an extra outer loop. Let’s cover the statements of fact and then translate them to a validator.
The top most item is a (list or tuple)
Each element of the outer (list or tuple) is a (list or tuple)
Each element of the inner (list or tuple) is a (float or int).
We’ve expanded on our original specification to handle the edge cases we encountered previously. We’re going to now work through each change we make to the Molecule
code, one part at a time, through exercises to see how well you’ve picked up on what all we’ve done so far.
Modifying the coordinates
type hint
First we want to change the type hint for coordinates
themselves to match the list statements of fact. Let's go through them backwards to put it all together.
You will find it easier and more legible to make these compound type hints variables.
“…Is a float or int.” Solution:
Union[float,int]
fi = Union[float,int]
“Each element of the inner (list or tuple) is a (float or int)” Solution:
Union[list[Union[float,int]], tuple[Union[float,int], ...]]
lfi = list[fi]
tfi = tuple[fi, ...]
inner = Union[lfi, tfi]
“Each element of the inner (list or tuple) is a (float or int)” Solution:
Union[list[Union[list[Union[float,int]], tuple[Union[float,int], ...]]], tuple[Union[list[Union[float,int]], tuple[Union[float,int], ...]], ...]]
outer = Union[list[inner], tuple[inner, ...]]
The complexity of the final type hint suggests that maybe coercion or is a better call here (casting the tuples to lists)
coordinates
Inner Loop try...except
The inner loop should look similar to what we did with symbols
. Just use a placeholder name for the inner iterable element for now.
Solution:
try:
if not (isinstance(inner, list) or isinstance(inner, tuple)):
raise TypeError
for content in inner: # Loop over elements
if not (isinstance(content, int), isinstance(content, float): # Check content
raise ValueError(content, type(content))
except TypeError as exec: # Trap not iterable item
# This will throw if you can't iterate over self.symbols
raise ValueError(f"'coordinates' inner elements must be a list or tuple of int/float, was {type(inner)}") from exec
except ValueError as exec: # Trap the content error
raise ValueError(f"Each inner element of 'coordinates' must be a string, was {exec.args[0]} of type {exec.args[1]}") from exec
coordinates
Outer Loop try...except
Focus just on the outer loop, put a pass
where the inner loop would go. Try to use the placeholder name from the previous exercise to
Solution:
try:
if not (isinstance(self.coordinates, list) or isinstance(self.coordinates, tuple)):
raise TypeError
for inner in self.coordinates: # Loop over elements
pass # Inner loop
except TypeError as exec: # Trap not iterable item
# This will throw if you can't iterate over self.symbols
raise ValueError(f"'coordinates' must be a list or tuple of int/float, was {type(inner)}") from exec
except ValueError as exec: # Trap the content error
raise ValueError(f"'coordinates' must be a list or tuple of int/float, however the following error was thrown: {exec}") from exec
One-liner Coordinates
Just like with symbols
, we can validate coordinates
with one line as well. We throw out the ability to get much more informative errors, and admittedly double loop in this example, but it’s much more compact.
if (not (isinstance(self.coordinates, list) or isinstance(self.coordinates, tuple)) or
any(not (isinstance(y, list) or isinstance(y, tuple)) for y in self.coordinates) or
any(any(not (isinstance(z, float) or isinstance(z, int)) for z in sub) for sub in self.coordinates)
):
raise ValueError(f"{self.coordinates} must be a (list or tuple) of (list or tuple) of (float or int)")
Constructing our fully validated molecule#
Finally, we can build out our fully, manually validated Molecule
object. This is also the final reference code of 03_manual_valid_molecule.py
.
from dataclasses import dataclass
from typing import Union
# Type Helpers
fi = Union[float, int]
lfi = list[fi]
tfi = tuple[fi, ...]
inner = Union[lfi, tfi]
lo = list[inner]
tupo = tuple[inner, ...]
@dataclass
class Molecule:
name: str
charge: fi
symbols: Union[list[str], tuple[str, ...]]
coordinates: Union[lo, tupo]
def __post_init__(self):
# We'll validate the inputs here.
if not isinstance(self.name, str):
raise ValueError(f"'name' must be a str, was {self.name}")
if not (isinstance(self.charge, float) or isinstance(self.charge, int)):
raise ValueError(f"'charge' must be a float or int, was {self.charge}")
try:
if not (isinstance(self.symbols, list) or isinstance(self.symbols, tuple)):
raise TypeError
for content in self.symbols: # Loop over elements
if not isinstance(content, str): # Check content
raise ValueError(content, type(content))
except TypeError as exec: # Trap not iterable item
# This will throw if you can't iterate over self.symbols
raise ValueError(f"'symbols' must be a list or tuple of string, was {type(self.symbols)}") from exec
except ValueError as exec: # Trap the content error
raise ValueError(f"Each element of 'symbols' must be a string, was {exec.args[0]} of type {exec.args[1]}") from exec
try:
if not (isinstance(self.coordinates, list) or isinstance(self.coordinates, tuple)):
raise TypeError
for inner in self.coordinates: # Loop over elements
try:
if not (isinstance(inner, list) or isinstance(inner, tuple)):
raise TypeError
for content in inner: # Loop over elements
if not (isinstance(content, int), isinstance(content, float)): # Check content
raise ValueError(content, type(content))
except TypeError as exec: # Trap not iterable item
# This will throw if you can't iterate over self.symbols
raise ValueError(f"'coordinates' inner elements must be a list or tuple of float/int, was {type(inner)}") from exec
except ValueError as exec: # Trap the content error
raise ValueError(f"Each inner element of 'coordinates' must be a string, was {exec.args[0]} of type {exec.args[1]}") from exec
except TypeError as exec: # Trap not iterable item
# This will throw if you can't iterate over self.symbols
raise ValueError(f"'coordinates' must be a list or tuple of int/float, was {type(inner)}") from exec
except ValueError as exec: # Trap the content error
raise ValueError(f"'coordinates' must be a list or tuple of int/float, however the following error was thrown: {exec}") from exec
@property
def num_atoms(self):
return len(self.symbols)
def __str__(self):
return f"name: {self.name}\ncharge: {self.charge}\nsymbols: {self.symbols}"
Let’s check that our code validates and fails correctly
water = Molecule(**mol_data)
for bad in [bad_name, bad_charge, noniter_symbols, nonlist_symbols, bad_coords]:
try:
mangle = {**mol_data, **bad}
water = Molecule(**mangle)
except:
pass
else:
raise ValueError(f"All of these should fail, but {bad} did not")
However, despite all our validation efforts, we so far have only validated types, actual data ranges are not yet validated. E.g. symbols
should be the same length as the outer iterable of coordinates
, but actually testing that still passses.
bad_symbols_and_cords = {"symbols": ["H", "H", "O"],
"coordinates": [[1, 1, 1], [2.0, 2.0, 2.0]]
} # Coordinates top-level list is not the same length as symbols
mangle = {**mol_data, **bad_symbols_and_cords}
water = Molecule(**mangle)
Key Takeaway of Manual Validation: Don’t Do It by Hand!#
As of now, you have written a bunch of code, just to validate 4 items of relatively simple types. What about complex dictionaries? Non-native Python types? Data structures with metadata also having their own structures? We also haven’t done anything really with the type hints. None of the validation code actually reads the type hints to infer what the types should be, let alone do any programming with them.
Still, this chapter should have instilled how to think about validation, what edge cases to consider, how to gracefully handle errors, shown the implicit advantages of data coercion (or at least shown all the effort to accomodate not coercing), and given a respect for what will be required should you choose to validate manually.
In the next chapter, we’re going to show pydantic: A powerful schema validation tool which takes the dataclass
structure in tandem with Python’s own native type hints to do automatic validation.