Nested Data Models#
Starting File: 05_valid_pydantic_molecule.py
This chapter will start from the 05_valid_pydantic_molecule.py
and end on the 06_multi_model_molecule.py
.
Data models are often more than flat objects. Many data structures and models can be perceived as a series of nested dictionaries, or “models within models.” We could validate those by hand, but pydantic provides the tools to handle that for us.
This chapter, we’ll be covering nesting models within each other. We’ll also be touching on a very powerful tool for validating strings called Regular Expressions, or “regex.”
Check Out Pydantic
We will not be covering all the capabilities of pydantic here, and we highly encourage you to visit the pydantic docs to learn about all the powerful and easy-to-execute things pydantic can do.
Compatibility with Python 3.8 and below
If you have Python 3.8 or below, you will need to import container type objects such as List
, Tuple
, Dict
, etc. from the typing
library instead of their native types of list
, tuple
, dict
, etc. This chapter will assume Python 3.9 or greater, however, both approaches will work in >=Python 3.9 and have 1:1 replacements of the same name.
Nested Models: Just Dictionaries with Some Structure#
Let’s start by taking a look at our Molecule
object once more and looking at some sample data.
import numpy as np
from pydantic import BaseModel, field_validator, ConfigDict
class Molecule(BaseModel):
name: str
charge: float
symbols: list[str]
coordinates: np.ndarray
model_config = ConfigDict(arbitrary_types_allowed=True)
@field_validator("coordinates", mode='before')
@classmethod
def coord_to_numpy(cls, coords):
try:
coords = np.asarray(coords)
except ValueError:
raise ValueError(f"Could not cast {coords} to numpy array")
return coords
@field_validator("coordinates")
@classmethod
def coords_length_of_symbols(cls, coords, info):
symbols = info.data["symbols"]
if (len(coords.shape) != 2) or (len(symbols) != coords.shape[0]) or (coords.shape[1] != 3):
raise ValueError(f"Coordinates must be of shape [Number Symbols, 3], was {coords.shape}")
return coords
@property
def num_atoms(self):
return len(self.symbols)
mol_data = { # Good data
"coordinates": [[0, 0, 0], [1, 1, 1], [2, 2, 2]],
"symbols": ["H", "H", "O"],
"charge": 0.0,
"name": "water"
}
bad_name = {"name": 789} # Name is not str
bad_charge = {"charge": [1, 0.0]} # Charge is not int or float
noniter_symbols = {"symbols": 1234567890} # Symbols is an int
nonlist_symbols = {"symbols": '["H", "H", "O"]'} # Symbols is a string (notably is a string-ified list)
tuple_symbols = {"symbols": ("H", "H", "O")} # Symbols as a tuple?
bad_coords = {"coordinates": ["1", "2", "3"]} # Coords is a single list of string
inner_coords_not3d = {"coordinates": [[1, 2, 3], [4, 5]]}
bad_symbols_and_cords = {"symbols": ["H", "H", "O"],
"coordinates": [[1, 1, 1], [2.0, 2.0, 2.0]]
} # Coordinates top-level list is not the same length as symbols
What exactly is our model? Our model is a dict
with specific keys name
, charge
, symbols
, and coordinates
; all of which have some restrictions in the form of type annotations. What if we had another model for additional information that needed to be kept together, and those data do not make sense to transfer to a flat list of other attributes? An example of this would be contributor-like metadata; the originator or provider of the data in question.
Let’s make one up. Say the information follows these rules:
A name of the contributor
A contact email or homepage
An optional organization.
The contributor as a whole is optional too.
contributor = {"name": "I. Taylor Researcher",
"url": "https://molssi.org",
"Organization": "The Molecular Sciences Software Institute"
}
That looks like a good contributor of our mol_data
. How would we add this entry to the Molecule
? First let’s understand what an “optional” entry is.
Defining Optional Entries#
For type hints/annotations, “optional” translates to “can be None
”. That is not to say “if undefined is None
” since that implies behavior which may not be intended. For now, lets look at how we can set optional fields in Python as a whole, not specifically pydantic yet. This can be specified in one of two main ways, three if you are on Python 3.10 or greater.
from typing import Optional, Any
some_attribute: Optional[Any]
some_attribute: Any = None
some_attribute: Any | None # Python 3.10+ only
First thing to note is the Any
object from typing
. This can be used to mean exactly that: any data types are valid here. We’ll replace it with our actual model in a moment. Let’s go over the ways to specify optional entries now with the understanding that all three of these mean and do the exact same thing from a pure Python standpoint.
Optional[Any]
borrows the Optional
object from the typing
library.
Any = None
sets a default value of None
, which also implies optional. This method can be used in tandem with any other type and not None
to set a default value. E.g. variable: int = 12
would indicate an int
type hint, and default value of 12
if it’s not set in the input data.
Any | None
employs the set operators with Python to treat this as “any OR none”. This is also equal to Union[Any,None]
. This only works in Python 3.10 or greater and it should be noted this will be the preferred way to specify Union
in the future, removing the need to import it at all.
Understanding the dict
Type#
At the end of the day, all models are just glorified dictionaries with conditions on what is and is not allowed. You can specify a dict
type which takes up to 2 arguments for its type hints: keys and values, in that order. Although the Python dictionary supports any immutable type for a dictionary key, pydantic models accept only strings by default (this can be changed).
If we take our contributor
rules, we could define this sub model like such:
class ValidEmail(str):
pass
class ValidHTML(str):
pass
from typing import Union
contributor_mockup: Optional[dict[
str, # Key Type Annotation
Union[ # Value Type Annotation
ValidEmail, ValidHTML, str # Different accepted string types, overly permissive
],
]] = None # Optional with default value of None
We would need to fill in the rest of the validator data for ValidURL
and ValidHTML
, write some rather rigorous validation to ensure there are only the correct keys, and ensure the values all adhere to the other rules above, but it can be done. In fact, the values Union
is overly permissive.
Why is the values Union
overly permissive?
As written, the Union
will not actually correctly prevent bad URLs or bad emails, why?
Solution:
Because pydantic runs its validators in order until one succeeds or all fail, any string will correctly validate once it hits the str
type annotation at the very end.
Nested Models: The Pydantic Way#
Manually writing validators for structured models within our models made simple with pydantic. Because our contributor
is just another model, we can treat it as such, and inject it in any other pydantic model.
from typing import Optional
from pydantic import BaseModel
class Contributor(BaseModel):
name: str
url: str
Organization: Optional[str] = None # Set default and explicitly make it Nullable
There it is, our very basic model. Because this is just another pydantic model, we can also write validators that will run for just this model. pydantic treats optional fields a bit differently in that it will NOT set a default value to None
if it is Optional
without an explicit default value assigned by the =
operator. It also follows a few basic rules:
Default values are not checked against the type/schema. E.g.
None
here is not checked againststr
Unless you specifically assign a value via
=
the field is required.Only fields explicitly set with
Optional
or... | None
can be “nullable,” which means acceptsNone
as an actual argument.
See the rules for Optional vs. nullable fields here. For our purposes, we will explicitly set the default value of None
and Optional
where appropriate to avoid any ambiguity.
Optional and default None in Python 3.10
If you have Python 3.10 and you wanted an optional default None field, you would do something like Organization: str | None = None
We’ll revisit that concept in a moment though, and let’s inject this model into our existing pydantic model for Molecule
.
import numpy as np
from typing import Optional
from pydantic import BaseModel, field_validator, ConfigDict
class Molecule(BaseModel):
name: str
charge: float
symbols: list[str]
coordinates: np.ndarray
contributor: Optional[Contributor] = None # <--- New, nested, optional model
model_config = ConfigDict(arbitrary_types_allowed=True)
@field_validator("coordinates", mode='before')
@classmethod
def coord_to_numpy(cls, coords):
try:
coords = np.asarray(coords)
except ValueError:
raise ValueError(f"Could not cast {coords} to numpy array")
return coords
@field_validator("coordinates")
@classmethod
def coords_length_of_symbols(cls, coords, info):
symbols = info.data["symbols"]
if (len(coords.shape) != 2) or (len(symbols) != coords.shape[0]) or (coords.shape[1] != 3):
raise ValueError(f"Coordinates must be of shape [Number Symbols, 3], was {coords.shape}")
return coords
@property
def num_atoms(self):
return len(self.symbols)
That one line has now added the entire construct of the Contributor
model to the Molecule
. The name of the submodel does NOT have to match the name of the attribute it’s representing. Pydantic will handle passing off the nested dictionary of input data to the nested model and construct it according to its own rules.
water = Molecule(**mol_data)
print(water)
name='water' charge=0.0 symbols=['H', 'H', 'O'] coordinates=array([[0, 0, 0],
[1, 1, 1],
[2, 2, 2]]) contributor=None
mol_data_with_contrib = {**mol_data, **{"contributor": contributor}}
water_with_contributor = Molecule(**mol_data_with_contrib)
print(water_with_contributor)
name='water' charge=0.0 symbols=['H', 'H', 'O'] coordinates=array([[0, 0, 0],
[1, 1, 1],
[2, 2, 2]]) contributor=Contributor(name='I. Taylor Researcher', url='https://molssi.org', Organization='The Molecular Sciences Software Institute')
And that’s the basics of nested models. Arbitrary levels of nesting and piecewise addition of models can be constructed and inherited to make rich data structures.
We still have the matter of making sure the URL is a valid url or email link, and for that we’ll need to touch on Regular Expressions.
Validating Strings on Patterns: Regular Expressions#
Strings, all strings, have patterns in them. Those patterns can be described with a specialized pattern recognition language called Regular Expressions or “regex”. A full understanding of regex is NOT required nor expected for this workshop. However, we feel it’s important to touch on as the more data validation you do, especially on strings, the more likely it will be that you need or encounter regex at some point.
Let’s write a validator for email. We’re looking for something that looks like mailto:someemail@fake-location.org
. And maybe the mailto:
part is optional. We start by creating a sequence of validators we can reuse by using Annotated
on the str
class. This is the alternate way to create re-usable, composable validators in pydantic
.
from typing_extensions import Annotated
# Python 3.9+ you can just get Annotated from the standard typing module
from pydantic import AfterValidator
import re
def strip_string(v: str):
return v.strip()
def valid_email(v: str):
if not re.match(r"(mailto:)?[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]+", v):
raise ValueError("mailto URL is not a valid mailto or email link")
return v
MailTo = Annotated[str, AfterValidator(valid_email), AfterValidator(strip_string)]
There is alot going on here, but lets break this down one thing at a time.
First, we have imported the Annotated
object, which lets us tack on additional metadata to a type. Pydantic will read that metadata to handle its validation of any MailTo
object-type.
Second, we have pulled in the AfterValidator
method from pydantic
which will allow us to define a function to use for validation after any standard pydantic
validation is done.
Third, we defined our two validation functions. strip_string
is just a helper to do some formatting before we do regular expressions (Python re
module) on the string in valid_email
. Sure, we could have combined the two, but its a good illustration if we don’t.
Fourth, we Annotated
the str
type with our two functions in reverse order of how we wanted to do them as pydantic does a pop
-like operation to extract validators from the list of annotations.
Overall, what we have here is now a custom validator form of the supplementary material in the last chapter, Validating Data Beyond Types. If you did not go through that section, don’t worry, all the important bits have were explained in the list above. Each of the valid_X
functions have been setup to run as different things which have to be validated for something of type MailTo
to be considered valid. The important part to focus on here is the valid_email
function and the re.match
method.
re
is a built-in Python library for doing regex. The match(pattern, string_to_find_match)
function looks for the pattern
from the first character of string_to_find_match
. Our pattern can be broken down into the following way:
r" <-- Literal String, bypasses lots of special characters like #
( <-- Grouping
mailto: <-- exactly this, no special characters here
)? <-- ? Optional, operating on Grouping
[ <-- Single character matching any of the characters within
a-z <-- any lowercase a to z character
A-Z <-- any uppercase A to Z character
0-9 <-- any number 0 to 9
._%+- <-- any of these characters, special regex meaning suppressed inside []
]+ <-- + at least one instance, repeated as many times, operating on single character match = "these characters at least once"
@ <-- literal @ sign
[a-zA-Z0-9.-]+ <-- Single character match, of this pattern, at least one of
\. <-- literal "." because \ suppresses the wildcard nature of . in regex
[a-zA-Z]+ <-- Single character match, of this pattern, at least one of
" <-- End pattern
We’re not expecting this to be memorized, just to understand that there is a pattern that is being looked for. Pydantic has other ways to define regex pattern searching in things like Field
through its Rust backend, but we’re not going to focus on that here.
We can now set this pattern as one of the valid parameters of the url
entry in the contributor
model.
from typing import Union
from pydantic import BaseModel, validator
class Contributor(BaseModel):
name: str
url: Union[MailTo, str]
Organization: Optional[str] = None # Set default and explicitly make it Nullable
So why did we show this if we were only going to pass in str
as the second Union
option? We wanted to show this regex pattern as pydantic provides a number of helper “types” in its pydantic.types
and pydantic.networks
modules which function very similarly to our custom MailTo
class that can be used to shortcut writing manual validators. Their names often say exactly what they do. Some examples include (with which pydantic.{module}
they come from:
StrictInt (types)
PositiveFloat (types)
AnyUrl (networks)
HttpUrl (networks)
IPvAnyAddress (networks)
and many more
They also have constrained types which you can use to set some boundaries without having to code them yourself. Natively, we can use the AnyUrl
to save us having to write our own regex validator for matching URLs.
from typing import Union
from pydantic import BaseModel
from pydantic.networks import AnyUrl
class Contributor(BaseModel):
name: str
url: Union[MailTo, AnyUrl]
Organization: Optional[str] = None # Set default and explicitly make it Nullable
Challenge: URL Regex Validator
Write a custom match string for a URL regex pattern
Do not do this yourself
There are many correct answers. All of them are extremely difficult regex strings. Put some thought into your answer, understanding that it’s best to look up an answer (feel free to do this), or borrow from someone else; with attribution. We did this for this challenge as well.
Answer:
With credit: https://gist.github.com/gruber/8891611#file-liberal-regex-pattern-for-web-urls-L8
r"""(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!@)))"""
Final Code: Bringing it All Together.#
Let’s combine everything we’ve built into one final block of code
import re
from typing import Optional, Union
# Python 3.9+ you can just get Annotated from the standard typing module
from typing_extensions import Annotated
import numpy as np
from pydantic import BaseModel, field_validator, AfterValidator, ConfigDict
from pydantic.networks import AnyUrl
def strip_string(v: str):
return v.strip()
def valid_email(v: str):
if not re.match(r"(mailto:)?[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]+", v):
raise ValueError("mailto URL is not a valid mailto or email link")
return v
MailTo = Annotated[str, AfterValidator(valid_email), AfterValidator(strip_string)]
class Contributor(BaseModel):
name: str
url: Union[MailTo, AnyUrl]
Organization: Optional[str] = None # Set default and explicitly make it Nullable
class Molecule(BaseModel):
name: str
charge: float
symbols: list[str]
coordinates: np.ndarray
contributor: Optional[Contributor] = None # <--- New, nested, optional model
model_config = ConfigDict(arbitrary_types_allowed=True)
@field_validator("coordinates", mode='before')
@classmethod
def coord_to_numpy(cls, coords):
try:
coords = np.asarray(coords)
except ValueError:
raise ValueError(f"Could not cast {coords} to numpy array")
return coords
@field_validator("coordinates")
@classmethod
def coords_length_of_symbols(cls, coords, info):
symbols = info.data["symbols"]
if (len(coords.shape) != 2) or (len(symbols) != coords.shape[0]) or (coords.shape[1] != 3):
raise ValueError(f"Coordinates must be of shape [Number Symbols, 3], was {coords.shape}")
return coords
@property
def num_atoms(self):
return len(self.symbols)
Congratulations! You’ve now written a robust data model with automatic type annotations, validation, and complex structure including nested models.
Our Molecule
has come a long way from being a simple data class with no validation. We learned how to annotate the arguments with built-in Python type hints. We converted our data structure to a Python dataclass
to simplify repetitive code and make our structure easier to understand. The data were validated through manual checks which we learned could be programmatically handled. Pydantic was brought in to turn our type hints into type annotations and automatically check typing, both Python-native and external/custom types like NumPy arrays. Finally we created nested models to permit arbitrary complexity and a better understanding of what tools are available for validating data.
Finally, we encourage you to go through and visit all the external links in these chapters, especially for pydantic. This workshop only touched on basic pydantic usage, and there is so much more you can do with auto-validating models.
We hope you’ve found this workshop helpful and we welcome any comments, feedback, spotted issues, improvements, or suggestions on the material through the GitHub (link as a dropdown at the top.)