Nested Data Models

Nested Data Models#

Starting File: 05_valid_pydantic_molecule.py

This chapter will start from the 05_valid_pydantic_molecule.py and end on the 06_multi_model_molecule.py.

Data models are often more than flat objects. Many data structures and models can be perceived as a series of nested dictionaries, or “models within models.” We could validate those by hand, but pydantic provides the tools to handle that for us.

This chapter, we’ll be covering nesting models within each other. We’ll also be touching on a very powerful tool for validating strings called Regular Expressions, or “regex.”

Check Out Pydantic

We will not be covering all the capabilities of pydantic here, and we highly encourage you to visit the pydantic docs to learn about all the powerful and easy-to-execute things pydantic can do.

Compatibility with Python 3.8 and below

If you have Python 3.8 or below, you will need to import container type objects such as List, Tuple, Dict, etc. from the typing library instead of their native types of list, tuple, dict, etc. This chapter will assume Python 3.9 or greater, however, both approaches will work in >=Python 3.9 and have 1:1 replacements of the same name.

Nested Models: Just Dictionaries with Some Structure#

Let’s start by taking a look at our Molecule object once more and looking at some sample data.

import numpy as np
from pydantic import BaseModel, field_validator, ConfigDict


class Molecule(BaseModel):
    name: str
    charge: float
    symbols: list[str]
    coordinates: np.ndarray

    model_config = ConfigDict(arbitrary_types_allowed=True)

    @field_validator("coordinates", mode='before')
    @classmethod
    def coord_to_numpy(cls, coords):
        try:
            coords = np.asarray(coords)
        except ValueError:
            raise ValueError(f"Could not cast {coords} to numpy array")
        return coords

    @field_validator("coordinates")
    @classmethod
    def coords_length_of_symbols(cls, coords, info):
        symbols = info.data["symbols"]
        if (len(coords.shape) != 2) or (len(symbols) != coords.shape[0]) or (coords.shape[1] != 3):
            raise ValueError(f"Coordinates must be of shape [Number Symbols, 3], was {coords.shape}")
        return coords

    @property
    def num_atoms(self):
        return len(self.symbols)

mol_data = {  # Good data
    "coordinates": [[0, 0, 0], [1, 1, 1], [2, 2, 2]], 
    "symbols": ["H", "H", "O"], 
    "charge": 0.0, 
    "name": "water"
}

bad_name = {"name": 789}  # Name is not str
bad_charge = {"charge": [1, 0.0]}  # Charge is not int or float
noniter_symbols = {"symbols": 1234567890}  # Symbols is an int
nonlist_symbols = {"symbols": '["H", "H", "O"]'}  # Symbols is a string (notably is a string-ified list)
tuple_symbols = {"symbols": ("H", "H", "O")}  # Symbols as a tuple?
bad_coords = {"coordinates": ["1", "2", "3"]}  # Coords is a single list of string
inner_coords_not3d = {"coordinates": [[1, 2, 3], [4, 5]]}
bad_symbols_and_cords = {"symbols": ["H", "H", "O"],
                         "coordinates": [[1, 1, 1], [2.0, 2.0, 2.0]]
                        }  # Coordinates top-level list is not the same length as symbols

What exactly is our model? Our model is a dict with specific keys name, charge, symbols, and coordinates; all of which have some restrictions in the form of type annotations. What if we had another model for additional information that needed to be kept together, and those data do not make sense to transfer to a flat list of other attributes? An example of this would be contributor-like metadata; the originator or provider of the data in question.

Let’s make one up. Say the information follows these rules:

A name of the contributor
A contact email or homepage
An optional organization.
The contributor as a whole is optional too.

contributor = {"name": "I. Taylor Researcher",
               "url": "https://molssi.org",
               "Organization": "The Molecular Sciences Software Institute"
              }

That looks like a good contributor of our mol_data. How would we add this entry to the Molecule? First let’s understand what an “optional” entry is.

Defining Optional Entries#

For type hints/annotations, “optional” translates to “can be None”. That is not to say “if undefined is None” since that implies behavior which may not be intended. For now, lets look at how we can set optional fields in Python as a whole, not specifically pydantic yet. This can be specified in one of two main ways, three if you are on Python 3.10 or greater.

from typing import Optional, Any

some_attribute: Optional[Any]
some_attribute: Any = None
some_attribute: Any | None  # Python 3.10+ only

First thing to note is the Any object from typing. This can be used to mean exactly that: any data types are valid here. We’ll replace it with our actual model in a moment. Let’s go over the ways to specify optional entries now with the understanding that all three of these mean and do the exact same thing from a pure Python standpoint.

Optional[Any] borrows the Optional object from the typing library.

Any = None sets a default value of None, which also implies optional. This method can be used in tandem with any other type and not None to set a default value. E.g. variable: int = 12 would indicate an int type hint, and default value of 12 if it’s not set in the input data.

Any | None employs the set operators with Python to treat this as “any OR none”. This is also equal to Union[Any,None]. This only works in Python 3.10 or greater and it should be noted this will be the preferred way to specify Union in the future, removing the need to import it at all.

Understanding the `dict` Type#

At the end of the day, all models are just glorified dictionaries with conditions on what is and is not allowed. You can specify a dict type which takes up to 2 arguments for its type hints: keys and values, in that order. Although the Python dictionary supports any immutable type for a dictionary key, pydantic models accept only strings by default (this can be changed).

If we take our contributor rules, we could define this sub model like such:

class ValidEmail(str):
    pass

class ValidHTML(str):
    pass

from typing import Union
contributor_mockup: Optional[dict[
    str,  # Key Type Annotation
    Union[  # Value Type Annotation
        ValidEmail, ValidHTML, str  # Different accepted string types, overly permissive
    ],  
]] = None  # Optional with default value of None

We would need to fill in the rest of the validator data for ValidURL and ValidHTML, write some rather rigorous validation to ensure there are only the correct keys, and ensure the values all adhere to the other rules above, but it can be done. In fact, the values Union is overly permissive.

Why is the values Union overly permissive?

As written, the Union will not actually correctly prevent bad URLs or bad emails, why?

Solution:

Because pydantic runs its validators in order until one succeeds or all fail, any string will correctly validate once it hits the str type annotation at the very end.

Nested Models: The Pydantic Way#

Manually writing validators for structured models within our models made simple with pydantic. Because our contributor is just another model, we can treat it as such, and inject it in any other pydantic model.

from typing import Optional
from pydantic import BaseModel

class Contributor(BaseModel):
    name: str
    url: str
    Organization: Optional[str] = None  # Set default and explicitly make it Nullable

There it is, our very basic model. Because this is just another pydantic model, we can also write validators that will run for just this model. pydantic treats optional fields a bit differently in that it will NOT set a default value to None if it is Optional without an explicit default value assigned by the = operator. It also follows a few basic rules:

Default values are not checked against the type/schema. E.g. None here is not checked against str
Unless you specifically assign a value via = the field is required.
Only fields explicitly set with Optional or ... | None can be “nullable,” which means accepts None as an actual argument.

See the rules for Optional vs. nullable fields here. For our purposes, we will explicitly set the default value of None and Optional where appropriate to avoid any ambiguity.

Optional and default None in Python 3.10

If you have Python 3.10 and you wanted an optional default None field, you would do something like Organization: str | None = None

We’ll revisit that concept in a moment though, and let’s inject this model into our existing pydantic model for Molecule.

import numpy as np
from typing import Optional
from pydantic import BaseModel, field_validator, ConfigDict


class Molecule(BaseModel):
    name: str
    charge: float
    symbols: list[str]
    coordinates: np.ndarray
    contributor: Optional[Contributor] = None  # <--- New, nested, optional model

    model_config = ConfigDict(arbitrary_types_allowed=True)

    @field_validator("coordinates", mode='before')
    @classmethod
    def coord_to_numpy(cls, coords):
        try:
            coords = np.asarray(coords)
        except ValueError:
            raise ValueError(f"Could not cast {coords} to numpy array")
        return coords

    @field_validator("coordinates")
    @classmethod
    def coords_length_of_symbols(cls, coords, info):
        symbols = info.data["symbols"]
        if (len(coords.shape) != 2) or (len(symbols) != coords.shape[0]) or (coords.shape[1] != 3):
            raise ValueError(f"Coordinates must be of shape [Number Symbols, 3], was {coords.shape}")
        return coords

    @property
    def num_atoms(self):
        return len(self.symbols)

That one line has now added the entire construct of the Contributor model to the Molecule. The name of the submodel does NOT have to match the name of the attribute it’s representing. Pydantic will handle passing off the nested dictionary of input data to the nested model and construct it according to its own rules.

water = Molecule(**mol_data)
print(water)

name='water' charge=0.0 symbols=['H', 'H', 'O'] coordinates=array([[0, 0, 0],
       [1, 1, 1],
       [2, 2, 2]]) contributor=None

mol_data_with_contrib = {**mol_data, **{"contributor": contributor}}
water_with_contributor = Molecule(**mol_data_with_contrib)
print(water_with_contributor)

name='water' charge=0.0 symbols=['H', 'H', 'O'] coordinates=array([[0, 0, 0],
       [1, 1, 1],
       [2, 2, 2]]) contributor=Contributor(name='I. Taylor Researcher', url='https://molssi.org', Organization='The Molecular Sciences Software Institute')

And that’s the basics of nested models. Arbitrary levels of nesting and piecewise addition of models can be constructed and inherited to make rich data structures.

We still have the matter of making sure the URL is a valid url or email link, and for that we’ll need to touch on Regular Expressions.

Validating Strings on Patterns: Regular Expressions#

Strings, all strings, have patterns in them. Those patterns can be described with a specialized pattern recognition language called Regular Expressions or “regex”. A full understanding of regex is NOT required nor expected for this workshop. However, we feel it’s important to touch on as the more data validation you do, especially on strings, the more likely it will be that you need or encounter regex at some point.

Let’s write a validator for email. We’re looking for something that looks like mailto:someemail@fake-location.org. And maybe the mailto: part is optional. We start by creating a sequence of validators we can reuse by using Annotated on the str class. This is the alternate way to create re-usable, composable validators in pydantic.

from typing_extensions import Annotated 
# Python 3.9+ you can just get Annotated from the standard typing module
from pydantic import AfterValidator

import re

def strip_string(v: str):
    return v.strip()

def valid_email(v: str):
    if not re.match(r"(mailto:)?[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]+", v):
        raise ValueError("mailto URL is not a valid mailto or email link")
    return v

MailTo = Annotated[str, AfterValidator(valid_email), AfterValidator(strip_string)]

There is alot going on here, but lets break this down one thing at a time.

First, we have imported the Annotated object, which lets us tack on additional metadata to a type. Pydantic will read that metadata to handle its validation of any MailTo object-type.

Second, we have pulled in the AfterValidator method from pydantic which will allow us to define a function to use for validation after any standard pydantic validation is done.

Third, we defined our two validation functions. strip_string is just a helper to do some formatting before we do regular expressions (Python re module) on the string in valid_email. Sure, we could have combined the two, but its a good illustration if we don’t.

Fourth, we Annotated the str type with our two functions in reverse order of how we wanted to do them as pydantic does a pop-like operation to extract validators from the list of annotations.

Overall, what we have here is now a custom validator form of the supplementary material in the last chapter, Validating Data Beyond Types. If you did not go through that section, don’t worry, all the important bits have were explained in the list above. Each of the valid_X functions have been setup to run as different things which have to be validated for something of type MailTo to be considered valid. The important part to focus on here is the valid_email function and the re.match method.

re is a built-in Python library for doing regex. The match(pattern, string_to_find_match) function looks for the pattern from the first character of string_to_find_match. Our pattern can be broken down into the following way:

r"              <-- Literal String, bypasses lots of special characters like #
(               <-- Grouping
    mailto:     <-- exactly this, no special characters here
)?              <-- ? Optional, operating on Grouping
[               <-- Single character matching any of the characters within
    a-z         <-- any lowercase a to z character
    A-Z         <-- any uppercase A to Z character
    0-9         <-- any number 0 to 9
    ._%+-       <-- any of these characters, special regex meaning suppressed inside []
]+              <-- + at least one instance, repeated as many times, operating on single character match = "these characters at least once"
@               <-- literal @ sign 
[a-zA-Z0-9.-]+  <-- Single character match, of this pattern, at least one of
\.              <-- literal "." because \ suppresses the wildcard nature of . in regex
[a-zA-Z]+       <-- Single character match, of this pattern, at least one of
"               <-- End pattern

We’re not expecting this to be memorized, just to understand that there is a pattern that is being looked for. Pydantic has other ways to define regex pattern searching in things like Field through its Rust backend, but we’re not going to focus on that here.

We can now set this pattern as one of the valid parameters of the url entry in the contributor model.

from typing import Union

from pydantic import BaseModel, validator

class Contributor(BaseModel):
    name: str
    url: Union[MailTo, str]
    Organization: Optional[str] = None  # Set default and explicitly make it Nullable

So why did we show this if we were only going to pass in str as the second Union option? We wanted to show this regex pattern as pydantic provides a number of helper “types” in its pydantic.types and pydantic.networks modules which function very similarly to our custom MailTo class that can be used to shortcut writing manual validators. Their names often say exactly what they do. Some examples include (with which pydantic.{module} they come from:

StrictInt (types)
PositiveFloat (types)
AnyUrl (networks)
HttpUrl (networks)
IPvAnyAddress (networks)
and many more

They also have constrained types which you can use to set some boundaries without having to code them yourself. Natively, we can use the AnyUrl to save us having to write our own regex validator for matching URLs.

from typing import Union

from pydantic import BaseModel
from pydantic.networks import AnyUrl

class Contributor(BaseModel):
    name: str
    url: Union[MailTo, AnyUrl]
    Organization: Optional[str] = None  # Set default and explicitly make it Nullable

Challenge: URL Regex Validator

Write a custom match string for a URL regex pattern

Do not do this yourself

There are many correct answers. All of them are extremely difficult regex strings. Put some thought into your answer, understanding that it’s best to look up an answer (feel free to do this), or borrow from someone else; with attribution. We did this for this challenge as well.

Answer:

With credit: https://gist.github.com/gruber/8891611#file-liberal-regex-pattern-for-web-urls-L8

r"""(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!@)))"""

Final Code: Bringing it All Together.#

Let’s combine everything we’ve built into one final block of code

import re
from typing import Optional, Union
# Python 3.9+ you can just get Annotated from the standard typing module
from typing_extensions import Annotated 

import numpy as np
from pydantic import BaseModel, field_validator, AfterValidator, ConfigDict
from pydantic.networks import AnyUrl


def strip_string(v: str):
    return v.strip()


def valid_email(v: str):
    if not re.match(r"(mailto:)?[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]+", v):
        raise ValueError("mailto URL is not a valid mailto or email link")
    return v


MailTo = Annotated[str, AfterValidator(valid_email), AfterValidator(strip_string)]


class Contributor(BaseModel):
    name: str
    url: Union[MailTo, AnyUrl]
    Organization: Optional[str] = None  # Set default and explicitly make it Nullable


class Molecule(BaseModel):
    name: str
    charge: float
    symbols: list[str]
    coordinates: np.ndarray
    contributor: Optional[Contributor] = None  # <--- New, nested, optional model

    model_config = ConfigDict(arbitrary_types_allowed=True)

    @field_validator("coordinates", mode='before')
    @classmethod
    def coord_to_numpy(cls, coords):
        try:
            coords = np.asarray(coords)
        except ValueError:
            raise ValueError(f"Could not cast {coords} to numpy array")
        return coords

    @field_validator("coordinates")
    @classmethod
    def coords_length_of_symbols(cls, coords, info):
        symbols = info.data["symbols"]
        if (len(coords.shape) != 2) or (len(symbols) != coords.shape[0]) or (coords.shape[1] != 3):
            raise ValueError(f"Coordinates must be of shape [Number Symbols, 3], was {coords.shape}")
        return coords

    @property
    def num_atoms(self):
        return len(self.symbols)

Congratulations! You’ve now written a robust data model with automatic type annotations, validation, and complex structure including nested models.

Our Molecule has come a long way from being a simple data class with no validation. We learned how to annotate the arguments with built-in Python type hints. We converted our data structure to a Python dataclass to simplify repetitive code and make our structure easier to understand. The data were validated through manual checks which we learned could be programmatically handled. Pydantic was brought in to turn our type hints into type annotations and automatically check typing, both Python-native and external/custom types like NumPy arrays. Finally we created nested models to permit arbitrary complexity and a better understanding of what tools are available for validating data.

Finally, we encourage you to go through and visit all the external links in these chapters, especially for pydantic. This workshop only touched on basic pydantic usage, and there is so much more you can do with auto-validating models.

We hope you’ve found this workshop helpful and we welcome any comments, feedback, spotted issues, improvements, or suggestions on the material through the GitHub (link as a dropdown at the top.)