Validating Data Beyond Types

Validating Data Beyond Types#

Starting File: 04_pydantic_molecule.py

This chapter will start from the 04_pydantic_molecule.py and end on the 05_valid_pydantic_molecule.py.

Data validation goes far beyond just type. Pydantic has provided the basic tools for doing data validation on data types, but it also provides the tools for writing custom validators to check so much more.

We’ll be covering the pydantic validator decorator and applying that to our data to check structure and scientific rigor. We’ll also cover how to validate types not native to Python, such as NumPy arrays.

Check Out Pydantic

We will not be covering all the capabilities of pydantic here, and we highly encourage you to visit the pydantic docs to learn about all the powerful and easy-to-execute things pydantic can do.

Compatibility with Python 3.8 and below

If you have Python 3.8 or below, you will need to import container type objects such as List, Tuple, Dict, etc. from the typing library instead of their native types of list, tuple, dict, etc. This chapter will assume Python 3.9 or greater, however, both approaches will work in >=Python 3.9 and have 1:1 replacements of the same name.

Pydantic’s Validator Decorator#

Let’s start by looking at the state of our code prior to extending the validators. As usual, let’s also define our test data.

from pydantic import BaseModel


class Molecule(BaseModel):
    name: str
    charge: float
    symbols: list[str]
    coordinates: list[list[float]]

    @property
    def num_atoms(self):
        return len(self.symbols)

mol_data = {  # Good data
    "coordinates": [[0, 0, 0], [1, 1, 1], [2, 2, 2]], 
    "symbols": ["H", "H", "O"], 
    "charge": 0.0, 
    "name": "water"
}

bad_name = {"name": 789}  # Name is not str
bad_charge = {"charge": [1, 0.0]}  # Charge is not int or float
noniter_symbols = {"symbols": 1234567890}  # Symbols is an int
nonlist_symbols = {"symbols": '["H", "H", "O"]'}  # Symbols is a string (notably is a string-ified list)
tuple_symbols = {"symbols": ("H", "H", "O")}  # Symbols as a tuple?
bad_coords = {"coordinates": ["1", "2", "3"]}  # Coords is a single list of string
inner_coords_not3d = {"coordinates": [[1, 2, 3], [4, 5]]}
bad_symbols_and_cords = {"symbols": ["H", "H", "O"],
                         "coordinates": [[1, 1, 1], [2.0, 2.0, 2.0]]
                        }  # Coordinates top-level list is not the same length as symbols

You may notice we have extended our “Good Data” here to have coordinates actually define the Nx3 structure where N = len(symbols). This is important for what we plan to validate.

pydantic allows you to write custom validators, in addition to the type validators which run automatically for a type annotation. This field_validator is pulled from the pydantic module just like BaseModel, and is used to decorate a class function you write. Let’s look at the most basic field_validator we can write and assign it to coordinates.

Field vs Annotated Validators

pydantic allows validators to be defined functionally for reuse, ordering, and much more powerful utilization through the Annotated class. We will be showing field_validator for this example to keep the validator much more local for learning purposes. Please see (the pydantic docs on validators for more info.)[https://docs.pydantic.dev/latest/usage/validators/]

from pydantic import BaseModel, field_validator

class Molecule(BaseModel):
    name: str
    charge: float
    symbols: list[str]
    coordinates: list[list[float]]
        
    @field_validator("coordinates")
    @classmethod
    def ensure_coordinates_is_3D(cls, coords):
        return coords

    @property
    def num_atoms(self):
        return len(self.symbols)

Here we have defined an additional validator which does nothing, but has the basic structure we can look at. For convenience and reference, I’ve broken the aspects of the field_validator into a list.

The field_validator decorator takes as arguments the exact name of the attributes you are validating against as a string. In this case coordinates. You could provide multiple string args of each attribute you want to run through the validator if you want to reuse it.
The function name can be whatever you want it to be. We’ve called it ensure_coordinates_is_3D to be meaningful if anyone ever wants to come back and see what this should be doing.
The function itself is a class function. This is why we have included the @classmethod decorator from native Python, this validator is intended to be called on the non-instanced class. The formal nomenclature for the first variable here is therefore cls and not self. You can define the validators without the @classmethod decorator, but your IDE may complain about this, so we also add the @classmethod decorator so we can use cls without IDE issues, at least on that point.
The first (non cls) argument of the function can be whatever string name you want. The optional second argument will be give a pydantic metadata class of type FieldValidationInfo and can also be named whatever we want. We’ll use this metadata class later in the chapter.
The return MUST be the validated data to be fed into the attribute. We’ve done nothing to our variable coords, so we simply return it. If you fail to have a return statement with something, it will return None and that will be considered valid.
If the data are not validated correctly, the function must raise either a ValueError or AssertionError for pydantic to correctly trap the error, anything else will raise the Python error stack as normal.
field_validator runs after type validation, unless specified (see later in this chapter).

That may seem like lots of rules, but most of them are boilerplate and intuitive. Let’s apply these items to our validator. We want to make sure the inner lists of coordinates are 3D, or length 3. We don’t have to worry about type checking (that was done before any custom field_validator was run), so we can just do an iteration of the top list and make sure. Let’s apply that now.

from pydantic import BaseModel, field_validator

class Molecule(BaseModel):
    name: str
    charge: float
    symbols: list[str]
    coordinates: list[list[float]]
        
    @field_validator("coordinates")
    @classmethod
    def ensure_coordinates_is_3D(cls, coords):
        if any(len(failure := inner) != 3 for inner in coords):  # Walrus operator (:=) for Python 3.8+
            raise ValueError(f"Inner coordinates must be 3D, got {failure} of length {len(failure)}")
        return coords
    
    @property
    def num_atoms(self):
        return len(self.symbols)

good_water = Molecule(**mol_data)
mangled = {**mol_data, **inner_coords_not3d}
water = Molecule(**mangled)

---------------------------------------------------------------------------
ValidationError                           Traceback (most recent call last)
Input In [90], in <cell line: 3>()
      1 good_water = Molecule(**mol_data)
      2 mangled = {**mol_data, **inner_coords_not3d}
----> 3 water = Molecule(**mangled)

File ~/miniconda3/envs/pyd-tut/lib/python3.10/site-packages/pydantic/main.py:150, in BaseModel.__init__(__pydantic_self__, **data)
    148 # `__tracebackhide__` tells pytest and some other tools to omit this function from tracebacks
    149 __tracebackhide__ = True
--> 150 __pydantic_self__.__pydantic_validator__.validate_python(data, self_instance=__pydantic_self__)

ValidationError: 1 validation error for Molecule
coordinates
  Value error, Inner coordinates must be 3D, got [4.0, 5.0] of length 2 [type=value_error, input_value=[[1, 2, 3], [4, 5]], input_type=list]
    For further information visit https://errors.pydantic.dev/2.0.3/v/value_error

Here we have checked the good data still works, and checked that the mangled data raised an error. It’s important to note the error raised by the function was a ValueError (or AssertionError) so the error report was a ValidationError. We can also see the error message is what we put as the error string and type of error is of the type we raised. This is why it’s very important to have meaningful error strings when your custom validator fails.

With all that said, our validator function really does look like any other function we may call to do a quick check of data, and then some special addons to make it work with pydantic. There is no practical limit to the number of field_validators you have in a given class, so validate to your heart’s content.

Python Assignment Expressions “The Walrus Operator” :=

Since Python 3.8, there is a new operator for “assignment expressions” called “The Walrus Operator” which allows variables to be assigned inside other expressions. We’ve used it here to trap the value at time of error and save space. Do not feel compelled to use this yourself, especially if it’s not clear what is happening.

Check your knowledge: Validator Basics

How would you validate that symbols entries are at most 2 characters? There is more than one correct solution beyond what we show here.

Possible Solution:

@field_validator("symbols")
@classmethod
def symbols_are_possible_element_length(cls, symbs):
    if not all(1 <= len(failure := symb) <= 2 for symb in symbs):
        raise ValueError(f"Symbols be 1 or 2 characters, got {failure}")
    return symbs

Validating against other fields#

pydantic’s validators can check fields beyond their own. This is helpful for cross referencing dependent data. In our example, we want to make sure there are exactly the right number of coordinates as there are symbols in our Molecule. To check against other fields in a field_validator, we extend the arguments to include the optional secondary one for metadata we’re going to call info. We are going to leave our initial validator to show a feature of the field_validators for now, but we could combine them (and will) later.

from pydantic import BaseModel, field_validator

class Molecule(BaseModel):
    name: str
    charge: float
    symbols: list[str]
    coordinates: list[list[float]]
        
    @field_validator("coordinates")
    @classmethod
    def ensure_coordinates_match_symbols(cls, coords, info):
        n_symbols = len(info.data["symbols"])
        if (n_coords := len(coords)) != n_symbols:  # Walrus operator (:=) for Python 3.8+
            raise ValueError(f"There must be an equal number of XYZ coordinates as there are symbols." 
                             f" There are {n_coords} coordinates and {n_symbols} symbols.")
        return coords
        
    @field_validator("coordinates")
    @classmethod
    def ensure_coordinates_is_3D(cls, coords):
        if any(len(failure := inner) != 3 for inner in coords):  # Walrus operator (:=) for Python 3.8+
            raise ValueError(f"Inner coordinates must be 3D, got {failure} of length {len(failure)}")
        return coords
    
    @property
    def num_atoms(self):
        return len(self.symbols)

We’ve added a second validator to our code called ensure_coordinates_match_symbols, and this funciton will validate against coordinates. There are two main things we can see from adding this function:

Multiple functions can be declared to validate against the same field.
We’ve added the additional optional metadata argument to our new validator: info.

The second argument, if it appears in a field_validator, provides metadata for the validation currently happening and that has already happened. The addition of info as an argument tells the field_validator to also retrieve all previously validated fields for the model. In our case, that would be name, charge, and symbols as those entries appeared before coordinates in the list of attributes. Any and all validators which would have been applied to those three entries have already been done and what we have access to is their validated records as metadata object with those validated values stored in the dictionary at .data. See the pydantic docs for more details about the special argument and metadata for field_validator.

Let’s see this in action

good_water = Molecule(**mol_data)
mangled = {**mol_data, **bad_symbols_and_cords}
water = Molecule(**mangled)

---------------------------------------------------------------------------
ValidationError                           Traceback (most recent call last)
Input In [92], in <cell line: 3>()
      1 good_water = Molecule(**mol_data)
      2 mangled = {**mol_data, **bad_symbols_and_cords}
----> 3 water = Molecule(**mangled)

File ~/miniconda3/envs/pyd-tut/lib/python3.10/site-packages/pydantic/main.py:150, in BaseModel.__init__(__pydantic_self__, **data)
    148 # `__tracebackhide__` tells pytest and some other tools to omit this function from tracebacks
    149 __tracebackhide__ = True
--> 150 __pydantic_self__.__pydantic_validator__.validate_python(data, self_instance=__pydantic_self__)

ValidationError: 1 validation error for Molecule
coordinates
  Value error, There must be an equal number of XYZ coordinates as there are symbols. There are 2 coordinates and 3 symbols. [type=value_error, input_value=[[1, 1, 1], [2.0, 2.0, 2.0]], input_type=list]
    For further information visit https://errors.pydantic.dev/2.0.3/v/value_error

Non-native Types in Pydantic#

Scientific data does not, and often should not, be confined to native Python types. One of the most common data types, especially in the sciences, is the NumPy Array (ndarray class). The most natural place for this would be coordinates where we want to simplify this list of list construct. Let’s see what happens when we try to just make the type annotation a ndarray and see how pydantic handles coercion, or how it does not.

import numpy as np
from pydantic import BaseModel, field_validator

class Molecule(BaseModel):
    name: str
    charge: float
    symbols: list[str]
    coordinates: np.ndarray
        
    @field_validator("coordinates")
    @classmethod
    def ensure_coordinates_match_symbols(cls, coords, info):
        n_symbols = len(info.data["symbols"])
        if (n_coords := len(coords)) != n_symbols:  # Walrus operator (:=) for Python 3.8+
            raise ValueError(f"There must be an equal number of XYZ coordinates as there are symbols." 
                             f" There are {n_coords} coordinates and {n_symbols} symbols.")
        return coords
        
    @field_validator("coordinates")
    @classmethod
    def ensure_coordinates_is_3D(cls, coords):
        if any(len(failure := inner) != 3 for inner in coords):  # Walrus operator (:=) for Python 3.8+
            raise ValueError(f"Inner coordinates must be 3D, got {failure} of length {len(failure)}")
        return coords
    
    @property
    def num_atoms(self):
        return len(self.symbols)

---------------------------------------------------------------------------
PydanticSchemaGenerationError             Traceback (most recent call last)
Input In [93], in <cell line: 4>()
import numpy as np
from pydantic import BaseModel, field_validator
----> 4 class Molecule(BaseModel):
   name: str
   charge: float

File ~/miniconda3/envs/pyd-tut/lib/python3.10/site-packages/pydantic/_internal/_model_construction.py:174, in ModelMetaclass.__new__(mcs, cls_name, bases, namespace, __pydantic_generic_metadata__, __pydantic_reset_parent_namespace__, **kwargs)
types_namespace = get_cls_types_namespace(cls, parent_namespace)
set_model_fields(cls, bases, config_wrapper, types_namespace)
--> 174 complete_model_class(
   cls,
   cls_name,
   config_wrapper,
   raise_errors=False,
   types_namespace=types_namespace,
)
# using super(cls, cls) on the next line ensures we only call the parent class's __pydantic_init_subclass__
# I believe the `type: ignore` is only necessary because mypy doesn't realize that this code branch is
# only hit for _proper_ subclasses of BaseModel
super(cls, cls).__pydantic_init_subclass__(**kwargs)  # type: ignore[misc]

File ~/miniconda3/envs/pyd-tut/lib/python3.10/site-packages/pydantic/_internal/_model_construction.py:431, in complete_model_class(cls, cls_name, config_wrapper, raise_errors, types_namespace)
handler = CallbackGetCoreSchemaHandler(
   partial(gen_schema.generate_schema, from_dunder_get_core_schema=False),
   gen_schema,
   ref_mode='unpack',
)
try:
--> 431     schema = cls.__get_pydantic_core_schema__(cls, handler)
except PydanticUndefinedAnnotation as e:
   if raise_errors:

File ~/miniconda3/envs/pyd-tut/lib/python3.10/site-packages/pydantic/main.py:533, in BaseModel.__get_pydantic_core_schema__(cls, _BaseModel__source, _BaseModel__handler)
   if not cls.__pydantic_generic_metadata__['origin']:
       return cls.__pydantic_core_schema__
--> 533 return __handler(__source)

File ~/miniconda3/envs/pyd-tut/lib/python3.10/site-packages/pydantic/_internal/_schema_generation_shared.py:82, in CallbackGetCoreSchemaHandler.__call__(self, _CallbackGetCoreSchemaHandler__source_type)
def __call__(self, __source_type: Any) -> core_schema.CoreSchema:
---> 82     schema = self._handler(__source_type)
   ref = schema.get('ref')
   if self._ref_mode == 'to-def':

File ~/miniconda3/envs/pyd-tut/lib/python3.10/site-packages/pydantic/_internal/_generate_schema.py:280, in GenerateSchema.generate_schema(self, obj, from_dunder_get_core_schema, from_prepare_args)
if isinstance(obj, type(Annotated[int, 123])):
   return self._annotated_schema(obj)
--> 280 return self._generate_schema_for_type(
   obj, from_dunder_get_core_schema=from_dunder_get_core_schema, from_prepare_args=from_prepare_args
)

File ~/miniconda3/envs/pyd-tut/lib/python3.10/site-packages/pydantic/_internal/_generate_schema.py:301, in GenerateSchema._generate_schema_for_type(self, obj, from_dunder_get_core_schema, from_prepare_args)
       schema = from_property
if schema is None:
--> 301     schema = self._generate_schema(obj)
metadata_js_function = _extract_get_pydantic_json_schema(obj, schema)
if metadata_js_function is not None:

File ~/miniconda3/envs/pyd-tut/lib/python3.10/site-packages/pydantic/_internal/_generate_schema.py:519, in GenerateSchema._generate_schema(self, obj)
from ..main import BaseModel
if lenient_issubclass(obj, BaseModel):
--> 519     return self._model_schema(obj)
if isinstance(obj, PydanticRecursiveRef):
   return core_schema.definition_reference_schema(schema_ref=obj.type_ref)

File ~/miniconda3/envs/pyd-tut/lib/python3.10/site-packages/pydantic/_internal/_generate_schema.py:370, in GenerateSchema._model_schema(self, cls)
self._config_wrapper_stack.append(config_wrapper)
try:
   fields_schema: core_schema.CoreSchema = core_schema.model_fields_schema(
--> 370         {k: self._generate_md_field_schema(k, v, decorators) for k, v in fields.items()},
       computed_fields=[self._computed_field_schema(d) for d in decorators.computed_fields.values()],
       extra_validator=extra_validator,
       model_name=cls.__name__,
   )
finally:
   self._config_wrapper_stack.pop()

File ~/miniconda3/envs/pyd-tut/lib/python3.10/site-packages/pydantic/_internal/_generate_schema.py:370, in <dictcomp>(.0)
self._config_wrapper_stack.append(config_wrapper)
try:
   fields_schema: core_schema.CoreSchema = core_schema.model_fields_schema(
--> 370         {k: self._generate_md_field_schema(k, v, decorators) for k, v in fields.items()},
       computed_fields=[self._computed_field_schema(d) for d in decorators.computed_fields.values()],
       extra_validator=extra_validator,
       model_name=cls.__name__,
   )
finally:
   self._config_wrapper_stack.pop()

File ~/miniconda3/envs/pyd-tut/lib/python3.10/site-packages/pydantic/_internal/_generate_schema.py:674, in GenerateSchema._generate_md_field_schema(self, name, field_info, decorators)
def _generate_md_field_schema(
   self,
   name: str,
   field_info: FieldInfo,
   decorators: DecoratorInfos,
) -> core_schema.ModelField:
   """Prepare a ModelField to represent a model field."""
--> 674     common_field = self._common_field_schema(name, field_info, decorators)
   return core_schema.model_field(
       common_field['schema'],
       serialization_exclude=common_field['serialization_exclude'],
   (...)
       metadata=common_field['metadata'],
   )

File ~/miniconda3/envs/pyd-tut/lib/python3.10/site-packages/pydantic/_internal/_generate_schema.py:714, in GenerateSchema._common_field_schema(self, name, field_info, decorators)
   schema = self._apply_annotations(source_type, annotations, transform_inner_schema=set_discriminator)
else:
--> 714     schema = self._apply_annotations(
       source_type,
       annotations,
   )
# This V1 compatibility shim should eventually be removed
# push down any `each_item=True` validators
# note that this won't work for any Annotated types that get wrapped by a function validator
# but that's okay because that didn't exist in V1
this_field_validators = filter_field_decorator_info_by_field(decorators.validators.values(), name)

File ~/miniconda3/envs/pyd-tut/lib/python3.10/site-packages/pydantic/_internal/_generate_schema.py:1405, in GenerateSchema._apply_annotations(self, source_type, annotations, transform_inner_schema)
   annotation = annotations[idx]
   get_inner_schema = self._get_wrapped_inner_schema(
       get_inner_schema, annotation, pydantic_js_annotation_functions
   )
-> 1405 schema = get_inner_schema(source_type)
if pydantic_js_annotation_functions:
   metadata = CoreMetadataHandler(schema).metadata

File ~/miniconda3/envs/pyd-tut/lib/python3.10/site-packages/pydantic/_internal/_schema_generation_shared.py:82, in CallbackGetCoreSchemaHandler.__call__(self, _CallbackGetCoreSchemaHandler__source_type)
def __call__(self, __source_type: Any) -> core_schema.CoreSchema:
---> 82     schema = self._handler(__source_type)
   ref = schema.get('ref')
   if self._ref_mode == 'to-def':

File ~/miniconda3/envs/pyd-tut/lib/python3.10/site-packages/pydantic/_internal/_generate_schema.py:1366, in GenerateSchema._apply_annotations.<locals>.inner_handler(obj)
from_property = self._generate_schema_from_property(obj, obj)
if from_property is None:
-> 1366     schema = self._generate_schema(obj)
else:
   schema = from_property

File ~/miniconda3/envs/pyd-tut/lib/python3.10/site-packages/pydantic/_internal/_generate_schema.py:586, in GenerateSchema._generate_schema(self, obj)
   return self._type_alias_type_schema(obj)
if origin is None:
--> 586     return self._arbitrary_type_schema(obj, obj)
# Need to handle generic dataclasses before looking for the schema properties because attribute accesses
# on _GenericAlias delegate to the origin type, so lose the information about the concrete parametrization
# As a result, currently, there is no way to cache the schema for generic dataclasses. This may be possible
# to resolve by modifying the value returned by `Generic.__class_getitem__`, but that is a dangerous game.
if _typing_extra.is_dataclass(origin):

File ~/miniconda3/envs/pyd-tut/lib/python3.10/site-packages/pydantic/_internal/_generate_schema.py:638, in GenerateSchema._arbitrary_type_schema(self, obj, type_)
   return core_schema.is_instance_schema(type_)
else:
--> 638     raise PydanticSchemaGenerationError(
       f'Unable to generate pydantic-core schema for {obj!r}. '
       'Set `arbitrary_types_allowed=True` in the model_config to ignore this error'
       ' or implement `__get_pydantic_core_schema__` on your type to fully support it.'
       '\n\nIf you got this error by calling handler(<some type>) within'
       ' `__get_pydantic_core_schema__` then you likely need to call'
       ' `handler.generate_schema(<some type>)` since we do not call'
       ' `__get_pydantic_core_schema__` on `<some type>` otherwise to avoid infinite recursion.'
   )

PydanticSchemaGenerationError: Unable to generate pydantic-core schema for <class 'numpy.ndarray'>. Set `arbitrary_types_allowed=True` in the model_config to ignore this error or implement `__get_pydantic_core_schema__` on your type to fully support it.

If you got this error by calling handler(<some type>) within `__get_pydantic_core_schema__` then you likely need to call `handler.generate_schema(<some type>)` since we do not call `__get_pydantic_core_schema__` on `<some type>` otherwise to avoid infinite recursion.

For further information visit https://errors.pydantic.dev/2.0.3/u/schema-for-unknown-type

This error was thrown because pydantic is coded to handle certain types of data, but it cannot handle types it was not programmed to understand. However, pydantic does provide a useful error message to fix this.

You can configure your pydantic models to modify their behavior by adding a class attribute within the BaseModel class explicitly called model_config that is an instance of the ConfigDict class provided by pydantic. Within that class, you set class keywords which serve as settings for the model they are attached to.

More model_config settings

You can see all of the config settings in the pydantic docs

Our particular error says many things, but we are going to focus on the simplest where it says we need to configure our model and set arbitrary_types_allowed, in this case to True. This will tell this particular BaseModel to permit types that it does not naturally understand how to handle, and assume the user/programer will handle it. Let’s see what Molecule looks like with this set. Note: The location of the model_config attribute does not matter, and model_config is on a per-model basis, not a global pydantic configuration.

Better and more powerful ways to do this with pydantic

Pydantic has much more powerful and precise ways to establish custom types than what we show here! Treat this lesson as a rudimentary basics in understanding custom types and only some of the ways to validate them. Please see the pydantic docs on custom validation which includes examples on how to handle third-party types such as NumPy or Pandas.

import numpy as np
from pydantic import BaseModel, field_validator, ConfigDict

class Molecule(BaseModel):
    name: str
    charge: float
    symbols: list[str]
    coordinates: np.ndarray
    
    model_config = ConfigDict(arbitrary_types_allowed = True)
        
    @field_validator("coordinates")
    @classmethod
    def ensure_coordinates_match_symbols(cls, coords, info):
        n_symbols = len(info.data["symbols"])
        if (n_coords := len(coords)) != n_symbols:  # Walrus operator (:=) for Python 3.8+
            raise ValueError(f"There must be an equal number of XYZ coordinates as there are symbols." 
                             f" There are {n_coords} coordinates and {n_symbols} symbols.")
        return coords
        
    @field_validator("coordinates")
    @classmethod
    def ensure_coordinates_is_3D(cls, coords):
        if any(len(failure := inner) != 3 for inner in coords):  # Walrus operator (:=) for Python 3.8+
            raise ValueError(f"Inner coordinates must be 3D, got {failure} of length {len(failure)}")
        return coords
    
    @property
    def num_atoms(self):
        return len(self.symbols)

Our model is now configured to allow arbitrary types; no more error. Let’s see what happens when we pass in our data.

water = Molecule(**mol_data)

---------------------------------------------------------------------------
ValidationError                           Traceback (most recent call last)
Input In [95], in <cell line: 1>()
----> 1 water = Molecule(**mol_data)

File ~/miniconda3/envs/pyd-tut/lib/python3.10/site-packages/pydantic/main.py:150, in BaseModel.__init__(__pydantic_self__, **data)
    148 # `__tracebackhide__` tells pytest and some other tools to omit this function from tracebacks
    149 __tracebackhide__ = True
--> 150 __pydantic_self__.__pydantic_validator__.validate_python(data, self_instance=__pydantic_self__)

ValidationError: 1 validation error for Molecule
coordinates
  Input should be an instance of ndarray [type=is_instance_of, input_value=[[0, 0, 0], [1, 1, 1], [2, 2, 2]], input_type=list]
    For further information visit https://errors.pydantic.dev/2.0.3/v/is_instance_of

We’re still getting a validation error, but it’s different. pydantic is now telling us that the data given to coordinates must be of type ndarray. Remember there are two default levels of validation in pydantic: Ensure type, manually written validators. When we have arbitrary_types_allowed configured, any unknown type to pydantic is not type-checked or coerced beyond that it is the declared type. Effectively, a glorified isinstance check.

So to fix this, either the user has to have already cast the data to the expected type, or the developer has to preempt the type validation somehow.

Before-Validators in Pydantic#

Good news! You can make pydantic validators that run before the type validation, effectively adding a third layer of validation stack. These are called “before validators” and will run before any other level of validator. The primary use case for these validators is data coercion, and that includes casting incoming data to specific types. E.g. Casting a list of lists to a NumPy array because we have arbitrary_types_allowed set.

A pre-validator is defined exactly like any other field_validator, it just has the keyword mode='before' in its arguments. We’re going to use the validator to take the coordinates data in, and cast it to a NumPy array.

import numpy as np
from pydantic import BaseModel, field_validator, ConfigDict

class Molecule(BaseModel):
    name: str
    charge: float
    symbols: list[str]
    coordinates: np.ndarray
        
    model_config = ConfigDict(arbitrary_types_allowed = True)
    
    @field_validator("coordinates", mode='before')
    @classmethod
    def coord_to_numpy(cls, coords):
        try:
            coords = np.asarray(coords)
        except ValueError:
            raise ValueError(f"Could not cast {coords} to numpy array")
        return coords
        
    @field_validator("coordinates")
    @classmethod
    def ensure_coordinates_match_symbols(cls, coords, info):
        n_symbols = len(info.data["symbols"])
        if (n_coords := len(coords)) != n_symbols:  # Walrus operator (:=) for Python 3.8+
            raise ValueError(f"There must be an equal number of XYZ coordinates as there are symbols." 
                             f" There are {n_coords} coordinates and {n_symbols} symbols.")
        return coords
        
    @field_validator("coordinates")
    @classmethod
    def ensure_coordinates_is_3D(cls, coords):
        if any(len(failure := inner) != 3 for inner in coords):  # Walrus operator (:=) for Python 3.8+
            raise ValueError(f"Inner coordinates must be 3D, got {failure} of length {len(failure)}")
        return coords
    
    @property
    def num_atoms(self):
        return len(self.symbols)

Now we can see what happens when we run our model

water = Molecule(**mol_data)
water.coordinates

array([[0, 0, 0],
       [1, 1, 1],
       [2, 2, 2]])

We now have a NumPy array for our coordinates. Since we now have a NumPy array for coordinates, we can refine the original validators. We’ll condense our normal coordinates validators down to a single one.

import numpy as np
from pydantic import BaseModel, field_validator, ConfigDict

class Molecule(BaseModel):
    name: str
    charge: float
    symbols: list[str]
    coordinates: np.ndarray
        
    model_config = ConfigDict(arbitrary_types_allowed = True)
    
    @field_validator("coordinates", mode='before')
    @classmethod
    def coord_to_numpy(cls, coords):
        try:
            coords = np.asarray(coords)
        except ValueError:
            raise ValueError(f"Could not cast {coords} to numpy array")
        return coords
        
    @field_validator("coordinates")
    @classmethod
    def coords_length_of_symbols(cls, coords, info):
        symbols = info.data["symbols"]
        if (len(coords.shape) != 2) or (len(symbols) != coords.shape[0]) or (coords.shape[1] != 3):
            raise ValueError(f"Coordinates must be of shape [Number Symbols, 3], was {coords.shape}")
        return coords
    
    @property
    def num_atoms(self):
        return len(self.symbols)

water = Molecule(**mol_data)

mangle = {**mol_data, **bad_charge, **bad_coords}
water = Molecule(**mangle)

---------------------------------------------------------------------------
ValidationError                           Traceback (most recent call last)
Input In [100], in <cell line: 2>()
      1 mangle = {**mol_data, **bad_charge, **bad_coords}
----> 2 water = Molecule(**mangle)

File ~/miniconda3/envs/pyd-tut/lib/python3.10/site-packages/pydantic/main.py:150, in BaseModel.__init__(__pydantic_self__, **data)
    148 # `__tracebackhide__` tells pytest and some other tools to omit this function from tracebacks
    149 __tracebackhide__ = True
--> 150 __pydantic_self__.__pydantic_validator__.validate_python(data, self_instance=__pydantic_self__)

ValidationError: 2 validation errors for Molecule
charge
  Input should be a valid number [type=float_type, input_value=[1, 0.0], input_type=list]
    For further information visit https://errors.pydantic.dev/2.0.3/v/float_type
coordinates
  Value error, Coordinates must be of shape [Number Symbols, 3], was (3,) [type=value_error, input_value=['1', '2', '3'], input_type=list]
    For further information visit https://errors.pydantic.dev/2.0.3/v/value_error

We’ve now upgraded our Molecule with more advanced data validation leaning into scientific validity, added in custom types which increase our model’s usability, and configured our model to further expand our capabilities. The code is now at the Lesson Materials labeled 05_valid_pydantic_molecule.py.

Next chapter we’ll look at nesting models to allow more complicated data structures.

Below is a supplementary section on how you can define custom, non-native types without arbitrary_types_allowed, giving you greater control over defining custom or even shorthand types.

Supplemental: Defining Custom Types with Built-In Validators#

In the example of this chapter, we showed how to combine arbitrary_types_allowed in Config with the field_validator(..., mode='before') to convert incoming data to the types not understood by pydantic. There are obvious limitations to this such as having to write a different set of validators for each Model, being limited (or at least confined) in how you can permit types through, and then having to be accepting of arbitrary types.

pydantic provides a separate way to write your custom class validator by extending the class in question. This can be done even to extend existing known types to augment them to special conditions.

Let’s extend a NumPy array type to have be something pydantic can validate without needing to use arbitrary_types_allowed. There are two ways to do this, either as an Annotated type where we overload pydantic’s type logic, or as a custom class schema generator. We’ll look at the Annotated method which the pydantic docs indicate is more stable than the custom class schema generator from an API standpoint.

import numpy as np

from typing_extensions import Annotated
from pydantic.functional_validators import PlainValidator


def cast_to_np(v):
    try:
        v = np.asarray(v)
    except ValueError:
        raise ValueError(f"Could not cast {v} to NumPy Array!")
    return v


ValidatableArray = Annotated[np.ndarray, PlainValidator(cast_to_np)]

That’s it.

We’ve first taken the Annotated object from the back-ported typing_extensions module which will work with Python 3.7+ (from typing import Annotated works with Python 3.9+ for identical behavior). This object allows you to augment types with additional metadata information for IDEs and other tools such as pydantic without disrupting normal code behavior.

Next we’ve taken augmented np.ndarray type with the pydantic PlainValidator method and passed it a function which will overwrite any of pydantic’s normal logic when validating the np.ndarray. Otherwise pydantic would have attempted to validate against np.ndarray and we’d be back where we started with the error asking about allow_arbitrary_types. Instead, we’ve usurped the normal pydantic logic and effectively said “The validator for this type is the function cast_to_np, send the data there, and if it doesn’t error, we’re good.”

There is FAR more you can do with the Annotated object and pydantic, including defining multiple Before, After, and Wrap validators for any and all class attributes. For instance, there is a BeforeValidator which takes a functional argument as well which can be annotated into any data field that will do the same thing as @field_validator(..., mode='before'). However, advanced usage is best left to the pydantic docs

Let’s apply this to our Molecule.

This won’t appear in the next chapter

The main Lesson Materials will not have this modification since this is all supplemental. Next chapter will start with the 05_valid_pydantic_molecule.py Lesson Materials.

import numpy as np
from pydantic import BaseModel, field_validator, ConfigDict

class Molecule(BaseModel):
    name: str
    charge: float
    symbols: list[str]
    coordinates: ValidatableArray
        
    @field_validator("coordinates")
    @classmethod
    def coords_length_of_symbols(cls, coords, info):
        symbols = info.data["symbols"]
        if (len(coords.shape) != 2) or (len(symbols) != coords.shape[0]) or (coords.shape[1] != 3):
            raise ValueError(f"Coordinates must be of shape [Number Symbols, 3], was {coords.shape}")
        return coords
    
    @property
    def num_atoms(self):
        return len(self.symbols)

water = Molecule(**mol_data)

mangle = {**mol_data, **bad_charge, **bad_coords}
water = Molecule(**mangle)

---------------------------------------------------------------------------
ValidationError                           Traceback (most recent call last)
Input In [104], in <cell line: 2>()
      1 mangle = {**mol_data, **bad_charge, **bad_coords}
----> 2 water = Molecule(**mangle)

File ~/miniconda3/envs/pyd-tut/lib/python3.10/site-packages/pydantic/main.py:150, in BaseModel.__init__(__pydantic_self__, **data)
    148 # `__tracebackhide__` tells pytest and some other tools to omit this function from tracebacks
    149 __tracebackhide__ = True
--> 150 __pydantic_self__.__pydantic_validator__.validate_python(data, self_instance=__pydantic_self__)

ValidationError: 2 validation errors for Molecule
charge
  Input should be a valid number [type=float_type, input_value=[1, 0.0], input_type=list]
    For further information visit https://errors.pydantic.dev/2.0.3/v/float_type
coordinates
  Value error, Coordinates must be of shape [Number Symbols, 3], was (3,) [type=value_error, input_value=['1', '2', '3'], input_type=list]
    For further information visit https://errors.pydantic.dev/2.0.3/v/value_error

We removed the model_config since we no longer are handling arbitrary types: we’re handling the explicit type we defined. We also removed the mode='before' validator on coordinates because that work got pushed to the ValidatableArray. That new Annotated type we wrote already preempts our custom coords_length_of_symbols field_validator because it operates at the same time as the type annotation check, which comes before custom validators in order of operations.

If we wanted to make a custom schema output for our new type, we would need to add another class method called __get_pydantic_core_schema__. However, please refer to the pydantic docs for more details.

Supplemental: Defining Custom NumPy Type AND Setting Data Type (dtype)#

It is possible to set the NumPy array dtype as well as part of the type checking without having to define multiple custom types. This approach is not related to pydantic per se, but is a showcase of chaining several very advanced Python topics together.

In the previous Supplemental, we showed how to write an augmented type with Annotated to define a NumPy ndarray type in pydantic. We cast the input data to a numpy array with the np.asarray. That function can also accept a dtype=... argument where you can specify the type of data the array will be. How would you support arbitrarily setting the dtype?

There are several, equally acceptable and perfectly valid, approaches to this.

Multiple Validators#

One option would be to make multiple types of validators and call the one you need. And there are several ways to do this. The first way is to just make multiple annotated types.

def cast_to_np_int(v):
    try:
        v = np.asarray(v, dtype=int)
    except ValueError:
        raise ValueError(f"Could not cast {v} to NumPy Array!")
    return v


def cast_to_np_float(v):
    try:
        v = np.asarray(v, dtype=float)
    except ValueError:
        raise ValueError(f"Could not cast {v} to NumPy Array!")
    return v


IntArray = Annotated[np.ndarray, PlainValidator(cast_to_np_int)]
FloatArray = Annotated[np.ndarray, PlainValidator(cast_to_np_float)]

class IntMolecule(Molecule):
    coordinates: IntArray
        
class FloatMolecule(Molecule):
    coordinates: FloatArray
        
print(IntMolecule(**mol_data).coordinates)
print(FloatMolecule(**mol_data).coordinates)

[[0 0 0]
 [1 1 1]
 [2 2 2]]
[[0. 0. 0.]
 [1. 1. 1.]
 [2. 2. 2.]]

A valid approach, can be dropped in when needed. However, this involves code duplication.

We can cut down on the work by defining a function which accepts a keyword and use functools.partial to lock the keyword in.

from functools import partial

def cast_to_np(v, dtype=None):
    try:
        v = np.asarray(v, dtype=dtype)
    except ValueError:
        raise ValueError(f"Could not cast {v} to NumPy Array!")
    return v

IntArray = Annotated[np.ndarray, PlainValidator(partial(cast_to_np, dtype=int))]
FloatArray = Annotated[np.ndarray, PlainValidator(partial(cast_to_np, dtype=float))]

class IntMolecule(Molecule):
    coordinates: IntArray
        
class FloatMolecule(Molecule):
    coordinates: FloatArray
        
print(IntMolecule(**mol_data).coordinates)
print(FloatMolecule(**mol_data).coordinates)

[[0 0 0]
 [1 1 1]
 [2 2 2]]
[[0. 0. 0.]
 [1. 1. 1.]
 [2. 2. 2.]]

Make an on Demand Typer Function#

One option is to just make a function create types on demand.

def array_typer(dtype):
    def cast_to_np(v):
        try:
            v = np.asarray(v, dtype=dtype)
        except ValueError:
            raise ValueError(f"Could not cast {v} to NumPy Array!")
        return v
    return Annotated[np.ndarray, PlainValidator(cast_to_np)]

class IntMolecule(Molecule):
    coordinates: array_typer(int)
        
class FloatMolecule(Molecule):
    coordinates: array_typer(float)
        
print(IntMolecule(**mol_data).coordinates)
print(FloatMolecule(**mol_data).coordinates)

[[0 0 0]
 [1 1 1]
 [2 2 2]]
[[0. 0. 0.]
 [1. 1. 1.]
 [2. 2. 2.]]

But this has the problem of now having to regenerate a new Annotated type each time, and its type schema will always have the same signature. This isn’t a problem most of the time, but it can be a little confusing to suddenly see functions in type annotation instead of the normal types and square brackets.

Custom Core Schema#

We’re going to take a look at the other way pydantic has for defining a custom type, one way they specifically suggest for NumPy in their own documentations, namely a custom core schema. We avoided this in the previous blocks of the lessons because the pydantic docs say this functionality touches the underlying pydantic-core functionality. And while it does have an API (and follows semantic versioning), its also the section most likely to change according to them.

We do want to look at this approach because we are abusing the PlainValidator a bit to overload pydantic’s internal type checking.

We’re going to build this piece by piece, with the understanding that it won’t work fully until we’ve constructed it. Effectively: writing the instructions for pydantic to handle this with mostly native pydantic-core functions.

class ValidatableArray:
    @classmethod
    def __get_pydantic_core_schema__(cls, source, handler):
        pass

This is the primary work method for the core schema. The __get_pydantic_core_schema__ is the method that pydantic will look for when validating this type of data.

source is the class we are generating a schema for; this will generally be the same as the cls argument if this is a classmethod.

handler is the call into Pydantic’s internal JSON schema generation logic. Since we’re writing our own schema generator for something that Pydantic does not natively understand, we likely won’t need source or handler at all for most uses. However, we will be taking advantage of source and some other Python typing tools as well.

We’ll fill in everything we want to do in the function itself.

from pydantic_core import core_schema

def cast_to_np(v):
    try:
        v = np.asarray(v)
    except ValueError:
        raise ValueError(f"Could not cast {v} to NumPy Array!")
    return v


class ValidatableArray:
    @classmethod
    def __get_pydantic_core_schema__(cls, source, handler):
        """
        We return a pydantic_core.CoreSchema that behaves in the following ways:

        * Data will be cast to ndarrays with the correct dtype
        * `ndarrays` instances will be parsed as `ndarrays` and cast to the correct dtype
        * Serialization will cast the ndarray to list
        """
        schema = core_schema.no_info_plain_validator_function(cast_to_np)
        return schema

We’ve added back in our actual “cast to NumPy” function we’ve used previously, and then we have added a function from the core_schema object of pydantic_core. The no_info_plain_validator function which is what generates a schema for PlainValidator as we have seen before. We finally return the schema generated from the function, although we can further manipulate it later as needed.

There are also other calls such as a general_plain_validator_function which supports additional info being fed into the function as a secondary argument, or no_info_after_validator_function which would make an AfterValidator, but we’re not going to cover those topics here.

Thus far, all we have done is cast to a NumPy array, which is good! We have done that before with the Annotated method, but this sets us up for much more powerful manipulation later if we want. We also want to make sure

class ArrMolecule(Molecule):
    coordinates: ValidatableArray

print(ArrMolecule(**mol_data).coordinates)
print(ArrMolecule(**mol_data).coordinates)

[[0 0 0]
 [1 1 1]
 [2 2 2]]
[[0 0 0]
 [1 1 1]
 [2 2 2]]

Great! We’ve made a validatable array. Now we’re going to extend this approach to handle passing types in as part of the array construction. However, our dtype option isn’t used anywhere. To fix that, we’re going to expand on this with some of Python’s native typing tools.

from typing import Sequence, TypeVar
from pydantic_core import core_schema

dtype = TypeVar("dtype")

def cast_to_np(v):
    try:
        v = np.asarray(v)
    except ValueError:
        raise ValueError(f"Could not cast {v} to NumPy Array!")
    return v


class ValidatableArray(Sequence[dtype]):
    @classmethod
    def __get_pydantic_core_schema__(cls, source, handler):
        """
        We return a pydantic_core.CoreSchema that behaves in the following ways:

        * Data will be cast to ndarrays with the correct dtype
        * `ndarrays` instances will be parsed as `ndarrays` and cast to the correct dtype
        * Serialization will cast the ndarray to list
        """
        schema = core_schema.no_info_plain_validator_function(cast_to_np)
        return schema

We’ve now established a custom Python type we are calling dtype, which is a common term in NumPy space, but we’re going to focus on the more general case for now and specialize later.

The new object dtype is now recognized as a valid Python type, even though nothing in the Python space or any of our modules use this, that’s okay! We’re going to use it as a placeholder for accepting an index/argument to the ValidatableArray class.

Speaking of, the ValidatableArray is now a subclass of two things: the Sequence from Python’s typing library, and our placeholder dtype type as an index/argument. Although it is square brackets, [], we’ll refer to these as “arguments” as they effectively are for types. We chose the Sequence instead of Generic from typing because at its core, NumPy arrays are sequences, just very formatted and specialized ones. This approach would have worked with Generic too, but we’re opting to be more verbose.

So far, nothing has changed, everything will continue to run exactly as we have designed it previously, however, we can now specify an argument to the ValidatableArray. Observe:

class ArrMolecule(Molecule):
    coordinates: ValidatableArray[float]

print(ArrMolecule(**mol_data).coordinates)
print(ArrMolecule(**mol_data).coordinates)

[[0 0 0]
 [1 1 1]
 [2 2 2]]
[[0 0 0]
 [1 1 1]
 [2 2 2]]

So now let’s change our code to actually do something with that new argument in our function to specify what the dtype should be for the arrays.

from typing import Sequence, TypeVar
from typing_extensions import get_args
from pydantic_core import core_schema

dtype = TypeVar("dtype")

def generate_caster(dtype_input):
    def cast_to_np(v):
        try:
            v = np.asarray(v, dtype=dtype_input)
        except ValueError:
            raise ValueError(f"Could not cast {v} to NumPy Array!")
        return v
    return cast_to_np


class ValidatableArray(Sequence[dtype]):
    @classmethod
    def __get_pydantic_core_schema__(cls, source, handler):
        """
        We return a pydantic_core.CoreSchema that behaves in the following ways:

        * Data will be cast to ndarrays with the correct dtype
        * `ndarrays` instances will be parsed as `ndarrays` and cast to the correct dtype
        * Serialization will cast the ndarray to list
        """
        dtype_arg = get_args(source)[0]
        validator = generate_caster(dtype_arg)
        schema = core_schema.no_info_plain_validator_function(validator)
        return schema

class FloatArrMolecule(Molecule):
    coordinates: ValidatableArray[float]

class IntArrMolecule(Molecule):
    coordinates: ValidatableArray[int]

print(FloatArrMolecule(**mol_data).coordinates)
print(FloatArrMolecule(**mol_data).coordinates)
print("")
print(IntArrMolecule(**mol_data).coordinates)
print(IntArrMolecule(**mol_data).coordinates)

[[0. 0. 0.]
 [1. 1. 1.]
 [2. 2. 2.]]
[[0. 0. 0.]
 [1. 1. 1.]
 [2. 2. 2.]]

[[0 0 0]
 [1 1 1]
 [2 2 2]]
[[0 0 0]
 [1 1 1]
 [2 2 2]]

Ta-da! We’ve now used the get_args function from typing_extensions (native in typing in Python 3.9+) to get the argument we fed into the ValidatableArray, established a generator function for dtypes in generate_caster, and then used all of that information to make our NumPy arrays of a specific type. All of this shows the power of the customization we can do with pydantic. There are some less-boilerplate ways to do this in pydantic, but we leave that up to you to read the docs to find out more.

Before we move past this, there are a couple of notes to make about this approach:

This is a relatively slow process in that the generator will be made for every validation, that could be faster.
As written, you MUST pass an arg to ValidatableArray, but it could be rewritten to avoid that.

We’ve specifically written this example to use generic Python type objects and methods. However, NumPy does have its own native types as of 1.20 and 1.21 we can use instead of the Generic and Sequence, or defining our own arbitrary type with TypeVar to make IDE’s happy. Below is the example of this.

from typing_extensions import get_args, Annotated
from pydantic_core import core_schema
from numpy.typing import NDArray

def generate_caster(dtype):
    def cast_to_np(v):
        try:
            v = np.asarray(v, dtype=dtype)
        except ValueError:
            raise ValueError(f"Could not cast {v} to NumPy Array!")
        return v
    return cast_to_np


class ValidatableArrayAnnotation:
    @classmethod
    def __get_pydantic_core_schema__(cls, source, handler):
        """
        We return a pydantic_core.CoreSchema that behaves in the following ways:

        * Data will be cast to ndarrays with the correct dtype
        * `ndarrays` instances will be parsed as `ndarrays` and cast to the correct dtype
        * Serialization will cast the ndarray to list
        """
        shape, dtype_alias = get_args(source)
        dtype = get_args(dtype_alias)[0]
        validator = generate_caster(dtype)
        schema = core_schema.no_info_plain_validator_function(validator)
        return schema

ValidatableArray = Annotated[NDArray, ValidatableArrayAnnotation]

class FloatArrMolecule(Molecule):
    coordinates: ValidatableArray[float]

class IntArrMolecule(Molecule):
    coordinates: ValidatableArray[int]

print(FloatArrMolecule(**mol_data).coordinates)
print(FloatArrMolecule(**mol_data).coordinates)
print("")
print(IntArrMolecule(**mol_data).coordinates)
print(IntArrMolecule(**mol_data).coordinates)

[[0. 0. 0.]
 [1. 1. 1.]
 [2. 2. 2.]]
[[0. 0. 0.]
 [1. 1. 1.]
 [2. 2. 2.]]

[[0 0 0]
 [1 1 1]
 [2 2 2]]
[[0 0 0]
 [1 1 1]
 [2 2 2]]

This approach now annotates the NDArray with additional information as per the pydantic docs which is then passed to the ValidatableArrayAnnotation, and takes use of the NumPy type hint format and behavior for NDArray. This also has problems as in the end we are trying to reverse engineer a type hint into a formal dtype for NumPy, which isn’t exactly clear-cut. E.g.:

print(NDArray)
print(NDArray[int])
default_second = get_args(get_args(NDArray)[1])[0]
print(default_second)
print(type(default_second))

numpy.ndarray[typing.Any, numpy.dtype[+ScalarType]]
numpy.ndarray[typing.Any, numpy.dtype[int]]
+ScalarType
<class 'typing.TypeVar'>

Its very unclear how, if you provide no arguments, to convert the TypeVar (which is what you get from the TypeVar function of “+ScalarType” into None which would be the default behavior of dtype=... style arguments. Sure you could hard code it, but will that always be the case? That’s up to you and beyond what this example hopes to show you.

Metaclasses of the Past#

At one point in this lesson back in Pydantic v1, we talked about Python Metaclass as a way to define a class generator whose properties are set dynamically, then usable by the class. BUT…

Metaclasses be Forbidden Magics

“Metaclasses are deeper magic than 99% of users should ever worry about. If you wonder whether you need them, you don’t (the people who actually need them know with certainty that they need them, and don’t need an explanation about why).”

— Tim Peters, Author of Zen of Python, PEP 20

Metaclasses are usually not something you want to touch, because you don’t need to. The above methods provide a fine way to generate type hints dynamically. However, if you want to be fancy, you can use a Metaclass. The best primer I, Levi Naden, have found on Metaclasses at the time of writing this section (Fall 2022) was through this Stack Overflow answer.

To be honest: you’re probably better off writing a custom core schema as pydantic suggests as above than messing with a Metaclass.

Do what makes sense, and only if you need to#

All of these methods are equally valid, with upsides and downsides alike. Your use case may not even need dtype specification and you can just accept the normal NumPy handling of casting to array plus your own custom validator functions to make sure the data look correct. Hopefully though this supplemental section has given you ideas to inspire your own code design and give you ideas on interesting and helpful things you can do.