(ch05)=
# Validating Data Beyond Types

```{admonition} Starting File: <code>04_pydantic_molecule.py</code>
:class: important
This chapter will start from the <code>04_pydantic_molecule.py</code> and end on the <code>05_valid_pydantic_molecule.py</code>.
```

Data validation goes far beyond just type. *Pydantic* has provided the basic tools for doing data validation on data types, but it also provides the tools for writing custom validators to check so much more.

We'll be covering the *pydantic* `validator` decorator and applying that to our data to check structure and scientific rigor. We'll also cover how to validate types not native to Python, such as NumPy arrays.

```{admonition} Check Out Pydantic
:class: note
We will not be covering all the capabilities of *pydantic* here, and we highly encourage you to visit [the pydantic docs](https://pydantic-docs.helpmanual.io/) to learn about all the powerful and easy-to-execute things *pydantic* can do.
```



```{admonition} Compatibility with Python 3.8 and below
:class: note
If you have Python 3.8 or below, you will need to import container type objects such as `List`, `Tuple`, `Dict`, etc. from the `typing` library instead of their native types of `list`, `tuple`, `dict`, etc. This chapter will assume Python 3.9 or greater, however, both approaches will work in >=Python 3.9 and have 1:1 replacements of the same name.
```

## Pydantic's Validator Decorator

Let's start by looking at the state of our code prior to extending the validators. As usual, let's also define our test data.

In [86]:
from pydantic import BaseModel


class Molecule(BaseModel):
    name: str
    charge: float
    symbols: list[str]
    coordinates: list[list[float]]

    @property
    def num_atoms(self):
        return len(self.symbols)

In [87]:
mol_data = {  # Good data
    "coordinates": [[0, 0, 0], [1, 1, 1], [2, 2, 2]], 
    "symbols": ["H", "H", "O"], 
    "charge": 0.0, 
    "name": "water"
}

bad_name = {"name": 789}  # Name is not str
bad_charge = {"charge": [1, 0.0]}  # Charge is not int or float
noniter_symbols = {"symbols": 1234567890}  # Symbols is an int
nonlist_symbols = {"symbols": '["H", "H", "O"]'}  # Symbols is a string (notably is a string-ified list)
tuple_symbols = {"symbols": ("H", "H", "O")}  # Symbols as a tuple?
bad_coords = {"coordinates": ["1", "2", "3"]}  # Coords is a single list of string
inner_coords_not3d = {"coordinates": [[1, 2, 3], [4, 5]]}
bad_symbols_and_cords = {"symbols": ["H", "H", "O"],
                         "coordinates": [[1, 1, 1], [2.0, 2.0, 2.0]]
                        }  # Coordinates top-level list is not the same length as symbols

You may notice we have extended our "Good Data" here to have `coordinates` actually define the `Nx3` structure where `N = len(symbols)`. This is important for what we plan to validate.

*pydantic* allows you to write custom validators, in addition to the type validators which run automatically for a type annotation. This `field_validator` is pulled from the `pydantic` module just like `BaseModel`, and is used to decorate a *class* function you write. Let's look at the most basic `field_validator` we can write and assign it to `coordinates`.

```{admonition} Field vs Annotated Validators
:class: note
`pydantic` allows validators to be defined functionally for reuse, ordering, and much more powerful utilization through the `Annotated` class. We will be showing `field_validator` for this example to keep the validator much more local for learning purposes. Please see (the *pydantic* docs on validators for more info.)[https://docs.pydantic.dev/latest/usage/validators/]
```

In [88]:
from pydantic import BaseModel, field_validator

class Molecule(BaseModel):
    name: str
    charge: float
    symbols: list[str]
    coordinates: list[list[float]]
        
    @field_validator("coordinates")
    @classmethod
    def ensure_coordinates_is_3D(cls, coords):
        return coords

    @property
    def num_atoms(self):
        return len(self.symbols)

Here we have defined an additional validator which does nothing, but has the basic structure we can look at. For convenience and reference, I've broken the aspects of the `field_validator` into a list.

* The `field_validator` decorator takes as arguments the *exact* name of the attributes you are validating against as a string. In this case `coordinates`. You could provide multiple string args of each attribute you want to run through the validator if you want to reuse it.
* The function name can be whatever you want it to be. We've called it `ensure_coordinates_is_3D` to be meaningful if anyone ever wants to come back and see what this should be doing.
* The function itself is a *class function*. This is why we have included the `@classmethod` decorator from native Python, this validator is intended to be called on the non-instanced class. The formal nomenclature for the first variable here is therefore `cls` and not `self`. You can define the validators without the `@classmethod` decorator, but your IDE may complain about this, so we also add the `@classmethod` decorator so we can use `cls` without IDE issues, at least on that point.
* The first (non `cls`) argument of the function can be whatever string name you want. The **optional** second argument will be give a *pydantic* metadata class of type `FieldValidationInfo` and can also be named whatever we want. We'll use this metadata class later in the chapter.
* The return MUST be the validated data to be fed into the attribute. We've done nothing to our variable `coords`, so we simply return it. If you fail to have a `return` statement with something, it will return `None` and that will be considered valid.
* If the data are not validated correctly, the function must raise either a `ValueError` or `AssertionError` for *pydantic* to correctly trap the error, anything else will raise the Python error stack as normal.
* `field_validator` runs *after* type validation, unless specified (see later in this chapter).

That may seem like lots of rules, but most of them are boilerplate and intuitive. Let's apply these items to our validator. We want to make sure the inner lists of `coordinates` are 3D, or length 3. We don't have to worry about type checking (that was done before any custom `field_validator` was run), so we can just do an iteration of the top list and make sure. Let's apply that now.

In [89]:
from pydantic import BaseModel, field_validator

class Molecule(BaseModel):
    name: str
    charge: float
    symbols: list[str]
    coordinates: list[list[float]]
        
    @field_validator("coordinates")
    @classmethod
    def ensure_coordinates_is_3D(cls, coords):
        if any(len(failure := inner) != 3 for inner in coords):  # Walrus operator (:=) for Python 3.8+
            raise ValueError(f"Inner coordinates must be 3D, got {failure} of length {len(failure)}")
        return coords
    
    @property
    def num_atoms(self):
        return len(self.symbols)

In [90]:
good_water = Molecule(**mol_data)
mangled = {**mol_data, **inner_coords_not3d}
water = Molecule(**mangled)

ValidationError: 1 validation error for Molecule
coordinates
  Value error, Inner coordinates must be 3D, got [4.0, 5.0] of length 2 [type=value_error, input_value=[[1, 2, 3], [4, 5]], input_type=list]
    For further information visit https://errors.pydantic.dev/2.0.3/v/value_error

Here we have checked the good data still works, and checked that the mangled data raised an error. It's important to note the error raised by the function was a `ValueError` (or `AssertionError`) so the error report was a `ValidationError`. We can also see the error message is what we put as the error string and `type` of error is of the type we raised. This is why it's very important to have meaningful error strings when your custom validator fails.

With all that said, our validator function really does look like any other function we may call to do a quick check of data, and then some special addons to make it work with *pydantic*. There is no practical limit to the number of `field_validator`s you have in a given class, so validate to your heart's content.

```{admonition} Python Assignment Expressions "The Walrus Operator" <code>:=</code>
:class: note
Since Python 3.8, there is a new operator for "assignment expressions" called "[The Walrus Operator](https://peps.python.org/pep-0572/)" which allows variables to be assigned inside other expressions. We've used it here to trap the value at time of error and save space. Do not feel compelled to use this yourself, especially if it's not clear what is happening.
```

<div class="exercise">
<p class="exercise-title"> Check your knowledge: Validator Basics
    <p>How would you validate that <code>symbols</code> entries are at most 2 characters? There is more than one correct solution beyond what we show here.</p>

```{admonition} Possible Solution:
:class: dropdown
```python
@field_validator("symbols")
@classmethod
def symbols_are_possible_element_length(cls, symbs):
    if not all(1 <= len(failure := symb) <= 2 for symb in symbs):
        raise ValueError(f"Symbols be 1 or 2 characters, got {failure}")
    return symbs
```
</div>

## Validating against other fields

*pydantic*'s validators can check fields beyond their own. This is helpful for cross referencing dependent data. In our example, we want to make sure there are exactly the right number of `coordinates` as there are `symbols` in our `Molecule`. To check against other fields in a `field_validator`, we extend the arguments to include the optional secondary one for metadata we're going to call `info`. We are going to leave our initial validator to show a feature of the `field_validator`s for now, but we could combine them (and will) later.

In [91]:
from pydantic import BaseModel, field_validator

class Molecule(BaseModel):
    name: str
    charge: float
    symbols: list[str]
    coordinates: list[list[float]]
        
    @field_validator("coordinates")
    @classmethod
    def ensure_coordinates_match_symbols(cls, coords, info):
        n_symbols = len(info.data["symbols"])
        if (n_coords := len(coords)) != n_symbols:  # Walrus operator (:=) for Python 3.8+
            raise ValueError(f"There must be an equal number of XYZ coordinates as there are symbols." 
                             f" There are {n_coords} coordinates and {n_symbols} symbols.")
        return coords
        
    @field_validator("coordinates")
    @classmethod
    def ensure_coordinates_is_3D(cls, coords):
        if any(len(failure := inner) != 3 for inner in coords):  # Walrus operator (:=) for Python 3.8+
            raise ValueError(f"Inner coordinates must be 3D, got {failure} of length {len(failure)}")
        return coords
    
    @property
    def num_atoms(self):
        return len(self.symbols)

We've added a second validator to our code called `ensure_coordinates_match_symbols`, and this funciton will validate against `coordinates`. There are two main things we can see from adding this function:

1. Multiple functions can be declared to validate against the same field.
2. We've added the additional optional metadata argument to our new validator: `info`.

The second argument, if it appears in a `field_validator`, provides metadata for the validation currently happening and that has already happened. The addition of `info` as an argument tells the `field_validator` to also retrieve *all previously validated fields for the model*. In our case, that would be `name`, `charge`, and `symbols` as those entries appeared before `coordinates` in the list of attributes. Any and all validators which would have been applied to those three entries have already been done and what we have access to is their validated records as metadata object with those validated values stored in the dictionary at `.data`. [See the *pydantic* docs](https://docs.pydantic.dev/latest/usage/validators/) for more details about the special argument and metadata for `field_validator`.

Let's see this in action

In [92]:
good_water = Molecule(**mol_data)
mangled = {**mol_data, **bad_symbols_and_cords}
water = Molecule(**mangled)

ValidationError: 1 validation error for Molecule
coordinates
  Value error, There must be an equal number of XYZ coordinates as there are symbols. There are 2 coordinates and 3 symbols. [type=value_error, input_value=[[1, 1, 1], [2.0, 2.0, 2.0]], input_type=list]
    For further information visit https://errors.pydantic.dev/2.0.3/v/value_error

## Non-native Types in Pydantic

Scientific data does not, and often should not, be confined to native Python types. One of the most common data types, especially in the sciences, is the NumPy Array (`ndarray` class). The most natural place for this would be `coordinates` where we want to simplify this list of list construct. Let's see what happens when we try to just make the type annotation a `ndarray` and see how *pydantic* handles coercion, or how it does not.

In [93]:
import numpy as np
from pydantic import BaseModel, field_validator

class Molecule(BaseModel):
    name: str
    charge: float
    symbols: list[str]
    coordinates: np.ndarray
        
    @field_validator("coordinates")
    @classmethod
    def ensure_coordinates_match_symbols(cls, coords, info):
        n_symbols = len(info.data["symbols"])
        if (n_coords := len(coords)) != n_symbols:  # Walrus operator (:=) for Python 3.8+
            raise ValueError(f"There must be an equal number of XYZ coordinates as there are symbols." 
                             f" There are {n_coords} coordinates and {n_symbols} symbols.")
        return coords
        
    @field_validator("coordinates")
    @classmethod
    def ensure_coordinates_is_3D(cls, coords):
        if any(len(failure := inner) != 3 for inner in coords):  # Walrus operator (:=) for Python 3.8+
            raise ValueError(f"Inner coordinates must be 3D, got {failure} of length {len(failure)}")
        return coords
    
    @property
    def num_atoms(self):
        return len(self.symbols)

PydanticSchemaGenerationError: Unable to generate pydantic-core schema for <class 'numpy.ndarray'>. Set `arbitrary_types_allowed=True` in the model_config to ignore this error or implement `__get_pydantic_core_schema__` on your type to fully support it.

If you got this error by calling handler(<some type>) within `__get_pydantic_core_schema__` then you likely need to call `handler.generate_schema(<some type>)` since we do not call `__get_pydantic_core_schema__` on `<some type>` otherwise to avoid infinite recursion.

For further information visit https://errors.pydantic.dev/2.0.3/u/schema-for-unknown-type

This error was thrown because *pydantic* is coded to handle certain types of data, but it cannot handle types it was not programmed to understand. However, *pydantic* does provide a useful error message to fix this.

You can configure your *pydantic* models to modify their behavior by adding a class attribute within the `BaseModel` class explicitly called `model_config` that is an instance of the `ConfigDict` class provided by *pydantic*. Within that class, you set class keywords which serve as settings for the model they are attached to.

```{admonition} More model_config settings
:class: note
You can see all of the config settings [in the *pydantic* docs](https://docs.pydantic.dev/latest/usage/model_config/)
```

Our particular error says many things, but we are going to focus on the simplest where it says we need to configure our model and set `arbitrary_types_allowed`, in this case to `True`. This will tell this particular `BaseModel` to permit types that it does not naturally understand how to handle, and assume the user/programer will handle it. Let's see what `Molecule` looks like with this set. Note: The location of the `model_config` attribute does not matter, and `model_config` is on a per-model basis, not a global *pydantic* configuration.

```{admonition} Better and more powerful ways to do this with pydantic
:class: note
Pydantic has much more powerful and precise ways to establish custom types than what we show here! Treat this lesson as a rudimentary basics in understanding custom types and only *some* of the ways to validate them. Please [see the pydantic docs on custom validation](https://docs.pydantic.dev/latest/usage/types/custom/) which includes examples on how to handle third-party types such as NumPy or Pandas.
```


In [94]:
import numpy as np
from pydantic import BaseModel, field_validator, ConfigDict

class Molecule(BaseModel):
    name: str
    charge: float
    symbols: list[str]
    coordinates: np.ndarray
    
    model_config = ConfigDict(arbitrary_types_allowed = True)
        
    @field_validator("coordinates")
    @classmethod
    def ensure_coordinates_match_symbols(cls, coords, info):
        n_symbols = len(info.data["symbols"])
        if (n_coords := len(coords)) != n_symbols:  # Walrus operator (:=) for Python 3.8+
            raise ValueError(f"There must be an equal number of XYZ coordinates as there are symbols." 
                             f" There are {n_coords} coordinates and {n_symbols} symbols.")
        return coords
        
    @field_validator("coordinates")
    @classmethod
    def ensure_coordinates_is_3D(cls, coords):
        if any(len(failure := inner) != 3 for inner in coords):  # Walrus operator (:=) for Python 3.8+
            raise ValueError(f"Inner coordinates must be 3D, got {failure} of length {len(failure)}")
        return coords
    
    @property
    def num_atoms(self):
        return len(self.symbols)

Our model is now configured to allow arbitrary types; no more error. Let's see what happens when we pass in our data.

In [95]:
water = Molecule(**mol_data)

ValidationError: 1 validation error for Molecule
coordinates
  Input should be an instance of ndarray [type=is_instance_of, input_value=[[0, 0, 0], [1, 1, 1], [2, 2, 2]], input_type=list]
    For further information visit https://errors.pydantic.dev/2.0.3/v/is_instance_of

We're still getting a validation error, but it's different. *pydantic* is now telling us that the data given to `coordinates` must be of type `ndarray`. Remember there are two default levels of validation in *pydantic*: Ensure type, manually written validators. When we have `arbitrary_types_allowed` configured, any unknown type to *pydantic* is not type-checked or coerced beyond that it is the declared type. Effectively, a glorified `isinstance` check.

So to fix this, either the user has to have already cast the data to the expected type, or the developer has to preempt the type validation somehow.

## Before-Validators in Pydantic

Good news! You can make *pydantic* validators that run before the type validation, effectively adding a third layer of validation stack. These are called "before validators" and will run before any other level of validator. The primary use case for these validators is data coercion, and that includes casting incoming data to specific types. E.g. Casting a list of lists to a NumPy array because we have `arbitrary_types_allowed` set.

A pre-validator is defined exactly like any other `field_validator`, it just has the keyword `mode='before'` in its arguments. We're going to use the validator to take the `coordinates` data in, and cast it to a NumPy array.

In [96]:
import numpy as np
from pydantic import BaseModel, field_validator, ConfigDict

class Molecule(BaseModel):
    name: str
    charge: float
    symbols: list[str]
    coordinates: np.ndarray
        
    model_config = ConfigDict(arbitrary_types_allowed = True)
    
    @field_validator("coordinates", mode='before')
    @classmethod
    def coord_to_numpy(cls, coords):
        try:
            coords = np.asarray(coords)
        except ValueError:
            raise ValueError(f"Could not cast {coords} to numpy array")
        return coords
        
    @field_validator("coordinates")
    @classmethod
    def ensure_coordinates_match_symbols(cls, coords, info):
        n_symbols = len(info.data["symbols"])
        if (n_coords := len(coords)) != n_symbols:  # Walrus operator (:=) for Python 3.8+
            raise ValueError(f"There must be an equal number of XYZ coordinates as there are symbols." 
                             f" There are {n_coords} coordinates and {n_symbols} symbols.")
        return coords
        
    @field_validator("coordinates")
    @classmethod
    def ensure_coordinates_is_3D(cls, coords):
        if any(len(failure := inner) != 3 for inner in coords):  # Walrus operator (:=) for Python 3.8+
            raise ValueError(f"Inner coordinates must be 3D, got {failure} of length {len(failure)}")
        return coords
    
    @property
    def num_atoms(self):
        return len(self.symbols)

Now we can see what happens when we run our model

In [97]:
water = Molecule(**mol_data)
water.coordinates

array([[0, 0, 0],
       [1, 1, 1],
       [2, 2, 2]])

We now have a NumPy array for our `coordinates`. Since we now have a NumPy array for `coordinates`, we can refine the original `validator`s. We'll condense our normal `coordinates` `validator`s down to a single one.

In [98]:
import numpy as np
from pydantic import BaseModel, field_validator, ConfigDict

class Molecule(BaseModel):
    name: str
    charge: float
    symbols: list[str]
    coordinates: np.ndarray
        
    model_config = ConfigDict(arbitrary_types_allowed = True)
    
    @field_validator("coordinates", mode='before')
    @classmethod
    def coord_to_numpy(cls, coords):
        try:
            coords = np.asarray(coords)
        except ValueError:
            raise ValueError(f"Could not cast {coords} to numpy array")
        return coords
        
    @field_validator("coordinates")
    @classmethod
    def coords_length_of_symbols(cls, coords, info):
        symbols = info.data["symbols"]
        if (len(coords.shape) != 2) or (len(symbols) != coords.shape[0]) or (coords.shape[1] != 3):
            raise ValueError(f"Coordinates must be of shape [Number Symbols, 3], was {coords.shape}")
        return coords
    
    @property
    def num_atoms(self):
        return len(self.symbols)

In [99]:
water = Molecule(**mol_data)

In [100]:
mangle = {**mol_data, **bad_charge, **bad_coords}
water = Molecule(**mangle)

ValidationError: 2 validation errors for Molecule
charge
  Input should be a valid number [type=float_type, input_value=[1, 0.0], input_type=list]
    For further information visit https://errors.pydantic.dev/2.0.3/v/float_type
coordinates
  Value error, Coordinates must be of shape [Number Symbols, 3], was (3,) [type=value_error, input_value=['1', '2', '3'], input_type=list]
    For further information visit https://errors.pydantic.dev/2.0.3/v/value_error

We've now upgraded our `Molecule` with more advanced data validation leaning into scientific validity, added in custom types which increase our model's usability, and configured our model to further expand our capabilities. The code is now at the Lesson Materials labeled `05_valid_pydantic_molecule.py`.

Next chapter we'll look at nesting models to allow more complicated data structures. 

Below is a supplementary section on how you can define custom, non-native types without `arbitrary_types_allowed`, giving you greater control over defining custom or even shorthand types.

## Supplemental: Defining Custom Types with Built-In Validators

In the example of this chapter, we showed how to combine `arbitrary_types_allowed` in `Config` with the `field_validator(..., mode='before')` to convert incoming data to the types not understood by *pydantic*. There are obvious limitations to this such as having to write a different set of validators for each Model, being limited (or at least confined) in how you can permit types through, and then having to be accepting of arbitrary types.

*pydantic* provides a separate way to write your custom class validator by extending the class in question. This can be done even to extend existing known types to augment them to special conditions. 

Let's extend a NumPy array type to have be something *pydantic* can validate without needing to use `arbitrary_types_allowed`. There are two ways to do this, either as an `Annotated` type where we overload *pydantic*'s type logic, or as a custom class schema generator. We'll look at the `Annotated` method which the *pydantic* docs indicate is more [stable than the custom class schema generator from an API standpoint](https://docs.pydantic.dev/latest/usage/types/custom/#customizing-validation-with-__get_pydantic_core_schema__).

In [101]:
import numpy as np

from typing_extensions import Annotated
from pydantic.functional_validators import PlainValidator


def cast_to_np(v):
    try:
        v = np.asarray(v)
    except ValueError:
        raise ValueError(f"Could not cast {v} to NumPy Array!")
    return v


ValidatableArray = Annotated[np.ndarray, PlainValidator(cast_to_np)]

That's it. 

We've first taken the `Annotated` object from the back-ported `typing_extensions` module which will work with Python 3.7+ (`from typing import Annotated` works with Python 3.9+ for identical behavior). This object allows you to augment types with additional metadata information for IDEs and other tools such as *pydantic* without disrupting normal code behavior.

Next we've taken augmented `np.ndarray` type with the *pydantic* `PlainValidator` method and passed it a function which will overwrite any of *pydantic*'s normal logic when validating the `np.ndarray`. Otherwise *pydantic* would have attempted to validate against `np.ndarray` and we'd be back where we started with the error asking about `allow_arbitrary_types`. Instead, we've usurped the normal *pydantic* logic and effectively said "The validator for this type is the function `cast_to_np`, send the data there, and if it doesn't error, we're good."

There is FAR more you can do with the `Annotated` object and *pydantic*, including defining multiple Before, After, and Wrap validators for any and all class attributes. For instance, there is a `BeforeValidator` which takes a functional argument as well which can be annotated into any data field that will do the same thing as `@field_validator(..., mode='before')`. However, advanced usage is best left to the [*pydantic* docs](https://docs.pydantic.dev/latest/usage/validators/)

Let's apply this to our `Molecule`.

```{admonition} This won't appear in the next chapter
:class: note
The main Lesson Materials will not have this modification since this is all supplemental. Next chapter will start with the <code>05_valid_pydantic_molecule.py</code> Lesson Materials.
```

In [102]:
import numpy as np
from pydantic import BaseModel, field_validator, ConfigDict

class Molecule(BaseModel):
    name: str
    charge: float
    symbols: list[str]
    coordinates: ValidatableArray
        
    @field_validator("coordinates")
    @classmethod
    def coords_length_of_symbols(cls, coords, info):
        symbols = info.data["symbols"]
        if (len(coords.shape) != 2) or (len(symbols) != coords.shape[0]) or (coords.shape[1] != 3):
            raise ValueError(f"Coordinates must be of shape [Number Symbols, 3], was {coords.shape}")
        return coords
    
    @property
    def num_atoms(self):
        return len(self.symbols)

In [103]:
water = Molecule(**mol_data)

In [104]:
mangle = {**mol_data, **bad_charge, **bad_coords}
water = Molecule(**mangle)

ValidationError: 2 validation errors for Molecule
charge
  Input should be a valid number [type=float_type, input_value=[1, 0.0], input_type=list]
    For further information visit https://errors.pydantic.dev/2.0.3/v/float_type
coordinates
  Value error, Coordinates must be of shape [Number Symbols, 3], was (3,) [type=value_error, input_value=['1', '2', '3'], input_type=list]
    For further information visit https://errors.pydantic.dev/2.0.3/v/value_error

We removed the `model_config` since we no longer are handling arbitrary types: we're handling the explicit type we defined. We also removed the `mode='before'` validator on `coordinates` because that work got pushed to the `ValidatableArray`. That new `Annotated` type we wrote already preempts our custom `coords_length_of_symbols` `field_validator` because it operates at the same time as the type annotation check, which comes before custom validators in order of operations.

If we wanted to make a custom schema output for our new type, we would need to add another class method called `__get_pydantic_core_schema__`. However, please refer to the [*pydantic* docs](https://docs.pydantic.dev/latest/usage/types/custom/#customizing-validation-with-__get_pydantic_core_schema__) for more details.

## Supplemental: Defining Custom NumPy Type AND Setting Data Type (*dtype*)

It is possible to set the NumPy array `dtype` as well as part of the type checking without having to define multiple custom types. This approach is not related to *pydantic* per se, but is a showcase of chaining several very advanced Python topics together.

In the previous Supplemental, we showed how to write an augmented type with `Annotated` to define a NumPy `ndarray` type in *pydantic*. We cast the input data to a numpy array with the `np.asarray`. That function can also accept a `dtype=...` argument where you can specify the type of data the array will be. How would you support arbitrarily setting the `dtype`?

There are several, equally acceptable and perfectly valid, approaches to this. 

### Multiple Validators

One option would be to make multiple types of validators and call the one you need. And there are several ways to do this. The first way is to just make multiple annotated types.

In [105]:
def cast_to_np_int(v):
    try:
        v = np.asarray(v, dtype=int)
    except ValueError:
        raise ValueError(f"Could not cast {v} to NumPy Array!")
    return v


def cast_to_np_float(v):
    try:
        v = np.asarray(v, dtype=float)
    except ValueError:
        raise ValueError(f"Could not cast {v} to NumPy Array!")
    return v


IntArray = Annotated[np.ndarray, PlainValidator(cast_to_np_int)]
FloatArray = Annotated[np.ndarray, PlainValidator(cast_to_np_float)]

In [106]:
class IntMolecule(Molecule):
    coordinates: IntArray
        
class FloatMolecule(Molecule):
    coordinates: FloatArray
        
print(IntMolecule(**mol_data).coordinates)
print(FloatMolecule(**mol_data).coordinates)

[[0 0 0]
 [1 1 1]
 [2 2 2]]
[[0. 0. 0.]
 [1. 1. 1.]
 [2. 2. 2.]]


A valid approach, can be dropped in when needed. However, this involves code duplication. 

We can cut down on the work by defining a function which accepts a keyword and use `functools.partial` to lock the keyword in.

In [107]:
from functools import partial

def cast_to_np(v, dtype=None):
    try:
        v = np.asarray(v, dtype=dtype)
    except ValueError:
        raise ValueError(f"Could not cast {v} to NumPy Array!")
    return v

IntArray = Annotated[np.ndarray, PlainValidator(partial(cast_to_np, dtype=int))]
FloatArray = Annotated[np.ndarray, PlainValidator(partial(cast_to_np, dtype=float))]

In [108]:
class IntMolecule(Molecule):
    coordinates: IntArray
        
class FloatMolecule(Molecule):
    coordinates: FloatArray
        
print(IntMolecule(**mol_data).coordinates)
print(FloatMolecule(**mol_data).coordinates)

[[0 0 0]
 [1 1 1]
 [2 2 2]]
[[0. 0. 0.]
 [1. 1. 1.]
 [2. 2. 2.]]


### Make an on Demand Typer Function

One option is to just make a function create types on demand.

In [109]:
def array_typer(dtype):
    def cast_to_np(v):
        try:
            v = np.asarray(v, dtype=dtype)
        except ValueError:
            raise ValueError(f"Could not cast {v} to NumPy Array!")
        return v
    return Annotated[np.ndarray, PlainValidator(cast_to_np)]

In [110]:
class IntMolecule(Molecule):
    coordinates: array_typer(int)
        
class FloatMolecule(Molecule):
    coordinates: array_typer(float)
        
print(IntMolecule(**mol_data).coordinates)
print(FloatMolecule(**mol_data).coordinates)

[[0 0 0]
 [1 1 1]
 [2 2 2]]
[[0. 0. 0.]
 [1. 1. 1.]
 [2. 2. 2.]]


But this has the problem of now having to regenerate a new `Annotated` type each time, and its type schema will always have the same signature. This isn't a problem most of the time, but it can be a little confusing to suddenly see functions in type annotation instead of the normal types and square brackets.

### Custom Core Schema

We're going to take a look at the other way *pydantic* has for defining a custom type, one way they specifically suggest for NumPy in their own documentations, namely [a custom core schema](https://docs.pydantic.dev/latest/usage/types/custom/#customizing-validation-with-__get_pydantic_core_schema__). We avoided this in the previous blocks of the lessons because the *pydantic* docs say this functionality touches the underlying `pydantic-core` functionality. And while it *does* have an API (and follows semantic versioning), its also the section most likely to change according to them.

We do want to look at this approach because we are abusing the `PlainValidator` a bit to overload *pydantic*'s internal type checking.

We're going to build this piece by piece, with the understanding that it won't work fully until we've constructed it. Effectively: writing the instructions for *pydantic* to handle this with mostly native `pydantic-core` functions.

In [111]:
class ValidatableArray:
    @classmethod
    def __get_pydantic_core_schema__(cls, source, handler):
        pass

This is the primary work method for the core schema. The `__get_pydantic_core_schema__` is the method that *pydantic* will look for when validating this type of data. 

`source` is the class we are generating a schema for; this will generally be the same as the `cls` argument if this is a classmethod. 

`handler` is the call into Pydantic's internal JSON schema generation logic. Since we're writing our own schema generator for something that *Pydantic* does not natively understand, we likely won't need `source` or `handler` at all for most uses. However, we will be taking advantage of `source` and some other Python `typing` tools as well.

We'll fill in everything we want to do in the function itself.

In [112]:
from pydantic_core import core_schema

def cast_to_np(v):
    try:
        v = np.asarray(v)
    except ValueError:
        raise ValueError(f"Could not cast {v} to NumPy Array!")
    return v


class ValidatableArray:
    @classmethod
    def __get_pydantic_core_schema__(cls, source, handler):
        """
        We return a pydantic_core.CoreSchema that behaves in the following ways:

        * Data will be cast to ndarrays with the correct dtype
        * `ndarrays` instances will be parsed as `ndarrays` and cast to the correct dtype
        * Serialization will cast the ndarray to list
        """
        schema = core_schema.no_info_plain_validator_function(cast_to_np)
        return schema

We've added back in our actual "cast to NumPy" function we've used previously, and then we have added a function from the `core_schema` object of `pydantic_core`. The `no_info_plain_validator` function which is what generates a schema for `PlainValidator` as we have seen before. We finally return the schema generated from the function, although we can further manipulate it later as needed.

There are also other calls such as a [`general_plain_validator_function`](https://docs.pydantic.dev/latest/api/pydantic_core_schema/#pydantic_core.core_schema.general_plain_validator_function) which supports additional info being fed into the function as a secondary argument, or [`no_info_after_validator_function`](https://docs.pydantic.dev/latest/api/pydantic_core_schema/#pydantic_core.core_schema.no_info_after_validator_function) which would make an `AfterValidator`, but we're not going to cover those topics here.

Thus far, all we have done is cast to a NumPy array, which is good! We have done that before with the `Annotated` method, but this sets us up for much more powerful manipulation later if we want. We also want to make sure 

In [113]:
class ArrMolecule(Molecule):
    coordinates: ValidatableArray

print(ArrMolecule(**mol_data).coordinates)
print(ArrMolecule(**mol_data).coordinates)

[[0 0 0]
 [1 1 1]
 [2 2 2]]
[[0 0 0]
 [1 1 1]
 [2 2 2]]


Great! We've made a validatable array. Now we're going to extend this approach to handle passing types in as part of the array construction. However, our dtype option isn't used anywhere. To fix that, we're going to expand on this with some of Python's native `typing` tools.

In [114]:
from typing import Sequence, TypeVar
from pydantic_core import core_schema

dtype = TypeVar("dtype")

def cast_to_np(v):
    try:
        v = np.asarray(v)
    except ValueError:
        raise ValueError(f"Could not cast {v} to NumPy Array!")
    return v


class ValidatableArray(Sequence[dtype]):
    @classmethod
    def __get_pydantic_core_schema__(cls, source, handler):
        """
        We return a pydantic_core.CoreSchema that behaves in the following ways:

        * Data will be cast to ndarrays with the correct dtype
        * `ndarrays` instances will be parsed as `ndarrays` and cast to the correct dtype
        * Serialization will cast the ndarray to list
        """
        schema = core_schema.no_info_plain_validator_function(cast_to_np)
        return schema

We've now established a custom Python type we are calling `dtype`, which is a common term in NumPy space, but we're going to focus on the more general case for now and specialize later.

The new object `dtype` is now recognized as a valid Python type, even though nothing in the Python space or any of our modules use this, that's okay! We're going to use it as a placeholder for accepting an index/argument to the `ValidatableArray` class.

Speaking of, the `ValidatableArray` is now a subclass of two things: the `Sequence` from Python's `typing` library, and our placeholder `dtype` type as an index/argument. Although it is square brackets, `[]`, we'll refer to these as "arguments" as they effectively are for types. We chose the `Sequence` instead of `Generic` from `typing` because at its core, NumPy arrays are sequences, just very formatted and specialized ones. This approach would have worked with `Generic` too, but we're opting to be more verbose.

So far, nothing has changed, everything will continue to run exactly as we have designed it previously, however, we can now specify an argument to the `ValidatableArray`. Observe:

In [115]:
class ArrMolecule(Molecule):
    coordinates: ValidatableArray[float]

print(ArrMolecule(**mol_data).coordinates)
print(ArrMolecule(**mol_data).coordinates)

[[0 0 0]
 [1 1 1]
 [2 2 2]]
[[0 0 0]
 [1 1 1]
 [2 2 2]]


So now let's change our code to actually do something with that new argument in our function to specify what the dtype should be for the arrays.

In [116]:
from typing import Sequence, TypeVar
from typing_extensions import get_args
from pydantic_core import core_schema

dtype = TypeVar("dtype")

def generate_caster(dtype_input):
    def cast_to_np(v):
        try:
            v = np.asarray(v, dtype=dtype_input)
        except ValueError:
            raise ValueError(f"Could not cast {v} to NumPy Array!")
        return v
    return cast_to_np


class ValidatableArray(Sequence[dtype]):
    @classmethod
    def __get_pydantic_core_schema__(cls, source, handler):
        """
        We return a pydantic_core.CoreSchema that behaves in the following ways:

        * Data will be cast to ndarrays with the correct dtype
        * `ndarrays` instances will be parsed as `ndarrays` and cast to the correct dtype
        * Serialization will cast the ndarray to list
        """
        dtype_arg = get_args(source)[0]
        validator = generate_caster(dtype_arg)
        schema = core_schema.no_info_plain_validator_function(validator)
        return schema

In [117]:
class FloatArrMolecule(Molecule):
    coordinates: ValidatableArray[float]

class IntArrMolecule(Molecule):
    coordinates: ValidatableArray[int]

print(FloatArrMolecule(**mol_data).coordinates)
print(FloatArrMolecule(**mol_data).coordinates)
print("")
print(IntArrMolecule(**mol_data).coordinates)
print(IntArrMolecule(**mol_data).coordinates)

[[0. 0. 0.]
 [1. 1. 1.]
 [2. 2. 2.]]
[[0. 0. 0.]
 [1. 1. 1.]
 [2. 2. 2.]]

[[0 0 0]
 [1 1 1]
 [2 2 2]]
[[0 0 0]
 [1 1 1]
 [2 2 2]]


Ta-da! We've now used the `get_args` function from `typing_extensions` (native in `typing` in Python 3.9+) to get the argument we fed into the `ValidatableArray`, established a generator function for dtypes in `generate_caster`, and then used all of that information to make our NumPy arrays of a specific type. All of this shows the power of the customization we can do with *pydantic*. There are some less-boilerplate ways to do this in *pydantic*, but we leave that up to you to read the docs to find out more. 

Before we move past this, there are a couple of notes to make about this approach:
* This is a relatively slow process in that the generator will be made for every validation, that could be faster.
* As written, you MUST pass an arg to `ValidatableArray`, but it could be rewritten to avoid that.

We've specifically written this example to use generic Python type objects and methods. However, [NumPy does have its own native types as of 1.20 and 1.21](https://numpy.org/doc/stable/reference/typing.html#numpy.typing.NDArray) we can use instead of the `Generic` and `Sequence`, or defining our own arbitrary type with `TypeVar` to make IDE's happy. Below is the example of this.

In [118]:
from typing_extensions import get_args, Annotated
from pydantic_core import core_schema
from numpy.typing import NDArray

def generate_caster(dtype):
    def cast_to_np(v):
        try:
            v = np.asarray(v, dtype=dtype)
        except ValueError:
            raise ValueError(f"Could not cast {v} to NumPy Array!")
        return v
    return cast_to_np


class ValidatableArrayAnnotation:
    @classmethod
    def __get_pydantic_core_schema__(cls, source, handler):
        """
        We return a pydantic_core.CoreSchema that behaves in the following ways:

        * Data will be cast to ndarrays with the correct dtype
        * `ndarrays` instances will be parsed as `ndarrays` and cast to the correct dtype
        * Serialization will cast the ndarray to list
        """
        shape, dtype_alias = get_args(source)
        dtype = get_args(dtype_alias)[0]
        validator = generate_caster(dtype)
        schema = core_schema.no_info_plain_validator_function(validator)
        return schema

ValidatableArray = Annotated[NDArray, ValidatableArrayAnnotation]

In [119]:
class FloatArrMolecule(Molecule):
    coordinates: ValidatableArray[float]

class IntArrMolecule(Molecule):
    coordinates: ValidatableArray[int]

print(FloatArrMolecule(**mol_data).coordinates)
print(FloatArrMolecule(**mol_data).coordinates)
print("")
print(IntArrMolecule(**mol_data).coordinates)
print(IntArrMolecule(**mol_data).coordinates)

[[0. 0. 0.]
 [1. 1. 1.]
 [2. 2. 2.]]
[[0. 0. 0.]
 [1. 1. 1.]
 [2. 2. 2.]]

[[0 0 0]
 [1 1 1]
 [2 2 2]]
[[0 0 0]
 [1 1 1]
 [2 2 2]]


This approach now annotates the `NDArray` with additional information as per [the *pydantic* docs](https://docs.pydantic.dev/latest/usage/types/custom/#as-an-annotation) which is then passed to the `ValidatableArrayAnnotation`, and takes use of the NumPy type hint format and behavior for `NDArray`. This *also* has problems as in the end we are trying to reverse engineer a type hint into a formal `dtype` for NumPy, which isn't exactly clear-cut. E.g.:

In [120]:
print(NDArray)
print(NDArray[int])
default_second = get_args(get_args(NDArray)[1])[0]
print(default_second)
print(type(default_second))

numpy.ndarray[typing.Any, numpy.dtype[+ScalarType]]
numpy.ndarray[typing.Any, numpy.dtype[int]]
+ScalarType
<class 'typing.TypeVar'>


Its very unclear how, if you provide no arguments, to convert the `TypeVar` (which is what you get from the `TypeVar` function of "+ScalarType" into `None` which would be the default behavior of `dtype=...` style arguments. Sure you could hard code it, but will that always be the case? That's up to you and beyond what this example hopes to show you.

### Metaclasses of the Past
At one point in this lesson back in Pydantic v1, we talked about [Python Metaclass](https://docs.python.org/3/reference/datamodel.html#metaclasses) as a way to define a class generator whose properties are set dynamically, then usable by the class. BUT...

```{admonition} Metaclasses be Forbidden Magics
:class: warning
“Metaclasses are deeper magic than 99% of users should ever worry about. If you wonder whether you need them, you don’t (the people who actually need them know with certainty that they need them, and don’t need an explanation about why).”

— Tim Peters, Author of [Zen of Python, PEP 20](https://peps.python.org/pep-0020/)
```

Metaclasses are usually not something you want to touch, because you don't need to. The above methods provide a fine way to generate type hints dynamically. However, if you want to be fancy, you can use a Metaclass. The best primer I, Levi Naden, have found on Metaclasses at the time of writing this section (Fall 2022) was through [this Stack Overflow answer](https://stackoverflow.com/a/6581949/10364409).

To be honest: you're probably better off writing [a custom core schema as *pydantic* suggests](https://docs.pydantic.dev/latest/usage/types/custom/#customizing-validation-with-__get_pydantic_core_schema__) as above than messing with a Metaclass.

### Do what makes sense, and only if you need to

All of these methods are equally valid, with upsides and downsides alike. Your use case may not even need `dtype` specification and you can just accept the normal NumPy handling of casting to array plus your own custom `validator` functions to make sure the data look correct. Hopefully though this supplemental section has given you ideas to inspire your own code design and give you ideas on interesting and helpful things you can do.