Data Validation with MDF

The MDF PropDefinitions section describes properties (slots or variables), along with the data types that consitute valid values for those properties. Using this information, one can validate data that are meant to comply with these conditions.

The MDFDataValidator class uses the Pydantic data validation library to interpret MDF nodes and properties as Python classes which have attributes whose values are automatically validated. This provides several options for performing data validation against an MDF model. Data to be validated simply needs to be expressed as a Python dict or as JSON.

Example: Suppose you have defined a node called sample in MDF, with properties sample_type and amount:

import bento_mdf
from importlib.metadata import version # check package version

version("bento_mdf")
'0.11.5'
# sample-model.yml                                                                              
Handle: test
Nodes:
  sample:
    Props:
      - sample_type
      - amount
Relationships:
  is_subsample_of:
    Mul: many_to_one
    Ends:
      -	Src: sample
        Dst: sample
        Props: ~
PropDefinitions:
  sample_type:
    Enum:
      - normal
      - tumor
  amount:
    Type:
      units:
        - mg
      value_type: number

You can then validate a list of dicts of sample data using validate() as follows:

from bento_mdf import MDFReader, MDFDataValidator
mdf = MDFReader("./sample-model.yml")
val = MDFDataValidator(mdf)
result = val.validate('Sample', 
                      [{"sample_type": "normal", "amount": 0.50},
                       {"sample_type": "tumor", "amount": 1.0},
                       {"sample_type": "wrong", "amount": "fred"}])
assert result is False # at least one record was invalid
assert val.last_validation_errors[2] # the last record has error info

Available validation classes and data fields

The first argument of validate is a string, the class name, that represents a particular model node. Class names are created by CamelCasing the Node handles that appear in the MDF. Properties for Nodes become data fields within the node validation class. These are snake_case strings given by the MDF property handles.

Available node class names are found in the MDFDataValidator node_classes attribute. Available field (property) names for a node class can be retrieved with the fields_of() or props_of() method.

For example, using test-model.yml:

mdf = MDFReader("../tests/samples/test-model.yml")
val = MDFDataValidator(mdf)
print( val.node_classes )
# ['Case', 'Diagnosis', 'File', 'Sample']
print( val.fields_of('Sample'))
# ['sample_type', 'amount']
  0%|          | 0/7 [00:00<?, ?it/s]
100%|██████████| 7/7 [00:00<00:00, 11550.01it/s]
['Case', 'Diagnosis', 'File', 'Sample']
['sample_type', 'amount']

The second argument to validate() is the data to be validated against the given class. It is a dict or a list of dicts, whose keys are names of properties defined in the MDF for the given node, and whose values are actual data values to be validated. If all data records in the list are valid, validate() returns True; otherwise, it retuns False.

if val.validate('Sample', {'sample_type': 'normal', 'amount':1.0}):
    print("Valid!")
else:
    print("Invalid.")
Valid!

The “Model Class”

An additional validation class is created that aggregates all Node classes. This can be used to validate a dict containing a data record for all model Nodes. The model class is named by appending ‘Data’ to the model handle. This name is found in val.model_class.

For example, test-model.yml has handle test and its model class is named testData. An example validation:

data = {
    "case": {"case_id": "CASE-22"},
    "diagnosis": {
        "disease": "http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C102872",
        "date_of_dx": "1965-05-04T00:00:00",
    },
    "file": {"file_size": 150342, "md5sum": "9d4cf66a8472f2f97c4594758a06fbd0"},
    "sample": {"amount": 4.0, "sample_type": "normal"},
    }
assert val.validate('testData', data)

Note that the Node keys for data are in lower case.

Inspecting validation errors

If validate() returns False, the attribute last_validation_errors will contain a dict of error lists emitted by Pydantic. The keys of the dict are the indexes in the data list of the records that errored; the values are a list of Pydantic ValidationError objects detailing the nature of the errors.

import json
data = [
    {"md5sum": "9d4cf66a8472f2f97c4594758a06fbd0",
     "file_name": "grelf.txt",
     "file_size": 50},
    {"md5sum": "9d4cf66a8472f2f97c4594758a06Fbd0",
     "file_name": "grolf.txt",
     "file_size": 50.0},
    {"md5sum": "9d4cf66a8472f2f97c4594758a06Fbd0",
     "file_name": "grilf.txt",
     "file_size": "big"}
    ]
val.validate('File', data)
print(json.dumps(val.last_validation_errors, indent=4))
{
    "1": [
        {
            "type": "predicate_failed",
            "loc": [
                "md5sum"
            ],
            "msg": "Predicate Pattern.fullmatch failed",
            "input": "9d4cf66a8472f2f97c4594758a06Fbd0"
        }
    ],
    "2": [
        {
            "type": "predicate_failed",
            "loc": [
                "md5sum"
            ],
            "msg": "Predicate Pattern.fullmatch failed",
            "input": "9d4cf66a8472f2f97c4594758a06Fbd0"
        },
        {
            "type": "int_parsing",
            "loc": [
                "file_size",
                "int"
            ],
            "msg": "Input should be a valid integer, unable to parse string as an integer",
            "input": "big",
            "url": "https://errors.pydantic.dev/2.10/v/int_parsing"
        },
        {
            "type": "int_parsing",
            "loc": [
                "file_size",
                "int"
            ],
            "msg": "Input should be a valid integer, unable to parse string as an integer",
            "input": "big",
            "url": "https://errors.pydantic.dev/2.10/v/int_parsing"
        }
    ]
}

Generated Validation Classes

MDFDataValidator generates a Python module containing Pydantic classes (generally known as “models”). The module code is contained in val.data_model; it can be printed to a file and used as an independent package in other applications.

The validator object creates this code using a Jinja template and imports it back dynamically with importlib.

There is no need to deal directly with this machinery in the simplest case of data validation (above). However, you can take advantage of Pydantic features available to these classes by accessing them using val.model_of().

# instantiate a validated object:
sample1 = val.model_of('Sample')(sample_type="normal", amount="1.0")
# get more detail on field types and validations (Pydantic BaseModel methods)
pydantic_fields = val.model_of('Sample').model_fields
print(pydantic_fields)
{'sample_type': FieldInfo(annotation=Union[SampleTypeEnum, NoneType], required=True), 'amount': FieldInfo(annotation=Union[Annotated[float, Unit], NoneType], required=True)}

JSON Schema Representations

Pydantic has extensive JSON Schema generation facilities. For any validation class, a JSON Schema representation can be created that may be used for for data validation across many programming environments and languages, including Python and Javascript. For example, data validation schemas can be stored along side MDF models in their repos, and general tools using JSON Schema can be developed to enable external submitters to validate their data prior to submission.

JSON Schema for any available validation class can be generated with the json_schema() method:

import json
print(json.dumps(val.json_schema('Diagnosis'), indent=4))
{
    "$schema": "https://json-schema.org/draft/2020-12/schema",
    "properties": {
        "disease": {
            "anyOf": [
                {
                    "format": "uri",
                    "minLength": 1,
                    "type": "string"
                },
                {
                    "type": "null"
                }
            ],
            "title": "Disease"
        },
        "date_of_dx": {
            "anyOf": [
                {
                    "format": "date-time",
                    "type": "string"
                },
                {
                    "type": "null"
                }
            ],
            "title": "Date Of Dx"
        }
    },
    "required": [
        "disease",
        "date_of_dx"
    ],
    "title": "Diagnosis",
    "type": "object"
}

In Python, this JSON Schema could be used to validate data as follows:

import jsonschema
from jsonschema import Draft202012Validator
try:
    jsonschema.validate(
        {"disease": "http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C102872",
         "date_of_dx": 1.5},
        val.json_schema('Diagnosis'),
        format_checker=Draft202012Validator.FORMAT_CHECKER
    )
except jsonschema.ValidationError as e:
    print(e)
1.5 is not valid under any of the given schemas

Failed validating 'anyOf' in schema['properties']['date_of_dx']:
    {'anyOf': [{'format': 'date-time', 'type': 'string'}, {'type': 'null'}],
     'title': 'Date Of Dx'}

On instance['date_of_dx']:
    1.5