{ "cells": [ { "cell_type": "markdown", "id": "346c96ae-a4fe-42c3-b3a5-f341d7dabdd1", "metadata": {}, "source": [ "# Data Validation with MDF" ] }, { "cell_type": "markdown", "id": "f4a2b6b8-ce84-4ca4-a64f-03fc4c04fcd0", "metadata": {}, "source": [ "The MDF [PropDefinitions](#property-definitions) section describes properties (slots or variables), along with the data types that consitute valid values for those properties. Using this information, one can validate data that are meant to comply with these conditions.\n", "\n", "The `MDFDataValidator` class uses the [Pydantic](https://docs.pydantic.dev/latest/) data validation library to interpret MDF nodes and properties as Python classes which have attributes whose values are automatically validated. This provides several options for performing data validation against an MDF model. Data to be validated simply needs to be expressed as a Python dict or as JSON. \n", "\n", "Example: Suppose you have defined a node called `sample` in MDF, with properties `sample_type` and `amount`:" ] }, { "cell_type": "code", "execution_count": 1, "id": "a7e5101a-674d-40b4-b436-9e61955bb797", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'0.11.1'" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import bento_mdf\n", "from importlib.metadata import version # check package version\n", "\n", "version(\"bento_mdf\")" ] }, { "cell_type": "markdown", "id": "7ba08dc1-deb6-45c0-85af-4b7a42291941", "metadata": {}, "source": [ "```yaml\n", "# sample-model.yml \n", "Handle: test\n", "Nodes:\n", " sample:\n", " Props:\n", " - sample_type\n", " - amount\n", "Relationships:\n", " is_subsample_of:\n", " Mul: many_to_one\n", " Ends:\n", " -\tSrc: sample\n", " Dst: sample\n", " Props: ~\n", "PropDefinitions:\n", " sample_type:\n", " Enum:\n", " - normal\n", " - tumor\n", " amount:\n", " Type:\n", " units:\n", " - mg\n", " value_type: number\n", "```" ] }, { "cell_type": "markdown", "id": "a5e0a49e-124e-40f1-a9f8-bb6ddc245669", "metadata": {}, "source": [ "You can then validate a list of dicts of `sample` data using `validate()` as follows:" ] }, { "cell_type": "code", "execution_count": 11, "id": "7ffc2ac3-c7cf-4401-9119-39bcf991247f", "metadata": {}, "outputs": [], "source": [ "from bento_mdf import MDFReader, MDFDataValidator\n", "mdf = MDFReader(\"./sample-model.yml\")\n", "val = MDFDataValidator(mdf)\n", "result = val.validate('Sample', \n", " [{\"sample_type\": \"normal\", \"amount\": 0.50},\n", " {\"sample_type\": \"tumor\", \"amount\": 1.0},\n", " {\"sample_type\": \"wrong\", \"amount\": \"fred\"}])\n", "assert result is False # at least one record was invalid\n", "assert val.last_validation_errors[2] # the last record has error info\n" ] }, { "cell_type": "markdown", "id": "ae730f52-2de1-4a59-a824-b19ecadce39c", "metadata": {}, "source": [ "## Available validation classes and data fields\n" ] }, { "cell_type": "markdown", "id": "629c5924-1281-47c3-a656-e19c08dc9e2e", "metadata": {}, "source": [ "The first argument of `validate` is a string, the class name, that represents a particular model node. Class names are created by CamelCasing the Node handles that appear in the MDF. Properties for Nodes become data fields within the node validation class. These are snake_case strings given by the MDF property handles.\n", "\n", "Available node class names are found in the MDFDataValidator `node_classes` attribute. Available field (property) names for a node class can be retrieved with the `fields_of()` or `props_of()` method.\n", "\n", "For example, using [test-model.yml](/python/tests/samples/test-model.yml):\n" ] }, { "cell_type": "code", "execution_count": 12, "id": "02caf629-7574-423d-af51-e305b7526f88", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 14061.36it/s]" ] }, { "name": "stdout", "output_type": "stream", "text": [ "['Case', 'Diagnosis', 'File', 'Sample']\n", "['sample_type', 'amount']\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "mdf = MDFReader(\"../tests/samples/test-model.yml\")\n", "val = MDFDataValidator(mdf)\n", "print( val.node_classes )\n", "# ['Case', 'Diagnosis', 'File', 'Sample']\n", "print( val.fields_of('Sample'))\n", "# ['sample_type', 'amount']\n" ] }, { "cell_type": "markdown", "id": "51d67434-34c2-448e-a60d-3b1c4c24e01e", "metadata": {}, "source": [ "The second argument to `validate()` is the data to be validated against the given class. It is a dict or a list of dicts, whose keys are names of properties defined in the MDF for the given node, and whose values are actual data values to be validated. If all data records in the list are valid, `validate()` returns True; otherwise, it retuns False.\n" ] }, { "cell_type": "code", "execution_count": 13, "id": "c2c6a73b-be7a-46fc-8bcd-2ee172d8cbdf", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Valid!\n" ] } ], "source": [ "if val.validate('Sample', {'sample_type': 'normal', 'amount':1.0}):\n", " print(\"Valid!\")\n", "else:\n", " print(\"Invalid.\")\n" ] }, { "cell_type": "markdown", "id": "6cf0c854-ae77-4af3-b543-d1e915f3684f", "metadata": {}, "source": [ "## The \"Model Class\"" ] }, { "cell_type": "markdown", "id": "d9085b32-c8ca-4545-9ae5-bf63771b87b1", "metadata": {}, "source": [ "An additional validation class is created that aggregates all Node classes. This can be used to validate a dict containing a data record for all model Nodes. The model class is named by appending 'Data' to the model handle. This name is found in `val.model_class`.\n", "\n", "For example, [test-model.yml](/python/tests/samples/test-model.yml) has handle `test` and its model class is named `testData`. An example validation:\n" ] }, { "cell_type": "code", "execution_count": 15, "id": "6980b6b8-c5c1-4646-9232-642d21240385", "metadata": {}, "outputs": [], "source": [ "data = {\n", " \"case\": {\"case_id\": \"CASE-22\"},\n", " \"diagnosis\": {\n", " \"disease\": \"http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C102872\",\n", " \"date_of_dx\": \"1965-05-04T00:00:00\",\n", " },\n", " \"file\": {\"file_size\": 150342, \"md5sum\": \"9d4cf66a8472f2f97c4594758a06fbd0\"},\n", " \"sample\": {\"amount\": 4.0, \"sample_type\": \"normal\"},\n", " }\n", "assert val.validate('testData', data)\n" ] }, { "cell_type": "markdown", "id": "e8f5872c-78be-488c-8e58-1ca3164c6eee", "metadata": {}, "source": [ "Note that the Node keys for data are in lower case." ] }, { "cell_type": "markdown", "id": "8b16c549-caa6-49ee-b94b-05c65acff920", "metadata": {}, "source": [ "## Inspecting validation errors" ] }, { "cell_type": "markdown", "id": "64adfeca-ed84-40e9-99a1-089b1ecff38d", "metadata": {}, "source": [ "If `validate()` returns False, the attribute `last_validation_errors` will contain a dict of error lists emitted by Pydantic. The keys of the dict are the indexes in the data list of the records that errored; the values are a list of Pydantic [ValidationError](https://docs.pydantic.dev/latest/api/pydantic_core/#pydantic_core.ValidationError) objects detailing the nature of the errors.\n" ] }, { "cell_type": "code", "execution_count": 19, "id": "a06dc30d-1f22-437a-b9c8-5282603a8f7e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " \"1\": [\n", " {\n", " \"type\": \"predicate_failed\",\n", " \"loc\": [\n", " \"md5sum\"\n", " ],\n", " \"msg\": \"Predicate Pattern.fullmatch failed\",\n", " \"input\": \"9d4cf66a8472f2f97c4594758a06Fbd0\"\n", " }\n", " ],\n", " \"2\": [\n", " {\n", " \"type\": \"predicate_failed\",\n", " \"loc\": [\n", " \"md5sum\"\n", " ],\n", " \"msg\": \"Predicate Pattern.fullmatch failed\",\n", " \"input\": \"9d4cf66a8472f2f97c4594758a06Fbd0\"\n", " },\n", " {\n", " \"type\": \"int_parsing\",\n", " \"loc\": [\n", " \"file_size\",\n", " \"int\"\n", " ],\n", " \"msg\": \"Input should be a valid integer, unable to parse string as an integer\",\n", " \"input\": \"big\",\n", " \"url\": \"https://errors.pydantic.dev/2.10/v/int_parsing\"\n", " },\n", " {\n", " \"type\": \"int_parsing\",\n", " \"loc\": [\n", " \"file_size\",\n", " \"int\"\n", " ],\n", " \"msg\": \"Input should be a valid integer, unable to parse string as an integer\",\n", " \"input\": \"big\",\n", " \"url\": \"https://errors.pydantic.dev/2.10/v/int_parsing\"\n", " }\n", " ]\n", "}\n" ] } ], "source": [ "import json\n", "data = [\n", " {\"md5sum\": \"9d4cf66a8472f2f97c4594758a06fbd0\",\n", " \"file_name\": \"grelf.txt\",\n", " \"file_size\": 50},\n", " {\"md5sum\": \"9d4cf66a8472f2f97c4594758a06Fbd0\",\n", " \"file_name\": \"grolf.txt\",\n", " \"file_size\": 50.0},\n", " {\"md5sum\": \"9d4cf66a8472f2f97c4594758a06Fbd0\",\n", " \"file_name\": \"grilf.txt\",\n", " \"file_size\": \"big\"}\n", " ]\n", "val.validate('File', data)\n", "print(json.dumps(val.last_validation_errors, indent=4))" ] }, { "cell_type": "markdown", "id": "f499491b-4972-4075-9e2d-d0852dcd37e4", "metadata": {}, "source": [ "## Generated Validation Classes" ] }, { "cell_type": "markdown", "id": "ede27d9e-4fbd-4a5e-8cbd-d2f03f95ecef", "metadata": {}, "source": [ "`MDFDataValidator` generates a Python module containing Pydantic classes (generally known as \"[models](https://docs.pydantic.dev/latest/concepts/models/)\"). The module code is contained in `val.data_model`; it can be printed to a file and used as an independent package in other applications. \n", "\n", "The validator object creates this code using a [Jinja](https://jinja.palletsprojects.com/en/stable/templates/) template and imports it back dynamically with [`importlib`](https://docs.python.org/3/library/importlib.html#importing-a-source-file-directly).\n", "\n", "There is no need to deal directly with this machinery in the simplest case of data validation (above). However, you can take advantage of Pydantic features available to these classes by accessing them using `val.model_of()`.\n" ] }, { "cell_type": "code", "execution_count": 31, "id": "649a9b87-5c43-472f-ba82-cdf3afeeb86c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'sample_type': FieldInfo(annotation=Union[SampleTypeEnum, NoneType], required=True), 'amount': FieldInfo(annotation=Union[Annotated[float, Unit], NoneType], required=True)}\n" ] } ], "source": [ "# instantiate a validated object:\n", "sample1 = val.model_of('Sample')(sample_type=\"normal\", amount=\"1.0\")\n", "# get more detail on field types and validations (Pydantic BaseModel methods)\n", "pydantic_fields = val.model_of('Sample').model_fields\n", "print(pydantic_fields)\n" ] }, { "cell_type": "markdown", "id": "89537422-91eb-43b2-b9f9-d0a973291f6a", "metadata": {}, "source": [ "## JSON Schema Representations" ] }, { "cell_type": "markdown", "id": "ce1855d9-a76a-4ff9-bb10-9557b762554a", "metadata": {}, "source": [ "Pydantic has extensive JSON Schema generation facilities. For any validation class, a JSON Schema representation can be created that may be used for for data validation across many programming environments and languages, including Python and Javascript. For example, data validation schemas can be stored along side MDF models in their repos, and general tools using JSON Schema can be developed to enable external submitters to validate their data prior to submission.\n", "\n", "JSON Schema for any available validation class can be generated with the `json_schema()` method:\n" ] }, { "cell_type": "code", "execution_count": 32, "id": "b5803ba2-c7d6-4164-b7c4-5fa4ca1e9983", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " \"$schema\": \"https://json-schema.org/draft/2020-12/schema\",\n", " \"properties\": {\n", " \"disease\": {\n", " \"anyOf\": [\n", " {\n", " \"format\": \"uri\",\n", " \"minLength\": 1,\n", " \"type\": \"string\"\n", " },\n", " {\n", " \"type\": \"null\"\n", " }\n", " ],\n", " \"title\": \"Disease\"\n", " },\n", " \"date_of_dx\": {\n", " \"anyOf\": [\n", " {\n", " \"format\": \"date-time\",\n", " \"type\": \"string\"\n", " },\n", " {\n", " \"type\": \"null\"\n", " }\n", " ],\n", " \"title\": \"Date Of Dx\"\n", " }\n", " },\n", " \"required\": [\n", " \"disease\",\n", " \"date_of_dx\"\n", " ],\n", " \"title\": \"Diagnosis\",\n", " \"type\": \"object\"\n", "}\n" ] } ], "source": [ "import json\n", "print(json.dumps(val.json_schema('Diagnosis'), indent=4))\n" ] }, { "cell_type": "markdown", "id": "5b25e2bf-04ce-4e2d-ae29-c4bb38339260", "metadata": {}, "source": [ "In Python, this JSON Schema could be used to validate data as follows:\n" ] }, { "cell_type": "code", "execution_count": 37, "id": "5e89b906-28b9-4a95-b443-1aadd291dbe1", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1.5 is not valid under any of the given schemas\n", "\n", "Failed validating 'anyOf' in schema['properties']['date_of_dx']:\n", " {'anyOf': [{'format': 'date-time', 'type': 'string'}, {'type': 'null'}],\n", " 'title': 'Date Of Dx'}\n", "\n", "On instance['date_of_dx']:\n", " 1.5\n" ] } ], "source": [ "import jsonschema\n", "from jsonschema import Draft202012Validator\n", "try:\n", " jsonschema.validate(\n", " {\"disease\": \"http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C102872\",\n", " \"date_of_dx\": 1.5},\n", " val.json_schema('Diagnosis'),\n", " format_checker=Draft202012Validator.FORMAT_CHECKER\n", " )\n", "except jsonschema.ValidationError as e:\n", " print(e)\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "id": "964b5552-e751-415c-b1c2-59709cf6da60", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "ipykernel" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.8" } }, "nbformat": 4, "nbformat_minor": 5 }