{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Example Usage\n", "\n", "To use `bento-mdf` in a project, start by installing the latest version with `pip install bento-mdf` and importing it into your project." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'0.11.2'" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import bento_mdf\n", "from pathlib import Path # for file paths\n", "from importlib.metadata import version # check package version\n", "\n", "version(\"bento_mdf\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading the Model from MDF(s)\n", "\n", "The `bento-mdf` package provides functionality for loading, validating, and manipulating MDF file content in Python.\n", "\n", "The `MDFReader` class parses and validates MDF files, creating a [`bento-meta` Model interface](https://cbiit.github.io/bento-meta/the_object_model.html) with convenient features, demonstrated below. An `MDFReader` is initialized with the relevant MDF file(s), filepath(s), or URL pointing to these." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from bento_mdf import MDFReader" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Loading from File(s)\n", "\n", "First, we can specify the paths to the MDF files we want to load. Then, we provide these to the `MDFReader` class to initalize the model. This loads the content of these files into their corresponding `bento-meta` Python object representations, which we can access via the `Model` object found at `MDFReader.model`.\n", "\n", "(Note: if a top-level model `Handle` is not present in the MDFs, it needs to be provided to the MDFReader class's `handle` argument.)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import logging\n", "logging.basicConfig(filename='mdf.log')\n", "mdf_dir = Path.cwd().parent / \"tests\" / \"samples\"\n", "ctdc_model = mdf_dir / \"ctdc_model_file.yaml\"\n", "ctdc_props = mdf_dir / \"ctdc_model_properties_file.yaml\"\n", "\n", "mdf_from_file = MDFReader(ctdc_model, ctdc_props, handle=\"CTDC\")\n", "mdf_from_file.model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Loading from URL(s)\n", "\n", "Similarly, we can instantiate an MDF from URL(s) pointing to the model file(s):" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model_url = \"https://cbiit.github.io/icdc-model-tool/model-desc/icdc-model.yml\"\n", "props_url = \"https://cbiit.github.io/icdc-model-tool/model-desc/icdc-model-props.yml\"\n", "\n", "mdf = MDFReader(model_url, props_url, handle=\"ICDC\")\n", "mdf.model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Setting the parameter `raise_error` to `True` in the MDFReader call will raise a RuntimeError if any MDF issues are found. In any case, all issues found will appear in the log." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exploring the Model\n", "\n", "Once we've loaded the model, we can start looking at the entities that make it up, including Nodes, Relationships, Properties, and Terms. These are conveniently stored in the `bento-meta Model` object. \n", "\n", "Note: This example will use the model created in the previous section from a URL." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Nodes\n", "\n", "Model nodes are stored as dictionaries in `Model.nodes`, where the keys are node handles and the values are `bento-meta Node` objects." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "33" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nodes = mdf.model.nodes\n", "\n", "len(nodes)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['program', 'study', 'study_site']" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(nodes.keys())[:3]" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[,\n", " ,\n", " ]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(nodes.values())[:3]" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nodes[\"study\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `get_attr_dict()` method is a convenient way to get a dictionary of a `bento-meta Entity's` set attributes. This will return string versions of the attributes. This can be useful for exploring the entity or for providing parameters to Neo4j Cypher queries.\n", "\n", "Note: this only includes simple attributes and not other bento-meta Entities or collections of Entities. All attributes can be accessed via methods matching their names." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'handle': 'diagnosis',\n", " 'model': 'ICDC',\n", " 'desc': 'The Diagnosis node contains numerous properties which fully characterize the type of cancer with which any given patient/subject/donor was diagnosed, inclusive of stage. This node also contains properties pertaining to comorbidities, and the availability of pathology reports, treatment data and follow-up data.'}" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nodes[\"diagnosis\"].get_attr_dict()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Relationships\n", "\n", "Simlarly, Model relationships are stored in `Model.edges`. This is a dictionary where the keys are (edge.handle, src.handle, dst.handle) tuples. The values are `Edge` objects." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "49" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "edges = mdf.model.edges\n", "\n", "len(edges)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('member_of', 'case', 'cohort'),\n", " ('member_of', 'cohort', 'study_arm'),\n", " ('member_of', 'study_arm', 'study')]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(edges.keys())[:3]" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[,\n", " ,\n", " ]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(edges.values())[:3]" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'handle': 'of_case', 'model': 'ICDC', 'multiplicity': 'many_to_one'}" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "edges[(\"of_case\", \"diagnosis\", \"case\")].get_attr_dict()" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "of_case, diagnosis, case\n", "('of_case', 'diagnosis', 'case')\n" ] } ], "source": [ "edge = edges[(\"of_case\", \"diagnosis\", \"case\")]\n", "print(edge.handle, edge.src.handle, edge.dst.handle, sep=\", \")\n", "\n", "\n", "# TIP: here's a convenient method to get the 3-tuple of an edge\n", "print(edge.triplet)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An `Edge's` `src` and `dst` attributes are `Nodes`" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "diagnosis\n" ] } ], "source": [ "print(edge.src)\n", "\n", "print(edge.src.handle)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `Model` object also has some useful methods to work with relationships/edges including:\n", " * `edges_by_src(node)` - get all edges that have a given node as their src attribute\n", " * `edges_by_dst(node)` - get all edges that have a given node as their dst attribute\n", " * `edges_by_type(edge_handle)` - get all edges that have a given edge type (i.e., handle)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('of_case', 'enrollment', 'case'),\n", " ('of_case', 'demographic', 'case'),\n", " ('of_case', 'diagnosis', 'case'),\n", " ('of_case', 'cycle', 'case'),\n", " ('of_case', 'follow_up', 'case'),\n", " ('of_case', 'sample', 'case'),\n", " ('of_case', 'file', 'case'),\n", " ('of_case', 'visit', 'case'),\n", " ('of_case', 'adverse_event', 'case'),\n", " ('of_case', 'registration', 'case')]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[e.triplet for e in mdf.model.edges_by_dst(mdf.model.nodes[\"case\"])]" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('of_study', 'study_site', 'study'),\n", " ('of_study', 'principal_investigator', 'study'),\n", " ('of_study', 'file', 'study'),\n", " ('of_study', 'image_collection', 'study'),\n", " ('of_study', 'publication', 'study')]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[e.triplet for e in mdf.model.edges_by_type(\"of_study\")]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Properties\n", "\n", "Model properties are stored in `Model.props`. This is a dictionary where the keys are ({edge|node}.handle, prop.handle) tuples. The values are `Property` objects." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "240" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "props = mdf.model.props\n", "\n", "len(props)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('program', 'program_name'),\n", " ('program', 'program_acronym'),\n", " ('program', 'program_short_description')]" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(props.keys())[:3]" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[,\n", " ,\n", " ]" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(props.values())[:3]" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'handle': 'primary_disease_site',\n", " 'model': 'ICDC',\n", " 'value_domain': 'value_set',\n", " 'is_required': 'Yes',\n", " 'is_key': 'False',\n", " 'is_nullable': 'False',\n", " 'is_strict': 'True',\n", " 'desc': 'The anatomical location at which the primary disease originated, recorded in relatively general terms at the subject level; the anatomical locations from which tumor samples subject to downstream analysis were acquired is recorded in more detailed terms at the sample level.'}" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "primary_disease_site = props[(\"diagnosis\", \"primary_disease_site\")]\n", "primary_disease_site.get_attr_dict()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Properties with Value Sets\n", "\n", "Properties with the value_domain \"value_set\" have the `value_set` attribute (`bento-meta ValueSet`), which has a `terms` attribute (`bento-meta Term` dictionary like `{term.value: Term}`)." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "primary_disease_site.value_set" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'Bladder': , 'Bladder, Prostate': , 'Bladder, Urethra': , 'Bladder, Urethra, Prostate': , 'Bladder, Urethra, Vagina': , 'Bone': , 'Bone (Appendicular)': , 'Bone (Axial)': , 'Bone Marrow': , 'Brain': , 'Carpus': , 'Chest Wall': , 'Distal Urethra': , 'Kidney': , 'Lung': , 'Lymph Node': , 'Mammary Gland': , 'Mouth': , 'Not Applicable': , 'Pleural Cavity': , 'Shoulder': , 'Skin': , 'Spleen': , 'Subcutis': , 'Thyroid Gland': , 'Unknown': , 'Urethra, Prostate': , 'Urinary Tract': , 'Urogenital Tract': }" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "primary_disease_site.value_set.terms" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`Property` objects with value sets have some useful methods to get to those terms and their values including:\n", " * `.terms` returns a list of `Term` objects from the property's value set\n", " * `.values` returns a list of the term values from the property's value set" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'Bladder': , 'Bladder, Prostate': , 'Bladder, Urethra': , 'Bladder, Urethra, Prostate': , 'Bladder, Urethra, Vagina': , 'Bone': , 'Bone (Appendicular)': , 'Bone (Axial)': , 'Bone Marrow': , 'Brain': , 'Carpus': , 'Chest Wall': , 'Distal Urethra': , 'Kidney': , 'Lung': , 'Lymph Node': , 'Mammary Gland': , 'Mouth': , 'Not Applicable': , 'Pleural Cavity': , 'Shoulder': , 'Skin': , 'Spleen': , 'Subcutis': , 'Thyroid Gland': , 'Unknown': , 'Urethra, Prostate': , 'Urinary Tract': , 'Urogenital Tract': }\n", "True\n" ] } ], "source": [ "print(primary_disease_site.terms)\n", "\n", "# TIP: this is the same object found at the ValueSet's `terms` attribute\n", "print(primary_disease_site.terms is primary_disease_site.value_set.terms)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Shoulder\n", "29\n", "True\n" ] } ], "source": [ "print(primary_disease_site.values[20])\n", "\n", "print(len(primary_disease_site.values))\n", "\n", "print(primary_disease_site.values == list(primary_disease_site.terms.keys()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Properties via Parent\n", "\n", "Model properties can also be accessed via their parent node|edge's `props` attribute, which is a dictionary of properties." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "14" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "diagnosis_props = nodes[\"diagnosis\"].props\n", "len(diagnosis_props)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['diagnosis_id', 'disease_term', 'primary_disease_site']" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(diagnosis_props.keys())[:3]" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[,\n", " ,\n", " ]" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(diagnosis_props.values())[:3]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Properties accesed via their parents are the same Property objects found in `Model.props`." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "diagnosis_props[\"primary_disease_site\"] is props[(\"diagnosis\", \"primary_disease_site\")]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Terms\n", "\n", "Model terms are stored in `Model.terms` as a dictionary of `Term` objects. The keys are the term handles, and the values are the `Term` objects. Terms are used to relate string descriptors in the model, such as permissible values in a property's value set, or semantic concepts from other frameworks that can describe an entity in the model via annotation (e.g. a caDSR Common Data Element/CDE annotating a model property).\n", "\n", "The keys in `Model.terms` are (term.handle, term.origin) tuples and the values are `bento-meta` `Term` objects." ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "538" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "terms = mdf.model.terms\n", "\n", "len(terms)" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('Unrestricted', 'ICDC'), ('Pending', 'ICDC'), ('Under Embargo', 'ICDC')]" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(terms.keys())[:3]" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[,\n", " ,\n", " ]" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(terms.values())[:3]" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'handle': 'Shoulder', 'value': 'Shoulder', 'origin_name': 'ICDC'}" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "shoulder = terms[(\"Shoulder\", \"ICDC\")]\n", "shoulder.get_attr_dict()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Terms via ValueSet\n", "\n", "Terms that are part of value set can be accessed via the owner of that value set as well. This is the same object found in `Model.terms`" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "primary_disease_site.terms[\"Shoulder\"] is shoulder" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Term Annotations\n", "\n", "Terms are also used to annotate model entities with semantic represenations from some other framework. For example, a Term from caDSR may be used to annotate a model property with a semantically equivalent CDE. In the `MDF`, these annotations are provided under the `Term` key for a given entity. " ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "100%|███████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 8665.92it/s]\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mdf_dir = Path.cwd().parent / \"tests\" / \"samples\"\n", "model_with_terms = mdf_dir / \"test-model-with-terms-a.yml\"\n", "# Tip: model 'Handle' key is in the yaml file so we don't need to provide one to MDF()\n", "terms_mdf = MDFReader(model_with_terms)\n", "terms_mdf.model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Terms can annotate nodes, relationships, and properties. The annotating term(s) are linked to the annotated entity via a `bento-meta Concept`, which stores them in a dictionary of the same format found at `Model.terms` (i.e. `{(term.value, term.origin_name): Term}`)." ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "case_concept = terms_mdf.model.nodes[\"case\"].concept\n", "case_concept" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{('case_term', 'CTDC'): , ('subject', 'caDSR'): }" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "case_concept.terms" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'handle': 'subject', 'value': 'subject', 'origin_name': 'caDSR'}\n" ] } ], "source": [ "# TIP: to find an annotating CDE, we can look for entries where the origin is 'caDSR'\n", "for term_key, term in case_concept.terms.items():\n", " if term_key[1] == \"caDSR\":\n", " print(term.get_attr_dict())" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{('of_case_term', 'CTDC'): }" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "terms_mdf.model.edges[(\"of_case\", \"sample\", \"case\")].concept.terms" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{('case_id', 'CTDC'): }" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "terms_mdf.model.props[(\"case\", \"case_id\")].concept.terms" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# TIP: terms found in Model.terms are the same objects as those in an entity's concept\n", "case_id_anno = terms_mdf.model.props[(\"case\", \"case_id\")].concept.terms[(\"case_id\", \"CTDC\")]\n", "terms_mdf.model.terms[(\"case_id\", \"CTDC\")] is case_id_anno" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tags\n", "\n", "A tags entry can be added to any object in the model. They are used to associated metainformation with an entity for downstream custom processing. Any `bento-meta Entity` except the `Tag` can be tagged with one of these key-value pairs. They are accessible via the `tags` attribute of the entity, where they are stored in a dictionary where the key is the tag's 'key' and the value is a `bento-meta Tag` object." ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'Labeled': }" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "icdc_breed_tags = mdf.model.props[(\"demographic\", \"breed\")].tags\n", "icdc_breed_tags" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'key': 'Labeled', 'value': 'Breed'}" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "icdc_breed_tags[\"Labeled\"].get_attr_dict()" ] }, { "cell_type": "markdown", "metadata": { "jp-MarkdownHeadingCollapsed": true }, "source": [ "## Model Diff\n", "\n", "`bento-mdf` also provides the `diff_models` function, which can be used to compare two models and report on the differences between them. This is useful for comparing models that have been updated or modified over time.\n", "\n", "`diff_models()` has two required arguments, both of which are `bento_meta.Model` objects:\n", " * `mdl_a`: The first model to compare.\n", " * `mdl_b`: The second model to compare.\n", "\n", "The function returns a `dict` with keys for nodes, edges, props, and terms, each with a dictionary with keys:\n", " * `\"added\"`: found in `mdl_a` but not in `mdl_b`\n", " * `\"removed\"`: found in `mdl_b` but not in `mdl_a`\n", " * `\"changed\"`: found in both models but with altered attributes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Writing MDF from the Model " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Schema-valid MDF may produced from a `bento-meta` Model, using the `MDFWriter` class. This can be useful if you wish to make changes to the Model within Python using the [update methods of that interface](https://cbiit.github.io/bento-meta/the_object_model.html#model-as-an-interface), and then write out the updated model in MDF format for sharing." ] }, { "cell_type": "markdown", "metadata": { "jp-MarkdownHeadingCollapsed": true }, "source": [ "Consider a simple data model in MDF format:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```yaml\n", "# sample-model.yml \n", "Handle: test\n", "Version: 0.01\n", "Nodes:\n", " sample:\n", " Props:\n", " - sample_type\n", " - amount\n", "Relationships:\n", " is_subsample_of:\n", " Mul: many_to_one\n", " Ends:\n", " - Src: sample\n", " Dst: sample\n", " Props: null\n", "PropDefinitions:\n", " sample_type:\n", " Enum:\n", " - normal\n", " - tumor\n", " amount:\n", " Type:\n", " units:\n", " - mg\n", " value_type: number\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Suppose we want to add a property from the ICDC model to this simple model, and write out a new MDF. We add the property to the model, then we can create an MDFWriter instance from the MDFReader instance. Then the `mdf` attribute of the writer will contain a dict that can be written as YAML. " ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Handle: test\n", "Nodes:\n", " sample:\n", " Props:\n", " - amount\n", " - sample_type\n", " - tumor_sample_origin\n", "PropDefinitions:\n", " amount:\n", " Key: false\n", " Nul: false\n", " Req: false\n", " Strict: true\n", " Type:\n", " units:\n", " - mg\n", " value_type: number\n", " sample_type:\n", " Enum:\n", " - normal\n", " - tumor\n", " Key: false\n", " Nul: false\n", " Req: false\n", " Strict: true\n", " tumor_sample_origin:\n", " Desc: An indication as to whether a tumor sample was derived from a primary\n", " versus a metastatic tumor.\n", " Enum:\n", " - Primary\n", " - Metastatic\n", " - Not Applicable\n", " - Unknown\n", " Key: false\n", " Nul: false\n", " Req: 'Yes'\n", " Strict: true\n", " Tags:\n", " Labeled: Tumor Sample Origin\n", "Relationships:\n", " is_subsample_of:\n", " Ends:\n", " - Dst: sample\n", " Props: null\n", " Src: sample\n", " Mul: many_to_one\n", " Props: null\n", "Terms:\n", " normal:\n", " Origin: test\n", " Value: normal\n", " tumor:\n", " Origin: test\n", " Value: tumor\n", "URI: null\n", "Version: 0.01\n", "\n" ] } ], "source": [ "import yaml\n", "from bento_mdf import MDFReader, MDFWriter\n", "\n", "smodel = MDFReader(\"./sample-model.yml\")\n", "new_prop = mdf.model.props[('sample', 'tumor_sample_origin')]\n", "smodel.model.add_prop( smodel.model.nodes['sample'], new_prop )\n", "print(yaml.dump(MDFWriter(smodel).mdf, indent=4))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that the new property `tumor_sample_origin` appears in the new MDF." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Make changes to the underlying model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Validating the Model\n", "\n", "As the `MDFReader` class loads the model, it automatically validates it against the MDF schema and will raise an exception if the model is invalid. This will use the [default schema](https://github.com/CBIIT/bento-mdf/blob/main/schema/mdf-schema.yaml) unless one is provided via the `MDFReader` class's `mdf_schema` argument.\n", "\n", "`bento-mdf` also provides the `MDFValidator` class, which can be used to validate a model against the MDF schema directly." ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from bento_mdf.validator import MDFValidator\n", "\n", "validator = MDFValidator(\n", " None,\n", " *[ctdc_model, ctdc_props],\n", " raise_error=True,\n", ")\n", "validator" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [], "source": [ "validator.load_and_validate_schema(); # load and check that JSON schema is valid" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [], "source": [ "validator.load_and_validate_yaml().as_dict(); # load and check YAML is valid" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [], "source": [ "validator.validate_instance_with_schema(); # check YAML against the schema" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If the schema or yaml instances (from MDF files) are invalid, the validation will fail." ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [], "source": [ "from jsonschema import SchemaError, ValidationError\n", "from yaml.parser import ParserError\n", "from IPython.display import clear_output" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Schema is invalid" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "'crobject' is not valid under any of the given schemas\n", "\n", "Failed validating 'anyOf' in metaschema['properties']['properties']['additionalProperties']['properties']['type']:\n", " {'anyOf': [{'$ref': '#/definitions/simpleTypes'},\n", " {'type': 'array',\n", " 'items': {'$ref': '#/definitions/simpleTypes'},\n", " 'minItems': 1,\n", " 'uniqueItems': True}]}\n", "\n", "On schema['properties']['UniversalNodeProperties']['type']:\n", " 'crobject'\n" ] } ], "source": [ "bad_schema = mdf_dir / \"mdf-bad-schema.yaml\"\n", "\n", "try:\n", " MDFValidator(bad_schema, raise_error=True).load_and_validate_schema()\n", "except SchemaError as e:\n", " clear_output()\n", " print(e)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### YAML structure is invalid" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "while parsing a block mapping\n", " in \"/Users/jensenma/Code/bento-mdf/python/tests/samples/ctdc_model_bad.yaml\", line 1, column 1\n", "expected , but found ''\n", " in \"/Users/jensenma/Code/bento-mdf/python/tests/samples/ctdc_model_bad.yaml\", line 3, column 3\n" ] } ], "source": [ "bad_yaml = mdf_dir / \"ctdc_model_bad.yaml\"\n", "\n", "try:\n", " MDFValidator(None, bad_yaml, raise_error=True).load_and_validate_yaml()\n", "except ParserError as e:\n", " clear_output()\n", " print(e)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### MDF YAMLs are invalid against the MDF schema" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "'case.show_node' does not match '^[A-Za-z_][A-Za-z0-9_]*$'\n", "\n", "Failed validating 'pattern' in schema['properties']['PropDefinitions']['propertyNames']:\n", " {'$id': '#snake_case_id',\n", " 'type': 'string',\n", " 'pattern': '^[A-Za-z_][A-Za-z0-9_]*$'}\n", "\n", "On instance['PropDefinitions']:\n", " 'case.show_node'\n" ] } ], "source": [ "test_schema = mdf_dir / \"mdf-schema.yaml\"\n", "ctdc_bad = mdf_dir / \"ctdc_model_file_invalid.yaml\"\n", "\n", "try:\n", " v = MDFValidator(\n", " test_schema,\n", " *[ctdc_bad, ctdc_props],\n", " raise_error=True\n", " )\n", " v.load_and_validate_schema()\n", " v.load_and_validate_yaml()\n", " v.validate_instance_with_schema()\n", "except ValidationError as e:\n", " clear_output()\n", " print(e)" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'nodes': {'changed': {'diagnosis': {'props': {'removed': {'fatal': },\n", " 'added': None}}},\n", " 'removed': None,\n", " 'added': {'outcome': }},\n", " 'edges': {'removed': None,\n", " 'added': {('end_result',\n", " 'diagnosis',\n", " 'outcome'): }},\n", " 'props': {'removed': {('diagnosis',\n", " 'fatal'): },\n", " 'added': {('outcome',\n", " 'fatal'): }}}" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from bento_mdf.diff import diff_models\n", "\n", "old_model = mdf_dir / \"test-model-d.yml\"\n", "new_model = mdf_dir / \"test-model-e.yml\"\n", "\n", "old_mdf = MDFReader(old_model, handle=\"TEST\")\n", "new_mdf = MDFReader(new_model, handle=\"TEST\")\n", "\n", "diff_models(mdl_a=old_mdf.model, mdl_b=new_mdf.model)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`diff_models` has two optional arguments:\n", " * `objects_as_dicts`: if True, the output will convert `bento-meta Entity` objects like `Node` or `Edge` to dictionaries with `get_attr_dict()`\n", " * `include_summary`: if True, the output will include a formatted string summary of the differences between the two models. This can be useful for GitHub changelogs when a model is updated, for example." ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'diagnosis': {'props': {'removed': {'fatal': {'handle': 'fatal',\n", " 'model': 'TEST',\n", " 'value_domain': 'value_set',\n", " 'is_required': 'False',\n", " 'is_key': 'False',\n", " 'is_nullable': 'False',\n", " 'is_strict': 'True'}},\n", " 'added': None}}}" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "diff = diff_models(\n", " old_mdf.model,\n", " new_mdf.model,\n", " objects_as_dicts=True, include_summary=True)\n", "\n", "diff[\"nodes\"][\"changed\"]" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1 node(s) added; 1 edge(s) added; 1 prop(s) removed; 1 prop(s) added; 1 attribute(s) changed for 1 node(s)\n", "- Added node: 'outcome'\n", "- Added edge: 'end_result' with src: 'diagnosis' and dst: 'outcome'\n", "- Removed prop: 'fatal' with parent: 'diagnosis'\n", "- Added prop: 'fatal' with parent: 'outcome'\n" ] } ], "source": [ "print(diff[\"summary\"], sep=\"\\n\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "ipykernel" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.8" } }, "nbformat": 4, "nbformat_minor": 4 }