Example Usage
To use bento-mdf
in a project, start by installing the latest version with pip install bento-mdf
and importing it into your project.
import bento_mdf
from pathlib import Path # for file paths
from importlib.metadata import version # check package version
version("bento_mdf")
'0.11.10'
Loading the Model from MDF(s)
The bento-mdf
package provides functionality for loading, validating, and manipulating MDF file content in Python.
The MDFReader
class parses and validates MDF files, creating a bento-meta
Model interface with convenient features, demonstrated below. An MDFReader
is initialized with the relevant MDF file(s), filepath(s), or URL pointing to these.
from bento_mdf import MDFReader
Loading from File(s)
First, we can specify the paths to the MDF files we want to load. Then, we provide these to the MDFReader
class to initalize the model. This loads the content of these files into their corresponding bento-meta
Python object representations, which we can access via the Model
object found at MDFReader.model
.
(Note: if a top-level model Handle
is not present in the MDFs, it needs to be provided to the MDFReader class’s handle
argument.)
import logging
logging.basicConfig(filename='mdf.log')
mdf_dir = Path.cwd().parent / "tests" / "samples"
ctdc_model = mdf_dir / "ctdc_model_file.yaml"
ctdc_props = mdf_dir / "ctdc_model_properties_file.yaml"
mdf_from_file = MDFReader(ctdc_model, ctdc_props, handle="CTDC")
mdf_from_file.model
<bento_meta.model.Model at 0x7f86da8641f0>
Loading from URL(s)
Similarly, we can instantiate an MDF from URL(s) pointing to the model file(s):
model_url = "https://cbiit.github.io/icdc-model-tool/model-desc/icdc-model.yml"
props_url = "https://cbiit.github.io/icdc-model-tool/model-desc/icdc-model-props.yml"
mdf = MDFReader(model_url, props_url, handle="ICDC")
mdf.model
<bento_meta.model.Model at 0x7f870814e8f0>
Setting the parameter raise_error
to True
in the MDFReader call will raise a RuntimeError if any MDF issues are found. In any case, all issues found will appear in the log.
Exploring the Model
Once we’ve loaded the model, we can start looking at the entities that make it up, including Nodes, Relationships, Properties, and Terms. These are conveniently stored in the bento-meta Model
object.
Note: This example will use the model created in the previous section from a URL.
Nodes
Model nodes are stored as dictionaries in Model.nodes
, where the keys are node handles and the values are bento-meta Node
objects.
nodes = mdf.model.nodes
len(nodes)
33
list(nodes.keys())[:3]
['program', 'study', 'study_site']
list(nodes.values())[:3]
[<bento_meta.objects.Node at 0x7f870814dc00>,
<bento_meta.objects.Node at 0x7f86da72ee00>,
<bento_meta.objects.Node at 0x7f86da72c100>]
nodes["study"]
<bento_meta.objects.Node at 0x7f86da72ee00>
The get_attr_dict()
method is a convenient way to get a dictionary of a bento-meta Entity's
set attributes. This will return string versions of the attributes. This can be useful for exploring the entity or for providing parameters to Neo4j Cypher queries.
Note: this only includes simple attributes and not other bento-meta Entities or collections of Entities. All attributes can be accessed via methods matching their names.
nodes["diagnosis"].get_attr_dict()
{'handle': 'diagnosis',
'model': 'ICDC',
'desc': 'The Diagnosis node contains numerous properties which fully characterize the type of cancer with which any given patient/subject/donor was diagnosed, inclusive of stage. This node also contains properties pertaining to comorbidities, and the availability of pathology reports, treatment data and follow-up data.'}
Relationships
Simlarly, Model relationships are stored in Model.edges
. This is a dictionary where the keys are (edge.handle, src.handle, dst.handle) tuples. The values are Edge
objects.
edges = mdf.model.edges
len(edges)
49
list(edges.keys())[:3]
[('member_of', 'case', 'cohort'),
('member_of', 'cohort', 'study_arm'),
('member_of', 'study_arm', 'study')]
list(edges.values())[:3]
[<bento_meta.objects.Edge at 0x7f86da77ab30>,
<bento_meta.objects.Edge at 0x7f86da77ab90>,
<bento_meta.objects.Edge at 0x7f86da77a290>]
edges[("of_case", "diagnosis", "case")].get_attr_dict()
{'handle': 'of_case', 'model': 'ICDC', 'multiplicity': 'many_to_one'}
edge = edges[("of_case", "diagnosis", "case")]
print(edge.handle, edge.src.handle, edge.dst.handle, sep=", ")
# TIP: here's a convenient method to get the 3-tuple of an edge
print(edge.triplet)
of_case, diagnosis, case
('of_case', 'diagnosis', 'case')
An Edge's
src
and dst
attributes are Nodes
print(edge.src)
print(edge.src.handle)
<bento_meta.objects.Node object at 0x7f86da72e500>
diagnosis
The Model
object also has some useful methods to work with relationships/edges including:
edges_by_src(node)
- get all edges that have a given node as their src attributeedges_by_dst(node)
- get all edges that have a given node as their dst attributeedges_by_type(edge_handle)
- get all edges that have a given edge type (i.e., handle)
[e.triplet for e in mdf.model.edges_by_dst(mdf.model.nodes["case"])]
[('of_case', 'enrollment', 'case'),
('of_case', 'demographic', 'case'),
('of_case', 'diagnosis', 'case'),
('of_case', 'cycle', 'case'),
('of_case', 'follow_up', 'case'),
('of_case', 'sample', 'case'),
('of_case', 'file', 'case'),
('of_case', 'visit', 'case'),
('of_case', 'adverse_event', 'case'),
('of_case', 'registration', 'case')]
[e.triplet for e in mdf.model.edges_by_type("of_study")]
[('of_study', 'study_site', 'study'),
('of_study', 'principal_investigator', 'study'),
('of_study', 'file', 'study'),
('of_study', 'image_collection', 'study'),
('of_study', 'publication', 'study')]
Properties
Model properties are stored in Model.props
. This is a dictionary where the keys are ({edge|node}.handle, prop.handle) tuples. The values are Property
objects.
props = mdf.model.props
len(props)
240
list(props.keys())[:3]
[('program', 'program_name'),
('program', 'program_acronym'),
('program', 'program_short_description')]
list(props.values())[:3]
[<bento_meta.objects.Property at 0x7f86da7a97b0>,
<bento_meta.objects.Property at 0x7f86da7a9480>,
<bento_meta.objects.Property at 0x7f86da7a9720>]
primary_disease_site = props[("diagnosis", "primary_disease_site")]
primary_disease_site.get_attr_dict()
{'handle': 'primary_disease_site',
'model': 'ICDC',
'value_domain': 'value_set',
'is_required': 'Yes',
'is_key': 'False',
'is_nullable': 'False',
'is_strict': 'True',
'desc': 'The anatomical location at which the primary disease originated, recorded in relatively general terms at the subject level; the anatomical locations from which tumor samples subject to downstream analysis were acquired is recorded in more detailed terms at the sample level.'}
Properties with Value Sets
Properties with the value_domain “value_set” have the value_set
attribute (bento-meta ValueSet
), which has a terms
attribute (bento-meta Term
dictionary like {term.value: Term}
).
primary_disease_site.value_set
<bento_meta.objects.ValueSet at 0x7f86da603ee0>
primary_disease_site.value_set.terms
{'Bladder': <bento_meta.objects.Term object at 0x7f86da603e80>, 'Bladder, Prostate': <bento_meta.objects.Term object at 0x7f86da603f70>, 'Bladder, Urethra': <bento_meta.objects.Term object at 0x7f86da603df0>, 'Bladder, Urethra, Prostate': <bento_meta.objects.Term object at 0x7f86da6100d0>, 'Bladder, Urethra, Vagina': <bento_meta.objects.Term object at 0x7f86da6101f0>, 'Bone': <bento_meta.objects.Term object at 0x7f86da610160>, 'Bone (Appendicular)': <bento_meta.objects.Term object at 0x7f86da610340>, 'Bone (Axial)': <bento_meta.objects.Term object at 0x7f86da611300>, 'Bone Marrow': <bento_meta.objects.Term object at 0x7f86da6104f0>, 'Brain': <bento_meta.objects.Term object at 0x7f86da610580>, 'Carpus': <bento_meta.objects.Term object at 0x7f86da610100>, 'Chest Wall': <bento_meta.objects.Term object at 0x7f86da610460>, 'Distal Urethra': <bento_meta.objects.Term object at 0x7f86da6106d0>, 'Kidney': <bento_meta.objects.Term object at 0x7f86da610760>, 'Lung': <bento_meta.objects.Term object at 0x7f86da610640>, 'Lymph Node': <bento_meta.objects.Term object at 0x7f86da6110f0>, 'Mammary Gland': <bento_meta.objects.Term object at 0x7f86da610400>, 'Mouth': <bento_meta.objects.Term object at 0x7f86da6108e0>, 'Not Applicable': <bento_meta.objects.Term object at 0x7f86da6109d0>, 'Pleural Cavity': <bento_meta.objects.Term object at 0x7f86da610a90>, 'Shoulder': <bento_meta.objects.Term object at 0x7f86da610ac0>, 'Skin': <bento_meta.objects.Term object at 0x7f86da610730>, 'Spleen': <bento_meta.objects.Term object at 0x7f86da610be0>, 'Subcutis': <bento_meta.objects.Term object at 0x7f86da610cd0>, 'Thyroid Gland': <bento_meta.objects.Term object at 0x7f86da610d00>, 'Unknown': <bento_meta.objects.Term object at 0x7f86da610bb0>, 'Urethra, Prostate': <bento_meta.objects.Term object at 0x7f86da610e20>, 'Urinary Tract': <bento_meta.objects.Term object at 0x7f86da610f10>, 'Urogenital Tract': <bento_meta.objects.Term object at 0x7f86da610f40>}
Property
objects with value sets have some useful methods to get to those terms and their values including:
.terms
returns a list ofTerm
objects from the property’s value set.values
returns a list of the term values from the property’s value set
print(primary_disease_site.terms)
# TIP: this is the same object found at the ValueSet's `terms` attribute
print(primary_disease_site.terms is primary_disease_site.value_set.terms)
{'Bladder': <bento_meta.objects.Term object at 0x7f86da603e80>, 'Bladder, Prostate': <bento_meta.objects.Term object at 0x7f86da603f70>, 'Bladder, Urethra': <bento_meta.objects.Term object at 0x7f86da603df0>, 'Bladder, Urethra, Prostate': <bento_meta.objects.Term object at 0x7f86da6100d0>, 'Bladder, Urethra, Vagina': <bento_meta.objects.Term object at 0x7f86da6101f0>, 'Bone': <bento_meta.objects.Term object at 0x7f86da610160>, 'Bone (Appendicular)': <bento_meta.objects.Term object at 0x7f86da610340>, 'Bone (Axial)': <bento_meta.objects.Term object at 0x7f86da611300>, 'Bone Marrow': <bento_meta.objects.Term object at 0x7f86da6104f0>, 'Brain': <bento_meta.objects.Term object at 0x7f86da610580>, 'Carpus': <bento_meta.objects.Term object at 0x7f86da610100>, 'Chest Wall': <bento_meta.objects.Term object at 0x7f86da610460>, 'Distal Urethra': <bento_meta.objects.Term object at 0x7f86da6106d0>, 'Kidney': <bento_meta.objects.Term object at 0x7f86da610760>, 'Lung': <bento_meta.objects.Term object at 0x7f86da610640>, 'Lymph Node': <bento_meta.objects.Term object at 0x7f86da6110f0>, 'Mammary Gland': <bento_meta.objects.Term object at 0x7f86da610400>, 'Mouth': <bento_meta.objects.Term object at 0x7f86da6108e0>, 'Not Applicable': <bento_meta.objects.Term object at 0x7f86da6109d0>, 'Pleural Cavity': <bento_meta.objects.Term object at 0x7f86da610a90>, 'Shoulder': <bento_meta.objects.Term object at 0x7f86da610ac0>, 'Skin': <bento_meta.objects.Term object at 0x7f86da610730>, 'Spleen': <bento_meta.objects.Term object at 0x7f86da610be0>, 'Subcutis': <bento_meta.objects.Term object at 0x7f86da610cd0>, 'Thyroid Gland': <bento_meta.objects.Term object at 0x7f86da610d00>, 'Unknown': <bento_meta.objects.Term object at 0x7f86da610bb0>, 'Urethra, Prostate': <bento_meta.objects.Term object at 0x7f86da610e20>, 'Urinary Tract': <bento_meta.objects.Term object at 0x7f86da610f10>, 'Urogenital Tract': <bento_meta.objects.Term object at 0x7f86da610f40>}
True
print(primary_disease_site.values[20])
print(len(primary_disease_site.values))
print(primary_disease_site.values == list(primary_disease_site.terms.keys()))
Shoulder
29
True
Properties via Parent
Model properties can also be accessed via their parent node|edge’s props
attribute, which is a dictionary of properties.
diagnosis_props = nodes["diagnosis"].props
len(diagnosis_props)
14
list(diagnosis_props.keys())[:3]
['diagnosis_id', 'disease_term', 'primary_disease_site']
list(diagnosis_props.values())[:3]
[<bento_meta.objects.Property at 0x7f86da602f80>,
<bento_meta.objects.Property at 0x7f86da602e60>,
<bento_meta.objects.Property at 0x7f86da603e50>]
Properties accesed via their parents are the same Property objects found in Model.props
.
diagnosis_props["primary_disease_site"] is props[("diagnosis", "primary_disease_site")]
True
Terms
Model terms are stored in Model.terms
as a dictionary of Term
objects. The keys are the term handles, and the values are the Term
objects. Terms are used to relate string descriptors in the model, such as permissible values in a property’s value set, or semantic concepts from other frameworks that can describe an entity in the model via annotation (e.g. a caDSR Common Data Element/CDE annotating a model property).
The keys in Model.terms
are (term.handle, term.origin) tuples and the values are bento-meta
Term
objects.
terms = mdf.model.terms
len(terms)
538
list(terms.keys())[:3]
[('Unrestricted', 'ICDC'), ('Pending', 'ICDC'), ('Under Embargo', 'ICDC')]
list(terms.values())[:3]
[<bento_meta.objects.Term at 0x7f86da7aa650>,
<bento_meta.objects.Term at 0x7f86da7aa7d0>,
<bento_meta.objects.Term at 0x7f86da7a9bd0>]
shoulder = terms[("Shoulder", "ICDC")]
shoulder.get_attr_dict()
{'handle': 'Shoulder', 'value': 'Shoulder', 'origin_name': 'ICDC'}
Terms via ValueSet
Terms that are part of value set can be accessed via the owner of that value set as well. This is the same object found in Model.terms
primary_disease_site.terms["Shoulder"] is shoulder
True
Term Annotations
Terms are also used to annotate model entities with semantic represenations from some other framework. For example, a Term from caDSR may be used to annotate a model property with a semantically equivalent CDE. In the MDF
, these annotations are provided under the Term
key for a given entity.
mdf_dir = Path.cwd().parent / "tests" / "samples"
model_with_terms = mdf_dir / "test-model-with-terms-a.yml"
# Tip: model 'Handle' key is in the yaml file so we don't need to provide one to MDF()
terms_mdf = MDFReader(model_with_terms)
terms_mdf.model
0%| | 0/2 [00:00<?, ?it/s]
100%|██████████| 2/2 [00:00<00:00, 5464.89it/s]
<bento_meta.model.Model at 0x7f86da77ad40>
Terms can annotate nodes, relationships, and properties. The annotating term(s) are linked to the annotated entity via a bento-meta Concept
, which stores them in a dictionary of the same format found at Model.terms
(i.e. {(term.value, term.origin_name): Term}
).
case_concept = terms_mdf.model.nodes["case"].concept
case_concept
<bento_meta.objects.Concept at 0x7f86da54d060>
case_concept.terms
{('case_term', 'CTDC'): <bento_meta.objects.Term object at 0x7f86da54d840>, ('subject', 'caDSR'): <bento_meta.objects.Term object at 0x7f86da54d8a0>}
# TIP: to find an annotating CDE, we can look for entries where the origin is 'caDSR'
for term_key, term in case_concept.terms.items():
if term_key[1] == "caDSR":
print(term.get_attr_dict())
{'handle': 'subject', 'value': 'subject', 'origin_name': 'caDSR'}
terms_mdf.model.edges[("of_case", "sample", "case")].concept.terms
{('of_case_term', 'CTDC'): <bento_meta.objects.Term object at 0x7f86da54dff0>}
terms_mdf.model.props[("case", "case_id")].concept.terms
{('case_id', 'CTDC'): <bento_meta.objects.Term object at 0x7f86da54e230>}
# TIP: terms found in Model.terms are the same objects as those in an entity's concept
case_id_anno = terms_mdf.model.props[("case", "case_id")].concept.terms[("case_id", "CTDC")]
terms_mdf.model.terms[("case_id", "CTDC")] is case_id_anno
True
Model Diff
bento-mdf
also provides the diff_models
function, which can be used to compare two models and report on the differences between them. This is useful for comparing models that have been updated or modified over time.
diff_models()
has two required arguments, both of which are bento_meta.Model
objects:
mdl_a
: The first model to compare.mdl_b
: The second model to compare.
The function returns a dict
with keys for nodes, edges, props, and terms, each with a dictionary with keys:
"added"
: found inmdl_a
but not inmdl_b
"removed"
: found inmdl_b
but not inmdl_a
"changed"
: found in both models but with altered attributes
Writing MDF from the Model
Schema-valid MDF may produced from a bento-meta
Model, using the MDFWriter
class. This can be useful if you wish to make changes to the Model within Python using the update methods of that interface, and then write out the updated model in MDF format for sharing.
Consider a simple data model in MDF format:
# sample-model.yml
Handle: test
Version: 0.01
Nodes:
sample:
Props:
- sample_type
- amount
Relationships:
is_subsample_of:
Mul: many_to_one
Ends:
- Src: sample
Dst: sample
Props: null
PropDefinitions:
sample_type:
Enum:
- normal
- tumor
amount:
Type:
units:
- mg
value_type: number
Suppose we want to add a property from the ICDC model to this simple model, and write out a new MDF. We add the property to the model, then we can create an MDFWriter instance from the MDFReader instance. Then the mdf
attribute of the writer will contain a dict that can be written as YAML.
import yaml
from bento_mdf import MDFReader, MDFWriter
smodel = MDFReader("./sample-model.yml")
new_prop = mdf.model.props[('sample', 'tumor_sample_origin')]
smodel.model.add_prop( smodel.model.nodes['sample'], new_prop )
print(yaml.dump(MDFWriter(smodel).mdf, indent=4))
Handle: test
Nodes:
sample:
Props:
- amount
- sample_type
- tumor_sample_origin
PropDefinitions:
amount:
Key: false
Nul: false
Req: false
Strict: true
Type:
units:
- mg
value_type: number
sample_type:
Enum:
- normal
- tumor
Key: false
Nul: false
Req: false
Strict: true
tumor_sample_origin:
Desc: An indication as to whether a tumor sample was derived from a primary
versus a metastatic tumor.
Enum:
- Primary
- Metastatic
- Not Applicable
- Unknown
Key: false
Nul: false
Req: 'Yes'
Strict: true
Tags:
Labeled: Tumor Sample Origin
Relationships:
is_subsample_of:
Ends:
- Dst: sample
Props: null
Src: sample
Mul: many_to_one
Props: null
Terms:
normal:
Origin: test
Value: normal
tumor:
Origin: test
Value: tumor
Version: 0.01
Note that the new property tumor_sample_origin
appears in the new MDF.
Make changes to the underlying model
Validating the Model
As the MDFReader
class loads the model, it automatically validates it against the MDF schema and will raise an exception if the model is invalid. This will use the default schema unless one is provided via the MDFReader
class’s mdf_schema
argument.
bento-mdf
also provides the MDFValidator
class, which can be used to validate a model against the MDF schema directly.
from bento_mdf.validator import MDFValidator
validator = MDFValidator(
None,
*[ctdc_model, ctdc_props],
raise_error=True,
)
validator
<bento_mdf.validator.MDFValidator at 0x7f86da54fc70>
validator.load_and_validate_schema(); # load and check that JSON schema is valid
validator.load_and_validate_yaml().as_dict(); # load and check YAML is valid
validator.validate_instance_with_schema(); # check YAML against the schema
If the schema or yaml instances (from MDF files) are invalid, the validation will fail.
from jsonschema import SchemaError, ValidationError
from yaml.parser import ParserError
from IPython.display import clear_output
Schema is invalid
bad_schema = mdf_dir / "mdf-bad-schema.yaml"
try:
MDFValidator(bad_schema, raise_error=True).load_and_validate_schema()
except SchemaError as e:
clear_output()
print(e)
'crobject' is not valid under any of the given schemas
Failed validating 'anyOf' in metaschema['properties']['properties']['additionalProperties']['properties']['type']:
{'anyOf': [{'$ref': '#/definitions/simpleTypes'},
{'type': 'array',
'items': {'$ref': '#/definitions/simpleTypes'},
'minItems': 1,
'uniqueItems': True}]}
On schema['properties']['UniversalNodeProperties']['type']:
'crobject'
YAML structure is invalid
bad_yaml = mdf_dir / "ctdc_model_bad.yaml"
try:
MDFValidator(None, bad_yaml, raise_error=True).load_and_validate_yaml()
except ParserError as e:
clear_output()
print(e)
while parsing a block mapping
in "/home/runner/work/bento-mdf/bento-mdf/python/tests/samples/ctdc_model_bad.yaml", line 1, column 1
expected <block end>, but found '<block mapping start>'
in "/home/runner/work/bento-mdf/bento-mdf/python/tests/samples/ctdc_model_bad.yaml", line 3, column 3
MDF YAMLs are invalid against the MDF schema
test_schema = mdf_dir / "mdf-schema.yaml"
ctdc_bad = mdf_dir / "ctdc_model_file_invalid.yaml"
try:
v = MDFValidator(
test_schema,
*[ctdc_bad, ctdc_props],
raise_error=True
)
v.load_and_validate_schema()
v.load_and_validate_yaml()
v.validate_instance_with_schema()
except ValidationError as e:
clear_output()
print(e)
'case.show_node' does not match '^[A-Za-z_][A-Za-z0-9_]*$'
Failed validating 'pattern' in schema['properties']['PropDefinitions']['propertyNames']:
{'$id': '#snake_case_id',
'type': 'string',
'pattern': '^[A-Za-z_][A-Za-z0-9_]*$'}
On instance['PropDefinitions']:
'case.show_node'
from bento_mdf.diff import diff_models
old_model = mdf_dir / "test-model-d.yml"
new_model = mdf_dir / "test-model-e.yml"
old_mdf = MDFReader(old_model, handle="TEST")
new_mdf = MDFReader(new_model, handle="TEST")
diff_models(mdl_a=old_mdf.model, mdl_b=new_mdf.model)
{'nodes': {'changed': {'diagnosis': {'props': {'removed': {'fatal': <bento_meta.objects.Property at 0x7f86da5a1d50>},
'added': None}}},
'removed': None,
'added': {'outcome': <bento_meta.objects.Node at 0x7f86da54f880>}},
'edges': {'removed': None,
'added': {('end_result',
'diagnosis',
'outcome'): <bento_meta.objects.Edge at 0x7f870814dde0>}},
'props': {'removed': {('diagnosis',
'fatal'): <bento_meta.objects.Property at 0x7f86da5a1d50>},
'added': {('outcome',
'fatal'): <bento_meta.objects.Property at 0x7f86da5f88b0>}}}
diff_models
has two optional arguments:
objects_as_dicts
: if True, the output will convertbento-meta Entity
objects likeNode
orEdge
to dictionaries withget_attr_dict()
include_summary
: if True, the output will include a formatted string summary of the differences between the two models. This can be useful for GitHub changelogs when a model is updated, for example.
diff = diff_models(
old_mdf.model,
new_mdf.model,
objects_as_dicts=True, include_summary=True)
diff["nodes"]["changed"]
{'diagnosis': {'props': {'removed': {'fatal': {'handle': 'fatal',
'model': 'TEST',
'value_domain': 'value_set',
'is_required': 'False',
'is_key': 'False',
'is_nullable': 'False',
'is_strict': 'True'}},
'added': None}}}
print(diff["summary"], sep="\n")
1 node(s) added; 1 edge(s) added; 1 prop(s) removed; 1 prop(s) added; 1 attribute(s) changed for 1 node(s)
- Added node: 'outcome'
- Added edge: 'end_result' with src: 'diagnosis' and dst: 'outcome'
- Removed prop: 'fatal' with parent: 'diagnosis'
- Added prop: 'fatal' with parent: 'outcome'