The Molecule Data Model
The Bsoft molecular object model is an attempt to implement the most general pragmatic molecular structure. Sequence and atomic coordinates are alternative representations of the same molecule. The molecules studied in structural biology are typically proteins or nucleic acids with associated sequence information. In Bsoft, these descriptors of molecules are grouped into one structure with utilities for reconciling alternative descriptions.
The molecule group hierarchy
A molecule is generally defined as a collection of covalently bound atoms. Exceptions involve cross-links such as disulfide bonds, or covalent bonding between what are considered different molecules. To deal with multiple molecules in a data set, the notion of a "molecule group" is defined in Bsoft. This loose concept can be used for all the molecules in a quaternary structure, or for all the molecules in a complete system (which may include solvent).
Simple molecules are usually small numbers of covalently bound atoms. However, the polymeric macromolecules (such as proteins and nucleic acids) require a hierarchical level between molecule and atom. In Bsoft this intermediate level is a "residue", whether it is an amino acid residue in a protein, a nucleotide in a nucleic acid, or a sugar monomer in a polysaccharide. For simple monomeric molecules, the residue is considered to be the whole molecule.
The molecule group hierarchy is therefore:
Molecule group
Molecule
Residue
Atom
Sequences
The notion of a sequence is only useful in the context of a polymer. In addition, a protein can be described by both an amino acid residue sequence or a gene sequence.
Atomic and sequence file formats
Format | Extension(s) | Type | Remark |
---|---|---|---|
Clustal | .aln | Sequence | |
EMBL | .embl | Sequence | Minimal: ignore header info |
Fasta | .fasta | Sequence | Minimal: ignore header info |
Genbank | .gb | Sequence | Minimal: ignore header info |
Gromacs | .gro | Coordinates | Molecular dynamics |
PDB | .pdb | Coordinates | Minimal: ignore header info |
Phylip | .phylip | Sequence | Minimal: ignore header info |
PIR | .pir | Sequence | Minimal: ignore header info |
STAR/mmCIF | .star, .cif | Sequence+coordinates | |
Text | .txt | Sequence | Raw sequence, type inferred |
Wayne Hendrickson format | .wah, .wh | Coordinates | Minimal format |
The Representation of Models
In Bsoft, a model is an interpretative representation of the data, similar in sense to molecular models, but at a coarser grain and with much larger extent. A model is composed of components, component types, connections between components, and higher order relationships, such as polygons and polyhedra. The aim with models is to provide descriptions of large assemblies of components while handling redundancies with component types.
Model parameters
The unit parameter is the component, which can be molecule, a molecule group, part of a molecule, or a density.
Component type
Component
Link
Polygon
Model file formats
Some aspects of models can be encoded in molecular files such as the PDB format, but these are too limited to encapsulate all the complexity.
Molecular files (Component coordinates and links)
Chimera marker file (Component coordinates and links)
Model STAR or XML file (Component types, coordinates, orientations and links)