The Molecule Data Model

The Bsoft molecular object model is an attempt to implement the most general pragmatic molecular structure. Sequence and atomic coordinates are alternative representations of the same molecule. The molecules studied in structural biology are typically proteins or nucleic acids with associated sequence information. In Bsoft, these descriptors of molecules are grouped into one structure with utilities for reconciling alternative descriptions.

The molecule group hierarchy

A molecule is generally defined as a collection of covalently bound atoms. Exceptions involve cross-links such as disulfide bonds, or covalent bonding between what are considered different molecules. To deal with multiple molecules in a data set, the notion of a "molecule group" is defined in Bsoft. This loose concept can be used for all the molecules in a quaternary structure, or for all the molecules in a complete system (which may include solvent).

Simple molecules are usually small numbers of covalently bound atoms. However, the polymeric macromolecules (such as proteins and nucleic acids) require a hierarchical level between molecule and atom. In Bsoft this intermediate level is a "residue", whether it is an amino acid residue in a protein, a nucleotide in a nucleic acid, or a sugar monomer in a polysaccharide. For simple monomeric molecules, the residue is considered to be the whole molecule.

The molecule group hierarchy is therefore:

Sequences

The notion of a sequence is only useful in the context of a polymer. In addition, a protein can be described by both an amino acid residue sequence or a gene sequence.

Atomic and sequence file formats

Table 1. Atomic and sequence file formats
FormatExtension(s)TypeRemark
Clustal.alnSequence 
EMBL.emblSequenceMinimal: ignore header info
Fasta.fastaSequenceMinimal: ignore header info
Genbank.gbSequenceMinimal: ignore header info
Gromacs.groCoordinatesMolecular dynamics
PDB.pdbCoordinatesMinimal: ignore header info
Phylip.phylipSequenceMinimal: ignore header info
PIR.pirSequenceMinimal: ignore header info
STAR/mmCIF.star, .cifSequence+coordinates 
Text.txtSequenceRaw sequence, type inferred
Wayne Hendrickson format.wah, .whCoordinatesMinimal format

The Representation of Models

In Bsoft, a model is an interpretative representation of the data, similar in sense to molecular models, but at a coarser grain and with much larger extent. A model is composed of components, component types, connections between components, and higher order relationships, such as polygons and polyhedra. The aim with models is to provide descriptions of large assemblies of components while handling redundancies with component types.

Model parameters

The unit parameter is the component, which can be molecule, a molecule group, part of a molecule, or a density.

Model file formats

Some aspects of models can be encoded in molecular files such as the PDB format, but these are too limited to encapsulate all the complexity.