Preparing for sequence analysis

The sequence analysis programs in the Bsoft package require aligned sequences. Bsoft has a very rudimentary sequence alignment capability, and this should better be done with another program such as clustalw (see http://www.clustal.org ).

The sequence formats the Bsoft programs support are EMBL, FASTA, Genbank, PIR and Phylip. The recognition of the format is based on the file name extension.

An example aligned sequence file is provided:

vp23.pir

Sequence identity

The "overlap" between two aligned sequences is defined as those positions in the alignment where both sequences have residues. The "identity" is defined as the number of identical residues divided by the overlap, and is thus a fraction.

Example:

bseq -verbose 7 -identity vp23.pir

Part of the output:

Aligned identity analysis:

Seq1 Seq2 Identity nID Overlap Name1 Name2 2 1 0.921 293 318 vp23_hsv2h VP23_HSV11 3 1 0.427 134 314 VP23_VZVD VP23_HSV11 3 2 0.417 131 314 VP23_VZVD vp23_hsv2h 4 1 0.438 137 313 VP23_HSVEB VP23_HSV11 4 2 0.435 136 313 VP23_HSVEB vp23_hsv2h 4 3 0.527 164 311 VP23_HSVEB VP23_VZVD 5 1 0.431 135 313 vp23_ehv4 VP23_HSV11 5 2 0.428 134 313 vp23_ehv4 vp23_hsv2h 5 3 0.527 164 311 vp23_ehv4 VP23_VZVD 5 4 0.946 297 314 vp23_ehv4 VP23_HSVEB 6 1 0.463 146 315 vp23_bhv1 VP23_HSV11 6 2 0.460 145 315 vp23_bhv1 vp23_hsv2h ...

Average identical residues: 81.4238 (54.6525) Average overlap: 297.99 (7.73172)

The last two lines give the averages and standard deviations of the number of identical residues and overlap in all pairwise comparisons.

Sequence similarity

The "similarity" between two aligned sequences is defined as the sum of residue similarities divided by the overlap. The similarity between two residues is taken from a residue substitution matrix. The default substitution matrix in Bsoft is BLOSUM62. The fraction similarity is defined as the number of residues above a given threshold divided by the overlap, and is thus a fraction comparable to the identity defined above.

Example:

bseq -verbose 7 -similarity 2 vp23.pir

Part of the output:

Aligned similarity analysis:

Similar residue threshold: 2

Seq1 Seq2 Sim fracSim Overlap Name1 Name2 2 1 4.701 0.934 318 vp23_hsv2h VP23_HSV11 3 1 2.140 0.535 314 VP23_VZVD VP23_HSV11 3 2 2.099 0.525 314 VP23_VZVD vp23_hsv2h 4 1 2.326 0.556 313 VP23_HSVEB VP23_HSV11 4 2 2.300 0.550 313 VP23_HSVEB vp23_hsv2h 4 3 2.859 0.650 311 VP23_HSVEB VP23_VZVD 5 1 2.275 0.550 313 vp23_ehv4 VP23_HSV11 5 2 2.243 0.543 313 vp23_ehv4 vp23_hsv2h 5 3 2.836 0.640 311 vp23_ehv4 VP23_VZVD 5 4 4.783 0.955 314 vp23_ehv4 VP23_HSVEB 6 1 2.248 0.546 315 vp23_bhv1 VP23_HSV11 6 2 2.232 0.537 315 vp23_bhv1 vp23_hsv2h 6 3 2.700 0.629 313 vp23_bhv1 VP23_VZVD ...

Hydrophobicity analysis

The average hydrophobicity is calculated at each position in the alignment, and a periodicity analysis done with a frequency of 4 to detect helical regions. The default hydrophobicity scale is the GES scale.

A typical command line is:

bseq -verbose 7 -hydrophobicity 0.5 -Postscript vp23_hp.ps vp23.pir

The "-Postscript" option outputs three plots to a postscript file.

Information content analysis

The information content of each position in an alignment is calculated as:

information = log₂n - sum(p_i * log₂p_i)
p_i = f_i / sum(f_i)

where f_i is the frequency of residue i at this alignment position, and n = sum(f_i) if sum(f_i) < 20, otherwise n = 20. A moving average of the information is calculated over a given window to smooth the resultant data.

A typical command line is:

bseq -verbose 7 -info -Postscript vp23_info.ps vp23.pir

The "-Postscript" option outputs three plots and a sequence logo representation to a postscript file. The sequence logo displays the occurrence of every residue type at every position in the alignment, where the combined height at each position is the information content, a measure of conservation. Here are the output file in both postscript and pdf (converted with ps2pdf from the postscript file):

vp23_info.ps vp23_info.pdf