Documentation for aln_compare 
	-(04/04/01)
	-(08/10/00)
	-(30/08/00)


0-Recent modifications
1-Program description
2-Installation
3-Flags
4-Example
	4.1 comparing two alignments
	4.2 comparing some pairs of residues within two alignments
	4.3 comparing every pair or every column...
	4.4 output the conserved bits between two alignments
5-Formats
6-known bugs
7-Author, citation




0-Recent modifications
	04/04/01: Added the -output_aln... flags to identify the conserved portions (see 4.4).
	04/10/00: Fixed a bug that caused the similarity measure to include gapped positions
	05/09/00: Made it possible for al1 and st to contain sequences in different order
	05/09/00: read_aln the routine for reading clustalw-like aln is now more flexible, it does not require 
		  a completly empty line after the heading stuff. 		 
	31/08/00: Added reading of the msf format.
	31/08/00: Fixed a bug that prevented usage of fasta_aln (reformat.c).
	30/08/00: First public release.
	
1-Program description

aln_compare is a program meant to compare two multiple sequence alignments.
Given two alignments, it will compare the sequences and residues that are common between the two.
It can also restrict the comparison to some of the residues and idicates which bit of the alignments are effectively conserved.
If you wish to carry out more sophisticated comparisons, please use t_coffee.


2-Installation
unzip and untar the distribution, then source the instal file ( ./install)

alias aln_compare='t_coffee -other_pg aln_compare'
(add this last instruction to your .bashrc)

3-Flag
to get a list of the existing flags, type:

	aln_compare
A full description of the flags is not yet available.


4 Example 

4.1 comparing two alignments

let us consider the two following alignments:
	
1aab_ref1.aln
***********************SNIP*****************************************
CLUSTAL W (1.75) multiple sequence alignment


hmgb_chite      ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD--KSEWEAK
hmgl_wheat      --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSESEKAPYVAK
hmgl_trybr      KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGPEERKVYEEM
hmgt_mouse      -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSPEEKQAYIQL
                      ***. ::: .: ..  .    :  . .      *  .  *: *    :  :   

hmgb_chite      AATAKQNYIRALQEYERNGG-
hmgl_wheat      ANKLKGEYNKAIAAYNKGESA
hmgl_trybr      AEKDKERYKREM---------
hmgt_mouse      AKDDRIRYDNEMKSWEEQMAE
                *   : .* . :         
***********************SNIP*****************************************
and 

1aab_ref1.ref_aln
***********************SNIP*****************************************
TEST ALN

hmgl_trybr -KKDSNAPKRAMTSFMFFSS----DFRSKHSDLSI-VEMSKAAGAAWKEL
hmgt_mouse ------KPKRPRSAYNIYVSESFQEAKDDSAQGKL-----KLVNEAWKNL
hmgb_chite ----ADKPKRPLSAYMLWLNSARESIKRENPDFKV-TEVAKKGGELWRGL
hmgl_wheat ---DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSL

hmgl_trybr GPEERKVYEEMAEKDKERYKREM---------
hmgt_mouse SPEEKQAYIQLAKDDRIRYDNEMKSWEEQMAE
hmgb_chite --KDKSEWEAKAATAKQNYIRALQEYERNGG-
hmgl_wheat SESEKAPYVAKANKLKGEYNKAIAAYNKGESA
***********************SNIP*****************************************

Now type:

	aln_compare -al1 1aab_ref1.aln -al2 1aab_ref1.ref_aln

will produce the following output


	seq1       seq2         Sim   [ALL]           Tot  
	1aab_ref1     4          27    86.8 [100.0]   [  402]


Sim   :Average similarity between all the sequences
[ALL] :Similarity between the two alignments on every pair of the al1 alignment. The number in bracket indicates the percent of the pairs of residues involved. In practice, the program considers each pair of aligned residues in al1, and counts how many of these pairs are found identical in al2.
TOT   :Number of pairs contained in the al1 multiple alignment. 

Note that the comparison is not symetrical, and will not always give the same results if al1 and al2 are interverted. The reason is that al1 and al2 may not contain the same number of aligned pairs.
If you are interested in comparing each pair of sequence, type:

	aln_compare -al1 1aab_ref1.aln -al2 1aab_ref1.ref_aln -io_format htp

	h: header
	t: total
	p: pair

will give as an output:

	*****************************************************
	seq1       seq2         Sim   [ALL]           Tot  
	1aab_ref1     4          27    86.8 [100.0]   [  402]
	hmgb_chite hmgl_wheat    36    95.9 [100.0]   [   74]
	hmgb_chite hmgl_trybr    21    90.3 [100.0]   [   62]
	hmgb_chite hmgt_mouse    24    80.9 [100.0]   [   68]
	hmgl_wheat hmgl_trybr    30    92.3 [100.0]   [   65]
	hmgl_wheat hmgt_mouse    22    84.5 [100.0]   [   71]
	hmgl_trybr hmgt_mouse    31    75.8 [100.0]   [   62]

If you want to know the average identity of each sequence, type:

	ln_compare -al1 1aab_ref1.aln -al2 1aab_ref1.ref_aln -io_format hs
	
	s:sequence

	
will give as an output:

	*****************************************************
	seq1       seq2         Sim   [ALL]           Tot  
	hmgb_chite ..            27    89.2 [100.0]   [  204]
	hmgl_wheat ..            29    91.0 [100.0]   [  210]
	hmgl_trybr ..            27    86.2 [100.0]   [  189]
	hmgt_mouse ..            25    80.6 [100.0]   [  201]



4.2 comparing some pairs of residues within two alignments
	 

If we consider the two previous alignments, and the following 'cache alignment'

cache
***********************SNIP*****************************************
CACHE

hmgl_trybr -xxxxxxhhhhhhhhhhhhh----xxxxxxxxxxx-xxxxhhhhhhhhhh
hmgt_mouse ------xhhhhhhhhhhhhhxxxxxxxxxxxxxxx-----hhhhhhhhhh
hmgb_chite ----xxxhhhhhhhhhhhhhxxxxxxxxxxxxxxx-xxxxhhhhhhhhhh
hmgl_wheat ---xxxxhhhhhhhhhhhhhxxxxxxxxxxxxxxxxxxxxhhhhhhhhhh

hmgl_trybr xxxxhhhhhhhhhhhhhhhhhhh---------
hmgt_mouse xxxxhhhhhhhhhhhhhhhhhhhxxxxxxxxx
hmgb_chite --xxhhhhhhhhhhhhhhhhhhhxxxxxxxx-
hmgl_wheat xxxxhhhhhhhhhhhhhhhhhhhxxxxxxxxx

***********************SNIP*****************************************

type the following instruction:
	aln_compare -al1 1aab_ref1.ref_aln -al2 1aab_ref1.aln -st
	cache aln -io_cat 3d_ali


'-st cache aln' tells the program that the structure is in
cache and formated as an alignment (you can use a fasta format in which
case the structure is threaded onto al1, in this case replace aln with
pep).

'-io_cat 3d_ali' tells the program what are the pairs/columns that wil
need to be taken into account. 3d_ali is a pre-set, and it is equivalent
to:
	-io_cat '[h][h]+[e][e]=[struc];[*][*]=[Tot]'

The result is as follow:

*********************************************************************
seq1       seq2         Sim   [struc]         [tot]           Tot  
1aab_ref1     4          27   100.0 [ 45.0]    86.8 [100.0]   [  402]


45.0 indicates that 45.0 % of the residues belong to the core ( i.e. [h]vs[h] or [e]vs[e]). 100
indicates that there is a perfect identity, in the core between the
reference and the Cw aln.

398 indicates the number of paired redsidues containe in al1.

Note that the results may change when al1 and al2 are interverted. They
may not contain the same number of pairs, column.... The structure is
always threaded onto al1. 

4.3 comparing every pair or every column

if you want a column comparison, add the flag:

	-compare_mode column 

the default is -compare_mode sp

sp means that the comparison will be made by comparing pairs of aligned residues, while column means the comparison will only take into account columns.

4.4  output the conserved bits between two alignments

Given two alignments you may want to identify the bits and pieces that are conserved:

aln_compare -al1 aln1 -al2 aln2 -output_aln\ 
	    -output_aln_modif 'x' -output_aln_format clustalw\ 
	    -output_aln_threshold 60 -output_aln_file myfile

'-output_aln' forces an output of al1 in '-output_aln_format' in the file '-output_aln_file'.
In al1, residues whose sum of pair alignment is more than 60% similar to the equivalent in al2 will be replaced with upercases. The other residues will be replaced with '-output_aln_modif'.

	-output_aln_modif    : any symbol or 'lower'
	-output_aln_threshold: 0-100 (100 <=> column comparison).
	-output_aln_format   : any format supported by seq_reformat
	-output_aln_file     : any file name, stdout, stderr will be recognized.



5-Formats

supported formats are:
	clustalw
	clustalw like ( i.e. interleaved alignment)
	fasta (i.e. like fasta sequences, but using gaps to give all the sequences the same length).
	pearson (id)
	msf

For the structure, two formats only are supported: clustalw-like and fasta

Formats are automatically recognised. Please contact me if you wish to see a new format added.

	

6-Known bugs


	


7-Address, Citation


If you use this program for academic purposes, please cite:

'T-Coffee, a novel method for multiple sequence alignments', Notredame, Higggins and Heringa, Journal of Molecular Biology, 302(1), pp205-217, 2000

If you wish to get in touch with me:

*******************************************
Dr. Cedric Notredame, PhD.                        
Structural and Genetic Information
C.N.R.S UMR 1889
31 Chemin Joseph Aiguier,
13402 Marseille Cedex 20,
France

Email   :  cedric.notredame@europe.com
WWW     :  http://igs-server.cnrs-mrs.fr/~cnotred/
Tel.    :  +33 491 164 606
Tel mob :  +33 661 312 531
Fax     :  +33 491 164 549
*******************************************

