
T-coffee Tutorial

Centre National De LA Recherche scientifique
Cdric Notredame

www.tcoffee.org

                                                                   T-Coffee:

                                                            Tutorial and FAQ



                              T-Coffee Tutorial

                        (Version 4.67, November 2006)

                                  T-Coffee

                                  3D-Coffee

                                  M-Coffee

                               APDB and iRMSD

 ( Cdric Notredame and Centre National de la Recherche Scientifique, France
Before You Start.      5
Foreword    5
Pre-Requisite    5
Getting The Example Files of The Tutorial    6

What Is  T-COFFEE ?    7
What is T-Coffee?      7
   What does it do?    7
   What can it align?  7
   How can I use it?   7
   Is There an Online Server      8
   Is T-Coffee different from ClustalW? 8
What T-Coffee Can and Cannot do for you .    8
   (NOT) Fetching Sequences 8
   Aligning Sequences  8
   Combining Alignments     8
   Evaluating Alignments    9
   Combining Sequences and Structures   9
   Identifying Occurrences of a Motif: Mocca 9
How Does T-Coffee works     9

Preparing Your Data: Reformatting and Trimming With seq_reformat   11
Reformatting your data 11
   Accessing the T-Coffee Reformatting Utility     11
   Changing MSA formats     12
   Removing the gaps from an alignment  12
   Changing the case of your sequences  12
Protecting Important Sequence Names    12
Colouring residues in an Alignment     13
   Overview 13
   Preparing a Sequence or Alignment Cache   13
   Preparing a Library Cache      14
   Coloring an Alignment    15
   Changing the default colors    15
Selective Reformatting 16
   Selectively turn some residues to lower case    16
   Selectively modifying residues 16
Extracting Portions of Dataset    17
   Extracting Sequences According to a Pattern     17
   Extracting Sequences by Names  18
   Removing Sequences by Names    18
   Extracting Blocks Within Alignment   19
   Concatenating Alignments 19
Reducing and Improving your dataset    19
   Extracting the N most informative sequences     20
   Extracting all the sequences less than X% identical  20
   Forcing Specific Sequences to be kept     20
   Identifying and Removing Outlayers   21
   Chaining Important Sequences   21
Manipulating DNA sequences  22
   Translating DNA sequences into Proteins   22
   Back-Translation With the Bona-Fide DNA sequences    22
   Finding the Bona-Fide Sequences for the Back-Translation   22
   Guessing Your Back Translation 22
Fetching a Structure   23
   Fetching a PDB structure 23
   Fetching The Sequence of a PDB structure  23
   Dealing with Non-automatically recognized formats    23
Dealing With Phylogenetic Trees   23
   Comparing two phylogenetic trees     23
   Prunning Phylogenetic Trees    24

Building Multiple Sequence Alignments  26
How to generate The Alignment You Need?      26
   What is a Good Alignment?      26
   The Main Methods and their Scope     27
   Choosing The Right Package     28
Computing Multiple Sequence Alignments With T-Coffee    29
   A Simple Multiple Sequence Alignment 29
   Controlling the Output Format  29
   Computing a Phylogenetic tree  29
   Using Several Datasets   30
   How Good is Your Alignment     30
   Doing it over the WWW    30
Aligning Many Sequences     30
   Aligning Very Large Datasets with Muscle  30
   Aligning Very Large Alignments with Mafft 31
   Aligning Very Large Alignments with T-Coffee    31
   Shrinking Large Alignments With T-Coffee  31
Modifying the default parameters of T-Coffee 31
   Changing the Substitution Matrix     31
   Comparing Two Alternative Alignments 32
   Changing Gap Penalties   34
   Can You Guess The Optimal Parameters?     35
Using Many Methods at once  35
   Using All the Methods at the Same Time: M-Coffee     35
   Using Selected Methods to Compute your MSA      36
   Combining pre-Computed Alignments    36
Aligning Profiles      37
   Aligning One sequence to a Profile   37
   Aligning Many Sequences to a Profile 37
   Accurate/Slow Profile to Profile Alignment      38
Aligning Other Types of Sequences 38
   Splicing variants   38
   Aligning DNA sequences   38
   Aligning RNA sequences   39
   Noisy Coding DNA Sequences.    39

Combining Sequences and Structures     41
If you are in a Hurry: Expresso   41
   What is Expresso?   41
   Using Expresso      42
Aligning Sequences and Structures 42
   Mixing Sequences and Structures      42
   Using Sequences only     42
Aligning Profile using Structural Information     43

How Good Is Your Alignment ?      44

Evaluating Alignments with The CORE index    44
   Computing the Local CORE Index 44
   Computing the CORE index of any alignment 44
   Filtering Bad Residues   45
   Filtering Gap Columns    45

Evaluating an Alignment Using Structural Information: iRMSD  46
   What is the iRMSD?  46
   How to Efficiently Use Structural Information   46
   Evaluating an Alignment With the irmsd Package  47
   Evaluating Alternative Alignments    47
   Identifying the most distantly related sequences in your dataset      47

Evaluating an Alignment according to your own Criterion 48
   Establishing Your Own Criterion      48

Integrating External Methods In T-Coffee     50
What Are The Methods Already Integrated in T-Coffee     50
   List of INTERNAL Methods 50
   Plug-In: Using Methods Integrated in T-Coffee   51
   Modfying the parameters of Internal and External Methods   53
Integrating External Methods      53
   Direct access to external methods    53
   Customizing an external method (with parameters) for T-Coffee   54
   Managing a collection of method files     54
Advanced Method Integration 55
   The Mother of All method files.      56
   Weighting your Method    57
Plug-Out: Using T-Coffee as a Plug-In  58
Creating Your Own T-Coffee Libraries   58
   Using Pre-Computed Alignments  58
   Customizing the Weighting Scheme     58
   Generating Your Own Libraries  59

Frequently Asked Questions  60
Abnormal Terminations and Wrong Results      60
   Q: The program keeps crashing when I give my sequences     60
   Q: The default alignment is not good enough     61
   Q: The alignment contains obvious mistakes      61
   Q: The program is crashing     61
   Q: I am running out of memory  61
Input/Output Control   61
   Q: How many Sequences can t_coffee handle 61
   Q: Can I prevent the Output of all the warnings?     62
   Q: How many ways to pass parameters to t_coffee?     62
   Q: How can I change the default output format?  62
   Q: My sequences are slightly different between all the alignments.    62
   Q: Is it possible to pipe stuff OUT of t_coffee?     62
   Q: Is it possible to pipe stuff INTO t_coffee?  62
   Q: Can I read my parameters from a file?  63
   Q: I want to  decide myself on the name of the output files!!!  63
   Q: I want to use the sequences in an alignment file  63
   Q: I only want to produce a library  63
   Q: I want to turn an alignment into a library   63
   Q: I want to concatenate two libraries    64
   Q: What happens to the gaps when an alignment is fed to T-Coffee      64
   Q: I cannot print the html graphic display!!!   64
   Q: I want to output an html file and a regular file  64
   Q: I would like to output more than one alignment format at the same time
   65
Alignment Computation  65
   Q: Is T-Coffee the best? Why Not Using Muscle, or Mafft, or ProbCons???
   65
   Q: Can t_coffee align Nucleic Acids ???   65
   Q: I do not want to compute the alignment.      65
   Q: I would like to force some residues to be aligned.      65
   Q: I would like to use structural alignments.   66
   Q: I want to build my own libraries. 66
   Q: I want to use my own tree   67
   Q: I want to align coding DNA  67
   Q: I do not want to use all the possible pairs when computing the library
   67
   Q: I only want to use specific pairs to compute the library     67
   Q: There are duplicates or quasi-duplicates in my set      68
Using Structures and Profiles     68
   Q: Can I align sequences to a profile with T-Coffee? 68
   Q: Can I align sequences Two or More Profiles?  68
   Q: Can I align two profiles according to the structures they contain? 68
   Q: T-Coffee becomes very slow when combining sequences and structures 68
   Q: Can I use a local installation of PDB? 69
Alignment Evaluation   69
   Q: How good is my alignment?   69
   Q: What is that color index?   70
   Q: Can I evaluate alignments NOT produced with T-Coffee?   70
   Q: Can I Compare Two Alignments?     70
   Q: I am aligning sequences with long regions of very good overlap     70
   Q: Why is T-Coffee changing the names of my sequences!!!!  71
Improving Your Alignment    71
   Q: How Can I Edit my Alignment Manually?  71
   Q: Have I Improved or Not my Alignment?   71

Addresses and Contacts 72
Contributors     72
Addresses   72

References  74
T-Coffee    74
Mocca 75
CORE  76
Other Contributions    76
Bug Reports and Feedback    76

                              Before You Start.


Foreword

A lot of the stuff presented here emanates form two summer school that  were
tentatively called the "Prosite Workshops" and were held  in  Marseille,  in
2001 and 2002. These workshops were mostly an  excuse  to  go  rambling  and
swimming in the callanques. Yet, when we got tired of lazing in the sun,  we
eventually did a bit of work to chill out.  Most  of  our  experiments  were
revolving around the development of sequence analysis  tools.  Many  of  the
most  advanced  ideas  in  T-Coffee  were  launched  during  these  fruitful
sessions. Participants  included  Phillip  Bucher,  Laurent  Falquet,  Marco
Pagni, Alexandre Gattiker,  Nicolas  Hulo,  Christian  Siegfried,  Anne-Lise
Veuthey, Virginie Leseau, Lorenzo Ceruti and Cedric Notredame.

This Document contains two main sections.  The  first  one  is  a  tutorial,
where we go from simple things to more complicated and show you how  to  use
all the subtleties of T-Coffee. We have  tried  to  put  as  many  of  these
functionalities  on  the  web  (www.tcoffee.org)  but  if  you  need  to  do
something special and highly reproducible, the  Command  Line  is  the  only
way.


Pre-Requisite

This tutorial relies on the assumption that  you  have  installed  T-Coffee,
version 4.30 or higher. T-Coffee is a freeware open source  running  on  all
Unix-like  platforms,  including  MAC-osX  and  Cygwin.  All  the   relevant
information  for  installing  T-Coffee  is  contained   in   the   Technical
Documentation (tcoffee_technical.doc in the doc directory.)

T-Coffee cannot run on the Microsoft Windows shell. If you  need  to  run  T
-Coffee on windows, start by installing cygwin (www.cygwin.com).  Cygwin  is
a freeware open source that makes it possible to  run  a  unix-like  command
line on your Microsoft Windows PC without having to reboot. Cygwin  is  free
of charge and very easy to install. Yet, as the first installation  requires
downloading substantial amounts of data,  you  should  make  sure  you  have
access to a broad-band connection.

In the course of this tutorial, we expect you to  use  a  unix-like  command
line shell. If you work on Cygwin, this means clicking on  the  cygwin  icon
and typing commands in the window that appears. If you don't want to  bother
with command  line  stuff,  try  using  the  online  tcoffee  webserver  at:
www.tcoffee.org


Getting The Example Files of The Tutorial

We  encourage  you  to  try  all  the  following  examples  with  your   own
sequences/structures. If you  want  to  try  with  ours,  you  can  get  the
material from the example directory of the distribution. If you do not  know
where this file leaves or if you do not have  access  to  it,  the  simplest
thing to do is to:

     1- download T-Coffee's latest version from www.tcoffee.org (Follow the
        link to the T-Coffee Home Page)

     2- Download the latest distribution

     3- gunzip <distrib>.tar.gz

     4- tar -xvf <distrib>.tar

     5- go into <distrib>/example

This is all you need to  do  to  run  ALL  the  examples  provided  in  this
tutorial.


                                   What Is

                                  T-COFFEE

                                      ?


What is T-Coffee?

       Before going deep into the core of the matter, here are a  few  words
       to quickly explain some of the things T-Coffee will do for you.


1 What does it do?

       T-Coffee is a multiple sequence alignment program:  given  a  set  of
       sequences previously gathered using  database  search  programs  like
       BLAST, FASTA or Smith and Waterman, T-Coffee will produce a  multiple
       sequence alignment. To  use  T-Coffee  you  must  already  have  your
       sequences ready.

       T-Coffee can also be used to compare  alignments,  reformat  them  or
       evaluate them using structural information.


2 What can it align?

       T-Coffee will align nucleic and protein sequences alike, although  it
       does better at aligning proteins than nucleic acids. It will be  able
       to use structural information for  protein  sequences  with  a  known
       structure. We recently introduced a new mode that makes T-Coffee able
       to accurately align large datasets.


3 How can I use it?

       T-Coffee is not an interactive program. It runs  from  your  UNIX  or
       Linux  command  line  and  you  must  provide  it  with  the  correct
       parameters. If you do not like typing commands, here is the  simplest
       available mode where T-Coffee only needs the  name  of  the  sequence
       file:

                 PROMPT: t_coffee sample_seq1.fasta
       Installing and using T-Coffee requires a  minimum  acquaintance  with
       the Linux/Unix operating system. If you  feel  this  is  beyond  your
       computer skills, we suggest you  use  one  of  the  available  online
       servers.


4 Is There an Online Server

       Yes, at www.tcoffee.org


5 Is T-Coffee different from ClustalW?

       According to several benchmarks, T-Coffee appears to be more accurate
       than ClustalW. Yet, this increased accuracy  comes  at  a  price:  T-
       Coffee is slower than Clustal (about N times fro N Sequences).

       If you are familiar with ClustalW, or if you run a  ClustalW  server,
       you will find that we have  made  some  efforts  to  ensure  as  much
       compatibility as possible between ClustalW and T-COFFEE. Whenever  it
       was relevant, we have kept the flag names  and  the  flag  syntax  of
       ClustalW. Yet, you will  find  that  T-Coffee  also  has  many  extra
       possibilities.

       If you want to align closely related sequences, T-Coffee can also  be
       used in a fast mode, much faster than ClustalW, and about as accurate
       ( T-Coffee -very_fast) This mode is especially useful to  align  long
       sequences.




What T-Coffee Can and Cannot do for you .

IMPORTANT: All the files mentioned here (sample_seq...) can be found in the
example directory of the distribution.

1 (NOT) Fetching Sequences

       T-Coffee will NOT fetch  sequences  for  you:  you  must  select  the
       sequences you want to align before hand. We suggest you use any BLAST
       server and format your sequences in FASTA so that  T-COFFEE  can  use
       them easily. The expasy BLAST server (www.expasy.ch) provides a  nice
       interface for integrating database searches.


2 Aligning Sequences

       T-Coffee will compute (or at least try to compute!) accurate multiple
       alignments of DNA, RNA or Protein sequences.


3 Combining Alignments

       T-Coffee  allows  you  to  combine  results  obtained  with   several
       alignment methods. For instance if you have an alignment coming  from
       ClustalW, an other alignment coming from Dialign,  and  a  structural
       alignment of some of your sequences, T-Coffee will combine  all  that
       information and produce a new multiple sequence alignment having  the
       best agreement with all these methods (see the FAQ for more details)

         PROMPT: t_coffee -aln=sproteases_small.cw_aln,
         sproteases_small.muscle, sproteases_small.tc_aln
         -outfile=combined_aln.aln

4 Evaluating Alignments

       You can use T-Coffee to measure  the  reliability  of  your  Multiple
       Sequence alignment. If you want to find out about that, read the  FAQ
       or the documentation for the -output flag.

         PROMPT: t_coffee -infile=sproteases_small.aln
         -special_mode=evaluate

5 Combining Sequences and Structures

       One of the latest improvements of T-Coffee  is  to  let  you  combine
       sequences and structures, so  that  your  alignments  are  of  higher
       quality. You need to have sap package installed to fully  benefit  of
       this facility.

         PROMPT: t_coffee 3d.fasta -special_mode=3dcoffee
       Using this mode will cause T-Coffee  to  automatically  identify  the
       target corresponding to your sequence as indicated by an NCBI  BLAST.
       T-Coffee then obtains the required PDB sequences from RCSB.  However,
       if you are also  using  -template_file,  the  program  will  use  the
       template you specified and the corresponding files on your disk.

       All these network based operations are carried  out  using  wget.  If
       wget is not installed on your system, you can get it  for  free  from
       (www.wget.org). To make sure wget is installed on your system, type

         PROMPT: which wget

6 Identifying Occurrences of a Motif: Mocca

       Mocca is a special mode of T-Coffee that  allows  you  to  extract  a
       series of repeats from a single sequence or a set  of  sequences.  In
       other words, if you know the coordinates of one copy of a repeat, you
       can extract all the other occurrences. If  you  want  to  use  Mocca,
       simply type:

         PROMPT: t_coffee -other_pg mocca sample_seq1.fasta
       The program needs some time to compute a library  and  it  will  then
       prompt you with an interactive menu. Follow the instructions.


How Does T-Coffee works

       If you only want to make a standard multiple alignments, you may skip
       these explanations. But if you want to do more sophisticated  things,
       these few indications may help before you start reading the  doc  and
       the papers.

       When you run T-Coffee, the first  thing  it  does  is  to  compute  a
       library. The library is a list of pairs of  residues  that  could  be
       aligned. It is like a Xmas list: you can ask anything you fancy,  but
       it is down to Santa to assemble a collection of Toys that  won't  get
       him stuck at the airport, while going through the metal detector.

       Given a standard library, it is not possible to have all the residues
       aligned at the same time because all the lines of the library may not
       agree. For instance, line 1 may say


         Residue 1 of seq A with Residue 5 of seq B,

       and line 100 may say


         Residue 1 of seq A with Residue 29 of seq B,

       Each of these constraints comes with a weight and in the end, the  T-
       Coffee algorithm  tries  to  generate  the  multiple  alignment  that
       contains constraints whose sum of weights yields the  highest  score.
       In other words, it  tries  to  make  happy  as  many  constraints  as
       possible (replace the word constraint with, friends, family  members,
       collaborators. and you will know exactly what we mean).

       You can generate this list of constraints however you like.  You  may
       even provide it yourself, forcing important residues to be aligned by
       giving them high weights (see the  FAQ).  For  your  convenience,  T-
       Coffee can generate (this is the default) its own list by making  all
       the possible global  pairwise  alignments,  and  the  10  best  local
       alignments associated with each  pair  of  sequences.  Each  pair  of
       residues observed aligned in these pairwise alignments becomes a line
       in the library.

       Yet be aware that nothing forces you to use this library and that you
       could build  it  using  other  methods  (see  the  FAQ).  In  protein
       language, T-COFEE is synonymous for freedom,  the  freedom  of  being
       aligned however you fancy ( I  was  a  Tryptophan  in  some  previous
       life).












                            Preparing Your Data:

                 Reformatting and Trimming With seq_reformat

Nothing is more frustrating than downloading important  data  and  realizing
you need to format it *before*  using  it.  In  general,  you  should  avoid
manual reformatting: it is by essence inconsistent and  will  get  you  into
trouble. It will also get you depressed when you will realize that you  have
spend the whole day adding carriage return to each line in your files.


Reformatting your data


1 Accessing the T-Coffee Reformatting Utility

       T-Coffee comes along with a very powerful reformatting utility  named
       seq_reformat. You can  use  seq_reformat  by  invoking  the  t_coffee
       shell.

         PROMPT: t_coffee -other_pg seq_reformat
       This will output the online flag usage of seq_reformat.  Seq_reformat
       recognizes automatically the most common formats. You can use it to:

       Reformat your sequences.

       extract sub-portions of alignments

       Extract sequences.

       In this section we give you a few examples of things you can do  with
       seq_reformat:


2 Changing MSA formats

       It can be necessary to change from one MSA format to another. If your
       sequences are in ClustalW format and  you  want  to  turn  them  into
       fasta, while keeping the gaps, try

         PROMPT: t_coffee -other_pg seq_reformat -in sproteases_small.aln
         -output fasta_aln > sproteases_small.fasta_aln
       If you want to turn a clustalw alignment into an alignment having the
       pileup format (MSF), try:

         PROMPT: t_coffee -other_pg seq_reformat -in sproteases_small.aln
         -output msf > sproteases_small.msf

3 Dealing with Non-automatically recognized formats

           Format recognition is not 100% full proof. Occasionally you  will
       have to inform the program about the  nature  of  the  file  you  are
       trying to reformat:

             -in_f msf_aln for intance

4 Removing the gaps from an alignment

       If  you  want  to  recover  your  sequences  from  some  pre-computed
       alignment, you can try:

         PROMPT: t_coffee -other_pg seq_reformat -in sproteases_small.aln
         -output fasta_seq > sproteases_small.fasta
       This will remove all the gaps.


5 Changing the case of your sequences

       If you need to change the case of your sequences, you  can  use  more
       sophisticated functions  embedded  in  seq_reformat.  We  call  these
       modifiers, and they are accessed via the -action flag. For  instance,
       to write our sequences in lower case:

         PROMPT: t_coffee -other_pg seq_reformat -in sproteases_small.aln
         -action +lower -output clustalw
       No prize for guessing that +upper will do exactly the opposite....

NOTE: It is possible to upper and lower case specific residues. See the
last part of this section for more information.

Protecting Important Sequence Names

Few programs support long sequence  names.  Sometimes,  when  going  through
some pipeline the names of your  sequences  can  be  damaged  (truncated  or
modified).  To  avoid  this,  seq_reformat  contains  a  utility  that   can
automatically rename your  sequences  into  a  form  that  will  be  machine
friendly, while making it easy to return to the human friendly form.

The first thing to do is to generate a list of names that will  be  used  in
place of the long original name of the sequences. For instance:

         PROMPT: t_coffee -other_pg seq_reformat -in sproteases_large.fasta
         -output code_name > sproteases_large.code_name
Will create a file where each original name is associated with a coded  name
(Cxxxx). You can then use this file to either code or decode  your  dataset.
For instance, the following command:

         PROMPT: t_coffee -other_pg seq_reformat -code
         sproteases_large.code_name -in sproteases_large.fasta
         >sproteases_large.coded.fasta
Will code all the names of the original data. You can  work  with  the  file
sproteases_large.coded.fasta, and when you are done,  you  can  de-code  the
names of your sequences using:

         PROMPT: t_coffee -other_pg seq_reformat -decode
         sproteases_large.code_name -in sproteases_large.coded.fasta

Colouring residues in an Alignment


1 Overview

       To color an alignment, two files are needed: the alignment (aln)  and
       the cache (cache). The cache is a file where residues to  be  colored
       are declared  along  with  the  colors.  Nine  different  colors  are
       currently supported. They are set by default but can be  modified  by
       the user (see last changing default colors).  The  cache  can  either
       look like a standard sequence or alignment file (see below) or like a
       standard T-Coffee library (see next section). In this section we show
       you how to specifically modify your original sequences to  turn  them
       into a cache.

       In the cache, the colors of each residue are declared with  a  number
       between 0 and 9.  Undeclared residues will appear without  any  color
       in the final alignment.


2 Preparing a Sequence or Alignment Cache

       Let us consider the following file:


CLUSTAL FORMAT





B               CTGAGA-AGCCGC---CTGAGG--TCG


C               TTAAGG-TCCAGA---TTGCGG--AGC


D               CTTCGT-AGTCGT---TTAAGA--ca-


A               CTCCGTgTCTAGGagtTTACGTggAGT


                 *  *      *     *  *

       The command

         PROMPT: t_coffee -other_pg seq_reformat -in=sample_aln6.aln
         -output=clustalw_aln -out=cache.aln -action +convert 'Aa1' '.--'
         +convert '#0'
         The conversion will proceed as follows:

                 -conv indicates the filters for character conversion:

                 - will remain -

                 A and a will be turned into 1

                 All the other symbols (#) will be turned into 0.

                 -action  +convert,  indicates  the  actions  that  must  be
                 carried out on the  alignment  before  it  is  output  into
                 cache.

This command generates the following alignment (called a cache):


CLUSTAL  FORMAT  for  SEQ_REFORMAT  Version  1.00,  CPU=0.00  sec,  SCORE=0,
Nseq=4, Len=27





B               000101-100000---000100--000


C               001100-000101---000000--100


D               000000-100000---001101--01-


A               000000000010010000100000100

       Other alternative are possible. For instance, the following command:

         PROMPT: t_coffee -other_pg seq_reformat -in=sample_aln6.aln
         -output=fasta_seq -out=cache.seq -action +convert 'Aa1' '.--'
         +convert '#0'
       will produce the following file cache_seq


>B


000101100000000100000


>C


001100000101000000100


>D


00000010000000110101


>A


000000000010010000100000100

       where each residue has been replaced with a number according to  what
       was specified by conv. Note that it is not necessary to replace EVERY
       residue with a code. For instance, the following file would  also  be
       suitable as a cache:

         PROMPT: t_coffee -other_pg seq_reformat -in=sample_aln6.aln
         -output=fasta_seq -out=cache -action +convert 'Aa1' '.--'

>B


CTG1G11GCCGCCTG1GGTCG


>C


TT11GGTCC1G1TTGCGG1GC


>D


CTTCGT1GTCGTTT11G1c1


>A


CTCCGTgTCT1GG1gtTT1CGTgg1GT


3 Preparing a Library Cache

       The Library is a special format used by T-Coffee to  declare  special
       relationships between pairs of residues. The cache library format can
       also be used  to  declare  the  color  of  specific  residues  in  an
       alignment. For instance, the following file


! TC_LIB_FORMAT_01


4


A 27 CTCCGTgTCTAGGagtTTACGTggAGT


B 21 CTGAGAAGCCGCCTGAGGTCG


C 21 TTAAGGTCCAGATTGCGGAGC


D 20 CTTCGTAGTCGTTTAAGAca


#1 1


    1     1   3


    4     4   5


#3 3


    6     6   1


    9     9   4


! CPU 240


! SEQ_1_TO_N

       sample_lib5.tc_lib declares that residue 1  of  sequence  3  will  be
       receive color 6, while residue 20 of sequence 4  will  receive  color
       20.  Note  that  the  sequence  number  and  the  residue  index  are
       duplicated, owing to the recycling of this format from  its  original
       usage.

       It is also possible to use  the  BLOCK  operator  when  defining  the
       library (c.f. technical doc, library format). For instance:


! TC_LIB_FORMAT_01


4


A 27 CTCCGTgTCTAGGagtTTACGTggAGT


B 21 CTGAGAAGCCGCCTGAGGTCG


C 21 TTAAGGTCCAGATTGCGGAGC


D 20 CTTCGTAGTCGTTTAAGAca


#1 1


    +BLOCK+ 10 1     1    3


    +BLOCK+ 5  15    15   5


#3 3


    6     6   1


    9     9   4


! CPU 240


! SEQ_1_TO_N

       The number right after BLOCK indicates the block length (10). The two
       next numbers (1 1) indicate the position of the first element in  the
       block. The last value is the color.


4 Coloring an Alignment

       If you have a cache alignment or a cache library, you can use  it  to
       color your alignment and either make  a  post  script,  html  or  PDF
       output. For instance, if you use the file cache.seq:

            PROMPT: t_coffee -other_pg seq_reformat -in=sample_aln6.aln
         -struc_in=sample_aln6.cache -struc_in_f number_fasta
         -output=color_html -out=x.html
       This will produce a colored version readable with  any  standard  web
       browser, while:

            PROMPT: t_coffee -other_pg seq_reformat -in=sample_aln6.aln
         -struc_in=sample_aln6.cache -struc_in_f number_fasta
         -output=color_pdf -out=x.pdf
       This will produce a colored version readable with acrobat reader.

Warning: ps2pdf must be installed on your system
       You  can  also  use  a  cache  library  like  the  one  shown   above
       (sample_lib5.tc_lib):

         PROMPT: t_coffee -other_pg seq_reformat -in=sample_aln6.aln
         -struc_in=sample_lib5.tc_lib -output=color_html -out=x.html

5 Changing the default colors

       Colors are hard coded in the program, but if you wish, you can change
       them, simply create a file named:

            seq_reformat.color
       That is used to declare the color values:





0 #FFAA00 1 0.2 0




       This indicates that the value 0  in  the  cache  corresponds  now  to
       #FFAA00 in html, and in RGB 1,  0.2 and  0.  The  name  of  the  file
       (seq_reformat.color) is defined in: programmes_define.h,  COLOR_FILE.
       And can be changed  before  compilation.  By  default,  the  file  is
       searched in the current directory


Selective Reformatting


1 Selectively turn some residues to lower case

Consider the following alignment (sample_aln7.aln)


CLUSTAL FORMAT for T-COFFEE Version_4.62 [http://www.tcoffee.org],  CPU=0.04
sec, SCORE=0, Nseq=4, Len=28





A               CTCCGTGTCTAGGAGT-TTACGTGGAGT


B               CTGAGA----AGCCGCCTGAGGTCG---


D               CTTCGT----AGTCGT-TTAAGACA---


C               -TTAAGGTCC---AGATTGCGGAGC---


                 * ..        .*  * . *:

and the following cache (sample_aln7.cache_aln):


CLUSTAL FORMAT for T-COFFEE Version_4.62 [http://www.tcoffee.org],  CPU=0.04
sec, SCORE=0, Nseq=4, Len=28





A               3133212131022021-11032122021


B               312020----023323312022132---


D               311321----021321-11002030---


C               -110022133---020112322023---



You can turn to lower case all the residues having a score between 1 and 2:

         PROMPT: t_coffee -other_pg seq_reformat -in sample_aln7.aln
         -struc_in sample_aln7.cache_aln -struc_in_f number_aln -action
         +lower '[1-2]'

CLUSTAL FORMAT for T-COFFEE Version_4.62 [http://www.tcoffee.org],  CPU=0.05
sec, SCORE=0, Nseq=4, Len=28





A               CtCCgtgtCtAggAgt-ttACgtggAgt


B               CtgAgA----AgCCgCCtgAggtCg---


D               CttCgt----AgtCgt-ttAAgACA---


C               -ttAAggtCC---AgAttgCggAgC---


                 * ..        .*  * . *:



Note that residues not concerned will keep their original case (such

2 Selectively modifying residues

       The range operator is supported by three other important modifiers:

            -upper: to uppercase your residues

            -lower: to lowercase your residues

            -switchcase: to selectively toggle the case of your residues

            -keep: to only keep the residues within the range

            -remove: to remove the residues within the range

            -convert: to only convert the residues within the range.

       For instance, to selectively turn all the G having a score between  1
       and 2, use:

         PROMPT: t_coffee -other_pg seq_reformat -in sample_aln7.aln
         -struc_in sample_aln7.cache_aln -struc_in_f number_aln -action
         +convert '[1-2]' CX

Extracting Portions of Dataset

Extracting portions of a dataset is something very  frequently  needed.  You
may need to extract all the sequences that contain the word human  in  their
name, or you may want all the sequences containing a simple motif.  We  show
you here how to do a couple of these things.


1 Extracting Sequences According to a Pattern

       You can extract any sequence by requesting a specific pattern  to  be
       found either in the name, the comment or the sequence. For  instance,
       if you want to extract all the sequences whose name contain the  word
       HUMAN:

         PROMPT: t_coffee -other_pg seq_reformat -in sproteases_small.aln
         -action +grep NAME KEEP HUMAN -output clustalw
       The modifier is "+grep". NAME indicates that the extraction  is  made
       according to the sequences names, and KEEP means that you  will  keep
       all the sequences containing the  string  HUMAN.  If  you  wanted  to
       remove all the sequences whose name  contains  the  word  HUMAN,  you
       should have typed:

         PROMPT: t_coffee -other_pg seq_reformat -in sproteases_small.aln
         -action +grep NAME REMOVE HUMAN -output clustalw
       Note that  HUMAN is case sensitive (Human, HUMAN and hUman  will  not
       yield the same results). You can also select the sequences  according
       to some pattern found in their COMMENT section  or  directly  in  the
       sequence. For instance

         PROMPT: t_coffee -other_pg seq_reformat -in sproteases_small.aln
         -action +grep COMMENT KEEP sapiens -output clustalw
       Will keep all the  sequences  containing  the  word  sapiens  in  the
       comment section. Last but not least, you should know that the pattern
       can    be    any    perl    legal     regular     expression     (See
       www.comp.leeds.ac.uk/Perl/matching.html  for   some   background   on
       regular expressions). For instance:

         PROMPT: t_coffee -other_pg seq_reformat -in sproteases_small.aln
         -action +grep NAME REMOVE '[ILM]K' -output clustalw
       Will extract all the sequences containing the pattern [ILM]K.


2 Extracting Sequences by Names

       Extracting Two Sequences: If you want to extract  several  sequences,
       in order to make a subset. You can do the following:

         PROMPT: t_coffee -other_pg seq_reformat -in sproteases_small.aln
         -action +extract_seq 'sp|P29786|TRY3_AEDAE' 'sp|P35037|TRY3_ANOGA'
Note the single quotes ('). They are meant to protect the name of your
sequence and prevent the UNIX shell to interpret it like an instruction.
       Removing Columns of Gaps. Removing intermediate sequences results  in
       columns of gaps appearing here and there. Keeping them is  convenient
       if some features are mapped on your alignment. On the other hand,  if
       you want to remove these columns you can use:

         PROMPT: t_coffee -other_pg seq_reformat -in sproteases_small.aln
         -action +extract_seq 'sp|P29786|TRY3_AEDAE' 'sp|P35037|TRY3_ANOGA'
         +rm_gap
       Extracting Sub sequences: You may want to extract  portions  of  your
       sequences. This is possible if you specify the coordinates after  the
       sequences name:

         PROMPT: t_coffee -other_pg seq_reformat -in sproteases_small.aln
         -action +extract_seq 'sp|P29786|TRY3_AEDAE' 20 200
         'sp|P35037|TRY3_ANOGA' 10 150 +rm_gap
       Keeping the original Sequence Names. Note that your sequences are now
       renamed according to the extraction coordinates.  You  can  keep  the
       original names by using the +keep_name modifier:

         PROMPT: t_coffee -other_pg seq_reformat -in sproteases_small.aln
         -action +keep_name +extract_seq 'sp|P29786|TRY3_AEDAE' 20 200
         'sp|P35037|TRY3_ANOGA' 10 150 +rm_gap
Note: +keep_name must come BEFORE +extract_seq

3 Removing Sequences by Names

       Removing Two Sequences. If you want to remove several sequences,  use
       rm_seq instead of keep_seq.

         PROMPT: t_coffee -other_pg seq_reformat -in sproteases_small.aln
         -action +remove_seq 'sp|P29786|TRY3_AEDAE' 'sp|P35037|TRY3_ANOGA'

4 Extracting Blocks Within Alignment

       Extracting a Block. If you only  want  to  keep  one  block  in  your
       alignment, use

         PROMPT: t_coffee -other_pg seq_reformat -in sproteases_small.aln
         -action +extract_block cons 150 200
       In this command line,  cons  indicates  that  you  are  counting  the
       positions according to the  consensus  of  the  alignment  (i.e.  the
       positions correspond to the columns # of the alignment). If you  want
       to extract your block relatively to a specific sequence,  you  should
       replace cons with this sequence name. For instance:

         PROMPT: t_coffee -other_pg seq_reformat -in sproteases_small.aln
         -action +extract_block 'sp|Q03238|GRAM_RAT' 10 200

5 Concatenating Alignments

       If you have extracted several blocks and you now want  to  glue  them
       together, you can use the cat_aln function

         PROMPT: t_coffee -other_pg seq_reformat -in sproteases_small.aln
         -action +extract_block cons 100 120 > block1.aln
         PROMPT: t_coffee -other_pg seq_reformat -in sproteases_small.aln
         -action +extract_block cons 150 200 > block2.aln
         PROMPT: t_coffee -other_pg seq_reformat -in block1.aln -in2
         block2.aln -action +cat_aln

Note: The alignments do not need to have the same number of sequences and
the sequences do not need to come in the same order.

Reducing and improving your dataset

Large datasets are problematic because they can  be  difficult  to  analyze.
The problem is that when there are too many sequences, MSA programs tend  to
become very slow and inaccurate.  Furthermore,  you  will  find  that  large
datasets are difficult to display and analyze. In short, the best  size  for
an MSA dataset is between 20 and 40 sequences.  This  way  you  have  enough
sequences to see the effect of evolution, but at the same time  the  dataset
is small enough so that you can visualize your alignment  and  recompute  it
as many times as needed.

Note: If your sequence dataset is very large, seq_reformat will compute the
similarity matrix between your sequences once only. It will then keep it in
its cache and re-use it any time you re-use that dataset. In short this
means that it will take much longer to run the first time.



1 Extracting the N most informative sequences

       To be informative, a sequence  must  contain  information  the  other
       sequences do not contain. The N most informative sequences are the  N
       sequences that are as different as possible to one another, given the
       initial dataset.

         PROMPT: t_coffee -other_pg seq_reformat -in sproteases_large.fasta
         -action +trim _seq_n10  -output fasta_seq
       The arguments to trim include _seq_ . It  means  your  sequences  are
       provided unaligned. If your sequences are already aligned, you do not
       need to provide this parameter. It is generaly more accurate  to  use
       unaligned sequences.

       The argument _n10 means you want to extract the 10  most  informative
       sequences. If you would  rather  extract  the  20%  most  informative
       sequences, use

         PROMPT: t_coffee -other_pg seq_reformat -in sproteases_large.fasta
         -action +trim _seq_N20  -output fasta_seq



2 Extracting all the sequences less than X% identical

       Removing the most similar sequences is often what people have in mind
       when they talk about removing redundancy. You can  do  so  using  the
       trim option. For instance, to generate a dataset  where  no  pair  of
       sequences has more than 50% identity, use:

         PROMPT: t_coffee -other_pg seq_reformat -in sproteases_large.fasta
         -action +trim _seq_%%50_




4 Speeding up the process

       If you start form unaligned sequences, the removal of redundancy  can
       be slow. If your sequences have already been  aligned  using  a  fast
       method, you can take advantage of this by replacing  the  _seq_  with
       _aln_

       Note the difference of  speed  between  these  two  command  and  the
       previous one:

         PROMPT: t_coffee -other_pg seq_reformat -in kinases.aln -action
         +trim _aln_%%50_




         t_coffee -other_pg seq_reformat -in kinases.fasta -action +trim
         _seq_%%50_
       Of course,  using  the  MSA  will  mean  that  you  rely  on  a  more
       approximate estimation of sequence similarity.


     Forcing Specific Sequences to be kept

       Sometimes you want to  trim  while  making  sure  specific  important
       sequences remain in your dataset. You can do  so  by  providing  trim
       with a string. Trim will keep all the sequences whose  name  contains
       the string. For instance, if you want to force trim to keep  all  the
       sequences that contain the word HUMAN, no matter how similar they are
       to one another, you can run the following command:



         PROMPT: t_coffee -other_pg seq_reformat -in sproteases_large.fasta
         -action +trim _seq_%%50 HUMAN
       When you give this command, the program will first make sure that all
       the HUMAN sequences are kept and  it  will  then  assemble  your  50%
       dataset while keeping the HUMAN sequences. Note that string is a perl
       regular expression.

       By default, string causes all the sequences whose name it matches  to
       be kept. You can also make  sure  that  sequences  whose  COMMENT  or
       SEQUENCE matche string are kept. For instance, the following line

         PROMPT: t_coffee -other_pg seq_reformat -in sproteases_large.fasta
         -action +trim _seq_%%50_fCOMMENT '.apiens'
       Will cause  all  the  sequences  containing  the  regular  expression
       '.apiens' in the comment to be kept. The  _f  symbol  before  COMMENT
       stands for  "_field"  If  you  want  to   make  a  selection  on  the
       sequences:

         PROMPT: t_coffee -other_pg seq_reformat -in sproteases_large.fasta
         -action +trim _seq_%%50_fSEQ '[MLV][RK]'
       You can also specify the sequences you want to keep. To do so, give a
       fasta file containing the nale of these sequences vi the -in2 file

         PROMPT:t_coffee -other_pg seq_reformat -in sproteases_large.fasta
         -in2 sproteases_small.fasta -action +trim _seq_%%40

6 Identifying and Removing Outlayers

       Sequences that are too distantly related from the  rest  of  the  set
       will sometimes have very negative effects on the  overall  alignment.
       To prevent this, it is advisable not to use them. This  can  be  done
       when trimming the sequences. For instance,

         PROMPT: t_coffee -other_pg seq_reformat -in sproteases_large.fasta
         -action +trim _seq_%%50_O40
       The symbol _O stands for Outlayers. It will lead to  the  removal  of
       all the sequences that have less than 40% average accuracy  with  all
       the other sequences in the dataset.


7 Chaining Important Sequences

       In order to align two  distantly  related  sequences,  most  multiple
       sequence alignment packages perform better when  provided  with  many
       intermediate sequences that make it possible  to  "bridge"  your  two
       sequences. The modifier +chain makes it possible to  extract  from  a
       dataset a subset of intermediate sequences that chain  the  sequences
       you are interested in.

       For instance, le us consider the two sequences:

        sp|P21844|MCPT5_MOUSE      sp|P29786|TRY3_AEDAE

       These sequences have 26% identity. This is high enough to make a case
       for a homology relationship between them, but  this  is  too  low  to
       blindly trust any pairwise alignment.  With  the  names  of  the  two
       sequences  written  in  the  file  sproteases_pair.fasta,   run   the
       following command:

         PROMPT: t_coffee -other_pg seq_reformat -in sproteases_large.fasta
         -in2 sproteases_pair.fasta -action +chain > sproteases_chain.fasta
       This will generate a dataset of 21  sequences,  whith  the  following
       chain of similarity between your two sequences:


N: 21 Lower: 40 Sim: 25 DELTA: 15


#sp|P21844|MCPT5_MOUSE     -->93     -->sp|P50339|MCPT3_RAT     -->85     --
>sp|P50341|MCPT2_MERUN    -->72    -->sp|P52195|MCPT1_PAPHA     -->98     --
>sp|P56435|MCPT1_MACFA -->97 -->sp|P23946|MCPT1_HUMAN -->8


1   -->sp|P21842|MCPT1_CANFA   -->77   -->sp|P79204|MCPT2_SHEEP   -->60   --
>sp|P21812|MCPT4_MOUSE     -->90     -->sp|P09650|MCPT1_RAT     -->83     --
>sp|P50340|MCPT1_MERUN -->73 -->sp|P11034|MCPT1_MOUSE


-->76   -->sp|P00770|MCPT2_RAT   -->71   -->sp|P97592|MCPT4_RAT   -->66   --
>sp|Q00356|MCPTX_MOUSE    -->97    -->sp|O35164|MCPT9_MOUSE     -->61     --
>sp|P15119|MCPT2_MOUSE -->50 -->sp|Q06606|GRZ2_RAT -


->54  -->sp|P80931|MCT1A_SHEEP  -->40   -->sp|Q90629|TRY3_CHICK   -->41   --
>sp|P29786|TRY3_AEDAE

       This is probably the best way to generate a high quality alignment of
       your two sequences when using a progressive method like ClustalW,  T-
       Coffee, Muscle or Mafft.


Manipulating DNA sequences


1 Translating DNA sequences into Proteins

       If your sequences are DNA coding sequences, it  is  always  safer  to
       align them as  proteins.  Seq_reformat  makes  it  easy  for  you  to
       translate your sequences:

         PROMPT: t_coffee -other_pg seq_reformat -in
         sproteases_small_dna.fasta -action +translate -output fasta_seq

2 Back-Translation With the Bona-Fide DNA sequences

       Once your sequences have been aligned, you  may  want  to  turn  your
       protein alignment back into a DNA alignment, either to do  phylogeny,
       or maybe in order to design PCR probes. To do so, use  the  following
       command:

         PROMPT: t_coffee -other_pg seq_reformat -in
         sproteases_small_dna.fasta -in2 sproteases_small.aln -action
         +thread_dna_on_prot_aln -output clustalw

3 Finding the Bona-Fide Sequences for the Back-Translation

       Use the online server Protogene, available from www.tcoffee.org.


4 Guessing Your Back Translation

       Back-translating  means  turning  a  protein  sequence  into  a   DNA
       sequence. If  you  do  not  have  the  original  DNA  sequence,  this
       operation will not be exact, owing to the fact that the genetic  code
       is degenerated. Yet, if a random-back translation is fine  with  you,
       you can use the following command.

         PROMPT: t_coffee -other_pg seq_reformat -in
         sproteases_small_dna.fasta -in2 sproteases_small.aln -action
         +thread_dna_on_prot_aln  -output clustalw
       In  this process, codons are chosen randomly.  For  instance,  if  an
       amino-acid  has  four  codons,  the  back-translation  process   will
       randomly select one of these. If you need  more  sophisticated  back-
       translations that take into account the codon bias,  we  suggest  you
       use           more           specific           tools           like:
       alpha.dmi.unict.it/~ctnyu/bbocushelp.html


Fetching a Structure

There are many reasons why you may need a  structure.  T-Coffee  contains  a
powerful utility named extract_from_pdb that makes it possible to fetch  the
PDB coordinates of a structure or its FASTA  sequence  without  requiring  a
local installation.

By default, extract_from_pdb will start looking for  the  structure  in  the
current directory; it will then look it up locally (PDB_DIR) and  eventually
try to fetch it from the  web  (via  a  wget  to  www.rcsb.org).  All  these
settings can  be  customized  using  environment  variables  (see  the  last
section).


1 Fetching a PDB structure

       If you want to fetch the chain E of the PDB structure 1PPG,  you  can
       use:

         PROMPT: t_coffee -other_pg extract_from_pdb -infile 1PPGE

2 Fetching The Sequence of a PDB structure

       To Fetch the sequence, use:

         PROMPT: t_coffee -other_pg extract_from_pdb -infile 1PPGE -fasta
       Will fetch the fasta sequence.




   Adapting extract_from_pdb to your own environment

       If you have the  PDB  installed  locally,  simply  set  the  variable
       PDB_DIR to the absolute location of the directory in which the PDB is
       installed. The PDB can either be installed in its divided form or  in
       its full form.

       If the file you are looking for is neither in the  current  directory
       nor in the local PDB version, extract_from_pdb will try to  fetch  it
       from rcsb. If you do not want this to happen, you should  either  set
       the  environment  variable  NO_REMOTE_PDB_DIR  to  1   or   use   the
       -no_remote_pdb_dir flag:

         export NO_REMOTE_PDB_FILE=1
         or
         t_coffee -other_pg extract_from_pdb -infile 1PPGE -fasta
         -no_remote_pdb_file







Dealing With Phylogenetic Trees


1 Comparing two phylogenetic trees

       Consider the following file (sample_tree1.dnd)


   ((   A:0.50000,   C:0.50000):0.00000,(   D:0.00500,   E:0.00500):0.99000,
B:0.50000);

       and the file sample_tree3.dnd.


   ((   E:0.50000,   C:0.50000):0.00000,(   A:0.00500,   B:0.00500):0.99000,
D:0.50000);

       You can compare them using:

         seq_reformat -in sample_tree2.dnd -in2 sample_tree3.dnd -action
         +tree_cmp -output newick

tree_cpm|T: 75 W: 71.43 L: 50.50


tree_cpm|8 Nodes in T1 with 5 Sequences


tree_cmp|T: ratio of identical nodes


tree_cmp|W: ratio of identical nodes weighted with the min Nseq below node


tree_cmp|L: average branch length similarity


((   A:1.00000,   C:1.00000):-2.00000,(   D:1.00000,    E:1.00000):-2.00000,
B:1.00000);

       Please consider the following aspects when exploiting these results:

           -The comparison is made on the unrooted trees

           T: Fraction of the branches conserved  between  the  two  trees.
           This is obtained by considering the split induced by each branch
           and by checking whether that split is found in both trees

           W: Fraction of the branches conserved  between  the  two  trees.
           Each branch is weighted with MIN the minimum number of  leaf  on
           its left or right (Number leaf left, Number leaf Right)

           L:  Fraction  of  branch  length  difference  between  the   two
       considered trees.



       The last portion of the output contains a tree where  distances  have
       been replaced by the number of leaf under the considered node

           Positive values (i.e. 2, 5) indicate a node common to both trees
       and correspond to MIN.

           Negative values indicate a node found in tree1 but not in tree2

           The higher this value, the deeper the node.

You can extract this tree for further usage by typing:

            cat outfile | grep -v "tree_cmp"



2 Prunning Phylogenetic Trees

       Pruning removes leaves from an existing tree and recomputes distances
       so that no information is lost

       Consider the file sample_tree2.dnd:


   ((   A:0.50000,   C:0.50000):0.00000,(   D:0.00500,   E:0.00500):0.99000,
B:0.50000);

       And the file sample_seq8.seq


>A


>B


>C


>D

Note: Sample_seq8 is merely a FASTA file where sequences can be omitted.
Sequences can be omitted, but you can also leave them, at your entire
convenience.


         seq_reformat -in sample_tree2.dnd -in2 sample_seq8.seq -action
         +tree_prune -output newick

 (( A:0.50000, C:0.50000):0.00000, B:0.50000, D:0.99500);


                    Building Multiple Sequence Alignments


How to generate The Alignment You Need?


1 What is a Good Alignment?

       This is a trick question. A good alignment is an alignment that makes
       it possible to  do  good  biology.  If  you  want  to  reconstruct  a
       phylogeny, a good alignment  will  be  an  alignment  leading  to  an
       accurate reconstruction.

       In practice, the alignment community has become used to measuring the
       accuracy  of  alignment  methods  using  structures.  Structures  are
       relatively easy to align correctly,  even  when  the  sequences  have
       diverged quite a lot. The most common usage is therefore  to  compare
       structure based alignments with their sequence based counterpart  and
       to evaluate the accuracy of the method using these criterions.

       Unfortunately it is not easy to establish structure  based  standards
       of truth. Several of these exist and they do not  necessarily  agree.
       To summarize, the situation is as roughly as follows:

            Above 40% identity (within  the  reference  datasets),  all  the
       reference collections agree with one another and all the  established
       methods give roughly  the  same  results.  These  alignments  can  be
       trusted blindly.

            Below 40% accuracy within the reference datasets, the  reference
       collections stop agreeing and the  methods  do  not  give  consistent
       results. In this area of similarity it is  not  necessarily  easy  to
       determine who is right and who is wrong, although most  studies  seem
       to indicate that consistency based methods (T-Coffee, Mafft-slow  and
       ProbCons) have an edge over traditional methods.

       When dealing with  distantly  related  sequences,  the  only  way  to
       produce reliable alignments is to us structural information. T-Coffee
       provides many facilities to do so  in  a  seamless  fashion.  Several
       important factors need to be taken into  account  when  selecting  an
       alignment method:

       -The best methods are  not  always  doing  best.  Given  a  difficult
       dataset, the best method is only more  likely  to  deliver  the  best
       alignment, but there is no guaranty it will do so. It  is  very  much
       like betting on the horse with the best odds.

       -Secondly, the difference  in  accuracy  (as  measured  on  reference
       datasets)  between all the available methods is not incredibly  high.
       It is unclear whether this is an artifact caused by the use of "easy"
       reference alignments, or whether this is a reality.  The  only  thing
       that can change dramatically the accuracy of the alignment is the use
       of structural information.

       Last, but not least, bear in mind that these methods have  only  been
       evaluated by  comparison  with  reference  structure  based  sequence
       alignments. This is merely one criterion among many. In theory, these
       methods should be evaluated for their ability to  produce  alignments
       that  lead  to  accurate  trees,  good  profiles  or   good   models.
       Unfortunately, these evaluation procedures do not yet exist.




2 The Main Methods and their Scope

       There are many MSA packages  around.  The  main  ones  are  ClustalW,
       Muscle, Mafft, T-Coffee and ProbCons. You can almost forget about the
       other packages, as there is virtually nothing you could do with  them
       that you will not be able to do with these packages.

       These packages offer a complex trade-off between speed, accuracy  and
       versatility.


    ClustalW: everywhere you look

       ClustalW is still the most widely used  multiple  sequence  alignment
       package. Yet things are  gradually  changing  as  recent  tests  have
       consistently shown that ClustalW is neither the most accurate nor the
       fastest package around. This being said, ClustalW is  everywhere  and
       if your sequences are similar enough,  it  should  deliver  a  fairly
       reasonable alignment.


    Mafft and Muscle: Aligning Many Sequences

       If you have many sequences to align Muscle or Mafft are  the  obvious
       choice. Mafft  is  often  described  as  the  fastest  and  the  most
       efficient. This is not entirely true. In its  fast  mode  (FFT-NS-1),
       Mafft is similar to Muscle and although it is fairly accurate  it  is
       about 5 points less accurate  than  the  consistency  based  packages
       (ProbCons and T-Coffee). In its most accurate  mode  (L-INS-i)  Mafft
       uses local alignments and consistency. It becomes much more  accurate
       but also slower, and more sensitive to the number of sequences.

       The alignments generated using the fast modes of these programs  will
       be very suitable for several important applications such as:

            -Distance based phylogenetic reconstruction (NJ trees)

            -Secondary structure predictions

       However they may not be suitable for more refined application such as

            -Profile construction

            -Structure Modeling

            -3D structure prediction

            -Function analysis

       In that case you may need to use more accurate methods


    T-Coffee and ProbCons: Slow and Accurate

       T-Coffee works by first assembling a library and then by turning this
       library into an alignment. The library is a list of  potential  pairs
       of residues. All of them are  not  compatible  and  the  job  of  the
       algorithm is to make  sure  that  as  many  possible  constraints  as
       possible find their way into the final alignment. Each  library  line
       is a constraint and the purpose is to  assemble  the  alignment  that
       accommodates the more all the constraints.

       It is very much like building a  high  school  schedule,  where  each
       teachers says something "I need my Monday morning", "I can't come  on
       Thursday afternoon", and so on. In the end you want a  schedule  that
       makes everybody happy, if possible.The nice thing about  the  library
       is that it can be used as a media to combine as many methods  as  one
       wishes. It is just a matter of generating the right constraints  with
       the right method and compile them into the library.

       ProbCons and Mafft (L-INS-i) uses a similar  algorithm,  but  with  a
       Bayesian twist  in  the  case  of  Probcons.  In  practice,  however,
       probcons and T-Coffee give very  similar  results  and  have  similar
       running time. Mafft is significantly faster.

       All these packages are ideal for the following applications:

            -Profile reconstruction

            -Function analysis

            -3D Prediction


3 Choosing The Right Package

       Each available package has something to go  for  it.  It  is  just  a
       matter of knowing what you want to do. T-Coffee is probably the  most
       versatile, but it comes at a price and it is  currently  slower  than
       many alternative packages.

       In the rest of this tutorial we give some hints on how to  carry  out
       each of these applications with T-Coffee.

|               |Muscle                   |Mafft          |
|Expresso       |sproteases_small.expresso|1.33          |
|T-Coffee       |sproteases_small.tc_aln  |1.35          |
|ClustalW       |sproteases_small.cw_aln  |1.52          |
|Mafft          |sproteases_small.mafft   |1.36          |
|Muscle         |sproteases_small.muscle  |1.34          |


       As expected, Expresso delivers the best alignment from  a  structural
       point of view. This makes  sense,  since  Expresso  explicitely  USES
       structural information. The other figures show us that the structural
       based alignment is only marginally better than most  sequences  based
       alignments. Muscle seems to have  a  small  edge  here  although  the
       reality is that all these figures are impossible to distinguish  with
       the notable exception of ClustalW


4 Identifying the most distantly related sequences in your dataset

       In order to identify  the  most  distantly  related  sequences  in  a
       dataset, you can use the seq_reformat utility, in  order  to  compare
       all the sequences two by two and pick up the two  having  the  lowest
       level of identity:

         PROMPT: t_coffee -other_pg seq_reformat sproteases_small.fasta
         -output sim_idscore | grep TOP |sort -rnk3
       This sim_idscore indicates that every pair of sequences will need  to
       be  aligned  when  estimating  the  similarity.  The  ouput   (below)
       indicates that the two sequences having the lowest level of  identity
       are AEDAE and MOUSE. It may  not  be  a  bad  idea  to  choose  these
       sequences (if possible) for evaluating your MSA.


.


TOP              16         10             28.00        sp|P29786|TRY3_AEDAE
sp|Q6H321|KLK2_HORSE   28.00


TOP              16          7             28.00        sp|P29786|TRY3_AEDAE
sp|P08246|ELNE_HUMAN   28.00


TOP              16          1             28.00        sp|P29786|TRY3_AEDAE
sp|P08884|GRAE_MOUSE   28.00


TOP              15         14             27.00          sp|P80015|CAP7_PIG
sp|P00757|KLKB4_MOUSE   27.00


TOP              12          9             27.00        sp|P20160|CAP7_HUMAN
sp|Q91VE3|KLK7_MOUSE   27.00


TOP               9          7             27.00        sp|Q91VE3|KLK7_MOUSE
sp|P08246|ELNE_HUMAN   27.00


TOP              16          2             26.00        sp|P29786|TRY3_AEDAE
sp|P21844|MCPT5_MOUSE   26.00


Evaluating an Alignment according to your own Criterion


1 Establishing Your Own Criterion

       Any kind of Feature can easily be turned into an evaluation grid. For
       instance, the protease sequences we have been using here have a  well
       characterized binding site. A possible  evaluation  can  be  made  as
       follows. let us consider the Swissprot annotation  of  the  two  most
       distantly related sequences. These two sequences contain the electron
       relay system of the proteases. We can use it to build  an  evaluation
       library: in P29786, the first Histidine is at position 68,  while  in
       P21844 this Histidine is on position 66. We  can  therefore  build  a
       library that will check whether these residues are  properly  aligned
       in any MSA. The library will look like this:


! TC_LIB_FORMAT_01


2


sp|P21844|MCPT5_MOUSE                                                    247
MHLLTLHLLLLLLGSSTKAGEIIGGTECIPHSRPYMAYLEIVTSENYLSACSGFLIRRNFVLTAAHCAGRSITVLL
GAHNKTSKEDTWQKLEVEKQFLHPKYDENLVVHDIMLLKLKEKAKLTLGVGTLPLSANFNFIPPGRMCRAVGWGRT
NV


NEPASDTLQEVKMRLQEPQACKHFTSFRHNSQLCVGNPKKMQNVYKGDSGGPLLCAGIAQGIASYVHRNAKPPAVF
TRISHYRPWINKILREN


sp|P29786|TRY3_AEDAE                                                     254
MNQFLFVSFCALLDSAKVSAATLSSGRIVGGFQIDIAEVPHQVSLQRSGRHFCGGSIISPRWVLTRAHCTTNTDPA
AYTIRAGSTDRTNGGIIVKVKSVIPHPQYNGDTYNYDFSLLELDESIGFSRSIEAIALPDASETVADGAMCTVSGW
GDT


KNVFEMNTLLRAVNVPSYNQAECAAALVNVVPVTEQMICAGYAAGGKDSCQGDSGGPLVSGDKLVGVVSWGKGCAL
PNLPGVYARVSTVRQWIREVSEV


#1 2


    66  68     100


! SEQ_1_TO_N

       You simply need to cut and paste this library in a file and use  this
       file as a library to measure the concistency between  your  alignment
       and the correspondances  declared  in  your  library.  The  following
       command line also makes it possible to visualy display the  agreement
       between your sequences and the library.

         PROMPT: t_coffee -infile sproteases_small.aln -lib
         charge_relay_lib.tc_lib  -score -output html



                  Integrating External Methods In T-Coffee

The real power of  T-Coffee  is  its  ability  to  seamlessly  combine  many
methods into one. While we try to integrate as many methods  as  we  can  in
the default distribution, we do not have the means to be exhaustive  and  if
you desperately need your favourite method to be integrated, you  will  need
to bite the bullet .


What Are The Methods Already Integrated in T-Coffee

       Although, it does not necessarily do so explicitly,  T-Coffee  always
       end up combining libraries. Libraries are  collections  of  pairs  of
       residues. Given a set of libraries,  T-Coffee  makes  an  attempt  to
       assemble the alignment with the highest level of consistence. You can
       think of the alignment as a timetable. Each library pair would  be  a
       request from students or teachers, and the job of T-Coffee  would  be
       to assemble the time table that makes  as  many  people  as  possible
       happy.

       In T-Coffee, methods replace the students/professors  as  constraints
       generators. These methods can be any standard/non standard  alignment
       methods that can be used to generate alignments  (pairwise,  most  of
       the  time).  These  alignments  can  be  viewed  as  collections   of
       constraints that must be fit within the final alignment.  Of  course,
       the constraints do not have to agree with one another.

       This section shows you what are the vailable method in T-Coffee,  and
       how  you  can  add  your   own   methods,   either   through   direct
       parameterization or via  a  perl  script.  There  are  two  kinds  of
       methods: the internal and the external. For the internal methods, you
       simply need to have T-Coffee up and  running.  The  external  methods
       will require you to instal a package.


1 List of INTERNAL Methods

       Built in methods methods can be requested using the following  names.
       To

         fast_pair     Makes a global fasta style  pairwise  alignment.  For
                    proteins, matrix=blosum62mt, gep=-1,  gop=-10,  ktup=2.
                    For DNA, matrix=idmat (id=10), gep=-1, gop=-20, ktup=5.
                    Each pair of residue is given a score function  of  the
                    weighting mode defined by -weight.

         slow_pair     Identical to fast  pair,  but  does  a  full  dynamic
                    programming, using the myers and miller algorithm. This
                    method is recommended if your sequences  are  distantly
                    related.

         ifast_pair

         islow_pair

                    Makes a global fasta  alignmnet  using  the  previously
                    computed pairs as a library. `i` stands for  iterative.
                    Each pair of residue is given a score function  of  the
                    weighting mode defined by -weight. The Library used for
                    the computation is the one computed before  the  method
                    is used. The resullt  is  therefore  dependant  on  the
                    order in methods and library are set via the -in flag.

         align_pdb_pair      Uses  the  align_pdb  routine  to   align   two
                    structures. The pairwise scores are those  returnes  by
                    the align_pdb  program.  If  a  structure  is  missing,
                    fast_pair is used instead.  Each  pair  of  residue  is
                    given  a   score   function   defined   by   align_pdb.
                    [UNSUPORTED]

         lalign_id_pair     Same as lalign_rs_pir, but using  the  level  of
                    identity as a weight.

         lalign_s_pair Same as above but does also the  self  comparison  (s
                    stands  for  self).  This  is  needed  when  extracting
                    repeats.  The  weights  used  that  way  are  based  on
                    identity.

         lalign_rs_s_pair   Same as above but does also the self  comparison
                    (s stands for self). This  is  needed  when  extracting
                    repeats.

         Matrix  Amy matrix can be requested. Simply indicate  as  a  method
                    the name  of  the  matrix  preceded  with  an  X  (i.e.
                    Xpam250mt). If you indicate  such  a  matrix,  all  the
                    other methods will simply be ignored,  and  a  standard
                    fast progressive alignment will  be  computed.  If  you
                    want to change the  substitution  matrix  used  by  the
                    methods, use the -matrix flag.

         cdna_fast_pair     This method computes the pairwise  alignment  of
                    two cDNA sequences. It is a  fast_pair  alignment  that
                    only takes into account the amino-acid  similarity  and
                    uses different penalties for amino-acid insertions  and
                    frameshifts. This alignment is turned  into  a  library
                    where matched nucleotides receive a score equql to  the
                    average level of identity at the amino-acid level. This
                    mode is intended to clean cDNA obtained from  ESTs,  or
                    to align pseudo-genes.

         WARNING: This method is currently unsuported.


    PLUG-INs:List OF EXTERNAL METHODS


2 Plug-In: Using Methods Integrated in T-Coffee



       The following methods  are  external.  They  correspond  to  packages
       developped by other groups that you may want to run within  T-Coffee.
       We are very open to  extending  these  options  and  we  welcome  any
       request to ad an extra  interface.  The  following  table  lists  the
       methods that can be used as plug-ins:


Package          Where From


==========================================================


ClustalW         can interact with t_coffee


----------------------------------------------------------


Poa              http://www.bioinformatics.ucla.edu/poa/


----------------------------------------------------------


Muscle           http://www.bioinformatics.ucla.edu/poa/


----------------------------------------------------------


ProbCons         http://probcons.stanford.edu/


----------------------------------------------------------


MAFFT http://www.biophys.kyoto-u.ac.jp/~katoh/programs/align/mafft/


----------------------------------------------------------


Dialign-T        http://dialign-t.gobics.de/


----------------------------------------------------------


PCMA             ftp://iole.swmed.edu/pub/PCMA/


----------------------------------------------------------


sap              structure/structure comparisons


                 (obtain it from W. Taylor, NIMR-MRC).


---------------------------------------------------


Blast            www.ncbi.nih.nlm.gov


---------------------------------------------------


Fugue            protein to structure alignment program


                 http://www-cryst.bioc.cam.ac.uk/fugue/download.html






       Once installed, most of these  methods  can  be  invoqued  as  either
       pairwise or multuiple alignment methods:

         clustalw_pair Uses  clustalw  (default  parameters)  to  align  two
                    sequences. Each  pair  of  residue  is  given  a  score
                    function of the weighting mode defined by -weight.

         clustalw_msa  Makes a multiple alignment using ClustalW and adds it
                    to the library. Each pair of residue is given  a  score
                    function of the weighting mode defined by -weight.

         probcons_pair Probcons package: probcons.stanford.edu/.

         probcons_msa  idem.

         muscle_pair   Muscle package www.drive5.com/muscle/ .

         muscle_msa    idem.

         mafft_pair                                       www.biophys.kyoto-
                    u.ac.jp/~katoh/programs/align/mafft/ .

         mafft_msa     idem.

         pcma_msa      pcma package

         pcma_pair     pcma package

         poa_msa poa package

         poa_pair      poa package

         dialignt_pair dialignt  package

         dialignt_msa  pcma package

         sap_pair      Uses sap  to  align  two  structures.  Each  pair  of
                    residue is given a score function defined by  sap.  You
                    must have sap installed on  your  system  to  use  this
                    method.

         fugue_pair    Uses a standard fugue installation to make a sequence
                    /structure  alignment.  Fugue  installation   must   be
                    standard. It does not have to  include  all  the  fugue
                    packages but only:

         1- joy, melody, fugueali, sstruc, hbond

         2-copy fugue/classdef.dat     /data/fugue/SUBST/classdef.dat

         OR

         Setenv MELODY_CLASSDEF=<location>

         Setenv MELODY_SUBST=fugue/allmat.dat

          All the configuration files must be in the right location.



       To request a method, see the -in or the -method flag.  For  instance,
       if you wish to request the use of fast_pair and  lalign_id_pair  (the
       current default):

         PROMPT: t_coffee -seq sample_seq1.fasta -method
         fast_pair,lalign_id_pair

3 Modifying the parameters of Internal and External Methods

       It is possible to modify on the fly  the  parameters  of  hard  coded
       methods:

         PROMPT: t_coffee sample_seq1.fasta -method
         slow_pair@EP@MATRIX@pam250mt@GOP@-10@GEP@-1
       EP stands for Extra parameters. These parameters will  superseed  any
       other parameters.


Integrating External Methods

If the method you need is not already included in T-Coffee,  you  will  need
to integrate it yourself. We give you here some guidelines on how to do so.


1 Direct access to external methods

       A special method exists in T-Coffee that can be used  to  invoke  any
       existing program:

         PROMPT: t_coffee sample_seq1.fasta -method=em@clustalw@pairwise
       In this context, Clustalw is a  method  that  can  be  ran  with  the
       following command line:


         method -infile=<infile> -outfile=<outfile>

       Clustalw can be replaced with any method using a similar  syntax.  If
       the program you want to use cannot be run this way,  you  can  either
       write a perl wrapper that fits the bill or  write  a  tc_method  file
       adapted to your program (cf next section).

       This special method (em, external method) uses the following syntax:


         em@<method>@<aln_mode:pairwises_pairwise|multiple>


2 Customizing an external method (with parameters) for T-Coffee

       T-Coffee can run external methods, using a tc_method file that can be
       used  in  place  of  an  established  method.  Two  such  files   are
       incorporated in T-Coffee.  You  can  dump  them  and  customize  them
       according to your needs:

       For instance  if  you  have  ClustalW  installed,  you  can  use  the
       following file to run the

         PROMPT: t_coffee -other_pg unpack_clustalw_method.tc_method
         PROMPT: t_coffee -other_pg unpack_generic_method.tc_method
       The second file (generic_method.tc_method) contains many hints on how
       to  customize  your  new  method.  The   first   file   is   a   very
       straightforward example on how to have t_coffee to run Clustalw  with
       a set of parameters you may be interested in:





*TC_METHOD_FORMAT_01


***************clustalw_method.tc_method*********


EXECUTABLE  clustalw


ALN_MODE         pairwise


IN_FLAG          -INFILE=


OUT_FLAG         -OUTFILE=


OUT_MODE         aln


PARAM       -gapopen=-10


SEQ_TYPE         S


*************************************************

       This configuration file will cause T-Coffee  to  emit  the  following
       system call:


         clustalw -INFILE=tmpfile1 -OUTFILE=tmpfile2 -gapopen=-10

       Note that ALN_MODE instructs t_coffee to run clustalw on  every  pair
       of sequences (cf generic_method.tc_method for more details).

       The tc_method files are treated like any standard established  method
       in T-Coffee. For instance, if the file  clustalw_method.tc_method  is
       in your current directory, run:

         PROMPT: t_coffee sample_seq1.fasta -method
         clustalw_method.tc_method

3 Managing a collection of method files

       It may be convenient to store  all  the  method  files  in  a  single
       location on your system. By default, t_coffee will  go  looking  into
       the directory ~/.t_coffee/methods/. You can  change  this  by  either
       modifying the METHODS_4_TCOFFEE in define_headers.h  (and  recompile)
       or by modifying the envoronement variable METHODS_4_TCOFFEE.


Advanced Method Integration

       It may sometimes be difficult to customize the program  you  want  to
       use through a tc_method file. In that case, you  may  rather  use  an
       external perl_script to  run  your  external  application.  This  can
       easily be achieved using the generic_method.tc_method file.


*TC_METHOD_FORMAT_01


***************generic_method.tc_method*********


EXECUTABLE  tc_generic_method.pl


ALN_MODE         pairwise


IN_FLAG          -infile=


OUT_FLAG         -outfile=


OUT_MODE         aln


PARAM       -method clustalw


PARAM       -gapopen=-10


SEQ_TYPE         S


*************************************************


* Note: &bsnp can be used to for  white spaces

       When you run this method:

         PROMPT: t_coffee -other_pg unpack_generic_method.tc_method
         PROMPT: t_coffee sample_seq1.fasta -method generic_method.tc_method

       T-Coffee runs the script tc_generic_method.pl on your data.  It  also
       provides the script with parameters. In this  case  -method  clustalw
       indicates that the script should  run  clustalw  on  your  data.  The
       script tc_generic_method.pl is incorporated  in  t_coffee.  Over  the
       time, this script will be the  place  where  novel  methods  will  be
       integrated

        will be used  to  run  the  script  tc_generic_method.pl.  The  file
       tc_generic_method.pl is  a  perl  file,  automatically  generated  by
       t_coffee. Over the time this file will make it possible  to  run  all
       available methods. You  can  dump  the  script  using  the  following
       command:

         PROMPT: t_coffee -other_pg=unpack_tc_generic_method.pl

       Note: If there is a copy of that script in your local directory, that
       copy will be used in place of the internal copy of T-Coffee.


1 The Mother of All method files.


*TC_METHOD_FORMAT_01


******************generic_method.tc_method*************


*


*       Incorporating new methods in T-Coffee


*       Cedric Notredame 17/04/05


*


*******************************************************


*This file is a method file


*Copy it and adapt it to your need so that the method


*you want to use can be incorporated within T-Coffee


*******************************************************


*                  USAGE                              *


*******************************************************


*This file is passed to t_coffee via -in:


*


*     t_coffee -in Mgeneric_method.method


*


*     The method is passed to the shell using the following


*call:


*<EXECUTABLE><IN_FLAG><seq_file><OUT_FLAG><outname><PARAM>


*


*Conventions:


*<FLAG_NAME>     <TYPE>           <VALUE>


*<VALUE>:   no_name    <=> Replaced with a space


*<VALUE>:   &nbsp      <=> Replaced with a space


*


*******************************************************


*                  EXECUTABLE                         *


*******************************************************


*name of the executable


*passed to the shell: executable


*


EXECUTABLE  tc_generic_method.pl


*


*******************************************************


*                  ALN_MODE                           *


*******************************************************


*pairwise   ->all Vs all (no self )[(n2-n)/2aln]


*m_pairwise ->all Vs all (no self)[n^2-n]^2


*s_pairwise ->all Vs all (self): [n^2-n]/2 + n


*multiple   ->All the sequences in one go


*


ALN_MODE         pairwise


*


*******************************************************


*                  OUT_MODE                           *


*******************************************************


* mode for the output:


*External methods:


* aln -> alignmnent File (Fasta or ClustalW Format)


* lib-> Library file (TC_LIB_FORMAT_01)


*Internal Methods:


* fL -> Internal Function returning a Lib (Librairie)


* fA -> Internal Function returning an Alignmnent


*


OUT_MODE         aln


*


*******************************************************


*                  IN_FLAG                             *


*******************************************************


*IN_FLAG


*flag indicating the name of the in coming sequences


*IN_FLAG S no_name ->no flag


*IN_FLAG S &nbsp-in&nbsp -> " -in "


*


IN_FLAG          -infile=


*


*******************************************************


*                  OUT_FLAG                           *


*******************************************************


*OUT_FLAG


*flag indicating the name of the out-coming data


*same conventions as IN_FLAG


*OUT_FLAG   S no_name ->no flag


*


OUT_FLAG         -outfile=


*


*******************************************************


*                  SEQ_TYPE                           *


*******************************************************


*G: Genomic, S: Sequence, P: PDB, R: Profile


*Examples:


*SEQTYPE    S    sequences against sequences (default)


*SEQTYPE    S_P  sequence against structure


*SEQTYPE    P_P  structure against structure


*SEQTYPE    PS   mix of sequences and structure


*


SEQ_TYPE    S


*


*******************************************************


*                  PARAM                              *


*******************************************************


*Parameters sent to the EXECUTABLE


*If there is more than 1 PARAM line, the lines are


*concatenated


*


PARAM -method clustalw


PARAM   -OUTORDER=INPUT -NEWTREE=core -align -gapopen=-15


*


*******************************************************


*                  END                                *


*******************************************************


2 Weighting your Method

       By default, the alignment produced by your method  will  be  weighted
       according  to  the  its  percent  identity.  However,  this  can   be
       customized via the WEIGHT parameter.

       The WEIGHT parameter supports all the values of the -weight flag. The
       only difference is that the -weight value thus declared will only  be
       applied onto your method.

       If needed you can also modify on the fly the  WEIGHT  value  of  your
       method:

         PROMPT: t_coffee sample_seq1.fasta -method slow_pair@WEIGHT@OW2
       Will overweight by a factor 2 the weight of slow_pair (exactly as  if
       you had specified slow_pair twice).

         PROMPT: t_coffee sample_seq1.fasta -method slow_pair@WEIGHT@250
       Will cause every pair of slow_pair to have a weight equal to 250




Plug-Out: Using T-Coffee as a Plug-In

       Just because it enjoys enslaving other methods as plug-ins, does  not
       mean that T-Coffee does not enjoy  being  incorporated  within  other
       packages. We try to give as much support as possible  to  anyone  who
       wishes to incorportae T-Coffee in an alignment pipeline.

       If you want to do so, please work out  some  way  to  incorporate  T-
       Coffee in your script . If you need some help along the ways, do  not
       hesitate  to  ask,  as  we  will  always  be  happy  to  either  give
       assistance, or even modify the package so that it accomodates as many
       needs as possible.

       Once that procedure is over, set aside a couple of input  files  with
       the correct parameterisation and send  them  to  us.  These  will  be
       included  as  a  distribution  test,  to  insure  that  any   further
       distribution remains compliant with your application.

       We currently support:


Package          Where From


==========================================================


Marna            www.bio.inf.unijena.de/Software/MARNA/download


----------------------------------------------------------


Creating Your Own T-Coffee Libraries

       If the method you want to use is not  integrated,  or  impossible  to
       integrate, you can generate your own libraries, either directly or by
       turning existing alignments into libraries.  You  may  also  want  to
       precompute  your  libraries,  in  order  to  combine  them  at   your
       convenience.


1 Using Pre-Computed Alignments

       If the method you wish to use is not supported, or if you simply have
       the alignments, the simplest thing to do is to generate yourself  the
       pairwise/multiple alignments, in FASTA, ClustalW, msf or  Pir  format
       and feed them into t_coffee using the -in flag:

         PROMPT: t_coffee -aln=sample_aln1_1.aln,sample_aln1_2.aln
         -outfile=combined_aln.aln

2 Customizing the Weighting Scheme

       The previous integration method forces you to use the same  weighting
       scheme for each alignment and the rest of the libraries generated  on
       the fly. This weighting scheme is based on global  pairwise  sequence
       identity. If you want to use a more specific weighting scheme with  a
       given method, you should either:

       generate your own library (cf next section)

       convert your aln into a lib, using the -weight flag:

         PROMPT: t_coffee -aln sample_aln1.aln -out_lib=test_lib.tc_lib
         -lib_only -weight=sim_pam250mt
         PROMPT: t_coffee -aln sample_aln1.aln -lib test_lib.tc_lib
         -outfile=outaln
         PROMPT: t_coffee -aln=sample_aln1_1.aln,sample_aln1_2.aln -method=
         fast_pair,lalign_id_pair -outfile=out_aln

3 Generating Your Own Libraries

       This is suitable if you  have  local  alignments,  or  very  detailed
       information about your potential residue pairs, or if you want to use
       a very specific weighting scheme. You will need to generate your  own
       libraries, using the format described in the last section.

       You may also want to pre-compute your libraries in order to save them
       for further use. For instance, in the following example, we  generate
       the local  and  the  global  libraries  and  later  re-use  them  for
       combination into a multiple alignment.



         PROMPT: t_coffee sample_seq1.fasta -method slow_pair -out_lib
         slow_pair_seq1.tc_lib -lib_only
         PROMPT: t_coffee sample_seq1.fasta -method lalign_id_pair -out_lib
         lalign_id_pair_seq1.tc_lib -lib_only


       Once these libraries have been computed, you can then combine tem  at
       your convenience in a single MSA. Of course you can  decide  to  only
       use the local or the global library

         PROMPT: t_coffee sample_seq1.fasta -lib lalign_id_pair_seq1.tc_lib,
         slow_pair_seq1.tc_lib



                         Frequently Asked Questions

IMPORTANT: All the files mentionned here (sample_seq...) can be found in
the example directory of the distribution.

Abnormal Terminations and Wrong Results


    Q: The program keeps crashing when I give my sequences

       A: This may be a format problem. Try to reformat your sequences using
       any utility (readseq...). We  recommend  the  Fasta  format.  If  the
       problem persists, contact us.

       A: Your sequences may not be recognized for  what  they  really  are.
       Normally   T-Coffee   recognizes   the   type   of   your   sequences
       automatically, but if it fails, use:

         PROMPT: t_coffee sample_seq1.fasta -type=PROTEIN
       A: Costly computation or data gathered over the net is stored  by  T-
       Coffee in a cache directory. Sometimes, some of these  files  can  be
       corrupted and cause an abnormal termination. You can either empty the
       cache ( ~/.t_coffee/cache/) or request T-Coffee to run without  using
       it:

         PROMPT: t_coffee -pdb=struc1.pdb,struc2.pdb,struc3.pdb -method
         sap_pair -cache=no
       If  you  do  not  want  to  empty  your  cache,  you  may  also   use
       -cache=update that will only update the files corresponding  to  your
       data

         PROMPT: t_coffee -pdb=struc1.pdb,struc2.pdb,struc3.pdb -method
         sap_pair -cache=update


    Q: The default alignment is not good enough

       A: see next question


    Q: The alignment contains obvious mistakes

       A: This happens with most  multiple  alignment  procedures.  However,
       wrong alignments are sometimes caused by bugs  or  an  implementation
       mistake. Please report the most unexpected results to the authors.


    Q: The program is crashing

       A: If you get the message:

   FAILED TO ALLOCATE REQUIRED MEMORY
       See the next question.

       If the program crashes for some other reason,  please  check  whether
       you are using the right syntax and if the  problem  persists  get  in
       touch with the authors.


    Q: I am running out of memory

       A: You can use a more accurate, slower and less memory hungry dynamic
       programming mode called myers_miller_pair_wise. Simply  indicate  the
       flag:

         PROMPT: t_coffee sample_seq1.fasta -special_mode low_memory
       Note that this mode  will  be  much  less  time  efficient  than  the
       default, although it may be slightly more accurate. In  practice  the
       parameterization associate with special mode turns off  every  memory
       expensive heuristic within T-Coffee. For version 2.11 this amounts to

         PROMPT: t_coffee  sample_seq1.fasta
         -method=slow_pair,lalign_id_pair -distance_matrix_mode=idscore
         -dp_mode=myers_miller_pair_wise
       If you keep running out  of  memory,  you  may  also  want  to  lower
       -maxnseq, to ensure that t_coffee_dpa will be used.


Input/Output Control


    Q: How many Sequences can t_coffee handle

       A: T-Coffee is limited to a  maximum  of  50  sequences.  Above  this
       number, the program automatically switches to a heuristic mode, named
       DPA, where DPA stands for Double Progressive Alignment.

       DPA is still in development and the version currently shipped with T-
       Coffee is only a beta version.


    Q: Can I prevent the Output of all the warnings?

       A: Yes, by setting  -no_warning


    Q: How many ways to pass parameters to t_coffee?

       A: See the section well behaved parameters


    Q: How can I change the default output format?

       A: See the -output option, common output formats are:

         PROMPT: t_coffee sample_seq1.fasta -output=msf,fasta_aln

    Q: My sequences are slightly different between all the alignments.

       A: It does not matter. T-Coffee will reconstruct a set  of  sequences
       that incorporates all the residues potentially missing in some of the
       sequences ( see flag -in).


    Q: Is it possible to pipe stuff OUT of t_coffee?

       A: Specify stderr or stdout as output filename, the  output  will  be
       redirected accordingly. For instance

         PROMPT: t_coffee sample_seq1.fasta -outfile=stdout -out_lib=stdout
       This instruction will output the tree (in new hampshire  format)  and
       the alignment to stdout.


    Q: Is it possible to pipe stuff INTO t_coffee?

       A: If as a file name, you specify stdin, the  content  of  this  file
       will be expected throught pipe:

         PROMPT: cat sample_seq1.fasta | t_coffee -infile=stdin
       will be equivalent to

         PROMPT: t_coffee sample_seq1.fasta
       If you do not give any argument to t_coffee, they will be expected to
       come from pipe:

         PROMPT: cat sample_param_file.param  | t_coffee -parameters=stdin
       For instance:

         PROMPT: echo -seq=sample_seq1.fasta -method=clustalw_pair |
         t_coffee -parameters=stdin

    Q: Can I read my parameters from a file?

       A: See the well behaved parameters section.


    Q: I want to  decide myself on the name of the output files!!!

       A: Use the -run_name flag.

         PROMPT: t_coffee sample_seq1.fasta -run_name=guacamole

    Q: I want to use the sequences in an alignment file

       A: Simply fed your alignment, any way you like, but do not forget  to
       append the prefix S for sequence:

         PROMPT: t_coffee Ssample_aln1.aln
         PROMPT: t_coffee -infile=Ssample_aln1.aln
         PROMPT: t_coffee -seq=sample_aln1.aln
         -method=slow_pair,lalign_id_pair -outfile=outaln
       This means that the gaps will be reset and  that  the  alignment  you
       provide will not be considered as an  alignment,  but  as  a  set  of
       sequences.


    Q: I only want to produce a library

       A: use the -lib_only flag

         PROMPT: t_coffee sample_seq1.fasta -out_lib=sample_lib1.tc_lib
         -lib_only
       Please, note that the  previous  usage  supersedes  the  use  of  the
       -convert flag. Its main advantage is to restrict computation time  to
       the actual library computation.


    Q: I want to turn an alignment into a library

       A: use the -lib_only flag

         PROMPT: t_coffee -in=Asample_aln1.aln -out_lib=sample_lib1.tc_lib
         -lib_only
       It is also possible  to  control  the  weight  associated  with  this
       alignment (see the -weight section).

         PROMPT: t_coffee -aln=sample_aln1.aln -out_lib=sample_lib1.tc_lib
         -lib_only -weight=1000

    Q: I want to concatenate two libraries

       A: You cannot concatenate these files on their own. You will have  to
       use  t_coffee.  Assume  you  want  to  combine   tc_lib1.tc_lib   and
       tc_lib2.tc_lib.

         PROMPT: t_coffee -lib=sample_lib1.tc_lib,sample_lib2.tc_lib
         -lib_only -out_lib=sample_lib3.tc_lib

    Q: What happens to the gaps when an alignment is fed to T-Coffee

       A: An alignment is ALWAYS considered  as  a  library  AND  a  set  of
       sequences. If you want your alignment to be considered as  a  library
       only, use the S identifier.

         PROMPT: t_coffee Ssample_aln1.aln -outfile=outaln
       It will be seen as a sequence file,  even  if  it  has  an  alignment
       format (gaps will be removed).


    Q: I cannot print the html graphic display!!!

       A: This is a problem that has to do with  your  browser.  Instead  of
       requesting the score_html output, request the  score_ps  output  that
       can be read using ghostview:

         PROMPT: t_coffee sample_seq1.fasta -output=score_ps
       or

         PROMPT: t_coffee sample_seq2.fasta -output=score_pdf

    Q: I want to output an html file and a regular file

       A: see the next question


    Q: I would like to output more than one alignment format  at  the  same
    time

       A: The flag -output accepts more than one parameter. For instance,

         PROMPT: t_coffee sample_seq1.fasta
         -output=clustalw,html,score_ps,msf
       This will output founr alignment files in the corresponding  formats.
       Alignments' names will have the format name as an extension.


       Note: you need to have the converter ps2pdf installed on your  system
       (standard under Linux and cygwin). The latest  versions  of  Internet
       Explorer and Netscape now allow the user to print the HTML display Do
       not forget to request Background printing.


Alignment Computation


    Q: Is T-Coffee the best? Why Not Using Muscle, or Mafft, or ProbCons???

       A: All these packages are good packages and they sometimes outperform
       T-Coffee. They also claim to outperform one another...  If  you  have
       them  installed  locally,  you  can  have  T-Coffee  to  generate   a
       conscensus alignment:

         PROMPT: t_coffee sample_seq1.fasta -method muscle_msa,probcons_msa,
         mafft_msa, lalign_id_pair,slow_pair

    Q: Can t_coffee align Nucleic Acids ???

       A: Normally it can, but check in the log that the program  recognises
       the right type ( In the INPUT  SEQ  section,  Type:  xxxx).  If  this
       fails, you will need to manually set the type:



         PROMPT: t_coffee sample_dnaseq1.fasta -type dna

    Q: I do not want to compute the alignment.

       A: use the -convert flag

         PROMPT: t_coffee sample_aln1.aln -convert -output=gcg
       This command will read the  .aln  file  and  turn  it  into  an  .msf
       alignment.


    Q: I would like to force some residues to be aligned.

       If you want to brutally force some residues to be  aligned,  you  may
       use as a post processing, the force_aln function of seq_reformat:

         PROMPT: t_coffee -other_pg seq_reformat -in sample_aln4.aln -action
         +force_aln seq1 10 seq2 15
         PROMPT: t_coffee -other_pg seq_reformat -in sample_aln4.aln -action
         +force_aln sample_lib4.tc_lib02
       sample_lib4.tc_lib02 is a T-Coffee library using the tc_lib02 format:


         *TC_LIB_FORMAT_02


         SeqX resY ResY_index      SeqZ ResZ ResZ_index

The TC_LIB_FORMAT_02 is still experimental and unsupported. It can only be
used in the context of the force_aln function described here.
       Given more than one constraint, these will be applied one  after  the
       other, in the order they are provided. This  greedy  procedure  means
       that the Nth constraint may disrupt the  (N-1)th  previously  imposed
       constraint, hence the importance of forcing the  constraints  in  the
       right order, with the most important coming last.

       We do not recommend imposing hard constraints on an alignment, and it
       is much more advisable  to  use  the  soft  constraints  provided  by
       standard t_coffee libraries (cf. building your own libraries section)



    Q: I would like to use structural alignments.

       See the section Using structures in Multiple Sequence Alignments,  or
       see the question I want to build my own libraries.


    Q: I want to build my own libraries.

       A: Turn your alignment into a library, forcing the residues to have a
       very good weight, using structure:

         PROMPT: t_coffee -aln=sample_seq1.aln -weight=1000
         -out_lib=sample_seq1.tc_lib -lib_only
       The value 1000 is simply a high value that should make it more likely
       for the substitution found in your alignment to reoccur in the  final
       alignment. This will produce the library sample_aln1.tc_lib that  you
       can later use when aligning all the sequences:

         PROMPT: t_coffee -seq=sample_seq1.fasta -lib=sample_seq1.tc_lib
         -outfile sample_seq1.aln
       If you only want some of these residues to be  aligned,  or  want  to
       give them individual weights, you will have to edit the library  file
       yourself or use the -force_aln option (cf FAQ: I would like to  force
       some residues to be aligned). A value of N*N  *  1000  (N  being  the
       number of sequences) usually ensure the respect of a constraint.


    Q: I want to use my own tree

       A: Use the -usetree=<your own tree> flag.

         PROMPT: t_coffee sample_seq1.fasta -usetree=sample_tree.dnd

    Q: I want to align coding DNA

       A: use the fasta_cdna_pair method that compares two  cDNA  using  the
       best reading frame and taking frameshifts into account.

         PROMPT: t_coffee sample_seq4.fasta -method=cdna_fast_pair
       Notice that in the resulting alignments, all the gaps are of modulo3,
       except one small gap in the first line of sequence  hmgl_trybr.  This
       is a framshift, made on purpose. You can realign the  same  sequences
       while ignoring their coding potential and treating them like standard
       DNA:

         PROMPT: t_coffee sample_seq4.fasta
Note: This method has not yet been fully tested and is only provided "as-
is" with no warranty. Any feedback will be much appreciated.

    Q: I do not want to use all  the  possible  pairs  when  computing  the
    library


    Q: I only want to use specific pairs to compute the library

       A: Simply write in a file the list of sequence  groups  you  want  to
       use:

         PROMPT: t_coffee sample_seq1.fasta
         -method=clustalw_pair,clustalw_msa -lib_list=sample_list1.lib_list
         -outfile=test

***************sample_list1.lib_list****


2 hmgl_trybr hmgt_mouse


2 hmgl_trybr hmgb_chite


2 hmgl_trybr hmgl_wheat


3 hmgl_trybr hmgl_wheat hmgl_mouse


***************sample_list1.lib_list****





       Note: Pairwise methods (slow_pair.) will only be applied to  list  of
       pairs of sequences, while multiple  methods  (clustalw_aln)  will  be
       applied to any dataset having more than two sequences.


    Q: There are duplicates or quasi-duplicates in my set

       A: If you can remove them, this will make  the  program  run  faster,
       otherwise, the t_coffee scoring scheme should be able to avoid  over-
       weighting of over-represented sequences.


Using Structures and Profiles


    Q: Can I align sequences to a profile with T-Coffee?

       A: Yes, you simply need to indicate that your alignment is a  profile
       with the R tag..

         PROMPT: t_coffee sample_seq1.fasta -profile=sample_aln2.aln
         -outfile tacos

    Q: Can I align sequences Two or More Profiles?

       A: Yes, you, simply tag your profiles  with  the  letter  R  and  the
       program will treat them like standard sequences.

         PROMPT: t_coffee -profile=sample_aln1.fasta,sample_aln2.aln
         -outfile tacos

    Q: Can I align two profiles according to the structures they contain?

       A: Yes. As long as the structure sequences  are  named  according  to
       their PDB identifier

         PROMPT: t_coffee  -profile=sample_profile1.aln,sample_profile2.aln
         -special_mode=3dcoffee -outfile=aligne_prf.aln

    Q: T-Coffee becomes very slow when combining sequences and structures

       A: This is true. By default the structures are feteched on  the  net,
       using RCSB. The problem arises when T-Coffee looks for the  structure
       of sequences WITHOUT structures.  One  solution  is  to  install  PDB
       locally. In that case you will need to set two environment variables:



         setenv (or export)  PDB_DIR="directory containing the pdb
         structures"

         setenv (or export)  NO_REMOTE_PDB_DIR=1
       Interestingly, the observation that sequences without structures  are
       those that take the most time to be checked  is  a  reminder  of  the
       strongest rational argument that  I  know  of  against  torture:  any
       innocent would require the maximum amount  of  torture  to  establish
       his/her  innocence,  which  sounds...ahem...strange.,  and  at  least
       inneficient. Then again I was never struck by the efficiency  of  the
       Bush administration.


    Q: Can I use a local installation of PDB?

       A: Yes, T-Coffe supports three types of installations:

            -an add-hoc installation where all  your  structures  are  in  a
       directory, under the form pdbid.pdb or pdbid.id.Z or pdbid.pdb.gz. In
       that case, all you need to  do  is  set  the  environement  variables
       correctly:

         setenv (or export)  PDB_DIR="directory containing the pdb
         structures"

         setenv (or export)  NO_REMOTE_PDB_DIR=1
       -A standard pdb installation using the all section of  pdb.  In  that
       case, you must set the variables to:

         setenv (or export)  PDB_DIR="<some absolute
         path>/data/structures/all/pdb/"

         setenv (or export)  NO_REMOTE_PDB_DIR=1
       -A standard pdb installation using the divided section of pdb:

         setenv (or export)  PDB_DIR="<some absolute
         path>/data/structures/divided/pdb/"

         setenv (or export)  NO_REMOTE_PDB_DIR=1
       If you need to do more clever things, you should know  that  all  the
       PDB  manipulation  is  made  in  T-Coffee  by  a  perl  script  named
       extract_from_pdb. You can extract this script from T-Coffee:

         t_coffee -other_pg unpack_extract_from_pdb

         chmod u+x extract_from_pdb
       You can then edit the script to suit your needs.  T-Coffee  will  use
       your edited version if it is in the current directory. It will  issue
       a warning that it used a local version.

       If you make extensive modifications, I would appreciate you  send  me
       the corrected  file  so  that  I  can  incorporate  it  in  the  next
       distribution.




Alignment Evaluation


    Q: How good is my alignment?

       A: see what is the color index?


    Q: What is that color index?

       A: T-Coffee can provide you with a measure of consistency  among  all
       the methods used. You can produce such an output using:

         PROMPT: t_coffee sample_seq1.fasta -output=html
       This  will  compute  your_seq.score_html  that  you  can  view  using
       netscape. An alternative is to use score_ps or score_pdf that can  be
       viewed using ghostview or acroread,  score_ascii  will  give  you  an
       alignment that can be parsed as a text file.

       A book chapter describing the CORE index is available on:

       http://igs-server.cnrs-mrs.fr/~cnotred/Publications/Pdf/core.pp.pdf


    Q: Can I evaluate alignments NOT produced with T-Coffee?

       A: Yes. You may have an alignment produced from any source you  like.
       To evaluate it do:

         PROMPT: t_coffee -infile=sample_aln1.aln -lib=sample_aln1.tc_lib
         -special_mode=evaluate
       If you have no library available, the library will be computed on the
       fly using the following command. This can take some  time,  depending
       on your sample size. To monitor the progress in a situation where the
       default library is being built, use:

         PROMPT: t_coffee -infile=sample_aln1.aln -special_mode evaluate

    Q: Can I Compare Two Alignments?

       A: Yes. You can treat one of your alignments as a library and compare
       it with the second alignment:

         PROMPT: t_coffee -infile=sample_aln1_1.aln -aln=sample_aln1_2.aln
         -special_mode=evaluate
       If you have no library available, the library will be computed on the
       fly using the following command. This can take some  time,  depending
       on your sample size. To monitor the progress in a situation where the
       default library is being built, use:

         PROMPT: t_coffee -infile=sample_aln1.aln -special_mode evaluate

    Q: I am aligning sequences with long regions of very good overlap

       A: Increase the ktuple size ( up to 4 or 5 for DNA) and up to  3  for
       proteins.

         PROMPT: t_coffee sample_seq1.fasta -ktuple=3
       This will speed up the program. It can  be  very  useful,  especially
       when aligning ESTs.


    Q: Why is T-Coffee changing the names of my sequences!!!!

       A: If there is no duplicated name in your  sequence  set,  T-Coffee's
       handling of names is consistent  with  Clustalw,  (Cf  Sequence  Name
       Handling in the Format section). If your dataset  contains  sequences
       with identical names, these will automatically be renamed to:


************************


>seq1


>seq1


************************


>seq1


>seq1_1


************************

Warning: The behaviour is undefined when this creates two sequence with a
similar names.

Improving Your Alignment


    Q: How Can I Edit my Alignment Manually?

       A: Use jalview, a Java online MSA editor: www.jalview.org


    Q: Have I Improved or Not my Alignment?

       A: Using structural information is the only way to establish  whether
       you have improved or not your alignment. The CORE index can also give
       you some information.










                           Addresses and Contacts


Contributors

T-coffee is  developed,  maintained,  monitored,  used  and  debugged  by  a
dedicated team that include:

      Cdric Notredame

      Fabrice Armougom

      Des Higgins

      Sebastien Moretti

      Orla O'Sullivan

      Eamon O'Toole

      Olivier Poirot

      Karsten Suhre

      Vladimir Keduas

      Iain Wallace


Addresses

       We are always very eager to get some user  feedback.  Please  do  not
       hesitate to drop  us  a  line   at:  cedric.notredame@europe.com  The
       latest updates of T-Coffee are always available  on:  www.tcoffee.org
       . On this address you will also find a link to some of the online  T-
       Coffee servers, including Tcoffee@igs



       T-Coffee can be used to automatically check if an updated version  is
       available, however the program will not update automatically, as this
       can cause endless reproducibility problems.

         PROMPT: t_coffee -update



                                 References

       It is important that you cite T-Coffee when you use it. Citing us  is
       (almost)  like  giving  us  money:  it  helps   us   convincing   our
       institutions that what we do is useful  and  that  they  should  keep
       paying our salaries and deliver Donuts to our offices  from  time  to
       time (Not that they ever did it, but it would be nice anyway).



       Cite the server if you used it, otherwise, cite  the  original  paper
       from 2000 (No, it was never named "T-Coffee 2000").

|Notredame C, Higgins DG, |Related Articles,          |
|Heringa J.               |[pic][pic]Links            |
|T-Coffee: A novel method for fast and accurate       |
|multiple sequence alignment.                         |
|J Mol Biol. 2000 Sep 8;302(1):205-17.                |
|PMID: 10964570 [PubMed - indexed for MEDLINE]        |


       Other useful publications include:


T-Coffee

|Claude JB, Suhre K,      |Related Articles,          |
|Notredame C, Claverie JM,|[pic][pic]Links            |
|Abergel C.               |                           |
|CaspR: a web server for automated molecular          |
|replacement using homology modelling.                |
|Nucleic Acids Res. 2004 Jul 1;32(Web Server          |
|issue):W606-9.                                       |
|PMID: 15215460 [PubMed - indexed for MEDLINE]        |


|Poirot O, Suhre K, Abergel |Related Articles,        |
|C, O'Toole E, Notredame C. |[pic]Links               |
|3DCoffee@igs: a web server for combining sequences   |
|and structures into a multiple sequence alignment.   |
|Nucleic Acids Res. 2004 Jul 1;32(Web Server          |
|issue):W37-40.                                       |
|PMID: 15215345 [PubMed - indexed for MEDLINE]        |


|O'Sullivan O, Suhre K,     |Related Articles,        |
|Abergel C, Higgins DG,     |[pic]Links               |
|Notredame C.               |                         |
|3DCoffee: combining protein sequences and structures |
|within multiple sequence alignments.                 |
|J Mol Biol. 2004 Jul 2;340(2):385-95.                |
|PMID: 15201059 [PubMed - indexed for MEDLINE]        |


|Poirot O, O'Toole E,       |Related Articles,        |
|Notredame C.               |[pic]Links               |
|Tcoffee@igs: A web server for computing, evaluating  |
|and combining multiple sequence alignments.          |
|Nucleic Acids Res. 2003 Jul 1;31(13):3503-6.         |
|PMID: 12824354 [PubMed - indexed for MEDLINE]        |


|Notredame C.               |Related Articles,        |
|                           |[pic]Links               |
|Mocca: semi-automatic method for domain hunting.     |
|Bioinformatics. 2001 Apr;17(4):373-4.                |
|PMID: 11301309 [PubMed - indexed for MEDLINE]        |


|Notredame C, Higgins DG,   |Related Articles,        |
|Heringa J.                 |[pic]Links               |
|T-Coffee: A novel method for fast and accurate       |
|multiple sequence alignment.                         |
|J Mol Biol. 2000 Sep 8;302(1):205-17.                |
|PMID: 10964570 [PubMed - indexed for MEDLINE]        |


|Notredame C, Holm L,       |Related Articles,        |
|Higgins DG.                |[pic]Links               |
|COFFEE: an objective function for multiple sequence  |
|alignments.                                          |
|Bioinformatics. 1998 Jun;14(5):407-22.               |
|PMID: 9682054 [PubMed - indexed for MEDLINE]         |



Mocca

|Notredame C.               |Related Articles,        |
|                           |[pic]Links               |
|Mocca: semi-automatic method for domain hunting.     |
|Bioinformatics. 2001 Apr;17(4):373-4.                |
|PMID: 11301309 [PubMed - indexed for MEDLINE]        |


CORE

       http://igs-server.cnrs-mrs.fr/~cnotred/Publications/Pdf/core.pp.pdf


Other Contributions

       We do not mean to steal code, but we will always try to  re-use  pre-
       existing code whenever that code exists, free of copyright, just like
       we expect people to do with our code. However, whenever this happens,
       we make a point  at  properly  citing  the  source  of  the  original
       contribution. If ever you recognize a piece of your  code  improperly
       cited, please drop us a note and we will be happy to correct that.

       In the mean time, here are some important pieces of code  from  other
       packages that have been incorporated  within  the  T-Coffee  package.
       These include:

            -The Sim algorithm of Huang and Miller that given two  sequences
       computes the N best scoring local alignments.

            -The tree reading/computing routines are taken from the ClustalW
       Package, courtesy of Julie Thompson,  Des  Higgins  and  Toby  Gibson
       (Thompson, Higgins, Gibson, 1994,  4673-4680,vol.  22,  Nucleic  Acid
       Research).

            -The implementation of the algorithm for aligning two  sequences
       in linear space was adapted from Myers and Miller, in  CABIOS,  1988,
       11-17, vol. 1)

             -Various  techniques  and  algorithms  have  been  implemented.
       Whenever relevant, the source of the code/algorithm/idea is indicated
       in the corresponding function.

             -64  Bits  compliance  was  implemented   by   Benjamin   Sohn,
       Performance Computing Center Stuttgart (HLRS), Germany


Bug Reports and Feedback

            -Prof David Jones (UCL) reported and  corrected  the  PDB1K  bug
       (now t_coffee/sap can align PDB sequences longer than 1000 AA).

            -Johan Leckner reported several bugs related to the treatment of
       PDB structures, insuring a consistant behavior between  version  1.37
       and current ones.







