source: trunk/GDE/PHYML20130708/phyml/doc/phyml-manual.tex

Last change on this file was 10307, checked in by aboeckma, 11 years ago

added most recent version of phyml

File size: 140.6 KB
Line 
1\documentclass[a4paper,12pt]{article}
2\usepackage{fancyvrb}
3\usepackage{graphicx}
4\usepackage{tabularx}
5% \usepackage{color}
6\usepackage{xcolor}
7\usepackage{psfrag}
8\usepackage[fleqn]{amsmath}
9\usepackage[hyphens]{url}
10% \usepackage{vmargin}
11\usepackage{cite}
12\usepackage{caption2}
13\usepackage{hyperref}
14\usepackage{makeidx}
15\renewcommand{\captionlabeldelim}{.}
16\def\thesection{\arabic{section}}
17\renewcommand{\thefigure}{\arabic{figure}}
18\renewcommand{\thetable}{\arabic{table}}
19\newcommand{\hl}{\noalign{\vskip3pt}\hline\noalign{\vskip3pt}}
20\newcommand{\hrf}{\hrulefill}
21\newcommand{\tc}[1]{\textcolor{black}{#1}}
22\newcommand{\dc}[1]{\textcolor{green}{#1}}
23% \usepackage[none]{hyphenat}
24\newcommand{\rep}[3][1]
25{
26  \psfrag{#2}[c][c][#1]{#3}
27}
28\newcommand{\x}[1]{\texttt{#1}}
29
30% \setpapersize{A4}
31% \hypersetup{colorlinks=true,linkcolor=blue,urlcolor=red,linkbordercolor=000}
32\hypersetup{colorlinks=true,linkcolor=blue,urlcolor=red}
33\renewcommand{\baselinestretch}{1.}
34\makeindex
35\begin{document}
36\sloppy
37\begin{center}
38\thispagestyle{empty}
39\vfill\vfill
40% \rule{\linewidth}{0.02cm}\\
41{\Huge \textbf{ P~h~y~M~L~~--~~M~a~n~u~a~l}}
42\vspace{-0.4cm}\\
43% \rule{\linewidth}{0.02cm}\\
44\vfill
45{\huge Version 3.0 \\
46\today
47\vfill
48\normalsize
49\href{http://code.google.com/p/phyml}{http://code.google.com/p/phyml}\\
50\vspace{0.4cm}
51\href{http://www.atgc-montpellier.fr/phyml}{http://www.atgc-montpellier.fr/phyml}}
52\end{center}
53\clearpage
54\tableofcontents
55\clearpage
56
57{\par
58\small 
59\noindent
60\copyright Copyright 1999 - 2008 by PhyML Development Team.\\
61\noindent The software PhyML is provided ``as is''  without warranty of any kind.  In no event shall
62the authors  or his  employer be  held responsible  for any damage  resulting from  the use  of this
63software, including but not limited to the frustration that you may experience in using the package.
64All parts of the source and documentation except where indicated are distributed under
65the GNU public licence. See http://www.opensource.org for details.
66
67}
68
69{
70\noindent
71\setlength{\baselineskip}{0.5\baselineskip}
72\section{Availability}
73\begin{itemize}
74\item Binaries: \href{http://www.atgc-montpellier.fr/phyml}{http://www.atgc-montpellier.fr/phyml}
75\item Sources: \href{http://code.google.com/p/phyml}{http://code.google.com/p/phyml}
76\item Discussion forum: \href{http://groups.google.com/group/phyml-forum}{http://groups.google.com/group/phyml-forum}
77\end{itemize}
78}
79
80{
81\noindent
82\setlength{\baselineskip}{0.7\baselineskip}
83\section{Authors}
84\begin{itemize}
85\item { St\'ephane Guindon} and { Olivier Gascuel} conceived the original PhyML algorithm.
86\item { St\'ephane Guindon} conceived the PhyTime method.
87\item { St\'ephane Guindon, Wim Hordjik} and { Olivier Gascuel} conceived the SPR-based tree search algorithm.
88\item { Maria Anisimova} and { Olivier Gascuel} conceived the aLRT method for branch support.
89\item { St\'ephane Guindon, Franck Lethiec}, Jean-Francois Dufayard and Vincent Lefort implemented PhyML.
90\item { Jean-Francois Dufayard} created the benchmark and implemented the tools that are used to check
91  PhyML accuracy and performances.
92\item { Vincent Lefort, St\'ephane Guindon, Patrice Duroux} and { Olivier Gascuel} conceived and
93  implemented PhyML web server.
94\item St\'ephane Guindon wrote this document.
95\end{itemize}
96}
97\clearpage
98
99% \section{ML in phylogenetics: the basics.}
100\section{Overview}
101
102PhyML  \cite{guindon03} is  a  software  package which  primary  task that  is  to estimate  maximum
103likelihood phylogenies  from alignments of nucleotide or  amino-acid sequences.  It  provides a wide
104range  of  options that  were  designed  to facilitate  standard  phylogenetic  analyses.  The  main
105strength of  PhyML lies in the  large number of substitution  models coupled to  various options to
106search the  space of phylogenetic  tree topologies,  going from very  fast and efficient  methods to
107slower but  generally more accurate approaches.  It  also implements two methods  to evaluate branch
108supports  in  a  sound statistical  framework  (the  non-parametric  bootstrap and  the  approximate
109likelihood ratio test).
110
111PhyML was designed to  process moderate to large data sets.  In theory,  alignments with up to 4,000
112sequences 2,000,000 character-long can analyzed.  In practice however, the amount of memory required
113to process  a data set is proportional  of the product of  the number of sequences  by their length.
114Hence, a large number  of sequences can only be processed provided that  they are short. Also, PhyML
115can  handle long  sequences  provided  that they  are  not numerous.   With  most standard  personal
116computers, the ``comfort  zone'' for PhyML generally lies around 100-500  sequences less than 10,000
117character  long.   For  larger  data  sets,  we  recommend  using  other  softwares  such  as  RAxML
118\cite{raxml}\index{RAxML}      or       GARLI      \cite{garli}\index{GARLI}      or      Treefinder
119(\href{http://www.treefinder.de}{http://www.treefinder.de}).
120
121
122\section{Bug report}\index{bug}
123
124While PhyML is, of  course, bug-free (!) (please read the disclaimer  carefuly...), if you ever come
125across an  issue, please feel free to  report it using the  discuss group web site  at the following
126address:  \url{https://groups.google.com/forum/?fromgroups#!forum/phyml-forum}.  Alternatively,  you
127can send an email  to \url{s.guindon@auckland.ac.nz}. Do not forget to mention  the version of PhyML
128and program options you are using.
129
130
131\section{Installing PhyML}
132
133\subsection{Sources and compilation}\index{compilation}
134
135The sources of the  program are available free of charge by sending  an e-mail to St\'ephane Guindon
136at                      \href{mailto:guindon@lirmm.fr}{\x{guindon@lirmm.fr}}                      or
137\href{mailto:s.guindon@auckland.ac.nz}{\x{s.guindon@auckland.ac.nz}}.  The  compilation on UNIX-like
138systems  is  fairly standard.  It  is  described  in the  `\x{INSTALL}'  file  that comes  with  the
139sources. In a command-line window, go to the directory that contains the sources and type:
140
141{\setlength{\baselineskip}{0.5\baselineskip}
142\begin{verbatim}
143./configure;
144make clean;
145make V=0;
146\end{verbatim}
147}
148
149By default, PhyML will be compiled with optimization flags turned on. It is possible to generate a
150version of PhyML that can run through a debugging tool (such as \x{ddd}\label{ddd}) or a profiling
151tool (such as \x{gprof}\label{gprof}) using the following instructions:
152
153{\setlength{\baselineskip}{0.5\baselineskip}
154\begin{verbatim}
155./configure --enable-debug;
156make clean;
157make V=0;
158\end{verbatim}
159}
160
161% {\em Note} -- when PhyML  is going to be used mostly of exclusively in  batch mode, it is preferable
162% to turn on the batch mode option in the  Makefile. In order to do so, the file \x{Makefile.am} needs
163% to be modified: add \x{-DBATCH} to the line with \x{DEFS=-DUNIX -D\$(PROG) -DDEBUG}.
164
165
166
167\subsection{Installing PhyML on UNIX-like systems (including Mac OS)}
168
169Copy PhyML binary file in the directory you like.  For the operating system to be able to locate the
170program, this directory must be specified in the global variable \x{PATH}. In order to achieve this,
171you  will   have  to  add  \x{export   PATH="/your\_path/:\${PATH}"}  to  the   \x{.bashrc}  or  the
172\x{.bash\_profile} located in your home directory  (\x{your\_path} is the path to the directory that
173contains PhyML binary).
174
175
176\subsection{Installing PhyML on Microsoft Windows}\label{sec:install_windows}
177
178Copy the files \x{phyml.exe} and \x{phyml.bat} in  the same directory. To launch PhyML, click on the
179icon  corresponding to \x{phyml.bat}.   Clicking on  the icon  for \x{phyml.exe}  works too  but the
180dimensions of the window will not fit PhyML PHYLIP-like interface.
181
182\subsection{Installing the parallel version of PhyML}\label{sec:MPI}\index{MPI}\index{bootstrap!parallel}
183
184Bootstrap analysis can run on multiple  processors. Each processor analyses one bootstraped dataset.
185Therefore, the computing time needed to perform $R$ bootstrap replicates is divided by the number of
186processors available.
187
188This  feature of  PhyML relies  on the  MPI (Message  Passing Interface)  library. To  use  it, your
189computer must  have MPI  installed on  it. In case  MPI is  not installed, you  can dowload  it from
190\href{http://www.mcs.anl.gov/research/projects/mpich2/}{http://www.mcs.anl.gov/research/projects/mpich2/}.
191Once MPI is  installed, it is necessary to launch  the MPI daemon. This can be  done by entering the
192following instruction: \x{mpd \&}.  Note however that in most cases, the  MPI daemon will already be
193running on your server so  that you most likely do not need to worry  about this. You can then just
194go in the \x{phyml/} directory (the directory that contains the \x{src/}, \x{examples/} and \x{doc/}
195folders) and enter the commands below:
196
197{\setlength{\baselineskip}{0.5\baselineskip}
198\begin{verbatim}
199./configure --enable-mpi;
200make clean;
201make;
202\end{verbatim}
203}
204
205A binary file named \x{phyml-mpi} has now been created in the \x{src/} directory and is ready to use
206with MPI. A typical MPI command-line which uses 4 CPUs is given below:
207
208{\setlength{\baselineskip}{0.5\baselineskip}
209\begin{verbatim}
210mpirun -n 4 ./phyml-mpi -i myseq -b 100
211\end{verbatim}
212}
213
214\noindent Please read section \ref{sec:parallel_bootstrap} of this document for more information.
215
216\section{Program usage.}\label{sec:phyml_new}
217
218PhyML has  three distinct user-interfaces.   The first corresponds  to a PHYLIP-like  text interface
219that makes the choice of the options self-explanatory. The command-line interface is well-suited for
220people that are familiar with PhyML options or for running PhyML in batch mode. The XML interface is
221more sophisticated. It allows the user to analyse partitionned data using flexible mixture models of evolution.
222
223\subsection{PHYLIP-like interface}
224
225The default is to use the PHYLIP-like  text interface by simply typing `\x{phyml}' in a command-line
226window or by clicking on the PhyML icon (see Section \ref{sec:install_windows}).  After entering the
227name of the input sequence file, a list of  sub-menus helps the users set up the analysis.  There
228are currently four distinct sub-menus:
229
230% \begin{figure}
231% \resizebox{15cm}{9cm}{\includegraphics{./fig/interface.eps}}
232% \caption{PHYLIP-like interface to PhyML.}
233% \label{fig:interface}
234% \end{figure}
235
236\begin{enumerate}
237
238\item  {\em  Input  Data}:  specify  whether  the  input  file  contains  amino-acid  or  nucleotide
239sequences. What the  sequence format is (see Section \ref{sec:input_output}) and  how many data sets
240should be analysed.
241
242\item  {\em  Substitution  Model}: selection  of  the  Markov  model  of substitution.
243
244\item  {\em  Tree Searching}:  selection  of  the tree  topology  searching  algorithm.
245
246\item {\em  Branch Support}: selection  of the method  that is used  to measure branch  support.
247
248\end{enumerate}
249\noindent `\x{+}' and `\x{-}' keys are used to  move forward and backward in the sub-menu list. Once
250the model parameters  have been defined, typing `\x{Y}' (or `\x{y}')  launches the calculations. The
251meaning of  some options may not  be obvious to users  that are not familiar  with phylogenetics. In
252such situation, we strongly recommend to use the default options. As long as the format of the input
253sequence file is  correctly specified (sub-menu {\em Input data}), the  safest option for non-expert
254users  is to  use the  default settings.  The different  options provided  within each  sub-menu are
255described in what follows.
256
257
258\subsubsection{Input  Data sub-menu}
259
260\begin{center}\framebox{\x{[D] ............................... Data type (DNA/AA)}} \end{center} 
261Type of data in the input file. It can be either DNA or amino-acid sequences in PHYLIP format (see
262Section \ref{sec:input_output}). Type \x{D} to change settings.
263
264\vspace{0.7cm}
265\begin{center} \framebox{\x{[I] ...... Input sequences interleaved (or sequential)}} \end{center}
266PHYLIP format comes in two flavours: interleaved or sequential (see Section
267\ref{sec:input_output}). Type \x{I} to selected among the two formats.
268
269\vspace{0.7cm}
270\begin{center} \framebox{\x{[M] ....................... Analyze multiple data sets}} \end{center}
271If the input sequence file contains more than one data sets, PhyML can analyse each of them
272in a single run of the program. Type \x{M} to change settings.
273
274\vspace{0.7cm}
275\begin{center}  \framebox{\x{[R] ............................................ Run  ID}} \end{center}
276This option allows  you to append a string  that identifies the current PhyML run.  Say for instance
277that you want to analyse  the same data set with two models. You can  then `tag' the first PhyML run
278with the name of the first model while the second run is tagged with the name of the second model.\index{run ID}
279
280
281
282\subsubsection{Substitution model sub-menu}\label{sec:submenus}
283
284\begin{center} \framebox{\x{[M] ................. Model of nucleotide substitution}} \end{center}
285\begin{center}  \framebox{\x{[M] ................ Model  of amino-acids  substitution}} \end{center}
286PhyML implements a wide range of  substitution models: JC69 \cite{jukes69}, K80 \cite{kimura80}, F81
287\cite{felsenstein81a},  F84  \cite{phylip2},   HKY85  \cite{hasegawa85},  TN93  \cite{tamura93}  GTR
288\cite{lanave84,tavare86} and  custom for nucleotides; LG \cite{le08},  WAG \cite{whelan01b}, Dayhoff
289\cite{dayhoff78},  JTT  \cite{jones92},  Blosum62  \cite{henikoff92}, mtREV  \cite{adachi96},  rtREV
290\cite{dimmic02},  cpREV  \cite{adachi00},  DCMut   \cite{kosiol04},  VT  \cite{muller00}  and  mtMAM
291\cite{cao98}  and custom  for amino  acids.  Cycle  through the  list of  nucleotide  or amino-acids
292substitution models by typing \x{M}. Both  nucleotide and amino-acid lists include a `custom' model.
293The custom option provides the most flexible  way to specify the nucleotide substitution model.  The
294model is defined  by a string made of  six digits.  The default string is  `\x{000000}', which means
295that the six relative rates of nucleotide  changes: $A \leftrightarrow C$, $A \leftrightarrow G$, $A
296\leftrightarrow T$$C \leftrightarrow  G$, $\leftrightarrow T$ and  $G \leftrightarrow  T$, are
297equal.   The   string  `\x{010010}'  indicates  that   the  rates  $\leftrightarrow   G$  and  $C
298\leftrightarrow  T$ are equal  and distinct  from $\leftrightarrow C  = A  \leftrightarrow T  = C
299\leftrightarrow G = G  \leftrightarrow T$.  This model corresponds to HKY85  (default) or K80 if the
300nucleotide frequencies  are all set to 0.25.   `\x{010020}' and `\x{012345}' correspond  to TN93 and
301GTR models respectively.  The digit string  therefore defines groups of relative substitution rates.
302The  initial rate within  each group  is set  to 1.0,  which corresponds  to F81  (JC69 if  the base
303frequencies are equal).   Users also have the  opportunity to define their own  initial rate values.
304These rates are  then optimised afterwards (option  `\x{O}') or fixed to their  initial values.  The
305custom option can be used to implement all substitution models that are special cases of
306GTR. Table \ref{tab:modelcode} on page \pageref{tab:modelcode} gives the correspondence between the `standard' name of the model
307(see \url{http://mbe.oxfordjournals.org/content/18/6/897/F2.large.jpg}) and the custom model code.
308The custom model also exists for protein sequences. It is useful when one wants to use an amino-acid
309substitution model that is  not hard-coded in PhyML. The symmetric part of  the rate matrix, as well
310as the equilibrium amino-acid  frequencies, are given in a file which name is  given as input of the
311program. The format of this file is described in the section \ref{sec:customaa}.
312
313\vspace{0.7cm}
314\begin{center} \framebox{\x{[F] ................. Optimise equilibrium frequencies}} \end{center}
315\begin{center} \framebox{\x{[E] ......... Equilibrium frequencies (empirical/user)}} \end{center}
316\begin{center} \framebox{\x{[F]  . Amino  acid frequencies (empirical/model  defined)}} \end{center}
317For  nucleotide  sequences,  optimising  nucleotide  frequencies  means that  the  values  of  these
318parameters  are estimated in  the maximum  likelihood framework.   When the  custom model  option is
319selected, it is  also possible to give the program a  user-defined nucleotide frequency distribution
320at equilibrium  (option \x{E}).   For protein sequences,  the stationary amino-acid  frequencies are
321either  those defined  by  the substitution  model  or those  estimated by  counting  the number  of
322different amino-acids observed  in the data. Hence, the  meaning of the \x{F} option  depends on the
323type of the data to be processed.
324
325\vspace{0.7cm}
326\begin{center}          \framebox{\x{[T]          ....................          Ts/tv          ratio
327(fixed/estimated)}}   \end{center}\index{$\kappa$}\index{ts/tv ratio}    Fix   or   estimate   the
328transition/transversion ratio  in the maximum likelihood  framework.  This option  is only available
329when DNA sequences are to be analysed under  K80, HKY85 or TN93 models. The definition given to this
330parameter by PhyML  is the same as  PAML's\index{PAML} one.  Therefore, the value  of this parameter
331does {\it not} correspond  to the ratio between the expected number  of transitions and the expected
332number  of  transversions  during  a unit  of  time.   This  last  definition  is the  one  used  in
333PHYLIP\index{PHYLIP}.   PAML's  manual gives  more  detail about  the  distinction  between the  two
334definitions (\url{http://abacus.gene.ucl.ac.uk/software/paml.html}).
335
336\vspace{0.7cm}
337\begin{center}       \framebox{\x{[V]       .        Proportion      of       invariable       sites
338(fixed/estimated)}}   \end{center}\index{invariable sites}\index{proportion of invariants}  The
339proportion of  invariable sites, i.e., the  expected frequency of sites  that do not  evolve, can be
340fixed or estimated. The default is to fix this proportion to 0.0. By doing so, we consider that each
341site in  the sequence may accumulate  substitutions at some point  during its evolution,  even if no
342differences across sequences are actually observed at  that site.  Users can also fix this parameter
343to  any value  in the  $[0.0,1.0]$ range  or estimate  it from  the data  in  the maximum-likelihood
344framework.
345
346\vspace{0.7cm} 
347\index{gamma distribution (discrete)!mean vs. median}
348\index{gamma distribution (discrete)!number of categories}
349\index{gamma distribution (discrete)!shape parameter}
350\begin{center} \framebox{\x{[R]  ....... One category  of substitution rate  (yes/no)}} \end{center}
351\begin{center} \framebox{\x{[C] ........... Number of substitution rate categories}} \end{center}
352\begin{center} \framebox{\x{[A] ... Gamma distribution parameter (fixed/estimated)}} \end{center}
353\begin{center} \framebox{\x{[G]  .........`Middle' of  each rate class  (mean/median)}} \end{center}
354Rates of evolution often vary from site to site. This heterogeneity can be modelled using a discrete
355gamma distribution.  Type \x{R} to switch  this option on or  off. The different  categories of this
356discrete  distribution  correspond  to  different  (relative)  rates of  evolution.  The  number  of
357categories of this  distribution is set to 4 by  default.  It is probably not wise  to go below this
358number.   Larger values  are  generally preferred.  However,  the computational  burden involved  is
359proportional to the  number of categories (i.e.,  an analysis with 8 categories  will generally take
360twice the  time of  the same analysis  with only 4  categories). Note  that the likelihood  will not
361necessarily increase as  the number of categories increases. Hence, the  number of categories should
362be kept below a  ``reasonable'' number, say 20.  The default number of  categories can be changed by
363typing \x{C}.
364
365The middle  of each  discretized substitution rate  class can  be determined using  the mean  or the
366median. PAML,  MrBayes and RAxML  use the  mean.  However, the  median is generally  associated with
367greater likelihoods than the  mean.  This conclusion is based on our  analysis of several real-world
368data sets extracted from TreeBase.  Despite this, the  default option in PhyML is to use the mean in
369order to make PhyML  likelihoods comparable to those of other phylogenetic  software.  One must bare
370in  mind that  {\color{red}{likelihoods  calculated with  the  mean approximation  are not  directly
371comparable to the likelihoods calculated using the median approximation}}.
372
373The shape  of the  gamma distribution  determines the range  of rate  variation across  sites. Small
374values,  typically  in  the $[0.1,1.0]$  range,  correspond  to  large variability.   Larger  values
375correspond to moderate to  low heterogeneity. The gamma shape parameter can be  fixed by the user or
376estimated via maximum-likelihood. Type \x{A} to select one or the other option.
377
378
379\subsubsection{Tree searching sub-menu}
380
381\begin{center} \framebox{\x{[O] ........................... Optimise tree topology}} \end{center} By
382default the  tree topology is  optimised in order  to maximise the  likelihood. However, it  is also
383possible to avoid  any topological alteration. This option  is useful when one wants  to compute the
384likelihood of a tree given as input (see below). Type \x{O} to select among these two options.
385
386\vspace{0.7cm}
387\begin{center}  \framebox{\x{[S] .................. Tree  topology search  operations}} \end{center}
388PhyML proposes three different  methods to estimate tree topologies. The default  approach is to use
389simultaneous  NNI. This option  corresponds to  the original  PhyML algorithm  \cite{guindon03}. The
390second approach  relies on  subtree pruning and  regrafting (SPR).   It generally finds  better tree
391topologies  compared to NNI  but is  also significantly  slower.  The  third approach,  termed BEST,
392simply estimates the phylogeny using both methods  and returns the best solution among the two. Type
393\x{S} to choose among these three choices.
394
395\vspace{0.7cm}
396\begin{center}  \framebox{\x{[R] ......................... Use  random starting  tree}} \end{center}
397\begin{center} \framebox{\x{[N]  .................. Number  of random starting  trees}} \end{center}
398When the SPR or the BEST options are selected,  is is possible to use random trees rather than BioNJ
399or a user-defined tree,  as starting tree. If this option is turned on  (type \x{R} to change), five
400trees, corresponding to five random starts, will be estimated. The output tree file will contain the
401best  tree  found  among  those five.  The  number  of  random  starts  can be  modified  by  typing
402\x{N}.  Setting the  number of  random  starting trees  to $N$  means  that the  analysis will  take
403(slightly more than) $N$ times the time required for a standard analysis where only one (BioNJ)
404starting tree is used.  However, the analysis of real data sets shows  that the best trees estimated
405using the  random start  option almost  systematically have higher  likelihoods than  those inferred
406using a single starting tree.
407
408\vspace{0.7cm}
409\begin{center}      \framebox{\x{[U]     ........       Starting      tree     (BioNJ/parsimony/user
410tree)}} \end{center}\index{BioNJ}  When the  tree topology optimisation  option is turned  on, PhyML
411proceeds  by  refining an  input  tree.   By  default, this  input  tree  is estimated  using  BioNJ
412\cite{gascuelNJ}. The alternative option is to use  a parsimony tree. We found this option specially
413useful when analysing  large data sets with NNI  moves as it generally leads  to greater likelihoods
414than  those obtained  when starting  from a  BioNJ trees.  The user  can also  to input  her/his own
415tree. This  tree should  be in Newick  format (see  Section \ref{sec:input_output}). This  option is
416useful when  one wants  to evaluate  the likelihood  of a given  tree with  a fixed  topology, using
417PhyML. Type \x{U} to choose among these two options.
418
419\subsubsection{Branch support sub-menu}
420
421\begin{center}  \framebox{\x{[B] ................ Non  parametric bootstrap  analysis}} \end{center}
422The  support  of the  data  for  each  internal branch  of  the  phylogeny  can be  estimated  using
423non-parametric bootstrap.   By default, this option is  switched off.  Typing \x{B}  switches on the
424bootstrap analysis. The user is then prompted for a number of bootstrap replicates. The largest this
425number the more precise the bootstrap  support estimates are.  However, for each bootstrap replicate
426a phylogeny is estimated. Hence, the time  needed to analyse $N$ bootstrap replicates corresponds to
427$N$-times the time spent  on the analysis of the original data  set. $N=100$ is generally considered
428as a reasonable number of replicates.
429
430\begin{center}  \framebox{\x{[A] ................ Approximate  likelihood ratio  test}} \end{center}
431When the  bootstrap option is switched off  (see above), approximate likelihood  branch supports are
432estimated. This approach is considerably faster than the bootstrap one. However, both methods intend
433to  estimate different quantities  and conducting  a fair  comparison between  both criteria  is not
434straightforward.  The  estimation  of  approximate  likelihood  branch  support  comes  in  multiple
435flavours. The default is  set to aBayes, corresponding to the approximate  Bayes method described in
436\cite{anisimova11}.   The    approximate   likelihood   ratio    test   (aLRT)   \cite{anisimova06},
437Shimodaira–Hasegawa aLRT (SH-aLRT) statistics are the other available options.
438
439
440\subsection{Command-line interface}
441
442An alternative to the PHYLIP-like interface is the command-line interface. Users that do not need to
443modify  the default  parameters  can launch  the  program with  the  `\x{phyml -i  seq\_file\_name}'
444command.  The list of all command line arguments and  how to use them is given in the `Help' section
445which is displayed when entering the `\x{phyml --help}' command.  The available command-line options
446are described in what follows.
447
448\begin{itemize}
449
450\item \x{-i} (or \x{--input}) \x{seq\_file\_name}\index{command-line options!\x{--input}} \\
451\x{seq\_file\_name} is the name of the nucleotide or amino-acid sequence file in PHYLIP format.
452
453\item \x{-d} (or \x{--datatype}) \x{data\_type}\index{command-line options!\x{--data\_type}}\\
454\x{data\_type} is \x{nt} for nucleotide (default) and \x{aa} for amino-acid sequences.
455
456\item \x{-q} (or \x{--sequential})\index{sequence format!interleaved}\index{sequence format!sequential}\index{command-line options!\x{--sequential}} \\
457Changes interleaved format (default) to sequential format.
458
459\item \x{-n} (or \x{--multiple}) \x{nb\_data\_sets}\index{multiple data sets}\index{command-line options!\x{--multiple}}\\
460\x{nb\_data\_sets} is an integer giving the number of data sets to analyse.
461
462\item \x{-p} (or \x{--pars})\index{command-line options!\x{--pars}}\\
463Use a minimum parsimony starting tree. This option is taken into account when the `-u' option
464is absent and when tree topology modifications are to be done.
465
466\item \x{-b} (or \x{--bootstrap}) \x{int}\index{bootstrap}\index{command-line options!\x{--bootstrap}} 
467\begin{itemize} 
468\item \x{int} $>$  0: \x{int} is the number of bootstrap replicates.
469\item \x{int} =  0: neither approximate likelihood ratio test nor bootstrap values are computed.
470\item \x{int} = -1: approximate likelihood ratio test returning aLRT statistics.
471\item \x{int} = -2: approximate likelihood ratio test returning Chi2-based parametric branch supports.
472% \item \x{int} = -3: minimum of Chi2-based parametric and SH-like branch supports.
473\item \x{int} = -4: SH-like branch supports alone.
474\item \x{int} = -5: (default) approximate Bayes branch supports.
475\end{itemize}
476
477\item \x{-m} (or \x{--model}) \x{model\_name}\index{substitution models!DNA}\index{substitution models!amino acids}\index{command-line options!\x{--model}} \\
478\x{model\_name} : substitution model name.
479\begin{itemize}
480\item {\it Nucleotide-based models}: \x{HKY85} (default) \x{| JC69 |  K80 | F81 | F84 | TN93 | GTR |
481custom} \\  The \x{custom} option can be  used to define a  new substitution model. A  string of six
482digits  identifies  the model.  For  instance,  000000 corresponds  to  F81  (or  JC69 provided  the
483distribution of nucleotide  frequencies is uniform).  012345 corresponds to GTR.  This option can be
484used for encoding any model that is a nested within GTR. See Section \ref{sec:submenus} and Table \ref{tab:modelcode}. {\em NOTE:}
485the  substitution  parameters  of  the  custom  model  will be  optimised  so  as  to  maximise  the
486likelihood.  It is  possible  to  specify and  fix  (i.e., avoid  optimisation)  the  values of  the
487substitution rates only through the PHYLIP-like interface.
488
489\item {\it Amino-acid based models}: \x{LG} (default) \x{| WAG | JTT | MtREV | Dayhoff | DCMut | RtREV
490 | CpREV | VT | Blosum62 | MtMam | MtArt | HIVw |  HIVb | custom} \\
491The \x{custom} option is  useful when one wants to use an amino-acid  substitution model that is not
492available by  default in PhyML. The  symmetric part of the  rate matrix, as well  as the equilibrium
493amino-acid frequencies, are  given in a file which name  is asked for by the  program. The format of
494this file is described in section \ref{sec:customaa}.
495\end{itemize}
496
497\begin{table}\index{custom models}
498\begin{center}
499\begin{tabular}{ll}
500\hline
501Name & Command-line option \\
502\hline
503JC69 &  \x{-m 000000 -f 0.25,0.25,0.25,0.25} \\
504F81 &   \x{-m 000000}\\
505K80 &   \x{-m 010010 -f 0.25,0.25,0.25,0.25} \\
506HKY85 & \x{-m 010010}\\
507TrNef & \x{-m 010020 -f 0.25,0.25,0.25,0.25} \\
508TrN &   \x{-m 010020}\\
509K81 &   \x{-m 123321 -f 0.25,0.25,0.25,0.25}\\
510K81uf & \x{-m 123321}\\
511TIMef & \x{-m 132241 -f 0.25,0.25,0.25,0.25}\\
512TIM &   \x{-m 132241}\\
513TVMef & \x{-m 102304 -f 0.25,0.25,0.25,0.25}\\
514TVM &   \x{-m 102304}\\
515SYM &   \x{-m 123456 -f 0.25,0.25,0.25,0.25}\\
516GTR &   \x{-m 123456}\\
517\hline
518\end{tabular}
519\caption{Nucleotide substitution model names (as defined in \cite{posada01}) and the corresponding
520  custom model code used in PhyML.}\label{tab:modelcode}
521\end{center}
522\end{table}
523
524\item \x{--aa\_rate\_file file\_name}\index{command-line options!\x{--aa\_rate\_file}} \\
525This option is compulsory when analysing amino-acid sequences under a `custom' model (see above). \x{file\_name}
526should provide a rate matrix and equilibrium amino acid in PAML format (see Section \ref{sec:customaa}).
527
528\item \x{-f e}, \x{m}, or ``\x{fA,fC,fG,fT}" \index{frequencies!nucleotide}\index{frequencies!amino-acid}\index{stationary frequencies}\index{command-line options!\x{-f}}\\
529Nucleotide or amino-acid frequencies.
530\begin{itemize}
531\item \x{e} : the character frequencies are determined as follows :
532\begin{itemize}
533\item {\it Nucleotide sequences}: (Empirical) the equilibrium base frequencies are estimated by counting
534                 the occurence of the different bases in the alignment.
535\item {\it Amino-acid sequences}: (Empirical) the equilibrium amino-acid frequencies are estimated by counting
536                 the occurence of the different amino-acids in the alignment.
537\end{itemize} 
538
539\item \x{m} : the character frequencies are determined as follows :
540\begin{itemize}
541\item {\it Nucleotide sequences}: (ML) the equilibrium base frequencies are estimated using maximum
542  likelihood.
543\item {\it Amino-acid sequences}: (Model) the equilibrium amino-acid frequencies are estimated using
544                 the frequencies defined by the substitution model.
545\end{itemize}
546
547\item ``\x{fA,fC,fG,fT}": only valid for nucleotide-based models. \x{fA}, \x{fC}, \x{fG} and \x{fT} are floating numbers that
548correspond to the frequencies of A, C, G and T respectively.
549\end{itemize} 
550
551\item \x{-t} (or \x{--ts/tv}) \x{ts/tv\_ratio} \index{$\kappa$}\index{ts/tv ratio}\index{command-line options!\x{--ts/tv}}\\
552\x{ts/tv\_ratio}: transition/transversion ratio. DNA sequences only. Can be a fixed positive value
553(e.g., 4.0) or type \x{e} to get the maximum likelihood estimate.
554
555\item \x{-v} (or \x{--pinv}) \x{prop\_invar}\index{proportion of invariants}\index{invariable sites} \index{command-line options!\x{--pinv}}\\
556\x{prop\_invar}: proportion of invariable sites. Can be a fixed value in the [0,1] range or type \x{e} to get the maximum likelihood estimate.
557
558\item \x{-c} (or \x{--nclasses}) \x{nb\_subst\_cat}\index{gamma distribution (discrete)!number of categories} \index{command-line options!\x{--nclasses}}\\
559\x{nb\_subst\_cat}: number of relative substitution rate categories. Default: \x{nb\_subst\_cat=4}. Must be a positive integer.
560
561\item \x{-a} (or \x{--alpha}) \x{gamma} \index{gamma distribution (discrete)!shape parameter}\index{command-line options!\x{--alpha}} \\
562\x{gamma}: value of the gamma shape parameter. Can be a fixed positive value or e to get the maximum
563likelihood estimate. The value of this parameter is estimated in the maximum likelihood framework by default.
564
565\item \x{--use\_median} \index{gamma distribution (discrete)!mean vs. median} \index{command-line options!\x{--use\_median}}\\
566The middle of each substitution rate class in the discrete gamma distribution is taken as the
567median. The mean is used by default.
568
569\item \x{--free\_rates} \index{command-line options!\x{--free\_rates}}\\
570As an alternative to the discrete gamma model, it is possible to estimate the (relative) rate in
571each class of the (mixture) model and the corresponding frequencies directly from the data. This
572model, called the FreeRate model, has more parameters
573than the discrete gamma one but usually provides a significantly better fit to the data. See
574\cite{soubrier12} for more information about this model and an illustration of its use.
575
576\item \x{--codpos} \x{1,2 or 3} \index{command-line options!\x{--codpos}}\\
577When analysing an alignment of coding sequences, use this option to consider only the first, second
578or the third coding position for the estimation.
579
580\item \x{-s} (or \x{--search}) \x{move}\index{NNI}\index{SPR} \index{command-line options!\x{--search}}\\
581Tree topology search operation option. Can be either \x{NNI} (default, fast) or \x{SPR} (usually
582slower than \x{NNI} but more accurate) or \x{BEST} (best of NNI and SPR search).
583
584\item \x{-u} (or \x{--inputtree}) \x{user\_tree\_file}\index{input tree}\index{user tree} \index{command-line options!\x{--inputtree}}\\
585\x{user\_tree\_file}: starting tree filename. The tree must be in Newick format.
586
587\item \x{-o params}\index{optimisation!topology}\index{optimisation!substitution parameters} \index{command-line options!\x{-o}}\\
588This option focuses on specific parameter optimisation.
589\begin{itemize}
590\item \x{params=tlr}: tree topology (\x{t}), branch length (\x{l}) and substitution rate parameters (\x{r}) are optimised.
591\item \x{params=tl}: tree topology and branch lengths are optimised.
592\item \x{params=lr}: branch lengths and substitution rate parameters are optimised.
593\item \x{params=l}: branch lengths are optimised.
594\item \x{params=r}: substitution rate parameters are optimised.
595\item \x{params=n}: no parameter is optimised.
596\end{itemize}
597
598\item \x{--rand\_start}\index{random tree} \index{command-line options!\x{--rand\_start}}\\
599This option sets the initial tree to random. It is only valid if SPR searches are to be performed.
600
601\item \x{--n\_rand\_starts num} \index{command-line options!\x{--n\_rand\_starts}}\\
602\x{num} is the number of initial random trees to be used. It is only valid if SPR searches are to be performed.
603
604\item \x{--r\_seed num}\index{random number} \index{command-line options!\x{--r\_seed}}\\
605\x{num} is the seed used to initiate the random number generator. Must be an integer.
606
607\item \x{--print\_site\_lnl}\index{likelihood!print site likelihood} \index{command-line options!\x{--print\_site\_lk}}\\
608Print the likelihood for each site in file *\_phyml\_lk.txt. For $\Gamma$ or $\Gamma$+I or
609FreeRate models, this option returns the posterior probability of each relative rate class at
610each site. Such information can then be used to identify fast- and slow-evolving regions of the alignment.
611
612\item \x{--print\_trace}\index{command-line options!\x{--print\_trace}}\\
613Print each phylogeny explored during the tree search process in file *\_phyml\_trace.txt. This
614option can be useful for monitoring the progress of the analysis for very large data sets and have
615an approximate idea of what the final phylogeny will look like.
616
617\item \x{--run\_id ID\_string}\index{run ID} \index{command-line options!\x{--run\_id}}\\
618Append the string ID\_string at the end of each PhyML output file. This option may be useful when
619running simulations involving PhyML. It can also be used to `tag' multiple analysis of the same data
620set with various program settings.
621
622\item \x{--no\_memory\_check} \index{command-line options!\x{--no\_memory\_check}}\\
623By default, when processing a large data set, PhyML will pause and ask the user to confirm that
624she/he wants to continue with the execution of the analysis despite the large amount of memory
625required. The \x{--no\_memory\_check}  skips this question. It is especially useful when running
626PhyML in batch mode.
627
628\item \x{--no\_colalias} \index{command-line options!\x{--no\_colalias}}\\
629By default, PhyML preprocesses each alignment by putting together (or aliasing) the columns that are
630identical. Use this option to skip this step but be aware that the analysis might then take more
631time to complete.
632
633\item \x{--constrained\_lens} \index{command-line options!\x{--constrained\_lens}}\\
634When an input tree with branch lengths is provided, this option will find the branch multiplier that
635maximises the likelihood (i.e., the relative branch lengths remain constant)
636
637\item  \x{--constraint\_file}  \x{file\_name} \index{command-line  options!\x{--constraint\_file}}\\
638\x{file\_name}  lists  the  topological  constraints   under  which  the  tree  topology  search  is
639conducted.  This option  should  be used  in  conjunction with  \x{-u}  \x{file\_name}. See  Section
640\ref{sec:topoconstraints} for more information.
641
642\item \x{--quiet} \index{command-line options!\x{--quiet}}\\
643Runs PhyML in quiet mode. The program will not pause if the memory required to run the analysis
644exceeds 256MB and will not output the log-likelihood score to the output.
645
646\end{itemize}
647
648\subsection{XML interface}
649\begin{itemize}
650\item \x{--xml=xml\_file\_name}\index{command-line options!\x{--xml}} \\
651\x{xml\_file\_name} is the name of the XML file containing the information required to run the
652analysis. More details about this type of file is given in the section \ref{sec:xmlio}.
653\end{itemize}
654 
655\subsection{Parallel  bootstrap}\label{sec:parallel_bootstrap}\index{MPI}\index{bootstrap!parallel}
656
657Bootstrapping is  a highly  parallelizable task. Indeed,  bootstrap replicates are  independent from
658one another.   Each bootstrap replicate can then  be analysed separately. Modern  computers often have
659more than one CPU. Each CPU can therefore be used to process a bootstrap sample. Using this parallel
660strategy, performing  $R$ bootstrap replicates  on $C$ CPUs  `costs' the same amount  of computation
661time as processing $\times C$ bootstrap replicates on a single CPU.  In  other words, for a given
662number of replicates, the computation time is divided by $R$ compared to the non-parallel approach.
663
664PhyML sources  must be compiled with  specific options to turn  on the parallel  option (see Section
665\ref{sec:MPI}). Once  the binary file (\x{phyml})  has been generated, running  a bootstrap analysis
666with, say 100 replicates on 2 CPUs, can be done by typing the following command-line:
667\begin{verbatim}
668mpd &;
669mpirun -np 2 ./phyml -i seqfile -b 100;
670\end{verbatim} 
671The  first command  launches  the mpi  daemon  while the  second launches  the  analysis. Note  that
672launching the daemon needs to be done only once.  The output files are similar to the ones generated
673using the standard, non-parallel, analysis (see Section \ref{sec:input_output}). Note that running
674the program in batch mode, i.e.:
675\begin{verbatim}
676mpirun -np 2 ./phyml -i seqfile -b 100 &
677\end{verbatim} 
678will probably NOT work. I do not know how to run a mpi process in batch mode yet. Suggestions welcome...
679Also, at the moment, the number of bootstrap replicates must be a multiple of the number of CPUs
680required in the mpirun command.
681
682\section{Inputs \& outputs for command-line and PHYLIP interface }\label{sec:input_output}
683
684PhyML reads data from standard text files,  without the need for any particular file name extension.
685
686\subsection{Sequence formats}
687
688\begin{figure}
689\begin{small}
690\begin{Verbatim}[frame=single, label=PHYLIP interleaved, samepage=true, baselinestretch=0.5]
691
6925 80
693seq1  CCATCTCACGGTCGGTACGATACACCKGCTTTTGGCAGGAAATGGTCAATATTACAAGGT
694seq2  CCATCTCACGGTCAG---GATACACCKGCTTTTGGCGGGAAATGGTCAACATTAAAAGAT
695seq3  RCATCTCCCGCTCAG---GATACCCCKGCTGTTG????????????????ATTAAAAGGT
696seq4  RCATCTCATGGTCAA---GATACTCCTGCTTTTGGCGGGAAATGGTCAATCTTAAAAGGT
697seq5  RCATCTCACGGTCGGTAAGATACACCTGCTTTTGGCGGGAAATGGTCAAT????????GT
698
699ATCKGCTTTTGGCAGGAAAT
700ATCKGCTTTTGGCGGGAAAT
701AGCKGCTGTTG?????????
702ATCTGCTTTTGGCGGGAAAT
703ATCTGCTTTTGGCGGGAAAT
704
705\end{Verbatim}
706\begin{Verbatim}[frame=single, label=PHYLIP sequential, samepage=true, baselinestretch=0.5]
707
7085 40
709seq1  CCATCTCANNNNNNNNACGATACACCKGCTTTTGGCAGG
710seq2  CCATCTCANNNNNNNNGGGATACACCKGCTTTTGGCGGG
711seq3  RCATCTCCCGCTCAGTGAGATACCCCKGCTGTTGXXXXX
712seq4  RCATCTCATGGTCAATG-AATACTCCTGCTTTTGXXXXX
713seq5  RCATCTCACGGTCGGTAAGATACACCTGCTTTTGxxxxx
714
715\end{Verbatim}
716\end{small}
717\label{fig:align_tree}
718\caption{\bf PHYLIP interleaved and sequential formats.}
719\end{figure}
720
721
722
723\begin{figure}
724\begin{small}
725\begin{Verbatim}[frame=single, label=Nexus nucleotides, samepage=true, baselinestretch=0.5]
726
727[ This is a comment ]
728#NEXUS
729BEGIN DATA;
730DIMENSIONS NTAX=10 NCHAR=20;
731FORMAT DATATYPE=DNA;
732MATRIX
733tax1       ?ATGATTTCCTTAGTAGCGG
734tax2       CAGGATTTCCTTAGTAGCGG
735tax3       ?AGGATTTCCTTAGTAGCGG
736tax4       ?????????????GTAGCGG
737tax5       CAGGATTTCCTTAGTAGCGG
738tax6       CAGGATTTCCTTAGTAGCGG
739tax7       ???GATTTCCTTAGTAGCGG
740tax8       ????????????????????
741tax9       ???GGATTTCTTCGTAGCGG
742tax10      ???????????????AGCGG;
743END;                                                                                                                             
744
745\end{Verbatim}
746\end{small}
747
748\begin{small}
749\begin{Verbatim}[frame=single, label=Nexus digits, samepage=true, baselinestretch=0.5]
750
751[ This is a comment ]
752#NEXUS
753BEGIN DATA;
754DIMENSIONS NTAX=10 NCHAR=20;
755FORMAT DATATYPE=STANDARD SYMBOLS="0 1 2 3";
756MATRIX
757tax1       ?0320333113302302122
758tax2       10220333113302302122
759tax3       ?0220333113302302122
760tax4       ?????????????2302122
761tax5       10220333113302302122
762tax6       10220333113302302122
763tax7       ???20333113302302122
764tax8       ????????????????????
765tax9       ???22033313312302122
766tax10      ???????????????02122;
767END;                                                                                                                             
768
769\end{Verbatim}
770\end{small}
771
772\begin{small}
773\begin{Verbatim}[frame=single, label=Nexus digits, samepage=true, baselinestretch=0.5]
774
775[ This is a comment ]
776#NEXUS
777BEGIN DATA;
778DIMENSIONS NTAX=10 NCHAR=20;
779FORMAT DATATYPE=STANDARD SYMBOLS="00 01 02 03";
780MATRIX
781tax1       ??00030200030303010103030002030002010202
782tax2       0100020200030303010103030002030002010202
783tax3       ??00020200030303010103030002030002010202
784tax4       ??????????????????????????02030002010202
785tax5       0100020200030303010103030002030002010202
786tax6       0100020200030303010103030002030002010202
787tax7       ??????0200030303010103030002030002010202
788tax8       ????????????????????????????????????????
789tax9       ??????0202000303030103030102030002010202
790tax10      ??????????????????????????????0002010202;
791END;                                                                                                                             
792
793\end{Verbatim}
794\end{small}
795\caption{\bf NEXUS formats.}\label{fig:nexus}
796\end{figure}
797
798
799Alignments   of   DNA   or   protein   sequences   must  be   in   PHYLIP\index{PHYLIP}   or   NEXUS
800\cite{maddison97}\index{NEXUS} sequential\index{sequential} or interleaved\index{interleaved} format
801(Figures \ref{fig:align_tree}  and \ref{fig:nexus}).  For  PHYLIP formated sequence  alignments, the
802first line of  the input file contains the number  of species and the number  of characters, in free
803format, separated by blank characters.  One slight difference with PHYLIP format deals with sequence
804name  lengths.  While PHYLIP  format limits  this length  to ten  characters, PhyML  can read  up to
805hundred  character long sequence  names.  Blanks  and the  symbols ``(),:''  are not  allowed within
806sequence names because  the Newick tree format  makes special use of these  symbols.  Another slight
807difference with  PHYLIP format is  that actual sequences  must be separated  from their names  by at
808least one blank character.
809
810A PHYLIP input sequence file  may also display more than a single data set.  Each of these data sets
811must  be  in  PHYLIP   format  and  two  successive  alignments  must  be   separated  by  an  empty
812line. Processing multiple  data sets requires to toggle  the `\x{M}' option in the  {\em Input Data}
813sub-menu or use the `\x{-n}' command line option  and enter the number of data sets to analyse.  The
814multiple  data set  option can  be  used to  process re-sampled  data  that were  generated using  a
815non-parametric  procedure such  as  cross-validation or  jackknife  (a bootstrap  option is  already
816included in PhyML).  This  option is also useful in multiple gene studies,  even if fitting the same
817substitution model to all data sets may not be suitable.
818
819PhyML can  also process alignments in  NEXUS format. Although not  all the options provided  by this
820format are supported  by PhyML, a few specific  features are exploited.  Of course,  this format can
821handle nucleotide and protein  sequence alignments in sequential or interleaved  format.  It is also
822possible  to  use custom  alphabets,  replacing  the standard  4-state  and  20-state alphabets  for
823nucleotides and amino-acids respectively. Examples of a  4-state custom alphabet are given in Figure
824\ref{fig:nexus}. Each state must here  correspond to one digit or more. The set  of states must be a
825list of consecutive  digits starting from 0.  For instance,  the list ``0, 1, 3, 4''  is not a valid
826alphabet. Each  state in the  symbol list must  be separated  from the next  one by a  space. Hence,
827alphabets with large number of states can be easily defined by using two-digit number (starting with
82800 up  to 19  for a 20  state alphabet).  Most  importantly, this  feature gives the  opportunity to
829analyse data sets made of presence/absence character states (use the \texttt{symbols=``0 1''} option
830for such data).\index{binary characters} Alignments made  of custom-defined states will be processed
831using the Jukes and Cantor model.  Other options  of the program (e.g., number of rate classes, tree
832topology search algorithm) are freely configurable. Note  that, at the moment, the maximum number of
833different states is  set to 22 in order to  save memory space.  It is however  possible to lift this
834threshold   by  modifiying   the   value  of   the  variable   \x{T\_MAX\_ALPHABET}   in  the   file
835`\x{utilities.h}'. The program will then have to be re-compiled.
836
837
838
839\subsubsection{Gaps and ambiguous characters}
840
841Gaps correspond to  the `\x{-}' symbol.  They are systematically treated  as unknown characters ``on
842the grounds  that we  don't know what  would be  there if something  were there''  (J.  Felsenstein,
843PHYLIP main documentation).   The likelihood at these  sites is summed over all  the possible states
844(i.e.,  nucleotides  or   amino  acids)  that  could  actually  be   observed  at  these  particular
845positions. Note however that  columns of the alignment that display only  gaps or unknown characters
846are simply discarded because  they do not carry any phylogenetic information  (they are equally well
847explained  by any  model).  PhyML  also handles  ambiguous characters  such as  $R$ for  $A$  or $G$
848(purines) and $Y$ for $C$  or $T$ (pyrimidines).  Tables \ref{tab:ambigu_nt} and \ref{tab:ambigu_aa}
849give the list of valid characters/symbols and the corresponding nucleotides or amino acids.
850
851\begin{table}
852\begin{center}
853\begin{tabular}{lr|lr}
854\hline
855Character & Nucleotide &   Character & Nucleotide \\               
856\hline                                   
857$A$       & Adenosine &     $Y$       & $C$ or $T$ \\                 
858$G$       & Guanine &       $K$       & $G$ or $T$ \\               
859$C$       & Cytosine &      $B$       & $C$ or $G$ or $T$\\         
860$T$       & Thymine &       $D$       & $A$ or $G$ or $T$ \\       
861$U$       & Uracil (=$T$) & $H$       & $A$ or $C$ or $T$ \\       
862$M$       & $A$ or $C$ &    $V$       & $A$ or $C$ or $G$ \\       
863$R$       & $A$ or $G$ &    $-$ or $N$ or $X$ or $?$ & unknown  \\ 
864$W$       & $A$ or $T$ &    & (=$A$ or $C$ or $G$ or $T$)\\         
865$S$       & $C$ or $G$ &   & \\
866\hline
867\end{tabular}
868\end{center}
869\caption{{\bf List of valid characters in DNA sequences and the corresponding nucleotides.}}\label{tab:ambigu_nt}
870\end{table}
871\begin{table}
872\begin{center}
873\begin{tabular}{lr|lr}
874\hline
875Character & Amino-Acid & Character & Amino-Acid \\
876\hline
877$A$       & Alanine &         $L$       & Leucine \\             
878$R$       & Arginine &        $K$       & Lysine \\             
879$N$ or $B$& Asparagine &      $M$       & Methionine \\         
880$D$       & Aspartic acid &   $F$       & Phenylalanine \\       
881$C$       & Cysteine &        $P$       & Proline \\             
882$Q$ or $Z$& Glutamine &       $S$       & Serine \\             
883$E$       & Glutamic acid &   $T$       & Threonine \\           
884$G$       & Glycine &         $W$       & Tryptophan \\         
885$H$       & Histidine &       $Y$       & Tyrosine \\           
886$I$       & Isoleucine &      $V$       & Valine \\             
887$L$       & Leucine &         $-$ or $X$ or $?$ & unknown \\     
888$K$       & Lysine &          & (can be any amino acid) \\       
889\hline
890\end{tabular}
891\end{center}
892\caption{{\bf List of valid characters in protein sequences and the corresponding amino acids.}}\label{tab:ambigu_aa}
893\end{table}
894
895\subsubsection{Specifying outgroup sequences}\label{sec:outgroupspecify}
896
897PhyML can return rooted trees provided outgroup taxa are identified from the sequence file. In
898order to do so, sequence names that display a `*' character will be automatically considered as
899belonging to the outgroup.
900
901The topology of  the rooted tree is  exactly the same as the  unrooted version of the  same tree. In
902other words,  PhyML first ignores the distinction  between ingroup and outgroup  sequences, builds a
903maximum likelihood unrooted tree  and then tries to add the root. If the  outgroup has more than one
904sequence, the position  of the root might be  ambiguous. In such situation, PhyML  tries to identify
905the  most relevant  position of  the root  by considering  which edge  provides the  best separation
906between ingroup  and outgroup taxa (i.e.,  we are trying to  make the outgroup  ``as monophyletic as
907possible'').
908
909\subsection{Tree format}
910
911PhyML can  read one or  several phylogenetic trees  from an input  file.  This option  is accessible
912through the  {\em Tree Searching} sub  menu or the `\x{-u}'  argument from the  command line.  Input
913trees are generally used as initial maximum  likelihood estimates to be subsequently adjusted by the
914tree searching algorithm.   Trees can be either rooted or unrooted  and multifurcations are allowed.
915Taxa names must, of course, match the corresponding sequence names.
916
917\begin{figure}[h]
918\begin{small}
919\begin{minipage}{\textwidth}
920\begin{verbatim}
921((seq1:0.03,seq2:0.01):0.04,(seq3:0.01,(seq4:0.2,seq5:0.05):0.2):0.01);
922((seq3,seq2),seq1,(seq4,seq5));
923\end{verbatim}
924\end{minipage}
925\end{small}
926\caption{{\bf Input trees}. The first tree (top) is rooted and has branch lengths. The second tree
927  (bottom) is unrooted and does not have branch lengths.}
928\label{fig:trees}\index{Newick format}
929\end{figure}
930
931
932\subsection{Multiple alignments and trees}\index{multiple data sets}
933
934Single or  multiple sequence  data sets may  be used  in combination with  single or  multiple input
935trees. When the number of data sets is one ($n_D = 1$) and there is only one input tree ($n_T = 1$),
936then this tree is simply  used as input for the single data set analysis. When  $n_D = 1$ and $n_T >
9371$,  each input tree  is used  successively for  the analysis  of the  single alignment.  PhyML then
938outputs the tree  with the highest likelihood.  If $n_D > 1$ and  $n_T = 1$, the same  input tree is
939used for the analysis  of each data set.  The last  combination is $n_D > 1$ and $n_T  > 1$. In this
940situation, the  $i$-th tree in the input  tree file is used  to analyse the $i$-th  data set. Hence,
941$n_D$ and $n_T$ must be equal here.
942
943
944\subsection{Custom amino-acid rate model}\label{sec:customaa}
945
946The custom amino-acid model of substitutions can be used to implement a model that is not hard-coded
947in  PhyML.   This model  must  be  time-reversible.  Hence,  the  matrix  of  substitution rates  is
948symmetrical. The format  of the rate matrix with the associated  stationary frequencies is identical
949to the one used in PAML\index{PAML}. An example is given below:
950
951\begin{center}
952{\tiny
953\begin{tabular}{p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}}
954% Ala & Arg & Asn & Asp & Cys & Gln & Glu & Gly & His & Ile & Leu & Lys & Met & Phe & Pro & Ser & Thr & Trp & Tyr & Val \\
955  &&&&&&&&&&&&&&&&&&& \\
9560.55 &  &&&&&&&&&&&&&&&&&& \\
9570.51 & 0.64 &  &&&&&&&&&&&&&&&& \\
9580.74 & 0.15 & 5.43 &  &&&&&&&&&&&&&&&& \\
9591.03 & 0.53 & 0.27 & 0.03 &  &&&&&&&&&&&&&&& \\
9600.91 & 3.04 & 1.54 & 0.62 & 0.10 &   &&&&&&&&&&&&&& \\
9611.58 & 0.44 & 0.95 & 6.17 & 0.02 & 5.47 &  &&&&&&&&&&&&& \\
9621.42 & 0.58 & 1.13 & 0.87 & 0.31 & 0.33 & 0.57 &  &&&&&&&&&&&& \\
9630.32 & 2.14 & 3.96 & 0.93 & 0.25 & 4.29 & 0.57 & 0.25 &  &&&&&&&&&&& \\
9640.19 & 0.19 & 0.55 & 0.04 & 0.17 & 0.11 & 0.13 & 0.03 & 0.14 &  &&&&&&&&&& \\
9650.40 & 0.50 & 0.13 & 0.08 & 0.38 & 0.87 & 0.15 & 0.06 & 0.50 & 3.17 &  &&&&&&&&& \\
9660.91 & 5.35 & 3.01 & 0.48 & 0.07 & 3.89 & 2.58 & 0.37 & 0.89 & 0.32 & 0.26 &  &&&&&&&& \\
9670.89 & 0.68 & 0.20 & 0.10 & 0.39 & 1.55 & 0.32 & 0.17 & 0.40 & 4.26 & 4.85 & 0.93 &  &&&&&&& \\
9680.21 & 0.10 & 0.10 & 0.05 & 0.40 & 0.10 & 0.08 & 0.05 & 0.68 & 1.06 & 2.12 & 0.09 & 1.19 &  &&&&&& \\
9691.44 & 0.68 & 0.20 & 0.42 & 0.11 & 0.93 & 0.68 & 0.24 & 0.70 & 0.10 & 0.42 & 0.56 & 0.17 & 0.16 &  &&&&& \\
9703.37 & 1.22 & 3.97 & 1.07 & 1.41 & 1.03 & 0.70 & 1.34 & 0.74 & 0.32 & 0.34 & 0.97 & 0.49 & 0.55 & 1.61 &  &&&& \\
9712.12 & 0.55 & 2.03 & 0.37 & 0.51 & 0.86 & 0.82 & 0.23 & 0.47 & 1.46 & 0.33 & 1.39 & 1.52 & 0.17 & 0.80 & 4.38 &  &&& \\
9720.11 & 1.16 & 0.07 & 0.13 & 0.72 & 0.22 & 0.16 & 0.34 & 0.26 & 0.21 & 0.67 & 0.14 & 0.52 & 1.53 & 0.14 & 0.52 & 0.11 &  && \\
9730.24 & 0.38 & 1.09 & 0.33 & 0.54 & 0.23 & 0.20 & 0.10 & 3.87 & 0.42 & 0.40 & 0.13 & 0.43 & 6.45 & 0.22 & 0.79 & 0.29 & 2.49 &   & \\
9742.01 & 0.25 & 0.20 & 0.15 & 1.00 & 0.30 & 0.59 & 0.19 & 0.12 & 7.82 & 1.80 & 0.31 & 2.06 & 0.65 & 0.31 & 0.23 & 1.39 & 0.37 & 0.31 &   \\
975\\
9768.66 & 4.40 & 3.91 & 5.70 & 1.93 & 3.67 & 5.81 & 8.33 & 2.44 & 4.85 & 8.62 & 6.20 & 1.95 & 3.84 & 4.58 & 6.95 & 6.10 & 1.44 & 3.53 & 7.09  \\
977\end{tabular}
978}
979\end{center}
980
981The  entry  on the  $i$-th  row  and  $j$-th  column of  this  matrix  corresponds  to the  rate  of
982substitutions between  amino-acids $i$  and $j$.   The last line  in the  file gives  the stationary
983frequencies and must be separated from the rate  matrix by one line. The ordering of the amino-acids
984is alphabetical,  i.e, Ala, Arg, Asn, Asp,  Cys, Gln, Glu, Gly,  His, Ile, Leu, Lys,  Met, Phe, Pro,
985Ser, Thr, Trp, Tyr and Val.
986
987
988\subsection{Topological constraint file}\label{sec:topoconstraints}
989
990PhyML can perform phylogenetic tree estimation under user-specified topological constraints. In
991order to do so, one should use the \x{--constraint\_file}  \x{file\_name} command-line option where
992\x{file\_name} lists the topological constraints. Such constraints are  straightforward to
993define. For instance, the following constraints:
994\vspace{0.2cm}
995\begin{Verbatim}
996((A,B,C),(D,E,F));
997\end{Verbatim} 
998indicate that taxa A, B and C belong to the same clade. D, E and F also belong to the
999same  clade and the  two clades  hence defined  should not  overlap. Under  these two
1000constraints, the tree ((A,B),D,((E,F),C)) is not valid. From the example above, you will notice that
1001the constraints are defined  using a multifurcating tree in NEWICK format.  Note that this tree does
1002not need to display the whole list of taxa. For instance, while the only taxa involved in specifying
1003topological constraints above  are A, B, C, D, E  \& F, the actual data set  could include more than
1004these six taxa only.
1005
1006PhyML tree topology  search algorithms all rely on  improving a starting tree. By  default, BioNJ is
1007the method  of choice  for building this  tree. However,  there is no  guarantee that  the phylogeny
1008estimated with PhyML does comply with the  topological constraints. While it is probably possible to
1009implement  BioNJ  with  topological constraints,  we  have  not  done  so yet.   Instead,  the  same
1010multifurcating tree that  defines the topological constraints  should also be used  as starting tree
1011using  the \x{-u}  (\x{--inputtree})  option. Altogether,  the  command line  should  look like  the
1012following: \x{-u}=\x{file\_name} \x{--constraint\_file}=\x{file\_name}. It is not possible to use as
1013input  tree a  non-binary phylogeny  that is  distinct  from that  provided in  the constraint  tree
1014file. However, any binary tree compatible with the constraint one can be used as input tree.
1015
1016
1017\subsection{Output files}
1018
1019\begin{table}
1020Sequence file name~: `{\x seq}'\\
1021\begin{center}
1022\begin{tabular}{ll}
1023\hline
1024Output file name & Content \\
1025\hline
1026\x{seq\_phyml\_tree.txt} & ML tree\\
1027\x{seq\_phyml\_stats.txt} &  ML model parameters\\
1028\x{seq\_phyml\_boot\_trees.txt} & ML trees -- bootstrap replicates\\
1029\x{seq\_phyml\_boot\_stats.txt} & ML model parameters -- bootstrap replicates \\
1030\x{seq\_phyml\_rand\_trees.txt} & ML trees -- multiple random starts\\
1031\hline
1032\end{tabular}
1033\end{center}
1034\caption{{\bf Standard output files}}\label{tab:output}
1035\end{table}
1036
1037Table  \ref{tab:output} presents  the list  of files  resulting from  an analysis.   Basically, each
1038output file  name can be divided into  three parts.  The first  part is the sequence  file name, the
1039second part corresponds to  the extension `\x{\_phyml\_}' and the third part  is related to the file
1040content.  When launched with the default options,  PhyML only generates two files: the tree file and
1041the model parameter file.   The estimated maximum likelihood tree is in  standard Newick format (see
1042Figure  \ref{fig:trees}).  The  model  parameters file,  or  statistics file,  displays the  maximum
1043likelihood estimates of the substitution model  parameters, the likelihood of the maximum likelihood
1044phylogenetic model, and  other important information concerning the settings  of the analysis (e.g.,
1045type of data, name of the substitution model, starting tree, etc.).  Two additional output files are
1046created if  bootstrap supports were  evaluated.  These files  simply contain the  maximum likelihood
1047trees  and  the  substitution  model  parameters  estimated from  each  bootstrap  replicate.   Such
1048information can be used to estimate sampling errors around each parameter of the phylogenetic model.
1049When the random  tree option is turned on,  the maximum likelihood trees estimated  from each random
1050starting trees are printed in a separate tree file (see last row of Table \ref{tab:output}).
1051
1052\subsection{Treatment of invariable sites with fixed branch lengths}
1053
1054PhyML  allows users  to give  an input  tree with  fixed topology  and branch  lengths and  find the
1055proportion of invariable sites that maximise the likelihood (option \x{-o r}). These two options can
1056be considered  as conflicting since  branch lengths depend  on the proportion of  invariants. Hence,
1057changing the proportion  of invariants implies that branch lengths are  changing too. More formally,
1058let $l$ denote the length of a branch,  i.e., the expected number of substitutions per site, and $p$
1059be  the proportion  of invariants.  We have  $l =  (1-p)l'$, where  $l'$ is  the expected  number of
1060substitutions at variable sites.  When  asked to optimize  $p$ but leave $l$  unchanged, PhyML
1061does the following:
1062\begin{enumerate}
1063\item Calculate $l' = l/(1-p)$ and leave $l'$ unchanged throughout the optimization.
1064\item Find the value of the proportion of invariants that maximises the likelihood. Let $p^{*}$ denote this value.
1065\item Set $l^{*} = (1-p^{*})l'$ and print out the tree with $l^{*}$ (instead of $l$).
1066\end{enumerate} 
1067
1068PhyML therefore  assumes that the  users wants  to fix the  branch lengths measured  at variable
1069sites only  (i.e., $l^{*}$ is  fixed). This is the  reason why the  branch lengths in the  input and
1070output trees  do differ  despite the  use of the  the \x{-o  r} option. While  we believe  that this
1071approach relies on a sound rationale, it  is not perfect. In particular, the original transformation
1072of  branch lengths  ($l' =  l/(1-p)$) relies  on a  default  value for  $p$ with  is set  to 0.2  in
1073practice. It is difficult  to justify the use of this value rather  than another one. One suggestion
1074proposed by  Bart Hazes is  to avoid fixing  the branch lengths  altogether and rather  estimate the
1075value  of   a  scaling  factor   applied  to   each  branch  length   in  the  input   tree  (option
1076\x{--contrained\_lens}).  We  agree  that  this  solution  probably matches  very  well  most  users
1077expectation, i.e., ``find the  best value of $p$ while constraining the ratio  of branch lengths to be
1078that given in the input tree''. Please feel free to send us your suggestions regarding this problem
1079by posting on forum (\url{http://groups.google.com/group/phyml-forum}).
1080
1081
1082\section{Inputs \& outputs for the XML interface }\label{sec:xmlio}\index{XML}
1083
1084\subsection{Mixture models in PhyML}\index{mixture models}\label{sec:mixtures}
1085
1086PhyML implements a wide range of mixture models. The discrete gamma model \cite{yang94b} is arguably
1087the  most popular  of these  models in  phylogenetics. However,  in theory,  mixture models  are not
1088restricted to the description of the variation  of substitution rates across sites. For instance, if
1089there are good reasons  to believe that the relative rates of  substitution between nucleotides vary
1090along the  sequence alignments, it  makes sense to  use a mixture of  GTR models. Consider  the case
1091where substitutions between $A$ and $C$ occur at  high rate in some regions of the alignment and low
1092rate elsewhere,  a mixture with two  classes, each class having  its own GTR rate  matrix, would be
1093suitable. The likelihood at any site of  the alignment is then obtained by averaging the likelihoods
1094obtained for each GTR rate matrix, with the same weight given to each of these matrices.
1095
1096PhyML implements  a generic framework  that allows users to  define mixtures on  substitution rates,
1097rate matrices and nucleotide or amino-acid equilibrium  frequencies. Each class of the mixture model
1098is built by assembling a substitution rate,  a rate matrix\footnote{the rate matrix corresponds here
1099the symmetrical  matrix giving the so-called  ``echangeability rates''} and a  vector of equilibrium
1100frequencies.  For  instance, let $\{R_1,R_2,R_3\}$ be  a set of substitution  rates, $\{M_1,M_2\}$ a
1101set of rate matrices and $\{F_1,F_2\}$ a set  of vectors of equilibrium frequencies.  One could then
1102define the first class of the mixture model  as $\mathcal{C}_1 = \{R_1,M_1,F_1\}$, a second class as
1103$\mathcal{C}_2  = \{R_2,M_1,F_1\}$,  and a  third class  as $\mathcal{C}_3  = \{R_3,M_2,F_2\}$.   If
1104$R_1$, $R_2$  and $R_3$ correspond to  slow, medium and  fast substitution rates, then  this mixture
1105model allows the  fast evolving rates to have  their own vector of equilibrium  frequencies and rate
1106matrix, distinct from that found at the medium  or slow evolving sites.  The likelihood at any given
1107site $D_s$ of the alignment is then:
1108\begin{eqnarray}\label{equ:mixtlk}
1109\Pr(D_s) = \sum_{c=1}^{3} \Pr(D_s | \mathcal{C}_s=c) \Pr(\mathcal{C}_s=c),
1110\end{eqnarray}
1111where  $\Pr(\mathcal{C}_s=c)$ is  obtained by  multiplying the  probability (density)  of  the three
1112components (i.e., rate, matrix, frequencies). For instance, $\Pr(\mathcal{C}_1=\{R_1,M_1,F_1\}) =
1113 \Pr(R_1)\times \Pr(M_1) \times \Pr(F_1)$.
1114We therefore assume here that substitution rates, rate
1115matrices and equilibrium frequencies are independent from one another.
1116
1117Note that, using the  same substitution rates, rate matrices and  vector of equilibrium frequencies,
1118it is  possible to construct  many other mixture  models. For instance,  the mixture model  with the
1119largest  number of  classes  can be  created  by considering  all the  combinations  of these  three
1120components.  We would  then get a mixture of  $3\times 2 \times 2=12$ classes,  corresponding to all
1121the possible combinations of 3 rates, 2 matrices and 2 vectors of frequencies.
1122
1123
1124% :  $\mathcal{C}_1 =
1125% \{R_1,M_1,F_1\}$   $\mathcal{C}_2    =   \{R_1,M_1,F_2\}$,   $\mathcal{C}_3    =   \{R_1,M_2,F_1\}$,
1126% $\mathcal{C}_4   =   \{R_1,M_2,F_2\}$,   $\mathcal{C}_5   =   \{R_2,M_1,F_1\}$,   $\mathcal{C}_6   =
1127% \{R_2,M_1,F_2\}$,   $\mathcal{C}_7   =    \{R_2,M_2,F_1\}$,   $\mathcal{C}_8   =   \{R_2,M_2,F_2\}$,
1128% $\mathcal{C}_9  =  \{R_3,M_1,F_1\}$,  $\mathcal{C}_{10}   =  \{R_3,M_1,F_2\}$,  $\mathcal{C}_{11}  =
1129% \{R_3,M_2,F_1\}$ and $\mathcal{C}_{12} = \{R_3,M_2,F_2\}$.
1130 
1131
1132\subsection{Partitions}\index{partitionned analysis}\index{data partitions}
1133
1134We first introduce some terms of vocabulary that have not been presented before. A partitionned data
1135set, also referred to as partition, is a  set of partition elements.  Typically, a partitionned data
1136set will be made of  a set of distinct gene alignments. A partition  element will then correspond to
1137one (or  several) of these gene  alignments. Note that the  biology litterature often uses  the term
1138partition to refer to an element of a  partitionned data.  We thus use here instead the mathematical
1139definition of the terms `partition' and `partition element'.
1140
1141Phylogenetics models usually assume individual columns  of an alignment to evolve independently from
1142one  another. Codon-based  models (e.g.,  \cite{yang98,yang00b,yang02,guindon04}) are  exceptions to
1143this rule  since the substitution process  applies here to  triplets of consecutive sites  of coding
1144sequences.  The non-independence of  the substitution process at the three  coding positions (due to
1145the specificities of the genetic code), can  therefore be accounted for.  Assuming that sites evolve
1146independently  does not  mean  that a  distinct  model is  fitted  to each  site  of the  alignment.
1147Estimating the  parameters of these  models would not  make much sense in  practice due to  the very
1148limited amount of phylogenetic signal conveyed by individual sites.  Site independence means instead
1149that the  columns of  the observed  alignment were sampled  randomly from  the same  ``population of
1150columns''.   The  stochasticity  of the  substitution  process  running  along  the tree  is  deemed
1151responsible to the variability of site patterns.
1152
1153Some parameters  of the  phylogenetic model  are considered  to be common  to all  the sites  in the
1154alignment. The tree topology is typically  one such parameter.  The transition/transversion ratio is
1155also generally assumed to be the same for all columns.  Other parameters can vary from site to site.
1156The rate at  which substitutions accumulate is  one of these parameters. Hence,  different sites can
1157have distinct rates. However,  such rates are all ``drawn'' from  the same probabilitic distribution
1158(generally a  discrete Gamma  density).  Hence,  while different  sites may  have distinct  rates of
1159evolution, they all share the same {\em distribution} of rates.
1160
1161This reasonning  also applies on a  larger scale. When  analysing multiple genes, one  can indeed
1162assume that the same  mechanism generated the different site patterns observed  for every gene. Here
1163again, we can assume that all the genes share the same underlying tree topology (commonly refered to
1164as the ``species  tree'').  Other parameters of  the phylogenetic model, such as  branch lengths for
1165instance, might  be shared across  genes. However,  due to the  specificities of the  gene evolution
1166processes, some  model parameters  need to be  adjusted for  each gene separately.   To sum  up, the
1167phylogenetic analysis of partitionned data requires flexible models with parameters, or distribution
1168of parameters,  shared across several partition  elements and other parameters  estimated separately
1169for each element of the partition.
1170
1171The likelihood of a  data set made of the concatenation of $n$  sequence alignments noted $D^{(1)}$,
1172$D^{(2)}, \ldots, D^{(n)}$ is then obtained as follow:
1173\begin{eqnarray}
1174\Pr(D^{(1)},D^{(2)},\ldots,D^{(n)}) &=& \prod_{i=1}^{n}  \Pr(D^{(i)}) \nonumber\\
1175&=& \prod_{i=1}^{n}  \prod_{s=1}^{L_i} \Pr(D^{(i)}_s),
1176\end{eqnarray} where $L_i$ is the number of site columns in partition element $i$$\Pr(D^{(i)}_s)$
1177is then obtained using Equation \ref{equ:mixtlk}, i.e., by summing over the different classes of the
1178mixture model that  applies to site $s$ for  partition element $i$. Hence, the  joint probability of
1179all the partition elements is here broken down into the product of likelihood of every site for each
1180partition  element. As  noted just  above,  any given  component of  the  mixture model  at a  given
1181particular site is shared by the other sites that belong to the same partition element and, for some
1182of them, by  sites in other partition  elements (e.g., the same  tree topology is shared  by all the
1183sites, throughout all the partition elements).
1184
1185PhyML implements a wide  variety of partition models.  The only parameter that  is constrained to be
1186shared  by all  the  partition elements  is  the tree  topology. This  constraint  makes sense  when
1187considering distantly  related taxa, typically inter-species  data. For closely related  taxa, i.e.,
1188when  analysing intra-species  or population-level  data,  not all  the  genes might  have the  same
1189evolutionary history.   Recombination events combined  to the incomplete lineage  sorting phenomenon
1190can  generate  discrepancies   between  the  gene  trees  and  the   underlying  species  tree  (see
1191\cite{degnan09}  for  a review).   The  phylogenetic  softwares BEST  \cite{best}\index{BEST},  STEM
1192\cite{stem}\index{STEM} and  *BEAST \cite{startbeast}\index{*BEAST} are dedicated  to the estimation
1193of species tree phylogenies  from the analysis of multi-gene data and  allow gene-tree topologies to
1194vary across genes.
1195
1196Aside from the tree topology  that is common to all the sites and  all the partition elements, other
1197parameters of  the phylogenetic model  can be either shared  across partition elements  or estimated
1198separately  for each  of  these. When  analysing  three partition  elements, $A$$B$  and $C$  for
1199instance, PhyML can  fit a model where the same  set of branch lengths applies to  $A$ and $B$ while
1200$C$ has its  own estimated lengths.  The same  goes for the substitution model: the  same GTR model,
1201with identical parameter values, can be fitted to $A$  and $C$ and JC69 for instance can be used for
1202$B$. The sections below  give more detailed information on the range of  models available and how to
1203set up the corresponding XML configuration files to implement them.
1204
1205
1206\subsection{Combining mixture and partitions in PhyML: the  theory}
1207
1208The rationale behind mixture  models as implemented in PhyML lies in (1)  the definition of suitable
1209rate matrices, equilibrium frequency vectors and relative rates of substitution and (2) the assembly
1210of these  components so as  to create the  classes of a mixture.  The main idea  behind partitionned
1211analysis in  PhyML lies  in (1)  the hypothesis of  statistical independance  of the  different data
1212partition elements and (2) distinct data partition can share model components such as rate matrices,
1213equilibrium frequencies or  distribution of rates across  sites. More formally, the  likelihood of a
1214data set made of $n$ partition elements is written as follows:
1215
1216\begin{eqnarray*}
1217\Pr(D^{(1)},D^{(2)},\ldots,D^{(n)}) &=& \prod_{i=1}^{n}  \prod_{s=1}^{L_i} \Pr(D^{(i)}_s) \\
1218&=& \prod_{i=1}^{n}  \prod_{s=1}^{L_i} \sum_{c=1}^{K_i} \Pr(D^{(i)}_s|\mathcal{C}=c) \Pr(\mathcal{C}=c),
1219\end{eqnarray*} 
1220
1221where $L_i$ is  the number of sites in partition  element $i$ and $K_i$ is the  number of classes in
1222the mixture model that applies to this same partition  element. Each class of a mixture is made of a
1223rate  matrix $M$,  a vector  of equilibrium  frequencies  $F$ and  a relative  rate of  substitution
1224$R$. Branch  lengths, $L$ and  tree topology $\tau$  are also required  for the calculation  of the
1225likelihood. Hence we have:
1226
1227\begin{eqnarray*}
1228&& \Pr(D^{(1)},D^{(2)},\ldots,D^{(n)}\\
1229&&\prod_{i=1}^{n}  \prod_{s=1}^{L_i} \sum_{c=1}^{K_i} \Pr(D^{(i)}_s|\mathcal{C}=c) \Pr(\mathcal{C}=c) \\
1230&&= \prod_{i=1}^{n}  \prod_{s=1}^{L_i} \sum_{m}^{\mathcal{M}_i} \sum_{f}^{\mathcal{F}_i} \sum_{r}^{\mathcal{R}_i}  \Pr(D^{(i)}_s|M_m^{(i)},F_f^{(i)},R_r^{(i)},L^{(i)},\tau) \Pr(M_m^{(i)},F_f^{(i)},R_r^{(i)}) \mathcal{I}(m,f,r,i)
1231% &&= \prod_{i=1}^{n}  \prod_{s=1}^{L_i} \sum_{m}^{\mathcal{M}_i} \sum_{f}^{\mathcal{F}_i} \sum_{r}^{\mathcal{R}_i}
1232% \Pr(D^{(i)}_s|M_m^{(i)},F_f^{(i)},R_r^{(i)},L^{(i)},\tau) \Pr(M_m^{(i)}) \Pr(F_f^{(i)}) \Pr(R_r^{(i)})
1233\end{eqnarray*} 
1234where $\mathcal{M}_i$, $\mathcal{F}_i$ and $\mathcal{R}_i$ are the number of rate matrices,
1235vector of equilibrium frequencies and relative rates that apply to partition element $i$
1236respectively. $\mathcal{I}(m,f,r,i)$ is an indicator function that takes value 1 if the combination
1237$M_m$, $F_f$ and $R_r$ is acually defined in the model for this particular partition element
1238$i$. Its value is 0 otherwise. In the example given in section \ref{sec:mixtures} $\{R_1,R_2,R_3\}$
1239is the  set of substitution  rates, $\{M_1,M_2\}$ the set of rate matrices and $\{F_1,F_2\}$ the set  of vectors of equilibrium frequencies.  We then
1240define the first class of the mixture model  as $\mathcal{C}_1 = \{R_1,M_1,F_1\}$, a second class as
1241$\mathcal{C}_2  = \{R_2,M_1,F_1\}$ and the third as $\mathcal{C}_3  = \{R_3,M_2,F_2\}$. Hence, we
1242have $\mathcal{I}(1,1,1,i)$, $\mathcal{I}(1,1,2,i)$ and $\mathcal{I}(2,2,3,i)$ equal to one while
1243the nine other values that this indicator function takes, corresponding to the possible combinations of
1244two vectors of frequencies, two  matrices and three rates, are all zero.
1245
1246As stated before, our implementation assumes that the different components of a mixture are
1247independant. In other words, we have $\Pr(M_m^{(i)},F_f^{(i)},R_r^{(i)}) = \Pr(M_m^{(i)}) \times
1248\Pr(F_f^{(i)}) \times \Pr(R_r^{(i)})$. In practice, the joint probability
1249$\Pr(M_m^{(i)},F_f^{(i)},R_r^{(i)})$ is obtained as follows:
1250\begin{eqnarray}\label{equ:weights}
1251\Pr(M_m^{(i)},F_f^{(i)},R_r^{(i)}) = \frac{\Pr(M_m^{(i)}) \Pr(F_f^{(i)}) \Pr(R_r^{(i)})}{
1252  \sum_{m,f,r} \Pr(M_m^{(i)}) \Pr(F_f^{(i)}) \Pr(R_r^{(i)}) \mathcal{I}(m,f,r,i)}
1253\end{eqnarray}
1254The probabilities $\Pr(M_m^{(i)})$, $\Pr(F_f^{(i)})$ and $\Pr(R_r^{(i)})$, also called `weights', can be fixed or estimated
1255from the data.
1256
1257\subsection{The XML format and its use in PhyML}\label{sec:XML format}
1258
1259The few paragraphs below are largely inspired from  the Wikipedia page that describes the XML format
1260(\url{http://en.wikipedia.org/wiki/XML}). XML (eXtensible Markup Language) is a markup language that
1261defines  a set  of  rules  for encoding  documents  in  a format  that  is  both human-readable  and
1262machine-readable.  An  XML document is  divided into  {\em markup} and  {\em content}, which  may be
1263distinguished  by the  application of  simple syntactic  rules. Generally,  strings that  constitute
1264markup either begin  with the character `\x{<}' and  end with a `\x{>}'. Strings  of characters that
1265are not markup are content:
1266
1267\vspace{0.2cm}
1268\begin{Verbatim}[frame=single, label=XML markup and content example, samepage=true,
1269  baselinestretch=0.5, fontsize=\small]
1270<markup> 
1271 content
1272</markup>
1273\end{Verbatim}
1274
1275A markup construct that begins  with `\x{<}' and ends with `\x{>}' is called  a {\em tag}. Tags come
1276in  three  flavors:  (1)  start-tags  (e.g,  \x{<section>}),  end-tags  (e.g.,  \x{</section>})  and
1277empty-element tags (e.g., \x{<line-break />}). A {\em  component} either begins with a start-tag and
1278ends with a  matching end-tag or consists only  of an empty-element tag. The  characters between the
1279start- and  end-tags, if any,  are the  element's content, and  may contain markup,  including other
1280elements, which are  called child elements.  In  the following example, the element  \x{img} has two
1281{\em  attributes}\x{src}   and  \x{alt}\x{<img  src="madonna.jpg"   alt="Foligno  Madonna,  by
1282Raphael"/>}. Another example would be \x{<step number="3">Connect  A to B.</step>} where the name of
1283the attribute is ``\x{number}" and the value is ``\x{3}".
1284
1285In  practice,  building   a  mixture  model  in   a  XML  file  readable  by   PhyML  is  relatively
1286straightforward.  The first  step  is  to define  the  different components  of  each  class of  the
1287mixture.  Consider for  instance that  the fitted  model will  have a  Gamma distribution  with four
1288classes plus  a proportion of invariants.  The rate component of  the mixture can then  be specified
1289using the following XML code:
1290
1291\vspace{0.2cm}
1292\begin{Verbatim}[frame=single, label=$\Gamma4$+I rates, samepage=true, baselinestretch=0.5,
1293  fontsize=\small, numbers=left]
1294
1295<siterates id="SiteRates1">
1296  <weights  id="Distrib" family="gamma+inv" alpha=".1" \
1297  optimise.alpha="yes" pinv="0.4" optimise.pinv="yes">
1298  </weights>
1299  <instance id="R1" init.value="1.0"/>
1300  <instance id="R2" init.value="1.0"/>
1301  <instance id="R5" init.value="0.0"/>
1302  <instance id="R3" init.value="1.0"/>
1303  <instance id="R4" init.value="1.0"/>
1304</siterates>
1305
1306\end{Verbatim}
1307
1308In the example above, the \x{<siterates>} component  completely defines a model of substitution rate
1309variation across  sites. This component  has a  particular identity, i.e.,  a name associated  to it
1310(``\x{SiteRates1}''  here),  which  is  not  mandatory.   This  \x{<siterates>}  component  has  six
1311sub-components.   The  first  is  the  \x{<weights>}  component,  followed  by  five  \x{<instance>}
1312components. The  \x{<weights>} component  defines the  type of  distribution that  characterizes the
1313variation of  rates across sites.  A discrete  Gamma plus invariants  is used here.   Two parameters
1314specify this distribution: the gamma shape and the proportion of invariant parameters. Their initial
1315values  are set  by using  the corresponding  attributes and  attribute values  (\x{alpha="0.1"} and
1316\x{pinv="0.4"}). Also, PhyML can  optimise these parameters so as to maximise  the likelihood of the
1317whole phylogenetic model (\x{optimise.pinv="yes"}  and \x{optimise.alpha="yes"}). The following five
1318\x{<instance>}  components  define  the  rate  classes themselves.  The  \x{id}  attribute  is  here
1319mandatory  and must  be  unique  to each  class.   Note  that one  of  the  initial (relative)  rate
1320(\x{init.value} attribute) is set to zero. The  corresponding rate class (the third in this example)
1321will then correspond to the invariant site category.
1322
1323Having specified the  part of the phylogenetic  model that describes the variation  of rates across
1324sites,  we can  now move  on  to build  the rest  of the  model.   The component  below defines  two
1325substitution models:
1326
1327\vspace{0.2cm}
1328\begin{Verbatim}[frame=single, label=Rate matrices, samepage=true, baselinestretch=0.5,
1329  fontsize=\small, numbers=left]
1330
1331<ratematrices id="RateMatrices">
1332  <instance id="M1" model="HKY85" tstv="4.0" optimise.tstv="no"/>
1333  <instance id="M2" model="GTR" optimise.rr="yes"/>
1334</ratematrices>
1335\end{Verbatim}
1336
1337This \x{<ratematrices>} component sets out a list  of substitution models (HKY85 and GTR here). Here
1338again, the  different elements in  this list correspond  to the \x{<instance>}  sub-components. Each
1339instance must  have a unique \x{id}  attribute for a reason  that will become obvious  shortly.  The
1340remaining attributes and their functions are described in Section \ref{sec:xmlratematrices}.
1341
1342The next ``ingredient'' in our phylogenetic model are vectors of nucleotide frequencies. The
1343\x{<equfreqs>} component below specifies two of such vectors:
1344
1345
1346\vspace{0.2cm}
1347\begin{Verbatim}[frame=single, label=Equilibrium frequencies, samepage=true, baselinestretch=0.5,
1348  fontsize=\small, numbers=left]
1349
1350<equfreqs id="EquFreq">
1351  <instance id="F1"/>
1352  <instance id="F2"/>
1353</equfreqs>
1354
1355\end{Verbatim}
1356
1357Now, we need to assemble these three components (rate variation across sites, rate matrices and
1358vectors of equilibrium frequencies) into a mixture model. The \x{<partitionelem>} component below
1359defines one such model:
1360
1361\vspace{0.2cm}
1362\begin{Verbatim}[frame=single, label=Mixture model, samepage=true, baselinestretch=0.5,
1363  fontsize=\small, numbers=left]
1364
1365<partitionelem id="Part1" file.name="./nucleic.txt" data.type="nt">
1366  <mixtureelem list="R1, R2, R3, R4, R5"/>
1367  <mixtureelem list="M1, M1, M1, M2, M2"/>
1368  <mixtureelem list="F1, F2, F1, F2, F2"/>
1369</partitionelem>
1370
1371\end{Verbatim}
1372
1373The  \x{<partitionelem>} component  defines a  particular partition  element. In  this example,  the
1374partition element corresponds to the sequence file  called \x{nucleic.txt}, which is an alignment of
1375nucleotide  sequences   (see  the   \x{data.type}  attribute   value).   The   \x{<mixtureelem>}  are
1376sub-components  of  the  \x{<partitionelem>}  component.   Each  \x{<mixtureelem>}  has  a  \x{list}
1377atrribute.   Each such  \x{list} gives  the ID  of components  that have  been defined  before.  For
1378instance,  the  first   \x{<mixtureelem>}  refers  to  the  five  classes   of  the  \x{<siterates>}
1379component. The  ordering of  the different term  in these list  matters a  lot since it  is directly
1380related  to the  elements in  each class  of  the mixture  model. Hence,  the first  element in  the
1381\x{<list>} attribute  of the first  \x{<mixtureelem>} added to the  first element in  the \x{<list>}
1382attribute of the second \x{<mixtureelem>} plus the  the first element in \x{<list>} attribute of the
1383third \x{<mixtureelem>} defines the  first class of the mixture model.  Therefore, the mixture model
1384defined   above   has   five   classes:  $\mathcal{C}_1   =   \{R_1,M_1,F_1\}$,   $\mathcal{C}_2   =
1385\{R_2,M_1,F_2\}$$\mathcal{C}_3   =  \{R_3,M_1,F_1\}$,   $\mathcal{C}_4  =   \{R_4,M_2,F_2\}$  and
1386$\mathcal{C}_5 = \{R_5,M_2,F_2\}$.
1387
1388% Going back  to the different components  of this model, the  XML code dealing  with the substitution
1389% rates defines five  classes with names {\tt R1} to  {\tt R5}. The initial values  of these rates are
1390% set to 1.0, except for  {\tt R5}, which is set to 0 and will  therefore correspond to the invariable
1391% site class. The {\tt  <weight>} tag that follows indicate that these  rates define a $\Gamma 4$+Inv
1392% model, with  initial gamma shape parameter  set to 0.1 and  initial proportion of  invariants set to
1393% 0.4.  These  two parameters  will  be  estimated in  the  analysis  ({\tt  optimise.alpha} and  {\tt
1394% optimise.pinv} attributes  set to {\tt yes}).   The two rate matrices  have names {\tt  M1} and {\tt
1395% M2}.  {\tt M1} corresponds  to a HKY85 model, with transition/transversion ratio  set to 4.0 and set
1396% to be  optimised in the  analysis.  {\tt M2}  is a GTR  model, which parameters  are also set  to be
1397% optimised. {\tt F1} and {\tt F2} are two vectors of nucleotide frequencies at equilibrium. These two
1398% sets of frequencies will therefore be estimated during the analysis.
1399
1400
1401
1402\subsection{Setting up mixture and partition models in PhyML: the basics}\index{mixture
1403  models}\index{partitionned analysis}\index{data partitions}
1404
1405Mixture models are particularly relevant to the analysis of partitionned data. Indeed, some features
1406of  evolution are  gene-specific (e.g.,  substitution  rates vary  across genes).   Models that  can
1407accomodate  for such  variation,  as mixture  models do,  are  therefore relevant  in this  context.
1408However, other evolutionary features are shared across loci (e.g., genes located in the same genomic
1409region usually have similar  GC contents). As a consequence, some components  of mixture models need
1410to be  estimated separately for each  partition element while  others should be shared  by different
1411partition elements.
1412
1413Below is a simple example with a partitionned data set made of two elements:
1414
1415\vspace{0.2cm}
1416\begin{Verbatim}[frame=single, label=Two sets of branch lengths (one per partition element),
1417  samepage=true, baselinestretch=0.5, fontsize=\small, numbers=left]
1418
1419<branchlengths id="BranchLens">
1420  <instance id="L1"/>
1421  <instance id="L2"/>
1422</branchlengths>
1423
1424<partitionelem id="Part1" file.name="./nucleic1.txt" data.type="nt">
1425  <mixtureelem list="R1, R2, R3, R4, R5"/>
1426  <mixtureelem list="L1, L1, L1, L1, L1"/>
1427</partitionelem>
1428
1429<partitionelem id="Part2" file.name="./nucleic2.txt" data.type="nt">
1430  <mixtureelem list="R1, R2, R3, R4, R5"/>
1431  <mixtureelem list="L2, L2, L2, L2, L2"/>
1432</partitionelem>
1433
1434\end{Verbatim}
1435
1436Mixture elements with names  \x{R1},$\ldots$, \x{R5} refer to the $\Gamma4+$I model defined
1437previsouly  (see Section  \ref{sec:XML format}).   The \x{<branchlengths>}  XML component  defines a
1438mixture element  that had not  been introduced  before.  It defined  vectors of branch  lengths that
1439apply to the estimated phylogeny. Two instances of  such vectors are defined: \x{L1} and \x{L2}.
1440When examining the  two partition elements (\x{<partitionelem>} component), it  appears that \x{L1}
1441is associated with \x{Part1} while \x{L2} is associated with \x{Part2}.  Hence, branch lengths
1442will be estimated separately for these two partition elements. 
1443
1444Note that  a given partition element  can only have  one {\tt branchlengths} instance  associated to
1445it. For instance, the example given below is not valid:
1446
1447\vspace{0.2cm}
1448\begin{Verbatim}[frame=single, label=Invalid mixture, samepage=true, baselinestretch=0.5,
1449  fontsize=\small, numbers=left]
1450
1451<partitionelem id="Part1" file.name="./nucleic1.txt" data.type="nt">
1452  <mixtureelem list="R1, R2, R3, R4, R5"/>
1453  <mixtureelem list="L1, L1, L1, L2, L2"/>
1454</partitionelem>
1455
1456\end{Verbatim}
1457
1458In other words, mixture of branch lengths are forbidden. One reason for this restriction is that
1459mixture of edge lengths sometimes lead to non-identifiable models (i.e., models with distinct sets
1460of branch lengths have the same likelihood) \cite{matsen07}. But mostly, combining mixture of branch
1461lengths with mixture of rates appears like a deadly combination. Consider for instance the following model:
1462
1463\vspace{0.2cm}
1464\begin{Verbatim}[frame=single, label=Invalid mixture, samepage=true, baselinestretch=0.5,
1465  fontsize=\small, numbers=left]
1466
1467<partitionelem id="Part1" file.name="./nucleic1.txt" data.type="nt">
1468  <mixtureelem list="R1, R2, R3"/>
1469  <mixtureelem list="L1, L2, L3"/>
1470</partitionelem>
1471
1472\end{Verbatim}
1473
1474It is here impossible to tell apart  branch lengths and substitution rates. Such model is strongly
1475non-identifiable and therefore not relevant.
1476
1477In the example given above, the same $\Gamma4+$I model (i.e. the same gamma shape parameter and
1478proportion of invariant ) applies to the two partition elements. It is possible to use two distinct
1479$\Gamma4+$I models instead using the following XML code:
1480
1481
1482
1483\vspace{0.2cm}
1484\begin{Verbatim}[frame=single, label=Two distinct $\Gamma4+$I models, samepage=true,
1485  baselinestretch=0.5, fontsize=\small, numbers=left]
1486
1487<siterates id="SiteRates1">
1488  <weights  id="Distrib1" family="gamma+inv" alpha=".1" \
1489  optimise.alpha="yes" pinv="0.4" optimise.pinv="yes">
1490  </weights>
1491  <instance id="R1" init.value="1.0"/>
1492  <instance id="R2" init.value="1.0"/>
1493  <instance id="R5" init.value="0.0"/>
1494  <instance id="R3" init.value="1.0"/>
1495  <instance id="R4" init.value="1.0"/>
1496</siterates>
1497
1498<siterates id="SiteRates2">
1499  <weights  id="Distrib2" family="gamma+inv" alpha=".1" \
1500  optimise.alpha="yes" pinv="0.4" optimise.pinv="yes">
1501  </weights>
1502  <instance id="R6"  init.value="1.0"/>
1503  <instance id="R7"  init.value="1.0"/>
1504  <instance id="R8"  init.value="0.0"/>
1505  <instance id="R9"  init.value="1.0"/>
1506  <instance id="R10" init.value="1.0"/>
1507</siterates>
1508
1509<partitionelem id="Part1" file.name="./nucleic1.txt" data.type="nt">
1510  <mixtureelem list="R1, R2, R3, R4, R5"/>
1511  <mixtureelem list="L1, L1, L1, L1, L1"/>
1512</partitionelem>
1513
1514<partitionelem id="Part2" file.name="./nucleic2.txt" data.type="nt">
1515  <mixtureelem list="R6, R7, R8, R9, R10"/>
1516  <mixtureelem list="L2, L2, L2, L2, L2"/>
1517</partitionelem>
1518
1519\end{Verbatim}
1520
1521\x{SiteRates1} and \x{SiteRates2} here define two distinct $\Gamma4+$I models. Each of these models apply to
1522one of the two partition elements (\x{nucleic1.txt} and \x{nucleic2.txt}), allowing them to display
1523different patterns of rate variation across sites.
1524
1525
1526\subsection{XML options}
1527\subsubsection{{\tt phyml} component}\index{XML options!{\tt phyml} component}
1528Options:
1529\begin{itemize}
1530\item \x{output.file="filename"}. The main output files of PhyML analysis will be named
1531  \x{filename\_phyml\_tree} and \x{filename\_phyml\_stats}.
1532\item \x{bootstrap="nreplicates"}. Run \x{nreplicates} replicates for the non-parametric bootstrap analysis.
1533\item \x{run.id="idstring"}. PhyML will append the string \x{idstring} to each output file.
1534\item \x{print.trace="yes|true|no|false"}. PhyML will print the estimated trees (and the
1535  corresponding loglikelihoods) at multiple stages of the estimation process. This option is useful
1536  for monitoring the progress of the analysis when processing large data sets.
1537\item \x{branch.test="aBayes|aLRT|SH|no"}. Calculate fast branch support using the aBayes method
1538  \cite{anisimova11}, aLRT \cite{anisimova06} or SH \cite{shimodaira99} tests. These branch
1539  statistics are much faster to estimate than the bootrap proportions and usually provide good
1540  estimates of the probabilities that the corresponding edges are correctly inferred (see Anisimova et
1541  al. 2011 for more precision). By default and if no bootstrap analysis is performed, branch supports
1542  are estimated using the aBayes approach.
1543
1544\end{itemize}
1545\subsubsection{{\tt topology} component}\index{XML options!{\tt topology} component}
1546Options:
1547\begin{itemize}
1548\item \x{init.tree="bionj"|"user"|"random"}.  Starting tree. Default is \x{bionj}.
1549\item \x{file.name="name\_of\_tree\_file"}. In case \x{init.tree="user"}, this
1550  attribute  is mandatory. \x{name\_of\_tree\_file} is a
1551  text file containing a tree in NEWICK format.
1552\item \x{optimise.tree="yes"|"true"|"no"|"false"}. The starting tree topology as defined by
1553  \x{init.tree} is to be optimised (or not) so as to maximise the likelihood function.
1554\item \x{search="nni"|"spr"|"none"}. Tree topology search is conducted using NNI (fast), SPR (a bit
1555  slower but more accurate) or no moves.
1556\end{itemize}
1557\vspace{0.2cm}
1558\begin{Verbatim}[frame=single, label=Example of `topology' component, samepage=true,
1559  baselinestretch=0.5, fontsize=\small, numbers=left]
1560
1561<topology>
1562  <instance id="T1" init.tree="bionj" optimise.tree="true" \
1563   search="spr"/>
1564</topology>
1565
1566\end{Verbatim}
1567
1568\subsubsection{{\tt ratematrices} component}\index{XML options!{\tt ratematrices} component}\label{sec:xmlratematrices}
1569Options:
1570\begin{itemize}
1571\item \x{model="JC69"|"K80"|"F81"|"F84"|"HKY85"|"TN93"|"GTR"|"custom"} for nucleotide data. The default is \x{"HKY85"}.\\
1572\x{model="LG"|"WAG"|"JTT"|"MtREV"|"Dayhoff"|"DCMut"|"RtREV"|"CpREV"|"VT"}\\\x{|"Blosum62"|"MtMam"|"MtArt"|"HIVw"|"HIVb"|"customaa"}
1573for amino-acid sequences. The default is \x{"LG"}.
1574\item \x{model.code="012345"}. For \x{custom} model applied to nucleotide sequences: set the
1575  string of digits that define a custom substitution model. See Table \ref{tab:modelcode} on page
1576  \pageref{tab:modelcode} for more
1577  information about the model codes.
1578\item \x{ratematrix.code="filename"}. When used in conjunction with \x{model="customaa"},
1579  \x{filename} is the name of the file that gives the rates of substitution between amino-acids as
1580  well as their frequences at equilibrium using PAML rate matrix format. An example of such file is
1581  provided in {phyml/examples/X1.mat}.
1582\item \x{optimise.rr="yes"|"true"|"no"|"false"}. For \x{custom} and \x{GTR} nucleotide models only:
1583  optimise the substitution rate model parameters.
1584\item \x{optimise.tstv="yes"|"true"|"no"|"false"}. For \x{K80}, \x{F84}, \x{HKY85} and \x{TN93}
1585  models only: optimise the transition/transversion rate ratio.
1586\item \x{tstv="value"}. For \x{K80}, \x{HKY85} and \x{TN93} models only: set the transition/transversion to a
1587  given value. 
1588\end{itemize}
1589
1590The {\tt ratematrices} component has the attribute {\tt optimise.weights=yes/no} (default is {\tt
1591  no}). If {\tt optimise.weights=yes}, then the probabilities (or weights) or each matrix in the
1592set of matrices defined by this component (see Equation \ref{equ:weights}), will be estimated from the data.
1593
1594\vspace{0.2cm}
1595\begin{Verbatim}[frame=single, label=Example of `ratematrices' component, samepage=true,
1596  baselinestretch=0.5, fontsize=\small, numbers=left]
1597
1598  <ratematrices id="RM1" optimise.weights="yes">
1599    <instance id="M1" model="custom" model.code="000000"/>
1600    <instance id="M2" model="GTR" optimise.rr="yes"/>
1601    <instance id="M3" model="WAG"/>
1602  </ratematrices>
1603
1604\end{Verbatim}
1605
1606\subsubsection{{\tt equfreqs} component}\index{XML options!{\tt equfreqs} component}
1607Options:
1608\begin{itemize}
1609\item \x{base.freqs="a,b,c,d"} where \x{a-d} are nucleotide frequencies. Make sure that these
1610  frequencies are separated by comas and no space character is inserted.
1611\item \x{aa.freqs="empirical|model"}. Amino-acid frequencies are derived from counting the number of
1612  occurence of each amino-acid in the alignment (\x{aa.freqs="empirical"}) or given by the
1613  substitution model (\x{aa.freqs="model"}).
1614\item \x{optimise.freqs="true|yes|false|no"}. Nucleotide frequencies can be optimised so as to maximise
1615  the likelihood (\x{optimise.freqs="yes|true"}).
1616\end{itemize}
1617
1618The {\tt equfreqs} component has the attribute {\tt optimise.weights=yes/no} (default is {\tt
1619  no}). If {\tt optimise.weights=yes}, then the probabilities (or weights) or each vector of
1620equilibrium frequencies in the
1621set of vectors defined by this component (see Equation \ref{equ:weights}), will be estimated from the data.
1622
1623\vspace{0.2cm}
1624\begin{Verbatim}[frame=single, label=Example of `equfreqs' component, samepage=true,
1625  baselinestretch=0.5, fontsize=\small, numbers=left]
1626
1627  <equfreqs id="EF1" optimise.weights="yes">
1628    <instance id="F1" base.freqs="0.25,0.25,0.25,0.25"/>
1629    <instance id="F2" aa.freqs="empirical"/>
1630    <instance id="F3" optimise.freq="yes"/>
1631  </equfreqs>
1632
1633\end{Verbatim}
1634
1635\subsubsection{{\tt branchlengths} component}\index{XML options!{\tt branchlengths} component}
1636Options:
1637\begin{itemize}
1638\item \x{optimise.lens="yes"|"true"|"no"|"false"}: branch lengths are optimised or not. The default
1639  is set to \x{"yes"}.
1640\end{itemize}
1641\vspace{0.2cm}
1642\begin{Verbatim}[frame=single, label=Example of `branchlengths' component, baselinestretch=0.5,
1643  fontsize=\small, numbers=left]
1644
1645  <branchlengths id="BL1">
1646    <instance id="L1" optimise.lens="yes"/>
1647    <instance id="L2"/>
1648    <instance id="L3" optimise.lens="false"/>
1649  </branchlengths>
1650
1651\end{Verbatim}
1652
1653\subsubsection{{\tt siterates} component}\index{XML options!{\tt siterates} component}
1654Options:
1655\begin{itemize}
1656\item \x{value="val"}, where \x{"val"} is the relative substitution rate for the corresponding class.
1657\end{itemize}
1658A \x{siterate} component generally includes a \x{weights} element that specifies the probabilitic
1659distribution of the relative rates. The available options for such element are:
1660\begin{itemize}
1661\item \x{family="gamma|gamma+inv|freerates"}. \x{gamma} indicates that the distribution of the
1662  relative rates is set to be a discrete Gamma density. \x{gamma+inv} indicates that the relative rate model
1663  is a mixture of Gamma and invariant sites (this is the common $\Gamma+$I model). FreeRate is
1664  a model that does not use any parametric function to describe the distribution of the relative
1665  rates (see \cite{soubrier12}). Under this option, relative rates and the corresponding frequencies of these classes are
1666  directly estimated from the data. While such approach is slightly more computationally demanding
1667  than the $\Gamma$ (or $\Gamma$+I) model, it often provides a significantly better fit to the data.
1668\item \x{alpha="value|optimised"}, where \x{value} is a real positive number. Use this option to set
1669   the gamma shape parameter to the selected value. \x{optimised}: the parameter is estimated from
1670   the data (see also next option).
1671\item \x{optimise.alpha="yes|true|no|false"}. Optimise the shape of the Gamma distribution of
1672  relative rates (or not).
1673\item \x{pinv="value|optimised"}, where \x{value} is in $[0,1]$. Use this option to set
1674   the proportion of invariants to the selected value. \x{optimised}: the parameter is estimated from
1675   the data (see also next option).
1676\item \x{optimise.pinv="yes|true|no|false"}. Optimise the proportion of invariable sites (or not).
1677\item \x{optimise.freerates="yes|true|no|false"}. Optimise the parameters of the FreeRate model,
1678  i.e., the relative rates and the corresponding frequencies.
1679\end{itemize} 
1680\vspace{0.2cm}
1681\begin{Verbatim}[frame=single, label=Example of `siterates' component, samepage=true,
1682  baselinestretch=0.5, fontsize=\small, numbers=left]
1683
1684  <siterates id="SR1">
1685    <instance id="R1" init.value="1.0"/>
1686    <instance id="R2" init.value="1.0"/>
1687    <instance id="R3" init.value="1.0"/>
1688    <instance id="R4" init.value="1.0"/>
1689    <weights  id="D1" family="gamma" optimise.alpha="yes" \
1690    optimise.pinv="no">
1691    </weights>
1692  </siterates>
1693\end{Verbatim}
1694
1695
1696
1697\subsubsection{{\tt partitionelem} and {\tt mixtureelem} components}\index{XML options!{\tt partitionelem}
1698  component}\index{XML options!{\tt mixtureelem} component}
1699
1700Options:
1701\begin{itemize}
1702\item \x{file.name="inputfilename"}, where \x{inputfilename} is the name of the input sequence file
1703  (in PHYLIP format) to be analysed.
1704\item \x{data.type="nt|aa"}. Specify the type of sequences to be processed (nucleotide of amino-acid sequences).
1705\item \x{interleaved="yes|true|no|false"}. Interleaved (\x{yes|true}) or sequential format
1706  (\x{no|false}) for the sequence alignment.
1707\end{itemize}
1708
1709Each \x{partitionelem} element should include exactly four \x{mixtureelem} elements, corresponding to
1710branch lengths, equilibrium frequencies, substitution rate model and tree topology. The ordering of
1711in which the \x{mixtureelem} elements are given does not matter, though exceptions apply for the
1712$\Gamma+I$ model (see below). The $n$-th element in the \x{list}
1713attribute of each \x{mixtureelem} defines the $n$-th class of the mixture model. In the example given
1714below, the first class of the mixture is made of the following elements: \x{T1}, \x{F1}, \x{R1} and
1715\x{L1}, the second class is made of \x{T1}, \x{F1}, \x{R2} and \x{L1}, etc.
1716
1717
1718\vspace{0.2cm}
1719\begin{Verbatim}[frame=single, label=Example of `partitionelem' component, samepage=true,
1720  baselinestretch=0.5, fontsize=\small, numbers=left]
1721
1722  <partitionelem id="partition1" file.name="./small_p1.nxs" \
1723   data.type="nt" interleaved="yes">
1724    <mixtureelem list="T1, T1, T1, T1"/>
1725    <mixtureelem list="F1, F1, F1, F1"/>
1726    <mixtureelem list="R1, R2, R3, R4"/>
1727    <mixtureelem list="L1, L1, L1, L1"/>
1728  </partitionelem>
1729
1730\end{Verbatim}
1731
1732In general, the  ordering of the \x{mixtureelem}  elements does not matter. However,  when the model
1733has invariable sites, then  the corresponding class should be first in  the list of classes provided
1734by \x{mixtureelem}. For instance, in the example above, if the rates are defined as follows:
1735
1736\vspace{0.2cm}
1737\begin{Verbatim}[frame=single, label=Example of `siterates' component, samepage=true,
1738  baselinestretch=0.5, fontsize=\small, numbers=left]
1739
1740  <siterates id="SR1">
1741    <instance id="R1" init.value="0.0"/>
1742    <instance id="R2" init.value="1.0"/>
1743    <instance id="R3" init.value="1.0"/>
1744    <instance id="R4" init.value="1.0"/>
1745    <weights  id="D1" family="gamma+inv" optimise.alpha="yes" \
1746    optimise.pinv="no">
1747    </weights>
1748  </siterates>
1749\end{Verbatim}
1750
1751then \x{R1} corresponds to the invariable  rate class (as \x{init.value="0.0"}).  As \x{R1} is first
1752in the  \x{mixtureelem} (see line  6 in  the example of  \x{`partionelem'} given above),  PhyML will
1753print out an  explicit error message and bail out.   One way to avoid this  shortcoming is to define
1754\x{mixtureelem} as \x{R4, R2, R3, R1} instead.
1755
1756
1757\subsection{A simple example: GTR + $\Gamma$4 + I}
1758
1759The example below provides all the required options to fit a $\Gamma$4+I model to a single alignment
1760of nucleotide  sequences under the GTR  model of substitution using  a SPR search for  the best tree
1761topology. The \x{phyml} component sets the name for the analysis to \x{simple.example}, meaning that
1762each output file  will display this particular  string of characters. Also, the  tree and statistics
1763file names will begin with \x{p1.output}. The tree  topology will be estimated so as to maximise the
1764likelihood and  the topology search  algorithm used here  is SPR, as indicated  by the value  of the
1765corresponding attribute  (i.e., \x{search="spr"}). Only  one vector of  branch lengths will  be used
1766here since  only one partition element  will be processed. Hence,  the \x{<branchlengths>} component
1767only has one  \x{<instance>} sub-component. Also, a single  GTR model will apply to  all the classes
1768for the mixture model -- the \x{<ratematrices>} component has only one \x{<instance>} sub-component,
1769corresponding to this  particular substitution model. The next  component, \x{<equfreqs>}, indicates
1770that a single vector of equilibrium frequencies will apply here. Next, the \x{<siterates>} component
1771has five \x{<instance>} sub-components.  Four of these  correspond to the non-zero relative rates of
1772evolution  a  defined  by  a  discrete  Gamma distribution.   The  last  one  (\x{<instance  id="R5"
1773value="0.0"/>})  defines  the  class  of  the   mixture  corresponding  to  invariable  sites.   The
1774\x{<weight>} component indicates that a $\Gamma+$I model will be fitted here. The shape parameter of
1775the  Gamma distribution  and the  proportion of  invariants will  be estimated  from the  data.  The
1776\x{<partitionelem>} gives information about the sequence alignment (the corresponding file name, the
1777type of  data and the  alignment format). The \x{<mixtureelem>}  components next define  the mixture
1778model. Each class of the  fitted model corresponds to one column, with the  first column made of the
1779following elements: \x{T1, M1, F1, R1} and \x{L1}. The second class of the mixture is made of \x{T1,
1780M1, F1, R2, L1} and so forth.
1781
1782\vspace{0.2cm}
1783\begin{Verbatim}[frame=single, label=Simple PhyML XML example, samepage=true, baselinestretch=0.5,
1784  fontsize=\small, numbers=left]
1785
1786<phyml runid="simple.example" output.file="p1.output">
1787
1788  <topology>
1789    <instance id="T1" init.tree="bionj" optimise.tree="yes" \
1790    search="spr"/>
1791  </topology>
1792 
1793  <branchlengths id="BL1">
1794    <instance id="L1" optimise.lens="yes"/>
1795  </branchlengths>
1796
1797  <ratematrices id="RM1">
1798    <instance id="M1" model="GTR"/>
1799  </ratematrices>
1800 
1801  <equfreqs id="EF1">
1802    <instance id="F1"/>
1803  </equfreqs>
1804
1805  <siterates id="SR1">
1806    <instance id="R1" value="1.0"/>
1807    <instance id="R2" value="1.0"/>
1808    <instance id="R3" value="1.0"/>
1809    <instance id="R4" value="1.0"/>
1810    <instance id="R5" value="0.0"/>
1811    <weights  id="D1" family="gamma+inv" optimise.alpha="yes" \
1812    optimise.pinv="yes">
1813    </weights>
1814  </siterates>
1815
1816  <partitionelem id="partition_elem1" file.name=\
1817  "./p1.seq" data.type="nt" interleaved="yes">
1818    <mixtureelem list="T1, T1, T1, T1, T1"/>
1819    <mixtureelem list="M1, M1, M1, M1, M1"/>
1820    <mixtureelem list="F1, F1, F1, F1, F1"/>
1821    <mixtureelem list="R1, R2, R3, R4, R5"/>
1822    <mixtureelem list="L1, L1, L1, L1, L1"/>
1823  </partitionelem>
1824 
1825</phyml>
1826
1827\end{Verbatim}
1828
1829
1830\subsection{A second example: LG4X}\index{lg4x}
1831
1832The example below shows how to fit the LG4X model \cite{lg4x} to a given alignment of amino-acid
1833sequences (file \x{M587.nex.Phy}). LG4X is a mixture model with four classes. Each class has its own
1834rate and corresponding frequencies (hence the use of the FreeRate model below, see the
1835\x{<siterates>} component). In the particular example given here, the rate values and frequencies
1836are set by the users. These parameters will then be optimized by PhyML (\x{optimise.freerates="yes"}).
1837Each class also has its own rate matrix and vector of equilibrium frequencies, which need to be provided by
1838the user (Note that these matrices can be downloaded from the following web address:
1839\url{http://www.atgc-montpellier.fr/download/datasets/models/lg4x/LG4X_4M.txt}. They are also
1840provided in the PhyML package \x{example/lg4x/} directory.)
1841
1842\vspace{0.2cm}
1843\begin{Verbatim}[frame=single, label=LG4X, samepage=true, baselinestretch=0.5,
1844  fontsize=\small, numbers=left]
1845<phyml run.id="lg4x" output.file="M587.tests" branch.test="no">
1846
1847  <!-- Tree topology: start with BioNJ and then SPRs -->
1848  <topology>
1849    <instance id="T1" init.tree="user" file.name="user_tree.txt" \
1850    search="spr" optimise.tree="no"/>
1851  </topology>
1852 
1853
1854  <!-- Four rate matrices, read from files -->
1855  <ratematrices id="RM1">
1856    <instance id="M1" model="customaa" ratematrix.file="X1.mat"/>
1857    <instance id="M2" model="customaa" ratematrix.file="X2.mat"/>
1858    <instance id="M3" model="customaa" ratematrix.file="X3.mat"/>
1859    <instance id="M4" model="customaa" ratematrix.file="X4.mat"/>
1860  </ratematrices>
1861 
1862  <!-- Freerate model of variation of rates across sites -->
1863  <siterates id="SR1">
1864    <instance id="R1" init.value="0.197063"/>
1865    <instance id="R2" init.value="0.750275"/>
1866    <instance id="R3" init.value="1.951569"/>
1867    <instance id="R4" init.value="5.161586"/>
1868    <weights  id="D1" family="freerates" optimise.freerates="yes">
1869      <instance appliesto="R1" value="0.422481"/>
1870      <instance appliesto="R2" value="0.336848"/>
1871      <instance appliesto="R3" value="0.180132"/>
1872      <instance appliesto="R4" value="0.060539"/>
1873    </weights>
1874  </siterates>
1875 
1876  <!-- Amino-acid equilibrium freqs. are given by the models -->
1877  <equfreqs id="EF1">
1878    <instance id="F1" aa.freqs="model"/>
1879    <instance id="F2" aa.freqs="model"/>
1880    <instance id="F3" aa.freqs="model"/>
1881    <instance id="F4" aa.freqs="model"/>
1882  </equfreqs>
1883
1884
1885  <!-- One vector of branch lengths -->
1886  <branchlengths id="BL1" >
1887    <instance id="L1" optimise.lens="yes"/>
1888  </branchlengths>
1889
1890
1891  <!-- Mixture model assemblage -->
1892  <partitionelem id="partition1" file.name="M587.nex.Phy" \
1893  data.type="aa" interleaved="yes">
1894    <mixtureelem list="T1, T1, T1, T1"/>
1895    <mixtureelem list="M1, M2, M3, M4"/>
1896    <mixtureelem list="F1, F2, F3, F4"/>
1897    <mixtureelem list="R1, R2, R3, R4"/>
1898    <mixtureelem list="L1, L1, L1, L1"/>
1899  </partitionelem>
1900 
1901</phyml>
1902\end{Verbatim}
1903
1904In order to fit the LG4X model to the \x{proteic} sequence file provided in the \x{examples/}
1905directory, simply type \x{./phyml --xml=../examples/lg4x/lg4x.xml} (assuming the PhyML binary is installed
1906in the \x{src/} directory). You can of course slightly tweak the file \x{../examples/lg4x/lg4x.xml}
1907and use it as a template to fit this model to another data set.
1908
1909
1910\subsection{An example with multiple partition elements}
1911
1912The example below gives  the complete XML file to specify the analysis  of three partition elements,
1913corresponding to the nucleotide  sequence files \x{small\_p1\_pos1.seq}, \x{small\_p1\_pos2.seq} and
1914\x{small\_p1\_pos3.seq} in  interleaved PHYLIP  format. \x{small\_p1\_pos1.seq}  is fitted  with the
1915HKY85 model of substitution (with the  transition/transversion ratio being estimated from the data),
1916combined to a $\Gamma4$  model of rate variation across sites (with the  gamma shape parameter being
1917estimated from the data).  \x{small\_p1\_pos2.seq} is fitted to a custom substitution model with the
1918constraint  $A\leftrightarrow  G$=$C\leftrightarrow  T$.  The  nucleotide  frequencies  are  set  to
1919$\frac{1}{4}$   here.    The   model   does   not  allow   substitution   rates   to   vary   across
1920sites. \x{small\_p1\_pos3.seq} is fitted  using a GTR model conbined to a  $\Gamma4+$I model of rate
1921variation across sites.  Note that the equilibrium  nucleotide frequencies for the  fourth and fifth
1922class of the mixture are  set to be equal to that estimated from  the first partition element (i.e.,
1923\x{F1}) . The initial phylogeny is built using BioNJ  and the tree topology is to be estimated using
1924a NNI search algorithm.
1925
1926\vspace{0.2cm}
1927\begin{Verbatim}[frame=single, label=Example of PhyML XML file, samepage=true, baselinestretch=0.5,
1928  fontsize=\small, numbers=left]
1929
1930<phyml runid="nnisearch" output.file="small_p1_output">
1931
1932  <topology>
1933    <instance id="T1" init.tree="bionj" optimise.tree="yes" \
1934    search="nni"/>
1935  </topology>
1936 
1937  <branchlengths id="BL1">
1938    <instance id="L1" optimise.lens="yes"/>
1939    <instance id="L2"/>
1940    <instance id="L3"/>
1941  </branchlengths>
1942
1943  <ratematrices id="RM1">
1944    <instance id="M1" model="HKY85" optimise.tstv="yes"/>
1945    <instance id="M2" model="custom" model.code="102304" \
1946    optimise.rr="yes"/>
1947    <instance id="M3" model="GTR"/>
1948  </ratematrices>
1949 
1950  <equfreqs id="EF1">
1951    <instance id="F1"/>
1952    <instance id="F2" base.freqs="0.25,0.25,0.25,0.25"/>
1953    <instance id="F3"/>
1954  </equfreqs>
1955
1956  <siterates id="SR1">
1957    <instance id="R1" value="1.0"/>
1958    <instance id="R2" value="1.0"/>
1959    <instance id="R3" value="1.0"/>
1960    <instance id="R4" value="1.0"/>
1961    <weights  id="D1" family="gamma" optimise.alpha="yes" \
1962    optimise.pinv="no">
1963    </weights>
1964  </siterates>
1965
1966  <siterates id="SR2">
1967    <instance id="R8" value="1.0"/>
1968    <weights  id="D2" family="gamma" optimise.alpha="yes" \
1969    optimise.pinv="yes">
1970    </weights>
1971  </siterates>
1972 
1973  <siterates id="SR3">
1974    <instance id="R10" value="1.0"/>
1975    <instance id="R11" value="1.0"/>
1976    <instance id="R12" value="1.0"/>
1977    <instance id="R13" value="1.0"/>
1978    <instance id="R14" value="1.0"/>
1979    <weights  id="D3" family="gamma" optimise.alpha="yes" \
1980    optimise.pinv="yes">
1981    </weights>
1982  </siterates>
1983
1984\end{Verbatim}
1985
1986\vspace{0.2cm}
1987\begin{Verbatim}[frame=single, label=Example of PhyML XML file (ctnd), samepage=true,
1988  baselinestretch=0.5, fontsize=\small, numbers=left]
1989
1990  <partitionelem id="partition_elem1" file.name=\
1991  "./small_p1_pos1.seq" data.type="nt" interleaved="yes">
1992    <mixtureelem list="T1, T1, T1, T1"/>
1993    <mixtureelem list="M1, M1, M1, M1"/>
1994    <mixtureelem list="F1, F1, F1, F1"/>
1995    <mixtureelem list="R1, R2, R3, R4"/>
1996    <mixtureelem list="L1, L1, L1, L1"/>
1997  </partitionelem>
1998
1999  <partitionelem id="partition_elem2" file.name=\
2000  "./small_p1_pos2.seq" data.type="nt" interleaved="yes">
2001    <mixtureelem list="T1"/>
2002    <mixtureelem list="M2"/>
2003    <mixtureelem list="R8"/>
2004    <mixtureelem list="F2"/>
2005    <mixtureelem list="L2"/>
2006  </partitionelem>
2007
2008  <partitionelem id="partition_elem3" file.name=\
2009  "./small_p1_pos3.seq" data.type="nt" interleaved="yes">
2010    <mixtureelem list="T1, T1, T1, T1, T1"/>
2011    <mixtureelem list="M3, M3, M3, M3, M3"/>
2012    <mixtureelem list="R10, R11, R12, R13, R14"/>
2013    <mixtureelem list="F3, F3, F3, F1, F1"/>
2014    <mixtureelem list="L3, L3, L3, L3, L3"/>
2015  </partitionelem>
2016 
2017</phyml>
2018
2019\end{Verbatim}
2020
2021
2022\subsection{Branch lengths with invariants and partionned data}
2023
2024Accommodating for models with invariable sites applying to some elements of a partitioned data, with
2025these elements  sharing the  same set  of edge lengths  can lead  to inconsistencies.   Consider for
2026instance a partitioned  data set with two elements.   Assume that these two elements  share the same
2027set of edge lengths.  Also, consider that GTR+I  applies to the first element and HKY applies to the
2028second. Now, the expected number of substitutions per site for the first element of the partition is
2029equal  to  $(1-p)l$,  where  $p$  is  the   estimated  proportion  of  invariants  and  $l$  is  the
2030maximum-likelihood estimate  for the length  of that  specific edge. For  the second element  of the
2031partition, the  expected number of  substitutions per  site is equal  to $l$, rather  than $(1-p)l$.
2032While $l$ are common to the two elements,  matching the specification of the input model, the actual
2033edge  lengths do  differ across  the  two partition  elements.  Please  be  aware that,  due to  the
2034programming structure implemented in PhyML, the program  will only return one value here, which will
2035be equal to $(1-p)l$.
2036
2037\section{Citing PhyML}
2038The ``default citation'' for PhyML is:
2039\begin{itemize}
2040\item 
2041``New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance
2042of PhyML 3.0''. Guindon S., Dufayard J.F., Lefort V., Anisimova M., Hordijk W., Gascuel O. 2010, {\it Systematic
2043  Biology}, 59(3):307-321
2044
2045\end{itemize}
2046The ``historic citation'' for PhyML is:
2047\begin{itemize}
2048\item ``A simple, fast and accurate algorithm to estimate large phylogenies by maximum likelihood''
2049  Guindon S., Gascuel O. 2003, {\it Systematic Biology}, 52(5):696-704
2050\end{itemize}
2051
2052
2053
2054\section{Other programs in the PhyML package}
2055
2056PhyML  is software package  that provides  tools to  tackle problems  other than  estimating maximum
2057likelihood  phylogenies.  Installing these  tools  and  processing data  sets  is  explained is  the
2058following sections.
2059
2060\subsection{PhyTime}\index{PhyTime} PhyTime is  a program that estimates node  ages and substitution
2061rates using a Bayesian  approach.  The performance and main features of  this software are described
2062in two article (see Section \ref{sec:citephytime}).
2063
2064It  relies on  a  Gibbs  sampler which  outperforms  the
2065``standard'' Metropolis-Hastings algorithm  implemented in a number of  phylogenetic softwares.  The
2066details  and performance  of  this  approach are  described  in the  following  article:
2067
2068\subsubsection{Installing PhyTime}
2069
2070Compiling PhyTime is straightforward on Unix-like  machines (i.e., linux and MacOS systems). PhyTime
2071is not readily available for Windows machines but  compilation should be easy on this system too. In
2072the `phyml' directory, where the `src/'  and `doc/' directories stand, enter the following commands:
2073{\setlength{\baselineskip}{0.5\baselineskip}
2074\begin{verbatim}
2075./configure --enable-phytime;
2076make clean;
2077make;
2078\end{verbatim} } This set of commands generates  a binary file called \x{phytime} which can be found
2079  in the `src/' directory.
2080
2081\subsubsection{Running  PhyTime} Passing  options and  running  PhyTime on  your data  set is  quite
2082similar to running  PhyML in commmand-line mode.  The main differences between the  two programs are
2083explained below:
2084\begin{itemize}
2085\item PhyTime takes as mandatory input a {\em rooted} phylogenetic tree.  Hence, the `\x{-u}' option
2086must be used. Also, unlike PhyML, PhyTime does not modify the tree topology. Hence, the options that
2087go with the `\x{-s}' command do not alter the input tree topology.
2088\item  PhyTime  needs  an input  file  giving  information  about  calibration nodes.   The  command
2089`\x{--calibration=}' followed by the name of  the file containing the calibration node information is
2090mandatory. The content of that file should look as follows:
2091
2092\begin{figure}[h]
2093\begin{small}
2094\begin{Verbatim}[frame=single, label=Calibration node file, samepage=true, baselinestretch=0.5, fontsize=\tiny]
2095Dugong_dugon Procavia_capensis Elephantidae | -65 -54
2096Equus_sp. Ceratomorpha | -58 -54
2097Cercopithecus_solatus Macaca_mulatta Hylobates_lar Homo_sapiens | -35 -25
2098Lepus_crawshayi Oryctolagus_cuniculus Ochotona_princeps | -90 -37
2099Marmota_monax Aplodontia_rufa | -120 -37
2100Dryomys_nitedula Glis_glis | -120 -28.5
2101@root@ | -100 -120
2102\end{Verbatim}
2103\end{small}
2104\end{figure}
2105
2106Every row in  this file lists a  set of taxa that belong  to the same subtree (i.e.,  a clade). This
2107list of taxa  is followed by the character  `\x{|}' and two real numbers corresponding  to the lower
2108and upper bounds of the  calibration interval for the node at the root of  the clade. In the example
2109given  here,  the  clade  grouping  the three  taxa  ``Dugong\_dugon'',  ``Procavia\_capensis''  and
2110``Elephantida'' has  -65 as lower  bound and -54  as upper bound.  Node  ages (or node  heights) are
2111relative to the most recent tip node in the phylogeny, which age is set to 0. It is also possible to
2112define a  clade using  only two taxon  names. PhyTime will  then search  for the most  recent common
2113ancestor  of these  two  taxa  in the  user-defined  phylogeny and  assign  time  boundaries to  the
2114corresponding node. For serially-sampled data, the calibration nodes correspond to tips in the tree.
2115A calibration file will then look as follows:
2116\begin{figure}[h]
2117\begin{small}
2118\begin{Verbatim}[frame=single, label=Calibration node file (serially-sampled sequences), samepage=true, baselinestretch=0.5]
2119taxaA | -65 -65
2120taxaB | -65 -65
2121taxaC | -20 -20
2122taxaD | -30 -30
2123taxaE | -30 -30
2124taxaF | -60 -60
2125taxaG | -61 -51
2126@root@ | -100 -120
2127\end{Verbatim}
2128\end{small}
2129\end{figure}
2130
2131
2132% The maximum  value the average substitution rate  along a branch of  a phylogeny can take  is set to
2133% $10^{-2}$ substitution per site per time unit. It  is therefore important that the time unit used to
2134% specify the calibration intervals  are realistic with respect to this upper  bound. For instance, it
2135% one time unit corresponds to one month,  one assumes that the maximum (average) substitution rate is
2136% $10^{-3}$/site/month, which  is very high  (substitution rate in  introns is close  to 3/site/$10^9$
2137% years).  Consider that in the example give above, one time unit is $10^6$ years.  Hence, the maximum
2138% value the substitution  rate can take is $10^{-2}$/site/$10^6$  years, i.e., $10$/site/$10^9$ years.
2139% Multiplying the calibration values by 10 amounts  to considering that the time unit is $10^5$ years,
2140% implying a maximum  substitution rate equal to $10^{-2}$/site/$10^5$  years, i.e., $100$/site/$10^9$
2141% years.
2142
2143Note   that  the  node corresponding  to  the  root of  the  whole tree  has  a specific  label:
2144`\x{@root@}'.  {\color{red}{It is important  to specify upper and lower bounds for  the root node in
2145order to ensure convergence  of the Gibbs sampler. If the prior interval for  the root height is not
2146specified, the  upper bound will be set  to the upper bound  of the oldest calibration  node and the
2147lower bound  will be set to twice  this age.}}  As a  consequence, leaving the prior  on root height
2148interval unspecified may produce inaccurate estimates of node ages, especially if there are only few
2149otherwise calibration nodes available.
2150
2151A notable exception to this rule comes from the analysis of serial sample \index{serial sample}
2152data, i.e., alignments in which sequences were not sampled at the same time point.  For such data,
2153the estimated number of substitutions accumulated between successive time points is used to estimate
2154the substitution rate averaged over lineages. Because the time of collection of the sequences is
2155generally known without ambiguity, this extra piece of data is translated into very informative
2156calibration intervals for the tip nodes (i.e., calibration interval of zero width), which in turn
2157results in substitution rate estimates with descreased variances.  Posterior distribution of
2158substitution rates with small variances then allows one to get good estimates of the root age.
2159
2160\end{itemize}
2161
2162A typical PhyTime command-line should look like the following\index{command-line options!\x{--calibration}}:
2163
2164\begin{Verbatim}[fontsize=\small]
2165./phytime -i seqname -u treename --calibration=calibration_file -m GTR -c 8
2166\end{Verbatim}
2167
2168Assuming the file `\x{seqname}' contains DNA  sequences in PHYLIP or NEXUS format, `\x{treename}' is
2169the rooted  input tree in Newick  format and `\x{calibration\_file}'  is a set of  calibration nodes,
2170PhyTime will  estimate the  posterior distribution of  node times  and substitution rates  under the
2171assumption that the  substitution process follows a GTR  model with 8 classes of rates  in the Gamma
2172distribution of  rates across sites. The  model parameter values  are estimated by a  Gibbs sampling
2173technique. This algorithm tries diferent values of the model parameters and record the most probable
2174ones. By default,  $10^6$ values for each  parameter are collected. These values  are recorded every
2175$10^3$  sample. These  settings can  be  modified using  the appropriate  command-line options  (see
2176below).
2177
2178
2179\subsubsection{Upper bounds of model parameters}
2180
2181The  maximum expected  number  of substitutions  per  along a  given  branch is  set  to 1.0.  Since
2182calibration times provide prior  information about the time scale considered, it  is possible to use
2183that information to  define an upper bound for  the substitution rate. This upper bound  is equal to
2184the ratio of  the maximum value for  a branch length (1.0) by  the amount of time  elapsed since the
2185oldest  calibration point  (i.e., the  minimum  of the  lower bounds  taken  over the  whole set  of
2186calibration points)\footnote{The actual  formula involves an extra parameter which  does not need to
2187be  introduced  here}. It  is  important  to keep  in  mind  that the  upper  bound  of the  average
2188substitution rate depends  on the time unit used  in the calibration priors. The value  of the upper
2189bound is printed on screen at the start of the execution.
2190
2191PhyTime  implements  two  models  that  authorize  rates  to  be  autocorrelated.  The  strength  of
2192autocorrelation is governed  by a parameter which value  is estimated from the data.  However, it is
2193necessary to  set an appropriate  upper bound  for this parameter  prior running the  analysis.  The
2194maximum value is set such that the correlation between the rate at the beginning and at the end of a
2195branch of length 1.0 calendar time unit is not  different from 0. Here again the upper bound for the
2196model parameter depends  on the time unit. It is  important to choose this unit so  that a branch of
2197length 1.0 calendar unit can be considered as short. For this reason, {\color{red}{we recommend to select a time
2198unit so that the calibration times take values between -10 and -1000}}\index{time scale}.
2199
2200\subsubsection{PhyTime specific options}
2201
2202Beside the \x{--calibration} option, there are other command line options that are specific to PhyTime:
2203\begin{itemize}
2204\item \x{--chain\_len=num}\index{command-line options!\x{--chain\_len}} \\
2205\x{num} is the number of iterations required to estimate the joint posterior density of all the model
2206parameters, i.e., the length of the MCMC chain. Its default is set to 1E+6.
2207% \item \x{--burnin=num} \\
2208% \x{num} is the number of iterations for the ``burnin'' period, i.e., the number of iterations
2209% required before actually starting to get valid samples from the joint posterior density of all the
2210% model parameters. Its default is set to 1E+5, i.e., 0.1 times the default value of the \x{chain\_len}
2211% parameter (see above).
2212\item \x{--sample\_freq=num}\index{command-line options!\x{--sample\_freq}}  \\  \x{num} is the number of  generations between successive collection
2213of   the   model   parameter   values   throughout   the   MCMC   algorithm.   For   instance,   the
2214\x{--sample\_freq=1E+2} option will make PhyTime sample the model parameter every 100th iteration of
2215the MCMC algorithm. Its default is set to 1E+3.
2216\item \x{--fastlk=yes (no)}\index{command-line options!\x{--fastlk}} [Default: no]\\
2217The option is used to turn on (off) the approximation of the likelihood function using a
2218multivariate normal density. By default, the exact likelihood is used. Using the normal
2219approximation considerably speeds up the calculation. However, it is necessary to ensure that this
2220approximation is appropriate by looking at the correlation between the exact and approximated
2221likelihood values that are sampled. Please read Section \ref{sec:recomphytime} for a description of
2222the appropriate steps to take.
2223\item \x{--no\_sequences}\index{command-line options!\x{--no\_sequences}}\\
2224Use this option to run the sampler without sequence data. This option can be useful when one wants
2225to compare the marginal posterior density of model parameters to those derived when ignoring the
2226information conveyed by the sequences. Such comparison should be conducted on a systematic basis  so as to determine whether the
2227parameters estimates are mostly determined by the prior of driven by the sequence data.
2228\item \x{--rate\_model=gbs/gbd/gamma/clock}\index{command-line options!\x{--rate\_model}}\\
2229This option is to select the model of evolution of the rate of evolution. \x{gbs} (default) stands
2230for Geometric Brownian + Stochastic. This model considers that rates evolve along the tree
2231according to a geometric Brownian process \cite{kishino01} and the average rate of substitution
2232along a branch is a gamma distributed random variable. This model is described in \cite{guindon12}.
2233The \x{gbd} model (Geometric Browninan with Deterministic calculation of edge lengths) assumes the
2234same Geometric Brownian model of rates. However, as opposed to \x{gbs}, this model uses a
2235deterministic approximation to calculate the average rates of evolution along edges. This model
2236corresponds to the one described in \cite{kishino01} and implemented in the program Multidivtime\index{Multidivtime}.
2237\x{gamma} is a less sophisticated model that assumes that average rates along edges are distributed a
2238priori according to a gamma distribution. It is analogous to the uncorrelated clock model
2239implemented in BEAST\index{BEAST} with a gamma distribution replacing the exponential one. The \x{clock} option
2240corresponds to the strict clock model where all the lineages in the tree evolve at the same pace.
2241\end{itemize}
2242
2243\subsubsection{PhyTime output}
2244
2245The     program     PhyTime     generates     two     output     files.      The     file     called
2246`\x{your\_seqfile\_phytime.XXXX.stats}', where XXXX is a  randomly generated integer, lists the node
2247times and branch  relative rates sampled during  the estimation process.  It also  gives the sampled
2248values for other parameters, such as the  autocorrelation of rates (parameter `Nu'), and the rate of
2249evolution (parameter `EvolRate') amongst others.  This  output file can be analysed with the program
2250Tracer\index{Tracer}            from            the            BEAST\index{BEAST}            package
2251(\url{http://beast.bio.ed.ac.uk/Main_Page}).       The       second       file       is       called
2252`\x{your\_seqfile\_phytime.XXXX.trees}'.  It is the  list of  trees that  were collected  during the
2253estimation process, i.e., phylogenies sampled from the posterior density of trees.  This file can be
2254processed   using   the   software   TreeAnnotator,   also   part  of   the   BEAST   package   (see
2255\url{http://beast.bio.ed.ac.uk/Main_Page}) in  order to generate  confidence sets for the  node time
2256estimates.
2257
2258Important  information is also  displayed on  the standard  output of  PhyTime (the  standard output
2259generally corresponds to the terminal window from  which PhyTime was launched).  The first column of
2260this output gives the current  generation, or run, of the chain. It starts at  1 and goes up to 1E+6
2261by default (use \x{--chain\_len} to change this value, see above).  The second column gives the time
2262elapsed  in seconds since  the sampling  began. The  third column  gives the  log likelihood  of the
2263phylogenetic model (i.e., `Felsenstein's likelihood'). The  fourth column gives the logarithm of the
2264joint prior  probability of substitution  rates along  the tree and  node heights. The  fifth column
2265gives the  current sampled value  of the EvolRate  parameter along with the  corresponding Effective
2266Sample Size (ESS) (see  Section \ref{sec:ess}) for this parameter.  The sixth  column gives the tree
2267height  and the  corresponding  ESS.  The seventh  column  gives the  value  of the  autocorrelation
2268parameter followed by the  corresponding ESS. The eightth column gives the  values of the birth rate
2269parameter that  governs the birth-rate model  of species divergence  dates.  The last column  of the
2270standard  output gives  the minimum  of the  ESS  values taken  over the  whole set  of node  height
2271estimates.  It provides useful information when one has  to decide whether or not the sample size is
2272large enough to draw  valid conclusion, i.e., decide whether the chain was  run for long enough (see
2273Section \ref{sec:recomphytime} for more detail about adequate chain length).
2274
2275\subsubsection{ClockRate vs. EvolRate}
2276
2277The average rate of evolution along a branch  is broken into two components. One is called ClockRate
2278and is  the same throughout  the tree. The  other is called EvolRate  and corresponds to  a weighted
2279average of  branch-specific rates.  The  model of rate  evolution implemented in PhyTime  forces the
2280branch-specific rate values to  be greater than one. As a consequence,  ClockRate is usually smaller
2281EvolRate.
2282
2283In more mathematical terms, let $\mu$ be the value of ClockRate, $r_i$ be the value of the relative
2284rate along branch $i$ and $\Delta_i$ the time elapsed along branch $i$. The value of EvolRate is
2285then given by:
2286\begin{eqnarray*}
2287\mathrm{EvolRate} = \mu \frac{\sum_{i}^{2n-3} r_i \Delta_i}{\sum_{i}^{2n-3} \Delta_i}.
2288\end{eqnarray*}  It is  clear from  this equation  that  multiplying each  $r_i$ by  a constant  and
2289dividing $\mu$ by the same constant does not  change the value of EvolRate. The $r_i$s and $\mu$
2290are then confounded,  or non-identifiable, and only  the value of EvolRate can  be estimated from
2291the data.   {\color{red}{Please make sure  that you use  the value of  EvolRate rather than  that of
2292ClockRate when referring to the estimate of the substitution rate}}.
2293
2294\subsubsection{Effective sample size}\label{sec:ess}
2295
2296The MCMC technique  generates samples from a  target distribution (in our case,  the joint posterior
2297density  of  parameters).  Due  to  the  Markovian  nature of  the  method,  these samples  are  not
2298independent.  The  ESS is the estimated  number of independent  measurements obtained from a  set of
2299(usually dependent) measurements. It is calculated using the following formula:
2300\begin{eqnarray*}
2301\mathrm{ESS} = N\left(\frac{1-r}{1+r}\right),
2302\end{eqnarray*}
2303where  $N$  is the  length  of  the  chain  (i.e., the  `raw'  or `correlated' sample  size)  and $r$  is  the
2304autocorrelation value, which is obtained using the following formula:
2305\begin{eqnarray*}
2306r = \frac{1}{(N-k)\sigma_x^2} \sum_{i=1}^{N-k} (X_i - \mu_x)(X_{i+k}-\mu_x),
2307\end{eqnarray*} where $\mu_x$ and $\sigma_x$ are the mean and standard deviation of the $X_i$ values
2308respectively and $k$ is the lag. The value of $r$ that is used in PhyTime corresponds to the case where $k=1$,
2309which therefore gives a first order  approximation of the `average' autocorrelation value (i.e., the
2310autocorrelation averaged over the set of possible values of the lag).
2311
2312
2313\subsubsection{Prior distributions of model parameters}\label{sec:prior}
2314
2315Any Bayesian analysis requires  specifying a prior distribution of model  parameters. The outcome of
2316the data analysis, i.e., the posterior distribution,  is influenced by the priors.  It is especially
2317true if the signal conveyed  by the data is weak.  While some have  argued that the specification of
2318priors relies more  on arbitrary decisions than sound scientific  reasoning, choosing relevant prior
2319distributions  is in  fact fully  integrated in  the process  of building  model that  generates the
2320observed data.  In particular, the problem of  estimating divergence times naturally lends itself to
2321hierarchical Bayesian modelling.  Based on the hypothesis  that rates of evolution are conserved (to
2322some extant)  throughout the  course of  evolution, the hierarchical  Bayesian approach  provides an
2323adequate framework for inferring substitution rates  and divergence dates separately. Hence, in this
2324situation, it makes good  sense to use what is known about a  relatively well-defined feature of the
2325evolution of genetic sequences (the ``molecular  clock'' hypothesis combined to stochastic variations of
2326rates across lineages) to build a prior distribution on rates along edges.
2327 
2328
2329
2330
2331\subsubsection{Citing PhyTime}\label{sec:citephytime}
2332
2333The ``default citation'' is:
2334
2335\begin{itemize}
2336\item Guindon S. ``From trajectories to averages: an improved description of the heterogeneity of
2337substitution rates along lineages'', {\it Systematic Biology}, 2012. doi: 10.1093/sysbio/sys063.
2338\end{itemize}
2339
2340An earlier article also described some of the methods implemented in PhyTime:
2341
2342\begin{itemize}
2343\item Guindon  S. ``Bayesian estimation of divergence  times from large data  sets'', {\it Molecular
2344    Biology and Evolution}, 2010,
234527(8):1768:81.
2346\end{itemize}
2347
2348
2349\section{Recommendations on program usage}\label{sec:progusage}
2350
2351\subsection{PhyML}
2352
2353The choice of the  tree searching algorithm among those provided by PhyML  is generally a tough one.
2354The  fastest option  relies  on local  and simultaneous  modifications  of the  phylogeny using  NNI
2355moves. More  thorough explorations of  the space  of topologies are  also available through  the SPR
2356options.  As these  two classes of tree topology moves involve  different computational burdens, it
2357is important to determine which option is the most suitable for the type of data set or analysis one
2358wants to perform. Below is a list of recommendations for typical phylogenetic analyses.
2359
2360\begin{enumerate}
2361\item {\em Single data set, unlimited computing time.} The best option here is probably to use a SPR
2362search (i.e., straight SPR of best of SPR and NNI).  If the focus is on estimating the relationships
2363between species,  it is a good  idea to use  more than one starting  tree to decrease the  chance of
2364getting stuck  in a  local maximum of  the likelihood  function.  Using NNIs  is appropriate  if the
2365analysis does not mainly focus on  estimating the evolutionary relationships between species (e.g. a
2366tree is needed to  estimate the parameters of codon-based models later  on).  Branch supports can be
2367estimated using bootstrap and approximate likelihood ratios.
2368
2369\item {\em  Single data set, restricted  computing time.}  The  three tree searching options  can be
2370used depending on  the computing time available and the  size of the data set.   For small data sets
2371(i.e., $<$ 50 sequences),  NNI will generally perform well provided that  the phylogenetic signal is
2372strong.  It  is relevant  to estimate a  first tree  using NNI moves  and examine  the reconstructed
2373phylogeny in order to have a rough idea  of the strength of the phylogenetic signal (the presence of
2374small internal  branch lengths  is generally  considered as a  sign of  a weak  phylogenetic signal,
2375specially when  sequences are  short).  For larger  data sets  ($>$ 50 sequences),  a SPR  search is
2376recommended if there  are good evidence of  a lack of phylogenetic signal.   Bootstrap analysis will
2377generally  involve  large  computational  burdens.   Estimating branch  supports  using  approximate
2378likelihood ratios therefore provides an interesting alternative here.
2379
2380\item {\em  Multiple data  sets, unlimited computing  time.} Comparative genomic  analyses sometimes
2381rely on building phylogenies from the analysis of  a large number of gene families.  Here again, the
2382NNI option is the most  relevant if the focus is not on recovering the  most accurate picture of the
2383evolutionary relationships  between species.   Slower SPR-based heuristics  should be used  when the
2384topology of the tree is an important parameter of the analysis (e.g., identification of horizontally
2385transferred genes using phylogenetic tree comparisons).   Internal branch support is generally not a
2386crucial parameter of the multiple data  set analyses. Using approximate likelihood ratio is probably
2387the best choice here.
2388
2389\item {\em Multiple data sets, limited computing time.}  The large amount of data to be processed in
2390a  limited time  generally  requires  the use  of  the fastest  tree  searching  and branch  support
2391estimation methods Hence,  NNI and approximate likelihood ratios rather  than SPR and non-parametric
2392bootstrap are generally the most appropriate here.
2393\end{enumerate}
2394 
2395Another important  point is the  choice of the  substitution model. While default  options generally
2396provide acceptable results, it is often warranted to perform a pre-analysis in order to identify the
2397best-fit substitution model.  This pre-analysis can be done using popular software such as Modeltest
2398\cite{posada98} or ProtTest  \cite{abascal05} for instance.  These programs  generally recommend the
2399use of a discrete gamma distribution to model the substitution process as variability of rates among
2400sites is a common  feature of molecular evolution.  The choice of the number  of rate classes to use
2401for this  distribution is  also an important  one. While the  default is  set to four  categories in
2402PhyML, it is  recommended to use larger number  of classes if possible in order  to best approximate
2403the  patterns of rate  variation across  sites \cite{galtier04}.   Note however  that run  times are
2404directly proportional to  the number of classes  of the discrete gamma distribution.   Here again, a
2405pre-analysis with the  simplest model should help the  user to determine the number  of rate classes
2406that represents the best trade-off between computing time and fit of the model to the data.
2407
2408
2409\subsection{PhyTime}\label{sec:recomphytime}
2410
2411Analysing a data set using PhyTime should  involve three steps based on the following questions: (1)
2412do the priors seem to be adequate (2) can I use the fast approximation of the likelihood and (3) how
2413long shall I run the program for? I explain below how to provide answers to these questions.
2414
2415\begin{itemize} 
2416\item {\em  Are the priors adequate?} Bayesian  analysis relies on  specifiying the joint
2417prior density  of model parameters.  In  the case of  node age estimation, these  priors essentially
2418describe how rates of substitution vary across lineages and the probabilistic distribution that node
2419ages have when  ignoring the information provided  by the genetic sequences. These  priors vary from
2420tree to tree. It is therefore essential to  check the adequacy of priors for each user-defined input
2421tree. In order to do so, PhyTime needs to be run with the \x{--no\_data} option. When this option is
2422required, the  sequence data provided  as input will  be ignored and the  rest of the  analysis will
2423proceed  normally. The  prior distribution  of  model parameters,  essentially edge  rates and  node
2424heights, can then be  checked using the program Tracer as one would  do for the standard `posterior'
2425analysis.
2426
2427\item {\em  Can I use the  fast approximation to the  likelihood?} The suface  of the log-likelihood
2428function can  be approximated using  a multivariate normal  density.  This technique is  saving very
2429substantial amounts  of computation  time. However, like  most approximations, there  are situations
2430where it does not provide a good fit to the actual function. This usually happens when the phylogeny
2431displays  a lot  of short  branches, i.e.,  the  signal conveyed  by the  sequences is  weak. It  is
2432therefore important to first check whether  using the approximate likelihood is reasonable. In order
2433to do  so, it is  recommended to first  run the program without  the approximation, i.e.,  using the
2434default settings. Once  the minimum value of the ESS  of node ages (the last column  on the right of
2435the  standard output)  has reached  40-50, open  the \x{phytime.XXXX}  output file  with  Tracer and
2436examine the correlation  between the exact and approximate likelihood values.  If the correlation is
2437deemed to be good enough, PhyTime can be re-run using the \x{--fast\_lk} option, which uses the fast
2438normal approximation to the likelihood function.
2439
2440
2441
2442% Figure \ref{fig:approxbad} gives an example where  the correlation is too weak and the approximation
2443% of  the  likelihood  should be  avoided.  Figure  \ref{fig:approxbad}  gives  an example  where  the
2444% approximation  is  good enough.   The  current  execution of  PhyTime  can  be  terminated and  then
2445% re-launched using the \x{--fast\_lk} option.
2446
2447% \begin{figure}
2448% \begin{center}
2449% \resizebox{14cm}{8cm}{\includegraphics{./fig/approx_bad.eps}}
2450% \caption{{\bf Exact vs. approximate likelihoods.} The correlation between the normally approximated
2451%   (Y-axis) and the exact (X-axis) likelihoods is weak here. The exact likelihood should be used (option \x{fastlk=no}).}
2452% \label{fig:approxbad}
2453% \end{center}
2454% \end{figure}
2455
2456
2457% \begin{figure}
2458% \begin{center}
2459% \resizebox{14cm}{8cm}{\includegraphics{./fig/approx_good.eps}}
2460% \caption{{\bf Exact vs. approximate likelihoods.} The correlation between the normally approximated
2461%   (Y-axis) and the exact (X-axis) likelihoods is good. The approximation of the  likelihood can be used (option \x{fastlk=yes}).}
2462% \label{fig:approxgood}
2463% \end{center}
2464% \end{figure}
2465
2466\item {\em How  long shall I run the program  for?} PhyTime should be run long  enough such that the
2467ESS of  each parameter  is `large enough'.  The last column  on the  right handside of  the standard
2468output gives the minimum ESS across all internal  node heights. It is recommended to run the program
2469so that this number reaches at least 100.
2470
2471\end{itemize}
2472
2473
2474\section{Frequently asked questions} 
2475
2476\begin{enumerate}
2477\item {\it PhyML crashes before reading the sequences. What's wrong ?}\\
2478\begin{itemize}
2479\item The format of your sequence file is not recognized by PhyML. See Section \ref{sec:input_output}
2480\item The carriage return characters in your sequence files are not recognized by PhyML. You must
2481  make sure that your sequence file is a plain text file, with standard carriage return characters (i.e.,
2482  corresponding to ``$\backslash$\x{n}'', or ``$\backslash$\x{r}'')
2483\end{itemize}
2484\item {\it The program crashes after reading the sequences. What's wrong ?}\\
2485\begin{itemize}
2486\item You analyse protein sequences and did not enter the \x{-d aa} option in the command-line.
2487\item The format of your sequence file is not recognized by PhyML. See Section \ref{sec:input_output}
2488\end{itemize}
2489\item {\it Does PhyML handle outgroup sequences ?}\\
2490\begin{itemize}
2491\item Yes, it does. Outgroup taxa are identified by adding the `*' sign at the end of each
2492  corresponding sequence name (see Section \ref{sec:outgroupspecify})
2493\end{itemize}
2494\item {\it Does PhyML estimate clock-constrained trees ?}\\
2495\begin{itemize}
2496\item No, the PhyML program does not estimate clock-contrained trees. One can however use the
2497  program PhyTime to perform such analysis but the tree topology will not be estimated.
2498\end{itemize}
2499\item {\it Can PhyML analyse partitioned data, such as multiple gene sequences ?}\\
2500\begin{itemize}
2501\item We are currently  working on this topic.  Future releases of  the program will provide options
2502to estimate  trees from phylogenomic data sets,  with the opportunity to  use different substitution
2503models on  the different data partitions (e.g.,  different genes). PhyML will  also include specific
2504algorithms to search the space of tree topologies for this type of data.
2505\end{itemize}
2506\end{enumerate}
2507
2508
2509
2510\section{Acknowledgements} 
2511The development of PhyML since 2000 has been supported by the Centre National de la Recherche
2512Scientifique (CNRS) and the Minist\`ere de l'\'Education Nationale.
2513
2514% \bibliographystyle{/home/guindon/latex/biblio/OUPnum}
2515\bibliographystyle{/home/guindon/latex/biblio/nature/naturemag}
2516\bibliography{/home/guindon/latex/biblio/ref}
2517
2518\printindex
2519\end{document}
Note: See TracBrowser for help on using the repository browser.