Context Navigation

phyml-manual.tex

Visit:

Last change on this file was 10307, checked in by aboeckma, 12 years ago
added most recent version of phyml
File size: 140.6 KB

Line
1	\documentclass[a4paper,12pt]{article}
2	\usepackage{fancyvrb}
3	\usepackage{graphicx}
4	\usepackage{tabularx}
5	% \usepackage{color}
6	\usepackage{xcolor}
7	\usepackage{psfrag}
8	\usepackage[fleqn]{amsmath}
9	\usepackage[hyphens]{url}
10	% \usepackage{vmargin}
11	\usepackage{cite}
12	\usepackage{caption2}
13	\usepackage{hyperref}
14	\usepackage{makeidx}
15	\renewcommand{\captionlabeldelim}{.}
16	\def\thesection{\arabic{section}}
17	\renewcommand{\thefigure}{\arabic{figure}}
18	\renewcommand{\thetable}{\arabic{table}}
19	\newcommand{\hl}{\noalign{\vskip3pt}\hline\noalign{\vskip3pt}}
20	\newcommand{\hrf}{\hrulefill}
21	\newcommand{\tc}[1]{\textcolor{black}{#1}}
22	\newcommand{\dc}[1]{\textcolor{green}{#1}}
23	% \usepackage[none]{hyphenat}
24	\newcommand{\rep}[3][1]
25	{
26	\psfrag{#2}[c][c][#1]{#3}
27	}
28	\newcommand{\x}[1]{\texttt{#1}}
29
30	% \setpapersize{A4}
31	% \hypersetup{colorlinks=true,linkcolor=blue,urlcolor=red,linkbordercolor=000}
32	\hypersetup{colorlinks=true,linkcolor=blue,urlcolor=red}
33	\renewcommand{\baselinestretch}{1.}
34	\makeindex
35	\begin{document}
36	\sloppy
37	\begin{center}
38	\thispagestyle{empty}
39	\vfill\vfill
40	% \rule{\linewidth}{0.02cm}\\
41	{\Huge \textbf{ P~h~y~M~L~~--~~M~a~n~u~a~l}}
42	\vspace{-0.4cm}\\
43	% \rule{\linewidth}{0.02cm}\\
44	\vfill
45	{\huge Version 3.0 \\
46	\today
47	\vfill
48	\normalsize
49	\href{http://code.google.com/p/phyml}{http://code.google.com/p/phyml}\\
50	\vspace{0.4cm}
51	\href{http://www.atgc-montpellier.fr/phyml}{http://www.atgc-montpellier.fr/phyml}}
52	\end{center}
53	\clearpage
54	\tableofcontents
55	\clearpage
56
57	{\par
58	\small
59	\noindent
60	\copyright Copyright 1999 - 2008 by PhyML Development Team.\\
61	\noindent The software PhyML is provided ``as is'' without warranty of any kind. In no event shall
62	the authors or his employer be held responsible for any damage resulting from the use of this
63	software, including but not limited to the frustration that you may experience in using the package.
64	All parts of the source and documentation except where indicated are distributed under
65	the GNU public licence. See http://www.opensource.org for details.
66
67	}
68
69	{
70	\noindent
71	\setlength{\baselineskip}{0.5\baselineskip}
72	\section{Availability}
73	\begin{itemize}
74	\item Binaries: \href{http://www.atgc-montpellier.fr/phyml}{http://www.atgc-montpellier.fr/phyml}
75	\item Sources: \href{http://code.google.com/p/phyml}{http://code.google.com/p/phyml}
76	\item Discussion forum: \href{http://groups.google.com/group/phyml-forum}{http://groups.google.com/group/phyml-forum}
77	\end{itemize}
78	}
79
80	{
81	\noindent
82	\setlength{\baselineskip}{0.7\baselineskip}
83	\section{Authors}
84	\begin{itemize}
85	\item { St\'ephane Guindon} and { Olivier Gascuel} conceived the original PhyML algorithm.
86	\item { St\'ephane Guindon} conceived the PhyTime method.
87	\item { St\'ephane Guindon, Wim Hordjik} and { Olivier Gascuel} conceived the SPR-based tree search algorithm.
88	\item { Maria Anisimova} and { Olivier Gascuel} conceived the aLRT method for branch support.
89	\item { St\'ephane Guindon, Franck Lethiec}, Jean-Francois Dufayard and Vincent Lefort implemented PhyML.
90	\item { Jean-Francois Dufayard} created the benchmark and implemented the tools that are used to check
91	PhyML accuracy and performances.
92	\item { Vincent Lefort, St\'ephane Guindon, Patrice Duroux} and { Olivier Gascuel} conceived and
93	implemented PhyML web server.
94	\item St\'ephane Guindon wrote this document.
95	\end{itemize}
96	}
97	\clearpage
98
99	% \section{ML in phylogenetics: the basics.}
100	\section{Overview}
101
102	PhyML \cite{guindon03} is a software package which primary task that is to estimate maximum
103	likelihood phylogenies from alignments of nucleotide or amino-acid sequences. It provides a wide
104	range of options that were designed to facilitate standard phylogenetic analyses. The main
105	strength of PhyML lies in the large number of substitution models coupled to various options to
106	search the space of phylogenetic tree topologies, going from very fast and efficient methods to
107	slower but generally more accurate approaches. It also implements two methods to evaluate branch
108	supports in a sound statistical framework (the non-parametric bootstrap and the approximate
109	likelihood ratio test).
110
111	PhyML was designed to process moderate to large data sets. In theory, alignments with up to 4,000
112	sequences 2,000,000 character-long can analyzed. In practice however, the amount of memory required
113	to process a data set is proportional of the product of the number of sequences by their length.
114	Hence, a large number of sequences can only be processed provided that they are short. Also, PhyML
115	can handle long sequences provided that they are not numerous. With most standard personal
116	computers, the ``comfort zone'' for PhyML generally lies around 100-500 sequences less than 10,000
117	character long. For larger data sets, we recommend using other softwares such as RAxML
118	\cite{raxml}\index{RAxML} or GARLI \cite{garli}\index{GARLI} or Treefinder
119	(\href{http://www.treefinder.de}{http://www.treefinder.de}).
120
121
122	\section{Bug report}\index{bug}
123
124	While PhyML is, of course, bug-free (!) (please read the disclaimer carefuly...), if you ever come
125	across an issue, please feel free to report it using the discuss group web site at the following
126	address: \url{https://groups.google.com/forum/?fromgroups#!forum/phyml-forum}. Alternatively, you
127	can send an email to \url{s.guindon@auckland.ac.nz}. Do not forget to mention the version of PhyML
128	and program options you are using.
129
130
131	\section{Installing PhyML}
132
133	\subsection{Sources and compilation}\index{compilation}
134
135	The sources of the program are available free of charge by sending an e-mail to St\'ephane Guindon
136	at \href{mailto:guindon@lirmm.fr}{\x{guindon@lirmm.fr}} or
137	\href{mailto:s.guindon@auckland.ac.nz}{\x{s.guindon@auckland.ac.nz}}. The compilation on UNIX-like
138	systems is fairly standard. It is described in the `\x{INSTALL}' file that comes with the
139	sources. In a command-line window, go to the directory that contains the sources and type:
140
141	{\setlength{\baselineskip}{0.5\baselineskip}
142	\begin{verbatim}
143	./configure;
144	make clean;
145	make V=0;
146	\end{verbatim}
147	}
148
149	By default, PhyML will be compiled with optimization flags turned on. It is possible to generate a
150	version of PhyML that can run through a debugging tool (such as \x{ddd}\label{ddd}) or a profiling
151	tool (such as \x{gprof}\label{gprof}) using the following instructions:
152
153	{\setlength{\baselineskip}{0.5\baselineskip}
154	\begin{verbatim}
155	./configure --enable-debug;
156	make clean;
157	make V=0;
158	\end{verbatim}
159	}
160
161	% {\em Note} -- when PhyML is going to be used mostly of exclusively in batch mode, it is preferable
162	% to turn on the batch mode option in the Makefile. In order to do so, the file \x{Makefile.am} needs
163	% to be modified: add \x{-DBATCH} to the line with \x{DEFS=-DUNIX -D\$(PROG) -DDEBUG}.
164
165
166
167	\subsection{Installing PhyML on UNIX-like systems (including Mac OS)}
168
169	Copy PhyML binary file in the directory you like. For the operating system to be able to locate the
170	program, this directory must be specified in the global variable \x{PATH}. In order to achieve this,
171	you will have to add \x{export PATH="/your\_path/:\${PATH}"} to the \x{.bashrc} or the
172	\x{.bash\_profile} located in your home directory (\x{your\_path} is the path to the directory that
173	contains PhyML binary).
174
175
176	\subsection{Installing PhyML on Microsoft Windows}\label{sec:install_windows}
177
178	Copy the files \x{phyml.exe} and \x{phyml.bat} in the same directory. To launch PhyML, click on the
179	icon corresponding to \x{phyml.bat}. Clicking on the icon for \x{phyml.exe} works too but the
180	dimensions of the window will not fit PhyML PHYLIP-like interface.
181
182	\subsection{Installing the parallel version of PhyML}\label{sec:MPI}\index{MPI}\index{bootstrap!parallel}
183
184	Bootstrap analysis can run on multiple processors. Each processor analyses one bootstraped dataset.
185	Therefore, the computing time needed to perform $R$ bootstrap replicates is divided by the number of
186	processors available.
187
188	This feature of PhyML relies on the MPI (Message Passing Interface) library. To use it, your
189	computer must have MPI installed on it. In case MPI is not installed, you can dowload it from
190	\href{http://www.mcs.anl.gov/research/projects/mpich2/}{http://www.mcs.anl.gov/research/projects/mpich2/}.
191	Once MPI is installed, it is necessary to launch the MPI daemon. This can be done by entering the
192	following instruction: \x{mpd \&}. Note however that in most cases, the MPI daemon will already be
193	running on your server so that you most likely do not need to worry about this. You can then just
194	go in the \x{phyml/} directory (the directory that contains the \x{src/}, \x{examples/} and \x{doc/}
195	folders) and enter the commands below:
196
197	{\setlength{\baselineskip}{0.5\baselineskip}
198	\begin{verbatim}
199	./configure --enable-mpi;
200	make clean;
201	make;
202	\end{verbatim}
203	}
204
205	A binary file named \x{phyml-mpi} has now been created in the \x{src/} directory and is ready to use
206	with MPI. A typical MPI command-line which uses 4 CPUs is given below:
207
208	{\setlength{\baselineskip}{0.5\baselineskip}
209	\begin{verbatim}
210	mpirun -n 4 ./phyml-mpi -i myseq -b 100
211	\end{verbatim}
212	}
213
214	\noindent Please read section \ref{sec:parallel_bootstrap} of this document for more information.
215
216	\section{Program usage.}\label{sec:phyml_new}
217
218	PhyML has three distinct user-interfaces. The first corresponds to a PHYLIP-like text interface
219	that makes the choice of the options self-explanatory. The command-line interface is well-suited for
220	people that are familiar with PhyML options or for running PhyML in batch mode. The XML interface is
221	more sophisticated. It allows the user to analyse partitionned data using flexible mixture models of evolution.
222
223	\subsection{PHYLIP-like interface}
224
225	The default is to use the PHYLIP-like text interface by simply typing `\x{phyml}' in a command-line
226	window or by clicking on the PhyML icon (see Section \ref{sec:install_windows}). After entering the
227	name of the input sequence file, a list of sub-menus helps the users set up the analysis. There
228	are currently four distinct sub-menus:
229
230	% \begin{figure}
231	% \resizebox{15cm}{9cm}{\includegraphics{./fig/interface.eps}}
232	% \caption{PHYLIP-like interface to PhyML.}
233	% \label{fig:interface}
234	% \end{figure}
235
236	\begin{enumerate}
237
238	\item {\em Input Data}: specify whether the input file contains amino-acid or nucleotide
239	sequences. What the sequence format is (see Section \ref{sec:input_output}) and how many data sets
240	should be analysed.
241
242	\item {\em Substitution Model}: selection of the Markov model of substitution.
243
244	\item {\em Tree Searching}: selection of the tree topology searching algorithm.
245
246	\item {\em Branch Support}: selection of the method that is used to measure branch support.
247
248	\end{enumerate}
249	\noindent `\x{+}' and `\x{-}' keys are used to move forward and backward in the sub-menu list. Once
250	the model parameters have been defined, typing `\x{Y}' (or `\x{y}') launches the calculations. The
251	meaning of some options may not be obvious to users that are not familiar with phylogenetics. In
252	such situation, we strongly recommend to use the default options. As long as the format of the input
253	sequence file is correctly specified (sub-menu {\em Input data}), the safest option for non-expert
254	users is to use the default settings. The different options provided within each sub-menu are
255	described in what follows.
256
257
258	\subsubsection{Input Data sub-menu}
259
260	\begin{center}\framebox{\x{[D] ............................... Data type (DNA/AA)}} \end{center}
261	Type of data in the input file. It can be either DNA or amino-acid sequences in PHYLIP format (see
262	Section \ref{sec:input_output}). Type \x{D} to change settings.
263
264	\vspace{0.7cm}
265	\begin{center} \framebox{\x{[I] ...... Input sequences interleaved (or sequential)}} \end{center}
266	PHYLIP format comes in two flavours: interleaved or sequential (see Section
267	\ref{sec:input_output}). Type \x{I} to selected among the two formats.
268
269	\vspace{0.7cm}
270	\begin{center} \framebox{\x{[M] ....................... Analyze multiple data sets}} \end{center}
271	If the input sequence file contains more than one data sets, PhyML can analyse each of them
272	in a single run of the program. Type \x{M} to change settings.
273
274	\vspace{0.7cm}
275	\begin{center} \framebox{\x{[R] ............................................ Run ID}} \end{center}
276	This option allows you to append a string that identifies the current PhyML run. Say for instance
277	that you want to analyse the same data set with two models. You can then `tag' the first PhyML run
278	with the name of the first model while the second run is tagged with the name of the second model.\index{run ID}
279
280
281
282	\subsubsection{Substitution model sub-menu}\label{sec:submenus}
283
284	\begin{center} \framebox{\x{[M] ................. Model of nucleotide substitution}} \end{center}
285	\begin{center} \framebox{\x{[M] ................ Model of amino-acids substitution}} \end{center}
286	PhyML implements a wide range of substitution models: JC69 \cite{jukes69}, K80 \cite{kimura80}, F81
287	\cite{felsenstein81a}, F84 \cite{phylip2}, HKY85 \cite{hasegawa85}, TN93 \cite{tamura93} GTR
288	\cite{lanave84,tavare86} and custom for nucleotides; LG \cite{le08}, WAG \cite{whelan01b}, Dayhoff
289	\cite{dayhoff78}, JTT \cite{jones92}, Blosum62 \cite{henikoff92}, mtREV \cite{adachi96}, rtREV
290	\cite{dimmic02}, cpREV \cite{adachi00}, DCMut \cite{kosiol04}, VT \cite{muller00} and mtMAM
291	\cite{cao98} and custom for amino acids. Cycle through the list of nucleotide or amino-acids
292	substitution models by typing \x{M}. Both nucleotide and amino-acid lists include a `custom' model.
293	The custom option provides the most flexible way to specify the nucleotide substitution model. The
294	model is defined by a string made of six digits. The default string is `\x{000000}', which means
295	that the six relative rates of nucleotide changes: $A \leftrightarrow C$, $A \leftrightarrow G$, $A
296	\leftrightarrow T$, $C \leftrightarrow G$, $C \leftrightarrow T$ and $G \leftrightarrow T$, are
297	equal. The string `\x{010010}' indicates that the rates $A \leftrightarrow G$ and $C
298	\leftrightarrow T$ are equal and distinct from $A \leftrightarrow C = A \leftrightarrow T = C
299	\leftrightarrow G = G \leftrightarrow T$. This model corresponds to HKY85 (default) or K80 if the
300	nucleotide frequencies are all set to 0.25. `\x{010020}' and `\x{012345}' correspond to TN93 and
301	GTR models respectively. The digit string therefore defines groups of relative substitution rates.
302	The initial rate within each group is set to 1.0, which corresponds to F81 (JC69 if the base
303	frequencies are equal). Users also have the opportunity to define their own initial rate values.
304	These rates are then optimised afterwards (option `\x{O}') or fixed to their initial values. The
305	custom option can be used to implement all substitution models that are special cases of
306	GTR. Table \ref{tab:modelcode} on page \pageref{tab:modelcode} gives the correspondence between the `standard' name of the model
307	(see \url{http://mbe.oxfordjournals.org/content/18/6/897/F2.large.jpg}) and the custom model code.
308	The custom model also exists for protein sequences. It is useful when one wants to use an amino-acid
309	substitution model that is not hard-coded in PhyML. The symmetric part of the rate matrix, as well
310	as the equilibrium amino-acid frequencies, are given in a file which name is given as input of the
311	program. The format of this file is described in the section \ref{sec:customaa}.
312
313	\vspace{0.7cm}
314	\begin{center} \framebox{\x{[F] ................. Optimise equilibrium frequencies}} \end{center}
315	\begin{center} \framebox{\x{[E] ......... Equilibrium frequencies (empirical/user)}} \end{center}
316	\begin{center} \framebox{\x{[F] . Amino acid frequencies (empirical/model defined)}} \end{center}
317	For nucleotide sequences, optimising nucleotide frequencies means that the values of these
318	parameters are estimated in the maximum likelihood framework. When the custom model option is
319	selected, it is also possible to give the program a user-defined nucleotide frequency distribution
320	at equilibrium (option \x{E}). For protein sequences, the stationary amino-acid frequencies are
321	either those defined by the substitution model or those estimated by counting the number of
322	different amino-acids observed in the data. Hence, the meaning of the \x{F} option depends on the
323	type of the data to be processed.
324
325	\vspace{0.7cm}
326	\begin{center} \framebox{\x{[T] .................... Ts/tv ratio
327	(fixed/estimated)}} \end{center}\index{$\kappa$}\index{ts/tv ratio} Fix or estimate the
328	transition/transversion ratio in the maximum likelihood framework. This option is only available
329	when DNA sequences are to be analysed under K80, HKY85 or TN93 models. The definition given to this
330	parameter by PhyML is the same as PAML's\index{PAML} one. Therefore, the value of this parameter
331	does {\it not} correspond to the ratio between the expected number of transitions and the expected
332	number of transversions during a unit of time. This last definition is the one used in
333	PHYLIP\index{PHYLIP}. PAML's manual gives more detail about the distinction between the two
334	definitions (\url{http://abacus.gene.ucl.ac.uk/software/paml.html}).
335
336	\vspace{0.7cm}
337	\begin{center} \framebox{\x{[V] . Proportion of invariable sites
338	(fixed/estimated)}} \end{center}\index{invariable sites}\index{proportion of invariants} The
339	proportion of invariable sites, i.e., the expected frequency of sites that do not evolve, can be
340	fixed or estimated. The default is to fix this proportion to 0.0. By doing so, we consider that each
341	site in the sequence may accumulate substitutions at some point during its evolution, even if no
342	differences across sequences are actually observed at that site. Users can also fix this parameter
343	to any value in the $[0.0,1.0]$ range or estimate it from the data in the maximum-likelihood
344	framework.
345
346	\vspace{0.7cm}
347	\index{gamma distribution (discrete)!mean vs. median}
348	\index{gamma distribution (discrete)!number of categories}
349	\index{gamma distribution (discrete)!shape parameter}
350	\begin{center} \framebox{\x{[R] ....... One category of substitution rate (yes/no)}} \end{center}
351	\begin{center} \framebox{\x{[C] ........... Number of substitution rate categories}} \end{center}
352	\begin{center} \framebox{\x{[A] ... Gamma distribution parameter (fixed/estimated)}} \end{center}
353	\begin{center} \framebox{\x{[G] .........`Middle' of each rate class (mean/median)}} \end{center}
354	Rates of evolution often vary from site to site. This heterogeneity can be modelled using a discrete
355	gamma distribution. Type \x{R} to switch this option on or off. The different categories of this
356	discrete distribution correspond to different (relative) rates of evolution. The number of
357	categories of this distribution is set to 4 by default. It is probably not wise to go below this
358	number. Larger values are generally preferred. However, the computational burden involved is
359	proportional to the number of categories (i.e., an analysis with 8 categories will generally take
360	twice the time of the same analysis with only 4 categories). Note that the likelihood will not
361	necessarily increase as the number of categories increases. Hence, the number of categories should
362	be kept below a ``reasonable'' number, say 20. The default number of categories can be changed by
363	typing \x{C}.
364
365	The middle of each discretized substitution rate class can be determined using the mean or the
366	median. PAML, MrBayes and RAxML use the mean. However, the median is generally associated with
367	greater likelihoods than the mean. This conclusion is based on our analysis of several real-world
368	data sets extracted from TreeBase. Despite this, the default option in PhyML is to use the mean in
369	order to make PhyML likelihoods comparable to those of other phylogenetic software. One must bare
370	in mind that {\color{red}{likelihoods calculated with the mean approximation are not directly
371	comparable to the likelihoods calculated using the median approximation}}.
372
373	The shape of the gamma distribution determines the range of rate variation across sites. Small
374	values, typically in the $[0.1,1.0]$ range, correspond to large variability. Larger values
375	correspond to moderate to low heterogeneity. The gamma shape parameter can be fixed by the user or
376	estimated via maximum-likelihood. Type \x{A} to select one or the other option.
377
378
379	\subsubsection{Tree searching sub-menu}
380
381	\begin{center} \framebox{\x{[O] ........................... Optimise tree topology}} \end{center} By
382	default the tree topology is optimised in order to maximise the likelihood. However, it is also
383	possible to avoid any topological alteration. This option is useful when one wants to compute the
384	likelihood of a tree given as input (see below). Type \x{O} to select among these two options.
385
386	\vspace{0.7cm}
387	\begin{center} \framebox{\x{[S] .................. Tree topology search operations}} \end{center}
388	PhyML proposes three different methods to estimate tree topologies. The default approach is to use
389	simultaneous NNI. This option corresponds to the original PhyML algorithm \cite{guindon03}. The
390	second approach relies on subtree pruning and regrafting (SPR). It generally finds better tree
391	topologies compared to NNI but is also significantly slower. The third approach, termed BEST,
392	simply estimates the phylogeny using both methods and returns the best solution among the two. Type
393	\x{S} to choose among these three choices.
394
395	\vspace{0.7cm}
396	\begin{center} \framebox{\x{[R] ......................... Use random starting tree}} \end{center}
397	\begin{center} \framebox{\x{[N] .................. Number of random starting trees}} \end{center}
398	When the SPR or the BEST options are selected, is is possible to use random trees rather than BioNJ
399	or a user-defined tree, as starting tree. If this option is turned on (type \x{R} to change), five
400	trees, corresponding to five random starts, will be estimated. The output tree file will contain the
401	best tree found among those five. The number of random starts can be modified by typing
402	\x{N}. Setting the number of random starting trees to $N$ means that the analysis will take
403	(slightly more than) $N$ times the time required for a standard analysis where only one (BioNJ)
404	starting tree is used. However, the analysis of real data sets shows that the best trees estimated
405	using the random start option almost systematically have higher likelihoods than those inferred
406	using a single starting tree.
407
408	\vspace{0.7cm}
409	\begin{center} \framebox{\x{[U] ........ Starting tree (BioNJ/parsimony/user
410	tree)}} \end{center}\index{BioNJ} When the tree topology optimisation option is turned on, PhyML
411	proceeds by refining an input tree. By default, this input tree is estimated using BioNJ
412	\cite{gascuelNJ}. The alternative option is to use a parsimony tree. We found this option specially
413	useful when analysing large data sets with NNI moves as it generally leads to greater likelihoods
414	than those obtained when starting from a BioNJ trees. The user can also to input her/his own
415	tree. This tree should be in Newick format (see Section \ref{sec:input_output}). This option is
416	useful when one wants to evaluate the likelihood of a given tree with a fixed topology, using
417	PhyML. Type \x{U} to choose among these two options.
418
419	\subsubsection{Branch support sub-menu}
420
421	\begin{center} \framebox{\x{[B] ................ Non parametric bootstrap analysis}} \end{center}
422	The support of the data for each internal branch of the phylogeny can be estimated using
423	non-parametric bootstrap. By default, this option is switched off. Typing \x{B} switches on the
424	bootstrap analysis. The user is then prompted for a number of bootstrap replicates. The largest this
425	number the more precise the bootstrap support estimates are. However, for each bootstrap replicate
426	a phylogeny is estimated. Hence, the time needed to analyse $N$ bootstrap replicates corresponds to
427	$N$-times the time spent on the analysis of the original data set. $N=100$ is generally considered
428	as a reasonable number of replicates.
429
430	\begin{center} \framebox{\x{[A] ................ Approximate likelihood ratio test}} \end{center}
431	When the bootstrap option is switched off (see above), approximate likelihood branch supports are
432	estimated. This approach is considerably faster than the bootstrap one. However, both methods intend
433	to estimate different quantities and conducting a fair comparison between both criteria is not
434	straightforward. The estimation of approximate likelihood branch support comes in multiple
435	flavours. The default is set to aBayes, corresponding to the approximate Bayes method described in
436	\cite{anisimova11}. The approximate likelihood ratio test (aLRT) \cite{anisimova06},
437	ShimodairaâHasegawa aLRT (SH-aLRT) statistics are the other available options.
438
439
440	\subsection{Command-line interface}
441
442	An alternative to the PHYLIP-like interface is the command-line interface. Users that do not need to
443	modify the default parameters can launch the program with the `\x{phyml -i seq\_file\_name}'
444	command. The list of all command line arguments and how to use them is given in the `Help' section
445	which is displayed when entering the `\x{phyml --help}' command. The available command-line options
446	are described in what follows.
447
448	\begin{itemize}
449
450	\item \x{-i} (or \x{--input}) \x{seq\_file\_name}\index{command-line options!\x{--input}} \\
451	\x{seq\_file\_name} is the name of the nucleotide or amino-acid sequence file in PHYLIP format.
452
453	\item \x{-d} (or \x{--datatype}) \x{data\_type}\index{command-line options!\x{--data\_type}}\\
454	\x{data\_type} is \x{nt} for nucleotide (default) and \x{aa} for amino-acid sequences.
455
456	\item \x{-q} (or \x{--sequential})\index{sequence format!interleaved}\index{sequence format!sequential}\index{command-line options!\x{--sequential}} \\
457	Changes interleaved format (default) to sequential format.
458
459	\item \x{-n} (or \x{--multiple}) \x{nb\_data\_sets}\index{multiple data sets}\index{command-line options!\x{--multiple}}\\
460	\x{nb\_data\_sets} is an integer giving the number of data sets to analyse.
461
462	\item \x{-p} (or \x{--pars})\index{command-line options!\x{--pars}}\\
463	Use a minimum parsimony starting tree. This option is taken into account when the `-u' option
464	is absent and when tree topology modifications are to be done.
465
466	\item \x{-b} (or \x{--bootstrap}) \x{int}\index{bootstrap}\index{command-line options!\x{--bootstrap}}
467	\begin{itemize}
468	\item \x{int} $>$ 0: \x{int} is the number of bootstrap replicates.
469	\item \x{int} = 0: neither approximate likelihood ratio test nor bootstrap values are computed.
470	\item \x{int} = -1: approximate likelihood ratio test returning aLRT statistics.
471	\item \x{int} = -2: approximate likelihood ratio test returning Chi2-based parametric branch supports.
472	% \item \x{int} = -3: minimum of Chi2-based parametric and SH-like branch supports.
473	\item \x{int} = -4: SH-like branch supports alone.
474	\item \x{int} = -5: (default) approximate Bayes branch supports.
475	\end{itemize}
476
477	\item \x{-m} (or \x{--model}) \x{model\_name}\index{substitution models!DNA}\index{substitution models!amino acids}\index{command-line options!\x{--model}} \\
478	\x{model\_name} : substitution model name.
479	\begin{itemize}
480	\item {\it Nucleotide-based models}: \x{HKY85} (default) \x{\| JC69 \| K80 \| F81 \| F84 \| TN93 \| GTR \|
481	custom} \\ The \x{custom} option can be used to define a new substitution model. A string of six
482	digits identifies the model. For instance, 000000 corresponds to F81 (or JC69 provided the
483	distribution of nucleotide frequencies is uniform). 012345 corresponds to GTR. This option can be
484	used for encoding any model that is a nested within GTR. See Section \ref{sec:submenus} and Table \ref{tab:modelcode}. {\em NOTE:}
485	the substitution parameters of the custom model will be optimised so as to maximise the
486	likelihood. It is possible to specify and fix (i.e., avoid optimisation) the values of the
487	substitution rates only through the PHYLIP-like interface.
488
489	\item {\it Amino-acid based models}: \x{LG} (default) \x{\| WAG \| JTT \| MtREV \| Dayhoff \| DCMut \| RtREV
490	\| CpREV \| VT \| Blosum62 \| MtMam \| MtArt \| HIVw \| HIVb \| custom} \\
491	The \x{custom} option is useful when one wants to use an amino-acid substitution model that is not
492	available by default in PhyML. The symmetric part of the rate matrix, as well as the equilibrium
493	amino-acid frequencies, are given in a file which name is asked for by the program. The format of
494	this file is described in section \ref{sec:customaa}.
495	\end{itemize}
496
497	\begin{table}\index{custom models}
498	\begin{center}
499	\begin{tabular}{ll}
500	\hline
501	Name & Command-line option \\
502	\hline
503	JC69 & \x{-m 000000 -f 0.25,0.25,0.25,0.25} \\
504	F81 & \x{-m 000000}\\
505	K80 & \x{-m 010010 -f 0.25,0.25,0.25,0.25} \\
506	HKY85 & \x{-m 010010}\\
507	TrNef & \x{-m 010020 -f 0.25,0.25,0.25,0.25} \\
508	TrN & \x{-m 010020}\\
509	K81 & \x{-m 123321 -f 0.25,0.25,0.25,0.25}\\
510	K81uf & \x{-m 123321}\\
511	TIMef & \x{-m 132241 -f 0.25,0.25,0.25,0.25}\\
512	TIM & \x{-m 132241}\\
513	TVMef & \x{-m 102304 -f 0.25,0.25,0.25,0.25}\\
514	TVM & \x{-m 102304}\\
515	SYM & \x{-m 123456 -f 0.25,0.25,0.25,0.25}\\
516	GTR & \x{-m 123456}\\
517	\hline
518	\end{tabular}
519	\caption{Nucleotide substitution model names (as defined in \cite{posada01}) and the corresponding
520	custom model code used in PhyML.}\label{tab:modelcode}
521	\end{center}
522	\end{table}
523
524	\item \x{--aa\_rate\_file file\_name}\index{command-line options!\x{--aa\_rate\_file}} \\
525	This option is compulsory when analysing amino-acid sequences under a `custom' model (see above). \x{file\_name}
526	should provide a rate matrix and equilibrium amino acid in PAML format (see Section \ref{sec:customaa}).
527
528	\item \x{-f e}, \x{m}, or ``\x{fA,fC,fG,fT}" \index{frequencies!nucleotide}\index{frequencies!amino-acid}\index{stationary frequencies}\index{command-line options!\x{-f}}\\
529	Nucleotide or amino-acid frequencies.
530	\begin{itemize}
531	\item \x{e} : the character frequencies are determined as follows :
532	\begin{itemize}
533	\item {\it Nucleotide sequences}: (Empirical) the equilibrium base frequencies are estimated by counting
534	the occurence of the different bases in the alignment.
535	\item {\it Amino-acid sequences}: (Empirical) the equilibrium amino-acid frequencies are estimated by counting
536	the occurence of the different amino-acids in the alignment.
537	\end{itemize}
538
539	\item \x{m} : the character frequencies are determined as follows :
540	\begin{itemize}
541	\item {\it Nucleotide sequences}: (ML) the equilibrium base frequencies are estimated using maximum
542	likelihood.
543	\item {\it Amino-acid sequences}: (Model) the equilibrium amino-acid frequencies are estimated using
544	the frequencies defined by the substitution model.
545	\end{itemize}
546
547	\item ``\x{fA,fC,fG,fT}": only valid for nucleotide-based models. \x{fA}, \x{fC}, \x{fG} and \x{fT} are floating numbers that
548	correspond to the frequencies of A, C, G and T respectively.
549	\end{itemize}
550
551	\item \x{-t} (or \x{--ts/tv}) \x{ts/tv\_ratio} \index{$\kappa$}\index{ts/tv ratio}\index{command-line options!\x{--ts/tv}}\\
552	\x{ts/tv\_ratio}: transition/transversion ratio. DNA sequences only. Can be a fixed positive value
553	(e.g., 4.0) or type \x{e} to get the maximum likelihood estimate.
554
555	\item \x{-v} (or \x{--pinv}) \x{prop\_invar}\index{proportion of invariants}\index{invariable sites} \index{command-line options!\x{--pinv}}\\
556	\x{prop\_invar}: proportion of invariable sites. Can be a fixed value in the [0,1] range or type \x{e} to get the maximum likelihood estimate.
557
558	\item \x{-c} (or \x{--nclasses}) \x{nb\_subst\_cat}\index{gamma distribution (discrete)!number of categories} \index{command-line options!\x{--nclasses}}\\
559	\x{nb\_subst\_cat}: number of relative substitution rate categories. Default: \x{nb\_subst\_cat=4}. Must be a positive integer.
560
561	\item \x{-a} (or \x{--alpha}) \x{gamma} \index{gamma distribution (discrete)!shape parameter}\index{command-line options!\x{--alpha}} \\
562	\x{gamma}: value of the gamma shape parameter. Can be a fixed positive value or e to get the maximum
563	likelihood estimate. The value of this parameter is estimated in the maximum likelihood framework by default.
564
565	\item \x{--use\_median} \index{gamma distribution (discrete)!mean vs. median} \index{command-line options!\x{--use\_median}}\\
566	The middle of each substitution rate class in the discrete gamma distribution is taken as the
567	median. The mean is used by default.
568
569	\item \x{--free\_rates} \index{command-line options!\x{--free\_rates}}\\
570	As an alternative to the discrete gamma model, it is possible to estimate the (relative) rate in
571	each class of the (mixture) model and the corresponding frequencies directly from the data. This
572	model, called the FreeRate model, has more parameters
573	than the discrete gamma one but usually provides a significantly better fit to the data. See
574	\cite{soubrier12} for more information about this model and an illustration of its use.
575
576	\item \x{--codpos} \x{1,2 or 3} \index{command-line options!\x{--codpos}}\\
577	When analysing an alignment of coding sequences, use this option to consider only the first, second
578	or the third coding position for the estimation.
579
580	\item \x{-s} (or \x{--search}) \x{move}\index{NNI}\index{SPR} \index{command-line options!\x{--search}}\\
581	Tree topology search operation option. Can be either \x{NNI} (default, fast) or \x{SPR} (usually
582	slower than \x{NNI} but more accurate) or \x{BEST} (best of NNI and SPR search).
583
584	\item \x{-u} (or \x{--inputtree}) \x{user\_tree\_file}\index{input tree}\index{user tree} \index{command-line options!\x{--inputtree}}\\
585	\x{user\_tree\_file}: starting tree filename. The tree must be in Newick format.
586
587	\item \x{-o params}\index{optimisation!topology}\index{optimisation!substitution parameters} \index{command-line options!\x{-o}}\\
588	This option focuses on specific parameter optimisation.
589	\begin{itemize}
590	\item \x{params=tlr}: tree topology (\x{t}), branch length (\x{l}) and substitution rate parameters (\x{r}) are optimised.
591	\item \x{params=tl}: tree topology and branch lengths are optimised.
592	\item \x{params=lr}: branch lengths and substitution rate parameters are optimised.
593	\item \x{params=l}: branch lengths are optimised.
594	\item \x{params=r}: substitution rate parameters are optimised.
595	\item \x{params=n}: no parameter is optimised.
596	\end{itemize}
597
598	\item \x{--rand\_start}\index{random tree} \index{command-line options!\x{--rand\_start}}\\
599	This option sets the initial tree to random. It is only valid if SPR searches are to be performed.
600
601	\item \x{--n\_rand\_starts num} \index{command-line options!\x{--n\_rand\_starts}}\\
602	\x{num} is the number of initial random trees to be used. It is only valid if SPR searches are to be performed.
603
604	\item \x{--r\_seed num}\index{random number} \index{command-line options!\x{--r\_seed}}\\
605	\x{num} is the seed used to initiate the random number generator. Must be an integer.
606
607	\item \x{--print\_site\_lnl}\index{likelihood!print site likelihood} \index{command-line options!\x{--print\_site\_lk}}\\
608	Print the likelihood for each site in file *\_phyml\_lk.txt. For $\Gamma$ or $\Gamma$+I or
609	FreeRate models, this option returns the posterior probability of each relative rate class at
610	each site. Such information can then be used to identify fast- and slow-evolving regions of the alignment.
611
612	\item \x{--print\_trace}\index{command-line options!\x{--print\_trace}}\\
613	Print each phylogeny explored during the tree search process in file *\_phyml\_trace.txt. This
614	option can be useful for monitoring the progress of the analysis for very large data sets and have
615	an approximate idea of what the final phylogeny will look like.
616
617	\item \x{--run\_id ID\_string}\index{run ID} \index{command-line options!\x{--run\_id}}\\
618	Append the string ID\_string at the end of each PhyML output file. This option may be useful when
619	running simulations involving PhyML. It can also be used to `tag' multiple analysis of the same data
620	set with various program settings.
621
622	\item \x{--no\_memory\_check} \index{command-line options!\x{--no\_memory\_check}}\\
623	By default, when processing a large data set, PhyML will pause and ask the user to confirm that
624	she/he wants to continue with the execution of the analysis despite the large amount of memory
625	required. The \x{--no\_memory\_check} skips this question. It is especially useful when running
626	PhyML in batch mode.
627
628	\item \x{--no\_colalias} \index{command-line options!\x{--no\_colalias}}\\
629	By default, PhyML preprocesses each alignment by putting together (or aliasing) the columns that are
630	identical. Use this option to skip this step but be aware that the analysis might then take more
631	time to complete.
632
633	\item \x{--constrained\_lens} \index{command-line options!\x{--constrained\_lens}}\\
634	When an input tree with branch lengths is provided, this option will find the branch multiplier that
635	maximises the likelihood (i.e., the relative branch lengths remain constant)
636
637	\item \x{--constraint\_file} \x{file\_name} \index{command-line options!\x{--constraint\_file}}\\
638	\x{file\_name} lists the topological constraints under which the tree topology search is
639	conducted. This option should be used in conjunction with \x{-u} \x{file\_name}. See Section
640	\ref{sec:topoconstraints} for more information.
641
642	\item \x{--quiet} \index{command-line options!\x{--quiet}}\\
643	Runs PhyML in quiet mode. The program will not pause if the memory required to run the analysis
644	exceeds 256MB and will not output the log-likelihood score to the output.
645
646	\end{itemize}
647
648	\subsection{XML interface}
649	\begin{itemize}
650	\item \x{--xml=xml\_file\_name}\index{command-line options!\x{--xml}} \\
651	\x{xml\_file\_name} is the name of the XML file containing the information required to run the
652	analysis. More details about this type of file is given in the section \ref{sec:xmlio}.
653	\end{itemize}
654
655	\subsection{Parallel bootstrap}\label{sec:parallel_bootstrap}\index{MPI}\index{bootstrap!parallel}
656
657	Bootstrapping is a highly parallelizable task. Indeed, bootstrap replicates are independent from
658	one another. Each bootstrap replicate can then be analysed separately. Modern computers often have
659	more than one CPU. Each CPU can therefore be used to process a bootstrap sample. Using this parallel
660	strategy, performing $R$ bootstrap replicates on $C$ CPUs `costs' the same amount of computation
661	time as processing $R \times C$ bootstrap replicates on a single CPU. In other words, for a given
662	number of replicates, the computation time is divided by $R$ compared to the non-parallel approach.
663
664	PhyML sources must be compiled with specific options to turn on the parallel option (see Section
665	\ref{sec:MPI}). Once the binary file (\x{phyml}) has been generated, running a bootstrap analysis
666	with, say 100 replicates on 2 CPUs, can be done by typing the following command-line:
667	\begin{verbatim}
668	mpd &;
669	mpirun -np 2 ./phyml -i seqfile -b 100;
670	\end{verbatim}
671	The first command launches the mpi daemon while the second launches the analysis. Note that
672	launching the daemon needs to be done only once. The output files are similar to the ones generated
673	using the standard, non-parallel, analysis (see Section \ref{sec:input_output}). Note that running
674	the program in batch mode, i.e.:
675	\begin{verbatim}
676	mpirun -np 2 ./phyml -i seqfile -b 100 &
677	\end{verbatim}
678	will probably NOT work. I do not know how to run a mpi process in batch mode yet. Suggestions welcome...
679	Also, at the moment, the number of bootstrap replicates must be a multiple of the number of CPUs
680	required in the mpirun command.
681
682	\section{Inputs \& outputs for command-line and PHYLIP interface }\label{sec:input_output}
683
684	PhyML reads data from standard text files, without the need for any particular file name extension.
685
686	\subsection{Sequence formats}
687
688	\begin{figure}
689	\begin{small}
690	\begin{Verbatim}[frame=single, label=PHYLIP interleaved, samepage=true, baselinestretch=0.5]
691
692	5 80
693	seq1 CCATCTCACGGTCGGTACGATACACCKGCTTTTGGCAGGAAATGGTCAATATTACAAGGT
694	seq2 CCATCTCACGGTCAG---GATACACCKGCTTTTGGCGGGAAATGGTCAACATTAAAAGAT
695	seq3 RCATCTCCCGCTCAG---GATACCCCKGCTGTTG????????????????ATTAAAAGGT
696	seq4 RCATCTCATGGTCAA---GATACTCCTGCTTTTGGCGGGAAATGGTCAATCTTAAAAGGT
697	seq5 RCATCTCACGGTCGGTAAGATACACCTGCTTTTGGCGGGAAATGGTCAAT????????GT
698
699	ATCKGCTTTTGGCAGGAAAT
700	ATCKGCTTTTGGCGGGAAAT
701	AGCKGCTGTTG?????????
702	ATCTGCTTTTGGCGGGAAAT
703	ATCTGCTTTTGGCGGGAAAT
704
705	\end{Verbatim}
706	\begin{Verbatim}[frame=single, label=PHYLIP sequential, samepage=true, baselinestretch=0.5]
707
708	5 40
709	seq1 CCATCTCANNNNNNNNACGATACACCKGCTTTTGGCAGG
710	seq2 CCATCTCANNNNNNNNGGGATACACCKGCTTTTGGCGGG
711	seq3 RCATCTCCCGCTCAGTGAGATACCCCKGCTGTTGXXXXX
712	seq4 RCATCTCATGGTCAATG-AATACTCCTGCTTTTGXXXXX
713	seq5 RCATCTCACGGTCGGTAAGATACACCTGCTTTTGxxxxx
714
715	\end{Verbatim}
716	\end{small}
717	\label{fig:align_tree}
718	\caption{\bf PHYLIP interleaved and sequential formats.}
719	\end{figure}
720
721
722
723	\begin{figure}
724	\begin{small}
725	\begin{Verbatim}[frame=single, label=Nexus nucleotides, samepage=true, baselinestretch=0.5]
726
727	[ This is a comment ]
728	#NEXUS
729	BEGIN DATA;
730	DIMENSIONS NTAX=10 NCHAR=20;
731	FORMAT DATATYPE=DNA;
732	MATRIX
733	tax1 ?ATGATTTCCTTAGTAGCGG
734	tax2 CAGGATTTCCTTAGTAGCGG
735	tax3 ?AGGATTTCCTTAGTAGCGG
736	tax4 ?????????????GTAGCGG
737	tax5 CAGGATTTCCTTAGTAGCGG
738	tax6 CAGGATTTCCTTAGTAGCGG
739	tax7 ???GATTTCCTTAGTAGCGG
740	tax8 ????????????????????
741	tax9 ???GGATTTCTTCGTAGCGG
742	tax10 ???????????????AGCGG;
743	END;
744
745	\end{Verbatim}
746	\end{small}
747
748	\begin{small}
749	\begin{Verbatim}[frame=single, label=Nexus digits, samepage=true, baselinestretch=0.5]
750
751	[ This is a comment ]
752	#NEXUS
753	BEGIN DATA;
754	DIMENSIONS NTAX=10 NCHAR=20;
755	FORMAT DATATYPE=STANDARD SYMBOLS="0 1 2 3";
756	MATRIX
757	tax1 ?0320333113302302122
758	tax2 10220333113302302122
759	tax3 ?0220333113302302122
760	tax4 ?????????????2302122
761	tax5 10220333113302302122
762	tax6 10220333113302302122
763	tax7 ???20333113302302122
764	tax8 ????????????????????
765	tax9 ???22033313312302122
766	tax10 ???????????????02122;
767	END;
768
769	\end{Verbatim}
770	\end{small}
771
772	\begin{small}
773	\begin{Verbatim}[frame=single, label=Nexus digits, samepage=true, baselinestretch=0.5]
774
775	[ This is a comment ]
776	#NEXUS
777	BEGIN DATA;
778	DIMENSIONS NTAX=10 NCHAR=20;
779	FORMAT DATATYPE=STANDARD SYMBOLS="00 01 02 03";
780	MATRIX
781	tax1 ??00030200030303010103030002030002010202
782	tax2 0100020200030303010103030002030002010202
783	tax3 ??00020200030303010103030002030002010202
784	tax4 ??????????????????????????02030002010202
785	tax5 0100020200030303010103030002030002010202
786	tax6 0100020200030303010103030002030002010202
787	tax7 ??????0200030303010103030002030002010202
788	tax8 ????????????????????????????????????????
789	tax9 ??????0202000303030103030102030002010202
790	tax10 ??????????????????????????????0002010202;
791	END;
792
793	\end{Verbatim}
794	\end{small}
795	\caption{\bf NEXUS formats.}\label{fig:nexus}
796	\end{figure}
797
798
799	Alignments of DNA or protein sequences must be in PHYLIP\index{PHYLIP} or NEXUS
800	\cite{maddison97}\index{NEXUS} sequential\index{sequential} or interleaved\index{interleaved} format
801	(Figures \ref{fig:align_tree} and \ref{fig:nexus}). For PHYLIP formated sequence alignments, the
802	first line of the input file contains the number of species and the number of characters, in free
803	format, separated by blank characters. One slight difference with PHYLIP format deals with sequence
804	name lengths. While PHYLIP format limits this length to ten characters, PhyML can read up to
805	hundred character long sequence names. Blanks and the symbols ``(),:'' are not allowed within
806	sequence names because the Newick tree format makes special use of these symbols. Another slight
807	difference with PHYLIP format is that actual sequences must be separated from their names by at
808	least one blank character.
809
810	A PHYLIP input sequence file may also display more than a single data set. Each of these data sets
811	must be in PHYLIP format and two successive alignments must be separated by an empty
812	line. Processing multiple data sets requires to toggle the `\x{M}' option in the {\em Input Data}
813	sub-menu or use the `\x{-n}' command line option and enter the number of data sets to analyse. The
814	multiple data set option can be used to process re-sampled data that were generated using a
815	non-parametric procedure such as cross-validation or jackknife (a bootstrap option is already
816	included in PhyML). This option is also useful in multiple gene studies, even if fitting the same
817	substitution model to all data sets may not be suitable.
818
819	PhyML can also process alignments in NEXUS format. Although not all the options provided by this
820	format are supported by PhyML, a few specific features are exploited. Of course, this format can
821	handle nucleotide and protein sequence alignments in sequential or interleaved format. It is also
822	possible to use custom alphabets, replacing the standard 4-state and 20-state alphabets for
823	nucleotides and amino-acids respectively. Examples of a 4-state custom alphabet are given in Figure
824	\ref{fig:nexus}. Each state must here correspond to one digit or more. The set of states must be a
825	list of consecutive digits starting from 0. For instance, the list ``0, 1, 3, 4'' is not a valid
826	alphabet. Each state in the symbol list must be separated from the next one by a space. Hence,
827	alphabets with large number of states can be easily defined by using two-digit number (starting with
828	00 up to 19 for a 20 state alphabet). Most importantly, this feature gives the opportunity to
829	analyse data sets made of presence/absence character states (use the \texttt{symbols=``0 1''} option
830	for such data).\index{binary characters} Alignments made of custom-defined states will be processed
831	using the Jukes and Cantor model. Other options of the program (e.g., number of rate classes, tree
832	topology search algorithm) are freely configurable. Note that, at the moment, the maximum number of
833	different states is set to 22 in order to save memory space. It is however possible to lift this
834	threshold by modifiying the value of the variable \x{T\_MAX\_ALPHABET} in the file
835	`\x{utilities.h}'. The program will then have to be re-compiled.
836
837
838
839	\subsubsection{Gaps and ambiguous characters}
840
841	Gaps correspond to the `\x{-}' symbol. They are systematically treated as unknown characters ``on
842	the grounds that we don't know what would be there if something were there'' (J. Felsenstein,
843	PHYLIP main documentation). The likelihood at these sites is summed over all the possible states
844	(i.e., nucleotides or amino acids) that could actually be observed at these particular
845	positions. Note however that columns of the alignment that display only gaps or unknown characters
846	are simply discarded because they do not carry any phylogenetic information (they are equally well
847	explained by any model). PhyML also handles ambiguous characters such as $R$ for $A$ or $G$
848	(purines) and $Y$ for $C$ or $T$ (pyrimidines). Tables \ref{tab:ambigu_nt} and \ref{tab:ambigu_aa}
849	give the list of valid characters/symbols and the corresponding nucleotides or amino acids.
850
851	\begin{table}
852	\begin{center}
853	\begin{tabular}{lr\|lr}
854	\hline
855	Character & Nucleotide & Character & Nucleotide \\
856	\hline
857	$A$ & Adenosine & $Y$ & $C$ or $T$ \\
858	$G$ & Guanine & $K$ & $G$ or $T$ \\
859	$C$ & Cytosine & $B$ & $C$ or $G$ or $T$\\
860	$T$ & Thymine & $D$ & $A$ or $G$ or $T$ \\
861	$U$ & Uracil (=$T$) & $H$ & $A$ or $C$ or $T$ \\
862	$M$ & $A$ or $C$ & $V$ & $A$ or $C$ or $G$ \\
863	$R$ & $A$ or $G$ & $-$ or $N$ or $X$ or $?$ & unknown \\
864	$W$ & $A$ or $T$ & & (=$A$ or $C$ or $G$ or $T$)\\
865	$S$ & $C$ or $G$ & & \\
866	\hline
867	\end{tabular}
868	\end{center}
869	\caption{{\bf List of valid characters in DNA sequences and the corresponding nucleotides.}}\label{tab:ambigu_nt}
870	\end{table}
871	\begin{table}
872	\begin{center}
873	\begin{tabular}{lr\|lr}
874	\hline
875	Character & Amino-Acid & Character & Amino-Acid \\
876	\hline
877	$A$ & Alanine & $L$ & Leucine \\
878	$R$ & Arginine & $K$ & Lysine \\
879	$N$ or $B$& Asparagine & $M$ & Methionine \\
880	$D$ & Aspartic acid & $F$ & Phenylalanine \\
881	$C$ & Cysteine & $P$ & Proline \\
882	$Q$ or $Z$& Glutamine & $S$ & Serine \\
883	$E$ & Glutamic acid & $T$ & Threonine \\
884	$G$ & Glycine & $W$ & Tryptophan \\
885	$H$ & Histidine & $Y$ & Tyrosine \\
886	$I$ & Isoleucine & $V$ & Valine \\
887	$L$ & Leucine & $-$ or $X$ or $?$ & unknown \\
888	$K$ & Lysine & & (can be any amino acid) \\
889	\hline
890	\end{tabular}
891	\end{center}
892	\caption{{\bf List of valid characters in protein sequences and the corresponding amino acids.}}\label{tab:ambigu_aa}
893	\end{table}
894
895	\subsubsection{Specifying outgroup sequences}\label{sec:outgroupspecify}
896
897	PhyML can return rooted trees provided outgroup taxa are identified from the sequence file. In
898	order to do so, sequence names that display a `*' character will be automatically considered as
899	belonging to the outgroup.
900
901	The topology of the rooted tree is exactly the same as the unrooted version of the same tree. In
902	other words, PhyML first ignores the distinction between ingroup and outgroup sequences, builds a
903	maximum likelihood unrooted tree and then tries to add the root. If the outgroup has more than one
904	sequence, the position of the root might be ambiguous. In such situation, PhyML tries to identify
905	the most relevant position of the root by considering which edge provides the best separation
906	between ingroup and outgroup taxa (i.e., we are trying to make the outgroup ``as monophyletic as
907	possible'').
908
909	\subsection{Tree format}
910
911	PhyML can read one or several phylogenetic trees from an input file. This option is accessible
912	through the {\em Tree Searching} sub menu or the `\x{-u}' argument from the command line. Input
913	trees are generally used as initial maximum likelihood estimates to be subsequently adjusted by the
914	tree searching algorithm. Trees can be either rooted or unrooted and multifurcations are allowed.
915	Taxa names must, of course, match the corresponding sequence names.
916
917	\begin{figure}[h]
918	\begin{small}
919	\begin{minipage}{\textwidth}
920	\begin{verbatim}
921	((seq1:0.03,seq2:0.01):0.04,(seq3:0.01,(seq4:0.2,seq5:0.05):0.2):0.01);
922	((seq3,seq2),seq1,(seq4,seq5));
923	\end{verbatim}
924	\end{minipage}
925	\end{small}
926	\caption{{\bf Input trees}. The first tree (top) is rooted and has branch lengths. The second tree
927	(bottom) is unrooted and does not have branch lengths.}
928	\label{fig:trees}\index{Newick format}
929	\end{figure}
930
931
932	\subsection{Multiple alignments and trees}\index{multiple data sets}
933
934	Single or multiple sequence data sets may be used in combination with single or multiple input
935	trees. When the number of data sets is one ($n_D = 1$) and there is only one input tree ($n_T = 1$),
936	then this tree is simply used as input for the single data set analysis. When $n_D = 1$ and $n_T >
937	1$, each input tree is used successively for the analysis of the single alignment. PhyML then
938	outputs the tree with the highest likelihood. If $n_D > 1$ and $n_T = 1$, the same input tree is
939	used for the analysis of each data set. The last combination is $n_D > 1$ and $n_T > 1$. In this
940	situation, the $i$-th tree in the input tree file is used to analyse the $i$-th data set. Hence,
941	$n_D$ and $n_T$ must be equal here.
942
943
944	\subsection{Custom amino-acid rate model}\label{sec:customaa}
945
946	The custom amino-acid model of substitutions can be used to implement a model that is not hard-coded
947	in PhyML. This model must be time-reversible. Hence, the matrix of substitution rates is
948	symmetrical. The format of the rate matrix with the associated stationary frequencies is identical
949	to the one used in PAML\index{PAML}. An example is given below:
950
951	\begin{center}
952	{\tiny
953	\begin{tabular}{p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}}
954	% Ala & Arg & Asn & Asp & Cys & Gln & Glu & Gly & His & Ile & Leu & Lys & Met & Phe & Pro & Ser & Thr & Trp & Tyr & Val \\
955	&&&&&&&&&&&&&&&&&&& \\
956	0.55 & &&&&&&&&&&&&&&&&&& \\
957	0.51 & 0.64 & &&&&&&&&&&&&&&&& \\
958	0.74 & 0.15 & 5.43 & &&&&&&&&&&&&&&&& \\
959	1.03 & 0.53 & 0.27 & 0.03 & &&&&&&&&&&&&&&& \\
960	0.91 & 3.04 & 1.54 & 0.62 & 0.10 & &&&&&&&&&&&&&& \\
961	1.58 & 0.44 & 0.95 & 6.17 & 0.02 & 5.47 & &&&&&&&&&&&&& \\
962	1.42 & 0.58 & 1.13 & 0.87 & 0.31 & 0.33 & 0.57 & &&&&&&&&&&&& \\
963	0.32 & 2.14 & 3.96 & 0.93 & 0.25 & 4.29 & 0.57 & 0.25 & &&&&&&&&&&& \\
964	0.19 & 0.19 & 0.55 & 0.04 & 0.17 & 0.11 & 0.13 & 0.03 & 0.14 & &&&&&&&&&& \\
965	0.40 & 0.50 & 0.13 & 0.08 & 0.38 & 0.87 & 0.15 & 0.06 & 0.50 & 3.17 & &&&&&&&&& \\
966	0.91 & 5.35 & 3.01 & 0.48 & 0.07 & 3.89 & 2.58 & 0.37 & 0.89 & 0.32 & 0.26 & &&&&&&&& \\
967	0.89 & 0.68 & 0.20 & 0.10 & 0.39 & 1.55 & 0.32 & 0.17 & 0.40 & 4.26 & 4.85 & 0.93 & &&&&&&& \\
968	0.21 & 0.10 & 0.10 & 0.05 & 0.40 & 0.10 & 0.08 & 0.05 & 0.68 & 1.06 & 2.12 & 0.09 & 1.19 & &&&&&& \\
969	1.44 & 0.68 & 0.20 & 0.42 & 0.11 & 0.93 & 0.68 & 0.24 & 0.70 & 0.10 & 0.42 & 0.56 & 0.17 & 0.16 & &&&&& \\
970	3.37 & 1.22 & 3.97 & 1.07 & 1.41 & 1.03 & 0.70 & 1.34 & 0.74 & 0.32 & 0.34 & 0.97 & 0.49 & 0.55 & 1.61 & &&&& \\
971	2.12 & 0.55 & 2.03 & 0.37 & 0.51 & 0.86 & 0.82 & 0.23 & 0.47 & 1.46 & 0.33 & 1.39 & 1.52 & 0.17 & 0.80 & 4.38 & &&& \\
972	0.11 & 1.16 & 0.07 & 0.13 & 0.72 & 0.22 & 0.16 & 0.34 & 0.26 & 0.21 & 0.67 & 0.14 & 0.52 & 1.53 & 0.14 & 0.52 & 0.11 & && \\
973	0.24 & 0.38 & 1.09 & 0.33 & 0.54 & 0.23 & 0.20 & 0.10 & 3.87 & 0.42 & 0.40 & 0.13 & 0.43 & 6.45 & 0.22 & 0.79 & 0.29 & 2.49 & & \\
974	2.01 & 0.25 & 0.20 & 0.15 & 1.00 & 0.30 & 0.59 & 0.19 & 0.12 & 7.82 & 1.80 & 0.31 & 2.06 & 0.65 & 0.31 & 0.23 & 1.39 & 0.37 & 0.31 & \\
975	\\
976	8.66 & 4.40 & 3.91 & 5.70 & 1.93 & 3.67 & 5.81 & 8.33 & 2.44 & 4.85 & 8.62 & 6.20 & 1.95 & 3.84 & 4.58 & 6.95 & 6.10 & 1.44 & 3.53 & 7.09 \\
977	\end{tabular}
978	}
979	\end{center}
980
981	The entry on the $i$-th row and $j$-th column of this matrix corresponds to the rate of
982	substitutions between amino-acids $i$ and $j$. The last line in the file gives the stationary
983	frequencies and must be separated from the rate matrix by one line. The ordering of the amino-acids
984	is alphabetical, i.e, Ala, Arg, Asn, Asp, Cys, Gln, Glu, Gly, His, Ile, Leu, Lys, Met, Phe, Pro,
985	Ser, Thr, Trp, Tyr and Val.
986
987
988	\subsection{Topological constraint file}\label{sec:topoconstraints}
989
990	PhyML can perform phylogenetic tree estimation under user-specified topological constraints. In
991	order to do so, one should use the \x{--constraint\_file} \x{file\_name} command-line option where
992	\x{file\_name} lists the topological constraints. Such constraints are straightforward to
993	define. For instance, the following constraints:
994	\vspace{0.2cm}
995	\begin{Verbatim}
996	((A,B,C),(D,E,F));
997	\end{Verbatim}
998	indicate that taxa A, B and C belong to the same clade. D, E and F also belong to the
999	same clade and the two clades hence defined should not overlap. Under these two
1000	constraints, the tree ((A,B),D,((E,F),C)) is not valid. From the example above, you will notice that
1001	the constraints are defined using a multifurcating tree in NEWICK format. Note that this tree does
1002	not need to display the whole list of taxa. For instance, while the only taxa involved in specifying
1003	topological constraints above are A, B, C, D, E \& F, the actual data set could include more than
1004	these six taxa only.
1005
1006	PhyML tree topology search algorithms all rely on improving a starting tree. By default, BioNJ is
1007	the method of choice for building this tree. However, there is no guarantee that the phylogeny
1008	estimated with PhyML does comply with the topological constraints. While it is probably possible to
1009	implement BioNJ with topological constraints, we have not done so yet. Instead, the same
1010	multifurcating tree that defines the topological constraints should also be used as starting tree
1011	using the \x{-u} (\x{--inputtree}) option. Altogether, the command line should look like the
1012	following: \x{-u}=\x{file\_name} \x{--constraint\_file}=\x{file\_name}. It is not possible to use as
1013	input tree a non-binary phylogeny that is distinct from that provided in the constraint tree
1014	file. However, any binary tree compatible with the constraint one can be used as input tree.
1015
1016
1017	\subsection{Output files}
1018
1019	\begin{table}
1020	Sequence file name~: `{\x seq}'\\
1021	\begin{center}
1022	\begin{tabular}{ll}
1023	\hline
1024	Output file name & Content \\
1025	\hline
1026	\x{seq\_phyml\_tree.txt} & ML tree\\
1027	\x{seq\_phyml\_stats.txt} & ML model parameters\\
1028	\x{seq\_phyml\_boot\_trees.txt} & ML trees -- bootstrap replicates\\
1029	\x{seq\_phyml\_boot\_stats.txt} & ML model parameters -- bootstrap replicates \\
1030	\x{seq\_phyml\_rand\_trees.txt} & ML trees -- multiple random starts\\
1031	\hline
1032	\end{tabular}
1033	\end{center}
1034	\caption{{\bf Standard output files}}\label{tab:output}
1035	\end{table}
1036
1037	Table \ref{tab:output} presents the list of files resulting from an analysis. Basically, each
1038	output file name can be divided into three parts. The first part is the sequence file name, the
1039	second part corresponds to the extension `\x{\_phyml\_}' and the third part is related to the file
1040	content. When launched with the default options, PhyML only generates two files: the tree file and
1041	the model parameter file. The estimated maximum likelihood tree is in standard Newick format (see
1042	Figure \ref{fig:trees}). The model parameters file, or statistics file, displays the maximum
1043	likelihood estimates of the substitution model parameters, the likelihood of the maximum likelihood
1044	phylogenetic model, and other important information concerning the settings of the analysis (e.g.,
1045	type of data, name of the substitution model, starting tree, etc.). Two additional output files are
1046	created if bootstrap supports were evaluated. These files simply contain the maximum likelihood
1047	trees and the substitution model parameters estimated from each bootstrap replicate. Such
1048	information can be used to estimate sampling errors around each parameter of the phylogenetic model.
1049	When the random tree option is turned on, the maximum likelihood trees estimated from each random
1050	starting trees are printed in a separate tree file (see last row of Table \ref{tab:output}).
1051
1052	\subsection{Treatment of invariable sites with fixed branch lengths}
1053
1054	PhyML allows users to give an input tree with fixed topology and branch lengths and find the
1055	proportion of invariable sites that maximise the likelihood (option \x{-o r}). These two options can
1056	be considered as conflicting since branch lengths depend on the proportion of invariants. Hence,
1057	changing the proportion of invariants implies that branch lengths are changing too. More formally,
1058	let $l$ denote the length of a branch, i.e., the expected number of substitutions per site, and $p$
1059	be the proportion of invariants. We have $l = (1-p)l'$, where $l'$ is the expected number of
1060	substitutions at variable sites. When asked to optimize $p$ but leave $l$ unchanged, PhyML
1061	does the following:
1062	\begin{enumerate}
1063	\item Calculate $l' = l/(1-p)$ and leave $l'$ unchanged throughout the optimization.
1064	\item Find the value of the proportion of invariants that maximises the likelihood. Let $p^{*}$ denote this value.
1065	\item Set $l^{} = (1-p^{})l'$ and print out the tree with $l^{*}$ (instead of $l$).
1066	\end{enumerate}
1067
1068	PhyML therefore assumes that the users wants to fix the branch lengths measured at variable
1069	sites only (i.e., $l^{*}$ is fixed). This is the reason why the branch lengths in the input and
1070	output trees do differ despite the use of the the \x{-o r} option. While we believe that this
1071	approach relies on a sound rationale, it is not perfect. In particular, the original transformation
1072	of branch lengths ($l' = l/(1-p)$) relies on a default value for $p$ with is set to 0.2 in
1073	practice. It is difficult to justify the use of this value rather than another one. One suggestion
1074	proposed by Bart Hazes is to avoid fixing the branch lengths altogether and rather estimate the
1075	value of a scaling factor applied to each branch length in the input tree (option
1076	\x{--contrained\_lens}). We agree that this solution probably matches very well most users
1077	expectation, i.e., ``find the best value of $p$ while constraining the ratio of branch lengths to be
1078	that given in the input tree''. Please feel free to send us your suggestions regarding this problem
1079	by posting on forum (\url{http://groups.google.com/group/phyml-forum}).
1080
1081
1082	\section{Inputs \& outputs for the XML interface }\label{sec:xmlio}\index{XML}
1083
1084	\subsection{Mixture models in PhyML}\index{mixture models}\label{sec:mixtures}
1085
1086	PhyML implements a wide range of mixture models. The discrete gamma model \cite{yang94b} is arguably
1087	the most popular of these models in phylogenetics. However, in theory, mixture models are not
1088	restricted to the description of the variation of substitution rates across sites. For instance, if
1089	there are good reasons to believe that the relative rates of substitution between nucleotides vary
1090	along the sequence alignments, it makes sense to use a mixture of GTR models. Consider the case
1091	where substitutions between $A$ and $C$ occur at high rate in some regions of the alignment and low
1092	rate elsewhere, a mixture with two classes, each class having its own GTR rate matrix, would be
1093	suitable. The likelihood at any site of the alignment is then obtained by averaging the likelihoods
1094	obtained for each GTR rate matrix, with the same weight given to each of these matrices.
1095
1096	PhyML implements a generic framework that allows users to define mixtures on substitution rates,
1097	rate matrices and nucleotide or amino-acid equilibrium frequencies. Each class of the mixture model
1098	is built by assembling a substitution rate, a rate matrix\footnote{the rate matrix corresponds here
1099	the symmetrical matrix giving the so-called ``echangeability rates''} and a vector of equilibrium
1100	frequencies. For instance, let $\{R_1,R_2,R_3\}$ be a set of substitution rates, $\{M_1,M_2\}$ a
1101	set of rate matrices and $\{F_1,F_2\}$ a set of vectors of equilibrium frequencies. One could then
1102	define the first class of the mixture model as $\mathcal{C}_1 = \{R_1,M_1,F_1\}$, a second class as
1103	$\mathcal{C}_2 = \{R_2,M_1,F_1\}$, and a third class as $\mathcal{C}_3 = \{R_3,M_2,F_2\}$. If
1104	$R_1$, $R_2$ and $R_3$ correspond to slow, medium and fast substitution rates, then this mixture
1105	model allows the fast evolving rates to have their own vector of equilibrium frequencies and rate
1106	matrix, distinct from that found at the medium or slow evolving sites. The likelihood at any given
1107	site $D_s$ of the alignment is then:
1108	\begin{eqnarray}\label{equ:mixtlk}
1109	\Pr(D_s) = \sum_{c=1}^{3} \Pr(D_s \| \mathcal{C}_s=c) \Pr(\mathcal{C}_s=c),
1110	\end{eqnarray}
1111	where $\Pr(\mathcal{C}_s=c)$ is obtained by multiplying the probability (density) of the three
1112	components (i.e., rate, matrix, frequencies). For instance, $\Pr(\mathcal{C}_1=\{R_1,M_1,F_1\}) =
1113	\Pr(R_1)\times \Pr(M_1) \times \Pr(F_1)$.
1114	We therefore assume here that substitution rates, rate
1115	matrices and equilibrium frequencies are independent from one another.
1116
1117	Note that, using the same substitution rates, rate matrices and vector of equilibrium frequencies,
1118	it is possible to construct many other mixture models. For instance, the mixture model with the
1119	largest number of classes can be created by considering all the combinations of these three
1120	components. We would then get a mixture of $3\times 2 \times 2=12$ classes, corresponding to all
1121	the possible combinations of 3 rates, 2 matrices and 2 vectors of frequencies.
1122
1123
1124	% : $\mathcal{C}_1 =
1125	% \{R_1,M_1,F_1\}$ $\mathcal{C}_2 = \{R_1,M_1,F_2\}$, $\mathcal{C}_3 = \{R_1,M_2,F_1\}$,
1126	% $\mathcal{C}_4 = \{R_1,M_2,F_2\}$, $\mathcal{C}_5 = \{R_2,M_1,F_1\}$, $\mathcal{C}_6 =
1127	% \{R_2,M_1,F_2\}$, $\mathcal{C}_7 = \{R_2,M_2,F_1\}$, $\mathcal{C}_8 = \{R_2,M_2,F_2\}$,
1128	% $\mathcal{C}_9 = \{R_3,M_1,F_1\}$, $\mathcal{C}_{10} = \{R_3,M_1,F_2\}$, $\mathcal{C}_{11} =
1129	% \{R_3,M_2,F_1\}$ and $\mathcal{C}_{12} = \{R_3,M_2,F_2\}$.
1130
1131
1132	\subsection{Partitions}\index{partitionned analysis}\index{data partitions}
1133
1134	We first introduce some terms of vocabulary that have not been presented before. A partitionned data
1135	set, also referred to as partition, is a set of partition elements. Typically, a partitionned data
1136	set will be made of a set of distinct gene alignments. A partition element will then correspond to
1137	one (or several) of these gene alignments. Note that the biology litterature often uses the term
1138	partition to refer to an element of a partitionned data. We thus use here instead the mathematical
1139	definition of the terms `partition' and `partition element'.
1140
1141	Phylogenetics models usually assume individual columns of an alignment to evolve independently from
1142	one another. Codon-based models (e.g., \cite{yang98,yang00b,yang02,guindon04}) are exceptions to
1143	this rule since the substitution process applies here to triplets of consecutive sites of coding
1144	sequences. The non-independence of the substitution process at the three coding positions (due to
1145	the specificities of the genetic code), can therefore be accounted for. Assuming that sites evolve
1146	independently does not mean that a distinct model is fitted to each site of the alignment.
1147	Estimating the parameters of these models would not make much sense in practice due to the very
1148	limited amount of phylogenetic signal conveyed by individual sites. Site independence means instead
1149	that the columns of the observed alignment were sampled randomly from the same ``population of
1150	columns''. The stochasticity of the substitution process running along the tree is deemed
1151	responsible to the variability of site patterns.
1152
1153	Some parameters of the phylogenetic model are considered to be common to all the sites in the
1154	alignment. The tree topology is typically one such parameter. The transition/transversion ratio is
1155	also generally assumed to be the same for all columns. Other parameters can vary from site to site.
1156	The rate at which substitutions accumulate is one of these parameters. Hence, different sites can
1157	have distinct rates. However, such rates are all ``drawn'' from the same probabilitic distribution
1158	(generally a discrete Gamma density). Hence, while different sites may have distinct rates of
1159	evolution, they all share the same {\em distribution} of rates.
1160
1161	This reasonning also applies on a larger scale. When analysing multiple genes, one can indeed
1162	assume that the same mechanism generated the different site patterns observed for every gene. Here
1163	again, we can assume that all the genes share the same underlying tree topology (commonly refered to
1164	as the ``species tree''). Other parameters of the phylogenetic model, such as branch lengths for
1165	instance, might be shared across genes. However, due to the specificities of the gene evolution
1166	processes, some model parameters need to be adjusted for each gene separately. To sum up, the
1167	phylogenetic analysis of partitionned data requires flexible models with parameters, or distribution
1168	of parameters, shared across several partition elements and other parameters estimated separately
1169	for each element of the partition.
1170
1171	The likelihood of a data set made of the concatenation of $n$ sequence alignments noted $D^{(1)}$,
1172	$D^{(2)}, \ldots, D^{(n)}$ is then obtained as follow:
1173	\begin{eqnarray}
1174	\Pr(D^{(1)},D^{(2)},\ldots,D^{(n)}) &=& \prod_{i=1}^{n} \Pr(D^{(i)}) \nonumber\\
1175	&=& \prod_{i=1}^{n} \prod_{s=1}^{L_i} \Pr(D^{(i)}_s),
1176	\end{eqnarray} where $L_i$ is the number of site columns in partition element $i$. $\Pr(D^{(i)}_s)$
1177	is then obtained using Equation \ref{equ:mixtlk}, i.e., by summing over the different classes of the
1178	mixture model that applies to site $s$ for partition element $i$. Hence, the joint probability of
1179	all the partition elements is here broken down into the product of likelihood of every site for each
1180	partition element. As noted just above, any given component of the mixture model at a given
1181	particular site is shared by the other sites that belong to the same partition element and, for some
1182	of them, by sites in other partition elements (e.g., the same tree topology is shared by all the
1183	sites, throughout all the partition elements).
1184
1185	PhyML implements a wide variety of partition models. The only parameter that is constrained to be
1186	shared by all the partition elements is the tree topology. This constraint makes sense when
1187	considering distantly related taxa, typically inter-species data. For closely related taxa, i.e.,
1188	when analysing intra-species or population-level data, not all the genes might have the same
1189	evolutionary history. Recombination events combined to the incomplete lineage sorting phenomenon
1190	can generate discrepancies between the gene trees and the underlying species tree (see
1191	\cite{degnan09} for a review). The phylogenetic softwares BEST \cite{best}\index{BEST}, STEM
1192	\cite{stem}\index{STEM} and BEAST \cite{startbeast}\index{BEAST} are dedicated to the estimation
1193	of species tree phylogenies from the analysis of multi-gene data and allow gene-tree topologies to
1194	vary across genes.
1195
1196	Aside from the tree topology that is common to all the sites and all the partition elements, other
1197	parameters of the phylogenetic model can be either shared across partition elements or estimated
1198	separately for each of these. When analysing three partition elements, $A$, $B$ and $C$ for
1199	instance, PhyML can fit a model where the same set of branch lengths applies to $A$ and $B$ while
1200	$C$ has its own estimated lengths. The same goes for the substitution model: the same GTR model,
1201	with identical parameter values, can be fitted to $A$ and $C$ and JC69 for instance can be used for
1202	$B$. The sections below give more detailed information on the range of models available and how to
1203	set up the corresponding XML configuration files to implement them.
1204
1205
1206	\subsection{Combining mixture and partitions in PhyML: the theory}
1207
1208	The rationale behind mixture models as implemented in PhyML lies in (1) the definition of suitable
1209	rate matrices, equilibrium frequency vectors and relative rates of substitution and (2) the assembly
1210	of these components so as to create the classes of a mixture. The main idea behind partitionned
1211	analysis in PhyML lies in (1) the hypothesis of statistical independance of the different data
1212	partition elements and (2) distinct data partition can share model components such as rate matrices,
1213	equilibrium frequencies or distribution of rates across sites. More formally, the likelihood of a
1214	data set made of $n$ partition elements is written as follows:
1215
1216	\begin{eqnarray*}
1217	\Pr(D^{(1)},D^{(2)},\ldots,D^{(n)}) &=& \prod_{i=1}^{n} \prod_{s=1}^{L_i} \Pr(D^{(i)}_s) \\
1218	&=& \prod_{i=1}^{n} \prod_{s=1}^{L_i} \sum_{c=1}^{K_i} \Pr(D^{(i)}_s\|\mathcal{C}=c) \Pr(\mathcal{C}=c),
1219	\end{eqnarray*}
1220
1221	where $L_i$ is the number of sites in partition element $i$ and $K_i$ is the number of classes in
1222	the mixture model that applies to this same partition element. Each class of a mixture is made of a
1223	rate matrix $M$, a vector of equilibrium frequencies $F$ and a relative rate of substitution
1224	$R$. Branch lengths, $L$ and tree topology $\tau$ are also required for the calculation of the
1225	likelihood. Hence we have:
1226
1227	\begin{eqnarray*}
1228	&& \Pr(D^{(1)},D^{(2)},\ldots,D^{(n)}) \\
1229	&&= \prod_{i=1}^{n} \prod_{s=1}^{L_i} \sum_{c=1}^{K_i} \Pr(D^{(i)}_s\|\mathcal{C}=c) \Pr(\mathcal{C}=c) \\
1230	&&= \prod_{i=1}^{n} \prod_{s=1}^{L_i} \sum_{m}^{\mathcal{M}_i} \sum_{f}^{\mathcal{F}_i} \sum_{r}^{\mathcal{R}_i} \Pr(D^{(i)}_s\|M_m^{(i)},F_f^{(i)},R_r^{(i)},L^{(i)},\tau) \Pr(M_m^{(i)},F_f^{(i)},R_r^{(i)}) \mathcal{I}(m,f,r,i)
1231	% &&= \prod_{i=1}^{n} \prod_{s=1}^{L_i} \sum_{m}^{\mathcal{M}_i} \sum_{f}^{\mathcal{F}_i} \sum_{r}^{\mathcal{R}_i}
1232	% \Pr(D^{(i)}_s\|M_m^{(i)},F_f^{(i)},R_r^{(i)},L^{(i)},\tau) \Pr(M_m^{(i)}) \Pr(F_f^{(i)}) \Pr(R_r^{(i)})
1233	\end{eqnarray*}
1234	where $\mathcal{M}_i$, $\mathcal{F}_i$ and $\mathcal{R}_i$ are the number of rate matrices,
1235	vector of equilibrium frequencies and relative rates that apply to partition element $i$
1236	respectively. $\mathcal{I}(m,f,r,i)$ is an indicator function that takes value 1 if the combination
1237	$M_m$, $F_f$ and $R_r$ is acually defined in the model for this particular partition element
1238	$i$. Its value is 0 otherwise. In the example given in section \ref{sec:mixtures} $\{R_1,R_2,R_3\}$
1239	is the set of substitution rates, $\{M_1,M_2\}$ the set of rate matrices and $\{F_1,F_2\}$ the set of vectors of equilibrium frequencies. We then
1240	define the first class of the mixture model as $\mathcal{C}_1 = \{R_1,M_1,F_1\}$, a second class as
1241	$\mathcal{C}_2 = \{R_2,M_1,F_1\}$ and the third as $\mathcal{C}_3 = \{R_3,M_2,F_2\}$. Hence, we
1242	have $\mathcal{I}(1,1,1,i)$, $\mathcal{I}(1,1,2,i)$ and $\mathcal{I}(2,2,3,i)$ equal to one while
1243	the nine other values that this indicator function takes, corresponding to the possible combinations of
1244	two vectors of frequencies, two matrices and three rates, are all zero.
1245
1246	As stated before, our implementation assumes that the different components of a mixture are
1247	independant. In other words, we have $\Pr(M_m^{(i)},F_f^{(i)},R_r^{(i)}) = \Pr(M_m^{(i)}) \times
1248	\Pr(F_f^{(i)}) \times \Pr(R_r^{(i)})$. In practice, the joint probability
1249	$\Pr(M_m^{(i)},F_f^{(i)},R_r^{(i)})$ is obtained as follows:
1250	\begin{eqnarray}\label{equ:weights}
1251	\Pr(M_m^{(i)},F_f^{(i)},R_r^{(i)}) = \frac{\Pr(M_m^{(i)}) \Pr(F_f^{(i)}) \Pr(R_r^{(i)})}{
1252	\sum_{m,f,r} \Pr(M_m^{(i)}) \Pr(F_f^{(i)}) \Pr(R_r^{(i)}) \mathcal{I}(m,f,r,i)}
1253	\end{eqnarray}
1254	The probabilities $\Pr(M_m^{(i)})$, $\Pr(F_f^{(i)})$ and $\Pr(R_r^{(i)})$, also called `weights', can be fixed or estimated
1255	from the data.
1256
1257	\subsection{The XML format and its use in PhyML}\label{sec:XML format}
1258
1259	The few paragraphs below are largely inspired from the Wikipedia page that describes the XML format
1260	(\url{http://en.wikipedia.org/wiki/XML}). XML (eXtensible Markup Language) is a markup language that
1261	defines a set of rules for encoding documents in a format that is both human-readable and
1262	machine-readable. An XML document is divided into {\em markup} and {\em content}, which may be
1263	distinguished by the application of simple syntactic rules. Generally, strings that constitute
1264	markup either begin with the character `\x{<}' and end with a `\x{>}'. Strings of characters that
1265	are not markup are content:
1266
1267	\vspace{0.2cm}
1268	\begin{Verbatim}[frame=single, label=XML markup and content example, samepage=true,
1269	baselinestretch=0.5, fontsize=\small]
1270	<markup>
1271	content
1272	</markup>
1273	\end{Verbatim}
1274
1275	A markup construct that begins with `\x{<}' and ends with `\x{>}' is called a {\em tag}. Tags come
1276	in three flavors: (1) start-tags (e.g, \x{<section>}), end-tags (e.g., \x{</section>}) and
1277	empty-element tags (e.g., \x{<line-break />}). A {\em component} either begins with a start-tag and
1278	ends with a matching end-tag or consists only of an empty-element tag. The characters between the
1279	start- and end-tags, if any, are the element's content, and may contain markup, including other
1280	elements, which are called child elements. In the following example, the element \x{img} has two
1281	{\em attributes}, \x{src} and \x{alt}: \x{<img src="madonna.jpg" alt="Foligno Madonna, by
1282	Raphael"/>}. Another example would be \x{<step number="3">Connect A to B.</step>} where the name of
1283	the attribute is ``\x{number}" and the value is ``\x{3}".
1284
1285	In practice, building a mixture model in a XML file readable by PhyML is relatively
1286	straightforward. The first step is to define the different components of each class of the
1287	mixture. Consider for instance that the fitted model will have a Gamma distribution with four
1288	classes plus a proportion of invariants. The rate component of the mixture can then be specified
1289	using the following XML code:
1290
1291	\vspace{0.2cm}
1292	\begin{Verbatim}[frame=single, label=$\Gamma4$+I rates, samepage=true, baselinestretch=0.5,
1293	fontsize=\small, numbers=left]
1294
1295	<siterates id="SiteRates1">
1296	<weights id="Distrib" family="gamma+inv" alpha=".1" \
1297	optimise.alpha="yes" pinv="0.4" optimise.pinv="yes">
1298	</weights>
1299	<instance id="R1" init.value="1.0"/>
1300	<instance id="R2" init.value="1.0"/>
1301	<instance id="R5" init.value="0.0"/>
1302	<instance id="R3" init.value="1.0"/>
1303	<instance id="R4" init.value="1.0"/>
1304	</siterates>
1305
1306	\end{Verbatim}
1307
1308	In the example above, the \x{<siterates>} component completely defines a model of substitution rate
1309	variation across sites. This component has a particular identity, i.e., a name associated to it
1310	(``\x{SiteRates1}'' here), which is not mandatory. This \x{<siterates>} component has six
1311	sub-components. The first is the \x{<weights>} component, followed by five \x{<instance>}
1312	components. The \x{<weights>} component defines the type of distribution that characterizes the
1313	variation of rates across sites. A discrete Gamma plus invariants is used here. Two parameters
1314	specify this distribution: the gamma shape and the proportion of invariant parameters. Their initial
1315	values are set by using the corresponding attributes and attribute values (\x{alpha="0.1"} and
1316	\x{pinv="0.4"}). Also, PhyML can optimise these parameters so as to maximise the likelihood of the
1317	whole phylogenetic model (\x{optimise.pinv="yes"} and \x{optimise.alpha="yes"}). The following five
1318	\x{<instance>} components define the rate classes themselves. The \x{id} attribute is here
1319	mandatory and must be unique to each class. Note that one of the initial (relative) rate
1320	(\x{init.value} attribute) is set to zero. The corresponding rate class (the third in this example)
1321	will then correspond to the invariant site category.
1322
1323	Having specified the part of the phylogenetic model that describes the variation of rates across
1324	sites, we can now move on to build the rest of the model. The component below defines two
1325	substitution models:
1326
1327	\vspace{0.2cm}
1328	\begin{Verbatim}[frame=single, label=Rate matrices, samepage=true, baselinestretch=0.5,
1329	fontsize=\small, numbers=left]
1330
1331	<ratematrices id="RateMatrices">
1332	<instance id="M1" model="HKY85" tstv="4.0" optimise.tstv="no"/>
1333	<instance id="M2" model="GTR" optimise.rr="yes"/>
1334	</ratematrices>
1335	\end{Verbatim}
1336
1337	This \x{<ratematrices>} component sets out a list of substitution models (HKY85 and GTR here). Here
1338	again, the different elements in this list correspond to the \x{<instance>} sub-components. Each
1339	instance must have a unique \x{id} attribute for a reason that will become obvious shortly. The
1340	remaining attributes and their functions are described in Section \ref{sec:xmlratematrices}.
1341
1342	The next ``ingredient'' in our phylogenetic model are vectors of nucleotide frequencies. The
1343	\x{<equfreqs>} component below specifies two of such vectors:
1344
1345
1346	\vspace{0.2cm}
1347	\begin{Verbatim}[frame=single, label=Equilibrium frequencies, samepage=true, baselinestretch=0.5,
1348	fontsize=\small, numbers=left]
1349
1350	<equfreqs id="EquFreq">
1351	<instance id="F1"/>
1352	<instance id="F2"/>
1353	</equfreqs>
1354
1355	\end{Verbatim}
1356
1357	Now, we need to assemble these three components (rate variation across sites, rate matrices and
1358	vectors of equilibrium frequencies) into a mixture model. The \x{<partitionelem>} component below
1359	defines one such model:
1360
1361	\vspace{0.2cm}
1362	\begin{Verbatim}[frame=single, label=Mixture model, samepage=true, baselinestretch=0.5,
1363	fontsize=\small, numbers=left]
1364
1365	<partitionelem id="Part1" file.name="./nucleic.txt" data.type="nt">
1366	<mixtureelem list="R1, R2, R3, R4, R5"/>
1367	<mixtureelem list="M1, M1, M1, M2, M2"/>
1368	<mixtureelem list="F1, F2, F1, F2, F2"/>
1369	</partitionelem>
1370
1371	\end{Verbatim}
1372
1373	The \x{<partitionelem>} component defines a particular partition element. In this example, the
1374	partition element corresponds to the sequence file called \x{nucleic.txt}, which is an alignment of
1375	nucleotide sequences (see the \x{data.type} attribute value). The \x{<mixtureelem>} are
1376	sub-components of the \x{<partitionelem>} component. Each \x{<mixtureelem>} has a \x{list}
1377	atrribute. Each such \x{list} gives the ID of components that have been defined before. For
1378	instance, the first \x{<mixtureelem>} refers to the five classes of the \x{<siterates>}
1379	component. The ordering of the different term in these list matters a lot since it is directly
1380	related to the elements in each class of the mixture model. Hence, the first element in the
1381	\x{<list>} attribute of the first \x{<mixtureelem>} added to the first element in the \x{<list>}
1382	attribute of the second \x{<mixtureelem>} plus the the first element in \x{<list>} attribute of the
1383	third \x{<mixtureelem>} defines the first class of the mixture model. Therefore, the mixture model
1384	defined above has five classes: $\mathcal{C}_1 = \{R_1,M_1,F_1\}$, $\mathcal{C}_2 =
1385	\{R_2,M_1,F_2\}$, $\mathcal{C}_3 = \{R_3,M_1,F_1\}$, $\mathcal{C}_4 = \{R_4,M_2,F_2\}$ and
1386	$\mathcal{C}_5 = \{R_5,M_2,F_2\}$.
1387
1388	% Going back to the different components of this model, the XML code dealing with the substitution
1389	% rates defines five classes with names {\tt R1} to {\tt R5}. The initial values of these rates are
1390	% set to 1.0, except for {\tt R5}, which is set to 0 and will therefore correspond to the invariable
1391	% site class. The {\tt <weight>} tag that follows indicate that these rates define a $\Gamma 4$+Inv
1392	% model, with initial gamma shape parameter set to 0.1 and initial proportion of invariants set to
1393	% 0.4. These two parameters will be estimated in the analysis ({\tt optimise.alpha} and {\tt
1394	% optimise.pinv} attributes set to {\tt yes}). The two rate matrices have names {\tt M1} and {\tt
1395	% M2}. {\tt M1} corresponds to a HKY85 model, with transition/transversion ratio set to 4.0 and set
1396	% to be optimised in the analysis. {\tt M2} is a GTR model, which parameters are also set to be
1397	% optimised. {\tt F1} and {\tt F2} are two vectors of nucleotide frequencies at equilibrium. These two
1398	% sets of frequencies will therefore be estimated during the analysis.
1399
1400
1401
1402	\subsection{Setting up mixture and partition models in PhyML: the basics}\index{mixture
1403	models}\index{partitionned analysis}\index{data partitions}
1404
1405	Mixture models are particularly relevant to the analysis of partitionned data. Indeed, some features
1406	of evolution are gene-specific (e.g., substitution rates vary across genes). Models that can
1407	accomodate for such variation, as mixture models do, are therefore relevant in this context.
1408	However, other evolutionary features are shared across loci (e.g., genes located in the same genomic
1409	region usually have similar GC contents). As a consequence, some components of mixture models need
1410	to be estimated separately for each partition element while others should be shared by different
1411	partition elements.
1412
1413	Below is a simple example with a partitionned data set made of two elements:
1414
1415	\vspace{0.2cm}
1416	\begin{Verbatim}[frame=single, label=Two sets of branch lengths (one per partition element),
1417	samepage=true, baselinestretch=0.5, fontsize=\small, numbers=left]
1418
1419	<branchlengths id="BranchLens">
1420	<instance id="L1"/>
1421	<instance id="L2"/>
1422	</branchlengths>
1423
1424	<partitionelem id="Part1" file.name="./nucleic1.txt" data.type="nt">
1425	<mixtureelem list="R1, R2, R3, R4, R5"/>
1426	<mixtureelem list="L1, L1, L1, L1, L1"/>
1427	</partitionelem>
1428
1429	<partitionelem id="Part2" file.name="./nucleic2.txt" data.type="nt">
1430	<mixtureelem list="R1, R2, R3, R4, R5"/>
1431	<mixtureelem list="L2, L2, L2, L2, L2"/>
1432	</partitionelem>
1433
1434	\end{Verbatim}
1435
1436	Mixture elements with names \x{R1},$\ldots$, \x{R5} refer to the $\Gamma4+$I model defined
1437	previsouly (see Section \ref{sec:XML format}). The \x{<branchlengths>} XML component defines a
1438	mixture element that had not been introduced before. It defined vectors of branch lengths that
1439	apply to the estimated phylogeny. Two instances of such vectors are defined: \x{L1} and \x{L2}.
1440	When examining the two partition elements (\x{<partitionelem>} component), it appears that \x{L1}
1441	is associated with \x{Part1} while \x{L2} is associated with \x{Part2}. Hence, branch lengths
1442	will be estimated separately for these two partition elements.
1443
1444	Note that a given partition element can only have one {\tt branchlengths} instance associated to
1445	it. For instance, the example given below is not valid:
1446
1447	\vspace{0.2cm}
1448	\begin{Verbatim}[frame=single, label=Invalid mixture, samepage=true, baselinestretch=0.5,
1449	fontsize=\small, numbers=left]
1450
1451	<partitionelem id="Part1" file.name="./nucleic1.txt" data.type="nt">
1452	<mixtureelem list="R1, R2, R3, R4, R5"/>
1453	<mixtureelem list="L1, L1, L1, L2, L2"/>
1454	</partitionelem>
1455
1456	\end{Verbatim}
1457
1458	In other words, mixture of branch lengths are forbidden. One reason for this restriction is that
1459	mixture of edge lengths sometimes lead to non-identifiable models (i.e., models with distinct sets
1460	of branch lengths have the same likelihood) \cite{matsen07}. But mostly, combining mixture of branch
1461	lengths with mixture of rates appears like a deadly combination. Consider for instance the following model:
1462
1463	\vspace{0.2cm}
1464	\begin{Verbatim}[frame=single, label=Invalid mixture, samepage=true, baselinestretch=0.5,
1465	fontsize=\small, numbers=left]
1466
1467	<partitionelem id="Part1" file.name="./nucleic1.txt" data.type="nt">
1468	<mixtureelem list="R1, R2, R3"/>
1469	<mixtureelem list="L1, L2, L3"/>
1470	</partitionelem>
1471
1472	\end{Verbatim}
1473
1474	It is here impossible to tell apart branch lengths and substitution rates. Such model is strongly
1475	non-identifiable and therefore not relevant.
1476
1477	In the example given above, the same $\Gamma4+$I model (i.e. the same gamma shape parameter and
1478	proportion of invariant ) applies to the two partition elements. It is possible to use two distinct
1479	$\Gamma4+$I models instead using the following XML code:
1480
1481
1482
1483	\vspace{0.2cm}
1484	\begin{Verbatim}[frame=single, label=Two distinct $\Gamma4+$I models, samepage=true,
1485	baselinestretch=0.5, fontsize=\small, numbers=left]
1486
1487	<siterates id="SiteRates1">
1488	<weights id="Distrib1" family="gamma+inv" alpha=".1" \
1489	optimise.alpha="yes" pinv="0.4" optimise.pinv="yes">
1490	</weights>
1491	<instance id="R1" init.value="1.0"/>
1492	<instance id="R2" init.value="1.0"/>
1493	<instance id="R5" init.value="0.0"/>
1494	<instance id="R3" init.value="1.0"/>
1495	<instance id="R4" init.value="1.0"/>
1496	</siterates>
1497
1498	<siterates id="SiteRates2">
1499	<weights id="Distrib2" family="gamma+inv" alpha=".1" \
1500	optimise.alpha="yes" pinv="0.4" optimise.pinv="yes">
1501	</weights>
1502	<instance id="R6" init.value="1.0"/>
1503	<instance id="R7" init.value="1.0"/>
1504	<instance id="R8" init.value="0.0"/>
1505	<instance id="R9" init.value="1.0"/>
1506	<instance id="R10" init.value="1.0"/>
1507	</siterates>
1508
1509	<partitionelem id="Part1" file.name="./nucleic1.txt" data.type="nt">
1510	<mixtureelem list="R1, R2, R3, R4, R5"/>
1511	<mixtureelem list="L1, L1, L1, L1, L1"/>
1512	</partitionelem>
1513
1514	<partitionelem id="Part2" file.name="./nucleic2.txt" data.type="nt">
1515	<mixtureelem list="R6, R7, R8, R9, R10"/>
1516	<mixtureelem list="L2, L2, L2, L2, L2"/>
1517	</partitionelem>
1518
1519	\end{Verbatim}
1520
1521	\x{SiteRates1} and \x{SiteRates2} here define two distinct $\Gamma4+$I models. Each of these models apply to
1522	one of the two partition elements (\x{nucleic1.txt} and \x{nucleic2.txt}), allowing them to display
1523	different patterns of rate variation across sites.
1524
1525
1526	\subsection{XML options}
1527	\subsubsection{{\tt phyml} component}\index{XML options!{\tt phyml} component}
1528	Options:
1529	\begin{itemize}
1530	\item \x{output.file="filename"}. The main output files of PhyML analysis will be named
1531	\x{filename\_phyml\_tree} and \x{filename\_phyml\_stats}.
1532	\item \x{bootstrap="nreplicates"}. Run \x{nreplicates} replicates for the non-parametric bootstrap analysis.
1533	\item \x{run.id="idstring"}. PhyML will append the string \x{idstring} to each output file.
1534	\item \x{print.trace="yes\|true\|no\|false"}. PhyML will print the estimated trees (and the
1535	corresponding loglikelihoods) at multiple stages of the estimation process. This option is useful
1536	for monitoring the progress of the analysis when processing large data sets.
1537	\item \x{branch.test="aBayes\|aLRT\|SH\|no"}. Calculate fast branch support using the aBayes method
1538	\cite{anisimova11}, aLRT \cite{anisimova06} or SH \cite{shimodaira99} tests. These branch
1539	statistics are much faster to estimate than the bootrap proportions and usually provide good
1540	estimates of the probabilities that the corresponding edges are correctly inferred (see Anisimova et
1541	al. 2011 for more precision). By default and if no bootstrap analysis is performed, branch supports
1542	are estimated using the aBayes approach.
1543
1544	\end{itemize}
1545	\subsubsection{{\tt topology} component}\index{XML options!{\tt topology} component}
1546	Options:
1547	\begin{itemize}
1548	\item \x{init.tree="bionj"\|"user"\|"random"}. Starting tree. Default is \x{bionj}.
1549	\item \x{file.name="name\_of\_tree\_file"}. In case \x{init.tree="user"}, this
1550	attribute is mandatory. \x{name\_of\_tree\_file} is a
1551	text file containing a tree in NEWICK format.
1552	\item \x{optimise.tree="yes"\|"true"\|"no"\|"false"}. The starting tree topology as defined by
1553	\x{init.tree} is to be optimised (or not) so as to maximise the likelihood function.
1554	\item \x{search="nni"\|"spr"\|"none"}. Tree topology search is conducted using NNI (fast), SPR (a bit
1555	slower but more accurate) or no moves.
1556	\end{itemize}
1557	\vspace{0.2cm}
1558	\begin{Verbatim}[frame=single, label=Example of `topology' component, samepage=true,
1559	baselinestretch=0.5, fontsize=\small, numbers=left]
1560
1561	<topology>
1562	<instance id="T1" init.tree="bionj" optimise.tree="true" \
1563	search="spr"/>
1564	</topology>
1565
1566	\end{Verbatim}
1567
1568	\subsubsection{{\tt ratematrices} component}\index{XML options!{\tt ratematrices} component}\label{sec:xmlratematrices}
1569	Options:
1570	\begin{itemize}
1571	\item \x{model="JC69"\|"K80"\|"F81"\|"F84"\|"HKY85"\|"TN93"\|"GTR"\|"custom"} for nucleotide data. The default is \x{"HKY85"}.\\
1572	\x{model="LG"\|"WAG"\|"JTT"\|"MtREV"\|"Dayhoff"\|"DCMut"\|"RtREV"\|"CpREV"\|"VT"}\\\x{\|"Blosum62"\|"MtMam"\|"MtArt"\|"HIVw"\|"HIVb"\|"customaa"}
1573	for amino-acid sequences. The default is \x{"LG"}.
1574	\item \x{model.code="012345"}. For \x{custom} model applied to nucleotide sequences: set the
1575	string of digits that define a custom substitution model. See Table \ref{tab:modelcode} on page
1576	\pageref{tab:modelcode} for more
1577	information about the model codes.
1578	\item \x{ratematrix.code="filename"}. When used in conjunction with \x{model="customaa"},
1579	\x{filename} is the name of the file that gives the rates of substitution between amino-acids as
1580	well as their frequences at equilibrium using PAML rate matrix format. An example of such file is
1581	provided in {phyml/examples/X1.mat}.
1582	\item \x{optimise.rr="yes"\|"true"\|"no"\|"false"}. For \x{custom} and \x{GTR} nucleotide models only:
1583	optimise the substitution rate model parameters.
1584	\item \x{optimise.tstv="yes"\|"true"\|"no"\|"false"}. For \x{K80}, \x{F84}, \x{HKY85} and \x{TN93}
1585	models only: optimise the transition/transversion rate ratio.
1586	\item \x{tstv="value"}. For \x{K80}, \x{HKY85} and \x{TN93} models only: set the transition/transversion to a
1587	given value.
1588	\end{itemize}
1589
1590	The {\tt ratematrices} component has the attribute {\tt optimise.weights=yes/no} (default is {\tt
1591	no}). If {\tt optimise.weights=yes}, then the probabilities (or weights) or each matrix in the
1592	set of matrices defined by this component (see Equation \ref{equ:weights}), will be estimated from the data.
1593
1594	\vspace{0.2cm}
1595	\begin{Verbatim}[frame=single, label=Example of `ratematrices' component, samepage=true,
1596	baselinestretch=0.5, fontsize=\small, numbers=left]
1597
1598	<ratematrices id="RM1" optimise.weights="yes">
1599	<instance id="M1" model="custom" model.code="000000"/>
1600	<instance id="M2" model="GTR" optimise.rr="yes"/>
1601	<instance id="M3" model="WAG"/>
1602	</ratematrices>
1603
1604	\end{Verbatim}
1605
1606	\subsubsection{{\tt equfreqs} component}\index{XML options!{\tt equfreqs} component}
1607	Options:
1608	\begin{itemize}
1609	\item \x{base.freqs="a,b,c,d"} where \x{a-d} are nucleotide frequencies. Make sure that these
1610	frequencies are separated by comas and no space character is inserted.
1611	\item \x{aa.freqs="empirical\|model"}. Amino-acid frequencies are derived from counting the number of
1612	occurence of each amino-acid in the alignment (\x{aa.freqs="empirical"}) or given by the
1613	substitution model (\x{aa.freqs="model"}).
1614	\item \x{optimise.freqs="true\|yes\|false\|no"}. Nucleotide frequencies can be optimised so as to maximise
1615	the likelihood (\x{optimise.freqs="yes\|true"}).
1616	\end{itemize}
1617
1618	The {\tt equfreqs} component has the attribute {\tt optimise.weights=yes/no} (default is {\tt
1619	no}). If {\tt optimise.weights=yes}, then the probabilities (or weights) or each vector of
1620	equilibrium frequencies in the
1621	set of vectors defined by this component (see Equation \ref{equ:weights}), will be estimated from the data.
1622
1623	\vspace{0.2cm}
1624	\begin{Verbatim}[frame=single, label=Example of `equfreqs' component, samepage=true,
1625	baselinestretch=0.5, fontsize=\small, numbers=left]
1626
1627	<equfreqs id="EF1" optimise.weights="yes">
1628	<instance id="F1" base.freqs="0.25,0.25,0.25,0.25"/>
1629	<instance id="F2" aa.freqs="empirical"/>
1630	<instance id="F3" optimise.freq="yes"/>
1631	</equfreqs>
1632
1633	\end{Verbatim}
1634
1635	\subsubsection{{\tt branchlengths} component}\index{XML options!{\tt branchlengths} component}
1636	Options:
1637	\begin{itemize}
1638	\item \x{optimise.lens="yes"\|"true"\|"no"\|"false"}: branch lengths are optimised or not. The default
1639	is set to \x{"yes"}.
1640	\end{itemize}
1641	\vspace{0.2cm}
1642	\begin{Verbatim}[frame=single, label=Example of `branchlengths' component, baselinestretch=0.5,
1643	fontsize=\small, numbers=left]
1644
1645	<branchlengths id="BL1">
1646	<instance id="L1" optimise.lens="yes"/>
1647	<instance id="L2"/>
1648	<instance id="L3" optimise.lens="false"/>
1649	</branchlengths>
1650
1651	\end{Verbatim}
1652
1653	\subsubsection{{\tt siterates} component}\index{XML options!{\tt siterates} component}
1654	Options:
1655	\begin{itemize}
1656	\item \x{value="val"}, where \x{"val"} is the relative substitution rate for the corresponding class.
1657	\end{itemize}
1658	A \x{siterate} component generally includes a \x{weights} element that specifies the probabilitic
1659	distribution of the relative rates. The available options for such element are:
1660	\begin{itemize}
1661	\item \x{family="gamma\|gamma+inv\|freerates"}. \x{gamma} indicates that the distribution of the
1662	relative rates is set to be a discrete Gamma density. \x{gamma+inv} indicates that the relative rate model
1663	is a mixture of Gamma and invariant sites (this is the common $\Gamma+$I model). FreeRate is
1664	a model that does not use any parametric function to describe the distribution of the relative
1665	rates (see \cite{soubrier12}). Under this option, relative rates and the corresponding frequencies of these classes are
1666	directly estimated from the data. While such approach is slightly more computationally demanding
1667	than the $\Gamma$ (or $\Gamma$+I) model, it often provides a significantly better fit to the data.
1668	\item \x{alpha="value\|optimised"}, where \x{value} is a real positive number. Use this option to set
1669	the gamma shape parameter to the selected value. \x{optimised}: the parameter is estimated from
1670	the data (see also next option).
1671	\item \x{optimise.alpha="yes\|true\|no\|false"}. Optimise the shape of the Gamma distribution of
1672	relative rates (or not).
1673	\item \x{pinv="value\|optimised"}, where \x{value} is in $[0,1]$. Use this option to set
1674	the proportion of invariants to the selected value. \x{optimised}: the parameter is estimated from
1675	the data (see also next option).
1676	\item \x{optimise.pinv="yes\|true\|no\|false"}. Optimise the proportion of invariable sites (or not).
1677	\item \x{optimise.freerates="yes\|true\|no\|false"}. Optimise the parameters of the FreeRate model,
1678	i.e., the relative rates and the corresponding frequencies.
1679	\end{itemize}
1680	\vspace{0.2cm}
1681	\begin{Verbatim}[frame=single, label=Example of `siterates' component, samepage=true,
1682	baselinestretch=0.5, fontsize=\small, numbers=left]
1683
1684	<siterates id="SR1">
1685	<instance id="R1" init.value="1.0"/>
1686	<instance id="R2" init.value="1.0"/>
1687	<instance id="R3" init.value="1.0"/>
1688	<instance id="R4" init.value="1.0"/>
1689	<weights id="D1" family="gamma" optimise.alpha="yes" \
1690	optimise.pinv="no">
1691	</weights>
1692	</siterates>
1693	\end{Verbatim}
1694
1695
1696
1697	\subsubsection{{\tt partitionelem} and {\tt mixtureelem} components}\index{XML options!{\tt partitionelem}
1698	component}\index{XML options!{\tt mixtureelem} component}
1699
1700	Options:
1701	\begin{itemize}
1702	\item \x{file.name="inputfilename"}, where \x{inputfilename} is the name of the input sequence file
1703	(in PHYLIP format) to be analysed.
1704	\item \x{data.type="nt\|aa"}. Specify the type of sequences to be processed (nucleotide of amino-acid sequences).
1705	\item \x{interleaved="yes\|true\|no\|false"}. Interleaved (\x{yes\|true}) or sequential format
1706	(\x{no\|false}) for the sequence alignment.
1707	\end{itemize}
1708
1709	Each \x{partitionelem} element should include exactly four \x{mixtureelem} elements, corresponding to
1710	branch lengths, equilibrium frequencies, substitution rate model and tree topology. The ordering of
1711	in which the \x{mixtureelem} elements are given does not matter, though exceptions apply for the
1712	$\Gamma+I$ model (see below). The $n$-th element in the \x{list}
1713	attribute of each \x{mixtureelem} defines the $n$-th class of the mixture model. In the example given
1714	below, the first class of the mixture is made of the following elements: \x{T1}, \x{F1}, \x{R1} and
1715	\x{L1}, the second class is made of \x{T1}, \x{F1}, \x{R2} and \x{L1}, etc.
1716
1717
1718	\vspace{0.2cm}
1719	\begin{Verbatim}[frame=single, label=Example of `partitionelem' component, samepage=true,
1720	baselinestretch=0.5, fontsize=\small, numbers=left]
1721
1722	<partitionelem id="partition1" file.name="./small_p1.nxs" \
1723	data.type="nt" interleaved="yes">
1724	<mixtureelem list="T1, T1, T1, T1"/>
1725	<mixtureelem list="F1, F1, F1, F1"/>
1726	<mixtureelem list="R1, R2, R3, R4"/>
1727	<mixtureelem list="L1, L1, L1, L1"/>
1728	</partitionelem>
1729
1730	\end{Verbatim}
1731
1732	In general, the ordering of the \x{mixtureelem} elements does not matter. However, when the model
1733	has invariable sites, then the corresponding class should be first in the list of classes provided
1734	by \x{mixtureelem}. For instance, in the example above, if the rates are defined as follows:
1735
1736	\vspace{0.2cm}
1737	\begin{Verbatim}[frame=single, label=Example of `siterates' component, samepage=true,
1738	baselinestretch=0.5, fontsize=\small, numbers=left]
1739
1740	<siterates id="SR1">
1741	<instance id="R1" init.value="0.0"/>
1742	<instance id="R2" init.value="1.0"/>
1743	<instance id="R3" init.value="1.0"/>
1744	<instance id="R4" init.value="1.0"/>
1745	<weights id="D1" family="gamma+inv" optimise.alpha="yes" \
1746	optimise.pinv="no">
1747	</weights>
1748	</siterates>
1749	\end{Verbatim}
1750
1751	then \x{R1} corresponds to the invariable rate class (as \x{init.value="0.0"}). As \x{R1} is first
1752	in the \x{mixtureelem} (see line 6 in the example of \x{`partionelem'} given above), PhyML will
1753	print out an explicit error message and bail out. One way to avoid this shortcoming is to define
1754	\x{mixtureelem} as \x{R4, R2, R3, R1} instead.
1755
1756
1757	\subsection{A simple example: GTR + $\Gamma$4 + I}
1758
1759	The example below provides all the required options to fit a $\Gamma$4+I model to a single alignment
1760	of nucleotide sequences under the GTR model of substitution using a SPR search for the best tree
1761	topology. The \x{phyml} component sets the name for the analysis to \x{simple.example}, meaning that
1762	each output file will display this particular string of characters. Also, the tree and statistics
1763	file names will begin with \x{p1.output}. The tree topology will be estimated so as to maximise the
1764	likelihood and the topology search algorithm used here is SPR, as indicated by the value of the
1765	corresponding attribute (i.e., \x{search="spr"}). Only one vector of branch lengths will be used
1766	here since only one partition element will be processed. Hence, the \x{<branchlengths>} component
1767	only has one \x{<instance>} sub-component. Also, a single GTR model will apply to all the classes
1768	for the mixture model -- the \x{<ratematrices>} component has only one \x{<instance>} sub-component,
1769	corresponding to this particular substitution model. The next component, \x{<equfreqs>}, indicates
1770	that a single vector of equilibrium frequencies will apply here. Next, the \x{<siterates>} component
1771	has five \x{<instance>} sub-components. Four of these correspond to the non-zero relative rates of
1772	evolution a defined by a discrete Gamma distribution. The last one (\x{<instance id="R5"
1773	value="0.0"/>}) defines the class of the mixture corresponding to invariable sites. The
1774	\x{<weight>} component indicates that a $\Gamma+$I model will be fitted here. The shape parameter of
1775	the Gamma distribution and the proportion of invariants will be estimated from the data. The
1776	\x{<partitionelem>} gives information about the sequence alignment (the corresponding file name, the
1777	type of data and the alignment format). The \x{<mixtureelem>} components next define the mixture
1778	model. Each class of the fitted model corresponds to one column, with the first column made of the
1779	following elements: \x{T1, M1, F1, R1} and \x{L1}. The second class of the mixture is made of \x{T1,
1780	M1, F1, R2, L1} and so forth.
1781
1782	\vspace{0.2cm}
1783	\begin{Verbatim}[frame=single, label=Simple PhyML XML example, samepage=true, baselinestretch=0.5,
1784	fontsize=\small, numbers=left]
1785
1786	<phyml runid="simple.example" output.file="p1.output">
1787
1788	<topology>
1789	<instance id="T1" init.tree="bionj" optimise.tree="yes" \
1790	search="spr"/>
1791	</topology>
1792
1793	<branchlengths id="BL1">
1794	<instance id="L1" optimise.lens="yes"/>
1795	</branchlengths>
1796
1797	<ratematrices id="RM1">
1798	<instance id="M1" model="GTR"/>
1799	</ratematrices>
1800
1801	<equfreqs id="EF1">
1802	<instance id="F1"/>
1803	</equfreqs>
1804
1805	<siterates id="SR1">
1806	<instance id="R1" value="1.0"/>
1807	<instance id="R2" value="1.0"/>
1808	<instance id="R3" value="1.0"/>
1809	<instance id="R4" value="1.0"/>
1810	<instance id="R5" value="0.0"/>
1811	<weights id="D1" family="gamma+inv" optimise.alpha="yes" \
1812	optimise.pinv="yes">
1813	</weights>
1814	</siterates>
1815
1816	<partitionelem id="partition_elem1" file.name=\
1817	"./p1.seq" data.type="nt" interleaved="yes">
1818	<mixtureelem list="T1, T1, T1, T1, T1"/>
1819	<mixtureelem list="M1, M1, M1, M1, M1"/>
1820	<mixtureelem list="F1, F1, F1, F1, F1"/>
1821	<mixtureelem list="R1, R2, R3, R4, R5"/>
1822	<mixtureelem list="L1, L1, L1, L1, L1"/>
1823	</partitionelem>
1824
1825	</phyml>
1826
1827	\end{Verbatim}
1828
1829
1830	\subsection{A second example: LG4X}\index{lg4x}
1831
1832	The example below shows how to fit the LG4X model \cite{lg4x} to a given alignment of amino-acid
1833	sequences (file \x{M587.nex.Phy}). LG4X is a mixture model with four classes. Each class has its own
1834	rate and corresponding frequencies (hence the use of the FreeRate model below, see the
1835	\x{<siterates>} component). In the particular example given here, the rate values and frequencies
1836	are set by the users. These parameters will then be optimized by PhyML (\x{optimise.freerates="yes"}).
1837	Each class also has its own rate matrix and vector of equilibrium frequencies, which need to be provided by
1838	the user (Note that these matrices can be downloaded from the following web address:
1839	\url{http://www.atgc-montpellier.fr/download/datasets/models/lg4x/LG4X_4M.txt}. They are also
1840	provided in the PhyML package \x{example/lg4x/} directory.)
1841
1842	\vspace{0.2cm}
1843	\begin{Verbatim}[frame=single, label=LG4X, samepage=true, baselinestretch=0.5,
1844	fontsize=\small, numbers=left]
1845	<phyml run.id="lg4x" output.file="M587.tests" branch.test="no">
1846
1847	<!-- Tree topology: start with BioNJ and then SPRs -->
1848	<topology>
1849	<instance id="T1" init.tree="user" file.name="user_tree.txt" \
1850	search="spr" optimise.tree="no"/>
1851	</topology>
1852
1853
1854	<!-- Four rate matrices, read from files -->
1855	<ratematrices id="RM1">
1856	<instance id="M1" model="customaa" ratematrix.file="X1.mat"/>
1857	<instance id="M2" model="customaa" ratematrix.file="X2.mat"/>
1858	<instance id="M3" model="customaa" ratematrix.file="X3.mat"/>
1859	<instance id="M4" model="customaa" ratematrix.file="X4.mat"/>
1860	</ratematrices>
1861
1862	<!-- Freerate model of variation of rates across sites -->
1863	<siterates id="SR1">
1864	<instance id="R1" init.value="0.197063"/>
1865	<instance id="R2" init.value="0.750275"/>
1866	<instance id="R3" init.value="1.951569"/>
1867	<instance id="R4" init.value="5.161586"/>
1868	<weights id="D1" family="freerates" optimise.freerates="yes">
1869	<instance appliesto="R1" value="0.422481"/>
1870	<instance appliesto="R2" value="0.336848"/>
1871	<instance appliesto="R3" value="0.180132"/>
1872	<instance appliesto="R4" value="0.060539"/>
1873	</weights>
1874	</siterates>
1875
1876	<!-- Amino-acid equilibrium freqs. are given by the models -->
1877	<equfreqs id="EF1">
1878	<instance id="F1" aa.freqs="model"/>
1879	<instance id="F2" aa.freqs="model"/>
1880	<instance id="F3" aa.freqs="model"/>
1881	<instance id="F4" aa.freqs="model"/>
1882	</equfreqs>
1883
1884
1885	<!-- One vector of branch lengths -->
1886	<branchlengths id="BL1" >
1887	<instance id="L1" optimise.lens="yes"/>
1888	</branchlengths>
1889
1890
1891	<!-- Mixture model assemblage -->
1892	<partitionelem id="partition1" file.name="M587.nex.Phy" \
1893	data.type="aa" interleaved="yes">
1894	<mixtureelem list="T1, T1, T1, T1"/>
1895	<mixtureelem list="M1, M2, M3, M4"/>
1896	<mixtureelem list="F1, F2, F3, F4"/>
1897	<mixtureelem list="R1, R2, R3, R4"/>
1898	<mixtureelem list="L1, L1, L1, L1"/>
1899	</partitionelem>
1900
1901	</phyml>
1902	\end{Verbatim}
1903
1904	In order to fit the LG4X model to the \x{proteic} sequence file provided in the \x{examples/}
1905	directory, simply type \x{./phyml --xml=../examples/lg4x/lg4x.xml} (assuming the PhyML binary is installed
1906	in the \x{src/} directory). You can of course slightly tweak the file \x{../examples/lg4x/lg4x.xml}
1907	and use it as a template to fit this model to another data set.
1908
1909
1910	\subsection{An example with multiple partition elements}
1911
1912	The example below gives the complete XML file to specify the analysis of three partition elements,
1913	corresponding to the nucleotide sequence files \x{small\_p1\_pos1.seq}, \x{small\_p1\_pos2.seq} and
1914	\x{small\_p1\_pos3.seq} in interleaved PHYLIP format. \x{small\_p1\_pos1.seq} is fitted with the
1915	HKY85 model of substitution (with the transition/transversion ratio being estimated from the data),
1916	combined to a $\Gamma4$ model of rate variation across sites (with the gamma shape parameter being
1917	estimated from the data). \x{small\_p1\_pos2.seq} is fitted to a custom substitution model with the
1918	constraint $A\leftrightarrow G$=$C\leftrightarrow T$. The nucleotide frequencies are set to
1919	$\frac{1}{4}$ here. The model does not allow substitution rates to vary across
1920	sites. \x{small\_p1\_pos3.seq} is fitted using a GTR model conbined to a $\Gamma4+$I model of rate
1921	variation across sites. Note that the equilibrium nucleotide frequencies for the fourth and fifth
1922	class of the mixture are set to be equal to that estimated from the first partition element (i.e.,
1923	\x{F1}) . The initial phylogeny is built using BioNJ and the tree topology is to be estimated using
1924	a NNI search algorithm.
1925
1926	\vspace{0.2cm}
1927	\begin{Verbatim}[frame=single, label=Example of PhyML XML file, samepage=true, baselinestretch=0.5,
1928	fontsize=\small, numbers=left]
1929
1930	<phyml runid="nnisearch" output.file="small_p1_output">
1931
1932	<topology>
1933	<instance id="T1" init.tree="bionj" optimise.tree="yes" \
1934	search="nni"/>
1935	</topology>
1936
1937	<branchlengths id="BL1">
1938	<instance id="L1" optimise.lens="yes"/>
1939	<instance id="L2"/>
1940	<instance id="L3"/>
1941	</branchlengths>
1942
1943	<ratematrices id="RM1">
1944	<instance id="M1" model="HKY85" optimise.tstv="yes"/>
1945	<instance id="M2" model="custom" model.code="102304" \
1946	optimise.rr="yes"/>
1947	<instance id="M3" model="GTR"/>
1948	</ratematrices>
1949
1950	<equfreqs id="EF1">
1951	<instance id="F1"/>
1952	<instance id="F2" base.freqs="0.25,0.25,0.25,0.25"/>
1953	<instance id="F3"/>
1954	</equfreqs>
1955
1956	<siterates id="SR1">
1957	<instance id="R1" value="1.0"/>
1958	<instance id="R2" value="1.0"/>
1959	<instance id="R3" value="1.0"/>
1960	<instance id="R4" value="1.0"/>
1961	<weights id="D1" family="gamma" optimise.alpha="yes" \
1962	optimise.pinv="no">
1963	</weights>
1964	</siterates>
1965
1966	<siterates id="SR2">
1967	<instance id="R8" value="1.0"/>
1968	<weights id="D2" family="gamma" optimise.alpha="yes" \
1969	optimise.pinv="yes">
1970	</weights>
1971	</siterates>
1972
1973	<siterates id="SR3">
1974	<instance id="R10" value="1.0"/>
1975	<instance id="R11" value="1.0"/>
1976	<instance id="R12" value="1.0"/>
1977	<instance id="R13" value="1.0"/>
1978	<instance id="R14" value="1.0"/>
1979	<weights id="D3" family="gamma" optimise.alpha="yes" \
1980	optimise.pinv="yes">
1981	</weights>
1982	</siterates>
1983
1984	\end{Verbatim}
1985
1986	\vspace{0.2cm}
1987	\begin{Verbatim}[frame=single, label=Example of PhyML XML file (ctnd), samepage=true,
1988	baselinestretch=0.5, fontsize=\small, numbers=left]
1989
1990	<partitionelem id="partition_elem1" file.name=\
1991	"./small_p1_pos1.seq" data.type="nt" interleaved="yes">
1992	<mixtureelem list="T1, T1, T1, T1"/>
1993	<mixtureelem list="M1, M1, M1, M1"/>
1994	<mixtureelem list="F1, F1, F1, F1"/>
1995	<mixtureelem list="R1, R2, R3, R4"/>
1996	<mixtureelem list="L1, L1, L1, L1"/>
1997	</partitionelem>
1998
1999	<partitionelem id="partition_elem2" file.name=\
2000	"./small_p1_pos2.seq" data.type="nt" interleaved="yes">
2001	<mixtureelem list="T1"/>
2002	<mixtureelem list="M2"/>
2003	<mixtureelem list="R8"/>
2004	<mixtureelem list="F2"/>
2005	<mixtureelem list="L2"/>
2006	</partitionelem>
2007
2008	<partitionelem id="partition_elem3" file.name=\
2009	"./small_p1_pos3.seq" data.type="nt" interleaved="yes">
2010	<mixtureelem list="T1, T1, T1, T1, T1"/>
2011	<mixtureelem list="M3, M3, M3, M3, M3"/>
2012	<mixtureelem list="R10, R11, R12, R13, R14"/>
2013	<mixtureelem list="F3, F3, F3, F1, F1"/>
2014	<mixtureelem list="L3, L3, L3, L3, L3"/>
2015	</partitionelem>
2016
2017	</phyml>
2018
2019	\end{Verbatim}
2020
2021
2022	\subsection{Branch lengths with invariants and partionned data}
2023
2024	Accommodating for models with invariable sites applying to some elements of a partitioned data, with
2025	these elements sharing the same set of edge lengths can lead to inconsistencies. Consider for
2026	instance a partitioned data set with two elements. Assume that these two elements share the same
2027	set of edge lengths. Also, consider that GTR+I applies to the first element and HKY applies to the
2028	second. Now, the expected number of substitutions per site for the first element of the partition is
2029	equal to $(1-p)l$, where $p$ is the estimated proportion of invariants and $l$ is the
2030	maximum-likelihood estimate for the length of that specific edge. For the second element of the
2031	partition, the expected number of substitutions per site is equal to $l$, rather than $(1-p)l$.
2032	While $l$ are common to the two elements, matching the specification of the input model, the actual
2033	edge lengths do differ across the two partition elements. Please be aware that, due to the
2034	programming structure implemented in PhyML, the program will only return one value here, which will
2035	be equal to $(1-p)l$.
2036
2037	\section{Citing PhyML}
2038	The ``default citation'' for PhyML is:
2039	\begin{itemize}
2040	\item
2041	``New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance
2042	of PhyML 3.0''. Guindon S., Dufayard J.F., Lefort V., Anisimova M., Hordijk W., Gascuel O. 2010, {\it Systematic
2043	Biology}, 59(3):307-321
2044
2045	\end{itemize}
2046	The ``historic citation'' for PhyML is:
2047	\begin{itemize}
2048	\item ``A simple, fast and accurate algorithm to estimate large phylogenies by maximum likelihood''
2049	Guindon S., Gascuel O. 2003, {\it Systematic Biology}, 52(5):696-704
2050	\end{itemize}
2051
2052
2053
2054	\section{Other programs in the PhyML package}
2055
2056	PhyML is software package that provides tools to tackle problems other than estimating maximum
2057	likelihood phylogenies. Installing these tools and processing data sets is explained is the
2058	following sections.
2059
2060	\subsection{PhyTime}\index{PhyTime} PhyTime is a program that estimates node ages and substitution
2061	rates using a Bayesian approach. The performance and main features of this software are described
2062	in two article (see Section \ref{sec:citephytime}).
2063
2064	It relies on a Gibbs sampler which outperforms the
2065	``standard'' Metropolis-Hastings algorithm implemented in a number of phylogenetic softwares. The
2066	details and performance of this approach are described in the following article:
2067
2068	\subsubsection{Installing PhyTime}
2069
2070	Compiling PhyTime is straightforward on Unix-like machines (i.e., linux and MacOS systems). PhyTime
2071	is not readily available for Windows machines but compilation should be easy on this system too. In
2072	the `phyml' directory, where the `src/' and `doc/' directories stand, enter the following commands:
2073	{\setlength{\baselineskip}{0.5\baselineskip}
2074	\begin{verbatim}
2075	./configure --enable-phytime;
2076	make clean;
2077	make;
2078	\end{verbatim} } This set of commands generates a binary file called \x{phytime} which can be found
2079	in the `src/' directory.
2080
2081	\subsubsection{Running PhyTime} Passing options and running PhyTime on your data set is quite
2082	similar to running PhyML in commmand-line mode. The main differences between the two programs are
2083	explained below:
2084	\begin{itemize}
2085	\item PhyTime takes as mandatory input a {\em rooted} phylogenetic tree. Hence, the `\x{-u}' option
2086	must be used. Also, unlike PhyML, PhyTime does not modify the tree topology. Hence, the options that
2087	go with the `\x{-s}' command do not alter the input tree topology.
2088	\item PhyTime needs an input file giving information about calibration nodes. The command
2089	`\x{--calibration=}' followed by the name of the file containing the calibration node information is
2090	mandatory. The content of that file should look as follows:
2091
2092	\begin{figure}[h]
2093	\begin{small}
2094	\begin{Verbatim}[frame=single, label=Calibration node file, samepage=true, baselinestretch=0.5, fontsize=\tiny]
2095	Dugong_dugon Procavia_capensis Elephantidae \| -65 -54
2096	Equus_sp. Ceratomorpha \| -58 -54
2097	Cercopithecus_solatus Macaca_mulatta Hylobates_lar Homo_sapiens \| -35 -25
2098	Lepus_crawshayi Oryctolagus_cuniculus Ochotona_princeps \| -90 -37
2099	Marmota_monax Aplodontia_rufa \| -120 -37
2100	Dryomys_nitedula Glis_glis \| -120 -28.5
2101	@root@ \| -100 -120
2102	\end{Verbatim}
2103	\end{small}
2104	\end{figure}
2105
2106	Every row in this file lists a set of taxa that belong to the same subtree (i.e., a clade). This
2107	list of taxa is followed by the character `\x{\|}' and two real numbers corresponding to the lower
2108	and upper bounds of the calibration interval for the node at the root of the clade. In the example
2109	given here, the clade grouping the three taxa ``Dugong\_dugon'', ``Procavia\_capensis'' and
2110	``Elephantida'' has -65 as lower bound and -54 as upper bound. Node ages (or node heights) are
2111	relative to the most recent tip node in the phylogeny, which age is set to 0. It is also possible to
2112	define a clade using only two taxon names. PhyTime will then search for the most recent common
2113	ancestor of these two taxa in the user-defined phylogeny and assign time boundaries to the
2114	corresponding node. For serially-sampled data, the calibration nodes correspond to tips in the tree.
2115	A calibration file will then look as follows:
2116	\begin{figure}[h]
2117	\begin{small}
2118	\begin{Verbatim}[frame=single, label=Calibration node file (serially-sampled sequences), samepage=true, baselinestretch=0.5]
2119	taxaA \| -65 -65
2120	taxaB \| -65 -65
2121	taxaC \| -20 -20
2122	taxaD \| -30 -30
2123	taxaE \| -30 -30
2124	taxaF \| -60 -60
2125	taxaG \| -61 -51
2126	@root@ \| -100 -120
2127	\end{Verbatim}
2128	\end{small}
2129	\end{figure}
2130
2131
2132	% The maximum value the average substitution rate along a branch of a phylogeny can take is set to
2133	% $10^{-2}$ substitution per site per time unit. It is therefore important that the time unit used to
2134	% specify the calibration intervals are realistic with respect to this upper bound. For instance, it
2135	% one time unit corresponds to one month, one assumes that the maximum (average) substitution rate is
2136	% $10^{-3}$/site/month, which is very high (substitution rate in introns is close to 3/site/$10^9$
2137	% years). Consider that in the example give above, one time unit is $10^6$ years. Hence, the maximum
2138	% value the substitution rate can take is $10^{-2}$/site/$10^6$ years, i.e., $10$/site/$10^9$ years.
2139	% Multiplying the calibration values by 10 amounts to considering that the time unit is $10^5$ years,
2140	% implying a maximum substitution rate equal to $10^{-2}$/site/$10^5$ years, i.e., $100$/site/$10^9$
2141	% years.
2142
2143	Note that the node corresponding to the root of the whole tree has a specific label:
2144	`\x{@root@}'. {\color{red}{It is important to specify upper and lower bounds for the root node in
2145	order to ensure convergence of the Gibbs sampler. If the prior interval for the root height is not
2146	specified, the upper bound will be set to the upper bound of the oldest calibration node and the
2147	lower bound will be set to twice this age.}} As a consequence, leaving the prior on root height
2148	interval unspecified may produce inaccurate estimates of node ages, especially if there are only few
2149	otherwise calibration nodes available.
2150
2151	A notable exception to this rule comes from the analysis of serial sample \index{serial sample}
2152	data, i.e., alignments in which sequences were not sampled at the same time point. For such data,
2153	the estimated number of substitutions accumulated between successive time points is used to estimate
2154	the substitution rate averaged over lineages. Because the time of collection of the sequences is
2155	generally known without ambiguity, this extra piece of data is translated into very informative
2156	calibration intervals for the tip nodes (i.e., calibration interval of zero width), which in turn
2157	results in substitution rate estimates with descreased variances. Posterior distribution of
2158	substitution rates with small variances then allows one to get good estimates of the root age.
2159
2160	\end{itemize}
2161
2162	A typical PhyTime command-line should look like the following\index{command-line options!\x{--calibration}}:
2163
2164	\begin{Verbatim}[fontsize=\small]
2165	./phytime -i seqname -u treename --calibration=calibration_file -m GTR -c 8
2166	\end{Verbatim}
2167
2168	Assuming the file `\x{seqname}' contains DNA sequences in PHYLIP or NEXUS format, `\x{treename}' is
2169	the rooted input tree in Newick format and `\x{calibration\_file}' is a set of calibration nodes,
2170	PhyTime will estimate the posterior distribution of node times and substitution rates under the
2171	assumption that the substitution process follows a GTR model with 8 classes of rates in the Gamma
2172	distribution of rates across sites. The model parameter values are estimated by a Gibbs sampling
2173	technique. This algorithm tries diferent values of the model parameters and record the most probable
2174	ones. By default, $10^6$ values for each parameter are collected. These values are recorded every
2175	$10^3$ sample. These settings can be modified using the appropriate command-line options (see
2176	below).
2177
2178
2179	\subsubsection{Upper bounds of model parameters}
2180
2181	The maximum expected number of substitutions per along a given branch is set to 1.0. Since
2182	calibration times provide prior information about the time scale considered, it is possible to use
2183	that information to define an upper bound for the substitution rate. This upper bound is equal to
2184	the ratio of the maximum value for a branch length (1.0) by the amount of time elapsed since the
2185	oldest calibration point (i.e., the minimum of the lower bounds taken over the whole set of
2186	calibration points)\footnote{The actual formula involves an extra parameter which does not need to
2187	be introduced here}. It is important to keep in mind that the upper bound of the average
2188	substitution rate depends on the time unit used in the calibration priors. The value of the upper
2189	bound is printed on screen at the start of the execution.
2190
2191	PhyTime implements two models that authorize rates to be autocorrelated. The strength of
2192	autocorrelation is governed by a parameter which value is estimated from the data. However, it is
2193	necessary to set an appropriate upper bound for this parameter prior running the analysis. The
2194	maximum value is set such that the correlation between the rate at the beginning and at the end of a
2195	branch of length 1.0 calendar time unit is not different from 0. Here again the upper bound for the
2196	model parameter depends on the time unit. It is important to choose this unit so that a branch of
2197	length 1.0 calendar unit can be considered as short. For this reason, {\color{red}{we recommend to select a time
2198	unit so that the calibration times take values between -10 and -1000}}\index{time scale}.
2199
2200	\subsubsection{PhyTime specific options}
2201
2202	Beside the \x{--calibration} option, there are other command line options that are specific to PhyTime:
2203	\begin{itemize}
2204	\item \x{--chain\_len=num}\index{command-line options!\x{--chain\_len}} \\
2205	\x{num} is the number of iterations required to estimate the joint posterior density of all the model
2206	parameters, i.e., the length of the MCMC chain. Its default is set to 1E+6.
2207	% \item \x{--burnin=num} \\
2208	% \x{num} is the number of iterations for the ``burnin'' period, i.e., the number of iterations
2209	% required before actually starting to get valid samples from the joint posterior density of all the
2210	% model parameters. Its default is set to 1E+5, i.e., 0.1 times the default value of the \x{chain\_len}
2211	% parameter (see above).
2212	\item \x{--sample\_freq=num}\index{command-line options!\x{--sample\_freq}} \\ \x{num} is the number of generations between successive collection
2213	of the model parameter values throughout the MCMC algorithm. For instance, the
2214	\x{--sample\_freq=1E+2} option will make PhyTime sample the model parameter every 100th iteration of
2215	the MCMC algorithm. Its default is set to 1E+3.
2216	\item \x{--fastlk=yes (no)}\index{command-line options!\x{--fastlk}} [Default: no]\\
2217	The option is used to turn on (off) the approximation of the likelihood function using a
2218	multivariate normal density. By default, the exact likelihood is used. Using the normal
2219	approximation considerably speeds up the calculation. However, it is necessary to ensure that this
2220	approximation is appropriate by looking at the correlation between the exact and approximated
2221	likelihood values that are sampled. Please read Section \ref{sec:recomphytime} for a description of
2222	the appropriate steps to take.
2223	\item \x{--no\_sequences}\index{command-line options!\x{--no\_sequences}}\\
2224	Use this option to run the sampler without sequence data. This option can be useful when one wants
2225	to compare the marginal posterior density of model parameters to those derived when ignoring the
2226	information conveyed by the sequences. Such comparison should be conducted on a systematic basis so as to determine whether the
2227	parameters estimates are mostly determined by the prior of driven by the sequence data.
2228	\item \x{--rate\_model=gbs/gbd/gamma/clock}\index{command-line options!\x{--rate\_model}}\\
2229	This option is to select the model of evolution of the rate of evolution. \x{gbs} (default) stands
2230	for Geometric Brownian + Stochastic. This model considers that rates evolve along the tree
2231	according to a geometric Brownian process \cite{kishino01} and the average rate of substitution
2232	along a branch is a gamma distributed random variable. This model is described in \cite{guindon12}.
2233	The \x{gbd} model (Geometric Browninan with Deterministic calculation of edge lengths) assumes the
2234	same Geometric Brownian model of rates. However, as opposed to \x{gbs}, this model uses a
2235	deterministic approximation to calculate the average rates of evolution along edges. This model
2236	corresponds to the one described in \cite{kishino01} and implemented in the program Multidivtime\index{Multidivtime}.
2237	\x{gamma} is a less sophisticated model that assumes that average rates along edges are distributed a
2238	priori according to a gamma distribution. It is analogous to the uncorrelated clock model
2239	implemented in BEAST\index{BEAST} with a gamma distribution replacing the exponential one. The \x{clock} option
2240	corresponds to the strict clock model where all the lineages in the tree evolve at the same pace.
2241	\end{itemize}
2242
2243	\subsubsection{PhyTime output}
2244
2245	The program PhyTime generates two output files. The file called
2246	`\x{your\_seqfile\_phytime.XXXX.stats}', where XXXX is a randomly generated integer, lists the node
2247	times and branch relative rates sampled during the estimation process. It also gives the sampled
2248	values for other parameters, such as the autocorrelation of rates (parameter `Nu'), and the rate of
2249	evolution (parameter `EvolRate') amongst others. This output file can be analysed with the program
2250	Tracer\index{Tracer} from the BEAST\index{BEAST} package
2251	(\url{http://beast.bio.ed.ac.uk/Main_Page}). The second file is called
2252	`\x{your\_seqfile\_phytime.XXXX.trees}'. It is the list of trees that were collected during the
2253	estimation process, i.e., phylogenies sampled from the posterior density of trees. This file can be
2254	processed using the software TreeAnnotator, also part of the BEAST package (see
2255	\url{http://beast.bio.ed.ac.uk/Main_Page}) in order to generate confidence sets for the node time
2256	estimates.
2257
2258	Important information is also displayed on the standard output of PhyTime (the standard output
2259	generally corresponds to the terminal window from which PhyTime was launched). The first column of
2260	this output gives the current generation, or run, of the chain. It starts at 1 and goes up to 1E+6
2261	by default (use \x{--chain\_len} to change this value, see above). The second column gives the time
2262	elapsed in seconds since the sampling began. The third column gives the log likelihood of the
2263	phylogenetic model (i.e., `Felsenstein's likelihood'). The fourth column gives the logarithm of the
2264	joint prior probability of substitution rates along the tree and node heights. The fifth column
2265	gives the current sampled value of the EvolRate parameter along with the corresponding Effective
2266	Sample Size (ESS) (see Section \ref{sec:ess}) for this parameter. The sixth column gives the tree
2267	height and the corresponding ESS. The seventh column gives the value of the autocorrelation
2268	parameter followed by the corresponding ESS. The eightth column gives the values of the birth rate
2269	parameter that governs the birth-rate model of species divergence dates. The last column of the
2270	standard output gives the minimum of the ESS values taken over the whole set of node height
2271	estimates. It provides useful information when one has to decide whether or not the sample size is
2272	large enough to draw valid conclusion, i.e., decide whether the chain was run for long enough (see
2273	Section \ref{sec:recomphytime} for more detail about adequate chain length).
2274
2275	\subsubsection{ClockRate vs. EvolRate}
2276
2277	The average rate of evolution along a branch is broken into two components. One is called ClockRate
2278	and is the same throughout the tree. The other is called EvolRate and corresponds to a weighted
2279	average of branch-specific rates. The model of rate evolution implemented in PhyTime forces the
2280	branch-specific rate values to be greater than one. As a consequence, ClockRate is usually smaller
2281	EvolRate.
2282
2283	In more mathematical terms, let $\mu$ be the value of ClockRate, $r_i$ be the value of the relative
2284	rate along branch $i$ and $\Delta_i$ the time elapsed along branch $i$. The value of EvolRate is
2285	then given by:
2286	\begin{eqnarray*}
2287	\mathrm{EvolRate} = \mu \frac{\sum_{i}^{2n-3} r_i \Delta_i}{\sum_{i}^{2n-3} \Delta_i}.
2288	\end{eqnarray*} It is clear from this equation that multiplying each $r_i$ by a constant and
2289	dividing $\mu$ by the same constant does not change the value of EvolRate. The $r_i$s and $\mu$
2290	are then confounded, or non-identifiable, and only the value of EvolRate can be estimated from
2291	the data. {\color{red}{Please make sure that you use the value of EvolRate rather than that of
2292	ClockRate when referring to the estimate of the substitution rate}}.
2293
2294	\subsubsection{Effective sample size}\label{sec:ess}
2295
2296	The MCMC technique generates samples from a target distribution (in our case, the joint posterior
2297	density of parameters). Due to the Markovian nature of the method, these samples are not
2298	independent. The ESS is the estimated number of independent measurements obtained from a set of
2299	(usually dependent) measurements. It is calculated using the following formula:
2300	\begin{eqnarray*}
2301	\mathrm{ESS} = N\left(\frac{1-r}{1+r}\right),
2302	\end{eqnarray*}
2303	where $N$ is the length of the chain (i.e., the `raw' or `correlated' sample size) and $r$ is the
2304	autocorrelation value, which is obtained using the following formula:
2305	\begin{eqnarray*}
2306	r = \frac{1}{(N-k)\sigma_x^2} \sum_{i=1}^{N-k} (X_i - \mu_x)(X_{i+k}-\mu_x),
2307	\end{eqnarray*} where $\mu_x$ and $\sigma_x$ are the mean and standard deviation of the $X_i$ values
2308	respectively and $k$ is the lag. The value of $r$ that is used in PhyTime corresponds to the case where $k=1$,
2309	which therefore gives a first order approximation of the `average' autocorrelation value (i.e., the
2310	autocorrelation averaged over the set of possible values of the lag).
2311
2312
2313	\subsubsection{Prior distributions of model parameters}\label{sec:prior}
2314
2315	Any Bayesian analysis requires specifying a prior distribution of model parameters. The outcome of
2316	the data analysis, i.e., the posterior distribution, is influenced by the priors. It is especially
2317	true if the signal conveyed by the data is weak. While some have argued that the specification of
2318	priors relies more on arbitrary decisions than sound scientific reasoning, choosing relevant prior
2319	distributions is in fact fully integrated in the process of building model that generates the
2320	observed data. In particular, the problem of estimating divergence times naturally lends itself to
2321	hierarchical Bayesian modelling. Based on the hypothesis that rates of evolution are conserved (to
2322	some extant) throughout the course of evolution, the hierarchical Bayesian approach provides an
2323	adequate framework for inferring substitution rates and divergence dates separately. Hence, in this
2324	situation, it makes good sense to use what is known about a relatively well-defined feature of the
2325	evolution of genetic sequences (the ``molecular clock'' hypothesis combined to stochastic variations of
2326	rates across lineages) to build a prior distribution on rates along edges.
2327
2328
2329
2330
2331	\subsubsection{Citing PhyTime}\label{sec:citephytime}
2332
2333	The ``default citation'' is:
2334
2335	\begin{itemize}
2336	\item Guindon S. ``From trajectories to averages: an improved description of the heterogeneity of
2337	substitution rates along lineages'', {\it Systematic Biology}, 2012. doi: 10.1093/sysbio/sys063.
2338	\end{itemize}
2339
2340	An earlier article also described some of the methods implemented in PhyTime:
2341
2342	\begin{itemize}
2343	\item Guindon S. ``Bayesian estimation of divergence times from large data sets'', {\it Molecular
2344	Biology and Evolution}, 2010,
2345	27(8):1768:81.
2346	\end{itemize}
2347
2348
2349	\section{Recommendations on program usage}\label{sec:progusage}
2350
2351	\subsection{PhyML}
2352
2353	The choice of the tree searching algorithm among those provided by PhyML is generally a tough one.
2354	The fastest option relies on local and simultaneous modifications of the phylogeny using NNI
2355	moves. More thorough explorations of the space of topologies are also available through the SPR
2356	options. As these two classes of tree topology moves involve different computational burdens, it
2357	is important to determine which option is the most suitable for the type of data set or analysis one
2358	wants to perform. Below is a list of recommendations for typical phylogenetic analyses.
2359
2360	\begin{enumerate}
2361	\item {\em Single data set, unlimited computing time.} The best option here is probably to use a SPR
2362	search (i.e., straight SPR of best of SPR and NNI). If the focus is on estimating the relationships
2363	between species, it is a good idea to use more than one starting tree to decrease the chance of
2364	getting stuck in a local maximum of the likelihood function. Using NNIs is appropriate if the
2365	analysis does not mainly focus on estimating the evolutionary relationships between species (e.g. a
2366	tree is needed to estimate the parameters of codon-based models later on). Branch supports can be
2367	estimated using bootstrap and approximate likelihood ratios.
2368
2369	\item {\em Single data set, restricted computing time.} The three tree searching options can be
2370	used depending on the computing time available and the size of the data set. For small data sets
2371	(i.e., $<$ 50 sequences), NNI will generally perform well provided that the phylogenetic signal is
2372	strong. It is relevant to estimate a first tree using NNI moves and examine the reconstructed
2373	phylogeny in order to have a rough idea of the strength of the phylogenetic signal (the presence of
2374	small internal branch lengths is generally considered as a sign of a weak phylogenetic signal,
2375	specially when sequences are short). For larger data sets ($>$ 50 sequences), a SPR search is
2376	recommended if there are good evidence of a lack of phylogenetic signal. Bootstrap analysis will
2377	generally involve large computational burdens. Estimating branch supports using approximate
2378	likelihood ratios therefore provides an interesting alternative here.
2379
2380	\item {\em Multiple data sets, unlimited computing time.} Comparative genomic analyses sometimes
2381	rely on building phylogenies from the analysis of a large number of gene families. Here again, the
2382	NNI option is the most relevant if the focus is not on recovering the most accurate picture of the
2383	evolutionary relationships between species. Slower SPR-based heuristics should be used when the
2384	topology of the tree is an important parameter of the analysis (e.g., identification of horizontally
2385	transferred genes using phylogenetic tree comparisons). Internal branch support is generally not a
2386	crucial parameter of the multiple data set analyses. Using approximate likelihood ratio is probably
2387	the best choice here.
2388
2389	\item {\em Multiple data sets, limited computing time.} The large amount of data to be processed in
2390	a limited time generally requires the use of the fastest tree searching and branch support
2391	estimation methods Hence, NNI and approximate likelihood ratios rather than SPR and non-parametric
2392	bootstrap are generally the most appropriate here.
2393	\end{enumerate}
2394
2395	Another important point is the choice of the substitution model. While default options generally
2396	provide acceptable results, it is often warranted to perform a pre-analysis in order to identify the
2397	best-fit substitution model. This pre-analysis can be done using popular software such as Modeltest
2398	\cite{posada98} or ProtTest \cite{abascal05} for instance. These programs generally recommend the
2399	use of a discrete gamma distribution to model the substitution process as variability of rates among
2400	sites is a common feature of molecular evolution. The choice of the number of rate classes to use
2401	for this distribution is also an important one. While the default is set to four categories in
2402	PhyML, it is recommended to use larger number of classes if possible in order to best approximate
2403	the patterns of rate variation across sites \cite{galtier04}. Note however that run times are
2404	directly proportional to the number of classes of the discrete gamma distribution. Here again, a
2405	pre-analysis with the simplest model should help the user to determine the number of rate classes
2406	that represents the best trade-off between computing time and fit of the model to the data.
2407
2408
2409	\subsection{PhyTime}\label{sec:recomphytime}
2410
2411	Analysing a data set using PhyTime should involve three steps based on the following questions: (1)
2412	do the priors seem to be adequate (2) can I use the fast approximation of the likelihood and (3) how
2413	long shall I run the program for? I explain below how to provide answers to these questions.
2414
2415	\begin{itemize}
2416	\item {\em Are the priors adequate?} Bayesian analysis relies on specifiying the joint
2417	prior density of model parameters. In the case of node age estimation, these priors essentially
2418	describe how rates of substitution vary across lineages and the probabilistic distribution that node
2419	ages have when ignoring the information provided by the genetic sequences. These priors vary from
2420	tree to tree. It is therefore essential to check the adequacy of priors for each user-defined input
2421	tree. In order to do so, PhyTime needs to be run with the \x{--no\_data} option. When this option is
2422	required, the sequence data provided as input will be ignored and the rest of the analysis will
2423	proceed normally. The prior distribution of model parameters, essentially edge rates and node
2424	heights, can then be checked using the program Tracer as one would do for the standard `posterior'
2425	analysis.
2426
2427	\item {\em Can I use the fast approximation to the likelihood?} The suface of the log-likelihood
2428	function can be approximated using a multivariate normal density. This technique is saving very
2429	substantial amounts of computation time. However, like most approximations, there are situations
2430	where it does not provide a good fit to the actual function. This usually happens when the phylogeny
2431	displays a lot of short branches, i.e., the signal conveyed by the sequences is weak. It is
2432	therefore important to first check whether using the approximate likelihood is reasonable. In order
2433	to do so, it is recommended to first run the program without the approximation, i.e., using the
2434	default settings. Once the minimum value of the ESS of node ages (the last column on the right of
2435	the standard output) has reached 40-50, open the \x{phytime.XXXX} output file with Tracer and
2436	examine the correlation between the exact and approximate likelihood values. If the correlation is
2437	deemed to be good enough, PhyTime can be re-run using the \x{--fast\_lk} option, which uses the fast
2438	normal approximation to the likelihood function.
2439
2440
2441
2442	% Figure \ref{fig:approxbad} gives an example where the correlation is too weak and the approximation
2443	% of the likelihood should be avoided. Figure \ref{fig:approxbad} gives an example where the
2444	% approximation is good enough. The current execution of PhyTime can be terminated and then
2445	% re-launched using the \x{--fast\_lk} option.
2446
2447	% \begin{figure}
2448	% \begin{center}
2449	% \resizebox{14cm}{8cm}{\includegraphics{./fig/approx_bad.eps}}
2450	% \caption{{\bf Exact vs. approximate likelihoods.} The correlation between the normally approximated
2451	% (Y-axis) and the exact (X-axis) likelihoods is weak here. The exact likelihood should be used (option \x{fastlk=no}).}
2452	% \label{fig:approxbad}
2453	% \end{center}
2454	% \end{figure}
2455
2456
2457	% \begin{figure}
2458	% \begin{center}
2459	% \resizebox{14cm}{8cm}{\includegraphics{./fig/approx_good.eps}}
2460	% \caption{{\bf Exact vs. approximate likelihoods.} The correlation between the normally approximated
2461	% (Y-axis) and the exact (X-axis) likelihoods is good. The approximation of the likelihood can be used (option \x{fastlk=yes}).}
2462	% \label{fig:approxgood}
2463	% \end{center}
2464	% \end{figure}
2465
2466	\item {\em How long shall I run the program for?} PhyTime should be run long enough such that the
2467	ESS of each parameter is `large enough'. The last column on the right handside of the standard
2468	output gives the minimum ESS across all internal node heights. It is recommended to run the program
2469	so that this number reaches at least 100.
2470
2471	\end{itemize}
2472
2473
2474	\section{Frequently asked questions}
2475
2476	\begin{enumerate}
2477	\item {\it PhyML crashes before reading the sequences. What's wrong ?}\\
2478	\begin{itemize}
2479	\item The format of your sequence file is not recognized by PhyML. See Section \ref{sec:input_output}
2480	\item The carriage return characters in your sequence files are not recognized by PhyML. You must
2481	make sure that your sequence file is a plain text file, with standard carriage return characters (i.e.,
2482	corresponding to ``$\backslash$\x{n}'', or ``$\backslash$\x{r}'')
2483	\end{itemize}
2484	\item {\it The program crashes after reading the sequences. What's wrong ?}\\
2485	\begin{itemize}
2486	\item You analyse protein sequences and did not enter the \x{-d aa} option in the command-line.
2487	\item The format of your sequence file is not recognized by PhyML. See Section \ref{sec:input_output}
2488	\end{itemize}
2489	\item {\it Does PhyML handle outgroup sequences ?}\\
2490	\begin{itemize}
2491	\item Yes, it does. Outgroup taxa are identified by adding the `*' sign at the end of each
2492	corresponding sequence name (see Section \ref{sec:outgroupspecify})
2493	\end{itemize}
2494	\item {\it Does PhyML estimate clock-constrained trees ?}\\
2495	\begin{itemize}
2496	\item No, the PhyML program does not estimate clock-contrained trees. One can however use the
2497	program PhyTime to perform such analysis but the tree topology will not be estimated.
2498	\end{itemize}
2499	\item {\it Can PhyML analyse partitioned data, such as multiple gene sequences ?}\\
2500	\begin{itemize}
2501	\item We are currently working on this topic. Future releases of the program will provide options
2502	to estimate trees from phylogenomic data sets, with the opportunity to use different substitution
2503	models on the different data partitions (e.g., different genes). PhyML will also include specific
2504	algorithms to search the space of tree topologies for this type of data.
2505	\end{itemize}
2506	\end{enumerate}
2507
2508
2509
2510	\section{Acknowledgements}
2511	The development of PhyML since 2000 has been supported by the Centre National de la Recherche
2512	Scientifique (CNRS) and the Minist\`ere de l'\'Education Nationale.
2513
2514	% \bibliographystyle{/home/guindon/latex/biblio/OUPnum}
2515	\bibliographystyle{/home/guindon/latex/biblio/nature/naturemag}
2516	\bibliography{/home/guindon/latex/biblio/ref}
2517
2518	\printindex
2519	\end{document}

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: branches/ali/GDE/PHYML20130708/phyml/doc/phyml-manual.tex

Download in other formats: