1 | \documentclass[a4paper,12pt]{article} |
---|
2 | \usepackage{fancyvrb} |
---|
3 | \usepackage{graphicx} |
---|
4 | \usepackage{tabularx} |
---|
5 | % \usepackage{color} |
---|
6 | \usepackage{xcolor} |
---|
7 | \usepackage{psfrag} |
---|
8 | \usepackage[fleqn]{amsmath} |
---|
9 | \usepackage[hyphens]{url} |
---|
10 | % \usepackage{vmargin} |
---|
11 | \usepackage{cite} |
---|
12 | \usepackage{caption2} |
---|
13 | \usepackage{hyperref} |
---|
14 | \usepackage{makeidx} |
---|
15 | \renewcommand{\captionlabeldelim}{.} |
---|
16 | \def\thesection{\arabic{section}} |
---|
17 | \renewcommand{\thefigure}{\arabic{figure}} |
---|
18 | \renewcommand{\thetable}{\arabic{table}} |
---|
19 | \newcommand{\hl}{\noalign{\vskip3pt}\hline\noalign{\vskip3pt}} |
---|
20 | \newcommand{\hrf}{\hrulefill} |
---|
21 | \newcommand{\tc}[1]{\textcolor{black}{#1}} |
---|
22 | \newcommand{\dc}[1]{\textcolor{green}{#1}} |
---|
23 | % \usepackage[none]{hyphenat} |
---|
24 | \newcommand{\rep}[3][1] |
---|
25 | { |
---|
26 | \psfrag{#2}[c][c][#1]{#3} |
---|
27 | } |
---|
28 | \newcommand{\x}[1]{\texttt{#1}} |
---|
29 | |
---|
30 | % \setpapersize{A4} |
---|
31 | % \hypersetup{colorlinks=true,linkcolor=blue,urlcolor=red,linkbordercolor=000} |
---|
32 | \hypersetup{colorlinks=true,linkcolor=blue,urlcolor=red} |
---|
33 | \renewcommand{\baselinestretch}{1.} |
---|
34 | \makeindex |
---|
35 | \begin{document} |
---|
36 | \sloppy |
---|
37 | \begin{center} |
---|
38 | \thispagestyle{empty} |
---|
39 | \vfill\vfill |
---|
40 | % \rule{\linewidth}{0.02cm}\\ |
---|
41 | {\Huge \textbf{ P~h~y~M~L~~--~~M~a~n~u~a~l}} |
---|
42 | \vspace{-0.4cm}\\ |
---|
43 | % \rule{\linewidth}{0.02cm}\\ |
---|
44 | \vfill |
---|
45 | {\huge Version 3.0 \\ |
---|
46 | \today |
---|
47 | \vfill |
---|
48 | \normalsize |
---|
49 | \href{http://code.google.com/p/phyml}{http://code.google.com/p/phyml}\\ |
---|
50 | \vspace{0.4cm} |
---|
51 | \href{http://www.atgc-montpellier.fr/phyml}{http://www.atgc-montpellier.fr/phyml}} |
---|
52 | \end{center} |
---|
53 | \clearpage |
---|
54 | \tableofcontents |
---|
55 | \clearpage |
---|
56 | |
---|
57 | {\par |
---|
58 | \small |
---|
59 | \noindent |
---|
60 | \copyright Copyright 1999 - 2008 by PhyML Development Team.\\ |
---|
61 | \noindent The software PhyML is provided ``as is'' without warranty of any kind. In no event shall |
---|
62 | the authors or his employer be held responsible for any damage resulting from the use of this |
---|
63 | software, including but not limited to the frustration that you may experience in using the package. |
---|
64 | All parts of the source and documentation except where indicated are distributed under |
---|
65 | the GNU public licence. See http://www.opensource.org for details. |
---|
66 | |
---|
67 | } |
---|
68 | |
---|
69 | { |
---|
70 | \noindent |
---|
71 | \setlength{\baselineskip}{0.5\baselineskip} |
---|
72 | \section{Availability} |
---|
73 | \begin{itemize} |
---|
74 | \item Binaries: \href{http://www.atgc-montpellier.fr/phyml}{http://www.atgc-montpellier.fr/phyml} |
---|
75 | \item Sources: \href{http://code.google.com/p/phyml}{http://code.google.com/p/phyml} |
---|
76 | \item Discussion forum: \href{http://groups.google.com/group/phyml-forum}{http://groups.google.com/group/phyml-forum} |
---|
77 | \end{itemize} |
---|
78 | } |
---|
79 | |
---|
80 | { |
---|
81 | \noindent |
---|
82 | \setlength{\baselineskip}{0.7\baselineskip} |
---|
83 | \section{Authors} |
---|
84 | \begin{itemize} |
---|
85 | \item { St\'ephane Guindon} and { Olivier Gascuel} conceived the original PhyML algorithm. |
---|
86 | \item { St\'ephane Guindon} conceived the PhyTime method. |
---|
87 | \item { St\'ephane Guindon, Wim Hordjik} and { Olivier Gascuel} conceived the SPR-based tree search algorithm. |
---|
88 | \item { Maria Anisimova} and { Olivier Gascuel} conceived the aLRT method for branch support. |
---|
89 | \item { St\'ephane Guindon, Franck Lethiec}, Jean-Francois Dufayard and Vincent Lefort implemented PhyML. |
---|
90 | \item { Jean-Francois Dufayard} created the benchmark and implemented the tools that are used to check |
---|
91 | PhyML accuracy and performances. |
---|
92 | \item { Vincent Lefort, St\'ephane Guindon, Patrice Duroux} and { Olivier Gascuel} conceived and |
---|
93 | implemented PhyML web server. |
---|
94 | \item St\'ephane Guindon wrote this document. |
---|
95 | \end{itemize} |
---|
96 | } |
---|
97 | \clearpage |
---|
98 | |
---|
99 | % \section{ML in phylogenetics: the basics.} |
---|
100 | \section{Overview} |
---|
101 | |
---|
102 | PhyML \cite{guindon03} is a software package which primary task that is to estimate maximum |
---|
103 | likelihood phylogenies from alignments of nucleotide or amino-acid sequences. It provides a wide |
---|
104 | range of options that were designed to facilitate standard phylogenetic analyses. The main |
---|
105 | strength of PhyML lies in the large number of substitution models coupled to various options to |
---|
106 | search the space of phylogenetic tree topologies, going from very fast and efficient methods to |
---|
107 | slower but generally more accurate approaches. It also implements two methods to evaluate branch |
---|
108 | supports in a sound statistical framework (the non-parametric bootstrap and the approximate |
---|
109 | likelihood ratio test). |
---|
110 | |
---|
111 | PhyML was designed to process moderate to large data sets. In theory, alignments with up to 4,000 |
---|
112 | sequences 2,000,000 character-long can analyzed. In practice however, the amount of memory required |
---|
113 | to process a data set is proportional of the product of the number of sequences by their length. |
---|
114 | Hence, a large number of sequences can only be processed provided that they are short. Also, PhyML |
---|
115 | can handle long sequences provided that they are not numerous. With most standard personal |
---|
116 | computers, the ``comfort zone'' for PhyML generally lies around 100-500 sequences less than 10,000 |
---|
117 | character long. For larger data sets, we recommend using other softwares such as RAxML |
---|
118 | \cite{raxml}\index{RAxML} or GARLI \cite{garli}\index{GARLI} or Treefinder |
---|
119 | (\href{http://www.treefinder.de}{http://www.treefinder.de}). |
---|
120 | |
---|
121 | |
---|
122 | \section{Bug report}\index{bug} |
---|
123 | |
---|
124 | While PhyML is, of course, bug-free (!) (please read the disclaimer carefuly...), if you ever come |
---|
125 | across an issue, please feel free to report it using the discuss group web site at the following |
---|
126 | address: \url{https://groups.google.com/forum/?fromgroups#!forum/phyml-forum}. Alternatively, you |
---|
127 | can send an email to \url{s.guindon@auckland.ac.nz}. Do not forget to mention the version of PhyML |
---|
128 | and program options you are using. |
---|
129 | |
---|
130 | |
---|
131 | \section{Installing PhyML} |
---|
132 | |
---|
133 | \subsection{Sources and compilation}\index{compilation} |
---|
134 | |
---|
135 | The sources of the program are available free of charge by sending an e-mail to St\'ephane Guindon |
---|
136 | at \href{mailto:guindon@lirmm.fr}{\x{guindon@lirmm.fr}} or |
---|
137 | \href{mailto:s.guindon@auckland.ac.nz}{\x{s.guindon@auckland.ac.nz}}. The compilation on UNIX-like |
---|
138 | systems is fairly standard. It is described in the `\x{INSTALL}' file that comes with the |
---|
139 | sources. In a command-line window, go to the directory that contains the sources and type: |
---|
140 | |
---|
141 | {\setlength{\baselineskip}{0.5\baselineskip} |
---|
142 | \begin{verbatim} |
---|
143 | ./configure; |
---|
144 | make clean; |
---|
145 | make V=0; |
---|
146 | \end{verbatim} |
---|
147 | } |
---|
148 | |
---|
149 | By default, PhyML will be compiled with optimization flags turned on. It is possible to generate a |
---|
150 | version of PhyML that can run through a debugging tool (such as \x{ddd}\label{ddd}) or a profiling |
---|
151 | tool (such as \x{gprof}\label{gprof}) using the following instructions: |
---|
152 | |
---|
153 | {\setlength{\baselineskip}{0.5\baselineskip} |
---|
154 | \begin{verbatim} |
---|
155 | ./configure --enable-debug; |
---|
156 | make clean; |
---|
157 | make V=0; |
---|
158 | \end{verbatim} |
---|
159 | } |
---|
160 | |
---|
161 | % {\em Note} -- when PhyML is going to be used mostly of exclusively in batch mode, it is preferable |
---|
162 | % to turn on the batch mode option in the Makefile. In order to do so, the file \x{Makefile.am} needs |
---|
163 | % to be modified: add \x{-DBATCH} to the line with \x{DEFS=-DUNIX -D\$(PROG) -DDEBUG}. |
---|
164 | |
---|
165 | |
---|
166 | |
---|
167 | \subsection{Installing PhyML on UNIX-like systems (including Mac OS)} |
---|
168 | |
---|
169 | Copy PhyML binary file in the directory you like. For the operating system to be able to locate the |
---|
170 | program, this directory must be specified in the global variable \x{PATH}. In order to achieve this, |
---|
171 | you will have to add \x{export PATH="/your\_path/:\${PATH}"} to the \x{.bashrc} or the |
---|
172 | \x{.bash\_profile} located in your home directory (\x{your\_path} is the path to the directory that |
---|
173 | contains PhyML binary). |
---|
174 | |
---|
175 | |
---|
176 | \subsection{Installing PhyML on Microsoft Windows}\label{sec:install_windows} |
---|
177 | |
---|
178 | Copy the files \x{phyml.exe} and \x{phyml.bat} in the same directory. To launch PhyML, click on the |
---|
179 | icon corresponding to \x{phyml.bat}. Clicking on the icon for \x{phyml.exe} works too but the |
---|
180 | dimensions of the window will not fit PhyML PHYLIP-like interface. |
---|
181 | |
---|
182 | \subsection{Installing the parallel version of PhyML}\label{sec:MPI}\index{MPI}\index{bootstrap!parallel} |
---|
183 | |
---|
184 | Bootstrap analysis can run on multiple processors. Each processor analyses one bootstraped dataset. |
---|
185 | Therefore, the computing time needed to perform $R$ bootstrap replicates is divided by the number of |
---|
186 | processors available. |
---|
187 | |
---|
188 | This feature of PhyML relies on the MPI (Message Passing Interface) library. To use it, your |
---|
189 | computer must have MPI installed on it. In case MPI is not installed, you can dowload it from |
---|
190 | \href{http://www.mcs.anl.gov/research/projects/mpich2/}{http://www.mcs.anl.gov/research/projects/mpich2/}. |
---|
191 | Once MPI is installed, it is necessary to launch the MPI daemon. This can be done by entering the |
---|
192 | following instruction: \x{mpd \&}. Note however that in most cases, the MPI daemon will already be |
---|
193 | running on your server so that you most likely do not need to worry about this. You can then just |
---|
194 | go in the \x{phyml/} directory (the directory that contains the \x{src/}, \x{examples/} and \x{doc/} |
---|
195 | folders) and enter the commands below: |
---|
196 | |
---|
197 | {\setlength{\baselineskip}{0.5\baselineskip} |
---|
198 | \begin{verbatim} |
---|
199 | ./configure --enable-mpi; |
---|
200 | make clean; |
---|
201 | make; |
---|
202 | \end{verbatim} |
---|
203 | } |
---|
204 | |
---|
205 | A binary file named \x{phyml-mpi} has now been created in the \x{src/} directory and is ready to use |
---|
206 | with MPI. A typical MPI command-line which uses 4 CPUs is given below: |
---|
207 | |
---|
208 | {\setlength{\baselineskip}{0.5\baselineskip} |
---|
209 | \begin{verbatim} |
---|
210 | mpirun -n 4 ./phyml-mpi -i myseq -b 100 |
---|
211 | \end{verbatim} |
---|
212 | } |
---|
213 | |
---|
214 | \noindent Please read section \ref{sec:parallel_bootstrap} of this document for more information. |
---|
215 | |
---|
216 | \section{Program usage.}\label{sec:phyml_new} |
---|
217 | |
---|
218 | PhyML has three distinct user-interfaces. The first corresponds to a PHYLIP-like text interface |
---|
219 | that makes the choice of the options self-explanatory. The command-line interface is well-suited for |
---|
220 | people that are familiar with PhyML options or for running PhyML in batch mode. The XML interface is |
---|
221 | more sophisticated. It allows the user to analyse partitionned data using flexible mixture models of evolution. |
---|
222 | |
---|
223 | \subsection{PHYLIP-like interface} |
---|
224 | |
---|
225 | The default is to use the PHYLIP-like text interface by simply typing `\x{phyml}' in a command-line |
---|
226 | window or by clicking on the PhyML icon (see Section \ref{sec:install_windows}). After entering the |
---|
227 | name of the input sequence file, a list of sub-menus helps the users set up the analysis. There |
---|
228 | are currently four distinct sub-menus: |
---|
229 | |
---|
230 | % \begin{figure} |
---|
231 | % \resizebox{15cm}{9cm}{\includegraphics{./fig/interface.eps}} |
---|
232 | % \caption{PHYLIP-like interface to PhyML.} |
---|
233 | % \label{fig:interface} |
---|
234 | % \end{figure} |
---|
235 | |
---|
236 | \begin{enumerate} |
---|
237 | |
---|
238 | \item {\em Input Data}: specify whether the input file contains amino-acid or nucleotide |
---|
239 | sequences. What the sequence format is (see Section \ref{sec:input_output}) and how many data sets |
---|
240 | should be analysed. |
---|
241 | |
---|
242 | \item {\em Substitution Model}: selection of the Markov model of substitution. |
---|
243 | |
---|
244 | \item {\em Tree Searching}: selection of the tree topology searching algorithm. |
---|
245 | |
---|
246 | \item {\em Branch Support}: selection of the method that is used to measure branch support. |
---|
247 | |
---|
248 | \end{enumerate} |
---|
249 | \noindent `\x{+}' and `\x{-}' keys are used to move forward and backward in the sub-menu list. Once |
---|
250 | the model parameters have been defined, typing `\x{Y}' (or `\x{y}') launches the calculations. The |
---|
251 | meaning of some options may not be obvious to users that are not familiar with phylogenetics. In |
---|
252 | such situation, we strongly recommend to use the default options. As long as the format of the input |
---|
253 | sequence file is correctly specified (sub-menu {\em Input data}), the safest option for non-expert |
---|
254 | users is to use the default settings. The different options provided within each sub-menu are |
---|
255 | described in what follows. |
---|
256 | |
---|
257 | |
---|
258 | \subsubsection{Input Data sub-menu} |
---|
259 | |
---|
260 | \begin{center}\framebox{\x{[D] ............................... Data type (DNA/AA)}} \end{center} |
---|
261 | Type of data in the input file. It can be either DNA or amino-acid sequences in PHYLIP format (see |
---|
262 | Section \ref{sec:input_output}). Type \x{D} to change settings. |
---|
263 | |
---|
264 | \vspace{0.7cm} |
---|
265 | \begin{center} \framebox{\x{[I] ...... Input sequences interleaved (or sequential)}} \end{center} |
---|
266 | PHYLIP format comes in two flavours: interleaved or sequential (see Section |
---|
267 | \ref{sec:input_output}). Type \x{I} to selected among the two formats. |
---|
268 | |
---|
269 | \vspace{0.7cm} |
---|
270 | \begin{center} \framebox{\x{[M] ....................... Analyze multiple data sets}} \end{center} |
---|
271 | If the input sequence file contains more than one data sets, PhyML can analyse each of them |
---|
272 | in a single run of the program. Type \x{M} to change settings. |
---|
273 | |
---|
274 | \vspace{0.7cm} |
---|
275 | \begin{center} \framebox{\x{[R] ............................................ Run ID}} \end{center} |
---|
276 | This option allows you to append a string that identifies the current PhyML run. Say for instance |
---|
277 | that you want to analyse the same data set with two models. You can then `tag' the first PhyML run |
---|
278 | with the name of the first model while the second run is tagged with the name of the second model.\index{run ID} |
---|
279 | |
---|
280 | |
---|
281 | |
---|
282 | \subsubsection{Substitution model sub-menu}\label{sec:submenus} |
---|
283 | |
---|
284 | \begin{center} \framebox{\x{[M] ................. Model of nucleotide substitution}} \end{center} |
---|
285 | \begin{center} \framebox{\x{[M] ................ Model of amino-acids substitution}} \end{center} |
---|
286 | PhyML implements a wide range of substitution models: JC69 \cite{jukes69}, K80 \cite{kimura80}, F81 |
---|
287 | \cite{felsenstein81a}, F84 \cite{phylip2}, HKY85 \cite{hasegawa85}, TN93 \cite{tamura93} GTR |
---|
288 | \cite{lanave84,tavare86} and custom for nucleotides; LG \cite{le08}, WAG \cite{whelan01b}, Dayhoff |
---|
289 | \cite{dayhoff78}, JTT \cite{jones92}, Blosum62 \cite{henikoff92}, mtREV \cite{adachi96}, rtREV |
---|
290 | \cite{dimmic02}, cpREV \cite{adachi00}, DCMut \cite{kosiol04}, VT \cite{muller00} and mtMAM |
---|
291 | \cite{cao98} and custom for amino acids. Cycle through the list of nucleotide or amino-acids |
---|
292 | substitution models by typing \x{M}. Both nucleotide and amino-acid lists include a `custom' model. |
---|
293 | The custom option provides the most flexible way to specify the nucleotide substitution model. The |
---|
294 | model is defined by a string made of six digits. The default string is `\x{000000}', which means |
---|
295 | that the six relative rates of nucleotide changes: $A \leftrightarrow C$, $A \leftrightarrow G$, $A |
---|
296 | \leftrightarrow T$, $C \leftrightarrow G$, $C \leftrightarrow T$ and $G \leftrightarrow T$, are |
---|
297 | equal. The string `\x{010010}' indicates that the rates $A \leftrightarrow G$ and $C |
---|
298 | \leftrightarrow T$ are equal and distinct from $A \leftrightarrow C = A \leftrightarrow T = C |
---|
299 | \leftrightarrow G = G \leftrightarrow T$. This model corresponds to HKY85 (default) or K80 if the |
---|
300 | nucleotide frequencies are all set to 0.25. `\x{010020}' and `\x{012345}' correspond to TN93 and |
---|
301 | GTR models respectively. The digit string therefore defines groups of relative substitution rates. |
---|
302 | The initial rate within each group is set to 1.0, which corresponds to F81 (JC69 if the base |
---|
303 | frequencies are equal). Users also have the opportunity to define their own initial rate values. |
---|
304 | These rates are then optimised afterwards (option `\x{O}') or fixed to their initial values. The |
---|
305 | custom option can be used to implement all substitution models that are special cases of |
---|
306 | GTR. Table \ref{tab:modelcode} on page \pageref{tab:modelcode} gives the correspondence between the `standard' name of the model |
---|
307 | (see \url{http://mbe.oxfordjournals.org/content/18/6/897/F2.large.jpg}) and the custom model code. |
---|
308 | The custom model also exists for protein sequences. It is useful when one wants to use an amino-acid |
---|
309 | substitution model that is not hard-coded in PhyML. The symmetric part of the rate matrix, as well |
---|
310 | as the equilibrium amino-acid frequencies, are given in a file which name is given as input of the |
---|
311 | program. The format of this file is described in the section \ref{sec:customaa}. |
---|
312 | |
---|
313 | \vspace{0.7cm} |
---|
314 | \begin{center} \framebox{\x{[F] ................. Optimise equilibrium frequencies}} \end{center} |
---|
315 | \begin{center} \framebox{\x{[E] ......... Equilibrium frequencies (empirical/user)}} \end{center} |
---|
316 | \begin{center} \framebox{\x{[F] . Amino acid frequencies (empirical/model defined)}} \end{center} |
---|
317 | For nucleotide sequences, optimising nucleotide frequencies means that the values of these |
---|
318 | parameters are estimated in the maximum likelihood framework. When the custom model option is |
---|
319 | selected, it is also possible to give the program a user-defined nucleotide frequency distribution |
---|
320 | at equilibrium (option \x{E}). For protein sequences, the stationary amino-acid frequencies are |
---|
321 | either those defined by the substitution model or those estimated by counting the number of |
---|
322 | different amino-acids observed in the data. Hence, the meaning of the \x{F} option depends on the |
---|
323 | type of the data to be processed. |
---|
324 | |
---|
325 | \vspace{0.7cm} |
---|
326 | \begin{center} \framebox{\x{[T] .................... Ts/tv ratio |
---|
327 | (fixed/estimated)}} \end{center}\index{$\kappa$}\index{ts/tv ratio} Fix or estimate the |
---|
328 | transition/transversion ratio in the maximum likelihood framework. This option is only available |
---|
329 | when DNA sequences are to be analysed under K80, HKY85 or TN93 models. The definition given to this |
---|
330 | parameter by PhyML is the same as PAML's\index{PAML} one. Therefore, the value of this parameter |
---|
331 | does {\it not} correspond to the ratio between the expected number of transitions and the expected |
---|
332 | number of transversions during a unit of time. This last definition is the one used in |
---|
333 | PHYLIP\index{PHYLIP}. PAML's manual gives more detail about the distinction between the two |
---|
334 | definitions (\url{http://abacus.gene.ucl.ac.uk/software/paml.html}). |
---|
335 | |
---|
336 | \vspace{0.7cm} |
---|
337 | \begin{center} \framebox{\x{[V] . Proportion of invariable sites |
---|
338 | (fixed/estimated)}} \end{center}\index{invariable sites}\index{proportion of invariants} The |
---|
339 | proportion of invariable sites, i.e., the expected frequency of sites that do not evolve, can be |
---|
340 | fixed or estimated. The default is to fix this proportion to 0.0. By doing so, we consider that each |
---|
341 | site in the sequence may accumulate substitutions at some point during its evolution, even if no |
---|
342 | differences across sequences are actually observed at that site. Users can also fix this parameter |
---|
343 | to any value in the $[0.0,1.0]$ range or estimate it from the data in the maximum-likelihood |
---|
344 | framework. |
---|
345 | |
---|
346 | \vspace{0.7cm} |
---|
347 | \index{gamma distribution (discrete)!mean vs. median} |
---|
348 | \index{gamma distribution (discrete)!number of categories} |
---|
349 | \index{gamma distribution (discrete)!shape parameter} |
---|
350 | \begin{center} \framebox{\x{[R] ....... One category of substitution rate (yes/no)}} \end{center} |
---|
351 | \begin{center} \framebox{\x{[C] ........... Number of substitution rate categories}} \end{center} |
---|
352 | \begin{center} \framebox{\x{[A] ... Gamma distribution parameter (fixed/estimated)}} \end{center} |
---|
353 | \begin{center} \framebox{\x{[G] .........`Middle' of each rate class (mean/median)}} \end{center} |
---|
354 | Rates of evolution often vary from site to site. This heterogeneity can be modelled using a discrete |
---|
355 | gamma distribution. Type \x{R} to switch this option on or off. The different categories of this |
---|
356 | discrete distribution correspond to different (relative) rates of evolution. The number of |
---|
357 | categories of this distribution is set to 4 by default. It is probably not wise to go below this |
---|
358 | number. Larger values are generally preferred. However, the computational burden involved is |
---|
359 | proportional to the number of categories (i.e., an analysis with 8 categories will generally take |
---|
360 | twice the time of the same analysis with only 4 categories). Note that the likelihood will not |
---|
361 | necessarily increase as the number of categories increases. Hence, the number of categories should |
---|
362 | be kept below a ``reasonable'' number, say 20. The default number of categories can be changed by |
---|
363 | typing \x{C}. |
---|
364 | |
---|
365 | The middle of each discretized substitution rate class can be determined using the mean or the |
---|
366 | median. PAML, MrBayes and RAxML use the mean. However, the median is generally associated with |
---|
367 | greater likelihoods than the mean. This conclusion is based on our analysis of several real-world |
---|
368 | data sets extracted from TreeBase. Despite this, the default option in PhyML is to use the mean in |
---|
369 | order to make PhyML likelihoods comparable to those of other phylogenetic software. One must bare |
---|
370 | in mind that {\color{red}{likelihoods calculated with the mean approximation are not directly |
---|
371 | comparable to the likelihoods calculated using the median approximation}}. |
---|
372 | |
---|
373 | The shape of the gamma distribution determines the range of rate variation across sites. Small |
---|
374 | values, typically in the $[0.1,1.0]$ range, correspond to large variability. Larger values |
---|
375 | correspond to moderate to low heterogeneity. The gamma shape parameter can be fixed by the user or |
---|
376 | estimated via maximum-likelihood. Type \x{A} to select one or the other option. |
---|
377 | |
---|
378 | |
---|
379 | \subsubsection{Tree searching sub-menu} |
---|
380 | |
---|
381 | \begin{center} \framebox{\x{[O] ........................... Optimise tree topology}} \end{center} By |
---|
382 | default the tree topology is optimised in order to maximise the likelihood. However, it is also |
---|
383 | possible to avoid any topological alteration. This option is useful when one wants to compute the |
---|
384 | likelihood of a tree given as input (see below). Type \x{O} to select among these two options. |
---|
385 | |
---|
386 | \vspace{0.7cm} |
---|
387 | \begin{center} \framebox{\x{[S] .................. Tree topology search operations}} \end{center} |
---|
388 | PhyML proposes three different methods to estimate tree topologies. The default approach is to use |
---|
389 | simultaneous NNI. This option corresponds to the original PhyML algorithm \cite{guindon03}. The |
---|
390 | second approach relies on subtree pruning and regrafting (SPR). It generally finds better tree |
---|
391 | topologies compared to NNI but is also significantly slower. The third approach, termed BEST, |
---|
392 | simply estimates the phylogeny using both methods and returns the best solution among the two. Type |
---|
393 | \x{S} to choose among these three choices. |
---|
394 | |
---|
395 | \vspace{0.7cm} |
---|
396 | \begin{center} \framebox{\x{[R] ......................... Use random starting tree}} \end{center} |
---|
397 | \begin{center} \framebox{\x{[N] .................. Number of random starting trees}} \end{center} |
---|
398 | When the SPR or the BEST options are selected, is is possible to use random trees rather than BioNJ |
---|
399 | or a user-defined tree, as starting tree. If this option is turned on (type \x{R} to change), five |
---|
400 | trees, corresponding to five random starts, will be estimated. The output tree file will contain the |
---|
401 | best tree found among those five. The number of random starts can be modified by typing |
---|
402 | \x{N}. Setting the number of random starting trees to $N$ means that the analysis will take |
---|
403 | (slightly more than) $N$ times the time required for a standard analysis where only one (BioNJ) |
---|
404 | starting tree is used. However, the analysis of real data sets shows that the best trees estimated |
---|
405 | using the random start option almost systematically have higher likelihoods than those inferred |
---|
406 | using a single starting tree. |
---|
407 | |
---|
408 | \vspace{0.7cm} |
---|
409 | \begin{center} \framebox{\x{[U] ........ Starting tree (BioNJ/parsimony/user |
---|
410 | tree)}} \end{center}\index{BioNJ} When the tree topology optimisation option is turned on, PhyML |
---|
411 | proceeds by refining an input tree. By default, this input tree is estimated using BioNJ |
---|
412 | \cite{gascuelNJ}. The alternative option is to use a parsimony tree. We found this option specially |
---|
413 | useful when analysing large data sets with NNI moves as it generally leads to greater likelihoods |
---|
414 | than those obtained when starting from a BioNJ trees. The user can also to input her/his own |
---|
415 | tree. This tree should be in Newick format (see Section \ref{sec:input_output}). This option is |
---|
416 | useful when one wants to evaluate the likelihood of a given tree with a fixed topology, using |
---|
417 | PhyML. Type \x{U} to choose among these two options. |
---|
418 | |
---|
419 | \subsubsection{Branch support sub-menu} |
---|
420 | |
---|
421 | \begin{center} \framebox{\x{[B] ................ Non parametric bootstrap analysis}} \end{center} |
---|
422 | The support of the data for each internal branch of the phylogeny can be estimated using |
---|
423 | non-parametric bootstrap. By default, this option is switched off. Typing \x{B} switches on the |
---|
424 | bootstrap analysis. The user is then prompted for a number of bootstrap replicates. The largest this |
---|
425 | number the more precise the bootstrap support estimates are. However, for each bootstrap replicate |
---|
426 | a phylogeny is estimated. Hence, the time needed to analyse $N$ bootstrap replicates corresponds to |
---|
427 | $N$-times the time spent on the analysis of the original data set. $N=100$ is generally considered |
---|
428 | as a reasonable number of replicates. |
---|
429 | |
---|
430 | \begin{center} \framebox{\x{[A] ................ Approximate likelihood ratio test}} \end{center} |
---|
431 | When the bootstrap option is switched off (see above), approximate likelihood branch supports are |
---|
432 | estimated. This approach is considerably faster than the bootstrap one. However, both methods intend |
---|
433 | to estimate different quantities and conducting a fair comparison between both criteria is not |
---|
434 | straightforward. The estimation of approximate likelihood branch support comes in multiple |
---|
435 | flavours. The default is set to aBayes, corresponding to the approximate Bayes method described in |
---|
436 | \cite{anisimova11}. The approximate likelihood ratio test (aLRT) \cite{anisimova06}, |
---|
437 | ShimodairaâHasegawa aLRT (SH-aLRT) statistics are the other available options. |
---|
438 | |
---|
439 | |
---|
440 | \subsection{Command-line interface} |
---|
441 | |
---|
442 | An alternative to the PHYLIP-like interface is the command-line interface. Users that do not need to |
---|
443 | modify the default parameters can launch the program with the `\x{phyml -i seq\_file\_name}' |
---|
444 | command. The list of all command line arguments and how to use them is given in the `Help' section |
---|
445 | which is displayed when entering the `\x{phyml --help}' command. The available command-line options |
---|
446 | are described in what follows. |
---|
447 | |
---|
448 | \begin{itemize} |
---|
449 | |
---|
450 | \item \x{-i} (or \x{--input}) \x{seq\_file\_name}\index{command-line options!\x{--input}} \\ |
---|
451 | \x{seq\_file\_name} is the name of the nucleotide or amino-acid sequence file in PHYLIP format. |
---|
452 | |
---|
453 | \item \x{-d} (or \x{--datatype}) \x{data\_type}\index{command-line options!\x{--data\_type}}\\ |
---|
454 | \x{data\_type} is \x{nt} for nucleotide (default) and \x{aa} for amino-acid sequences. |
---|
455 | |
---|
456 | \item \x{-q} (or \x{--sequential})\index{sequence format!interleaved}\index{sequence format!sequential}\index{command-line options!\x{--sequential}} \\ |
---|
457 | Changes interleaved format (default) to sequential format. |
---|
458 | |
---|
459 | \item \x{-n} (or \x{--multiple}) \x{nb\_data\_sets}\index{multiple data sets}\index{command-line options!\x{--multiple}}\\ |
---|
460 | \x{nb\_data\_sets} is an integer giving the number of data sets to analyse. |
---|
461 | |
---|
462 | \item \x{-p} (or \x{--pars})\index{command-line options!\x{--pars}}\\ |
---|
463 | Use a minimum parsimony starting tree. This option is taken into account when the `-u' option |
---|
464 | is absent and when tree topology modifications are to be done. |
---|
465 | |
---|
466 | \item \x{-b} (or \x{--bootstrap}) \x{int}\index{bootstrap}\index{command-line options!\x{--bootstrap}} |
---|
467 | \begin{itemize} |
---|
468 | \item \x{int} $>$ 0: \x{int} is the number of bootstrap replicates. |
---|
469 | \item \x{int} = 0: neither approximate likelihood ratio test nor bootstrap values are computed. |
---|
470 | \item \x{int} = -1: approximate likelihood ratio test returning aLRT statistics. |
---|
471 | \item \x{int} = -2: approximate likelihood ratio test returning Chi2-based parametric branch supports. |
---|
472 | % \item \x{int} = -3: minimum of Chi2-based parametric and SH-like branch supports. |
---|
473 | \item \x{int} = -4: SH-like branch supports alone. |
---|
474 | \item \x{int} = -5: (default) approximate Bayes branch supports. |
---|
475 | \end{itemize} |
---|
476 | |
---|
477 | \item \x{-m} (or \x{--model}) \x{model\_name}\index{substitution models!DNA}\index{substitution models!amino acids}\index{command-line options!\x{--model}} \\ |
---|
478 | \x{model\_name} : substitution model name. |
---|
479 | \begin{itemize} |
---|
480 | \item {\it Nucleotide-based models}: \x{HKY85} (default) \x{| JC69 | K80 | F81 | F84 | TN93 | GTR | |
---|
481 | custom} \\ The \x{custom} option can be used to define a new substitution model. A string of six |
---|
482 | digits identifies the model. For instance, 000000 corresponds to F81 (or JC69 provided the |
---|
483 | distribution of nucleotide frequencies is uniform). 012345 corresponds to GTR. This option can be |
---|
484 | used for encoding any model that is a nested within GTR. See Section \ref{sec:submenus} and Table \ref{tab:modelcode}. {\em NOTE:} |
---|
485 | the substitution parameters of the custom model will be optimised so as to maximise the |
---|
486 | likelihood. It is possible to specify and fix (i.e., avoid optimisation) the values of the |
---|
487 | substitution rates only through the PHYLIP-like interface. |
---|
488 | |
---|
489 | \item {\it Amino-acid based models}: \x{LG} (default) \x{| WAG | JTT | MtREV | Dayhoff | DCMut | RtREV |
---|
490 | | CpREV | VT | Blosum62 | MtMam | MtArt | HIVw | HIVb | custom} \\ |
---|
491 | The \x{custom} option is useful when one wants to use an amino-acid substitution model that is not |
---|
492 | available by default in PhyML. The symmetric part of the rate matrix, as well as the equilibrium |
---|
493 | amino-acid frequencies, are given in a file which name is asked for by the program. The format of |
---|
494 | this file is described in section \ref{sec:customaa}. |
---|
495 | \end{itemize} |
---|
496 | |
---|
497 | \begin{table}\index{custom models} |
---|
498 | \begin{center} |
---|
499 | \begin{tabular}{ll} |
---|
500 | \hline |
---|
501 | Name & Command-line option \\ |
---|
502 | \hline |
---|
503 | JC69 & \x{-m 000000 -f 0.25,0.25,0.25,0.25} \\ |
---|
504 | F81 & \x{-m 000000}\\ |
---|
505 | K80 & \x{-m 010010 -f 0.25,0.25,0.25,0.25} \\ |
---|
506 | HKY85 & \x{-m 010010}\\ |
---|
507 | TrNef & \x{-m 010020 -f 0.25,0.25,0.25,0.25} \\ |
---|
508 | TrN & \x{-m 010020}\\ |
---|
509 | K81 & \x{-m 123321 -f 0.25,0.25,0.25,0.25}\\ |
---|
510 | K81uf & \x{-m 123321}\\ |
---|
511 | TIMef & \x{-m 132241 -f 0.25,0.25,0.25,0.25}\\ |
---|
512 | TIM & \x{-m 132241}\\ |
---|
513 | TVMef & \x{-m 102304 -f 0.25,0.25,0.25,0.25}\\ |
---|
514 | TVM & \x{-m 102304}\\ |
---|
515 | SYM & \x{-m 123456 -f 0.25,0.25,0.25,0.25}\\ |
---|
516 | GTR & \x{-m 123456}\\ |
---|
517 | \hline |
---|
518 | \end{tabular} |
---|
519 | \caption{Nucleotide substitution model names (as defined in \cite{posada01}) and the corresponding |
---|
520 | custom model code used in PhyML.}\label{tab:modelcode} |
---|
521 | \end{center} |
---|
522 | \end{table} |
---|
523 | |
---|
524 | \item \x{--aa\_rate\_file file\_name}\index{command-line options!\x{--aa\_rate\_file}} \\ |
---|
525 | This option is compulsory when analysing amino-acid sequences under a `custom' model (see above). \x{file\_name} |
---|
526 | should provide a rate matrix and equilibrium amino acid in PAML format (see Section \ref{sec:customaa}). |
---|
527 | |
---|
528 | \item \x{-f e}, \x{m}, or ``\x{fA,fC,fG,fT}" \index{frequencies!nucleotide}\index{frequencies!amino-acid}\index{stationary frequencies}\index{command-line options!\x{-f}}\\ |
---|
529 | Nucleotide or amino-acid frequencies. |
---|
530 | \begin{itemize} |
---|
531 | \item \x{e} : the character frequencies are determined as follows : |
---|
532 | \begin{itemize} |
---|
533 | \item {\it Nucleotide sequences}: (Empirical) the equilibrium base frequencies are estimated by counting |
---|
534 | the occurence of the different bases in the alignment. |
---|
535 | \item {\it Amino-acid sequences}: (Empirical) the equilibrium amino-acid frequencies are estimated by counting |
---|
536 | the occurence of the different amino-acids in the alignment. |
---|
537 | \end{itemize} |
---|
538 | |
---|
539 | \item \x{m} : the character frequencies are determined as follows : |
---|
540 | \begin{itemize} |
---|
541 | \item {\it Nucleotide sequences}: (ML) the equilibrium base frequencies are estimated using maximum |
---|
542 | likelihood. |
---|
543 | \item {\it Amino-acid sequences}: (Model) the equilibrium amino-acid frequencies are estimated using |
---|
544 | the frequencies defined by the substitution model. |
---|
545 | \end{itemize} |
---|
546 | |
---|
547 | \item ``\x{fA,fC,fG,fT}": only valid for nucleotide-based models. \x{fA}, \x{fC}, \x{fG} and \x{fT} are floating numbers that |
---|
548 | correspond to the frequencies of A, C, G and T respectively. |
---|
549 | \end{itemize} |
---|
550 | |
---|
551 | \item \x{-t} (or \x{--ts/tv}) \x{ts/tv\_ratio} \index{$\kappa$}\index{ts/tv ratio}\index{command-line options!\x{--ts/tv}}\\ |
---|
552 | \x{ts/tv\_ratio}: transition/transversion ratio. DNA sequences only. Can be a fixed positive value |
---|
553 | (e.g., 4.0) or type \x{e} to get the maximum likelihood estimate. |
---|
554 | |
---|
555 | \item \x{-v} (or \x{--pinv}) \x{prop\_invar}\index{proportion of invariants}\index{invariable sites} \index{command-line options!\x{--pinv}}\\ |
---|
556 | \x{prop\_invar}: proportion of invariable sites. Can be a fixed value in the [0,1] range or type \x{e} to get the maximum likelihood estimate. |
---|
557 | |
---|
558 | \item \x{-c} (or \x{--nclasses}) \x{nb\_subst\_cat}\index{gamma distribution (discrete)!number of categories} \index{command-line options!\x{--nclasses}}\\ |
---|
559 | \x{nb\_subst\_cat}: number of relative substitution rate categories. Default: \x{nb\_subst\_cat=4}. Must be a positive integer. |
---|
560 | |
---|
561 | \item \x{-a} (or \x{--alpha}) \x{gamma} \index{gamma distribution (discrete)!shape parameter}\index{command-line options!\x{--alpha}} \\ |
---|
562 | \x{gamma}: value of the gamma shape parameter. Can be a fixed positive value or e to get the maximum |
---|
563 | likelihood estimate. The value of this parameter is estimated in the maximum likelihood framework by default. |
---|
564 | |
---|
565 | \item \x{--use\_median} \index{gamma distribution (discrete)!mean vs. median} \index{command-line options!\x{--use\_median}}\\ |
---|
566 | The middle of each substitution rate class in the discrete gamma distribution is taken as the |
---|
567 | median. The mean is used by default. |
---|
568 | |
---|
569 | \item \x{--free\_rates} \index{command-line options!\x{--free\_rates}}\\ |
---|
570 | As an alternative to the discrete gamma model, it is possible to estimate the (relative) rate in |
---|
571 | each class of the (mixture) model and the corresponding frequencies directly from the data. This |
---|
572 | model, called the FreeRate model, has more parameters |
---|
573 | than the discrete gamma one but usually provides a significantly better fit to the data. See |
---|
574 | \cite{soubrier12} for more information about this model and an illustration of its use. |
---|
575 | |
---|
576 | \item \x{--codpos} \x{1,2 or 3} \index{command-line options!\x{--codpos}}\\ |
---|
577 | When analysing an alignment of coding sequences, use this option to consider only the first, second |
---|
578 | or the third coding position for the estimation. |
---|
579 | |
---|
580 | \item \x{-s} (or \x{--search}) \x{move}\index{NNI}\index{SPR} \index{command-line options!\x{--search}}\\ |
---|
581 | Tree topology search operation option. Can be either \x{NNI} (default, fast) or \x{SPR} (usually |
---|
582 | slower than \x{NNI} but more accurate) or \x{BEST} (best of NNI and SPR search). |
---|
583 | |
---|
584 | \item \x{-u} (or \x{--inputtree}) \x{user\_tree\_file}\index{input tree}\index{user tree} \index{command-line options!\x{--inputtree}}\\ |
---|
585 | \x{user\_tree\_file}: starting tree filename. The tree must be in Newick format. |
---|
586 | |
---|
587 | \item \x{-o params}\index{optimisation!topology}\index{optimisation!substitution parameters} \index{command-line options!\x{-o}}\\ |
---|
588 | This option focuses on specific parameter optimisation. |
---|
589 | \begin{itemize} |
---|
590 | \item \x{params=tlr}: tree topology (\x{t}), branch length (\x{l}) and substitution rate parameters (\x{r}) are optimised. |
---|
591 | \item \x{params=tl}: tree topology and branch lengths are optimised. |
---|
592 | \item \x{params=lr}: branch lengths and substitution rate parameters are optimised. |
---|
593 | \item \x{params=l}: branch lengths are optimised. |
---|
594 | \item \x{params=r}: substitution rate parameters are optimised. |
---|
595 | \item \x{params=n}: no parameter is optimised. |
---|
596 | \end{itemize} |
---|
597 | |
---|
598 | \item \x{--rand\_start}\index{random tree} \index{command-line options!\x{--rand\_start}}\\ |
---|
599 | This option sets the initial tree to random. It is only valid if SPR searches are to be performed. |
---|
600 | |
---|
601 | \item \x{--n\_rand\_starts num} \index{command-line options!\x{--n\_rand\_starts}}\\ |
---|
602 | \x{num} is the number of initial random trees to be used. It is only valid if SPR searches are to be performed. |
---|
603 | |
---|
604 | \item \x{--r\_seed num}\index{random number} \index{command-line options!\x{--r\_seed}}\\ |
---|
605 | \x{num} is the seed used to initiate the random number generator. Must be an integer. |
---|
606 | |
---|
607 | \item \x{--print\_site\_lnl}\index{likelihood!print site likelihood} \index{command-line options!\x{--print\_site\_lk}}\\ |
---|
608 | Print the likelihood for each site in file *\_phyml\_lk.txt. For $\Gamma$ or $\Gamma$+I or |
---|
609 | FreeRate models, this option returns the posterior probability of each relative rate class at |
---|
610 | each site. Such information can then be used to identify fast- and slow-evolving regions of the alignment. |
---|
611 | |
---|
612 | \item \x{--print\_trace}\index{command-line options!\x{--print\_trace}}\\ |
---|
613 | Print each phylogeny explored during the tree search process in file *\_phyml\_trace.txt. This |
---|
614 | option can be useful for monitoring the progress of the analysis for very large data sets and have |
---|
615 | an approximate idea of what the final phylogeny will look like. |
---|
616 | |
---|
617 | \item \x{--run\_id ID\_string}\index{run ID} \index{command-line options!\x{--run\_id}}\\ |
---|
618 | Append the string ID\_string at the end of each PhyML output file. This option may be useful when |
---|
619 | running simulations involving PhyML. It can also be used to `tag' multiple analysis of the same data |
---|
620 | set with various program settings. |
---|
621 | |
---|
622 | \item \x{--no\_memory\_check} \index{command-line options!\x{--no\_memory\_check}}\\ |
---|
623 | By default, when processing a large data set, PhyML will pause and ask the user to confirm that |
---|
624 | she/he wants to continue with the execution of the analysis despite the large amount of memory |
---|
625 | required. The \x{--no\_memory\_check} skips this question. It is especially useful when running |
---|
626 | PhyML in batch mode. |
---|
627 | |
---|
628 | \item \x{--no\_colalias} \index{command-line options!\x{--no\_colalias}}\\ |
---|
629 | By default, PhyML preprocesses each alignment by putting together (or aliasing) the columns that are |
---|
630 | identical. Use this option to skip this step but be aware that the analysis might then take more |
---|
631 | time to complete. |
---|
632 | |
---|
633 | \item \x{--constrained\_lens} \index{command-line options!\x{--constrained\_lens}}\\ |
---|
634 | When an input tree with branch lengths is provided, this option will find the branch multiplier that |
---|
635 | maximises the likelihood (i.e., the relative branch lengths remain constant) |
---|
636 | |
---|
637 | \item \x{--constraint\_file} \x{file\_name} \index{command-line options!\x{--constraint\_file}}\\ |
---|
638 | \x{file\_name} lists the topological constraints under which the tree topology search is |
---|
639 | conducted. This option should be used in conjunction with \x{-u} \x{file\_name}. See Section |
---|
640 | \ref{sec:topoconstraints} for more information. |
---|
641 | |
---|
642 | \item \x{--quiet} \index{command-line options!\x{--quiet}}\\ |
---|
643 | Runs PhyML in quiet mode. The program will not pause if the memory required to run the analysis |
---|
644 | exceeds 256MB and will not output the log-likelihood score to the output. |
---|
645 | |
---|
646 | \end{itemize} |
---|
647 | |
---|
648 | \subsection{XML interface} |
---|
649 | \begin{itemize} |
---|
650 | \item \x{--xml=xml\_file\_name}\index{command-line options!\x{--xml}} \\ |
---|
651 | \x{xml\_file\_name} is the name of the XML file containing the information required to run the |
---|
652 | analysis. More details about this type of file is given in the section \ref{sec:xmlio}. |
---|
653 | \end{itemize} |
---|
654 | |
---|
655 | \subsection{Parallel bootstrap}\label{sec:parallel_bootstrap}\index{MPI}\index{bootstrap!parallel} |
---|
656 | |
---|
657 | Bootstrapping is a highly parallelizable task. Indeed, bootstrap replicates are independent from |
---|
658 | one another. Each bootstrap replicate can then be analysed separately. Modern computers often have |
---|
659 | more than one CPU. Each CPU can therefore be used to process a bootstrap sample. Using this parallel |
---|
660 | strategy, performing $R$ bootstrap replicates on $C$ CPUs `costs' the same amount of computation |
---|
661 | time as processing $R \times C$ bootstrap replicates on a single CPU. In other words, for a given |
---|
662 | number of replicates, the computation time is divided by $R$ compared to the non-parallel approach. |
---|
663 | |
---|
664 | PhyML sources must be compiled with specific options to turn on the parallel option (see Section |
---|
665 | \ref{sec:MPI}). Once the binary file (\x{phyml}) has been generated, running a bootstrap analysis |
---|
666 | with, say 100 replicates on 2 CPUs, can be done by typing the following command-line: |
---|
667 | \begin{verbatim} |
---|
668 | mpd &; |
---|
669 | mpirun -np 2 ./phyml -i seqfile -b 100; |
---|
670 | \end{verbatim} |
---|
671 | The first command launches the mpi daemon while the second launches the analysis. Note that |
---|
672 | launching the daemon needs to be done only once. The output files are similar to the ones generated |
---|
673 | using the standard, non-parallel, analysis (see Section \ref{sec:input_output}). Note that running |
---|
674 | the program in batch mode, i.e.: |
---|
675 | \begin{verbatim} |
---|
676 | mpirun -np 2 ./phyml -i seqfile -b 100 & |
---|
677 | \end{verbatim} |
---|
678 | will probably NOT work. I do not know how to run a mpi process in batch mode yet. Suggestions welcome... |
---|
679 | Also, at the moment, the number of bootstrap replicates must be a multiple of the number of CPUs |
---|
680 | required in the mpirun command. |
---|
681 | |
---|
682 | \section{Inputs \& outputs for command-line and PHYLIP interface }\label{sec:input_output} |
---|
683 | |
---|
684 | PhyML reads data from standard text files, without the need for any particular file name extension. |
---|
685 | |
---|
686 | \subsection{Sequence formats} |
---|
687 | |
---|
688 | \begin{figure} |
---|
689 | \begin{small} |
---|
690 | \begin{Verbatim}[frame=single, label=PHYLIP interleaved, samepage=true, baselinestretch=0.5] |
---|
691 | |
---|
692 | 5 80 |
---|
693 | seq1 CCATCTCACGGTCGGTACGATACACCKGCTTTTGGCAGGAAATGGTCAATATTACAAGGT |
---|
694 | seq2 CCATCTCACGGTCAG---GATACACCKGCTTTTGGCGGGAAATGGTCAACATTAAAAGAT |
---|
695 | seq3 RCATCTCCCGCTCAG---GATACCCCKGCTGTTG????????????????ATTAAAAGGT |
---|
696 | seq4 RCATCTCATGGTCAA---GATACTCCTGCTTTTGGCGGGAAATGGTCAATCTTAAAAGGT |
---|
697 | seq5 RCATCTCACGGTCGGTAAGATACACCTGCTTTTGGCGGGAAATGGTCAAT????????GT |
---|
698 | |
---|
699 | ATCKGCTTTTGGCAGGAAAT |
---|
700 | ATCKGCTTTTGGCGGGAAAT |
---|
701 | AGCKGCTGTTG????????? |
---|
702 | ATCTGCTTTTGGCGGGAAAT |
---|
703 | ATCTGCTTTTGGCGGGAAAT |
---|
704 | |
---|
705 | \end{Verbatim} |
---|
706 | \begin{Verbatim}[frame=single, label=PHYLIP sequential, samepage=true, baselinestretch=0.5] |
---|
707 | |
---|
708 | 5 40 |
---|
709 | seq1 CCATCTCANNNNNNNNACGATACACCKGCTTTTGGCAGG |
---|
710 | seq2 CCATCTCANNNNNNNNGGGATACACCKGCTTTTGGCGGG |
---|
711 | seq3 RCATCTCCCGCTCAGTGAGATACCCCKGCTGTTGXXXXX |
---|
712 | seq4 RCATCTCATGGTCAATG-AATACTCCTGCTTTTGXXXXX |
---|
713 | seq5 RCATCTCACGGTCGGTAAGATACACCTGCTTTTGxxxxx |
---|
714 | |
---|
715 | \end{Verbatim} |
---|
716 | \end{small} |
---|
717 | \label{fig:align_tree} |
---|
718 | \caption{\bf PHYLIP interleaved and sequential formats.} |
---|
719 | \end{figure} |
---|
720 | |
---|
721 | |
---|
722 | |
---|
723 | \begin{figure} |
---|
724 | \begin{small} |
---|
725 | \begin{Verbatim}[frame=single, label=Nexus nucleotides, samepage=true, baselinestretch=0.5] |
---|
726 | |
---|
727 | [ This is a comment ] |
---|
728 | #NEXUS |
---|
729 | BEGIN DATA; |
---|
730 | DIMENSIONS NTAX=10 NCHAR=20; |
---|
731 | FORMAT DATATYPE=DNA; |
---|
732 | MATRIX |
---|
733 | tax1 ?ATGATTTCCTTAGTAGCGG |
---|
734 | tax2 CAGGATTTCCTTAGTAGCGG |
---|
735 | tax3 ?AGGATTTCCTTAGTAGCGG |
---|
736 | tax4 ?????????????GTAGCGG |
---|
737 | tax5 CAGGATTTCCTTAGTAGCGG |
---|
738 | tax6 CAGGATTTCCTTAGTAGCGG |
---|
739 | tax7 ???GATTTCCTTAGTAGCGG |
---|
740 | tax8 ???????????????????? |
---|
741 | tax9 ???GGATTTCTTCGTAGCGG |
---|
742 | tax10 ???????????????AGCGG; |
---|
743 | END; |
---|
744 | |
---|
745 | \end{Verbatim} |
---|
746 | \end{small} |
---|
747 | |
---|
748 | \begin{small} |
---|
749 | \begin{Verbatim}[frame=single, label=Nexus digits, samepage=true, baselinestretch=0.5] |
---|
750 | |
---|
751 | [ This is a comment ] |
---|
752 | #NEXUS |
---|
753 | BEGIN DATA; |
---|
754 | DIMENSIONS NTAX=10 NCHAR=20; |
---|
755 | FORMAT DATATYPE=STANDARD SYMBOLS="0 1 2 3"; |
---|
756 | MATRIX |
---|
757 | tax1 ?0320333113302302122 |
---|
758 | tax2 10220333113302302122 |
---|
759 | tax3 ?0220333113302302122 |
---|
760 | tax4 ?????????????2302122 |
---|
761 | tax5 10220333113302302122 |
---|
762 | tax6 10220333113302302122 |
---|
763 | tax7 ???20333113302302122 |
---|
764 | tax8 ???????????????????? |
---|
765 | tax9 ???22033313312302122 |
---|
766 | tax10 ???????????????02122; |
---|
767 | END; |
---|
768 | |
---|
769 | \end{Verbatim} |
---|
770 | \end{small} |
---|
771 | |
---|
772 | \begin{small} |
---|
773 | \begin{Verbatim}[frame=single, label=Nexus digits, samepage=true, baselinestretch=0.5] |
---|
774 | |
---|
775 | [ This is a comment ] |
---|
776 | #NEXUS |
---|
777 | BEGIN DATA; |
---|
778 | DIMENSIONS NTAX=10 NCHAR=20; |
---|
779 | FORMAT DATATYPE=STANDARD SYMBOLS="00 01 02 03"; |
---|
780 | MATRIX |
---|
781 | tax1 ??00030200030303010103030002030002010202 |
---|
782 | tax2 0100020200030303010103030002030002010202 |
---|
783 | tax3 ??00020200030303010103030002030002010202 |
---|
784 | tax4 ??????????????????????????02030002010202 |
---|
785 | tax5 0100020200030303010103030002030002010202 |
---|
786 | tax6 0100020200030303010103030002030002010202 |
---|
787 | tax7 ??????0200030303010103030002030002010202 |
---|
788 | tax8 ???????????????????????????????????????? |
---|
789 | tax9 ??????0202000303030103030102030002010202 |
---|
790 | tax10 ??????????????????????????????0002010202; |
---|
791 | END; |
---|
792 | |
---|
793 | \end{Verbatim} |
---|
794 | \end{small} |
---|
795 | \caption{\bf NEXUS formats.}\label{fig:nexus} |
---|
796 | \end{figure} |
---|
797 | |
---|
798 | |
---|
799 | Alignments of DNA or protein sequences must be in PHYLIP\index{PHYLIP} or NEXUS |
---|
800 | \cite{maddison97}\index{NEXUS} sequential\index{sequential} or interleaved\index{interleaved} format |
---|
801 | (Figures \ref{fig:align_tree} and \ref{fig:nexus}). For PHYLIP formated sequence alignments, the |
---|
802 | first line of the input file contains the number of species and the number of characters, in free |
---|
803 | format, separated by blank characters. One slight difference with PHYLIP format deals with sequence |
---|
804 | name lengths. While PHYLIP format limits this length to ten characters, PhyML can read up to |
---|
805 | hundred character long sequence names. Blanks and the symbols ``(),:'' are not allowed within |
---|
806 | sequence names because the Newick tree format makes special use of these symbols. Another slight |
---|
807 | difference with PHYLIP format is that actual sequences must be separated from their names by at |
---|
808 | least one blank character. |
---|
809 | |
---|
810 | A PHYLIP input sequence file may also display more than a single data set. Each of these data sets |
---|
811 | must be in PHYLIP format and two successive alignments must be separated by an empty |
---|
812 | line. Processing multiple data sets requires to toggle the `\x{M}' option in the {\em Input Data} |
---|
813 | sub-menu or use the `\x{-n}' command line option and enter the number of data sets to analyse. The |
---|
814 | multiple data set option can be used to process re-sampled data that were generated using a |
---|
815 | non-parametric procedure such as cross-validation or jackknife (a bootstrap option is already |
---|
816 | included in PhyML). This option is also useful in multiple gene studies, even if fitting the same |
---|
817 | substitution model to all data sets may not be suitable. |
---|
818 | |
---|
819 | PhyML can also process alignments in NEXUS format. Although not all the options provided by this |
---|
820 | format are supported by PhyML, a few specific features are exploited. Of course, this format can |
---|
821 | handle nucleotide and protein sequence alignments in sequential or interleaved format. It is also |
---|
822 | possible to use custom alphabets, replacing the standard 4-state and 20-state alphabets for |
---|
823 | nucleotides and amino-acids respectively. Examples of a 4-state custom alphabet are given in Figure |
---|
824 | \ref{fig:nexus}. Each state must here correspond to one digit or more. The set of states must be a |
---|
825 | list of consecutive digits starting from 0. For instance, the list ``0, 1, 3, 4'' is not a valid |
---|
826 | alphabet. Each state in the symbol list must be separated from the next one by a space. Hence, |
---|
827 | alphabets with large number of states can be easily defined by using two-digit number (starting with |
---|
828 | 00 up to 19 for a 20 state alphabet). Most importantly, this feature gives the opportunity to |
---|
829 | analyse data sets made of presence/absence character states (use the \texttt{symbols=``0 1''} option |
---|
830 | for such data).\index{binary characters} Alignments made of custom-defined states will be processed |
---|
831 | using the Jukes and Cantor model. Other options of the program (e.g., number of rate classes, tree |
---|
832 | topology search algorithm) are freely configurable. Note that, at the moment, the maximum number of |
---|
833 | different states is set to 22 in order to save memory space. It is however possible to lift this |
---|
834 | threshold by modifiying the value of the variable \x{T\_MAX\_ALPHABET} in the file |
---|
835 | `\x{utilities.h}'. The program will then have to be re-compiled. |
---|
836 | |
---|
837 | |
---|
838 | |
---|
839 | \subsubsection{Gaps and ambiguous characters} |
---|
840 | |
---|
841 | Gaps correspond to the `\x{-}' symbol. They are systematically treated as unknown characters ``on |
---|
842 | the grounds that we don't know what would be there if something were there'' (J. Felsenstein, |
---|
843 | PHYLIP main documentation). The likelihood at these sites is summed over all the possible states |
---|
844 | (i.e., nucleotides or amino acids) that could actually be observed at these particular |
---|
845 | positions. Note however that columns of the alignment that display only gaps or unknown characters |
---|
846 | are simply discarded because they do not carry any phylogenetic information (they are equally well |
---|
847 | explained by any model). PhyML also handles ambiguous characters such as $R$ for $A$ or $G$ |
---|
848 | (purines) and $Y$ for $C$ or $T$ (pyrimidines). Tables \ref{tab:ambigu_nt} and \ref{tab:ambigu_aa} |
---|
849 | give the list of valid characters/symbols and the corresponding nucleotides or amino acids. |
---|
850 | |
---|
851 | \begin{table} |
---|
852 | \begin{center} |
---|
853 | \begin{tabular}{lr|lr} |
---|
854 | \hline |
---|
855 | Character & Nucleotide & Character & Nucleotide \\ |
---|
856 | \hline |
---|
857 | $A$ & Adenosine & $Y$ & $C$ or $T$ \\ |
---|
858 | $G$ & Guanine & $K$ & $G$ or $T$ \\ |
---|
859 | $C$ & Cytosine & $B$ & $C$ or $G$ or $T$\\ |
---|
860 | $T$ & Thymine & $D$ & $A$ or $G$ or $T$ \\ |
---|
861 | $U$ & Uracil (=$T$) & $H$ & $A$ or $C$ or $T$ \\ |
---|
862 | $M$ & $A$ or $C$ & $V$ & $A$ or $C$ or $G$ \\ |
---|
863 | $R$ & $A$ or $G$ & $-$ or $N$ or $X$ or $?$ & unknown \\ |
---|
864 | $W$ & $A$ or $T$ & & (=$A$ or $C$ or $G$ or $T$)\\ |
---|
865 | $S$ & $C$ or $G$ & & \\ |
---|
866 | \hline |
---|
867 | \end{tabular} |
---|
868 | \end{center} |
---|
869 | \caption{{\bf List of valid characters in DNA sequences and the corresponding nucleotides.}}\label{tab:ambigu_nt} |
---|
870 | \end{table} |
---|
871 | \begin{table} |
---|
872 | \begin{center} |
---|
873 | \begin{tabular}{lr|lr} |
---|
874 | \hline |
---|
875 | Character & Amino-Acid & Character & Amino-Acid \\ |
---|
876 | \hline |
---|
877 | $A$ & Alanine & $L$ & Leucine \\ |
---|
878 | $R$ & Arginine & $K$ & Lysine \\ |
---|
879 | $N$ or $B$& Asparagine & $M$ & Methionine \\ |
---|
880 | $D$ & Aspartic acid & $F$ & Phenylalanine \\ |
---|
881 | $C$ & Cysteine & $P$ & Proline \\ |
---|
882 | $Q$ or $Z$& Glutamine & $S$ & Serine \\ |
---|
883 | $E$ & Glutamic acid & $T$ & Threonine \\ |
---|
884 | $G$ & Glycine & $W$ & Tryptophan \\ |
---|
885 | $H$ & Histidine & $Y$ & Tyrosine \\ |
---|
886 | $I$ & Isoleucine & $V$ & Valine \\ |
---|
887 | $L$ & Leucine & $-$ or $X$ or $?$ & unknown \\ |
---|
888 | $K$ & Lysine & & (can be any amino acid) \\ |
---|
889 | \hline |
---|
890 | \end{tabular} |
---|
891 | \end{center} |
---|
892 | \caption{{\bf List of valid characters in protein sequences and the corresponding amino acids.}}\label{tab:ambigu_aa} |
---|
893 | \end{table} |
---|
894 | |
---|
895 | \subsubsection{Specifying outgroup sequences}\label{sec:outgroupspecify} |
---|
896 | |
---|
897 | PhyML can return rooted trees provided outgroup taxa are identified from the sequence file. In |
---|
898 | order to do so, sequence names that display a `*' character will be automatically considered as |
---|
899 | belonging to the outgroup. |
---|
900 | |
---|
901 | The topology of the rooted tree is exactly the same as the unrooted version of the same tree. In |
---|
902 | other words, PhyML first ignores the distinction between ingroup and outgroup sequences, builds a |
---|
903 | maximum likelihood unrooted tree and then tries to add the root. If the outgroup has more than one |
---|
904 | sequence, the position of the root might be ambiguous. In such situation, PhyML tries to identify |
---|
905 | the most relevant position of the root by considering which edge provides the best separation |
---|
906 | between ingroup and outgroup taxa (i.e., we are trying to make the outgroup ``as monophyletic as |
---|
907 | possible''). |
---|
908 | |
---|
909 | \subsection{Tree format} |
---|
910 | |
---|
911 | PhyML can read one or several phylogenetic trees from an input file. This option is accessible |
---|
912 | through the {\em Tree Searching} sub menu or the `\x{-u}' argument from the command line. Input |
---|
913 | trees are generally used as initial maximum likelihood estimates to be subsequently adjusted by the |
---|
914 | tree searching algorithm. Trees can be either rooted or unrooted and multifurcations are allowed. |
---|
915 | Taxa names must, of course, match the corresponding sequence names. |
---|
916 | |
---|
917 | \begin{figure}[h] |
---|
918 | \begin{small} |
---|
919 | \begin{minipage}{\textwidth} |
---|
920 | \begin{verbatim} |
---|
921 | ((seq1:0.03,seq2:0.01):0.04,(seq3:0.01,(seq4:0.2,seq5:0.05):0.2):0.01); |
---|
922 | ((seq3,seq2),seq1,(seq4,seq5)); |
---|
923 | \end{verbatim} |
---|
924 | \end{minipage} |
---|
925 | \end{small} |
---|
926 | \caption{{\bf Input trees}. The first tree (top) is rooted and has branch lengths. The second tree |
---|
927 | (bottom) is unrooted and does not have branch lengths.} |
---|
928 | \label{fig:trees}\index{Newick format} |
---|
929 | \end{figure} |
---|
930 | |
---|
931 | |
---|
932 | \subsection{Multiple alignments and trees}\index{multiple data sets} |
---|
933 | |
---|
934 | Single or multiple sequence data sets may be used in combination with single or multiple input |
---|
935 | trees. When the number of data sets is one ($n_D = 1$) and there is only one input tree ($n_T = 1$), |
---|
936 | then this tree is simply used as input for the single data set analysis. When $n_D = 1$ and $n_T > |
---|
937 | 1$, each input tree is used successively for the analysis of the single alignment. PhyML then |
---|
938 | outputs the tree with the highest likelihood. If $n_D > 1$ and $n_T = 1$, the same input tree is |
---|
939 | used for the analysis of each data set. The last combination is $n_D > 1$ and $n_T > 1$. In this |
---|
940 | situation, the $i$-th tree in the input tree file is used to analyse the $i$-th data set. Hence, |
---|
941 | $n_D$ and $n_T$ must be equal here. |
---|
942 | |
---|
943 | |
---|
944 | \subsection{Custom amino-acid rate model}\label{sec:customaa} |
---|
945 | |
---|
946 | The custom amino-acid model of substitutions can be used to implement a model that is not hard-coded |
---|
947 | in PhyML. This model must be time-reversible. Hence, the matrix of substitution rates is |
---|
948 | symmetrical. The format of the rate matrix with the associated stationary frequencies is identical |
---|
949 | to the one used in PAML\index{PAML}. An example is given below: |
---|
950 | |
---|
951 | \begin{center} |
---|
952 | {\tiny |
---|
953 | \begin{tabular}{p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}} |
---|
954 | % Ala & Arg & Asn & Asp & Cys & Gln & Glu & Gly & His & Ile & Leu & Lys & Met & Phe & Pro & Ser & Thr & Trp & Tyr & Val \\ |
---|
955 | &&&&&&&&&&&&&&&&&&& \\ |
---|
956 | 0.55 & &&&&&&&&&&&&&&&&&& \\ |
---|
957 | 0.51 & 0.64 & &&&&&&&&&&&&&&&& \\ |
---|
958 | 0.74 & 0.15 & 5.43 & &&&&&&&&&&&&&&&& \\ |
---|
959 | 1.03 & 0.53 & 0.27 & 0.03 & &&&&&&&&&&&&&&& \\ |
---|
960 | 0.91 & 3.04 & 1.54 & 0.62 & 0.10 & &&&&&&&&&&&&&& \\ |
---|
961 | 1.58 & 0.44 & 0.95 & 6.17 & 0.02 & 5.47 & &&&&&&&&&&&&& \\ |
---|
962 | 1.42 & 0.58 & 1.13 & 0.87 & 0.31 & 0.33 & 0.57 & &&&&&&&&&&&& \\ |
---|
963 | 0.32 & 2.14 & 3.96 & 0.93 & 0.25 & 4.29 & 0.57 & 0.25 & &&&&&&&&&&& \\ |
---|
964 | 0.19 & 0.19 & 0.55 & 0.04 & 0.17 & 0.11 & 0.13 & 0.03 & 0.14 & &&&&&&&&&& \\ |
---|
965 | 0.40 & 0.50 & 0.13 & 0.08 & 0.38 & 0.87 & 0.15 & 0.06 & 0.50 & 3.17 & &&&&&&&&& \\ |
---|
966 | 0.91 & 5.35 & 3.01 & 0.48 & 0.07 & 3.89 & 2.58 & 0.37 & 0.89 & 0.32 & 0.26 & &&&&&&&& \\ |
---|
967 | 0.89 & 0.68 & 0.20 & 0.10 & 0.39 & 1.55 & 0.32 & 0.17 & 0.40 & 4.26 & 4.85 & 0.93 & &&&&&&& \\ |
---|
968 | 0.21 & 0.10 & 0.10 & 0.05 & 0.40 & 0.10 & 0.08 & 0.05 & 0.68 & 1.06 & 2.12 & 0.09 & 1.19 & &&&&&& \\ |
---|
969 | 1.44 & 0.68 & 0.20 & 0.42 & 0.11 & 0.93 & 0.68 & 0.24 & 0.70 & 0.10 & 0.42 & 0.56 & 0.17 & 0.16 & &&&&& \\ |
---|
970 | 3.37 & 1.22 & 3.97 & 1.07 & 1.41 & 1.03 & 0.70 & 1.34 & 0.74 & 0.32 & 0.34 & 0.97 & 0.49 & 0.55 & 1.61 & &&&& \\ |
---|
971 | 2.12 & 0.55 & 2.03 & 0.37 & 0.51 & 0.86 & 0.82 & 0.23 & 0.47 & 1.46 & 0.33 & 1.39 & 1.52 & 0.17 & 0.80 & 4.38 & &&& \\ |
---|
972 | 0.11 & 1.16 & 0.07 & 0.13 & 0.72 & 0.22 & 0.16 & 0.34 & 0.26 & 0.21 & 0.67 & 0.14 & 0.52 & 1.53 & 0.14 & 0.52 & 0.11 & && \\ |
---|
973 | 0.24 & 0.38 & 1.09 & 0.33 & 0.54 & 0.23 & 0.20 & 0.10 & 3.87 & 0.42 & 0.40 & 0.13 & 0.43 & 6.45 & 0.22 & 0.79 & 0.29 & 2.49 & & \\ |
---|
974 | 2.01 & 0.25 & 0.20 & 0.15 & 1.00 & 0.30 & 0.59 & 0.19 & 0.12 & 7.82 & 1.80 & 0.31 & 2.06 & 0.65 & 0.31 & 0.23 & 1.39 & 0.37 & 0.31 & \\ |
---|
975 | \\ |
---|
976 | 8.66 & 4.40 & 3.91 & 5.70 & 1.93 & 3.67 & 5.81 & 8.33 & 2.44 & 4.85 & 8.62 & 6.20 & 1.95 & 3.84 & 4.58 & 6.95 & 6.10 & 1.44 & 3.53 & 7.09 \\ |
---|
977 | \end{tabular} |
---|
978 | } |
---|
979 | \end{center} |
---|
980 | |
---|
981 | The entry on the $i$-th row and $j$-th column of this matrix corresponds to the rate of |
---|
982 | substitutions between amino-acids $i$ and $j$. The last line in the file gives the stationary |
---|
983 | frequencies and must be separated from the rate matrix by one line. The ordering of the amino-acids |
---|
984 | is alphabetical, i.e, Ala, Arg, Asn, Asp, Cys, Gln, Glu, Gly, His, Ile, Leu, Lys, Met, Phe, Pro, |
---|
985 | Ser, Thr, Trp, Tyr and Val. |
---|
986 | |
---|
987 | |
---|
988 | \subsection{Topological constraint file}\label{sec:topoconstraints} |
---|
989 | |
---|
990 | PhyML can perform phylogenetic tree estimation under user-specified topological constraints. In |
---|
991 | order to do so, one should use the \x{--constraint\_file} \x{file\_name} command-line option where |
---|
992 | \x{file\_name} lists the topological constraints. Such constraints are straightforward to |
---|
993 | define. For instance, the following constraints: |
---|
994 | \vspace{0.2cm} |
---|
995 | \begin{Verbatim} |
---|
996 | ((A,B,C),(D,E,F)); |
---|
997 | \end{Verbatim} |
---|
998 | indicate that taxa A, B and C belong to the same clade. D, E and F also belong to the |
---|
999 | same clade and the two clades hence defined should not overlap. Under these two |
---|
1000 | constraints, the tree ((A,B),D,((E,F),C)) is not valid. From the example above, you will notice that |
---|
1001 | the constraints are defined using a multifurcating tree in NEWICK format. Note that this tree does |
---|
1002 | not need to display the whole list of taxa. For instance, while the only taxa involved in specifying |
---|
1003 | topological constraints above are A, B, C, D, E \& F, the actual data set could include more than |
---|
1004 | these six taxa only. |
---|
1005 | |
---|
1006 | PhyML tree topology search algorithms all rely on improving a starting tree. By default, BioNJ is |
---|
1007 | the method of choice for building this tree. However, there is no guarantee that the phylogeny |
---|
1008 | estimated with PhyML does comply with the topological constraints. While it is probably possible to |
---|
1009 | implement BioNJ with topological constraints, we have not done so yet. Instead, the same |
---|
1010 | multifurcating tree that defines the topological constraints should also be used as starting tree |
---|
1011 | using the \x{-u} (\x{--inputtree}) option. Altogether, the command line should look like the |
---|
1012 | following: \x{-u}=\x{file\_name} \x{--constraint\_file}=\x{file\_name}. It is not possible to use as |
---|
1013 | input tree a non-binary phylogeny that is distinct from that provided in the constraint tree |
---|
1014 | file. However, any binary tree compatible with the constraint one can be used as input tree. |
---|
1015 | |
---|
1016 | |
---|
1017 | \subsection{Output files} |
---|
1018 | |
---|
1019 | \begin{table} |
---|
1020 | Sequence file name~: `{\x seq}'\\ |
---|
1021 | \begin{center} |
---|
1022 | \begin{tabular}{ll} |
---|
1023 | \hline |
---|
1024 | Output file name & Content \\ |
---|
1025 | \hline |
---|
1026 | \x{seq\_phyml\_tree.txt} & ML tree\\ |
---|
1027 | \x{seq\_phyml\_stats.txt} & ML model parameters\\ |
---|
1028 | \x{seq\_phyml\_boot\_trees.txt} & ML trees -- bootstrap replicates\\ |
---|
1029 | \x{seq\_phyml\_boot\_stats.txt} & ML model parameters -- bootstrap replicates \\ |
---|
1030 | \x{seq\_phyml\_rand\_trees.txt} & ML trees -- multiple random starts\\ |
---|
1031 | \hline |
---|
1032 | \end{tabular} |
---|
1033 | \end{center} |
---|
1034 | \caption{{\bf Standard output files}}\label{tab:output} |
---|
1035 | \end{table} |
---|
1036 | |
---|
1037 | Table \ref{tab:output} presents the list of files resulting from an analysis. Basically, each |
---|
1038 | output file name can be divided into three parts. The first part is the sequence file name, the |
---|
1039 | second part corresponds to the extension `\x{\_phyml\_}' and the third part is related to the file |
---|
1040 | content. When launched with the default options, PhyML only generates two files: the tree file and |
---|
1041 | the model parameter file. The estimated maximum likelihood tree is in standard Newick format (see |
---|
1042 | Figure \ref{fig:trees}). The model parameters file, or statistics file, displays the maximum |
---|
1043 | likelihood estimates of the substitution model parameters, the likelihood of the maximum likelihood |
---|
1044 | phylogenetic model, and other important information concerning the settings of the analysis (e.g., |
---|
1045 | type of data, name of the substitution model, starting tree, etc.). Two additional output files are |
---|
1046 | created if bootstrap supports were evaluated. These files simply contain the maximum likelihood |
---|
1047 | trees and the substitution model parameters estimated from each bootstrap replicate. Such |
---|
1048 | information can be used to estimate sampling errors around each parameter of the phylogenetic model. |
---|
1049 | When the random tree option is turned on, the maximum likelihood trees estimated from each random |
---|
1050 | starting trees are printed in a separate tree file (see last row of Table \ref{tab:output}). |
---|
1051 | |
---|
1052 | \subsection{Treatment of invariable sites with fixed branch lengths} |
---|
1053 | |
---|
1054 | PhyML allows users to give an input tree with fixed topology and branch lengths and find the |
---|
1055 | proportion of invariable sites that maximise the likelihood (option \x{-o r}). These two options can |
---|
1056 | be considered as conflicting since branch lengths depend on the proportion of invariants. Hence, |
---|
1057 | changing the proportion of invariants implies that branch lengths are changing too. More formally, |
---|
1058 | let $l$ denote the length of a branch, i.e., the expected number of substitutions per site, and $p$ |
---|
1059 | be the proportion of invariants. We have $l = (1-p)l'$, where $l'$ is the expected number of |
---|
1060 | substitutions at variable sites. When asked to optimize $p$ but leave $l$ unchanged, PhyML |
---|
1061 | does the following: |
---|
1062 | \begin{enumerate} |
---|
1063 | \item Calculate $l' = l/(1-p)$ and leave $l'$ unchanged throughout the optimization. |
---|
1064 | \item Find the value of the proportion of invariants that maximises the likelihood. Let $p^{*}$ denote this value. |
---|
1065 | \item Set $l^{*} = (1-p^{*})l'$ and print out the tree with $l^{*}$ (instead of $l$). |
---|
1066 | \end{enumerate} |
---|
1067 | |
---|
1068 | PhyML therefore assumes that the users wants to fix the branch lengths measured at variable |
---|
1069 | sites only (i.e., $l^{*}$ is fixed). This is the reason why the branch lengths in the input and |
---|
1070 | output trees do differ despite the use of the the \x{-o r} option. While we believe that this |
---|
1071 | approach relies on a sound rationale, it is not perfect. In particular, the original transformation |
---|
1072 | of branch lengths ($l' = l/(1-p)$) relies on a default value for $p$ with is set to 0.2 in |
---|
1073 | practice. It is difficult to justify the use of this value rather than another one. One suggestion |
---|
1074 | proposed by Bart Hazes is to avoid fixing the branch lengths altogether and rather estimate the |
---|
1075 | value of a scaling factor applied to each branch length in the input tree (option |
---|
1076 | \x{--contrained\_lens}). We agree that this solution probably matches very well most users |
---|
1077 | expectation, i.e., ``find the best value of $p$ while constraining the ratio of branch lengths to be |
---|
1078 | that given in the input tree''. Please feel free to send us your suggestions regarding this problem |
---|
1079 | by posting on forum (\url{http://groups.google.com/group/phyml-forum}). |
---|
1080 | |
---|
1081 | |
---|
1082 | \section{Inputs \& outputs for the XML interface }\label{sec:xmlio}\index{XML} |
---|
1083 | |
---|
1084 | \subsection{Mixture models in PhyML}\index{mixture models}\label{sec:mixtures} |
---|
1085 | |
---|
1086 | PhyML implements a wide range of mixture models. The discrete gamma model \cite{yang94b} is arguably |
---|
1087 | the most popular of these models in phylogenetics. However, in theory, mixture models are not |
---|
1088 | restricted to the description of the variation of substitution rates across sites. For instance, if |
---|
1089 | there are good reasons to believe that the relative rates of substitution between nucleotides vary |
---|
1090 | along the sequence alignments, it makes sense to use a mixture of GTR models. Consider the case |
---|
1091 | where substitutions between $A$ and $C$ occur at high rate in some regions of the alignment and low |
---|
1092 | rate elsewhere, a mixture with two classes, each class having its own GTR rate matrix, would be |
---|
1093 | suitable. The likelihood at any site of the alignment is then obtained by averaging the likelihoods |
---|
1094 | obtained for each GTR rate matrix, with the same weight given to each of these matrices. |
---|
1095 | |
---|
1096 | PhyML implements a generic framework that allows users to define mixtures on substitution rates, |
---|
1097 | rate matrices and nucleotide or amino-acid equilibrium frequencies. Each class of the mixture model |
---|
1098 | is built by assembling a substitution rate, a rate matrix\footnote{the rate matrix corresponds here |
---|
1099 | the symmetrical matrix giving the so-called ``echangeability rates''} and a vector of equilibrium |
---|
1100 | frequencies. For instance, let $\{R_1,R_2,R_3\}$ be a set of substitution rates, $\{M_1,M_2\}$ a |
---|
1101 | set of rate matrices and $\{F_1,F_2\}$ a set of vectors of equilibrium frequencies. One could then |
---|
1102 | define the first class of the mixture model as $\mathcal{C}_1 = \{R_1,M_1,F_1\}$, a second class as |
---|
1103 | $\mathcal{C}_2 = \{R_2,M_1,F_1\}$, and a third class as $\mathcal{C}_3 = \{R_3,M_2,F_2\}$. If |
---|
1104 | $R_1$, $R_2$ and $R_3$ correspond to slow, medium and fast substitution rates, then this mixture |
---|
1105 | model allows the fast evolving rates to have their own vector of equilibrium frequencies and rate |
---|
1106 | matrix, distinct from that found at the medium or slow evolving sites. The likelihood at any given |
---|
1107 | site $D_s$ of the alignment is then: |
---|
1108 | \begin{eqnarray}\label{equ:mixtlk} |
---|
1109 | \Pr(D_s) = \sum_{c=1}^{3} \Pr(D_s | \mathcal{C}_s=c) \Pr(\mathcal{C}_s=c), |
---|
1110 | \end{eqnarray} |
---|
1111 | where $\Pr(\mathcal{C}_s=c)$ is obtained by multiplying the probability (density) of the three |
---|
1112 | components (i.e., rate, matrix, frequencies). For instance, $\Pr(\mathcal{C}_1=\{R_1,M_1,F_1\}) = |
---|
1113 | \Pr(R_1)\times \Pr(M_1) \times \Pr(F_1)$. |
---|
1114 | We therefore assume here that substitution rates, rate |
---|
1115 | matrices and equilibrium frequencies are independent from one another. |
---|
1116 | |
---|
1117 | Note that, using the same substitution rates, rate matrices and vector of equilibrium frequencies, |
---|
1118 | it is possible to construct many other mixture models. For instance, the mixture model with the |
---|
1119 | largest number of classes can be created by considering all the combinations of these three |
---|
1120 | components. We would then get a mixture of $3\times 2 \times 2=12$ classes, corresponding to all |
---|
1121 | the possible combinations of 3 rates, 2 matrices and 2 vectors of frequencies. |
---|
1122 | |
---|
1123 | |
---|
1124 | % : $\mathcal{C}_1 = |
---|
1125 | % \{R_1,M_1,F_1\}$ $\mathcal{C}_2 = \{R_1,M_1,F_2\}$, $\mathcal{C}_3 = \{R_1,M_2,F_1\}$, |
---|
1126 | % $\mathcal{C}_4 = \{R_1,M_2,F_2\}$, $\mathcal{C}_5 = \{R_2,M_1,F_1\}$, $\mathcal{C}_6 = |
---|
1127 | % \{R_2,M_1,F_2\}$, $\mathcal{C}_7 = \{R_2,M_2,F_1\}$, $\mathcal{C}_8 = \{R_2,M_2,F_2\}$, |
---|
1128 | % $\mathcal{C}_9 = \{R_3,M_1,F_1\}$, $\mathcal{C}_{10} = \{R_3,M_1,F_2\}$, $\mathcal{C}_{11} = |
---|
1129 | % \{R_3,M_2,F_1\}$ and $\mathcal{C}_{12} = \{R_3,M_2,F_2\}$. |
---|
1130 | |
---|
1131 | |
---|
1132 | \subsection{Partitions}\index{partitionned analysis}\index{data partitions} |
---|
1133 | |
---|
1134 | We first introduce some terms of vocabulary that have not been presented before. A partitionned data |
---|
1135 | set, also referred to as partition, is a set of partition elements. Typically, a partitionned data |
---|
1136 | set will be made of a set of distinct gene alignments. A partition element will then correspond to |
---|
1137 | one (or several) of these gene alignments. Note that the biology litterature often uses the term |
---|
1138 | partition to refer to an element of a partitionned data. We thus use here instead the mathematical |
---|
1139 | definition of the terms `partition' and `partition element'. |
---|
1140 | |
---|
1141 | Phylogenetics models usually assume individual columns of an alignment to evolve independently from |
---|
1142 | one another. Codon-based models (e.g., \cite{yang98,yang00b,yang02,guindon04}) are exceptions to |
---|
1143 | this rule since the substitution process applies here to triplets of consecutive sites of coding |
---|
1144 | sequences. The non-independence of the substitution process at the three coding positions (due to |
---|
1145 | the specificities of the genetic code), can therefore be accounted for. Assuming that sites evolve |
---|
1146 | independently does not mean that a distinct model is fitted to each site of the alignment. |
---|
1147 | Estimating the parameters of these models would not make much sense in practice due to the very |
---|
1148 | limited amount of phylogenetic signal conveyed by individual sites. Site independence means instead |
---|
1149 | that the columns of the observed alignment were sampled randomly from the same ``population of |
---|
1150 | columns''. The stochasticity of the substitution process running along the tree is deemed |
---|
1151 | responsible to the variability of site patterns. |
---|
1152 | |
---|
1153 | Some parameters of the phylogenetic model are considered to be common to all the sites in the |
---|
1154 | alignment. The tree topology is typically one such parameter. The transition/transversion ratio is |
---|
1155 | also generally assumed to be the same for all columns. Other parameters can vary from site to site. |
---|
1156 | The rate at which substitutions accumulate is one of these parameters. Hence, different sites can |
---|
1157 | have distinct rates. However, such rates are all ``drawn'' from the same probabilitic distribution |
---|
1158 | (generally a discrete Gamma density). Hence, while different sites may have distinct rates of |
---|
1159 | evolution, they all share the same {\em distribution} of rates. |
---|
1160 | |
---|
1161 | This reasonning also applies on a larger scale. When analysing multiple genes, one can indeed |
---|
1162 | assume that the same mechanism generated the different site patterns observed for every gene. Here |
---|
1163 | again, we can assume that all the genes share the same underlying tree topology (commonly refered to |
---|
1164 | as the ``species tree''). Other parameters of the phylogenetic model, such as branch lengths for |
---|
1165 | instance, might be shared across genes. However, due to the specificities of the gene evolution |
---|
1166 | processes, some model parameters need to be adjusted for each gene separately. To sum up, the |
---|
1167 | phylogenetic analysis of partitionned data requires flexible models with parameters, or distribution |
---|
1168 | of parameters, shared across several partition elements and other parameters estimated separately |
---|
1169 | for each element of the partition. |
---|
1170 | |
---|
1171 | The likelihood of a data set made of the concatenation of $n$ sequence alignments noted $D^{(1)}$, |
---|
1172 | $D^{(2)}, \ldots, D^{(n)}$ is then obtained as follow: |
---|
1173 | \begin{eqnarray} |
---|
1174 | \Pr(D^{(1)},D^{(2)},\ldots,D^{(n)}) &=& \prod_{i=1}^{n} \Pr(D^{(i)}) \nonumber\\ |
---|
1175 | &=& \prod_{i=1}^{n} \prod_{s=1}^{L_i} \Pr(D^{(i)}_s), |
---|
1176 | \end{eqnarray} where $L_i$ is the number of site columns in partition element $i$. $\Pr(D^{(i)}_s)$ |
---|
1177 | is then obtained using Equation \ref{equ:mixtlk}, i.e., by summing over the different classes of the |
---|
1178 | mixture model that applies to site $s$ for partition element $i$. Hence, the joint probability of |
---|
1179 | all the partition elements is here broken down into the product of likelihood of every site for each |
---|
1180 | partition element. As noted just above, any given component of the mixture model at a given |
---|
1181 | particular site is shared by the other sites that belong to the same partition element and, for some |
---|
1182 | of them, by sites in other partition elements (e.g., the same tree topology is shared by all the |
---|
1183 | sites, throughout all the partition elements). |
---|
1184 | |
---|
1185 | PhyML implements a wide variety of partition models. The only parameter that is constrained to be |
---|
1186 | shared by all the partition elements is the tree topology. This constraint makes sense when |
---|
1187 | considering distantly related taxa, typically inter-species data. For closely related taxa, i.e., |
---|
1188 | when analysing intra-species or population-level data, not all the genes might have the same |
---|
1189 | evolutionary history. Recombination events combined to the incomplete lineage sorting phenomenon |
---|
1190 | can generate discrepancies between the gene trees and the underlying species tree (see |
---|
1191 | \cite{degnan09} for a review). The phylogenetic softwares BEST \cite{best}\index{BEST}, STEM |
---|
1192 | \cite{stem}\index{STEM} and *BEAST \cite{startbeast}\index{*BEAST} are dedicated to the estimation |
---|
1193 | of species tree phylogenies from the analysis of multi-gene data and allow gene-tree topologies to |
---|
1194 | vary across genes. |
---|
1195 | |
---|
1196 | Aside from the tree topology that is common to all the sites and all the partition elements, other |
---|
1197 | parameters of the phylogenetic model can be either shared across partition elements or estimated |
---|
1198 | separately for each of these. When analysing three partition elements, $A$, $B$ and $C$ for |
---|
1199 | instance, PhyML can fit a model where the same set of branch lengths applies to $A$ and $B$ while |
---|
1200 | $C$ has its own estimated lengths. The same goes for the substitution model: the same GTR model, |
---|
1201 | with identical parameter values, can be fitted to $A$ and $C$ and JC69 for instance can be used for |
---|
1202 | $B$. The sections below give more detailed information on the range of models available and how to |
---|
1203 | set up the corresponding XML configuration files to implement them. |
---|
1204 | |
---|
1205 | |
---|
1206 | \subsection{Combining mixture and partitions in PhyML: the theory} |
---|
1207 | |
---|
1208 | The rationale behind mixture models as implemented in PhyML lies in (1) the definition of suitable |
---|
1209 | rate matrices, equilibrium frequency vectors and relative rates of substitution and (2) the assembly |
---|
1210 | of these components so as to create the classes of a mixture. The main idea behind partitionned |
---|
1211 | analysis in PhyML lies in (1) the hypothesis of statistical independance of the different data |
---|
1212 | partition elements and (2) distinct data partition can share model components such as rate matrices, |
---|
1213 | equilibrium frequencies or distribution of rates across sites. More formally, the likelihood of a |
---|
1214 | data set made of $n$ partition elements is written as follows: |
---|
1215 | |
---|
1216 | \begin{eqnarray*} |
---|
1217 | \Pr(D^{(1)},D^{(2)},\ldots,D^{(n)}) &=& \prod_{i=1}^{n} \prod_{s=1}^{L_i} \Pr(D^{(i)}_s) \\ |
---|
1218 | &=& \prod_{i=1}^{n} \prod_{s=1}^{L_i} \sum_{c=1}^{K_i} \Pr(D^{(i)}_s|\mathcal{C}=c) \Pr(\mathcal{C}=c), |
---|
1219 | \end{eqnarray*} |
---|
1220 | |
---|
1221 | where $L_i$ is the number of sites in partition element $i$ and $K_i$ is the number of classes in |
---|
1222 | the mixture model that applies to this same partition element. Each class of a mixture is made of a |
---|
1223 | rate matrix $M$, a vector of equilibrium frequencies $F$ and a relative rate of substitution |
---|
1224 | $R$. Branch lengths, $L$ and tree topology $\tau$ are also required for the calculation of the |
---|
1225 | likelihood. Hence we have: |
---|
1226 | |
---|
1227 | \begin{eqnarray*} |
---|
1228 | && \Pr(D^{(1)},D^{(2)},\ldots,D^{(n)}) \\ |
---|
1229 | &&= \prod_{i=1}^{n} \prod_{s=1}^{L_i} \sum_{c=1}^{K_i} \Pr(D^{(i)}_s|\mathcal{C}=c) \Pr(\mathcal{C}=c) \\ |
---|
1230 | &&= \prod_{i=1}^{n} \prod_{s=1}^{L_i} \sum_{m}^{\mathcal{M}_i} \sum_{f}^{\mathcal{F}_i} \sum_{r}^{\mathcal{R}_i} \Pr(D^{(i)}_s|M_m^{(i)},F_f^{(i)},R_r^{(i)},L^{(i)},\tau) \Pr(M_m^{(i)},F_f^{(i)},R_r^{(i)}) \mathcal{I}(m,f,r,i) |
---|
1231 | % &&= \prod_{i=1}^{n} \prod_{s=1}^{L_i} \sum_{m}^{\mathcal{M}_i} \sum_{f}^{\mathcal{F}_i} \sum_{r}^{\mathcal{R}_i} |
---|
1232 | % \Pr(D^{(i)}_s|M_m^{(i)},F_f^{(i)},R_r^{(i)},L^{(i)},\tau) \Pr(M_m^{(i)}) \Pr(F_f^{(i)}) \Pr(R_r^{(i)}) |
---|
1233 | \end{eqnarray*} |
---|
1234 | where $\mathcal{M}_i$, $\mathcal{F}_i$ and $\mathcal{R}_i$ are the number of rate matrices, |
---|
1235 | vector of equilibrium frequencies and relative rates that apply to partition element $i$ |
---|
1236 | respectively. $\mathcal{I}(m,f,r,i)$ is an indicator function that takes value 1 if the combination |
---|
1237 | $M_m$, $F_f$ and $R_r$ is acually defined in the model for this particular partition element |
---|
1238 | $i$. Its value is 0 otherwise. In the example given in section \ref{sec:mixtures} $\{R_1,R_2,R_3\}$ |
---|
1239 | is the set of substitution rates, $\{M_1,M_2\}$ the set of rate matrices and $\{F_1,F_2\}$ the set of vectors of equilibrium frequencies. We then |
---|
1240 | define the first class of the mixture model as $\mathcal{C}_1 = \{R_1,M_1,F_1\}$, a second class as |
---|
1241 | $\mathcal{C}_2 = \{R_2,M_1,F_1\}$ and the third as $\mathcal{C}_3 = \{R_3,M_2,F_2\}$. Hence, we |
---|
1242 | have $\mathcal{I}(1,1,1,i)$, $\mathcal{I}(1,1,2,i)$ and $\mathcal{I}(2,2,3,i)$ equal to one while |
---|
1243 | the nine other values that this indicator function takes, corresponding to the possible combinations of |
---|
1244 | two vectors of frequencies, two matrices and three rates, are all zero. |
---|
1245 | |
---|
1246 | As stated before, our implementation assumes that the different components of a mixture are |
---|
1247 | independant. In other words, we have $\Pr(M_m^{(i)},F_f^{(i)},R_r^{(i)}) = \Pr(M_m^{(i)}) \times |
---|
1248 | \Pr(F_f^{(i)}) \times \Pr(R_r^{(i)})$. In practice, the joint probability |
---|
1249 | $\Pr(M_m^{(i)},F_f^{(i)},R_r^{(i)})$ is obtained as follows: |
---|
1250 | \begin{eqnarray}\label{equ:weights} |
---|
1251 | \Pr(M_m^{(i)},F_f^{(i)},R_r^{(i)}) = \frac{\Pr(M_m^{(i)}) \Pr(F_f^{(i)}) \Pr(R_r^{(i)})}{ |
---|
1252 | \sum_{m,f,r} \Pr(M_m^{(i)}) \Pr(F_f^{(i)}) \Pr(R_r^{(i)}) \mathcal{I}(m,f,r,i)} |
---|
1253 | \end{eqnarray} |
---|
1254 | The probabilities $\Pr(M_m^{(i)})$, $\Pr(F_f^{(i)})$ and $\Pr(R_r^{(i)})$, also called `weights', can be fixed or estimated |
---|
1255 | from the data. |
---|
1256 | |
---|
1257 | \subsection{The XML format and its use in PhyML}\label{sec:XML format} |
---|
1258 | |
---|
1259 | The few paragraphs below are largely inspired from the Wikipedia page that describes the XML format |
---|
1260 | (\url{http://en.wikipedia.org/wiki/XML}). XML (eXtensible Markup Language) is a markup language that |
---|
1261 | defines a set of rules for encoding documents in a format that is both human-readable and |
---|
1262 | machine-readable. An XML document is divided into {\em markup} and {\em content}, which may be |
---|
1263 | distinguished by the application of simple syntactic rules. Generally, strings that constitute |
---|
1264 | markup either begin with the character `\x{<}' and end with a `\x{>}'. Strings of characters that |
---|
1265 | are not markup are content: |
---|
1266 | |
---|
1267 | \vspace{0.2cm} |
---|
1268 | \begin{Verbatim}[frame=single, label=XML markup and content example, samepage=true, |
---|
1269 | baselinestretch=0.5, fontsize=\small] |
---|
1270 | <markup> |
---|
1271 | content |
---|
1272 | </markup> |
---|
1273 | \end{Verbatim} |
---|
1274 | |
---|
1275 | A markup construct that begins with `\x{<}' and ends with `\x{>}' is called a {\em tag}. Tags come |
---|
1276 | in three flavors: (1) start-tags (e.g, \x{<section>}), end-tags (e.g., \x{</section>}) and |
---|
1277 | empty-element tags (e.g., \x{<line-break />}). A {\em component} either begins with a start-tag and |
---|
1278 | ends with a matching end-tag or consists only of an empty-element tag. The characters between the |
---|
1279 | start- and end-tags, if any, are the element's content, and may contain markup, including other |
---|
1280 | elements, which are called child elements. In the following example, the element \x{img} has two |
---|
1281 | {\em attributes}, \x{src} and \x{alt}: \x{<img src="madonna.jpg" alt="Foligno Madonna, by |
---|
1282 | Raphael"/>}. Another example would be \x{<step number="3">Connect A to B.</step>} where the name of |
---|
1283 | the attribute is ``\x{number}" and the value is ``\x{3}". |
---|
1284 | |
---|
1285 | In practice, building a mixture model in a XML file readable by PhyML is relatively |
---|
1286 | straightforward. The first step is to define the different components of each class of the |
---|
1287 | mixture. Consider for instance that the fitted model will have a Gamma distribution with four |
---|
1288 | classes plus a proportion of invariants. The rate component of the mixture can then be specified |
---|
1289 | using the following XML code: |
---|
1290 | |
---|
1291 | \vspace{0.2cm} |
---|
1292 | \begin{Verbatim}[frame=single, label=$\Gamma4$+I rates, samepage=true, baselinestretch=0.5, |
---|
1293 | fontsize=\small, numbers=left] |
---|
1294 | |
---|
1295 | <siterates id="SiteRates1"> |
---|
1296 | <weights id="Distrib" family="gamma+inv" alpha=".1" \ |
---|
1297 | optimise.alpha="yes" pinv="0.4" optimise.pinv="yes"> |
---|
1298 | </weights> |
---|
1299 | <instance id="R1" init.value="1.0"/> |
---|
1300 | <instance id="R2" init.value="1.0"/> |
---|
1301 | <instance id="R5" init.value="0.0"/> |
---|
1302 | <instance id="R3" init.value="1.0"/> |
---|
1303 | <instance id="R4" init.value="1.0"/> |
---|
1304 | </siterates> |
---|
1305 | |
---|
1306 | \end{Verbatim} |
---|
1307 | |
---|
1308 | In the example above, the \x{<siterates>} component completely defines a model of substitution rate |
---|
1309 | variation across sites. This component has a particular identity, i.e., a name associated to it |
---|
1310 | (``\x{SiteRates1}'' here), which is not mandatory. This \x{<siterates>} component has six |
---|
1311 | sub-components. The first is the \x{<weights>} component, followed by five \x{<instance>} |
---|
1312 | components. The \x{<weights>} component defines the type of distribution that characterizes the |
---|
1313 | variation of rates across sites. A discrete Gamma plus invariants is used here. Two parameters |
---|
1314 | specify this distribution: the gamma shape and the proportion of invariant parameters. Their initial |
---|
1315 | values are set by using the corresponding attributes and attribute values (\x{alpha="0.1"} and |
---|
1316 | \x{pinv="0.4"}). Also, PhyML can optimise these parameters so as to maximise the likelihood of the |
---|
1317 | whole phylogenetic model (\x{optimise.pinv="yes"} and \x{optimise.alpha="yes"}). The following five |
---|
1318 | \x{<instance>} components define the rate classes themselves. The \x{id} attribute is here |
---|
1319 | mandatory and must be unique to each class. Note that one of the initial (relative) rate |
---|
1320 | (\x{init.value} attribute) is set to zero. The corresponding rate class (the third in this example) |
---|
1321 | will then correspond to the invariant site category. |
---|
1322 | |
---|
1323 | Having specified the part of the phylogenetic model that describes the variation of rates across |
---|
1324 | sites, we can now move on to build the rest of the model. The component below defines two |
---|
1325 | substitution models: |
---|
1326 | |
---|
1327 | \vspace{0.2cm} |
---|
1328 | \begin{Verbatim}[frame=single, label=Rate matrices, samepage=true, baselinestretch=0.5, |
---|
1329 | fontsize=\small, numbers=left] |
---|
1330 | |
---|
1331 | <ratematrices id="RateMatrices"> |
---|
1332 | <instance id="M1" model="HKY85" tstv="4.0" optimise.tstv="no"/> |
---|
1333 | <instance id="M2" model="GTR" optimise.rr="yes"/> |
---|
1334 | </ratematrices> |
---|
1335 | \end{Verbatim} |
---|
1336 | |
---|
1337 | This \x{<ratematrices>} component sets out a list of substitution models (HKY85 and GTR here). Here |
---|
1338 | again, the different elements in this list correspond to the \x{<instance>} sub-components. Each |
---|
1339 | instance must have a unique \x{id} attribute for a reason that will become obvious shortly. The |
---|
1340 | remaining attributes and their functions are described in Section \ref{sec:xmlratematrices}. |
---|
1341 | |
---|
1342 | The next ``ingredient'' in our phylogenetic model are vectors of nucleotide frequencies. The |
---|
1343 | \x{<equfreqs>} component below specifies two of such vectors: |
---|
1344 | |
---|
1345 | |
---|
1346 | \vspace{0.2cm} |
---|
1347 | \begin{Verbatim}[frame=single, label=Equilibrium frequencies, samepage=true, baselinestretch=0.5, |
---|
1348 | fontsize=\small, numbers=left] |
---|
1349 | |
---|
1350 | <equfreqs id="EquFreq"> |
---|
1351 | <instance id="F1"/> |
---|
1352 | <instance id="F2"/> |
---|
1353 | </equfreqs> |
---|
1354 | |
---|
1355 | \end{Verbatim} |
---|
1356 | |
---|
1357 | Now, we need to assemble these three components (rate variation across sites, rate matrices and |
---|
1358 | vectors of equilibrium frequencies) into a mixture model. The \x{<partitionelem>} component below |
---|
1359 | defines one such model: |
---|
1360 | |
---|
1361 | \vspace{0.2cm} |
---|
1362 | \begin{Verbatim}[frame=single, label=Mixture model, samepage=true, baselinestretch=0.5, |
---|
1363 | fontsize=\small, numbers=left] |
---|
1364 | |
---|
1365 | <partitionelem id="Part1" file.name="./nucleic.txt" data.type="nt"> |
---|
1366 | <mixtureelem list="R1, R2, R3, R4, R5"/> |
---|
1367 | <mixtureelem list="M1, M1, M1, M2, M2"/> |
---|
1368 | <mixtureelem list="F1, F2, F1, F2, F2"/> |
---|
1369 | </partitionelem> |
---|
1370 | |
---|
1371 | \end{Verbatim} |
---|
1372 | |
---|
1373 | The \x{<partitionelem>} component defines a particular partition element. In this example, the |
---|
1374 | partition element corresponds to the sequence file called \x{nucleic.txt}, which is an alignment of |
---|
1375 | nucleotide sequences (see the \x{data.type} attribute value). The \x{<mixtureelem>} are |
---|
1376 | sub-components of the \x{<partitionelem>} component. Each \x{<mixtureelem>} has a \x{list} |
---|
1377 | atrribute. Each such \x{list} gives the ID of components that have been defined before. For |
---|
1378 | instance, the first \x{<mixtureelem>} refers to the five classes of the \x{<siterates>} |
---|
1379 | component. The ordering of the different term in these list matters a lot since it is directly |
---|
1380 | related to the elements in each class of the mixture model. Hence, the first element in the |
---|
1381 | \x{<list>} attribute of the first \x{<mixtureelem>} added to the first element in the \x{<list>} |
---|
1382 | attribute of the second \x{<mixtureelem>} plus the the first element in \x{<list>} attribute of the |
---|
1383 | third \x{<mixtureelem>} defines the first class of the mixture model. Therefore, the mixture model |
---|
1384 | defined above has five classes: $\mathcal{C}_1 = \{R_1,M_1,F_1\}$, $\mathcal{C}_2 = |
---|
1385 | \{R_2,M_1,F_2\}$, $\mathcal{C}_3 = \{R_3,M_1,F_1\}$, $\mathcal{C}_4 = \{R_4,M_2,F_2\}$ and |
---|
1386 | $\mathcal{C}_5 = \{R_5,M_2,F_2\}$. |
---|
1387 | |
---|
1388 | % Going back to the different components of this model, the XML code dealing with the substitution |
---|
1389 | % rates defines five classes with names {\tt R1} to {\tt R5}. The initial values of these rates are |
---|
1390 | % set to 1.0, except for {\tt R5}, which is set to 0 and will therefore correspond to the invariable |
---|
1391 | % site class. The {\tt <weight>} tag that follows indicate that these rates define a $\Gamma 4$+Inv |
---|
1392 | % model, with initial gamma shape parameter set to 0.1 and initial proportion of invariants set to |
---|
1393 | % 0.4. These two parameters will be estimated in the analysis ({\tt optimise.alpha} and {\tt |
---|
1394 | % optimise.pinv} attributes set to {\tt yes}). The two rate matrices have names {\tt M1} and {\tt |
---|
1395 | % M2}. {\tt M1} corresponds to a HKY85 model, with transition/transversion ratio set to 4.0 and set |
---|
1396 | % to be optimised in the analysis. {\tt M2} is a GTR model, which parameters are also set to be |
---|
1397 | % optimised. {\tt F1} and {\tt F2} are two vectors of nucleotide frequencies at equilibrium. These two |
---|
1398 | % sets of frequencies will therefore be estimated during the analysis. |
---|
1399 | |
---|
1400 | |
---|
1401 | |
---|
1402 | \subsection{Setting up mixture and partition models in PhyML: the basics}\index{mixture |
---|
1403 | models}\index{partitionned analysis}\index{data partitions} |
---|
1404 | |
---|
1405 | Mixture models are particularly relevant to the analysis of partitionned data. Indeed, some features |
---|
1406 | of evolution are gene-specific (e.g., substitution rates vary across genes). Models that can |
---|
1407 | accomodate for such variation, as mixture models do, are therefore relevant in this context. |
---|
1408 | However, other evolutionary features are shared across loci (e.g., genes located in the same genomic |
---|
1409 | region usually have similar GC contents). As a consequence, some components of mixture models need |
---|
1410 | to be estimated separately for each partition element while others should be shared by different |
---|
1411 | partition elements. |
---|
1412 | |
---|
1413 | Below is a simple example with a partitionned data set made of two elements: |
---|
1414 | |
---|
1415 | \vspace{0.2cm} |
---|
1416 | \begin{Verbatim}[frame=single, label=Two sets of branch lengths (one per partition element), |
---|
1417 | samepage=true, baselinestretch=0.5, fontsize=\small, numbers=left] |
---|
1418 | |
---|
1419 | <branchlengths id="BranchLens"> |
---|
1420 | <instance id="L1"/> |
---|
1421 | <instance id="L2"/> |
---|
1422 | </branchlengths> |
---|
1423 | |
---|
1424 | <partitionelem id="Part1" file.name="./nucleic1.txt" data.type="nt"> |
---|
1425 | <mixtureelem list="R1, R2, R3, R4, R5"/> |
---|
1426 | <mixtureelem list="L1, L1, L1, L1, L1"/> |
---|
1427 | </partitionelem> |
---|
1428 | |
---|
1429 | <partitionelem id="Part2" file.name="./nucleic2.txt" data.type="nt"> |
---|
1430 | <mixtureelem list="R1, R2, R3, R4, R5"/> |
---|
1431 | <mixtureelem list="L2, L2, L2, L2, L2"/> |
---|
1432 | </partitionelem> |
---|
1433 | |
---|
1434 | \end{Verbatim} |
---|
1435 | |
---|
1436 | Mixture elements with names \x{R1},$\ldots$, \x{R5} refer to the $\Gamma4+$I model defined |
---|
1437 | previsouly (see Section \ref{sec:XML format}). The \x{<branchlengths>} XML component defines a |
---|
1438 | mixture element that had not been introduced before. It defined vectors of branch lengths that |
---|
1439 | apply to the estimated phylogeny. Two instances of such vectors are defined: \x{L1} and \x{L2}. |
---|
1440 | When examining the two partition elements (\x{<partitionelem>} component), it appears that \x{L1} |
---|
1441 | is associated with \x{Part1} while \x{L2} is associated with \x{Part2}. Hence, branch lengths |
---|
1442 | will be estimated separately for these two partition elements. |
---|
1443 | |
---|
1444 | Note that a given partition element can only have one {\tt branchlengths} instance associated to |
---|
1445 | it. For instance, the example given below is not valid: |
---|
1446 | |
---|
1447 | \vspace{0.2cm} |
---|
1448 | \begin{Verbatim}[frame=single, label=Invalid mixture, samepage=true, baselinestretch=0.5, |
---|
1449 | fontsize=\small, numbers=left] |
---|
1450 | |
---|
1451 | <partitionelem id="Part1" file.name="./nucleic1.txt" data.type="nt"> |
---|
1452 | <mixtureelem list="R1, R2, R3, R4, R5"/> |
---|
1453 | <mixtureelem list="L1, L1, L1, L2, L2"/> |
---|
1454 | </partitionelem> |
---|
1455 | |
---|
1456 | \end{Verbatim} |
---|
1457 | |
---|
1458 | In other words, mixture of branch lengths are forbidden. One reason for this restriction is that |
---|
1459 | mixture of edge lengths sometimes lead to non-identifiable models (i.e., models with distinct sets |
---|
1460 | of branch lengths have the same likelihood) \cite{matsen07}. But mostly, combining mixture of branch |
---|
1461 | lengths with mixture of rates appears like a deadly combination. Consider for instance the following model: |
---|
1462 | |
---|
1463 | \vspace{0.2cm} |
---|
1464 | \begin{Verbatim}[frame=single, label=Invalid mixture, samepage=true, baselinestretch=0.5, |
---|
1465 | fontsize=\small, numbers=left] |
---|
1466 | |
---|
1467 | <partitionelem id="Part1" file.name="./nucleic1.txt" data.type="nt"> |
---|
1468 | <mixtureelem list="R1, R2, R3"/> |
---|
1469 | <mixtureelem list="L1, L2, L3"/> |
---|
1470 | </partitionelem> |
---|
1471 | |
---|
1472 | \end{Verbatim} |
---|
1473 | |
---|
1474 | It is here impossible to tell apart branch lengths and substitution rates. Such model is strongly |
---|
1475 | non-identifiable and therefore not relevant. |
---|
1476 | |
---|
1477 | In the example given above, the same $\Gamma4+$I model (i.e. the same gamma shape parameter and |
---|
1478 | proportion of invariant ) applies to the two partition elements. It is possible to use two distinct |
---|
1479 | $\Gamma4+$I models instead using the following XML code: |
---|
1480 | |
---|
1481 | |
---|
1482 | |
---|
1483 | \vspace{0.2cm} |
---|
1484 | \begin{Verbatim}[frame=single, label=Two distinct $\Gamma4+$I models, samepage=true, |
---|
1485 | baselinestretch=0.5, fontsize=\small, numbers=left] |
---|
1486 | |
---|
1487 | <siterates id="SiteRates1"> |
---|
1488 | <weights id="Distrib1" family="gamma+inv" alpha=".1" \ |
---|
1489 | optimise.alpha="yes" pinv="0.4" optimise.pinv="yes"> |
---|
1490 | </weights> |
---|
1491 | <instance id="R1" init.value="1.0"/> |
---|
1492 | <instance id="R2" init.value="1.0"/> |
---|
1493 | <instance id="R5" init.value="0.0"/> |
---|
1494 | <instance id="R3" init.value="1.0"/> |
---|
1495 | <instance id="R4" init.value="1.0"/> |
---|
1496 | </siterates> |
---|
1497 | |
---|
1498 | <siterates id="SiteRates2"> |
---|
1499 | <weights id="Distrib2" family="gamma+inv" alpha=".1" \ |
---|
1500 | optimise.alpha="yes" pinv="0.4" optimise.pinv="yes"> |
---|
1501 | </weights> |
---|
1502 | <instance id="R6" init.value="1.0"/> |
---|
1503 | <instance id="R7" init.value="1.0"/> |
---|
1504 | <instance id="R8" init.value="0.0"/> |
---|
1505 | <instance id="R9" init.value="1.0"/> |
---|
1506 | <instance id="R10" init.value="1.0"/> |
---|
1507 | </siterates> |
---|
1508 | |
---|
1509 | <partitionelem id="Part1" file.name="./nucleic1.txt" data.type="nt"> |
---|
1510 | <mixtureelem list="R1, R2, R3, R4, R5"/> |
---|
1511 | <mixtureelem list="L1, L1, L1, L1, L1"/> |
---|
1512 | </partitionelem> |
---|
1513 | |
---|
1514 | <partitionelem id="Part2" file.name="./nucleic2.txt" data.type="nt"> |
---|
1515 | <mixtureelem list="R6, R7, R8, R9, R10"/> |
---|
1516 | <mixtureelem list="L2, L2, L2, L2, L2"/> |
---|
1517 | </partitionelem> |
---|
1518 | |
---|
1519 | \end{Verbatim} |
---|
1520 | |
---|
1521 | \x{SiteRates1} and \x{SiteRates2} here define two distinct $\Gamma4+$I models. Each of these models apply to |
---|
1522 | one of the two partition elements (\x{nucleic1.txt} and \x{nucleic2.txt}), allowing them to display |
---|
1523 | different patterns of rate variation across sites. |
---|
1524 | |
---|
1525 | |
---|
1526 | \subsection{XML options} |
---|
1527 | \subsubsection{{\tt phyml} component}\index{XML options!{\tt phyml} component} |
---|
1528 | Options: |
---|
1529 | \begin{itemize} |
---|
1530 | \item \x{output.file="filename"}. The main output files of PhyML analysis will be named |
---|
1531 | \x{filename\_phyml\_tree} and \x{filename\_phyml\_stats}. |
---|
1532 | \item \x{bootstrap="nreplicates"}. Run \x{nreplicates} replicates for the non-parametric bootstrap analysis. |
---|
1533 | \item \x{run.id="idstring"}. PhyML will append the string \x{idstring} to each output file. |
---|
1534 | \item \x{print.trace="yes|true|no|false"}. PhyML will print the estimated trees (and the |
---|
1535 | corresponding loglikelihoods) at multiple stages of the estimation process. This option is useful |
---|
1536 | for monitoring the progress of the analysis when processing large data sets. |
---|
1537 | \item \x{branch.test="aBayes|aLRT|SH|no"}. Calculate fast branch support using the aBayes method |
---|
1538 | \cite{anisimova11}, aLRT \cite{anisimova06} or SH \cite{shimodaira99} tests. These branch |
---|
1539 | statistics are much faster to estimate than the bootrap proportions and usually provide good |
---|
1540 | estimates of the probabilities that the corresponding edges are correctly inferred (see Anisimova et |
---|
1541 | al. 2011 for more precision). By default and if no bootstrap analysis is performed, branch supports |
---|
1542 | are estimated using the aBayes approach. |
---|
1543 | |
---|
1544 | \end{itemize} |
---|
1545 | \subsubsection{{\tt topology} component}\index{XML options!{\tt topology} component} |
---|
1546 | Options: |
---|
1547 | \begin{itemize} |
---|
1548 | \item \x{init.tree="bionj"|"user"|"random"}. Starting tree. Default is \x{bionj}. |
---|
1549 | \item \x{file.name="name\_of\_tree\_file"}. In case \x{init.tree="user"}, this |
---|
1550 | attribute is mandatory. \x{name\_of\_tree\_file} is a |
---|
1551 | text file containing a tree in NEWICK format. |
---|
1552 | \item \x{optimise.tree="yes"|"true"|"no"|"false"}. The starting tree topology as defined by |
---|
1553 | \x{init.tree} is to be optimised (or not) so as to maximise the likelihood function. |
---|
1554 | \item \x{search="nni"|"spr"|"none"}. Tree topology search is conducted using NNI (fast), SPR (a bit |
---|
1555 | slower but more accurate) or no moves. |
---|
1556 | \end{itemize} |
---|
1557 | \vspace{0.2cm} |
---|
1558 | \begin{Verbatim}[frame=single, label=Example of `topology' component, samepage=true, |
---|
1559 | baselinestretch=0.5, fontsize=\small, numbers=left] |
---|
1560 | |
---|
1561 | <topology> |
---|
1562 | <instance id="T1" init.tree="bionj" optimise.tree="true" \ |
---|
1563 | search="spr"/> |
---|
1564 | </topology> |
---|
1565 | |
---|
1566 | \end{Verbatim} |
---|
1567 | |
---|
1568 | \subsubsection{{\tt ratematrices} component}\index{XML options!{\tt ratematrices} component}\label{sec:xmlratematrices} |
---|
1569 | Options: |
---|
1570 | \begin{itemize} |
---|
1571 | \item \x{model="JC69"|"K80"|"F81"|"F84"|"HKY85"|"TN93"|"GTR"|"custom"} for nucleotide data. The default is \x{"HKY85"}.\\ |
---|
1572 | \x{model="LG"|"WAG"|"JTT"|"MtREV"|"Dayhoff"|"DCMut"|"RtREV"|"CpREV"|"VT"}\\\x{|"Blosum62"|"MtMam"|"MtArt"|"HIVw"|"HIVb"|"customaa"} |
---|
1573 | for amino-acid sequences. The default is \x{"LG"}. |
---|
1574 | \item \x{model.code="012345"}. For \x{custom} model applied to nucleotide sequences: set the |
---|
1575 | string of digits that define a custom substitution model. See Table \ref{tab:modelcode} on page |
---|
1576 | \pageref{tab:modelcode} for more |
---|
1577 | information about the model codes. |
---|
1578 | \item \x{ratematrix.code="filename"}. When used in conjunction with \x{model="customaa"}, |
---|
1579 | \x{filename} is the name of the file that gives the rates of substitution between amino-acids as |
---|
1580 | well as their frequences at equilibrium using PAML rate matrix format. An example of such file is |
---|
1581 | provided in {phyml/examples/X1.mat}. |
---|
1582 | \item \x{optimise.rr="yes"|"true"|"no"|"false"}. For \x{custom} and \x{GTR} nucleotide models only: |
---|
1583 | optimise the substitution rate model parameters. |
---|
1584 | \item \x{optimise.tstv="yes"|"true"|"no"|"false"}. For \x{K80}, \x{F84}, \x{HKY85} and \x{TN93} |
---|
1585 | models only: optimise the transition/transversion rate ratio. |
---|
1586 | \item \x{tstv="value"}. For \x{K80}, \x{HKY85} and \x{TN93} models only: set the transition/transversion to a |
---|
1587 | given value. |
---|
1588 | \end{itemize} |
---|
1589 | |
---|
1590 | The {\tt ratematrices} component has the attribute {\tt optimise.weights=yes/no} (default is {\tt |
---|
1591 | no}). If {\tt optimise.weights=yes}, then the probabilities (or weights) or each matrix in the |
---|
1592 | set of matrices defined by this component (see Equation \ref{equ:weights}), will be estimated from the data. |
---|
1593 | |
---|
1594 | \vspace{0.2cm} |
---|
1595 | \begin{Verbatim}[frame=single, label=Example of `ratematrices' component, samepage=true, |
---|
1596 | baselinestretch=0.5, fontsize=\small, numbers=left] |
---|
1597 | |
---|
1598 | <ratematrices id="RM1" optimise.weights="yes"> |
---|
1599 | <instance id="M1" model="custom" model.code="000000"/> |
---|
1600 | <instance id="M2" model="GTR" optimise.rr="yes"/> |
---|
1601 | <instance id="M3" model="WAG"/> |
---|
1602 | </ratematrices> |
---|
1603 | |
---|
1604 | \end{Verbatim} |
---|
1605 | |
---|
1606 | \subsubsection{{\tt equfreqs} component}\index{XML options!{\tt equfreqs} component} |
---|
1607 | Options: |
---|
1608 | \begin{itemize} |
---|
1609 | \item \x{base.freqs="a,b,c,d"} where \x{a-d} are nucleotide frequencies. Make sure that these |
---|
1610 | frequencies are separated by comas and no space character is inserted. |
---|
1611 | \item \x{aa.freqs="empirical|model"}. Amino-acid frequencies are derived from counting the number of |
---|
1612 | occurence of each amino-acid in the alignment (\x{aa.freqs="empirical"}) or given by the |
---|
1613 | substitution model (\x{aa.freqs="model"}). |
---|
1614 | \item \x{optimise.freqs="true|yes|false|no"}. Nucleotide frequencies can be optimised so as to maximise |
---|
1615 | the likelihood (\x{optimise.freqs="yes|true"}). |
---|
1616 | \end{itemize} |
---|
1617 | |
---|
1618 | The {\tt equfreqs} component has the attribute {\tt optimise.weights=yes/no} (default is {\tt |
---|
1619 | no}). If {\tt optimise.weights=yes}, then the probabilities (or weights) or each vector of |
---|
1620 | equilibrium frequencies in the |
---|
1621 | set of vectors defined by this component (see Equation \ref{equ:weights}), will be estimated from the data. |
---|
1622 | |
---|
1623 | \vspace{0.2cm} |
---|
1624 | \begin{Verbatim}[frame=single, label=Example of `equfreqs' component, samepage=true, |
---|
1625 | baselinestretch=0.5, fontsize=\small, numbers=left] |
---|
1626 | |
---|
1627 | <equfreqs id="EF1" optimise.weights="yes"> |
---|
1628 | <instance id="F1" base.freqs="0.25,0.25,0.25,0.25"/> |
---|
1629 | <instance id="F2" aa.freqs="empirical"/> |
---|
1630 | <instance id="F3" optimise.freq="yes"/> |
---|
1631 | </equfreqs> |
---|
1632 | |
---|
1633 | \end{Verbatim} |
---|
1634 | |
---|
1635 | \subsubsection{{\tt branchlengths} component}\index{XML options!{\tt branchlengths} component} |
---|
1636 | Options: |
---|
1637 | \begin{itemize} |
---|
1638 | \item \x{optimise.lens="yes"|"true"|"no"|"false"}: branch lengths are optimised or not. The default |
---|
1639 | is set to \x{"yes"}. |
---|
1640 | \end{itemize} |
---|
1641 | \vspace{0.2cm} |
---|
1642 | \begin{Verbatim}[frame=single, label=Example of `branchlengths' component, baselinestretch=0.5, |
---|
1643 | fontsize=\small, numbers=left] |
---|
1644 | |
---|
1645 | <branchlengths id="BL1"> |
---|
1646 | <instance id="L1" optimise.lens="yes"/> |
---|
1647 | <instance id="L2"/> |
---|
1648 | <instance id="L3" optimise.lens="false"/> |
---|
1649 | </branchlengths> |
---|
1650 | |
---|
1651 | \end{Verbatim} |
---|
1652 | |
---|
1653 | \subsubsection{{\tt siterates} component}\index{XML options!{\tt siterates} component} |
---|
1654 | Options: |
---|
1655 | \begin{itemize} |
---|
1656 | \item \x{value="val"}, where \x{"val"} is the relative substitution rate for the corresponding class. |
---|
1657 | \end{itemize} |
---|
1658 | A \x{siterate} component generally includes a \x{weights} element that specifies the probabilitic |
---|
1659 | distribution of the relative rates. The available options for such element are: |
---|
1660 | \begin{itemize} |
---|
1661 | \item \x{family="gamma|gamma+inv|freerates"}. \x{gamma} indicates that the distribution of the |
---|
1662 | relative rates is set to be a discrete Gamma density. \x{gamma+inv} indicates that the relative rate model |
---|
1663 | is a mixture of Gamma and invariant sites (this is the common $\Gamma+$I model). FreeRate is |
---|
1664 | a model that does not use any parametric function to describe the distribution of the relative |
---|
1665 | rates (see \cite{soubrier12}). Under this option, relative rates and the corresponding frequencies of these classes are |
---|
1666 | directly estimated from the data. While such approach is slightly more computationally demanding |
---|
1667 | than the $\Gamma$ (or $\Gamma$+I) model, it often provides a significantly better fit to the data. |
---|
1668 | \item \x{alpha="value|optimised"}, where \x{value} is a real positive number. Use this option to set |
---|
1669 | the gamma shape parameter to the selected value. \x{optimised}: the parameter is estimated from |
---|
1670 | the data (see also next option). |
---|
1671 | \item \x{optimise.alpha="yes|true|no|false"}. Optimise the shape of the Gamma distribution of |
---|
1672 | relative rates (or not). |
---|
1673 | \item \x{pinv="value|optimised"}, where \x{value} is in $[0,1]$. Use this option to set |
---|
1674 | the proportion of invariants to the selected value. \x{optimised}: the parameter is estimated from |
---|
1675 | the data (see also next option). |
---|
1676 | \item \x{optimise.pinv="yes|true|no|false"}. Optimise the proportion of invariable sites (or not). |
---|
1677 | \item \x{optimise.freerates="yes|true|no|false"}. Optimise the parameters of the FreeRate model, |
---|
1678 | i.e., the relative rates and the corresponding frequencies. |
---|
1679 | \end{itemize} |
---|
1680 | \vspace{0.2cm} |
---|
1681 | \begin{Verbatim}[frame=single, label=Example of `siterates' component, samepage=true, |
---|
1682 | baselinestretch=0.5, fontsize=\small, numbers=left] |
---|
1683 | |
---|
1684 | <siterates id="SR1"> |
---|
1685 | <instance id="R1" init.value="1.0"/> |
---|
1686 | <instance id="R2" init.value="1.0"/> |
---|
1687 | <instance id="R3" init.value="1.0"/> |
---|
1688 | <instance id="R4" init.value="1.0"/> |
---|
1689 | <weights id="D1" family="gamma" optimise.alpha="yes" \ |
---|
1690 | optimise.pinv="no"> |
---|
1691 | </weights> |
---|
1692 | </siterates> |
---|
1693 | \end{Verbatim} |
---|
1694 | |
---|
1695 | |
---|
1696 | |
---|
1697 | \subsubsection{{\tt partitionelem} and {\tt mixtureelem} components}\index{XML options!{\tt partitionelem} |
---|
1698 | component}\index{XML options!{\tt mixtureelem} component} |
---|
1699 | |
---|
1700 | Options: |
---|
1701 | \begin{itemize} |
---|
1702 | \item \x{file.name="inputfilename"}, where \x{inputfilename} is the name of the input sequence file |
---|
1703 | (in PHYLIP format) to be analysed. |
---|
1704 | \item \x{data.type="nt|aa"}. Specify the type of sequences to be processed (nucleotide of amino-acid sequences). |
---|
1705 | \item \x{interleaved="yes|true|no|false"}. Interleaved (\x{yes|true}) or sequential format |
---|
1706 | (\x{no|false}) for the sequence alignment. |
---|
1707 | \end{itemize} |
---|
1708 | |
---|
1709 | Each \x{partitionelem} element should include exactly four \x{mixtureelem} elements, corresponding to |
---|
1710 | branch lengths, equilibrium frequencies, substitution rate model and tree topology. The ordering of |
---|
1711 | in which the \x{mixtureelem} elements are given does not matter, though exceptions apply for the |
---|
1712 | $\Gamma+I$ model (see below). The $n$-th element in the \x{list} |
---|
1713 | attribute of each \x{mixtureelem} defines the $n$-th class of the mixture model. In the example given |
---|
1714 | below, the first class of the mixture is made of the following elements: \x{T1}, \x{F1}, \x{R1} and |
---|
1715 | \x{L1}, the second class is made of \x{T1}, \x{F1}, \x{R2} and \x{L1}, etc. |
---|
1716 | |
---|
1717 | |
---|
1718 | \vspace{0.2cm} |
---|
1719 | \begin{Verbatim}[frame=single, label=Example of `partitionelem' component, samepage=true, |
---|
1720 | baselinestretch=0.5, fontsize=\small, numbers=left] |
---|
1721 | |
---|
1722 | <partitionelem id="partition1" file.name="./small_p1.nxs" \ |
---|
1723 | data.type="nt" interleaved="yes"> |
---|
1724 | <mixtureelem list="T1, T1, T1, T1"/> |
---|
1725 | <mixtureelem list="F1, F1, F1, F1"/> |
---|
1726 | <mixtureelem list="R1, R2, R3, R4"/> |
---|
1727 | <mixtureelem list="L1, L1, L1, L1"/> |
---|
1728 | </partitionelem> |
---|
1729 | |
---|
1730 | \end{Verbatim} |
---|
1731 | |
---|
1732 | In general, the ordering of the \x{mixtureelem} elements does not matter. However, when the model |
---|
1733 | has invariable sites, then the corresponding class should be first in the list of classes provided |
---|
1734 | by \x{mixtureelem}. For instance, in the example above, if the rates are defined as follows: |
---|
1735 | |
---|
1736 | \vspace{0.2cm} |
---|
1737 | \begin{Verbatim}[frame=single, label=Example of `siterates' component, samepage=true, |
---|
1738 | baselinestretch=0.5, fontsize=\small, numbers=left] |
---|
1739 | |
---|
1740 | <siterates id="SR1"> |
---|
1741 | <instance id="R1" init.value="0.0"/> |
---|
1742 | <instance id="R2" init.value="1.0"/> |
---|
1743 | <instance id="R3" init.value="1.0"/> |
---|
1744 | <instance id="R4" init.value="1.0"/> |
---|
1745 | <weights id="D1" family="gamma+inv" optimise.alpha="yes" \ |
---|
1746 | optimise.pinv="no"> |
---|
1747 | </weights> |
---|
1748 | </siterates> |
---|
1749 | \end{Verbatim} |
---|
1750 | |
---|
1751 | then \x{R1} corresponds to the invariable rate class (as \x{init.value="0.0"}). As \x{R1} is first |
---|
1752 | in the \x{mixtureelem} (see line 6 in the example of \x{`partionelem'} given above), PhyML will |
---|
1753 | print out an explicit error message and bail out. One way to avoid this shortcoming is to define |
---|
1754 | \x{mixtureelem} as \x{R4, R2, R3, R1} instead. |
---|
1755 | |
---|
1756 | |
---|
1757 | \subsection{A simple example: GTR + $\Gamma$4 + I} |
---|
1758 | |
---|
1759 | The example below provides all the required options to fit a $\Gamma$4+I model to a single alignment |
---|
1760 | of nucleotide sequences under the GTR model of substitution using a SPR search for the best tree |
---|
1761 | topology. The \x{phyml} component sets the name for the analysis to \x{simple.example}, meaning that |
---|
1762 | each output file will display this particular string of characters. Also, the tree and statistics |
---|
1763 | file names will begin with \x{p1.output}. The tree topology will be estimated so as to maximise the |
---|
1764 | likelihood and the topology search algorithm used here is SPR, as indicated by the value of the |
---|
1765 | corresponding attribute (i.e., \x{search="spr"}). Only one vector of branch lengths will be used |
---|
1766 | here since only one partition element will be processed. Hence, the \x{<branchlengths>} component |
---|
1767 | only has one \x{<instance>} sub-component. Also, a single GTR model will apply to all the classes |
---|
1768 | for the mixture model -- the \x{<ratematrices>} component has only one \x{<instance>} sub-component, |
---|
1769 | corresponding to this particular substitution model. The next component, \x{<equfreqs>}, indicates |
---|
1770 | that a single vector of equilibrium frequencies will apply here. Next, the \x{<siterates>} component |
---|
1771 | has five \x{<instance>} sub-components. Four of these correspond to the non-zero relative rates of |
---|
1772 | evolution a defined by a discrete Gamma distribution. The last one (\x{<instance id="R5" |
---|
1773 | value="0.0"/>}) defines the class of the mixture corresponding to invariable sites. The |
---|
1774 | \x{<weight>} component indicates that a $\Gamma+$I model will be fitted here. The shape parameter of |
---|
1775 | the Gamma distribution and the proportion of invariants will be estimated from the data. The |
---|
1776 | \x{<partitionelem>} gives information about the sequence alignment (the corresponding file name, the |
---|
1777 | type of data and the alignment format). The \x{<mixtureelem>} components next define the mixture |
---|
1778 | model. Each class of the fitted model corresponds to one column, with the first column made of the |
---|
1779 | following elements: \x{T1, M1, F1, R1} and \x{L1}. The second class of the mixture is made of \x{T1, |
---|
1780 | M1, F1, R2, L1} and so forth. |
---|
1781 | |
---|
1782 | \vspace{0.2cm} |
---|
1783 | \begin{Verbatim}[frame=single, label=Simple PhyML XML example, samepage=true, baselinestretch=0.5, |
---|
1784 | fontsize=\small, numbers=left] |
---|
1785 | |
---|
1786 | <phyml runid="simple.example" output.file="p1.output"> |
---|
1787 | |
---|
1788 | <topology> |
---|
1789 | <instance id="T1" init.tree="bionj" optimise.tree="yes" \ |
---|
1790 | search="spr"/> |
---|
1791 | </topology> |
---|
1792 | |
---|
1793 | <branchlengths id="BL1"> |
---|
1794 | <instance id="L1" optimise.lens="yes"/> |
---|
1795 | </branchlengths> |
---|
1796 | |
---|
1797 | <ratematrices id="RM1"> |
---|
1798 | <instance id="M1" model="GTR"/> |
---|
1799 | </ratematrices> |
---|
1800 | |
---|
1801 | <equfreqs id="EF1"> |
---|
1802 | <instance id="F1"/> |
---|
1803 | </equfreqs> |
---|
1804 | |
---|
1805 | <siterates id="SR1"> |
---|
1806 | <instance id="R1" value="1.0"/> |
---|
1807 | <instance id="R2" value="1.0"/> |
---|
1808 | <instance id="R3" value="1.0"/> |
---|
1809 | <instance id="R4" value="1.0"/> |
---|
1810 | <instance id="R5" value="0.0"/> |
---|
1811 | <weights id="D1" family="gamma+inv" optimise.alpha="yes" \ |
---|
1812 | optimise.pinv="yes"> |
---|
1813 | </weights> |
---|
1814 | </siterates> |
---|
1815 | |
---|
1816 | <partitionelem id="partition_elem1" file.name=\ |
---|
1817 | "./p1.seq" data.type="nt" interleaved="yes"> |
---|
1818 | <mixtureelem list="T1, T1, T1, T1, T1"/> |
---|
1819 | <mixtureelem list="M1, M1, M1, M1, M1"/> |
---|
1820 | <mixtureelem list="F1, F1, F1, F1, F1"/> |
---|
1821 | <mixtureelem list="R1, R2, R3, R4, R5"/> |
---|
1822 | <mixtureelem list="L1, L1, L1, L1, L1"/> |
---|
1823 | </partitionelem> |
---|
1824 | |
---|
1825 | </phyml> |
---|
1826 | |
---|
1827 | \end{Verbatim} |
---|
1828 | |
---|
1829 | |
---|
1830 | \subsection{A second example: LG4X}\index{lg4x} |
---|
1831 | |
---|
1832 | The example below shows how to fit the LG4X model \cite{lg4x} to a given alignment of amino-acid |
---|
1833 | sequences (file \x{M587.nex.Phy}). LG4X is a mixture model with four classes. Each class has its own |
---|
1834 | rate and corresponding frequencies (hence the use of the FreeRate model below, see the |
---|
1835 | \x{<siterates>} component). In the particular example given here, the rate values and frequencies |
---|
1836 | are set by the users. These parameters will then be optimized by PhyML (\x{optimise.freerates="yes"}). |
---|
1837 | Each class also has its own rate matrix and vector of equilibrium frequencies, which need to be provided by |
---|
1838 | the user (Note that these matrices can be downloaded from the following web address: |
---|
1839 | \url{http://www.atgc-montpellier.fr/download/datasets/models/lg4x/LG4X_4M.txt}. They are also |
---|
1840 | provided in the PhyML package \x{example/lg4x/} directory.) |
---|
1841 | |
---|
1842 | \vspace{0.2cm} |
---|
1843 | \begin{Verbatim}[frame=single, label=LG4X, samepage=true, baselinestretch=0.5, |
---|
1844 | fontsize=\small, numbers=left] |
---|
1845 | <phyml run.id="lg4x" output.file="M587.tests" branch.test="no"> |
---|
1846 | |
---|
1847 | <!-- Tree topology: start with BioNJ and then SPRs --> |
---|
1848 | <topology> |
---|
1849 | <instance id="T1" init.tree="user" file.name="user_tree.txt" \ |
---|
1850 | search="spr" optimise.tree="no"/> |
---|
1851 | </topology> |
---|
1852 | |
---|
1853 | |
---|
1854 | <!-- Four rate matrices, read from files --> |
---|
1855 | <ratematrices id="RM1"> |
---|
1856 | <instance id="M1" model="customaa" ratematrix.file="X1.mat"/> |
---|
1857 | <instance id="M2" model="customaa" ratematrix.file="X2.mat"/> |
---|
1858 | <instance id="M3" model="customaa" ratematrix.file="X3.mat"/> |
---|
1859 | <instance id="M4" model="customaa" ratematrix.file="X4.mat"/> |
---|
1860 | </ratematrices> |
---|
1861 | |
---|
1862 | <!-- Freerate model of variation of rates across sites --> |
---|
1863 | <siterates id="SR1"> |
---|
1864 | <instance id="R1" init.value="0.197063"/> |
---|
1865 | <instance id="R2" init.value="0.750275"/> |
---|
1866 | <instance id="R3" init.value="1.951569"/> |
---|
1867 | <instance id="R4" init.value="5.161586"/> |
---|
1868 | <weights id="D1" family="freerates" optimise.freerates="yes"> |
---|
1869 | <instance appliesto="R1" value="0.422481"/> |
---|
1870 | <instance appliesto="R2" value="0.336848"/> |
---|
1871 | <instance appliesto="R3" value="0.180132"/> |
---|
1872 | <instance appliesto="R4" value="0.060539"/> |
---|
1873 | </weights> |
---|
1874 | </siterates> |
---|
1875 | |
---|
1876 | <!-- Amino-acid equilibrium freqs. are given by the models --> |
---|
1877 | <equfreqs id="EF1"> |
---|
1878 | <instance id="F1" aa.freqs="model"/> |
---|
1879 | <instance id="F2" aa.freqs="model"/> |
---|
1880 | <instance id="F3" aa.freqs="model"/> |
---|
1881 | <instance id="F4" aa.freqs="model"/> |
---|
1882 | </equfreqs> |
---|
1883 | |
---|
1884 | |
---|
1885 | <!-- One vector of branch lengths --> |
---|
1886 | <branchlengths id="BL1" > |
---|
1887 | <instance id="L1" optimise.lens="yes"/> |
---|
1888 | </branchlengths> |
---|
1889 | |
---|
1890 | |
---|
1891 | <!-- Mixture model assemblage --> |
---|
1892 | <partitionelem id="partition1" file.name="M587.nex.Phy" \ |
---|
1893 | data.type="aa" interleaved="yes"> |
---|
1894 | <mixtureelem list="T1, T1, T1, T1"/> |
---|
1895 | <mixtureelem list="M1, M2, M3, M4"/> |
---|
1896 | <mixtureelem list="F1, F2, F3, F4"/> |
---|
1897 | <mixtureelem list="R1, R2, R3, R4"/> |
---|
1898 | <mixtureelem list="L1, L1, L1, L1"/> |
---|
1899 | </partitionelem> |
---|
1900 | |
---|
1901 | </phyml> |
---|
1902 | \end{Verbatim} |
---|
1903 | |
---|
1904 | In order to fit the LG4X model to the \x{proteic} sequence file provided in the \x{examples/} |
---|
1905 | directory, simply type \x{./phyml --xml=../examples/lg4x/lg4x.xml} (assuming the PhyML binary is installed |
---|
1906 | in the \x{src/} directory). You can of course slightly tweak the file \x{../examples/lg4x/lg4x.xml} |
---|
1907 | and use it as a template to fit this model to another data set. |
---|
1908 | |
---|
1909 | |
---|
1910 | \subsection{An example with multiple partition elements} |
---|
1911 | |
---|
1912 | The example below gives the complete XML file to specify the analysis of three partition elements, |
---|
1913 | corresponding to the nucleotide sequence files \x{small\_p1\_pos1.seq}, \x{small\_p1\_pos2.seq} and |
---|
1914 | \x{small\_p1\_pos3.seq} in interleaved PHYLIP format. \x{small\_p1\_pos1.seq} is fitted with the |
---|
1915 | HKY85 model of substitution (with the transition/transversion ratio being estimated from the data), |
---|
1916 | combined to a $\Gamma4$ model of rate variation across sites (with the gamma shape parameter being |
---|
1917 | estimated from the data). \x{small\_p1\_pos2.seq} is fitted to a custom substitution model with the |
---|
1918 | constraint $A\leftrightarrow G$=$C\leftrightarrow T$. The nucleotide frequencies are set to |
---|
1919 | $\frac{1}{4}$ here. The model does not allow substitution rates to vary across |
---|
1920 | sites. \x{small\_p1\_pos3.seq} is fitted using a GTR model conbined to a $\Gamma4+$I model of rate |
---|
1921 | variation across sites. Note that the equilibrium nucleotide frequencies for the fourth and fifth |
---|
1922 | class of the mixture are set to be equal to that estimated from the first partition element (i.e., |
---|
1923 | \x{F1}) . The initial phylogeny is built using BioNJ and the tree topology is to be estimated using |
---|
1924 | a NNI search algorithm. |
---|
1925 | |
---|
1926 | \vspace{0.2cm} |
---|
1927 | \begin{Verbatim}[frame=single, label=Example of PhyML XML file, samepage=true, baselinestretch=0.5, |
---|
1928 | fontsize=\small, numbers=left] |
---|
1929 | |
---|
1930 | <phyml runid="nnisearch" output.file="small_p1_output"> |
---|
1931 | |
---|
1932 | <topology> |
---|
1933 | <instance id="T1" init.tree="bionj" optimise.tree="yes" \ |
---|
1934 | search="nni"/> |
---|
1935 | </topology> |
---|
1936 | |
---|
1937 | <branchlengths id="BL1"> |
---|
1938 | <instance id="L1" optimise.lens="yes"/> |
---|
1939 | <instance id="L2"/> |
---|
1940 | <instance id="L3"/> |
---|
1941 | </branchlengths> |
---|
1942 | |
---|
1943 | <ratematrices id="RM1"> |
---|
1944 | <instance id="M1" model="HKY85" optimise.tstv="yes"/> |
---|
1945 | <instance id="M2" model="custom" model.code="102304" \ |
---|
1946 | optimise.rr="yes"/> |
---|
1947 | <instance id="M3" model="GTR"/> |
---|
1948 | </ratematrices> |
---|
1949 | |
---|
1950 | <equfreqs id="EF1"> |
---|
1951 | <instance id="F1"/> |
---|
1952 | <instance id="F2" base.freqs="0.25,0.25,0.25,0.25"/> |
---|
1953 | <instance id="F3"/> |
---|
1954 | </equfreqs> |
---|
1955 | |
---|
1956 | <siterates id="SR1"> |
---|
1957 | <instance id="R1" value="1.0"/> |
---|
1958 | <instance id="R2" value="1.0"/> |
---|
1959 | <instance id="R3" value="1.0"/> |
---|
1960 | <instance id="R4" value="1.0"/> |
---|
1961 | <weights id="D1" family="gamma" optimise.alpha="yes" \ |
---|
1962 | optimise.pinv="no"> |
---|
1963 | </weights> |
---|
1964 | </siterates> |
---|
1965 | |
---|
1966 | <siterates id="SR2"> |
---|
1967 | <instance id="R8" value="1.0"/> |
---|
1968 | <weights id="D2" family="gamma" optimise.alpha="yes" \ |
---|
1969 | optimise.pinv="yes"> |
---|
1970 | </weights> |
---|
1971 | </siterates> |
---|
1972 | |
---|
1973 | <siterates id="SR3"> |
---|
1974 | <instance id="R10" value="1.0"/> |
---|
1975 | <instance id="R11" value="1.0"/> |
---|
1976 | <instance id="R12" value="1.0"/> |
---|
1977 | <instance id="R13" value="1.0"/> |
---|
1978 | <instance id="R14" value="1.0"/> |
---|
1979 | <weights id="D3" family="gamma" optimise.alpha="yes" \ |
---|
1980 | optimise.pinv="yes"> |
---|
1981 | </weights> |
---|
1982 | </siterates> |
---|
1983 | |
---|
1984 | \end{Verbatim} |
---|
1985 | |
---|
1986 | \vspace{0.2cm} |
---|
1987 | \begin{Verbatim}[frame=single, label=Example of PhyML XML file (ctnd), samepage=true, |
---|
1988 | baselinestretch=0.5, fontsize=\small, numbers=left] |
---|
1989 | |
---|
1990 | <partitionelem id="partition_elem1" file.name=\ |
---|
1991 | "./small_p1_pos1.seq" data.type="nt" interleaved="yes"> |
---|
1992 | <mixtureelem list="T1, T1, T1, T1"/> |
---|
1993 | <mixtureelem list="M1, M1, M1, M1"/> |
---|
1994 | <mixtureelem list="F1, F1, F1, F1"/> |
---|
1995 | <mixtureelem list="R1, R2, R3, R4"/> |
---|
1996 | <mixtureelem list="L1, L1, L1, L1"/> |
---|
1997 | </partitionelem> |
---|
1998 | |
---|
1999 | <partitionelem id="partition_elem2" file.name=\ |
---|
2000 | "./small_p1_pos2.seq" data.type="nt" interleaved="yes"> |
---|
2001 | <mixtureelem list="T1"/> |
---|
2002 | <mixtureelem list="M2"/> |
---|
2003 | <mixtureelem list="R8"/> |
---|
2004 | <mixtureelem list="F2"/> |
---|
2005 | <mixtureelem list="L2"/> |
---|
2006 | </partitionelem> |
---|
2007 | |
---|
2008 | <partitionelem id="partition_elem3" file.name=\ |
---|
2009 | "./small_p1_pos3.seq" data.type="nt" interleaved="yes"> |
---|
2010 | <mixtureelem list="T1, T1, T1, T1, T1"/> |
---|
2011 | <mixtureelem list="M3, M3, M3, M3, M3"/> |
---|
2012 | <mixtureelem list="R10, R11, R12, R13, R14"/> |
---|
2013 | <mixtureelem list="F3, F3, F3, F1, F1"/> |
---|
2014 | <mixtureelem list="L3, L3, L3, L3, L3"/> |
---|
2015 | </partitionelem> |
---|
2016 | |
---|
2017 | </phyml> |
---|
2018 | |
---|
2019 | \end{Verbatim} |
---|
2020 | |
---|
2021 | |
---|
2022 | \subsection{Branch lengths with invariants and partionned data} |
---|
2023 | |
---|
2024 | Accommodating for models with invariable sites applying to some elements of a partitioned data, with |
---|
2025 | these elements sharing the same set of edge lengths can lead to inconsistencies. Consider for |
---|
2026 | instance a partitioned data set with two elements. Assume that these two elements share the same |
---|
2027 | set of edge lengths. Also, consider that GTR+I applies to the first element and HKY applies to the |
---|
2028 | second. Now, the expected number of substitutions per site for the first element of the partition is |
---|
2029 | equal to $(1-p)l$, where $p$ is the estimated proportion of invariants and $l$ is the |
---|
2030 | maximum-likelihood estimate for the length of that specific edge. For the second element of the |
---|
2031 | partition, the expected number of substitutions per site is equal to $l$, rather than $(1-p)l$. |
---|
2032 | While $l$ are common to the two elements, matching the specification of the input model, the actual |
---|
2033 | edge lengths do differ across the two partition elements. Please be aware that, due to the |
---|
2034 | programming structure implemented in PhyML, the program will only return one value here, which will |
---|
2035 | be equal to $(1-p)l$. |
---|
2036 | |
---|
2037 | \section{Citing PhyML} |
---|
2038 | The ``default citation'' for PhyML is: |
---|
2039 | \begin{itemize} |
---|
2040 | \item |
---|
2041 | ``New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance |
---|
2042 | of PhyML 3.0''. Guindon S., Dufayard J.F., Lefort V., Anisimova M., Hordijk W., Gascuel O. 2010, {\it Systematic |
---|
2043 | Biology}, 59(3):307-321 |
---|
2044 | |
---|
2045 | \end{itemize} |
---|
2046 | The ``historic citation'' for PhyML is: |
---|
2047 | \begin{itemize} |
---|
2048 | \item ``A simple, fast and accurate algorithm to estimate large phylogenies by maximum likelihood'' |
---|
2049 | Guindon S., Gascuel O. 2003, {\it Systematic Biology}, 52(5):696-704 |
---|
2050 | \end{itemize} |
---|
2051 | |
---|
2052 | |
---|
2053 | |
---|
2054 | \section{Other programs in the PhyML package} |
---|
2055 | |
---|
2056 | PhyML is software package that provides tools to tackle problems other than estimating maximum |
---|
2057 | likelihood phylogenies. Installing these tools and processing data sets is explained is the |
---|
2058 | following sections. |
---|
2059 | |
---|
2060 | \subsection{PhyTime}\index{PhyTime} PhyTime is a program that estimates node ages and substitution |
---|
2061 | rates using a Bayesian approach. The performance and main features of this software are described |
---|
2062 | in two article (see Section \ref{sec:citephytime}). |
---|
2063 | |
---|
2064 | It relies on a Gibbs sampler which outperforms the |
---|
2065 | ``standard'' Metropolis-Hastings algorithm implemented in a number of phylogenetic softwares. The |
---|
2066 | details and performance of this approach are described in the following article: |
---|
2067 | |
---|
2068 | \subsubsection{Installing PhyTime} |
---|
2069 | |
---|
2070 | Compiling PhyTime is straightforward on Unix-like machines (i.e., linux and MacOS systems). PhyTime |
---|
2071 | is not readily available for Windows machines but compilation should be easy on this system too. In |
---|
2072 | the `phyml' directory, where the `src/' and `doc/' directories stand, enter the following commands: |
---|
2073 | {\setlength{\baselineskip}{0.5\baselineskip} |
---|
2074 | \begin{verbatim} |
---|
2075 | ./configure --enable-phytime; |
---|
2076 | make clean; |
---|
2077 | make; |
---|
2078 | \end{verbatim} } This set of commands generates a binary file called \x{phytime} which can be found |
---|
2079 | in the `src/' directory. |
---|
2080 | |
---|
2081 | \subsubsection{Running PhyTime} Passing options and running PhyTime on your data set is quite |
---|
2082 | similar to running PhyML in commmand-line mode. The main differences between the two programs are |
---|
2083 | explained below: |
---|
2084 | \begin{itemize} |
---|
2085 | \item PhyTime takes as mandatory input a {\em rooted} phylogenetic tree. Hence, the `\x{-u}' option |
---|
2086 | must be used. Also, unlike PhyML, PhyTime does not modify the tree topology. Hence, the options that |
---|
2087 | go with the `\x{-s}' command do not alter the input tree topology. |
---|
2088 | \item PhyTime needs an input file giving information about calibration nodes. The command |
---|
2089 | `\x{--calibration=}' followed by the name of the file containing the calibration node information is |
---|
2090 | mandatory. The content of that file should look as follows: |
---|
2091 | |
---|
2092 | \begin{figure}[h] |
---|
2093 | \begin{small} |
---|
2094 | \begin{Verbatim}[frame=single, label=Calibration node file, samepage=true, baselinestretch=0.5, fontsize=\tiny] |
---|
2095 | Dugong_dugon Procavia_capensis Elephantidae | -65 -54 |
---|
2096 | Equus_sp. Ceratomorpha | -58 -54 |
---|
2097 | Cercopithecus_solatus Macaca_mulatta Hylobates_lar Homo_sapiens | -35 -25 |
---|
2098 | Lepus_crawshayi Oryctolagus_cuniculus Ochotona_princeps | -90 -37 |
---|
2099 | Marmota_monax Aplodontia_rufa | -120 -37 |
---|
2100 | Dryomys_nitedula Glis_glis | -120 -28.5 |
---|
2101 | @root@ | -100 -120 |
---|
2102 | \end{Verbatim} |
---|
2103 | \end{small} |
---|
2104 | \end{figure} |
---|
2105 | |
---|
2106 | Every row in this file lists a set of taxa that belong to the same subtree (i.e., a clade). This |
---|
2107 | list of taxa is followed by the character `\x{|}' and two real numbers corresponding to the lower |
---|
2108 | and upper bounds of the calibration interval for the node at the root of the clade. In the example |
---|
2109 | given here, the clade grouping the three taxa ``Dugong\_dugon'', ``Procavia\_capensis'' and |
---|
2110 | ``Elephantida'' has -65 as lower bound and -54 as upper bound. Node ages (or node heights) are |
---|
2111 | relative to the most recent tip node in the phylogeny, which age is set to 0. It is also possible to |
---|
2112 | define a clade using only two taxon names. PhyTime will then search for the most recent common |
---|
2113 | ancestor of these two taxa in the user-defined phylogeny and assign time boundaries to the |
---|
2114 | corresponding node. For serially-sampled data, the calibration nodes correspond to tips in the tree. |
---|
2115 | A calibration file will then look as follows: |
---|
2116 | \begin{figure}[h] |
---|
2117 | \begin{small} |
---|
2118 | \begin{Verbatim}[frame=single, label=Calibration node file (serially-sampled sequences), samepage=true, baselinestretch=0.5] |
---|
2119 | taxaA | -65 -65 |
---|
2120 | taxaB | -65 -65 |
---|
2121 | taxaC | -20 -20 |
---|
2122 | taxaD | -30 -30 |
---|
2123 | taxaE | -30 -30 |
---|
2124 | taxaF | -60 -60 |
---|
2125 | taxaG | -61 -51 |
---|
2126 | @root@ | -100 -120 |
---|
2127 | \end{Verbatim} |
---|
2128 | \end{small} |
---|
2129 | \end{figure} |
---|
2130 | |
---|
2131 | |
---|
2132 | % The maximum value the average substitution rate along a branch of a phylogeny can take is set to |
---|
2133 | % $10^{-2}$ substitution per site per time unit. It is therefore important that the time unit used to |
---|
2134 | % specify the calibration intervals are realistic with respect to this upper bound. For instance, it |
---|
2135 | % one time unit corresponds to one month, one assumes that the maximum (average) substitution rate is |
---|
2136 | % $10^{-3}$/site/month, which is very high (substitution rate in introns is close to 3/site/$10^9$ |
---|
2137 | % years). Consider that in the example give above, one time unit is $10^6$ years. Hence, the maximum |
---|
2138 | % value the substitution rate can take is $10^{-2}$/site/$10^6$ years, i.e., $10$/site/$10^9$ years. |
---|
2139 | % Multiplying the calibration values by 10 amounts to considering that the time unit is $10^5$ years, |
---|
2140 | % implying a maximum substitution rate equal to $10^{-2}$/site/$10^5$ years, i.e., $100$/site/$10^9$ |
---|
2141 | % years. |
---|
2142 | |
---|
2143 | Note that the node corresponding to the root of the whole tree has a specific label: |
---|
2144 | `\x{@root@}'. {\color{red}{It is important to specify upper and lower bounds for the root node in |
---|
2145 | order to ensure convergence of the Gibbs sampler. If the prior interval for the root height is not |
---|
2146 | specified, the upper bound will be set to the upper bound of the oldest calibration node and the |
---|
2147 | lower bound will be set to twice this age.}} As a consequence, leaving the prior on root height |
---|
2148 | interval unspecified may produce inaccurate estimates of node ages, especially if there are only few |
---|
2149 | otherwise calibration nodes available. |
---|
2150 | |
---|
2151 | A notable exception to this rule comes from the analysis of serial sample \index{serial sample} |
---|
2152 | data, i.e., alignments in which sequences were not sampled at the same time point. For such data, |
---|
2153 | the estimated number of substitutions accumulated between successive time points is used to estimate |
---|
2154 | the substitution rate averaged over lineages. Because the time of collection of the sequences is |
---|
2155 | generally known without ambiguity, this extra piece of data is translated into very informative |
---|
2156 | calibration intervals for the tip nodes (i.e., calibration interval of zero width), which in turn |
---|
2157 | results in substitution rate estimates with descreased variances. Posterior distribution of |
---|
2158 | substitution rates with small variances then allows one to get good estimates of the root age. |
---|
2159 | |
---|
2160 | \end{itemize} |
---|
2161 | |
---|
2162 | A typical PhyTime command-line should look like the following\index{command-line options!\x{--calibration}}: |
---|
2163 | |
---|
2164 | \begin{Verbatim}[fontsize=\small] |
---|
2165 | ./phytime -i seqname -u treename --calibration=calibration_file -m GTR -c 8 |
---|
2166 | \end{Verbatim} |
---|
2167 | |
---|
2168 | Assuming the file `\x{seqname}' contains DNA sequences in PHYLIP or NEXUS format, `\x{treename}' is |
---|
2169 | the rooted input tree in Newick format and `\x{calibration\_file}' is a set of calibration nodes, |
---|
2170 | PhyTime will estimate the posterior distribution of node times and substitution rates under the |
---|
2171 | assumption that the substitution process follows a GTR model with 8 classes of rates in the Gamma |
---|
2172 | distribution of rates across sites. The model parameter values are estimated by a Gibbs sampling |
---|
2173 | technique. This algorithm tries diferent values of the model parameters and record the most probable |
---|
2174 | ones. By default, $10^6$ values for each parameter are collected. These values are recorded every |
---|
2175 | $10^3$ sample. These settings can be modified using the appropriate command-line options (see |
---|
2176 | below). |
---|
2177 | |
---|
2178 | |
---|
2179 | \subsubsection{Upper bounds of model parameters} |
---|
2180 | |
---|
2181 | The maximum expected number of substitutions per along a given branch is set to 1.0. Since |
---|
2182 | calibration times provide prior information about the time scale considered, it is possible to use |
---|
2183 | that information to define an upper bound for the substitution rate. This upper bound is equal to |
---|
2184 | the ratio of the maximum value for a branch length (1.0) by the amount of time elapsed since the |
---|
2185 | oldest calibration point (i.e., the minimum of the lower bounds taken over the whole set of |
---|
2186 | calibration points)\footnote{The actual formula involves an extra parameter which does not need to |
---|
2187 | be introduced here}. It is important to keep in mind that the upper bound of the average |
---|
2188 | substitution rate depends on the time unit used in the calibration priors. The value of the upper |
---|
2189 | bound is printed on screen at the start of the execution. |
---|
2190 | |
---|
2191 | PhyTime implements two models that authorize rates to be autocorrelated. The strength of |
---|
2192 | autocorrelation is governed by a parameter which value is estimated from the data. However, it is |
---|
2193 | necessary to set an appropriate upper bound for this parameter prior running the analysis. The |
---|
2194 | maximum value is set such that the correlation between the rate at the beginning and at the end of a |
---|
2195 | branch of length 1.0 calendar time unit is not different from 0. Here again the upper bound for the |
---|
2196 | model parameter depends on the time unit. It is important to choose this unit so that a branch of |
---|
2197 | length 1.0 calendar unit can be considered as short. For this reason, {\color{red}{we recommend to select a time |
---|
2198 | unit so that the calibration times take values between -10 and -1000}}\index{time scale}. |
---|
2199 | |
---|
2200 | \subsubsection{PhyTime specific options} |
---|
2201 | |
---|
2202 | Beside the \x{--calibration} option, there are other command line options that are specific to PhyTime: |
---|
2203 | \begin{itemize} |
---|
2204 | \item \x{--chain\_len=num}\index{command-line options!\x{--chain\_len}} \\ |
---|
2205 | \x{num} is the number of iterations required to estimate the joint posterior density of all the model |
---|
2206 | parameters, i.e., the length of the MCMC chain. Its default is set to 1E+6. |
---|
2207 | % \item \x{--burnin=num} \\ |
---|
2208 | % \x{num} is the number of iterations for the ``burnin'' period, i.e., the number of iterations |
---|
2209 | % required before actually starting to get valid samples from the joint posterior density of all the |
---|
2210 | % model parameters. Its default is set to 1E+5, i.e., 0.1 times the default value of the \x{chain\_len} |
---|
2211 | % parameter (see above). |
---|
2212 | \item \x{--sample\_freq=num}\index{command-line options!\x{--sample\_freq}} \\ \x{num} is the number of generations between successive collection |
---|
2213 | of the model parameter values throughout the MCMC algorithm. For instance, the |
---|
2214 | \x{--sample\_freq=1E+2} option will make PhyTime sample the model parameter every 100th iteration of |
---|
2215 | the MCMC algorithm. Its default is set to 1E+3. |
---|
2216 | \item \x{--fastlk=yes (no)}\index{command-line options!\x{--fastlk}} [Default: no]\\ |
---|
2217 | The option is used to turn on (off) the approximation of the likelihood function using a |
---|
2218 | multivariate normal density. By default, the exact likelihood is used. Using the normal |
---|
2219 | approximation considerably speeds up the calculation. However, it is necessary to ensure that this |
---|
2220 | approximation is appropriate by looking at the correlation between the exact and approximated |
---|
2221 | likelihood values that are sampled. Please read Section \ref{sec:recomphytime} for a description of |
---|
2222 | the appropriate steps to take. |
---|
2223 | \item \x{--no\_sequences}\index{command-line options!\x{--no\_sequences}}\\ |
---|
2224 | Use this option to run the sampler without sequence data. This option can be useful when one wants |
---|
2225 | to compare the marginal posterior density of model parameters to those derived when ignoring the |
---|
2226 | information conveyed by the sequences. Such comparison should be conducted on a systematic basis so as to determine whether the |
---|
2227 | parameters estimates are mostly determined by the prior of driven by the sequence data. |
---|
2228 | \item \x{--rate\_model=gbs/gbd/gamma/clock}\index{command-line options!\x{--rate\_model}}\\ |
---|
2229 | This option is to select the model of evolution of the rate of evolution. \x{gbs} (default) stands |
---|
2230 | for Geometric Brownian + Stochastic. This model considers that rates evolve along the tree |
---|
2231 | according to a geometric Brownian process \cite{kishino01} and the average rate of substitution |
---|
2232 | along a branch is a gamma distributed random variable. This model is described in \cite{guindon12}. |
---|
2233 | The \x{gbd} model (Geometric Browninan with Deterministic calculation of edge lengths) assumes the |
---|
2234 | same Geometric Brownian model of rates. However, as opposed to \x{gbs}, this model uses a |
---|
2235 | deterministic approximation to calculate the average rates of evolution along edges. This model |
---|
2236 | corresponds to the one described in \cite{kishino01} and implemented in the program Multidivtime\index{Multidivtime}. |
---|
2237 | \x{gamma} is a less sophisticated model that assumes that average rates along edges are distributed a |
---|
2238 | priori according to a gamma distribution. It is analogous to the uncorrelated clock model |
---|
2239 | implemented in BEAST\index{BEAST} with a gamma distribution replacing the exponential one. The \x{clock} option |
---|
2240 | corresponds to the strict clock model where all the lineages in the tree evolve at the same pace. |
---|
2241 | \end{itemize} |
---|
2242 | |
---|
2243 | \subsubsection{PhyTime output} |
---|
2244 | |
---|
2245 | The program PhyTime generates two output files. The file called |
---|
2246 | `\x{your\_seqfile\_phytime.XXXX.stats}', where XXXX is a randomly generated integer, lists the node |
---|
2247 | times and branch relative rates sampled during the estimation process. It also gives the sampled |
---|
2248 | values for other parameters, such as the autocorrelation of rates (parameter `Nu'), and the rate of |
---|
2249 | evolution (parameter `EvolRate') amongst others. This output file can be analysed with the program |
---|
2250 | Tracer\index{Tracer} from the BEAST\index{BEAST} package |
---|
2251 | (\url{http://beast.bio.ed.ac.uk/Main_Page}). The second file is called |
---|
2252 | `\x{your\_seqfile\_phytime.XXXX.trees}'. It is the list of trees that were collected during the |
---|
2253 | estimation process, i.e., phylogenies sampled from the posterior density of trees. This file can be |
---|
2254 | processed using the software TreeAnnotator, also part of the BEAST package (see |
---|
2255 | \url{http://beast.bio.ed.ac.uk/Main_Page}) in order to generate confidence sets for the node time |
---|
2256 | estimates. |
---|
2257 | |
---|
2258 | Important information is also displayed on the standard output of PhyTime (the standard output |
---|
2259 | generally corresponds to the terminal window from which PhyTime was launched). The first column of |
---|
2260 | this output gives the current generation, or run, of the chain. It starts at 1 and goes up to 1E+6 |
---|
2261 | by default (use \x{--chain\_len} to change this value, see above). The second column gives the time |
---|
2262 | elapsed in seconds since the sampling began. The third column gives the log likelihood of the |
---|
2263 | phylogenetic model (i.e., `Felsenstein's likelihood'). The fourth column gives the logarithm of the |
---|
2264 | joint prior probability of substitution rates along the tree and node heights. The fifth column |
---|
2265 | gives the current sampled value of the EvolRate parameter along with the corresponding Effective |
---|
2266 | Sample Size (ESS) (see Section \ref{sec:ess}) for this parameter. The sixth column gives the tree |
---|
2267 | height and the corresponding ESS. The seventh column gives the value of the autocorrelation |
---|
2268 | parameter followed by the corresponding ESS. The eightth column gives the values of the birth rate |
---|
2269 | parameter that governs the birth-rate model of species divergence dates. The last column of the |
---|
2270 | standard output gives the minimum of the ESS values taken over the whole set of node height |
---|
2271 | estimates. It provides useful information when one has to decide whether or not the sample size is |
---|
2272 | large enough to draw valid conclusion, i.e., decide whether the chain was run for long enough (see |
---|
2273 | Section \ref{sec:recomphytime} for more detail about adequate chain length). |
---|
2274 | |
---|
2275 | \subsubsection{ClockRate vs. EvolRate} |
---|
2276 | |
---|
2277 | The average rate of evolution along a branch is broken into two components. One is called ClockRate |
---|
2278 | and is the same throughout the tree. The other is called EvolRate and corresponds to a weighted |
---|
2279 | average of branch-specific rates. The model of rate evolution implemented in PhyTime forces the |
---|
2280 | branch-specific rate values to be greater than one. As a consequence, ClockRate is usually smaller |
---|
2281 | EvolRate. |
---|
2282 | |
---|
2283 | In more mathematical terms, let $\mu$ be the value of ClockRate, $r_i$ be the value of the relative |
---|
2284 | rate along branch $i$ and $\Delta_i$ the time elapsed along branch $i$. The value of EvolRate is |
---|
2285 | then given by: |
---|
2286 | \begin{eqnarray*} |
---|
2287 | \mathrm{EvolRate} = \mu \frac{\sum_{i}^{2n-3} r_i \Delta_i}{\sum_{i}^{2n-3} \Delta_i}. |
---|
2288 | \end{eqnarray*} It is clear from this equation that multiplying each $r_i$ by a constant and |
---|
2289 | dividing $\mu$ by the same constant does not change the value of EvolRate. The $r_i$s and $\mu$ |
---|
2290 | are then confounded, or non-identifiable, and only the value of EvolRate can be estimated from |
---|
2291 | the data. {\color{red}{Please make sure that you use the value of EvolRate rather than that of |
---|
2292 | ClockRate when referring to the estimate of the substitution rate}}. |
---|
2293 | |
---|
2294 | \subsubsection{Effective sample size}\label{sec:ess} |
---|
2295 | |
---|
2296 | The MCMC technique generates samples from a target distribution (in our case, the joint posterior |
---|
2297 | density of parameters). Due to the Markovian nature of the method, these samples are not |
---|
2298 | independent. The ESS is the estimated number of independent measurements obtained from a set of |
---|
2299 | (usually dependent) measurements. It is calculated using the following formula: |
---|
2300 | \begin{eqnarray*} |
---|
2301 | \mathrm{ESS} = N\left(\frac{1-r}{1+r}\right), |
---|
2302 | \end{eqnarray*} |
---|
2303 | where $N$ is the length of the chain (i.e., the `raw' or `correlated' sample size) and $r$ is the |
---|
2304 | autocorrelation value, which is obtained using the following formula: |
---|
2305 | \begin{eqnarray*} |
---|
2306 | r = \frac{1}{(N-k)\sigma_x^2} \sum_{i=1}^{N-k} (X_i - \mu_x)(X_{i+k}-\mu_x), |
---|
2307 | \end{eqnarray*} where $\mu_x$ and $\sigma_x$ are the mean and standard deviation of the $X_i$ values |
---|
2308 | respectively and $k$ is the lag. The value of $r$ that is used in PhyTime corresponds to the case where $k=1$, |
---|
2309 | which therefore gives a first order approximation of the `average' autocorrelation value (i.e., the |
---|
2310 | autocorrelation averaged over the set of possible values of the lag). |
---|
2311 | |
---|
2312 | |
---|
2313 | \subsubsection{Prior distributions of model parameters}\label{sec:prior} |
---|
2314 | |
---|
2315 | Any Bayesian analysis requires specifying a prior distribution of model parameters. The outcome of |
---|
2316 | the data analysis, i.e., the posterior distribution, is influenced by the priors. It is especially |
---|
2317 | true if the signal conveyed by the data is weak. While some have argued that the specification of |
---|
2318 | priors relies more on arbitrary decisions than sound scientific reasoning, choosing relevant prior |
---|
2319 | distributions is in fact fully integrated in the process of building model that generates the |
---|
2320 | observed data. In particular, the problem of estimating divergence times naturally lends itself to |
---|
2321 | hierarchical Bayesian modelling. Based on the hypothesis that rates of evolution are conserved (to |
---|
2322 | some extant) throughout the course of evolution, the hierarchical Bayesian approach provides an |
---|
2323 | adequate framework for inferring substitution rates and divergence dates separately. Hence, in this |
---|
2324 | situation, it makes good sense to use what is known about a relatively well-defined feature of the |
---|
2325 | evolution of genetic sequences (the ``molecular clock'' hypothesis combined to stochastic variations of |
---|
2326 | rates across lineages) to build a prior distribution on rates along edges. |
---|
2327 | |
---|
2328 | |
---|
2329 | |
---|
2330 | |
---|
2331 | \subsubsection{Citing PhyTime}\label{sec:citephytime} |
---|
2332 | |
---|
2333 | The ``default citation'' is: |
---|
2334 | |
---|
2335 | \begin{itemize} |
---|
2336 | \item Guindon S. ``From trajectories to averages: an improved description of the heterogeneity of |
---|
2337 | substitution rates along lineages'', {\it Systematic Biology}, 2012. doi: 10.1093/sysbio/sys063. |
---|
2338 | \end{itemize} |
---|
2339 | |
---|
2340 | An earlier article also described some of the methods implemented in PhyTime: |
---|
2341 | |
---|
2342 | \begin{itemize} |
---|
2343 | \item Guindon S. ``Bayesian estimation of divergence times from large data sets'', {\it Molecular |
---|
2344 | Biology and Evolution}, 2010, |
---|
2345 | 27(8):1768:81. |
---|
2346 | \end{itemize} |
---|
2347 | |
---|
2348 | |
---|
2349 | \section{Recommendations on program usage}\label{sec:progusage} |
---|
2350 | |
---|
2351 | \subsection{PhyML} |
---|
2352 | |
---|
2353 | The choice of the tree searching algorithm among those provided by PhyML is generally a tough one. |
---|
2354 | The fastest option relies on local and simultaneous modifications of the phylogeny using NNI |
---|
2355 | moves. More thorough explorations of the space of topologies are also available through the SPR |
---|
2356 | options. As these two classes of tree topology moves involve different computational burdens, it |
---|
2357 | is important to determine which option is the most suitable for the type of data set or analysis one |
---|
2358 | wants to perform. Below is a list of recommendations for typical phylogenetic analyses. |
---|
2359 | |
---|
2360 | \begin{enumerate} |
---|
2361 | \item {\em Single data set, unlimited computing time.} The best option here is probably to use a SPR |
---|
2362 | search (i.e., straight SPR of best of SPR and NNI). If the focus is on estimating the relationships |
---|
2363 | between species, it is a good idea to use more than one starting tree to decrease the chance of |
---|
2364 | getting stuck in a local maximum of the likelihood function. Using NNIs is appropriate if the |
---|
2365 | analysis does not mainly focus on estimating the evolutionary relationships between species (e.g. a |
---|
2366 | tree is needed to estimate the parameters of codon-based models later on). Branch supports can be |
---|
2367 | estimated using bootstrap and approximate likelihood ratios. |
---|
2368 | |
---|
2369 | \item {\em Single data set, restricted computing time.} The three tree searching options can be |
---|
2370 | used depending on the computing time available and the size of the data set. For small data sets |
---|
2371 | (i.e., $<$ 50 sequences), NNI will generally perform well provided that the phylogenetic signal is |
---|
2372 | strong. It is relevant to estimate a first tree using NNI moves and examine the reconstructed |
---|
2373 | phylogeny in order to have a rough idea of the strength of the phylogenetic signal (the presence of |
---|
2374 | small internal branch lengths is generally considered as a sign of a weak phylogenetic signal, |
---|
2375 | specially when sequences are short). For larger data sets ($>$ 50 sequences), a SPR search is |
---|
2376 | recommended if there are good evidence of a lack of phylogenetic signal. Bootstrap analysis will |
---|
2377 | generally involve large computational burdens. Estimating branch supports using approximate |
---|
2378 | likelihood ratios therefore provides an interesting alternative here. |
---|
2379 | |
---|
2380 | \item {\em Multiple data sets, unlimited computing time.} Comparative genomic analyses sometimes |
---|
2381 | rely on building phylogenies from the analysis of a large number of gene families. Here again, the |
---|
2382 | NNI option is the most relevant if the focus is not on recovering the most accurate picture of the |
---|
2383 | evolutionary relationships between species. Slower SPR-based heuristics should be used when the |
---|
2384 | topology of the tree is an important parameter of the analysis (e.g., identification of horizontally |
---|
2385 | transferred genes using phylogenetic tree comparisons). Internal branch support is generally not a |
---|
2386 | crucial parameter of the multiple data set analyses. Using approximate likelihood ratio is probably |
---|
2387 | the best choice here. |
---|
2388 | |
---|
2389 | \item {\em Multiple data sets, limited computing time.} The large amount of data to be processed in |
---|
2390 | a limited time generally requires the use of the fastest tree searching and branch support |
---|
2391 | estimation methods Hence, NNI and approximate likelihood ratios rather than SPR and non-parametric |
---|
2392 | bootstrap are generally the most appropriate here. |
---|
2393 | \end{enumerate} |
---|
2394 | |
---|
2395 | Another important point is the choice of the substitution model. While default options generally |
---|
2396 | provide acceptable results, it is often warranted to perform a pre-analysis in order to identify the |
---|
2397 | best-fit substitution model. This pre-analysis can be done using popular software such as Modeltest |
---|
2398 | \cite{posada98} or ProtTest \cite{abascal05} for instance. These programs generally recommend the |
---|
2399 | use of a discrete gamma distribution to model the substitution process as variability of rates among |
---|
2400 | sites is a common feature of molecular evolution. The choice of the number of rate classes to use |
---|
2401 | for this distribution is also an important one. While the default is set to four categories in |
---|
2402 | PhyML, it is recommended to use larger number of classes if possible in order to best approximate |
---|
2403 | the patterns of rate variation across sites \cite{galtier04}. Note however that run times are |
---|
2404 | directly proportional to the number of classes of the discrete gamma distribution. Here again, a |
---|
2405 | pre-analysis with the simplest model should help the user to determine the number of rate classes |
---|
2406 | that represents the best trade-off between computing time and fit of the model to the data. |
---|
2407 | |
---|
2408 | |
---|
2409 | \subsection{PhyTime}\label{sec:recomphytime} |
---|
2410 | |
---|
2411 | Analysing a data set using PhyTime should involve three steps based on the following questions: (1) |
---|
2412 | do the priors seem to be adequate (2) can I use the fast approximation of the likelihood and (3) how |
---|
2413 | long shall I run the program for? I explain below how to provide answers to these questions. |
---|
2414 | |
---|
2415 | \begin{itemize} |
---|
2416 | \item {\em Are the priors adequate?} Bayesian analysis relies on specifiying the joint |
---|
2417 | prior density of model parameters. In the case of node age estimation, these priors essentially |
---|
2418 | describe how rates of substitution vary across lineages and the probabilistic distribution that node |
---|
2419 | ages have when ignoring the information provided by the genetic sequences. These priors vary from |
---|
2420 | tree to tree. It is therefore essential to check the adequacy of priors for each user-defined input |
---|
2421 | tree. In order to do so, PhyTime needs to be run with the \x{--no\_data} option. When this option is |
---|
2422 | required, the sequence data provided as input will be ignored and the rest of the analysis will |
---|
2423 | proceed normally. The prior distribution of model parameters, essentially edge rates and node |
---|
2424 | heights, can then be checked using the program Tracer as one would do for the standard `posterior' |
---|
2425 | analysis. |
---|
2426 | |
---|
2427 | \item {\em Can I use the fast approximation to the likelihood?} The suface of the log-likelihood |
---|
2428 | function can be approximated using a multivariate normal density. This technique is saving very |
---|
2429 | substantial amounts of computation time. However, like most approximations, there are situations |
---|
2430 | where it does not provide a good fit to the actual function. This usually happens when the phylogeny |
---|
2431 | displays a lot of short branches, i.e., the signal conveyed by the sequences is weak. It is |
---|
2432 | therefore important to first check whether using the approximate likelihood is reasonable. In order |
---|
2433 | to do so, it is recommended to first run the program without the approximation, i.e., using the |
---|
2434 | default settings. Once the minimum value of the ESS of node ages (the last column on the right of |
---|
2435 | the standard output) has reached 40-50, open the \x{phytime.XXXX} output file with Tracer and |
---|
2436 | examine the correlation between the exact and approximate likelihood values. If the correlation is |
---|
2437 | deemed to be good enough, PhyTime can be re-run using the \x{--fast\_lk} option, which uses the fast |
---|
2438 | normal approximation to the likelihood function. |
---|
2439 | |
---|
2440 | |
---|
2441 | |
---|
2442 | % Figure \ref{fig:approxbad} gives an example where the correlation is too weak and the approximation |
---|
2443 | % of the likelihood should be avoided. Figure \ref{fig:approxbad} gives an example where the |
---|
2444 | % approximation is good enough. The current execution of PhyTime can be terminated and then |
---|
2445 | % re-launched using the \x{--fast\_lk} option. |
---|
2446 | |
---|
2447 | % \begin{figure} |
---|
2448 | % \begin{center} |
---|
2449 | % \resizebox{14cm}{8cm}{\includegraphics{./fig/approx_bad.eps}} |
---|
2450 | % \caption{{\bf Exact vs. approximate likelihoods.} The correlation between the normally approximated |
---|
2451 | % (Y-axis) and the exact (X-axis) likelihoods is weak here. The exact likelihood should be used (option \x{fastlk=no}).} |
---|
2452 | % \label{fig:approxbad} |
---|
2453 | % \end{center} |
---|
2454 | % \end{figure} |
---|
2455 | |
---|
2456 | |
---|
2457 | % \begin{figure} |
---|
2458 | % \begin{center} |
---|
2459 | % \resizebox{14cm}{8cm}{\includegraphics{./fig/approx_good.eps}} |
---|
2460 | % \caption{{\bf Exact vs. approximate likelihoods.} The correlation between the normally approximated |
---|
2461 | % (Y-axis) and the exact (X-axis) likelihoods is good. The approximation of the likelihood can be used (option \x{fastlk=yes}).} |
---|
2462 | % \label{fig:approxgood} |
---|
2463 | % \end{center} |
---|
2464 | % \end{figure} |
---|
2465 | |
---|
2466 | \item {\em How long shall I run the program for?} PhyTime should be run long enough such that the |
---|
2467 | ESS of each parameter is `large enough'. The last column on the right handside of the standard |
---|
2468 | output gives the minimum ESS across all internal node heights. It is recommended to run the program |
---|
2469 | so that this number reaches at least 100. |
---|
2470 | |
---|
2471 | \end{itemize} |
---|
2472 | |
---|
2473 | |
---|
2474 | \section{Frequently asked questions} |
---|
2475 | |
---|
2476 | \begin{enumerate} |
---|
2477 | \item {\it PhyML crashes before reading the sequences. What's wrong ?}\\ |
---|
2478 | \begin{itemize} |
---|
2479 | \item The format of your sequence file is not recognized by PhyML. See Section \ref{sec:input_output} |
---|
2480 | \item The carriage return characters in your sequence files are not recognized by PhyML. You must |
---|
2481 | make sure that your sequence file is a plain text file, with standard carriage return characters (i.e., |
---|
2482 | corresponding to ``$\backslash$\x{n}'', or ``$\backslash$\x{r}'') |
---|
2483 | \end{itemize} |
---|
2484 | \item {\it The program crashes after reading the sequences. What's wrong ?}\\ |
---|
2485 | \begin{itemize} |
---|
2486 | \item You analyse protein sequences and did not enter the \x{-d aa} option in the command-line. |
---|
2487 | \item The format of your sequence file is not recognized by PhyML. See Section \ref{sec:input_output} |
---|
2488 | \end{itemize} |
---|
2489 | \item {\it Does PhyML handle outgroup sequences ?}\\ |
---|
2490 | \begin{itemize} |
---|
2491 | \item Yes, it does. Outgroup taxa are identified by adding the `*' sign at the end of each |
---|
2492 | corresponding sequence name (see Section \ref{sec:outgroupspecify}) |
---|
2493 | \end{itemize} |
---|
2494 | \item {\it Does PhyML estimate clock-constrained trees ?}\\ |
---|
2495 | \begin{itemize} |
---|
2496 | \item No, the PhyML program does not estimate clock-contrained trees. One can however use the |
---|
2497 | program PhyTime to perform such analysis but the tree topology will not be estimated. |
---|
2498 | \end{itemize} |
---|
2499 | \item {\it Can PhyML analyse partitioned data, such as multiple gene sequences ?}\\ |
---|
2500 | \begin{itemize} |
---|
2501 | \item We are currently working on this topic. Future releases of the program will provide options |
---|
2502 | to estimate trees from phylogenomic data sets, with the opportunity to use different substitution |
---|
2503 | models on the different data partitions (e.g., different genes). PhyML will also include specific |
---|
2504 | algorithms to search the space of tree topologies for this type of data. |
---|
2505 | \end{itemize} |
---|
2506 | \end{enumerate} |
---|
2507 | |
---|
2508 | |
---|
2509 | |
---|
2510 | \section{Acknowledgements} |
---|
2511 | The development of PhyML since 2000 has been supported by the Centre National de la Recherche |
---|
2512 | Scientifique (CNRS) and the Minist\`ere de l'\'Education Nationale. |
---|
2513 | |
---|
2514 | % \bibliographystyle{/home/guindon/latex/biblio/OUPnum} |
---|
2515 | \bibliographystyle{/home/guindon/latex/biblio/nature/naturemag} |
---|
2516 | \bibliography{/home/guindon/latex/biblio/ref} |
---|
2517 | |
---|
2518 | \printindex |
---|
2519 | \end{document} |
---|