| 1 | \documentclass[a4paper,12pt]{article} |
|---|
| 2 | \usepackage{fancyvrb} |
|---|
| 3 | \usepackage{graphicx} |
|---|
| 4 | \usepackage{tabularx} |
|---|
| 5 | % \usepackage{color} |
|---|
| 6 | \usepackage{xcolor} |
|---|
| 7 | \usepackage{psfrag} |
|---|
| 8 | \usepackage[fleqn]{amsmath} |
|---|
| 9 | \usepackage[hyphens]{url} |
|---|
| 10 | % \usepackage{vmargin} |
|---|
| 11 | \usepackage{cite} |
|---|
| 12 | \usepackage{caption2} |
|---|
| 13 | \usepackage{hyperref} |
|---|
| 14 | \usepackage{makeidx} |
|---|
| 15 | \renewcommand{\captionlabeldelim}{.} |
|---|
| 16 | \def\thesection{\arabic{section}} |
|---|
| 17 | \renewcommand{\thefigure}{\arabic{figure}} |
|---|
| 18 | \renewcommand{\thetable}{\arabic{table}} |
|---|
| 19 | \newcommand{\hl}{\noalign{\vskip3pt}\hline\noalign{\vskip3pt}} |
|---|
| 20 | \newcommand{\hrf}{\hrulefill} |
|---|
| 21 | \newcommand{\tc}[1]{\textcolor{black}{#1}} |
|---|
| 22 | \newcommand{\dc}[1]{\textcolor{green}{#1}} |
|---|
| 23 | % \usepackage[none]{hyphenat} |
|---|
| 24 | \newcommand{\rep}[3][1] |
|---|
| 25 | { |
|---|
| 26 | \psfrag{#2}[c][c][#1]{#3} |
|---|
| 27 | } |
|---|
| 28 | \newcommand{\x}[1]{\texttt{#1}} |
|---|
| 29 | |
|---|
| 30 | % \setpapersize{A4} |
|---|
| 31 | % \hypersetup{colorlinks=true,linkcolor=blue,urlcolor=red,linkbordercolor=000} |
|---|
| 32 | \hypersetup{colorlinks=true,linkcolor=blue,urlcolor=red} |
|---|
| 33 | \renewcommand{\baselinestretch}{1.} |
|---|
| 34 | \makeindex |
|---|
| 35 | \begin{document} |
|---|
| 36 | \sloppy |
|---|
| 37 | \begin{center} |
|---|
| 38 | \thispagestyle{empty} |
|---|
| 39 | \vfill\vfill |
|---|
| 40 | % \rule{\linewidth}{0.02cm}\\ |
|---|
| 41 | {\Huge \textbf{ P~h~y~M~L~~--~~M~a~n~u~a~l}} |
|---|
| 42 | \vspace{-0.4cm}\\ |
|---|
| 43 | % \rule{\linewidth}{0.02cm}\\ |
|---|
| 44 | \vfill |
|---|
| 45 | {\huge Version 3.0 \\ |
|---|
| 46 | \today |
|---|
| 47 | \vfill |
|---|
| 48 | \normalsize |
|---|
| 49 | \href{http://code.google.com/p/phyml}{http://code.google.com/p/phyml}\\ |
|---|
| 50 | \vspace{0.4cm} |
|---|
| 51 | \href{http://www.atgc-montpellier.fr/phyml}{http://www.atgc-montpellier.fr/phyml}} |
|---|
| 52 | \end{center} |
|---|
| 53 | \clearpage |
|---|
| 54 | \tableofcontents |
|---|
| 55 | \clearpage |
|---|
| 56 | |
|---|
| 57 | {\par |
|---|
| 58 | \small |
|---|
| 59 | \noindent |
|---|
| 60 | \copyright Copyright 1999 - 2008 by PhyML Development Team.\\ |
|---|
| 61 | \noindent The software PhyML is provided ``as is'' without warranty of any kind. In no event shall |
|---|
| 62 | the authors or his employer be held responsible for any damage resulting from the use of this |
|---|
| 63 | software, including but not limited to the frustration that you may experience in using the package. |
|---|
| 64 | All parts of the source and documentation except where indicated are distributed under |
|---|
| 65 | the GNU public licence. See http://www.opensource.org for details. |
|---|
| 66 | |
|---|
| 67 | } |
|---|
| 68 | |
|---|
| 69 | { |
|---|
| 70 | \noindent |
|---|
| 71 | \setlength{\baselineskip}{0.5\baselineskip} |
|---|
| 72 | \section{Availability} |
|---|
| 73 | \begin{itemize} |
|---|
| 74 | \item Binaries: \href{http://www.atgc-montpellier.fr/phyml}{http://www.atgc-montpellier.fr/phyml} |
|---|
| 75 | \item Sources: \href{http://code.google.com/p/phyml}{http://code.google.com/p/phyml} |
|---|
| 76 | \item Discussion forum: \href{http://groups.google.com/group/phyml-forum}{http://groups.google.com/group/phyml-forum} |
|---|
| 77 | \end{itemize} |
|---|
| 78 | } |
|---|
| 79 | |
|---|
| 80 | { |
|---|
| 81 | \noindent |
|---|
| 82 | \setlength{\baselineskip}{0.7\baselineskip} |
|---|
| 83 | \section{Authors} |
|---|
| 84 | \begin{itemize} |
|---|
| 85 | \item { St\'ephane Guindon} and { Olivier Gascuel} conceived the original PhyML algorithm. |
|---|
| 86 | \item { St\'ephane Guindon} conceived the PhyTime method. |
|---|
| 87 | \item { St\'ephane Guindon, Wim Hordjik} and { Olivier Gascuel} conceived the SPR-based tree search algorithm. |
|---|
| 88 | \item { Maria Anisimova} and { Olivier Gascuel} conceived the aLRT method for branch support. |
|---|
| 89 | \item { St\'ephane Guindon, Franck Lethiec}, Jean-Francois Dufayard and Vincent Lefort implemented PhyML. |
|---|
| 90 | \item { Jean-Francois Dufayard} created the benchmark and implemented the tools that are used to check |
|---|
| 91 | PhyML accuracy and performances. |
|---|
| 92 | \item { Vincent Lefort, St\'ephane Guindon, Patrice Duroux} and { Olivier Gascuel} conceived and |
|---|
| 93 | implemented PhyML web server. |
|---|
| 94 | \item St\'ephane Guindon wrote this document. |
|---|
| 95 | \end{itemize} |
|---|
| 96 | } |
|---|
| 97 | \clearpage |
|---|
| 98 | |
|---|
| 99 | % \section{ML in phylogenetics: the basics.} |
|---|
| 100 | \section{Overview} |
|---|
| 101 | |
|---|
| 102 | PhyML \cite{guindon03} is a software package which primary task that is to estimate maximum |
|---|
| 103 | likelihood phylogenies from alignments of nucleotide or amino-acid sequences. It provides a wide |
|---|
| 104 | range of options that were designed to facilitate standard phylogenetic analyses. The main |
|---|
| 105 | strength of PhyML lies in the large number of substitution models coupled to various options to |
|---|
| 106 | search the space of phylogenetic tree topologies, going from very fast and efficient methods to |
|---|
| 107 | slower but generally more accurate approaches. It also implements two methods to evaluate branch |
|---|
| 108 | supports in a sound statistical framework (the non-parametric bootstrap and the approximate |
|---|
| 109 | likelihood ratio test). |
|---|
| 110 | |
|---|
| 111 | PhyML was designed to process moderate to large data sets. In theory, alignments with up to 4,000 |
|---|
| 112 | sequences 2,000,000 character-long can analyzed. In practice however, the amount of memory required |
|---|
| 113 | to process a data set is proportional of the product of the number of sequences by their length. |
|---|
| 114 | Hence, a large number of sequences can only be processed provided that they are short. Also, PhyML |
|---|
| 115 | can handle long sequences provided that they are not numerous. With most standard personal |
|---|
| 116 | computers, the ``comfort zone'' for PhyML generally lies around 100-500 sequences less than 10,000 |
|---|
| 117 | character long. For larger data sets, we recommend using other softwares such as RAxML |
|---|
| 118 | \cite{raxml}\index{RAxML} or GARLI \cite{garli}\index{GARLI} or Treefinder |
|---|
| 119 | (\href{http://www.treefinder.de}{http://www.treefinder.de}). |
|---|
| 120 | |
|---|
| 121 | |
|---|
| 122 | \section{Bug report}\index{bug} |
|---|
| 123 | |
|---|
| 124 | While PhyML is, of course, bug-free (!) (please read the disclaimer carefuly...), if you ever come |
|---|
| 125 | across an issue, please feel free to report it using the discuss group web site at the following |
|---|
| 126 | address: \url{https://groups.google.com/forum/?fromgroups#!forum/phyml-forum}. Alternatively, you |
|---|
| 127 | can send an email to \url{s.guindon@auckland.ac.nz}. Do not forget to mention the version of PhyML |
|---|
| 128 | and program options you are using. |
|---|
| 129 | |
|---|
| 130 | |
|---|
| 131 | \section{Installing PhyML} |
|---|
| 132 | |
|---|
| 133 | \subsection{Sources and compilation}\index{compilation} |
|---|
| 134 | |
|---|
| 135 | The sources of the program are available free of charge by sending an e-mail to St\'ephane Guindon |
|---|
| 136 | at \href{mailto:guindon@lirmm.fr}{\x{guindon@lirmm.fr}} or |
|---|
| 137 | \href{mailto:s.guindon@auckland.ac.nz}{\x{s.guindon@auckland.ac.nz}}. The compilation on UNIX-like |
|---|
| 138 | systems is fairly standard. It is described in the `\x{INSTALL}' file that comes with the |
|---|
| 139 | sources. In a command-line window, go to the directory that contains the sources and type: |
|---|
| 140 | |
|---|
| 141 | {\setlength{\baselineskip}{0.5\baselineskip} |
|---|
| 142 | \begin{verbatim} |
|---|
| 143 | ./configure; |
|---|
| 144 | make clean; |
|---|
| 145 | make V=0; |
|---|
| 146 | \end{verbatim} |
|---|
| 147 | } |
|---|
| 148 | |
|---|
| 149 | By default, PhyML will be compiled with optimization flags turned on. It is possible to generate a |
|---|
| 150 | version of PhyML that can run through a debugging tool (such as \x{ddd}\label{ddd}) or a profiling |
|---|
| 151 | tool (such as \x{gprof}\label{gprof}) using the following instructions: |
|---|
| 152 | |
|---|
| 153 | {\setlength{\baselineskip}{0.5\baselineskip} |
|---|
| 154 | \begin{verbatim} |
|---|
| 155 | ./configure --enable-debug; |
|---|
| 156 | make clean; |
|---|
| 157 | make V=0; |
|---|
| 158 | \end{verbatim} |
|---|
| 159 | } |
|---|
| 160 | |
|---|
| 161 | % {\em Note} -- when PhyML is going to be used mostly of exclusively in batch mode, it is preferable |
|---|
| 162 | % to turn on the batch mode option in the Makefile. In order to do so, the file \x{Makefile.am} needs |
|---|
| 163 | % to be modified: add \x{-DBATCH} to the line with \x{DEFS=-DUNIX -D\$(PROG) -DDEBUG}. |
|---|
| 164 | |
|---|
| 165 | |
|---|
| 166 | |
|---|
| 167 | \subsection{Installing PhyML on UNIX-like systems (including Mac OS)} |
|---|
| 168 | |
|---|
| 169 | Copy PhyML binary file in the directory you like. For the operating system to be able to locate the |
|---|
| 170 | program, this directory must be specified in the global variable \x{PATH}. In order to achieve this, |
|---|
| 171 | you will have to add \x{export PATH="/your\_path/:\${PATH}"} to the \x{.bashrc} or the |
|---|
| 172 | \x{.bash\_profile} located in your home directory (\x{your\_path} is the path to the directory that |
|---|
| 173 | contains PhyML binary). |
|---|
| 174 | |
|---|
| 175 | |
|---|
| 176 | \subsection{Installing PhyML on Microsoft Windows}\label{sec:install_windows} |
|---|
| 177 | |
|---|
| 178 | Copy the files \x{phyml.exe} and \x{phyml.bat} in the same directory. To launch PhyML, click on the |
|---|
| 179 | icon corresponding to \x{phyml.bat}. Clicking on the icon for \x{phyml.exe} works too but the |
|---|
| 180 | dimensions of the window will not fit PhyML PHYLIP-like interface. |
|---|
| 181 | |
|---|
| 182 | \subsection{Installing the parallel version of PhyML}\label{sec:MPI}\index{MPI}\index{bootstrap!parallel} |
|---|
| 183 | |
|---|
| 184 | Bootstrap analysis can run on multiple processors. Each processor analyses one bootstraped dataset. |
|---|
| 185 | Therefore, the computing time needed to perform $R$ bootstrap replicates is divided by the number of |
|---|
| 186 | processors available. |
|---|
| 187 | |
|---|
| 188 | This feature of PhyML relies on the MPI (Message Passing Interface) library. To use it, your |
|---|
| 189 | computer must have MPI installed on it. In case MPI is not installed, you can dowload it from |
|---|
| 190 | \href{http://www.mcs.anl.gov/research/projects/mpich2/}{http://www.mcs.anl.gov/research/projects/mpich2/}. |
|---|
| 191 | Once MPI is installed, it is necessary to launch the MPI daemon. This can be done by entering the |
|---|
| 192 | following instruction: \x{mpd \&}. Note however that in most cases, the MPI daemon will already be |
|---|
| 193 | running on your server so that you most likely do not need to worry about this. You can then just |
|---|
| 194 | go in the \x{phyml/} directory (the directory that contains the \x{src/}, \x{examples/} and \x{doc/} |
|---|
| 195 | folders) and enter the commands below: |
|---|
| 196 | |
|---|
| 197 | {\setlength{\baselineskip}{0.5\baselineskip} |
|---|
| 198 | \begin{verbatim} |
|---|
| 199 | ./configure --enable-mpi; |
|---|
| 200 | make clean; |
|---|
| 201 | make; |
|---|
| 202 | \end{verbatim} |
|---|
| 203 | } |
|---|
| 204 | |
|---|
| 205 | A binary file named \x{phyml-mpi} has now been created in the \x{src/} directory and is ready to use |
|---|
| 206 | with MPI. A typical MPI command-line which uses 4 CPUs is given below: |
|---|
| 207 | |
|---|
| 208 | {\setlength{\baselineskip}{0.5\baselineskip} |
|---|
| 209 | \begin{verbatim} |
|---|
| 210 | mpirun -n 4 ./phyml-mpi -i myseq -b 100 |
|---|
| 211 | \end{verbatim} |
|---|
| 212 | } |
|---|
| 213 | |
|---|
| 214 | \noindent Please read section \ref{sec:parallel_bootstrap} of this document for more information. |
|---|
| 215 | |
|---|
| 216 | \section{Program usage.}\label{sec:phyml_new} |
|---|
| 217 | |
|---|
| 218 | PhyML has three distinct user-interfaces. The first corresponds to a PHYLIP-like text interface |
|---|
| 219 | that makes the choice of the options self-explanatory. The command-line interface is well-suited for |
|---|
| 220 | people that are familiar with PhyML options or for running PhyML in batch mode. The XML interface is |
|---|
| 221 | more sophisticated. It allows the user to analyse partitionned data using flexible mixture models of evolution. |
|---|
| 222 | |
|---|
| 223 | \subsection{PHYLIP-like interface} |
|---|
| 224 | |
|---|
| 225 | The default is to use the PHYLIP-like text interface by simply typing `\x{phyml}' in a command-line |
|---|
| 226 | window or by clicking on the PhyML icon (see Section \ref{sec:install_windows}). After entering the |
|---|
| 227 | name of the input sequence file, a list of sub-menus helps the users set up the analysis. There |
|---|
| 228 | are currently four distinct sub-menus: |
|---|
| 229 | |
|---|
| 230 | % \begin{figure} |
|---|
| 231 | % \resizebox{15cm}{9cm}{\includegraphics{./fig/interface.eps}} |
|---|
| 232 | % \caption{PHYLIP-like interface to PhyML.} |
|---|
| 233 | % \label{fig:interface} |
|---|
| 234 | % \end{figure} |
|---|
| 235 | |
|---|
| 236 | \begin{enumerate} |
|---|
| 237 | |
|---|
| 238 | \item {\em Input Data}: specify whether the input file contains amino-acid or nucleotide |
|---|
| 239 | sequences. What the sequence format is (see Section \ref{sec:input_output}) and how many data sets |
|---|
| 240 | should be analysed. |
|---|
| 241 | |
|---|
| 242 | \item {\em Substitution Model}: selection of the Markov model of substitution. |
|---|
| 243 | |
|---|
| 244 | \item {\em Tree Searching}: selection of the tree topology searching algorithm. |
|---|
| 245 | |
|---|
| 246 | \item {\em Branch Support}: selection of the method that is used to measure branch support. |
|---|
| 247 | |
|---|
| 248 | \end{enumerate} |
|---|
| 249 | \noindent `\x{+}' and `\x{-}' keys are used to move forward and backward in the sub-menu list. Once |
|---|
| 250 | the model parameters have been defined, typing `\x{Y}' (or `\x{y}') launches the calculations. The |
|---|
| 251 | meaning of some options may not be obvious to users that are not familiar with phylogenetics. In |
|---|
| 252 | such situation, we strongly recommend to use the default options. As long as the format of the input |
|---|
| 253 | sequence file is correctly specified (sub-menu {\em Input data}), the safest option for non-expert |
|---|
| 254 | users is to use the default settings. The different options provided within each sub-menu are |
|---|
| 255 | described in what follows. |
|---|
| 256 | |
|---|
| 257 | |
|---|
| 258 | \subsubsection{Input Data sub-menu} |
|---|
| 259 | |
|---|
| 260 | \begin{center}\framebox{\x{[D] ............................... Data type (DNA/AA)}} \end{center} |
|---|
| 261 | Type of data in the input file. It can be either DNA or amino-acid sequences in PHYLIP format (see |
|---|
| 262 | Section \ref{sec:input_output}). Type \x{D} to change settings. |
|---|
| 263 | |
|---|
| 264 | \vspace{0.7cm} |
|---|
| 265 | \begin{center} \framebox{\x{[I] ...... Input sequences interleaved (or sequential)}} \end{center} |
|---|
| 266 | PHYLIP format comes in two flavours: interleaved or sequential (see Section |
|---|
| 267 | \ref{sec:input_output}). Type \x{I} to selected among the two formats. |
|---|
| 268 | |
|---|
| 269 | \vspace{0.7cm} |
|---|
| 270 | \begin{center} \framebox{\x{[M] ....................... Analyze multiple data sets}} \end{center} |
|---|
| 271 | If the input sequence file contains more than one data sets, PhyML can analyse each of them |
|---|
| 272 | in a single run of the program. Type \x{M} to change settings. |
|---|
| 273 | |
|---|
| 274 | \vspace{0.7cm} |
|---|
| 275 | \begin{center} \framebox{\x{[R] ............................................ Run ID}} \end{center} |
|---|
| 276 | This option allows you to append a string that identifies the current PhyML run. Say for instance |
|---|
| 277 | that you want to analyse the same data set with two models. You can then `tag' the first PhyML run |
|---|
| 278 | with the name of the first model while the second run is tagged with the name of the second model.\index{run ID} |
|---|
| 279 | |
|---|
| 280 | |
|---|
| 281 | |
|---|
| 282 | \subsubsection{Substitution model sub-menu}\label{sec:submenus} |
|---|
| 283 | |
|---|
| 284 | \begin{center} \framebox{\x{[M] ................. Model of nucleotide substitution}} \end{center} |
|---|
| 285 | \begin{center} \framebox{\x{[M] ................ Model of amino-acids substitution}} \end{center} |
|---|
| 286 | PhyML implements a wide range of substitution models: JC69 \cite{jukes69}, K80 \cite{kimura80}, F81 |
|---|
| 287 | \cite{felsenstein81a}, F84 \cite{phylip2}, HKY85 \cite{hasegawa85}, TN93 \cite{tamura93} GTR |
|---|
| 288 | \cite{lanave84,tavare86} and custom for nucleotides; LG \cite{le08}, WAG \cite{whelan01b}, Dayhoff |
|---|
| 289 | \cite{dayhoff78}, JTT \cite{jones92}, Blosum62 \cite{henikoff92}, mtREV \cite{adachi96}, rtREV |
|---|
| 290 | \cite{dimmic02}, cpREV \cite{adachi00}, DCMut \cite{kosiol04}, VT \cite{muller00} and mtMAM |
|---|
| 291 | \cite{cao98} and custom for amino acids. Cycle through the list of nucleotide or amino-acids |
|---|
| 292 | substitution models by typing \x{M}. Both nucleotide and amino-acid lists include a `custom' model. |
|---|
| 293 | The custom option provides the most flexible way to specify the nucleotide substitution model. The |
|---|
| 294 | model is defined by a string made of six digits. The default string is `\x{000000}', which means |
|---|
| 295 | that the six relative rates of nucleotide changes: $A \leftrightarrow C$, $A \leftrightarrow G$, $A |
|---|
| 296 | \leftrightarrow T$, $C \leftrightarrow G$, $C \leftrightarrow T$ and $G \leftrightarrow T$, are |
|---|
| 297 | equal. The string `\x{010010}' indicates that the rates $A \leftrightarrow G$ and $C |
|---|
| 298 | \leftrightarrow T$ are equal and distinct from $A \leftrightarrow C = A \leftrightarrow T = C |
|---|
| 299 | \leftrightarrow G = G \leftrightarrow T$. This model corresponds to HKY85 (default) or K80 if the |
|---|
| 300 | nucleotide frequencies are all set to 0.25. `\x{010020}' and `\x{012345}' correspond to TN93 and |
|---|
| 301 | GTR models respectively. The digit string therefore defines groups of relative substitution rates. |
|---|
| 302 | The initial rate within each group is set to 1.0, which corresponds to F81 (JC69 if the base |
|---|
| 303 | frequencies are equal). Users also have the opportunity to define their own initial rate values. |
|---|
| 304 | These rates are then optimised afterwards (option `\x{O}') or fixed to their initial values. The |
|---|
| 305 | custom option can be used to implement all substitution models that are special cases of |
|---|
| 306 | GTR. Table \ref{tab:modelcode} on page \pageref{tab:modelcode} gives the correspondence between the `standard' name of the model |
|---|
| 307 | (see \url{http://mbe.oxfordjournals.org/content/18/6/897/F2.large.jpg}) and the custom model code. |
|---|
| 308 | The custom model also exists for protein sequences. It is useful when one wants to use an amino-acid |
|---|
| 309 | substitution model that is not hard-coded in PhyML. The symmetric part of the rate matrix, as well |
|---|
| 310 | as the equilibrium amino-acid frequencies, are given in a file which name is given as input of the |
|---|
| 311 | program. The format of this file is described in the section \ref{sec:customaa}. |
|---|
| 312 | |
|---|
| 313 | \vspace{0.7cm} |
|---|
| 314 | \begin{center} \framebox{\x{[F] ................. Optimise equilibrium frequencies}} \end{center} |
|---|
| 315 | \begin{center} \framebox{\x{[E] ......... Equilibrium frequencies (empirical/user)}} \end{center} |
|---|
| 316 | \begin{center} \framebox{\x{[F] . Amino acid frequencies (empirical/model defined)}} \end{center} |
|---|
| 317 | For nucleotide sequences, optimising nucleotide frequencies means that the values of these |
|---|
| 318 | parameters are estimated in the maximum likelihood framework. When the custom model option is |
|---|
| 319 | selected, it is also possible to give the program a user-defined nucleotide frequency distribution |
|---|
| 320 | at equilibrium (option \x{E}). For protein sequences, the stationary amino-acid frequencies are |
|---|
| 321 | either those defined by the substitution model or those estimated by counting the number of |
|---|
| 322 | different amino-acids observed in the data. Hence, the meaning of the \x{F} option depends on the |
|---|
| 323 | type of the data to be processed. |
|---|
| 324 | |
|---|
| 325 | \vspace{0.7cm} |
|---|
| 326 | \begin{center} \framebox{\x{[T] .................... Ts/tv ratio |
|---|
| 327 | (fixed/estimated)}} \end{center}\index{$\kappa$}\index{ts/tv ratio} Fix or estimate the |
|---|
| 328 | transition/transversion ratio in the maximum likelihood framework. This option is only available |
|---|
| 329 | when DNA sequences are to be analysed under K80, HKY85 or TN93 models. The definition given to this |
|---|
| 330 | parameter by PhyML is the same as PAML's\index{PAML} one. Therefore, the value of this parameter |
|---|
| 331 | does {\it not} correspond to the ratio between the expected number of transitions and the expected |
|---|
| 332 | number of transversions during a unit of time. This last definition is the one used in |
|---|
| 333 | PHYLIP\index{PHYLIP}. PAML's manual gives more detail about the distinction between the two |
|---|
| 334 | definitions (\url{http://abacus.gene.ucl.ac.uk/software/paml.html}). |
|---|
| 335 | |
|---|
| 336 | \vspace{0.7cm} |
|---|
| 337 | \begin{center} \framebox{\x{[V] . Proportion of invariable sites |
|---|
| 338 | (fixed/estimated)}} \end{center}\index{invariable sites}\index{proportion of invariants} The |
|---|
| 339 | proportion of invariable sites, i.e., the expected frequency of sites that do not evolve, can be |
|---|
| 340 | fixed or estimated. The default is to fix this proportion to 0.0. By doing so, we consider that each |
|---|
| 341 | site in the sequence may accumulate substitutions at some point during its evolution, even if no |
|---|
| 342 | differences across sequences are actually observed at that site. Users can also fix this parameter |
|---|
| 343 | to any value in the $[0.0,1.0]$ range or estimate it from the data in the maximum-likelihood |
|---|
| 344 | framework. |
|---|
| 345 | |
|---|
| 346 | \vspace{0.7cm} |
|---|
| 347 | \index{gamma distribution (discrete)!mean vs. median} |
|---|
| 348 | \index{gamma distribution (discrete)!number of categories} |
|---|
| 349 | \index{gamma distribution (discrete)!shape parameter} |
|---|
| 350 | \begin{center} \framebox{\x{[R] ....... One category of substitution rate (yes/no)}} \end{center} |
|---|
| 351 | \begin{center} \framebox{\x{[C] ........... Number of substitution rate categories}} \end{center} |
|---|
| 352 | \begin{center} \framebox{\x{[A] ... Gamma distribution parameter (fixed/estimated)}} \end{center} |
|---|
| 353 | \begin{center} \framebox{\x{[G] .........`Middle' of each rate class (mean/median)}} \end{center} |
|---|
| 354 | Rates of evolution often vary from site to site. This heterogeneity can be modelled using a discrete |
|---|
| 355 | gamma distribution. Type \x{R} to switch this option on or off. The different categories of this |
|---|
| 356 | discrete distribution correspond to different (relative) rates of evolution. The number of |
|---|
| 357 | categories of this distribution is set to 4 by default. It is probably not wise to go below this |
|---|
| 358 | number. Larger values are generally preferred. However, the computational burden involved is |
|---|
| 359 | proportional to the number of categories (i.e., an analysis with 8 categories will generally take |
|---|
| 360 | twice the time of the same analysis with only 4 categories). Note that the likelihood will not |
|---|
| 361 | necessarily increase as the number of categories increases. Hence, the number of categories should |
|---|
| 362 | be kept below a ``reasonable'' number, say 20. The default number of categories can be changed by |
|---|
| 363 | typing \x{C}. |
|---|
| 364 | |
|---|
| 365 | The middle of each discretized substitution rate class can be determined using the mean or the |
|---|
| 366 | median. PAML, MrBayes and RAxML use the mean. However, the median is generally associated with |
|---|
| 367 | greater likelihoods than the mean. This conclusion is based on our analysis of several real-world |
|---|
| 368 | data sets extracted from TreeBase. Despite this, the default option in PhyML is to use the mean in |
|---|
| 369 | order to make PhyML likelihoods comparable to those of other phylogenetic software. One must bare |
|---|
| 370 | in mind that {\color{red}{likelihoods calculated with the mean approximation are not directly |
|---|
| 371 | comparable to the likelihoods calculated using the median approximation}}. |
|---|
| 372 | |
|---|
| 373 | The shape of the gamma distribution determines the range of rate variation across sites. Small |
|---|
| 374 | values, typically in the $[0.1,1.0]$ range, correspond to large variability. Larger values |
|---|
| 375 | correspond to moderate to low heterogeneity. The gamma shape parameter can be fixed by the user or |
|---|
| 376 | estimated via maximum-likelihood. Type \x{A} to select one or the other option. |
|---|
| 377 | |
|---|
| 378 | |
|---|
| 379 | \subsubsection{Tree searching sub-menu} |
|---|
| 380 | |
|---|
| 381 | \begin{center} \framebox{\x{[O] ........................... Optimise tree topology}} \end{center} By |
|---|
| 382 | default the tree topology is optimised in order to maximise the likelihood. However, it is also |
|---|
| 383 | possible to avoid any topological alteration. This option is useful when one wants to compute the |
|---|
| 384 | likelihood of a tree given as input (see below). Type \x{O} to select among these two options. |
|---|
| 385 | |
|---|
| 386 | \vspace{0.7cm} |
|---|
| 387 | \begin{center} \framebox{\x{[S] .................. Tree topology search operations}} \end{center} |
|---|
| 388 | PhyML proposes three different methods to estimate tree topologies. The default approach is to use |
|---|
| 389 | simultaneous NNI. This option corresponds to the original PhyML algorithm \cite{guindon03}. The |
|---|
| 390 | second approach relies on subtree pruning and regrafting (SPR). It generally finds better tree |
|---|
| 391 | topologies compared to NNI but is also significantly slower. The third approach, termed BEST, |
|---|
| 392 | simply estimates the phylogeny using both methods and returns the best solution among the two. Type |
|---|
| 393 | \x{S} to choose among these three choices. |
|---|
| 394 | |
|---|
| 395 | \vspace{0.7cm} |
|---|
| 396 | \begin{center} \framebox{\x{[R] ......................... Use random starting tree}} \end{center} |
|---|
| 397 | \begin{center} \framebox{\x{[N] .................. Number of random starting trees}} \end{center} |
|---|
| 398 | When the SPR or the BEST options are selected, is is possible to use random trees rather than BioNJ |
|---|
| 399 | or a user-defined tree, as starting tree. If this option is turned on (type \x{R} to change), five |
|---|
| 400 | trees, corresponding to five random starts, will be estimated. The output tree file will contain the |
|---|
| 401 | best tree found among those five. The number of random starts can be modified by typing |
|---|
| 402 | \x{N}. Setting the number of random starting trees to $N$ means that the analysis will take |
|---|
| 403 | (slightly more than) $N$ times the time required for a standard analysis where only one (BioNJ) |
|---|
| 404 | starting tree is used. However, the analysis of real data sets shows that the best trees estimated |
|---|
| 405 | using the random start option almost systematically have higher likelihoods than those inferred |
|---|
| 406 | using a single starting tree. |
|---|
| 407 | |
|---|
| 408 | \vspace{0.7cm} |
|---|
| 409 | \begin{center} \framebox{\x{[U] ........ Starting tree (BioNJ/parsimony/user |
|---|
| 410 | tree)}} \end{center}\index{BioNJ} When the tree topology optimisation option is turned on, PhyML |
|---|
| 411 | proceeds by refining an input tree. By default, this input tree is estimated using BioNJ |
|---|
| 412 | \cite{gascuelNJ}. The alternative option is to use a parsimony tree. We found this option specially |
|---|
| 413 | useful when analysing large data sets with NNI moves as it generally leads to greater likelihoods |
|---|
| 414 | than those obtained when starting from a BioNJ trees. The user can also to input her/his own |
|---|
| 415 | tree. This tree should be in Newick format (see Section \ref{sec:input_output}). This option is |
|---|
| 416 | useful when one wants to evaluate the likelihood of a given tree with a fixed topology, using |
|---|
| 417 | PhyML. Type \x{U} to choose among these two options. |
|---|
| 418 | |
|---|
| 419 | \subsubsection{Branch support sub-menu} |
|---|
| 420 | |
|---|
| 421 | \begin{center} \framebox{\x{[B] ................ Non parametric bootstrap analysis}} \end{center} |
|---|
| 422 | The support of the data for each internal branch of the phylogeny can be estimated using |
|---|
| 423 | non-parametric bootstrap. By default, this option is switched off. Typing \x{B} switches on the |
|---|
| 424 | bootstrap analysis. The user is then prompted for a number of bootstrap replicates. The largest this |
|---|
| 425 | number the more precise the bootstrap support estimates are. However, for each bootstrap replicate |
|---|
| 426 | a phylogeny is estimated. Hence, the time needed to analyse $N$ bootstrap replicates corresponds to |
|---|
| 427 | $N$-times the time spent on the analysis of the original data set. $N=100$ is generally considered |
|---|
| 428 | as a reasonable number of replicates. |
|---|
| 429 | |
|---|
| 430 | \begin{center} \framebox{\x{[A] ................ Approximate likelihood ratio test}} \end{center} |
|---|
| 431 | When the bootstrap option is switched off (see above), approximate likelihood branch supports are |
|---|
| 432 | estimated. This approach is considerably faster than the bootstrap one. However, both methods intend |
|---|
| 433 | to estimate different quantities and conducting a fair comparison between both criteria is not |
|---|
| 434 | straightforward. The estimation of approximate likelihood branch support comes in multiple |
|---|
| 435 | flavours. The default is set to aBayes, corresponding to the approximate Bayes method described in |
|---|
| 436 | \cite{anisimova11}. The approximate likelihood ratio test (aLRT) \cite{anisimova06}, |
|---|
| 437 | ShimodairaâHasegawa aLRT (SH-aLRT) statistics are the other available options. |
|---|
| 438 | |
|---|
| 439 | |
|---|
| 440 | \subsection{Command-line interface} |
|---|
| 441 | |
|---|
| 442 | An alternative to the PHYLIP-like interface is the command-line interface. Users that do not need to |
|---|
| 443 | modify the default parameters can launch the program with the `\x{phyml -i seq\_file\_name}' |
|---|
| 444 | command. The list of all command line arguments and how to use them is given in the `Help' section |
|---|
| 445 | which is displayed when entering the `\x{phyml --help}' command. The available command-line options |
|---|
| 446 | are described in what follows. |
|---|
| 447 | |
|---|
| 448 | \begin{itemize} |
|---|
| 449 | |
|---|
| 450 | \item \x{-i} (or \x{--input}) \x{seq\_file\_name}\index{command-line options!\x{--input}} \\ |
|---|
| 451 | \x{seq\_file\_name} is the name of the nucleotide or amino-acid sequence file in PHYLIP format. |
|---|
| 452 | |
|---|
| 453 | \item \x{-d} (or \x{--datatype}) \x{data\_type}\index{command-line options!\x{--data\_type}}\\ |
|---|
| 454 | \x{data\_type} is \x{nt} for nucleotide (default) and \x{aa} for amino-acid sequences. |
|---|
| 455 | |
|---|
| 456 | \item \x{-q} (or \x{--sequential})\index{sequence format!interleaved}\index{sequence format!sequential}\index{command-line options!\x{--sequential}} \\ |
|---|
| 457 | Changes interleaved format (default) to sequential format. |
|---|
| 458 | |
|---|
| 459 | \item \x{-n} (or \x{--multiple}) \x{nb\_data\_sets}\index{multiple data sets}\index{command-line options!\x{--multiple}}\\ |
|---|
| 460 | \x{nb\_data\_sets} is an integer giving the number of data sets to analyse. |
|---|
| 461 | |
|---|
| 462 | \item \x{-p} (or \x{--pars})\index{command-line options!\x{--pars}}\\ |
|---|
| 463 | Use a minimum parsimony starting tree. This option is taken into account when the `-u' option |
|---|
| 464 | is absent and when tree topology modifications are to be done. |
|---|
| 465 | |
|---|
| 466 | \item \x{-b} (or \x{--bootstrap}) \x{int}\index{bootstrap}\index{command-line options!\x{--bootstrap}} |
|---|
| 467 | \begin{itemize} |
|---|
| 468 | \item \x{int} $>$ 0: \x{int} is the number of bootstrap replicates. |
|---|
| 469 | \item \x{int} = 0: neither approximate likelihood ratio test nor bootstrap values are computed. |
|---|
| 470 | \item \x{int} = -1: approximate likelihood ratio test returning aLRT statistics. |
|---|
| 471 | \item \x{int} = -2: approximate likelihood ratio test returning Chi2-based parametric branch supports. |
|---|
| 472 | % \item \x{int} = -3: minimum of Chi2-based parametric and SH-like branch supports. |
|---|
| 473 | \item \x{int} = -4: SH-like branch supports alone. |
|---|
| 474 | \item \x{int} = -5: (default) approximate Bayes branch supports. |
|---|
| 475 | \end{itemize} |
|---|
| 476 | |
|---|
| 477 | \item \x{-m} (or \x{--model}) \x{model\_name}\index{substitution models!DNA}\index{substitution models!amino acids}\index{command-line options!\x{--model}} \\ |
|---|
| 478 | \x{model\_name} : substitution model name. |
|---|
| 479 | \begin{itemize} |
|---|
| 480 | \item {\it Nucleotide-based models}: \x{HKY85} (default) \x{| JC69 | K80 | F81 | F84 | TN93 | GTR | |
|---|
| 481 | custom} \\ The \x{custom} option can be used to define a new substitution model. A string of six |
|---|
| 482 | digits identifies the model. For instance, 000000 corresponds to F81 (or JC69 provided the |
|---|
| 483 | distribution of nucleotide frequencies is uniform). 012345 corresponds to GTR. This option can be |
|---|
| 484 | used for encoding any model that is a nested within GTR. See Section \ref{sec:submenus} and Table \ref{tab:modelcode}. {\em NOTE:} |
|---|
| 485 | the substitution parameters of the custom model will be optimised so as to maximise the |
|---|
| 486 | likelihood. It is possible to specify and fix (i.e., avoid optimisation) the values of the |
|---|
| 487 | substitution rates only through the PHYLIP-like interface. |
|---|
| 488 | |
|---|
| 489 | \item {\it Amino-acid based models}: \x{LG} (default) \x{| WAG | JTT | MtREV | Dayhoff | DCMut | RtREV |
|---|
| 490 | | CpREV | VT | Blosum62 | MtMam | MtArt | HIVw | HIVb | custom} \\ |
|---|
| 491 | The \x{custom} option is useful when one wants to use an amino-acid substitution model that is not |
|---|
| 492 | available by default in PhyML. The symmetric part of the rate matrix, as well as the equilibrium |
|---|
| 493 | amino-acid frequencies, are given in a file which name is asked for by the program. The format of |
|---|
| 494 | this file is described in section \ref{sec:customaa}. |
|---|
| 495 | \end{itemize} |
|---|
| 496 | |
|---|
| 497 | \begin{table}\index{custom models} |
|---|
| 498 | \begin{center} |
|---|
| 499 | \begin{tabular}{ll} |
|---|
| 500 | \hline |
|---|
| 501 | Name & Command-line option \\ |
|---|
| 502 | \hline |
|---|
| 503 | JC69 & \x{-m 000000 -f 0.25,0.25,0.25,0.25} \\ |
|---|
| 504 | F81 & \x{-m 000000}\\ |
|---|
| 505 | K80 & \x{-m 010010 -f 0.25,0.25,0.25,0.25} \\ |
|---|
| 506 | HKY85 & \x{-m 010010}\\ |
|---|
| 507 | TrNef & \x{-m 010020 -f 0.25,0.25,0.25,0.25} \\ |
|---|
| 508 | TrN & \x{-m 010020}\\ |
|---|
| 509 | K81 & \x{-m 123321 -f 0.25,0.25,0.25,0.25}\\ |
|---|
| 510 | K81uf & \x{-m 123321}\\ |
|---|
| 511 | TIMef & \x{-m 132241 -f 0.25,0.25,0.25,0.25}\\ |
|---|
| 512 | TIM & \x{-m 132241}\\ |
|---|
| 513 | TVMef & \x{-m 102304 -f 0.25,0.25,0.25,0.25}\\ |
|---|
| 514 | TVM & \x{-m 102304}\\ |
|---|
| 515 | SYM & \x{-m 123456 -f 0.25,0.25,0.25,0.25}\\ |
|---|
| 516 | GTR & \x{-m 123456}\\ |
|---|
| 517 | \hline |
|---|
| 518 | \end{tabular} |
|---|
| 519 | \caption{Nucleotide substitution model names (as defined in \cite{posada01}) and the corresponding |
|---|
| 520 | custom model code used in PhyML.}\label{tab:modelcode} |
|---|
| 521 | \end{center} |
|---|
| 522 | \end{table} |
|---|
| 523 | |
|---|
| 524 | \item \x{--aa\_rate\_file file\_name}\index{command-line options!\x{--aa\_rate\_file}} \\ |
|---|
| 525 | This option is compulsory when analysing amino-acid sequences under a `custom' model (see above). \x{file\_name} |
|---|
| 526 | should provide a rate matrix and equilibrium amino acid in PAML format (see Section \ref{sec:customaa}). |
|---|
| 527 | |
|---|
| 528 | \item \x{-f e}, \x{m}, or ``\x{fA,fC,fG,fT}" \index{frequencies!nucleotide}\index{frequencies!amino-acid}\index{stationary frequencies}\index{command-line options!\x{-f}}\\ |
|---|
| 529 | Nucleotide or amino-acid frequencies. |
|---|
| 530 | \begin{itemize} |
|---|
| 531 | \item \x{e} : the character frequencies are determined as follows : |
|---|
| 532 | \begin{itemize} |
|---|
| 533 | \item {\it Nucleotide sequences}: (Empirical) the equilibrium base frequencies are estimated by counting |
|---|
| 534 | the occurence of the different bases in the alignment. |
|---|
| 535 | \item {\it Amino-acid sequences}: (Empirical) the equilibrium amino-acid frequencies are estimated by counting |
|---|
| 536 | the occurence of the different amino-acids in the alignment. |
|---|
| 537 | \end{itemize} |
|---|
| 538 | |
|---|
| 539 | \item \x{m} : the character frequencies are determined as follows : |
|---|
| 540 | \begin{itemize} |
|---|
| 541 | \item {\it Nucleotide sequences}: (ML) the equilibrium base frequencies are estimated using maximum |
|---|
| 542 | likelihood. |
|---|
| 543 | \item {\it Amino-acid sequences}: (Model) the equilibrium amino-acid frequencies are estimated using |
|---|
| 544 | the frequencies defined by the substitution model. |
|---|
| 545 | \end{itemize} |
|---|
| 546 | |
|---|
| 547 | \item ``\x{fA,fC,fG,fT}": only valid for nucleotide-based models. \x{fA}, \x{fC}, \x{fG} and \x{fT} are floating numbers that |
|---|
| 548 | correspond to the frequencies of A, C, G and T respectively. |
|---|
| 549 | \end{itemize} |
|---|
| 550 | |
|---|
| 551 | \item \x{-t} (or \x{--ts/tv}) \x{ts/tv\_ratio} \index{$\kappa$}\index{ts/tv ratio}\index{command-line options!\x{--ts/tv}}\\ |
|---|
| 552 | \x{ts/tv\_ratio}: transition/transversion ratio. DNA sequences only. Can be a fixed positive value |
|---|
| 553 | (e.g., 4.0) or type \x{e} to get the maximum likelihood estimate. |
|---|
| 554 | |
|---|
| 555 | \item \x{-v} (or \x{--pinv}) \x{prop\_invar}\index{proportion of invariants}\index{invariable sites} \index{command-line options!\x{--pinv}}\\ |
|---|
| 556 | \x{prop\_invar}: proportion of invariable sites. Can be a fixed value in the [0,1] range or type \x{e} to get the maximum likelihood estimate. |
|---|
| 557 | |
|---|
| 558 | \item \x{-c} (or \x{--nclasses}) \x{nb\_subst\_cat}\index{gamma distribution (discrete)!number of categories} \index{command-line options!\x{--nclasses}}\\ |
|---|
| 559 | \x{nb\_subst\_cat}: number of relative substitution rate categories. Default: \x{nb\_subst\_cat=4}. Must be a positive integer. |
|---|
| 560 | |
|---|
| 561 | \item \x{-a} (or \x{--alpha}) \x{gamma} \index{gamma distribution (discrete)!shape parameter}\index{command-line options!\x{--alpha}} \\ |
|---|
| 562 | \x{gamma}: value of the gamma shape parameter. Can be a fixed positive value or e to get the maximum |
|---|
| 563 | likelihood estimate. The value of this parameter is estimated in the maximum likelihood framework by default. |
|---|
| 564 | |
|---|
| 565 | \item \x{--use\_median} \index{gamma distribution (discrete)!mean vs. median} \index{command-line options!\x{--use\_median}}\\ |
|---|
| 566 | The middle of each substitution rate class in the discrete gamma distribution is taken as the |
|---|
| 567 | median. The mean is used by default. |
|---|
| 568 | |
|---|
| 569 | \item \x{--free\_rates} \index{command-line options!\x{--free\_rates}}\\ |
|---|
| 570 | As an alternative to the discrete gamma model, it is possible to estimate the (relative) rate in |
|---|
| 571 | each class of the (mixture) model and the corresponding frequencies directly from the data. This |
|---|
| 572 | model, called the FreeRate model, has more parameters |
|---|
| 573 | than the discrete gamma one but usually provides a significantly better fit to the data. See |
|---|
| 574 | \cite{soubrier12} for more information about this model and an illustration of its use. |
|---|
| 575 | |
|---|
| 576 | \item \x{--codpos} \x{1,2 or 3} \index{command-line options!\x{--codpos}}\\ |
|---|
| 577 | When analysing an alignment of coding sequences, use this option to consider only the first, second |
|---|
| 578 | or the third coding position for the estimation. |
|---|
| 579 | |
|---|
| 580 | \item \x{-s} (or \x{--search}) \x{move}\index{NNI}\index{SPR} \index{command-line options!\x{--search}}\\ |
|---|
| 581 | Tree topology search operation option. Can be either \x{NNI} (default, fast) or \x{SPR} (usually |
|---|
| 582 | slower than \x{NNI} but more accurate) or \x{BEST} (best of NNI and SPR search). |
|---|
| 583 | |
|---|
| 584 | \item \x{-u} (or \x{--inputtree}) \x{user\_tree\_file}\index{input tree}\index{user tree} \index{command-line options!\x{--inputtree}}\\ |
|---|
| 585 | \x{user\_tree\_file}: starting tree filename. The tree must be in Newick format. |
|---|
| 586 | |
|---|
| 587 | \item \x{-o params}\index{optimisation!topology}\index{optimisation!substitution parameters} \index{command-line options!\x{-o}}\\ |
|---|
| 588 | This option focuses on specific parameter optimisation. |
|---|
| 589 | \begin{itemize} |
|---|
| 590 | \item \x{params=tlr}: tree topology (\x{t}), branch length (\x{l}) and substitution rate parameters (\x{r}) are optimised. |
|---|
| 591 | \item \x{params=tl}: tree topology and branch lengths are optimised. |
|---|
| 592 | \item \x{params=lr}: branch lengths and substitution rate parameters are optimised. |
|---|
| 593 | \item \x{params=l}: branch lengths are optimised. |
|---|
| 594 | \item \x{params=r}: substitution rate parameters are optimised. |
|---|
| 595 | \item \x{params=n}: no parameter is optimised. |
|---|
| 596 | \end{itemize} |
|---|
| 597 | |
|---|
| 598 | \item \x{--rand\_start}\index{random tree} \index{command-line options!\x{--rand\_start}}\\ |
|---|
| 599 | This option sets the initial tree to random. It is only valid if SPR searches are to be performed. |
|---|
| 600 | |
|---|
| 601 | \item \x{--n\_rand\_starts num} \index{command-line options!\x{--n\_rand\_starts}}\\ |
|---|
| 602 | \x{num} is the number of initial random trees to be used. It is only valid if SPR searches are to be performed. |
|---|
| 603 | |
|---|
| 604 | \item \x{--r\_seed num}\index{random number} \index{command-line options!\x{--r\_seed}}\\ |
|---|
| 605 | \x{num} is the seed used to initiate the random number generator. Must be an integer. |
|---|
| 606 | |
|---|
| 607 | \item \x{--print\_site\_lnl}\index{likelihood!print site likelihood} \index{command-line options!\x{--print\_site\_lk}}\\ |
|---|
| 608 | Print the likelihood for each site in file *\_phyml\_lk.txt. For $\Gamma$ or $\Gamma$+I or |
|---|
| 609 | FreeRate models, this option returns the posterior probability of each relative rate class at |
|---|
| 610 | each site. Such information can then be used to identify fast- and slow-evolving regions of the alignment. |
|---|
| 611 | |
|---|
| 612 | \item \x{--print\_trace}\index{command-line options!\x{--print\_trace}}\\ |
|---|
| 613 | Print each phylogeny explored during the tree search process in file *\_phyml\_trace.txt. This |
|---|
| 614 | option can be useful for monitoring the progress of the analysis for very large data sets and have |
|---|
| 615 | an approximate idea of what the final phylogeny will look like. |
|---|
| 616 | |
|---|
| 617 | \item \x{--run\_id ID\_string}\index{run ID} \index{command-line options!\x{--run\_id}}\\ |
|---|
| 618 | Append the string ID\_string at the end of each PhyML output file. This option may be useful when |
|---|
| 619 | running simulations involving PhyML. It can also be used to `tag' multiple analysis of the same data |
|---|
| 620 | set with various program settings. |
|---|
| 621 | |
|---|
| 622 | \item \x{--no\_memory\_check} \index{command-line options!\x{--no\_memory\_check}}\\ |
|---|
| 623 | By default, when processing a large data set, PhyML will pause and ask the user to confirm that |
|---|
| 624 | she/he wants to continue with the execution of the analysis despite the large amount of memory |
|---|
| 625 | required. The \x{--no\_memory\_check} skips this question. It is especially useful when running |
|---|
| 626 | PhyML in batch mode. |
|---|
| 627 | |
|---|
| 628 | \item \x{--no\_colalias} \index{command-line options!\x{--no\_colalias}}\\ |
|---|
| 629 | By default, PhyML preprocesses each alignment by putting together (or aliasing) the columns that are |
|---|
| 630 | identical. Use this option to skip this step but be aware that the analysis might then take more |
|---|
| 631 | time to complete. |
|---|
| 632 | |
|---|
| 633 | \item \x{--constrained\_lens} \index{command-line options!\x{--constrained\_lens}}\\ |
|---|
| 634 | When an input tree with branch lengths is provided, this option will find the branch multiplier that |
|---|
| 635 | maximises the likelihood (i.e., the relative branch lengths remain constant) |
|---|
| 636 | |
|---|
| 637 | \item \x{--constraint\_file} \x{file\_name} \index{command-line options!\x{--constraint\_file}}\\ |
|---|
| 638 | \x{file\_name} lists the topological constraints under which the tree topology search is |
|---|
| 639 | conducted. This option should be used in conjunction with \x{-u} \x{file\_name}. See Section |
|---|
| 640 | \ref{sec:topoconstraints} for more information. |
|---|
| 641 | |
|---|
| 642 | \item \x{--quiet} \index{command-line options!\x{--quiet}}\\ |
|---|
| 643 | Runs PhyML in quiet mode. The program will not pause if the memory required to run the analysis |
|---|
| 644 | exceeds 256MB and will not output the log-likelihood score to the output. |
|---|
| 645 | |
|---|
| 646 | \end{itemize} |
|---|
| 647 | |
|---|
| 648 | \subsection{XML interface} |
|---|
| 649 | \begin{itemize} |
|---|
| 650 | \item \x{--xml=xml\_file\_name}\index{command-line options!\x{--xml}} \\ |
|---|
| 651 | \x{xml\_file\_name} is the name of the XML file containing the information required to run the |
|---|
| 652 | analysis. More details about this type of file is given in the section \ref{sec:xmlio}. |
|---|
| 653 | \end{itemize} |
|---|
| 654 | |
|---|
| 655 | \subsection{Parallel bootstrap}\label{sec:parallel_bootstrap}\index{MPI}\index{bootstrap!parallel} |
|---|
| 656 | |
|---|
| 657 | Bootstrapping is a highly parallelizable task. Indeed, bootstrap replicates are independent from |
|---|
| 658 | one another. Each bootstrap replicate can then be analysed separately. Modern computers often have |
|---|
| 659 | more than one CPU. Each CPU can therefore be used to process a bootstrap sample. Using this parallel |
|---|
| 660 | strategy, performing $R$ bootstrap replicates on $C$ CPUs `costs' the same amount of computation |
|---|
| 661 | time as processing $R \times C$ bootstrap replicates on a single CPU. In other words, for a given |
|---|
| 662 | number of replicates, the computation time is divided by $R$ compared to the non-parallel approach. |
|---|
| 663 | |
|---|
| 664 | PhyML sources must be compiled with specific options to turn on the parallel option (see Section |
|---|
| 665 | \ref{sec:MPI}). Once the binary file (\x{phyml}) has been generated, running a bootstrap analysis |
|---|
| 666 | with, say 100 replicates on 2 CPUs, can be done by typing the following command-line: |
|---|
| 667 | \begin{verbatim} |
|---|
| 668 | mpd &; |
|---|
| 669 | mpirun -np 2 ./phyml -i seqfile -b 100; |
|---|
| 670 | \end{verbatim} |
|---|
| 671 | The first command launches the mpi daemon while the second launches the analysis. Note that |
|---|
| 672 | launching the daemon needs to be done only once. The output files are similar to the ones generated |
|---|
| 673 | using the standard, non-parallel, analysis (see Section \ref{sec:input_output}). Note that running |
|---|
| 674 | the program in batch mode, i.e.: |
|---|
| 675 | \begin{verbatim} |
|---|
| 676 | mpirun -np 2 ./phyml -i seqfile -b 100 & |
|---|
| 677 | \end{verbatim} |
|---|
| 678 | will probably NOT work. I do not know how to run a mpi process in batch mode yet. Suggestions welcome... |
|---|
| 679 | Also, at the moment, the number of bootstrap replicates must be a multiple of the number of CPUs |
|---|
| 680 | required in the mpirun command. |
|---|
| 681 | |
|---|
| 682 | \section{Inputs \& outputs for command-line and PHYLIP interface }\label{sec:input_output} |
|---|
| 683 | |
|---|
| 684 | PhyML reads data from standard text files, without the need for any particular file name extension. |
|---|
| 685 | |
|---|
| 686 | \subsection{Sequence formats} |
|---|
| 687 | |
|---|
| 688 | \begin{figure} |
|---|
| 689 | \begin{small} |
|---|
| 690 | \begin{Verbatim}[frame=single, label=PHYLIP interleaved, samepage=true, baselinestretch=0.5] |
|---|
| 691 | |
|---|
| 692 | 5 80 |
|---|
| 693 | seq1 CCATCTCACGGTCGGTACGATACACCKGCTTTTGGCAGGAAATGGTCAATATTACAAGGT |
|---|
| 694 | seq2 CCATCTCACGGTCAG---GATACACCKGCTTTTGGCGGGAAATGGTCAACATTAAAAGAT |
|---|
| 695 | seq3 RCATCTCCCGCTCAG---GATACCCCKGCTGTTG????????????????ATTAAAAGGT |
|---|
| 696 | seq4 RCATCTCATGGTCAA---GATACTCCTGCTTTTGGCGGGAAATGGTCAATCTTAAAAGGT |
|---|
| 697 | seq5 RCATCTCACGGTCGGTAAGATACACCTGCTTTTGGCGGGAAATGGTCAAT????????GT |
|---|
| 698 | |
|---|
| 699 | ATCKGCTTTTGGCAGGAAAT |
|---|
| 700 | ATCKGCTTTTGGCGGGAAAT |
|---|
| 701 | AGCKGCTGTTG????????? |
|---|
| 702 | ATCTGCTTTTGGCGGGAAAT |
|---|
| 703 | ATCTGCTTTTGGCGGGAAAT |
|---|
| 704 | |
|---|
| 705 | \end{Verbatim} |
|---|
| 706 | \begin{Verbatim}[frame=single, label=PHYLIP sequential, samepage=true, baselinestretch=0.5] |
|---|
| 707 | |
|---|
| 708 | 5 40 |
|---|
| 709 | seq1 CCATCTCANNNNNNNNACGATACACCKGCTTTTGGCAGG |
|---|
| 710 | seq2 CCATCTCANNNNNNNNGGGATACACCKGCTTTTGGCGGG |
|---|
| 711 | seq3 RCATCTCCCGCTCAGTGAGATACCCCKGCTGTTGXXXXX |
|---|
| 712 | seq4 RCATCTCATGGTCAATG-AATACTCCTGCTTTTGXXXXX |
|---|
| 713 | seq5 RCATCTCACGGTCGGTAAGATACACCTGCTTTTGxxxxx |
|---|
| 714 | |
|---|
| 715 | \end{Verbatim} |
|---|
| 716 | \end{small} |
|---|
| 717 | \label{fig:align_tree} |
|---|
| 718 | \caption{\bf PHYLIP interleaved and sequential formats.} |
|---|
| 719 | \end{figure} |
|---|
| 720 | |
|---|
| 721 | |
|---|
| 722 | |
|---|
| 723 | \begin{figure} |
|---|
| 724 | \begin{small} |
|---|
| 725 | \begin{Verbatim}[frame=single, label=Nexus nucleotides, samepage=true, baselinestretch=0.5] |
|---|
| 726 | |
|---|
| 727 | [ This is a comment ] |
|---|
| 728 | #NEXUS |
|---|
| 729 | BEGIN DATA; |
|---|
| 730 | DIMENSIONS NTAX=10 NCHAR=20; |
|---|
| 731 | FORMAT DATATYPE=DNA; |
|---|
| 732 | MATRIX |
|---|
| 733 | tax1 ?ATGATTTCCTTAGTAGCGG |
|---|
| 734 | tax2 CAGGATTTCCTTAGTAGCGG |
|---|
| 735 | tax3 ?AGGATTTCCTTAGTAGCGG |
|---|
| 736 | tax4 ?????????????GTAGCGG |
|---|
| 737 | tax5 CAGGATTTCCTTAGTAGCGG |
|---|
| 738 | tax6 CAGGATTTCCTTAGTAGCGG |
|---|
| 739 | tax7 ???GATTTCCTTAGTAGCGG |
|---|
| 740 | tax8 ???????????????????? |
|---|
| 741 | tax9 ???GGATTTCTTCGTAGCGG |
|---|
| 742 | tax10 ???????????????AGCGG; |
|---|
| 743 | END; |
|---|
| 744 | |
|---|
| 745 | \end{Verbatim} |
|---|
| 746 | \end{small} |
|---|
| 747 | |
|---|
| 748 | \begin{small} |
|---|
| 749 | \begin{Verbatim}[frame=single, label=Nexus digits, samepage=true, baselinestretch=0.5] |
|---|
| 750 | |
|---|
| 751 | [ This is a comment ] |
|---|
| 752 | #NEXUS |
|---|
| 753 | BEGIN DATA; |
|---|
| 754 | DIMENSIONS NTAX=10 NCHAR=20; |
|---|
| 755 | FORMAT DATATYPE=STANDARD SYMBOLS="0 1 2 3"; |
|---|
| 756 | MATRIX |
|---|
| 757 | tax1 ?0320333113302302122 |
|---|
| 758 | tax2 10220333113302302122 |
|---|
| 759 | tax3 ?0220333113302302122 |
|---|
| 760 | tax4 ?????????????2302122 |
|---|
| 761 | tax5 10220333113302302122 |
|---|
| 762 | tax6 10220333113302302122 |
|---|
| 763 | tax7 ???20333113302302122 |
|---|
| 764 | tax8 ???????????????????? |
|---|
| 765 | tax9 ???22033313312302122 |
|---|
| 766 | tax10 ???????????????02122; |
|---|
| 767 | END; |
|---|
| 768 | |
|---|
| 769 | \end{Verbatim} |
|---|
| 770 | \end{small} |
|---|
| 771 | |
|---|
| 772 | \begin{small} |
|---|
| 773 | \begin{Verbatim}[frame=single, label=Nexus digits, samepage=true, baselinestretch=0.5] |
|---|
| 774 | |
|---|
| 775 | [ This is a comment ] |
|---|
| 776 | #NEXUS |
|---|
| 777 | BEGIN DATA; |
|---|
| 778 | DIMENSIONS NTAX=10 NCHAR=20; |
|---|
| 779 | FORMAT DATATYPE=STANDARD SYMBOLS="00 01 02 03"; |
|---|
| 780 | MATRIX |
|---|
| 781 | tax1 ??00030200030303010103030002030002010202 |
|---|
| 782 | tax2 0100020200030303010103030002030002010202 |
|---|
| 783 | tax3 ??00020200030303010103030002030002010202 |
|---|
| 784 | tax4 ??????????????????????????02030002010202 |
|---|
| 785 | tax5 0100020200030303010103030002030002010202 |
|---|
| 786 | tax6 0100020200030303010103030002030002010202 |
|---|
| 787 | tax7 ??????0200030303010103030002030002010202 |
|---|
| 788 | tax8 ???????????????????????????????????????? |
|---|
| 789 | tax9 ??????0202000303030103030102030002010202 |
|---|
| 790 | tax10 ??????????????????????????????0002010202; |
|---|
| 791 | END; |
|---|
| 792 | |
|---|
| 793 | \end{Verbatim} |
|---|
| 794 | \end{small} |
|---|
| 795 | \caption{\bf NEXUS formats.}\label{fig:nexus} |
|---|
| 796 | \end{figure} |
|---|
| 797 | |
|---|
| 798 | |
|---|
| 799 | Alignments of DNA or protein sequences must be in PHYLIP\index{PHYLIP} or NEXUS |
|---|
| 800 | \cite{maddison97}\index{NEXUS} sequential\index{sequential} or interleaved\index{interleaved} format |
|---|
| 801 | (Figures \ref{fig:align_tree} and \ref{fig:nexus}). For PHYLIP formated sequence alignments, the |
|---|
| 802 | first line of the input file contains the number of species and the number of characters, in free |
|---|
| 803 | format, separated by blank characters. One slight difference with PHYLIP format deals with sequence |
|---|
| 804 | name lengths. While PHYLIP format limits this length to ten characters, PhyML can read up to |
|---|
| 805 | hundred character long sequence names. Blanks and the symbols ``(),:'' are not allowed within |
|---|
| 806 | sequence names because the Newick tree format makes special use of these symbols. Another slight |
|---|
| 807 | difference with PHYLIP format is that actual sequences must be separated from their names by at |
|---|
| 808 | least one blank character. |
|---|
| 809 | |
|---|
| 810 | A PHYLIP input sequence file may also display more than a single data set. Each of these data sets |
|---|
| 811 | must be in PHYLIP format and two successive alignments must be separated by an empty |
|---|
| 812 | line. Processing multiple data sets requires to toggle the `\x{M}' option in the {\em Input Data} |
|---|
| 813 | sub-menu or use the `\x{-n}' command line option and enter the number of data sets to analyse. The |
|---|
| 814 | multiple data set option can be used to process re-sampled data that were generated using a |
|---|
| 815 | non-parametric procedure such as cross-validation or jackknife (a bootstrap option is already |
|---|
| 816 | included in PhyML). This option is also useful in multiple gene studies, even if fitting the same |
|---|
| 817 | substitution model to all data sets may not be suitable. |
|---|
| 818 | |
|---|
| 819 | PhyML can also process alignments in NEXUS format. Although not all the options provided by this |
|---|
| 820 | format are supported by PhyML, a few specific features are exploited. Of course, this format can |
|---|
| 821 | handle nucleotide and protein sequence alignments in sequential or interleaved format. It is also |
|---|
| 822 | possible to use custom alphabets, replacing the standard 4-state and 20-state alphabets for |
|---|
| 823 | nucleotides and amino-acids respectively. Examples of a 4-state custom alphabet are given in Figure |
|---|
| 824 | \ref{fig:nexus}. Each state must here correspond to one digit or more. The set of states must be a |
|---|
| 825 | list of consecutive digits starting from 0. For instance, the list ``0, 1, 3, 4'' is not a valid |
|---|
| 826 | alphabet. Each state in the symbol list must be separated from the next one by a space. Hence, |
|---|
| 827 | alphabets with large number of states can be easily defined by using two-digit number (starting with |
|---|
| 828 | 00 up to 19 for a 20 state alphabet). Most importantly, this feature gives the opportunity to |
|---|
| 829 | analyse data sets made of presence/absence character states (use the \texttt{symbols=``0 1''} option |
|---|
| 830 | for such data).\index{binary characters} Alignments made of custom-defined states will be processed |
|---|
| 831 | using the Jukes and Cantor model. Other options of the program (e.g., number of rate classes, tree |
|---|
| 832 | topology search algorithm) are freely configurable. Note that, at the moment, the maximum number of |
|---|
| 833 | different states is set to 22 in order to save memory space. It is however possible to lift this |
|---|
| 834 | threshold by modifiying the value of the variable \x{T\_MAX\_ALPHABET} in the file |
|---|
| 835 | `\x{utilities.h}'. The program will then have to be re-compiled. |
|---|
| 836 | |
|---|
| 837 | |
|---|
| 838 | |
|---|
| 839 | \subsubsection{Gaps and ambiguous characters} |
|---|
| 840 | |
|---|
| 841 | Gaps correspond to the `\x{-}' symbol. They are systematically treated as unknown characters ``on |
|---|
| 842 | the grounds that we don't know what would be there if something were there'' (J. Felsenstein, |
|---|
| 843 | PHYLIP main documentation). The likelihood at these sites is summed over all the possible states |
|---|
| 844 | (i.e., nucleotides or amino acids) that could actually be observed at these particular |
|---|
| 845 | positions. Note however that columns of the alignment that display only gaps or unknown characters |
|---|
| 846 | are simply discarded because they do not carry any phylogenetic information (they are equally well |
|---|
| 847 | explained by any model). PhyML also handles ambiguous characters such as $R$ for $A$ or $G$ |
|---|
| 848 | (purines) and $Y$ for $C$ or $T$ (pyrimidines). Tables \ref{tab:ambigu_nt} and \ref{tab:ambigu_aa} |
|---|
| 849 | give the list of valid characters/symbols and the corresponding nucleotides or amino acids. |
|---|
| 850 | |
|---|
| 851 | \begin{table} |
|---|
| 852 | \begin{center} |
|---|
| 853 | \begin{tabular}{lr|lr} |
|---|
| 854 | \hline |
|---|
| 855 | Character & Nucleotide & Character & Nucleotide \\ |
|---|
| 856 | \hline |
|---|
| 857 | $A$ & Adenosine & $Y$ & $C$ or $T$ \\ |
|---|
| 858 | $G$ & Guanine & $K$ & $G$ or $T$ \\ |
|---|
| 859 | $C$ & Cytosine & $B$ & $C$ or $G$ or $T$\\ |
|---|
| 860 | $T$ & Thymine & $D$ & $A$ or $G$ or $T$ \\ |
|---|
| 861 | $U$ & Uracil (=$T$) & $H$ & $A$ or $C$ or $T$ \\ |
|---|
| 862 | $M$ & $A$ or $C$ & $V$ & $A$ or $C$ or $G$ \\ |
|---|
| 863 | $R$ & $A$ or $G$ & $-$ or $N$ or $X$ or $?$ & unknown \\ |
|---|
| 864 | $W$ & $A$ or $T$ & & (=$A$ or $C$ or $G$ or $T$)\\ |
|---|
| 865 | $S$ & $C$ or $G$ & & \\ |
|---|
| 866 | \hline |
|---|
| 867 | \end{tabular} |
|---|
| 868 | \end{center} |
|---|
| 869 | \caption{{\bf List of valid characters in DNA sequences and the corresponding nucleotides.}}\label{tab:ambigu_nt} |
|---|
| 870 | \end{table} |
|---|
| 871 | \begin{table} |
|---|
| 872 | \begin{center} |
|---|
| 873 | \begin{tabular}{lr|lr} |
|---|
| 874 | \hline |
|---|
| 875 | Character & Amino-Acid & Character & Amino-Acid \\ |
|---|
| 876 | \hline |
|---|
| 877 | $A$ & Alanine & $L$ & Leucine \\ |
|---|
| 878 | $R$ & Arginine & $K$ & Lysine \\ |
|---|
| 879 | $N$ or $B$& Asparagine & $M$ & Methionine \\ |
|---|
| 880 | $D$ & Aspartic acid & $F$ & Phenylalanine \\ |
|---|
| 881 | $C$ & Cysteine & $P$ & Proline \\ |
|---|
| 882 | $Q$ or $Z$& Glutamine & $S$ & Serine \\ |
|---|
| 883 | $E$ & Glutamic acid & $T$ & Threonine \\ |
|---|
| 884 | $G$ & Glycine & $W$ & Tryptophan \\ |
|---|
| 885 | $H$ & Histidine & $Y$ & Tyrosine \\ |
|---|
| 886 | $I$ & Isoleucine & $V$ & Valine \\ |
|---|
| 887 | $L$ & Leucine & $-$ or $X$ or $?$ & unknown \\ |
|---|
| 888 | $K$ & Lysine & & (can be any amino acid) \\ |
|---|
| 889 | \hline |
|---|
| 890 | \end{tabular} |
|---|
| 891 | \end{center} |
|---|
| 892 | \caption{{\bf List of valid characters in protein sequences and the corresponding amino acids.}}\label{tab:ambigu_aa} |
|---|
| 893 | \end{table} |
|---|
| 894 | |
|---|
| 895 | \subsubsection{Specifying outgroup sequences}\label{sec:outgroupspecify} |
|---|
| 896 | |
|---|
| 897 | PhyML can return rooted trees provided outgroup taxa are identified from the sequence file. In |
|---|
| 898 | order to do so, sequence names that display a `*' character will be automatically considered as |
|---|
| 899 | belonging to the outgroup. |
|---|
| 900 | |
|---|
| 901 | The topology of the rooted tree is exactly the same as the unrooted version of the same tree. In |
|---|
| 902 | other words, PhyML first ignores the distinction between ingroup and outgroup sequences, builds a |
|---|
| 903 | maximum likelihood unrooted tree and then tries to add the root. If the outgroup has more than one |
|---|
| 904 | sequence, the position of the root might be ambiguous. In such situation, PhyML tries to identify |
|---|
| 905 | the most relevant position of the root by considering which edge provides the best separation |
|---|
| 906 | between ingroup and outgroup taxa (i.e., we are trying to make the outgroup ``as monophyletic as |
|---|
| 907 | possible''). |
|---|
| 908 | |
|---|
| 909 | \subsection{Tree format} |
|---|
| 910 | |
|---|
| 911 | PhyML can read one or several phylogenetic trees from an input file. This option is accessible |
|---|
| 912 | through the {\em Tree Searching} sub menu or the `\x{-u}' argument from the command line. Input |
|---|
| 913 | trees are generally used as initial maximum likelihood estimates to be subsequently adjusted by the |
|---|
| 914 | tree searching algorithm. Trees can be either rooted or unrooted and multifurcations are allowed. |
|---|
| 915 | Taxa names must, of course, match the corresponding sequence names. |
|---|
| 916 | |
|---|
| 917 | \begin{figure}[h] |
|---|
| 918 | \begin{small} |
|---|
| 919 | \begin{minipage}{\textwidth} |
|---|
| 920 | \begin{verbatim} |
|---|
| 921 | ((seq1:0.03,seq2:0.01):0.04,(seq3:0.01,(seq4:0.2,seq5:0.05):0.2):0.01); |
|---|
| 922 | ((seq3,seq2),seq1,(seq4,seq5)); |
|---|
| 923 | \end{verbatim} |
|---|
| 924 | \end{minipage} |
|---|
| 925 | \end{small} |
|---|
| 926 | \caption{{\bf Input trees}. The first tree (top) is rooted and has branch lengths. The second tree |
|---|
| 927 | (bottom) is unrooted and does not have branch lengths.} |
|---|
| 928 | \label{fig:trees}\index{Newick format} |
|---|
| 929 | \end{figure} |
|---|
| 930 | |
|---|
| 931 | |
|---|
| 932 | \subsection{Multiple alignments and trees}\index{multiple data sets} |
|---|
| 933 | |
|---|
| 934 | Single or multiple sequence data sets may be used in combination with single or multiple input |
|---|
| 935 | trees. When the number of data sets is one ($n_D = 1$) and there is only one input tree ($n_T = 1$), |
|---|
| 936 | then this tree is simply used as input for the single data set analysis. When $n_D = 1$ and $n_T > |
|---|
| 937 | 1$, each input tree is used successively for the analysis of the single alignment. PhyML then |
|---|
| 938 | outputs the tree with the highest likelihood. If $n_D > 1$ and $n_T = 1$, the same input tree is |
|---|
| 939 | used for the analysis of each data set. The last combination is $n_D > 1$ and $n_T > 1$. In this |
|---|
| 940 | situation, the $i$-th tree in the input tree file is used to analyse the $i$-th data set. Hence, |
|---|
| 941 | $n_D$ and $n_T$ must be equal here. |
|---|
| 942 | |
|---|
| 943 | |
|---|
| 944 | \subsection{Custom amino-acid rate model}\label{sec:customaa} |
|---|
| 945 | |
|---|
| 946 | The custom amino-acid model of substitutions can be used to implement a model that is not hard-coded |
|---|
| 947 | in PhyML. This model must be time-reversible. Hence, the matrix of substitution rates is |
|---|
| 948 | symmetrical. The format of the rate matrix with the associated stationary frequencies is identical |
|---|
| 949 | to the one used in PAML\index{PAML}. An example is given below: |
|---|
| 950 | |
|---|
| 951 | \begin{center} |
|---|
| 952 | {\tiny |
|---|
| 953 | \begin{tabular}{p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}p{0.33cm}} |
|---|
| 954 | % Ala & Arg & Asn & Asp & Cys & Gln & Glu & Gly & His & Ile & Leu & Lys & Met & Phe & Pro & Ser & Thr & Trp & Tyr & Val \\ |
|---|
| 955 | &&&&&&&&&&&&&&&&&&& \\ |
|---|
| 956 | 0.55 & &&&&&&&&&&&&&&&&&& \\ |
|---|
| 957 | 0.51 & 0.64 & &&&&&&&&&&&&&&&& \\ |
|---|
| 958 | 0.74 & 0.15 & 5.43 & &&&&&&&&&&&&&&&& \\ |
|---|
| 959 | 1.03 & 0.53 & 0.27 & 0.03 & &&&&&&&&&&&&&&& \\ |
|---|
| 960 | 0.91 & 3.04 & 1.54 & 0.62 & 0.10 & &&&&&&&&&&&&&& \\ |
|---|
| 961 | 1.58 & 0.44 & 0.95 & 6.17 & 0.02 & 5.47 & &&&&&&&&&&&&& \\ |
|---|
| 962 | 1.42 & 0.58 & 1.13 & 0.87 & 0.31 & 0.33 & 0.57 & &&&&&&&&&&&& \\ |
|---|
| 963 | 0.32 & 2.14 & 3.96 & 0.93 & 0.25 & 4.29 & 0.57 & 0.25 & &&&&&&&&&&& \\ |
|---|
| 964 | 0.19 & 0.19 & 0.55 & 0.04 & 0.17 & 0.11 & 0.13 & 0.03 & 0.14 & &&&&&&&&&& \\ |
|---|
| 965 | 0.40 & 0.50 & 0.13 & 0.08 & 0.38 & 0.87 & 0.15 & 0.06 & 0.50 & 3.17 & &&&&&&&&& \\ |
|---|
| 966 | 0.91 & 5.35 & 3.01 & 0.48 & 0.07 & 3.89 & 2.58 & 0.37 & 0.89 & 0.32 & 0.26 & &&&&&&&& \\ |
|---|
| 967 | 0.89 & 0.68 & 0.20 & 0.10 & 0.39 & 1.55 & 0.32 & 0.17 & 0.40 & 4.26 & 4.85 & 0.93 & &&&&&&& \\ |
|---|
| 968 | 0.21 & 0.10 & 0.10 & 0.05 & 0.40 & 0.10 & 0.08 & 0.05 & 0.68 & 1.06 & 2.12 & 0.09 & 1.19 & &&&&&& \\ |
|---|
| 969 | 1.44 & 0.68 & 0.20 & 0.42 & 0.11 & 0.93 & 0.68 & 0.24 & 0.70 & 0.10 & 0.42 & 0.56 & 0.17 & 0.16 & &&&&& \\ |
|---|
| 970 | 3.37 & 1.22 & 3.97 & 1.07 & 1.41 & 1.03 & 0.70 & 1.34 & 0.74 & 0.32 & 0.34 & 0.97 & 0.49 & 0.55 & 1.61 & &&&& \\ |
|---|
| 971 | 2.12 & 0.55 & 2.03 & 0.37 & 0.51 & 0.86 & 0.82 & 0.23 & 0.47 & 1.46 & 0.33 & 1.39 & 1.52 & 0.17 & 0.80 & 4.38 & &&& \\ |
|---|
| 972 | 0.11 & 1.16 & 0.07 & 0.13 & 0.72 & 0.22 & 0.16 & 0.34 & 0.26 & 0.21 & 0.67 & 0.14 & 0.52 & 1.53 & 0.14 & 0.52 & 0.11 & && \\ |
|---|
| 973 | 0.24 & 0.38 & 1.09 & 0.33 & 0.54 & 0.23 & 0.20 & 0.10 & 3.87 & 0.42 & 0.40 & 0.13 & 0.43 & 6.45 & 0.22 & 0.79 & 0.29 & 2.49 & & \\ |
|---|
| 974 | 2.01 & 0.25 & 0.20 & 0.15 & 1.00 & 0.30 & 0.59 & 0.19 & 0.12 & 7.82 & 1.80 & 0.31 & 2.06 & 0.65 & 0.31 & 0.23 & 1.39 & 0.37 & 0.31 & \\ |
|---|
| 975 | \\ |
|---|
| 976 | 8.66 & 4.40 & 3.91 & 5.70 & 1.93 & 3.67 & 5.81 & 8.33 & 2.44 & 4.85 & 8.62 & 6.20 & 1.95 & 3.84 & 4.58 & 6.95 & 6.10 & 1.44 & 3.53 & 7.09 \\ |
|---|
| 977 | \end{tabular} |
|---|
| 978 | } |
|---|
| 979 | \end{center} |
|---|
| 980 | |
|---|
| 981 | The entry on the $i$-th row and $j$-th column of this matrix corresponds to the rate of |
|---|
| 982 | substitutions between amino-acids $i$ and $j$. The last line in the file gives the stationary |
|---|
| 983 | frequencies and must be separated from the rate matrix by one line. The ordering of the amino-acids |
|---|
| 984 | is alphabetical, i.e, Ala, Arg, Asn, Asp, Cys, Gln, Glu, Gly, His, Ile, Leu, Lys, Met, Phe, Pro, |
|---|
| 985 | Ser, Thr, Trp, Tyr and Val. |
|---|
| 986 | |
|---|
| 987 | |
|---|
| 988 | \subsection{Topological constraint file}\label{sec:topoconstraints} |
|---|
| 989 | |
|---|
| 990 | PhyML can perform phylogenetic tree estimation under user-specified topological constraints. In |
|---|
| 991 | order to do so, one should use the \x{--constraint\_file} \x{file\_name} command-line option where |
|---|
| 992 | \x{file\_name} lists the topological constraints. Such constraints are straightforward to |
|---|
| 993 | define. For instance, the following constraints: |
|---|
| 994 | \vspace{0.2cm} |
|---|
| 995 | \begin{Verbatim} |
|---|
| 996 | ((A,B,C),(D,E,F)); |
|---|
| 997 | \end{Verbatim} |
|---|
| 998 | indicate that taxa A, B and C belong to the same clade. D, E and F also belong to the |
|---|
| 999 | same clade and the two clades hence defined should not overlap. Under these two |
|---|
| 1000 | constraints, the tree ((A,B),D,((E,F),C)) is not valid. From the example above, you will notice that |
|---|
| 1001 | the constraints are defined using a multifurcating tree in NEWICK format. Note that this tree does |
|---|
| 1002 | not need to display the whole list of taxa. For instance, while the only taxa involved in specifying |
|---|
| 1003 | topological constraints above are A, B, C, D, E \& F, the actual data set could include more than |
|---|
| 1004 | these six taxa only. |
|---|
| 1005 | |
|---|
| 1006 | PhyML tree topology search algorithms all rely on improving a starting tree. By default, BioNJ is |
|---|
| 1007 | the method of choice for building this tree. However, there is no guarantee that the phylogeny |
|---|
| 1008 | estimated with PhyML does comply with the topological constraints. While it is probably possible to |
|---|
| 1009 | implement BioNJ with topological constraints, we have not done so yet. Instead, the same |
|---|
| 1010 | multifurcating tree that defines the topological constraints should also be used as starting tree |
|---|
| 1011 | using the \x{-u} (\x{--inputtree}) option. Altogether, the command line should look like the |
|---|
| 1012 | following: \x{-u}=\x{file\_name} \x{--constraint\_file}=\x{file\_name}. It is not possible to use as |
|---|
| 1013 | input tree a non-binary phylogeny that is distinct from that provided in the constraint tree |
|---|
| 1014 | file. However, any binary tree compatible with the constraint one can be used as input tree. |
|---|
| 1015 | |
|---|
| 1016 | |
|---|
| 1017 | \subsection{Output files} |
|---|
| 1018 | |
|---|
| 1019 | \begin{table} |
|---|
| 1020 | Sequence file name~: `{\x seq}'\\ |
|---|
| 1021 | \begin{center} |
|---|
| 1022 | \begin{tabular}{ll} |
|---|
| 1023 | \hline |
|---|
| 1024 | Output file name & Content \\ |
|---|
| 1025 | \hline |
|---|
| 1026 | \x{seq\_phyml\_tree.txt} & ML tree\\ |
|---|
| 1027 | \x{seq\_phyml\_stats.txt} & ML model parameters\\ |
|---|
| 1028 | \x{seq\_phyml\_boot\_trees.txt} & ML trees -- bootstrap replicates\\ |
|---|
| 1029 | \x{seq\_phyml\_boot\_stats.txt} & ML model parameters -- bootstrap replicates \\ |
|---|
| 1030 | \x{seq\_phyml\_rand\_trees.txt} & ML trees -- multiple random starts\\ |
|---|
| 1031 | \hline |
|---|
| 1032 | \end{tabular} |
|---|
| 1033 | \end{center} |
|---|
| 1034 | \caption{{\bf Standard output files}}\label{tab:output} |
|---|
| 1035 | \end{table} |
|---|
| 1036 | |
|---|
| 1037 | Table \ref{tab:output} presents the list of files resulting from an analysis. Basically, each |
|---|
| 1038 | output file name can be divided into three parts. The first part is the sequence file name, the |
|---|
| 1039 | second part corresponds to the extension `\x{\_phyml\_}' and the third part is related to the file |
|---|
| 1040 | content. When launched with the default options, PhyML only generates two files: the tree file and |
|---|
| 1041 | the model parameter file. The estimated maximum likelihood tree is in standard Newick format (see |
|---|
| 1042 | Figure \ref{fig:trees}). The model parameters file, or statistics file, displays the maximum |
|---|
| 1043 | likelihood estimates of the substitution model parameters, the likelihood of the maximum likelihood |
|---|
| 1044 | phylogenetic model, and other important information concerning the settings of the analysis (e.g., |
|---|
| 1045 | type of data, name of the substitution model, starting tree, etc.). Two additional output files are |
|---|
| 1046 | created if bootstrap supports were evaluated. These files simply contain the maximum likelihood |
|---|
| 1047 | trees and the substitution model parameters estimated from each bootstrap replicate. Such |
|---|
| 1048 | information can be used to estimate sampling errors around each parameter of the phylogenetic model. |
|---|
| 1049 | When the random tree option is turned on, the maximum likelihood trees estimated from each random |
|---|
| 1050 | starting trees are printed in a separate tree file (see last row of Table \ref{tab:output}). |
|---|
| 1051 | |
|---|
| 1052 | \subsection{Treatment of invariable sites with fixed branch lengths} |
|---|
| 1053 | |
|---|
| 1054 | PhyML allows users to give an input tree with fixed topology and branch lengths and find the |
|---|
| 1055 | proportion of invariable sites that maximise the likelihood (option \x{-o r}). These two options can |
|---|
| 1056 | be considered as conflicting since branch lengths depend on the proportion of invariants. Hence, |
|---|
| 1057 | changing the proportion of invariants implies that branch lengths are changing too. More formally, |
|---|
| 1058 | let $l$ denote the length of a branch, i.e., the expected number of substitutions per site, and $p$ |
|---|
| 1059 | be the proportion of invariants. We have $l = (1-p)l'$, where $l'$ is the expected number of |
|---|
| 1060 | substitutions at variable sites. When asked to optimize $p$ but leave $l$ unchanged, PhyML |
|---|
| 1061 | does the following: |
|---|
| 1062 | \begin{enumerate} |
|---|
| 1063 | \item Calculate $l' = l/(1-p)$ and leave $l'$ unchanged throughout the optimization. |
|---|
| 1064 | \item Find the value of the proportion of invariants that maximises the likelihood. Let $p^{*}$ denote this value. |
|---|
| 1065 | \item Set $l^{*} = (1-p^{*})l'$ and print out the tree with $l^{*}$ (instead of $l$). |
|---|
| 1066 | \end{enumerate} |
|---|
| 1067 | |
|---|
| 1068 | PhyML therefore assumes that the users wants to fix the branch lengths measured at variable |
|---|
| 1069 | sites only (i.e., $l^{*}$ is fixed). This is the reason why the branch lengths in the input and |
|---|
| 1070 | output trees do differ despite the use of the the \x{-o r} option. While we believe that this |
|---|
| 1071 | approach relies on a sound rationale, it is not perfect. In particular, the original transformation |
|---|
| 1072 | of branch lengths ($l' = l/(1-p)$) relies on a default value for $p$ with is set to 0.2 in |
|---|
| 1073 | practice. It is difficult to justify the use of this value rather than another one. One suggestion |
|---|
| 1074 | proposed by Bart Hazes is to avoid fixing the branch lengths altogether and rather estimate the |
|---|
| 1075 | value of a scaling factor applied to each branch length in the input tree (option |
|---|
| 1076 | \x{--contrained\_lens}). We agree that this solution probably matches very well most users |
|---|
| 1077 | expectation, i.e., ``find the best value of $p$ while constraining the ratio of branch lengths to be |
|---|
| 1078 | that given in the input tree''. Please feel free to send us your suggestions regarding this problem |
|---|
| 1079 | by posting on forum (\url{http://groups.google.com/group/phyml-forum}). |
|---|
| 1080 | |
|---|
| 1081 | |
|---|
| 1082 | \section{Inputs \& outputs for the XML interface }\label{sec:xmlio}\index{XML} |
|---|
| 1083 | |
|---|
| 1084 | \subsection{Mixture models in PhyML}\index{mixture models}\label{sec:mixtures} |
|---|
| 1085 | |
|---|
| 1086 | PhyML implements a wide range of mixture models. The discrete gamma model \cite{yang94b} is arguably |
|---|
| 1087 | the most popular of these models in phylogenetics. However, in theory, mixture models are not |
|---|
| 1088 | restricted to the description of the variation of substitution rates across sites. For instance, if |
|---|
| 1089 | there are good reasons to believe that the relative rates of substitution between nucleotides vary |
|---|
| 1090 | along the sequence alignments, it makes sense to use a mixture of GTR models. Consider the case |
|---|
| 1091 | where substitutions between $A$ and $C$ occur at high rate in some regions of the alignment and low |
|---|
| 1092 | rate elsewhere, a mixture with two classes, each class having its own GTR rate matrix, would be |
|---|
| 1093 | suitable. The likelihood at any site of the alignment is then obtained by averaging the likelihoods |
|---|
| 1094 | obtained for each GTR rate matrix, with the same weight given to each of these matrices. |
|---|
| 1095 | |
|---|
| 1096 | PhyML implements a generic framework that allows users to define mixtures on substitution rates, |
|---|
| 1097 | rate matrices and nucleotide or amino-acid equilibrium frequencies. Each class of the mixture model |
|---|
| 1098 | is built by assembling a substitution rate, a rate matrix\footnote{the rate matrix corresponds here |
|---|
| 1099 | the symmetrical matrix giving the so-called ``echangeability rates''} and a vector of equilibrium |
|---|
| 1100 | frequencies. For instance, let $\{R_1,R_2,R_3\}$ be a set of substitution rates, $\{M_1,M_2\}$ a |
|---|
| 1101 | set of rate matrices and $\{F_1,F_2\}$ a set of vectors of equilibrium frequencies. One could then |
|---|
| 1102 | define the first class of the mixture model as $\mathcal{C}_1 = \{R_1,M_1,F_1\}$, a second class as |
|---|
| 1103 | $\mathcal{C}_2 = \{R_2,M_1,F_1\}$, and a third class as $\mathcal{C}_3 = \{R_3,M_2,F_2\}$. If |
|---|
| 1104 | $R_1$, $R_2$ and $R_3$ correspond to slow, medium and fast substitution rates, then this mixture |
|---|
| 1105 | model allows the fast evolving rates to have their own vector of equilibrium frequencies and rate |
|---|
| 1106 | matrix, distinct from that found at the medium or slow evolving sites. The likelihood at any given |
|---|
| 1107 | site $D_s$ of the alignment is then: |
|---|
| 1108 | \begin{eqnarray}\label{equ:mixtlk} |
|---|
| 1109 | \Pr(D_s) = \sum_{c=1}^{3} \Pr(D_s | \mathcal{C}_s=c) \Pr(\mathcal{C}_s=c), |
|---|
| 1110 | \end{eqnarray} |
|---|
| 1111 | where $\Pr(\mathcal{C}_s=c)$ is obtained by multiplying the probability (density) of the three |
|---|
| 1112 | components (i.e., rate, matrix, frequencies). For instance, $\Pr(\mathcal{C}_1=\{R_1,M_1,F_1\}) = |
|---|
| 1113 | \Pr(R_1)\times \Pr(M_1) \times \Pr(F_1)$. |
|---|
| 1114 | We therefore assume here that substitution rates, rate |
|---|
| 1115 | matrices and equilibrium frequencies are independent from one another. |
|---|
| 1116 | |
|---|
| 1117 | Note that, using the same substitution rates, rate matrices and vector of equilibrium frequencies, |
|---|
| 1118 | it is possible to construct many other mixture models. For instance, the mixture model with the |
|---|
| 1119 | largest number of classes can be created by considering all the combinations of these three |
|---|
| 1120 | components. We would then get a mixture of $3\times 2 \times 2=12$ classes, corresponding to all |
|---|
| 1121 | the possible combinations of 3 rates, 2 matrices and 2 vectors of frequencies. |
|---|
| 1122 | |
|---|
| 1123 | |
|---|
| 1124 | % : $\mathcal{C}_1 = |
|---|
| 1125 | % \{R_1,M_1,F_1\}$ $\mathcal{C}_2 = \{R_1,M_1,F_2\}$, $\mathcal{C}_3 = \{R_1,M_2,F_1\}$, |
|---|
| 1126 | % $\mathcal{C}_4 = \{R_1,M_2,F_2\}$, $\mathcal{C}_5 = \{R_2,M_1,F_1\}$, $\mathcal{C}_6 = |
|---|
| 1127 | % \{R_2,M_1,F_2\}$, $\mathcal{C}_7 = \{R_2,M_2,F_1\}$, $\mathcal{C}_8 = \{R_2,M_2,F_2\}$, |
|---|
| 1128 | % $\mathcal{C}_9 = \{R_3,M_1,F_1\}$, $\mathcal{C}_{10} = \{R_3,M_1,F_2\}$, $\mathcal{C}_{11} = |
|---|
| 1129 | % \{R_3,M_2,F_1\}$ and $\mathcal{C}_{12} = \{R_3,M_2,F_2\}$. |
|---|
| 1130 | |
|---|
| 1131 | |
|---|
| 1132 | \subsection{Partitions}\index{partitionned analysis}\index{data partitions} |
|---|
| 1133 | |
|---|
| 1134 | We first introduce some terms of vocabulary that have not been presented before. A partitionned data |
|---|
| 1135 | set, also referred to as partition, is a set of partition elements. Typically, a partitionned data |
|---|
| 1136 | set will be made of a set of distinct gene alignments. A partition element will then correspond to |
|---|
| 1137 | one (or several) of these gene alignments. Note that the biology litterature often uses the term |
|---|
| 1138 | partition to refer to an element of a partitionned data. We thus use here instead the mathematical |
|---|
| 1139 | definition of the terms `partition' and `partition element'. |
|---|
| 1140 | |
|---|
| 1141 | Phylogenetics models usually assume individual columns of an alignment to evolve independently from |
|---|
| 1142 | one another. Codon-based models (e.g., \cite{yang98,yang00b,yang02,guindon04}) are exceptions to |
|---|
| 1143 | this rule since the substitution process applies here to triplets of consecutive sites of coding |
|---|
| 1144 | sequences. The non-independence of the substitution process at the three coding positions (due to |
|---|
| 1145 | the specificities of the genetic code), can therefore be accounted for. Assuming that sites evolve |
|---|
| 1146 | independently does not mean that a distinct model is fitted to each site of the alignment. |
|---|
| 1147 | Estimating the parameters of these models would not make much sense in practice due to the very |
|---|
| 1148 | limited amount of phylogenetic signal conveyed by individual sites. Site independence means instead |
|---|
| 1149 | that the columns of the observed alignment were sampled randomly from the same ``population of |
|---|
| 1150 | columns''. The stochasticity of the substitution process running along the tree is deemed |
|---|
| 1151 | responsible to the variability of site patterns. |
|---|
| 1152 | |
|---|
| 1153 | Some parameters of the phylogenetic model are considered to be common to all the sites in the |
|---|
| 1154 | alignment. The tree topology is typically one such parameter. The transition/transversion ratio is |
|---|
| 1155 | also generally assumed to be the same for all columns. Other parameters can vary from site to site. |
|---|
| 1156 | The rate at which substitutions accumulate is one of these parameters. Hence, different sites can |
|---|
| 1157 | have distinct rates. However, such rates are all ``drawn'' from the same probabilitic distribution |
|---|
| 1158 | (generally a discrete Gamma density). Hence, while different sites may have distinct rates of |
|---|
| 1159 | evolution, they all share the same {\em distribution} of rates. |
|---|
| 1160 | |
|---|
| 1161 | This reasonning also applies on a larger scale. When analysing multiple genes, one can indeed |
|---|
| 1162 | assume that the same mechanism generated the different site patterns observed for every gene. Here |
|---|
| 1163 | again, we can assume that all the genes share the same underlying tree topology (commonly refered to |
|---|
| 1164 | as the ``species tree''). Other parameters of the phylogenetic model, such as branch lengths for |
|---|
| 1165 | instance, might be shared across genes. However, due to the specificities of the gene evolution |
|---|
| 1166 | processes, some model parameters need to be adjusted for each gene separately. To sum up, the |
|---|
| 1167 | phylogenetic analysis of partitionned data requires flexible models with parameters, or distribution |
|---|
| 1168 | of parameters, shared across several partition elements and other parameters estimated separately |
|---|
| 1169 | for each element of the partition. |
|---|
| 1170 | |
|---|
| 1171 | The likelihood of a data set made of the concatenation of $n$ sequence alignments noted $D^{(1)}$, |
|---|
| 1172 | $D^{(2)}, \ldots, D^{(n)}$ is then obtained as follow: |
|---|
| 1173 | \begin{eqnarray} |
|---|
| 1174 | \Pr(D^{(1)},D^{(2)},\ldots,D^{(n)}) &=& \prod_{i=1}^{n} \Pr(D^{(i)}) \nonumber\\ |
|---|
| 1175 | &=& \prod_{i=1}^{n} \prod_{s=1}^{L_i} \Pr(D^{(i)}_s), |
|---|
| 1176 | \end{eqnarray} where $L_i$ is the number of site columns in partition element $i$. $\Pr(D^{(i)}_s)$ |
|---|
| 1177 | is then obtained using Equation \ref{equ:mixtlk}, i.e., by summing over the different classes of the |
|---|
| 1178 | mixture model that applies to site $s$ for partition element $i$. Hence, the joint probability of |
|---|
| 1179 | all the partition elements is here broken down into the product of likelihood of every site for each |
|---|
| 1180 | partition element. As noted just above, any given component of the mixture model at a given |
|---|
| 1181 | particular site is shared by the other sites that belong to the same partition element and, for some |
|---|
| 1182 | of them, by sites in other partition elements (e.g., the same tree topology is shared by all the |
|---|
| 1183 | sites, throughout all the partition elements). |
|---|
| 1184 | |
|---|
| 1185 | PhyML implements a wide variety of partition models. The only parameter that is constrained to be |
|---|
| 1186 | shared by all the partition elements is the tree topology. This constraint makes sense when |
|---|
| 1187 | considering distantly related taxa, typically inter-species data. For closely related taxa, i.e., |
|---|
| 1188 | when analysing intra-species or population-level data, not all the genes might have the same |
|---|
| 1189 | evolutionary history. Recombination events combined to the incomplete lineage sorting phenomenon |
|---|
| 1190 | can generate discrepancies between the gene trees and the underlying species tree (see |
|---|
| 1191 | \cite{degnan09} for a review). The phylogenetic softwares BEST \cite{best}\index{BEST}, STEM |
|---|
| 1192 | \cite{stem}\index{STEM} and *BEAST \cite{startbeast}\index{*BEAST} are dedicated to the estimation |
|---|
| 1193 | of species tree phylogenies from the analysis of multi-gene data and allow gene-tree topologies to |
|---|
| 1194 | vary across genes. |
|---|
| 1195 | |
|---|
| 1196 | Aside from the tree topology that is common to all the sites and all the partition elements, other |
|---|
| 1197 | parameters of the phylogenetic model can be either shared across partition elements or estimated |
|---|
| 1198 | separately for each of these. When analysing three partition elements, $A$, $B$ and $C$ for |
|---|
| 1199 | instance, PhyML can fit a model where the same set of branch lengths applies to $A$ and $B$ while |
|---|
| 1200 | $C$ has its own estimated lengths. The same goes for the substitution model: the same GTR model, |
|---|
| 1201 | with identical parameter values, can be fitted to $A$ and $C$ and JC69 for instance can be used for |
|---|
| 1202 | $B$. The sections below give more detailed information on the range of models available and how to |
|---|
| 1203 | set up the corresponding XML configuration files to implement them. |
|---|
| 1204 | |
|---|
| 1205 | |
|---|
| 1206 | \subsection{Combining mixture and partitions in PhyML: the theory} |
|---|
| 1207 | |
|---|
| 1208 | The rationale behind mixture models as implemented in PhyML lies in (1) the definition of suitable |
|---|
| 1209 | rate matrices, equilibrium frequency vectors and relative rates of substitution and (2) the assembly |
|---|
| 1210 | of these components so as to create the classes of a mixture. The main idea behind partitionned |
|---|
| 1211 | analysis in PhyML lies in (1) the hypothesis of statistical independance of the different data |
|---|
| 1212 | partition elements and (2) distinct data partition can share model components such as rate matrices, |
|---|
| 1213 | equilibrium frequencies or distribution of rates across sites. More formally, the likelihood of a |
|---|
| 1214 | data set made of $n$ partition elements is written as follows: |
|---|
| 1215 | |
|---|
| 1216 | \begin{eqnarray*} |
|---|
| 1217 | \Pr(D^{(1)},D^{(2)},\ldots,D^{(n)}) &=& \prod_{i=1}^{n} \prod_{s=1}^{L_i} \Pr(D^{(i)}_s) \\ |
|---|
| 1218 | &=& \prod_{i=1}^{n} \prod_{s=1}^{L_i} \sum_{c=1}^{K_i} \Pr(D^{(i)}_s|\mathcal{C}=c) \Pr(\mathcal{C}=c), |
|---|
| 1219 | \end{eqnarray*} |
|---|
| 1220 | |
|---|
| 1221 | where $L_i$ is the number of sites in partition element $i$ and $K_i$ is the number of classes in |
|---|
| 1222 | the mixture model that applies to this same partition element. Each class of a mixture is made of a |
|---|
| 1223 | rate matrix $M$, a vector of equilibrium frequencies $F$ and a relative rate of substitution |
|---|
| 1224 | $R$. Branch lengths, $L$ and tree topology $\tau$ are also required for the calculation of the |
|---|
| 1225 | likelihood. Hence we have: |
|---|
| 1226 | |
|---|
| 1227 | \begin{eqnarray*} |
|---|
| 1228 | && \Pr(D^{(1)},D^{(2)},\ldots,D^{(n)}) \\ |
|---|
| 1229 | &&= \prod_{i=1}^{n} \prod_{s=1}^{L_i} \sum_{c=1}^{K_i} \Pr(D^{(i)}_s|\mathcal{C}=c) \Pr(\mathcal{C}=c) \\ |
|---|
| 1230 | &&= \prod_{i=1}^{n} \prod_{s=1}^{L_i} \sum_{m}^{\mathcal{M}_i} \sum_{f}^{\mathcal{F}_i} \sum_{r}^{\mathcal{R}_i} \Pr(D^{(i)}_s|M_m^{(i)},F_f^{(i)},R_r^{(i)},L^{(i)},\tau) \Pr(M_m^{(i)},F_f^{(i)},R_r^{(i)}) \mathcal{I}(m,f,r,i) |
|---|
| 1231 | % &&= \prod_{i=1}^{n} \prod_{s=1}^{L_i} \sum_{m}^{\mathcal{M}_i} \sum_{f}^{\mathcal{F}_i} \sum_{r}^{\mathcal{R}_i} |
|---|
| 1232 | % \Pr(D^{(i)}_s|M_m^{(i)},F_f^{(i)},R_r^{(i)},L^{(i)},\tau) \Pr(M_m^{(i)}) \Pr(F_f^{(i)}) \Pr(R_r^{(i)}) |
|---|
| 1233 | \end{eqnarray*} |
|---|
| 1234 | where $\mathcal{M}_i$, $\mathcal{F}_i$ and $\mathcal{R}_i$ are the number of rate matrices, |
|---|
| 1235 | vector of equilibrium frequencies and relative rates that apply to partition element $i$ |
|---|
| 1236 | respectively. $\mathcal{I}(m,f,r,i)$ is an indicator function that takes value 1 if the combination |
|---|
| 1237 | $M_m$, $F_f$ and $R_r$ is acually defined in the model for this particular partition element |
|---|
| 1238 | $i$. Its value is 0 otherwise. In the example given in section \ref{sec:mixtures} $\{R_1,R_2,R_3\}$ |
|---|
| 1239 | is the set of substitution rates, $\{M_1,M_2\}$ the set of rate matrices and $\{F_1,F_2\}$ the set of vectors of equilibrium frequencies. We then |
|---|
| 1240 | define the first class of the mixture model as $\mathcal{C}_1 = \{R_1,M_1,F_1\}$, a second class as |
|---|
| 1241 | $\mathcal{C}_2 = \{R_2,M_1,F_1\}$ and the third as $\mathcal{C}_3 = \{R_3,M_2,F_2\}$. Hence, we |
|---|
| 1242 | have $\mathcal{I}(1,1,1,i)$, $\mathcal{I}(1,1,2,i)$ and $\mathcal{I}(2,2,3,i)$ equal to one while |
|---|
| 1243 | the nine other values that this indicator function takes, corresponding to the possible combinations of |
|---|
| 1244 | two vectors of frequencies, two matrices and three rates, are all zero. |
|---|
| 1245 | |
|---|
| 1246 | As stated before, our implementation assumes that the different components of a mixture are |
|---|
| 1247 | independant. In other words, we have $\Pr(M_m^{(i)},F_f^{(i)},R_r^{(i)}) = \Pr(M_m^{(i)}) \times |
|---|
| 1248 | \Pr(F_f^{(i)}) \times \Pr(R_r^{(i)})$. In practice, the joint probability |
|---|
| 1249 | $\Pr(M_m^{(i)},F_f^{(i)},R_r^{(i)})$ is obtained as follows: |
|---|
| 1250 | \begin{eqnarray}\label{equ:weights} |
|---|
| 1251 | \Pr(M_m^{(i)},F_f^{(i)},R_r^{(i)}) = \frac{\Pr(M_m^{(i)}) \Pr(F_f^{(i)}) \Pr(R_r^{(i)})}{ |
|---|
| 1252 | \sum_{m,f,r} \Pr(M_m^{(i)}) \Pr(F_f^{(i)}) \Pr(R_r^{(i)}) \mathcal{I}(m,f,r,i)} |
|---|
| 1253 | \end{eqnarray} |
|---|
| 1254 | The probabilities $\Pr(M_m^{(i)})$, $\Pr(F_f^{(i)})$ and $\Pr(R_r^{(i)})$, also called `weights', can be fixed or estimated |
|---|
| 1255 | from the data. |
|---|
| 1256 | |
|---|
| 1257 | \subsection{The XML format and its use in PhyML}\label{sec:XML format} |
|---|
| 1258 | |
|---|
| 1259 | The few paragraphs below are largely inspired from the Wikipedia page that describes the XML format |
|---|
| 1260 | (\url{http://en.wikipedia.org/wiki/XML}). XML (eXtensible Markup Language) is a markup language that |
|---|
| 1261 | defines a set of rules for encoding documents in a format that is both human-readable and |
|---|
| 1262 | machine-readable. An XML document is divided into {\em markup} and {\em content}, which may be |
|---|
| 1263 | distinguished by the application of simple syntactic rules. Generally, strings that constitute |
|---|
| 1264 | markup either begin with the character `\x{<}' and end with a `\x{>}'. Strings of characters that |
|---|
| 1265 | are not markup are content: |
|---|
| 1266 | |
|---|
| 1267 | \vspace{0.2cm} |
|---|
| 1268 | \begin{Verbatim}[frame=single, label=XML markup and content example, samepage=true, |
|---|
| 1269 | baselinestretch=0.5, fontsize=\small] |
|---|
| 1270 | <markup> |
|---|
| 1271 | content |
|---|
| 1272 | </markup> |
|---|
| 1273 | \end{Verbatim} |
|---|
| 1274 | |
|---|
| 1275 | A markup construct that begins with `\x{<}' and ends with `\x{>}' is called a {\em tag}. Tags come |
|---|
| 1276 | in three flavors: (1) start-tags (e.g, \x{<section>}), end-tags (e.g., \x{</section>}) and |
|---|
| 1277 | empty-element tags (e.g., \x{<line-break />}). A {\em component} either begins with a start-tag and |
|---|
| 1278 | ends with a matching end-tag or consists only of an empty-element tag. The characters between the |
|---|
| 1279 | start- and end-tags, if any, are the element's content, and may contain markup, including other |
|---|
| 1280 | elements, which are called child elements. In the following example, the element \x{img} has two |
|---|
| 1281 | {\em attributes}, \x{src} and \x{alt}: \x{<img src="madonna.jpg" alt="Foligno Madonna, by |
|---|
| 1282 | Raphael"/>}. Another example would be \x{<step number="3">Connect A to B.</step>} where the name of |
|---|
| 1283 | the attribute is ``\x{number}" and the value is ``\x{3}". |
|---|
| 1284 | |
|---|
| 1285 | In practice, building a mixture model in a XML file readable by PhyML is relatively |
|---|
| 1286 | straightforward. The first step is to define the different components of each class of the |
|---|
| 1287 | mixture. Consider for instance that the fitted model will have a Gamma distribution with four |
|---|
| 1288 | classes plus a proportion of invariants. The rate component of the mixture can then be specified |
|---|
| 1289 | using the following XML code: |
|---|
| 1290 | |
|---|
| 1291 | \vspace{0.2cm} |
|---|
| 1292 | \begin{Verbatim}[frame=single, label=$\Gamma4$+I rates, samepage=true, baselinestretch=0.5, |
|---|
| 1293 | fontsize=\small, numbers=left] |
|---|
| 1294 | |
|---|
| 1295 | <siterates id="SiteRates1"> |
|---|
| 1296 | <weights id="Distrib" family="gamma+inv" alpha=".1" \ |
|---|
| 1297 | optimise.alpha="yes" pinv="0.4" optimise.pinv="yes"> |
|---|
| 1298 | </weights> |
|---|
| 1299 | <instance id="R1" init.value="1.0"/> |
|---|
| 1300 | <instance id="R2" init.value="1.0"/> |
|---|
| 1301 | <instance id="R5" init.value="0.0"/> |
|---|
| 1302 | <instance id="R3" init.value="1.0"/> |
|---|
| 1303 | <instance id="R4" init.value="1.0"/> |
|---|
| 1304 | </siterates> |
|---|
| 1305 | |
|---|
| 1306 | \end{Verbatim} |
|---|
| 1307 | |
|---|
| 1308 | In the example above, the \x{<siterates>} component completely defines a model of substitution rate |
|---|
| 1309 | variation across sites. This component has a particular identity, i.e., a name associated to it |
|---|
| 1310 | (``\x{SiteRates1}'' here), which is not mandatory. This \x{<siterates>} component has six |
|---|
| 1311 | sub-components. The first is the \x{<weights>} component, followed by five \x{<instance>} |
|---|
| 1312 | components. The \x{<weights>} component defines the type of distribution that characterizes the |
|---|
| 1313 | variation of rates across sites. A discrete Gamma plus invariants is used here. Two parameters |
|---|
| 1314 | specify this distribution: the gamma shape and the proportion of invariant parameters. Their initial |
|---|
| 1315 | values are set by using the corresponding attributes and attribute values (\x{alpha="0.1"} and |
|---|
| 1316 | \x{pinv="0.4"}). Also, PhyML can optimise these parameters so as to maximise the likelihood of the |
|---|
| 1317 | whole phylogenetic model (\x{optimise.pinv="yes"} and \x{optimise.alpha="yes"}). The following five |
|---|
| 1318 | \x{<instance>} components define the rate classes themselves. The \x{id} attribute is here |
|---|
| 1319 | mandatory and must be unique to each class. Note that one of the initial (relative) rate |
|---|
| 1320 | (\x{init.value} attribute) is set to zero. The corresponding rate class (the third in this example) |
|---|
| 1321 | will then correspond to the invariant site category. |
|---|
| 1322 | |
|---|
| 1323 | Having specified the part of the phylogenetic model that describes the variation of rates across |
|---|
| 1324 | sites, we can now move on to build the rest of the model. The component below defines two |
|---|
| 1325 | substitution models: |
|---|
| 1326 | |
|---|
| 1327 | \vspace{0.2cm} |
|---|
| 1328 | \begin{Verbatim}[frame=single, label=Rate matrices, samepage=true, baselinestretch=0.5, |
|---|
| 1329 | fontsize=\small, numbers=left] |
|---|
| 1330 | |
|---|
| 1331 | <ratematrices id="RateMatrices"> |
|---|
| 1332 | <instance id="M1" model="HKY85" tstv="4.0" optimise.tstv="no"/> |
|---|
| 1333 | <instance id="M2" model="GTR" optimise.rr="yes"/> |
|---|
| 1334 | </ratematrices> |
|---|
| 1335 | \end{Verbatim} |
|---|
| 1336 | |
|---|
| 1337 | This \x{<ratematrices>} component sets out a list of substitution models (HKY85 and GTR here). Here |
|---|
| 1338 | again, the different elements in this list correspond to the \x{<instance>} sub-components. Each |
|---|
| 1339 | instance must have a unique \x{id} attribute for a reason that will become obvious shortly. The |
|---|
| 1340 | remaining attributes and their functions are described in Section \ref{sec:xmlratematrices}. |
|---|
| 1341 | |
|---|
| 1342 | The next ``ingredient'' in our phylogenetic model are vectors of nucleotide frequencies. The |
|---|
| 1343 | \x{<equfreqs>} component below specifies two of such vectors: |
|---|
| 1344 | |
|---|
| 1345 | |
|---|
| 1346 | \vspace{0.2cm} |
|---|
| 1347 | \begin{Verbatim}[frame=single, label=Equilibrium frequencies, samepage=true, baselinestretch=0.5, |
|---|
| 1348 | fontsize=\small, numbers=left] |
|---|
| 1349 | |
|---|
| 1350 | <equfreqs id="EquFreq"> |
|---|
| 1351 | <instance id="F1"/> |
|---|
| 1352 | <instance id="F2"/> |
|---|
| 1353 | </equfreqs> |
|---|
| 1354 | |
|---|
| 1355 | \end{Verbatim} |
|---|
| 1356 | |
|---|
| 1357 | Now, we need to assemble these three components (rate variation across sites, rate matrices and |
|---|
| 1358 | vectors of equilibrium frequencies) into a mixture model. The \x{<partitionelem>} component below |
|---|
| 1359 | defines one such model: |
|---|
| 1360 | |
|---|
| 1361 | \vspace{0.2cm} |
|---|
| 1362 | \begin{Verbatim}[frame=single, label=Mixture model, samepage=true, baselinestretch=0.5, |
|---|
| 1363 | fontsize=\small, numbers=left] |
|---|
| 1364 | |
|---|
| 1365 | <partitionelem id="Part1" file.name="./nucleic.txt" data.type="nt"> |
|---|
| 1366 | <mixtureelem list="R1, R2, R3, R4, R5"/> |
|---|
| 1367 | <mixtureelem list="M1, M1, M1, M2, M2"/> |
|---|
| 1368 | <mixtureelem list="F1, F2, F1, F2, F2"/> |
|---|
| 1369 | </partitionelem> |
|---|
| 1370 | |
|---|
| 1371 | \end{Verbatim} |
|---|
| 1372 | |
|---|
| 1373 | The \x{<partitionelem>} component defines a particular partition element. In this example, the |
|---|
| 1374 | partition element corresponds to the sequence file called \x{nucleic.txt}, which is an alignment of |
|---|
| 1375 | nucleotide sequences (see the \x{data.type} attribute value). The \x{<mixtureelem>} are |
|---|
| 1376 | sub-components of the \x{<partitionelem>} component. Each \x{<mixtureelem>} has a \x{list} |
|---|
| 1377 | atrribute. Each such \x{list} gives the ID of components that have been defined before. For |
|---|
| 1378 | instance, the first \x{<mixtureelem>} refers to the five classes of the \x{<siterates>} |
|---|
| 1379 | component. The ordering of the different term in these list matters a lot since it is directly |
|---|
| 1380 | related to the elements in each class of the mixture model. Hence, the first element in the |
|---|
| 1381 | \x{<list>} attribute of the first \x{<mixtureelem>} added to the first element in the \x{<list>} |
|---|
| 1382 | attribute of the second \x{<mixtureelem>} plus the the first element in \x{<list>} attribute of the |
|---|
| 1383 | third \x{<mixtureelem>} defines the first class of the mixture model. Therefore, the mixture model |
|---|
| 1384 | defined above has five classes: $\mathcal{C}_1 = \{R_1,M_1,F_1\}$, $\mathcal{C}_2 = |
|---|
| 1385 | \{R_2,M_1,F_2\}$, $\mathcal{C}_3 = \{R_3,M_1,F_1\}$, $\mathcal{C}_4 = \{R_4,M_2,F_2\}$ and |
|---|
| 1386 | $\mathcal{C}_5 = \{R_5,M_2,F_2\}$. |
|---|
| 1387 | |
|---|
| 1388 | % Going back to the different components of this model, the XML code dealing with the substitution |
|---|
| 1389 | % rates defines five classes with names {\tt R1} to {\tt R5}. The initial values of these rates are |
|---|
| 1390 | % set to 1.0, except for {\tt R5}, which is set to 0 and will therefore correspond to the invariable |
|---|
| 1391 | % site class. The {\tt <weight>} tag that follows indicate that these rates define a $\Gamma 4$+Inv |
|---|
| 1392 | % model, with initial gamma shape parameter set to 0.1 and initial proportion of invariants set to |
|---|
| 1393 | % 0.4. These two parameters will be estimated in the analysis ({\tt optimise.alpha} and {\tt |
|---|
| 1394 | % optimise.pinv} attributes set to {\tt yes}). The two rate matrices have names {\tt M1} and {\tt |
|---|
| 1395 | % M2}. {\tt M1} corresponds to a HKY85 model, with transition/transversion ratio set to 4.0 and set |
|---|
| 1396 | % to be optimised in the analysis. {\tt M2} is a GTR model, which parameters are also set to be |
|---|
| 1397 | % optimised. {\tt F1} and {\tt F2} are two vectors of nucleotide frequencies at equilibrium. These two |
|---|
| 1398 | % sets of frequencies will therefore be estimated during the analysis. |
|---|
| 1399 | |
|---|
| 1400 | |
|---|
| 1401 | |
|---|
| 1402 | \subsection{Setting up mixture and partition models in PhyML: the basics}\index{mixture |
|---|
| 1403 | models}\index{partitionned analysis}\index{data partitions} |
|---|
| 1404 | |
|---|
| 1405 | Mixture models are particularly relevant to the analysis of partitionned data. Indeed, some features |
|---|
| 1406 | of evolution are gene-specific (e.g., substitution rates vary across genes). Models that can |
|---|
| 1407 | accomodate for such variation, as mixture models do, are therefore relevant in this context. |
|---|
| 1408 | However, other evolutionary features are shared across loci (e.g., genes located in the same genomic |
|---|
| 1409 | region usually have similar GC contents). As a consequence, some components of mixture models need |
|---|
| 1410 | to be estimated separately for each partition element while others should be shared by different |
|---|
| 1411 | partition elements. |
|---|
| 1412 | |
|---|
| 1413 | Below is a simple example with a partitionned data set made of two elements: |
|---|
| 1414 | |
|---|
| 1415 | \vspace{0.2cm} |
|---|
| 1416 | \begin{Verbatim}[frame=single, label=Two sets of branch lengths (one per partition element), |
|---|
| 1417 | samepage=true, baselinestretch=0.5, fontsize=\small, numbers=left] |
|---|
| 1418 | |
|---|
| 1419 | <branchlengths id="BranchLens"> |
|---|
| 1420 | <instance id="L1"/> |
|---|
| 1421 | <instance id="L2"/> |
|---|
| 1422 | </branchlengths> |
|---|
| 1423 | |
|---|
| 1424 | <partitionelem id="Part1" file.name="./nucleic1.txt" data.type="nt"> |
|---|
| 1425 | <mixtureelem list="R1, R2, R3, R4, R5"/> |
|---|
| 1426 | <mixtureelem list="L1, L1, L1, L1, L1"/> |
|---|
| 1427 | </partitionelem> |
|---|
| 1428 | |
|---|
| 1429 | <partitionelem id="Part2" file.name="./nucleic2.txt" data.type="nt"> |
|---|
| 1430 | <mixtureelem list="R1, R2, R3, R4, R5"/> |
|---|
| 1431 | <mixtureelem list="L2, L2, L2, L2, L2"/> |
|---|
| 1432 | </partitionelem> |
|---|
| 1433 | |
|---|
| 1434 | \end{Verbatim} |
|---|
| 1435 | |
|---|
| 1436 | Mixture elements with names \x{R1},$\ldots$, \x{R5} refer to the $\Gamma4+$I model defined |
|---|
| 1437 | previsouly (see Section \ref{sec:XML format}). The \x{<branchlengths>} XML component defines a |
|---|
| 1438 | mixture element that had not been introduced before. It defined vectors of branch lengths that |
|---|
| 1439 | apply to the estimated phylogeny. Two instances of such vectors are defined: \x{L1} and \x{L2}. |
|---|
| 1440 | When examining the two partition elements (\x{<partitionelem>} component), it appears that \x{L1} |
|---|
| 1441 | is associated with \x{Part1} while \x{L2} is associated with \x{Part2}. Hence, branch lengths |
|---|
| 1442 | will be estimated separately for these two partition elements. |
|---|
| 1443 | |
|---|
| 1444 | Note that a given partition element can only have one {\tt branchlengths} instance associated to |
|---|
| 1445 | it. For instance, the example given below is not valid: |
|---|
| 1446 | |
|---|
| 1447 | \vspace{0.2cm} |
|---|
| 1448 | \begin{Verbatim}[frame=single, label=Invalid mixture, samepage=true, baselinestretch=0.5, |
|---|
| 1449 | fontsize=\small, numbers=left] |
|---|
| 1450 | |
|---|
| 1451 | <partitionelem id="Part1" file.name="./nucleic1.txt" data.type="nt"> |
|---|
| 1452 | <mixtureelem list="R1, R2, R3, R4, R5"/> |
|---|
| 1453 | <mixtureelem list="L1, L1, L1, L2, L2"/> |
|---|
| 1454 | </partitionelem> |
|---|
| 1455 | |
|---|
| 1456 | \end{Verbatim} |
|---|
| 1457 | |
|---|
| 1458 | In other words, mixture of branch lengths are forbidden. One reason for this restriction is that |
|---|
| 1459 | mixture of edge lengths sometimes lead to non-identifiable models (i.e., models with distinct sets |
|---|
| 1460 | of branch lengths have the same likelihood) \cite{matsen07}. But mostly, combining mixture of branch |
|---|
| 1461 | lengths with mixture of rates appears like a deadly combination. Consider for instance the following model: |
|---|
| 1462 | |
|---|
| 1463 | \vspace{0.2cm} |
|---|
| 1464 | \begin{Verbatim}[frame=single, label=Invalid mixture, samepage=true, baselinestretch=0.5, |
|---|
| 1465 | fontsize=\small, numbers=left] |
|---|
| 1466 | |
|---|
| 1467 | <partitionelem id="Part1" file.name="./nucleic1.txt" data.type="nt"> |
|---|
| 1468 | <mixtureelem list="R1, R2, R3"/> |
|---|
| 1469 | <mixtureelem list="L1, L2, L3"/> |
|---|
| 1470 | </partitionelem> |
|---|
| 1471 | |
|---|
| 1472 | \end{Verbatim} |
|---|
| 1473 | |
|---|
| 1474 | It is here impossible to tell apart branch lengths and substitution rates. Such model is strongly |
|---|
| 1475 | non-identifiable and therefore not relevant. |
|---|
| 1476 | |
|---|
| 1477 | In the example given above, the same $\Gamma4+$I model (i.e. the same gamma shape parameter and |
|---|
| 1478 | proportion of invariant ) applies to the two partition elements. It is possible to use two distinct |
|---|
| 1479 | $\Gamma4+$I models instead using the following XML code: |
|---|
| 1480 | |
|---|
| 1481 | |
|---|
| 1482 | |
|---|
| 1483 | \vspace{0.2cm} |
|---|
| 1484 | \begin{Verbatim}[frame=single, label=Two distinct $\Gamma4+$I models, samepage=true, |
|---|
| 1485 | baselinestretch=0.5, fontsize=\small, numbers=left] |
|---|
| 1486 | |
|---|
| 1487 | <siterates id="SiteRates1"> |
|---|
| 1488 | <weights id="Distrib1" family="gamma+inv" alpha=".1" \ |
|---|
| 1489 | optimise.alpha="yes" pinv="0.4" optimise.pinv="yes"> |
|---|
| 1490 | </weights> |
|---|
| 1491 | <instance id="R1" init.value="1.0"/> |
|---|
| 1492 | <instance id="R2" init.value="1.0"/> |
|---|
| 1493 | <instance id="R5" init.value="0.0"/> |
|---|
| 1494 | <instance id="R3" init.value="1.0"/> |
|---|
| 1495 | <instance id="R4" init.value="1.0"/> |
|---|
| 1496 | </siterates> |
|---|
| 1497 | |
|---|
| 1498 | <siterates id="SiteRates2"> |
|---|
| 1499 | <weights id="Distrib2" family="gamma+inv" alpha=".1" \ |
|---|
| 1500 | optimise.alpha="yes" pinv="0.4" optimise.pinv="yes"> |
|---|
| 1501 | </weights> |
|---|
| 1502 | <instance id="R6" init.value="1.0"/> |
|---|
| 1503 | <instance id="R7" init.value="1.0"/> |
|---|
| 1504 | <instance id="R8" init.value="0.0"/> |
|---|
| 1505 | <instance id="R9" init.value="1.0"/> |
|---|
| 1506 | <instance id="R10" init.value="1.0"/> |
|---|
| 1507 | </siterates> |
|---|
| 1508 | |
|---|
| 1509 | <partitionelem id="Part1" file.name="./nucleic1.txt" data.type="nt"> |
|---|
| 1510 | <mixtureelem list="R1, R2, R3, R4, R5"/> |
|---|
| 1511 | <mixtureelem list="L1, L1, L1, L1, L1"/> |
|---|
| 1512 | </partitionelem> |
|---|
| 1513 | |
|---|
| 1514 | <partitionelem id="Part2" file.name="./nucleic2.txt" data.type="nt"> |
|---|
| 1515 | <mixtureelem list="R6, R7, R8, R9, R10"/> |
|---|
| 1516 | <mixtureelem list="L2, L2, L2, L2, L2"/> |
|---|
| 1517 | </partitionelem> |
|---|
| 1518 | |
|---|
| 1519 | \end{Verbatim} |
|---|
| 1520 | |
|---|
| 1521 | \x{SiteRates1} and \x{SiteRates2} here define two distinct $\Gamma4+$I models. Each of these models apply to |
|---|
| 1522 | one of the two partition elements (\x{nucleic1.txt} and \x{nucleic2.txt}), allowing them to display |
|---|
| 1523 | different patterns of rate variation across sites. |
|---|
| 1524 | |
|---|
| 1525 | |
|---|
| 1526 | \subsection{XML options} |
|---|
| 1527 | \subsubsection{{\tt phyml} component}\index{XML options!{\tt phyml} component} |
|---|
| 1528 | Options: |
|---|
| 1529 | \begin{itemize} |
|---|
| 1530 | \item \x{output.file="filename"}. The main output files of PhyML analysis will be named |
|---|
| 1531 | \x{filename\_phyml\_tree} and \x{filename\_phyml\_stats}. |
|---|
| 1532 | \item \x{bootstrap="nreplicates"}. Run \x{nreplicates} replicates for the non-parametric bootstrap analysis. |
|---|
| 1533 | \item \x{run.id="idstring"}. PhyML will append the string \x{idstring} to each output file. |
|---|
| 1534 | \item \x{print.trace="yes|true|no|false"}. PhyML will print the estimated trees (and the |
|---|
| 1535 | corresponding loglikelihoods) at multiple stages of the estimation process. This option is useful |
|---|
| 1536 | for monitoring the progress of the analysis when processing large data sets. |
|---|
| 1537 | \item \x{branch.test="aBayes|aLRT|SH|no"}. Calculate fast branch support using the aBayes method |
|---|
| 1538 | \cite{anisimova11}, aLRT \cite{anisimova06} or SH \cite{shimodaira99} tests. These branch |
|---|
| 1539 | statistics are much faster to estimate than the bootrap proportions and usually provide good |
|---|
| 1540 | estimates of the probabilities that the corresponding edges are correctly inferred (see Anisimova et |
|---|
| 1541 | al. 2011 for more precision). By default and if no bootstrap analysis is performed, branch supports |
|---|
| 1542 | are estimated using the aBayes approach. |
|---|
| 1543 | |
|---|
| 1544 | \end{itemize} |
|---|
| 1545 | \subsubsection{{\tt topology} component}\index{XML options!{\tt topology} component} |
|---|
| 1546 | Options: |
|---|
| 1547 | \begin{itemize} |
|---|
| 1548 | \item \x{init.tree="bionj"|"user"|"random"}. Starting tree. Default is \x{bionj}. |
|---|
| 1549 | \item \x{file.name="name\_of\_tree\_file"}. In case \x{init.tree="user"}, this |
|---|
| 1550 | attribute is mandatory. \x{name\_of\_tree\_file} is a |
|---|
| 1551 | text file containing a tree in NEWICK format. |
|---|
| 1552 | \item \x{optimise.tree="yes"|"true"|"no"|"false"}. The starting tree topology as defined by |
|---|
| 1553 | \x{init.tree} is to be optimised (or not) so as to maximise the likelihood function. |
|---|
| 1554 | \item \x{search="nni"|"spr"|"none"}. Tree topology search is conducted using NNI (fast), SPR (a bit |
|---|
| 1555 | slower but more accurate) or no moves. |
|---|
| 1556 | \end{itemize} |
|---|
| 1557 | \vspace{0.2cm} |
|---|
| 1558 | \begin{Verbatim}[frame=single, label=Example of `topology' component, samepage=true, |
|---|
| 1559 | baselinestretch=0.5, fontsize=\small, numbers=left] |
|---|
| 1560 | |
|---|
| 1561 | <topology> |
|---|
| 1562 | <instance id="T1" init.tree="bionj" optimise.tree="true" \ |
|---|
| 1563 | search="spr"/> |
|---|
| 1564 | </topology> |
|---|
| 1565 | |
|---|
| 1566 | \end{Verbatim} |
|---|
| 1567 | |
|---|
| 1568 | \subsubsection{{\tt ratematrices} component}\index{XML options!{\tt ratematrices} component}\label{sec:xmlratematrices} |
|---|
| 1569 | Options: |
|---|
| 1570 | \begin{itemize} |
|---|
| 1571 | \item \x{model="JC69"|"K80"|"F81"|"F84"|"HKY85"|"TN93"|"GTR"|"custom"} for nucleotide data. The default is \x{"HKY85"}.\\ |
|---|
| 1572 | \x{model="LG"|"WAG"|"JTT"|"MtREV"|"Dayhoff"|"DCMut"|"RtREV"|"CpREV"|"VT"}\\\x{|"Blosum62"|"MtMam"|"MtArt"|"HIVw"|"HIVb"|"customaa"} |
|---|
| 1573 | for amino-acid sequences. The default is \x{"LG"}. |
|---|
| 1574 | \item \x{model.code="012345"}. For \x{custom} model applied to nucleotide sequences: set the |
|---|
| 1575 | string of digits that define a custom substitution model. See Table \ref{tab:modelcode} on page |
|---|
| 1576 | \pageref{tab:modelcode} for more |
|---|
| 1577 | information about the model codes. |
|---|
| 1578 | \item \x{ratematrix.code="filename"}. When used in conjunction with \x{model="customaa"}, |
|---|
| 1579 | \x{filename} is the name of the file that gives the rates of substitution between amino-acids as |
|---|
| 1580 | well as their frequences at equilibrium using PAML rate matrix format. An example of such file is |
|---|
| 1581 | provided in {phyml/examples/X1.mat}. |
|---|
| 1582 | \item \x{optimise.rr="yes"|"true"|"no"|"false"}. For \x{custom} and \x{GTR} nucleotide models only: |
|---|
| 1583 | optimise the substitution rate model parameters. |
|---|
| 1584 | \item \x{optimise.tstv="yes"|"true"|"no"|"false"}. For \x{K80}, \x{F84}, \x{HKY85} and \x{TN93} |
|---|
| 1585 | models only: optimise the transition/transversion rate ratio. |
|---|
| 1586 | \item \x{tstv="value"}. For \x{K80}, \x{HKY85} and \x{TN93} models only: set the transition/transversion to a |
|---|
| 1587 | given value. |
|---|
| 1588 | \end{itemize} |
|---|
| 1589 | |
|---|
| 1590 | The {\tt ratematrices} component has the attribute {\tt optimise.weights=yes/no} (default is {\tt |
|---|
| 1591 | no}). If {\tt optimise.weights=yes}, then the probabilities (or weights) or each matrix in the |
|---|
| 1592 | set of matrices defined by this component (see Equation \ref{equ:weights}), will be estimated from the data. |
|---|
| 1593 | |
|---|
| 1594 | \vspace{0.2cm} |
|---|
| 1595 | \begin{Verbatim}[frame=single, label=Example of `ratematrices' component, samepage=true, |
|---|
| 1596 | baselinestretch=0.5, fontsize=\small, numbers=left] |
|---|
| 1597 | |
|---|
| 1598 | <ratematrices id="RM1" optimise.weights="yes"> |
|---|
| 1599 | <instance id="M1" model="custom" model.code="000000"/> |
|---|
| 1600 | <instance id="M2" model="GTR" optimise.rr="yes"/> |
|---|
| 1601 | <instance id="M3" model="WAG"/> |
|---|
| 1602 | </ratematrices> |
|---|
| 1603 | |
|---|
| 1604 | \end{Verbatim} |
|---|
| 1605 | |
|---|
| 1606 | \subsubsection{{\tt equfreqs} component}\index{XML options!{\tt equfreqs} component} |
|---|
| 1607 | Options: |
|---|
| 1608 | \begin{itemize} |
|---|
| 1609 | \item \x{base.freqs="a,b,c,d"} where \x{a-d} are nucleotide frequencies. Make sure that these |
|---|
| 1610 | frequencies are separated by comas and no space character is inserted. |
|---|
| 1611 | \item \x{aa.freqs="empirical|model"}. Amino-acid frequencies are derived from counting the number of |
|---|
| 1612 | occurence of each amino-acid in the alignment (\x{aa.freqs="empirical"}) or given by the |
|---|
| 1613 | substitution model (\x{aa.freqs="model"}). |
|---|
| 1614 | \item \x{optimise.freqs="true|yes|false|no"}. Nucleotide frequencies can be optimised so as to maximise |
|---|
| 1615 | the likelihood (\x{optimise.freqs="yes|true"}). |
|---|
| 1616 | \end{itemize} |
|---|
| 1617 | |
|---|
| 1618 | The {\tt equfreqs} component has the attribute {\tt optimise.weights=yes/no} (default is {\tt |
|---|
| 1619 | no}). If {\tt optimise.weights=yes}, then the probabilities (or weights) or each vector of |
|---|
| 1620 | equilibrium frequencies in the |
|---|
| 1621 | set of vectors defined by this component (see Equation \ref{equ:weights}), will be estimated from the data. |
|---|
| 1622 | |
|---|
| 1623 | \vspace{0.2cm} |
|---|
| 1624 | \begin{Verbatim}[frame=single, label=Example of `equfreqs' component, samepage=true, |
|---|
| 1625 | baselinestretch=0.5, fontsize=\small, numbers=left] |
|---|
| 1626 | |
|---|
| 1627 | <equfreqs id="EF1" optimise.weights="yes"> |
|---|
| 1628 | <instance id="F1" base.freqs="0.25,0.25,0.25,0.25"/> |
|---|
| 1629 | <instance id="F2" aa.freqs="empirical"/> |
|---|
| 1630 | <instance id="F3" optimise.freq="yes"/> |
|---|
| 1631 | </equfreqs> |
|---|
| 1632 | |
|---|
| 1633 | \end{Verbatim} |
|---|
| 1634 | |
|---|
| 1635 | \subsubsection{{\tt branchlengths} component}\index{XML options!{\tt branchlengths} component} |
|---|
| 1636 | Options: |
|---|
| 1637 | \begin{itemize} |
|---|
| 1638 | \item \x{optimise.lens="yes"|"true"|"no"|"false"}: branch lengths are optimised or not. The default |
|---|
| 1639 | is set to \x{"yes"}. |
|---|
| 1640 | \end{itemize} |
|---|
| 1641 | \vspace{0.2cm} |
|---|
| 1642 | \begin{Verbatim}[frame=single, label=Example of `branchlengths' component, baselinestretch=0.5, |
|---|
| 1643 | fontsize=\small, numbers=left] |
|---|
| 1644 | |
|---|
| 1645 | <branchlengths id="BL1"> |
|---|
| 1646 | <instance id="L1" optimise.lens="yes"/> |
|---|
| 1647 | <instance id="L2"/> |
|---|
| 1648 | <instance id="L3" optimise.lens="false"/> |
|---|
| 1649 | </branchlengths> |
|---|
| 1650 | |
|---|
| 1651 | \end{Verbatim} |
|---|
| 1652 | |
|---|
| 1653 | \subsubsection{{\tt siterates} component}\index{XML options!{\tt siterates} component} |
|---|
| 1654 | Options: |
|---|
| 1655 | \begin{itemize} |
|---|
| 1656 | \item \x{value="val"}, where \x{"val"} is the relative substitution rate for the corresponding class. |
|---|
| 1657 | \end{itemize} |
|---|
| 1658 | A \x{siterate} component generally includes a \x{weights} element that specifies the probabilitic |
|---|
| 1659 | distribution of the relative rates. The available options for such element are: |
|---|
| 1660 | \begin{itemize} |
|---|
| 1661 | \item \x{family="gamma|gamma+inv|freerates"}. \x{gamma} indicates that the distribution of the |
|---|
| 1662 | relative rates is set to be a discrete Gamma density. \x{gamma+inv} indicates that the relative rate model |
|---|
| 1663 | is a mixture of Gamma and invariant sites (this is the common $\Gamma+$I model). FreeRate is |
|---|
| 1664 | a model that does not use any parametric function to describe the distribution of the relative |
|---|
| 1665 | rates (see \cite{soubrier12}). Under this option, relative rates and the corresponding frequencies of these classes are |
|---|
| 1666 | directly estimated from the data. While such approach is slightly more computationally demanding |
|---|
| 1667 | than the $\Gamma$ (or $\Gamma$+I) model, it often provides a significantly better fit to the data. |
|---|
| 1668 | \item \x{alpha="value|optimised"}, where \x{value} is a real positive number. Use this option to set |
|---|
| 1669 | the gamma shape parameter to the selected value. \x{optimised}: the parameter is estimated from |
|---|
| 1670 | the data (see also next option). |
|---|
| 1671 | \item \x{optimise.alpha="yes|true|no|false"}. Optimise the shape of the Gamma distribution of |
|---|
| 1672 | relative rates (or not). |
|---|
| 1673 | \item \x{pinv="value|optimised"}, where \x{value} is in $[0,1]$. Use this option to set |
|---|
| 1674 | the proportion of invariants to the selected value. \x{optimised}: the parameter is estimated from |
|---|
| 1675 | the data (see also next option). |
|---|
| 1676 | \item \x{optimise.pinv="yes|true|no|false"}. Optimise the proportion of invariable sites (or not). |
|---|
| 1677 | \item \x{optimise.freerates="yes|true|no|false"}. Optimise the parameters of the FreeRate model, |
|---|
| 1678 | i.e., the relative rates and the corresponding frequencies. |
|---|
| 1679 | \end{itemize} |
|---|
| 1680 | \vspace{0.2cm} |
|---|
| 1681 | \begin{Verbatim}[frame=single, label=Example of `siterates' component, samepage=true, |
|---|
| 1682 | baselinestretch=0.5, fontsize=\small, numbers=left] |
|---|
| 1683 | |
|---|
| 1684 | <siterates id="SR1"> |
|---|
| 1685 | <instance id="R1" init.value="1.0"/> |
|---|
| 1686 | <instance id="R2" init.value="1.0"/> |
|---|
| 1687 | <instance id="R3" init.value="1.0"/> |
|---|
| 1688 | <instance id="R4" init.value="1.0"/> |
|---|
| 1689 | <weights id="D1" family="gamma" optimise.alpha="yes" \ |
|---|
| 1690 | optimise.pinv="no"> |
|---|
| 1691 | </weights> |
|---|
| 1692 | </siterates> |
|---|
| 1693 | \end{Verbatim} |
|---|
| 1694 | |
|---|
| 1695 | |
|---|
| 1696 | |
|---|
| 1697 | \subsubsection{{\tt partitionelem} and {\tt mixtureelem} components}\index{XML options!{\tt partitionelem} |
|---|
| 1698 | component}\index{XML options!{\tt mixtureelem} component} |
|---|
| 1699 | |
|---|
| 1700 | Options: |
|---|
| 1701 | \begin{itemize} |
|---|
| 1702 | \item \x{file.name="inputfilename"}, where \x{inputfilename} is the name of the input sequence file |
|---|
| 1703 | (in PHYLIP format) to be analysed. |
|---|
| 1704 | \item \x{data.type="nt|aa"}. Specify the type of sequences to be processed (nucleotide of amino-acid sequences). |
|---|
| 1705 | \item \x{interleaved="yes|true|no|false"}. Interleaved (\x{yes|true}) or sequential format |
|---|
| 1706 | (\x{no|false}) for the sequence alignment. |
|---|
| 1707 | \end{itemize} |
|---|
| 1708 | |
|---|
| 1709 | Each \x{partitionelem} element should include exactly four \x{mixtureelem} elements, corresponding to |
|---|
| 1710 | branch lengths, equilibrium frequencies, substitution rate model and tree topology. The ordering of |
|---|
| 1711 | in which the \x{mixtureelem} elements are given does not matter, though exceptions apply for the |
|---|
| 1712 | $\Gamma+I$ model (see below). The $n$-th element in the \x{list} |
|---|
| 1713 | attribute of each \x{mixtureelem} defines the $n$-th class of the mixture model. In the example given |
|---|
| 1714 | below, the first class of the mixture is made of the following elements: \x{T1}, \x{F1}, \x{R1} and |
|---|
| 1715 | \x{L1}, the second class is made of \x{T1}, \x{F1}, \x{R2} and \x{L1}, etc. |
|---|
| 1716 | |
|---|
| 1717 | |
|---|
| 1718 | \vspace{0.2cm} |
|---|
| 1719 | \begin{Verbatim}[frame=single, label=Example of `partitionelem' component, samepage=true, |
|---|
| 1720 | baselinestretch=0.5, fontsize=\small, numbers=left] |
|---|
| 1721 | |
|---|
| 1722 | <partitionelem id="partition1" file.name="./small_p1.nxs" \ |
|---|
| 1723 | data.type="nt" interleaved="yes"> |
|---|
| 1724 | <mixtureelem list="T1, T1, T1, T1"/> |
|---|
| 1725 | <mixtureelem list="F1, F1, F1, F1"/> |
|---|
| 1726 | <mixtureelem list="R1, R2, R3, R4"/> |
|---|
| 1727 | <mixtureelem list="L1, L1, L1, L1"/> |
|---|
| 1728 | </partitionelem> |
|---|
| 1729 | |
|---|
| 1730 | \end{Verbatim} |
|---|
| 1731 | |
|---|
| 1732 | In general, the ordering of the \x{mixtureelem} elements does not matter. However, when the model |
|---|
| 1733 | has invariable sites, then the corresponding class should be first in the list of classes provided |
|---|
| 1734 | by \x{mixtureelem}. For instance, in the example above, if the rates are defined as follows: |
|---|
| 1735 | |
|---|
| 1736 | \vspace{0.2cm} |
|---|
| 1737 | \begin{Verbatim}[frame=single, label=Example of `siterates' component, samepage=true, |
|---|
| 1738 | baselinestretch=0.5, fontsize=\small, numbers=left] |
|---|
| 1739 | |
|---|
| 1740 | <siterates id="SR1"> |
|---|
| 1741 | <instance id="R1" init.value="0.0"/> |
|---|
| 1742 | <instance id="R2" init.value="1.0"/> |
|---|
| 1743 | <instance id="R3" init.value="1.0"/> |
|---|
| 1744 | <instance id="R4" init.value="1.0"/> |
|---|
| 1745 | <weights id="D1" family="gamma+inv" optimise.alpha="yes" \ |
|---|
| 1746 | optimise.pinv="no"> |
|---|
| 1747 | </weights> |
|---|
| 1748 | </siterates> |
|---|
| 1749 | \end{Verbatim} |
|---|
| 1750 | |
|---|
| 1751 | then \x{R1} corresponds to the invariable rate class (as \x{init.value="0.0"}). As \x{R1} is first |
|---|
| 1752 | in the \x{mixtureelem} (see line 6 in the example of \x{`partionelem'} given above), PhyML will |
|---|
| 1753 | print out an explicit error message and bail out. One way to avoid this shortcoming is to define |
|---|
| 1754 | \x{mixtureelem} as \x{R4, R2, R3, R1} instead. |
|---|
| 1755 | |
|---|
| 1756 | |
|---|
| 1757 | \subsection{A simple example: GTR + $\Gamma$4 + I} |
|---|
| 1758 | |
|---|
| 1759 | The example below provides all the required options to fit a $\Gamma$4+I model to a single alignment |
|---|
| 1760 | of nucleotide sequences under the GTR model of substitution using a SPR search for the best tree |
|---|
| 1761 | topology. The \x{phyml} component sets the name for the analysis to \x{simple.example}, meaning that |
|---|
| 1762 | each output file will display this particular string of characters. Also, the tree and statistics |
|---|
| 1763 | file names will begin with \x{p1.output}. The tree topology will be estimated so as to maximise the |
|---|
| 1764 | likelihood and the topology search algorithm used here is SPR, as indicated by the value of the |
|---|
| 1765 | corresponding attribute (i.e., \x{search="spr"}). Only one vector of branch lengths will be used |
|---|
| 1766 | here since only one partition element will be processed. Hence, the \x{<branchlengths>} component |
|---|
| 1767 | only has one \x{<instance>} sub-component. Also, a single GTR model will apply to all the classes |
|---|
| 1768 | for the mixture model -- the \x{<ratematrices>} component has only one \x{<instance>} sub-component, |
|---|
| 1769 | corresponding to this particular substitution model. The next component, \x{<equfreqs>}, indicates |
|---|
| 1770 | that a single vector of equilibrium frequencies will apply here. Next, the \x{<siterates>} component |
|---|
| 1771 | has five \x{<instance>} sub-components. Four of these correspond to the non-zero relative rates of |
|---|
| 1772 | evolution a defined by a discrete Gamma distribution. The last one (\x{<instance id="R5" |
|---|
| 1773 | value="0.0"/>}) defines the class of the mixture corresponding to invariable sites. The |
|---|
| 1774 | \x{<weight>} component indicates that a $\Gamma+$I model will be fitted here. The shape parameter of |
|---|
| 1775 | the Gamma distribution and the proportion of invariants will be estimated from the data. The |
|---|
| 1776 | \x{<partitionelem>} gives information about the sequence alignment (the corresponding file name, the |
|---|
| 1777 | type of data and the alignment format). The \x{<mixtureelem>} components next define the mixture |
|---|
| 1778 | model. Each class of the fitted model corresponds to one column, with the first column made of the |
|---|
| 1779 | following elements: \x{T1, M1, F1, R1} and \x{L1}. The second class of the mixture is made of \x{T1, |
|---|
| 1780 | M1, F1, R2, L1} and so forth. |
|---|
| 1781 | |
|---|
| 1782 | \vspace{0.2cm} |
|---|
| 1783 | \begin{Verbatim}[frame=single, label=Simple PhyML XML example, samepage=true, baselinestretch=0.5, |
|---|
| 1784 | fontsize=\small, numbers=left] |
|---|
| 1785 | |
|---|
| 1786 | <phyml runid="simple.example" output.file="p1.output"> |
|---|
| 1787 | |
|---|
| 1788 | <topology> |
|---|
| 1789 | <instance id="T1" init.tree="bionj" optimise.tree="yes" \ |
|---|
| 1790 | search="spr"/> |
|---|
| 1791 | </topology> |
|---|
| 1792 | |
|---|
| 1793 | <branchlengths id="BL1"> |
|---|
| 1794 | <instance id="L1" optimise.lens="yes"/> |
|---|
| 1795 | </branchlengths> |
|---|
| 1796 | |
|---|
| 1797 | <ratematrices id="RM1"> |
|---|
| 1798 | <instance id="M1" model="GTR"/> |
|---|
| 1799 | </ratematrices> |
|---|
| 1800 | |
|---|
| 1801 | <equfreqs id="EF1"> |
|---|
| 1802 | <instance id="F1"/> |
|---|
| 1803 | </equfreqs> |
|---|
| 1804 | |
|---|
| 1805 | <siterates id="SR1"> |
|---|
| 1806 | <instance id="R1" value="1.0"/> |
|---|
| 1807 | <instance id="R2" value="1.0"/> |
|---|
| 1808 | <instance id="R3" value="1.0"/> |
|---|
| 1809 | <instance id="R4" value="1.0"/> |
|---|
| 1810 | <instance id="R5" value="0.0"/> |
|---|
| 1811 | <weights id="D1" family="gamma+inv" optimise.alpha="yes" \ |
|---|
| 1812 | optimise.pinv="yes"> |
|---|
| 1813 | </weights> |
|---|
| 1814 | </siterates> |
|---|
| 1815 | |
|---|
| 1816 | <partitionelem id="partition_elem1" file.name=\ |
|---|
| 1817 | "./p1.seq" data.type="nt" interleaved="yes"> |
|---|
| 1818 | <mixtureelem list="T1, T1, T1, T1, T1"/> |
|---|
| 1819 | <mixtureelem list="M1, M1, M1, M1, M1"/> |
|---|
| 1820 | <mixtureelem list="F1, F1, F1, F1, F1"/> |
|---|
| 1821 | <mixtureelem list="R1, R2, R3, R4, R5"/> |
|---|
| 1822 | <mixtureelem list="L1, L1, L1, L1, L1"/> |
|---|
| 1823 | </partitionelem> |
|---|
| 1824 | |
|---|
| 1825 | </phyml> |
|---|
| 1826 | |
|---|
| 1827 | \end{Verbatim} |
|---|
| 1828 | |
|---|
| 1829 | |
|---|
| 1830 | \subsection{A second example: LG4X}\index{lg4x} |
|---|
| 1831 | |
|---|
| 1832 | The example below shows how to fit the LG4X model \cite{lg4x} to a given alignment of amino-acid |
|---|
| 1833 | sequences (file \x{M587.nex.Phy}). LG4X is a mixture model with four classes. Each class has its own |
|---|
| 1834 | rate and corresponding frequencies (hence the use of the FreeRate model below, see the |
|---|
| 1835 | \x{<siterates>} component). In the particular example given here, the rate values and frequencies |
|---|
| 1836 | are set by the users. These parameters will then be optimized by PhyML (\x{optimise.freerates="yes"}). |
|---|
| 1837 | Each class also has its own rate matrix and vector of equilibrium frequencies, which need to be provided by |
|---|
| 1838 | the user (Note that these matrices can be downloaded from the following web address: |
|---|
| 1839 | \url{http://www.atgc-montpellier.fr/download/datasets/models/lg4x/LG4X_4M.txt}. They are also |
|---|
| 1840 | provided in the PhyML package \x{example/lg4x/} directory.) |
|---|
| 1841 | |
|---|
| 1842 | \vspace{0.2cm} |
|---|
| 1843 | \begin{Verbatim}[frame=single, label=LG4X, samepage=true, baselinestretch=0.5, |
|---|
| 1844 | fontsize=\small, numbers=left] |
|---|
| 1845 | <phyml run.id="lg4x" output.file="M587.tests" branch.test="no"> |
|---|
| 1846 | |
|---|
| 1847 | <!-- Tree topology: start with BioNJ and then SPRs --> |
|---|
| 1848 | <topology> |
|---|
| 1849 | <instance id="T1" init.tree="user" file.name="user_tree.txt" \ |
|---|
| 1850 | search="spr" optimise.tree="no"/> |
|---|
| 1851 | </topology> |
|---|
| 1852 | |
|---|
| 1853 | |
|---|
| 1854 | <!-- Four rate matrices, read from files --> |
|---|
| 1855 | <ratematrices id="RM1"> |
|---|
| 1856 | <instance id="M1" model="customaa" ratematrix.file="X1.mat"/> |
|---|
| 1857 | <instance id="M2" model="customaa" ratematrix.file="X2.mat"/> |
|---|
| 1858 | <instance id="M3" model="customaa" ratematrix.file="X3.mat"/> |
|---|
| 1859 | <instance id="M4" model="customaa" ratematrix.file="X4.mat"/> |
|---|
| 1860 | </ratematrices> |
|---|
| 1861 | |
|---|
| 1862 | <!-- Freerate model of variation of rates across sites --> |
|---|
| 1863 | <siterates id="SR1"> |
|---|
| 1864 | <instance id="R1" init.value="0.197063"/> |
|---|
| 1865 | <instance id="R2" init.value="0.750275"/> |
|---|
| 1866 | <instance id="R3" init.value="1.951569"/> |
|---|
| 1867 | <instance id="R4" init.value="5.161586"/> |
|---|
| 1868 | <weights id="D1" family="freerates" optimise.freerates="yes"> |
|---|
| 1869 | <instance appliesto="R1" value="0.422481"/> |
|---|
| 1870 | <instance appliesto="R2" value="0.336848"/> |
|---|
| 1871 | <instance appliesto="R3" value="0.180132"/> |
|---|
| 1872 | <instance appliesto="R4" value="0.060539"/> |
|---|
| 1873 | </weights> |
|---|
| 1874 | </siterates> |
|---|
| 1875 | |
|---|
| 1876 | <!-- Amino-acid equilibrium freqs. are given by the models --> |
|---|
| 1877 | <equfreqs id="EF1"> |
|---|
| 1878 | <instance id="F1" aa.freqs="model"/> |
|---|
| 1879 | <instance id="F2" aa.freqs="model"/> |
|---|
| 1880 | <instance id="F3" aa.freqs="model"/> |
|---|
| 1881 | <instance id="F4" aa.freqs="model"/> |
|---|
| 1882 | </equfreqs> |
|---|
| 1883 | |
|---|
| 1884 | |
|---|
| 1885 | <!-- One vector of branch lengths --> |
|---|
| 1886 | <branchlengths id="BL1" > |
|---|
| 1887 | <instance id="L1" optimise.lens="yes"/> |
|---|
| 1888 | </branchlengths> |
|---|
| 1889 | |
|---|
| 1890 | |
|---|
| 1891 | <!-- Mixture model assemblage --> |
|---|
| 1892 | <partitionelem id="partition1" file.name="M587.nex.Phy" \ |
|---|
| 1893 | data.type="aa" interleaved="yes"> |
|---|
| 1894 | <mixtureelem list="T1, T1, T1, T1"/> |
|---|
| 1895 | <mixtureelem list="M1, M2, M3, M4"/> |
|---|
| 1896 | <mixtureelem list="F1, F2, F3, F4"/> |
|---|
| 1897 | <mixtureelem list="R1, R2, R3, R4"/> |
|---|
| 1898 | <mixtureelem list="L1, L1, L1, L1"/> |
|---|
| 1899 | </partitionelem> |
|---|
| 1900 | |
|---|
| 1901 | </phyml> |
|---|
| 1902 | \end{Verbatim} |
|---|
| 1903 | |
|---|
| 1904 | In order to fit the LG4X model to the \x{proteic} sequence file provided in the \x{examples/} |
|---|
| 1905 | directory, simply type \x{./phyml --xml=../examples/lg4x/lg4x.xml} (assuming the PhyML binary is installed |
|---|
| 1906 | in the \x{src/} directory). You can of course slightly tweak the file \x{../examples/lg4x/lg4x.xml} |
|---|
| 1907 | and use it as a template to fit this model to another data set. |
|---|
| 1908 | |
|---|
| 1909 | |
|---|
| 1910 | \subsection{An example with multiple partition elements} |
|---|
| 1911 | |
|---|
| 1912 | The example below gives the complete XML file to specify the analysis of three partition elements, |
|---|
| 1913 | corresponding to the nucleotide sequence files \x{small\_p1\_pos1.seq}, \x{small\_p1\_pos2.seq} and |
|---|
| 1914 | \x{small\_p1\_pos3.seq} in interleaved PHYLIP format. \x{small\_p1\_pos1.seq} is fitted with the |
|---|
| 1915 | HKY85 model of substitution (with the transition/transversion ratio being estimated from the data), |
|---|
| 1916 | combined to a $\Gamma4$ model of rate variation across sites (with the gamma shape parameter being |
|---|
| 1917 | estimated from the data). \x{small\_p1\_pos2.seq} is fitted to a custom substitution model with the |
|---|
| 1918 | constraint $A\leftrightarrow G$=$C\leftrightarrow T$. The nucleotide frequencies are set to |
|---|
| 1919 | $\frac{1}{4}$ here. The model does not allow substitution rates to vary across |
|---|
| 1920 | sites. \x{small\_p1\_pos3.seq} is fitted using a GTR model conbined to a $\Gamma4+$I model of rate |
|---|
| 1921 | variation across sites. Note that the equilibrium nucleotide frequencies for the fourth and fifth |
|---|
| 1922 | class of the mixture are set to be equal to that estimated from the first partition element (i.e., |
|---|
| 1923 | \x{F1}) . The initial phylogeny is built using BioNJ and the tree topology is to be estimated using |
|---|
| 1924 | a NNI search algorithm. |
|---|
| 1925 | |
|---|
| 1926 | \vspace{0.2cm} |
|---|
| 1927 | \begin{Verbatim}[frame=single, label=Example of PhyML XML file, samepage=true, baselinestretch=0.5, |
|---|
| 1928 | fontsize=\small, numbers=left] |
|---|
| 1929 | |
|---|
| 1930 | <phyml runid="nnisearch" output.file="small_p1_output"> |
|---|
| 1931 | |
|---|
| 1932 | <topology> |
|---|
| 1933 | <instance id="T1" init.tree="bionj" optimise.tree="yes" \ |
|---|
| 1934 | search="nni"/> |
|---|
| 1935 | </topology> |
|---|
| 1936 | |
|---|
| 1937 | <branchlengths id="BL1"> |
|---|
| 1938 | <instance id="L1" optimise.lens="yes"/> |
|---|
| 1939 | <instance id="L2"/> |
|---|
| 1940 | <instance id="L3"/> |
|---|
| 1941 | </branchlengths> |
|---|
| 1942 | |
|---|
| 1943 | <ratematrices id="RM1"> |
|---|
| 1944 | <instance id="M1" model="HKY85" optimise.tstv="yes"/> |
|---|
| 1945 | <instance id="M2" model="custom" model.code="102304" \ |
|---|
| 1946 | optimise.rr="yes"/> |
|---|
| 1947 | <instance id="M3" model="GTR"/> |
|---|
| 1948 | </ratematrices> |
|---|
| 1949 | |
|---|
| 1950 | <equfreqs id="EF1"> |
|---|
| 1951 | <instance id="F1"/> |
|---|
| 1952 | <instance id="F2" base.freqs="0.25,0.25,0.25,0.25"/> |
|---|
| 1953 | <instance id="F3"/> |
|---|
| 1954 | </equfreqs> |
|---|
| 1955 | |
|---|
| 1956 | <siterates id="SR1"> |
|---|
| 1957 | <instance id="R1" value="1.0"/> |
|---|
| 1958 | <instance id="R2" value="1.0"/> |
|---|
| 1959 | <instance id="R3" value="1.0"/> |
|---|
| 1960 | <instance id="R4" value="1.0"/> |
|---|
| 1961 | <weights id="D1" family="gamma" optimise.alpha="yes" \ |
|---|
| 1962 | optimise.pinv="no"> |
|---|
| 1963 | </weights> |
|---|
| 1964 | </siterates> |
|---|
| 1965 | |
|---|
| 1966 | <siterates id="SR2"> |
|---|
| 1967 | <instance id="R8" value="1.0"/> |
|---|
| 1968 | <weights id="D2" family="gamma" optimise.alpha="yes" \ |
|---|
| 1969 | optimise.pinv="yes"> |
|---|
| 1970 | </weights> |
|---|
| 1971 | </siterates> |
|---|
| 1972 | |
|---|
| 1973 | <siterates id="SR3"> |
|---|
| 1974 | <instance id="R10" value="1.0"/> |
|---|
| 1975 | <instance id="R11" value="1.0"/> |
|---|
| 1976 | <instance id="R12" value="1.0"/> |
|---|
| 1977 | <instance id="R13" value="1.0"/> |
|---|
| 1978 | <instance id="R14" value="1.0"/> |
|---|
| 1979 | <weights id="D3" family="gamma" optimise.alpha="yes" \ |
|---|
| 1980 | optimise.pinv="yes"> |
|---|
| 1981 | </weights> |
|---|
| 1982 | </siterates> |
|---|
| 1983 | |
|---|
| 1984 | \end{Verbatim} |
|---|
| 1985 | |
|---|
| 1986 | \vspace{0.2cm} |
|---|
| 1987 | \begin{Verbatim}[frame=single, label=Example of PhyML XML file (ctnd), samepage=true, |
|---|
| 1988 | baselinestretch=0.5, fontsize=\small, numbers=left] |
|---|
| 1989 | |
|---|
| 1990 | <partitionelem id="partition_elem1" file.name=\ |
|---|
| 1991 | "./small_p1_pos1.seq" data.type="nt" interleaved="yes"> |
|---|
| 1992 | <mixtureelem list="T1, T1, T1, T1"/> |
|---|
| 1993 | <mixtureelem list="M1, M1, M1, M1"/> |
|---|
| 1994 | <mixtureelem list="F1, F1, F1, F1"/> |
|---|
| 1995 | <mixtureelem list="R1, R2, R3, R4"/> |
|---|
| 1996 | <mixtureelem list="L1, L1, L1, L1"/> |
|---|
| 1997 | </partitionelem> |
|---|
| 1998 | |
|---|
| 1999 | <partitionelem id="partition_elem2" file.name=\ |
|---|
| 2000 | "./small_p1_pos2.seq" data.type="nt" interleaved="yes"> |
|---|
| 2001 | <mixtureelem list="T1"/> |
|---|
| 2002 | <mixtureelem list="M2"/> |
|---|
| 2003 | <mixtureelem list="R8"/> |
|---|
| 2004 | <mixtureelem list="F2"/> |
|---|
| 2005 | <mixtureelem list="L2"/> |
|---|
| 2006 | </partitionelem> |
|---|
| 2007 | |
|---|
| 2008 | <partitionelem id="partition_elem3" file.name=\ |
|---|
| 2009 | "./small_p1_pos3.seq" data.type="nt" interleaved="yes"> |
|---|
| 2010 | <mixtureelem list="T1, T1, T1, T1, T1"/> |
|---|
| 2011 | <mixtureelem list="M3, M3, M3, M3, M3"/> |
|---|
| 2012 | <mixtureelem list="R10, R11, R12, R13, R14"/> |
|---|
| 2013 | <mixtureelem list="F3, F3, F3, F1, F1"/> |
|---|
| 2014 | <mixtureelem list="L3, L3, L3, L3, L3"/> |
|---|
| 2015 | </partitionelem> |
|---|
| 2016 | |
|---|
| 2017 | </phyml> |
|---|
| 2018 | |
|---|
| 2019 | \end{Verbatim} |
|---|
| 2020 | |
|---|
| 2021 | |
|---|
| 2022 | \subsection{Branch lengths with invariants and partionned data} |
|---|
| 2023 | |
|---|
| 2024 | Accommodating for models with invariable sites applying to some elements of a partitioned data, with |
|---|
| 2025 | these elements sharing the same set of edge lengths can lead to inconsistencies. Consider for |
|---|
| 2026 | instance a partitioned data set with two elements. Assume that these two elements share the same |
|---|
| 2027 | set of edge lengths. Also, consider that GTR+I applies to the first element and HKY applies to the |
|---|
| 2028 | second. Now, the expected number of substitutions per site for the first element of the partition is |
|---|
| 2029 | equal to $(1-p)l$, where $p$ is the estimated proportion of invariants and $l$ is the |
|---|
| 2030 | maximum-likelihood estimate for the length of that specific edge. For the second element of the |
|---|
| 2031 | partition, the expected number of substitutions per site is equal to $l$, rather than $(1-p)l$. |
|---|
| 2032 | While $l$ are common to the two elements, matching the specification of the input model, the actual |
|---|
| 2033 | edge lengths do differ across the two partition elements. Please be aware that, due to the |
|---|
| 2034 | programming structure implemented in PhyML, the program will only return one value here, which will |
|---|
| 2035 | be equal to $(1-p)l$. |
|---|
| 2036 | |
|---|
| 2037 | \section{Citing PhyML} |
|---|
| 2038 | The ``default citation'' for PhyML is: |
|---|
| 2039 | \begin{itemize} |
|---|
| 2040 | \item |
|---|
| 2041 | ``New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance |
|---|
| 2042 | of PhyML 3.0''. Guindon S., Dufayard J.F., Lefort V., Anisimova M., Hordijk W., Gascuel O. 2010, {\it Systematic |
|---|
| 2043 | Biology}, 59(3):307-321 |
|---|
| 2044 | |
|---|
| 2045 | \end{itemize} |
|---|
| 2046 | The ``historic citation'' for PhyML is: |
|---|
| 2047 | \begin{itemize} |
|---|
| 2048 | \item ``A simple, fast and accurate algorithm to estimate large phylogenies by maximum likelihood'' |
|---|
| 2049 | Guindon S., Gascuel O. 2003, {\it Systematic Biology}, 52(5):696-704 |
|---|
| 2050 | \end{itemize} |
|---|
| 2051 | |
|---|
| 2052 | |
|---|
| 2053 | |
|---|
| 2054 | \section{Other programs in the PhyML package} |
|---|
| 2055 | |
|---|
| 2056 | PhyML is software package that provides tools to tackle problems other than estimating maximum |
|---|
| 2057 | likelihood phylogenies. Installing these tools and processing data sets is explained is the |
|---|
| 2058 | following sections. |
|---|
| 2059 | |
|---|
| 2060 | \subsection{PhyTime}\index{PhyTime} PhyTime is a program that estimates node ages and substitution |
|---|
| 2061 | rates using a Bayesian approach. The performance and main features of this software are described |
|---|
| 2062 | in two article (see Section \ref{sec:citephytime}). |
|---|
| 2063 | |
|---|
| 2064 | It relies on a Gibbs sampler which outperforms the |
|---|
| 2065 | ``standard'' Metropolis-Hastings algorithm implemented in a number of phylogenetic softwares. The |
|---|
| 2066 | details and performance of this approach are described in the following article: |
|---|
| 2067 | |
|---|
| 2068 | \subsubsection{Installing PhyTime} |
|---|
| 2069 | |
|---|
| 2070 | Compiling PhyTime is straightforward on Unix-like machines (i.e., linux and MacOS systems). PhyTime |
|---|
| 2071 | is not readily available for Windows machines but compilation should be easy on this system too. In |
|---|
| 2072 | the `phyml' directory, where the `src/' and `doc/' directories stand, enter the following commands: |
|---|
| 2073 | {\setlength{\baselineskip}{0.5\baselineskip} |
|---|
| 2074 | \begin{verbatim} |
|---|
| 2075 | ./configure --enable-phytime; |
|---|
| 2076 | make clean; |
|---|
| 2077 | make; |
|---|
| 2078 | \end{verbatim} } This set of commands generates a binary file called \x{phytime} which can be found |
|---|
| 2079 | in the `src/' directory. |
|---|
| 2080 | |
|---|
| 2081 | \subsubsection{Running PhyTime} Passing options and running PhyTime on your data set is quite |
|---|
| 2082 | similar to running PhyML in commmand-line mode. The main differences between the two programs are |
|---|
| 2083 | explained below: |
|---|
| 2084 | \begin{itemize} |
|---|
| 2085 | \item PhyTime takes as mandatory input a {\em rooted} phylogenetic tree. Hence, the `\x{-u}' option |
|---|
| 2086 | must be used. Also, unlike PhyML, PhyTime does not modify the tree topology. Hence, the options that |
|---|
| 2087 | go with the `\x{-s}' command do not alter the input tree topology. |
|---|
| 2088 | \item PhyTime needs an input file giving information about calibration nodes. The command |
|---|
| 2089 | `\x{--calibration=}' followed by the name of the file containing the calibration node information is |
|---|
| 2090 | mandatory. The content of that file should look as follows: |
|---|
| 2091 | |
|---|
| 2092 | \begin{figure}[h] |
|---|
| 2093 | \begin{small} |
|---|
| 2094 | \begin{Verbatim}[frame=single, label=Calibration node file, samepage=true, baselinestretch=0.5, fontsize=\tiny] |
|---|
| 2095 | Dugong_dugon Procavia_capensis Elephantidae | -65 -54 |
|---|
| 2096 | Equus_sp. Ceratomorpha | -58 -54 |
|---|
| 2097 | Cercopithecus_solatus Macaca_mulatta Hylobates_lar Homo_sapiens | -35 -25 |
|---|
| 2098 | Lepus_crawshayi Oryctolagus_cuniculus Ochotona_princeps | -90 -37 |
|---|
| 2099 | Marmota_monax Aplodontia_rufa | -120 -37 |
|---|
| 2100 | Dryomys_nitedula Glis_glis | -120 -28.5 |
|---|
| 2101 | @root@ | -100 -120 |
|---|
| 2102 | \end{Verbatim} |
|---|
| 2103 | \end{small} |
|---|
| 2104 | \end{figure} |
|---|
| 2105 | |
|---|
| 2106 | Every row in this file lists a set of taxa that belong to the same subtree (i.e., a clade). This |
|---|
| 2107 | list of taxa is followed by the character `\x{|}' and two real numbers corresponding to the lower |
|---|
| 2108 | and upper bounds of the calibration interval for the node at the root of the clade. In the example |
|---|
| 2109 | given here, the clade grouping the three taxa ``Dugong\_dugon'', ``Procavia\_capensis'' and |
|---|
| 2110 | ``Elephantida'' has -65 as lower bound and -54 as upper bound. Node ages (or node heights) are |
|---|
| 2111 | relative to the most recent tip node in the phylogeny, which age is set to 0. It is also possible to |
|---|
| 2112 | define a clade using only two taxon names. PhyTime will then search for the most recent common |
|---|
| 2113 | ancestor of these two taxa in the user-defined phylogeny and assign time boundaries to the |
|---|
| 2114 | corresponding node. For serially-sampled data, the calibration nodes correspond to tips in the tree. |
|---|
| 2115 | A calibration file will then look as follows: |
|---|
| 2116 | \begin{figure}[h] |
|---|
| 2117 | \begin{small} |
|---|
| 2118 | \begin{Verbatim}[frame=single, label=Calibration node file (serially-sampled sequences), samepage=true, baselinestretch=0.5] |
|---|
| 2119 | taxaA | -65 -65 |
|---|
| 2120 | taxaB | -65 -65 |
|---|
| 2121 | taxaC | -20 -20 |
|---|
| 2122 | taxaD | -30 -30 |
|---|
| 2123 | taxaE | -30 -30 |
|---|
| 2124 | taxaF | -60 -60 |
|---|
| 2125 | taxaG | -61 -51 |
|---|
| 2126 | @root@ | -100 -120 |
|---|
| 2127 | \end{Verbatim} |
|---|
| 2128 | \end{small} |
|---|
| 2129 | \end{figure} |
|---|
| 2130 | |
|---|
| 2131 | |
|---|
| 2132 | % The maximum value the average substitution rate along a branch of a phylogeny can take is set to |
|---|
| 2133 | % $10^{-2}$ substitution per site per time unit. It is therefore important that the time unit used to |
|---|
| 2134 | % specify the calibration intervals are realistic with respect to this upper bound. For instance, it |
|---|
| 2135 | % one time unit corresponds to one month, one assumes that the maximum (average) substitution rate is |
|---|
| 2136 | % $10^{-3}$/site/month, which is very high (substitution rate in introns is close to 3/site/$10^9$ |
|---|
| 2137 | % years). Consider that in the example give above, one time unit is $10^6$ years. Hence, the maximum |
|---|
| 2138 | % value the substitution rate can take is $10^{-2}$/site/$10^6$ years, i.e., $10$/site/$10^9$ years. |
|---|
| 2139 | % Multiplying the calibration values by 10 amounts to considering that the time unit is $10^5$ years, |
|---|
| 2140 | % implying a maximum substitution rate equal to $10^{-2}$/site/$10^5$ years, i.e., $100$/site/$10^9$ |
|---|
| 2141 | % years. |
|---|
| 2142 | |
|---|
| 2143 | Note that the node corresponding to the root of the whole tree has a specific label: |
|---|
| 2144 | `\x{@root@}'. {\color{red}{It is important to specify upper and lower bounds for the root node in |
|---|
| 2145 | order to ensure convergence of the Gibbs sampler. If the prior interval for the root height is not |
|---|
| 2146 | specified, the upper bound will be set to the upper bound of the oldest calibration node and the |
|---|
| 2147 | lower bound will be set to twice this age.}} As a consequence, leaving the prior on root height |
|---|
| 2148 | interval unspecified may produce inaccurate estimates of node ages, especially if there are only few |
|---|
| 2149 | otherwise calibration nodes available. |
|---|
| 2150 | |
|---|
| 2151 | A notable exception to this rule comes from the analysis of serial sample \index{serial sample} |
|---|
| 2152 | data, i.e., alignments in which sequences were not sampled at the same time point. For such data, |
|---|
| 2153 | the estimated number of substitutions accumulated between successive time points is used to estimate |
|---|
| 2154 | the substitution rate averaged over lineages. Because the time of collection of the sequences is |
|---|
| 2155 | generally known without ambiguity, this extra piece of data is translated into very informative |
|---|
| 2156 | calibration intervals for the tip nodes (i.e., calibration interval of zero width), which in turn |
|---|
| 2157 | results in substitution rate estimates with descreased variances. Posterior distribution of |
|---|
| 2158 | substitution rates with small variances then allows one to get good estimates of the root age. |
|---|
| 2159 | |
|---|
| 2160 | \end{itemize} |
|---|
| 2161 | |
|---|
| 2162 | A typical PhyTime command-line should look like the following\index{command-line options!\x{--calibration}}: |
|---|
| 2163 | |
|---|
| 2164 | \begin{Verbatim}[fontsize=\small] |
|---|
| 2165 | ./phytime -i seqname -u treename --calibration=calibration_file -m GTR -c 8 |
|---|
| 2166 | \end{Verbatim} |
|---|
| 2167 | |
|---|
| 2168 | Assuming the file `\x{seqname}' contains DNA sequences in PHYLIP or NEXUS format, `\x{treename}' is |
|---|
| 2169 | the rooted input tree in Newick format and `\x{calibration\_file}' is a set of calibration nodes, |
|---|
| 2170 | PhyTime will estimate the posterior distribution of node times and substitution rates under the |
|---|
| 2171 | assumption that the substitution process follows a GTR model with 8 classes of rates in the Gamma |
|---|
| 2172 | distribution of rates across sites. The model parameter values are estimated by a Gibbs sampling |
|---|
| 2173 | technique. This algorithm tries diferent values of the model parameters and record the most probable |
|---|
| 2174 | ones. By default, $10^6$ values for each parameter are collected. These values are recorded every |
|---|
| 2175 | $10^3$ sample. These settings can be modified using the appropriate command-line options (see |
|---|
| 2176 | below). |
|---|
| 2177 | |
|---|
| 2178 | |
|---|
| 2179 | \subsubsection{Upper bounds of model parameters} |
|---|
| 2180 | |
|---|
| 2181 | The maximum expected number of substitutions per along a given branch is set to 1.0. Since |
|---|
| 2182 | calibration times provide prior information about the time scale considered, it is possible to use |
|---|
| 2183 | that information to define an upper bound for the substitution rate. This upper bound is equal to |
|---|
| 2184 | the ratio of the maximum value for a branch length (1.0) by the amount of time elapsed since the |
|---|
| 2185 | oldest calibration point (i.e., the minimum of the lower bounds taken over the whole set of |
|---|
| 2186 | calibration points)\footnote{The actual formula involves an extra parameter which does not need to |
|---|
| 2187 | be introduced here}. It is important to keep in mind that the upper bound of the average |
|---|
| 2188 | substitution rate depends on the time unit used in the calibration priors. The value of the upper |
|---|
| 2189 | bound is printed on screen at the start of the execution. |
|---|
| 2190 | |
|---|
| 2191 | PhyTime implements two models that authorize rates to be autocorrelated. The strength of |
|---|
| 2192 | autocorrelation is governed by a parameter which value is estimated from the data. However, it is |
|---|
| 2193 | necessary to set an appropriate upper bound for this parameter prior running the analysis. The |
|---|
| 2194 | maximum value is set such that the correlation between the rate at the beginning and at the end of a |
|---|
| 2195 | branch of length 1.0 calendar time unit is not different from 0. Here again the upper bound for the |
|---|
| 2196 | model parameter depends on the time unit. It is important to choose this unit so that a branch of |
|---|
| 2197 | length 1.0 calendar unit can be considered as short. For this reason, {\color{red}{we recommend to select a time |
|---|
| 2198 | unit so that the calibration times take values between -10 and -1000}}\index{time scale}. |
|---|
| 2199 | |
|---|
| 2200 | \subsubsection{PhyTime specific options} |
|---|
| 2201 | |
|---|
| 2202 | Beside the \x{--calibration} option, there are other command line options that are specific to PhyTime: |
|---|
| 2203 | \begin{itemize} |
|---|
| 2204 | \item \x{--chain\_len=num}\index{command-line options!\x{--chain\_len}} \\ |
|---|
| 2205 | \x{num} is the number of iterations required to estimate the joint posterior density of all the model |
|---|
| 2206 | parameters, i.e., the length of the MCMC chain. Its default is set to 1E+6. |
|---|
| 2207 | % \item \x{--burnin=num} \\ |
|---|
| 2208 | % \x{num} is the number of iterations for the ``burnin'' period, i.e., the number of iterations |
|---|
| 2209 | % required before actually starting to get valid samples from the joint posterior density of all the |
|---|
| 2210 | % model parameters. Its default is set to 1E+5, i.e., 0.1 times the default value of the \x{chain\_len} |
|---|
| 2211 | % parameter (see above). |
|---|
| 2212 | \item \x{--sample\_freq=num}\index{command-line options!\x{--sample\_freq}} \\ \x{num} is the number of generations between successive collection |
|---|
| 2213 | of the model parameter values throughout the MCMC algorithm. For instance, the |
|---|
| 2214 | \x{--sample\_freq=1E+2} option will make PhyTime sample the model parameter every 100th iteration of |
|---|
| 2215 | the MCMC algorithm. Its default is set to 1E+3. |
|---|
| 2216 | \item \x{--fastlk=yes (no)}\index{command-line options!\x{--fastlk}} [Default: no]\\ |
|---|
| 2217 | The option is used to turn on (off) the approximation of the likelihood function using a |
|---|
| 2218 | multivariate normal density. By default, the exact likelihood is used. Using the normal |
|---|
| 2219 | approximation considerably speeds up the calculation. However, it is necessary to ensure that this |
|---|
| 2220 | approximation is appropriate by looking at the correlation between the exact and approximated |
|---|
| 2221 | likelihood values that are sampled. Please read Section \ref{sec:recomphytime} for a description of |
|---|
| 2222 | the appropriate steps to take. |
|---|
| 2223 | \item \x{--no\_sequences}\index{command-line options!\x{--no\_sequences}}\\ |
|---|
| 2224 | Use this option to run the sampler without sequence data. This option can be useful when one wants |
|---|
| 2225 | to compare the marginal posterior density of model parameters to those derived when ignoring the |
|---|
| 2226 | information conveyed by the sequences. Such comparison should be conducted on a systematic basis so as to determine whether the |
|---|
| 2227 | parameters estimates are mostly determined by the prior of driven by the sequence data. |
|---|
| 2228 | \item \x{--rate\_model=gbs/gbd/gamma/clock}\index{command-line options!\x{--rate\_model}}\\ |
|---|
| 2229 | This option is to select the model of evolution of the rate of evolution. \x{gbs} (default) stands |
|---|
| 2230 | for Geometric Brownian + Stochastic. This model considers that rates evolve along the tree |
|---|
| 2231 | according to a geometric Brownian process \cite{kishino01} and the average rate of substitution |
|---|
| 2232 | along a branch is a gamma distributed random variable. This model is described in \cite{guindon12}. |
|---|
| 2233 | The \x{gbd} model (Geometric Browninan with Deterministic calculation of edge lengths) assumes the |
|---|
| 2234 | same Geometric Brownian model of rates. However, as opposed to \x{gbs}, this model uses a |
|---|
| 2235 | deterministic approximation to calculate the average rates of evolution along edges. This model |
|---|
| 2236 | corresponds to the one described in \cite{kishino01} and implemented in the program Multidivtime\index{Multidivtime}. |
|---|
| 2237 | \x{gamma} is a less sophisticated model that assumes that average rates along edges are distributed a |
|---|
| 2238 | priori according to a gamma distribution. It is analogous to the uncorrelated clock model |
|---|
| 2239 | implemented in BEAST\index{BEAST} with a gamma distribution replacing the exponential one. The \x{clock} option |
|---|
| 2240 | corresponds to the strict clock model where all the lineages in the tree evolve at the same pace. |
|---|
| 2241 | \end{itemize} |
|---|
| 2242 | |
|---|
| 2243 | \subsubsection{PhyTime output} |
|---|
| 2244 | |
|---|
| 2245 | The program PhyTime generates two output files. The file called |
|---|
| 2246 | `\x{your\_seqfile\_phytime.XXXX.stats}', where XXXX is a randomly generated integer, lists the node |
|---|
| 2247 | times and branch relative rates sampled during the estimation process. It also gives the sampled |
|---|
| 2248 | values for other parameters, such as the autocorrelation of rates (parameter `Nu'), and the rate of |
|---|
| 2249 | evolution (parameter `EvolRate') amongst others. This output file can be analysed with the program |
|---|
| 2250 | Tracer\index{Tracer} from the BEAST\index{BEAST} package |
|---|
| 2251 | (\url{http://beast.bio.ed.ac.uk/Main_Page}). The second file is called |
|---|
| 2252 | `\x{your\_seqfile\_phytime.XXXX.trees}'. It is the list of trees that were collected during the |
|---|
| 2253 | estimation process, i.e., phylogenies sampled from the posterior density of trees. This file can be |
|---|
| 2254 | processed using the software TreeAnnotator, also part of the BEAST package (see |
|---|
| 2255 | \url{http://beast.bio.ed.ac.uk/Main_Page}) in order to generate confidence sets for the node time |
|---|
| 2256 | estimates. |
|---|
| 2257 | |
|---|
| 2258 | Important information is also displayed on the standard output of PhyTime (the standard output |
|---|
| 2259 | generally corresponds to the terminal window from which PhyTime was launched). The first column of |
|---|
| 2260 | this output gives the current generation, or run, of the chain. It starts at 1 and goes up to 1E+6 |
|---|
| 2261 | by default (use \x{--chain\_len} to change this value, see above). The second column gives the time |
|---|
| 2262 | elapsed in seconds since the sampling began. The third column gives the log likelihood of the |
|---|
| 2263 | phylogenetic model (i.e., `Felsenstein's likelihood'). The fourth column gives the logarithm of the |
|---|
| 2264 | joint prior probability of substitution rates along the tree and node heights. The fifth column |
|---|
| 2265 | gives the current sampled value of the EvolRate parameter along with the corresponding Effective |
|---|
| 2266 | Sample Size (ESS) (see Section \ref{sec:ess}) for this parameter. The sixth column gives the tree |
|---|
| 2267 | height and the corresponding ESS. The seventh column gives the value of the autocorrelation |
|---|
| 2268 | parameter followed by the corresponding ESS. The eightth column gives the values of the birth rate |
|---|
| 2269 | parameter that governs the birth-rate model of species divergence dates. The last column of the |
|---|
| 2270 | standard output gives the minimum of the ESS values taken over the whole set of node height |
|---|
| 2271 | estimates. It provides useful information when one has to decide whether or not the sample size is |
|---|
| 2272 | large enough to draw valid conclusion, i.e., decide whether the chain was run for long enough (see |
|---|
| 2273 | Section \ref{sec:recomphytime} for more detail about adequate chain length). |
|---|
| 2274 | |
|---|
| 2275 | \subsubsection{ClockRate vs. EvolRate} |
|---|
| 2276 | |
|---|
| 2277 | The average rate of evolution along a branch is broken into two components. One is called ClockRate |
|---|
| 2278 | and is the same throughout the tree. The other is called EvolRate and corresponds to a weighted |
|---|
| 2279 | average of branch-specific rates. The model of rate evolution implemented in PhyTime forces the |
|---|
| 2280 | branch-specific rate values to be greater than one. As a consequence, ClockRate is usually smaller |
|---|
| 2281 | EvolRate. |
|---|
| 2282 | |
|---|
| 2283 | In more mathematical terms, let $\mu$ be the value of ClockRate, $r_i$ be the value of the relative |
|---|
| 2284 | rate along branch $i$ and $\Delta_i$ the time elapsed along branch $i$. The value of EvolRate is |
|---|
| 2285 | then given by: |
|---|
| 2286 | \begin{eqnarray*} |
|---|
| 2287 | \mathrm{EvolRate} = \mu \frac{\sum_{i}^{2n-3} r_i \Delta_i}{\sum_{i}^{2n-3} \Delta_i}. |
|---|
| 2288 | \end{eqnarray*} It is clear from this equation that multiplying each $r_i$ by a constant and |
|---|
| 2289 | dividing $\mu$ by the same constant does not change the value of EvolRate. The $r_i$s and $\mu$ |
|---|
| 2290 | are then confounded, or non-identifiable, and only the value of EvolRate can be estimated from |
|---|
| 2291 | the data. {\color{red}{Please make sure that you use the value of EvolRate rather than that of |
|---|
| 2292 | ClockRate when referring to the estimate of the substitution rate}}. |
|---|
| 2293 | |
|---|
| 2294 | \subsubsection{Effective sample size}\label{sec:ess} |
|---|
| 2295 | |
|---|
| 2296 | The MCMC technique generates samples from a target distribution (in our case, the joint posterior |
|---|
| 2297 | density of parameters). Due to the Markovian nature of the method, these samples are not |
|---|
| 2298 | independent. The ESS is the estimated number of independent measurements obtained from a set of |
|---|
| 2299 | (usually dependent) measurements. It is calculated using the following formula: |
|---|
| 2300 | \begin{eqnarray*} |
|---|
| 2301 | \mathrm{ESS} = N\left(\frac{1-r}{1+r}\right), |
|---|
| 2302 | \end{eqnarray*} |
|---|
| 2303 | where $N$ is the length of the chain (i.e., the `raw' or `correlated' sample size) and $r$ is the |
|---|
| 2304 | autocorrelation value, which is obtained using the following formula: |
|---|
| 2305 | \begin{eqnarray*} |
|---|
| 2306 | r = \frac{1}{(N-k)\sigma_x^2} \sum_{i=1}^{N-k} (X_i - \mu_x)(X_{i+k}-\mu_x), |
|---|
| 2307 | \end{eqnarray*} where $\mu_x$ and $\sigma_x$ are the mean and standard deviation of the $X_i$ values |
|---|
| 2308 | respectively and $k$ is the lag. The value of $r$ that is used in PhyTime corresponds to the case where $k=1$, |
|---|
| 2309 | which therefore gives a first order approximation of the `average' autocorrelation value (i.e., the |
|---|
| 2310 | autocorrelation averaged over the set of possible values of the lag). |
|---|
| 2311 | |
|---|
| 2312 | |
|---|
| 2313 | \subsubsection{Prior distributions of model parameters}\label{sec:prior} |
|---|
| 2314 | |
|---|
| 2315 | Any Bayesian analysis requires specifying a prior distribution of model parameters. The outcome of |
|---|
| 2316 | the data analysis, i.e., the posterior distribution, is influenced by the priors. It is especially |
|---|
| 2317 | true if the signal conveyed by the data is weak. While some have argued that the specification of |
|---|
| 2318 | priors relies more on arbitrary decisions than sound scientific reasoning, choosing relevant prior |
|---|
| 2319 | distributions is in fact fully integrated in the process of building model that generates the |
|---|
| 2320 | observed data. In particular, the problem of estimating divergence times naturally lends itself to |
|---|
| 2321 | hierarchical Bayesian modelling. Based on the hypothesis that rates of evolution are conserved (to |
|---|
| 2322 | some extant) throughout the course of evolution, the hierarchical Bayesian approach provides an |
|---|
| 2323 | adequate framework for inferring substitution rates and divergence dates separately. Hence, in this |
|---|
| 2324 | situation, it makes good sense to use what is known about a relatively well-defined feature of the |
|---|
| 2325 | evolution of genetic sequences (the ``molecular clock'' hypothesis combined to stochastic variations of |
|---|
| 2326 | rates across lineages) to build a prior distribution on rates along edges. |
|---|
| 2327 | |
|---|
| 2328 | |
|---|
| 2329 | |
|---|
| 2330 | |
|---|
| 2331 | \subsubsection{Citing PhyTime}\label{sec:citephytime} |
|---|
| 2332 | |
|---|
| 2333 | The ``default citation'' is: |
|---|
| 2334 | |
|---|
| 2335 | \begin{itemize} |
|---|
| 2336 | \item Guindon S. ``From trajectories to averages: an improved description of the heterogeneity of |
|---|
| 2337 | substitution rates along lineages'', {\it Systematic Biology}, 2012. doi: 10.1093/sysbio/sys063. |
|---|
| 2338 | \end{itemize} |
|---|
| 2339 | |
|---|
| 2340 | An earlier article also described some of the methods implemented in PhyTime: |
|---|
| 2341 | |
|---|
| 2342 | \begin{itemize} |
|---|
| 2343 | \item Guindon S. ``Bayesian estimation of divergence times from large data sets'', {\it Molecular |
|---|
| 2344 | Biology and Evolution}, 2010, |
|---|
| 2345 | 27(8):1768:81. |
|---|
| 2346 | \end{itemize} |
|---|
| 2347 | |
|---|
| 2348 | |
|---|
| 2349 | \section{Recommendations on program usage}\label{sec:progusage} |
|---|
| 2350 | |
|---|
| 2351 | \subsection{PhyML} |
|---|
| 2352 | |
|---|
| 2353 | The choice of the tree searching algorithm among those provided by PhyML is generally a tough one. |
|---|
| 2354 | The fastest option relies on local and simultaneous modifications of the phylogeny using NNI |
|---|
| 2355 | moves. More thorough explorations of the space of topologies are also available through the SPR |
|---|
| 2356 | options. As these two classes of tree topology moves involve different computational burdens, it |
|---|
| 2357 | is important to determine which option is the most suitable for the type of data set or analysis one |
|---|
| 2358 | wants to perform. Below is a list of recommendations for typical phylogenetic analyses. |
|---|
| 2359 | |
|---|
| 2360 | \begin{enumerate} |
|---|
| 2361 | \item {\em Single data set, unlimited computing time.} The best option here is probably to use a SPR |
|---|
| 2362 | search (i.e., straight SPR of best of SPR and NNI). If the focus is on estimating the relationships |
|---|
| 2363 | between species, it is a good idea to use more than one starting tree to decrease the chance of |
|---|
| 2364 | getting stuck in a local maximum of the likelihood function. Using NNIs is appropriate if the |
|---|
| 2365 | analysis does not mainly focus on estimating the evolutionary relationships between species (e.g. a |
|---|
| 2366 | tree is needed to estimate the parameters of codon-based models later on). Branch supports can be |
|---|
| 2367 | estimated using bootstrap and approximate likelihood ratios. |
|---|
| 2368 | |
|---|
| 2369 | \item {\em Single data set, restricted computing time.} The three tree searching options can be |
|---|
| 2370 | used depending on the computing time available and the size of the data set. For small data sets |
|---|
| 2371 | (i.e., $<$ 50 sequences), NNI will generally perform well provided that the phylogenetic signal is |
|---|
| 2372 | strong. It is relevant to estimate a first tree using NNI moves and examine the reconstructed |
|---|
| 2373 | phylogeny in order to have a rough idea of the strength of the phylogenetic signal (the presence of |
|---|
| 2374 | small internal branch lengths is generally considered as a sign of a weak phylogenetic signal, |
|---|
| 2375 | specially when sequences are short). For larger data sets ($>$ 50 sequences), a SPR search is |
|---|
| 2376 | recommended if there are good evidence of a lack of phylogenetic signal. Bootstrap analysis will |
|---|
| 2377 | generally involve large computational burdens. Estimating branch supports using approximate |
|---|
| 2378 | likelihood ratios therefore provides an interesting alternative here. |
|---|
| 2379 | |
|---|
| 2380 | \item {\em Multiple data sets, unlimited computing time.} Comparative genomic analyses sometimes |
|---|
| 2381 | rely on building phylogenies from the analysis of a large number of gene families. Here again, the |
|---|
| 2382 | NNI option is the most relevant if the focus is not on recovering the most accurate picture of the |
|---|
| 2383 | evolutionary relationships between species. Slower SPR-based heuristics should be used when the |
|---|
| 2384 | topology of the tree is an important parameter of the analysis (e.g., identification of horizontally |
|---|
| 2385 | transferred genes using phylogenetic tree comparisons). Internal branch support is generally not a |
|---|
| 2386 | crucial parameter of the multiple data set analyses. Using approximate likelihood ratio is probably |
|---|
| 2387 | the best choice here. |
|---|
| 2388 | |
|---|
| 2389 | \item {\em Multiple data sets, limited computing time.} The large amount of data to be processed in |
|---|
| 2390 | a limited time generally requires the use of the fastest tree searching and branch support |
|---|
| 2391 | estimation methods Hence, NNI and approximate likelihood ratios rather than SPR and non-parametric |
|---|
| 2392 | bootstrap are generally the most appropriate here. |
|---|
| 2393 | \end{enumerate} |
|---|
| 2394 | |
|---|
| 2395 | Another important point is the choice of the substitution model. While default options generally |
|---|
| 2396 | provide acceptable results, it is often warranted to perform a pre-analysis in order to identify the |
|---|
| 2397 | best-fit substitution model. This pre-analysis can be done using popular software such as Modeltest |
|---|
| 2398 | \cite{posada98} or ProtTest \cite{abascal05} for instance. These programs generally recommend the |
|---|
| 2399 | use of a discrete gamma distribution to model the substitution process as variability of rates among |
|---|
| 2400 | sites is a common feature of molecular evolution. The choice of the number of rate classes to use |
|---|
| 2401 | for this distribution is also an important one. While the default is set to four categories in |
|---|
| 2402 | PhyML, it is recommended to use larger number of classes if possible in order to best approximate |
|---|
| 2403 | the patterns of rate variation across sites \cite{galtier04}. Note however that run times are |
|---|
| 2404 | directly proportional to the number of classes of the discrete gamma distribution. Here again, a |
|---|
| 2405 | pre-analysis with the simplest model should help the user to determine the number of rate classes |
|---|
| 2406 | that represents the best trade-off between computing time and fit of the model to the data. |
|---|
| 2407 | |
|---|
| 2408 | |
|---|
| 2409 | \subsection{PhyTime}\label{sec:recomphytime} |
|---|
| 2410 | |
|---|
| 2411 | Analysing a data set using PhyTime should involve three steps based on the following questions: (1) |
|---|
| 2412 | do the priors seem to be adequate (2) can I use the fast approximation of the likelihood and (3) how |
|---|
| 2413 | long shall I run the program for? I explain below how to provide answers to these questions. |
|---|
| 2414 | |
|---|
| 2415 | \begin{itemize} |
|---|
| 2416 | \item {\em Are the priors adequate?} Bayesian analysis relies on specifiying the joint |
|---|
| 2417 | prior density of model parameters. In the case of node age estimation, these priors essentially |
|---|
| 2418 | describe how rates of substitution vary across lineages and the probabilistic distribution that node |
|---|
| 2419 | ages have when ignoring the information provided by the genetic sequences. These priors vary from |
|---|
| 2420 | tree to tree. It is therefore essential to check the adequacy of priors for each user-defined input |
|---|
| 2421 | tree. In order to do so, PhyTime needs to be run with the \x{--no\_data} option. When this option is |
|---|
| 2422 | required, the sequence data provided as input will be ignored and the rest of the analysis will |
|---|
| 2423 | proceed normally. The prior distribution of model parameters, essentially edge rates and node |
|---|
| 2424 | heights, can then be checked using the program Tracer as one would do for the standard `posterior' |
|---|
| 2425 | analysis. |
|---|
| 2426 | |
|---|
| 2427 | \item {\em Can I use the fast approximation to the likelihood?} The suface of the log-likelihood |
|---|
| 2428 | function can be approximated using a multivariate normal density. This technique is saving very |
|---|
| 2429 | substantial amounts of computation time. However, like most approximations, there are situations |
|---|
| 2430 | where it does not provide a good fit to the actual function. This usually happens when the phylogeny |
|---|
| 2431 | displays a lot of short branches, i.e., the signal conveyed by the sequences is weak. It is |
|---|
| 2432 | therefore important to first check whether using the approximate likelihood is reasonable. In order |
|---|
| 2433 | to do so, it is recommended to first run the program without the approximation, i.e., using the |
|---|
| 2434 | default settings. Once the minimum value of the ESS of node ages (the last column on the right of |
|---|
| 2435 | the standard output) has reached 40-50, open the \x{phytime.XXXX} output file with Tracer and |
|---|
| 2436 | examine the correlation between the exact and approximate likelihood values. If the correlation is |
|---|
| 2437 | deemed to be good enough, PhyTime can be re-run using the \x{--fast\_lk} option, which uses the fast |
|---|
| 2438 | normal approximation to the likelihood function. |
|---|
| 2439 | |
|---|
| 2440 | |
|---|
| 2441 | |
|---|
| 2442 | % Figure \ref{fig:approxbad} gives an example where the correlation is too weak and the approximation |
|---|
| 2443 | % of the likelihood should be avoided. Figure \ref{fig:approxbad} gives an example where the |
|---|
| 2444 | % approximation is good enough. The current execution of PhyTime can be terminated and then |
|---|
| 2445 | % re-launched using the \x{--fast\_lk} option. |
|---|
| 2446 | |
|---|
| 2447 | % \begin{figure} |
|---|
| 2448 | % \begin{center} |
|---|
| 2449 | % \resizebox{14cm}{8cm}{\includegraphics{./fig/approx_bad.eps}} |
|---|
| 2450 | % \caption{{\bf Exact vs. approximate likelihoods.} The correlation between the normally approximated |
|---|
| 2451 | % (Y-axis) and the exact (X-axis) likelihoods is weak here. The exact likelihood should be used (option \x{fastlk=no}).} |
|---|
| 2452 | % \label{fig:approxbad} |
|---|
| 2453 | % \end{center} |
|---|
| 2454 | % \end{figure} |
|---|
| 2455 | |
|---|
| 2456 | |
|---|
| 2457 | % \begin{figure} |
|---|
| 2458 | % \begin{center} |
|---|
| 2459 | % \resizebox{14cm}{8cm}{\includegraphics{./fig/approx_good.eps}} |
|---|
| 2460 | % \caption{{\bf Exact vs. approximate likelihoods.} The correlation between the normally approximated |
|---|
| 2461 | % (Y-axis) and the exact (X-axis) likelihoods is good. The approximation of the likelihood can be used (option \x{fastlk=yes}).} |
|---|
| 2462 | % \label{fig:approxgood} |
|---|
| 2463 | % \end{center} |
|---|
| 2464 | % \end{figure} |
|---|
| 2465 | |
|---|
| 2466 | \item {\em How long shall I run the program for?} PhyTime should be run long enough such that the |
|---|
| 2467 | ESS of each parameter is `large enough'. The last column on the right handside of the standard |
|---|
| 2468 | output gives the minimum ESS across all internal node heights. It is recommended to run the program |
|---|
| 2469 | so that this number reaches at least 100. |
|---|
| 2470 | |
|---|
| 2471 | \end{itemize} |
|---|
| 2472 | |
|---|
| 2473 | |
|---|
| 2474 | \section{Frequently asked questions} |
|---|
| 2475 | |
|---|
| 2476 | \begin{enumerate} |
|---|
| 2477 | \item {\it PhyML crashes before reading the sequences. What's wrong ?}\\ |
|---|
| 2478 | \begin{itemize} |
|---|
| 2479 | \item The format of your sequence file is not recognized by PhyML. See Section \ref{sec:input_output} |
|---|
| 2480 | \item The carriage return characters in your sequence files are not recognized by PhyML. You must |
|---|
| 2481 | make sure that your sequence file is a plain text file, with standard carriage return characters (i.e., |
|---|
| 2482 | corresponding to ``$\backslash$\x{n}'', or ``$\backslash$\x{r}'') |
|---|
| 2483 | \end{itemize} |
|---|
| 2484 | \item {\it The program crashes after reading the sequences. What's wrong ?}\\ |
|---|
| 2485 | \begin{itemize} |
|---|
| 2486 | \item You analyse protein sequences and did not enter the \x{-d aa} option in the command-line. |
|---|
| 2487 | \item The format of your sequence file is not recognized by PhyML. See Section \ref{sec:input_output} |
|---|
| 2488 | \end{itemize} |
|---|
| 2489 | \item {\it Does PhyML handle outgroup sequences ?}\\ |
|---|
| 2490 | \begin{itemize} |
|---|
| 2491 | \item Yes, it does. Outgroup taxa are identified by adding the `*' sign at the end of each |
|---|
| 2492 | corresponding sequence name (see Section \ref{sec:outgroupspecify}) |
|---|
| 2493 | \end{itemize} |
|---|
| 2494 | \item {\it Does PhyML estimate clock-constrained trees ?}\\ |
|---|
| 2495 | \begin{itemize} |
|---|
| 2496 | \item No, the PhyML program does not estimate clock-contrained trees. One can however use the |
|---|
| 2497 | program PhyTime to perform such analysis but the tree topology will not be estimated. |
|---|
| 2498 | \end{itemize} |
|---|
| 2499 | \item {\it Can PhyML analyse partitioned data, such as multiple gene sequences ?}\\ |
|---|
| 2500 | \begin{itemize} |
|---|
| 2501 | \item We are currently working on this topic. Future releases of the program will provide options |
|---|
| 2502 | to estimate trees from phylogenomic data sets, with the opportunity to use different substitution |
|---|
| 2503 | models on the different data partitions (e.g., different genes). PhyML will also include specific |
|---|
| 2504 | algorithms to search the space of tree topologies for this type of data. |
|---|
| 2505 | \end{itemize} |
|---|
| 2506 | \end{enumerate} |
|---|
| 2507 | |
|---|
| 2508 | |
|---|
| 2509 | |
|---|
| 2510 | \section{Acknowledgements} |
|---|
| 2511 | The development of PhyML since 2000 has been supported by the Centre National de la Recherche |
|---|
| 2512 | Scientifique (CNRS) and the Minist\`ere de l'\'Education Nationale. |
|---|
| 2513 | |
|---|
| 2514 | % \bibliographystyle{/home/guindon/latex/biblio/OUPnum} |
|---|
| 2515 | \bibliographystyle{/home/guindon/latex/biblio/nature/naturemag} |
|---|
| 2516 | \bibliography{/home/guindon/latex/biblio/ref} |
|---|
| 2517 | |
|---|
| 2518 | \printindex |
|---|
| 2519 | \end{document} |
|---|