| 1 | <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN"> |
|---|
| 2 | <HTML> |
|---|
| 3 | <HEAD> |
|---|
| 4 | <TITLE>factor</TITLE> |
|---|
| 5 | <META NAME="description" CONTENT="factor"> |
|---|
| 6 | <META NAME="keywords" CONTENT="factor"> |
|---|
| 7 | <META NAME="resource-type" CONTENT="document"> |
|---|
| 8 | <META NAME="distribution" CONTENT="global"> |
|---|
| 9 | <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1"> |
|---|
| 10 | </HEAD> |
|---|
| 11 | <BODY BGCOLOR="#ccffff"> |
|---|
| 12 | <DIV ALIGN=RIGHT> |
|---|
| 13 | version 3.6 |
|---|
| 14 | </DIV> |
|---|
| 15 | <P> |
|---|
| 16 | <DIV ALIGN=CENTER> |
|---|
| 17 | <H1>FACTOR - Program to factor multistate characters.</H1> |
|---|
| 18 | </DIV> |
|---|
| 19 | <P> |
|---|
| 20 | © Copyright 1986-2002 by The University of Washington. Written by |
|---|
| 21 | Christopher Meacham and Joseph Felsenstein. Permission is granted |
|---|
| 22 | to copy this document provided that no fee is charged for it and that this |
|---|
| 23 | copyright notice is not removed. |
|---|
| 24 | <P> |
|---|
| 25 | <TABLE><TR><TD BGCOLOR=white> |
|---|
| 26 | <EM><B>Note:</B> Factor is an Old Style program. |
|---|
| 27 | This means that it takes some of its options information, notably the |
|---|
| 28 | Ancestral states and Factors |
|---|
| 29 | options from the input file rather than from separate files of their own |
|---|
| 30 | as the New Style programs in this version of PHYLIP do. |
|---|
| 31 | </EM> |
|---|
| 32 | </TD></TR></TABLE> |
|---|
| 33 | <P> |
|---|
| 34 | </EM> |
|---|
| 35 | <P> |
|---|
| 36 | Programmed by C. Meacham, Botany, Univ. of Georgia, Athens, Georgia |
|---|
| 37 | .ce |
|---|
| 38 | (current address: University of California, Berkeley, California 94720) |
|---|
| 39 | .ce |
|---|
| 40 | additional code and documentation by Joe Felsenstein |
|---|
| 41 | <P> |
|---|
| 42 | This program factors a data set that contains multistate |
|---|
| 43 | characters, creating a data set consisting entirely of binary (0,1) |
|---|
| 44 | characters that, in turn, can be used as input to any of the other |
|---|
| 45 | discrete character programs in this package, except for PARS. |
|---|
| 46 | Besides this primary |
|---|
| 47 | function, FACTOR also provides an easy way of deleting characters from a |
|---|
| 48 | data set. The input format for FACTOR is very similar to the input |
|---|
| 49 | format for the other discrete character programs except for the |
|---|
| 50 | addition of character-state tree descriptions. |
|---|
| 51 | <P> |
|---|
| 52 | Note that this program has no way of converting an unordered multistate |
|---|
| 53 | character into binary characters. This is a weakness of the Old Style |
|---|
| 54 | discrete characters programs in this package. |
|---|
| 55 | Fortunately, PARS has joined the package, and it enables unordered |
|---|
| 56 | multistate characters, in which any state can change to any other in |
|---|
| 57 | one step, to be analyzed with parsimony. |
|---|
| 58 | <P> |
|---|
| 59 | FACTOR is really for a different case, that in which there are |
|---|
| 60 | multiple states related on a "character state tree", which specifies |
|---|
| 61 | for each state which other states it can change to. That graph of |
|---|
| 62 | states is assumed to be a tree, with no loops in it. |
|---|
| 63 | <P> |
|---|
| 64 | The first line of the input file should contain the number of |
|---|
| 65 | species and the number of multistate characters. This |
|---|
| 66 | first line is followed by the lines describing the character-state |
|---|
| 67 | trees, one description per line. The species information constitutes |
|---|
| 68 | the last part of the file. Any number of lines may be used for a single |
|---|
| 69 | species. |
|---|
| 70 | <P> |
|---|
| 71 | <H2>FIRST LINE</H2> |
|---|
| 72 | <P> |
|---|
| 73 | The first line is free format with the number of species first, |
|---|
| 74 | separated by at least one blank (space) from the number of multistate |
|---|
| 75 | characters, which in turn is separated by at least one blank from the |
|---|
| 76 | options, if present. |
|---|
| 77 | <P> |
|---|
| 78 | <H2>OPTIONS</H2> |
|---|
| 79 | <P> |
|---|
| 80 | The options are selected from a menu that looks like this: |
|---|
| 81 | <P> |
|---|
| 82 | <TABLE><TR><TD BGCOLOR=white> |
|---|
| 83 | <PRE> |
|---|
| 84 | |
|---|
| 85 | Factor -- multistate to binary recoding program, version 3.6a3 |
|---|
| 86 | |
|---|
| 87 | Settings for this run: |
|---|
| 88 | A put ancestral states in output file? No |
|---|
| 89 | F put factors information in output file? No |
|---|
| 90 | 0 Terminal type (IBM PC, ANSI, none)? (none) |
|---|
| 91 | 1 Print indications of progress of run Yes |
|---|
| 92 | |
|---|
| 93 | Are these settings correct? (type Y or the letter for one to change) |
|---|
| 94 | |
|---|
| 95 | </PRE> |
|---|
| 96 | </TD></TR></TABLE> |
|---|
| 97 | <P> |
|---|
| 98 | The options particular to this program are: |
|---|
| 99 | <P> |
|---|
| 100 | <DL COMPACT> |
|---|
| 101 | <DT>A</DT> <DD>Choosing the A (Ancestors) options toggles on and off the setting |
|---|
| 102 | that causes a line to be written in the output that |
|---|
| 103 | describes the states of the ancestor as indicated by the |
|---|
| 104 | character-state tree descriptions (see below). If the ancestral |
|---|
| 105 | state is not specified by a particular character-state tree, |
|---|
| 106 | a "?" signifying an unknown character state will be written. |
|---|
| 107 | The multistate characters are factored in such a way that the |
|---|
| 108 | ancestral state in the factored data set will always be "0". |
|---|
| 109 | The ancestor line does not get counted as a species.</DD> |
|---|
| 110 | <P> |
|---|
| 111 | <DT>F</DT> <DD>Choosing the F (Factors) option toggles on and off |
|---|
| 112 | a setting that will cause a "FACTORS" line to |
|---|
| 113 | be written in the output. |
|---|
| 114 | This line will indicate to other programs which factors came |
|---|
| 115 | from the same multistate character. Of the programs currently in |
|---|
| 116 | the package only SEQBOOT, MOVE, and DOLMOVE use this information.</DD> |
|---|
| 117 | </DL> |
|---|
| 118 | <P> |
|---|
| 119 | <H2>CHARACTER-STATE TREE DESCRIPTIONS</H2> |
|---|
| 120 | <P> |
|---|
| 121 | The character-state trees are described in free format. The |
|---|
| 122 | character number of the multistate character is given first followed |
|---|
| 123 | by the description of the tree itself. Each description must be |
|---|
| 124 | completed on a single line. Each character that is to be factored must |
|---|
| 125 | have a description, and the characters must be described in the order |
|---|
| 126 | that they occur in the input, that is, in numerical order. |
|---|
| 127 | <P> |
|---|
| 128 | The tree is described by listing the pairs of character states that |
|---|
| 129 | are adjacent to each other in the character-state tree. The two |
|---|
| 130 | character states in each adjacent pair are separated by a colon (":"). |
|---|
| 131 | If character fifteen has this character state tree for possible states |
|---|
| 132 | "A", "B", "C", and "D": |
|---|
| 133 | <P> |
|---|
| 134 | <PRE> |
|---|
| 135 | A ---- B ---- C |
|---|
| 136 | | |
|---|
| 137 | | |
|---|
| 138 | | |
|---|
| 139 | D |
|---|
| 140 | </PRE> |
|---|
| 141 | <P> |
|---|
| 142 | then the character-state tree description would be |
|---|
| 143 | <P> |
|---|
| 144 | <PRE> |
|---|
| 145 | 15 A:B B:C D:B |
|---|
| 146 | </PRE> |
|---|
| 147 | <P> |
|---|
| 148 | Note that either symbol may appear first. The ancestral state is |
|---|
| 149 | identified, if desired, by putting it "adjacent" to a period. If we |
|---|
| 150 | wanted to root character fifteen at state C: |
|---|
| 151 | <P> |
|---|
| 152 | <PRE> |
|---|
| 153 | A <--- B <--- C |
|---|
| 154 | | |
|---|
| 155 | | |
|---|
| 156 | V |
|---|
| 157 | D |
|---|
| 158 | </PRE> |
|---|
| 159 | <P> |
|---|
| 160 | we could write |
|---|
| 161 | <P> |
|---|
| 162 | <PRE> |
|---|
| 163 | 15 B:D A:B C:B .:C |
|---|
| 164 | </PRE> |
|---|
| 165 | <P> |
|---|
| 166 | Both the order in which the pairs are listed and the order of the |
|---|
| 167 | symbols in each pair are arbitrary. However, each pair may only appear |
|---|
| 168 | once in the list. Any symbols may be used for a character state in the |
|---|
| 169 | input except the character that signals the connection between two states (in |
|---|
| 170 | the distribution copy this is set to ":"), ".", and, of course, a |
|---|
| 171 | blank. Blanks are ignored |
|---|
| 172 | completely in the tree description so that even B:DA:BC:B.:C or |
|---|
| 173 | B : DA : BC : B. : C would be equivalent to the above example. |
|---|
| 174 | However, at least one blank must separate the character number from the |
|---|
| 175 | tree description. |
|---|
| 176 | <P> |
|---|
| 177 | <H2>DELETING CHARACTERS FROM A DATA SET</H2> |
|---|
| 178 | <P> |
|---|
| 179 | If no description line appears in the input for a particular |
|---|
| 180 | character, then that character will be omitted from the output. If the |
|---|
| 181 | character number is given on the line, but no character-state tree is |
|---|
| 182 | provided, then the symbol for the character in the input will be copied |
|---|
| 183 | directly to the output without change. This is useful for characters |
|---|
| 184 | that are already coded "0" and "1". Characters can be deleted from a |
|---|
| 185 | data set simply by listing only those that are to appear in the output. |
|---|
| 186 | <P> |
|---|
| 187 | <H2>TERMINATING THE LIST OF TREE DESCRIPTIONS</H2> |
|---|
| 188 | <P> |
|---|
| 189 | The last character-state tree description should be followed by a |
|---|
| 190 | line containing the number "999". This terminates processing of the |
|---|
| 191 | trees and indicates the beginning of the species information. |
|---|
| 192 | <P> |
|---|
| 193 | <H2>SPECIES INFORMATION</H2> |
|---|
| 194 | <P> |
|---|
| 195 | The format for the species information is basically identical to |
|---|
| 196 | the other discrete character programs. The first ten character positions |
|---|
| 197 | are allotted to the species name (this value may be changed by altering |
|---|
| 198 | the value of the constant nmlngth at the beginning of the program). The |
|---|
| 199 | character states follow and may be continued to as many lines as |
|---|
| 200 | desired. There is no current method for indicating polymorphisms. It is |
|---|
| 201 | possible to either put blanks between characters or not. |
|---|
| 202 | <P> |
|---|
| 203 | There is a method for indicating uncertainty about states. There is |
|---|
| 204 | one character value that stands for "unknown". If this appears in |
|---|
| 205 | the input data then "?" is written out in all the corresponding |
|---|
| 206 | positions in the output file. The character value that designates |
|---|
| 207 | "unknown" is given in the constant unkchar at the beginning of the |
|---|
| 208 | program, and can be changed by changing that constant. It is set to |
|---|
| 209 | "?" in the distribution copy. |
|---|
| 210 | <P> |
|---|
| 211 | <H2>OUTPUT</H2> |
|---|
| 212 | <P> |
|---|
| 213 | The first line of output will contain the number of species and |
|---|
| 214 | the number of binary characters in the factored data set followed by |
|---|
| 215 | the letter "A" if the A option was specified in the input. If option |
|---|
| 216 | F was specified, the next line will begin "FACTORS". If option A was |
|---|
| 217 | specified, the line describing the ancestor will follow next. Finally, |
|---|
| 218 | the factored characters will be written for each species in the format |
|---|
| 219 | required for input by the other discrete programs in the package. The |
|---|
| 220 | maximum length of the output lines is 80 characters, but this maximum |
|---|
| 221 | length can be changed prior to compilation. |
|---|
| 222 | <P> |
|---|
| 223 | In fact, the format of the output file for the A and F options is not |
|---|
| 224 | correct for the current release of PHYLIP. We need to change their |
|---|
| 225 | output to write a factors file and an ancestors file instead of |
|---|
| 226 | putting the Factors and Ancestors information into the data file. |
|---|
| 227 | <P> |
|---|
| 228 | ERRORS |
|---|
| 229 | <P> |
|---|
| 230 | The output should be checked for error messages. Errors will occur |
|---|
| 231 | in the character-state tree descriptions if the format is incorrect |
|---|
| 232 | (colons in the wrong place, etc.), if more than one root is specified, |
|---|
| 233 | if the tree contains loops (and hence is not a tree), and if the tree is |
|---|
| 234 | not connected, e.g. |
|---|
| 235 | <P> |
|---|
| 236 | <PRE> |
|---|
| 237 | A:B B:C D:E |
|---|
| 238 | </PRE> |
|---|
| 239 | <P> |
|---|
| 240 | describes |
|---|
| 241 | <P> |
|---|
| 242 | <PRE> |
|---|
| 243 | A ---- B ---- C D ---- E |
|---|
| 244 | </PRE> |
|---|
| 245 | <P> |
|---|
| 246 | This "tree" is in two unconnected pieces. An error will also occur if a symbol |
|---|
| 247 | appears in the data set that is not in the tree description for that |
|---|
| 248 | character. Blanks at the end of lines when the species information |
|---|
| 249 | is continued to a new line will cause this kind of error. |
|---|
| 250 | <P> |
|---|
| 251 | <H2>CONSTANTS AVAILABLE TO BE CHANGED</H2> |
|---|
| 252 | <P> |
|---|
| 253 | At the beginning of the program a number of |
|---|
| 254 | are available to be changed to accomodate larger data sets. These are |
|---|
| 255 | "maxstates", "maxoutput", "sizearray", "factchar" and "unkchar". The |
|---|
| 256 | constant "maxstates" |
|---|
| 257 | gives the maximum number of states per character (set at 20 in the |
|---|
| 258 | distribution copy). The constant "maxoutput" |
|---|
| 259 | gives the maximum width of a line in the output file (80 in the |
|---|
| 260 | distribution copy). The constant "sizearray" |
|---|
| 261 | must be less than the sum of squares |
|---|
| 262 | of the numbers of states in the characters. It is initially set to |
|---|
| 263 | set to 2000, so that although 20 states are allowed (at the initial |
|---|
| 264 | setting of maxstates) per character, there cannot be 20 states in all |
|---|
| 265 | of 100 characters. |
|---|
| 266 | <P> |
|---|
| 267 | Particularly important constants are "factchar" and "unkchar" |
|---|
| 268 | which are not numerical |
|---|
| 269 | values but a character. Initially set to the colon ":", |
|---|
| 270 | "factchar" is the character that will be used to separate states in the input of character |
|---|
| 271 | state trees. It can be changed by changing this |
|---|
| 272 | constant. (We could have used a hyphen ("-") but didn't because that would make the |
|---|
| 273 | minus-sign ("-") unavailable as a character state in +/- characters). |
|---|
| 274 | The constant "unkchar" |
|---|
| 275 | is the character value in the input data that |
|---|
| 276 | indicates that the state is unknown. It is set to "?" in the |
|---|
| 277 | distribution copy. If your computer is one that lacks the colon ":" in its |
|---|
| 278 | character set or uses a nonstandard character code such as EBCDIC, you |
|---|
| 279 | will want to change the constant "factchar". |
|---|
| 280 | <P> |
|---|
| 281 | <H2>INPUT AND OUTPUT FILES</H2> |
|---|
| 282 | <P> |
|---|
| 283 | The input file for the program has the default file name "infile" |
|---|
| 284 | and the output file, the one that has the binary character state data, |
|---|
| 285 | has the name "outfile". |
|---|
| 286 | <P> |
|---|
| 287 | <TABLE> |
|---|
| 288 | <TR> |
|---|
| 289 | <TD>----SAMPLE INPUT-----</TD> <TD> -----Comments (not part of input file) -----</TD> |
|---|
| 290 | </TR> |
|---|
| 291 | <TR> |
|---|
| 292 | <TD BGCOLOR=white> |
|---|
| 293 | <PRE> |
|---|
| 294 | 4 6 A |
|---|
| 295 | 1 A:B B:C |
|---|
| 296 | 2 A:B B:. |
|---|
| 297 | 4 |
|---|
| 298 | 5 0:1 1:2 .:0 |
|---|
| 299 | 6 .:# #:$ #:% |
|---|
| 300 | 999 |
|---|
| 301 | Alpha CAW00# |
|---|
| 302 | Beta BBX01% |
|---|
| 303 | Gamma ABY12# |
|---|
| 304 | Epsilon CAZ01$ |
|---|
| 305 | |
|---|
| 306 | </TD> |
|---|
| 307 | <TD> |
|---|
| 308 | <PRE> |
|---|
| 309 | |
|---|
| 310 | 4 species; 6 characters; A option on |
|---|
| 311 | A ---- B ---- C |
|---|
| 312 | B ---> A |
|---|
| 313 | Character 3 deleted; 4 unchanged |
|---|
| 314 | 0 ---> 1 ---> 2 |
|---|
| 315 | % <--- # ---> $ |
|---|
| 316 | Signals end of trees |
|---|
| 317 | Species information begins |
|---|
| 318 | |
|---|
| 319 | |
|---|
| 320 | |
|---|
| 321 | </PRE> |
|---|
| 322 | </TD> |
|---|
| 323 | </TR> |
|---|
| 324 | <TR> |
|---|
| 325 | <TD> ---SAMPLE OUTPUT-----</TD> <TD> -----Comments (not part of output file) -----</TD> |
|---|
| 326 | </TR> |
|---|
| 327 | <TR> |
|---|
| 328 | <TD BGCOLOR=white> |
|---|
| 329 | <PRE> |
|---|
| 330 | 5 8 A |
|---|
| 331 | ANCESTOR ??0?0000 |
|---|
| 332 | Alpha 11100000 |
|---|
| 333 | Beta 10001001 |
|---|
| 334 | Gamma 00011100 |
|---|
| 335 | Epsilon 11101010 |
|---|
| 336 | </PRE> |
|---|
| 337 | </TD> |
|---|
| 338 | <TD> |
|---|
| 339 | <PRE> |
|---|
| 340 | 5 species (incl. anc.); 8 factors |
|---|
| 341 | Chars. 1 and 2 come from old number 1 |
|---|
| 342 | Char. 3 comes from old number 2 |
|---|
| 343 | Char. 4 is old number 4 |
|---|
| 344 | Chars. 5 and 6 come from old number 5 |
|---|
| 345 | Chars. 7 and 8 come from old number 6 |
|---|
| 346 | </PRE> |
|---|
| 347 | </TD> |
|---|
| 348 | </TR> |
|---|
| 349 | </TABLE> |
|---|
| 350 | </BODY> |
|---|
| 351 | </HTML> |
|---|