source: trunk/GDE/PHYLIP/doc/factor.html

Last change on this file was 2176, checked in by westram, 21 years ago

* empty log message *

  • Property svn:eol-style set to native
  • Property svn:keywords set to Author Date Id Revision
File size: 12.6 KB
Line 
1<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
2<HTML>
3<HEAD>
4<TITLE>factor</TITLE>
5<META NAME="description" CONTENT="factor">
6<META NAME="keywords" CONTENT="factor">
7<META NAME="resource-type" CONTENT="document">
8<META NAME="distribution" CONTENT="global">
9<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
10</HEAD>
11<BODY BGCOLOR="#ccffff">
12<DIV ALIGN=RIGHT>
13version 3.6
14</DIV>
15<P>
16<DIV ALIGN=CENTER>
17<H1>FACTOR - Program to factor multistate characters.</H1>
18</DIV>
19<P>
20&#169; Copyright 1986-2002 by The University of Washington. Written by
21Christopher Meacham and Joseph Felsenstein.  Permission is granted
22to copy this document provided that no fee is charged for it and that this
23copyright notice is not removed.
24<P>
25<TABLE><TR><TD BGCOLOR=white>
26<EM><B>Note:</B> Factor is an Old Style program.
27This means that it takes some of its options information, notably the
28Ancestral states and Factors
29options from the input file rather than from separate files of their own
30as the New Style programs in this version of PHYLIP do.
31</EM>
32</TD></TR></TABLE>
33<P>
34</EM>
35<P>
36Programmed by C. Meacham, Botany, Univ. of Georgia, Athens, Georgia
37.ce
38(current address: University of California, Berkeley, California  94720)
39.ce
40additional code and documentation by Joe Felsenstein
41<P>
42This program factors a data set that contains multistate
43characters, creating a data set consisting entirely of binary (0,1)
44characters that, in turn, can be used as input to any of the other
45discrete character programs in this package, except for PARS. 
46Besides this primary
47function, FACTOR also provides an easy way of deleting characters from a
48data set.  The input format for FACTOR is very similar to the input
49format for the other discrete character programs except for the
50addition of character-state tree descriptions.
51<P>
52Note that this program has no way of converting an unordered multistate
53character into binary characters.  This is a weakness of the Old Style
54discrete characters programs in this package.
55Fortunately, PARS has joined the package, and it enables unordered
56multistate characters, in which any state can change to any other in
57one step, to be analyzed with parsimony.
58<P>
59FACTOR is really for a different case, that in which there are
60multiple states related on a "character state tree", which specifies
61for each state which other states it can change to.  That graph of
62states is assumed to be a tree, with no loops in it.
63<P>
64The first line of the input file should contain the number of
65species and the number of multistate characters.  This
66first line is followed by the lines describing the character-state
67trees, one description per line.  The species information constitutes
68the last part of the file.  Any number of lines may be used for a single
69species.
70<P>
71<H2>FIRST LINE</H2>
72<P>
73The first line is free format with the number of species first,
74separated by at least one blank (space) from the number of multistate
75characters, which in turn is separated by at least one blank from the
76options, if present.
77<P>
78<H2>OPTIONS</H2>
79<P>
80The options are selected from a menu that looks like this:
81<P>
82<TABLE><TR><TD BGCOLOR=white>
83<PRE>
84
85Factor -- multistate to binary recoding program, version 3.6a3
86
87Settings for this run:
88  A      put ancestral states in output file?  No
89  F   put factors information in output file?  No
90  0       Terminal type (IBM PC, ANSI, none)?  (none)
91  1      Print indications of progress of run  Yes
92
93Are these settings correct? (type Y or the letter for one to change)
94
95</PRE>
96</TD></TR></TABLE>
97<P>
98The options particular to this program are:
99<P>
100<DL COMPACT>
101<DT>A</DT> <DD>Choosing the A (Ancestors) options toggles on and off the setting
102that causes a line to be written in the output that
103describes the states of the ancestor as indicated by the
104character-state tree descriptions (see below).  If the ancestral
105state is not specified by a particular character-state tree,
106a "?" signifying an unknown character state will be written.
107The multistate characters are factored in such a way that the
108ancestral state in the factored data set will always be "0".
109The ancestor line does not get counted as a species.</DD>
110<P>
111<DT>F</DT> <DD>Choosing the F (Factors) option toggles on and off
112a setting that will cause a "FACTORS" line to
113be written in the output.
114This line will indicate to other programs which factors came
115from the same multistate character.  Of the  programs currently in
116the package only SEQBOOT, MOVE, and DOLMOVE use this information.</DD>
117</DL>
118<P>
119<H2>CHARACTER-STATE TREE DESCRIPTIONS</H2>
120<P>
121The character-state trees are described in free format.  The
122character number of the multistate character is given first followed
123by the description of the tree itself.  Each description must be
124completed on a single line.  Each character that is to be factored must
125have a description, and the characters must be described in the order
126that they occur in the input, that is, in numerical order.
127<P>
128The tree is described by listing the pairs of character states that
129are adjacent to each other in the character-state tree.  The two
130character states in each adjacent pair are separated by a colon (":").
131If character fifteen has this character state tree for possible states
132"A", "B", "C", and "D":
133<P>
134<PRE>
135                         A ---- B ---- C
136                                |
137                                |
138                                |
139                                D
140</PRE>
141<P>
142then the character-state tree description would be
143<P>
144<PRE>
145                        15  A:B B:C D:B
146</PRE>
147<P>
148Note that either symbol may appear first.  The ancestral state is
149identified, if desired, by putting it "adjacent" to a period.  If we
150wanted to root character fifteen at state C:
151<P>
152<PRE>
153                         A <--- B <--- C
154                                |
155                                |
156                                V
157                                D
158</PRE>
159<P>
160we could write
161<P>
162<PRE>
163                      15  B:D A:B C:B .:C
164</PRE>
165<P>
166Both the order in which the pairs are listed and the order of the
167symbols in each pair are arbitrary.  However, each pair may only appear
168once in the list.  Any symbols may be used for a character state in the
169input except the character that signals the connection between two states (in
170the distribution copy this is set to ":"), ".", and, of course, a
171blank.  Blanks are ignored
172completely in the tree description so that even  B:DA:BC:B.:C  or
173B : DA : BC : B. : C  would be equivalent to the above example.
174However, at least one blank must separate the character number from the
175tree description.
176<P>
177<H2>DELETING CHARACTERS FROM A DATA SET</H2>
178<P>
179If no description line appears in the input for a particular
180character, then that character will be omitted from the output.  If the
181character number is given on the line, but no character-state tree is
182provided, then the symbol for the character in the input will be copied
183directly to the output without change.  This is useful for characters
184that are already coded "0" and "1".  Characters can be deleted from a
185data set simply by listing only those that are to appear in the output.
186<P>
187<H2>TERMINATING THE LIST OF TREE DESCRIPTIONS</H2>
188<P>
189The last character-state tree description should be followed by a
190line containing the number "999".  This terminates processing of the
191trees and indicates the beginning of the species information.
192<P>
193<H2>SPECIES INFORMATION</H2>
194<P>
195The format for the species information is basically identical to
196the other discrete character programs.  The first ten character positions
197are allotted to the species name (this value may be changed by altering
198the value of the constant nmlngth at the beginning of the program).  The
199character states follow and may be continued to as many lines as
200desired.  There is no current method for indicating polymorphisms.  It is
201possible to either put blanks between characters or not.
202<P>
203There is a method for indicating uncertainty about states.  There is
204one character value that stands for "unknown".  If this appears in
205the input data then "?" is written out in all the corresponding
206positions in the output file.  The character value that designates
207"unknown" is given in the constant unkchar at the beginning of the
208program, and can be changed by changing that constant.  It is set to
209"?" in the distribution copy.
210<P>
211<H2>OUTPUT</H2>
212<P>
213The first line of output will contain the number of species and
214the number of binary characters in the factored data set followed by
215the letter "A" if the A option was specified in the input.  If option
216F was specified, the next line will begin "FACTORS".  If option A was
217specified, the line describing the ancestor will follow next.  Finally,
218the factored characters will be written for each species in the format
219required for input by the other discrete programs in the package.  The
220maximum length of the output lines is 80 characters, but this maximum
221length can be changed prior to compilation.
222<P>
223In fact, the format of the output file for the A and F options is not
224correct for the current release of PHYLIP.  We need to change their
225output to write a factors file and an ancestors file instead of
226putting the Factors and Ancestors information into the data file.
227<P>
228ERRORS
229<P>
230The output should be checked for error messages.  Errors will occur
231in the character-state tree descriptions if the format is incorrect
232(colons in the wrong place, etc.), if more than one root is specified,
233if the tree contains loops (and hence is not a tree), and if the tree is
234not connected, e.g.
235<P>
236<PRE>
237                             A:B B:C D:E
238</PRE>
239<P>
240describes
241<P>
242<PRE>
243                  A ---- B ---- C          D ---- E
244</PRE>
245<P>
246This "tree" is in two unconnected pieces.  An error will also occur if a symbol
247appears in the data set that is not in the tree description for that
248character.  Blanks at the end of lines when the species information
249is continued to a new line will cause this kind of error.
250<P>
251<H2>CONSTANTS AVAILABLE TO BE CHANGED</H2>
252<P>
253At the beginning of the program a number of
254are available to be changed to accomodate larger data sets.  These are
255"maxstates", "maxoutput", "sizearray", "factchar" and "unkchar".  The
256constant "maxstates"
257gives the maximum number of states per character (set at 20 in the
258distribution copy).  The constant "maxoutput"
259gives the maximum width of a line in the output file (80 in the
260distribution copy).  The constant "sizearray"
261must be less than the sum of squares
262of the numbers of states in the characters.  It is initially set to
263set to 2000, so that although 20 states are allowed (at the initial
264setting of maxstates) per character, there cannot be 20 states in all
265of 100 characters.
266<P>
267Particularly important constants are "factchar" and "unkchar"
268which are not numerical
269values but a character.  Initially set to the colon ":",
270"factchar" is the character that will be used to separate states in the input of character
271state trees.  It can be changed by changing this
272constant.  (We could have used a hyphen ("-") but didn't because that would make the
273minus-sign ("-") unavailable as a character state in +/- characters).
274The constant "unkchar"
275is the character value in the input data that
276indicates that the state is unknown.  It is set to "?" in the
277distribution copy.  If your computer is one that lacks the colon ":" in its
278character set or uses a nonstandard character code such as EBCDIC, you
279will want to change the constant "factchar".
280<P>
281<H2>INPUT AND OUTPUT FILES</H2>
282<P>
283The input file for the program has the default file name "infile"
284and the output file, the one that has the binary character state data,
285has the name "outfile".
286<P>
287<TABLE>
288<TR>
289<TD>----SAMPLE INPUT-----</TD> <TD> -----Comments (not part of input file) -----</TD>
290</TR>
291<TR>
292<TD BGCOLOR=white>
293<PRE> 
294   4   6  A
2951 A:B B:C       
2962 A:B B:.       
2974               
2985 0:1 1:2 .:0   
2996 .:# #:$ #:%   
300999             
301Alpha     CAW00#
302Beta      BBX01%
303Gamma     ABY12#
304Epsilon   CAZ01$
305
306</TD>
307<TD>
308<PRE>
309
310     4 species; 6 characters; A option on
311     A ---- B ---- C
312     B ---> A
313     Character 3 deleted; 4 unchanged
314     0 ---> 1 ---> 2
315     % <--- # ---> $
316     Signals end of trees
317     Species information begins
318
319     
320   
321</PRE>
322</TD>
323</TR>
324<TR>
325<TD> ---SAMPLE OUTPUT-----</TD> <TD>  -----Comments (not part of output file) -----</TD>
326</TR>
327<TR>
328<TD BGCOLOR=white>
329<PRE>
330    5    8    A
331ANCESTOR  ??0?0000
332Alpha     11100000
333Beta      10001001
334Gamma     00011100
335Epsilon   11101010
336</PRE>
337</TD>
338<TD>
339<PRE> 
340     5 species (incl. anc.); 8 factors
341     Chars. 1 and 2 come from old number 1
342     Char. 3 comes from old number 2
343     Char. 4 is old number 4
344     Chars. 5 and 6 come from old number 5
345     Chars. 7 and 8 come from old number 6
346</PRE>
347</TD>
348</TR>
349</TABLE>
350</BODY>
351</HTML>
Note: See TracBrowser for help on using the repository browser.