1 | #Please insert up references in the next lines (line starts with keyword UP) |
---|
2 | UP arb.hlp |
---|
3 | UP glossary.hlp |
---|
4 | |
---|
5 | #Please insert subtopic references (line starts with keyword SUB) |
---|
6 | SUB pos_var_pars.hlp |
---|
7 | |
---|
8 | # Hypertext links in helptext can be added like this: LINK{ref.hlp|http://add|bla@domain} |
---|
9 | |
---|
10 | #************* Title of helpfile !! and start of real helpfile strunk******** |
---|
11 | TITLE Estimate Parameters from Column Statistics |
---|
12 | |
---|
13 | OCCURRENCE ARB_DIST |
---|
14 | |
---|
15 | DESCRIPTION In a standard RNA, base frequencies are not equally |
---|
16 | distributed. Especially in the archea subclass we find |
---|
17 | extremely G+C rich sequences. |
---|
18 | This yielded in a couple of new rate corrections, algorithms |
---|
19 | and programs which: |
---|
20 | |
---|
21 | - calculate the average G+C content of all/two sequences |
---|
22 | - correct the distance. |
---|
23 | |
---|
24 | But further research showed us that the G+C frequencies are |
---|
25 | not equally distributed within a sequence. Especially helical |
---|
26 | parts have a significant higher G+C content than non |
---|
27 | helical parts. |
---|
28 | One strait forward algorithm would calculate each frequency |
---|
29 | independently for each column. |
---|
30 | Especially for small datasets the resulting frequencies would |
---|
31 | look like random data, as too few examples are analyzed. |
---|
32 | |
---|
33 | In ARB we implemented a combination of the 2 approaches. |
---|
34 | Lets say we want to estimate a Parameter 'P' with |
---|
35 | a maximum variance 'maxvar', so we need a minimum |
---|
36 | samples 'minsap'. |
---|
37 | |
---|
38 | - All sequence positions are clustered according to |
---|
39 | |
---|
40 | - helical/non helical region |
---|
41 | - variability |
---|
42 | |
---|
43 | The size of the cluster is choosen with respect |
---|
44 | to the variability of the sequences to get a |
---|
45 | minimum of independent events. |
---|
46 | |
---|
47 | - The final parameter estimate for a column is a |
---|
48 | weighted sum between the estimate for the |
---|
49 | cluster and the estimate for the single position. |
---|
50 | |
---|
51 | You can give your favorite method a higher weight by |
---|
52 | controlling the smoothing parameter: |
---|
53 | |
---|
54 | Less smoothing -> independent parameter estimates |
---|
55 | |
---|
56 | Much smoothing -> clustered parameter estimates |
---|
57 | |
---|
58 | To get a good tree we recommend you to try all selections. |
---|
59 | |
---|
60 | NOTES To get parameters from a column statistic you first have |
---|
61 | to create one. |
---|
62 | Do this with <ARB_NT/SAI/Positional Variability (Parsimony M.)> |
---|
63 | |
---|
64 | WARNINGS Problems may occur when |
---|
65 | |
---|
66 | 1. 'independent parameter estimates' is selected and |
---|
67 | 2. your dataset is quite small (<100 Sequences) and |
---|
68 | 3. one sequence is bad or badly aligned |
---|
69 | |
---|
70 | or |
---|
71 | |
---|
72 | 1. Much smoothing of parameters is selected and |
---|
73 | 2. you are analyzing ribosomal RNA and |
---|
74 | 3. 'Use Helix Information' is turned off |
---|
75 | |
---|
76 | |
---|
77 | BUGS No bugs known |
---|