| 1 | # main topics: |
|---|
| 2 | UP arb.hlp |
|---|
| 3 | UP glossary.hlp |
|---|
| 4 | |
|---|
| 5 | # sub topics: |
|---|
| 6 | SUB pos_var_pars.hlp |
|---|
| 7 | |
|---|
| 8 | # format described in ../help.readme |
|---|
| 9 | |
|---|
| 10 | |
|---|
| 11 | TITLE Estimate Parameters from Column Statistics |
|---|
| 12 | |
|---|
| 13 | OCCURRENCE ARB_DIST |
|---|
| 14 | |
|---|
| 15 | DESCRIPTION In a standard RNA, base frequencies are not equally |
|---|
| 16 | distributed. Especially in the archea subclass we find |
|---|
| 17 | extremely G+C rich sequences. |
|---|
| 18 | This yielded in a couple of new rate corrections, algorithms |
|---|
| 19 | and programs which: |
|---|
| 20 | |
|---|
| 21 | - calculate the average G+C content of all/two sequences |
|---|
| 22 | - correct the distance. |
|---|
| 23 | |
|---|
| 24 | But further research showed us that the G+C frequencies are |
|---|
| 25 | not equally distributed within a sequence. Especially helical |
|---|
| 26 | parts have a significant higher G+C content than non |
|---|
| 27 | helical parts. |
|---|
| 28 | One strait forward algorithm would calculate each frequency |
|---|
| 29 | independently for each column. |
|---|
| 30 | Especially for small datasets the resulting frequencies would |
|---|
| 31 | look like random data, as too few examples are analyzed. |
|---|
| 32 | |
|---|
| 33 | In ARB we implemented a combination of the 2 approaches. |
|---|
| 34 | Lets say we want to estimate a Parameter 'P' with |
|---|
| 35 | a maximum variance 'maxvar', so we need a minimum |
|---|
| 36 | samples 'minsap'. |
|---|
| 37 | |
|---|
| 38 | - All sequence positions are clustered according to |
|---|
| 39 | |
|---|
| 40 | - helical/non helical region |
|---|
| 41 | - variability |
|---|
| 42 | |
|---|
| 43 | The size of the cluster is choosen with respect |
|---|
| 44 | to the variability of the sequences to get a |
|---|
| 45 | minimum of independent events. |
|---|
| 46 | |
|---|
| 47 | - The final parameter estimate for a column is a |
|---|
| 48 | weighted sum between the estimate for the |
|---|
| 49 | cluster and the estimate for the single position. |
|---|
| 50 | |
|---|
| 51 | You can give your favorite method a higher weight by |
|---|
| 52 | controlling the smoothing parameter: |
|---|
| 53 | |
|---|
| 54 | Less smoothing -> independent parameter estimates |
|---|
| 55 | |
|---|
| 56 | Much smoothing -> clustered parameter estimates |
|---|
| 57 | |
|---|
| 58 | To get a good tree we recommend you to try all selections. |
|---|
| 59 | |
|---|
| 60 | NOTES To get parameters from a column statistic you first have |
|---|
| 61 | to create one. |
|---|
| 62 | Do this with <ARB_NT/SAI/Positional Variability (Parsimony M.)> |
|---|
| 63 | |
|---|
| 64 | WARNINGS Problems may occur when |
|---|
| 65 | |
|---|
| 66 | 1. 'independent parameter estimates' is selected and |
|---|
| 67 | 2. your dataset is quite small (<100 Sequences) and |
|---|
| 68 | 3. one sequence is bad or badly aligned |
|---|
| 69 | |
|---|
| 70 | or |
|---|
| 71 | |
|---|
| 72 | 1. Much smoothing of parameters is selected and |
|---|
| 73 | 2. you are analyzing ribosomal RNA and |
|---|
| 74 | 3. 'Use Helix Information' is turned off |
|---|
| 75 | |
|---|
| 76 | |
|---|
| 77 | BUGS No bugs known |
|---|