| 1 | # main topics: |
|---|
| 2 | UP arb.hlp |
|---|
| 3 | UP glossary.hlp |
|---|
| 4 | |
|---|
| 5 | # sub topics: |
|---|
| 6 | #SUB subtopic.hlp |
|---|
| 7 | |
|---|
| 8 | # format described in ../help.readme |
|---|
| 9 | |
|---|
| 10 | |
|---|
| 11 | TITLE Cluster detection |
|---|
| 12 | |
|---|
| 13 | OCCURRENCE ARB_DIST |
|---|
| 14 | |
|---|
| 15 | DESCRIPTION Cluster detection searches for subtrees of the tree selected |
|---|
| 16 | in the ARB_DIST main window, that form homologous groups of sequences. |
|---|
| 17 | |
|---|
| 18 | The main prerequisite for cluster detection to work well is a good tree, |
|---|
| 19 | preferable a tree optimized with ARB_PARSIMONY (since cluster detection uses |
|---|
| 20 | the same distance function as ARB_PARSIMONY does). |
|---|
| 21 | |
|---|
| 22 | You may control the distance calculation by selecting filter and/or weights |
|---|
| 23 | in the ARB_DIST main window. |
|---|
| 24 | |
|---|
| 25 | The following parameters define which subtrees will be reported as clusters |
|---|
| 26 | - Max. distance inside each cluster (no two sequences in a cluster have |
|---|
| 27 | bigger distance than specified). Specify the distance as percentage of |
|---|
| 28 | mutations, 100 means every base differs, 0 means no base differs |
|---|
| 29 | - Min cluster size (clusters below that size are ignored) |
|---|
| 30 | |
|---|
| 31 | |
|---|
| 32 | Press 'Detect clusters' to start the cluster detection.. |
|---|
| 33 | |
|---|
| 34 | The clusters matching the given |
|---|
| 35 | parameters will be displayed in the list below. |
|---|
| 36 | Each line contains the following information: |
|---|
| 37 | |
|---|
| 38 | - number of species in cluster |
|---|
| 39 | - mean distance [min. - max.distance] |
|---|
| 40 | - minimal bases used for distance calculation (weighted) |
|---|
| 41 | - a generated cluster description |
|---|
| 42 | |
|---|
| 43 | Each cluster contains one so called 'representative'. |
|---|
| 44 | The representative is the species in the cluster with the least |
|---|
| 45 | mean distance to all other cluster members. |
|---|
| 46 | |
|---|
| 47 | SECTION Working with found clusters |
|---|
| 48 | |
|---|
| 49 | Marking |
|---|
| 50 | |
|---|
| 51 | You can mark the members of the currently selected cluster by clicking |
|---|
| 52 | on the 'Mark' button. Below that button you may select whether to mark |
|---|
| 53 | - all species in the cluster, |
|---|
| 54 | - all species in the cluster despite the representative or |
|---|
| 55 | - only the representative. |
|---|
| 56 | |
|---|
| 57 | The second mode is useful when you plan to remove all but the |
|---|
| 58 | representative from the tree. |
|---|
| 59 | |
|---|
| 60 | You may also mark ALL clusters by clicking on the 'Mark all' button. |
|---|
| 61 | This is handy to expand all cluster in the tree or to load all clusters into the |
|---|
| 62 | sequence editor. |
|---|
| 63 | |
|---|
| 64 | Auto mark |
|---|
| 65 | |
|---|
| 66 | If you enable the 'Auto mark' toggle, ARB will automatically mark the cluster |
|---|
| 67 | as soon as you select it in the list. |
|---|
| 68 | |
|---|
| 69 | Selecting representative |
|---|
| 70 | |
|---|
| 71 | If this option is checked, the representative species of the selected cluster |
|---|
| 72 | will be become the LINK{selected.hlp}. |
|---|
| 73 | |
|---|
| 74 | Storing intermediate results |
|---|
| 75 | |
|---|
| 76 | You may store the displayed clusters by either pressing |
|---|
| 77 | - 'Store selected' or |
|---|
| 78 | - 'Store all' |
|---|
| 79 | |
|---|
| 80 | The number of currently stored clusters will be displayed on |
|---|
| 81 | the restore button. By pressing that button, you can restore |
|---|
| 82 | these clusters. |
|---|
| 83 | |
|---|
| 84 | Press 'Swap stored' to exchange stored clusters with displayed |
|---|
| 85 | clusters. |
|---|
| 86 | |
|---|
| 87 | Storing result will be useful to compare results of two cluster detections |
|---|
| 88 | with different parameters. |
|---|
| 89 | |
|---|
| 90 | Delete results |
|---|
| 91 | |
|---|
| 92 | You can delete results using 'Delete selected' or 'Clear list'. |
|---|
| 93 | |
|---|
| 94 | Cluster groups |
|---|
| 95 | |
|---|
| 96 | Create groups for found clusters |
|---|
| 97 | |
|---|
| 98 | NOTES The performance of the cluster detection is very sensitive to the parameters: |
|---|
| 99 | |
|---|
| 100 | - Shortly said: Big cluster size + small max.distance => faster calculation |
|---|
| 101 | - A cluster size of 2 forces all sequences to be loaded. This consumes time and memory and |
|---|
| 102 | may render the calculation impossible. |
|---|
| 103 | - Opposed a minimum cluster size of 10 only loads about 20% of the sequence |
|---|
| 104 | data (in best case), a size of 20 will only load about 10% of data. |
|---|
| 105 | |
|---|
| 106 | - The bigger the maximum allowed distance is, the more clusters will be found, |
|---|
| 107 | hence the more has to be calculated. |
|---|
| 108 | - So if you got no idea about what distance to use, start with a low |
|---|
| 109 | distance (e.g. 0.01) and if you don't find any clusters, increase |
|---|
| 110 | the distance stepwise. |
|---|
| 111 | |
|---|
| 112 | EXAMPLES One use case is to reduce a given tree by removing clones or very nearly |
|---|
| 113 | related species and only keeping one of them as representative of |
|---|
| 114 | the so formed OTU. |
|---|
| 115 | |
|---|
| 116 | Steps: |
|---|
| 117 | |
|---|
| 118 | - search clusters |
|---|
| 119 | - examine found clusters and delete those you'd like to keep |
|---|
| 120 | - uncheck 'Mark representative' and click 'Mark all' |
|---|
| 121 | - in ARB_NTREE call 'Tree/Remove species from tree/Remove marked' |
|---|
| 122 | |
|---|
| 123 | Another use case is to create groups. |
|---|
| 124 | |
|---|
| 125 | If you choose higher values for the maximum distance allowed |
|---|
| 126 | in found clusters and for the minimum cluster size, the found |
|---|
| 127 | clusters might be good candidates to create groups. |
|---|
| 128 | |
|---|
| 129 | WARNINGS Be careful when the minimum distance reported for a cluster is zero. |
|---|
| 130 | This may have 2 reasons: |
|---|
| 131 | |
|---|
| 132 | - two sequences are identical (in filtered region) |
|---|
| 133 | - one sequence is empty (in filtered region) |
|---|
| 134 | |
|---|
| 135 | In the second case, the results are meaningless and the empty sequence |
|---|
| 136 | will be used as representative (which makes no sense). |
|---|
| 137 | |
|---|
| 138 | As a second indicator the min. number of base positions used for distance |
|---|
| 139 | calculation is listed for each cluster. When this gets low or zero the result |
|---|
| 140 | get more and more random. |
|---|
| 141 | |
|---|
| 142 | BUGS No bugs known |
|---|