source: branches/profile/GDEHELP/HELP_PLAIN/CAP2.help

Last change on this file was 10842, checked in by westram, 11 years ago
  • Property svn:eol-style set to native
  • Property svn:keywords set to Author Date Id Revision
File size: 7.6 KB
Line 
1
2
3 CONTIG ASSEMBLY PROGRAM (CAP)
4
5   copyright (c) 1991  Xiaoqiu Huang
6
7   The distribution of the program is granted provided no charge
8   is made and the copyright notice is included.
9
10   Proper attribution of the author as the source of the software
11   would be appreciated:
12
13   "A Contig Assembly Program Based on Sensitive Detection of
14   Fragment Overlaps" (submitted to Genomics, 1991)
15
16        Xiaoqiu Huang
17        Department of Computer Science
18        Michigan Technological University
19        Houghton, MI 49931
20        E-mail:  huang@cs.mtu.edu
21
22   The CAP program uses a dynamic programming algorithm to compute
23   the maximal-scoring overlapping alignment between two fragments.
24   Fragments in random orientations are assembled into contigs by a
25   greedy approach in order of the overlap scores. CAP is efficient
26   in computer memory: a large number of arbitrarily long fragments
27   can be assembled. The time requirement is acceptable; for example,
28   CAP took 4 hours to assemble 1015 fragments of a total of 252 kb
29   nucleotides on a Sun SPARCstation SLC. The program is written in C
30   and runs on Sun workstations.
31
32   Below is a description of the parameters in the #define section of CAP.
33   Two specially chosen sets of substitution scores and indel penalties
34   are used by the dynamic programming algorithm: heavy set for regions
35   of low sequencing error rates and light set for fragment ends of high
36   sequencing error rates. (Use integers only.)
37
38        Heavy set:                       Light set:
39
40        MATCH     =  2                   MATCH     =  2
41        MISMAT    = -6                   LTMISM    = -3
42        EXTEND    =  4                   LTEXTEN   =  2
43
44    In the initial assembly, any overlap must be of length at least OVERLEN,
45    and any overlap/containment must be of identity percentage at least
46    PERCENT. After the initial assembly, the program attempts to join
47    contigs together using weak overlaps. Two contigs are merged if the
48    score of the overlapping alignment is at least CUTOFF. The value for
49    CUTOFF is chosen according to the value for MATCH.
50
51    DELTA is a parameter in necessary conditions for overlap/containment.
52    Those conditions are used to quickly reject pairs of fragments that
53    could not possibly have an overlap/containment relationship.
54    The dynamic programming algorithm is only applied to pairs of fragments
55    that pass the screening. A large value for DELTA means stringent
56    conditions, where the value for DELTA is a real number at least 8.0.
57
58    POS5 and POS3 are fragment positions such that the 5' end between base 1
59    and base POS5, and the 3' end after base POS3 are of high sequencing
60    error rates, say more than 5%. For mismatches and indels occurring in
61    the two ends, light penalties are used.
62
63    A file of input fragments looks like:
64
65      |  >G019uabh
66      |  ATACATCATAACACTACTTCCTACCCATAAGCTCCTTTTAACTTGTTAAA
67      |  GTCTTGCTTGAATTAAAGACTTGTTTAAACACAAAAATTTAGAGTTTTAC
68      |  TCAACAAAAGTGATTGATTGATTGATTGATTGATTGATGGTTTACAGTAG
69      |  GACTTCATTCTAGTCATTATAGCTGCTGGCAGTATAACTGGCCAGCCTTT
70      |  AATACATTGCTGCTTAGAGTCAAAGCATGTACTTAGAGTTGGTATGATTT
71      |  ATCTTTTTGGTCTTCTATAGCCTCCTTCCCCATCCCCATCAGTCTTAATC
72      |  AGTCTTGTTACGTTATGACTAATCTTTGGGGATTGTGCAGAATGTTATTT
73      |  TAGATAAGCAAAACGAGCAAAATGGGGAGTTACTTATATTTCTTTAAAGC
74      |  >G028uaah
75      |  CATAAGCTCCTTTTAACTTGTTAAAGTCTTGCTTGAATTAAAGACTTGTT
76      |  TAAACACAAAATTTAGACTTTTACTCAACAAAAGTGATTGATTGATTGAT
77      |  TGATTGATTGATGGTTTACAGTAGGACTTCATTCTAGTCATTATAGCTGC
78      |  TGGCAGTATAACTGGCCAGCCTTTAATACATTGCTGCTTAGAGTCAAAGC
79      |  ATGTACTTAGAGTTGGTATGATTTATCTTTTTGGTCTTCTATAGCCTCCT
80      |  TCCCCATCCCATCAGTCT
81      |  >G022uabh
82      |  TATTTTAGAGACCCAAGTTTTTGACCTTTTCCATGTTTACATCAATCCTG
83      |  TAGGTGATTGGGCAGCCATTTAAGTATTATTATAGACATTTTCACTATCC
84      |  CATTAAAACCCTTTATGCCCATACATCATAACACTACTTCCTACCCATAA
85      |  GCTCCTTTTAACTTGTTAAAGTCTTGCTTGAATTAAAGACTTGTTTAAAC
86      |  ACAAAATTTAGACTTTTACTCAACAAAAGTGATTGATTGATTGATTGATT
87      |  GATTGAT
88      |  >G023uabh
89      |  AATAAATACCAAAAAAATAGTATATCTACATAGAATTTCACATAAAATAA
90      |  ACTGTTTTCTATGTGAAAATTAACCTAAAAATATGCTTTGCTTATGTTTA
91      |  AGATGTCATGCTTTTTATCAGTTGAGGAGTTCAGCTTAATAATCCTCTAC
92      |  GATCTTAAACAAATAGGAAAAAAACTAAAAGTAGAAAATGGAAATAAAAT
93      |  GTCAAAGCATTTCTACCACTCAGAATTGATCTTATAACATGAAATGCTTT
94      |  TTAAAAGAAAATATTAAAGTTAAACTCCCCTATTTTGCTCGTTTTTGCTT
95      |  ATCTAAAATACATTCTGCACAATCCCCAAAGATTGATCATACGTTAC
96      |  >G006uaah
97      |  ACATAAAATAAACTGTTTTCTATGTGAAAATTAACCTANNATATGCTTTG
98      |  CTTATGTTTAAGATGTCATGCTTTTTATCAGTTGAGGAGTTCAGCTTAAT
99      |  AATCCTCTAAGATCTTAAACAAATAGGAAAAAAACTAAAAGTAGAAAATG
100      |  GAAATAAAATGTCAAAGCATTTCTACCACTCAGAATTGATCTTATAACAT
101      |  GAAATGCTTTTTAAAAGAAAATATTAAAGTTAAACTCCCC
102
103   A string after ">" is the name of the following fragment.
104   Only the five upper-case letters A, C, G, T and N are allowed
105   to appear in fragment data. No other characters are allowed.
106   A common mistake is the use of lower case letters in a fragment.
107
108   To run the program, type a command of form
109
110        cap  file_of_fragments
111
112   The output goes to the terminal screen. So redirection of the
113   output into a file is necessary. The output consists of three parts:
114   overview of contigs at fragment level, detailed display of contigs
115   at nucleotide level, and consensus sequences.
116   '+' = direct orientation; '-' = reverse complement
117   The output of CAP on the sample input data looks like:
118
119      |  #Contig 1
120      | 
121      |  #G022uabh+(0)
122      |  TATTTTAGAGACCCAAGTTTTTGACCTTTTCCATGTTTACATCAATCCTGTAGGTGATTG
123      |  GGCAGCCATTTAAGTATTATTATAGACATTTTCACTATCCCATTAAAACCCTTTATGCCC
124      |  ATACATCATAACACTACTTCCTACCCATAAGCTCCTTTTAACTTGTTAAAGTCTTGCTTG
125      |  AATTAAAGACTTGTTTAAACACAAAA-TTTAGACTTTTACTCAACAAAAGTGATTGATTG
126      |  ATTGATTGATTGATTGAT
127      |  #G028uaah+(145)
128      |  CATAAGCTCCTTTTAACTTGTTAAAGTCTTGCTTGAATTAAAGACTTGTTTAAACACAAA
129      |  A-TTTAGACTTTTACTCAACAAAAGTGATTGATTGATTGATTGATTGATTGATGGTTTAC
130      |  AGTAGGACTTCATTCTAGTCATTATAGCTGCTGGCAGTATAACTGGCCAGCCTTTAATAC
131      |  ATTGCTGCTTAGAGTCAAAGCATGTACTTAGAGTTGGTATGATTTATCTTTTTGGTCTTC
132      |  TATAGCCTCCTTCCCCATCCC-ATCAGTCT
133      |  #G019uabh+(120)
134      |  ATACATCATAACACTACTTCCTACCCATAAGCTCCTTTTAACTTGTTAAAGTCTTGCTTG
135      |  AATTAAAGACTTGTTTAAACACAAAAATTTAGAGTTTTACTCAACAAAAGTGATTGATTG
136      |  ATTGATTGATTGATTGATGGTTTACAGTAGGACTTCATTCTAGTCATTATAGCTGCTGGC
137      |  AGTATAACTGGCCAGCCTTTAATACATTGCTGCTTAGAGTCAAAGCATGTACTTAGAGTT
138      |  GGTATGATTTATCTTTTTGGTCTTCTATAGCCTCCTTCCCCATCCCCATCAGTCTTAATC
139      |  AGTCTTGTTACGTTATGACT-AATCTTTGGGGATTGTGCAGAATGTTATTTTAGATAAGC
140      |  AAAA-CGAGCAAAAT-GGGGAGTT-A-CTT-A-TATTT-CTTT-AAA--GC
141      |  #G023uabh-(426)
142      |  GTAACGT-ATGA-TCAATCTTTGGGGATTGTGCAGAATGT-ATTTTAGATAAGCAAAAAC
143      |  GAGCAAAATAGGGGAGTTTAACTTTAATATTTTCTTTTAAAAAGCATTTCATGTTATAAG
144      |  ATCAATTCTGAGTGGTAGAAATGCTTTGACATTTTATTTCCATTTTCTACTTTTAGTTTT
145      |  TTTCCTATTTGTTTAAGATCGTAGAGGATTATTAAGCTGAACTCCTCAACTGATAAAAAG
146      |  CATGACATCTTAAACATAAGCAAAGCATATTTTTAGGTTAATTTTCACATAGAAAACAGT
147      |  TTATTTTATGTGAAATTCTATGTAGATATACTATTTTTTTGGTATTTATT
148      |  #G006uaah-(496)
149      |  GGGGAGTTTAACTTTAATATTTTCTTTTAAAAAGCATTTCATGTTATAAGATCAATTCTG
150      |  AGTGGTAGAAATGCTTTGACATTTTATTTCCATTTTCTACTTTTAGTTTTTTTCCTATTT
151      |  GTTTAAGATCTTAGAGGATTATTAAGCTGAACTCCTCAACTGATAAAAAGCATGACATCT
152      |  TAAACATAAGCAAAGCATATNNT-AGGTTAATTTTCACATAGAAAACAGTTTATTTTATG
153      |  T
154
155
156
157   Slight modifications by S. Smith on Mon Feb 17 10:18:34 EST 1992.
158   These changes allow for command line arguments for several
159   of the hard coded parameters, as well as a slight modification to
160   the output routine to support GDE format.  Changes are commented
161   as: Mod by S.S.
Note: See TracBrowser for help on using the repository browser.