1 | |
---|
2 | |
---|
3 | CONTIG ASSEMBLY PROGRAM (CAP) |
---|
4 | |
---|
5 | copyright (c) 1991 Xiaoqiu Huang |
---|
6 | |
---|
7 | The distribution of the program is granted provided no charge |
---|
8 | is made and the copyright notice is included. |
---|
9 | |
---|
10 | Proper attribution of the author as the source of the software |
---|
11 | would be appreciated: |
---|
12 | |
---|
13 | "A Contig Assembly Program Based on Sensitive Detection of |
---|
14 | Fragment Overlaps" (submitted to Genomics, 1991) |
---|
15 | |
---|
16 | Xiaoqiu Huang |
---|
17 | Department of Computer Science |
---|
18 | Michigan Technological University |
---|
19 | Houghton, MI 49931 |
---|
20 | E-mail: huang@cs.mtu.edu |
---|
21 | |
---|
22 | The CAP program uses a dynamic programming algorithm to compute |
---|
23 | the maximal-scoring overlapping alignment between two fragments. |
---|
24 | Fragments in random orientations are assembled into contigs by a |
---|
25 | greedy approach in order of the overlap scores. CAP is efficient |
---|
26 | in computer memory: a large number of arbitrarily long fragments |
---|
27 | can be assembled. The time requirement is acceptable; for example, |
---|
28 | CAP took 4 hours to assemble 1015 fragments of a total of 252 kb |
---|
29 | nucleotides on a Sun SPARCstation SLC. The program is written in C |
---|
30 | and runs on Sun workstations. |
---|
31 | |
---|
32 | Below is a description of the parameters in the #define section of CAP. |
---|
33 | Two specially chosen sets of substitution scores and indel penalties |
---|
34 | are used by the dynamic programming algorithm: heavy set for regions |
---|
35 | of low sequencing error rates and light set for fragment ends of high |
---|
36 | sequencing error rates. (Use integers only.) |
---|
37 | |
---|
38 | Heavy set: Light set: |
---|
39 | |
---|
40 | MATCH = 2 MATCH = 2 |
---|
41 | MISMAT = -6 LTMISM = -3 |
---|
42 | EXTEND = 4 LTEXTEN = 2 |
---|
43 | |
---|
44 | In the initial assembly, any overlap must be of length at least OVERLEN, |
---|
45 | and any overlap/containment must be of identity percentage at least |
---|
46 | PERCENT. After the initial assembly, the program attempts to join |
---|
47 | contigs together using weak overlaps. Two contigs are merged if the |
---|
48 | score of the overlapping alignment is at least CUTOFF. The value for |
---|
49 | CUTOFF is chosen according to the value for MATCH. |
---|
50 | |
---|
51 | DELTA is a parameter in necessary conditions for overlap/containment. |
---|
52 | Those conditions are used to quickly reject pairs of fragments that |
---|
53 | could not possibly have an overlap/containment relationship. |
---|
54 | The dynamic programming algorithm is only applied to pairs of fragments |
---|
55 | that pass the screening. A large value for DELTA means stringent |
---|
56 | conditions, where the value for DELTA is a real number at least 8.0. |
---|
57 | |
---|
58 | POS5 and POS3 are fragment positions such that the 5' end between base 1 |
---|
59 | and base POS5, and the 3' end after base POS3 are of high sequencing |
---|
60 | error rates, say more than 5%. For mismatches and indels occurring in |
---|
61 | the two ends, light penalties are used. |
---|
62 | |
---|
63 | A file of input fragments looks like: |
---|
64 | |
---|
65 | >G019uabh |
---|
66 | ATACATCATAACACTACTTCCTACCCATAAGCTCCTTTTAACTTGTTAAA |
---|
67 | GTCTTGCTTGAATTAAAGACTTGTTTAAACACAAAAATTTAGAGTTTTAC |
---|
68 | TCAACAAAAGTGATTGATTGATTGATTGATTGATTGATGGTTTACAGTAG |
---|
69 | GACTTCATTCTAGTCATTATAGCTGCTGGCAGTATAACTGGCCAGCCTTT |
---|
70 | AATACATTGCTGCTTAGAGTCAAAGCATGTACTTAGAGTTGGTATGATTT |
---|
71 | ATCTTTTTGGTCTTCTATAGCCTCCTTCCCCATCCCCATCAGTCTTAATC |
---|
72 | AGTCTTGTTACGTTATGACTAATCTTTGGGGATTGTGCAGAATGTTATTT |
---|
73 | TAGATAAGCAAAACGAGCAAAATGGGGAGTTACTTATATTTCTTTAAAGC |
---|
74 | >G028uaah |
---|
75 | CATAAGCTCCTTTTAACTTGTTAAAGTCTTGCTTGAATTAAAGACTTGTT |
---|
76 | TAAACACAAAATTTAGACTTTTACTCAACAAAAGTGATTGATTGATTGAT |
---|
77 | TGATTGATTGATGGTTTACAGTAGGACTTCATTCTAGTCATTATAGCTGC |
---|
78 | TGGCAGTATAACTGGCCAGCCTTTAATACATTGCTGCTTAGAGTCAAAGC |
---|
79 | ATGTACTTAGAGTTGGTATGATTTATCTTTTTGGTCTTCTATAGCCTCCT |
---|
80 | TCCCCATCCCATCAGTCT |
---|
81 | >G022uabh |
---|
82 | TATTTTAGAGACCCAAGTTTTTGACCTTTTCCATGTTTACATCAATCCTG |
---|
83 | TAGGTGATTGGGCAGCCATTTAAGTATTATTATAGACATTTTCACTATCC |
---|
84 | CATTAAAACCCTTTATGCCCATACATCATAACACTACTTCCTACCCATAA |
---|
85 | GCTCCTTTTAACTTGTTAAAGTCTTGCTTGAATTAAAGACTTGTTTAAAC |
---|
86 | ACAAAATTTAGACTTTTACTCAACAAAAGTGATTGATTGATTGATTGATT |
---|
87 | GATTGAT |
---|
88 | >G023uabh |
---|
89 | AATAAATACCAAAAAAATAGTATATCTACATAGAATTTCACATAAAATAA |
---|
90 | ACTGTTTTCTATGTGAAAATTAACCTAAAAATATGCTTTGCTTATGTTTA |
---|
91 | AGATGTCATGCTTTTTATCAGTTGAGGAGTTCAGCTTAATAATCCTCTAC |
---|
92 | GATCTTAAACAAATAGGAAAAAAACTAAAAGTAGAAAATGGAAATAAAAT |
---|
93 | GTCAAAGCATTTCTACCACTCAGAATTGATCTTATAACATGAAATGCTTT |
---|
94 | TTAAAAGAAAATATTAAAGTTAAACTCCCCTATTTTGCTCGTTTTTGCTT |
---|
95 | ATCTAAAATACATTCTGCACAATCCCCAAAGATTGATCATACGTTAC |
---|
96 | >G006uaah |
---|
97 | ACATAAAATAAACTGTTTTCTATGTGAAAATTAACCTANNATATGCTTTG |
---|
98 | CTTATGTTTAAGATGTCATGCTTTTTATCAGTTGAGGAGTTCAGCTTAAT |
---|
99 | AATCCTCTAAGATCTTAAACAAATAGGAAAAAAACTAAAAGTAGAAAATG |
---|
100 | GAAATAAAATGTCAAAGCATTTCTACCACTCAGAATTGATCTTATAACAT |
---|
101 | GAAATGCTTTTTAAAAGAAAATATTAAAGTTAAACTCCCC |
---|
102 | |
---|
103 | A string after ">" is the name of the following fragment. |
---|
104 | Only the five upper-case letters A, C, G, T and N are allowed |
---|
105 | to appear in fragment data. No other characters are allowed. |
---|
106 | A common mistake is the use of lower case letters in a fragment. |
---|
107 | |
---|
108 | To run the program, type a command of form |
---|
109 | |
---|
110 | cap file_of_fragments |
---|
111 | |
---|
112 | The output goes to the terminal screen. So redirection of the |
---|
113 | output into a file is necessary. The output consists of three parts: |
---|
114 | overview of contigs at fragment level, detailed display of contigs |
---|
115 | at nucleotide level, and consensus sequences. |
---|
116 | '+' = direct orientation; '-' = reverse complement |
---|
117 | The output of CAP on the sample input data looks like: |
---|
118 | |
---|
119 | #Contig 1 |
---|
120 | |
---|
121 | #G022uabh+(0) |
---|
122 | TATTTTAGAGACCCAAGTTTTTGACCTTTTCCATGTTTACATCAATCCTGTAGGTGATTG |
---|
123 | GGCAGCCATTTAAGTATTATTATAGACATTTTCACTATCCCATTAAAACCCTTTATGCCC |
---|
124 | ATACATCATAACACTACTTCCTACCCATAAGCTCCTTTTAACTTGTTAAAGTCTTGCTTG |
---|
125 | AATTAAAGACTTGTTTAAACACAAAA-TTTAGACTTTTACTCAACAAAAGTGATTGATTG |
---|
126 | ATTGATTGATTGATTGAT |
---|
127 | #G028uaah+(145) |
---|
128 | CATAAGCTCCTTTTAACTTGTTAAAGTCTTGCTTGAATTAAAGACTTGTTTAAACACAAA |
---|
129 | A-TTTAGACTTTTACTCAACAAAAGTGATTGATTGATTGATTGATTGATTGATGGTTTAC |
---|
130 | AGTAGGACTTCATTCTAGTCATTATAGCTGCTGGCAGTATAACTGGCCAGCCTTTAATAC |
---|
131 | ATTGCTGCTTAGAGTCAAAGCATGTACTTAGAGTTGGTATGATTTATCTTTTTGGTCTTC |
---|
132 | TATAGCCTCCTTCCCCATCCC-ATCAGTCT |
---|
133 | #G019uabh+(120) |
---|
134 | ATACATCATAACACTACTTCCTACCCATAAGCTCCTTTTAACTTGTTAAAGTCTTGCTTG |
---|
135 | AATTAAAGACTTGTTTAAACACAAAAATTTAGAGTTTTACTCAACAAAAGTGATTGATTG |
---|
136 | ATTGATTGATTGATTGATGGTTTACAGTAGGACTTCATTCTAGTCATTATAGCTGCTGGC |
---|
137 | AGTATAACTGGCCAGCCTTTAATACATTGCTGCTTAGAGTCAAAGCATGTACTTAGAGTT |
---|
138 | GGTATGATTTATCTTTTTGGTCTTCTATAGCCTCCTTCCCCATCCCCATCAGTCTTAATC |
---|
139 | AGTCTTGTTACGTTATGACT-AATCTTTGGGGATTGTGCAGAATGTTATTTTAGATAAGC |
---|
140 | AAAA-CGAGCAAAAT-GGGGAGTT-A-CTT-A-TATTT-CTTT-AAA--GC |
---|
141 | #G023uabh-(426) |
---|
142 | GTAACGT-ATGA-TCAATCTTTGGGGATTGTGCAGAATGT-ATTTTAGATAAGCAAAAAC |
---|
143 | GAGCAAAATAGGGGAGTTTAACTTTAATATTTTCTTTTAAAAAGCATTTCATGTTATAAG |
---|
144 | ATCAATTCTGAGTGGTAGAAATGCTTTGACATTTTATTTCCATTTTCTACTTTTAGTTTT |
---|
145 | TTTCCTATTTGTTTAAGATCGTAGAGGATTATTAAGCTGAACTCCTCAACTGATAAAAAG |
---|
146 | CATGACATCTTAAACATAAGCAAAGCATATTTTTAGGTTAATTTTCACATAGAAAACAGT |
---|
147 | TTATTTTATGTGAAATTCTATGTAGATATACTATTTTTTTGGTATTTATT |
---|
148 | #G006uaah-(496) |
---|
149 | GGGGAGTTTAACTTTAATATTTTCTTTTAAAAAGCATTTCATGTTATAAGATCAATTCTG |
---|
150 | AGTGGTAGAAATGCTTTGACATTTTATTTCCATTTTCTACTTTTAGTTTTTTTCCTATTT |
---|
151 | GTTTAAGATCTTAGAGGATTATTAAGCTGAACTCCTCAACTGATAAAAAGCATGACATCT |
---|
152 | TAAACATAAGCAAAGCATATNNT-AGGTTAATTTTCACATAGAAAACAGTTTATTTTATG |
---|
153 | T |
---|
154 | |
---|
155 | |
---|
156 | |
---|
157 | Slight modifications by S. Smith on Mon Feb 17 10:18:34 EST 1992. |
---|
158 | These changes allow for command line arguments for several |
---|
159 | of the hard coded parameters, as well as a slight modification to |
---|
160 | the output routine to support GDE format. Changes are commented |
---|
161 | as: Mod by S.S. |
---|