source: branches/port5/GDEHELP/HELP_PLAIN/CAP2.help

Last change on this file was 6142, checked in by westram, 16 years ago
  • backport [6141] (parts not affecting code at all, i.e. helpfiles, figs, ..)
  • Property svn:eol-style set to native
  • Property svn:keywords set to Author Date Id Revision
File size: 6.9 KB
Line 
1
2
3 CONTIG ASSEMBLY PROGRAM (CAP)
4
5   copyright (c) 1991  Xiaoqiu Huang
6
7   The distribution of the program is granted provided no charge
8   is made and the copyright notice is included.
9
10   Proper attribution of the author as the source of the software
11   would be appreciated:
12
13   "A Contig Assembly Program Based on Sensitive Detection of
14   Fragment Overlaps" (submitted to Genomics, 1991)
15
16        Xiaoqiu Huang
17        Department of Computer Science
18        Michigan Technological University
19        Houghton, MI 49931
20        E-mail:  huang@cs.mtu.edu
21
22   The CAP program uses a dynamic programming algorithm to compute
23   the maximal-scoring overlapping alignment between two fragments.
24   Fragments in random orientations are assembled into contigs by a
25   greedy approach in order of the overlap scores. CAP is efficient
26   in computer memory: a large number of arbitrarily long fragments
27   can be assembled. The time requirement is acceptable; for example,
28   CAP took 4 hours to assemble 1015 fragments of a total of 252 kb
29   nucleotides on a Sun SPARCstation SLC. The program is written in C
30   and runs on Sun workstations.
31
32   Below is a description of the parameters in the #define section of CAP.
33   Two specially chosen sets of substitution scores and indel penalties
34   are used by the dynamic programming algorithm: heavy set for regions
35   of low sequencing error rates and light set for fragment ends of high
36   sequencing error rates. (Use integers only.)
37
38        Heavy set:                       Light set:
39
40        MATCH     =  2                   MATCH     =  2
41        MISMAT    = -6                   LTMISM    = -3
42        EXTEND    =  4                   LTEXTEN   =  2
43
44    In the initial assembly, any overlap must be of length at least OVERLEN,
45    and any overlap/containment must be of identity percentage at least
46    PERCENT. After the initial assembly, the program attempts to join
47    contigs together using weak overlaps. Two contigs are merged if the
48    score of the overlapping alignment is at least CUTOFF. The value for
49    CUTOFF is chosen according to the value for MATCH.
50
51    DELTA is a parameter in necessary conditions for overlap/containment.
52    Those conditions are used to quickly reject pairs of fragments that
53    could not possibly have an overlap/containment relationship.
54    The dynamic programming algorithm is only applied to pairs of fragments
55    that pass the screening. A large value for DELTA means stringent
56    conditions, where the value for DELTA is a real number at least 8.0.
57
58    POS5 and POS3 are fragment positions such that the 5' end between base 1
59    and base POS5, and the 3' end after base POS3 are of high sequencing
60    error rates, say more than 5%. For mismatches and indels occurring in
61    the two ends, light penalties are used.
62
63    A file of input fragments looks like:
64
65>G019uabh
66ATACATCATAACACTACTTCCTACCCATAAGCTCCTTTTAACTTGTTAAA
67GTCTTGCTTGAATTAAAGACTTGTTTAAACACAAAAATTTAGAGTTTTAC
68TCAACAAAAGTGATTGATTGATTGATTGATTGATTGATGGTTTACAGTAG
69GACTTCATTCTAGTCATTATAGCTGCTGGCAGTATAACTGGCCAGCCTTT
70AATACATTGCTGCTTAGAGTCAAAGCATGTACTTAGAGTTGGTATGATTT
71ATCTTTTTGGTCTTCTATAGCCTCCTTCCCCATCCCCATCAGTCTTAATC
72AGTCTTGTTACGTTATGACTAATCTTTGGGGATTGTGCAGAATGTTATTT
73TAGATAAGCAAAACGAGCAAAATGGGGAGTTACTTATATTTCTTTAAAGC
74>G028uaah
75CATAAGCTCCTTTTAACTTGTTAAAGTCTTGCTTGAATTAAAGACTTGTT
76TAAACACAAAATTTAGACTTTTACTCAACAAAAGTGATTGATTGATTGAT
77TGATTGATTGATGGTTTACAGTAGGACTTCATTCTAGTCATTATAGCTGC
78TGGCAGTATAACTGGCCAGCCTTTAATACATTGCTGCTTAGAGTCAAAGC
79ATGTACTTAGAGTTGGTATGATTTATCTTTTTGGTCTTCTATAGCCTCCT
80TCCCCATCCCATCAGTCT
81>G022uabh
82TATTTTAGAGACCCAAGTTTTTGACCTTTTCCATGTTTACATCAATCCTG
83TAGGTGATTGGGCAGCCATTTAAGTATTATTATAGACATTTTCACTATCC
84CATTAAAACCCTTTATGCCCATACATCATAACACTACTTCCTACCCATAA
85GCTCCTTTTAACTTGTTAAAGTCTTGCTTGAATTAAAGACTTGTTTAAAC
86ACAAAATTTAGACTTTTACTCAACAAAAGTGATTGATTGATTGATTGATT
87GATTGAT
88>G023uabh
89AATAAATACCAAAAAAATAGTATATCTACATAGAATTTCACATAAAATAA
90ACTGTTTTCTATGTGAAAATTAACCTAAAAATATGCTTTGCTTATGTTTA
91AGATGTCATGCTTTTTATCAGTTGAGGAGTTCAGCTTAATAATCCTCTAC
92GATCTTAAACAAATAGGAAAAAAACTAAAAGTAGAAAATGGAAATAAAAT
93GTCAAAGCATTTCTACCACTCAGAATTGATCTTATAACATGAAATGCTTT
94TTAAAAGAAAATATTAAAGTTAAACTCCCCTATTTTGCTCGTTTTTGCTT
95ATCTAAAATACATTCTGCACAATCCCCAAAGATTGATCATACGTTAC
96>G006uaah
97ACATAAAATAAACTGTTTTCTATGTGAAAATTAACCTANNATATGCTTTG
98CTTATGTTTAAGATGTCATGCTTTTTATCAGTTGAGGAGTTCAGCTTAAT
99AATCCTCTAAGATCTTAAACAAATAGGAAAAAAACTAAAAGTAGAAAATG
100GAAATAAAATGTCAAAGCATTTCTACCACTCAGAATTGATCTTATAACAT
101GAAATGCTTTTTAAAAGAAAATATTAAAGTTAAACTCCCC
102
103   A string after ">" is the name of the following fragment.
104   Only the five upper-case letters A, C, G, T and N are allowed
105   to appear in fragment data. No other characters are allowed.
106   A common mistake is the use of lower case letters in a fragment.
107
108   To run the program, type a command of form
109
110        cap  file_of_fragments
111
112   The output goes to the terminal screen. So redirection of the
113   output into a file is necessary. The output consists of three parts:
114   overview of contigs at fragment level, detailed display of contigs
115   at nucleotide level, and consensus sequences.
116   '+' = direct orientation; '-' = reverse complement
117   The output of CAP on the sample input data looks like:
118
119#Contig 1
120
121#G022uabh+(0)
122TATTTTAGAGACCCAAGTTTTTGACCTTTTCCATGTTTACATCAATCCTGTAGGTGATTG
123GGCAGCCATTTAAGTATTATTATAGACATTTTCACTATCCCATTAAAACCCTTTATGCCC
124ATACATCATAACACTACTTCCTACCCATAAGCTCCTTTTAACTTGTTAAAGTCTTGCTTG
125AATTAAAGACTTGTTTAAACACAAAA-TTTAGACTTTTACTCAACAAAAGTGATTGATTG
126ATTGATTGATTGATTGAT
127#G028uaah+(145)
128CATAAGCTCCTTTTAACTTGTTAAAGTCTTGCTTGAATTAAAGACTTGTTTAAACACAAA
129A-TTTAGACTTTTACTCAACAAAAGTGATTGATTGATTGATTGATTGATTGATGGTTTAC
130AGTAGGACTTCATTCTAGTCATTATAGCTGCTGGCAGTATAACTGGCCAGCCTTTAATAC
131ATTGCTGCTTAGAGTCAAAGCATGTACTTAGAGTTGGTATGATTTATCTTTTTGGTCTTC
132TATAGCCTCCTTCCCCATCCC-ATCAGTCT
133#G019uabh+(120)
134ATACATCATAACACTACTTCCTACCCATAAGCTCCTTTTAACTTGTTAAAGTCTTGCTTG
135AATTAAAGACTTGTTTAAACACAAAAATTTAGAGTTTTACTCAACAAAAGTGATTGATTG
136ATTGATTGATTGATTGATGGTTTACAGTAGGACTTCATTCTAGTCATTATAGCTGCTGGC
137AGTATAACTGGCCAGCCTTTAATACATTGCTGCTTAGAGTCAAAGCATGTACTTAGAGTT
138GGTATGATTTATCTTTTTGGTCTTCTATAGCCTCCTTCCCCATCCCCATCAGTCTTAATC
139AGTCTTGTTACGTTATGACT-AATCTTTGGGGATTGTGCAGAATGTTATTTTAGATAAGC
140AAAA-CGAGCAAAAT-GGGGAGTT-A-CTT-A-TATTT-CTTT-AAA--GC
141#G023uabh-(426)
142GTAACGT-ATGA-TCAATCTTTGGGGATTGTGCAGAATGT-ATTTTAGATAAGCAAAAAC
143GAGCAAAATAGGGGAGTTTAACTTTAATATTTTCTTTTAAAAAGCATTTCATGTTATAAG
144ATCAATTCTGAGTGGTAGAAATGCTTTGACATTTTATTTCCATTTTCTACTTTTAGTTTT
145TTTCCTATTTGTTTAAGATCGTAGAGGATTATTAAGCTGAACTCCTCAACTGATAAAAAG
146CATGACATCTTAAACATAAGCAAAGCATATTTTTAGGTTAATTTTCACATAGAAAACAGT
147TTATTTTATGTGAAATTCTATGTAGATATACTATTTTTTTGGTATTTATT
148#G006uaah-(496)
149GGGGAGTTTAACTTTAATATTTTCTTTTAAAAAGCATTTCATGTTATAAGATCAATTCTG
150AGTGGTAGAAATGCTTTGACATTTTATTTCCATTTTCTACTTTTAGTTTTTTTCCTATTT
151GTTTAAGATCTTAGAGGATTATTAAGCTGAACTCCTCAACTGATAAAAAGCATGACATCT
152TAAACATAAGCAAAGCATATNNT-AGGTTAATTTTCACATAGAAAACAGTTTATTTTATG
153T
154
155
156
157   Slight modifications by S. Smith on Mon Feb 17 10:18:34 EST 1992.
158   These changes allow for command line arguments for several
159   of the hard coded parameters, as well as a slight modification to
160   the output routine to support GDE format.  Changes are commented
161   as: Mod by S.S.
Note: See TracBrowser for help on using the repository browser.