debian/0000755000000000000000000000000012254304754007174 5ustar debian/source/0000755000000000000000000000000012254304754010474 5ustar debian/source/format0000644000000000000000000000001412105413355011673 0ustar 3.0 (quilt) debian/copyright0000644000000000000000000000343512105413355011125 0ustar Format: http://www.debian.org/doc/packaging-manuals/copyright-format/1.0/ Upstream-Name: soapdenovo Source: http://soap.genomics.org.cn/soapdenovo.html Files: * Copyright: 2008-2010 BGI-Shenzhen License: GPL-3+ This package is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version. . This package is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. . You should have received a copy of the GNU General Public License along with this program. If not, see . On Debian systems, the complete text of the GNU General Public License version 3 can be found in "/usr/share/common-licenses/GPL-3". Files: debian/* Copyright: 2012 Olivier Sallou License: GPL-2+ This package is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. . This package is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. . You should have received a copy of the GNU General Public License along with this program. If not, see . On Debian systems, the complete text of the GNU General Public License version 2 can be found in "/usr/share/common-licenses/GPL-2". debian/soapdenovo-63mer.10000644000000000000000000002741612105413355012370 0ustar .TH soapdenovo 1 "July 30, 2012" "version 1.1.0" "USER COMMANDS" .SH NAME soapdenovo \- Short-read assembly method that can build a de novo draft assembly .SH SYNOPSIS .B soapdenovo_31mer soapdenovo_63mer soapdenovo_127mer .SH Introduction SOAPdenovo is a novel short-read assembly method that can build a de novo draft assembly for the human-sized genomes. The program is specially designed to assemble Illumina GA short reads. It creates new opportunities for building reference sequences and carrying out accurate analyses of unexplored genomes in a cost effective way. 1) Support large kmer up to 127 to utilize long reads. Three version are provided. I. The 31mer version support kmer only <=31. II. The 63mer version support kmer only <=63 and doubles the memory consumption than 31mer version, even being used with kmer <=31. III. The 127mer version support kmer only <=127 and double the memory consumption than 63mer version, even being used with kmer <=63. Please notice that, with longer kmer, the quantity of nodes would decrease significantly, thus the memory consumption is usually smaller than double with shifted version. 2) New parameter added in "pregraph" module. This parameter initiates the memory assumption to avoid further reallocation. Unit of the parameter is GB. Without further reallocation, SOAPdenovo runs faster and provide the potential to eat up all the memory of the machine. For example, if the workstation provides 50g free memory, use \-a 50 in pregraph step, then a static amount of 50g memory would be allocated before processing reads. This can also avoid being interrupted by other users sharing the same machine. 3) Gap filled bases now represented by lowercase characters in 'scafSeq' file. 4) Introduced SIMD instructions to boost the performance. .SH Configuration file For big genome projects with deep sequencing, the data is usually organized as multiple read sequence files generated from multiple libraries. The configuration file tells the assembler where to find these files and the relevant information. “example.config” is an example of such a file. .P The configuration file has a section for global information, and then multiple library sections. Right now only “max_rd_len” is included in the global information section. Any read longer than max_rd_len will be cut to this length. .P The library information and the information of sequencing data generated from the library should be organized in the corresponding library section. Each library section starts with tag [LIB] and includes the following items: .TP avg_ins This value indicates the average insert size of this library or the peak value position in the insert size distribution figure. .TP reverse_seq This option takes value 0 or 1. It tells the assembler if the read sequences need to be complementarily reversed. Illumima GA produces two types of paired-end libraries: a) forward-reverse, generated from fragmented DNA ends with typical insert size less than 500 bp; b) forward-forward, generated from circularizing libraries with typical insert size greater than 2 Kb. The parameter “reverse_seq” should be set to indicate this: 0, forward-reverse; 1, forward-forward. .TP asm_flags=3 This indicator decides in which part(s) the reads are used. It takes value 1(only contig assembly), 2 (only scaffold assembly), 3(both contig and scaffold assembly), or 4 (only gap closure). .TP rd_len_cutoff The assembler will cut the reads from the current library to this length. .TP rank It takes integer values and decides in which order the reads are used for scaffold assembly. Libraries with the same “rank” are used at the same time during scaffold assembly. .TP pair_num_cutoff This parameter is the cutoff value of pair number for a reliable connection between two contigs or pre-scaffolds. .TP map_len This takes effect in the “map” step and is the minimun alignment length between a read and a contig required for a reliable read location. .P The assembler accepts read file in two formats: FASTA or FASTQ. Mate-pair relationship could be indicated in two ways: two sequence files with reads in the same order belonging to a pair, or two adjacent reads in a single file (FASTA only) belonging to a pair. .P In the configuration file single end files are indicated by “f=/path/filename” or “q=/pah/filename” for fasta or fastq formats separately. Paired reads in two fasta sequence files are indicated by “f1=” and “f2=”. While paired reads in two fastq sequences files are indicated by “q1=” and “q2=”. Paired reads in a single fasta sequence file is indicated by “p=” item. .P All the above items in each library section are optional. The assembler assigns default values for most of them. If you are not sure how to set a parameter, you can remove it from your configuration file. .SH Get it started Once the configuration file is available, a typical way to run the assembler is: ${bin} all –s config_file –K 63 –R –o graph_prefix User can also choose to run the assembly process step by step as: ${bin} pregraph \–s config_file \–K 63 [\–R \-d \–p \-a] \–o graph_prefix ${bin} contig \–g graph_prefix [\–R \–M 1 \-D] ${bin} map \–s config_file \–g graph_prefix [\-p] ${bin} scaff \–g graph_prefix [\–F \-u \-G \-p] .SH Options .TP \-a INT Initiate the memory assumption (GB) to avoid further reallocation .TP \-s STR configuration file .TP \-o STR output graph file prefix .TP \-g STR input graph file prefix .TP \-K INT K-mer size [default 23, min 13, max 127] .TP \-p INT multithreads, n threads [default 8] .TP \-R use reads to solve tiny repeats [default no] .TP \-d INT remove low-frequency K-mers with frequency no larger than [default 0] .TP \-D INT remove edges with coverage no larger that [default 1] .TP \-M INT strength of merging similar sequences during contiging [default 1, min 0, max 3] .TP \-F intra-scaffold gap closure [default no] .TP \-u un-mask high coverage contigs before scaffolding [defaut mask] .TP \-G INT allowed length difference between estimated and filled gap .TP \-L minimum contigs length used for scaffolding .SH Output files These files are output as assembly results: .IP "a. *.contig" .IP "contig sequences without using mate pair information" .IP "b. *.scafSeq" .IP "scaffold sequences (final contig sequences can be extracted by breaking down scaffold sequences at gap regions)" .P There are some other files that provide useful information for advanced users, which are listed in Appendix B. .SH FAQ .SS "How to set K-mer size?" The program accepts odd numbers between 13 and 31. Larger K-mers would have higher rate of uniqueness in the genome and would make the graph simpler, but it requires deep sequencing depth and longer read length to guarantee the overlap at any genomic location. .SS "How to set library rank?" SOAPdenovo will use the pair-end libraries with insert size from smaller to larger to construct scaffolds. Libraries with the same rank would be used at the same time. For example, in a dataset of a human genome, we set five ranks for five libraries with insert size 200-bp, 500-bp, 2-Kb, 5-Kb and 10-Kb, separately. It is desired that the pairs in each rank provide adequate physical coverage of the genome. .SH "APPENDIX A: an example.config" #maximal read length .br max_rd_len=50 .br [LIB] .br #average insert size .br avg_ins=200 .br #if sequence needs to be reversed .br reverse_seq=0 .br #in which part(s) the reads are used .br asm_flags=3 .br #use only first 50 bps of each read .br rd_len_cutoff=50 .br #in which order the reads are used while scaffolding .br rank=1 .br # cutoff of pair number for a reliable connection (default 3) .br pair_num_cutoff=3 .br #minimum aligned length to contigs for a reliable read location (default 32) .br map_len=32 .br #fastq file for read 1 .br q1=/path/**LIBNAMEA**/fastq_read_1.fq .br #fastq file for read 2 always follows fastq file for read 1 .br q2=/path/**LIBNAMEA**/fastq_read_2.fq .br #fasta file for read 1 .br f1=/path/**LIBNAMEA**/fasta_read_1.fa .br #fastq file for read 2 always follows fastq file for read 1 .br f2=/path/**LIBNAMEA**/fasta_read_2.fa .br #fastq file for single reads .br q=/path/**LIBNAMEA**/fastq_read_single.fq .br #fasta file for single reads .br f=/path/**LIBNAMEA**/fasta_read_single.fa .br #a single fasta file for paired reads .br p=/path/**LIBNAMEA**/pairs_in_one_file.fa .br [LIB] .br avg_ins=2000 .br reverse_seq=1 .br asm_flags=2 .br rank=2 .br # cutoff of pair number for a reliable connection .br #(default 5 for large insert size) .br pair_num_cutoff=5 .br #minimum aligned length to contigs for a reliable read location .br #(default 35 for large insert size) .br map_len=35 .br q1=/path/**LIBNAMEB**/fastq_read_1.fq .br q2=/path/**LIBNAMEB**/fastq_read_2.fq .br q=/path/**LIBNAMEB**/fastq_read_single.fq .br f=/path/**LIBNAMEB**/fasta_read_single.fa .br .SH "Appendix B: output files" 1. Output files from the command “pregraph” .IP "a. *.kmerFreq" .IP "Each row shows the number of Kmers with a frequency equals the row number." .IP "b. *.edge" .IP "Each record gives the information of an edge in the pre-graph: length, Kmers on both ends, average kmer coverage, whether it’s reverse-complementarily identical and the sequence." .IP "c. *.markOnEdge & *.path" .IP "These two files are for using reads to solve small repeats" .IP "e. *.preArc" .IP "Connections between edges which are established by the read paths." .IP "f. *.vertex" .IP "Kmers at the ends of edges." .IP "g. *.preGraphBasic" .IP "Some basic information about the pre-graph: number of vertex, K value, number of edges, maximum read length etc." 2. Output files from the command “contig” .IP "a. *.contig" .IP "Contig information: corresponding edge index, length, kmer coverage, whether it’s tip and the sequence. Either a contig or its reverse complementry counterpart is included. Each reverse complementary contig index is indicated in the *.ContigIndex file." .IP "b. *.Arc" .IP "Arcs coming out of each edge and their corresponding coverage by reads " .IP "c. *.updated.edge" .IP "Some information for each edge in graph: length, Kmers at both ends, index difference between the reverse-complementary edge and this one." .IP "d. *.ContigIndex" .IP "Each record gives information about each contig in the *.contig: it’s edge index, length, the index difference between its reverse-complementary counterpart and itself." 3. Output files from the command “map” .IP "a. *.peGrads" .IP "Information for each clone library: insert-size, read index upper bound, rank and pair number cutoff for a reliable link." .IP "This file can be revised manually for scaffolding tuning." .IP "b. *.readOnContig" .IP "Read locations on contigs. Here contigs are referred by their edge index. Howerver about half of them are not listed in the *.contig file for their reverse-complementary counterparts are included already." .IP "c. *.readInGap" .IP "This file includes reads that could be located in gaps between contigs. This information will be used to close gaps in scaffolds." 4. Output files from the command “scaff” .IP "a. *.newContigIndex" .IP "Contigs are sorted according their length before scaffolding. Their new index are listed in this file. This is useful if one wants to corresponds contigs in *.contig with those in *.links." .IP "b. *.links" .IP "Links between contigs which are established by read pairs. New index are used." .IP "c. *.scaf_gap" .IP "Contigs in gaps found by contig graph outputted by the contiging procedure. Here new index are used." .IP "d. *.scaf" .IP "Contigs for each scaffold: contig index (concordant to index in *.contig), approximate start position on scaffold, orientation, contig length, and its links to others." .IP "e. *.gapSeq" .IP "Gap sequences between contigs." .IP "f. *.scafSeq" .IP "Sequence of each scaffold." .SH AUTHOR Olivier Sallou (olivier.sallou (at) irisa.fr) - Man page and packaging debian/soapdenovo.manpages0000644000000000000000000000011712105413355013056 0ustar debian/soapdenovo-31mer.1 debian/soapdenovo-63mer.1 debian/soapdenovo-127mer.1 debian/compat0000644000000000000000000000000212254304524010365 0ustar 9 debian/watch0000644000000000000000000000030012105413355010207 0ustar # Compulsory line, this is a version 3 file version=3 opts=filenamemangle=s/down\/SOAPdenovo-V/soapdenovo-/ http://soap.genomics.org.cn/soapdenovo.html#down2 down/SOAPdenovo-V(.*)\.src\.tgz debian/docs0000644000000000000000000000000712105413355010035 0ustar MANUAL debian/README.Debian0000644000000000000000000000043612254304112011224 0ustar * SOAPdenovo Binaries are soapdenovo-31mer, soapdenovo-63mer and soapdenovo-127mer. See manpage for further explanations SOAPdenovo 1.x is not maintained anymore by upstream. Please consider using soapdenovo2. -- Olivier Sallou Mon, 30 Jul 2012 15:47:51 +0200 debian/soapdenovo-31mer.10000644000000000000000000002741612105413355012363 0ustar .TH soapdenovo 1 "July 30, 2012" "version 1.1.0" "USER COMMANDS" .SH NAME soapdenovo \- Short-read assembly method that can build a de novo draft assembly .SH SYNOPSIS .B soapdenovo_31mer soapdenovo_63mer soapdenovo_127mer .SH Introduction SOAPdenovo is a novel short-read assembly method that can build a de novo draft assembly for the human-sized genomes. The program is specially designed to assemble Illumina GA short reads. It creates new opportunities for building reference sequences and carrying out accurate analyses of unexplored genomes in a cost effective way. 1) Support large kmer up to 127 to utilize long reads. Three version are provided. I. The 31mer version support kmer only <=31. II. The 63mer version support kmer only <=63 and doubles the memory consumption than 31mer version, even being used with kmer <=31. III. The 127mer version support kmer only <=127 and double the memory consumption than 63mer version, even being used with kmer <=63. Please notice that, with longer kmer, the quantity of nodes would decrease significantly, thus the memory consumption is usually smaller than double with shifted version. 2) New parameter added in "pregraph" module. This parameter initiates the memory assumption to avoid further reallocation. Unit of the parameter is GB. Without further reallocation, SOAPdenovo runs faster and provide the potential to eat up all the memory of the machine. For example, if the workstation provides 50g free memory, use \-a 50 in pregraph step, then a static amount of 50g memory would be allocated before processing reads. This can also avoid being interrupted by other users sharing the same machine. 3) Gap filled bases now represented by lowercase characters in 'scafSeq' file. 4) Introduced SIMD instructions to boost the performance. .SH Configuration file For big genome projects with deep sequencing, the data is usually organized as multiple read sequence files generated from multiple libraries. The configuration file tells the assembler where to find these files and the relevant information. “example.config” is an example of such a file. .P The configuration file has a section for global information, and then multiple library sections. Right now only “max_rd_len” is included in the global information section. Any read longer than max_rd_len will be cut to this length. .P The library information and the information of sequencing data generated from the library should be organized in the corresponding library section. Each library section starts with tag [LIB] and includes the following items: .TP avg_ins This value indicates the average insert size of this library or the peak value position in the insert size distribution figure. .TP reverse_seq This option takes value 0 or 1. It tells the assembler if the read sequences need to be complementarily reversed. Illumima GA produces two types of paired-end libraries: a) forward-reverse, generated from fragmented DNA ends with typical insert size less than 500 bp; b) forward-forward, generated from circularizing libraries with typical insert size greater than 2 Kb. The parameter “reverse_seq” should be set to indicate this: 0, forward-reverse; 1, forward-forward. .TP asm_flags=3 This indicator decides in which part(s) the reads are used. It takes value 1(only contig assembly), 2 (only scaffold assembly), 3(both contig and scaffold assembly), or 4 (only gap closure). .TP rd_len_cutoff The assembler will cut the reads from the current library to this length. .TP rank It takes integer values and decides in which order the reads are used for scaffold assembly. Libraries with the same “rank” are used at the same time during scaffold assembly. .TP pair_num_cutoff This parameter is the cutoff value of pair number for a reliable connection between two contigs or pre-scaffolds. .TP map_len This takes effect in the “map” step and is the minimun alignment length between a read and a contig required for a reliable read location. .P The assembler accepts read file in two formats: FASTA or FASTQ. Mate-pair relationship could be indicated in two ways: two sequence files with reads in the same order belonging to a pair, or two adjacent reads in a single file (FASTA only) belonging to a pair. .P In the configuration file single end files are indicated by “f=/path/filename” or “q=/pah/filename” for fasta or fastq formats separately. Paired reads in two fasta sequence files are indicated by “f1=” and “f2=”. While paired reads in two fastq sequences files are indicated by “q1=” and “q2=”. Paired reads in a single fasta sequence file is indicated by “p=” item. .P All the above items in each library section are optional. The assembler assigns default values for most of them. If you are not sure how to set a parameter, you can remove it from your configuration file. .SH Get it started Once the configuration file is available, a typical way to run the assembler is: ${bin} all –s config_file –K 63 –R –o graph_prefix User can also choose to run the assembly process step by step as: ${bin} pregraph \–s config_file \–K 63 [\–R \-d \–p \-a] \–o graph_prefix ${bin} contig \–g graph_prefix [\–R \–M 1 \-D] ${bin} map \–s config_file \–g graph_prefix [\-p] ${bin} scaff \–g graph_prefix [\–F \-u \-G \-p] .SH Options .TP \-a INT Initiate the memory assumption (GB) to avoid further reallocation .TP \-s STR configuration file .TP \-o STR output graph file prefix .TP \-g STR input graph file prefix .TP \-K INT K-mer size [default 23, min 13, max 127] .TP \-p INT multithreads, n threads [default 8] .TP \-R use reads to solve tiny repeats [default no] .TP \-d INT remove low-frequency K-mers with frequency no larger than [default 0] .TP \-D INT remove edges with coverage no larger that [default 1] .TP \-M INT strength of merging similar sequences during contiging [default 1, min 0, max 3] .TP \-F intra-scaffold gap closure [default no] .TP \-u un-mask high coverage contigs before scaffolding [defaut mask] .TP \-G INT allowed length difference between estimated and filled gap .TP \-L minimum contigs length used for scaffolding .SH Output files These files are output as assembly results: .IP "a. *.contig" .IP "contig sequences without using mate pair information" .IP "b. *.scafSeq" .IP "scaffold sequences (final contig sequences can be extracted by breaking down scaffold sequences at gap regions)" .P There are some other files that provide useful information for advanced users, which are listed in Appendix B. .SH FAQ .SS "How to set K-mer size?" The program accepts odd numbers between 13 and 31. Larger K-mers would have higher rate of uniqueness in the genome and would make the graph simpler, but it requires deep sequencing depth and longer read length to guarantee the overlap at any genomic location. .SS "How to set library rank?" SOAPdenovo will use the pair-end libraries with insert size from smaller to larger to construct scaffolds. Libraries with the same rank would be used at the same time. For example, in a dataset of a human genome, we set five ranks for five libraries with insert size 200-bp, 500-bp, 2-Kb, 5-Kb and 10-Kb, separately. It is desired that the pairs in each rank provide adequate physical coverage of the genome. .SH "APPENDIX A: an example.config" #maximal read length .br max_rd_len=50 .br [LIB] .br #average insert size .br avg_ins=200 .br #if sequence needs to be reversed .br reverse_seq=0 .br #in which part(s) the reads are used .br asm_flags=3 .br #use only first 50 bps of each read .br rd_len_cutoff=50 .br #in which order the reads are used while scaffolding .br rank=1 .br # cutoff of pair number for a reliable connection (default 3) .br pair_num_cutoff=3 .br #minimum aligned length to contigs for a reliable read location (default 32) .br map_len=32 .br #fastq file for read 1 .br q1=/path/**LIBNAMEA**/fastq_read_1.fq .br #fastq file for read 2 always follows fastq file for read 1 .br q2=/path/**LIBNAMEA**/fastq_read_2.fq .br #fasta file for read 1 .br f1=/path/**LIBNAMEA**/fasta_read_1.fa .br #fastq file for read 2 always follows fastq file for read 1 .br f2=/path/**LIBNAMEA**/fasta_read_2.fa .br #fastq file for single reads .br q=/path/**LIBNAMEA**/fastq_read_single.fq .br #fasta file for single reads .br f=/path/**LIBNAMEA**/fasta_read_single.fa .br #a single fasta file for paired reads .br p=/path/**LIBNAMEA**/pairs_in_one_file.fa .br [LIB] .br avg_ins=2000 .br reverse_seq=1 .br asm_flags=2 .br rank=2 .br # cutoff of pair number for a reliable connection .br #(default 5 for large insert size) .br pair_num_cutoff=5 .br #minimum aligned length to contigs for a reliable read location .br #(default 35 for large insert size) .br map_len=35 .br q1=/path/**LIBNAMEB**/fastq_read_1.fq .br q2=/path/**LIBNAMEB**/fastq_read_2.fq .br q=/path/**LIBNAMEB**/fastq_read_single.fq .br f=/path/**LIBNAMEB**/fasta_read_single.fa .br .SH "Appendix B: output files" 1. Output files from the command “pregraph” .IP "a. *.kmerFreq" .IP "Each row shows the number of Kmers with a frequency equals the row number." .IP "b. *.edge" .IP "Each record gives the information of an edge in the pre-graph: length, Kmers on both ends, average kmer coverage, whether it’s reverse-complementarily identical and the sequence." .IP "c. *.markOnEdge & *.path" .IP "These two files are for using reads to solve small repeats" .IP "e. *.preArc" .IP "Connections between edges which are established by the read paths." .IP "f. *.vertex" .IP "Kmers at the ends of edges." .IP "g. *.preGraphBasic" .IP "Some basic information about the pre-graph: number of vertex, K value, number of edges, maximum read length etc." 2. Output files from the command “contig” .IP "a. *.contig" .IP "Contig information: corresponding edge index, length, kmer coverage, whether it’s tip and the sequence. Either a contig or its reverse complementry counterpart is included. Each reverse complementary contig index is indicated in the *.ContigIndex file." .IP "b. *.Arc" .IP "Arcs coming out of each edge and their corresponding coverage by reads " .IP "c. *.updated.edge" .IP "Some information for each edge in graph: length, Kmers at both ends, index difference between the reverse-complementary edge and this one." .IP "d. *.ContigIndex" .IP "Each record gives information about each contig in the *.contig: it’s edge index, length, the index difference between its reverse-complementary counterpart and itself." 3. Output files from the command “map” .IP "a. *.peGrads" .IP "Information for each clone library: insert-size, read index upper bound, rank and pair number cutoff for a reliable link." .IP "This file can be revised manually for scaffolding tuning." .IP "b. *.readOnContig" .IP "Read locations on contigs. Here contigs are referred by their edge index. Howerver about half of them are not listed in the *.contig file for their reverse-complementary counterparts are included already." .IP "c. *.readInGap" .IP "This file includes reads that could be located in gaps between contigs. This information will be used to close gaps in scaffolds." 4. Output files from the command “scaff” .IP "a. *.newContigIndex" .IP "Contigs are sorted according their length before scaffolding. Their new index are listed in this file. This is useful if one wants to corresponds contigs in *.contig with those in *.links." .IP "b. *.links" .IP "Links between contigs which are established by read pairs. New index are used." .IP "c. *.scaf_gap" .IP "Contigs in gaps found by contig graph outputted by the contiging procedure. Here new index are used." .IP "d. *.scaf" .IP "Contigs for each scaffold: contig index (concordant to index in *.contig), approximate start position on scaffold, orientation, contig length, and its links to others." .IP "e. *.gapSeq" .IP "Gap sequences between contigs." .IP "f. *.scafSeq" .IP "Sequence of each scaffold." .SH AUTHOR Olivier Sallou (olivier.sallou (at) irisa.fr) - Man page and packaging debian/install0000644000000000000000000000013312105413355010553 0ustar bin/SOAPdenovo-31mer usr/bin/ bin/SOAPdenovo-63mer usr/bin/ bin/SOAPdenovo-127mer usr/bin/ debian/control0000644000000000000000000000207012254304741010572 0ustar Source: soapdenovo Section: science Priority: optional Build-Depends: debhelper (>= 9) Maintainer: Debian Med Packaging Team Uploaders: Olivier Sallou Vcs-Svn: svn://anonscm.debian.org/debian-med/trunk/packages/soap/soapdenovo/trunk/ Vcs-Browser: http://anonscm.debian.org/viewvc/debian-med/trunk/packages/soap/soapdenovo/ Standards-Version: 3.9.4 Homepage: http://soap.genomics.org.cn/soapdenovo.html Package: soapdenovo Architecture: any-amd64 any-ppc64 any-ia64 #Architecture: any Depends: ${shlibs:Depends}, ${misc:Depends} Description: short-read assembly method to build de novo draft assembly SOAPdenovo is a novel short-read assembly method that can build a de novo draft assembly for the human-sized genomes. The program is specially designed to assemble Illumina GA short reads. . It creates new opportunities for building reference sequences and carrying out accurate analyses of unexplored genomes in a cost effective way. This version is not maintained anymore, consider using soapdenovo2. debian/rules0000755000000000000000000000031712105413355010246 0ustar #!/usr/bin/make -f # -*- makefile -*- # Uncomment this to turn on verbose mode. #export DH_VERBOSE=1 %: dh $@ override_dh_install: dh_install cd debian/soapdenovo/usr/bin/;rename 's/SOAP/soap/' SOAP* debian/upstream0000644000000000000000000000120112105413355010742 0ustar Name: SOAPdenovo Homepage: http://soap.genomics.org.cn/soapdenovo.html Reference: Author: Ruiqiang Li and Hongmei Zhu and Jue Ruan and Wubin Qian and Xiaodong Fang and Zhongbin Shi and Yingrui Li and Shengting Li and Gao Shan and Karsten Kristiansen and Songgang Li and Huanming Yang and Jian Wang and Jun Wang Title: De novo assembly of human genomes with massively parallel short read sequencing Journal: Genome Research Year: 2009 Volume: 20 Number: 2 Pages: 265-72 PMID: 20019144 DOI: 10.1101/gr.097261.109 URL: http://genome.cshlp.org/content/20/2/265.abstract eprint: http://genome.cshlp.org/content/20/2/265.full.pdf+html debian/soapdenovo-127mer.10000644000000000000000000002741612105413355012451 0ustar .TH soapdenovo 1 "July 30, 2012" "version 1.1.0" "USER COMMANDS" .SH NAME soapdenovo \- Short-read assembly method that can build a de novo draft assembly .SH SYNOPSIS .B soapdenovo_31mer soapdenovo_63mer soapdenovo_127mer .SH Introduction SOAPdenovo is a novel short-read assembly method that can build a de novo draft assembly for the human-sized genomes. The program is specially designed to assemble Illumina GA short reads. It creates new opportunities for building reference sequences and carrying out accurate analyses of unexplored genomes in a cost effective way. 1) Support large kmer up to 127 to utilize long reads. Three version are provided. I. The 31mer version support kmer only <=31. II. The 63mer version support kmer only <=63 and doubles the memory consumption than 31mer version, even being used with kmer <=31. III. The 127mer version support kmer only <=127 and double the memory consumption than 63mer version, even being used with kmer <=63. Please notice that, with longer kmer, the quantity of nodes would decrease significantly, thus the memory consumption is usually smaller than double with shifted version. 2) New parameter added in "pregraph" module. This parameter initiates the memory assumption to avoid further reallocation. Unit of the parameter is GB. Without further reallocation, SOAPdenovo runs faster and provide the potential to eat up all the memory of the machine. For example, if the workstation provides 50g free memory, use \-a 50 in pregraph step, then a static amount of 50g memory would be allocated before processing reads. This can also avoid being interrupted by other users sharing the same machine. 3) Gap filled bases now represented by lowercase characters in 'scafSeq' file. 4) Introduced SIMD instructions to boost the performance. .SH Configuration file For big genome projects with deep sequencing, the data is usually organized as multiple read sequence files generated from multiple libraries. The configuration file tells the assembler where to find these files and the relevant information. “example.config” is an example of such a file. .P The configuration file has a section for global information, and then multiple library sections. Right now only “max_rd_len” is included in the global information section. Any read longer than max_rd_len will be cut to this length. .P The library information and the information of sequencing data generated from the library should be organized in the corresponding library section. Each library section starts with tag [LIB] and includes the following items: .TP avg_ins This value indicates the average insert size of this library or the peak value position in the insert size distribution figure. .TP reverse_seq This option takes value 0 or 1. It tells the assembler if the read sequences need to be complementarily reversed. Illumima GA produces two types of paired-end libraries: a) forward-reverse, generated from fragmented DNA ends with typical insert size less than 500 bp; b) forward-forward, generated from circularizing libraries with typical insert size greater than 2 Kb. The parameter “reverse_seq” should be set to indicate this: 0, forward-reverse; 1, forward-forward. .TP asm_flags=3 This indicator decides in which part(s) the reads are used. It takes value 1(only contig assembly), 2 (only scaffold assembly), 3(both contig and scaffold assembly), or 4 (only gap closure). .TP rd_len_cutoff The assembler will cut the reads from the current library to this length. .TP rank It takes integer values and decides in which order the reads are used for scaffold assembly. Libraries with the same “rank” are used at the same time during scaffold assembly. .TP pair_num_cutoff This parameter is the cutoff value of pair number for a reliable connection between two contigs or pre-scaffolds. .TP map_len This takes effect in the “map” step and is the minimun alignment length between a read and a contig required for a reliable read location. .P The assembler accepts read file in two formats: FASTA or FASTQ. Mate-pair relationship could be indicated in two ways: two sequence files with reads in the same order belonging to a pair, or two adjacent reads in a single file (FASTA only) belonging to a pair. .P In the configuration file single end files are indicated by “f=/path/filename” or “q=/pah/filename” for fasta or fastq formats separately. Paired reads in two fasta sequence files are indicated by “f1=” and “f2=”. While paired reads in two fastq sequences files are indicated by “q1=” and “q2=”. Paired reads in a single fasta sequence file is indicated by “p=” item. .P All the above items in each library section are optional. The assembler assigns default values for most of them. If you are not sure how to set a parameter, you can remove it from your configuration file. .SH Get it started Once the configuration file is available, a typical way to run the assembler is: ${bin} all –s config_file –K 63 –R –o graph_prefix User can also choose to run the assembly process step by step as: ${bin} pregraph \–s config_file \–K 63 [\–R \-d \–p \-a] \–o graph_prefix ${bin} contig \–g graph_prefix [\–R \–M 1 \-D] ${bin} map \–s config_file \–g graph_prefix [\-p] ${bin} scaff \–g graph_prefix [\–F \-u \-G \-p] .SH Options .TP \-a INT Initiate the memory assumption (GB) to avoid further reallocation .TP \-s STR configuration file .TP \-o STR output graph file prefix .TP \-g STR input graph file prefix .TP \-K INT K-mer size [default 23, min 13, max 127] .TP \-p INT multithreads, n threads [default 8] .TP \-R use reads to solve tiny repeats [default no] .TP \-d INT remove low-frequency K-mers with frequency no larger than [default 0] .TP \-D INT remove edges with coverage no larger that [default 1] .TP \-M INT strength of merging similar sequences during contiging [default 1, min 0, max 3] .TP \-F intra-scaffold gap closure [default no] .TP \-u un-mask high coverage contigs before scaffolding [defaut mask] .TP \-G INT allowed length difference between estimated and filled gap .TP \-L minimum contigs length used for scaffolding .SH Output files These files are output as assembly results: .IP "a. *.contig" .IP "contig sequences without using mate pair information" .IP "b. *.scafSeq" .IP "scaffold sequences (final contig sequences can be extracted by breaking down scaffold sequences at gap regions)" .P There are some other files that provide useful information for advanced users, which are listed in Appendix B. .SH FAQ .SS "How to set K-mer size?" The program accepts odd numbers between 13 and 31. Larger K-mers would have higher rate of uniqueness in the genome and would make the graph simpler, but it requires deep sequencing depth and longer read length to guarantee the overlap at any genomic location. .SS "How to set library rank?" SOAPdenovo will use the pair-end libraries with insert size from smaller to larger to construct scaffolds. Libraries with the same rank would be used at the same time. For example, in a dataset of a human genome, we set five ranks for five libraries with insert size 200-bp, 500-bp, 2-Kb, 5-Kb and 10-Kb, separately. It is desired that the pairs in each rank provide adequate physical coverage of the genome. .SH "APPENDIX A: an example.config" #maximal read length .br max_rd_len=50 .br [LIB] .br #average insert size .br avg_ins=200 .br #if sequence needs to be reversed .br reverse_seq=0 .br #in which part(s) the reads are used .br asm_flags=3 .br #use only first 50 bps of each read .br rd_len_cutoff=50 .br #in which order the reads are used while scaffolding .br rank=1 .br # cutoff of pair number for a reliable connection (default 3) .br pair_num_cutoff=3 .br #minimum aligned length to contigs for a reliable read location (default 32) .br map_len=32 .br #fastq file for read 1 .br q1=/path/**LIBNAMEA**/fastq_read_1.fq .br #fastq file for read 2 always follows fastq file for read 1 .br q2=/path/**LIBNAMEA**/fastq_read_2.fq .br #fasta file for read 1 .br f1=/path/**LIBNAMEA**/fasta_read_1.fa .br #fastq file for read 2 always follows fastq file for read 1 .br f2=/path/**LIBNAMEA**/fasta_read_2.fa .br #fastq file for single reads .br q=/path/**LIBNAMEA**/fastq_read_single.fq .br #fasta file for single reads .br f=/path/**LIBNAMEA**/fasta_read_single.fa .br #a single fasta file for paired reads .br p=/path/**LIBNAMEA**/pairs_in_one_file.fa .br [LIB] .br avg_ins=2000 .br reverse_seq=1 .br asm_flags=2 .br rank=2 .br # cutoff of pair number for a reliable connection .br #(default 5 for large insert size) .br pair_num_cutoff=5 .br #minimum aligned length to contigs for a reliable read location .br #(default 35 for large insert size) .br map_len=35 .br q1=/path/**LIBNAMEB**/fastq_read_1.fq .br q2=/path/**LIBNAMEB**/fastq_read_2.fq .br q=/path/**LIBNAMEB**/fastq_read_single.fq .br f=/path/**LIBNAMEB**/fasta_read_single.fa .br .SH "Appendix B: output files" 1. Output files from the command “pregraph” .IP "a. *.kmerFreq" .IP "Each row shows the number of Kmers with a frequency equals the row number." .IP "b. *.edge" .IP "Each record gives the information of an edge in the pre-graph: length, Kmers on both ends, average kmer coverage, whether it’s reverse-complementarily identical and the sequence." .IP "c. *.markOnEdge & *.path" .IP "These two files are for using reads to solve small repeats" .IP "e. *.preArc" .IP "Connections between edges which are established by the read paths." .IP "f. *.vertex" .IP "Kmers at the ends of edges." .IP "g. *.preGraphBasic" .IP "Some basic information about the pre-graph: number of vertex, K value, number of edges, maximum read length etc." 2. Output files from the command “contig” .IP "a. *.contig" .IP "Contig information: corresponding edge index, length, kmer coverage, whether it’s tip and the sequence. Either a contig or its reverse complementry counterpart is included. Each reverse complementary contig index is indicated in the *.ContigIndex file." .IP "b. *.Arc" .IP "Arcs coming out of each edge and their corresponding coverage by reads " .IP "c. *.updated.edge" .IP "Some information for each edge in graph: length, Kmers at both ends, index difference between the reverse-complementary edge and this one." .IP "d. *.ContigIndex" .IP "Each record gives information about each contig in the *.contig: it’s edge index, length, the index difference between its reverse-complementary counterpart and itself." 3. Output files from the command “map” .IP "a. *.peGrads" .IP "Information for each clone library: insert-size, read index upper bound, rank and pair number cutoff for a reliable link." .IP "This file can be revised manually for scaffolding tuning." .IP "b. *.readOnContig" .IP "Read locations on contigs. Here contigs are referred by their edge index. Howerver about half of them are not listed in the *.contig file for their reverse-complementary counterparts are included already." .IP "c. *.readInGap" .IP "This file includes reads that could be located in gaps between contigs. This information will be used to close gaps in scaffolds." 4. Output files from the command “scaff” .IP "a. *.newContigIndex" .IP "Contigs are sorted according their length before scaffolding. Their new index are listed in this file. This is useful if one wants to corresponds contigs in *.contig with those in *.links." .IP "b. *.links" .IP "Links between contigs which are established by read pairs. New index are used." .IP "c. *.scaf_gap" .IP "Contigs in gaps found by contig graph outputted by the contiging procedure. Here new index are used." .IP "d. *.scaf" .IP "Contigs for each scaffold: contig index (concordant to index in *.contig), approximate start position on scaffold, orientation, contig length, and its links to others." .IP "e. *.gapSeq" .IP "Gap sequences between contigs." .IP "f. *.scafSeq" .IP "Sequence of each scaffold." .SH AUTHOR Olivier Sallou (olivier.sallou (at) irisa.fr) - Man page and packaging debian/changelog0000644000000000000000000000055212254304035011040 0ustar soapdenovo (1.05-2) unstable; urgency=low * soapdenovo is not maintained anymore, please use soapdenovo2 (Closes: #732281). -- Olivier Sallou Wed, 18 Dec 2013 12:32:59 +0100 soapdenovo (1.05-1) unstable; urgency=low * Initial release (Closes: #683291) -- Olivier Sallou Mon, 30 Jul 2012 15:47:51 +0200