SOAPdenovo-V1.05/000755 000765 000024 00000000000 11530715447 013603 5ustar00Aquastaff000000 000000 SOAPdenovo-V1.05/._.DS_Store000644 000765 000024 00000000122 11526132660 015471 0ustar00Aquastaff000000 000000 Mac OS X  2 R@SOAPdenovo-V1.05/.DS_Store000644 000765 000024 00000014004 11526132660 015260 0ustar00Aquastaff000000 000000 Bud1 rbwspblob31merbwspblobbplist00 \WindowBounds[ShowSidebar]ShowStatusBar[ShowPathbar[ShowToolbar\SidebarWidth_{{297, 302}, {770, 444}} ". Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. Preamble The GNU General Public License is a free, copyleft license for software and other kinds of works. The licenses for most software and other practical works are designed to take away your freedom to share and change the works. By contrast, the GNU General Public License is intended to guarantee your freedom to share and change all versions of a program--to make sure it remains free software for all its users. We, the Free Software Foundation, use the GNU General Public License for most of our software; it applies also to any other work released this way by its authors. You can apply it to your programs, too. When we speak of free software, we are referring to freedom, not price. Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for them if you wish), that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs, and that you know you can do these things. To protect your rights, we need to prevent others from denying you these rights or asking you to surrender the rights. Therefore, you have certain responsibilities if you distribute copies of the software, or if you modify it: responsibilities to respect the freedom of others. For example, if you distribute copies of such a program, whether gratis or for a fee, you must pass on to the recipients the same freedoms that you received. You must make sure that they, too, receive or can get the source code. And you must show them these terms so they know their rights. Developers that use the GNU GPL protect your rights with two steps: (1) assert copyright on the software, and (2) offer you this License giving you legal permission to copy, distribute and/or modify it. For the developers' and authors' protection, the GPL clearly explains that there is no warranty for this free software. For both users' and authors' sake, the GPL requires that modified versions be marked as changed, so that their problems will not be attributed erroneously to authors of previous versions. Some devices are designed to deny users access to install or run modified versions of the software inside them, although the manufacturer can do so. This is fundamentally incompatible with the aim of protecting users' freedom to change the software. The systematic pattern of such abuse occurs in the area of products for individuals to use, which is precisely where it is most unacceptable. Therefore, we have designed this version of the GPL to prohibit the practice for those products. If such problems arise substantially in other domains, we stand ready to extend this provision to those domains in future versions of the GPL, as needed to protect the freedom of users. Finally, every program is threatened constantly by software patents. States should not allow patents to restrict development and use of software on general-purpose computers, but in those that do, we wish to avoid the special danger that patents applied to a free program could make it effectively proprietary. To prevent this, the GPL assures that patents cannot be used to render the program non-free. The precise terms and conditions for copying, distribution and modification follow. TERMS AND CONDITIONS 0. Definitions. "This License" refers to version 3 of the GNU General Public License. "Copyright" also means copyright-like laws that apply to other kinds of works, such as semiconductor masks. "The Program" refers to any copyrightable work licensed under this License. Each licensee is addressed as "you". "Licensees" and "recipients" may be individuals or organizations. To "modify" a work means to copy from or adapt all or part of the work in a fashion requiring copyright permission, other than the making of an exact copy. The resulting work is called a "modified version" of the earlier work or a work "based on" the earlier work. A "covered work" means either the unmodified Program or a work based on the Program. To "propagate" a work means to do anything with it that, without permission, would make you directly or secondarily liable for infringement under applicable copyright law, except executing it on a computer or modifying a private copy. Propagation includes copying, distribution (with or without modification), making available to the public, and in some countries other activities as well. To "convey" a work means any kind of propagation that enables other parties to make or receive copies. Mere interaction with a user through a computer network, with no transfer of a copy, is not conveying. An interactive user interface displays "Appropriate Legal Notices" to the extent that it includes a convenient and prominently visible feature that (1) displays an appropriate copyright notice, and (2) tells the user that there is no warranty for the work (except to the extent that warranties are provided), that licensees may convey the work under this License, and how to view a copy of this License. If the interface presents a list of user commands or options, such as a menu, a prominent item in the list meets this criterion. 1. Source Code. The "source code" for a work means the preferred form of the work for making modifications to it. "Object code" means any non-source form of a work. A "Standard Interface" means an interface that either is an official standard defined by a recognized standards body, or, in the case of interfaces specified for a particular programming language, one that is widely used among developers working in that language. The "System Libraries" of an executable work include anything, other than the work as a whole, that (a) is included in the normal form of packaging a Major Component, but which is not part of that Major Component, and (b) serves only to enable use of the work with that Major Component, or to implement a Standard Interface for which an implementation is available to the public in source code form. A "Major Component", in this context, means a major essential component (kernel, window system, and so on) of the specific operating system (if any) on which the executable work runs, or a compiler used to produce the work, or an object code interpreter used to run it. The "Corresponding Source" for a work in object code form means all the source code needed to generate, install, and (for an executable work) run the object code and to modify the work, including scripts to control those activities. However, it does not include the work's System Libraries, or general-purpose tools or generally available free programs which are used unmodified in performing those activities but which are not part of the work. For example, Corresponding Source includes interface definition files associated with source files for the work, and the source code for shared libraries and dynamically linked subprograms that the work is specifically designed to require, such as by intimate data communication or control flow between those subprograms and other parts of the work. The Corresponding Source need not include anything that users can regenerate automatically from other parts of the Corresponding Source. The Corresponding Source for a work in source code form is that same work. 2. Basic Permissions. All rights granted under this License are granted for the term of copyright on the Program, and are irrevocable provided the stated conditions are met. This License explicitly affirms your unlimited permission to run the unmodified Program. The output from running a covered work is covered by this License only if the output, given its content, constitutes a covered work. This License acknowledges your rights of fair use or other equivalent, as provided by copyright law. You may make, run and propagate covered works that you do not convey, without conditions so long as your license otherwise remains in force. You may convey covered works to others for the sole purpose of having them make modifications exclusively for you, or provide you with facilities for running those works, provided that you comply with the terms of this License in conveying all material for which you do not control copyright. Those thus making or running the covered works for you must do so exclusively on your behalf, under your direction and control, on terms that prohibit them from making any copies of your copyrighted material outside their relationship with you. Conveying under any other circumstances is permitted solely under the conditions stated below. Sublicensing is not allowed; section 10 makes it unnecessary. 3. Protecting Users' Legal Rights From Anti-Circumvention Law. No covered work shall be deemed part of an effective technological measure under any applicable law fulfilling obligations under article 11 of the WIPO copyright treaty adopted on 20 December 1996, or similar laws prohibiting or restricting circumvention of such measures. When you convey a covered work, you waive any legal power to forbid circumvention of technological measures to the extent such circumvention is effected by exercising rights under this License with respect to the covered work, and you disclaim any intention to limit operation or modification of the work as a means of enforcing, against the work's users, your or third parties' legal rights to forbid circumvention of technological measures. 4. Conveying Verbatim Copies. You may convey verbatim copies of the Program's source code as you receive it, in any medium, provided that you conspicuously and appropriately publish on each copy an appropriate copyright notice; keep intact all notices stating that this License and any non-permissive terms added in accord with section 7 apply to the code; keep intact all notices of the absence of any warranty; and give all recipients a copy of this License along with the Program. You may charge any price or no price for each copy that you convey, and you may offer support or warranty protection for a fee. 5. Conveying Modified Source Versions. You may convey a work based on the Program, or the modifications to produce it from the Program, in the form of source code under the terms of section 4, provided that you also meet all of these conditions: a) The work must carry prominent notices stating that you modified it, and giving a relevant date. b) The work must carry prominent notices stating that it is released under this License and any conditions added under section 7. This requirement modifies the requirement in section 4 to "keep intact all notices". c) You must license the entire work, as a whole, under this License to anyone who comes into possession of a copy. This License will therefore apply, along with any applicable section 7 additional terms, to the whole of the work, and all its parts, regardless of how they are packaged. This License gives no permission to license the work in any other way, but it does not invalidate such permission if you have separately received it. d) If the work has interactive user interfaces, each must display Appropriate Legal Notices; however, if the Program has interactive interfaces that do not display Appropriate Legal Notices, your work need not make them do so. A compilation of a covered work with other separate and independent works, which are not by their nature extensions of the covered work, and which are not combined with it such as to form a larger program, in or on a volume of a storage or distribution medium, is called an "aggregate" if the compilation and its resulting copyright are not used to limit the access or legal rights of the compilation's users beyond what the individual works permit. Inclusion of a covered work in an aggregate does not cause this License to apply to the other parts of the aggregate. 6. Conveying Non-Source Forms. You may convey a covered work in object code form under the terms of sections 4 and 5, provided that you also convey the machine-readable Corresponding Source under the terms of this License, in one of these ways: a) Convey the object code in, or embodied in, a physical product (including a physical distribution medium), accompanied by the Corresponding Source fixed on a durable physical medium customarily used for software interchange. b) Convey the object code in, or embodied in, a physical product (including a physical distribution medium), accompanied by a written offer, valid for at least three years and valid for as long as you offer spare parts or customer support for that product model, to give anyone who possesses the object code either (1) a copy of the Corresponding Source for all the software in the product that is covered by this License, on a durable physical medium customarily used for software interchange, for a price no more than your reasonable cost of physically performing this conveying of source, or (2) access to copy the Corresponding Source from a network server at no charge. c) Convey individual copies of the object code with a copy of the written offer to provide the Corresponding Source. This alternative is allowed only occasionally and noncommercially, and only if you received the object code with such an offer, in accord with subsection 6b. d) Convey the object code by offering access from a designated place (gratis or for a charge), and offer equivalent access to the Corresponding Source in the same way through the same place at no further charge. You need not require recipients to copy the Corresponding Source along with the object code. If the place to copy the object code is a network server, the Corresponding Source may be on a different server (operated by you or a third party) that supports equivalent copying facilities, provided you maintain clear directions next to the object code saying where to find the Corresponding Source. Regardless of what server hosts the Corresponding Source, you remain obligated to ensure that it is available for as long as needed to satisfy these requirements. e) Convey the object code using peer-to-peer transmission, provided you inform other peers where the object code and Corresponding Source of the work are being offered to the general public at no charge under subsection 6d. A separable portion of the object code, whose source code is excluded from the Corresponding Source as a System Library, need not be included in conveying the object code work. A "User Product" is either (1) a "consumer product", which means any tangible personal property which is normally used for personal, family, or household purposes, or (2) anything designed or sold for incorporation into a dwelling. In determining whether a product is a consumer product, doubtful cases shall be resolved in favor of coverage. For a particular product received by a particular user, "normally used" refers to a typical or common use of that class of product, regardless of the status of the particular user or of the way in which the particular user actually uses, or expects or is expected to use, the product. A product is a consumer product regardless of whether the product has substantial commercial, industrial or non-consumer uses, unless such uses represent the only significant mode of use of the product. "Installation Information" for a User Product means any methods, procedures, authorization keys, or other information required to install and execute modified versions of a covered work in that User Product from a modified version of its Corresponding Source. The information must suffice to ensure that the continued functioning of the modified object code is in no case prevented or interfered with solely because modification has been made. If you convey an object code work under this section in, or with, or specifically for use in, a User Product, and the conveying occurs as part of a transaction in which the right of possession and use of the User Product is transferred to the recipient in perpetuity or for a fixed term (regardless of how the transaction is characterized), the Corresponding Source conveyed under this section must be accompanied by the Installation Information. But this requirement does not apply if neither you nor any third party retains the ability to install modified object code on the User Product (for example, the work has been installed in ROM). The requirement to provide Installation Information does not include a requirement to continue to provide support service, warranty, or updates for a work that has been modified or installed by the recipient, or for the User Product in which it has been modified or installed. Access to a network may be denied when the modification itself materially and adversely affects the operation of the network or violates the rules and protocols for communication across the network. Corresponding Source conveyed, and Installation Information provided, in accord with this section must be in a format that is publicly documented (and with an implementation available to the public in source code form), and must require no special password or key for unpacking, reading or copying. 7. Additional Terms. "Additional permissions" are terms that supplement the terms of this License by making exceptions from one or more of its conditions. Additional permissions that are applicable to the entire Program shall be treated as though they were included in this License, to the extent that they are valid under applicable law. If additional permissions apply only to part of the Program, that part may be used separately under those permissions, but the entire Program remains governed by this License without regard to the additional permissions. When you convey a copy of a covered work, you may at your option remove any additional permissions from that copy, or from any part of it. (Additional permissions may be written to require their own removal in certain cases when you modify the work.) You may place additional permissions on material, added by you to a covered work, for which you have or can give appropriate copyright permission. Notwithstanding any other provision of this License, for material you add to a covered work, you may (if authorized by the copyright holders of that material) supplement the terms of this License with terms: a) Disclaiming warranty or limiting liability differently from the terms of sections 15 and 16 of this License; or b) Requiring preservation of specified reasonable legal notices or author attributions in that material or in the Appropriate Legal Notices displayed by works containing it; or c) Prohibiting misrepresentation of the origin of that material, or requiring that modified versions of such material be marked in reasonable ways as different from the original version; or d) Limiting the use for publicity purposes of names of licensors or authors of the material; or e) Declining to grant rights under trademark law for use of some trade names, trademarks, or service marks; or f) Requiring indemnification of licensors and authors of that material by anyone who conveys the material (or modified versions of it) with contractual assumptions of liability to the recipient, for any liability that these contractual assumptions directly impose on those licensors and authors. All other non-permissive additional terms are considered "further restrictions" within the meaning of section 10. If the Program as you received it, or any part of it, contains a notice stating that it is governed by this License along with a term that is a further restriction, you may remove that term. If a license document contains a further restriction but permits relicensing or conveying under this License, you may add to a covered work material governed by the terms of that license document, provided that the further restriction does not survive such relicensing or conveying. If you add terms to a covered work in accord with this section, you must place, in the relevant source files, a statement of the additional terms that apply to those files, or a notice indicating where to find the applicable terms. Additional terms, permissive or non-permissive, may be stated in the form of a separately written license, or stated as exceptions; the above requirements apply either way. 8. Termination. You may not propagate or modify a covered work except as expressly provided under this License. Any attempt otherwise to propagate or modify it is void, and will automatically terminate your rights under this License (including any patent licenses granted under the third paragraph of section 11). However, if you cease all violation of this License, then your license from a particular copyright holder is reinstated (a) provisionally, unless and until the copyright holder explicitly and finally terminates your license, and (b) permanently, if the copyright holder fails to notify you of the violation by some reasonable means prior to 60 days after the cessation. Moreover, your license from a particular copyright holder is reinstated permanently if the copyright holder notifies you of the violation by some reasonable means, this is the first time you have received notice of violation of this License (for any work) from that copyright holder, and you cure the violation prior to 30 days after your receipt of the notice. Termination of your rights under this section does not terminate the licenses of parties who have received copies or rights from you under this License. If your rights have been terminated and not permanently reinstated, you do not qualify to receive new licenses for the same material under section 10. 9. Acceptance Not Required for Having Copies. You are not required to accept this License in order to receive or run a copy of the Program. Ancillary propagation of a covered work occurring solely as a consequence of using peer-to-peer transmission to receive a copy likewise does not require acceptance. However, nothing other than this License grants you permission to propagate or modify any covered work. These actions infringe copyright if you do not accept this License. Therefore, by modifying or propagating a covered work, you indicate your acceptance of this License to do so. 10. Automatic Licensing of Downstream Recipients. Each time you convey a covered work, the recipient automatically receives a license from the original licensors, to run, modify and propagate that work, subject to this License. You are not responsible for enforcing compliance by third parties with this License. An "entity transaction" is a transaction transferring control of an organization, or substantially all assets of one, or subdividing an organization, or merging organizations. If propagation of a covered work results from an entity transaction, each party to that transaction who receives a copy of the work also receives whatever licenses to the work the party's predecessor in interest had or could give under the previous paragraph, plus a right to possession of the Corresponding Source of the work from the predecessor in interest, if the predecessor has it or can get it with reasonable efforts. You may not impose any further restrictions on the exercise of the rights granted or affirmed under this License. For example, you may not impose a license fee, royalty, or other charge for exercise of rights granted under this License, and you may not initiate litigation (including a cross-claim or counterclaim in a lawsuit) alleging that any patent claim is infringed by making, using, selling, offering for sale, or importing the Program or any portion of it. 11. Patents. A "contributor" is a copyright holder who authorizes use under this License of the Program or a work on which the Program is based. The work thus licensed is called the contributor's "contributor version". A contributor's "essential patent claims" are all patent claims owned or controlled by the contributor, whether already acquired or hereafter acquired, that would be infringed by some manner, permitted by this License, of making, using, or selling its contributor version, but do not include claims that would be infringed only as a consequence of further modification of the contributor version. For purposes of this definition, "control" includes the right to grant patent sublicenses in a manner consistent with the requirements of this License. Each contributor grants you a non-exclusive, worldwide, royalty-free patent license under the contributor's essential patent claims, to make, use, sell, offer for sale, import and otherwise run, modify and propagate the contents of its contributor version. In the following three paragraphs, a "patent license" is any express agreement or commitment, however denominated, not to enforce a patent (such as an express permission to practice a patent or covenant not to sue for patent infringement). To "grant" such a patent license to a party means to make such an agreement or commitment not to enforce a patent against the party. If you convey a covered work, knowingly relying on a patent license, and the Corresponding Source of the work is not available for anyone to copy, free of charge and under the terms of this License, through a publicly available network server or other readily accessible means, then you must either (1) cause the Corresponding Source to be so available, or (2) arrange to deprive yourself of the benefit of the patent license for this particular work, or (3) arrange, in a manner consistent with the requirements of this License, to extend the patent license to downstream recipients. "Knowingly relying" means you have actual knowledge that, but for the patent license, your conveying the covered work in a country, or your recipient's use of the covered work in a country, would infringe one or more identifiable patents in that country that you have reason to believe are valid. If, pursuant to or in connection with a single transaction or arrangement, you convey, or propagate by procuring conveyance of, a covered work, and grant a patent license to some of the parties receiving the covered work authorizing them to use, propagate, modify or convey a specific copy of the covered work, then the patent license you grant is automatically extended to all recipients of the covered work and works based on it. A patent license is "discriminatory" if it does not include within the scope of its coverage, prohibits the exercise of, or is conditioned on the non-exercise of one or more of the rights that are specifically granted under this License. You may not convey a covered work if you are a party to an arrangement with a third party that is in the business of distributing software, under which you make payment to the third party based on the extent of your activity of conveying the work, and under which the third party grants, to any of the parties who would receive the covered work from you, a discriminatory patent license (a) in connection with copies of the covered work conveyed by you (or copies made from those copies), or (b) primarily for and in connection with specific products or compilations that contain the covered work, unless you entered into that arrangement, or that patent license was granted, prior to 28 March 2007. Nothing in this License shall be construed as excluding or limiting any implied license or other defenses to infringement that may otherwise be available to you under applicable patent law. 12. No Surrender of Others' Freedom. If conditions are imposed on you (whether by court order, agreement or otherwise) that contradict the conditions of this License, they do not excuse you from the conditions of this License. If you cannot convey a covered work so as to satisfy simultaneously your obligations under this License and any other pertinent obligations, then as a consequence you may not convey it at all. For example, if you agree to terms that obligate you to collect a royalty for further conveying from those to whom you convey the Program, the only way you could satisfy both those terms and this License would be to refrain entirely from conveying the Program. 13. Use with the GNU Affero General Public License. Notwithstanding any other provision of this License, you have permission to link or combine any covered work with a work licensed under version 3 of the GNU Affero General Public License into a single combined work, and to convey the resulting work. The terms of this License will continue to apply to the part which is the covered work, but the special requirements of the GNU Affero General Public License, section 13, concerning interaction through a network will apply to the combination as such. 14. Revised Versions of this License. The Free Software Foundation may publish revised and/or new versions of the GNU General Public License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns. Each version is given a distinguishing version number. If the Program specifies that a certain numbered version of the GNU General Public License "or any later version" applies to it, you have the option of following the terms and conditions either of that numbered version or of any later version published by the Free Software Foundation. If the Program does not specify a version number of the GNU General Public License, you may choose any version ever published by the Free Software Foundation. If the Program specifies that a proxy can decide which future versions of the GNU General Public License can be used, that proxy's public statement of acceptance of a version permanently authorizes you to choose that version for the Program. Later license versions may give you additional or different permissions. However, no additional obligations are imposed on any author or copyright holder as a result of your choosing to follow a later version. 15. Disclaimer of Warranty. THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. 16. Limitation of Liability. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. 17. Interpretation of Sections 15 and 16. If the disclaimer of warranty and limitation of liability provided above cannot be given local legal effect according to their terms, reviewing courts shall apply local law that most closely approximates an absolute waiver of all civil liability in connection with the Program, unless a warranty or assumption of liability accompanies a copy of the Program in return for a fee. END OF TERMS AND CONDITIONS How to Apply These Terms to Your New Programs If you develop a new program, and you want it to be of the greatest possible use to the public, the best way to achieve this is to make it free software which everyone can redistribute and change under these terms. To do so, attach the following notices to the program. It is safest to attach them to the start of each source file to most effectively state the exclusion of warranty; and each file should have at least the "copyright" line and a pointer to where the full notice is found. Copyright (C) This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . Also add information on how to contact you by electronic and paper mail. If the program does terminal interaction, make it output a short notice like this when it starts in an interactive mode: Copyright (C) This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'. This is free software, and you are welcome to redistribute it under certain conditions; type `show c' for details. The hypothetical commands `show w' and `show c' should show the appropriate parts of the General Public License. Of course, your program's commands might be different; for a GUI interface, you would use an "about box". You should also get your employer (if you work as a programmer) or school, if any, to sign a "copyright disclaimer" for the program, if necessary. For more information on this, and how to apply and follow the GNU GPL, see . The GNU General Public License does not permit incorporating your program into proprietary programs. If your program is a subroutine library, you may consider it more useful to permit linking proprietary applications with the library. If this is what you want to do, use the GNU Lesser General Public License instead of this License. But first, please read . SOAPdenovo-V1.05/Makefile000644 000765 000024 00000000645 11530714250 015237 0ustar00Aquastaff000000 000000 .NOTPARALLEL: .PHONY: all compile mer31 mer63 mer127 install clean all: compile install compile: mer31 mer63 mer127 mer31: @cd ./src/31mer; make mer63: @cd ./src/63mer; make mer127: @cd ./src/127mer; make install: @cp ./src/*/SOAPdenovo* ./bin/ @printf "Installation done.\n" clean: @cd ./src/31mer; make clean @cd ./src/63mer; make clean @cd ./src/127mer; make clean @cd ./bin; rm -f SOAPdenovo* SOAPdenovo-V1.05/MANUAL000644 000765 000024 00000027650 11530711345 014506 0ustar00Aquastaff000000 000000 Manual of SOAPdenovo-V1.05 Ruibang Luo, 2011-2-22 ********************************************************** Introduction SOAPdenovo is a novel short-read assembly method that can build a de novo draft assembly for the human-sized genomes. The program is specially designed to assemble Illumina GA short reads. It creates new opportunities for building reference sequences and carrying out accurate analyses of unexplored genomes in a cost effective way. System Requirement SOAPdenovo aims for large plant and animal genomes, although it also works well on bacteria and fungi genomes. It runs on 64-bit Linux system with a minimum of 5G physical memory. For big genomes like human, about 150 GB memory would be required. Download Current release: 1.05 Update Log: 1) Support large kmer up to 127 to utilize long reads. Three version are provided. I. The 31mer version support kmer only ≤31. II. The 63mer version support kmer only ≤63 and doubles the memory consumption than 31mer version, even being used with kmer ≤31. III. The 127mer version support kmer only ≤127 and double the memory consumption than 63mer version, even being used with kmer ≤63. Please notice that, with longer kmer, the quantity of nodes would decrease significantly, thus the memory consumption is usually smaller than double with shifted version. 2) New parameter added in "pregraph" module. This parameter initiates the memory assumption to avoid further reallocation. Unit of the parameter is GB. Without further reallocation, SOAPdenovo runs faster and provide the potential to eat up all the memory of the machine. For example, if the workstation provides 50g free memory, use -a 50 in pregraph step, then a static amount of 50g memory would be allocated before processing reads. This can also avoid being interrupted by other users sharing the same machine. 3) Gap filled bases now represented by lowercase characters in 'scafSeq' file. 4) Introduced SIMD instructions to boost the performance. 5) Several bugs fixed. 6) 32bit version will not be supported in the future. Installation 1. You can download the pre-compiled binary according to your platform, unpack using “tar –zxf ${destination folder} download.tgz” and execute directly. 2. Or download the source code, unpack to ${destination folder} with the method above, and compile by using GNU make with command “make” at ${destination folder}/SOAPdenovo-V1.05. Then install executable to ${destination folder}/ SOAPdenovo-V1.05/bin using “make install” 3. How to use it 1. Configuration file For big genome projects with deep sequencing, the data is usually organized as multiple read sequence files generated from multiple libraries. The configuration file tells the assembler where to find these files and the relevant information. “example.config” is an example of such a file. The configuration file has a section for global information, and then multiple library sections. Right now only “max_rd_len” is included in the global information section. Any read longer than max_rd_len will be cut to this length. The library information and the information of sequencing data generated from the library should be organized in the corresponding library section. Each library section starts with tag [LIB] and includes the following items: • avg_ins This value indicates the average insert size of this library or the peak value position in the insert size distribution figure. • reverse_seq This option takes value 0 or 1. It tells the assembler if the read sequences need to be complementarily reversed. Illumima GA produces two types of paired-end libraries: a) forward-reverse, generated from fragmented DNA ends with typical insert size less than 500 bp; b) forward-forward, generated from circularizing libraries with typical insert size greater than 2 Kb. The parameter “reverse_seq” should be set to indicate this: 0, forward-reverse; 1, forward-forward. • asm_flags=3 This indicator decides in which part(s) the reads are used. It takes value 1(only contig assembly), 2 (only scaffold assembly), 3(both contig and scaffold assembly), or 4 (only gap closure). • rd_len_cutoff The assembler will cut the reads from the current library to this length. • rank It takes integer values and decides in which order the reads are used for scaffold assembly. Libraries with the same “rank” are used at the same time during scaffold assembly. • pair_num_cutoff This parameter is the cutoff value of pair number for a reliable connection between two contigs or pre-scaffolds. • map_len This takes effect in the “map” step and is the minimun alignment length between a read and a contig required for a reliable read location. The assembler accepts read file in two formats: FASTA or FASTQ. Mate-pair relationship could be indicated in two ways: two sequence files with reads in the same order belonging to a pair, or two adjacent reads in a single file (FASTA only) belonging to a pair. In the configuration file single end files are indicated by “f=/path/filename” or “q=/pah/filename” for fasta or fastq formats separately. Paired reads in two fasta sequence files are indicated by “f1=” and “f2=”. While paired reads in two fastq sequences files are indicated by “q1=” and “q2=”. Paired reads in a single fasta sequence file is indicated by “p=” item. All the above items in each library section are optional. The assembler assigns default values for most of them. If you are not sure how to set a parameter, you can remove it from your configuration file. 2. Get it started Once the configuration file is available, a typical way to run the assembler is: ${bin} all –s config_file –K 63 –R –o graph_prefix User can also choose to run the assembly process step by step as: ${bin} pregraph –s config_file –K 63 [–R -d –p -a] –o graph_prefix ${bin} contig –g graph_prefix [–R –M 1 -D] ${bin} map –s config_file –g graph_prefix [-p] ${bin} scaff –g graph_prefix [–F -u -G -p] 2. Options: -a INT Initiate the memory assumption (GB) to avoid further reallocation -s STR configuration file -o STR output graph file prefix -g STR input graph file prefix -K INT K-mer size [default 23, min 13, max 127] -p INT multithreads, n threads [default 8] -R use reads to solve tiny repeats [default no] -d INT remove low-frequency K-mers with frequency no larger than [default 0] -D INT remove edges with coverage no larger that [default 1] -M INT strength of merging similar sequences during contiging [default 1, min 0, max 3] -F intra-scaffold gap closure [default no] -u un-mask high coverage contigs before scaffolding [defaut mask] -G INT allowed length difference between estimated and filled gap -L minimum contigs length used for scaffolding 4. Output files 4.1 These files are output as assembly results: a. *.contig contig sequences without using mate pair information b. *.scafSeq scaffold sequences (final contig sequences can be extracted by breaking down scaffold sequences at gap regions) 4.2 There are some other files that provide useful information for advanced users, which are listed in Appendix B. 5. FAQ 5.1 How to set K-mer size? The program accepts odd numbers between 13 and 31. Larger K-mers would have higher rate of uniqueness in the genome and would make the graph simpler, but it requires deep sequencing depth and longer read length to guarantee the overlap at any genomic location. 5.2 How to set library rank? SOAPdenovo will use the pair-end libraries with insert size from smaller to larger to construct scaffolds. Libraries with the same rank would be used at the same time. For example, in a dataset of a human genome, we set five ranks for five libraries with insert size 200-bp, 500-bp, 2-Kb, 5-Kb and 10-Kb, separately. It is desired that the pairs in each rank provide adequate physical coverage of the genome. APPENDIX A: an example.config #maximal read length max_rd_len=50 [LIB] #average insert size avg_ins=200 #if sequence needs to be reversed reverse_seq=0 #in which part(s) the reads are used asm_flags=3 #use only first 50 bps of each read rd_len_cutoff=50 #in which order the reads are used while scaffolding rank=1 # cutoff of pair number for a reliable connection (default 3) pair_num_cutoff=3 #minimum aligned length to contigs for a reliable read location (default 32) map_len=32 #fastq file for read 1 q1=/path/**LIBNAMEA**/fastq_read_1.fq #fastq file for read 2 always follows fastq file for read 1 q2=/path/**LIBNAMEA**/fastq_read_2.fq #fasta file for read 1 f1=/path/**LIBNAMEA**/fasta_read_1.fa #fastq file for read 2 always follows fastq file for read 1 f2=/path/**LIBNAMEA**/fasta_read_2.fa #fastq file for single reads q=/path/**LIBNAMEA**/fastq_read_single.fq #fasta file for single reads f=/path/**LIBNAMEA**/fasta_read_single.fa #a single fasta file for paired reads p=/path/**LIBNAMEA**/pairs_in_one_file.fa [LIB] avg_ins=2000 reverse_seq=1 asm_flags=2 rank=2 # cutoff of pair number for a reliable connection #(default 5 for large insert size) pair_num_cutoff=5 #minimum aligned length to contigs for a reliable read location #(default 35 for large insert size) map_len=35 q1=/path/**LIBNAMEB**/fastq_read_1.fq q2=/path/**LIBNAMEB**/fastq_read_2.fq q=/path/**LIBNAMEB**/fastq_read_single.fq f=/path/**LIBNAMEB**/fasta_read_single.fa Appendix B: output files 1. Output files from the command “pregraph” a. *.kmerFreq Each row shows the number of Kmers with a frequency equals the row number. b. *.edge Each record gives the information of an edge in the pre-graph: length, Kmers on both ends, average kmer coverage, whether it’s reverse-complementarily identical and the sequence. c. *.markOnEdge & *.path These two files are for using reads to solve small repeats e. *.preArc Connections between edges which are established by the read paths. f. *.vertex Kmers at the ends of edges. g. *.preGraphBasic Some basic information about the pre-graph: number of vertex, K value, number of edges, maximum read length etc. 2. Output files from the command “contig” a. *.contig Contig information: corresponding edge index, length, kmer coverage, whether it’s tip and the sequence. Either a contig or its reverse complementry counterpart is included. Each reverse complementary contig index is indicated in the *.ContigIndex file. b. *.Arc Arcs coming out of each edge and their corresponding coverage by reads c. *.updated.edge Some information for each edge in graph: length, Kmers at both ends, index difference between the reverse-complementary edge and this one. d. *.ContigIndex Each record gives information about each contig in the *.contig: it’s edge index, length, the index difference between its reverse-complementary counterpart and itself. 3. Output files from the command “map” a. *.peGrads Information for each clone library: insert-size, read index upper bound, rank and pair number cutoff for a reliable link. This file can be revised manually for scaffolding tuning. b. *.readOnContig Read locations on contigs. Here contigs are referred by their edge index. Howerver about half of them are not listed in the *.contig file for their reverse-complementary counterparts are included already. c. *.readInGap This file includes reads that could be located in gaps between contigs. This information will be used to close gaps in scaffolds. 4. Output files from the command “scaff” a. *.newContigIndex Contigs are sorted according their length before scaffolding. Their new index are listed in this file. This is useful if one wants to corresponds contigs in *.contig with those in *.links. b. *.links Links between contigs which are established by read pairs. New index are used. c. *.scaf_gap Contigs in gaps found by contig graph outputted by the contiging procedure. Here new index are used. d. *.scaf Contigs for each scaffold: contig index (concordant to index in *.contig), approximate start position on scaffold, orientation, contig length, and its links to others. e. *.gapSeq Gap sequences between contigs. f. *.scafSeq Sequence of each scaffold. SOAPdenovo-V1.05/src/000755 000765 000024 00000000000 11530701176 014364 5ustar00Aquastaff000000 000000 SOAPdenovo-V1.05/VERSION000644 000765 000024 00000000005 11530702111 014626 0ustar00Aquastaff000000 000000 1.05 SOAPdenovo-V1.05/src/127mer/000755 000765 000024 00000000000 11534064032 015376 5ustar00Aquastaff000000 000000 SOAPdenovo-V1.05/src/31mer/000755 000765 000024 00000000000 11534064043 015312 5ustar00Aquastaff000000 000000 SOAPdenovo-V1.05/src/63mer/000755 000765 000024 00000000000 11534064011 015312 5ustar00Aquastaff000000 000000 SOAPdenovo-V1.05/src/63mer/arc.c000644 000765 000024 00000012563 11530651532 016240 0ustar00Aquastaff000000 000000 /* * 63mer/arc.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" #define preARCBLOCKSIZE 100000 void createPreArcMemManager() { prearc_mem_manager = createMem_manager ( preARCBLOCKSIZE, sizeof ( preARC ) ); } void prlDestroyPreArcMem() { if ( !preArc_mem_managers ) { return; } int i; for ( i = 0; i < thrd_num; i++ ) { freeMem_manager ( preArc_mem_managers[i] ); } free ( ( void * ) preArc_mem_managers ); preArc_mem_managers = NULL; } void destroyPreArcMem() { freeMem_manager ( prearc_mem_manager ); prearc_mem_manager = NULL; } preARC * prlAllocatePreArc ( unsigned int edgeid, MEM_MANAGER * manager ) { preARC * newArc; newArc = ( preARC * ) getItem ( manager ); newArc->to_ed = edgeid; newArc->multiplicity = 1; newArc->next = NULL; return newArc; } preARC * allocatePreArc ( unsigned int edgeid ) { arcCounter++; preARC * newArc; newArc = ( preARC * ) getItem ( prearc_mem_manager ); newArc->to_ed = edgeid; newArc->multiplicity = 1; newArc->next = NULL; return newArc; } void output_arcGVZ ( char * outfile, boolean IsContig ) { ARC * pArc; preARC * pPreArc; char name[256]; FILE * fp; unsigned int i; sprintf ( name, "%s.arc.gvz", outfile ); fp = ckopen ( name, "w" ); fprintf ( fp, "digraph G{\n" ); fprintf ( fp, "\tsize=\"512,512\";\n" ); for ( i = 1; i <= num_ed; i++ ) { if ( IsContig ) { pPreArc = contig_array[i].arcs; while ( pPreArc ) { fprintf ( fp, "\tC%d -> C%d[label =\"%d\"];\n", i, pPreArc->to_ed, pPreArc->multiplicity ); pPreArc = pPreArc->next; } } else { pArc = edge_array[i].arcs; while ( pArc ) { fprintf ( fp, "\tC%d -> C%d[label =\"%d\"];\n", i, pArc->to_ed, pArc->multiplicity ); pArc = pArc->next; } } } fprintf ( fp, "}\n" ); fclose ( fp ); } /**************** below this line all codes are about ARC ****************/ #define ARCBLOCKSIZE 100000 void createArcMemo() { if ( !arc_mem_manager ) { arc_mem_manager = createMem_manager ( ARCBLOCKSIZE, sizeof ( ARC ) ); } else { printf ( "Warning from createArcMemo: arc_mem_manager is active pointer\n" ); } } void destroyArcMem() { freeMem_manager ( arc_mem_manager ); arc_mem_manager = NULL; } ARC * allocateArc ( unsigned int edgeid ) { arcCounter++; ARC * newArc; newArc = ( ARC * ) getItem ( arc_mem_manager ); newArc->to_ed = edgeid; newArc->multiplicity = 1; newArc->prev = NULL; newArc->next = NULL; return newArc; } void dismissArc ( ARC * arc ) { returnItem ( arc_mem_manager, arc ); } /***************** below this line all codes are about lookup table *****************/ void createArcLookupTable() { if ( !arcLookupTable ) { arcLookupTable = ( ARC ** ) ckalloc ( ( 3 * num_ed + 1 ) * sizeof ( ARC * ) ); } } void deleteArcLookupTable() { if ( arcLookupTable ) { free ( ( void * ) arcLookupTable ); arcLookupTable = NULL; } } void putArc2LookupTable ( unsigned int from_ed, ARC * arc ) { if ( !arc || !arcLookupTable ) { return; } unsigned int index = 2 * from_ed + arc->to_ed; arc->nextInLookupTable = arcLookupTable[index]; arcLookupTable[index] = arc; } static ARC * getArcInLookupTable ( unsigned int from_ed, unsigned int to_ed ) { unsigned int index = 2 * from_ed + to_ed; ARC * ite_arc = arcLookupTable[index]; while ( ite_arc ) { if ( ite_arc->to_ed == to_ed ) { return ite_arc; } ite_arc = ite_arc->nextInLookupTable; } return NULL; } void removeArcInLookupTable ( unsigned int from_ed, unsigned int to_ed ) { unsigned int index = 2 * from_ed + to_ed; ARC * ite_arc = arcLookupTable[index]; ARC * arc; if ( !ite_arc ) { printf ( "removeArcInLookupTable: not found A\n" ); return; } if ( ite_arc->to_ed == to_ed ) { arcLookupTable[index] = ite_arc->nextInLookupTable; return; } while ( ite_arc->nextInLookupTable && ite_arc->nextInLookupTable->to_ed != to_ed ) { ite_arc = ite_arc->nextInLookupTable; } if ( ite_arc->nextInLookupTable ) { arc = ite_arc->nextInLookupTable; ite_arc->nextInLookupTable = arc->nextInLookupTable; return; } printf ( "removeArcInLookupTable: not found B\n" ); return; } void recordArcsInLookupTable() { unsigned int i; ARC * ite_arc; for ( i = 1; i <= num_ed; i++ ) { ite_arc = edge_array[i].arcs; while ( ite_arc ) { putArc2LookupTable ( i, ite_arc ); ite_arc = ite_arc->next; } } } ARC * getArcBetween ( unsigned int from_ed, unsigned int to_ed ) { ARC * parc; if ( arcLookupTable ) { parc = getArcInLookupTable ( from_ed, to_ed ); return parc; } parc = edge_array[from_ed].arcs; while ( parc ) { if ( parc->to_ed == to_ed ) { return parc; } parc = parc->next; } return parc; } SOAPdenovo-V1.05/src/63mer/attachPEinfo.c000644 000765 000024 00000021401 11530651532 020027 0ustar00Aquastaff000000 000000 /* * 63mer/attachPEinfo.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" #include "stack.h" #define CNBLOCKSIZE 10000 static STACK * isStack; static int ignorePE1, ignorePE2, ignorePE3, static_flag; static int onsameCtgPE; static unsigned long long peSUM; static boolean staticF; static int existCounter; static int calcuIS ( STACK * intStack ); static int cmp_pe ( const void * a, const void * b ) { PE_INFO * A, *B; A = ( PE_INFO * ) a; B = ( PE_INFO * ) b; if ( A->rank > B->rank ) { return 1; } else if ( A->rank == B->rank ) { return 0; } else { return -1; } } void loadPEgrads ( char * infile ) { FILE * fp; char name[256], line[1024]; int i; boolean rankSet = 1; sprintf ( name, "%s.peGrads", infile ); fp = fopen ( name, "r" ); if ( !fp ) { printf ( "can not open file %s\n", name ); gradsCounter = 0; return; } while ( fgets ( line, sizeof ( line ), fp ) != NULL ) { if ( line[0] == 'g' ) { sscanf ( line + 10, "%d %lld %d", &gradsCounter, &n_solexa, &maxReadLen ); printf ( "there're %d grads, %lld reads, max read len %d\n", gradsCounter, n_solexa, maxReadLen ); break; } } alloc_pe_mem ( gradsCounter ); for ( i = 0; i < gradsCounter; i++ ) { fgets ( line, sizeof ( line ), fp ); pes[i].rank = 0; sscanf ( line, "%d %lld %d %d", & ( pes[i].insertS ), & ( pes[i].PE_bound ), & ( pes[i].rank ), & ( pes[i].pair_num_cut ) ); if ( pes[i].rank < 1 ) { rankSet = 0; } } fclose ( fp ); if ( rankSet ) { qsort ( &pes[0], gradsCounter, sizeof ( PE_INFO ), cmp_pe ); return; } int lastRank = 0; for ( i = 0; i < gradsCounter; i++ ) { if ( i == 0 ) { pes[i].rank = ++lastRank; } else if ( pes[i].insertS < 300 ) { pes[i].rank = lastRank; } else if ( pes[i].insertS < 800 ) { if ( pes[i - 1].insertS < 300 ) { pes[i].rank = ++lastRank; } else { pes[i].rank = lastRank; } } else if ( pes[i].insertS < 3000 ) { if ( pes[i - 1].insertS < 800 ) { pes[i].rank = ++lastRank; } else { pes[i].rank = lastRank; } } else if ( pes[i].insertS < 7000 ) { if ( pes[i - 1].insertS < 3000 ) { pes[i].rank = ++lastRank; } else { pes[i].rank = lastRank; } } else { if ( pes[i - 1].insertS < 7000 ) { pes[i].rank = ++lastRank; } else { pes[i].rank = lastRank; } } } } CONNECT * add1Connect ( unsigned int e1, unsigned int e2, int gap, int weight, boolean inherit ) { if ( e1 == e2 || e1 == getTwinCtg ( e2 ) ) { return NULL; } CONNECT * connect = NULL; long long sum; if ( weight > 255 ) { weight = 255; } connect = getCntBetween ( e1, e2 ); if ( connect ) { if ( !weight ) { return connect; } existCounter++; if ( !inherit ) { sum = connect->weightNotInherit * connect->gapLen + gap * weight; connect->gapLen = sum / ( connect->weightNotInherit + weight ); if ( connect->weightNotInherit + weight <= 255 ) { connect->weightNotInherit += weight; } else if ( connect->weightNotInherit < 255 ) { connect->weightNotInherit = 255; } } else { sum = connect->weight * connect->gapLen + gap * weight; connect->gapLen = sum / ( connect->weight + weight ); if ( !connect->inherit ) { connect->maxSingleWeight = connect->weightNotInherit; } connect->inherit = 1; connect->maxSingleWeight = connect->maxSingleWeight > weight ? connect->maxSingleWeight : weight; } if ( connect->weight + weight <= 255 ) { connect->weight += weight; } else if ( connect->weight < 255 ) { connect->weight = 255; } } else { newCntCounter++; connect = allocateCN ( e2, gap ); if ( cntLookupTable ) { putCnt2LookupTable ( e1, connect ); } connect->weight = weight; if ( contig_array[e1].mask || contig_array[e2].mask ) { connect->mask = 1; } connect->next = contig_array[e1].downwardConnect; contig_array[e1].downwardConnect = connect; if ( !inherit ) { connect->weightNotInherit = weight; } else { connect->weightNotInherit = 0; connect->inherit = 1; connect->maxSingleWeight = weight; } } return connect; } int attach1PE ( unsigned int e1, int pre_pos, unsigned int bal_e2, int pos, int insert_size ) { int gap, realpeSize; unsigned int bal_e1, e2; if ( e1 == bal_e2 ) { ignorePE1++; return -1; //orientation wrong } bal_e1 = getTwinCtg ( e1 ); e2 = getTwinCtg ( bal_e2 ); if ( e1 == e2 ) { realpeSize = contig_array[e1].length + overlaplen - pre_pos - pos; if ( realpeSize > 0 ) { peSUM += realpeSize; onsameCtgPE++; if ( ( int ) contig_array[e1].length > insert_size ) { int * item = ( int * ) stackPush ( isStack ); ( *item ) = realpeSize; } } return 2; } gap = insert_size - overlaplen + pre_pos + pos - contig_array[e1].length - contig_array[e2].length; if ( gap < - ( insert_size / 10 ) ) { ignorePE2++; return 0; } if ( gap > insert_size ) { ignorePE3++; return 0; } add1Connect ( e1, e2, gap, 1, 0 ); add1Connect ( bal_e2, bal_e1, gap, 1, 0 ); return 1; } int connectByPE_grad ( FILE * fp, int peGrad, char * line ) { long long pre_readno, readno, minno, maxno; int pre_pos, pos, flag, PE, count = 0; unsigned int pre_contigno, contigno, newIndex; if ( peGrad < 0 || peGrad > gradsCounter ) { printf ( "specified pe grad is out of bound\n" ); return 0; } maxno = pes[peGrad].PE_bound; if ( peGrad == 0 ) { minno = 0; } else { minno = pes[peGrad - 1].PE_bound; } onsameCtgPE = peSUM = 0; PE = pes[peGrad].insertS; if ( strlen ( line ) ) { sscanf ( line, "%lld %d %d", &pre_readno, &pre_contigno, &pre_pos ); //printf("first record %d %d %d\n",pre_readno,pre_contigno,pre_pos); if ( pre_readno <= minno ) { pre_readno = -1; } } else { pre_readno = -1; } ignorePE1 = ignorePE2 = ignorePE3 = 0; static_flag = 1; isStack = ( STACK * ) createStack ( CNBLOCKSIZE, sizeof ( int ) ); while ( fgets ( line, lineLen, fp ) != NULL ) { sscanf ( line, "%lld %d %d", &readno, &contigno, &pos ); if ( readno > maxno ) { break; } if ( readno <= minno ) { continue; } newIndex = index_array[contigno]; //if(contig_array[newIndex].bal_edge==0) if ( isSameAsTwin ( newIndex ) ) { continue; } if ( PE && ( readno % 2 == 0 ) && ( pre_readno == readno - 1 ) ) // they are a pair of reads { flag = attach1PE ( pre_contigno, pre_pos, newIndex, pos, PE ); if ( flag == 1 ) { count++; } } pre_readno = readno; pre_contigno = newIndex; pre_pos = pos; } printf ( "%d PEs with insert size %d attached, %d + %d + %d ignored\n", count, PE, ignorePE1, ignorePE2, ignorePE3 ); if ( onsameCtgPE > 0 ) { printf ( "estimated PE size %lli, by %d pairs\n", peSUM / onsameCtgPE, onsameCtgPE ); } printf ( "on contigs longer than %d, %d pairs found,", PE, isStack->item_c ); printf ( "insert_size estimated: %d\n", calcuIS ( isStack ) ); freeStack ( isStack ); return count; } static int calcuIS ( STACK * intStack ) { long long sum = 0; int avg = 0; int * item; int num = intStack->item_c; if ( num < 100 ) { return avg; } stackBackup ( intStack ); while ( ( item = ( int * ) stackPop ( intStack ) ) != NULL ) { sum += *item; } stackRecover ( intStack ); num = intStack->item_c; avg = sum / num; sum = 0; stackBackup ( intStack ); while ( ( item = ( int * ) stackPop ( intStack ) ) != NULL ) { sum += ( *item - avg ) * ( *item - avg ); } int SD = sqrt ( sum / ( num - 1 ) ); if ( SD == 0 ) { printf ( "SD=%d, ", SD ); return avg; } stackRecover ( intStack ); sum = num = 0; while ( ( item = ( int * ) stackPop ( intStack ) ) != NULL ) if ( abs ( *item - avg ) < 3 * SD ) { sum += *item; num++; } avg = sum / num; printf ( "SD=%d, ", SD ); return avg; } unsigned int getTwinCtg ( unsigned int ctg ) { return ctg + contig_array[ctg].bal_edge - 1; } boolean isSmallerThanTwin ( unsigned int ctg ) { return contig_array[ctg].bal_edge > 1; } boolean isLargerThanTwin ( unsigned int ctg ) { return contig_array[ctg].bal_edge < 1; } boolean isSameAsTwin ( unsigned int ctg ) { return contig_array[ctg].bal_edge == 1; } SOAPdenovo-V1.05/src/63mer/bubble.c000644 000765 000024 00000136255 11530651532 016733 0ustar00Aquastaff000000 000000 /* * 63mer/bubble.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" #include "dfibHeap.h" #include "fibHeap.h" #define false 0 #define true 1 #define SLOW_TO_FAST 1 #define FAST_TO_SLOW 0 #define MAXREADLENGTH 100 #define MAXCONNECTION 100 static int MAXNODELENGTH; static int DIFF; static unsigned int outNodeArray[MAXCONNECTION]; static ARC * outArcArray[MAXCONNECTION]; static boolean HasChanged; //static boolean staticFlag = 0; //static boolean staticFlag2 = 0; static const int INDEL = 0; static const int SIM[4][4] = { {1, 0, 0, 0}, {0, 1, 0, 0}, {0, 0, 1, 0}, {0, 0, 0, 1} }; //static variables static READINTERVAL * fastPath; static READINTERVAL * slowPath; static char fastSequence[MAXREADLENGTH]; static char slowSequence[MAXREADLENGTH]; static int fastSeqLength; static int slowSeqLength; static Time * times; static unsigned int * previous; static unsigned int expCounter; static unsigned int * expanded; static double cutoff; static int Fmatrix[MAXREADLENGTH + 1][MAXREADLENGTH + 1]; static int slowToFastMapping[MAXREADLENGTH + 1]; static int fastToSlowMapping[MAXREADLENGTH + 1]; static DFibHeapNode ** dheapNodes; static DFibHeap * dheap; static unsigned int activeNode; //static ARC *activeArc; static unsigned int startingNode; static int progress; static unsigned int * eligibleStartingPoints; // DEBUG static long long caseA, caseB, caseC, caseD; static long long dnodeCounter; static long long rnodeCounter; static long long btCounter; static long long cmpCounter; static long long simiCounter; static long long pinCounter; static long long replaceCounter; static long long getArcCounter; // END OF DEBUG static void output_seq ( char * seq, int length, FILE * fp, unsigned int from_vt, unsigned int dest ) { int i; Kmer kmer; kmer = vt_array[from_vt].kmer; printKmerSeq ( fp, kmer ); fprintf ( fp, " " ); for ( i = 0; i < length; i++ ) { fprintf ( fp, "%c", int2base ( ( int ) seq[i] ) ); } if ( edge_array[dest].seq ) { fprintf ( fp, " %c\n", int2base ( ( int ) getCharInTightString ( edge_array[dest].seq, 0 ) ) ); } else { fprintf ( fp, " N\n" ); } } static void print_path ( FILE * fp ) { READINTERVAL * marker; marker = fastPath->nextInRead; while ( marker->nextInRead ) { fprintf ( fp, "%u ", marker->edgeid ); marker = marker->nextInRead; } fprintf ( fp, "\n" ); marker = slowPath->nextInRead; while ( marker->nextInRead ) { fprintf ( fp, "%u ", marker->edgeid ); marker = marker->nextInRead; } fprintf ( fp, "\n" ); } static void output_pair ( int lengthF, int lengthS, FILE * fp, int nodeF, int nodeS, boolean merged, unsigned int source, unsigned int destination ) { fprintf ( fp, "$$ %d vs %d $$ %d\n", nodeF, nodeS, merged ); //print_path(fp); output_seq ( fastSequence, lengthF, fp, edge_array[source].to_vt, destination ); output_seq ( slowSequence, lengthS, fp, edge_array[source].to_vt, destination ); //fprintf(fp,"\n"); } static void resetNodeStatus() { unsigned int index; ARC * arc; unsigned int bal_ed; for ( index = 1; index <= num_ed; index++ ) { if ( EdSameAsTwin ( index ) ) { edge_array[index].multi = 1; continue; } arc = edge_array[index].arcs; bal_ed = getTwinEdge ( index ); while ( arc ) { if ( arc->to_ed == bal_ed ) { break; } arc = arc->next; } if ( arc ) { edge_array[index].multi = 1; edge_array[bal_ed].multi = 1; index++; continue; } arc = edge_array[bal_ed].arcs; while ( arc ) { if ( arc->to_ed == index ) { break; } arc = arc->next; } if ( arc ) { edge_array[index].multi = 1; edge_array[bal_ed].multi = 1; } else { edge_array[index].multi = 0; edge_array[bal_ed].multi = 0; } index++; } } /* static void determineEligibleStartingPoints() { long long index,counter=0; unsigned int node; unsigned int maxmult; ARC *parc; FibHeap *heap = newFibHeap(); for(index=1;index<=num_ed;index++){ if(edge_array[index].deleted||edge_array[index].length<1) continue; maxmult = counter = 0; parc = edge_array[index].arcs; while(parc){ if(parc->multiplicity > maxmult) maxmult = parc->multiplicity; parc = parc->next; } if(maxmult<1){ continue; } insertNodeIntoHeap(heap,-maxmult,index); } counter = 0; while((index=removeNextNodeFromHeap(heap))!=0){ eligibleStartingPoints[counter++] = index; } destroyHeap(heap); printf("%lld edges out of %d are eligible starting points\n",counter,num_ed); } */ static unsigned int nextStartingPoint() { unsigned int index = 1; unsigned int result = 0; for ( index = progress + 1; index < num_ed; index++ ) { //result = eligibleStartingPoints[index]; result = index; if ( edge_array[index].deleted || edge_array[index].length < 1 ) { continue; } if ( result == 0 ) { return 0; } if ( edge_array[result].multi > 0 ) { continue; } progress = index; return result; } return 0; } static void updateNodeStatus() { unsigned int i, node; for ( i = 0; i < expCounter; i++ ) { node = expanded[i]; edge_array[node].multi = 1; edge_array[getTwinEdge ( node )].multi = 1; } } unsigned int getNodePrevious ( unsigned int node ) { return previous[node]; } static boolean isPreviousToNode ( unsigned int previous, unsigned int target ) { unsigned int currentNode = target; unsigned int previousNode = 0; Time targetTime = times[target]; while ( currentNode ) { if ( currentNode == previous ) { return 1; } if ( currentNode == previousNode ) { return 0; } if ( times[currentNode] != targetTime ) { return 0; } previousNode = currentNode; currentNode = getNodePrevious ( currentNode ); } return 0; } static void copySeq ( char * targetS, char * sourceS, int pos, int length ) { char ch; int i, index; index = pos; for ( i = 0; i < length; i++ ) { ch = getCharInTightString ( sourceS, i ); targetS[index++] = ch; //writeChar2tightString(ch,targetS,index++); } } // return the length of sequence static int extractSequence ( READINTERVAL * path, char * sequence ) { READINTERVAL * marker; int seqLength, writeIndex; seqLength = writeIndex = 0; path->start = -10; marker = path->nextInRead; while ( marker->nextInRead ) { marker->start = seqLength; seqLength += edge_array[marker->edgeid].length; marker = marker->nextInRead; } marker->start = seqLength; if ( seqLength > MAXREADLENGTH ) { return 0; } marker = path->nextInRead; while ( marker->nextInRead ) { if ( edge_array[marker->edgeid].length && edge_array[marker->edgeid].seq ) { copySeq ( sequence, edge_array[marker->edgeid].seq, writeIndex, edge_array[marker->edgeid].length ); writeIndex += edge_array[marker->edgeid].length; } /* else if(edge_array[marker->edgeid].length==0) printf("node %d with length 0 in this path\n",marker->edgeid); else if(edge_array[marker->edgeid].seq==NULL) printf("node %d without seq in this path\n",marker->edgeid); */ marker = marker->nextInRead; } return seqLength; } static int max ( int A, int B, int C ) { A = A >= B ? A : B; return ( A >= C ? A : C ); } static boolean compareSequences ( char * sequence1, char * sequence2, int length1, int length2 ) { int i, j; int maxLength; int Choice1, Choice2, Choice3; int maxScore; if ( length1 == 0 || length2 == 0 ) { caseA++; return 0; } if ( abs ( ( int ) length1 - ( int ) length2 ) > 2 ) { caseB++; return 0; } if ( length1 < overlaplen - 1 || length2 < overlaplen - 1 ) { caseB++; return 0; } /* if (length1 < overlaplen || length2 < overlaplen){ if(abs((int)length1 - (int)length2) > 3){ caseB++; return 0; } } */ //printf("length %d vs %d\n",length1,length2); for ( i = 0; i <= length1; i++ ) { Fmatrix[i][0] = 0; } for ( j = 0; j <= length2; j++ ) { Fmatrix[0][j] = 0; } for ( i = 1; i <= length1; i++ ) { for ( j = 1; j <= length2; j++ ) { Choice1 = Fmatrix[i - 1][j - 1] + SIM[ ( int ) sequence1[i - 1]] [ ( int ) sequence2[j - 1]]; Choice2 = Fmatrix[i - 1][j] + INDEL; Choice3 = Fmatrix[i][j - 1] + INDEL; Fmatrix[i][j] = max ( Choice1, Choice2, Choice3 ); } } maxScore = Fmatrix[length1][length2]; maxLength = ( length1 > length2 ? length1 : length2 ); if ( maxScore < maxLength - DIFF ) { caseC++; return 0; } if ( ( 1 - ( double ) maxScore / maxLength ) > cutoff ) { caseD++; return 0; } //printf("\niTOTO %i / %li\n", maxScore, maxLength); return 1; } static void mapSlowOntoFast() { int slowIndex = slowSeqLength; int fastIndex = fastSeqLength; int fastn, slown; if ( slowIndex == 0 ) { slowToFastMapping[0] = fastIndex; while ( fastIndex >= 0 ) { fastToSlowMapping[fastIndex--] = 0; } return; } if ( fastIndex == 0 ) { while ( slowIndex >= 0 ) { slowToFastMapping[slowIndex--] = 0; } fastToSlowMapping[0] = slowIndex; return; } while ( slowIndex > 0 && fastIndex > 0 ) { fastn = ( int ) fastSequence[fastIndex - 1]; //getCharInTightString(fastSequence,fastIndex-1); slown = ( int ) slowSequence[slowIndex - 1]; //getCharInTightString(slowSequence,slowIndex-1); if ( Fmatrix[fastIndex][slowIndex] == Fmatrix[fastIndex - 1][slowIndex - 1] + SIM[fastn][slown] ) { fastToSlowMapping[--fastIndex] = --slowIndex; slowToFastMapping[slowIndex] = fastIndex; } else if ( Fmatrix[fastIndex][slowIndex] == Fmatrix[fastIndex - 1][slowIndex] + INDEL ) { fastToSlowMapping[--fastIndex] = slowIndex - 1; } else if ( Fmatrix[fastIndex][slowIndex] == Fmatrix[fastIndex][slowIndex - 1] + INDEL ) { slowToFastMapping[--slowIndex] = fastIndex - 1; } else { //printPaths(); printf ( "Error" ); fflush ( stdout ); abort(); } } while ( slowIndex > 0 ) { slowToFastMapping[--slowIndex] = -1; } while ( fastIndex > 0 ) { fastToSlowMapping[--fastIndex] = -1; } slowToFastMapping[slowSeqLength] = fastSeqLength; fastToSlowMapping[fastSeqLength] = slowSeqLength; } /* //add an arc to the head of an arc list static ARC *addArc(ARC *arc_list, ARC *arc) { arc->prev = NULL; arc->next = arc_list; if(arc_list) arc_list->prev = arc; arc_list = arc; return arc_list; } */ //remove an arc from the double linked list and return the updated list ARC * deleteArc ( ARC * arc_list, ARC * arc ) { if ( arc->prev ) { arc->prev->next = arc->next; } else { arc_list = arc->next; } if ( arc->next ) { arc->next->prev = arc->prev; } /* if(checkActiveArc&&arc==activeArc){ activeArc = arc->next; } */ dismissArc ( arc ); return arc_list; } //add a rv to the head of a rv list static READINTERVAL * addRv ( READINTERVAL * rv_list, READINTERVAL * rv ) { rv->prevOnEdge = NULL; rv->nextOnEdge = rv_list; if ( rv_list ) { rv_list->prevOnEdge = rv; } rv_list = rv; return rv_list; } //remove a rv from the double linked list and return the updated list static READINTERVAL * deleteRv ( READINTERVAL * rv_list, READINTERVAL * rv ) { if ( rv->prevOnEdge ) { rv->prevOnEdge->nextOnEdge = rv->nextOnEdge; } else { rv_list = rv->nextOnEdge; } if ( rv->nextOnEdge ) { rv->nextOnEdge->prevOnEdge = rv->prevOnEdge; } return rv_list; } /* static void disconnect(unsigned int from_ed, unsigned int to_ed) { READINTERVAL *rv_temp; rv_temp = edge_array[from_ed].rv; while(rv_temp){ if(!rv_temp->nextInRead||rv_temp->nextInRead->edgeid!=to_ed){ rv_temp = rv_temp->nextOnEdge; continue; } rv_temp->nextInRead->prevInRead = NULL; rv_temp->nextInRead = NULL; rv_temp = rv_temp->nextOnEdge; } } */ static int mapDistancesOntoPaths() { READINTERVAL * marker; int totalDistance = 0; marker = slowPath; while ( marker->nextInRead ) { marker = marker->nextInRead; marker->start = totalDistance; totalDistance += edge_array[marker->edgeid].length; marker->bal_rv->start = totalDistance; } totalDistance = 0; marker = fastPath; while ( marker->nextInRead ) { marker = marker->nextInRead; marker->start = totalDistance; totalDistance += edge_array[marker->edgeid].length; marker->bal_rv->start = totalDistance; } return totalDistance; } //attach a path to the graph and mean while make the reverse complementary path of it static void attachPath ( READINTERVAL * path ) { READINTERVAL * marker, *bal_marker; unsigned int ed, bal_ed; marker = path; while ( marker ) { ed = marker->edgeid; edge_array[ed].rv = addRv ( edge_array[ed].rv, marker ); bal_ed = getTwinEdge ( ed ); bal_marker = allocateRV ( -marker->readid, bal_ed ); edge_array[bal_ed].rv = addRv ( edge_array[bal_ed].rv, bal_marker ); if ( marker->prevInRead ) { marker->prevInRead->bal_rv->prevInRead = bal_marker; bal_marker->nextInRead = marker->prevInRead->bal_rv; } bal_marker->bal_rv = marker; marker->bal_rv = bal_marker; marker = marker->nextInRead; } } static void detachPathSingle ( READINTERVAL * path ) { READINTERVAL * marker, *nextMarker; unsigned int ed; marker = path; while ( marker ) { nextMarker = marker->nextInRead; ed = marker->edgeid; edge_array[ed].rv = deleteRv ( edge_array[ed].rv, marker ); dismissRV ( marker ); marker = nextMarker; } } static void detachPath ( READINTERVAL * path ) { READINTERVAL * marker, *bal_marker, *nextMarker; unsigned int ed, bal_ed; marker = path; while ( marker ) { nextMarker = marker->nextInRead; bal_marker = marker->bal_rv; ed = marker->edgeid; edge_array[ed].rv = deleteRv ( edge_array[ed].rv, marker ); dismissRV ( marker ); //printf("%d (%d),",ed,edge_array[ed].deleted); bal_ed = getTwinEdge ( ed ); edge_array[bal_ed].rv = deleteRv ( edge_array[bal_ed].rv, bal_marker ); dismissRV ( bal_marker ); marker = nextMarker; } // printf("\n"); fflush ( stdout ); } static void remapNodeMarkersOntoNeighbour ( unsigned int source, unsigned int target ) { READINTERVAL * marker, *bal_marker; unsigned int bal_source = getTwinEdge ( source ); unsigned int bal_target = getTwinEdge ( target ); while ( ( marker = edge_array[source].rv ) != NULL ) { edge_array[source].rv = deleteRv ( edge_array[source].rv, marker ); marker->edgeid = target; edge_array[target].rv = addRv ( edge_array[target].rv, marker ); bal_marker = marker->bal_rv; edge_array[bal_source].rv = deleteRv ( edge_array[bal_source].rv, bal_marker ); bal_marker->edgeid = bal_target; edge_array[bal_target].rv = addRv ( edge_array[bal_target].rv, bal_marker ); } } static void remapNodeInwardReferencesOntoNode ( unsigned int source, unsigned int target ) { ARC * arc; unsigned int destination; for ( arc = edge_array[source].arcs; arc != NULL; arc = arc->next ) { destination = arc->to_ed; if ( destination == target || destination == source ) { continue; } if ( previous[destination] == source ) { previous[destination] = target; } } } static void remapNodeTimesOntoTargetNode ( unsigned int source, unsigned int target ) { Time nodeTime = times[source]; unsigned int prevNode = previous[source]; Time targetTime = times[target]; if ( nodeTime == -1 ) { return; } if ( prevNode == source ) { times[target] = nodeTime; previous[target] = target; } else if ( targetTime == -1 || targetTime > nodeTime || ( targetTime == nodeTime && !isPreviousToNode ( target, prevNode ) ) ) { times[target] = nodeTime; if ( prevNode != getTwinEdge ( source ) ) { previous[target] = prevNode; } else { previous[target] = getTwinEdge ( target ); } } remapNodeInwardReferencesOntoNode ( source, target ); previous[source] = 0; } static void remapNodeTimesOntoNeighbour ( unsigned int source, unsigned int target ) { remapNodeTimesOntoTargetNode ( source, target ); remapNodeTimesOntoTargetNode ( getTwinEdge ( source ), getTwinEdge ( target ) ); //questionable } static void destroyArc ( unsigned int from_ed, ARC * arc ) { unsigned int bal_dest; ARC * twinArc; if ( !arc ) { return; } bal_dest = getTwinEdge ( arc->to_ed ); twinArc = arc->bal_arc; removeArcInLookupTable ( from_ed, arc->to_ed ); edge_array[from_ed].arcs = deleteArc ( edge_array[from_ed].arcs, arc ); if ( bal_dest != from_ed ) { removeArcInLookupTable ( bal_dest, getTwinEdge ( from_ed ) ); edge_array[bal_dest].arcs = deleteArc ( edge_array[bal_dest].arcs, twinArc ); } } static void createAnalogousArc ( unsigned int originNode, unsigned int destinationNode, ARC * refArc ) { ARC * arc, *twinArc; unsigned int destinationTwin; arc = getArcBetween ( originNode, destinationNode ); if ( arc ) { if ( refArc->bal_arc != refArc ) { arc->multiplicity += refArc->multiplicity; arc->bal_arc->multiplicity += refArc->multiplicity; } else { arc->multiplicity += refArc->multiplicity / 2; arc->bal_arc->multiplicity += refArc->multiplicity / 2; } return; } arc = allocateArc ( destinationNode ); arc->multiplicity = refArc->multiplicity; arc->prev = NULL; arc->next = edge_array[originNode].arcs; if ( edge_array[originNode].arcs ) { edge_array[originNode].arcs->prev = arc; } edge_array[originNode].arcs = arc; putArc2LookupTable ( originNode, arc ); destinationTwin = getTwinEdge ( destinationNode ); if ( destinationTwin == originNode ) { //printf("arc from A to A'\n"); arc->bal_arc = arc; if ( refArc->bal_arc != refArc ) { arc->multiplicity += refArc->multiplicity; } return; } twinArc = allocateArc ( getTwinEdge ( originNode ) ); arc->bal_arc = twinArc; twinArc->bal_arc = arc; twinArc->multiplicity = refArc->multiplicity; twinArc->prev = NULL; twinArc->next = edge_array[destinationTwin].arcs; if ( edge_array[destinationTwin].arcs ) { edge_array[destinationTwin].arcs->prev = twinArc; } edge_array[destinationTwin].arcs = twinArc; putArc2LookupTable ( destinationTwin, twinArc ); } static void remapNodeArcsOntoTarget ( unsigned int source, unsigned int target ) { ARC * arc; if ( source == activeNode ) { activeNode = target; } arc = edge_array[source].arcs; if ( !arc ) { return; } while ( arc != NULL ) { createAnalogousArc ( target, arc->to_ed, arc ); destroyArc ( source, arc ); arc = edge_array[source].arcs; } } static void remapNodeArcsOntoNeighbour ( unsigned int source, unsigned int target ) { remapNodeArcsOntoTarget ( source, target ); remapNodeArcsOntoTarget ( getTwinEdge ( source ), getTwinEdge ( target ) ); } static DFibHeapNode * getNodeDHeapNode ( unsigned int node ) { return dheapNodes[node]; } static void setNodeDHeapNode ( unsigned int node, DFibHeapNode * dheapNode ) { dheapNodes[node] = dheapNode; } static void remapNodeFibHeapReferencesOntoNode ( unsigned int source, unsigned int target ) { DFibHeapNode * sourceDHeapNode = getNodeDHeapNode ( source ); DFibHeapNode * targetDHeapNode = getNodeDHeapNode ( target ); if ( sourceDHeapNode == NULL ) { return; } if ( targetDHeapNode == NULL ) { setNodeDHeapNode ( target, sourceDHeapNode ); replaceValueInDHeap ( sourceDHeapNode, target ); } else if ( getKey ( targetDHeapNode ) > getKey ( sourceDHeapNode ) ) { setNodeDHeapNode ( target, sourceDHeapNode ); replaceValueInDHeap ( sourceDHeapNode, target ); destroyNodeInDHeap ( targetDHeapNode, dheap ); } else { destroyNodeInDHeap ( sourceDHeapNode, dheap ); } setNodeDHeapNode ( source, NULL ); } static void combineCOV ( unsigned int source, int len_s, unsigned int target, int len_t ) { if ( len_s < 1 || len_t < 1 ) { return; } int cov = ( len_s * edge_array[source].cvg + len_t * edge_array[target].cvg ) / len_t; edge_array[target].cvg = cov > MaxEdgeCov ? MaxEdgeCov : cov; edge_array[getTwinEdge ( target )].cvg = cov > MaxEdgeCov ? MaxEdgeCov : cov; } static void remapNodeOntoNeighbour ( unsigned int source, unsigned int target ) { combineCOV ( source, edge_array[source].length, target, edge_array[target].length ); remapNodeMarkersOntoNeighbour ( source, target ); remapNodeTimesOntoNeighbour ( source, target ); //questionable remapNodeArcsOntoNeighbour ( source, target ); remapNodeFibHeapReferencesOntoNode ( source, target ); remapNodeFibHeapReferencesOntoNode ( getTwinEdge ( source ), getTwinEdge ( target ) ); edge_array[source].deleted = 1; edge_array[getTwinEdge ( source )].deleted = 1; if ( startingNode == source ) { startingNode = target; } if ( startingNode == getTwinEdge ( source ) ) { startingNode = getTwinEdge ( target ); } edge_array[source].length = 0; edge_array[getTwinEdge ( source )].length = 0; } static void connectInRead ( READINTERVAL * previous, READINTERVAL * next ) { if ( previous ) { previous->nextInRead = next; if ( next ) { previous->bal_rv->prevInRead = next->bal_rv; } else { previous->bal_rv->prevInRead = NULL; } } if ( next ) { next->prevInRead = previous; if ( previous ) { next->bal_rv->nextInRead = previous->bal_rv; } else { next->bal_rv->nextInRead = NULL; } } } static int remapBackOfNodeMarkersOntoNeighbour ( unsigned int source, READINTERVAL * sourceMarker, unsigned int target, READINTERVAL * targetMarker, boolean slowToFast ) { READINTERVAL * marker, *newMarker, *bal_new, *previousMarker; int halfwayPoint, halfwayPointOffset, breakpoint; int * targetToSourceMapping, *sourceToTargetMapping; unsigned int bal_ed; int targetFinish = targetMarker->bal_rv->start; int sourceStart = sourceMarker->start; int sourceFinish = sourceMarker->bal_rv->start; int alignedSourceLength = sourceFinish - sourceStart; int realSourceLength = edge_array[source].length; if ( slowToFast ) { sourceToTargetMapping = slowToFastMapping; targetToSourceMapping = fastToSlowMapping; } else { sourceToTargetMapping = fastToSlowMapping; targetToSourceMapping = slowToFastMapping; } if ( alignedSourceLength > 0 && targetFinish > 0 ) { halfwayPoint = targetToSourceMapping[targetFinish - 1] - sourceStart + 1; halfwayPoint *= realSourceLength; halfwayPoint /= alignedSourceLength; } else { halfwayPoint = 0; } if ( halfwayPoint < 0 ) { halfwayPoint = 0; } if ( halfwayPoint > realSourceLength ) { halfwayPoint = realSourceLength; } halfwayPointOffset = realSourceLength - halfwayPoint; bal_ed = getTwinEdge ( target ); for ( marker = edge_array[source].rv; marker != NULL; marker = marker->nextOnEdge ) { if ( marker->prevInRead && marker->prevInRead->edgeid == target ) { continue; } newMarker = allocateRV ( marker->readid, target ); edge_array[target].rv = addRv ( edge_array[target].rv, newMarker ); bal_new = allocateRV ( -marker->readid, bal_ed ); edge_array[bal_ed].rv = addRv ( edge_array[bal_ed].rv, bal_new ); newMarker->bal_rv = bal_new; bal_new->bal_rv = newMarker; newMarker->start = marker->start; if ( realSourceLength > 0 ) { breakpoint = halfwayPoint + marker->start; } else { breakpoint = marker->start; } bal_new->start = breakpoint; marker->start = breakpoint; previousMarker = marker->prevInRead; connectInRead ( previousMarker, newMarker ); connectInRead ( newMarker, marker ); } return halfwayPointOffset; } static void printKmer ( Kmer kmer ) { printKmerSeq ( stdout, kmer ); printf ( "\n" ); } static void splitNodeDescriptor ( unsigned int source, unsigned int target, int offset ) { int originalLength = edge_array[source].length; int backLength = originalLength - offset; int index, seqLen; char * tightSeq, nt, *newSeq; unsigned int bal_source = getTwinEdge ( source ); unsigned int bal_target = getTwinEdge ( target ); edge_array[source].length = offset; edge_array[bal_source].length = offset; edge_array[source].flag = 1; edge_array[bal_source].flag = 1; if ( target != 0 ) { edge_array[target].length = backLength; edge_array[bal_target].length = backLength; free ( ( void * ) edge_array[target].seq ); edge_array[target].seq = NULL; free ( ( void * ) edge_array[bal_target].seq ); edge_array[bal_target].seq = NULL; } if ( backLength == 0 ) { return; } tightSeq = edge_array[source].seq; seqLen = backLength / 4 + 1; if ( target != 0 ) { edge_array[target].flag = 1; edge_array[bal_target].flag = 1; newSeq = ( char * ) ckalloc ( seqLen * sizeof ( char ) ); edge_array[target].seq = newSeq; for ( index = 0; index < backLength; index++ ) { nt = getCharInTightString ( tightSeq, index ); writeChar2tightString ( nt, newSeq, index ); } } //source node for ( index = backLength; index < originalLength; index++ ) { nt = getCharInTightString ( tightSeq, index ); writeChar2tightString ( nt, tightSeq, index - backLength ); } if ( target == 0 ) { return; } //target twin tightSeq = edge_array[bal_source].seq; newSeq = ( char * ) ckalloc ( seqLen * sizeof ( char ) ); edge_array[bal_target].seq = newSeq; for ( index = offset; index < originalLength; index++ ) { nt = getCharInTightString ( tightSeq, index ); writeChar2tightString ( nt, newSeq, index - offset ); } } static void remapBackOfNodeDescriptorOntoNeighbour ( unsigned int source, unsigned int target, boolean slowToFast, int offset ) { unsigned int bal_source = getTwinEdge ( source ); unsigned int bal_target = getTwinEdge ( target ); if ( slowToFast ) { splitNodeDescriptor ( source, 0, offset ); } else { splitNodeDescriptor ( source, target, offset ); } //printf("%d vs %d\n",edge_array[source].from_vt,edge_array[target].to_vt); edge_array[source].from_vt = edge_array[target].to_vt; edge_array[bal_source].to_vt = edge_array[bal_target].from_vt; } static void remapBackOfNodeTimesOntoNeighbour ( unsigned int source, unsigned int target ) { Time targetTime = times[target]; Time nodeTime = times[source]; unsigned int twinTarget = getTwinEdge ( target ); unsigned int twinSource = getTwinEdge ( source ); unsigned int previousNode; if ( nodeTime != -1 ) { previousNode = previous[source]; if ( previousNode == source ) { times[target] = nodeTime; previous[target] = target; } else if ( targetTime == -1 || targetTime > nodeTime || ( targetTime == nodeTime && !isPreviousToNode ( target, previousNode ) ) ) { times[target] = nodeTime; if ( previousNode != twinSource ) { previous[target] = previousNode; } else { previous[target] = twinTarget; } } previous[source] = target; } targetTime = times[twinTarget]; nodeTime = times[twinSource]; if ( nodeTime != -1 ) { if ( targetTime == -1 || targetTime > nodeTime || ( targetTime == nodeTime && !isPreviousToNode ( twinTarget, twinSource ) ) ) { times[twinTarget] = nodeTime; previous[twinTarget] = twinSource; } } remapNodeInwardReferencesOntoNode ( twinSource, twinTarget ); } static void remapBackOfNodeArcsOntoNeighbour ( unsigned int source, unsigned int target ) { ARC * arc; remapNodeArcsOntoTarget ( getTwinEdge ( source ), getTwinEdge ( target ) ); for ( arc = edge_array[source].arcs; arc != NULL; arc = arc->next ) { createAnalogousArc ( target, source, arc ); } } static void remapBackOfNodeOntoNeighbour ( unsigned int source, READINTERVAL * sourceMarker, unsigned int target, READINTERVAL * targetMarker, boolean slowToFast ) { int offset; offset = remapBackOfNodeMarkersOntoNeighbour ( source, sourceMarker, target, targetMarker, slowToFast ); remapBackOfNodeDescriptorOntoNeighbour ( source, target, slowToFast, offset ); combineCOV ( source, edge_array[target].length, target, edge_array[target].length ); remapBackOfNodeTimesOntoNeighbour ( source, target ); remapBackOfNodeArcsOntoNeighbour ( source, target ); remapNodeFibHeapReferencesOntoNode ( getTwinEdge ( source ), getTwinEdge ( target ) ); //why not "remapNodeFibHeapReferencesOntoNode(source,target);" //because the downstream part of source still retains, which can serve as previousNode as before if ( getTwinEdge ( source ) == startingNode ) { startingNode = getTwinEdge ( target ); } } static boolean markerLeadsToNode ( READINTERVAL * marker, unsigned int node ) { READINTERVAL * currentMarker; for ( currentMarker = marker; currentMarker != NULL; currentMarker = currentMarker->nextInRead ) if ( currentMarker->edgeid == node ) { return true; } return false; } static void reduceNode ( unsigned int node ) { unsigned int bal_ed = getTwinEdge ( node ); edge_array[node].length = 0; edge_array[bal_ed].length = 0; } static void reduceSlowNodes ( READINTERVAL * slowMarker, unsigned int finish ) { READINTERVAL * marker; for ( marker = slowMarker; marker->edgeid != finish; marker = marker->nextInRead ) { reduceNode ( marker->edgeid ); } } static boolean markerLeadsToArc ( READINTERVAL * marker, unsigned int nodeA, unsigned int nodeB ) { READINTERVAL * current, *next; unsigned int twinA = getTwinEdge ( nodeA ); unsigned int twinB = getTwinEdge ( nodeB ); current = marker; while ( current != NULL ) { next = current->nextInRead; if ( current->edgeid == nodeA && next->edgeid == nodeB ) { return true; } if ( current->edgeid == twinB && next->edgeid == twinA ) { return true; } current = next; } return false; } static void remapEmptyPathArcsOntoMiddlePathSimple ( READINTERVAL * emptyPath, READINTERVAL * targetPath ) { READINTERVAL * pathMarker, *marker; unsigned int start = emptyPath->prevInRead->edgeid; unsigned int finish = emptyPath->edgeid; unsigned int previousNode = start; unsigned int currentNode; ARC * originalArc = getArcBetween ( start, finish ); if ( !originalArc ) { printf ( "remapEmptyPathArcsOntoMiddlePathSimple: no arc between %d and %d\n", start, finish ); marker = fastPath; printf ( "fast path: " ); while ( marker ) { printf ( "%d,", marker->edgeid ); marker = marker->nextInRead; } printf ( "\n" ); marker = slowPath; printf ( "slow path: " ); while ( marker ) { printf ( "%d,", marker->edgeid ); marker = marker->nextInRead; } printf ( "\n" ); fflush ( stdout ); } for ( pathMarker = targetPath; pathMarker->edgeid != finish; pathMarker = pathMarker->nextInRead ) { currentNode = pathMarker->edgeid; createAnalogousArc ( previousNode, currentNode, originalArc ); previousNode = currentNode; } createAnalogousArc ( previousNode, finish, originalArc ); destroyArc ( start, originalArc ); } static void remapEmptyPathMarkersOntoMiddlePathSimple ( READINTERVAL * emptyPath, READINTERVAL * targetPath, boolean slowToFast ) { READINTERVAL * marker, *newMarker, *previousMarker, *pathMarker, *bal_marker; unsigned int start = emptyPath->prevInRead->edgeid; unsigned int finish = emptyPath->edgeid; unsigned int markerStart, bal_ed; READINTERVAL * oldMarker = edge_array[finish].rv; while ( oldMarker ) { marker = oldMarker; oldMarker = marker->nextOnEdge; newMarker = marker->prevInRead; if ( newMarker->edgeid != start ) { continue; } if ( ( slowToFast && marker->readid != 2 ) || ( !slowToFast && marker->readid != 1 ) ) { continue; } markerStart = marker->start; for ( pathMarker = targetPath; pathMarker->edgeid != finish; pathMarker = pathMarker->nextInRead ) { previousMarker = newMarker; //maker a new marker newMarker = allocateRV ( marker->readid, pathMarker->edgeid ); newMarker->start = markerStart; edge_array[pathMarker->edgeid].rv = addRv ( edge_array[pathMarker->edgeid].rv, newMarker ); //maker the twin marker bal_ed = getTwinEdge ( pathMarker->edgeid ); bal_marker = allocateRV ( -marker->readid, bal_ed ); bal_marker->start = markerStart; edge_array[bal_ed].rv = addRv ( edge_array[bal_ed].rv, bal_marker ); newMarker->bal_rv = bal_marker; bal_marker->bal_rv = newMarker; connectInRead ( previousMarker, newMarker ); } connectInRead ( newMarker, marker ); } } static void remapNodeTimesOntoForwardMiddlePath ( unsigned int source, READINTERVAL * path ) { READINTERVAL * marker; unsigned int target; Time nodeTime = times[source]; unsigned int previousNode = previous[source]; Time targetTime; for ( marker = path; marker->edgeid != source; marker = marker->nextInRead ) { target = marker->edgeid; targetTime = times[target]; if ( targetTime == -1 || targetTime > nodeTime || ( targetTime == nodeTime && !isPreviousToNode ( target, previousNode ) ) ) { times[target] = nodeTime; previous[target] = previousNode; } previousNode = target; } previous[source] = previousNode; } static void remapNodeTimesOntoTwinMiddlePath ( unsigned int source, READINTERVAL * path ) { READINTERVAL * marker; unsigned int target; unsigned int previousNode = getTwinEdge ( source ); Time targetTime; READINTERVAL * limit = path->prevInRead->bal_rv; Time nodeTime = times[limit->edgeid]; marker = path; while ( marker->edgeid != source ) { marker = marker->nextInRead; } marker = marker->bal_rv; while ( marker != limit ) { marker = marker->nextInRead; target = marker->edgeid; targetTime = times[target]; if ( targetTime == -1 || targetTime > nodeTime || ( targetTime == nodeTime && !isPreviousToNode ( target, previousNode ) ) ) { times[target] = nodeTime; previous[target] = previousNode; } previousNode = target; } } static void remapEmptyPathOntoMiddlePath ( READINTERVAL * emptyPath, READINTERVAL * targetPath, boolean slowToFast ) { unsigned int start = emptyPath->prevInRead->edgeid; unsigned int finish = emptyPath->edgeid; // Remapping markers if ( !markerLeadsToArc ( targetPath, start, finish ) ) { remapEmptyPathArcsOntoMiddlePathSimple ( emptyPath, targetPath ); } remapEmptyPathMarkersOntoMiddlePathSimple ( emptyPath, targetPath, slowToFast ); //Remap times and previous(if necessary) if ( getNodePrevious ( finish ) == start ) { remapNodeTimesOntoForwardMiddlePath ( finish, targetPath ); } if ( getNodePrevious ( getTwinEdge ( start ) ) == getTwinEdge ( finish ) ) { remapNodeTimesOntoTwinMiddlePath ( finish, targetPath ); } } static boolean cleanUpRedundancy() { READINTERVAL * slowMarker = slowPath->nextInRead, *fastMarker = fastPath->nextInRead; unsigned int slowNode, fastNode; int slowLength, fastLength; int fastConstraint = 0; int slowConstraint = 0; int finalLength; attachPath ( slowPath ); attachPath ( fastPath ); mapSlowOntoFast(); finalLength = mapDistancesOntoPaths(); slowLength = fastLength = 0; while ( slowMarker != NULL && fastMarker != NULL ) { //modifCounter++; if ( !slowMarker->nextInRead ) { slowLength = finalLength; } else { slowLength = slowToFastMapping[slowMarker->bal_rv->start - 1]; if ( slowLength < slowConstraint ) { slowLength = slowConstraint; } } fastLength = fastMarker->bal_rv->start - 1; if ( fastLength < fastConstraint ) { fastLength = fastConstraint; } slowNode = slowMarker->edgeid; fastNode = fastMarker->edgeid; if ( false ) printf ( "Slow %d\tFast %d\n", slowLength, fastLength ); if ( slowNode == fastNode ) { //if (false) if ( false ) printf ( "0/ Already merged together %d == %d\n", slowNode, fastNode ); if ( fastLength > slowLength ) { slowConstraint = fastLength; } //else if (fastLength < slowLength); fastConstraint = slowLength; slowMarker = slowMarker->nextInRead; fastMarker = fastMarker->nextInRead; } else if ( slowNode == getTwinEdge ( fastNode ) ) { //if (false) if ( false ) printf ( "1/ Creme de la hairpin %d $$ %d\n", slowNode, fastNode ); if ( fastLength > slowLength ) { slowConstraint = fastLength; } //else if (fastLength < slowLength); fastConstraint = slowLength; slowMarker = slowMarker->nextInRead; fastMarker = fastMarker->nextInRead; //foldSymmetricalNode(fastNode); } else if ( markerLeadsToNode ( slowMarker, fastNode ) ) { //if (false) if ( false ) { printf ( "2/ Remapping empty fast arc onto slow nodes\n" ); } reduceSlowNodes ( slowMarker, fastNode ); remapEmptyPathOntoMiddlePath ( fastMarker, slowMarker, FAST_TO_SLOW ); while ( slowMarker->edgeid != fastNode ) { slowMarker = slowMarker->nextInRead; } } else if ( markerLeadsToNode ( fastMarker, slowNode ) ) { //if (false) if ( false ) { printf ( "3/ Remapping empty slow arc onto fast nodes\n" ); } remapEmptyPathOntoMiddlePath ( slowMarker, fastMarker, SLOW_TO_FAST ); while ( fastMarker->edgeid != slowNode ) { fastMarker = fastMarker->nextInRead; } } else if ( slowLength == fastLength ) { if ( false ) { printf ( "A/ Mapped equivalent nodes together %d <=> %d\n", slowNode, fastNode ); } remapNodeOntoNeighbour ( slowNode, fastNode ); slowMarker = slowMarker->nextInRead; fastMarker = fastMarker->nextInRead; } else if ( slowLength < fastLength ) { if ( false ) { printf ( "B/ Mapped back of fast node into slow %d -> %d\n", fastNode, slowNode ); } remapBackOfNodeOntoNeighbour ( fastNode, fastMarker, slowNode, slowMarker, FAST_TO_SLOW ); slowMarker = slowMarker->nextInRead; } else { if ( false ) { printf ( "C/ Mapped back of slow node into fast %d -> %d\n", slowNode, fastNode ); } remapBackOfNodeOntoNeighbour ( slowNode, slowMarker, fastNode, fastMarker, SLOW_TO_FAST ); fastMarker = fastMarker->nextInRead; } fflush ( stdout ); } detachPath ( fastPath ); detachPath ( slowPath ); return 1; } static void comparePaths ( unsigned int destination, unsigned int origin ) { int slowLength, fastLength, i; unsigned int fastNode, slowNode; READINTERVAL * marker; slowLength = fastLength = 0; fastNode = destination; slowNode = origin; btCounter++; while ( fastNode != slowNode ) { if ( times[fastNode] > times[slowNode] ) { fastLength++; fastNode = previous[fastNode]; } else if ( times[fastNode] < times[slowNode] ) { slowLength++; slowNode = previous[slowNode]; } else if ( isPreviousToNode ( slowNode, fastNode ) ) { while ( fastNode != slowNode ) { fastLength++; fastNode = previous[fastNode]; } } else if ( isPreviousToNode ( fastNode, slowNode ) ) { while ( slowNode != fastNode ) { slowLength++; slowNode = previous[slowNode]; } } else { fastLength++; fastNode = previous[fastNode]; slowLength++; slowNode = previous[slowNode]; } if ( slowLength > MAXNODELENGTH || fastLength > MAXNODELENGTH ) { return; } } if ( fastLength == 0 ) { return; } marker = allocateRV ( 1, destination ); fastPath = marker; for ( i = 0; i < fastLength; i++ ) { marker = allocateRV ( 1, previous[fastPath->edgeid] ); //printf("marker for read %d on edge %d\n",marker->readid,marker->edgeid); marker->nextInRead = fastPath; fastPath->prevInRead = marker; fastPath = marker; } marker = allocateRV ( 2, destination ); //printf("marker for read %d on edge %d\n",marker->readid,marker->edgeid); slowPath = marker; marker = allocateRV ( 2, origin ); //printf("marker for read %d on edge %d\n",marker->readid,marker->edgeid); marker->nextInRead = slowPath; slowPath->prevInRead = marker; slowPath = marker; for ( i = 0; i < slowLength; i++ ) { marker = allocateRV ( 2, previous[slowPath->edgeid] ); //printf("marker for read %d on edge %d\n",marker->readid,marker->edgeid); marker->nextInRead = slowPath; slowPath->prevInRead = marker; slowPath = marker; } //printf("node num %d vs %d\n",fastLength,slowLength); fastSeqLength = extractSequence ( fastPath, fastSequence ); slowSeqLength = extractSequence ( slowPath, slowSequence ); /* if(destination==6359){ printf("destination %d, slowLength %d, fastLength %d\n",destination,slowLength,fastLength); printf("fastSeqLength %d, slowSeqLength %d\n",fastSeqLength,slowSeqLength); } */ if ( !fastSeqLength || !slowSeqLength ) { detachPathSingle ( slowPath ); detachPathSingle ( fastPath ); return; } cmpCounter++; if ( !compareSequences ( fastSequence, slowSequence, fastSeqLength, slowSeqLength ) ) { //output_pair(fastSeqLength,slowSeqLength,ftemp,fastLength-1,slowLength, 0,slowPath->edgeid,destination); detachPathSingle ( slowPath ); detachPathSingle ( fastPath ); return; } simiCounter++; //output_pair(fastSeqLength,slowSeqLength,ftemp,fastLength-1,slowLength, 1,slowPath->edgeid,destination); //pinCounter++; pinCounter += cleanUpRedundancy(); if ( pinCounter % 100000 == 0 ) { printf ( ".............%lld\n", pinCounter ); } HasChanged = 1; } static void tourBusArc ( unsigned int origin, unsigned int destination, unsigned int arcMulti, Time originTime ) { Time arcTime, totalTime, destinationTime; unsigned int oldPrevious = previous[destination]; if ( oldPrevious == origin || edge_array[destination].multi == 1 ) { return; } arcCounter++; if ( arcMulti > 0 ) { arcTime = ( ( Time ) edge_array[origin].length ) / ( ( Time ) arcMulti ); } else { arcTime = 0.0; printf ( "arc from %d to %d with flags %d originTime %f, arc %d\n", origin, destination, edge_array[destination].multi, originTime, arcMulti ); fflush ( stdout ); } totalTime = originTime + arcTime; /* if(destination==289129||destination==359610){ printf("arc from %d to %d with flags %d time %f originTime %f, arc %d\n", origin,destination,edge_array[destination].multi,totalTime,originTime,arcMulti); fflush(stdout); } */ destinationTime = times[destination]; if ( destinationTime == -1 ) { times[destination] = totalTime; dheapNodes[destination] = insertNodeIntoDHeap ( dheap, totalTime, destination ); dnodeCounter++; previous[destination] = origin; return; } else if ( destinationTime > totalTime ) { if ( dheapNodes[destination] == NULL ) { //printf("node %d Already expanded though\n",destination); return; } replaceCounter++; times[destination] = totalTime; replaceKeyInDHeap ( dheap, dheapNodes[destination], totalTime ); previous[destination] = origin; comparePaths ( destination, oldPrevious ); return; } else { if ( destinationTime == times[origin] && isPreviousToNode ( destination, origin ) ) { return; } comparePaths ( destination, origin ); } } static void tourBusNode ( unsigned int node ) { ARC * parc; int index = 0, outNodeNum; /* if(node==745) printf("to expand %d\n",node); */ expanded[expCounter++] = node; //edge_array[node].multi = 2; activeNode = node; parc = edge_array[activeNode].arcs; while ( parc ) { outArcArray[index] = parc; outNodeArray[index++] = parc->to_ed; if ( index >= MAXCONNECTION ) { //printf("node %d has more than MAXCONNECTION arcs\n",node); break; } parc = parc->next; } outNodeNum = index; HasChanged = 0; for ( index = 0; index < outNodeNum; index++ ) { if ( HasChanged ) { parc = getArcBetween ( activeNode, outNodeArray[index] ); getArcCounter++; } else { parc = outArcArray[index]; } if ( !parc ) { continue; } tourBusArc ( activeNode, outNodeArray[index], parc->multiplicity, times[activeNode] ); } } /* static void dumpNodeFromDHeap() { unsigned int currentNode; while((currentNode = removeNextNodeFromDHeap(dheap))!=0){ rnodeCounter++; times[currentNode] = -1; previous[currentNode] = 0; dheapNodes[currentNode] = NULL; if(dnodeCounter-rnodeCounter<250) break; } } */ static void tourBus ( unsigned int startingPoint ) { unsigned int currentNode = startingPoint; times[startingPoint] = 0; previous[startingPoint] = currentNode; while ( currentNode > 0 ) { dheapNodes[currentNode] = NULL; tourBusNode ( currentNode ); currentNode = removeNextNodeFromDHeap ( dheap ); if ( currentNode > 0 ) { rnodeCounter++; } } } void bubblePinch ( double simiCutoff, char * outfile, int M ) { unsigned int index, counter = 0; unsigned int startingNode; char temp[256]; sprintf ( temp, "%s.pathpair", outfile ); //ftemp = ckopen(temp,"w"); //linearConcatenate(); //initiator caseA = caseB = caseC = caseD = 0; progress = 0; arcCounter = 0; dnodeCounter = 0; rnodeCounter = 0; btCounter = 0; cmpCounter = 0; simiCounter = 0; pinCounter = 0; replaceCounter = 0; getArcCounter = 0; cutoff = 1.0 - simiCutoff; if ( M <= 1 ) { MAXNODELENGTH = 3; DIFF = 2; } else if ( M == 2 ) { MAXNODELENGTH = 9; DIFF = 3; } else { MAXNODELENGTH = 30; DIFF = 10; } printf ( "start to pinch bubbles, cutoff %f, MAX NODE NUM %d, MAX DIFF %d\n", cutoff, MAXNODELENGTH, DIFF ); createRVmemo(); times = ( Time * ) ckalloc ( ( num_ed + 1 ) * sizeof ( Time ) ); previous = ( unsigned int * ) ckalloc ( ( num_ed + 1 ) * sizeof ( unsigned int ) ); expanded = ( unsigned int * ) ckalloc ( ( num_ed + 1 ) * sizeof ( unsigned int ) ); dheapNodes = ( DFibHeapNode ** ) ckalloc ( ( num_ed + 1 ) * sizeof ( DFibHeapNode * ) ); WORDFILTER = createFilter ( overlaplen ); for ( index = 1; index <= num_ed; index++ ) { times[index] = -1; previous[index] = 0; dheapNodes[index] = NULL; } dheap = newDFibHeap(); eligibleStartingPoints = ( unsigned int * ) ckalloc ( ( num_ed + 1 ) * sizeof ( unsigned int ) ); resetNodeStatus(); //determineEligibleStartingPoints(); createArcLookupTable(); recordArcsInLookupTable(); while ( ( startingNode = nextStartingPoint() ) > 0 ) { counter++; //printf("starting point %d with length %d\n",startingNode,edge_array[startingNode].length); expCounter = 0; tourBus ( startingNode ); updateNodeStatus(); } resetNodeStatus(); deleteArcLookupTable(); destroyReadIntervMem(); printf ( "%d startingPoints, %lld dheap nodes\n", counter, dnodeCounter ); //printf("%lld times getArcBetween for tourBusNode\n",getArcCounter); printf ( "%lld pairs found, %lld pairs of paths compared, %lld pairs merged\n", btCounter, cmpCounter, pinCounter ); printf ( "sequenc compare failure: %lld %lld %lld %lld\n", caseA, caseB, caseC, caseD ); //fclose(ftemp); free ( ( void * ) eligibleStartingPoints ); destroyDHeap ( dheap ); free ( ( void * ) dheapNodes ); free ( ( void * ) times ); free ( ( void * ) previous ); free ( ( void * ) expanded ); linearConcatenate(); } SOAPdenovo-V1.05/src/63mer/check.c000644 000765 000024 00000003757 11530651532 016555 0ustar00Aquastaff000000 000000 /* * 63mer/check.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include void * ckalloc ( unsigned long long amount ); FILE * ckopen ( char * name, char * mode ); FILE * ckopen ( char * name, char * mode ) { FILE * fp; if ( ( fp = fopen ( name, mode ) ) == NULL ) { printf ( "Cannot open %s. Now exit to system...\n", name ); exit ( -1 ); } return ( fp ); } /* ckalloc - allocate space; check for success */ void * ckalloc ( unsigned long long amount ) { void * p; if ( ( p = ( void * ) calloc ( 1, ( unsigned long long ) amount ) ) == NULL && amount != 0 ) { printf ( "Ran out of memory while applying %lldbytes\n", amount ); printf ( "There may be errors as follows:\n" ); printf ( "1) Not enough memory.\n" ); printf ( "2) The ARRAY may be overrode.\n" ); printf ( "3) The wild pointers.\n" ); fflush ( stdout ); exit ( -1 ); } return ( p ); } /* reallocate memory */ void * ckrealloc ( void * p, size_t new_size, size_t old_size ) { void * q; q = realloc ( ( void * ) p, new_size ); if ( new_size == 0 || q != ( void * ) 0 ) { return q; } /* manually reallocate space */ q = ckalloc ( new_size ); /* move old memory to new space */ bcopy ( p, q, old_size ); free ( p ); return q; } SOAPdenovo-V1.05/src/63mer/compactEdge.c000644 000765 000024 00000005336 11530651532 017706 0ustar00Aquastaff000000 000000 /* * 63mer/compactEdge.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" void copyEdge ( unsigned int source, unsigned int target ) { edge_array[target].from_vt = edge_array[source].from_vt; edge_array[target].to_vt = edge_array[source].to_vt; edge_array[target].length = edge_array[source].length; edge_array[target].cvg = edge_array[source].cvg; edge_array[target].multi = edge_array[source].multi; if ( edge_array[target].seq ) { free ( ( void * ) edge_array[target].seq ); } edge_array[target].seq = edge_array[source].seq; edge_array[source].seq = NULL; edge_array[target].arcs = edge_array[source].arcs; edge_array[source].arcs = NULL; edge_array[target].deleted = edge_array[source].deleted; } //move edge from source to target void edgeMove ( unsigned int source, unsigned int target ) { unsigned int bal_source, bal_target; ARC * arc; copyEdge ( source, target ); bal_source = getTwinEdge ( source ); //bal_edge if ( bal_source != source ) { bal_target = target + 1; copyEdge ( bal_source, bal_target ); edge_array[target].bal_edge = 2; edge_array[bal_target].bal_edge = 0; } else { edge_array[target].bal_edge = 1; bal_target = target; } //take care of the arcs arc = edge_array[target].arcs; while ( arc ) { arc->bal_arc->to_ed = bal_target; arc = arc->next; } if ( bal_target == target ) { return; } arc = edge_array[bal_target].arcs; while ( arc ) { arc->bal_arc->to_ed = target; arc = arc->next; } } void compactEdgeArray() { unsigned int i; unsigned int validCounter = 0; unsigned int bal_ed; printf ( "there're %d edges\n", num_ed ); for ( i = 1; i <= num_ed; i++ ) { if ( edge_array[i].deleted ) { continue; } validCounter++; if ( i == validCounter ) { continue; } bal_ed = getTwinEdge ( i ); edgeMove ( i, validCounter ); if ( bal_ed != i ) { i++; validCounter++; } } num_ed = validCounter; printf ( "after compacting %d edges left\n", num_ed ); } SOAPdenovo-V1.05/src/63mer/concatenateEdge.c000644 000765 000024 00000015454 11530651532 020546 0ustar00Aquastaff000000 000000 /* * 63mer/concatenateEdge.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" void copySeq ( char * targetS, char * sourceS, int pos, int length ) { char ch; int i, index; index = pos; for ( i = 0; i < length; i++ ) { ch = getCharInTightString ( sourceS, i ); writeChar2tightString ( ch, targetS, index++ ); } } //a path from e1 to e2 is merged int to e1(indicate=0) or e2(indicate=1), update graph topology void linearUpdateConnection ( unsigned int e1, unsigned int e2, int indicate ) { unsigned int bal_ed; ARC * parc; if ( !indicate ) { edge_array[e1].to_vt = edge_array[e2].to_vt; bal_ed = getTwinEdge ( e1 ); parc = edge_array[e2].arcs; while ( parc ) { parc->bal_arc->to_ed = bal_ed; parc = parc->next; } edge_array[e1].arcs = edge_array[e2].arcs; edge_array[e2].arcs = NULL; if ( edge_array[e1].length || edge_array[e2].length ) edge_array[e1].cvg = ( edge_array[e1].cvg * edge_array[e1].length + edge_array[e2].cvg * edge_array[e2].length ) / ( edge_array[e1].length + edge_array[e2].length ); edge_array[e2].deleted = 1; } else { //all the arcs pointing to e1 switch to e2 parc = edge_array[getTwinEdge ( e1 )].arcs; while ( parc ) { parc->bal_arc->to_ed = e2; parc = parc->next; } edge_array[e1].arcs = NULL; edge_array[e2].from_vt = edge_array[e1].from_vt; if ( edge_array[e1].length || edge_array[e2].length ) edge_array[e2].cvg = ( edge_array[e1].cvg * edge_array[e1].length + edge_array[e2].cvg * edge_array[e2].length ) / ( edge_array[e1].length + edge_array[e2].length ); edge_array[e1].deleted = 1; } } static void printEdgeSeq ( FILE * fp, char * tightSeq, int len ) { int i; for ( i = 0; i < len; i++ ) { fprintf ( fp, "%c", int2base ( ( int ) getCharInTightString ( tightSeq, i ) ) ); if ( ( i + overlaplen + 1 ) % 100 == 0 ) { fprintf ( fp, "\n" ); } } fprintf ( fp, "\n" ); } void allpathUpdateEdge ( unsigned int e1, unsigned int e2, int indicate ) { int tightLen; char * tightSeq = NULL; if ( edge_array[e1].cvg == 0 ) { edge_array[e1].cvg = edge_array[e2].cvg; } if ( edge_array[e2].cvg == 0 ) { edge_array[e2].cvg = edge_array[e1].cvg; } /* if(edge_array[e1].length&&edge_array[e2].length){ fprintf(stderr,">e1\n"); printEdgeSeq(stderr,edge_array[e1].seq,edge_array[e1].length); fprintf(stderr,">e2\n"); printEdgeSeq(stderr,edge_array[e2].seq,edge_array[e2].length); } */ unsigned int cvgsum = edge_array[e1].cvg * edge_array[e1].length + edge_array[e2].cvg * edge_array[e2].length; tightLen = edge_array[e1].length + edge_array[e2].length; if ( tightLen ) { tightSeq = ( char * ) ckalloc ( ( tightLen / 4 + 1 ) * sizeof ( char ) ); } tightLen = 0; if ( edge_array[e1].length ) { copySeq ( tightSeq, edge_array[e1].seq, 0, edge_array[e1].length ); tightLen = edge_array[e1].length; if ( edge_array[e1].seq ) { free ( ( void * ) edge_array[e1].seq ); edge_array[e1].seq = NULL; } else { printf ( "allpathUpdateEdge: edge %d with length %d, but without seq\n", e1, edge_array[e1].length ); } } if ( edge_array[e2].length ) { copySeq ( tightSeq, edge_array[e2].seq, tightLen, edge_array[e2].length ); tightLen += edge_array[e2].length; if ( edge_array[e2].seq ) { free ( ( void * ) edge_array[e2].seq ); edge_array[e2].seq = NULL; } else { printf ( "allpathUpdateEdge: edge %d with length %d, but without seq\n", e2, edge_array[e2].length ); } } /* if(edge_array[e1].length&&edge_array[e2].length){ fprintf(stderr,">e1+e2\n"); printEdgeSeq(stderr,tightSeq,tightLen); } */ //edge_array[e2].extend_len = tightLen-edge_array[e2].length; //the sequence of e1 is to be updated if ( !indicate ) { edge_array[e2].length = 0; //e1 is removed from the graph edge_array[e1].to_vt = edge_array[e2].to_vt; //e2 is part of e1 now edge_array[e1].length = tightLen; edge_array[e1].seq = tightSeq; if ( tightLen ) { edge_array[e1].cvg = cvgsum / tightLen; } edge_array[e1].cvg = edge_array[e1].cvg > 0 ? edge_array[e1].cvg : 1; } else { edge_array[e1].length = 0; //e1 is removed from the graph edge_array[e2].from_vt = edge_array[e1].from_vt; //e1 is part of e2 now edge_array[e2].length = tightLen; edge_array[e2].seq = tightSeq; if ( tightLen ) { edge_array[e2].cvg = cvgsum / tightLen; } edge_array[e2].cvg = edge_array[e2].cvg > 0 ? edge_array[e2].cvg : 1; } } static void debugging ( unsigned int i ) { ARC * parc; parc = edge_array[i].arcs; if ( !parc ) { printf ( "no downward connection for %d\n", i ); } while ( parc ) { printf ( "%d -> %d\n", i, parc->to_ed ); parc = parc->next; } } //concatenate two edges if they are linearly linked void linearConcatenate() { unsigned int i; int conc_c = 1; int counter; unsigned int from_ed, to_ed, bal_ed; ARC * parc, *parc2; unsigned int bal_fe; //debugging(30514); while ( conc_c ) { conc_c = 0; counter = 0; for ( i = 1; i <= num_ed; i++ ) //num_ed { if ( edge_array[i].deleted || EdSameAsTwin ( i ) ) { continue; } if ( edge_array[i].length > 0 ) { counter++; } parc = edge_array[i].arcs; if ( !parc || parc->next ) { continue; } to_ed = parc->to_ed; bal_ed = getTwinEdge ( to_ed ); parc2 = edge_array[bal_ed].arcs; if ( bal_ed == to_ed || !parc2 || parc2->next ) { continue; } from_ed = i; if ( from_ed == to_ed || from_ed == bal_ed ) { continue; } //linear connection found conc_c++; linearUpdateConnection ( from_ed, to_ed, 0 ); allpathUpdateEdge ( from_ed, to_ed, 0 ); bal_fe = getTwinEdge ( from_ed ); linearUpdateConnection ( bal_ed, bal_fe, 1 ); allpathUpdateEdge ( bal_ed, bal_fe, 1 ); /* if(from_ed==6589||to_ed==6589) printf("%d <- %d (%d)\n",from_ed,to_ed,i); if(bal_fe==6589||bal_ed==6589) printf("%d <- %d (%d)\n",bal_fe,bal_ed,i); */ } printf ( "a linear concatenation lap, %d concatenated\n", conc_c ); } printf ( "%d edges in graph\n", counter ); } SOAPdenovo-V1.05/src/63mer/connect.c000644 000765 000024 00000011602 11530651532 017115 0ustar00Aquastaff000000 000000 /* * 63mer/connect.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" #define CNBLOCKSIZE 100000 void createCntMemManager() { if ( !cn_mem_manager ) { cn_mem_manager = createMem_manager ( CNBLOCKSIZE, sizeof ( CONNECT ) ); } else { printf ( "cn_mem_manger was created\n" ); } } void destroyConnectMem() { freeMem_manager ( cn_mem_manager ); cn_mem_manager = NULL; } CONNECT * allocateCN ( unsigned int contigId, int gap ) { CONNECT * newCN; newCN = ( CONNECT * ) getItem ( cn_mem_manager ); newCN->contigID = contigId; newCN->gapLen = gap; newCN->minGap = 0; newCN->maxGap = 0; newCN->bySmall = 0; newCN->weakPoint = 0; newCN->weight = 1; newCN->weightNotInherit = 0; newCN->mask = 0; newCN->used = 0; newCN->checking = 0; newCN->deleted = 0; newCN->prevInScaf = 0; newCN->inherit = 0; newCN->singleInScaf = 0; newCN->nextInScaf = NULL; return newCN; } void output_cntGVZ ( char * outfile ) { char name[256]; FILE * fp; unsigned int i; CONNECT * connect; boolean flag; sprintf ( name, "%s.scaffold.gvz", outfile ); fp = ckopen ( name, "w" ); fprintf ( fp, "digraph G{\n" ); fprintf ( fp, "\tsize=\"512,512\";\n" ); for ( i = num_ctg; i > 0; i-- ) { //if(contig_array[i].mask||!contig_array[i].downwardConnect) if ( !contig_array[i].downwardConnect ) { continue; } connect = contig_array[i].downwardConnect; while ( connect ) { //if(connect->mask||connect->deleted){ if ( connect->deleted ) { connect = connect->next; continue; } if ( connect->prevInScaf || connect->nextInScaf ) { flag = 1; } else { flag = 0; } if ( !connect->mask ) fprintf ( fp, "\tC%d_%d -> C%d_%d [label = \"%d(%d_%d)\"];\n" , i, contig_array[i].length, connect->contigID, contig_array[connect->contigID].length, connect->gapLen, flag, connect->weight ); else fprintf ( fp, "\tC%d_%d -> C%d_%d [label = \"%d(%d_%d)\", color = red];\n" , i, contig_array[i].length, connect->contigID, contig_array[connect->contigID].length, connect->gapLen, flag, connect->weight ); connect = connect->next; } } fprintf ( fp, "}\n" ); fclose ( fp ); } /***************** below this line all codes are about lookup table *****************/ void createCntLookupTable() { if ( !cntLookupTable ) { cntLookupTable = ( CONNECT ** ) ckalloc ( ( 3 * num_ctg + 1 ) * sizeof ( CONNECT * ) ); } } void deleteCntLookupTable() { if ( cntLookupTable ) { free ( ( void * ) cntLookupTable ); cntLookupTable = NULL; } } void putCnt2LookupTable ( unsigned int from_c, CONNECT * cnt ) { if ( !cnt || !cntLookupTable ) { return; } unsigned int index = 2 * from_c + cnt->contigID; cnt->nextInLookupTable = cntLookupTable[index]; cntLookupTable[index] = cnt; } static CONNECT * getCntInLookupTable ( unsigned int from_c, unsigned int to_c ) { unsigned int index = 2 * from_c + to_c; CONNECT * ite_cnt = cntLookupTable[index]; while ( ite_cnt ) { if ( ite_cnt->contigID == to_c ) { return ite_cnt; } ite_cnt = ite_cnt->nextInLookupTable; } return NULL; } CONNECT * getCntBetween ( unsigned int from_c, unsigned int to_c ) { CONNECT * pcnt; if ( cntLookupTable ) { pcnt = getCntInLookupTable ( from_c, to_c ); return pcnt; } pcnt = contig_array[from_c].downwardConnect; while ( pcnt ) { if ( pcnt->contigID == to_c ) { return pcnt; } pcnt = pcnt->next; } return pcnt; } /* void removeCntInLookupTable(unsigned int from_c,unsigned int to_c) { unsigned int index = 2*from_c + to_c; CONNECT *ite_cnt = cntLookupTable[index]; CONNECT *cnt; if(!ite_cnt){ printf("removeCntInLookupTable: not found A\n"); return; } if(ite_cnt->contigID==to_c){ cntLookupTable[index] = ite_cnt->nextInLookupTable; return; } while(ite_cnt->nextInLookupTable&&ite_cnt->nextInLookupTable->contigID!=to_c) ite_cnt = ite_cnt->nextInLookupTable; if(ite_cnt->nextInLookupTable){ cnt = ite_cnt->nextInLookupTable; ite_cnt->nextInLookupTable = cnt->nextInLookupTable; return; } printf("removeCntInLookupTable: not found B\n"); return; } */ SOAPdenovo-V1.05/src/63mer/contig.c000644 000765 000024 00000007153 11530651532 016755 0ustar00Aquastaff000000 000000 /* * 63mer/contig.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" static void initenv ( int argc, char ** argv ); static void display_contig_usage(); char shortrdsfile[256], graphfile[256]; static boolean repeatSolve; static int M = 1; int call_heavygraph ( int argc, char ** argv ) { time_t start_t, stop_t, time_bef, time_aft; time ( &start_t ); boolean ret; initenv ( argc, argv ); loadVertex ( graphfile ); loadEdge ( graphfile ); if ( repeatSolve ) { time ( &time_bef ); ret = loadPathBin ( graphfile ); if ( ret ) { solveReps(); } else { printf ( "repeat solving can't be done...\n" ); } time ( &time_aft ); printf ( "time spent on solving repeat: %ds\n", ( int ) ( time_aft - time_bef ) ); } //edgecvg_bar(edge_array,num_ed,graphfile,100); //0531 if ( M > 0 ) { time ( &time_bef ); bubblePinch ( 0.90, graphfile, M ); time ( &time_aft ); printf ( "time spent on bubblePinch: %ds\n", ( int ) ( time_aft - time_bef ) ); } removeWeakEdges ( 2 * overlaplen, 1 ); removeLowCovEdges ( 2 * overlaplen, 1 ); cutTipsInGraph ( 0, 0 ); //output_graph(graphfile); output_contig ( edge_array, num_ed, graphfile, overlaplen + 1 ); output_updated_edges ( graphfile ); output_heavyArcs ( graphfile ); if ( vt_array ) { free ( ( void * ) vt_array ); vt_array = NULL; } if ( edge_array ) { free_edge_array ( edge_array, num_ed_limit ); edge_array = NULL; } destroyArcMem(); time ( &stop_t ); printf ( "time elapsed: %dm\n\n", ( int ) ( stop_t - start_t ) / 60 ); return 0; } /***************************************************************************** * Parse command line switches *****************************************************************************/ void initenv ( int argc, char ** argv ) { int copt; int inpseq, outseq; extern char * optarg; char temp[100]; inpseq = outseq = repeatSolve = 0; optind = 1; while ( ( copt = getopt ( argc, argv, "g:M:R" ) ) != EOF ) { switch ( copt ) { case 'M': sscanf ( optarg, "%s", temp ); // M = atoi ( temp ); continue; case 'g': inGraph = 1; sscanf ( optarg, "%s", graphfile ); // continue; case 'R': repeatSolve = 1; continue; default: if ( inGraph == 0 ) // { display_contig_usage(); exit ( -1 ); } } } if ( inGraph == 0 ) // { display_contig_usage(); exit ( -1 ); } } static void display_contig_usage() { printf ( "\ncontig -g InputGraph [-M mergeLevel -R]\n" ); printf ( " -g InputFile: prefix of graph file names\n" ); printf ( " -M mergeLevel(default 1,min 0, max 3): the strength of merging similar sequences during contiging\n" ); printf ( " -R solve_repeats (optional): solve repeats by read paths(default: no)\n" ); } SOAPdenovo-V1.05/src/63mer/cutTip_graph.c000644 000765 000024 00000016661 11530651532 020127 0ustar00Aquastaff000000 000000 /* * 63mer/cutTip_graph.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" static int caseA, caseB, caseC, caseD, caseE; void destroyEdge ( unsigned int edgeid ) { unsigned int bal_ed = getTwinEdge ( edgeid ); ARC * arc; if ( bal_ed == edgeid ) { edge_array[edgeid].length = 0; return; } arc = edge_array[edgeid].arcs; while ( arc ) { arc->bal_arc->to_ed = 0; arc = arc->next; } arc = edge_array[bal_ed].arcs; while ( arc ) { arc->bal_arc->to_ed = 0; arc = arc->next; } edge_array[edgeid].arcs = NULL; edge_array[bal_ed].arcs = NULL; edge_array[edgeid].length = 0; edge_array[bal_ed].length = 0; edge_array[edgeid].deleted = 1; edge_array[bal_ed].deleted = 1; //printf("Destroyed %d and %d\n",edgeid,bal_ed); } ARC * arcCount ( unsigned int edgeid, unsigned int * num ) { ARC * arc; ARC * firstValidArc = NULL; unsigned int count = 0; arc = edge_array[edgeid].arcs; while ( arc ) { if ( arc->to_ed > 0 ) { count++; if ( count == 1 ) { firstValidArc = arc; } else if ( count > 1 ) { *num = count; return firstValidArc; } } arc = arc->next; } *num = count; return firstValidArc; } void removeWeakEdges ( int lenCutoff, unsigned int multiCutoff ) { unsigned int bal_ed; unsigned int arcRight_n, arcLeft_n; ARC * arcLeft, *arcRight; unsigned int i; int counter = 0; for ( i = 1; i <= num_ed; i++ ) { if ( edge_array[i].deleted || edge_array[i].length > lenCutoff || EdSameAsTwin ( i ) ) { continue; } bal_ed = getTwinEdge ( i ); arcRight = arcCount ( i, &arcRight_n ); if ( arcRight_n > 1 || !arcRight || arcRight->multiplicity > multiCutoff ) { continue; } arcLeft = arcCount ( bal_ed, &arcLeft_n ); if ( arcLeft_n > 1 || !arcLeft || arcLeft->multiplicity > multiCutoff ) { continue; } destroyEdge ( i ); counter++; } printf ( "%d weak inner edges destroyed\n", counter ); removeDeadArcs(); /* linearConcatenate(); compactEdgeArray(); */ } void removeLowCovEdges ( int lenCutoff, unsigned short covCutoff ) { unsigned int bal_ed; unsigned int arcRight_n, arcLeft_n; ARC * arcLeft, *arcRight; unsigned int i; int counter = 0; for ( i = 1; i <= num_ed; i++ ) { if ( edge_array[i].deleted || edge_array[i].cvg == 0 || edge_array[i].cvg > covCutoff * 10 || edge_array[i].length >= lenCutoff || EdSameAsTwin ( i ) || edge_array[i].length == 0 ) { continue; } bal_ed = getTwinEdge ( i ); arcRight = arcCount ( i, &arcRight_n ); arcLeft = arcCount ( bal_ed, &arcLeft_n ); if ( arcLeft_n < 1 || arcRight_n < 1 ) { continue; } destroyEdge ( i ); counter++; } printf ( "Remove low coverage(%d): %d inner edges destroyed\n", covCutoff, counter ); removeDeadArcs(); linearConcatenate(); compactEdgeArray(); } boolean isUnreliableTip ( unsigned int edgeid, int cutLen, boolean strict ) { unsigned int arcRight_n, arcLeft_n; unsigned int bal_ed; unsigned int currentEd = edgeid; int length = 0; unsigned int mult = 0; ARC * arc, *activeArc = NULL, *tempArc; if ( edgeid == 0 ) { return 0; } bal_ed = getTwinEdge ( edgeid ); if ( bal_ed == edgeid ) { return 0; } arcCount ( bal_ed, &arcLeft_n ); if ( arcLeft_n > 0 ) { return 0; } while ( currentEd ) { arcCount ( bal_ed, &arcLeft_n ); tempArc = arcCount ( currentEd, &arcRight_n ); if ( arcLeft_n > 1 || arcRight_n > 1 ) { break; } length += edge_array[currentEd].length; if ( tempArc ) { activeArc = tempArc; currentEd = activeArc->to_ed; bal_ed = getTwinEdge ( currentEd ); } else { currentEd = 0; } } if ( length >= cutLen ) { return 0; } if ( currentEd == 0 ) { caseB++; return 1; } if ( !strict ) { if ( arcLeft_n < 2 ) { length += edge_array[currentEd].length; } if ( length >= cutLen ) { return 0; } else { caseC++; return 1; } } if ( arcLeft_n < 2 ) { return 0; } if ( !activeArc ) { printf ( "no activeArc while checking edge %d\n", edgeid ); } if ( activeArc->multiplicity == 1 ) { caseD++; return 1; } for ( arc = edge_array[bal_ed].arcs; arc != NULL; arc = arc->next ) if ( arc->multiplicity > mult ) { mult = arc->multiplicity; } if ( mult > activeArc->multiplicity ) { caseE++; } return mult > activeArc->multiplicity; } boolean isUnreliableTip_strict ( unsigned int edgeid, int cutLen ) { unsigned int arcRight_n, arcLeft_n; unsigned int bal_ed; unsigned int currentEd = edgeid; int length = 0; unsigned int mult = 0; ARC * arc, *activeArc = NULL, *tempArc; if ( edgeid == 0 ) { return 0; } bal_ed = getTwinEdge ( edgeid ); if ( bal_ed == edgeid ) { return 0; } arcCount ( bal_ed, &arcLeft_n ); if ( arcLeft_n > 0 ) { return 0; } while ( currentEd ) { arcCount ( bal_ed, &arcLeft_n ); tempArc = arcCount ( currentEd, &arcRight_n ); if ( arcLeft_n > 1 || arcRight_n > 1 ) { if ( arcLeft_n == 0 || length == 0 ) { return 0; } else { break; } } length += edge_array[currentEd].length; if ( length >= cutLen ) { return 0; } if ( tempArc ) { activeArc = tempArc; currentEd = activeArc->to_ed; bal_ed = getTwinEdge ( currentEd ); } else { currentEd = 0; } } if ( currentEd == 0 ) { caseA++; return 1; } if ( !activeArc ) { printf ( "no activeArc while checking edge %d\n", edgeid ); } if ( activeArc->multiplicity == 1 ) { caseB++; return 1; } for ( arc = edge_array[bal_ed].arcs; arc != NULL; arc = arc->next ) if ( arc->multiplicity > mult ) { mult = arc->multiplicity; } if ( mult > activeArc->multiplicity ) { caseC++; } return mult > activeArc->multiplicity; } void removeDeadArcs() { unsigned int i, count = 0; ARC * arc, *arc_temp; for ( i = 1; i <= num_ed; i++ ) { arc = edge_array[i].arcs; while ( arc ) { arc_temp = arc; arc = arc->next; if ( arc_temp->to_ed == 0 ) { count++; edge_array[i].arcs = deleteArc ( edge_array[i].arcs, arc_temp ); } } } printf ( "%d dead arcs removed\n", count ); } void cutTipsInGraph ( int cutLen, boolean strict ) { int flag = 1; unsigned int i; if ( !cutLen ) { cutLen = 2 * overlaplen; } printf ( "strict %d, cutLen %d\n", strict, cutLen ); if ( strict ) { linearConcatenate(); } caseA = caseB = caseC = caseD = caseE = 0; while ( flag ) { flag = 0; for ( i = 1; i <= num_ed; i++ ) { if ( edge_array[i].deleted ) { continue; } if ( isUnreliableTip ( i, cutLen, strict ) ) { destroyEdge ( i ); flag++; } } printf ( "a cutTipsInGraph lap, %d tips cut\n", flag ); } removeDeadArcs(); if ( strict ) { printf ( "case A %d, B %d C %d D %d E %d\n", caseA, caseB, caseC, caseD, caseE ); } linearConcatenate(); compactEdgeArray(); } SOAPdenovo-V1.05/src/63mer/cutTipPreGraph.c000644 000765 000024 00000023656 11530651532 020401 0ustar00Aquastaff000000 000000 /* * 63mer/cutTipPreGraph.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" static int tip_c; static long long * linearCounter; static void Mark1in1outNode(); static void thread_mark ( KmerSet * set, unsigned char thrdID ); /* static void printKmer(Kmer kmer) { printKmerSeq(stdout,kmer); printf("\n"); } */ static int clipTipFromNode ( kmer_t * node1, int cut_len, boolean THIN ) { unsigned char ret = 0, in_num, out_num, link; int sum, count; kmer_t * out_node; Kmer tempKmer, pre_word, word, bal_word; ubyte8 hash_ban; char ch1, ch; boolean smaller, found; int setPicker; unsigned int max_links, singleCvg; if ( node1->linear || node1->deleted ) { return ret; } if ( THIN && !node1->single ) { return ret; } in_num = count_branch2prev ( node1 ); out_num = count_branch2next ( node1 ); if ( in_num == 0 && out_num == 1 ) { pre_word = node1->seq; for ( ch1 = 0; ch1 < 4; ch1++ ) { link = get_kmer_right_cov ( *node1, ch1 ); if ( link ) { break; } } word = nextKmer ( pre_word, ch1 ); } else if ( in_num == 1 && out_num == 0 ) { pre_word = reverseComplement ( node1->seq, overlaplen ); for ( ch1 = 0; ch1 < 4; ch1++ ) { link = get_kmer_left_cov ( *node1, ch1 ); if ( link ) { break; } } word = nextKmer ( pre_word, int_comp ( ch1 ) ); } else { return ret; } count = 1; bal_word = reverseComplement ( word, overlaplen ); if ( KmerLarger ( word, bal_word ) ) { tempKmer = bal_word; bal_word = word; word = tempKmer; smaller = 0; } else { smaller = 1; } hash_ban = hash_kmer ( word ); setPicker = hash_ban % thrd_num; found = search_kmerset ( KmerSets[setPicker], word, &out_node ); if ( !found ) { printf ( "kmer %llx%llx not found, node1 %llx%llx\n", word.high, word.low, node1->seq.high, node1->seq.low ); exit ( 1 ); } while ( out_node->linear ) { count++; if ( THIN && !out_node->single ) { break; } if ( count > cut_len ) { return ret; } if ( smaller ) { pre_word = word; for ( ch = 0; ch < 4; ch++ ) { link = get_kmer_right_cov ( *out_node, ch ); if ( link ) { break; } } word = nextKmer ( pre_word, ch ); bal_word = reverseComplement ( word, overlaplen ); if ( KmerLarger ( word, bal_word ) ) { tempKmer = bal_word; bal_word = word; word = tempKmer; smaller = 0; } else { smaller = 1; } hash_ban = hash_kmer ( word ); setPicker = hash_ban % thrd_num; found = search_kmerset ( KmerSets[setPicker], word, &out_node ); if ( !found ) { printf ( "kmer %llx%llx not found, node1 %llx%llx\n", word.high, word.low, node1->seq.high, node1->seq.low ); printf ( "pre_word %llx%llx with %d(smaller)\n", pre_word.high, pre_word.low, ch ); exit ( 1 ); } } else { pre_word = bal_word; for ( ch = 0; ch < 4; ch++ ) { link = get_kmer_left_cov ( *out_node, ch ); if ( link ) { break; } } word = nextKmer ( pre_word, int_comp ( ch ) ); bal_word = reverseComplement ( word, overlaplen ); if ( KmerLarger ( word, bal_word ) ) { tempKmer = bal_word; bal_word = word; word = tempKmer; smaller = 0; } else { smaller = 1; } hash_ban = hash_kmer ( word ); setPicker = hash_ban % thrd_num; found = search_kmerset ( KmerSets[setPicker], word, &out_node ); if ( !found ) { printf ( "kmer %llx%llx not found, node1 %llx%llx\n", word.high, word.low, node1->seq.high, node1->seq.low ); printf ( "pre_word %llx%llx with %d(larger)\n", reverseComplement ( pre_word, overlaplen ).high, reverseComplement ( pre_word, overlaplen ).low, int_comp ( ch ) ); exit ( 1 ); } } } if ( ( sum = count_branch2next ( out_node ) + count_branch2prev ( out_node ) ) == 1 ) { tip_c++; node1->deleted = 1; out_node->deleted = 1; return 1; } else { ch = firstCharInKmer ( pre_word ); if ( THIN ) { tip_c++; node1->deleted = 1; dislink2prevUncertain ( out_node, ch, smaller ); out_node->linear = 0; return 1; } // make sure this tip doesn't provide most links to out_node max_links = 0; for ( ch1 = 0; ch1 < 4; ch1++ ) { if ( smaller ) { singleCvg = get_kmer_left_cov ( *out_node, ch1 ); if ( singleCvg > max_links ) { max_links = singleCvg; } } else { singleCvg = get_kmer_right_cov ( *out_node, ch1 ); if ( singleCvg > max_links ) { max_links = singleCvg; } } } if ( smaller && get_kmer_left_cov ( *out_node, ch ) < max_links ) { tip_c++; node1->deleted = 1; dislink2prevUncertain ( out_node, ch, smaller ); if ( count_branch2prev ( out_node ) == 1 && count_branch2next ( out_node ) == 1 ) { out_node->linear = 1; } return 1; } if ( !smaller && get_kmer_right_cov ( *out_node, int_comp ( ch ) ) < max_links ) { tip_c++; node1->deleted = 1; dislink2prevUncertain ( out_node, ch, smaller ); if ( count_branch2prev ( out_node ) == 1 && count_branch2next ( out_node ) == 1 ) { out_node->linear = 1; } return 1; } } return 0; } void removeSingleTips() { int i, flag = 0, cut_len_tip; kmer_t * rs; KmerSet * set; //count_ends(hash_table); cut_len_tip = 2 * overlaplen; // >= maxReadLen4all-overlaplen+1 ? 2*overlaplen : maxReadLen4all-overlaplen+1; printf ( "Start to remove tips of single frequency kmers short than %d\n", cut_len_tip ); tip_c = 0; for ( i = 0; i < thrd_num; i++ ) { set = KmerSets[i]; set->iter_ptr = 0; while ( set->iter_ptr < set->size ) { if ( !is_kmer_entity_null ( set->flags, set->iter_ptr ) ) { rs = set->array + set->iter_ptr; flag += clipTipFromNode ( rs, cut_len_tip, 1 ); } set->iter_ptr ++; } } printf ( "%d tips off\n", tip_c ); Mark1in1outNode(); } void removeMinorTips() { int i, flag = 0, cut_len_tip; kmer_t * rs; KmerSet * set; //count_ends(hash_table); //cut_len_tip = 2*overlaplen >= maxReadLen4all-overlaplen+1 ? 2*overlaplen : maxReadLen4all-overlaplen+1; cut_len_tip = 2 * overlaplen; printf ( "Start to remove tips which don't contribute the most links\n" ); tip_c = 0; for ( i = 0; i < thrd_num; i++ ) { set = KmerSets[i]; flag = 1; while ( flag ) { flag = 0; set->iter_ptr = 0; while ( set->iter_ptr < set->size ) { if ( !is_kmer_entity_null ( set->flags, set->iter_ptr ) ) { rs = set->array + set->iter_ptr; flag += clipTipFromNode ( rs, cut_len_tip, 0 ); } set->iter_ptr ++; } } printf ( "kmer set %d done\n", i ); } printf ( "%d tips off\n", tip_c ); Mark1in1outNode(); } static void threadRoutine ( void * para ) { PARAMETER * prm; unsigned char id; prm = ( PARAMETER * ) para; id = prm->threadID; //printf("%dth thread with threadID %d, hash_table %p\n",id,prm.threadID,prm.hash_table); while ( 1 ) { if ( * ( prm->selfSignal ) == 2 ) { * ( prm->selfSignal ) = 0; break; } else if ( * ( prm->selfSignal ) == 1 ) { thread_mark ( KmerSets[id], id ); * ( prm->selfSignal ) = 0; } usleep ( 1 ); } } static void creatThrds ( pthread_t * threads, PARAMETER * paras ) { unsigned char i; int temp; for ( i = 0; i < thrd_num; i++ ) { if ( ( temp = pthread_create ( &threads[i], NULL, ( void * ) threadRoutine, & ( paras[i] ) ) ) != 0 ) { printf ( "create threads failed\n" ); exit ( 1 ); } } printf ( "%d thread created\n", thrd_num ); } static void thread_mark ( KmerSet * set, unsigned char thrdID ) { int in_num, out_num; kmer_t * rs; set->iter_ptr = 0; while ( set->iter_ptr < set->size ) { if ( !is_kmer_entity_null ( set->flags, set->iter_ptr ) ) { rs = set->array + set->iter_ptr; if ( rs->deleted || rs->linear ) { set->iter_ptr ++; continue;; } in_num = count_branch2prev ( rs ); out_num = count_branch2next ( rs ); if ( in_num == 1 && out_num == 1 ) { rs->linear = 1; linearCounter[thrdID]++; } } set->iter_ptr ++; } //printf("%lld more linear\n",linearCounter[thrdID]); } static void thread_wait ( pthread_t * threads ) { int i; for ( i = 0; i < thrd_num; i++ ) if ( threads[i] != 0 ) { pthread_join ( threads[i], NULL ); } } static void sendWorkSignal ( unsigned char SIG, unsigned char * thrdSignals ) { int t; for ( t = 0; t < thrd_num; t++ ) { thrdSignals[t + 1] = SIG; } while ( 1 ) { usleep ( 10 ); for ( t = 0; t < thrd_num; t++ ) if ( thrdSignals[t + 1] ) { break; } if ( t == thrd_num ) { break; } } } static void Mark1in1outNode() { int i; long long counter = 0; pthread_t threads[thrd_num]; unsigned char thrdSignal[thrd_num + 1]; PARAMETER paras[thrd_num]; for ( i = 0; i < thrd_num; i++ ) { thrdSignal[i + 1] = 0; paras[i].threadID = i; paras[i].mainSignal = &thrdSignal[0]; paras[i].selfSignal = &thrdSignal[i + 1]; } creatThrds ( threads, paras ); thrdSignal[0] = 0; linearCounter = ( long long * ) ckalloc ( thrd_num * sizeof ( long long ) ); for ( i = 0; i < thrd_num; i++ ) { linearCounter[i] = 0; } sendWorkSignal ( 1, thrdSignal ); //mark linear nodes sendWorkSignal ( 2, thrdSignal ); thread_wait ( threads ); for ( i = 0; i < thrd_num; i++ ) { counter += linearCounter[i]; } free ( ( void * ) linearCounter ); printf ( "%lld linear nodes\n", counter ); } SOAPdenovo-V1.05/src/63mer/darray.c000644 000765 000024 00000004254 11530651532 016753 0ustar00Aquastaff000000 000000 /* * 63mer/darray.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "darray.h" #include "check.h" DARRAY * createDarray ( int num_items, size_t unit_size ) { DARRAY * newDarray = ( DARRAY * ) malloc ( 1 * sizeof ( DARRAY ) ); newDarray->array_size = num_items; newDarray->item_size = unit_size; newDarray->item_c = 0; newDarray->array = ( void * ) ckalloc ( num_items * unit_size ); return newDarray; } void * darrayPut ( DARRAY * darray, long long index ) { int i = 2; if ( index + 1 > darray->item_c ) { darray->item_c = index + 1; } if ( index < darray->array_size ) { return darray->array + darray->item_size * index; } while ( index > i * darray->array_size ) { i++; } darray->array = ( void * ) ckrealloc ( darray->array, i * darray->array_size * darray->item_size , darray->array_size * darray->item_size ); darray->array_size *= i; return ( void * ) ( ( void * ) darray->array + darray->item_size * index ); } void * darrayGet ( DARRAY * darray, long long index ) { if ( index < darray->array_size ) { return ( void * ) ( ( void * ) darray->array + darray->item_size * index ); } printf ( "array read index %lld out of range %lld\n", index, darray->array_size ); return NULL; } void emptyDarray ( DARRAY * darray ) { darray->item_c = 0; } void freeDarray ( DARRAY * darray ) { if ( !darray ) { return; } if ( darray->array ) { free ( ( void * ) darray->array ); } free ( ( void * ) darray ); } SOAPdenovo-V1.05/src/63mer/dfib.c000644 000765 000024 00000026444 11530651532 016402 0ustar00Aquastaff000000 000000 /* * 63mer/dfib.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ /*- * Copyright 1997-2003 John-Mark Gurney. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $Id: dfib.c,v 1.12 2007/10/19 13:09:26 zerbino Exp $ * */ #include #include #include "dfib.h" #include "dfibpriv.h" #include "extfunc2.h" #define HEAPBLOCKSIZE 1000 static int dfh_comparedata ( DFibHeap * h, Time key, unsigned int data, DFibHeapNode * b ); static DFibHeapNode * allocateDFibHeapNode ( DFibHeap * heap ) { return ( DFibHeapNode * ) getItem ( heap->nodeMemory ); }; static void deallocateDFibHeapNode ( DFibHeapNode * a, DFibHeap * heap ) { returnItem ( heap->nodeMemory, a ); } IDnum dfibheap_getSize ( DFibHeap * heap ) { return heap->dfh_n; } #define swap(type, a, b) \ do { \ type c; \ c = a; \ a = b; \ b = c; \ } while (0) \ #define INT_BITS (sizeof(IDnum) * 8) static inline IDnum ceillog2 ( IDnum a ) { IDnum oa; IDnum i; IDnum b; IDnum cons; oa = a; b = INT_BITS / 2; i = 0; while ( b ) { i = ( i << 1 ); cons = ( ( IDnum ) 1 ) << b; if ( a >= cons ) { a /= cons; i = i | 1; } else { a &= cons - 1; } b /= 2; } if ( ( ( ( IDnum ) 1 << i ) ) == oa ) { return i; } else { return i + 1; } } /* * Public Heap Functions */ DFibHeap * dfh_makekeyheap() { DFibHeap * n; if ( ( n = malloc ( sizeof * n ) ) == NULL ) { return NULL; } n->nodeMemory = createMem_manager ( HEAPBLOCKSIZE, sizeof ( DFibHeapNode ) ); n->dfh_n = 0; n->dfh_Dl = -1; n->dfh_cons = NULL; n->dfh_min = NULL; n->dfh_root = NULL; return n; } void dfh_deleteheap ( DFibHeap * h ) { printf ( "DFibHeap: %lld Nodes allocated\n", h->nodeMemory->counter ); freeMem_manager ( h->nodeMemory ); h->nodeMemory = NULL; if ( h->dfh_cons != NULL ) { free ( h->dfh_cons ); } free ( h ); } /* * Public Key Heap Functions */ DFibHeapNode * dfh_insertkey ( DFibHeap * h, Time key, unsigned int data ) { DFibHeapNode * x; if ( ( x = dfhe_newelem ( h ) ) == NULL ) { return NULL; } /* just insert on root list, and make sure it's not the new min */ x->dfhe_data = data; x->dfhe_key = key; dfh_insertel ( h, x ); return x; } Time dfh_replacekey ( DFibHeap * h, DFibHeapNode * x, Time key ) { Time ret; ret = x->dfhe_key; ( void ) dfh_replacekeydata ( h, x, key, x->dfhe_data ); return ret; } unsigned int minInDHeap ( DFibHeap * h ) { if ( h->dfh_min ) { return h->dfh_min->dfhe_data; } else { return 0; } } boolean HasMin ( DFibHeap * h ) { if ( h->dfh_min ) { return 1; } else { return 0; } } unsigned int dfh_replacekeydata ( DFibHeap * h, DFibHeapNode * x, Time key, unsigned int data ) { unsigned int odata; Time okey; DFibHeapNode * y; int r; odata = x->dfhe_data; okey = x->dfhe_key; /* * we can increase a key by deleting and reinserting, that * requires O(lgn) time. */ if ( ( r = dfh_comparedata ( h, key, data, x ) ) > 0 ) { /* XXX - bad code! */ abort(); } x->dfhe_data = data; x->dfhe_key = key; /* because they are equal, we don't have to do anything */ if ( r == 0 ) { return odata; } y = x->dfhe_p; if ( okey == key ) { return odata; } if ( y != NULL && dfh_compare ( h, x, y ) <= 0 ) { dfh_cut ( h, x, y ); dfh_cascading_cut ( h, y ); } /* * the = is so that the call from dfh_delete will delete the proper * element. */ if ( dfh_compare ( h, x, h->dfh_min ) <= 0 ) { h->dfh_min = x; } return odata; } /* * Public void * Heap Functions */ /* * this will return these values: * NULL failed for some reason * ptr token to use for manipulation of data */ unsigned int dfh_extractmin ( DFibHeap * h ) { DFibHeapNode * z; unsigned int ret; ret = 0; if ( h->dfh_min != NULL ) { z = dfh_extractminel ( h ); ret = z->dfhe_data; deallocateDFibHeapNode ( z, h ); } return ret; } unsigned int dfh_replacedata ( DFibHeapNode * x, unsigned int data ) { unsigned int odata = x->dfhe_data; //printf("replace node value %d with %d\n",x->dfhe_data,data); x->dfhe_data = data; return odata; } unsigned int dfh_delete ( DFibHeap * h, DFibHeapNode * x ) { unsigned int k; //printf("destroy node %d in dheap\n",x->dfhe_data); k = x->dfhe_data; dfh_replacekey ( h, x, INT_MIN ); dfh_extractmin ( h ); return k; } /* * begin of private element fuctions */ static DFibHeapNode * dfh_extractminel ( DFibHeap * h ) { DFibHeapNode * ret; DFibHeapNode * x, *y, *orig; ret = h->dfh_min; orig = NULL; /* put all the children on the root list */ /* for true consistancy, we should use dfhe_remove */ for ( x = ret->dfhe_child; x != orig && x != NULL; ) { if ( orig == NULL ) { orig = x; } y = x->dfhe_right; x->dfhe_p = NULL; dfh_insertrootlist ( h, x ); x = y; } /* remove minimum from root list */ dfh_removerootlist ( h, ret ); h->dfh_n--; /* if we aren't empty, consolidate the heap */ if ( h->dfh_n == 0 ) { h->dfh_min = NULL; } else { h->dfh_min = ret->dfhe_right; dfh_consolidate ( h ); } return ret; } static void dfh_insertrootlist ( DFibHeap * h, DFibHeapNode * x ) { if ( h->dfh_root == NULL ) { h->dfh_root = x; x->dfhe_left = x; x->dfhe_right = x; return; } dfhe_insertafter ( h->dfh_root, x ); } static void dfh_removerootlist ( DFibHeap * h, DFibHeapNode * x ) { if ( x->dfhe_left == x ) { h->dfh_root = NULL; } else { h->dfh_root = dfhe_remove ( x ); } } static void dfh_consolidate ( DFibHeap * h ) { DFibHeapNode ** a; DFibHeapNode * w; DFibHeapNode * y; DFibHeapNode * x; IDnum i; IDnum d; IDnum D; dfh_checkcons ( h ); /* assign a the value of h->dfh_cons so I don't have to rewrite code */ D = h->dfh_Dl + 1; a = h->dfh_cons; for ( i = 0; i < D; i++ ) { a[i] = NULL; } while ( ( w = h->dfh_root ) != NULL ) { x = w; dfh_removerootlist ( h, w ); d = x->dfhe_degree; /* XXX - assert that d < D */ while ( a[d] != NULL ) { y = a[d]; if ( dfh_compare ( h, x, y ) > 0 ) { swap ( DFibHeapNode *, x, y ); } dfh_heaplink ( h, y, x ); a[d] = NULL; d++; } a[d] = x; } h->dfh_min = NULL; for ( i = 0; i < D; i++ ) if ( a[i] != NULL ) { dfh_insertrootlist ( h, a[i] ); if ( h->dfh_min == NULL || dfh_compare ( h, a[i], h->dfh_min ) < 0 ) { h->dfh_min = a[i]; } } } static void dfh_heaplink ( DFibHeap * h, DFibHeapNode * y, DFibHeapNode * x ) { /* make y a child of x */ if ( x->dfhe_child == NULL ) { x->dfhe_child = y; } else { dfhe_insertbefore ( x->dfhe_child, y ); } y->dfhe_p = x; x->dfhe_degree++; y->dfhe_mark = 0; } static void dfh_cut ( DFibHeap * h, DFibHeapNode * x, DFibHeapNode * y ) { dfhe_remove ( x ); y->dfhe_degree--; dfh_insertrootlist ( h, x ); x->dfhe_p = NULL; x->dfhe_mark = 0; } static void dfh_cascading_cut ( DFibHeap * h, DFibHeapNode * y ) { DFibHeapNode * z; while ( ( z = y->dfhe_p ) != NULL ) { if ( y->dfhe_mark == 0 ) { y->dfhe_mark = 1; return; } else { dfh_cut ( h, y, z ); y = z; } } } /* * begining of handling elements of dfibheap */ static DFibHeapNode * dfhe_newelem ( DFibHeap * h ) { DFibHeapNode * e; if ( ( e = allocateDFibHeapNode ( h ) ) == NULL ) { return NULL; } e->dfhe_degree = 0; e->dfhe_mark = 0; e->dfhe_p = NULL; e->dfhe_child = NULL; e->dfhe_left = e; e->dfhe_right = e; e->dfhe_data = 0; return e; } static void dfhe_insertafter ( DFibHeapNode * a, DFibHeapNode * b ) { if ( a == a->dfhe_right ) { a->dfhe_right = b; a->dfhe_left = b; b->dfhe_right = a; b->dfhe_left = a; } else { b->dfhe_right = a->dfhe_right; a->dfhe_right->dfhe_left = b; a->dfhe_right = b; b->dfhe_left = a; } } static inline void dfhe_insertbefore ( DFibHeapNode * a, DFibHeapNode * b ) { dfhe_insertafter ( a->dfhe_left, b ); } static DFibHeapNode * dfhe_remove ( DFibHeapNode * x ) { DFibHeapNode * ret; if ( x == x->dfhe_left ) { ret = NULL; } else { ret = x->dfhe_left; } /* fix the parent pointer */ if ( x->dfhe_p != NULL && x->dfhe_p->dfhe_child == x ) { x->dfhe_p->dfhe_child = ret; } x->dfhe_right->dfhe_left = x->dfhe_left; x->dfhe_left->dfhe_right = x->dfhe_right; /* clear out hanging pointers */ x->dfhe_p = NULL; x->dfhe_left = x; x->dfhe_right = x; return ret; } static void dfh_checkcons ( DFibHeap * h ) { IDnum oDl; /* make sure we have enough memory allocated to "reorganize" */ if ( h->dfh_Dl == -1 || h->dfh_n > ( 1 << h->dfh_Dl ) ) { oDl = h->dfh_Dl; if ( ( h->dfh_Dl = ceillog2 ( h->dfh_n ) + 1 ) < 8 ) { h->dfh_Dl = 8; } if ( oDl != h->dfh_Dl ) h->dfh_cons = ( DFibHeapNode ** ) realloc ( h->dfh_cons, sizeof * h-> dfh_cons * ( h->dfh_Dl + 1 ) ); if ( h->dfh_cons == NULL ) { abort(); } } } static int dfh_compare ( DFibHeap * h, DFibHeapNode * a, DFibHeapNode * b ) { if ( a->dfhe_key < b->dfhe_key ) { return -1; } if ( a->dfhe_key == b->dfhe_key ) { return 0; } return 1; } static int dfh_comparedata ( DFibHeap * h, Time key, unsigned int data, DFibHeapNode * b ) { DFibHeapNode a; a.dfhe_key = key; a.dfhe_data = data; return dfh_compare ( h, &a, b ); } static void dfh_insertel ( DFibHeap * h, DFibHeapNode * x ) { dfh_insertrootlist ( h, x ); if ( h->dfh_min == NULL || x->dfhe_key < h->dfh_min->dfhe_key ) { h->dfh_min = x; } h->dfh_n++; } Time dfibheap_el_getKey ( DFibHeapNode * node ) { return node->dfhe_key; } SOAPdenovo-V1.05/src/63mer/dfibHeap.c000644 000765 000024 00000004232 11530651532 017167 0ustar00Aquastaff000000 000000 /* * 63mer/dfibHeap.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include #include #include "def2.h" #include "dfib.h" // Return number of elements stored in heap IDnum getDFibHeapSize ( DFibHeap * heap ) { return dfibheap_getSize ( heap ); } // Constructor // Memory allocated DFibHeap * newDFibHeap() { return dfh_makekeyheap(); } // Add new node into heap with a key, and a pointer to the specified node DFibHeapNode * insertNodeIntoDHeap ( DFibHeap * heap, Time key, unsigned int node ) { DFibHeapNode * res; res = dfh_insertkey ( heap, key, node ); return res; } // Replaces the key for a given node Time replaceKeyInDHeap ( DFibHeap * heap, DFibHeapNode * node, Time newKey ) { Time res; res = dfh_replacekey ( heap, node, newKey ); return res; } // Removes the node with the shortest key, then returns it. unsigned int removeNextNodeFromDHeap ( DFibHeap * heap ) { unsigned int node; node = ( unsigned int ) dfh_extractmin ( heap ); return node; } // Destructor void destroyDHeap ( DFibHeap * heap ) { dfh_deleteheap ( heap ); } // Replace the node pointed to by a heap node void replaceValueInDHeap ( DFibHeapNode * node, unsigned int newValue ) { dfh_replacedata ( node, newValue ); } // Remove unwanted node void destroyNodeInDHeap ( DFibHeapNode * node, DFibHeap * heap ) { dfh_delete ( heap, node ); } Time getKey ( DFibHeapNode * node ) { return dfibheap_el_getKey ( node ); } SOAPdenovo-V1.05/src/63mer/fib.c000644 000765 000024 00000031767 11530651532 016242 0ustar00Aquastaff000000 000000 /* * 63mer/fib.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ /*- * Copyright 1997-2003 John-Mark Gurney. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $Id: fib.c,v 1.10 2007/10/19 13:09:26 zerbino Exp $ * */ #include #include #include "fib.h" #include "fibpriv.h" #include "extfunc2.h" #define HEAPBLOCKSIZE 10000 static int fh_comparedata ( FibHeap * h, Coordinate key, unsigned int data, FibHeapNode * b ); unsigned int fh_replacekeydata ( FibHeap * h, FibHeapNode * x, Coordinate key, unsigned int data ); static FibHeapNode * allocateFibHeapEl ( FibHeap * heap ) { return ( FibHeapNode * ) getItem ( heap->nodeMemory ); }; static void deallocateFibHeapEl ( FibHeapNode * a, FibHeap * heap ) { returnItem ( heap->nodeMemory, a ); } #define swap(type, a, b) \ do { \ type c; \ c = a; \ a = b; \ b = c; \ } while (0) \ #define INT_BITS (sizeof(IDnum) * 8) static inline IDnum ceillog2 ( IDnum a ) { IDnum oa; IDnum i; IDnum b; IDnum cons; oa = a; b = INT_BITS / 2; i = 0; while ( b ) { i = ( i << 1 ); cons = ( ( IDnum ) 1 ) << b; if ( a >= cons ) { a /= cons; i = i | 1; } else { a &= cons - 1; } b /= 2; } if ( ( ( ( IDnum ) 1 << i ) ) == oa ) { return i; } else { return i + 1; } } /* * Private Heap Functions */ static void fh_initheap ( FibHeap * new ) { new->fh_cmp_fnct = NULL; new->nodeMemory = createMem_manager ( sizeof ( FibHeapNode ), HEAPBLOCKSIZE ); new->fh_neginf = 0; new->fh_n = 0; new->fh_Dl = -1; new->fh_cons = NULL; new->fh_min = NULL; new->fh_root = NULL; new->fh_keys = 0; } static void fh_destroyheap ( FibHeap * h ) { h->fh_cmp_fnct = NULL; h->fh_neginf = 0; if ( h->fh_cons != NULL ) { free ( h->fh_cons ); } h->fh_cons = NULL; free ( h ); } /* * Public Heap Functions */ FibHeap * fh_makekeyheap() { FibHeap * n; if ( ( n = malloc ( sizeof * n ) ) == NULL ) { return NULL; } fh_initheap ( n ); n->fh_keys = 1; return n; } FibHeap * fh_makeheap() { FibHeap * n; if ( ( n = malloc ( sizeof * n ) ) == NULL ) { return NULL; } fh_initheap ( n ); return n; } voidcmp fh_setcmp ( FibHeap * h, voidcmp fnct ) { voidcmp oldfnct; oldfnct = h->fh_cmp_fnct; h->fh_cmp_fnct = fnct; return oldfnct; } unsigned int fh_setneginf ( FibHeap * h, unsigned int data ) { unsigned int old; old = h->fh_neginf; h->fh_neginf = data; return old; } FibHeap * fh_union ( FibHeap * ha, FibHeap * hb ) { FibHeapNode * x; if ( ha->fh_root == NULL || hb->fh_root == NULL ) { /* either one or both are empty */ if ( ha->fh_root == NULL ) { fh_destroyheap ( ha ); return hb; } else { fh_destroyheap ( hb ); return ha; } } ha->fh_root->fhe_left->fhe_right = hb->fh_root; hb->fh_root->fhe_left->fhe_right = ha->fh_root; x = ha->fh_root->fhe_left; ha->fh_root->fhe_left = hb->fh_root->fhe_left; hb->fh_root->fhe_left = x; ha->fh_n += hb->fh_n; /* * we probably should also keep stats on number of unions */ /* set fh_min if necessary */ if ( fh_compare ( ha, hb->fh_min, ha->fh_min ) < 0 ) { ha->fh_min = hb->fh_min; } fh_destroyheap ( hb ); return ha; } void fh_deleteheap ( FibHeap * h ) { freeMem_manager ( h->nodeMemory ); h->nodeMemory = NULL; fh_destroyheap ( h ); } /* * Public Key Heap Functions */ FibHeapNode * fh_insertkey ( FibHeap * h, Coordinate key, unsigned int data ) { FibHeapNode * x; if ( ( x = fhe_newelem ( h ) ) == NULL ) { return NULL; } /* just insert on root list, and make sure it's not the new min */ x->fhe_data = data; x->fhe_key = key; fh_insertel ( h, x ); return x; } boolean fh_isempty ( FibHeap * h ) { if ( h->fh_min == NULL ) { return 1; } else { return 0; } } Coordinate fh_minkey ( FibHeap * h ) { if ( h->fh_min == NULL ) { return INT_MIN; } return h->fh_min->fhe_key; } unsigned int fh_replacekeydata ( FibHeap * h, FibHeapNode * x, Coordinate key, unsigned int data ) { unsigned int odata; Coordinate okey; FibHeapNode * y; int r; odata = x->fhe_data; okey = x->fhe_key; /* * we can increase a key by deleting and reinserting, that * requires O(lgn) time. */ if ( ( r = fh_comparedata ( h, key, data, x ) ) > 0 ) { /* XXX - bad code! */ abort(); } x->fhe_data = data; x->fhe_key = key; /* because they are equal, we don't have to do anything */ if ( r == 0 ) { return odata; } y = x->fhe_p; if ( h->fh_keys && okey == key ) { return odata; } if ( y != NULL && fh_compare ( h, x, y ) <= 0 ) { fh_cut ( h, x, y ); fh_cascading_cut ( h, y ); } /* * the = is so that the call from fh_delete will delete the proper * element. */ if ( fh_compare ( h, x, h->fh_min ) <= 0 ) { h->fh_min = x; } return odata; } Coordinate fh_replacekey ( FibHeap * h, FibHeapNode * x, Coordinate key ) { Coordinate ret; ret = x->fhe_key; ( void ) fh_replacekeydata ( h, x, key, x->fhe_data ); return ret; } /* * Public void * Heap Functions */ /* * this will return these values: * NULL failed for some reason * ptr token to use for manipulation of data */ FibHeapNode * fh_insert ( FibHeap * h, unsigned int data ) { FibHeapNode * x; if ( ( x = fhe_newelem ( h ) ) == NULL ) { return NULL; } /* just insert on root list, and make sure it's not the new min */ x->fhe_data = data; fh_insertel ( h, x ); return x; } unsigned int fh_min ( FibHeap * h ) { if ( h->fh_min == NULL ) { return 0; } return h->fh_min->fhe_data; } unsigned int fh_extractmin ( FibHeap * h ) { FibHeapNode * z; unsigned int ret = 0; if ( h->fh_min != NULL ) { z = fh_extractminel ( h ); ret = z->fhe_data; #ifndef NO_FREE deallocateFibHeapEl ( z, h ); #endif } return ret; } unsigned int fh_replacedata ( FibHeapNode * x, unsigned int data ) { unsigned int odata = x->fhe_data; x->fhe_data = data; return odata; } unsigned int fh_delete ( FibHeap * h, FibHeapNode * x ) { unsigned int k; k = x->fhe_data; if ( !h->fh_keys ) { fh_replacedata ( x, h->fh_neginf ); } else { fh_replacekey ( h, x, INT_MIN ); } fh_extractmin ( h ); return k; } /* * begin of private element fuctions */ static FibHeapNode * fh_extractminel ( FibHeap * h ) { FibHeapNode * ret; FibHeapNode * x, *y, *orig; ret = h->fh_min; orig = NULL; /* put all the children on the root list */ /* for true consistancy, we should use fhe_remove */ for ( x = ret->fhe_child; x != orig && x != NULL; ) { if ( orig == NULL ) { orig = x; } y = x->fhe_right; x->fhe_p = NULL; fh_insertrootlist ( h, x ); x = y; } /* remove minimum from root list */ fh_removerootlist ( h, ret ); h->fh_n--; /* if we aren't empty, consolidate the heap */ if ( h->fh_n == 0 ) { h->fh_min = NULL; } else { h->fh_min = ret->fhe_right; fh_consolidate ( h ); } return ret; } static void fh_insertrootlist ( FibHeap * h, FibHeapNode * x ) { if ( h->fh_root == NULL ) { h->fh_root = x; x->fhe_left = x; x->fhe_right = x; return; } fhe_insertafter ( h->fh_root, x ); } static void fh_removerootlist ( FibHeap * h, FibHeapNode * x ) { if ( x->fhe_left == x ) { h->fh_root = NULL; } else { h->fh_root = fhe_remove ( x ); } } static void fh_consolidate ( FibHeap * h ) { FibHeapNode ** a; FibHeapNode * w; FibHeapNode * y; FibHeapNode * x; IDnum i; IDnum d; IDnum D; fh_checkcons ( h ); /* assign a the value of h->fh_cons so I don't have to rewrite code */ D = h->fh_Dl + 1; a = h->fh_cons; for ( i = 0; i < D; i++ ) { a[i] = NULL; } while ( ( w = h->fh_root ) != NULL ) { x = w; fh_removerootlist ( h, w ); d = x->fhe_degree; /* XXX - assert that d < D */ while ( a[d] != NULL ) { y = a[d]; if ( fh_compare ( h, x, y ) > 0 ) { swap ( FibHeapNode *, x, y ); } fh_heaplink ( h, y, x ); a[d] = NULL; d++; } a[d] = x; } h->fh_min = NULL; for ( i = 0; i < D; i++ ) if ( a[i] != NULL ) { fh_insertrootlist ( h, a[i] ); if ( h->fh_min == NULL || fh_compare ( h, a[i], h->fh_min ) < 0 ) { h->fh_min = a[i]; } } } static void fh_heaplink ( FibHeap * h, FibHeapNode * y, FibHeapNode * x ) { /* make y a child of x */ if ( x->fhe_child == NULL ) { x->fhe_child = y; } else { fhe_insertbefore ( x->fhe_child, y ); } y->fhe_p = x; x->fhe_degree++; y->fhe_mark = 0; } static void fh_cut ( FibHeap * h, FibHeapNode * x, FibHeapNode * y ) { fhe_remove ( x ); y->fhe_degree--; fh_insertrootlist ( h, x ); x->fhe_p = NULL; x->fhe_mark = 0; } static void fh_cascading_cut ( FibHeap * h, FibHeapNode * y ) { FibHeapNode * z; while ( ( z = y->fhe_p ) != NULL ) { if ( y->fhe_mark == 0 ) { y->fhe_mark = 1; return; } else { fh_cut ( h, y, z ); y = z; } } } /* * begining of handling elements of fibheap */ static FibHeapNode * fhe_newelem ( FibHeap * h ) { FibHeapNode * e; if ( ( e = allocateFibHeapEl ( h ) ) == NULL ) { return NULL; } fhe_initelem ( e ); return e; } static void fhe_initelem ( FibHeapNode * e ) { e->fhe_degree = 0; e->fhe_mark = 0; e->fhe_p = NULL; e->fhe_child = NULL; e->fhe_left = e; e->fhe_right = e; e->fhe_data = 0; } static void fhe_insertafter ( FibHeapNode * a, FibHeapNode * b ) { if ( a == a->fhe_right ) { a->fhe_right = b; a->fhe_left = b; b->fhe_right = a; b->fhe_left = a; } else { b->fhe_right = a->fhe_right; a->fhe_right->fhe_left = b; a->fhe_right = b; b->fhe_left = a; } } static inline void fhe_insertbefore ( FibHeapNode * a, FibHeapNode * b ) { fhe_insertafter ( a->fhe_left, b ); } static FibHeapNode * fhe_remove ( FibHeapNode * x ) { FibHeapNode * ret; if ( x == x->fhe_left ) { ret = NULL; } else { ret = x->fhe_left; } /* fix the parent pointer */ if ( x->fhe_p != NULL && x->fhe_p->fhe_child == x ) { x->fhe_p->fhe_child = ret; } x->fhe_right->fhe_left = x->fhe_left; x->fhe_left->fhe_right = x->fhe_right; /* clear out hanging pointers */ x->fhe_p = NULL; x->fhe_left = x; x->fhe_right = x; return ret; } static void fh_checkcons ( FibHeap * h ) { IDnum oDl; /* make sure we have enough memory allocated to "reorganize" */ if ( h->fh_Dl == -1 || h->fh_n > ( 1 << h->fh_Dl ) ) { oDl = h->fh_Dl; if ( ( h->fh_Dl = ceillog2 ( h->fh_n ) + 1 ) < 8 ) { h->fh_Dl = 8; } if ( oDl != h->fh_Dl ) h->fh_cons = ( FibHeapNode ** ) realloc ( h->fh_cons, sizeof * h-> fh_cons * ( h->fh_Dl + 1 ) ); if ( h->fh_cons == NULL ) { abort(); } } } static int fh_compare ( FibHeap * h, FibHeapNode * a, FibHeapNode * b ) { if ( a->fhe_key < b->fhe_key ) { return -1; } if ( a->fhe_key == b->fhe_key ) { return 0; } return 1; } static int fh_comparedata ( FibHeap * h, Coordinate key, unsigned int data, FibHeapNode * b ) { FibHeapNode a; a.fhe_key = key; a.fhe_data = data; return fh_compare ( h, &a, b ); } static void fh_insertel ( FibHeap * h, FibHeapNode * x ) { fh_insertrootlist ( h, x ); if ( h->fh_min == NULL || ( h->fh_keys ? x->fhe_key < h->fh_min->fhe_key : h->fh_cmp_fnct ( x->fhe_data, h->fh_min->fhe_data ) < 0 ) ) { h->fh_min = x; } h->fh_n++; } SOAPdenovo-V1.05/src/63mer/fibHeap.c000644 000765 000024 00000004002 11530651532 017016 0ustar00Aquastaff000000 000000 /* * 63mer/fibHeap.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "fib.h" // Constructor // Memory allocated FibHeap * newFibHeap() { return fh_makekeyheap(); } // Add new node into heap with a key, and a pointer to the specified node FibHeapNode * insertNodeIntoHeap ( FibHeap * heap, Coordinate key, unsigned int node ) { return fh_insertkey ( heap, key, node ); } // Returns smallest key in heap Coordinate minKeyOfHeap ( FibHeap * heap ) { return fh_minkey ( heap ); } // Replaces the key for a given node Coordinate replaceKeyInHeap ( FibHeap * heap, FibHeapNode * node, Coordinate newKey ) { return fh_replacekey ( heap, node, newKey ); } // Removes the node with the shortest key, then returns it. unsigned int removeNextNodeFromHeap ( FibHeap * heap ) { return ( unsigned int ) fh_extractmin ( heap ); } boolean IsHeapEmpty ( FibHeap * heap ) { return fh_isempty ( heap ); } // Destructor void destroyHeap ( FibHeap * heap ) { fh_deleteheap ( heap ); } // Replace the node pointed to by a heap node void replaceValueInHeap ( FibHeapNode * node, unsigned int newValue ) { fh_replacedata ( node, newValue ); } // Remove unwanted node void destroyNodeInHeap ( FibHeapNode * node, FibHeap * heap ) { fh_delete ( heap, node ); } SOAPdenovo-V1.05/src/63mer/hashFunction.c000644 000765 000024 00000010573 11530651532 020123 0ustar00Aquastaff000000 000000 /* * 63mer/hashFunction.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include #define KMER_HASH_MASK 0x0000000000ffffffL #define KMER_HASH_BUCKETS 16777216 // 4^12 static int crc_table[256] = { 0x00000000, 0x77073096, 0xee0e612c, 0x990951ba, 0x076dc419, 0x706af48f, 0xe963a535, 0x9e6495a3, 0x0edb8832, 0x79dcb8a4, 0xe0d5e91e, 0x97d2d988, 0x09b64c2b, 0x7eb17cbd, 0xe7b82d07, 0x90bf1d91, 0x1db71064, 0x6ab020f2, 0xf3b97148, 0x84be41de, 0x1adad47d, 0x6ddde4eb, 0xf4d4b551, 0x83d385c7, 0x136c9856, 0x646ba8c0, 0xfd62f97a, 0x8a65c9ec, 0x14015c4f, 0x63066cd9, 0xfa0f3d63, 0x8d080df5, 0x3b6e20c8, 0x4c69105e, 0xd56041e4, 0xa2677172, 0x3c03e4d1, 0x4b04d447, 0xd20d85fd, 0xa50ab56b, 0x35b5a8fa, 0x42b2986c, 0xdbbbc9d6, 0xacbcf940, 0x32d86ce3, 0x45df5c75, 0xdcd60dcf, 0xabd13d59, 0x26d930ac, 0x51de003a, 0xc8d75180, 0xbfd06116, 0x21b4f4b5, 0x56b3c423, 0xcfba9599, 0xb8bda50f, 0x2802b89e, 0x5f058808, 0xc60cd9b2, 0xb10be924, 0x2f6f7c87, 0x58684c11, 0xc1611dab, 0xb6662d3d, 0x76dc4190, 0x01db7106, 0x98d220bc, 0xefd5102a, 0x71b18589, 0x06b6b51f, 0x9fbfe4a5, 0xe8b8d433, 0x7807c9a2, 0x0f00f934, 0x9609a88e, 0xe10e9818, 0x7f6a0dbb, 0x086d3d2d, 0x91646c97, 0xe6635c01, 0x6b6b51f4, 0x1c6c6162, 0x856530d8, 0xf262004e, 0x6c0695ed, 0x1b01a57b, 0x8208f4c1, 0xf50fc457, 0x65b0d9c6, 0x12b7e950, 0x8bbeb8ea, 0xfcb9887c, 0x62dd1ddf, 0x15da2d49, 0x8cd37cf3, 0xfbd44c65, 0x4db26158, 0x3ab551ce, 0xa3bc0074, 0xd4bb30e2, 0x4adfa541, 0x3dd895d7, 0xa4d1c46d, 0xd3d6f4fb, 0x4369e96a, 0x346ed9fc, 0xad678846, 0xda60b8d0, 0x44042d73, 0x33031de5, 0xaa0a4c5f, 0xdd0d7cc9, 0x5005713c, 0x270241aa, 0xbe0b1010, 0xc90c2086, 0x5768b525, 0x206f85b3, 0xb966d409, 0xce61e49f, 0x5edef90e, 0x29d9c998, 0xb0d09822, 0xc7d7a8b4, 0x59b33d17, 0x2eb40d81, 0xb7bd5c3b, 0xc0ba6cad, 0xedb88320, 0x9abfb3b6, 0x03b6e20c, 0x74b1d29a, 0xead54739, 0x9dd277af, 0x04db2615, 0x73dc1683, 0xe3630b12, 0x94643b84, 0x0d6d6a3e, 0x7a6a5aa8, 0xe40ecf0b, 0x9309ff9d, 0x0a00ae27, 0x7d079eb1, 0xf00f9344, 0x8708a3d2, 0x1e01f268, 0x6906c2fe, 0xf762575d, 0x806567cb, 0x196c3671, 0x6e6b06e7, 0xfed41b76, 0x89d32be0, 0x10da7a5a, 0x67dd4acc, 0xf9b9df6f, 0x8ebeeff9, 0x17b7be43, 0x60b08ed5, 0xd6d6a3e8, 0xa1d1937e, 0x38d8c2c4, 0x4fdff252, 0xd1bb67f1, 0xa6bc5767, 0x3fb506dd, 0x48b2364b, 0xd80d2bda, 0xaf0a1b4c, 0x36034af6, 0x41047a60, 0xdf60efc3, 0xa867df55, 0x316e8eef, 0x4669be79, 0xcb61b38c, 0xbc66831a, 0x256fd2a0, 0x5268e236, 0xcc0c7795, 0xbb0b4703, 0x220216b9, 0x5505262f, 0xc5ba3bbe, 0xb2bd0b28, 0x2bb45a92, 0x5cb36a04, 0xc2d7ffa7, 0xb5d0cf31, 0x2cd99e8b, 0x5bdeae1d, 0x9b64c2b0, 0xec63f226, 0x756aa39c, 0x026d930a, 0x9c0906a9, 0xeb0e363f, 0x72076785, 0x05005713, 0x95bf4a82, 0xe2b87a14, 0x7bb12bae, 0x0cb61b38, 0x92d28e9b, 0xe5d5be0d, 0x7cdcefb7, 0x0bdbdf21, 0x86d3d2d4, 0xf1d4e242, 0x68ddb3f8, 0x1fda836e, 0x81be16cd, 0xf6b9265b, 0x6fb077e1, 0x18b74777, 0x88085ae6, 0xff0f6a70, 0x66063bca, 0x11010b5c, 0x8f659eff, 0xf862ae69, 0x616bffd3, 0x166ccf45, 0xa00ae278, 0xd70dd2ee, 0x4e048354, 0x3903b3c2, 0xa7672661, 0xd06016f7, 0x4969474d, 0x3e6e77db, 0xaed16a4a, 0xd9d65adc, 0x40df0b66, 0x37d83bf0, 0xa9bcae53, 0xdebb9ec5, 0x47b2cf7f, 0x30b5ffe9, 0xbdbdf21c, 0xcabac28a, 0x53b39330, 0x24b4a3a6, 0xbad03605, 0xcdd70693, 0x54de5729, 0x23d967bf, 0xb3667a2e, 0xc4614ab8, 0x5d681b02, 0x2a6f2b94, 0xb40bbe37, 0xc30c8ea1, 0x5a05df1b, 0x2d02ef8d }; static int crc32 ( int crc, const char * buf, int len ) { if ( buf == NULL ) { return 0; } crc = crc ^ 0xffffffff; while ( len-- ) { crc = crc_table[ ( ( int ) crc ^ ( *buf++ ) ) & 0xff] ^ ( crc >> 8 ); } return crc ^ 0xffffffff; } ubyte8 hash_kmer ( Kmer kmer ) { ubyte8 hash = kmer.low; hash = crc32 ( 0, ( char * ) &kmer, sizeof ( Kmer ) ); hash &= KMER_HASH_MASK; return hash; } SOAPdenovo-V1.05/src/63mer/inc/000755 000765 000024 00000000000 11530651532 016071 5ustar00Aquastaff000000 000000 SOAPdenovo-V1.05/src/63mer/kmer.c000644 000765 000024 00000013324 11530651532 016425 0ustar00Aquastaff000000 000000 /* * 63mer/kmer.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" __uint128_t Kmer2int128 ( Kmer seq ) { __uint128_t temp; temp = seq.high; temp <<= 64; temp |= seq.low; return temp; } boolean KmerSmaller ( Kmer kmer1, Kmer kmer2 ) { if ( kmer1.high < kmer2.high ) { return 1; } else if ( kmer1.high == kmer2.high ) { if ( kmer1.low < kmer2.low ) { return 1; } else { return 0; } } else { return 0; } } boolean KmerLarger ( Kmer kmer1, Kmer kmer2 ) { if ( kmer1.high > kmer2.high ) { return 1; } else if ( kmer1.high == kmer2.high ) { if ( kmer1.low > kmer2.low ) { return 1; } else { return 0; } } else { return 0; } } boolean KmerEqual ( Kmer kmer1, Kmer kmer2 ) { if ( kmer1.high == kmer2.high && kmer1.low == kmer2.low ) { return 1; } else { return 0; } } // kmer1 = kmer1 & kmer2 Kmer KmerAnd ( Kmer kmer1, Kmer kmer2 ) { kmer1.high &= kmer2.high; kmer1.low &= kmer2.low; return kmer1; } // kmer <<= 2 Kmer KmerLeftBitMoveBy2 ( Kmer word ) { ubyte8 temp = word.low >> 62; word.high <<= 2; word.high |= temp; word.low <<= 2; return word; } // kmer >>= 2 Kmer KmerRightBitMoveBy2 ( Kmer word ) { ubyte8 temp = ( word.high & 0x3 ) << 62; word.high >>= 2; word.low >>= 2; word.low |= temp; return word; } Kmer KmerPlus ( Kmer prev, char ch ) { Kmer word = KmerLeftBitMoveBy2 ( prev ); word.low |= ch; return word; } Kmer nextKmer ( Kmer prev, char ch ) { Kmer word = KmerLeftBitMoveBy2 ( prev ); word = KmerAnd ( word, WORDFILTER ); word.low |= ch; return word; } Kmer prevKmer ( Kmer next, char ch ) { Kmer word = KmerRightBitMoveBy2 ( next ); if ( 2 * ( overlaplen - 1 ) < 64 ) { word.low |= ( ( ( ubyte8 ) ch ) << 2 * ( overlaplen - 1 ) ); } else { word.high |= ( ( ubyte8 ) ch ) << ( 2 * ( overlaplen - 1 ) - 64 ); } return word; } char lastCharInKmer ( Kmer kmer ) { return ( char ) ( kmer.low & 0x3 ); } char firstCharInKmer ( Kmer kmer ) { if ( 2 * ( overlaplen - 1 ) < 64 ) { kmer.low >>= 2 * ( overlaplen - 1 ); return kmer.low;// & 3; } else { kmer.high >>= 2 * ( overlaplen - 1 ) - 64; return kmer.high;// & 3; } } Kmer createFilter ( int overlaplen ) { Kmer word; word.high = word.low = 0; if ( 2 * overlaplen < 64 ) { word.low = ( ( ( ubyte8 ) 1 ) << ( 2 * overlaplen ) ) - 1; } else { word.low = ~word.low; if ( 2 * overlaplen > 64 ) { word.high = ( ( ( ubyte8 ) 1 ) << ( 2 * overlaplen - 64 ) ) - 1; } } return word; } Kmer KmerRightBitMove ( Kmer word, int dis ) { if ( dis < 64 ) { ubyte8 mask = ( ( ( ubyte8 ) 1 ) << dis ) - 1; ubyte8 temp = ( word.high & mask ) << ( 64 - dis ); word.high >>= dis; word.low >>= dis; word.low |= temp; return word; } word.high >>= ( dis - 64 ); word.low = word.high; word.high = 0; return word; } void printKmerSeq ( FILE * fp, Kmer kmer ) { int i, bit1, bit2; char ch; char kmerSeq[64]; bit2 = overlaplen > 32 ? 32 : overlaplen; bit1 = overlaplen > 32 ? overlaplen - 32 : 0; for ( i = bit1 - 1; i >= 0; i-- ) { ch = kmer.high & 0x3; kmer.high >>= 2; kmerSeq[i] = ch; } for ( i = bit2 - 1; i >= 0; i-- ) { ch = kmer.low & 0x3; kmer.low >>= 2; kmerSeq[i + bit1] = ch; } for ( i = 0; i < overlaplen; i++ ) { fprintf ( fp, "%c", int2base ( ( int ) kmerSeq[i] ) ); } } void print_kmer ( FILE * fp, Kmer kmer, char c ) { fprintf ( fp, "%llx %llx", kmer.high, kmer.low ); fprintf ( fp, "%c", c ); } static Kmer fastReverseComp ( Kmer seq, char seq_size ) { seq.low ^= 0xAAAAAAAAAAAAAAAALLU; seq.low = ( ( seq.low & 0x3333333333333333LLU ) << 2 ) | ( ( seq.low & 0xCCCCCCCCCCCCCCCCLLU ) >> 2 ); seq.low = ( ( seq.low & 0x0F0F0F0F0F0F0F0FLLU ) << 4 ) | ( ( seq.low & 0xF0F0F0F0F0F0F0F0LLU ) >> 4 ); seq.low = ( ( seq.low & 0x00FF00FF00FF00FFLLU ) << 8 ) | ( ( seq.low & 0xFF00FF00FF00FF00LLU ) >> 8 ); seq.low = ( ( seq.low & 0x0000FFFF0000FFFFLLU ) << 16 ) | ( ( seq.low & 0xFFFF0000FFFF0000LLU ) >> 16 ); seq.low = ( ( seq.low & 0x00000000FFFFFFFFLLU ) << 32 ) | ( ( seq.low & 0xFFFFFFFF00000000LLU ) >> 32 ); if ( seq_size < 32 ) { seq.low >>= ( 64 - ( seq_size << 1 ) ); return seq; } seq.high ^= 0xAAAAAAAAAAAAAAAALLU; seq.high = ( ( seq.high & 0x3333333333333333LLU ) << 2 ) | ( ( seq.high & 0xCCCCCCCCCCCCCCCCLLU ) >> 2 ); seq.high = ( ( seq.high & 0x0F0F0F0F0F0F0F0FLLU ) << 4 ) | ( ( seq.high & 0xF0F0F0F0F0F0F0F0LLU ) >> 4 ); seq.high = ( ( seq.high & 0x00FF00FF00FF00FFLLU ) << 8 ) | ( ( seq.high & 0xFF00FF00FF00FF00LLU ) >> 8 ); seq.high = ( ( seq.high & 0x0000FFFF0000FFFFLLU ) << 16 ) | ( ( seq.high & 0xFFFF0000FFFF0000LLU ) >> 16 ); seq.high = ( ( seq.high & 0x00000000FFFFFFFFLLU ) << 32 ) | ( ( seq.high & 0xFFFFFFFF00000000LLU ) >> 32 ); ubyte8 temp = seq.high; seq.high = seq.low; seq.low = temp; seq = KmerRightBitMove ( seq, 128 - ( seq_size << 1 ) ); return seq; } Kmer reverseComplementVerbose ( Kmer word, int overlap ) { return fastReverseComp ( word, overlap ); } Kmer reverseComplement ( Kmer word, int overlap ) { return fastReverseComp ( word, overlap ); } SOAPdenovo-V1.05/src/63mer/lib.c000644 000765 000024 00000025320 11530651532 016234 0ustar00Aquastaff000000 000000 /* * 63mer/lib.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" static char tabs[2][1024]; int getMaxLongReadLen ( int num_libs ) { int i; int maxLong = 0; boolean Has = 0; for ( i = 0; i < num_libs; i++ ) { if ( lib_array[i].asm_flag != 4 ) { continue; } Has = 1; maxLong = maxLong < lib_array[i].rd_len_cutoff ? lib_array[i].rd_len_cutoff : maxLong; } if ( !Has ) { return maxLong; } else { return maxLong > 0 ? maxLong : maxReadLen; } } static boolean splitColumn ( char * line ) { int len = strlen ( line ); int i = 0, j; int tabs_n = 0; while ( i < len ) { if ( line[i] >= 32 && line[i] <= 126 && line[i] != '=' ) { j = 0; while ( i < len && line[i] >= 32 && line[i] <= 126 && line[i] != '=' ) { tabs[tabs_n][j++] = line[i]; i++; } tabs[tabs_n][j] = '\0'; tabs_n++; if ( tabs_n == 2 ) { return 1; } } i++; } if ( tabs_n == 2 ) { return 1; } else { return 0; } } static int cmp_lib ( const void * a, const void * b ) { LIB_INFO * A, *B; A = ( LIB_INFO * ) a; B = ( LIB_INFO * ) b; if ( A->avg_ins > B->avg_ins ) { return 1; } else if ( A->avg_ins == B->avg_ins ) { return 0; } else { return -1; } } void scan_libInfo ( char * libfile ) { FILE * fp; char line[1024], ch; int i, j, index; int libCounter; boolean flag; fp = ckopen ( libfile, "r" ); num_libs = 0; while ( fgets ( line, 1024, fp ) ) { ch = line[5]; line[5] = '\0'; if ( strcmp ( line, "[LIB]" ) == 0 ) { num_libs++; } if ( !num_libs ) { line[5] = ch; flag = splitColumn ( line ); if ( !flag ) { continue; } if ( strcmp ( tabs[0], "max_rd_len" ) == 0 ) { maxReadLen = atoi ( tabs[1] ); } } } //count file numbers of each type lib_array = ( LIB_INFO * ) ckalloc ( num_libs * sizeof ( LIB_INFO ) ); for ( i = 0; i < num_libs; i++ ) { lib_array[i].asm_flag = 3; lib_array[i].rank = 0; lib_array[i].pair_num_cut = 0; lib_array[i].rd_len_cutoff = 0; lib_array[i].map_len = 0; lib_array[i].num_s_a_file = 0; lib_array[i].num_s_q_file = 0; lib_array[i].num_p_file = 0; lib_array[i].num_a1_file = 0; lib_array[i].num_a2_file = 0; lib_array[i].num_q1_file = 0; lib_array[i].num_q2_file = 0; } libCounter = -1; rewind ( fp ); i = -1; while ( fgets ( line, 1024, fp ) ) { ch = line[5]; line[5] = '\0'; if ( strcmp ( line, "[LIB]" ) == 0 ) { i++; continue; } line[5] = ch; flag = splitColumn ( line ); if ( !flag ) { continue; } if ( strcmp ( tabs[0], "f1" ) == 0 ) { lib_array[i].num_a1_file++; } else if ( strcmp ( tabs[0], "q1" ) == 0 ) { lib_array[i].num_q1_file++; } else if ( strcmp ( tabs[0], "f2" ) == 0 ) { lib_array[i].num_a2_file++; } else if ( strcmp ( tabs[0], "q2" ) == 0 ) { lib_array[i].num_q2_file++; } else if ( strcmp ( tabs[0], "f" ) == 0 ) { lib_array[i].num_s_a_file++; } else if ( strcmp ( tabs[0], "q" ) == 0 ) { lib_array[i].num_s_q_file++; } else if ( strcmp ( tabs[0], "p" ) == 0 ) { lib_array[i].num_p_file++; } } //allocate memory for filenames for ( i = 0; i < num_libs; i++ ) { if ( lib_array[i].num_s_a_file ) { lib_array[i].s_a_fname = ( char ** ) ckalloc ( lib_array[i].num_s_a_file * sizeof ( char * ) ); for ( j = 0; j < lib_array[i].num_s_a_file; j++ ) { lib_array[i].s_a_fname[j] = ( char * ) ckalloc ( 1024 * sizeof ( char ) ); } } if ( lib_array[i].num_s_q_file ) { lib_array[i].s_q_fname = ( char ** ) ckalloc ( lib_array[i].num_s_q_file * sizeof ( char * ) ); for ( j = 0; j < lib_array[i].num_s_q_file; j++ ) { lib_array[i].s_q_fname[j] = ( char * ) ckalloc ( 1024 * sizeof ( char ) ); } } if ( lib_array[i].num_p_file ) { lib_array[i].p_fname = ( char ** ) ckalloc ( lib_array[i].num_p_file * sizeof ( char * ) ); for ( j = 0; j < lib_array[i].num_p_file; j++ ) { lib_array[i].p_fname[j] = ( char * ) ckalloc ( 1024 * sizeof ( char ) ); } } if ( lib_array[i].num_a1_file ) { lib_array[i].a1_fname = ( char ** ) ckalloc ( lib_array[i].num_a1_file * sizeof ( char * ) ); for ( j = 0; j < lib_array[i].num_a1_file; j++ ) { lib_array[i].a1_fname[j] = ( char * ) ckalloc ( 1024 * sizeof ( char ) ); } } if ( lib_array[i].num_a2_file ) { lib_array[i].a2_fname = ( char ** ) ckalloc ( lib_array[i].num_a2_file * sizeof ( char * ) ); for ( j = 0; j < lib_array[i].num_a2_file; j++ ) { lib_array[i].a2_fname[j] = ( char * ) ckalloc ( 1024 * sizeof ( char ) ); } } if ( lib_array[i].num_q1_file ) { lib_array[i].q1_fname = ( char ** ) ckalloc ( lib_array[i].num_q1_file * sizeof ( char * ) ); for ( j = 0; j < lib_array[i].num_q1_file; j++ ) { lib_array[i].q1_fname[j] = ( char * ) ckalloc ( 1024 * sizeof ( char ) ); } } if ( lib_array[i].num_q2_file ) { lib_array[i].q2_fname = ( char ** ) ckalloc ( lib_array[i].num_q2_file * sizeof ( char * ) ); for ( j = 0; j < lib_array[i].num_q2_file; j++ ) { lib_array[i].q2_fname[j] = ( char * ) ckalloc ( 1024 * sizeof ( char ) ); } } } // get file names for ( i = 0; i < num_libs; i++ ) { lib_array[i].curr_type = 1; lib_array[i].curr_index = 0; lib_array[i].fp1 = NULL; lib_array[i].fp2 = NULL; lib_array[i].num_s_a_file = 0; lib_array[i].num_s_q_file = 0; lib_array[i].num_p_file = 0; lib_array[i].num_a1_file = 0; lib_array[i].num_a2_file = 0; lib_array[i].num_q1_file = 0; lib_array[i].num_q2_file = 0; } libCounter = -1; rewind ( fp ); i = -1; while ( fgets ( line, 1024, fp ) ) { ch = line[5]; line[5] = '\0'; if ( strcmp ( line, "[LIB]" ) == 0 ) { i++; continue; } line[5] = ch; flag = splitColumn ( line ); if ( !flag ) { continue; } if ( strcmp ( tabs[0], "f1" ) == 0 ) { index = lib_array[i].num_a1_file++; strcpy ( lib_array[i].a1_fname[index], tabs[1] ); } else if ( strcmp ( tabs[0], "q1" ) == 0 ) { index = lib_array[i].num_q1_file++; strcpy ( lib_array[i].q1_fname[index], tabs[1] ); } else if ( strcmp ( tabs[0], "f2" ) == 0 ) { index = lib_array[i].num_a2_file++; strcpy ( lib_array[i].a2_fname[index], tabs[1] ); } else if ( strcmp ( tabs[0], "q2" ) == 0 ) { index = lib_array[i].num_q2_file++; strcpy ( lib_array[i].q2_fname[index], tabs[1] ); } else if ( strcmp ( tabs[0], "f" ) == 0 ) { index = lib_array[i].num_s_a_file++; strcpy ( lib_array[i].s_a_fname[index], tabs[1] ); } else if ( strcmp ( tabs[0], "q" ) == 0 ) { index = lib_array[i].num_s_q_file++; strcpy ( lib_array[i].s_q_fname[index], tabs[1] ); } else if ( strcmp ( tabs[0], "p" ) == 0 ) { index = lib_array[i].num_p_file++; strcpy ( lib_array[i].p_fname[index], tabs[1] ); } else if ( strcmp ( tabs[0], "min_ins" ) == 0 ) { lib_array[i].min_ins = atoi ( tabs[1] ); } else if ( strcmp ( tabs[0], "max_ins" ) == 0 ) { lib_array[i].max_ins = atoi ( tabs[1] ); } else if ( strcmp ( tabs[0], "avg_ins" ) == 0 ) { lib_array[i].avg_ins = atoi ( tabs[1] ); } else if ( strcmp ( tabs[0], "rd_len_cutoff" ) == 0 ) { lib_array[i].rd_len_cutoff = atoi ( tabs[1] ); } else if ( strcmp ( tabs[0], "reverse_seq" ) == 0 ) { lib_array[i].reverse = atoi ( tabs[1] ); } else if ( strcmp ( tabs[0], "asm_flags" ) == 0 ) { lib_array[i].asm_flag = atoi ( tabs[1] ); } else if ( strcmp ( tabs[0], "rank" ) == 0 ) { lib_array[i].rank = atoi ( tabs[1] ); } else if ( strcmp ( tabs[0], "pair_num_cutoff" ) == 0 ) { lib_array[i].pair_num_cut = atoi ( tabs[1] ); } else if ( strcmp ( tabs[0], "rd_len_cutoff" ) == 0 ) { lib_array[i].rd_len_cutoff = atoi ( tabs[1] ); } else if ( strcmp ( tabs[0], "map_len" ) == 0 ) { lib_array[i].map_len = atoi ( tabs[1] ); } } fclose ( fp ); qsort ( &lib_array[0], num_libs, sizeof ( LIB_INFO ), cmp_lib ); } void free_libs() { if ( !lib_array ) { return; } int i, j; for ( i = 0; i < num_libs; i++ ) { printf ( "[LIB] %d, avg_ins %d, reverse %d \n", i, lib_array[i].avg_ins, lib_array[i].reverse ); if ( lib_array[i].num_s_a_file ) { //printf("%d single fasta files\n",lib_array[i].num_s_a_file); for ( j = 0; j < lib_array[i].num_s_a_file; j++ ) { free ( ( void * ) lib_array[i].s_a_fname[j] ); } free ( ( void * ) lib_array[i].s_a_fname ); } if ( lib_array[i].num_s_q_file ) { //printf("%d single fastq files\n",lib_array[i].num_s_q_file); for ( j = 0; j < lib_array[i].num_s_q_file; j++ ) { free ( ( void * ) lib_array[i].s_q_fname[j] ); } free ( ( void * ) lib_array[i].s_q_fname ); } if ( lib_array[i].num_p_file ) { //printf("%d paired fasta files\n",lib_array[i].num_p_file); for ( j = 0; j < lib_array[i].num_p_file; j++ ) { free ( ( void * ) lib_array[i].p_fname[j] ); } free ( ( void * ) lib_array[i].p_fname ); } if ( lib_array[i].num_a1_file ) { //printf("%d read1 fasta files\n",lib_array[i].num_a1_file); for ( j = 0; j < lib_array[i].num_a1_file; j++ ) { free ( ( void * ) lib_array[i].a1_fname[j] ); } free ( ( void * ) lib_array[i].a1_fname ); } if ( lib_array[i].num_a2_file ) { //printf("%d read2 fasta files\n",lib_array[i].num_a2_file); for ( j = 0; j < lib_array[i].num_a2_file; j++ ) { free ( ( void * ) lib_array[i].a2_fname[j] ); } free ( ( void * ) lib_array[i].a2_fname ); } if ( lib_array[i].num_q1_file ) { //printf("%d read1 fastq files\n",lib_array[i].num_q1_file); for ( j = 0; j < lib_array[i].num_q1_file; j++ ) { free ( ( void * ) lib_array[i].q1_fname[j] ); } free ( ( void * ) lib_array[i].q1_fname ); } if ( lib_array[i].num_q2_file ) { //printf("%d read2 fastq files\n",lib_array[i].num_q2_file); for ( j = 0; j < lib_array[i].num_q2_file; j++ ) { free ( ( void * ) lib_array[i].q2_fname[j] ); } free ( ( void * ) lib_array[i].q2_fname ); } } num_libs = 0; free ( ( void * ) lib_array ); } void alloc_pe_mem ( int gradsCounter ) { if ( gradsCounter ) { pes = ( PE_INFO * ) ckalloc ( gradsCounter * sizeof ( PE_INFO ) ); } } void free_pe_mem() { if ( pes ) { free ( ( void * ) pes ); pes = NULL; } } SOAPdenovo-V1.05/src/63mer/loadGraph.c000644 000765 000024 00000025531 11530651532 017373 0ustar00Aquastaff000000 000000 /* * 63mer/loadGraph.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" #define preARCBLOCKSIZE 100000 static void loadArcs ( char * graphfile ); static void loadContig ( char * graphfile ); /* void loadUpdatedVertex(char *graphfile) { char name[256],line[256]; FILE *fp; Kmer word,bal_word; int num_kmer,i; char ch; sprintf(name,"%s.updated.vertex",graphfile); fp = ckopen(name,"r"); while(fgets(line,sizeof(line),fp)!=NULL){ if(line[0] == 'V'){ sscanf(line+6, "%d %c %d",&num_kmer,&ch,&overlaplen); printf("there're %d kmers in vertex file\n",num_kmer); break; } } vt_array = (VERTEX *)ckalloc((2*num_kmer)*sizeof(VERTEX)); for(i=0;i len_array[mid] ) { low = mid + 1; } else { high = mid - 1; } } if ( low > high ) { return -1; } //locate the first same length unflaged return flag_array[mid]++; } int lengthSearch ( unsigned int * len_array, unsigned int * flag_array, int num, unsigned int target ) { int mid, low, high, i; low = 1; high = num; while ( low <= high ) { mid = ( low + high ) / 2; if ( len_array[mid] == target ) { break; } else if ( target > len_array[mid] ) { low = mid + 1; } else { high = mid - 1; } } if ( low > high ) { return -1; } //locate the first same length unflaged if ( !flag_array[mid] ) { for ( i = mid - 1; i > 0; i-- ) { if ( len_array[i] != len_array[mid] || flag_array[i] ) { break; } } flag_array[i + 1] = 1; return i + 1; } else { for ( i = mid + 1; i <= num; i++ ) { if ( !flag_array[i] ) { break; } } flag_array[i] = 1; return i; } } void quick_sort_int ( unsigned int * length_array, int low, int high ) { int i, j; unsigned int pivot; if ( low < high ) { pivot = length_array[low]; i = low; j = high; while ( i < j ) { while ( i < j && length_array[j] >= pivot ) { j--; } if ( i < j ) { length_array[i++] = length_array[j]; } while ( i < j && length_array[i] <= pivot ) { i++; } if ( i < j ) { length_array[j--] = length_array[i]; } } length_array[i] = pivot; quick_sort_int ( length_array, low, i - 1 ); quick_sort_int ( length_array, i + 1, high ); } } void loadUpdatedEdges ( char * graphfile ) { char c, name[256], line[1024]; int bal_ed, cvg; FILE * fp, *out_fp; unsigned int num_ctgge, length, index = 0, num_kmer; unsigned int i = 0, j; int newIndex; unsigned int * length_array, *flag_array, diff_len; char * outfile = graphfile; long long cvgSum = 0; long long counter = 0; //get overlaplen from *.preGraphBasic sprintf ( name, "%s.preGraphBasic", graphfile ); fp = ckopen ( name, "r" ); while ( fgets ( line, sizeof ( line ), fp ) != NULL ) { if ( line[0] == 'V' ) { sscanf ( line + 6, "%d %c %d", &num_kmer, &c, &overlaplen ); printf ( "K = %d\n", overlaplen ); break; } } if ( ctg_short == 0 ) { ctg_short = overlaplen + 2; } fclose ( fp ); sprintf ( name, "%s.updated.edge", graphfile ); fp = ckopen ( name, "r" ); sprintf ( name, "%s.newContigIndex", outfile ); out_fp = ckopen ( name, "w" ); while ( fgets ( line, sizeof ( line ), fp ) != NULL ) { if ( line[0] == 'E' ) { sscanf ( line + 5, "%d", &num_ctgge ); printf ( "there're %d edge in edge file\n", num_ctgge ); break; } } index_array = ( unsigned int * ) ckalloc ( ( num_ctgge + 1 ) * sizeof ( unsigned int ) ); length_array = ( unsigned int * ) ckalloc ( ( num_ctgge + 1 ) * sizeof ( unsigned int ) ); flag_array = ( unsigned int * ) ckalloc ( ( num_ctgge + 1 ) * sizeof ( unsigned int ) ); while ( fgets ( line, sizeof ( line ), fp ) != NULL ) { if ( line[0] == '>' ) { sscanf ( line + 7, "%d", &length ); index_array[++index] = length; length_array[++i] = length; } } num_ctg = index; orig2new = 1; //quick_sort_int(length_array,1,num_ctg); qsort ( & ( length_array[1] ), num_ctg, sizeof ( length_array[0] ), cmp_int ); //extract unique length diff_len = 0; for ( i = 1; i <= num_ctg; i++ ) { for ( j = i + 1; j <= num_ctg; j++ ) if ( length_array[j] != length_array[i] ) { break; } length_array[++diff_len] = length_array[i]; flag_array[diff_len] = i; i = j - 1; } /* for(i=1;i<=num_ctg;i++) flag_array[i] = 0; */ contig_array = ( CONTIG * ) ckalloc ( ( num_ctg + 1 ) * sizeof ( CONTIG ) ); //load edges index = 0; rewind ( fp ); while ( fgets ( line, sizeof ( line ), fp ) != NULL ) { if ( line[0] == '>' ) { sscanf ( line, ">length %u,%d,%d", &length, &bal_ed, &cvg ); newIndex = uniqueLenSearch ( length_array, flag_array, diff_len, length ); index_array[++index] = newIndex; contig_array[newIndex].length = length; contig_array[newIndex].bal_edge = bal_ed + 1; contig_array[newIndex].downwardConnect = NULL; contig_array[newIndex].mask = 0; contig_array[newIndex].flag = 0; contig_array[newIndex].arcs = NULL; contig_array[newIndex].seq = NULL; contig_array[newIndex].multi = 0; contig_array[newIndex].inSubGraph = 0; contig_array[newIndex].cvg = cvg / 10; if ( cvg ) { counter += length; cvgSum += cvg * length; } fprintf ( out_fp, "%d %d %d\n", index, newIndex, contig_array[newIndex].bal_edge ); } } if ( counter ) { cvgAvg = cvgSum / counter / 10 > 2 ? cvgSum / counter / 10 : 3; } //mark repeats int bal_i; if ( maskRep ) { counter = 0; for ( i = 1; i <= num_ctg; i++ ) { bal_i = getTwinCtg ( i ); if ( ( contig_array[i].cvg + contig_array[bal_i].cvg ) > 4 * cvgAvg ) { contig_array[i].mask = 1; contig_array[bal_i].mask = 1; counter += 2; } if ( isSmallerThanTwin ( i ) ) { i++; } } printf ( "average contig coverage is %d, %lld contig masked\n", cvgAvg, counter ); } counter = 0; for ( i = 1; i <= num_ctg; i++ ) { if ( contig_array[i].mask ) { continue; } bal_i = getTwinCtg ( i ); if ( contig_array[i].length < ctg_short ) { contig_array[i].mask = 1; contig_array[bal_i].mask = 1; counter += 2; } if ( isSmallerThanTwin ( i ) ) { i++; } } printf ( "Mask contigs shorter than %d, %lld contig masked\n", ctg_short, counter ); loadArcs ( graphfile ); //tipsCount(); loadContig ( graphfile ); printf ( "done loading updated edges\n" ); fflush ( stdout ); free ( ( void * ) length_array ); free ( ( void * ) flag_array ); fclose ( fp ); fclose ( out_fp ); } static void add1Arc ( unsigned int from_ed, unsigned int to_ed, unsigned int weight ) { preARC * parc; unsigned int from_c = index_array[from_ed]; unsigned int to_c = index_array[to_ed]; parc = allocatePreArc ( to_c ); parc->multiplicity = weight; parc->next = contig_array[from_c].arcs; contig_array[from_c].arcs = parc; } void loadArcs ( char * graphfile ) { FILE * fp; char name[256], line[1024]; unsigned int target, weight; unsigned int from_ed; char * seg; sprintf ( name, "%s.Arc", graphfile ); fp = ckopen ( name, "r" ); createPreArcMemManager(); arcCounter = 0; while ( fgets ( line, sizeof ( line ), fp ) != NULL ) { seg = strtok ( line, " " ); from_ed = atoi ( seg ); //printf("%d\n",from_ed); while ( ( seg = strtok ( NULL, " " ) ) != NULL ) { target = atoi ( seg ); seg = strtok ( NULL, " " ); weight = atoi ( seg ); add1Arc ( from_ed, target, weight ); } } printf ( "%lld arcs loaded\n", arcCounter ); fclose ( fp ); } void loadContig ( char * graphfile ) { char c, name[256], line[1024], *tightSeq = NULL; FILE * fp; int n = 0, length, index = -1, edgeno; unsigned int i; unsigned int newIndex; sprintf ( name, "%s.contig", graphfile ); fp = ckopen ( name, "r" ); while ( fgets ( line, sizeof ( line ), fp ) != NULL ) { if ( line[0] == '>' ) { if ( index >= 0 ) { newIndex = index_array[edgeno]; contig_array[newIndex].seq = tightSeq; } n = 0; index++; sscanf ( line + 1, "%d %s %d", &edgeno, name, &length ); //printf("contig %d, length %d\n",edgeno,length); tightSeq = ( char * ) ckalloc ( ( length / 4 + 1 ) * sizeof ( char ) ); } else { for ( i = 0; i < strlen ( line ); i++ ) { if ( line[i] >= 'a' && line[i] <= 'z' ) { c = base2int ( line[i] - 'a' + 'A' ); writeChar2tightString ( c, tightSeq, n++ ); } else if ( line[i] >= 'A' && line[i] <= 'Z' ) { c = base2int ( line[i] ); writeChar2tightString ( c, tightSeq, n++ ); } } } } if ( index >= 0 ) { newIndex = index_array[edgeno]; contig_array[newIndex].seq = tightSeq; } printf ( "input %d contigs\n", index + 1 ); fclose ( fp ); //printf("the %dth contig with index 107\n",index); } void freeContig_array() { if ( !contig_array ) { return; } unsigned int i; for ( i = 1; i <= num_ctg; i++ ) { if ( contig_array[i].seq ) { free ( ( void * ) contig_array[i].seq ); } if ( contig_array[i].closeReads ) { freeStack ( contig_array[i].closeReads ); } } free ( ( void * ) contig_array ); contig_array = NULL; } /* void loadCvg(char *graphfile) { char name[256],line[1024]; FILE *fp; int cvg; unsigned int newIndex,edgeno,bal_ctg; sprintf(name,"%s.contigCVG",graphfile); fp = fopen(name,"r"); if(!fp){ printf("contig coverage file %s is not found!\n",name); return; } while(fgets(line,sizeof(line),fp)!=NULL){ if(line[0]=='>'){ sscanf(line+1,"%d %d",&edgeno,&cvg); newIndex = index_array[edgeno]; cvg = cvg <= 255 ? cvg:255; contig_array[newIndex].multi = cvg; bal_ctg = getTwinCtg(newIndex); contig_array[bal_ctg].multi= cvg; } } fclose(fp); } */ SOAPdenovo-V1.05/src/63mer/loadPath.c000644 000765 000024 00000012222 11530651532 017217 0ustar00Aquastaff000000 000000 /* * 63mer/loadPath.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" static void add1marker2edge ( unsigned int edgeno, long long readid ) { if ( edge_array[edgeno].multi == 255 ) { return; } unsigned int bal_ed = getTwinEdge ( edgeno ); unsigned char counter = edge_array[edgeno].multi++; edge_array[edgeno].markers[counter] = readid; counter = edge_array[bal_ed].multi++; edge_array[bal_ed].markers[counter] = -readid; } boolean loadPath ( char * graphfile ) { FILE * fp; char name[256], line[1024]; unsigned int i, bal_ed, num1, edgeno, num2; long long markCounter = 0, readid = 0; char * seg; sprintf ( name, "%s.markOnEdge", graphfile ); fp = fopen ( name, "r" ); if ( !fp ) { return 0; } for ( i = 1; i <= num_ed; i++ ) { edge_array[i].multi = 0; } for ( i = 1; i <= num_ed; i++ ) { fscanf ( fp, "%d", &num1 ); if ( EdSmallerThanTwin ( i ) ) { fscanf ( fp, "%d", &num2 ); bal_ed = getTwinEdge ( i ); if ( num1 + num2 >= 255 ) { edge_array[i].multi = 255; edge_array[bal_ed].multi = 255; } else { edge_array[i].multi = num1 + num2; edge_array[bal_ed].multi = num1 + num2; markCounter += 2 * ( num1 + num2 ); } i++; } else { if ( 2 * num1 >= 255 ) { edge_array[i].multi = 255; } else { edge_array[i].multi = 2 * num1; markCounter += 2 * num1; } } } fclose ( fp ); printf ( "%lld markers overall\n", markCounter ); markersArray = ( long long * ) ckalloc ( markCounter * sizeof ( long long ) ); markCounter = 0; for ( i = 1; i <= num_ed; i++ ) { if ( edge_array[i].multi == 255 ) { continue; } edge_array[i].markers = markersArray + markCounter; markCounter += edge_array[i].multi; edge_array[i].multi = 0; } sprintf ( name, "%s.path", graphfile ); fp = fopen ( name, "r" ); if ( !fp ) { return 0; } while ( fgets ( line, sizeof ( line ), fp ) != NULL ) { //printf("%s",line); readid++; seg = strtok ( line, " " ); while ( seg ) { edgeno = atoi ( seg ); //printf("%s, %d\n",seg,edgeno); add1marker2edge ( edgeno, readid ); seg = strtok ( NULL, " " ); } } fclose ( fp ); markCounter = 0; for ( i = 1; i <= num_ed; i++ ) { if ( edge_array[i].multi == 255 ) { continue; } markCounter += edge_array[i].multi; } printf ( "%lld marks loaded\n", markCounter ); return 1; } boolean loadPathBin ( char * graphfile ) { FILE * fp; char name[256]; unsigned int i, bal_ed, num1, num2; long long markCounter = 0, readid = 0; unsigned char seg, ch; unsigned int * freadBuf; sprintf ( name, "%s.markOnEdge", graphfile ); fp = fopen ( name, "r" ); if ( !fp ) { return 0; } for ( i = 1; i <= num_ed; i++ ) { edge_array[i].multi = 0; edge_array[i].markers = NULL; } for ( i = 1; i <= num_ed; i++ ) { fscanf ( fp, "%d", &num1 ); if ( EdSmallerThanTwin ( i ) ) { fscanf ( fp, "%d", &num2 ); bal_ed = getTwinEdge ( i ); if ( num1 + num2 >= 255 ) { edge_array[i].multi = 255; edge_array[bal_ed].multi = 255; } else { edge_array[i].multi = num1 + num2; edge_array[bal_ed].multi = num1 + num2; markCounter += 2 * ( num1 + num2 ); } i++; } else { if ( 2 * num1 >= 255 ) { edge_array[i].multi = 255; } else { edge_array[i].multi = 2 * num1; markCounter += 2 * num1; } } } fclose ( fp ); printf ( "%lld markers overall\n", markCounter ); markersArray = ( long long * ) ckalloc ( markCounter * sizeof ( long long ) ); markCounter = 0; for ( i = 1; i <= num_ed; i++ ) { if ( edge_array[i].multi == 255 ) { continue; } edge_array[i].markers = markersArray + markCounter; markCounter += edge_array[i].multi; edge_array[i].multi = 0; } sprintf ( name, "%s.path", graphfile ); fp = fopen ( name, "rb" ); if ( !fp ) { return 0; } freadBuf = ( unsigned int * ) ckalloc ( ( maxReadLen - overlaplen + 1 ) * sizeof ( unsigned int ) ); while ( fread ( &ch, sizeof ( char ), 1, fp ) == 1 ) { //printf("%s",line); if ( fread ( freadBuf, sizeof ( unsigned int ), ch, fp ) != ch ) { break; } readid++; for ( seg = 0; seg < ch; seg++ ) { add1marker2edge ( freadBuf[seg], readid ); } } fclose ( fp ); markCounter = 0; for ( i = 1; i <= num_ed; i++ ) { if ( edge_array[i].multi == 255 ) { continue; } markCounter += edge_array[i].multi; } printf ( "%lld markers loaded\n", markCounter ); free ( ( void * ) freadBuf ); return 1; } SOAPdenovo-V1.05/src/63mer/loadPreGraph.c000644 000765 000024 00000023323 11530651532 020037 0ustar00Aquastaff000000 000000 /* * 63mer/loadPreGraph.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" static void loadPreArcs ( char * graphfile ); int cmp_vertex ( const void * a, const void * b ) { VERTEX * A, *B; A = ( VERTEX * ) a; B = ( VERTEX * ) b; if ( KmerLarger ( A->kmer, B->kmer ) ) { return 1; } else if ( KmerEqual ( A->kmer, B->kmer ) ) { return 0; } else { return -1; } } void loadVertex ( char * graphfile ) { char name[256], line[256]; FILE * fp; Kmer word, bal_word, temp; int num_kmer, i; char ch; sprintf ( name, "%s.preGraphBasic", graphfile ); fp = ckopen ( name, "r" ); while ( fgets ( line, sizeof ( line ), fp ) != NULL ) { if ( line[0] == 'V' ) { sscanf ( line + 6, "%d %c %d", &num_kmer, &ch, &overlaplen ); printf ( "there're %d kmers in vertex file\n", num_kmer ); } else if ( line[0] == 'E' ) { sscanf ( line + 5, "%d", &num_ed ); printf ( "there're %d edge in edge file\n", num_ed ); } else if ( line[0] == 'M' ) { sscanf ( line, "MaxReadLen %d MinReadLen %d MaxNameLen %d", &maxReadLen, &minReadLen, &maxNameLen ); } } fclose ( fp ); vt_array = ( VERTEX * ) ckalloc ( ( 2 * num_kmer ) * sizeof ( VERTEX ) ); sprintf ( name, "%s.vertex", graphfile ); fp = ckopen ( name, "r" ); for ( i = 0; i < num_kmer; i++ ) { fscanf ( fp, "%llx %llx", & ( word.high ), & ( word.low ) ); bal_word = reverseComplement ( word, overlaplen ); if ( KmerSmaller ( word, bal_word ) ) { vt_array[i].kmer = word; } else { vt_array[i].kmer = bal_word; } } temp = vt_array[num_kmer - 1].kmer; qsort ( &vt_array[0], num_kmer, sizeof ( vt_array[0] ), cmp_vertex ); printf ( "done sort\n" ); fclose ( fp ); for ( i = 0; i < num_kmer; i++ ) { bal_word = reverseComplement ( vt_array[i].kmer, overlaplen ); vt_array[i + num_kmer].kmer = bal_word; } num_vt = num_kmer; } int bisearch ( VERTEX * vts, int num, Kmer target ) { int mid, low, high; low = 0; high = num - 1; while ( low <= high ) { mid = ( low + high ) / 2; if ( KmerEqual ( vts[mid].kmer, target ) ) { break; } else if ( KmerLarger ( target, vts[mid].kmer ) ) { low = mid + 1; } else { high = mid - 1; } } if ( low <= high ) { return mid; } else { return -1; } } int kmer2vt ( Kmer kmer ) { Kmer bal_word; int vt_id; bal_word = reverseComplement ( kmer, overlaplen ); if ( KmerSmaller ( kmer, bal_word ) ) { vt_id = bisearch ( &vt_array[0], num_vt, kmer ); if ( vt_id < 0 ) { printf ( "no vt found for kmer %llx %llx\n", kmer.high, kmer.low ); } return vt_id; } else { vt_id = bisearch ( &vt_array[0], num_vt, bal_word ); if ( vt_id >= 0 ) { vt_id += num_vt; } else { printf ( "no vt found for kmer %llx %llx\n", kmer.high, kmer.low ); } return vt_id; } } // create an edge with index edgeno+1 reverse complememtary to edge with index edgeno static void buildReverseComplementEdge ( unsigned int edgeno ) { int length = edge_array[edgeno].length; int i, index = 0; char * sequence, ch, *tightSeq; Kmer kmer = vt_array[edge_array[edgeno].from_vt].kmer; sequence = ( char * ) ckalloc ( ( overlaplen + length ) * sizeof ( char ) ); int bit2 = overlaplen > 32 ? 32 : overlaplen; int bit1 = overlaplen > 32 ? overlaplen - 32 : 0; for ( i = bit1 - 1; i >= 0; i-- ) { ch = kmer.high & 0x3; kmer.high >>= 2; sequence[i] = ch; } for ( i = bit2 - 1; i >= 0; i-- ) { ch = kmer.low & 0x3; kmer.low >>= 2; sequence[i + bit1] = ch; } for ( i = 0; i < length; i++ ) { sequence[i + overlaplen] = getCharInTightString ( edge_array[edgeno].seq, i ); } tightSeq = ( char * ) ckalloc ( ( length / 4 + 1 ) * sizeof ( char ) ); for ( i = length - 1; i >= 0; i-- ) { writeChar2tightString ( int_comp ( sequence[i] ), tightSeq, index++ ); } edge_array[edgeno + 1].length = length; edge_array[edgeno + 1].cvg = edge_array[edgeno].cvg; kmer = vt_array[edge_array[edgeno].from_vt].kmer; edge_array[edgeno + 1].to_vt = kmer2vt ( reverseComplement ( kmer, overlaplen ) ); kmer = vt_array[edge_array[edgeno].to_vt].kmer; edge_array[edgeno + 1].from_vt = kmer2vt ( reverseComplement ( kmer, overlaplen ) ); edge_array[edgeno + 1].seq = tightSeq; edge_array[edgeno + 1].bal_edge = 0; edge_array[edgeno + 1].rv = NULL; edge_array[edgeno + 1].arcs = NULL; edge_array[edgeno + 1].flag = 0; edge_array[edgeno + 1].deleted = 0; free ( ( void * ) sequence ); } void loadEdge ( char * graphfile ) { char c, name[256], line[1024], str[32]; char * tightSeq = NULL; FILE * fp; Kmer from_kmer, to_kmer; int n = 0, i, length, cvg, index = -1, bal_ed, edgeno; int linelen; unsigned int j; sprintf ( name, "%s.edge", graphfile ); fp = ckopen ( name, "r" ); num_ed_limit = 1.2 * num_ed; edge_array = ( EDGE * ) ckalloc ( ( num_ed_limit + 1 ) * sizeof ( EDGE ) ); for ( j = num_ed + 1; j <= num_ed_limit; j++ ) { edge_array[j].seq = NULL; } while ( fgets ( line, sizeof ( line ), fp ) != NULL ) { if ( line[0] == '>' ) { if ( index >= 0 ) { edgeno = index + 1; edge_array[edgeno].length = length; edge_array[edgeno].cvg = cvg; edge_array[edgeno].from_vt = kmer2vt ( from_kmer ); edge_array[edgeno].to_vt = kmer2vt ( to_kmer ); edge_array[edgeno].seq = tightSeq; edge_array[edgeno].bal_edge = bal_ed + 1; edge_array[edgeno].rv = NULL; edge_array[edgeno].arcs = NULL; edge_array[edgeno].flag = 0; edge_array[edgeno].deleted = 0; if ( bal_ed ) { buildReverseComplementEdge ( edgeno ); index++; } } n = 0; index++; sscanf ( line + 7, "%d,%llx %llx,%llx %llx,%s %d,%d", &length, & ( from_kmer.high ), & ( from_kmer.low ), & ( to_kmer.high ), & ( to_kmer.low ), str, &cvg, &bal_ed ); tightSeq = ( char * ) ckalloc ( ( length / 4 + 1 ) * sizeof ( char ) ); } else { linelen = strlen ( line ); for ( i = 0; i < linelen; i++ ) { if ( line[i] >= 'a' && line[i] <= 'z' ) { c = base2int ( line[i] - 'a' + 'A' ); writeChar2tightString ( c, tightSeq, n++ ); } else if ( line[i] >= 'A' && line[i] <= 'Z' ) { c = base2int ( line[i] ); writeChar2tightString ( c, tightSeq, n++ ); } } } } if ( index >= 0 ) { edgeno = index + 1; edge_array[edgeno].length = length; edge_array[edgeno].cvg = cvg; edge_array[edgeno].from_vt = kmer2vt ( from_kmer ); edge_array[edgeno].to_vt = kmer2vt ( to_kmer ); edge_array[edgeno].seq = tightSeq; edge_array[edgeno].bal_edge = bal_ed + 1; if ( bal_ed ) { buildReverseComplementEdge ( edgeno ); index++; } } printf ( "input %d edges\n", index + 1 ); fclose ( fp ); createArcMemo(); loadPreArcs ( graphfile ); } unsigned int getTwinEdge ( unsigned int edgeno ) { return edgeno + edge_array[edgeno].bal_edge - 1; } boolean EdSmallerThanTwin ( unsigned int edgeno ) { return edge_array[edgeno].bal_edge > 1; } boolean EdLargerThanTwin ( unsigned int edgeno ) { return edge_array[edgeno].bal_edge < 1; } boolean EdSameAsTwin ( unsigned int edgeno ) { return edge_array[edgeno].bal_edge == 1; } static void add1Arc ( unsigned int from_ed, unsigned int to_ed, unsigned int weight ) { unsigned int bal_fe = getTwinEdge ( from_ed ); unsigned int bal_te = getTwinEdge ( to_ed ); if ( from_ed > num_ed || to_ed > num_ed || bal_fe > num_ed || bal_te > num_ed ) { return; } ARC * parc, *bal_parc; //both arcs already exist parc = getArcBetween ( from_ed, to_ed ); if ( parc ) { bal_parc = parc->bal_arc; parc->multiplicity += weight; bal_parc->multiplicity += weight; return; } //create new arcs parc = allocateArc ( to_ed ); parc->multiplicity = weight; parc->prev = NULL; if ( edge_array[from_ed].arcs ) { edge_array[from_ed].arcs->prev = parc; } parc->next = edge_array[from_ed].arcs; edge_array[from_ed].arcs = parc; // A->A' if ( bal_te == from_ed ) { //printf("preArc from A to A'\n"); parc->bal_arc = parc; parc->multiplicity += weight; return; } bal_parc = allocateArc ( bal_fe ); bal_parc->multiplicity = weight; bal_parc->prev = NULL; if ( edge_array[bal_te].arcs ) { edge_array[bal_te].arcs->prev = bal_parc; } bal_parc->next = edge_array[bal_te].arcs; edge_array[bal_te].arcs = bal_parc; //link them to each other parc->bal_arc = bal_parc; bal_parc->bal_arc = parc; } void loadPreArcs ( char * graphfile ) { FILE * fp; char name[256], line[1024]; unsigned int target, weight; unsigned int from_ed; char * seg; sprintf ( name, "%s.preArc", graphfile ); fp = ckopen ( name, "r" ); arcCounter = 0; while ( fgets ( line, sizeof ( line ), fp ) != NULL ) { seg = strtok ( line, " " ); from_ed = atoi ( seg ); while ( ( seg = strtok ( NULL, " " ) ) != NULL ) { target = atoi ( seg ); seg = strtok ( NULL, " " ); weight = atoi ( seg ); add1Arc ( from_ed, target, weight ); } } printf ( "%lli pre-arcs loaded\n", arcCounter ); fclose ( fp ); } void free_edge_array ( EDGE * ed_array, int ed_num ) { int i; for ( i = 1; i <= ed_num; i++ ) if ( ed_array[i].seq ) { free ( ( void * ) ed_array[i].seq ); } free ( ( void * ) ed_array ); } SOAPdenovo-V1.05/src/63mer/localAsm.c000644 000765 000024 00000154227 11530651532 017232 0ustar00Aquastaff000000 000000 /* * 63mer/localAsm.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" #define CTGendLen 35 // shouldn't larger than max_read_len #define UPlimit 5000 #define MaxRouteNum 10 static void kmerSet_mark ( KmerSet * set ); static void trace4Repeat ( Kmer currW, int steps, int min, int max, int * num_route, KmerSet * kset, Kmer kmerDest, int overlap, Kmer WORDF, int * traceCounter, int maxRoute, kmer_t ** soFarNode, short * multiOccu1, short * multiOccu2, int * routeLens, char ** foundRoutes, char * soFarSeq, long long * soFarLinks, double * avgLinks ); static Kmer prevKmerLocal ( Kmer next, char ch, int overlap ) { Kmer word = KmerRightBitMoveBy2 ( next ); if ( 2 * ( overlap - 1 ) < 64 ) { word.low |= ( ( ( ubyte8 ) ch ) << 2 * ( overlap - 1 ) ); } else { word.high |= ( ( ubyte8 ) ch ) << ( 2 * ( overlap - 1 ) - 64 ); } return word; } static Kmer nextKmerLocal ( Kmer prev, char ch, Kmer WordFilter ) { Kmer word = KmerLeftBitMoveBy2 ( prev ); word = KmerAnd ( word, WordFilter ); word.low |= ch; return word; } static void singleKmer ( int t, KmerSet * kset, int flag, Kmer * kmerBuffer, char * prevcBuffer, char * nextcBuffer ) { kmer_t * pos; put_kmerset ( kset, kmerBuffer[t], prevcBuffer[t], nextcBuffer[t], &pos ); if ( pos->inEdge == flag ) { return; } else if ( pos->inEdge == 0 ) { pos->inEdge = flag; } else if ( pos->inEdge == 1 && flag == 2 ) { pos->inEdge = 3; } else if ( pos->inEdge == 2 && flag == 1 ) { pos->inEdge = 3; } } static void putKmer2DBgraph ( KmerSet * kset, int flag, int kmer_c, Kmer * kmerBuffer, char * prevcBuffer, char * nextcBuffer ) { int t; for ( t = 0; t < kmer_c; t++ ) { singleKmer ( t, kset, flag, kmerBuffer, prevcBuffer, nextcBuffer ); } } static void getSeqFromRead ( READNEARBY read, char * src_seq ) { int len_seq = read.len; int j; char * tightSeq = ( char * ) darrayGet ( readSeqInGap, read.seqStarter ); for ( j = 0; j < len_seq; j++ ) { src_seq[j] = getCharInTightString ( tightSeq, j ); } } static void chopKmer4Ctg ( Kmer * kmerCtg, int lenCtg, int overlap, char * src_seq, Kmer WORDF ) { int index, j; Kmer word; word.high = word.low = 0; for ( index = 0; index < overlap; index++ ) { word = KmerLeftBitMoveBy2 ( word ); word.low |= src_seq[index]; } index = 0; kmerCtg[index++] = word; for ( j = 1; j <= lenCtg - overlap; j ++ ) { word = nextKmerLocal ( word, src_seq[j - 1 + overlap], WORDF ); kmerCtg[index++] = word; } } static void chopKmer4read ( int len_seq, int overlap, char * src_seq, char * bal_seq, Kmer * kmerBuffer, char * prevcBuffer, char * nextcBuffer, int * kmer_c, Kmer WORDF ) { int j, bal_j; Kmer word, bal_word; int index; char InvalidCh = 4; if ( len_seq < overlap + 1 ) { *kmer_c = 0; return; } word.high = word.low = 0; for ( index = 0; index < overlap; index++ ) { word = KmerLeftBitMoveBy2 ( word ); word.low |= src_seq[index]; } reverseComplementSeq ( src_seq, len_seq, bal_seq ); // complementary node bal_word = reverseComplement ( word, overlap ); bal_j = len_seq - 0 - overlap; // 0; index = 0; if ( KmerSmaller ( word, bal_word ) ) { kmerBuffer[index] = word; prevcBuffer[index] = InvalidCh; nextcBuffer[index++] = src_seq[0 + overlap]; } else { kmerBuffer[index] = bal_word; prevcBuffer[index] = bal_seq[bal_j - 1]; nextcBuffer[index++] = InvalidCh; } for ( j = 1; j <= len_seq - overlap; j ++ ) { word = nextKmerLocal ( word, src_seq[j - 1 + overlap], WORDF ); bal_j = len_seq - j - overlap; // j; bal_word = prevKmerLocal ( bal_word, bal_seq[bal_j], overlap ); if ( KmerSmaller ( word, bal_word ) ) { kmerBuffer[index] = word; prevcBuffer[index] = src_seq[j - 1]; if ( j < len_seq - overlap ) { nextcBuffer[index++] = src_seq[j + overlap]; } else { nextcBuffer[index++] = InvalidCh; } //printf("%dth: %p with %p\n",kmer_c-1,word,hashBanBuffer[kmer_c-1]); } else { // complementary node kmerBuffer[index] = bal_word; if ( bal_j > 0 ) { prevcBuffer[index] = bal_seq[bal_j - 1]; } else { prevcBuffer[index] = InvalidCh; } nextcBuffer[index++] = bal_seq[bal_j + overlap]; //printf("%dth: %p with %p\n",kmer_c-1,bal_word,hashBanBuffer[kmer_c-1]); } } *kmer_c = index; } static void headTightStr ( char * tightStr, int length, int start, int headLen, int revS, char * src_seq ) { int i, index = 0; if ( !revS ) { for ( i = start; i < start + headLen; i++ ) { src_seq[index++] = getCharInTightString ( tightStr, i ); } } else { for ( i = length - 1 - start; i >= length - headLen - start; i-- ) { src_seq[index++] = int_comp ( getCharInTightString ( tightStr, i ) ); } } } static int getSeqFromCtg ( CTGinSCAF * ctg, boolean fromHead, unsigned int len, int originOverlap, char * src_seq ) { unsigned int ctgId = ctg->ctgID; unsigned int bal_ctg = getTwinCtg ( ctgId ); if ( contig_array[ctgId].length < 1 ) { return 0; } unsigned int length = contig_array[ctgId].length + originOverlap; len = len < length ? len : length; if ( fromHead ) { if ( contig_array[ctgId].seq ) { headTightStr ( contig_array[ctgId].seq, length, 0, len, 0, src_seq ); } else { headTightStr ( contig_array[bal_ctg].seq, length, 0, len, 1, src_seq ); } } else { if ( contig_array[ctgId].seq ) { headTightStr ( contig_array[ctgId].seq, length, length - len, len, 0, src_seq ); } else { headTightStr ( contig_array[bal_ctg].seq, length, length - len, len, 1, src_seq ); } } return len; } static KmerSet * readsInGap2DBgraph ( READNEARBY * rdArray, int num, CTGinSCAF * ctg1, CTGinSCAF * ctg2, int originOverlap, Kmer * kmerCtg1, Kmer * kmerCtg2, int overlap, Kmer WordFilter ) { int kmer_c; Kmer * kmerBuffer; char * nextcBuffer, *prevcBuffer; int i; int buffer_size = maxReadLen > CTGendLen ? maxReadLen : CTGendLen; KmerSet * kmerS = NULL; int lenCtg1; int lenCtg2; char * bal_seq; char * src_seq; src_seq = ( char * ) ckalloc ( buffer_size * sizeof ( char ) ); bal_seq = ( char * ) ckalloc ( buffer_size * sizeof ( char ) ); lenCtg1 = getSeqFromCtg ( ctg1, 0, CTGendLen, originOverlap, src_seq ); lenCtg2 = getSeqFromCtg ( ctg2, 1, CTGendLen, originOverlap, src_seq ); if ( lenCtg1 <= overlap || lenCtg2 <= overlap ) { free ( ( void * ) src_seq ); free ( ( void * ) bal_seq ); return kmerS; } kmerBuffer = ( Kmer * ) ckalloc ( buffer_size * sizeof ( Kmer ) ); prevcBuffer = ( char * ) ckalloc ( buffer_size * sizeof ( char ) ); nextcBuffer = ( char * ) ckalloc ( buffer_size * sizeof ( char ) ); kmerS = init_kmerset ( 1024, 0.77f ); for ( i = 0; i < num; i++ ) { getSeqFromRead ( rdArray[i], src_seq ); chopKmer4read ( rdArray[i].len, overlap, src_seq, bal_seq, kmerBuffer, prevcBuffer, nextcBuffer, &kmer_c, WordFilter ); putKmer2DBgraph ( kmerS, 0, kmer_c, kmerBuffer, prevcBuffer, nextcBuffer ); } lenCtg1 = getSeqFromCtg ( ctg1, 0, CTGendLen, originOverlap, src_seq ); chopKmer4Ctg ( kmerCtg1, lenCtg1, overlap, src_seq, WordFilter ); chopKmer4read ( lenCtg1, overlap, src_seq, bal_seq, kmerBuffer, prevcBuffer, nextcBuffer, &kmer_c, WordFilter ); putKmer2DBgraph ( kmerS, 1, kmer_c, kmerBuffer, prevcBuffer, nextcBuffer ); lenCtg2 = getSeqFromCtg ( ctg2, 1, CTGendLen, originOverlap, src_seq ); chopKmer4Ctg ( kmerCtg2, lenCtg2, overlap, src_seq, WordFilter ); chopKmer4read ( lenCtg2, overlap, src_seq, bal_seq, kmerBuffer, prevcBuffer, nextcBuffer, &kmer_c, WordFilter ); putKmer2DBgraph ( kmerS, 2, kmer_c, kmerBuffer, prevcBuffer, nextcBuffer ); /* if(ctg1->ctgID==3733&&ctg2->ctgID==3067){ for(i=0;i 32 ? 32 : overlap; bit1 = overlap > 32 ? overlap - 32 : 0; for ( i = bit1 - 1; i >= 0; i-- ) { ch = kmer.high & 0x3; kmer.high >>= 2; kmerSeq[i] = ch; } for ( i = bit2 - 1; i >= 0; i-- ) { ch = kmer.low & 0x3; kmer.low >>= 2; kmerSeq[i + bit1] = ch; } for ( i = 0; i < overlap; i++ ) { fprintf ( fp, "%c", int2base ( ( int ) kmerSeq[i] ) ); } } static void kmerSet_mark ( KmerSet * set ) { int i, in_num, out_num, cvgSingle; kmer_t * rs; long long counter = 0, linear = 0; Kmer word; set->iter_ptr = 0; while ( set->iter_ptr < set->size ) { if ( !is_kmer_entity_null ( set->flags, set->iter_ptr ) ) { in_num = out_num = 0; rs = set->array + set->iter_ptr; word = rs->seq; for ( i = 0; i < 4; i++ ) { cvgSingle = get_kmer_left_cov ( *rs, i ); if ( cvgSingle > 0 ) { in_num++; } cvgSingle = get_kmer_right_cov ( *rs, i ); if ( cvgSingle > 0 ) { out_num++; } } if ( rs->single ) { counter++; } if ( in_num == 1 && out_num == 1 ) { rs->linear = 1; linear++; } } set->iter_ptr ++; } //printf("Allocated %ld node, %ld single nodes, %ld linear\n",(long)count_kmerset(set),counter,linear); } static kmer_t * searchNode ( Kmer word, KmerSet * kset, int overlap ) { Kmer bal_word = reverseComplement ( word, overlap ); kmer_t * node; boolean found; if ( KmerSmaller ( word, bal_word ) ) { found = search_kmerset ( kset, word, &node ); } else { found = search_kmerset ( kset, bal_word, &node ); } if ( found ) { return node; } else { return NULL; } } static int searchKmerOnCtg ( Kmer currW, Kmer * kmerDest, int num ) { int i; for ( i = 0; i < num; i++ ) { if ( KmerEqual ( currW, kmerDest[i] ) ) { return i; } } return -1; } // pick on from n items randomly static int nPick1 ( int * array, int n ) { int m, i; m = n - 1; //(int)(drand48()*n); int value = array[m]; for ( i = m; i < n - 1; i++ ) { array[i] = array[i + 1]; } return value; } static void traceAlongDBgraph ( Kmer currW, int steps, int min, int max, int * num_route, KmerSet * kset, Kmer * kmerDest, int num, int overlap, Kmer WORDF, char ** foundRoutes, int * routeEndOnCtg2, int * routeLens, char * soFarSeq, int * traceCounter, int maxRoute, kmer_t ** soFarNode, boolean * multiOccu, long long * soFarLinks, double * avgLinks ) { ( *traceCounter ) ++; if ( *traceCounter > UPlimit ) { /* if(overlap==19&&kmerDest[0]==pubKmer) printf("UPlimit\n"); */ return; } if ( steps > max || *num_route >= maxRoute ) { /* if(overlap==19&&kmerDest[0]==pubKmer) printf("max steps/maxRoute\n"); */ return; } Kmer word = reverseComplement ( currW, overlap ); boolean isSmaller = KmerSmaller ( currW, word ); int i; char ch; unsigned char links; if ( isSmaller ) { word = currW; } kmer_t * node; boolean found = search_kmerset ( kset, word, &node ); if ( !found ) { printf ( "Trace: can't find kmer %llx %llx (input %llx %llx) at step %d\n", word.high, word.low, currW.high, currW.low, steps ); return; } if ( node->twin > 1 ) { return; } if ( soFarNode ) { soFarNode[steps] = node; } if ( steps > 0 ) { soFarSeq[steps - 1] = lastCharInKmer ( currW ); } int index, end; int linkCounter = *soFarLinks; if ( steps >= min && node->inEdge > 1 && ( end = searchKmerOnCtg ( currW, kmerDest, num ) ) >= 0 ) { index = *num_route; if ( steps > 0 ) { avgLinks[index] = ( double ) linkCounter / steps; } else { avgLinks[index] = 0; } //find node that appears more than once in the path multiOccu[index] = 0; for ( i = 0; i < steps + 1; i++ ) { soFarNode[i]->deleted = 0; } for ( i = 0; i < steps + 1; i++ ) { if ( soFarNode[i]->deleted ) { multiOccu[index] = 1; break; } soFarNode[i]->deleted = 1; } routeEndOnCtg2[index] = end; routeLens[index] = steps; char * array = foundRoutes[index]; for ( i = 0; i < steps; i++ ) { array[i] = soFarSeq[i]; } if ( i < max ) { array[i] = 4; } //indicate the end of the sequence *num_route = ++index; return; } steps++; if ( isSmaller ) { int array[] = {0, 1, 2, 3}; for ( i = 4; i > 0; i-- ) { ch = nPick1 ( array, i ); links = get_kmer_right_cov ( *node, ch ); if ( !links ) { continue; } *soFarLinks = linkCounter + links; word = nextKmerLocal ( currW, ch, WORDF ); traceAlongDBgraph ( word, steps, min, max, num_route, kset, kmerDest, num, overlap, WORDF, foundRoutes, routeEndOnCtg2, routeLens, soFarSeq, traceCounter, maxRoute, soFarNode, multiOccu, soFarLinks, avgLinks ); } } else { int array[] = {0, 1, 2, 3}; for ( i = 4; i > 0; i-- ) { ch = nPick1 ( array, i ); links = get_kmer_left_cov ( *node, ch ); if ( !links ) { continue; } *soFarLinks = linkCounter + links; word = nextKmerLocal ( currW, int_comp ( ch ), WORDF ); traceAlongDBgraph ( word, steps, min, max, num_route, kset, kmerDest, num, overlap, WORDF, foundRoutes, routeEndOnCtg2, routeLens, soFarSeq, traceCounter, maxRoute, soFarNode, multiOccu, soFarLinks, avgLinks ); } } } static int searchFgap ( KmerSet * kset, CTGinSCAF * ctg1, CTGinSCAF * ctg2, Kmer * kmerCtg1, Kmer * kmerCtg2, unsigned int origOverlap, int overlap, DARRAY * gapSeqArray, int len1, int len2, Kmer WordFilter, int * offset1, int * offset2, char * seqGap, int * cut1, int * cut2 ) { int i; int ret = 0; kmer_t * node, **soFarNode; int num_route; int gapLen = ctg2->start - ctg1->end - origOverlap + overlap; int min = gapLen - GLDiff > 0 ? gapLen - GLDiff : 0; //0531 int max = gapLen + GLDiff < 10 ? 10 : gapLen + GLDiff; char ** foundRoutes; char * soFarSeq; int traceCounter; int * routeEndOnCtg2; int * routeLens; boolean * multiOccu; long long soFarLinks; double * avgLinks; //mask linear internal linear kmer on contig1 end routeEndOnCtg2 = ( int * ) ckalloc ( MaxRouteNum * sizeof ( int ) ); routeLens = ( int * ) ckalloc ( MaxRouteNum * sizeof ( int ) ); multiOccu = ( boolean * ) ckalloc ( MaxRouteNum * sizeof ( boolean ) ); short * MULTI1 = ( short * ) ckalloc ( MaxRouteNum * sizeof ( short ) ); short * MULTI2 = ( short * ) ckalloc ( MaxRouteNum * sizeof ( short ) ); soFarSeq = ( char * ) ckalloc ( max * sizeof ( char ) ); soFarNode = ( kmer_t ** ) ckalloc ( ( max + 1 ) * sizeof ( kmer_t * ) ); foundRoutes = ( char ** ) ckalloc ( MaxRouteNum * sizeof ( char * ) );; avgLinks = ( double * ) ckalloc ( MaxRouteNum * sizeof ( double ) );; for ( i = 0; i < MaxRouteNum; i++ ) { foundRoutes[i] = ( char * ) ckalloc ( max * sizeof ( char ) ); } for ( i = len1 - 1; i >= 0; i-- ) { num_route = traceCounter = soFarLinks = 0; int steps = 0; traceAlongDBgraph ( kmerCtg1[i], steps, min, max, &num_route, kset, kmerCtg2, len2, overlap, WordFilter, foundRoutes, routeEndOnCtg2, routeLens, soFarSeq, &traceCounter, MaxRouteNum, soFarNode, multiOccu, &soFarLinks, avgLinks ); if ( num_route > 0 ) { int m, minEnd = routeEndOnCtg2[0]; for ( m = 0; m < num_route; m++ ) { if ( routeLens[m] < 0 ) { continue; } if ( routeEndOnCtg2[m] < minEnd ) { minEnd = routeEndOnCtg2[m]; } } /* else if(minFreq>1){ for(m=0;m3) break; printf("%c",int2base((int)foundRoutes[m][j])); } printf(": %4.2f\n",avgLinks[m]); } } */ num_route = traceCounter = soFarLinks = 0; steps = 0; trace4Repeat ( kmerCtg1[i], steps, min, max, &num_route, kset, kmerCtg2[minEnd], overlap, WordFilter, &traceCounter, MaxRouteNum, soFarNode, MULTI1, MULTI2, routeLens, foundRoutes, soFarSeq, &soFarLinks, avgLinks ); int j, best = 0; int maxLen = routeLens[0]; double maxLink = avgLinks[0]; char * pt; boolean repeat = 0, sameLen = 1; int leftMost = max, rightMost = max; if ( num_route < 1 ) { fprintf ( stderr, "After trace4Repeat: non route was found\n" ); continue; } if ( num_route > 1 ) { // if multi paths are found, we check on the repeatative occurrences and links/length for ( m = 0; m < num_route; m++ ) { if ( routeLens[m] < 0 ) { continue; } if ( MULTI1[m] >= 0 && MULTI2[m] >= 0 ) { repeat = 1; leftMost = leftMost > MULTI1[m] ? MULTI1[m] : leftMost; rightMost = rightMost > MULTI2[m] ? MULTI2[m] : rightMost; } if ( routeLens[m] != maxLen ) { sameLen = 0; } if ( routeLens[m] < maxLen ) { maxLen = routeLens[m]; } if ( avgLinks[m] > maxLink ) { maxLink = avgLinks[m]; best = m; } } } if ( repeat ) { *offset1 = *offset2 = *cut1 = *cut2 = 0; int index = 0; char ch; for ( j = 0; j < leftMost; j++ ) { if ( routeLens[0] < j + overlap + 1 ) { break; } else { ch = foundRoutes[0][j]; } for ( m = 1; m < num_route; m++ ) { if ( routeLens[m] < 0 ) { continue; } if ( ch != foundRoutes[m][j] ) { break; } } if ( m == num_route ) { seqGap[index++] = ch; } else { break; } } *offset1 = index; index = 0; for ( j = 0; j < rightMost; j++ ) { if ( routeLens[0] - overlap - 1 < j ) { break; } else { ch = foundRoutes[0][routeLens[0] - overlap - 1 - j]; } for ( m = 1; m < num_route; m++ ) { if ( routeLens[m] < 0 ) { continue; } if ( ch != foundRoutes[m][routeLens[m] - overlap - 1 - j] ) { break; } } if ( m == num_route ) { index++; } else { break; } } *offset2 = index; for ( j = 0; j < *offset2; j++ ) { seqGap[*offset1 + *offset2 - 1 - j] = foundRoutes[0][routeLens[0] - overlap - 1 - j]; } if ( *offset1 > 0 || *offset2 > 0 ) { *cut1 = len1 - i - 1; *cut2 = minEnd; //fprintf(stderr,"\n"); for ( m = 0; m < num_route; m++ ) { for ( j = 0; j < max; j++ ) { if ( foundRoutes[m][j] > 3 ) { break; } //fprintf(stderr,"%c",int2base((int)foundRoutes[m][j])); } //fprintf(stderr,": %4.2f\n",avgLinks[m]); } /* fprintf(stderr,">Gap (%d + %d) (%d + %d)\n",*offset1,*offset2,*cut1,*cut2); for(index=0;index<*offset1+*offset2;index++) fprintf(stderr,"%c",int2base(seqGap[index])); fprintf(stderr,"\n"); */ } ret = 3; break; } if ( overlap + ( len1 - i - 1 ) + minEnd - routeLens[best] > ( int ) origOverlap ) { continue; } ctg1->gapSeqOffset = gapSeqArray->item_c; ctg1->gapSeqLen = routeLens[best]; if ( !darrayPut ( gapSeqArray, ctg1->gapSeqOffset + maxLen / 4 ) ) { continue; } pt = ( char * ) darrayPut ( gapSeqArray, ctg1->gapSeqOffset ); /* printKmerSeqLocal(stderr,kmerCtg1[i],overlap); fprintf(stderr,"-"); */ for ( j = 0; j < max; j++ ) { if ( foundRoutes[best][j] > 3 ) { break; } writeChar2tightString ( foundRoutes[best][j], pt, j ); //fprintf(stderr,"%c",int2base((int)foundRoutes[best][j])); } //fprintf(stderr,": GAPSEQ %d + %d, avglink %4.2f\n",len1-i-1,minEnd,avgLinks[best]); ctg1->cutTail = len1 - i - 1; ctg2->cutHead = overlap + minEnd; ctg2->scaftig_start = 0; ret = 1; break; /* }if(num_route>1){ ret = 2; break; */ } else //mark node which leads to dead end { node = searchNode ( kmerCtg1[i], kset, overlap ); if ( node ) { node->twin = 2; } } } for ( i = 0; i < MaxRouteNum; i++ ) { free ( ( void * ) foundRoutes[i] ); } free ( ( void * ) soFarSeq ); free ( ( void * ) soFarNode ); free ( ( void * ) multiOccu ); free ( ( void * ) MULTI1 ); free ( ( void * ) MULTI2 ); free ( ( void * ) foundRoutes ); free ( ( void * ) routeEndOnCtg2 ); free ( ( void * ) routeLens ); return ret; } static void trace4Repeat ( Kmer currW, int steps, int min, int max, int * num_route, KmerSet * kset, Kmer kmerDest, int overlap, Kmer WORDF, int * traceCounter, int maxRoute, kmer_t ** soFarNode, short * multiOccu1, short * multiOccu2, int * routeLens, char ** foundRoutes, char * soFarSeq, long long * soFarLinks, double * avgLinks ) { ( *traceCounter ) ++; if ( *traceCounter > UPlimit ) { return; } if ( steps > max || *num_route >= maxRoute ) { return; } Kmer word = reverseComplement ( currW, overlap ); boolean isSmaller = KmerSmaller ( currW, word ); char ch; unsigned char links; int index, i; if ( isSmaller ) { word = currW; } kmer_t * node; boolean found = search_kmerset ( kset, word, &node ); if ( !found ) { printf ( "Trace: can't find kmer %llx %llx (input %llx %llx) at step %d\n", word.high, word.low, currW.high, currW.low, steps ); return; } if ( soFarNode ) { soFarNode[steps] = node; } if ( soFarSeq && steps > 0 ) { soFarSeq[steps - 1] = lastCharInKmer ( currW ); } int linkCounter; if ( soFarLinks ) { linkCounter = *soFarLinks; } if ( steps >= min && KmerEqual ( currW, kmerDest ) ) { index = *num_route; if ( avgLinks && steps > 0 ) { avgLinks[index] = ( double ) linkCounter / steps; } else if ( avgLinks ) { avgLinks[index] = 0; } //find node that appears more than once in the path if ( multiOccu1 && multiOccu2 ) { for ( i = 0; i < steps + 1; i++ ) { soFarNode[i]->deleted = 0; } int rightMost = 0; boolean MULTI = 0; for ( i = 0; i < steps + 1; i++ ) { if ( soFarNode[i]->deleted ) { rightMost = rightMost < i - 1 ? i - 1 : rightMost; MULTI = 1; } soFarNode[i]->deleted = 1; } if ( !MULTI ) { multiOccu1[index] = multiOccu2[index] = -1; } else { multiOccu2[index] = steps - 2 - rightMost < 0 ? 0 : steps - 2 - rightMost; //[0 steps-2] for ( i = 0; i < steps + 1; i++ ) { soFarNode[i]->deleted = 0; } int leftMost = steps - 2; for ( i = steps; i >= 0; i-- ) { if ( soFarNode[i]->deleted ) { leftMost = leftMost > i - 1 ? i - 1 : leftMost; } soFarNode[i]->deleted = 1; } multiOccu1[index] = leftMost < 0 ? 0 : leftMost; //[0 steps-2] } } if ( routeLens ) { routeLens[index] = steps; } if ( soFarSeq ) { char * array = foundRoutes[index]; for ( i = 0; i < steps; i++ ) { array[i] = soFarSeq[i]; } if ( i < max ) { array[i] = 4; } //indicate the end of the sequence } *num_route = ++index; } steps++; if ( isSmaller ) { int array[] = {0, 1, 2, 3}; for ( i = 4; i > 0; i-- ) { ch = nPick1 ( array, i ); links = get_kmer_right_cov ( *node, ch ); if ( !links ) { continue; } if ( soFarLinks ) { *soFarLinks = linkCounter + links; } word = nextKmerLocal ( currW, ch, WORDF ); trace4Repeat ( word, steps, min, max, num_route, kset, kmerDest, overlap, WORDF, traceCounter, maxRoute, soFarNode, multiOccu1, multiOccu2, routeLens, foundRoutes, soFarSeq, soFarLinks, avgLinks ); } } else { int array[] = {0, 1, 2, 3}; for ( i = 4; i > 0; i-- ) { ch = nPick1 ( array, i ); links = get_kmer_left_cov ( *node, ch ); if ( !links ) { continue; } if ( soFarLinks ) { *soFarLinks = linkCounter + links; } word = nextKmerLocal ( currW, int_comp ( ch ), WORDF ); trace4Repeat ( word, steps, min, max, num_route, kset, kmerDest, overlap, WORDF, traceCounter, maxRoute, soFarNode, multiOccu1, multiOccu2, routeLens, foundRoutes, soFarSeq, soFarLinks, avgLinks ); } } } //found repeat node on contig ends static void maskRepeatNode ( KmerSet * kset, Kmer * kmerCtg1, Kmer * kmerCtg2, int overlap, int len1, int len2, int max, Kmer WordFilter ) { int i; int num_route, steps; int min = 1, maxRoute = 1; int traceCounter; Kmer word, bal_word; kmer_t * node; boolean found; int counter = 0; for ( i = 0; i < len1; i++ ) { word = kmerCtg1[i]; bal_word = reverseComplement ( word, overlap ); if ( KmerLarger ( word, bal_word ) ) { word = bal_word; } found = search_kmerset ( kset, word, &node ); if ( !found || node->linear ) { //printf("Found no node for kmer %llx\n",word); continue; } num_route = traceCounter = 0; steps = 0; trace4Repeat ( word, steps, min, max, &num_route, kset, word, overlap, WordFilter, &traceCounter, maxRoute, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL ); if ( num_route < 1 ) { continue; } counter++; node->checked = 1; } for ( i = 0; i < len2; i++ ) { word = kmerCtg2[i]; bal_word = reverseComplement ( word, overlap ); if ( KmerLarger ( word, bal_word ) ) { word = bal_word; } found = search_kmerset ( kset, word, &node ); if ( !found || node->linear ) { //printf("Found no node for kmer %llx\n",word); continue; } num_route = traceCounter = 0; steps = 0; trace4Repeat ( word, steps, min, max, &num_route, kset, word, overlap, WordFilter, &traceCounter, maxRoute, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL ); if ( num_route < 1 ) { continue; } counter++; node->checked = 1; } //printf("MR: %d(%d)\n",counter,len1+len2); } /* static boolean chopReadFillGap(int len_seq,int overlap,char *src_seq, char *bal_seq, KmerSet *kset,Kmer WORDF,int *start,int *end,boolean *bal, Kmer *KmerCtg1,int len1,Kmer *KmerCtg2,int len2,int *index1,int *index2) { int index,j=0,bal_j; Kmer word,bal_word; int flag=0,bal_flag=0; int ctg1start,bal_ctg1start,ctg2end,bal_ctg2end; int seqStart,bal_start,seqEnd,bal_end; kmer_t *node; boolean found; if(len_seqlinear&&!node->checked){ if(!flag&&node->inEdge==1){ ctg1start = searchKmerOnCtg(word,KmerCtg1,len1); if(ctg1start>0){ flag = 1; seqStart = j + overlap-1; } } if(!bal_flag&&node->inEdge==2){ bal_ctg2end = searchKmerOnCtg(bal_word,KmerCtg2,len2); if(bal_ctg2end>0){ bal_flag = 2; bal_end = bal_j+overlap-1; } } } for(j = 1; j <= len_seq - overlap; j ++) { word = nextKmerLocal(word,src_seq[j-1+overlap],WORDF); bal_j = len_seq-j-overlap; // j; bal_word = prevKmerLocal(bal_word,bal_seq[bal_j],overlap); if(wordlinear&&!node->checked){ if(!flag&&node->inEdge==1){ ctg1start = searchKmerOnCtg(word,KmerCtg1,len1); if(ctg1start>0){ flag = 1; seqStart = j + overlap-1; } }else if(flag==1&&node->inEdge==1){ index = searchKmerOnCtg(word,KmerCtg1,len1); if(index>ctg1start){ // choose hit closer to gap ctg1start = index; seqStart = j + overlap-1; } }else if(flag==1&&node->inEdge==2){ ctg2end = searchKmerOnCtg(word,KmerCtg2,len2); if(ctg2end>0){ flag = 3; seqEnd = j+overlap-1; break; } } if(!bal_flag&&node->inEdge==2){ bal_ctg2end = searchKmerOnCtg(bal_word,KmerCtg2,len2); if(bal_ctg2end>0){ bal_flag = 2; bal_end = bal_j+overlap-1; } }else if(bal_flag==2&&node->inEdge==2){ index = searchKmerOnCtg(bal_word,KmerCtg2,len2); if(indexinEdge==1){ bal_ctg1start = searchKmerOnCtg(bal_word,KmerCtg1,len1); if(bal_ctg1start>0){ bal_flag = 3; bal_start = bal_j+overlap-1; break; } } } } if(flag==3){ *start = seqStart; *end = seqEnd; *bal = 0; *index1 = ctg1start; *index2 = ctg2end; return 1; }else if(bal_flag==3){ *start = bal_start; *end = bal_end; *bal = 1; *index1 = bal_ctg1start; *index2 = bal_ctg2end; return 1; } return 0; } static boolean readsCrossGap(READNEARBY *rdArray, int num, int originOverlap,DARRAY *gapSeqArray, Kmer *kmerCtg1,Kmer *kmerCtg2,int overlap,int len1,int len2, CTGinSCAF *ctg1,CTGinSCAF *ctg2,KmerSet *kmerS,Kmer WordFilter,int min,int max) { int i,j,start,end,startOnCtg1,endOnCtg2; char *bal_seq; char *src_seq; char *pt; boolean bal,ret=0,FILL; src_seq = (char *)ckalloc(maxReadLen*sizeof(char)); bal_seq = (char *)ckalloc(maxReadLen*sizeof(char)); for(i=0;imax) continue; fprintf(stderr,"Read across\n"); //printf("Filled: K %d, ctg1 %d ctg2 %d,start %d end %d\n",overlap,startOnCtg1,endOnCtg2,start,end); if(overlap+(len1-startOnCtg1-1)+endOnCtg2-(end-start)>(int)originOverlap) continue; // contig1 and contig2 could not overlap more than origOverlap bases ctg1->gapSeqOffset = gapSeqArray->item_c; ctg1->gapSeqLen = end-start; if(!darrayPut(gapSeqArray,ctg1->gapSeqOffset+(end-start)/4)) continue; pt = (char *)darrayPut(gapSeqArray,ctg1->gapSeqOffset); for(j=start+1;j<=end;j++){ if(bal) writeChar2tightString(bal_seq[j],pt,j-start-1); else writeChar2tightString(src_seq[j],pt,j-start-1); } ctg1->cutTail = len1-startOnCtg1-1; ctg2->cutHead = overlap + endOnCtg2; ctg2->scaftig_start = 0; ret = 1; break; } free((void*)src_seq); free((void*)bal_seq); return ret; } */ static void kmerSet_markTandem ( KmerSet * set, Kmer WordFilter, int overlap ); static boolean readsCrossGap ( READNEARBY * rdArray, int num, int originOverlap, DARRAY * gapSeqArray, Kmer * kmerCtg1, Kmer * kmerCtg2, int overlap, CTGinSCAF * ctg1, CTGinSCAF * ctg2, KmerSet * kmerS, Kmer WordFilter, int min, int max, int offset1, int offset2, char * seqGap, char * seqCtg1, char * seqCtg2, int cut1, int cut2 ); int localGraph ( READNEARBY * rdArray, int num, CTGinSCAF * ctg1, CTGinSCAF * ctg2, int origOverlap, Kmer * kmerCtg1, Kmer * kmerCtg2, int overlap, DARRAY * gapSeqArray, char * seqCtg1, char * seqCtg2, char * seqGap ) { /**************** put kmer in DBgraph ****************/ KmerSet * kmerSet; Kmer WordFilter = createFilter ( overlap ); /* if(ctg1->ctgID==56410&&ctg2->ctgID==61741) printf("Extract %d reads for gap [%d %d]\n",num,ctg1->ctgID,ctg2->ctgID); */ kmerSet = readsInGap2DBgraph ( rdArray, num, ctg1, ctg2, origOverlap, kmerCtg1, kmerCtg2, overlap, WordFilter ); if ( !kmerSet ) { //printf("no kmer found\n"); return 0; } time_t tt; time ( &tt ); srand48 ( ( int ) tt ); /* int i,j; for(i=0;i<2;i++){ int array[] = {0,1,2,3}; for(j=4;j>0;j--) fprintf(stderr,"%d ", nPick1(array,j)); } fprintf(stderr,"\n"); */ /***************** search path to connect contig ends ********/ int gapLen = ctg2->start - ctg1->end - origOverlap + overlap; int min = gapLen - GLDiff > 0 ? gapLen - GLDiff : 0; int max = gapLen + GLDiff < 10 ? 10 : gapLen + GLDiff; //count kmer number for contig1 and contig2 ends int len1, len2; len1 = CTGendLen < contig_array[ctg1->ctgID].length + origOverlap ? CTGendLen : contig_array[ctg1->ctgID].length + origOverlap; len2 = CTGendLen < contig_array[ctg2->ctgID].length + origOverlap ? CTGendLen : contig_array[ctg2->ctgID].length + origOverlap; len1 -= overlap - 1; len2 -= overlap - 1; //int pathNum = 2; int offset1 = 0, offset2 = 0, cut1 = 0, cut2 = 0; int pathNum = searchFgap ( kmerSet, ctg1, ctg2, kmerCtg1, kmerCtg2, origOverlap, overlap, gapSeqArray, len1, len2, WordFilter, &offset1, &offset2, seqGap, &cut1, &cut2 ); //printf("SF: %d K %d\n",pathNum,overlap); if ( pathNum == 0 ) { free_kmerset ( kmerSet ); return 0; } else if ( pathNum == 1 ) { free_kmerset ( kmerSet ); return 1; }/* else{ printf("ret %d\n",pathNum); free_kmerset(kmerSet); return 0; } */ /******************* cross the gap by single reads *********/ //kmerSet_markTandem(kmerSet,WordFilter,overlap); maskRepeatNode ( kmerSet, kmerCtg1, kmerCtg2, overlap, len1, len2, max, WordFilter ); boolean found = readsCrossGap ( rdArray, num, origOverlap, gapSeqArray, kmerCtg1, kmerCtg2, overlap, ctg1, ctg2, kmerSet, WordFilter, min, max, offset1, offset2, seqGap, seqCtg1, seqCtg2, cut1, cut2 ); if ( found ) { //fprintf(stderr,"read across\n"); free_kmerset ( kmerSet ); return found; } else { free_kmerset ( kmerSet ); return 0; } } static void kmerSet_markTandem ( KmerSet * set, Kmer WordFilter, int overlap ) { kmer_t * rs; long long counter = 0; int num_route, steps; int min = 1, max = overlap, maxRoute = 1; int traceCounter; set->iter_ptr = 0; while ( set->iter_ptr < set->size ) { if ( !is_kmer_entity_null ( set->flags, set->iter_ptr ) ) { rs = set->array + set->iter_ptr; if ( rs->inEdge > 0 ) { set->iter_ptr ++; continue; } num_route = traceCounter = 0; steps = 0; trace4Repeat ( rs->seq, steps, min, max, &num_route, set, rs->seq, overlap, WordFilter, &traceCounter, maxRoute, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL ); if ( num_route < 1 ) { set->iter_ptr ++; continue; } /* printKmerSeqLocal(stderr,rs->seq,overlap); fprintf(stderr, "\n"); */ rs->checked = 1; counter++; } set->iter_ptr ++; } } /******************* the following is for read-crossing gaps *************************/ #define MAXREADLENGTH 100 static const int INDEL = 0; static const int SIM[4][4] = { {1, 0, 0, 0}, {0, 1, 0, 0}, {0, 0, 1, 0}, {0, 0, 0, 1} }; static char fastSequence[MAXREADLENGTH]; static char slowSequence[MAXREADLENGTH]; static int Fmatrix[MAXREADLENGTH + 1][MAXREADLENGTH + 1]; static int slowToFastMapping[MAXREADLENGTH + 1]; static int fastToSlowMapping[MAXREADLENGTH + 1]; static int max ( int A, int B, int C ) { A = A >= B ? A : B; return ( A >= C ? A : C ); } static int compareSequences ( char * sequence1, char * sequence2, int length1, int length2 ) { if ( length1 < 1 || length2 < 1 || length1 > MAXREADLENGTH || length2 > MAXREADLENGTH ) { return 0; } int i, j; int Choice1, Choice2, Choice3; int maxScore; for ( i = 0; i <= length1; i++ ) { Fmatrix[i][0] = 0; } for ( j = 0; j <= length2; j++ ) { Fmatrix[0][j] = 0; } for ( i = 1; i <= length1; i++ ) { for ( j = 1; j <= length2; j++ ) { Choice1 = Fmatrix[i - 1][j - 1] + SIM[ ( int ) sequence1[i - 1]] [ ( int ) sequence2[j - 1]]; Choice2 = Fmatrix[i - 1][j] + INDEL; Choice3 = Fmatrix[i][j - 1] + INDEL; Fmatrix[i][j] = max ( Choice1, Choice2, Choice3 ); } } maxScore = Fmatrix[length1][length2]; return maxScore; } static void mapSlowOntoFast ( int slowSeqLength, int fastSeqLength ) { int slowIndex = slowSeqLength; int fastIndex = fastSeqLength; int fastn, slown; if ( slowIndex == 0 ) { slowToFastMapping[0] = fastIndex; while ( fastIndex >= 0 ) { fastToSlowMapping[fastIndex--] = 0; } return; } if ( fastIndex == 0 ) { while ( slowIndex >= 0 ) { slowToFastMapping[slowIndex--] = 0; } fastToSlowMapping[0] = slowIndex; return; } while ( slowIndex > 0 && fastIndex > 0 ) { fastn = ( int ) fastSequence[fastIndex - 1]; //getCharInTightString(fastSequence,fastIndex-1); slown = ( int ) slowSequence[slowIndex - 1]; //getCharInTightString(slowSequence,slowIndex-1); if ( Fmatrix[fastIndex][slowIndex] == Fmatrix[fastIndex - 1][slowIndex - 1] + SIM[fastn][slown] ) { fastToSlowMapping[--fastIndex] = --slowIndex; slowToFastMapping[slowIndex] = fastIndex; } else if ( Fmatrix[fastIndex][slowIndex] == Fmatrix[fastIndex - 1][slowIndex] + INDEL ) { fastToSlowMapping[--fastIndex] = slowIndex - 1; } else if ( Fmatrix[fastIndex][slowIndex] == Fmatrix[fastIndex][slowIndex - 1] + INDEL ) { slowToFastMapping[--slowIndex] = fastIndex - 1; } else { printf ( "compareSequence: Error trace\n" ); fflush ( stdout ); abort(); } } while ( slowIndex > 0 ) { slowToFastMapping[--slowIndex] = -1; } while ( fastIndex > 0 ) { fastToSlowMapping[--fastIndex] = -1; } slowToFastMapping[slowSeqLength] = fastSeqLength; fastToSlowMapping[fastSeqLength] = slowSeqLength; } static boolean chopReadFillGap ( int len_seq, int overlap, char * src_seq, char * bal_seq, KmerSet * kset, Kmer WORDF, int * start, int * end, boolean * bal, Kmer * KmerCtg1, int len1, Kmer * KmerCtg2, int len2, int * index1, int * index2 ) { int index, j = 0, bal_j; Kmer word, bal_word; int flag = 0, bal_flag = 0; int ctg1start, bal_ctg1start, ctg2end, bal_ctg2end; int seqStart, bal_start, seqEnd, bal_end; kmer_t * node; boolean found; if ( len_seq < overlap + 1 ) { return 0; } word.high = word.low = 0; for ( index = 0; index < overlap; index++ ) { word = KmerLeftBitMoveBy2 ( word ); word.low |= src_seq[index]; } reverseComplementSeq ( src_seq, len_seq, bal_seq ); // complementary node bal_word = reverseComplement ( word, overlap ); bal_j = len_seq - 0 - overlap; // 0; flag = bal_flag = 0; if ( KmerSmaller ( word, bal_word ) ) { found = search_kmerset ( kset, word, &node ); } else { found = search_kmerset ( kset, bal_word, &node ); } if ( found && !node->linear && !node->checked ) { if ( !flag && node->inEdge == 1 ) { ctg1start = searchKmerOnCtg ( word, KmerCtg1, len1 ); if ( ctg1start >= 0 ) { flag = 1; seqStart = j + overlap - 1; } } if ( !bal_flag && node->inEdge == 2 ) { bal_ctg2end = searchKmerOnCtg ( bal_word, KmerCtg2, len2 ); if ( bal_ctg2end >= 0 ) { bal_flag = 2; bal_end = bal_j + overlap - 1; } } } for ( j = 1; j <= len_seq - overlap; j ++ ) { word = nextKmerLocal ( word, src_seq[j - 1 + overlap], WORDF ); bal_j = len_seq - j - overlap; // j; bal_word = prevKmerLocal ( bal_word, bal_seq[bal_j], overlap ); if ( KmerSmaller ( word, bal_word ) ) { found = search_kmerset ( kset, word, &node ); } else { found = search_kmerset ( kset, bal_word, &node ); } if ( found && !node->linear && !node->checked ) { if ( !flag && node->inEdge == 1 ) { ctg1start = searchKmerOnCtg ( word, KmerCtg1, len1 ); if ( ctg1start >= 0 ) { flag = 1; seqStart = j + overlap - 1; } } else if ( flag == 1 && node->inEdge == 1 ) { index = searchKmerOnCtg ( word, KmerCtg1, len1 ); if ( index >= 0 && index > ctg1start ) // choose hit closer to gap { ctg1start = index; seqStart = j + overlap - 1; } } else if ( flag == 1 && node->inEdge == 2 ) { ctg2end = searchKmerOnCtg ( word, KmerCtg2, len2 ); if ( ctg2end >= 0 ) { flag = 3; seqEnd = j + overlap - 1; break; } } if ( !bal_flag && node->inEdge == 2 ) { bal_ctg2end = searchKmerOnCtg ( bal_word, KmerCtg2, len2 ); if ( bal_ctg2end >= 0 ) { bal_flag = 2; bal_end = bal_j + overlap - 1; } } else if ( bal_flag == 2 && node->inEdge == 2 ) { index = searchKmerOnCtg ( bal_word, KmerCtg2, len2 ); if ( index >= 0 && index < bal_ctg2end ) // choose hit closer to gap { bal_ctg2end = index; bal_end = bal_j + overlap - 1; } } else if ( bal_flag == 2 && node->inEdge == 1 ) { bal_ctg1start = searchKmerOnCtg ( bal_word, KmerCtg1, len1 ); if ( bal_ctg1start >= 0 ) { bal_flag = 3; bal_start = bal_j + overlap - 1; break; } } } } if ( flag == 3 ) { *start = seqStart; *end = seqEnd; *bal = 0; *index1 = ctg1start; *index2 = ctg2end; return 1; } else if ( bal_flag == 3 ) { *start = bal_start; *end = bal_end; *bal = 1; *index1 = bal_ctg1start; *index2 = bal_ctg2end; return 1; } return 0; } static int cutSeqFromTightStr ( char * tightStr, int length, int start, int end, int revS, char * src_seq ) { int i, index = 0; end = end < length ? end : length - 1; start = start >= 0 ? start : 0; if ( !revS ) { for ( i = start; i <= end; i++ ) { src_seq[index++] = getCharInTightString ( tightStr, i ); } } else { for ( i = length - 1 - start; i >= length - end - 1; i-- ) { src_seq[index++] = int_comp ( getCharInTightString ( tightStr, i ) ); } } return end - start + 1; } static int cutSeqFromCtg ( unsigned int ctgID, int start, int end, char * sequence, int originOverlap ) { unsigned int bal_ctg = getTwinCtg ( ctgID ); if ( contig_array[ctgID].length < 1 ) { return 0; } int length = contig_array[ctgID].length + originOverlap; if ( contig_array[ctgID].seq ) { return cutSeqFromTightStr ( contig_array[ctgID].seq, length, start, end, 0, sequence ); } else { return cutSeqFromTightStr ( contig_array[bal_ctg].seq, length, start, end, 1, sequence ); } } static int cutSeqFromRead ( char * src_seq, int length, int start, int end, char * sequence ) { if ( end >= length ) { printf ( "******: end %d length %d\n", end, length ); } end = end < length ? end : length - 1; start = start >= 0 ? start : 0; int i; for ( i = start; i <= end; i++ ) { sequence[i - start] = src_seq[i]; } return end - start + 1; } static void printSeq ( FILE * fo, char * seq, int len ) { int i; for ( i = 0; i < len; i++ ) { fprintf ( fo, "%c", int2base ( ( int ) seq[i] ) ); } fprintf ( fo, "\n" ); } static boolean readsCrossGap ( READNEARBY * rdArray, int num, int originOverlap, DARRAY * gapSeqArray, Kmer * kmerCtg1, Kmer * kmerCtg2, int overlap, CTGinSCAF * ctg1, CTGinSCAF * ctg2, KmerSet * kmerS, Kmer WordFilter, int min, int max, int offset1, int offset2, char * seqGap, char * seqCtg1, char * seqCtg2, int cut1, int cut2 ) { int i, j, start, end, startOnCtg1, endOnCtg2; char * bal_seq; char * src_seq; char * pt; boolean bal, ret = 0, FILL; double maxScore = 0.0; int maxIndex; int lenCtg1, lenCtg2; //build sequences on left and right of the uncertain region int buffer_size = maxReadLen > 100 ? maxReadLen : 100; int length = contig_array[ctg1->ctgID].length + originOverlap; if ( buffer_size > offset1 ) { lenCtg1 = cutSeqFromCtg ( ctg1->ctgID, length - cut1 - ( buffer_size - offset1 ), length - 1 - cut1, seqCtg1, originOverlap ); for ( i = 0; i < offset1; i++ ) { seqCtg1[lenCtg1 + i] = seqGap[i]; } lenCtg1 += offset1; } else { for ( i = offset1 - buffer_size; i < offset1; i++ ) { seqCtg1[i + buffer_size - offset1] = seqGap[i]; } lenCtg1 = buffer_size; } length = contig_array[ctg2->ctgID].length + originOverlap; if ( buffer_size > offset2 ) { lenCtg2 = cutSeqFromCtg ( ctg2->ctgID, cut2, buffer_size - offset2 - 1 + cut2, & ( seqCtg2[offset2] ), originOverlap ); for ( i = 0; i < offset2; i++ ) { seqCtg2[i] = seqGap[i + offset1]; } lenCtg2 += offset2; } else { for ( i = 0; i < buffer_size; i++ ) { seqCtg2[i] = seqGap[i + offset1]; } lenCtg2 = buffer_size; } /* if(offset1>0||offset2>0){ for(i=0;i max ) { continue; } if ( overlap + ( len1 - startOnCtg1 - 1 ) + endOnCtg2 - ( end - start ) > ( int ) originOverlap ) { continue; } // contig1 and contig2 could not overlap more than origOverlap bases START[i] = start; END[i] = end; INDEX1[i] = startOnCtg1; INDEX2[i] = endOnCtg2; BAL[i] = bal; int matchLen = 2 * overlap < ( end - start + overlap ) ? 2 * overlap : ( end - start + overlap ); int match; int alignLen = matchLen; //compare the left of hit kmer on ctg1 //int ctgLeft = (contig_array[ctg1->ctgID].length+originOverlap)-(len1+overlap-1)+startOnCtg1; int ctgLeft = ( lenCtg1 ) - ( len1 + overlap - 1 ) + startOnCtg1; int readLeft = start - overlap + 1; int cmpLen = ctgLeft < readLeft ? ctgLeft : readLeft; cmpLen = cmpLen <= MAXREADLENGTH ? cmpLen : MAXREADLENGTH; //cutSeqFromCtg(ctg1->ctgID,ctgLeft-cmpLen,ctgLeft-1,fastSequence,originOverlap); cutSeqFromRead ( seqCtg1, lenCtg1, ctgLeft - cmpLen, ctgLeft - 1, fastSequence ); if ( !bal ) { cutSeqFromRead ( src_seq, rdArray[i].len, readLeft - cmpLen, readLeft - 1, slowSequence ); } else { cutSeqFromRead ( bal_seq, rdArray[i].len, readLeft - cmpLen, readLeft - 1, slowSequence ); } match = compareSequences ( fastSequence, slowSequence, cmpLen, cmpLen ); alignLen += cmpLen; matchLen += match; //compare the right of hit kmer on ctg1 int ctgRight = len1 - startOnCtg1 - 1; cmpLen = ctgRight < ( rdArray[i].len - start - 1 ) ? ctgRight : ( rdArray[i].len - start - 1 ); cmpLen = cmpLen <= MAXREADLENGTH ? cmpLen : MAXREADLENGTH; //cutSeqFromCtg(ctg1->ctgID,ctgLeft+overlap,ctgLeft+overlap+cmpLen-1,fastSequence,originOverlap); cutSeqFromRead ( seqCtg1, lenCtg1, ctgLeft + overlap, ctgLeft + overlap + cmpLen - 1, fastSequence ); if ( !bal ) { cutSeqFromRead ( src_seq, rdArray[i].len, start + 1, start + cmpLen, slowSequence ); } else { cutSeqFromRead ( bal_seq, rdArray[i].len, start + 1, start + cmpLen, slowSequence ); } match = compareSequences ( fastSequence, slowSequence, cmpLen, cmpLen ); //fprintf(stderr,"%d -- %d\n",match,cmpLen); alignLen += cmpLen; matchLen += match; //compare the left of hit kmer on ctg2 ctgLeft = endOnCtg2; readLeft = end - overlap + 1; cmpLen = ctgLeft < readLeft ? ctgLeft : readLeft; cmpLen = ctgLeft <= MAXREADLENGTH ? ctgLeft : MAXREADLENGTH; //cutSeqFromCtg(ctg2->ctgID,endOnCtg2-cmpLen,endOnCtg2-1,fastSequence,originOverlap); cutSeqFromRead ( seqCtg2, lenCtg2, endOnCtg2 - cmpLen, endOnCtg2 - 1, fastSequence ); if ( !bal ) { cutSeqFromRead ( src_seq, rdArray[i].len, readLeft - cmpLen, readLeft - 1, slowSequence ); } else { cutSeqFromRead ( bal_seq, rdArray[i].len, readLeft - cmpLen, readLeft - 1, slowSequence ); } match = compareSequences ( fastSequence, slowSequence, cmpLen, cmpLen ); alignLen += cmpLen; matchLen += match; //compare the right of hit kmer on ctg2 //ctgRight = contig_array[ctg2->ctgID].length+originOverlap-endOnCtg2-overlap; ctgRight = lenCtg2 - endOnCtg2 - overlap; cmpLen = ctgRight < ( rdArray[i].len - end - 1 ) ? ctgRight : ( rdArray[i].len - end - 1 ); cmpLen = cmpLen <= MAXREADLENGTH ? cmpLen : MAXREADLENGTH; //cutSeqFromCtg(ctg2->ctgID,endOnCtg2+overlap,endOnCtg2+overlap+cmpLen-1,fastSequence,originOverlap); cutSeqFromRead ( seqCtg2, lenCtg2, endOnCtg2 + overlap, endOnCtg2 + overlap + cmpLen - 1, fastSequence ); if ( !bal ) { cutSeqFromRead ( src_seq, rdArray[i].len, end + 1, end + cmpLen, slowSequence ); } else { cutSeqFromRead ( bal_seq, rdArray[i].len, end + 1, end + cmpLen, slowSequence ); } match = compareSequences ( fastSequence, slowSequence, cmpLen, cmpLen ); alignLen += cmpLen; matchLen += match; /* if(cmpLen>0&&match!=cmpLen+overlap){ printSeq(stderr,fastSequence,cmpLen+overlap); printSeq(stderr,slowSequence,cmpLen+overlap); printKmerSeqLocal(stderr,kmerCtg2[endOnCtg2],overlap); fprintf(stderr,": %d(%d)\n",bal,endOnCtg2); }else if(cmpLen>0&&match==cmpLen+overlap) fprintf(stderr,"Perfect\n"); */ double score = ( double ) matchLen / alignLen; if ( maxScore < score ) { maxScore = score; //fprintf(stderr,"%4.2f (%d/%d)\n",maxScore,matchLen,alignLen); maxIndex = i; } SCORE[i] = score; } /* if(maxScore>0.0) fprintf(stderr,"SCORE: %4.2f\n",maxScore); */ if ( maxScore > 0.9 ) { /* for(i=0;i 0 ? offset1 - ( len1 - INDEX1[maxIndex] - 1 ) : 0; int rightRemain = offset2 - ( overlap + INDEX2[maxIndex] ) > 0 ? offset2 - ( overlap + INDEX2[maxIndex] ) : 0; ctg1->gapSeqOffset = gapSeqArray->item_c; ctg1->gapSeqLen = END[maxIndex] - START[maxIndex] + leftRemain + rightRemain; if ( darrayPut ( gapSeqArray, ctg1->gapSeqOffset + ( END[maxIndex] - START[maxIndex] + leftRemain + rightRemain ) / 4 ) ) { pt = ( char * ) darrayPut ( gapSeqArray, ctg1->gapSeqOffset ); for ( j = 0; j < leftRemain; j++ ) //get the left side of the gap region from search { writeChar2tightString ( seqGap[j], pt, j ); //fprintf(stderr,"%c",int2base(seqGap[j])); } for ( j = START[maxIndex] + 1; j <= END[maxIndex]; j++ ) { if ( BAL[maxIndex] ) { writeChar2tightString ( bal_seq[j], pt, j - START[maxIndex] - 1 + leftRemain ); //fprintf(stderr,"%c",int2base(bal_seq[j])); } else { writeChar2tightString ( src_seq[j], pt, j - START[maxIndex] - 1 + leftRemain ); //fprintf(stderr,"%c",int2base(src_seq[j])); } } for ( j = offset2 - rightRemain; j < offset2; j++ ) //get the right side of the gap region from search { writeChar2tightString ( seqGap[j + leftRemain], pt, j + END[maxIndex] - START[maxIndex] + leftRemain ); //fprintf(stderr,"%c",int2base(seqGap[j+leftRemain])); } /* fprintf(stderr,": GAPSEQ (%d+%d)(%d+%d)(%d+%d)(%d+%d) B %d\n",offset1,offset2,cut1,cut2, len1-INDEX1[maxIndex]-1,INDEX2[maxIndex],START[maxIndex],END[maxIndex],BAL[maxIndex]); */ ctg1->cutTail = len1 - INDEX1[maxIndex] - 1 - offset1 + cut1 > cut1 ? len1 - INDEX1[maxIndex] - 1 - offset1 + cut1 : cut1; ctg2->cutHead = overlap + INDEX2[maxIndex] - offset2 + cut2 > cut2 ? overlap + INDEX2[maxIndex] - offset2 + cut2 : cut2; ctg2->scaftig_start = 0; ret = 1; } } free ( ( void * ) START ); free ( ( void * ) END ); free ( ( void * ) INDEX1 ); free ( ( void * ) INDEX2 ); free ( ( void * ) SCORE ); free ( ( void * ) BAL ); free ( ( void * ) src_seq ); free ( ( void * ) bal_seq ); return ret; } SOAPdenovo-V1.05/src/63mer/main.c000644 000765 000024 00000016372 11530651532 016421 0ustar00Aquastaff000000 000000 /* * 63mer/main.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "global.h" extern int call_pregraph ( int arc, char ** argv ); extern int call_heavygraph ( int arc, char ** argv ); extern int call_map2contig ( int arc, char ** argv ); extern int call_scaffold ( int arc, char ** argv ); extern int call_align ( int arc, char ** argv ); static void display_usage(); static void display_all_usage(); static void pipeline ( int argc, char ** argv ); int main ( int argc, char ** argv ) { printf ( "\nVersion 1.05: released on July 29th, 2010\n\n" ); argc--; argv++; /* printf("Size of Kmer is %d\n",sizeof(Kmer)); printf("Size of kmer_t is %d\n",sizeof(kmer_t)); */ if ( argc == 0 ) { display_usage(); return 0; } if ( strcmp ( "pregraph", argv[0] ) == 0 ) { call_pregraph ( argc, argv ); } else if ( strcmp ( "contig", argv[0] ) == 0 ) { call_heavygraph ( argc, argv ); } else if ( strcmp ( "map", argv[0] ) == 0 ) { call_align ( argc, argv ); } //call_map2contig(argc,argv); else if ( strcmp ( "scaff", argv[0] ) == 0 ) { call_scaffold ( argc, argv ); } else if ( strcmp ( "all", argv[0] ) == 0 ) { pipeline ( argc, argv ); } else { display_usage(); } return 0; } static void display_usage() { printf ( "\nUsage: SOAPdenovo [option]\n" ); printf ( " pregraph construction kmer-graph\n" ); printf ( " contig eliminate errors and output contigs\n" ); printf ( " map map reads to contigs\n" ); printf ( " scaff scaffolding\n" ); printf ( " all doing all the above in turn\n" ); } static void pipeline ( int argc, char ** argv ) { char * options[16]; unsigned char getK, getRfile, getOfile, getD, getDD, getL, getR, getP, getF; char readfile[256], outfile[256]; char temp[128]; char * name; int kmer = 0, cutoff_len = 0, ncpu = 0; char kmer_s[16], len_s[16], ncpu_s[16], M_s[16]; int i, copt, index, M = 1; extern char * optarg; time_t start_t, stop_t; time ( &start_t ); getK = getRfile = getOfile = getD = getDD = getL = getR = getP = getF = 0; while ( ( copt = getopt ( argc, argv, "s:o:K:M:L:p:G:dDRua:" ) ) != EOF ) { switch ( copt ) { case 's': getRfile = 1; sscanf ( optarg, "%s", readfile ); continue; case 'o': getOfile = 1; sscanf ( optarg, "%s", outfile ); // continue; case 'K': getK = 1; sscanf ( optarg, "%s", temp ); // kmer = atoi ( temp ); continue; case 'G': sscanf ( optarg, "%s", temp ); // GLDiff = atoi ( temp ); continue; case 'M': sscanf ( optarg, "%s", temp ); // M = atoi ( temp ); continue; case 'p': getP = 1; sscanf ( optarg, "%s", temp ); // ncpu = atoi ( temp ); continue; case 'L': getL = 1; sscanf ( optarg, "%s", temp ); // cutoff_len = atoi ( temp ); continue; case 'R': getR = 1; continue; case 'u': maskRep = 0; continue; case 'd': getD = 1; continue; case 'D': getDD = 1; continue; case 'a': initKmerSetSize = atoi ( optarg ); break; case 'F': getF = 1; break; default: if ( getRfile == 0 || getOfile == 0 ) // { display_all_usage(); exit ( -1 ); } } } if ( getRfile == 0 || getOfile == 0 ) // { display_all_usage(); exit ( -1 ); } if ( thrd_num < 1 ) { thrd_num = 1; } // getK = getRfile = getOfile = getD = getL = getR = 0; name = "pregraph"; index = 0; options[index++] = name; options[index++] = "-s"; options[index++] = readfile; if ( getK ) { options[index++] = "-K"; sprintf ( kmer_s, "%d", kmer ); options[index++] = kmer_s; } if ( getP ) { options[index++] = "-p"; sprintf ( ncpu_s, "%d", ncpu ); options[index++] = ncpu_s; } if ( getD ) { options[index++] = "-d"; } if ( getDD ) { options[index++] = "-D"; } if ( getR ) { options[index++] = "-R"; } options[index++] = "-o"; options[index++] = outfile; for ( i = 0; i < index; i++ ) { printf ( "%s ", options[i] ); } printf ( "\n" ); call_pregraph ( index, options ); name = "contig"; index = 0; options[index++] = name; options[index++] = "-g"; options[index++] = outfile; options[index++] = "-M"; sprintf ( M_s, "%d", M ); options[index++] = M_s; if ( getR ) { options[index++] = "-R"; } for ( i = 0; i < index; i++ ) { printf ( "%s ", options[i] ); } printf ( "\n" ); call_heavygraph ( index, options ); name = "map"; index = 0; options[index++] = name; options[index++] = "-s"; options[index++] = readfile; options[index++] = "-g"; options[index++] = outfile; if ( getP ) { options[index++] = "-p"; sprintf ( ncpu_s, "%d", ncpu ); options[index++] = ncpu_s; } for ( i = 0; i < index; i++ ) { printf ( "%s ", options[i] ); } printf ( "\n" ); call_align ( index, options ); name = "scaff"; index = 0; options[index++] = name; options[index++] = "-g"; options[index++] = outfile; if ( getF ) { options[index++] = "-F"; } if ( getP ) { options[index++] = "-p"; sprintf ( ncpu_s, "%d", ncpu ); options[index++] = ncpu_s; } if ( getL ) { options[index++] = "-L"; sprintf ( len_s, "%d", cutoff_len ); options[index++] = len_s; } for ( i = 0; i < index; i++ ) { printf ( "%s ", options[i] ); } printf ( "\n" ); call_scaffold ( index, options ); time ( &stop_t ); printf ( "time for the whole pipeline: %dm\n", ( int ) ( stop_t - start_t ) / 60 ); } static void display_all_usage() { printf ( "\nSOAPdenovo all -s configFile [-K kmer -d -D -M mergeLevel -R -u -G gapLenDiff -L minContigLen -p n_cpu] -o Output\n" ); printf ( " -s ShortSeqFile: The input file name of solexa reads\n" ); printf ( " -a initKmerSetSize: define the initial KmerSet size(unit: GB)\n" ); printf ( " -K kmer(default 23): k value in kmer\n" ); printf ( " -p n_cpu(default 8): number of cpu for use\n" ); printf ( " -F (optional) fill gaps in scaffold\n" ); printf ( " -M mergeLevel(default 1,min 0, max 3): the strength of merging similar sequences during contiging\n" ); printf ( " -d (optional): delete kmers with frequency one (default no)\n" ); printf ( " -D (optional): delete edges with coverage one (default no)\n" ); printf ( " -R (optional): unsolve repeats by reads (default no)\n" ); printf ( " -G gapLenDiff(default 50): allowed length difference between estimated and filled gap\n" ); printf ( " -L minLen(default K+2): shortest contig for scaffolding\n" ); printf ( " -u (optional): un-mask contigs with high coverage before scaffolding (default mask)\n" ); printf ( " -o Output: prefix of output file name\n" ); } SOAPdenovo-V1.05/src/63mer/._Makefile000644 000765 000024 00000000341 11534064011 017165 0ustar00Aquastaff000000 000000 Mac OS X  2ATTRH2IIcom.apple.quarantineq/0000;4d58b30f;Mail;BD809BBE-F3C3-4DD1-869B-8610B7EBA988|com.apple.mailSOAPdenovo-V1.05/src/63mer/Makefile000644 000765 000024 00000003562 11534064011 016760 0ustar00Aquastaff000000 000000 CC= gcc CFLAGS= -O3 -fomit-frame-pointer DFLAGS= OBJS= arc.o attachPEinfo.o bubble.o check.o compactEdge.o \ concatenateEdge.o connect.o contig.o cutTipPreGraph.o cutTip_graph.o \ darray.o dfib.o dfibHeap.o fib.o fibHeap.o \ hashFunction.o kmer.o lib.o loadGraph.o loadPath.o \ loadPreGraph.o localAsm.o main.o map.o mem_manager.o \ newhash.o node2edge.o orderContig.o output_contig.o output_pregraph.o \ output_scaffold.o pregraph.o prlHashCtg.o prlHashReads.o prlRead2Ctg.o \ prlRead2path.o prlReadFillGap.o read2scaf.o readInterval.o stack.o\ readseq1by1.o scaffold.o searchPath.o seq.o splitReps.o PROG= SOAPdenovo-63mer INCLUDES= -Iinc SUBDIRS= . LIBPATH= LIBS= -pthread -lm EXTRA_FLAGS= BIT_ERR = 0 ifeq (,$(findstring $(shell uname -m), x86_64 ppc64 ia64)) BIT_ERR = 1 endif LINUX = 0 ifneq (,$(findstring Linux,$(shell uname))) LINUX = 1 EXTRA_FLAGS += -Wl,--hash-style=both endif ifneq (,$(findstring $(shell uname -m), x86_64)) CFLAGS += -m64 endif ifneq (,$(findstring $(shell uname -m), ia64)) CFLAGS += endif ifneq (,$(findstring $(shell uname -m), ppc64)) CFLAGS += -m64 -mpowerpc64 endif .SUFFIXES:.c .o .c.o: @printf "Compiling $<... \r"; \ $(CC) -c $(CFLAGS) $(DFLAGS) $(INCLUDES) $< || echo "Error in command: $(CC) -c $(CFLAGS) $(DFLAGS) $(INCLUDES) $<" all: SOAPdenovo .PHONY:all clean install envTest: @test $(BIT_ERR) != 1 || sh -c 'echo "Fatal: 64bit CPU and Operating System required!";false;' SOAPdenovo: envTest $(OBJS) @$(CC) $(CFLAGS) -o $(PROG) $(OBJS) $(LIBPATH) $(LIBS) $(ENTRAFLAGS) @printf "Linking...\r" @printf "$(PROG) compilation done.\n"; clean: @rm -fr gmon.out *.o a.out *.exe *.dSYM $(PROG) *~ *.a *.so.* *.so *.dylib @printf "$(PROG) cleaning done.\n"; install: @cp $(PROG) ../../bin/ @printf "$(PROG) installed at ../../bin/$(PROG)\n" SOAPdenovo-V1.05/src/63mer/map.c000644 000765 000024 00000010044 11530651532 016240 0ustar00Aquastaff000000 000000 /* * 63mer/map.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" static void initenv ( int argc, char ** argv ); static char shortrdsfile[256]; static char graphfile[256]; static void display_map_usage(); static int getMinOverlap ( char * gfile ) { char name[256], ch; FILE * fp; int num_kmer, overlaplen = 23; char line[1024]; sprintf ( name, "%s.preGraphBasic", gfile ); fp = fopen ( name, "r" ); if ( !fp ) { return overlaplen; } while ( fgets ( line, sizeof ( line ), fp ) != NULL ) { if ( line[0] == 'V' ) { sscanf ( line + 6, "%d %c %d", &num_kmer, &ch, &overlaplen ); } else if ( line[0] == 'M' ) { sscanf ( line, "MaxReadLen %d MinReadLen %d MaxNameLen %d", &maxReadLen, &minReadLen, &maxNameLen ); } } fclose ( fp ); return overlaplen; } int call_align ( int argc, char ** argv ) { time_t start_t, stop_t, time_bef, time_aft; time ( &start_t ); initenv ( argc, argv ); overlaplen = getMinOverlap ( graphfile ); printf ( "K = %d\n", overlaplen ); time ( &time_bef ); ctg_short = overlaplen + 2; printf ( "contig len cutoff: %d\n", ctg_short ); prlContig2nodes ( graphfile, ctg_short ); time ( &time_aft ); printf ( "time spent on De bruijn graph construction: %ds\n\n", ( int ) ( time_aft - time_bef ) ); //map long read to edge one by one time ( &time_bef ); prlLongRead2Ctg ( shortrdsfile, graphfile ); time ( &time_aft ); printf ( "time spent on mapping long reads: %ds\n\n", ( int ) ( time_aft - time_bef ) ); //map read to edge one by one time ( &time_bef ); prlRead2Ctg ( shortrdsfile, graphfile ); time ( &time_aft ); printf ( "time spent on mapping reads: %ds\n\n", ( int ) ( time_aft - time_bef ) ); free_Sets ( KmerSets, thrd_num ); time ( &stop_t ); printf ( "overall time for alignment: %dm\n\n", ( int ) ( stop_t - start_t ) / 60 ); return 0; } /***************************************************************************** * Parse command line switches *****************************************************************************/ void initenv ( int argc, char ** argv ) { int copt; int inpseq, outseq; extern char * optarg; char temp[100]; optind = 1; inpseq = outseq = 0; while ( ( copt = getopt ( argc, argv, "s:g:K:p:" ) ) != EOF ) { //printf("get option\n"); switch ( copt ) { case 's': inpseq = 1; sscanf ( optarg, "%s", shortrdsfile ); continue; case 'g': outseq = 1; sscanf ( optarg, "%s", graphfile ); // continue; case 'K': sscanf ( optarg, "%s", temp ); // overlaplen = atoi ( temp ); continue; case 'p': sscanf ( optarg, "%s", temp ); // thrd_num = atoi ( temp ); continue; default: if ( inpseq == 0 || outseq == 0 ) // { display_map_usage(); exit ( 1 ); } } } if ( inpseq == 0 || outseq == 0 ) // { //printf("need more\n"); display_map_usage(); exit ( 1 ); } } static void display_map_usage() { printf ( "\nmap -s readsInfoFile -g graphfile [-p n_cpu]\n" ); printf ( " -s readsInfoFile: The file contains information of solexa reads\n" ); printf ( " -p n_cpu(default 8): number of cpu for use\n" ); printf ( " -g graphfile: prefix of graph files\n" ); } SOAPdenovo-V1.05/src/63mer/mem_manager.c000644 000765 000024 00000005353 11530651532 017742 0ustar00Aquastaff000000 000000 /* * 63mer/mem_manager.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" MEM_MANAGER * createMem_manager ( int num_items, size_t unit_size ) { MEM_MANAGER * mem_Manager = ( MEM_MANAGER * ) ckalloc ( 1 * sizeof ( MEM_MANAGER ) ); mem_Manager->block_list = NULL; mem_Manager->items_per_block = num_items; mem_Manager->item_size = unit_size; mem_Manager->recycle_list = NULL; mem_Manager->counter = 0; return mem_Manager; } void freeMem_manager ( MEM_MANAGER * mem_Manager ) { BLOCK_START * ite_block, *temp_block; if ( !mem_Manager ) { return; } ite_block = mem_Manager->block_list; while ( ite_block ) { temp_block = ite_block; ite_block = ite_block->next; free ( ( void * ) temp_block ); } free ( ( void * ) mem_Manager ); } void * getItem ( MEM_MANAGER * mem_Manager ) { RECYCLE_MARK * mark; //this is the type of return value BLOCK_START * block; if ( !mem_Manager ) { return NULL; } if ( mem_Manager->recycle_list ) { mark = mem_Manager->recycle_list; mem_Manager->recycle_list = mark->next; return mark; } mem_Manager->counter++; if ( !mem_Manager->block_list || mem_Manager->index_in_block == mem_Manager->items_per_block ) { //pthread_mutex_lock(&gmutex); block = ckalloc ( sizeof ( BLOCK_START ) + mem_Manager->items_per_block * mem_Manager->item_size ); //mem_Manager->counter += sizeof(BLOCK_START)+mem_Manager->items_per_block*mem_Manager->item_size; //pthread_mutex_unlock(&gmutex); block->next = mem_Manager->block_list; mem_Manager->block_list = block; mem_Manager->index_in_block = 1; return ( RECYCLE_MARK * ) ( ( void * ) block + sizeof ( BLOCK_START ) ); } block = mem_Manager->block_list; return ( RECYCLE_MARK * ) ( ( void * ) block + sizeof ( BLOCK_START ) + mem_Manager->item_size * ( mem_Manager->index_in_block++ ) ); } void returnItem ( MEM_MANAGER * mem_Manager, void * item ) { RECYCLE_MARK * mark; mark = item; mark->next = mem_Manager->recycle_list; mem_Manager->recycle_list = mark; } SOAPdenovo-V1.05/src/63mer/newhash.c000644 000765 000024 00000017154 11530651532 017131 0ustar00Aquastaff000000 000000 /* * 63mer/newhash.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" #define PUBLIC_FUNC #define PROTECTED_FUNC static const kmer_t empty_kmer = {{0, 0}, 0, 0, 0, 0, 0, 1, 0, 0}; static inline void update_kmer ( kmer_t * mer, ubyte left, ubyte right ) { ubyte4 cov; if ( left < 4 ) { cov = get_kmer_left_cov ( *mer, left ); if ( cov < MAX_KMER_COV ) { set_kmer_left_cov ( *mer, left, cov + 1 ); } } if ( right < 4 ) { cov = get_kmer_right_cov ( *mer, right ); if ( cov < MAX_KMER_COV ) { set_kmer_right_cov ( *mer, right, cov + 1 ); } } } static inline void set_new_kmer ( kmer_t * mer, Kmer seq, ubyte left, ubyte right ) { *mer = empty_kmer; set_kmer_seq ( *mer, seq ); if ( left < 4 ) { set_kmer_left_cov ( *mer, left, 1 ); } if ( right < 4 ) { set_kmer_right_cov ( *mer, right, 1 ); } } static inline int is_prime_kh ( ubyte8 num ) { ubyte8 i, max; if ( num < 4 ) { return 1; } if ( num % 2 == 0 ) { return 0; } max = ( ubyte8 ) sqrt ( ( float ) num ); for ( i = 3; i < max; i += 2 ) { if ( num % i == 0 ) { return 0; } } return 1; } static inline ubyte8 find_next_prime_kh ( ubyte8 num ) { if ( num % 2 == 0 ) { num ++; } while ( 1 ) { if ( is_prime_kh ( num ) ) { return num; } num += 2; } } PUBLIC_FUNC KmerSet * init_kmerset ( ubyte8 init_size, float load_factor ) { KmerSet * set; if ( init_size < 3 ) { init_size = 3; } else { init_size = find_next_prime_kh ( init_size ); } set = ( KmerSet * ) malloc ( sizeof ( KmerSet ) ); set->size = init_size; set->count = 0; set->max = set->size * load_factor; if ( load_factor <= 0 ) { load_factor = 0.25f; } else if ( load_factor >= 1 ) { load_factor = 0.75f; } set->load_factor = load_factor; set->iter_ptr = 0; set->array = calloc ( set->size, sizeof ( kmer_t ) ); set->flags = malloc ( ( set->size + 15 ) / 16 * 4 ); memset ( set->flags, 0x55, ( set->size + 15 ) / 16 * 4 ); return set; } PROTECTED_FUNC static inline ubyte8 get_kmerset ( KmerSet * set, Kmer seq ) { ubyte8 hc; __uint128_t temp; temp = Kmer2int128 ( seq ); hc = temp % set->size; while ( 1 ) { if ( is_kmer_entity_null ( set->flags, hc ) ) { return hc; } else { if ( KmerEqual ( get_kmer_seq ( set->array[hc] ), seq ) ) { return hc; } } hc ++; if ( hc == set->size ) { hc = 0; } } return set->size; } PUBLIC_FUNC int search_kmerset ( KmerSet * set, Kmer seq, kmer_t ** rs ) { ubyte8 hc; __uint128_t temp; temp = Kmer2int128 ( seq ); hc = temp % set->size; while ( 1 ) { if ( is_kmer_entity_null ( set->flags, hc ) ) { return 0; } else { if ( KmerEqual ( get_kmer_seq ( set->array[hc] ), seq ) ) { *rs = set->array + hc; return 1; } } hc ++; if ( hc == set->size ) { hc = 0; } } return 0; } PUBLIC_FUNC static inline int exists_kmerset ( KmerSet * set, Kmer seq ) { ubyte8 idx; idx = get_kmerset ( set, seq ); return !is_kmer_entity_null ( set->flags, idx ); } PROTECTED_FUNC static inline void encap_kmerset ( KmerSet * set, ubyte8 num ) { ubyte4 * flags, *f; ubyte8 i, n, size, hc; kmer_t key, tmp; if ( set->count + num <= set->max ) { return; } if ( initKmerSetSize != 0 ) { if ( set->load_factor < 0.88 ) { set->load_factor = 0.88; set->max = set->size * set->load_factor; return; } else { fprintf ( stderr, "-- Static memory pool exploded, please define a larger value. --\n" ); abort(); } } n = set->size; do { if ( n < 0xFFFFFFFU ) { n <<= 1; } else { n += 0xFFFFFFU; } n = find_next_prime_kh ( n ); } while ( n * set->load_factor < set->count + num ); set->array = realloc ( set->array, n * sizeof ( kmer_t ) ); //printf("Allocate Mem %lld(%d*%lld*%d)bytes\n",thrd_num*n*sizeof(kmer_t),thrd_num,n,sizeof(kmer_t)); fflush ( stdout ); if ( set->array == NULL ) { fprintf ( stderr, "-- Out of memory --\n" ); abort(); } flags = malloc ( ( n + 15 ) / 16 * 4 ); memset ( flags, 0x55, ( n + 15 ) / 16 * 4 ); size = set->size; set->size = n; set->max = n * set->load_factor; f = set->flags; set->flags = flags; flags = f; __uint128_t temp; for ( i = 0; i < size; i++ ) { if ( !exists_kmer_entity ( flags, i ) ) { continue; } key = set->array[i]; set_kmer_entity_del ( flags, i ); while ( 1 ) { temp = Kmer2int128 ( get_kmer_seq ( key ) ); hc = temp % set->size; while ( !is_kmer_entity_null ( set->flags, hc ) ) { hc ++; if ( hc == set->size ) { hc = 0; } } clear_kmer_entity_null ( set->flags, hc ); if ( hc < size && exists_kmer_entity ( flags, hc ) ) { tmp = key; key = set->array[hc]; set->array[hc] = tmp; set_kmer_entity_del ( flags, hc ); } else { set->array[hc] = key; break; } } } free ( flags ); } PUBLIC_FUNC int put_kmerset ( KmerSet * set, Kmer seq, ubyte left, ubyte right, kmer_t ** kmer_p ) { ubyte8 hc; encap_kmerset ( set, 1 ); __uint128_t temp; temp = Kmer2int128 ( seq ); hc = temp % set->size; do { if ( is_kmer_entity_null ( set->flags, hc ) ) { clear_kmer_entity_null ( set->flags, hc ); set_new_kmer ( set->array + hc, seq, left, right ); set->count ++; *kmer_p = set->array + hc; return 0; } else { if ( KmerEqual ( get_kmer_seq ( set->array[hc] ), seq ) ) { update_kmer ( set->array + hc, left, right ); set->array[hc].single = 0; *kmer_p = set->array + hc; return 1; } } hc ++; if ( hc == set->size ) { hc = 0; } } while ( 1 ); *kmer_p = NULL; return 0; } PUBLIC_FUNC byte8 count_kmerset ( KmerSet * set ) { return set->count; } PUBLIC_FUNC static inline void reset_iter_kmerset ( KmerSet * set ) { set->iter_ptr = 0; } PUBLIC_FUNC static inline ubyte8 iter_kmerset ( KmerSet * set, kmer_t ** rs ) { while ( set->iter_ptr < set->size ) { if ( !is_kmer_entity_null ( set->flags, set->iter_ptr ) ) { *rs = set->array + set->iter_ptr; set->iter_ptr ++; return 1; } set->iter_ptr ++; } return 0; } PUBLIC_FUNC void free_kmerset ( KmerSet * set ) { free ( set->array ); free ( set->flags ); free ( set ); } PUBLIC_FUNC void free_Sets ( KmerSet ** sets, int num ) { int i; for ( i = 0; i < num; i++ ) { free_kmerset ( sets[i] ); } free ( ( void * ) sets ); } int count_branch2prev ( kmer_t * node ) { int num = 0, i; for ( i = 0; i < 4; i++ ) { if ( get_kmer_left_cov ( *node, i ) > 0 ) { num++; } } return num; } int count_branch2next ( kmer_t * node ) { int num = 0, i; for ( i = 0; i < 4; i++ ) { if ( get_kmer_right_cov ( *node, i ) > 0 ) { num++; } } return num; } void dislink2prevUncertain ( kmer_t * node, char ch, boolean smaller ) { if ( smaller ) { set_kmer_left_cov ( *node, ch, 0 ); } else { set_kmer_right_cov ( *node, int_comp ( ch ), 0 ); } } void dislink2nextUncertain ( kmer_t * node, char ch, boolean smaller ) { if ( smaller ) { set_kmer_right_cov ( *node, ch, 0 ); } else { set_kmer_left_cov ( *node, int_comp ( ch ), 0 ); } } SOAPdenovo-V1.05/src/63mer/node2edge.c000644 000765 000024 00000027315 11530651532 017330 0ustar00Aquastaff000000 000000 /* * 63mer/node2edge.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" #include "stack.h" #define KMERPTBLOCKSIZE 1000 //static int loopCounter; static int nodeCounter; static int edge_length_limit = 100000; static int edge_c, edgeCounter; static preEDGE temp_edge; static char edge_seq[100000]; static void make_edge ( FILE * fp ); static void merge_linearV2 ( char bal_edge, STACK * nStack, int count, FILE * fp ); static int check_iden_kmerList ( STACK * stack1, STACK * stack2 ); //for stack static STACK * nodeStack; static STACK * bal_nodeStack; void kmer2edges ( char * outfile ) { FILE * fp; char temp[256]; sprintf ( temp, "%s.edge", outfile ); fp = ckopen ( temp, "w" ); make_edge ( fp ); fclose ( fp ); num_ed = edge_c; } static void stringBeads ( KMER_PT * firstBead, char nextch, int * node_c ) { boolean smaller, found; Kmer tempKmer, bal_word; Kmer word = firstBead->kmer; ubyte8 hash_ban; kmer_t * outgoing_node; int nodeCounter = 1, setPicker; char ch; unsigned char flag; KMER_PT * temp_pt, *prev_pt = firstBead; word = prev_pt->kmer; nodeCounter = 1; word = nextKmer ( word, nextch ); bal_word = reverseComplement ( word, overlaplen ); if ( KmerLarger ( word, bal_word ) ) { tempKmer = bal_word; bal_word = word; word = tempKmer; smaller = 0; } else { smaller = 1; } hash_ban = hash_kmer ( word ); setPicker = hash_ban % thrd_num; found = search_kmerset ( KmerSets[setPicker], word, &outgoing_node ); while ( found && ( outgoing_node->linear ) ) // for every node in this line { nodeCounter++; temp_pt = ( KMER_PT * ) stackPush ( nodeStack ); temp_pt->node = outgoing_node; temp_pt->isSmaller = smaller; if ( smaller ) { temp_pt->kmer = word; } else { temp_pt->kmer = bal_word; } prev_pt = temp_pt; if ( smaller ) { for ( ch = 0; ch < 4; ch++ ) { flag = get_kmer_right_cov ( *outgoing_node, ch ); if ( flag ) { break; } } word = nextKmer ( prev_pt->kmer, ch ); bal_word = reverseComplement ( word, overlaplen ); if ( KmerLarger ( word, bal_word ) ) { tempKmer = bal_word; bal_word = word; word = tempKmer; smaller = 0; } else { smaller = 1; } hash_ban = hash_kmer ( word ); setPicker = hash_ban % thrd_num; found = search_kmerset ( KmerSets[setPicker], word, &outgoing_node ); } else { for ( ch = 0; ch < 4; ch++ ) { flag = get_kmer_left_cov ( *outgoing_node, ch ); if ( flag ) { break; } } word = nextKmer ( prev_pt->kmer, int_comp ( ch ) ); bal_word = reverseComplement ( word, overlaplen ); if ( KmerLarger ( word, bal_word ) ) { tempKmer = bal_word; bal_word = word; word = tempKmer; smaller = 0; } else { smaller = 1; } hash_ban = hash_kmer ( word ); setPicker = hash_ban % thrd_num; found = search_kmerset ( KmerSets[setPicker], word, &outgoing_node ); } } if ( outgoing_node ) //this is always true { nodeCounter++; temp_pt = ( KMER_PT * ) stackPush ( nodeStack ); temp_pt->node = outgoing_node; temp_pt->isSmaller = smaller; if ( smaller ) { temp_pt->kmer = word; } else { temp_pt->kmer = bal_word; } } *node_c = nodeCounter; } //search linear structure starting with the root of a tree static int startEdgeFromNode ( kmer_t * node1, FILE * fp ) { int node_c, palindrome; unsigned char flag; KMER_PT * ite_pt, *temp_pt; Kmer word1, bal_word1; char ch1; if ( node1->linear || node1->deleted ) { return 0; } // ignore floating loop word1 = node1->seq; bal_word1 = reverseComplement ( word1, overlaplen ); // linear structure for ( ch1 = 0; ch1 < 4; ch1++ ) // for every node on outgoing list { flag = get_kmer_right_cov ( *node1, ch1 ); if ( !flag ) { continue; } emptyStack ( nodeStack ); temp_pt = ( KMER_PT * ) stackPush ( nodeStack ); temp_pt->node = node1; temp_pt->isSmaller = 1; temp_pt->kmer = word1; stringBeads ( temp_pt, ch1, &node_c ); //printf("%d nodes\n",node_c); if ( node_c < 2 ) { printf ( "%d nodes in this line!!!!!!!!!!!\n", node_c ); } else { //make a reverse complement node list stackBackup ( nodeStack ); emptyStack ( bal_nodeStack ); while ( ( ite_pt = ( KMER_PT * ) stackPop ( nodeStack ) ) != NULL ) { temp_pt = ( KMER_PT * ) stackPush ( bal_nodeStack ); temp_pt->kmer = reverseComplement ( ite_pt->kmer, overlaplen ); } stackRecover ( nodeStack ); palindrome = check_iden_kmerList ( nodeStack, bal_nodeStack ); stackRecover ( nodeStack ); if ( palindrome ) { merge_linearV2 ( 0, nodeStack, node_c, fp ); } else { merge_linearV2 ( 1, nodeStack, node_c, fp ); } } } //every possible outgoing edges for ( ch1 = 0; ch1 < 4; ch1++ ) // for every node on incoming list { flag = get_kmer_left_cov ( *node1, ch1 ); if ( !flag ) { continue; } emptyStack ( nodeStack ); temp_pt = ( KMER_PT * ) stackPush ( nodeStack ); temp_pt->node = node1; temp_pt->isSmaller = 0; temp_pt->kmer = bal_word1; stringBeads ( temp_pt, int_comp ( ch1 ), &node_c ); if ( node_c < 2 ) { printf ( "%d nodes in this line!!!!!!!!!!!\n", node_c ); } else { //make a reverse complement node list stackBackup ( nodeStack ); emptyStack ( bal_nodeStack ); while ( ( ite_pt = ( KMER_PT * ) stackPop ( nodeStack ) ) != NULL ) { temp_pt = ( KMER_PT * ) stackPush ( bal_nodeStack ); temp_pt->kmer = reverseComplement ( ite_pt->kmer, overlaplen ); } stackRecover ( nodeStack ); palindrome = check_iden_kmerList ( nodeStack, bal_nodeStack ); stackRecover ( nodeStack ); if ( palindrome ) { merge_linearV2 ( 0, nodeStack, node_c, fp ); //printf("edge is palindrome with length %d\n",temp_edge.length); } else { merge_linearV2 ( 1, nodeStack, node_c, fp ); } } } //every possible incoming edges return 0; } void make_edge ( FILE * fp ) { int i = 0; kmer_t * node1; KmerSet * set; KmerSetsPatch = ( KmerSet ** ) ckalloc ( thrd_num * sizeof ( KmerSet * ) ); for ( i = 0; i < thrd_num; i++ ) { KmerSetsPatch[i] = init_kmerset ( 1000, K_LOAD_FACTOR ); } nodeStack = ( STACK * ) createStack ( KMERPTBLOCKSIZE, sizeof ( KMER_PT ) ); bal_nodeStack = ( STACK * ) createStack ( KMERPTBLOCKSIZE, sizeof ( KMER_PT ) ); edge_c = nodeCounter = 0; edgeCounter = 0; for ( i = 0; i < thrd_num; i++ ) { set = KmerSets[i]; set->iter_ptr = 0; while ( set->iter_ptr < set->size ) { if ( !is_kmer_entity_null ( set->flags, set->iter_ptr ) ) { node1 = set->array + set->iter_ptr; startEdgeFromNode ( node1, fp ); } set->iter_ptr ++; } } printf ( "%d (%d) edges %d extra nodes\n", edge_c, edgeCounter, nodeCounter ); freeStack ( nodeStack ); freeStack ( bal_nodeStack ); } static void merge_linearV2 ( char bal_edge, STACK * nStack, int count, FILE * fp ) { int length, char_index; preEDGE * newedge; kmer_t * del_node, *longNode; char * tightSeq, firstCh; long long symbol = 0; int len_tSeq; Kmer wordplus, bal_wordplus; ubyte8 hash_ban; KMER_PT * last_np = ( KMER_PT * ) stackPop ( nStack ); KMER_PT * second_last_np = ( KMER_PT * ) stackPop ( nStack ); KMER_PT * first_np, *second_np = NULL; KMER_PT * temp; boolean found, lastOne = 1, single = 1; int setPicker; length = count - 1; len_tSeq = length; if ( len_tSeq >= edge_length_limit ) { tightSeq = ( char * ) ckalloc ( len_tSeq * sizeof ( char ) ); } else { tightSeq = edge_seq; } char_index = length - 1; newedge = &temp_edge; newedge->to_node = last_np->kmer; newedge->length = length; newedge->bal_edge = bal_edge; tightSeq[char_index--] = lastCharInKmer ( last_np->kmer ); firstCh = firstCharInKmer ( second_last_np->kmer ); dislink2prevUncertain ( last_np->node, firstCh, last_np->isSmaller ); stackRecover ( nStack ); while ( nStack->item_c > 1 ) { second_np = ( KMER_PT * ) stackPop ( nStack ); } first_np = ( KMER_PT * ) stackPop ( nStack ); //unlink first node to the second one dislink2nextUncertain ( first_np->node, lastCharInKmer ( second_np->kmer ), first_np->isSmaller ); //printf("from %llx, to %llx\n",first_np->node->seq,last_np->node->seq); //now temp is the last node in line, out_node is the second last node in line newedge->from_node = first_np->kmer; //create a long kmer for edge with length 1 if ( length == 1 ) { nodeCounter++; wordplus = KmerPlus ( newedge->from_node, lastCharInKmer ( newedge->to_node ) ); bal_wordplus = reverseComplement ( wordplus, overlaplen + 1 ); /* Kmer temp = KmerPlus(reverseComplement(newedge->to_node,overlaplen), lastCharInKmer(reverseComplement(newedge->from_node,overlaplen))); fprintf(stderr,"(%llx %llx) (%llx %llx) (%llx %llx)\n", wordplus.high,wordplus.low,temp.high,temp.low, bal_wordplus.high,bal_wordplus.low); */ edge_c++; edgeCounter++; if ( KmerSmaller ( wordplus, bal_wordplus ) ) { hash_ban = hash_kmer ( wordplus ); setPicker = hash_ban % thrd_num; found = put_kmerset ( KmerSetsPatch[setPicker], wordplus, 4, 4, &longNode ); if ( found ) { printf ( "longNode %llx %llx already exist\n", wordplus.high, wordplus.low ); } longNode->l_links = edge_c; longNode->twin = ( unsigned char ) ( bal_edge + 1 ); } else { hash_ban = hash_kmer ( bal_wordplus ); setPicker = hash_ban % thrd_num; found = put_kmerset ( KmerSetsPatch[setPicker], bal_wordplus, 4, 4, &longNode ); if ( found ) { printf ( "longNode %llx %llx already exist\n", bal_wordplus.high, bal_wordplus.low ); } longNode->l_links = edge_c + bal_edge; longNode->twin = ( unsigned char ) ( -bal_edge + 1 ); } } else { edge_c++; edgeCounter++; } stackRecover ( nStack ); //mark all the internal nodes temp = ( KMER_PT * ) stackPop ( nStack ); while ( nStack->item_c > 1 ) { temp = ( KMER_PT * ) stackPop ( nStack ); del_node = temp->node; del_node->inEdge = 1; symbol += get_kmer_left_covs ( *del_node ); if ( temp->isSmaller ) { del_node->l_links = edge_c; del_node->twin = ( unsigned char ) ( bal_edge + 1 ); } else { del_node->l_links = edge_c + bal_edge; del_node->twin = ( unsigned char ) ( -bal_edge + 1 ); } tightSeq[char_index--] = lastCharInKmer ( temp->kmer ); } newedge->seq = tightSeq; if ( length > 1 ) { newedge->cvg = symbol / ( length - 1 ) * 10 > MaxEdgeCov ? MaxEdgeCov : symbol / ( length - 1 ) * 10; } else { newedge->cvg = 0; } output_1edge ( newedge, fp ); if ( len_tSeq >= edge_length_limit ) { free ( ( void * ) tightSeq ); } edge_c += bal_edge; if ( edge_c % 10000000 == 0 ) { printf ( "--- %d edges built\n", edge_c ); } return; } static int check_iden_kmerList ( STACK * stack1, STACK * stack2 ) { KMER_PT * ite1, *ite2; if ( !stack1->item_c || !stack2->item_c ) // one of them is empty { return 0; } while ( ( ite1 = ( KMER_PT * ) stackPop ( stack1 ) ) != NULL && ( ite2 = ( KMER_PT * ) stackPop ( stack2 ) ) != NULL ) { if ( !KmerEqual ( ite1->kmer, ite2->kmer ) ) { return 0; } } if ( stack1->item_c || stack2->item_c ) // one of them is not empty { return 0; } else { return 1; } } SOAPdenovo-V1.05/src/63mer/orderContig.c000644 000765 000024 00000314476 11530651532 017762 0ustar00Aquastaff000000 000000 /* * 63mer/orderContig.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" #include "dfibHeap.h" #include "fibHeap.h" #include "darray.h" #define CNBLOCKSIZE 10000 #define MAXC 10000 #define MAXCinBetween 200 #define MaxNodeInSub 10000 #define GapLowerBound -2000 #define GapUpperBound 300000 static boolean static_f = 0; static double OverlapPercent = 0.05; static double ConflPercent = 0.05; static int gapCounter; static int orienCounter; static int throughCounter; static DARRAY * solidArray; static DARRAY * tempArray; static int solidCounter; static CTGinHEAP ctg4heapArray[MaxNodeInSub + 1]; // index in this array are put to heaps, start from 1 static unsigned int nodesInSub[MaxNodeInSub]; static int nodeDistance[MaxNodeInSub]; static int nodeCounter; static unsigned int nodesInSubInOrder[MaxNodeInSub]; static int nodeDistanceInOrder[MaxNodeInSub]; static DARRAY * scaf3, *scaf5; static DARRAY * gap3, *gap5; static unsigned int downstreamCTG[MAXCinBetween]; static unsigned int upstreamCTG[MAXCinBetween]; static int dsCtgCounter; static int usCtgCounter; static CONNECT * checkConnect ( unsigned int from_c, unsigned int to_c ); static int maskPuzzle ( int num_connect, unsigned int contigLen ); static void freezing(); static boolean checkOverlapInBetween ( double tolerance ); static int setConnectDelete ( unsigned int from_c, unsigned int to_c, char flag, boolean cleanBinding ); static int setConnectWP ( unsigned int from_c, unsigned int to_c, char flag ); static void general_linearization ( boolean strict ); static void debugging2(); static void smallScaf(); static void detectBreakScaf(); static boolean checkSimple ( DARRAY * ctgArray, int count ); static void checkCircle(); //find the only connection involved in connection binding static CONNECT * getBindCnt ( unsigned int ctg ) { CONNECT * ite_cnt; CONNECT * bindCnt = NULL; CONNECT * temp_cnt = NULL; CONNECT * temp3_cnt = NULL; int count = 0; int count2 = 0; int count3 = 0; ite_cnt = contig_array[ctg].downwardConnect; while ( ite_cnt ) { if ( ite_cnt->nextInScaf ) { count++; bindCnt = ite_cnt; } if ( ite_cnt->prevInScaf ) { temp_cnt = ite_cnt; count2++; } if ( ite_cnt->singleInScaf ) { temp3_cnt = ite_cnt; count3++; } ite_cnt = ite_cnt->next; } if ( count == 1 ) { return bindCnt; } if ( count == 0 && count2 == 1 ) { return temp_cnt; } if ( count == 0 && count2 == 0 && count3 == 1 ) { return temp3_cnt; } return NULL; } static void createAnalogousCnt ( unsigned int sourceStart, CONNECT * originCnt, int gap, unsigned int targetStart, unsigned int targetStop ) { CONNECT * temp_cnt; unsigned int balTargetStart = getTwinCtg ( targetStart ); unsigned int balTargetStop = getTwinCtg ( targetStop ); unsigned int balSourceStart = getTwinCtg ( sourceStart ); unsigned int balSourceStop = getTwinCtg ( originCnt->contigID ); originCnt->deleted = 1; temp_cnt = getCntBetween ( balSourceStop, balSourceStart ); temp_cnt->deleted = 1; if ( gap < GapLowerBound ) { gapCounter++; return; } temp_cnt = add1Connect ( targetStart, targetStop, gap, originCnt->weight, 1 ); if ( temp_cnt ) { temp_cnt->inherit = 1; } temp_cnt = add1Connect ( balTargetStop, balTargetStart, gap, originCnt->weight, 1 ); if ( temp_cnt ) { temp_cnt->inherit = 1; } } // increase #long_pe_support for a conncet by 1 static void add1LongPEcov ( unsigned int fromCtg, unsigned int toCtg, int weight ) { //check if they are on the same scaff if ( contig_array[fromCtg].from_vt != contig_array[toCtg].from_vt || contig_array[fromCtg].to_vt != contig_array[toCtg].to_vt ) { printf ( "Warning from add1LongPEcov: contig %d and %d not on the same scaffold\n", fromCtg, toCtg ); return; } if ( contig_array[fromCtg].indexInScaf >= contig_array[toCtg].indexInScaf ) { printf ( "Warning from add1LongPEcov: wrong about order between contig %d and %d\n", fromCtg, toCtg ); return; } CONNECT * bindCnt; unsigned int prevCtg = fromCtg; bindCnt = getBindCnt ( fromCtg ); while ( bindCnt ) { if ( bindCnt->maxGap + weight <= 1000 ) { bindCnt->maxGap += weight; } else { bindCnt->maxGap = 1000; } if ( fromCtg == 0 && toCtg == 0 ) printf ( "link (%d %d ) covered by link (%d %d), wt %d\n", prevCtg, bindCnt->contigID, fromCtg, toCtg, weight ); if ( bindCnt->contigID == toCtg ) { break; } prevCtg = bindCnt->contigID; bindCnt = bindCnt->nextInScaf; } unsigned int bal_fc = getTwinCtg ( fromCtg ); unsigned int bal_tc = getTwinCtg ( toCtg ); bindCnt = getBindCnt ( bal_tc ); prevCtg = bal_tc; while ( bindCnt ) { if ( bindCnt->maxGap + weight <= 1000 ) { bindCnt->maxGap += weight; } else { bindCnt->maxGap = 1000; } if ( fromCtg == 0 && toCtg == 0 ) printf ( "link (%d %d ) covered by link (%d %d), wt %d\n", prevCtg, bindCnt->contigID, fromCtg, toCtg, weight ); if ( bindCnt->contigID == bal_fc ) { return; } prevCtg = bindCnt->contigID; bindCnt = bindCnt->nextInScaf; } printf ( "Warning from add1LongPEcov: not reach the end (%d %d) (B)\n", bal_tc, bal_fc ); } // for long pair ends, move the connections along scaffolds established by shorter pair ends till reach the ends static void downSlide() { int len = 0, gap; unsigned int i; CONNECT * ite_cnt, *bindCnt, *temp_cnt; unsigned int bottomCtg, topCtg, bal_i; unsigned int targetCtg, bal_target; boolean getThrough, orienConflict; int slideLen, slideLen2; orienCounter = throughCounter = 0; for ( i = 1; i <= num_ctg; i++ ) { if ( contig_array[i].mask || !contig_array[i].downwardConnect ) { continue; } bindCnt = getBindCnt ( i ); if ( !bindCnt ) { continue; } bal_i = getTwinCtg ( i ); len = slideLen = 0; bottomCtg = i; //find the last unmasked contig in this binding while ( bindCnt->nextInScaf ) { len += bindCnt->gapLen + contig_array[bindCnt->contigID].length; if ( contig_array[bindCnt->contigID].mask == 0 ) { bottomCtg = bindCnt->contigID; slideLen = len; } bindCnt = bindCnt->nextInScaf; } len += bindCnt->gapLen + contig_array[bindCnt->contigID].length; if ( contig_array[bindCnt->contigID].mask == 0 || bottomCtg == 0 ) { bottomCtg = bindCnt->contigID; slideLen = len; } //check each connetion from long pair ends ite_cnt = contig_array[i].downwardConnect; while ( ite_cnt ) { if ( ite_cnt->deleted || ite_cnt->mask || ite_cnt->singleInScaf || ite_cnt->nextInScaf || ite_cnt->prevInScaf || ite_cnt->inherit ) { ite_cnt = ite_cnt->next; continue; } targetCtg = ite_cnt->contigID; if ( contig_array[i].from_vt == contig_array[targetCtg].from_vt ) // on the same scaff { if ( contig_array[i].indexInScaf > contig_array[targetCtg].indexInScaf ) { orienCounter++; } else { throughCounter++; } setConnectDelete ( i, ite_cnt->contigID, 1, 0 ); ite_cnt = ite_cnt->next; continue; } //check if this connection conflicts with previous scaffold orientationally temp_cnt = getBindCnt ( targetCtg ); orienConflict = 0; if ( temp_cnt ) { while ( temp_cnt->nextInScaf ) { if ( temp_cnt->contigID == i ) { orienConflict = 1; printf ( "Warning from downSlide: still on the same scaff: %d and %d\n" , i, targetCtg ); printf ( "on scaff %d and %d\n", contig_array[i].from_vt, contig_array[targetCtg].from_vt ); printf ( "on bal_scaff %d and %d\n", contig_array[bal_target].to_vt, contig_array[bal_i].to_vt ); break; } temp_cnt = temp_cnt->nextInScaf; } if ( temp_cnt->contigID == i ) { orienConflict = 1; } } if ( orienConflict ) { orienCounter++; setConnectDelete ( i, ite_cnt->contigID, 1, 0 ); ite_cnt = ite_cnt->next; continue; } //find the most top contig along previous scaffold starting with the target contig of this connection bal_target = getTwinCtg ( targetCtg ); slideLen2 = 0; if ( contig_array[targetCtg].mask == 0 ) { topCtg = bal_target; } else { topCtg = 0; } temp_cnt = getBindCnt ( bal_target ); getThrough = len = 0; if ( temp_cnt ) { //find the last contig in this binding while ( temp_cnt->nextInScaf ) { //check if this route reaches bal_i if ( temp_cnt->contigID == bal_i ) { printf ( "Warning from downSlide: (B) still on the same scaff: %d and %d (%d and %d)\n", i, targetCtg, bal_target, bal_i ); printf ( "on scaff %d and %d\n", contig_array[i].from_vt, contig_array[targetCtg].from_vt ); printf ( "on bal_scaff %d and %d\n", contig_array[bal_target].to_vt, contig_array[bal_i].to_vt ); getThrough = 1; break; } len += temp_cnt->gapLen + contig_array[temp_cnt->contigID].length; if ( contig_array[temp_cnt->contigID].mask == 0 ) { topCtg = temp_cnt->contigID; slideLen2 = len; } temp_cnt = temp_cnt->nextInScaf; } len += temp_cnt->gapLen + contig_array[temp_cnt->contigID].length; if ( contig_array[temp_cnt->contigID].mask == 0 || topCtg == 0 ) { topCtg = temp_cnt->contigID; slideLen2 = len; } if ( temp_cnt->contigID == bal_i ) { getThrough = 1; } else { topCtg = getTwinCtg ( topCtg ); } } else { topCtg = targetCtg; } if ( getThrough ) { throughCounter++; setConnectDelete ( i, ite_cnt->contigID, 1, 0 ); ite_cnt = ite_cnt->next; continue; } //add a connection between bottomCtg and topCtg gap = ite_cnt->gapLen - slideLen - slideLen2; if ( bottomCtg != topCtg && ! ( i == bottomCtg && targetCtg == topCtg ) ) { createAnalogousCnt ( i, ite_cnt, gap, bottomCtg, topCtg ); if ( contig_array[bottomCtg].mask || contig_array[topCtg].mask ) { printf ( "downSlide to masked contig\n" ); } } ite_cnt = ite_cnt->next; } //for each connect } // for each contig printf ( "downSliding is done...orienConflict %d, fall inside %d\n", orienCounter, throughCounter ); } static boolean setNextInScaf ( CONNECT * cnt, CONNECT * nextCnt ) { if ( !cnt ) { printf ( "setNextInScaf: empty pointer\n" ); return 0; } if ( !nextCnt ) { cnt->nextInScaf = nextCnt; return 1; } if ( cnt->mask || cnt->deleted ) { printf ( "setNextInScaf: cnt is masked or deleted\n" ); return 0; } if ( nextCnt->deleted || nextCnt->mask ) { printf ( "setNextInScaf: nextCnt is masked or deleted\n" ); return 0; } cnt->nextInScaf = nextCnt; return 1; } static boolean setPrevInScaf ( CONNECT * cnt, boolean flag ) { if ( !cnt ) { printf ( "setPrevInScaf: empty pointer\n" ); return 0; } if ( !flag ) { cnt->prevInScaf = flag; return 1; } if ( cnt->mask || cnt->deleted ) { printf ( "setPrevInScaf: cnt is masked or deleted\n" ); return 0; } cnt->prevInScaf = flag; return 1; } /* connect A is upstream to B, replace A with C from_c > branch_c - to_c from_c_new */ static void substitueUSinScaf ( CONNECT * origin, unsigned int from_c_new ) { if ( !origin || !origin->nextInScaf ) { return; } unsigned int branch_c, to_c; unsigned int bal_branch_c, bal_to_c; unsigned int bal_from_c_new = getTwinCtg ( from_c_new ); CONNECT * bal_origin, *bal_nextCNT, *prevCNT, *bal_prevCNT; branch_c = origin->contigID; to_c = origin->nextInScaf->contigID; bal_branch_c = getTwinCtg ( branch_c ); bal_to_c = getTwinCtg ( to_c ); prevCNT = checkConnect ( from_c_new, branch_c ); bal_nextCNT = checkConnect ( bal_to_c, bal_branch_c ); if ( !bal_nextCNT ) { printf ( "substitueUSinScaf: no connect between %d and %d\n", bal_to_c, bal_branch_c ); return; } bal_origin = bal_nextCNT->nextInScaf; bal_prevCNT = checkConnect ( bal_branch_c, bal_from_c_new ); setPrevInScaf ( bal_nextCNT->nextInScaf, 0 ); setNextInScaf ( prevCNT, origin->nextInScaf ); setNextInScaf ( bal_nextCNT, bal_prevCNT ); setPrevInScaf ( bal_prevCNT, 1 ); setNextInScaf ( origin, NULL ); setPrevInScaf ( bal_origin, 0 ); } /* connect B is downstream to C, replace B with A to_c from_c - branch_c < to_c_new */ static void substitueDSinScaf ( CONNECT * origin, unsigned int branch_c, unsigned int to_c_new ) { if ( !origin || !origin->prevInScaf ) { return; } unsigned int to_c; unsigned int bal_branch_c, bal_to_c, bal_to_c_new; unsigned int from_c, bal_from_c; CONNECT * bal_origin, *prevCNT, *bal_prevCNT; CONNECT * nextCNT, *bal_nextCNT; to_c = origin->contigID; bal_branch_c = getTwinCtg ( branch_c ); bal_to_c = getTwinCtg ( to_c ); bal_origin = getCntBetween ( bal_to_c, bal_branch_c ); if ( !bal_origin ) { printf ( "substitueDSinScaf: no connect between %d and %d\n", bal_to_c, bal_branch_c ); return; } bal_from_c = bal_origin->nextInScaf->contigID; from_c = getTwinCtg ( bal_from_c ); bal_to_c_new = getTwinCtg ( to_c_new ); prevCNT = checkConnect ( from_c, branch_c ); nextCNT = checkConnect ( branch_c, to_c_new ); setNextInScaf ( prevCNT, nextCNT ); setPrevInScaf ( nextCNT, 1 ); bal_nextCNT = checkConnect ( bal_to_c_new, bal_branch_c ); bal_prevCNT = checkConnect ( bal_branch_c, bal_from_c ); setNextInScaf ( bal_nextCNT, bal_prevCNT ); setPrevInScaf ( origin, 0 ); setNextInScaf ( bal_origin, NULL ); } static int validConnect ( unsigned int ctg, CONNECT * preCNT ) { if ( preCNT && preCNT->nextInScaf ) { return 1; } CONNECT * cn_temp; int count = 0; if ( !contig_array[ctg].downwardConnect ) { return count; } cn_temp = contig_array[ctg].downwardConnect; while ( cn_temp ) { if ( !cn_temp->deleted && !cn_temp->mask ) { count++; } cn_temp = cn_temp->next; } return count; } static CONNECT * getNextContig ( unsigned int ctg, CONNECT * preCNT, boolean * exception ) { CONNECT * cn_temp, *retCNT = NULL; int count = 0, valid_in; unsigned int nextCtg, bal_ctg; *exception = 0; if ( preCNT && preCNT->nextInScaf ) { if ( preCNT->contigID != ctg ) { printf ( "pre cnt does not lead to %d\n", ctg ); } nextCtg = preCNT->nextInScaf->contigID; cn_temp = getCntBetween ( ctg, nextCtg ); if ( cn_temp && ( cn_temp->mask || cn_temp->deleted ) ) { printf ( "getNextContig: arc(%d %d) twin (%d %d) with mask %d deleted %d\n" , ctg, nextCtg, getTwinCtg ( nextCtg ), getTwinCtg ( ctg ) , cn_temp->mask, cn_temp->deleted ); if ( !cn_temp->prevInScaf ) { printf ( "not even has a prevInScaf\n" ); } cn_temp = getCntBetween ( getTwinCtg ( nextCtg ), getTwinCtg ( ctg ) ); if ( !cn_temp->nextInScaf ) { printf ( "its twin cnt not has a nextInScaf\n" ); } fflush ( stdout ); *exception = 1; } else { return preCNT->nextInScaf; } } bal_ctg = getTwinCtg ( ctg ); valid_in = validConnect ( bal_ctg, NULL ); if ( valid_in > 1 ) { return NULL; } if ( !contig_array[ctg].downwardConnect ) { return NULL; } cn_temp = contig_array[ctg].downwardConnect; while ( cn_temp ) { if ( cn_temp->mask || cn_temp->deleted ) { cn_temp = cn_temp->next; continue; } count++; if ( count == 1 ) { retCNT = cn_temp; } else if ( count == 2 ) { return NULL; } cn_temp = cn_temp->next; } return retCNT; } // get the valid connect between 2 given ctgs static CONNECT * checkConnect ( unsigned int from_c, unsigned int to_c ) { CONNECT * cn_temp = getCntBetween ( from_c, to_c ); if ( !cn_temp ) { return NULL; } if ( !cn_temp->mask && !cn_temp->deleted ) { return cn_temp; } return NULL; } static int setConnectMask ( unsigned int from_c, unsigned int to_c, char mask ) { CONNECT * cn_temp, *cn_bal, *cn_ds, *cn_us; unsigned int bal_fc = getTwinCtg ( from_c ); unsigned int bal_tc = getTwinCtg ( to_c ); unsigned int ctg3, bal_ctg3; cn_temp = getCntBetween ( from_c, to_c ); cn_bal = getCntBetween ( bal_tc, bal_fc ); if ( !cn_temp || !cn_bal ) { return 0; } cn_temp->mask = mask; cn_bal->mask = mask; if ( !mask ) { return 1; } if ( cn_temp->nextInScaf ) //undo the binding { setPrevInScaf ( cn_temp->nextInScaf, 0 ); ctg3 = cn_temp->nextInScaf->contigID; setNextInScaf ( cn_temp, NULL ); bal_ctg3 = getTwinCtg ( ctg3 ); cn_ds = getCntBetween ( bal_ctg3, bal_tc ); setNextInScaf ( cn_ds, NULL ); setPrevInScaf ( cn_bal, 0 ); } // ctg3 -> from_c -> to_c // bal_ctg3 <- bal_fc <- bal_tc if ( cn_bal->nextInScaf ) { setPrevInScaf ( cn_bal->nextInScaf, 0 ); bal_ctg3 = cn_bal->nextInScaf->contigID; setNextInScaf ( cn_bal, NULL ); ctg3 = getTwinCtg ( bal_ctg3 ); cn_us = getCntBetween ( ctg3, from_c ); setNextInScaf ( cn_us, NULL ); setPrevInScaf ( cn_temp, 0 ); } return 1; } static boolean setConnectUsed ( unsigned int from_c, unsigned int to_c, char flag ) { CONNECT * cn_temp, *cn_bal; unsigned int bal_fc = getTwinCtg ( from_c ); unsigned int bal_tc = getTwinCtg ( to_c ); cn_temp = getCntBetween ( from_c, to_c ); cn_bal = getCntBetween ( bal_tc, bal_fc ); if ( !cn_temp || !cn_bal ) { return 0; } cn_temp->used = flag; cn_bal->used = flag; return 1; } static int setConnectWP ( unsigned int from_c, unsigned int to_c, char flag ) { CONNECT * cn_temp, *cn_bal; unsigned int bal_fc = getTwinCtg ( from_c ); unsigned int bal_tc = getTwinCtg ( to_c ); cn_temp = getCntBetween ( from_c, to_c ); cn_bal = getCntBetween ( bal_tc, bal_fc ); if ( !cn_temp || !cn_bal ) { return 0; } cn_temp->weakPoint = flag; cn_bal->weakPoint = flag; //fprintf(stderr,"contig %d and %d, weakPoint %d\n",from_c,to_c,cn_temp->weakPoint); //fprintf(stderr,"contig %d and %d, weakPoint %d\n",bal_tc,bal_fc,cn_bal->weakPoint); return 1; } static int setConnectDelete ( unsigned int from_c, unsigned int to_c, char flag, boolean cleanBinding ) { CONNECT * cn_temp, *cn_bal; unsigned int bal_fc = getTwinCtg ( from_c ); unsigned int bal_tc = getTwinCtg ( to_c ); cn_temp = getCntBetween ( from_c, to_c ); cn_bal = getCntBetween ( bal_tc, bal_fc ); if ( !cn_temp || !cn_bal ) { return 0; } cn_temp->deleted = flag; cn_bal->deleted = flag; if ( !flag ) { return 1; } if ( cleanBinding ) { cn_temp->prevInScaf = 0; cn_temp->nextInScaf = NULL; cn_bal->prevInScaf = 0; cn_bal->nextInScaf = NULL; } return 1; } static void maskContig ( unsigned int ctg, boolean flag ) { unsigned int bal_ctg, ctg2, bal_ctg2; CONNECT * cn_temp; bal_ctg = getTwinCtg ( ctg ); cn_temp = contig_array[ctg].downwardConnect; while ( cn_temp ) { if ( cn_temp->mask || cn_temp->prevInScaf || cn_temp->nextInScaf || cn_temp->singleInScaf ) { cn_temp = cn_temp->next; continue; } ctg2 = cn_temp->contigID; setConnectMask ( ctg, ctg2, flag ); cn_temp = cn_temp->next; } // bal_ctg2 <- bal_ctg cn_temp = contig_array[bal_ctg].downwardConnect; while ( cn_temp ) { if ( cn_temp->mask || cn_temp->prevInScaf || cn_temp->nextInScaf || cn_temp->singleInScaf ) { cn_temp = cn_temp->next; continue; } bal_ctg2 = cn_temp->contigID; setConnectMask ( bal_ctg, bal_ctg2, flag ); cn_temp = cn_temp->next; } contig_array[ctg].mask = flag; contig_array[bal_ctg].mask = flag; } static int maskPuzzle ( int num_connect, unsigned int contigLen ) { int in_num, out_num, flag = 0, puzzleCounter = 0; unsigned int i, bal_i; for ( i = 1; i <= num_ctg; i++ ) { if ( contigLen && contig_array[i].length > contigLen ) { break; } if ( contig_array[i].mask ) { continue; } bal_i = getTwinCtg ( i ); in_num = validConnect ( bal_i, NULL ); out_num = validConnect ( i, NULL ); if ( ( in_num > 1 || out_num > 1 ) && ( in_num + out_num >= num_connect ) ) { flag++; maskContig ( i, 1 ); } in_num = validConnect ( bal_i, NULL ); out_num = validConnect ( i, NULL ); if ( in_num > 1 || out_num > 1 ) { puzzleCounter++; //debugging2(i); } if ( isSmallerThanTwin ( i ) ) { i++; } } printf ( "Masked %d contigs, %d puzzle left\n", flag, puzzleCounter ); return flag; } static void deleteWeakCnt ( int cut_off ) { unsigned int i; CONNECT * cn_temp1; int weaks = 0, counter = 0; for ( i = 1; i <= num_ctg; i++ ) { cn_temp1 = contig_array[i].downwardConnect; while ( cn_temp1 ) { if ( !cn_temp1->mask && !cn_temp1->deleted && !cn_temp1->nextInScaf && !cn_temp1->singleInScaf && !cn_temp1->prevInScaf ) { counter++; } if ( cn_temp1->weak && cn_temp1->deleted && cn_temp1->weight >= cut_off ) { cn_temp1->deleted = 0; cn_temp1->weak = 0; } else if ( !cn_temp1->deleted && cn_temp1->weight > 0 && cn_temp1->weight < cut_off && !cn_temp1->nextInScaf && !cn_temp1->prevInScaf ) { cn_temp1->deleted = 1; cn_temp1->weak = 1; if ( cn_temp1->singleInScaf ) { cn_temp1->singleInScaf = 0; } if ( !cn_temp1->mask ) { weaks++; } } cn_temp1 = cn_temp1->next; } } printf ( "%d weak connects removed (there were %d active cnnects))\n", weaks, counter ); checkCircle(); } //check if one contig is linearly connected to the other ->C1->C2... static int linearC2C ( unsigned int starter, CONNECT * cnt2c1, unsigned int c2, int min_dis, int max_dis ) { int out_num, in_num; CONNECT * prevCNT, *cnt, *cn_temp; unsigned int c1, bal_c1, ctg, bal_c2; int len = 0; unsigned int bal_start = getTwinCtg ( starter ); boolean excep; c1 = cnt2c1->contigID; if ( c1 == c2 ) { printf ( "linearC2C: c1(%d) and c2(%d) are the same contig\n", c1, c2 ); return -1; } bal_c1 = getTwinCtg ( c1 ); in_num = validConnect ( bal_c1, NULL ); if ( in_num > 1 ) { return 0; } dsCtgCounter = 1; usCtgCounter = 0; downstreamCTG[dsCtgCounter++] = c1; bal_c2 = getTwinCtg ( c2 ); upstreamCTG[usCtgCounter++] = bal_c2; // check if c1 is linearly connected to c2 by pe connections cnt = prevCNT = cnt2c1; while ( ( cnt = getNextContig ( c1, prevCNT, &excep ) ) != NULL ) { c1 = cnt->contigID; len += cnt->gapLen + contig_array[c1].length; if ( c1 == c2 ) { return 1; } if ( len > max_dis || c1 == starter || c1 == bal_start ) { return 0; } downstreamCTG[dsCtgCounter++] = c1; if ( dsCtgCounter >= MAXCinBetween ) { printf ( "%d downstream contigs, start at %d, max_dis %d, current dis %d\n" , dsCtgCounter, starter, max_dis, len ); return 0; } prevCNT = cnt; } out_num = validConnect ( c1, NULL ); if ( out_num ) { return 0; } //find the most upstream contig to c2 cnt = prevCNT = NULL; ctg = bal_c2; while ( ( cnt = getNextContig ( ctg, prevCNT, &excep ) ) != NULL ) { ctg = cnt->contigID; len += cnt->gapLen + contig_array[ctg].length; if ( len > max_dis || ctg == starter || ctg == bal_start ) { return 0; } prevCNT = cnt; upstreamCTG[usCtgCounter++] = ctg; if ( usCtgCounter >= MAXCinBetween ) { printf ( "%d upstream contigs, start at %d, max_dis %d, current dis %d\n" , usCtgCounter, starter, max_dis, len ); return 0; } } if ( dsCtgCounter + usCtgCounter > MAXCinBetween ) { printf ( "%d downstream and %d upstream contigs\n", dsCtgCounter, usCtgCounter ); return 0; } out_num = validConnect ( ctg, NULL ); if ( out_num ) { return 0; } c2 = getTwinCtg ( ctg ); min_dis -= len; max_dis -= len; if ( c1 == c2 || c1 == ctg || max_dis < 0 ) { return 0; } cn_temp = getCntBetween ( c1, c2 ); if ( cn_temp ) { setConnectMask ( c1, c2, 0 ); setConnectDelete ( c1, c2, 0, 0 ); return 1; } len = ( min_dis + max_dis ) / 2 >= 0 ? ( min_dis + max_dis ) / 2 : 0; cn_temp = allocateCN ( c2, len ); if ( cntLookupTable ) { putCnt2LookupTable ( c1, cn_temp ); } cn_temp->weight = 0; // special connect from the original graph cn_temp->next = contig_array[c1].downwardConnect; contig_array[c1].downwardConnect = cn_temp; bal_c1 = getTwinCtg ( c1 ); bal_c2 = getTwinCtg ( c2 ); cn_temp = allocateCN ( bal_c1, len ); if ( cntLookupTable ) { putCnt2LookupTable ( bal_c2, cn_temp ); } cn_temp->weight = 0; // special connect from the original graph cn_temp->next = contig_array[bal_c2].downwardConnect; contig_array[bal_c2].downwardConnect = cn_temp; return 1; } //catenate upstream contig array and downstream contig array to solidArray static void catUsDsContig() { int i; for ( i = 0; i < dsCtgCounter; i++ ) { * ( unsigned int * ) darrayPut ( solidArray, i ) = downstreamCTG[i]; } for ( i = usCtgCounter - 1; i >= 0; i-- ) { * ( unsigned int * ) darrayPut ( solidArray, dsCtgCounter++ ) = getTwinCtg ( upstreamCTG[i] ); } solidCounter = dsCtgCounter; } //binding the connections between contigs in solidArray static void consolidate() { int i, j; CONNECT * prevCNT = NULL; CONNECT * cnt; unsigned int to_ctg; unsigned int from_ctg = * ( unsigned int * ) darrayGet ( solidArray, 0 ); for ( i = 1; i < solidCounter; i++ ) { to_ctg = * ( unsigned int * ) darrayGet ( solidArray, i ); cnt = checkConnect ( from_ctg, to_ctg ); if ( !cnt ) { printf ( "consolidate A: no connect from %d to %d\n", from_ctg, to_ctg ); for ( j = 0; j < solidCounter; j++ ) { printf ( "%d-->", * ( unsigned int * ) darrayGet ( solidArray, j ) ); } printf ( "\n" ); return; } cnt->singleInScaf = solidCounter == 2 ? 1 : 0; if ( prevCNT ) { setNextInScaf ( prevCNT, cnt ); setPrevInScaf ( cnt, 1 ); } prevCNT = cnt; from_ctg = to_ctg; } //the reverse complementary path from_ctg = getTwinCtg ( * ( unsigned int * ) darrayGet ( solidArray, solidCounter - 1 ) ); prevCNT = NULL; for ( i = solidCounter - 2; i >= 0; i-- ) { to_ctg = getTwinCtg ( * ( unsigned int * ) darrayGet ( solidArray, i ) ); cnt = checkConnect ( from_ctg, to_ctg ); if ( !cnt ) { printf ( "consolidate B: no connect from %d to %d\n", from_ctg, to_ctg ); return; } cnt->singleInScaf = solidCounter == 2 ? 1 : 0; if ( prevCNT ) { setNextInScaf ( prevCNT, cnt ); setPrevInScaf ( cnt, 1 ); } prevCNT = cnt; from_ctg = to_ctg; } } static void debugging1 ( unsigned int ctg1, unsigned int ctg2 ) { CONNECT * cn1; cn1 = getCntBetween ( ctg1, ctg2 ); if ( cn1 ) { printf ( "(%d,%d) mask %d deleted %d w %d,singleInScaf %d\n", ctg1, ctg2, cn1->mask, cn1->deleted, cn1->weight, cn1->singleInScaf ); if ( cn1->nextInScaf ) { printf ( "%d->%d->%d\n", ctg1, ctg2, cn1->nextInScaf->contigID ); } if ( cn1->prevInScaf ) { printf ( "*->%d->%d\n", ctg1, ctg2 ); } else if ( !cn1->nextInScaf ) { printf ( "NULL->%d->%d->NULL\n", ctg1, ctg2 ); } } else { printf ( "%d -X- %d\n", ctg1, ctg2 ); } } //remove transitive connections which cross linear paths (these paths may be broken) //if a->b->c and a->c, mask a->c static void removeTransitive() { unsigned int i, bal_ctg; int flag = 1, out_num, in_num, count, min, max, linear; CONNECT * cn_temp, *cn1 = NULL, *cn2 = NULL; while ( flag ) { flag = 0; for ( i = 1; i <= num_ctg; i++ ) { if ( contig_array[i].mask ) { continue; } out_num = validConnect ( i, NULL ); if ( out_num != 2 ) { continue; } cn_temp = contig_array[i].downwardConnect; count = 0; while ( cn_temp ) { if ( cn_temp->deleted || cn_temp->mask ) { cn_temp = cn_temp->next; continue; } count++; if ( count == 1 ) { cn1 = cn_temp; } else if ( count == 2 ) { cn2 = cn_temp; } else // count > 2 { break; } cn_temp = cn_temp->next; } if ( count > 2 ) { printf ( "%d valid connections from ctg %d\n", count, i ); continue; } if ( cn1->gapLen > cn2->gapLen ) { cn_temp = cn1; cn1 = cn2; cn2 = cn_temp; } //make sure cn1 is closer to contig i than cn2 if ( cn1->prevInScaf && cn2->prevInScaf ) { continue; } bal_ctg = getTwinCtg ( cn2->contigID ); in_num = validConnect ( bal_ctg, NULL ); if ( in_num > 2 ) { continue; } min = cn2->gapLen - cn1->gapLen - contig_array[cn1->contigID].length - ins_size_var / 2; max = cn2->gapLen - cn1->gapLen - contig_array[cn1->contigID].length + ins_size_var / 2; if ( max < 0 ) { continue; } //temprarily delete cn2 setConnectDelete ( i, cn2->contigID, 1, 0 ); linear = linearC2C ( i, cn1, cn2->contigID, min, max ); if ( linear != 1 ) { setConnectDelete ( i, cn2->contigID, 0, 0 ); continue; } else { downstreamCTG[0] = i; catUsDsContig(); if ( !checkSimple ( solidArray, solidCounter ) ) { continue; } cn1 = getCntBetween ( * ( unsigned int * ) darrayGet ( solidArray, solidCounter - 2 ), * ( unsigned int * ) darrayGet ( solidArray, solidCounter - 1 ) ); if ( cn1 && cn1->nextInScaf && cn2->nextInScaf ) { setConnectDelete ( i, cn2->contigID, 0, 0 ); continue; } consolidate(); if ( cn2->prevInScaf ) substitueDSinScaf ( cn2, * ( unsigned int * ) darrayGet ( solidArray, 0 ), * ( unsigned int * ) darrayGet ( solidArray, 1 ) ); if ( cn2->nextInScaf ) { substitueUSinScaf ( cn2, * ( unsigned int * ) darrayGet ( solidArray, solidCounter - 2 ) ); } flag++; } } //for each contig printf ( "a remove transitive lag, %d connections removed\n", flag ); } } //get repeat contigs back into the scaffold according to connected unique contigs on both sides /* A ------ D > [i] < B E */ static void debugging2 ( unsigned int ctg ) { CONNECT * cn1 = contig_array[ctg].downwardConnect; while ( cn1 ) { if ( cn1->nextInScaf ) { fprintf ( stderr, "with nextInScaf," ); } if ( cn1->prevInScaf ) { fprintf ( stderr, "with prevInScaf," ); } fprintf ( stderr, "%u >> %d, mask %d deleted %d, inherit %d, singleInScaf %d\n", ctg, cn1->contigID, cn1->mask, cn1->deleted, cn1->inherit, cn1->singleInScaf ); cn1 = cn1->next; } } static void debugging() { /* debugging1(1777,1468); debugging2(8065); debugging2(8066); */ } static void simplifyCnt() { removeTransitive(); debugging(); general_linearization ( 1 ); debugging(); } static int getIndexInArray ( unsigned int node ) { int index; for ( index = 0; index < nodeCounter; index++ ) if ( nodesInSub[index] == node ) { return index; } return -1; } static boolean putNodeIntoSubgraph ( FibHeap * heap, int distance, unsigned int node, int index ) { int pos = getIndexInArray ( node ); if ( pos > 0 ) { //printf("exists\n"); return 0; } if ( index >= MaxNodeInSub ) { return -1; } insertNodeIntoHeap ( heap, distance, node ); nodesInSub[index] = node; nodeDistance[index] = distance; return 1; } static boolean putChainIntoSubgraph ( FibHeap * heap, int distance, unsigned int node, int * index, CONNECT * prevC ) { unsigned int ctg = node; CONNECT * nextCnt; boolean excep, flag; int counter = *index; while ( 1 ) { nextCnt = getNextContig ( ctg, prevC, &excep ); if ( excep || !nextCnt ) { *index = counter; return 1; } ctg = nextCnt->contigID; distance += nextCnt->gapLen + ctg; flag = putNodeIntoSubgraph ( heap, distance, ctg, counter ); if ( flag < 0 ) { return 0; } if ( flag > 0 ) { counter++; } prevC = nextCnt; } } // check if a contig is unique by trying to line its downstream/upstream nodes together static boolean checkUnique ( unsigned int node, double tolerance ) { CONNECT * ite_cnt; unsigned int currNode; int distance; int popCounter = 0; boolean flag; currNode = node; FibHeap * heap = newFibHeap(); putNodeIntoSubgraph ( heap, 0, currNode, 0 ); nodeCounter = 1; ite_cnt = contig_array[currNode].downwardConnect; while ( ite_cnt ) { if ( ite_cnt->deleted || ite_cnt->mask ) { ite_cnt = ite_cnt->next; continue; } currNode = ite_cnt->contigID; distance = ite_cnt->gapLen + contig_array[currNode].length; flag = putNodeIntoSubgraph ( heap, distance, currNode, nodeCounter ); if ( flag < 0 ) { destroyHeap ( heap ); return 0; } if ( flag > 0 ) { nodeCounter++; } flag = putChainIntoSubgraph ( heap, distance, currNode, &nodeCounter, ite_cnt ); if ( !flag ) { destroyHeap ( heap ); return 0; } ite_cnt = ite_cnt->next; } if ( nodeCounter <= 2 ) // no more than 2 valid connections { destroyHeap ( heap ); return 1; } while ( ( currNode = removeNextNodeFromHeap ( heap ) ) != 0 ) { nodesInSubInOrder[popCounter++] = currNode; } destroyHeap ( heap ); flag = checkOverlapInBetween ( tolerance ); return flag; } //mask contigs with downstream and/or upstream can not be lined static void maskRepeat() { int in_num, out_num, flagA, flagB; int counter = 0; int puzzleCounter = 0; unsigned int i, bal_i; for ( i = 1; i <= num_ctg; i++ ) { if ( contig_array[i].mask ) { continue; } bal_i = getTwinCtg ( i ); in_num = validConnect ( bal_i, NULL ); out_num = validConnect ( i, NULL ); if ( in_num > 1 || out_num > 1 ) { puzzleCounter++; } else { if ( isSmallerThanTwin ( i ) ) { i++; } continue; } if ( contig_array[i].cvg > 2 * cvgAvg ) { counter++; maskContig ( i, 1 ); //printf("thick mask contig %d and %d\n",i,bal_i); if ( isSmallerThanTwin ( i ) ) { i++; } continue; } if ( in_num > 1 ) { flagA = checkUnique ( bal_i, OverlapPercent ); } else { flagA = 1; } if ( out_num > 1 ) { flagB = checkUnique ( i, OverlapPercent ); } else { flagB = 1; } if ( !flagA || !flagB ) { counter++; maskContig ( i, 1 ); } if ( isSmallerThanTwin ( i ) ) { i++; } } printf ( "maskRepeat: %d contigs masked from %d puzzles\n", counter, puzzleCounter ); } static void ordering ( boolean deWeak, boolean downS, boolean nonlinear, char * infile ) { debugging(); if ( downS ) { downSlide(); debugging(); if ( deWeak ) { deleteWeakCnt ( weakPE ); } } else { if ( deWeak ) { deleteWeakCnt ( weakPE ); } } //output_scaf(infile); debugging(); printf ( "variance for insert size %d\n", ins_size_var ); simplifyCnt(); debugging(); maskRepeat(); debugging(); simplifyCnt(); if ( nonlinear ) { printf ( "non-strict linearization\n" ); general_linearization ( 0 ); //linearization(0,0); } maskPuzzle ( 2, 0 ); debugging(); freezing(); debugging(); } //check if contigs next to each other have reasonable overlap boolean checkOverlapInBetween ( double tolerance ) { int i, gap; int index; unsigned int node; int lenSum, lenOlp; lenSum = lenOlp = 0; for ( i = 0; i < nodeCounter; i++ ) { node = nodesInSubInOrder[i]; lenSum += contig_array[node].length; index = getIndexInArray ( node ); nodeDistanceInOrder[i] = nodeDistance[index]; } if ( lenSum < 1 ) { return 1; } for ( i = 0; i < nodeCounter - 1; i++ ) { gap = nodeDistanceInOrder[i + 1] - nodeDistanceInOrder[i] - contig_array[nodesInSubInOrder[i + 1]].length; if ( -gap > 0 ) { lenOlp += -gap; } //if(-gap>ins_size_var) if ( ( double ) lenOlp / lenSum > tolerance ) { return 0; } } return 1; } /********* the following codes are for freezing current scaffolds ****************/ //set connections between contigs in a array to used or not //meanwhile set mask to the opposite value static boolean setUsed ( unsigned int start, unsigned int * array, int max_steps, boolean flag ) { unsigned int prevCtg = start; unsigned int twinA, twinB; int j; CONNECT * cnt; boolean usedFlag = 0; // save 'used' to 'checking' prevCtg = start; for ( j = 0; j < max_steps; j++ ) { if ( array[j] == 0 ) { break; } cnt = getCntBetween ( prevCtg, array[j] ); if ( !cnt ) { printf ( "setUsed: no connect between %d and %d\n", prevCtg, array[j] ); prevCtg = array[j]; continue; } if ( cnt->used == flag || cnt->nextInScaf || cnt->prevInScaf || cnt->singleInScaf ) { return 1; } cnt->checking = cnt->used; twinA = getTwinCtg ( prevCtg ); twinB = getTwinCtg ( array[j] ); cnt = getCntBetween ( twinB, twinA ); if ( cnt ) { cnt->checking = cnt->used; } prevCtg = array[j]; } // set used to flag prevCtg = start; for ( j = 0; j < max_steps; j++ ) { if ( array[j] == 0 ) { break; } cnt = getCntBetween ( prevCtg, array[j] ); if ( !cnt ) { prevCtg = array[j]; continue; } if ( cnt->used == flag ) { usedFlag = 1; break; } cnt->used = flag; twinA = getTwinCtg ( prevCtg ); twinB = getTwinCtg ( array[j] ); cnt = getCntBetween ( twinB, twinA ); if ( cnt ) { cnt->used = flag; } prevCtg = array[j]; } // set mask to 'NOT flag' or set used to original value prevCtg = start; for ( j = 0; j < max_steps; j++ ) { if ( array[j] == 0 ) { break; } cnt = getCntBetween ( prevCtg, array[j] ); if ( !cnt ) { prevCtg = array[j]; continue; } if ( !usedFlag ) { cnt->mask = 1 - flag; } else { cnt->used = cnt->checking; } twinA = getTwinCtg ( prevCtg ); twinB = getTwinCtg ( array[j] ); cnt = getCntBetween ( twinB, twinA ); cnt->used = 1 - flag; if ( !usedFlag ) { cnt->mask = 1 - flag; } else { cnt->used = cnt->checking; } prevCtg = array[j]; } return usedFlag; } // break down scaffolds poorly supported by longer PE static void recoverMask() { unsigned int i, ctg, bal_ctg, start, finish; int num3, num5, j, t; CONNECT * bindCnt, *cnt; int min, max, max_steps = 5, num_route, length; int tempCounter, recoverCounter = 0; boolean multiUSE, change; for ( i = 1; i <= num_ctg; i++ ) { contig_array[i].flag = 0; } so_far = ( unsigned int * ) ckalloc ( max_n_routes * sizeof ( unsigned int ) ); found_routes = ( unsigned int ** ) ckalloc ( max_n_routes * sizeof ( unsigned int * ) ); for ( j = 0; j < max_n_routes; j++ ) { found_routes[j] = ( unsigned int * ) ckalloc ( max_steps * sizeof ( unsigned int ) ); } for ( i = 1; i <= num_ctg; i++ ) { if ( contig_array[i].flag || contig_array[i].mask || !contig_array[i].downwardConnect ) { continue; } bindCnt = getBindCnt ( i ); if ( !bindCnt ) { continue; } //first scan get the average coverage by longer pe num5 = num3 = 0; ctg = i; * ( unsigned int * ) darrayPut ( scaf5, num5++ ) = i; contig_array[i].flag = 1; contig_array[getTwinCtg ( i )].flag = 1; while ( bindCnt ) { if ( bindCnt->used ) { break; } setConnectUsed ( ctg, bindCnt->contigID, 1 ); ctg = bindCnt->contigID; * ( unsigned int * ) darrayPut ( scaf5, num5++ ) = ctg; bal_ctg = getTwinCtg ( ctg ); contig_array[ctg].flag = 1; contig_array[bal_ctg].flag = 1; bindCnt = bindCnt->nextInScaf; } ctg = getTwinCtg ( i ); bindCnt = getBindCnt ( ctg ); while ( bindCnt ) { if ( bindCnt->used ) { break; } setConnectUsed ( ctg, bindCnt->contigID, 1 ); ctg = bindCnt->contigID; bal_ctg = getTwinCtg ( ctg ); contig_array[ctg].flag = 1; contig_array[bal_ctg].flag = 1; * ( unsigned int * ) darrayPut ( scaf3, num3++ ) = bal_ctg; bindCnt = bindCnt->nextInScaf; } if ( num5 + num3 < 2 ) { continue; } tempCounter = solidCounter = 0; for ( j = num3 - 1; j >= 0; j-- ) * ( unsigned int * ) darrayPut ( tempArray, tempCounter++ ) = * ( unsigned int * ) darrayGet ( scaf3, j ); for ( j = 0; j < num5; j++ ) * ( unsigned int * ) darrayPut ( tempArray, tempCounter++ ) = * ( unsigned int * ) darrayGet ( scaf5, j ); change = 0; for ( t = 0; t < tempCounter - 1; t++ ) { * ( unsigned int * ) darrayPut ( solidArray, solidCounter++ ) = * ( unsigned int * ) darrayGet ( tempArray, t ); start = * ( unsigned int * ) darrayGet ( tempArray, t ); finish = * ( unsigned int * ) darrayGet ( tempArray, t + 1 ); num_route = num_trace = 0; cnt = checkConnect ( start, finish ); if ( !cnt ) { printf ( "Warning from recoverMask: no connection (%d %d), start at %d\n", start, finish, i ); cnt = getCntBetween ( start, finish ); if ( cnt ) { debugging1 ( start, finish ); } continue; } length = cnt->gapLen + contig_array[finish].length; min = length - 1.5 * ins_size_var; max = length + 1.5 * ins_size_var; traceAlongMaskedCnt ( finish, start, max_steps, min, max, 0, 0, &num_route ); if ( finish == start ) { for ( j = 0; j < tempCounter; j++ ) { printf ( "->%d", * ( unsigned int * ) darrayGet ( tempArray, j ) ); } printf ( ": start at %d\n", i ); } if ( num_route == 1 ) { for ( j = 0; j < max_steps; j++ ) if ( found_routes[0][j] == 0 ) { break; } if ( j < 1 ) { continue; } //check if connects have been used more than once multiUSE = setUsed ( start, found_routes[0], max_steps, 1 ); if ( multiUSE ) { continue; } for ( j = 0; j < max_steps; j++ ) { if ( j + 1 == max_steps || found_routes[0][j + 1] == 0 ) { break; } * ( unsigned int * ) darrayPut ( solidArray, solidCounter++ ) = found_routes[0][j]; contig_array[found_routes[0][j]].flag = 1; contig_array[getTwinCtg ( found_routes[0][j] )].flag = 1; } recoverCounter += j; setConnectDelete ( start, finish, 1, 1 ); change = 1; } //end if num_route=1 } // for each gap * ( unsigned int * ) darrayPut ( solidArray, solidCounter++ ) = * ( unsigned int * ) darrayGet ( tempArray, tempCounter - 1 ); if ( change ) { consolidate(); } } printf ( "%d contigs recovered\n", recoverCounter ); fflush ( stdout ); for ( i = 1; i <= num_ctg; i++ ) { cnt = contig_array[i].downwardConnect; while ( cnt ) { cnt->used = 0; cnt->checking = 0; cnt = cnt->next; } } for ( j = 0; j < max_n_routes; j++ ) { free ( ( void * ) found_routes[j] ); } free ( ( void * ) found_routes ); free ( ( void * ) so_far ); } // A -> B -> C -> D un-bind link B->C to link A->B and B->C // A' <- B' <- C' <- D' static void unBindLink ( unsigned int CB, unsigned int CC ) { //fprintf(stderr,"Unbind link (%d %d) to others...\n",CB,CC); CONNECT * cnt1 = getCntBetween ( CB, CC ); if ( !cnt1 ) { return; } if ( cnt1->singleInScaf ) { cnt1->singleInScaf = 0; } CONNECT * cnt2 = getCntBetween ( getTwinCtg ( CC ), getTwinCtg ( CB ) ); if ( !cnt2 ) { return; } if ( cnt2->singleInScaf ) { cnt2->singleInScaf = 0; } if ( cnt1->nextInScaf ) { unsigned int CD = cnt1->nextInScaf->contigID; cnt1->nextInScaf->prevInScaf = 0; cnt1->nextInScaf = NULL; CONNECT * cnt3 = getCntBetween ( getTwinCtg ( CD ), getTwinCtg ( CC ) ); if ( cnt3 ) { cnt3->nextInScaf = NULL; } cnt2->prevInScaf = 0; } if ( cnt2->nextInScaf ) { unsigned int bal_CA = cnt2->nextInScaf->contigID; cnt2->nextInScaf->prevInScaf = 0; cnt2->nextInScaf = NULL; CONNECT * cnt4 = getCntBetween ( getTwinCtg ( bal_CA ), CB ); if ( cnt4 ) { cnt4->nextInScaf = NULL; } cnt1->prevInScaf = 0; } } static void freezing() { int num5, num3; unsigned int ctg, bal_ctg; unsigned int i; int j, t; CONNECT * cnt, *prevCNT, *nextCnt; boolean excep; for ( i = 1; i <= num_ctg; i++ ) { contig_array[i].flag = 0; contig_array[i].from_vt = 0; contig_array[i].to_vt = 0; cnt = contig_array[i].downwardConnect; while ( cnt ) { cnt->used = 0; cnt->checking = 0; cnt->singleInScaf = 0; cnt = cnt->next; } } for ( i = 1; i <= num_ctg; i++ ) { if ( contig_array[i].flag || contig_array[i].mask ) { continue; } if ( !contig_array[i].downwardConnect || !validConnect ( i, NULL ) ) { continue; } num5 = num3 = 0; ctg = i; * ( unsigned int * ) darrayPut ( scaf5, num5++ ) = i; contig_array[i].flag = 1; contig_array[getTwinCtg ( i )].flag = 1; prevCNT = NULL; cnt = getNextContig ( ctg, prevCNT, &excep ); while ( cnt ) { if ( contig_array[cnt->contigID].flag ) { unBindLink ( ctg, cnt->contigID ); break; } nextCnt = getNextContig ( cnt->contigID, cnt, &excep ); setConnectUsed ( ctg, cnt->contigID, 1 ); ctg = cnt->contigID; * ( unsigned int * ) darrayPut ( scaf5, num5++ ) = ctg; bal_ctg = getTwinCtg ( ctg ); contig_array[ctg].flag = 1; contig_array[bal_ctg].flag = 1; prevCNT = cnt; cnt = nextCnt; } ctg = getTwinCtg ( i ); if ( num5 >= 2 ) { prevCNT = checkConnect ( getTwinCtg ( * ( unsigned int * ) darrayGet ( scaf5, 1 ) ), ctg ); } else { prevCNT = NULL; } cnt = getNextContig ( ctg, prevCNT, &excep ); while ( cnt ) { if ( contig_array[cnt->contigID].flag ) { unBindLink ( ctg, cnt->contigID ); break; } nextCnt = getNextContig ( cnt->contigID, cnt, &excep ); setConnectUsed ( ctg, cnt->contigID, 1 ); ctg = cnt->contigID; bal_ctg = getTwinCtg ( ctg ); contig_array[ctg].flag = 1; contig_array[bal_ctg].flag = 1; * ( unsigned int * ) darrayPut ( scaf3, num3++ ) = bal_ctg; prevCNT = cnt; cnt = nextCnt; } if ( num5 + num3 < 2 ) { continue; } solidCounter = 0; for ( j = num3 - 1; j >= 0; j-- ) * ( unsigned int * ) darrayPut ( solidArray, solidCounter++ ) = * ( unsigned int * ) darrayGet ( scaf3, j ); for ( j = 0; j < num5; j++ ) * ( unsigned int * ) darrayPut ( solidArray, solidCounter++ ) = * ( unsigned int * ) darrayGet ( scaf5, j ); unsigned int firstCtg = 0; unsigned int lastCtg = 0; unsigned int firstTwin = 0; unsigned int lastTwin = 0; for ( t = 0; t < solidCounter; t++ ) if ( !contig_array[* ( unsigned int * ) darrayGet ( solidArray, t )].mask ) { firstCtg = * ( unsigned int * ) darrayGet ( solidArray, t ); break; } for ( t = solidCounter - 1; t >= 0; t-- ) if ( !contig_array[* ( unsigned int * ) darrayGet ( solidArray, t )].mask ) { lastCtg = * ( unsigned int * ) darrayGet ( solidArray, t ); break; } if ( firstCtg == 0 || lastCtg == 0 ) { printf ( "scaffold start at %d, stop at %d, freezing began with %d\n", firstCtg, lastCtg, i ); for ( j = 0; j < solidCounter; j++ ) printf ( "->%d(%d %d)", * ( unsigned int * ) darrayGet ( solidArray, j ) , contig_array[* ( unsigned int * ) darrayGet ( solidArray, j )].mask , contig_array[* ( unsigned int * ) darrayGet ( solidArray, j )].flag ); printf ( "\n" ); } else { firstTwin = getTwinCtg ( firstCtg ); lastTwin = getTwinCtg ( lastCtg ); } for ( t = 0; t < solidCounter; t++ ) { unsigned int ctg = * ( unsigned int * ) darrayGet ( solidArray, t ); if ( contig_array[ctg].from_vt > 0 ) { contig_array[ctg].mask = 1; contig_array[getTwinCtg ( ctg )].mask = 1; printf ( "Repeat: contig %d (%d) appears more than once\n", ctg, getTwinCtg ( ctg ) ); } else { contig_array[ctg].from_vt = firstCtg; contig_array[ctg].to_vt = lastCtg; contig_array[ctg].indexInScaf = t + 1; contig_array[getTwinCtg ( ctg )].from_vt = lastTwin; contig_array[getTwinCtg ( ctg )].to_vt = firstTwin; contig_array[getTwinCtg ( ctg )].indexInScaf = solidCounter - t; } } consolidate(); } printf ( "Freezing is done....\n" ); fflush ( stdout ); for ( i = 1; i <= num_ctg; i++ ) { if ( contig_array[i].flag ) { contig_array[i].flag = 0; } if ( contig_array[i].from_vt == 0 ) { contig_array[i].from_vt = i; contig_array[i].to_vt = i; } cnt = contig_array[i].downwardConnect; while ( cnt ) { cnt->used = 0; cnt->checking = 0; cnt = cnt->next; } } } /************** codes below this line are for pulling the scaffolds out ************/ void output1gap ( FILE * fo, int max_steps ) { int i, len, seg; len = seg = 0; for ( i = 0; i < max_steps - 1; i++ ) { if ( found_routes[0][i + 1] == 0 ) { break; } len += contig_array[found_routes[0][i]].length; seg++; } fprintf ( fo, "GAP %d %d", len, seg ); for ( i = 0; i < max_steps - 1; i++ ) { if ( found_routes[0][i + 1] == 0 ) { break; } fprintf ( fo, " %d", found_routes[0][i] ); } fprintf ( fo, "\n" ); } static int weakCounter; static boolean printCnts ( FILE * fp, unsigned int ctg ) { CONNECT * cnt = contig_array[ctg].downwardConnect; boolean flag = 0, ret = 0; unsigned int bal_ctg = getTwinCtg ( ctg ); unsigned int linkCtg; if ( isSameAsTwin ( ctg ) ) { return ret; } CONNECT * bindCnt = getBindCnt ( ctg ); if ( bindCnt && bindCnt->bySmall && bindCnt->weakPoint ) { weakCounter++; fprintf ( fp, "\tWP" ); ret = 1; } while ( cnt ) { if ( cnt->weight && !cnt->inherit ) { if ( !flag ) { flag = 1; fprintf ( fp, "\t#DOWN " ); } linkCtg = cnt->contigID; if ( isLargerThanTwin ( linkCtg ) ) { linkCtg = getTwinCtg ( linkCtg ); } fprintf ( fp, "%d:%d:%d ", index_array[linkCtg], cnt->weight, cnt->gapLen ); } cnt = cnt->next; } flag = 0; cnt = contig_array[bal_ctg].downwardConnect; while ( cnt ) { if ( cnt->weight && !cnt->inherit ) { if ( !flag ) { flag = 1; fprintf ( fp, "\t#UP " ); } linkCtg = cnt->contigID; if ( isLargerThanTwin ( linkCtg ) ) { linkCtg = getTwinCtg ( linkCtg ); } fprintf ( fp, "%d:%d:%d ", index_array[linkCtg], cnt->weight, cnt->gapLen ); } cnt = cnt->next; } fprintf ( fp, "\n" ); return ret; } void scaffolding ( unsigned int len_cut, char * outfile ) { unsigned int prev_ctg, ctg, bal_ctg, *length_array, count = 0, num_lctg = 0; unsigned int i, max_steps = 5; int num5, num3, j, len, flag, num_route, gap_c = 0; short gap = 0; long long sum = 0, N50, N90; FILE * fp, *fo = NULL; char name[256]; CONNECT * cnt, *prevCNT, *nextCnt; boolean excep, weak; weakCounter = 0; so_far = ( unsigned int * ) ckalloc ( max_n_routes * sizeof ( unsigned int ) ); found_routes = ( unsigned int ** ) ckalloc ( max_n_routes * sizeof ( unsigned int * ) ); for ( j = 0; j < max_n_routes; j++ ) { found_routes[j] = ( unsigned int * ) ckalloc ( max_steps * sizeof ( unsigned int ) ); } length_array = ( unsigned int * ) ckalloc ( ( num_ctg + 1 ) * sizeof ( unsigned int ) ); //use length_array to change info in index_array for ( i = 1; i <= num_ctg; i++ ) { length_array[i] = 0; } for ( i = 1; i <= num_ctg; i++ ) { if ( index_array[i] > 0 ) { length_array[index_array[i]] = i; } } for ( i = 1; i <= num_ctg; i++ ) { index_array[i] = length_array[i]; } //contig i with original index: index_array[i] orig2new = 0; sprintf ( name, "%s.scaf", outfile ); fp = ckopen ( name, "w" ); sprintf ( name, "%s.scaf_gap", outfile ); fo = ckopen ( name, "w" ); scaf3 = ( DARRAY * ) createDarray ( 1000, sizeof ( unsigned int ) ); scaf5 = ( DARRAY * ) createDarray ( 1000, sizeof ( unsigned int ) ); gap3 = ( DARRAY * ) createDarray ( 1000, sizeof ( int ) ); gap5 = ( DARRAY * ) createDarray ( 1000, sizeof ( int ) ); for ( i = 1; i <= num_ctg; i++ ) { contig_array[i].flag = 0; } for ( i = 1; i <= num_ctg; i++ ) { if ( contig_array[i].length + ( unsigned int ) overlaplen >= len_cut ) { num_lctg++; } else { continue; } if ( contig_array[i].flag || contig_array[i].mask || !contig_array[i].downwardConnect || !validConnect ( i, NULL ) ) { continue; } num5 = num3 = 0; ctg = i; //printf("%d",i); * ( unsigned int * ) darrayPut ( scaf5, num5++ ) = i; contig_array[i].flag = 1; bal_ctg = getTwinCtg ( ctg ); contig_array[bal_ctg].flag = 1; len = contig_array[i].length; prevCNT = NULL; cnt = getNextContig ( ctg, prevCNT, &excep ); while ( cnt ) { nextCnt = getNextContig ( cnt->contigID, cnt, &excep ); if ( excep && prevCNT ) { printf ( "scaffolding: exception --- prev cnt from %u\n", prevCNT->contigID ); } if ( nextCnt && nextCnt->used ) { break; } setConnectUsed ( ctg, cnt->contigID, 1 ); * ( int * ) darrayPut ( gap5, num5 - 1 ) = cnt->gapLen; ctg = cnt->contigID; * ( unsigned int * ) darrayPut ( scaf5, num5++ ) = ctg; len += cnt->gapLen + contig_array[ctg].length; bal_ctg = getTwinCtg ( ctg ); contig_array[ctg].flag = 1; contig_array[bal_ctg].flag = 1; prevCNT = cnt; cnt = nextCnt; //printf("->%d",ctg); } //printf("\n"); ctg = getTwinCtg ( i ); if ( num5 >= 2 ) { prevCNT = checkConnect ( getTwinCtg ( * ( unsigned int * ) darrayGet ( scaf5, 1 ) ), ctg ); } else { prevCNT = NULL; } //printf("%d",i); //fflush(stdout); cnt = getNextContig ( ctg, prevCNT, &excep ); while ( cnt ) { nextCnt = getNextContig ( cnt->contigID, cnt, &excep ); if ( excep && prevCNT ) { printf ( "scaffolding: exception -- prev cnt from %u\n", prevCNT->contigID ); } if ( nextCnt && nextCnt->used ) { break; } setConnectUsed ( ctg, cnt->contigID, 1 ); ctg = cnt->contigID; len += cnt->gapLen + contig_array[ctg].length; bal_ctg = getTwinCtg ( ctg ); contig_array[ctg].flag = 1; contig_array[bal_ctg].flag = 1; //printf("<-%d",bal_ctg); //fflush(stdout); * ( int * ) darrayPut ( gap3, num3 ) = cnt->gapLen; * ( unsigned int * ) darrayPut ( scaf3, num3++ ) = bal_ctg; prevCNT = cnt; cnt = nextCnt; } //printf("\n"); len += overlaplen; sum += len; length_array[count++] = len; if ( num5 + num3 < 1 ) { printf ( "no scaffold created for contig %d\n", i ); continue; } fprintf ( fp, ">scaffold%d %d %d\n", count, num5 + num3, len ); fprintf ( fo, ">scaffold%d %d %d\n", count, num5 + num3, len ); len = prev_ctg = 0; for ( j = num3 - 1; j >= 0; j-- ) { if ( !isLargerThanTwin ( * ( unsigned int * ) darrayGet ( scaf3, j ) ) ) { fprintf ( fp, "%-10d %-10d + %d " , index_array[* ( unsigned int * ) darrayGet ( scaf3, j )], len, contig_array[* ( unsigned int * ) darrayGet ( scaf3, j )].length + overlaplen ); weak = printCnts ( fp, * ( unsigned int * ) darrayGet ( scaf3, j ) ); /* if(weak) fprintf(stderr,"scaffold%d\n",count); */ } else { fprintf ( fp, "%-10d %-10d - %d " , index_array[getTwinCtg ( * ( unsigned int * ) darrayGet ( scaf3, j ) )], len , contig_array[* ( unsigned int * ) darrayGet ( scaf3, j )].length + overlaplen ); weak = printCnts ( fp, * ( unsigned int * ) darrayGet ( scaf3, j ) ); /* if(weak) fprintf(stderr,"scaffold%d\n",count); */ } if ( prev_ctg ) { num_route = num_trace = 0; traceAlongArc ( * ( unsigned int * ) darrayGet ( scaf3, j ), prev_ctg, max_steps , gap - ins_size_var, gap + ins_size_var, 0, 0, &num_route ); if ( num_route == 1 ) { output1gap ( fo, max_steps ); gap_c++; } } fprintf ( fo, "%-10d %-10d\n", * ( unsigned int * ) darrayGet ( scaf3, j ), len ); len += contig_array[* ( unsigned int * ) darrayGet ( scaf3, j )].length + * ( int * ) darrayGet ( gap3, j ); prev_ctg = * ( unsigned int * ) darrayGet ( scaf3, j ); gap = * ( int * ) darrayGet ( gap3, j ) > 0 ? * ( int * ) darrayGet ( gap3, j ) : 0; } for ( j = 0; j < num5; j++ ) { if ( !isLargerThanTwin ( * ( unsigned int * ) darrayGet ( scaf5, j ) ) ) { fprintf ( fp, "%-10d %-10d + %d " , index_array[* ( unsigned int * ) darrayGet ( scaf5, j )], len , contig_array[* ( unsigned int * ) darrayGet ( scaf5, j )].length + overlaplen ); weak = printCnts ( fp, * ( unsigned int * ) darrayGet ( scaf5, j ) ); /* if(weak) fprintf(stderr,"scaffold%d\n",count); */ } else { fprintf ( fp, "%-10d %-10d - %d " , index_array[getTwinCtg ( * ( unsigned int * ) darrayGet ( scaf5, j ) )], len , contig_array[* ( unsigned int * ) darrayGet ( scaf5, j )].length + overlaplen ); weak = printCnts ( fp, * ( unsigned int * ) darrayGet ( scaf5, j ) ); /* if(weak) fprintf(stderr,"scaffold%d\n",count); */ } if ( prev_ctg ) { num_route = num_trace = 0; traceAlongArc ( * ( unsigned int * ) darrayGet ( scaf5, j ), prev_ctg, max_steps , gap - ins_size_var, gap + ins_size_var, 0, 0, &num_route ); if ( num_route == 1 ) { output1gap ( fo, max_steps ); gap_c++; } } fprintf ( fo, "%-10d %-10d\n", * ( unsigned int * ) darrayGet ( scaf5, j ), len ); if ( j < num5 - 1 ) { len += contig_array[* ( unsigned int * ) darrayGet ( scaf5, j )].length + * ( int * ) darrayGet ( gap5, j ); prev_ctg = * ( unsigned int * ) darrayGet ( scaf5, j ); gap = * ( int * ) darrayGet ( gap5, j ) > 0 ? * ( int * ) darrayGet ( gap5, j ) : 0; } } } freeDarray ( scaf3 ); freeDarray ( scaf5 ); freeDarray ( gap3 ); freeDarray ( gap5 ); fclose ( fp ); fclose ( fo ); printf ( "\nthe final rank" ); printf ( "\n%d scaffolds from %d contigs sum up %lldbp, with average length %lld, %d gaps filled\n" , count, num_lctg / 2, sum, sum / count, gap_c ); //output singleton for ( i = 1; i <= num_ctg; i++ ) { if ( contig_array[i].length + ( unsigned int ) overlaplen < len_cut || contig_array[i].flag ) { continue; } length_array[count++] = contig_array[i].length; sum += contig_array[i].length; if ( isSmallerThanTwin ( i ) ) { i++; } } // calculate N50/N90 printf ( "%d scaffolds&singleton sum up %lldbp, with average length %lld\n" , count, sum, sum / count ); qsort ( length_array, count, sizeof ( length_array[0] ), cmp_int ); printf ( "the longest is %dbp,", length_array[count - 1] ); N50 = sum * 0.5; N90 = sum * 0.9; sum = flag = 0; for ( j = count - 1; j >= 0; j-- ) { sum += length_array[j]; if ( !flag && sum >= N50 ) { printf ( "scaffold N50 is %d bp, ", length_array[j] ); flag++; } if ( sum >= N90 ) { printf ( "scaffold N90 is %d bp\n", length_array[j] ); break; } } printf ( "Found %d weak points in scaffolds\n", weakCounter ); fflush ( stdout ); free ( ( void * ) length_array ); for ( j = 0; j < max_n_routes; j++ ) { free ( ( void * ) found_routes[j] ); } free ( ( void * ) found_routes ); free ( ( void * ) so_far ); } void scaffold_count ( unsigned int len_cut ) { static DARRAY * scaf3, *scaf5; static DARRAY * gap3, *gap5; unsigned int prev_ctg, ctg, bal_ctg, *length_array, count = 0, num_lctg = 0; unsigned int i, max_steps = 5; int num5, num3, j, len, flag, num_route, gap_c = 0; short gap = 0; long long sum = 0, N50, N90; CONNECT * cnt, *prevCNT, *nextCnt; boolean excep; so_far = ( unsigned int * ) ckalloc ( max_n_routes * sizeof ( unsigned int ) ); found_routes = ( unsigned int ** ) ckalloc ( max_n_routes * sizeof ( unsigned int * ) ); for ( j = 0; j < max_n_routes; j++ ) { found_routes[j] = ( unsigned int * ) ckalloc ( max_steps * sizeof ( unsigned int ) ); } length_array = ( unsigned int * ) ckalloc ( ( num_ctg + 1 ) * sizeof ( unsigned int ) ); //use length_array to change info in index_array for ( i = 1; i <= num_ctg; i++ ) { length_array[i] = 0; } for ( i = 1; i <= num_ctg; i++ ) { if ( index_array[i] > 0 ) { length_array[index_array[i]] = i; } } for ( i = 1; i <= num_ctg; i++ ) { index_array[i] = length_array[i]; } //contig i with original index: index_array[i] orig2new = 0; scaf3 = ( DARRAY * ) createDarray ( 1000, sizeof ( unsigned int ) ); scaf5 = ( DARRAY * ) createDarray ( 1000, sizeof ( unsigned int ) ); gap3 = ( DARRAY * ) createDarray ( 1000, sizeof ( int ) ); gap5 = ( DARRAY * ) createDarray ( 1000, sizeof ( int ) ); for ( i = 1; i <= num_ctg; i++ ) { contig_array[i].flag = 0; } for ( i = 1; i <= num_ctg; i++ ) { if ( contig_array[i].length + ( unsigned int ) overlaplen >= len_cut ) { num_lctg++; } else { continue; } if ( contig_array[i].flag || contig_array[i].mask || !contig_array[i].downwardConnect || !validConnect ( i, NULL ) ) { continue; } num5 = num3 = 0; ctg = i; //printf("%d",i); * ( unsigned int * ) darrayPut ( scaf5, num5++ ) = i; contig_array[i].flag = 1; bal_ctg = getTwinCtg ( ctg ); contig_array[bal_ctg].flag = 1; len = contig_array[i].length; prevCNT = NULL; cnt = getNextContig ( ctg, prevCNT, &excep ); while ( cnt ) { nextCnt = getNextContig ( cnt->contigID, cnt, &excep ); if ( excep && prevCNT ) { printf ( "scaffolding: exception --- prev cnt from %u\n", prevCNT->contigID ); } if ( nextCnt && nextCnt->used ) { break; } setConnectUsed ( ctg, cnt->contigID, 1 ); * ( int * ) darrayPut ( gap5, num5 - 1 ) = cnt->gapLen; ctg = cnt->contigID; * ( unsigned int * ) darrayPut ( scaf5, num5++ ) = ctg; len += cnt->gapLen + contig_array[ctg].length; bal_ctg = getTwinCtg ( ctg ); contig_array[ctg].flag = 1; contig_array[bal_ctg].flag = 1; prevCNT = cnt; cnt = nextCnt; //printf("->%d",ctg); } //printf("\n"); ctg = getTwinCtg ( i ); if ( num5 >= 2 ) { prevCNT = checkConnect ( getTwinCtg ( * ( unsigned int * ) darrayGet ( scaf5, 1 ) ), ctg ); } else { prevCNT = NULL; } //printf("%d",i); //fflush(stdout); cnt = getNextContig ( ctg, prevCNT, &excep ); while ( cnt ) { nextCnt = getNextContig ( cnt->contigID, cnt, &excep ); if ( excep && prevCNT ) { printf ( "scaffolding: exception -- prev cnt from %u\n", prevCNT->contigID ); } if ( nextCnt && nextCnt->used ) { break; } setConnectUsed ( ctg, cnt->contigID, 1 ); ctg = cnt->contigID; len += cnt->gapLen + contig_array[ctg].length; bal_ctg = getTwinCtg ( ctg ); contig_array[ctg].flag = 1; contig_array[bal_ctg].flag = 1; //printf("<-%d",bal_ctg); //fflush(stdout); * ( int * ) darrayPut ( gap3, num3 ) = cnt->gapLen; * ( unsigned int * ) darrayPut ( scaf3, num3++ ) = bal_ctg; prevCNT = cnt; cnt = nextCnt; } //printf("\n"); len += overlaplen; sum += len; length_array[count++] = len; if ( num5 + num3 < 1 ) { printf ( "no scaffold created for contig %d\n", i ); continue; } len = prev_ctg = 0; for ( j = num3 - 1; j >= 0; j-- ) { if ( prev_ctg ) { num_route = num_trace = 0; traceAlongArc ( * ( unsigned int * ) darrayGet ( scaf3, j ), prev_ctg, max_steps , gap - ins_size_var, gap + ins_size_var, 0, 0, &num_route ); if ( num_route == 1 ) { gap_c++; } } len += contig_array[* ( unsigned int * ) darrayGet ( scaf3, j )].length + * ( int * ) darrayGet ( gap3, j ); prev_ctg = * ( unsigned int * ) darrayGet ( scaf3, j ); gap = * ( int * ) darrayGet ( gap3, j ) > 0 ? * ( int * ) darrayGet ( gap3, j ) : 0; } for ( j = 0; j < num5; j++ ) { if ( prev_ctg ) { num_route = num_trace = 0; traceAlongArc ( * ( unsigned int * ) darrayGet ( scaf5, j ), prev_ctg, max_steps , gap - ins_size_var, gap + ins_size_var, 0, 0, &num_route ); if ( num_route == 1 ) { gap_c++; } } if ( j < num5 - 1 ) { len += contig_array[* ( unsigned int * ) darrayGet ( scaf5, j )].length + * ( int * ) darrayGet ( gap5, j ); prev_ctg = * ( unsigned int * ) darrayGet ( scaf5, j ); gap = * ( int * ) darrayGet ( gap5, j ) > 0 ? * ( int * ) darrayGet ( gap5, j ) : 0; } } } freeDarray ( scaf3 ); freeDarray ( scaf5 ); freeDarray ( gap3 ); freeDarray ( gap5 ); printf ( "\n%d scaffolds from %d contigs sum up %lldbp, with average length %lld, %d gaps filled\n" , count, num_lctg / 2, sum, sum / count, gap_c ); //output singleton for ( i = 1; i <= num_ctg; i++ ) { if ( contig_array[i].length + ( unsigned int ) overlaplen < len_cut || contig_array[i].flag ) { continue; } length_array[count++] = contig_array[i].length; sum += contig_array[i].length; if ( isSmallerThanTwin ( i ) ) { i++; } } // calculate N50/N90 printf ( "%d scaffolds&singleton sum up %lldbp, with average length %lld\n" , count, sum, sum / count ); qsort ( length_array, count, sizeof ( length_array[0] ), cmp_int ); printf ( "the longest is %dbp,", length_array[count - 1] ); N50 = sum * 0.5; N90 = sum * 0.9; sum = flag = 0; for ( j = count - 1; j >= 0; j-- ) { sum += length_array[j]; if ( !flag && sum >= N50 ) { printf ( "scaffold N50 is %d bp, ", length_array[j] ); flag++; } if ( sum >= N90 ) { printf ( "scaffold N90 is %d bp\n", length_array[j] ); break; } } fflush ( stdout ); free ( ( void * ) length_array ); for ( j = 0; j < max_n_routes; j++ ) { free ( ( void * ) found_routes[j] ); } free ( ( void * ) found_routes ); free ( ( void * ) so_far ); } static void outputLinks ( FILE * fp, int insertS ) { unsigned int i, bal_ctg, bal_toCtg; CONNECT * cnts, *temp_cnt; //printf("outputLinks, %d contigs\n",num_ctg); for ( i = 1; i <= num_ctg; i++ ) { cnts = contig_array[i].downwardConnect; bal_ctg = getTwinCtg ( i ); while ( cnts ) { if ( cnts->weight < 1 ) { cnts = cnts->next; continue; } fprintf ( fp, "%-10d %-10d\t%d\t%d\t%d\n" , i, cnts->contigID, cnts->gapLen, cnts->weight, insertS ); cnts->weight = 0; bal_toCtg = getTwinCtg ( cnts->contigID ); temp_cnt = getCntBetween ( bal_toCtg, bal_ctg ); if ( temp_cnt ) { temp_cnt->weight = 0; } cnts = cnts->next; } } } //use pe info in ascent order void PE2Links ( char * infile ) { char name[256], *line; FILE * fp, *linkF; int i; int flag = 0; unsigned int j; sprintf ( name, "%s.links", infile ); /*linkF = fopen(name,"r"); if(linkF){ printf("file %s exists, skip creating the links...\n",name); fclose(linkF); return; }*/ linkF = ckopen ( name, "w" ); if ( !pes ) { loadPEgrads ( infile ); } sprintf ( name, "%s.readOnContig", infile ); fp = ckopen ( name, "r" ); lineLen = 1024; line = ( char * ) ckalloc ( lineLen * sizeof ( char ) ); fgets ( line, lineLen, fp ); line[0] = '\0'; printf ( "\n" ); for ( i = 0; i < gradsCounter; i++ ) { createCntMemManager(); createCntLookupTable(); newCntCounter = 0; flag += connectByPE_grad ( fp, i, line ); printf ( "%lld new connections\n", newCntCounter / 2 ); if ( !flag ) { destroyConnectMem(); deleteCntLookupTable(); for ( j = 1; j <= num_ctg; j++ ) { contig_array[j].downwardConnect = NULL; } printf ( "\n" ); continue; } flag = 0; outputLinks ( linkF, pes[i].insertS ); destroyConnectMem(); deleteCntLookupTable(); for ( j = 1; j <= num_ctg; j++ ) { contig_array[j].downwardConnect = NULL; } } free ( ( void * ) line ); fclose ( fp ); fclose ( linkF ); printf ( "all PEs attached\n" ); } static int inputLinks ( FILE * fp, int insertS, char * line ) { unsigned int ctg, bal_ctg, toCtg, bal_toCtg; int gap, wt, ins; unsigned int counter = 0, onScafCounter = 0; unsigned int maskCounter = 0; if ( strlen ( line ) ) { sscanf ( line, "%d %d %d %d %d", &ctg, &toCtg, &gap, &wt, &ins ); if ( ins != insertS ) { return counter; } //if(contig_array[ctg].length>=ctg_short&&contig_array[toCtg].length>=ctg_short){ if ( 1 ) { bal_ctg = getTwinCtg ( ctg ); bal_toCtg = getTwinCtg ( toCtg ); add1Connect ( ctg, toCtg, gap, wt, 0 ); add1Connect ( bal_toCtg, bal_ctg, gap, wt, 0 ); counter++; if ( contig_array[ctg].mask || contig_array[toCtg].mask ) { maskCounter++; } if ( insertS > 1000 && contig_array[ctg].from_vt == contig_array[toCtg].from_vt && // on the same scaff contig_array[ctg].indexInScaf < contig_array[toCtg].indexInScaf ) { add1LongPEcov ( ctg, toCtg, wt ); onScafCounter++; } } } while ( fgets ( line, lineLen, fp ) != NULL ) { sscanf ( line, "%d %d %d %d %d", &ctg, &toCtg, &gap, &wt, &ins ); if ( ins > insertS ) { break; } /* if(contig_array[ctg].length 1000 && contig_array[ctg].from_vt == contig_array[toCtg].from_vt && // on the same scaff contig_array[ctg].indexInScaf < contig_array[toCtg].indexInScaf ) { add1LongPEcov ( ctg, toCtg, wt ); onScafCounter++; } bal_ctg = getTwinCtg ( ctg ); bal_toCtg = getTwinCtg ( toCtg ); add1Connect ( ctg, toCtg, gap, wt, 0 ); add1Connect ( bal_toCtg, bal_ctg, gap, wt, 0 ); counter++; if ( contig_array[ctg].mask || contig_array[toCtg].mask ) { maskCounter++; } } printf ( "%d link to masked contigs, %d links on a single scaff\n", maskCounter, onScafCounter ); return counter; } //use linkage info in ascent order void Links2Scaf ( char * infile ) { char name[256], *line; FILE * fp; int i, j = 1, lib_n = 0, cutoff_sum = 0; int flag = 0, flag2; boolean downS, nonLinear = 0, smallPE = 0, isPrevSmall = 0, markSmall; if ( !pes ) { loadPEgrads ( infile ); } sprintf ( name, "%s.links", infile ); fp = ckopen ( name, "r" ); createCntMemManager(); createCntLookupTable(); lineLen = 1024; line = ( char * ) ckalloc ( lineLen * sizeof ( char ) ); fgets ( line, lineLen, fp ); line[0] = '\0'; solidArray = ( DARRAY * ) createDarray ( 1000, sizeof ( unsigned int ) ); tempArray = ( DARRAY * ) createDarray ( 1000, sizeof ( unsigned int ) ); scaf3 = ( DARRAY * ) createDarray ( 1000, sizeof ( unsigned int ) ); scaf5 = ( DARRAY * ) createDarray ( 1000, sizeof ( unsigned int ) ); gap3 = ( DARRAY * ) createDarray ( 1000, sizeof ( int ) ); gap5 = ( DARRAY * ) createDarray ( 1000, sizeof ( int ) ); weakPE = 3; //0531 printf ( "\n" ); for ( i = 0; i < gradsCounter; i++ ) { if ( pes[i].insertS < 1000 ) { isPrevSmall = 1; } else if ( pes[i].insertS > 1000 && isPrevSmall ) { smallScaf(); isPrevSmall = 0; } flag2 = inputLinks ( fp, pes[i].insertS, line ); printf ( "Insert size %d: %d links input\n", pes[i].insertS, flag2 ); if ( flag2 ) { lib_n++; cutoff_sum += pes[i].pair_num_cut; } flag += flag2; if ( !flag ) { printf ( "\n" ); continue; } if ( i == gradsCounter - 1 || pes[i + 1].rank != pes[i].rank ) { flag = nonLinear = downS = markSmall = 0; if ( pes[i].insertS > 1000 && pes[i].rank > 1 ) { downS = 1; } if ( pes[i].insertS <= 1000 ) { smallPE = 1; } if ( pes[i].insertS >= 1000 ) { ins_size_var = 50; OverlapPercent = 0.05; } else if ( pes[i].insertS >= 300 ) { ins_size_var = 30; OverlapPercent = 0.05; } else { ins_size_var = 20; OverlapPercent = 0.05; } if ( pes[i].insertS > 1000 ) { weakPE = 5; } //static_f = 1; if ( lib_n > 0 ) { weakPE = weakPE < cutoff_sum / lib_n ? cutoff_sum / lib_n : weakPE; lib_n = cutoff_sum = 0; } printf ( "Cutoff for number of pairs to make a reliable connection: %d\n", weakPE ); if ( i == gradsCounter - 1 ) { nonLinear = 1; } if ( i == gradsCounter - 1 && !isPrevSmall && smallPE ) { detectBreakScaf(); } ordering ( 1, downS, nonLinear, infile ); if ( i == gradsCounter - 1 ) { recoverMask(); } else { printf ( "\nthe %d rank", j++ ); scaffold_count ( 100 ); printf ( "\n" ); } } } freeDarray ( tempArray ); freeDarray ( solidArray ); freeDarray ( scaf3 ); freeDarray ( scaf5 ); freeDarray ( gap3 ); freeDarray ( gap5 ); free ( ( void * ) line ); fclose ( fp ); printf ( "all links loaded\n" ); } /* below for picking up a subgraph (with at most one node has upstream connections to the rest and at most one downstream connections) in general */ // static int nodeCounter static boolean putNodeInArray ( unsigned int node, int maxNodes, int dis ) { if ( contig_array[node].inSubGraph ) { return 1; } int index = nodeCounter; if ( index > maxNodes ) { return 0; } if ( contig_array[getTwinCtg ( node )].inSubGraph ) { return 0; } ctg4heapArray[index].ctgID = node; ctg4heapArray[index].dis = dis; contig_array[node].inSubGraph = 1; ctg4heapArray[index].ds_shut4dheap = 0; ctg4heapArray[index].us_shut4dheap = 0; ctg4heapArray[index].ds_shut4uheap = 0; ctg4heapArray[index].us_shut4uheap = 0; return 1; } static void setInGraph ( boolean flag ) { int i; int node; nodeCounter = nodeCounter > MaxNodeInSub ? MaxNodeInSub : nodeCounter; for ( i = 1; i <= nodeCounter; i++ ) { node = ctg4heapArray[i].ctgID; if ( node > 0 ) { contig_array[node].inSubGraph = flag; } } } static boolean dispatch1node ( int dis, unsigned int tempNode, int maxNodes, FibHeap * dheap, FibHeap * uheap, int * DmaxDis, int * UmaxDis ) { boolean ret; if ( dis >= 0 ) // put it to Dheap { nodeCounter++; ret = putNodeInArray ( tempNode, maxNodes, dis ); if ( !ret ) { return 0; } insertNodeIntoHeap ( dheap, dis, nodeCounter ); if ( dis > *DmaxDis ) { *DmaxDis = dis; } return 1; } else // put it to Uheap { nodeCounter++; ret = putNodeInArray ( tempNode, maxNodes, dis ); if ( !ret ) { return 0; } insertNodeIntoHeap ( uheap, -dis, nodeCounter ); int temp_len = contig_array[tempNode].length; if ( -dis + temp_len > *UmaxDis ) { *UmaxDis = -dis + contig_array[tempNode].length; } return -1; } return 0; } static boolean canDheapWait ( unsigned int currNode, int dis, int DmaxDis ) { if ( dis < DmaxDis ) { return 0; } else { return 1; } } static boolean workOnDheap ( FibHeap * dheap, FibHeap * uheap, boolean * Dwait, boolean * Uwait, int * DmaxDis, int * UmaxDis, int maxNodes ) { if ( *Dwait ) { return 1; } unsigned int currNode, twin, tempNode; CTGinHEAP * ctgInHeap; int indexInArray; CONNECT * us_cnt, *ds_cnt; int dis0, dis; boolean ret, isEmpty; while ( ( indexInArray = removeNextNodeFromHeap ( dheap ) ) != 0 ) { ctgInHeap = &ctg4heapArray[indexInArray]; currNode = ctgInHeap->ctgID; dis0 = ctgInHeap->dis; isEmpty = IsHeapEmpty ( dheap ); twin = getTwinCtg ( currNode ); us_cnt = ctgInHeap->us_shut4dheap ? NULL : contig_array[twin].downwardConnect; while ( us_cnt ) { if ( us_cnt->deleted || us_cnt->mask || contig_array[getTwinCtg ( us_cnt->contigID )].inSubGraph ) { us_cnt = us_cnt->next; continue; } tempNode = getTwinCtg ( us_cnt->contigID ); if ( contig_array[tempNode].inSubGraph ) { us_cnt = us_cnt->next; continue; } dis = dis0 - us_cnt->gapLen - ( int ) contig_array[twin].length; ret = dispatch1node ( dis, tempNode, maxNodes, dheap, uheap, DmaxDis, UmaxDis ); if ( ret == 0 ) { return 0; } else if ( ret < 0 ) { *Uwait = 0; } us_cnt = us_cnt->next; } if ( nodeCounter > 1 && isEmpty ) { *Dwait = canDheapWait ( currNode, dis0, *DmaxDis ); if ( *Dwait ) { isEmpty = IsHeapEmpty ( dheap ); insertNodeIntoHeap ( dheap, dis0, indexInArray ); ctg4heapArray[indexInArray].us_shut4dheap = 1; if ( isEmpty ) { return 1; } else { continue; } } } ds_cnt = ctgInHeap->ds_shut4dheap ? NULL : contig_array[currNode].downwardConnect; while ( ds_cnt ) { if ( ds_cnt->deleted || ds_cnt->mask || contig_array[ds_cnt->contigID].inSubGraph ) { ds_cnt = ds_cnt->next; continue; } tempNode = ds_cnt->contigID; dis = dis0 + ds_cnt->gapLen + ( int ) contig_array[tempNode].length; ret = dispatch1node ( dis, tempNode, maxNodes, dheap, uheap, DmaxDis, UmaxDis ); if ( ret == 0 ) { return 0; } else if ( ret < 0 ) { *Uwait = 0; } } // for each downstream connections } // for each node comes off the heap *Dwait = 1; return 1; } static boolean canUheapWait ( unsigned int currNode, int dis, int UmaxDis ) { int temp_len = contig_array[currNode].length; if ( -dis + temp_len < UmaxDis ) { return 0; } else { return 1; } } static boolean workOnUheap ( FibHeap * dheap, FibHeap * uheap, boolean * Dwait, boolean * Uwait, int * DmaxDis, int * UmaxDis, int maxNodes ) { if ( *Uwait ) { return 1; } unsigned int currNode, twin, tempNode; CTGinHEAP * ctgInHeap; int indexInArray; CONNECT * us_cnt, *ds_cnt; int dis0, dis; boolean ret, isEmpty; while ( ( indexInArray = removeNextNodeFromHeap ( uheap ) ) != 0 ) { ctgInHeap = &ctg4heapArray[indexInArray]; currNode = ctgInHeap->ctgID; dis0 = ctgInHeap->dis; isEmpty = IsHeapEmpty ( uheap ); ds_cnt = ctgInHeap->ds_shut4uheap ? NULL : contig_array[currNode].downwardConnect; while ( ds_cnt ) { if ( ds_cnt->deleted || ds_cnt->mask || contig_array[ds_cnt->contigID].inSubGraph ) { ds_cnt = ds_cnt->next; continue; } tempNode = ds_cnt->contigID; dis = dis0 + ds_cnt->gapLen + contig_array[tempNode].length; ret = dispatch1node ( dis, tempNode, maxNodes, dheap, uheap, DmaxDis, UmaxDis ); if ( ret == 0 ) { return 0; } else if ( ret > 0 ) { *Dwait = 0; } } // for each downstream connections if ( nodeCounter > 1 && isEmpty ) { *Uwait = canUheapWait ( currNode, dis0, *UmaxDis ); if ( *Uwait ) { isEmpty = IsHeapEmpty ( uheap ); insertNodeIntoHeap ( uheap, dis0, indexInArray ); ctg4heapArray[indexInArray].ds_shut4uheap = 1; if ( isEmpty ) { return 1; } else { continue; } } } twin = getTwinCtg ( currNode ); us_cnt = ctgInHeap->us_shut4uheap ? NULL : contig_array[twin].downwardConnect; while ( us_cnt ) { if ( us_cnt->deleted || us_cnt->mask || contig_array[getTwinCtg ( us_cnt->contigID )].inSubGraph ) { us_cnt = us_cnt->next; continue; } tempNode = getTwinCtg ( us_cnt->contigID ); if ( contig_array[tempNode].inSubGraph ) { us_cnt = us_cnt->next; continue; } dis = dis0 - us_cnt->gapLen - contig_array[twin].length; ret = dispatch1node ( dis, tempNode, maxNodes, dheap, uheap, DmaxDis, UmaxDis ); if ( ret == 0 ) { return 0; } else if ( ret > 0 ) { *Dwait = 1; } us_cnt = us_cnt->next; } } // for each node comes off the heap *Uwait = 1; return 1; } static boolean pickUpGeneralSubgraph ( unsigned int node1, int maxNodes ) { FibHeap * Uheap = newFibHeap(); // heap for upstream contigs to node1 FibHeap * Dheap = newFibHeap(); int UmaxDis; // max distance upstream to node1 int DmaxDis; boolean Uwait; // wait signal for Uheap boolean Dwait; int dis; boolean ret; //initiate: node1 is put to array once, and to both Dheap and Uheap dis = 0; nodeCounter = 1; putNodeInArray ( node1, maxNodes, dis ); insertNodeIntoHeap ( Dheap, dis, nodeCounter ); ctg4heapArray[nodeCounter].us_shut4dheap = 1; Dwait = 0; DmaxDis = 0; insertNodeIntoHeap ( Uheap, dis, nodeCounter ); ctg4heapArray[nodeCounter].ds_shut4uheap = 1; Uwait = 1; UmaxDis = contig_array[node1].length; while ( 1 ) { ret = workOnDheap ( Dheap, Uheap, &Dwait, &Uwait, &DmaxDis, &UmaxDis, maxNodes ); if ( !ret ) { setInGraph ( 0 ); destroyHeap ( Dheap ); destroyHeap ( Uheap ); return 0; } ret = workOnUheap ( Dheap, Uheap, &Dwait, &Uwait, &DmaxDis, &UmaxDis, maxNodes ); if ( !ret ) { setInGraph ( 0 ); destroyHeap ( Dheap ); destroyHeap ( Uheap ); return 0; } if ( Uwait && Dwait ) { destroyHeap ( Dheap ); destroyHeap ( Uheap ); return 1; } } } static int cmp_ctg ( const void * a, const void * b ) { CTGinHEAP * A, *B; A = ( CTGinHEAP * ) a; B = ( CTGinHEAP * ) b; if ( A->dis > B->dis ) { return 1; } else if ( A->dis == B->dis ) { return 0; } else { return -1; } } static boolean checkEligible() { unsigned int firstNode = ctg4heapArray[1].ctgID; unsigned int twin; int i; boolean flag = 0; //check if the first node has incoming link from twin of any node in subgraph // or it has multi outgoing links bound to incoming links twin = getTwinCtg ( firstNode ); CONNECT * ite_cnt = contig_array[twin].downwardConnect; while ( ite_cnt ) { if ( ite_cnt->deleted || ite_cnt->mask ) { ite_cnt = ite_cnt->next; continue; } if ( contig_array[ite_cnt->contigID].inSubGraph ) { /* if(firstNode==3693) printf("eligible link %d -> %d\n",twin,ite_cnt->contigID); */ return 0; } if ( ite_cnt->prevInScaf ) { if ( flag ) { return 0; } flag = 1; } ite_cnt = ite_cnt->next; } //check if the last node has outgoing link to twin of any node in subgraph // or it has multi outgoing links bound to incoming links unsigned int lastNode = ctg4heapArray[nodeCounter].ctgID; ite_cnt = contig_array[lastNode].downwardConnect; flag = 0; while ( ite_cnt ) { if ( ite_cnt->deleted || ite_cnt->mask ) { ite_cnt = ite_cnt->next; continue; } twin = getTwinCtg ( ite_cnt->contigID ); if ( contig_array[twin].inSubGraph ) { /* if(firstNode==3693) printf("eligible link %d -> %d\n",lastNode,ite_cnt->contigID); */ return 0; } if ( ite_cnt->prevInScaf ) { if ( flag ) { return 0; } flag = 1; } ite_cnt = ite_cnt->next; } //check if any node has outgoing link to node outside the subgraph for ( i = 1; i < nodeCounter; i++ ) { ite_cnt = contig_array[ctg4heapArray[i].ctgID].downwardConnect; while ( ite_cnt ) { if ( ite_cnt->deleted || ite_cnt->mask ) { ite_cnt = ite_cnt->next; continue; } if ( !contig_array[ite_cnt->contigID].inSubGraph ) { /* printf("eligible check: ctg %d links to ctg %d\n", ctg4heapArray[i].ctgID,ite_cnt->contigID); */ return 0; } ite_cnt = ite_cnt->next; } } //check if any node has incoming link from node outside the subgraph for ( i = 2; i <= nodeCounter; i++ ) { twin = getTwinCtg ( ctg4heapArray[i].ctgID ); ite_cnt = contig_array[twin].downwardConnect; while ( ite_cnt ) { if ( ite_cnt->deleted || ite_cnt->mask ) { ite_cnt = ite_cnt->next; continue; } if ( !contig_array[getTwinCtg ( ite_cnt->contigID )].inSubGraph ) { /* printf("eligible check: ctg %d links to ctg %d\n", ctg4heapArray[i].ctgID,ite_cnt->contigID); */ return 0; } ite_cnt = ite_cnt->next; } } return 1; } //put nodes in sub-graph in a line static void arrangeNodes_general() { int i, gap; CONNECT * ite_cnt, *temp_cnt, *bal_cnt, *prev_cnt, *next_cnt; unsigned int node1, node2; unsigned int bal_nd1, bal_nd2; //delete original connections for ( i = 1; i <= nodeCounter; i++ ) { node1 = ctg4heapArray[i].ctgID; ite_cnt = contig_array[node1].downwardConnect; while ( ite_cnt ) { if ( ite_cnt->mask || ite_cnt->deleted || !contig_array[ite_cnt->contigID].inSubGraph ) { ite_cnt = ite_cnt->next; continue; } ite_cnt->deleted = 1; setNextInScaf ( ite_cnt, NULL ); setPrevInScaf ( ite_cnt, 0 ); ite_cnt = ite_cnt->next; } bal_nd1 = getTwinCtg ( node1 ); ite_cnt = contig_array[bal_nd1].downwardConnect; while ( ite_cnt ) { if ( ite_cnt->mask || ite_cnt->deleted || !contig_array[getTwinCtg ( ite_cnt->contigID )].inSubGraph ) { ite_cnt = ite_cnt->next; continue; } ite_cnt->deleted = 1; setNextInScaf ( ite_cnt, NULL ); setPrevInScaf ( ite_cnt, 0 ); ite_cnt = ite_cnt->next; } } //create new connections prev_cnt = next_cnt = NULL; for ( i = 1; i < nodeCounter; i++ ) { node1 = ctg4heapArray[i].ctgID; node2 = ctg4heapArray[i + 1].ctgID; bal_nd1 = getTwinCtg ( node1 ); bal_nd2 = getTwinCtg ( node2 ); gap = ctg4heapArray[i + 1].dis - ctg4heapArray[i].dis - contig_array[node2].length; temp_cnt = getCntBetween ( node1, node2 ); if ( temp_cnt ) { temp_cnt->deleted = 0; temp_cnt->mask = 0; //temp_cnt->gapLen = gap; bal_cnt = getCntBetween ( bal_nd2, bal_nd1 ); bal_cnt->deleted = 0; bal_cnt->mask = 0; //bal_cnt->gapLen = gap; } else { temp_cnt = allocateCN ( node2, gap ); if ( cntLookupTable ) { putCnt2LookupTable ( node1, temp_cnt ); } temp_cnt->next = contig_array[node1].downwardConnect; contig_array[node1].downwardConnect = temp_cnt; bal_cnt = allocateCN ( bal_nd1, gap ); if ( cntLookupTable ) { putCnt2LookupTable ( bal_nd2, bal_cnt ); } bal_cnt->next = contig_array[bal_nd2].downwardConnect; contig_array[bal_nd2].downwardConnect = bal_cnt; } if ( prev_cnt ) { setNextInScaf ( prev_cnt, temp_cnt ); setPrevInScaf ( temp_cnt, 1 ); } if ( next_cnt ) { setNextInScaf ( bal_cnt, next_cnt ); setPrevInScaf ( next_cnt, 1 ); } prev_cnt = temp_cnt; next_cnt = bal_cnt; } //re-binding connection at both ends bal_nd2 = getTwinCtg ( ctg4heapArray[1].ctgID ); ite_cnt = contig_array[bal_nd2].downwardConnect; while ( ite_cnt ) { if ( ite_cnt->deleted || ite_cnt->mask ) { ite_cnt = ite_cnt->next; continue; } if ( ite_cnt->prevInScaf ) { break; } ite_cnt = ite_cnt->next; } if ( ite_cnt ) { bal_nd1 = ite_cnt->contigID; node1 = getTwinCtg ( bal_nd1 ); node2 = ctg4heapArray[1].ctgID; temp_cnt = checkConnect ( node1, node2 ); bal_cnt = ite_cnt; next_cnt = checkConnect ( ctg4heapArray[1].ctgID, ctg4heapArray[2].ctgID ); prev_cnt = checkConnect ( getTwinCtg ( ctg4heapArray[2].ctgID ), getTwinCtg ( ctg4heapArray[1].ctgID ) ); if ( temp_cnt ) { setNextInScaf ( temp_cnt, next_cnt ); setPrevInScaf ( temp_cnt->nextInScaf, 0 ); setPrevInScaf ( next_cnt, 1 ); setNextInScaf ( prev_cnt, bal_cnt ); } } node1 = ctg4heapArray[nodeCounter].ctgID; ite_cnt = contig_array[node1].downwardConnect; while ( ite_cnt ) { if ( ite_cnt->deleted || ite_cnt->mask ) { ite_cnt = ite_cnt->next; continue; } if ( ite_cnt->prevInScaf ) { break; } ite_cnt = ite_cnt->next; } if ( ite_cnt ) { node2 = ite_cnt->contigID; bal_nd1 = getTwinCtg ( node1 ); bal_nd2 = getTwinCtg ( node2 ); temp_cnt = ite_cnt; bal_cnt = checkConnect ( bal_nd2, bal_nd1 ); next_cnt = checkConnect ( getTwinCtg ( ctg4heapArray[nodeCounter].ctgID ), getTwinCtg ( ctg4heapArray[nodeCounter - 1].ctgID ) ); prev_cnt = checkConnect ( ctg4heapArray[nodeCounter - 1].ctgID, ctg4heapArray[nodeCounter].ctgID ); setNextInScaf ( prev_cnt, temp_cnt ); setNextInScaf ( bal_cnt, next_cnt ); setPrevInScaf ( next_cnt, 1 ); } } //check if contigs next to each other have reasonable overlap boolean checkOverlapInBetween_general ( double tolerance ) { int i, gap; unsigned int node; int lenSum, lenOlp; lenSum = lenOlp = 0; for ( i = 1; i <= nodeCounter; i++ ) { node = ctg4heapArray[i].ctgID; lenSum += contig_array[node].length; } if ( lenSum < 1 ) { return 1; } for ( i = 1; i < nodeCounter; i++ ) { gap = ctg4heapArray[i + 1].dis - ctg4heapArray[i].dis - contig_array[ctg4heapArray[i + 1].ctgID].length; if ( -gap > 0 ) { lenOlp += -gap; } //if(-gap>ins_size_var) if ( ( double ) lenOlp / lenSum > tolerance ) { return 0; } } return 1; } //check if there's any connect indicates the opposite order between nodes in sub-graph static boolean checkConflictCnt_general ( double tolerance ) { int i, j; int supportCounter = 0; int objectCounter = 0; CONNECT * cnt; for ( i = 1; i < nodeCounter; i++ ) { for ( j = i + 1; j <= nodeCounter; j++ ) { //cnt=getCntBetween(nodesInSubInOrder[j],nodesInSubInOrder[i]); cnt = checkConnect ( ctg4heapArray[i].ctgID, ctg4heapArray[j].ctgID ); if ( cnt ) { supportCounter += cnt->weight; } cnt = checkConnect ( ctg4heapArray[j].ctgID, ctg4heapArray[i].ctgID ); if ( cnt ) { objectCounter += cnt->weight; } //return 1; } } if ( supportCounter < 1 ) { return 1; } if ( ( double ) objectCounter / supportCounter < tolerance ) { return 0; } return 1; } // turn sub-graph to linear struct static void general_linearization ( boolean strict ) { unsigned int i; int subCounter = 0; int out_num; boolean flag; int conflCounter = 0, overlapCounter = 0, eligibleCounter = 0; double overlapTolerance, conflTolerance; for ( i = num_ctg; i > 0; i-- ) { if ( contig_array[i].mask ) { continue; } out_num = validConnect ( i, NULL ); if ( out_num < 2 ) { continue; } //flag = pickSubGraph(i,strict); flag = pickUpGeneralSubgraph ( i, MaxNodeInSub ); if ( !flag ) { continue; } subCounter++; qsort ( &ctg4heapArray[1], nodeCounter, sizeof ( CTGinHEAP ), cmp_ctg ); flag = checkEligible(); if ( !flag ) { eligibleCounter++; setInGraph ( 0 ); continue; } if ( strict ) { overlapTolerance = OverlapPercent; conflTolerance = ConflPercent; } else { overlapTolerance = 2 * OverlapPercent; conflTolerance = 2 * ConflPercent; } flag = checkOverlapInBetween_general ( overlapTolerance ); if ( !flag ) { overlapCounter++; setInGraph ( 0 ); continue; } flag = checkConflictCnt_general ( conflTolerance ); if ( flag ) { conflCounter++; setInGraph ( 0 ); continue; } arrangeNodes_general(); setInGraph ( 0 ); } printf ( "Picked %d subgraphs,%d have conflicting connections,%d have significant overlapping, %d eligible\n", subCounter, conflCounter, overlapCounter, eligibleCounter ); } /**** the fowllowing codes for detecting and break down scaffold at weak point **********/ // mark connections in scaffolds made by small pe static void smallScaf() { unsigned int i, ctg, bal_ctg, prevCtg; int counter = 0; CONNECT * bindCnt, *cnt; for ( i = 1; i <= num_ctg; i++ ) { contig_array[i].flag = 0; } for ( i = 1; i <= num_ctg; i++ ) { if ( contig_array[i].flag || contig_array[i].mask || !contig_array[i].downwardConnect ) { continue; } bindCnt = getBindCnt ( i ); if ( !bindCnt ) { continue; } counter++; contig_array[i].flag = 1; contig_array[getTwinCtg ( i )].flag = 1; prevCtg = getTwinCtg ( i ); while ( bindCnt ) { ctg = bindCnt->contigID; bal_ctg = getTwinCtg ( ctg ); bindCnt->bySmall = 1; cnt = getCntBetween ( bal_ctg, prevCtg ); if ( cnt ) { cnt->bySmall = 1; } contig_array[ctg].flag = 1; contig_array[bal_ctg].flag = 1; prevCtg = bal_ctg; bindCnt = bindCnt->nextInScaf; } ctg = getTwinCtg ( i ); bindCnt = getBindCnt ( ctg ); prevCtg = i; while ( bindCnt ) { ctg = bindCnt->contigID; bal_ctg = getTwinCtg ( ctg ); bindCnt->bySmall = 1; cnt = getCntBetween ( bal_ctg, prevCtg ); if ( cnt ) { cnt->bySmall = 1; } contig_array[ctg].flag = 1; contig_array[bal_ctg].flag = 1; prevCtg = bal_ctg; bindCnt = bindCnt->nextInScaf; } } printf ( "Report from smallScaf: %d scaffolds by smallPE\n", counter ); } static boolean putItem2Sarray ( unsigned int scaf, int wt, DARRAY * SCAF, DARRAY * WT, int counter ) { int i; unsigned int * scafP, *wtP; for ( i = 0; i < counter; i++ ) { scafP = ( unsigned int * ) darrayGet ( SCAF, i ); if ( ( *scafP ) == scaf ) { wtP = ( unsigned int * ) darrayGet ( WT, i ); *wtP = ( *wtP + wt ); return 0; } } scafP = ( unsigned int * ) darrayPut ( SCAF, counter ); wtP = ( unsigned int * ) darrayPut ( WT, counter ); *scafP = scaf; *wtP = wt; return 1; } static int getDSLink2Scaf ( STACK * scafStack, DARRAY * SCAF, DARRAY * WT ) { CONNECT * ite_cnt; unsigned int ctg, targetCtg, *pt; int counter = 0; boolean inc; stackRecover ( scafStack ); while ( ( pt = ( unsigned int * ) stackPop ( scafStack ) ) != NULL ) { ctg = *pt; if ( contig_array[ctg].mask || !contig_array[ctg].downwardConnect ) { continue; } ite_cnt = contig_array[ctg].downwardConnect; while ( ite_cnt ) { if ( ite_cnt->deleted || ite_cnt->mask || ite_cnt->singleInScaf || ite_cnt->nextInScaf || ite_cnt->prevInScaf || ite_cnt->inherit ) { ite_cnt = ite_cnt->next; continue; } targetCtg = ite_cnt->contigID; if ( contig_array[ctg].from_vt == contig_array[targetCtg].from_vt ) // on the same scaff { ite_cnt = ite_cnt->next; continue; } inc = putItem2Sarray ( contig_array[targetCtg].from_vt, ite_cnt->weight, SCAF, WT, counter ); if ( inc ) { counter++; } ite_cnt = ite_cnt->next; } } return counter; } static int getScaffold ( unsigned int start, STACK * scafStack ) { int len = contig_array[start].length; unsigned int * pt, ctg; emptyStack ( scafStack ); pt = ( unsigned int * ) stackPush ( scafStack ); *pt = start; CONNECT * bindCnt = getBindCnt ( start ); while ( bindCnt ) { ctg = bindCnt->contigID; pt = ( unsigned int * ) stackPush ( scafStack ); *pt = ctg; len += contig_array[ctg].length; bindCnt = bindCnt->nextInScaf; } stackBackup ( scafStack ); return len; } static boolean isLinkReliable ( DARRAY * WT, int count ) { int i; for ( i = 0; i < count; i++ ) if ( * ( int * ) darrayGet ( WT, i ) >= weakPE ) { return 1; } return 0; } static int getWtFromSarray ( DARRAY * SCAF, DARRAY * WT, int count, unsigned int scaf ) { int i; for ( i = 0; i < count; i++ ) if ( * ( unsigned int * ) darrayGet ( SCAF, i ) == scaf ) { return * ( int * ) darrayGet ( WT, i ); } return 0; } static void switch2twin ( STACK * scafStack ) { unsigned int * pt; stackRecover ( scafStack ); while ( ( pt = ( unsigned int * ) stackPop ( scafStack ) ) != NULL ) { *pt = getTwinCtg ( *pt ); } } /* ------> scaf1 --- --- -- -- --- scaf2 -- --- --- -- ----> */ static boolean checkScafConsist ( STACK * scafStack1, STACK * scafStack2 ) { DARRAY * downwardTo1 = ( DARRAY * ) createDarray ( 1000, sizeof ( unsigned int ) ); // scaf links to those scaffolds DARRAY * downwardTo2 = ( DARRAY * ) createDarray ( 1000, sizeof ( unsigned int ) ); DARRAY * downwardWt1 = ( DARRAY * ) createDarray ( 1000, sizeof ( unsigned int ) ); // scaf links to scaffolds with those wt DARRAY * downwardWt2 = ( DARRAY * ) createDarray ( 1000, sizeof ( unsigned int ) ); int linkCount1 = getDSLink2Scaf ( scafStack1, downwardTo1, downwardWt1 ); int linkCount2 = getDSLink2Scaf ( scafStack2, downwardTo2, downwardWt2 ); if ( !linkCount1 || !linkCount2 ) { freeDarray ( downwardTo1 ); freeDarray ( downwardTo2 ); freeDarray ( downwardWt1 ); freeDarray ( downwardWt2 ); return 1; } boolean flag1 = isLinkReliable ( downwardWt1, linkCount1 ); boolean flag2 = isLinkReliable ( downwardWt2, linkCount2 ); if ( !flag1 || !flag2 ) { freeDarray ( downwardTo1 ); freeDarray ( downwardTo2 ); freeDarray ( downwardWt1 ); freeDarray ( downwardWt2 ); return 1; } unsigned int scaf; int i, wt1, wt2, ret = 1; for ( i = 0; i < linkCount1; i++ ) { wt1 = * ( int * ) darrayGet ( downwardWt1, i ); if ( wt1 < weakPE ) { continue; } scaf = * ( unsigned int * ) darrayGet ( downwardTo1, i ); wt2 = getWtFromSarray ( downwardTo2, downwardWt2, linkCount2, scaf ); if ( wt2 < 1 ) { //fprintf(stderr,"Inconsistant link to %d\n",scaf); ret = 0; break; } } freeDarray ( downwardTo1 ); freeDarray ( downwardTo2 ); freeDarray ( downwardWt1 ); freeDarray ( downwardWt2 ); return ret; } static void setBreakPoints ( DARRAY * ctgArray, int count, int weakest, int * start, int * finish ) { int index = weakest - 1; unsigned int thisCtg; unsigned int nextCtg = * ( unsigned int * ) darrayGet ( ctgArray, weakest ); CONNECT * cnt; *start = weakest; while ( index >= 0 ) { thisCtg = * ( unsigned int * ) darrayGet ( ctgArray, index ); cnt = getCntBetween ( thisCtg, nextCtg ); if ( cnt->maxGap > 2 ) { break; } else { *start = index; } nextCtg = thisCtg; index--; } unsigned int prevCtg = * ( unsigned int * ) darrayGet ( ctgArray, weakest + 1 ); *finish = weakest + 1; index = weakest + 2; while ( index < count ) { thisCtg = * ( unsigned int * ) darrayGet ( ctgArray, index ); cnt = getCntBetween ( prevCtg, thisCtg ); if ( cnt->maxGap > 2 ) { break; } else { *finish = index; } prevCtg = thisCtg; index++; } } static void changeScafEnd ( STACK * scafStack, unsigned int end ) { unsigned int ctg, *pt; unsigned int start = getTwinCtg ( end ); stackRecover ( scafStack ); while ( ( pt = ( unsigned int * ) stackPop ( scafStack ) ) != NULL ) { ctg = *pt; contig_array[ctg].to_vt = end; contig_array[getTwinCtg ( ctg )].from_vt = start; } } static void changeScafBegin ( STACK * scafStack, unsigned int start ) { unsigned int ctg, *pt; unsigned int end = getTwinCtg ( start ); stackRecover ( scafStack ); while ( ( pt = ( unsigned int * ) stackPop ( scafStack ) ) != NULL ) { ctg = *pt; contig_array[ctg].from_vt = start; contig_array[getTwinCtg ( ctg )].to_vt = end; } } // break down scaffolds poorly supported by longer PE static void detectBreakScaf() { unsigned int i, avgPE, scafLen, len, ctg, bal_ctg, prevCtg, thisCtg; long long peCounter, linkCounter; int num3, num5, weakPoint, tempCounter, j, t, counter = 0; CONNECT * bindCnt, *cnt, *weakCnt; STACK * scafStack1 = ( STACK * ) createStack ( 1000, sizeof ( unsigned int ) ); STACK * scafStack2 = ( STACK * ) createStack ( 1000, sizeof ( unsigned int ) ); for ( i = 1; i <= num_ctg; i++ ) { contig_array[i].flag = 0; } for ( i = 1; i <= num_ctg; i++ ) { if ( contig_array[i].flag || contig_array[i].mask || !contig_array[i].downwardConnect ) { continue; } bindCnt = getBindCnt ( i ); if ( !bindCnt ) { continue; } //first scan get the average coverage by longer pe num5 = num3 = peCounter = linkCounter = 0; scafLen = contig_array[i].length; ctg = i; * ( unsigned int * ) darrayPut ( scaf5, num5++ ) = i; contig_array[i].flag = 1; contig_array[getTwinCtg ( i )].flag = 1; while ( bindCnt ) { if ( !bindCnt->bySmall ) { break; } linkCounter++; peCounter += bindCnt->maxGap; ctg = bindCnt->contigID; scafLen += contig_array[ctg].length; * ( unsigned int * ) darrayPut ( scaf5, num5++ ) = ctg; bal_ctg = getTwinCtg ( ctg ); contig_array[ctg].flag = 1; contig_array[bal_ctg].flag = 1; bindCnt = bindCnt->nextInScaf; } ctg = getTwinCtg ( i ); bindCnt = getBindCnt ( ctg ); while ( bindCnt ) { if ( !bindCnt->bySmall ) { break; } linkCounter++; peCounter += bindCnt->maxGap; ctg = bindCnt->contigID; scafLen += contig_array[ctg].length; bal_ctg = getTwinCtg ( ctg ); contig_array[ctg].flag = 1; contig_array[bal_ctg].flag = 1; * ( unsigned int * ) darrayPut ( scaf3, num3++ ) = bal_ctg; bindCnt = bindCnt->nextInScaf; } if ( linkCounter < 1 || scafLen < 5000 ) { continue; } avgPE = peCounter / linkCounter; if ( avgPE < 10 ) { continue; } tempCounter = 0; for ( j = num3 - 1; j >= 0; j-- ) * ( unsigned int * ) darrayPut ( tempArray, tempCounter++ ) = * ( unsigned int * ) darrayGet ( scaf3, j ); for ( j = 0; j < num5; j++ ) * ( unsigned int * ) darrayPut ( tempArray, tempCounter++ ) = * ( unsigned int * ) darrayGet ( scaf5, j ); prevCtg = * ( unsigned int * ) darrayGet ( tempArray, 0 ); weakCnt = NULL; weakPoint = 0; len = contig_array[prevCtg].length; for ( t = 1; t < tempCounter; t++ ) { thisCtg = * ( unsigned int * ) darrayGet ( tempArray, t ); if ( len < 2000 ) { len += contig_array[thisCtg].length; prevCtg = thisCtg; continue; } else if ( len > scafLen - 2000 ) { break; } len += contig_array[thisCtg].length; if ( contig_array[prevCtg].from_vt != contig_array[thisCtg].from_vt || contig_array[prevCtg].indexInScaf > contig_array[thisCtg].indexInScaf ) { prevCtg = thisCtg; continue; } cnt = getCntBetween ( prevCtg, thisCtg ); if ( !weakCnt || weakCnt->maxGap > cnt->maxGap ) { weakCnt = cnt; weakPoint = t; } prevCtg = thisCtg; } if ( !weakCnt || ( weakCnt->maxGap > 2 && weakCnt->maxGap > avgPE / 5 ) ) { continue; } prevCtg = * ( unsigned int * ) darrayGet ( tempArray, weakPoint - 1 ); thisCtg = * ( unsigned int * ) darrayGet ( tempArray, weakPoint ); if ( contig_array[prevCtg].from_vt != contig_array[thisCtg].from_vt || contig_array[prevCtg].indexInScaf > contig_array[thisCtg].indexInScaf ) { printf ( "contig %d and %d not on the same scaff\n", prevCtg, thisCtg ); continue; } setConnectWP ( prevCtg, thisCtg, 1 ); /* fprintf(stderr,"scaffold len %d, avg long pe cov %d (%ld/%ld)\n", scafLen,avgPE,peCounter,linkCounter); fprintf(stderr,"Weak connect (%d) between %d(%dth of %d) and %d\n" ,weakCnt->maxGap,prevCtg,weakPoint-1,tempCounter,thisCtg); */ // set start and end to break down the scaffold int index1, index2; setBreakPoints ( tempArray, tempCounter, weakPoint - 1, &index1, &index2 ); //fprintf(stderr,"break %d ->...-> %d\n",index1,index2); unsigned int start = * ( unsigned int * ) darrayGet ( tempArray, index1 ); unsigned int finish = * ( unsigned int * ) darrayGet ( tempArray, index2 ); int len1 = getScaffold ( getTwinCtg ( start ), scafStack1 ); int len2 = getScaffold ( finish, scafStack2 ); if ( len1 < 2000 || len2 < 2000 ) { continue; } switch2twin ( scafStack1 ); int flag1 = checkScafConsist ( scafStack1, scafStack2 ); switch2twin ( scafStack1 ); switch2twin ( scafStack2 ); int flag2 = checkScafConsist ( scafStack2, scafStack1 ); if ( !flag1 || !flag2 ) { changeScafBegin ( scafStack1, getTwinCtg ( start ) ); changeScafEnd ( scafStack2, getTwinCtg ( finish ) ); //unbind links unsigned int nextCtg = * ( unsigned int * ) darrayGet ( tempArray, index1 + 1 ); thisCtg = * ( unsigned int * ) darrayGet ( tempArray, index1 ); cnt = getCntBetween ( getTwinCtg ( nextCtg ), getTwinCtg ( thisCtg ) ); if ( cnt->nextInScaf ) { prevCtg = getTwinCtg ( cnt->nextInScaf->contigID ); cnt->nextInScaf->prevInScaf = 0; cnt = getCntBetween ( prevCtg, thisCtg ); cnt->nextInScaf = NULL; } prevCtg = * ( unsigned int * ) darrayGet ( tempArray, index2 - 1 ); thisCtg = * ( unsigned int * ) darrayGet ( tempArray, index2 ); cnt = getCntBetween ( prevCtg, thisCtg ); if ( cnt->nextInScaf ) { nextCtg = cnt->nextInScaf->contigID; cnt->nextInScaf->prevInScaf = 0; cnt = getCntBetween ( getTwinCtg ( nextCtg ), getTwinCtg ( thisCtg ) ); cnt->nextInScaf = NULL; } prevCtg = * ( unsigned int * ) darrayGet ( tempArray, index1 ); for ( t = index1 + 1; t <= index2; t++ ) { thisCtg = * ( unsigned int * ) darrayGet ( tempArray, t ); cnt = getCntBetween ( prevCtg, thisCtg ); cnt->mask = 1; cnt->nextInScaf = NULL; cnt->prevInScaf = 0; cnt = getCntBetween ( getTwinCtg ( thisCtg ), getTwinCtg ( prevCtg ) ); cnt->mask = 1; cnt->nextInScaf = NULL; cnt->prevInScaf = 0; /* fprintf(stderr,"(%d %d)/(%d %d) ", prevCtg,thisCtg,getTwinCtg(thisCtg),getTwinCtg(prevCtg)); */ prevCtg = thisCtg; } //fprintf(stderr,": BREAKING\n"); counter++; } } freeStack ( scafStack1 ); freeStack ( scafStack2 ); printf ( "Report from checkScaf: %d scaffold segments broken\n", counter ); } static boolean checkSimple ( DARRAY * ctgArray, int count ) { int i; unsigned int ctg; for ( i = 0; i < count; i++ ) { ctg = * ( unsigned int * ) darrayGet ( ctgArray, i ); contig_array[ctg].flag = 0; contig_array[getTwinCtg ( ctg )].flag = 0; } for ( i = 0; i < count; i++ ) { ctg = * ( unsigned int * ) darrayGet ( ctgArray, i ); if ( contig_array[ctg].flag ) { return 0; } contig_array[ctg].flag = 1; contig_array[getTwinCtg ( ctg )].flag = 1; } return 1; } static void checkCircle() { unsigned int i, ctg; CONNECT * cn_temp1; int counter = 0; for ( i = 1; i <= num_ctg; i++ ) { cn_temp1 = contig_array[i].downwardConnect; while ( cn_temp1 ) { if ( cn_temp1->weak || cn_temp1->deleted ) { cn_temp1 = cn_temp1->next; continue; } ctg = cn_temp1->contigID; if ( checkConnect ( ctg, i ) ) { counter++; maskContig ( i, 1 ); maskContig ( ctg, 1 ); } cn_temp1 = cn_temp1->next; } } printf ( "%d circles removed \n", counter ); } SOAPdenovo-V1.05/src/63mer/output_contig.c000644 000765 000024 00000015411 11530651532 020371 0ustar00Aquastaff000000 000000 /* * 63mer/output_contig.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" static char * kmerSeq; void output_graph ( char * outfile ) { char name[256]; FILE * fp; unsigned int i, bal_i; sprintf ( name, "%s.edge.gvz", outfile ); fp = ckopen ( name, "w" ); fprintf ( fp, "digraph G{\n" ); fprintf ( fp, "\tsize=\"512,512\";\n" ); for ( i = num_ed; i > 0; i-- ) { if ( edge_array[i].deleted ) { continue; } /* arcCount(i,&arcNum); if(arcNum<1) continue; */ bal_i = getTwinEdge ( i ); /* arcCount(bal_i,&arcNum); if(arcNum<1) continue; */ fprintf ( fp, "\tV%d -> V%d[label =\"%d(%d)\"];\n", edge_array[i].from_vt, edge_array[i].to_vt, i, edge_array[i].length ); } fprintf ( fp, "}\n" ); fclose ( fp ); } static void output_1contig ( int id, EDGE * edge, FILE * fp, boolean tip ) { int i; Kmer kmer; fprintf ( fp, ">%d length %d cvg_%.1f_tip_%d\n", id, edge->length + overlaplen, ( double ) edge->cvg / 10, tip ); //fprintf(fp,">%d\n",id); kmer = vt_array[edge->from_vt].kmer; printKmerSeq ( fp, kmer ); for ( i = 0; i < edge->length; i++ ) { fprintf ( fp, "%c", int2base ( ( int ) getCharInTightString ( edge->seq, i ) ) ); if ( ( i + overlaplen + 1 ) % 100 == 0 ) { fprintf ( fp, "\n" ); } } fprintf ( fp, "\n" ); } int cmp_int ( const void * a, const void * b ) { int * A, *B; A = ( int * ) a; B = ( int * ) b; if ( *A > *B ) { return 1; } else if ( *A == *B ) { return 0; } else { return -1; } } int cmp_edge ( const void * a, const void * b ) { EDGE * A, *B; A = ( EDGE * ) a; B = ( EDGE * ) b; if ( A->length > B->length ) { return 1; } else if ( A->length == B->length ) { return 0; } else { return -1; } } void output_contig ( EDGE * ed_array, unsigned int ed_num, char * outfile, int cut_len ) { char temp[256]; FILE * fp, *fp_contig; int flag, count, len_c; int signI; unsigned int i; long long sum = 0, N90, N50; unsigned int * length_array; boolean tip; sprintf ( temp, "%s.contig", outfile ); fp = ckopen ( temp, "w" ); qsort ( &ed_array[1], ed_num, sizeof ( EDGE ), cmp_edge ); length_array = ( unsigned int * ) ckalloc ( ed_num * sizeof ( unsigned int ) ); kmerSeq = ( char * ) ckalloc ( overlaplen * sizeof ( char ) ); //first scan for number counting count = len_c = 0; for ( i = 1; i <= ed_num; i++ ) { if ( ( ed_array[i].length + overlaplen ) >= len_bar ) { length_array[len_c++] = ed_array[i].length + overlaplen; } if ( ed_array[i].length < 1 || ed_array[i].deleted ) { continue; } count++; if ( EdSmallerThanTwin ( i ) ) { i++; } } sum = 0; for ( signI = len_c - 1; signI >= 0; signI-- ) { sum += length_array[signI]; } qsort ( length_array, len_c, sizeof ( length_array[0] ), cmp_int ); if ( len_c > 0 ) { printf ( "%d ctgs longer than %d, sum up %lldbp, with average length %lld\n", len_c, len_bar, sum, sum / len_c ); printf ( "the longest is %dbp, ", length_array[len_c - 1] ); } N50 = sum * 0.5; N90 = sum * 0.9; sum = flag = 0; for ( signI = len_c - 1; signI >= 0; signI-- ) { sum += length_array[signI]; if ( !flag && sum >= N50 ) { printf ( "contig N50 is %d bp,", length_array[signI] ); flag = 1; } if ( sum >= N90 ) { printf ( "contig N90 is %d bp\n", length_array[signI] ); break; } } //fprintf(fp,"Number %d\n",count); for ( i = 1; i <= ed_num; i++ ) { //if(ed_array[i].multi!=1||ed_array[i].length<1||(ed_array[i].length+overlaplen)length %d,", edge->length ); if ( EdSmallerThanTwin ( i ) ) { fprintf ( fp, "1," ); } else if ( EdLargerThanTwin ( i ) ) { fprintf ( fp, "-1," ); } else { fprintf ( fp, "0," ); } fprintf ( fp, "%d ", edge->cvg ); print_kmer ( fp, vt_array[edge->from_vt].kmer, ',' ); print_kmer ( fp, vt_array[edge->to_vt].kmer, ',' ); fprintf ( fp, "\n" ); } fclose ( fp ); } void output_heavyArcs ( char * outfile ) { unsigned int i, j; char name[256]; FILE * outfp; ARC * parc; sprintf ( name, "%s.Arc", outfile ); outfp = ckopen ( name, "w" ); for ( i = 1; i <= num_ed; i++ ) { parc = edge_array[i].arcs; if ( !parc ) { continue; } j = 0; fprintf ( outfp, "%u", i ); while ( parc ) { fprintf ( outfp, " %u %u", parc->to_ed, parc->multiplicity ); if ( ( ++j ) % 10 == 0 ) { fprintf ( outfp, "\n%u", i ); } parc = parc->next; } fprintf ( outfp, "\n" ); } fclose ( outfp ); } SOAPdenovo-V1.05/src/63mer/output_pregraph.c000644 000765 000024 00000005151 11530651532 020716 0ustar00Aquastaff000000 000000 /* * 63mer/output_pregraph.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include #include "newhash.h" #include #include static int outvCounter = 0; //after this LINKFLAGFILTER in the Kmer is destroyed static void output1vt ( kmer_t * node1, FILE * fp ) { if ( !node1 ) { return; } if ( ! ( node1->linear ) && ! ( node1->deleted ) ) { outvCounter++; print_kmer ( fp, node1->seq, ' ' ); //printKmerSeq(stdout,node1->seq); //printf("\n"); if ( outvCounter % 8 == 0 ) { fprintf ( fp, "\n" ); } } } void output_vertex ( char * outfile ) { char temp[256]; FILE * fp; int i; kmer_t * node; KmerSet * set; sprintf ( temp, "%s.vertex", outfile ); fp = ckopen ( temp, "w" ); for ( i = 0; i < thrd_num; i++ ) { set = KmerSets[i]; set->iter_ptr = 0; while ( set->iter_ptr < set->size ) { if ( !is_kmer_entity_null ( set->flags, set->iter_ptr ) ) { node = set->array + set->iter_ptr; output1vt ( node, fp ); } set->iter_ptr ++; } } fprintf ( fp, "\n" ); printf ( "%d vertex outputed\n", outvCounter ); fclose ( fp ); sprintf ( temp, "%s.preGraphBasic", outfile ); fp = ckopen ( temp, "w" ); fprintf ( fp, "VERTEX %d K %d\n", outvCounter, overlaplen ); fprintf ( fp, "\nEDGEs %d\n", num_ed ); fprintf ( fp, "\nMaxReadLen %d MinReadLen %d MaxNameLen %d\n", maxReadLen4all, minReadLen, maxNameLen ); fclose ( fp ); } void output_1edge ( preEDGE * edge, FILE * fp ) { int i; fprintf ( fp, ">length %d,", edge->length ); print_kmer ( fp, edge->from_node, ',' ); print_kmer ( fp, edge->to_node, ',' ); fprintf ( fp, "cvg %d, %d\n", edge->cvg, edge->bal_edge ); for ( i = 0; i < edge->length; i++ ) { fprintf ( fp, "%c", int2base ( ( int ) edge->seq[i] ) ); if ( ( i + 1 ) % 100 == 0 ) { fprintf ( fp, "\n" ); } } fprintf ( fp, "\n" ); } SOAPdenovo-V1.05/src/63mer/output_scaffold.c000644 000765 000024 00000005107 11530651532 020670 0ustar00Aquastaff000000 000000 /* * 63mer/output_scaffold.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" void output_contig_graph ( char * outfile ) { char name[256]; FILE * fp; unsigned int i; sprintf ( name, "%s.contig.gvz", outfile ); fp = ckopen ( name, "w" ); fprintf ( fp, "digraph G{\n" ); fprintf ( fp, "\tsize=\"512,512\";\n" ); for ( i = num_ctg; i > 0; i-- ) { fprintf ( fp, "\tV%d -> V%d[label =\"%d(%d)\"];\n", contig_array[i].from_vt, contig_array[i].to_vt, i, contig_array[i].length ); } fprintf ( fp, "}\n" ); fclose ( fp ); } void output_scaf ( char * outfile ) { char name[256]; FILE * fp; unsigned int i; CONNECT * connect; boolean flag; sprintf ( name, "%s.scaffold.gvz", outfile ); fp = ckopen ( name, "w" ); fprintf ( fp, "digraph G{\n" ); fprintf ( fp, "\tsize=\"512,512\";\n" ); for ( i = num_ctg; i > 0; i-- ) { //if(contig_array[i].mask||!contig_array[i].downwardConnect) if ( !contig_array[i].downwardConnect ) { continue; } connect = contig_array[i].downwardConnect; while ( connect ) { //if(connect->mask||connect->deleted){ if ( connect->deleted ) { connect = connect->next; continue; } if ( connect->prevInScaf || connect->nextInScaf ) { flag = 1; } else { flag = 0; } if ( !connect->mask ) fprintf ( fp, "\tC%d_%d -> C%d_%d [label = \"%d(%d_%d)\"];\n" , i, contig_array[i].length, connect->contigID, contig_array[connect->contigID].length, connect->gapLen, flag, connect->weight ); else fprintf ( fp, "\tC%d_%d -> C%d_%d [label = \"%d(%d_%d)\", color = red];\n" , i, contig_array[i].length, connect->contigID, contig_array[connect->contigID].length, connect->gapLen, flag, connect->weight ); connect = connect->next; } } fprintf ( fp, "}\n" ); fclose ( fp ); } SOAPdenovo-V1.05/src/63mer/pregraph.c000644 000765 000024 00000011470 11530651532 017277 0ustar00Aquastaff000000 000000 /* * 63mer/pregraph.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" static void initenv ( int argc, char ** argv ); static char shortrdsfile[256]; static char graphfile[256]; static int cutTips = 1; static void display_pregraph_usage(); int call_pregraph ( int argc, char ** argv ) { time_t start_t, stop_t, time_bef, time_aft; time ( &start_t ); initenv ( argc, argv ); if ( overlaplen % 2 == 0 ) { overlaplen++; printf ( "K should be an odd number\n" ); } if ( overlaplen < 13 ) { overlaplen = 13; printf ( "K should not be less than 13\n" ); } else if ( overlaplen > 63 ) { overlaplen = 63; printf ( "K should not be greater than 63\n" ); } time ( &time_bef ); prlRead2HashTable ( shortrdsfile, graphfile ); time ( &time_aft ); printf ( "time spent on pre-graph construction: %ds\n\n", ( int ) ( time_aft - time_bef ) ); printf ( "deLowKmer %d, deLowEdge %d\n", deLowKmer, deLowEdge ); //analyzeTips(hash_table, graphfile); if ( !deLowKmer && cutTips ) { time ( &time_bef ); removeSingleTips(); removeMinorTips(); time ( &time_aft ); printf ( "time spent on cutTipe: %ds\n\n", ( int ) ( time_aft - time_bef ) ); } else { time ( &time_bef ); removeMinorTips(); time ( &time_aft ); printf ( "time spent on cutTipe: %ds\n\n", ( int ) ( time_aft - time_bef ) ); } initKmerSetSize = 0; //combine each linear part to an edge time ( &time_bef ); kmer2edges ( graphfile ); time ( &time_aft ); printf ( "time spent on making edges: %ds\n\n", ( int ) ( time_aft - time_bef ) ); //map read to edge one by one time ( &time_bef ); prlRead2edge ( shortrdsfile, graphfile ); time ( &time_aft ); printf ( "time spent on mapping reads: %ds\n\n", ( int ) ( time_aft - time_bef ) ); output_vertex ( graphfile ); free_Sets ( KmerSets, thrd_num ); free_Sets ( KmerSetsPatch, thrd_num ); time ( &stop_t ); printf ( "overall time for lightgraph: %dm\n\n", ( int ) ( stop_t - start_t ) / 60 ); return 0; } /***************************************************************************** * Parse command line switches *****************************************************************************/ void initenv ( int argc, char ** argv ) { int copt; int inpseq, outseq; extern char * optarg; char temp[100]; optind = 1; inpseq = outseq = 0; while ( ( copt = getopt ( argc, argv, "s:o:K:p:a:dDR" ) ) != EOF ) { //printf("get option\n"); switch ( copt ) { case 's': inpseq = 1; sscanf ( optarg, "%s", shortrdsfile ); continue; case 'o': outseq = 1; sscanf ( optarg, "%s", graphfile ); // continue; case 'K': sscanf ( optarg, "%s", temp ); // overlaplen = atoi ( temp ); continue; case 'p': sscanf ( optarg, "%s", temp ); // thrd_num = atoi ( temp ); continue; case 'R': repsTie = 1; continue; case 'd': deLowKmer = 1; continue; case 'D': deLowEdge = 1; continue; case 'a': initKmerSetSize = atoi ( optarg ); break; default: if ( inpseq == 0 || outseq == 0 ) // { display_pregraph_usage(); exit ( -1 ); } } } if ( inpseq == 0 || outseq == 0 ) // { //printf("need more\n"); display_pregraph_usage(); exit ( -1 ); } } static void display_pregraph_usage() { printf ( "\npregraph -s readsInfoFile [-d -D -R -K kmer -p n_cpu -a initKmerSetSize] -o OutputFile\n" ); printf ( " -s readsInfoFile: The file contains information of solexa reads\n" ); printf ( " -p n_cpu(default 8): number of cpu for use\n" ); printf ( " -K kmer(default 21): k value in kmer\n" ); printf ( " -a initKmerSetSize: define the initial KmerSet size(unit: GB)\n" ); printf ( " -d (optional): delete kmers with frequency one (default no)\n" ); printf ( " -D (optional): delete edges with coverage one (default no)\n" ); printf ( " -R (optional): unsolve repeats by reads (default no)\n" ); printf ( " -o OutputFile: prefix of output file name\n" ); } SOAPdenovo-V1.05/src/63mer/prlHashCtg.c000644 000765 000024 00000024603 11530651532 017530 0ustar00Aquastaff000000 000000 /* * 63mer/prlHashCtg.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" //debugging variables static long long * kmerCounter; //buffer related varibles for chop kmer static unsigned int read_c; static char ** rcSeq; static char * seqBuffer; static int * lenBuffer; static unsigned int * indexArray; static unsigned int * seqBreakers; static int * ctgIdArray; //static Kmer *firstKmers; //buffer related varibles for splay tree static unsigned int buffer_size = 100000000; static unsigned int seq_buffer_size; static unsigned int max_read_c; static volatile unsigned int kmer_c; static Kmer * kmerBuffer; static ubyte8 * hashBanBuffer; static boolean * smallerBuffer; static void singleKmer ( int t, KmerSet * kset, unsigned int seq_index, unsigned int pos ); static void chopKmer4read ( int t, int threadID ); static void threadRoutine ( void * para ) { PARAMETER * prm; unsigned int i; unsigned char id; prm = ( PARAMETER * ) para; id = prm->threadID; //printf("%dth thread with threadID %d, hash_table %p\n",id,prm.threadID,prm.hash_table); while ( 1 ) { if ( * ( prm->selfSignal ) == 1 ) { unsigned int seq_index = 0; unsigned int pos = 0; for ( i = 0; i < kmer_c; i++ ) { if ( seq_index < read_c && indexArray[seq_index + 1] == i ) { seq_index++; // which sequence this kmer belongs to pos = 0; } //if((unsigned char)(hashBanBuffer[i]&taskMask)!=id){ if ( ( unsigned char ) ( hashBanBuffer[i] % thrd_num ) != id ) { pos++; continue; } kmerCounter[id + 1]++; singleKmer ( i, KmerSets[id], seq_index, pos++ ); } * ( prm->selfSignal ) = 0; } else if ( * ( prm->selfSignal ) == 2 ) { for ( i = 0; i < read_c; i++ ) { if ( i % thrd_num != id ) { continue; } chopKmer4read ( i, id + 1 ); } * ( prm->selfSignal ) = 0; } else if ( * ( prm->selfSignal ) == 3 ) { * ( prm->selfSignal ) = 0; break; } usleep ( 1 ); } } static void singleKmer ( int t, KmerSet * kset, unsigned int seq_index, unsigned int pos ) { boolean flag; kmer_t * node; flag = put_kmerset ( kset, kmerBuffer[t], 4, 4, &node ); //printf("singleKmer: kmer %llx\n",kmerBuffer[t]); if ( !flag ) { if ( smallerBuffer[t] ) { node->twin = 0; } else { node->twin = 1; }; node->l_links = ctgIdArray[seq_index]; node->r_links = pos; } else { node->deleted = 1; } } static void creatThrds ( pthread_t * threads, PARAMETER * paras ) { unsigned char i; int temp; for ( i = 0; i < thrd_num; i++ ) { //printf("to create %dth thread\n",(*(char *)&(threadID[i]))); if ( ( temp = pthread_create ( &threads[i], NULL, ( void * ) threadRoutine, & ( paras[i] ) ) ) != 0 ) { printf ( "create threads failed\n" ); exit ( 1 ); } } printf ( "%d thread created\n", thrd_num ); } static void thread_wait ( pthread_t * threads ) { int i; for ( i = 0; i < thrd_num; i++ ) if ( threads[i] != 0 ) { pthread_join ( threads[i], NULL ); } } static void chopKmer4read ( int t, int threadID ) { char * src_seq = seqBuffer + seqBreakers[t]; char * bal_seq = rcSeq[threadID]; int len_seq = lenBuffer[t]; int j, bal_j; ubyte8 hash_ban, bal_hash_ban; Kmer word, bal_word; int index; word.high = word.low = 0; for ( index = 0; index < overlaplen; index++ ) { word = KmerLeftBitMoveBy2 ( word ); word.low |= src_seq[index]; } reverseComplementSeq ( src_seq, len_seq, bal_seq ); // complementary node bal_word = reverseComplement ( word, overlaplen ); bal_j = len_seq - 0 - overlaplen; // 0; index = indexArray[t]; if ( KmerSmaller ( word, bal_word ) ) { hash_ban = hash_kmer ( word ); kmerBuffer[index] = word; hashBanBuffer[index] = hash_ban; smallerBuffer[index++] = 1; } else { bal_hash_ban = hash_kmer ( bal_word ); kmerBuffer[index] = bal_word; hashBanBuffer[index] = bal_hash_ban; smallerBuffer[index++] = 0; } //printf("%dth: %p with %p\n",kmer_c-1,bal_word,bal_hash_ban); for ( j = 1; j <= len_seq - overlaplen; j ++ ) { word = nextKmer ( word, src_seq[j - 1 + overlaplen] ); bal_j = len_seq - j - overlaplen; // j; bal_word = prevKmer ( bal_word, bal_seq[bal_j] ); if ( KmerSmaller ( word, bal_word ) ) { hash_ban = hash_kmer ( word ); kmerBuffer[index] = word; hashBanBuffer[index] = hash_ban; smallerBuffer[index++] = 1; //printf("%dth: %p with %p\n",kmer_c-1,word,hashBanBuffer[kmer_c-1]); } else { // complementary node bal_hash_ban = hash_kmer ( bal_word ); kmerBuffer[index] = bal_word; hashBanBuffer[index] = bal_hash_ban; smallerBuffer[index++] = 0; //printf("%dth: %p with %p\n",kmer_c-1,bal_word,hashBanBuffer[kmer_c-1]); } } } static void sendWorkSignal ( unsigned char SIG, unsigned char * thrdSignals ) { int t; for ( t = 0; t < thrd_num; t++ ) { thrdSignals[t + 1] = SIG; } while ( 1 ) { usleep ( 10 ); for ( t = 0; t < thrd_num; t++ ) if ( thrdSignals[t + 1] ) { break; } if ( t == thrd_num ) { break; } } } static int getID ( char * name ) { if ( name[0] >= '0' && name[0] <= '9' ) { return atoi ( & ( name[0] ) ); } else { return 0; } } boolean prlContig2nodes ( char * grapfile, int len_cut ) { long long i, num_seq; char name[256], *next_name; FILE * fp; pthread_t threads[thrd_num]; time_t start_t, stop_t; unsigned char thrdSignal[thrd_num + 1]; PARAMETER paras[thrd_num]; int maxCtgLen, minCtgLen, nameLen; unsigned int lenSum, contigId; WORDFILTER = createFilter ( overlaplen ); time ( &start_t ); sprintf ( name, "%s.contig", grapfile ); fp = ckopen ( name, "r" ); maxCtgLen = nameLen = 10; minCtgLen = 1000; num_seq = readseqpar ( &maxCtgLen, &minCtgLen, &nameLen, fp ); printf ( "\nthere're %lld contigs in file: %s, max seq len %d, min seq len %d, max name len %d\n", num_seq, grapfile, maxCtgLen, minCtgLen, nameLen ); maxReadLen = maxCtgLen; fclose ( fp ); time ( &stop_t ); printf ( "time spent on parse contigs file %ds\n", ( int ) ( stop_t - start_t ) ); next_name = ( char * ) ckalloc ( ( maxNameLen + 1 ) * sizeof ( char ) ); // extract all the EDONs seq_buffer_size = buffer_size * 2; max_read_c = seq_buffer_size / 20; kmerBuffer = ( Kmer * ) ckalloc ( buffer_size * sizeof ( Kmer ) ); hashBanBuffer = ( ubyte8 * ) ckalloc ( buffer_size * sizeof ( ubyte8 ) ); smallerBuffer = ( boolean * ) ckalloc ( buffer_size * sizeof ( boolean ) ); seqBuffer = ( char * ) ckalloc ( seq_buffer_size * sizeof ( char ) ); lenBuffer = ( int * ) ckalloc ( max_read_c * sizeof ( int ) ); indexArray = ( unsigned int * ) ckalloc ( ( max_read_c + 1 ) * sizeof ( unsigned int ) ); seqBreakers = ( unsigned int * ) ckalloc ( ( max_read_c + 1 ) * sizeof ( unsigned int ) ); ctgIdArray = ( int * ) ckalloc ( max_read_c * sizeof ( int ) ); fp = ckopen ( name, "r" ); //node_mem_manager = createMem_manager(EDONBLOCKSIZE,sizeof(EDON)); rcSeq = ( char ** ) ckalloc ( ( thrd_num + 1 ) * sizeof ( char * ) ); if ( 1 ) { kmerCounter = ( long long * ) ckalloc ( ( thrd_num + 1 ) * sizeof ( long long ) ); KmerSets = ( KmerSet ** ) ckalloc ( thrd_num * sizeof ( KmerSet * ) ); for ( i = 0; i < thrd_num; i++ ) { KmerSets[i] = init_kmerset ( 1024, 0.77f ); thrdSignal[i + 1] = 0; paras[i].threadID = i; paras[i].mainSignal = &thrdSignal[0]; paras[i].selfSignal = &thrdSignal[i + 1]; kmerCounter[i + 1] = 0; rcSeq[i + 1] = ( char * ) ckalloc ( maxCtgLen * sizeof ( char ) ); } creatThrds ( threads, paras ); } kmer_c = thrdSignal[0] = kmerCounter[0] = 0; time ( &start_t ); read_c = lenSum = i = seqBreakers[0] = indexArray[0] = 0; readseq1by1 ( seqBuffer + seqBreakers[read_c], next_name, & ( lenBuffer[read_c] ), fp, -1 ); while ( !feof ( fp ) ) { contigId = getID ( next_name ); readseq1by1 ( seqBuffer + seqBreakers[read_c], next_name, & ( lenBuffer[read_c] ), fp, 1 ); if ( ( ++i ) % 10000000 == 0 ) { printf ( "--- %lldth contigs\n", i ); } if ( lenBuffer[read_c] < overlaplen + 1 || lenBuffer[read_c] < len_cut ) { contigId = getID ( next_name ); continue; } //printf("len of seq %d is %d, ID %d\n",read_c,lenBuffer[read_c],contigId); ctgIdArray[read_c] = contigId > 0 ? contigId : i; lenSum += lenBuffer[read_c]; kmer_c += lenBuffer[read_c] - overlaplen + 1; read_c++; seqBreakers[read_c] = lenSum; indexArray[read_c] = kmer_c; //printf("seq %d start at %d\n",read_c,seqBreakers[read_c]); if ( read_c == max_read_c || ( lenSum + maxCtgLen ) > seq_buffer_size || ( kmer_c + maxCtgLen - overlaplen + 1 ) > buffer_size ) { kmerCounter[0] += kmer_c; sendWorkSignal ( 2, thrdSignal ); sendWorkSignal ( 1, thrdSignal ); kmer_c = read_c = lenSum = 0; } } if ( read_c ) { kmerCounter[0] += kmer_c; sendWorkSignal ( 2, thrdSignal ); sendWorkSignal ( 1, thrdSignal ); } sendWorkSignal ( 3, thrdSignal ); thread_wait ( threads ); time ( &stop_t ); printf ( "time spent on hash reads: %ds\n", ( int ) ( stop_t - start_t ) ); if ( 1 ) { unsigned long long alloCounter = 0; unsigned long long allKmerCounter = 0; for ( i = 0; i < thrd_num; i++ ) { alloCounter += count_kmerset ( ( KmerSets[i] ) ); allKmerCounter += kmerCounter[i + 1]; free ( ( void * ) rcSeq[i + 1] ); } printf ( "%lli nodes allocated, %lli kmer in reads, %lli kmer processed\n" , alloCounter, kmerCounter[0], allKmerCounter ); } free ( ( void * ) rcSeq ); free ( ( void * ) kmerCounter ); free ( ( void * ) seqBuffer ); free ( ( void * ) lenBuffer ); free ( ( void * ) indexArray ); free ( ( void * ) seqBreakers ); free ( ( void * ) ctgIdArray ); free ( ( void * ) kmerBuffer ); free ( ( void * ) hashBanBuffer ); free ( ( void * ) smallerBuffer ); free ( ( void * ) next_name ); fclose ( fp ); return 1; } SOAPdenovo-V1.05/src/63mer/prlHashReads.c000644 000765 000024 00000035577 11530651532 020065 0ustar00Aquastaff000000 000000 /* * 63mer/prlHashReads.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" //debugging variables static long long * tips; static long long * kmerCounter; static long long ** kmerFreq; //buffer related varibles for chop kmer static int read_c; static char ** rcSeq; static char ** seqBuffer; static int * lenBuffer; static int * indexArray; //buffer related varibles for splay tree static int buffer_size = 100000000; static volatile int kmer_c; static Kmer * kmerBuffer; static ubyte8 * hashBanBuffer; static char * nextcBuffer, *prevcBuffer; static void thread_mark ( KmerSet * set, unsigned char thrdID ); static void Mark1in1outNode ( unsigned char * thrdSignal ); static void thread_delow ( KmerSet * set, unsigned char thrdID ); static void deLowCov ( unsigned char * thrdSignal ); static void singleKmer ( int t, KmerSet * kset ); static void chopKmer4read ( int t, int threadID ); static void freqStat ( char * outfile ); static void threadRoutine ( void * para ) { PARAMETER * prm; int i; unsigned char id; prm = ( PARAMETER * ) para; id = prm->threadID; //printf("%dth thread with threadID %d, hash_table %p\n",id,prm.threadID,prm.hash_table); while ( 1 ) { if ( * ( prm->selfSignal ) == 1 ) { for ( i = 0; i < kmer_c; i++ ) { //if((unsigned char)(magic_seq(hashBanBuffer[i])%thrd_num)!=id) //if((kmerBuffer[i]%thrd_num)!=id) if ( ( hashBanBuffer[i] % thrd_num ) != id ) { continue; } kmerCounter[id + 1]++; singleKmer ( i, KmerSets[id] ); } * ( prm->selfSignal ) = 0; } else if ( * ( prm->selfSignal ) == 2 ) { for ( i = 0; i < read_c; i++ ) { if ( i % thrd_num != id ) { continue; } chopKmer4read ( i, id + 1 ); } * ( prm->selfSignal ) = 0; } else if ( * ( prm->selfSignal ) == 3 ) { * ( prm->selfSignal ) = 0; break; } else if ( * ( prm->selfSignal ) == 4 ) { thread_mark ( KmerSets[id], id ); * ( prm->selfSignal ) = 0; } else if ( * ( prm->selfSignal ) == 5 ) { thread_delow ( KmerSets[id], id ); * ( prm->selfSignal ) = 0; } usleep ( 1 ); } } static void singleKmer ( int t, KmerSet * kset ) { kmer_t * pos; put_kmerset ( kset, kmerBuffer[t], prevcBuffer[t], nextcBuffer[t], &pos ); } static void creatThrds ( pthread_t * threads, PARAMETER * paras ) { unsigned char i; int temp; for ( i = 0; i < thrd_num; i++ ) { //printf("to create %dth thread\n",(*(char *)&(threadID[i]))); if ( ( temp = pthread_create ( &threads[i], NULL, ( void * ) threadRoutine, & ( paras[i] ) ) ) != 0 ) { printf ( "create threads failed\n" ); exit ( 1 ); } } printf ( "%d thread created\n", thrd_num ); } static void thread_wait ( pthread_t * threads ) { int i; for ( i = 0; i < thrd_num; i++ ) if ( threads[i] != 0 ) { pthread_join ( threads[i], NULL ); } } static void chopKmer4read ( int t, int threadID ) { char * src_seq = seqBuffer[t]; char * bal_seq = rcSeq[threadID]; int len_seq = lenBuffer[t]; int j, bal_j; ubyte8 hash_ban, bal_hash_ban; Kmer word, bal_word; int index; char InvalidCh = 4; word.high = word.low = 0; for ( index = 0; index < overlaplen; index++ ) { word = KmerLeftBitMoveBy2 ( word ); word.low |= src_seq[index]; } reverseComplementSeq ( src_seq, len_seq, bal_seq ); // complementary node bal_word = reverseComplement ( word, overlaplen ); bal_j = len_seq - 0 - overlaplen; // 0; index = indexArray[t]; if ( KmerSmaller ( word, bal_word ) ) { hash_ban = hash_kmer ( word ); hashBanBuffer[index] = hash_ban; kmerBuffer[index] = word; prevcBuffer[index] = InvalidCh; nextcBuffer[index++] = src_seq[0 + overlaplen]; } else { bal_hash_ban = hash_kmer ( bal_word ); hashBanBuffer[index] = bal_hash_ban; kmerBuffer[index] = bal_word; prevcBuffer[index] = bal_seq[bal_j - 1]; nextcBuffer[index++] = InvalidCh; } for ( j = 1; j <= len_seq - overlaplen; j ++ ) { word = nextKmer ( word, src_seq[j - 1 + overlaplen] ); bal_j = len_seq - j - overlaplen; // j; bal_word = prevKmer ( bal_word, bal_seq[bal_j] ); if ( KmerSmaller ( word, bal_word ) ) { hash_ban = hash_kmer ( word ); hashBanBuffer[index] = hash_ban; kmerBuffer[index] = word; prevcBuffer[index] = src_seq[j - 1]; if ( j < len_seq - overlaplen ) { nextcBuffer[index++] = src_seq[j + overlaplen]; } else { nextcBuffer[index++] = InvalidCh; } //printf("%dth: %p with %p\n",kmer_c-1,word,hashBanBuffer[kmer_c-1]); } else { // complementary node bal_hash_ban = hash_kmer ( bal_word ); hashBanBuffer[index] = bal_hash_ban; kmerBuffer[index] = bal_word; if ( bal_j > 0 ) { prevcBuffer[index] = bal_seq[bal_j - 1]; } else { prevcBuffer[index] = InvalidCh; } nextcBuffer[index++] = bal_seq[bal_j + overlaplen]; //printf("%dth: %p with %p\n",kmer_c-1,bal_word,hashBanBuffer[kmer_c-1]); } } } static void sendWorkSignal ( unsigned char SIG, unsigned char * thrdSignals ) { int t; for ( t = 0; t < thrd_num; t++ ) { thrdSignals[t + 1] = SIG; } while ( 1 ) { usleep ( 10 ); for ( t = 0; t < thrd_num; t++ ) if ( thrdSignals[t + 1] ) { break; } if ( t == thrd_num ) { break; } } } boolean prlRead2HashTable ( char * libfile, char * outfile ) { long long i; char * next_name, name[256]; FILE * fo; time_t start_t, stop_t; int maxReadNum; int libNo; pthread_t threads[thrd_num]; unsigned char thrdSignal[thrd_num + 1]; PARAMETER paras[thrd_num]; boolean flag, pairs = 0; WORDFILTER = createFilter ( overlaplen ); maxReadLen = 0; maxNameLen = 256; scan_libInfo ( libfile ); alloc_pe_mem ( num_libs ); if ( !maxReadLen ) { maxReadLen = 100; } maxReadLen4all = maxReadLen; printf ( "In %s, %d libs, max seq len %d, max name len %d\n\n", libfile, num_libs, maxReadLen, maxNameLen ); next_name = ( char * ) ckalloc ( ( maxNameLen + 1 ) * sizeof ( char ) ); kmerBuffer = ( Kmer * ) ckalloc ( buffer_size * sizeof ( Kmer ) ); hashBanBuffer = ( ubyte8 * ) ckalloc ( buffer_size * sizeof ( ubyte8 ) ); prevcBuffer = ( char * ) ckalloc ( buffer_size * sizeof ( char ) ); nextcBuffer = ( char * ) ckalloc ( buffer_size * sizeof ( char ) ); maxReadNum = buffer_size / ( maxReadLen - overlaplen + 1 ); //printf("buffer size %d, max read len %d, max read num %d\n",buffer_size,maxReadLen,maxReadNum); seqBuffer = ( char ** ) ckalloc ( maxReadNum * sizeof ( char * ) ); lenBuffer = ( int * ) ckalloc ( maxReadNum * sizeof ( int ) ); indexArray = ( int * ) ckalloc ( maxReadNum * sizeof ( int ) ); for ( i = 0; i < maxReadNum; i++ ) { seqBuffer[i] = ( char * ) ckalloc ( maxReadLen * sizeof ( char ) ); } rcSeq = ( char ** ) ckalloc ( ( thrd_num + 1 ) * sizeof ( char * ) ); if ( 1 ) { kmerCounter = ( long long * ) ckalloc ( ( thrd_num + 1 ) * sizeof ( long long ) ); KmerSets = ( KmerSet ** ) ckalloc ( thrd_num * sizeof ( KmerSet * ) ); ubyte8 init_size = 1024; ubyte8 k = 0; if ( initKmerSetSize ) { init_size = ( ubyte8 ) ( ( double ) initKmerSetSize * 1024.0f * 1024.0f * 1024.0f / ( double ) thrd_num / 24 ); //is it true? do { ++k; } while ( k * 0xFFFFFFLLU < init_size ); } for ( i = 0; i < thrd_num; i++ ) { KmerSets[i] = init_kmerset ( k * 0xFFFFFFLLU, 0.77f ); thrdSignal[i + 1] = 0; paras[i].threadID = i; paras[i].mainSignal = &thrdSignal[0]; paras[i].selfSignal = &thrdSignal[i + 1]; kmerCounter[i + 1] = 0; rcSeq[i + 1] = ( char * ) ckalloc ( maxReadLen * sizeof ( char ) ); } creatThrds ( threads, paras ); } thrdSignal[0] = kmerCounter[0] = 0; time ( &start_t ); kmer_c = n_solexa = read_c = i = libNo = readNumBack = gradsCounter = 0; while ( ( flag = read1seqInLib ( seqBuffer[read_c], next_name, & ( lenBuffer[read_c] ), &libNo, pairs, 1 ) ) != 0 ) { if ( ( ++i ) % 100000000 == 0 ) { printf ( "--- %lldth reads\n", i ); } if ( lenBuffer[read_c] < 0 ) { printf ( "read len %d\n", lenBuffer[read_c] ); } if ( lenBuffer[read_c] < overlaplen + 1 ) { continue; } /* if(lenBuffer[read_c]>70) lenBuffer[read_c] = 50; else if(lenBuffer[read_c]>40) lenBuffer[read_c] = 40; */ indexArray[read_c] = kmer_c; kmer_c += lenBuffer[read_c] - overlaplen + 1; read_c++; if ( read_c == maxReadNum ) { kmerCounter[0] += kmer_c; sendWorkSignal ( 2, thrdSignal ); sendWorkSignal ( 1, thrdSignal ); kmer_c = read_c = 0; } } if ( read_c ) { kmerCounter[0] += kmer_c; sendWorkSignal ( 2, thrdSignal ); sendWorkSignal ( 1, thrdSignal ); } time ( &stop_t ); printf ( "time spent on hash reads: %ds, %lld reads processed\n", ( int ) ( stop_t - start_t ), i ); //record insert size info if ( pairs ) { if ( gradsCounter ) printf ( "%d pe insert size, the largest boundary is %lld\n\n", gradsCounter, pes[gradsCounter - 1].PE_bound ); else { printf ( "no paired reads found\n" ); } sprintf ( name, "%s.peGrads", outfile ); fo = ckopen ( name, "w" ); fprintf ( fo, "grads&num: %d\t%lld\n", gradsCounter, n_solexa ); for ( i = 0; i < gradsCounter; i++ ) { fprintf ( fo, "%d\t%lld\t%d\n", pes[i].insertS, pes[i].PE_bound, pes[i].rank ); } fclose ( fo ); } free_pe_mem(); free_libs(); if ( 1 ) { unsigned long long alloCounter = 0; unsigned long long allKmerCounter = 0; for ( i = 0; i < thrd_num; i++ ) { alloCounter += count_kmerset ( ( KmerSets[i] ) ); allKmerCounter += kmerCounter[i + 1]; free ( ( void * ) rcSeq[i + 1] ); } printf ( "%lli nodes allocated, %lli kmer in reads, %lli kmer processed\n" , alloCounter, kmerCounter[0], allKmerCounter ); } free ( ( void * ) rcSeq ); free ( ( void * ) kmerCounter ); for ( i = 0; i < maxReadNum; i++ ) { free ( ( void * ) seqBuffer[i] ); } free ( ( void * ) seqBuffer ); free ( ( void * ) lenBuffer ); free ( ( void * ) indexArray ); free ( ( void * ) kmerBuffer ); free ( ( void * ) hashBanBuffer ); free ( ( void * ) nextcBuffer ); free ( ( void * ) prevcBuffer ); free ( ( void * ) next_name ); //printf("done hashing nodes\n"); if ( deLowKmer ) { time ( &start_t ); deLowCov ( thrdSignal ); time ( &stop_t ); printf ( "time spent on delowcvgNode %ds\n", ( int ) ( stop_t - start_t ) ); } time ( &start_t ); Mark1in1outNode ( thrdSignal ); freqStat ( outfile ); time ( &stop_t ); printf ( "time spent on marking linear nodes %ds\n", ( int ) ( stop_t - start_t ) ); fflush ( stdout ); sendWorkSignal ( 3, thrdSignal ); thread_wait ( threads ); /* Kmer word = 0x21c3ca82c734c8d0; Kmer hash_ban = hash_kmer(word); int setPicker = hash_ban%thrd_num; kmer_t *node; boolean found = search_kmerset(KmerSets[setPicker], word, &node); if(!found) printf("kmer %llx not found,\n",word); else{ printf("kmer %llx, linear %d\n",word,node->linear); for(i=0;i<4;i++){ if(get_kmer_right_cov(*node,i)>0) printf("right %d, kmer %llx\n",i,nextKmer(node->seq,i)); if(get_kmer_left_cov(*node,i)>0) printf("left %d, kmer %llx\n",i,prevKmer(node->seq,i)); } } */ return 1; } static void thread_delow ( KmerSet * set, unsigned char thrdID ) { int i, in_num, out_num, cvgSingle; int l_cvg, r_cvg; kmer_t * rs; set->iter_ptr = 0; while ( set->iter_ptr < set->size ) { if ( !is_kmer_entity_null ( set->flags, set->iter_ptr ) ) { in_num = out_num = l_cvg = r_cvg = 0; rs = set->array + set->iter_ptr; for ( i = 0; i < 4; i++ ) { cvgSingle = get_kmer_left_cov ( *rs, i ); if ( cvgSingle > 0 && cvgSingle == 1 ) { set_kmer_left_cov ( *rs, i, 0 ); } cvgSingle = get_kmer_right_cov ( *rs, i ); if ( cvgSingle > 0 && cvgSingle == 1 ) { set_kmer_right_cov ( *rs, i, 0 ); } } if ( rs->l_links == 0 && rs->r_links == 0 ) { rs->deleted = 1; tips[thrdID]++; } } set->iter_ptr ++; } //printf("%lld single nodes, %lld linear\n",counter,tips[thrdID]); } static void deLowCov ( unsigned char * thrdSignal ) { int i; long long counter = 0; tips = ( long long * ) ckalloc ( thrd_num * sizeof ( long long ) ); for ( i = 0; i < thrd_num; i++ ) { tips[i] = 0; } sendWorkSignal ( 5, thrdSignal ); //mark linear nodes for ( i = 0; i < thrd_num; i++ ) { counter += tips[i]; } free ( ( void * ) tips ); printf ( "%lld kmer removed\n", counter ); } static void thread_mark ( KmerSet * set, unsigned char thrdID ) { int i, in_num, out_num, cvgSingle; int l_cvg, r_cvg; kmer_t * rs; long long counter = 0; set->iter_ptr = 0; while ( set->iter_ptr < set->size ) { if ( !is_kmer_entity_null ( set->flags, set->iter_ptr ) ) { in_num = out_num = l_cvg = r_cvg = 0; rs = set->array + set->iter_ptr; for ( i = 0; i < 4; i++ ) { cvgSingle = get_kmer_left_cov ( *rs, i ); if ( cvgSingle > 0 ) { in_num++; l_cvg += cvgSingle; } cvgSingle = get_kmer_right_cov ( *rs, i ); if ( cvgSingle > 0 ) { out_num++; r_cvg += cvgSingle; } } if ( rs->single ) { kmerFreq[thrdID][1]++; counter++; } else { kmerFreq[thrdID][ ( l_cvg > r_cvg ? l_cvg : r_cvg )]++; } if ( in_num == 1 && out_num == 1 ) { rs->linear = 1; tips[thrdID]++; } } set->iter_ptr ++; } //printf("%lld single nodes, %lld linear\n",counter,tips[thrdID]); } static void Mark1in1outNode ( unsigned char * thrdSignal ) { int i; long long counter = 0; tips = ( long long * ) ckalloc ( thrd_num * sizeof ( long long ) ); kmerFreq = ( long long ** ) ckalloc ( thrd_num * sizeof ( long long * ) ); for ( i = 0; i < thrd_num; i++ ) { kmerFreq[i] = ( long long * ) ckalloc ( 257 * sizeof ( long long ) ); memset ( kmerFreq[i], 0, 257 * sizeof ( long long ) ); tips[i] = 0; } sendWorkSignal ( 4, thrdSignal ); //mark linear nodes for ( i = 0; i < thrd_num; i++ ) { counter += tips[i]; } free ( ( void * ) tips ); printf ( "%lld linear nodes\n", counter ); } static void freqStat ( char * outfile ) { FILE * fo; char name[256]; int i, j; long long sum; sprintf ( name, "%s.kmerFreq", outfile ); fo = ckopen ( name, "w" ); for ( i = 1; i < 256; i++ ) { sum = 0; for ( j = 0; j < thrd_num; j++ ) { sum += kmerFreq[j][i]; } fprintf ( fo, "%lld\n", sum ); } for ( i = 0; i < thrd_num; i++ ) { free ( ( void * ) kmerFreq[i] ); } free ( ( void * ) kmerFreq ); fclose ( fo ); } SOAPdenovo-V1.05/src/63mer/prlRead2Ctg.c000644 000765 000024 00000053130 11530651532 017577 0ustar00Aquastaff000000 000000 /* * 63mer/prlRead2Ctg.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" static long long readsInGap = 0; static int buffer_size = 100000000; static long long readCounter; static long long mapCounter; static int ALIGNLEN = 0; //buffer related varibles for chop kmer static int read_c; static char ** rcSeq; static char ** seqBuffer; static int * lenBuffer; static unsigned int * ctgIdArray; static int * posArray; static char * orienArray; static char * footprint; // flag indicates whether the read shoulld leave markers on contigs // kmer related variables static int kmer_c; static Kmer * kmerBuffer; static ubyte8 * hashBanBuffer; static kmer_t ** nodeBuffer; static boolean * smallerBuffer; static unsigned int * indexArray; static int * deletion; static void parse1read ( int t ); static void threadRoutine ( void * thrdID ); static void searchKmer ( int t, KmerSet * kset ); static void chopKmer4read ( int t, int threadID ); static void thread_wait ( pthread_t * threads ); static void creatThrds ( pthread_t * threads, PARAMETER * paras ) { unsigned char i; int temp; for ( i = 0; i < thrd_num; i++ ) { //printf("to create %dth thread\n",(*(char *)&(threadID[i]))); if ( ( temp = pthread_create ( &threads[i], NULL, ( void * ) threadRoutine, & ( paras[i] ) ) ) != 0 ) { printf ( "create threads failed\n" ); exit ( 1 ); } } printf ( "%d thread created\n", thrd_num ); } static void threadRoutine ( void * para ) { PARAMETER * prm; int i, t; unsigned char id; prm = ( PARAMETER * ) para; id = prm->threadID; //printf("%dth thread with task %d, hash_table %p\n",id,prm.task,prm.hash_table); while ( 1 ) { if ( * ( prm->selfSignal ) == 1 ) { for ( i = 0; i < kmer_c; i++ ) { //if((hashBanBuffer[i]&taskMask)!=prm.threadID) if ( ( hashBanBuffer[i] % thrd_num ) != id ) { continue; } searchKmer ( i, KmerSets[id] ); } * ( prm->selfSignal ) = 0; } else if ( * ( prm->selfSignal ) == 2 ) { for ( i = 0; i < read_c; i++ ) { if ( i % thrd_num != id ) { continue; } chopKmer4read ( i, id + 1 ); } * ( prm->selfSignal ) = 0; } else if ( * ( prm->selfSignal ) == 3 ) { // parse reads for ( t = 0; t < read_c; t++ ) { if ( t % thrd_num != id ) { continue; } parse1read ( t ); } * ( prm->selfSignal ) = 0; } else if ( * ( prm->selfSignal ) == 5 ) { * ( prm->selfSignal ) = 0; break; } usleep ( 1 ); } } /* static void chopReads() { int i; for(i=0;ideleted ) { nodeBuffer[t] = node; } else { nodeBuffer[t] = NULL; } } static void parse1read ( int t ) { unsigned int j, i, s; unsigned int contigID; int counter = 0, counter2 = 0; unsigned int ctgLen, pos; kmer_t * node; boolean isSmaller; int flag, maxOcc = 0; kmer_t * maxNode = NULL; int alldgnLen = lenBuffer[t] > ALIGNLEN ? ALIGNLEN : lenBuffer[t]; int multi = alldgnLen - overlaplen + 1 < 2 ? 2 : alldgnLen - overlaplen + 1; unsigned int start, finish; footprint[t] = 0; start = indexArray[t]; finish = indexArray[t + 1]; if ( finish == start ) //too short { ctgIdArray[t] = 0; return; } for ( j = start; j < finish; j++ ) { node = nodeBuffer[j]; if ( !node ) //same as previous { continue; } flag = 1; for ( s = j + 1; s < finish; s++ ) { if ( !nodeBuffer[s] ) { continue; } if ( nodeBuffer[s]->l_links == node->l_links ) { flag++; nodeBuffer[s] = NULL; } } if ( ( overlaplen < 32 && flag >= 2 ) || overlaplen > 32 ) { counter2++; } if ( flag >= multi ) { counter++; } else { continue; } if ( flag > maxOcc ) { pos = j; maxOcc = flag; maxNode = node; } } if ( !counter ) //no match { ctgIdArray[t] = 0; return; } if ( counter2 > 1 ) { footprint[t] = 1; } //use as a flag j = pos; i = pos - start + 1; node = nodeBuffer[j]; isSmaller = smallerBuffer[j]; contigID = node->l_links; ctgLen = contig_array[contigID].length; pos = node->r_links; if ( node->twin == isSmaller ) { orienArray[t] = '-'; ctgIdArray[t] = getTwinCtg ( contigID ); posArray[t] = ctgLen - pos - overlaplen - i + 1; } else { orienArray[t] = '+'; ctgIdArray[t] = contigID; posArray[t] = pos - i + 1; } } static void sendWorkSignal ( unsigned char SIG, unsigned char * thrdSignals ) { int t; for ( t = 0; t < thrd_num; t++ ) { thrdSignals[t + 1] = SIG; } while ( 1 ) { usleep ( 10 ); for ( t = 0; t < thrd_num; t++ ) if ( thrdSignals[t + 1] ) { break; } if ( t == thrd_num ) { break; } } } static void locate1read ( int t ) { int i, j, start, finish; kmer_t * node; unsigned int contigID; int pos, ctgLen; boolean isSmaller; start = indexArray[t]; finish = indexArray[t + 1]; for ( j = start; j < finish; j++ ) { node = nodeBuffer[j]; if ( !node ) //same as previous { continue; } i = j - start + 1; isSmaller = smallerBuffer[j]; contigID = node->l_links; ctgLen = contig_array[contigID].length; pos = node->r_links; if ( node->twin == isSmaller ) { ctgIdArray[t] = getTwinCtg ( contigID ); posArray[t] = ctgLen - pos - overlaplen - i + 1; } else { ctgIdArray[t] = contigID; posArray[t] = pos - i + 1; } } } static void output1read ( int t, FILE * outfp ) { int len = lenBuffer[t]; int index; readsInGap++; /* if(ctgIdArray[t]==735||ctgIdArray[t]==getTwinCtg(735)){ printf("%d\t%d\t%d\t",t+1,ctgIdArray[t],posArray[t]); int j; for(j=0;j R2 <-- R1 output1read ( read1, outfp ); } else { read2 = t; read1 = t - 1; ctgIdArray[read2] = ctgIdArray[read1]; posArray[read2] = posArray[read1] + insSize - lenBuffer[read2]; // --> R1 <-- R2 output1read ( read2, outfp ); } } static void recordLongRead ( FILE * outfp ) { int t; for ( t = 0; t < read_c; t++ ) { readCounter++; if ( footprint[t] ) { output1read ( t, outfp ); } } } static void recordAlldgn ( FILE * outfp, int insSize, FILE * outfp2 ) { int t, ctgId; boolean rd1gap, rd2gap; for ( t = 0; t < read_c; t++ ) { readCounter++; rd1gap = rd2gap = 0; ctgId = ctgIdArray[t]; if ( outfp2 && t % 2 == 1 ) //make sure this is read2 in a pair { if ( ctgIdArray[t] < 1 && ctgIdArray[t - 1] > 0 ) { getReadIngap ( t, insSize, outfp2, 0 ); //read 2 in gap rd2gap = 1; } else if ( ctgIdArray[t] > 0 && ctgIdArray[t - 1] < 1 ) { getReadIngap ( t - 1, insSize, outfp2, 1 ); //read 1 in gap rd1gap = 1; } } if ( ctgId < 1 ) { continue; } mapCounter++; fprintf ( outfp, "%lld\t%u\t%d\t%c\n", readCounter, ctgIdArray[t], posArray[t], orienArray[t] ); if ( t % 2 == 0 ) { continue; } // reads are not located by pe info but across edges if ( outfp2 && footprint[t - 1] && !rd1gap ) { if ( ctgIdArray[t - 1] < 1 ) { locate1read ( t - 1 ); } output1read ( t - 1, outfp2 ); } if ( outfp2 && footprint[t] && !rd2gap ) { if ( ctgIdArray[t] < 1 ) { locate1read ( t ); } output1read ( t, outfp2 ); } } } //load contig index and length void basicContigInfo ( char * infile ) { char name[256], lldne[1024]; FILE * fp; int length, bal_ed, num_all, num_long, index; sprintf ( name, "%s.ContigIndex", infile ); fp = ckopen ( name, "r" ); fgets ( lldne, sizeof ( lldne ), fp ); sscanf ( lldne + 8, "%d %d", &num_all, &num_long ); printf ( "%d edges in graph\n", num_all ); num_ctg = num_all; contig_array = ( CONTIG * ) ckalloc ( ( num_all + 1 ) * sizeof ( CONTIG ) ); fgets ( lldne, sizeof ( lldne ), fp ); num_long = 0; while ( fgets ( lldne, sizeof ( lldne ), fp ) != NULL ) { sscanf ( lldne, "%d %d %d", &index, &length, &bal_ed ); contig_array[++num_long].length = length; contig_array[num_long].bal_edge = bal_ed + 1; if ( index != num_long ) { printf ( "basicContigInfo: %d vs %d\n", index, num_long ); } if ( bal_ed == 0 ) { continue; } contig_array[++num_long].length = length; contig_array[num_long].bal_edge = -bal_ed + 1; } fclose ( fp ); } void prlRead2Ctg ( char * libfile, char * outfile ) { long long i; char * src_name, *next_name, name[256]; FILE * fo, *outfp2 = NULL; int maxReadNum, libNo, prevLibNo, insSize; boolean flag, pairs = 1; pthread_t threads[thrd_num]; unsigned char thrdSignal[thrd_num + 1]; PARAMETER paras[thrd_num]; maxReadLen = 0; maxNameLen = 256; scan_libInfo ( libfile ); alloc_pe_mem ( num_libs ); if ( !maxReadLen ) { maxReadLen = 100; } printf ( "In file: %s, max seq len %d, max name len %d\n\n", libfile, maxReadLen, maxNameLen ); if ( maxReadLen > maxReadLen4all ) { maxReadLen4all = maxReadLen; } src_name = ( char * ) ckalloc ( ( maxNameLen + 1 ) * sizeof ( char ) ); next_name = ( char * ) ckalloc ( ( maxNameLen + 1 ) * sizeof ( char ) ); kmerBuffer = ( Kmer * ) ckalloc ( buffer_size * sizeof ( Kmer ) ); hashBanBuffer = ( ubyte8 * ) ckalloc ( buffer_size * sizeof ( ubyte8 ) ); nodeBuffer = ( kmer_t ** ) ckalloc ( buffer_size * sizeof ( kmer_t * ) ); smallerBuffer = ( boolean * ) ckalloc ( buffer_size * sizeof ( boolean ) ); maxReadNum = buffer_size / ( maxReadLen - overlaplen + 1 ); maxReadNum = maxReadNum % 2 == 0 ? maxReadNum : maxReadNum - 1; //make sure paired reads are processed at the same batch seqBuffer = ( char ** ) ckalloc ( maxReadNum * sizeof ( char * ) ); lenBuffer = ( int * ) ckalloc ( maxReadNum * sizeof ( int ) ); indexArray = ( unsigned int * ) ckalloc ( ( maxReadNum + 1 ) * sizeof ( unsigned int ) ); ctgIdArray = ( unsigned int * ) ckalloc ( ( maxReadNum + 1 ) * sizeof ( unsigned int ) ); posArray = ( int * ) ckalloc ( ( maxReadNum + 1 ) * sizeof ( int ) ); orienArray = ( char * ) ckalloc ( ( maxReadNum + 1 ) * sizeof ( char ) ); footprint = ( char * ) ckalloc ( ( maxReadNum + 1 ) * sizeof ( char ) ); for ( i = 0; i < maxReadNum; i++ ) { seqBuffer[i] = ( char * ) ckalloc ( maxReadLen * sizeof ( char ) ); } rcSeq = ( char ** ) ckalloc ( ( thrd_num + 1 ) * sizeof ( char * ) ); thrdSignal[0] = 0; if ( 1 ) { for ( i = 0; i < thrd_num; i++ ) { rcSeq[i + 1] = ( char * ) ckalloc ( maxReadLen * sizeof ( char ) ); thrdSignal[i + 1] = 0; paras[i].threadID = i; paras[i].mainSignal = &thrdSignal[0]; paras[i].selfSignal = &thrdSignal[i + 1]; } creatThrds ( threads, paras ); } if ( !contig_array ) { basicContigInfo ( outfile ); } sprintf ( name, "%s.readInGap", outfile ); outfp2 = ckopen ( name, "wb" ); sprintf ( name, "%s.readOnContig", outfile ); fo = ckopen ( name, "w" ); fprintf ( fo, "read\tcontig\tpos\n" ); readCounter = mapCounter = readsInGap = 0; kmer_c = n_solexa = read_c = i = libNo = readNumBack = gradsCounter = 0; prevLibNo = -1; while ( ( flag = read1seqInLib ( seqBuffer[read_c], next_name, & ( lenBuffer[read_c] ), &libNo, pairs, 0 ) ) != 0 ) { if ( libNo != prevLibNo ) { prevLibNo = libNo; insSize = lib_array[libNo].avg_ins; ALIGNLEN = lib_array[libNo].map_len; printf ( "current insert size %d, map_len %d\n", insSize, ALIGNLEN ); if ( insSize > 1000 ) { ALIGNLEN = ALIGNLEN < 35 ? 35 : ALIGNLEN; } else { ALIGNLEN = ALIGNLEN < 32 ? 32 : ALIGNLEN; } } if ( insSize > 1000 ) { ALIGNLEN = ALIGNLEN < ( lenBuffer[read_c] / 2 + 1 ) ? ( lenBuffer[read_c] / 2 + 1 ) : ALIGNLEN; } if ( ( ++i ) % 100000000 == 0 ) { printf ( "--- %lldth reads\n", i ); } indexArray[read_c] = kmer_c; if ( lenBuffer[read_c] >= overlaplen + 1 ) { kmer_c += lenBuffer[read_c] - overlaplen + 1; } read_c++; if ( read_c == maxReadNum ) { indexArray[read_c] = kmer_c; sendWorkSignal ( 2, thrdSignal ); sendWorkSignal ( 1, thrdSignal ); sendWorkSignal ( 3, thrdSignal ); recordAlldgn ( fo, insSize, outfp2 ); kmer_c = 0; read_c = 0; } } if ( read_c ) { indexArray[read_c] = kmer_c; sendWorkSignal ( 2, thrdSignal ); sendWorkSignal ( 1, thrdSignal ); sendWorkSignal ( 3, thrdSignal ); recordAlldgn ( fo, insSize, outfp2 ); printf ( "Output %lld out of %lld (%.1f)%% reads in gaps\n", readsInGap, readCounter, ( float ) readsInGap / readCounter * 100 ); } printf ( "%lld out of %lld (%.1f)%% reads mapped to contigs\n", mapCounter, readCounter, ( float ) mapCounter / readCounter * 100 ); sendWorkSignal ( 5, thrdSignal ); thread_wait ( threads ); fclose ( fo ); sprintf ( name, "%s.peGrads", outfile ); fo = ckopen ( name, "w" ); fprintf ( fo, "grads&num: %d\t%lld\t%d\n", gradsCounter, n_solexa, maxReadLen4all ); if ( pairs ) { if ( gradsCounter ) printf ( "%d pe insert size, the largest boundary is %lld\n\n", gradsCounter, pes[gradsCounter - 1].PE_bound ); else { printf ( "no paired reads found\n" ); } for ( i = 0; i < gradsCounter; i++ ) { fprintf ( fo, "%d\t%lld\t%d\t%d\n", pes[i].insertS, pes[i].PE_bound, pes[i].rank, pes[i].pair_num_cut ); } fclose ( fo ); } fclose ( outfp2 ); free_pe_mem(); free_libs(); if ( 1 ) // multi-threads { for ( i = 0; i < thrd_num; i++ ) { free ( ( void * ) rcSeq[i + 1] ); } } free ( ( void * ) rcSeq ); for ( i = 0; i < maxReadNum; i++ ) { free ( ( void * ) seqBuffer[i] ); } free ( ( void * ) seqBuffer ); free ( ( void * ) lenBuffer ); free ( ( void * ) indexArray ); free ( ( void * ) kmerBuffer ); free ( ( void * ) smallerBuffer ); free ( ( void * ) hashBanBuffer ); free ( ( void * ) nodeBuffer ); free ( ( void * ) ctgIdArray ); free ( ( void * ) posArray ); free ( ( void * ) orienArray ); free ( ( void * ) footprint ); free ( ( void * ) src_name ); free ( ( void * ) next_name ); if ( contig_array ) { free ( ( void * ) contig_array ); contig_array = NULL; } } static void thread_wait ( pthread_t * threads ) { int i; for ( i = 0; i < thrd_num; i++ ) if ( threads[i] != 0 ) { pthread_join ( threads[i], NULL ); } } /********************* map long reads for gap filling ************************/ void prlLongRead2Ctg ( char * libfile, char * outfile ) { long long i; char * src_name, *next_name, name[256]; FILE * outfp2; int maxReadNum, libNo, prevLibNo; boolean flag, pairs = 0; pthread_t threads[thrd_num]; unsigned char thrdSignal[thrd_num + 1]; PARAMETER paras[thrd_num]; maxReadLen = 0; maxNameLen = 256; scan_libInfo ( libfile ); if ( !maxReadLen ) { maxReadLen = 100; } int longReadLen = getMaxLongReadLen ( num_libs ); if ( longReadLen < 1 ) // no long reads { return; } maxReadLen4all = maxReadLen < longReadLen ? longReadLen : maxReadLen; printf ( "In file: %s, long read len %d, max name len %d\n\n", libfile, longReadLen, maxNameLen ); maxReadLen = longReadLen; src_name = ( char * ) ckalloc ( ( maxNameLen + 1 ) * sizeof ( char ) ); next_name = ( char * ) ckalloc ( ( maxNameLen + 1 ) * sizeof ( char ) ); kmerBuffer = ( Kmer * ) ckalloc ( buffer_size * sizeof ( Kmer ) ); hashBanBuffer = ( ubyte8 * ) ckalloc ( buffer_size * sizeof ( ubyte8 ) ); nodeBuffer = ( kmer_t ** ) ckalloc ( buffer_size * sizeof ( kmer_t * ) ); smallerBuffer = ( boolean * ) ckalloc ( buffer_size * sizeof ( boolean ) ); maxReadNum = buffer_size / ( maxReadLen - overlaplen + 1 ); maxReadNum = maxReadNum % 2 == 0 ? maxReadNum : maxReadNum - 1; //make sure paired reads are processed at the same batch seqBuffer = ( char ** ) ckalloc ( maxReadNum * sizeof ( char * ) ); lenBuffer = ( int * ) ckalloc ( maxReadNum * sizeof ( int ) ); indexArray = ( unsigned int * ) ckalloc ( ( maxReadNum + 1 ) * sizeof ( unsigned int ) ); ctgIdArray = ( unsigned int * ) ckalloc ( ( maxReadNum + 1 ) * sizeof ( unsigned int ) ); posArray = ( int * ) ckalloc ( ( maxReadNum + 1 ) * sizeof ( int ) ); orienArray = ( char * ) ckalloc ( ( maxReadNum + 1 ) * sizeof ( char ) ); footprint = ( char * ) ckalloc ( ( maxReadNum + 1 ) * sizeof ( char ) ); for ( i = 0; i < maxReadNum; i++ ) { seqBuffer[i] = ( char * ) ckalloc ( maxReadLen * sizeof ( char ) ); } rcSeq = ( char ** ) ckalloc ( ( thrd_num + 1 ) * sizeof ( char * ) ); deletion = ( int * ) ckalloc ( ( thrd_num + 1 ) * sizeof ( int ) ); thrdSignal[0] = 0; deletion[0] = 0; if ( 1 ) { for ( i = 0; i < thrd_num; i++ ) { rcSeq[i + 1] = ( char * ) ckalloc ( maxReadLen * sizeof ( char ) ); deletion[i + 1] = 0; thrdSignal[i + 1] = 0; paras[i].threadID = i; paras[i].mainSignal = &thrdSignal[0]; paras[i].selfSignal = &thrdSignal[i + 1]; } creatThrds ( threads, paras ); } if ( !contig_array ) { basicContigInfo ( outfile ); } sprintf ( name, "%s.longReadInGap", outfile ); outfp2 = ckopen ( name, "wb" ); readCounter = 0; kmer_c = n_solexa = read_c = i = libNo = 0; prevLibNo = -1; while ( ( flag = read1seqInLib ( seqBuffer[read_c], next_name, & ( lenBuffer[read_c] ), &libNo, pairs, 4 ) ) != 0 ) { if ( libNo != prevLibNo ) { prevLibNo = libNo; ALIGNLEN = lib_array[libNo].map_len; ALIGNLEN = ALIGNLEN < 35 ? 35 : ALIGNLEN; printf ( "Map_len %d\n", ALIGNLEN ); } if ( ( ++i ) % 100000000 == 0 ) { printf ( "--- %lldth reads\n", i ); } indexArray[read_c] = kmer_c; if ( lenBuffer[read_c] >= overlaplen + 1 ) { kmer_c += lenBuffer[read_c] - overlaplen + 1; } read_c++; if ( read_c == maxReadNum ) { indexArray[read_c] = kmer_c; sendWorkSignal ( 2, thrdSignal ); sendWorkSignal ( 1, thrdSignal ); sendWorkSignal ( 3, thrdSignal ); recordLongRead ( outfp2 ); kmer_c = 0; read_c = 0; } } if ( read_c ) { indexArray[read_c] = kmer_c; sendWorkSignal ( 2, thrdSignal ); sendWorkSignal ( 1, thrdSignal ); sendWorkSignal ( 3, thrdSignal ); recordLongRead ( outfp2 ); printf ( "Output %lld out of %lld (%.1f)%% reads in gaps\n", readsInGap, readCounter, ( float ) readsInGap / readCounter * 100 ); } sendWorkSignal ( 5, thrdSignal ); thread_wait ( threads ); fclose ( outfp2 ); free_libs(); if ( 1 ) // multi-threads { for ( i = 0; i < thrd_num; i++ ) { deletion[0] += deletion[i + 1]; free ( ( void * ) rcSeq[i + 1] ); } } printf ( "%d reads deleted\n", deletion[0] ); free ( ( void * ) rcSeq ); free ( ( void * ) deletion ); for ( i = 0; i < maxReadNum; i++ ) { free ( ( void * ) seqBuffer[i] ); } free ( ( void * ) seqBuffer ); free ( ( void * ) lenBuffer ); free ( ( void * ) indexArray ); free ( ( void * ) kmerBuffer ); free ( ( void * ) smallerBuffer ); free ( ( void * ) hashBanBuffer ); free ( ( void * ) nodeBuffer ); free ( ( void * ) ctgIdArray ); free ( ( void * ) posArray ); free ( ( void * ) orienArray ); free ( ( void * ) footprint ); free ( ( void * ) src_name ); free ( ( void * ) next_name ); } SOAPdenovo-V1.05/src/63mer/prlRead2path.c000644 000765 000024 00000045112 11530651532 020017 0ustar00Aquastaff000000 000000 /* * 63mer/prlRead2path.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include #include "newhash.h" #include #include #define preARCBLOCKSIZE 100000 static const Kmer kmerZero = {0, 0}; static unsigned int * arcCounters; static int buffer_size = 100000000; static long long markCounter = 0; static unsigned int * fwriteBuf; static unsigned char * markerOnEdge; //buffer related varibles for chop kmer static int read_c; static char ** rcSeq; static char ** seqBuffer; static int * lenBuffer; //edge and (K+1)mer related variables static preARC ** preArc_array; static Kmer * mixBuffer; static boolean * flagArray; //indicate each item in mixBuffer where it's a (K+1)mer // kmer related variables static char ** flags; static int kmer_c; static Kmer * kmerBuffer; static ubyte8 * hashBanBuffer; static kmer_t ** nodeBuffer; static boolean * smallerBuffer; static int * indexArray; static int * deletion; static void parse1read ( int t, int threadID ); static void search1kmerPlus ( int j, unsigned char thrdID ); static void threadRoutine ( void * thrdID ); static void searchKmer ( int t, KmerSet * kset ); static void chopKmer4read ( int t, int threadID ); static void thread_wait ( pthread_t * threads ); static void thread_add1preArc ( unsigned int from_ed, unsigned int to_ed, unsigned int thrdID ); static void creatThrds ( pthread_t * threads, PARAMETER * paras ) { unsigned char i; int temp; for ( i = 0; i < thrd_num; i++ ) { //printf("to create %dth thread\n",(*(char *)&(threadID[i]))); if ( ( temp = pthread_create ( &threads[i], NULL, ( void * ) threadRoutine, & ( paras[i] ) ) ) != 0 ) { printf ( "create threads failed\n" ); exit ( 1 ); } } printf ( "%d thread created\n", thrd_num ); } static void threadRoutine ( void * para ) { PARAMETER * prm; int i, t, j, start, finish; unsigned char id; prm = ( PARAMETER * ) para; id = prm->threadID; //printf("%dth thread with task %d, hash_table %p\n",id,prm.task,prm.hash_table); while ( 1 ) { if ( * ( prm->selfSignal ) == 1 ) { for ( i = 0; i < kmer_c; i++ ) { //if((hashBanBuffer[i]&taskMask)!=prm.threadID) if ( ( hashBanBuffer[i] % thrd_num ) != id ) { continue; } searchKmer ( i, KmerSets[id] ); } * ( prm->selfSignal ) = 0; } else if ( * ( prm->selfSignal ) == 2 ) { for ( i = 0; i < read_c; i++ ) { if ( i % thrd_num != id ) { continue; } chopKmer4read ( i, id + 1 ); } * ( prm->selfSignal ) = 0; } else if ( * ( prm->selfSignal ) == 3 ) { // parse reads for ( t = 0; t < read_c; t++ ) { if ( t % thrd_num != id ) { continue; } parse1read ( t, id + 1 ); } * ( prm->selfSignal ) = 0; } else if ( * ( prm->selfSignal ) == 4 ) { //printf("thread %d, reads %d splay kmerplus\n",id,read_c); for ( t = 0; t < read_c; t++ ) { start = indexArray[t]; finish = indexArray[t + 1]; for ( j = start; j < finish; j++ ) { if ( flagArray[j] == 0 ) { if ( mixBuffer[j].low == 0 ) { break; } } else if ( hashBanBuffer[j] % thrd_num == id ) { //fprintf(stderr,"thread %d search for ban %lld\n",id,hashBanBuffer[j]); search1kmerPlus ( j, id ); } /* if(flagArray[j]==0&&mixBuffer[j]==0) break; if(!flagArray[j]||(hashBanBuffer[j]%thrd_num)!=id) continue; search1kmerPlus(j,id); */ } } * ( prm->selfSignal ) = 0; } else if ( * ( prm->selfSignal ) == 6 ) { for ( t = 0; t < read_c; t++ ) { start = indexArray[t]; finish = indexArray[t + 1]; for ( j = start; j < finish - 1; j++ ) { if ( mixBuffer[j].low == 0 || mixBuffer[j + 1].low == 0 ) { break; } if ( mixBuffer[j].low % thrd_num != id ) { continue; } thread_add1preArc ( mixBuffer[j].low, mixBuffer[j + 1].low, id ); } } * ( prm->selfSignal ) = 0; } else if ( * ( prm->selfSignal ) == 5 ) { * ( prm->selfSignal ) = 0; break; } usleep ( 1 ); } } static void chopKmer4read ( int t, int threadID ) { char * src_seq = seqBuffer[t]; char * bal_seq = rcSeq[threadID]; int len_seq = lenBuffer[t]; int j, bal_j; ubyte8 hash_ban, bal_hash_ban; Kmer word, bal_word; int index; word = kmerZero; for ( index = 0; index < overlaplen; index++ ) { word = KmerLeftBitMoveBy2 ( word ); word.low |= src_seq[index]; } reverseComplementSeq ( src_seq, len_seq, bal_seq ); // complementary node bal_word = reverseComplement ( word, overlaplen ); bal_j = len_seq - 0 - overlaplen; // 0; index = indexArray[t]; if ( KmerSmaller ( word, bal_word ) ) { hash_ban = hash_kmer ( word ); kmerBuffer[index] = word; smallerBuffer[index] = 1; hashBanBuffer[index++] = hash_ban; } else { bal_hash_ban = hash_kmer ( bal_word ); kmerBuffer[index] = bal_word; smallerBuffer[index] = 0; hashBanBuffer[index++] = bal_hash_ban; } //printf("%dth: %p with %p\n",kmer_c-1,bal_word,bal_hash_ban); for ( j = 1; j <= len_seq - overlaplen; j ++ ) { word = nextKmer ( word, src_seq[j - 1 + overlaplen] ); bal_j = len_seq - j - overlaplen; // j; bal_word = prevKmer ( bal_word, bal_seq[bal_j] ); if ( KmerSmaller ( word, bal_word ) ) { hash_ban = hash_kmer ( word ); kmerBuffer[index] = word; smallerBuffer[index] = 1; hashBanBuffer[index++] = hash_ban; //printf("%dth: %p with %p\n",kmer_c-1,word,hashBanBuffer[kmer_c-1]); } else { // complementary node bal_hash_ban = hash_kmer ( bal_word ); kmerBuffer[index] = bal_word; smallerBuffer[index] = 0; hashBanBuffer[index++] = bal_hash_ban; //printf("%dth: %p with %p\n",kmer_c-1,bal_word,hashBanBuffer[kmer_c-1]); } } } //splay for one kmer in buffer and save the node to nodeBuffer static void searchKmer ( int t, KmerSet * kset ) { kmer_t * node; boolean found = search_kmerset ( kset, kmerBuffer[t], &node ); if ( !found ) { printf ( "searchKmer: kmer %llx %llx is not found\n", kmerBuffer[t].high, kmerBuffer[t].low ); } nodeBuffer[t] = node; } static preARC * getPreArcBetween ( unsigned int from_ed, unsigned int to_ed ) { preARC * parc; parc = preArc_array[from_ed]; while ( parc ) { if ( parc->to_ed == to_ed ) { return parc; } parc = parc->next; } return parc; } static void thread_add1preArc ( unsigned int from_ed, unsigned int to_ed, unsigned int thrdID ) { preARC * parc = getPreArcBetween ( from_ed, to_ed ); if ( parc ) { parc->multiplicity++; } else { parc = prlAllocatePreArc ( to_ed, preArc_mem_managers[thrdID] ); arcCounters[thrdID]++; parc->next = preArc_array[from_ed]; preArc_array[from_ed] = parc; } } static void memoAlloc4preArc() { unsigned int i; preArc_array = ( preARC ** ) ckalloc ( ( num_ed + 1 ) * sizeof ( preARC * ) ); for ( i = 0; i <= num_ed; i++ ) { preArc_array[i] = NULL; } } static void memoFree4preArc() { prlDestroyPreArcMem(); if ( preArc_array ) { free ( ( void * ) preArc_array ); } } static void output_arcs ( char * outfile ) { unsigned int i; char name[256]; FILE * outfp, *outfp2 = NULL; preARC * parc; sprintf ( name, "%s.preArc", outfile ); outfp = ckopen ( name, "w" ); if ( repsTie ) { sprintf ( name, "%s.markOnEdge", outfile ); outfp2 = ckopen ( name, "w" ); } markCounter = 0; for ( i = 1; i <= num_ed; i++ ) { if ( repsTie ) { markCounter += markerOnEdge[i]; fprintf ( outfp2, "%d\n", markerOnEdge[i] ); } parc = preArc_array[i]; if ( !parc ) { continue; } fprintf ( outfp, "%u", i ); while ( parc ) { fprintf ( outfp, " %u %u", parc->to_ed, parc->multiplicity ); parc = parc->next; } fprintf ( outfp, "\n" ); } fclose ( outfp ); if ( repsTie ) { fclose ( outfp2 ); printf ( "%lld markers counted\n", markCounter ); } } static void recordPathBin ( FILE * outfp ) { int t, j, start, finish; unsigned char counter; for ( t = 0; t < read_c; t++ ) { start = indexArray[t]; finish = indexArray[t + 1]; if ( finish - start < 3 || mixBuffer[start].low == 0 || mixBuffer[start + 1].low == 0 || mixBuffer[start + 2].low == 0 ) { continue; } counter = 0; for ( j = start; j < finish; j++ ) { if ( mixBuffer[j].low == 0 ) { break; } fwriteBuf[counter++] = ( unsigned int ) mixBuffer[j].low; if ( markerOnEdge[mixBuffer[j].low] < 255 ) { markerOnEdge[mixBuffer[j].low]++; } markCounter++; } fwrite ( &counter, sizeof ( char ), 1, outfp ); fwrite ( fwriteBuf, sizeof ( unsigned int ), ( int ) counter, outfp ); } } static void search1kmerPlus ( int j, unsigned char thrdID ) { kmer_t * node; boolean found = search_kmerset ( KmerSetsPatch[thrdID], mixBuffer[j], &node ); if ( !found ) { /* fprintf(stderr,"kmerPlus %llx %llx (hashban %lld) not found, flag %d!\n", mixBuffer[j].high,mixBuffer[j].low,hashBanBuffer[j],flagArray[j]); */ mixBuffer[j] = kmerZero; return; } //else fprintf(stderr,"kmerPlus found\n"); if ( smallerBuffer[j] ) { mixBuffer[j].low = node->l_links; } else { mixBuffer[j].low = node->l_links + node->twin - 1; } } static void parse1read ( int t, int threadID ) { unsigned int j, retain = 0; unsigned int edge_index = 0; kmer_t * node; boolean isSmaller; Kmer wordplus, bal_wordplus; unsigned int start, finish, pos; Kmer prevKmer, currentKmer; boolean IsPrevKmer = 0; start = indexArray[t]; finish = indexArray[t + 1]; pos = start; for ( j = start; j < finish; j++ ) { node = nodeBuffer[j]; //extract edges or keep kmers if ( ( node->deleted ) || ( node->linear && !node->inEdge ) ) // deleted or in a floating loop { if ( retain < 2 ) { retain = 0; pos = start; } else { break; } continue; } isSmaller = smallerBuffer[j]; if ( node->linear ) { if ( isSmaller ) { edge_index = node->l_links; } else { edge_index = node->l_links + node->twin - 1; } if ( retain == 0 || IsPrevKmer ) { retain++; mixBuffer[pos].low = edge_index; flagArray[pos++] = 0; IsPrevKmer = 0; } else if ( edge_index != mixBuffer[pos - 1].low ) { retain++; mixBuffer[pos].low = edge_index; flagArray[pos++] = 0; } } else { if ( isSmaller ) { currentKmer = node->seq; } else { currentKmer = reverseComplement ( node->seq, overlaplen ); } if ( IsPrevKmer ) { retain++; wordplus = KmerPlus ( prevKmer, lastCharInKmer ( currentKmer ) ); bal_wordplus = reverseComplement ( wordplus, overlaplen + 1 ); if ( KmerSmaller ( wordplus, bal_wordplus ) ) { smallerBuffer[pos] = 1; hashBanBuffer[pos] = hash_kmer ( wordplus ); mixBuffer[pos] = wordplus; } else { smallerBuffer[pos] = 0; hashBanBuffer[pos] = hash_kmer ( bal_wordplus ); mixBuffer[pos] = bal_wordplus; } // fprintf(stderr,"%lld\n",hashBanBuffer[pos]); flagArray[pos++] = 1; } IsPrevKmer = 1; prevKmer = currentKmer; } } /* for(j=start;j70) lenBuffer[read_c] = 70; else if(lenBuffer[read_c]>40) lenBuffer[read_c] = 40; */ indexArray[read_c] = kmer_c; kmer_c += lenBuffer[read_c] - overlaplen + 1; read_c++; if ( read_c == maxReadNum ) { indexArray[read_c] = kmer_c; time ( &read_end ); t0 += read_end - read_start; time ( &time_bef ); sendWorkSignal ( 2, thrdSignal ); time ( &time_aft ); t1 += time_aft - time_bef; time ( &time_bef ); sendWorkSignal ( 1, thrdSignal ); time ( &time_aft ); t2 += time_aft - time_bef; time ( &time_bef ); sendWorkSignal ( 3, thrdSignal ); time ( &time_aft ); t3 += time_aft - time_bef; time ( &time_bef ); sendWorkSignal ( 4, thrdSignal ); time ( &time_aft ); t4 += time_aft - time_bef; time ( &time_bef ); sendWorkSignal ( 6, thrdSignal ); time ( &time_aft ); t5 += time_aft - time_bef; time ( &time_bef ); //recordPreArc(); if ( repsTie ) { recordPathBin ( outfp ); } time ( &time_aft ); t6 += time_aft - time_bef; kmer_c = 0; read_c = 0; time ( &read_start ); } } printf ( "%lld reads processed\n", i ); printf ( "time %d,%d,%d,%d,%d,%d,%d\n", t0, t1, t2, t3, t4, t5, t6 ); if ( read_c ) { indexArray[read_c] = kmer_c; sendWorkSignal ( 2, thrdSignal ); sendWorkSignal ( 1, thrdSignal ); sendWorkSignal ( 3, thrdSignal ); sendWorkSignal ( 4, thrdSignal ); sendWorkSignal ( 6, thrdSignal ); //recordPreArc(); if ( repsTie ) { recordPathBin ( outfp ); } } printf ( "%lld markers outputed\n", markCounter ); sendWorkSignal ( 5, thrdSignal ); thread_wait ( threads ); output_arcs ( outfile ); memoFree4preArc(); if ( 1 ) // multi-threads { arcCounter = 0; for ( i = 0; i < thrd_num; i++ ) { arcCounter += arcCounters[i]; free ( ( void * ) flags[i + 1] ); deletion[0] += deletion[i + 1]; free ( ( void * ) rcSeq[i + 1] ); } } if ( 1 ) { free ( ( void * ) flags[0] ); free ( ( void * ) rcSeq[0] ); } printf ( "done mapping reads, %d reads deleted, %lld arcs created\n", deletion[0], arcCounter ); if ( repsTie ) { free ( ( void * ) markerOnEdge ); free ( ( void * ) fwriteBuf ); } free ( ( void * ) arcCounters ); free ( ( void * ) rcSeq ); for ( i = 0; i < maxReadNum; i++ ) { free ( ( void * ) seqBuffer[i] ); } free ( ( void * ) seqBuffer ); free ( ( void * ) lenBuffer ); free ( ( void * ) indexArray ); free ( ( void * ) flags ); free ( ( void * ) deletion ); free ( ( void * ) kmerBuffer ); free ( ( void * ) mixBuffer ); free ( ( void * ) smallerBuffer ); free ( ( void * ) flagArray ); free ( ( void * ) hashBanBuffer ); free ( ( void * ) nodeBuffer ); free ( ( void * ) src_name ); free ( ( void * ) next_name ); if ( repsTie ) { fclose ( outfp ); } free_pe_mem(); free_libs(); } static void thread_wait ( pthread_t * threads ) { int i; for ( i = 0; i < thrd_num; i++ ) if ( threads[i] != 0 ) { pthread_join ( threads[i], NULL ); } } SOAPdenovo-V1.05/src/63mer/prlReadFillGap.c000644 000765 000024 00000072244 11530651532 020325 0ustar00Aquastaff000000 000000 /* * 63mer/prlReadFillGap.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" #define RDBLOCKSIZE 50 #define CTGappend 50 static Kmer MAXKMER; static int Ncounter; static int allGaps; // for multi threads static int * counters; static pthread_mutex_t mutex; static int scafBufSize = 100; static boolean * flagBuf; static unsigned char * thrdNoBuf; static STACK ** ctgStackBuffer; static int scafCounter; static int scafInBuf; static void MarkCtgOccu ( unsigned int ctg ); /* static void printRead(int len,char *seq) { int j; fprintf(stderr,">read\n"); for(j=0;jlen = len; rd->dis = pos; rd->seqStarter = starter; } static void convertIndex() { int * length_array = ( int * ) ckalloc ( ( num_ctg + 1 ) * sizeof ( int ) ); unsigned int i; for ( i = 1; i <= num_ctg; i++ ) { length_array[i] = 0; } for ( i = 1; i <= num_ctg; i++ ) { if ( index_array[i] > 0 ) { length_array[index_array[i]] = i; } } for ( i = 1; i <= num_ctg; i++ ) { index_array[i] = length_array[i]; } //contig i with new index: index_array[i] free ( ( void * ) length_array ); } static long long getRead1by1 ( FILE * fp, DARRAY * readSeqInGap ) { long long readCounter = 0; if ( !fp ) { return readCounter; } int len, ctgID, pos; long long starter; char * pt; char * freadBuf = ( char * ) ckalloc ( ( maxReadLen / 4 + 1 ) * sizeof ( char ) ); while ( fread ( &len, sizeof ( int ), 1, fp ) == 1 ) { if ( fread ( &ctgID, sizeof ( int ), 1, fp ) != 1 ) { break; } if ( fread ( &pos, sizeof ( int ), 1, fp ) != 1 ) { break; } if ( fread ( freadBuf, sizeof ( char ), len / 4 + 1, fp ) != ( unsigned ) ( len / 4 + 1 ) ) { break; } //put seq to dynamic array starter = readSeqInGap->item_c; if ( !darrayPut ( readSeqInGap, starter + len / 4 ) ) // make sure there's room for this seq { break; } pt = ( char * ) darrayPut ( readSeqInGap, starter ); bcopy ( freadBuf, pt, len / 4 + 1 ); attach1read2contig ( ctgID, len, pos, starter ); readCounter++; } free ( ( void * ) freadBuf ); return readCounter; } // Darray *readSeqInGap static boolean loadReads4gap ( char * graphfile ) { FILE * fp, *fp2; char name[1024]; long long readCounter; sprintf ( name, "%s.readInGap", graphfile ); fp = fopen ( name, "rb" ); sprintf ( name, "%s.longReadInGap", graphfile ); fp2 = fopen ( name, "rb" ); if ( !fp && !fp2 ) { return 0; } if ( !orig2new ) { convertIndex(); orig2new = 1; } readSeqInGap = ( DARRAY * ) createDarray ( 1000000, sizeof ( char ) ); if ( fp ) { readCounter = getRead1by1 ( fp, readSeqInGap ); printf ( "Loaded %lld reads from %s.readInGap\n", readCounter, graphfile ); fclose ( fp ); } if ( fp2 ) { readCounter = getRead1by1 ( fp2, readSeqInGap ); printf ( "Loaded %lld reads from %s.LongReadInGap\n", readCounter, graphfile ); fclose ( fp2 ); } return 1; } static void debugging1() { unsigned int i; if ( orig2new ) { unsigned int * length_array = ( unsigned int * ) ckalloc ( ( num_ctg + 1 ) * sizeof ( unsigned int ) ); //use length_array to change info in index_array for ( i = 1; i <= num_ctg; i++ ) { length_array[i] = 0; } for ( i = 1; i <= num_ctg; i++ ) { if ( index_array[i] > 0 ) { length_array[index_array[i]] = i; } } for ( i = 1; i <= num_ctg; i++ ) { index_array[i] = length_array[i]; } //contig i with original index: index_array[i] orig2new = 0; } READNEARBY * rd; int j; char * pt; for ( i = 1; i <= num_ctg; i++ ) { if ( !contig_array[i].closeReads ) { continue; } if ( index_array[i] != 735 ) { continue; } printf ( "contig %d, len %d: \n", index_array[i], contig_array[i].length ); stackBackup ( contig_array[i].closeReads ); while ( ( rd = ( READNEARBY * ) stackPop ( contig_array[i].closeReads ) ) != NULL ) { printf ( "%d\t%d\t%lld\t", rd->dis, rd->len, rd->seqStarter ); pt = ( char * ) darrayGet ( readSeqInGap, rd->seqStarter ); for ( j = 0; j < rd->len; j++ ) { printf ( "%c", int2base ( ( int ) getCharInTightString ( pt, j ) ) ); } printf ( "\n" ); } stackRecover ( contig_array[i].closeReads ); } } static void initiateCtgInScaf ( CTGinSCAF * actg ) { actg->cutTail = 0; actg->cutHead = overlaplen; actg->gapSeqLen = 0; } static int procGap ( char * line, STACK * ctgsStack ) { char * tp; int length, i, seg; unsigned int ctg; CTGinSCAF * ctgPt; tp = strtok ( line, " " ); tp = strtok ( NULL, " " ); //length length = atoi ( tp ); tp = strtok ( NULL, " " ); //seg seg = atoi ( tp ); if ( !seg ) { return length; } for ( i = 0; i < seg; i++ ) { tp = strtok ( NULL, " " ); ctg = atoi ( tp ); MarkCtgOccu ( ctg ); ctgPt = ( CTGinSCAF * ) stackPush ( ctgsStack ); initiateCtgInScaf ( ctgPt ); ctgPt->ctgID = ctg; ctgPt->start = 0; ctgPt->end = 0; ctgPt->scaftig_start = 0; ctgPt->mask = 1; } return length; } static void debugging2 ( int index, STACK * ctgsStack ) { CTGinSCAF * actg; stackBackup ( ctgsStack ); printf ( ">scaffold%d\t%d 0.0\n", index, ctgsStack->item_c ); while ( ( actg = stackPop ( ctgsStack ) ) != NULL ) { printf ( "%d\t%d\t%d\t%d\n", actg->ctgID, actg->start, actg->end, actg->scaftig_start ); } stackRecover ( ctgsStack ); } static int cmp_reads ( const void * a, const void * b ) { READNEARBY * A, *B; A = ( READNEARBY * ) a; B = ( READNEARBY * ) b; if ( A->dis > B->dis ) { return 1; } else if ( A->dis == B->dis ) { return 0; } else { return -1; } } static void cutRdArray ( READNEARBY * rdArray, int gapStart, int gapEnd, int * count, int arrayLen, READNEARBY * cutArray ) { int i; int num = 0; for ( i = 0; i < arrayLen; i++ ) { if ( rdArray[i].dis > gapEnd ) { break; } if ( ( rdArray[i].dis + rdArray[i].len ) >= gapStart ) { cutArray[num].dis = rdArray[i].dis; cutArray[num].len = rdArray[i].len; cutArray[num++].seqStarter = rdArray[i].seqStarter; } } *count = num; } static void outputTightStr ( FILE * fp, char * tightStr, int start, int length, int outputlen, int revS, int * col ) { int i; int end; int column = *col; if ( !revS ) { end = start + outputlen <= length ? start + outputlen : length; for ( i = start; i < end; i++ ) { fprintf ( fp, "%c", int2base ( ( int ) getCharInTightString ( tightStr, i ) ) ); if ( ( ++column ) % 100 == 0 ) { //column = 0; fprintf ( fp, "\n" ); } } } else { end = length - start - outputlen - 1 >= 0 ? length - start - outputlen : 0; for ( i = length - 1 - start; i >= end; i-- ) { fprintf ( fp, "%c", int2compbase ( ( int ) getCharInTightString ( tightStr, i ) ) ); if ( ( ++column ) % 100 == 0 ) { fprintf ( fp, "\n" ); //column = 0; } } } *col = column; } static void outputTightStrLowerCase ( FILE * fp, char * tightStr, int start, int length, int outputlen, int revS, int * col ) { int i; int end; int column = *col; if ( !revS ) { end = start + outputlen <= length ? start + outputlen : length; for ( i = start; i < end; i++ ) { fprintf ( fp, "%c", "actg"[ ( int ) getCharInTightString ( tightStr, i )] ); if ( ( ++column ) % 100 == 0 ) { //column = 0; fprintf ( fp, "\n" ); } } } else { end = length - start - outputlen - 1 >= 0 ? length - start - outputlen : 0; for ( i = length - 1 - start; i >= end; i-- ) { fprintf ( fp, "%c", "tgac"[ ( int ) getCharInTightString ( tightStr, i )] ); if ( ( ++column ) % 100 == 0 ) { fprintf ( fp, "\n" ); //column = 0; } } } *col = column; } static void outputNs ( FILE * fp, int gapN, int * col ) { int i, column = *col; for ( i = 0; i < gapN; i++ ) { fprintf ( fp, "N" ); if ( ( ++column ) % 100 == 0 ) { //column = 0; fprintf ( fp, "\n" ); } } *col = column; } static void outputGapInfo ( unsigned int ctg1, unsigned int ctg2 ) { unsigned int bal_ctg1 = getTwinCtg ( ctg1 ); unsigned int bal_ctg2 = getTwinCtg ( ctg2 ); if ( isLargerThanTwin ( ctg1 ) ) { fprintf ( stderr, "%d\t", index_array[bal_ctg1] ); } else { fprintf ( stderr, "%d\t", index_array[ctg1] ); } if ( isLargerThanTwin ( ctg2 ) ) { fprintf ( stderr, "%d\n", index_array[bal_ctg2] ); } else { fprintf ( stderr, "%d\n", index_array[ctg2] ); } } static void output1gap ( FILE * fo, int scafIndex, CTGinSCAF * prevCtg, CTGinSCAF * actg, DARRAY * gapSeqArray ) { unsigned int ctg1, bal_ctg1, length1; int start1, outputlen1; unsigned int ctg2, bal_ctg2, length2; int start2, outputlen2; char * pt; int column = 0; ctg1 = prevCtg->ctgID; bal_ctg1 = getTwinCtg ( ctg1 ); start1 = prevCtg->cutHead; length1 = contig_array[ctg1].length + overlaplen; if ( length1 - prevCtg->cutTail - start1 > CTGappend ) { outputlen1 = CTGappend; start1 = length1 - prevCtg->cutTail - outputlen1; } else { outputlen1 = length1 - prevCtg->cutTail - start1; } ctg2 = actg->ctgID; bal_ctg2 = getTwinCtg ( ctg2 ); start2 = actg->cutHead; length2 = contig_array[ctg2].length + overlaplen; if ( length2 - actg->cutTail - start2 > CTGappend ) { outputlen2 = CTGappend; } else { outputlen2 = length2 - actg->cutTail - start2; } if ( isLargerThanTwin ( ctg1 ) ) { fprintf ( fo, ">S%d_C%d_L%d_G%d", scafIndex, index_array[bal_ctg1], outputlen1, prevCtg->gapSeqLen ); } else { fprintf ( fo, ">S%d_C%d_L%d_G%d", scafIndex, index_array[ctg1], outputlen1, prevCtg->gapSeqLen ); } if ( isLargerThanTwin ( ctg2 ) ) { fprintf ( fo, "_C%d_L%d\n", index_array[bal_ctg2], outputlen2 ); } else { fprintf ( fo, "_C%d_L%d\n", index_array[ctg2], outputlen2 ); } if ( contig_array[ctg1].seq ) { outputTightStr ( fo, contig_array[ctg1].seq, start1, length1, outputlen1, 0, &column ); } else if ( contig_array[bal_ctg1].seq ) { outputTightStr ( fo, contig_array[bal_ctg1].seq, start1, length1, outputlen1, 1, &column ); } pt = ( char * ) darrayPut ( gapSeqArray, prevCtg->gapSeqOffset ); outputTightStrLowerCase ( fo, pt, 0, prevCtg->gapSeqLen, prevCtg->gapSeqLen, 0, &column ); if ( contig_array[ctg2].seq ) { outputTightStr ( fo, contig_array[ctg2].seq, start2, length2, outputlen2, 0, &column ); } else if ( contig_array[bal_ctg2].seq ) { outputTightStr ( fo, contig_array[bal_ctg2].seq, start2, length2, outputlen2, 1, &column ); } fprintf ( fo, "\n" ); } static void outputGapSeq ( FILE * fo, int index, STACK * ctgsStack, DARRAY * gapSeqArray ) { CTGinSCAF * actg, *prevCtg = NULL; stackRecover ( ctgsStack ); fprintf ( fo, ">scaffold%d\n", index ); while ( ( actg = stackPop ( ctgsStack ) ) != NULL ) { if ( prevCtg ) { if ( actg->scaftig_start ) { fprintf ( fo, "0\t%d\t%d\n", prevCtg->mask, actg->mask ); } else { fprintf ( fo, "1\t%d\t%d\n", prevCtg->mask, actg->mask ); } } /* if(prevCtg&&prevCtg->gapSeqLen>0) output1gap(fo,index,prevCtg,actg,gapSeqArray); */ prevCtg = actg; } } static void outputScafSeq ( FILE * fo, int index, STACK * ctgsStack, DARRAY * gapSeqArray ) { CTGinSCAF * actg, *prevCtg = NULL; unsigned int ctg, bal_ctg, length; int start, outputlen, gapN; char * pt; int column = 0; long long cvgSum = 0; int lenSum = 0; stackRecover ( ctgsStack ); while ( ( actg = stackPop ( ctgsStack ) ) != NULL ) { if ( ! ( contig_array[actg->ctgID].cvg > 0 ) ) { continue; } lenSum += contig_array[actg->ctgID].length; cvgSum += contig_array[actg->ctgID].length * contig_array[actg->ctgID].cvg; } if ( lenSum > 0 ) { fprintf ( fo, ">scaffold%d %4.1f\n", index, ( double ) cvgSum / lenSum ); } else { fprintf ( fo, ">scaffold%d 0.0\n", index ); } stackRecover ( ctgsStack ); while ( ( actg = stackPop ( ctgsStack ) ) != NULL ) { ctg = actg->ctgID; bal_ctg = getTwinCtg ( ctg ); length = contig_array[ctg].length + overlaplen; if ( prevCtg && actg->scaftig_start ) { gapN = actg->start - prevCtg->start - contig_array[prevCtg->ctgID].length; gapN = gapN > 0 ? gapN : 1; outputNs ( fo, gapN, &column ); //outputGapInfo(prevCtg->ctgID,ctg); Ncounter++; } if ( !prevCtg ) { start = 0; } else { start = actg->cutHead; } outputlen = length - start - actg->cutTail; if ( contig_array[ctg].seq ) { outputTightStr ( fo, contig_array[ctg].seq, start, length, outputlen, 0, &column ); } else if ( contig_array[bal_ctg].seq ) { outputTightStr ( fo, contig_array[bal_ctg].seq, start, length, outputlen, 1, &column ); } if ( actg->gapSeqLen < 1 ) { prevCtg = actg; continue; } pt = ( char * ) darrayPut ( gapSeqArray, actg->gapSeqOffset ); outputTightStrLowerCase ( fo, pt, 0, actg->gapSeqLen, actg->gapSeqLen, 0, &column ); prevCtg = actg; } fprintf ( fo, "\n" ); } static void fill1scaf ( int index, STACK * ctgsStack, int thrdID ); static void check1scaf ( int t, int thrdID ) { if ( flagBuf[t] ) { return; } boolean late = 0; pthread_mutex_lock ( &mutex ); if ( !flagBuf[t] ) { flagBuf[t] = 1; thrdNoBuf[t] = thrdID; } else { late = 1; } pthread_mutex_unlock ( &mutex ); if ( late ) { return; } counters[thrdID]++; fill1scaf ( scafCounter + t + 1, ctgStackBuffer[t], thrdID ); } static void fill1scaf ( int index, STACK * ctgsStack, int thrdID ) { CTGinSCAF * actg, *prevCtg = NULL; READNEARBY * rdArray, *rdArray4gap, *rd; int numRd = 0, count, maxGLen = 0; unsigned int ctg, bal_ctg; STACK * rdStack; while ( ( actg = stackPop ( ctgsStack ) ) != NULL ) { if ( prevCtg ) { maxGLen = maxGLen < ( actg->start - prevCtg->end ) ? ( actg->start - prevCtg->end ) : maxGLen; } ctg = actg->ctgID; bal_ctg = getTwinCtg ( ctg ); if ( actg->mask ) { prevCtg = actg; continue; } if ( contig_array[ctg].closeReads ) { numRd += contig_array[ctg].closeReads->item_c; } else if ( contig_array[bal_ctg].closeReads ) { numRd += contig_array[bal_ctg].closeReads->item_c; } prevCtg = actg; } if ( numRd < 1 ) { return; } rdArray = ( READNEARBY * ) ckalloc ( numRd * sizeof ( READNEARBY ) ); rdArray4gap = ( READNEARBY * ) ckalloc ( numRd * sizeof ( READNEARBY ) ); //fprintf(stderr,"scaffold%d reads4gap %d\n",index,numRd); // collect reads appended to contigs in this scaffold int numRd2 = 0; stackRecover ( ctgsStack ); while ( ( actg = stackPop ( ctgsStack ) ) != NULL ) { ctg = actg->ctgID; bal_ctg = getTwinCtg ( ctg ); if ( actg->mask ) { continue; } if ( contig_array[ctg].closeReads ) { rdStack = contig_array[ctg].closeReads; } else if ( contig_array[bal_ctg].closeReads ) { rdStack = contig_array[bal_ctg].closeReads; } else { continue; } stackBackup ( rdStack ); while ( ( rd = ( READNEARBY * ) stackPop ( rdStack ) ) != NULL ) { rdArray[numRd2].len = rd->len; rdArray[numRd2].seqStarter = rd->seqStarter; if ( isSmallerThanTwin ( ctg ) ) { rdArray[numRd2++].dis = actg->start - overlaplen + rd->dis; } else rdArray[numRd2++].dis = actg->start - overlaplen + contig_array[ctg].length - rd->len - rd->dis; } stackRecover ( rdStack ); } if ( numRd2 != numRd ) { printf ( "##reads numbers doesn't match, %d vs %d when scaffold %d\n", numRd, numRd2, index ); } qsort ( rdArray, numRd, sizeof ( READNEARBY ), cmp_reads ); //fill gap one by one int gapStart, gapEnd; int numIn = 0; boolean flag; int buffer_size = maxReadLen > 100 ? maxReadLen : 100; int maxGSLen = maxGLen + GLDiff < 10 ? 10 : maxGLen + GLDiff; //fprintf(stderr,"maxGlen %d, maxGSlen %d\n",maxGLen,maxGSLen); char * seqGap = ( char * ) ckalloc ( maxGSLen * sizeof ( char ) ); // temp array for gap sequence Kmer * kmerCtg1 = ( Kmer * ) ckalloc ( buffer_size * sizeof ( Kmer ) ); Kmer * kmerCtg2 = ( Kmer * ) ckalloc ( buffer_size * sizeof ( Kmer ) ); char * seqCtg1 = ( char * ) ckalloc ( buffer_size * sizeof ( char ) ); char * seqCtg2 = ( char * ) ckalloc ( buffer_size * sizeof ( char ) ); prevCtg = NULL; stackRecover ( ctgsStack ); while ( ( actg = stackPop ( ctgsStack ) ) != NULL ) { if ( !prevCtg || !actg->scaftig_start ) { prevCtg = actg; continue; } gapStart = prevCtg->end - 100; gapEnd = actg->start - overlaplen + 100; cutRdArray ( rdArray, gapStart, gapEnd, &count, numRd, rdArray4gap ); numIn += count; /* if(!count){ prevCtg = actg; continue; } */ int overlap; for ( overlap = overlaplen; overlap > 14; overlap -= 2 ) { flag = localGraph ( rdArray4gap, count, prevCtg, actg, overlaplen, kmerCtg1, kmerCtg2, overlap, darrayBuf[thrdID], seqCtg1, seqCtg2, seqGap ); //free_kmerset(kmerSet); if ( flag == 1 ) { /* fprintf(stderr,"Between ctg %d and %d, Found with %d\n",prevCtg->ctgID ,actg->ctgID,overlap); */ break; } } /* if(count==0) printf("Gap closed without reads\n"); if(!flag) fprintf(stderr,"Between ctg %d and %d, NO routes found\n",prevCtg->ctgID,actg->ctgID); */ prevCtg = actg; } //fprintf(stderr,"____scaffold%d reads in gap %d\n",index,numIn); free ( ( void * ) seqGap ); free ( ( void * ) kmerCtg1 ); free ( ( void * ) kmerCtg2 ); free ( ( void * ) seqCtg1 ); free ( ( void * ) seqCtg2 ); free ( ( void * ) rdArray ); free ( ( void * ) rdArray4gap ); } static void reverseStack ( STACK * dStack, STACK * sStack ) { CTGinSCAF * actg, *ctgPt; emptyStack ( dStack ); while ( ( actg = ( CTGinSCAF * ) stackPop ( sStack ) ) != NULL ) { ctgPt = ( CTGinSCAF * ) stackPush ( dStack ); ctgPt->ctgID = actg->ctgID; ctgPt->start = actg->start; ctgPt->end = actg->end; ctgPt->scaftig_start = actg->scaftig_start; ctgPt->mask = actg->mask; ctgPt->cutHead = actg->cutHead; ctgPt->cutTail = actg->cutTail; ctgPt->gapSeqLen = actg->gapSeqLen; ctgPt->gapSeqOffset = actg->gapSeqOffset; } stackBackup ( dStack ); } static Kmer tightStr2Kmer ( char * tightStr, int start, int length, int revS ) { int i; Kmer word; word.high = word.low = 0; if ( !revS ) { if ( start + overlaplen > length ) { printf ( "tightStr2Kmer A: no enough bases for kmer\n" ); return word; } for ( i = start; i < start + overlaplen; i++ ) { word = KmerLeftBitMoveBy2 ( word ); word.low |= getCharInTightString ( tightStr, i ); } } else { if ( length - start - overlaplen < 0 ) { printf ( "tightStr2Kmer B: no enough bases for kmer\n" ); return word; } for ( i = length - 1 - start; i > length - 1 - start - overlaplen; i-- ) { word = KmerLeftBitMoveBy2 ( word ); word.low |= int_comp ( getCharInTightString ( tightStr, i ) ); } } return word; } static Kmer maxKmer() { Kmer word; word.high = word.low = 0; int i; for ( i = 0; i < overlaplen; i++ ) { word = KmerLeftBitMoveBy2 ( word ); word.low |= 0x3; } return word; } static int contigCatch ( unsigned int prev_ctg, unsigned int ctg ) { if ( contig_array[prev_ctg].length == 0 || contig_array[ctg].length == 0 ) { return 0; } Kmer kmerAtEnd, kmerAtStart; Kmer MaxKmer; unsigned int bal_ctg1 = getTwinCtg ( prev_ctg ); unsigned int bal_ctg2 = getTwinCtg ( ctg ); int i, start; int len1 = contig_array[prev_ctg].length + overlaplen; int len2 = contig_array[ctg].length + overlaplen; start = contig_array[prev_ctg].length; if ( contig_array[prev_ctg].seq ) { kmerAtEnd = tightStr2Kmer ( contig_array[prev_ctg].seq, start, len1, 0 ); } else { kmerAtEnd = tightStr2Kmer ( contig_array[bal_ctg1].seq, start, len1, 1 ); } start = 0; if ( contig_array[ctg].seq ) { kmerAtStart = tightStr2Kmer ( contig_array[ctg].seq, start, len2, 0 ); } else { kmerAtStart = tightStr2Kmer ( contig_array[bal_ctg2].seq, start, len2, 1 ); } MaxKmer = MAXKMER; for ( i = 0; i < 10; i++ ) { if ( KmerEqual ( kmerAtStart, kmerAtEnd ) ) { break; } MaxKmer = KmerRightBitMoveBy2 ( MaxKmer ); kmerAtEnd = KmerAnd ( kmerAtEnd, MaxKmer ); kmerAtStart = KmerRightBitMoveBy2 ( kmerAtStart ); } if ( i < 10 ) { return overlaplen - i; } else { return 0; } } static void initStackBuf ( STACK ** ctgStackBuffer, int scafBufSize ) { int i; for ( i = 0; i < scafBufSize; i++ ) { flagBuf[i] = 1; ctgStackBuffer[i] = ( STACK * ) createStack ( 100, sizeof ( CTGinSCAF ) ); } } static void freeStackBuf ( STACK ** ctgStackBuffer, int scafBufSize ) { int i; for ( i = 0; i < scafBufSize; i++ ) { freeStack ( ctgStackBuffer[i] ); } } static void threadRoutine ( void * para ) { PARAMETER * prm; int i; prm = ( PARAMETER * ) para; //printf("%dth thread with threadID %d, hash_table %p\n",id,prm.threadID,prm.hash_table); while ( 1 ) { if ( * ( prm->selfSignal ) == 1 ) { emptyDarray ( darrayBuf[prm->threadID] ); for ( i = 0; i < scafInBuf; i++ ) { check1scaf ( i, prm->threadID ); } * ( prm->selfSignal ) = 0; } else if ( * ( prm->selfSignal ) == 2 ) { * ( prm->selfSignal ) = 0; break; } usleep ( 1 ); } } static void creatThrds ( pthread_t * threads, PARAMETER * paras ) { unsigned char i; int temp; for ( i = 0; i < thrd_num; i++ ) { if ( ( temp = pthread_create ( &threads[i], NULL, ( void * ) threadRoutine, & ( paras[i] ) ) ) != 0 ) { printf ( "create threads failed\n" ); exit ( 1 ); } } printf ( "%d thread created\n...\n", thrd_num ); } static void sendWorkSignal ( unsigned char SIG, unsigned char * thrdSignals ) { int t; for ( t = 0; t < thrd_num; t++ ) { thrdSignals[t + 1] = SIG; } while ( 1 ) { usleep ( 10 ); for ( t = 0; t < thrd_num; t++ ) if ( thrdSignals[t + 1] ) { break; } if ( t == thrd_num ) { break; } } } static void thread_wait ( pthread_t * threads ) { int i; for ( i = 0; i < thrd_num; i++ ) if ( threads[i] != 0 ) { pthread_join ( threads[i], NULL ); } } static void outputSeqs ( FILE * fo, FILE * fo2, int scafInBuf ) { int i, thrdID; for ( i = 0; i < scafInBuf; i++ ) { thrdID = thrdNoBuf[i]; outputScafSeq ( fo, scafCounter + i + 1, ctgStackBuffer[i], darrayBuf[thrdID] ); outputGapSeq ( fo2, scafCounter + i + 1, ctgStackBuffer[i], darrayBuf[thrdID] ); } } static void MaskContig ( unsigned int ctg ) { contig_array[ctg].mask = 1; contig_array[getTwinCtg ( ctg )].mask = 1; } static void MarkCtgOccu ( unsigned int ctg ) { contig_array[ctg].flag = 1; contig_array[getTwinCtg ( ctg )].flag = 1; } static void output_ctg ( unsigned int ctg, FILE * fo ) { if ( contig_array[ctg].length < 1 ) { return; } int len; unsigned int bal_ctg = getTwinCtg ( ctg ); len = contig_array[ctg].length + overlaplen; int col = 0; if ( contig_array[ctg].seq ) { fprintf ( fo, ">C%d %4.1f\n", ctg, ( double ) contig_array[ctg].cvg ); outputTightStr ( fo, contig_array[ctg].seq, 0, len, len, 0, &col ); } else if ( contig_array[bal_ctg].seq ) { fprintf ( fo, ">C%d %4.1f\n", bal_ctg, ( double ) contig_array[ctg].cvg ); outputTightStr ( fo, contig_array[bal_ctg].seq, 0, len, len, 0, &col ); } contig_array[ctg].flag = 1; contig_array[bal_ctg].flag = 1; fprintf ( fo, "\n" ); } void prlReadsCloseGap ( char * graphfile ) { //thrd_num=1; if ( fillGap ) { boolean flag; printf ( "\nStart to load reads for gap filling. %d length discrepancy is allowed\n", GLDiff ); printf ( "...\n" ); flag = loadReads4gap ( graphfile ); if ( !flag ) { return; } } if ( orig2new ) { convertIndex(); orig2new = 0; } FILE * fp, *fo, *fo2; char line[1024]; CTGinSCAF * actg; STACK * ctgStack, *aStack; int index = 0, offset = 0, counter, overallLen; int i, starter, prev_start, gapLen, catchable; unsigned int ctg, prev_ctg = 0; boolean IsPrevGap; pthread_t threads[thrd_num]; unsigned char thrdSignal[thrd_num + 1]; PARAMETER paras[thrd_num]; for ( ctg = 1; ctg <= num_ctg; ctg++ ) { contig_array[ctg].flag = 0; } MAXKMER = maxKmer(); ctgStack = ( STACK * ) createStack ( 1000, sizeof ( CTGinSCAF ) ); sprintf ( line, "%s.scaf_gap", graphfile ); fp = ckopen ( line, "r" ); sprintf ( line, "%s.scafSeq", graphfile ); fo = ckopen ( line, "w" ); sprintf ( line, "%s.gapSeq", graphfile ); fo2 = ckopen ( line, "w" ); pthread_mutex_init ( &mutex, NULL ); flagBuf = ( boolean * ) ckalloc ( scafBufSize * sizeof ( boolean ) );; thrdNoBuf = ( unsigned char * ) ckalloc ( scafBufSize * sizeof ( unsigned char ) );; memset ( thrdNoBuf, 0, scafBufSize * sizeof ( char ) ); ctgStackBuffer = ( STACK ** ) ckalloc ( scafBufSize * sizeof ( STACK * ) ); initStackBuf ( ctgStackBuffer, scafBufSize ); darrayBuf = ( DARRAY ** ) ckalloc ( thrd_num * sizeof ( DARRAY * ) ); counters = ( int * ) ckalloc ( thrd_num * sizeof ( int ) ); for ( i = 0; i < thrd_num; i++ ) { counters[i] = 0; darrayBuf[i] = ( DARRAY * ) createDarray ( 100000, sizeof ( char ) ); thrdSignal[i + 1] = 0; paras[i].threadID = i; paras[i].mainSignal = &thrdSignal[0]; paras[i].selfSignal = &thrdSignal[i + 1]; } if ( fillGap ) { creatThrds ( threads, paras ); } Ncounter = scafCounter = scafInBuf = allGaps = 0; while ( fgets ( line, sizeof ( line ), fp ) != NULL ) { if ( line[0] == '>' ) { if ( index ) { aStack = ctgStackBuffer[scafInBuf]; flagBuf[scafInBuf++] = 0; reverseStack ( aStack, ctgStack ); if ( scafInBuf == scafBufSize ) { if ( fillGap ) { sendWorkSignal ( 1, thrdSignal ); } outputSeqs ( fo, fo2, scafInBuf ); scafCounter += scafInBuf; scafInBuf = 0; } if ( index % 1000 == 0 ) { printf ( "Processed %d scaffolds\n", index ); } } //read next scaff emptyStack ( ctgStack ); IsPrevGap = offset = prev_ctg = 0; sscanf ( line + 9, "%d %d %d", &index, &counter, &overallLen ); continue; } if ( line[0] == 'G' ) // gap appears { if ( fillGap ) { gapLen = procGap ( line, ctgStack ); IsPrevGap = 1; } continue; } if ( line[0] >= '0' && line[0] <= '9' ) // a contig line { sscanf ( line, "%d %d", &ctg, &starter ); actg = ( CTGinSCAF * ) stackPush ( ctgStack ); actg->ctgID = ctg; if ( contig_array[ctg].flag ) { MaskContig ( ctg ); } else { MarkCtgOccu ( ctg ); } initiateCtgInScaf ( actg ); if ( !prev_ctg ) { actg->cutHead = 0; } else if ( !IsPrevGap ) { allGaps++; } if ( !IsPrevGap ) { if ( prev_ctg && ( starter - prev_start - ( int ) contig_array[prev_ctg].length ) < ( ( int ) overlaplen * 4 ) ) { /* if(fillGap) catchable = contigCatch(prev_ctg,ctg); else */ catchable = 0; if ( catchable ) // prev_ctg and ctg overlap **bp { allGaps--; /* if(isLargerThanTwin(prev_ctg)) fprintf(stderr,"%d ####### by_overlap\n",getTwinCtg(prev_ctg)); else fprintf(stderr,"%d ####### by_overlap\n",prev_ctg); */ actg->scaftig_start = 0; actg->cutHead = catchable; offset += - ( starter - prev_start - contig_array[prev_ctg].length ) + ( overlaplen - catchable ); } else { actg->scaftig_start = 1; } } else { actg->scaftig_start = 1; } } else { offset += - ( starter - prev_start - contig_array[prev_ctg].length ) + gapLen; actg->scaftig_start = 0; } actg->start = starter + offset; actg->end = actg->start + contig_array[ctg].length - 1; actg->mask = contig_array[ctg].mask; IsPrevGap = 0; prev_ctg = ctg; prev_start = starter; } } if ( index ) { aStack = ctgStackBuffer[scafInBuf]; flagBuf[scafInBuf++] = 0; reverseStack ( aStack, ctgStack ); if ( fillGap ) { sendWorkSignal ( 1, thrdSignal ); } outputSeqs ( fo, fo2, scafInBuf ); } if ( fillGap ) { sendWorkSignal ( 2, thrdSignal ); thread_wait ( threads ); } for ( ctg = 1; ctg <= num_ctg; ctg++ ) { if ( ( contig_array[ctg].length + overlaplen ) < 100 || contig_array[ctg].flag ) { continue; } output_ctg ( ctg, fo ); } printf ( "Done with %d scaffolds, %d gaps finished, %d gaps overall\n", index, allGaps - Ncounter, allGaps ); index = 0; for ( i = 0; i < thrd_num; i++ ) { freeDarray ( darrayBuf[i] ); index += counters[i]; } if ( fillGap ) { printf ( "Threads processed %d scaffolds\n", index ); } free ( ( void * ) darrayBuf ); if ( readSeqInGap ) { freeDarray ( readSeqInGap ); } fclose ( fp ); fclose ( fo ); fclose ( fo2 ); freeStack ( ctgStack ); freeStackBuf ( ctgStackBuffer, scafBufSize ); free ( ( void * ) flagBuf ); free ( ( void * ) thrdNoBuf ); free ( ( void * ) ctgStackBuffer ); } SOAPdenovo-V1.05/src/63mer/read2scaf.c000644 000765 000024 00000016445 11530651532 017330 0ustar00Aquastaff000000 000000 /* * 63mer/read2scaf.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" static int Ncounter; static int allGaps; // for multi threads static int scafBufSize = 100; static STACK ** ctgStackBuffer; static int scafCounter; static int scafInBuf; static void convertIndex() { int * length_array = ( int * ) ckalloc ( ( num_ctg + 1 ) * sizeof ( int ) ); unsigned int i; for ( i = 1; i <= num_ctg; i++ ) { length_array[i] = 0; } for ( i = 1; i <= num_ctg; i++ ) { if ( index_array[i] > 0 ) { length_array[index_array[i]] = i; } } for ( i = 1; i <= num_ctg; i++ ) { index_array[i] = length_array[i]; } //contig i with new index: index_array[i] free ( ( void * ) length_array ); } static void reverseStack ( STACK * dStack, STACK * sStack ) { CTGinSCAF * actg, *ctgPt; emptyStack ( dStack ); while ( ( actg = ( CTGinSCAF * ) stackPop ( sStack ) ) != NULL ) { ctgPt = ( CTGinSCAF * ) stackPush ( dStack ); ctgPt->ctgID = actg->ctgID; ctgPt->start = actg->start; ctgPt->end = actg->end; } stackBackup ( dStack ); } static void initStackBuf ( STACK ** ctgStackBuffer, int scafBufSize ) { int i; for ( i = 0; i < scafBufSize; i++ ) { ctgStackBuffer[i] = ( STACK * ) createStack ( 100, sizeof ( CTGinSCAF ) ); } } static void freeStackBuf ( STACK ** ctgStackBuffer, int scafBufSize ) { int i; for ( i = 0; i < scafBufSize; i++ ) { freeStack ( ctgStackBuffer[i] ); } } static void mapCtg2Scaf ( int scafInBuf ) { int i, scafID; CTGinSCAF * actg; STACK * ctgsStack; unsigned int ctg, bal_ctg; for ( i = 0; i < scafInBuf; i++ ) { scafID = scafCounter + i + 1; ctgsStack = ctgStackBuffer[i]; while ( ( actg = stackPop ( ctgsStack ) ) != NULL ) { ctg = actg->ctgID; bal_ctg = getTwinCtg ( ctg ); if ( contig_array[ctg].from_vt != 0 ) { contig_array[ctg].multi = 1; contig_array[bal_ctg].multi = 1; continue; } contig_array[ctg].from_vt = scafID; contig_array[ctg].to_vt = actg->start; contig_array[ctg].flag = 0; //ctg and scaf on the same strand contig_array[bal_ctg].from_vt = scafID; contig_array[bal_ctg].to_vt = actg->start; contig_array[bal_ctg].flag = 1; } } } static void locateContigOnscaff ( char * graphfile ) { FILE * fp; char line[1024]; CTGinSCAF * actg; STACK * ctgStack, *aStack; int index = 0, counter, overallLen; int starter, prev_start, gapN, scafLen; unsigned int ctg, prev_ctg = 0; for ( ctg = 1; ctg <= num_ctg; ctg++ ) { contig_array[ctg].from_vt = 0; contig_array[ctg].multi = 0; } ctgStack = ( STACK * ) createStack ( 1000, sizeof ( CTGinSCAF ) ); sprintf ( line, "%s.scaf_gap", graphfile ); fp = ckopen ( line, "r" ); ctgStackBuffer = ( STACK ** ) ckalloc ( scafBufSize * sizeof ( STACK * ) ); initStackBuf ( ctgStackBuffer, scafBufSize ); Ncounter = scafCounter = scafInBuf = allGaps = 0; while ( fgets ( line, sizeof ( line ), fp ) != NULL ) { if ( line[0] == '>' ) { if ( index ) { aStack = ctgStackBuffer[scafInBuf++]; reverseStack ( aStack, ctgStack ); if ( scafInBuf == scafBufSize ) { mapCtg2Scaf ( scafInBuf ); scafCounter += scafInBuf; scafInBuf = 0; } if ( index % 1000 == 0 ) { printf ( "Processed %d scaffolds\n", index ); } } //read next scaff scafLen = prev_ctg = 0; emptyStack ( ctgStack ); sscanf ( line + 9, "%d %d %d", &index, &counter, &overallLen ); //fprintf(stderr,">%d\n",index); continue; } if ( line[0] == 'G' ) // gap appears { continue; } if ( line[0] >= '0' && line[0] <= '9' ) // a contig line { sscanf ( line, "%d %d", &ctg, &starter ); actg = ( CTGinSCAF * ) stackPush ( ctgStack ); actg->ctgID = ctg; if ( !prev_ctg ) { actg->start = scafLen; actg->end = actg->start + overlaplen + contig_array[ctg].length - 1; } else { gapN = starter - prev_start - ( int ) contig_array[prev_ctg].length; gapN = gapN < 1 ? 1 : gapN; actg->start = scafLen + gapN; actg->end = actg->start + contig_array[ctg].length - 1; } //fprintf(stderr,"%d\t%d\n",actg->start,actg->end); scafLen = actg->end + 1; prev_ctg = ctg; prev_start = starter; } } if ( index ) { aStack = ctgStackBuffer[scafInBuf++]; reverseStack ( aStack, ctgStack ); mapCtg2Scaf ( scafInBuf ); } gapN = 0; for ( ctg = 1; ctg <= num_ctg; ctg++ ) { if ( contig_array[ctg].from_vt == 0 || contig_array[ctg].multi == 1 ) { continue; } gapN++; } printf ( "\nDone with %d scaffolds, %d contigs in Scaffolld\n", index, gapN ); /* if(readSeqInGap) freeDarray(readSeqInGap); */ fclose ( fp ); freeStack ( ctgStack ); freeStackBuf ( ctgStackBuffer, scafBufSize ); free ( ( void * ) ctgStackBuffer ); } static boolean contigElligible ( unsigned int contigno ) { unsigned int ctg = index_array[contigno]; if ( contig_array[ctg].from_vt == 0 || contig_array[ctg].multi == 1 ) { return 0; } else { return 1; } } static void output1read ( FILE * fo, long long readno, unsigned int contigno, int pos ) { unsigned int ctg = index_array[contigno]; int posOnScaf; char orien; pos = pos < 0 ? 0 : pos; if ( contig_array[ctg].flag == 0 ) { posOnScaf = contig_array[ctg].to_vt + pos - overlaplen; orien = '+'; } else { posOnScaf = contig_array[ctg].to_vt + contig_array[ctg].length - pos; orien = '-'; } /* if(readno==676) printf("Read %lld in region from %d, extend %d, pos %d, orien %c\n", readno,contig_array[ctg].to_vt,contig_array[ctg].length,posOnScaf,orien); */ fprintf ( fo, "%lld\t%d\t%d\t%c\n", readno, contig_array[ctg].from_vt, posOnScaf, orien ); } void locateReadOnScaf ( char * graphfile ) { char name[1024], line[1024]; FILE * fp, *fo; long long readno, counter = 0, pre_readno = 0; unsigned int contigno, pre_contigno; int pre_pos, pos; locateContigOnscaff ( graphfile ); sprintf ( name, "%s.readOnContig", graphfile ); fp = ckopen ( name, "r" ); sprintf ( name, "%s.readOnScaf", graphfile ); fo = ckopen ( name, "w" ); if ( !orig2new ) { convertIndex(); orig2new = 1; } fgets ( line, 1024, fp ); while ( fgets ( line, 1024, fp ) != NULL ) { sscanf ( line, "%lld %d %d", &readno, &contigno, &pos ); if ( ( readno % 2 == 0 ) && ( pre_readno == readno - 1 ) // they are a pair of reads && contigElligible ( pre_contigno ) && contigElligible ( contigno ) ) { output1read ( fo, pre_readno, pre_contigno, pre_pos ); output1read ( fo, readno, contigno, pos ); counter++; } pre_readno = readno; pre_contigno = contigno; pre_pos = pos; } printf ( "%lld pairs on contig\n", counter ); fclose ( fp ); fclose ( fo ); } SOAPdenovo-V1.05/src/63mer/readInterval.c000644 000765 000024 00000003112 11530651532 020101 0ustar00Aquastaff000000 000000 /* * 63mer/readInterval.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" #define RVBLOCKSIZE 1000 void destroyReadIntervMem() { freeMem_manager ( rv_mem_manager ); rv_mem_manager = NULL; } READINTERVAL * allocateRV ( int readid, int edgeid ) { READINTERVAL * newRV; newRV = ( READINTERVAL * ) getItem ( rv_mem_manager ); newRV->readid = readid; newRV->edgeid = edgeid; newRV->nextInRead = NULL; newRV->prevInRead = NULL; newRV->nextOnEdge = NULL; newRV->prevOnEdge = NULL; return newRV; } void dismissRV ( READINTERVAL * rv ) { returnItem ( rv_mem_manager, rv ); } void createRVmemo() { if ( !rv_mem_manager ) { rv_mem_manager = createMem_manager ( RVBLOCKSIZE, sizeof ( READINTERVAL ) ); } else { printf ( "Warning from createRVmemo: rv_mem_manager is an active pointer\n" ); } } SOAPdenovo-V1.05/src/63mer/readseq1by1.c000644 000765 000024 00000034127 11530651532 017614 0ustar00Aquastaff000000 000000 /* * 63mer/readseq1by1.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" static char src_rc_seq[1024]; void readseq1by1 ( char * src_seq, char * src_name, int * len_seq, FILE * fp, long long num_seq ) { int i, k, n, strL; char c; char str[5000]; n = 0; k = num_seq; while ( fgets ( str, 4950, fp ) ) { if ( str[0] == '#' ) { continue; } if ( str[0] == '>' ) { /* if(k >= 0) { // if this isn't the first '>' in the file *len_seq = n; } */ *len_seq = n; n = 0; sscanf ( &str[1], "%s", src_name ); return; } else { strL = strlen ( str ); if ( strL + n > maxReadLen ) { strL = maxReadLen - n; } for ( i = 0; i < strL; i ++ ) { if ( str[i] >= 'a' && str[i] <= 'z' ) { c = base2int ( str[i] - 'a' + 'A' ); src_seq[n ++] = c; } else if ( str[i] >= 'A' && str[i] <= 'Z' ) { c = base2int ( str[i] ); src_seq[n ++] = c; // after pre-process all the symbles would be a,g,c,t,n in lower or upper case. } else if ( str[i] == '.' ) { c = base2int ( 'A' ); src_seq[n ++] = c; } // after pre-process all the symbles would be a,g,c,t,n in lower or upper case. } //printf("%d: %d\n",k,n); } } if ( k >= 0 ) { *len_seq = n; return; } *len_seq = 0; } void read_one_sequence ( FILE * fp, long long * T, char ** X ) { char * fasta, *src_name; //point to fasta array int num_seq, len, name_len, min_len; num_seq = readseqpar ( &len, &min_len, &name_len, fp ); if ( num_seq < 1 ) { printf ( "no fasta sequence in file\n" ); *T = 0; return; } fasta = ( char * ) ckalloc ( len * sizeof ( char ) ); src_name = ( char * ) ckalloc ( ( name_len + 1 ) * sizeof ( char ) ); rewind ( fp ); readseq1by1 ( fasta, src_name, &len, fp, -1 ); readseq1by1 ( fasta, src_name, &len, fp, 0 ); *X = fasta; *T = len; free ( ( void * ) src_name ); } long long multiFileParse ( int * max_leg, int * min_leg, int * max_name_leg, FILE * fp ) { char str[5000]; FILE * freads; int slen; long long counter = 0; *max_name_leg = *max_leg = 1; *min_leg = 1000; while ( fgets ( str, 4950, fp ) ) { slen = strlen ( str ); str[slen - 1] = str[slen]; freads = ckopen ( str, "r" ); counter += readseqpar ( max_leg, min_leg, max_name_leg, freads ); fclose ( freads ); } return counter; } long long readseqpar ( int * max_leg, int * min_leg, int * max_name_leg, FILE * fp ) { int l, n; long long k; char str[5000], src_name[5000]; n = 0; k = -1; while ( fgets ( str, 4950, fp ) ) { if ( str[0] == '>' ) { if ( k >= 0 ) { if ( n > *max_leg ) { *max_leg = n; } if ( n < *min_leg ) { *min_leg = n; } } n = 0; k ++; sscanf ( &str[1], "%s", src_name ); if ( ( l = strlen ( src_name ) ) > *max_name_leg ) { *max_name_leg = l; } } else { n += strlen ( str ) - 1; } } if ( n > *max_leg ) { *max_leg = n; } if ( n < *min_leg ) { *min_leg = n; } k ++; return ( k ); } void read1seqfq ( char * src_seq, char * src_name, int * len_seq, FILE * fp ) { int i, n, strL; char c; char str[5000]; boolean flag = 0; while ( fgets ( str, 4950, fp ) ) { if ( str[0] == '@' ) { flag = 1; sscanf ( &str[1], "%s", src_name ); break; } } if ( !flag ) //last time reading fq file get this { *len_seq = 0; return; } n = 0; while ( fgets ( str, 4950, fp ) ) { if ( str[0] == '+' ) { fgets ( str, 4950, fp ); // pass quality value line *len_seq = n; return; } else { strL = strlen ( str ); if ( strL + n > maxReadLen ) { strL = maxReadLen - n; } for ( i = 0; i < strL; i ++ ) { if ( str[i] >= 'a' && str[i] <= 'z' ) { c = base2int ( str[i] - 'a' + 'A' ); src_seq[n ++] = c; } else if ( str[i] >= 'A' && str[i] <= 'Z' ) { c = base2int ( str[i] ); src_seq[n ++] = c; // after pre-process all the symbles would be a,g,c,t,n in lower or upper case. } else if ( str[i] == '.' ) { c = base2int ( 'A' ); src_seq[n ++] = c; } // after pre-process all the symbles would be a,g,c,t,n in lower or upper case. } //printf("%d: %d\n",k,n); } } *len_seq = n; return; } // find the next file to open in libs static int nextValidIndex ( int libNo, boolean pair, unsigned char asm_ctg ) { int i = libNo; while ( i < num_libs ) { if ( asm_ctg == 1 && ( lib_array[i].asm_flag != 1 && lib_array[i].asm_flag != 3 ) ) { i++; continue; } else if ( asm_ctg == 0 && ( lib_array[i].asm_flag != 2 && lib_array[i].asm_flag != 3 ) ) { i++; continue; } else if ( asm_ctg > 1 && lib_array[i].asm_flag != asm_ctg ) // reads for other purpose { i++; continue; } if ( lib_array[i].curr_type == 1 && lib_array[i].curr_index < lib_array[i].num_a1_file ) { return i; } if ( lib_array[i].curr_type == 2 && lib_array[i].curr_index < lib_array[i].num_q1_file ) { return i; } if ( lib_array[i].curr_type == 3 && lib_array[i].curr_index < lib_array[i].num_p_file ) { return i; } if ( pair ) { if ( lib_array[i].curr_type < 3 ) { lib_array[i].curr_type++; lib_array[i].curr_index = 0; } else { i++; } continue; } if ( lib_array[i].curr_type == 4 && lib_array[i].curr_index < lib_array[i].num_s_a_file ) { return i; } if ( lib_array[i].curr_type == 5 && lib_array[i].curr_index < lib_array[i].num_s_q_file ) { return i; } if ( lib_array[i].curr_type < 5 ) { lib_array[i].curr_type++; lib_array[i].curr_index = 0; } else { i++; } }//for each lib return i; } static FILE * openFile4read ( char * fname ) { FILE * fp; if ( strlen ( fname ) > 3 && strcmp ( fname + strlen ( fname ) - 3, ".gz" ) == 0 ) { char * cmd = ( char * ) ckalloc ( ( strlen ( fname ) + 20 ) * sizeof ( char ) ); sprintf ( cmd, "gzip -dc %s", fname ); fp = popen ( cmd, "r" ); free ( cmd ); return fp; } else { return ckopen ( fname, "r" ); } } void openFileInLib ( int libNo ) { int i = libNo; if ( lib_array[i].curr_type == 1 ) { printf ( "read from file:\n %s\n", lib_array[i].a1_fname[lib_array[i].curr_index] ); printf ( "read from file:\n %s\n", lib_array[i].a2_fname[lib_array[i].curr_index] ); lib_array[i].fp1 = openFile4read ( lib_array[i].a1_fname[lib_array[i].curr_index] ); lib_array[i].fp2 = openFile4read ( lib_array[i].a2_fname[lib_array[i].curr_index] ); lib_array[i].curr_index++; lib_array[i].paired = 1; } else if ( lib_array[i].curr_type == 2 ) { printf ( "read from file:\n %s\n", lib_array[i].q1_fname[lib_array[i].curr_index] ); printf ( "read from file:\n %s\n", lib_array[i].q2_fname[lib_array[i].curr_index] ); lib_array[i].fp1 = openFile4read ( lib_array[i].q1_fname[lib_array[i].curr_index] ); lib_array[i].fp2 = openFile4read ( lib_array[i].q2_fname[lib_array[i].curr_index] ); lib_array[i].curr_index++; lib_array[i].paired = 1; } else if ( lib_array[i].curr_type == 3 ) { printf ( "read from file:\n %s\n", lib_array[i].p_fname[lib_array[i].curr_index] ); lib_array[i].fp1 = openFile4read ( lib_array[i].p_fname[lib_array[i].curr_index] ); lib_array[i].curr_index++; lib_array[i].paired = 0; } else if ( lib_array[i].curr_type == 4 ) { printf ( "read from file:\n %s\n", lib_array[i].s_a_fname[lib_array[i].curr_index] ); lib_array[i].fp1 = openFile4read ( lib_array[i].s_a_fname[lib_array[i].curr_index] ); lib_array[i].curr_index++; lib_array[i].paired = 0; } else if ( lib_array[i].curr_type == 5 ) { printf ( "read from file:\n %s\n", lib_array[i].s_q_fname[lib_array[i].curr_index] ); lib_array[i].fp1 = openFile4read ( lib_array[i].s_q_fname[lib_array[i].curr_index] ); lib_array[i].curr_index++; lib_array[i].paired = 0; } } static void reverse2k ( char * src_seq, int len_seq ) { if ( !len_seq ) { return; } int i; reverseComplementSeq ( src_seq, len_seq, src_rc_seq ); for ( i = 0; i < len_seq; i++ ) { src_seq[i] = src_rc_seq[i]; } } static void closeFp1InLab ( int libNo ) { int ftype = lib_array[libNo].curr_type; int index = lib_array[libNo].curr_index - 1; char * fname; if ( ftype == 1 ) { fname = lib_array[libNo].a1_fname[index]; } else if ( ftype == 2 ) { fname = lib_array[libNo].q1_fname[index]; } else if ( ftype == 3 ) { fname = lib_array[libNo].p_fname[index]; } else if ( ftype == 4 ) { fname = lib_array[libNo].s_a_fname[index]; } else if ( ftype == 5 ) { fname = lib_array[libNo].s_q_fname[index]; } else { return; } if ( strlen ( fname ) > 3 && strcmp ( fname + strlen ( fname ) - 3, ".gz" ) == 0 ) { pclose ( lib_array[libNo].fp1 ); } else { fclose ( lib_array[libNo].fp1 ); } } static void closeFp2InLab ( int libNo ) { int ftype = lib_array[libNo].curr_type; int index = lib_array[libNo].curr_index - 1; char * fname; if ( ftype == 1 ) { fname = lib_array[libNo].a2_fname[index]; } else if ( ftype == 2 ) { fname = lib_array[libNo].q2_fname[index]; } else { return; } if ( strlen ( fname ) > 3 && strcmp ( fname + strlen ( fname ) - 3, ".gz" ) == 0 ) { pclose ( lib_array[libNo].fp2 ); } else { fclose ( lib_array[libNo].fp2 ); } } boolean read1seqInLib ( char * src_seq, char * src_name, int * len_seq, int * libNo, boolean pair, unsigned char asm_ctg ) { int i = *libNo; int prevLib = i; if ( !lib_array[i].fp1 // file1 does not exist || ( lib_array[i].curr_type != 1 && feof ( lib_array[i].fp1 ) ) // file1 reaches end and not type1 || ( lib_array[i].curr_type == 1 && feof ( lib_array[i].fp1 ) && feof ( lib_array[i].fp2 ) ) ) //f1&f2 reaches end { if ( lib_array[i].fp1 && feof ( lib_array[i].fp1 ) ) { closeFp1InLab ( i ); } if ( lib_array[i].fp2 && feof ( lib_array[i].fp2 ) ) { closeFp2InLab ( i ); } *libNo = nextValidIndex ( i, pair, asm_ctg ); i = *libNo; if ( lib_array[i].rd_len_cutoff > 0 ) maxReadLen = lib_array[i].rd_len_cutoff < maxReadLen4all ? lib_array[i].rd_len_cutoff : maxReadLen4all; else { maxReadLen = maxReadLen4all; } //record insert size info //printf("from lib %d to %d, read %lld to %ld\n",prevLib,i,readNumBack,n_solexa); if ( pair && i != prevLib ) { if ( readNumBack < n_solexa ) { pes[gradsCounter].PE_bound = n_solexa; pes[gradsCounter].rank = lib_array[prevLib].rank; pes[gradsCounter].pair_num_cut = lib_array[prevLib].pair_num_cut; pes[gradsCounter++].insertS = lib_array[prevLib].avg_ins; readNumBack = n_solexa; } } if ( i >= num_libs ) { return 0; } openFileInLib ( i ); if ( lib_array[i].curr_type == 1 ) { readseq1by1 ( src_seq, src_name, len_seq, lib_array[i].fp1, -1 ); readseq1by1 ( src_seq, src_name, len_seq, lib_array[i].fp2, -1 ); } else if ( lib_array[i].curr_type == 3 || lib_array[i].curr_type == 4 ) { readseq1by1 ( src_seq, src_name, len_seq, lib_array[i].fp1, -1 ); } } if ( lib_array[i].curr_type == 1 ) { if ( lib_array[i].paired == 1 ) { readseq1by1 ( src_seq, src_name, len_seq, lib_array[i].fp1, 1 ); if ( lib_array[i].reverse ) { reverse2k ( src_seq, *len_seq ); } lib_array[i].paired = 2; if ( *len_seq > 0 || !feof ( lib_array[i].fp1 ) ) { n_solexa++; return 1; } else { return read1seqInLib ( src_seq, src_name, len_seq, libNo, pair, asm_ctg ); } } else { readseq1by1 ( src_seq, src_name, len_seq, lib_array[i].fp2, 1 ); if ( lib_array[i].reverse ) { reverse2k ( src_seq, *len_seq ); } lib_array[i].paired = 1; n_solexa++; return 1; //can't fail to read a read2 } } if ( lib_array[i].curr_type == 2 ) { if ( lib_array[i].paired == 1 ) { read1seqfq ( src_seq, src_name, len_seq, lib_array[i].fp1 ); /* if(*len_seq>0){ for(j=0;j<*len_seq;j++) printf("%c",int2base(src_seq[j])); printf("\n"); } */ if ( lib_array[i].reverse ) { reverse2k ( src_seq, *len_seq ); } lib_array[i].paired = 2; if ( *len_seq > 0 || !feof ( lib_array[i].fp1 ) ) { n_solexa++; return 1; } else { return read1seqInLib ( src_seq, src_name, len_seq, libNo, pair, asm_ctg ); } } else { read1seqfq ( src_seq, src_name, len_seq, lib_array[i].fp2 ); if ( lib_array[i].reverse ) { reverse2k ( src_seq, *len_seq ); } lib_array[i].paired = 1; n_solexa++; return 1; //can't fail to read a read2 } } if ( lib_array[i].curr_type == 5 ) { read1seqfq ( src_seq, src_name, len_seq, lib_array[i].fp1 ); } else { readseq1by1 ( src_seq, src_name, len_seq, lib_array[i].fp1, 1 ); } /* int t; for(t=0;t<*len_seq;t++) printf("%d",src_seq[t]); printf("\n"); */ if ( lib_array[i].reverse ) { reverse2k ( src_seq, *len_seq ); } if ( *len_seq > 0 || !feof ( lib_array[i].fp1 ) ) { n_solexa++; return 1; } else { return read1seqInLib ( src_seq, src_name, len_seq, libNo, pair, asm_ctg ); } } SOAPdenovo-V1.05/src/63mer/scaffold.c000644 000765 000024 00000007400 11530651532 017246 0ustar00Aquastaff000000 000000 /* * 63mer/scaffold.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" static void initenv ( int argc, char ** argv ); static void display_scaff_usage(); static boolean LINK, SCAFF; static char graphfile[256]; int call_scaffold ( int argc, char ** argv ) { time_t start_t, stop_t, time_bef, time_aft; time ( &start_t ); initenv ( argc, argv ); loadPEgrads ( graphfile ); time ( &time_bef ); loadUpdatedEdges ( graphfile ); time ( &time_aft ); printf ( "time spent on loading edges %ds\n", ( int ) ( time_aft - time_bef ) ); if ( !SCAFF ) { time ( &time_bef ); PE2Links ( graphfile ); time ( &time_aft ); printf ( "time spent on loading pair end info %ds\n", ( int ) ( time_aft - time_bef ) ); time ( &time_bef ); Links2Scaf ( graphfile ); time ( &time_aft ); printf ( "time spent on creating scaffolds %ds\n", ( int ) ( time_aft - time_bef ) ); scaffolding ( 100, graphfile ); } prlReadsCloseGap ( graphfile ); // locateReadOnScaf(graphfile); free_pe_mem(); if ( index_array ) { free ( ( void * ) index_array ); } freeContig_array(); destroyPreArcMem(); destroyConnectMem(); deleteCntLookupTable(); time ( &stop_t ); printf ( "time elapsed: %dm\n", ( int ) ( stop_t - start_t ) / 60 ); return 0; } /***************************************************************************** * Parse command line switches *****************************************************************************/ void initenv ( int argc, char ** argv ) { int copt; int inpseq; extern char * optarg; char temp[256]; inpseq = 0; LINK = 0; SCAFF = 0; optind = 1; while ( ( copt = getopt ( argc, argv, "g:L:p:G:FuS" ) ) != EOF ) { switch ( copt ) { case 'g': inGraph = 1; sscanf ( optarg, "%s", graphfile ); // continue; case 'G': sscanf ( optarg, "%s", temp ); // GLDiff = atoi ( temp ); continue; case 'L': sscanf ( optarg, "%s", temp ); ctg_short = atoi ( temp ); continue; case 'F': fillGap = 1; continue; case 'S': SCAFF = 1; continue; case 'u': maskRep = 0; continue; case 'p': sscanf ( optarg, "%s", temp ); // thrd_num = atoi ( temp ); continue; default: if ( inGraph == 0 ) // { display_scaff_usage(); exit ( -1 ); } } } if ( inGraph == 0 ) // { display_scaff_usage(); exit ( -1 ); } } static void display_scaff_usage() { printf ( "\nscaff -g InputGraph [-F -u -S] [-G gapLenDiff -L minContigLen] [-p n_cpu]\n" ); printf ( " -g InputFile: prefix of graph file names\n" ); printf ( " -F (optional) fill gaps in scaffold\n" ); printf ( " -S (optional) scaffold structure exists(default: NO)\n" ); printf ( " -G gapLenDiff(default 50): allowed length difference between estimated and filled gap\n" ); printf ( " -u (optional): un-mask contigs with high coverage before scaffolding (default mask)\n" ); printf ( " -p n_cpu(default 8): number of cpu for use\n" ); printf ( " -L minLen(default K+2): shortest contig for scaffolding\n" ); } SOAPdenovo-V1.05/src/63mer/searchPath.c000644 000765 000024 00000013453 11530651532 017554 0ustar00Aquastaff000000 000000 /* * 63mer/searchPath.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" static int trace_limit = 5000; //the times function is called in a search /* search connection paths which were masked along related contigs start from one contig, end with another path length includes the length of the last contig */ void traceAlongMaskedCnt ( unsigned int destE, unsigned int currE, int max_steps, int min, int max, int index, int len, int * num_route ) { num_trace++; if ( num_trace > trace_limit || *num_route >= max_n_routes ) { return; } unsigned int * array; int num, i, length; CONNECT * ite_cnt; if ( index > 0 ) // there're at most max_steps edges stored in this array including the destination edge { length = len + contig_array[currE].length; } else { length = 0; } if ( index > max_steps || length > max ) { return; } // this is the only situation we stop if ( index > 0 ) // there're at most max_steps edges stored in this array including the destination edge { so_far[index - 1] = currE; } if ( currE == destE && index == 0 ) { printf ( "traceAlongMaskedCnt: start and destination are the same\n" ); return; } if ( currE == destE && length >= min && length <= max ) { num = *num_route; array = found_routes[num]; for ( i = 0; i < index; i++ ) { array[i] = so_far[i]; } if ( index < max_steps ) { array[index] = 0; } //indicate the end of the route *num_route = ++num; } // one route is extrated, but we don't terminate searching ite_cnt = contig_array[currE].downwardConnect; while ( ite_cnt ) { if ( !ite_cnt->mask || ite_cnt->deleted ) { ite_cnt = ite_cnt->next; continue; } traceAlongMaskedCnt ( destE, ite_cnt->contigID, max_steps, min, max, index + 1, length + ite_cnt->gapLen, num_route ); ite_cnt = ite_cnt->next; } } // search connection paths from one connect to a contig // path length includes the length of the last contig void traceAlongConnect ( unsigned int destE, CONNECT * currCNT, int max_steps, int min, int max, int index, int len, int * num_route ) { num_trace++; if ( num_trace > trace_limit || *num_route >= max_n_routes ) { return; } unsigned int * array, currE; int num, i, length; CONNECT * ite_cnt; currE = currCNT->contigID; length = len + currCNT->gapLen; length += contig_array[currE].length; if ( index > max_steps || length > max ) { return; } // this is the only situation we stop /* if(globalFlag) printf("B: step %d, ctg %d, length %d\n",index,currCNT->contigID,length); */ if ( currE == destE && index == 1 ) { printf ( "traceAlongConnect: start and destination are the same\n" ); return; } so_far[index - 1] = currE; // there're at most max_steps edges stored in this array including the destination edge if ( currE == destE && length >= min && length <= max ) { num = *num_route; array = found_routes[num]; for ( i = 0; i < index; i++ ) { array[i] = so_far[i]; } if ( index < max_steps ) { array[index] = 0; } //indicate the end of the route *num_route = ++num; } // one route is extrated, but we don't terminate searching if ( currCNT->nextInScaf ) { traceAlongConnect ( destE, currCNT->nextInScaf, max_steps, min, max, index + 1, length, num_route ); return; } ite_cnt = contig_array[currE].downwardConnect; while ( ite_cnt ) { if ( ite_cnt->mask || ite_cnt->deleted ) { ite_cnt = ite_cnt->next; continue; } traceAlongConnect ( destE, ite_cnt, max_steps, min, max, index + 1, length, num_route ); ite_cnt = ite_cnt->next; } } //find paths in the graph from currE to destE, its length does not include length of both end contigs void traceAlongArc ( unsigned int destE, unsigned int currE, int max_steps, int min, int max, int index, int len, int * num_route ) { num_trace++; if ( num_trace > trace_limit || *num_route >= max_n_routes ) { return; } unsigned int * array, out_ed, vt; int num, i, pos, length; preARC * parc; pos = index; if ( pos > max_steps || len > max ) { return; } // this is the only situation we stop if ( currE == destE && pos == 0 ) { printf ( "traceAlongArc: start and destination are the same\n" ); return; } if ( pos > 0 ) // pos starts with 0 for the starting edge { so_far[pos - 1] = currE; } // there're at most max_steps edges stored in this array including the destination edge if ( currE == destE && len >= min ) { num = *num_route; array = found_routes[num]; for ( i = 0; i < pos; i++ ) { array[i] = so_far[i]; } if ( pos < max_steps ) { array[pos] = 0; } //indicate the end of the route *num_route = ++num; } // one route is extrated, but we don't terminate searching if ( pos == max_steps || len == max ) { return; } if ( pos++ > 0 ) //not the starting edge { length = len + contig_array[currE].length; } else { length = len; } vt = contig_array[currE].to_vt; parc = contig_array[currE].arcs; while ( parc ) { out_ed = parc->to_ed; traceAlongArc ( destE, out_ed, max_steps, min, max, pos, length, num_route ); parc = parc->next; } } SOAPdenovo-V1.05/src/63mer/seq.c000644 000765 000024 00000005712 11530651532 016261 0ustar00Aquastaff000000 000000 /* * 63mer/seq.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" void printTightString ( char * tightSeq, int len ) { int i; for ( i = 0; i < len; i++ ) { printf ( "%c", int2base ( ( int ) getCharInTightString ( tightSeq, i ) ) ); if ( ( i + 1 ) % 100 == 0 ) { printf ( "\n" ); } } printf ( "\n" ); } void writeChar2tightString ( char nt, char * tightSeq, int pos ) { char * byte = tightSeq + pos / 4; switch ( pos % 4 ) { case 0: *byte &= 63; *byte += nt << 6; return; case 1: *byte &= 207; *byte += nt << 4; return; case 2: *byte &= 243; *byte += nt << 2; return; case 3: *byte &= 252; *byte += nt; return; } } char getCharInTightString ( char * tightSeq, int pos ) { char * byte = tightSeq + pos / 4; switch ( pos % 4 ) { case 3: return ( *byte & 3 ); case 2: return ( *byte & 12 ) >> 2; case 1: return ( *byte & 48 ) >> 4; case 0: return ( *byte & 192 ) >> 6; } return 0; } // complement of sequence denoted 0, 1, 2, 3 void reverseComplementSeq ( char * seq, int len, char * bal_seq ) { int i, index = 0; if ( len < 1 ) { return; } for ( i = len - 1; i >= 0; i-- ) { bal_seq[index++] = int_comp ( seq[i] ); } return; } // complement of sequence denoted 0, 1, 2, 3 char * compl_int_seq ( char * seq, int len ) { char * bal_seq = NULL, c, bal_c; int i, index; if ( len < 1 ) { return bal_seq; } bal_seq = ( char * ) ckalloc ( len * sizeof ( char ) ); index = 0; for ( i = len - 1; i >= 0; i-- ) { c = seq[i]; if ( c < 4 ) { bal_c = int_comp ( c ); } //3-c; else { bal_c = c; } bal_seq[index++] = bal_c; } return bal_seq; } long long trans_seq ( char * seq, int len ) { int i; long long res; res = 0; for ( i = 0; i < len; i ++ ) { res = res * 4 + seq[i]; } return ( res ); } /* char *kmer2seq(Kmer word) { int i; char *seq; Kmer charMask = 3; seq = (char *)ckalloc(overlaplen*sizeof(char)); for(i=overlaplen-1;i>=0;i--){ seq[i] = charMask&word; word >>= 2; } return seq; } */ SOAPdenovo-V1.05/src/63mer/splitReps.c000644 000765 000024 00000023065 11530651532 017457 0ustar00Aquastaff000000 000000 /* * 63mer/splitReps.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" static unsigned int involved[9]; static unsigned int lefts[4]; static unsigned int rights[4]; static unsigned char gothrough[4][4]; static boolean interferingCheck ( unsigned int edgeno, int repTimes ) { int i, j, t; unsigned int bal_ed; involved[0] = edgeno; i = 1; for ( j = 0; j < repTimes; j++ ) { involved[i++] = lefts[j]; } for ( j = 0; j < repTimes; j++ ) { involved[i++] = rights[j]; } for ( j = 0; j < i - 1; j++ ) for ( t = j + 1; t < i; t++ ) if ( involved[j] == involved[t] ) { return 1; } for ( j = 0; j < i; j++ ) { bal_ed = getTwinEdge ( involved[j] ); for ( t = 0; t < i; t++ ) if ( bal_ed == involved[t] ) { return 1; } } return 0; } static ARC * arcCounts ( unsigned int edgeid, unsigned int * num ) { ARC * arc; ARC * firstValidArc = NULL; unsigned int count = 0; arc = edge_array[edgeid].arcs; while ( arc ) { if ( arc->to_ed > 0 ) { count++; } if ( count == 1 ) { firstValidArc = arc; } arc = arc->next; } *num = count; return firstValidArc; } static boolean readOnEdge ( long long readid, unsigned int edge ) { int index; int markNum; long long * marklist; if ( edge_array[edge].markers ) { markNum = edge_array[edge].multi; marklist = edge_array[edge].markers; } else { return 0; } for ( index = 0; index < markNum; index++ ) { if ( readid == marklist[index] ) { return 1; } } return 0; } static long long cntByReads ( unsigned int left, unsigned int middle , unsigned int right ) { int markNum; long long * marklist; if ( edge_array[left].markers ) { markNum = edge_array[left].multi; marklist = edge_array[left].markers; } else { return 0; } int index; long long readid; /* if(middle==8553) printf("%d markers on %d\n",markNum,left); */ for ( index = 0; index < markNum; index++ ) { readid = marklist[index]; if ( readOnEdge ( readid, middle ) && readOnEdge ( readid, right ) ) { return readid; } } return 0; } /* - - > - < - - */ unsigned int solvable ( unsigned int edgeno ) { if ( EdSameAsTwin ( edgeno ) || edge_array[edgeno].multi == 255 ) { return 0; } unsigned int bal_ed = getTwinEdge ( edgeno ); unsigned int arcRight_n, arcLeft_n; unsigned int counter; unsigned int i, j; unsigned int branch, bal_branch; ARC * parcL, *parcR; parcL = arcCounts ( bal_ed, &arcLeft_n ); if ( arcLeft_n < 2 ) { return 0; } parcR = arcCounts ( edgeno, &arcRight_n ); if ( arcLeft_n != arcRight_n ) { return 0; } // check each right branch only has one upsteam connection /* if(edgeno==2551){ for(i=0;ito_ed == 0 ) { parcR = parcR->next; continue; } branch = parcR->to_ed; if ( EdSameAsTwin ( branch ) || edge_array[branch].multi == 255 ) { return 0; } rights[arcRight_n++] = branch; bal_branch = getTwinEdge ( branch ); arcCounts ( bal_branch, &counter ); if ( counter != 1 ) { return 0; } parcR = parcR->next; } // check if each left branch only has one downsteam connection arcLeft_n = 0; while ( parcL ) { if ( parcL->to_ed == 0 ) { parcL = parcL->next; continue; } branch = parcL->to_ed; if ( EdSameAsTwin ( branch ) || edge_array[branch].multi == 255 ) { return 0; } bal_branch = getTwinEdge ( branch ); lefts[arcLeft_n++] = bal_branch; arcCounts ( bal_branch, &counter ); if ( counter != 1 ) { return 0; } parcL = parcL->next; } //check if reads indicate one to one connection between upsteam and downstream edges for ( i = 0; i < arcLeft_n; i++ ) { counter = 0; for ( j = 0; j < arcRight_n; j++ ) { gothrough[i][j] = cntByReads ( lefts[i], edgeno, rights[j] ) == 0 ? 0 : 1; counter += gothrough[i][j]; if ( counter > 1 ) { return 0; } } if ( counter != 1 ) { return 0; } } for ( j = 0; j < arcRight_n; j++ ) { counter = 0; for ( i = 0; i < arcLeft_n; i++ ) { counter += gothrough[i][j]; } if ( counter != 1 ) { return 0; } } return arcLeft_n; } static unsigned int cp1edge ( unsigned int source, unsigned int target ) { int length = edge_array[source].length; char * tightSeq; int index; unsigned int bal_source = getTwinEdge ( source ); unsigned int bal_target; if ( bal_source > source ) { bal_target = target + 1; } else { bal_target = target; target = target + 1; } tightSeq = ( char * ) ckalloc ( ( length / 4 + 1 ) * sizeof ( char ) ); for ( index = 0; index < length / 4 + 1; index++ ) { tightSeq[index] = edge_array[source].seq[index]; } edge_array[target].length = length; edge_array[target].cvg = edge_array[source].cvg; edge_array[target].to_vt = edge_array[source].to_vt; edge_array[target].from_vt = edge_array[source].from_vt; edge_array[target].seq = tightSeq; edge_array[target].bal_edge = edge_array[source].bal_edge; edge_array[target].rv = NULL; edge_array[target].arcs = NULL; edge_array[target].markers = NULL; edge_array[target].flag = 0; edge_array[target].deleted = 0; tightSeq = ( char * ) ckalloc ( ( length / 4 + 1 ) * sizeof ( char ) ); for ( index = 0; index < length / 4 + 1; index++ ) { tightSeq[index] = edge_array[bal_source].seq[index]; } edge_array[bal_target].length = length; edge_array[bal_target].cvg = edge_array[bal_source].cvg; edge_array[bal_target].to_vt = edge_array[bal_source].to_vt; edge_array[bal_target].from_vt = edge_array[bal_source].from_vt; edge_array[bal_target].seq = tightSeq; edge_array[bal_target].bal_edge = edge_array[bal_source].bal_edge; edge_array[bal_target].rv = NULL; edge_array[bal_target].arcs = NULL; edge_array[bal_target].markers = NULL; edge_array[bal_target].flag = 0; edge_array[bal_target].deleted = 0; return target; } static void moveArc2cp ( unsigned int leftEd, unsigned int rightEd, unsigned int source, unsigned int target ) { unsigned int bal_left = getTwinEdge ( leftEd ); unsigned int bal_right = getTwinEdge ( rightEd ); unsigned int bal_source = getTwinEdge ( source ); unsigned int bal_target = getTwinEdge ( target ); ARC * arc; ARC * newArc, *twinArc; //between left and source arc = getArcBetween ( leftEd, source ); arc->to_ed = 0; newArc = allocateArc ( target ); newArc->multiplicity = arc->multiplicity; newArc->prev = NULL; newArc->next = edge_array[leftEd].arcs; if ( edge_array[leftEd].arcs ) { edge_array[leftEd].arcs->prev = newArc; } edge_array[leftEd].arcs = newArc; arc = getArcBetween ( bal_source, bal_left ); arc->to_ed = 0; twinArc = allocateArc ( bal_left ); twinArc->multiplicity = arc->multiplicity; twinArc->prev = NULL; twinArc->next = NULL; edge_array[bal_target].arcs = twinArc; newArc->bal_arc = twinArc; twinArc->bal_arc = newArc; //between source and right arc = getArcBetween ( source, rightEd ); arc->to_ed = 0; newArc = allocateArc ( rightEd ); newArc->multiplicity = arc->multiplicity; newArc->prev = NULL; newArc->next = NULL; edge_array[target].arcs = newArc; arc = getArcBetween ( bal_right, bal_source ); arc->to_ed = 0; twinArc = allocateArc ( bal_target ); twinArc->multiplicity = arc->multiplicity; twinArc->prev = NULL; twinArc->next = edge_array[bal_right].arcs; if ( edge_array[bal_right].arcs ) { edge_array[bal_right].arcs->prev = twinArc; } edge_array[bal_right].arcs = twinArc; newArc->bal_arc = twinArc; twinArc->bal_arc = newArc; } static void split1edge ( unsigned int edgeno, int repTimes ) { int i, j; unsigned int target; for ( i = 1; i < repTimes; i++ ) { for ( j = 0; j < repTimes; j++ ) if ( gothrough[i][j] > 0 ) { break; } target = cp1edge ( edgeno, extraEdgeNum ); moveArc2cp ( lefts[i], rights[j], edgeno, target ); extraEdgeNum += 2; } } static void debugging ( unsigned int i ) { ARC * parc; parc = edge_array[i].arcs; if ( !parc ) { printf ( "no downward connection for %d\n", i ); } while ( parc ) { printf ( "%d -> %d\n", i, parc->to_ed ); parc = parc->next; } } void solveReps() { unsigned int i; unsigned int repTime; int counter = 0; boolean flag; //debugging(30514); extraEdgeNum = num_ed + 1; for ( i = 1; i <= num_ed; i++ ) { repTime = solvable ( i ); if ( repTime == 0 ) { continue; } flag = interferingCheck ( i, repTime ); if ( flag ) { continue; } split1edge ( i, repTime ); counter ++; //+= 2*(repTime-1); if ( EdSmallerThanTwin ( i ) ) { i++; } } printf ( "%d repeats solvable, %d more edges\n", counter, extraEdgeNum - 1 - num_ed ); num_ed = extraEdgeNum - 1; removeDeadArcs(); if ( markersArray ) { free ( ( void * ) markersArray ); markersArray = NULL; } } SOAPdenovo-V1.05/src/63mer/stack.c000644 000765 000024 00000007160 11530651532 016575 0ustar00Aquastaff000000 000000 /* * 63mer/stack.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stack.h" STACK * createStack ( int num_items, size_t unit_size ) { STACK * newStack = ( STACK * ) malloc ( 1 * sizeof ( STACK ) ); newStack->block_list = NULL; newStack->items_per_block = num_items; newStack->item_size = unit_size; newStack->item_c = 0; return newStack; } void emptyStack ( STACK * astack ) { BLOCK_STARTER * block; if ( !astack || !astack->block_list ) { return; } block = astack->block_list; if ( block->next ) { block = block->next; } astack->block_list = block; astack->item_c = 0; astack->index_in_block = 0; } void freeStack ( STACK * astack ) { BLOCK_STARTER * ite_block, *temp_block; if ( !astack ) { return; } ite_block = astack->block_list; if ( ite_block ) { while ( ite_block->next ) { ite_block = ite_block->next; } } while ( ite_block ) { temp_block = ite_block; ite_block = ite_block->prev; free ( ( void * ) temp_block ); } free ( ( void * ) astack ); } void stackBackup ( STACK * astack ) { astack->block_backup = astack->block_list; astack->index_backup = astack->index_in_block; astack->item_c_backup = astack->item_c; } void stackRecover ( STACK * astack ) { astack->block_list = astack->block_backup; astack->index_in_block = astack->index_backup; astack->item_c = astack->item_c_backup; } void * stackPop ( STACK * astack ) { BLOCK_STARTER * block; if ( !astack || !astack->block_list || !astack->item_c ) { return NULL; } astack->item_c--; block = astack->block_list; if ( astack->index_in_block == 1 ) { if ( block->next ) { astack->block_list = block->next; astack->index_in_block = astack->items_per_block; } else { astack->index_in_block = 0; astack->item_c = 0; } return ( void * ) ( ( void * ) block + sizeof ( BLOCK_STARTER ) ); } return ( void * ) ( ( void * ) block + sizeof ( BLOCK_STARTER ) + astack->item_size * ( --astack->index_in_block ) ); } void * stackPush ( STACK * astack ) { BLOCK_STARTER * block; if ( !astack ) { return NULL; } astack->item_c++; if ( !astack->block_list || ( astack->index_in_block == astack->items_per_block && !astack->block_list->prev ) ) { block = malloc ( sizeof ( BLOCK_STARTER ) + astack->items_per_block * astack->item_size ); block->prev = NULL; if ( astack->block_list ) { astack->block_list->prev = block; } block->next = astack->block_list; astack->block_list = block; astack->index_in_block = 1; return ( void * ) ( ( void * ) block + sizeof ( BLOCK_STARTER ) ); } else if ( astack->index_in_block == astack->items_per_block && astack->block_list->prev ) { astack->block_list = astack->block_list->prev; astack->index_in_block = 1; return ( void * ) ( ( void * ) astack->block_list + sizeof ( BLOCK_STARTER ) ); } block = astack->block_list; return ( void * ) ( ( void * ) block + sizeof ( BLOCK_STARTER ) + astack->item_size * astack->index_in_block++ ); } SOAPdenovo-V1.05/src/63mer/inc/._.def.h.swo000644 000765 000024 00000000341 11526121042 020071 0ustar00Aquastaff000000 000000 Mac OS X  2ATTRتIIcom.apple.quarantineq/0000;4d58b30f;Mail;BD809BBE-F3C3-4DD1-869B-8610B7EBA988|com.apple.mailSOAPdenovo-V1.05/src/63mer/inc/.def.h.swo000644 000765 000024 00000030000 11526121042 017647 0ustar00Aquastaff000000 000000 b0VIM 6.3ʔ;IB\$zhuhmcompute-0-33.local/share/raid8/zhuhm/multiFile/inc/def.h3210#"! Utpadhg6lk[ZFD8"  x d O H G 0 . !   y    b @     y ] D 8 %   j h P C A 2 #   tT: oZM/t\RED+)ut[Y@* q^K5}ARCEXIST; struct arcexist *right; struct arcexist *left; Kmer kmer;{typedef struct arcexist }ARC; struct arc *nextInLookupTable; struct arc *bal_arc; struct arc *next; struct arc *prev; unsigned int multiplicity; unsigned int to_ed;{typedef struct arc}LIGHTCTG; char *seq; int length; unsigned int index;{typedef struct lightctg }EDGEPATCH; char bal_edge; unsigned int length; Kmer from_kmer,to_kmer;{typedef struct edgepatch}LIGHTANNOT; int pos; int contigID;{typedef struct lightannot}PARAMETER; unsigned char *selfSignal; unsigned char *mainSignal; NODE **hash_table; unsigned char threadID;{typedef struct parameter}ANNOTATION; int pos; unsigned int contigID; unsigned long readID;{typedef struct annotation}CONTIG; preARC *arcs; CONNECT *downwardConnect; char *seq; unsigned char flag : 1; unsigned char mask : 1; unsigned char multi; char bal_edge; // 1, -1 or 0 int length; unsigned int to_vt; unsigned int from_vt;{typedef struct contig }preARC; struct prearc *next; unsigned int multiplicity; unsigned int to_ed;{typedef struct prearc}CONNECT; struct connection *nextInLookupTable; struct connection *next; struct connection *nextInScaf; unsigned char prevInScaf : 1; unsigned char deleted : 1; unsigned char weak : 1; unsigned char used : 1; unsigned char mask : 1; unsigned char weight; short residue; short minGap; short maxGap; int gapLen; unsigned int contigID;{typedef struct connection}VERTEX; Kmer kmer;{typedef struct vertex}EDGE_PT; struct edge_pt *next; EDGE *edge;{typedef struct edge_pt}EDGE; long *markers; struct arc *arcs; READINTERVAL *rv; char *seq; unsigned char flag : 1; unsigned char deleted : 1; unsigned char multi; char bal_edge; unsigned char cvg; int length; unsigned int to_vt; unsigned int from_vt;{typedef struct edgestruct arc;}READINTERVAL; struct readinterval *prevInRead; struct readinterval *nextInRead; struct readinterval *prevOnEdge; struct readinterval *nextOnEdge; struct readinterval *bal_rv; int start; unsigned int edgeid; int readid; {typedef struct readinterval}preEDGE; char bal_edge; //indicate whether it's bal_edge is the previous edge, next edge or itself unsigned char cvg; int length; char *seq; Kmer to_node; Kmer from_node;{typedef struct preedge}NODE_PT; struct node_pt *next; boolean isSmaller; Kmer kmer; NODE *node;{typedef struct node_pt}NODE; struct node *right; struct node *left; unsigned int to_end; // the edge no. it belongs to unsigned char mark:1; unsigned char deleted:1; unsigned char linear:1; unsigned char cvg; unsigned char linksB; unsigned char links; Kmer kmer;{typedef struct nodestruct node_pt;typedef unsigned long long Kmer;int b_ban;#define int_comp(seq) (char)(seq^0x02) //(char)((0x4E>>((seq)<<1))&0x03)#define int2compbase(seq) "TGAC"[seq]#define int2base(seq) "ACTG"[seq]#define base2int(base) (char)(((base)&0x06)>>1)#define taskMask 0x1f //the last 7 bits#define thrd_num 32 #define word_len 12#include "def2.h"/* this file provides some datatype definition */SOAPdenovo-V1.05/src/63mer/inc/._.def.h.swp000644 000765 000024 00000000341 11526121040 020070 0ustar00Aquastaff000000 000000 Mac OS X  2ATTRثIIcom.apple.quarantineq/0000;4d58b30f;Mail;BD809BBE-F3C3-4DD1-869B-8610B7EBA988|com.apple.mailSOAPdenovo-V1.05/src/63mer/inc/.def.h.swp000644 000765 000024 00000030000 11526121040 017646 0ustar00Aquastaff000000 000000 b0VIM 6.3kEzhuhmcompute-0-33.local/zhuhm/subtleGap/inc/def.h3210#"! Utpadiqih7ml\[GE9#  | g ` _ H F 9 -   5 + *  z X 6 ' &    u \ P = *    l k L - $ #  gN2ofeNL5 wnmSQ:"  ~}caRH;:!~}jhS7$|qq}ARCEXIST; struct arcexist *right; struct arcexist *left; Kmer kmer;{typedef struct arcexist }ARC; struct arc *nextInLookupTable; struct arc *bal_arc; struct arc *next; struct arc *prev; unsigned int multiplicity; unsigned int to_ed;{typedef struct arc}LIGHTCTG; char *seq; int length; unsigned int index;{typedef struct lightctg }EDGEPATCH; char bal_edge; unsigned int length; Kmer from_kmer,to_kmer;{typedef struct edgepatch}LIGHTANNOT; int pos; int contigID;{typedef struct lightannot}PARAMETER; unsigned char *selfSignal; unsigned char *mainSignal; NODE **hash_table; unsigned char threadID;{typedef struct parameter}ANNOTATION; int pos; unsigned int contigID; unsigned long readID;{typedef struct annotation}CONTIG; preARC *arcs; CONNECT *downwardConnect; char *seq; unsigned char flag : 1; unsigned char mask : 1; unsigned char multi; char bal_edge; // 1, -1 or 0 int length; unsigned int to_vt; unsigned int from_vt;{typedef struct contig }preARC; struct prearc *next; unsigned int multiplicity; unsigned int to_ed;{typedef struct prearc}CONNECT; struct connection *next; struct connection *nextInScaf; unsigned char prevInScaf : 1; unsigned char deleted : 1; unsigned char weak : 1; unsigned char used : 1; unsigned char mask : 1; unsigned char weight; short residue; short minGap; short maxGap; int gapLen; unsigned int contigID;{typedef struct connection}VERTEX; unsigned int *outgoing_edges; unsigned int *incoming_edges; unsigned char outgoing_num; unsigned char incoming_num; Kmer kmer;{typedef struct vertex}EDGE_PT; struct edge_pt *next; EDGE *edge;{typedef struct edge_pt}EDGE; long *markers; struct arc *arcs; READINTERVAL *rv; char *seq; unsigned char flag : 1; unsigned char deleted : 1; unsigned char multi; char bal_edge; unsigned char cvg; int length; unsigned int to_vt; unsigned int from_vt;{typedef struct edgestruct arc;}READINTERVAL; struct readinterval *prevInRead; struct readinterval *nextInRead; struct readinterval *prevOnEdge; struct readinterval *nextOnEdge; struct readinterval *bal_rv; int start; unsigned int edgeid; int readid; {typedef struct readinterval}preEDGE; char bal_edge; //indicate whether it's bal_edge is the previous edge, next edge or itself unsigned char cvg; int length; char *seq; Kmer to_node; Kmer from_node;{typedef struct preedge}NODE_PT; struct node_pt *next; boolean isSmaller; Kmer kmer; NODE *node;{typedef struct node_pt}NODE; struct node *right; struct node *left; unsigned int to_end; // the edge no. it belongs to unsigned char deleted:1; unsigned char linear:1; unsigned char cvg; unsigned char linksB; unsigned char links; Kmer kmer;{typedef struct nodestruct node_pt;typedef unsigned long long Kmer;int b_ban;#define int_comp(seq) (char)(seq^0x02) //(char)((0x4E>>((seq)<<1))&0x03)#define int2compbase(seq) "TGAC"[seq]#define int2base(seq) "ACTG"[seq]#define base2int(base) (char)(((base)&0x06)>>1)#define taskMask 0xf //the last 6 bits#define thrd_num 16 #define word_len 12#include "def2.h"/* this file provides some datatype definition */SOAPdenovo-V1.05/src/63mer/inc/._.newhash.h.swp000644 000765 000024 00000000341 11526121040 020767 0ustar00Aquastaff000000 000000 Mac OS X  2ATTRجIIcom.apple.quarantineq/0000;4d58b30f;Mail;BD809BBE-F3C3-4DD1-869B-8610B7EBA988|com.apple.mailSOAPdenovo-V1.05/src/63mer/inc/.newhash.h.swp000644 000765 000024 00000030000 11526121040 020545 0ustar00Aquastaff000000 000000 b0VIM 6.3rrIjTzhuhmcompute-0-35.local/share/data/assembly/potato/zhuhm/newhash/inc/newhash.h3210#"! UtpNaduNjN/.{ D C ? Q @ * ) (    vcQA/%$ trcWC,"!ARQ%#endifextern char firstCharInKmer(Kmer kmer);extern int count_branch2next(kmer_t *node);extern int count_branch2prev(kmer_t *node);extern void dislink2prevUncertain(kmer_t *node,char ch,boolean smaller);extern void dislink2nextUncertain(kmer_t *node,char ch,boolean smaller);extern void free_Sets(KmerSet **KmerSets,int num);extern byte8 count_kmerset(KmerSet *set);extern int put_kmerset(KmerSet *set, ubyte8 seq, ubyte left, ubyte right,kmer_t **kmer_p);extern int search_kmerset(KmerSet *set, ubyte8 seq, kmer_t **rs);extern KmerSet* init_kmerset(ubyte8 init_size, float load_factor);}KMER_PT; struct kmer_pt *next; boolean isSmaller; Kmer kmer; kmer_t *node;{typedef struct kmer_pt} KmerSet; ubyte8 iter_ptr; double load_factor; ubyte8 max; ubyte8 count; ubyte8 size; ubyte4 *flags; kmer_t *array;{typedef struct kmerSet_st} kmer_t; ubyte4 inEdge:2; ubyte4 twin:2; ubyte4 single:1; ubyte4 checked:1; ubyte4 deleted:1; ubyte4 linear:1; ubyte4 r_links:4*EDGE_BIT_SIZE; ubyte4 l_links; // sever as edgeID since make_edge Kmer seq;{typedef struct kmer_st#define exists_kmer_entity(flags, idx) (!((flags)[(idx)>>4]>>(((idx)&0x0f)<<1)&0x03))#define clear_kmer_entity_del(flags, idx) ((flags)[(idx)>>4] &= ~(0x02u<<(((idx)&0x0f)<<1)))#define clear_kmer_entity_null(flags, idx) ((flags)[(idx)>>4] &= ~(0x01u<<(((idx)&0x0f)<<1)))#define set_kmer_entity_del(flags, idx) ((flags)[(idx)>>4] |= (0x02u<<(((idx)&0x0f)<<1)))#define set_kmer_entity_null(flags, idx) ((flags)[(idx)>>4] |= (0x01u<<(((idx)&0x0f)<<1)))#define is_kmer_entity_del(flags, idx) ((flags)[(idx)>>4]>>(((idx)&0x0f)<<1)&0x02)#define is_kmer_entity_null(flags, idx) ((flags)[(idx)>>4]>>(((idx)&0x0f)<<1)&0x01)#define get_kmer_right_covs(mer) (get_kmer_right_cov(mer, 0) + get_kmer_right_cov(mer, 1) + get_kmer_right_cov(mer, 2) + get_kmer_right_cov(mer, 3))#define set_kmer_right_cov(mer, idx, val) ((mer).r_links = ((mer).r_links&(~(EDGE_XOR_MASK<<((idx)*EDGE_BIT_SIZE)))) | (((val)&EDGE_XOR_MASK)<<((idx)*EDGE_BIT_SIZE)) )#define get_kmer_right_cov(mer, idx) (((mer).r_links>>((idx)*EDGE_BIT_SIZE))&EDGE_XOR_MASK)#define get_kmer_left_covs(mer) (get_kmer_left_cov(mer, 0) + get_kmer_left_cov(mer, 1) + get_kmer_left_cov(mer, 2) + get_kmer_left_cov(mer, 3))#define set_kmer_left_cov(mer, idx, val) ((mer).l_links = ((mer).l_links&(~(EDGE_XOR_MASK<<((idx)*EDGE_BIT_SIZE)))) | (((val)&EDGE_XOR_MASK)<<((idx)*EDGE_BIT_SIZE)) )#define get_kmer_left_cov(mer, idx) (((mer).l_links>>((idx)*EDGE_BIT_SIZE))&EDGE_XOR_MASK)#define set_kmer_seq(mer, val) ((mer).seq = val)#define get_kmer_seq(mer) ((mer).seq)#define LINKS_BITS 0x00FFFFFFU#define EDGE_XOR_MASK 0x3FU#define EDGE_BIT_SIZE 6#define MAX_KMER_COV 63#endif#define K_LOAD_FACTOR 0.75#ifndef K_LOAD_FACTOR#define __NEW_HASH_RJ#ifndef __NEW_HASH_RJSOAPdenovo-V1.05/src/63mer/inc/check.h000644 000765 000024 00000001714 11530651532 017322 0ustar00Aquastaff000000 000000 /* * 63mer/inc/check.h * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ extern void * ckalloc ( unsigned long long amount ); extern void * ckrealloc ( void * p, size_t new_size, size_t old_size ); extern FILE * ckopen ( char * name, char * mode ); SOAPdenovo-V1.05/src/63mer/inc/darray.h000644 000765 000024 00000002361 11530651532 017526 0ustar00Aquastaff000000 000000 /* * 63mer/inc/darray.h * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #ifndef __DARRAY__ #define __DARRAY__ #include #include #include typedef struct dynamic_array { void * array; long long array_size; size_t item_size; long long item_c; } DARRAY; void * darrayPut ( DARRAY * darray, long long index ); void * darrayGet ( DARRAY * darray, long long index ); DARRAY * createDarray ( int num_items, size_t unit_size ); void freeDarray ( DARRAY * darray ); void emptyDarray ( DARRAY * darray ); #endif SOAPdenovo-V1.05/src/63mer/inc/def.h000644 000765 000024 00000015406 11530651532 017006 0ustar00Aquastaff000000 000000 /* * 63mer/inc/def.h * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ /* this file provides some datatype definition */ #ifndef _DEF #define _DEF #include "def2.h" #include "types.h" #include "stack.h" #include "darray.h" #define EDGE_BIT_SIZE 6 #define word_len 12 #define taskMask 0xf //the last 7 bits #define MaxEdgeCov 16000 #define base2int(base) (char)(((base)&0x06)>>1) #define int2base(seq) "ACTG"[seq] #define int2compbase(seq) "TGAC"[seq] #define int_comp(seq) (char)(seq^0x02) //(char)((0x4E>>((seq)<<1))&0x03) int b_ban; typedef struct kmer { unsigned long long high, low; } Kmer; typedef struct preedge { Kmer from_node; Kmer to_node; char * seq; int length; unsigned short cvg: 14; unsigned bal_edge: 2; //indicate whether it's bal_edge is the previous edge, next edge or itself } preEDGE; typedef struct readinterval { int readid; unsigned int edgeid; int start; struct readinterval * bal_rv; struct readinterval * nextOnEdge; struct readinterval * prevOnEdge; struct readinterval * nextInRead; struct readinterval * prevInRead; } READINTERVAL; struct arc; typedef struct edge { unsigned int from_vt; unsigned int to_vt; int length; unsigned short cvg: 14; unsigned short bal_edge: 2; unsigned short multi: 14; unsigned short deleted : 1; unsigned short flag : 1; char * seq; READINTERVAL * rv; struct arc * arcs; long long * markers; } EDGE; typedef struct edge_pt { EDGE * edge; struct edge_pt * next; } EDGE_PT; typedef struct vertex { Kmer kmer; } VERTEX; /* typedef struct connection { unsigned int contigID; int gapLen; short maxGap; unsigned char minGap; unsigned char bySmall:1; unsigned char weakPoint:1; unsigned char weightNotInherit; unsigned char weight; unsigned char maxSingleWeight; unsigned char mask : 1; unsigned char used : 1; unsigned char weak : 1; unsigned char deleted : 1; unsigned char prevInScaf : 1; unsigned char inherit : 1; unsigned char checking : 1; unsigned char singleInScaf : 1; struct connection *nextInScaf; struct connection *next; struct connection *nextInLookupTable; }CONNECT; */ typedef struct connection { unsigned int contigID; int gapLen; unsigned short maxGap; unsigned char minGap; unsigned char bySmall: 1; unsigned char weakPoint: 1; unsigned char weightNotInherit; unsigned char weight; unsigned char maxSingleWeight; unsigned char mask : 1; unsigned char used : 1; unsigned char weak : 1; unsigned char deleted : 1; unsigned char prevInScaf : 1; unsigned char inherit : 1; unsigned char checking : 1; unsigned char singleInScaf : 1; struct connection * nextInScaf; struct connection * next; struct connection * nextInLookupTable; } CONNECT; typedef struct prearc { unsigned int to_ed; unsigned int multiplicity; struct prearc * next; } preARC; /* typedef struct contig { unsigned int from_vt; unsigned int to_vt; unsigned int length; int to_right; unsigned short indexInScaf; unsigned char cvg; unsigned char bal_edge:2; // 0, 1 or 2 unsigned char mask : 1; unsigned char flag : 1; unsigned char multi: 1; unsigned char inSubGraph: 1; char *seq; CONNECT *downwardConnect; preARC *arcs; STACK *closeReads; }CONTIG; */ typedef struct contig { unsigned int from_vt; unsigned int to_vt; unsigned int length; unsigned short indexInScaf; unsigned char cvg; unsigned char bal_edge: 2; // 0, 1 or 2 unsigned char mask : 1; unsigned char flag : 1; unsigned char multi: 1; unsigned char inSubGraph: 1; char * seq; CONNECT * downwardConnect; preARC * arcs; STACK * closeReads; } CONTIG; typedef struct read_nearby { int len; int dis; // dis to nearby contig or scaffold's start position long long seqStarter; //sequence start position in dynamic array } READNEARBY; typedef struct annotation { unsigned long long readID; unsigned int contigID; int pos; } ANNOTATION; typedef struct parameter { unsigned char threadID; void ** hash_table; unsigned char * mainSignal; unsigned char * selfSignal; } PARAMETER; typedef struct lightannot { int contigID; int pos; } LIGHTANNOT; typedef struct edgepatch { Kmer from_kmer, to_kmer; unsigned int length; char bal_edge; } EDGEPATCH; typedef struct lightctg { unsigned int index; int length; char * seq; } LIGHTCTG; typedef struct arc { unsigned int to_ed; unsigned int multiplicity; struct arc * prev; struct arc * next; struct arc * bal_arc; struct arc * nextInLookupTable; } ARC; typedef struct arcexist { Kmer kmer; struct arcexist * left; struct arcexist * right; } ARCEXIST; typedef struct lib_info { int min_ins; int max_ins; int avg_ins; int rd_len_cutoff; int reverse; int asm_flag; int map_len; int pair_num_cut; int rank; //indicate which file is next to be read int curr_type; int curr_index; //file handlers to opened files FILE * fp1; FILE * fp2; boolean f1_start; boolean f2_start; //whether last read is read1 in pair int paired; // 0 -- single; 1 -- read1; 2 -- read2; //type1 char ** a1_fname; char ** a2_fname; int num_a1_file; int num_a2_file; //type2 char ** q1_fname; char ** q2_fname; int num_q1_file; int num_q2_file; //type3 char ** p_fname; int num_p_file; //fasta only //type4 &5 char ** s_a_fname; int num_s_a_file; char ** s_q_fname; int num_s_q_file; } LIB_INFO; typedef struct ctg4heap { unsigned int ctgID; int dis; unsigned char ds_shut4dheap: 1; // ignore downstream connections unsigned char us_shut4dheap: 1; // ignore upstream connections unsigned char ds_shut4uheap: 1; // ignore downstream connections unsigned char us_shut4uheap: 1; // ignore upstream connections } CTGinHEAP; typedef struct ctg4scaf { unsigned int ctgID; int start; int end; //position in scaff unsigned int cutHead : 8; // unsigned int cutTail : 7; // unsigned int scaftig_start : 1; //is it a scaftig starter unsigned int mask : 1; // is it masked for further operations unsigned int gapSeqLen: 15; int gapSeqOffset; } CTGinSCAF; typedef struct pe_info { int insertS; long long PE_bound; int rank; int pair_num_cut; } PE_INFO; #endif SOAPdenovo-V1.05/src/63mer/inc/def2.h000644 000765 000024 00000003246 11530651532 017067 0ustar00Aquastaff000000 000000 /* * 63mer/inc/def2.h * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #ifndef _DEF2 #define _DEF2 typedef char boolean; typedef long long IDnum; typedef double Time; typedef long long Coordinate; // Fibonacci heaps used mainly in Tour Bus typedef struct fibheap FibHeap; typedef struct fibheap_el FibHeapNode; typedef struct dfibheap DFibHeap; typedef struct dfibheap_el DFibHeapNode; //Memory manager typedef struct block_start { struct block_start * next; } BLOCK_START; typedef struct recycle_mark { struct recycle_mark * next; } RECYCLE_MARK; typedef struct mem_manager { BLOCK_START * block_list; int index_in_block; int items_per_block; size_t item_size; RECYCLE_MARK * recycle_list; unsigned long long counter; } MEM_MANAGER; struct dfibheap_el { int dfhe_degree; boolean dfhe_mark; DFibHeapNode * dfhe_p; DFibHeapNode * dfhe_child; DFibHeapNode * dfhe_left; DFibHeapNode * dfhe_right; Time dfhe_key; unsigned int dfhe_data;//void *dfhe_data; }; #endif SOAPdenovo-V1.05/src/63mer/inc/dfib.h000644 000765 000024 00000005565 11530651532 017161 0ustar00Aquastaff000000 000000 /* * 63mer/inc/dfib.h * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ /*- * Copyright 1997, 1998-2003 John-Mark Gurney. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $Id: dfib.h,v 1.8 2007/04/24 12:16:41 zerbino Exp $ * */ #ifndef _DFIB_H_ #define _DFIB_H_ #include #include "def2.h" //#include "globals.h" /* functions for key heaps */ DFibHeap * dfh_makekeyheap ( void ); DFibHeapNode * dfh_insertkey ( DFibHeap *, Time, unsigned int ); Time dfh_replacekey ( DFibHeap *, DFibHeapNode *, Time ); unsigned int dfh_replacekeydata ( DFibHeap *, DFibHeapNode *, Time, unsigned int ); unsigned int dfh_extractmin ( DFibHeap * ); unsigned int dfh_replacedata ( DFibHeapNode *, unsigned int ); unsigned int dfh_delete ( DFibHeap *, DFibHeapNode * ); void dfh_deleteheap ( DFibHeap * ); // DEBUG IDnum dfibheap_getSize ( DFibHeap * ); Time dfibheap_el_getKey ( DFibHeapNode * ); // END DEBUG #endif /* _FIB_H_ */ SOAPdenovo-V1.05/src/63mer/inc/dfibHeap.h000644 000765 000024 00000002564 11530651532 017753 0ustar00Aquastaff000000 000000 /* * 63mer/inc/dfibHeap.h * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #ifndef _DFIBHEAP_H_ #define _DFIBHEAP_H_ DFibHeap * newDFibHeap(); DFibHeapNode * insertNodeIntoDHeap ( DFibHeap * heap, Time key, unsigned int node ); Time replaceKeyInDHeap ( DFibHeap * heap, DFibHeapNode * node, Time newKey ); unsigned int removeNextNodeFromDHeap ( DFibHeap * heap ); void destroyDHeap ( DFibHeap * heap ); boolean HasMin ( DFibHeap * h ); void replaceValueInDHeap ( DFibHeapNode * node, unsigned int newValue ); void * destroyNodeInDHeap ( DFibHeapNode * node, DFibHeap * heap ); IDnum getDFibHeapSize ( DFibHeap * heap ); Time getKey ( DFibHeapNode * node ); #endif SOAPdenovo-V1.05/src/63mer/inc/dfibpriv.h000644 000765 000024 00000007072 11530651532 020055 0ustar00Aquastaff000000 000000 /* * 63mer/inc/dfibpriv.h * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ /*- * Copyright 1997, 1999-2003 John-Mark Gurney. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $Id: dfibpriv.h,v 1.8 2007/10/09 09:56:46 zerbino Exp $ * */ #ifndef _DFIBPRIV_H_ #define _DFIBPRIV_H_ //#include "globals.h" #include "def2.h" /* * specific node operations */ static DFibHeapNode * dfhe_newelem ( DFibHeap * ); static void dfhe_insertafter ( DFibHeapNode * a, DFibHeapNode * b ); static inline void dfhe_insertbefore ( DFibHeapNode * a, DFibHeapNode * b ); static DFibHeapNode * dfhe_remove ( DFibHeapNode * a ); /* * global heap operations */ struct dfibheap { MEM_MANAGER * nodeMemory; IDnum dfh_n; IDnum dfh_Dl; DFibHeapNode ** dfh_cons; DFibHeapNode * dfh_min; DFibHeapNode * dfh_root; }; static void dfh_insertrootlist ( DFibHeap *, DFibHeapNode * ); static void dfh_removerootlist ( DFibHeap *, DFibHeapNode * ); static void dfh_consolidate ( DFibHeap * ); static void dfh_heaplink ( DFibHeap * h, DFibHeapNode * y, DFibHeapNode * x ); static void dfh_cut ( DFibHeap *, DFibHeapNode *, DFibHeapNode * ); static void dfh_cascading_cut ( DFibHeap *, DFibHeapNode * ); static DFibHeapNode * dfh_extractminel ( DFibHeap * ); static void dfh_checkcons ( DFibHeap * h ); static int dfh_compare ( DFibHeap * h, DFibHeapNode * a, DFibHeapNode * b ); static int dfh_comparedata ( DFibHeap * h, Time key, unsigned int data, DFibHeapNode * b ); static void dfh_insertel ( DFibHeap * h, DFibHeapNode * x ); /* * general functions */ static inline IDnum ceillog2 ( IDnum a ); #endif /* _FIBPRIV_H_ */ SOAPdenovo-V1.05/src/63mer/inc/extfunc.h000644 000765 000024 00000024143 11530651532 017722 0ustar00Aquastaff000000 000000 /* * 63mer/inc/extfunc.h * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "check.h" #include "extfunc2.h" extern void readseq1by1 ( char * src_seq, char * src_name, int * len_seq, FILE * fp, long long num_seq ); extern void readseqPbyP ( char * src_seq, char * src_name, int * insertS, int * len_seq, FILE * fp, long long num_seq ); extern long long readseqpar ( int * max_len, int * min_leg, int * max_name_len, FILE * fp ); extern void free_edge_list ( EDGE_PT * el ); extern void reverseComplementSeq ( char * seq, int len, char * bal_seq ); extern void free_edge_array ( EDGE * ed_array, int ed_num ); extern void free_lightctg_array ( LIGHTCTG * ed_array, int ed_num ); extern char getCharInTightString ( char * tightSeq, int pos ); extern void writeChar2tightSting ( char nt, char * tightSeq, int pos ); extern void short_reads_sum(); extern void read_one_sequence ( FILE * fp, long long * T, char ** X ); extern void output_edges ( preEDGE * ed_array, int ed_num, char * outfile ); extern void loadVertex ( char * graphfile ); extern void loadEdge ( char * graphfile ); extern boolean loadPath ( char * graphfile ); extern READINTERVAL * allocateRV ( int readid, int edgeid ); extern void createRVmemo(); extern void dismissRV ( READINTERVAL * rv ); extern void destroyReadIntervMem(); extern void destroyConnectMem(); extern void u2uConcatenate(); extern void output_contig ( EDGE * ed_array, unsigned int ed_num, char * outfile, int cut_len ); extern void printTightString ( char * tightSeq, int len ); extern int roughUniqueness ( unsigned int edgeno, char ignore_cvg, char * ignored ); extern void outputReadPos ( char * graphfile, int min_len ); extern void testSearch(); extern void allpathConcatenate(); extern void output_updated_edges ( char * outfile ); extern void output_updated_vertex ( char * outfile ); extern void loadUpdatedEdges ( char * graphfile ); extern void loadUpdatedVertex ( char * graphfile ); extern void connectByPE ( char * infile ); extern void output_cntGVZ ( char * outfile ); extern void output_graph ( char * outfile ); extern void testLinearC2C(); extern void output_contig_graph ( char * outfile ); extern void scaffolding ( unsigned int cut_len, char * outfile ); extern int cmp_int ( const void * a, const void * b ); extern CONNECT * allocateCN ( unsigned int contigId, int gap ); extern int recoverRep(); extern void loadPEgrads ( char * infile ); extern int putInsertS ( long long readid, int size, int * currGrads ); extern int getInsertS ( long long readid, int * readlen ); extern int connectByPE_grad ( FILE * fp, int peGrad, char * line ); extern void PEgradsScaf ( char * infile ); extern void reorderAnnotation ( char * infile, char * outfile ); extern void output_1edge ( preEDGE * edge, FILE * fp ); extern void prlRead2edge ( char * libfile, char * outfile ); extern void annotFileTrans ( char * infile, char * outfile ); extern void prlLoadPath ( char * graphfile ); extern void misCheck ( char * infile, char * outfile ); extern int uniqueLenSearch ( unsigned int * len_array, unsigned int * flag_array, int num, unsigned int target ); extern int cmp_vertex ( const void * a, const void * b ); extern void linkContig2Vts(); extern int connectByPE_gradPatch ( FILE * fp1, FILE * fp2, int peGrad, char * line1, char * line2 ); extern void scaftiging ( char * graphfile, int len_cut ); extern void gapFilling ( char * graphfile, int cut_len ); extern ARC * getArcBetween ( unsigned int from_ed, unsigned int to_ed ); extern void bubblePinch ( double simiCutoff, char * outfile, int M ); extern void linearConcatenate(); extern unsigned char setArcMulti ( unsigned int from_ed, unsigned int to_ed, unsigned char value ); extern ARC * allocateArc ( unsigned int edgeid ); extern void cutTipsInGraph ( int cutLen, boolean strict ); extern ARC * deleteArc ( ARC * arc_list, ARC * arc ); extern void compactEdgeArray(); extern void dismissArc ( ARC * arc ); extern void createArcMemo(); extern ARC * getArcBetween ( unsigned int from_ed, unsigned int to_ed ); extern ARC * allocateArc ( unsigned int edgeid ); extern void writeChar2tightString ( char nt, char * tightSeq, int pos ); extern void output_heavyArcs ( char * outfile ); extern preARC * allocatePreArc ( unsigned int edgeid ); extern void destroyPreArcMem(); extern void traceAlongArc ( unsigned int destE, unsigned int currE, int max_steps, int min, int max, int index, int len, int * num_route ); extern void freeContig_array(); extern void output_scafSeq ( char * graphfile, int len_cut ); extern void putArcInHash ( unsigned int from_ed, unsigned int to_ed ); extern boolean DoesArcExist ( unsigned int from_ed, unsigned int to_ed ); extern void recordArcInHash(); extern void destroyArcHash(); extern void removeWeakEdges ( int lenCutoff, unsigned int multiCutoff ); extern void createArcLookupTable(); extern void deleteArcLookupTable(); extern void putArc2LookupTable ( unsigned int from_ed, ARC * arc ); extern void removeArcInLookupTable ( unsigned int from_ed, unsigned int to_ed ); extern ARC * arcCount ( unsigned int edgeid, unsigned int * num ); extern void mapFileTrans ( char * infile ); extern void solveReps(); extern void removeDeadArcs(); extern void destroyArcMem(); extern void getCntsInFile ( char * infile ); extern void scafByCntInfo ( char * infile ); extern CONNECT * add1Connect ( unsigned int e1, unsigned int e2, int gap, int weight, boolean inherit ); extern void getScaff ( char * infile ); extern void traceAlongMaskedCnt ( unsigned int destE, unsigned int currE, int max_steps, int min, int max, int index, int len, int * num_route ); extern void createPreArcMemManager(); extern boolean loadPathBin ( char * graphfile ); extern void recordArcsInLookupTable(); extern FILE * multiFileRead1seq ( char * src_seq, char * src_name, int * len_seq, FILE * fp, FILE * freads ); extern void multiFileSeqpar ( FILE * fp ); extern long long multiFileParse ( int * max_leg, int * min_leg, int * max_name_leg, FILE * fp ); extern CONNECT * getCntBetween ( unsigned int from_ed, unsigned int to_ed ); extern void createCntMemManager(); extern void destroyConnectMem(); extern void createCntLookupTable(); extern void deleteCntLookupTable(); extern void putCnt2LookupTable ( unsigned int from_c, CONNECT * cnt ); extern void prlRead2Ctg ( char * seqfile, char * outfile ); extern boolean prlContig2nodes ( char * grapfile, int len_cut ); extern void scan_libInfo ( char * libfile ); extern void free_libs(); extern boolean read1seqInLib ( char * src_seq, char * src_name, int * len_seq, int * libNo, boolean pair, unsigned char asm_ctg ); extern void save4laterSolve(); extern void solveRepsAfter(); extern void free_pe_mem(); extern void alloc_pe_mem ( int gradsCounter ); extern void prlDestroyPreArcMem(); extern preARC * prlAllocatePreArc ( unsigned int edgeid, MEM_MANAGER * manager ); extern boolean prlRead2HashTable ( char * libfile, char * outfile ); extern void free_allSets(); extern void removeSingleTips(); extern void removeMinorTips(); extern void kmer2edges ( char * outfile ); extern void output_vertex ( char * outfile ); extern boolean prlRead2HashTable ( char * libfile, char * outfile ); extern void Links2Scaf ( char * infile ); extern void PE2Links ( char * infile ); extern unsigned int getTwinCtg ( unsigned int ctg ); extern void basicContigInfo ( char * infile ); extern boolean isSmallerThanTwin ( unsigned int ctg ); extern boolean isLargerThanTwin ( unsigned int ctg ); extern boolean isSameAsTwin ( unsigned int ctg ); extern boolean loadMarkerBin ( char * graphfile ); extern void readsCloseGap ( char * graphfile ); extern void prlReadsCloseGap ( char * graphfile ); extern void locateReadOnScaf ( char * graphfile ); /*********** Kmer related *************/ extern Kmer createFilter ( int overlaplen ); extern void printKmerSeq ( FILE * fp, Kmer kmer ); extern __uint128_t Kmer2int128 ( Kmer seq ); extern boolean KmerLarger ( Kmer kmer1, Kmer kmer2 ); extern boolean KmerSmaller ( Kmer kmer1, Kmer kmer2 ); extern boolean KmerEqual ( Kmer kmer1, Kmer kmer2 ); extern Kmer KmerAnd ( Kmer kmer1, Kmer kmer2 ); extern Kmer KmerLeftBitMoveBy2 ( Kmer word ); extern Kmer KmerRightBitMoveBy2 ( Kmer word ); extern Kmer KmerPlus ( Kmer prev, char ch ); extern Kmer nextKmer ( Kmer prev, char ch ); extern Kmer prevKmer ( Kmer next, char ch ); extern char firstCharInKmer ( Kmer kmer ); extern Kmer KmerRightBitMove ( Kmer word, int dis ); extern Kmer reverseComplement ( Kmer word, int overlap ); extern ubyte8 hash_kmer ( Kmer kmer ); extern int kmer2vt ( Kmer kmer ); extern void print_kmer ( FILE * fp, Kmer kmer, char c ); extern int bisearch ( VERTEX * vts, int num, Kmer target ); extern void printKmerSeq ( FILE * fp, Kmer kmer ); extern char lastCharInKmer ( Kmer kmer ); int localGraph ( READNEARBY * rdArray, int num, CTGinSCAF * ctg1, CTGinSCAF * ctg2, int origOverlap, Kmer * kmerCtg1, Kmer * kmerCtg2, int overlap, DARRAY * gapSeqArray, char * seqCtg1, char * seqCtg2, char * seqGap ); extern unsigned int getTwinEdge ( unsigned int edgeno ); extern boolean EdSmallerThanTwin ( unsigned int edgeno ); extern boolean EdLargerThanTwin ( unsigned int edgeno ); extern boolean EdSameAsTwin ( unsigned int edgeno ); extern void removeLowCovEdges ( int lenCutoff, unsigned short covCutoff ); extern int getMaxLongReadLen ( int num_libs ); extern void prlLongRead2Ctg ( char * libfile, char * outfile ); SOAPdenovo-V1.05/src/63mer/inc/extfunc2.h000644 000765 000024 00000002110 11530651532 017772 0ustar00Aquastaff000000 000000 /* * 63mer/inc/extfunc2.h * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #ifndef _MEM_MANAGER #define _MEM_MANAGER extern MEM_MANAGER * createMem_manager ( int num_items, size_t unit_size ); extern void * getItem ( MEM_MANAGER * mem_Manager ); extern void returnItem ( MEM_MANAGER * mem_Manager, void * ); extern void freeMem_manager ( MEM_MANAGER * mem_Manager ); #endif SOAPdenovo-V1.05/src/63mer/inc/extvab.h000644 000765 000024 00000005352 11530651532 017540 0ustar00Aquastaff000000 000000 /* * 63mer/inc/extvab.h * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ /*** global variables ****/ extern int overlaplen; extern int inGraph; extern long long n_ban; extern Kmer WORDFILTER; extern boolean globalFlag; extern int thrd_num; /**** reads info *****/ extern long long n_solexa; extern long long prevNum; extern int ins_size_var; extern PE_INFO * pes; extern int maxReadLen; extern int maxReadLen4all; extern int minReadLen; extern int maxNameLen; extern int num_libs; extern LIB_INFO * lib_array; extern int libNo; extern long long readNumBack; extern int gradsCounter; /*** used for pregraph *****/ extern MEM_MANAGER * prearc_mem_manager; //also used in scaffolding extern MEM_MANAGER ** preArc_mem_managers; extern boolean deLowKmer; extern boolean deLowEdge; extern KmerSet ** KmerSets; // also used in mapping extern KmerSet ** KmerSetsPatch; /**** used for contiging ****/ extern boolean repsTie; extern long long arcCounter; extern unsigned int num_ed; extern unsigned int num_ed_limit; extern unsigned int extraEdgeNum; extern EDGE * edge_array; extern VERTEX * vt_array; extern MEM_MANAGER * rv_mem_manager; extern MEM_MANAGER * arc_mem_manager; extern unsigned int num_vt; extern int len_bar; extern ARC ** arcLookupTable; extern long long * markersArray; /***** used for scaffolding *****/ extern MEM_MANAGER * cn_mem_manager; extern unsigned int num_ctg; extern unsigned int * index_array; extern CONTIG * contig_array; extern int lineLen; extern int weakPE; extern long long newCntCounter; extern CONNECT ** cntLookupTable; extern unsigned int ctg_short; extern int cvgAvg; extern boolean orig2new; /**** used for gapFilling ****/ extern DARRAY * readSeqInGap; extern DARRAY * gapSeqDarray; extern DARRAY ** darrayBuf; extern int fillGap; /**** used for searchPath *****/ extern int maxSteps; extern int num_trace; extern unsigned int ** found_routes; extern unsigned int * so_far; extern int max_n_routes; extern boolean maskRep; extern int GLDiff; extern int initKmerSetSize; SOAPdenovo-V1.05/src/63mer/inc/fib.h000644 000765 000024 00000006240 11530651532 017004 0ustar00Aquastaff000000 000000 /* * 63mer/inc/fib.h * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ /*- * Copyright 1997, 1998-2003 John-Mark Gurney. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * */ #ifndef _FIB_H_ #define _FIB_H_ //#include "globals.h" #include #include "def2.h" typedef Coordinate ( *voidcmp ) ( unsigned int , unsigned int ); /* functions for key heaps */ boolean fh_isempty ( FibHeap * ); FibHeap * fh_makekeyheap ( void ); FibHeapNode * fh_insertkey ( FibHeap *, Coordinate, unsigned int ); Coordinate fh_minkey ( FibHeap * ); Coordinate fh_replacekey ( FibHeap *, FibHeapNode *, Coordinate ); unsigned int fh_replacekeydata ( FibHeap *, FibHeapNode *, Coordinate, unsigned int ); /* functions for unsigned int * heaps */ FibHeap * fh_makeheap ( void ); voidcmp fh_setcmp ( FibHeap *, voidcmp ); unsigned int fh_setneginf ( FibHeap *, unsigned int ); FibHeapNode * fh_insert ( FibHeap *, unsigned int ); /* shared functions */ unsigned int fh_extractmin ( FibHeap * ); unsigned int fh_min ( FibHeap * ); unsigned int fh_replacedata ( FibHeapNode *, unsigned int ); unsigned int fh_delete ( FibHeap *, FibHeapNode * ); void fh_deleteheap ( FibHeap * ); FibHeap * fh_union ( FibHeap *, FibHeap * ); #endif /* _FIB_H_ */ SOAPdenovo-V1.05/src/63mer/inc/fibHeap.h000644 000765 000024 00000002625 11530651532 017605 0ustar00Aquastaff000000 000000 /* * 63mer/inc/fibHeap.h * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #ifndef _FIBHEAP_H_ #define _FIBHEAP_H_ FibHeap * newFibHeap(); FibHeapNode * insertNodeIntoHeap ( FibHeap * heap, Coordinate key, unsigned int node ); Coordinate minKeyOfHeap ( FibHeap * heap ); Coordinate replaceKeyInHeap ( FibHeap * heap, FibHeapNode * node, Coordinate newKey ); void replaceValueInHeap ( FibHeapNode * node, unsigned int newValue ); unsigned int removeNextNodeFromHeap ( FibHeap * heap ); void * destroyNodeInHeap ( FibHeapNode * node, FibHeap * heap ); void destroyHeap ( FibHeap * heap ); boolean IsHeapEmpty ( FibHeap * heap ); #endif SOAPdenovo-V1.05/src/63mer/inc/fibpriv.h000644 000765 000024 00000007544 11530651532 017715 0ustar00Aquastaff000000 000000 /* * 63mer/inc/fibpriv.h * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ /*- * Copyright 1997, 1999-2003 John-Mark Gurney. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $Id: fibpriv.h,v 1.10 2007/10/09 09:56:46 zerbino Exp $ * */ #ifndef _FIBPRIV_H_ #define _FIBPRIV_H_ #include "def2.h" /* * specific node operations */ struct fibheap_el { int fhe_degree; boolean fhe_mark; FibHeapNode * fhe_p; FibHeapNode * fhe_child; FibHeapNode * fhe_left; FibHeapNode * fhe_right; Coordinate fhe_key; unsigned int fhe_data; }; static FibHeapNode * fhe_newelem ( struct fibheap * ); static void fhe_initelem ( FibHeapNode * ); static void fhe_insertafter ( FibHeapNode * a, FibHeapNode * b ); static inline void fhe_insertbefore ( FibHeapNode * a, FibHeapNode * b ); static FibHeapNode * fhe_remove ( FibHeapNode * a ); /* * global heap operations */ struct fibheap { Coordinate ( *fh_cmp_fnct ) ( unsigned int, unsigned int ); MEM_MANAGER * nodeMemory; IDnum fh_n; IDnum fh_Dl; FibHeapNode ** fh_cons; FibHeapNode * fh_min; FibHeapNode * fh_root; unsigned int fh_neginf; boolean fh_keys: 1; }; static void fh_initheap ( FibHeap * ); static void fh_insertrootlist ( FibHeap *, FibHeapNode * ); static void fh_removerootlist ( FibHeap *, FibHeapNode * ); static void fh_consolidate ( FibHeap * ); static void fh_heaplink ( FibHeap * h, FibHeapNode * y, FibHeapNode * x ); static void fh_cut ( FibHeap *, FibHeapNode *, FibHeapNode * ); static void fh_cascading_cut ( FibHeap *, FibHeapNode * ); static FibHeapNode * fh_extractminel ( FibHeap * ); static void fh_checkcons ( FibHeap * h ); static void fh_destroyheap ( FibHeap * h ); static int fh_compare ( FibHeap * h, FibHeapNode * a, FibHeapNode * b ); static int fh_comparedata ( FibHeap * h, Coordinate key, unsigned int data, FibHeapNode * b ); static void fh_insertel ( FibHeap * h, FibHeapNode * x ); /* * general functions */ static inline IDnum ceillog2 ( IDnum a ); #endif /* _FIBPRIV_H_ */ SOAPdenovo-V1.05/src/63mer/inc/global.h000644 000765 000024 00000004355 11530651532 017511 0ustar00Aquastaff000000 000000 /* * 63mer/inc/global.h * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ int overlaplen = 23; int inGraph; long long n_ban; long long n_solexa = 0; long long prevNum = 0; int ins_size_var = 20; PE_INFO * pes = NULL; MEM_MANAGER * rv_mem_manager = NULL; MEM_MANAGER * cn_mem_manager = NULL; MEM_MANAGER * arc_mem_manager = NULL; unsigned int num_vt = 0; unsigned int ** found_routes = NULL; unsigned int * so_far = NULL; int max_n_routes = 10; int num_trace; Kmer WORDFILTER; unsigned int num_ed = 0; unsigned int num_ctg = 0; unsigned int num_ed_limit; unsigned int extraEdgeNum; EDGE * edge_array = NULL; VERTEX * vt_array = NULL; unsigned int * index_array = NULL; CONTIG * contig_array = NULL; int lineLen; int len_bar = 100; int weakPE = 3; int fillGap = 0; boolean globalFlag; long long arcCounter; MEM_MANAGER * prearc_mem_manager = NULL; MEM_MANAGER ** preArc_mem_managers = NULL; int maxReadLen = 0; int maxReadLen4all = 0; int minReadLen = 0; int maxNameLen = 0; ARC ** arcLookupTable = NULL; long long * markersArray = NULL; boolean deLowKmer = 0; boolean deLowEdge = 0; long long newCntCounter; boolean repsTie = 0; CONNECT ** cntLookupTable = NULL; int num_libs = 0; LIB_INFO * lib_array = NULL; int libNo = 0; long long readNumBack; int gradsCounter; unsigned int ctg_short = 0; int thrd_num = 8; int cvgAvg = 0; KmerSet ** KmerSets = NULL; KmerSet ** KmerSetsPatch = NULL; DARRAY * readSeqInGap = NULL; DARRAY * gapSeqDarray = NULL; DARRAY ** darrayBuf; boolean orig2new; int maxSteps; boolean maskRep = 1; int GLDiff = 50; int initKmerSetSize = 0; SOAPdenovo-V1.05/src/63mer/inc/newhash.h000644 000765 000024 00000007270 11530651532 017705 0ustar00Aquastaff000000 000000 /* * 63mer/inc/newhash.h * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #ifndef __NEW_HASH_RJ #define __NEW_HASH_RJ #ifndef K_LOAD_FACTOR #define K_LOAD_FACTOR 0.75 #endif #define MAX_KMER_COV 63 #define EDGE_BIT_SIZE 6 #define EDGE_XOR_MASK 0x3FU #define LINKS_BITS 0x00FFFFFFU #define get_kmer_seq(mer) ((mer).seq) #define set_kmer_seq(mer, val) ((mer).seq = val) #define get_kmer_left_cov(mer, idx) (((mer).l_links>>((idx)*EDGE_BIT_SIZE))&EDGE_XOR_MASK) #define set_kmer_left_cov(mer, idx, val) ((mer).l_links = ((mer).l_links&(~(EDGE_XOR_MASK<<((idx)*EDGE_BIT_SIZE)))) | (((val)&EDGE_XOR_MASK)<<((idx)*EDGE_BIT_SIZE)) ) #define get_kmer_left_covs(mer) (get_kmer_left_cov(mer, 0) + get_kmer_left_cov(mer, 1) + get_kmer_left_cov(mer, 2) + get_kmer_left_cov(mer, 3)) #define get_kmer_right_cov(mer, idx) (((mer).r_links>>((idx)*EDGE_BIT_SIZE))&EDGE_XOR_MASK) #define set_kmer_right_cov(mer, idx, val) ((mer).r_links = ((mer).r_links&(~(EDGE_XOR_MASK<<((idx)*EDGE_BIT_SIZE)))) | (((val)&EDGE_XOR_MASK)<<((idx)*EDGE_BIT_SIZE)) ) #define get_kmer_right_covs(mer) (get_kmer_right_cov(mer, 0) + get_kmer_right_cov(mer, 1) + get_kmer_right_cov(mer, 2) + get_kmer_right_cov(mer, 3)) #define is_kmer_entity_null(flags, idx) ((flags)[(idx)>>4]>>(((idx)&0x0f)<<1)&0x01) #define is_kmer_entity_del(flags, idx) ((flags)[(idx)>>4]>>(((idx)&0x0f)<<1)&0x02) #define set_kmer_entity_null(flags, idx) ((flags)[(idx)>>4] |= (0x01u<<(((idx)&0x0f)<<1))) #define set_kmer_entity_del(flags, idx) ((flags)[(idx)>>4] |= (0x02u<<(((idx)&0x0f)<<1))) #define clear_kmer_entity_null(flags, idx) ((flags)[(idx)>>4] &= ~(0x01u<<(((idx)&0x0f)<<1))) #define clear_kmer_entity_del(flags, idx) ((flags)[(idx)>>4] &= ~(0x02u<<(((idx)&0x0f)<<1))) #define exists_kmer_entity(flags, idx) (!((flags)[(idx)>>4]>>(((idx)&0x0f)<<1)&0x03)) typedef struct kmer_st { Kmer seq; ubyte4 l_links; // sever as edgeID since make_edge ubyte4 r_links: 4 * EDGE_BIT_SIZE; ubyte4 linear: 1; ubyte4 deleted: 1; ubyte4 checked: 1; ubyte4 single: 1; ubyte4 twin: 2; ubyte4 inEdge: 2; } kmer_t; typedef struct kmerSet_st { kmer_t * array; ubyte4 * flags; ubyte8 size; ubyte8 count; ubyte8 max; double load_factor; ubyte8 iter_ptr; } KmerSet; typedef struct kmer_pt { kmer_t * node; Kmer kmer; boolean isSmaller; struct kmer_pt * next; } KMER_PT; extern KmerSet * init_kmerset ( ubyte8 init_size, float load_factor ); extern int search_kmerset ( KmerSet * set, Kmer seq, kmer_t ** rs ); extern int put_kmerset ( KmerSet * set, Kmer seq, ubyte left, ubyte right, kmer_t ** kmer_p ); extern byte8 count_kmerset ( KmerSet * set ); extern void free_Sets ( KmerSet ** KmerSets, int num ); extern void free_kmerset ( KmerSet * set ); extern void dislink2nextUncertain ( kmer_t * node, char ch, boolean smaller ); extern void dislink2prevUncertain ( kmer_t * node, char ch, boolean smaller ); extern int count_branch2prev ( kmer_t * node ); extern int count_branch2next ( kmer_t * node ); extern char firstCharInKmer ( Kmer kmer ); #endif SOAPdenovo-V1.05/src/63mer/inc/nuc.h000644 000765 000024 00000001675 11530651532 017040 0ustar00Aquastaff000000 000000 /* * 63mer/inc/nuc.h * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ int total_nuc = 16; char na_name[17] = {'g', 'a', 't', 'c', 'n', 'r', 'y', 'w', 's', 'm', 'k', 'h', 'b', 'v', 'd', 'x' }; SOAPdenovo-V1.05/src/63mer/inc/stack.h000644 000765 000024 00000002724 11530651532 017354 0ustar00Aquastaff000000 000000 /* * 63mer/inc/stack.h * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #ifndef __STACK__ #define __STACK__ #include #include #include typedef struct block_starter { struct block_starter * prev; struct block_starter * next; } BLOCK_STARTER; typedef struct stack { BLOCK_STARTER * block_list; int index_in_block; int items_per_block; int item_c; size_t item_size; BLOCK_STARTER * block_backup; int index_backup; int item_c_backup; } STACK; void stackBackup ( STACK * astack ); void stackRecover ( STACK * astack ); void * stackPush ( STACK * astack ); void * stackPop ( STACK * astack ); void freeStack ( STACK * astack ); void emptyStack ( STACK * astack ); STACK * createStack ( int num_items, size_t unit_size ); #endif SOAPdenovo-V1.05/src/63mer/inc/stdinc.h000644 000765 000024 00000001721 11530651532 017527 0ustar00Aquastaff000000 000000 /* * 63mer/inc/stdinc.h * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include #include #include #include #include #include #include #include #include "def.h" SOAPdenovo-V1.05/src/63mer/inc/types.h000644 000765 000024 00000002124 11530651532 017405 0ustar00Aquastaff000000 000000 /* * 63mer/inc/types.h * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #ifndef __TYPES_RJ #define __TYPES_RJ typedef unsigned long long ubyte8; typedef unsigned int ubyte4; typedef unsigned short ubyte2; typedef unsigned char ubyte; typedef long long byte8; typedef int byte4; typedef short byte2; typedef char byte; #endif SOAPdenovo-V1.05/src/31mer/arc.c000644 000765 000024 00000012563 11530651532 016233 0ustar00Aquastaff000000 000000 /* * 31mer/arc.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" #define preARCBLOCKSIZE 100000 void createPreArcMemManager() { prearc_mem_manager = createMem_manager ( preARCBLOCKSIZE, sizeof ( preARC ) ); } void prlDestroyPreArcMem() { if ( !preArc_mem_managers ) { return; } int i; for ( i = 0; i < thrd_num; i++ ) { freeMem_manager ( preArc_mem_managers[i] ); } free ( ( void * ) preArc_mem_managers ); preArc_mem_managers = NULL; } void destroyPreArcMem() { freeMem_manager ( prearc_mem_manager ); prearc_mem_manager = NULL; } preARC * prlAllocatePreArc ( unsigned int edgeid, MEM_MANAGER * manager ) { preARC * newArc; newArc = ( preARC * ) getItem ( manager ); newArc->to_ed = edgeid; newArc->multiplicity = 1; newArc->next = NULL; return newArc; } preARC * allocatePreArc ( unsigned int edgeid ) { arcCounter++; preARC * newArc; newArc = ( preARC * ) getItem ( prearc_mem_manager ); newArc->to_ed = edgeid; newArc->multiplicity = 1; newArc->next = NULL; return newArc; } void output_arcGVZ ( char * outfile, boolean IsContig ) { ARC * pArc; preARC * pPreArc; char name[256]; FILE * fp; unsigned int i; sprintf ( name, "%s.arc.gvz", outfile ); fp = ckopen ( name, "w" ); fprintf ( fp, "digraph G{\n" ); fprintf ( fp, "\tsize=\"512,512\";\n" ); for ( i = 1; i <= num_ed; i++ ) { if ( IsContig ) { pPreArc = contig_array[i].arcs; while ( pPreArc ) { fprintf ( fp, "\tC%d -> C%d[label =\"%d\"];\n", i, pPreArc->to_ed, pPreArc->multiplicity ); pPreArc = pPreArc->next; } } else { pArc = edge_array[i].arcs; while ( pArc ) { fprintf ( fp, "\tC%d -> C%d[label =\"%d\"];\n", i, pArc->to_ed, pArc->multiplicity ); pArc = pArc->next; } } } fprintf ( fp, "}\n" ); fclose ( fp ); } /**************** below this line all codes are about ARC ****************/ #define ARCBLOCKSIZE 100000 void createArcMemo() { if ( !arc_mem_manager ) { arc_mem_manager = createMem_manager ( ARCBLOCKSIZE, sizeof ( ARC ) ); } else { printf ( "Warning from createArcMemo: arc_mem_manager is active pointer\n" ); } } void destroyArcMem() { freeMem_manager ( arc_mem_manager ); arc_mem_manager = NULL; } ARC * allocateArc ( unsigned int edgeid ) { arcCounter++; ARC * newArc; newArc = ( ARC * ) getItem ( arc_mem_manager ); newArc->to_ed = edgeid; newArc->multiplicity = 1; newArc->prev = NULL; newArc->next = NULL; return newArc; } void dismissArc ( ARC * arc ) { returnItem ( arc_mem_manager, arc ); } /***************** below this line all codes are about lookup table *****************/ void createArcLookupTable() { if ( !arcLookupTable ) { arcLookupTable = ( ARC ** ) ckalloc ( ( 3 * num_ed + 1 ) * sizeof ( ARC * ) ); } } void deleteArcLookupTable() { if ( arcLookupTable ) { free ( ( void * ) arcLookupTable ); arcLookupTable = NULL; } } void putArc2LookupTable ( unsigned int from_ed, ARC * arc ) { if ( !arc || !arcLookupTable ) { return; } unsigned int index = 2 * from_ed + arc->to_ed; arc->nextInLookupTable = arcLookupTable[index]; arcLookupTable[index] = arc; } static ARC * getArcInLookupTable ( unsigned int from_ed, unsigned int to_ed ) { unsigned int index = 2 * from_ed + to_ed; ARC * ite_arc = arcLookupTable[index]; while ( ite_arc ) { if ( ite_arc->to_ed == to_ed ) { return ite_arc; } ite_arc = ite_arc->nextInLookupTable; } return NULL; } void removeArcInLookupTable ( unsigned int from_ed, unsigned int to_ed ) { unsigned int index = 2 * from_ed + to_ed; ARC * ite_arc = arcLookupTable[index]; ARC * arc; if ( !ite_arc ) { printf ( "removeArcInLookupTable: not found A\n" ); return; } if ( ite_arc->to_ed == to_ed ) { arcLookupTable[index] = ite_arc->nextInLookupTable; return; } while ( ite_arc->nextInLookupTable && ite_arc->nextInLookupTable->to_ed != to_ed ) { ite_arc = ite_arc->nextInLookupTable; } if ( ite_arc->nextInLookupTable ) { arc = ite_arc->nextInLookupTable; ite_arc->nextInLookupTable = arc->nextInLookupTable; return; } printf ( "removeArcInLookupTable: not found B\n" ); return; } void recordArcsInLookupTable() { unsigned int i; ARC * ite_arc; for ( i = 1; i <= num_ed; i++ ) { ite_arc = edge_array[i].arcs; while ( ite_arc ) { putArc2LookupTable ( i, ite_arc ); ite_arc = ite_arc->next; } } } ARC * getArcBetween ( unsigned int from_ed, unsigned int to_ed ) { ARC * parc; if ( arcLookupTable ) { parc = getArcInLookupTable ( from_ed, to_ed ); return parc; } parc = edge_array[from_ed].arcs; while ( parc ) { if ( parc->to_ed == to_ed ) { return parc; } parc = parc->next; } return parc; } SOAPdenovo-V1.05/src/31mer/attachPEinfo.c000644 000765 000024 00000021401 11530651532 020022 0ustar00Aquastaff000000 000000 /* * 31mer/attachPEinfo.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" #include "stack.h" #define CNBLOCKSIZE 10000 static STACK * isStack; static int ignorePE1, ignorePE2, ignorePE3, static_flag; static int onsameCtgPE; static unsigned long long peSUM; static boolean staticF; static int existCounter; static int calcuIS ( STACK * intStack ); static int cmp_pe ( const void * a, const void * b ) { PE_INFO * A, *B; A = ( PE_INFO * ) a; B = ( PE_INFO * ) b; if ( A->rank > B->rank ) { return 1; } else if ( A->rank == B->rank ) { return 0; } else { return -1; } } void loadPEgrads ( char * infile ) { FILE * fp; char name[256], line[1024]; int i; boolean rankSet = 1; sprintf ( name, "%s.peGrads", infile ); fp = fopen ( name, "r" ); if ( !fp ) { printf ( "can not open file %s\n", name ); gradsCounter = 0; return; } while ( fgets ( line, sizeof ( line ), fp ) != NULL ) { if ( line[0] == 'g' ) { sscanf ( line + 10, "%d %lld %d", &gradsCounter, &n_solexa, &maxReadLen ); printf ( "there're %d grads, %lld reads, max read len %d\n", gradsCounter, n_solexa, maxReadLen ); break; } } alloc_pe_mem ( gradsCounter ); for ( i = 0; i < gradsCounter; i++ ) { fgets ( line, sizeof ( line ), fp ); pes[i].rank = 0; sscanf ( line, "%d %lld %d %d", & ( pes[i].insertS ), & ( pes[i].PE_bound ), & ( pes[i].rank ), & ( pes[i].pair_num_cut ) ); if ( pes[i].rank < 1 ) { rankSet = 0; } } fclose ( fp ); if ( rankSet ) { qsort ( &pes[0], gradsCounter, sizeof ( PE_INFO ), cmp_pe ); return; } int lastRank = 0; for ( i = 0; i < gradsCounter; i++ ) { if ( i == 0 ) { pes[i].rank = ++lastRank; } else if ( pes[i].insertS < 300 ) { pes[i].rank = lastRank; } else if ( pes[i].insertS < 800 ) { if ( pes[i - 1].insertS < 300 ) { pes[i].rank = ++lastRank; } else { pes[i].rank = lastRank; } } else if ( pes[i].insertS < 3000 ) { if ( pes[i - 1].insertS < 800 ) { pes[i].rank = ++lastRank; } else { pes[i].rank = lastRank; } } else if ( pes[i].insertS < 7000 ) { if ( pes[i - 1].insertS < 3000 ) { pes[i].rank = ++lastRank; } else { pes[i].rank = lastRank; } } else { if ( pes[i - 1].insertS < 7000 ) { pes[i].rank = ++lastRank; } else { pes[i].rank = lastRank; } } } } CONNECT * add1Connect ( unsigned int e1, unsigned int e2, int gap, int weight, boolean inherit ) { if ( e1 == e2 || e1 == getTwinCtg ( e2 ) ) { return NULL; } CONNECT * connect = NULL; long long sum; if ( weight > 255 ) { weight = 255; } connect = getCntBetween ( e1, e2 ); if ( connect ) { if ( !weight ) { return connect; } existCounter++; if ( !inherit ) { sum = connect->weightNotInherit * connect->gapLen + gap * weight; connect->gapLen = sum / ( connect->weightNotInherit + weight ); if ( connect->weightNotInherit + weight <= 255 ) { connect->weightNotInherit += weight; } else if ( connect->weightNotInherit < 255 ) { connect->weightNotInherit = 255; } } else { sum = connect->weight * connect->gapLen + gap * weight; connect->gapLen = sum / ( connect->weight + weight ); if ( !connect->inherit ) { connect->maxSingleWeight = connect->weightNotInherit; } connect->inherit = 1; connect->maxSingleWeight = connect->maxSingleWeight > weight ? connect->maxSingleWeight : weight; } if ( connect->weight + weight <= 255 ) { connect->weight += weight; } else if ( connect->weight < 255 ) { connect->weight = 255; } } else { newCntCounter++; connect = allocateCN ( e2, gap ); if ( cntLookupTable ) { putCnt2LookupTable ( e1, connect ); } connect->weight = weight; if ( contig_array[e1].mask || contig_array[e2].mask ) { connect->mask = 1; } connect->next = contig_array[e1].downwardConnect; contig_array[e1].downwardConnect = connect; if ( !inherit ) { connect->weightNotInherit = weight; } else { connect->weightNotInherit = 0; connect->inherit = 1; connect->maxSingleWeight = weight; } } return connect; } int attach1PE ( unsigned int e1, int pre_pos, unsigned int bal_e2, int pos, int insert_size ) { int gap, realpeSize; unsigned int bal_e1, e2; if ( e1 == bal_e2 ) { ignorePE1++; return -1; //orientation wrong } bal_e1 = getTwinCtg ( e1 ); e2 = getTwinCtg ( bal_e2 ); if ( e1 == e2 ) { realpeSize = contig_array[e1].length + overlaplen - pre_pos - pos; if ( realpeSize > 0 ) { peSUM += realpeSize; onsameCtgPE++; if ( ( int ) contig_array[e1].length > insert_size ) { int * item = ( int * ) stackPush ( isStack ); ( *item ) = realpeSize; } } return 2; } gap = insert_size - overlaplen + pre_pos + pos - contig_array[e1].length - contig_array[e2].length; if ( gap < - ( insert_size / 10 ) ) { ignorePE2++; return 0; } if ( gap > insert_size ) { ignorePE3++; return 0; } add1Connect ( e1, e2, gap, 1, 0 ); add1Connect ( bal_e2, bal_e1, gap, 1, 0 ); return 1; } int connectByPE_grad ( FILE * fp, int peGrad, char * line ) { long long pre_readno, readno, minno, maxno; int pre_pos, pos, flag, PE, count = 0; unsigned int pre_contigno, contigno, newIndex; if ( peGrad < 0 || peGrad > gradsCounter ) { printf ( "specified pe grad is out of bound\n" ); return 0; } maxno = pes[peGrad].PE_bound; if ( peGrad == 0 ) { minno = 0; } else { minno = pes[peGrad - 1].PE_bound; } onsameCtgPE = peSUM = 0; PE = pes[peGrad].insertS; if ( strlen ( line ) ) { sscanf ( line, "%lld %d %d", &pre_readno, &pre_contigno, &pre_pos ); //printf("first record %d %d %d\n",pre_readno,pre_contigno,pre_pos); if ( pre_readno <= minno ) { pre_readno = -1; } } else { pre_readno = -1; } ignorePE1 = ignorePE2 = ignorePE3 = 0; static_flag = 1; isStack = ( STACK * ) createStack ( CNBLOCKSIZE, sizeof ( int ) ); while ( fgets ( line, lineLen, fp ) != NULL ) { sscanf ( line, "%lld %d %d", &readno, &contigno, &pos ); if ( readno > maxno ) { break; } if ( readno <= minno ) { continue; } newIndex = index_array[contigno]; //if(contig_array[newIndex].bal_edge==0) if ( isSameAsTwin ( newIndex ) ) { continue; } if ( PE && ( readno % 2 == 0 ) && ( pre_readno == readno - 1 ) ) // they are a pair of reads { flag = attach1PE ( pre_contigno, pre_pos, newIndex, pos, PE ); if ( flag == 1 ) { count++; } } pre_readno = readno; pre_contigno = newIndex; pre_pos = pos; } printf ( "%d PEs with insert size %d attached, %d + %d + %d ignored\n", count, PE, ignorePE1, ignorePE2, ignorePE3 ); if ( onsameCtgPE > 0 ) { printf ( "estimated PE size %lli, by %d pairs\n", peSUM / onsameCtgPE, onsameCtgPE ); } printf ( "on contigs longer than %d, %d pairs found,", PE, isStack->item_c ); printf ( "insert_size estimated: %d\n", calcuIS ( isStack ) ); freeStack ( isStack ); return count; } static int calcuIS ( STACK * intStack ) { long long sum = 0; int avg = 0; int * item; int num = intStack->item_c; if ( num < 100 ) { return avg; } stackBackup ( intStack ); while ( ( item = ( int * ) stackPop ( intStack ) ) != NULL ) { sum += *item; } stackRecover ( intStack ); num = intStack->item_c; avg = sum / num; sum = 0; stackBackup ( intStack ); while ( ( item = ( int * ) stackPop ( intStack ) ) != NULL ) { sum += ( *item - avg ) * ( *item - avg ); } int SD = sqrt ( sum / ( num - 1 ) ); if ( SD == 0 ) { printf ( "SD=%d, ", SD ); return avg; } stackRecover ( intStack ); sum = num = 0; while ( ( item = ( int * ) stackPop ( intStack ) ) != NULL ) if ( abs ( *item - avg ) < 3 * SD ) { sum += *item; num++; } avg = sum / num; printf ( "SD=%d, ", SD ); return avg; } unsigned int getTwinCtg ( unsigned int ctg ) { return ctg + contig_array[ctg].bal_edge - 1; } boolean isSmallerThanTwin ( unsigned int ctg ) { return contig_array[ctg].bal_edge > 1; } boolean isLargerThanTwin ( unsigned int ctg ) { return contig_array[ctg].bal_edge < 1; } boolean isSameAsTwin ( unsigned int ctg ) { return contig_array[ctg].bal_edge == 1; } SOAPdenovo-V1.05/src/31mer/bubble.c000644 000765 000024 00000140545 11530651532 016723 0ustar00Aquastaff000000 000000 /* * 31mer/bubble.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" #include "dfibHeap.h" #include "fibHeap.h" #define false 0 #define true 1 #define SLOW_TO_FAST 1 #define FAST_TO_SLOW 0 #define MAXREADLENGTH 100 #define MAXCONNECTION 100 static int MAXNODELENGTH; static int DIFF; static unsigned int outNodeArray[MAXCONNECTION]; static ARC * outArcArray[MAXCONNECTION]; static boolean HasChanged; //static boolean staticFlag = 0; //static boolean staticFlag2 = 0; static const int INDEL = 0; static const int SIM[4][4] = { {1, 0, 0, 0}, {0, 1, 0, 0}, {0, 0, 1, 0}, {0, 0, 0, 1} }; //static variables static READINTERVAL * fastPath; static READINTERVAL * slowPath; static char fastSequence[MAXREADLENGTH]; static char slowSequence[MAXREADLENGTH]; static int fastSeqLength; static int slowSeqLength; static Time * times; static unsigned int * previous; static unsigned int expCounter; static unsigned int * expanded; static double cutoff; static int Fmatrix[MAXREADLENGTH + 1][MAXREADLENGTH + 1]; static int slowToFastMapping[MAXREADLENGTH + 1]; static int fastToSlowMapping[MAXREADLENGTH + 1]; static DFibHeapNode ** dheapNodes; static DFibHeap * dheap; static unsigned int activeNode; //static ARC *activeArc; static unsigned int startingNode; static int progress; static unsigned int * eligibleStartingPoints; // DEBUG static long long caseA, caseB, caseC, caseD; static long long dnodeCounter; static long long rnodeCounter; static long long btCounter; static long long cmpCounter; static long long simiCounter; static long long pinCounter; static long long replaceCounter; static long long getArcCounter; // END OF DEBUG static void output_seq ( char * seq, int length, FILE * fp, unsigned int from_vt, unsigned int dest ) { int i; Kmer kmer; char ch; char kmerSeq[32]; kmer = vt_array[from_vt].kmer; for ( i = overlaplen - 1; i >= 0; i-- ) { ch = kmer & 0x3; kmer >>= 2; kmerSeq[i] = ch; } for ( i = 0; i < overlaplen; i++ ) { fprintf ( fp, "%c", int2base ( ( int ) kmerSeq[i] ) ); } fprintf ( fp, " " ); for ( i = 0; i < length; i++ ) { fprintf ( fp, "%c", int2base ( ( int ) seq[i] ) ); } if ( edge_array[dest].seq ) { fprintf ( fp, " %c\n", int2base ( ( int ) getCharInTightString ( edge_array[dest].seq, 0 ) ) ); } else { fprintf ( fp, " N\n" ); } } static void print_path ( FILE * fp ) { READINTERVAL * marker; marker = fastPath->nextInRead; while ( marker->nextInRead ) { fprintf ( fp, "%u ", marker->edgeid ); marker = marker->nextInRead; } fprintf ( fp, "\n" ); marker = slowPath->nextInRead; while ( marker->nextInRead ) { fprintf ( fp, "%u ", marker->edgeid ); marker = marker->nextInRead; } fprintf ( fp, "\n" ); } static void output_pair ( int lengthF, int lengthS, FILE * fp, int nodeF, int nodeS, boolean merged, unsigned int source, unsigned int destination ) { fprintf ( fp, "$$ %d vs %d $$ %d\n", nodeF, nodeS, merged ); //print_path(fp); output_seq ( fastSequence, lengthF, fp, edge_array[source].to_vt, destination ); output_seq ( slowSequence, lengthS, fp, edge_array[source].to_vt, destination ); //fprintf(fp,"\n"); } static void resetNodeStatus() { unsigned int index; ARC * arc; unsigned int bal_ed; for ( index = 1; index <= num_ed; index++ ) { if ( EdSameAsTwin ( index ) ) { edge_array[index].multi = 1; continue; } arc = edge_array[index].arcs; bal_ed = getTwinEdge ( index ); while ( arc ) { if ( arc->to_ed == bal_ed ) { break; } arc = arc->next; } if ( arc ) { edge_array[index].multi = 1; edge_array[bal_ed].multi = 1; index++; continue; } arc = edge_array[bal_ed].arcs; while ( arc ) { if ( arc->to_ed == index ) { break; } arc = arc->next; } if ( arc ) { edge_array[index].multi = 1; edge_array[bal_ed].multi = 1; } else { edge_array[index].multi = 0; edge_array[bal_ed].multi = 0; } index++; } } /* static void determineEligibleStartingPoints() { long long index,counter=0; unsigned int node; unsigned int maxmult; ARC *parc; FibHeap *heap = newFibHeap(); for(index=1;index<=num_ed;index++){ if(edge_array[index].deleted||edge_array[index].length<1) continue; maxmult = counter = 0; parc = edge_array[index].arcs; while(parc){ if(parc->multiplicity > maxmult) maxmult = parc->multiplicity; parc = parc->next; } if(maxmult<1){ continue; } insertNodeIntoHeap(heap,-maxmult,index); } counter = 0; while((index=removeNextNodeFromHeap(heap))!=0){ eligibleStartingPoints[counter++] = index; } destroyHeap(heap); printf("%lld edges out of %d are eligible starting points\n",counter,num_ed); } */ static unsigned int nextStartingPoint() { unsigned int index = 1; unsigned int result = 0; for ( index = progress + 1; index < num_ed; index++ ) { //result = eligibleStartingPoints[index]; result = index; if ( edge_array[index].deleted || edge_array[index].length < 1 ) { continue; } if ( result == 0 ) { return 0; } if ( edge_array[result].multi > 0 ) { continue; } progress = index; return result; } return 0; } static void updateNodeStatus() { unsigned int i, node; for ( i = 0; i < expCounter; i++ ) { node = expanded[i]; edge_array[node].multi = 1; edge_array[getTwinEdge ( node )].multi = 1; } } unsigned int getNodePrevious ( unsigned int node ) { return previous[node]; } static boolean isPreviousToNode ( unsigned int previous, unsigned int target ) { unsigned int currentNode = target; unsigned int previousNode = 0; Time targetTime = times[target]; while ( currentNode ) { if ( currentNode == previous ) { return 1; } if ( currentNode == previousNode ) { return 0; } if ( times[currentNode] != targetTime ) { return 0; } previousNode = currentNode; currentNode = getNodePrevious ( currentNode ); } return 0; } static void copySeq ( char * targetS, char * sourceS, int pos, int length ) { char ch; int i, index; index = pos; for ( i = 0; i < length; i++ ) { ch = getCharInTightString ( sourceS, i ); targetS[index++] = ch; //writeChar2tightString(ch,targetS,index++); } } // return the length of sequence static int extractSequence ( READINTERVAL * path, char * sequence ) { READINTERVAL * marker; int seqLength, writeIndex; seqLength = writeIndex = 0; path->start = -10; marker = path->nextInRead; while ( marker->nextInRead ) { marker->start = seqLength; seqLength += edge_array[marker->edgeid].length; marker = marker->nextInRead; } marker->start = seqLength; if ( seqLength > MAXREADLENGTH ) { return 0; } marker = path->nextInRead; while ( marker->nextInRead ) { if ( edge_array[marker->edgeid].length && edge_array[marker->edgeid].seq ) { copySeq ( sequence, edge_array[marker->edgeid].seq, writeIndex, edge_array[marker->edgeid].length ); writeIndex += edge_array[marker->edgeid].length; } /* else if(edge_array[marker->edgeid].length==0) printf("node %d with length 0 in this path\n",marker->edgeid); else if(edge_array[marker->edgeid].seq==NULL) printf("node %d without seq in this path\n",marker->edgeid); */ marker = marker->nextInRead; } return seqLength; } static int max ( int A, int B, int C ) { A = A >= B ? A : B; return ( A >= C ? A : C ); } static boolean compareSequences ( char * sequence1, char * sequence2, int length1, int length2 ) { int i, j; int maxLength; int Choice1, Choice2, Choice3; int maxScore; if ( length1 == 0 || length2 == 0 ) { caseA++; return 0; } if ( abs ( ( int ) length1 - ( int ) length2 ) > 2 ) { caseB++; return 0; } if ( length1 < overlaplen - 1 || length2 < overlaplen - 1 ) { caseB++; return 0; } /* if (length1 < overlaplen || length2 < overlaplen){ if(abs((int)length1 - (int)length2) > 3){ caseB++; return 0; } } */ //printf("length %d vs %d\n",length1,length2); for ( i = 0; i <= length1; i++ ) { Fmatrix[i][0] = 0; } for ( j = 0; j <= length2; j++ ) { Fmatrix[0][j] = 0; } for ( i = 1; i <= length1; i++ ) { for ( j = 1; j <= length2; j++ ) { Choice1 = Fmatrix[i - 1][j - 1] + SIM[ ( int ) sequence1[i - 1]] [ ( int ) sequence2[j - 1]]; Choice2 = Fmatrix[i - 1][j] + INDEL; Choice3 = Fmatrix[i][j - 1] + INDEL; Fmatrix[i][j] = max ( Choice1, Choice2, Choice3 ); } } maxScore = Fmatrix[length1][length2]; maxLength = ( length1 > length2 ? length1 : length2 ); if ( maxScore < maxLength - DIFF ) { caseC++; return 0; } if ( ( 1 - ( double ) maxScore / maxLength ) > cutoff ) { caseD++; return 0; } //printf("\niTOTO %i / %li\n", maxScore, maxLength); return 1; } static void mapSlowOntoFast() { int slowIndex = slowSeqLength; int fastIndex = fastSeqLength; int fastn, slown; if ( slowIndex == 0 ) { slowToFastMapping[0] = fastIndex; while ( fastIndex >= 0 ) { fastToSlowMapping[fastIndex--] = 0; } return; } if ( fastIndex == 0 ) { while ( slowIndex >= 0 ) { slowToFastMapping[slowIndex--] = 0; } fastToSlowMapping[0] = slowIndex; return; } while ( slowIndex > 0 && fastIndex > 0 ) { fastn = ( int ) fastSequence[fastIndex - 1]; //getCharInTightString(fastSequence,fastIndex-1); slown = ( int ) slowSequence[slowIndex - 1]; //getCharInTightString(slowSequence,slowIndex-1); if ( Fmatrix[fastIndex][slowIndex] == Fmatrix[fastIndex - 1][slowIndex - 1] + SIM[fastn][slown] ) { fastToSlowMapping[--fastIndex] = --slowIndex; slowToFastMapping[slowIndex] = fastIndex; } else if ( Fmatrix[fastIndex][slowIndex] == Fmatrix[fastIndex - 1][slowIndex] + INDEL ) { fastToSlowMapping[--fastIndex] = slowIndex - 1; } else if ( Fmatrix[fastIndex][slowIndex] == Fmatrix[fastIndex][slowIndex - 1] + INDEL ) { slowToFastMapping[--slowIndex] = fastIndex - 1; } else { //printPaths(); printf ( "Error" ); fflush ( stdout ); abort(); } } while ( slowIndex > 0 ) { slowToFastMapping[--slowIndex] = -1; } while ( fastIndex > 0 ) { fastToSlowMapping[--fastIndex] = -1; } slowToFastMapping[slowSeqLength] = fastSeqLength; fastToSlowMapping[fastSeqLength] = slowSeqLength; } /* //add an arc to the head of an arc list static ARC *addArc(ARC *arc_list, ARC *arc) { arc->prev = NULL; arc->next = arc_list; if(arc_list) arc_list->prev = arc; arc_list = arc; return arc_list; } */ //remove an arc from the double linked list and return the updated list ARC * deleteArc ( ARC * arc_list, ARC * arc ) { if ( arc->prev ) { arc->prev->next = arc->next; } else { arc_list = arc->next; } if ( arc->next ) { arc->next->prev = arc->prev; } /* if(checkActiveArc&&arc==activeArc){ activeArc = arc->next; } */ dismissArc ( arc ); return arc_list; } //add a rv to the head of a rv list static READINTERVAL * addRv ( READINTERVAL * rv_list, READINTERVAL * rv ) { rv->prevOnEdge = NULL; rv->nextOnEdge = rv_list; if ( rv_list ) { rv_list->prevOnEdge = rv; } rv_list = rv; return rv_list; } //remove a rv from the double linked list and return the updated list static READINTERVAL * deleteRv ( READINTERVAL * rv_list, READINTERVAL * rv ) { if ( rv->prevOnEdge ) { rv->prevOnEdge->nextOnEdge = rv->nextOnEdge; } else { rv_list = rv->nextOnEdge; } if ( rv->nextOnEdge ) { rv->nextOnEdge->prevOnEdge = rv->prevOnEdge; } return rv_list; } /* static void disconnect(unsigned int from_ed, unsigned int to_ed) { READINTERVAL *rv_temp; rv_temp = edge_array[from_ed].rv; while(rv_temp){ if(!rv_temp->nextInRead||rv_temp->nextInRead->edgeid!=to_ed){ rv_temp = rv_temp->nextOnEdge; continue; } rv_temp->nextInRead->prevInRead = NULL; rv_temp->nextInRead = NULL; rv_temp = rv_temp->nextOnEdge; } } */ static int mapDistancesOntoPaths() { READINTERVAL * marker; int totalDistance = 0; marker = slowPath; while ( marker->nextInRead ) { marker = marker->nextInRead; marker->start = totalDistance; totalDistance += edge_array[marker->edgeid].length; marker->bal_rv->start = totalDistance; } totalDistance = 0; marker = fastPath; while ( marker->nextInRead ) { marker = marker->nextInRead; marker->start = totalDistance; totalDistance += edge_array[marker->edgeid].length; marker->bal_rv->start = totalDistance; } return totalDistance; } //attach a path to the graph and mean while make the reverse complementary path of it static void attachPath ( READINTERVAL * path ) { READINTERVAL * marker, *bal_marker; unsigned int ed, bal_ed; marker = path; while ( marker ) { ed = marker->edgeid; edge_array[ed].rv = addRv ( edge_array[ed].rv, marker ); bal_ed = getTwinEdge ( ed ); bal_marker = allocateRV ( -marker->readid, bal_ed ); edge_array[bal_ed].rv = addRv ( edge_array[bal_ed].rv, bal_marker ); if ( marker->prevInRead ) { marker->prevInRead->bal_rv->prevInRead = bal_marker; bal_marker->nextInRead = marker->prevInRead->bal_rv; } bal_marker->bal_rv = marker; marker->bal_rv = bal_marker; marker = marker->nextInRead; } } static void detachPathSingle ( READINTERVAL * path ) { READINTERVAL * marker, *nextMarker; unsigned int ed; marker = path; while ( marker ) { nextMarker = marker->nextInRead; ed = marker->edgeid; edge_array[ed].rv = deleteRv ( edge_array[ed].rv, marker ); dismissRV ( marker ); marker = nextMarker; } } static void detachPath ( READINTERVAL * path ) { READINTERVAL * marker, *bal_marker, *nextMarker; unsigned int ed, bal_ed; marker = path; while ( marker ) { nextMarker = marker->nextInRead; bal_marker = marker->bal_rv; ed = marker->edgeid; edge_array[ed].rv = deleteRv ( edge_array[ed].rv, marker ); dismissRV ( marker ); //printf("%d (%d),",ed,edge_array[ed].deleted); bal_ed = getTwinEdge ( ed ); edge_array[bal_ed].rv = deleteRv ( edge_array[bal_ed].rv, bal_marker ); dismissRV ( bal_marker ); marker = nextMarker; } // printf("\n"); fflush ( stdout ); } static void remapNodeMarkersOntoNeighbour ( unsigned int source, unsigned int target ) { READINTERVAL * marker, *bal_marker; unsigned int bal_source = getTwinEdge ( source ); unsigned int bal_target = getTwinEdge ( target ); while ( ( marker = edge_array[source].rv ) != NULL ) { edge_array[source].rv = deleteRv ( edge_array[source].rv, marker ); marker->edgeid = target; edge_array[target].rv = addRv ( edge_array[target].rv, marker ); bal_marker = marker->bal_rv; edge_array[bal_source].rv = deleteRv ( edge_array[bal_source].rv, bal_marker ); bal_marker->edgeid = bal_target; edge_array[bal_target].rv = addRv ( edge_array[bal_target].rv, bal_marker ); } } static void remapNodeInwardReferencesOntoNode ( unsigned int source, unsigned int target ) { ARC * arc; unsigned int destination; for ( arc = edge_array[source].arcs; arc != NULL; arc = arc->next ) { destination = arc->to_ed; if ( destination == target || destination == source ) { continue; } if ( previous[destination] == source ) { previous[destination] = target; } } } static void remapNodeTimesOntoTargetNode ( unsigned int source, unsigned int target ) { Time nodeTime = times[source]; unsigned int prevNode = previous[source]; Time targetTime = times[target]; if ( nodeTime == -1 ) { return; } if ( prevNode == source ) { times[target] = nodeTime; previous[target] = target; } else if ( targetTime == -1 || targetTime > nodeTime || ( targetTime == nodeTime && !isPreviousToNode ( target, prevNode ) ) ) { times[target] = nodeTime; if ( prevNode != getTwinEdge ( source ) ) { previous[target] = prevNode; } else { previous[target] = getTwinEdge ( target ); } } remapNodeInwardReferencesOntoNode ( source, target ); previous[source] = 0; } static void remapNodeTimesOntoNeighbour ( unsigned int source, unsigned int target ) { remapNodeTimesOntoTargetNode ( source, target ); remapNodeTimesOntoTargetNode ( getTwinEdge ( source ), getTwinEdge ( target ) ); //questionable } static void destroyArc ( unsigned int from_ed, ARC * arc ) { unsigned int bal_dest; ARC * twinArc; if ( !arc ) { return; } bal_dest = getTwinEdge ( arc->to_ed ); twinArc = arc->bal_arc; removeArcInLookupTable ( from_ed, arc->to_ed ); edge_array[from_ed].arcs = deleteArc ( edge_array[from_ed].arcs, arc ); if ( bal_dest != from_ed ) { removeArcInLookupTable ( bal_dest, getTwinEdge ( from_ed ) ); edge_array[bal_dest].arcs = deleteArc ( edge_array[bal_dest].arcs, twinArc ); } } static void createAnalogousArc ( unsigned int originNode, unsigned int destinationNode, ARC * refArc ) { ARC * arc, *twinArc; unsigned int destinationTwin; arc = getArcBetween ( originNode, destinationNode ); if ( arc ) { if ( refArc->bal_arc != refArc ) { arc->multiplicity += refArc->multiplicity; arc->bal_arc->multiplicity += refArc->multiplicity; } else { arc->multiplicity += refArc->multiplicity / 2; arc->bal_arc->multiplicity += refArc->multiplicity / 2; } return; } arc = allocateArc ( destinationNode ); arc->multiplicity = refArc->multiplicity; arc->prev = NULL; arc->next = edge_array[originNode].arcs; if ( edge_array[originNode].arcs ) { edge_array[originNode].arcs->prev = arc; } edge_array[originNode].arcs = arc; putArc2LookupTable ( originNode, arc ); destinationTwin = getTwinEdge ( destinationNode ); if ( destinationTwin == originNode ) { //printf("arc from A to A'\n"); arc->bal_arc = arc; if ( refArc->bal_arc != refArc ) { arc->multiplicity += refArc->multiplicity; } return; } twinArc = allocateArc ( getTwinEdge ( originNode ) ); arc->bal_arc = twinArc; twinArc->bal_arc = arc; twinArc->multiplicity = refArc->multiplicity; twinArc->prev = NULL; twinArc->next = edge_array[destinationTwin].arcs; if ( edge_array[destinationTwin].arcs ) { edge_array[destinationTwin].arcs->prev = twinArc; } edge_array[destinationTwin].arcs = twinArc; putArc2LookupTable ( destinationTwin, twinArc ); } static void remapNodeArcsOntoTarget ( unsigned int source, unsigned int target ) { ARC * arc; if ( source == activeNode ) { activeNode = target; } arc = edge_array[source].arcs; if ( !arc ) { return; } while ( arc != NULL ) { createAnalogousArc ( target, arc->to_ed, arc ); destroyArc ( source, arc ); arc = edge_array[source].arcs; } } static void remapNodeArcsOntoNeighbour ( unsigned int source, unsigned int target ) { remapNodeArcsOntoTarget ( source, target ); remapNodeArcsOntoTarget ( getTwinEdge ( source ), getTwinEdge ( target ) ); } static DFibHeapNode * getNodeDHeapNode ( unsigned int node ) { return dheapNodes[node]; } static void setNodeDHeapNode ( unsigned int node, DFibHeapNode * dheapNode ) { dheapNodes[node] = dheapNode; } static void remapNodeFibHeapReferencesOntoNode ( unsigned int source, unsigned int target ) { DFibHeapNode * sourceDHeapNode = getNodeDHeapNode ( source ); DFibHeapNode * targetDHeapNode = getNodeDHeapNode ( target ); if ( sourceDHeapNode == NULL ) { return; } if ( targetDHeapNode == NULL ) { setNodeDHeapNode ( target, sourceDHeapNode ); replaceValueInDHeap ( sourceDHeapNode, target ); } else if ( getKey ( targetDHeapNode ) > getKey ( sourceDHeapNode ) ) { setNodeDHeapNode ( target, sourceDHeapNode ); replaceValueInDHeap ( sourceDHeapNode, target ); destroyNodeInDHeap ( targetDHeapNode, dheap ); } else { destroyNodeInDHeap ( sourceDHeapNode, dheap ); } setNodeDHeapNode ( source, NULL ); } static void combineCOV ( unsigned int source, int len_s, unsigned int target, int len_t ) { if ( len_s < 1 || len_t < 1 ) { return; } int cov = ( len_s * edge_array[source].cvg + len_t * edge_array[target].cvg ) / len_t; edge_array[target].cvg = cov > MaxEdgeCov ? MaxEdgeCov : cov; edge_array[getTwinEdge ( target )].cvg = cov > MaxEdgeCov ? MaxEdgeCov : cov; } static void remapNodeOntoNeighbour ( unsigned int source, unsigned int target ) { combineCOV ( source, edge_array[source].length, target, edge_array[target].length ); remapNodeMarkersOntoNeighbour ( source, target ); remapNodeTimesOntoNeighbour ( source, target ); //questionable remapNodeArcsOntoNeighbour ( source, target ); remapNodeFibHeapReferencesOntoNode ( source, target ); remapNodeFibHeapReferencesOntoNode ( getTwinEdge ( source ), getTwinEdge ( target ) ); edge_array[source].deleted = 1; edge_array[getTwinEdge ( source )].deleted = 1; if ( startingNode == source ) { startingNode = target; } if ( startingNode == getTwinEdge ( source ) ) { startingNode = getTwinEdge ( target ); } edge_array[source].length = 0; edge_array[getTwinEdge ( source )].length = 0; } static void connectInRead ( READINTERVAL * previous, READINTERVAL * next ) { if ( previous ) { previous->nextInRead = next; if ( next ) { previous->bal_rv->prevInRead = next->bal_rv; } else { previous->bal_rv->prevInRead = NULL; } } if ( next ) { next->prevInRead = previous; if ( previous ) { next->bal_rv->nextInRead = previous->bal_rv; } else { next->bal_rv->nextInRead = NULL; } } } static int remapBackOfNodeMarkersOntoNeighbour ( unsigned int source, READINTERVAL * sourceMarker, unsigned int target, READINTERVAL * targetMarker, boolean slowToFast ) { READINTERVAL * marker, *newMarker, *bal_new, *previousMarker; int halfwayPoint, halfwayPointOffset, breakpoint; int * targetToSourceMapping, *sourceToTargetMapping; unsigned int bal_ed; int targetFinish = targetMarker->bal_rv->start; int sourceStart = sourceMarker->start; int sourceFinish = sourceMarker->bal_rv->start; int alignedSourceLength = sourceFinish - sourceStart; int realSourceLength = edge_array[source].length; if ( slowToFast ) { sourceToTargetMapping = slowToFastMapping; targetToSourceMapping = fastToSlowMapping; } else { sourceToTargetMapping = fastToSlowMapping; targetToSourceMapping = slowToFastMapping; } if ( alignedSourceLength > 0 && targetFinish > 0 ) { halfwayPoint = targetToSourceMapping[targetFinish - 1] - sourceStart + 1; halfwayPoint *= realSourceLength; halfwayPoint /= alignedSourceLength; } else { halfwayPoint = 0; } if ( halfwayPoint < 0 ) { halfwayPoint = 0; } if ( halfwayPoint > realSourceLength ) { halfwayPoint = realSourceLength; } halfwayPointOffset = realSourceLength - halfwayPoint; bal_ed = getTwinEdge ( target ); for ( marker = edge_array[source].rv; marker != NULL; marker = marker->nextOnEdge ) { if ( marker->prevInRead && marker->prevInRead->edgeid == target ) { continue; } newMarker = allocateRV ( marker->readid, target ); edge_array[target].rv = addRv ( edge_array[target].rv, newMarker ); bal_new = allocateRV ( -marker->readid, bal_ed ); edge_array[bal_ed].rv = addRv ( edge_array[bal_ed].rv, bal_new ); newMarker->bal_rv = bal_new; bal_new->bal_rv = newMarker; newMarker->start = marker->start; if ( realSourceLength > 0 ) { breakpoint = halfwayPoint + marker->start; } else { breakpoint = marker->start; } bal_new->start = breakpoint; marker->start = breakpoint; previousMarker = marker->prevInRead; connectInRead ( previousMarker, newMarker ); connectInRead ( newMarker, marker ); } return halfwayPointOffset; } static void printKmer ( Kmer kmer ) { int i; char kmerSeq[32], ch; for ( i = overlaplen - 1; i >= 0; i-- ) { ch = kmer & 3; kmer >>= 2; kmerSeq[i] = ch; } for ( i = 0; i < overlaplen; i++ ) { printf ( "%c", int2base ( ( int ) kmerSeq[i] ) ); } printf ( "\n" ); } static void splitNodeDescriptor ( unsigned int source, unsigned int target, int offset ) { int originalLength = edge_array[source].length; int backLength = originalLength - offset; int index, seqLen; char * tightSeq, nt, *newSeq; unsigned int bal_source = getTwinEdge ( source ); unsigned int bal_target = getTwinEdge ( target ); edge_array[source].length = offset; edge_array[bal_source].length = offset; edge_array[source].flag = 1; edge_array[bal_source].flag = 1; if ( target != 0 ) { edge_array[target].length = backLength; edge_array[bal_target].length = backLength; free ( ( void * ) edge_array[target].seq ); edge_array[target].seq = NULL; free ( ( void * ) edge_array[bal_target].seq ); edge_array[bal_target].seq = NULL; } if ( backLength == 0 ) { return; } tightSeq = edge_array[source].seq; seqLen = backLength / 4 + 1; if ( target != 0 ) { edge_array[target].flag = 1; edge_array[bal_target].flag = 1; /* Kmer word; int pos; if(backLength=pos) word = nextKmer(word,nt); */ } /* printKmer(vt_array[edge_array[source].from_vt].kmer); printKmer(vt_array[edge_array[bal_source].to_vt].kmer); printf("-----\n"); edge_array[source].from_vt = num_vt; vt_array[num_vt++].kmer = word; word = reverseComplement(word); edge_array[bal_source].to_vt = num_vt; vt_array[num_vt++].kmer = word; printKmer(vt_array[edge_array[source].from_vt].kmer); printKmer(vt_array[edge_array[bal_source].to_vt].kmer); */ } //source node for ( index = backLength; index < originalLength; index++ ) { nt = getCharInTightString ( tightSeq, index ); writeChar2tightString ( nt, tightSeq, index - backLength ); } if ( target == 0 ) { return; } //target twin tightSeq = edge_array[bal_source].seq; newSeq = ( char * ) ckalloc ( seqLen * sizeof ( char ) ); edge_array[bal_target].seq = newSeq; for ( index = offset; index < originalLength; index++ ) { nt = getCharInTightString ( tightSeq, index ); writeChar2tightString ( nt, newSeq, index - offset ); } } static void remapBackOfNodeDescriptorOntoNeighbour ( unsigned int source, unsigned int target, boolean slowToFast, int offset ) { unsigned int bal_source = getTwinEdge ( source ); unsigned int bal_target = getTwinEdge ( target ); if ( slowToFast ) { splitNodeDescriptor ( source, 0, offset ); } else { splitNodeDescriptor ( source, target, offset ); } //printf("%d vs %d\n",edge_array[source].from_vt,edge_array[target].to_vt); edge_array[source].from_vt = edge_array[target].to_vt; edge_array[bal_source].to_vt = edge_array[bal_target].from_vt; } static void remapBackOfNodeTimesOntoNeighbour ( unsigned int source, unsigned int target ) { Time targetTime = times[target]; Time nodeTime = times[source]; unsigned int twinTarget = getTwinEdge ( target ); unsigned int twinSource = getTwinEdge ( source ); unsigned int previousNode; if ( nodeTime != -1 ) { previousNode = previous[source]; if ( previousNode == source ) { times[target] = nodeTime; previous[target] = target; } else if ( targetTime == -1 || targetTime > nodeTime || ( targetTime == nodeTime && !isPreviousToNode ( target, previousNode ) ) ) { times[target] = nodeTime; if ( previousNode != twinSource ) { previous[target] = previousNode; } else { previous[target] = twinTarget; } } previous[source] = target; } targetTime = times[twinTarget]; nodeTime = times[twinSource]; if ( nodeTime != -1 ) { if ( targetTime == -1 || targetTime > nodeTime || ( targetTime == nodeTime && !isPreviousToNode ( twinTarget, twinSource ) ) ) { times[twinTarget] = nodeTime; previous[twinTarget] = twinSource; } } remapNodeInwardReferencesOntoNode ( twinSource, twinTarget ); } static void remapBackOfNodeArcsOntoNeighbour ( unsigned int source, unsigned int target ) { ARC * arc; remapNodeArcsOntoTarget ( getTwinEdge ( source ), getTwinEdge ( target ) ); for ( arc = edge_array[source].arcs; arc != NULL; arc = arc->next ) { createAnalogousArc ( target, source, arc ); } } static void remapBackOfNodeOntoNeighbour ( unsigned int source, READINTERVAL * sourceMarker, unsigned int target, READINTERVAL * targetMarker, boolean slowToFast ) { int offset; offset = remapBackOfNodeMarkersOntoNeighbour ( source, sourceMarker, target, targetMarker, slowToFast ); remapBackOfNodeDescriptorOntoNeighbour ( source, target, slowToFast, offset ); combineCOV ( source, edge_array[target].length, target, edge_array[target].length ); remapBackOfNodeTimesOntoNeighbour ( source, target ); remapBackOfNodeArcsOntoNeighbour ( source, target ); remapNodeFibHeapReferencesOntoNode ( getTwinEdge ( source ), getTwinEdge ( target ) ); //why not "remapNodeFibHeapReferencesOntoNode(source,target);" //because the downstream part of source still retains, which can serve as previousNode as before if ( getTwinEdge ( source ) == startingNode ) { startingNode = getTwinEdge ( target ); } } static boolean markerLeadsToNode ( READINTERVAL * marker, unsigned int node ) { READINTERVAL * currentMarker; for ( currentMarker = marker; currentMarker != NULL; currentMarker = currentMarker->nextInRead ) if ( currentMarker->edgeid == node ) { return true; } return false; } static void reduceNode ( unsigned int node ) { unsigned int bal_ed = getTwinEdge ( node ); edge_array[node].length = 0; edge_array[bal_ed].length = 0; } static void reduceSlowNodes ( READINTERVAL * slowMarker, unsigned int finish ) { READINTERVAL * marker; for ( marker = slowMarker; marker->edgeid != finish; marker = marker->nextInRead ) { reduceNode ( marker->edgeid ); } } static boolean markerLeadsToArc ( READINTERVAL * marker, unsigned int nodeA, unsigned int nodeB ) { READINTERVAL * current, *next; unsigned int twinA = getTwinEdge ( nodeA ); unsigned int twinB = getTwinEdge ( nodeB ); current = marker; while ( current != NULL ) { next = current->nextInRead; if ( current->edgeid == nodeA && next->edgeid == nodeB ) { return true; } if ( current->edgeid == twinB && next->edgeid == twinA ) { return true; } current = next; } return false; } static void remapEmptyPathArcsOntoMiddlePathSimple ( READINTERVAL * emptyPath, READINTERVAL * targetPath ) { READINTERVAL * pathMarker, *marker; unsigned int start = emptyPath->prevInRead->edgeid; unsigned int finish = emptyPath->edgeid; unsigned int previousNode = start; unsigned int currentNode; ARC * originalArc = getArcBetween ( start, finish ); if ( !originalArc ) { printf ( "remapEmptyPathArcsOntoMiddlePathSimple: no arc between %d and %d\n", start, finish ); marker = fastPath; printf ( "fast path: " ); while ( marker ) { printf ( "%d,", marker->edgeid ); marker = marker->nextInRead; } printf ( "\n" ); marker = slowPath; printf ( "slow path: " ); while ( marker ) { printf ( "%d,", marker->edgeid ); marker = marker->nextInRead; } printf ( "\n" ); fflush ( stdout ); } for ( pathMarker = targetPath; pathMarker->edgeid != finish; pathMarker = pathMarker->nextInRead ) { currentNode = pathMarker->edgeid; createAnalogousArc ( previousNode, currentNode, originalArc ); previousNode = currentNode; } createAnalogousArc ( previousNode, finish, originalArc ); destroyArc ( start, originalArc ); } static void remapEmptyPathMarkersOntoMiddlePathSimple ( READINTERVAL * emptyPath, READINTERVAL * targetPath, boolean slowToFast ) { READINTERVAL * marker, *newMarker, *previousMarker, *pathMarker, *bal_marker; unsigned int start = emptyPath->prevInRead->edgeid; unsigned int finish = emptyPath->edgeid; unsigned int markerStart, bal_ed; READINTERVAL * oldMarker = edge_array[finish].rv; while ( oldMarker ) { marker = oldMarker; oldMarker = marker->nextOnEdge; newMarker = marker->prevInRead; if ( newMarker->edgeid != start ) { continue; } if ( ( slowToFast && marker->readid != 2 ) || ( !slowToFast && marker->readid != 1 ) ) { continue; } markerStart = marker->start; for ( pathMarker = targetPath; pathMarker->edgeid != finish; pathMarker = pathMarker->nextInRead ) { previousMarker = newMarker; //maker a new marker newMarker = allocateRV ( marker->readid, pathMarker->edgeid ); newMarker->start = markerStart; edge_array[pathMarker->edgeid].rv = addRv ( edge_array[pathMarker->edgeid].rv, newMarker ); //maker the twin marker bal_ed = getTwinEdge ( pathMarker->edgeid ); bal_marker = allocateRV ( -marker->readid, bal_ed ); bal_marker->start = markerStart; edge_array[bal_ed].rv = addRv ( edge_array[bal_ed].rv, bal_marker ); newMarker->bal_rv = bal_marker; bal_marker->bal_rv = newMarker; connectInRead ( previousMarker, newMarker ); } connectInRead ( newMarker, marker ); } } static void remapNodeTimesOntoForwardMiddlePath ( unsigned int source, READINTERVAL * path ) { READINTERVAL * marker; unsigned int target; Time nodeTime = times[source]; unsigned int previousNode = previous[source]; Time targetTime; for ( marker = path; marker->edgeid != source; marker = marker->nextInRead ) { target = marker->edgeid; targetTime = times[target]; if ( targetTime == -1 || targetTime > nodeTime || ( targetTime == nodeTime && !isPreviousToNode ( target, previousNode ) ) ) { times[target] = nodeTime; previous[target] = previousNode; } previousNode = target; } previous[source] = previousNode; } static void remapNodeTimesOntoTwinMiddlePath ( unsigned int source, READINTERVAL * path ) { READINTERVAL * marker; unsigned int target; unsigned int previousNode = getTwinEdge ( source ); Time targetTime; READINTERVAL * limit = path->prevInRead->bal_rv; Time nodeTime = times[limit->edgeid]; marker = path; while ( marker->edgeid != source ) { marker = marker->nextInRead; } marker = marker->bal_rv; while ( marker != limit ) { marker = marker->nextInRead; target = marker->edgeid; targetTime = times[target]; if ( targetTime == -1 || targetTime > nodeTime || ( targetTime == nodeTime && !isPreviousToNode ( target, previousNode ) ) ) { times[target] = nodeTime; previous[target] = previousNode; } previousNode = target; } } static void remapEmptyPathOntoMiddlePath ( READINTERVAL * emptyPath, READINTERVAL * targetPath, boolean slowToFast ) { unsigned int start = emptyPath->prevInRead->edgeid; unsigned int finish = emptyPath->edgeid; // Remapping markers if ( !markerLeadsToArc ( targetPath, start, finish ) ) { remapEmptyPathArcsOntoMiddlePathSimple ( emptyPath, targetPath ); } remapEmptyPathMarkersOntoMiddlePathSimple ( emptyPath, targetPath, slowToFast ); //Remap times and previous(if necessary) if ( getNodePrevious ( finish ) == start ) { remapNodeTimesOntoForwardMiddlePath ( finish, targetPath ); } if ( getNodePrevious ( getTwinEdge ( start ) ) == getTwinEdge ( finish ) ) { remapNodeTimesOntoTwinMiddlePath ( finish, targetPath ); } } static boolean cleanUpRedundancy() { READINTERVAL * slowMarker = slowPath->nextInRead, *fastMarker = fastPath->nextInRead; unsigned int slowNode, fastNode; int slowLength, fastLength; int fastConstraint = 0; int slowConstraint = 0; int finalLength; attachPath ( slowPath ); attachPath ( fastPath ); mapSlowOntoFast(); finalLength = mapDistancesOntoPaths(); slowLength = fastLength = 0; while ( slowMarker != NULL && fastMarker != NULL ) { //modifCounter++; if ( !slowMarker->nextInRead ) { slowLength = finalLength; } else { slowLength = slowToFastMapping[slowMarker->bal_rv->start - 1]; if ( slowLength < slowConstraint ) { slowLength = slowConstraint; } } fastLength = fastMarker->bal_rv->start - 1; if ( fastLength < fastConstraint ) { fastLength = fastConstraint; } slowNode = slowMarker->edgeid; fastNode = fastMarker->edgeid; if ( false ) printf ( "Slow %d\tFast %d\n", slowLength, fastLength ); if ( slowNode == fastNode ) { //if (false) if ( false ) printf ( "0/ Already merged together %d == %d\n", slowNode, fastNode ); if ( fastLength > slowLength ) { slowConstraint = fastLength; } //else if (fastLength < slowLength); fastConstraint = slowLength; slowMarker = slowMarker->nextInRead; fastMarker = fastMarker->nextInRead; } else if ( slowNode == getTwinEdge ( fastNode ) ) { //if (false) if ( false ) printf ( "1/ Creme de la hairpin %d $$ %d\n", slowNode, fastNode ); if ( fastLength > slowLength ) { slowConstraint = fastLength; } //else if (fastLength < slowLength); fastConstraint = slowLength; slowMarker = slowMarker->nextInRead; fastMarker = fastMarker->nextInRead; //foldSymmetricalNode(fastNode); } else if ( markerLeadsToNode ( slowMarker, fastNode ) ) { //if (false) if ( false ) { printf ( "2/ Remapping empty fast arc onto slow nodes\n" ); } reduceSlowNodes ( slowMarker, fastNode ); remapEmptyPathOntoMiddlePath ( fastMarker, slowMarker, FAST_TO_SLOW ); while ( slowMarker->edgeid != fastNode ) { slowMarker = slowMarker->nextInRead; } } else if ( markerLeadsToNode ( fastMarker, slowNode ) ) { //if (false) if ( false ) { printf ( "3/ Remapping empty slow arc onto fast nodes\n" ); } remapEmptyPathOntoMiddlePath ( slowMarker, fastMarker, SLOW_TO_FAST ); while ( fastMarker->edgeid != slowNode ) { fastMarker = fastMarker->nextInRead; } } else if ( slowLength == fastLength ) { if ( false ) { printf ( "A/ Mapped equivalent nodes together %d <=> %d\n", slowNode, fastNode ); } remapNodeOntoNeighbour ( slowNode, fastNode ); slowMarker = slowMarker->nextInRead; fastMarker = fastMarker->nextInRead; } else if ( slowLength < fastLength ) { if ( false ) { printf ( "B/ Mapped back of fast node into slow %d -> %d\n", fastNode, slowNode ); } remapBackOfNodeOntoNeighbour ( fastNode, fastMarker, slowNode, slowMarker, FAST_TO_SLOW ); slowMarker = slowMarker->nextInRead; } else { if ( false ) { printf ( "C/ Mapped back of slow node into fast %d -> %d\n", slowNode, fastNode ); } remapBackOfNodeOntoNeighbour ( slowNode, slowMarker, fastNode, fastMarker, SLOW_TO_FAST ); fastMarker = fastMarker->nextInRead; } fflush ( stdout ); } detachPath ( fastPath ); detachPath ( slowPath ); return 1; } static void comparePaths ( unsigned int destination, unsigned int origin ) { int slowLength, fastLength, i; unsigned int fastNode, slowNode; READINTERVAL * marker; slowLength = fastLength = 0; fastNode = destination; slowNode = origin; btCounter++; while ( fastNode != slowNode ) { if ( times[fastNode] > times[slowNode] ) { fastLength++; fastNode = previous[fastNode]; } else if ( times[fastNode] < times[slowNode] ) { slowLength++; slowNode = previous[slowNode]; } else if ( isPreviousToNode ( slowNode, fastNode ) ) { while ( fastNode != slowNode ) { fastLength++; fastNode = previous[fastNode]; } } else if ( isPreviousToNode ( fastNode, slowNode ) ) { while ( slowNode != fastNode ) { slowLength++; slowNode = previous[slowNode]; } } else { fastLength++; fastNode = previous[fastNode]; slowLength++; slowNode = previous[slowNode]; } if ( slowLength > MAXNODELENGTH || fastLength > MAXNODELENGTH ) { return; } } if ( fastLength == 0 ) //originally fastNode is previous to slowNode { //printf("fastLength is %d\n",fastLength); return; } marker = allocateRV ( 1, destination ); fastPath = marker; for ( i = 0; i < fastLength; i++ ) { marker = allocateRV ( 1, previous[fastPath->edgeid] ); //printf("marker for read %d on edge %d\n",marker->readid,marker->edgeid); marker->nextInRead = fastPath; fastPath->prevInRead = marker; fastPath = marker; } marker = allocateRV ( 2, destination ); //printf("marker for read %d on edge %d\n",marker->readid,marker->edgeid); slowPath = marker; marker = allocateRV ( 2, origin ); //printf("marker for read %d on edge %d\n",marker->readid,marker->edgeid); marker->nextInRead = slowPath; slowPath->prevInRead = marker; slowPath = marker; for ( i = 0; i < slowLength; i++ ) { marker = allocateRV ( 2, previous[slowPath->edgeid] ); //printf("marker for read %d on edge %d\n",marker->readid,marker->edgeid); marker->nextInRead = slowPath; slowPath->prevInRead = marker; slowPath = marker; } //printf("node num %d vs %d\n",fastLength,slowLength); fastSeqLength = extractSequence ( fastPath, fastSequence ); slowSeqLength = extractSequence ( slowPath, slowSequence ); /* if(destination==6359){ printf("destination %d, slowLength %d, fastLength %d\n",destination,slowLength,fastLength); printf("fastSeqLength %d, slowSeqLength %d\n",fastSeqLength,slowSeqLength); } */ if ( !fastSeqLength || !slowSeqLength ) { detachPathSingle ( slowPath ); detachPathSingle ( fastPath ); return; } cmpCounter++; if ( !compareSequences ( fastSequence, slowSequence, fastSeqLength, slowSeqLength ) ) { //output_pair(fastSeqLength,slowSeqLength,ftemp,fastLength-1,slowLength, 0,slowPath->edgeid,destination); detachPathSingle ( slowPath ); detachPathSingle ( fastPath ); return; } simiCounter++; //output_pair(fastSeqLength,slowSeqLength,ftemp,fastLength-1,slowLength, 1,slowPath->edgeid,destination); //pinCounter++; pinCounter += cleanUpRedundancy(); if ( pinCounter % 100000 == 0 ) { printf ( ".............%lld\n", pinCounter ); } HasChanged = 1; } static void tourBusArc ( unsigned int origin, unsigned int destination, unsigned int arcMulti, Time originTime ) { Time arcTime, totalTime, destinationTime; unsigned int oldPrevious = previous[destination]; if ( oldPrevious == origin || edge_array[destination].multi == 1 ) { return; } arcCounter++; if ( arcMulti > 0 ) { arcTime = ( ( Time ) edge_array[origin].length ) / ( ( Time ) arcMulti ); } else { arcTime = 0.0; printf ( "arc from %d to %d with flags %d originTime %f, arc %d\n", origin, destination, edge_array[destination].multi, originTime, arcMulti ); fflush ( stdout ); } totalTime = originTime + arcTime; /* if(destination==289129||destination==359610){ printf("arc from %d to %d with flags %d time %f originTime %f, arc %d\n", origin,destination,edge_array[destination].multi,totalTime,originTime,arcMulti); fflush(stdout); } */ destinationTime = times[destination]; if ( destinationTime == -1 ) { times[destination] = totalTime; dheapNodes[destination] = insertNodeIntoDHeap ( dheap, totalTime, destination ); dnodeCounter++; previous[destination] = origin; return; } else if ( destinationTime > totalTime ) { if ( dheapNodes[destination] == NULL ) { //printf("node %d Already expanded though\n",destination); return; } replaceCounter++; times[destination] = totalTime; replaceKeyInDHeap ( dheap, dheapNodes[destination], totalTime ); previous[destination] = origin; comparePaths ( destination, oldPrevious ); return; } else { if ( destinationTime == times[origin] && isPreviousToNode ( destination, origin ) ) { return; } comparePaths ( destination, origin ); } } static void tourBusNode ( unsigned int node ) { ARC * parc; int index = 0, outNodeNum; /* if(node==745) printf("to expand %d\n",node); */ expanded[expCounter++] = node; //edge_array[node].multi = 2; activeNode = node; parc = edge_array[activeNode].arcs; while ( parc ) { outArcArray[index] = parc; outNodeArray[index++] = parc->to_ed; if ( index >= MAXCONNECTION ) { //printf("node %d has more than MAXCONNECTION arcs\n",node); break; } parc = parc->next; } outNodeNum = index; HasChanged = 0; for ( index = 0; index < outNodeNum; index++ ) { if ( HasChanged ) { parc = getArcBetween ( activeNode, outNodeArray[index] ); getArcCounter++; } else { parc = outArcArray[index]; } if ( !parc ) { continue; } tourBusArc ( activeNode, outNodeArray[index], parc->multiplicity, times[activeNode] ); } } /* static void dumpNodeFromDHeap() { unsigned int currentNode; while((currentNode = removeNextNodeFromDHeap(dheap))!=0){ rnodeCounter++; times[currentNode] = -1; previous[currentNode] = 0; dheapNodes[currentNode] = NULL; if(dnodeCounter-rnodeCounter<250) break; } } */ static void tourBus ( unsigned int startingPoint ) { unsigned int currentNode = startingPoint; times[startingPoint] = 0; previous[startingPoint] = currentNode; while ( currentNode > 0 ) { dheapNodes[currentNode] = NULL; tourBusNode ( currentNode ); currentNode = removeNextNodeFromDHeap ( dheap ); if ( currentNode > 0 ) { rnodeCounter++; } } } void bubblePinch ( double simiCutoff, char * outfile, int M ) { unsigned int index, counter = 0; unsigned int startingNode; char temp[256]; sprintf ( temp, "%s.pathpair", outfile ); //ftemp = ckopen(temp,"w"); //linearConcatenate(); //initiator caseA = caseB = caseC = caseD = 0; progress = 0; arcCounter = 0; dnodeCounter = 0; rnodeCounter = 0; btCounter = 0; cmpCounter = 0; simiCounter = 0; pinCounter = 0; replaceCounter = 0; getArcCounter = 0; cutoff = 1.0 - simiCutoff; if ( M <= 1 ) { MAXNODELENGTH = 3; DIFF = 2; } else if ( M == 2 ) { MAXNODELENGTH = 9; DIFF = 3; } else { MAXNODELENGTH = 30; DIFF = 10; } printf ( "start to pinch bubbles, cutoff %f, MAX NODE NUM %d, MAX DIFF %d\n", cutoff, MAXNODELENGTH, DIFF ); createRVmemo(); times = ( Time * ) ckalloc ( ( num_ed + 1 ) * sizeof ( Time ) ); previous = ( unsigned int * ) ckalloc ( ( num_ed + 1 ) * sizeof ( unsigned int ) ); expanded = ( unsigned int * ) ckalloc ( ( num_ed + 1 ) * sizeof ( unsigned int ) ); dheapNodes = ( DFibHeapNode ** ) ckalloc ( ( num_ed + 1 ) * sizeof ( DFibHeapNode * ) ); WORDFILTER = ( ( ( Kmer ) 1 ) << ( 2 * overlaplen ) ) - 1; for ( index = 1; index <= num_ed; index++ ) { times[index] = -1; previous[index] = 0; dheapNodes[index] = NULL; } dheap = newDFibHeap(); eligibleStartingPoints = ( unsigned int * ) ckalloc ( ( num_ed + 1 ) * sizeof ( unsigned int ) ); resetNodeStatus(); //determineEligibleStartingPoints(); createArcLookupTable(); recordArcsInLookupTable(); while ( ( startingNode = nextStartingPoint() ) > 0 ) { counter++; //printf("starting point %d with length %d\n",startingNode,edge_array[startingNode].length); expCounter = 0; tourBus ( startingNode ); updateNodeStatus(); } resetNodeStatus(); deleteArcLookupTable(); destroyReadIntervMem(); printf ( "%d startingPoints, %lld dheap nodes\n", counter, dnodeCounter ); //printf("%lld times getArcBetween for tourBusNode\n",getArcCounter); printf ( "%lld pairs found, %lld pairs of paths compared, %lld pairs merged\n", btCounter, cmpCounter, pinCounter ); printf ( "sequenc compare failure: %lld %lld %lld %lld\n", caseA, caseB, caseC, caseD ); //fclose(ftemp); free ( ( void * ) eligibleStartingPoints ); destroyDHeap ( dheap ); free ( ( void * ) dheapNodes ); free ( ( void * ) times ); free ( ( void * ) previous ); free ( ( void * ) expanded ); linearConcatenate(); } SOAPdenovo-V1.05/src/31mer/check.c000644 000765 000024 00000003757 11530651532 016550 0ustar00Aquastaff000000 000000 /* * 31mer/check.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include void * ckalloc ( unsigned long long amount ); FILE * ckopen ( char * name, char * mode ); FILE * ckopen ( char * name, char * mode ) { FILE * fp; if ( ( fp = fopen ( name, mode ) ) == NULL ) { printf ( "Cannot open %s. Now exit to system...\n", name ); exit ( -1 ); } return ( fp ); } /* ckalloc - allocate space; check for success */ void * ckalloc ( unsigned long long amount ) { void * p; if ( ( p = ( void * ) calloc ( 1, ( unsigned long long ) amount ) ) == NULL && amount != 0 ) { printf ( "Ran out of memory while applying %lldbytes\n", amount ); printf ( "There may be errors as follows:\n" ); printf ( "1) Not enough memory.\n" ); printf ( "2) The ARRAY may be overrode.\n" ); printf ( "3) The wild pointers.\n" ); fflush ( stdout ); exit ( -1 ); } return ( p ); } /* reallocate memory */ void * ckrealloc ( void * p, size_t new_size, size_t old_size ) { void * q; q = realloc ( ( void * ) p, new_size ); if ( new_size == 0 || q != ( void * ) 0 ) { return q; } /* manually reallocate space */ q = ckalloc ( new_size ); /* move old memory to new space */ bcopy ( p, q, old_size ); free ( p ); return q; } SOAPdenovo-V1.05/src/31mer/compactEdge.c000644 000765 000024 00000005336 11530651532 017701 0ustar00Aquastaff000000 000000 /* * 31mer/compactEdge.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" void copyEdge ( unsigned int source, unsigned int target ) { edge_array[target].from_vt = edge_array[source].from_vt; edge_array[target].to_vt = edge_array[source].to_vt; edge_array[target].length = edge_array[source].length; edge_array[target].cvg = edge_array[source].cvg; edge_array[target].multi = edge_array[source].multi; if ( edge_array[target].seq ) { free ( ( void * ) edge_array[target].seq ); } edge_array[target].seq = edge_array[source].seq; edge_array[source].seq = NULL; edge_array[target].arcs = edge_array[source].arcs; edge_array[source].arcs = NULL; edge_array[target].deleted = edge_array[source].deleted; } //move edge from source to target void edgeMove ( unsigned int source, unsigned int target ) { unsigned int bal_source, bal_target; ARC * arc; copyEdge ( source, target ); bal_source = getTwinEdge ( source ); //bal_edge if ( bal_source != source ) { bal_target = target + 1; copyEdge ( bal_source, bal_target ); edge_array[target].bal_edge = 2; edge_array[bal_target].bal_edge = 0; } else { edge_array[target].bal_edge = 1; bal_target = target; } //take care of the arcs arc = edge_array[target].arcs; while ( arc ) { arc->bal_arc->to_ed = bal_target; arc = arc->next; } if ( bal_target == target ) { return; } arc = edge_array[bal_target].arcs; while ( arc ) { arc->bal_arc->to_ed = target; arc = arc->next; } } void compactEdgeArray() { unsigned int i; unsigned int validCounter = 0; unsigned int bal_ed; printf ( "there're %d edges\n", num_ed ); for ( i = 1; i <= num_ed; i++ ) { if ( edge_array[i].deleted ) { continue; } validCounter++; if ( i == validCounter ) { continue; } bal_ed = getTwinEdge ( i ); edgeMove ( i, validCounter ); if ( bal_ed != i ) { i++; validCounter++; } } num_ed = validCounter; printf ( "after compacting %d edges left\n", num_ed ); } SOAPdenovo-V1.05/src/31mer/concatenateEdge.c000644 000765 000024 00000014172 11530651532 020535 0ustar00Aquastaff000000 000000 /* * 31mer/concatenateEdge.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" void copySeq ( char * targetS, char * sourceS, int pos, int length ) { char ch; int i, index; index = pos; for ( i = 0; i < length; i++ ) { ch = getCharInTightString ( sourceS, i ); writeChar2tightString ( ch, targetS, index++ ); } } //a path from e1 to e2 is merged int to e1(indicate=0) or e2(indicate=1), update graph topology void linearUpdateConnection ( unsigned int e1, unsigned int e2, int indicate ) { unsigned int bal_ed; ARC * parc; if ( !indicate ) { edge_array[e1].to_vt = edge_array[e2].to_vt; bal_ed = getTwinEdge ( e1 ); parc = edge_array[e2].arcs; while ( parc ) { parc->bal_arc->to_ed = bal_ed; parc = parc->next; } edge_array[e1].arcs = edge_array[e2].arcs; edge_array[e2].arcs = NULL; if ( edge_array[e1].length || edge_array[e2].length ) edge_array[e1].cvg = ( edge_array[e1].cvg * edge_array[e1].length + edge_array[e2].cvg * edge_array[e2].length ) / ( edge_array[e1].length + edge_array[e2].length ); edge_array[e2].deleted = 1; } else { //all the arcs pointing to e1 switch to e2 parc = edge_array[getTwinEdge ( e1 )].arcs; while ( parc ) { parc->bal_arc->to_ed = e2; parc = parc->next; } edge_array[e1].arcs = NULL; edge_array[e2].from_vt = edge_array[e1].from_vt; if ( edge_array[e1].length || edge_array[e2].length ) edge_array[e2].cvg = ( edge_array[e1].cvg * edge_array[e1].length + edge_array[e2].cvg * edge_array[e2].length ) / ( edge_array[e1].length + edge_array[e2].length ); edge_array[e1].deleted = 1; } } void allpathUpdateEdge ( unsigned int e1, unsigned int e2, int indicate ) { int tightLen; char * tightSeq = NULL; if ( edge_array[e1].cvg == 0 ) { edge_array[e1].cvg = edge_array[e2].cvg; } if ( edge_array[e2].cvg == 0 ) { edge_array[e2].cvg = edge_array[e1].cvg; } unsigned int cvgsum = edge_array[e1].cvg * edge_array[e1].length + edge_array[e2].cvg * edge_array[e2].length; tightLen = edge_array[e1].length + edge_array[e2].length; if ( tightLen ) { tightSeq = ( char * ) ckalloc ( ( tightLen / 4 + 1 ) * sizeof ( char ) ); } tightLen = 0; if ( edge_array[e1].length ) { copySeq ( tightSeq, edge_array[e1].seq, 0, edge_array[e1].length ); tightLen = edge_array[e1].length; if ( edge_array[e1].seq ) { free ( ( void * ) edge_array[e1].seq ); edge_array[e1].seq = NULL; } else { printf ( "allpathUpdateEdge: edge %d with length %d, but without seq\n", e1, edge_array[e1].length ); } } if ( edge_array[e2].length ) { copySeq ( tightSeq, edge_array[e2].seq, tightLen, edge_array[e2].length ); tightLen += edge_array[e2].length; if ( edge_array[e2].seq ) { free ( ( void * ) edge_array[e2].seq ); edge_array[e2].seq = NULL; } else { printf ( "allpathUpdateEdge: edge %d with length %d, but without seq\n", e2, edge_array[e2].length ); } } //edge_array[e2].extend_len = tightLen-edge_array[e2].length; //the sequence of e1 is to be updated if ( !indicate ) { edge_array[e2].length = 0; //e1 is removed from the graph edge_array[e1].to_vt = edge_array[e2].to_vt; //e2 is part of e1 now edge_array[e1].length = tightLen; edge_array[e1].seq = tightSeq; if ( tightLen ) { edge_array[e1].cvg = cvgsum / tightLen; } edge_array[e1].cvg = edge_array[e1].cvg > 0 ? edge_array[e1].cvg : 1; } else { edge_array[e1].length = 0; //e1 is removed from the graph edge_array[e2].from_vt = edge_array[e1].from_vt; //e1 is part of e2 now edge_array[e2].length = tightLen; edge_array[e2].seq = tightSeq; if ( tightLen ) { edge_array[e2].cvg = cvgsum / tightLen; } edge_array[e2].cvg = edge_array[e2].cvg > 0 ? edge_array[e2].cvg : 1; } } static void debugging ( unsigned int i ) { ARC * parc; parc = edge_array[i].arcs; if ( !parc ) { printf ( "no downward connection for %d\n", i ); } while ( parc ) { printf ( "%d -> %d\n", i, parc->to_ed ); parc = parc->next; } } //concatenate two edges if they are linearly linked void linearConcatenate() { unsigned int i; int conc_c = 1; int counter; unsigned int from_ed, to_ed, bal_ed; ARC * parc, *parc2; unsigned int bal_fe; //debugging(30514); while ( conc_c ) { conc_c = 0; counter = 0; for ( i = 1; i <= num_ed; i++ ) //num_ed { if ( edge_array[i].deleted || EdSameAsTwin ( i ) ) { continue; } if ( edge_array[i].length > 0 ) { counter++; } parc = edge_array[i].arcs; if ( !parc || parc->next ) { continue; } to_ed = parc->to_ed; bal_ed = getTwinEdge ( to_ed ); parc2 = edge_array[bal_ed].arcs; if ( bal_ed == to_ed || !parc2 || parc2->next ) { continue; } from_ed = i; if ( from_ed == to_ed || from_ed == bal_ed ) { continue; } //linear connection found conc_c++; linearUpdateConnection ( from_ed, to_ed, 0 ); allpathUpdateEdge ( from_ed, to_ed, 0 ); bal_fe = getTwinEdge ( from_ed ); linearUpdateConnection ( bal_ed, bal_fe, 1 ); allpathUpdateEdge ( bal_ed, bal_fe, 1 ); /* if(from_ed==6589||to_ed==6589) printf("%d <- %d (%d)\n",from_ed,to_ed,i); if(bal_fe==6589||bal_ed==6589) printf("%d <- %d (%d)\n",bal_fe,bal_ed,i); */ } printf ( "a linear concatenation lap, %d concatenated\n", conc_c ); } printf ( "%d edges in graph\n", counter ); } SOAPdenovo-V1.05/src/31mer/connect.c000644 000765 000024 00000011602 11530651532 017110 0ustar00Aquastaff000000 000000 /* * 31mer/connect.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" #define CNBLOCKSIZE 100000 void createCntMemManager() { if ( !cn_mem_manager ) { cn_mem_manager = createMem_manager ( CNBLOCKSIZE, sizeof ( CONNECT ) ); } else { printf ( "cn_mem_manger was created\n" ); } } void destroyConnectMem() { freeMem_manager ( cn_mem_manager ); cn_mem_manager = NULL; } CONNECT * allocateCN ( unsigned int contigId, int gap ) { CONNECT * newCN; newCN = ( CONNECT * ) getItem ( cn_mem_manager ); newCN->contigID = contigId; newCN->gapLen = gap; newCN->minGap = 0; newCN->maxGap = 0; newCN->bySmall = 0; newCN->weakPoint = 0; newCN->weight = 1; newCN->weightNotInherit = 0; newCN->mask = 0; newCN->used = 0; newCN->checking = 0; newCN->deleted = 0; newCN->prevInScaf = 0; newCN->inherit = 0; newCN->singleInScaf = 0; newCN->nextInScaf = NULL; return newCN; } void output_cntGVZ ( char * outfile ) { char name[256]; FILE * fp; unsigned int i; CONNECT * connect; boolean flag; sprintf ( name, "%s.scaffold.gvz", outfile ); fp = ckopen ( name, "w" ); fprintf ( fp, "digraph G{\n" ); fprintf ( fp, "\tsize=\"512,512\";\n" ); for ( i = num_ctg; i > 0; i-- ) { //if(contig_array[i].mask||!contig_array[i].downwardConnect) if ( !contig_array[i].downwardConnect ) { continue; } connect = contig_array[i].downwardConnect; while ( connect ) { //if(connect->mask||connect->deleted){ if ( connect->deleted ) { connect = connect->next; continue; } if ( connect->prevInScaf || connect->nextInScaf ) { flag = 1; } else { flag = 0; } if ( !connect->mask ) fprintf ( fp, "\tC%d_%d -> C%d_%d [label = \"%d(%d_%d)\"];\n" , i, contig_array[i].length, connect->contigID, contig_array[connect->contigID].length, connect->gapLen, flag, connect->weight ); else fprintf ( fp, "\tC%d_%d -> C%d_%d [label = \"%d(%d_%d)\", color = red];\n" , i, contig_array[i].length, connect->contigID, contig_array[connect->contigID].length, connect->gapLen, flag, connect->weight ); connect = connect->next; } } fprintf ( fp, "}\n" ); fclose ( fp ); } /***************** below this line all codes are about lookup table *****************/ void createCntLookupTable() { if ( !cntLookupTable ) { cntLookupTable = ( CONNECT ** ) ckalloc ( ( 3 * num_ctg + 1 ) * sizeof ( CONNECT * ) ); } } void deleteCntLookupTable() { if ( cntLookupTable ) { free ( ( void * ) cntLookupTable ); cntLookupTable = NULL; } } void putCnt2LookupTable ( unsigned int from_c, CONNECT * cnt ) { if ( !cnt || !cntLookupTable ) { return; } unsigned int index = 2 * from_c + cnt->contigID; cnt->nextInLookupTable = cntLookupTable[index]; cntLookupTable[index] = cnt; } static CONNECT * getCntInLookupTable ( unsigned int from_c, unsigned int to_c ) { unsigned int index = 2 * from_c + to_c; CONNECT * ite_cnt = cntLookupTable[index]; while ( ite_cnt ) { if ( ite_cnt->contigID == to_c ) { return ite_cnt; } ite_cnt = ite_cnt->nextInLookupTable; } return NULL; } CONNECT * getCntBetween ( unsigned int from_c, unsigned int to_c ) { CONNECT * pcnt; if ( cntLookupTable ) { pcnt = getCntInLookupTable ( from_c, to_c ); return pcnt; } pcnt = contig_array[from_c].downwardConnect; while ( pcnt ) { if ( pcnt->contigID == to_c ) { return pcnt; } pcnt = pcnt->next; } return pcnt; } /* void removeCntInLookupTable(unsigned int from_c,unsigned int to_c) { unsigned int index = 2*from_c + to_c; CONNECT *ite_cnt = cntLookupTable[index]; CONNECT *cnt; if(!ite_cnt){ printf("removeCntInLookupTable: not found A\n"); return; } if(ite_cnt->contigID==to_c){ cntLookupTable[index] = ite_cnt->nextInLookupTable; return; } while(ite_cnt->nextInLookupTable&&ite_cnt->nextInLookupTable->contigID!=to_c) ite_cnt = ite_cnt->nextInLookupTable; if(ite_cnt->nextInLookupTable){ cnt = ite_cnt->nextInLookupTable; ite_cnt->nextInLookupTable = cnt->nextInLookupTable; return; } printf("removeCntInLookupTable: not found B\n"); return; } */ SOAPdenovo-V1.05/src/31mer/contig.c000644 000765 000024 00000007606 11530651532 016753 0ustar00Aquastaff000000 000000 /* * 31mer/contig.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" static void initenv ( int argc, char ** argv ); static void display_contig_usage(); char shortrdsfile[256], graphfile[256]; static boolean repeatSolve; static int M = 1; int call_heavygraph ( int argc, char ** argv ) { time_t start_t, stop_t, time_bef, time_aft; time ( &start_t ); boolean ret; initenv ( argc, argv ); loadVertex ( graphfile ); loadEdge ( graphfile ); if ( repeatSolve ) { time ( &time_bef ); ret = loadPathBin ( graphfile ); if ( ret ) { solveReps(); } else { printf ( "repeat solving can't be done...\n" ); } time ( &time_aft ); printf ( "time spent on solving repeat: %ds\n", ( int ) ( time_aft - time_bef ) ); } //edgecvg_bar(edge_array,num_ed,graphfile,100); //0531 if ( M > 0 ) { time ( &time_bef ); bubblePinch ( 0.90, graphfile, M ); time ( &time_aft ); printf ( "time spent on bubblePinch: %ds\n", ( int ) ( time_aft - time_bef ) ); } if ( deLowEdge ) { removeWeakEdges ( 2 * overlaplen, 1 ); removeLowCovEdges ( 2 * overlaplen, deLowEdge ); } cutTipsInGraph ( 0, 0 ); //output_graph(graphfile); output_contig ( edge_array, num_ed, graphfile, overlaplen + 1 ); output_updated_edges ( graphfile ); output_heavyArcs ( graphfile ); if ( vt_array ) { free ( ( void * ) vt_array ); vt_array = NULL; } if ( edge_array ) { free_edge_array ( edge_array, num_ed_limit ); edge_array = NULL; } destroyArcMem(); time ( &stop_t ); printf ( "time elapsed: %dm\n\n", ( int ) ( stop_t - start_t ) / 60 ); return 0; } /***************************************************************************** * Parse command line switches *****************************************************************************/ void initenv ( int argc, char ** argv ) { int copt; int inpseq, outseq; extern char * optarg; char temp[100]; inpseq = outseq = repeatSolve = 0; optind = 1; while ( ( copt = getopt ( argc, argv, "g:M:D:R" ) ) != EOF ) { switch ( copt ) { case 'M': sscanf ( optarg, "%s", temp ); // M = atoi ( temp ); continue; case 'D': sscanf ( optarg, "%s", temp ); deLowEdge = atoi ( temp ) >= 0 ? atoi ( temp ) : 0; continue; case 'g': inGraph = 1; sscanf ( optarg, "%s", graphfile ); // continue; case 'R': repeatSolve = 1; continue; default: if ( inGraph == 0 ) // { display_contig_usage(); exit ( -1 ); } } } if ( inGraph == 0 ) // { display_contig_usage(); exit ( -1 ); } } static void display_contig_usage() { printf ( "\ncontig -g InputGraph [-M mergeLevel -D EdgeCovCutoff -R]\n" ); printf ( " -g InputFile: prefix of graph file names\n" ); printf ( " -D EdgeCovCutoff(optional): delete edges with coverage no largert than (default 1)\n" ); printf ( " -M mergeLevel(default 1,min 0, max 3): the strength of merging similar sequences during contiging\n" ); printf ( " -R solve_repeats (optional): solve repeats by read paths(default: no)\n" ); } SOAPdenovo-V1.05/src/31mer/cutTip_graph.c000644 000765 000024 00000017064 11530651532 020120 0ustar00Aquastaff000000 000000 /* * 31mer/cutTip_graph.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" static int caseA, caseB, caseC, caseD, caseE; void destroyEdge ( unsigned int edgeid ) { unsigned int bal_ed = getTwinEdge ( edgeid ); ARC * arc; if ( bal_ed == edgeid ) { edge_array[edgeid].length = 0; return; } arc = edge_array[edgeid].arcs; while ( arc ) { arc->bal_arc->to_ed = 0; arc = arc->next; } arc = edge_array[bal_ed].arcs; while ( arc ) { arc->bal_arc->to_ed = 0; arc = arc->next; } edge_array[edgeid].arcs = NULL; edge_array[bal_ed].arcs = NULL; edge_array[edgeid].length = 0; edge_array[bal_ed].length = 0; edge_array[edgeid].deleted = 1; edge_array[bal_ed].deleted = 1; } ARC * arcCount ( unsigned int edgeid, unsigned int * num ) { ARC * arc; ARC * firstValidArc = NULL; unsigned int count = 0; arc = edge_array[edgeid].arcs; while ( arc ) { if ( arc->to_ed > 0 ) { count++; if ( count == 1 ) { firstValidArc = arc; } else if ( count > 1 ) { *num = count; return firstValidArc; } } arc = arc->next; } *num = count; return firstValidArc; } void removeWeakEdges ( int lenCutoff, unsigned int multiCutoff ) { unsigned int bal_ed; unsigned int arcRight_n, arcLeft_n; ARC * arcLeft, *arcRight; unsigned int i; int counter = 0; for ( i = 1; i <= num_ed; i++ ) { if ( edge_array[i].deleted || edge_array[i].length == 0 || edge_array[i].length > lenCutoff || EdSameAsTwin ( i ) ) { continue; } bal_ed = getTwinEdge ( i ); arcRight = arcCount ( i, &arcRight_n ); if ( arcRight_n > 1 || !arcRight || arcRight->multiplicity > multiCutoff ) { continue; } arcLeft = arcCount ( bal_ed, &arcLeft_n ); if ( arcLeft_n > 1 || !arcLeft || arcLeft->multiplicity > multiCutoff ) { continue; } destroyEdge ( i ); counter++; } printf ( "Remove weakly linked edges: %d weak inner edges destroyed\n", counter ); removeDeadArcs(); //linearConcatenate(); //compactEdgeArray(); } void removeLowCovEdges ( int lenCutoff, unsigned short covCutoff ) { unsigned int bal_ed; unsigned int arcRight_n, arcLeft_n; ARC * arcLeft, *arcRight; unsigned int i; int counter = 0; for ( i = 1; i <= num_ed; i++ ) { if ( edge_array[i].deleted || edge_array[i].cvg == 0 || edge_array[i].cvg > covCutoff * 10 || edge_array[i].length >= lenCutoff || EdSameAsTwin ( i ) || edge_array[i].length == 0 ) { continue; } bal_ed = getTwinEdge ( i ); arcRight = arcCount ( i, &arcRight_n ); arcLeft = arcCount ( bal_ed, &arcLeft_n ); if ( arcLeft_n < 1 || arcRight_n < 1 ) { continue; } destroyEdge ( i ); counter++; } printf ( "Remove low coverage(%d): %d inner edges destroyed\n", covCutoff, counter ); removeDeadArcs(); linearConcatenate(); compactEdgeArray(); } boolean isUnreliableTip ( unsigned int edgeid, int cutLen, boolean strict ) { unsigned int arcRight_n, arcLeft_n; unsigned int bal_ed; unsigned int currentEd = edgeid; int length = 0; unsigned int mult = 0; ARC * arc, *activeArc = NULL, *tempArc; if ( edgeid == 0 ) { return 0; } bal_ed = getTwinEdge ( edgeid ); if ( bal_ed == edgeid ) { return 0; } arcCount ( bal_ed, &arcLeft_n ); if ( arcLeft_n > 0 ) { return 0; } while ( currentEd ) { arcCount ( bal_ed, &arcLeft_n ); tempArc = arcCount ( currentEd, &arcRight_n ); if ( arcLeft_n > 1 || arcRight_n > 1 ) { break; } length += edge_array[currentEd].length; if ( tempArc ) { activeArc = tempArc; currentEd = activeArc->to_ed; bal_ed = getTwinEdge ( currentEd ); } else { currentEd = 0; } } if ( length >= cutLen ) { return 0; } if ( currentEd == 0 ) { caseB++; return 1; } if ( !strict ) { if ( arcLeft_n < 2 ) { length += edge_array[currentEd].length; } if ( length >= cutLen ) { return 0; } else { caseC++; return 1; } } if ( arcLeft_n < 2 ) { return 0; } if ( !activeArc ) { printf ( "no activeArc while checking edge %d\n", edgeid ); } if ( activeArc->multiplicity == 1 ) { caseD++; return 1; } for ( arc = edge_array[bal_ed].arcs; arc != NULL; arc = arc->next ) if ( arc->multiplicity > mult ) { mult = arc->multiplicity; } if ( mult > activeArc->multiplicity ) { caseE++; } return mult > activeArc->multiplicity; } boolean isUnreliableTip_strict ( unsigned int edgeid, int cutLen ) { unsigned int arcRight_n, arcLeft_n; unsigned int bal_ed; unsigned int currentEd = edgeid; int length = 0; unsigned int mult = 0; ARC * arc, *activeArc = NULL, *tempArc; if ( edgeid == 0 ) { return 0; } bal_ed = getTwinEdge ( edgeid ); if ( bal_ed == edgeid ) { return 0; } arcCount ( bal_ed, &arcLeft_n ); if ( arcLeft_n > 0 ) { return 0; } while ( currentEd ) { arcCount ( bal_ed, &arcLeft_n ); tempArc = arcCount ( currentEd, &arcRight_n ); if ( arcLeft_n > 1 || arcRight_n > 1 ) { if ( arcLeft_n == 0 || length == 0 ) { return 0; } else { break; } } length += edge_array[currentEd].length; if ( length >= cutLen ) { return 0; } if ( tempArc ) { activeArc = tempArc; currentEd = activeArc->to_ed; bal_ed = getTwinEdge ( currentEd ); } else { currentEd = 0; } } if ( currentEd == 0 ) { caseA++; return 1; } if ( !activeArc ) { printf ( "no activeArc while checking edge %d\n", edgeid ); } if ( activeArc->multiplicity == 1 ) { caseB++; return 1; } for ( arc = edge_array[bal_ed].arcs; arc != NULL; arc = arc->next ) if ( arc->multiplicity > mult ) { mult = arc->multiplicity; } if ( mult > activeArc->multiplicity ) { caseC++; } return mult > activeArc->multiplicity; } void removeDeadArcs() { unsigned int i, count = 0; ARC * arc, *arc_temp; for ( i = 1; i <= num_ed; i++ ) { arc = edge_array[i].arcs; while ( arc ) { arc_temp = arc; arc = arc->next; if ( arc_temp->to_ed == 0 ) { count++; edge_array[i].arcs = deleteArc ( edge_array[i].arcs, arc_temp ); } } } printf ( "%d dead arcs removed\n", count ); } void cutTipsInGraph ( int cutLen, boolean strict ) { int flag = 1; unsigned int i; if ( !cutLen ) { cutLen = 2 * overlaplen; } printf ( "strict %d, cutLen %d\n", strict, cutLen ); if ( strict ) { linearConcatenate(); } caseA = caseB = caseC = caseD = caseE = 0; while ( flag ) { flag = 0; for ( i = 1; i <= num_ed; i++ ) { if ( edge_array[i].deleted ) { continue; } /* if(strict){ if(isUnreliableTip_strict(i,cutLen)){ destroyEdge(i); flag++; } }else */ if ( isUnreliableTip ( i, cutLen, strict ) ) { destroyEdge ( i ); flag++; } } printf ( "a cutTipsInGraph lap, %d tips cut\n", flag ); } removeDeadArcs(); if ( strict ) { printf ( "case A %d, B %d C %d D %d E %d\n", caseA, caseB, caseC, caseD, caseE ); } linearConcatenate(); compactEdgeArray(); } SOAPdenovo-V1.05/src/31mer/cutTipPreGraph.c000644 000765 000024 00000023533 11530651532 020366 0ustar00Aquastaff000000 000000 /* * 31mer/cutTipPreGraph.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" static int tip_c; static long long * linearCounter; static void Mark1in1outNode(); static void thread_mark ( KmerSet * set, unsigned char thrdID ); static void printKmer ( Kmer kmer ) { int i; char kmerSeq[32], ch; for ( i = overlaplen - 1; i >= 0; i-- ) { ch = kmer & 3; kmer >>= 2; kmerSeq[i] = ch; } for ( i = 0; i < overlaplen; i++ ) { printf ( "%c", int2base ( ( int ) kmerSeq[i] ) ); } printf ( "\n" ); } static int clipTipFromNode ( kmer_t * node1, int cut_len, boolean THIN ) { unsigned char ret = 0, in_num, out_num, link; int sum, count; kmer_t * out_node; Kmer tempKmer, pre_word, word, bal_word, hash_ban; char ch1, ch; boolean smaller, found; int setPicker; unsigned int max_links, singleCvg; if ( node1->linear || node1->deleted ) { return ret; } if ( THIN && !node1->single ) { return ret; } in_num = count_branch2prev ( node1 ); out_num = count_branch2next ( node1 ); if ( in_num == 0 && out_num == 1 ) { pre_word = node1->seq; for ( ch1 = 0; ch1 < 4; ch1++ ) { link = get_kmer_right_cov ( *node1, ch1 ); if ( link ) { break; } } word = nextKmer ( pre_word, ch1 ); } else if ( in_num == 1 && out_num == 0 ) { pre_word = reverseComplement ( node1->seq, overlaplen ); for ( ch1 = 0; ch1 < 4; ch1++ ) { link = get_kmer_left_cov ( *node1, ch1 ); if ( link ) { break; } } word = nextKmer ( pre_word, int_comp ( ch1 ) ); } else { return ret; } count = 1; bal_word = reverseComplement ( word, overlaplen ); if ( word > bal_word ) { tempKmer = bal_word; bal_word = word; word = tempKmer; smaller = 0; } else { smaller = 1; } hash_ban = hash_kmer ( word ); setPicker = hash_ban % thrd_num; found = search_kmerset ( KmerSets[setPicker], word, &out_node ); if ( !found ) { printf ( "kmer %llx not found, node1 %llx\n", word, node1->seq ); exit ( 1 ); } while ( out_node->linear ) { count++; if ( THIN && !out_node->single ) { break; } if ( count > cut_len ) { return ret; } if ( smaller ) { pre_word = word; for ( ch = 0; ch < 4; ch++ ) { link = get_kmer_right_cov ( *out_node, ch ); if ( link ) { break; } } word = nextKmer ( pre_word, ch ); bal_word = reverseComplement ( word, overlaplen ); if ( word > bal_word ) { tempKmer = bal_word; bal_word = word; word = tempKmer; smaller = 0; } else { smaller = 1; } hash_ban = hash_kmer ( word ); setPicker = hash_ban % thrd_num; found = search_kmerset ( KmerSets[setPicker], word, &out_node ); if ( !found ) { printf ( "kmer %llx not found, node1 %llx\n", word, node1->seq ); printf ( "pre_word %llx with %d(smaller)\n", pre_word, ch ); exit ( 1 ); } } else { pre_word = bal_word; for ( ch = 0; ch < 4; ch++ ) { link = get_kmer_left_cov ( *out_node, ch ); if ( link ) { break; } } word = nextKmer ( pre_word, int_comp ( ch ) ); bal_word = reverseComplement ( word, overlaplen ); if ( word > bal_word ) { tempKmer = bal_word; bal_word = word; word = tempKmer; smaller = 0; } else { smaller = 1; } hash_ban = hash_kmer ( word ); setPicker = hash_ban % thrd_num; found = search_kmerset ( KmerSets[setPicker], word, &out_node ); if ( !found ) { printf ( "kmer %llx not found, node1 %llx\n", word, node1->seq ); printf ( "pre_word %llx with %d(larger)\n", reverseComplement ( pre_word, overlaplen ), int_comp ( ch ) ); exit ( 1 ); } } } if ( ( sum = count_branch2next ( out_node ) + count_branch2prev ( out_node ) ) == 1 ) { tip_c++; node1->deleted = 1; out_node->deleted = 1; return 1; } else { ch = firstCharInKmer ( pre_word ); if ( THIN ) { tip_c++; node1->deleted = 1; dislink2prevUncertain ( out_node, ch, smaller ); out_node->linear = 0; return 1; } // make sure this tip doesn't provide most links to out_node max_links = 0; for ( ch1 = 0; ch1 < 4; ch1++ ) { if ( smaller ) { singleCvg = get_kmer_left_cov ( *out_node, ch1 ); if ( singleCvg > max_links ) { max_links = singleCvg; } } else { singleCvg = get_kmer_right_cov ( *out_node, ch1 ); if ( singleCvg > max_links ) { max_links = singleCvg; } } } if ( smaller && get_kmer_left_cov ( *out_node, ch ) < max_links ) { tip_c++; node1->deleted = 1; dislink2prevUncertain ( out_node, ch, smaller ); if ( count_branch2prev ( out_node ) == 1 && count_branch2next ( out_node ) == 1 ) { out_node->linear = 1; } return 1; } if ( !smaller && get_kmer_right_cov ( *out_node, int_comp ( ch ) ) < max_links ) { tip_c++; node1->deleted = 1; dislink2prevUncertain ( out_node, ch, smaller ); if ( count_branch2prev ( out_node ) == 1 && count_branch2next ( out_node ) == 1 ) { out_node->linear = 1; } return 1; } } return 0; } void removeSingleTips() { int i, flag = 0, cut_len_tip; kmer_t * rs; KmerSet * set; //count_ends(hash_table); //cut_len_tip = 2*overlaplen >= maxReadLen4all-overlaplen+1 ? 2*overlaplen : maxReadLen4all-overlaplen+1; cut_len_tip = 2 * overlaplen; printf ( "Start to remove tips of single frequency kmers short than %d\n", cut_len_tip ); tip_c = 0; for ( i = 0; i < thrd_num; i++ ) { set = KmerSets[i]; set->iter_ptr = 0; while ( set->iter_ptr < set->size ) { if ( !is_kmer_entity_null ( set->flags, set->iter_ptr ) ) { rs = set->array + set->iter_ptr; flag += clipTipFromNode ( rs, cut_len_tip, 1 ); } set->iter_ptr ++; } } printf ( "%d tips off\n", tip_c ); Mark1in1outNode(); } void removeMinorTips() { int i, flag = 0, cut_len_tip; kmer_t * rs; KmerSet * set; //count_ends(hash_table); //cut_len_tip = 2*overlaplen >= maxReadLen4all-overlaplen+1 ? 2*overlaplen : maxReadLen4all-overlaplen+1; cut_len_tip = 2 * overlaplen; printf ( "Start to remove tips which don't contribute the most links\n" ); tip_c = 0; for ( i = 0; i < thrd_num; i++ ) { set = KmerSets[i]; flag = 1; while ( flag ) { flag = 0; set->iter_ptr = 0; while ( set->iter_ptr < set->size ) { if ( !is_kmer_entity_null ( set->flags, set->iter_ptr ) ) { rs = set->array + set->iter_ptr; flag += clipTipFromNode ( rs, cut_len_tip, 0 ); } set->iter_ptr ++; } } printf ( "kmer set %d done\n", i ); } printf ( "%d tips off\n", tip_c ); Mark1in1outNode(); } static void threadRoutine ( void * para ) { PARAMETER * prm; unsigned char id; prm = ( PARAMETER * ) para; id = prm->threadID; //printf("%dth thread with threadID %d, hash_table %p\n",id,prm.threadID,prm.hash_table); while ( 1 ) { if ( * ( prm->selfSignal ) == 2 ) { * ( prm->selfSignal ) = 0; break; } else if ( * ( prm->selfSignal ) == 1 ) { thread_mark ( KmerSets[id], id ); * ( prm->selfSignal ) = 0; } usleep ( 1 ); } } static void creatThrds ( pthread_t * threads, PARAMETER * paras ) { unsigned char i; int temp; for ( i = 0; i < thrd_num; i++ ) { if ( ( temp = pthread_create ( &threads[i], NULL, ( void * ) threadRoutine, & ( paras[i] ) ) ) != 0 ) { printf ( "create threads failed\n" ); exit ( 1 ); } } printf ( "%d thread created\n", thrd_num ); } static void thread_mark ( KmerSet * set, unsigned char thrdID ) { int in_num, out_num; kmer_t * rs; set->iter_ptr = 0; while ( set->iter_ptr < set->size ) { if ( !is_kmer_entity_null ( set->flags, set->iter_ptr ) ) { rs = set->array + set->iter_ptr; if ( rs->deleted || rs->linear ) { set->iter_ptr ++; continue;; } in_num = count_branch2prev ( rs ); out_num = count_branch2next ( rs ); if ( in_num == 1 && out_num == 1 ) { rs->linear = 1; linearCounter[thrdID]++; } } set->iter_ptr ++; } //printf("%lld more linear\n",linearCounter[thrdID]); } static void thread_wait ( pthread_t * threads ) { int i; for ( i = 0; i < thrd_num; i++ ) if ( threads[i] != 0 ) { pthread_join ( threads[i], NULL ); } } static void sendWorkSignal ( unsigned char SIG, unsigned char * thrdSignals ) { int t; for ( t = 0; t < thrd_num; t++ ) { thrdSignals[t + 1] = SIG; } while ( 1 ) { usleep ( 10 ); for ( t = 0; t < thrd_num; t++ ) if ( thrdSignals[t + 1] ) { break; } if ( t == thrd_num ) { break; } } } static void Mark1in1outNode() { int i; long long counter = 0; pthread_t threads[thrd_num]; unsigned char thrdSignal[thrd_num + 1]; PARAMETER paras[thrd_num]; for ( i = 0; i < thrd_num; i++ ) { thrdSignal[i + 1] = 0; paras[i].threadID = i; paras[i].mainSignal = &thrdSignal[0]; paras[i].selfSignal = &thrdSignal[i + 1]; } creatThrds ( threads, paras ); thrdSignal[0] = 0; linearCounter = ( long long * ) ckalloc ( thrd_num * sizeof ( long long ) ); for ( i = 0; i < thrd_num; i++ ) { linearCounter[i] = 0; } sendWorkSignal ( 1, thrdSignal ); //mark linear nodes sendWorkSignal ( 2, thrdSignal ); thread_wait ( threads ); for ( i = 0; i < thrd_num; i++ ) { counter += linearCounter[i]; } free ( ( void * ) linearCounter ); printf ( "%lld linear nodes\n", counter ); } SOAPdenovo-V1.05/src/31mer/darray.c000644 000765 000024 00000004254 11530651532 016746 0ustar00Aquastaff000000 000000 /* * 31mer/darray.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "darray.h" #include "check.h" DARRAY * createDarray ( int num_items, size_t unit_size ) { DARRAY * newDarray = ( DARRAY * ) malloc ( 1 * sizeof ( DARRAY ) ); newDarray->array_size = num_items; newDarray->item_size = unit_size; newDarray->item_c = 0; newDarray->array = ( void * ) ckalloc ( num_items * unit_size ); return newDarray; } void * darrayPut ( DARRAY * darray, long long index ) { int i = 2; if ( index + 1 > darray->item_c ) { darray->item_c = index + 1; } if ( index < darray->array_size ) { return darray->array + darray->item_size * index; } while ( index > i * darray->array_size ) { i++; } darray->array = ( void * ) ckrealloc ( darray->array, i * darray->array_size * darray->item_size , darray->array_size * darray->item_size ); darray->array_size *= i; return ( void * ) ( ( void * ) darray->array + darray->item_size * index ); } void * darrayGet ( DARRAY * darray, long long index ) { if ( index < darray->array_size ) { return ( void * ) ( ( void * ) darray->array + darray->item_size * index ); } printf ( "array read index %lld out of range %lld\n", index, darray->array_size ); return NULL; } void emptyDarray ( DARRAY * darray ) { darray->item_c = 0; } void freeDarray ( DARRAY * darray ) { if ( !darray ) { return; } if ( darray->array ) { free ( ( void * ) darray->array ); } free ( ( void * ) darray ); } SOAPdenovo-V1.05/src/31mer/dfib.c000644 000765 000024 00000026444 11530651532 016375 0ustar00Aquastaff000000 000000 /* * 31mer/dfib.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ /*- * Copyright 1997-2003 John-Mark Gurney. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $Id: dfib.c,v 1.12 2007/10/19 13:09:26 zerbino Exp $ * */ #include #include #include "dfib.h" #include "dfibpriv.h" #include "extfunc2.h" #define HEAPBLOCKSIZE 1000 static int dfh_comparedata ( DFibHeap * h, Time key, unsigned int data, DFibHeapNode * b ); static DFibHeapNode * allocateDFibHeapNode ( DFibHeap * heap ) { return ( DFibHeapNode * ) getItem ( heap->nodeMemory ); }; static void deallocateDFibHeapNode ( DFibHeapNode * a, DFibHeap * heap ) { returnItem ( heap->nodeMemory, a ); } IDnum dfibheap_getSize ( DFibHeap * heap ) { return heap->dfh_n; } #define swap(type, a, b) \ do { \ type c; \ c = a; \ a = b; \ b = c; \ } while (0) \ #define INT_BITS (sizeof(IDnum) * 8) static inline IDnum ceillog2 ( IDnum a ) { IDnum oa; IDnum i; IDnum b; IDnum cons; oa = a; b = INT_BITS / 2; i = 0; while ( b ) { i = ( i << 1 ); cons = ( ( IDnum ) 1 ) << b; if ( a >= cons ) { a /= cons; i = i | 1; } else { a &= cons - 1; } b /= 2; } if ( ( ( ( IDnum ) 1 << i ) ) == oa ) { return i; } else { return i + 1; } } /* * Public Heap Functions */ DFibHeap * dfh_makekeyheap() { DFibHeap * n; if ( ( n = malloc ( sizeof * n ) ) == NULL ) { return NULL; } n->nodeMemory = createMem_manager ( HEAPBLOCKSIZE, sizeof ( DFibHeapNode ) ); n->dfh_n = 0; n->dfh_Dl = -1; n->dfh_cons = NULL; n->dfh_min = NULL; n->dfh_root = NULL; return n; } void dfh_deleteheap ( DFibHeap * h ) { printf ( "DFibHeap: %lld Nodes allocated\n", h->nodeMemory->counter ); freeMem_manager ( h->nodeMemory ); h->nodeMemory = NULL; if ( h->dfh_cons != NULL ) { free ( h->dfh_cons ); } free ( h ); } /* * Public Key Heap Functions */ DFibHeapNode * dfh_insertkey ( DFibHeap * h, Time key, unsigned int data ) { DFibHeapNode * x; if ( ( x = dfhe_newelem ( h ) ) == NULL ) { return NULL; } /* just insert on root list, and make sure it's not the new min */ x->dfhe_data = data; x->dfhe_key = key; dfh_insertel ( h, x ); return x; } Time dfh_replacekey ( DFibHeap * h, DFibHeapNode * x, Time key ) { Time ret; ret = x->dfhe_key; ( void ) dfh_replacekeydata ( h, x, key, x->dfhe_data ); return ret; } unsigned int minInDHeap ( DFibHeap * h ) { if ( h->dfh_min ) { return h->dfh_min->dfhe_data; } else { return 0; } } boolean HasMin ( DFibHeap * h ) { if ( h->dfh_min ) { return 1; } else { return 0; } } unsigned int dfh_replacekeydata ( DFibHeap * h, DFibHeapNode * x, Time key, unsigned int data ) { unsigned int odata; Time okey; DFibHeapNode * y; int r; odata = x->dfhe_data; okey = x->dfhe_key; /* * we can increase a key by deleting and reinserting, that * requires O(lgn) time. */ if ( ( r = dfh_comparedata ( h, key, data, x ) ) > 0 ) { /* XXX - bad code! */ abort(); } x->dfhe_data = data; x->dfhe_key = key; /* because they are equal, we don't have to do anything */ if ( r == 0 ) { return odata; } y = x->dfhe_p; if ( okey == key ) { return odata; } if ( y != NULL && dfh_compare ( h, x, y ) <= 0 ) { dfh_cut ( h, x, y ); dfh_cascading_cut ( h, y ); } /* * the = is so that the call from dfh_delete will delete the proper * element. */ if ( dfh_compare ( h, x, h->dfh_min ) <= 0 ) { h->dfh_min = x; } return odata; } /* * Public void * Heap Functions */ /* * this will return these values: * NULL failed for some reason * ptr token to use for manipulation of data */ unsigned int dfh_extractmin ( DFibHeap * h ) { DFibHeapNode * z; unsigned int ret; ret = 0; if ( h->dfh_min != NULL ) { z = dfh_extractminel ( h ); ret = z->dfhe_data; deallocateDFibHeapNode ( z, h ); } return ret; } unsigned int dfh_replacedata ( DFibHeapNode * x, unsigned int data ) { unsigned int odata = x->dfhe_data; //printf("replace node value %d with %d\n",x->dfhe_data,data); x->dfhe_data = data; return odata; } unsigned int dfh_delete ( DFibHeap * h, DFibHeapNode * x ) { unsigned int k; //printf("destroy node %d in dheap\n",x->dfhe_data); k = x->dfhe_data; dfh_replacekey ( h, x, INT_MIN ); dfh_extractmin ( h ); return k; } /* * begin of private element fuctions */ static DFibHeapNode * dfh_extractminel ( DFibHeap * h ) { DFibHeapNode * ret; DFibHeapNode * x, *y, *orig; ret = h->dfh_min; orig = NULL; /* put all the children on the root list */ /* for true consistancy, we should use dfhe_remove */ for ( x = ret->dfhe_child; x != orig && x != NULL; ) { if ( orig == NULL ) { orig = x; } y = x->dfhe_right; x->dfhe_p = NULL; dfh_insertrootlist ( h, x ); x = y; } /* remove minimum from root list */ dfh_removerootlist ( h, ret ); h->dfh_n--; /* if we aren't empty, consolidate the heap */ if ( h->dfh_n == 0 ) { h->dfh_min = NULL; } else { h->dfh_min = ret->dfhe_right; dfh_consolidate ( h ); } return ret; } static void dfh_insertrootlist ( DFibHeap * h, DFibHeapNode * x ) { if ( h->dfh_root == NULL ) { h->dfh_root = x; x->dfhe_left = x; x->dfhe_right = x; return; } dfhe_insertafter ( h->dfh_root, x ); } static void dfh_removerootlist ( DFibHeap * h, DFibHeapNode * x ) { if ( x->dfhe_left == x ) { h->dfh_root = NULL; } else { h->dfh_root = dfhe_remove ( x ); } } static void dfh_consolidate ( DFibHeap * h ) { DFibHeapNode ** a; DFibHeapNode * w; DFibHeapNode * y; DFibHeapNode * x; IDnum i; IDnum d; IDnum D; dfh_checkcons ( h ); /* assign a the value of h->dfh_cons so I don't have to rewrite code */ D = h->dfh_Dl + 1; a = h->dfh_cons; for ( i = 0; i < D; i++ ) { a[i] = NULL; } while ( ( w = h->dfh_root ) != NULL ) { x = w; dfh_removerootlist ( h, w ); d = x->dfhe_degree; /* XXX - assert that d < D */ while ( a[d] != NULL ) { y = a[d]; if ( dfh_compare ( h, x, y ) > 0 ) { swap ( DFibHeapNode *, x, y ); } dfh_heaplink ( h, y, x ); a[d] = NULL; d++; } a[d] = x; } h->dfh_min = NULL; for ( i = 0; i < D; i++ ) if ( a[i] != NULL ) { dfh_insertrootlist ( h, a[i] ); if ( h->dfh_min == NULL || dfh_compare ( h, a[i], h->dfh_min ) < 0 ) { h->dfh_min = a[i]; } } } static void dfh_heaplink ( DFibHeap * h, DFibHeapNode * y, DFibHeapNode * x ) { /* make y a child of x */ if ( x->dfhe_child == NULL ) { x->dfhe_child = y; } else { dfhe_insertbefore ( x->dfhe_child, y ); } y->dfhe_p = x; x->dfhe_degree++; y->dfhe_mark = 0; } static void dfh_cut ( DFibHeap * h, DFibHeapNode * x, DFibHeapNode * y ) { dfhe_remove ( x ); y->dfhe_degree--; dfh_insertrootlist ( h, x ); x->dfhe_p = NULL; x->dfhe_mark = 0; } static void dfh_cascading_cut ( DFibHeap * h, DFibHeapNode * y ) { DFibHeapNode * z; while ( ( z = y->dfhe_p ) != NULL ) { if ( y->dfhe_mark == 0 ) { y->dfhe_mark = 1; return; } else { dfh_cut ( h, y, z ); y = z; } } } /* * begining of handling elements of dfibheap */ static DFibHeapNode * dfhe_newelem ( DFibHeap * h ) { DFibHeapNode * e; if ( ( e = allocateDFibHeapNode ( h ) ) == NULL ) { return NULL; } e->dfhe_degree = 0; e->dfhe_mark = 0; e->dfhe_p = NULL; e->dfhe_child = NULL; e->dfhe_left = e; e->dfhe_right = e; e->dfhe_data = 0; return e; } static void dfhe_insertafter ( DFibHeapNode * a, DFibHeapNode * b ) { if ( a == a->dfhe_right ) { a->dfhe_right = b; a->dfhe_left = b; b->dfhe_right = a; b->dfhe_left = a; } else { b->dfhe_right = a->dfhe_right; a->dfhe_right->dfhe_left = b; a->dfhe_right = b; b->dfhe_left = a; } } static inline void dfhe_insertbefore ( DFibHeapNode * a, DFibHeapNode * b ) { dfhe_insertafter ( a->dfhe_left, b ); } static DFibHeapNode * dfhe_remove ( DFibHeapNode * x ) { DFibHeapNode * ret; if ( x == x->dfhe_left ) { ret = NULL; } else { ret = x->dfhe_left; } /* fix the parent pointer */ if ( x->dfhe_p != NULL && x->dfhe_p->dfhe_child == x ) { x->dfhe_p->dfhe_child = ret; } x->dfhe_right->dfhe_left = x->dfhe_left; x->dfhe_left->dfhe_right = x->dfhe_right; /* clear out hanging pointers */ x->dfhe_p = NULL; x->dfhe_left = x; x->dfhe_right = x; return ret; } static void dfh_checkcons ( DFibHeap * h ) { IDnum oDl; /* make sure we have enough memory allocated to "reorganize" */ if ( h->dfh_Dl == -1 || h->dfh_n > ( 1 << h->dfh_Dl ) ) { oDl = h->dfh_Dl; if ( ( h->dfh_Dl = ceillog2 ( h->dfh_n ) + 1 ) < 8 ) { h->dfh_Dl = 8; } if ( oDl != h->dfh_Dl ) h->dfh_cons = ( DFibHeapNode ** ) realloc ( h->dfh_cons, sizeof * h-> dfh_cons * ( h->dfh_Dl + 1 ) ); if ( h->dfh_cons == NULL ) { abort(); } } } static int dfh_compare ( DFibHeap * h, DFibHeapNode * a, DFibHeapNode * b ) { if ( a->dfhe_key < b->dfhe_key ) { return -1; } if ( a->dfhe_key == b->dfhe_key ) { return 0; } return 1; } static int dfh_comparedata ( DFibHeap * h, Time key, unsigned int data, DFibHeapNode * b ) { DFibHeapNode a; a.dfhe_key = key; a.dfhe_data = data; return dfh_compare ( h, &a, b ); } static void dfh_insertel ( DFibHeap * h, DFibHeapNode * x ) { dfh_insertrootlist ( h, x ); if ( h->dfh_min == NULL || x->dfhe_key < h->dfh_min->dfhe_key ) { h->dfh_min = x; } h->dfh_n++; } Time dfibheap_el_getKey ( DFibHeapNode * node ) { return node->dfhe_key; } SOAPdenovo-V1.05/src/31mer/dfibHeap.c000644 000765 000024 00000004232 11530651532 017162 0ustar00Aquastaff000000 000000 /* * 31mer/dfibHeap.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include #include #include "def2.h" #include "dfib.h" // Return number of elements stored in heap IDnum getDFibHeapSize ( DFibHeap * heap ) { return dfibheap_getSize ( heap ); } // Constructor // Memory allocated DFibHeap * newDFibHeap() { return dfh_makekeyheap(); } // Add new node into heap with a key, and a pointer to the specified node DFibHeapNode * insertNodeIntoDHeap ( DFibHeap * heap, Time key, unsigned int node ) { DFibHeapNode * res; res = dfh_insertkey ( heap, key, node ); return res; } // Replaces the key for a given node Time replaceKeyInDHeap ( DFibHeap * heap, DFibHeapNode * node, Time newKey ) { Time res; res = dfh_replacekey ( heap, node, newKey ); return res; } // Removes the node with the shortest key, then returns it. unsigned int removeNextNodeFromDHeap ( DFibHeap * heap ) { unsigned int node; node = ( unsigned int ) dfh_extractmin ( heap ); return node; } // Destructor void destroyDHeap ( DFibHeap * heap ) { dfh_deleteheap ( heap ); } // Replace the node pointed to by a heap node void replaceValueInDHeap ( DFibHeapNode * node, unsigned int newValue ) { dfh_replacedata ( node, newValue ); } // Remove unwanted node void destroyNodeInDHeap ( DFibHeapNode * node, DFibHeap * heap ) { dfh_delete ( heap, node ); } Time getKey ( DFibHeapNode * node ) { return dfibheap_el_getKey ( node ); } SOAPdenovo-V1.05/src/31mer/fib.c000644 000765 000024 00000031767 11530651532 016235 0ustar00Aquastaff000000 000000 /* * 31mer/fib.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ /*- * Copyright 1997-2003 John-Mark Gurney. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $Id: fib.c,v 1.10 2007/10/19 13:09:26 zerbino Exp $ * */ #include #include #include "fib.h" #include "fibpriv.h" #include "extfunc2.h" #define HEAPBLOCKSIZE 10000 static int fh_comparedata ( FibHeap * h, Coordinate key, unsigned int data, FibHeapNode * b ); unsigned int fh_replacekeydata ( FibHeap * h, FibHeapNode * x, Coordinate key, unsigned int data ); static FibHeapNode * allocateFibHeapEl ( FibHeap * heap ) { return ( FibHeapNode * ) getItem ( heap->nodeMemory ); }; static void deallocateFibHeapEl ( FibHeapNode * a, FibHeap * heap ) { returnItem ( heap->nodeMemory, a ); } #define swap(type, a, b) \ do { \ type c; \ c = a; \ a = b; \ b = c; \ } while (0) \ #define INT_BITS (sizeof(IDnum) * 8) static inline IDnum ceillog2 ( IDnum a ) { IDnum oa; IDnum i; IDnum b; IDnum cons; oa = a; b = INT_BITS / 2; i = 0; while ( b ) { i = ( i << 1 ); cons = ( ( IDnum ) 1 ) << b; if ( a >= cons ) { a /= cons; i = i | 1; } else { a &= cons - 1; } b /= 2; } if ( ( ( ( IDnum ) 1 << i ) ) == oa ) { return i; } else { return i + 1; } } /* * Private Heap Functions */ static void fh_initheap ( FibHeap * new ) { new->fh_cmp_fnct = NULL; new->nodeMemory = createMem_manager ( sizeof ( FibHeapNode ), HEAPBLOCKSIZE ); new->fh_neginf = 0; new->fh_n = 0; new->fh_Dl = -1; new->fh_cons = NULL; new->fh_min = NULL; new->fh_root = NULL; new->fh_keys = 0; } static void fh_destroyheap ( FibHeap * h ) { h->fh_cmp_fnct = NULL; h->fh_neginf = 0; if ( h->fh_cons != NULL ) { free ( h->fh_cons ); } h->fh_cons = NULL; free ( h ); } /* * Public Heap Functions */ FibHeap * fh_makekeyheap() { FibHeap * n; if ( ( n = malloc ( sizeof * n ) ) == NULL ) { return NULL; } fh_initheap ( n ); n->fh_keys = 1; return n; } FibHeap * fh_makeheap() { FibHeap * n; if ( ( n = malloc ( sizeof * n ) ) == NULL ) { return NULL; } fh_initheap ( n ); return n; } voidcmp fh_setcmp ( FibHeap * h, voidcmp fnct ) { voidcmp oldfnct; oldfnct = h->fh_cmp_fnct; h->fh_cmp_fnct = fnct; return oldfnct; } unsigned int fh_setneginf ( FibHeap * h, unsigned int data ) { unsigned int old; old = h->fh_neginf; h->fh_neginf = data; return old; } FibHeap * fh_union ( FibHeap * ha, FibHeap * hb ) { FibHeapNode * x; if ( ha->fh_root == NULL || hb->fh_root == NULL ) { /* either one or both are empty */ if ( ha->fh_root == NULL ) { fh_destroyheap ( ha ); return hb; } else { fh_destroyheap ( hb ); return ha; } } ha->fh_root->fhe_left->fhe_right = hb->fh_root; hb->fh_root->fhe_left->fhe_right = ha->fh_root; x = ha->fh_root->fhe_left; ha->fh_root->fhe_left = hb->fh_root->fhe_left; hb->fh_root->fhe_left = x; ha->fh_n += hb->fh_n; /* * we probably should also keep stats on number of unions */ /* set fh_min if necessary */ if ( fh_compare ( ha, hb->fh_min, ha->fh_min ) < 0 ) { ha->fh_min = hb->fh_min; } fh_destroyheap ( hb ); return ha; } void fh_deleteheap ( FibHeap * h ) { freeMem_manager ( h->nodeMemory ); h->nodeMemory = NULL; fh_destroyheap ( h ); } /* * Public Key Heap Functions */ FibHeapNode * fh_insertkey ( FibHeap * h, Coordinate key, unsigned int data ) { FibHeapNode * x; if ( ( x = fhe_newelem ( h ) ) == NULL ) { return NULL; } /* just insert on root list, and make sure it's not the new min */ x->fhe_data = data; x->fhe_key = key; fh_insertel ( h, x ); return x; } boolean fh_isempty ( FibHeap * h ) { if ( h->fh_min == NULL ) { return 1; } else { return 0; } } Coordinate fh_minkey ( FibHeap * h ) { if ( h->fh_min == NULL ) { return INT_MIN; } return h->fh_min->fhe_key; } unsigned int fh_replacekeydata ( FibHeap * h, FibHeapNode * x, Coordinate key, unsigned int data ) { unsigned int odata; Coordinate okey; FibHeapNode * y; int r; odata = x->fhe_data; okey = x->fhe_key; /* * we can increase a key by deleting and reinserting, that * requires O(lgn) time. */ if ( ( r = fh_comparedata ( h, key, data, x ) ) > 0 ) { /* XXX - bad code! */ abort(); } x->fhe_data = data; x->fhe_key = key; /* because they are equal, we don't have to do anything */ if ( r == 0 ) { return odata; } y = x->fhe_p; if ( h->fh_keys && okey == key ) { return odata; } if ( y != NULL && fh_compare ( h, x, y ) <= 0 ) { fh_cut ( h, x, y ); fh_cascading_cut ( h, y ); } /* * the = is so that the call from fh_delete will delete the proper * element. */ if ( fh_compare ( h, x, h->fh_min ) <= 0 ) { h->fh_min = x; } return odata; } Coordinate fh_replacekey ( FibHeap * h, FibHeapNode * x, Coordinate key ) { Coordinate ret; ret = x->fhe_key; ( void ) fh_replacekeydata ( h, x, key, x->fhe_data ); return ret; } /* * Public void * Heap Functions */ /* * this will return these values: * NULL failed for some reason * ptr token to use for manipulation of data */ FibHeapNode * fh_insert ( FibHeap * h, unsigned int data ) { FibHeapNode * x; if ( ( x = fhe_newelem ( h ) ) == NULL ) { return NULL; } /* just insert on root list, and make sure it's not the new min */ x->fhe_data = data; fh_insertel ( h, x ); return x; } unsigned int fh_min ( FibHeap * h ) { if ( h->fh_min == NULL ) { return 0; } return h->fh_min->fhe_data; } unsigned int fh_extractmin ( FibHeap * h ) { FibHeapNode * z; unsigned int ret = 0; if ( h->fh_min != NULL ) { z = fh_extractminel ( h ); ret = z->fhe_data; #ifndef NO_FREE deallocateFibHeapEl ( z, h ); #endif } return ret; } unsigned int fh_replacedata ( FibHeapNode * x, unsigned int data ) { unsigned int odata = x->fhe_data; x->fhe_data = data; return odata; } unsigned int fh_delete ( FibHeap * h, FibHeapNode * x ) { unsigned int k; k = x->fhe_data; if ( !h->fh_keys ) { fh_replacedata ( x, h->fh_neginf ); } else { fh_replacekey ( h, x, INT_MIN ); } fh_extractmin ( h ); return k; } /* * begin of private element fuctions */ static FibHeapNode * fh_extractminel ( FibHeap * h ) { FibHeapNode * ret; FibHeapNode * x, *y, *orig; ret = h->fh_min; orig = NULL; /* put all the children on the root list */ /* for true consistancy, we should use fhe_remove */ for ( x = ret->fhe_child; x != orig && x != NULL; ) { if ( orig == NULL ) { orig = x; } y = x->fhe_right; x->fhe_p = NULL; fh_insertrootlist ( h, x ); x = y; } /* remove minimum from root list */ fh_removerootlist ( h, ret ); h->fh_n--; /* if we aren't empty, consolidate the heap */ if ( h->fh_n == 0 ) { h->fh_min = NULL; } else { h->fh_min = ret->fhe_right; fh_consolidate ( h ); } return ret; } static void fh_insertrootlist ( FibHeap * h, FibHeapNode * x ) { if ( h->fh_root == NULL ) { h->fh_root = x; x->fhe_left = x; x->fhe_right = x; return; } fhe_insertafter ( h->fh_root, x ); } static void fh_removerootlist ( FibHeap * h, FibHeapNode * x ) { if ( x->fhe_left == x ) { h->fh_root = NULL; } else { h->fh_root = fhe_remove ( x ); } } static void fh_consolidate ( FibHeap * h ) { FibHeapNode ** a; FibHeapNode * w; FibHeapNode * y; FibHeapNode * x; IDnum i; IDnum d; IDnum D; fh_checkcons ( h ); /* assign a the value of h->fh_cons so I don't have to rewrite code */ D = h->fh_Dl + 1; a = h->fh_cons; for ( i = 0; i < D; i++ ) { a[i] = NULL; } while ( ( w = h->fh_root ) != NULL ) { x = w; fh_removerootlist ( h, w ); d = x->fhe_degree; /* XXX - assert that d < D */ while ( a[d] != NULL ) { y = a[d]; if ( fh_compare ( h, x, y ) > 0 ) { swap ( FibHeapNode *, x, y ); } fh_heaplink ( h, y, x ); a[d] = NULL; d++; } a[d] = x; } h->fh_min = NULL; for ( i = 0; i < D; i++ ) if ( a[i] != NULL ) { fh_insertrootlist ( h, a[i] ); if ( h->fh_min == NULL || fh_compare ( h, a[i], h->fh_min ) < 0 ) { h->fh_min = a[i]; } } } static void fh_heaplink ( FibHeap * h, FibHeapNode * y, FibHeapNode * x ) { /* make y a child of x */ if ( x->fhe_child == NULL ) { x->fhe_child = y; } else { fhe_insertbefore ( x->fhe_child, y ); } y->fhe_p = x; x->fhe_degree++; y->fhe_mark = 0; } static void fh_cut ( FibHeap * h, FibHeapNode * x, FibHeapNode * y ) { fhe_remove ( x ); y->fhe_degree--; fh_insertrootlist ( h, x ); x->fhe_p = NULL; x->fhe_mark = 0; } static void fh_cascading_cut ( FibHeap * h, FibHeapNode * y ) { FibHeapNode * z; while ( ( z = y->fhe_p ) != NULL ) { if ( y->fhe_mark == 0 ) { y->fhe_mark = 1; return; } else { fh_cut ( h, y, z ); y = z; } } } /* * begining of handling elements of fibheap */ static FibHeapNode * fhe_newelem ( FibHeap * h ) { FibHeapNode * e; if ( ( e = allocateFibHeapEl ( h ) ) == NULL ) { return NULL; } fhe_initelem ( e ); return e; } static void fhe_initelem ( FibHeapNode * e ) { e->fhe_degree = 0; e->fhe_mark = 0; e->fhe_p = NULL; e->fhe_child = NULL; e->fhe_left = e; e->fhe_right = e; e->fhe_data = 0; } static void fhe_insertafter ( FibHeapNode * a, FibHeapNode * b ) { if ( a == a->fhe_right ) { a->fhe_right = b; a->fhe_left = b; b->fhe_right = a; b->fhe_left = a; } else { b->fhe_right = a->fhe_right; a->fhe_right->fhe_left = b; a->fhe_right = b; b->fhe_left = a; } } static inline void fhe_insertbefore ( FibHeapNode * a, FibHeapNode * b ) { fhe_insertafter ( a->fhe_left, b ); } static FibHeapNode * fhe_remove ( FibHeapNode * x ) { FibHeapNode * ret; if ( x == x->fhe_left ) { ret = NULL; } else { ret = x->fhe_left; } /* fix the parent pointer */ if ( x->fhe_p != NULL && x->fhe_p->fhe_child == x ) { x->fhe_p->fhe_child = ret; } x->fhe_right->fhe_left = x->fhe_left; x->fhe_left->fhe_right = x->fhe_right; /* clear out hanging pointers */ x->fhe_p = NULL; x->fhe_left = x; x->fhe_right = x; return ret; } static void fh_checkcons ( FibHeap * h ) { IDnum oDl; /* make sure we have enough memory allocated to "reorganize" */ if ( h->fh_Dl == -1 || h->fh_n > ( 1 << h->fh_Dl ) ) { oDl = h->fh_Dl; if ( ( h->fh_Dl = ceillog2 ( h->fh_n ) + 1 ) < 8 ) { h->fh_Dl = 8; } if ( oDl != h->fh_Dl ) h->fh_cons = ( FibHeapNode ** ) realloc ( h->fh_cons, sizeof * h-> fh_cons * ( h->fh_Dl + 1 ) ); if ( h->fh_cons == NULL ) { abort(); } } } static int fh_compare ( FibHeap * h, FibHeapNode * a, FibHeapNode * b ) { if ( a->fhe_key < b->fhe_key ) { return -1; } if ( a->fhe_key == b->fhe_key ) { return 0; } return 1; } static int fh_comparedata ( FibHeap * h, Coordinate key, unsigned int data, FibHeapNode * b ) { FibHeapNode a; a.fhe_key = key; a.fhe_data = data; return fh_compare ( h, &a, b ); } static void fh_insertel ( FibHeap * h, FibHeapNode * x ) { fh_insertrootlist ( h, x ); if ( h->fh_min == NULL || ( h->fh_keys ? x->fhe_key < h->fh_min->fhe_key : h->fh_cmp_fnct ( x->fhe_data, h->fh_min->fhe_data ) < 0 ) ) { h->fh_min = x; } h->fh_n++; } SOAPdenovo-V1.05/src/31mer/fibHeap.c000644 000765 000024 00000004002 11530651532 017011 0ustar00Aquastaff000000 000000 /* * 31mer/fibHeap.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "fib.h" // Constructor // Memory allocated FibHeap * newFibHeap() { return fh_makekeyheap(); } // Add new node into heap with a key, and a pointer to the specified node FibHeapNode * insertNodeIntoHeap ( FibHeap * heap, Coordinate key, unsigned int node ) { return fh_insertkey ( heap, key, node ); } // Returns smallest key in heap Coordinate minKeyOfHeap ( FibHeap * heap ) { return fh_minkey ( heap ); } // Replaces the key for a given node Coordinate replaceKeyInHeap ( FibHeap * heap, FibHeapNode * node, Coordinate newKey ) { return fh_replacekey ( heap, node, newKey ); } // Removes the node with the shortest key, then returns it. unsigned int removeNextNodeFromHeap ( FibHeap * heap ) { return ( unsigned int ) fh_extractmin ( heap ); } boolean IsHeapEmpty ( FibHeap * heap ) { return fh_isempty ( heap ); } // Destructor void destroyHeap ( FibHeap * heap ) { fh_deleteheap ( heap ); } // Replace the node pointed to by a heap node void replaceValueInHeap ( FibHeapNode * node, unsigned int newValue ) { fh_replacedata ( node, newValue ); } // Remove unwanted node void destroyNodeInHeap ( FibHeapNode * node, FibHeap * heap ) { fh_delete ( heap, node ); } SOAPdenovo-V1.05/src/31mer/hashFunction.c000644 000765 000024 00000010572 11530651532 020115 0ustar00Aquastaff000000 000000 /* * 31mer/hashFunction.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include #define KMER_HASH_MASK 0x0000000000ffffffL #define KMER_HASH_BUCKETS 16777216 // 4^12 static int crc_table[256] = { 0x00000000, 0x77073096, 0xee0e612c, 0x990951ba, 0x076dc419, 0x706af48f, 0xe963a535, 0x9e6495a3, 0x0edb8832, 0x79dcb8a4, 0xe0d5e91e, 0x97d2d988, 0x09b64c2b, 0x7eb17cbd, 0xe7b82d07, 0x90bf1d91, 0x1db71064, 0x6ab020f2, 0xf3b97148, 0x84be41de, 0x1adad47d, 0x6ddde4eb, 0xf4d4b551, 0x83d385c7, 0x136c9856, 0x646ba8c0, 0xfd62f97a, 0x8a65c9ec, 0x14015c4f, 0x63066cd9, 0xfa0f3d63, 0x8d080df5, 0x3b6e20c8, 0x4c69105e, 0xd56041e4, 0xa2677172, 0x3c03e4d1, 0x4b04d447, 0xd20d85fd, 0xa50ab56b, 0x35b5a8fa, 0x42b2986c, 0xdbbbc9d6, 0xacbcf940, 0x32d86ce3, 0x45df5c75, 0xdcd60dcf, 0xabd13d59, 0x26d930ac, 0x51de003a, 0xc8d75180, 0xbfd06116, 0x21b4f4b5, 0x56b3c423, 0xcfba9599, 0xb8bda50f, 0x2802b89e, 0x5f058808, 0xc60cd9b2, 0xb10be924, 0x2f6f7c87, 0x58684c11, 0xc1611dab, 0xb6662d3d, 0x76dc4190, 0x01db7106, 0x98d220bc, 0xefd5102a, 0x71b18589, 0x06b6b51f, 0x9fbfe4a5, 0xe8b8d433, 0x7807c9a2, 0x0f00f934, 0x9609a88e, 0xe10e9818, 0x7f6a0dbb, 0x086d3d2d, 0x91646c97, 0xe6635c01, 0x6b6b51f4, 0x1c6c6162, 0x856530d8, 0xf262004e, 0x6c0695ed, 0x1b01a57b, 0x8208f4c1, 0xf50fc457, 0x65b0d9c6, 0x12b7e950, 0x8bbeb8ea, 0xfcb9887c, 0x62dd1ddf, 0x15da2d49, 0x8cd37cf3, 0xfbd44c65, 0x4db26158, 0x3ab551ce, 0xa3bc0074, 0xd4bb30e2, 0x4adfa541, 0x3dd895d7, 0xa4d1c46d, 0xd3d6f4fb, 0x4369e96a, 0x346ed9fc, 0xad678846, 0xda60b8d0, 0x44042d73, 0x33031de5, 0xaa0a4c5f, 0xdd0d7cc9, 0x5005713c, 0x270241aa, 0xbe0b1010, 0xc90c2086, 0x5768b525, 0x206f85b3, 0xb966d409, 0xce61e49f, 0x5edef90e, 0x29d9c998, 0xb0d09822, 0xc7d7a8b4, 0x59b33d17, 0x2eb40d81, 0xb7bd5c3b, 0xc0ba6cad, 0xedb88320, 0x9abfb3b6, 0x03b6e20c, 0x74b1d29a, 0xead54739, 0x9dd277af, 0x04db2615, 0x73dc1683, 0xe3630b12, 0x94643b84, 0x0d6d6a3e, 0x7a6a5aa8, 0xe40ecf0b, 0x9309ff9d, 0x0a00ae27, 0x7d079eb1, 0xf00f9344, 0x8708a3d2, 0x1e01f268, 0x6906c2fe, 0xf762575d, 0x806567cb, 0x196c3671, 0x6e6b06e7, 0xfed41b76, 0x89d32be0, 0x10da7a5a, 0x67dd4acc, 0xf9b9df6f, 0x8ebeeff9, 0x17b7be43, 0x60b08ed5, 0xd6d6a3e8, 0xa1d1937e, 0x38d8c2c4, 0x4fdff252, 0xd1bb67f1, 0xa6bc5767, 0x3fb506dd, 0x48b2364b, 0xd80d2bda, 0xaf0a1b4c, 0x36034af6, 0x41047a60, 0xdf60efc3, 0xa867df55, 0x316e8eef, 0x4669be79, 0xcb61b38c, 0xbc66831a, 0x256fd2a0, 0x5268e236, 0xcc0c7795, 0xbb0b4703, 0x220216b9, 0x5505262f, 0xc5ba3bbe, 0xb2bd0b28, 0x2bb45a92, 0x5cb36a04, 0xc2d7ffa7, 0xb5d0cf31, 0x2cd99e8b, 0x5bdeae1d, 0x9b64c2b0, 0xec63f226, 0x756aa39c, 0x026d930a, 0x9c0906a9, 0xeb0e363f, 0x72076785, 0x05005713, 0x95bf4a82, 0xe2b87a14, 0x7bb12bae, 0x0cb61b38, 0x92d28e9b, 0xe5d5be0d, 0x7cdcefb7, 0x0bdbdf21, 0x86d3d2d4, 0xf1d4e242, 0x68ddb3f8, 0x1fda836e, 0x81be16cd, 0xf6b9265b, 0x6fb077e1, 0x18b74777, 0x88085ae6, 0xff0f6a70, 0x66063bca, 0x11010b5c, 0x8f659eff, 0xf862ae69, 0x616bffd3, 0x166ccf45, 0xa00ae278, 0xd70dd2ee, 0x4e048354, 0x3903b3c2, 0xa7672661, 0xd06016f7, 0x4969474d, 0x3e6e77db, 0xaed16a4a, 0xd9d65adc, 0x40df0b66, 0x37d83bf0, 0xa9bcae53, 0xdebb9ec5, 0x47b2cf7f, 0x30b5ffe9, 0xbdbdf21c, 0xcabac28a, 0x53b39330, 0x24b4a3a6, 0xbad03605, 0xcdd70693, 0x54de5729, 0x23d967bf, 0xb3667a2e, 0xc4614ab8, 0x5d681b02, 0x2a6f2b94, 0xb40bbe37, 0xc30c8ea1, 0x5a05df1b, 0x2d02ef8d }; static int crc32 ( int crc, const char * buf, int len ) { if ( buf == NULL ) { return 0; } crc = crc ^ 0xffffffff; while ( len-- ) { crc = crc_table[ ( ( int ) crc ^ ( *buf++ ) ) & 0xff] ^ ( crc >> 8 ); } return crc ^ 0xffffffff; } Kmer hash_kmer ( Kmer kmer ) { Kmer hash; hash = kmer; hash = crc32 ( 0, ( char * ) &kmer, sizeof ( Kmer ) ); hash &= KMER_HASH_MASK; return hash; } SOAPdenovo-V1.05/src/31mer/inc/000755 000765 000024 00000000000 11530651532 016064 5ustar00Aquastaff000000 000000 SOAPdenovo-V1.05/src/31mer/kmer.c000644 000765 000024 00000006501 11530651532 016417 0ustar00Aquastaff000000 000000 /* * 31mer/kmer.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" static unsigned char filter_array[8] = { ( unsigned char ) 1, ( ( unsigned char ) 1 ) << 1, ( ( unsigned char ) 1 ) << 2, ( ( unsigned char ) 1 ) << 3, ( ( unsigned char ) 1 ) << 4, ( ( unsigned char ) 1 ) << 5, ( ( unsigned char ) 1 ) << 6, ( ( unsigned char ) 1 ) << 7}; void link2next ( NODE * node, char ch ) { if ( node->links & filter_array[ ( int ) ch] ) { node->linksB = node->linksB | filter_array[ ( int ) ch]; } else { node->links = node->links | filter_array[ ( int ) ch]; } } unsigned char check_linkB2next ( NODE * node, char ch ) { return filter_array[ ( int ) ch] & node->linksB; } unsigned char check_link2next ( NODE * node, char ch ) { return filter_array[ ( int ) ch] & node->links; } void unlink2next ( NODE * node, char ch ) { node->links = node->links & ( ~filter_array[ ( int ) ch] ); } void link2prev ( NODE * node, char ch ) { if ( node->links & filter_array[ch + 4] ) { node->linksB = node->linksB | filter_array[ch + 4]; } else { node->links = node->links | filter_array[ch + 4]; } } unsigned char check_linkB2prev ( NODE * node, char ch ) { return filter_array[ch + 4] & node->linksB; } unsigned char check_link2prev ( NODE * node, char ch ) { return filter_array[ch + 4] & node->links; } void unlink2prev ( NODE * node, char ch ) { node->links = node->links & ( ~filter_array[ch + 4] ); } int count_link2next ( NODE * node ) { int num = 0, i; unsigned char ch = node->links; for ( i = 0; i < 4; i++ ) { num += ch & 0x01; ch >>= 1; } return num; } int count_link2nextB ( NODE * node ) { int num = 0, i; unsigned char ch = node->linksB; for ( i = 0; i < 4; i++ ) { num += ch & 0x01; ch >>= 1; } return num; } int count_link2prevB ( NODE * node ) { int num = 0, i; unsigned char ch = node->linksB; ch >>= 4; for ( i = 0; i < 4; i++ ) { num += ch & 0x01; ch >>= 1; } return num; } int count_link2prev ( NODE * node ) { int num = 0, i; unsigned char ch = node->links; ch >>= 4; for ( i = 0; i < 4; i++ ) { num += ch & 0x01; ch >>= 1; } return num; } Kmer KmerPlus ( Kmer prev, char ch ) { Kmer word = prev; word <<= 2; word += ch; return word; } Kmer nextKmer ( Kmer prev, char ch ) { Kmer word = prev; word <<= 2; word &= WORDFILTER; word += ch; return word; } Kmer prevKmer ( Kmer next, char ch ) { Kmer word = next; word >>= 2; word += ( ( Kmer ) ch ) << 2 * ( overlaplen - 1 ); return word; } char firstCharInKmer ( Kmer kmer ) { return ( char ) ( kmer >> 2 * ( overlaplen - 1 ) ); // & 3; } SOAPdenovo-V1.05/src/31mer/lib.c000644 000765 000024 00000025317 11530651532 016235 0ustar00Aquastaff000000 000000 /* * 31mer/lib.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" static char tabs[2][1024]; static boolean splitColumn ( char * line ) { int len = strlen ( line ); int i = 0, j; int tabs_n = 0; while ( i < len ) { if ( line[i] >= 32 && line[i] <= 126 && line[i] != '=' ) { j = 0; while ( i < len && line[i] >= 32 && line[i] <= 126 && line[i] != '=' ) { tabs[tabs_n][j++] = line[i]; i++; } tabs[tabs_n][j] = '\0'; tabs_n++; if ( tabs_n == 2 ) { return 1; } } i++; } if ( tabs_n == 2 ) { return 1; } else { return 0; } } static int cmp_lib ( const void * a, const void * b ) { LIB_INFO * A, *B; A = ( LIB_INFO * ) a; B = ( LIB_INFO * ) b; if ( A->avg_ins > B->avg_ins ) { return 1; } else if ( A->avg_ins == B->avg_ins ) { return 0; } else { return -1; } } void scan_libInfo ( char * libfile ) { FILE * fp; char line[1024], ch; int i, j, index; int libCounter; boolean flag; fp = ckopen ( libfile, "r" ); num_libs = 0; while ( fgets ( line, 1024, fp ) ) { ch = line[5]; line[5] = '\0'; if ( strcmp ( line, "[LIB]" ) == 0 ) { num_libs++; } if ( !num_libs ) { line[5] = ch; flag = splitColumn ( line ); if ( !flag ) { continue; } if ( strcmp ( tabs[0], "max_rd_len" ) == 0 ) { maxReadLen = atoi ( tabs[1] ); } } } //count file numbers of each type lib_array = ( LIB_INFO * ) ckalloc ( num_libs * sizeof ( LIB_INFO ) ); for ( i = 0; i < num_libs; i++ ) { lib_array[i].asm_flag = 3; lib_array[i].rd_len_cutoff = 0; lib_array[i].rank = 0; lib_array[i].pair_num_cut = 0; lib_array[i].map_len = 0; lib_array[i].num_s_a_file = 0; lib_array[i].num_s_q_file = 0; lib_array[i].num_p_file = 0; lib_array[i].num_a1_file = 0; lib_array[i].num_a2_file = 0; lib_array[i].num_q1_file = 0; lib_array[i].num_q2_file = 0; } libCounter = -1; rewind ( fp ); i = -1; while ( fgets ( line, 1024, fp ) ) { ch = line[5]; line[5] = '\0'; if ( strcmp ( line, "[LIB]" ) == 0 ) { i++; continue; } line[5] = ch; flag = splitColumn ( line ); if ( !flag ) { continue; } if ( strcmp ( tabs[0], "f1" ) == 0 ) { lib_array[i].num_a1_file++; } else if ( strcmp ( tabs[0], "q1" ) == 0 ) { lib_array[i].num_q1_file++; } else if ( strcmp ( tabs[0], "f2" ) == 0 ) { lib_array[i].num_a2_file++; } else if ( strcmp ( tabs[0], "q2" ) == 0 ) { lib_array[i].num_q2_file++; } else if ( strcmp ( tabs[0], "f" ) == 0 ) { lib_array[i].num_s_a_file++; } else if ( strcmp ( tabs[0], "q" ) == 0 ) { lib_array[i].num_s_q_file++; } else if ( strcmp ( tabs[0], "p" ) == 0 ) { lib_array[i].num_p_file++; } } //allocate memory for filenames for ( i = 0; i < num_libs; i++ ) { if ( lib_array[i].num_s_a_file ) { lib_array[i].s_a_fname = ( char ** ) ckalloc ( lib_array[i].num_s_a_file * sizeof ( char * ) ); for ( j = 0; j < lib_array[i].num_s_a_file; j++ ) { lib_array[i].s_a_fname[j] = ( char * ) ckalloc ( 1024 * sizeof ( char ) ); } } if ( lib_array[i].num_s_q_file ) { lib_array[i].s_q_fname = ( char ** ) ckalloc ( lib_array[i].num_s_q_file * sizeof ( char * ) ); for ( j = 0; j < lib_array[i].num_s_q_file; j++ ) { lib_array[i].s_q_fname[j] = ( char * ) ckalloc ( 1024 * sizeof ( char ) ); } } if ( lib_array[i].num_p_file ) { lib_array[i].p_fname = ( char ** ) ckalloc ( lib_array[i].num_p_file * sizeof ( char * ) ); for ( j = 0; j < lib_array[i].num_p_file; j++ ) { lib_array[i].p_fname[j] = ( char * ) ckalloc ( 1024 * sizeof ( char ) ); } } if ( lib_array[i].num_a1_file ) { lib_array[i].a1_fname = ( char ** ) ckalloc ( lib_array[i].num_a1_file * sizeof ( char * ) ); for ( j = 0; j < lib_array[i].num_a1_file; j++ ) { lib_array[i].a1_fname[j] = ( char * ) ckalloc ( 1024 * sizeof ( char ) ); } } if ( lib_array[i].num_a2_file ) { lib_array[i].a2_fname = ( char ** ) ckalloc ( lib_array[i].num_a2_file * sizeof ( char * ) ); for ( j = 0; j < lib_array[i].num_a2_file; j++ ) { lib_array[i].a2_fname[j] = ( char * ) ckalloc ( 1024 * sizeof ( char ) ); } } if ( lib_array[i].num_q1_file ) { lib_array[i].q1_fname = ( char ** ) ckalloc ( lib_array[i].num_q1_file * sizeof ( char * ) ); for ( j = 0; j < lib_array[i].num_q1_file; j++ ) { lib_array[i].q1_fname[j] = ( char * ) ckalloc ( 1024 * sizeof ( char ) ); } } if ( lib_array[i].num_q2_file ) { lib_array[i].q2_fname = ( char ** ) ckalloc ( lib_array[i].num_q2_file * sizeof ( char * ) ); for ( j = 0; j < lib_array[i].num_q2_file; j++ ) { lib_array[i].q2_fname[j] = ( char * ) ckalloc ( 1024 * sizeof ( char ) ); } } } // get file names for ( i = 0; i < num_libs; i++ ) { lib_array[i].curr_type = 1; lib_array[i].curr_index = 0; lib_array[i].fp1 = NULL; lib_array[i].fp2 = NULL; lib_array[i].num_s_a_file = 0; lib_array[i].num_s_q_file = 0; lib_array[i].num_p_file = 0; lib_array[i].num_a1_file = 0; lib_array[i].num_a2_file = 0; lib_array[i].num_q1_file = 0; lib_array[i].num_q2_file = 0; } libCounter = -1; rewind ( fp ); i = -1; while ( fgets ( line, 1024, fp ) ) { ch = line[5]; line[5] = '\0'; if ( strcmp ( line, "[LIB]" ) == 0 ) { i++; continue; } line[5] = ch; flag = splitColumn ( line ); if ( !flag ) { continue; } if ( strcmp ( tabs[0], "f1" ) == 0 ) { index = lib_array[i].num_a1_file++; strcpy ( lib_array[i].a1_fname[index], tabs[1] ); } else if ( strcmp ( tabs[0], "q1" ) == 0 ) { index = lib_array[i].num_q1_file++; strcpy ( lib_array[i].q1_fname[index], tabs[1] ); } else if ( strcmp ( tabs[0], "f2" ) == 0 ) { index = lib_array[i].num_a2_file++; strcpy ( lib_array[i].a2_fname[index], tabs[1] ); } else if ( strcmp ( tabs[0], "q2" ) == 0 ) { index = lib_array[i].num_q2_file++; strcpy ( lib_array[i].q2_fname[index], tabs[1] ); } else if ( strcmp ( tabs[0], "f" ) == 0 ) { index = lib_array[i].num_s_a_file++; strcpy ( lib_array[i].s_a_fname[index], tabs[1] ); } else if ( strcmp ( tabs[0], "q" ) == 0 ) { index = lib_array[i].num_s_q_file++; strcpy ( lib_array[i].s_q_fname[index], tabs[1] ); } else if ( strcmp ( tabs[0], "p" ) == 0 ) { index = lib_array[i].num_p_file++; strcpy ( lib_array[i].p_fname[index], tabs[1] ); } else if ( strcmp ( tabs[0], "min_ins" ) == 0 ) { lib_array[i].min_ins = atoi ( tabs[1] ); } else if ( strcmp ( tabs[0], "max_ins" ) == 0 ) { lib_array[i].max_ins = atoi ( tabs[1] ); } else if ( strcmp ( tabs[0], "avg_ins" ) == 0 ) { lib_array[i].avg_ins = atoi ( tabs[1] ); } else if ( strcmp ( tabs[0], "rd_len_cutoff" ) == 0 ) { lib_array[i].rd_len_cutoff = atoi ( tabs[1] ); } else if ( strcmp ( tabs[0], "reverse_seq" ) == 0 ) { lib_array[i].reverse = atoi ( tabs[1] ); } else if ( strcmp ( tabs[0], "asm_flags" ) == 0 ) { lib_array[i].asm_flag = atoi ( tabs[1] ); } else if ( strcmp ( tabs[0], "rank" ) == 0 ) { lib_array[i].rank = atoi ( tabs[1] ); } else if ( strcmp ( tabs[0], "pair_num_cutoff" ) == 0 ) { lib_array[i].pair_num_cut = atoi ( tabs[1] ); } else if ( strcmp ( tabs[0], "rd_len_cutoff" ) == 0 ) { lib_array[i].rd_len_cutoff = atoi ( tabs[1] ); } else if ( strcmp ( tabs[0], "map_len" ) == 0 ) { lib_array[i].map_len = atoi ( tabs[1] ); } } fclose ( fp ); qsort ( &lib_array[0], num_libs, sizeof ( LIB_INFO ), cmp_lib ); } int getMaxLongReadLen ( int num_libs ) { int i; int maxLong = 0; boolean Has = 0; for ( i = 0; i < num_libs; i++ ) { if ( lib_array[i].asm_flag != 4 ) { continue; } Has = 1; maxLong = maxLong < lib_array[i].rd_len_cutoff ? lib_array[i].rd_len_cutoff : maxLong; } if ( !Has ) { return maxLong; } else { return maxLong > 0 ? maxLong : maxReadLen; } } void free_libs() { if ( !lib_array ) { return; } int i, j; for ( i = 0; i < num_libs; i++ ) { printf ( "[LIB] %d, avg_ins %d, reverse %d \n", i, lib_array[i].avg_ins, lib_array[i].reverse ); if ( lib_array[i].num_s_a_file ) { //printf("%d single fasta files\n",lib_array[i].num_s_a_file); for ( j = 0; j < lib_array[i].num_s_a_file; j++ ) { free ( ( void * ) lib_array[i].s_a_fname[j] ); } free ( ( void * ) lib_array[i].s_a_fname ); } if ( lib_array[i].num_s_q_file ) { //printf("%d single fastq files\n",lib_array[i].num_s_q_file); for ( j = 0; j < lib_array[i].num_s_q_file; j++ ) { free ( ( void * ) lib_array[i].s_q_fname[j] ); } free ( ( void * ) lib_array[i].s_q_fname ); } if ( lib_array[i].num_p_file ) { //printf("%d paired fasta files\n",lib_array[i].num_p_file); for ( j = 0; j < lib_array[i].num_p_file; j++ ) { free ( ( void * ) lib_array[i].p_fname[j] ); } free ( ( void * ) lib_array[i].p_fname ); } if ( lib_array[i].num_a1_file ) { //printf("%d read1 fasta files\n",lib_array[i].num_a1_file); for ( j = 0; j < lib_array[i].num_a1_file; j++ ) { free ( ( void * ) lib_array[i].a1_fname[j] ); } free ( ( void * ) lib_array[i].a1_fname ); } if ( lib_array[i].num_a2_file ) { //printf("%d read2 fasta files\n",lib_array[i].num_a2_file); for ( j = 0; j < lib_array[i].num_a2_file; j++ ) { free ( ( void * ) lib_array[i].a2_fname[j] ); } free ( ( void * ) lib_array[i].a2_fname ); } if ( lib_array[i].num_q1_file ) { //printf("%d read1 fastq files\n",lib_array[i].num_q1_file); for ( j = 0; j < lib_array[i].num_q1_file; j++ ) { free ( ( void * ) lib_array[i].q1_fname[j] ); } free ( ( void * ) lib_array[i].q1_fname ); } if ( lib_array[i].num_q2_file ) { //printf("%d read2 fastq files\n",lib_array[i].num_q2_file); for ( j = 0; j < lib_array[i].num_q2_file; j++ ) { free ( ( void * ) lib_array[i].q2_fname[j] ); } free ( ( void * ) lib_array[i].q2_fname ); } } num_libs = 0; free ( ( void * ) lib_array ); } void alloc_pe_mem ( int gradsCounter ) { if ( gradsCounter ) { pes = ( PE_INFO * ) ckalloc ( gradsCounter * sizeof ( PE_INFO ) ); } } void free_pe_mem() { if ( pes ) { free ( ( void * ) pes ); pes = NULL; } } SOAPdenovo-V1.05/src/31mer/loadGraph.c000644 000765 000024 00000025636 11530651532 017374 0ustar00Aquastaff000000 000000 /* * 31mer/loadGraph.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" #define preARCBLOCKSIZE 100000 static void loadArcs ( char * graphfile ); static void loadContig ( char * graphfile ); void loadUpdatedVertex ( char * graphfile ) { char name[256], line[256]; FILE * fp; Kmer word, bal_word; int num_kmer, i; char ch; sprintf ( name, "%s.updated.vertex", graphfile ); fp = ckopen ( name, "r" ); while ( fgets ( line, sizeof ( line ), fp ) != NULL ) { if ( line[0] == 'V' ) { sscanf ( line + 6, "%d %c %d", &num_kmer, &ch, &overlaplen ); printf ( "there're %d kmers in vertex file\n", num_kmer ); break; } } vt_array = ( VERTEX * ) ckalloc ( ( 2 * num_kmer ) * sizeof ( VERTEX ) ); for ( i = 0; i < num_kmer; i++ ) { fscanf ( fp, "%llx ", &word ); vt_array[i].kmer = word; } fclose ( fp ); for ( i = 0; i < num_kmer; i++ ) { bal_word = reverseComplement ( vt_array[i].kmer, overlaplen ); vt_array[i + num_kmer].kmer = bal_word; } num_vt = num_kmer; } int uniqueLenSearch ( unsigned int * len_array, unsigned int * flag_array, int num, unsigned int target ) { int mid, low, high; low = 1; high = num; while ( low <= high ) { mid = ( low + high ) / 2; if ( len_array[mid] == target ) { break; } else if ( target > len_array[mid] ) { low = mid + 1; } else { high = mid - 1; } } if ( low > high ) { return -1; } //locate the first same length unflaged return flag_array[mid]++; } int lengthSearch ( unsigned int * len_array, unsigned int * flag_array, int num, unsigned int target ) { int mid, low, high, i; low = 1; high = num; while ( low <= high ) { mid = ( low + high ) / 2; if ( len_array[mid] == target ) { break; } else if ( target > len_array[mid] ) { low = mid + 1; } else { high = mid - 1; } } if ( low > high ) { return -1; } //locate the first same length unflaged if ( !flag_array[mid] ) { for ( i = mid - 1; i > 0; i-- ) { if ( len_array[i] != len_array[mid] || flag_array[i] ) { break; } } flag_array[i + 1] = 1; return i + 1; } else { for ( i = mid + 1; i <= num; i++ ) { if ( !flag_array[i] ) { break; } } flag_array[i] = 1; return i; } } void quick_sort_int ( unsigned int * length_array, int low, int high ) { int i, j; Kmer pivot; if ( low < high ) { pivot = length_array[low]; i = low; j = high; while ( i < j ) { while ( i < j && length_array[j] >= pivot ) { j--; } if ( i < j ) { length_array[i++] = length_array[j]; } while ( i < j && length_array[i] <= pivot ) { i++; } if ( i < j ) { length_array[j--] = length_array[i]; } } length_array[i] = pivot; quick_sort_int ( length_array, low, i - 1 ); quick_sort_int ( length_array, i + 1, high ); } } void loadUpdatedEdges ( char * graphfile ) { char c, name[256], line[1024]; int bal_ed, cvg; FILE * fp, *out_fp; Kmer from_kmer, to_kmer; unsigned int num_ctgge, length, index = 0, num_kmer; unsigned int i = 0, j; int newIndex; unsigned int * length_array, *flag_array, diff_len; char * outfile = graphfile; long long cvgSum = 0; long long counter = 0; //get overlaplen from *.preGraphBasic sprintf ( name, "%s.preGraphBasic", graphfile ); fp = ckopen ( name, "r" ); while ( fgets ( line, sizeof ( line ), fp ) != NULL ) { if ( line[0] == 'V' ) { sscanf ( line + 6, "%d %c %d", &num_kmer, &c, &overlaplen ); printf ( "K = %d\n", overlaplen ); break; } } if ( ctg_short == 0 ) { ctg_short = overlaplen + 2; } fclose ( fp ); sprintf ( name, "%s.updated.edge", graphfile ); fp = ckopen ( name, "r" ); sprintf ( name, "%s.newContigIndex", outfile ); out_fp = ckopen ( name, "w" ); while ( fgets ( line, sizeof ( line ), fp ) != NULL ) { if ( line[0] == 'E' ) { sscanf ( line + 5, "%d", &num_ctgge ); printf ( "there're %d edge in edge file\n", num_ctgge ); break; } } index_array = ( unsigned int * ) ckalloc ( ( num_ctgge + 1 ) * sizeof ( unsigned int ) ); length_array = ( unsigned int * ) ckalloc ( ( num_ctgge + 1 ) * sizeof ( unsigned int ) ); flag_array = ( unsigned int * ) ckalloc ( ( num_ctgge + 1 ) * sizeof ( unsigned int ) ); while ( fgets ( line, sizeof ( line ), fp ) != NULL ) { if ( line[0] == '>' ) { sscanf ( line + 7, "%d", &length ); index_array[++index] = length; length_array[++i] = length; } } num_ctg = index; orig2new = 1; //quick_sort_int(length_array,1,num_ctg); qsort ( & ( length_array[1] ), num_ctg, sizeof ( length_array[0] ), cmp_int ); //extract unique length diff_len = 0; for ( i = 1; i <= num_ctg; i++ ) { for ( j = i + 1; j <= num_ctg; j++ ) if ( length_array[j] != length_array[i] ) { break; } length_array[++diff_len] = length_array[i]; flag_array[diff_len] = i; i = j - 1; } /* for(i=1;i<=num_ctg;i++) flag_array[i] = 0; */ contig_array = ( CONTIG * ) ckalloc ( ( num_ctg + 1 ) * sizeof ( CONTIG ) ); //load edges index = 0; rewind ( fp ); while ( fgets ( line, sizeof ( line ), fp ) != NULL ) { if ( line[0] == '>' ) { sscanf ( line, ">length %u,%llx,%llx,%d,%d", &length, &from_kmer, &to_kmer, &bal_ed, &cvg ); newIndex = uniqueLenSearch ( length_array, flag_array, diff_len, length ); index_array[++index] = newIndex; contig_array[newIndex].length = length; contig_array[newIndex].bal_edge = bal_ed + 1; contig_array[newIndex].downwardConnect = NULL; contig_array[newIndex].mask = 0; contig_array[newIndex].flag = 0; contig_array[newIndex].arcs = NULL; contig_array[newIndex].seq = NULL; contig_array[newIndex].multi = 0; contig_array[newIndex].inSubGraph = 0; contig_array[newIndex].cvg = cvg / 10; if ( cvg ) { counter += length; cvgSum += cvg * length; } fprintf ( out_fp, "%d %d %d\n", index, newIndex, contig_array[newIndex].bal_edge ); } } if ( counter ) //cvgAvg = cvgSum/counter > 2 ? cvgSum/counter : 3; { cvgAvg = cvgSum / counter / 10 > 2 ? cvgSum / counter / 10 : 3; } //mark repeats int bal_i; if ( maskRep ) { counter = 0; for ( i = 1; i <= num_ctg; i++ ) { bal_i = getTwinCtg ( i ); if ( ( contig_array[i].cvg + contig_array[bal_i].cvg ) > 4 * cvgAvg ) { contig_array[i].mask = 1; contig_array[bal_i].mask = 1; counter += 2; } if ( isSmallerThanTwin ( i ) ) { i++; } } printf ( "average contig coverage is %d, %lld contig masked\n", cvgAvg, counter ); } counter = 0; for ( i = 1; i <= num_ctg; i++ ) { if ( contig_array[i].mask ) { continue; } bal_i = getTwinCtg ( i ); if ( contig_array[i].length < ctg_short ) { contig_array[i].mask = 1; contig_array[bal_i].mask = 1; counter += 2; } if ( isSmallerThanTwin ( i ) ) { i++; } } printf ( "Mask contigs shorter than %d, %lld contig masked\n", ctg_short, counter ); loadArcs ( graphfile ); //tipsCount(); loadContig ( graphfile ); printf ( "done loading updated edges\n" ); fflush ( stdout ); free ( ( void * ) length_array ); free ( ( void * ) flag_array ); fclose ( fp ); fclose ( out_fp ); } static void add1Arc ( unsigned int from_ed, unsigned int to_ed, unsigned int weight ) { preARC * parc; unsigned int from_c = index_array[from_ed]; unsigned int to_c = index_array[to_ed]; parc = allocatePreArc ( to_c ); parc->multiplicity = weight; parc->next = contig_array[from_c].arcs; contig_array[from_c].arcs = parc; } void loadArcs ( char * graphfile ) { FILE * fp; char name[256], line[1024]; unsigned int target, weight; unsigned int from_ed; char * seg; sprintf ( name, "%s.Arc", graphfile ); fp = ckopen ( name, "r" ); createPreArcMemManager(); arcCounter = 0; while ( fgets ( line, sizeof ( line ), fp ) != NULL ) { seg = strtok ( line, " " ); from_ed = atoi ( seg ); //printf("%d\n",from_ed); while ( ( seg = strtok ( NULL, " " ) ) != NULL ) { target = atoi ( seg ); seg = strtok ( NULL, " " ); weight = atoi ( seg ); add1Arc ( from_ed, target, weight ); } } printf ( "%lld arcs loaded\n", arcCounter ); fclose ( fp ); } void loadContig ( char * graphfile ) { char c, name[256], line[1024], *tightSeq = NULL; FILE * fp; int n = 0, length, index = -1, edgeno; unsigned int i; unsigned int newIndex; sprintf ( name, "%s.contig", graphfile ); fp = ckopen ( name, "r" ); while ( fgets ( line, sizeof ( line ), fp ) != NULL ) { if ( line[0] == '>' ) { if ( index >= 0 ) { newIndex = index_array[edgeno]; contig_array[newIndex].seq = tightSeq; } n = 0; index++; sscanf ( line + 1, "%d %s %d", &edgeno, name, &length ); //printf("contig %d, length %d\n",edgeno,length); tightSeq = ( char * ) ckalloc ( ( length / 4 + 1 ) * sizeof ( char ) ); } else { for ( i = 0; i < strlen ( line ); i++ ) { if ( line[i] >= 'a' && line[i] <= 'z' ) { c = base2int ( line[i] - 'a' + 'A' ); writeChar2tightString ( c, tightSeq, n++ ); } else if ( line[i] >= 'A' && line[i] <= 'Z' ) { c = base2int ( line[i] ); writeChar2tightString ( c, tightSeq, n++ ); } } } } if ( index >= 0 ) { newIndex = index_array[edgeno]; contig_array[newIndex].seq = tightSeq; } printf ( "input %d contigs\n", index + 1 ); fclose ( fp ); //printf("the %dth contig with index 107\n",index); } void freeContig_array() { if ( !contig_array ) { return; } unsigned int i; for ( i = 1; i <= num_ctg; i++ ) { if ( contig_array[i].seq ) { free ( ( void * ) contig_array[i].seq ); } if ( contig_array[i].closeReads ) { freeStack ( contig_array[i].closeReads ); } } free ( ( void * ) contig_array ); contig_array = NULL; } /* void loadCvg(char *graphfile) { char name[256],line[1024]; FILE *fp; int cvg; unsigned int newIndex,edgeno,bal_ctg; sprintf(name,"%s.contigCVG",graphfile); fp = fopen(name,"r"); if(!fp){ printf("contig coverage file %s is not found!\n",name); return; } while(fgets(line,sizeof(line),fp)!=NULL){ if(line[0]=='>'){ sscanf(line+1,"%d %d",&edgeno,&cvg); newIndex = index_array[edgeno]; cvg = cvg <= 255 ? cvg:255; contig_array[newIndex].multi = cvg; bal_ctg = getTwinCtg(newIndex); contig_array[bal_ctg].multi= cvg; } } fclose(fp); } */ SOAPdenovo-V1.05/src/31mer/loadPath.c000644 000765 000024 00000012222 11530651532 017212 0ustar00Aquastaff000000 000000 /* * 31mer/loadPath.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" static void add1marker2edge ( unsigned int edgeno, long long readid ) { if ( edge_array[edgeno].multi == 255 ) { return; } unsigned int bal_ed = getTwinEdge ( edgeno ); unsigned char counter = edge_array[edgeno].multi++; edge_array[edgeno].markers[counter] = readid; counter = edge_array[bal_ed].multi++; edge_array[bal_ed].markers[counter] = -readid; } boolean loadPath ( char * graphfile ) { FILE * fp; char name[256], line[1024]; unsigned int i, bal_ed, num1, edgeno, num2; long long markCounter = 0, readid = 0; char * seg; sprintf ( name, "%s.markOnEdge", graphfile ); fp = fopen ( name, "r" ); if ( !fp ) { return 0; } for ( i = 1; i <= num_ed; i++ ) { edge_array[i].multi = 0; } for ( i = 1; i <= num_ed; i++ ) { fscanf ( fp, "%d", &num1 ); if ( EdSmallerThanTwin ( i ) ) { fscanf ( fp, "%d", &num2 ); bal_ed = getTwinEdge ( i ); if ( num1 + num2 >= 255 ) { edge_array[i].multi = 255; edge_array[bal_ed].multi = 255; } else { edge_array[i].multi = num1 + num2; edge_array[bal_ed].multi = num1 + num2; markCounter += 2 * ( num1 + num2 ); } i++; } else { if ( 2 * num1 >= 255 ) { edge_array[i].multi = 255; } else { edge_array[i].multi = 2 * num1; markCounter += 2 * num1; } } } fclose ( fp ); printf ( "%lld markers overall\n", markCounter ); markersArray = ( long long * ) ckalloc ( markCounter * sizeof ( long long ) ); markCounter = 0; for ( i = 1; i <= num_ed; i++ ) { if ( edge_array[i].multi == 255 ) { continue; } edge_array[i].markers = markersArray + markCounter; markCounter += edge_array[i].multi; edge_array[i].multi = 0; } sprintf ( name, "%s.path", graphfile ); fp = fopen ( name, "r" ); if ( !fp ) { return 0; } while ( fgets ( line, sizeof ( line ), fp ) != NULL ) { //printf("%s",line); readid++; seg = strtok ( line, " " ); while ( seg ) { edgeno = atoi ( seg ); //printf("%s, %d\n",seg,edgeno); add1marker2edge ( edgeno, readid ); seg = strtok ( NULL, " " ); } } fclose ( fp ); markCounter = 0; for ( i = 1; i <= num_ed; i++ ) { if ( edge_array[i].multi == 255 ) { continue; } markCounter += edge_array[i].multi; } printf ( "%lld marks loaded\n", markCounter ); return 1; } boolean loadPathBin ( char * graphfile ) { FILE * fp; char name[256]; unsigned int i, bal_ed, num1, num2; long long markCounter = 0, readid = 0; unsigned char seg, ch; unsigned int * freadBuf; sprintf ( name, "%s.markOnEdge", graphfile ); fp = fopen ( name, "r" ); if ( !fp ) { return 0; } for ( i = 1; i <= num_ed; i++ ) { edge_array[i].multi = 0; edge_array[i].markers = NULL; } for ( i = 1; i <= num_ed; i++ ) { fscanf ( fp, "%d", &num1 ); if ( EdSmallerThanTwin ( i ) ) { fscanf ( fp, "%d", &num2 ); bal_ed = getTwinEdge ( i ); if ( num1 + num2 >= 255 ) { edge_array[i].multi = 255; edge_array[bal_ed].multi = 255; } else { edge_array[i].multi = num1 + num2; edge_array[bal_ed].multi = num1 + num2; markCounter += 2 * ( num1 + num2 ); } i++; } else { if ( 2 * num1 >= 255 ) { edge_array[i].multi = 255; } else { edge_array[i].multi = 2 * num1; markCounter += 2 * num1; } } } fclose ( fp ); printf ( "%lld markers overall\n", markCounter ); markersArray = ( long long * ) ckalloc ( markCounter * sizeof ( long long ) ); markCounter = 0; for ( i = 1; i <= num_ed; i++ ) { if ( edge_array[i].multi == 255 ) { continue; } edge_array[i].markers = markersArray + markCounter; markCounter += edge_array[i].multi; edge_array[i].multi = 0; } sprintf ( name, "%s.path", graphfile ); fp = fopen ( name, "rb" ); if ( !fp ) { return 0; } freadBuf = ( unsigned int * ) ckalloc ( ( maxReadLen - overlaplen + 1 ) * sizeof ( unsigned int ) ); while ( fread ( &ch, sizeof ( char ), 1, fp ) == 1 ) { //printf("%s",line); if ( fread ( freadBuf, sizeof ( unsigned int ), ch, fp ) != ch ) { break; } readid++; for ( seg = 0; seg < ch; seg++ ) { add1marker2edge ( freadBuf[seg], readid ); } } fclose ( fp ); markCounter = 0; for ( i = 1; i <= num_ed; i++ ) { if ( edge_array[i].multi == 255 ) { continue; } markCounter += edge_array[i].multi; } printf ( "%lld markers loaded\n", markCounter ); free ( ( void * ) freadBuf ); return 1; } SOAPdenovo-V1.05/src/31mer/loadPreGraph.c000644 000765 000024 00000022742 11530651532 020036 0ustar00Aquastaff000000 000000 /* * 31mer/loadPreGraph.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" static void loadPreArcs ( char * graphfile ); int cmp_vertex ( const void * a, const void * b ) { VERTEX * A, *B; A = ( VERTEX * ) a; B = ( VERTEX * ) b; if ( A->kmer > B->kmer ) { return 1; } else if ( A->kmer == B->kmer ) { return 0; } else { return -1; } } void loadVertex ( char * graphfile ) { char name[256], line[256]; FILE * fp; Kmer word, bal_word, temp; int num_kmer, i; char ch; sprintf ( name, "%s.preGraphBasic", graphfile ); fp = ckopen ( name, "r" ); while ( fgets ( line, sizeof ( line ), fp ) != NULL ) { if ( line[0] == 'V' ) { sscanf ( line + 6, "%d %c %d", &num_kmer, &ch, &overlaplen ); printf ( "there're %d kmers in vertex file\n", num_kmer ); } else if ( line[0] == 'E' ) { sscanf ( line + 5, "%d", &num_ed ); printf ( "there're %d edge in edge file\n", num_ed ); } else if ( line[0] == 'M' ) { sscanf ( line, "MaxReadLen %d MinReadLen %d MaxNameLen %d", &maxReadLen, &minReadLen, &maxNameLen ); } } fclose ( fp ); vt_array = ( VERTEX * ) ckalloc ( ( 2 * num_kmer ) * sizeof ( VERTEX ) ); sprintf ( name, "%s.vertex", graphfile ); fp = ckopen ( name, "r" ); for ( i = 0; i < num_kmer; i++ ) { fscanf ( fp, "%llx ", &word ); bal_word = reverseComplement ( word, overlaplen ); if ( word < bal_word ) { vt_array[i].kmer = word; } else { vt_array[i].kmer = bal_word; } } temp = vt_array[num_kmer - 1].kmer; qsort ( &vt_array[0], num_kmer, sizeof ( vt_array[0] ), cmp_vertex ); printf ( "done sort\n" ); fclose ( fp ); for ( i = 0; i < num_kmer; i++ ) { bal_word = reverseComplement ( vt_array[i].kmer, overlaplen ); vt_array[i + num_kmer].kmer = bal_word; } num_vt = num_kmer; } int bisearch ( VERTEX * vts, int num, Kmer target ) { int mid, low, high; low = 0; high = num - 1; while ( low <= high ) { mid = ( low + high ) / 2; if ( vts[mid].kmer == target ) { break; } else if ( target > vts[mid].kmer ) { low = mid + 1; } else { high = mid - 1; } } if ( low <= high ) { return mid; } else { return -1; } } int kmer2vt ( Kmer kmer ) { Kmer bal_word; int vt_id; bal_word = reverseComplement ( kmer, overlaplen ); if ( kmer < bal_word ) { vt_id = bisearch ( &vt_array[0], num_vt, kmer ); if ( vt_id < 0 ) { printf ( "no vt found for kmer %llx, its twin %llx\n", kmer, reverseComplement ( kmer, overlaplen ) ); } return vt_id; } else { vt_id = bisearch ( &vt_array[0], num_vt, bal_word ); if ( vt_id >= 0 ) { vt_id += num_vt; } else { printf ( "no vt found for kmer %llx, its twin %llx\n", kmer, reverseComplement ( kmer, overlaplen ) ); } return vt_id; } } // create an edge with index edgeno+1 reverse complememtary to edge with index edgeno static void buildReverseComplementEdge ( unsigned int edgeno ) { int length = edge_array[edgeno].length; int i, index = 0; char * sequence, ch, *tightSeq; Kmer kmer = vt_array[edge_array[edgeno].from_vt].kmer; sequence = ( char * ) ckalloc ( ( overlaplen + length ) * sizeof ( char ) ); for ( i = overlaplen - 1; i >= 0; i-- ) { ch = kmer & 3; kmer >>= 2; sequence[i] = ch; } for ( i = 0; i < length; i++ ) { sequence[i + overlaplen] = getCharInTightString ( edge_array[edgeno].seq, i ); } tightSeq = ( char * ) ckalloc ( ( length / 4 + 1 ) * sizeof ( char ) ); for ( i = length - 1; i >= 0; i-- ) { writeChar2tightString ( int_comp ( sequence[i] ), tightSeq, index++ ); } edge_array[edgeno + 1].length = length; edge_array[edgeno + 1].cvg = edge_array[edgeno].cvg; kmer = vt_array[edge_array[edgeno].from_vt].kmer; edge_array[edgeno + 1].to_vt = kmer2vt ( reverseComplement ( kmer, overlaplen ) ); kmer = vt_array[edge_array[edgeno].to_vt].kmer; edge_array[edgeno + 1].from_vt = kmer2vt ( reverseComplement ( kmer, overlaplen ) ); edge_array[edgeno + 1].seq = tightSeq; edge_array[edgeno + 1].bal_edge = 0; edge_array[edgeno + 1].rv = NULL; edge_array[edgeno + 1].arcs = NULL; edge_array[edgeno + 1].flag = 0; edge_array[edgeno + 1].deleted = 0; free ( ( void * ) sequence ); } void loadEdge ( char * graphfile ) { char c, name[256], line[1024], str[32]; char * tightSeq = NULL; FILE * fp; Kmer from_kmer, to_kmer; int n = 0, i, length, cvg, index = -1, bal_ed, edgeno; int linelen; unsigned int j; sprintf ( name, "%s.edge", graphfile ); fp = ckopen ( name, "r" ); num_ed_limit = 1.2 * num_ed; edge_array = ( EDGE * ) ckalloc ( ( num_ed_limit + 1 ) * sizeof ( EDGE ) ); for ( j = num_ed + 1; j <= num_ed_limit; j++ ) { edge_array[j].seq = NULL; } while ( fgets ( line, sizeof ( line ), fp ) != NULL ) { if ( line[0] == '>' ) { if ( index >= 0 ) { edgeno = index + 1; edge_array[edgeno].length = length; edge_array[edgeno].cvg = cvg; edge_array[edgeno].from_vt = kmer2vt ( from_kmer ); edge_array[edgeno].to_vt = kmer2vt ( to_kmer ); edge_array[edgeno].seq = tightSeq; edge_array[edgeno].bal_edge = bal_ed + 1; edge_array[edgeno].rv = NULL; edge_array[edgeno].arcs = NULL; edge_array[edgeno].flag = 0; edge_array[edgeno].deleted = 0; if ( bal_ed ) { buildReverseComplementEdge ( edgeno ); index++; } } n = 0; index++; sscanf ( line + 7, "%d,%llx,%llx,%s %d,%d", &length, &from_kmer, &to_kmer, str, &cvg, &bal_ed ); tightSeq = ( char * ) ckalloc ( ( length / 4 + 1 ) * sizeof ( char ) ); } else { linelen = strlen ( line ); for ( i = 0; i < linelen; i++ ) { if ( line[i] >= 'a' && line[i] <= 'z' ) { c = base2int ( line[i] - 'a' + 'A' ); writeChar2tightString ( c, tightSeq, n++ ); } else if ( line[i] >= 'A' && line[i] <= 'Z' ) { c = base2int ( line[i] ); writeChar2tightString ( c, tightSeq, n++ ); } } } } if ( index >= 0 ) { edgeno = index + 1; edge_array[edgeno].length = length; edge_array[edgeno].cvg = cvg; edge_array[edgeno].from_vt = kmer2vt ( from_kmer ); edge_array[edgeno].to_vt = kmer2vt ( to_kmer ); edge_array[edgeno].seq = tightSeq; edge_array[edgeno].bal_edge = bal_ed + 1; if ( bal_ed ) { buildReverseComplementEdge ( edgeno ); index++; } } printf ( "input %d edges\n", index + 1 ); fclose ( fp ); createArcMemo(); loadPreArcs ( graphfile ); } unsigned int getTwinEdge ( unsigned int edgeno ) { return edgeno + edge_array[edgeno].bal_edge - 1; } boolean EdSmallerThanTwin ( unsigned int edgeno ) { return edge_array[edgeno].bal_edge > 1; } boolean EdLargerThanTwin ( unsigned int edgeno ) { return edge_array[edgeno].bal_edge < 1; } boolean EdSameAsTwin ( unsigned int edgeno ) { return edge_array[edgeno].bal_edge == 1; } static void add1Arc ( unsigned int from_ed, unsigned int to_ed, unsigned int weight ) { if ( edge_array[from_ed].to_vt != edge_array[to_ed].from_vt ) { //fprintf(stderr,"add1Arc: inconsistant joins\n"); return; } unsigned int bal_fe = getTwinEdge ( from_ed ); unsigned int bal_te = getTwinEdge ( to_ed ); if ( from_ed > num_ed || to_ed > num_ed || bal_fe > num_ed || bal_te > num_ed ) { return; } ARC * parc, *bal_parc; //both arcs already exist parc = getArcBetween ( from_ed, to_ed ); if ( parc ) { bal_parc = parc->bal_arc; parc->multiplicity += weight; bal_parc->multiplicity += weight; return; } //create new arcs parc = allocateArc ( to_ed ); parc->multiplicity = weight; parc->prev = NULL; if ( edge_array[from_ed].arcs ) { edge_array[from_ed].arcs->prev = parc; } parc->next = edge_array[from_ed].arcs; edge_array[from_ed].arcs = parc; // A->A' if ( bal_te == from_ed ) { //printf("preArc from A to A'\n"); parc->bal_arc = parc; parc->multiplicity += weight; return; } bal_parc = allocateArc ( bal_fe ); bal_parc->multiplicity = weight; bal_parc->prev = NULL; if ( edge_array[bal_te].arcs ) { edge_array[bal_te].arcs->prev = bal_parc; } bal_parc->next = edge_array[bal_te].arcs; edge_array[bal_te].arcs = bal_parc; //link them to each other parc->bal_arc = bal_parc; bal_parc->bal_arc = parc; } void loadPreArcs ( char * graphfile ) { FILE * fp; char name[256], line[1024]; unsigned int target, weight; unsigned int from_ed; char * seg; sprintf ( name, "%s.preArc", graphfile ); fp = ckopen ( name, "r" ); arcCounter = 0; while ( fgets ( line, sizeof ( line ), fp ) != NULL ) { seg = strtok ( line, " " ); from_ed = atoi ( seg ); while ( ( seg = strtok ( NULL, " " ) ) != NULL ) { target = atoi ( seg ); seg = strtok ( NULL, " " ); weight = atoi ( seg ); add1Arc ( from_ed, target, weight ); } } printf ( "%lli pre-arcs loaded\n", arcCounter ); fclose ( fp ); } void free_edge_array ( EDGE * ed_array, int ed_num ) { int i; for ( i = 1; i <= ed_num; i++ ) if ( ed_array[i].seq ) { free ( ( void * ) ed_array[i].seq ); } free ( ( void * ) ed_array ); } SOAPdenovo-V1.05/src/31mer/localAsm.c000644 000765 000024 00000152247 11530651532 017225 0ustar00Aquastaff000000 000000 /* * 31mer/localAsm.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" #define CTGendLen 35 // shouldn't larger than max_read_len #define UPlimit 5000 #define MaxRouteNum 10 static Kmer pubKmer = 0x1b4d65165b; static void kmerSet_mark ( KmerSet * set ); static void trace4Repeat ( Kmer currW, int steps, int min, int max, int * num_route, KmerSet * kset, Kmer kmerDest, int overlap, Kmer WORDF, int * traceCounter, int maxRoute, kmer_t ** soFarNode, short * multiOccu1, short * multiOccu2, int * routeLens, char ** foundRoutes, char * soFarSeq, long long * soFarLinks, double * avgLinks ); static Kmer prevKmerLocal ( Kmer next, char ch, int overlap ) { Kmer word = next; word >>= 2; word += ( ( Kmer ) ch ) << 2 * ( overlap - 1 ); return word; } static Kmer nextKmerLocal ( Kmer prev, char ch, Kmer WordFilter ) { Kmer word = prev; word <<= 2; word &= WordFilter; word += ch; return word; } static void singleKmer ( int t, KmerSet * kset, int flag, Kmer * kmerBuffer, char * prevcBuffer, char * nextcBuffer ) { kmer_t * pos; put_kmerset ( kset, kmerBuffer[t], prevcBuffer[t], nextcBuffer[t], &pos ); if ( pos->inEdge == flag ) { return; } else if ( pos->inEdge == 0 ) { pos->inEdge = flag; } else if ( pos->inEdge == 1 && flag == 2 ) { pos->inEdge = 3; } else if ( pos->inEdge == 2 && flag == 1 ) { pos->inEdge = 3; } } static void putKmer2DBgraph ( KmerSet * kset, int flag, int kmer_c, Kmer * kmerBuffer, char * prevcBuffer, char * nextcBuffer ) { int t; for ( t = 0; t < kmer_c; t++ ) { singleKmer ( t, kset, flag, kmerBuffer, prevcBuffer, nextcBuffer ); } } static void getSeqFromRead ( READNEARBY read, char * src_seq ) { int len_seq = read.len; int j; char * tightSeq = ( char * ) darrayGet ( readSeqInGap, read.seqStarter ); for ( j = 0; j < len_seq; j++ ) { src_seq[j] = getCharInTightString ( tightSeq, j ); } } static void chopKmer4Ctg ( Kmer * kmerCtg, int lenCtg, int overlap, char * src_seq, Kmer WORDF ) { int index, j; Kmer word = 0; for ( index = 0; index < overlap; index++ ) { word <<= 2; word += src_seq[index]; } index = 0; kmerCtg[index++] = word; for ( j = 1; j <= lenCtg - overlap; j ++ ) { word = nextKmerLocal ( word, src_seq[j - 1 + overlap], WORDF ); kmerCtg[index++] = word; } } static void chopKmer4read ( int len_seq, int overlap, char * src_seq, char * bal_seq, Kmer * kmerBuffer, char * prevcBuffer, char * nextcBuffer, int * kmer_c, Kmer WORDF ) { int j, bal_j; Kmer word, bal_word; int index; char InvalidCh = 4; if ( len_seq < overlap + 1 ) { *kmer_c = 0; return; } word = 0; for ( index = 0; index < overlap; index++ ) { word <<= 2; word += src_seq[index]; } reverseComplementSeq ( src_seq, len_seq, bal_seq ); // complementary node bal_word = reverseComplement ( word, overlap ); bal_j = len_seq - 0 - overlap; // 0; index = 0; if ( word < bal_word ) { kmerBuffer[index] = word; prevcBuffer[index] = InvalidCh; nextcBuffer[index++] = src_seq[0 + overlap]; } else { kmerBuffer[index] = bal_word; prevcBuffer[index] = bal_seq[bal_j - 1]; nextcBuffer[index++] = InvalidCh; } for ( j = 1; j <= len_seq - overlap; j ++ ) { word = nextKmerLocal ( word, src_seq[j - 1 + overlap], WORDF ); bal_j = len_seq - j - overlap; // j; bal_word = prevKmerLocal ( bal_word, bal_seq[bal_j], overlap ); if ( word < bal_word ) { kmerBuffer[index] = word; prevcBuffer[index] = src_seq[j - 1]; if ( j < len_seq - overlap ) { nextcBuffer[index++] = src_seq[j + overlap]; } else { nextcBuffer[index++] = InvalidCh; } //printf("%dth: %p with %p\n",kmer_c-1,word,hashBanBuffer[kmer_c-1]); } else { // complementary node kmerBuffer[index] = bal_word; if ( bal_j > 0 ) { prevcBuffer[index] = bal_seq[bal_j - 1]; } else { prevcBuffer[index] = InvalidCh; } nextcBuffer[index++] = bal_seq[bal_j + overlap]; //printf("%dth: %p with %p\n",kmer_c-1,bal_word,hashBanBuffer[kmer_c-1]); } } *kmer_c = index; } static void headTightStr ( char * tightStr, int length, int start, int headLen, int revS, char * src_seq ) { int i, index = 0; if ( !revS ) { for ( i = start; i < start + headLen; i++ ) { src_seq[index++] = getCharInTightString ( tightStr, i ); } } else { for ( i = length - 1 - start; i >= length - headLen - start; i-- ) { src_seq[index++] = int_comp ( getCharInTightString ( tightStr, i ) ); } } } static int getSeqFromCtg ( CTGinSCAF * ctg, boolean fromHead, unsigned int len, int originOverlap, char * src_seq ) { unsigned int ctgId = ctg->ctgID; unsigned int bal_ctg = getTwinCtg ( ctgId ); if ( contig_array[ctgId].length < 1 ) { return 0; } unsigned int length = contig_array[ctgId].length + originOverlap; len = len < length ? len : length; if ( fromHead ) { if ( contig_array[ctgId].seq ) { headTightStr ( contig_array[ctgId].seq, length, 0, len, 0, src_seq ); } else { headTightStr ( contig_array[bal_ctg].seq, length, 0, len, 1, src_seq ); } } else { if ( contig_array[ctgId].seq ) { headTightStr ( contig_array[ctgId].seq, length, length - len, len, 0, src_seq ); } else { headTightStr ( contig_array[bal_ctg].seq, length, length - len, len, 1, src_seq ); } } return len; } static KmerSet * readsInGap2DBgraph ( READNEARBY * rdArray, int num, CTGinSCAF * ctg1, CTGinSCAF * ctg2, int originOverlap, Kmer * kmerCtg1, Kmer * kmerCtg2, int overlap, Kmer WordFilter ) { int kmer_c; Kmer * kmerBuffer; char * nextcBuffer, *prevcBuffer; int i; int buffer_size = maxReadLen > CTGendLen ? maxReadLen : CTGendLen; KmerSet * kmerS = NULL; int lenCtg1; int lenCtg2; char * bal_seq; char * src_seq; src_seq = ( char * ) ckalloc ( buffer_size * sizeof ( char ) ); bal_seq = ( char * ) ckalloc ( buffer_size * sizeof ( char ) ); kmerBuffer = ( Kmer * ) ckalloc ( buffer_size * sizeof ( Kmer ) ); prevcBuffer = ( char * ) ckalloc ( buffer_size * sizeof ( char ) ); nextcBuffer = ( char * ) ckalloc ( buffer_size * sizeof ( char ) ); kmerS = init_kmerset ( 1024, 0.77f ); for ( i = 0; i < num; i++ ) { getSeqFromRead ( rdArray[i], src_seq ); chopKmer4read ( rdArray[i].len, overlap, src_seq, bal_seq, kmerBuffer, prevcBuffer, nextcBuffer, &kmer_c, WordFilter ); putKmer2DBgraph ( kmerS, 0, kmer_c, kmerBuffer, prevcBuffer, nextcBuffer ); } lenCtg1 = getSeqFromCtg ( ctg1, 0, CTGendLen, originOverlap, src_seq ); chopKmer4Ctg ( kmerCtg1, lenCtg1, overlap, src_seq, WordFilter ); chopKmer4read ( lenCtg1, overlap, src_seq, bal_seq, kmerBuffer, prevcBuffer, nextcBuffer, &kmer_c, WordFilter ); putKmer2DBgraph ( kmerS, 1, kmer_c, kmerBuffer, prevcBuffer, nextcBuffer ); lenCtg2 = getSeqFromCtg ( ctg2, 1, CTGendLen, originOverlap, src_seq ); chopKmer4Ctg ( kmerCtg2, lenCtg2, overlap, src_seq, WordFilter ); chopKmer4read ( lenCtg2, overlap, src_seq, bal_seq, kmerBuffer, prevcBuffer, nextcBuffer, &kmer_c, WordFilter ); putKmer2DBgraph ( kmerS, 2, kmer_c, kmerBuffer, prevcBuffer, nextcBuffer ); /* if(ctg1->ctgID==3733&&ctg2->ctgID==3067){ for(i=0;i= 0; i-- ) { ch = kmer & 3; kmer >>= 2; kmerSeq[i] = ch; } for ( i = 0; i < overlap; i++ ) { fprintf ( fo, "%c", int2base ( ( int ) kmerSeq[i] ) ); } } static void kmerSet_mark ( KmerSet * set ) { int i, in_num, out_num, cvgSingle; kmer_t * rs; long long counter = 0, linear = 0; Kmer word; set->iter_ptr = 0; while ( set->iter_ptr < set->size ) { if ( !is_kmer_entity_null ( set->flags, set->iter_ptr ) ) { in_num = out_num = 0; rs = set->array + set->iter_ptr; word = rs->seq; for ( i = 0; i < 4; i++ ) { cvgSingle = get_kmer_left_cov ( *rs, i ); if ( cvgSingle > 0 ) { in_num++; } cvgSingle = get_kmer_right_cov ( *rs, i ); if ( cvgSingle > 0 ) { out_num++; } } if ( rs->single ) { counter++; } if ( in_num == 1 && out_num == 1 ) { rs->linear = 1; linear++; } } set->iter_ptr ++; } //printf("Allocated %ld node, %ld single nodes, %ld linear\n",(long)count_kmerset(set),counter,linear); } static kmer_t * searchNode ( Kmer word, KmerSet * kset, int overlap ) { Kmer bal_word = reverseComplement ( word, overlap ); kmer_t * node; boolean found; if ( word < bal_word ) { found = search_kmerset ( kset, word, &node ); } else { found = search_kmerset ( kset, bal_word, &node ); } if ( found ) { return node; } else { return NULL; } } static int searchKmerOnCtg ( Kmer currW, Kmer * kmerDest, int num ) { int i; for ( i = 0; i < num; i++ ) { if ( currW == kmerDest[i] ) { return i; } } return -1; } // pick on from n items randomly static int nPick1 ( int * array, int n ) { int m, i; m = n - 1; //(int)(drand48()*n); int value = array[m]; for ( i = m; i < n - 1; i++ ) { array[i] = array[i + 1]; } return value; } static void traceAlongDBgraph ( Kmer currW, int steps, int min, int max, int * num_route, KmerSet * kset, Kmer * kmerDest, int num, int overlap, Kmer WORDF, char ** foundRoutes, int * routeEndOnCtg2, int * routeLens, char * soFarSeq, int * traceCounter, int maxRoute, kmer_t ** soFarNode, boolean * multiOccu, long long * soFarLinks, double * avgLinks ) { ( *traceCounter ) ++; if ( *traceCounter > UPlimit ) { /* if(overlap==19&&kmerDest[0]==pubKmer) printf("UPlimit\n"); */ return; } if ( steps > max || *num_route >= maxRoute ) { /* if(overlap==19&&kmerDest[0]==pubKmer) printf("max steps/maxRoute\n"); */ return; } Kmer word = reverseComplement ( currW, overlap ); boolean isSmaller = currW < word; int i; char ch; unsigned char links; if ( isSmaller ) { word = currW; } kmer_t * node; boolean found = search_kmerset ( kset, word, &node ); if ( !found ) { printf ( "Trace: can't find kmer %llx (rc %llx, input %llx) at step %d\n", word, reverseComplement ( word, overlap ), currW, steps ); return; } if ( node->twin > 1 ) { return; } if ( soFarNode ) { soFarNode[steps] = node; } if ( steps > 0 ) { soFarSeq[steps - 1] = currW & 0x03; } int index, end; int linkCounter = *soFarLinks; if ( steps >= min && node->inEdge > 1 && ( end = searchKmerOnCtg ( currW, kmerDest, num ) ) >= 0 ) { index = *num_route; if ( steps > 0 ) { avgLinks[index] = ( double ) linkCounter / steps; } else { avgLinks[index] = 0; } //find node that appears more than once in the path multiOccu[index] = 0; for ( i = 0; i < steps + 1; i++ ) { soFarNode[i]->deleted = 0; } for ( i = 0; i < steps + 1; i++ ) { if ( soFarNode[i]->deleted ) { multiOccu[index] = 1; break; } soFarNode[i]->deleted = 1; } routeEndOnCtg2[index] = end; routeLens[index] = steps; char * array = foundRoutes[index]; for ( i = 0; i < steps; i++ ) { array[i] = soFarSeq[i]; } if ( i < max ) { array[i] = 4; } //indicate the end of the sequence *num_route = ++index; return; } steps++; if ( isSmaller ) { int array[] = {0, 1, 2, 3}; for ( i = 4; i > 0; i-- ) { ch = nPick1 ( array, i ); links = get_kmer_right_cov ( *node, ch ); if ( !links ) { continue; } *soFarLinks = linkCounter + links; word = nextKmerLocal ( currW, ch, WORDF ); traceAlongDBgraph ( word, steps, min, max, num_route, kset, kmerDest, num, overlap, WORDF, foundRoutes, routeEndOnCtg2, routeLens, soFarSeq, traceCounter, maxRoute, soFarNode, multiOccu, soFarLinks, avgLinks ); } } else { int array[] = {0, 1, 2, 3}; for ( i = 4; i > 0; i-- ) { ch = nPick1 ( array, i ); links = get_kmer_left_cov ( *node, ch ); if ( !links ) { continue; } *soFarLinks = linkCounter + links; word = nextKmerLocal ( currW, int_comp ( ch ), WORDF ); traceAlongDBgraph ( word, steps, min, max, num_route, kset, kmerDest, num, overlap, WORDF, foundRoutes, routeEndOnCtg2, routeLens, soFarSeq, traceCounter, maxRoute, soFarNode, multiOccu, soFarLinks, avgLinks ); } } } static int searchFgap ( KmerSet * kset, CTGinSCAF * ctg1, CTGinSCAF * ctg2, Kmer * kmerCtg1, Kmer * kmerCtg2, unsigned int origOverlap, int overlap, DARRAY * gapSeqArray, int len1, int len2, Kmer WordFilter, int * offset1, int * offset2, char * seqGap, int * cut1, int * cut2 ) { int i; int ret = 0; kmer_t * node, **soFarNode; int num_route; int gapLen = ctg2->start - ctg1->end - origOverlap + overlap; int min = gapLen - GLDiff > 0 ? gapLen - GLDiff : 0; //0531 int max = gapLen + GLDiff < 10 ? 10 : gapLen + GLDiff; char ** foundRoutes; char * soFarSeq; int traceCounter; int * routeEndOnCtg2; int * routeLens; boolean * multiOccu; long long soFarLinks; double * avgLinks; //mask linear internal linear kmer on contig1 end routeEndOnCtg2 = ( int * ) ckalloc ( MaxRouteNum * sizeof ( int ) ); routeLens = ( int * ) ckalloc ( MaxRouteNum * sizeof ( int ) ); multiOccu = ( boolean * ) ckalloc ( MaxRouteNum * sizeof ( boolean ) ); short * MULTI1 = ( short * ) ckalloc ( MaxRouteNum * sizeof ( short ) ); short * MULTI2 = ( short * ) ckalloc ( MaxRouteNum * sizeof ( short ) ); soFarSeq = ( char * ) ckalloc ( max * sizeof ( char ) ); soFarNode = ( kmer_t ** ) ckalloc ( ( max + 1 ) * sizeof ( kmer_t * ) ); foundRoutes = ( char ** ) ckalloc ( MaxRouteNum * sizeof ( char * ) );; avgLinks = ( double * ) ckalloc ( MaxRouteNum * sizeof ( double ) );; for ( i = 0; i < MaxRouteNum; i++ ) { foundRoutes[i] = ( char * ) ckalloc ( max * sizeof ( char ) ); } for ( i = len1 - 1; i >= 0; i-- ) { num_route = traceCounter = soFarLinks = 0; int steps = 0; traceAlongDBgraph ( kmerCtg1[i], steps, min, max, &num_route, kset, kmerCtg2, len2, overlap, WordFilter, foundRoutes, routeEndOnCtg2, routeLens, soFarSeq, &traceCounter, MaxRouteNum, soFarNode, multiOccu, &soFarLinks, avgLinks ); if ( num_route > 0 ) { int m, minEnd = routeEndOnCtg2[0]; for ( m = 0; m < num_route; m++ ) { if ( routeLens[m] < 0 ) { continue; } if ( routeEndOnCtg2[m] < minEnd ) { minEnd = routeEndOnCtg2[m]; } } /* else if(minFreq>1){ for(m=0;m3) break; printf("%c",int2base((int)foundRoutes[m][j])); } printf(": %4.2f\n",avgLinks[m]); } } */ num_route = traceCounter = soFarLinks = 0; steps = 0; trace4Repeat ( kmerCtg1[i], steps, min, max, &num_route, kset, kmerCtg2[minEnd], overlap, WordFilter, &traceCounter, MaxRouteNum, soFarNode, MULTI1, MULTI2, routeLens, foundRoutes, soFarSeq, &soFarLinks, avgLinks ); int j, best = 0; int maxLen = routeLens[0]; double maxLink = avgLinks[0]; char * pt; boolean repeat = 0, sameLen = 1; int leftMost = max, rightMost = max; if ( num_route < 1 ) { fprintf ( stderr, "After trace4Repeat: non route was found\n" ); continue; } if ( num_route > 1 ) { // if multi paths are found, we check on the repeatative occurrences and links/length for ( m = 0; m < num_route; m++ ) { if ( routeLens[m] < 0 ) { continue; } if ( MULTI1[m] >= 0 && MULTI2[m] >= 0 ) { repeat = 1; leftMost = leftMost > MULTI1[m] ? MULTI1[m] : leftMost; rightMost = rightMost > MULTI2[m] ? MULTI2[m] : rightMost; } if ( routeLens[m] != maxLen ) { sameLen = 0; } if ( routeLens[m] < maxLen ) { maxLen = routeLens[m]; } if ( avgLinks[m] > maxLink ) { maxLink = avgLinks[m]; best = m; } } } if ( repeat ) { *offset1 = *offset2 = *cut1 = *cut2 = 0; int index = 0; char ch; for ( j = 0; j < leftMost; j++ ) { if ( routeLens[0] < j + overlap + 1 ) { break; } else { ch = foundRoutes[0][j]; } for ( m = 1; m < num_route; m++ ) { if ( routeLens[m] < 0 ) { continue; } if ( ch != foundRoutes[m][j] ) { break; } } if ( m == num_route ) { seqGap[index++] = ch; } else { break; } } *offset1 = index; index = 0; for ( j = 0; j < rightMost; j++ ) { if ( routeLens[0] - overlap - 1 < j ) { break; } else { ch = foundRoutes[0][routeLens[0] - overlap - 1 - j]; } for ( m = 1; m < num_route; m++ ) { if ( routeLens[m] < 0 ) { continue; } if ( ch != foundRoutes[m][routeLens[m] - overlap - 1 - j] ) { break; } } if ( m == num_route ) { index++; } else { break; } } *offset2 = index; for ( j = 0; j < *offset2; j++ ) { seqGap[*offset1 + *offset2 - 1 - j] = foundRoutes[0][routeLens[0] - overlap - 1 - j]; } if ( *offset1 > 0 || *offset2 > 0 ) { *cut1 = len1 - i - 1; *cut2 = minEnd; //fprintf(stderr,"\n"); for ( m = 0; m < num_route; m++ ) { for ( j = 0; j < max; j++ ) { if ( foundRoutes[m][j] > 3 ) { break; } //fprintf(stderr,"%c",int2base((int)foundRoutes[m][j])); } //fprintf(stderr,": %4.2f\n",avgLinks[m]); } /* fprintf(stderr,">Gap (%d + %d) (%d + %d)\n",*offset1,*offset2,*cut1,*cut2); for(index=0;index<*offset1+*offset2;index++) fprintf(stderr,"%c",int2base(seqGap[index])); fprintf(stderr,"\n"); */ } ret = 3; break; } if ( overlap + ( len1 - i - 1 ) + minEnd - routeLens[best] > ( int ) origOverlap ) { continue; } ctg1->gapSeqOffset = gapSeqArray->item_c; ctg1->gapSeqLen = routeLens[best]; if ( !darrayPut ( gapSeqArray, ctg1->gapSeqOffset + maxLen / 4 ) ) { continue; } pt = ( char * ) darrayPut ( gapSeqArray, ctg1->gapSeqOffset ); /* printKmer(stderr,kmerCtg1[i],overlap); fprintf(stderr,"-"); */ for ( j = 0; j < max; j++ ) { if ( foundRoutes[best][j] > 3 ) { break; } writeChar2tightString ( foundRoutes[best][j], pt, j ); //fprintf(stderr,"%c",int2base((int)foundRoutes[best][j])); } //fprintf(stderr,": GAPSEQ %d + %d, avglink %4.2f\n",len1-i-1,minEnd,avgLinks[best]); ctg1->cutTail = len1 - i - 1; ctg2->cutHead = overlap + minEnd; ctg2->scaftig_start = 0; ret = 1; break; /* }if(num_route>1){ ret = 2; break; */ } else //mark node which leads to dead end { node = searchNode ( kmerCtg1[i], kset, overlap ); if ( node ) { node->twin = 2; } } } for ( i = 0; i < MaxRouteNum; i++ ) { free ( ( void * ) foundRoutes[i] ); } free ( ( void * ) soFarSeq ); free ( ( void * ) soFarNode ); free ( ( void * ) multiOccu ); free ( ( void * ) MULTI1 ); free ( ( void * ) MULTI2 ); free ( ( void * ) foundRoutes ); free ( ( void * ) routeEndOnCtg2 ); free ( ( void * ) routeLens ); return ret; } static void trace4Repeat ( Kmer currW, int steps, int min, int max, int * num_route, KmerSet * kset, Kmer kmerDest, int overlap, Kmer WORDF, int * traceCounter, int maxRoute, kmer_t ** soFarNode, short * multiOccu1, short * multiOccu2, int * routeLens, char ** foundRoutes, char * soFarSeq, long long * soFarLinks, double * avgLinks ) { ( *traceCounter ) ++; if ( *traceCounter > UPlimit ) { return; } if ( steps > max || *num_route >= maxRoute ) { return; } Kmer word = reverseComplement ( currW, overlap ); boolean isSmaller = currW < word; char ch; unsigned char links; int index, i; if ( isSmaller ) { word = currW; } kmer_t * node; boolean found = search_kmerset ( kset, word, &node ); if ( !found ) { printf ( "Trace: can't find kmer %llx (rc %llx, input %llx) at step %d\n", word, reverseComplement ( word, overlap ), currW, steps ); return; } if ( soFarNode ) { soFarNode[steps] = node; } if ( soFarSeq && steps > 0 ) { soFarSeq[steps - 1] = currW & 0x03; } int linkCounter; if ( soFarLinks ) { linkCounter = *soFarLinks; } if ( steps >= min && currW == kmerDest ) { index = *num_route; if ( avgLinks && steps > 0 ) { avgLinks[index] = ( double ) linkCounter / steps; } else if ( avgLinks ) { avgLinks[index] = 0; } //find node that appears more than once in the path if ( multiOccu1 && multiOccu2 ) { for ( i = 0; i < steps + 1; i++ ) { soFarNode[i]->deleted = 0; } int rightMost = 0; boolean MULTI = 0; for ( i = 0; i < steps + 1; i++ ) { if ( soFarNode[i]->deleted ) { rightMost = rightMost < i - 1 ? i - 1 : rightMost; MULTI = 1; } soFarNode[i]->deleted = 1; } if ( !MULTI ) { multiOccu1[index] = multiOccu2[index] = -1; } else { multiOccu2[index] = steps - 2 - rightMost < 0 ? 0 : steps - 2 - rightMost; //[0 steps-2] for ( i = 0; i < steps + 1; i++ ) { soFarNode[i]->deleted = 0; } int leftMost = steps - 2; for ( i = steps; i >= 0; i-- ) { if ( soFarNode[i]->deleted ) { leftMost = leftMost > i - 1 ? i - 1 : leftMost; } soFarNode[i]->deleted = 1; } multiOccu1[index] = leftMost < 0 ? 0 : leftMost; //[0 steps-2] } } if ( routeLens ) { routeLens[index] = steps; } if ( soFarSeq ) { char * array = foundRoutes[index]; for ( i = 0; i < steps; i++ ) { array[i] = soFarSeq[i]; } if ( i < max ) { array[i] = 4; } //indicate the end of the sequence } *num_route = ++index; } steps++; if ( isSmaller ) { int array[] = {0, 1, 2, 3}; for ( i = 4; i > 0; i-- ) { ch = nPick1 ( array, i ); links = get_kmer_right_cov ( *node, ch ); if ( !links ) { continue; } if ( soFarLinks ) { *soFarLinks = linkCounter + links; } word = nextKmerLocal ( currW, ch, WORDF ); trace4Repeat ( word, steps, min, max, num_route, kset, kmerDest, overlap, WORDF, traceCounter, maxRoute, soFarNode, multiOccu1, multiOccu2, routeLens, foundRoutes, soFarSeq, soFarLinks, avgLinks ); } } else { int array[] = {0, 1, 2, 3}; for ( i = 4; i > 0; i-- ) { ch = nPick1 ( array, i ); links = get_kmer_left_cov ( *node, ch ); if ( !links ) { continue; } if ( soFarLinks ) { *soFarLinks = linkCounter + links; } word = nextKmerLocal ( currW, int_comp ( ch ), WORDF ); trace4Repeat ( word, steps, min, max, num_route, kset, kmerDest, overlap, WORDF, traceCounter, maxRoute, soFarNode, multiOccu1, multiOccu2, routeLens, foundRoutes, soFarSeq, soFarLinks, avgLinks ); } } } //found repeat node on contig ends static void maskRepeatNode ( KmerSet * kset, Kmer * kmerCtg1, Kmer * kmerCtg2, int overlap, int len1, int len2, int max, Kmer WordFilter ) { int i; int num_route, steps; int min = 1, maxRoute = 1; int traceCounter; Kmer word, bal_word; kmer_t * node; boolean found; int counter = 0; for ( i = 0; i < len1; i++ ) { word = kmerCtg1[i]; bal_word = reverseComplement ( word, overlap ); if ( word > bal_word ) { word = bal_word; } found = search_kmerset ( kset, word, &node ); if ( !found || node->linear ) { //printf("Found no node for kmer %llx\n",word); continue; } num_route = traceCounter = 0; steps = 0; trace4Repeat ( word, steps, min, max, &num_route, kset, word, overlap, WordFilter, &traceCounter, maxRoute, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL ); if ( num_route < 1 ) { continue; } counter++; node->checked = 1; } for ( i = 0; i < len2; i++ ) { word = kmerCtg2[i]; bal_word = reverseComplement ( word, overlap ); if ( word > bal_word ) { word = bal_word; } found = search_kmerset ( kset, word, &node ); if ( !found || node->linear ) { //printf("Found no node for kmer %llx\n",word); continue; } num_route = traceCounter = 0; steps = 0; trace4Repeat ( word, steps, min, max, &num_route, kset, word, overlap, WordFilter, &traceCounter, maxRoute, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL ); if ( num_route < 1 ) { continue; } counter++; node->checked = 1; } //printf("MR: %d(%d)\n",counter,len1+len2); } /* static boolean chopReadFillGap(int len_seq,int overlap,char *src_seq, char *bal_seq, KmerSet *kset,Kmer WORDF,int *start,int *end,boolean *bal, Kmer *KmerCtg1,int len1,Kmer *KmerCtg2,int len2,int *index1,int *index2) { int index,j=0,bal_j; Kmer word,bal_word; int flag=0,bal_flag=0; int ctg1start,bal_ctg1start,ctg2end,bal_ctg2end; int seqStart,bal_start,seqEnd,bal_end; kmer_t *node; boolean found; if(len_seqlinear&&!node->checked){ if(!flag&&node->inEdge==1){ ctg1start = searchKmerOnCtg(word,KmerCtg1,len1); if(ctg1start>0){ flag = 1; seqStart = j + overlap-1; } } if(!bal_flag&&node->inEdge==2){ bal_ctg2end = searchKmerOnCtg(bal_word,KmerCtg2,len2); if(bal_ctg2end>0){ bal_flag = 2; bal_end = bal_j+overlap-1; } } } for(j = 1; j <= len_seq - overlap; j ++) { word = nextKmerLocal(word,src_seq[j-1+overlap],WORDF); bal_j = len_seq-j-overlap; // j; bal_word = prevKmerLocal(bal_word,bal_seq[bal_j],overlap); if(wordlinear&&!node->checked){ if(!flag&&node->inEdge==1){ ctg1start = searchKmerOnCtg(word,KmerCtg1,len1); if(ctg1start>0){ flag = 1; seqStart = j + overlap-1; } }else if(flag==1&&node->inEdge==1){ index = searchKmerOnCtg(word,KmerCtg1,len1); if(index>ctg1start){ // choose hit closer to gap ctg1start = index; seqStart = j + overlap-1; } }else if(flag==1&&node->inEdge==2){ ctg2end = searchKmerOnCtg(word,KmerCtg2,len2); if(ctg2end>0){ flag = 3; seqEnd = j+overlap-1; break; } } if(!bal_flag&&node->inEdge==2){ bal_ctg2end = searchKmerOnCtg(bal_word,KmerCtg2,len2); if(bal_ctg2end>0){ bal_flag = 2; bal_end = bal_j+overlap-1; } }else if(bal_flag==2&&node->inEdge==2){ index = searchKmerOnCtg(bal_word,KmerCtg2,len2); if(indexinEdge==1){ bal_ctg1start = searchKmerOnCtg(bal_word,KmerCtg1,len1); if(bal_ctg1start>0){ bal_flag = 3; bal_start = bal_j+overlap-1; break; } } } } if(flag==3){ *start = seqStart; *end = seqEnd; *bal = 0; *index1 = ctg1start; *index2 = ctg2end; return 1; }else if(bal_flag==3){ *start = bal_start; *end = bal_end; *bal = 1; *index1 = bal_ctg1start; *index2 = bal_ctg2end; return 1; } return 0; } static boolean readsCrossGap(READNEARBY *rdArray, int num, int originOverlap,DARRAY *gapSeqArray, Kmer *kmerCtg1,Kmer *kmerCtg2,int overlap,int len1,int len2, CTGinSCAF *ctg1,CTGinSCAF *ctg2,KmerSet *kmerS,Kmer WordFilter,int min,int max) { int i,j,start,end,startOnCtg1,endOnCtg2; char *bal_seq; char *src_seq; char *pt; boolean bal,ret=0,FILL; src_seq = (char *)ckalloc(maxReadLen*sizeof(char)); bal_seq = (char *)ckalloc(maxReadLen*sizeof(char)); for(i=0;imax) continue; fprintf(stderr,"Read across\n"); //printf("Filled: K %d, ctg1 %d ctg2 %d,start %d end %d\n",overlap,startOnCtg1,endOnCtg2,start,end); if(overlap+(len1-startOnCtg1-1)+endOnCtg2-(end-start)>(int)originOverlap) continue; // contig1 and contig2 could not overlap more than origOverlap bases ctg1->gapSeqOffset = gapSeqArray->item_c; ctg1->gapSeqLen = end-start; if(!darrayPut(gapSeqArray,ctg1->gapSeqOffset+(end-start)/4)) continue; pt = (char *)darrayPut(gapSeqArray,ctg1->gapSeqOffset); for(j=start+1;j<=end;j++){ if(bal) writeChar2tightString(bal_seq[j],pt,j-start-1); else writeChar2tightString(src_seq[j],pt,j-start-1); } ctg1->cutTail = len1-startOnCtg1-1; ctg2->cutHead = overlap + endOnCtg2; ctg2->scaftig_start = 0; ret = 1; break; } free((void*)src_seq); free((void*)bal_seq); return ret; } */ static void kmerSet_markTandem ( KmerSet * set, Kmer WordFilter, int overlap ); static boolean readsCrossGap ( READNEARBY * rdArray, int num, int originOverlap, DARRAY * gapSeqArray, Kmer * kmerCtg1, Kmer * kmerCtg2, int overlap, CTGinSCAF * ctg1, CTGinSCAF * ctg2, KmerSet * kmerS, Kmer WordFilter, int min, int max, int offset1, int offset2, char * seqGap, char * seqCtg1, char * seqCtg2, int cut1, int cut2 ); int localGraph ( READNEARBY * rdArray, int num, CTGinSCAF * ctg1, CTGinSCAF * ctg2, int origOverlap, Kmer * kmerCtg1, Kmer * kmerCtg2, int overlap, DARRAY * gapSeqArray, char * seqCtg1, char * seqCtg2, char * seqGap ) { /**************** put kmer in DBgraph ****************/ KmerSet * kmerSet; Kmer WordFilter = ( ( ( Kmer ) 1 ) << ( 2 * overlap ) ) - 1; /* if(ctg1->ctgID==56410&&ctg2->ctgID==61741) printf("Extract %d reads for gap [%d %d]\n",num,ctg1->ctgID,ctg2->ctgID); */ kmerSet = readsInGap2DBgraph ( rdArray, num, ctg1, ctg2, origOverlap, kmerCtg1, kmerCtg2, overlap, WordFilter ); time_t tt; time ( &tt ); // srand48((int)tt); /* int i,j; for(i=0;i<2;i++){ int array[] = {0,1,2,3}; for(j=4;j>0;j--) fprintf(stderr,"%d ", nPick1(array,j)); } fprintf(stderr,"\n"); */ /***************** search path to connect contig ends ********/ int gapLen = ctg2->start - ctg1->end - origOverlap + overlap; int min = gapLen - GLDiff > 0 ? gapLen - GLDiff : 0; int max = gapLen + GLDiff < 10 ? 10 : gapLen + GLDiff; //count kmer number for contig1 and contig2 ends int len1, len2; len1 = CTGendLen < contig_array[ctg1->ctgID].length + origOverlap ? CTGendLen : contig_array[ctg1->ctgID].length + origOverlap; len2 = CTGendLen < contig_array[ctg2->ctgID].length + origOverlap ? CTGendLen : contig_array[ctg2->ctgID].length + origOverlap; len1 -= overlap - 1; len2 -= overlap - 1; //int pathNum = 2; int offset1 = 0, offset2 = 0, cut1 = 0, cut2 = 0; int pathNum = searchFgap ( kmerSet, ctg1, ctg2, kmerCtg1, kmerCtg2, origOverlap, overlap, gapSeqArray, len1, len2, WordFilter, &offset1, &offset2, seqGap, &cut1, &cut2 ); //printf("SF: %d K %d\n",pathNum,overlap); if ( pathNum == 0 ) { free_kmerset ( kmerSet ); return 0; } else if ( pathNum == 1 ) { free_kmerset ( kmerSet ); return 1; }/* else{ printf("ret %d\n",pathNum); free_kmerset(kmerSet); return 0; } */ /******************* cross the gap by single reads *********/ //kmerSet_markTandem(kmerSet,WordFilter,overlap); maskRepeatNode ( kmerSet, kmerCtg1, kmerCtg2, overlap, len1, len2, max, WordFilter ); boolean found = readsCrossGap ( rdArray, num, origOverlap, gapSeqArray, kmerCtg1, kmerCtg2, overlap, ctg1, ctg2, kmerSet, WordFilter, min, max, offset1, offset2, seqGap, seqCtg1, seqCtg2, cut1, cut2 ); if ( found ) { //fprintf(stderr,"read across\n"); free_kmerset ( kmerSet ); return found; } else { free_kmerset ( kmerSet ); return 0; } } static void kmerSet_markTandem ( KmerSet * set, Kmer WordFilter, int overlap ) { kmer_t * rs; long long counter = 0; int num_route, steps; int min = 1, max = overlap, maxRoute = 1; int traceCounter; set->iter_ptr = 0; while ( set->iter_ptr < set->size ) { if ( !is_kmer_entity_null ( set->flags, set->iter_ptr ) ) { rs = set->array + set->iter_ptr; if ( rs->inEdge > 0 ) { set->iter_ptr ++; continue; } num_route = traceCounter = 0; steps = 0; trace4Repeat ( rs->seq, steps, min, max, &num_route, set, rs->seq, overlap, WordFilter, &traceCounter, maxRoute, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL ); if ( num_route < 1 ) { set->iter_ptr ++; continue; } /* printKmer(stderr,rs->seq,overlap); fprintf(stderr, "\n"); */ rs->checked = 1; counter++; } set->iter_ptr ++; } } /******************* the following is for read-crossing gaps *************************/ #define MAXREADLENGTH 100 static const int INDEL = 0; static const int SIM[4][4] = { {1, 0, 0, 0}, {0, 1, 0, 0}, {0, 0, 1, 0}, {0, 0, 0, 1} }; static char fastSequence[MAXREADLENGTH]; static char slowSequence[MAXREADLENGTH]; static int Fmatrix[MAXREADLENGTH + 1][MAXREADLENGTH + 1]; static int slowToFastMapping[MAXREADLENGTH + 1]; static int fastToSlowMapping[MAXREADLENGTH + 1]; static int max ( int A, int B, int C ) { A = A >= B ? A : B; return ( A >= C ? A : C ); } static int compareSequences ( char * sequence1, char * sequence2, int length1, int length2 ) { if ( length1 < 1 || length2 < 1 || length1 > MAXREADLENGTH || length2 > MAXREADLENGTH ) { return 0; } int i, j; int Choice1, Choice2, Choice3; int maxScore; for ( i = 0; i <= length1; i++ ) { Fmatrix[i][0] = 0; } for ( j = 0; j <= length2; j++ ) { Fmatrix[0][j] = 0; } for ( i = 1; i <= length1; i++ ) { for ( j = 1; j <= length2; j++ ) { Choice1 = Fmatrix[i - 1][j - 1] + SIM[ ( int ) sequence1[i - 1]] [ ( int ) sequence2[j - 1]]; Choice2 = Fmatrix[i - 1][j] + INDEL; Choice3 = Fmatrix[i][j - 1] + INDEL; Fmatrix[i][j] = max ( Choice1, Choice2, Choice3 ); } } maxScore = Fmatrix[length1][length2]; return maxScore; } static void mapSlowOntoFast ( int slowSeqLength, int fastSeqLength ) { int slowIndex = slowSeqLength; int fastIndex = fastSeqLength; int fastn, slown; if ( slowIndex == 0 ) { slowToFastMapping[0] = fastIndex; while ( fastIndex >= 0 ) { fastToSlowMapping[fastIndex--] = 0; } return; } if ( fastIndex == 0 ) { while ( slowIndex >= 0 ) { slowToFastMapping[slowIndex--] = 0; } fastToSlowMapping[0] = slowIndex; return; } while ( slowIndex > 0 && fastIndex > 0 ) { fastn = ( int ) fastSequence[fastIndex - 1]; //getCharInTightString(fastSequence,fastIndex-1); slown = ( int ) slowSequence[slowIndex - 1]; //getCharInTightString(slowSequence,slowIndex-1); if ( Fmatrix[fastIndex][slowIndex] == Fmatrix[fastIndex - 1][slowIndex - 1] + SIM[fastn][slown] ) { fastToSlowMapping[--fastIndex] = --slowIndex; slowToFastMapping[slowIndex] = fastIndex; } else if ( Fmatrix[fastIndex][slowIndex] == Fmatrix[fastIndex - 1][slowIndex] + INDEL ) { fastToSlowMapping[--fastIndex] = slowIndex - 1; } else if ( Fmatrix[fastIndex][slowIndex] == Fmatrix[fastIndex][slowIndex - 1] + INDEL ) { slowToFastMapping[--slowIndex] = fastIndex - 1; } else { printf ( "compareSequence: Error trace\n" ); fflush ( stdout ); abort(); } } while ( slowIndex > 0 ) { slowToFastMapping[--slowIndex] = -1; } while ( fastIndex > 0 ) { fastToSlowMapping[--fastIndex] = -1; } slowToFastMapping[slowSeqLength] = fastSeqLength; fastToSlowMapping[fastSeqLength] = slowSeqLength; } static boolean chopReadFillGap ( int len_seq, int overlap, char * src_seq, char * bal_seq, KmerSet * kset, Kmer WORDF, int * start, int * end, boolean * bal, Kmer * KmerCtg1, int len1, Kmer * KmerCtg2, int len2, int * index1, int * index2 ) { int index, j = 0, bal_j; Kmer word, bal_word; int flag = 0, bal_flag = 0; int ctg1start, bal_ctg1start, ctg2end, bal_ctg2end; int seqStart, bal_start, seqEnd, bal_end; kmer_t * node; boolean found; if ( len_seq < overlap + 1 ) { return 0; } word = 0; for ( index = 0; index < overlap; index++ ) { word <<= 2; word += src_seq[index]; } reverseComplementSeq ( src_seq, len_seq, bal_seq ); // complementary node bal_word = reverseComplement ( word, overlap ); bal_j = len_seq - 0 - overlap; // 0; flag = bal_flag = 0; if ( word < bal_word ) { found = search_kmerset ( kset, word, &node ); } else { found = search_kmerset ( kset, bal_word, &node ); } if ( found && !node->linear && !node->checked ) { if ( !flag && node->inEdge == 1 ) { ctg1start = searchKmerOnCtg ( word, KmerCtg1, len1 ); if ( ctg1start >= 0 ) { flag = 1; seqStart = j + overlap - 1; } } if ( !bal_flag && node->inEdge == 2 ) { bal_ctg2end = searchKmerOnCtg ( bal_word, KmerCtg2, len2 ); if ( bal_ctg2end >= 0 ) { bal_flag = 2; bal_end = bal_j + overlap - 1; } } } for ( j = 1; j <= len_seq - overlap; j ++ ) { word = nextKmerLocal ( word, src_seq[j - 1 + overlap], WORDF ); bal_j = len_seq - j - overlap; // j; bal_word = prevKmerLocal ( bal_word, bal_seq[bal_j], overlap ); if ( word < bal_word ) { found = search_kmerset ( kset, word, &node ); } else { found = search_kmerset ( kset, bal_word, &node ); } if ( found && !node->linear && !node->checked ) { if ( !flag && node->inEdge == 1 ) { ctg1start = searchKmerOnCtg ( word, KmerCtg1, len1 ); if ( ctg1start >= 0 ) { flag = 1; seqStart = j + overlap - 1; } } else if ( flag == 1 && node->inEdge == 1 ) { index = searchKmerOnCtg ( word, KmerCtg1, len1 ); if ( index >= 0 && index > ctg1start ) // choose hit closer to gap { ctg1start = index; seqStart = j + overlap - 1; } } else if ( flag == 1 && node->inEdge == 2 ) { ctg2end = searchKmerOnCtg ( word, KmerCtg2, len2 ); if ( ctg2end >= 0 ) { flag = 3; seqEnd = j + overlap - 1; break; } } if ( !bal_flag && node->inEdge == 2 ) { bal_ctg2end = searchKmerOnCtg ( bal_word, KmerCtg2, len2 ); if ( bal_ctg2end >= 0 ) { bal_flag = 2; bal_end = bal_j + overlap - 1; } } else if ( bal_flag == 2 && node->inEdge == 2 ) { index = searchKmerOnCtg ( bal_word, KmerCtg2, len2 ); if ( index >= 0 && index < bal_ctg2end ) // choose hit closer to gap { bal_ctg2end = index; bal_end = bal_j + overlap - 1; } } else if ( bal_flag == 2 && node->inEdge == 1 ) { bal_ctg1start = searchKmerOnCtg ( bal_word, KmerCtg1, len1 ); if ( bal_ctg1start >= 0 ) { bal_flag = 3; bal_start = bal_j + overlap - 1; break; } } } } if ( flag == 3 ) { *start = seqStart; *end = seqEnd; *bal = 0; *index1 = ctg1start; *index2 = ctg2end; return 1; } else if ( bal_flag == 3 ) { *start = bal_start; *end = bal_end; *bal = 1; *index1 = bal_ctg1start; *index2 = bal_ctg2end; return 1; } return 0; } static int cutSeqFromTightStr ( char * tightStr, int length, int start, int end, int revS, char * src_seq ) { int i, index = 0; end = end < length ? end : length - 1; start = start >= 0 ? start : 0; if ( !revS ) { for ( i = start; i <= end; i++ ) { src_seq[index++] = getCharInTightString ( tightStr, i ); } } else { for ( i = length - 1 - start; i >= length - end - 1; i-- ) { src_seq[index++] = int_comp ( getCharInTightString ( tightStr, i ) ); } } return end - start + 1; } static int cutSeqFromCtg ( unsigned int ctgID, int start, int end, char * sequence, int originOverlap ) { unsigned int bal_ctg = getTwinCtg ( ctgID ); if ( contig_array[ctgID].length < 1 ) { return 0; } int length = contig_array[ctgID].length + originOverlap; if ( contig_array[ctgID].seq ) { return cutSeqFromTightStr ( contig_array[ctgID].seq, length, start, end, 0, sequence ); } else { return cutSeqFromTightStr ( contig_array[bal_ctg].seq, length, start, end, 1, sequence ); } } static int cutSeqFromRead ( char * src_seq, int length, int start, int end, char * sequence ) { if ( end >= length ) { printf ( "******: end %d length %d\n", end, length ); } end = end < length ? end : length - 1; start = start >= 0 ? start : 0; int i; for ( i = start; i <= end; i++ ) { sequence[i - start] = src_seq[i]; } return end - start + 1; } static void printSeq ( FILE * fo, char * seq, int len ) { int i; for ( i = 0; i < len; i++ ) { fprintf ( fo, "%c", int2base ( ( int ) seq[i] ) ); } fprintf ( fo, "\n" ); } static boolean readsCrossGap ( READNEARBY * rdArray, int num, int originOverlap, DARRAY * gapSeqArray, Kmer * kmerCtg1, Kmer * kmerCtg2, int overlap, CTGinSCAF * ctg1, CTGinSCAF * ctg2, KmerSet * kmerS, Kmer WordFilter, int min, int max, int offset1, int offset2, char * seqGap, char * seqCtg1, char * seqCtg2, int cut1, int cut2 ) { int i, j, start, end, startOnCtg1, endOnCtg2; char * bal_seq; char * src_seq; char * pt; boolean bal, ret = 0, FILL; double maxScore = 0.0; int maxIndex; int lenCtg1, lenCtg2; //build sequences on left and right of the uncertain region int buffer_size = maxReadLen > 100 ? maxReadLen : 100; int length = contig_array[ctg1->ctgID].length + originOverlap; if ( buffer_size > offset1 ) { lenCtg1 = cutSeqFromCtg ( ctg1->ctgID, length - cut1 - ( buffer_size - offset1 ), length - 1 - cut1, seqCtg1, originOverlap ); for ( i = 0; i < offset1; i++ ) { seqCtg1[lenCtg1 + i] = seqGap[i]; } lenCtg1 += offset1; } else { for ( i = offset1 - buffer_size; i < offset1; i++ ) { seqCtg1[i + buffer_size - offset1] = seqGap[i]; } lenCtg1 = buffer_size; } length = contig_array[ctg2->ctgID].length + originOverlap; if ( buffer_size > offset2 ) { lenCtg2 = cutSeqFromCtg ( ctg2->ctgID, cut2, buffer_size - offset2 - 1 + cut2, & ( seqCtg2[offset2] ), originOverlap ); for ( i = 0; i < offset2; i++ ) { seqCtg2[i] = seqGap[i + offset1]; } lenCtg2 += offset2; } else { for ( i = 0; i < buffer_size; i++ ) { seqCtg2[i] = seqGap[i + offset1]; } lenCtg2 = buffer_size; } /* if(offset1>0||offset2>0){ for(i=0;i max ) { continue; } if ( overlap + ( len1 - startOnCtg1 - 1 ) + endOnCtg2 - ( end - start ) > ( int ) originOverlap ) { continue; } // contig1 and contig2 could not overlap more than origOverlap bases START[i] = start; END[i] = end; INDEX1[i] = startOnCtg1; INDEX2[i] = endOnCtg2; BAL[i] = bal; int matchLen = 2 * overlap < ( end - start + overlap ) ? 2 * overlap : ( end - start + overlap ); int match; int alignLen = matchLen; //compare the left of hit kmer on ctg1 //int ctgLeft = (contig_array[ctg1->ctgID].length+originOverlap)-(len1+overlap-1)+startOnCtg1; int ctgLeft = ( lenCtg1 ) - ( len1 + overlap - 1 ) + startOnCtg1; int readLeft = start - overlap + 1; int cmpLen = ctgLeft < readLeft ? ctgLeft : readLeft; cmpLen = cmpLen <= MAXREADLENGTH ? cmpLen : MAXREADLENGTH; //cutSeqFromCtg(ctg1->ctgID,ctgLeft-cmpLen,ctgLeft-1,fastSequence,originOverlap); cutSeqFromRead ( seqCtg1, lenCtg1, ctgLeft - cmpLen, ctgLeft - 1, fastSequence ); if ( !bal ) { cutSeqFromRead ( src_seq, rdArray[i].len, readLeft - cmpLen, readLeft - 1, slowSequence ); } else { cutSeqFromRead ( bal_seq, rdArray[i].len, readLeft - cmpLen, readLeft - 1, slowSequence ); } match = compareSequences ( fastSequence, slowSequence, cmpLen, cmpLen ); alignLen += cmpLen; matchLen += match; //compare the right of hit kmer on ctg1 int ctgRight = len1 - startOnCtg1 - 1; cmpLen = ctgRight < ( rdArray[i].len - start - 1 ) ? ctgRight : ( rdArray[i].len - start - 1 ); cmpLen = cmpLen <= MAXREADLENGTH ? cmpLen : MAXREADLENGTH; //cutSeqFromCtg(ctg1->ctgID,ctgLeft+overlap,ctgLeft+overlap+cmpLen-1,fastSequence,originOverlap); cutSeqFromRead ( seqCtg1, lenCtg1, ctgLeft + overlap, ctgLeft + overlap + cmpLen - 1, fastSequence ); if ( !bal ) { cutSeqFromRead ( src_seq, rdArray[i].len, start + 1, start + cmpLen, slowSequence ); } else { cutSeqFromRead ( bal_seq, rdArray[i].len, start + 1, start + cmpLen, slowSequence ); } match = compareSequences ( fastSequence, slowSequence, cmpLen, cmpLen ); //fprintf(stderr,"%d -- %d\n",match,cmpLen); alignLen += cmpLen; matchLen += match; //compare the left of hit kmer on ctg2 ctgLeft = endOnCtg2; readLeft = end - overlap + 1; cmpLen = ctgLeft < readLeft ? ctgLeft : readLeft; cmpLen = ctgLeft <= MAXREADLENGTH ? ctgLeft : MAXREADLENGTH; //cutSeqFromCtg(ctg2->ctgID,endOnCtg2-cmpLen,endOnCtg2-1,fastSequence,originOverlap); cutSeqFromRead ( seqCtg2, lenCtg2, endOnCtg2 - cmpLen, endOnCtg2 - 1, fastSequence ); if ( !bal ) { cutSeqFromRead ( src_seq, rdArray[i].len, readLeft - cmpLen, readLeft - 1, slowSequence ); } else { cutSeqFromRead ( bal_seq, rdArray[i].len, readLeft - cmpLen, readLeft - 1, slowSequence ); } match = compareSequences ( fastSequence, slowSequence, cmpLen, cmpLen ); alignLen += cmpLen; matchLen += match; //compare the right of hit kmer on ctg2 //ctgRight = contig_array[ctg2->ctgID].length+originOverlap-endOnCtg2-overlap; ctgRight = lenCtg2 - endOnCtg2 - overlap; cmpLen = ctgRight < ( rdArray[i].len - end - 1 ) ? ctgRight : ( rdArray[i].len - end - 1 ); cmpLen = cmpLen <= MAXREADLENGTH ? cmpLen : MAXREADLENGTH; //cutSeqFromCtg(ctg2->ctgID,endOnCtg2+overlap,endOnCtg2+overlap+cmpLen-1,fastSequence,originOverlap); cutSeqFromRead ( seqCtg2, lenCtg2, endOnCtg2 + overlap, endOnCtg2 + overlap + cmpLen - 1, fastSequence ); if ( !bal ) { cutSeqFromRead ( src_seq, rdArray[i].len, end + 1, end + cmpLen, slowSequence ); } else { cutSeqFromRead ( bal_seq, rdArray[i].len, end + 1, end + cmpLen, slowSequence ); } match = compareSequences ( fastSequence, slowSequence, cmpLen, cmpLen ); alignLen += cmpLen; matchLen += match; /* if(cmpLen>0&&match!=cmpLen+overlap){ printSeq(stderr,fastSequence,cmpLen+overlap); printSeq(stderr,slowSequence,cmpLen+overlap); printKmer(stderr,kmerCtg2[endOnCtg2],overlap); fprintf(stderr,": %d(%d)\n",bal,endOnCtg2); }else if(cmpLen>0&&match==cmpLen+overlap) fprintf(stderr,"Perfect\n"); */ double score = ( double ) matchLen / alignLen; if ( maxScore < score ) { maxScore = score; //fprintf(stderr,"%4.2f (%d/%d)\n",maxScore,matchLen,alignLen); maxIndex = i; } SCORE[i] = score; } /* if(maxScore>0.0) fprintf(stderr,"SCORE: %4.2f\n",maxScore); */ if ( maxScore > 0.9 ) { /* for(i=0;i 0 ? offset1 - ( len1 - INDEX1[maxIndex] - 1 ) : 0; int rightRemain = offset2 - ( overlap + INDEX2[maxIndex] ) > 0 ? offset2 - ( overlap + INDEX2[maxIndex] ) : 0; ctg1->gapSeqOffset = gapSeqArray->item_c; ctg1->gapSeqLen = END[maxIndex] - START[maxIndex] + leftRemain + rightRemain; if ( darrayPut ( gapSeqArray, ctg1->gapSeqOffset + ( END[maxIndex] - START[maxIndex] + leftRemain + rightRemain ) / 4 ) ) { pt = ( char * ) darrayPut ( gapSeqArray, ctg1->gapSeqOffset ); for ( j = 0; j < leftRemain; j++ ) //get the left side of the gap region from search { writeChar2tightString ( seqGap[j], pt, j ); fprintf ( stderr, "%c", int2base ( seqGap[j] ) ); } for ( j = START[maxIndex] + 1; j <= END[maxIndex]; j++ ) { if ( BAL[maxIndex] ) { writeChar2tightString ( bal_seq[j], pt, j - START[maxIndex] - 1 + leftRemain ); fprintf ( stderr, "%c", int2base ( bal_seq[j] ) ); } else { writeChar2tightString ( src_seq[j], pt, j - START[maxIndex] - 1 + leftRemain ); fprintf ( stderr, "%c", int2base ( src_seq[j] ) ); } } for ( j = offset2 - rightRemain; j < offset2; j++ ) //get the right side of the gap region from search { writeChar2tightString ( seqGap[j + leftRemain], pt, j + END[maxIndex] - START[maxIndex] + leftRemain ); fprintf ( stderr, "%c", int2base ( seqGap[j + leftRemain] ) ); } fprintf ( stderr, ": GAPSEQ (%d+%d)(%d+%d)(%d+%d)(%d+%d) B %d\n", offset1, offset2, cut1, cut2, len1 - INDEX1[maxIndex] - 1, INDEX2[maxIndex], START[maxIndex], END[maxIndex], BAL[maxIndex] ); ctg1->cutTail = len1 - INDEX1[maxIndex] - 1 - offset1 + cut1 > cut1 ? len1 - INDEX1[maxIndex] - 1 - offset1 + cut1 : cut1; ctg2->cutHead = overlap + INDEX2[maxIndex] - offset2 + cut2 > cut2 ? overlap + INDEX2[maxIndex] - offset2 + cut2 : cut2; ctg2->scaftig_start = 0; ret = 1; } } free ( ( void * ) START ); free ( ( void * ) END ); free ( ( void * ) INDEX1 ); free ( ( void * ) INDEX2 ); free ( ( void * ) SCORE ); free ( ( void * ) BAL ); free ( ( void * ) src_seq ); free ( ( void * ) bal_seq ); return ret; } SOAPdenovo-V1.05/src/31mer/main.c000644 000765 000024 00000017416 11530651532 016414 0ustar00Aquastaff000000 000000 /* * 31mer/main.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "global.h" extern int call_pregraph ( int arc, char ** argv ); extern int call_heavygraph ( int arc, char ** argv ); extern int call_map2contig ( int arc, char ** argv ); extern int call_scaffold ( int arc, char ** argv ); extern int call_align ( int arc, char ** argv ); static void display_usage(); static void display_all_usage(); static void pipeline ( int argc, char ** argv ); int main ( int argc, char ** argv ) { printf ( "\nVersion 1.05: released on July 29th, 2010\n\n" ); argc--; argv++; /* __uint128_t temp; ubyte8 long1=0x5ad6c7ef8a; ubyte8 long2=0x87a3c27a2b; temp = long1; temp <<= 64; temp |=long2; long2 = (ubyte8)temp; long1 = (ubyte8)(temp>>64); printf("%p,%p,%p\n",long1,long2,temp); */ if ( argc == 0 ) { display_usage(); return 0; } if ( strcmp ( "pregraph", argv[0] ) == 0 ) { call_pregraph ( argc, argv ); } else if ( strcmp ( "contig", argv[0] ) == 0 ) { call_heavygraph ( argc, argv ); } else if ( strcmp ( "map", argv[0] ) == 0 ) { call_align ( argc, argv ); } //call_map2contig(argc,argv); else if ( strcmp ( "scaff", argv[0] ) == 0 ) { call_scaffold ( argc, argv ); } else if ( strcmp ( "all", argv[0] ) == 0 ) { pipeline ( argc, argv ); } else { display_usage(); } return 0; } static void display_usage() { printf ( "\nUsage: SOAPdenovo [option]\n" ); printf ( " pregraph construction kmer-graph\n" ); printf ( " contig eliminate errors and output contigs\n" ); printf ( " map map reads to contigs\n" ); printf ( " scaff scaffolding\n" ); printf ( " all doing all the above in turn\n" ); } static void pipeline ( int argc, char ** argv ) { char * options[16]; unsigned char getK, getRfile, getOfile, getD, getDD, getL, getR, getP, getF; char readfile[256], outfile[256]; char temp[128]; char * name; int kmer = 0, cutoff_len = 0, ncpu = 0, lowK = 0, lowC = 0; char kmer_s[16], len_s[16], ncpu_s[16], M_s[16], lowK_s[16], lowC_s[16]; int i, copt, index, M = 1; extern char * optarg; time_t start_t, stop_t; time ( &start_t ); getK = getRfile = getOfile = getD = getDD = getL = getR = getP = getF = 0; while ( ( copt = getopt ( argc, argv, "a:s:o:K:M:L:p:G:d:D:Ru" ) ) != EOF ) { switch ( copt ) { case 's': getRfile = 1; sscanf ( optarg, "%s", readfile ); continue; case 'o': getOfile = 1; sscanf ( optarg, "%s", outfile ); // continue; case 'K': getK = 1; sscanf ( optarg, "%s", temp ); // kmer = atoi ( temp ); continue; case 'G': sscanf ( optarg, "%s", temp ); // GLDiff = atoi ( temp ); continue; case 'M': sscanf ( optarg, "%s", temp ); // M = atoi ( temp ); continue; case 'p': getP = 1; sscanf ( optarg, "%s", temp ); // ncpu = atoi ( temp ); continue; case 'L': getL = 1; sscanf ( optarg, "%s", temp ); // cutoff_len = atoi ( temp ); continue; case 'R': getR = 1; continue; case 'u': maskRep = 0; continue; case 'd': getD = 1; sscanf ( optarg, "%s", temp ); lowK = atoi ( temp ); continue; case 'D': getDD = 1; sscanf ( optarg, "%s", temp ); lowC = atoi ( temp ); continue; case 'a': initKmerSetSize = atoi ( optarg ); break; case 'F': getF = 1; break; default: if ( getRfile == 0 || getOfile == 0 ) // { display_all_usage(); exit ( -1 ); } } } if ( getRfile == 0 || getOfile == 0 ) // { display_all_usage(); exit ( -1 ); } if ( thrd_num < 1 ) { thrd_num = 1; } // getK = getRfile = getOfile = getD = getL = getR = 0; name = "pregraph"; index = 0; options[index++] = name; options[index++] = "-s"; options[index++] = readfile; if ( getK ) { options[index++] = "-K"; sprintf ( kmer_s, "%d", kmer ); options[index++] = kmer_s; } if ( getP ) { options[index++] = "-p"; sprintf ( ncpu_s, "%d", ncpu ); options[index++] = ncpu_s; } if ( getD ) { options[index++] = "-d"; sprintf ( lowK_s, "%d", lowK ); options[index++] = lowK_s; } if ( getR ) { options[index++] = "-R"; } options[index++] = "-o"; options[index++] = outfile; for ( i = 0; i < index; i++ ) { printf ( "%s ", options[i] ); } printf ( "\n" ); call_pregraph ( index, options ); name = "contig"; index = 0; options[index++] = name; options[index++] = "-g"; options[index++] = outfile; options[index++] = "-M"; sprintf ( M_s, "%d", M ); options[index++] = M_s; if ( getR ) { options[index++] = "-R"; } if ( getDD ) { options[index++] = "-D"; sprintf ( lowC_s, "%d", lowC ); options[index++] = lowC_s; } for ( i = 0; i < index; i++ ) { printf ( "%s ", options[i] ); } printf ( "\n" ); call_heavygraph ( index, options ); name = "map"; index = 0; options[index++] = name; options[index++] = "-s"; options[index++] = readfile; options[index++] = "-g"; options[index++] = outfile; if ( getP ) { options[index++] = "-p"; sprintf ( ncpu_s, "%d", ncpu ); options[index++] = ncpu_s; } for ( i = 0; i < index; i++ ) { printf ( "%s ", options[i] ); } printf ( "\n" ); call_align ( index, options ); name = "scaff"; index = 0; options[index++] = name; options[index++] = "-g"; options[index++] = outfile; if ( getF ) { options[index++] = "-F"; } if ( getP ) { options[index++] = "-p"; sprintf ( ncpu_s, "%d", ncpu ); options[index++] = ncpu_s; } if ( getL ) { options[index++] = "-L"; sprintf ( len_s, "%d", cutoff_len ); options[index++] = len_s; } for ( i = 0; i < index; i++ ) { printf ( "%s ", options[i] ); } printf ( "\n" ); call_scaffold ( index, options ); time ( &stop_t ); printf ( "time for the whole pipeline: %dm\n", ( int ) ( stop_t - start_t ) / 60 ); } static void display_all_usage() { printf ( "\nSOAPdenovo all -s configFile [-a initMemoryAssumption -K kmer -d KmerFreqCutOff -D EdgeCovCutoff -M mergeLevel -R -u -G gapLenDiff -L minContigLen -p n_cpu] -o Output\n" ); printf ( " -s ShortSeqFile: The input file name of solexa reads\n" ); printf ( " -a initMemoryAssumption: Initiate the memory assumption to avoid further reallocation\n" ); printf ( " -K kmer(default 23): k value in kmer\n" ); printf ( " -p n_cpu(default 8): number of cpu for use\n" ); printf ( " -F (optional) fill gaps in scaffold\n" ); printf ( " -M mergeLevel(default 1,min 0, max 3): the strength of merging similar sequences during contiging\n" ); printf ( " -d KmerFreqCutoff(optional): delete kmers with frequency no larger than (default 0)\n" ); printf ( " -D EdgeCovCutoff(optional): delete edges with coverage no largert than (default 1)\n" ); printf ( " -R (optional): unsolve repeats by reads (default no)\n" ); printf ( " -G gapLenDiff(default 50): allowed length difference between estimated and filled gap\n" ); printf ( " -L minLen(default K+2): shortest contig for scaffolding\n" ); printf ( " -u (optional): un-mask contigs with high coverage before scaffolding (default mask)\n" ); printf ( " -o Output: prefix of output file name\n" ); } SOAPdenovo-V1.05/src/31mer/Makefile000644 000765 000024 00000003561 11534064043 016757 0ustar00Aquastaff000000 000000 CC= gcc CFLAGS= -O3 -fomit-frame-pointer DFLAGS= OBJS= arc.o attachPEinfo.o bubble.o check.o compactEdge.o \ concatenateEdge.o connect.o contig.o cutTipPreGraph.o cutTip_graph.o \ darray.o dfib.o dfibHeap.o fib.o fibHeap.o \ hashFunction.o kmer.o lib.o loadGraph.o loadPath.o \ loadPreGraph.o localAsm.o main.o map.o mem_manager.o \ newhash.o node2edge.o orderContig.o output_contig.o output_pregraph.o \ output_scaffold.o pregraph.o prlHashCtg.o prlHashReads.o prlRead2Ctg.o \ prlRead2path.o prlReadFillGap.o read2scaf.o readInterval.o stack.o\ readseq1by1.o scaffold.o searchPath.o seq.o splitReps.o PROG= SOAPdenovo-31mer INCLUDES= -Iinc SUBDIRS= . LIBPATH= LIBS= -pthread -lm EXTRA_FLAGS= BIT_ERR = 0 ifeq (,$(findstring $(shell uname -m), x86_64 ppc64 ia64)) BIT_ERR = 1 endif LINUX = 0 ifneq (,$(findstring Linux,$(shell uname))) LINUX = 1 EXTRA_FLAGS += -Wl,--hash-style=both endif ifneq (,$(findstring $(shell uname -m), x86_64)) CFLAGS += -m64 endif ifneq (,$(findstring $(shell uname -m), ia64)) CFLAGS += endif ifneq (,$(findstring $(shell uname -m), ppc64)) CFLAGS += -m64 -mpowerpc64 endif .SUFFIXES:.c .o .c.o: @printf "Compiling $<... \r"; \ $(CC) -c $(CFLAGS) $(DFLAGS) $(INCLUDES) $< || echo "Error in command: $(CC) -c $(CFLAGS) $(DFLAGS) $(INCLUDES) $<" all: SOAPdenovo .PHONY:all clean install envTest: @test $(BIT_ERR) != 1 || sh -c 'echo "Fatal: 64bit CPU and Operating System required!";false;' SOAPdenovo: envTest $(OBJS) @$(CC) $(CFLAGS) -o $(PROG) $(OBJS) $(LIBPATH) $(LIBS) $(ENTRAFLAGS) @printf "Linking...\r" @printf "$(PROG) compilation done.\n"; clean: @rm -fr gmon.out *.o a.out *.exe *.dSYM $(PROG) *~ *.a *.so.* *.so *.dylib @printf "$(PROG) cleaning done.\n"; install: @cp $(PROG) ../../bin/ @printf "$(PROG) installed at ../../bin/$(PROG)\n" SOAPdenovo-V1.05/src/31mer/map.c000644 000765 000024 00000007777 11530651532 016256 0ustar00Aquastaff000000 000000 /* * 31mer/map.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" static void initenv ( int argc, char ** argv ); static char shortrdsfile[256]; static char graphfile[256]; static void display_map_usage(); static int getMinOverlap ( char * gfile ) { char name[256], ch; FILE * fp; int num_kmer, overlaplen = 23; char line[1024]; sprintf ( name, "%s.preGraphBasic", gfile ); fp = fopen ( name, "r" ); if ( !fp ) { return overlaplen; } while ( fgets ( line, sizeof ( line ), fp ) != NULL ) { if ( line[0] == 'V' ) { sscanf ( line + 6, "%d %c %d", &num_kmer, &ch, &overlaplen ); } else if ( line[0] == 'M' ) { sscanf ( line, "MaxReadLen %d MinReadLen %d MaxNameLen %d", &maxReadLen, &minReadLen, &maxNameLen ); } } fclose ( fp ); return overlaplen; } int call_align ( int argc, char ** argv ) { time_t start_t, stop_t, time_bef, time_aft; time ( &start_t ); initenv ( argc, argv ); overlaplen = getMinOverlap ( graphfile ); printf ( "K = %d\n", overlaplen ); time ( &time_bef ); ctg_short = overlaplen + 2; printf ( "contig len cutoff: %d\n", ctg_short ); prlContig2nodes ( graphfile, ctg_short ); time ( &time_aft ); printf ( "time spent on De bruijn graph construction: %ds\n\n", ( int ) ( time_aft - time_bef ) ); //map read to edge one by one time ( &time_bef ); prlLongRead2Ctg ( shortrdsfile, graphfile ); time ( &time_aft ); printf ( "time spent on mapping long reads: %ds\n\n", ( int ) ( time_aft - time_bef ) ); time ( &time_bef ); prlRead2Ctg ( shortrdsfile, graphfile ); time ( &time_aft ); printf ( "time spent on mapping reads: %ds\n\n", ( int ) ( time_aft - time_bef ) ); free_Sets ( KmerSets, thrd_num ); time ( &stop_t ); printf ( "overall time for alignment: %dm\n\n", ( int ) ( stop_t - start_t ) / 60 ); return 0; } /***************************************************************************** * Parse command line switches *****************************************************************************/ void initenv ( int argc, char ** argv ) { int copt; int inpseq, outseq; extern char * optarg; char temp[100]; optind = 1; inpseq = outseq = 0; while ( ( copt = getopt ( argc, argv, "s:g:K:p:" ) ) != EOF ) { //printf("get option\n"); switch ( copt ) { case 's': inpseq = 1; sscanf ( optarg, "%s", shortrdsfile ); continue; case 'g': outseq = 1; sscanf ( optarg, "%s", graphfile ); // continue; case 'K': sscanf ( optarg, "%s", temp ); // overlaplen = atoi ( temp ); continue; case 'p': sscanf ( optarg, "%s", temp ); // thrd_num = atoi ( temp ); continue; default: if ( inpseq == 0 || outseq == 0 ) // { display_map_usage(); exit ( 1 ); } } } if ( inpseq == 0 || outseq == 0 ) // { //printf("need more\n"); display_map_usage(); exit ( 1 ); } } static void display_map_usage() { printf ( "\nmap -s readsInfoFile -g graphfile [-p n_cpu]\n" ); printf ( " -s readsInfoFile: The file contains information of solexa reads\n" ); printf ( " -p n_cpu(default 8): number of cpu for use\n" ); printf ( " -g graphfile: prefix of graph files\n" ); } SOAPdenovo-V1.05/src/31mer/mem_manager.c000644 000765 000024 00000006000 11530651532 017723 0ustar00Aquastaff000000 000000 /* * 31mer/mem_manager.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" MEM_MANAGER * createMem_manager ( int num_items, size_t unit_size ) { MEM_MANAGER * mem_Manager = ( MEM_MANAGER * ) ckalloc ( 1 * sizeof ( MEM_MANAGER ) ); mem_Manager->block_list = NULL; mem_Manager->items_per_block = num_items; mem_Manager->item_size = unit_size; mem_Manager->recycle_list = NULL; mem_Manager->counter = 0; return mem_Manager; } void freeMem_manager ( MEM_MANAGER * mem_Manager ) { BLOCK_START * ite_block, *temp_block; if ( !mem_Manager ) { return; } ite_block = mem_Manager->block_list; while ( ite_block ) { temp_block = ite_block; ite_block = ite_block->next; free ( ( void * ) temp_block ); } free ( ( void * ) mem_Manager ); } void * getItem ( MEM_MANAGER * mem_Manager ) { RECYCLE_MARK * mark; //this is the type of return value BLOCK_START * block; if ( !mem_Manager ) { return NULL; } if ( mem_Manager->recycle_list ) { mark = mem_Manager->recycle_list; mem_Manager->recycle_list = mark->next; return mark; } mem_Manager->counter++; if ( !mem_Manager->block_list || mem_Manager->index_in_block == mem_Manager->items_per_block ) { //pthread_mutex_lock(&gmutex); block = ckalloc ( sizeof ( BLOCK_START ) + mem_Manager->items_per_block * mem_Manager->item_size ); //mem_Manager->counter += sizeof(BLOCK_START)+mem_Manager->items_per_block*mem_Manager->item_size; //pthread_mutex_unlock(&gmutex); block->next = mem_Manager->block_list; mem_Manager->block_list = block; mem_Manager->index_in_block = 1; return ( RECYCLE_MARK * ) ( ( void * ) block + sizeof ( BLOCK_START ) ); } block = mem_Manager->block_list; return ( RECYCLE_MARK * ) ( ( void * ) block + sizeof ( BLOCK_START ) + mem_Manager->item_size * ( mem_Manager->index_in_block++ ) ); } void returnItem ( MEM_MANAGER * mem_Manager, void * item ) { RECYCLE_MARK * mark; mark = item; mark->next = mem_Manager->recycle_list; mem_Manager->recycle_list = mark; } /* void test_mem_manager() { MEM_MANAGER *test_manager; NODE *temp_node; test_manager = createMem_manager(NODEBLOCKSIZE,sizeof(NODE)); temp_node = (NODE *)getItem(test_manager); returnItem(test_manager,temp_node); freeMem_manager(test_manager); } */ SOAPdenovo-V1.05/src/31mer/newhash.c000644 000765 000024 00000016535 11530651532 017126 0ustar00Aquastaff000000 000000 /* * 31mer/newhash.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" #define PUBLIC_FUNC #define PROTECTED_FUNC static const kmer_t empty_kmer = {0, 0, 0, 0, 0, 0, 1, 0, 0}; static inline void update_kmer ( kmer_t * mer, ubyte left, ubyte right ) { ubyte4 cov; if ( left < 4 ) { cov = get_kmer_left_cov ( *mer, left ); if ( cov < MAX_KMER_COV ) { set_kmer_left_cov ( *mer, left, cov + 1 ); } } if ( right < 4 ) { cov = get_kmer_right_cov ( *mer, right ); if ( cov < MAX_KMER_COV ) { set_kmer_right_cov ( *mer, right, cov + 1 ); } } } static inline void set_new_kmer ( kmer_t * mer, ubyte8 seq, ubyte left, ubyte right ) { *mer = empty_kmer; set_kmer_seq ( *mer, seq ); if ( left < 4 ) { set_kmer_left_cov ( *mer, left, 1 ); } if ( right < 4 ) { set_kmer_right_cov ( *mer, right, 1 ); } } static inline int is_prime_kh ( ubyte8 num ) { ubyte8 i, max; if ( num < 4 ) { return 1; } if ( num % 2 == 0 ) { return 0; } max = ( ubyte8 ) sqrt ( ( float ) num ); for ( i = 3; i < max; i += 2 ) { if ( num % i == 0 ) { return 0; } } return 1; } static inline ubyte8 find_next_prime_kh ( ubyte8 num ) { if ( num % 2 == 0 ) { num ++; } while ( 1 ) { if ( is_prime_kh ( num ) ) { return num; } num += 2; } } PUBLIC_FUNC KmerSet * init_kmerset ( ubyte8 init_size, float load_factor ) { KmerSet * set; if ( init_size < 3 ) { init_size = 3; } else { init_size = find_next_prime_kh ( init_size ); } set = ( KmerSet * ) malloc ( sizeof ( KmerSet ) ); set->size = init_size; set->count = 0; set->max = set->size * load_factor; if ( load_factor <= 0 ) { load_factor = 0.25f; } else if ( load_factor >= 1 ) { load_factor = 0.75f; } set->load_factor = load_factor; set->iter_ptr = 0; set->array = calloc ( set->size, sizeof ( kmer_t ) ); set->flags = malloc ( ( set->size + 15 ) / 16 * 4 ); memset ( set->flags, 0x55, ( set->size + 15 ) / 16 * 4 ); return set; } PROTECTED_FUNC static inline ubyte8 get_kmerset ( KmerSet * set, ubyte8 seq ) { ubyte8 hc; hc = seq % set->size; while ( 1 ) { if ( is_kmer_entity_null ( set->flags, hc ) ) { return hc; } else { if ( get_kmer_seq ( set->array[hc] ) == seq ) { return hc; } } hc ++; if ( hc == set->size ) { hc = 0; } } return set->size; } PUBLIC_FUNC int search_kmerset ( KmerSet * set, ubyte8 seq, kmer_t ** rs ) { ubyte8 hc; hc = seq % set->size; while ( 1 ) { if ( is_kmer_entity_null ( set->flags, hc ) ) { return 0; } else { if ( get_kmer_seq ( set->array[hc] ) == seq ) { *rs = set->array + hc; return 1; } } hc ++; if ( hc == set->size ) { hc = 0; } } return 0; } PUBLIC_FUNC static inline int exists_kmerset ( KmerSet * set, ubyte8 seq ) { ubyte8 idx; idx = get_kmerset ( set, seq ); return !is_kmer_entity_null ( set->flags, idx ); } PROTECTED_FUNC static inline void encap_kmerset ( KmerSet * set, ubyte8 num ) { ubyte4 * flags, *f; ubyte8 i, n, size, hc; kmer_t key, tmp; if ( set->count + num <= set->max ) { return; } if ( initKmerSetSize != 0 ) { if ( set->load_factor < 0.88 ) { set->load_factor = 0.88; set->max = set->size * set->load_factor; return; } else { fprintf ( stderr, "-- Static memory pool exploded, please define a larger value. --\nloadFactor\t%f\nsize\t%llu\ncnt\t%llu\n", set->load_factor, set->size, set->count ); abort(); } } n = set->size; do { if ( n < 0xFFFFFFFU ) { n <<= 1; } else { n += 0xFFFFFFU; } n = find_next_prime_kh ( n ); } while ( n * set->load_factor < set->count + num ); set->array = realloc ( set->array, n * sizeof ( kmer_t ) ); if ( set->array == NULL ) { fprintf ( stderr, "-- Out of memory --\n" ); abort(); } flags = malloc ( ( n + 15 ) / 16 * 4 ); memset ( flags, 0x55, ( n + 15 ) / 16 * 4 ); size = set->size; set->size = n; set->max = n * set->load_factor; f = set->flags; set->flags = flags; flags = f; for ( i = 0; i < size; i++ ) { if ( !exists_kmer_entity ( flags, i ) ) { continue; } key = set->array[i]; set_kmer_entity_del ( flags, i ); while ( 1 ) { hc = get_kmer_seq ( key ) % set->size; while ( !is_kmer_entity_null ( set->flags, hc ) ) { hc ++; if ( hc == set->size ) { hc = 0; } } clear_kmer_entity_null ( set->flags, hc ); if ( hc < size && exists_kmer_entity ( flags, hc ) ) { tmp = key; key = set->array[hc]; set->array[hc] = tmp; set_kmer_entity_del ( flags, hc ); } else { set->array[hc] = key; break; } } } free ( flags ); } PUBLIC_FUNC int put_kmerset ( KmerSet * set, ubyte8 seq, ubyte left, ubyte right, kmer_t ** kmer_p ) { ubyte8 hc; encap_kmerset ( set, 1 ); hc = seq % set->size; do { if ( is_kmer_entity_null ( set->flags, hc ) ) { clear_kmer_entity_null ( set->flags, hc ); set_new_kmer ( set->array + hc, seq, left, right ); set->count ++; *kmer_p = set->array + hc; return 0; } else { if ( get_kmer_seq ( set->array[hc] ) == seq ) { update_kmer ( set->array + hc, left, right ); set->array[hc].single = 0; *kmer_p = set->array + hc; return 1; } } hc ++; if ( hc == set->size ) { hc = 0; } } while ( 1 ); *kmer_p = NULL; return 0; } PUBLIC_FUNC byte8 count_kmerset ( KmerSet * set ) { return set->count; } PUBLIC_FUNC static inline void reset_iter_kmerset ( KmerSet * set ) { set->iter_ptr = 0; } PUBLIC_FUNC static inline ubyte8 iter_kmerset ( KmerSet * set, kmer_t ** rs ) { while ( set->iter_ptr < set->size ) { if ( !is_kmer_entity_null ( set->flags, set->iter_ptr ) ) { *rs = set->array + set->iter_ptr; set->iter_ptr ++; return 1; } set->iter_ptr ++; } return 0; } PUBLIC_FUNC void free_kmerset ( KmerSet * set ) { free ( set->array ); free ( set->flags ); free ( set ); } PUBLIC_FUNC void free_Sets ( KmerSet ** sets, int num ) { int i; for ( i = 0; i < num; i++ ) { free_kmerset ( sets[i] ); } free ( ( void * ) sets ); } int count_branch2prev ( kmer_t * node ) { int num = 0, i; for ( i = 0; i < 4; i++ ) { if ( get_kmer_left_cov ( *node, i ) > 0 ) { num++; } } return num; } int count_branch2next ( kmer_t * node ) { int num = 0, i; for ( i = 0; i < 4; i++ ) { if ( get_kmer_right_cov ( *node, i ) > 0 ) { num++; } } return num; } void dislink2prevUncertain ( kmer_t * node, char ch, boolean smaller ) { if ( smaller ) { set_kmer_left_cov ( *node, ch, 0 ); } else { set_kmer_right_cov ( *node, int_comp ( ch ), 0 ); } } void dislink2nextUncertain ( kmer_t * node, char ch, boolean smaller ) { if ( smaller ) { set_kmer_right_cov ( *node, ch, 0 ); } else { set_kmer_left_cov ( *node, int_comp ( ch ), 0 ); } } SOAPdenovo-V1.05/src/31mer/node2edge.c000644 000765 000024 00000026223 11530651532 017320 0ustar00Aquastaff000000 000000 /* * 31mer/node2edge.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" #include "stack.h" #define KMERPTBLOCKSIZE 1000 //static int loopCounter; static int nodeCounter; static int edge_length_limit = 100000; static int edge_c, edgeCounter; static preEDGE temp_edge; static char edge_seq[100000]; static void make_edge ( FILE * fp ); static void merge_linearV2 ( char bal_edge, STACK * nStack, int count, FILE * fp ); static int check_iden_kmerList ( STACK * stack1, STACK * stack2 ); //for stack static STACK * nodeStack; static STACK * bal_nodeStack; void kmer2edges ( char * outfile ) { FILE * fp; char temp[256]; sprintf ( temp, "%s.edge", outfile ); fp = ckopen ( temp, "w" ); make_edge ( fp ); fclose ( fp ); num_ed = edge_c; } static void stringBeads ( KMER_PT * firstBead, char nextch, int * node_c ) { boolean smaller, found; Kmer tempKmer, bal_word; Kmer word = firstBead->kmer; Kmer hash_ban; kmer_t * outgoing_node; int nodeCounter = 1, setPicker; char ch; unsigned char flag; KMER_PT * temp_pt, *prev_pt = firstBead; word = prev_pt->kmer; nodeCounter = 1; word = nextKmer ( word, nextch ); bal_word = reverseComplement ( word, overlaplen ); if ( word > bal_word ) { tempKmer = bal_word; bal_word = word; word = tempKmer; smaller = 0; } else { smaller = 1; } hash_ban = hash_kmer ( word ); setPicker = hash_ban % thrd_num; found = search_kmerset ( KmerSets[setPicker], word, &outgoing_node ); while ( found && ( outgoing_node->linear ) ) // for every node in this line { nodeCounter++; temp_pt = ( KMER_PT * ) stackPush ( nodeStack ); temp_pt->node = outgoing_node; temp_pt->isSmaller = smaller; if ( smaller ) { temp_pt->kmer = word; } else { temp_pt->kmer = bal_word; } prev_pt = temp_pt; if ( smaller ) { for ( ch = 0; ch < 4; ch++ ) { flag = get_kmer_right_cov ( *outgoing_node, ch ); if ( flag ) { break; } } word = nextKmer ( prev_pt->kmer, ch ); bal_word = reverseComplement ( word, overlaplen ); if ( word > bal_word ) { tempKmer = bal_word; bal_word = word; word = tempKmer; smaller = 0; } else { smaller = 1; } hash_ban = hash_kmer ( word ); setPicker = hash_ban % thrd_num; found = search_kmerset ( KmerSets[setPicker], word, &outgoing_node ); } else { for ( ch = 0; ch < 4; ch++ ) { flag = get_kmer_left_cov ( *outgoing_node, ch ); if ( flag ) { break; } } word = nextKmer ( prev_pt->kmer, int_comp ( ch ) ); bal_word = reverseComplement ( word, overlaplen ); if ( word > bal_word ) { tempKmer = bal_word; bal_word = word; word = tempKmer; smaller = 0; } else { smaller = 1; } hash_ban = hash_kmer ( word ); setPicker = hash_ban % thrd_num; found = search_kmerset ( KmerSets[setPicker], word, &outgoing_node ); } } if ( outgoing_node ) //this is always true { nodeCounter++; temp_pt = ( KMER_PT * ) stackPush ( nodeStack ); temp_pt->node = outgoing_node; temp_pt->isSmaller = smaller; if ( smaller ) { temp_pt->kmer = word; } else { temp_pt->kmer = bal_word; } } *node_c = nodeCounter; } //search linear structure starting with the root of a tree static int startEdgeFromNode ( kmer_t * node1, FILE * fp ) { int node_c, palindrome; unsigned char flag; KMER_PT * ite_pt, *temp_pt; Kmer word1, bal_word1; char ch1; if ( node1->linear || node1->deleted ) { return 0; } // ignore floating loop word1 = node1->seq; bal_word1 = reverseComplement ( word1, overlaplen ); // linear structure for ( ch1 = 0; ch1 < 4; ch1++ ) // for every node on outgoing list { flag = get_kmer_right_cov ( *node1, ch1 ); if ( !flag ) { continue; } emptyStack ( nodeStack ); temp_pt = ( KMER_PT * ) stackPush ( nodeStack ); temp_pt->node = node1; temp_pt->isSmaller = 1; temp_pt->kmer = word1; stringBeads ( temp_pt, ch1, &node_c ); if ( node_c < 2 ) { printf ( "%d nodes in this line!!!!!!!!!!!\n", node_c ); } else { //make a reverse complement node list stackBackup ( nodeStack ); emptyStack ( bal_nodeStack ); while ( ( ite_pt = ( KMER_PT * ) stackPop ( nodeStack ) ) != NULL ) { temp_pt = ( KMER_PT * ) stackPush ( bal_nodeStack ); temp_pt->kmer = reverseComplement ( ite_pt->kmer, overlaplen ); } stackRecover ( nodeStack ); palindrome = check_iden_kmerList ( nodeStack, bal_nodeStack ); stackRecover ( nodeStack ); if ( palindrome ) { merge_linearV2 ( 0, nodeStack, node_c, fp ); } else { merge_linearV2 ( 1, nodeStack, node_c, fp ); } } } //every possible outgoing edges for ( ch1 = 0; ch1 < 4; ch1++ ) // for every node on incoming list { flag = get_kmer_left_cov ( *node1, ch1 ); if ( !flag ) { continue; } emptyStack ( nodeStack ); temp_pt = ( KMER_PT * ) stackPush ( nodeStack ); temp_pt->node = node1; temp_pt->isSmaller = 0; temp_pt->kmer = bal_word1; stringBeads ( temp_pt, int_comp ( ch1 ), &node_c ); if ( node_c < 2 ) { printf ( "%d nodes in this line!!!!!!!!!!!\n", node_c ); } else { //make a reverse complement node list stackBackup ( nodeStack ); emptyStack ( bal_nodeStack ); while ( ( ite_pt = ( KMER_PT * ) stackPop ( nodeStack ) ) != NULL ) { temp_pt = ( KMER_PT * ) stackPush ( bal_nodeStack ); temp_pt->kmer = reverseComplement ( ite_pt->kmer, overlaplen ); } stackRecover ( nodeStack ); palindrome = check_iden_kmerList ( nodeStack, bal_nodeStack ); stackRecover ( nodeStack ); if ( palindrome ) { merge_linearV2 ( 0, nodeStack, node_c, fp ); //printf("edge is palindrome with length %d\n",temp_edge.length); } else { merge_linearV2 ( 1, nodeStack, node_c, fp ); } } } //every possible incoming edges return 0; } void make_edge ( FILE * fp ) { int i = 0; kmer_t * node1; KmerSet * set; KmerSetsPatch = ( KmerSet ** ) ckalloc ( thrd_num * sizeof ( KmerSet * ) ); for ( i = 0; i < thrd_num; i++ ) { KmerSetsPatch[i] = init_kmerset ( 1000, K_LOAD_FACTOR ); } nodeStack = ( STACK * ) createStack ( KMERPTBLOCKSIZE, sizeof ( KMER_PT ) ); bal_nodeStack = ( STACK * ) createStack ( KMERPTBLOCKSIZE, sizeof ( KMER_PT ) ); edge_c = nodeCounter = 0; edgeCounter = 0; for ( i = 0; i < thrd_num; i++ ) { set = KmerSets[i]; set->iter_ptr = 0; while ( set->iter_ptr < set->size ) { if ( !is_kmer_entity_null ( set->flags, set->iter_ptr ) ) { node1 = set->array + set->iter_ptr; startEdgeFromNode ( node1, fp ); } set->iter_ptr ++; } } printf ( "%d (%d) edges %d extra nodes\n", edge_c, edgeCounter, nodeCounter ); freeStack ( nodeStack ); freeStack ( bal_nodeStack ); } static void merge_linearV2 ( char bal_edge, STACK * nStack, int count, FILE * fp ) { int length, char_index; preEDGE * newedge; kmer_t * del_node, *longNode; char * tightSeq, firstCh; long long symbol = 0; int len_tSeq; Kmer wordplus, bal_wordplus, hash_ban; KMER_PT * last_np = ( KMER_PT * ) stackPop ( nStack ); KMER_PT * second_last_np = ( KMER_PT * ) stackPop ( nStack ); KMER_PT * first_np, *second_np = NULL; KMER_PT * temp; boolean found; int setPicker; length = count - 1; len_tSeq = length; if ( len_tSeq >= edge_length_limit ) { tightSeq = ( char * ) ckalloc ( len_tSeq * sizeof ( char ) ); } else { tightSeq = edge_seq; } char_index = length - 1; newedge = &temp_edge; newedge->to_node = last_np->kmer; newedge->length = length; newedge->bal_edge = bal_edge; tightSeq[char_index--] = last_np->kmer & 3; firstCh = firstCharInKmer ( second_last_np->kmer ); dislink2prevUncertain ( last_np->node, firstCh, last_np->isSmaller ); stackRecover ( nStack ); while ( nStack->item_c > 1 ) { second_np = ( KMER_PT * ) stackPop ( nStack ); } first_np = ( KMER_PT * ) stackPop ( nStack ); //unlink first node to the second one dislink2nextUncertain ( first_np->node, second_np->kmer & 3, first_np->isSmaller ); //printf("from %llx, to %llx\n",first_np->node->seq,last_np->node->seq); //now temp is the last node in line, out_node is the second last node in line newedge->from_node = first_np->kmer; //create a long kmer for edge with length 1 if ( length == 1 ) { nodeCounter++; wordplus = KmerPlus ( newedge->from_node, newedge->to_node & 3 ); bal_wordplus = reverseComplementVerbose ( wordplus, overlaplen + 1 ); edge_c++; edgeCounter++; if ( wordplus <= bal_wordplus ) { hash_ban = hash_kmer ( wordplus ); setPicker = hash_ban % thrd_num; found = put_kmerset ( KmerSetsPatch[setPicker], wordplus, 4, 4, &longNode ); if ( found ) { printf ( "longNode %llx already exist\n", wordplus ); } longNode->l_links = edge_c; longNode->twin = ( unsigned char ) ( bal_edge + 1 ); } else { hash_ban = hash_kmer ( bal_wordplus ); setPicker = hash_ban % thrd_num; found = put_kmerset ( KmerSetsPatch[setPicker], bal_wordplus, 4, 4, &longNode ); if ( found ) { printf ( "longNode %llx already exist\n", bal_wordplus ); } longNode->l_links = edge_c + bal_edge; longNode->twin = ( unsigned char ) ( -bal_edge + 1 ); } } else { edge_c++; edgeCounter++; } stackRecover ( nStack ); //mark all the internal nodes temp = ( KMER_PT * ) stackPop ( nStack ); while ( nStack->item_c > 1 ) { temp = ( KMER_PT * ) stackPop ( nStack ); del_node = temp->node; del_node->inEdge = 1; symbol += get_kmer_left_covs ( *del_node ); if ( temp->isSmaller ) { del_node->l_links = edge_c; del_node->twin = ( unsigned char ) ( bal_edge + 1 ); } else { del_node->l_links = edge_c + bal_edge; del_node->twin = ( unsigned char ) ( -bal_edge + 1 ); } tightSeq[char_index--] = temp->kmer & 3; } newedge->seq = tightSeq; if ( length > 1 ) { newedge->cvg = symbol / ( length - 1 ) * 10 > MaxEdgeCov ? MaxEdgeCov : symbol / ( length - 1 ) * 10; } else { newedge->cvg = 0; } output_1edge ( newedge, fp ); if ( len_tSeq >= edge_length_limit ) { free ( ( void * ) tightSeq ); } edge_c += bal_edge; if ( edge_c % 10000000 == 0 ) { printf ( "--- %d edges built\n", edge_c ); } return; } static int check_iden_kmerList ( STACK * stack1, STACK * stack2 ) { KMER_PT * ite1, *ite2; if ( !stack1->item_c || !stack2->item_c ) // one of them is empty { return 0; } while ( ( ite1 = ( KMER_PT * ) stackPop ( stack1 ) ) != NULL && ( ite2 = ( KMER_PT * ) stackPop ( stack2 ) ) != NULL ) { if ( ite1->kmer != ite2->kmer ) { return 0; } } if ( stack1->item_c || stack2->item_c ) // one of them is not empty { return 0; } else { return 1; } } SOAPdenovo-V1.05/src/31mer/orderContig.c000644 000765 000024 00000314524 11530651532 017747 0ustar00Aquastaff000000 000000 /* * 31mer/orderContig.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" #include "dfibHeap.h" #include "fibHeap.h" #include "darray.h" #define CNBLOCKSIZE 10000 #define MAXC 10000 #define MAXCinBetween 200 #define MaxNodeInSub 10000 #define GapLowerBound -2000 #define GapUpperBound 300000 static boolean static_f = 0; static double OverlapPercent = 0.05; static double ConflPercent = 0.05; static int gapCounter; static int orienCounter; static int throughCounter; static DARRAY * solidArray; static DARRAY * tempArray; static int solidCounter; static CTGinHEAP ctg4heapArray[MaxNodeInSub + 1]; // index in this array are put to heaps, start from 1 static unsigned int nodesInSub[MaxNodeInSub]; static int nodeDistance[MaxNodeInSub]; static int nodeCounter; static unsigned int nodesInSubInOrder[MaxNodeInSub]; static int nodeDistanceInOrder[MaxNodeInSub]; static DARRAY * scaf3, *scaf5; static DARRAY * gap3, *gap5; static unsigned int downstreamCTG[MAXCinBetween]; static unsigned int upstreamCTG[MAXCinBetween]; static int dsCtgCounter; static int usCtgCounter; static CONNECT * checkConnect ( unsigned int from_c, unsigned int to_c ); static int maskPuzzle ( int num_connect, unsigned int contigLen ); static void freezing(); static boolean checkOverlapInBetween ( double tolerance ); static int setConnectDelete ( unsigned int from_c, unsigned int to_c, char flag, boolean cleanBinding ); static int setConnectWP ( unsigned int from_c, unsigned int to_c, char flag ); static void general_linearization ( boolean strict ); static void debugging2(); static void smallScaf(); static void detectBreakScaf(); static boolean checkSimple ( DARRAY * ctgArray, int count ); static void checkCircle(); //find the only connection involved in connection binding static CONNECT * getBindCnt ( unsigned int ctg ) { CONNECT * ite_cnt; CONNECT * bindCnt = NULL; CONNECT * temp_cnt = NULL; CONNECT * temp3_cnt = NULL; int count = 0; int count2 = 0; int count3 = 0; ite_cnt = contig_array[ctg].downwardConnect; while ( ite_cnt ) { if ( ite_cnt->nextInScaf ) { count++; bindCnt = ite_cnt; } if ( ite_cnt->prevInScaf ) { temp_cnt = ite_cnt; count2++; } if ( ite_cnt->singleInScaf ) { temp3_cnt = ite_cnt; count3++; } ite_cnt = ite_cnt->next; } if ( count == 1 ) { return bindCnt; } if ( count == 0 && count2 == 1 ) { return temp_cnt; } if ( count == 0 && count2 == 0 && count3 == 1 ) { return temp3_cnt; } return NULL; } static void createAnalogousCnt ( unsigned int sourceStart, CONNECT * originCnt, int gap, unsigned int targetStart, unsigned int targetStop ) { CONNECT * temp_cnt; unsigned int balTargetStart = getTwinCtg ( targetStart ); unsigned int balTargetStop = getTwinCtg ( targetStop ); unsigned int balSourceStart = getTwinCtg ( sourceStart ); unsigned int balSourceStop = getTwinCtg ( originCnt->contigID ); originCnt->deleted = 1; temp_cnt = getCntBetween ( balSourceStop, balSourceStart ); temp_cnt->deleted = 1; if ( gap < GapLowerBound ) { gapCounter++; return; } temp_cnt = add1Connect ( targetStart, targetStop, gap, originCnt->weight, 1 ); if ( temp_cnt ) { temp_cnt->inherit = 1; } temp_cnt = add1Connect ( balTargetStop, balTargetStart, gap, originCnt->weight, 1 ); if ( temp_cnt ) { temp_cnt->inherit = 1; } } // increase #long_pe_support for a conncet by 1 static void add1LongPEcov ( unsigned int fromCtg, unsigned int toCtg, int weight ) { //check if they are on the same scaff if ( contig_array[fromCtg].from_vt != contig_array[toCtg].from_vt || contig_array[fromCtg].to_vt != contig_array[toCtg].to_vt ) { printf ( "Warning from add1LongPEcov: contig %d and %d not on the same scaffold\n", fromCtg, toCtg ); return; } if ( contig_array[fromCtg].indexInScaf >= contig_array[toCtg].indexInScaf ) { printf ( "Warning from add1LongPEcov: wrong about order between contig %d and %d\n", fromCtg, toCtg ); return; } CONNECT * bindCnt; unsigned int prevCtg = fromCtg; bindCnt = getBindCnt ( fromCtg ); while ( bindCnt ) { if ( bindCnt->maxGap + weight <= 1000 ) { bindCnt->maxGap += weight; } else { bindCnt->maxGap = 1000; } if ( fromCtg == 0 && toCtg == 0 ) printf ( "link (%d %d ) covered by link (%d %d), wt %d\n", prevCtg, bindCnt->contigID, fromCtg, toCtg, weight ); if ( bindCnt->contigID == toCtg ) { break; } prevCtg = bindCnt->contigID; bindCnt = bindCnt->nextInScaf; } unsigned int bal_fc = getTwinCtg ( fromCtg ); unsigned int bal_tc = getTwinCtg ( toCtg ); bindCnt = getBindCnt ( bal_tc ); prevCtg = bal_tc; while ( bindCnt ) { if ( bindCnt->maxGap + weight <= 1000 ) { bindCnt->maxGap += weight; } else { bindCnt->maxGap = 1000; } if ( fromCtg == 0 && toCtg == 0 ) printf ( "link (%d %d ) covered by link (%d %d), wt %d\n", prevCtg, bindCnt->contigID, fromCtg, toCtg, weight ); if ( bindCnt->contigID == bal_fc ) { return; } prevCtg = bindCnt->contigID; bindCnt = bindCnt->nextInScaf; } printf ( "Warning from add1LongPEcov: not reach the end (%d %d) (B)\n", bal_tc, bal_fc ); } // for long pair ends, move the connections along scaffolds established by shorter pair ends till reach the ends static void downSlide() { int len = 0, gap; unsigned int i; CONNECT * ite_cnt, *bindCnt, *temp_cnt; unsigned int bottomCtg, topCtg, bal_i; unsigned int targetCtg, bal_target; boolean getThrough, orienConflict; int slideLen, slideLen2; orienCounter = throughCounter = 0; for ( i = 1; i <= num_ctg; i++ ) { if ( contig_array[i].mask || !contig_array[i].downwardConnect ) { continue; } bindCnt = getBindCnt ( i ); if ( !bindCnt ) { continue; } bal_i = getTwinCtg ( i ); len = slideLen = 0; bottomCtg = i; //find the last unmasked contig in this binding while ( bindCnt->nextInScaf ) { len += bindCnt->gapLen + contig_array[bindCnt->contigID].length; if ( contig_array[bindCnt->contigID].mask == 0 ) { bottomCtg = bindCnt->contigID; slideLen = len; } bindCnt = bindCnt->nextInScaf; } len += bindCnt->gapLen + contig_array[bindCnt->contigID].length; if ( contig_array[bindCnt->contigID].mask == 0 || bottomCtg == 0 ) { bottomCtg = bindCnt->contigID; slideLen = len; } //check each connetion from long pair ends ite_cnt = contig_array[i].downwardConnect; while ( ite_cnt ) { if ( ite_cnt->deleted || ite_cnt->mask || ite_cnt->singleInScaf || ite_cnt->nextInScaf || ite_cnt->prevInScaf || ite_cnt->inherit ) { ite_cnt = ite_cnt->next; continue; } targetCtg = ite_cnt->contigID; if ( contig_array[i].from_vt == contig_array[targetCtg].from_vt ) // on the same scaff { if ( contig_array[i].indexInScaf > contig_array[targetCtg].indexInScaf ) { orienCounter++; } else { throughCounter++; } setConnectDelete ( i, ite_cnt->contigID, 1, 0 ); ite_cnt = ite_cnt->next; continue; } //check if this connection conflicts with previous scaffold orientationally temp_cnt = getBindCnt ( targetCtg ); orienConflict = 0; if ( temp_cnt ) { while ( temp_cnt->nextInScaf ) { if ( temp_cnt->contigID == i ) { orienConflict = 1; printf ( "Warning from downSlide: still on the same scaff: %d and %d\n" , i, targetCtg ); printf ( "on scaff %d and %d\n", contig_array[i].from_vt, contig_array[targetCtg].from_vt ); printf ( "on bal_scaff %d and %d\n", contig_array[bal_target].to_vt, contig_array[bal_i].to_vt ); break; } temp_cnt = temp_cnt->nextInScaf; } if ( temp_cnt->contigID == i ) { orienConflict = 1; } } if ( orienConflict ) { orienCounter++; setConnectDelete ( i, ite_cnt->contigID, 1, 0 ); ite_cnt = ite_cnt->next; continue; } //find the most top contig along previous scaffold starting with the target contig of this connection bal_target = getTwinCtg ( targetCtg ); slideLen2 = 0; if ( contig_array[targetCtg].mask == 0 ) { topCtg = bal_target; } else { topCtg = 0; } temp_cnt = getBindCnt ( bal_target ); getThrough = len = 0; if ( temp_cnt ) { //find the last contig in this binding while ( temp_cnt->nextInScaf ) { //check if this route reaches bal_i if ( temp_cnt->contigID == bal_i ) { printf ( "Warning from downSlide: (B) still on the same scaff: %d and %d (%d and %d)\n", i, targetCtg, bal_target, bal_i ); printf ( "on scaff %d and %d\n", contig_array[i].from_vt, contig_array[targetCtg].from_vt ); printf ( "on bal_scaff %d and %d\n", contig_array[bal_target].to_vt, contig_array[bal_i].to_vt ); getThrough = 1; break; } len += temp_cnt->gapLen + contig_array[temp_cnt->contigID].length; if ( contig_array[temp_cnt->contigID].mask == 0 ) { topCtg = temp_cnt->contigID; slideLen2 = len; } temp_cnt = temp_cnt->nextInScaf; } len += temp_cnt->gapLen + contig_array[temp_cnt->contigID].length; if ( contig_array[temp_cnt->contigID].mask == 0 || topCtg == 0 ) { topCtg = temp_cnt->contigID; slideLen2 = len; } if ( temp_cnt->contigID == bal_i ) { getThrough = 1; } else { topCtg = getTwinCtg ( topCtg ); } } else { topCtg = targetCtg; } if ( getThrough ) { throughCounter++; setConnectDelete ( i, ite_cnt->contigID, 1, 0 ); ite_cnt = ite_cnt->next; continue; } //add a connection between bottomCtg and topCtg gap = ite_cnt->gapLen - slideLen - slideLen2; if ( bottomCtg != topCtg && ! ( i == bottomCtg && targetCtg == topCtg ) ) { createAnalogousCnt ( i, ite_cnt, gap, bottomCtg, topCtg ); if ( contig_array[bottomCtg].mask || contig_array[topCtg].mask ) { printf ( "downSlide to masked contig\n" ); } } ite_cnt = ite_cnt->next; } //for each connect } // for each contig printf ( "downSliding is done...orienConflict %d, fall inside %d\n", orienCounter, throughCounter ); } static boolean setNextInScaf ( CONNECT * cnt, CONNECT * nextCnt ) { if ( !cnt ) { printf ( "setNextInScaf: empty pointer\n" ); return 0; } if ( !nextCnt ) { cnt->nextInScaf = nextCnt; return 1; } if ( cnt->mask || cnt->deleted ) { printf ( "setNextInScaf: cnt is masked or deleted\n" ); return 0; } if ( nextCnt->deleted || nextCnt->mask ) { printf ( "setNextInScaf: nextCnt is masked or deleted\n" ); return 0; } cnt->nextInScaf = nextCnt; return 1; } static boolean setPrevInScaf ( CONNECT * cnt, boolean flag ) { if ( !cnt ) { printf ( "setPrevInScaf: empty pointer\n" ); return 0; } if ( !flag ) { cnt->prevInScaf = flag; return 1; } if ( cnt->mask || cnt->deleted ) { printf ( "setPrevInScaf: cnt is masked or deleted\n" ); return 0; } cnt->prevInScaf = flag; return 1; } /* connect A is upstream to B, replace A with C from_c > branch_c - to_c from_c_new */ static void substitueUSinScaf ( CONNECT * origin, unsigned int from_c_new ) { if ( !origin || !origin->nextInScaf ) { return; } unsigned int branch_c, to_c; unsigned int bal_branch_c, bal_to_c; unsigned int bal_from_c_new = getTwinCtg ( from_c_new ); CONNECT * bal_origin, *bal_nextCNT, *prevCNT, *bal_prevCNT; branch_c = origin->contigID; to_c = origin->nextInScaf->contigID; bal_branch_c = getTwinCtg ( branch_c ); bal_to_c = getTwinCtg ( to_c ); prevCNT = checkConnect ( from_c_new, branch_c ); bal_nextCNT = checkConnect ( bal_to_c, bal_branch_c ); if ( !bal_nextCNT ) { printf ( "substitueUSinScaf: no connect between %d and %d\n", bal_to_c, bal_branch_c ); return; } bal_origin = bal_nextCNT->nextInScaf; bal_prevCNT = checkConnect ( bal_branch_c, bal_from_c_new ); setPrevInScaf ( bal_nextCNT->nextInScaf, 0 ); setNextInScaf ( prevCNT, origin->nextInScaf ); setNextInScaf ( bal_nextCNT, bal_prevCNT ); setPrevInScaf ( bal_prevCNT, 1 ); setNextInScaf ( origin, NULL ); setPrevInScaf ( bal_origin, 0 ); } /* connect B is downstream to C, replace B with A to_c from_c - branch_c < to_c_new */ static void substitueDSinScaf ( CONNECT * origin, unsigned int branch_c, unsigned int to_c_new ) { if ( !origin || !origin->prevInScaf ) { return; } unsigned int to_c; unsigned int bal_branch_c, bal_to_c, bal_to_c_new; unsigned int from_c, bal_from_c; CONNECT * bal_origin, *prevCNT, *bal_prevCNT; CONNECT * nextCNT, *bal_nextCNT; to_c = origin->contigID; bal_branch_c = getTwinCtg ( branch_c ); bal_to_c = getTwinCtg ( to_c ); bal_origin = getCntBetween ( bal_to_c, bal_branch_c ); if ( !bal_origin ) { printf ( "substitueDSinScaf: no connect between %d and %d\n", bal_to_c, bal_branch_c ); return; } bal_from_c = bal_origin->nextInScaf->contigID; from_c = getTwinCtg ( bal_from_c ); bal_to_c_new = getTwinCtg ( to_c_new ); prevCNT = checkConnect ( from_c, branch_c ); nextCNT = checkConnect ( branch_c, to_c_new ); setNextInScaf ( prevCNT, nextCNT ); setPrevInScaf ( nextCNT, 1 ); bal_nextCNT = checkConnect ( bal_to_c_new, bal_branch_c ); bal_prevCNT = checkConnect ( bal_branch_c, bal_from_c ); setNextInScaf ( bal_nextCNT, bal_prevCNT ); setPrevInScaf ( origin, 0 ); setNextInScaf ( bal_origin, NULL ); } static int validConnect ( unsigned int ctg, CONNECT * preCNT ) { if ( preCNT && preCNT->nextInScaf ) { return 1; } CONNECT * cn_temp; int count = 0; if ( !contig_array[ctg].downwardConnect ) { return count; } cn_temp = contig_array[ctg].downwardConnect; while ( cn_temp ) { if ( !cn_temp->deleted && !cn_temp->mask ) { count++; } cn_temp = cn_temp->next; } return count; } static CONNECT * getNextContig ( unsigned int ctg, CONNECT * preCNT, boolean * exception ) { CONNECT * cn_temp, *retCNT = NULL; int count = 0, valid_in; unsigned int nextCtg, bal_ctg; *exception = 0; if ( preCNT && preCNT->nextInScaf ) { if ( preCNT->contigID != ctg ) { printf ( "pre cnt does not lead to %d\n", ctg ); } nextCtg = preCNT->nextInScaf->contigID; cn_temp = getCntBetween ( ctg, nextCtg ); if ( cn_temp && ( cn_temp->mask || cn_temp->deleted ) ) { printf ( "getNextContig: arc(%d %d) twin (%d %d) with mask %d deleted %d\n" , ctg, nextCtg, getTwinCtg ( nextCtg ), getTwinCtg ( ctg ) , cn_temp->mask, cn_temp->deleted ); if ( !cn_temp->prevInScaf ) { printf ( "not even has a prevInScaf\n" ); } cn_temp = getCntBetween ( getTwinCtg ( nextCtg ), getTwinCtg ( ctg ) ); if ( !cn_temp->nextInScaf ) { printf ( "its twin cnt not has a nextInScaf\n" ); } fflush ( stdout ); *exception = 1; } else { return preCNT->nextInScaf; } } bal_ctg = getTwinCtg ( ctg ); valid_in = validConnect ( bal_ctg, NULL ); if ( valid_in > 1 ) { return NULL; } if ( !contig_array[ctg].downwardConnect ) { return NULL; } cn_temp = contig_array[ctg].downwardConnect; while ( cn_temp ) { if ( cn_temp->mask || cn_temp->deleted ) { cn_temp = cn_temp->next; continue; } count++; if ( count == 1 ) { retCNT = cn_temp; } else if ( count == 2 ) { return NULL; } cn_temp = cn_temp->next; } return retCNT; } // get the valid connect between 2 given ctgs static CONNECT * checkConnect ( unsigned int from_c, unsigned int to_c ) { CONNECT * cn_temp = getCntBetween ( from_c, to_c ); if ( !cn_temp ) { return NULL; } if ( !cn_temp->mask && !cn_temp->deleted ) { return cn_temp; } return NULL; } static int setConnectMask ( unsigned int from_c, unsigned int to_c, char mask ) { CONNECT * cn_temp, *cn_bal, *cn_ds, *cn_us; unsigned int bal_fc = getTwinCtg ( from_c ); unsigned int bal_tc = getTwinCtg ( to_c ); unsigned int ctg3, bal_ctg3; cn_temp = getCntBetween ( from_c, to_c ); cn_bal = getCntBetween ( bal_tc, bal_fc ); if ( !cn_temp || !cn_bal ) { return 0; } cn_temp->mask = mask; cn_bal->mask = mask; if ( !mask ) { return 1; } if ( cn_temp->nextInScaf ) //undo the binding { setPrevInScaf ( cn_temp->nextInScaf, 0 ); ctg3 = cn_temp->nextInScaf->contigID; setNextInScaf ( cn_temp, NULL ); bal_ctg3 = getTwinCtg ( ctg3 ); cn_ds = getCntBetween ( bal_ctg3, bal_tc ); setNextInScaf ( cn_ds, NULL ); setPrevInScaf ( cn_bal, 0 ); } // ctg3 -> from_c -> to_c // bal_ctg3 <- bal_fc <- bal_tc if ( cn_bal->nextInScaf ) { setPrevInScaf ( cn_bal->nextInScaf, 0 ); bal_ctg3 = cn_bal->nextInScaf->contigID; setNextInScaf ( cn_bal, NULL ); ctg3 = getTwinCtg ( bal_ctg3 ); cn_us = getCntBetween ( ctg3, from_c ); setNextInScaf ( cn_us, NULL ); setPrevInScaf ( cn_temp, 0 ); } return 1; } static boolean setConnectUsed ( unsigned int from_c, unsigned int to_c, char flag ) { CONNECT * cn_temp, *cn_bal; unsigned int bal_fc = getTwinCtg ( from_c ); unsigned int bal_tc = getTwinCtg ( to_c ); cn_temp = getCntBetween ( from_c, to_c ); cn_bal = getCntBetween ( bal_tc, bal_fc ); if ( !cn_temp || !cn_bal ) { return 0; } cn_temp->used = flag; cn_bal->used = flag; return 1; } static int setConnectWP ( unsigned int from_c, unsigned int to_c, char flag ) { CONNECT * cn_temp, *cn_bal; unsigned int bal_fc = getTwinCtg ( from_c ); unsigned int bal_tc = getTwinCtg ( to_c ); cn_temp = getCntBetween ( from_c, to_c ); cn_bal = getCntBetween ( bal_tc, bal_fc ); if ( !cn_temp || !cn_bal ) { return 0; } cn_temp->weakPoint = flag; cn_bal->weakPoint = flag; //fprintf(stderr,"contig %d and %d, weakPoint %d\n",from_c,to_c,cn_temp->weakPoint); //fprintf(stderr,"contig %d and %d, weakPoint %d\n",bal_tc,bal_fc,cn_bal->weakPoint); return 1; } static int setConnectDelete ( unsigned int from_c, unsigned int to_c, char flag, boolean cleanBinding ) { CONNECT * cn_temp, *cn_bal; unsigned int bal_fc = getTwinCtg ( from_c ); unsigned int bal_tc = getTwinCtg ( to_c ); cn_temp = getCntBetween ( from_c, to_c ); cn_bal = getCntBetween ( bal_tc, bal_fc ); if ( !cn_temp || !cn_bal ) { return 0; } cn_temp->deleted = flag; cn_bal->deleted = flag; if ( !flag ) { return 1; } if ( cleanBinding ) { cn_temp->prevInScaf = 0; cn_temp->nextInScaf = NULL; cn_bal->prevInScaf = 0; cn_bal->nextInScaf = NULL; } return 1; } static void maskContig ( unsigned int ctg, boolean flag ) { unsigned int bal_ctg, ctg2, bal_ctg2; CONNECT * cn_temp; bal_ctg = getTwinCtg ( ctg ); cn_temp = contig_array[ctg].downwardConnect; while ( cn_temp ) { if ( cn_temp->mask || cn_temp->prevInScaf || cn_temp->nextInScaf || cn_temp->singleInScaf ) { cn_temp = cn_temp->next; continue; } ctg2 = cn_temp->contigID; setConnectMask ( ctg, ctg2, flag ); cn_temp = cn_temp->next; } // bal_ctg2 <- bal_ctg cn_temp = contig_array[bal_ctg].downwardConnect; while ( cn_temp ) { if ( cn_temp->mask || cn_temp->prevInScaf || cn_temp->nextInScaf || cn_temp->singleInScaf ) { cn_temp = cn_temp->next; continue; } bal_ctg2 = cn_temp->contigID; setConnectMask ( bal_ctg, bal_ctg2, flag ); cn_temp = cn_temp->next; } contig_array[ctg].mask = flag; contig_array[bal_ctg].mask = flag; } static int maskPuzzle ( int num_connect, unsigned int contigLen ) { int in_num, out_num, flag = 0, puzzleCounter = 0; unsigned int i, bal_i; for ( i = 1; i <= num_ctg; i++ ) { if ( contigLen && contig_array[i].length > contigLen ) { break; } if ( contig_array[i].mask ) { continue; } bal_i = getTwinCtg ( i ); in_num = validConnect ( bal_i, NULL ); out_num = validConnect ( i, NULL ); if ( ( in_num > 1 || out_num > 1 ) && ( in_num + out_num >= num_connect ) ) { flag++; maskContig ( i, 1 ); } in_num = validConnect ( bal_i, NULL ); out_num = validConnect ( i, NULL ); if ( in_num > 1 || out_num > 1 ) { puzzleCounter++; //debugging2(i); } if ( isSmallerThanTwin ( i ) ) { i++; } } printf ( "Masked %d contigs, %d puzzle left\n", flag, puzzleCounter ); return flag; } static void deleteWeakCnt ( int cut_off ) { unsigned int i; CONNECT * cn_temp1; int weaks = 0, counter = 0; for ( i = 1; i <= num_ctg; i++ ) { cn_temp1 = contig_array[i].downwardConnect; while ( cn_temp1 ) { if ( !cn_temp1->mask && !cn_temp1->deleted && !cn_temp1->nextInScaf && !cn_temp1->singleInScaf && !cn_temp1->prevInScaf ) { counter++; } if ( cn_temp1->weak && cn_temp1->deleted && cn_temp1->weight >= cut_off ) { cn_temp1->deleted = 0; cn_temp1->weak = 0; } else if ( !cn_temp1->deleted && cn_temp1->weight > 0 && cn_temp1->weight < cut_off && !cn_temp1->nextInScaf && !cn_temp1->prevInScaf ) { cn_temp1->deleted = 1; cn_temp1->weak = 1; if ( cn_temp1->singleInScaf ) { cn_temp1->singleInScaf = 0; } if ( !cn_temp1->mask ) { weaks++; } } cn_temp1 = cn_temp1->next; } } printf ( "%d weak connects removed (there were %d active cnnects))\n", weaks, counter ); checkCircle(); } //check if one contig is linearly connected to the other ->C1->C2... static int linearC2C ( unsigned int starter, CONNECT * cnt2c1, unsigned int c2, int min_dis, int max_dis ) { int out_num, in_num; CONNECT * prevCNT, *cnt, *cn_temp; unsigned int c1, bal_c1, ctg, bal_c2; int len = 0; unsigned int bal_start = getTwinCtg ( starter ); boolean excep; c1 = cnt2c1->contigID; if ( c1 == c2 ) { printf ( "linearC2C: c1(%d) and c2(%d) are the same contig\n", c1, c2 ); return -1; } bal_c1 = getTwinCtg ( c1 ); in_num = validConnect ( bal_c1, NULL ); if ( in_num > 1 ) { return 0; } dsCtgCounter = 1; usCtgCounter = 0; downstreamCTG[dsCtgCounter++] = c1; bal_c2 = getTwinCtg ( c2 ); upstreamCTG[usCtgCounter++] = bal_c2; // check if c1 is linearly connected to c2 by pe connections cnt = prevCNT = cnt2c1; while ( ( cnt = getNextContig ( c1, prevCNT, &excep ) ) != NULL ) { c1 = cnt->contigID; len += cnt->gapLen + contig_array[c1].length; if ( c1 == c2 ) { return 1; } if ( len > max_dis || c1 == starter || c1 == bal_start ) { return 0; } downstreamCTG[dsCtgCounter++] = c1; if ( dsCtgCounter >= MAXCinBetween ) { printf ( "%d downstream contigs, start at %d, max_dis %d, current dis %d\n" , dsCtgCounter, starter, max_dis, len ); return 0; } prevCNT = cnt; } out_num = validConnect ( c1, NULL ); if ( out_num ) { return 0; } //find the most upstream contig to c2 cnt = prevCNT = NULL; ctg = bal_c2; while ( ( cnt = getNextContig ( ctg, prevCNT, &excep ) ) != NULL ) { ctg = cnt->contigID; len += cnt->gapLen + contig_array[ctg].length; if ( len > max_dis || ctg == starter || ctg == bal_start ) { return 0; } prevCNT = cnt; upstreamCTG[usCtgCounter++] = ctg; if ( usCtgCounter >= MAXCinBetween ) { printf ( "%d upstream contigs, start at %d, max_dis %d, current dis %d\n" , usCtgCounter, starter, max_dis, len ); return 0; } } if ( dsCtgCounter + usCtgCounter > MAXCinBetween ) { printf ( "%d downstream and %d upstream contigs\n", dsCtgCounter, usCtgCounter ); return 0; } out_num = validConnect ( ctg, NULL ); if ( out_num ) { return 0; } c2 = getTwinCtg ( ctg ); min_dis -= len; max_dis -= len; if ( c1 == c2 || c1 == ctg || max_dis < 0 ) { return 0; } cn_temp = getCntBetween ( c1, c2 ); if ( cn_temp ) { setConnectMask ( c1, c2, 0 ); setConnectDelete ( c1, c2, 0, 0 ); return 1; } len = ( min_dis + max_dis ) / 2 >= 0 ? ( min_dis + max_dis ) / 2 : 0; cn_temp = allocateCN ( c2, len ); if ( cntLookupTable ) { putCnt2LookupTable ( c1, cn_temp ); } cn_temp->weight = 0; // special connect from the original graph cn_temp->next = contig_array[c1].downwardConnect; contig_array[c1].downwardConnect = cn_temp; bal_c1 = getTwinCtg ( c1 ); bal_c2 = getTwinCtg ( c2 ); cn_temp = allocateCN ( bal_c1, len ); if ( cntLookupTable ) { putCnt2LookupTable ( bal_c2, cn_temp ); } cn_temp->weight = 0; // special connect from the original graph cn_temp->next = contig_array[bal_c2].downwardConnect; contig_array[bal_c2].downwardConnect = cn_temp; return 1; } //catenate upstream contig array and downstream contig array to solidArray static void catUsDsContig() { int i; for ( i = 0; i < dsCtgCounter; i++ ) { * ( unsigned int * ) darrayPut ( solidArray, i ) = downstreamCTG[i]; } for ( i = usCtgCounter - 1; i >= 0; i-- ) { * ( unsigned int * ) darrayPut ( solidArray, dsCtgCounter++ ) = getTwinCtg ( upstreamCTG[i] ); } solidCounter = dsCtgCounter; } //binding the connections between contigs in solidArray static void consolidate() { int i, j; CONNECT * prevCNT = NULL; CONNECT * cnt; unsigned int to_ctg; unsigned int from_ctg = * ( unsigned int * ) darrayGet ( solidArray, 0 ); for ( i = 1; i < solidCounter; i++ ) { to_ctg = * ( unsigned int * ) darrayGet ( solidArray, i ); cnt = checkConnect ( from_ctg, to_ctg ); if ( !cnt ) { printf ( "consolidate A: no connect from %d to %d\n", from_ctg, to_ctg ); for ( j = 0; j < solidCounter; j++ ) { printf ( "%d-->", * ( unsigned int * ) darrayGet ( solidArray, j ) ); } printf ( "\n" ); return; } cnt->singleInScaf = solidCounter == 2 ? 1 : 0; if ( prevCNT ) { setNextInScaf ( prevCNT, cnt ); setPrevInScaf ( cnt, 1 ); } prevCNT = cnt; from_ctg = to_ctg; } //the reverse complementary path from_ctg = getTwinCtg ( * ( unsigned int * ) darrayGet ( solidArray, solidCounter - 1 ) ); prevCNT = NULL; for ( i = solidCounter - 2; i >= 0; i-- ) { to_ctg = getTwinCtg ( * ( unsigned int * ) darrayGet ( solidArray, i ) ); cnt = checkConnect ( from_ctg, to_ctg ); if ( !cnt ) { printf ( "consolidate B: no connect from %d to %d\n", from_ctg, to_ctg ); return; } cnt->singleInScaf = solidCounter == 2 ? 1 : 0; if ( prevCNT ) { setNextInScaf ( prevCNT, cnt ); setPrevInScaf ( cnt, 1 ); } prevCNT = cnt; from_ctg = to_ctg; } } static void debugging1 ( unsigned int ctg1, unsigned int ctg2 ) { CONNECT * cn1; cn1 = getCntBetween ( ctg1, ctg2 ); if ( cn1 ) { printf ( "(%d,%d) mask %d deleted %d w %d,singleInScaf %d\n", ctg1, ctg2, cn1->mask, cn1->deleted, cn1->weight, cn1->singleInScaf ); if ( cn1->nextInScaf ) { printf ( "%d->%d->%d\n", ctg1, ctg2, cn1->nextInScaf->contigID ); } if ( cn1->prevInScaf ) { printf ( "*->%d->%d\n", ctg1, ctg2 ); } else if ( !cn1->nextInScaf ) { printf ( "NULL->%d->%d->NULL\n", ctg1, ctg2 ); } } else { printf ( "%d -X- %d\n", ctg1, ctg2 ); } } //remove transitive connections which cross linear paths (these paths may be broken) //if a->b->c and a->c, mask a->c static void removeTransitive() { unsigned int i, bal_ctg; int flag = 1, out_num, in_num, count, min, max, linear; CONNECT * cn_temp, *cn1 = NULL, *cn2 = NULL; while ( flag ) { flag = 0; for ( i = 1; i <= num_ctg; i++ ) { if ( contig_array[i].mask ) { continue; } out_num = validConnect ( i, NULL ); if ( out_num != 2 ) { continue; } cn_temp = contig_array[i].downwardConnect; count = 0; while ( cn_temp ) { if ( cn_temp->deleted || cn_temp->mask ) { cn_temp = cn_temp->next; continue; } count++; if ( count == 1 ) { cn1 = cn_temp; } else if ( count == 2 ) { cn2 = cn_temp; } else // count > 2 { break; } cn_temp = cn_temp->next; } if ( count > 2 ) { printf ( "%d valid connections from ctg %d\n", count, i ); continue; } if ( cn1->gapLen > cn2->gapLen ) { cn_temp = cn1; cn1 = cn2; cn2 = cn_temp; } //make sure cn1 is closer to contig i than cn2 if ( cn1->prevInScaf && cn2->prevInScaf ) { continue; } bal_ctg = getTwinCtg ( cn2->contigID ); in_num = validConnect ( bal_ctg, NULL ); if ( in_num > 2 ) { continue; } min = cn2->gapLen - cn1->gapLen - contig_array[cn1->contigID].length - ins_size_var / 2; max = cn2->gapLen - cn1->gapLen - contig_array[cn1->contigID].length + ins_size_var / 2; if ( max < 0 ) { continue; } //temprarily delete cn2 setConnectDelete ( i, cn2->contigID, 1, 0 ); linear = linearC2C ( i, cn1, cn2->contigID, min, max ); if ( linear != 1 ) { setConnectDelete ( i, cn2->contigID, 0, 0 ); continue; } else { downstreamCTG[0] = i; catUsDsContig(); if ( !checkSimple ( solidArray, solidCounter ) ) { continue; } cn1 = getCntBetween ( * ( unsigned int * ) darrayGet ( solidArray, solidCounter - 2 ), * ( unsigned int * ) darrayGet ( solidArray, solidCounter - 1 ) ); if ( cn1 && cn1->nextInScaf && cn2->nextInScaf ) { setConnectDelete ( i, cn2->contigID, 0, 0 ); continue; } consolidate(); if ( cn2->prevInScaf ) substitueDSinScaf ( cn2, * ( unsigned int * ) darrayGet ( solidArray, 0 ), * ( unsigned int * ) darrayGet ( solidArray, 1 ) ); if ( cn2->nextInScaf ) { substitueUSinScaf ( cn2, * ( unsigned int * ) darrayGet ( solidArray, solidCounter - 2 ) ); } flag++; } } //for each contig printf ( "a remove transitive lag, %d connections removed\n", flag ); } } //get repeat contigs back into the scaffold according to connected unique contigs on both sides /* A ------ D > [i] < B E */ static void debugging2 ( unsigned int ctg ) { CONNECT * cn1 = contig_array[ctg].downwardConnect; while ( cn1 ) { if ( cn1->nextInScaf ) { fprintf ( stderr, "with nextInScaf," ); } if ( cn1->prevInScaf ) { fprintf ( stderr, "with prevInScaf," ); } fprintf ( stderr, "%u >> %d, mask %d deleted %d, inherit %d, singleInScaf %d\n", ctg, cn1->contigID, cn1->mask, cn1->deleted, cn1->inherit, cn1->singleInScaf ); cn1 = cn1->next; } } static void debugging() { /* debugging1(1777,1468); debugging2(8065); debugging2(8066); */ } static void simplifyCnt() { removeTransitive(); debugging(); general_linearization ( 1 ); debugging(); } static int getIndexInArray ( unsigned int node ) { int index; for ( index = 0; index < nodeCounter; index++ ) if ( nodesInSub[index] == node ) { return index; } return -1; } static boolean putNodeIntoSubgraph ( FibHeap * heap, int distance, unsigned int node, int index ) { int pos = getIndexInArray ( node ); if ( pos > 0 ) { //printf("exists\n"); return 0; } if ( index >= MaxNodeInSub ) { return -1; } insertNodeIntoHeap ( heap, distance, node ); nodesInSub[index] = node; nodeDistance[index] = distance; return 1; } static boolean putChainIntoSubgraph ( FibHeap * heap, int distance, unsigned int node, int * index, CONNECT * prevC ) { unsigned int ctg = node; CONNECT * nextCnt; boolean excep, flag; int counter = *index; while ( 1 ) { nextCnt = getNextContig ( ctg, prevC, &excep ); if ( excep || !nextCnt ) { *index = counter; return 1; } ctg = nextCnt->contigID; distance += nextCnt->gapLen + ctg; flag = putNodeIntoSubgraph ( heap, distance, ctg, counter ); if ( flag < 0 ) { return 0; } if ( flag > 0 ) { counter++; } prevC = nextCnt; } } // check if a contig is unique by trying to line its downstream/upstream nodes together static boolean checkUnique ( unsigned int node, double tolerance ) { CONNECT * ite_cnt; unsigned int currNode; int distance; int popCounter = 0; boolean flag; currNode = node; FibHeap * heap = newFibHeap(); putNodeIntoSubgraph ( heap, 0, currNode, 0 ); nodeCounter = 1; ite_cnt = contig_array[currNode].downwardConnect; while ( ite_cnt ) { if ( ite_cnt->deleted || ite_cnt->mask ) { ite_cnt = ite_cnt->next; continue; } currNode = ite_cnt->contigID; distance = ite_cnt->gapLen + contig_array[currNode].length; flag = putNodeIntoSubgraph ( heap, distance, currNode, nodeCounter ); if ( flag < 0 ) { destroyHeap ( heap ); return 0; } if ( flag > 0 ) { nodeCounter++; } flag = putChainIntoSubgraph ( heap, distance, currNode, &nodeCounter, ite_cnt ); if ( !flag ) { destroyHeap ( heap ); return 0; } ite_cnt = ite_cnt->next; } if ( nodeCounter <= 2 ) // no more than 2 valid connections { destroyHeap ( heap ); return 1; } while ( ( currNode = removeNextNodeFromHeap ( heap ) ) != 0 ) { nodesInSubInOrder[popCounter++] = currNode; } destroyHeap ( heap ); flag = checkOverlapInBetween ( tolerance ); return flag; } //mask contigs with downstream and/or upstream can not be lined static void maskRepeat() { int in_num, out_num, flagA, flagB; int counter = 0; int puzzleCounter = 0; unsigned int i, bal_i; for ( i = 1; i <= num_ctg; i++ ) { if ( contig_array[i].mask ) { continue; } bal_i = getTwinCtg ( i ); in_num = validConnect ( bal_i, NULL ); out_num = validConnect ( i, NULL ); if ( in_num > 1 || out_num > 1 ) { puzzleCounter++; } else { if ( isSmallerThanTwin ( i ) ) { i++; } continue; } if ( contig_array[i].cvg > 2 * cvgAvg ) { counter++; maskContig ( i, 1 ); //printf("thick mask contig %d and %d\n",i,bal_i); if ( isSmallerThanTwin ( i ) ) { i++; } continue; } if ( in_num > 1 ) { flagA = checkUnique ( bal_i, OverlapPercent ); } else { flagA = 1; } if ( out_num > 1 ) { flagB = checkUnique ( i, OverlapPercent ); } else { flagB = 1; } if ( !flagA || !flagB ) { counter++; maskContig ( i, 1 ); } if ( isSmallerThanTwin ( i ) ) { i++; } } printf ( "maskRepeat: %d contigs masked from %d puzzles\n", counter, puzzleCounter ); } static void ordering ( boolean deWeak, boolean downS, boolean nonlinear, char * infile ) { debugging(); if ( downS ) { downSlide(); debugging(); if ( deWeak ) { deleteWeakCnt ( weakPE ); } } else { if ( deWeak ) { deleteWeakCnt ( weakPE ); } } //output_scaf(infile); debugging(); printf ( "variance for insert size %d\n", ins_size_var ); simplifyCnt(); debugging(); maskRepeat(); debugging(); simplifyCnt(); if ( nonlinear ) { printf ( "non-strict linearization\n" ); general_linearization ( 0 ); //linearization(0,0); } maskPuzzle ( 2, 0 ); debugging(); freezing(); debugging(); } //check if contigs next to each other have reasonable overlap boolean checkOverlapInBetween ( double tolerance ) { int i, gap; int index; unsigned int node; int lenSum, lenOlp; lenSum = lenOlp = 0; for ( i = 0; i < nodeCounter; i++ ) { node = nodesInSubInOrder[i]; lenSum += contig_array[node].length; index = getIndexInArray ( node ); nodeDistanceInOrder[i] = nodeDistance[index]; } if ( lenSum < 1 ) { return 1; } for ( i = 0; i < nodeCounter - 1; i++ ) { gap = nodeDistanceInOrder[i + 1] - nodeDistanceInOrder[i] - contig_array[nodesInSubInOrder[i + 1]].length; if ( -gap > 0 ) { lenOlp += -gap; } //if(-gap>ins_size_var) if ( ( double ) lenOlp / lenSum > tolerance ) { return 0; } } return 1; } /********* the following codes are for freezing current scaffolds ****************/ //set connections between contigs in a array to used or not //meanwhile set mask to the opposite value static boolean setUsed ( unsigned int start, unsigned int * array, int max_steps, boolean flag ) { unsigned int prevCtg = start; unsigned int twinA, twinB; int j; CONNECT * cnt; boolean usedFlag = 0; // save 'used' to 'checking' prevCtg = start; for ( j = 0; j < max_steps; j++ ) { if ( array[j] == 0 ) { break; } cnt = getCntBetween ( prevCtg, array[j] ); if ( !cnt ) { printf ( "setUsed: no connect between %d and %d\n", prevCtg, array[j] ); prevCtg = array[j]; continue; } if ( cnt->used == flag || cnt->nextInScaf || cnt->prevInScaf || cnt->singleInScaf ) { return 1; } cnt->checking = cnt->used; twinA = getTwinCtg ( prevCtg ); twinB = getTwinCtg ( array[j] ); cnt = getCntBetween ( twinB, twinA ); if ( cnt ) { cnt->checking = cnt->used; } prevCtg = array[j]; } // set used to flag prevCtg = start; for ( j = 0; j < max_steps; j++ ) { if ( array[j] == 0 ) { break; } cnt = getCntBetween ( prevCtg, array[j] ); if ( !cnt ) { prevCtg = array[j]; continue; } if ( cnt->used == flag ) { usedFlag = 1; break; } cnt->used = flag; twinA = getTwinCtg ( prevCtg ); twinB = getTwinCtg ( array[j] ); cnt = getCntBetween ( twinB, twinA ); if ( cnt ) { cnt->used = flag; } prevCtg = array[j]; } // set mask to 'NOT flag' or set used to original value prevCtg = start; for ( j = 0; j < max_steps; j++ ) { if ( array[j] == 0 ) { break; } cnt = getCntBetween ( prevCtg, array[j] ); if ( !cnt ) { prevCtg = array[j]; continue; } if ( !usedFlag ) { cnt->mask = 1 - flag; } else { cnt->used = cnt->checking; } twinA = getTwinCtg ( prevCtg ); twinB = getTwinCtg ( array[j] ); cnt = getCntBetween ( twinB, twinA ); cnt->used = 1 - flag; if ( !usedFlag ) { cnt->mask = 1 - flag; } else { cnt->used = cnt->checking; } prevCtg = array[j]; } return usedFlag; } // break down scaffolds poorly supported by longer PE static void recoverMask() { unsigned int i, ctg, bal_ctg, start, finish; int num3, num5, j, t; CONNECT * bindCnt, *cnt; int min, max, max_steps = 5, num_route, length; int tempCounter, recoverCounter = 0; boolean multiUSE, change; for ( i = 1; i <= num_ctg; i++ ) { contig_array[i].flag = 0; } so_far = ( unsigned int * ) ckalloc ( max_n_routes * sizeof ( unsigned int ) ); found_routes = ( unsigned int ** ) ckalloc ( max_n_routes * sizeof ( unsigned int * ) ); for ( j = 0; j < max_n_routes; j++ ) { found_routes[j] = ( unsigned int * ) ckalloc ( max_steps * sizeof ( unsigned int ) ); } for ( i = 1; i <= num_ctg; i++ ) { if ( contig_array[i].flag || contig_array[i].mask || !contig_array[i].downwardConnect ) { continue; } bindCnt = getBindCnt ( i ); if ( !bindCnt ) { continue; } //first scan get the average coverage by longer pe num5 = num3 = 0; ctg = i; * ( unsigned int * ) darrayPut ( scaf5, num5++ ) = i; contig_array[i].flag = 1; contig_array[getTwinCtg ( i )].flag = 1; while ( bindCnt ) { if ( bindCnt->used ) { break; } setConnectUsed ( ctg, bindCnt->contigID, 1 ); ctg = bindCnt->contigID; * ( unsigned int * ) darrayPut ( scaf5, num5++ ) = ctg; bal_ctg = getTwinCtg ( ctg ); contig_array[ctg].flag = 1; contig_array[bal_ctg].flag = 1; bindCnt = bindCnt->nextInScaf; } ctg = getTwinCtg ( i ); bindCnt = getBindCnt ( ctg ); while ( bindCnt ) { if ( bindCnt->used ) { break; } setConnectUsed ( ctg, bindCnt->contigID, 1 ); ctg = bindCnt->contigID; bal_ctg = getTwinCtg ( ctg ); contig_array[ctg].flag = 1; contig_array[bal_ctg].flag = 1; * ( unsigned int * ) darrayPut ( scaf3, num3++ ) = bal_ctg; bindCnt = bindCnt->nextInScaf; } if ( num5 + num3 < 2 ) { continue; } tempCounter = solidCounter = 0; for ( j = num3 - 1; j >= 0; j-- ) * ( unsigned int * ) darrayPut ( tempArray, tempCounter++ ) = * ( unsigned int * ) darrayGet ( scaf3, j ); for ( j = 0; j < num5; j++ ) * ( unsigned int * ) darrayPut ( tempArray, tempCounter++ ) = * ( unsigned int * ) darrayGet ( scaf5, j ); change = 0; for ( t = 0; t < tempCounter - 1; t++ ) { * ( unsigned int * ) darrayPut ( solidArray, solidCounter++ ) = * ( unsigned int * ) darrayGet ( tempArray, t ); start = * ( unsigned int * ) darrayGet ( tempArray, t ); finish = * ( unsigned int * ) darrayGet ( tempArray, t + 1 ); num_route = num_trace = 0; cnt = checkConnect ( start, finish ); if ( !cnt ) { printf ( "Warning from recoverMask: no connection (%d %d), start at %d\n", start, finish, i ); cnt = getCntBetween ( start, finish ); if ( cnt ) { debugging1 ( start, finish ); } continue; } length = cnt->gapLen + contig_array[finish].length; min = length - 1.5 * ins_size_var; max = length + 1.5 * ins_size_var; traceAlongMaskedCnt ( finish, start, max_steps, min, max, 0, 0, &num_route ); if ( finish == start ) { for ( j = 0; j < tempCounter; j++ ) { printf ( "->%d", * ( unsigned int * ) darrayGet ( tempArray, j ) ); } printf ( ": start at %d\n", i ); } if ( num_route == 1 ) { for ( j = 0; j < max_steps; j++ ) if ( found_routes[0][j] == 0 ) { break; } if ( j < 1 ) { continue; } //check if connects have been used more than once multiUSE = setUsed ( start, found_routes[0], max_steps, 1 ); if ( multiUSE ) { continue; } for ( j = 0; j < max_steps; j++ ) { if ( j + 1 == max_steps || found_routes[0][j + 1] == 0 ) { break; } * ( unsigned int * ) darrayPut ( solidArray, solidCounter++ ) = found_routes[0][j]; contig_array[found_routes[0][j]].flag = 1; contig_array[getTwinCtg ( found_routes[0][j] )].flag = 1; } recoverCounter += j; setConnectDelete ( start, finish, 1, 1 ); change = 1; } //end if num_route=1 } // for each gap * ( unsigned int * ) darrayPut ( solidArray, solidCounter++ ) = * ( unsigned int * ) darrayGet ( tempArray, tempCounter - 1 ); if ( change ) { consolidate(); } } printf ( "%d contigs recovered\n", recoverCounter ); fflush ( stdout ); for ( i = 1; i <= num_ctg; i++ ) { cnt = contig_array[i].downwardConnect; while ( cnt ) { cnt->used = 0; cnt->checking = 0; cnt = cnt->next; } } for ( j = 0; j < max_n_routes; j++ ) { free ( ( void * ) found_routes[j] ); } free ( ( void * ) found_routes ); free ( ( void * ) so_far ); } // A -> B -> C -> D un-bind link B->C to link A->B and B->C // A' <- B' <- C' <- D' static void unBindLink ( unsigned int CB, unsigned int CC ) { //fprintf(stderr,"Unbind link (%d %d) to others...\n",CB,CC); CONNECT * cnt1 = getCntBetween ( CB, CC ); if ( !cnt1 ) { return; } if ( cnt1->singleInScaf ) { cnt1->singleInScaf = 0; } CONNECT * cnt2 = getCntBetween ( getTwinCtg ( CC ), getTwinCtg ( CB ) ); if ( !cnt2 ) { return; } if ( cnt2->singleInScaf ) { cnt2->singleInScaf = 0; } if ( cnt1->nextInScaf ) { unsigned int CD = cnt1->nextInScaf->contigID; cnt1->nextInScaf->prevInScaf = 0; cnt1->nextInScaf = NULL; CONNECT * cnt3 = getCntBetween ( getTwinCtg ( CD ), getTwinCtg ( CC ) ); if ( cnt3 ) { cnt3->nextInScaf = NULL; } cnt2->prevInScaf = 0; } if ( cnt2->nextInScaf ) { unsigned int bal_CA = cnt2->nextInScaf->contigID; cnt2->nextInScaf->prevInScaf = 0; cnt2->nextInScaf = NULL; CONNECT * cnt4 = getCntBetween ( getTwinCtg ( bal_CA ), CB ); if ( cnt4 ) { cnt4->nextInScaf = NULL; } cnt1->prevInScaf = 0; } } static void freezing() { int num5, num3; unsigned int ctg, bal_ctg; unsigned int i; int j, t; CONNECT * cnt, *prevCNT, *nextCnt; boolean excep; for ( i = 1; i <= num_ctg; i++ ) { contig_array[i].flag = 0; contig_array[i].from_vt = 0; contig_array[i].to_vt = 0; cnt = contig_array[i].downwardConnect; while ( cnt ) { cnt->used = 0; cnt->checking = 0; cnt->singleInScaf = 0; cnt = cnt->next; } } for ( i = 1; i <= num_ctg; i++ ) { if ( contig_array[i].flag || contig_array[i].mask ) { continue; } if ( !contig_array[i].downwardConnect || !validConnect ( i, NULL ) ) { continue; } num5 = num3 = 0; ctg = i; * ( unsigned int * ) darrayPut ( scaf5, num5++ ) = i; contig_array[i].flag = 1; contig_array[getTwinCtg ( i )].flag = 1; prevCNT = NULL; cnt = getNextContig ( ctg, prevCNT, &excep ); while ( cnt ) { if ( contig_array[cnt->contigID].flag ) { unBindLink ( ctg, cnt->contigID ); break; } nextCnt = getNextContig ( cnt->contigID, cnt, &excep ); setConnectUsed ( ctg, cnt->contigID, 1 ); ctg = cnt->contigID; * ( unsigned int * ) darrayPut ( scaf5, num5++ ) = ctg; bal_ctg = getTwinCtg ( ctg ); contig_array[ctg].flag = 1; contig_array[bal_ctg].flag = 1; prevCNT = cnt; cnt = nextCnt; } ctg = getTwinCtg ( i ); if ( num5 >= 2 ) { prevCNT = checkConnect ( getTwinCtg ( * ( unsigned int * ) darrayGet ( scaf5, 1 ) ), ctg ); } else { prevCNT = NULL; } cnt = getNextContig ( ctg, prevCNT, &excep ); while ( cnt ) { if ( contig_array[cnt->contigID].flag ) { unBindLink ( ctg, cnt->contigID ); break; } nextCnt = getNextContig ( cnt->contigID, cnt, &excep ); setConnectUsed ( ctg, cnt->contigID, 1 ); ctg = cnt->contigID; bal_ctg = getTwinCtg ( ctg ); contig_array[ctg].flag = 1; contig_array[bal_ctg].flag = 1; * ( unsigned int * ) darrayPut ( scaf3, num3++ ) = bal_ctg; prevCNT = cnt; cnt = nextCnt; } if ( num5 + num3 < 2 ) { continue; } solidCounter = 0; for ( j = num3 - 1; j >= 0; j-- ) * ( unsigned int * ) darrayPut ( solidArray, solidCounter++ ) = * ( unsigned int * ) darrayGet ( scaf3, j ); for ( j = 0; j < num5; j++ ) * ( unsigned int * ) darrayPut ( solidArray, solidCounter++ ) = * ( unsigned int * ) darrayGet ( scaf5, j ); unsigned int firstCtg = 0; unsigned int lastCtg = 0; unsigned int firstTwin = 0; unsigned int lastTwin = 0; for ( t = 0; t < solidCounter; t++ ) if ( !contig_array[* ( unsigned int * ) darrayGet ( solidArray, t )].mask ) { firstCtg = * ( unsigned int * ) darrayGet ( solidArray, t ); break; } for ( t = solidCounter - 1; t >= 0; t-- ) if ( !contig_array[* ( unsigned int * ) darrayGet ( solidArray, t )].mask ) { lastCtg = * ( unsigned int * ) darrayGet ( solidArray, t ); break; } if ( firstCtg == 0 || lastCtg == 0 ) { printf ( "scaffold start at %d, stop at %d, freezing began with %d\n", firstCtg, lastCtg, i ); for ( j = 0; j < solidCounter; j++ ) printf ( "->%d(%d %d)", * ( unsigned int * ) darrayGet ( solidArray, j ) , contig_array[* ( unsigned int * ) darrayGet ( solidArray, j )].mask , contig_array[* ( unsigned int * ) darrayGet ( solidArray, j )].flag ); printf ( "\n" ); } else { firstTwin = getTwinCtg ( firstCtg ); lastTwin = getTwinCtg ( lastCtg ); } for ( t = 0; t < solidCounter; t++ ) { unsigned int ctg = * ( unsigned int * ) darrayGet ( solidArray, t ); if ( contig_array[ctg].from_vt > 0 ) { contig_array[ctg].mask = 1; contig_array[getTwinCtg ( ctg )].mask = 1; printf ( "Repeat: contig %d (%d) appears more than once\n", ctg, getTwinCtg ( ctg ) ); } else { contig_array[ctg].from_vt = firstCtg; contig_array[ctg].to_vt = lastCtg; contig_array[ctg].indexInScaf = t + 1; contig_array[getTwinCtg ( ctg )].from_vt = lastTwin; contig_array[getTwinCtg ( ctg )].to_vt = firstTwin; contig_array[getTwinCtg ( ctg )].indexInScaf = solidCounter - t; } } consolidate(); } printf ( "Freezing is done....\n" ); fflush ( stdout ); for ( i = 1; i <= num_ctg; i++ ) { if ( contig_array[i].flag ) { contig_array[i].flag = 0; } if ( contig_array[i].from_vt == 0 ) { contig_array[i].from_vt = i; contig_array[i].to_vt = i; } cnt = contig_array[i].downwardConnect; while ( cnt ) { cnt->used = 0; cnt->checking = 0; cnt = cnt->next; } } } /************** codes below this line are for pulling the scaffolds out ************/ void output1gap ( FILE * fo, int max_steps ) { int i, len, seg; len = seg = 0; for ( i = 0; i < max_steps - 1; i++ ) { if ( found_routes[0][i + 1] == 0 ) { break; } len += contig_array[found_routes[0][i]].length; seg++; } fprintf ( fo, "GAP %d %d", len, seg ); for ( i = 0; i < max_steps - 1; i++ ) { if ( found_routes[0][i + 1] == 0 ) { break; } fprintf ( fo, " %d", found_routes[0][i] ); } fprintf ( fo, "\n" ); } static int weakCounter; static boolean printCnts ( FILE * fp, unsigned int ctg ) { CONNECT * cnt = contig_array[ctg].downwardConnect; boolean flag = 0, ret = 0; unsigned int bal_ctg = getTwinCtg ( ctg ); unsigned int linkCtg; if ( isSameAsTwin ( ctg ) ) { return ret; } CONNECT * bindCnt = getBindCnt ( ctg ); if ( bindCnt && bindCnt->bySmall && bindCnt->weakPoint ) { weakCounter++; fprintf ( fp, "\tWP" ); ret = 1; } while ( cnt ) { if ( cnt->weight && !cnt->inherit ) { if ( !flag ) { flag = 1; fprintf ( fp, "\t#DOWN " ); } linkCtg = cnt->contigID; if ( isLargerThanTwin ( linkCtg ) ) { linkCtg = getTwinCtg ( linkCtg ); } fprintf ( fp, "%d:%d:%d ", index_array[linkCtg], cnt->weight, cnt->gapLen ); } cnt = cnt->next; } flag = 0; cnt = contig_array[bal_ctg].downwardConnect; while ( cnt ) { if ( cnt->weight && !cnt->inherit ) { if ( !flag ) { flag = 1; fprintf ( fp, "\t#UP " ); } linkCtg = cnt->contigID; if ( isLargerThanTwin ( linkCtg ) ) { linkCtg = getTwinCtg ( linkCtg ); } fprintf ( fp, "%d:%d:%d ", index_array[linkCtg], cnt->weight, cnt->gapLen ); } cnt = cnt->next; } fprintf ( fp, "\n" ); return ret; } void scaffolding ( unsigned int len_cut, char * outfile ) { unsigned int prev_ctg, ctg, bal_ctg, *length_array, count = 0, num_lctg = 0; unsigned int i, max_steps = 5; int num5, num3, j, len, flag, num_route, gap_c = 0; short gap = 0; long long sum = 0, N50, N90; FILE * fp, *fo = NULL; char name[256]; CONNECT * cnt, *prevCNT, *nextCnt; boolean excep, weak; weakCounter = 0; so_far = ( unsigned int * ) ckalloc ( max_n_routes * sizeof ( unsigned int ) ); found_routes = ( unsigned int ** ) ckalloc ( max_n_routes * sizeof ( unsigned int * ) ); for ( j = 0; j < max_n_routes; j++ ) { found_routes[j] = ( unsigned int * ) ckalloc ( max_steps * sizeof ( unsigned int ) ); } length_array = ( unsigned int * ) ckalloc ( ( num_ctg + 1 ) * sizeof ( unsigned int ) ); //use length_array to change info in index_array for ( i = 1; i <= num_ctg; i++ ) { length_array[i] = 0; } for ( i = 1; i <= num_ctg; i++ ) { if ( index_array[i] > 0 ) { length_array[index_array[i]] = i; } } for ( i = 1; i <= num_ctg; i++ ) { index_array[i] = length_array[i]; } //contig i with original index: index_array[i] orig2new = 0; sprintf ( name, "%s.scaf", outfile ); fp = ckopen ( name, "w" ); sprintf ( name, "%s.scaf_gap", outfile ); fo = ckopen ( name, "w" ); scaf3 = ( DARRAY * ) createDarray ( 1000, sizeof ( unsigned int ) ); scaf5 = ( DARRAY * ) createDarray ( 1000, sizeof ( unsigned int ) ); gap3 = ( DARRAY * ) createDarray ( 1000, sizeof ( int ) ); gap5 = ( DARRAY * ) createDarray ( 1000, sizeof ( int ) ); for ( i = 1; i <= num_ctg; i++ ) { contig_array[i].flag = 0; } for ( i = 1; i <= num_ctg; i++ ) { if ( contig_array[i].length + ( unsigned int ) overlaplen >= len_cut ) { num_lctg++; } else { continue; } if ( contig_array[i].flag || contig_array[i].mask || !contig_array[i].downwardConnect || !validConnect ( i, NULL ) ) { continue; } num5 = num3 = 0; ctg = i; //printf("%d",i); * ( unsigned int * ) darrayPut ( scaf5, num5++ ) = i; contig_array[i].flag = 1; bal_ctg = getTwinCtg ( ctg ); contig_array[bal_ctg].flag = 1; len = contig_array[i].length; prevCNT = NULL; cnt = getNextContig ( ctg, prevCNT, &excep ); while ( cnt ) { nextCnt = getNextContig ( cnt->contigID, cnt, &excep ); if ( excep && prevCNT ) { printf ( "scaffolding: exception --- prev cnt from %u\n", prevCNT->contigID ); } if ( nextCnt && nextCnt->used ) { break; } setConnectUsed ( ctg, cnt->contigID, 1 ); * ( int * ) darrayPut ( gap5, num5 - 1 ) = cnt->gapLen; ctg = cnt->contigID; * ( unsigned int * ) darrayPut ( scaf5, num5++ ) = ctg; len += cnt->gapLen + contig_array[ctg].length; bal_ctg = getTwinCtg ( ctg ); contig_array[ctg].flag = 1; contig_array[bal_ctg].flag = 1; prevCNT = cnt; cnt = nextCnt; //printf("->%d",ctg); } //printf("\n"); ctg = getTwinCtg ( i ); if ( num5 >= 2 ) { prevCNT = checkConnect ( getTwinCtg ( * ( unsigned int * ) darrayGet ( scaf5, 1 ) ), ctg ); } else { prevCNT = NULL; } //printf("%d",i); //fflush(stdout); cnt = getNextContig ( ctg, prevCNT, &excep ); while ( cnt ) { nextCnt = getNextContig ( cnt->contigID, cnt, &excep ); if ( excep && prevCNT ) { printf ( "scaffolding: exception -- prev cnt from %u\n", prevCNT->contigID ); } if ( nextCnt && nextCnt->used ) { break; } setConnectUsed ( ctg, cnt->contigID, 1 ); ctg = cnt->contigID; len += cnt->gapLen + contig_array[ctg].length; bal_ctg = getTwinCtg ( ctg ); contig_array[ctg].flag = 1; contig_array[bal_ctg].flag = 1; //printf("<-%d",bal_ctg); //fflush(stdout); * ( int * ) darrayPut ( gap3, num3 ) = cnt->gapLen; * ( unsigned int * ) darrayPut ( scaf3, num3++ ) = bal_ctg; prevCNT = cnt; cnt = nextCnt; } //printf("\n"); len += overlaplen; sum += len; length_array[count++] = len; if ( num5 + num3 < 1 ) { printf ( "no scaffold created for contig %d\n", i ); continue; } fprintf ( fp, ">scaffold%d %d %d\n", count, num5 + num3, len ); fprintf ( fo, ">scaffold%d %d %d\n", count, num5 + num3, len ); len = prev_ctg = 0; for ( j = num3 - 1; j >= 0; j-- ) { if ( !isLargerThanTwin ( * ( unsigned int * ) darrayGet ( scaf3, j ) ) ) { fprintf ( fp, "%-10d %-10d + %d " , index_array[* ( unsigned int * ) darrayGet ( scaf3, j )], len, contig_array[* ( unsigned int * ) darrayGet ( scaf3, j )].length + overlaplen ); weak = printCnts ( fp, * ( unsigned int * ) darrayGet ( scaf3, j ) ); /* if(weak) fprintf(stderr,"scaffold%d\n",count); */ } else { fprintf ( fp, "%-10d %-10d - %d " , index_array[getTwinCtg ( * ( unsigned int * ) darrayGet ( scaf3, j ) )], len , contig_array[* ( unsigned int * ) darrayGet ( scaf3, j )].length + overlaplen ); weak = printCnts ( fp, * ( unsigned int * ) darrayGet ( scaf3, j ) ); /* if(weak) fprintf(stderr,"scaffold%d\n",count); */ } if ( prev_ctg ) { num_route = num_trace = 0; traceAlongArc ( * ( unsigned int * ) darrayGet ( scaf3, j ), prev_ctg, max_steps , gap - ins_size_var, gap + ins_size_var, 0, 0, &num_route ); if ( num_route == 1 ) { output1gap ( fo, max_steps ); gap_c++; } } fprintf ( fo, "%-10d %-10d\n", * ( unsigned int * ) darrayGet ( scaf3, j ), len ); len += contig_array[* ( unsigned int * ) darrayGet ( scaf3, j )].length + * ( int * ) darrayGet ( gap3, j ); prev_ctg = * ( unsigned int * ) darrayGet ( scaf3, j ); gap = * ( int * ) darrayGet ( gap3, j ) > 0 ? * ( int * ) darrayGet ( gap3, j ) : 0; } for ( j = 0; j < num5; j++ ) { if ( !isLargerThanTwin ( * ( unsigned int * ) darrayGet ( scaf5, j ) ) ) { fprintf ( fp, "%-10d %-10d + %d " , index_array[* ( unsigned int * ) darrayGet ( scaf5, j )], len , contig_array[* ( unsigned int * ) darrayGet ( scaf5, j )].length + overlaplen ); weak = printCnts ( fp, * ( unsigned int * ) darrayGet ( scaf5, j ) ); /* if(weak) fprintf(stderr,"scaffold%d\n",count); */ } else { fprintf ( fp, "%-10d %-10d - %d " , index_array[getTwinCtg ( * ( unsigned int * ) darrayGet ( scaf5, j ) )], len , contig_array[* ( unsigned int * ) darrayGet ( scaf5, j )].length + overlaplen ); weak = printCnts ( fp, * ( unsigned int * ) darrayGet ( scaf5, j ) ); /* if(weak) fprintf(stderr,"scaffold%d\n",count); */ } if ( prev_ctg ) { num_route = num_trace = 0; traceAlongArc ( * ( unsigned int * ) darrayGet ( scaf5, j ), prev_ctg, max_steps , gap - ins_size_var, gap + ins_size_var, 0, 0, &num_route ); if ( num_route == 1 ) { output1gap ( fo, max_steps ); gap_c++; } } fprintf ( fo, "%-10d %-10d\n", * ( unsigned int * ) darrayGet ( scaf5, j ), len ); if ( j < num5 - 1 ) { len += contig_array[* ( unsigned int * ) darrayGet ( scaf5, j )].length + * ( int * ) darrayGet ( gap5, j ); prev_ctg = * ( unsigned int * ) darrayGet ( scaf5, j ); gap = * ( int * ) darrayGet ( gap5, j ) > 0 ? * ( int * ) darrayGet ( gap5, j ) : 0; } } } freeDarray ( scaf3 ); freeDarray ( scaf5 ); freeDarray ( gap3 ); freeDarray ( gap5 ); fclose ( fp ); fclose ( fo ); printf ( "\nthe final rank" ); printf ( "\n%d scaffolds from %d contigs sum up %lldbp, with average length %lld, %d gaps filled\n" , count, num_lctg / 2, sum, sum / count, gap_c ); //output singleton for ( i = 1; i <= num_ctg; i++ ) { if ( contig_array[i].length + ( unsigned int ) overlaplen < len_cut || contig_array[i].flag ) { continue; } length_array[count++] = contig_array[i].length; sum += contig_array[i].length; if ( isSmallerThanTwin ( i ) ) { i++; } } // calculate N50/N90 printf ( "%d scaffolds&singleton sum up %lldbp, with average length %lld\n" , count, sum, sum / count ); qsort ( length_array, count, sizeof ( length_array[0] ), cmp_int ); printf ( "the longest is %dbp,", length_array[count - 1] ); N50 = sum * 0.5; N90 = sum * 0.9; sum = flag = 0; for ( j = count - 1; j >= 0; j-- ) { sum += length_array[j]; if ( !flag && sum >= N50 ) { printf ( "scaffold N50 is %d bp, ", length_array[j] ); flag++; } if ( sum >= N90 ) { printf ( "scaffold N90 is %d bp\n", length_array[j] ); break; } } printf ( "Found %d weak points in scaffolds\n", weakCounter ); fflush ( stdout ); free ( ( void * ) length_array ); for ( j = 0; j < max_n_routes; j++ ) { free ( ( void * ) found_routes[j] ); } free ( ( void * ) found_routes ); free ( ( void * ) so_far ); } void scaffold_count ( unsigned int len_cut ) { static DARRAY * scaf3, *scaf5; static DARRAY * gap3, *gap5; unsigned int prev_ctg, ctg, bal_ctg, *length_array, count = 0, num_lctg = 0; unsigned int i, max_steps = 5; int num5, num3, j, len, flag, num_route, gap_c = 0; short gap = 0; long long sum = 0, N50, N90; CONNECT * cnt, *prevCNT, *nextCnt; boolean excep; so_far = ( unsigned int * ) ckalloc ( max_n_routes * sizeof ( unsigned int ) ); found_routes = ( unsigned int ** ) ckalloc ( max_n_routes * sizeof ( unsigned int * ) ); for ( j = 0; j < max_n_routes; j++ ) { found_routes[j] = ( unsigned int * ) ckalloc ( max_steps * sizeof ( unsigned int ) ); } length_array = ( unsigned int * ) ckalloc ( ( num_ctg + 1 ) * sizeof ( unsigned int ) ); //use length_array to change info in index_array for ( i = 1; i <= num_ctg; i++ ) { length_array[i] = 0; } for ( i = 1; i <= num_ctg; i++ ) { if ( index_array[i] > 0 ) { length_array[index_array[i]] = i; } } for ( i = 1; i <= num_ctg; i++ ) { index_array[i] = length_array[i]; } //contig i with original index: index_array[i] orig2new = 0; scaf3 = ( DARRAY * ) createDarray ( 1000, sizeof ( unsigned int ) ); scaf5 = ( DARRAY * ) createDarray ( 1000, sizeof ( unsigned int ) ); gap3 = ( DARRAY * ) createDarray ( 1000, sizeof ( int ) ); gap5 = ( DARRAY * ) createDarray ( 1000, sizeof ( int ) ); for ( i = 1; i <= num_ctg; i++ ) { contig_array[i].flag = 0; } for ( i = 1; i <= num_ctg; i++ ) { if ( contig_array[i].length + ( unsigned int ) overlaplen >= len_cut ) { num_lctg++; } else { continue; } if ( contig_array[i].flag || contig_array[i].mask || !contig_array[i].downwardConnect || !validConnect ( i, NULL ) ) { continue; } num5 = num3 = 0; ctg = i; //printf("%d",i); * ( unsigned int * ) darrayPut ( scaf5, num5++ ) = i; contig_array[i].flag = 1; bal_ctg = getTwinCtg ( ctg ); contig_array[bal_ctg].flag = 1; len = contig_array[i].length; prevCNT = NULL; cnt = getNextContig ( ctg, prevCNT, &excep ); while ( cnt ) { nextCnt = getNextContig ( cnt->contigID, cnt, &excep ); if ( excep && prevCNT ) { printf ( "scaffolding: exception --- prev cnt from %u\n", prevCNT->contigID ); } if ( nextCnt && nextCnt->used ) { break; } setConnectUsed ( ctg, cnt->contigID, 1 ); * ( int * ) darrayPut ( gap5, num5 - 1 ) = cnt->gapLen; ctg = cnt->contigID; * ( unsigned int * ) darrayPut ( scaf5, num5++ ) = ctg; len += cnt->gapLen + contig_array[ctg].length; bal_ctg = getTwinCtg ( ctg ); contig_array[ctg].flag = 1; contig_array[bal_ctg].flag = 1; prevCNT = cnt; cnt = nextCnt; //printf("->%d",ctg); } //printf("\n"); ctg = getTwinCtg ( i ); if ( num5 >= 2 ) { prevCNT = checkConnect ( getTwinCtg ( * ( unsigned int * ) darrayGet ( scaf5, 1 ) ), ctg ); } else { prevCNT = NULL; } //printf("%d",i); //fflush(stdout); cnt = getNextContig ( ctg, prevCNT, &excep ); while ( cnt ) { nextCnt = getNextContig ( cnt->contigID, cnt, &excep ); if ( excep && prevCNT ) { printf ( "scaffolding: exception -- prev cnt from %u\n", prevCNT->contigID ); } if ( nextCnt && nextCnt->used ) { break; } setConnectUsed ( ctg, cnt->contigID, 1 ); ctg = cnt->contigID; len += cnt->gapLen + contig_array[ctg].length; bal_ctg = getTwinCtg ( ctg ); contig_array[ctg].flag = 1; contig_array[bal_ctg].flag = 1; //printf("<-%d",bal_ctg); //fflush(stdout); * ( int * ) darrayPut ( gap3, num3 ) = cnt->gapLen; * ( unsigned int * ) darrayPut ( scaf3, num3++ ) = bal_ctg; prevCNT = cnt; cnt = nextCnt; } //printf("\n"); len += overlaplen; sum += len; length_array[count++] = len; if ( num5 + num3 < 1 ) { printf ( "no scaffold created for contig %d\n", i ); continue; } len = prev_ctg = 0; for ( j = num3 - 1; j >= 0; j-- ) { if ( prev_ctg ) { num_route = num_trace = 0; traceAlongArc ( * ( unsigned int * ) darrayGet ( scaf3, j ), prev_ctg, max_steps , gap - ins_size_var, gap + ins_size_var, 0, 0, &num_route ); if ( num_route == 1 ) { gap_c++; } } len += contig_array[* ( unsigned int * ) darrayGet ( scaf3, j )].length + * ( int * ) darrayGet ( gap3, j ); prev_ctg = * ( unsigned int * ) darrayGet ( scaf3, j ); gap = * ( int * ) darrayGet ( gap3, j ) > 0 ? * ( int * ) darrayGet ( gap3, j ) : 0; } for ( j = 0; j < num5; j++ ) { if ( prev_ctg ) { num_route = num_trace = 0; traceAlongArc ( * ( unsigned int * ) darrayGet ( scaf5, j ), prev_ctg, max_steps , gap - ins_size_var, gap + ins_size_var, 0, 0, &num_route ); if ( num_route == 1 ) { gap_c++; } } if ( j < num5 - 1 ) { len += contig_array[* ( unsigned int * ) darrayGet ( scaf5, j )].length + * ( int * ) darrayGet ( gap5, j ); prev_ctg = * ( unsigned int * ) darrayGet ( scaf5, j ); gap = * ( int * ) darrayGet ( gap5, j ) > 0 ? * ( int * ) darrayGet ( gap5, j ) : 0; } } } freeDarray ( scaf3 ); freeDarray ( scaf5 ); freeDarray ( gap3 ); freeDarray ( gap5 ); printf ( "\n%d scaffolds from %d contigs sum up %lldbp, with average length %lld, %d gaps filled\n" , count, num_lctg / 2, sum, sum / count, gap_c ); //output singleton for ( i = 1; i <= num_ctg; i++ ) { if ( contig_array[i].length + ( unsigned int ) overlaplen < len_cut || contig_array[i].flag ) { continue; } length_array[count++] = contig_array[i].length; sum += contig_array[i].length; if ( isSmallerThanTwin ( i ) ) { i++; } } // calculate N50/N90 printf ( "%d scaffolds&singleton sum up %lldbp, with average length %lld\n" , count, sum, sum / count ); qsort ( length_array, count, sizeof ( length_array[0] ), cmp_int ); printf ( "the longest is %dbp,", length_array[count - 1] ); N50 = sum * 0.5; N90 = sum * 0.9; sum = flag = 0; for ( j = count - 1; j >= 0; j-- ) { sum += length_array[j]; if ( !flag && sum >= N50 ) { printf ( "scaffold N50 is %d bp, ", length_array[j] ); flag++; } if ( sum >= N90 ) { printf ( "scaffold N90 is %d bp\n", length_array[j] ); break; } } fflush ( stdout ); free ( ( void * ) length_array ); for ( j = 0; j < max_n_routes; j++ ) { free ( ( void * ) found_routes[j] ); } free ( ( void * ) found_routes ); free ( ( void * ) so_far ); } static void outputLinks ( FILE * fp, int insertS ) { unsigned int i, bal_ctg, bal_toCtg; CONNECT * cnts, *temp_cnt; //printf("outputLinks, %d contigs\n",num_ctg); for ( i = 1; i <= num_ctg; i++ ) { cnts = contig_array[i].downwardConnect; bal_ctg = getTwinCtg ( i ); while ( cnts ) { if ( cnts->weight < 1 ) { cnts = cnts->next; continue; } fprintf ( fp, "%-10d %-10d\t%d\t%d\t%d\n" , i, cnts->contigID, cnts->gapLen, cnts->weight, insertS ); cnts->weight = 0; bal_toCtg = getTwinCtg ( cnts->contigID ); temp_cnt = getCntBetween ( bal_toCtg, bal_ctg ); if ( temp_cnt ) { temp_cnt->weight = 0; } cnts = cnts->next; } } } //use pe info in ascent order void PE2Links ( char * infile ) { char name[256], *line; FILE * fp, *linkF; int i; int flag = 0; unsigned int j; sprintf ( name, "%s.links", infile ); /*linkF = fopen(name,"r"); if(linkF){ printf("file %s exists, skip creating the links...\n",name); fclose(linkF); return; }*/ linkF = ckopen ( name, "w" ); if ( !pes ) { loadPEgrads ( infile ); } sprintf ( name, "%s.readOnContig", infile ); fp = ckopen ( name, "r" ); lineLen = 1024; line = ( char * ) ckalloc ( lineLen * sizeof ( char ) ); fgets ( line, lineLen, fp ); line[0] = '\0'; printf ( "\n" ); for ( i = 0; i < gradsCounter; i++ ) { createCntMemManager(); createCntLookupTable(); newCntCounter = 0; flag += connectByPE_grad ( fp, i, line ); printf ( "%lld new connections\n", newCntCounter / 2 ); if ( !flag ) { destroyConnectMem(); deleteCntLookupTable(); for ( j = 1; j <= num_ctg; j++ ) { contig_array[j].downwardConnect = NULL; } printf ( "\n" ); continue; } flag = 0; outputLinks ( linkF, pes[i].insertS ); destroyConnectMem(); deleteCntLookupTable(); for ( j = 1; j <= num_ctg; j++ ) { contig_array[j].downwardConnect = NULL; } } free ( ( void * ) line ); fclose ( fp ); fclose ( linkF ); printf ( "all PEs attached\n" ); } static int inputLinks ( FILE * fp, int insertS, char * line ) { unsigned int ctg, bal_ctg, toCtg, bal_toCtg; int gap, wt, ins; unsigned int counter = 0, onScafCounter = 0; unsigned int maskCounter = 0; if ( strlen ( line ) ) { sscanf ( line, "%d %d %d %d %d", &ctg, &toCtg, &gap, &wt, &ins ); if ( ins != insertS ) { return counter; } //if(contig_array[ctg].length>=ctg_short&&contig_array[toCtg].length>=ctg_short){ if ( 1 ) { bal_ctg = getTwinCtg ( ctg ); bal_toCtg = getTwinCtg ( toCtg ); add1Connect ( ctg, toCtg, gap, wt, 0 ); add1Connect ( bal_toCtg, bal_ctg, gap, wt, 0 ); counter++; if ( contig_array[ctg].mask || contig_array[toCtg].mask ) { maskCounter++; } if ( insertS > 1000 && contig_array[ctg].from_vt == contig_array[toCtg].from_vt && // on the same scaff contig_array[ctg].indexInScaf < contig_array[toCtg].indexInScaf ) { add1LongPEcov ( ctg, toCtg, wt ); onScafCounter++; } } } while ( fgets ( line, lineLen, fp ) != NULL ) { sscanf ( line, "%d %d %d %d %d", &ctg, &toCtg, &gap, &wt, &ins ); if ( ins != insertS ) //if(ins>insertS) { break; } /* if(contig_array[ctg].length 1000 && contig_array[ctg].from_vt == contig_array[toCtg].from_vt && // on the same scaff contig_array[ctg].indexInScaf < contig_array[toCtg].indexInScaf ) { add1LongPEcov ( ctg, toCtg, wt ); onScafCounter++; } bal_ctg = getTwinCtg ( ctg ); bal_toCtg = getTwinCtg ( toCtg ); add1Connect ( ctg, toCtg, gap, wt, 0 ); add1Connect ( bal_toCtg, bal_ctg, gap, wt, 0 ); counter++; if ( contig_array[ctg].mask || contig_array[toCtg].mask ) { maskCounter++; } } printf ( "%d link to masked contigs, %d links on a single scaff\n", maskCounter, onScafCounter ); return counter; } //use linkage info in ascent order void Links2Scaf ( char * infile ) { char name[256], *line; FILE * fp; int i, j = 1, lib_n = 0, cutoff_sum = 0; int flag = 0, flag2; boolean downS, nonLinear = 0, smallPE = 0, isPrevSmall = 0, markSmall; if ( !pes ) { loadPEgrads ( infile ); } sprintf ( name, "%s.links", infile ); fp = ckopen ( name, "r" ); createCntMemManager(); createCntLookupTable(); lineLen = 1024; line = ( char * ) ckalloc ( lineLen * sizeof ( char ) ); fgets ( line, lineLen, fp ); line[0] = '\0'; solidArray = ( DARRAY * ) createDarray ( 1000, sizeof ( unsigned int ) ); tempArray = ( DARRAY * ) createDarray ( 1000, sizeof ( unsigned int ) ); scaf3 = ( DARRAY * ) createDarray ( 1000, sizeof ( unsigned int ) ); scaf5 = ( DARRAY * ) createDarray ( 1000, sizeof ( unsigned int ) ); gap3 = ( DARRAY * ) createDarray ( 1000, sizeof ( int ) ); gap5 = ( DARRAY * ) createDarray ( 1000, sizeof ( int ) ); weakPE = 3; //0531 printf ( "\n" ); for ( i = 0; i < gradsCounter; i++ ) { if ( pes[i].insertS < 1000 ) { isPrevSmall = 1; } else if ( pes[i].insertS > 1000 && isPrevSmall ) { smallScaf(); isPrevSmall = 0; } flag2 = inputLinks ( fp, pes[i].insertS, line ); printf ( "Insert size %d: %d links input\n", pes[i].insertS, flag2 ); if ( flag2 ) { lib_n++; cutoff_sum += pes[i].pair_num_cut; } flag += flag2; if ( !flag ) { printf ( "\n" ); continue; } if ( i == gradsCounter - 1 || pes[i + 1].rank != pes[i].rank ) { flag = nonLinear = downS = markSmall = 0; if ( pes[i].insertS > 1000 && pes[i].rank > 1 ) { downS = 1; } if ( pes[i].insertS <= 1000 ) { smallPE = 1; } if ( pes[i].insertS >= 1000 ) { ins_size_var = 50; OverlapPercent = 0.05; } else if ( pes[i].insertS >= 300 ) { ins_size_var = 30; OverlapPercent = 0.05; } else { ins_size_var = 20; OverlapPercent = 0.05; } if ( pes[i].insertS > 1000 ) { weakPE = 5; } //static_f = 1; if ( lib_n > 0 ) { weakPE = weakPE < cutoff_sum / lib_n ? cutoff_sum / lib_n : weakPE; lib_n = cutoff_sum = 0; } printf ( "Cutoff for number of pairs to make a reliable connection: %d\n", weakPE ); if ( i == gradsCounter - 1 ) { nonLinear = 1; } if ( i == gradsCounter - 1 && !isPrevSmall && smallPE ) { detectBreakScaf(); } ordering ( 1, downS, nonLinear, infile ); if ( i == gradsCounter - 1 ) { recoverMask(); } else { printf ( "\nthe %d rank", j++ ); scaffold_count ( 100 ); printf ( "\n" ); } } } freeDarray ( tempArray ); freeDarray ( solidArray ); freeDarray ( scaf3 ); freeDarray ( scaf5 ); freeDarray ( gap3 ); freeDarray ( gap5 ); free ( ( void * ) line ); fclose ( fp ); printf ( "all links loaded\n" ); } /* below for picking up a subgraph (with at most one node has upstream connections to the rest and at most one downstream connections) in general */ // static int nodeCounter static boolean putNodeInArray ( unsigned int node, int maxNodes, int dis ) { if ( contig_array[node].inSubGraph ) { return 1; } int index = nodeCounter; if ( index > maxNodes ) { return 0; } if ( contig_array[getTwinCtg ( node )].inSubGraph ) { return 0; } ctg4heapArray[index].ctgID = node; ctg4heapArray[index].dis = dis; contig_array[node].inSubGraph = 1; ctg4heapArray[index].ds_shut4dheap = 0; ctg4heapArray[index].us_shut4dheap = 0; ctg4heapArray[index].ds_shut4uheap = 0; ctg4heapArray[index].us_shut4uheap = 0; return 1; } static void setInGraph ( boolean flag ) { int i; int node; nodeCounter = nodeCounter > MaxNodeInSub ? MaxNodeInSub : nodeCounter; for ( i = 1; i <= nodeCounter; i++ ) { node = ctg4heapArray[i].ctgID; if ( node > 0 ) { contig_array[node].inSubGraph = flag; } } } static boolean dispatch1node ( int dis, unsigned int tempNode, int maxNodes, FibHeap * dheap, FibHeap * uheap, int * DmaxDis, int * UmaxDis ) { boolean ret; if ( dis >= 0 ) // put it to Dheap { nodeCounter++; ret = putNodeInArray ( tempNode, maxNodes, dis ); if ( !ret ) { return 0; } insertNodeIntoHeap ( dheap, dis, nodeCounter ); if ( dis > *DmaxDis ) { *DmaxDis = dis; } return 1; } else // put it to Uheap { nodeCounter++; ret = putNodeInArray ( tempNode, maxNodes, dis ); if ( !ret ) { return 0; } insertNodeIntoHeap ( uheap, -dis, nodeCounter ); int temp_len = contig_array[tempNode].length; if ( -dis + temp_len > *UmaxDis ) { *UmaxDis = -dis + contig_array[tempNode].length; } return -1; } return 0; } static boolean canDheapWait ( unsigned int currNode, int dis, int DmaxDis ) { if ( dis < DmaxDis ) { return 0; } else { return 1; } } static boolean workOnDheap ( FibHeap * dheap, FibHeap * uheap, boolean * Dwait, boolean * Uwait, int * DmaxDis, int * UmaxDis, int maxNodes ) { if ( *Dwait ) { return 1; } unsigned int currNode, twin, tempNode; CTGinHEAP * ctgInHeap; int indexInArray; CONNECT * us_cnt, *ds_cnt; int dis0, dis; boolean ret, isEmpty; while ( ( indexInArray = removeNextNodeFromHeap ( dheap ) ) != 0 ) { ctgInHeap = &ctg4heapArray[indexInArray]; currNode = ctgInHeap->ctgID; dis0 = ctgInHeap->dis; isEmpty = IsHeapEmpty ( dheap ); twin = getTwinCtg ( currNode ); us_cnt = ctgInHeap->us_shut4dheap ? NULL : contig_array[twin].downwardConnect; while ( us_cnt ) { if ( us_cnt->deleted || us_cnt->mask || contig_array[getTwinCtg ( us_cnt->contigID )].inSubGraph ) { us_cnt = us_cnt->next; continue; } tempNode = getTwinCtg ( us_cnt->contigID ); if ( contig_array[tempNode].inSubGraph ) { us_cnt = us_cnt->next; continue; } dis = dis0 - us_cnt->gapLen - ( int ) contig_array[twin].length; ret = dispatch1node ( dis, tempNode, maxNodes, dheap, uheap, DmaxDis, UmaxDis ); if ( ret == 0 ) { return 0; } else if ( ret < 0 ) { *Uwait = 0; } us_cnt = us_cnt->next; } if ( nodeCounter > 1 && isEmpty ) { *Dwait = canDheapWait ( currNode, dis0, *DmaxDis ); if ( *Dwait ) { isEmpty = IsHeapEmpty ( dheap ); insertNodeIntoHeap ( dheap, dis0, indexInArray ); ctg4heapArray[indexInArray].us_shut4dheap = 1; if ( isEmpty ) { return 1; } else { continue; } } } ds_cnt = ctgInHeap->ds_shut4dheap ? NULL : contig_array[currNode].downwardConnect; while ( ds_cnt ) { if ( ds_cnt->deleted || ds_cnt->mask || contig_array[ds_cnt->contigID].inSubGraph ) { ds_cnt = ds_cnt->next; continue; } tempNode = ds_cnt->contigID; dis = dis0 + ds_cnt->gapLen + ( int ) contig_array[tempNode].length; ret = dispatch1node ( dis, tempNode, maxNodes, dheap, uheap, DmaxDis, UmaxDis ); if ( ret == 0 ) { return 0; } else if ( ret < 0 ) { *Uwait = 0; } } // for each downstream connections } // for each node comes off the heap *Dwait = 1; return 1; } static boolean canUheapWait ( unsigned int currNode, int dis, int UmaxDis ) { int temp_len = contig_array[currNode].length; if ( -dis + temp_len < UmaxDis ) { return 0; } else { return 1; } } static boolean workOnUheap ( FibHeap * dheap, FibHeap * uheap, boolean * Dwait, boolean * Uwait, int * DmaxDis, int * UmaxDis, int maxNodes ) { if ( *Uwait ) { return 1; } unsigned int currNode, twin, tempNode; CTGinHEAP * ctgInHeap; int indexInArray; CONNECT * us_cnt, *ds_cnt; int dis0, dis; boolean ret, isEmpty; while ( ( indexInArray = removeNextNodeFromHeap ( uheap ) ) != 0 ) { ctgInHeap = &ctg4heapArray[indexInArray]; currNode = ctgInHeap->ctgID; dis0 = ctgInHeap->dis; isEmpty = IsHeapEmpty ( uheap ); ds_cnt = ctgInHeap->ds_shut4uheap ? NULL : contig_array[currNode].downwardConnect; while ( ds_cnt ) { if ( ds_cnt->deleted || ds_cnt->mask || contig_array[ds_cnt->contigID].inSubGraph ) { ds_cnt = ds_cnt->next; continue; } tempNode = ds_cnt->contigID; dis = dis0 + ds_cnt->gapLen + contig_array[tempNode].length; ret = dispatch1node ( dis, tempNode, maxNodes, dheap, uheap, DmaxDis, UmaxDis ); if ( ret == 0 ) { return 0; } else if ( ret > 0 ) { *Dwait = 0; } } // for each downstream connections if ( nodeCounter > 1 && isEmpty ) { *Uwait = canUheapWait ( currNode, dis0, *UmaxDis ); if ( *Uwait ) { isEmpty = IsHeapEmpty ( uheap ); insertNodeIntoHeap ( uheap, dis0, indexInArray ); ctg4heapArray[indexInArray].ds_shut4uheap = 1; if ( isEmpty ) { return 1; } else { continue; } } } twin = getTwinCtg ( currNode ); us_cnt = ctgInHeap->us_shut4uheap ? NULL : contig_array[twin].downwardConnect; while ( us_cnt ) { if ( us_cnt->deleted || us_cnt->mask || contig_array[getTwinCtg ( us_cnt->contigID )].inSubGraph ) { us_cnt = us_cnt->next; continue; } tempNode = getTwinCtg ( us_cnt->contigID ); if ( contig_array[tempNode].inSubGraph ) { us_cnt = us_cnt->next; continue; } dis = dis0 - us_cnt->gapLen - contig_array[twin].length; ret = dispatch1node ( dis, tempNode, maxNodes, dheap, uheap, DmaxDis, UmaxDis ); if ( ret == 0 ) { return 0; } else if ( ret > 0 ) { *Dwait = 1; } us_cnt = us_cnt->next; } } // for each node comes off the heap *Uwait = 1; return 1; } static boolean pickUpGeneralSubgraph ( unsigned int node1, int maxNodes ) { FibHeap * Uheap = newFibHeap(); // heap for upstream contigs to node1 FibHeap * Dheap = newFibHeap(); int UmaxDis; // max distance upstream to node1 int DmaxDis; boolean Uwait; // wait signal for Uheap boolean Dwait; int dis; boolean ret; //initiate: node1 is put to array once, and to both Dheap and Uheap dis = 0; nodeCounter = 1; putNodeInArray ( node1, maxNodes, dis ); insertNodeIntoHeap ( Dheap, dis, nodeCounter ); ctg4heapArray[nodeCounter].us_shut4dheap = 1; Dwait = 0; DmaxDis = 0; insertNodeIntoHeap ( Uheap, dis, nodeCounter ); ctg4heapArray[nodeCounter].ds_shut4uheap = 1; Uwait = 1; UmaxDis = contig_array[node1].length; while ( 1 ) { ret = workOnDheap ( Dheap, Uheap, &Dwait, &Uwait, &DmaxDis, &UmaxDis, maxNodes ); if ( !ret ) { setInGraph ( 0 ); destroyHeap ( Dheap ); destroyHeap ( Uheap ); return 0; } ret = workOnUheap ( Dheap, Uheap, &Dwait, &Uwait, &DmaxDis, &UmaxDis, maxNodes ); if ( !ret ) { setInGraph ( 0 ); destroyHeap ( Dheap ); destroyHeap ( Uheap ); return 0; } if ( Uwait && Dwait ) { destroyHeap ( Dheap ); destroyHeap ( Uheap ); return 1; } } } static int cmp_ctg ( const void * a, const void * b ) { CTGinHEAP * A, *B; A = ( CTGinHEAP * ) a; B = ( CTGinHEAP * ) b; if ( A->dis > B->dis ) { return 1; } else if ( A->dis == B->dis ) { return 0; } else { return -1; } } static boolean checkEligible() { unsigned int firstNode = ctg4heapArray[1].ctgID; unsigned int twin; int i; boolean flag = 0; //check if the first node has incoming link from twin of any node in subgraph // or it has multi outgoing links bound to incoming links twin = getTwinCtg ( firstNode ); CONNECT * ite_cnt = contig_array[twin].downwardConnect; while ( ite_cnt ) { if ( ite_cnt->deleted || ite_cnt->mask ) { ite_cnt = ite_cnt->next; continue; } if ( contig_array[ite_cnt->contigID].inSubGraph ) { /* if(firstNode==3693) printf("eligible link %d -> %d\n",twin,ite_cnt->contigID); */ return 0; } if ( ite_cnt->prevInScaf ) { if ( flag ) { return 0; } flag = 1; } ite_cnt = ite_cnt->next; } //check if the last node has outgoing link to twin of any node in subgraph // or it has multi outgoing links bound to incoming links unsigned int lastNode = ctg4heapArray[nodeCounter].ctgID; ite_cnt = contig_array[lastNode].downwardConnect; flag = 0; while ( ite_cnt ) { if ( ite_cnt->deleted || ite_cnt->mask ) { ite_cnt = ite_cnt->next; continue; } twin = getTwinCtg ( ite_cnt->contigID ); if ( contig_array[twin].inSubGraph ) { /* if(firstNode==3693) printf("eligible link %d -> %d\n",lastNode,ite_cnt->contigID); */ return 0; } if ( ite_cnt->prevInScaf ) { if ( flag ) { return 0; } flag = 1; } ite_cnt = ite_cnt->next; } //check if any node has outgoing link to node outside the subgraph for ( i = 1; i < nodeCounter; i++ ) { ite_cnt = contig_array[ctg4heapArray[i].ctgID].downwardConnect; while ( ite_cnt ) { if ( ite_cnt->deleted || ite_cnt->mask ) { ite_cnt = ite_cnt->next; continue; } if ( !contig_array[ite_cnt->contigID].inSubGraph ) { /* printf("eligible check: ctg %d links to ctg %d\n", ctg4heapArray[i].ctgID,ite_cnt->contigID); */ return 0; } ite_cnt = ite_cnt->next; } } //check if any node has incoming link from node outside the subgraph for ( i = 2; i <= nodeCounter; i++ ) { twin = getTwinCtg ( ctg4heapArray[i].ctgID ); ite_cnt = contig_array[twin].downwardConnect; while ( ite_cnt ) { if ( ite_cnt->deleted || ite_cnt->mask ) { ite_cnt = ite_cnt->next; continue; } if ( !contig_array[getTwinCtg ( ite_cnt->contigID )].inSubGraph ) { /* printf("eligible check: ctg %d links to ctg %d\n", ctg4heapArray[i].ctgID,ite_cnt->contigID); */ return 0; } ite_cnt = ite_cnt->next; } } return 1; } //put nodes in sub-graph in a line static void arrangeNodes_general() { int i, gap; CONNECT * ite_cnt, *temp_cnt, *bal_cnt, *prev_cnt, *next_cnt; unsigned int node1, node2; unsigned int bal_nd1, bal_nd2; //delete original connections for ( i = 1; i <= nodeCounter; i++ ) { node1 = ctg4heapArray[i].ctgID; ite_cnt = contig_array[node1].downwardConnect; while ( ite_cnt ) { if ( ite_cnt->mask || ite_cnt->deleted || !contig_array[ite_cnt->contigID].inSubGraph ) { ite_cnt = ite_cnt->next; continue; } ite_cnt->deleted = 1; setNextInScaf ( ite_cnt, NULL ); setPrevInScaf ( ite_cnt, 0 ); ite_cnt = ite_cnt->next; } bal_nd1 = getTwinCtg ( node1 ); ite_cnt = contig_array[bal_nd1].downwardConnect; while ( ite_cnt ) { if ( ite_cnt->mask || ite_cnt->deleted || !contig_array[getTwinCtg ( ite_cnt->contigID )].inSubGraph ) { ite_cnt = ite_cnt->next; continue; } ite_cnt->deleted = 1; setNextInScaf ( ite_cnt, NULL ); setPrevInScaf ( ite_cnt, 0 ); ite_cnt = ite_cnt->next; } } //create new connections prev_cnt = next_cnt = NULL; for ( i = 1; i < nodeCounter; i++ ) { node1 = ctg4heapArray[i].ctgID; node2 = ctg4heapArray[i + 1].ctgID; bal_nd1 = getTwinCtg ( node1 ); bal_nd2 = getTwinCtg ( node2 ); gap = ctg4heapArray[i + 1].dis - ctg4heapArray[i].dis - contig_array[node2].length; temp_cnt = getCntBetween ( node1, node2 ); if ( temp_cnt ) { temp_cnt->deleted = 0; temp_cnt->mask = 0; //temp_cnt->gapLen = gap; bal_cnt = getCntBetween ( bal_nd2, bal_nd1 ); bal_cnt->deleted = 0; bal_cnt->mask = 0; //bal_cnt->gapLen = gap; } else { temp_cnt = allocateCN ( node2, gap ); if ( cntLookupTable ) { putCnt2LookupTable ( node1, temp_cnt ); } temp_cnt->next = contig_array[node1].downwardConnect; contig_array[node1].downwardConnect = temp_cnt; bal_cnt = allocateCN ( bal_nd1, gap ); if ( cntLookupTable ) { putCnt2LookupTable ( bal_nd2, bal_cnt ); } bal_cnt->next = contig_array[bal_nd2].downwardConnect; contig_array[bal_nd2].downwardConnect = bal_cnt; } if ( prev_cnt ) { setNextInScaf ( prev_cnt, temp_cnt ); setPrevInScaf ( temp_cnt, 1 ); } if ( next_cnt ) { setNextInScaf ( bal_cnt, next_cnt ); setPrevInScaf ( next_cnt, 1 ); } prev_cnt = temp_cnt; next_cnt = bal_cnt; } //re-binding connection at both ends bal_nd2 = getTwinCtg ( ctg4heapArray[1].ctgID ); ite_cnt = contig_array[bal_nd2].downwardConnect; while ( ite_cnt ) { if ( ite_cnt->deleted || ite_cnt->mask ) { ite_cnt = ite_cnt->next; continue; } if ( ite_cnt->prevInScaf ) { break; } ite_cnt = ite_cnt->next; } if ( ite_cnt ) { bal_nd1 = ite_cnt->contigID; node1 = getTwinCtg ( bal_nd1 ); node2 = ctg4heapArray[1].ctgID; temp_cnt = checkConnect ( node1, node2 ); bal_cnt = ite_cnt; next_cnt = checkConnect ( ctg4heapArray[1].ctgID, ctg4heapArray[2].ctgID ); prev_cnt = checkConnect ( getTwinCtg ( ctg4heapArray[2].ctgID ), getTwinCtg ( ctg4heapArray[1].ctgID ) ); if ( temp_cnt ) { setNextInScaf ( temp_cnt, next_cnt ); setPrevInScaf ( temp_cnt->nextInScaf, 0 ); setPrevInScaf ( next_cnt, 1 ); setNextInScaf ( prev_cnt, bal_cnt ); } } node1 = ctg4heapArray[nodeCounter].ctgID; ite_cnt = contig_array[node1].downwardConnect; while ( ite_cnt ) { if ( ite_cnt->deleted || ite_cnt->mask ) { ite_cnt = ite_cnt->next; continue; } if ( ite_cnt->prevInScaf ) { break; } ite_cnt = ite_cnt->next; } if ( ite_cnt ) { node2 = ite_cnt->contigID; bal_nd1 = getTwinCtg ( node1 ); bal_nd2 = getTwinCtg ( node2 ); temp_cnt = ite_cnt; bal_cnt = checkConnect ( bal_nd2, bal_nd1 ); next_cnt = checkConnect ( getTwinCtg ( ctg4heapArray[nodeCounter].ctgID ), getTwinCtg ( ctg4heapArray[nodeCounter - 1].ctgID ) ); prev_cnt = checkConnect ( ctg4heapArray[nodeCounter - 1].ctgID, ctg4heapArray[nodeCounter].ctgID ); setNextInScaf ( prev_cnt, temp_cnt ); setNextInScaf ( bal_cnt, next_cnt ); setPrevInScaf ( next_cnt, 1 ); } } //check if contigs next to each other have reasonable overlap boolean checkOverlapInBetween_general ( double tolerance ) { int i, gap; unsigned int node; int lenSum, lenOlp; lenSum = lenOlp = 0; for ( i = 1; i <= nodeCounter; i++ ) { node = ctg4heapArray[i].ctgID; lenSum += contig_array[node].length; } if ( lenSum < 1 ) { return 1; } for ( i = 1; i < nodeCounter; i++ ) { gap = ctg4heapArray[i + 1].dis - ctg4heapArray[i].dis - contig_array[ctg4heapArray[i + 1].ctgID].length; if ( -gap > 0 ) { lenOlp += -gap; } //if(-gap>ins_size_var) if ( ( double ) lenOlp / lenSum > tolerance ) { return 0; } } return 1; } //check if there's any connect indicates the opposite order between nodes in sub-graph static boolean checkConflictCnt_general ( double tolerance ) { int i, j; int supportCounter = 0; int objectCounter = 0; CONNECT * cnt; for ( i = 1; i < nodeCounter; i++ ) { for ( j = i + 1; j <= nodeCounter; j++ ) { //cnt=getCntBetween(nodesInSubInOrder[j],nodesInSubInOrder[i]); cnt = checkConnect ( ctg4heapArray[i].ctgID, ctg4heapArray[j].ctgID ); if ( cnt ) { supportCounter += cnt->weight; } cnt = checkConnect ( ctg4heapArray[j].ctgID, ctg4heapArray[i].ctgID ); if ( cnt ) { objectCounter += cnt->weight; } //return 1; } } if ( supportCounter < 1 ) { return 1; } if ( ( double ) objectCounter / supportCounter < tolerance ) { return 0; } return 1; } // turn sub-graph to linear struct static void general_linearization ( boolean strict ) { unsigned int i; int subCounter = 0; int out_num; boolean flag; int conflCounter = 0, overlapCounter = 0, eligibleCounter = 0; double overlapTolerance, conflTolerance; for ( i = num_ctg; i > 0; i-- ) { if ( contig_array[i].mask ) { continue; } out_num = validConnect ( i, NULL ); if ( out_num < 2 ) { continue; } //flag = pickSubGraph(i,strict); flag = pickUpGeneralSubgraph ( i, MaxNodeInSub ); if ( !flag ) { continue; } subCounter++; qsort ( &ctg4heapArray[1], nodeCounter, sizeof ( CTGinHEAP ), cmp_ctg ); flag = checkEligible(); if ( !flag ) { eligibleCounter++; setInGraph ( 0 ); continue; } if ( strict ) { overlapTolerance = OverlapPercent; conflTolerance = ConflPercent; } else { overlapTolerance = 2 * OverlapPercent; conflTolerance = 2 * ConflPercent; } flag = checkOverlapInBetween_general ( overlapTolerance ); if ( !flag ) { overlapCounter++; setInGraph ( 0 ); continue; } flag = checkConflictCnt_general ( conflTolerance ); if ( flag ) { conflCounter++; setInGraph ( 0 ); continue; } arrangeNodes_general(); setInGraph ( 0 ); } printf ( "Picked %d subgraphs,%d have conflicting connections,%d have significant overlapping, %d eligible\n", subCounter, conflCounter, overlapCounter, eligibleCounter ); } /**** the fowllowing codes for detecting and break down scaffold at weak point **********/ // mark connections in scaffolds made by small pe static void smallScaf() { unsigned int i, ctg, bal_ctg, prevCtg; int counter = 0; CONNECT * bindCnt, *cnt; for ( i = 1; i <= num_ctg; i++ ) { contig_array[i].flag = 0; } for ( i = 1; i <= num_ctg; i++ ) { if ( contig_array[i].flag || contig_array[i].mask || !contig_array[i].downwardConnect ) { continue; } bindCnt = getBindCnt ( i ); if ( !bindCnt ) { continue; } counter++; contig_array[i].flag = 1; contig_array[getTwinCtg ( i )].flag = 1; prevCtg = getTwinCtg ( i ); while ( bindCnt ) { ctg = bindCnt->contigID; bal_ctg = getTwinCtg ( ctg ); bindCnt->bySmall = 1; cnt = getCntBetween ( bal_ctg, prevCtg ); if ( cnt ) { cnt->bySmall = 1; } contig_array[ctg].flag = 1; contig_array[bal_ctg].flag = 1; prevCtg = bal_ctg; bindCnt = bindCnt->nextInScaf; } ctg = getTwinCtg ( i ); bindCnt = getBindCnt ( ctg ); prevCtg = i; while ( bindCnt ) { ctg = bindCnt->contigID; bal_ctg = getTwinCtg ( ctg ); bindCnt->bySmall = 1; cnt = getCntBetween ( bal_ctg, prevCtg ); if ( cnt ) { cnt->bySmall = 1; } contig_array[ctg].flag = 1; contig_array[bal_ctg].flag = 1; prevCtg = bal_ctg; bindCnt = bindCnt->nextInScaf; } } printf ( "Report from smallScaf: %d scaffolds by smallPE\n", counter ); } static boolean putItem2Sarray ( unsigned int scaf, int wt, DARRAY * SCAF, DARRAY * WT, int counter ) { int i; unsigned int * scafP, *wtP; for ( i = 0; i < counter; i++ ) { scafP = ( unsigned int * ) darrayGet ( SCAF, i ); if ( ( *scafP ) == scaf ) { wtP = ( unsigned int * ) darrayGet ( WT, i ); *wtP = ( *wtP + wt ); return 0; } } scafP = ( unsigned int * ) darrayPut ( SCAF, counter ); wtP = ( unsigned int * ) darrayPut ( WT, counter ); *scafP = scaf; *wtP = wt; return 1; } static int getDSLink2Scaf ( STACK * scafStack, DARRAY * SCAF, DARRAY * WT ) { CONNECT * ite_cnt; unsigned int ctg, targetCtg, *pt; int counter = 0; boolean inc; stackRecover ( scafStack ); while ( ( pt = ( unsigned int * ) stackPop ( scafStack ) ) != NULL ) { ctg = *pt; if ( contig_array[ctg].mask || !contig_array[ctg].downwardConnect ) { continue; } ite_cnt = contig_array[ctg].downwardConnect; while ( ite_cnt ) { if ( ite_cnt->deleted || ite_cnt->mask || ite_cnt->singleInScaf || ite_cnt->nextInScaf || ite_cnt->prevInScaf || ite_cnt->inherit ) { ite_cnt = ite_cnt->next; continue; } targetCtg = ite_cnt->contigID; if ( contig_array[ctg].from_vt == contig_array[targetCtg].from_vt ) // on the same scaff { ite_cnt = ite_cnt->next; continue; } inc = putItem2Sarray ( contig_array[targetCtg].from_vt, ite_cnt->weight, SCAF, WT, counter ); if ( inc ) { counter++; } ite_cnt = ite_cnt->next; } } return counter; } static int getScaffold ( unsigned int start, STACK * scafStack ) { int len = contig_array[start].length; unsigned int * pt, ctg; emptyStack ( scafStack ); pt = ( unsigned int * ) stackPush ( scafStack ); *pt = start; CONNECT * bindCnt = getBindCnt ( start ); while ( bindCnt ) { ctg = bindCnt->contigID; pt = ( unsigned int * ) stackPush ( scafStack ); *pt = ctg; len += contig_array[ctg].length; bindCnt = bindCnt->nextInScaf; } stackBackup ( scafStack ); return len; } static boolean isLinkReliable ( DARRAY * WT, int count ) { int i; for ( i = 0; i < count; i++ ) if ( * ( int * ) darrayGet ( WT, i ) >= weakPE ) { return 1; } return 0; } static int getWtFromSarray ( DARRAY * SCAF, DARRAY * WT, int count, unsigned int scaf ) { int i; for ( i = 0; i < count; i++ ) if ( * ( unsigned int * ) darrayGet ( SCAF, i ) == scaf ) { return * ( int * ) darrayGet ( WT, i ); } return 0; } static void switch2twin ( STACK * scafStack ) { unsigned int * pt; stackRecover ( scafStack ); while ( ( pt = ( unsigned int * ) stackPop ( scafStack ) ) != NULL ) { *pt = getTwinCtg ( *pt ); } } /* ------> scaf1 --- --- -- -- --- scaf2 -- --- --- -- ----> */ static boolean checkScafConsist ( STACK * scafStack1, STACK * scafStack2 ) { DARRAY * downwardTo1 = ( DARRAY * ) createDarray ( 1000, sizeof ( unsigned int ) ); // scaf links to those scaffolds DARRAY * downwardTo2 = ( DARRAY * ) createDarray ( 1000, sizeof ( unsigned int ) ); DARRAY * downwardWt1 = ( DARRAY * ) createDarray ( 1000, sizeof ( unsigned int ) ); // scaf links to scaffolds with those wt DARRAY * downwardWt2 = ( DARRAY * ) createDarray ( 1000, sizeof ( unsigned int ) ); int linkCount1 = getDSLink2Scaf ( scafStack1, downwardTo1, downwardWt1 ); int linkCount2 = getDSLink2Scaf ( scafStack2, downwardTo2, downwardWt2 ); if ( !linkCount1 || !linkCount2 ) { freeDarray ( downwardTo1 ); freeDarray ( downwardTo2 ); freeDarray ( downwardWt1 ); freeDarray ( downwardWt2 ); return 1; } boolean flag1 = isLinkReliable ( downwardWt1, linkCount1 ); boolean flag2 = isLinkReliable ( downwardWt2, linkCount2 ); if ( !flag1 || !flag2 ) { freeDarray ( downwardTo1 ); freeDarray ( downwardTo2 ); freeDarray ( downwardWt1 ); freeDarray ( downwardWt2 ); return 1; } unsigned int scaf; int i, wt1, wt2, ret = 1; for ( i = 0; i < linkCount1; i++ ) { wt1 = * ( int * ) darrayGet ( downwardWt1, i ); if ( wt1 < weakPE ) { continue; } scaf = * ( unsigned int * ) darrayGet ( downwardTo1, i ); wt2 = getWtFromSarray ( downwardTo2, downwardWt2, linkCount2, scaf ); if ( wt2 < 1 ) { //fprintf(stderr,"Inconsistant link to %d\n",scaf); ret = 0; break; } } freeDarray ( downwardTo1 ); freeDarray ( downwardTo2 ); freeDarray ( downwardWt1 ); freeDarray ( downwardWt2 ); return ret; } static void setBreakPoints ( DARRAY * ctgArray, int count, int weakest, int * start, int * finish ) { int index = weakest - 1; unsigned int thisCtg; unsigned int nextCtg = * ( unsigned int * ) darrayGet ( ctgArray, weakest ); CONNECT * cnt; *start = weakest; while ( index >= 0 ) { thisCtg = * ( unsigned int * ) darrayGet ( ctgArray, index ); cnt = getCntBetween ( thisCtg, nextCtg ); if ( cnt->maxGap > 2 ) { break; } else { *start = index; } nextCtg = thisCtg; index--; } unsigned int prevCtg = * ( unsigned int * ) darrayGet ( ctgArray, weakest + 1 ); *finish = weakest + 1; index = weakest + 2; while ( index < count ) { thisCtg = * ( unsigned int * ) darrayGet ( ctgArray, index ); cnt = getCntBetween ( prevCtg, thisCtg ); if ( cnt->maxGap > 2 ) { break; } else { *finish = index; } prevCtg = thisCtg; index++; } } static void changeScafEnd ( STACK * scafStack, unsigned int end ) { unsigned int ctg, *pt; unsigned int start = getTwinCtg ( end ); stackRecover ( scafStack ); while ( ( pt = ( unsigned int * ) stackPop ( scafStack ) ) != NULL ) { ctg = *pt; contig_array[ctg].to_vt = end; contig_array[getTwinCtg ( ctg )].from_vt = start; } } static void changeScafBegin ( STACK * scafStack, unsigned int start ) { unsigned int ctg, *pt; unsigned int end = getTwinCtg ( start ); stackRecover ( scafStack ); while ( ( pt = ( unsigned int * ) stackPop ( scafStack ) ) != NULL ) { ctg = *pt; contig_array[ctg].from_vt = start; contig_array[getTwinCtg ( ctg )].to_vt = end; } } // break down scaffolds poorly supported by longer PE static void detectBreakScaf() { unsigned int i, avgPE, scafLen, len, ctg, bal_ctg, prevCtg, thisCtg; long long peCounter, linkCounter; int num3, num5, weakPoint, tempCounter, j, t, counter = 0; CONNECT * bindCnt, *cnt, *weakCnt; STACK * scafStack1 = ( STACK * ) createStack ( 1000, sizeof ( unsigned int ) ); STACK * scafStack2 = ( STACK * ) createStack ( 1000, sizeof ( unsigned int ) ); for ( i = 1; i <= num_ctg; i++ ) { contig_array[i].flag = 0; } for ( i = 1; i <= num_ctg; i++ ) { if ( contig_array[i].flag || contig_array[i].mask || !contig_array[i].downwardConnect ) { continue; } bindCnt = getBindCnt ( i ); if ( !bindCnt ) { continue; } //first scan get the average coverage by longer pe num5 = num3 = peCounter = linkCounter = 0; scafLen = contig_array[i].length; ctg = i; * ( unsigned int * ) darrayPut ( scaf5, num5++ ) = i; contig_array[i].flag = 1; contig_array[getTwinCtg ( i )].flag = 1; while ( bindCnt ) { if ( !bindCnt->bySmall ) { break; } linkCounter++; peCounter += bindCnt->maxGap; ctg = bindCnt->contigID; scafLen += contig_array[ctg].length; * ( unsigned int * ) darrayPut ( scaf5, num5++ ) = ctg; bal_ctg = getTwinCtg ( ctg ); contig_array[ctg].flag = 1; contig_array[bal_ctg].flag = 1; bindCnt = bindCnt->nextInScaf; } ctg = getTwinCtg ( i ); bindCnt = getBindCnt ( ctg ); while ( bindCnt ) { if ( !bindCnt->bySmall ) { break; } linkCounter++; peCounter += bindCnt->maxGap; ctg = bindCnt->contigID; scafLen += contig_array[ctg].length; bal_ctg = getTwinCtg ( ctg ); contig_array[ctg].flag = 1; contig_array[bal_ctg].flag = 1; * ( unsigned int * ) darrayPut ( scaf3, num3++ ) = bal_ctg; bindCnt = bindCnt->nextInScaf; } if ( linkCounter < 1 || scafLen < 5000 ) { continue; } avgPE = peCounter / linkCounter; if ( avgPE < 10 ) { continue; } tempCounter = 0; for ( j = num3 - 1; j >= 0; j-- ) * ( unsigned int * ) darrayPut ( tempArray, tempCounter++ ) = * ( unsigned int * ) darrayGet ( scaf3, j ); for ( j = 0; j < num5; j++ ) * ( unsigned int * ) darrayPut ( tempArray, tempCounter++ ) = * ( unsigned int * ) darrayGet ( scaf5, j ); prevCtg = * ( unsigned int * ) darrayGet ( tempArray, 0 ); weakCnt = NULL; weakPoint = 0; len = contig_array[prevCtg].length; for ( t = 1; t < tempCounter; t++ ) { thisCtg = * ( unsigned int * ) darrayGet ( tempArray, t ); if ( len < 2000 ) { len += contig_array[thisCtg].length; prevCtg = thisCtg; continue; } else if ( len > scafLen - 2000 ) { break; } len += contig_array[thisCtg].length; if ( contig_array[prevCtg].from_vt != contig_array[thisCtg].from_vt || contig_array[prevCtg].indexInScaf > contig_array[thisCtg].indexInScaf ) { prevCtg = thisCtg; continue; } cnt = getCntBetween ( prevCtg, thisCtg ); if ( !weakCnt || weakCnt->maxGap > cnt->maxGap ) { weakCnt = cnt; weakPoint = t; } prevCtg = thisCtg; } if ( !weakCnt || ( weakCnt->maxGap > 2 && weakCnt->maxGap > avgPE / 5 ) ) { continue; } prevCtg = * ( unsigned int * ) darrayGet ( tempArray, weakPoint - 1 ); thisCtg = * ( unsigned int * ) darrayGet ( tempArray, weakPoint ); if ( contig_array[prevCtg].from_vt != contig_array[thisCtg].from_vt || contig_array[prevCtg].indexInScaf > contig_array[thisCtg].indexInScaf ) { printf ( "contig %d and %d not on the same scaff\n", prevCtg, thisCtg ); continue; } setConnectWP ( prevCtg, thisCtg, 1 ); /* fprintf(stderr,"scaffold len %d, avg long pe cov %d (%ld/%ld)\n", scafLen,avgPE,peCounter,linkCounter); fprintf(stderr,"Weak connect (%d) between %d(%dth of %d) and %d\n" ,weakCnt->maxGap,prevCtg,weakPoint-1,tempCounter,thisCtg); */ // set start and end to break down the scaffold int index1, index2; setBreakPoints ( tempArray, tempCounter, weakPoint - 1, &index1, &index2 ); //fprintf(stderr,"break %d ->...-> %d\n",index1,index2); unsigned int start = * ( unsigned int * ) darrayGet ( tempArray, index1 ); unsigned int finish = * ( unsigned int * ) darrayGet ( tempArray, index2 ); int len1 = getScaffold ( getTwinCtg ( start ), scafStack1 ); int len2 = getScaffold ( finish, scafStack2 ); if ( len1 < 2000 || len2 < 2000 ) { continue; } switch2twin ( scafStack1 ); int flag1 = checkScafConsist ( scafStack1, scafStack2 ); switch2twin ( scafStack1 ); switch2twin ( scafStack2 ); int flag2 = checkScafConsist ( scafStack2, scafStack1 ); if ( !flag1 || !flag2 ) { changeScafBegin ( scafStack1, getTwinCtg ( start ) ); changeScafEnd ( scafStack2, getTwinCtg ( finish ) ); //unbind links unsigned int nextCtg = * ( unsigned int * ) darrayGet ( tempArray, index1 + 1 ); thisCtg = * ( unsigned int * ) darrayGet ( tempArray, index1 ); cnt = getCntBetween ( getTwinCtg ( nextCtg ), getTwinCtg ( thisCtg ) ); if ( cnt->nextInScaf ) { prevCtg = getTwinCtg ( cnt->nextInScaf->contigID ); cnt->nextInScaf->prevInScaf = 0; cnt = getCntBetween ( prevCtg, thisCtg ); cnt->nextInScaf = NULL; } prevCtg = * ( unsigned int * ) darrayGet ( tempArray, index2 - 1 ); thisCtg = * ( unsigned int * ) darrayGet ( tempArray, index2 ); cnt = getCntBetween ( prevCtg, thisCtg ); if ( cnt->nextInScaf ) { nextCtg = cnt->nextInScaf->contigID; cnt->nextInScaf->prevInScaf = 0; cnt = getCntBetween ( getTwinCtg ( nextCtg ), getTwinCtg ( thisCtg ) ); cnt->nextInScaf = NULL; } prevCtg = * ( unsigned int * ) darrayGet ( tempArray, index1 ); for ( t = index1 + 1; t <= index2; t++ ) { thisCtg = * ( unsigned int * ) darrayGet ( tempArray, t ); cnt = getCntBetween ( prevCtg, thisCtg ); cnt->mask = 1; cnt->nextInScaf = NULL; cnt->prevInScaf = 0; cnt = getCntBetween ( getTwinCtg ( thisCtg ), getTwinCtg ( prevCtg ) ); cnt->mask = 1; cnt->nextInScaf = NULL; cnt->prevInScaf = 0; /* fprintf(stderr,"(%d %d)/(%d %d) ", prevCtg,thisCtg,getTwinCtg(thisCtg),getTwinCtg(prevCtg)); */ prevCtg = thisCtg; } //fprintf(stderr,": BREAKING\n"); counter++; } } freeStack ( scafStack1 ); freeStack ( scafStack2 ); printf ( "Report from checkScaf: %d scaffold segments broken\n", counter ); } static boolean checkSimple ( DARRAY * ctgArray, int count ) { int i; unsigned int ctg; for ( i = 0; i < count; i++ ) { ctg = * ( unsigned int * ) darrayGet ( ctgArray, i ); contig_array[ctg].flag = 0; contig_array[getTwinCtg ( ctg )].flag = 0; } for ( i = 0; i < count; i++ ) { ctg = * ( unsigned int * ) darrayGet ( ctgArray, i ); if ( contig_array[ctg].flag ) { return 0; } contig_array[ctg].flag = 1; contig_array[getTwinCtg ( ctg )].flag = 1; } return 1; } static void checkCircle() { unsigned int i, ctg; CONNECT * cn_temp1; int counter = 0; for ( i = 1; i <= num_ctg; i++ ) { cn_temp1 = contig_array[i].downwardConnect; while ( cn_temp1 ) { if ( cn_temp1->weak || cn_temp1->deleted ) { cn_temp1 = cn_temp1->next; continue; } ctg = cn_temp1->contigID; if ( checkConnect ( ctg, i ) ) { counter++; maskContig ( i, 1 ); maskContig ( ctg, 1 ); } cn_temp1 = cn_temp1->next; } } printf ( "%d circles removed \n", counter ); } SOAPdenovo-V1.05/src/31mer/output_contig.c000644 000765 000024 00000015644 11530651532 020374 0ustar00Aquastaff000000 000000 /* * 31mer/output_contig.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" static char * kmerSeq; void output_graph ( char * outfile ) { char name[256]; FILE * fp; unsigned int i, bal_i; sprintf ( name, "%s.edge.gvz", outfile ); fp = ckopen ( name, "w" ); fprintf ( fp, "digraph G{\n" ); fprintf ( fp, "\tsize=\"512,512\";\n" ); for ( i = num_ed; i > 0; i-- ) { if ( edge_array[i].deleted ) { continue; } /* arcCount(i,&arcNum); if(arcNum<1) continue; */ bal_i = getTwinEdge ( i ); /* arcCount(bal_i,&arcNum); if(arcNum<1) continue; */ fprintf ( fp, "\tV%d -> V%d[label =\"%d(%d)\"];\n", edge_array[i].from_vt, edge_array[i].to_vt, i, edge_array[i].length ); } fprintf ( fp, "}\n" ); fclose ( fp ); } static void output_1contig ( int id, EDGE * edge, FILE * fp, boolean tip ) { int i; Kmer kmer; char ch; fprintf ( fp, ">%d length %d cvg_%.1f_tip_%d\n", id, edge->length + overlaplen, ( double ) edge->cvg / 10, tip ); //fprintf(fp,">%d\n",id); kmer = vt_array[edge->from_vt].kmer; for ( i = overlaplen - 1; i >= 0; i-- ) { ch = kmer & 3; kmer >>= 2; kmerSeq[i] = ch; } for ( i = 0; i < overlaplen; i++ ) { fprintf ( fp, "%c", int2base ( ( int ) kmerSeq[i] ) ); } for ( i = 0; i < edge->length; i++ ) { fprintf ( fp, "%c", int2base ( ( int ) getCharInTightString ( edge->seq, i ) ) ); if ( ( i + overlaplen + 1 ) % 100 == 0 ) { fprintf ( fp, "\n" ); } } fprintf ( fp, "\n" ); } int cmp_int ( const void * a, const void * b ) { int * A, *B; A = ( int * ) a; B = ( int * ) b; if ( *A > *B ) { return 1; } else if ( *A == *B ) { return 0; } else { return -1; } } int cmp_edge ( const void * a, const void * b ) { EDGE * A, *B; A = ( EDGE * ) a; B = ( EDGE * ) b; if ( A->length > B->length ) { return 1; } else if ( A->length == B->length ) { return 0; } else { return -1; } } void output_contig ( EDGE * ed_array, unsigned int ed_num, char * outfile, int cut_len ) { char temp[256]; FILE * fp, *fp_contig; int flag, count, len_c; int signI; unsigned int i; long long sum = 0, N90, N50; unsigned int * length_array; boolean tip; sprintf ( temp, "%s.contig", outfile ); fp = ckopen ( temp, "w" ); qsort ( &ed_array[1], ed_num, sizeof ( EDGE ), cmp_edge ); length_array = ( unsigned int * ) ckalloc ( ed_num * sizeof ( unsigned int ) ); kmerSeq = ( char * ) ckalloc ( overlaplen * sizeof ( char ) ); //first scan for number counting count = len_c = 0; for ( i = 1; i <= ed_num; i++ ) { if ( ( ed_array[i].length + overlaplen ) >= len_bar ) { length_array[len_c++] = ed_array[i].length + overlaplen; } if ( ed_array[i].length < 1 || ed_array[i].deleted ) { continue; } count++; if ( EdSmallerThanTwin ( i ) ) { i++; } } sum = 0; for ( signI = len_c - 1; signI >= 0; signI-- ) { sum += length_array[signI]; } if ( len_c > 0 ) { printf ( "%d ctgs longer than %d, sum up %lldbp, with average length %lld\n", len_c, len_bar, sum, sum / len_c ); } qsort ( length_array, len_c, sizeof ( length_array[0] ), cmp_int ); printf ( "the longest is %dbp, ", length_array[len_c - 1] ); N50 = sum * 0.5; N90 = sum * 0.9; sum = flag = 0; for ( signI = len_c - 1; signI >= 0; signI-- ) { sum += length_array[signI]; if ( !flag && sum >= N50 ) { printf ( "contig N50 is %d bp,", length_array[signI] ); flag = 1; } if ( sum >= N90 ) { printf ( "contig N90 is %d bp\n", length_array[signI] ); break; } } //fprintf(fp,"Number %d\n",count); for ( i = 1; i <= ed_num; i++ ) { //if(ed_array[i].multi!=1||ed_array[i].length<1||(ed_array[i].length+overlaplen)length %d,", edge->length ); print_kmer ( fp, vt_array[edge->from_vt].kmer, ',' ); print_kmer ( fp, vt_array[edge->to_vt].kmer, ',' ); if ( EdSmallerThanTwin ( i ) ) { fprintf ( fp, "1," ); } else if ( EdLargerThanTwin ( i ) ) { fprintf ( fp, "-1," ); } else { fprintf ( fp, "0," ); } fprintf ( fp, "%d\n", edge->cvg ); } fclose ( fp ); } void output_heavyArcs ( char * outfile ) { unsigned int i, j; char name[256]; FILE * outfp; ARC * parc; sprintf ( name, "%s.Arc", outfile ); outfp = ckopen ( name, "w" ); for ( i = 1; i <= num_ed; i++ ) { parc = edge_array[i].arcs; if ( !parc ) { continue; } j = 0; fprintf ( outfp, "%u", i ); while ( parc ) { fprintf ( outfp, " %u %u", parc->to_ed, parc->multiplicity ); if ( ( ++j ) % 10 == 0 ) { fprintf ( outfp, "\n%u", i ); } parc = parc->next; } fprintf ( outfp, "\n" ); } fclose ( outfp ); } SOAPdenovo-V1.05/src/31mer/output_pregraph.c000644 000765 000024 00000005060 11530651532 020710 0ustar00Aquastaff000000 000000 /* * 31mer/output_pregraph.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include #include "newhash.h" #include #include static int outvCounter = 0; //after this LINKFLAGFILTER in the Kmer is destroyed static void output1vt ( kmer_t * node1, FILE * fp ) { if ( !node1 ) { return; } if ( ! ( node1->linear ) && ! ( node1->deleted ) ) { outvCounter++; print_kmer ( fp, node1->seq, ' ' ); if ( outvCounter % 8 == 0 ) { fprintf ( fp, "\n" ); } } } void output_vertex ( char * outfile ) { char temp[256]; FILE * fp; int i; kmer_t * node; KmerSet * set; sprintf ( temp, "%s.vertex", outfile ); fp = ckopen ( temp, "w" ); for ( i = 0; i < thrd_num; i++ ) { set = KmerSets[i]; set->iter_ptr = 0; while ( set->iter_ptr < set->size ) { if ( !is_kmer_entity_null ( set->flags, set->iter_ptr ) ) { node = set->array + set->iter_ptr; output1vt ( node, fp ); } set->iter_ptr ++; } } fprintf ( fp, "\n" ); printf ( "%d vertex outputed\n", outvCounter ); fclose ( fp ); sprintf ( temp, "%s.preGraphBasic", outfile ); fp = ckopen ( temp, "w" ); fprintf ( fp, "VERTEX %d K %d\n", outvCounter, overlaplen ); fprintf ( fp, "\nEDGEs %d\n", num_ed ); fprintf ( fp, "\nMaxReadLen %d MinReadLen %d MaxNameLen %d\n", maxReadLen4all, minReadLen, maxNameLen ); fclose ( fp ); } void output_1edge ( preEDGE * edge, FILE * fp ) { int i; fprintf ( fp, ">length %d,", edge->length ); print_kmer ( fp, edge->from_node, ',' ); print_kmer ( fp, edge->to_node, ',' ); fprintf ( fp, "cvg %d, %d\n", edge->cvg, edge->bal_edge ); for ( i = 0; i < edge->length; i++ ) { fprintf ( fp, "%c", int2base ( ( int ) edge->seq[i] ) ); if ( ( i + 1 ) % 100 == 0 ) { fprintf ( fp, "\n" ); } } fprintf ( fp, "\n" ); } SOAPdenovo-V1.05/src/31mer/output_scaffold.c000644 000765 000024 00000005107 11530651532 020663 0ustar00Aquastaff000000 000000 /* * 31mer/output_scaffold.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" void output_contig_graph ( char * outfile ) { char name[256]; FILE * fp; unsigned int i; sprintf ( name, "%s.contig.gvz", outfile ); fp = ckopen ( name, "w" ); fprintf ( fp, "digraph G{\n" ); fprintf ( fp, "\tsize=\"512,512\";\n" ); for ( i = num_ctg; i > 0; i-- ) { fprintf ( fp, "\tV%d -> V%d[label =\"%d(%d)\"];\n", contig_array[i].from_vt, contig_array[i].to_vt, i, contig_array[i].length ); } fprintf ( fp, "}\n" ); fclose ( fp ); } void output_scaf ( char * outfile ) { char name[256]; FILE * fp; unsigned int i; CONNECT * connect; boolean flag; sprintf ( name, "%s.scaffold.gvz", outfile ); fp = ckopen ( name, "w" ); fprintf ( fp, "digraph G{\n" ); fprintf ( fp, "\tsize=\"512,512\";\n" ); for ( i = num_ctg; i > 0; i-- ) { //if(contig_array[i].mask||!contig_array[i].downwardConnect) if ( !contig_array[i].downwardConnect ) { continue; } connect = contig_array[i].downwardConnect; while ( connect ) { //if(connect->mask||connect->deleted){ if ( connect->deleted ) { connect = connect->next; continue; } if ( connect->prevInScaf || connect->nextInScaf ) { flag = 1; } else { flag = 0; } if ( !connect->mask ) fprintf ( fp, "\tC%d_%d -> C%d_%d [label = \"%d(%d_%d)\"];\n" , i, contig_array[i].length, connect->contigID, contig_array[connect->contigID].length, connect->gapLen, flag, connect->weight ); else fprintf ( fp, "\tC%d_%d -> C%d_%d [label = \"%d(%d_%d)\", color = red];\n" , i, contig_array[i].length, connect->contigID, contig_array[connect->contigID].length, connect->gapLen, flag, connect->weight ); connect = connect->next; } } fprintf ( fp, "}\n" ); fclose ( fp ); } SOAPdenovo-V1.05/src/31mer/pregraph.c000644 000765 000024 00000011537 11530651532 017276 0ustar00Aquastaff000000 000000 /* * 31mer/pregraph.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" static void initenv ( int argc, char ** argv ); static char shortrdsfile[256]; static char graphfile[256]; static int cutTips = 1; static void display_pregraph_usage(); int call_pregraph ( int argc, char ** argv ) { time_t start_t, stop_t, time_bef, time_aft; time ( &start_t ); initenv ( argc, argv ); if ( overlaplen % 2 == 0 ) { overlaplen++; printf ( "K should be an odd number\n" ); } if ( overlaplen < 13 ) { overlaplen = 13; printf ( "K should not be less than 13\n" ); } else if ( overlaplen > 31 ) { overlaplen = 31; printf ( "K should not be greater than 31\n" ); } time ( &time_bef ); prlRead2HashTable ( shortrdsfile, graphfile ); time ( &time_aft ); printf ( "time spent on pre-graph construction: %ds\n\n", ( int ) ( time_aft - time_bef ) ); printf ( "deLowKmer %d, deLowEdge %d\n", deLowKmer, deLowEdge ); //analyzeTips(hash_table, graphfile); if ( !deLowKmer && cutTips ) { time ( &time_bef ); removeSingleTips(); removeMinorTips(); time ( &time_aft ); printf ( "time spent on cutTipe: %ds\n\n", ( int ) ( time_aft - time_bef ) ); } else { time ( &time_bef ); removeMinorTips(); time ( &time_aft ); printf ( "time spent on cutTipe: %ds\n\n", ( int ) ( time_aft - time_bef ) ); } initKmerSetSize = 0; //combine each linear part to an edge time ( &time_bef ); kmer2edges ( graphfile ); time ( &time_aft ); printf ( "time spent on making edges: %ds\n\n", ( int ) ( time_aft - time_bef ) ); //map read to edge one by one time ( &time_bef ); prlRead2edge ( shortrdsfile, graphfile ); time ( &time_aft ); printf ( "time spent on mapping reads: %ds\n\n", ( int ) ( time_aft - time_bef ) ); output_vertex ( graphfile ); free_Sets ( KmerSets, thrd_num ); free_Sets ( KmerSetsPatch, thrd_num ); time ( &stop_t ); printf ( "overall time for lightgraph: %dm\n\n", ( int ) ( stop_t - start_t ) / 60 ); return 0; } /***************************************************************************** * Parse command line switches *****************************************************************************/ void initenv ( int argc, char ** argv ) { int copt; int inpseq, outseq; extern char * optarg; char temp[100]; optind = 1; inpseq = outseq = 0; while ( ( copt = getopt ( argc, argv, "a:s:o:K:p:d:DR" ) ) != EOF ) { //printf("get option\n"); switch ( copt ) { case 's': inpseq = 1; sscanf ( optarg, "%s", shortrdsfile ); continue; case 'o': outseq = 1; sscanf ( optarg, "%s", graphfile ); // continue; case 'K': sscanf ( optarg, "%s", temp ); // overlaplen = atoi ( temp ); continue; case 'p': sscanf ( optarg, "%s", temp ); // thrd_num = atoi ( temp ); continue; case 'R': repsTie = 1; continue; case 'd': sscanf ( optarg, "%s", temp ); // deLowKmer = atoi ( temp ) >= 0 ? atoi ( temp ) : 0; continue; case 'D': deLowEdge = 1; continue; case 'a': initKmerSetSize = atoi ( optarg ); break; default: if ( inpseq == 0 || outseq == 0 ) // { display_pregraph_usage(); exit ( -1 ); } } } if ( inpseq == 0 || outseq == 0 ) // { //printf("need more\n"); display_pregraph_usage(); exit ( -1 ); } } static void display_pregraph_usage() { printf ( "\npregraph -s readsInfoFile [-d KmerFreqCutoff -R -K kmer -p n_cpu] -o OutputFile\n" ); printf ( " -s readsInfoFile: The file contains information of solexa reads\n" ); printf ( " -a initMemoryAssumption: Initiate the memory assumption to avoid further reallocation\n" ); printf ( " -p n_cpu(default 8): number of cpu for use\n" ); printf ( " -K kmer(default 21): k value in kmer\n" ); printf ( " -d KmerFreqCutoff(optional): delete kmers with frequency no larger than (default 0)\n" ); printf ( " -R (optional): unsolve repeats by reads (default no)\n" ); printf ( " -o OutputFile: prefix of output file name\n" ); } SOAPdenovo-V1.05/src/31mer/prlHashCtg.c000644 000765 000024 00000024463 11530651532 017527 0ustar00Aquastaff000000 000000 /* * 31mer/prlHashCtg.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" //debugging variables static long long * kmerCounter; //buffer related varibles for chop kmer static unsigned int read_c; static char ** rcSeq; static char * seqBuffer; static int * lenBuffer; static unsigned int * indexArray; static unsigned int * seqBreakers; static int * ctgIdArray; static Kmer * firstKmers; //buffer related varibles for splay tree static unsigned int buffer_size = 10000000; static unsigned int seq_buffer_size; static unsigned int max_read_c; static volatile unsigned int kmer_c; static Kmer * kmerBuffer, *hashBanBuffer; static boolean * smallerBuffer; static void singleKmer ( int t, KmerSet * kset, unsigned int seq_index, unsigned int pos ); static void chopKmer4read ( int t, int threadID ); static void threadRoutine ( void * para ) { PARAMETER * prm; unsigned int i; unsigned char id; prm = ( PARAMETER * ) para; id = prm->threadID; //printf("%dth thread with threadID %d, hash_table %p\n",id,prm.threadID,prm.hash_table); while ( 1 ) { if ( * ( prm->selfSignal ) == 1 ) { unsigned int seq_index = 0; unsigned int pos = 0; for ( i = 0; i < kmer_c; i++ ) { if ( seq_index < read_c && indexArray[seq_index + 1] == i ) { seq_index++; // which sequence this kmer belongs to pos = 0; } //if((unsigned char)(hashBanBuffer[i]&taskMask)!=id){ if ( ( unsigned char ) ( hashBanBuffer[i] % thrd_num ) != id ) { pos++; continue; } kmerCounter[id + 1]++; singleKmer ( i, KmerSets[id], seq_index, pos++ ); } * ( prm->selfSignal ) = 0; } else if ( * ( prm->selfSignal ) == 2 ) { for ( i = 0; i < read_c; i++ ) { if ( i % thrd_num != id ) { continue; } chopKmer4read ( i, id + 1 ); } * ( prm->selfSignal ) = 0; } else if ( * ( prm->selfSignal ) == 3 ) { * ( prm->selfSignal ) = 0; break; } usleep ( 1 ); } } static void singleKmer ( int t, KmerSet * kset, unsigned int seq_index, unsigned int pos ) { boolean flag; kmer_t * node; flag = put_kmerset ( kset, kmerBuffer[t], 4, 4, &node ); //printf("singleKmer: kmer %llx\n",kmerBuffer[t]); if ( !flag ) { if ( smallerBuffer[t] ) { node->twin = 0; } else { node->twin = 1; }; node->l_links = ctgIdArray[seq_index]; node->r_links = pos; } else { node->deleted = 1; } } static void creatThrds ( pthread_t * threads, PARAMETER * paras ) { unsigned char i; int temp; for ( i = 0; i < thrd_num; i++ ) { //printf("to create %dth thread\n",(*(char *)&(threadID[i]))); if ( ( temp = pthread_create ( &threads[i], NULL, ( void * ) threadRoutine, & ( paras[i] ) ) ) != 0 ) { printf ( "create threads failed\n" ); exit ( 1 ); } } printf ( "%d thread created\n", thrd_num ); } static void thread_wait ( pthread_t * threads ) { int i; for ( i = 0; i < thrd_num; i++ ) if ( threads[i] != 0 ) { pthread_join ( threads[i], NULL ); } } static void chopKmer4read ( int t, int threadID ) { char * src_seq = seqBuffer + seqBreakers[t]; char * bal_seq = rcSeq[threadID]; int len_seq = lenBuffer[t]; int j, bal_j; Kmer hash_ban, bal_hash_ban; Kmer word, bal_word; int index; word = 0; for ( index = 0; index < overlaplen; index++ ) { word <<= 2; word += src_seq[index]; } reverseComplementSeq ( src_seq, len_seq, bal_seq ); // complementary node bal_word = reverseComplement ( word, overlaplen ); bal_j = len_seq - 0 - overlaplen; // 0; index = indexArray[t]; if ( word < bal_word ) { hash_ban = hash_kmer ( word ); kmerBuffer[index] = word; hashBanBuffer[index] = hash_ban; smallerBuffer[index++] = 1; } else { bal_hash_ban = hash_kmer ( bal_word ); kmerBuffer[index] = bal_word; hashBanBuffer[index] = bal_hash_ban; smallerBuffer[index++] = 0; } //printf("%dth: %p with %p\n",kmer_c-1,bal_word,bal_hash_ban); for ( j = 1; j <= len_seq - overlaplen; j ++ ) { word = nextKmer ( word, src_seq[j - 1 + overlaplen] ); bal_j = len_seq - j - overlaplen; // j; bal_word = prevKmer ( bal_word, bal_seq[bal_j] ); if ( word < bal_word ) { hash_ban = hash_kmer ( word ); kmerBuffer[index] = word; hashBanBuffer[index] = hash_ban; smallerBuffer[index++] = 1; //printf("%dth: %p with %p\n",kmer_c-1,word,hashBanBuffer[kmer_c-1]); } else { // complementary node bal_hash_ban = hash_kmer ( bal_word ); kmerBuffer[index] = bal_word; hashBanBuffer[index] = bal_hash_ban; smallerBuffer[index++] = 0; //printf("%dth: %p with %p\n",kmer_c-1,bal_word,hashBanBuffer[kmer_c-1]); } } } static void sendWorkSignal ( unsigned char SIG, unsigned char * thrdSignals ) { int t; for ( t = 0; t < thrd_num; t++ ) { thrdSignals[t + 1] = SIG; } while ( 1 ) { usleep ( 10 ); for ( t = 0; t < thrd_num; t++ ) if ( thrdSignals[t + 1] ) { break; } if ( t == thrd_num ) { break; } } } static int getID ( char * name ) { if ( name[0] >= '0' && name[0] <= '9' ) { return atoi ( & ( name[0] ) ); } else { return 0; } } boolean prlContig2nodes ( char * grapfile, int len_cut ) { long long i, num_seq; char name[256], *next_name; FILE * fp; pthread_t threads[thrd_num]; time_t start_t, stop_t; unsigned char thrdSignal[thrd_num + 1]; PARAMETER paras[thrd_num]; int maxCtgLen, minCtgLen, nameLen; unsigned int lenSum, contigId; WORDFILTER = ( ( ( Kmer ) 1 ) << ( 2 * overlaplen ) ) - 1; time ( &start_t ); sprintf ( name, "%s.contig", grapfile ); fp = ckopen ( name, "r" ); maxCtgLen = nameLen = 10; minCtgLen = 1000; num_seq = readseqpar ( &maxCtgLen, &minCtgLen, &nameLen, fp ); printf ( "\nthere're %lld contigs in file: %s, max seq len %d, min seq len %d, max name len %d\n", num_seq, grapfile, maxCtgLen, minCtgLen, nameLen ); maxReadLen = maxCtgLen; fclose ( fp ); time ( &stop_t ); printf ( "time spent on parse contigs file %ds\n", ( int ) ( stop_t - start_t ) ); next_name = ( char * ) ckalloc ( ( maxNameLen + 1 ) * sizeof ( char ) ); // extract all the EDONs seq_buffer_size = buffer_size * 2; max_read_c = seq_buffer_size / 20; kmerBuffer = ( Kmer * ) ckalloc ( buffer_size * sizeof ( Kmer ) ); hashBanBuffer = ( Kmer * ) ckalloc ( buffer_size * sizeof ( Kmer ) ); smallerBuffer = ( boolean * ) ckalloc ( buffer_size * sizeof ( boolean ) ); seqBuffer = ( char * ) ckalloc ( seq_buffer_size * sizeof ( char ) ); lenBuffer = ( int * ) ckalloc ( max_read_c * sizeof ( int ) ); indexArray = ( unsigned int * ) ckalloc ( ( max_read_c + 1 ) * sizeof ( unsigned int ) ); seqBreakers = ( unsigned int * ) ckalloc ( ( max_read_c + 1 ) * sizeof ( unsigned int ) ); ctgIdArray = ( int * ) ckalloc ( max_read_c * sizeof ( int ) ); fp = ckopen ( name, "r" ); //node_mem_manager = createMem_manager(EDONBLOCKSIZE,sizeof(EDON)); rcSeq = ( char ** ) ckalloc ( ( thrd_num + 1 ) * sizeof ( char * ) ); if ( 1 ) { kmerCounter = ( long long * ) ckalloc ( ( thrd_num + 1 ) * sizeof ( long long ) ); KmerSets = ( KmerSet ** ) ckalloc ( thrd_num * sizeof ( KmerSet * ) ); for ( i = 0; i < thrd_num; i++ ) { KmerSets[i] = init_kmerset ( 1024, 0.77f ); thrdSignal[i + 1] = 0; paras[i].threadID = i; paras[i].mainSignal = &thrdSignal[0]; paras[i].selfSignal = &thrdSignal[i + 1]; kmerCounter[i + 1] = 0; rcSeq[i + 1] = ( char * ) ckalloc ( maxCtgLen * sizeof ( char ) ); } creatThrds ( threads, paras ); } kmer_c = thrdSignal[0] = kmerCounter[0] = 0; time ( &start_t ); read_c = lenSum = i = seqBreakers[0] = indexArray[0] = 0; readseq1by1 ( seqBuffer + seqBreakers[read_c], next_name, & ( lenBuffer[read_c] ), fp, -1 ); while ( !feof ( fp ) ) { contigId = getID ( next_name ); readseq1by1 ( seqBuffer + seqBreakers[read_c], next_name, & ( lenBuffer[read_c] ), fp, 1 ); if ( ( ++i ) % 10000000 == 0 ) { printf ( "--- %lldth contigs\n", i ); } if ( lenBuffer[read_c] < overlaplen + 1 || lenBuffer[read_c] < len_cut ) { contigId = getID ( next_name ); continue; } //printf("len of seq %d is %d, ID %d\n",read_c,lenBuffer[read_c],contigId); ctgIdArray[read_c] = contigId > 0 ? contigId : i; lenSum += lenBuffer[read_c]; kmer_c += lenBuffer[read_c] - overlaplen + 1; read_c++; seqBreakers[read_c] = lenSum; indexArray[read_c] = kmer_c; //printf("seq %d start at %d\n",read_c,seqBreakers[read_c]); if ( read_c == max_read_c || ( lenSum + maxCtgLen ) > seq_buffer_size || ( kmer_c + maxCtgLen - overlaplen + 1 ) > buffer_size ) { kmerCounter[0] += kmer_c; sendWorkSignal ( 2, thrdSignal ); sendWorkSignal ( 1, thrdSignal ); kmer_c = read_c = lenSum = 0; } } if ( read_c ) { kmerCounter[0] += kmer_c; sendWorkSignal ( 2, thrdSignal ); sendWorkSignal ( 1, thrdSignal ); } sendWorkSignal ( 3, thrdSignal ); thread_wait ( threads ); time ( &stop_t ); printf ( "time spent on hash reads: %ds\n", ( int ) ( stop_t - start_t ) ); if ( 1 ) { unsigned long long alloCounter = 0; unsigned long long allKmerCounter = 0; for ( i = 0; i < thrd_num; i++ ) { alloCounter += count_kmerset ( ( KmerSets[i] ) ); allKmerCounter += kmerCounter[i + 1]; free ( ( void * ) rcSeq[i + 1] ); } printf ( "%lli nodes allocated, %lli kmer in reads, %lli kmer processed\n" , alloCounter, kmerCounter[0], allKmerCounter ); } free ( ( void * ) rcSeq ); free ( ( void * ) kmerCounter ); free ( ( void * ) seqBuffer ); free ( ( void * ) lenBuffer ); free ( ( void * ) indexArray ); free ( ( void * ) seqBreakers ); free ( ( void * ) ctgIdArray ); free ( ( void * ) kmerBuffer ); free ( ( void * ) hashBanBuffer ); free ( ( void * ) smallerBuffer ); free ( ( void * ) next_name ); fclose ( fp ); return 1; } SOAPdenovo-V1.05/src/31mer/prlHashReads.c000644 000765 000024 00000036331 11530651532 020045 0ustar00Aquastaff000000 000000 /* * 31mer/prlHashReads.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" //debugging variables static long long * tips; static long long * kmerCounter; static long long ** kmerFreq; //buffer related varibles for chop kmer static int read_c; static char ** rcSeq; static char ** seqBuffer; static int * lenBuffer; static int * indexArray; //buffer related varibles for splay tree static int buffer_size = 10000000; static volatile int kmer_c; static Kmer * kmerBuffer, *hashBanBuffer; static char * nextcBuffer, *prevcBuffer; static void thread_mark ( KmerSet * set, unsigned char thrdID ); static void Mark1in1outNode ( unsigned char * thrdSignal ); static void thread_delow ( KmerSet * set, unsigned char thrdID ); static void deLowCov ( unsigned char * thrdSignal ); static void singleKmer ( int t, KmerSet * kset ); static void chopKmer4read ( int t, int threadID ); static void freqStat ( char * outfile ); static void threadRoutine ( void * para ) { PARAMETER * prm; int i; unsigned char id; prm = ( PARAMETER * ) para; id = prm->threadID; //printf("%dth thread with threadID %d, hash_table %p\n",id,prm.threadID,prm.hash_table); while ( 1 ) { if ( * ( prm->selfSignal ) == 1 ) { for ( i = 0; i < kmer_c; i++ ) { //if((unsigned char)(magic_seq(hashBanBuffer[i])%thrd_num)!=id) //if((kmerBuffer[i]%thrd_num)!=id) if ( ( hashBanBuffer[i] % thrd_num ) != id ) { continue; } kmerCounter[id + 1]++; singleKmer ( i, KmerSets[id] ); } * ( prm->selfSignal ) = 0; } else if ( * ( prm->selfSignal ) == 2 ) { for ( i = 0; i < read_c; i++ ) { if ( i % thrd_num != id ) { continue; } chopKmer4read ( i, id + 1 ); } * ( prm->selfSignal ) = 0; } else if ( * ( prm->selfSignal ) == 3 ) { * ( prm->selfSignal ) = 0; break; } else if ( * ( prm->selfSignal ) == 4 ) { thread_mark ( KmerSets[id], id ); * ( prm->selfSignal ) = 0; } else if ( * ( prm->selfSignal ) == 5 ) { thread_delow ( KmerSets[id], id ); * ( prm->selfSignal ) = 0; } usleep ( 1 ); } } static void singleKmer ( int t, KmerSet * kset ) { kmer_t * pos; put_kmerset ( kset, kmerBuffer[t], prevcBuffer[t], nextcBuffer[t], &pos ); } static void creatThrds ( pthread_t * threads, PARAMETER * paras ) { unsigned char i; int temp; for ( i = 0; i < thrd_num; i++ ) { //printf("to create %dth thread\n",(*(char *)&(threadID[i]))); if ( ( temp = pthread_create ( &threads[i], NULL, ( void * ) threadRoutine, & ( paras[i] ) ) ) != 0 ) { printf ( "create threads failed\n" ); exit ( 1 ); } } printf ( "%d thread created\n", thrd_num ); } static void thread_wait ( pthread_t * threads ) { int i; for ( i = 0; i < thrd_num; i++ ) if ( threads[i] != 0 ) { pthread_join ( threads[i], NULL ); } } static void chopKmer4read ( int t, int threadID ) { char * src_seq = seqBuffer[t]; char * bal_seq = rcSeq[threadID]; int len_seq = lenBuffer[t]; int j, bal_j; Kmer hash_ban, bal_hash_ban; Kmer word, bal_word; int index; char InvalidCh = 4; word = 0; for ( index = 0; index < overlaplen; index++ ) { word <<= 2; word += src_seq[index]; } reverseComplementSeq ( src_seq, len_seq, bal_seq ); // complementary node bal_word = reverseComplement ( word, overlaplen ); bal_j = len_seq - 0 - overlaplen; // 0; index = indexArray[t]; if ( word < bal_word ) { hash_ban = hash_kmer ( word ); hashBanBuffer[index] = hash_ban; kmerBuffer[index] = word; prevcBuffer[index] = InvalidCh; nextcBuffer[index++] = src_seq[0 + overlaplen]; } else { bal_hash_ban = hash_kmer ( bal_word ); hashBanBuffer[index] = bal_hash_ban; kmerBuffer[index] = bal_word; prevcBuffer[index] = bal_seq[bal_j - 1]; nextcBuffer[index++] = InvalidCh; } for ( j = 1; j <= len_seq - overlaplen; j ++ ) { word = nextKmer ( word, src_seq[j - 1 + overlaplen] ); bal_j = len_seq - j - overlaplen; // j; bal_word = prevKmer ( bal_word, bal_seq[bal_j] ); if ( word < bal_word ) { hash_ban = hash_kmer ( word ); hashBanBuffer[index] = hash_ban; kmerBuffer[index] = word; prevcBuffer[index] = src_seq[j - 1]; if ( j < len_seq - overlaplen ) { nextcBuffer[index++] = src_seq[j + overlaplen]; } else { nextcBuffer[index++] = InvalidCh; } //printf("%dth: %p with %p\n",kmer_c-1,word,hashBanBuffer[kmer_c-1]); } else { // complementary node bal_hash_ban = hash_kmer ( bal_word ); hashBanBuffer[index] = bal_hash_ban; kmerBuffer[index] = bal_word; if ( bal_j > 0 ) { prevcBuffer[index] = bal_seq[bal_j - 1]; } else { prevcBuffer[index] = InvalidCh; } nextcBuffer[index++] = bal_seq[bal_j + overlaplen]; //printf("%dth: %p with %p\n",kmer_c-1,bal_word,hashBanBuffer[kmer_c-1]); } } } static void sendWorkSignal ( unsigned char SIG, unsigned char * thrdSignals ) { int t; for ( t = 0; t < thrd_num; t++ ) { thrdSignals[t + 1] = SIG; } while ( 1 ) { usleep ( 10 ); for ( t = 0; t < thrd_num; t++ ) if ( thrdSignals[t + 1] ) { break; } if ( t == thrd_num ) { break; } } } boolean prlRead2HashTable ( char * libfile, char * outfile ) { long long i; char * next_name, name[256]; FILE * fo; time_t start_t, stop_t; int maxReadNum; int libNo; pthread_t threads[thrd_num]; unsigned char thrdSignal[thrd_num + 1]; PARAMETER paras[thrd_num]; boolean flag, pairs = 0; WORDFILTER = ( ( ( Kmer ) 1 ) << ( 2 * overlaplen ) ) - 1; maxReadLen = 0; maxNameLen = 256; scan_libInfo ( libfile ); alloc_pe_mem ( num_libs ); if ( !maxReadLen ) { maxReadLen = 100; } maxReadLen4all = maxReadLen; printf ( "In %s, %d libs, max seq len %d, max name len %d\n\n", libfile, num_libs, maxReadLen, maxNameLen ); next_name = ( char * ) ckalloc ( ( maxNameLen + 1 ) * sizeof ( char ) ); kmerBuffer = ( Kmer * ) ckalloc ( buffer_size * sizeof ( Kmer ) ); hashBanBuffer = ( Kmer * ) ckalloc ( buffer_size * sizeof ( Kmer ) ); prevcBuffer = ( char * ) ckalloc ( buffer_size * sizeof ( char ) ); nextcBuffer = ( char * ) ckalloc ( buffer_size * sizeof ( char ) ); maxReadNum = buffer_size / ( maxReadLen - overlaplen + 1 ); //printf("buffer size %d, max read len %d, max read num %d\n",buffer_size,maxReadLen,maxReadNum); seqBuffer = ( char ** ) ckalloc ( maxReadNum * sizeof ( char * ) ); lenBuffer = ( int * ) ckalloc ( maxReadNum * sizeof ( int ) ); indexArray = ( int * ) ckalloc ( maxReadNum * sizeof ( int ) ); for ( i = 0; i < maxReadNum; i++ ) { seqBuffer[i] = ( char * ) ckalloc ( maxReadLen * sizeof ( char ) ); } rcSeq = ( char ** ) ckalloc ( ( thrd_num + 1 ) * sizeof ( char * ) ); if ( 1 ) { kmerCounter = ( long long * ) ckalloc ( ( thrd_num + 1 ) * sizeof ( long long ) ); KmerSets = ( KmerSet ** ) ckalloc ( thrd_num * sizeof ( KmerSet * ) ); ubyte8 init_size = 1024; ubyte8 k = 0; if ( initKmerSetSize ) { init_size = ( ubyte8 ) ( ( double ) initKmerSetSize * 1024.0f * 1024.0f * 1024.0f / ( double ) thrd_num / 16 ); do { ++k; } while ( k * 0xFFFFFFLLU < init_size ); } for ( i = 0; i < thrd_num; i++ ) { KmerSets[i] = init_kmerset ( ( ( initKmerSetSize ) ? ( k * 0xFFFFFFLLU ) : ( init_size ) ), 0.77f ); ubyte8 tmp = ( initKmerSetSize ) ? ( k * 0xFFFFFFLLU ) : ( init_size ); thrdSignal[i + 1] = 0; paras[i].threadID = i; paras[i].mainSignal = &thrdSignal[0]; paras[i].selfSignal = &thrdSignal[i + 1]; kmerCounter[i + 1] = 0; rcSeq[i + 1] = ( char * ) ckalloc ( maxReadLen * sizeof ( char ) ); } creatThrds ( threads, paras ); } thrdSignal[0] = kmerCounter[0] = 0; time ( &start_t ); kmer_c = n_solexa = read_c = i = libNo = readNumBack = gradsCounter = 0; while ( ( flag = read1seqInLib ( seqBuffer[read_c], next_name, & ( lenBuffer[read_c] ), &libNo, pairs, 1 ) ) != 0 ) { if ( ( ++i ) % 100000000 == 0 ) { printf ( "--- %lldth reads\n", i ); } if ( lenBuffer[read_c] < 0 ) { printf ( "read len %d\n", lenBuffer[read_c] ); } if ( lenBuffer[read_c] < overlaplen + 1 ) { continue; } /* if(lenBuffer[read_c]>70) lenBuffer[read_c] = 50; else if(lenBuffer[read_c]>40) lenBuffer[read_c] = 40; */ indexArray[read_c] = kmer_c; kmer_c += lenBuffer[read_c] - overlaplen + 1; read_c++; if ( read_c == maxReadNum ) { kmerCounter[0] += kmer_c; sendWorkSignal ( 2, thrdSignal ); sendWorkSignal ( 1, thrdSignal ); kmer_c = read_c = 0; } } if ( read_c ) { kmerCounter[0] += kmer_c; sendWorkSignal ( 2, thrdSignal ); sendWorkSignal ( 1, thrdSignal ); } time ( &stop_t ); printf ( "time spent on hash reads: %ds, %lld reads processed\n", ( int ) ( stop_t - start_t ), i ); //record insert size info if ( pairs ) { if ( gradsCounter ) printf ( "%d pe insert size, the largest boundary is %lld\n\n", gradsCounter, pes[gradsCounter - 1].PE_bound ); else { printf ( "no paired reads found\n" ); } sprintf ( name, "%s.peGrads", outfile ); fo = ckopen ( name, "w" ); fprintf ( fo, "grads&num: %d\t%lld\n", gradsCounter, n_solexa ); for ( i = 0; i < gradsCounter; i++ ) { fprintf ( fo, "%d\t%lld\t%d\n", pes[i].insertS, pes[i].PE_bound, pes[i].rank ); } fclose ( fo ); } free_pe_mem(); free_libs(); if ( 1 ) { unsigned long long alloCounter = 0; unsigned long long allKmerCounter = 0; for ( i = 0; i < thrd_num; i++ ) { alloCounter += count_kmerset ( ( KmerSets[i] ) ); allKmerCounter += kmerCounter[i + 1]; free ( ( void * ) rcSeq[i + 1] ); } printf ( "%lli nodes allocated, %lli kmer in reads, %lli kmer processed\n" , alloCounter, kmerCounter[0], allKmerCounter ); fflush ( stdout ); } free ( ( void * ) rcSeq ); free ( ( void * ) kmerCounter ); for ( i = 0; i < maxReadNum; i++ ) { free ( ( void * ) seqBuffer[i] ); } free ( ( void * ) seqBuffer ); free ( ( void * ) lenBuffer ); free ( ( void * ) indexArray ); free ( ( void * ) kmerBuffer ); free ( ( void * ) hashBanBuffer ); free ( ( void * ) nextcBuffer ); free ( ( void * ) prevcBuffer ); free ( ( void * ) next_name ); //printf("done hashing nodes\n"); if ( deLowKmer ) { time ( &start_t ); deLowCov ( thrdSignal ); time ( &stop_t ); printf ( "time spent on delowcvgNode %ds\n", ( int ) ( stop_t - start_t ) ); } time ( &start_t ); Mark1in1outNode ( thrdSignal ); freqStat ( outfile ); time ( &stop_t ); printf ( "time spent on marking linear nodes %ds\n", ( int ) ( stop_t - start_t ) ); fflush ( stdout ); sendWorkSignal ( 3, thrdSignal ); thread_wait ( threads ); /* Kmer word = 0x21c3ca82c734c8d0; Kmer hash_ban = hash_kmer(word); int setPicker = hash_ban%thrd_num; kmer_t *node; boolean found = search_kmerset(KmerSets[setPicker], word, &node); if(!found) printf("kmer %llx not found,\n",word); else{ printf("kmer %llx, linear %d\n",word,node->linear); for(i=0;i<4;i++){ if(get_kmer_right_cov(*node,i)>0) printf("right %d, kmer %llx\n",i,nextKmer(node->seq,i)); if(get_kmer_left_cov(*node,i)>0) printf("left %d, kmer %llx\n",i,prevKmer(node->seq,i)); } } */ return 1; } static void thread_delow ( KmerSet * set, unsigned char thrdID ) { int i, in_num, out_num, cvgSingle; int l_cvg, r_cvg; kmer_t * rs; set->iter_ptr = 0; while ( set->iter_ptr < set->size ) { if ( !is_kmer_entity_null ( set->flags, set->iter_ptr ) ) { in_num = out_num = l_cvg = r_cvg = 0; rs = set->array + set->iter_ptr; for ( i = 0; i < 4; i++ ) { cvgSingle = get_kmer_left_cov ( *rs, i ); if ( cvgSingle > 0 && cvgSingle <= deLowKmer ) { set_kmer_left_cov ( *rs, i, 0 ); } cvgSingle = get_kmer_right_cov ( *rs, i ); if ( cvgSingle > 0 && cvgSingle <= deLowKmer ) { set_kmer_right_cov ( *rs, i, 0 ); } } if ( rs->l_links == 0 && rs->r_links == 0 ) { rs->deleted = 1; tips[thrdID]++; } } set->iter_ptr ++; } //printf("%lld single nodes, %lld linear\n",counter,tips[thrdID]); } static void deLowCov ( unsigned char * thrdSignal ) { int i; long long counter = 0; tips = ( long long * ) ckalloc ( thrd_num * sizeof ( long long ) ); for ( i = 0; i < thrd_num; i++ ) { tips[i] = 0; } sendWorkSignal ( 5, thrdSignal ); //mark linear nodes for ( i = 0; i < thrd_num; i++ ) { counter += tips[i]; } free ( ( void * ) tips ); printf ( "%lld kmer removed\n", counter ); } static void printKmer ( Kmer kmer ) { int i; char kmerSeq[32], ch; for ( i = overlaplen - 1; i >= 0; i-- ) { ch = kmer & 3; kmer >>= 2; kmerSeq[i] = ch; } for ( i = 0; i < overlaplen; i++ ) { printf ( "%c", int2base ( ( int ) kmerSeq[i] ) ); } printf ( "\n" ); } static void thread_mark ( KmerSet * set, unsigned char thrdID ) { int i, in_num, out_num, cvgSingle; int l_cvg, r_cvg; kmer_t * rs; long long counter = 0; set->iter_ptr = 0; while ( set->iter_ptr < set->size ) { if ( !is_kmer_entity_null ( set->flags, set->iter_ptr ) ) { in_num = out_num = l_cvg = r_cvg = 0; rs = set->array + set->iter_ptr; for ( i = 0; i < 4; i++ ) { cvgSingle = get_kmer_left_cov ( *rs, i ); if ( cvgSingle > 0 ) { in_num++; l_cvg += cvgSingle; } cvgSingle = get_kmer_right_cov ( *rs, i ); if ( cvgSingle > 0 ) { out_num++; r_cvg += cvgSingle; } } if ( rs->single ) { kmerFreq[thrdID][1]++; counter++; } else { kmerFreq[thrdID][ ( l_cvg > r_cvg ? l_cvg : r_cvg )]++; } if ( in_num == 1 && out_num == 1 ) { rs->linear = 1; tips[thrdID]++; } } set->iter_ptr ++; } //printf("%lld single nodes, %lld linear\n",counter,tips[thrdID]); } static void Mark1in1outNode ( unsigned char * thrdSignal ) { int i; long long counter = 0; kmerFreq = ( long long ** ) ckalloc ( thrd_num * sizeof ( long long * ) ); tips = ( long long * ) ckalloc ( thrd_num * sizeof ( long long ) ); for ( i = 0; i < thrd_num; i++ ) { kmerFreq[i] = ( long long * ) ckalloc ( 257 * sizeof ( long long ) ); memset ( kmerFreq[i], 0, 257 * sizeof ( long long ) ); tips[i] = 0; } sendWorkSignal ( 4, thrdSignal ); //mark linear nodes for ( i = 0; i < thrd_num; i++ ) { counter += tips[i]; } free ( ( void * ) tips ); printf ( "%lld linear nodes\n", counter ); } static void freqStat ( char * outfile ) { FILE * fo; char name[256]; int i, j; long long sum; sprintf ( name, "%s.kmerFreq", outfile ); fo = ckopen ( name, "w" ); for ( i = 1; i < 256; i++ ) { sum = 0; for ( j = 0; j < thrd_num; j++ ) { sum += kmerFreq[j][i]; } fprintf ( fo, "%lld\n", sum ); } for ( i = 0; i < thrd_num; i++ ) { free ( ( void * ) kmerFreq[i] ); } free ( ( void * ) kmerFreq ); fclose ( fo ); } SOAPdenovo-V1.05/src/31mer/prlRead2Ctg.c000644 000765 000024 00000053431 11530651532 017576 0ustar00Aquastaff000000 000000 /* * 31mer/prlRead2Ctg.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" static boolean staticFlag = 1; static long long readsInGap = 0; static int buffer_size = 10000000; static long long readCounter; static long long mapCounter; static int ALIGNLEN = 0; //buffer related varibles for chop kmer static int read_c; static char ** rcSeq; static char ** seqBuffer; static int * lenBuffer; static unsigned int * ctgIdArray; static int * posArray; static char * orienArray; static char * footprint; // flag indicates whether the read shoulld leave markers on contigs // kmer related variables static int kmer_c; static Kmer * kmerBuffer, *hashBanBuffer; static kmer_t ** nodeBuffer; static boolean * smallerBuffer; static unsigned int * indexArray; static int * deletion; static void parse1read ( int t, int threadID ); static void threadRoutine ( void * thrdID ); static void searchKmer ( int t, KmerSet * kset ); static void chopKmer4read ( int t, int threadID ); static void thread_wait ( pthread_t * threads ); static void creatThrds ( pthread_t * threads, PARAMETER * paras ) { unsigned char i; int temp; for ( i = 0; i < thrd_num; i++ ) { //printf("to create %dth thread\n",(*(char *)&(threadID[i]))); if ( ( temp = pthread_create ( &threads[i], NULL, ( void * ) threadRoutine, & ( paras[i] ) ) ) != 0 ) { printf ( "create threads failed\n" ); exit ( 1 ); } } printf ( "%d thread created\n", thrd_num ); } static void threadRoutine ( void * para ) { PARAMETER * prm; int i, t; unsigned char id; prm = ( PARAMETER * ) para; id = prm->threadID; //printf("%dth thread with task %d, hash_table %p\n",id,prm.task,prm.hash_table); while ( 1 ) { if ( * ( prm->selfSignal ) == 1 ) { for ( i = 0; i < kmer_c; i++ ) { //if((hashBanBuffer[i]&taskMask)!=prm.threadID) if ( ( hashBanBuffer[i] % thrd_num ) != id ) { continue; } searchKmer ( i, KmerSets[id] ); } * ( prm->selfSignal ) = 0; } else if ( * ( prm->selfSignal ) == 2 ) { for ( i = 0; i < read_c; i++ ) { if ( i % thrd_num != id ) { continue; } chopKmer4read ( i, id + 1 ); } * ( prm->selfSignal ) = 0; } else if ( * ( prm->selfSignal ) == 3 ) { // parse reads for ( t = 0; t < read_c; t++ ) { if ( t % thrd_num != id ) { continue; } parse1read ( t, id + 1 ); } * ( prm->selfSignal ) = 0; } else if ( * ( prm->selfSignal ) == 5 ) { * ( prm->selfSignal ) = 0; break; } usleep ( 1 ); } } /* static void chopReads() { int i; for(i=0;ideleted ) { nodeBuffer[t] = node; } else { nodeBuffer[t] = NULL; } } static void parse1read ( int t, int threadID ) { unsigned int j, i, s; unsigned int contigID; int counter2 = 0, counter; unsigned int ctgLen, pos; kmer_t * node; boolean isSmaller; int flag, maxOcc = 0; kmer_t * maxNode = NULL; int alldgnLen = lenBuffer[t] > ALIGNLEN ? ALIGNLEN : lenBuffer[t]; int multi = alldgnLen - overlaplen + 1 < 5 ? 5 : alldgnLen - overlaplen + 1; unsigned int start, finish; footprint[t] = 0; start = indexArray[t]; finish = indexArray[t + 1]; if ( finish == start ) //too short { ctgIdArray[t] = 0; deletion[threadID]++; return; } for ( j = start; j < finish; j++ ) if ( nodeBuffer[j] ) { counter2++; } if ( counter2 < 2 ) { deletion[threadID]++; } counter = counter2 = 0; for ( j = start; j < finish; j++ ) { node = nodeBuffer[j]; if ( !node ) //same as previous { continue; } flag = 1; for ( s = j + 1; s < finish; s++ ) { if ( !nodeBuffer[s] ) { continue; } if ( nodeBuffer[s]->l_links == node->l_links ) { flag++; nodeBuffer[s] = NULL; } } if ( flag >= 2 ) { counter2++; } //a loose alignment if ( flag >= multi ) { counter++; } else { continue; } if ( flag > maxOcc ) { pos = j; maxOcc = flag; maxNode = node; } } if ( !counter ) //no match { ctgIdArray[t] = 0; return; } if ( counter2 > 1 ) { footprint[t] = 1; } //aligned to multi contigs j = pos; i = pos - start + 1; node = nodeBuffer[j]; isSmaller = smallerBuffer[j]; contigID = node->l_links; ctgLen = contig_array[contigID].length; pos = node->r_links; if ( node->twin == isSmaller ) { orienArray[t] = '-'; ctgIdArray[t] = getTwinCtg ( contigID ); posArray[t] = ctgLen - pos - overlaplen - i + 1; } else { orienArray[t] = '+'; ctgIdArray[t] = contigID; posArray[t] = pos - i + 1; } } static void sendWorkSignal ( unsigned char SIG, unsigned char * thrdSignals ) { int t; for ( t = 0; t < thrd_num; t++ ) { thrdSignals[t + 1] = SIG; } while ( 1 ) { usleep ( 10 ); for ( t = 0; t < thrd_num; t++ ) if ( thrdSignals[t + 1] ) { break; } if ( t == thrd_num ) { break; } } } static void locate1read ( int t ) { int i, j, start, finish; kmer_t * node; unsigned int contigID; int pos, ctgLen; boolean isSmaller; start = indexArray[t]; finish = indexArray[t + 1]; for ( j = start; j < finish; j++ ) { node = nodeBuffer[j]; if ( !node ) //same as previous { continue; } i = j - start + 1; isSmaller = smallerBuffer[j]; contigID = node->l_links; ctgLen = contig_array[contigID].length; pos = node->r_links; if ( node->twin == isSmaller ) { ctgIdArray[t] = getTwinCtg ( contigID ); posArray[t] = ctgLen - pos - overlaplen - i + 1; } else { ctgIdArray[t] = contigID; posArray[t] = pos - i + 1; } } } static void output1read ( int t, FILE * outfp ) { int len = lenBuffer[t]; int index; readsInGap++; /* if(ctgIdArray[t]==735||ctgIdArray[t]==getTwinCtg(735)){ printf("%d\t%d\t%d\t",t+1,ctgIdArray[t],posArray[t]); int j; for(j=0;j R2 <-- R1 output1read ( read1, outfp ); } else { read2 = t; read1 = t - 1; ctgIdArray[read2] = ctgIdArray[read1]; posArray[read2] = posArray[read1] + insSize - lenBuffer[read2]; // --> R1 <-- R2 output1read ( read2, outfp ); } } static void recordLongRead ( FILE * outfp ) { int t; for ( t = 0; t < read_c; t++ ) { readCounter++; if ( footprint[t] ) { output1read ( t, outfp ); } } } static void recordAlldgn ( FILE * outfp, int insSize, FILE * outfp2 ) { int t, ctgId; boolean rd1gap, rd2gap; for ( t = 0; t < read_c; t++ ) { readCounter++; rd1gap = rd2gap = 0; ctgId = ctgIdArray[t]; if ( outfp2 && t % 2 == 1 ) //make sure this is read2 in a pair { if ( ctgIdArray[t] < 1 && ctgIdArray[t - 1] > 0 ) { getReadIngap ( t, insSize, outfp2, 0 ); //read 2 in gap rd2gap = 1; } else if ( ctgIdArray[t] > 0 && ctgIdArray[t - 1] < 1 ) { getReadIngap ( t - 1, insSize, outfp2, 1 ); //read 1 in gap rd1gap = 1; } } if ( ctgId < 1 ) { continue; } mapCounter++; fprintf ( outfp, "%lld\t%u\t%d\t%c\n", readCounter, ctgIdArray[t], posArray[t], orienArray[t] ); if ( t % 2 == 0 ) { continue; } if ( outfp2 && footprint[t - 1] && !rd1gap ) { output1read ( t - 1, outfp2 ); } if ( outfp2 && footprint[t] && !rd2gap ) { output1read ( t, outfp2 ); } } } //load contig index and length void basicContigInfo ( char * infile ) { char name[256], lldne[1024]; FILE * fp; int length, bal_ed, num_all, num_long, index; sprintf ( name, "%s.ContigIndex", infile ); fp = ckopen ( name, "r" ); fgets ( lldne, sizeof ( lldne ), fp ); sscanf ( lldne + 8, "%d %d", &num_all, &num_long ); printf ( "%d edges in graph\n", num_all ); num_ctg = num_all; contig_array = ( CONTIG * ) ckalloc ( ( num_all + 1 ) * sizeof ( CONTIG ) ); fgets ( lldne, sizeof ( lldne ), fp ); num_long = 0; while ( fgets ( lldne, sizeof ( lldne ), fp ) != NULL ) { sscanf ( lldne, "%d %d %d", &index, &length, &bal_ed ); contig_array[++num_long].length = length; contig_array[num_long].bal_edge = bal_ed + 1; if ( index != num_long ) { printf ( "basicContigInfo: %d vs %d\n", index, num_long ); } if ( bal_ed == 0 ) { continue; } contig_array[++num_long].length = length; contig_array[num_long].bal_edge = -bal_ed + 1; } fclose ( fp ); } void prlRead2Ctg ( char * libfile, char * outfile ) { long long i; char * src_name, *next_name, name[256]; FILE * fo, *outfp2 = NULL; int maxReadNum, libNo, prevLibNo, insSize; boolean flag, pairs = 1; pthread_t threads[thrd_num]; unsigned char thrdSignal[thrd_num + 1]; PARAMETER paras[thrd_num]; maxReadLen = 0; maxNameLen = 256; scan_libInfo ( libfile ); alloc_pe_mem ( num_libs ); if ( !maxReadLen ) { maxReadLen = 100; } printf ( "In file: %s, max seq len %d, max name len %d\n\n", libfile, maxReadLen, maxNameLen ); if ( maxReadLen > maxReadLen4all ) { maxReadLen4all = maxReadLen; } src_name = ( char * ) ckalloc ( ( maxNameLen + 1 ) * sizeof ( char ) ); next_name = ( char * ) ckalloc ( ( maxNameLen + 1 ) * sizeof ( char ) ); kmerBuffer = ( Kmer * ) ckalloc ( buffer_size * sizeof ( Kmer ) ); hashBanBuffer = ( Kmer * ) ckalloc ( buffer_size * sizeof ( Kmer ) ); nodeBuffer = ( kmer_t ** ) ckalloc ( buffer_size * sizeof ( kmer_t * ) ); smallerBuffer = ( boolean * ) ckalloc ( buffer_size * sizeof ( boolean ) ); maxReadNum = buffer_size / ( maxReadLen - overlaplen + 1 ); maxReadNum = maxReadNum % 2 == 0 ? maxReadNum : maxReadNum - 1; //make sure paired reads are processed at the same batch seqBuffer = ( char ** ) ckalloc ( maxReadNum * sizeof ( char * ) ); lenBuffer = ( int * ) ckalloc ( maxReadNum * sizeof ( int ) ); indexArray = ( unsigned int * ) ckalloc ( ( maxReadNum + 1 ) * sizeof ( unsigned int ) ); ctgIdArray = ( unsigned int * ) ckalloc ( ( maxReadNum + 1 ) * sizeof ( unsigned int ) ); posArray = ( int * ) ckalloc ( ( maxReadNum + 1 ) * sizeof ( int ) ); orienArray = ( char * ) ckalloc ( ( maxReadNum + 1 ) * sizeof ( char ) ); footprint = ( char * ) ckalloc ( ( maxReadNum + 1 ) * sizeof ( char ) ); for ( i = 0; i < maxReadNum; i++ ) { seqBuffer[i] = ( char * ) ckalloc ( maxReadLen * sizeof ( char ) ); } rcSeq = ( char ** ) ckalloc ( ( thrd_num + 1 ) * sizeof ( char * ) ); deletion = ( int * ) ckalloc ( ( thrd_num + 1 ) * sizeof ( int ) ); thrdSignal[0] = 0; deletion[0] = 0; if ( 1 ) { for ( i = 0; i < thrd_num; i++ ) { rcSeq[i + 1] = ( char * ) ckalloc ( maxReadLen * sizeof ( char ) ); deletion[i + 1] = 0; thrdSignal[i + 1] = 0; paras[i].threadID = i; paras[i].mainSignal = &thrdSignal[0]; paras[i].selfSignal = &thrdSignal[i + 1]; } creatThrds ( threads, paras ); } if ( !contig_array ) { basicContigInfo ( outfile ); } sprintf ( name, "%s.readInGap", outfile ); outfp2 = ckopen ( name, "wb" ); sprintf ( name, "%s.readOnContig", outfile ); fo = ckopen ( name, "w" ); fprintf ( fo, "read\tcontig\tpos\n" ); readCounter = mapCounter = 0; kmer_c = n_solexa = read_c = i = libNo = readNumBack = gradsCounter = readsInGap = 0; prevLibNo = -1; while ( ( flag = read1seqInLib ( seqBuffer[read_c], next_name, & ( lenBuffer[read_c] ), &libNo, pairs, 0 ) ) != 0 ) { if ( libNo != prevLibNo ) { prevLibNo = libNo; insSize = lib_array[libNo].avg_ins; ALIGNLEN = lib_array[libNo].map_len; if ( insSize > 1000 ) { ALIGNLEN = ALIGNLEN < 35 ? 35 : ALIGNLEN; } else { ALIGNLEN = ALIGNLEN < 32 ? 32 : ALIGNLEN; } printf ( "current insert size %d, map_len %d\n", insSize, ALIGNLEN ); } if ( insSize > 1000 ) { ALIGNLEN = ALIGNLEN < ( lenBuffer[read_c] / 2 + 1 ) ? ( lenBuffer[read_c] / 2 + 1 ) : ALIGNLEN; } if ( ( ++i ) % 100000000 == 0 ) { printf ( "--- %lldth reads\n", i ); } indexArray[read_c] = kmer_c; if ( lenBuffer[read_c] >= overlaplen + 1 ) { kmer_c += lenBuffer[read_c] - overlaplen + 1; } read_c++; if ( read_c == maxReadNum ) { indexArray[read_c] = kmer_c; sendWorkSignal ( 2, thrdSignal ); sendWorkSignal ( 1, thrdSignal ); sendWorkSignal ( 3, thrdSignal ); recordAlldgn ( fo, insSize, outfp2 ); kmer_c = 0; read_c = 0; } } if ( read_c ) { indexArray[read_c] = kmer_c; sendWorkSignal ( 2, thrdSignal ); sendWorkSignal ( 1, thrdSignal ); sendWorkSignal ( 3, thrdSignal ); recordAlldgn ( fo, insSize, outfp2 ); printf ( "Output %lld out of %lld (%.1f)%% reads in gaps\n", readsInGap, readCounter, ( float ) readsInGap / readCounter * 100 ); } if ( readCounter ) printf ( "%lld out of %lld (%.1f)%% reads mapped to contigs\n", mapCounter, readCounter, ( float ) mapCounter / readCounter * 100 ); sendWorkSignal ( 5, thrdSignal ); thread_wait ( threads ); fclose ( fo ); sprintf ( name, "%s.peGrads", outfile ); fo = ckopen ( name, "w" ); fprintf ( fo, "grads&num: %d\t%lld\t%d\n", gradsCounter, n_solexa, maxReadLen4all ); if ( pairs ) { if ( gradsCounter ) printf ( "%d pe insert size, the largest boundary is %lld\n\n", gradsCounter, pes[gradsCounter - 1].PE_bound ); else { printf ( "no paired reads found\n" ); } for ( i = 0; i < gradsCounter; i++ ) { fprintf ( fo, "%d\t%lld\t%d\t%d\n", pes[i].insertS, pes[i].PE_bound, pes[i].rank, pes[i].pair_num_cut ); } } fclose ( fo ); fclose ( outfp2 ); free_pe_mem(); free_libs(); if ( 1 ) // multi-threads { for ( i = 0; i < thrd_num; i++ ) { deletion[0] += deletion[i + 1]; free ( ( void * ) rcSeq[i + 1] ); } } printf ( "%d reads deleted\n", deletion[0] ); free ( ( void * ) rcSeq ); free ( ( void * ) deletion ); for ( i = 0; i < maxReadNum; i++ ) { free ( ( void * ) seqBuffer[i] ); } free ( ( void * ) seqBuffer ); free ( ( void * ) lenBuffer ); free ( ( void * ) indexArray ); free ( ( void * ) kmerBuffer ); free ( ( void * ) smallerBuffer ); free ( ( void * ) hashBanBuffer ); free ( ( void * ) nodeBuffer ); free ( ( void * ) ctgIdArray ); free ( ( void * ) posArray ); free ( ( void * ) orienArray ); free ( ( void * ) footprint ); free ( ( void * ) src_name ); free ( ( void * ) next_name ); if ( contig_array ) { free ( ( void * ) contig_array ); contig_array = NULL; } } static void thread_wait ( pthread_t * threads ) { int i; for ( i = 0; i < thrd_num; i++ ) if ( threads[i] != 0 ) { pthread_join ( threads[i], NULL ); } } /********************* map long reads for gap filling ************************/ void prlLongRead2Ctg ( char * libfile, char * outfile ) { long long i; char * src_name, *next_name, name[256]; FILE * outfp2; int maxReadNum, libNo, prevLibNo; boolean flag, pairs = 0; pthread_t threads[thrd_num]; unsigned char thrdSignal[thrd_num + 1]; PARAMETER paras[thrd_num]; maxReadLen = 0; maxNameLen = 256; scan_libInfo ( libfile ); if ( !maxReadLen ) { maxReadLen = 100; } int longReadLen = getMaxLongReadLen ( num_libs ); if ( longReadLen < 1 ) // no long reads { return; } maxReadLen4all = maxReadLen < longReadLen ? longReadLen : maxReadLen; printf ( "In file: %s, long read len %d, max name len %d\n\n", libfile, longReadLen, maxNameLen ); maxReadLen = longReadLen; src_name = ( char * ) ckalloc ( ( maxNameLen + 1 ) * sizeof ( char ) ); next_name = ( char * ) ckalloc ( ( maxNameLen + 1 ) * sizeof ( char ) ); kmerBuffer = ( Kmer * ) ckalloc ( buffer_size * sizeof ( Kmer ) ); hashBanBuffer = ( Kmer * ) ckalloc ( buffer_size * sizeof ( Kmer ) ); nodeBuffer = ( kmer_t ** ) ckalloc ( buffer_size * sizeof ( kmer_t * ) ); smallerBuffer = ( boolean * ) ckalloc ( buffer_size * sizeof ( boolean ) ); maxReadNum = buffer_size / ( maxReadLen - overlaplen + 1 ); maxReadNum = maxReadNum % 2 == 0 ? maxReadNum : maxReadNum - 1; //make sure paired reads are processed at the same batch seqBuffer = ( char ** ) ckalloc ( maxReadNum * sizeof ( char * ) ); lenBuffer = ( int * ) ckalloc ( maxReadNum * sizeof ( int ) ); indexArray = ( unsigned int * ) ckalloc ( ( maxReadNum + 1 ) * sizeof ( unsigned int ) ); ctgIdArray = ( unsigned int * ) ckalloc ( ( maxReadNum + 1 ) * sizeof ( unsigned int ) ); posArray = ( int * ) ckalloc ( ( maxReadNum + 1 ) * sizeof ( int ) ); orienArray = ( char * ) ckalloc ( ( maxReadNum + 1 ) * sizeof ( char ) ); footprint = ( char * ) ckalloc ( ( maxReadNum + 1 ) * sizeof ( char ) ); for ( i = 0; i < maxReadNum; i++ ) { seqBuffer[i] = ( char * ) ckalloc ( maxReadLen * sizeof ( char ) ); } rcSeq = ( char ** ) ckalloc ( ( thrd_num + 1 ) * sizeof ( char * ) ); deletion = ( int * ) ckalloc ( ( thrd_num + 1 ) * sizeof ( int ) ); thrdSignal[0] = 0; deletion[0] = 0; if ( 1 ) { for ( i = 0; i < thrd_num; i++ ) { rcSeq[i + 1] = ( char * ) ckalloc ( maxReadLen * sizeof ( char ) ); deletion[i + 1] = 0; thrdSignal[i + 1] = 0; paras[i].threadID = i; paras[i].mainSignal = &thrdSignal[0]; paras[i].selfSignal = &thrdSignal[i + 1]; } creatThrds ( threads, paras ); } if ( !contig_array ) { basicContigInfo ( outfile ); } sprintf ( name, "%s.longReadInGap", outfile ); outfp2 = ckopen ( name, "wb" ); readCounter = 0; kmer_c = n_solexa = read_c = i = libNo; prevLibNo = -1; while ( ( flag = read1seqInLib ( seqBuffer[read_c], next_name, & ( lenBuffer[read_c] ), &libNo, pairs, 4 ) ) != 0 ) { if ( libNo != prevLibNo ) { prevLibNo = libNo; ALIGNLEN = lib_array[libNo].map_len; ALIGNLEN = ALIGNLEN < 35 ? 35 : ALIGNLEN; printf ( "Map_len %d\n", ALIGNLEN ); } if ( ( ++i ) % 100000000 == 0 ) { printf ( "--- %lldth reads\n", i ); } indexArray[read_c] = kmer_c; if ( lenBuffer[read_c] >= overlaplen + 1 ) { kmer_c += lenBuffer[read_c] - overlaplen + 1; } read_c++; if ( read_c == maxReadNum ) { indexArray[read_c] = kmer_c; sendWorkSignal ( 2, thrdSignal ); sendWorkSignal ( 1, thrdSignal ); sendWorkSignal ( 3, thrdSignal ); recordLongRead ( outfp2 ); kmer_c = 0; read_c = 0; } } if ( read_c ) { indexArray[read_c] = kmer_c; sendWorkSignal ( 2, thrdSignal ); sendWorkSignal ( 1, thrdSignal ); sendWorkSignal ( 3, thrdSignal ); recordLongRead ( outfp2 ); printf ( "Output %lld out of %lld (%.1f)%% reads in gaps\n", readsInGap, readCounter, ( float ) readsInGap / readCounter * 100 ); } sendWorkSignal ( 5, thrdSignal ); thread_wait ( threads ); fclose ( outfp2 ); free_libs(); if ( 1 ) // multi-threads { for ( i = 0; i < thrd_num; i++ ) { deletion[0] += deletion[i + 1]; free ( ( void * ) rcSeq[i + 1] ); } } printf ( "%d reads deleted\n", deletion[0] ); free ( ( void * ) rcSeq ); free ( ( void * ) deletion ); for ( i = 0; i < maxReadNum; i++ ) { free ( ( void * ) seqBuffer[i] ); } free ( ( void * ) seqBuffer ); free ( ( void * ) lenBuffer ); free ( ( void * ) indexArray ); free ( ( void * ) kmerBuffer ); free ( ( void * ) smallerBuffer ); free ( ( void * ) hashBanBuffer ); free ( ( void * ) nodeBuffer ); free ( ( void * ) ctgIdArray ); free ( ( void * ) posArray ); free ( ( void * ) orienArray ); free ( ( void * ) footprint ); free ( ( void * ) src_name ); free ( ( void * ) next_name ); } SOAPdenovo-V1.05/src/31mer/prlRead2path.c000644 000765 000024 00000044623 11530651532 020020 0ustar00Aquastaff000000 000000 /* * 31mer/prlRead2path.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include #include "newhash.h" #include #include #define preARCBLOCKSIZE 100000 static unsigned int * arcCounters; static int buffer_size = 10000000; static long long markCounter = 0; static unsigned int * fwriteBuf; static unsigned char * markerOnEdge; //buffer related varibles for chop kmer static int read_c; static char ** rcSeq; static char ** seqBuffer; static int * lenBuffer; //edge and (K+1)mer related variables static preARC ** preArc_array; static Kmer * mixBuffer; static boolean * flagArray; //indicate each item in mixBuffer where it's a (K+1)mer // kmer related variables static Kmer ** edge_no; static char ** flags; static int kmer_c; static Kmer * kmerBuffer, *hashBanBuffer; static kmer_t ** nodeBuffer; static boolean * smallerBuffer; static int * indexArray; static int * deletion; static void parse1read ( int t, int threadID ); static void search1kmerPlus ( int j, unsigned char thrdID ); static void threadRoutine ( void * thrdID ); static void searchKmer ( int t, KmerSet * kset ); static void chopKmer4read ( int t, int threadID ); static void thread_wait ( pthread_t * threads ); static void thread_add1preArc ( unsigned int from_ed, unsigned int to_ed, unsigned int thrdID ); static void creatThrds ( pthread_t * threads, PARAMETER * paras ) { unsigned char i; int temp; for ( i = 0; i < thrd_num; i++ ) { //printf("to create %dth thread\n",(*(char *)&(threadID[i]))); if ( ( temp = pthread_create ( &threads[i], NULL, ( void * ) threadRoutine, & ( paras[i] ) ) ) != 0 ) { printf ( "create threads failed\n" ); exit ( 1 ); } } printf ( "%d thread created\n", thrd_num ); } static void threadRoutine ( void * para ) { PARAMETER * prm; int i, t, j, start, finish; unsigned char id; prm = ( PARAMETER * ) para; id = prm->threadID; //printf("%dth thread with task %d, hash_table %p\n",id,prm.task,prm.hash_table); while ( 1 ) { if ( * ( prm->selfSignal ) == 1 ) { for ( i = 0; i < kmer_c; i++ ) { //if((hashBanBuffer[i]&taskMask)!=prm.threadID) if ( ( hashBanBuffer[i] % thrd_num ) != id ) { continue; } searchKmer ( i, KmerSets[id] ); } * ( prm->selfSignal ) = 0; } else if ( * ( prm->selfSignal ) == 2 ) { for ( i = 0; i < read_c; i++ ) { if ( i % thrd_num != id ) { continue; } chopKmer4read ( i, id + 1 ); } * ( prm->selfSignal ) = 0; } else if ( * ( prm->selfSignal ) == 3 ) { // parse reads for ( t = 0; t < read_c; t++ ) { if ( t % thrd_num != id ) { continue; } parse1read ( t, id + 1 ); } * ( prm->selfSignal ) = 0; } else if ( * ( prm->selfSignal ) == 4 ) { //printf("thread %d, reads %d splay kmerplus\n",id,read_c); for ( t = 0; t < read_c; t++ ) { start = indexArray[t]; finish = indexArray[t + 1]; for ( j = start; j < finish; j++ ) { if ( flagArray[j] == 0 ) { if ( mixBuffer[j] == 0 ) { break; } } else if ( hashBanBuffer[j] % thrd_num == id ) { search1kmerPlus ( j, id ); } /* if(flagArray[j]==0&&mixBuffer[j]==0) break; if(!flagArray[j]||(hashBanBuffer[j]%thrd_num)!=id) continue; search1kmerPlus(j,id); */ } } * ( prm->selfSignal ) = 0; } else if ( * ( prm->selfSignal ) == 6 ) { for ( t = 0; t < read_c; t++ ) { start = indexArray[t]; finish = indexArray[t + 1]; for ( j = start; j < finish - 1; j++ ) { if ( mixBuffer[j] == 0 || mixBuffer[j + 1] == 0 ) { break; } if ( mixBuffer[j] % thrd_num != id ) { continue; } thread_add1preArc ( mixBuffer[j], mixBuffer[j + 1], id ); } } * ( prm->selfSignal ) = 0; } else if ( * ( prm->selfSignal ) == 5 ) { * ( prm->selfSignal ) = 0; break; } usleep ( 1 ); } } static void chopKmer4read ( int t, int threadID ) { char * src_seq = seqBuffer[t]; char * bal_seq = rcSeq[threadID]; int len_seq = lenBuffer[t]; int j, bal_j; Kmer hash_ban, bal_hash_ban; Kmer word, bal_word; int index; word = 0; for ( index = 0; index < overlaplen; index++ ) { word <<= 2; word += src_seq[index]; } reverseComplementSeq ( src_seq, len_seq, bal_seq ); // complementary node bal_word = reverseComplement ( word, overlaplen ); bal_j = len_seq - 0 - overlaplen; // 0; index = indexArray[t]; if ( word < bal_word ) { hash_ban = hash_kmer ( word ); kmerBuffer[index] = word; smallerBuffer[index] = 1; hashBanBuffer[index++] = hash_ban; } else { bal_hash_ban = hash_kmer ( bal_word ); kmerBuffer[index] = bal_word; smallerBuffer[index] = 0; hashBanBuffer[index++] = bal_hash_ban; } //printf("%dth: %p with %p\n",kmer_c-1,bal_word,bal_hash_ban); for ( j = 1; j <= len_seq - overlaplen; j ++ ) { word = nextKmer ( word, src_seq[j - 1 + overlaplen] ); bal_j = len_seq - j - overlaplen; // j; bal_word = prevKmer ( bal_word, bal_seq[bal_j] ); if ( word < bal_word ) { hash_ban = hash_kmer ( word ); kmerBuffer[index] = word; smallerBuffer[index] = 1; hashBanBuffer[index++] = hash_ban; //printf("%dth: %p with %p\n",kmer_c-1,word,hashBanBuffer[kmer_c-1]); } else { // complementary node bal_hash_ban = hash_kmer ( bal_word ); kmerBuffer[index] = bal_word; smallerBuffer[index] = 0; hashBanBuffer[index++] = bal_hash_ban; //printf("%dth: %p with %p\n",kmer_c-1,bal_word,hashBanBuffer[kmer_c-1]); } } } //splay for one kmer in buffer and save the node to nodeBuffer static void searchKmer ( int t, KmerSet * kset ) { kmer_t * node; boolean found = search_kmerset ( kset, kmerBuffer[t], &node ); if ( !found ) { printf ( "searchKmer: kmer %llx is not found\n", kmerBuffer[t] ); } nodeBuffer[t] = node; } static preARC * getPreArcBetween ( unsigned int from_ed, unsigned int to_ed ) { preARC * parc; parc = preArc_array[from_ed]; while ( parc ) { if ( parc->to_ed == to_ed ) { return parc; } parc = parc->next; } return parc; } static void thread_add1preArc ( unsigned int from_ed, unsigned int to_ed, unsigned int thrdID ) { preARC * parc = getPreArcBetween ( from_ed, to_ed ); if ( parc ) { parc->multiplicity++; } else { parc = prlAllocatePreArc ( to_ed, preArc_mem_managers[thrdID] ); arcCounters[thrdID]++; parc->next = preArc_array[from_ed]; preArc_array[from_ed] = parc; } } static void memoAlloc4preArc() { unsigned int i; preArc_array = ( preARC ** ) ckalloc ( ( num_ed + 1 ) * sizeof ( preARC * ) ); for ( i = 0; i <= num_ed; i++ ) { preArc_array[i] = NULL; } } static void memoFree4preArc() { prlDestroyPreArcMem(); if ( preArc_array ) { free ( ( void * ) preArc_array ); } } static void output_arcs ( char * outfile ) { unsigned int i; char name[256]; FILE * outfp, *outfp2 = NULL; preARC * parc; sprintf ( name, "%s.preArc", outfile ); outfp = ckopen ( name, "w" ); if ( repsTie ) { sprintf ( name, "%s.markOnEdge", outfile ); outfp2 = ckopen ( name, "w" ); } markCounter = 0; for ( i = 1; i <= num_ed; i++ ) { if ( repsTie ) { markCounter += markerOnEdge[i]; fprintf ( outfp2, "%d\n", markerOnEdge[i] ); } parc = preArc_array[i]; if ( !parc ) { continue; } fprintf ( outfp, "%u", i ); while ( parc ) { fprintf ( outfp, " %u %u", parc->to_ed, parc->multiplicity ); parc = parc->next; } fprintf ( outfp, "\n" ); } fclose ( outfp ); if ( repsTie ) { fclose ( outfp2 ); printf ( "%lld markers counted\n", markCounter ); } } static void recordPathBin ( FILE * outfp ) { int t, j, start, finish; unsigned char counter; for ( t = 0; t < read_c; t++ ) { start = indexArray[t]; finish = indexArray[t + 1]; if ( finish - start < 3 || mixBuffer[start] == 0 || mixBuffer[start + 1] == 0 || mixBuffer[start + 2] == 0 ) { continue; } counter = 0; for ( j = start; j < finish; j++ ) { if ( mixBuffer[j] == 0 ) { break; } fwriteBuf[counter++] = ( unsigned int ) mixBuffer[j]; if ( markerOnEdge[mixBuffer[j]] < 255 ) { markerOnEdge[mixBuffer[j]]++; } markCounter++; } fwrite ( &counter, sizeof ( char ), 1, outfp ); fwrite ( fwriteBuf, sizeof ( unsigned int ), ( int ) counter, outfp ); } } static void search1kmerPlus ( int j, unsigned char thrdID ) { kmer_t * node; boolean found = search_kmerset ( KmerSetsPatch[thrdID], mixBuffer[j], &node ); if ( !found ) { /* printf("kmerPlus %llx (hashban %llx) not found, flag %d!\n", mixBuffer[j],hashBanBuffer[j],flagArray[j]); */ mixBuffer[j] = 0; return; } if ( smallerBuffer[j] ) { mixBuffer[j] = node->l_links; } else { mixBuffer[j] = node->l_links + node->twin - 1; } } static void parse1read ( int t, int threadID ) { unsigned int j, retain = 0; unsigned int edge_index = 0; kmer_t * node; boolean isSmaller; Kmer wordplus, bal_wordplus; unsigned int start, finish, pos; Kmer prevKmer, currentKmer; boolean IsPrevKmer = 0; start = indexArray[t]; finish = indexArray[t + 1]; pos = start; for ( j = start; j < finish; j++ ) { node = nodeBuffer[j]; //extract edges or keep kmers if ( ( node->deleted ) || ( node->linear && !node->inEdge ) ) // deleted or in a floating loop { if ( retain < 2 ) { retain = 0; pos = start; } else { break; } continue; } isSmaller = smallerBuffer[j]; if ( node->linear ) { if ( isSmaller ) { edge_index = node->l_links; } else { edge_index = node->l_links + node->twin - 1; } if ( retain == 0 || IsPrevKmer ) { retain++; mixBuffer[pos] = edge_index; flagArray[pos++] = 0; IsPrevKmer = 0; } else if ( edge_index != mixBuffer[pos - 1] ) { retain++; mixBuffer[pos] = edge_index; flagArray[pos++] = 0; } } else { if ( isSmaller ) { currentKmer = node->seq; } else { currentKmer = reverseComplement ( node->seq, overlaplen ); } if ( IsPrevKmer ) { retain++; wordplus = KmerPlus ( prevKmer, currentKmer & 3 ); bal_wordplus = reverseComplementVerbose ( wordplus, overlaplen + 1 ); if ( wordplus <= bal_wordplus ) { smallerBuffer[pos] = 1; hashBanBuffer[pos] = hash_kmer ( wordplus ); mixBuffer[pos] = wordplus; } else { smallerBuffer[pos] = 0; hashBanBuffer[pos] = hash_kmer ( bal_wordplus ); mixBuffer[pos] = bal_wordplus; } flagArray[pos++] = 1; } IsPrevKmer = 1; prevKmer = currentKmer; } } if ( retain < 1 ) { deletion[threadID]++; } if ( retain < 2 ) { flagArray[start] = 0; mixBuffer[start] = 0; return; } if ( ( pos - start ) != retain ) { printf ( "read %d, %d vs %d\n", t, retain, edge_index - start ); } if ( pos < finish ) { flagArray[pos] = 0; mixBuffer[pos] = 0; } } static void sendWorkSignal ( unsigned char SIG, unsigned char * thrdSignals ) { int t; for ( t = 0; t < thrd_num; t++ ) { thrdSignals[t + 1] = SIG; } while ( 1 ) { usleep ( 10 ); for ( t = 0; t < thrd_num; t++ ) if ( thrdSignals[t + 1] ) { break; } if ( t == thrd_num ) { break; } } } void prlRead2edge ( char * libfile, char * outfile ) { long long i; char name[256], *src_name, *next_name; FILE * outfp = NULL; int maxReadNum, libNo; boolean flag, pairs = 0; pthread_t threads[thrd_num]; unsigned char thrdSignal[thrd_num + 1]; PARAMETER paras[thrd_num]; maxReadLen = 0; maxNameLen = 256; scan_libInfo ( libfile ); alloc_pe_mem ( num_libs ); if ( !maxReadLen ) { maxReadLen = 100; } maxReadLen4all = maxReadLen; printf ( "In file: %s, max seq len %d, max name len %d\n\n", libfile, maxReadLen, maxNameLen ); if ( repsTie ) { sprintf ( name, "%s.path", outfile ); outfp = ckopen ( name, "wb" ); } src_name = ( char * ) ckalloc ( ( maxNameLen + 1 ) * sizeof ( char ) ); next_name = ( char * ) ckalloc ( ( maxNameLen + 1 ) * sizeof ( char ) ); kmerBuffer = ( Kmer * ) ckalloc ( buffer_size * sizeof ( Kmer ) ); mixBuffer = ( Kmer * ) ckalloc ( buffer_size * sizeof ( Kmer ) ); hashBanBuffer = ( Kmer * ) ckalloc ( buffer_size * sizeof ( Kmer ) ); nodeBuffer = ( kmer_t ** ) ckalloc ( buffer_size * sizeof ( kmer_t * ) ); smallerBuffer = ( boolean * ) ckalloc ( buffer_size * sizeof ( boolean ) ); flagArray = ( boolean * ) ckalloc ( buffer_size * sizeof ( boolean ) ); maxReadNum = buffer_size / ( maxReadLen - overlaplen + 1 ); //printf("buffer for at most %d reads\n",maxReadNum); seqBuffer = ( char ** ) ckalloc ( maxReadNum * sizeof ( char * ) ); lenBuffer = ( int * ) ckalloc ( maxReadNum * sizeof ( int ) ); indexArray = ( int * ) ckalloc ( ( maxReadNum + 1 ) * sizeof ( int ) ); for ( i = 0; i < maxReadNum; i++ ) { seqBuffer[i] = ( char * ) ckalloc ( maxReadLen * sizeof ( char ) ); } memoAlloc4preArc(); flags = ( char ** ) ckalloc ( ( thrd_num + 1 ) * sizeof ( char * ) ); edge_no = ( Kmer ** ) ckalloc ( ( thrd_num + 1 ) * sizeof ( Kmer * ) ); deletion = ( int * ) ckalloc ( ( thrd_num + 1 ) * sizeof ( int ) ); rcSeq = ( char ** ) ckalloc ( ( thrd_num + 1 ) * sizeof ( char * ) ); if ( repsTie ) { markerOnEdge = ( unsigned char * ) ckalloc ( ( num_ed + 1 ) * sizeof ( unsigned char ) ); for ( i = 1; i <= num_ed; i++ ) { markerOnEdge[i] = 0; } fwriteBuf = ( unsigned int * ) ckalloc ( ( maxReadLen - overlaplen + 1 ) * sizeof ( unsigned int ) ); } thrdSignal[0] = 0; if ( 1 ) { preArc_mem_managers = ( MEM_MANAGER ** ) ckalloc ( thrd_num * sizeof ( MEM_MANAGER * ) ); arcCounters = ( unsigned int * ) ckalloc ( thrd_num * sizeof ( unsigned int ) ); for ( i = 0; i < thrd_num; i++ ) { arcCounters[i] = 0; preArc_mem_managers[i] = createMem_manager ( preARCBLOCKSIZE, sizeof ( preARC ) ); deletion[i + 1] = 0; flags[i + 1] = ( char * ) ckalloc ( 2 * maxReadLen * sizeof ( char ) ); edge_no[i + 1] = ( Kmer * ) ckalloc ( 2 * maxReadLen * sizeof ( Kmer ) ); rcSeq[i + 1] = ( char * ) ckalloc ( maxReadLen * sizeof ( char ) ); thrdSignal[i + 1] = 0; paras[i].threadID = i; paras[i].mainSignal = &thrdSignal[0]; paras[i].selfSignal = &thrdSignal[i + 1]; } creatThrds ( threads, paras ); } if ( 1 ) { deletion[0] = 0; flags[0] = ( char * ) ckalloc ( 2 * maxReadLen * sizeof ( char ) ); edge_no[0] = ( Kmer * ) ckalloc ( 2 * maxReadLen * sizeof ( Kmer ) ); rcSeq[0] = ( char * ) ckalloc ( maxReadLen * sizeof ( char ) ); } kmer_c = n_solexa = read_c = i = libNo = readNumBack = gradsCounter = 0; int t0, t1, t2, t3, t4, t5, t6; t0 = t1 = t2 = t3 = t4 = t5 = t6 = 0; time_t read_start, read_end, time_bef, time_aft; time ( &read_start ); while ( ( flag = read1seqInLib ( seqBuffer[read_c], next_name, & ( lenBuffer[read_c] ), &libNo, pairs, 1 ) ) != 0 ) { if ( ( ++i ) % 100000000 == 0 ) { printf ( "--- %lldth reads\n", i ); } if ( lenBuffer[read_c] < overlaplen + 1 ) { continue; } /* if(lenBuffer[read_c]>70) lenBuffer[read_c] = 70; else if(lenBuffer[read_c]>40) lenBuffer[read_c] = 40; */ indexArray[read_c] = kmer_c; kmer_c += lenBuffer[read_c] - overlaplen + 1; read_c++; if ( read_c == maxReadNum ) { indexArray[read_c] = kmer_c; time ( &read_end ); t0 += read_end - read_start; time ( &time_bef ); sendWorkSignal ( 2, thrdSignal ); time ( &time_aft ); t1 += time_aft - time_bef; time ( &time_bef ); sendWorkSignal ( 1, thrdSignal ); time ( &time_aft ); t2 += time_aft - time_bef; time ( &time_bef ); sendWorkSignal ( 3, thrdSignal ); time ( &time_aft ); t3 += time_aft - time_bef; time ( &time_bef ); sendWorkSignal ( 4, thrdSignal ); time ( &time_aft ); t4 += time_aft - time_bef; time ( &time_bef ); sendWorkSignal ( 6, thrdSignal ); time ( &time_aft ); t5 += time_aft - time_bef; time ( &time_bef ); //recordPreArc(); if ( repsTie ) { recordPathBin ( outfp ); } time ( &time_aft ); t6 += time_aft - time_bef; //output_path(read_c,edge_no,flags,outfp); kmer_c = 0; read_c = 0; time ( &read_start ); } } printf ( "%lld reads processed\n", i ); printf ( "time %d,%d,%d,%d,%d,%d,%d\n", t0, t1, t2, t3, t4, t5, t6 ); if ( read_c ) { indexArray[read_c] = kmer_c; sendWorkSignal ( 2, thrdSignal ); sendWorkSignal ( 1, thrdSignal ); sendWorkSignal ( 3, thrdSignal ); sendWorkSignal ( 4, thrdSignal ); sendWorkSignal ( 6, thrdSignal ); //recordPreArc(); if ( repsTie ) { recordPathBin ( outfp ); } } printf ( "%lld markers outputed\n", markCounter ); sendWorkSignal ( 5, thrdSignal ); thread_wait ( threads ); output_arcs ( outfile ); memoFree4preArc(); if ( 1 ) // multi-threads { arcCounter = 0; for ( i = 0; i < thrd_num; i++ ) { arcCounter += arcCounters[i]; free ( ( void * ) flags[i + 1] ); free ( ( void * ) edge_no[i + 1] ); deletion[0] += deletion[i + 1]; free ( ( void * ) rcSeq[i + 1] ); } } if ( 1 ) { free ( ( void * ) flags[0] ); free ( ( void * ) edge_no[0] ); free ( ( void * ) rcSeq[0] ); } printf ( "done mapping reads, %d reads deleted, %lld arcs created\n", deletion[0], arcCounter ); if ( repsTie ) { free ( ( void * ) markerOnEdge ); free ( ( void * ) fwriteBuf ); } free ( ( void * ) arcCounters ); free ( ( void * ) rcSeq ); for ( i = 0; i < maxReadNum; i++ ) { free ( ( void * ) seqBuffer[i] ); } free ( ( void * ) seqBuffer ); free ( ( void * ) lenBuffer ); free ( ( void * ) indexArray ); free ( ( void * ) flags ); free ( ( void * ) deletion ); free ( ( void * ) edge_no ); free ( ( void * ) kmerBuffer ); free ( ( void * ) mixBuffer ); free ( ( void * ) smallerBuffer ); free ( ( void * ) flagArray ); free ( ( void * ) hashBanBuffer ); free ( ( void * ) nodeBuffer ); free ( ( void * ) src_name ); free ( ( void * ) next_name ); if ( repsTie ) { fclose ( outfp ); } free_pe_mem(); free_libs(); } static void thread_wait ( pthread_t * threads ) { int i; for ( i = 0; i < thrd_num; i++ ) if ( threads[i] != 0 ) { pthread_join ( threads[i], NULL ); } } SOAPdenovo-V1.05/src/31mer/prlReadFillGap.c000644 000765 000024 00000071345 11530651532 020321 0ustar00Aquastaff000000 000000 /* * 31mer/prlReadFillGap.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" #define RDBLOCKSIZE 50 #define CTGappend 50 static Kmer MAXKMER; static int Ncounter; static int allGaps; // for multi threads static int * counters; static pthread_mutex_t mutex; static int scafBufSize = 100; static boolean * flagBuf; static unsigned char * thrdNoBuf; static STACK ** ctgStackBuffer; static int scafCounter; static int scafInBuf; static void MarkCtgOccu ( unsigned int ctg ); /* static void printRead(int len,char *seq) { int j; fprintf(stderr,">read\n"); for(j=0;jlen = len; rd->dis = pos; rd->seqStarter = starter; } static void convertIndex() { int * length_array = ( int * ) ckalloc ( ( num_ctg + 1 ) * sizeof ( int ) ); unsigned int i; for ( i = 1; i <= num_ctg; i++ ) { length_array[i] = 0; } for ( i = 1; i <= num_ctg; i++ ) { if ( index_array[i] > 0 ) { length_array[index_array[i]] = i; } } for ( i = 1; i <= num_ctg; i++ ) { index_array[i] = length_array[i]; } //contig i with new index: index_array[i] free ( ( void * ) length_array ); } static long long getRead1by1 ( FILE * fp, DARRAY * readSeqInGap ) { long long readCounter = 0; if ( !fp ) { return readCounter; } int len, ctgID, pos; long long starter; char * pt; char * freadBuf = ( char * ) ckalloc ( ( maxReadLen / 4 + 1 ) * sizeof ( char ) ); while ( fread ( &len, sizeof ( int ), 1, fp ) == 1 ) { if ( fread ( &ctgID, sizeof ( int ), 1, fp ) != 1 ) { break; } if ( fread ( &pos, sizeof ( int ), 1, fp ) != 1 ) { break; } if ( fread ( freadBuf, sizeof ( char ), len / 4 + 1, fp ) != ( unsigned ) ( len / 4 + 1 ) ) { break; } //put seq to dynamic array starter = readSeqInGap->item_c; if ( !darrayPut ( readSeqInGap, starter + len / 4 ) ) // make sure there's room for this seq { break; } pt = ( char * ) darrayPut ( readSeqInGap, starter ); bcopy ( freadBuf, pt, len / 4 + 1 ); attach1read2contig ( ctgID, len, pos, starter ); readCounter++; } free ( ( void * ) freadBuf ); return readCounter; } // Darray *readSeqInGap static boolean loadReads4gap ( char * graphfile ) { FILE * fp, *fp2; char name[1024]; long long readCounter; sprintf ( name, "%s.readInGap", graphfile ); fp = fopen ( name, "rb" ); sprintf ( name, "%s.longReadInGap", graphfile ); fp2 = fopen ( name, "rb" ); if ( !fp && !fp2 ) { return 0; } if ( !orig2new ) { convertIndex(); orig2new = 1; } readSeqInGap = ( DARRAY * ) createDarray ( 1000000, sizeof ( char ) ); if ( fp ) { readCounter = getRead1by1 ( fp, readSeqInGap ); printf ( "Loaded %lld reads from %s.readInGap\n", readCounter, graphfile ); fclose ( fp ); } if ( fp2 ) { readCounter = getRead1by1 ( fp2, readSeqInGap ); printf ( "Loaded %lld reads from %s.LongReadInGap\n", readCounter, graphfile ); fclose ( fp2 ); } return 1; } static void debugging1() { unsigned int i; if ( orig2new ) { unsigned int * length_array = ( unsigned int * ) ckalloc ( ( num_ctg + 1 ) * sizeof ( unsigned int ) ); //use length_array to change info in index_array for ( i = 1; i <= num_ctg; i++ ) { length_array[i] = 0; } for ( i = 1; i <= num_ctg; i++ ) { if ( index_array[i] > 0 ) { length_array[index_array[i]] = i; } } for ( i = 1; i <= num_ctg; i++ ) { index_array[i] = length_array[i]; } //contig i with original index: index_array[i] orig2new = 0; } READNEARBY * rd; int j; char * pt; for ( i = 1; i <= num_ctg; i++ ) { if ( !contig_array[i].closeReads ) { continue; } if ( index_array[i] != 735 ) { continue; } printf ( "contig %d, len %d: \n", index_array[i], contig_array[i].length ); stackBackup ( contig_array[i].closeReads ); while ( ( rd = ( READNEARBY * ) stackPop ( contig_array[i].closeReads ) ) != NULL ) { printf ( "%d\t%d\t%lld\t", rd->dis, rd->len, rd->seqStarter ); pt = ( char * ) darrayGet ( readSeqInGap, rd->seqStarter ); for ( j = 0; j < rd->len; j++ ) { printf ( "%c", int2base ( ( int ) getCharInTightString ( pt, j ) ) ); } printf ( "\n" ); } stackRecover ( contig_array[i].closeReads ); } } static void initiateCtgInScaf ( CTGinSCAF * actg ) { actg->cutTail = 0; actg->cutHead = overlaplen; actg->gapSeqLen = 0; } static int procGap ( char * line, STACK * ctgsStack ) { char * tp; int length, i, seg; unsigned int ctg; CTGinSCAF * ctgPt; tp = strtok ( line, " " ); tp = strtok ( NULL, " " ); //length length = atoi ( tp ); tp = strtok ( NULL, " " ); //seg seg = atoi ( tp ); if ( !seg ) { return length; } for ( i = 0; i < seg; i++ ) { tp = strtok ( NULL, " " ); ctg = atoi ( tp ); MarkCtgOccu ( ctg ); ctgPt = ( CTGinSCAF * ) stackPush ( ctgsStack ); initiateCtgInScaf ( ctgPt ); ctgPt->ctgID = ctg; ctgPt->start = 0; ctgPt->end = 0; ctgPt->scaftig_start = 0; ctgPt->mask = 1; } return length; } static void debugging2 ( int index, STACK * ctgsStack ) { CTGinSCAF * actg; stackBackup ( ctgsStack ); printf ( ">scaffold%d\t%d 0.0\n", index, ctgsStack->item_c ); while ( ( actg = stackPop ( ctgsStack ) ) != NULL ) { printf ( "%d\t%d\t%d\t%d\n", actg->ctgID, actg->start, actg->end, actg->scaftig_start ); } stackRecover ( ctgsStack ); } static int cmp_reads ( const void * a, const void * b ) { READNEARBY * A, *B; A = ( READNEARBY * ) a; B = ( READNEARBY * ) b; if ( A->dis > B->dis ) { return 1; } else if ( A->dis == B->dis ) { return 0; } else { return -1; } } static void cutRdArray ( READNEARBY * rdArray, int gapStart, int gapEnd, int * count, int arrayLen, READNEARBY * cutArray ) { int i; int num = 0; for ( i = 0; i < arrayLen; i++ ) { if ( rdArray[i].dis > gapEnd ) { break; } if ( ( rdArray[i].dis + rdArray[i].len ) >= gapStart ) { cutArray[num].dis = rdArray[i].dis; cutArray[num].len = rdArray[i].len; cutArray[num++].seqStarter = rdArray[i].seqStarter; } } *count = num; } static void outputTightStr ( FILE * fp, char * tightStr, int start, int length, int outputlen, int revS, int * col ) { int i; int end; int column = *col; if ( !revS ) { end = start + outputlen <= length ? start + outputlen : length; for ( i = start; i < end; i++ ) { fprintf ( fp, "%c", int2base ( ( int ) getCharInTightString ( tightStr, i ) ) ); if ( ( ++column ) % 100 == 0 ) { //column = 0; fprintf ( fp, "\n" ); } } } else { end = length - start - outputlen - 1 >= 0 ? length - start - outputlen : 0; for ( i = length - 1 - start; i >= end; i-- ) { fprintf ( fp, "%c", int2compbase ( ( int ) getCharInTightString ( tightStr, i ) ) ); if ( ( ++column ) % 100 == 0 ) { fprintf ( fp, "\n" ); //column = 0; } } } *col = column; } static void outputTightStrLowerCase ( FILE * fp, char * tightStr, int start, int length, int outputlen, int revS, int * col ) { int i; int end; int column = *col; if ( !revS ) { end = start + outputlen <= length ? start + outputlen : length; for ( i = start; i < end; i++ ) { fprintf ( fp, "%c", "actg"[ ( int ) getCharInTightString ( tightStr, i )] ); if ( ( ++column ) % 100 == 0 ) { //column = 0; fprintf ( fp, "\n" ); } } } else { end = length - start - outputlen - 1 >= 0 ? length - start - outputlen : 0; for ( i = length - 1 - start; i >= end; i-- ) { fprintf ( fp, "%c", "tgac"[ ( int ) getCharInTightString ( tightStr, i )] ); if ( ( ++column ) % 100 == 0 ) { fprintf ( fp, "\n" ); //column = 0; } } } *col = column; } static void outputNs ( FILE * fp, int gapN, int * col ) { int i, column = *col; for ( i = 0; i < gapN; i++ ) { fprintf ( fp, "N" ); if ( ( ++column ) % 100 == 0 ) { //column = 0; fprintf ( fp, "\n" ); } } *col = column; } static void outputGapInfo ( unsigned int ctg1, unsigned int ctg2 ) { unsigned int bal_ctg1 = getTwinCtg ( ctg1 ); unsigned int bal_ctg2 = getTwinCtg ( ctg2 ); if ( isLargerThanTwin ( ctg1 ) ) { fprintf ( stderr, "%d\t", index_array[bal_ctg1] ); } else { fprintf ( stderr, "%d\t", index_array[ctg1] ); } if ( isLargerThanTwin ( ctg2 ) ) { fprintf ( stderr, "%d\n", index_array[bal_ctg2] ); } else { fprintf ( stderr, "%d\n", index_array[ctg2] ); } } static void output1gap ( FILE * fo, int scafIndex, CTGinSCAF * prevCtg, CTGinSCAF * actg, DARRAY * gapSeqArray ) { unsigned int ctg1, bal_ctg1, length1; int start1, outputlen1; unsigned int ctg2, bal_ctg2, length2; int start2, outputlen2; char * pt; int column = 0; ctg1 = prevCtg->ctgID; bal_ctg1 = getTwinCtg ( ctg1 ); start1 = prevCtg->cutHead; length1 = contig_array[ctg1].length + overlaplen; if ( length1 - prevCtg->cutTail - start1 > CTGappend ) { outputlen1 = CTGappend; start1 = length1 - prevCtg->cutTail - outputlen1; } else { outputlen1 = length1 - prevCtg->cutTail - start1; } ctg2 = actg->ctgID; bal_ctg2 = getTwinCtg ( ctg2 ); start2 = actg->cutHead; length2 = contig_array[ctg2].length + overlaplen; if ( length2 - actg->cutTail - start2 > CTGappend ) { outputlen2 = CTGappend; } else { outputlen2 = length2 - actg->cutTail - start2; } if ( isLargerThanTwin ( ctg1 ) ) { fprintf ( fo, ">S%d_C%d_L%d_G%d", scafIndex, index_array[bal_ctg1], outputlen1, prevCtg->gapSeqLen ); } else { fprintf ( fo, ">S%d_C%d_L%d_G%d", scafIndex, index_array[ctg1], outputlen1, prevCtg->gapSeqLen ); } if ( isLargerThanTwin ( ctg2 ) ) { fprintf ( fo, "_C%d_L%d\n", index_array[bal_ctg2], outputlen2 ); } else { fprintf ( fo, "_C%d_L%d\n", index_array[ctg2], outputlen2 ); } if ( contig_array[ctg1].seq ) { outputTightStr ( fo, contig_array[ctg1].seq, start1, length1, outputlen1, 0, &column ); } else if ( contig_array[bal_ctg1].seq ) { outputTightStr ( fo, contig_array[bal_ctg1].seq, start1, length1, outputlen1, 1, &column ); } pt = ( char * ) darrayPut ( gapSeqArray, prevCtg->gapSeqOffset ); outputTightStrLowerCase ( fo, pt, 0, prevCtg->gapSeqLen, prevCtg->gapSeqLen, 0, &column ); if ( contig_array[ctg2].seq ) { outputTightStr ( fo, contig_array[ctg2].seq, start2, length2, outputlen2, 0, &column ); } else if ( contig_array[bal_ctg2].seq ) { outputTightStr ( fo, contig_array[bal_ctg2].seq, start2, length2, outputlen2, 1, &column ); } fprintf ( fo, "\n" ); } static void outputGapSeq ( FILE * fo, int index, STACK * ctgsStack, DARRAY * gapSeqArray ) { CTGinSCAF * actg, *prevCtg = NULL; stackRecover ( ctgsStack ); while ( ( actg = stackPop ( ctgsStack ) ) != NULL ) { if ( prevCtg && prevCtg->gapSeqLen > 0 ) { output1gap ( fo, index, prevCtg, actg, gapSeqArray ); } prevCtg = actg; } } static void outputScafSeq ( FILE * fo, int index, STACK * ctgsStack, DARRAY * gapSeqArray ) { CTGinSCAF * actg, *prevCtg = NULL; unsigned int ctg, bal_ctg, length; int start, outputlen, gapN; char * pt; int column = 0; long long cvgSum = 0; int lenSum = 0; stackRecover ( ctgsStack ); while ( ( actg = stackPop ( ctgsStack ) ) != NULL ) { if ( ! ( contig_array[actg->ctgID].cvg > 0 ) ) { continue; } lenSum += contig_array[actg->ctgID].length; cvgSum += contig_array[actg->ctgID].length * contig_array[actg->ctgID].cvg; } if ( lenSum > 0 ) { fprintf ( fo, ">scaffold%d %4.1f\n", index, ( double ) cvgSum / lenSum ); } else { fprintf ( fo, ">scaffold%d 0.0\n", index ); } stackRecover ( ctgsStack ); while ( ( actg = stackPop ( ctgsStack ) ) != NULL ) { ctg = actg->ctgID; bal_ctg = getTwinCtg ( ctg ); length = contig_array[ctg].length + overlaplen; if ( prevCtg && actg->scaftig_start ) { gapN = actg->start - prevCtg->start - contig_array[prevCtg->ctgID].length; gapN = gapN > 0 ? gapN : 1; outputNs ( fo, gapN, &column ); //outputGapInfo(prevCtg->ctgID,ctg); Ncounter++; } if ( !prevCtg ) { start = 0; } else { start = actg->cutHead; } outputlen = length - start - actg->cutTail; if ( contig_array[ctg].seq ) { outputTightStr ( fo, contig_array[ctg].seq, start, length, outputlen, 0, &column ); } else if ( contig_array[bal_ctg].seq ) { outputTightStr ( fo, contig_array[bal_ctg].seq, start, length, outputlen, 1, &column ); } if ( actg->gapSeqLen < 1 ) { prevCtg = actg; continue; } pt = ( char * ) darrayPut ( gapSeqArray, actg->gapSeqOffset ); outputTightStrLowerCase ( fo, pt, 0, actg->gapSeqLen, actg->gapSeqLen, 0, &column ); prevCtg = actg; } fprintf ( fo, "\n" ); } static void fill1scaf ( int index, STACK * ctgsStack, int thrdID ); static void check1scaf ( int t, int thrdID ) { if ( flagBuf[t] ) { return; } boolean late = 0; pthread_mutex_lock ( &mutex ); if ( !flagBuf[t] ) { flagBuf[t] = 1; thrdNoBuf[t] = thrdID; } else { late = 1; } pthread_mutex_unlock ( &mutex ); if ( late ) { return; } counters[thrdID]++; fill1scaf ( scafCounter + t + 1, ctgStackBuffer[t], thrdID ); } static void fill1scaf ( int index, STACK * ctgsStack, int thrdID ) { CTGinSCAF * actg, *prevCtg = NULL; READNEARBY * rdArray, *rdArray4gap, *rd; int numRd = 0, count, maxGLen = 0; unsigned int ctg, bal_ctg; STACK * rdStack; while ( ( actg = stackPop ( ctgsStack ) ) != NULL ) { if ( prevCtg ) { maxGLen = maxGLen < ( actg->start - prevCtg->end ) ? ( actg->start - prevCtg->end ) : maxGLen; } ctg = actg->ctgID; bal_ctg = getTwinCtg ( ctg ); if ( actg->mask ) { prevCtg = actg; continue; } if ( contig_array[ctg].closeReads ) { numRd += contig_array[ctg].closeReads->item_c; } else if ( contig_array[bal_ctg].closeReads ) { numRd += contig_array[bal_ctg].closeReads->item_c; } prevCtg = actg; } if ( numRd < 1 ) { return; } rdArray = ( READNEARBY * ) ckalloc ( numRd * sizeof ( READNEARBY ) ); rdArray4gap = ( READNEARBY * ) ckalloc ( numRd * sizeof ( READNEARBY ) ); //fprintf(stderr,"scaffold%d reads4gap %d\n",index,numRd); // collect reads appended to contigs in this scaffold int numRd2 = 0; stackRecover ( ctgsStack ); while ( ( actg = stackPop ( ctgsStack ) ) != NULL ) { ctg = actg->ctgID; bal_ctg = getTwinCtg ( ctg ); if ( actg->mask ) { continue; } if ( contig_array[ctg].closeReads ) { rdStack = contig_array[ctg].closeReads; } else if ( contig_array[bal_ctg].closeReads ) { rdStack = contig_array[bal_ctg].closeReads; } else { continue; } stackBackup ( rdStack ); while ( ( rd = ( READNEARBY * ) stackPop ( rdStack ) ) != NULL ) { rdArray[numRd2].len = rd->len; rdArray[numRd2].seqStarter = rd->seqStarter; if ( isSmallerThanTwin ( ctg ) ) { rdArray[numRd2++].dis = actg->start - overlaplen + rd->dis; } else rdArray[numRd2++].dis = actg->start - overlaplen + contig_array[ctg].length - rd->len - rd->dis; } stackRecover ( rdStack ); } if ( numRd2 != numRd ) { printf ( "##reads numbers doesn't match, %d vs %d when scaffold %d\n", numRd, numRd2, index ); } qsort ( rdArray, numRd, sizeof ( READNEARBY ), cmp_reads ); //fill gap one by one int gapStart, gapEnd; int numIn = 0; boolean flag; int buffer_size = maxReadLen > 100 ? maxReadLen : 100; int maxGSLen = maxGLen + GLDiff < 10 ? 10 : maxGLen + GLDiff; //fprintf(stderr,"maxGlen %d, maxGSlen %d\n",maxGLen,maxGSLen); char * seqGap = ( char * ) ckalloc ( maxGSLen * sizeof ( char ) ); // temp array for gap sequence Kmer * kmerCtg1 = ( Kmer * ) ckalloc ( buffer_size * sizeof ( Kmer ) ); Kmer * kmerCtg2 = ( Kmer * ) ckalloc ( buffer_size * sizeof ( Kmer ) ); char * seqCtg1 = ( char * ) ckalloc ( buffer_size * sizeof ( char ) ); char * seqCtg2 = ( char * ) ckalloc ( buffer_size * sizeof ( char ) ); prevCtg = NULL; stackRecover ( ctgsStack ); while ( ( actg = stackPop ( ctgsStack ) ) != NULL ) { if ( !prevCtg || !actg->scaftig_start ) { prevCtg = actg; continue; } gapStart = prevCtg->end - 100; gapEnd = actg->start - overlaplen + 100; cutRdArray ( rdArray, gapStart, gapEnd, &count, numRd, rdArray4gap ); numIn += count; /* if(!count){ prevCtg = actg; continue; } */ int overlap; for ( overlap = overlaplen; overlap > 14; overlap -= 2 ) { flag = localGraph ( rdArray4gap, count, prevCtg, actg, overlaplen, kmerCtg1, kmerCtg2, overlap, darrayBuf[thrdID], seqCtg1, seqCtg2, seqGap ); //free_kmerset(kmerSet); if ( flag == 1 ) { /* fprintf(stderr,"Between ctg %d and %d, Found with %d\n",prevCtg->ctgID ,actg->ctgID,overlap); */ break; } } /* if(count==0) printf("Gap closed without reads\n"); if(!flag) fprintf(stderr,"Between ctg %d and %d, NO routes found\n",prevCtg->ctgID,actg->ctgID); */ prevCtg = actg; } //fprintf(stderr,"____scaffold%d reads in gap %d\n",index,numIn); free ( ( void * ) seqGap ); free ( ( void * ) kmerCtg1 ); free ( ( void * ) kmerCtg2 ); free ( ( void * ) seqCtg1 ); free ( ( void * ) seqCtg2 ); free ( ( void * ) rdArray ); free ( ( void * ) rdArray4gap ); } static void reverseStack ( STACK * dStack, STACK * sStack ) { CTGinSCAF * actg, *ctgPt; emptyStack ( dStack ); while ( ( actg = ( CTGinSCAF * ) stackPop ( sStack ) ) != NULL ) { ctgPt = ( CTGinSCAF * ) stackPush ( dStack ); ctgPt->ctgID = actg->ctgID; ctgPt->start = actg->start; ctgPt->end = actg->end; ctgPt->scaftig_start = actg->scaftig_start; ctgPt->mask = actg->mask; ctgPt->cutHead = actg->cutHead; ctgPt->cutTail = actg->cutTail; ctgPt->gapSeqLen = actg->gapSeqLen; ctgPt->gapSeqOffset = actg->gapSeqOffset; } stackBackup ( dStack ); } static Kmer tightStr2Kmer ( char * tightStr, int start, int length, int revS ) { int i; Kmer word = 0; if ( !revS ) { if ( start + overlaplen > length ) { printf ( "tightStr2Kmer A: no enough bases for kmer\n" ); return word; } for ( i = start; i < start + overlaplen; i++ ) { word <<= 2; word += getCharInTightString ( tightStr, i ); } } else { if ( length - start - overlaplen < 0 ) { printf ( "tightStr2Kmer B: no enough bases for kmer\n" ); return word; } for ( i = length - 1 - start; i > length - 1 - start - overlaplen; i-- ) { word <<= 2; word += int_comp ( getCharInTightString ( tightStr, i ) ); } } return word; } static Kmer maxKmer() { Kmer word = 0; int i; for ( i = 0; i < overlaplen; i++ ) { word <<= 2; word += 0x3; } return word; } static int contigCatch ( unsigned int prev_ctg, unsigned int ctg ) { if ( contig_array[prev_ctg].length == 0 || contig_array[ctg].length == 0 ) { return 0; } Kmer kmerAtEnd, kmerAtStart; Kmer MaxKmer; unsigned int bal_ctg1 = getTwinCtg ( prev_ctg ); unsigned int bal_ctg2 = getTwinCtg ( ctg ); int i, start; int len1 = contig_array[prev_ctg].length + overlaplen; int len2 = contig_array[ctg].length + overlaplen; start = contig_array[prev_ctg].length; if ( contig_array[prev_ctg].seq ) { kmerAtEnd = tightStr2Kmer ( contig_array[prev_ctg].seq, start, len1, 0 ); } else { kmerAtEnd = tightStr2Kmer ( contig_array[bal_ctg1].seq, start, len1, 1 ); } start = 0; if ( contig_array[ctg].seq ) { kmerAtStart = tightStr2Kmer ( contig_array[ctg].seq, start, len2, 0 ); } else { kmerAtStart = tightStr2Kmer ( contig_array[bal_ctg2].seq, start, len2, 1 ); } MaxKmer = MAXKMER; for ( i = 0; i < 10; i++ ) { if ( ( kmerAtStart ^ kmerAtEnd ) == 0 ) { break; } MaxKmer >>= 2; kmerAtEnd &= MaxKmer; kmerAtStart >>= 2; } if ( i < 10 ) { return overlaplen - i; } else { return 0; } } static void initStackBuf ( STACK ** ctgStackBuffer, int scafBufSize ) { int i; for ( i = 0; i < scafBufSize; i++ ) { flagBuf[i] = 1; ctgStackBuffer[i] = ( STACK * ) createStack ( 100, sizeof ( CTGinSCAF ) ); } } static void freeStackBuf ( STACK ** ctgStackBuffer, int scafBufSize ) { int i; for ( i = 0; i < scafBufSize; i++ ) { freeStack ( ctgStackBuffer[i] ); } } static void threadRoutine ( void * para ) { PARAMETER * prm; int i; prm = ( PARAMETER * ) para; //printf("%dth thread with threadID %d, hash_table %p\n",id,prm.threadID,prm.hash_table); while ( 1 ) { if ( * ( prm->selfSignal ) == 1 ) { emptyDarray ( darrayBuf[prm->threadID] ); for ( i = 0; i < scafInBuf; i++ ) { check1scaf ( i, prm->threadID ); } * ( prm->selfSignal ) = 0; } else if ( * ( prm->selfSignal ) == 2 ) { * ( prm->selfSignal ) = 0; break; } usleep ( 1 ); } } static void creatThrds ( pthread_t * threads, PARAMETER * paras ) { unsigned char i; int temp; for ( i = 0; i < thrd_num; i++ ) { if ( ( temp = pthread_create ( &threads[i], NULL, ( void * ) threadRoutine, & ( paras[i] ) ) ) != 0 ) { printf ( "create threads failed\n" ); exit ( 1 ); } } printf ( "%d thread created\n...\n", thrd_num ); } static void sendWorkSignal ( unsigned char SIG, unsigned char * thrdSignals ) { int t; for ( t = 0; t < thrd_num; t++ ) { thrdSignals[t + 1] = SIG; } while ( 1 ) { usleep ( 10 ); for ( t = 0; t < thrd_num; t++ ) if ( thrdSignals[t + 1] ) { break; } if ( t == thrd_num ) { break; } } } static void thread_wait ( pthread_t * threads ) { int i; for ( i = 0; i < thrd_num; i++ ) if ( threads[i] != 0 ) { pthread_join ( threads[i], NULL ); } } static void outputSeqs ( FILE * fo, FILE * fo2, int scafInBuf ) { int i, thrdID; for ( i = 0; i < scafInBuf; i++ ) { thrdID = thrdNoBuf[i]; outputScafSeq ( fo, scafCounter + i + 1, ctgStackBuffer[i], darrayBuf[thrdID] ); outputGapSeq ( fo2, scafCounter + i + 1, ctgStackBuffer[i], darrayBuf[thrdID] ); } } static void MaskContig ( unsigned int ctg ) { contig_array[ctg].mask = 1; contig_array[getTwinCtg ( ctg )].mask = 1; } static void MarkCtgOccu ( unsigned int ctg ) { contig_array[ctg].flag = 1; contig_array[getTwinCtg ( ctg )].flag = 1; } static void output_ctg ( unsigned int ctg, FILE * fo ) { if ( contig_array[ctg].length < 1 ) { return; } int len; unsigned int bal_ctg = getTwinCtg ( ctg ); len = contig_array[ctg].length + overlaplen; int col = 0; if ( contig_array[ctg].seq ) { fprintf ( fo, ">C%d %4.1f\n", ctg, ( double ) contig_array[ctg].cvg ); outputTightStr ( fo, contig_array[ctg].seq, 0, len, len, 0, &col ); } else if ( contig_array[bal_ctg].seq ) { fprintf ( fo, ">C%d %4.1f\n", bal_ctg, ( double ) contig_array[ctg].cvg ); outputTightStr ( fo, contig_array[bal_ctg].seq, 0, len, len, 0, &col ); } contig_array[ctg].flag = 1; contig_array[bal_ctg].flag = 1; fprintf ( fo, "\n" ); } void prlReadsCloseGap ( char * graphfile ) { //thrd_num=1; if ( fillGap ) { boolean flag; printf ( "\nStart to load reads for gap filling. %d length discrepancy is allowed\n", GLDiff ); printf ( "...\n" ); flag = loadReads4gap ( graphfile ); if ( !flag ) { return; } } if ( orig2new ) { convertIndex(); orig2new = 0; } FILE * fp, *fo, *fo2; char line[1024]; CTGinSCAF * actg; STACK * ctgStack, *aStack; int index = 0, offset = 0, counter, overallLen; int i, starter, prev_start, gapLen, catchable; unsigned int ctg, prev_ctg = 0; boolean IsPrevGap; pthread_t threads[thrd_num]; unsigned char thrdSignal[thrd_num + 1]; PARAMETER paras[thrd_num]; for ( ctg = 1; ctg <= num_ctg; ctg++ ) { contig_array[ctg].flag = 0; } MAXKMER = maxKmer(); ctgStack = ( STACK * ) createStack ( 1000, sizeof ( CTGinSCAF ) ); sprintf ( line, "%s.scaf_gap", graphfile ); fp = ckopen ( line, "r" ); sprintf ( line, "%s.scafSeq", graphfile ); fo = ckopen ( line, "w" ); sprintf ( line, "%s.gapSeq", graphfile ); fo2 = ckopen ( line, "w" ); pthread_mutex_init ( &mutex, NULL ); flagBuf = ( boolean * ) ckalloc ( scafBufSize * sizeof ( boolean ) );; thrdNoBuf = ( unsigned char * ) ckalloc ( scafBufSize * sizeof ( unsigned char ) );; memset ( thrdNoBuf, 0, scafBufSize * sizeof ( char ) ); ctgStackBuffer = ( STACK ** ) ckalloc ( scafBufSize * sizeof ( STACK * ) ); initStackBuf ( ctgStackBuffer, scafBufSize ); darrayBuf = ( DARRAY ** ) ckalloc ( thrd_num * sizeof ( DARRAY * ) ); counters = ( int * ) ckalloc ( thrd_num * sizeof ( int ) ); for ( i = 0; i < thrd_num; i++ ) { counters[i] = 0; darrayBuf[i] = ( DARRAY * ) createDarray ( 100000, sizeof ( char ) ); thrdSignal[i + 1] = 0; paras[i].threadID = i; paras[i].mainSignal = &thrdSignal[0]; paras[i].selfSignal = &thrdSignal[i + 1]; } if ( fillGap ) { creatThrds ( threads, paras ); } Ncounter = scafCounter = scafInBuf = allGaps = 0; while ( fgets ( line, sizeof ( line ), fp ) != NULL ) { if ( line[0] == '>' ) { if ( index ) { aStack = ctgStackBuffer[scafInBuf]; flagBuf[scafInBuf++] = 0; reverseStack ( aStack, ctgStack ); if ( scafInBuf == scafBufSize ) { if ( fillGap ) { sendWorkSignal ( 1, thrdSignal ); } outputSeqs ( fo, fo2, scafInBuf ); scafCounter += scafInBuf; scafInBuf = 0; } if ( index % 1000 == 0 ) { printf ( "Processed %d scaffolds\n", index ); } } //read next scaff emptyStack ( ctgStack ); IsPrevGap = offset = prev_ctg = 0; sscanf ( line + 9, "%d %d %d", &index, &counter, &overallLen ); continue; } if ( line[0] == 'G' ) // gap appears { if ( fillGap ) { gapLen = procGap ( line, ctgStack ); IsPrevGap = 1; } continue; } if ( line[0] >= '0' && line[0] <= '9' ) // a contig line { sscanf ( line, "%d %d", &ctg, &starter ); actg = ( CTGinSCAF * ) stackPush ( ctgStack ); actg->ctgID = ctg; if ( contig_array[ctg].flag ) { MaskContig ( ctg ); } else { MarkCtgOccu ( ctg ); } initiateCtgInScaf ( actg ); if ( !prev_ctg ) { actg->cutHead = 0; } else if ( !IsPrevGap ) { allGaps++; } if ( !IsPrevGap ) { if ( prev_ctg && ( starter - prev_start - ( int ) contig_array[prev_ctg].length ) < ( ( int ) overlaplen * 4 ) ) { /* if(fillGap) catchable = contigCatch(prev_ctg,ctg); else */ catchable = 0; if ( catchable ) // prev_ctg and ctg overlap **bp { allGaps--; /* if(isLargerThanTwin(prev_ctg)) fprintf(stderr,"%d ####### by_overlap\n",getTwinCtg(prev_ctg)); else fprintf(stderr,"%d ####### by_overlap\n",prev_ctg); */ actg->scaftig_start = 0; actg->cutHead = catchable; offset += - ( starter - prev_start - contig_array[prev_ctg].length ) + ( overlaplen - catchable ); } else { actg->scaftig_start = 1; } } else { actg->scaftig_start = 1; } } else { offset += - ( starter - prev_start - contig_array[prev_ctg].length ) + gapLen; actg->scaftig_start = 0; } actg->start = starter + offset; actg->end = actg->start + contig_array[ctg].length - 1; actg->mask = contig_array[ctg].mask; IsPrevGap = 0; prev_ctg = ctg; prev_start = starter; } } if ( index ) { aStack = ctgStackBuffer[scafInBuf]; flagBuf[scafInBuf++] = 0; reverseStack ( aStack, ctgStack ); if ( fillGap ) { sendWorkSignal ( 1, thrdSignal ); } outputSeqs ( fo, fo2, scafInBuf ); } if ( fillGap ) { sendWorkSignal ( 2, thrdSignal ); thread_wait ( threads ); } for ( ctg = 1; ctg <= num_ctg; ctg++ ) { if ( ( contig_array[ctg].length + overlaplen ) < 100 || contig_array[ctg].flag ) { continue; } output_ctg ( ctg, fo ); } printf ( "Done with %d scaffolds, %d gaps finished, %d gaps overall\n", index, allGaps - Ncounter, allGaps ); index = 0; for ( i = 0; i < thrd_num; i++ ) { freeDarray ( darrayBuf[i] ); index += counters[i]; } if ( fillGap ) { printf ( "Threads processed %d scaffolds\n", index ); } free ( ( void * ) darrayBuf ); if ( readSeqInGap ) { freeDarray ( readSeqInGap ); } fclose ( fp ); fclose ( fo ); fclose ( fo2 ); freeStack ( ctgStack ); freeStackBuf ( ctgStackBuffer, scafBufSize ); free ( ( void * ) flagBuf ); free ( ( void * ) thrdNoBuf ); free ( ( void * ) ctgStackBuffer ); } SOAPdenovo-V1.05/src/31mer/read2scaf.c000644 000765 000024 00000016353 11530651532 017321 0ustar00Aquastaff000000 000000 /* * 31mer/read2scaf.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" static int Ncounter; static int allGaps; // for multi threads static int scafBufSize = 100; static STACK ** ctgStackBuffer; static int scafCounter; static int scafInBuf; static void convertIndex() { int * length_array = ( int * ) ckalloc ( ( num_ctg + 1 ) * sizeof ( int ) ); unsigned int i; for ( i = 1; i <= num_ctg; i++ ) { length_array[i] = 0; } for ( i = 1; i <= num_ctg; i++ ) { if ( index_array[i] > 0 ) { length_array[index_array[i]] = i; } } for ( i = 1; i <= num_ctg; i++ ) { index_array[i] = length_array[i]; } //contig i with new index: index_array[i] free ( ( void * ) length_array ); } static void reverseStack ( STACK * dStack, STACK * sStack ) { CTGinSCAF * actg, *ctgPt; emptyStack ( dStack ); while ( ( actg = ( CTGinSCAF * ) stackPop ( sStack ) ) != NULL ) { ctgPt = ( CTGinSCAF * ) stackPush ( dStack ); ctgPt->ctgID = actg->ctgID; ctgPt->start = actg->start; ctgPt->end = actg->end; } stackBackup ( dStack ); } static void initStackBuf ( STACK ** ctgStackBuffer, int scafBufSize ) { int i; for ( i = 0; i < scafBufSize; i++ ) { ctgStackBuffer[i] = ( STACK * ) createStack ( 100, sizeof ( CTGinSCAF ) ); } } static void freeStackBuf ( STACK ** ctgStackBuffer, int scafBufSize ) { int i; for ( i = 0; i < scafBufSize; i++ ) { freeStack ( ctgStackBuffer[i] ); } } static void mapCtg2Scaf ( int scafInBuf ) { int i, scafID; CTGinSCAF * actg; STACK * ctgsStack; unsigned int ctg, bal_ctg; for ( i = 0; i < scafInBuf; i++ ) { scafID = scafCounter + i + 1; ctgsStack = ctgStackBuffer[i]; while ( ( actg = stackPop ( ctgsStack ) ) != NULL ) { ctg = actg->ctgID; bal_ctg = getTwinCtg ( ctg ); if ( contig_array[ctg].from_vt != 0 ) { contig_array[ctg].multi = 1; contig_array[bal_ctg].multi = 1; continue; } contig_array[ctg].from_vt = scafID; contig_array[ctg].to_vt = actg->start; contig_array[ctg].flag = 0; //ctg and scaf on the same strand contig_array[bal_ctg].from_vt = scafID; contig_array[bal_ctg].to_vt = actg->start; contig_array[bal_ctg].flag = 1; } } } static void locateContigOnscaff ( char * graphfile ) { FILE * fp; char line[1024]; CTGinSCAF * actg; STACK * ctgStack, *aStack; int index = 0, counter, overallLen; int starter, prev_start, gapN, scafLen; unsigned int ctg, prev_ctg = 0; for ( ctg = 1; ctg <= num_ctg; ctg++ ) { contig_array[ctg].from_vt = 0; contig_array[ctg].multi = 0; } ctgStack = ( STACK * ) createStack ( 1000, sizeof ( CTGinSCAF ) ); sprintf ( line, "%s.scaf_gap", graphfile ); fp = ckopen ( line, "r" ); ctgStackBuffer = ( STACK ** ) ckalloc ( scafBufSize * sizeof ( STACK * ) ); initStackBuf ( ctgStackBuffer, scafBufSize ); Ncounter = scafCounter = scafInBuf = allGaps = 0; while ( fgets ( line, sizeof ( line ), fp ) != NULL ) { if ( line[0] == '>' ) { if ( index ) { aStack = ctgStackBuffer[scafInBuf++]; reverseStack ( aStack, ctgStack ); if ( scafInBuf == scafBufSize ) { mapCtg2Scaf ( scafInBuf ); scafCounter += scafInBuf; scafInBuf = 0; } if ( index % 1000 == 0 ) { printf ( "Processed %d scaffolds\n", index ); } } //read next scaff scafLen = prev_ctg = 0; emptyStack ( ctgStack ); sscanf ( line + 9, "%d %d %d", &index, &counter, &overallLen ); fprintf ( stderr, ">%d\n", index ); continue; } if ( line[0] == 'G' ) // gap appears { continue; } if ( line[0] >= '0' && line[0] <= '9' ) // a contig line { sscanf ( line, "%d %d", &ctg, &starter ); actg = ( CTGinSCAF * ) stackPush ( ctgStack ); actg->ctgID = ctg; if ( !prev_ctg ) { actg->start = scafLen; actg->end = actg->start + overlaplen + contig_array[ctg].length - 1; } else { gapN = starter - prev_start - ( int ) contig_array[prev_ctg].length; gapN = gapN < 1 ? 1 : gapN; actg->start = scafLen + gapN; actg->end = actg->start + contig_array[ctg].length - 1; } fprintf ( stderr, "%d\t%d\n", actg->start, actg->end ); scafLen = actg->end + 1; prev_ctg = ctg; prev_start = starter; } } if ( index ) { aStack = ctgStackBuffer[scafInBuf++]; reverseStack ( aStack, ctgStack ); mapCtg2Scaf ( scafInBuf ); } gapN = 0; for ( ctg = 1; ctg <= num_ctg; ctg++ ) { if ( contig_array[ctg].from_vt == 0 || contig_array[ctg].multi == 1 ) { continue; } gapN++; } printf ( "\nDone with %d scaffolds, %d contigs in Scaffolld\n", index, gapN ); fclose ( fp ); freeStack ( ctgStack ); freeStackBuf ( ctgStackBuffer, scafBufSize ); free ( ( void * ) ctgStackBuffer ); } static boolean contigElligible ( unsigned int contigno ) { unsigned int ctg = index_array[contigno]; if ( contig_array[ctg].from_vt == 0 || contig_array[ctg].multi == 1 ) { return 0; } else { return 1; } } static void output1read ( FILE * fo, long long readno, unsigned int contigno, int pos ) { unsigned int ctg = index_array[contigno]; int posOnScaf; char orien; pos = pos < 0 ? 0 : pos; if ( contig_array[ctg].flag == 0 ) { posOnScaf = contig_array[ctg].to_vt + pos - overlaplen; orien = '+'; } else { posOnScaf = contig_array[ctg].to_vt + contig_array[ctg].length - pos; orien = '-'; } /* if(readno==676) printf("Read %lld in region from %d, extend %d, pos %d, orien %c\n", readno,contig_array[ctg].to_vt,contig_array[ctg].length,posOnScaf,orien); */ fprintf ( fo, "%lld\t%d\t%d\t%c\n", readno, contig_array[ctg].from_vt, posOnScaf, orien ); } void locateReadOnScaf ( char * graphfile ) { char name[1024], line[1024]; FILE * fp, *fo; long long readno, counter = 0, pre_readno = 0; unsigned int contigno, pre_contigno; int pre_pos, pos; locateContigOnscaff ( graphfile ); sprintf ( name, "%s.readOnContig", graphfile ); fp = ckopen ( name, "r" ); sprintf ( name, "%s.readOnScaf", graphfile ); fo = ckopen ( name, "w" ); if ( !orig2new ) { convertIndex(); orig2new = 1; } fgets ( line, 1024, fp ); while ( fgets ( line, 1024, fp ) != NULL ) { sscanf ( line, "%lld %d %d", &readno, &contigno, &pos ); if ( ( readno % 2 == 0 ) && ( pre_readno == readno - 1 ) // they are a pair of reads && contigElligible ( pre_contigno ) && contigElligible ( contigno ) ) { output1read ( fo, pre_readno, pre_contigno, pre_pos ); output1read ( fo, readno, contigno, pos ); counter++; } pre_readno = readno; pre_contigno = contigno; pre_pos = pos; } printf ( "%lld pairs on contig\n", counter ); fclose ( fp ); fclose ( fo ); } SOAPdenovo-V1.05/src/31mer/readInterval.c000644 000765 000024 00000003112 11530651532 020074 0ustar00Aquastaff000000 000000 /* * 31mer/readInterval.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" #define RVBLOCKSIZE 1000 void destroyReadIntervMem() { freeMem_manager ( rv_mem_manager ); rv_mem_manager = NULL; } READINTERVAL * allocateRV ( int readid, int edgeid ) { READINTERVAL * newRV; newRV = ( READINTERVAL * ) getItem ( rv_mem_manager ); newRV->readid = readid; newRV->edgeid = edgeid; newRV->nextInRead = NULL; newRV->prevInRead = NULL; newRV->nextOnEdge = NULL; newRV->prevOnEdge = NULL; return newRV; } void dismissRV ( READINTERVAL * rv ) { returnItem ( rv_mem_manager, rv ); } void createRVmemo() { if ( !rv_mem_manager ) { rv_mem_manager = createMem_manager ( RVBLOCKSIZE, sizeof ( READINTERVAL ) ); } else { printf ( "Warning from createRVmemo: rv_mem_manager is an active pointer\n" ); } } SOAPdenovo-V1.05/src/31mer/readseq1by1.c000644 000765 000024 00000034127 11530651532 017607 0ustar00Aquastaff000000 000000 /* * 31mer/readseq1by1.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" static char src_rc_seq[1024]; void readseq1by1 ( char * src_seq, char * src_name, int * len_seq, FILE * fp, long long num_seq ) { int i, k, n, strL; char c; char str[5000]; n = 0; k = num_seq; while ( fgets ( str, 4950, fp ) ) { if ( str[0] == '#' ) { continue; } if ( str[0] == '>' ) { /* if(k >= 0) { // if this isn't the first '>' in the file *len_seq = n; } */ *len_seq = n; n = 0; sscanf ( &str[1], "%s", src_name ); return; } else { strL = strlen ( str ); if ( strL + n > maxReadLen ) { strL = maxReadLen - n; } for ( i = 0; i < strL; i ++ ) { if ( str[i] >= 'a' && str[i] <= 'z' ) { c = base2int ( str[i] - 'a' + 'A' ); src_seq[n ++] = c; } else if ( str[i] >= 'A' && str[i] <= 'Z' ) { c = base2int ( str[i] ); src_seq[n ++] = c; // after pre-process all the symbles would be a,g,c,t,n in lower or upper case. } else if ( str[i] == '.' ) { c = base2int ( 'A' ); src_seq[n ++] = c; } // after pre-process all the symbles would be a,g,c,t,n in lower or upper case. } //printf("%d: %d\n",k,n); } } if ( k >= 0 ) { *len_seq = n; return; } *len_seq = 0; } void read_one_sequence ( FILE * fp, long long * T, char ** X ) { char * fasta, *src_name; //point to fasta array int num_seq, len, name_len, min_len; num_seq = readseqpar ( &len, &min_len, &name_len, fp ); if ( num_seq < 1 ) { printf ( "no fasta sequence in file\n" ); *T = 0; return; } fasta = ( char * ) ckalloc ( len * sizeof ( char ) ); src_name = ( char * ) ckalloc ( ( name_len + 1 ) * sizeof ( char ) ); rewind ( fp ); readseq1by1 ( fasta, src_name, &len, fp, -1 ); readseq1by1 ( fasta, src_name, &len, fp, 0 ); *X = fasta; *T = len; free ( ( void * ) src_name ); } long long multiFileParse ( int * max_leg, int * min_leg, int * max_name_leg, FILE * fp ) { char str[5000]; FILE * freads; int slen; long long counter = 0; *max_name_leg = *max_leg = 1; *min_leg = 1000; while ( fgets ( str, 4950, fp ) ) { slen = strlen ( str ); str[slen - 1] = str[slen]; freads = ckopen ( str, "r" ); counter += readseqpar ( max_leg, min_leg, max_name_leg, freads ); fclose ( freads ); } return counter; } long long readseqpar ( int * max_leg, int * min_leg, int * max_name_leg, FILE * fp ) { int l, n; long long k; char str[5000], src_name[5000]; n = 0; k = -1; while ( fgets ( str, 4950, fp ) ) { if ( str[0] == '>' ) { if ( k >= 0 ) { if ( n > *max_leg ) { *max_leg = n; } if ( n < *min_leg ) { *min_leg = n; } } n = 0; k ++; sscanf ( &str[1], "%s", src_name ); if ( ( l = strlen ( src_name ) ) > *max_name_leg ) { *max_name_leg = l; } } else { n += strlen ( str ) - 1; } } if ( n > *max_leg ) { *max_leg = n; } if ( n < *min_leg ) { *min_leg = n; } k ++; return ( k ); } void read1seqfq ( char * src_seq, char * src_name, int * len_seq, FILE * fp ) { int i, n, strL; char c; char str[5000]; boolean flag = 0; while ( fgets ( str, 4950, fp ) ) { if ( str[0] == '@' ) { flag = 1; sscanf ( &str[1], "%s", src_name ); break; } } if ( !flag ) //last time reading fq file get this { *len_seq = 0; return; } n = 0; while ( fgets ( str, 4950, fp ) ) { if ( str[0] == '+' ) { fgets ( str, 4950, fp ); // pass quality value line *len_seq = n; return; } else { strL = strlen ( str ); if ( strL + n > maxReadLen ) { strL = maxReadLen - n; } for ( i = 0; i < strL; i ++ ) { if ( str[i] >= 'a' && str[i] <= 'z' ) { c = base2int ( str[i] - 'a' + 'A' ); src_seq[n ++] = c; } else if ( str[i] >= 'A' && str[i] <= 'Z' ) { c = base2int ( str[i] ); src_seq[n ++] = c; // after pre-process all the symbles would be a,g,c,t,n in lower or upper case. } else if ( str[i] == '.' ) { c = base2int ( 'A' ); src_seq[n ++] = c; } // after pre-process all the symbles would be a,g,c,t,n in lower or upper case. } //printf("%d: %d\n",k,n); } } *len_seq = n; return; } // find the next file to open in libs static int nextValidIndex ( int libNo, boolean pair, unsigned char asm_ctg ) { int i = libNo; while ( i < num_libs ) { if ( asm_ctg == 1 && ( lib_array[i].asm_flag != 1 && lib_array[i].asm_flag != 3 ) ) { i++; continue; } else if ( asm_ctg == 0 && ( lib_array[i].asm_flag != 2 && lib_array[i].asm_flag != 3 ) ) { i++; continue; } else if ( asm_ctg > 1 && lib_array[i].asm_flag != asm_ctg ) // reads for other purpose { i++; continue; } if ( lib_array[i].curr_type == 1 && lib_array[i].curr_index < lib_array[i].num_a1_file ) { return i; } if ( lib_array[i].curr_type == 2 && lib_array[i].curr_index < lib_array[i].num_q1_file ) { return i; } if ( lib_array[i].curr_type == 3 && lib_array[i].curr_index < lib_array[i].num_p_file ) { return i; } if ( pair ) { if ( lib_array[i].curr_type < 3 ) { lib_array[i].curr_type++; lib_array[i].curr_index = 0; } else { i++; } continue; } if ( lib_array[i].curr_type == 4 && lib_array[i].curr_index < lib_array[i].num_s_a_file ) { return i; } if ( lib_array[i].curr_type == 5 && lib_array[i].curr_index < lib_array[i].num_s_q_file ) { return i; } if ( lib_array[i].curr_type < 5 ) { lib_array[i].curr_type++; lib_array[i].curr_index = 0; } else { i++; } }//for each lib return i; } static FILE * openFile4read ( char * fname ) { FILE * fp; if ( strlen ( fname ) > 3 && strcmp ( fname + strlen ( fname ) - 3, ".gz" ) == 0 ) { char * cmd = ( char * ) ckalloc ( ( strlen ( fname ) + 20 ) * sizeof ( char ) ); sprintf ( cmd, "gzip -dc %s", fname ); fp = popen ( cmd, "r" ); free ( cmd ); return fp; } else { return ckopen ( fname, "r" ); } } void openFileInLib ( int libNo ) { int i = libNo; if ( lib_array[i].curr_type == 1 ) { printf ( "read from file:\n %s\n", lib_array[i].a1_fname[lib_array[i].curr_index] ); printf ( "read from file:\n %s\n", lib_array[i].a2_fname[lib_array[i].curr_index] ); lib_array[i].fp1 = openFile4read ( lib_array[i].a1_fname[lib_array[i].curr_index] ); lib_array[i].fp2 = openFile4read ( lib_array[i].a2_fname[lib_array[i].curr_index] ); lib_array[i].curr_index++; lib_array[i].paired = 1; } else if ( lib_array[i].curr_type == 2 ) { printf ( "read from file:\n %s\n", lib_array[i].q1_fname[lib_array[i].curr_index] ); printf ( "read from file:\n %s\n", lib_array[i].q2_fname[lib_array[i].curr_index] ); lib_array[i].fp1 = openFile4read ( lib_array[i].q1_fname[lib_array[i].curr_index] ); lib_array[i].fp2 = openFile4read ( lib_array[i].q2_fname[lib_array[i].curr_index] ); lib_array[i].curr_index++; lib_array[i].paired = 1; } else if ( lib_array[i].curr_type == 3 ) { printf ( "read from file:\n %s\n", lib_array[i].p_fname[lib_array[i].curr_index] ); lib_array[i].fp1 = openFile4read ( lib_array[i].p_fname[lib_array[i].curr_index] ); lib_array[i].curr_index++; lib_array[i].paired = 0; } else if ( lib_array[i].curr_type == 4 ) { printf ( "read from file:\n %s\n", lib_array[i].s_a_fname[lib_array[i].curr_index] ); lib_array[i].fp1 = openFile4read ( lib_array[i].s_a_fname[lib_array[i].curr_index] ); lib_array[i].curr_index++; lib_array[i].paired = 0; } else if ( lib_array[i].curr_type == 5 ) { printf ( "read from file:\n %s\n", lib_array[i].s_q_fname[lib_array[i].curr_index] ); lib_array[i].fp1 = openFile4read ( lib_array[i].s_q_fname[lib_array[i].curr_index] ); lib_array[i].curr_index++; lib_array[i].paired = 0; } } static void reverse2k ( char * src_seq, int len_seq ) { if ( !len_seq ) { return; } int i; reverseComplementSeq ( src_seq, len_seq, src_rc_seq ); for ( i = 0; i < len_seq; i++ ) { src_seq[i] = src_rc_seq[i]; } } static void closeFp1InLab ( int libNo ) { int ftype = lib_array[libNo].curr_type; int index = lib_array[libNo].curr_index - 1; char * fname; if ( ftype == 1 ) { fname = lib_array[libNo].a1_fname[index]; } else if ( ftype == 2 ) { fname = lib_array[libNo].q1_fname[index]; } else if ( ftype == 3 ) { fname = lib_array[libNo].p_fname[index]; } else if ( ftype == 4 ) { fname = lib_array[libNo].s_a_fname[index]; } else if ( ftype == 5 ) { fname = lib_array[libNo].s_q_fname[index]; } else { return; } if ( strlen ( fname ) > 3 && strcmp ( fname + strlen ( fname ) - 3, ".gz" ) == 0 ) { pclose ( lib_array[libNo].fp1 ); } else { fclose ( lib_array[libNo].fp1 ); } } static void closeFp2InLab ( int libNo ) { int ftype = lib_array[libNo].curr_type; int index = lib_array[libNo].curr_index - 1; char * fname; if ( ftype == 1 ) { fname = lib_array[libNo].a2_fname[index]; } else if ( ftype == 2 ) { fname = lib_array[libNo].q2_fname[index]; } else { return; } if ( strlen ( fname ) > 3 && strcmp ( fname + strlen ( fname ) - 3, ".gz" ) == 0 ) { pclose ( lib_array[libNo].fp2 ); } else { fclose ( lib_array[libNo].fp2 ); } } boolean read1seqInLib ( char * src_seq, char * src_name, int * len_seq, int * libNo, boolean pair, unsigned char asm_ctg ) { int i = *libNo; int prevLib = i; if ( !lib_array[i].fp1 // file1 does not exist || ( lib_array[i].curr_type != 1 && feof ( lib_array[i].fp1 ) ) // file1 reaches end and not type1 || ( lib_array[i].curr_type == 1 && feof ( lib_array[i].fp1 ) && feof ( lib_array[i].fp2 ) ) ) //f1&f2 reaches end { if ( lib_array[i].fp1 && feof ( lib_array[i].fp1 ) ) { closeFp1InLab ( i ); } if ( lib_array[i].fp2 && feof ( lib_array[i].fp2 ) ) { closeFp2InLab ( i ); } *libNo = nextValidIndex ( i, pair, asm_ctg ); i = *libNo; if ( lib_array[i].rd_len_cutoff > 0 ) maxReadLen = lib_array[i].rd_len_cutoff < maxReadLen4all ? lib_array[i].rd_len_cutoff : maxReadLen4all; else { maxReadLen = maxReadLen4all; } //record insert size info //printf("from lib %d to %d, read %lld to %ld\n",prevLib,i,readNumBack,n_solexa); if ( pair && i != prevLib ) { if ( readNumBack < n_solexa ) { pes[gradsCounter].PE_bound = n_solexa; pes[gradsCounter].rank = lib_array[prevLib].rank; pes[gradsCounter].pair_num_cut = lib_array[prevLib].pair_num_cut; pes[gradsCounter++].insertS = lib_array[prevLib].avg_ins; readNumBack = n_solexa; } } if ( i >= num_libs ) { return 0; } openFileInLib ( i ); if ( lib_array[i].curr_type == 1 ) { readseq1by1 ( src_seq, src_name, len_seq, lib_array[i].fp1, -1 ); readseq1by1 ( src_seq, src_name, len_seq, lib_array[i].fp2, -1 ); } else if ( lib_array[i].curr_type == 3 || lib_array[i].curr_type == 4 ) { readseq1by1 ( src_seq, src_name, len_seq, lib_array[i].fp1, -1 ); } } if ( lib_array[i].curr_type == 1 ) { if ( lib_array[i].paired == 1 ) { readseq1by1 ( src_seq, src_name, len_seq, lib_array[i].fp1, 1 ); if ( lib_array[i].reverse ) { reverse2k ( src_seq, *len_seq ); } lib_array[i].paired = 2; if ( *len_seq > 0 || !feof ( lib_array[i].fp1 ) ) { n_solexa++; return 1; } else { return read1seqInLib ( src_seq, src_name, len_seq, libNo, pair, asm_ctg ); } } else { readseq1by1 ( src_seq, src_name, len_seq, lib_array[i].fp2, 1 ); if ( lib_array[i].reverse ) { reverse2k ( src_seq, *len_seq ); } lib_array[i].paired = 1; n_solexa++; return 1; //can't fail to read a read2 } } if ( lib_array[i].curr_type == 2 ) { if ( lib_array[i].paired == 1 ) { read1seqfq ( src_seq, src_name, len_seq, lib_array[i].fp1 ); /* if(*len_seq>0){ for(j=0;j<*len_seq;j++) printf("%c",int2base(src_seq[j])); printf("\n"); } */ if ( lib_array[i].reverse ) { reverse2k ( src_seq, *len_seq ); } lib_array[i].paired = 2; if ( *len_seq > 0 || !feof ( lib_array[i].fp1 ) ) { n_solexa++; return 1; } else { return read1seqInLib ( src_seq, src_name, len_seq, libNo, pair, asm_ctg ); } } else { read1seqfq ( src_seq, src_name, len_seq, lib_array[i].fp2 ); if ( lib_array[i].reverse ) { reverse2k ( src_seq, *len_seq ); } lib_array[i].paired = 1; n_solexa++; return 1; //can't fail to read a read2 } } if ( lib_array[i].curr_type == 5 ) { read1seqfq ( src_seq, src_name, len_seq, lib_array[i].fp1 ); } else { readseq1by1 ( src_seq, src_name, len_seq, lib_array[i].fp1, 1 ); } /* int t; for(t=0;t<*len_seq;t++) printf("%d",src_seq[t]); printf("\n"); */ if ( lib_array[i].reverse ) { reverse2k ( src_seq, *len_seq ); } if ( *len_seq > 0 || !feof ( lib_array[i].fp1 ) ) { n_solexa++; return 1; } else { return read1seqInLib ( src_seq, src_name, len_seq, libNo, pair, asm_ctg ); } } SOAPdenovo-V1.05/src/31mer/scaffold.c000644 000765 000024 00000007400 11530651532 017241 0ustar00Aquastaff000000 000000 /* * 31mer/scaffold.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" static void initenv ( int argc, char ** argv ); static void display_scaff_usage(); static boolean LINK, SCAFF; static char graphfile[256]; int call_scaffold ( int argc, char ** argv ) { time_t start_t, stop_t, time_bef, time_aft; time ( &start_t ); initenv ( argc, argv ); loadPEgrads ( graphfile ); time ( &time_bef ); loadUpdatedEdges ( graphfile ); time ( &time_aft ); printf ( "time spent on loading edges %ds\n", ( int ) ( time_aft - time_bef ) ); if ( !SCAFF ) { time ( &time_bef ); PE2Links ( graphfile ); time ( &time_aft ); printf ( "time spent on loading pair end info %ds\n", ( int ) ( time_aft - time_bef ) ); time ( &time_bef ); Links2Scaf ( graphfile ); time ( &time_aft ); printf ( "time spent on creating scaffolds %ds\n", ( int ) ( time_aft - time_bef ) ); scaffolding ( 100, graphfile ); } prlReadsCloseGap ( graphfile ); // locateReadOnScaf(graphfile); free_pe_mem(); if ( index_array ) { free ( ( void * ) index_array ); } freeContig_array(); destroyPreArcMem(); destroyConnectMem(); deleteCntLookupTable(); time ( &stop_t ); printf ( "time elapsed: %dm\n", ( int ) ( stop_t - start_t ) / 60 ); return 0; } /***************************************************************************** * Parse command line switches *****************************************************************************/ void initenv ( int argc, char ** argv ) { int copt; int inpseq; extern char * optarg; char temp[256]; inpseq = 0; LINK = 0; SCAFF = 0; optind = 1; while ( ( copt = getopt ( argc, argv, "g:L:p:G:FuS" ) ) != EOF ) { switch ( copt ) { case 'g': inGraph = 1; sscanf ( optarg, "%s", graphfile ); // continue; case 'G': sscanf ( optarg, "%s", temp ); // GLDiff = atoi ( temp ); continue; case 'L': sscanf ( optarg, "%s", temp ); ctg_short = atoi ( temp ); continue; case 'F': fillGap = 1; continue; case 'S': SCAFF = 1; continue; case 'u': maskRep = 0; continue; case 'p': sscanf ( optarg, "%s", temp ); // thrd_num = atoi ( temp ); continue; default: if ( inGraph == 0 ) // { display_scaff_usage(); exit ( -1 ); } } } if ( inGraph == 0 ) // { display_scaff_usage(); exit ( -1 ); } } static void display_scaff_usage() { printf ( "\nscaff -g InputGraph [-F -u -S] [-G gapLenDiff -L minContigLen] [-p n_cpu]\n" ); printf ( " -g InputFile: prefix of graph file names\n" ); printf ( " -F (optional) fill gaps in scaffold\n" ); printf ( " -S (optional) scaffold structure exists(default: NO)\n" ); printf ( " -G gapLenDiff(default 50): allowed length difference between estimated and filled gap\n" ); printf ( " -u (optional): un-mask contigs with high coverage before scaffolding (default mask)\n" ); printf ( " -p n_cpu(default 8): number of cpu for use\n" ); printf ( " -L minLen(default K+2): shortest contig for scaffolding\n" ); } SOAPdenovo-V1.05/src/31mer/searchPath.c000644 000765 000024 00000013453 11530651532 017547 0ustar00Aquastaff000000 000000 /* * 31mer/searchPath.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" static int trace_limit = 5000; //the times function is called in a search /* search connection paths which were masked along related contigs start from one contig, end with another path length includes the length of the last contig */ void traceAlongMaskedCnt ( unsigned int destE, unsigned int currE, int max_steps, int min, int max, int index, int len, int * num_route ) { num_trace++; if ( num_trace > trace_limit || *num_route >= max_n_routes ) { return; } unsigned int * array; int num, i, length; CONNECT * ite_cnt; if ( index > 0 ) // there're at most max_steps edges stored in this array including the destination edge { length = len + contig_array[currE].length; } else { length = 0; } if ( index > max_steps || length > max ) { return; } // this is the only situation we stop if ( index > 0 ) // there're at most max_steps edges stored in this array including the destination edge { so_far[index - 1] = currE; } if ( currE == destE && index == 0 ) { printf ( "traceAlongMaskedCnt: start and destination are the same\n" ); return; } if ( currE == destE && length >= min && length <= max ) { num = *num_route; array = found_routes[num]; for ( i = 0; i < index; i++ ) { array[i] = so_far[i]; } if ( index < max_steps ) { array[index] = 0; } //indicate the end of the route *num_route = ++num; } // one route is extrated, but we don't terminate searching ite_cnt = contig_array[currE].downwardConnect; while ( ite_cnt ) { if ( !ite_cnt->mask || ite_cnt->deleted ) { ite_cnt = ite_cnt->next; continue; } traceAlongMaskedCnt ( destE, ite_cnt->contigID, max_steps, min, max, index + 1, length + ite_cnt->gapLen, num_route ); ite_cnt = ite_cnt->next; } } // search connection paths from one connect to a contig // path length includes the length of the last contig void traceAlongConnect ( unsigned int destE, CONNECT * currCNT, int max_steps, int min, int max, int index, int len, int * num_route ) { num_trace++; if ( num_trace > trace_limit || *num_route >= max_n_routes ) { return; } unsigned int * array, currE; int num, i, length; CONNECT * ite_cnt; currE = currCNT->contigID; length = len + currCNT->gapLen; length += contig_array[currE].length; if ( index > max_steps || length > max ) { return; } // this is the only situation we stop /* if(globalFlag) printf("B: step %d, ctg %d, length %d\n",index,currCNT->contigID,length); */ if ( currE == destE && index == 1 ) { printf ( "traceAlongConnect: start and destination are the same\n" ); return; } so_far[index - 1] = currE; // there're at most max_steps edges stored in this array including the destination edge if ( currE == destE && length >= min && length <= max ) { num = *num_route; array = found_routes[num]; for ( i = 0; i < index; i++ ) { array[i] = so_far[i]; } if ( index < max_steps ) { array[index] = 0; } //indicate the end of the route *num_route = ++num; } // one route is extrated, but we don't terminate searching if ( currCNT->nextInScaf ) { traceAlongConnect ( destE, currCNT->nextInScaf, max_steps, min, max, index + 1, length, num_route ); return; } ite_cnt = contig_array[currE].downwardConnect; while ( ite_cnt ) { if ( ite_cnt->mask || ite_cnt->deleted ) { ite_cnt = ite_cnt->next; continue; } traceAlongConnect ( destE, ite_cnt, max_steps, min, max, index + 1, length, num_route ); ite_cnt = ite_cnt->next; } } //find paths in the graph from currE to destE, its length does not include length of both end contigs void traceAlongArc ( unsigned int destE, unsigned int currE, int max_steps, int min, int max, int index, int len, int * num_route ) { num_trace++; if ( num_trace > trace_limit || *num_route >= max_n_routes ) { return; } unsigned int * array, out_ed, vt; int num, i, pos, length; preARC * parc; pos = index; if ( pos > max_steps || len > max ) { return; } // this is the only situation we stop if ( currE == destE && pos == 0 ) { printf ( "traceAlongArc: start and destination are the same\n" ); return; } if ( pos > 0 ) // pos starts with 0 for the starting edge { so_far[pos - 1] = currE; } // there're at most max_steps edges stored in this array including the destination edge if ( currE == destE && len >= min ) { num = *num_route; array = found_routes[num]; for ( i = 0; i < pos; i++ ) { array[i] = so_far[i]; } if ( pos < max_steps ) { array[pos] = 0; } //indicate the end of the route *num_route = ++num; } // one route is extrated, but we don't terminate searching if ( pos == max_steps || len == max ) { return; } if ( pos++ > 0 ) //not the starting edge { length = len + contig_array[currE].length; } else { length = len; } vt = contig_array[currE].to_vt; parc = contig_array[currE].arcs; while ( parc ) { out_ed = parc->to_ed; traceAlongArc ( destE, out_ed, max_steps, min, max, pos, length, num_route ); parc = parc->next; } } SOAPdenovo-V1.05/src/31mer/seq.c000644 000765 000024 00000010506 11530651532 016251 0ustar00Aquastaff000000 000000 /* * 31mer/seq.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" /* put a insertSize in the grads array, if all grads have been entered and all the boundaris have been set, return 0 */ void print_kmer ( FILE * fp, Kmer kmer, char c ) { if ( kmer ) { fprintf ( fp, "%llx", kmer ); } else { fprintf ( fp, "0x0" ); } fprintf ( fp, "%c", c ); } void printTightString ( char * tightSeq, int len ) { int i; for ( i = 0; i < len; i++ ) { printf ( "%c", int2base ( ( int ) getCharInTightString ( tightSeq, i ) ) ); if ( ( i + 1 ) % 100 == 0 ) { printf ( "\n" ); } } printf ( "\n" ); } static Kmer fastReverseComp ( Kmer seq, char seq_size ) { seq ^= 0xAAAAAAAAAAAAAAAALLU; seq = ( ( seq & 0x3333333333333333LLU ) << 2 ) | ( ( seq & 0xCCCCCCCCCCCCCCCCLLU ) >> 2 ); seq = ( ( seq & 0x0F0F0F0F0F0F0F0FLLU ) << 4 ) | ( ( seq & 0xF0F0F0F0F0F0F0F0LLU ) >> 4 ); seq = ( ( seq & 0x00FF00FF00FF00FFLLU ) << 8 ) | ( ( seq & 0xFF00FF00FF00FF00LLU ) >> 8 ); seq = ( ( seq & 0x0000FFFF0000FFFFLLU ) << 16 ) | ( ( seq & 0xFFFF0000FFFF0000LLU ) >> 16 ); seq = ( ( seq & 0x00000000FFFFFFFFLLU ) << 32 ) | ( ( seq & 0xFFFFFFFF00000000LLU ) >> 32 ); return seq >> ( 64 - ( seq_size << 1 ) ); } Kmer reverseComplementVerbose ( Kmer word, int overlap ) { return fastReverseComp ( word, overlap ); /* int index; Kmer revComp = 0; Kmer copy = word; unsigned char nucleotide; for (index = 0; index < overlap; index++) { nucleotide = copy & 3; revComp <<= 2; revComp += int_comp(nucleotide);//3 - nucleotide; copy >>= 2; } return revComp; */ } Kmer reverseComplement ( Kmer word, int overlap ) { return fastReverseComp ( word, overlap ); } void writeChar2tightString ( char nt, char * tightSeq, int pos ) { char * byte = tightSeq + pos / 4; switch ( pos % 4 ) { case 0: *byte &= 63; *byte += nt << 6; return; case 1: *byte &= 207; *byte += nt << 4; return; case 2: *byte &= 243; *byte += nt << 2; return; case 3: *byte &= 252; *byte += nt; return; } } char getCharInTightString ( char * tightSeq, int pos ) { char * byte = tightSeq + pos / 4; switch ( pos % 4 ) { case 3: return ( *byte & 3 ); case 2: return ( *byte & 12 ) >> 2; case 1: return ( *byte & 48 ) >> 4; case 0: return ( *byte & 192 ) >> 6; } return 0; } // complement of sequence denoted 0, 1, 2, 3 void reverseComplementSeq ( char * seq, int len, char * bal_seq ) { int i, index = 0; if ( len < 1 ) { return; } for ( i = len - 1; i >= 0; i-- ) { bal_seq[index++] = int_comp ( seq[i] ); } return; } // complement of sequence denoted 0, 1, 2, 3 char * compl_int_seq ( char * seq, int len ) { char * bal_seq = NULL, c, bal_c; int i, index; if ( len < 1 ) { return bal_seq; } bal_seq = ( char * ) ckalloc ( len * sizeof ( char ) ); index = 0; for ( i = len - 1; i >= 0; i-- ) { c = seq[i]; if ( c < 4 ) { bal_c = int_comp ( c ); } //3-c; else { bal_c = c; } bal_seq[index++] = bal_c; } return bal_seq; } long long trans_seq ( char * seq, int len ) { int i; long long res; res = 0; for ( i = 0; i < len; i ++ ) { res = res * 4 + seq[i]; } return ( res ); } char * kmer2seq ( Kmer word ) { int i; char * seq; Kmer charMask = 3; seq = ( char * ) ckalloc ( overlaplen * sizeof ( char ) ); for ( i = overlaplen - 1; i >= 0; i-- ) { seq[i] = charMask & word; word >>= 2; } return seq; } SOAPdenovo-V1.05/src/31mer/splitReps.c000644 000765 000024 00000023065 11530651532 017452 0ustar00Aquastaff000000 000000 /* * 31mer/splitReps.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" static unsigned int involved[9]; static unsigned int lefts[4]; static unsigned int rights[4]; static unsigned char gothrough[4][4]; static boolean interferingCheck ( unsigned int edgeno, int repTimes ) { int i, j, t; unsigned int bal_ed; involved[0] = edgeno; i = 1; for ( j = 0; j < repTimes; j++ ) { involved[i++] = lefts[j]; } for ( j = 0; j < repTimes; j++ ) { involved[i++] = rights[j]; } for ( j = 0; j < i - 1; j++ ) for ( t = j + 1; t < i; t++ ) if ( involved[j] == involved[t] ) { return 1; } for ( j = 0; j < i; j++ ) { bal_ed = getTwinEdge ( involved[j] ); for ( t = 0; t < i; t++ ) if ( bal_ed == involved[t] ) { return 1; } } return 0; } static ARC * arcCounts ( unsigned int edgeid, unsigned int * num ) { ARC * arc; ARC * firstValidArc = NULL; unsigned int count = 0; arc = edge_array[edgeid].arcs; while ( arc ) { if ( arc->to_ed > 0 ) { count++; } if ( count == 1 ) { firstValidArc = arc; } arc = arc->next; } *num = count; return firstValidArc; } static boolean readOnEdge ( long long readid, unsigned int edge ) { int index; int markNum; long long * marklist; if ( edge_array[edge].markers ) { markNum = edge_array[edge].multi; marklist = edge_array[edge].markers; } else { return 0; } for ( index = 0; index < markNum; index++ ) { if ( readid == marklist[index] ) { return 1; } } return 0; } static long long cntByReads ( unsigned int left, unsigned int middle , unsigned int right ) { int markNum; long long * marklist; if ( edge_array[left].markers ) { markNum = edge_array[left].multi; marklist = edge_array[left].markers; } else { return 0; } int index; long long readid; /* if(middle==8553) printf("%d markers on %d\n",markNum,left); */ for ( index = 0; index < markNum; index++ ) { readid = marklist[index]; if ( readOnEdge ( readid, middle ) && readOnEdge ( readid, right ) ) { return readid; } } return 0; } /* - - > - < - - */ unsigned int solvable ( unsigned int edgeno ) { if ( EdSameAsTwin ( edgeno ) || edge_array[edgeno].multi == 255 ) { return 0; } unsigned int bal_ed = getTwinEdge ( edgeno ); unsigned int arcRight_n, arcLeft_n; unsigned int counter; unsigned int i, j; unsigned int branch, bal_branch; ARC * parcL, *parcR; parcL = arcCounts ( bal_ed, &arcLeft_n ); if ( arcLeft_n < 2 ) { return 0; } parcR = arcCounts ( edgeno, &arcRight_n ); if ( arcLeft_n != arcRight_n ) { return 0; } // check each right branch only has one upsteam connection /* if(edgeno==2551){ for(i=0;ito_ed == 0 ) { parcR = parcR->next; continue; } branch = parcR->to_ed; if ( EdSameAsTwin ( branch ) || edge_array[branch].multi == 255 ) { return 0; } rights[arcRight_n++] = branch; bal_branch = getTwinEdge ( branch ); arcCounts ( bal_branch, &counter ); if ( counter != 1 ) { return 0; } parcR = parcR->next; } // check if each left branch only has one downsteam connection arcLeft_n = 0; while ( parcL ) { if ( parcL->to_ed == 0 ) { parcL = parcL->next; continue; } branch = parcL->to_ed; if ( EdSameAsTwin ( branch ) || edge_array[branch].multi == 255 ) { return 0; } bal_branch = getTwinEdge ( branch ); lefts[arcLeft_n++] = bal_branch; arcCounts ( bal_branch, &counter ); if ( counter != 1 ) { return 0; } parcL = parcL->next; } //check if reads indicate one to one connection between upsteam and downstream edges for ( i = 0; i < arcLeft_n; i++ ) { counter = 0; for ( j = 0; j < arcRight_n; j++ ) { gothrough[i][j] = cntByReads ( lefts[i], edgeno, rights[j] ) == 0 ? 0 : 1; counter += gothrough[i][j]; if ( counter > 1 ) { return 0; } } if ( counter != 1 ) { return 0; } } for ( j = 0; j < arcRight_n; j++ ) { counter = 0; for ( i = 0; i < arcLeft_n; i++ ) { counter += gothrough[i][j]; } if ( counter != 1 ) { return 0; } } return arcLeft_n; } static unsigned int cp1edge ( unsigned int source, unsigned int target ) { int length = edge_array[source].length; char * tightSeq; int index; unsigned int bal_source = getTwinEdge ( source ); unsigned int bal_target; if ( bal_source > source ) { bal_target = target + 1; } else { bal_target = target; target = target + 1; } tightSeq = ( char * ) ckalloc ( ( length / 4 + 1 ) * sizeof ( char ) ); for ( index = 0; index < length / 4 + 1; index++ ) { tightSeq[index] = edge_array[source].seq[index]; } edge_array[target].length = length; edge_array[target].cvg = edge_array[source].cvg; edge_array[target].to_vt = edge_array[source].to_vt; edge_array[target].from_vt = edge_array[source].from_vt; edge_array[target].seq = tightSeq; edge_array[target].bal_edge = edge_array[source].bal_edge; edge_array[target].rv = NULL; edge_array[target].arcs = NULL; edge_array[target].markers = NULL; edge_array[target].flag = 0; edge_array[target].deleted = 0; tightSeq = ( char * ) ckalloc ( ( length / 4 + 1 ) * sizeof ( char ) ); for ( index = 0; index < length / 4 + 1; index++ ) { tightSeq[index] = edge_array[bal_source].seq[index]; } edge_array[bal_target].length = length; edge_array[bal_target].cvg = edge_array[bal_source].cvg; edge_array[bal_target].to_vt = edge_array[bal_source].to_vt; edge_array[bal_target].from_vt = edge_array[bal_source].from_vt; edge_array[bal_target].seq = tightSeq; edge_array[bal_target].bal_edge = edge_array[bal_source].bal_edge; edge_array[bal_target].rv = NULL; edge_array[bal_target].arcs = NULL; edge_array[bal_target].markers = NULL; edge_array[bal_target].flag = 0; edge_array[bal_target].deleted = 0; return target; } static void moveArc2cp ( unsigned int leftEd, unsigned int rightEd, unsigned int source, unsigned int target ) { unsigned int bal_left = getTwinEdge ( leftEd ); unsigned int bal_right = getTwinEdge ( rightEd ); unsigned int bal_source = getTwinEdge ( source ); unsigned int bal_target = getTwinEdge ( target ); ARC * arc; ARC * newArc, *twinArc; //between left and source arc = getArcBetween ( leftEd, source ); arc->to_ed = 0; newArc = allocateArc ( target ); newArc->multiplicity = arc->multiplicity; newArc->prev = NULL; newArc->next = edge_array[leftEd].arcs; if ( edge_array[leftEd].arcs ) { edge_array[leftEd].arcs->prev = newArc; } edge_array[leftEd].arcs = newArc; arc = getArcBetween ( bal_source, bal_left ); arc->to_ed = 0; twinArc = allocateArc ( bal_left ); twinArc->multiplicity = arc->multiplicity; twinArc->prev = NULL; twinArc->next = NULL; edge_array[bal_target].arcs = twinArc; newArc->bal_arc = twinArc; twinArc->bal_arc = newArc; //between source and right arc = getArcBetween ( source, rightEd ); arc->to_ed = 0; newArc = allocateArc ( rightEd ); newArc->multiplicity = arc->multiplicity; newArc->prev = NULL; newArc->next = NULL; edge_array[target].arcs = newArc; arc = getArcBetween ( bal_right, bal_source ); arc->to_ed = 0; twinArc = allocateArc ( bal_target ); twinArc->multiplicity = arc->multiplicity; twinArc->prev = NULL; twinArc->next = edge_array[bal_right].arcs; if ( edge_array[bal_right].arcs ) { edge_array[bal_right].arcs->prev = twinArc; } edge_array[bal_right].arcs = twinArc; newArc->bal_arc = twinArc; twinArc->bal_arc = newArc; } static void split1edge ( unsigned int edgeno, int repTimes ) { int i, j; unsigned int target; for ( i = 1; i < repTimes; i++ ) { for ( j = 0; j < repTimes; j++ ) if ( gothrough[i][j] > 0 ) { break; } target = cp1edge ( edgeno, extraEdgeNum ); moveArc2cp ( lefts[i], rights[j], edgeno, target ); extraEdgeNum += 2; } } static void debugging ( unsigned int i ) { ARC * parc; parc = edge_array[i].arcs; if ( !parc ) { printf ( "no downward connection for %d\n", i ); } while ( parc ) { printf ( "%d -> %d\n", i, parc->to_ed ); parc = parc->next; } } void solveReps() { unsigned int i; unsigned int repTime; int counter = 0; boolean flag; //debugging(30514); extraEdgeNum = num_ed + 1; for ( i = 1; i <= num_ed; i++ ) { repTime = solvable ( i ); if ( repTime == 0 ) { continue; } flag = interferingCheck ( i, repTime ); if ( flag ) { continue; } split1edge ( i, repTime ); counter ++; //+= 2*(repTime-1); if ( EdSmallerThanTwin ( i ) ) { i++; } } printf ( "%d repeats solvable, %d more edges\n", counter, extraEdgeNum - 1 - num_ed ); num_ed = extraEdgeNum - 1; removeDeadArcs(); if ( markersArray ) { free ( ( void * ) markersArray ); markersArray = NULL; } } SOAPdenovo-V1.05/src/31mer/stack.c000644 000765 000024 00000007160 11530651532 016570 0ustar00Aquastaff000000 000000 /* * 31mer/stack.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stack.h" STACK * createStack ( int num_items, size_t unit_size ) { STACK * newStack = ( STACK * ) malloc ( 1 * sizeof ( STACK ) ); newStack->block_list = NULL; newStack->items_per_block = num_items; newStack->item_size = unit_size; newStack->item_c = 0; return newStack; } void emptyStack ( STACK * astack ) { BLOCK_STARTER * block; if ( !astack || !astack->block_list ) { return; } block = astack->block_list; if ( block->next ) { block = block->next; } astack->block_list = block; astack->item_c = 0; astack->index_in_block = 0; } void freeStack ( STACK * astack ) { BLOCK_STARTER * ite_block, *temp_block; if ( !astack ) { return; } ite_block = astack->block_list; if ( ite_block ) { while ( ite_block->next ) { ite_block = ite_block->next; } } while ( ite_block ) { temp_block = ite_block; ite_block = ite_block->prev; free ( ( void * ) temp_block ); } free ( ( void * ) astack ); } void stackBackup ( STACK * astack ) { astack->block_backup = astack->block_list; astack->index_backup = astack->index_in_block; astack->item_c_backup = astack->item_c; } void stackRecover ( STACK * astack ) { astack->block_list = astack->block_backup; astack->index_in_block = astack->index_backup; astack->item_c = astack->item_c_backup; } void * stackPop ( STACK * astack ) { BLOCK_STARTER * block; if ( !astack || !astack->block_list || !astack->item_c ) { return NULL; } astack->item_c--; block = astack->block_list; if ( astack->index_in_block == 1 ) { if ( block->next ) { astack->block_list = block->next; astack->index_in_block = astack->items_per_block; } else { astack->index_in_block = 0; astack->item_c = 0; } return ( void * ) ( ( void * ) block + sizeof ( BLOCK_STARTER ) ); } return ( void * ) ( ( void * ) block + sizeof ( BLOCK_STARTER ) + astack->item_size * ( --astack->index_in_block ) ); } void * stackPush ( STACK * astack ) { BLOCK_STARTER * block; if ( !astack ) { return NULL; } astack->item_c++; if ( !astack->block_list || ( astack->index_in_block == astack->items_per_block && !astack->block_list->prev ) ) { block = malloc ( sizeof ( BLOCK_STARTER ) + astack->items_per_block * astack->item_size ); block->prev = NULL; if ( astack->block_list ) { astack->block_list->prev = block; } block->next = astack->block_list; astack->block_list = block; astack->index_in_block = 1; return ( void * ) ( ( void * ) block + sizeof ( BLOCK_STARTER ) ); } else if ( astack->index_in_block == astack->items_per_block && astack->block_list->prev ) { astack->block_list = astack->block_list->prev; astack->index_in_block = 1; return ( void * ) ( ( void * ) astack->block_list + sizeof ( BLOCK_STARTER ) ); } block = astack->block_list; return ( void * ) ( ( void * ) block + sizeof ( BLOCK_STARTER ) + astack->item_size * astack->index_in_block++ ); } SOAPdenovo-V1.05/src/31mer/inc/._.def.h.swo000644 000765 000024 00000000341 11526121016 020065 0ustar00Aquastaff000000 000000 Mac OS X  2ATTRJIIcom.apple.quarantineq/0000;4d58b30f;Mail;645896F8-E714-4CC1-B109-E2A605004F1A|com.apple.mailSOAPdenovo-V1.05/src/31mer/inc/.def.h.swo000644 000765 000024 00000030000 11526121016 017643 0ustar00Aquastaff000000 000000 b0VIM 6.3ʔ;IB\$zhuhmcompute-0-33.local/share/raid8/zhuhm/multiFile/inc/def.h3210#"! Utpadhg6lk[ZFD8"  x d O H G 0 . !   y    b @     y ] D 8 %   j h P C A 2 #   tT: oZM/t\RED+)ut[Y@* q^K5}ARCEXIST; struct arcexist *right; struct arcexist *left; Kmer kmer;{typedef struct arcexist }ARC; struct arc *nextInLookupTable; struct arc *bal_arc; struct arc *next; struct arc *prev; unsigned int multiplicity; unsigned int to_ed;{typedef struct arc}LIGHTCTG; char *seq; int length; unsigned int index;{typedef struct lightctg }EDGEPATCH; char bal_edge; unsigned int length; Kmer from_kmer,to_kmer;{typedef struct edgepatch}LIGHTANNOT; int pos; int contigID;{typedef struct lightannot}PARAMETER; unsigned char *selfSignal; unsigned char *mainSignal; NODE **hash_table; unsigned char threadID;{typedef struct parameter}ANNOTATION; int pos; unsigned int contigID; unsigned long readID;{typedef struct annotation}CONTIG; preARC *arcs; CONNECT *downwardConnect; char *seq; unsigned char flag : 1; unsigned char mask : 1; unsigned char multi; char bal_edge; // 1, -1 or 0 int length; unsigned int to_vt; unsigned int from_vt;{typedef struct contig }preARC; struct prearc *next; unsigned int multiplicity; unsigned int to_ed;{typedef struct prearc}CONNECT; struct connection *nextInLookupTable; struct connection *next; struct connection *nextInScaf; unsigned char prevInScaf : 1; unsigned char deleted : 1; unsigned char weak : 1; unsigned char used : 1; unsigned char mask : 1; unsigned char weight; short residue; short minGap; short maxGap; int gapLen; unsigned int contigID;{typedef struct connection}VERTEX; Kmer kmer;{typedef struct vertex}EDGE_PT; struct edge_pt *next; EDGE *edge;{typedef struct edge_pt}EDGE; long *markers; struct arc *arcs; READINTERVAL *rv; char *seq; unsigned char flag : 1; unsigned char deleted : 1; unsigned char multi; char bal_edge; unsigned char cvg; int length; unsigned int to_vt; unsigned int from_vt;{typedef struct edgestruct arc;}READINTERVAL; struct readinterval *prevInRead; struct readinterval *nextInRead; struct readinterval *prevOnEdge; struct readinterval *nextOnEdge; struct readinterval *bal_rv; int start; unsigned int edgeid; int readid; {typedef struct readinterval}preEDGE; char bal_edge; //indicate whether it's bal_edge is the previous edge, next edge or itself unsigned char cvg; int length; char *seq; Kmer to_node; Kmer from_node;{typedef struct preedge}NODE_PT; struct node_pt *next; boolean isSmaller; Kmer kmer; NODE *node;{typedef struct node_pt}NODE; struct node *right; struct node *left; unsigned int to_end; // the edge no. it belongs to unsigned char mark:1; unsigned char deleted:1; unsigned char linear:1; unsigned char cvg; unsigned char linksB; unsigned char links; Kmer kmer;{typedef struct nodestruct node_pt;typedef unsigned long long Kmer;int b_ban;#define int_comp(seq) (char)(seq^0x02) //(char)((0x4E>>((seq)<<1))&0x03)#define int2compbase(seq) "TGAC"[seq]#define int2base(seq) "ACTG"[seq]#define base2int(base) (char)(((base)&0x06)>>1)#define taskMask 0x1f //the last 7 bits#define thrd_num 32 #define word_len 12#include "def2.h"/* this file provides some datatype definition */SOAPdenovo-V1.05/src/31mer/inc/._.def.h.swp000644 000765 000024 00000000341 11526121014 020064 0ustar00Aquastaff000000 000000 Mac OS X  2ATTRKIIcom.apple.quarantineq/0000;4d58b30f;Mail;645896F8-E714-4CC1-B109-E2A605004F1A|com.apple.mailSOAPdenovo-V1.05/src/31mer/inc/.def.h.swp000644 000765 000024 00000030000 11526121014 017642 0ustar00Aquastaff000000 000000 b0VIM 6.3kEzhuhmcompute-0-33.local/zhuhm/subtleGap/inc/def.h3210#"! Utpadiqih7ml\[GE9#  | g ` _ H F 9 -   5 + *  z X 6 ' &    u \ P = *    l k L - $ #  gN2ofeNL5 wnmSQ:"  ~}caRH;:!~}jhS7$|qq}ARCEXIST; struct arcexist *right; struct arcexist *left; Kmer kmer;{typedef struct arcexist }ARC; struct arc *nextInLookupTable; struct arc *bal_arc; struct arc *next; struct arc *prev; unsigned int multiplicity; unsigned int to_ed;{typedef struct arc}LIGHTCTG; char *seq; int length; unsigned int index;{typedef struct lightctg }EDGEPATCH; char bal_edge; unsigned int length; Kmer from_kmer,to_kmer;{typedef struct edgepatch}LIGHTANNOT; int pos; int contigID;{typedef struct lightannot}PARAMETER; unsigned char *selfSignal; unsigned char *mainSignal; NODE **hash_table; unsigned char threadID;{typedef struct parameter}ANNOTATION; int pos; unsigned int contigID; unsigned long readID;{typedef struct annotation}CONTIG; preARC *arcs; CONNECT *downwardConnect; char *seq; unsigned char flag : 1; unsigned char mask : 1; unsigned char multi; char bal_edge; // 1, -1 or 0 int length; unsigned int to_vt; unsigned int from_vt;{typedef struct contig }preARC; struct prearc *next; unsigned int multiplicity; unsigned int to_ed;{typedef struct prearc}CONNECT; struct connection *next; struct connection *nextInScaf; unsigned char prevInScaf : 1; unsigned char deleted : 1; unsigned char weak : 1; unsigned char used : 1; unsigned char mask : 1; unsigned char weight; short residue; short minGap; short maxGap; int gapLen; unsigned int contigID;{typedef struct connection}VERTEX; unsigned int *outgoing_edges; unsigned int *incoming_edges; unsigned char outgoing_num; unsigned char incoming_num; Kmer kmer;{typedef struct vertex}EDGE_PT; struct edge_pt *next; EDGE *edge;{typedef struct edge_pt}EDGE; long *markers; struct arc *arcs; READINTERVAL *rv; char *seq; unsigned char flag : 1; unsigned char deleted : 1; unsigned char multi; char bal_edge; unsigned char cvg; int length; unsigned int to_vt; unsigned int from_vt;{typedef struct edgestruct arc;}READINTERVAL; struct readinterval *prevInRead; struct readinterval *nextInRead; struct readinterval *prevOnEdge; struct readinterval *nextOnEdge; struct readinterval *bal_rv; int start; unsigned int edgeid; int readid; {typedef struct readinterval}preEDGE; char bal_edge; //indicate whether it's bal_edge is the previous edge, next edge or itself unsigned char cvg; int length; char *seq; Kmer to_node; Kmer from_node;{typedef struct preedge}NODE_PT; struct node_pt *next; boolean isSmaller; Kmer kmer; NODE *node;{typedef struct node_pt}NODE; struct node *right; struct node *left; unsigned int to_end; // the edge no. it belongs to unsigned char deleted:1; unsigned char linear:1; unsigned char cvg; unsigned char linksB; unsigned char links; Kmer kmer;{typedef struct nodestruct node_pt;typedef unsigned long long Kmer;int b_ban;#define int_comp(seq) (char)(seq^0x02) //(char)((0x4E>>((seq)<<1))&0x03)#define int2compbase(seq) "TGAC"[seq]#define int2base(seq) "ACTG"[seq]#define base2int(base) (char)(((base)&0x06)>>1)#define taskMask 0xf //the last 6 bits#define thrd_num 16 #define word_len 12#include "def2.h"/* this file provides some datatype definition */SOAPdenovo-V1.05/src/31mer/inc/._.newhash.h.swp000644 000765 000024 00000000341 11526121016 020765 0ustar00Aquastaff000000 000000 Mac OS X  2ATTRLIIcom.apple.quarantineq/0000;4d58b30f;Mail;645896F8-E714-4CC1-B109-E2A605004F1A|com.apple.mailSOAPdenovo-V1.05/src/31mer/inc/.newhash.h.swp000644 000765 000024 00000030000 11526121016 020543 0ustar00Aquastaff000000 000000 b0VIM 6.3rrIjTzhuhmcompute-0-35.local/share/data/assembly/potato/zhuhm/newhash/inc/newhash.h3210#"! UtpNaduNjN/.{ D C ? Q @ * ) (    vcQA/%$ trcWC,"!ARQ%#endifextern char firstCharInKmer(Kmer kmer);extern int count_branch2next(kmer_t *node);extern int count_branch2prev(kmer_t *node);extern void dislink2prevUncertain(kmer_t *node,char ch,boolean smaller);extern void dislink2nextUncertain(kmer_t *node,char ch,boolean smaller);extern void free_Sets(KmerSet **KmerSets,int num);extern byte8 count_kmerset(KmerSet *set);extern int put_kmerset(KmerSet *set, ubyte8 seq, ubyte left, ubyte right,kmer_t **kmer_p);extern int search_kmerset(KmerSet *set, ubyte8 seq, kmer_t **rs);extern KmerSet* init_kmerset(ubyte8 init_size, float load_factor);}KMER_PT; struct kmer_pt *next; boolean isSmaller; Kmer kmer; kmer_t *node;{typedef struct kmer_pt} KmerSet; ubyte8 iter_ptr; double load_factor; ubyte8 max; ubyte8 count; ubyte8 size; ubyte4 *flags; kmer_t *array;{typedef struct kmerSet_st} kmer_t; ubyte4 inEdge:2; ubyte4 twin:2; ubyte4 single:1; ubyte4 checked:1; ubyte4 deleted:1; ubyte4 linear:1; ubyte4 r_links:4*EDGE_BIT_SIZE; ubyte4 l_links; // sever as edgeID since make_edge Kmer seq;{typedef struct kmer_st#define exists_kmer_entity(flags, idx) (!((flags)[(idx)>>4]>>(((idx)&0x0f)<<1)&0x03))#define clear_kmer_entity_del(flags, idx) ((flags)[(idx)>>4] &= ~(0x02u<<(((idx)&0x0f)<<1)))#define clear_kmer_entity_null(flags, idx) ((flags)[(idx)>>4] &= ~(0x01u<<(((idx)&0x0f)<<1)))#define set_kmer_entity_del(flags, idx) ((flags)[(idx)>>4] |= (0x02u<<(((idx)&0x0f)<<1)))#define set_kmer_entity_null(flags, idx) ((flags)[(idx)>>4] |= (0x01u<<(((idx)&0x0f)<<1)))#define is_kmer_entity_del(flags, idx) ((flags)[(idx)>>4]>>(((idx)&0x0f)<<1)&0x02)#define is_kmer_entity_null(flags, idx) ((flags)[(idx)>>4]>>(((idx)&0x0f)<<1)&0x01)#define get_kmer_right_covs(mer) (get_kmer_right_cov(mer, 0) + get_kmer_right_cov(mer, 1) + get_kmer_right_cov(mer, 2) + get_kmer_right_cov(mer, 3))#define set_kmer_right_cov(mer, idx, val) ((mer).r_links = ((mer).r_links&(~(EDGE_XOR_MASK<<((idx)*EDGE_BIT_SIZE)))) | (((val)&EDGE_XOR_MASK)<<((idx)*EDGE_BIT_SIZE)) )#define get_kmer_right_cov(mer, idx) (((mer).r_links>>((idx)*EDGE_BIT_SIZE))&EDGE_XOR_MASK)#define get_kmer_left_covs(mer) (get_kmer_left_cov(mer, 0) + get_kmer_left_cov(mer, 1) + get_kmer_left_cov(mer, 2) + get_kmer_left_cov(mer, 3))#define set_kmer_left_cov(mer, idx, val) ((mer).l_links = ((mer).l_links&(~(EDGE_XOR_MASK<<((idx)*EDGE_BIT_SIZE)))) | (((val)&EDGE_XOR_MASK)<<((idx)*EDGE_BIT_SIZE)) )#define get_kmer_left_cov(mer, idx) (((mer).l_links>>((idx)*EDGE_BIT_SIZE))&EDGE_XOR_MASK)#define set_kmer_seq(mer, val) ((mer).seq = val)#define get_kmer_seq(mer) ((mer).seq)#define LINKS_BITS 0x00FFFFFFU#define EDGE_XOR_MASK 0x3FU#define EDGE_BIT_SIZE 6#define MAX_KMER_COV 63#endif#define K_LOAD_FACTOR 0.75#ifndef K_LOAD_FACTOR#define __NEW_HASH_RJ#ifndef __NEW_HASH_RJSOAPdenovo-V1.05/src/31mer/inc/check.h000644 000765 000024 00000001714 11530651532 017315 0ustar00Aquastaff000000 000000 /* * 31mer/inc/check.h * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ extern void * ckalloc ( unsigned long long amount ); extern void * ckrealloc ( void * p, size_t new_size, size_t old_size ); extern FILE * ckopen ( char * name, char * mode ); SOAPdenovo-V1.05/src/31mer/inc/darray.h000644 000765 000024 00000002361 11530651532 017521 0ustar00Aquastaff000000 000000 /* * 31mer/inc/darray.h * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #ifndef __DARRAY__ #define __DARRAY__ #include #include #include typedef struct dynamic_array { void * array; long long array_size; size_t item_size; long long item_c; } DARRAY; void * darrayPut ( DARRAY * darray, long long index ); void * darrayGet ( DARRAY * darray, long long index ); DARRAY * createDarray ( int num_items, size_t unit_size ); void freeDarray ( DARRAY * darray ); void emptyDarray ( DARRAY * darray ); #endif SOAPdenovo-V1.05/src/31mer/inc/def.h000644 000765 000024 00000014400 11530651532 016772 0ustar00Aquastaff000000 000000 /* * 31mer/inc/def.h * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ /* this file provides some datatype definition */ #ifndef _DEF #define _DEF #include "def2.h" #include "types.h" #include "stack.h" #include "darray.h" #define EDGE_BIT_SIZE 6 #define word_len 12 #define taskMask 0xf //the last 7 bits #define MaxEdgeCov 16000 #define base2int(base) (char)(((base)&0x06)>>1) #define int2base(seq) "ACTG"[seq] #define int2compbase(seq) "TGAC"[seq] #define int_comp(seq) (char)(seq^0x02) //(char)((0x4E>>((seq)<<1))&0x03) int b_ban; typedef unsigned long long Kmer; typedef struct edon { Kmer kmer; unsigned int ctgLen: 1; unsigned int twin: 1; unsigned int pos: 30; unsigned int ctgID; struct edon * left; struct edon * right; } EDON; struct node_pt; typedef struct node { Kmer kmer; unsigned char links; unsigned char linksB; unsigned char cvg; unsigned char linear: 1; unsigned char deleted: 1; unsigned char mark: 1; unsigned int to_end; // the edge no. it belongs to struct node * left; struct node * right; } NODE; typedef struct node_pt { NODE * node; Kmer kmer; boolean isSmaller; struct node_pt * next; } NODE_PT; typedef struct preedge { Kmer from_node; Kmer to_node; char * seq; int length; unsigned short cvg; unsigned short bal_edge: 2; //indicate whether it's bal_edge is the previous edge, next edge or itself } preEDGE; typedef struct readinterval { int readid; unsigned int edgeid; int start; struct readinterval * bal_rv; struct readinterval * nextOnEdge; struct readinterval * prevOnEdge; struct readinterval * nextInRead; struct readinterval * prevInRead; } READINTERVAL; struct arc; typedef struct edge { unsigned int from_vt; unsigned int to_vt; int length; unsigned short cvg: 14; unsigned short bal_edge: 2; unsigned short multi: 14; unsigned short deleted : 1; unsigned short flag : 1; char * seq; READINTERVAL * rv; struct arc * arcs; long long * markers; } EDGE; typedef struct edge_pt { EDGE * edge; struct edge_pt * next; } EDGE_PT; typedef struct vertex { Kmer kmer; } VERTEX; typedef struct connection { unsigned int contigID; int gapLen; unsigned short maxGap; unsigned char minGap; unsigned char bySmall: 1; unsigned char weakPoint: 1; unsigned char weightNotInherit; unsigned char weight; unsigned char maxSingleWeight; unsigned char mask : 1; unsigned char used : 1; unsigned char weak : 1; unsigned char deleted : 1; unsigned char prevInScaf : 1; unsigned char inherit : 1; unsigned char checking : 1; unsigned char singleInScaf : 1; struct connection * nextInScaf; struct connection * next; struct connection * nextInLookupTable; } CONNECT; typedef struct prearc { unsigned int to_ed; unsigned int multiplicity; struct prearc * next; } preARC; typedef struct contig { unsigned int from_vt; unsigned int to_vt; unsigned int length; unsigned short indexInScaf; unsigned char cvg; unsigned char bal_edge: 2; // 0, 1 or 2 unsigned char mask : 1; unsigned char flag : 1; unsigned char multi: 1; unsigned char inSubGraph: 1; char * seq; CONNECT * downwardConnect; preARC * arcs; STACK * closeReads; } CONTIG; typedef struct read_nearby { int len; int dis; // dis to nearby contig or scaffold's start position long long seqStarter; //sequence start position in dynamic array } READNEARBY; typedef struct annotation { unsigned long long readID; unsigned int contigID; int pos; } ANNOTATION; typedef struct parameter { unsigned char threadID; void ** hash_table; unsigned char * mainSignal; unsigned char * selfSignal; } PARAMETER; typedef struct lightannot { int contigID; int pos; } LIGHTANNOT; typedef struct edgepatch { Kmer from_kmer, to_kmer; unsigned int length; char bal_edge; } EDGEPATCH; typedef struct lightctg { unsigned int index; int length; char * seq; } LIGHTCTG; typedef struct arc { unsigned int to_ed; unsigned int multiplicity; struct arc * prev; struct arc * next; struct arc * bal_arc; struct arc * nextInLookupTable; } ARC; typedef struct arcexist { Kmer kmer; struct arcexist * left; struct arcexist * right; } ARCEXIST; typedef struct lib_info { int min_ins; int max_ins; int avg_ins; int rd_len_cutoff; int reverse; int asm_flag; int map_len; int pair_num_cut; int rank; //indicate which file is next to be read int curr_type; int curr_index; //file handlers to opened files FILE * fp1; FILE * fp2; boolean f1_start; boolean f2_start; //whether last read is read1 in pair int paired; // 0 -- single; 1 -- read1; 2 -- read2; //type1 char ** a1_fname; char ** a2_fname; int num_a1_file; int num_a2_file; //type2 char ** q1_fname; char ** q2_fname; int num_q1_file; int num_q2_file; //type3 char ** p_fname; int num_p_file; //fasta only //type4 &5 char ** s_a_fname; int num_s_a_file; char ** s_q_fname; int num_s_q_file; } LIB_INFO; typedef struct ctg4heap { unsigned int ctgID; int dis; unsigned char ds_shut4dheap: 1; // ignore downstream connections unsigned char us_shut4dheap: 1; // ignore upstream connections unsigned char ds_shut4uheap: 1; // ignore downstream connections unsigned char us_shut4uheap: 1; // ignore upstream connections } CTGinHEAP; typedef struct ctg4scaf { unsigned int ctgID; int start; int end; //position in scaff unsigned int cutHead : 8; // unsigned int cutTail : 7; // unsigned int scaftig_start : 1; //is it a scaftig starter unsigned int mask : 1; // is it masked for further operations unsigned int gapSeqLen: 15; int gapSeqOffset; } CTGinSCAF; typedef struct pe_info { int insertS; long long PE_bound; int rank; int pair_num_cut; } PE_INFO; #endif SOAPdenovo-V1.05/src/31mer/inc/def2.h000644 000765 000024 00000003246 11530651532 017062 0ustar00Aquastaff000000 000000 /* * 31mer/inc/def2.h * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #ifndef _DEF2 #define _DEF2 typedef char boolean; typedef long long IDnum; typedef double Time; typedef long long Coordinate; // Fibonacci heaps used mainly in Tour Bus typedef struct fibheap FibHeap; typedef struct fibheap_el FibHeapNode; typedef struct dfibheap DFibHeap; typedef struct dfibheap_el DFibHeapNode; //Memory manager typedef struct block_start { struct block_start * next; } BLOCK_START; typedef struct recycle_mark { struct recycle_mark * next; } RECYCLE_MARK; typedef struct mem_manager { BLOCK_START * block_list; int index_in_block; int items_per_block; size_t item_size; RECYCLE_MARK * recycle_list; unsigned long long counter; } MEM_MANAGER; struct dfibheap_el { int dfhe_degree; boolean dfhe_mark; DFibHeapNode * dfhe_p; DFibHeapNode * dfhe_child; DFibHeapNode * dfhe_left; DFibHeapNode * dfhe_right; Time dfhe_key; unsigned int dfhe_data;//void *dfhe_data; }; #endif SOAPdenovo-V1.05/src/31mer/inc/dfib.h000644 000765 000024 00000005565 11530651532 017154 0ustar00Aquastaff000000 000000 /* * 31mer/inc/dfib.h * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ /*- * Copyright 1997, 1998-2003 John-Mark Gurney. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $Id: dfib.h,v 1.8 2007/04/24 12:16:41 zerbino Exp $ * */ #ifndef _DFIB_H_ #define _DFIB_H_ #include #include "def2.h" //#include "globals.h" /* functions for key heaps */ DFibHeap * dfh_makekeyheap ( void ); DFibHeapNode * dfh_insertkey ( DFibHeap *, Time, unsigned int ); Time dfh_replacekey ( DFibHeap *, DFibHeapNode *, Time ); unsigned int dfh_replacekeydata ( DFibHeap *, DFibHeapNode *, Time, unsigned int ); unsigned int dfh_extractmin ( DFibHeap * ); unsigned int dfh_replacedata ( DFibHeapNode *, unsigned int ); unsigned int dfh_delete ( DFibHeap *, DFibHeapNode * ); void dfh_deleteheap ( DFibHeap * ); // DEBUG IDnum dfibheap_getSize ( DFibHeap * ); Time dfibheap_el_getKey ( DFibHeapNode * ); // END DEBUG #endif /* _FIB_H_ */ SOAPdenovo-V1.05/src/31mer/inc/dfibHeap.h000644 000765 000024 00000002564 11530651532 017746 0ustar00Aquastaff000000 000000 /* * 31mer/inc/dfibHeap.h * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #ifndef _DFIBHEAP_H_ #define _DFIBHEAP_H_ DFibHeap * newDFibHeap(); DFibHeapNode * insertNodeIntoDHeap ( DFibHeap * heap, Time key, unsigned int node ); Time replaceKeyInDHeap ( DFibHeap * heap, DFibHeapNode * node, Time newKey ); unsigned int removeNextNodeFromDHeap ( DFibHeap * heap ); void destroyDHeap ( DFibHeap * heap ); boolean HasMin ( DFibHeap * h ); void replaceValueInDHeap ( DFibHeapNode * node, unsigned int newValue ); void * destroyNodeInDHeap ( DFibHeapNode * node, DFibHeap * heap ); IDnum getDFibHeapSize ( DFibHeap * heap ); Time getKey ( DFibHeapNode * node ); #endif SOAPdenovo-V1.05/src/31mer/inc/dfibpriv.h000644 000765 000024 00000007072 11530651532 020050 0ustar00Aquastaff000000 000000 /* * 31mer/inc/dfibpriv.h * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ /*- * Copyright 1997, 1999-2003 John-Mark Gurney. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $Id: dfibpriv.h,v 1.8 2007/10/09 09:56:46 zerbino Exp $ * */ #ifndef _DFIBPRIV_H_ #define _DFIBPRIV_H_ //#include "globals.h" #include "def2.h" /* * specific node operations */ static DFibHeapNode * dfhe_newelem ( DFibHeap * ); static void dfhe_insertafter ( DFibHeapNode * a, DFibHeapNode * b ); static inline void dfhe_insertbefore ( DFibHeapNode * a, DFibHeapNode * b ); static DFibHeapNode * dfhe_remove ( DFibHeapNode * a ); /* * global heap operations */ struct dfibheap { MEM_MANAGER * nodeMemory; IDnum dfh_n; IDnum dfh_Dl; DFibHeapNode ** dfh_cons; DFibHeapNode * dfh_min; DFibHeapNode * dfh_root; }; static void dfh_insertrootlist ( DFibHeap *, DFibHeapNode * ); static void dfh_removerootlist ( DFibHeap *, DFibHeapNode * ); static void dfh_consolidate ( DFibHeap * ); static void dfh_heaplink ( DFibHeap * h, DFibHeapNode * y, DFibHeapNode * x ); static void dfh_cut ( DFibHeap *, DFibHeapNode *, DFibHeapNode * ); static void dfh_cascading_cut ( DFibHeap *, DFibHeapNode * ); static DFibHeapNode * dfh_extractminel ( DFibHeap * ); static void dfh_checkcons ( DFibHeap * h ); static int dfh_compare ( DFibHeap * h, DFibHeapNode * a, DFibHeapNode * b ); static int dfh_comparedata ( DFibHeap * h, Time key, unsigned int data, DFibHeapNode * b ); static void dfh_insertel ( DFibHeap * h, DFibHeapNode * x ); /* * general functions */ static inline IDnum ceillog2 ( IDnum a ); #endif /* _FIBPRIV_H_ */ SOAPdenovo-V1.05/src/31mer/inc/extfunc.h000644 000765 000024 00000027517 11530651532 017725 0ustar00Aquastaff000000 000000 /* * 31mer/inc/extfunc.h * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "check.h" #include "extfunc2.h" extern NODE ** seq2nodes_with_pair ( char * seqfile, char * outfile ); extern NODE ** prlSeq2nodes_with_pair ( char * seqfile, char * outfile ); extern void readseq1by1 ( char * src_seq, char * src_name, int * len_seq, FILE * fp, long long num_seq ); extern void readseqPbyP ( char * src_seq, char * src_name, int * insertS, int * len_seq, FILE * fp, long long num_seq ); extern void nodes2edges_with_pair ( NODE ** hash_table, EDGE_PT ** edge_list, char * outfile ); extern int findOrInsertOccurenceInNodeTree ( Kmer kmer, NODE ** T ); extern NODE * SplayNodeTree ( NODE * T, Kmer kmer ); extern Kmer reverseComplement ( Kmer word, int overlap ); extern Kmer hash_kmer ( Kmer kmer ); extern void link2next ( NODE * node, char ch ); extern unsigned char check_link2next ( NODE * node, char ch ); extern void unlink2next ( NODE * node, char ch ); extern void link2prev ( NODE * node, char ch ); extern unsigned char check_link2prev ( NODE * node, char ch ); extern void unlink2prev ( NODE * node, char ch ); extern int count_link2next ( NODE * node ); extern int count_link2prev ( NODE * node ); extern Kmer nextKmer ( Kmer prev, char ch ); extern Kmer prevKmer ( Kmer next, char ch ); extern long long readseqpar ( int * max_len, int * min_leg, int * max_name_len, FILE * fp ); extern void destroyNodeHash ( NODE ** hash_table ); extern void free_edge_list ( EDGE_PT * el ); extern void reverseComplementSeq ( char * seq, int len, char * bal_seq ); extern void free_node_list ( NODE_PT * np ); extern NODE * SplayNodeTree_FILTER ( NODE * T, Kmer kmer ); extern NODE * allocateNode_cvg ( Kmer kmer ); extern int findOrInsertOccurenceInNodeTree_cvg ( Kmer kmer, NODE ** T ); extern void free_edge_array ( EDGE * ed_array, int ed_num ); extern void free_lightctg_array ( LIGHTCTG * ed_array, int ed_num ); extern char getCharInTightString ( char * tightSeq, int pos ); extern void writeChar2tightSting ( char nt, char * tightSeq, int pos ); extern void short_reads_sum(); extern void read_one_sequence ( FILE * fp, long long * T, char ** X ); extern void output_edges ( preEDGE * ed_array, int ed_num, char * outfile ); extern void read2edge ( char * seqfile, NODE ** hash_table, char * outfile ); extern void loadVertex ( char * graphfile ); extern int kmer2vt ( Kmer kmer ); extern void loadEdge ( char * graphfile ); extern boolean loadPath ( char * graphfile ); extern READINTERVAL * allocateRV ( int readid, int edgeid ); extern void createRVmemo(); extern void dismissRV ( READINTERVAL * rv ); extern void destroyReadIntervMem(); extern void destroyConnectMem(); extern void u2uConcatenate(); extern void unlink2all ( NODE * node, NODE ** hash_table ); extern void cutTip ( NODE ** hash_table ); extern void output_contig ( EDGE * ed_array, unsigned int ed_num, char * outfile, int cut_len ); extern void printTightString ( char * tightSeq, int len ); extern int roughUniqueness ( unsigned int edgeno, char ignore_cvg, char * ignored ); extern void outputReadPos ( char * graphfile, int min_len ); extern NODE * reverseComplementNode ( NODE * node1, NODE ** hash_table ); extern void testSearch(); extern void print_kmer ( FILE * fp, Kmer kmer, char c ); extern void allpathConcatenate(); extern void output_updated_edges ( char * outfile ); extern void output_updated_vertex ( char * outfile ); extern void loadUpdatedEdges ( char * graphfile ); extern void loadUpdatedVertex ( char * graphfile ); extern void connectByPE ( char * infile ); extern void output_cntGVZ ( char * outfile ); extern void output_graph ( char * outfile ); extern void removeUnreliable ( NODE ** hash_talbe ); extern void testLinearC2C(); extern void output_contig_graph ( char * outfile ); extern void scaffolding ( unsigned int cut_len, char * outfile ); extern int cmp_int ( const void * a, const void * b ); extern CONNECT * allocateCN ( unsigned int contigId, int gap ); extern int recoverRep(); extern void loadPEgrads ( char * infile ); extern int putInsertS ( long long readid, int size, int * currGrads ); extern int getInsertS ( long long readid, int * readlen ); extern int connectByPE_grad ( FILE * fp, int peGrad, char * line ); extern void PEgradsScaf ( char * infile ); extern void reorderAnnotation ( char * infile, char * outfile ); extern int count_ends ( NODE ** hash_table ); extern void output_1edge ( preEDGE * edge, FILE * fp ); extern void prlRead2edge ( char * libfile, char * outfile ); extern int count_edges ( NODE ** hash_table ); extern int prlFindOrInsertOccurenceInNodeTree_cvg ( Kmer kmer, NODE ** T, MEM_MANAGER * node_mem_manager ); extern void prlDestroyNodeHash ( NODE ** hash_table ); extern void annotFileTrans ( char * infile, char * outfile ); extern void prlLoadPath ( char * graphfile ); extern void misCheck ( char * infile, char * outfile ); extern int uniqueLenSearch ( unsigned int * len_array, unsigned int * flag_array, int num, unsigned int target ); extern int cmp_vertex ( const void * a, const void * b ); extern void linkContig2Vts(); extern int bisearch ( VERTEX * vts, int num, Kmer target ); extern int connectByPE_gradPatch ( FILE * fp1, FILE * fp2, int peGrad, char * line1, char * line2 ); extern void scaftiging ( char * graphfile, int len_cut ); extern void gapFilling ( char * graphfile, int cut_len ); extern ARC * getArcBetween ( unsigned int from_ed, unsigned int to_ed ); extern void bubblePinch ( double simiCutoff, char * outfile, int M ); extern void linearConcatenate(); extern unsigned char setArcMulti ( unsigned int from_ed, unsigned int to_ed, unsigned char value ); extern ARC * allocateArc ( unsigned int edgeid ); extern void cutTipsInGraph ( int cutLen, boolean strict ); extern ARC * deleteArc ( ARC * arc_list, ARC * arc ); extern void compactEdgeArray(); extern void dismissArc ( ARC * arc ); extern void createArcMemo(); extern ARC * getArcBetween ( unsigned int from_ed, unsigned int to_ed ); extern ARC * allocateArc ( unsigned int edgeid ); extern void unlink2prevUncertain ( NODE * node, char ch, boolean smaller ); extern char firstCharInKmer ( Kmer kmer ); extern void writeChar2tightString ( char nt, char * tightSeq, int pos ); extern Kmer reverseComplementVerbose ( Kmer word, int overlap ); extern Kmer KmerPlus ( Kmer prev, char ch ); extern void output_heavyArcs ( char * outfile ); extern preARC * allocatePreArc ( unsigned int edgeid ); extern void destroyPreArcMem(); extern void traceAlongArc ( unsigned int destE, unsigned int currE, int max_steps, int min, int max, int index, int len, int * num_route ); extern void freeContig_array(); extern void output_scafSeq ( char * graphfile, int len_cut ); extern void putArcInHash ( unsigned int from_ed, unsigned int to_ed ); extern boolean DoesArcExist ( unsigned int from_ed, unsigned int to_ed ); extern void recordArcInHash(); extern void destroyArcHash(); extern void removeWeakEdges ( int lenCutoff, unsigned int multiCutoff ); extern void createArcLookupTable(); extern void deleteArcLookupTable(); extern void putArc2LookupTable ( unsigned int from_ed, ARC * arc ); extern void removeArcInLookupTable ( unsigned int from_ed, unsigned int to_ed ); extern ARC * arcCount ( unsigned int edgeid, unsigned int * num ); extern void mapFileTrans ( char * infile ); extern void solveReps(); extern void removeDeadArcs(); extern void destroyArcMem(); extern int count_link2prevB ( NODE * node ); extern int count_link2nextB ( NODE * node ); extern void getCntsInFile ( char * infile ); extern void scafByCntInfo ( char * infile ); extern CONNECT * add1Connect ( unsigned int e1, unsigned int e2, int gap, int weight, boolean inherit ); extern void getScaff ( char * infile ); extern void traceAlongMaskedCnt ( unsigned int destE, unsigned int currE, int max_steps, int min, int max, int index, int len, int * num_route ); extern void createPreArcMemManager(); extern boolean loadPathBin ( char * graphfile ); extern void analyzeTips ( NODE ** hash_table, char * graphfile ); extern void recordArcsInLookupTable(); extern FILE * multiFileRead1seq ( char * src_seq, char * src_name, int * len_seq, FILE * fp, FILE * freads ); extern void multiFileSeqpar ( FILE * fp ); extern long long multiFileParse ( int * max_leg, int * min_leg, int * max_name_leg, FILE * fp ); extern CONNECT * getCntBetween ( unsigned int from_ed, unsigned int to_ed ); extern void createCntMemManager(); extern void destroyConnectMem(); extern void createCntLookupTable(); extern void deleteCntLookupTable(); extern void putCnt2LookupTable ( unsigned int from_c, CONNECT * cnt ); extern int prlFindOrInsertOccurenceInEdonTree ( Kmer kmer, EDON ** T, MEM_MANAGER * node_mem_manager ); extern EDON * SplayEdonTree ( EDON * T, Kmer kmer ); extern void prlDestroyEdonHash ( EDON ** hash_table ); extern void prlRead2Ctg ( char * seqfile, char * outfile ); extern void prlLongRead2Ctg ( char * libfile, char * outfile ); extern boolean prlContig2nodes ( char * grapfile, int len_cut ); extern void scan_libInfo ( char * libfile ); extern int getMaxLongReadLen ( int num_libs ); extern void free_libs(); extern boolean read1seqInLib ( char * src_seq, char * src_name, int * len_seq, int * libNo, boolean pair, unsigned char purpose ); extern NODE ** prlEdge2nodes ( char * grapfile ); extern void prlRead2graph ( char * libfile, NODE ** hash_table, char * outfile ); extern void save4laterSolve(); extern void solveRepsAfter(); extern void free_pe_mem(); extern void alloc_pe_mem ( int gradsCounter ); extern NODE * searchNodeTree ( NODE * T, Kmer kmer ); extern EDON * searchEdonTree ( EDON * T, Kmer kmer ); extern void prlDestroyPreArcMem(); extern preARC * prlAllocatePreArc ( unsigned int edgeid, MEM_MANAGER * manager ); extern boolean prlRead2HashTable ( char * libfile, char * outfile ); extern void free_allSets(); extern void removeSingleTips(); extern void removeMinorTips(); extern void kmer2edges ( char * outfile ); extern void output_vertex ( char * outfile ); extern boolean prlRead2HashTable ( char * libfile, char * outfile ); extern void Links2Scaf ( char * infile ); extern void PE2Links ( char * infile ); extern void basicContigInfo ( char * infile ); extern unsigned int getTwinCtg ( unsigned int ctg ); extern boolean isSmallerThanTwin ( unsigned int ctg ); extern boolean isLargerThanTwin ( unsigned int ctg ); extern boolean isSameAsTwin ( unsigned int ctg ); extern boolean loadMarkerBin ( char * graphfile ); extern void readsCloseGap ( char * graphfile ); extern void prlReadsCloseGap ( char * graphfile ); extern void locateReadOnScaf ( char * graphfile ); extern unsigned int getTwinEdge ( unsigned int edge ); extern boolean EdSmallerThanTwin ( unsigned int edge ); extern boolean EdLargerThanTwin ( unsigned int edge ); extern boolean EdSameAsTwin ( unsigned int edge ); extern void removeLowCovEdges ( int lenCutoff, unsigned short covCutoff ); extern int localGraph ( READNEARBY * rdArray, int num, CTGinSCAF * ctg1, CTGinSCAF * ctg2, int origOverlap, Kmer * kmerCtg1, Kmer * kmerCtg2, int overlap, DARRAY * gapSeqArray, char * seqCtg1, char * seqCtg2, char * seqGap ); SOAPdenovo-V1.05/src/31mer/inc/extfunc2.h000644 000765 000024 00000002110 11530651532 017765 0ustar00Aquastaff000000 000000 /* * 31mer/inc/extfunc2.h * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #ifndef _MEM_MANAGER #define _MEM_MANAGER extern MEM_MANAGER * createMem_manager ( int num_items, size_t unit_size ); extern void * getItem ( MEM_MANAGER * mem_Manager ); extern void returnItem ( MEM_MANAGER * mem_Manager, void * ); extern void freeMem_manager ( MEM_MANAGER * mem_Manager ); #endif SOAPdenovo-V1.05/src/31mer/inc/extvab.h000644 000765 000024 00000005352 11530651532 017533 0ustar00Aquastaff000000 000000 /* * 31mer/inc/extvab.h * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ /*** global variables ****/ extern int overlaplen; extern int inGraph; extern long long n_ban; extern Kmer WORDFILTER; extern boolean globalFlag; extern int thrd_num; /**** reads info *****/ extern long long n_solexa; extern long long prevNum; extern int ins_size_var; extern PE_INFO * pes; extern int maxReadLen; extern int maxReadLen4all; extern int minReadLen; extern int maxNameLen; extern int num_libs; extern LIB_INFO * lib_array; extern int libNo; extern long long readNumBack; extern int gradsCounter; /*** used for pregraph *****/ extern MEM_MANAGER * prearc_mem_manager; //also used in scaffolding extern MEM_MANAGER ** preArc_mem_managers; extern boolean deLowKmer; extern boolean deLowEdge; extern KmerSet ** KmerSets; // also used in mapping extern KmerSet ** KmerSetsPatch; /**** used for contiging ****/ extern boolean repsTie; extern long long arcCounter; extern unsigned int num_ed; extern unsigned int num_ed_limit; extern unsigned int extraEdgeNum; extern EDGE * edge_array; extern VERTEX * vt_array; extern MEM_MANAGER * rv_mem_manager; extern MEM_MANAGER * arc_mem_manager; extern unsigned int num_vt; extern int len_bar; extern ARC ** arcLookupTable; extern long long * markersArray; /***** used for scaffolding *****/ extern MEM_MANAGER * cn_mem_manager; extern unsigned int num_ctg; extern unsigned int * index_array; extern CONTIG * contig_array; extern int lineLen; extern int weakPE; extern long long newCntCounter; extern CONNECT ** cntLookupTable; extern unsigned int ctg_short; extern int cvgAvg; extern boolean orig2new; /**** used for gapFilling ****/ extern DARRAY * readSeqInGap; extern DARRAY * gapSeqDarray; extern DARRAY ** darrayBuf; extern int fillGap; /**** used for searchPath *****/ extern int maxSteps; extern int num_trace; extern unsigned int ** found_routes; extern unsigned int * so_far; extern int max_n_routes; extern boolean maskRep; extern int GLDiff; extern int initKmerSetSize; SOAPdenovo-V1.05/src/31mer/inc/fib.h000644 000765 000024 00000006240 11530651532 016777 0ustar00Aquastaff000000 000000 /* * 31mer/inc/fib.h * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ /*- * Copyright 1997, 1998-2003 John-Mark Gurney. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * */ #ifndef _FIB_H_ #define _FIB_H_ //#include "globals.h" #include #include "def2.h" typedef Coordinate ( *voidcmp ) ( unsigned int , unsigned int ); /* functions for key heaps */ boolean fh_isempty ( FibHeap * ); FibHeap * fh_makekeyheap ( void ); FibHeapNode * fh_insertkey ( FibHeap *, Coordinate, unsigned int ); Coordinate fh_minkey ( FibHeap * ); Coordinate fh_replacekey ( FibHeap *, FibHeapNode *, Coordinate ); unsigned int fh_replacekeydata ( FibHeap *, FibHeapNode *, Coordinate, unsigned int ); /* functions for unsigned int * heaps */ FibHeap * fh_makeheap ( void ); voidcmp fh_setcmp ( FibHeap *, voidcmp ); unsigned int fh_setneginf ( FibHeap *, unsigned int ); FibHeapNode * fh_insert ( FibHeap *, unsigned int ); /* shared functions */ unsigned int fh_extractmin ( FibHeap * ); unsigned int fh_min ( FibHeap * ); unsigned int fh_replacedata ( FibHeapNode *, unsigned int ); unsigned int fh_delete ( FibHeap *, FibHeapNode * ); void fh_deleteheap ( FibHeap * ); FibHeap * fh_union ( FibHeap *, FibHeap * ); #endif /* _FIB_H_ */ SOAPdenovo-V1.05/src/31mer/inc/fibHeap.h000644 000765 000024 00000002625 11530651532 017600 0ustar00Aquastaff000000 000000 /* * 31mer/inc/fibHeap.h * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #ifndef _FIBHEAP_H_ #define _FIBHEAP_H_ FibHeap * newFibHeap(); FibHeapNode * insertNodeIntoHeap ( FibHeap * heap, Coordinate key, unsigned int node ); Coordinate minKeyOfHeap ( FibHeap * heap ); Coordinate replaceKeyInHeap ( FibHeap * heap, FibHeapNode * node, Coordinate newKey ); void replaceValueInHeap ( FibHeapNode * node, unsigned int newValue ); unsigned int removeNextNodeFromHeap ( FibHeap * heap ); void * destroyNodeInHeap ( FibHeapNode * node, FibHeap * heap ); void destroyHeap ( FibHeap * heap ); boolean IsHeapEmpty ( FibHeap * heap ); #endif SOAPdenovo-V1.05/src/31mer/inc/fibpriv.h000644 000765 000024 00000007544 11530651532 017710 0ustar00Aquastaff000000 000000 /* * 31mer/inc/fibpriv.h * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ /*- * Copyright 1997, 1999-2003 John-Mark Gurney. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $Id: fibpriv.h,v 1.10 2007/10/09 09:56:46 zerbino Exp $ * */ #ifndef _FIBPRIV_H_ #define _FIBPRIV_H_ #include "def2.h" /* * specific node operations */ struct fibheap_el { int fhe_degree; boolean fhe_mark; FibHeapNode * fhe_p; FibHeapNode * fhe_child; FibHeapNode * fhe_left; FibHeapNode * fhe_right; Coordinate fhe_key; unsigned int fhe_data; }; static FibHeapNode * fhe_newelem ( struct fibheap * ); static void fhe_initelem ( FibHeapNode * ); static void fhe_insertafter ( FibHeapNode * a, FibHeapNode * b ); static inline void fhe_insertbefore ( FibHeapNode * a, FibHeapNode * b ); static FibHeapNode * fhe_remove ( FibHeapNode * a ); /* * global heap operations */ struct fibheap { Coordinate ( *fh_cmp_fnct ) ( unsigned int, unsigned int ); MEM_MANAGER * nodeMemory; IDnum fh_n; IDnum fh_Dl; FibHeapNode ** fh_cons; FibHeapNode * fh_min; FibHeapNode * fh_root; unsigned int fh_neginf; boolean fh_keys: 1; }; static void fh_initheap ( FibHeap * ); static void fh_insertrootlist ( FibHeap *, FibHeapNode * ); static void fh_removerootlist ( FibHeap *, FibHeapNode * ); static void fh_consolidate ( FibHeap * ); static void fh_heaplink ( FibHeap * h, FibHeapNode * y, FibHeapNode * x ); static void fh_cut ( FibHeap *, FibHeapNode *, FibHeapNode * ); static void fh_cascading_cut ( FibHeap *, FibHeapNode * ); static FibHeapNode * fh_extractminel ( FibHeap * ); static void fh_checkcons ( FibHeap * h ); static void fh_destroyheap ( FibHeap * h ); static int fh_compare ( FibHeap * h, FibHeapNode * a, FibHeapNode * b ); static int fh_comparedata ( FibHeap * h, Coordinate key, unsigned int data, FibHeapNode * b ); static void fh_insertel ( FibHeap * h, FibHeapNode * x ); /* * general functions */ static inline IDnum ceillog2 ( IDnum a ); #endif /* _FIBPRIV_H_ */ SOAPdenovo-V1.05/src/31mer/inc/global.h000644 000765 000024 00000004355 11530651532 017504 0ustar00Aquastaff000000 000000 /* * 31mer/inc/global.h * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ int overlaplen = 23; int inGraph; long long n_ban; long long n_solexa = 0; long long prevNum = 0; int ins_size_var = 20; PE_INFO * pes = NULL; MEM_MANAGER * rv_mem_manager = NULL; MEM_MANAGER * cn_mem_manager = NULL; MEM_MANAGER * arc_mem_manager = NULL; unsigned int num_vt = 0; unsigned int ** found_routes = NULL; unsigned int * so_far = NULL; int max_n_routes = 10; int num_trace; Kmer WORDFILTER; unsigned int num_ed = 0; unsigned int num_ctg = 0; unsigned int num_ed_limit; unsigned int extraEdgeNum; EDGE * edge_array = NULL; VERTEX * vt_array = NULL; unsigned int * index_array = NULL; CONTIG * contig_array = NULL; int lineLen; int len_bar = 100; int weakPE = 3; int fillGap = 0; boolean globalFlag; long long arcCounter; MEM_MANAGER * prearc_mem_manager = NULL; MEM_MANAGER ** preArc_mem_managers = NULL; int maxReadLen = 0; int maxReadLen4all = 0; int minReadLen = 0; int maxNameLen = 0; ARC ** arcLookupTable = NULL; long long * markersArray = NULL; boolean deLowKmer = 0; boolean deLowEdge = 1; long long newCntCounter; boolean repsTie = 0; CONNECT ** cntLookupTable = NULL; int num_libs = 0; LIB_INFO * lib_array = NULL; int libNo = 0; long long readNumBack; int gradsCounter; unsigned int ctg_short = 0; int thrd_num = 8; int cvgAvg = 0; KmerSet ** KmerSets = NULL; KmerSet ** KmerSetsPatch = NULL; DARRAY * readSeqInGap = NULL; DARRAY * gapSeqDarray = NULL; DARRAY ** darrayBuf; boolean orig2new; int maxSteps; boolean maskRep = 1; int GLDiff = 50; int initKmerSetSize = 0; SOAPdenovo-V1.05/src/31mer/inc/newhash.h000644 000765 000024 00000007274 11530651532 017704 0ustar00Aquastaff000000 000000 /* * 31mer/inc/newhash.h * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #ifndef __NEW_HASH_RJ #define __NEW_HASH_RJ #ifndef K_LOAD_FACTOR #define K_LOAD_FACTOR 0.75 #endif #define MAX_KMER_COV 63 #define EDGE_BIT_SIZE 6 #define EDGE_XOR_MASK 0x3FU #define LINKS_BITS 0x00FFFFFFU #define get_kmer_seq(mer) ((mer).seq) #define set_kmer_seq(mer, val) ((mer).seq = val) #define get_kmer_left_cov(mer, idx) (((mer).l_links>>((idx)*EDGE_BIT_SIZE))&EDGE_XOR_MASK) #define set_kmer_left_cov(mer, idx, val) ((mer).l_links = ((mer).l_links&(~(EDGE_XOR_MASK<<((idx)*EDGE_BIT_SIZE)))) | (((val)&EDGE_XOR_MASK)<<((idx)*EDGE_BIT_SIZE)) ) #define get_kmer_left_covs(mer) (get_kmer_left_cov(mer, 0) + get_kmer_left_cov(mer, 1) + get_kmer_left_cov(mer, 2) + get_kmer_left_cov(mer, 3)) #define get_kmer_right_cov(mer, idx) (((mer).r_links>>((idx)*EDGE_BIT_SIZE))&EDGE_XOR_MASK) #define set_kmer_right_cov(mer, idx, val) ((mer).r_links = ((mer).r_links&(~(EDGE_XOR_MASK<<((idx)*EDGE_BIT_SIZE)))) | (((val)&EDGE_XOR_MASK)<<((idx)*EDGE_BIT_SIZE)) ) #define get_kmer_right_covs(mer) (get_kmer_right_cov(mer, 0) + get_kmer_right_cov(mer, 1) + get_kmer_right_cov(mer, 2) + get_kmer_right_cov(mer, 3)) #define is_kmer_entity_null(flags, idx) ((flags)[(idx)>>4]>>(((idx)&0x0f)<<1)&0x01) #define is_kmer_entity_del(flags, idx) ((flags)[(idx)>>4]>>(((idx)&0x0f)<<1)&0x02) #define set_kmer_entity_null(flags, idx) ((flags)[(idx)>>4] |= (0x01u<<(((idx)&0x0f)<<1))) #define set_kmer_entity_del(flags, idx) ((flags)[(idx)>>4] |= (0x02u<<(((idx)&0x0f)<<1))) #define clear_kmer_entity_null(flags, idx) ((flags)[(idx)>>4] &= ~(0x01u<<(((idx)&0x0f)<<1))) #define clear_kmer_entity_del(flags, idx) ((flags)[(idx)>>4] &= ~(0x02u<<(((idx)&0x0f)<<1))) #define exists_kmer_entity(flags, idx) (!((flags)[(idx)>>4]>>(((idx)&0x0f)<<1)&0x03)) typedef struct kmer_st { Kmer seq; ubyte4 l_links; // sever as edgeID since make_edge ubyte4 r_links: 4 * EDGE_BIT_SIZE; ubyte4 linear: 1; ubyte4 deleted: 1; ubyte4 checked: 1; ubyte4 single: 1; ubyte4 twin: 2; ubyte4 inEdge: 2; } kmer_t; typedef struct kmerSet_st { kmer_t * array; ubyte4 * flags; ubyte8 size; ubyte8 count; ubyte8 max; double load_factor; ubyte8 iter_ptr; } KmerSet; typedef struct kmer_pt { kmer_t * node; Kmer kmer; boolean isSmaller; struct kmer_pt * next; } KMER_PT; extern KmerSet * init_kmerset ( ubyte8 init_size, float load_factor ); extern int search_kmerset ( KmerSet * set, ubyte8 seq, kmer_t ** rs ); extern int put_kmerset ( KmerSet * set, ubyte8 seq, ubyte left, ubyte right, kmer_t ** kmer_p ); extern byte8 count_kmerset ( KmerSet * set ); extern void free_Sets ( KmerSet ** KmerSets, int num ); extern void free_kmerset ( KmerSet * set ); extern void dislink2nextUncertain ( kmer_t * node, char ch, boolean smaller ); extern void dislink2prevUncertain ( kmer_t * node, char ch, boolean smaller ); extern int count_branch2prev ( kmer_t * node ); extern int count_branch2next ( kmer_t * node ); extern char firstCharInKmer ( Kmer kmer ); #endif SOAPdenovo-V1.05/src/31mer/inc/nuc.h000644 000765 000024 00000001675 11530651532 017033 0ustar00Aquastaff000000 000000 /* * 31mer/inc/nuc.h * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ int total_nuc = 16; char na_name[17] = {'g', 'a', 't', 'c', 'n', 'r', 'y', 'w', 's', 'm', 'k', 'h', 'b', 'v', 'd', 'x' }; SOAPdenovo-V1.05/src/31mer/inc/stack.h000644 000765 000024 00000002724 11530651532 017347 0ustar00Aquastaff000000 000000 /* * 31mer/inc/stack.h * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #ifndef __STACK__ #define __STACK__ #include #include #include typedef struct block_starter { struct block_starter * prev; struct block_starter * next; } BLOCK_STARTER; typedef struct stack { BLOCK_STARTER * block_list; int index_in_block; int items_per_block; int item_c; size_t item_size; BLOCK_STARTER * block_backup; int index_backup; int item_c_backup; } STACK; void stackBackup ( STACK * astack ); void stackRecover ( STACK * astack ); void * stackPush ( STACK * astack ); void * stackPop ( STACK * astack ); void freeStack ( STACK * astack ); void emptyStack ( STACK * astack ); STACK * createStack ( int num_items, size_t unit_size ); #endif SOAPdenovo-V1.05/src/31mer/inc/stdinc.h000644 000765 000024 00000001721 11530651532 017522 0ustar00Aquastaff000000 000000 /* * 31mer/inc/stdinc.h * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include #include #include #include #include #include #include #include #include "def.h" SOAPdenovo-V1.05/src/31mer/inc/types.h000644 000765 000024 00000002124 11530651532 017400 0ustar00Aquastaff000000 000000 /* * 31mer/inc/types.h * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #ifndef __TYPES_RJ #define __TYPES_RJ typedef unsigned long long ubyte8; typedef unsigned int ubyte4; typedef unsigned short ubyte2; typedef unsigned char ubyte; typedef long long byte8; typedef int byte4; typedef short byte2; typedef char byte; #endif SOAPdenovo-V1.05/src/127mer/arc.c000644 000765 000024 00000012564 11530651532 016322 0ustar00Aquastaff000000 000000 /* * 127mer/arc.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" #define preARCBLOCKSIZE 100000 void createPreArcMemManager() { prearc_mem_manager = createMem_manager ( preARCBLOCKSIZE, sizeof ( preARC ) ); } void prlDestroyPreArcMem() { if ( !preArc_mem_managers ) { return; } int i; for ( i = 0; i < thrd_num; i++ ) { freeMem_manager ( preArc_mem_managers[i] ); } free ( ( void * ) preArc_mem_managers ); preArc_mem_managers = NULL; } void destroyPreArcMem() { freeMem_manager ( prearc_mem_manager ); prearc_mem_manager = NULL; } preARC * prlAllocatePreArc ( unsigned int edgeid, MEM_MANAGER * manager ) { preARC * newArc; newArc = ( preARC * ) getItem ( manager ); newArc->to_ed = edgeid; newArc->multiplicity = 1; newArc->next = NULL; return newArc; } preARC * allocatePreArc ( unsigned int edgeid ) { arcCounter++; preARC * newArc; newArc = ( preARC * ) getItem ( prearc_mem_manager ); newArc->to_ed = edgeid; newArc->multiplicity = 1; newArc->next = NULL; return newArc; } void output_arcGVZ ( char * outfile, boolean IsContig ) { ARC * pArc; preARC * pPreArc; char name[256]; FILE * fp; unsigned int i; sprintf ( name, "%s.arc.gvz", outfile ); fp = ckopen ( name, "w" ); fprintf ( fp, "digraph G{\n" ); fprintf ( fp, "\tsize=\"512,512\";\n" ); for ( i = 1; i <= num_ed; i++ ) { if ( IsContig ) { pPreArc = contig_array[i].arcs; while ( pPreArc ) { fprintf ( fp, "\tC%d -> C%d[label =\"%d\"];\n", i, pPreArc->to_ed, pPreArc->multiplicity ); pPreArc = pPreArc->next; } } else { pArc = edge_array[i].arcs; while ( pArc ) { fprintf ( fp, "\tC%d -> C%d[label =\"%d\"];\n", i, pArc->to_ed, pArc->multiplicity ); pArc = pArc->next; } } } fprintf ( fp, "}\n" ); fclose ( fp ); } /**************** below this line all codes are about ARC ****************/ #define ARCBLOCKSIZE 100000 void createArcMemo() { if ( !arc_mem_manager ) { arc_mem_manager = createMem_manager ( ARCBLOCKSIZE, sizeof ( ARC ) ); } else { printf ( "Warning from createArcMemo: arc_mem_manager is active pointer\n" ); } } void destroyArcMem() { freeMem_manager ( arc_mem_manager ); arc_mem_manager = NULL; } ARC * allocateArc ( unsigned int edgeid ) { arcCounter++; ARC * newArc; newArc = ( ARC * ) getItem ( arc_mem_manager ); newArc->to_ed = edgeid; newArc->multiplicity = 1; newArc->prev = NULL; newArc->next = NULL; return newArc; } void dismissArc ( ARC * arc ) { returnItem ( arc_mem_manager, arc ); } /***************** below this line all codes are about lookup table *****************/ void createArcLookupTable() { if ( !arcLookupTable ) { arcLookupTable = ( ARC ** ) ckalloc ( ( 3 * num_ed + 1 ) * sizeof ( ARC * ) ); } } void deleteArcLookupTable() { if ( arcLookupTable ) { free ( ( void * ) arcLookupTable ); arcLookupTable = NULL; } } void putArc2LookupTable ( unsigned int from_ed, ARC * arc ) { if ( !arc || !arcLookupTable ) { return; } unsigned int index = 2 * from_ed + arc->to_ed; arc->nextInLookupTable = arcLookupTable[index]; arcLookupTable[index] = arc; } static ARC * getArcInLookupTable ( unsigned int from_ed, unsigned int to_ed ) { unsigned int index = 2 * from_ed + to_ed; ARC * ite_arc = arcLookupTable[index]; while ( ite_arc ) { if ( ite_arc->to_ed == to_ed ) { return ite_arc; } ite_arc = ite_arc->nextInLookupTable; } return NULL; } void removeArcInLookupTable ( unsigned int from_ed, unsigned int to_ed ) { unsigned int index = 2 * from_ed + to_ed; ARC * ite_arc = arcLookupTable[index]; ARC * arc; if ( !ite_arc ) { printf ( "removeArcInLookupTable: not found A\n" ); return; } if ( ite_arc->to_ed == to_ed ) { arcLookupTable[index] = ite_arc->nextInLookupTable; return; } while ( ite_arc->nextInLookupTable && ite_arc->nextInLookupTable->to_ed != to_ed ) { ite_arc = ite_arc->nextInLookupTable; } if ( ite_arc->nextInLookupTable ) { arc = ite_arc->nextInLookupTable; ite_arc->nextInLookupTable = arc->nextInLookupTable; return; } printf ( "removeArcInLookupTable: not found B\n" ); return; } void recordArcsInLookupTable() { unsigned int i; ARC * ite_arc; for ( i = 1; i <= num_ed; i++ ) { ite_arc = edge_array[i].arcs; while ( ite_arc ) { putArc2LookupTable ( i, ite_arc ); ite_arc = ite_arc->next; } } } ARC * getArcBetween ( unsigned int from_ed, unsigned int to_ed ) { ARC * parc; if ( arcLookupTable ) { parc = getArcInLookupTable ( from_ed, to_ed ); return parc; } parc = edge_array[from_ed].arcs; while ( parc ) { if ( parc->to_ed == to_ed ) { return parc; } parc = parc->next; } return parc; } SOAPdenovo-V1.05/src/127mer/attachPEinfo.c000644 000765 000024 00000021402 11530651532 020111 0ustar00Aquastaff000000 000000 /* * 127mer/attachPEinfo.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" #include "stack.h" #define CNBLOCKSIZE 10000 static STACK * isStack; static int ignorePE1, ignorePE2, ignorePE3, static_flag; static int onsameCtgPE; static unsigned long long peSUM; static boolean staticF; static int existCounter; static int calcuIS ( STACK * intStack ); static int cmp_pe ( const void * a, const void * b ) { PE_INFO * A, *B; A = ( PE_INFO * ) a; B = ( PE_INFO * ) b; if ( A->rank > B->rank ) { return 1; } else if ( A->rank == B->rank ) { return 0; } else { return -1; } } void loadPEgrads ( char * infile ) { FILE * fp; char name[256], line[1024]; int i; boolean rankSet = 1; sprintf ( name, "%s.peGrads", infile ); fp = fopen ( name, "r" ); if ( !fp ) { printf ( "can not open file %s\n", name ); gradsCounter = 0; return; } while ( fgets ( line, sizeof ( line ), fp ) != NULL ) { if ( line[0] == 'g' ) { sscanf ( line + 10, "%d %lld %d", &gradsCounter, &n_solexa, &maxReadLen ); printf ( "there're %d grads, %lld reads, max read len %d\n", gradsCounter, n_solexa, maxReadLen ); break; } } alloc_pe_mem ( gradsCounter ); for ( i = 0; i < gradsCounter; i++ ) { fgets ( line, sizeof ( line ), fp ); pes[i].rank = 0; sscanf ( line, "%d %lld %d %d", & ( pes[i].insertS ), & ( pes[i].PE_bound ), & ( pes[i].rank ), & ( pes[i].pair_num_cut ) ); if ( pes[i].rank < 1 ) { rankSet = 0; } } fclose ( fp ); if ( rankSet ) { qsort ( &pes[0], gradsCounter, sizeof ( PE_INFO ), cmp_pe ); return; } int lastRank = 0; for ( i = 0; i < gradsCounter; i++ ) { if ( i == 0 ) { pes[i].rank = ++lastRank; } else if ( pes[i].insertS < 300 ) { pes[i].rank = lastRank; } else if ( pes[i].insertS < 800 ) { if ( pes[i - 1].insertS < 300 ) { pes[i].rank = ++lastRank; } else { pes[i].rank = lastRank; } } else if ( pes[i].insertS < 3000 ) { if ( pes[i - 1].insertS < 800 ) { pes[i].rank = ++lastRank; } else { pes[i].rank = lastRank; } } else if ( pes[i].insertS < 7000 ) { if ( pes[i - 1].insertS < 3000 ) { pes[i].rank = ++lastRank; } else { pes[i].rank = lastRank; } } else { if ( pes[i - 1].insertS < 7000 ) { pes[i].rank = ++lastRank; } else { pes[i].rank = lastRank; } } } } CONNECT * add1Connect ( unsigned int e1, unsigned int e2, int gap, int weight, boolean inherit ) { if ( e1 == e2 || e1 == getTwinCtg ( e2 ) ) { return NULL; } CONNECT * connect = NULL; long long sum; if ( weight > 255 ) { weight = 255; } connect = getCntBetween ( e1, e2 ); if ( connect ) { if ( !weight ) { return connect; } existCounter++; if ( !inherit ) { sum = connect->weightNotInherit * connect->gapLen + gap * weight; connect->gapLen = sum / ( connect->weightNotInherit + weight ); if ( connect->weightNotInherit + weight <= 255 ) { connect->weightNotInherit += weight; } else if ( connect->weightNotInherit < 255 ) { connect->weightNotInherit = 255; } } else { sum = connect->weight * connect->gapLen + gap * weight; connect->gapLen = sum / ( connect->weight + weight ); if ( !connect->inherit ) { connect->maxSingleWeight = connect->weightNotInherit; } connect->inherit = 1; connect->maxSingleWeight = connect->maxSingleWeight > weight ? connect->maxSingleWeight : weight; } if ( connect->weight + weight <= 255 ) { connect->weight += weight; } else if ( connect->weight < 255 ) { connect->weight = 255; } } else { newCntCounter++; connect = allocateCN ( e2, gap ); if ( cntLookupTable ) { putCnt2LookupTable ( e1, connect ); } connect->weight = weight; if ( contig_array[e1].mask || contig_array[e2].mask ) { connect->mask = 1; } connect->next = contig_array[e1].downwardConnect; contig_array[e1].downwardConnect = connect; if ( !inherit ) { connect->weightNotInherit = weight; } else { connect->weightNotInherit = 0; connect->inherit = 1; connect->maxSingleWeight = weight; } } return connect; } int attach1PE ( unsigned int e1, int pre_pos, unsigned int bal_e2, int pos, int insert_size ) { int gap, realpeSize; unsigned int bal_e1, e2; if ( e1 == bal_e2 ) { ignorePE1++; return -1; //orientation wrong } bal_e1 = getTwinCtg ( e1 ); e2 = getTwinCtg ( bal_e2 ); if ( e1 == e2 ) { realpeSize = contig_array[e1].length + overlaplen - pre_pos - pos; if ( realpeSize > 0 ) { peSUM += realpeSize; onsameCtgPE++; if ( ( int ) contig_array[e1].length > insert_size ) { int * item = ( int * ) stackPush ( isStack ); ( *item ) = realpeSize; } } return 2; } gap = insert_size - overlaplen + pre_pos + pos - contig_array[e1].length - contig_array[e2].length; if ( gap < - ( insert_size / 10 ) ) { ignorePE2++; return 0; } if ( gap > insert_size ) { ignorePE3++; return 0; } add1Connect ( e1, e2, gap, 1, 0 ); add1Connect ( bal_e2, bal_e1, gap, 1, 0 ); return 1; } int connectByPE_grad ( FILE * fp, int peGrad, char * line ) { long long pre_readno, readno, minno, maxno; int pre_pos, pos, flag, PE, count = 0; unsigned int pre_contigno, contigno, newIndex; if ( peGrad < 0 || peGrad > gradsCounter ) { printf ( "specified pe grad is out of bound\n" ); return 0; } maxno = pes[peGrad].PE_bound; if ( peGrad == 0 ) { minno = 0; } else { minno = pes[peGrad - 1].PE_bound; } onsameCtgPE = peSUM = 0; PE = pes[peGrad].insertS; if ( strlen ( line ) ) { sscanf ( line, "%lld %d %d", &pre_readno, &pre_contigno, &pre_pos ); //printf("first record %d %d %d\n",pre_readno,pre_contigno,pre_pos); if ( pre_readno <= minno ) { pre_readno = -1; } } else { pre_readno = -1; } ignorePE1 = ignorePE2 = ignorePE3 = 0; static_flag = 1; isStack = ( STACK * ) createStack ( CNBLOCKSIZE, sizeof ( int ) ); while ( fgets ( line, lineLen, fp ) != NULL ) { sscanf ( line, "%lld %d %d", &readno, &contigno, &pos ); if ( readno > maxno ) { break; } if ( readno <= minno ) { continue; } newIndex = index_array[contigno]; //if(contig_array[newIndex].bal_edge==0) if ( isSameAsTwin ( newIndex ) ) { continue; } if ( PE && ( readno % 2 == 0 ) && ( pre_readno == readno - 1 ) ) // they are a pair of reads { flag = attach1PE ( pre_contigno, pre_pos, newIndex, pos, PE ); if ( flag == 1 ) { count++; } } pre_readno = readno; pre_contigno = newIndex; pre_pos = pos; } printf ( "%d PEs with insert size %d attached, %d + %d + %d ignored\n", count, PE, ignorePE1, ignorePE2, ignorePE3 ); if ( onsameCtgPE > 0 ) { printf ( "estimated PE size %lli, by %d pairs\n", peSUM / onsameCtgPE, onsameCtgPE ); } printf ( "on contigs longer than %d, %d pairs found,", PE, isStack->item_c ); printf ( "insert_size estimated: %d\n", calcuIS ( isStack ) ); freeStack ( isStack ); return count; } static int calcuIS ( STACK * intStack ) { long long sum = 0; int avg = 0; int * item; int num = intStack->item_c; if ( num < 100 ) { return avg; } stackBackup ( intStack ); while ( ( item = ( int * ) stackPop ( intStack ) ) != NULL ) { sum += *item; } stackRecover ( intStack ); num = intStack->item_c; avg = sum / num; sum = 0; stackBackup ( intStack ); while ( ( item = ( int * ) stackPop ( intStack ) ) != NULL ) { sum += ( *item - avg ) * ( *item - avg ); } int SD = sqrt ( sum / ( num - 1 ) ); if ( SD == 0 ) { printf ( "SD=%d, ", SD ); return avg; } stackRecover ( intStack ); sum = num = 0; while ( ( item = ( int * ) stackPop ( intStack ) ) != NULL ) if ( abs ( *item - avg ) < 3 * SD ) { sum += *item; num++; } avg = sum / num; printf ( "SD=%d, ", SD ); return avg; } unsigned int getTwinCtg ( unsigned int ctg ) { return ctg + contig_array[ctg].bal_edge - 1; } boolean isSmallerThanTwin ( unsigned int ctg ) { return contig_array[ctg].bal_edge > 1; } boolean isLargerThanTwin ( unsigned int ctg ) { return contig_array[ctg].bal_edge < 1; } boolean isSameAsTwin ( unsigned int ctg ) { return contig_array[ctg].bal_edge == 1; } SOAPdenovo-V1.05/src/127mer/bubble.c000644 000765 000024 00000136256 11530651532 017015 0ustar00Aquastaff000000 000000 /* * 127mer/bubble.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" #include "dfibHeap.h" #include "fibHeap.h" #define false 0 #define true 1 #define SLOW_TO_FAST 1 #define FAST_TO_SLOW 0 #define MAXREADLENGTH 100 #define MAXCONNECTION 100 static int MAXNODELENGTH; static int DIFF; static unsigned int outNodeArray[MAXCONNECTION]; static ARC * outArcArray[MAXCONNECTION]; static boolean HasChanged; //static boolean staticFlag = 0; //static boolean staticFlag2 = 0; static const int INDEL = 0; static const int SIM[4][4] = { {1, 0, 0, 0}, {0, 1, 0, 0}, {0, 0, 1, 0}, {0, 0, 0, 1} }; //static variables static READINTERVAL * fastPath; static READINTERVAL * slowPath; static char fastSequence[MAXREADLENGTH]; static char slowSequence[MAXREADLENGTH]; static int fastSeqLength; static int slowSeqLength; static Time * times; static unsigned int * previous; static unsigned int expCounter; static unsigned int * expanded; static double cutoff; static int Fmatrix[MAXREADLENGTH + 1][MAXREADLENGTH + 1]; static int slowToFastMapping[MAXREADLENGTH + 1]; static int fastToSlowMapping[MAXREADLENGTH + 1]; static DFibHeapNode ** dheapNodes; static DFibHeap * dheap; static unsigned int activeNode; //static ARC *activeArc; static unsigned int startingNode; static int progress; static unsigned int * eligibleStartingPoints; // DEBUG static long long caseA, caseB, caseC, caseD; static long long dnodeCounter; static long long rnodeCounter; static long long btCounter; static long long cmpCounter; static long long simiCounter; static long long pinCounter; static long long replaceCounter; static long long getArcCounter; // END OF DEBUG static void output_seq ( char * seq, int length, FILE * fp, unsigned int from_vt, unsigned int dest ) { int i; Kmer kmer; kmer = vt_array[from_vt].kmer; printKmerSeq ( fp, kmer ); fprintf ( fp, " " ); for ( i = 0; i < length; i++ ) { fprintf ( fp, "%c", int2base ( ( int ) seq[i] ) ); } if ( edge_array[dest].seq ) { fprintf ( fp, " %c\n", int2base ( ( int ) getCharInTightString ( edge_array[dest].seq, 0 ) ) ); } else { fprintf ( fp, " N\n" ); } } static void print_path ( FILE * fp ) { READINTERVAL * marker; marker = fastPath->nextInRead; while ( marker->nextInRead ) { fprintf ( fp, "%u ", marker->edgeid ); marker = marker->nextInRead; } fprintf ( fp, "\n" ); marker = slowPath->nextInRead; while ( marker->nextInRead ) { fprintf ( fp, "%u ", marker->edgeid ); marker = marker->nextInRead; } fprintf ( fp, "\n" ); } static void output_pair ( int lengthF, int lengthS, FILE * fp, int nodeF, int nodeS, boolean merged, unsigned int source, unsigned int destination ) { fprintf ( fp, "$$ %d vs %d $$ %d\n", nodeF, nodeS, merged ); //print_path(fp); output_seq ( fastSequence, lengthF, fp, edge_array[source].to_vt, destination ); output_seq ( slowSequence, lengthS, fp, edge_array[source].to_vt, destination ); //fprintf(fp,"\n"); } static void resetNodeStatus() { unsigned int index; ARC * arc; unsigned int bal_ed; for ( index = 1; index <= num_ed; index++ ) { if ( EdSameAsTwin ( index ) ) { edge_array[index].multi = 1; continue; } arc = edge_array[index].arcs; bal_ed = getTwinEdge ( index ); while ( arc ) { if ( arc->to_ed == bal_ed ) { break; } arc = arc->next; } if ( arc ) { edge_array[index].multi = 1; edge_array[bal_ed].multi = 1; index++; continue; } arc = edge_array[bal_ed].arcs; while ( arc ) { if ( arc->to_ed == index ) { break; } arc = arc->next; } if ( arc ) { edge_array[index].multi = 1; edge_array[bal_ed].multi = 1; } else { edge_array[index].multi = 0; edge_array[bal_ed].multi = 0; } index++; } } /* static void determineEligibleStartingPoints() { long long index,counter=0; unsigned int node; unsigned int maxmult; ARC *parc; FibHeap *heap = newFibHeap(); for(index=1;index<=num_ed;index++){ if(edge_array[index].deleted||edge_array[index].length<1) continue; maxmult = counter = 0; parc = edge_array[index].arcs; while(parc){ if(parc->multiplicity > maxmult) maxmult = parc->multiplicity; parc = parc->next; } if(maxmult<1){ continue; } insertNodeIntoHeap(heap,-maxmult,index); } counter = 0; while((index=removeNextNodeFromHeap(heap))!=0){ eligibleStartingPoints[counter++] = index; } destroyHeap(heap); printf("%lld edges out of %d are eligible starting points\n",counter,num_ed); } */ static unsigned int nextStartingPoint() { unsigned int index = 1; unsigned int result = 0; for ( index = progress + 1; index < num_ed; index++ ) { //result = eligibleStartingPoints[index]; result = index; if ( edge_array[index].deleted || edge_array[index].length < 1 ) { continue; } if ( result == 0 ) { return 0; } if ( edge_array[result].multi > 0 ) { continue; } progress = index; return result; } return 0; } static void updateNodeStatus() { unsigned int i, node; for ( i = 0; i < expCounter; i++ ) { node = expanded[i]; edge_array[node].multi = 1; edge_array[getTwinEdge ( node )].multi = 1; } } unsigned int getNodePrevious ( unsigned int node ) { return previous[node]; } static boolean isPreviousToNode ( unsigned int previous, unsigned int target ) { unsigned int currentNode = target; unsigned int previousNode = 0; Time targetTime = times[target]; while ( currentNode ) { if ( currentNode == previous ) { return 1; } if ( currentNode == previousNode ) { return 0; } if ( times[currentNode] != targetTime ) { return 0; } previousNode = currentNode; currentNode = getNodePrevious ( currentNode ); } return 0; } static void copySeq ( char * targetS, char * sourceS, int pos, int length ) { char ch; int i, index; index = pos; for ( i = 0; i < length; i++ ) { ch = getCharInTightString ( sourceS, i ); targetS[index++] = ch; //writeChar2tightString(ch,targetS,index++); } } // return the length of sequence static int extractSequence ( READINTERVAL * path, char * sequence ) { READINTERVAL * marker; int seqLength, writeIndex; seqLength = writeIndex = 0; path->start = -10; marker = path->nextInRead; while ( marker->nextInRead ) { marker->start = seqLength; seqLength += edge_array[marker->edgeid].length; marker = marker->nextInRead; } marker->start = seqLength; if ( seqLength > MAXREADLENGTH ) { return 0; } marker = path->nextInRead; while ( marker->nextInRead ) { if ( edge_array[marker->edgeid].length && edge_array[marker->edgeid].seq ) { copySeq ( sequence, edge_array[marker->edgeid].seq, writeIndex, edge_array[marker->edgeid].length ); writeIndex += edge_array[marker->edgeid].length; } /* else if(edge_array[marker->edgeid].length==0) printf("node %d with length 0 in this path\n",marker->edgeid); else if(edge_array[marker->edgeid].seq==NULL) printf("node %d without seq in this path\n",marker->edgeid); */ marker = marker->nextInRead; } return seqLength; } static int max ( int A, int B, int C ) { A = A >= B ? A : B; return ( A >= C ? A : C ); } static boolean compareSequences ( char * sequence1, char * sequence2, int length1, int length2 ) { int i, j; int maxLength; int Choice1, Choice2, Choice3; int maxScore; if ( length1 == 0 || length2 == 0 ) { caseA++; return 0; } if ( abs ( ( int ) length1 - ( int ) length2 ) > 2 ) { caseB++; return 0; } if ( length1 < overlaplen - 1 || length2 < overlaplen - 1 ) { caseB++; return 0; } /* if (length1 < overlaplen || length2 < overlaplen){ if(abs((int)length1 - (int)length2) > 3){ caseB++; return 0; } } */ //printf("length %d vs %d\n",length1,length2); for ( i = 0; i <= length1; i++ ) { Fmatrix[i][0] = 0; } for ( j = 0; j <= length2; j++ ) { Fmatrix[0][j] = 0; } for ( i = 1; i <= length1; i++ ) { for ( j = 1; j <= length2; j++ ) { Choice1 = Fmatrix[i - 1][j - 1] + SIM[ ( int ) sequence1[i - 1]] [ ( int ) sequence2[j - 1]]; Choice2 = Fmatrix[i - 1][j] + INDEL; Choice3 = Fmatrix[i][j - 1] + INDEL; Fmatrix[i][j] = max ( Choice1, Choice2, Choice3 ); } } maxScore = Fmatrix[length1][length2]; maxLength = ( length1 > length2 ? length1 : length2 ); if ( maxScore < maxLength - DIFF ) { caseC++; return 0; } if ( ( 1 - ( double ) maxScore / maxLength ) > cutoff ) { caseD++; return 0; } //printf("\niTOTO %i / %li\n", maxScore, maxLength); return 1; } static void mapSlowOntoFast() { int slowIndex = slowSeqLength; int fastIndex = fastSeqLength; int fastn, slown; if ( slowIndex == 0 ) { slowToFastMapping[0] = fastIndex; while ( fastIndex >= 0 ) { fastToSlowMapping[fastIndex--] = 0; } return; } if ( fastIndex == 0 ) { while ( slowIndex >= 0 ) { slowToFastMapping[slowIndex--] = 0; } fastToSlowMapping[0] = slowIndex; return; } while ( slowIndex > 0 && fastIndex > 0 ) { fastn = ( int ) fastSequence[fastIndex - 1]; //getCharInTightString(fastSequence,fastIndex-1); slown = ( int ) slowSequence[slowIndex - 1]; //getCharInTightString(slowSequence,slowIndex-1); if ( Fmatrix[fastIndex][slowIndex] == Fmatrix[fastIndex - 1][slowIndex - 1] + SIM[fastn][slown] ) { fastToSlowMapping[--fastIndex] = --slowIndex; slowToFastMapping[slowIndex] = fastIndex; } else if ( Fmatrix[fastIndex][slowIndex] == Fmatrix[fastIndex - 1][slowIndex] + INDEL ) { fastToSlowMapping[--fastIndex] = slowIndex - 1; } else if ( Fmatrix[fastIndex][slowIndex] == Fmatrix[fastIndex][slowIndex - 1] + INDEL ) { slowToFastMapping[--slowIndex] = fastIndex - 1; } else { //printPaths(); printf ( "Error" ); fflush ( stdout ); abort(); } } while ( slowIndex > 0 ) { slowToFastMapping[--slowIndex] = -1; } while ( fastIndex > 0 ) { fastToSlowMapping[--fastIndex] = -1; } slowToFastMapping[slowSeqLength] = fastSeqLength; fastToSlowMapping[fastSeqLength] = slowSeqLength; } /* //add an arc to the head of an arc list static ARC *addArc(ARC *arc_list, ARC *arc) { arc->prev = NULL; arc->next = arc_list; if(arc_list) arc_list->prev = arc; arc_list = arc; return arc_list; } */ //remove an arc from the double linked list and return the updated list ARC * deleteArc ( ARC * arc_list, ARC * arc ) { if ( arc->prev ) { arc->prev->next = arc->next; } else { arc_list = arc->next; } if ( arc->next ) { arc->next->prev = arc->prev; } /* if(checkActiveArc&&arc==activeArc){ activeArc = arc->next; } */ dismissArc ( arc ); return arc_list; } //add a rv to the head of a rv list static READINTERVAL * addRv ( READINTERVAL * rv_list, READINTERVAL * rv ) { rv->prevOnEdge = NULL; rv->nextOnEdge = rv_list; if ( rv_list ) { rv_list->prevOnEdge = rv; } rv_list = rv; return rv_list; } //remove a rv from the double linked list and return the updated list static READINTERVAL * deleteRv ( READINTERVAL * rv_list, READINTERVAL * rv ) { if ( rv->prevOnEdge ) { rv->prevOnEdge->nextOnEdge = rv->nextOnEdge; } else { rv_list = rv->nextOnEdge; } if ( rv->nextOnEdge ) { rv->nextOnEdge->prevOnEdge = rv->prevOnEdge; } return rv_list; } /* static void disconnect(unsigned int from_ed, unsigned int to_ed) { READINTERVAL *rv_temp; rv_temp = edge_array[from_ed].rv; while(rv_temp){ if(!rv_temp->nextInRead||rv_temp->nextInRead->edgeid!=to_ed){ rv_temp = rv_temp->nextOnEdge; continue; } rv_temp->nextInRead->prevInRead = NULL; rv_temp->nextInRead = NULL; rv_temp = rv_temp->nextOnEdge; } } */ static int mapDistancesOntoPaths() { READINTERVAL * marker; int totalDistance = 0; marker = slowPath; while ( marker->nextInRead ) { marker = marker->nextInRead; marker->start = totalDistance; totalDistance += edge_array[marker->edgeid].length; marker->bal_rv->start = totalDistance; } totalDistance = 0; marker = fastPath; while ( marker->nextInRead ) { marker = marker->nextInRead; marker->start = totalDistance; totalDistance += edge_array[marker->edgeid].length; marker->bal_rv->start = totalDistance; } return totalDistance; } //attach a path to the graph and mean while make the reverse complementary path of it static void attachPath ( READINTERVAL * path ) { READINTERVAL * marker, *bal_marker; unsigned int ed, bal_ed; marker = path; while ( marker ) { ed = marker->edgeid; edge_array[ed].rv = addRv ( edge_array[ed].rv, marker ); bal_ed = getTwinEdge ( ed ); bal_marker = allocateRV ( -marker->readid, bal_ed ); edge_array[bal_ed].rv = addRv ( edge_array[bal_ed].rv, bal_marker ); if ( marker->prevInRead ) { marker->prevInRead->bal_rv->prevInRead = bal_marker; bal_marker->nextInRead = marker->prevInRead->bal_rv; } bal_marker->bal_rv = marker; marker->bal_rv = bal_marker; marker = marker->nextInRead; } } static void detachPathSingle ( READINTERVAL * path ) { READINTERVAL * marker, *nextMarker; unsigned int ed; marker = path; while ( marker ) { nextMarker = marker->nextInRead; ed = marker->edgeid; edge_array[ed].rv = deleteRv ( edge_array[ed].rv, marker ); dismissRV ( marker ); marker = nextMarker; } } static void detachPath ( READINTERVAL * path ) { READINTERVAL * marker, *bal_marker, *nextMarker; unsigned int ed, bal_ed; marker = path; while ( marker ) { nextMarker = marker->nextInRead; bal_marker = marker->bal_rv; ed = marker->edgeid; edge_array[ed].rv = deleteRv ( edge_array[ed].rv, marker ); dismissRV ( marker ); //printf("%d (%d),",ed,edge_array[ed].deleted); bal_ed = getTwinEdge ( ed ); edge_array[bal_ed].rv = deleteRv ( edge_array[bal_ed].rv, bal_marker ); dismissRV ( bal_marker ); marker = nextMarker; } // printf("\n"); fflush ( stdout ); } static void remapNodeMarkersOntoNeighbour ( unsigned int source, unsigned int target ) { READINTERVAL * marker, *bal_marker; unsigned int bal_source = getTwinEdge ( source ); unsigned int bal_target = getTwinEdge ( target ); while ( ( marker = edge_array[source].rv ) != NULL ) { edge_array[source].rv = deleteRv ( edge_array[source].rv, marker ); marker->edgeid = target; edge_array[target].rv = addRv ( edge_array[target].rv, marker ); bal_marker = marker->bal_rv; edge_array[bal_source].rv = deleteRv ( edge_array[bal_source].rv, bal_marker ); bal_marker->edgeid = bal_target; edge_array[bal_target].rv = addRv ( edge_array[bal_target].rv, bal_marker ); } } static void remapNodeInwardReferencesOntoNode ( unsigned int source, unsigned int target ) { ARC * arc; unsigned int destination; for ( arc = edge_array[source].arcs; arc != NULL; arc = arc->next ) { destination = arc->to_ed; if ( destination == target || destination == source ) { continue; } if ( previous[destination] == source ) { previous[destination] = target; } } } static void remapNodeTimesOntoTargetNode ( unsigned int source, unsigned int target ) { Time nodeTime = times[source]; unsigned int prevNode = previous[source]; Time targetTime = times[target]; if ( nodeTime == -1 ) { return; } if ( prevNode == source ) { times[target] = nodeTime; previous[target] = target; } else if ( targetTime == -1 || targetTime > nodeTime || ( targetTime == nodeTime && !isPreviousToNode ( target, prevNode ) ) ) { times[target] = nodeTime; if ( prevNode != getTwinEdge ( source ) ) { previous[target] = prevNode; } else { previous[target] = getTwinEdge ( target ); } } remapNodeInwardReferencesOntoNode ( source, target ); previous[source] = 0; } static void remapNodeTimesOntoNeighbour ( unsigned int source, unsigned int target ) { remapNodeTimesOntoTargetNode ( source, target ); remapNodeTimesOntoTargetNode ( getTwinEdge ( source ), getTwinEdge ( target ) ); //questionable } static void destroyArc ( unsigned int from_ed, ARC * arc ) { unsigned int bal_dest; ARC * twinArc; if ( !arc ) { return; } bal_dest = getTwinEdge ( arc->to_ed ); twinArc = arc->bal_arc; removeArcInLookupTable ( from_ed, arc->to_ed ); edge_array[from_ed].arcs = deleteArc ( edge_array[from_ed].arcs, arc ); if ( bal_dest != from_ed ) { removeArcInLookupTable ( bal_dest, getTwinEdge ( from_ed ) ); edge_array[bal_dest].arcs = deleteArc ( edge_array[bal_dest].arcs, twinArc ); } } static void createAnalogousArc ( unsigned int originNode, unsigned int destinationNode, ARC * refArc ) { ARC * arc, *twinArc; unsigned int destinationTwin; arc = getArcBetween ( originNode, destinationNode ); if ( arc ) { if ( refArc->bal_arc != refArc ) { arc->multiplicity += refArc->multiplicity; arc->bal_arc->multiplicity += refArc->multiplicity; } else { arc->multiplicity += refArc->multiplicity / 2; arc->bal_arc->multiplicity += refArc->multiplicity / 2; } return; } arc = allocateArc ( destinationNode ); arc->multiplicity = refArc->multiplicity; arc->prev = NULL; arc->next = edge_array[originNode].arcs; if ( edge_array[originNode].arcs ) { edge_array[originNode].arcs->prev = arc; } edge_array[originNode].arcs = arc; putArc2LookupTable ( originNode, arc ); destinationTwin = getTwinEdge ( destinationNode ); if ( destinationTwin == originNode ) { //printf("arc from A to A'\n"); arc->bal_arc = arc; if ( refArc->bal_arc != refArc ) { arc->multiplicity += refArc->multiplicity; } return; } twinArc = allocateArc ( getTwinEdge ( originNode ) ); arc->bal_arc = twinArc; twinArc->bal_arc = arc; twinArc->multiplicity = refArc->multiplicity; twinArc->prev = NULL; twinArc->next = edge_array[destinationTwin].arcs; if ( edge_array[destinationTwin].arcs ) { edge_array[destinationTwin].arcs->prev = twinArc; } edge_array[destinationTwin].arcs = twinArc; putArc2LookupTable ( destinationTwin, twinArc ); } static void remapNodeArcsOntoTarget ( unsigned int source, unsigned int target ) { ARC * arc; if ( source == activeNode ) { activeNode = target; } arc = edge_array[source].arcs; if ( !arc ) { return; } while ( arc != NULL ) { createAnalogousArc ( target, arc->to_ed, arc ); destroyArc ( source, arc ); arc = edge_array[source].arcs; } } static void remapNodeArcsOntoNeighbour ( unsigned int source, unsigned int target ) { remapNodeArcsOntoTarget ( source, target ); remapNodeArcsOntoTarget ( getTwinEdge ( source ), getTwinEdge ( target ) ); } static DFibHeapNode * getNodeDHeapNode ( unsigned int node ) { return dheapNodes[node]; } static void setNodeDHeapNode ( unsigned int node, DFibHeapNode * dheapNode ) { dheapNodes[node] = dheapNode; } static void remapNodeFibHeapReferencesOntoNode ( unsigned int source, unsigned int target ) { DFibHeapNode * sourceDHeapNode = getNodeDHeapNode ( source ); DFibHeapNode * targetDHeapNode = getNodeDHeapNode ( target ); if ( sourceDHeapNode == NULL ) { return; } if ( targetDHeapNode == NULL ) { setNodeDHeapNode ( target, sourceDHeapNode ); replaceValueInDHeap ( sourceDHeapNode, target ); } else if ( getKey ( targetDHeapNode ) > getKey ( sourceDHeapNode ) ) { setNodeDHeapNode ( target, sourceDHeapNode ); replaceValueInDHeap ( sourceDHeapNode, target ); destroyNodeInDHeap ( targetDHeapNode, dheap ); } else { destroyNodeInDHeap ( sourceDHeapNode, dheap ); } setNodeDHeapNode ( source, NULL ); } static void combineCOV ( unsigned int source, int len_s, unsigned int target, int len_t ) { if ( len_s < 1 || len_t < 1 ) { return; } int cov = ( len_s * edge_array[source].cvg + len_t * edge_array[target].cvg ) / len_t; edge_array[target].cvg = cov > MaxEdgeCov ? MaxEdgeCov : cov; edge_array[getTwinEdge ( target )].cvg = cov > MaxEdgeCov ? MaxEdgeCov : cov; } static void remapNodeOntoNeighbour ( unsigned int source, unsigned int target ) { combineCOV ( source, edge_array[source].length, target, edge_array[target].length ); remapNodeMarkersOntoNeighbour ( source, target ); remapNodeTimesOntoNeighbour ( source, target ); //questionable remapNodeArcsOntoNeighbour ( source, target ); remapNodeFibHeapReferencesOntoNode ( source, target ); remapNodeFibHeapReferencesOntoNode ( getTwinEdge ( source ), getTwinEdge ( target ) ); edge_array[source].deleted = 1; edge_array[getTwinEdge ( source )].deleted = 1; if ( startingNode == source ) { startingNode = target; } if ( startingNode == getTwinEdge ( source ) ) { startingNode = getTwinEdge ( target ); } edge_array[source].length = 0; edge_array[getTwinEdge ( source )].length = 0; } static void connectInRead ( READINTERVAL * previous, READINTERVAL * next ) { if ( previous ) { previous->nextInRead = next; if ( next ) { previous->bal_rv->prevInRead = next->bal_rv; } else { previous->bal_rv->prevInRead = NULL; } } if ( next ) { next->prevInRead = previous; if ( previous ) { next->bal_rv->nextInRead = previous->bal_rv; } else { next->bal_rv->nextInRead = NULL; } } } static int remapBackOfNodeMarkersOntoNeighbour ( unsigned int source, READINTERVAL * sourceMarker, unsigned int target, READINTERVAL * targetMarker, boolean slowToFast ) { READINTERVAL * marker, *newMarker, *bal_new, *previousMarker; int halfwayPoint, halfwayPointOffset, breakpoint; int * targetToSourceMapping, *sourceToTargetMapping; unsigned int bal_ed; int targetFinish = targetMarker->bal_rv->start; int sourceStart = sourceMarker->start; int sourceFinish = sourceMarker->bal_rv->start; int alignedSourceLength = sourceFinish - sourceStart; int realSourceLength = edge_array[source].length; if ( slowToFast ) { sourceToTargetMapping = slowToFastMapping; targetToSourceMapping = fastToSlowMapping; } else { sourceToTargetMapping = fastToSlowMapping; targetToSourceMapping = slowToFastMapping; } if ( alignedSourceLength > 0 && targetFinish > 0 ) { halfwayPoint = targetToSourceMapping[targetFinish - 1] - sourceStart + 1; halfwayPoint *= realSourceLength; halfwayPoint /= alignedSourceLength; } else { halfwayPoint = 0; } if ( halfwayPoint < 0 ) { halfwayPoint = 0; } if ( halfwayPoint > realSourceLength ) { halfwayPoint = realSourceLength; } halfwayPointOffset = realSourceLength - halfwayPoint; bal_ed = getTwinEdge ( target ); for ( marker = edge_array[source].rv; marker != NULL; marker = marker->nextOnEdge ) { if ( marker->prevInRead && marker->prevInRead->edgeid == target ) { continue; } newMarker = allocateRV ( marker->readid, target ); edge_array[target].rv = addRv ( edge_array[target].rv, newMarker ); bal_new = allocateRV ( -marker->readid, bal_ed ); edge_array[bal_ed].rv = addRv ( edge_array[bal_ed].rv, bal_new ); newMarker->bal_rv = bal_new; bal_new->bal_rv = newMarker; newMarker->start = marker->start; if ( realSourceLength > 0 ) { breakpoint = halfwayPoint + marker->start; } else { breakpoint = marker->start; } bal_new->start = breakpoint; marker->start = breakpoint; previousMarker = marker->prevInRead; connectInRead ( previousMarker, newMarker ); connectInRead ( newMarker, marker ); } return halfwayPointOffset; } static void printKmer ( Kmer kmer ) { printKmerSeq ( stdout, kmer ); printf ( "\n" ); } static void splitNodeDescriptor ( unsigned int source, unsigned int target, int offset ) { int originalLength = edge_array[source].length; int backLength = originalLength - offset; int index, seqLen; char * tightSeq, nt, *newSeq; unsigned int bal_source = getTwinEdge ( source ); unsigned int bal_target = getTwinEdge ( target ); edge_array[source].length = offset; edge_array[bal_source].length = offset; edge_array[source].flag = 1; edge_array[bal_source].flag = 1; if ( target != 0 ) { edge_array[target].length = backLength; edge_array[bal_target].length = backLength; free ( ( void * ) edge_array[target].seq ); edge_array[target].seq = NULL; free ( ( void * ) edge_array[bal_target].seq ); edge_array[bal_target].seq = NULL; } if ( backLength == 0 ) { return; } tightSeq = edge_array[source].seq; seqLen = backLength / 4 + 1; if ( target != 0 ) { edge_array[target].flag = 1; edge_array[bal_target].flag = 1; newSeq = ( char * ) ckalloc ( seqLen * sizeof ( char ) ); edge_array[target].seq = newSeq; for ( index = 0; index < backLength; index++ ) { nt = getCharInTightString ( tightSeq, index ); writeChar2tightString ( nt, newSeq, index ); } } //source node for ( index = backLength; index < originalLength; index++ ) { nt = getCharInTightString ( tightSeq, index ); writeChar2tightString ( nt, tightSeq, index - backLength ); } if ( target == 0 ) { return; } //target twin tightSeq = edge_array[bal_source].seq; newSeq = ( char * ) ckalloc ( seqLen * sizeof ( char ) ); edge_array[bal_target].seq = newSeq; for ( index = offset; index < originalLength; index++ ) { nt = getCharInTightString ( tightSeq, index ); writeChar2tightString ( nt, newSeq, index - offset ); } } static void remapBackOfNodeDescriptorOntoNeighbour ( unsigned int source, unsigned int target, boolean slowToFast, int offset ) { unsigned int bal_source = getTwinEdge ( source ); unsigned int bal_target = getTwinEdge ( target ); if ( slowToFast ) { splitNodeDescriptor ( source, 0, offset ); } else { splitNodeDescriptor ( source, target, offset ); } //printf("%d vs %d\n",edge_array[source].from_vt,edge_array[target].to_vt); edge_array[source].from_vt = edge_array[target].to_vt; edge_array[bal_source].to_vt = edge_array[bal_target].from_vt; } static void remapBackOfNodeTimesOntoNeighbour ( unsigned int source, unsigned int target ) { Time targetTime = times[target]; Time nodeTime = times[source]; unsigned int twinTarget = getTwinEdge ( target ); unsigned int twinSource = getTwinEdge ( source ); unsigned int previousNode; if ( nodeTime != -1 ) { previousNode = previous[source]; if ( previousNode == source ) { times[target] = nodeTime; previous[target] = target; } else if ( targetTime == -1 || targetTime > nodeTime || ( targetTime == nodeTime && !isPreviousToNode ( target, previousNode ) ) ) { times[target] = nodeTime; if ( previousNode != twinSource ) { previous[target] = previousNode; } else { previous[target] = twinTarget; } } previous[source] = target; } targetTime = times[twinTarget]; nodeTime = times[twinSource]; if ( nodeTime != -1 ) { if ( targetTime == -1 || targetTime > nodeTime || ( targetTime == nodeTime && !isPreviousToNode ( twinTarget, twinSource ) ) ) { times[twinTarget] = nodeTime; previous[twinTarget] = twinSource; } } remapNodeInwardReferencesOntoNode ( twinSource, twinTarget ); } static void remapBackOfNodeArcsOntoNeighbour ( unsigned int source, unsigned int target ) { ARC * arc; remapNodeArcsOntoTarget ( getTwinEdge ( source ), getTwinEdge ( target ) ); for ( arc = edge_array[source].arcs; arc != NULL; arc = arc->next ) { createAnalogousArc ( target, source, arc ); } } static void remapBackOfNodeOntoNeighbour ( unsigned int source, READINTERVAL * sourceMarker, unsigned int target, READINTERVAL * targetMarker, boolean slowToFast ) { int offset; offset = remapBackOfNodeMarkersOntoNeighbour ( source, sourceMarker, target, targetMarker, slowToFast ); remapBackOfNodeDescriptorOntoNeighbour ( source, target, slowToFast, offset ); combineCOV ( source, edge_array[target].length, target, edge_array[target].length ); remapBackOfNodeTimesOntoNeighbour ( source, target ); remapBackOfNodeArcsOntoNeighbour ( source, target ); remapNodeFibHeapReferencesOntoNode ( getTwinEdge ( source ), getTwinEdge ( target ) ); //why not "remapNodeFibHeapReferencesOntoNode(source,target);" //because the downstream part of source still retains, which can serve as previousNode as before if ( getTwinEdge ( source ) == startingNode ) { startingNode = getTwinEdge ( target ); } } static boolean markerLeadsToNode ( READINTERVAL * marker, unsigned int node ) { READINTERVAL * currentMarker; for ( currentMarker = marker; currentMarker != NULL; currentMarker = currentMarker->nextInRead ) if ( currentMarker->edgeid == node ) { return true; } return false; } static void reduceNode ( unsigned int node ) { unsigned int bal_ed = getTwinEdge ( node ); edge_array[node].length = 0; edge_array[bal_ed].length = 0; } static void reduceSlowNodes ( READINTERVAL * slowMarker, unsigned int finish ) { READINTERVAL * marker; for ( marker = slowMarker; marker->edgeid != finish; marker = marker->nextInRead ) { reduceNode ( marker->edgeid ); } } static boolean markerLeadsToArc ( READINTERVAL * marker, unsigned int nodeA, unsigned int nodeB ) { READINTERVAL * current, *next; unsigned int twinA = getTwinEdge ( nodeA ); unsigned int twinB = getTwinEdge ( nodeB ); current = marker; while ( current != NULL ) { next = current->nextInRead; if ( current->edgeid == nodeA && next->edgeid == nodeB ) { return true; } if ( current->edgeid == twinB && next->edgeid == twinA ) { return true; } current = next; } return false; } static void remapEmptyPathArcsOntoMiddlePathSimple ( READINTERVAL * emptyPath, READINTERVAL * targetPath ) { READINTERVAL * pathMarker, *marker; unsigned int start = emptyPath->prevInRead->edgeid; unsigned int finish = emptyPath->edgeid; unsigned int previousNode = start; unsigned int currentNode; ARC * originalArc = getArcBetween ( start, finish ); if ( !originalArc ) { printf ( "remapEmptyPathArcsOntoMiddlePathSimple: no arc between %d and %d\n", start, finish ); marker = fastPath; printf ( "fast path: " ); while ( marker ) { printf ( "%d,", marker->edgeid ); marker = marker->nextInRead; } printf ( "\n" ); marker = slowPath; printf ( "slow path: " ); while ( marker ) { printf ( "%d,", marker->edgeid ); marker = marker->nextInRead; } printf ( "\n" ); fflush ( stdout ); } for ( pathMarker = targetPath; pathMarker->edgeid != finish; pathMarker = pathMarker->nextInRead ) { currentNode = pathMarker->edgeid; createAnalogousArc ( previousNode, currentNode, originalArc ); previousNode = currentNode; } createAnalogousArc ( previousNode, finish, originalArc ); destroyArc ( start, originalArc ); } static void remapEmptyPathMarkersOntoMiddlePathSimple ( READINTERVAL * emptyPath, READINTERVAL * targetPath, boolean slowToFast ) { READINTERVAL * marker, *newMarker, *previousMarker, *pathMarker, *bal_marker; unsigned int start = emptyPath->prevInRead->edgeid; unsigned int finish = emptyPath->edgeid; unsigned int markerStart, bal_ed; READINTERVAL * oldMarker = edge_array[finish].rv; while ( oldMarker ) { marker = oldMarker; oldMarker = marker->nextOnEdge; newMarker = marker->prevInRead; if ( newMarker->edgeid != start ) { continue; } if ( ( slowToFast && marker->readid != 2 ) || ( !slowToFast && marker->readid != 1 ) ) { continue; } markerStart = marker->start; for ( pathMarker = targetPath; pathMarker->edgeid != finish; pathMarker = pathMarker->nextInRead ) { previousMarker = newMarker; //maker a new marker newMarker = allocateRV ( marker->readid, pathMarker->edgeid ); newMarker->start = markerStart; edge_array[pathMarker->edgeid].rv = addRv ( edge_array[pathMarker->edgeid].rv, newMarker ); //maker the twin marker bal_ed = getTwinEdge ( pathMarker->edgeid ); bal_marker = allocateRV ( -marker->readid, bal_ed ); bal_marker->start = markerStart; edge_array[bal_ed].rv = addRv ( edge_array[bal_ed].rv, bal_marker ); newMarker->bal_rv = bal_marker; bal_marker->bal_rv = newMarker; connectInRead ( previousMarker, newMarker ); } connectInRead ( newMarker, marker ); } } static void remapNodeTimesOntoForwardMiddlePath ( unsigned int source, READINTERVAL * path ) { READINTERVAL * marker; unsigned int target; Time nodeTime = times[source]; unsigned int previousNode = previous[source]; Time targetTime; for ( marker = path; marker->edgeid != source; marker = marker->nextInRead ) { target = marker->edgeid; targetTime = times[target]; if ( targetTime == -1 || targetTime > nodeTime || ( targetTime == nodeTime && !isPreviousToNode ( target, previousNode ) ) ) { times[target] = nodeTime; previous[target] = previousNode; } previousNode = target; } previous[source] = previousNode; } static void remapNodeTimesOntoTwinMiddlePath ( unsigned int source, READINTERVAL * path ) { READINTERVAL * marker; unsigned int target; unsigned int previousNode = getTwinEdge ( source ); Time targetTime; READINTERVAL * limit = path->prevInRead->bal_rv; Time nodeTime = times[limit->edgeid]; marker = path; while ( marker->edgeid != source ) { marker = marker->nextInRead; } marker = marker->bal_rv; while ( marker != limit ) { marker = marker->nextInRead; target = marker->edgeid; targetTime = times[target]; if ( targetTime == -1 || targetTime > nodeTime || ( targetTime == nodeTime && !isPreviousToNode ( target, previousNode ) ) ) { times[target] = nodeTime; previous[target] = previousNode; } previousNode = target; } } static void remapEmptyPathOntoMiddlePath ( READINTERVAL * emptyPath, READINTERVAL * targetPath, boolean slowToFast ) { unsigned int start = emptyPath->prevInRead->edgeid; unsigned int finish = emptyPath->edgeid; // Remapping markers if ( !markerLeadsToArc ( targetPath, start, finish ) ) { remapEmptyPathArcsOntoMiddlePathSimple ( emptyPath, targetPath ); } remapEmptyPathMarkersOntoMiddlePathSimple ( emptyPath, targetPath, slowToFast ); //Remap times and previous(if necessary) if ( getNodePrevious ( finish ) == start ) { remapNodeTimesOntoForwardMiddlePath ( finish, targetPath ); } if ( getNodePrevious ( getTwinEdge ( start ) ) == getTwinEdge ( finish ) ) { remapNodeTimesOntoTwinMiddlePath ( finish, targetPath ); } } static boolean cleanUpRedundancy() { READINTERVAL * slowMarker = slowPath->nextInRead, *fastMarker = fastPath->nextInRead; unsigned int slowNode, fastNode; int slowLength, fastLength; int fastConstraint = 0; int slowConstraint = 0; int finalLength; attachPath ( slowPath ); attachPath ( fastPath ); mapSlowOntoFast(); finalLength = mapDistancesOntoPaths(); slowLength = fastLength = 0; while ( slowMarker != NULL && fastMarker != NULL ) { //modifCounter++; if ( !slowMarker->nextInRead ) { slowLength = finalLength; } else { slowLength = slowToFastMapping[slowMarker->bal_rv->start - 1]; if ( slowLength < slowConstraint ) { slowLength = slowConstraint; } } fastLength = fastMarker->bal_rv->start - 1; if ( fastLength < fastConstraint ) { fastLength = fastConstraint; } slowNode = slowMarker->edgeid; fastNode = fastMarker->edgeid; if ( false ) printf ( "Slow %d\tFast %d\n", slowLength, fastLength ); if ( slowNode == fastNode ) { //if (false) if ( false ) printf ( "0/ Already merged together %d == %d\n", slowNode, fastNode ); if ( fastLength > slowLength ) { slowConstraint = fastLength; } //else if (fastLength < slowLength); fastConstraint = slowLength; slowMarker = slowMarker->nextInRead; fastMarker = fastMarker->nextInRead; } else if ( slowNode == getTwinEdge ( fastNode ) ) { //if (false) if ( false ) printf ( "1/ Creme de la hairpin %d $$ %d\n", slowNode, fastNode ); if ( fastLength > slowLength ) { slowConstraint = fastLength; } //else if (fastLength < slowLength); fastConstraint = slowLength; slowMarker = slowMarker->nextInRead; fastMarker = fastMarker->nextInRead; //foldSymmetricalNode(fastNode); } else if ( markerLeadsToNode ( slowMarker, fastNode ) ) { //if (false) if ( false ) { printf ( "2/ Remapping empty fast arc onto slow nodes\n" ); } reduceSlowNodes ( slowMarker, fastNode ); remapEmptyPathOntoMiddlePath ( fastMarker, slowMarker, FAST_TO_SLOW ); while ( slowMarker->edgeid != fastNode ) { slowMarker = slowMarker->nextInRead; } } else if ( markerLeadsToNode ( fastMarker, slowNode ) ) { //if (false) if ( false ) { printf ( "3/ Remapping empty slow arc onto fast nodes\n" ); } remapEmptyPathOntoMiddlePath ( slowMarker, fastMarker, SLOW_TO_FAST ); while ( fastMarker->edgeid != slowNode ) { fastMarker = fastMarker->nextInRead; } } else if ( slowLength == fastLength ) { if ( false ) { printf ( "A/ Mapped equivalent nodes together %d <=> %d\n", slowNode, fastNode ); } remapNodeOntoNeighbour ( slowNode, fastNode ); slowMarker = slowMarker->nextInRead; fastMarker = fastMarker->nextInRead; } else if ( slowLength < fastLength ) { if ( false ) { printf ( "B/ Mapped back of fast node into slow %d -> %d\n", fastNode, slowNode ); } remapBackOfNodeOntoNeighbour ( fastNode, fastMarker, slowNode, slowMarker, FAST_TO_SLOW ); slowMarker = slowMarker->nextInRead; } else { if ( false ) { printf ( "C/ Mapped back of slow node into fast %d -> %d\n", slowNode, fastNode ); } remapBackOfNodeOntoNeighbour ( slowNode, slowMarker, fastNode, fastMarker, SLOW_TO_FAST ); fastMarker = fastMarker->nextInRead; } fflush ( stdout ); } detachPath ( fastPath ); detachPath ( slowPath ); return 1; } static void comparePaths ( unsigned int destination, unsigned int origin ) { int slowLength, fastLength, i; unsigned int fastNode, slowNode; READINTERVAL * marker; slowLength = fastLength = 0; fastNode = destination; slowNode = origin; btCounter++; while ( fastNode != slowNode ) { if ( times[fastNode] > times[slowNode] ) { fastLength++; fastNode = previous[fastNode]; } else if ( times[fastNode] < times[slowNode] ) { slowLength++; slowNode = previous[slowNode]; } else if ( isPreviousToNode ( slowNode, fastNode ) ) { while ( fastNode != slowNode ) { fastLength++; fastNode = previous[fastNode]; } } else if ( isPreviousToNode ( fastNode, slowNode ) ) { while ( slowNode != fastNode ) { slowLength++; slowNode = previous[slowNode]; } } else { fastLength++; fastNode = previous[fastNode]; slowLength++; slowNode = previous[slowNode]; } if ( slowLength > MAXNODELENGTH || fastLength > MAXNODELENGTH ) { return; } } if ( fastLength == 0 ) { return; } marker = allocateRV ( 1, destination ); fastPath = marker; for ( i = 0; i < fastLength; i++ ) { marker = allocateRV ( 1, previous[fastPath->edgeid] ); //printf("marker for read %d on edge %d\n",marker->readid,marker->edgeid); marker->nextInRead = fastPath; fastPath->prevInRead = marker; fastPath = marker; } marker = allocateRV ( 2, destination ); //printf("marker for read %d on edge %d\n",marker->readid,marker->edgeid); slowPath = marker; marker = allocateRV ( 2, origin ); //printf("marker for read %d on edge %d\n",marker->readid,marker->edgeid); marker->nextInRead = slowPath; slowPath->prevInRead = marker; slowPath = marker; for ( i = 0; i < slowLength; i++ ) { marker = allocateRV ( 2, previous[slowPath->edgeid] ); //printf("marker for read %d on edge %d\n",marker->readid,marker->edgeid); marker->nextInRead = slowPath; slowPath->prevInRead = marker; slowPath = marker; } //printf("node num %d vs %d\n",fastLength,slowLength); fastSeqLength = extractSequence ( fastPath, fastSequence ); slowSeqLength = extractSequence ( slowPath, slowSequence ); /* if(destination==6359){ printf("destination %d, slowLength %d, fastLength %d\n",destination,slowLength,fastLength); printf("fastSeqLength %d, slowSeqLength %d\n",fastSeqLength,slowSeqLength); } */ if ( !fastSeqLength || !slowSeqLength ) { detachPathSingle ( slowPath ); detachPathSingle ( fastPath ); return; } cmpCounter++; if ( !compareSequences ( fastSequence, slowSequence, fastSeqLength, slowSeqLength ) ) { //output_pair(fastSeqLength,slowSeqLength,ftemp,fastLength-1,slowLength, 0,slowPath->edgeid,destination); detachPathSingle ( slowPath ); detachPathSingle ( fastPath ); return; } simiCounter++; //output_pair(fastSeqLength,slowSeqLength,ftemp,fastLength-1,slowLength, 1,slowPath->edgeid,destination); //pinCounter++; pinCounter += cleanUpRedundancy(); if ( pinCounter % 100000 == 0 ) { printf ( ".............%lld\n", pinCounter ); } HasChanged = 1; } static void tourBusArc ( unsigned int origin, unsigned int destination, unsigned int arcMulti, Time originTime ) { Time arcTime, totalTime, destinationTime; unsigned int oldPrevious = previous[destination]; if ( oldPrevious == origin || edge_array[destination].multi == 1 ) { return; } arcCounter++; if ( arcMulti > 0 ) { arcTime = ( ( Time ) edge_array[origin].length ) / ( ( Time ) arcMulti ); } else { arcTime = 0.0; printf ( "arc from %d to %d with flags %d originTime %f, arc %d\n", origin, destination, edge_array[destination].multi, originTime, arcMulti ); fflush ( stdout ); } totalTime = originTime + arcTime; /* if(destination==289129||destination==359610){ printf("arc from %d to %d with flags %d time %f originTime %f, arc %d\n", origin,destination,edge_array[destination].multi,totalTime,originTime,arcMulti); fflush(stdout); } */ destinationTime = times[destination]; if ( destinationTime == -1 ) { times[destination] = totalTime; dheapNodes[destination] = insertNodeIntoDHeap ( dheap, totalTime, destination ); dnodeCounter++; previous[destination] = origin; return; } else if ( destinationTime > totalTime ) { if ( dheapNodes[destination] == NULL ) { //printf("node %d Already expanded though\n",destination); return; } replaceCounter++; times[destination] = totalTime; replaceKeyInDHeap ( dheap, dheapNodes[destination], totalTime ); previous[destination] = origin; comparePaths ( destination, oldPrevious ); return; } else { if ( destinationTime == times[origin] && isPreviousToNode ( destination, origin ) ) { return; } comparePaths ( destination, origin ); } } static void tourBusNode ( unsigned int node ) { ARC * parc; int index = 0, outNodeNum; /* if(node==745) printf("to expand %d\n",node); */ expanded[expCounter++] = node; //edge_array[node].multi = 2; activeNode = node; parc = edge_array[activeNode].arcs; while ( parc ) { outArcArray[index] = parc; outNodeArray[index++] = parc->to_ed; if ( index >= MAXCONNECTION ) { //printf("node %d has more than MAXCONNECTION arcs\n",node); break; } parc = parc->next; } outNodeNum = index; HasChanged = 0; for ( index = 0; index < outNodeNum; index++ ) { if ( HasChanged ) { parc = getArcBetween ( activeNode, outNodeArray[index] ); getArcCounter++; } else { parc = outArcArray[index]; } if ( !parc ) { continue; } tourBusArc ( activeNode, outNodeArray[index], parc->multiplicity, times[activeNode] ); } } /* static void dumpNodeFromDHeap() { unsigned int currentNode; while((currentNode = removeNextNodeFromDHeap(dheap))!=0){ rnodeCounter++; times[currentNode] = -1; previous[currentNode] = 0; dheapNodes[currentNode] = NULL; if(dnodeCounter-rnodeCounter<250) break; } } */ static void tourBus ( unsigned int startingPoint ) { unsigned int currentNode = startingPoint; times[startingPoint] = 0; previous[startingPoint] = currentNode; while ( currentNode > 0 ) { dheapNodes[currentNode] = NULL; tourBusNode ( currentNode ); currentNode = removeNextNodeFromDHeap ( dheap ); if ( currentNode > 0 ) { rnodeCounter++; } } } void bubblePinch ( double simiCutoff, char * outfile, int M ) { unsigned int index, counter = 0; unsigned int startingNode; char temp[256]; sprintf ( temp, "%s.pathpair", outfile ); //ftemp = ckopen(temp,"w"); //linearConcatenate(); //initiator caseA = caseB = caseC = caseD = 0; progress = 0; arcCounter = 0; dnodeCounter = 0; rnodeCounter = 0; btCounter = 0; cmpCounter = 0; simiCounter = 0; pinCounter = 0; replaceCounter = 0; getArcCounter = 0; cutoff = 1.0 - simiCutoff; if ( M <= 1 ) { MAXNODELENGTH = 3; DIFF = 2; } else if ( M == 2 ) { MAXNODELENGTH = 9; DIFF = 3; } else { MAXNODELENGTH = 30; DIFF = 10; } printf ( "start to pinch bubbles, cutoff %f, MAX NODE NUM %d, MAX DIFF %d\n", cutoff, MAXNODELENGTH, DIFF ); createRVmemo(); times = ( Time * ) ckalloc ( ( num_ed + 1 ) * sizeof ( Time ) ); previous = ( unsigned int * ) ckalloc ( ( num_ed + 1 ) * sizeof ( unsigned int ) ); expanded = ( unsigned int * ) ckalloc ( ( num_ed + 1 ) * sizeof ( unsigned int ) ); dheapNodes = ( DFibHeapNode ** ) ckalloc ( ( num_ed + 1 ) * sizeof ( DFibHeapNode * ) ); WORDFILTER = createFilter ( overlaplen ); for ( index = 1; index <= num_ed; index++ ) { times[index] = -1; previous[index] = 0; dheapNodes[index] = NULL; } dheap = newDFibHeap(); eligibleStartingPoints = ( unsigned int * ) ckalloc ( ( num_ed + 1 ) * sizeof ( unsigned int ) ); resetNodeStatus(); //determineEligibleStartingPoints(); createArcLookupTable(); recordArcsInLookupTable(); while ( ( startingNode = nextStartingPoint() ) > 0 ) { counter++; //printf("starting point %d with length %d\n",startingNode,edge_array[startingNode].length); expCounter = 0; tourBus ( startingNode ); updateNodeStatus(); } resetNodeStatus(); deleteArcLookupTable(); destroyReadIntervMem(); printf ( "%d startingPoints, %lld dheap nodes\n", counter, dnodeCounter ); //printf("%lld times getArcBetween for tourBusNode\n",getArcCounter); printf ( "%lld pairs found, %lld pairs of paths compared, %lld pairs merged\n", btCounter, cmpCounter, pinCounter ); printf ( "sequenc compare failure: %lld %lld %lld %lld\n", caseA, caseB, caseC, caseD ); //fclose(ftemp); free ( ( void * ) eligibleStartingPoints ); destroyDHeap ( dheap ); free ( ( void * ) dheapNodes ); free ( ( void * ) times ); free ( ( void * ) previous ); free ( ( void * ) expanded ); linearConcatenate(); } SOAPdenovo-V1.05/src/127mer/check.c000644 000765 000024 00000003760 11530651532 016630 0ustar00Aquastaff000000 000000 /* * 127mer/check.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include void * ckalloc ( unsigned long long amount ); FILE * ckopen ( char * name, char * mode ); FILE * ckopen ( char * name, char * mode ) { FILE * fp; if ( ( fp = fopen ( name, mode ) ) == NULL ) { printf ( "Cannot open %s. Now exit to system...\n", name ); exit ( -1 ); } return ( fp ); } /* ckalloc - allocate space; check for success */ void * ckalloc ( unsigned long long amount ) { void * p; if ( ( p = ( void * ) calloc ( 1, ( unsigned long long ) amount ) ) == NULL && amount != 0 ) { printf ( "Ran out of memory while applying %lldbytes\n", amount ); printf ( "There may be errors as follows:\n" ); printf ( "1) Not enough memory.\n" ); printf ( "2) The ARRAY may be overrode.\n" ); printf ( "3) The wild pointers.\n" ); fflush ( stdout ); exit ( -1 ); } return ( p ); } /* reallocate memory */ void * ckrealloc ( void * p, size_t new_size, size_t old_size ) { void * q; q = realloc ( ( void * ) p, new_size ); if ( new_size == 0 || q != ( void * ) 0 ) { return q; } /* manually reallocate space */ q = ckalloc ( new_size ); /* move old memory to new space */ bcopy ( p, q, old_size ); free ( p ); return q; } SOAPdenovo-V1.05/src/127mer/compactEdge.c000644 000765 000024 00000005337 11530651532 017770 0ustar00Aquastaff000000 000000 /* * 127mer/compactEdge.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" void copyEdge ( unsigned int source, unsigned int target ) { edge_array[target].from_vt = edge_array[source].from_vt; edge_array[target].to_vt = edge_array[source].to_vt; edge_array[target].length = edge_array[source].length; edge_array[target].cvg = edge_array[source].cvg; edge_array[target].multi = edge_array[source].multi; if ( edge_array[target].seq ) { free ( ( void * ) edge_array[target].seq ); } edge_array[target].seq = edge_array[source].seq; edge_array[source].seq = NULL; edge_array[target].arcs = edge_array[source].arcs; edge_array[source].arcs = NULL; edge_array[target].deleted = edge_array[source].deleted; } //move edge from source to target void edgeMove ( unsigned int source, unsigned int target ) { unsigned int bal_source, bal_target; ARC * arc; copyEdge ( source, target ); bal_source = getTwinEdge ( source ); //bal_edge if ( bal_source != source ) { bal_target = target + 1; copyEdge ( bal_source, bal_target ); edge_array[target].bal_edge = 2; edge_array[bal_target].bal_edge = 0; } else { edge_array[target].bal_edge = 1; bal_target = target; } //take care of the arcs arc = edge_array[target].arcs; while ( arc ) { arc->bal_arc->to_ed = bal_target; arc = arc->next; } if ( bal_target == target ) { return; } arc = edge_array[bal_target].arcs; while ( arc ) { arc->bal_arc->to_ed = target; arc = arc->next; } } void compactEdgeArray() { unsigned int i; unsigned int validCounter = 0; unsigned int bal_ed; printf ( "there're %d edges\n", num_ed ); for ( i = 1; i <= num_ed; i++ ) { if ( edge_array[i].deleted ) { continue; } validCounter++; if ( i == validCounter ) { continue; } bal_ed = getTwinEdge ( i ); edgeMove ( i, validCounter ); if ( bal_ed != i ) { i++; validCounter++; } } num_ed = validCounter; printf ( "after compacting %d edges left\n", num_ed ); } SOAPdenovo-V1.05/src/127mer/concatenateEdge.c000644 000765 000024 00000015455 11530651532 020630 0ustar00Aquastaff000000 000000 /* * 127mer/concatenateEdge.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" void copySeq ( char * targetS, char * sourceS, int pos, int length ) { char ch; int i, index; index = pos; for ( i = 0; i < length; i++ ) { ch = getCharInTightString ( sourceS, i ); writeChar2tightString ( ch, targetS, index++ ); } } //a path from e1 to e2 is merged int to e1(indicate=0) or e2(indicate=1), update graph topology void linearUpdateConnection ( unsigned int e1, unsigned int e2, int indicate ) { unsigned int bal_ed; ARC * parc; if ( !indicate ) { edge_array[e1].to_vt = edge_array[e2].to_vt; bal_ed = getTwinEdge ( e1 ); parc = edge_array[e2].arcs; while ( parc ) { parc->bal_arc->to_ed = bal_ed; parc = parc->next; } edge_array[e1].arcs = edge_array[e2].arcs; edge_array[e2].arcs = NULL; if ( edge_array[e1].length || edge_array[e2].length ) edge_array[e1].cvg = ( edge_array[e1].cvg * edge_array[e1].length + edge_array[e2].cvg * edge_array[e2].length ) / ( edge_array[e1].length + edge_array[e2].length ); edge_array[e2].deleted = 1; } else { //all the arcs pointing to e1 switch to e2 parc = edge_array[getTwinEdge ( e1 )].arcs; while ( parc ) { parc->bal_arc->to_ed = e2; parc = parc->next; } edge_array[e1].arcs = NULL; edge_array[e2].from_vt = edge_array[e1].from_vt; if ( edge_array[e1].length || edge_array[e2].length ) edge_array[e2].cvg = ( edge_array[e1].cvg * edge_array[e1].length + edge_array[e2].cvg * edge_array[e2].length ) / ( edge_array[e1].length + edge_array[e2].length ); edge_array[e1].deleted = 1; } } static void printEdgeSeq ( FILE * fp, char * tightSeq, int len ) { int i; for ( i = 0; i < len; i++ ) { fprintf ( fp, "%c", int2base ( ( int ) getCharInTightString ( tightSeq, i ) ) ); if ( ( i + overlaplen + 1 ) % 100 == 0 ) { fprintf ( fp, "\n" ); } } fprintf ( fp, "\n" ); } void allpathUpdateEdge ( unsigned int e1, unsigned int e2, int indicate ) { int tightLen; char * tightSeq = NULL; if ( edge_array[e1].cvg == 0 ) { edge_array[e1].cvg = edge_array[e2].cvg; } if ( edge_array[e2].cvg == 0 ) { edge_array[e2].cvg = edge_array[e1].cvg; } /* if(edge_array[e1].length&&edge_array[e2].length){ fprintf(stderr,">e1\n"); printEdgeSeq(stderr,edge_array[e1].seq,edge_array[e1].length); fprintf(stderr,">e2\n"); printEdgeSeq(stderr,edge_array[e2].seq,edge_array[e2].length); } */ unsigned int cvgsum = edge_array[e1].cvg * edge_array[e1].length + edge_array[e2].cvg * edge_array[e2].length; tightLen = edge_array[e1].length + edge_array[e2].length; if ( tightLen ) { tightSeq = ( char * ) ckalloc ( ( tightLen / 4 + 1 ) * sizeof ( char ) ); } tightLen = 0; if ( edge_array[e1].length ) { copySeq ( tightSeq, edge_array[e1].seq, 0, edge_array[e1].length ); tightLen = edge_array[e1].length; if ( edge_array[e1].seq ) { free ( ( void * ) edge_array[e1].seq ); edge_array[e1].seq = NULL; } else { printf ( "allpathUpdateEdge: edge %d with length %d, but without seq\n", e1, edge_array[e1].length ); } } if ( edge_array[e2].length ) { copySeq ( tightSeq, edge_array[e2].seq, tightLen, edge_array[e2].length ); tightLen += edge_array[e2].length; if ( edge_array[e2].seq ) { free ( ( void * ) edge_array[e2].seq ); edge_array[e2].seq = NULL; } else { printf ( "allpathUpdateEdge: edge %d with length %d, but without seq\n", e2, edge_array[e2].length ); } } /* if(edge_array[e1].length&&edge_array[e2].length){ fprintf(stderr,">e1+e2\n"); printEdgeSeq(stderr,tightSeq,tightLen); } */ //edge_array[e2].extend_len = tightLen-edge_array[e2].length; //the sequence of e1 is to be updated if ( !indicate ) { edge_array[e2].length = 0; //e1 is removed from the graph edge_array[e1].to_vt = edge_array[e2].to_vt; //e2 is part of e1 now edge_array[e1].length = tightLen; edge_array[e1].seq = tightSeq; if ( tightLen ) { edge_array[e1].cvg = cvgsum / tightLen; } edge_array[e1].cvg = edge_array[e1].cvg > 0 ? edge_array[e1].cvg : 1; } else { edge_array[e1].length = 0; //e1 is removed from the graph edge_array[e2].from_vt = edge_array[e1].from_vt; //e1 is part of e2 now edge_array[e2].length = tightLen; edge_array[e2].seq = tightSeq; if ( tightLen ) { edge_array[e2].cvg = cvgsum / tightLen; } edge_array[e2].cvg = edge_array[e2].cvg > 0 ? edge_array[e2].cvg : 1; } } static void debugging ( unsigned int i ) { ARC * parc; parc = edge_array[i].arcs; if ( !parc ) { printf ( "no downward connection for %d\n", i ); } while ( parc ) { printf ( "%d -> %d\n", i, parc->to_ed ); parc = parc->next; } } //concatenate two edges if they are linearly linked void linearConcatenate() { unsigned int i; int conc_c = 1; int counter; unsigned int from_ed, to_ed, bal_ed; ARC * parc, *parc2; unsigned int bal_fe; //debugging(30514); while ( conc_c ) { conc_c = 0; counter = 0; for ( i = 1; i <= num_ed; i++ ) //num_ed { if ( edge_array[i].deleted || EdSameAsTwin ( i ) ) { continue; } if ( edge_array[i].length > 0 ) { counter++; } parc = edge_array[i].arcs; if ( !parc || parc->next ) { continue; } to_ed = parc->to_ed; bal_ed = getTwinEdge ( to_ed ); parc2 = edge_array[bal_ed].arcs; if ( bal_ed == to_ed || !parc2 || parc2->next ) { continue; } from_ed = i; if ( from_ed == to_ed || from_ed == bal_ed ) { continue; } //linear connection found conc_c++; linearUpdateConnection ( from_ed, to_ed, 0 ); allpathUpdateEdge ( from_ed, to_ed, 0 ); bal_fe = getTwinEdge ( from_ed ); linearUpdateConnection ( bal_ed, bal_fe, 1 ); allpathUpdateEdge ( bal_ed, bal_fe, 1 ); /* if(from_ed==6589||to_ed==6589) printf("%d <- %d (%d)\n",from_ed,to_ed,i); if(bal_fe==6589||bal_ed==6589) printf("%d <- %d (%d)\n",bal_fe,bal_ed,i); */ } printf ( "a linear concatenation lap, %d concatenated\n", conc_c ); } printf ( "%d edges in graph\n", counter ); } SOAPdenovo-V1.05/src/127mer/connect.c000644 000765 000024 00000011603 11530651532 017177 0ustar00Aquastaff000000 000000 /* * 127mer/connect.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" #define CNBLOCKSIZE 100000 void createCntMemManager() { if ( !cn_mem_manager ) { cn_mem_manager = createMem_manager ( CNBLOCKSIZE, sizeof ( CONNECT ) ); } else { printf ( "cn_mem_manger was created\n" ); } } void destroyConnectMem() { freeMem_manager ( cn_mem_manager ); cn_mem_manager = NULL; } CONNECT * allocateCN ( unsigned int contigId, int gap ) { CONNECT * newCN; newCN = ( CONNECT * ) getItem ( cn_mem_manager ); newCN->contigID = contigId; newCN->gapLen = gap; newCN->minGap = 0; newCN->maxGap = 0; newCN->bySmall = 0; newCN->weakPoint = 0; newCN->weight = 1; newCN->weightNotInherit = 0; newCN->mask = 0; newCN->used = 0; newCN->checking = 0; newCN->deleted = 0; newCN->prevInScaf = 0; newCN->inherit = 0; newCN->singleInScaf = 0; newCN->nextInScaf = NULL; return newCN; } void output_cntGVZ ( char * outfile ) { char name[256]; FILE * fp; unsigned int i; CONNECT * connect; boolean flag; sprintf ( name, "%s.scaffold.gvz", outfile ); fp = ckopen ( name, "w" ); fprintf ( fp, "digraph G{\n" ); fprintf ( fp, "\tsize=\"512,512\";\n" ); for ( i = num_ctg; i > 0; i-- ) { //if(contig_array[i].mask||!contig_array[i].downwardConnect) if ( !contig_array[i].downwardConnect ) { continue; } connect = contig_array[i].downwardConnect; while ( connect ) { //if(connect->mask||connect->deleted){ if ( connect->deleted ) { connect = connect->next; continue; } if ( connect->prevInScaf || connect->nextInScaf ) { flag = 1; } else { flag = 0; } if ( !connect->mask ) fprintf ( fp, "\tC%d_%d -> C%d_%d [label = \"%d(%d_%d)\"];\n" , i, contig_array[i].length, connect->contigID, contig_array[connect->contigID].length, connect->gapLen, flag, connect->weight ); else fprintf ( fp, "\tC%d_%d -> C%d_%d [label = \"%d(%d_%d)\", color = red];\n" , i, contig_array[i].length, connect->contigID, contig_array[connect->contigID].length, connect->gapLen, flag, connect->weight ); connect = connect->next; } } fprintf ( fp, "}\n" ); fclose ( fp ); } /***************** below this line all codes are about lookup table *****************/ void createCntLookupTable() { if ( !cntLookupTable ) { cntLookupTable = ( CONNECT ** ) ckalloc ( ( 3 * num_ctg + 1 ) * sizeof ( CONNECT * ) ); } } void deleteCntLookupTable() { if ( cntLookupTable ) { free ( ( void * ) cntLookupTable ); cntLookupTable = NULL; } } void putCnt2LookupTable ( unsigned int from_c, CONNECT * cnt ) { if ( !cnt || !cntLookupTable ) { return; } unsigned int index = 2 * from_c + cnt->contigID; cnt->nextInLookupTable = cntLookupTable[index]; cntLookupTable[index] = cnt; } static CONNECT * getCntInLookupTable ( unsigned int from_c, unsigned int to_c ) { unsigned int index = 2 * from_c + to_c; CONNECT * ite_cnt = cntLookupTable[index]; while ( ite_cnt ) { if ( ite_cnt->contigID == to_c ) { return ite_cnt; } ite_cnt = ite_cnt->nextInLookupTable; } return NULL; } CONNECT * getCntBetween ( unsigned int from_c, unsigned int to_c ) { CONNECT * pcnt; if ( cntLookupTable ) { pcnt = getCntInLookupTable ( from_c, to_c ); return pcnt; } pcnt = contig_array[from_c].downwardConnect; while ( pcnt ) { if ( pcnt->contigID == to_c ) { return pcnt; } pcnt = pcnt->next; } return pcnt; } /* void removeCntInLookupTable(unsigned int from_c,unsigned int to_c) { unsigned int index = 2*from_c + to_c; CONNECT *ite_cnt = cntLookupTable[index]; CONNECT *cnt; if(!ite_cnt){ printf("removeCntInLookupTable: not found A\n"); return; } if(ite_cnt->contigID==to_c){ cntLookupTable[index] = ite_cnt->nextInLookupTable; return; } while(ite_cnt->nextInLookupTable&&ite_cnt->nextInLookupTable->contigID!=to_c) ite_cnt = ite_cnt->nextInLookupTable; if(ite_cnt->nextInLookupTable){ cnt = ite_cnt->nextInLookupTable; ite_cnt->nextInLookupTable = cnt->nextInLookupTable; return; } printf("removeCntInLookupTable: not found B\n"); return; } */ SOAPdenovo-V1.05/src/127mer/contig.c000644 000765 000024 00000007154 11530651532 017037 0ustar00Aquastaff000000 000000 /* * 127mer/contig.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" static void initenv ( int argc, char ** argv ); static void display_contig_usage(); char shortrdsfile[256], graphfile[256]; static boolean repeatSolve; static int M = 1; int call_heavygraph ( int argc, char ** argv ) { time_t start_t, stop_t, time_bef, time_aft; time ( &start_t ); boolean ret; initenv ( argc, argv ); loadVertex ( graphfile ); loadEdge ( graphfile ); if ( repeatSolve ) { time ( &time_bef ); ret = loadPathBin ( graphfile ); if ( ret ) { solveReps(); } else { printf ( "repeat solving can't be done...\n" ); } time ( &time_aft ); printf ( "time spent on solving repeat: %ds\n", ( int ) ( time_aft - time_bef ) ); } //edgecvg_bar(edge_array,num_ed,graphfile,100); //0531 if ( M > 0 ) { time ( &time_bef ); bubblePinch ( 0.90, graphfile, M ); time ( &time_aft ); printf ( "time spent on bubblePinch: %ds\n", ( int ) ( time_aft - time_bef ) ); } removeWeakEdges ( 2 * overlaplen, 1 ); removeLowCovEdges ( 2 * overlaplen, 1 ); cutTipsInGraph ( 0, 0 ); //output_graph(graphfile); output_contig ( edge_array, num_ed, graphfile, overlaplen + 1 ); output_updated_edges ( graphfile ); output_heavyArcs ( graphfile ); if ( vt_array ) { free ( ( void * ) vt_array ); vt_array = NULL; } if ( edge_array ) { free_edge_array ( edge_array, num_ed_limit ); edge_array = NULL; } destroyArcMem(); time ( &stop_t ); printf ( "time elapsed: %dm\n\n", ( int ) ( stop_t - start_t ) / 60 ); return 0; } /***************************************************************************** * Parse command line switches *****************************************************************************/ void initenv ( int argc, char ** argv ) { int copt; int inpseq, outseq; extern char * optarg; char temp[100]; inpseq = outseq = repeatSolve = 0; optind = 1; while ( ( copt = getopt ( argc, argv, "g:M:R" ) ) != EOF ) { switch ( copt ) { case 'M': sscanf ( optarg, "%s", temp ); // M = atoi ( temp ); continue; case 'g': inGraph = 1; sscanf ( optarg, "%s", graphfile ); // continue; case 'R': repeatSolve = 1; continue; default: if ( inGraph == 0 ) // { display_contig_usage(); exit ( -1 ); } } } if ( inGraph == 0 ) // { display_contig_usage(); exit ( -1 ); } } static void display_contig_usage() { printf ( "\ncontig -g InputGraph [-M mergeLevel -R]\n" ); printf ( " -g InputFile: prefix of graph file names\n" ); printf ( " -M mergeLevel(default 1,min 0, max 3): the strength of merging similar sequences during contiging\n" ); printf ( " -R solve_repeats (optional): solve repeats by read paths(default: no)\n" ); } SOAPdenovo-V1.05/src/127mer/cutTip_graph.c000644 000765 000024 00000016724 11530651532 020210 0ustar00Aquastaff000000 000000 /* * 127mer/cutTip_graph.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" static int caseA, caseB, caseC, caseD, caseE; void destroyEdge ( unsigned int edgeid ) { unsigned int bal_ed = getTwinEdge ( edgeid ); ARC * arc; if ( bal_ed == edgeid ) { edge_array[edgeid].length = 0; return; } arc = edge_array[edgeid].arcs; while ( arc ) { arc->bal_arc->to_ed = 0; arc = arc->next; } arc = edge_array[bal_ed].arcs; while ( arc ) { arc->bal_arc->to_ed = 0; arc = arc->next; } edge_array[edgeid].arcs = NULL; edge_array[bal_ed].arcs = NULL; edge_array[edgeid].length = 0; edge_array[bal_ed].length = 0; edge_array[edgeid].deleted = 1; edge_array[bal_ed].deleted = 1; //printf("Destroyed %d and %d\n",edgeid,bal_ed); } ARC * arcCount ( unsigned int edgeid, unsigned int * num ) { ARC * arc; ARC * firstValidArc = NULL; unsigned int count = 0; arc = edge_array[edgeid].arcs; while ( arc ) { if ( arc->to_ed > 0 ) { count++; if ( count == 1 ) { firstValidArc = arc; } else if ( count > 1 ) { *num = count; return firstValidArc; } } arc = arc->next; } *num = count; return firstValidArc; } void removeWeakEdges ( int lenCutoff, unsigned int multiCutoff ) { unsigned int bal_ed; unsigned int arcRight_n, arcLeft_n; ARC * arcLeft, *arcRight; unsigned int i; int counter = 0; for ( i = 1; i <= num_ed; i++ ) { if ( edge_array[i].deleted || edge_array[i].length > lenCutoff || EdSameAsTwin ( i ) ) { continue; } bal_ed = getTwinEdge ( i ); arcRight = arcCount ( i, &arcRight_n ); if ( arcRight_n > 1 || !arcRight || arcRight->multiplicity > multiCutoff ) { continue; } arcLeft = arcCount ( bal_ed, &arcLeft_n ); if ( arcLeft_n > 1 || !arcLeft || arcLeft->multiplicity > multiCutoff ) { continue; } destroyEdge ( i ); counter++; } printf ( "%d weak inner edges destroyed\n", counter ); removeDeadArcs(); /* linearConcatenate(); compactEdgeArray(); */ } void removeLowCovEdges ( int lenCutoff, unsigned short covCutoff ) { unsigned int bal_ed; unsigned int arcRight_n, arcLeft_n; ARC * arcLeft, *arcRight; unsigned int i; int counter = 0; for ( i = 1; i <= num_ed; i++ ) { if ( edge_array[i].deleted || edge_array[i].cvg == 0 || edge_array[i].cvg > covCutoff * 10 || edge_array[i].length >= lenCutoff || EdSameAsTwin ( i ) || edge_array[i].length == 0 ) { continue; } bal_ed = getTwinEdge ( i ); arcRight = arcCount ( i, &arcRight_n ); arcLeft = arcCount ( bal_ed, &arcLeft_n ); if ( arcLeft_n < 1 || arcRight_n < 1 ) { continue; } destroyEdge ( i ); counter++; } printf ( "Remove low coverage(%d): %d inner edges destroyed\n", covCutoff, counter ); removeDeadArcs(); linearConcatenate(); compactEdgeArray(); } boolean isUnreliableTip ( unsigned int edgeid, int cutLen, boolean strict ) { unsigned int arcRight_n, arcLeft_n; unsigned int bal_ed; unsigned int currentEd = edgeid; int length = 0; unsigned int mult = 0; ARC * arc, *activeArc = NULL, *tempArc; if ( edgeid == 0 ) { return 0; } bal_ed = getTwinEdge ( edgeid ); if ( bal_ed == edgeid ) { return 0; } arcCount ( bal_ed, &arcLeft_n ); if ( arcLeft_n > 0 ) { return 0; } while ( currentEd ) { arcCount ( bal_ed, &arcLeft_n ); tempArc = arcCount ( currentEd, &arcRight_n ); if ( arcLeft_n > 1 || arcRight_n > 1 ) { break; } length += edge_array[currentEd].length; if ( tempArc ) { activeArc = tempArc; currentEd = activeArc->to_ed; bal_ed = getTwinEdge ( currentEd ); } else { currentEd = 0; } } if ( length >= cutLen ) { return 0; } if ( currentEd == 0 ) { caseB++; return 1; } if ( !strict ) { if ( arcLeft_n < 2 ) { length += edge_array[currentEd].length; } if ( length >= cutLen ) { return 0; } else { caseC++; return 1; } } if ( arcLeft_n < 2 ) { return 0; } if ( !activeArc ) { printf ( "no activeArc while checking edge %d\n", edgeid ); } if ( activeArc->multiplicity == 1 ) { caseD++; return 1; } for ( arc = edge_array[bal_ed].arcs; arc != NULL; arc = arc->next ) if ( arc->multiplicity > mult ) { mult = arc->multiplicity; } if ( mult > activeArc->multiplicity ) { caseE++; } return mult > activeArc->multiplicity; } boolean isUnreliableTip_strict ( unsigned int edgeid, int cutLen ) { unsigned int arcRight_n, arcLeft_n; unsigned int bal_ed; unsigned int currentEd = edgeid; int length = 0; unsigned int mult = 0; ARC * arc, *activeArc = NULL, *tempArc; if ( edgeid == 0 ) { return 0; } bal_ed = getTwinEdge ( edgeid ); if ( bal_ed == edgeid ) { return 0; } arcCount ( bal_ed, &arcLeft_n ); if ( arcLeft_n > 0 ) { return 0; } while ( currentEd ) { arcCount ( bal_ed, &arcLeft_n ); tempArc = arcCount ( currentEd, &arcRight_n ); if ( arcLeft_n > 1 || arcRight_n > 1 ) { if ( arcLeft_n == 0 || length == 0 ) { return 0; } else { break; } } length += edge_array[currentEd].length; if ( length >= cutLen ) { return 0; } if ( tempArc ) { activeArc = tempArc; currentEd = activeArc->to_ed; bal_ed = getTwinEdge ( currentEd ); } else { currentEd = 0; } } if ( currentEd == 0 ) { caseA++; return 1; } if ( !activeArc ) { printf ( "no activeArc while checking edge %d\n", edgeid ); } if ( activeArc->multiplicity == 1 ) { caseB++; return 1; } for ( arc = edge_array[bal_ed].arcs; arc != NULL; arc = arc->next ) if ( arc->multiplicity > mult ) { mult = arc->multiplicity; } if ( mult > activeArc->multiplicity ) { caseC++; } return mult > activeArc->multiplicity; } void removeDeadArcs() { unsigned int i, count = 0; ARC * arc, *arc_temp; for ( i = 1; i <= num_ed; i++ ) { arc = edge_array[i].arcs; while ( arc ) { arc_temp = arc; arc = arc->next; if ( arc_temp->to_ed == 0 ) { count++; edge_array[i].arcs = deleteArc ( edge_array[i].arcs, arc_temp ); } } } printf ( "%d dead arcs removed\n", count ); } void cutTipsInGraph ( int cutLen, boolean strict ) { int flag = 1; unsigned int i; if ( !cutLen ) { cutLen = 2 * overlaplen; } //if(cutLen > 100) cutLen = 100; printf ( "strict %d, cutLen %d\n", strict, cutLen ); if ( strict ) { linearConcatenate(); } caseA = caseB = caseC = caseD = caseE = 0; while ( flag ) { flag = 0; for ( i = 1; i <= num_ed; i++ ) { if ( edge_array[i].deleted ) { continue; } if ( isUnreliableTip ( i, cutLen, strict ) ) { destroyEdge ( i ); flag++; } } printf ( "a cutTipsInGraph lap, %d tips cut\n", flag ); } removeDeadArcs(); if ( strict ) { printf ( "case A %d, B %d C %d D %d E %d\n", caseA, caseB, caseC, caseD, caseE ); } linearConcatenate(); compactEdgeArray(); } SOAPdenovo-V1.05/src/127mer/cutTipPreGraph.c000644 000765 000024 00000024641 11530651532 020455 0ustar00Aquastaff000000 000000 /* * 127mer/cutTipPreGraph.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" static int tip_c; static long long * linearCounter; static void Mark1in1outNode(); static void thread_mark ( KmerSet * set, unsigned char thrdID ); /* static void printKmer(Kmer kmer) { printKmerSeq(stdout,kmer); printf("\n"); } */ static int clipTipFromNode ( kmer_t * node1, int cut_len, boolean THIN ) { unsigned char ret = 0, in_num, out_num, link; int sum, count; kmer_t * out_node; Kmer tempKmer, pre_word, word, bal_word; ubyte8 hash_ban; char ch1, ch; boolean smaller, found; int setPicker; unsigned int max_links, singleCvg; if ( node1->linear || node1->deleted ) { return ret; } if ( THIN && !node1->single ) { return ret; } in_num = count_branch2prev ( node1 ); out_num = count_branch2next ( node1 ); if ( in_num == 0 && out_num == 1 ) { pre_word = node1->seq; for ( ch1 = 0; ch1 < 4; ch1++ ) { link = get_kmer_right_cov ( *node1, ch1 ); if ( link ) { break; } } word = nextKmer ( pre_word, ch1 ); } else if ( in_num == 1 && out_num == 0 ) { pre_word = reverseComplement ( node1->seq, overlaplen ); for ( ch1 = 0; ch1 < 4; ch1++ ) { link = get_kmer_left_cov ( *node1, ch1 ); if ( link ) { break; } } word = nextKmer ( pre_word, int_comp ( ch1 ) ); } else { return ret; } count = 1; bal_word = reverseComplement ( word, overlaplen ); if ( KmerLarger ( word, bal_word ) ) { tempKmer = bal_word; bal_word = word; word = tempKmer; smaller = 0; } else { smaller = 1; } hash_ban = hash_kmer ( word ); setPicker = hash_ban % thrd_num; found = search_kmerset ( KmerSets[setPicker], word, &out_node ); if ( !found ) { printf ( "kmer %llx%llx%llx%llx not found, node1 %llx%llx%llx%llx\n", word.high1, word.low1, word.high2, word.low2, node1->seq.high1, node1->seq.low1, node1->seq.high2, node1->seq.low2 ); exit ( 1 ); } while ( out_node->linear ) { count++; if ( THIN && !out_node->single ) { break; } if ( count > cut_len ) { return ret; } if ( smaller ) { pre_word = word; for ( ch = 0; ch < 4; ch++ ) { link = get_kmer_right_cov ( *out_node, ch ); if ( link ) { break; } } word = nextKmer ( pre_word, ch ); bal_word = reverseComplement ( word, overlaplen ); if ( KmerLarger ( word, bal_word ) ) { tempKmer = bal_word; bal_word = word; word = tempKmer; smaller = 0; } else { smaller = 1; } hash_ban = hash_kmer ( word ); setPicker = hash_ban % thrd_num; found = search_kmerset ( KmerSets[setPicker], word, &out_node ); if ( !found ) { printf ( "kmer %llx%llx%llx%llx not found, node1 %llx%llx%llx%llx\n", word.high1, word.low1, word.high2, word.low2, node1->seq.high1, node1->seq.low1, node1->seq.high2, node1->seq.low2 ); printf ( "pre_word %llx%llx%llx%llx with %d(smaller)\n", pre_word.high1, pre_word.low1, pre_word.high2, pre_word.low2, ch ); exit ( 1 ); } } else { pre_word = bal_word; for ( ch = 0; ch < 4; ch++ ) { link = get_kmer_left_cov ( *out_node, ch ); if ( link ) { break; } } word = nextKmer ( pre_word, int_comp ( ch ) ); bal_word = reverseComplement ( word, overlaplen ); if ( KmerLarger ( word, bal_word ) ) { tempKmer = bal_word; bal_word = word; word = tempKmer; smaller = 0; } else { smaller = 1; } hash_ban = hash_kmer ( word ); setPicker = hash_ban % thrd_num; found = search_kmerset ( KmerSets[setPicker], word, &out_node ); if ( !found ) { printf ( "kmer %llx%llx%llx%llx not found, node1 %llx%llx%llx%llx\n", word.high1, word.low1, word.high2, word.low2, node1->seq.high1, node1->seq.low1, node1->seq.high2, node1->seq.low2 ); printf ( "pre_word %llx%llx%llx%llx with %d(larger)\n", reverseComplement ( pre_word, overlaplen ).high1, reverseComplement ( pre_word, overlaplen ).low1, reverseComplement ( pre_word, overlaplen ).high2, reverseComplement ( pre_word, overlaplen ).low2, int_comp ( ch ) ); exit ( 1 ); } } } if ( ( sum = count_branch2next ( out_node ) + count_branch2prev ( out_node ) ) == 1 ) { tip_c++; node1->deleted = 1; out_node->deleted = 1; return 1; } else { ch = firstCharInKmer ( pre_word ); if ( THIN ) { tip_c++; node1->deleted = 1; dislink2prevUncertain ( out_node, ch, smaller ); out_node->linear = 0; return 1; } // make sure this tip doesn't provide most links to out_node max_links = 0; for ( ch1 = 0; ch1 < 4; ch1++ ) { if ( smaller ) { singleCvg = get_kmer_left_cov ( *out_node, ch1 ); if ( singleCvg > max_links ) { max_links = singleCvg; } } else { singleCvg = get_kmer_right_cov ( *out_node, ch1 ); if ( singleCvg > max_links ) { max_links = singleCvg; } } } if ( smaller && get_kmer_left_cov ( *out_node, ch ) < max_links ) { tip_c++; node1->deleted = 1; dislink2prevUncertain ( out_node, ch, smaller ); if ( count_branch2prev ( out_node ) == 1 && count_branch2next ( out_node ) == 1 ) { out_node->linear = 1; } return 1; } if ( !smaller && get_kmer_right_cov ( *out_node, int_comp ( ch ) ) < max_links ) { tip_c++; node1->deleted = 1; dislink2prevUncertain ( out_node, ch, smaller ); if ( count_branch2prev ( out_node ) == 1 && count_branch2next ( out_node ) == 1 ) { out_node->linear = 1; } return 1; } } return 0; } void removeSingleTips() { int i, flag = 0, cut_len_tip; kmer_t * rs; KmerSet * set; //count_ends(hash_table); cut_len_tip = 2 * overlaplen; // >= maxReadLen4all-overlaplen+1 ? 2*overlaplen : maxReadLen4all-overlaplen+1; //if(cut_len_tip > 100) cut_len_tip = 100; printf ( "Start to remove tips of single frequency kmers short than %d\n", cut_len_tip ); tip_c = 0; for ( i = 0; i < thrd_num; i++ ) { set = KmerSets[i]; set->iter_ptr = 0; while ( set->iter_ptr < set->size ) { if ( !is_kmer_entity_null ( set->flags, set->iter_ptr ) ) { rs = set->array + set->iter_ptr; flag += clipTipFromNode ( rs, cut_len_tip, 1 ); } set->iter_ptr ++; } } printf ( "%d tips off\n", tip_c ); Mark1in1outNode(); } void removeMinorTips() { int i, flag = 0, cut_len_tip; kmer_t * rs; KmerSet * set; //count_ends(hash_table); //cut_len_tip = 2*overlaplen >= maxReadLen4all-overlaplen+1 ? 2*overlaplen : maxReadLen4all-overlaplen+1; cut_len_tip = 2 * overlaplen; //if(cut_len_tip > 100) cut_len_tip = 100; printf ( "Start to remove tips which don't contribute the most links\n" ); tip_c = 0; for ( i = 0; i < thrd_num; i++ ) { set = KmerSets[i]; flag = 1; while ( flag ) { flag = 0; set->iter_ptr = 0; while ( set->iter_ptr < set->size ) { if ( !is_kmer_entity_null ( set->flags, set->iter_ptr ) ) { rs = set->array + set->iter_ptr; flag += clipTipFromNode ( rs, cut_len_tip, 0 ); } set->iter_ptr ++; } } printf ( "kmer set %d done\n", i ); } printf ( "%d tips off\n", tip_c ); Mark1in1outNode(); } static void threadRoutine ( void * para ) { PARAMETER * prm; unsigned char id; prm = ( PARAMETER * ) para; id = prm->threadID; //printf("%dth thread with threadID %d, hash_table %p\n",id,prm.threadID,prm.hash_table); while ( 1 ) { if ( * ( prm->selfSignal ) == 2 ) { * ( prm->selfSignal ) = 0; break; } else if ( * ( prm->selfSignal ) == 1 ) { thread_mark ( KmerSets[id], id ); * ( prm->selfSignal ) = 0; } usleep ( 1 ); } } static void creatThrds ( pthread_t * threads, PARAMETER * paras ) { unsigned char i; int temp; for ( i = 0; i < thrd_num; i++ ) { if ( ( temp = pthread_create ( &threads[i], NULL, ( void * ) threadRoutine, & ( paras[i] ) ) ) != 0 ) { printf ( "create threads failed\n" ); exit ( 1 ); } } printf ( "%d thread created\n", thrd_num ); } static void thread_mark ( KmerSet * set, unsigned char thrdID ) { int in_num, out_num; kmer_t * rs; set->iter_ptr = 0; while ( set->iter_ptr < set->size ) { if ( !is_kmer_entity_null ( set->flags, set->iter_ptr ) ) { rs = set->array + set->iter_ptr; if ( rs->deleted || rs->linear ) { set->iter_ptr ++; continue;; } in_num = count_branch2prev ( rs ); out_num = count_branch2next ( rs ); if ( in_num == 1 && out_num == 1 ) { rs->linear = 1; linearCounter[thrdID]++; } } set->iter_ptr ++; } //printf("%lld more linear\n",linearCounter[thrdID]); } static void thread_wait ( pthread_t * threads ) { int i; for ( i = 0; i < thrd_num; i++ ) if ( threads[i] != 0 ) { pthread_join ( threads[i], NULL ); } } static void sendWorkSignal ( unsigned char SIG, unsigned char * thrdSignals ) { int t; for ( t = 0; t < thrd_num; t++ ) { thrdSignals[t + 1] = SIG; } while ( 1 ) { usleep ( 10 ); for ( t = 0; t < thrd_num; t++ ) if ( thrdSignals[t + 1] ) { break; } if ( t == thrd_num ) { break; } } } static void Mark1in1outNode() { int i; long long counter = 0; pthread_t threads[thrd_num]; unsigned char thrdSignal[thrd_num + 1]; PARAMETER paras[thrd_num]; for ( i = 0; i < thrd_num; i++ ) { thrdSignal[i + 1] = 0; paras[i].threadID = i; paras[i].mainSignal = &thrdSignal[0]; paras[i].selfSignal = &thrdSignal[i + 1]; } creatThrds ( threads, paras ); thrdSignal[0] = 0; linearCounter = ( long long * ) ckalloc ( thrd_num * sizeof ( long long ) ); for ( i = 0; i < thrd_num; i++ ) { linearCounter[i] = 0; } sendWorkSignal ( 1, thrdSignal ); //mark linear nodes sendWorkSignal ( 2, thrdSignal ); thread_wait ( threads ); for ( i = 0; i < thrd_num; i++ ) { counter += linearCounter[i]; } free ( ( void * ) linearCounter ); printf ( "%lld linear nodes\n", counter ); } SOAPdenovo-V1.05/src/127mer/darray.c000644 000765 000024 00000004255 11530651532 017035 0ustar00Aquastaff000000 000000 /* * 127mer/darray.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "darray.h" #include "check.h" DARRAY * createDarray ( int num_items, size_t unit_size ) { DARRAY * newDarray = ( DARRAY * ) malloc ( 1 * sizeof ( DARRAY ) ); newDarray->array_size = num_items; newDarray->item_size = unit_size; newDarray->item_c = 0; newDarray->array = ( void * ) ckalloc ( num_items * unit_size ); return newDarray; } void * darrayPut ( DARRAY * darray, long long index ) { int i = 2; if ( index + 1 > darray->item_c ) { darray->item_c = index + 1; } if ( index < darray->array_size ) { return darray->array + darray->item_size * index; } while ( index > i * darray->array_size ) { i++; } darray->array = ( void * ) ckrealloc ( darray->array, i * darray->array_size * darray->item_size , darray->array_size * darray->item_size ); darray->array_size *= i; return ( void * ) ( ( void * ) darray->array + darray->item_size * index ); } void * darrayGet ( DARRAY * darray, long long index ) { if ( index < darray->array_size ) { return ( void * ) ( ( void * ) darray->array + darray->item_size * index ); } printf ( "array read index %lld out of range %lld\n", index, darray->array_size ); return NULL; } void emptyDarray ( DARRAY * darray ) { darray->item_c = 0; } void freeDarray ( DARRAY * darray ) { if ( !darray ) { return; } if ( darray->array ) { free ( ( void * ) darray->array ); } free ( ( void * ) darray ); } SOAPdenovo-V1.05/src/127mer/dfib.c000644 000765 000024 00000026445 11530651532 016464 0ustar00Aquastaff000000 000000 /* * 127mer/dfib.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ /*- * Copyright 1997-2003 John-Mark Gurney. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $Id: dfib.c,v 1.12 2007/10/19 13:09:26 zerbino Exp $ * */ #include #include #include "dfib.h" #include "dfibpriv.h" #include "extfunc2.h" #define HEAPBLOCKSIZE 1000 static int dfh_comparedata ( DFibHeap * h, Time key, unsigned int data, DFibHeapNode * b ); static DFibHeapNode * allocateDFibHeapNode ( DFibHeap * heap ) { return ( DFibHeapNode * ) getItem ( heap->nodeMemory ); }; static void deallocateDFibHeapNode ( DFibHeapNode * a, DFibHeap * heap ) { returnItem ( heap->nodeMemory, a ); } IDnum dfibheap_getSize ( DFibHeap * heap ) { return heap->dfh_n; } #define swap(type, a, b) \ do { \ type c; \ c = a; \ a = b; \ b = c; \ } while (0) \ #define INT_BITS (sizeof(IDnum) * 8) static inline IDnum ceillog2 ( IDnum a ) { IDnum oa; IDnum i; IDnum b; IDnum cons; oa = a; b = INT_BITS / 2; i = 0; while ( b ) { i = ( i << 1 ); cons = ( ( IDnum ) 1 ) << b; if ( a >= cons ) { a /= cons; i = i | 1; } else { a &= cons - 1; } b /= 2; } if ( ( ( ( IDnum ) 1 << i ) ) == oa ) { return i; } else { return i + 1; } } /* * Public Heap Functions */ DFibHeap * dfh_makekeyheap() { DFibHeap * n; if ( ( n = malloc ( sizeof * n ) ) == NULL ) { return NULL; } n->nodeMemory = createMem_manager ( HEAPBLOCKSIZE, sizeof ( DFibHeapNode ) ); n->dfh_n = 0; n->dfh_Dl = -1; n->dfh_cons = NULL; n->dfh_min = NULL; n->dfh_root = NULL; return n; } void dfh_deleteheap ( DFibHeap * h ) { printf ( "DFibHeap: %lld Nodes allocated\n", h->nodeMemory->counter ); freeMem_manager ( h->nodeMemory ); h->nodeMemory = NULL; if ( h->dfh_cons != NULL ) { free ( h->dfh_cons ); } free ( h ); } /* * Public Key Heap Functions */ DFibHeapNode * dfh_insertkey ( DFibHeap * h, Time key, unsigned int data ) { DFibHeapNode * x; if ( ( x = dfhe_newelem ( h ) ) == NULL ) { return NULL; } /* just insert on root list, and make sure it's not the new min */ x->dfhe_data = data; x->dfhe_key = key; dfh_insertel ( h, x ); return x; } Time dfh_replacekey ( DFibHeap * h, DFibHeapNode * x, Time key ) { Time ret; ret = x->dfhe_key; ( void ) dfh_replacekeydata ( h, x, key, x->dfhe_data ); return ret; } unsigned int minInDHeap ( DFibHeap * h ) { if ( h->dfh_min ) { return h->dfh_min->dfhe_data; } else { return 0; } } boolean HasMin ( DFibHeap * h ) { if ( h->dfh_min ) { return 1; } else { return 0; } } unsigned int dfh_replacekeydata ( DFibHeap * h, DFibHeapNode * x, Time key, unsigned int data ) { unsigned int odata; Time okey; DFibHeapNode * y; int r; odata = x->dfhe_data; okey = x->dfhe_key; /* * we can increase a key by deleting and reinserting, that * requires O(lgn) time. */ if ( ( r = dfh_comparedata ( h, key, data, x ) ) > 0 ) { /* XXX - bad code! */ abort(); } x->dfhe_data = data; x->dfhe_key = key; /* because they are equal, we don't have to do anything */ if ( r == 0 ) { return odata; } y = x->dfhe_p; if ( okey == key ) { return odata; } if ( y != NULL && dfh_compare ( h, x, y ) <= 0 ) { dfh_cut ( h, x, y ); dfh_cascading_cut ( h, y ); } /* * the = is so that the call from dfh_delete will delete the proper * element. */ if ( dfh_compare ( h, x, h->dfh_min ) <= 0 ) { h->dfh_min = x; } return odata; } /* * Public void * Heap Functions */ /* * this will return these values: * NULL failed for some reason * ptr token to use for manipulation of data */ unsigned int dfh_extractmin ( DFibHeap * h ) { DFibHeapNode * z; unsigned int ret; ret = 0; if ( h->dfh_min != NULL ) { z = dfh_extractminel ( h ); ret = z->dfhe_data; deallocateDFibHeapNode ( z, h ); } return ret; } unsigned int dfh_replacedata ( DFibHeapNode * x, unsigned int data ) { unsigned int odata = x->dfhe_data; //printf("replace node value %d with %d\n",x->dfhe_data,data); x->dfhe_data = data; return odata; } unsigned int dfh_delete ( DFibHeap * h, DFibHeapNode * x ) { unsigned int k; //printf("destroy node %d in dheap\n",x->dfhe_data); k = x->dfhe_data; dfh_replacekey ( h, x, INT_MIN ); dfh_extractmin ( h ); return k; } /* * begin of private element fuctions */ static DFibHeapNode * dfh_extractminel ( DFibHeap * h ) { DFibHeapNode * ret; DFibHeapNode * x, *y, *orig; ret = h->dfh_min; orig = NULL; /* put all the children on the root list */ /* for true consistancy, we should use dfhe_remove */ for ( x = ret->dfhe_child; x != orig && x != NULL; ) { if ( orig == NULL ) { orig = x; } y = x->dfhe_right; x->dfhe_p = NULL; dfh_insertrootlist ( h, x ); x = y; } /* remove minimum from root list */ dfh_removerootlist ( h, ret ); h->dfh_n--; /* if we aren't empty, consolidate the heap */ if ( h->dfh_n == 0 ) { h->dfh_min = NULL; } else { h->dfh_min = ret->dfhe_right; dfh_consolidate ( h ); } return ret; } static void dfh_insertrootlist ( DFibHeap * h, DFibHeapNode * x ) { if ( h->dfh_root == NULL ) { h->dfh_root = x; x->dfhe_left = x; x->dfhe_right = x; return; } dfhe_insertafter ( h->dfh_root, x ); } static void dfh_removerootlist ( DFibHeap * h, DFibHeapNode * x ) { if ( x->dfhe_left == x ) { h->dfh_root = NULL; } else { h->dfh_root = dfhe_remove ( x ); } } static void dfh_consolidate ( DFibHeap * h ) { DFibHeapNode ** a; DFibHeapNode * w; DFibHeapNode * y; DFibHeapNode * x; IDnum i; IDnum d; IDnum D; dfh_checkcons ( h ); /* assign a the value of h->dfh_cons so I don't have to rewrite code */ D = h->dfh_Dl + 1; a = h->dfh_cons; for ( i = 0; i < D; i++ ) { a[i] = NULL; } while ( ( w = h->dfh_root ) != NULL ) { x = w; dfh_removerootlist ( h, w ); d = x->dfhe_degree; /* XXX - assert that d < D */ while ( a[d] != NULL ) { y = a[d]; if ( dfh_compare ( h, x, y ) > 0 ) { swap ( DFibHeapNode *, x, y ); } dfh_heaplink ( h, y, x ); a[d] = NULL; d++; } a[d] = x; } h->dfh_min = NULL; for ( i = 0; i < D; i++ ) if ( a[i] != NULL ) { dfh_insertrootlist ( h, a[i] ); if ( h->dfh_min == NULL || dfh_compare ( h, a[i], h->dfh_min ) < 0 ) { h->dfh_min = a[i]; } } } static void dfh_heaplink ( DFibHeap * h, DFibHeapNode * y, DFibHeapNode * x ) { /* make y a child of x */ if ( x->dfhe_child == NULL ) { x->dfhe_child = y; } else { dfhe_insertbefore ( x->dfhe_child, y ); } y->dfhe_p = x; x->dfhe_degree++; y->dfhe_mark = 0; } static void dfh_cut ( DFibHeap * h, DFibHeapNode * x, DFibHeapNode * y ) { dfhe_remove ( x ); y->dfhe_degree--; dfh_insertrootlist ( h, x ); x->dfhe_p = NULL; x->dfhe_mark = 0; } static void dfh_cascading_cut ( DFibHeap * h, DFibHeapNode * y ) { DFibHeapNode * z; while ( ( z = y->dfhe_p ) != NULL ) { if ( y->dfhe_mark == 0 ) { y->dfhe_mark = 1; return; } else { dfh_cut ( h, y, z ); y = z; } } } /* * begining of handling elements of dfibheap */ static DFibHeapNode * dfhe_newelem ( DFibHeap * h ) { DFibHeapNode * e; if ( ( e = allocateDFibHeapNode ( h ) ) == NULL ) { return NULL; } e->dfhe_degree = 0; e->dfhe_mark = 0; e->dfhe_p = NULL; e->dfhe_child = NULL; e->dfhe_left = e; e->dfhe_right = e; e->dfhe_data = 0; return e; } static void dfhe_insertafter ( DFibHeapNode * a, DFibHeapNode * b ) { if ( a == a->dfhe_right ) { a->dfhe_right = b; a->dfhe_left = b; b->dfhe_right = a; b->dfhe_left = a; } else { b->dfhe_right = a->dfhe_right; a->dfhe_right->dfhe_left = b; a->dfhe_right = b; b->dfhe_left = a; } } static inline void dfhe_insertbefore ( DFibHeapNode * a, DFibHeapNode * b ) { dfhe_insertafter ( a->dfhe_left, b ); } static DFibHeapNode * dfhe_remove ( DFibHeapNode * x ) { DFibHeapNode * ret; if ( x == x->dfhe_left ) { ret = NULL; } else { ret = x->dfhe_left; } /* fix the parent pointer */ if ( x->dfhe_p != NULL && x->dfhe_p->dfhe_child == x ) { x->dfhe_p->dfhe_child = ret; } x->dfhe_right->dfhe_left = x->dfhe_left; x->dfhe_left->dfhe_right = x->dfhe_right; /* clear out hanging pointers */ x->dfhe_p = NULL; x->dfhe_left = x; x->dfhe_right = x; return ret; } static void dfh_checkcons ( DFibHeap * h ) { IDnum oDl; /* make sure we have enough memory allocated to "reorganize" */ if ( h->dfh_Dl == -1 || h->dfh_n > ( 1 << h->dfh_Dl ) ) { oDl = h->dfh_Dl; if ( ( h->dfh_Dl = ceillog2 ( h->dfh_n ) + 1 ) < 8 ) { h->dfh_Dl = 8; } if ( oDl != h->dfh_Dl ) h->dfh_cons = ( DFibHeapNode ** ) realloc ( h->dfh_cons, sizeof * h-> dfh_cons * ( h->dfh_Dl + 1 ) ); if ( h->dfh_cons == NULL ) { abort(); } } } static int dfh_compare ( DFibHeap * h, DFibHeapNode * a, DFibHeapNode * b ) { if ( a->dfhe_key < b->dfhe_key ) { return -1; } if ( a->dfhe_key == b->dfhe_key ) { return 0; } return 1; } static int dfh_comparedata ( DFibHeap * h, Time key, unsigned int data, DFibHeapNode * b ) { DFibHeapNode a; a.dfhe_key = key; a.dfhe_data = data; return dfh_compare ( h, &a, b ); } static void dfh_insertel ( DFibHeap * h, DFibHeapNode * x ) { dfh_insertrootlist ( h, x ); if ( h->dfh_min == NULL || x->dfhe_key < h->dfh_min->dfhe_key ) { h->dfh_min = x; } h->dfh_n++; } Time dfibheap_el_getKey ( DFibHeapNode * node ) { return node->dfhe_key; } SOAPdenovo-V1.05/src/127mer/dfibHeap.c000644 000765 000024 00000004233 11530651532 017251 0ustar00Aquastaff000000 000000 /* * 127mer/dfibHeap.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include #include #include "def2.h" #include "dfib.h" // Return number of elements stored in heap IDnum getDFibHeapSize ( DFibHeap * heap ) { return dfibheap_getSize ( heap ); } // Constructor // Memory allocated DFibHeap * newDFibHeap() { return dfh_makekeyheap(); } // Add new node into heap with a key, and a pointer to the specified node DFibHeapNode * insertNodeIntoDHeap ( DFibHeap * heap, Time key, unsigned int node ) { DFibHeapNode * res; res = dfh_insertkey ( heap, key, node ); return res; } // Replaces the key for a given node Time replaceKeyInDHeap ( DFibHeap * heap, DFibHeapNode * node, Time newKey ) { Time res; res = dfh_replacekey ( heap, node, newKey ); return res; } // Removes the node with the shortest key, then returns it. unsigned int removeNextNodeFromDHeap ( DFibHeap * heap ) { unsigned int node; node = ( unsigned int ) dfh_extractmin ( heap ); return node; } // Destructor void destroyDHeap ( DFibHeap * heap ) { dfh_deleteheap ( heap ); } // Replace the node pointed to by a heap node void replaceValueInDHeap ( DFibHeapNode * node, unsigned int newValue ) { dfh_replacedata ( node, newValue ); } // Remove unwanted node void destroyNodeInDHeap ( DFibHeapNode * node, DFibHeap * heap ) { dfh_delete ( heap, node ); } Time getKey ( DFibHeapNode * node ) { return dfibheap_el_getKey ( node ); } SOAPdenovo-V1.05/src/127mer/fib.c000644 000765 000024 00000031770 11530651532 016315 0ustar00Aquastaff000000 000000 /* * 127mer/fib.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ /*- * Copyright 1997-2003 John-Mark Gurney. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $Id: fib.c,v 1.10 2007/10/19 13:09:26 zerbino Exp $ * */ #include #include #include "fib.h" #include "fibpriv.h" #include "extfunc2.h" #define HEAPBLOCKSIZE 10000 static int fh_comparedata ( FibHeap * h, Coordinate key, unsigned int data, FibHeapNode * b ); unsigned int fh_replacekeydata ( FibHeap * h, FibHeapNode * x, Coordinate key, unsigned int data ); static FibHeapNode * allocateFibHeapEl ( FibHeap * heap ) { return ( FibHeapNode * ) getItem ( heap->nodeMemory ); }; static void deallocateFibHeapEl ( FibHeapNode * a, FibHeap * heap ) { returnItem ( heap->nodeMemory, a ); } #define swap(type, a, b) \ do { \ type c; \ c = a; \ a = b; \ b = c; \ } while (0) \ #define INT_BITS (sizeof(IDnum) * 8) static inline IDnum ceillog2 ( IDnum a ) { IDnum oa; IDnum i; IDnum b; IDnum cons; oa = a; b = INT_BITS / 2; i = 0; while ( b ) { i = ( i << 1 ); cons = ( ( IDnum ) 1 ) << b; if ( a >= cons ) { a /= cons; i = i | 1; } else { a &= cons - 1; } b /= 2; } if ( ( ( ( IDnum ) 1 << i ) ) == oa ) { return i; } else { return i + 1; } } /* * Private Heap Functions */ static void fh_initheap ( FibHeap * new ) { new->fh_cmp_fnct = NULL; new->nodeMemory = createMem_manager ( sizeof ( FibHeapNode ), HEAPBLOCKSIZE ); new->fh_neginf = 0; new->fh_n = 0; new->fh_Dl = -1; new->fh_cons = NULL; new->fh_min = NULL; new->fh_root = NULL; new->fh_keys = 0; } static void fh_destroyheap ( FibHeap * h ) { h->fh_cmp_fnct = NULL; h->fh_neginf = 0; if ( h->fh_cons != NULL ) { free ( h->fh_cons ); } h->fh_cons = NULL; free ( h ); } /* * Public Heap Functions */ FibHeap * fh_makekeyheap() { FibHeap * n; if ( ( n = malloc ( sizeof * n ) ) == NULL ) { return NULL; } fh_initheap ( n ); n->fh_keys = 1; return n; } FibHeap * fh_makeheap() { FibHeap * n; if ( ( n = malloc ( sizeof * n ) ) == NULL ) { return NULL; } fh_initheap ( n ); return n; } voidcmp fh_setcmp ( FibHeap * h, voidcmp fnct ) { voidcmp oldfnct; oldfnct = h->fh_cmp_fnct; h->fh_cmp_fnct = fnct; return oldfnct; } unsigned int fh_setneginf ( FibHeap * h, unsigned int data ) { unsigned int old; old = h->fh_neginf; h->fh_neginf = data; return old; } FibHeap * fh_union ( FibHeap * ha, FibHeap * hb ) { FibHeapNode * x; if ( ha->fh_root == NULL || hb->fh_root == NULL ) { /* either one or both are empty */ if ( ha->fh_root == NULL ) { fh_destroyheap ( ha ); return hb; } else { fh_destroyheap ( hb ); return ha; } } ha->fh_root->fhe_left->fhe_right = hb->fh_root; hb->fh_root->fhe_left->fhe_right = ha->fh_root; x = ha->fh_root->fhe_left; ha->fh_root->fhe_left = hb->fh_root->fhe_left; hb->fh_root->fhe_left = x; ha->fh_n += hb->fh_n; /* * we probably should also keep stats on number of unions */ /* set fh_min if necessary */ if ( fh_compare ( ha, hb->fh_min, ha->fh_min ) < 0 ) { ha->fh_min = hb->fh_min; } fh_destroyheap ( hb ); return ha; } void fh_deleteheap ( FibHeap * h ) { freeMem_manager ( h->nodeMemory ); h->nodeMemory = NULL; fh_destroyheap ( h ); } /* * Public Key Heap Functions */ FibHeapNode * fh_insertkey ( FibHeap * h, Coordinate key, unsigned int data ) { FibHeapNode * x; if ( ( x = fhe_newelem ( h ) ) == NULL ) { return NULL; } /* just insert on root list, and make sure it's not the new min */ x->fhe_data = data; x->fhe_key = key; fh_insertel ( h, x ); return x; } boolean fh_isempty ( FibHeap * h ) { if ( h->fh_min == NULL ) { return 1; } else { return 0; } } Coordinate fh_minkey ( FibHeap * h ) { if ( h->fh_min == NULL ) { return INT_MIN; } return h->fh_min->fhe_key; } unsigned int fh_replacekeydata ( FibHeap * h, FibHeapNode * x, Coordinate key, unsigned int data ) { unsigned int odata; Coordinate okey; FibHeapNode * y; int r; odata = x->fhe_data; okey = x->fhe_key; /* * we can increase a key by deleting and reinserting, that * requires O(lgn) time. */ if ( ( r = fh_comparedata ( h, key, data, x ) ) > 0 ) { /* XXX - bad code! */ abort(); } x->fhe_data = data; x->fhe_key = key; /* because they are equal, we don't have to do anything */ if ( r == 0 ) { return odata; } y = x->fhe_p; if ( h->fh_keys && okey == key ) { return odata; } if ( y != NULL && fh_compare ( h, x, y ) <= 0 ) { fh_cut ( h, x, y ); fh_cascading_cut ( h, y ); } /* * the = is so that the call from fh_delete will delete the proper * element. */ if ( fh_compare ( h, x, h->fh_min ) <= 0 ) { h->fh_min = x; } return odata; } Coordinate fh_replacekey ( FibHeap * h, FibHeapNode * x, Coordinate key ) { Coordinate ret; ret = x->fhe_key; ( void ) fh_replacekeydata ( h, x, key, x->fhe_data ); return ret; } /* * Public void * Heap Functions */ /* * this will return these values: * NULL failed for some reason * ptr token to use for manipulation of data */ FibHeapNode * fh_insert ( FibHeap * h, unsigned int data ) { FibHeapNode * x; if ( ( x = fhe_newelem ( h ) ) == NULL ) { return NULL; } /* just insert on root list, and make sure it's not the new min */ x->fhe_data = data; fh_insertel ( h, x ); return x; } unsigned int fh_min ( FibHeap * h ) { if ( h->fh_min == NULL ) { return 0; } return h->fh_min->fhe_data; } unsigned int fh_extractmin ( FibHeap * h ) { FibHeapNode * z; unsigned int ret = 0; if ( h->fh_min != NULL ) { z = fh_extractminel ( h ); ret = z->fhe_data; #ifndef NO_FREE deallocateFibHeapEl ( z, h ); #endif } return ret; } unsigned int fh_replacedata ( FibHeapNode * x, unsigned int data ) { unsigned int odata = x->fhe_data; x->fhe_data = data; return odata; } unsigned int fh_delete ( FibHeap * h, FibHeapNode * x ) { unsigned int k; k = x->fhe_data; if ( !h->fh_keys ) { fh_replacedata ( x, h->fh_neginf ); } else { fh_replacekey ( h, x, INT_MIN ); } fh_extractmin ( h ); return k; } /* * begin of private element fuctions */ static FibHeapNode * fh_extractminel ( FibHeap * h ) { FibHeapNode * ret; FibHeapNode * x, *y, *orig; ret = h->fh_min; orig = NULL; /* put all the children on the root list */ /* for true consistancy, we should use fhe_remove */ for ( x = ret->fhe_child; x != orig && x != NULL; ) { if ( orig == NULL ) { orig = x; } y = x->fhe_right; x->fhe_p = NULL; fh_insertrootlist ( h, x ); x = y; } /* remove minimum from root list */ fh_removerootlist ( h, ret ); h->fh_n--; /* if we aren't empty, consolidate the heap */ if ( h->fh_n == 0 ) { h->fh_min = NULL; } else { h->fh_min = ret->fhe_right; fh_consolidate ( h ); } return ret; } static void fh_insertrootlist ( FibHeap * h, FibHeapNode * x ) { if ( h->fh_root == NULL ) { h->fh_root = x; x->fhe_left = x; x->fhe_right = x; return; } fhe_insertafter ( h->fh_root, x ); } static void fh_removerootlist ( FibHeap * h, FibHeapNode * x ) { if ( x->fhe_left == x ) { h->fh_root = NULL; } else { h->fh_root = fhe_remove ( x ); } } static void fh_consolidate ( FibHeap * h ) { FibHeapNode ** a; FibHeapNode * w; FibHeapNode * y; FibHeapNode * x; IDnum i; IDnum d; IDnum D; fh_checkcons ( h ); /* assign a the value of h->fh_cons so I don't have to rewrite code */ D = h->fh_Dl + 1; a = h->fh_cons; for ( i = 0; i < D; i++ ) { a[i] = NULL; } while ( ( w = h->fh_root ) != NULL ) { x = w; fh_removerootlist ( h, w ); d = x->fhe_degree; /* XXX - assert that d < D */ while ( a[d] != NULL ) { y = a[d]; if ( fh_compare ( h, x, y ) > 0 ) { swap ( FibHeapNode *, x, y ); } fh_heaplink ( h, y, x ); a[d] = NULL; d++; } a[d] = x; } h->fh_min = NULL; for ( i = 0; i < D; i++ ) if ( a[i] != NULL ) { fh_insertrootlist ( h, a[i] ); if ( h->fh_min == NULL || fh_compare ( h, a[i], h->fh_min ) < 0 ) { h->fh_min = a[i]; } } } static void fh_heaplink ( FibHeap * h, FibHeapNode * y, FibHeapNode * x ) { /* make y a child of x */ if ( x->fhe_child == NULL ) { x->fhe_child = y; } else { fhe_insertbefore ( x->fhe_child, y ); } y->fhe_p = x; x->fhe_degree++; y->fhe_mark = 0; } static void fh_cut ( FibHeap * h, FibHeapNode * x, FibHeapNode * y ) { fhe_remove ( x ); y->fhe_degree--; fh_insertrootlist ( h, x ); x->fhe_p = NULL; x->fhe_mark = 0; } static void fh_cascading_cut ( FibHeap * h, FibHeapNode * y ) { FibHeapNode * z; while ( ( z = y->fhe_p ) != NULL ) { if ( y->fhe_mark == 0 ) { y->fhe_mark = 1; return; } else { fh_cut ( h, y, z ); y = z; } } } /* * begining of handling elements of fibheap */ static FibHeapNode * fhe_newelem ( FibHeap * h ) { FibHeapNode * e; if ( ( e = allocateFibHeapEl ( h ) ) == NULL ) { return NULL; } fhe_initelem ( e ); return e; } static void fhe_initelem ( FibHeapNode * e ) { e->fhe_degree = 0; e->fhe_mark = 0; e->fhe_p = NULL; e->fhe_child = NULL; e->fhe_left = e; e->fhe_right = e; e->fhe_data = 0; } static void fhe_insertafter ( FibHeapNode * a, FibHeapNode * b ) { if ( a == a->fhe_right ) { a->fhe_right = b; a->fhe_left = b; b->fhe_right = a; b->fhe_left = a; } else { b->fhe_right = a->fhe_right; a->fhe_right->fhe_left = b; a->fhe_right = b; b->fhe_left = a; } } static inline void fhe_insertbefore ( FibHeapNode * a, FibHeapNode * b ) { fhe_insertafter ( a->fhe_left, b ); } static FibHeapNode * fhe_remove ( FibHeapNode * x ) { FibHeapNode * ret; if ( x == x->fhe_left ) { ret = NULL; } else { ret = x->fhe_left; } /* fix the parent pointer */ if ( x->fhe_p != NULL && x->fhe_p->fhe_child == x ) { x->fhe_p->fhe_child = ret; } x->fhe_right->fhe_left = x->fhe_left; x->fhe_left->fhe_right = x->fhe_right; /* clear out hanging pointers */ x->fhe_p = NULL; x->fhe_left = x; x->fhe_right = x; return ret; } static void fh_checkcons ( FibHeap * h ) { IDnum oDl; /* make sure we have enough memory allocated to "reorganize" */ if ( h->fh_Dl == -1 || h->fh_n > ( 1 << h->fh_Dl ) ) { oDl = h->fh_Dl; if ( ( h->fh_Dl = ceillog2 ( h->fh_n ) + 1 ) < 8 ) { h->fh_Dl = 8; } if ( oDl != h->fh_Dl ) h->fh_cons = ( FibHeapNode ** ) realloc ( h->fh_cons, sizeof * h-> fh_cons * ( h->fh_Dl + 1 ) ); if ( h->fh_cons == NULL ) { abort(); } } } static int fh_compare ( FibHeap * h, FibHeapNode * a, FibHeapNode * b ) { if ( a->fhe_key < b->fhe_key ) { return -1; } if ( a->fhe_key == b->fhe_key ) { return 0; } return 1; } static int fh_comparedata ( FibHeap * h, Coordinate key, unsigned int data, FibHeapNode * b ) { FibHeapNode a; a.fhe_key = key; a.fhe_data = data; return fh_compare ( h, &a, b ); } static void fh_insertel ( FibHeap * h, FibHeapNode * x ) { fh_insertrootlist ( h, x ); if ( h->fh_min == NULL || ( h->fh_keys ? x->fhe_key < h->fh_min->fhe_key : h->fh_cmp_fnct ( x->fhe_data, h->fh_min->fhe_data ) < 0 ) ) { h->fh_min = x; } h->fh_n++; } SOAPdenovo-V1.05/src/127mer/fibHeap.c000644 000765 000024 00000004003 11530651532 017100 0ustar00Aquastaff000000 000000 /* * 127mer/fibHeap.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "fib.h" // Constructor // Memory allocated FibHeap * newFibHeap() { return fh_makekeyheap(); } // Add new node into heap with a key, and a pointer to the specified node FibHeapNode * insertNodeIntoHeap ( FibHeap * heap, Coordinate key, unsigned int node ) { return fh_insertkey ( heap, key, node ); } // Returns smallest key in heap Coordinate minKeyOfHeap ( FibHeap * heap ) { return fh_minkey ( heap ); } // Replaces the key for a given node Coordinate replaceKeyInHeap ( FibHeap * heap, FibHeapNode * node, Coordinate newKey ) { return fh_replacekey ( heap, node, newKey ); } // Removes the node with the shortest key, then returns it. unsigned int removeNextNodeFromHeap ( FibHeap * heap ) { return ( unsigned int ) fh_extractmin ( heap ); } boolean IsHeapEmpty ( FibHeap * heap ) { return fh_isempty ( heap ); } // Destructor void destroyHeap ( FibHeap * heap ) { fh_deleteheap ( heap ); } // Replace the node pointed to by a heap node void replaceValueInHeap ( FibHeapNode * node, unsigned int newValue ) { fh_replacedata ( node, newValue ); } // Remove unwanted node void destroyNodeInHeap ( FibHeapNode * node, FibHeap * heap ) { fh_delete ( heap, node ); } SOAPdenovo-V1.05/src/127mer/hashFunction.c000644 000765 000024 00000010575 11530651532 020206 0ustar00Aquastaff000000 000000 /* * 127mer/hashFunction.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include #define KMER_HASH_MASK 0x0000000000ffffffL #define KMER_HASH_BUCKETS 16777216 // 4^12 static int crc_table[256] = { 0x00000000, 0x77073096, 0xee0e612c, 0x990951ba, 0x076dc419, 0x706af48f, 0xe963a535, 0x9e6495a3, 0x0edb8832, 0x79dcb8a4, 0xe0d5e91e, 0x97d2d988, 0x09b64c2b, 0x7eb17cbd, 0xe7b82d07, 0x90bf1d91, 0x1db71064, 0x6ab020f2, 0xf3b97148, 0x84be41de, 0x1adad47d, 0x6ddde4eb, 0xf4d4b551, 0x83d385c7, 0x136c9856, 0x646ba8c0, 0xfd62f97a, 0x8a65c9ec, 0x14015c4f, 0x63066cd9, 0xfa0f3d63, 0x8d080df5, 0x3b6e20c8, 0x4c69105e, 0xd56041e4, 0xa2677172, 0x3c03e4d1, 0x4b04d447, 0xd20d85fd, 0xa50ab56b, 0x35b5a8fa, 0x42b2986c, 0xdbbbc9d6, 0xacbcf940, 0x32d86ce3, 0x45df5c75, 0xdcd60dcf, 0xabd13d59, 0x26d930ac, 0x51de003a, 0xc8d75180, 0xbfd06116, 0x21b4f4b5, 0x56b3c423, 0xcfba9599, 0xb8bda50f, 0x2802b89e, 0x5f058808, 0xc60cd9b2, 0xb10be924, 0x2f6f7c87, 0x58684c11, 0xc1611dab, 0xb6662d3d, 0x76dc4190, 0x01db7106, 0x98d220bc, 0xefd5102a, 0x71b18589, 0x06b6b51f, 0x9fbfe4a5, 0xe8b8d433, 0x7807c9a2, 0x0f00f934, 0x9609a88e, 0xe10e9818, 0x7f6a0dbb, 0x086d3d2d, 0x91646c97, 0xe6635c01, 0x6b6b51f4, 0x1c6c6162, 0x856530d8, 0xf262004e, 0x6c0695ed, 0x1b01a57b, 0x8208f4c1, 0xf50fc457, 0x65b0d9c6, 0x12b7e950, 0x8bbeb8ea, 0xfcb9887c, 0x62dd1ddf, 0x15da2d49, 0x8cd37cf3, 0xfbd44c65, 0x4db26158, 0x3ab551ce, 0xa3bc0074, 0xd4bb30e2, 0x4adfa541, 0x3dd895d7, 0xa4d1c46d, 0xd3d6f4fb, 0x4369e96a, 0x346ed9fc, 0xad678846, 0xda60b8d0, 0x44042d73, 0x33031de5, 0xaa0a4c5f, 0xdd0d7cc9, 0x5005713c, 0x270241aa, 0xbe0b1010, 0xc90c2086, 0x5768b525, 0x206f85b3, 0xb966d409, 0xce61e49f, 0x5edef90e, 0x29d9c998, 0xb0d09822, 0xc7d7a8b4, 0x59b33d17, 0x2eb40d81, 0xb7bd5c3b, 0xc0ba6cad, 0xedb88320, 0x9abfb3b6, 0x03b6e20c, 0x74b1d29a, 0xead54739, 0x9dd277af, 0x04db2615, 0x73dc1683, 0xe3630b12, 0x94643b84, 0x0d6d6a3e, 0x7a6a5aa8, 0xe40ecf0b, 0x9309ff9d, 0x0a00ae27, 0x7d079eb1, 0xf00f9344, 0x8708a3d2, 0x1e01f268, 0x6906c2fe, 0xf762575d, 0x806567cb, 0x196c3671, 0x6e6b06e7, 0xfed41b76, 0x89d32be0, 0x10da7a5a, 0x67dd4acc, 0xf9b9df6f, 0x8ebeeff9, 0x17b7be43, 0x60b08ed5, 0xd6d6a3e8, 0xa1d1937e, 0x38d8c2c4, 0x4fdff252, 0xd1bb67f1, 0xa6bc5767, 0x3fb506dd, 0x48b2364b, 0xd80d2bda, 0xaf0a1b4c, 0x36034af6, 0x41047a60, 0xdf60efc3, 0xa867df55, 0x316e8eef, 0x4669be79, 0xcb61b38c, 0xbc66831a, 0x256fd2a0, 0x5268e236, 0xcc0c7795, 0xbb0b4703, 0x220216b9, 0x5505262f, 0xc5ba3bbe, 0xb2bd0b28, 0x2bb45a92, 0x5cb36a04, 0xc2d7ffa7, 0xb5d0cf31, 0x2cd99e8b, 0x5bdeae1d, 0x9b64c2b0, 0xec63f226, 0x756aa39c, 0x026d930a, 0x9c0906a9, 0xeb0e363f, 0x72076785, 0x05005713, 0x95bf4a82, 0xe2b87a14, 0x7bb12bae, 0x0cb61b38, 0x92d28e9b, 0xe5d5be0d, 0x7cdcefb7, 0x0bdbdf21, 0x86d3d2d4, 0xf1d4e242, 0x68ddb3f8, 0x1fda836e, 0x81be16cd, 0xf6b9265b, 0x6fb077e1, 0x18b74777, 0x88085ae6, 0xff0f6a70, 0x66063bca, 0x11010b5c, 0x8f659eff, 0xf862ae69, 0x616bffd3, 0x166ccf45, 0xa00ae278, 0xd70dd2ee, 0x4e048354, 0x3903b3c2, 0xa7672661, 0xd06016f7, 0x4969474d, 0x3e6e77db, 0xaed16a4a, 0xd9d65adc, 0x40df0b66, 0x37d83bf0, 0xa9bcae53, 0xdebb9ec5, 0x47b2cf7f, 0x30b5ffe9, 0xbdbdf21c, 0xcabac28a, 0x53b39330, 0x24b4a3a6, 0xbad03605, 0xcdd70693, 0x54de5729, 0x23d967bf, 0xb3667a2e, 0xc4614ab8, 0x5d681b02, 0x2a6f2b94, 0xb40bbe37, 0xc30c8ea1, 0x5a05df1b, 0x2d02ef8d }; static int crc32 ( int crc, const char * buf, int len ) { if ( buf == NULL ) { return 0; } crc = crc ^ 0xffffffff; while ( len-- ) { crc = crc_table[ ( ( int ) crc ^ ( *buf++ ) ) & 0xff] ^ ( crc >> 8 ); } return crc ^ 0xffffffff; } ubyte8 hash_kmer ( Kmer kmer ) { ubyte8 hash = kmer.low2; hash = crc32 ( 0, ( char * ) &kmer, sizeof ( Kmer ) ); hash &= KMER_HASH_MASK; return hash; } SOAPdenovo-V1.05/src/127mer/inc/000755 000765 000024 00000000000 11530651532 016152 5ustar00Aquastaff000000 000000 SOAPdenovo-V1.05/src/127mer/kmer.c000644 000765 000024 00000036657 11530651532 016524 0ustar00Aquastaff000000 000000 /* * 127mer/kmer.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" /* U256b Kmer2int256(Kmer seq) { U256b temp; temp.low = seq.high2; temp.low <<= 64; temp.low |= seq.low2; temp.high = seq.high1; temp.high <<= 64; temp.high |= seq.low1; //temp.low = seq.high2 << 64 | seq.low2; //temp.high = seq.high1 << 64 | seq.low1; return temp; } */ boolean KmerSmaller ( Kmer kmer1, Kmer kmer2 ) { if ( kmer1.high1 != kmer2.high1 ) { return ( kmer1.high1 < kmer2.high1 ); } else { if ( kmer1.low1 != kmer2.low1 ) { return ( kmer1.low1 < kmer2.low1 ); } else { if ( kmer1.high2 != kmer2.high2 ) { return ( kmer1.high2 < kmer2.high2 ); } else { return ( kmer1.low2 < kmer2.low2 ); } } } /* if(kmer1.high1 kmer2.high1 ); } else { if ( kmer1.low1 != kmer2.low1 ) { return ( kmer1.low1 > kmer2.low1 ); } else { if ( kmer1.high2 != kmer2.high2 ) { return ( kmer1.high2 > kmer2.high2 ); } else { return ( kmer1.low2 > kmer2.low2 ); } } } /* if(kmer1.high1>kmer2.high1) return 1; else if(kmer1.high1==kmer2.high1){ if(kmer1.low1>kmer2.low1) return 1; else if(kmer1.low1==kmer2.low1){ if(kmer1.high2>kmer2.high2) return 1; else if(kmer1.high2==kmer2.high2){ if(kmer1.low2>kmer2.low2) return 1; else return 0; }else return 0; }else return 0; }else return 0; */ } boolean KmerEqual ( Kmer kmer1, Kmer kmer2 ) { if ( kmer1.low2 != kmer2.low2 || kmer1.high2 != kmer2.high2 || kmer1.low1 != kmer2.low1 || kmer1.high1 != kmer2.high1 ) { return 0; } else { return 1; } } // kmer1 = kmer1 & kmer2 Kmer KmerAnd ( Kmer kmer1, Kmer kmer2 ) { kmer1.high1 &= kmer2.high1; kmer1.low1 &= kmer2.low1; kmer1.high2 &= kmer2.high2; kmer1.low2 &= kmer2.low2; return kmer1; } // kmer <<= 2 Kmer KmerLeftBitMoveBy2 ( Kmer word ) { word.high1 = ( word.high1 << 2 ) | ( word.low1 >> 62 ); word.low1 = ( word.low1 << 2 ) | ( word.high2 >> 62 ); word.high2 = ( word.high2 << 2 ) | ( word.low2 >> 62 ); word.low2 <<= 2; return word; } // kmer >>= 2 Kmer KmerRightBitMoveBy2 ( Kmer word ) { word.low2 = ( word.low2 >> 2 ) | ( word.high2 & 0x3 ) << 62; word.high2 = ( word.high2 >> 2 ) | ( word.low1 & 0x3 ) << 62; word.low1 = ( word.low1 >> 2 ) | ( word.high1 & 0x3 ) << 62; word.high1 >>= 2; return word; } Kmer KmerPlus ( Kmer prev, char ch ) { Kmer word = KmerLeftBitMoveBy2 ( prev ); word.low2 |= ch; return word; } Kmer nextKmer ( Kmer prev, char ch ) { Kmer word = KmerLeftBitMoveBy2 ( prev ); word = KmerAnd ( word, WORDFILTER ); word.low2 |= ch; return word; } Kmer prevKmer ( Kmer next, char ch ) { Kmer word = KmerRightBitMoveBy2 ( next ); switch ( overlaplen ) { case 1 ... 32: word.low2 |= ( ( ( ubyte8 ) ch ) << 2 * ( overlaplen - 1 ) ); break; case 33 ... 64: word.high2 |= ( ( ubyte8 ) ch ) << ( 2 * ( overlaplen - 1 ) - 64 ); break; case 65 ... 96 : word.low1 |= ( ( ubyte8 ) ch ) << ( 2 * ( overlaplen - 1 ) - 128 ); break; case 97 ... 128: word.high1 |= ( ( ubyte8 ) ch ) << ( 2 * ( overlaplen - 1 ) - 192 ); break; } return word; } char lastCharInKmer ( Kmer kmer ) { return ( char ) ( kmer.low2 & 0x3 ); } char firstCharInKmer ( Kmer kmer ) { switch ( overlaplen ) { case 1 ... 32: kmer.low2 >>= 2 * ( overlaplen - 1 ); return kmer.low2;// & 3; case 33 ... 64: kmer.high2 >>= 2 * ( overlaplen - 1 ) - 64; return kmer.high2;// & 3; case 65 ... 96 : kmer.low1 >>= 2 * ( overlaplen - 1 ) - 128; return kmer.low1; case 97 ... 128: kmer.high1 >>= 2 * ( overlaplen - 1 ) - 192; return kmer.high1; } } Kmer createFilter ( int overlaplen ) { Kmer word; word.high1 = word.low1 = word.high2 = word.low2 = 0; switch ( overlaplen ) { case 1 ... 31: word.low2 = ( ( ( ubyte8 ) 1 ) << ( 2 * overlaplen ) ) - 1; break; case 32 ... 63: word.low2 = ~word.low2; word.high2 = ( ( ( ubyte8 ) 1 ) << ( 2 * overlaplen - 64 ) ) - 1; break; case 64 ... 95: word.high2 = word.low2 = ~word.low2; word.low1 = ( ( ( ubyte8 ) 1 ) << ( 2 * overlaplen - 128 ) ) - 1; break; case 96 ... 127: word.low1 = word.high2 = word.low2 = ~word.low2; word.high1 = ( ( ( ubyte8 ) 1 ) << ( 2 * overlaplen - 192 ) ) - 1; break; } return word; } Kmer KmerRightBitMove ( Kmer word, int dis ) { ubyte8 mask; switch ( dis ) { case 1 ... 63: mask = ( ( ( ubyte8 ) 1 ) << dis ) - 1; word.low2 = ( word.low2 >> dis ) | ( word.high2 & mask ) << ( 64 - dis ); word.high2 = ( word.high2 >> dis ) | ( word.low1 & mask ) << ( 64 - dis ); word.low1 = ( word.low1 >> dis ) | ( word.high1 & mask ) << ( 64 - dis ); word.high1 >>= dis; return word; case 64 ... 127: mask = ( ( ( ubyte8 ) 1 ) << ( dis - 64 ) ) - 1; word.low2 = word.high2 >> ( dis - 64 ) | ( word.low1 & mask ) << ( 128 - dis ); word.high2 = word.low1 >> ( dis - 64 ) | ( word.high1 & mask ) << ( 128 - dis ); word.low1 = word.high1 >> ( dis - 64 ); word.high1 = 0; return word; case 128 ... 191: mask = ( ( ( ubyte8 ) 1 ) << ( dis - 128 ) ) - 1; word.low2 = word.low1 >> ( dis - 128 ) | ( word.high1 & mask ) << ( 192 - dis ); word.high2 = word.high1 >> ( dis - 128 ); word.high1 = word.low1 = 0; return word; case 192 ... 255: word.low2 = word.high1 >> ( dis - 192 ); word.high1 = word.low1 = word.high2 = 0; return word; } /* if(dis<64){ ubyte8 mask = (((ubyte8) 1) << dis) - 1; ubyte8 temp1 = (word.high1&mask)<<(64-dis); ubyte8 temp2 = (word.low1&mask)<<(64-dis); ubyte8 temp3 = (word.high2&mask)<<(64-dis); word.high1 >>= dis; word.low1 >>= dis; word.high2 >>= dis; word.low2 >>= dis; word.low1 |= temp1; word.high2 |= temp2; word.low2 |= temp3; return word; } if(dis>=64 && dis<128){ ubyte8 mask = (((ubyte8) 1) << (dis-64)) - 1; ubyte8 temp1 = (word.high1&mask)<<(128-dis); ubyte8 temp2 = (word.low1&mask)<<(128-dis); word.high1 >>= (dis-64); word.low1 >>= (dis-64); word.high2 >>= (dis-64); word.high2 |= temp2; word.low2 = word.high2; word.low1 |= temp1; word.high2 = word.low1; word.low1 = word.high1; word.high1 = 0; return word; } if(dis>=128 && dis<192){ ubyte8 mask = (((ubyte8) 1) << (dis-128)) - 1; ubyte8 temp1 = (word.high1&mask)<<(192-dis); word.high1 >>= (dis-128); word.low1 >>= (dis-128); word.low1 |= temp1; word.low2 = word.low1; word.high2 = word.high1; word.low1 =0; word.high1 = 0; return word; } if(dis>=192 && dis<256){ word.high1 >>= (dis-192); word.low2 = word.high1; word.high2 = 0; word.low1 =0; word.high1 = 0; return word; }*/ } void printKmerSeq ( FILE * fp, Kmer kmer ) { int i, bit1, bit2, bit3, bit4; bit4 = bit3 = bit2 = bit1 = 0; char kmerSeq[128]; switch ( overlaplen ) { case 1 ... 31: bit4 = overlaplen; break; case 32 ... 63: bit4 = 32; bit3 = overlaplen - 32; break; case 64 ... 95: bit4 = bit3 = 32; bit2 = overlaplen - 64; break; case 96 ... 127: bit4 = bit3 = bit2 = 32; bit1 = overlaplen - 96; break; } for ( i = bit1 - 1; i >= 0; i-- ) { kmerSeq[i] = kmer.high1 & 0x3; kmer.high1 >>= 2; } for ( i = bit2 - 1; i >= 0; i-- ) { kmerSeq[i + bit1] = kmer.low1 & 0x3; kmer.low1 >>= 2; } for ( i = bit3 - 1; i >= 0; i-- ) { kmerSeq[i + bit1 + bit2] = kmer.high2 & 0x3; kmer.high2 >>= 2; } for ( i = bit4 - 1; i >= 0; i-- ) { kmerSeq[i + bit1 + bit2 + bit3] = kmer.low2 & 0x3; kmer.low2 >>= 2; } for ( i = 0; i < overlaplen; i++ ) { fprintf ( fp, "%c", int2base ( ( int ) kmerSeq[i] ) ); } } void print_kmer ( FILE * fp, Kmer kmer, char c ) { fprintf ( fp, "%llx %llx %llx %llx", kmer.high1, kmer.low1, kmer.high2, kmer.low2 ); fprintf ( fp, "%c", c ); } static const ubyte2 BitReverseTable [65536] = { # define R2(n) n, n + 1*16384, n + 2*16384, n + 3*16384 # define R4(n) R2(n), R2(n + 1*4096), R2(n + 2*4096), R2(n + 3*4096) # define R6(n) R4(n), R4(n + 1*1024), R4(n + 2*1024), R4(n + 3*1024) # define R8(n) R6(n), R6(n + 1*256 ), R6(n + 2*256 ), R6(n + 3*256 ) # define R10(n) R8(n), R8(n + 1*64 ), R8(n + 2*64 ), R8(n + 3*64 ) # define R12(n) R10(n),R10(n + 1*16), R10(n + 2*16 ), R10(n + 3*16 ) # define R14(n) R12(n),R12(n + 1*4 ), R12(n + 2*4 ), R12(n + 3*4 ) R14 ( 0 ), R14 ( 1 ), R14 ( 2 ), R14 ( 3 ) }; static Kmer fastReverseComp ( Kmer seq, char seq_size ) { seq.low2 ^= 0xAAAAAAAAAAAAAAAALLU; seq.low2 = ( ( ubyte8 ) BitReverseTable[ seq.low2 & 0xffff] << 48 ) | ( ( ubyte8 ) BitReverseTable[ ( seq.low2 >> 16 ) & 0xffff] << 32 ) | ( ( ubyte8 ) BitReverseTable[ ( seq.low2 >> 32 ) & 0xffff] << 16 ) | ( ( ubyte8 ) BitReverseTable[ ( seq.low2 >> 48 ) & 0xffff] ); if ( seq_size < 32 ) { seq.low2 >>= ( 64 - ( seq_size << 1 ) ); return seq; } seq.high2 ^= 0xAAAAAAAAAAAAAAAALLU; seq.high2 = ( ( ubyte8 ) BitReverseTable[ seq.high2 & 0xffff] << 48 ) | ( ( ubyte8 ) BitReverseTable[ ( seq.high2 >> 16 ) & 0xffff] << 32 ) | ( ( ubyte8 ) BitReverseTable[ ( seq.high2 >> 32 ) & 0xffff] << 16 ) | ( ( ubyte8 ) BitReverseTable[ ( seq.high2 >> 48 ) & 0xffff] ); if ( seq_size < 64 ) { seq.high2 = seq.high2 ^ seq.low2; seq.low2 = seq.high2 ^ seq.low2; seq.high2 = seq.high2 ^ seq.low2; seq = KmerRightBitMove ( seq, 128 - ( seq_size << 1 ) ); return seq; } seq.low1 ^= 0xAAAAAAAAAAAAAAAALLU; seq.low1 = ( ( ubyte8 ) BitReverseTable[ seq.low1 & 0xffff] << 48 ) | ( ( ubyte8 ) BitReverseTable[ ( seq.low1 >> 16 ) & 0xffff] << 32 ) | ( ( ubyte8 ) BitReverseTable[ ( seq.low1 >> 32 ) & 0xffff] << 16 ) | ( ( ubyte8 ) BitReverseTable[ ( seq.low1 >> 48 ) & 0xffff] ); if ( seq_size < 96 ) { seq.low1 = seq.low1 ^ seq.low2; seq.low2 = seq.low1 ^ seq.low2; seq.low1 = seq.low1 ^ seq.low2; seq = KmerRightBitMove ( seq, 192 - ( seq_size << 1 ) ); return seq; } seq.high1 ^= 0xAAAAAAAAAAAAAAAALLU; seq.high1 = ( ( ubyte8 ) BitReverseTable[ seq.high1 & 0xffff] << 48 ) | ( ( ubyte8 ) BitReverseTable[ ( seq.high1 >> 16 ) & 0xffff] << 32 ) | ( ( ubyte8 ) BitReverseTable[ ( seq.high1 >> 32 ) & 0xffff] << 16 ) | ( ( ubyte8 ) BitReverseTable[ ( seq.high1 >> 48 ) & 0xffff] ); seq.low1 = seq.low1 ^ seq.high2; seq.high2 = seq.low1 ^ seq.high2; seq.low1 = seq.low1 ^ seq.high2; seq.low2 = seq.low2 ^ seq.high1; seq.high1 = seq.low2 ^ seq.high1; seq.low2 = seq.low2 ^ seq.high1; seq = KmerRightBitMove ( seq, 256 - ( seq_size << 1 ) ); return seq; } /* seq.low2 ^= 0xAAAAAAAAAAAAAAAALLU; seq.low2 = ((seq.low2 & 0x3333333333333333LLU)<< 2) | ((seq.low2 & 0xCCCCCCCCCCCCCCCCLLU)>> 2); seq.low2 = ((seq.low2 & 0x0F0F0F0F0F0F0F0FLLU)<< 4) | ((seq.low2 & 0xF0F0F0F0F0F0F0F0LLU)>> 4); seq.low2 = ((seq.low2 & 0x00FF00FF00FF00FFLLU)<< 8) | ((seq.low2 & 0xFF00FF00FF00FF00LLU)>> 8); seq.low2 = ((seq.low2 & 0x0000FFFF0000FFFFLLU)<<16) | ((seq.low2 & 0xFFFF0000FFFF0000LLU)>>16); seq.low2 = ((seq.low2 & 0x00000000FFFFFFFFLLU)<<32) | ((seq.low2 & 0xFFFFFFFF00000000LLU)>>32); if(seq_size<32){ seq.low2 >>= (64 - (seq_size<<1)); return seq; } seq.high2 ^= 0xAAAAAAAAAAAAAAAALLU; seq.high2 = ((seq.high2 & 0x3333333333333333LLU)<< 2) | ((seq.high2 & 0xCCCCCCCCCCCCCCCCLLU)>> 2); seq.high2 = ((seq.high2 & 0x0F0F0F0F0F0F0F0FLLU)<< 4) | ((seq.high2 & 0xF0F0F0F0F0F0F0F0LLU)>> 4); seq.high2 = ((seq.high2 & 0x00FF00FF00FF00FFLLU)<< 8) | ((seq.high2 & 0xFF00FF00FF00FF00LLU)>> 8); seq.high2 = ((seq.high2 & 0x0000FFFF0000FFFFLLU)<<16) | ((seq.high2 & 0xFFFF0000FFFF0000LLU)>>16); seq.high2 = ((seq.high2 & 0x00000000FFFFFFFFLLU)<<32) | ((seq.high2 & 0xFFFFFFFF00000000LLU)>>32); ubyte8 temp; if(seq_size<64){ temp = seq.high2; seq.high2 = seq.low2; seq.low2 = temp; seq = KmerRightBitMove(seq,128-(seq_size<<1)); return seq; } seq.low1 ^= 0xAAAAAAAAAAAAAAAALLU; seq.low1 = ((seq.low1 & 0x3333333333333333LLU)<< 2) | ((seq.low1 & 0xCCCCCCCCCCCCCCCCLLU)>> 2); seq.low1 = ((seq.low1 & 0x0F0F0F0F0F0F0F0FLLU)<< 4) | ((seq.low1 & 0xF0F0F0F0F0F0F0F0LLU)>> 4); seq.low1 = ((seq.low1 & 0x00FF00FF00FF00FFLLU)<< 8) | ((seq.low1 & 0xFF00FF00FF00FF00LLU)>> 8); seq.low1 = ((seq.low1 & 0x0000FFFF0000FFFFLLU)<<16) | ((seq.low1 & 0xFFFF0000FFFF0000LLU)>>16); seq.low1 = ((seq.low1 & 0x00000000FFFFFFFFLLU)<<32) | ((seq.low1 & 0xFFFFFFFF00000000LLU)>>32); if(seq_size<96){ temp = seq.low2; seq.low2 = seq.low1; seq.low1 = temp; seq = KmerRightBitMove(seq,192-(seq_size<<1)); return seq; } seq.high1 ^= 0xAAAAAAAAAAAAAAAALLU; seq.high1 = ((seq.high1 & 0x3333333333333333LLU)<< 2) | ((seq.high1 & 0xCCCCCCCCCCCCCCCCLLU)>> 2); seq.high1 = ((seq.high1 & 0x0F0F0F0F0F0F0F0FLLU)<< 4) | ((seq.high1 & 0xF0F0F0F0F0F0F0F0LLU)>> 4); seq.high1 = ((seq.high1 & 0x00FF00FF00FF00FFLLU)<< 8) | ((seq.high1 & 0xFF00FF00FF00FF00LLU)>> 8); seq.high1 = ((seq.high1 & 0x0000FFFF0000FFFFLLU)<<16) | ((seq.high1 & 0xFFFF0000FFFF0000LLU)>>16); seq.high1 = ((seq.high1 & 0x00000000FFFFFFFFLLU)<<32) | ((seq.high1 & 0xFFFFFFFF00000000LLU)>>32); ubyte8 temp_t; temp = seq.high2; seq.high2 = seq.low1; seq.low1 = temp; temp_t = seq.high1; seq.high1 = seq.low2; seq.low2 = temp_t; seq = KmerRightBitMove(seq,256-(seq_size<<1)); return seq;*/ Kmer reverseComplementVerbose ( Kmer word, int overlap ) { return fastReverseComp ( word, overlap ); } Kmer reverseComplement ( Kmer word, int overlap ) { return fastReverseComp ( word, overlap ); } SOAPdenovo-V1.05/src/127mer/lib.c000644 000765 000024 00000025321 11530651532 016316 0ustar00Aquastaff000000 000000 /* * 127mer/lib.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" static char tabs[2][1024]; int getMaxLongReadLen ( int num_libs ) { int i; int maxLong = 0; boolean Has = 0; for ( i = 0; i < num_libs; i++ ) { if ( lib_array[i].asm_flag != 4 ) { continue; } Has = 1; maxLong = maxLong < lib_array[i].rd_len_cutoff ? lib_array[i].rd_len_cutoff : maxLong; } if ( !Has ) { return maxLong; } else { return maxLong > 0 ? maxLong : maxReadLen; } } static boolean splitColumn ( char * line ) { int len = strlen ( line ); int i = 0, j; int tabs_n = 0; while ( i < len ) { if ( line[i] >= 32 && line[i] <= 126 && line[i] != '=' ) { j = 0; while ( i < len && line[i] >= 32 && line[i] <= 126 && line[i] != '=' ) { tabs[tabs_n][j++] = line[i]; i++; } tabs[tabs_n][j] = '\0'; tabs_n++; if ( tabs_n == 2 ) { return 1; } } i++; } if ( tabs_n == 2 ) { return 1; } else { return 0; } } static int cmp_lib ( const void * a, const void * b ) { LIB_INFO * A, *B; A = ( LIB_INFO * ) a; B = ( LIB_INFO * ) b; if ( A->avg_ins > B->avg_ins ) { return 1; } else if ( A->avg_ins == B->avg_ins ) { return 0; } else { return -1; } } void scan_libInfo ( char * libfile ) { FILE * fp; char line[1024], ch; int i, j, index; int libCounter; boolean flag; fp = ckopen ( libfile, "r" ); num_libs = 0; while ( fgets ( line, 1024, fp ) ) { ch = line[5]; line[5] = '\0'; if ( strcmp ( line, "[LIB]" ) == 0 ) { num_libs++; } if ( !num_libs ) { line[5] = ch; flag = splitColumn ( line ); if ( !flag ) { continue; } if ( strcmp ( tabs[0], "max_rd_len" ) == 0 ) { maxReadLen = atoi ( tabs[1] ); } } } //count file numbers of each type lib_array = ( LIB_INFO * ) ckalloc ( num_libs * sizeof ( LIB_INFO ) ); for ( i = 0; i < num_libs; i++ ) { lib_array[i].asm_flag = 3; lib_array[i].rank = 0; lib_array[i].pair_num_cut = 0; lib_array[i].rd_len_cutoff = 0; lib_array[i].map_len = 0; lib_array[i].num_s_a_file = 0; lib_array[i].num_s_q_file = 0; lib_array[i].num_p_file = 0; lib_array[i].num_a1_file = 0; lib_array[i].num_a2_file = 0; lib_array[i].num_q1_file = 0; lib_array[i].num_q2_file = 0; } libCounter = -1; rewind ( fp ); i = -1; while ( fgets ( line, 1024, fp ) ) { ch = line[5]; line[5] = '\0'; if ( strcmp ( line, "[LIB]" ) == 0 ) { i++; continue; } line[5] = ch; flag = splitColumn ( line ); if ( !flag ) { continue; } if ( strcmp ( tabs[0], "f1" ) == 0 ) { lib_array[i].num_a1_file++; } else if ( strcmp ( tabs[0], "q1" ) == 0 ) { lib_array[i].num_q1_file++; } else if ( strcmp ( tabs[0], "f2" ) == 0 ) { lib_array[i].num_a2_file++; } else if ( strcmp ( tabs[0], "q2" ) == 0 ) { lib_array[i].num_q2_file++; } else if ( strcmp ( tabs[0], "f" ) == 0 ) { lib_array[i].num_s_a_file++; } else if ( strcmp ( tabs[0], "q" ) == 0 ) { lib_array[i].num_s_q_file++; } else if ( strcmp ( tabs[0], "p" ) == 0 ) { lib_array[i].num_p_file++; } } //allocate memory for filenames for ( i = 0; i < num_libs; i++ ) { if ( lib_array[i].num_s_a_file ) { lib_array[i].s_a_fname = ( char ** ) ckalloc ( lib_array[i].num_s_a_file * sizeof ( char * ) ); for ( j = 0; j < lib_array[i].num_s_a_file; j++ ) { lib_array[i].s_a_fname[j] = ( char * ) ckalloc ( 1024 * sizeof ( char ) ); } } if ( lib_array[i].num_s_q_file ) { lib_array[i].s_q_fname = ( char ** ) ckalloc ( lib_array[i].num_s_q_file * sizeof ( char * ) ); for ( j = 0; j < lib_array[i].num_s_q_file; j++ ) { lib_array[i].s_q_fname[j] = ( char * ) ckalloc ( 1024 * sizeof ( char ) ); } } if ( lib_array[i].num_p_file ) { lib_array[i].p_fname = ( char ** ) ckalloc ( lib_array[i].num_p_file * sizeof ( char * ) ); for ( j = 0; j < lib_array[i].num_p_file; j++ ) { lib_array[i].p_fname[j] = ( char * ) ckalloc ( 1024 * sizeof ( char ) ); } } if ( lib_array[i].num_a1_file ) { lib_array[i].a1_fname = ( char ** ) ckalloc ( lib_array[i].num_a1_file * sizeof ( char * ) ); for ( j = 0; j < lib_array[i].num_a1_file; j++ ) { lib_array[i].a1_fname[j] = ( char * ) ckalloc ( 1024 * sizeof ( char ) ); } } if ( lib_array[i].num_a2_file ) { lib_array[i].a2_fname = ( char ** ) ckalloc ( lib_array[i].num_a2_file * sizeof ( char * ) ); for ( j = 0; j < lib_array[i].num_a2_file; j++ ) { lib_array[i].a2_fname[j] = ( char * ) ckalloc ( 1024 * sizeof ( char ) ); } } if ( lib_array[i].num_q1_file ) { lib_array[i].q1_fname = ( char ** ) ckalloc ( lib_array[i].num_q1_file * sizeof ( char * ) ); for ( j = 0; j < lib_array[i].num_q1_file; j++ ) { lib_array[i].q1_fname[j] = ( char * ) ckalloc ( 1024 * sizeof ( char ) ); } } if ( lib_array[i].num_q2_file ) { lib_array[i].q2_fname = ( char ** ) ckalloc ( lib_array[i].num_q2_file * sizeof ( char * ) ); for ( j = 0; j < lib_array[i].num_q2_file; j++ ) { lib_array[i].q2_fname[j] = ( char * ) ckalloc ( 1024 * sizeof ( char ) ); } } } // get file names for ( i = 0; i < num_libs; i++ ) { lib_array[i].curr_type = 1; lib_array[i].curr_index = 0; lib_array[i].fp1 = NULL; lib_array[i].fp2 = NULL; lib_array[i].num_s_a_file = 0; lib_array[i].num_s_q_file = 0; lib_array[i].num_p_file = 0; lib_array[i].num_a1_file = 0; lib_array[i].num_a2_file = 0; lib_array[i].num_q1_file = 0; lib_array[i].num_q2_file = 0; } libCounter = -1; rewind ( fp ); i = -1; while ( fgets ( line, 1024, fp ) ) { ch = line[5]; line[5] = '\0'; if ( strcmp ( line, "[LIB]" ) == 0 ) { i++; continue; } line[5] = ch; flag = splitColumn ( line ); if ( !flag ) { continue; } if ( strcmp ( tabs[0], "f1" ) == 0 ) { index = lib_array[i].num_a1_file++; strcpy ( lib_array[i].a1_fname[index], tabs[1] ); } else if ( strcmp ( tabs[0], "q1" ) == 0 ) { index = lib_array[i].num_q1_file++; strcpy ( lib_array[i].q1_fname[index], tabs[1] ); } else if ( strcmp ( tabs[0], "f2" ) == 0 ) { index = lib_array[i].num_a2_file++; strcpy ( lib_array[i].a2_fname[index], tabs[1] ); } else if ( strcmp ( tabs[0], "q2" ) == 0 ) { index = lib_array[i].num_q2_file++; strcpy ( lib_array[i].q2_fname[index], tabs[1] ); } else if ( strcmp ( tabs[0], "f" ) == 0 ) { index = lib_array[i].num_s_a_file++; strcpy ( lib_array[i].s_a_fname[index], tabs[1] ); } else if ( strcmp ( tabs[0], "q" ) == 0 ) { index = lib_array[i].num_s_q_file++; strcpy ( lib_array[i].s_q_fname[index], tabs[1] ); } else if ( strcmp ( tabs[0], "p" ) == 0 ) { index = lib_array[i].num_p_file++; strcpy ( lib_array[i].p_fname[index], tabs[1] ); } else if ( strcmp ( tabs[0], "min_ins" ) == 0 ) { lib_array[i].min_ins = atoi ( tabs[1] ); } else if ( strcmp ( tabs[0], "max_ins" ) == 0 ) { lib_array[i].max_ins = atoi ( tabs[1] ); } else if ( strcmp ( tabs[0], "avg_ins" ) == 0 ) { lib_array[i].avg_ins = atoi ( tabs[1] ); } else if ( strcmp ( tabs[0], "rd_len_cutoff" ) == 0 ) { lib_array[i].rd_len_cutoff = atoi ( tabs[1] ); } else if ( strcmp ( tabs[0], "reverse_seq" ) == 0 ) { lib_array[i].reverse = atoi ( tabs[1] ); } else if ( strcmp ( tabs[0], "asm_flags" ) == 0 ) { lib_array[i].asm_flag = atoi ( tabs[1] ); } else if ( strcmp ( tabs[0], "rank" ) == 0 ) { lib_array[i].rank = atoi ( tabs[1] ); } else if ( strcmp ( tabs[0], "pair_num_cutoff" ) == 0 ) { lib_array[i].pair_num_cut = atoi ( tabs[1] ); } else if ( strcmp ( tabs[0], "rd_len_cutoff" ) == 0 ) { lib_array[i].rd_len_cutoff = atoi ( tabs[1] ); } else if ( strcmp ( tabs[0], "map_len" ) == 0 ) { lib_array[i].map_len = atoi ( tabs[1] ); } } fclose ( fp ); qsort ( &lib_array[0], num_libs, sizeof ( LIB_INFO ), cmp_lib ); } void free_libs() { if ( !lib_array ) { return; } int i, j; for ( i = 0; i < num_libs; i++ ) { printf ( "[LIB] %d, avg_ins %d, reverse %d \n", i, lib_array[i].avg_ins, lib_array[i].reverse ); if ( lib_array[i].num_s_a_file ) { //printf("%d single fasta files\n",lib_array[i].num_s_a_file); for ( j = 0; j < lib_array[i].num_s_a_file; j++ ) { free ( ( void * ) lib_array[i].s_a_fname[j] ); } free ( ( void * ) lib_array[i].s_a_fname ); } if ( lib_array[i].num_s_q_file ) { //printf("%d single fastq files\n",lib_array[i].num_s_q_file); for ( j = 0; j < lib_array[i].num_s_q_file; j++ ) { free ( ( void * ) lib_array[i].s_q_fname[j] ); } free ( ( void * ) lib_array[i].s_q_fname ); } if ( lib_array[i].num_p_file ) { //printf("%d paired fasta files\n",lib_array[i].num_p_file); for ( j = 0; j < lib_array[i].num_p_file; j++ ) { free ( ( void * ) lib_array[i].p_fname[j] ); } free ( ( void * ) lib_array[i].p_fname ); } if ( lib_array[i].num_a1_file ) { //printf("%d read1 fasta files\n",lib_array[i].num_a1_file); for ( j = 0; j < lib_array[i].num_a1_file; j++ ) { free ( ( void * ) lib_array[i].a1_fname[j] ); } free ( ( void * ) lib_array[i].a1_fname ); } if ( lib_array[i].num_a2_file ) { //printf("%d read2 fasta files\n",lib_array[i].num_a2_file); for ( j = 0; j < lib_array[i].num_a2_file; j++ ) { free ( ( void * ) lib_array[i].a2_fname[j] ); } free ( ( void * ) lib_array[i].a2_fname ); } if ( lib_array[i].num_q1_file ) { //printf("%d read1 fastq files\n",lib_array[i].num_q1_file); for ( j = 0; j < lib_array[i].num_q1_file; j++ ) { free ( ( void * ) lib_array[i].q1_fname[j] ); } free ( ( void * ) lib_array[i].q1_fname ); } if ( lib_array[i].num_q2_file ) { //printf("%d read2 fastq files\n",lib_array[i].num_q2_file); for ( j = 0; j < lib_array[i].num_q2_file; j++ ) { free ( ( void * ) lib_array[i].q2_fname[j] ); } free ( ( void * ) lib_array[i].q2_fname ); } } num_libs = 0; free ( ( void * ) lib_array ); } void alloc_pe_mem ( int gradsCounter ) { if ( gradsCounter ) { pes = ( PE_INFO * ) ckalloc ( gradsCounter * sizeof ( PE_INFO ) ); } } void free_pe_mem() { if ( pes ) { free ( ( void * ) pes ); pes = NULL; } } SOAPdenovo-V1.05/src/127mer/loadGraph.c000644 000765 000024 00000025532 11530651532 017455 0ustar00Aquastaff000000 000000 /* * 127mer/loadGraph.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" #define preARCBLOCKSIZE 100000 static void loadArcs ( char * graphfile ); static void loadContig ( char * graphfile ); /* void loadUpdatedVertex(char *graphfile) { char name[256],line[256]; FILE *fp; Kmer word,bal_word; int num_kmer,i; char ch; sprintf(name,"%s.updated.vertex",graphfile); fp = ckopen(name,"r"); while(fgets(line,sizeof(line),fp)!=NULL){ if(line[0] == 'V'){ sscanf(line+6, "%d %c %d",&num_kmer,&ch,&overlaplen); printf("there're %d kmers in vertex file\n",num_kmer); break; } } vt_array = (VERTEX *)ckalloc((2*num_kmer)*sizeof(VERTEX)); for(i=0;i len_array[mid] ) { low = mid + 1; } else { high = mid - 1; } } if ( low > high ) { return -1; } //locate the first same length unflaged return flag_array[mid]++; } int lengthSearch ( unsigned int * len_array, unsigned int * flag_array, int num, unsigned int target ) { int mid, low, high, i; low = 1; high = num; while ( low <= high ) { mid = ( low + high ) / 2; if ( len_array[mid] == target ) { break; } else if ( target > len_array[mid] ) { low = mid + 1; } else { high = mid - 1; } } if ( low > high ) { return -1; } //locate the first same length unflaged if ( !flag_array[mid] ) { for ( i = mid - 1; i > 0; i-- ) { if ( len_array[i] != len_array[mid] || flag_array[i] ) { break; } } flag_array[i + 1] = 1; return i + 1; } else { for ( i = mid + 1; i <= num; i++ ) { if ( !flag_array[i] ) { break; } } flag_array[i] = 1; return i; } } void quick_sort_int ( unsigned int * length_array, int low, int high ) { int i, j; unsigned int pivot; if ( low < high ) { pivot = length_array[low]; i = low; j = high; while ( i < j ) { while ( i < j && length_array[j] >= pivot ) { j--; } if ( i < j ) { length_array[i++] = length_array[j]; } while ( i < j && length_array[i] <= pivot ) { i++; } if ( i < j ) { length_array[j--] = length_array[i]; } } length_array[i] = pivot; quick_sort_int ( length_array, low, i - 1 ); quick_sort_int ( length_array, i + 1, high ); } } void loadUpdatedEdges ( char * graphfile ) { char c, name[256], line[1024]; int bal_ed, cvg; FILE * fp, *out_fp; unsigned int num_ctgge, length, index = 0, num_kmer; unsigned int i = 0, j; int newIndex; unsigned int * length_array, *flag_array, diff_len; char * outfile = graphfile; long long cvgSum = 0; long long counter = 0; //get overlaplen from *.preGraphBasic sprintf ( name, "%s.preGraphBasic", graphfile ); fp = ckopen ( name, "r" ); while ( fgets ( line, sizeof ( line ), fp ) != NULL ) { if ( line[0] == 'V' ) { sscanf ( line + 6, "%d %c %d", &num_kmer, &c, &overlaplen ); printf ( "K = %d\n", overlaplen ); break; } } if ( ctg_short == 0 ) { ctg_short = overlaplen + 2; } fclose ( fp ); sprintf ( name, "%s.updated.edge", graphfile ); fp = ckopen ( name, "r" ); sprintf ( name, "%s.newContigIndex", outfile ); out_fp = ckopen ( name, "w" ); while ( fgets ( line, sizeof ( line ), fp ) != NULL ) { if ( line[0] == 'E' ) { sscanf ( line + 5, "%d", &num_ctgge ); printf ( "there're %d edge in edge file\n", num_ctgge ); break; } } index_array = ( unsigned int * ) ckalloc ( ( num_ctgge + 1 ) * sizeof ( unsigned int ) ); length_array = ( unsigned int * ) ckalloc ( ( num_ctgge + 1 ) * sizeof ( unsigned int ) ); flag_array = ( unsigned int * ) ckalloc ( ( num_ctgge + 1 ) * sizeof ( unsigned int ) ); while ( fgets ( line, sizeof ( line ), fp ) != NULL ) { if ( line[0] == '>' ) { sscanf ( line + 7, "%d", &length ); index_array[++index] = length; length_array[++i] = length; } } num_ctg = index; orig2new = 1; //quick_sort_int(length_array,1,num_ctg); qsort ( & ( length_array[1] ), num_ctg, sizeof ( length_array[0] ), cmp_int ); //extract unique length diff_len = 0; for ( i = 1; i <= num_ctg; i++ ) { for ( j = i + 1; j <= num_ctg; j++ ) if ( length_array[j] != length_array[i] ) { break; } length_array[++diff_len] = length_array[i]; flag_array[diff_len] = i; i = j - 1; } /* for(i=1;i<=num_ctg;i++) flag_array[i] = 0; */ contig_array = ( CONTIG * ) ckalloc ( ( num_ctg + 1 ) * sizeof ( CONTIG ) ); //load edges index = 0; rewind ( fp ); while ( fgets ( line, sizeof ( line ), fp ) != NULL ) { if ( line[0] == '>' ) { sscanf ( line, ">length %u,%d,%d", &length, &bal_ed, &cvg ); newIndex = uniqueLenSearch ( length_array, flag_array, diff_len, length ); index_array[++index] = newIndex; contig_array[newIndex].length = length; contig_array[newIndex].bal_edge = bal_ed + 1; contig_array[newIndex].downwardConnect = NULL; contig_array[newIndex].mask = 0; contig_array[newIndex].flag = 0; contig_array[newIndex].arcs = NULL; contig_array[newIndex].seq = NULL; contig_array[newIndex].multi = 0; contig_array[newIndex].inSubGraph = 0; contig_array[newIndex].cvg = cvg / 10; if ( cvg ) { counter += length; cvgSum += cvg * length; } fprintf ( out_fp, "%d %d %d\n", index, newIndex, contig_array[newIndex].bal_edge ); } } if ( counter ) { cvgAvg = cvgSum / counter / 10 > 2 ? cvgSum / counter / 10 : 3; } //mark repeats int bal_i; if ( maskRep ) { counter = 0; for ( i = 1; i <= num_ctg; i++ ) { bal_i = getTwinCtg ( i ); if ( ( contig_array[i].cvg + contig_array[bal_i].cvg ) > 4 * cvgAvg ) { contig_array[i].mask = 1; contig_array[bal_i].mask = 1; counter += 2; } if ( isSmallerThanTwin ( i ) ) { i++; } } printf ( "average contig coverage is %d, %lld contig masked\n", cvgAvg, counter ); } counter = 0; for ( i = 1; i <= num_ctg; i++ ) { if ( contig_array[i].mask ) { continue; } bal_i = getTwinCtg ( i ); if ( contig_array[i].length < ctg_short ) { contig_array[i].mask = 1; contig_array[bal_i].mask = 1; counter += 2; } if ( isSmallerThanTwin ( i ) ) { i++; } } printf ( "Mask contigs shorter than %d, %lld contig masked\n", ctg_short, counter ); loadArcs ( graphfile ); //tipsCount(); loadContig ( graphfile ); printf ( "done loading updated edges\n" ); fflush ( stdout ); free ( ( void * ) length_array ); free ( ( void * ) flag_array ); fclose ( fp ); fclose ( out_fp ); } static void add1Arc ( unsigned int from_ed, unsigned int to_ed, unsigned int weight ) { preARC * parc; unsigned int from_c = index_array[from_ed]; unsigned int to_c = index_array[to_ed]; parc = allocatePreArc ( to_c ); parc->multiplicity = weight; parc->next = contig_array[from_c].arcs; contig_array[from_c].arcs = parc; } void loadArcs ( char * graphfile ) { FILE * fp; char name[256], line[1024]; unsigned int target, weight; unsigned int from_ed; char * seg; sprintf ( name, "%s.Arc", graphfile ); fp = ckopen ( name, "r" ); createPreArcMemManager(); arcCounter = 0; while ( fgets ( line, sizeof ( line ), fp ) != NULL ) { seg = strtok ( line, " " ); from_ed = atoi ( seg ); //printf("%d\n",from_ed); while ( ( seg = strtok ( NULL, " " ) ) != NULL ) { target = atoi ( seg ); seg = strtok ( NULL, " " ); weight = atoi ( seg ); add1Arc ( from_ed, target, weight ); } } printf ( "%lld arcs loaded\n", arcCounter ); fclose ( fp ); } void loadContig ( char * graphfile ) { char c, name[256], line[1024], *tightSeq = NULL; FILE * fp; int n = 0, length, index = -1, edgeno; unsigned int i; unsigned int newIndex; sprintf ( name, "%s.contig", graphfile ); fp = ckopen ( name, "r" ); while ( fgets ( line, sizeof ( line ), fp ) != NULL ) { if ( line[0] == '>' ) { if ( index >= 0 ) { newIndex = index_array[edgeno]; contig_array[newIndex].seq = tightSeq; } n = 0; index++; sscanf ( line + 1, "%d %s %d", &edgeno, name, &length ); //printf("contig %d, length %d\n",edgeno,length); tightSeq = ( char * ) ckalloc ( ( length / 4 + 1 ) * sizeof ( char ) ); } else { for ( i = 0; i < strlen ( line ); i++ ) { if ( line[i] >= 'a' && line[i] <= 'z' ) { c = base2int ( line[i] - 'a' + 'A' ); writeChar2tightString ( c, tightSeq, n++ ); } else if ( line[i] >= 'A' && line[i] <= 'Z' ) { c = base2int ( line[i] ); writeChar2tightString ( c, tightSeq, n++ ); } } } } if ( index >= 0 ) { newIndex = index_array[edgeno]; contig_array[newIndex].seq = tightSeq; } printf ( "input %d contigs\n", index + 1 ); fclose ( fp ); //printf("the %dth contig with index 107\n",index); } void freeContig_array() { if ( !contig_array ) { return; } unsigned int i; for ( i = 1; i <= num_ctg; i++ ) { if ( contig_array[i].seq ) { free ( ( void * ) contig_array[i].seq ); } if ( contig_array[i].closeReads ) { freeStack ( contig_array[i].closeReads ); } } free ( ( void * ) contig_array ); contig_array = NULL; } /* void loadCvg(char *graphfile) { char name[256],line[1024]; FILE *fp; int cvg; unsigned int newIndex,edgeno,bal_ctg; sprintf(name,"%s.contigCVG",graphfile); fp = fopen(name,"r"); if(!fp){ printf("contig coverage file %s is not found!\n",name); return; } while(fgets(line,sizeof(line),fp)!=NULL){ if(line[0]=='>'){ sscanf(line+1,"%d %d",&edgeno,&cvg); newIndex = index_array[edgeno]; cvg = cvg <= 255 ? cvg:255; contig_array[newIndex].multi = cvg; bal_ctg = getTwinCtg(newIndex); contig_array[bal_ctg].multi= cvg; } } fclose(fp); } */ SOAPdenovo-V1.05/src/127mer/loadPath.c000644 000765 000024 00000012223 11530651532 017301 0ustar00Aquastaff000000 000000 /* * 127mer/loadPath.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" static void add1marker2edge ( unsigned int edgeno, long long readid ) { if ( edge_array[edgeno].multi == 255 ) { return; } unsigned int bal_ed = getTwinEdge ( edgeno ); unsigned char counter = edge_array[edgeno].multi++; edge_array[edgeno].markers[counter] = readid; counter = edge_array[bal_ed].multi++; edge_array[bal_ed].markers[counter] = -readid; } boolean loadPath ( char * graphfile ) { FILE * fp; char name[256], line[1024]; unsigned int i, bal_ed, num1, edgeno, num2; long long markCounter = 0, readid = 0; char * seg; sprintf ( name, "%s.markOnEdge", graphfile ); fp = fopen ( name, "r" ); if ( !fp ) { return 0; } for ( i = 1; i <= num_ed; i++ ) { edge_array[i].multi = 0; } for ( i = 1; i <= num_ed; i++ ) { fscanf ( fp, "%d", &num1 ); if ( EdSmallerThanTwin ( i ) ) { fscanf ( fp, "%d", &num2 ); bal_ed = getTwinEdge ( i ); if ( num1 + num2 >= 255 ) { edge_array[i].multi = 255; edge_array[bal_ed].multi = 255; } else { edge_array[i].multi = num1 + num2; edge_array[bal_ed].multi = num1 + num2; markCounter += 2 * ( num1 + num2 ); } i++; } else { if ( 2 * num1 >= 255 ) { edge_array[i].multi = 255; } else { edge_array[i].multi = 2 * num1; markCounter += 2 * num1; } } } fclose ( fp ); printf ( "%lld markers overall\n", markCounter ); markersArray = ( long long * ) ckalloc ( markCounter * sizeof ( long long ) ); markCounter = 0; for ( i = 1; i <= num_ed; i++ ) { if ( edge_array[i].multi == 255 ) { continue; } edge_array[i].markers = markersArray + markCounter; markCounter += edge_array[i].multi; edge_array[i].multi = 0; } sprintf ( name, "%s.path", graphfile ); fp = fopen ( name, "r" ); if ( !fp ) { return 0; } while ( fgets ( line, sizeof ( line ), fp ) != NULL ) { //printf("%s",line); readid++; seg = strtok ( line, " " ); while ( seg ) { edgeno = atoi ( seg ); //printf("%s, %d\n",seg,edgeno); add1marker2edge ( edgeno, readid ); seg = strtok ( NULL, " " ); } } fclose ( fp ); markCounter = 0; for ( i = 1; i <= num_ed; i++ ) { if ( edge_array[i].multi == 255 ) { continue; } markCounter += edge_array[i].multi; } printf ( "%lld marks loaded\n", markCounter ); return 1; } boolean loadPathBin ( char * graphfile ) { FILE * fp; char name[256]; unsigned int i, bal_ed, num1, num2; long long markCounter = 0, readid = 0; unsigned char seg, ch; unsigned int * freadBuf; sprintf ( name, "%s.markOnEdge", graphfile ); fp = fopen ( name, "r" ); if ( !fp ) { return 0; } for ( i = 1; i <= num_ed; i++ ) { edge_array[i].multi = 0; edge_array[i].markers = NULL; } for ( i = 1; i <= num_ed; i++ ) { fscanf ( fp, "%d", &num1 ); if ( EdSmallerThanTwin ( i ) ) { fscanf ( fp, "%d", &num2 ); bal_ed = getTwinEdge ( i ); if ( num1 + num2 >= 255 ) { edge_array[i].multi = 255; edge_array[bal_ed].multi = 255; } else { edge_array[i].multi = num1 + num2; edge_array[bal_ed].multi = num1 + num2; markCounter += 2 * ( num1 + num2 ); } i++; } else { if ( 2 * num1 >= 255 ) { edge_array[i].multi = 255; } else { edge_array[i].multi = 2 * num1; markCounter += 2 * num1; } } } fclose ( fp ); printf ( "%lld markers overall\n", markCounter ); markersArray = ( long long * ) ckalloc ( markCounter * sizeof ( long long ) ); markCounter = 0; for ( i = 1; i <= num_ed; i++ ) { if ( edge_array[i].multi == 255 ) { continue; } edge_array[i].markers = markersArray + markCounter; markCounter += edge_array[i].multi; edge_array[i].multi = 0; } sprintf ( name, "%s.path", graphfile ); fp = fopen ( name, "rb" ); if ( !fp ) { return 0; } freadBuf = ( unsigned int * ) ckalloc ( ( maxReadLen - overlaplen + 1 ) * sizeof ( unsigned int ) ); while ( fread ( &ch, sizeof ( char ), 1, fp ) == 1 ) { //printf("%s",line); if ( fread ( freadBuf, sizeof ( unsigned int ), ch, fp ) != ch ) { break; } readid++; for ( seg = 0; seg < ch; seg++ ) { add1marker2edge ( freadBuf[seg], readid ); } } fclose ( fp ); markCounter = 0; for ( i = 1; i <= num_ed; i++ ) { if ( edge_array[i].multi == 255 ) { continue; } markCounter += edge_array[i].multi; } printf ( "%lld markers loaded\n", markCounter ); free ( ( void * ) freadBuf ); return 1; } SOAPdenovo-V1.05/src/127mer/loadPreGraph.c000644 000765 000024 00000024757 11530651532 020134 0ustar00Aquastaff000000 000000 /* * 127mer/loadPreGraph.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" static void loadPreArcs ( char * graphfile ); int cmp_vertex ( const void * a, const void * b ) { VERTEX * A, *B; A = ( VERTEX * ) a; B = ( VERTEX * ) b; if ( KmerLarger ( A->kmer, B->kmer ) ) { return 1; } else if ( KmerEqual ( A->kmer, B->kmer ) ) { return 0; } else { return -1; } } void loadVertex ( char * graphfile ) { char name[256], line[256]; FILE * fp; Kmer word, bal_word, temp; int num_kmer, i; char ch; sprintf ( name, "%s.preGraphBasic", graphfile ); fp = ckopen ( name, "r" ); while ( fgets ( line, sizeof ( line ), fp ) != NULL ) { if ( line[0] == 'V' ) { sscanf ( line + 6, "%d %c %d", &num_kmer, &ch, &overlaplen ); printf ( "there're %d kmers in vertex file\n", num_kmer ); } else if ( line[0] == 'E' ) { sscanf ( line + 5, "%d", &num_ed ); printf ( "there're %d edge in edge file\n", num_ed ); } else if ( line[0] == 'M' ) { sscanf ( line, "MaxReadLen %d MinReadLen %d MaxNameLen %d", &maxReadLen, &minReadLen, &maxNameLen ); } } fclose ( fp ); vt_array = ( VERTEX * ) ckalloc ( ( 2 * num_kmer ) * sizeof ( VERTEX ) ); sprintf ( name, "%s.vertex", graphfile ); fp = ckopen ( name, "r" ); for ( i = 0; i < num_kmer; i++ ) { fscanf ( fp, "%llx %llx %llx %llx", & ( word.high1 ), & ( word.low1 ), & ( word.high2 ), & ( word.low2 ) ); bal_word = reverseComplement ( word, overlaplen ); if ( KmerSmaller ( word, bal_word ) ) { vt_array[i].kmer = word; } else { vt_array[i].kmer = bal_word; } } temp = vt_array[num_kmer - 1].kmer; qsort ( &vt_array[0], num_kmer, sizeof ( vt_array[0] ), cmp_vertex ); printf ( "done sort\n" ); fclose ( fp ); for ( i = 0; i < num_kmer; i++ ) { bal_word = reverseComplement ( vt_array[i].kmer, overlaplen ); vt_array[i + num_kmer].kmer = bal_word; } num_vt = num_kmer; } int bisearch ( VERTEX * vts, int num, Kmer target ) { int mid, low, high; low = 0; high = num - 1; while ( low <= high ) { mid = ( low + high ) / 2; if ( KmerEqual ( vts[mid].kmer, target ) ) { break; } else if ( KmerLarger ( target, vts[mid].kmer ) ) { low = mid + 1; } else { high = mid - 1; } } if ( low <= high ) { return mid; } else { return -1; } } int kmer2vt ( Kmer kmer ) { Kmer bal_word; int vt_id; bal_word = reverseComplement ( kmer, overlaplen ); if ( KmerSmaller ( kmer, bal_word ) ) { vt_id = bisearch ( &vt_array[0], num_vt, kmer ); if ( vt_id < 0 ) { printf ( "no vt found for kmer %llx %llx %llx %llx\n", kmer.high1, kmer.low1, kmer.high2, kmer.low2 ); } return vt_id; } else { vt_id = bisearch ( &vt_array[0], num_vt, bal_word ); if ( vt_id >= 0 ) { vt_id += num_vt; } else { printf ( "no vt found for kmer %llx %llx %llx %llx\n", kmer.high1, kmer.low1, kmer.high2, kmer.low2 ); } return vt_id; } } // create an edge with index edgeno+1 reverse complememtary to edge with index edgeno static void buildReverseComplementEdge ( unsigned int edgeno ) { int length = edge_array[edgeno].length; int i, index = 0; char * sequence, ch, *tightSeq; Kmer kmer = vt_array[edge_array[edgeno].from_vt].kmer; sequence = ( char * ) ckalloc ( ( overlaplen + length ) * sizeof ( char ) ); int bit1, bit2, bit3, bit4; if ( overlaplen < 32 ) {bit4 = overlaplen; bit3 = 0; bit2 = 0; bit1 = 0;} if ( overlaplen >= 32 && overlaplen < 64 ) {bit4 = 32; bit3 = overlaplen - 32; bit2 = 0; bit1 = 0;} if ( overlaplen >= 64 && overlaplen < 96 ) {bit4 = 32; bit3 = 32; bit2 = overlaplen - 64; bit1 = 0;} if ( overlaplen >= 96 && overlaplen < 128 ) {bit4 = 32; bit3 = 32; bit2 = 32; bit1 = overlaplen - 96;} for ( i = bit1 - 1; i >= 0; i-- ) { ch = kmer.high1 & 0x3; kmer.high1 >>= 2; sequence[i] = ch; } for ( i = bit2 - 1; i >= 0; i-- ) { ch = kmer.low1 & 0x3; kmer.low1 >>= 2; sequence[i + bit1] = ch; } for ( i = bit3 - 1; i >= 0; i-- ) { ch = kmer.high2 & 0x3; kmer.high2 >>= 2; sequence[i + bit1 + bit2] = ch; } for ( i = bit4 - 1; i >= 0; i-- ) { ch = kmer.low2 & 0x3; kmer.low2 >>= 2; sequence[i + bit1 + bit2 + bit3] = ch; } for ( i = 0; i < length; i++ ) { sequence[i + overlaplen] = getCharInTightString ( edge_array[edgeno].seq, i ); } tightSeq = ( char * ) ckalloc ( ( length / 4 + 1 ) * sizeof ( char ) ); for ( i = length - 1; i >= 0; i-- ) { writeChar2tightString ( int_comp ( sequence[i] ), tightSeq, index++ ); } edge_array[edgeno + 1].length = length; edge_array[edgeno + 1].cvg = edge_array[edgeno].cvg; kmer = vt_array[edge_array[edgeno].from_vt].kmer; edge_array[edgeno + 1].to_vt = kmer2vt ( reverseComplement ( kmer, overlaplen ) ); kmer = vt_array[edge_array[edgeno].to_vt].kmer; edge_array[edgeno + 1].from_vt = kmer2vt ( reverseComplement ( kmer, overlaplen ) ); edge_array[edgeno + 1].seq = tightSeq; edge_array[edgeno + 1].bal_edge = 0; edge_array[edgeno + 1].rv = NULL; edge_array[edgeno + 1].arcs = NULL; edge_array[edgeno + 1].flag = 0; edge_array[edgeno + 1].deleted = 0; free ( ( void * ) sequence ); } void loadEdge ( char * graphfile ) { char c, name[256], line[1024], str[32]; char * tightSeq = NULL; FILE * fp; Kmer from_kmer, to_kmer; int n = 0, i, length, cvg, index = -1, bal_ed, edgeno; int linelen; unsigned int j; sprintf ( name, "%s.edge", graphfile ); fp = ckopen ( name, "r" ); num_ed_limit = 1.2 * num_ed; edge_array = ( EDGE * ) ckalloc ( ( num_ed_limit + 1 ) * sizeof ( EDGE ) ); for ( j = num_ed + 1; j <= num_ed_limit; j++ ) { edge_array[j].seq = NULL; } while ( fgets ( line, sizeof ( line ), fp ) != NULL ) { if ( line[0] == '>' ) { if ( index >= 0 ) { edgeno = index + 1; edge_array[edgeno].length = length; edge_array[edgeno].cvg = cvg; edge_array[edgeno].from_vt = kmer2vt ( from_kmer ); edge_array[edgeno].to_vt = kmer2vt ( to_kmer ); edge_array[edgeno].seq = tightSeq; edge_array[edgeno].bal_edge = bal_ed + 1; edge_array[edgeno].rv = NULL; edge_array[edgeno].arcs = NULL; edge_array[edgeno].flag = 0; edge_array[edgeno].deleted = 0; if ( bal_ed ) { buildReverseComplementEdge ( edgeno ); index++; } } n = 0; index++; sscanf ( line + 7, "%d,%llx %llx %llx %llx,%llx %llx %llx %llx,%s %d,%d", &length, & ( from_kmer.high1 ), & ( from_kmer.low1 ), & ( from_kmer.high2 ), & ( from_kmer.low2 ), & ( to_kmer.high1 ), & ( to_kmer.low1 ), & ( to_kmer.high2 ), & ( to_kmer.low2 ), str, &cvg, &bal_ed ); tightSeq = ( char * ) ckalloc ( ( length / 4 + 1 ) * sizeof ( char ) ); } else { linelen = strlen ( line ); for ( i = 0; i < linelen; i++ ) { if ( line[i] >= 'a' && line[i] <= 'z' ) { c = base2int ( line[i] - 'a' + 'A' ); writeChar2tightString ( c, tightSeq, n++ ); } else if ( line[i] >= 'A' && line[i] <= 'Z' ) { c = base2int ( line[i] ); writeChar2tightString ( c, tightSeq, n++ ); } } } } if ( index >= 0 ) { edgeno = index + 1; edge_array[edgeno].length = length; edge_array[edgeno].cvg = cvg; edge_array[edgeno].from_vt = kmer2vt ( from_kmer ); edge_array[edgeno].to_vt = kmer2vt ( to_kmer ); edge_array[edgeno].seq = tightSeq; edge_array[edgeno].bal_edge = bal_ed + 1; if ( bal_ed ) { buildReverseComplementEdge ( edgeno ); index++; } } printf ( "input %d edges\n", index + 1 ); fclose ( fp ); createArcMemo(); loadPreArcs ( graphfile ); } unsigned int getTwinEdge ( unsigned int edgeno ) { return edgeno + edge_array[edgeno].bal_edge - 1; } boolean EdSmallerThanTwin ( unsigned int edgeno ) { return edge_array[edgeno].bal_edge > 1; } boolean EdLargerThanTwin ( unsigned int edgeno ) { return edge_array[edgeno].bal_edge < 1; } boolean EdSameAsTwin ( unsigned int edgeno ) { return edge_array[edgeno].bal_edge == 1; } static void add1Arc ( unsigned int from_ed, unsigned int to_ed, unsigned int weight ) { unsigned int bal_fe = getTwinEdge ( from_ed ); unsigned int bal_te = getTwinEdge ( to_ed ); if ( from_ed > num_ed || to_ed > num_ed || bal_fe > num_ed || bal_te > num_ed ) { return; } ARC * parc, *bal_parc; //both arcs already exist parc = getArcBetween ( from_ed, to_ed ); if ( parc ) { bal_parc = parc->bal_arc; parc->multiplicity += weight; bal_parc->multiplicity += weight; return; } //create new arcs parc = allocateArc ( to_ed ); parc->multiplicity = weight; parc->prev = NULL; if ( edge_array[from_ed].arcs ) { edge_array[from_ed].arcs->prev = parc; } parc->next = edge_array[from_ed].arcs; edge_array[from_ed].arcs = parc; // A->A' if ( bal_te == from_ed ) { //printf("preArc from A to A'\n"); parc->bal_arc = parc; parc->multiplicity += weight; return; } bal_parc = allocateArc ( bal_fe ); bal_parc->multiplicity = weight; bal_parc->prev = NULL; if ( edge_array[bal_te].arcs ) { edge_array[bal_te].arcs->prev = bal_parc; } bal_parc->next = edge_array[bal_te].arcs; edge_array[bal_te].arcs = bal_parc; //link them to each other parc->bal_arc = bal_parc; bal_parc->bal_arc = parc; } void loadPreArcs ( char * graphfile ) { FILE * fp; char name[256], line[1024]; unsigned int target, weight; unsigned int from_ed; char * seg; sprintf ( name, "%s.preArc", graphfile ); fp = ckopen ( name, "r" ); arcCounter = 0; while ( fgets ( line, sizeof ( line ), fp ) != NULL ) { seg = strtok ( line, " " ); from_ed = atoi ( seg ); while ( ( seg = strtok ( NULL, " " ) ) != NULL ) { target = atoi ( seg ); seg = strtok ( NULL, " " ); weight = atoi ( seg ); add1Arc ( from_ed, target, weight ); } } printf ( "%lli pre-arcs loaded\n", arcCounter ); fclose ( fp ); } void free_edge_array ( EDGE * ed_array, int ed_num ) { int i; for ( i = 1; i <= ed_num; i++ ) if ( ed_array[i].seq ) { free ( ( void * ) ed_array[i].seq ); } free ( ( void * ) ed_array ); } SOAPdenovo-V1.05/src/127mer/localAsm.c000644 000765 000024 00000156616 11530651532 017317 0ustar00Aquastaff000000 000000 /* * 127mer/localAsm.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" #define CTGendLen 35 // shouldn't larger than max_read_len #define UPlimit 5000 #define MaxRouteNum 10 static void kmerSet_mark ( KmerSet * set ); static void trace4Repeat ( Kmer currW, int steps, int min, int max, int * num_route, KmerSet * kset, Kmer kmerDest, int overlap, Kmer WORDF, int * traceCounter, int maxRoute, kmer_t ** soFarNode, short * multiOccu1, short * multiOccu2, int * routeLens, char ** foundRoutes, char * soFarSeq, long long * soFarLinks, double * avgLinks ); static Kmer prevKmerLocal ( Kmer next, char ch, int overlap ) { Kmer word = KmerRightBitMoveBy2 ( next ); if ( 2 * ( overlap - 1 ) < 64 ) { word.low2 |= ( ( ( ubyte8 ) ch ) << 2 * ( overlap - 1 ) ); } if ( 2 * ( overlap - 1 ) >= 64 && 2 * ( overlap - 1 ) < 128 ) { word.high2 |= ( ( ubyte8 ) ch ) << ( 2 * ( overlap - 1 ) - 64 ); } if ( 2 * ( overlap - 1 ) >= 128 && 2 * ( overlap - 1 ) < 192 ) { word.low1 |= ( ( ubyte8 ) ch ) << ( 2 * ( overlap - 1 ) - 128 ); } if ( 2 * ( overlap - 1 ) >= 192 && 2 * ( overlap - 1 ) < 256 ) { word.high1 |= ( ( ubyte8 ) ch ) << ( 2 * ( overlap - 1 ) - 192 ); } return word; } static Kmer nextKmerLocal ( Kmer prev, char ch, Kmer WordFilter ) { Kmer word = KmerLeftBitMoveBy2 ( prev ); word = KmerAnd ( word, WordFilter ); word.low2 |= ch; return word; } static void singleKmer ( int t, KmerSet * kset, int flag, Kmer * kmerBuffer, char * prevcBuffer, char * nextcBuffer ) { kmer_t * pos; put_kmerset ( kset, kmerBuffer[t], prevcBuffer[t], nextcBuffer[t], &pos ); if ( pos->inEdge == flag ) { return; } else if ( pos->inEdge == 0 ) { pos->inEdge = flag; } else if ( pos->inEdge == 1 && flag == 2 ) { pos->inEdge = 3; } else if ( pos->inEdge == 2 && flag == 1 ) { pos->inEdge = 3; } } static void putKmer2DBgraph ( KmerSet * kset, int flag, int kmer_c, Kmer * kmerBuffer, char * prevcBuffer, char * nextcBuffer ) { int t; for ( t = 0; t < kmer_c; t++ ) { singleKmer ( t, kset, flag, kmerBuffer, prevcBuffer, nextcBuffer ); } } static void getSeqFromRead ( READNEARBY read, char * src_seq ) { int len_seq = read.len; int j; char * tightSeq = ( char * ) darrayGet ( readSeqInGap, read.seqStarter ); for ( j = 0; j < len_seq; j++ ) { src_seq[j] = getCharInTightString ( tightSeq, j ); } } static void chopKmer4Ctg ( Kmer * kmerCtg, int lenCtg, int overlap, char * src_seq, Kmer WORDF ) { int index, j; Kmer word; word.high1 = word.low1 = word.high2 = word.low2 = 0; for ( index = 0; index < overlap; index++ ) { word = KmerLeftBitMoveBy2 ( word ); word.low2 |= src_seq[index]; } index = 0; kmerCtg[index++] = word; for ( j = 1; j <= lenCtg - overlap; j ++ ) { word = nextKmerLocal ( word, src_seq[j - 1 + overlap], WORDF ); kmerCtg[index++] = word; } } static void chopKmer4read ( int len_seq, int overlap, char * src_seq, char * bal_seq, Kmer * kmerBuffer, char * prevcBuffer, char * nextcBuffer, int * kmer_c, Kmer WORDF ) { int j, bal_j; Kmer word, bal_word; int index; char InvalidCh = 4; if ( len_seq < overlap + 1 ) { *kmer_c = 0; return; } word.high1 = word.low1 = word.high2 = word.low2 = 0; for ( index = 0; index < overlap; index++ ) { word = KmerLeftBitMoveBy2 ( word ); word.low2 |= src_seq[index]; } reverseComplementSeq ( src_seq, len_seq, bal_seq ); // complementary node bal_word = reverseComplement ( word, overlap ); bal_j = len_seq - 0 - overlap; // 0; index = 0; if ( KmerSmaller ( word, bal_word ) ) { kmerBuffer[index] = word; prevcBuffer[index] = InvalidCh; nextcBuffer[index++] = src_seq[0 + overlap]; } else { kmerBuffer[index] = bal_word; prevcBuffer[index] = bal_seq[bal_j - 1]; nextcBuffer[index++] = InvalidCh; } for ( j = 1; j <= len_seq - overlap; j ++ ) { word = nextKmerLocal ( word, src_seq[j - 1 + overlap], WORDF ); bal_j = len_seq - j - overlap; // j; bal_word = prevKmerLocal ( bal_word, bal_seq[bal_j], overlap ); if ( KmerSmaller ( word, bal_word ) ) { kmerBuffer[index] = word; prevcBuffer[index] = src_seq[j - 1]; if ( j < len_seq - overlap ) { nextcBuffer[index++] = src_seq[j + overlap]; } else { nextcBuffer[index++] = InvalidCh; } //printf("%dth: %p with %p\n",kmer_c-1,word,hashBanBuffer[kmer_c-1]); } else { // complementary node kmerBuffer[index] = bal_word; if ( bal_j > 0 ) { prevcBuffer[index] = bal_seq[bal_j - 1]; } else { prevcBuffer[index] = InvalidCh; } nextcBuffer[index++] = bal_seq[bal_j + overlap]; //printf("%dth: %p with %p\n",kmer_c-1,bal_word,hashBanBuffer[kmer_c-1]); } } *kmer_c = index; } static void headTightStr ( char * tightStr, int length, int start, int headLen, int revS, char * src_seq ) { int i, index = 0; if ( !revS ) { for ( i = start; i < start + headLen; i++ ) { src_seq[index++] = getCharInTightString ( tightStr, i ); } } else { for ( i = length - 1 - start; i >= length - headLen - start; i-- ) { src_seq[index++] = int_comp ( getCharInTightString ( tightStr, i ) ); } } } static int getSeqFromCtg ( CTGinSCAF * ctg, boolean fromHead, unsigned int len, int originOverlap, char * src_seq ) { unsigned int ctgId = ctg->ctgID; unsigned int bal_ctg = getTwinCtg ( ctgId ); if ( contig_array[ctgId].length < 1 ) { return 0; } unsigned int length = contig_array[ctgId].length + originOverlap; len = len < length ? len : length; if ( fromHead ) { if ( contig_array[ctgId].seq ) { headTightStr ( contig_array[ctgId].seq, length, 0, len, 0, src_seq ); } else { headTightStr ( contig_array[bal_ctg].seq, length, 0, len, 1, src_seq ); } } else { if ( contig_array[ctgId].seq ) { headTightStr ( contig_array[ctgId].seq, length, length - len, len, 0, src_seq ); } else { headTightStr ( contig_array[bal_ctg].seq, length, length - len, len, 1, src_seq ); } } return len; } static KmerSet * readsInGap2DBgraph ( READNEARBY * rdArray, int num, CTGinSCAF * ctg1, CTGinSCAF * ctg2, int originOverlap, Kmer * kmerCtg1, Kmer * kmerCtg2, int overlap, Kmer WordFilter ) { int kmer_c; Kmer * kmerBuffer; char * nextcBuffer, *prevcBuffer; int i; int buffer_size = maxReadLen > CTGendLen ? maxReadLen : CTGendLen; KmerSet * kmerS = NULL; int lenCtg1; int lenCtg2; char * bal_seq; char * src_seq; src_seq = ( char * ) ckalloc ( buffer_size * sizeof ( char ) ); bal_seq = ( char * ) ckalloc ( buffer_size * sizeof ( char ) ); lenCtg1 = getSeqFromCtg ( ctg1, 0, CTGendLen, originOverlap, src_seq ); lenCtg2 = getSeqFromCtg ( ctg2, 1, CTGendLen, originOverlap, src_seq ); if ( lenCtg1 <= overlap || lenCtg2 <= overlap ) { free ( ( void * ) src_seq ); free ( ( void * ) bal_seq ); return kmerS; } kmerBuffer = ( Kmer * ) ckalloc ( buffer_size * sizeof ( Kmer ) ); prevcBuffer = ( char * ) ckalloc ( buffer_size * sizeof ( char ) ); nextcBuffer = ( char * ) ckalloc ( buffer_size * sizeof ( char ) ); kmerS = init_kmerset ( 1024, 0.77f ); for ( i = 0; i < num; i++ ) { getSeqFromRead ( rdArray[i], src_seq ); chopKmer4read ( rdArray[i].len, overlap, src_seq, bal_seq, kmerBuffer, prevcBuffer, nextcBuffer, &kmer_c, WordFilter ); putKmer2DBgraph ( kmerS, 0, kmer_c, kmerBuffer, prevcBuffer, nextcBuffer ); } lenCtg1 = getSeqFromCtg ( ctg1, 0, CTGendLen, originOverlap, src_seq ); chopKmer4Ctg ( kmerCtg1, lenCtg1, overlap, src_seq, WordFilter ); chopKmer4read ( lenCtg1, overlap, src_seq, bal_seq, kmerBuffer, prevcBuffer, nextcBuffer, &kmer_c, WordFilter ); putKmer2DBgraph ( kmerS, 1, kmer_c, kmerBuffer, prevcBuffer, nextcBuffer ); lenCtg2 = getSeqFromCtg ( ctg2, 1, CTGendLen, originOverlap, src_seq ); chopKmer4Ctg ( kmerCtg2, lenCtg2, overlap, src_seq, WordFilter ); chopKmer4read ( lenCtg2, overlap, src_seq, bal_seq, kmerBuffer, prevcBuffer, nextcBuffer, &kmer_c, WordFilter ); putKmer2DBgraph ( kmerS, 2, kmer_c, kmerBuffer, prevcBuffer, nextcBuffer ); /* if(ctg1->ctgID==3733&&ctg2->ctgID==3067){ for(i=0;i= 32 && overlap < 64 ) {bit4 = 32; bit3 = overlap - 32; bit2 = 0; bit1 = 0;} if ( overlap >= 64 && overlap < 96 ) {bit4 = 32; bit3 = 32; bit2 = overlap - 64; bit1 = 0;} if ( overlap >= 96 && overlap < 128 ) {bit4 = 32; bit3 = 32; bit2 = 32; bit1 = overlap - 96;} for ( i = bit1 - 1; i >= 0; i-- ) { ch = kmer.high1 & 0x3; kmer.high1 >>= 2; kmerSeq[i] = ch; } for ( i = bit2 - 1; i >= 0; i-- ) { ch = kmer.low1 & 0x3; kmer.low1 >>= 2; kmerSeq[i + bit1] = ch; } for ( i = bit3 - 1; i >= 0; i-- ) { ch = kmer.high2 & 0x3; kmer.high2 >>= 2; kmerSeq[i + bit1 + bit2] = ch; } for ( i = bit4 - 1; i >= 0; i-- ) { ch = kmer.low2 & 0x3; kmer.low2 >>= 2; kmerSeq[i + bit1 + bit2 + bit3] = ch; } for ( i = 0; i < overlap; i++ ) { fprintf ( fp, "%c", int2base ( ( int ) kmerSeq[i] ) ); } } static void kmerSet_mark ( KmerSet * set ) { int i, in_num, out_num, cvgSingle; kmer_t * rs; long long counter = 0, linear = 0; Kmer word; set->iter_ptr = 0; while ( set->iter_ptr < set->size ) { if ( !is_kmer_entity_null ( set->flags, set->iter_ptr ) ) { in_num = out_num = 0; rs = set->array + set->iter_ptr; word = rs->seq; for ( i = 0; i < 4; i++ ) { cvgSingle = get_kmer_left_cov ( *rs, i ); if ( cvgSingle > 0 ) { in_num++; } cvgSingle = get_kmer_right_cov ( *rs, i ); if ( cvgSingle > 0 ) { out_num++; } } if ( rs->single ) { counter++; } if ( in_num == 1 && out_num == 1 ) { rs->linear = 1; linear++; } } set->iter_ptr ++; } //printf("Allocated %ld node, %ld single nodes, %ld linear\n",(long)count_kmerset(set),counter,linear); } static kmer_t * searchNode ( Kmer word, KmerSet * kset, int overlap ) { Kmer bal_word = reverseComplement ( word, overlap ); kmer_t * node; boolean found; if ( KmerSmaller ( word, bal_word ) ) { found = search_kmerset ( kset, word, &node ); } else { found = search_kmerset ( kset, bal_word, &node ); } if ( found ) { return node; } else { return NULL; } } static int searchKmerOnCtg ( Kmer currW, Kmer * kmerDest, int num ) { int i; for ( i = 0; i < num; i++ ) { if ( KmerEqual ( currW, kmerDest[i] ) ) { return i; } } return -1; } // pick on from n items randomly static int nPick1 ( int * array, int n ) { int m, i; m = n - 1; //(int)(drand48()*n); int value = array[m]; for ( i = m; i < n - 1; i++ ) { array[i] = array[i + 1]; } return value; } static void traceAlongDBgraph ( Kmer currW, int steps, int min, int max, int * num_route, KmerSet * kset, Kmer * kmerDest, int num, int overlap, Kmer WORDF, char ** foundRoutes, int * routeEndOnCtg2, int * routeLens, char * soFarSeq, int * traceCounter, int maxRoute, kmer_t ** soFarNode, boolean * multiOccu, long long * soFarLinks, double * avgLinks ) { ( *traceCounter ) ++; if ( *traceCounter > UPlimit ) { /* if(overlap==19&&kmerDest[0]==pubKmer) printf("UPlimit\n"); */ return; } if ( steps > max || *num_route >= maxRoute ) { /* if(overlap==19&&kmerDest[0]==pubKmer) printf("max steps/maxRoute\n"); */ return; } Kmer word = reverseComplement ( currW, overlap ); boolean isSmaller = KmerSmaller ( currW, word ); int i; char ch; unsigned char links; if ( isSmaller ) { word = currW; } kmer_t * node; boolean found = search_kmerset ( kset, word, &node ); if ( !found ) { printf ( "%s Trace: can't find kmer %llx %llx %llx %llx (input %llx %llx %llx %llx) at step %d\n", __FUNCTION__, word.high1, word.low1, word.high2, word.low2, currW.high1, currW.low1, currW.high2, currW.low2, steps ); return; } if ( node->twin > 1 ) { return; } if ( soFarNode ) { soFarNode[steps] = node; } if ( steps > 0 ) { soFarSeq[steps - 1] = lastCharInKmer ( currW ); } int index, end; int linkCounter = *soFarLinks; if ( steps >= min && node->inEdge > 1 && ( end = searchKmerOnCtg ( currW, kmerDest, num ) ) >= 0 ) { index = *num_route; if ( steps > 0 ) { avgLinks[index] = ( double ) linkCounter / steps; } else { avgLinks[index] = 0; } //find node that appears more than once in the path multiOccu[index] = 0; for ( i = 0; i < steps + 1; i++ ) { soFarNode[i]->deleted = 0; } for ( i = 0; i < steps + 1; i++ ) { if ( soFarNode[i]->deleted ) { multiOccu[index] = 1; break; } soFarNode[i]->deleted = 1; } routeEndOnCtg2[index] = end; routeLens[index] = steps; char * array = foundRoutes[index]; for ( i = 0; i < steps; i++ ) { array[i] = soFarSeq[i]; } if ( i < max ) { array[i] = 4; } //indicate the end of the sequence *num_route = ++index; return; } steps++; if ( isSmaller ) { int array[] = {0, 1, 2, 3}; for ( i = 4; i > 0; i-- ) { ch = nPick1 ( array, i ); links = get_kmer_right_cov ( *node, ch ); if ( !links ) { continue; } *soFarLinks = linkCounter + links; word = nextKmerLocal ( currW, ch, WORDF ); traceAlongDBgraph ( word, steps, min, max, num_route, kset, kmerDest, num, overlap, WORDF, foundRoutes, routeEndOnCtg2, routeLens, soFarSeq, traceCounter, maxRoute, soFarNode, multiOccu, soFarLinks, avgLinks ); } } else { int array[] = {0, 1, 2, 3}; for ( i = 4; i > 0; i-- ) { ch = nPick1 ( array, i ); links = get_kmer_left_cov ( *node, ch ); if ( !links ) { continue; } *soFarLinks = linkCounter + links; word = nextKmerLocal ( currW, int_comp ( ch ), WORDF ); traceAlongDBgraph ( word, steps, min, max, num_route, kset, kmerDest, num, overlap, WORDF, foundRoutes, routeEndOnCtg2, routeLens, soFarSeq, traceCounter, maxRoute, soFarNode, multiOccu, soFarLinks, avgLinks ); } } } static int searchFgap ( KmerSet * kset, CTGinSCAF * ctg1, CTGinSCAF * ctg2, Kmer * kmerCtg1, Kmer * kmerCtg2, unsigned int origOverlap, int overlap, DARRAY * gapSeqArray, int len1, int len2, Kmer WordFilter, int * offset1, int * offset2, char * seqGap, int * cut1, int * cut2 ) { int i; int ret = 0; kmer_t * node, **soFarNode; int num_route; int gapLen = ctg2->start - ctg1->end - origOverlap + overlap; int min = gapLen - GLDiff > 0 ? gapLen - GLDiff : 0; //0531 int max = gapLen + GLDiff < 10 ? 10 : gapLen + GLDiff; char ** foundRoutes; char * soFarSeq; int traceCounter; int * routeEndOnCtg2; int * routeLens; boolean * multiOccu; long long soFarLinks; double * avgLinks; //mask linear internal linear kmer on contig1 end routeEndOnCtg2 = ( int * ) ckalloc ( MaxRouteNum * sizeof ( int ) ); routeLens = ( int * ) ckalloc ( MaxRouteNum * sizeof ( int ) ); multiOccu = ( boolean * ) ckalloc ( MaxRouteNum * sizeof ( boolean ) ); short * MULTI1 = ( short * ) ckalloc ( MaxRouteNum * sizeof ( short ) ); short * MULTI2 = ( short * ) ckalloc ( MaxRouteNum * sizeof ( short ) ); soFarSeq = ( char * ) ckalloc ( max * sizeof ( char ) ); soFarNode = ( kmer_t ** ) ckalloc ( ( max + 1 ) * sizeof ( kmer_t * ) ); foundRoutes = ( char ** ) ckalloc ( MaxRouteNum * sizeof ( char * ) );; avgLinks = ( double * ) ckalloc ( MaxRouteNum * sizeof ( double ) );; for ( i = 0; i < MaxRouteNum; i++ ) { foundRoutes[i] = ( char * ) ckalloc ( max * sizeof ( char ) ); } for ( i = len1 - 1; i >= 0; i-- ) { num_route = traceCounter = soFarLinks = 0; int steps = 0; traceAlongDBgraph ( kmerCtg1[i], steps, min, max, &num_route, kset, kmerCtg2, len2, overlap, WordFilter, foundRoutes, routeEndOnCtg2, routeLens, soFarSeq, &traceCounter, MaxRouteNum, soFarNode, multiOccu, &soFarLinks, avgLinks ); if ( num_route > 0 ) { int m, minEnd = routeEndOnCtg2[0]; for ( m = 0; m < num_route; m++ ) { if ( routeLens[m] < 0 ) { continue; } if ( routeEndOnCtg2[m] < minEnd ) { minEnd = routeEndOnCtg2[m]; } } /* else if(minFreq>1){ for(m=0;m3) break; printf("%c",int2base((int)foundRoutes[m][j])); } printf(": %4.2f\n",avgLinks[m]); } } */ num_route = traceCounter = soFarLinks = 0; steps = 0; trace4Repeat ( kmerCtg1[i], steps, min, max, &num_route, kset, kmerCtg2[minEnd], overlap, WordFilter, &traceCounter, MaxRouteNum, soFarNode, MULTI1, MULTI2, routeLens, foundRoutes, soFarSeq, &soFarLinks, avgLinks ); int j, best = 0; int maxLen = routeLens[0]; double maxLink = avgLinks[0]; char * pt; boolean repeat = 0, sameLen = 1; int leftMost = max, rightMost = max; if ( num_route < 1 ) { fprintf ( stderr, "After trace4Repeat: non route was found\n" ); continue; } if ( num_route > 1 ) { // if multi paths are found, we check on the repeatative occurrences and links/length for ( m = 0; m < num_route; m++ ) { if ( routeLens[m] < 0 ) { continue; } if ( MULTI1[m] >= 0 && MULTI2[m] >= 0 ) { repeat = 1; leftMost = leftMost > MULTI1[m] ? MULTI1[m] : leftMost; rightMost = rightMost > MULTI2[m] ? MULTI2[m] : rightMost; } if ( routeLens[m] != maxLen ) { sameLen = 0; } if ( routeLens[m] < maxLen ) { maxLen = routeLens[m]; } if ( avgLinks[m] > maxLink ) { maxLink = avgLinks[m]; best = m; } } } if ( repeat ) { *offset1 = *offset2 = *cut1 = *cut2 = 0; int index = 0; char ch; for ( j = 0; j < leftMost; j++ ) { if ( routeLens[0] < j + overlap + 1 ) { break; } else { ch = foundRoutes[0][j]; } for ( m = 1; m < num_route; m++ ) { if ( routeLens[m] < 0 ) { continue; } if ( ch != foundRoutes[m][j] ) { break; } } if ( m == num_route ) { seqGap[index++] = ch; } else { break; } } *offset1 = index; index = 0; for ( j = 0; j < rightMost; j++ ) { if ( routeLens[0] - overlap - 1 < j ) { break; } else { ch = foundRoutes[0][routeLens[0] - overlap - 1 - j]; } for ( m = 1; m < num_route; m++ ) { if ( routeLens[m] < 0 ) { continue; } if ( ch != foundRoutes[m][routeLens[m] - overlap - 1 - j] ) { break; } } if ( m == num_route ) { index++; } else { break; } } *offset2 = index; for ( j = 0; j < *offset2; j++ ) { seqGap[*offset1 + *offset2 - 1 - j] = foundRoutes[0][routeLens[0] - overlap - 1 - j]; } if ( *offset1 > 0 || *offset2 > 0 ) { *cut1 = len1 - i - 1; *cut2 = minEnd; //fprintf(stderr,"\n"); for ( m = 0; m < num_route; m++ ) { for ( j = 0; j < max; j++ ) { if ( foundRoutes[m][j] > 3 ) { break; } //fprintf(stderr,"%c",int2base((int)foundRoutes[m][j])); } //fprintf(stderr,": %4.2f\n",avgLinks[m]); } /* fprintf(stderr,">Gap (%d + %d) (%d + %d)\n",*offset1,*offset2,*cut1,*cut2); for(index=0;index<*offset1+*offset2;index++) fprintf(stderr,"%c",int2base(seqGap[index])); fprintf(stderr,"\n"); */ } ret = 3; break; } if ( overlap + ( len1 - i - 1 ) + minEnd - routeLens[best] > ( int ) origOverlap ) { continue; } ctg1->gapSeqOffset = gapSeqArray->item_c; ctg1->gapSeqLen = routeLens[best]; if ( !darrayPut ( gapSeqArray, ctg1->gapSeqOffset + maxLen / 4 ) ) { continue; } pt = ( char * ) darrayPut ( gapSeqArray, ctg1->gapSeqOffset ); /* printKmerSeqLocal(stderr,kmerCtg1[i],overlap); fprintf(stderr,"-"); */ for ( j = 0; j < max; j++ ) { if ( foundRoutes[best][j] > 3 ) { break; } writeChar2tightString ( foundRoutes[best][j], pt, j ); //fprintf(stderr,"%c",int2base((int)foundRoutes[best][j])); } //fprintf(stderr,": GAPSEQ %d + %d, avglink %4.2f\n",len1-i-1,minEnd,avgLinks[best]); ctg1->cutTail = len1 - i - 1; ctg2->cutHead = overlap + minEnd; ctg2->scaftig_start = 0; ret = 1; break; /* }if(num_route>1){ ret = 2; break; */ } else //mark node which leads to dead end { node = searchNode ( kmerCtg1[i], kset, overlap ); if ( node ) { node->twin = 2; } } } for ( i = 0; i < MaxRouteNum; i++ ) { free ( ( void * ) foundRoutes[i] ); } free ( ( void * ) soFarSeq ); free ( ( void * ) soFarNode ); free ( ( void * ) multiOccu ); free ( ( void * ) MULTI1 ); free ( ( void * ) MULTI2 ); free ( ( void * ) foundRoutes ); free ( ( void * ) routeEndOnCtg2 ); free ( ( void * ) routeLens ); return ret; } static void trace4Repeat ( Kmer currW, int steps, int min, int max, int * num_route, KmerSet * kset, Kmer kmerDest, int overlap, Kmer WORDF, int * traceCounter, int maxRoute, kmer_t ** soFarNode, short * multiOccu1, short * multiOccu2, int * routeLens, char ** foundRoutes, char * soFarSeq, long long * soFarLinks, double * avgLinks ) { ( *traceCounter ) ++; if ( *traceCounter > UPlimit ) { return; } if ( steps > max || *num_route >= maxRoute ) { return; } Kmer word = reverseComplement ( currW, overlap ); boolean isSmaller = KmerSmaller ( currW, word ); char ch; unsigned char links; int index, i; if ( isSmaller ) { word = currW; } kmer_t * node; boolean found = search_kmerset ( kset, word, &node ); if ( !found ) { printf ( "%s Trace: can't find kmer %llx %llx %llx %llx (input %llx %llx %llx %llx) at step %d\n", __FUNCTION__, word.high1, word.low1, word.high2, word.low2, currW.high1, currW.low1, currW.high2, currW.low2, steps ); return; } if ( soFarNode ) { soFarNode[steps] = node; } if ( soFarSeq && steps > 0 ) { soFarSeq[steps - 1] = lastCharInKmer ( currW ); } int linkCounter; if ( soFarLinks ) { linkCounter = *soFarLinks; } if ( steps >= min && KmerEqual ( currW, kmerDest ) ) { index = *num_route; if ( avgLinks && steps > 0 ) { avgLinks[index] = ( double ) linkCounter / steps; } else if ( avgLinks ) { avgLinks[index] = 0; } //find node that appears more than once in the path if ( multiOccu1 && multiOccu2 ) { for ( i = 0; i < steps + 1; i++ ) { soFarNode[i]->deleted = 0; } int rightMost = 0; boolean MULTI = 0; for ( i = 0; i < steps + 1; i++ ) { if ( soFarNode[i]->deleted ) { rightMost = rightMost < i - 1 ? i - 1 : rightMost; MULTI = 1; } soFarNode[i]->deleted = 1; } if ( !MULTI ) { multiOccu1[index] = multiOccu2[index] = -1; } else { multiOccu2[index] = steps - 2 - rightMost < 0 ? 0 : steps - 2 - rightMost; //[0 steps-2] for ( i = 0; i < steps + 1; i++ ) { soFarNode[i]->deleted = 0; } int leftMost = steps - 2; for ( i = steps; i >= 0; i-- ) { if ( soFarNode[i]->deleted ) { leftMost = leftMost > i - 1 ? i - 1 : leftMost; } soFarNode[i]->deleted = 1; } multiOccu1[index] = leftMost < 0 ? 0 : leftMost; //[0 steps-2] } } if ( routeLens ) { routeLens[index] = steps; } if ( soFarSeq ) { char * array = foundRoutes[index]; for ( i = 0; i < steps; i++ ) { array[i] = soFarSeq[i]; } if ( i < max ) { array[i] = 4; } //indicate the end of the sequence } *num_route = ++index; } steps++; if ( isSmaller ) { int array[] = {0, 1, 2, 3}; for ( i = 4; i > 0; i-- ) { ch = nPick1 ( array, i ); links = get_kmer_right_cov ( *node, ch ); if ( !links ) { continue; } if ( soFarLinks ) { *soFarLinks = linkCounter + links; } word = nextKmerLocal ( currW, ch, WORDF ); trace4Repeat ( word, steps, min, max, num_route, kset, kmerDest, overlap, WORDF, traceCounter, maxRoute, soFarNode, multiOccu1, multiOccu2, routeLens, foundRoutes, soFarSeq, soFarLinks, avgLinks ); } } else { int array[] = {0, 1, 2, 3}; for ( i = 4; i > 0; i-- ) { ch = nPick1 ( array, i ); links = get_kmer_left_cov ( *node, ch ); if ( !links ) { continue; } if ( soFarLinks ) { *soFarLinks = linkCounter + links; } word = nextKmerLocal ( currW, int_comp ( ch ), WORDF ); trace4Repeat ( word, steps, min, max, num_route, kset, kmerDest, overlap, WORDF, traceCounter, maxRoute, soFarNode, multiOccu1, multiOccu2, routeLens, foundRoutes, soFarSeq, soFarLinks, avgLinks ); } } } //found repeat node on contig ends static void maskRepeatNode ( KmerSet * kset, Kmer * kmerCtg1, Kmer * kmerCtg2, int overlap, int len1, int len2, int max, Kmer WordFilter ) { int i; int num_route, steps; int min = 1, maxRoute = 1; int traceCounter; Kmer word, bal_word; kmer_t * node; boolean found; int counter = 0; for ( i = 0; i < len1; i++ ) { word = kmerCtg1[i]; bal_word = reverseComplement ( word, overlap ); if ( KmerLarger ( word, bal_word ) ) { word = bal_word; } found = search_kmerset ( kset, word, &node ); if ( !found || node->linear ) { //printf("Found no node for kmer %llx\n",word); continue; } num_route = traceCounter = 0; steps = 0; trace4Repeat ( word, steps, min, max, &num_route, kset, word, overlap, WordFilter, &traceCounter, maxRoute, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL ); if ( num_route < 1 ) { continue; } counter++; node->checked = 1; } for ( i = 0; i < len2; i++ ) { word = kmerCtg2[i]; bal_word = reverseComplement ( word, overlap ); if ( KmerLarger ( word, bal_word ) ) { word = bal_word; } found = search_kmerset ( kset, word, &node ); if ( !found || node->linear ) { //printf("Found no node for kmer %llx\n",word); continue; } num_route = traceCounter = 0; steps = 0; trace4Repeat ( word, steps, min, max, &num_route, kset, word, overlap, WordFilter, &traceCounter, maxRoute, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL ); if ( num_route < 1 ) { continue; } counter++; node->checked = 1; } //printf("MR: %d(%d)\n",counter,len1+len2); } /* static boolean chopReadFillGap(int len_seq,int overlap,char *src_seq, char *bal_seq, KmerSet *kset,Kmer WORDF,int *start,int *end,boolean *bal, Kmer *KmerCtg1,int len1,Kmer *KmerCtg2,int len2,int *index1,int *index2) { int index,j=0,bal_j; Kmer word,bal_word; int flag=0,bal_flag=0; int ctg1start,bal_ctg1start,ctg2end,bal_ctg2end; int seqStart,bal_start,seqEnd,bal_end; kmer_t *node; boolean found; if(len_seqlinear&&!node->checked){ if(!flag&&node->inEdge==1){ ctg1start = searchKmerOnCtg(word,KmerCtg1,len1); if(ctg1start>0){ flag = 1; seqStart = j + overlap-1; } } if(!bal_flag&&node->inEdge==2){ bal_ctg2end = searchKmerOnCtg(bal_word,KmerCtg2,len2); if(bal_ctg2end>0){ bal_flag = 2; bal_end = bal_j+overlap-1; } } } for(j = 1; j <= len_seq - overlap; j ++) { word = nextKmerLocal(word,src_seq[j-1+overlap],WORDF); bal_j = len_seq-j-overlap; // j; bal_word = prevKmerLocal(bal_word,bal_seq[bal_j],overlap); if(wordlinear&&!node->checked){ if(!flag&&node->inEdge==1){ ctg1start = searchKmerOnCtg(word,KmerCtg1,len1); if(ctg1start>0){ flag = 1; seqStart = j + overlap-1; } }else if(flag==1&&node->inEdge==1){ index = searchKmerOnCtg(word,KmerCtg1,len1); if(index>ctg1start){ // choose hit closer to gap ctg1start = index; seqStart = j + overlap-1; } }else if(flag==1&&node->inEdge==2){ ctg2end = searchKmerOnCtg(word,KmerCtg2,len2); if(ctg2end>0){ flag = 3; seqEnd = j+overlap-1; break; } } if(!bal_flag&&node->inEdge==2){ bal_ctg2end = searchKmerOnCtg(bal_word,KmerCtg2,len2); if(bal_ctg2end>0){ bal_flag = 2; bal_end = bal_j+overlap-1; } }else if(bal_flag==2&&node->inEdge==2){ index = searchKmerOnCtg(bal_word,KmerCtg2,len2); if(indexinEdge==1){ bal_ctg1start = searchKmerOnCtg(bal_word,KmerCtg1,len1); if(bal_ctg1start>0){ bal_flag = 3; bal_start = bal_j+overlap-1; break; } } } } if(flag==3){ *start = seqStart; *end = seqEnd; *bal = 0; *index1 = ctg1start; *index2 = ctg2end; return 1; }else if(bal_flag==3){ *start = bal_start; *end = bal_end; *bal = 1; *index1 = bal_ctg1start; *index2 = bal_ctg2end; return 1; } return 0; } static boolean readsCrossGap(READNEARBY *rdArray, int num, int originOverlap,DARRAY *gapSeqArray, Kmer *kmerCtg1,Kmer *kmerCtg2,int overlap,int len1,int len2, CTGinSCAF *ctg1,CTGinSCAF *ctg2,KmerSet *kmerS,Kmer WordFilter,int min,int max) { int i,j,start,end,startOnCtg1,endOnCtg2; char *bal_seq; char *src_seq; char *pt; boolean bal,ret=0,FILL; src_seq = (char *)ckalloc(maxReadLen*sizeof(char)); bal_seq = (char *)ckalloc(maxReadLen*sizeof(char)); for(i=0;imax) continue; fprintf(stderr,"Read across\n"); //printf("Filled: K %d, ctg1 %d ctg2 %d,start %d end %d\n",overlap,startOnCtg1,endOnCtg2,start,end); if(overlap+(len1-startOnCtg1-1)+endOnCtg2-(end-start)>(int)originOverlap) continue; // contig1 and contig2 could not overlap more than origOverlap bases ctg1->gapSeqOffset = gapSeqArray->item_c; ctg1->gapSeqLen = end-start; if(!darrayPut(gapSeqArray,ctg1->gapSeqOffset+(end-start)/4)) continue; pt = (char *)darrayPut(gapSeqArray,ctg1->gapSeqOffset); for(j=start+1;j<=end;j++){ if(bal) writeChar2tightString(bal_seq[j],pt,j-start-1); else writeChar2tightString(src_seq[j],pt,j-start-1); } ctg1->cutTail = len1-startOnCtg1-1; ctg2->cutHead = overlap + endOnCtg2; ctg2->scaftig_start = 0; ret = 1; break; } free((void*)src_seq); free((void*)bal_seq); return ret; } */ static void kmerSet_markTandem ( KmerSet * set, Kmer WordFilter, int overlap ); static boolean readsCrossGap ( READNEARBY * rdArray, int num, int originOverlap, DARRAY * gapSeqArray, Kmer * kmerCtg1, Kmer * kmerCtg2, int overlap, CTGinSCAF * ctg1, CTGinSCAF * ctg2, KmerSet * kmerS, Kmer WordFilter, int min, int max, int offset1, int offset2, char * seqGap, char * seqCtg1, char * seqCtg2, int cut1, int cut2 ); int localGraph ( READNEARBY * rdArray, int num, CTGinSCAF * ctg1, CTGinSCAF * ctg2, int origOverlap, Kmer * kmerCtg1, Kmer * kmerCtg2, int overlap, DARRAY * gapSeqArray, char * seqCtg1, char * seqCtg2, char * seqGap ) { /**************** put kmer in DBgraph ****************/ KmerSet * kmerSet; Kmer WordFilter = createFilter ( overlap ); /* if(ctg1->ctgID==56410&&ctg2->ctgID==61741) printf("Extract %d reads for gap [%d %d]\n",num,ctg1->ctgID,ctg2->ctgID); */ kmerSet = readsInGap2DBgraph ( rdArray, num, ctg1, ctg2, origOverlap, kmerCtg1, kmerCtg2, overlap, WordFilter ); if ( !kmerSet ) { //printf("no kmer found\n"); return 0; } time_t tt; time ( &tt ); srand48 ( ( int ) tt ); /* int i,j; for(i=0;i<2;i++){ int array[] = {0,1,2,3}; for(j=4;j>0;j--) fprintf(stderr,"%d ", nPick1(array,j)); } fprintf(stderr,"\n"); */ /***************** search path to connect contig ends ********/ int gapLen = ctg2->start - ctg1->end - origOverlap + overlap; int min = gapLen - GLDiff > 0 ? gapLen - GLDiff : 0; int max = gapLen + GLDiff < 10 ? 10 : gapLen + GLDiff; //count kmer number for contig1 and contig2 ends int len1, len2; len1 = CTGendLen < contig_array[ctg1->ctgID].length + origOverlap ? CTGendLen : contig_array[ctg1->ctgID].length + origOverlap; len2 = CTGendLen < contig_array[ctg2->ctgID].length + origOverlap ? CTGendLen : contig_array[ctg2->ctgID].length + origOverlap; len1 -= overlap - 1; len2 -= overlap - 1; //int pathNum = 2; int offset1 = 0, offset2 = 0, cut1 = 0, cut2 = 0; int pathNum = searchFgap ( kmerSet, ctg1, ctg2, kmerCtg1, kmerCtg2, origOverlap, overlap, gapSeqArray, len1, len2, WordFilter, &offset1, &offset2, seqGap, &cut1, &cut2 ); //printf("SF: %d K %d\n",pathNum,overlap); if ( pathNum == 0 ) { free_kmerset ( kmerSet ); return 0; } else if ( pathNum == 1 ) { free_kmerset ( kmerSet ); return 1; }/* else{ printf("ret %d\n",pathNum); free_kmerset(kmerSet); return 0; } */ /******************* cross the gap by single reads *********/ //kmerSet_markTandem(kmerSet,WordFilter,overlap); maskRepeatNode ( kmerSet, kmerCtg1, kmerCtg2, overlap, len1, len2, max, WordFilter ); boolean found = readsCrossGap ( rdArray, num, origOverlap, gapSeqArray, kmerCtg1, kmerCtg2, overlap, ctg1, ctg2, kmerSet, WordFilter, min, max, offset1, offset2, seqGap, seqCtg1, seqCtg2, cut1, cut2 ); if ( found ) { //fprintf(stderr,"read across\n"); free_kmerset ( kmerSet ); return found; } else { free_kmerset ( kmerSet ); return 0; } } static void kmerSet_markTandem ( KmerSet * set, Kmer WordFilter, int overlap ) { kmer_t * rs; long long counter = 0; int num_route, steps; int min = 1, max = overlap, maxRoute = 1; int traceCounter; set->iter_ptr = 0; while ( set->iter_ptr < set->size ) { if ( !is_kmer_entity_null ( set->flags, set->iter_ptr ) ) { rs = set->array + set->iter_ptr; if ( rs->inEdge > 0 ) { set->iter_ptr ++; continue; } num_route = traceCounter = 0; steps = 0; trace4Repeat ( rs->seq, steps, min, max, &num_route, set, rs->seq, overlap, WordFilter, &traceCounter, maxRoute, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL ); if ( num_route < 1 ) { set->iter_ptr ++; continue; } /* printKmerSeqLocal(stderr,rs->seq,overlap); fprintf(stderr, "\n"); */ rs->checked = 1; counter++; } set->iter_ptr ++; } } /******************* the following is for read-crossing gaps *************************/ #define MAXREADLENGTH 100 static const int INDEL = 0; static const int SIM[4][4] = { {1, 0, 0, 0}, {0, 1, 0, 0}, {0, 0, 1, 0}, {0, 0, 0, 1} }; static char fastSequence[MAXREADLENGTH]; static char slowSequence[MAXREADLENGTH]; static int Fmatrix[MAXREADLENGTH + 1][MAXREADLENGTH + 1]; static int slowToFastMapping[MAXREADLENGTH + 1]; static int fastToSlowMapping[MAXREADLENGTH + 1]; static int max ( int A, int B, int C ) { A = A >= B ? A : B; return ( A >= C ? A : C ); } static int compareSequences ( char * sequence1, char * sequence2, int length1, int length2 ) { if ( length1 < 1 || length2 < 1 || length1 > MAXREADLENGTH || length2 > MAXREADLENGTH ) { return 0; } int i, j; int Choice1, Choice2, Choice3; int maxScore; for ( i = 0; i <= length1; i++ ) { Fmatrix[i][0] = 0; } for ( j = 0; j <= length2; j++ ) { Fmatrix[0][j] = 0; } for ( i = 1; i <= length1; i++ ) { for ( j = 1; j <= length2; j++ ) { Choice1 = Fmatrix[i - 1][j - 1] + SIM[ ( int ) sequence1[i - 1]] [ ( int ) sequence2[j - 1]]; Choice2 = Fmatrix[i - 1][j] + INDEL; Choice3 = Fmatrix[i][j - 1] + INDEL; Fmatrix[i][j] = max ( Choice1, Choice2, Choice3 ); } } maxScore = Fmatrix[length1][length2]; return maxScore; } static void mapSlowOntoFast ( int slowSeqLength, int fastSeqLength ) { int slowIndex = slowSeqLength; int fastIndex = fastSeqLength; int fastn, slown; if ( slowIndex == 0 ) { slowToFastMapping[0] = fastIndex; while ( fastIndex >= 0 ) { fastToSlowMapping[fastIndex--] = 0; } return; } if ( fastIndex == 0 ) { while ( slowIndex >= 0 ) { slowToFastMapping[slowIndex--] = 0; } fastToSlowMapping[0] = slowIndex; return; } while ( slowIndex > 0 && fastIndex > 0 ) { fastn = ( int ) fastSequence[fastIndex - 1]; //getCharInTightString(fastSequence,fastIndex-1); slown = ( int ) slowSequence[slowIndex - 1]; //getCharInTightString(slowSequence,slowIndex-1); if ( Fmatrix[fastIndex][slowIndex] == Fmatrix[fastIndex - 1][slowIndex - 1] + SIM[fastn][slown] ) { fastToSlowMapping[--fastIndex] = --slowIndex; slowToFastMapping[slowIndex] = fastIndex; } else if ( Fmatrix[fastIndex][slowIndex] == Fmatrix[fastIndex - 1][slowIndex] + INDEL ) { fastToSlowMapping[--fastIndex] = slowIndex - 1; } else if ( Fmatrix[fastIndex][slowIndex] == Fmatrix[fastIndex][slowIndex - 1] + INDEL ) { slowToFastMapping[--slowIndex] = fastIndex - 1; } else { printf ( "compareSequence: Error trace\n" ); fflush ( stdout ); abort(); } } while ( slowIndex > 0 ) { slowToFastMapping[--slowIndex] = -1; } while ( fastIndex > 0 ) { fastToSlowMapping[--fastIndex] = -1; } slowToFastMapping[slowSeqLength] = fastSeqLength; fastToSlowMapping[fastSeqLength] = slowSeqLength; } static boolean chopReadFillGap ( int len_seq, int overlap, char * src_seq, char * bal_seq, KmerSet * kset, Kmer WORDF, int * start, int * end, boolean * bal, Kmer * KmerCtg1, int len1, Kmer * KmerCtg2, int len2, int * index1, int * index2 ) { int index, j = 0, bal_j; Kmer word, bal_word; int flag = 0, bal_flag = 0; int ctg1start, bal_ctg1start, ctg2end, bal_ctg2end; int seqStart, bal_start, seqEnd, bal_end; kmer_t * node; boolean found; if ( len_seq < overlap + 1 ) { return 0; } word.high1 = word.low1 = word.high2 = word.low2 = 0; for ( index = 0; index < overlap; index++ ) { word = KmerLeftBitMoveBy2 ( word ); word.low2 |= src_seq[index]; } reverseComplementSeq ( src_seq, len_seq, bal_seq ); // complementary node bal_word = reverseComplement ( word, overlap ); bal_j = len_seq - 0 - overlap; // 0; flag = bal_flag = 0; if ( KmerSmaller ( word, bal_word ) ) { found = search_kmerset ( kset, word, &node ); } else { found = search_kmerset ( kset, bal_word, &node ); } if ( !found ) { printf ( "chopReadFillGap 1292 not found!\n" ); } if ( found && !node->linear && !node->checked ) { if ( !flag && node->inEdge == 1 ) { ctg1start = searchKmerOnCtg ( word, KmerCtg1, len1 ); if ( ctg1start >= 0 ) { flag = 1; seqStart = j + overlap - 1; } } if ( !bal_flag && node->inEdge == 2 ) { bal_ctg2end = searchKmerOnCtg ( bal_word, KmerCtg2, len2 ); if ( bal_ctg2end >= 0 ) { bal_flag = 2; bal_end = bal_j + overlap - 1; } } } for ( j = 1; j <= len_seq - overlap; j ++ ) { word = nextKmerLocal ( word, src_seq[j - 1 + overlap], WORDF ); bal_j = len_seq - j - overlap; // j; bal_word = prevKmerLocal ( bal_word, bal_seq[bal_j], overlap ); if ( KmerSmaller ( word, bal_word ) ) { found = search_kmerset ( kset, word, &node ); } else { found = search_kmerset ( kset, bal_word, &node ); } if ( !found ) { printf ( "chopReadFillGap 1321 not found!\n" ); } if ( found && !node->linear && !node->checked ) { if ( !flag && node->inEdge == 1 ) { ctg1start = searchKmerOnCtg ( word, KmerCtg1, len1 ); if ( ctg1start >= 0 ) { flag = 1; seqStart = j + overlap - 1; } } else if ( flag == 1 && node->inEdge == 1 ) { index = searchKmerOnCtg ( word, KmerCtg1, len1 ); if ( index >= 0 && index > ctg1start ) // choose hit closer to gap { ctg1start = index; seqStart = j + overlap - 1; } } else if ( flag == 1 && node->inEdge == 2 ) { ctg2end = searchKmerOnCtg ( word, KmerCtg2, len2 ); if ( ctg2end >= 0 ) { flag = 3; seqEnd = j + overlap - 1; break; } } if ( !bal_flag && node->inEdge == 2 ) { bal_ctg2end = searchKmerOnCtg ( bal_word, KmerCtg2, len2 ); if ( bal_ctg2end >= 0 ) { bal_flag = 2; bal_end = bal_j + overlap - 1; } } else if ( bal_flag == 2 && node->inEdge == 2 ) { index = searchKmerOnCtg ( bal_word, KmerCtg2, len2 ); if ( index >= 0 && index < bal_ctg2end ) // choose hit closer to gap { bal_ctg2end = index; bal_end = bal_j + overlap - 1; } } else if ( bal_flag == 2 && node->inEdge == 1 ) { bal_ctg1start = searchKmerOnCtg ( bal_word, KmerCtg1, len1 ); if ( bal_ctg1start >= 0 ) { bal_flag = 3; bal_start = bal_j + overlap - 1; break; } } } } if ( flag == 3 ) { *start = seqStart; *end = seqEnd; *bal = 0; *index1 = ctg1start; *index2 = ctg2end; return 1; } else if ( bal_flag == 3 ) { *start = bal_start; *end = bal_end; *bal = 1; *index1 = bal_ctg1start; *index2 = bal_ctg2end; return 1; } return 0; } static int cutSeqFromTightStr ( char * tightStr, int length, int start, int end, int revS, char * src_seq ) { int i, index = 0; end = end < length ? end : length - 1; start = start >= 0 ? start : 0; if ( !revS ) { for ( i = start; i <= end; i++ ) { src_seq[index++] = getCharInTightString ( tightStr, i ); } } else { for ( i = length - 1 - start; i >= length - end - 1; i-- ) { src_seq[index++] = int_comp ( getCharInTightString ( tightStr, i ) ); } } return end - start + 1; } static int cutSeqFromCtg ( unsigned int ctgID, int start, int end, char * sequence, int originOverlap ) { unsigned int bal_ctg = getTwinCtg ( ctgID ); if ( contig_array[ctgID].length < 1 ) { return 0; } int length = contig_array[ctgID].length + originOverlap; if ( contig_array[ctgID].seq ) { return cutSeqFromTightStr ( contig_array[ctgID].seq, length, start, end, 0, sequence ); } else { return cutSeqFromTightStr ( contig_array[bal_ctg].seq, length, start, end, 1, sequence ); } } static int cutSeqFromRead ( char * src_seq, int length, int start, int end, char * sequence ) { if ( end >= length ) { printf ( "******: end %d length %d\n", end, length ); } end = end < length ? end : length - 1; start = start >= 0 ? start : 0; int i; for ( i = start; i <= end; i++ ) { sequence[i - start] = src_seq[i]; } return end - start + 1; } static void printSeq ( FILE * fo, char * seq, int len ) { int i; for ( i = 0; i < len; i++ ) { fprintf ( fo, "%c", int2base ( ( int ) seq[i] ) ); } fprintf ( fo, "\n" ); } static boolean readsCrossGap ( READNEARBY * rdArray, int num, int originOverlap, DARRAY * gapSeqArray, Kmer * kmerCtg1, Kmer * kmerCtg2, int overlap, CTGinSCAF * ctg1, CTGinSCAF * ctg2, KmerSet * kmerS, Kmer WordFilter, int min, int max, int offset1, int offset2, char * seqGap, char * seqCtg1, char * seqCtg2, int cut1, int cut2 ) { int i, j, start, end, startOnCtg1, endOnCtg2; char * bal_seq; char * src_seq; char * pt; boolean bal, ret = 0, FILL; double maxScore = 0.0; int maxIndex; int lenCtg1, lenCtg2; //build sequences on left and right of the uncertain region int buffer_size = maxReadLen > 100 ? maxReadLen : 100; int length = contig_array[ctg1->ctgID].length + originOverlap; if ( buffer_size > offset1 ) { lenCtg1 = cutSeqFromCtg ( ctg1->ctgID, length - cut1 - ( buffer_size - offset1 ), length - 1 - cut1, seqCtg1, originOverlap ); for ( i = 0; i < offset1; i++ ) { seqCtg1[lenCtg1 + i] = seqGap[i]; } lenCtg1 += offset1; } else { for ( i = offset1 - buffer_size; i < offset1; i++ ) { seqCtg1[i + buffer_size - offset1] = seqGap[i]; } lenCtg1 = buffer_size; } length = contig_array[ctg2->ctgID].length + originOverlap; if ( buffer_size > offset2 ) { lenCtg2 = cutSeqFromCtg ( ctg2->ctgID, cut2, buffer_size - offset2 - 1 + cut2, & ( seqCtg2[offset2] ), originOverlap ); for ( i = 0; i < offset2; i++ ) { seqCtg2[i] = seqGap[i + offset1]; } lenCtg2 += offset2; } else { for ( i = 0; i < buffer_size; i++ ) { seqCtg2[i] = seqGap[i + offset1]; } lenCtg2 = buffer_size; } /* if(offset1>0||offset2>0){ for(i=0;i max ) { continue; } if ( overlap + ( len1 - startOnCtg1 - 1 ) + endOnCtg2 - ( end - start ) > ( int ) originOverlap ) { continue; } // contig1 and contig2 could not overlap more than origOverlap bases START[i] = start; END[i] = end; INDEX1[i] = startOnCtg1; INDEX2[i] = endOnCtg2; BAL[i] = bal; int matchLen = 2 * overlap < ( end - start + overlap ) ? 2 * overlap : ( end - start + overlap ); int match; int alignLen = matchLen; //compare the left of hit kmer on ctg1 //int ctgLeft = (contig_array[ctg1->ctgID].length+originOverlap)-(len1+overlap-1)+startOnCtg1; int ctgLeft = ( lenCtg1 ) - ( len1 + overlap - 1 ) + startOnCtg1; int readLeft = start - overlap + 1; int cmpLen = ctgLeft < readLeft ? ctgLeft : readLeft; cmpLen = cmpLen <= MAXREADLENGTH ? cmpLen : MAXREADLENGTH; //cutSeqFromCtg(ctg1->ctgID,ctgLeft-cmpLen,ctgLeft-1,fastSequence,originOverlap); cutSeqFromRead ( seqCtg1, lenCtg1, ctgLeft - cmpLen, ctgLeft - 1, fastSequence ); if ( !bal ) { cutSeqFromRead ( src_seq, rdArray[i].len, readLeft - cmpLen, readLeft - 1, slowSequence ); } else { cutSeqFromRead ( bal_seq, rdArray[i].len, readLeft - cmpLen, readLeft - 1, slowSequence ); } match = compareSequences ( fastSequence, slowSequence, cmpLen, cmpLen ); alignLen += cmpLen; matchLen += match; //compare the right of hit kmer on ctg1 int ctgRight = len1 - startOnCtg1 - 1; cmpLen = ctgRight < ( rdArray[i].len - start - 1 ) ? ctgRight : ( rdArray[i].len - start - 1 ); cmpLen = cmpLen <= MAXREADLENGTH ? cmpLen : MAXREADLENGTH; //cutSeqFromCtg(ctg1->ctgID,ctgLeft+overlap,ctgLeft+overlap+cmpLen-1,fastSequence,originOverlap); cutSeqFromRead ( seqCtg1, lenCtg1, ctgLeft + overlap, ctgLeft + overlap + cmpLen - 1, fastSequence ); if ( !bal ) { cutSeqFromRead ( src_seq, rdArray[i].len, start + 1, start + cmpLen, slowSequence ); } else { cutSeqFromRead ( bal_seq, rdArray[i].len, start + 1, start + cmpLen, slowSequence ); } match = compareSequences ( fastSequence, slowSequence, cmpLen, cmpLen ); //fprintf(stderr,"%d -- %d\n",match,cmpLen); alignLen += cmpLen; matchLen += match; //compare the left of hit kmer on ctg2 ctgLeft = endOnCtg2; readLeft = end - overlap + 1; cmpLen = ctgLeft < readLeft ? ctgLeft : readLeft; cmpLen = ctgLeft <= MAXREADLENGTH ? ctgLeft : MAXREADLENGTH; //cutSeqFromCtg(ctg2->ctgID,endOnCtg2-cmpLen,endOnCtg2-1,fastSequence,originOverlap); cutSeqFromRead ( seqCtg2, lenCtg2, endOnCtg2 - cmpLen, endOnCtg2 - 1, fastSequence ); if ( !bal ) { cutSeqFromRead ( src_seq, rdArray[i].len, readLeft - cmpLen, readLeft - 1, slowSequence ); } else { cutSeqFromRead ( bal_seq, rdArray[i].len, readLeft - cmpLen, readLeft - 1, slowSequence ); } match = compareSequences ( fastSequence, slowSequence, cmpLen, cmpLen ); alignLen += cmpLen; matchLen += match; //compare the right of hit kmer on ctg2 //ctgRight = contig_array[ctg2->ctgID].length+originOverlap-endOnCtg2-overlap; ctgRight = lenCtg2 - endOnCtg2 - overlap; cmpLen = ctgRight < ( rdArray[i].len - end - 1 ) ? ctgRight : ( rdArray[i].len - end - 1 ); cmpLen = cmpLen <= MAXREADLENGTH ? cmpLen : MAXREADLENGTH; //cutSeqFromCtg(ctg2->ctgID,endOnCtg2+overlap,endOnCtg2+overlap+cmpLen-1,fastSequence,originOverlap); cutSeqFromRead ( seqCtg2, lenCtg2, endOnCtg2 + overlap, endOnCtg2 + overlap + cmpLen - 1, fastSequence ); if ( !bal ) { cutSeqFromRead ( src_seq, rdArray[i].len, end + 1, end + cmpLen, slowSequence ); } else { cutSeqFromRead ( bal_seq, rdArray[i].len, end + 1, end + cmpLen, slowSequence ); } match = compareSequences ( fastSequence, slowSequence, cmpLen, cmpLen ); alignLen += cmpLen; matchLen += match; /* if(cmpLen>0&&match!=cmpLen+overlap){ printSeq(stderr,fastSequence,cmpLen+overlap); printSeq(stderr,slowSequence,cmpLen+overlap); printKmerSeqLocal(stderr,kmerCtg2[endOnCtg2],overlap); fprintf(stderr,": %d(%d)\n",bal,endOnCtg2); }else if(cmpLen>0&&match==cmpLen+overlap) fprintf(stderr,"Perfect\n"); */ double score = ( double ) matchLen / alignLen; if ( maxScore < score ) { maxScore = score; //fprintf(stderr,"%4.2f (%d/%d)\n",maxScore,matchLen,alignLen); maxIndex = i; } SCORE[i] = score; } /* if(maxScore>0.0) fprintf(stderr,"SCORE: %4.2f\n",maxScore); */ if ( maxScore > 0.9 ) { /* for(i=0;i 0 ? offset1 - ( len1 - INDEX1[maxIndex] - 1 ) : 0; int rightRemain = offset2 - ( overlap + INDEX2[maxIndex] ) > 0 ? offset2 - ( overlap + INDEX2[maxIndex] ) : 0; ctg1->gapSeqOffset = gapSeqArray->item_c; ctg1->gapSeqLen = END[maxIndex] - START[maxIndex] + leftRemain + rightRemain; if ( darrayPut ( gapSeqArray, ctg1->gapSeqOffset + ( END[maxIndex] - START[maxIndex] + leftRemain + rightRemain ) / 4 ) ) { pt = ( char * ) darrayPut ( gapSeqArray, ctg1->gapSeqOffset ); for ( j = 0; j < leftRemain; j++ ) //get the left side of the gap region from search { writeChar2tightString ( seqGap[j], pt, j ); //fprintf(stderr,"%c",int2base(seqGap[j])); } for ( j = START[maxIndex] + 1; j <= END[maxIndex]; j++ ) { if ( BAL[maxIndex] ) { writeChar2tightString ( bal_seq[j], pt, j - START[maxIndex] - 1 + leftRemain ); //fprintf(stderr,"%c",int2base(bal_seq[j])); } else { writeChar2tightString ( src_seq[j], pt, j - START[maxIndex] - 1 + leftRemain ); //fprintf(stderr,"%c",int2base(src_seq[j])); } } for ( j = offset2 - rightRemain; j < offset2; j++ ) //get the right side of the gap region from search { writeChar2tightString ( seqGap[j + leftRemain], pt, j + END[maxIndex] - START[maxIndex] + leftRemain ); //fprintf(stderr,"%c",int2base(seqGap[j+leftRemain])); } /* fprintf(stderr,": GAPSEQ (%d+%d)(%d+%d)(%d+%d)(%d+%d) B %d\n",offset1,offset2,cut1,cut2, len1-INDEX1[maxIndex]-1,INDEX2[maxIndex],START[maxIndex],END[maxIndex],BAL[maxIndex]); */ ctg1->cutTail = len1 - INDEX1[maxIndex] - 1 - offset1 + cut1 > cut1 ? len1 - INDEX1[maxIndex] - 1 - offset1 + cut1 : cut1; ctg2->cutHead = overlap + INDEX2[maxIndex] - offset2 + cut2 > cut2 ? overlap + INDEX2[maxIndex] - offset2 + cut2 : cut2; ctg2->scaftig_start = 0; ret = 1; } } free ( ( void * ) START ); free ( ( void * ) END ); free ( ( void * ) INDEX1 ); free ( ( void * ) INDEX2 ); free ( ( void * ) SCORE ); free ( ( void * ) BAL ); free ( ( void * ) src_seq ); free ( ( void * ) bal_seq ); return ret; } SOAPdenovo-V1.05/src/127mer/main.c000644 000765 000024 00000016002 11530651532 016470 0ustar00Aquastaff000000 000000 /* * 127mer/main.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "global.h" extern int call_pregraph ( int arc, char ** argv ); extern int call_heavygraph ( int arc, char ** argv ); extern int call_map2contig ( int arc, char ** argv ); extern int call_scaffold ( int arc, char ** argv ); extern int call_align ( int arc, char ** argv ); static void display_usage(); static void display_all_usage(); static void pipeline ( int argc, char ** argv ); int main ( int argc, char ** argv ) { printf ( "\nVersion 1.05: testing... 2010\n\n" ); argc--; argv++; /* printf("Size of Kmer is %d\n",sizeof(Kmer)); printf("Size of kmer_t is %d\n",sizeof(kmer_t)); */ if ( argc == 0 ) { display_usage(); return 0; } if ( strcmp ( "pregraph", argv[0] ) == 0 ) { call_pregraph ( argc, argv ); } else if ( strcmp ( "contig", argv[0] ) == 0 ) { call_heavygraph ( argc, argv ); } else if ( strcmp ( "map", argv[0] ) == 0 ) { call_align ( argc, argv ); } //call_map2contig(argc,argv); else if ( strcmp ( "scaff", argv[0] ) == 0 ) { call_scaffold ( argc, argv ); } else if ( strcmp ( "all", argv[0] ) == 0 ) { pipeline ( argc, argv ); } else { display_usage(); } return 0; } static void display_usage() { printf ( "\nUsage: SOAPdenovo [option]\n" ); printf ( " pregraph construction kmer-graph\n" ); printf ( " contig eliminate errors and output contigs\n" ); printf ( " map map reads to contigs\n" ); printf ( " scaff scaffolding\n" ); printf ( " all doing all the above in turn\n" ); } static void pipeline ( int argc, char ** argv ) { char * options[16]; unsigned char getK, getRfile, getOfile, getD, getDD, getL, getR, getP; char readfile[256], outfile[256]; char temp[128]; char * name; int kmer = 0, cutoff_len = 0, ncpu = 0; char kmer_s[16], len_s[16], ncpu_s[16], M_s[16]; int i, copt, index, M = 1; extern char * optarg; time_t start_t, stop_t; time ( &start_t ); getK = getRfile = getOfile = getD = getDD = getL = getR = getP = 0; while ( ( copt = getopt ( argc, argv, "s:o:K:M:L:p:G:dDRua:" ) ) != EOF ) { switch ( copt ) { case 's': getRfile = 1; sscanf ( optarg, "%s", readfile ); continue; case 'o': getOfile = 1; sscanf ( optarg, "%s", outfile ); // continue; case 'K': getK = 1; sscanf ( optarg, "%s", temp ); // kmer = atoi ( temp ); continue; case 'G': sscanf ( optarg, "%s", temp ); // GLDiff = atoi ( temp ); continue; case 'M': sscanf ( optarg, "%s", temp ); // M = atoi ( temp ); continue; case 'p': getP = 1; sscanf ( optarg, "%s", temp ); // ncpu = atoi ( temp ); continue; case 'L': getL = 1; sscanf ( optarg, "%s", temp ); // cutoff_len = atoi ( temp ); continue; case 'R': getR = 1; continue; case 'u': maskRep = 0; continue; case 'd': getD = 1; continue; case 'D': getDD = 1; continue; case 'a': initKmerSetSize = atoi ( optarg ); break; default: if ( getRfile == 0 || getOfile == 0 ) // { display_all_usage(); exit ( -1 ); } } } if ( getRfile == 0 || getOfile == 0 ) // { display_all_usage(); exit ( -1 ); } if ( thrd_num < 1 ) { thrd_num = 1; } // getK = getRfile = getOfile = getD = getL = getR = 0; name = "pregraph"; index = 0; options[index++] = name; options[index++] = "-s"; options[index++] = readfile; if ( getK ) { options[index++] = "-K"; sprintf ( kmer_s, "%d", kmer ); options[index++] = kmer_s; } if ( getP ) { options[index++] = "-p"; sprintf ( ncpu_s, "%d", ncpu ); options[index++] = ncpu_s; } if ( getD ) { options[index++] = "-d"; } if ( getDD ) { options[index++] = "-D"; } if ( getR ) { options[index++] = "-R"; } options[index++] = "-o"; options[index++] = outfile; for ( i = 0; i < index; i++ ) { printf ( "%s ", options[i] ); } printf ( "\n" ); call_pregraph ( index, options ); name = "contig"; index = 0; options[index++] = name; options[index++] = "-g"; options[index++] = outfile; options[index++] = "-M"; sprintf ( M_s, "%d", M ); options[index++] = M_s; if ( getR ) { options[index++] = "-R"; } for ( i = 0; i < index; i++ ) { printf ( "%s ", options[i] ); } printf ( "\n" ); call_heavygraph ( index, options ); name = "map"; index = 0; options[index++] = name; options[index++] = "-s"; options[index++] = readfile; options[index++] = "-g"; options[index++] = outfile; if ( getP ) { options[index++] = "-p"; sprintf ( ncpu_s, "%d", ncpu ); options[index++] = ncpu_s; } for ( i = 0; i < index; i++ ) { printf ( "%s ", options[i] ); } printf ( "\n" ); call_align ( index, options ); name = "scaff"; index = 0; options[index++] = name; options[index++] = "-g"; options[index++] = outfile; options[index++] = "-F"; if ( getL ) { options[index++] = "-L"; sprintf ( len_s, "%d", cutoff_len ); options[index++] = len_s; } for ( i = 0; i < index; i++ ) { printf ( "%s ", options[i] ); } printf ( "\n" ); call_scaffold ( index, options ); time ( &stop_t ); printf ( "time for the whole pipeline: %dm\n", ( int ) ( stop_t - start_t ) / 60 ); } static void display_all_usage() { printf ( "\nSOAPdenovo all -s configFile [-K kmer -d -D -M mergeLevel -R -u -G gapLenDiff -L minContigLen -p n_cpu] -o Output\n" ); printf ( " -s ShortSeqFile: The input file name of solexa reads\n" ); printf ( " -K kmer(default 23): k value in kmer\n" ); printf ( " -p n_cpu(default 8): number of cpu for use\n" ); printf ( " -a initKmerSetSize: define the initial KmerSet size(unit: GB)\n" ); printf ( " -M mergeLevel(default 1,min 0, max 3): the strength of merging similar sequences during contiging\n" ); printf ( " -d (optional): delete kmers with frequency one (default no)\n" ); printf ( " -D (optional): delete edges with coverage one (default no)\n" ); printf ( " -R (optional): unsolve repeats by reads (default no)\n" ); printf ( " -G gapLenDiff(default 50): allowed length difference between estimated and filled gap\n" ); printf ( " -L minLen(default K+2): shortest contig for scaffolding\n" ); printf ( " -u (optional): un-mask contigs with high coverage before scaffolding (default mask)\n" ); printf ( " -o Output: prefix of output file name\n" ); } SOAPdenovo-V1.05/src/127mer/._Makefile000644 000765 000024 00000000341 11534064032 017251 0ustar00Aquastaff000000 000000 Mac OS X  2ATTRH;IIcom.apple.quarantineq/0000;4d58b30f;Mail;185AD716-5ED9-4213-A5CE-C6BFB378332E|com.apple.mailSOAPdenovo-V1.05/src/127mer/Makefile000644 000765 000024 00000003563 11534064032 017045 0ustar00Aquastaff000000 000000 CC= gcc CFLAGS= -O3 -fomit-frame-pointer DFLAGS= OBJS= arc.o attachPEinfo.o bubble.o check.o compactEdge.o \ concatenateEdge.o connect.o contig.o cutTipPreGraph.o cutTip_graph.o \ darray.o dfib.o dfibHeap.o fib.o fibHeap.o \ hashFunction.o kmer.o lib.o loadGraph.o loadPath.o \ loadPreGraph.o localAsm.o main.o map.o mem_manager.o \ newhash.o node2edge.o orderContig.o output_contig.o output_pregraph.o \ output_scaffold.o pregraph.o prlHashCtg.o prlHashReads.o prlRead2Ctg.o \ prlRead2path.o prlReadFillGap.o read2scaf.o readInterval.o stack.o\ readseq1by1.o scaffold.o searchPath.o seq.o splitReps.o PROG= SOAPdenovo-127mer INCLUDES= -Iinc SUBDIRS= . LIBPATH= LIBS= -pthread -lm EXTRA_FLAGS= BIT_ERR = 0 ifeq (,$(findstring $(shell uname -m), x86_64 ppc64 ia64)) BIT_ERR = 1 endif LINUX = 0 ifneq (,$(findstring Linux,$(shell uname))) LINUX = 1 EXTRA_FLAGS += -Wl,--hash-style=both endif ifneq (,$(findstring $(shell uname -m), x86_64)) CFLAGS += -m64 endif ifneq (,$(findstring $(shell uname -m), ia64)) CFLAGS += endif ifneq (,$(findstring $(shell uname -m), ppc64)) CFLAGS += -m64 -mpowerpc64 endif .SUFFIXES:.c .o .c.o: @printf "Compiling $<... \r"; \ $(CC) -c $(CFLAGS) $(DFLAGS) $(INCLUDES) $< || echo "Error in command: $(CC) -c $(CFLAGS) $(DFLAGS) $(INCLUDES) $<" all: SOAPdenovo .PHONY:all clean install envTest: @test $(BIT_ERR) != 1 || sh -c 'echo "Fatal: 64bit CPU and Operating System required!";false;' SOAPdenovo: envTest $(OBJS) @$(CC) $(CFLAGS) -o $(PROG) $(OBJS) $(LIBPATH) $(LIBS) $(ENTRAFLAGS) @printf "Linking...\r" @printf "$(PROG) compilation done.\n"; clean: @rm -fr gmon.out *.o a.out *.exe *.dSYM $(PROG) *~ *.a *.so.* *.so *.dylib @printf "$(PROG) cleaning done.\n"; install: @cp $(PROG) ../../bin/ @printf "$(PROG) installed at ../../bin/$(PROG)\n" SOAPdenovo-V1.05/src/127mer/map.c000644 000765 000024 00000010045 11530651532 016322 0ustar00Aquastaff000000 000000 /* * 127mer/map.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" static void initenv ( int argc, char ** argv ); static char shortrdsfile[256]; static char graphfile[256]; static void display_map_usage(); static int getMinOverlap ( char * gfile ) { char name[256], ch; FILE * fp; int num_kmer, overlaplen = 23; char line[1024]; sprintf ( name, "%s.preGraphBasic", gfile ); fp = fopen ( name, "r" ); if ( !fp ) { return overlaplen; } while ( fgets ( line, sizeof ( line ), fp ) != NULL ) { if ( line[0] == 'V' ) { sscanf ( line + 6, "%d %c %d", &num_kmer, &ch, &overlaplen ); } else if ( line[0] == 'M' ) { sscanf ( line, "MaxReadLen %d MinReadLen %d MaxNameLen %d", &maxReadLen, &minReadLen, &maxNameLen ); } } fclose ( fp ); return overlaplen; } int call_align ( int argc, char ** argv ) { time_t start_t, stop_t, time_bef, time_aft; time ( &start_t ); initenv ( argc, argv ); overlaplen = getMinOverlap ( graphfile ); printf ( "K = %d\n", overlaplen ); time ( &time_bef ); ctg_short = overlaplen + 2; printf ( "contig len cutoff: %d\n", ctg_short ); prlContig2nodes ( graphfile, ctg_short ); time ( &time_aft ); printf ( "time spent on De bruijn graph construction: %ds\n\n", ( int ) ( time_aft - time_bef ) ); //map long read to edge one by one time ( &time_bef ); prlLongRead2Ctg ( shortrdsfile, graphfile ); time ( &time_aft ); printf ( "time spent on mapping long reads: %ds\n\n", ( int ) ( time_aft - time_bef ) ); //map read to edge one by one time ( &time_bef ); prlRead2Ctg ( shortrdsfile, graphfile ); time ( &time_aft ); printf ( "time spent on mapping reads: %ds\n\n", ( int ) ( time_aft - time_bef ) ); free_Sets ( KmerSets, thrd_num ); time ( &stop_t ); printf ( "overall time for alignment: %dm\n\n", ( int ) ( stop_t - start_t ) / 60 ); return 0; } /***************************************************************************** * Parse command line switches *****************************************************************************/ void initenv ( int argc, char ** argv ) { int copt; int inpseq, outseq; extern char * optarg; char temp[100]; optind = 1; inpseq = outseq = 0; while ( ( copt = getopt ( argc, argv, "s:g:K:p:" ) ) != EOF ) { //printf("get option\n"); switch ( copt ) { case 's': inpseq = 1; sscanf ( optarg, "%s", shortrdsfile ); continue; case 'g': outseq = 1; sscanf ( optarg, "%s", graphfile ); // continue; case 'K': sscanf ( optarg, "%s", temp ); // overlaplen = atoi ( temp ); continue; case 'p': sscanf ( optarg, "%s", temp ); // thrd_num = atoi ( temp ); continue; default: if ( inpseq == 0 || outseq == 0 ) // { display_map_usage(); exit ( 1 ); } } } if ( inpseq == 0 || outseq == 0 ) // { //printf("need more\n"); display_map_usage(); exit ( 1 ); } } static void display_map_usage() { printf ( "\nmap -s readsInfoFile -g graphfile [-p n_cpu]\n" ); printf ( " -s readsInfoFile: The file contains information of solexa reads\n" ); printf ( " -p n_cpu(default 8): number of cpu for use\n" ); printf ( " -g graphfile: prefix of graph files\n" ); } SOAPdenovo-V1.05/src/127mer/mem_manager.c000644 000765 000024 00000005354 11530651532 020024 0ustar00Aquastaff000000 000000 /* * 127mer/mem_manager.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" MEM_MANAGER * createMem_manager ( int num_items, size_t unit_size ) { MEM_MANAGER * mem_Manager = ( MEM_MANAGER * ) ckalloc ( 1 * sizeof ( MEM_MANAGER ) ); mem_Manager->block_list = NULL; mem_Manager->items_per_block = num_items; mem_Manager->item_size = unit_size; mem_Manager->recycle_list = NULL; mem_Manager->counter = 0; return mem_Manager; } void freeMem_manager ( MEM_MANAGER * mem_Manager ) { BLOCK_START * ite_block, *temp_block; if ( !mem_Manager ) { return; } ite_block = mem_Manager->block_list; while ( ite_block ) { temp_block = ite_block; ite_block = ite_block->next; free ( ( void * ) temp_block ); } free ( ( void * ) mem_Manager ); } void * getItem ( MEM_MANAGER * mem_Manager ) { RECYCLE_MARK * mark; //this is the type of return value BLOCK_START * block; if ( !mem_Manager ) { return NULL; } if ( mem_Manager->recycle_list ) { mark = mem_Manager->recycle_list; mem_Manager->recycle_list = mark->next; return mark; } mem_Manager->counter++; if ( !mem_Manager->block_list || mem_Manager->index_in_block == mem_Manager->items_per_block ) { //pthread_mutex_lock(&gmutex); block = ckalloc ( sizeof ( BLOCK_START ) + mem_Manager->items_per_block * mem_Manager->item_size ); //mem_Manager->counter += sizeof(BLOCK_START)+mem_Manager->items_per_block*mem_Manager->item_size; //pthread_mutex_unlock(&gmutex); block->next = mem_Manager->block_list; mem_Manager->block_list = block; mem_Manager->index_in_block = 1; return ( RECYCLE_MARK * ) ( ( void * ) block + sizeof ( BLOCK_START ) ); } block = mem_Manager->block_list; return ( RECYCLE_MARK * ) ( ( void * ) block + sizeof ( BLOCK_START ) + mem_Manager->item_size * ( mem_Manager->index_in_block++ ) ); } void returnItem ( MEM_MANAGER * mem_Manager, void * item ) { RECYCLE_MARK * mark; mark = item; mark->next = mem_Manager->recycle_list; mem_Manager->recycle_list = mark; } SOAPdenovo-V1.05/src/127mer/newhash.c000644 000765 000024 00000020134 11530651532 017202 0ustar00Aquastaff000000 000000 /* * 127mer/newhash.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" #include "def.h" #define PUBLIC_FUNC #define PROTECTED_FUNC static const kmer_t empty_kmer = {{0, 0, 0, 0}, 0, 0, 0, 0, 0, 1, 0, 0}; static inline ubyte8 modular ( KmerSet * set, Kmer seq ) { ubyte8 temp; temp = ( seq.high1 % set->size ) << 32 | ( seq.low1 >> 32 & 0xffffffff ); temp = ( temp % set->size ) << 32 | ( seq.low1 & 0xffffffff ); temp = ( temp % set->size ) << 32 | ( seq.high2 >> 32 & 0xffffffff ); temp = ( temp % set->size ) << 32 | ( seq.high2 & 0xffffffff ); temp = ( temp % set->size ) << 32 | ( seq.low2 >> 32 & 0xffffffff ); temp = ( temp % set->size ) << 32 | ( seq.low2 & 0xffffffff ); temp = ( ubyte8 ) ( temp % set->size ); return temp; } static inline void update_kmer ( kmer_t * mer, ubyte left, ubyte right ) { ubyte4 cov; if ( left < 4 ) { cov = get_kmer_left_cov ( *mer, left ); if ( cov < MAX_KMER_COV ) { set_kmer_left_cov ( *mer, left, cov + 1 ); } } if ( right < 4 ) { cov = get_kmer_right_cov ( *mer, right ); if ( cov < MAX_KMER_COV ) { set_kmer_right_cov ( *mer, right, cov + 1 ); } } } static inline void set_new_kmer ( kmer_t * mer, Kmer seq, ubyte left, ubyte right ) { *mer = empty_kmer; set_kmer_seq ( *mer, seq ); if ( left < 4 ) { set_kmer_left_cov ( *mer, left, 1 ); } if ( right < 4 ) { set_kmer_right_cov ( *mer, right, 1 ); } } static inline int is_prime_kh ( ubyte8 num ) { ubyte8 i, max; if ( num < 4 ) { return 1; } if ( num % 2 == 0 ) { return 0; } max = ( ubyte8 ) sqrt ( ( float ) num ); for ( i = 3; i < max; i += 2 ) { if ( num % i == 0 ) { return 0; } } return 1; } static inline ubyte8 find_next_prime_kh ( ubyte8 num ) { if ( num % 2 == 0 ) { num ++; } while ( 1 ) { if ( is_prime_kh ( num ) ) { return num; } num += 2; } } PUBLIC_FUNC KmerSet * init_kmerset ( ubyte8 init_size, float load_factor ) { KmerSet * set; if ( init_size < 3 ) { init_size = 3; } else { init_size = find_next_prime_kh ( init_size ); } set = ( KmerSet * ) malloc ( sizeof ( KmerSet ) ); set->size = init_size; set->count = 0; set->max = set->size * load_factor; if ( load_factor <= 0 ) { load_factor = 0.25f; } else if ( load_factor >= 1 ) { load_factor = 0.75f; } set->load_factor = load_factor; set->iter_ptr = 0; set->array = calloc ( set->size, sizeof ( kmer_t ) ); set->flags = malloc ( ( set->size + 15 ) / 16 * 4 ); memset ( set->flags, 0x55, ( set->size + 15 ) / 16 * 4 ); return set; } PROTECTED_FUNC static inline ubyte8 get_kmerset ( KmerSet * set, Kmer seq ) { ubyte8 hc; // U256b temp; // temp = Kmer2int256(seq); // hc = temp.low % set->size; hc = modular ( set, seq ); while ( 1 ) { if ( is_kmer_entity_null ( set->flags, hc ) ) { return hc; } else { if ( KmerEqual ( get_kmer_seq ( set->array[hc] ), seq ) ) { return hc; } } hc ++; if ( hc == set->size ) { hc = 0; } } return set->size; } PUBLIC_FUNC int search_kmerset ( KmerSet * set, Kmer seq, kmer_t ** rs ) { ubyte8 hc; // U256b temp; // temp = Kmer2int256(seq); // hc = temp.low % set->size; hc = modular ( set, seq ); while ( 1 ) { if ( is_kmer_entity_null ( set->flags, hc ) ) { return 0; } else { if ( KmerEqual ( get_kmer_seq ( set->array[hc] ), seq ) ) { *rs = set->array + hc; return 1; } } hc ++; if ( hc == set->size ) { hc = 0; } } return 0; } PUBLIC_FUNC static inline int exists_kmerset ( KmerSet * set, Kmer seq ) { ubyte8 idx; idx = get_kmerset ( set, seq ); return !is_kmer_entity_null ( set->flags, idx ); } PROTECTED_FUNC static inline void encap_kmerset ( KmerSet * set, ubyte8 num ) { ubyte4 * flags, *f; ubyte8 i, n, size, hc; kmer_t key, tmp; if ( set->count + num <= set->max ) { return; } n = set->size; do { if ( n < 0xFFFFFFFU ) { n <<= 1; } else { n += 0xFFFFFFU; } n = find_next_prime_kh ( n ); } while ( n * set->load_factor < set->count + num ); set->array = realloc ( set->array, n * sizeof ( kmer_t ) ); //printf("Allocate Mem %lld(%d*%lld*%d)bytes\n",thrd_num*n*sizeof(kmer_t),thrd_num,n,sizeof(kmer_t)); fflush ( stdout ); if ( set->array == NULL ) { fprintf ( stderr, "-- Out of memory --\n" ); abort(); } flags = malloc ( ( n + 15 ) / 16 * 4 ); memset ( flags, 0x55, ( n + 15 ) / 16 * 4 ); size = set->size; set->size = n; set->max = n * set->load_factor; f = set->flags; set->flags = flags; flags = f; // U256b temp; for ( i = 0; i < size; i++ ) { if ( !exists_kmer_entity ( flags, i ) ) { continue; } key = set->array[i]; set_kmer_entity_del ( flags, i ); while ( 1 ) { // temp = Kmer2int256(get_kmer_seq(key)); // hc = temp.low % set->size; hc = modular ( set, get_kmer_seq ( key ) ); while ( !is_kmer_entity_null ( set->flags, hc ) ) { hc ++; if ( hc == set->size ) { hc = 0; } } clear_kmer_entity_null ( set->flags, hc ); if ( hc < size && exists_kmer_entity ( flags, hc ) ) { tmp = key; key = set->array[hc]; set->array[hc] = tmp; set_kmer_entity_del ( flags, hc ); } else { set->array[hc] = key; break; } } } free ( flags ); } PUBLIC_FUNC int put_kmerset ( KmerSet * set, Kmer seq, ubyte left, ubyte right, kmer_t ** kmer_p ) { ubyte8 hc; encap_kmerset ( set, 1 ); // U256b temp; // temp = Kmer2int256(seq); // hc = temp.low % set->size; hc = modular ( set, seq ); do { if ( is_kmer_entity_null ( set->flags, hc ) ) { clear_kmer_entity_null ( set->flags, hc ); set_new_kmer ( set->array + hc, seq, left, right ); set->count ++; *kmer_p = set->array + hc; return 0; } else { if ( KmerEqual ( get_kmer_seq ( set->array[hc] ), seq ) ) { update_kmer ( set->array + hc, left, right ); set->array[hc].single = 0; *kmer_p = set->array + hc; return 1; } } hc ++; if ( hc == set->size ) { hc = 0; } } while ( 1 ); *kmer_p = NULL; return 0; } PUBLIC_FUNC byte8 count_kmerset ( KmerSet * set ) { return set->count; } PUBLIC_FUNC static inline void reset_iter_kmerset ( KmerSet * set ) { set->iter_ptr = 0; } PUBLIC_FUNC static inline ubyte8 iter_kmerset ( KmerSet * set, kmer_t ** rs ) { while ( set->iter_ptr < set->size ) { if ( !is_kmer_entity_null ( set->flags, set->iter_ptr ) ) { *rs = set->array + set->iter_ptr; set->iter_ptr ++; return 1; } set->iter_ptr ++; } return 0; } PUBLIC_FUNC void free_kmerset ( KmerSet * set ) { free ( set->array ); free ( set->flags ); free ( set ); } PUBLIC_FUNC void free_Sets ( KmerSet ** sets, int num ) { int i; for ( i = 0; i < num; i++ ) { free_kmerset ( sets[i] ); } free ( ( void * ) sets ); } int count_branch2prev ( kmer_t * node ) { int num = 0, i; for ( i = 0; i < 4; i++ ) { if ( get_kmer_left_cov ( *node, i ) > 0 ) { num++; } } return num; } int count_branch2next ( kmer_t * node ) { int num = 0, i; for ( i = 0; i < 4; i++ ) { if ( get_kmer_right_cov ( *node, i ) > 0 ) { num++; } } return num; } void dislink2prevUncertain ( kmer_t * node, char ch, boolean smaller ) { if ( smaller ) { set_kmer_left_cov ( *node, ch, 0 ); } else { set_kmer_right_cov ( *node, int_comp ( ch ), 0 ); } } void dislink2nextUncertain ( kmer_t * node, char ch, boolean smaller ) { if ( smaller ) { set_kmer_right_cov ( *node, ch, 0 ); } else { set_kmer_left_cov ( *node, int_comp ( ch ), 0 ); } } SOAPdenovo-V1.05/src/127mer/node2edge.c000644 000765 000024 00000027403 11530651532 017407 0ustar00Aquastaff000000 000000 /* * 127mer/node2edge.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" #include "stack.h" #define KMERPTBLOCKSIZE 1000 //static int loopCounter; static int nodeCounter; static int edge_length_limit = 100000; static int edge_c, edgeCounter; static preEDGE temp_edge; static char edge_seq[100000]; static void make_edge ( FILE * fp ); static void merge_linearV2 ( char bal_edge, STACK * nStack, int count, FILE * fp ); static int check_iden_kmerList ( STACK * stack1, STACK * stack2 ); //for stack static STACK * nodeStack; static STACK * bal_nodeStack; void kmer2edges ( char * outfile ) { FILE * fp; char temp[256]; sprintf ( temp, "%s.edge", outfile ); fp = ckopen ( temp, "w" ); make_edge ( fp ); fclose ( fp ); num_ed = edge_c; } static void stringBeads ( KMER_PT * firstBead, char nextch, int * node_c ) { boolean smaller, found; Kmer tempKmer, bal_word; Kmer word = firstBead->kmer; ubyte8 hash_ban; kmer_t * outgoing_node; int nodeCounter = 1, setPicker; char ch; unsigned char flag; KMER_PT * temp_pt, *prev_pt = firstBead; word = prev_pt->kmer; nodeCounter = 1; word = nextKmer ( word, nextch ); bal_word = reverseComplement ( word, overlaplen ); if ( KmerLarger ( word, bal_word ) ) { tempKmer = bal_word; bal_word = word; word = tempKmer; smaller = 0; } else { smaller = 1; } hash_ban = hash_kmer ( word ); setPicker = hash_ban % thrd_num; found = search_kmerset ( KmerSets[setPicker], word, &outgoing_node ); while ( found && ( outgoing_node->linear ) ) // for every node in this line { nodeCounter++; temp_pt = ( KMER_PT * ) stackPush ( nodeStack ); temp_pt->node = outgoing_node; temp_pt->isSmaller = smaller; if ( smaller ) { temp_pt->kmer = word; } else { temp_pt->kmer = bal_word; } prev_pt = temp_pt; if ( smaller ) { for ( ch = 0; ch < 4; ch++ ) { flag = get_kmer_right_cov ( *outgoing_node, ch ); if ( flag ) { break; } } word = nextKmer ( prev_pt->kmer, ch ); bal_word = reverseComplement ( word, overlaplen ); if ( KmerLarger ( word, bal_word ) ) { tempKmer = bal_word; bal_word = word; word = tempKmer; smaller = 0; } else { smaller = 1; } hash_ban = hash_kmer ( word ); setPicker = hash_ban % thrd_num; found = search_kmerset ( KmerSets[setPicker], word, &outgoing_node ); } else { for ( ch = 0; ch < 4; ch++ ) { flag = get_kmer_left_cov ( *outgoing_node, ch ); if ( flag ) { break; } } word = nextKmer ( prev_pt->kmer, int_comp ( ch ) ); bal_word = reverseComplement ( word, overlaplen ); if ( KmerLarger ( word, bal_word ) ) { tempKmer = bal_word; bal_word = word; word = tempKmer; smaller = 0; } else { smaller = 1; } hash_ban = hash_kmer ( word ); setPicker = hash_ban % thrd_num; found = search_kmerset ( KmerSets[setPicker], word, &outgoing_node ); } } if ( outgoing_node ) //this is always true { nodeCounter++; temp_pt = ( KMER_PT * ) stackPush ( nodeStack ); temp_pt->node = outgoing_node; temp_pt->isSmaller = smaller; if ( smaller ) { temp_pt->kmer = word; } else { temp_pt->kmer = bal_word; } } *node_c = nodeCounter; } //search linear structure starting with the root of a tree static int startEdgeFromNode ( kmer_t * node1, FILE * fp ) { int node_c, palindrome; unsigned char flag; KMER_PT * ite_pt, *temp_pt; Kmer word1, bal_word1; char ch1; if ( node1->linear || node1->deleted ) { return 0; } // ignore floating loop word1 = node1->seq; bal_word1 = reverseComplement ( word1, overlaplen ); // linear structure for ( ch1 = 0; ch1 < 4; ch1++ ) // for every node on outgoing list { flag = get_kmer_right_cov ( *node1, ch1 ); if ( !flag ) { continue; } emptyStack ( nodeStack ); temp_pt = ( KMER_PT * ) stackPush ( nodeStack ); temp_pt->node = node1; temp_pt->isSmaller = 1; temp_pt->kmer = word1; stringBeads ( temp_pt, ch1, &node_c ); //printf("%d nodes\n",node_c); if ( node_c < 2 ) { printf ( "%d nodes in this line!!!!!!!!!!!\n", node_c ); } else { //make a reverse complement node list stackBackup ( nodeStack ); emptyStack ( bal_nodeStack ); while ( ( ite_pt = ( KMER_PT * ) stackPop ( nodeStack ) ) != NULL ) { temp_pt = ( KMER_PT * ) stackPush ( bal_nodeStack ); temp_pt->kmer = reverseComplement ( ite_pt->kmer, overlaplen ); } stackRecover ( nodeStack ); palindrome = check_iden_kmerList ( nodeStack, bal_nodeStack ); stackRecover ( nodeStack ); if ( palindrome ) { merge_linearV2 ( 0, nodeStack, node_c, fp ); } else { merge_linearV2 ( 1, nodeStack, node_c, fp ); } } } //every possible outgoing edges for ( ch1 = 0; ch1 < 4; ch1++ ) // for every node on incoming list { flag = get_kmer_left_cov ( *node1, ch1 ); if ( !flag ) { continue; } emptyStack ( nodeStack ); temp_pt = ( KMER_PT * ) stackPush ( nodeStack ); temp_pt->node = node1; temp_pt->isSmaller = 0; temp_pt->kmer = bal_word1; stringBeads ( temp_pt, int_comp ( ch1 ), &node_c ); if ( node_c < 2 ) { printf ( "%d nodes in this line!!!!!!!!!!!\n", node_c ); } else { //make a reverse complement node list stackBackup ( nodeStack ); emptyStack ( bal_nodeStack ); while ( ( ite_pt = ( KMER_PT * ) stackPop ( nodeStack ) ) != NULL ) { temp_pt = ( KMER_PT * ) stackPush ( bal_nodeStack ); temp_pt->kmer = reverseComplement ( ite_pt->kmer, overlaplen ); } stackRecover ( nodeStack ); palindrome = check_iden_kmerList ( nodeStack, bal_nodeStack ); stackRecover ( nodeStack ); if ( palindrome ) { merge_linearV2 ( 0, nodeStack, node_c, fp ); //printf("edge is palindrome with length %d\n",temp_edge.length); } else { merge_linearV2 ( 1, nodeStack, node_c, fp ); } } } //every possible incoming edges return 0; } void make_edge ( FILE * fp ) { int i = 0; kmer_t * node1; KmerSet * set; KmerSetsPatch = ( KmerSet ** ) ckalloc ( thrd_num * sizeof ( KmerSet * ) ); for ( i = 0; i < thrd_num; i++ ) { KmerSetsPatch[i] = init_kmerset ( 1000, K_LOAD_FACTOR ); } nodeStack = ( STACK * ) createStack ( KMERPTBLOCKSIZE, sizeof ( KMER_PT ) ); bal_nodeStack = ( STACK * ) createStack ( KMERPTBLOCKSIZE, sizeof ( KMER_PT ) ); edge_c = nodeCounter = 0; edgeCounter = 0; for ( i = 0; i < thrd_num; i++ ) { set = KmerSets[i]; set->iter_ptr = 0; while ( set->iter_ptr < set->size ) { if ( !is_kmer_entity_null ( set->flags, set->iter_ptr ) ) { node1 = set->array + set->iter_ptr; startEdgeFromNode ( node1, fp ); } set->iter_ptr ++; } } printf ( "%d (%d) edges %d extra nodes\n", edge_c, edgeCounter, nodeCounter ); freeStack ( nodeStack ); freeStack ( bal_nodeStack ); } static void merge_linearV2 ( char bal_edge, STACK * nStack, int count, FILE * fp ) { int length, char_index; preEDGE * newedge; kmer_t * del_node, *longNode; char * tightSeq, firstCh; long long symbol = 0; int len_tSeq; Kmer wordplus, bal_wordplus; ubyte8 hash_ban; KMER_PT * last_np = ( KMER_PT * ) stackPop ( nStack ); KMER_PT * second_last_np = ( KMER_PT * ) stackPop ( nStack ); KMER_PT * first_np, *second_np = NULL; KMER_PT * temp; boolean found; int setPicker; length = count - 1; len_tSeq = length; if ( len_tSeq >= edge_length_limit ) { tightSeq = ( char * ) ckalloc ( len_tSeq * sizeof ( char ) ); } else { tightSeq = edge_seq; } char_index = length - 1; newedge = &temp_edge; newedge->to_node = last_np->kmer; newedge->length = length; newedge->bal_edge = bal_edge; tightSeq[char_index--] = lastCharInKmer ( last_np->kmer ); firstCh = firstCharInKmer ( second_last_np->kmer ); dislink2prevUncertain ( last_np->node, firstCh, last_np->isSmaller ); stackRecover ( nStack ); while ( nStack->item_c > 1 ) { second_np = ( KMER_PT * ) stackPop ( nStack ); } first_np = ( KMER_PT * ) stackPop ( nStack ); //unlink first node to the second one dislink2nextUncertain ( first_np->node, lastCharInKmer ( second_np->kmer ), first_np->isSmaller ); //printf("from %llx, to %llx\n",first_np->node->seq,last_np->node->seq); //now temp is the last node in line, out_node is the second last node in line newedge->from_node = first_np->kmer; //create a long kmer for edge with length 1 if ( length == 1 ) { nodeCounter++; wordplus = KmerPlus ( newedge->from_node, lastCharInKmer ( newedge->to_node ) ); bal_wordplus = reverseComplement ( wordplus, overlaplen + 1 ); /* Kmer temp = KmerPlus(reverseComplement(newedge->to_node,overlaplen), lastCharInKmer(reverseComplement(newedge->from_node,overlaplen))); fprintf(stderr,"(%llx %llx) (%llx %llx) (%llx %llx)\n", wordplus.high,wordplus.low,temp.high,temp.low, bal_wordplus.high,bal_wordplus.low); */ edge_c++; edgeCounter++; if ( KmerSmaller ( wordplus, bal_wordplus ) ) { hash_ban = hash_kmer ( wordplus ); setPicker = hash_ban % thrd_num; found = put_kmerset ( KmerSetsPatch[setPicker], wordplus, 4, 4, &longNode ); if ( found ) { printf ( "longNode %llx %llx %llx %llx already exist\n", wordplus.high1, wordplus.low1, wordplus.high2, wordplus.low2 ); } longNode->l_links = edge_c; longNode->twin = ( unsigned char ) ( bal_edge + 1 ); } else { hash_ban = hash_kmer ( bal_wordplus ); setPicker = hash_ban % thrd_num; found = put_kmerset ( KmerSetsPatch[setPicker], bal_wordplus, 4, 4, &longNode ); if ( found ) { printf ( "longNode %llx %llx %llx %llx already exist\n", wordplus.high1, wordplus.low1, wordplus.high2, wordplus.low2 ); } longNode->l_links = edge_c + bal_edge; longNode->twin = ( unsigned char ) ( -bal_edge + 1 ); } } else { edge_c++; edgeCounter++; } stackRecover ( nStack ); //mark all the internal nodes temp = ( KMER_PT * ) stackPop ( nStack ); while ( nStack->item_c > 1 ) { temp = ( KMER_PT * ) stackPop ( nStack ); del_node = temp->node; del_node->inEdge = 1; symbol += get_kmer_left_covs ( *del_node ); if ( temp->isSmaller ) { del_node->l_links = edge_c; del_node->twin = ( unsigned char ) ( bal_edge + 1 ); } else { del_node->l_links = edge_c + bal_edge; del_node->twin = ( unsigned char ) ( -bal_edge + 1 ); } tightSeq[char_index--] = lastCharInKmer ( temp->kmer ); } newedge->seq = tightSeq; if ( length > 1 ) { newedge->cvg = symbol / ( length - 1 ) * 10 > MaxEdgeCov ? MaxEdgeCov : symbol / ( length - 1 ) * 10; } else { newedge->cvg = 0; } output_1edge ( newedge, fp ); if ( len_tSeq >= edge_length_limit ) { free ( ( void * ) tightSeq ); } edge_c += bal_edge; if ( edge_c % 10000000 == 0 ) { printf ( "--- %d edges built\n", edge_c ); } return; } static int check_iden_kmerList ( STACK * stack1, STACK * stack2 ) { KMER_PT * ite1, *ite2; if ( !stack1->item_c || !stack2->item_c ) // one of them is empty { return 0; } while ( ( ite1 = ( KMER_PT * ) stackPop ( stack1 ) ) != NULL && ( ite2 = ( KMER_PT * ) stackPop ( stack2 ) ) != NULL ) { if ( !KmerEqual ( ite1->kmer, ite2->kmer ) ) { return 0; } } if ( stack1->item_c || stack2->item_c ) // one of them is not empty { return 0; } else { return 1; } } SOAPdenovo-V1.05/src/127mer/orderContig.c000644 000765 000024 00000314477 11530651532 020044 0ustar00Aquastaff000000 000000 /* * 127mer/orderContig.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" #include "dfibHeap.h" #include "fibHeap.h" #include "darray.h" #define CNBLOCKSIZE 10000 #define MAXC 10000 #define MAXCinBetween 200 #define MaxNodeInSub 10000 #define GapLowerBound -2000 #define GapUpperBound 300000 static boolean static_f = 0; static double OverlapPercent = 0.05; static double ConflPercent = 0.05; static int gapCounter; static int orienCounter; static int throughCounter; static DARRAY * solidArray; static DARRAY * tempArray; static int solidCounter; static CTGinHEAP ctg4heapArray[MaxNodeInSub + 1]; // index in this array are put to heaps, start from 1 static unsigned int nodesInSub[MaxNodeInSub]; static int nodeDistance[MaxNodeInSub]; static int nodeCounter; static unsigned int nodesInSubInOrder[MaxNodeInSub]; static int nodeDistanceInOrder[MaxNodeInSub]; static DARRAY * scaf3, *scaf5; static DARRAY * gap3, *gap5; static unsigned int downstreamCTG[MAXCinBetween]; static unsigned int upstreamCTG[MAXCinBetween]; static int dsCtgCounter; static int usCtgCounter; static CONNECT * checkConnect ( unsigned int from_c, unsigned int to_c ); static int maskPuzzle ( int num_connect, unsigned int contigLen ); static void freezing(); static boolean checkOverlapInBetween ( double tolerance ); static int setConnectDelete ( unsigned int from_c, unsigned int to_c, char flag, boolean cleanBinding ); static int setConnectWP ( unsigned int from_c, unsigned int to_c, char flag ); static void general_linearization ( boolean strict ); static void debugging2(); static void smallScaf(); static void detectBreakScaf(); static boolean checkSimple ( DARRAY * ctgArray, int count ); static void checkCircle(); //find the only connection involved in connection binding static CONNECT * getBindCnt ( unsigned int ctg ) { CONNECT * ite_cnt; CONNECT * bindCnt = NULL; CONNECT * temp_cnt = NULL; CONNECT * temp3_cnt = NULL; int count = 0; int count2 = 0; int count3 = 0; ite_cnt = contig_array[ctg].downwardConnect; while ( ite_cnt ) { if ( ite_cnt->nextInScaf ) { count++; bindCnt = ite_cnt; } if ( ite_cnt->prevInScaf ) { temp_cnt = ite_cnt; count2++; } if ( ite_cnt->singleInScaf ) { temp3_cnt = ite_cnt; count3++; } ite_cnt = ite_cnt->next; } if ( count == 1 ) { return bindCnt; } if ( count == 0 && count2 == 1 ) { return temp_cnt; } if ( count == 0 && count2 == 0 && count3 == 1 ) { return temp3_cnt; } return NULL; } static void createAnalogousCnt ( unsigned int sourceStart, CONNECT * originCnt, int gap, unsigned int targetStart, unsigned int targetStop ) { CONNECT * temp_cnt; unsigned int balTargetStart = getTwinCtg ( targetStart ); unsigned int balTargetStop = getTwinCtg ( targetStop ); unsigned int balSourceStart = getTwinCtg ( sourceStart ); unsigned int balSourceStop = getTwinCtg ( originCnt->contigID ); originCnt->deleted = 1; temp_cnt = getCntBetween ( balSourceStop, balSourceStart ); temp_cnt->deleted = 1; if ( gap < GapLowerBound ) { gapCounter++; return; } temp_cnt = add1Connect ( targetStart, targetStop, gap, originCnt->weight, 1 ); if ( temp_cnt ) { temp_cnt->inherit = 1; } temp_cnt = add1Connect ( balTargetStop, balTargetStart, gap, originCnt->weight, 1 ); if ( temp_cnt ) { temp_cnt->inherit = 1; } } // increase #long_pe_support for a conncet by 1 static void add1LongPEcov ( unsigned int fromCtg, unsigned int toCtg, int weight ) { //check if they are on the same scaff if ( contig_array[fromCtg].from_vt != contig_array[toCtg].from_vt || contig_array[fromCtg].to_vt != contig_array[toCtg].to_vt ) { printf ( "Warning from add1LongPEcov: contig %d and %d not on the same scaffold\n", fromCtg, toCtg ); return; } if ( contig_array[fromCtg].indexInScaf >= contig_array[toCtg].indexInScaf ) { printf ( "Warning from add1LongPEcov: wrong about order between contig %d and %d\n", fromCtg, toCtg ); return; } CONNECT * bindCnt; unsigned int prevCtg = fromCtg; bindCnt = getBindCnt ( fromCtg ); while ( bindCnt ) { if ( bindCnt->maxGap + weight <= 1000 ) { bindCnt->maxGap += weight; } else { bindCnt->maxGap = 1000; } if ( fromCtg == 0 && toCtg == 0 ) printf ( "link (%d %d ) covered by link (%d %d), wt %d\n", prevCtg, bindCnt->contigID, fromCtg, toCtg, weight ); if ( bindCnt->contigID == toCtg ) { break; } prevCtg = bindCnt->contigID; bindCnt = bindCnt->nextInScaf; } unsigned int bal_fc = getTwinCtg ( fromCtg ); unsigned int bal_tc = getTwinCtg ( toCtg ); bindCnt = getBindCnt ( bal_tc ); prevCtg = bal_tc; while ( bindCnt ) { if ( bindCnt->maxGap + weight <= 1000 ) { bindCnt->maxGap += weight; } else { bindCnt->maxGap = 1000; } if ( fromCtg == 0 && toCtg == 0 ) printf ( "link (%d %d ) covered by link (%d %d), wt %d\n", prevCtg, bindCnt->contigID, fromCtg, toCtg, weight ); if ( bindCnt->contigID == bal_fc ) { return; } prevCtg = bindCnt->contigID; bindCnt = bindCnt->nextInScaf; } printf ( "Warning from add1LongPEcov: not reach the end (%d %d) (B)\n", bal_tc, bal_fc ); } // for long pair ends, move the connections along scaffolds established by shorter pair ends till reach the ends static void downSlide() { int len = 0, gap; unsigned int i; CONNECT * ite_cnt, *bindCnt, *temp_cnt; unsigned int bottomCtg, topCtg, bal_i; unsigned int targetCtg, bal_target; boolean getThrough, orienConflict; int slideLen, slideLen2; orienCounter = throughCounter = 0; for ( i = 1; i <= num_ctg; i++ ) { if ( contig_array[i].mask || !contig_array[i].downwardConnect ) { continue; } bindCnt = getBindCnt ( i ); if ( !bindCnt ) { continue; } bal_i = getTwinCtg ( i ); len = slideLen = 0; bottomCtg = i; //find the last unmasked contig in this binding while ( bindCnt->nextInScaf ) { len += bindCnt->gapLen + contig_array[bindCnt->contigID].length; if ( contig_array[bindCnt->contigID].mask == 0 ) { bottomCtg = bindCnt->contigID; slideLen = len; } bindCnt = bindCnt->nextInScaf; } len += bindCnt->gapLen + contig_array[bindCnt->contigID].length; if ( contig_array[bindCnt->contigID].mask == 0 || bottomCtg == 0 ) { bottomCtg = bindCnt->contigID; slideLen = len; } //check each connetion from long pair ends ite_cnt = contig_array[i].downwardConnect; while ( ite_cnt ) { if ( ite_cnt->deleted || ite_cnt->mask || ite_cnt->singleInScaf || ite_cnt->nextInScaf || ite_cnt->prevInScaf || ite_cnt->inherit ) { ite_cnt = ite_cnt->next; continue; } targetCtg = ite_cnt->contigID; if ( contig_array[i].from_vt == contig_array[targetCtg].from_vt ) // on the same scaff { if ( contig_array[i].indexInScaf > contig_array[targetCtg].indexInScaf ) { orienCounter++; } else { throughCounter++; } setConnectDelete ( i, ite_cnt->contigID, 1, 0 ); ite_cnt = ite_cnt->next; continue; } //check if this connection conflicts with previous scaffold orientationally temp_cnt = getBindCnt ( targetCtg ); orienConflict = 0; if ( temp_cnt ) { while ( temp_cnt->nextInScaf ) { if ( temp_cnt->contigID == i ) { orienConflict = 1; printf ( "Warning from downSlide: still on the same scaff: %d and %d\n" , i, targetCtg ); printf ( "on scaff %d and %d\n", contig_array[i].from_vt, contig_array[targetCtg].from_vt ); printf ( "on bal_scaff %d and %d\n", contig_array[bal_target].to_vt, contig_array[bal_i].to_vt ); break; } temp_cnt = temp_cnt->nextInScaf; } if ( temp_cnt->contigID == i ) { orienConflict = 1; } } if ( orienConflict ) { orienCounter++; setConnectDelete ( i, ite_cnt->contigID, 1, 0 ); ite_cnt = ite_cnt->next; continue; } //find the most top contig along previous scaffold starting with the target contig of this connection bal_target = getTwinCtg ( targetCtg ); slideLen2 = 0; if ( contig_array[targetCtg].mask == 0 ) { topCtg = bal_target; } else { topCtg = 0; } temp_cnt = getBindCnt ( bal_target ); getThrough = len = 0; if ( temp_cnt ) { //find the last contig in this binding while ( temp_cnt->nextInScaf ) { //check if this route reaches bal_i if ( temp_cnt->contigID == bal_i ) { printf ( "Warning from downSlide: (B) still on the same scaff: %d and %d (%d and %d)\n", i, targetCtg, bal_target, bal_i ); printf ( "on scaff %d and %d\n", contig_array[i].from_vt, contig_array[targetCtg].from_vt ); printf ( "on bal_scaff %d and %d\n", contig_array[bal_target].to_vt, contig_array[bal_i].to_vt ); getThrough = 1; break; } len += temp_cnt->gapLen + contig_array[temp_cnt->contigID].length; if ( contig_array[temp_cnt->contigID].mask == 0 ) { topCtg = temp_cnt->contigID; slideLen2 = len; } temp_cnt = temp_cnt->nextInScaf; } len += temp_cnt->gapLen + contig_array[temp_cnt->contigID].length; if ( contig_array[temp_cnt->contigID].mask == 0 || topCtg == 0 ) { topCtg = temp_cnt->contigID; slideLen2 = len; } if ( temp_cnt->contigID == bal_i ) { getThrough = 1; } else { topCtg = getTwinCtg ( topCtg ); } } else { topCtg = targetCtg; } if ( getThrough ) { throughCounter++; setConnectDelete ( i, ite_cnt->contigID, 1, 0 ); ite_cnt = ite_cnt->next; continue; } //add a connection between bottomCtg and topCtg gap = ite_cnt->gapLen - slideLen - slideLen2; if ( bottomCtg != topCtg && ! ( i == bottomCtg && targetCtg == topCtg ) ) { createAnalogousCnt ( i, ite_cnt, gap, bottomCtg, topCtg ); if ( contig_array[bottomCtg].mask || contig_array[topCtg].mask ) { printf ( "downSlide to masked contig\n" ); } } ite_cnt = ite_cnt->next; } //for each connect } // for each contig printf ( "downSliding is done...orienConflict %d, fall inside %d\n", orienCounter, throughCounter ); } static boolean setNextInScaf ( CONNECT * cnt, CONNECT * nextCnt ) { if ( !cnt ) { printf ( "setNextInScaf: empty pointer\n" ); return 0; } if ( !nextCnt ) { cnt->nextInScaf = nextCnt; return 1; } if ( cnt->mask || cnt->deleted ) { printf ( "setNextInScaf: cnt is masked or deleted\n" ); return 0; } if ( nextCnt->deleted || nextCnt->mask ) { printf ( "setNextInScaf: nextCnt is masked or deleted\n" ); return 0; } cnt->nextInScaf = nextCnt; return 1; } static boolean setPrevInScaf ( CONNECT * cnt, boolean flag ) { if ( !cnt ) { printf ( "setPrevInScaf: empty pointer\n" ); return 0; } if ( !flag ) { cnt->prevInScaf = flag; return 1; } if ( cnt->mask || cnt->deleted ) { printf ( "setPrevInScaf: cnt is masked or deleted\n" ); return 0; } cnt->prevInScaf = flag; return 1; } /* connect A is upstream to B, replace A with C from_c > branch_c - to_c from_c_new */ static void substitueUSinScaf ( CONNECT * origin, unsigned int from_c_new ) { if ( !origin || !origin->nextInScaf ) { return; } unsigned int branch_c, to_c; unsigned int bal_branch_c, bal_to_c; unsigned int bal_from_c_new = getTwinCtg ( from_c_new ); CONNECT * bal_origin, *bal_nextCNT, *prevCNT, *bal_prevCNT; branch_c = origin->contigID; to_c = origin->nextInScaf->contigID; bal_branch_c = getTwinCtg ( branch_c ); bal_to_c = getTwinCtg ( to_c ); prevCNT = checkConnect ( from_c_new, branch_c ); bal_nextCNT = checkConnect ( bal_to_c, bal_branch_c ); if ( !bal_nextCNT ) { printf ( "substitueUSinScaf: no connect between %d and %d\n", bal_to_c, bal_branch_c ); return; } bal_origin = bal_nextCNT->nextInScaf; bal_prevCNT = checkConnect ( bal_branch_c, bal_from_c_new ); setPrevInScaf ( bal_nextCNT->nextInScaf, 0 ); setNextInScaf ( prevCNT, origin->nextInScaf ); setNextInScaf ( bal_nextCNT, bal_prevCNT ); setPrevInScaf ( bal_prevCNT, 1 ); setNextInScaf ( origin, NULL ); setPrevInScaf ( bal_origin, 0 ); } /* connect B is downstream to C, replace B with A to_c from_c - branch_c < to_c_new */ static void substitueDSinScaf ( CONNECT * origin, unsigned int branch_c, unsigned int to_c_new ) { if ( !origin || !origin->prevInScaf ) { return; } unsigned int to_c; unsigned int bal_branch_c, bal_to_c, bal_to_c_new; unsigned int from_c, bal_from_c; CONNECT * bal_origin, *prevCNT, *bal_prevCNT; CONNECT * nextCNT, *bal_nextCNT; to_c = origin->contigID; bal_branch_c = getTwinCtg ( branch_c ); bal_to_c = getTwinCtg ( to_c ); bal_origin = getCntBetween ( bal_to_c, bal_branch_c ); if ( !bal_origin ) { printf ( "substitueDSinScaf: no connect between %d and %d\n", bal_to_c, bal_branch_c ); return; } bal_from_c = bal_origin->nextInScaf->contigID; from_c = getTwinCtg ( bal_from_c ); bal_to_c_new = getTwinCtg ( to_c_new ); prevCNT = checkConnect ( from_c, branch_c ); nextCNT = checkConnect ( branch_c, to_c_new ); setNextInScaf ( prevCNT, nextCNT ); setPrevInScaf ( nextCNT, 1 ); bal_nextCNT = checkConnect ( bal_to_c_new, bal_branch_c ); bal_prevCNT = checkConnect ( bal_branch_c, bal_from_c ); setNextInScaf ( bal_nextCNT, bal_prevCNT ); setPrevInScaf ( origin, 0 ); setNextInScaf ( bal_origin, NULL ); } static int validConnect ( unsigned int ctg, CONNECT * preCNT ) { if ( preCNT && preCNT->nextInScaf ) { return 1; } CONNECT * cn_temp; int count = 0; if ( !contig_array[ctg].downwardConnect ) { return count; } cn_temp = contig_array[ctg].downwardConnect; while ( cn_temp ) { if ( !cn_temp->deleted && !cn_temp->mask ) { count++; } cn_temp = cn_temp->next; } return count; } static CONNECT * getNextContig ( unsigned int ctg, CONNECT * preCNT, boolean * exception ) { CONNECT * cn_temp, *retCNT = NULL; int count = 0, valid_in; unsigned int nextCtg, bal_ctg; *exception = 0; if ( preCNT && preCNT->nextInScaf ) { if ( preCNT->contigID != ctg ) { printf ( "pre cnt does not lead to %d\n", ctg ); } nextCtg = preCNT->nextInScaf->contigID; cn_temp = getCntBetween ( ctg, nextCtg ); if ( cn_temp && ( cn_temp->mask || cn_temp->deleted ) ) { printf ( "getNextContig: arc(%d %d) twin (%d %d) with mask %d deleted %d\n" , ctg, nextCtg, getTwinCtg ( nextCtg ), getTwinCtg ( ctg ) , cn_temp->mask, cn_temp->deleted ); if ( !cn_temp->prevInScaf ) { printf ( "not even has a prevInScaf\n" ); } cn_temp = getCntBetween ( getTwinCtg ( nextCtg ), getTwinCtg ( ctg ) ); if ( !cn_temp->nextInScaf ) { printf ( "its twin cnt not has a nextInScaf\n" ); } fflush ( stdout ); *exception = 1; } else { return preCNT->nextInScaf; } } bal_ctg = getTwinCtg ( ctg ); valid_in = validConnect ( bal_ctg, NULL ); if ( valid_in > 1 ) { return NULL; } if ( !contig_array[ctg].downwardConnect ) { return NULL; } cn_temp = contig_array[ctg].downwardConnect; while ( cn_temp ) { if ( cn_temp->mask || cn_temp->deleted ) { cn_temp = cn_temp->next; continue; } count++; if ( count == 1 ) { retCNT = cn_temp; } else if ( count == 2 ) { return NULL; } cn_temp = cn_temp->next; } return retCNT; } // get the valid connect between 2 given ctgs static CONNECT * checkConnect ( unsigned int from_c, unsigned int to_c ) { CONNECT * cn_temp = getCntBetween ( from_c, to_c ); if ( !cn_temp ) { return NULL; } if ( !cn_temp->mask && !cn_temp->deleted ) { return cn_temp; } return NULL; } static int setConnectMask ( unsigned int from_c, unsigned int to_c, char mask ) { CONNECT * cn_temp, *cn_bal, *cn_ds, *cn_us; unsigned int bal_fc = getTwinCtg ( from_c ); unsigned int bal_tc = getTwinCtg ( to_c ); unsigned int ctg3, bal_ctg3; cn_temp = getCntBetween ( from_c, to_c ); cn_bal = getCntBetween ( bal_tc, bal_fc ); if ( !cn_temp || !cn_bal ) { return 0; } cn_temp->mask = mask; cn_bal->mask = mask; if ( !mask ) { return 1; } if ( cn_temp->nextInScaf ) //undo the binding { setPrevInScaf ( cn_temp->nextInScaf, 0 ); ctg3 = cn_temp->nextInScaf->contigID; setNextInScaf ( cn_temp, NULL ); bal_ctg3 = getTwinCtg ( ctg3 ); cn_ds = getCntBetween ( bal_ctg3, bal_tc ); setNextInScaf ( cn_ds, NULL ); setPrevInScaf ( cn_bal, 0 ); } // ctg3 -> from_c -> to_c // bal_ctg3 <- bal_fc <- bal_tc if ( cn_bal->nextInScaf ) { setPrevInScaf ( cn_bal->nextInScaf, 0 ); bal_ctg3 = cn_bal->nextInScaf->contigID; setNextInScaf ( cn_bal, NULL ); ctg3 = getTwinCtg ( bal_ctg3 ); cn_us = getCntBetween ( ctg3, from_c ); setNextInScaf ( cn_us, NULL ); setPrevInScaf ( cn_temp, 0 ); } return 1; } static boolean setConnectUsed ( unsigned int from_c, unsigned int to_c, char flag ) { CONNECT * cn_temp, *cn_bal; unsigned int bal_fc = getTwinCtg ( from_c ); unsigned int bal_tc = getTwinCtg ( to_c ); cn_temp = getCntBetween ( from_c, to_c ); cn_bal = getCntBetween ( bal_tc, bal_fc ); if ( !cn_temp || !cn_bal ) { return 0; } cn_temp->used = flag; cn_bal->used = flag; return 1; } static int setConnectWP ( unsigned int from_c, unsigned int to_c, char flag ) { CONNECT * cn_temp, *cn_bal; unsigned int bal_fc = getTwinCtg ( from_c ); unsigned int bal_tc = getTwinCtg ( to_c ); cn_temp = getCntBetween ( from_c, to_c ); cn_bal = getCntBetween ( bal_tc, bal_fc ); if ( !cn_temp || !cn_bal ) { return 0; } cn_temp->weakPoint = flag; cn_bal->weakPoint = flag; //fprintf(stderr,"contig %d and %d, weakPoint %d\n",from_c,to_c,cn_temp->weakPoint); //fprintf(stderr,"contig %d and %d, weakPoint %d\n",bal_tc,bal_fc,cn_bal->weakPoint); return 1; } static int setConnectDelete ( unsigned int from_c, unsigned int to_c, char flag, boolean cleanBinding ) { CONNECT * cn_temp, *cn_bal; unsigned int bal_fc = getTwinCtg ( from_c ); unsigned int bal_tc = getTwinCtg ( to_c ); cn_temp = getCntBetween ( from_c, to_c ); cn_bal = getCntBetween ( bal_tc, bal_fc ); if ( !cn_temp || !cn_bal ) { return 0; } cn_temp->deleted = flag; cn_bal->deleted = flag; if ( !flag ) { return 1; } if ( cleanBinding ) { cn_temp->prevInScaf = 0; cn_temp->nextInScaf = NULL; cn_bal->prevInScaf = 0; cn_bal->nextInScaf = NULL; } return 1; } static void maskContig ( unsigned int ctg, boolean flag ) { unsigned int bal_ctg, ctg2, bal_ctg2; CONNECT * cn_temp; bal_ctg = getTwinCtg ( ctg ); cn_temp = contig_array[ctg].downwardConnect; while ( cn_temp ) { if ( cn_temp->mask || cn_temp->prevInScaf || cn_temp->nextInScaf || cn_temp->singleInScaf ) { cn_temp = cn_temp->next; continue; } ctg2 = cn_temp->contigID; setConnectMask ( ctg, ctg2, flag ); cn_temp = cn_temp->next; } // bal_ctg2 <- bal_ctg cn_temp = contig_array[bal_ctg].downwardConnect; while ( cn_temp ) { if ( cn_temp->mask || cn_temp->prevInScaf || cn_temp->nextInScaf || cn_temp->singleInScaf ) { cn_temp = cn_temp->next; continue; } bal_ctg2 = cn_temp->contigID; setConnectMask ( bal_ctg, bal_ctg2, flag ); cn_temp = cn_temp->next; } contig_array[ctg].mask = flag; contig_array[bal_ctg].mask = flag; } static int maskPuzzle ( int num_connect, unsigned int contigLen ) { int in_num, out_num, flag = 0, puzzleCounter = 0; unsigned int i, bal_i; for ( i = 1; i <= num_ctg; i++ ) { if ( contigLen && contig_array[i].length > contigLen ) { break; } if ( contig_array[i].mask ) { continue; } bal_i = getTwinCtg ( i ); in_num = validConnect ( bal_i, NULL ); out_num = validConnect ( i, NULL ); if ( ( in_num > 1 || out_num > 1 ) && ( in_num + out_num >= num_connect ) ) { flag++; maskContig ( i, 1 ); } in_num = validConnect ( bal_i, NULL ); out_num = validConnect ( i, NULL ); if ( in_num > 1 || out_num > 1 ) { puzzleCounter++; //debugging2(i); } if ( isSmallerThanTwin ( i ) ) { i++; } } printf ( "Masked %d contigs, %d puzzle left\n", flag, puzzleCounter ); return flag; } static void deleteWeakCnt ( int cut_off ) { unsigned int i; CONNECT * cn_temp1; int weaks = 0, counter = 0; for ( i = 1; i <= num_ctg; i++ ) { cn_temp1 = contig_array[i].downwardConnect; while ( cn_temp1 ) { if ( !cn_temp1->mask && !cn_temp1->deleted && !cn_temp1->nextInScaf && !cn_temp1->singleInScaf && !cn_temp1->prevInScaf ) { counter++; } if ( cn_temp1->weak && cn_temp1->deleted && cn_temp1->weight >= cut_off ) { cn_temp1->deleted = 0; cn_temp1->weak = 0; } else if ( !cn_temp1->deleted && cn_temp1->weight > 0 && cn_temp1->weight < cut_off && !cn_temp1->nextInScaf && !cn_temp1->prevInScaf ) { cn_temp1->deleted = 1; cn_temp1->weak = 1; if ( cn_temp1->singleInScaf ) { cn_temp1->singleInScaf = 0; } if ( !cn_temp1->mask ) { weaks++; } } cn_temp1 = cn_temp1->next; } } printf ( "%d weak connects removed (there were %d active cnnects))\n", weaks, counter ); checkCircle(); } //check if one contig is linearly connected to the other ->C1->C2... static int linearC2C ( unsigned int starter, CONNECT * cnt2c1, unsigned int c2, int min_dis, int max_dis ) { int out_num, in_num; CONNECT * prevCNT, *cnt, *cn_temp; unsigned int c1, bal_c1, ctg, bal_c2; int len = 0; unsigned int bal_start = getTwinCtg ( starter ); boolean excep; c1 = cnt2c1->contigID; if ( c1 == c2 ) { printf ( "linearC2C: c1(%d) and c2(%d) are the same contig\n", c1, c2 ); return -1; } bal_c1 = getTwinCtg ( c1 ); in_num = validConnect ( bal_c1, NULL ); if ( in_num > 1 ) { return 0; } dsCtgCounter = 1; usCtgCounter = 0; downstreamCTG[dsCtgCounter++] = c1; bal_c2 = getTwinCtg ( c2 ); upstreamCTG[usCtgCounter++] = bal_c2; // check if c1 is linearly connected to c2 by pe connections cnt = prevCNT = cnt2c1; while ( ( cnt = getNextContig ( c1, prevCNT, &excep ) ) != NULL ) { c1 = cnt->contigID; len += cnt->gapLen + contig_array[c1].length; if ( c1 == c2 ) { return 1; } if ( len > max_dis || c1 == starter || c1 == bal_start ) { return 0; } downstreamCTG[dsCtgCounter++] = c1; if ( dsCtgCounter >= MAXCinBetween ) { printf ( "%d downstream contigs, start at %d, max_dis %d, current dis %d\n" , dsCtgCounter, starter, max_dis, len ); return 0; } prevCNT = cnt; } out_num = validConnect ( c1, NULL ); if ( out_num ) { return 0; } //find the most upstream contig to c2 cnt = prevCNT = NULL; ctg = bal_c2; while ( ( cnt = getNextContig ( ctg, prevCNT, &excep ) ) != NULL ) { ctg = cnt->contigID; len += cnt->gapLen + contig_array[ctg].length; if ( len > max_dis || ctg == starter || ctg == bal_start ) { return 0; } prevCNT = cnt; upstreamCTG[usCtgCounter++] = ctg; if ( usCtgCounter >= MAXCinBetween ) { printf ( "%d upstream contigs, start at %d, max_dis %d, current dis %d\n" , usCtgCounter, starter, max_dis, len ); return 0; } } if ( dsCtgCounter + usCtgCounter > MAXCinBetween ) { printf ( "%d downstream and %d upstream contigs\n", dsCtgCounter, usCtgCounter ); return 0; } out_num = validConnect ( ctg, NULL ); if ( out_num ) { return 0; } c2 = getTwinCtg ( ctg ); min_dis -= len; max_dis -= len; if ( c1 == c2 || c1 == ctg || max_dis < 0 ) { return 0; } cn_temp = getCntBetween ( c1, c2 ); if ( cn_temp ) { setConnectMask ( c1, c2, 0 ); setConnectDelete ( c1, c2, 0, 0 ); return 1; } len = ( min_dis + max_dis ) / 2 >= 0 ? ( min_dis + max_dis ) / 2 : 0; cn_temp = allocateCN ( c2, len ); if ( cntLookupTable ) { putCnt2LookupTable ( c1, cn_temp ); } cn_temp->weight = 0; // special connect from the original graph cn_temp->next = contig_array[c1].downwardConnect; contig_array[c1].downwardConnect = cn_temp; bal_c1 = getTwinCtg ( c1 ); bal_c2 = getTwinCtg ( c2 ); cn_temp = allocateCN ( bal_c1, len ); if ( cntLookupTable ) { putCnt2LookupTable ( bal_c2, cn_temp ); } cn_temp->weight = 0; // special connect from the original graph cn_temp->next = contig_array[bal_c2].downwardConnect; contig_array[bal_c2].downwardConnect = cn_temp; return 1; } //catenate upstream contig array and downstream contig array to solidArray static void catUsDsContig() { int i; for ( i = 0; i < dsCtgCounter; i++ ) { * ( unsigned int * ) darrayPut ( solidArray, i ) = downstreamCTG[i]; } for ( i = usCtgCounter - 1; i >= 0; i-- ) { * ( unsigned int * ) darrayPut ( solidArray, dsCtgCounter++ ) = getTwinCtg ( upstreamCTG[i] ); } solidCounter = dsCtgCounter; } //binding the connections between contigs in solidArray static void consolidate() { int i, j; CONNECT * prevCNT = NULL; CONNECT * cnt; unsigned int to_ctg; unsigned int from_ctg = * ( unsigned int * ) darrayGet ( solidArray, 0 ); for ( i = 1; i < solidCounter; i++ ) { to_ctg = * ( unsigned int * ) darrayGet ( solidArray, i ); cnt = checkConnect ( from_ctg, to_ctg ); if ( !cnt ) { printf ( "consolidate A: no connect from %d to %d\n", from_ctg, to_ctg ); for ( j = 0; j < solidCounter; j++ ) { printf ( "%d-->", * ( unsigned int * ) darrayGet ( solidArray, j ) ); } printf ( "\n" ); return; } cnt->singleInScaf = solidCounter == 2 ? 1 : 0; if ( prevCNT ) { setNextInScaf ( prevCNT, cnt ); setPrevInScaf ( cnt, 1 ); } prevCNT = cnt; from_ctg = to_ctg; } //the reverse complementary path from_ctg = getTwinCtg ( * ( unsigned int * ) darrayGet ( solidArray, solidCounter - 1 ) ); prevCNT = NULL; for ( i = solidCounter - 2; i >= 0; i-- ) { to_ctg = getTwinCtg ( * ( unsigned int * ) darrayGet ( solidArray, i ) ); cnt = checkConnect ( from_ctg, to_ctg ); if ( !cnt ) { printf ( "consolidate B: no connect from %d to %d\n", from_ctg, to_ctg ); return; } cnt->singleInScaf = solidCounter == 2 ? 1 : 0; if ( prevCNT ) { setNextInScaf ( prevCNT, cnt ); setPrevInScaf ( cnt, 1 ); } prevCNT = cnt; from_ctg = to_ctg; } } static void debugging1 ( unsigned int ctg1, unsigned int ctg2 ) { CONNECT * cn1; cn1 = getCntBetween ( ctg1, ctg2 ); if ( cn1 ) { printf ( "(%d,%d) mask %d deleted %d w %d,singleInScaf %d\n", ctg1, ctg2, cn1->mask, cn1->deleted, cn1->weight, cn1->singleInScaf ); if ( cn1->nextInScaf ) { printf ( "%d->%d->%d\n", ctg1, ctg2, cn1->nextInScaf->contigID ); } if ( cn1->prevInScaf ) { printf ( "*->%d->%d\n", ctg1, ctg2 ); } else if ( !cn1->nextInScaf ) { printf ( "NULL->%d->%d->NULL\n", ctg1, ctg2 ); } } else { printf ( "%d -X- %d\n", ctg1, ctg2 ); } } //remove transitive connections which cross linear paths (these paths may be broken) //if a->b->c and a->c, mask a->c static void removeTransitive() { unsigned int i, bal_ctg; int flag = 1, out_num, in_num, count, min, max, linear; CONNECT * cn_temp, *cn1 = NULL, *cn2 = NULL; while ( flag ) { flag = 0; for ( i = 1; i <= num_ctg; i++ ) { if ( contig_array[i].mask ) { continue; } out_num = validConnect ( i, NULL ); if ( out_num != 2 ) { continue; } cn_temp = contig_array[i].downwardConnect; count = 0; while ( cn_temp ) { if ( cn_temp->deleted || cn_temp->mask ) { cn_temp = cn_temp->next; continue; } count++; if ( count == 1 ) { cn1 = cn_temp; } else if ( count == 2 ) { cn2 = cn_temp; } else // count > 2 { break; } cn_temp = cn_temp->next; } if ( count > 2 ) { printf ( "%d valid connections from ctg %d\n", count, i ); continue; } if ( cn1->gapLen > cn2->gapLen ) { cn_temp = cn1; cn1 = cn2; cn2 = cn_temp; } //make sure cn1 is closer to contig i than cn2 if ( cn1->prevInScaf && cn2->prevInScaf ) { continue; } bal_ctg = getTwinCtg ( cn2->contigID ); in_num = validConnect ( bal_ctg, NULL ); if ( in_num > 2 ) { continue; } min = cn2->gapLen - cn1->gapLen - contig_array[cn1->contigID].length - ins_size_var / 2; max = cn2->gapLen - cn1->gapLen - contig_array[cn1->contigID].length + ins_size_var / 2; if ( max < 0 ) { continue; } //temprarily delete cn2 setConnectDelete ( i, cn2->contigID, 1, 0 ); linear = linearC2C ( i, cn1, cn2->contigID, min, max ); if ( linear != 1 ) { setConnectDelete ( i, cn2->contigID, 0, 0 ); continue; } else { downstreamCTG[0] = i; catUsDsContig(); if ( !checkSimple ( solidArray, solidCounter ) ) { continue; } cn1 = getCntBetween ( * ( unsigned int * ) darrayGet ( solidArray, solidCounter - 2 ), * ( unsigned int * ) darrayGet ( solidArray, solidCounter - 1 ) ); if ( cn1 && cn1->nextInScaf && cn2->nextInScaf ) { setConnectDelete ( i, cn2->contigID, 0, 0 ); continue; } consolidate(); if ( cn2->prevInScaf ) substitueDSinScaf ( cn2, * ( unsigned int * ) darrayGet ( solidArray, 0 ), * ( unsigned int * ) darrayGet ( solidArray, 1 ) ); if ( cn2->nextInScaf ) { substitueUSinScaf ( cn2, * ( unsigned int * ) darrayGet ( solidArray, solidCounter - 2 ) ); } flag++; } } //for each contig printf ( "a remove transitive lag, %d connections removed\n", flag ); } } //get repeat contigs back into the scaffold according to connected unique contigs on both sides /* A ------ D > [i] < B E */ static void debugging2 ( unsigned int ctg ) { CONNECT * cn1 = contig_array[ctg].downwardConnect; while ( cn1 ) { if ( cn1->nextInScaf ) { fprintf ( stderr, "with nextInScaf," ); } if ( cn1->prevInScaf ) { fprintf ( stderr, "with prevInScaf," ); } fprintf ( stderr, "%u >> %d, mask %d deleted %d, inherit %d, singleInScaf %d\n", ctg, cn1->contigID, cn1->mask, cn1->deleted, cn1->inherit, cn1->singleInScaf ); cn1 = cn1->next; } } static void debugging() { /* debugging1(1777,1468); debugging2(8065); debugging2(8066); */ } static void simplifyCnt() { removeTransitive(); debugging(); general_linearization ( 1 ); debugging(); } static int getIndexInArray ( unsigned int node ) { int index; for ( index = 0; index < nodeCounter; index++ ) if ( nodesInSub[index] == node ) { return index; } return -1; } static boolean putNodeIntoSubgraph ( FibHeap * heap, int distance, unsigned int node, int index ) { int pos = getIndexInArray ( node ); if ( pos > 0 ) { //printf("exists\n"); return 0; } if ( index >= MaxNodeInSub ) { return -1; } insertNodeIntoHeap ( heap, distance, node ); nodesInSub[index] = node; nodeDistance[index] = distance; return 1; } static boolean putChainIntoSubgraph ( FibHeap * heap, int distance, unsigned int node, int * index, CONNECT * prevC ) { unsigned int ctg = node; CONNECT * nextCnt; boolean excep, flag; int counter = *index; while ( 1 ) { nextCnt = getNextContig ( ctg, prevC, &excep ); if ( excep || !nextCnt ) { *index = counter; return 1; } ctg = nextCnt->contigID; distance += nextCnt->gapLen + ctg; flag = putNodeIntoSubgraph ( heap, distance, ctg, counter ); if ( flag < 0 ) { return 0; } if ( flag > 0 ) { counter++; } prevC = nextCnt; } } // check if a contig is unique by trying to line its downstream/upstream nodes together static boolean checkUnique ( unsigned int node, double tolerance ) { CONNECT * ite_cnt; unsigned int currNode; int distance; int popCounter = 0; boolean flag; currNode = node; FibHeap * heap = newFibHeap(); putNodeIntoSubgraph ( heap, 0, currNode, 0 ); nodeCounter = 1; ite_cnt = contig_array[currNode].downwardConnect; while ( ite_cnt ) { if ( ite_cnt->deleted || ite_cnt->mask ) { ite_cnt = ite_cnt->next; continue; } currNode = ite_cnt->contigID; distance = ite_cnt->gapLen + contig_array[currNode].length; flag = putNodeIntoSubgraph ( heap, distance, currNode, nodeCounter ); if ( flag < 0 ) { destroyHeap ( heap ); return 0; } if ( flag > 0 ) { nodeCounter++; } flag = putChainIntoSubgraph ( heap, distance, currNode, &nodeCounter, ite_cnt ); if ( !flag ) { destroyHeap ( heap ); return 0; } ite_cnt = ite_cnt->next; } if ( nodeCounter <= 2 ) // no more than 2 valid connections { destroyHeap ( heap ); return 1; } while ( ( currNode = removeNextNodeFromHeap ( heap ) ) != 0 ) { nodesInSubInOrder[popCounter++] = currNode; } destroyHeap ( heap ); flag = checkOverlapInBetween ( tolerance ); return flag; } //mask contigs with downstream and/or upstream can not be lined static void maskRepeat() { int in_num, out_num, flagA, flagB; int counter = 0; int puzzleCounter = 0; unsigned int i, bal_i; for ( i = 1; i <= num_ctg; i++ ) { if ( contig_array[i].mask ) { continue; } bal_i = getTwinCtg ( i ); in_num = validConnect ( bal_i, NULL ); out_num = validConnect ( i, NULL ); if ( in_num > 1 || out_num > 1 ) { puzzleCounter++; } else { if ( isSmallerThanTwin ( i ) ) { i++; } continue; } if ( contig_array[i].cvg > 2 * cvgAvg ) { counter++; maskContig ( i, 1 ); //printf("thick mask contig %d and %d\n",i,bal_i); if ( isSmallerThanTwin ( i ) ) { i++; } continue; } if ( in_num > 1 ) { flagA = checkUnique ( bal_i, OverlapPercent ); } else { flagA = 1; } if ( out_num > 1 ) { flagB = checkUnique ( i, OverlapPercent ); } else { flagB = 1; } if ( !flagA || !flagB ) { counter++; maskContig ( i, 1 ); } if ( isSmallerThanTwin ( i ) ) { i++; } } printf ( "maskRepeat: %d contigs masked from %d puzzles\n", counter, puzzleCounter ); } static void ordering ( boolean deWeak, boolean downS, boolean nonlinear, char * infile ) { debugging(); if ( downS ) { downSlide(); debugging(); if ( deWeak ) { deleteWeakCnt ( weakPE ); } } else { if ( deWeak ) { deleteWeakCnt ( weakPE ); } } //output_scaf(infile); debugging(); printf ( "variance for insert size %d\n", ins_size_var ); simplifyCnt(); debugging(); maskRepeat(); debugging(); simplifyCnt(); if ( nonlinear ) { printf ( "non-strict linearization\n" ); general_linearization ( 0 ); //linearization(0,0); } maskPuzzle ( 2, 0 ); debugging(); freezing(); debugging(); } //check if contigs next to each other have reasonable overlap boolean checkOverlapInBetween ( double tolerance ) { int i, gap; int index; unsigned int node; int lenSum, lenOlp; lenSum = lenOlp = 0; for ( i = 0; i < nodeCounter; i++ ) { node = nodesInSubInOrder[i]; lenSum += contig_array[node].length; index = getIndexInArray ( node ); nodeDistanceInOrder[i] = nodeDistance[index]; } if ( lenSum < 1 ) { return 1; } for ( i = 0; i < nodeCounter - 1; i++ ) { gap = nodeDistanceInOrder[i + 1] - nodeDistanceInOrder[i] - contig_array[nodesInSubInOrder[i + 1]].length; if ( -gap > 0 ) { lenOlp += -gap; } //if(-gap>ins_size_var) if ( ( double ) lenOlp / lenSum > tolerance ) { return 0; } } return 1; } /********* the following codes are for freezing current scaffolds ****************/ //set connections between contigs in a array to used or not //meanwhile set mask to the opposite value static boolean setUsed ( unsigned int start, unsigned int * array, int max_steps, boolean flag ) { unsigned int prevCtg = start; unsigned int twinA, twinB; int j; CONNECT * cnt; boolean usedFlag = 0; // save 'used' to 'checking' prevCtg = start; for ( j = 0; j < max_steps; j++ ) { if ( array[j] == 0 ) { break; } cnt = getCntBetween ( prevCtg, array[j] ); if ( !cnt ) { printf ( "setUsed: no connect between %d and %d\n", prevCtg, array[j] ); prevCtg = array[j]; continue; } if ( cnt->used == flag || cnt->nextInScaf || cnt->prevInScaf || cnt->singleInScaf ) { return 1; } cnt->checking = cnt->used; twinA = getTwinCtg ( prevCtg ); twinB = getTwinCtg ( array[j] ); cnt = getCntBetween ( twinB, twinA ); if ( cnt ) { cnt->checking = cnt->used; } prevCtg = array[j]; } // set used to flag prevCtg = start; for ( j = 0; j < max_steps; j++ ) { if ( array[j] == 0 ) { break; } cnt = getCntBetween ( prevCtg, array[j] ); if ( !cnt ) { prevCtg = array[j]; continue; } if ( cnt->used == flag ) { usedFlag = 1; break; } cnt->used = flag; twinA = getTwinCtg ( prevCtg ); twinB = getTwinCtg ( array[j] ); cnt = getCntBetween ( twinB, twinA ); if ( cnt ) { cnt->used = flag; } prevCtg = array[j]; } // set mask to 'NOT flag' or set used to original value prevCtg = start; for ( j = 0; j < max_steps; j++ ) { if ( array[j] == 0 ) { break; } cnt = getCntBetween ( prevCtg, array[j] ); if ( !cnt ) { prevCtg = array[j]; continue; } if ( !usedFlag ) { cnt->mask = 1 - flag; } else { cnt->used = cnt->checking; } twinA = getTwinCtg ( prevCtg ); twinB = getTwinCtg ( array[j] ); cnt = getCntBetween ( twinB, twinA ); cnt->used = 1 - flag; if ( !usedFlag ) { cnt->mask = 1 - flag; } else { cnt->used = cnt->checking; } prevCtg = array[j]; } return usedFlag; } // break down scaffolds poorly supported by longer PE static void recoverMask() { unsigned int i, ctg, bal_ctg, start, finish; int num3, num5, j, t; CONNECT * bindCnt, *cnt; int min, max, max_steps = 5, num_route, length; int tempCounter, recoverCounter = 0; boolean multiUSE, change; for ( i = 1; i <= num_ctg; i++ ) { contig_array[i].flag = 0; } so_far = ( unsigned int * ) ckalloc ( max_n_routes * sizeof ( unsigned int ) ); found_routes = ( unsigned int ** ) ckalloc ( max_n_routes * sizeof ( unsigned int * ) ); for ( j = 0; j < max_n_routes; j++ ) { found_routes[j] = ( unsigned int * ) ckalloc ( max_steps * sizeof ( unsigned int ) ); } for ( i = 1; i <= num_ctg; i++ ) { if ( contig_array[i].flag || contig_array[i].mask || !contig_array[i].downwardConnect ) { continue; } bindCnt = getBindCnt ( i ); if ( !bindCnt ) { continue; } //first scan get the average coverage by longer pe num5 = num3 = 0; ctg = i; * ( unsigned int * ) darrayPut ( scaf5, num5++ ) = i; contig_array[i].flag = 1; contig_array[getTwinCtg ( i )].flag = 1; while ( bindCnt ) { if ( bindCnt->used ) { break; } setConnectUsed ( ctg, bindCnt->contigID, 1 ); ctg = bindCnt->contigID; * ( unsigned int * ) darrayPut ( scaf5, num5++ ) = ctg; bal_ctg = getTwinCtg ( ctg ); contig_array[ctg].flag = 1; contig_array[bal_ctg].flag = 1; bindCnt = bindCnt->nextInScaf; } ctg = getTwinCtg ( i ); bindCnt = getBindCnt ( ctg ); while ( bindCnt ) { if ( bindCnt->used ) { break; } setConnectUsed ( ctg, bindCnt->contigID, 1 ); ctg = bindCnt->contigID; bal_ctg = getTwinCtg ( ctg ); contig_array[ctg].flag = 1; contig_array[bal_ctg].flag = 1; * ( unsigned int * ) darrayPut ( scaf3, num3++ ) = bal_ctg; bindCnt = bindCnt->nextInScaf; } if ( num5 + num3 < 2 ) { continue; } tempCounter = solidCounter = 0; for ( j = num3 - 1; j >= 0; j-- ) * ( unsigned int * ) darrayPut ( tempArray, tempCounter++ ) = * ( unsigned int * ) darrayGet ( scaf3, j ); for ( j = 0; j < num5; j++ ) * ( unsigned int * ) darrayPut ( tempArray, tempCounter++ ) = * ( unsigned int * ) darrayGet ( scaf5, j ); change = 0; for ( t = 0; t < tempCounter - 1; t++ ) { * ( unsigned int * ) darrayPut ( solidArray, solidCounter++ ) = * ( unsigned int * ) darrayGet ( tempArray, t ); start = * ( unsigned int * ) darrayGet ( tempArray, t ); finish = * ( unsigned int * ) darrayGet ( tempArray, t + 1 ); num_route = num_trace = 0; cnt = checkConnect ( start, finish ); if ( !cnt ) { printf ( "Warning from recoverMask: no connection (%d %d), start at %d\n", start, finish, i ); cnt = getCntBetween ( start, finish ); if ( cnt ) { debugging1 ( start, finish ); } continue; } length = cnt->gapLen + contig_array[finish].length; min = length - 1.5 * ins_size_var; max = length + 1.5 * ins_size_var; traceAlongMaskedCnt ( finish, start, max_steps, min, max, 0, 0, &num_route ); if ( finish == start ) { for ( j = 0; j < tempCounter; j++ ) { printf ( "->%d", * ( unsigned int * ) darrayGet ( tempArray, j ) ); } printf ( ": start at %d\n", i ); } if ( num_route == 1 ) { for ( j = 0; j < max_steps; j++ ) if ( found_routes[0][j] == 0 ) { break; } if ( j < 1 ) { continue; } //check if connects have been used more than once multiUSE = setUsed ( start, found_routes[0], max_steps, 1 ); if ( multiUSE ) { continue; } for ( j = 0; j < max_steps; j++ ) { if ( j + 1 == max_steps || found_routes[0][j + 1] == 0 ) { break; } * ( unsigned int * ) darrayPut ( solidArray, solidCounter++ ) = found_routes[0][j]; contig_array[found_routes[0][j]].flag = 1; contig_array[getTwinCtg ( found_routes[0][j] )].flag = 1; } recoverCounter += j; setConnectDelete ( start, finish, 1, 1 ); change = 1; } //end if num_route=1 } // for each gap * ( unsigned int * ) darrayPut ( solidArray, solidCounter++ ) = * ( unsigned int * ) darrayGet ( tempArray, tempCounter - 1 ); if ( change ) { consolidate(); } } printf ( "%d contigs recovered\n", recoverCounter ); fflush ( stdout ); for ( i = 1; i <= num_ctg; i++ ) { cnt = contig_array[i].downwardConnect; while ( cnt ) { cnt->used = 0; cnt->checking = 0; cnt = cnt->next; } } for ( j = 0; j < max_n_routes; j++ ) { free ( ( void * ) found_routes[j] ); } free ( ( void * ) found_routes ); free ( ( void * ) so_far ); } // A -> B -> C -> D un-bind link B->C to link A->B and B->C // A' <- B' <- C' <- D' static void unBindLink ( unsigned int CB, unsigned int CC ) { //fprintf(stderr,"Unbind link (%d %d) to others...\n",CB,CC); CONNECT * cnt1 = getCntBetween ( CB, CC ); if ( !cnt1 ) { return; } if ( cnt1->singleInScaf ) { cnt1->singleInScaf = 0; } CONNECT * cnt2 = getCntBetween ( getTwinCtg ( CC ), getTwinCtg ( CB ) ); if ( !cnt2 ) { return; } if ( cnt2->singleInScaf ) { cnt2->singleInScaf = 0; } if ( cnt1->nextInScaf ) { unsigned int CD = cnt1->nextInScaf->contigID; cnt1->nextInScaf->prevInScaf = 0; cnt1->nextInScaf = NULL; CONNECT * cnt3 = getCntBetween ( getTwinCtg ( CD ), getTwinCtg ( CC ) ); if ( cnt3 ) { cnt3->nextInScaf = NULL; } cnt2->prevInScaf = 0; } if ( cnt2->nextInScaf ) { unsigned int bal_CA = cnt2->nextInScaf->contigID; cnt2->nextInScaf->prevInScaf = 0; cnt2->nextInScaf = NULL; CONNECT * cnt4 = getCntBetween ( getTwinCtg ( bal_CA ), CB ); if ( cnt4 ) { cnt4->nextInScaf = NULL; } cnt1->prevInScaf = 0; } } static void freezing() { int num5, num3; unsigned int ctg, bal_ctg; unsigned int i; int j, t; CONNECT * cnt, *prevCNT, *nextCnt; boolean excep; for ( i = 1; i <= num_ctg; i++ ) { contig_array[i].flag = 0; contig_array[i].from_vt = 0; contig_array[i].to_vt = 0; cnt = contig_array[i].downwardConnect; while ( cnt ) { cnt->used = 0; cnt->checking = 0; cnt->singleInScaf = 0; cnt = cnt->next; } } for ( i = 1; i <= num_ctg; i++ ) { if ( contig_array[i].flag || contig_array[i].mask ) { continue; } if ( !contig_array[i].downwardConnect || !validConnect ( i, NULL ) ) { continue; } num5 = num3 = 0; ctg = i; * ( unsigned int * ) darrayPut ( scaf5, num5++ ) = i; contig_array[i].flag = 1; contig_array[getTwinCtg ( i )].flag = 1; prevCNT = NULL; cnt = getNextContig ( ctg, prevCNT, &excep ); while ( cnt ) { if ( contig_array[cnt->contigID].flag ) { unBindLink ( ctg, cnt->contigID ); break; } nextCnt = getNextContig ( cnt->contigID, cnt, &excep ); setConnectUsed ( ctg, cnt->contigID, 1 ); ctg = cnt->contigID; * ( unsigned int * ) darrayPut ( scaf5, num5++ ) = ctg; bal_ctg = getTwinCtg ( ctg ); contig_array[ctg].flag = 1; contig_array[bal_ctg].flag = 1; prevCNT = cnt; cnt = nextCnt; } ctg = getTwinCtg ( i ); if ( num5 >= 2 ) { prevCNT = checkConnect ( getTwinCtg ( * ( unsigned int * ) darrayGet ( scaf5, 1 ) ), ctg ); } else { prevCNT = NULL; } cnt = getNextContig ( ctg, prevCNT, &excep ); while ( cnt ) { if ( contig_array[cnt->contigID].flag ) { unBindLink ( ctg, cnt->contigID ); break; } nextCnt = getNextContig ( cnt->contigID, cnt, &excep ); setConnectUsed ( ctg, cnt->contigID, 1 ); ctg = cnt->contigID; bal_ctg = getTwinCtg ( ctg ); contig_array[ctg].flag = 1; contig_array[bal_ctg].flag = 1; * ( unsigned int * ) darrayPut ( scaf3, num3++ ) = bal_ctg; prevCNT = cnt; cnt = nextCnt; } if ( num5 + num3 < 2 ) { continue; } solidCounter = 0; for ( j = num3 - 1; j >= 0; j-- ) * ( unsigned int * ) darrayPut ( solidArray, solidCounter++ ) = * ( unsigned int * ) darrayGet ( scaf3, j ); for ( j = 0; j < num5; j++ ) * ( unsigned int * ) darrayPut ( solidArray, solidCounter++ ) = * ( unsigned int * ) darrayGet ( scaf5, j ); unsigned int firstCtg = 0; unsigned int lastCtg = 0; unsigned int firstTwin = 0; unsigned int lastTwin = 0; for ( t = 0; t < solidCounter; t++ ) if ( !contig_array[* ( unsigned int * ) darrayGet ( solidArray, t )].mask ) { firstCtg = * ( unsigned int * ) darrayGet ( solidArray, t ); break; } for ( t = solidCounter - 1; t >= 0; t-- ) if ( !contig_array[* ( unsigned int * ) darrayGet ( solidArray, t )].mask ) { lastCtg = * ( unsigned int * ) darrayGet ( solidArray, t ); break; } if ( firstCtg == 0 || lastCtg == 0 ) { printf ( "scaffold start at %d, stop at %d, freezing began with %d\n", firstCtg, lastCtg, i ); for ( j = 0; j < solidCounter; j++ ) printf ( "->%d(%d %d)", * ( unsigned int * ) darrayGet ( solidArray, j ) , contig_array[* ( unsigned int * ) darrayGet ( solidArray, j )].mask , contig_array[* ( unsigned int * ) darrayGet ( solidArray, j )].flag ); printf ( "\n" ); } else { firstTwin = getTwinCtg ( firstCtg ); lastTwin = getTwinCtg ( lastCtg ); } for ( t = 0; t < solidCounter; t++ ) { unsigned int ctg = * ( unsigned int * ) darrayGet ( solidArray, t ); if ( contig_array[ctg].from_vt > 0 ) { contig_array[ctg].mask = 1; contig_array[getTwinCtg ( ctg )].mask = 1; printf ( "Repeat: contig %d (%d) appears more than once\n", ctg, getTwinCtg ( ctg ) ); } else { contig_array[ctg].from_vt = firstCtg; contig_array[ctg].to_vt = lastCtg; contig_array[ctg].indexInScaf = t + 1; contig_array[getTwinCtg ( ctg )].from_vt = lastTwin; contig_array[getTwinCtg ( ctg )].to_vt = firstTwin; contig_array[getTwinCtg ( ctg )].indexInScaf = solidCounter - t; } } consolidate(); } printf ( "Freezing is done....\n" ); fflush ( stdout ); for ( i = 1; i <= num_ctg; i++ ) { if ( contig_array[i].flag ) { contig_array[i].flag = 0; } if ( contig_array[i].from_vt == 0 ) { contig_array[i].from_vt = i; contig_array[i].to_vt = i; } cnt = contig_array[i].downwardConnect; while ( cnt ) { cnt->used = 0; cnt->checking = 0; cnt = cnt->next; } } } /************** codes below this line are for pulling the scaffolds out ************/ void output1gap ( FILE * fo, int max_steps ) { int i, len, seg; len = seg = 0; for ( i = 0; i < max_steps - 1; i++ ) { if ( found_routes[0][i + 1] == 0 ) { break; } len += contig_array[found_routes[0][i]].length; seg++; } fprintf ( fo, "GAP %d %d", len, seg ); for ( i = 0; i < max_steps - 1; i++ ) { if ( found_routes[0][i + 1] == 0 ) { break; } fprintf ( fo, " %d", found_routes[0][i] ); } fprintf ( fo, "\n" ); } static int weakCounter; static boolean printCnts ( FILE * fp, unsigned int ctg ) { CONNECT * cnt = contig_array[ctg].downwardConnect; boolean flag = 0, ret = 0; unsigned int bal_ctg = getTwinCtg ( ctg ); unsigned int linkCtg; if ( isSameAsTwin ( ctg ) ) { return ret; } CONNECT * bindCnt = getBindCnt ( ctg ); if ( bindCnt && bindCnt->bySmall && bindCnt->weakPoint ) { weakCounter++; fprintf ( fp, "\tWP" ); ret = 1; } while ( cnt ) { if ( cnt->weight && !cnt->inherit ) { if ( !flag ) { flag = 1; fprintf ( fp, "\t#DOWN " ); } linkCtg = cnt->contigID; if ( isLargerThanTwin ( linkCtg ) ) { linkCtg = getTwinCtg ( linkCtg ); } fprintf ( fp, "%d:%d:%d ", index_array[linkCtg], cnt->weight, cnt->gapLen ); } cnt = cnt->next; } flag = 0; cnt = contig_array[bal_ctg].downwardConnect; while ( cnt ) { if ( cnt->weight && !cnt->inherit ) { if ( !flag ) { flag = 1; fprintf ( fp, "\t#UP " ); } linkCtg = cnt->contigID; if ( isLargerThanTwin ( linkCtg ) ) { linkCtg = getTwinCtg ( linkCtg ); } fprintf ( fp, "%d:%d:%d ", index_array[linkCtg], cnt->weight, cnt->gapLen ); } cnt = cnt->next; } fprintf ( fp, "\n" ); return ret; } void scaffolding ( unsigned int len_cut, char * outfile ) { unsigned int prev_ctg, ctg, bal_ctg, *length_array, count = 0, num_lctg = 0; unsigned int i, max_steps = 5; int num5, num3, j, len, flag, num_route, gap_c = 0; short gap = 0; long long sum = 0, N50, N90; FILE * fp, *fo = NULL; char name[256]; CONNECT * cnt, *prevCNT, *nextCnt; boolean excep, weak; weakCounter = 0; so_far = ( unsigned int * ) ckalloc ( max_n_routes * sizeof ( unsigned int ) ); found_routes = ( unsigned int ** ) ckalloc ( max_n_routes * sizeof ( unsigned int * ) ); for ( j = 0; j < max_n_routes; j++ ) { found_routes[j] = ( unsigned int * ) ckalloc ( max_steps * sizeof ( unsigned int ) ); } length_array = ( unsigned int * ) ckalloc ( ( num_ctg + 1 ) * sizeof ( unsigned int ) ); //use length_array to change info in index_array for ( i = 1; i <= num_ctg; i++ ) { length_array[i] = 0; } for ( i = 1; i <= num_ctg; i++ ) { if ( index_array[i] > 0 ) { length_array[index_array[i]] = i; } } for ( i = 1; i <= num_ctg; i++ ) { index_array[i] = length_array[i]; } //contig i with original index: index_array[i] orig2new = 0; sprintf ( name, "%s.scaf", outfile ); fp = ckopen ( name, "w" ); sprintf ( name, "%s.scaf_gap", outfile ); fo = ckopen ( name, "w" ); scaf3 = ( DARRAY * ) createDarray ( 1000, sizeof ( unsigned int ) ); scaf5 = ( DARRAY * ) createDarray ( 1000, sizeof ( unsigned int ) ); gap3 = ( DARRAY * ) createDarray ( 1000, sizeof ( int ) ); gap5 = ( DARRAY * ) createDarray ( 1000, sizeof ( int ) ); for ( i = 1; i <= num_ctg; i++ ) { contig_array[i].flag = 0; } for ( i = 1; i <= num_ctg; i++ ) { if ( contig_array[i].length + ( unsigned int ) overlaplen >= len_cut ) { num_lctg++; } else { continue; } if ( contig_array[i].flag || contig_array[i].mask || !contig_array[i].downwardConnect || !validConnect ( i, NULL ) ) { continue; } num5 = num3 = 0; ctg = i; //printf("%d",i); * ( unsigned int * ) darrayPut ( scaf5, num5++ ) = i; contig_array[i].flag = 1; bal_ctg = getTwinCtg ( ctg ); contig_array[bal_ctg].flag = 1; len = contig_array[i].length; prevCNT = NULL; cnt = getNextContig ( ctg, prevCNT, &excep ); while ( cnt ) { nextCnt = getNextContig ( cnt->contigID, cnt, &excep ); if ( excep && prevCNT ) { printf ( "scaffolding: exception --- prev cnt from %u\n", prevCNT->contigID ); } if ( nextCnt && nextCnt->used ) { break; } setConnectUsed ( ctg, cnt->contigID, 1 ); * ( int * ) darrayPut ( gap5, num5 - 1 ) = cnt->gapLen; ctg = cnt->contigID; * ( unsigned int * ) darrayPut ( scaf5, num5++ ) = ctg; len += cnt->gapLen + contig_array[ctg].length; bal_ctg = getTwinCtg ( ctg ); contig_array[ctg].flag = 1; contig_array[bal_ctg].flag = 1; prevCNT = cnt; cnt = nextCnt; //printf("->%d",ctg); } //printf("\n"); ctg = getTwinCtg ( i ); if ( num5 >= 2 ) { prevCNT = checkConnect ( getTwinCtg ( * ( unsigned int * ) darrayGet ( scaf5, 1 ) ), ctg ); } else { prevCNT = NULL; } //printf("%d",i); //fflush(stdout); cnt = getNextContig ( ctg, prevCNT, &excep ); while ( cnt ) { nextCnt = getNextContig ( cnt->contigID, cnt, &excep ); if ( excep && prevCNT ) { printf ( "scaffolding: exception -- prev cnt from %u\n", prevCNT->contigID ); } if ( nextCnt && nextCnt->used ) { break; } setConnectUsed ( ctg, cnt->contigID, 1 ); ctg = cnt->contigID; len += cnt->gapLen + contig_array[ctg].length; bal_ctg = getTwinCtg ( ctg ); contig_array[ctg].flag = 1; contig_array[bal_ctg].flag = 1; //printf("<-%d",bal_ctg); //fflush(stdout); * ( int * ) darrayPut ( gap3, num3 ) = cnt->gapLen; * ( unsigned int * ) darrayPut ( scaf3, num3++ ) = bal_ctg; prevCNT = cnt; cnt = nextCnt; } //printf("\n"); len += overlaplen; sum += len; length_array[count++] = len; if ( num5 + num3 < 1 ) { printf ( "no scaffold created for contig %d\n", i ); continue; } fprintf ( fp, ">scaffold%d %d %d\n", count, num5 + num3, len ); fprintf ( fo, ">scaffold%d %d %d\n", count, num5 + num3, len ); len = prev_ctg = 0; for ( j = num3 - 1; j >= 0; j-- ) { if ( !isLargerThanTwin ( * ( unsigned int * ) darrayGet ( scaf3, j ) ) ) { fprintf ( fp, "%-10d %-10d + %d " , index_array[* ( unsigned int * ) darrayGet ( scaf3, j )], len, contig_array[* ( unsigned int * ) darrayGet ( scaf3, j )].length + overlaplen ); weak = printCnts ( fp, * ( unsigned int * ) darrayGet ( scaf3, j ) ); /* if(weak) fprintf(stderr,"scaffold%d\n",count); */ } else { fprintf ( fp, "%-10d %-10d - %d " , index_array[getTwinCtg ( * ( unsigned int * ) darrayGet ( scaf3, j ) )], len , contig_array[* ( unsigned int * ) darrayGet ( scaf3, j )].length + overlaplen ); weak = printCnts ( fp, * ( unsigned int * ) darrayGet ( scaf3, j ) ); /* if(weak) fprintf(stderr,"scaffold%d\n",count); */ } if ( prev_ctg ) { num_route = num_trace = 0; traceAlongArc ( * ( unsigned int * ) darrayGet ( scaf3, j ), prev_ctg, max_steps , gap - ins_size_var, gap + ins_size_var, 0, 0, &num_route ); if ( num_route == 1 ) { output1gap ( fo, max_steps ); gap_c++; } } fprintf ( fo, "%-10d %-10d\n", * ( unsigned int * ) darrayGet ( scaf3, j ), len ); len += contig_array[* ( unsigned int * ) darrayGet ( scaf3, j )].length + * ( int * ) darrayGet ( gap3, j ); prev_ctg = * ( unsigned int * ) darrayGet ( scaf3, j ); gap = * ( int * ) darrayGet ( gap3, j ) > 0 ? * ( int * ) darrayGet ( gap3, j ) : 0; } for ( j = 0; j < num5; j++ ) { if ( !isLargerThanTwin ( * ( unsigned int * ) darrayGet ( scaf5, j ) ) ) { fprintf ( fp, "%-10d %-10d + %d " , index_array[* ( unsigned int * ) darrayGet ( scaf5, j )], len , contig_array[* ( unsigned int * ) darrayGet ( scaf5, j )].length + overlaplen ); weak = printCnts ( fp, * ( unsigned int * ) darrayGet ( scaf5, j ) ); /* if(weak) fprintf(stderr,"scaffold%d\n",count); */ } else { fprintf ( fp, "%-10d %-10d - %d " , index_array[getTwinCtg ( * ( unsigned int * ) darrayGet ( scaf5, j ) )], len , contig_array[* ( unsigned int * ) darrayGet ( scaf5, j )].length + overlaplen ); weak = printCnts ( fp, * ( unsigned int * ) darrayGet ( scaf5, j ) ); /* if(weak) fprintf(stderr,"scaffold%d\n",count); */ } if ( prev_ctg ) { num_route = num_trace = 0; traceAlongArc ( * ( unsigned int * ) darrayGet ( scaf5, j ), prev_ctg, max_steps , gap - ins_size_var, gap + ins_size_var, 0, 0, &num_route ); if ( num_route == 1 ) { output1gap ( fo, max_steps ); gap_c++; } } fprintf ( fo, "%-10d %-10d\n", * ( unsigned int * ) darrayGet ( scaf5, j ), len ); if ( j < num5 - 1 ) { len += contig_array[* ( unsigned int * ) darrayGet ( scaf5, j )].length + * ( int * ) darrayGet ( gap5, j ); prev_ctg = * ( unsigned int * ) darrayGet ( scaf5, j ); gap = * ( int * ) darrayGet ( gap5, j ) > 0 ? * ( int * ) darrayGet ( gap5, j ) : 0; } } } freeDarray ( scaf3 ); freeDarray ( scaf5 ); freeDarray ( gap3 ); freeDarray ( gap5 ); fclose ( fp ); fclose ( fo ); printf ( "\nthe final rank" ); printf ( "\n%d scaffolds from %d contigs sum up %lldbp, with average length %lld, %d gaps filled\n" , count, num_lctg / 2, sum, sum / count, gap_c ); //output singleton for ( i = 1; i <= num_ctg; i++ ) { if ( contig_array[i].length + ( unsigned int ) overlaplen < len_cut || contig_array[i].flag ) { continue; } length_array[count++] = contig_array[i].length; sum += contig_array[i].length; if ( isSmallerThanTwin ( i ) ) { i++; } } // calculate N50/N90 printf ( "%d scaffolds&singleton sum up %lldbp, with average length %lld\n" , count, sum, sum / count ); qsort ( length_array, count, sizeof ( length_array[0] ), cmp_int ); printf ( "the longest is %dbp,", length_array[count - 1] ); N50 = sum * 0.5; N90 = sum * 0.9; sum = flag = 0; for ( j = count - 1; j >= 0; j-- ) { sum += length_array[j]; if ( !flag && sum >= N50 ) { printf ( "scaffold N50 is %d bp, ", length_array[j] ); flag++; } if ( sum >= N90 ) { printf ( "scaffold N90 is %d bp\n", length_array[j] ); break; } } printf ( "Found %d weak points in scaffolds\n", weakCounter ); fflush ( stdout ); free ( ( void * ) length_array ); for ( j = 0; j < max_n_routes; j++ ) { free ( ( void * ) found_routes[j] ); } free ( ( void * ) found_routes ); free ( ( void * ) so_far ); } void scaffold_count ( unsigned int len_cut ) { static DARRAY * scaf3, *scaf5; static DARRAY * gap3, *gap5; unsigned int prev_ctg, ctg, bal_ctg, *length_array, count = 0, num_lctg = 0; unsigned int i, max_steps = 5; int num5, num3, j, len, flag, num_route, gap_c = 0; short gap = 0; long long sum = 0, N50, N90; CONNECT * cnt, *prevCNT, *nextCnt; boolean excep; so_far = ( unsigned int * ) ckalloc ( max_n_routes * sizeof ( unsigned int ) ); found_routes = ( unsigned int ** ) ckalloc ( max_n_routes * sizeof ( unsigned int * ) ); for ( j = 0; j < max_n_routes; j++ ) { found_routes[j] = ( unsigned int * ) ckalloc ( max_steps * sizeof ( unsigned int ) ); } length_array = ( unsigned int * ) ckalloc ( ( num_ctg + 1 ) * sizeof ( unsigned int ) ); //use length_array to change info in index_array for ( i = 1; i <= num_ctg; i++ ) { length_array[i] = 0; } for ( i = 1; i <= num_ctg; i++ ) { if ( index_array[i] > 0 ) { length_array[index_array[i]] = i; } } for ( i = 1; i <= num_ctg; i++ ) { index_array[i] = length_array[i]; } //contig i with original index: index_array[i] orig2new = 0; scaf3 = ( DARRAY * ) createDarray ( 1000, sizeof ( unsigned int ) ); scaf5 = ( DARRAY * ) createDarray ( 1000, sizeof ( unsigned int ) ); gap3 = ( DARRAY * ) createDarray ( 1000, sizeof ( int ) ); gap5 = ( DARRAY * ) createDarray ( 1000, sizeof ( int ) ); for ( i = 1; i <= num_ctg; i++ ) { contig_array[i].flag = 0; } for ( i = 1; i <= num_ctg; i++ ) { if ( contig_array[i].length + ( unsigned int ) overlaplen >= len_cut ) { num_lctg++; } else { continue; } if ( contig_array[i].flag || contig_array[i].mask || !contig_array[i].downwardConnect || !validConnect ( i, NULL ) ) { continue; } num5 = num3 = 0; ctg = i; //printf("%d",i); * ( unsigned int * ) darrayPut ( scaf5, num5++ ) = i; contig_array[i].flag = 1; bal_ctg = getTwinCtg ( ctg ); contig_array[bal_ctg].flag = 1; len = contig_array[i].length; prevCNT = NULL; cnt = getNextContig ( ctg, prevCNT, &excep ); while ( cnt ) { nextCnt = getNextContig ( cnt->contigID, cnt, &excep ); if ( excep && prevCNT ) { printf ( "scaffolding: exception --- prev cnt from %u\n", prevCNT->contigID ); } if ( nextCnt && nextCnt->used ) { break; } setConnectUsed ( ctg, cnt->contigID, 1 ); * ( int * ) darrayPut ( gap5, num5 - 1 ) = cnt->gapLen; ctg = cnt->contigID; * ( unsigned int * ) darrayPut ( scaf5, num5++ ) = ctg; len += cnt->gapLen + contig_array[ctg].length; bal_ctg = getTwinCtg ( ctg ); contig_array[ctg].flag = 1; contig_array[bal_ctg].flag = 1; prevCNT = cnt; cnt = nextCnt; //printf("->%d",ctg); } //printf("\n"); ctg = getTwinCtg ( i ); if ( num5 >= 2 ) { prevCNT = checkConnect ( getTwinCtg ( * ( unsigned int * ) darrayGet ( scaf5, 1 ) ), ctg ); } else { prevCNT = NULL; } //printf("%d",i); //fflush(stdout); cnt = getNextContig ( ctg, prevCNT, &excep ); while ( cnt ) { nextCnt = getNextContig ( cnt->contigID, cnt, &excep ); if ( excep && prevCNT ) { printf ( "scaffolding: exception -- prev cnt from %u\n", prevCNT->contigID ); } if ( nextCnt && nextCnt->used ) { break; } setConnectUsed ( ctg, cnt->contigID, 1 ); ctg = cnt->contigID; len += cnt->gapLen + contig_array[ctg].length; bal_ctg = getTwinCtg ( ctg ); contig_array[ctg].flag = 1; contig_array[bal_ctg].flag = 1; //printf("<-%d",bal_ctg); //fflush(stdout); * ( int * ) darrayPut ( gap3, num3 ) = cnt->gapLen; * ( unsigned int * ) darrayPut ( scaf3, num3++ ) = bal_ctg; prevCNT = cnt; cnt = nextCnt; } //printf("\n"); len += overlaplen; sum += len; length_array[count++] = len; if ( num5 + num3 < 1 ) { printf ( "no scaffold created for contig %d\n", i ); continue; } len = prev_ctg = 0; for ( j = num3 - 1; j >= 0; j-- ) { if ( prev_ctg ) { num_route = num_trace = 0; traceAlongArc ( * ( unsigned int * ) darrayGet ( scaf3, j ), prev_ctg, max_steps , gap - ins_size_var, gap + ins_size_var, 0, 0, &num_route ); if ( num_route == 1 ) { gap_c++; } } len += contig_array[* ( unsigned int * ) darrayGet ( scaf3, j )].length + * ( int * ) darrayGet ( gap3, j ); prev_ctg = * ( unsigned int * ) darrayGet ( scaf3, j ); gap = * ( int * ) darrayGet ( gap3, j ) > 0 ? * ( int * ) darrayGet ( gap3, j ) : 0; } for ( j = 0; j < num5; j++ ) { if ( prev_ctg ) { num_route = num_trace = 0; traceAlongArc ( * ( unsigned int * ) darrayGet ( scaf5, j ), prev_ctg, max_steps , gap - ins_size_var, gap + ins_size_var, 0, 0, &num_route ); if ( num_route == 1 ) { gap_c++; } } if ( j < num5 - 1 ) { len += contig_array[* ( unsigned int * ) darrayGet ( scaf5, j )].length + * ( int * ) darrayGet ( gap5, j ); prev_ctg = * ( unsigned int * ) darrayGet ( scaf5, j ); gap = * ( int * ) darrayGet ( gap5, j ) > 0 ? * ( int * ) darrayGet ( gap5, j ) : 0; } } } freeDarray ( scaf3 ); freeDarray ( scaf5 ); freeDarray ( gap3 ); freeDarray ( gap5 ); printf ( "\n%d scaffolds from %d contigs sum up %lldbp, with average length %lld, %d gaps filled\n" , count, num_lctg / 2, sum, sum / count, gap_c ); //output singleton for ( i = 1; i <= num_ctg; i++ ) { if ( contig_array[i].length + ( unsigned int ) overlaplen < len_cut || contig_array[i].flag ) { continue; } length_array[count++] = contig_array[i].length; sum += contig_array[i].length; if ( isSmallerThanTwin ( i ) ) { i++; } } // calculate N50/N90 printf ( "%d scaffolds&singleton sum up %lldbp, with average length %lld\n" , count, sum, sum / count ); qsort ( length_array, count, sizeof ( length_array[0] ), cmp_int ); printf ( "the longest is %dbp,", length_array[count - 1] ); N50 = sum * 0.5; N90 = sum * 0.9; sum = flag = 0; for ( j = count - 1; j >= 0; j-- ) { sum += length_array[j]; if ( !flag && sum >= N50 ) { printf ( "scaffold N50 is %d bp, ", length_array[j] ); flag++; } if ( sum >= N90 ) { printf ( "scaffold N90 is %d bp\n", length_array[j] ); break; } } fflush ( stdout ); free ( ( void * ) length_array ); for ( j = 0; j < max_n_routes; j++ ) { free ( ( void * ) found_routes[j] ); } free ( ( void * ) found_routes ); free ( ( void * ) so_far ); } static void outputLinks ( FILE * fp, int insertS ) { unsigned int i, bal_ctg, bal_toCtg; CONNECT * cnts, *temp_cnt; //printf("outputLinks, %d contigs\n",num_ctg); for ( i = 1; i <= num_ctg; i++ ) { cnts = contig_array[i].downwardConnect; bal_ctg = getTwinCtg ( i ); while ( cnts ) { if ( cnts->weight < 1 ) { cnts = cnts->next; continue; } fprintf ( fp, "%-10d %-10d\t%d\t%d\t%d\n" , i, cnts->contigID, cnts->gapLen, cnts->weight, insertS ); cnts->weight = 0; bal_toCtg = getTwinCtg ( cnts->contigID ); temp_cnt = getCntBetween ( bal_toCtg, bal_ctg ); if ( temp_cnt ) { temp_cnt->weight = 0; } cnts = cnts->next; } } } //use pe info in ascent order void PE2Links ( char * infile ) { char name[256], *line; FILE * fp, *linkF; int i; int flag = 0; unsigned int j; sprintf ( name, "%s.links", infile ); /*linkF = fopen(name,"r"); if(linkF){ printf("file %s exists, skip creating the links...\n",name); fclose(linkF); return; }*/ linkF = ckopen ( name, "w" ); if ( !pes ) { loadPEgrads ( infile ); } sprintf ( name, "%s.readOnContig", infile ); fp = ckopen ( name, "r" ); lineLen = 1024; line = ( char * ) ckalloc ( lineLen * sizeof ( char ) ); fgets ( line, lineLen, fp ); line[0] = '\0'; printf ( "\n" ); for ( i = 0; i < gradsCounter; i++ ) { createCntMemManager(); createCntLookupTable(); newCntCounter = 0; flag += connectByPE_grad ( fp, i, line ); printf ( "%lld new connections\n", newCntCounter / 2 ); if ( !flag ) { destroyConnectMem(); deleteCntLookupTable(); for ( j = 1; j <= num_ctg; j++ ) { contig_array[j].downwardConnect = NULL; } printf ( "\n" ); continue; } flag = 0; outputLinks ( linkF, pes[i].insertS ); destroyConnectMem(); deleteCntLookupTable(); for ( j = 1; j <= num_ctg; j++ ) { contig_array[j].downwardConnect = NULL; } } free ( ( void * ) line ); fclose ( fp ); fclose ( linkF ); printf ( "all PEs attached\n" ); } static int inputLinks ( FILE * fp, int insertS, char * line ) { unsigned int ctg, bal_ctg, toCtg, bal_toCtg; int gap, wt, ins; unsigned int counter = 0, onScafCounter = 0; unsigned int maskCounter = 0; if ( strlen ( line ) ) { sscanf ( line, "%d %d %d %d %d", &ctg, &toCtg, &gap, &wt, &ins ); if ( ins != insertS ) { return counter; } //if(contig_array[ctg].length>=ctg_short&&contig_array[toCtg].length>=ctg_short){ if ( 1 ) { bal_ctg = getTwinCtg ( ctg ); bal_toCtg = getTwinCtg ( toCtg ); add1Connect ( ctg, toCtg, gap, wt, 0 ); add1Connect ( bal_toCtg, bal_ctg, gap, wt, 0 ); counter++; if ( contig_array[ctg].mask || contig_array[toCtg].mask ) { maskCounter++; } if ( insertS > 1000 && contig_array[ctg].from_vt == contig_array[toCtg].from_vt && // on the same scaff contig_array[ctg].indexInScaf < contig_array[toCtg].indexInScaf ) { add1LongPEcov ( ctg, toCtg, wt ); onScafCounter++; } } } while ( fgets ( line, lineLen, fp ) != NULL ) { sscanf ( line, "%d %d %d %d %d", &ctg, &toCtg, &gap, &wt, &ins ); if ( ins > insertS ) { break; } /* if(contig_array[ctg].length 1000 && contig_array[ctg].from_vt == contig_array[toCtg].from_vt && // on the same scaff contig_array[ctg].indexInScaf < contig_array[toCtg].indexInScaf ) { add1LongPEcov ( ctg, toCtg, wt ); onScafCounter++; } bal_ctg = getTwinCtg ( ctg ); bal_toCtg = getTwinCtg ( toCtg ); add1Connect ( ctg, toCtg, gap, wt, 0 ); add1Connect ( bal_toCtg, bal_ctg, gap, wt, 0 ); counter++; if ( contig_array[ctg].mask || contig_array[toCtg].mask ) { maskCounter++; } } printf ( "%d link to masked contigs, %d links on a single scaff\n", maskCounter, onScafCounter ); return counter; } //use linkage info in ascent order void Links2Scaf ( char * infile ) { char name[256], *line; FILE * fp; int i, j = 1, lib_n = 0, cutoff_sum = 0; int flag = 0, flag2; boolean downS, nonLinear = 0, smallPE = 0, isPrevSmall = 0, markSmall; if ( !pes ) { loadPEgrads ( infile ); } sprintf ( name, "%s.links", infile ); fp = ckopen ( name, "r" ); createCntMemManager(); createCntLookupTable(); lineLen = 1024; line = ( char * ) ckalloc ( lineLen * sizeof ( char ) ); fgets ( line, lineLen, fp ); line[0] = '\0'; solidArray = ( DARRAY * ) createDarray ( 1000, sizeof ( unsigned int ) ); tempArray = ( DARRAY * ) createDarray ( 1000, sizeof ( unsigned int ) ); scaf3 = ( DARRAY * ) createDarray ( 1000, sizeof ( unsigned int ) ); scaf5 = ( DARRAY * ) createDarray ( 1000, sizeof ( unsigned int ) ); gap3 = ( DARRAY * ) createDarray ( 1000, sizeof ( int ) ); gap5 = ( DARRAY * ) createDarray ( 1000, sizeof ( int ) ); weakPE = 3; //0531 printf ( "\n" ); for ( i = 0; i < gradsCounter; i++ ) { if ( pes[i].insertS < 1000 ) { isPrevSmall = 1; } else if ( pes[i].insertS > 1000 && isPrevSmall ) { smallScaf(); isPrevSmall = 0; } flag2 = inputLinks ( fp, pes[i].insertS, line ); printf ( "Insert size %d: %d links input\n", pes[i].insertS, flag2 ); if ( flag2 ) { lib_n++; cutoff_sum += pes[i].pair_num_cut; } flag += flag2; if ( !flag ) { printf ( "\n" ); continue; } if ( i == gradsCounter - 1 || pes[i + 1].rank != pes[i].rank ) { flag = nonLinear = downS = markSmall = 0; if ( pes[i].insertS > 1000 && pes[i].rank > 1 ) { downS = 1; } if ( pes[i].insertS <= 1000 ) { smallPE = 1; } if ( pes[i].insertS >= 1000 ) { ins_size_var = 50; OverlapPercent = 0.05; } else if ( pes[i].insertS >= 300 ) { ins_size_var = 30; OverlapPercent = 0.05; } else { ins_size_var = 20; OverlapPercent = 0.05; } if ( pes[i].insertS > 1000 ) { weakPE = 5; } //static_f = 1; if ( lib_n > 0 ) { weakPE = weakPE < cutoff_sum / lib_n ? cutoff_sum / lib_n : weakPE; lib_n = cutoff_sum = 0; } printf ( "Cutoff for number of pairs to make a reliable connection: %d\n", weakPE ); if ( i == gradsCounter - 1 ) { nonLinear = 1; } if ( i == gradsCounter - 1 && !isPrevSmall && smallPE ) { detectBreakScaf(); } ordering ( 1, downS, nonLinear, infile ); if ( i == gradsCounter - 1 ) { recoverMask(); } else { printf ( "\nthe %d rank", j++ ); scaffold_count ( 100 ); printf ( "\n" ); } } } freeDarray ( tempArray ); freeDarray ( solidArray ); freeDarray ( scaf3 ); freeDarray ( scaf5 ); freeDarray ( gap3 ); freeDarray ( gap5 ); free ( ( void * ) line ); fclose ( fp ); printf ( "all links loaded\n" ); } /* below for picking up a subgraph (with at most one node has upstream connections to the rest and at most one downstream connections) in general */ // static int nodeCounter static boolean putNodeInArray ( unsigned int node, int maxNodes, int dis ) { if ( contig_array[node].inSubGraph ) { return 1; } int index = nodeCounter; if ( index > maxNodes ) { return 0; } if ( contig_array[getTwinCtg ( node )].inSubGraph ) { return 0; } ctg4heapArray[index].ctgID = node; ctg4heapArray[index].dis = dis; contig_array[node].inSubGraph = 1; ctg4heapArray[index].ds_shut4dheap = 0; ctg4heapArray[index].us_shut4dheap = 0; ctg4heapArray[index].ds_shut4uheap = 0; ctg4heapArray[index].us_shut4uheap = 0; return 1; } static void setInGraph ( boolean flag ) { int i; int node; nodeCounter = nodeCounter > MaxNodeInSub ? MaxNodeInSub : nodeCounter; for ( i = 1; i <= nodeCounter; i++ ) { node = ctg4heapArray[i].ctgID; if ( node > 0 ) { contig_array[node].inSubGraph = flag; } } } static boolean dispatch1node ( int dis, unsigned int tempNode, int maxNodes, FibHeap * dheap, FibHeap * uheap, int * DmaxDis, int * UmaxDis ) { boolean ret; if ( dis >= 0 ) // put it to Dheap { nodeCounter++; ret = putNodeInArray ( tempNode, maxNodes, dis ); if ( !ret ) { return 0; } insertNodeIntoHeap ( dheap, dis, nodeCounter ); if ( dis > *DmaxDis ) { *DmaxDis = dis; } return 1; } else // put it to Uheap { nodeCounter++; ret = putNodeInArray ( tempNode, maxNodes, dis ); if ( !ret ) { return 0; } insertNodeIntoHeap ( uheap, -dis, nodeCounter ); int temp_len = contig_array[tempNode].length; if ( -dis + temp_len > *UmaxDis ) { *UmaxDis = -dis + contig_array[tempNode].length; } return -1; } return 0; } static boolean canDheapWait ( unsigned int currNode, int dis, int DmaxDis ) { if ( dis < DmaxDis ) { return 0; } else { return 1; } } static boolean workOnDheap ( FibHeap * dheap, FibHeap * uheap, boolean * Dwait, boolean * Uwait, int * DmaxDis, int * UmaxDis, int maxNodes ) { if ( *Dwait ) { return 1; } unsigned int currNode, twin, tempNode; CTGinHEAP * ctgInHeap; int indexInArray; CONNECT * us_cnt, *ds_cnt; int dis0, dis; boolean ret, isEmpty; while ( ( indexInArray = removeNextNodeFromHeap ( dheap ) ) != 0 ) { ctgInHeap = &ctg4heapArray[indexInArray]; currNode = ctgInHeap->ctgID; dis0 = ctgInHeap->dis; isEmpty = IsHeapEmpty ( dheap ); twin = getTwinCtg ( currNode ); us_cnt = ctgInHeap->us_shut4dheap ? NULL : contig_array[twin].downwardConnect; while ( us_cnt ) { if ( us_cnt->deleted || us_cnt->mask || contig_array[getTwinCtg ( us_cnt->contigID )].inSubGraph ) { us_cnt = us_cnt->next; continue; } tempNode = getTwinCtg ( us_cnt->contigID ); if ( contig_array[tempNode].inSubGraph ) { us_cnt = us_cnt->next; continue; } dis = dis0 - us_cnt->gapLen - ( int ) contig_array[twin].length; ret = dispatch1node ( dis, tempNode, maxNodes, dheap, uheap, DmaxDis, UmaxDis ); if ( ret == 0 ) { return 0; } else if ( ret < 0 ) { *Uwait = 0; } us_cnt = us_cnt->next; } if ( nodeCounter > 1 && isEmpty ) { *Dwait = canDheapWait ( currNode, dis0, *DmaxDis ); if ( *Dwait ) { isEmpty = IsHeapEmpty ( dheap ); insertNodeIntoHeap ( dheap, dis0, indexInArray ); ctg4heapArray[indexInArray].us_shut4dheap = 1; if ( isEmpty ) { return 1; } else { continue; } } } ds_cnt = ctgInHeap->ds_shut4dheap ? NULL : contig_array[currNode].downwardConnect; while ( ds_cnt ) { if ( ds_cnt->deleted || ds_cnt->mask || contig_array[ds_cnt->contigID].inSubGraph ) { ds_cnt = ds_cnt->next; continue; } tempNode = ds_cnt->contigID; dis = dis0 + ds_cnt->gapLen + ( int ) contig_array[tempNode].length; ret = dispatch1node ( dis, tempNode, maxNodes, dheap, uheap, DmaxDis, UmaxDis ); if ( ret == 0 ) { return 0; } else if ( ret < 0 ) { *Uwait = 0; } } // for each downstream connections } // for each node comes off the heap *Dwait = 1; return 1; } static boolean canUheapWait ( unsigned int currNode, int dis, int UmaxDis ) { int temp_len = contig_array[currNode].length; if ( -dis + temp_len < UmaxDis ) { return 0; } else { return 1; } } static boolean workOnUheap ( FibHeap * dheap, FibHeap * uheap, boolean * Dwait, boolean * Uwait, int * DmaxDis, int * UmaxDis, int maxNodes ) { if ( *Uwait ) { return 1; } unsigned int currNode, twin, tempNode; CTGinHEAP * ctgInHeap; int indexInArray; CONNECT * us_cnt, *ds_cnt; int dis0, dis; boolean ret, isEmpty; while ( ( indexInArray = removeNextNodeFromHeap ( uheap ) ) != 0 ) { ctgInHeap = &ctg4heapArray[indexInArray]; currNode = ctgInHeap->ctgID; dis0 = ctgInHeap->dis; isEmpty = IsHeapEmpty ( uheap ); ds_cnt = ctgInHeap->ds_shut4uheap ? NULL : contig_array[currNode].downwardConnect; while ( ds_cnt ) { if ( ds_cnt->deleted || ds_cnt->mask || contig_array[ds_cnt->contigID].inSubGraph ) { ds_cnt = ds_cnt->next; continue; } tempNode = ds_cnt->contigID; dis = dis0 + ds_cnt->gapLen + contig_array[tempNode].length; ret = dispatch1node ( dis, tempNode, maxNodes, dheap, uheap, DmaxDis, UmaxDis ); if ( ret == 0 ) { return 0; } else if ( ret > 0 ) { *Dwait = 0; } } // for each downstream connections if ( nodeCounter > 1 && isEmpty ) { *Uwait = canUheapWait ( currNode, dis0, *UmaxDis ); if ( *Uwait ) { isEmpty = IsHeapEmpty ( uheap ); insertNodeIntoHeap ( uheap, dis0, indexInArray ); ctg4heapArray[indexInArray].ds_shut4uheap = 1; if ( isEmpty ) { return 1; } else { continue; } } } twin = getTwinCtg ( currNode ); us_cnt = ctgInHeap->us_shut4uheap ? NULL : contig_array[twin].downwardConnect; while ( us_cnt ) { if ( us_cnt->deleted || us_cnt->mask || contig_array[getTwinCtg ( us_cnt->contigID )].inSubGraph ) { us_cnt = us_cnt->next; continue; } tempNode = getTwinCtg ( us_cnt->contigID ); if ( contig_array[tempNode].inSubGraph ) { us_cnt = us_cnt->next; continue; } dis = dis0 - us_cnt->gapLen - contig_array[twin].length; ret = dispatch1node ( dis, tempNode, maxNodes, dheap, uheap, DmaxDis, UmaxDis ); if ( ret == 0 ) { return 0; } else if ( ret > 0 ) { *Dwait = 1; } us_cnt = us_cnt->next; } } // for each node comes off the heap *Uwait = 1; return 1; } static boolean pickUpGeneralSubgraph ( unsigned int node1, int maxNodes ) { FibHeap * Uheap = newFibHeap(); // heap for upstream contigs to node1 FibHeap * Dheap = newFibHeap(); int UmaxDis; // max distance upstream to node1 int DmaxDis; boolean Uwait; // wait signal for Uheap boolean Dwait; int dis; boolean ret; //initiate: node1 is put to array once, and to both Dheap and Uheap dis = 0; nodeCounter = 1; putNodeInArray ( node1, maxNodes, dis ); insertNodeIntoHeap ( Dheap, dis, nodeCounter ); ctg4heapArray[nodeCounter].us_shut4dheap = 1; Dwait = 0; DmaxDis = 0; insertNodeIntoHeap ( Uheap, dis, nodeCounter ); ctg4heapArray[nodeCounter].ds_shut4uheap = 1; Uwait = 1; UmaxDis = contig_array[node1].length; while ( 1 ) { ret = workOnDheap ( Dheap, Uheap, &Dwait, &Uwait, &DmaxDis, &UmaxDis, maxNodes ); if ( !ret ) { setInGraph ( 0 ); destroyHeap ( Dheap ); destroyHeap ( Uheap ); return 0; } ret = workOnUheap ( Dheap, Uheap, &Dwait, &Uwait, &DmaxDis, &UmaxDis, maxNodes ); if ( !ret ) { setInGraph ( 0 ); destroyHeap ( Dheap ); destroyHeap ( Uheap ); return 0; } if ( Uwait && Dwait ) { destroyHeap ( Dheap ); destroyHeap ( Uheap ); return 1; } } } static int cmp_ctg ( const void * a, const void * b ) { CTGinHEAP * A, *B; A = ( CTGinHEAP * ) a; B = ( CTGinHEAP * ) b; if ( A->dis > B->dis ) { return 1; } else if ( A->dis == B->dis ) { return 0; } else { return -1; } } static boolean checkEligible() { unsigned int firstNode = ctg4heapArray[1].ctgID; unsigned int twin; int i; boolean flag = 0; //check if the first node has incoming link from twin of any node in subgraph // or it has multi outgoing links bound to incoming links twin = getTwinCtg ( firstNode ); CONNECT * ite_cnt = contig_array[twin].downwardConnect; while ( ite_cnt ) { if ( ite_cnt->deleted || ite_cnt->mask ) { ite_cnt = ite_cnt->next; continue; } if ( contig_array[ite_cnt->contigID].inSubGraph ) { /* if(firstNode==3693) printf("eligible link %d -> %d\n",twin,ite_cnt->contigID); */ return 0; } if ( ite_cnt->prevInScaf ) { if ( flag ) { return 0; } flag = 1; } ite_cnt = ite_cnt->next; } //check if the last node has outgoing link to twin of any node in subgraph // or it has multi outgoing links bound to incoming links unsigned int lastNode = ctg4heapArray[nodeCounter].ctgID; ite_cnt = contig_array[lastNode].downwardConnect; flag = 0; while ( ite_cnt ) { if ( ite_cnt->deleted || ite_cnt->mask ) { ite_cnt = ite_cnt->next; continue; } twin = getTwinCtg ( ite_cnt->contigID ); if ( contig_array[twin].inSubGraph ) { /* if(firstNode==3693) printf("eligible link %d -> %d\n",lastNode,ite_cnt->contigID); */ return 0; } if ( ite_cnt->prevInScaf ) { if ( flag ) { return 0; } flag = 1; } ite_cnt = ite_cnt->next; } //check if any node has outgoing link to node outside the subgraph for ( i = 1; i < nodeCounter; i++ ) { ite_cnt = contig_array[ctg4heapArray[i].ctgID].downwardConnect; while ( ite_cnt ) { if ( ite_cnt->deleted || ite_cnt->mask ) { ite_cnt = ite_cnt->next; continue; } if ( !contig_array[ite_cnt->contigID].inSubGraph ) { /* printf("eligible check: ctg %d links to ctg %d\n", ctg4heapArray[i].ctgID,ite_cnt->contigID); */ return 0; } ite_cnt = ite_cnt->next; } } //check if any node has incoming link from node outside the subgraph for ( i = 2; i <= nodeCounter; i++ ) { twin = getTwinCtg ( ctg4heapArray[i].ctgID ); ite_cnt = contig_array[twin].downwardConnect; while ( ite_cnt ) { if ( ite_cnt->deleted || ite_cnt->mask ) { ite_cnt = ite_cnt->next; continue; } if ( !contig_array[getTwinCtg ( ite_cnt->contigID )].inSubGraph ) { /* printf("eligible check: ctg %d links to ctg %d\n", ctg4heapArray[i].ctgID,ite_cnt->contigID); */ return 0; } ite_cnt = ite_cnt->next; } } return 1; } //put nodes in sub-graph in a line static void arrangeNodes_general() { int i, gap; CONNECT * ite_cnt, *temp_cnt, *bal_cnt, *prev_cnt, *next_cnt; unsigned int node1, node2; unsigned int bal_nd1, bal_nd2; //delete original connections for ( i = 1; i <= nodeCounter; i++ ) { node1 = ctg4heapArray[i].ctgID; ite_cnt = contig_array[node1].downwardConnect; while ( ite_cnt ) { if ( ite_cnt->mask || ite_cnt->deleted || !contig_array[ite_cnt->contigID].inSubGraph ) { ite_cnt = ite_cnt->next; continue; } ite_cnt->deleted = 1; setNextInScaf ( ite_cnt, NULL ); setPrevInScaf ( ite_cnt, 0 ); ite_cnt = ite_cnt->next; } bal_nd1 = getTwinCtg ( node1 ); ite_cnt = contig_array[bal_nd1].downwardConnect; while ( ite_cnt ) { if ( ite_cnt->mask || ite_cnt->deleted || !contig_array[getTwinCtg ( ite_cnt->contigID )].inSubGraph ) { ite_cnt = ite_cnt->next; continue; } ite_cnt->deleted = 1; setNextInScaf ( ite_cnt, NULL ); setPrevInScaf ( ite_cnt, 0 ); ite_cnt = ite_cnt->next; } } //create new connections prev_cnt = next_cnt = NULL; for ( i = 1; i < nodeCounter; i++ ) { node1 = ctg4heapArray[i].ctgID; node2 = ctg4heapArray[i + 1].ctgID; bal_nd1 = getTwinCtg ( node1 ); bal_nd2 = getTwinCtg ( node2 ); gap = ctg4heapArray[i + 1].dis - ctg4heapArray[i].dis - contig_array[node2].length; temp_cnt = getCntBetween ( node1, node2 ); if ( temp_cnt ) { temp_cnt->deleted = 0; temp_cnt->mask = 0; //temp_cnt->gapLen = gap; bal_cnt = getCntBetween ( bal_nd2, bal_nd1 ); bal_cnt->deleted = 0; bal_cnt->mask = 0; //bal_cnt->gapLen = gap; } else { temp_cnt = allocateCN ( node2, gap ); if ( cntLookupTable ) { putCnt2LookupTable ( node1, temp_cnt ); } temp_cnt->next = contig_array[node1].downwardConnect; contig_array[node1].downwardConnect = temp_cnt; bal_cnt = allocateCN ( bal_nd1, gap ); if ( cntLookupTable ) { putCnt2LookupTable ( bal_nd2, bal_cnt ); } bal_cnt->next = contig_array[bal_nd2].downwardConnect; contig_array[bal_nd2].downwardConnect = bal_cnt; } if ( prev_cnt ) { setNextInScaf ( prev_cnt, temp_cnt ); setPrevInScaf ( temp_cnt, 1 ); } if ( next_cnt ) { setNextInScaf ( bal_cnt, next_cnt ); setPrevInScaf ( next_cnt, 1 ); } prev_cnt = temp_cnt; next_cnt = bal_cnt; } //re-binding connection at both ends bal_nd2 = getTwinCtg ( ctg4heapArray[1].ctgID ); ite_cnt = contig_array[bal_nd2].downwardConnect; while ( ite_cnt ) { if ( ite_cnt->deleted || ite_cnt->mask ) { ite_cnt = ite_cnt->next; continue; } if ( ite_cnt->prevInScaf ) { break; } ite_cnt = ite_cnt->next; } if ( ite_cnt ) { bal_nd1 = ite_cnt->contigID; node1 = getTwinCtg ( bal_nd1 ); node2 = ctg4heapArray[1].ctgID; temp_cnt = checkConnect ( node1, node2 ); bal_cnt = ite_cnt; next_cnt = checkConnect ( ctg4heapArray[1].ctgID, ctg4heapArray[2].ctgID ); prev_cnt = checkConnect ( getTwinCtg ( ctg4heapArray[2].ctgID ), getTwinCtg ( ctg4heapArray[1].ctgID ) ); if ( temp_cnt ) { setNextInScaf ( temp_cnt, next_cnt ); setPrevInScaf ( temp_cnt->nextInScaf, 0 ); setPrevInScaf ( next_cnt, 1 ); setNextInScaf ( prev_cnt, bal_cnt ); } } node1 = ctg4heapArray[nodeCounter].ctgID; ite_cnt = contig_array[node1].downwardConnect; while ( ite_cnt ) { if ( ite_cnt->deleted || ite_cnt->mask ) { ite_cnt = ite_cnt->next; continue; } if ( ite_cnt->prevInScaf ) { break; } ite_cnt = ite_cnt->next; } if ( ite_cnt ) { node2 = ite_cnt->contigID; bal_nd1 = getTwinCtg ( node1 ); bal_nd2 = getTwinCtg ( node2 ); temp_cnt = ite_cnt; bal_cnt = checkConnect ( bal_nd2, bal_nd1 ); next_cnt = checkConnect ( getTwinCtg ( ctg4heapArray[nodeCounter].ctgID ), getTwinCtg ( ctg4heapArray[nodeCounter - 1].ctgID ) ); prev_cnt = checkConnect ( ctg4heapArray[nodeCounter - 1].ctgID, ctg4heapArray[nodeCounter].ctgID ); setNextInScaf ( prev_cnt, temp_cnt ); setNextInScaf ( bal_cnt, next_cnt ); setPrevInScaf ( next_cnt, 1 ); } } //check if contigs next to each other have reasonable overlap boolean checkOverlapInBetween_general ( double tolerance ) { int i, gap; unsigned int node; int lenSum, lenOlp; lenSum = lenOlp = 0; for ( i = 1; i <= nodeCounter; i++ ) { node = ctg4heapArray[i].ctgID; lenSum += contig_array[node].length; } if ( lenSum < 1 ) { return 1; } for ( i = 1; i < nodeCounter; i++ ) { gap = ctg4heapArray[i + 1].dis - ctg4heapArray[i].dis - contig_array[ctg4heapArray[i + 1].ctgID].length; if ( -gap > 0 ) { lenOlp += -gap; } //if(-gap>ins_size_var) if ( ( double ) lenOlp / lenSum > tolerance ) { return 0; } } return 1; } //check if there's any connect indicates the opposite order between nodes in sub-graph static boolean checkConflictCnt_general ( double tolerance ) { int i, j; int supportCounter = 0; int objectCounter = 0; CONNECT * cnt; for ( i = 1; i < nodeCounter; i++ ) { for ( j = i + 1; j <= nodeCounter; j++ ) { //cnt=getCntBetween(nodesInSubInOrder[j],nodesInSubInOrder[i]); cnt = checkConnect ( ctg4heapArray[i].ctgID, ctg4heapArray[j].ctgID ); if ( cnt ) { supportCounter += cnt->weight; } cnt = checkConnect ( ctg4heapArray[j].ctgID, ctg4heapArray[i].ctgID ); if ( cnt ) { objectCounter += cnt->weight; } //return 1; } } if ( supportCounter < 1 ) { return 1; } if ( ( double ) objectCounter / supportCounter < tolerance ) { return 0; } return 1; } // turn sub-graph to linear struct static void general_linearization ( boolean strict ) { unsigned int i; int subCounter = 0; int out_num; boolean flag; int conflCounter = 0, overlapCounter = 0, eligibleCounter = 0; double overlapTolerance, conflTolerance; for ( i = num_ctg; i > 0; i-- ) { if ( contig_array[i].mask ) { continue; } out_num = validConnect ( i, NULL ); if ( out_num < 2 ) { continue; } //flag = pickSubGraph(i,strict); flag = pickUpGeneralSubgraph ( i, MaxNodeInSub ); if ( !flag ) { continue; } subCounter++; qsort ( &ctg4heapArray[1], nodeCounter, sizeof ( CTGinHEAP ), cmp_ctg ); flag = checkEligible(); if ( !flag ) { eligibleCounter++; setInGraph ( 0 ); continue; } if ( strict ) { overlapTolerance = OverlapPercent; conflTolerance = ConflPercent; } else { overlapTolerance = 2 * OverlapPercent; conflTolerance = 2 * ConflPercent; } flag = checkOverlapInBetween_general ( overlapTolerance ); if ( !flag ) { overlapCounter++; setInGraph ( 0 ); continue; } flag = checkConflictCnt_general ( conflTolerance ); if ( flag ) { conflCounter++; setInGraph ( 0 ); continue; } arrangeNodes_general(); setInGraph ( 0 ); } printf ( "Picked %d subgraphs,%d have conflicting connections,%d have significant overlapping, %d eligible\n", subCounter, conflCounter, overlapCounter, eligibleCounter ); } /**** the fowllowing codes for detecting and break down scaffold at weak point **********/ // mark connections in scaffolds made by small pe static void smallScaf() { unsigned int i, ctg, bal_ctg, prevCtg; int counter = 0; CONNECT * bindCnt, *cnt; for ( i = 1; i <= num_ctg; i++ ) { contig_array[i].flag = 0; } for ( i = 1; i <= num_ctg; i++ ) { if ( contig_array[i].flag || contig_array[i].mask || !contig_array[i].downwardConnect ) { continue; } bindCnt = getBindCnt ( i ); if ( !bindCnt ) { continue; } counter++; contig_array[i].flag = 1; contig_array[getTwinCtg ( i )].flag = 1; prevCtg = getTwinCtg ( i ); while ( bindCnt ) { ctg = bindCnt->contigID; bal_ctg = getTwinCtg ( ctg ); bindCnt->bySmall = 1; cnt = getCntBetween ( bal_ctg, prevCtg ); if ( cnt ) { cnt->bySmall = 1; } contig_array[ctg].flag = 1; contig_array[bal_ctg].flag = 1; prevCtg = bal_ctg; bindCnt = bindCnt->nextInScaf; } ctg = getTwinCtg ( i ); bindCnt = getBindCnt ( ctg ); prevCtg = i; while ( bindCnt ) { ctg = bindCnt->contigID; bal_ctg = getTwinCtg ( ctg ); bindCnt->bySmall = 1; cnt = getCntBetween ( bal_ctg, prevCtg ); if ( cnt ) { cnt->bySmall = 1; } contig_array[ctg].flag = 1; contig_array[bal_ctg].flag = 1; prevCtg = bal_ctg; bindCnt = bindCnt->nextInScaf; } } printf ( "Report from smallScaf: %d scaffolds by smallPE\n", counter ); } static boolean putItem2Sarray ( unsigned int scaf, int wt, DARRAY * SCAF, DARRAY * WT, int counter ) { int i; unsigned int * scafP, *wtP; for ( i = 0; i < counter; i++ ) { scafP = ( unsigned int * ) darrayGet ( SCAF, i ); if ( ( *scafP ) == scaf ) { wtP = ( unsigned int * ) darrayGet ( WT, i ); *wtP = ( *wtP + wt ); return 0; } } scafP = ( unsigned int * ) darrayPut ( SCAF, counter ); wtP = ( unsigned int * ) darrayPut ( WT, counter ); *scafP = scaf; *wtP = wt; return 1; } static int getDSLink2Scaf ( STACK * scafStack, DARRAY * SCAF, DARRAY * WT ) { CONNECT * ite_cnt; unsigned int ctg, targetCtg, *pt; int counter = 0; boolean inc; stackRecover ( scafStack ); while ( ( pt = ( unsigned int * ) stackPop ( scafStack ) ) != NULL ) { ctg = *pt; if ( contig_array[ctg].mask || !contig_array[ctg].downwardConnect ) { continue; } ite_cnt = contig_array[ctg].downwardConnect; while ( ite_cnt ) { if ( ite_cnt->deleted || ite_cnt->mask || ite_cnt->singleInScaf || ite_cnt->nextInScaf || ite_cnt->prevInScaf || ite_cnt->inherit ) { ite_cnt = ite_cnt->next; continue; } targetCtg = ite_cnt->contigID; if ( contig_array[ctg].from_vt == contig_array[targetCtg].from_vt ) // on the same scaff { ite_cnt = ite_cnt->next; continue; } inc = putItem2Sarray ( contig_array[targetCtg].from_vt, ite_cnt->weight, SCAF, WT, counter ); if ( inc ) { counter++; } ite_cnt = ite_cnt->next; } } return counter; } static int getScaffold ( unsigned int start, STACK * scafStack ) { int len = contig_array[start].length; unsigned int * pt, ctg; emptyStack ( scafStack ); pt = ( unsigned int * ) stackPush ( scafStack ); *pt = start; CONNECT * bindCnt = getBindCnt ( start ); while ( bindCnt ) { ctg = bindCnt->contigID; pt = ( unsigned int * ) stackPush ( scafStack ); *pt = ctg; len += contig_array[ctg].length; bindCnt = bindCnt->nextInScaf; } stackBackup ( scafStack ); return len; } static boolean isLinkReliable ( DARRAY * WT, int count ) { int i; for ( i = 0; i < count; i++ ) if ( * ( int * ) darrayGet ( WT, i ) >= weakPE ) { return 1; } return 0; } static int getWtFromSarray ( DARRAY * SCAF, DARRAY * WT, int count, unsigned int scaf ) { int i; for ( i = 0; i < count; i++ ) if ( * ( unsigned int * ) darrayGet ( SCAF, i ) == scaf ) { return * ( int * ) darrayGet ( WT, i ); } return 0; } static void switch2twin ( STACK * scafStack ) { unsigned int * pt; stackRecover ( scafStack ); while ( ( pt = ( unsigned int * ) stackPop ( scafStack ) ) != NULL ) { *pt = getTwinCtg ( *pt ); } } /* ------> scaf1 --- --- -- -- --- scaf2 -- --- --- -- ----> */ static boolean checkScafConsist ( STACK * scafStack1, STACK * scafStack2 ) { DARRAY * downwardTo1 = ( DARRAY * ) createDarray ( 1000, sizeof ( unsigned int ) ); // scaf links to those scaffolds DARRAY * downwardTo2 = ( DARRAY * ) createDarray ( 1000, sizeof ( unsigned int ) ); DARRAY * downwardWt1 = ( DARRAY * ) createDarray ( 1000, sizeof ( unsigned int ) ); // scaf links to scaffolds with those wt DARRAY * downwardWt2 = ( DARRAY * ) createDarray ( 1000, sizeof ( unsigned int ) ); int linkCount1 = getDSLink2Scaf ( scafStack1, downwardTo1, downwardWt1 ); int linkCount2 = getDSLink2Scaf ( scafStack2, downwardTo2, downwardWt2 ); if ( !linkCount1 || !linkCount2 ) { freeDarray ( downwardTo1 ); freeDarray ( downwardTo2 ); freeDarray ( downwardWt1 ); freeDarray ( downwardWt2 ); return 1; } boolean flag1 = isLinkReliable ( downwardWt1, linkCount1 ); boolean flag2 = isLinkReliable ( downwardWt2, linkCount2 ); if ( !flag1 || !flag2 ) { freeDarray ( downwardTo1 ); freeDarray ( downwardTo2 ); freeDarray ( downwardWt1 ); freeDarray ( downwardWt2 ); return 1; } unsigned int scaf; int i, wt1, wt2, ret = 1; for ( i = 0; i < linkCount1; i++ ) { wt1 = * ( int * ) darrayGet ( downwardWt1, i ); if ( wt1 < weakPE ) { continue; } scaf = * ( unsigned int * ) darrayGet ( downwardTo1, i ); wt2 = getWtFromSarray ( downwardTo2, downwardWt2, linkCount2, scaf ); if ( wt2 < 1 ) { //fprintf(stderr,"Inconsistant link to %d\n",scaf); ret = 0; break; } } freeDarray ( downwardTo1 ); freeDarray ( downwardTo2 ); freeDarray ( downwardWt1 ); freeDarray ( downwardWt2 ); return ret; } static void setBreakPoints ( DARRAY * ctgArray, int count, int weakest, int * start, int * finish ) { int index = weakest - 1; unsigned int thisCtg; unsigned int nextCtg = * ( unsigned int * ) darrayGet ( ctgArray, weakest ); CONNECT * cnt; *start = weakest; while ( index >= 0 ) { thisCtg = * ( unsigned int * ) darrayGet ( ctgArray, index ); cnt = getCntBetween ( thisCtg, nextCtg ); if ( cnt->maxGap > 2 ) { break; } else { *start = index; } nextCtg = thisCtg; index--; } unsigned int prevCtg = * ( unsigned int * ) darrayGet ( ctgArray, weakest + 1 ); *finish = weakest + 1; index = weakest + 2; while ( index < count ) { thisCtg = * ( unsigned int * ) darrayGet ( ctgArray, index ); cnt = getCntBetween ( prevCtg, thisCtg ); if ( cnt->maxGap > 2 ) { break; } else { *finish = index; } prevCtg = thisCtg; index++; } } static void changeScafEnd ( STACK * scafStack, unsigned int end ) { unsigned int ctg, *pt; unsigned int start = getTwinCtg ( end ); stackRecover ( scafStack ); while ( ( pt = ( unsigned int * ) stackPop ( scafStack ) ) != NULL ) { ctg = *pt; contig_array[ctg].to_vt = end; contig_array[getTwinCtg ( ctg )].from_vt = start; } } static void changeScafBegin ( STACK * scafStack, unsigned int start ) { unsigned int ctg, *pt; unsigned int end = getTwinCtg ( start ); stackRecover ( scafStack ); while ( ( pt = ( unsigned int * ) stackPop ( scafStack ) ) != NULL ) { ctg = *pt; contig_array[ctg].from_vt = start; contig_array[getTwinCtg ( ctg )].to_vt = end; } } // break down scaffolds poorly supported by longer PE static void detectBreakScaf() { unsigned int i, avgPE, scafLen, len, ctg, bal_ctg, prevCtg, thisCtg; long long peCounter, linkCounter; int num3, num5, weakPoint, tempCounter, j, t, counter = 0; CONNECT * bindCnt, *cnt, *weakCnt; STACK * scafStack1 = ( STACK * ) createStack ( 1000, sizeof ( unsigned int ) ); STACK * scafStack2 = ( STACK * ) createStack ( 1000, sizeof ( unsigned int ) ); for ( i = 1; i <= num_ctg; i++ ) { contig_array[i].flag = 0; } for ( i = 1; i <= num_ctg; i++ ) { if ( contig_array[i].flag || contig_array[i].mask || !contig_array[i].downwardConnect ) { continue; } bindCnt = getBindCnt ( i ); if ( !bindCnt ) { continue; } //first scan get the average coverage by longer pe num5 = num3 = peCounter = linkCounter = 0; scafLen = contig_array[i].length; ctg = i; * ( unsigned int * ) darrayPut ( scaf5, num5++ ) = i; contig_array[i].flag = 1; contig_array[getTwinCtg ( i )].flag = 1; while ( bindCnt ) { if ( !bindCnt->bySmall ) { break; } linkCounter++; peCounter += bindCnt->maxGap; ctg = bindCnt->contigID; scafLen += contig_array[ctg].length; * ( unsigned int * ) darrayPut ( scaf5, num5++ ) = ctg; bal_ctg = getTwinCtg ( ctg ); contig_array[ctg].flag = 1; contig_array[bal_ctg].flag = 1; bindCnt = bindCnt->nextInScaf; } ctg = getTwinCtg ( i ); bindCnt = getBindCnt ( ctg ); while ( bindCnt ) { if ( !bindCnt->bySmall ) { break; } linkCounter++; peCounter += bindCnt->maxGap; ctg = bindCnt->contigID; scafLen += contig_array[ctg].length; bal_ctg = getTwinCtg ( ctg ); contig_array[ctg].flag = 1; contig_array[bal_ctg].flag = 1; * ( unsigned int * ) darrayPut ( scaf3, num3++ ) = bal_ctg; bindCnt = bindCnt->nextInScaf; } if ( linkCounter < 1 || scafLen < 5000 ) { continue; } avgPE = peCounter / linkCounter; if ( avgPE < 10 ) { continue; } tempCounter = 0; for ( j = num3 - 1; j >= 0; j-- ) * ( unsigned int * ) darrayPut ( tempArray, tempCounter++ ) = * ( unsigned int * ) darrayGet ( scaf3, j ); for ( j = 0; j < num5; j++ ) * ( unsigned int * ) darrayPut ( tempArray, tempCounter++ ) = * ( unsigned int * ) darrayGet ( scaf5, j ); prevCtg = * ( unsigned int * ) darrayGet ( tempArray, 0 ); weakCnt = NULL; weakPoint = 0; len = contig_array[prevCtg].length; for ( t = 1; t < tempCounter; t++ ) { thisCtg = * ( unsigned int * ) darrayGet ( tempArray, t ); if ( len < 2000 ) { len += contig_array[thisCtg].length; prevCtg = thisCtg; continue; } else if ( len > scafLen - 2000 ) { break; } len += contig_array[thisCtg].length; if ( contig_array[prevCtg].from_vt != contig_array[thisCtg].from_vt || contig_array[prevCtg].indexInScaf > contig_array[thisCtg].indexInScaf ) { prevCtg = thisCtg; continue; } cnt = getCntBetween ( prevCtg, thisCtg ); if ( !weakCnt || weakCnt->maxGap > cnt->maxGap ) { weakCnt = cnt; weakPoint = t; } prevCtg = thisCtg; } if ( !weakCnt || ( weakCnt->maxGap > 2 && weakCnt->maxGap > avgPE / 5 ) ) { continue; } prevCtg = * ( unsigned int * ) darrayGet ( tempArray, weakPoint - 1 ); thisCtg = * ( unsigned int * ) darrayGet ( tempArray, weakPoint ); if ( contig_array[prevCtg].from_vt != contig_array[thisCtg].from_vt || contig_array[prevCtg].indexInScaf > contig_array[thisCtg].indexInScaf ) { printf ( "contig %d and %d not on the same scaff\n", prevCtg, thisCtg ); continue; } setConnectWP ( prevCtg, thisCtg, 1 ); /* fprintf(stderr,"scaffold len %d, avg long pe cov %d (%ld/%ld)\n", scafLen,avgPE,peCounter,linkCounter); fprintf(stderr,"Weak connect (%d) between %d(%dth of %d) and %d\n" ,weakCnt->maxGap,prevCtg,weakPoint-1,tempCounter,thisCtg); */ // set start and end to break down the scaffold int index1, index2; setBreakPoints ( tempArray, tempCounter, weakPoint - 1, &index1, &index2 ); //fprintf(stderr,"break %d ->...-> %d\n",index1,index2); unsigned int start = * ( unsigned int * ) darrayGet ( tempArray, index1 ); unsigned int finish = * ( unsigned int * ) darrayGet ( tempArray, index2 ); int len1 = getScaffold ( getTwinCtg ( start ), scafStack1 ); int len2 = getScaffold ( finish, scafStack2 ); if ( len1 < 2000 || len2 < 2000 ) { continue; } switch2twin ( scafStack1 ); int flag1 = checkScafConsist ( scafStack1, scafStack2 ); switch2twin ( scafStack1 ); switch2twin ( scafStack2 ); int flag2 = checkScafConsist ( scafStack2, scafStack1 ); if ( !flag1 || !flag2 ) { changeScafBegin ( scafStack1, getTwinCtg ( start ) ); changeScafEnd ( scafStack2, getTwinCtg ( finish ) ); //unbind links unsigned int nextCtg = * ( unsigned int * ) darrayGet ( tempArray, index1 + 1 ); thisCtg = * ( unsigned int * ) darrayGet ( tempArray, index1 ); cnt = getCntBetween ( getTwinCtg ( nextCtg ), getTwinCtg ( thisCtg ) ); if ( cnt->nextInScaf ) { prevCtg = getTwinCtg ( cnt->nextInScaf->contigID ); cnt->nextInScaf->prevInScaf = 0; cnt = getCntBetween ( prevCtg, thisCtg ); cnt->nextInScaf = NULL; } prevCtg = * ( unsigned int * ) darrayGet ( tempArray, index2 - 1 ); thisCtg = * ( unsigned int * ) darrayGet ( tempArray, index2 ); cnt = getCntBetween ( prevCtg, thisCtg ); if ( cnt->nextInScaf ) { nextCtg = cnt->nextInScaf->contigID; cnt->nextInScaf->prevInScaf = 0; cnt = getCntBetween ( getTwinCtg ( nextCtg ), getTwinCtg ( thisCtg ) ); cnt->nextInScaf = NULL; } prevCtg = * ( unsigned int * ) darrayGet ( tempArray, index1 ); for ( t = index1 + 1; t <= index2; t++ ) { thisCtg = * ( unsigned int * ) darrayGet ( tempArray, t ); cnt = getCntBetween ( prevCtg, thisCtg ); cnt->mask = 1; cnt->nextInScaf = NULL; cnt->prevInScaf = 0; cnt = getCntBetween ( getTwinCtg ( thisCtg ), getTwinCtg ( prevCtg ) ); cnt->mask = 1; cnt->nextInScaf = NULL; cnt->prevInScaf = 0; /* fprintf(stderr,"(%d %d)/(%d %d) ", prevCtg,thisCtg,getTwinCtg(thisCtg),getTwinCtg(prevCtg)); */ prevCtg = thisCtg; } //fprintf(stderr,": BREAKING\n"); counter++; } } freeStack ( scafStack1 ); freeStack ( scafStack2 ); printf ( "Report from checkScaf: %d scaffold segments broken\n", counter ); } static boolean checkSimple ( DARRAY * ctgArray, int count ) { int i; unsigned int ctg; for ( i = 0; i < count; i++ ) { ctg = * ( unsigned int * ) darrayGet ( ctgArray, i ); contig_array[ctg].flag = 0; contig_array[getTwinCtg ( ctg )].flag = 0; } for ( i = 0; i < count; i++ ) { ctg = * ( unsigned int * ) darrayGet ( ctgArray, i ); if ( contig_array[ctg].flag ) { return 0; } contig_array[ctg].flag = 1; contig_array[getTwinCtg ( ctg )].flag = 1; } return 1; } static void checkCircle() { unsigned int i, ctg; CONNECT * cn_temp1; int counter = 0; for ( i = 1; i <= num_ctg; i++ ) { cn_temp1 = contig_array[i].downwardConnect; while ( cn_temp1 ) { if ( cn_temp1->weak || cn_temp1->deleted ) { cn_temp1 = cn_temp1->next; continue; } ctg = cn_temp1->contigID; if ( checkConnect ( ctg, i ) ) { counter++; maskContig ( i, 1 ); maskContig ( ctg, 1 ); } cn_temp1 = cn_temp1->next; } } printf ( "%d circles removed \n", counter ); } SOAPdenovo-V1.05/src/127mer/output_contig.c000644 000765 000024 00000015523 11530651532 020456 0ustar00Aquastaff000000 000000 /* * 127mer/output_contig.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" static char * kmerSeq; void output_graph ( char * outfile ) { char name[256]; FILE * fp; unsigned int i, bal_i; sprintf ( name, "%s.edge.gvz", outfile ); fp = ckopen ( name, "w" ); fprintf ( fp, "digraph G{\n" ); fprintf ( fp, "\tsize=\"512,512\";\n" ); for ( i = num_ed; i > 0; i-- ) { if ( edge_array[i].deleted ) { continue; } bal_i = getTwinEdge ( i ); fprintf ( fp, "\tV%d -> V%d[label =\"%d(%d)\"];\n", edge_array[i].from_vt, edge_array[i].to_vt, i, edge_array[i].length ); } fprintf ( fp, "}\n" ); fclose ( fp ); } static void output_1contig ( int id, EDGE * edge, FILE * fp, boolean tip ) { int i; Kmer kmer; fprintf ( fp, ">%d length %d cvg_%.1f_tip_%d\n", id, edge->length + overlaplen, ( double ) edge->cvg / 10, tip ); //fprintf(fp,">%d\n",id); kmer = vt_array[edge->from_vt].kmer; printKmerSeq ( fp, kmer ); for ( i = 0; i < edge->length; i++ ) { fprintf ( fp, "%c", int2base ( ( int ) getCharInTightString ( edge->seq, i ) ) ); if ( ( i + overlaplen + 1 ) % 100 == 0 ) { fprintf ( fp, "\n" ); } } fprintf ( fp, "\n" ); } int cmp_int ( const void * a, const void * b ) { int * A, *B; A = ( int * ) a; B = ( int * ) b; if ( *A > *B ) { return 1; } else if ( *A == *B ) { return 0; } else { return -1; } } int cmp_edge ( const void * a, const void * b ) { EDGE * A, *B; A = ( EDGE * ) a; B = ( EDGE * ) b; if ( A->length > B->length ) { return 1; } else if ( A->length == B->length ) { return 0; } else { return -1; } } void output_contig ( EDGE * ed_array, unsigned int ed_num, char * outfile, int cut_len ) { char temp[256]; FILE * fp, *fp_contig; int flag, count, len_c; int signI; unsigned int i; long long sum = 0, N90, N50; unsigned int * length_array; boolean tip; sprintf ( temp, "%s.contig", outfile ); fp = ckopen ( temp, "w" ); qsort ( &ed_array[1], ed_num, sizeof ( EDGE ), cmp_edge ); length_array = ( unsigned int * ) ckalloc ( ed_num * sizeof ( unsigned int ) ); kmerSeq = ( char * ) ckalloc ( overlaplen * sizeof ( char ) ); //first scan for number counting count = len_c = 0; for ( i = 1; i <= ed_num; i++ ) { if ( ( ed_array[i].length + overlaplen ) >= len_bar ) { length_array[len_c++] = ed_array[i].length + overlaplen; } if ( ed_array[i].length < 1 || ed_array[i].deleted ) { continue; } count++; if ( EdSmallerThanTwin ( i ) ) { i++; } } sum = 0; for ( signI = len_c - 1; signI >= 0; signI-- ) { sum += length_array[signI]; } qsort ( length_array, len_c, sizeof ( length_array[0] ), cmp_int ); if ( len_c > 0 ) { printf ( "%d ctgs longer than %d, sum up %lldbp, with average length %lld\n", len_c, len_bar, sum, sum / len_c ); printf ( "the longest is %dbp, ", length_array[len_c - 1] ); } //qsort(length_array,len_c,sizeof(length_array[0]),cmp_int); //printf("the longest is %dbp, ",length_array[len_c-1]); N50 = sum * 0.5; N90 = sum * 0.9; sum = flag = 0; for ( signI = len_c - 1; signI >= 0; signI-- ) { sum += length_array[signI]; if ( !flag && sum >= N50 ) { printf ( "contig N50 is %d bp,", length_array[signI] ); flag = 1; } if ( sum >= N90 ) { printf ( "contig N90 is %d bp\n", length_array[signI] ); break; } } //fprintf(fp,"Number %d\n",count); for ( i = 1; i <= ed_num; i++ ) { //if(ed_array[i].multi!=1||ed_array[i].length<1||(ed_array[i].length+overlaplen)length %d,", edge->length ); //print_kmer(fp,vt_array[edge->from_vt].kmer,','); //print_kmer(fp,vt_array[edge->to_vt].kmer,','); if ( EdSmallerThanTwin ( i ) ) { fprintf ( fp, "1," ); } else if ( EdLargerThanTwin ( i ) ) { fprintf ( fp, "-1," ); } else { fprintf ( fp, "0," ); } //fprintf(fp,"%d\n",edge->cvg); fprintf ( fp, "%d ", edge->cvg ); print_kmer ( fp, vt_array[edge->from_vt].kmer, ',' ); print_kmer ( fp, vt_array[edge->to_vt].kmer, ',' ); fprintf ( fp, "\n" ); } fclose ( fp ); } void output_heavyArcs ( char * outfile ) { unsigned int i, j; char name[256]; FILE * outfp; ARC * parc; sprintf ( name, "%s.Arc", outfile ); outfp = ckopen ( name, "w" ); for ( i = 1; i <= num_ed; i++ ) { parc = edge_array[i].arcs; if ( !parc ) { continue; } j = 0; fprintf ( outfp, "%u", i ); while ( parc ) { fprintf ( outfp, " %u %u", parc->to_ed, parc->multiplicity ); if ( ( ++j ) % 10 == 0 ) { fprintf ( outfp, "\n%u", i ); } parc = parc->next; } fprintf ( outfp, "\n" ); } fclose ( outfp ); } SOAPdenovo-V1.05/src/127mer/output_pregraph.c000644 000765 000024 00000005152 11530651532 021000 0ustar00Aquastaff000000 000000 /* * 127mer/output_pregraph.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include #include "newhash.h" #include #include static int outvCounter = 0; //after this LINKFLAGFILTER in the Kmer is destroyed static void output1vt ( kmer_t * node1, FILE * fp ) { if ( !node1 ) { return; } if ( ! ( node1->linear ) && ! ( node1->deleted ) ) { outvCounter++; print_kmer ( fp, node1->seq, ' ' ); //printKmerSeq(stdout,node1->seq); //printf("\n"); if ( outvCounter % 8 == 0 ) { fprintf ( fp, "\n" ); } } } void output_vertex ( char * outfile ) { char temp[256]; FILE * fp; int i; kmer_t * node; KmerSet * set; sprintf ( temp, "%s.vertex", outfile ); fp = ckopen ( temp, "w" ); for ( i = 0; i < thrd_num; i++ ) { set = KmerSets[i]; set->iter_ptr = 0; while ( set->iter_ptr < set->size ) { if ( !is_kmer_entity_null ( set->flags, set->iter_ptr ) ) { node = set->array + set->iter_ptr; output1vt ( node, fp ); } set->iter_ptr ++; } } fprintf ( fp, "\n" ); printf ( "%d vertex outputed\n", outvCounter ); fclose ( fp ); sprintf ( temp, "%s.preGraphBasic", outfile ); fp = ckopen ( temp, "w" ); fprintf ( fp, "VERTEX %d K %d\n", outvCounter, overlaplen ); fprintf ( fp, "\nEDGEs %d\n", num_ed ); fprintf ( fp, "\nMaxReadLen %d MinReadLen %d MaxNameLen %d\n", maxReadLen4all, minReadLen, maxNameLen ); fclose ( fp ); } void output_1edge ( preEDGE * edge, FILE * fp ) { int i; fprintf ( fp, ">length %d,", edge->length ); print_kmer ( fp, edge->from_node, ',' ); print_kmer ( fp, edge->to_node, ',' ); fprintf ( fp, "cvg %d, %d\n", edge->cvg, edge->bal_edge ); for ( i = 0; i < edge->length; i++ ) { fprintf ( fp, "%c", int2base ( ( int ) edge->seq[i] ) ); if ( ( i + 1 ) % 100 == 0 ) { fprintf ( fp, "\n" ); } } fprintf ( fp, "\n" ); } SOAPdenovo-V1.05/src/127mer/output_scaffold.c000644 000765 000024 00000005110 11530651532 020743 0ustar00Aquastaff000000 000000 /* * 127mer/output_scaffold.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" void output_contig_graph ( char * outfile ) { char name[256]; FILE * fp; unsigned int i; sprintf ( name, "%s.contig.gvz", outfile ); fp = ckopen ( name, "w" ); fprintf ( fp, "digraph G{\n" ); fprintf ( fp, "\tsize=\"512,512\";\n" ); for ( i = num_ctg; i > 0; i-- ) { fprintf ( fp, "\tV%d -> V%d[label =\"%d(%d)\"];\n", contig_array[i].from_vt, contig_array[i].to_vt, i, contig_array[i].length ); } fprintf ( fp, "}\n" ); fclose ( fp ); } void output_scaf ( char * outfile ) { char name[256]; FILE * fp; unsigned int i; CONNECT * connect; boolean flag; sprintf ( name, "%s.scaffold.gvz", outfile ); fp = ckopen ( name, "w" ); fprintf ( fp, "digraph G{\n" ); fprintf ( fp, "\tsize=\"512,512\";\n" ); for ( i = num_ctg; i > 0; i-- ) { //if(contig_array[i].mask||!contig_array[i].downwardConnect) if ( !contig_array[i].downwardConnect ) { continue; } connect = contig_array[i].downwardConnect; while ( connect ) { //if(connect->mask||connect->deleted){ if ( connect->deleted ) { connect = connect->next; continue; } if ( connect->prevInScaf || connect->nextInScaf ) { flag = 1; } else { flag = 0; } if ( !connect->mask ) fprintf ( fp, "\tC%d_%d -> C%d_%d [label = \"%d(%d_%d)\"];\n" , i, contig_array[i].length, connect->contigID, contig_array[connect->contigID].length, connect->gapLen, flag, connect->weight ); else fprintf ( fp, "\tC%d_%d -> C%d_%d [label = \"%d(%d_%d)\", color = red];\n" , i, contig_array[i].length, connect->contigID, contig_array[connect->contigID].length, connect->gapLen, flag, connect->weight ); connect = connect->next; } } fprintf ( fp, "}\n" ); fclose ( fp ); } SOAPdenovo-V1.05/src/127mer/pregraph.c000644 000765 000024 00000011445 11530651532 017362 0ustar00Aquastaff000000 000000 /* * 127mer/pregraph.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" static void initenv ( int argc, char ** argv ); static char shortrdsfile[256]; static char graphfile[256]; static int cutTips = 1; static void display_pregraph_usage(); int call_pregraph ( int argc, char ** argv ) { time_t start_t, stop_t, time_bef, time_aft; time ( &start_t ); initenv ( argc, argv ); if ( overlaplen % 2 == 0 ) { overlaplen++; printf ( "K should be an odd number\n" ); } if ( overlaplen < 13 ) { overlaplen = 13; printf ( "K should not be less than 13\n" ); } else if ( overlaplen > 127 ) { overlaplen = 127; printf ( "K should not be greater than 127\n" ); } time ( &time_bef ); prlRead2HashTable ( shortrdsfile, graphfile ); time ( &time_aft ); printf ( "time spent on pre-graph construction: %ds\n\n", ( int ) ( time_aft - time_bef ) ); printf ( "deLowKmer %d, deLowEdge %d\n", deLowKmer, deLowEdge ); //analyzeTips(hash_table, graphfile); if ( !deLowKmer && cutTips ) { time ( &time_bef ); removeSingleTips(); removeMinorTips(); time ( &time_aft ); printf ( "time spent on cutTipe: %ds\n\n", ( int ) ( time_aft - time_bef ) ); } else { time ( &time_bef ); removeMinorTips(); time ( &time_aft ); printf ( "time spent on cutTipe: %ds\n\n", ( int ) ( time_aft - time_bef ) ); } //combine each linear part to an edge time ( &time_bef ); kmer2edges ( graphfile ); time ( &time_aft ); printf ( "time spent on making edges: %ds\n\n", ( int ) ( time_aft - time_bef ) ); //map read to edge one by one time ( &time_bef ); prlRead2edge ( shortrdsfile, graphfile ); time ( &time_aft ); printf ( "time spent on mapping reads: %ds\n\n", ( int ) ( time_aft - time_bef ) ); output_vertex ( graphfile ); free_Sets ( KmerSets, thrd_num ); free_Sets ( KmerSetsPatch, thrd_num ); time ( &stop_t ); printf ( "overall time for lightgraph: %dm\n\n", ( int ) ( stop_t - start_t ) / 60 ); return 0; } /***************************************************************************** * Parse command line switches *****************************************************************************/ void initenv ( int argc, char ** argv ) { int copt; int inpseq, outseq; extern char * optarg; char temp[100]; optind = 1; inpseq = outseq = 0; while ( ( copt = getopt ( argc, argv, "s:o:K:p:a:dDR" ) ) != EOF ) { //printf("get option\n"); switch ( copt ) { case 's': inpseq = 1; sscanf ( optarg, "%s", shortrdsfile ); continue; case 'o': outseq = 1; sscanf ( optarg, "%s", graphfile ); // continue; case 'K': sscanf ( optarg, "%s", temp ); // overlaplen = atoi ( temp ); continue; case 'p': sscanf ( optarg, "%s", temp ); // thrd_num = atoi ( temp ); continue; case 'R': repsTie = 1; continue; case 'd': deLowKmer = 1; continue; case 'D': deLowEdge = 1; continue; case 'a': initKmerSetSize = atoi ( optarg ); break; default: if ( inpseq == 0 || outseq == 0 ) // { display_pregraph_usage(); exit ( -1 ); } } } if ( inpseq == 0 || outseq == 0 ) // { //printf("need more\n"); display_pregraph_usage(); exit ( -1 ); } } static void display_pregraph_usage() { printf ( "\npregraph -s readsInfoFile [-d -D -R -K kmer -p n_cpu -a initKmerSetSize] -o OutputFile\n" ); printf ( " -s readsInfoFile: The file contains information of solexa reads\n" ); printf ( " -p n_cpu(default 8): number of cpu for use\n" ); printf ( " -K kmer(default 21): k value in kmer\n" ); printf ( " -a initKmerSetSize: define the initial KmerSet size(unit: GB)\n" ); printf ( " -d (optional): delete kmers with frequency one (default no)\n" ); printf ( " -D (optional): delete edges with coverage one (default no)\n" ); printf ( " -R (optional): unsolve repeats by reads (default no)\n" ); printf ( " -o OutputFile: prefix of output file name\n" ); } SOAPdenovo-V1.05/src/127mer/prlHashCtg.c000644 000765 000024 00000024640 11530651532 017612 0ustar00Aquastaff000000 000000 /* * 127mer/prlHashCtg.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" //debugging variables static long long * kmerCounter; //buffer related varibles for chop kmer static unsigned int read_c; static char ** rcSeq; static char * seqBuffer; static int * lenBuffer; static unsigned int * indexArray; static unsigned int * seqBreakers; static int * ctgIdArray; //static Kmer *firstKmers; //buffer related varibles for splay tree static unsigned int buffer_size = 100000000; static unsigned int seq_buffer_size; static unsigned int max_read_c; static volatile unsigned int kmer_c; static Kmer * kmerBuffer; static ubyte8 * hashBanBuffer; static boolean * smallerBuffer; static void singleKmer ( int t, KmerSet * kset, unsigned int seq_index, unsigned int pos ); static void chopKmer4read ( int t, int threadID ); static void threadRoutine ( void * para ) { PARAMETER * prm; unsigned int i; unsigned char id; prm = ( PARAMETER * ) para; id = prm->threadID; //printf("%dth thread with threadID %d, hash_table %p\n",id,prm.threadID,prm.hash_table); while ( 1 ) { if ( * ( prm->selfSignal ) == 1 ) { unsigned int seq_index = 0; unsigned int pos = 0; for ( i = 0; i < kmer_c; i++ ) { if ( seq_index < read_c && indexArray[seq_index + 1] == i ) { seq_index++; // which sequence this kmer belongs to pos = 0; } //if((unsigned char)(hashBanBuffer[i]&taskMask)!=id){ if ( ( unsigned char ) ( hashBanBuffer[i] % thrd_num ) != id ) { pos++; continue; } kmerCounter[id + 1]++; singleKmer ( i, KmerSets[id], seq_index, pos++ ); } * ( prm->selfSignal ) = 0; } else if ( * ( prm->selfSignal ) == 2 ) { for ( i = 0; i < read_c; i++ ) { if ( i % thrd_num != id ) { continue; } chopKmer4read ( i, id + 1 ); } * ( prm->selfSignal ) = 0; } else if ( * ( prm->selfSignal ) == 3 ) { * ( prm->selfSignal ) = 0; break; } usleep ( 1 ); } } static void singleKmer ( int t, KmerSet * kset, unsigned int seq_index, unsigned int pos ) { boolean flag; kmer_t * node; flag = put_kmerset ( kset, kmerBuffer[t], 4, 4, &node ); //printf("singleKmer: kmer %llx\n",kmerBuffer[t]); if ( !flag ) { if ( smallerBuffer[t] ) { node->twin = 0; } else { node->twin = 1; }; node->l_links = ctgIdArray[seq_index]; node->r_links = pos; } else { node->deleted = 1; } } static void creatThrds ( pthread_t * threads, PARAMETER * paras ) { unsigned char i; int temp; for ( i = 0; i < thrd_num; i++ ) { //printf("to create %dth thread\n",(*(char *)&(threadID[i]))); if ( ( temp = pthread_create ( &threads[i], NULL, ( void * ) threadRoutine, & ( paras[i] ) ) ) != 0 ) { printf ( "create threads failed\n" ); exit ( 1 ); } } printf ( "%d thread created\n", thrd_num ); } static void thread_wait ( pthread_t * threads ) { int i; for ( i = 0; i < thrd_num; i++ ) if ( threads[i] != 0 ) { pthread_join ( threads[i], NULL ); } } static void chopKmer4read ( int t, int threadID ) { char * src_seq = seqBuffer + seqBreakers[t]; char * bal_seq = rcSeq[threadID]; int len_seq = lenBuffer[t]; int j, bal_j; ubyte8 hash_ban, bal_hash_ban; Kmer word, bal_word; int index; word.high1 = word.low1 = word.high2 = word.low2 = 0; for ( index = 0; index < overlaplen; index++ ) { word = KmerLeftBitMoveBy2 ( word ); word.low2 |= src_seq[index]; } reverseComplementSeq ( src_seq, len_seq, bal_seq ); // complementary node bal_word = reverseComplement ( word, overlaplen ); bal_j = len_seq - 0 - overlaplen; // 0; index = indexArray[t]; if ( KmerSmaller ( word, bal_word ) ) { hash_ban = hash_kmer ( word ); kmerBuffer[index] = word; hashBanBuffer[index] = hash_ban; smallerBuffer[index++] = 1; } else { bal_hash_ban = hash_kmer ( bal_word ); kmerBuffer[index] = bal_word; hashBanBuffer[index] = bal_hash_ban; smallerBuffer[index++] = 0; } //printf("%dth: %p with %p\n",kmer_c-1,bal_word,bal_hash_ban); for ( j = 1; j <= len_seq - overlaplen; j ++ ) { word = nextKmer ( word, src_seq[j - 1 + overlaplen] ); bal_j = len_seq - j - overlaplen; // j; bal_word = prevKmer ( bal_word, bal_seq[bal_j] ); if ( KmerSmaller ( word, bal_word ) ) { hash_ban = hash_kmer ( word ); kmerBuffer[index] = word; hashBanBuffer[index] = hash_ban; smallerBuffer[index++] = 1; //printf("%dth: %p with %p\n",kmer_c-1,word,hashBanBuffer[kmer_c-1]); } else { // complementary node bal_hash_ban = hash_kmer ( bal_word ); kmerBuffer[index] = bal_word; hashBanBuffer[index] = bal_hash_ban; smallerBuffer[index++] = 0; //printf("%dth: %p with %p\n",kmer_c-1,bal_word,hashBanBuffer[kmer_c-1]); } } } static void sendWorkSignal ( unsigned char SIG, unsigned char * thrdSignals ) { int t; for ( t = 0; t < thrd_num; t++ ) { thrdSignals[t + 1] = SIG; } while ( 1 ) { usleep ( 10 ); for ( t = 0; t < thrd_num; t++ ) if ( thrdSignals[t + 1] ) { break; } if ( t == thrd_num ) { break; } } } static int getID ( char * name ) { if ( name[0] >= '0' && name[0] <= '9' ) { return atoi ( & ( name[0] ) ); } else { return 0; } } boolean prlContig2nodes ( char * grapfile, int len_cut ) { long long i, num_seq; char name[256], *next_name; FILE * fp; pthread_t threads[thrd_num]; time_t start_t, stop_t; unsigned char thrdSignal[thrd_num + 1]; PARAMETER paras[thrd_num]; int maxCtgLen, minCtgLen, nameLen; unsigned int lenSum, contigId; WORDFILTER = createFilter ( overlaplen ); time ( &start_t ); sprintf ( name, "%s.contig", grapfile ); fp = ckopen ( name, "r" ); maxCtgLen = nameLen = 10; minCtgLen = 1000; num_seq = readseqpar ( &maxCtgLen, &minCtgLen, &nameLen, fp ); printf ( "\nthere're %lld contigs in file: %s, max seq len %d, min seq len %d, max name len %d\n", num_seq, grapfile, maxCtgLen, minCtgLen, nameLen ); maxReadLen = maxCtgLen; fclose ( fp ); time ( &stop_t ); printf ( "time spent on parse contigs file %ds\n", ( int ) ( stop_t - start_t ) ); next_name = ( char * ) ckalloc ( ( maxNameLen + 1 ) * sizeof ( char ) ); // extract all the EDONs seq_buffer_size = buffer_size * 2; max_read_c = seq_buffer_size / 20; kmerBuffer = ( Kmer * ) ckalloc ( buffer_size * sizeof ( Kmer ) ); hashBanBuffer = ( ubyte8 * ) ckalloc ( buffer_size * sizeof ( ubyte8 ) ); smallerBuffer = ( boolean * ) ckalloc ( buffer_size * sizeof ( boolean ) ); seqBuffer = ( char * ) ckalloc ( seq_buffer_size * sizeof ( char ) ); lenBuffer = ( int * ) ckalloc ( max_read_c * sizeof ( int ) ); indexArray = ( unsigned int * ) ckalloc ( ( max_read_c + 1 ) * sizeof ( unsigned int ) ); seqBreakers = ( unsigned int * ) ckalloc ( ( max_read_c + 1 ) * sizeof ( unsigned int ) ); ctgIdArray = ( int * ) ckalloc ( max_read_c * sizeof ( int ) ); fp = ckopen ( name, "r" ); //node_mem_manager = createMem_manager(EDONBLOCKSIZE,sizeof(EDON)); rcSeq = ( char ** ) ckalloc ( ( thrd_num + 1 ) * sizeof ( char * ) ); if ( 1 ) { kmerCounter = ( long long * ) ckalloc ( ( thrd_num + 1 ) * sizeof ( long long ) ); KmerSets = ( KmerSet ** ) ckalloc ( thrd_num * sizeof ( KmerSet * ) ); for ( i = 0; i < thrd_num; i++ ) { KmerSets[i] = init_kmerset ( 1024, 0.77f ); thrdSignal[i + 1] = 0; paras[i].threadID = i; paras[i].mainSignal = &thrdSignal[0]; paras[i].selfSignal = &thrdSignal[i + 1]; kmerCounter[i + 1] = 0; rcSeq[i + 1] = ( char * ) ckalloc ( maxCtgLen * sizeof ( char ) ); } creatThrds ( threads, paras ); } kmer_c = thrdSignal[0] = kmerCounter[0] = 0; time ( &start_t ); read_c = lenSum = i = seqBreakers[0] = indexArray[0] = 0; readseq1by1 ( seqBuffer + seqBreakers[read_c], next_name, & ( lenBuffer[read_c] ), fp, -1 ); while ( !feof ( fp ) ) { contigId = getID ( next_name ); readseq1by1 ( seqBuffer + seqBreakers[read_c], next_name, & ( lenBuffer[read_c] ), fp, 1 ); if ( ( ++i ) % 10000000 == 0 ) { printf ( "--- %lldth contigs\n", i ); } if ( lenBuffer[read_c] < overlaplen + 1 || lenBuffer[read_c] < len_cut ) { contigId = getID ( next_name ); continue; } //printf("len of seq %d is %d, ID %d\n",read_c,lenBuffer[read_c],contigId); ctgIdArray[read_c] = contigId > 0 ? contigId : i; lenSum += lenBuffer[read_c]; kmer_c += lenBuffer[read_c] - overlaplen + 1; read_c++; seqBreakers[read_c] = lenSum; indexArray[read_c] = kmer_c; //printf("seq %d start at %d\n",read_c,seqBreakers[read_c]); if ( read_c == max_read_c || ( lenSum + maxCtgLen ) > seq_buffer_size || ( kmer_c + maxCtgLen - overlaplen + 1 ) > buffer_size ) { kmerCounter[0] += kmer_c; sendWorkSignal ( 2, thrdSignal ); sendWorkSignal ( 1, thrdSignal ); kmer_c = read_c = lenSum = 0; } } if ( read_c ) { kmerCounter[0] += kmer_c; sendWorkSignal ( 2, thrdSignal ); sendWorkSignal ( 1, thrdSignal ); } sendWorkSignal ( 3, thrdSignal ); thread_wait ( threads ); time ( &stop_t ); printf ( "time spent on hash reads: %ds\n", ( int ) ( stop_t - start_t ) ); if ( 1 ) { unsigned long long alloCounter = 0; unsigned long long allKmerCounter = 0; for ( i = 0; i < thrd_num; i++ ) { alloCounter += count_kmerset ( ( KmerSets[i] ) ); allKmerCounter += kmerCounter[i + 1]; free ( ( void * ) rcSeq[i + 1] ); } printf ( "%lli nodes allocated, %lli kmer in reads, %lli kmer processed\n" , alloCounter, kmerCounter[0], allKmerCounter ); } free ( ( void * ) rcSeq ); free ( ( void * ) kmerCounter ); free ( ( void * ) seqBuffer ); free ( ( void * ) lenBuffer ); free ( ( void * ) indexArray ); free ( ( void * ) seqBreakers ); free ( ( void * ) ctgIdArray ); free ( ( void * ) kmerBuffer ); free ( ( void * ) hashBanBuffer ); free ( ( void * ) smallerBuffer ); free ( ( void * ) next_name ); fclose ( fp ); return 1; } SOAPdenovo-V1.05/src/127mer/prlHashReads.c000644 000765 000024 00000035673 11530651532 020143 0ustar00Aquastaff000000 000000 /* * 127mer/prlHashReads.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" //debugging variables static long long * tips; static long long * kmerCounter; static long long ** kmerFreq; //buffer related varibles for chop kmer static int read_c; static char ** rcSeq; static char ** seqBuffer; static int * lenBuffer; static int * indexArray; //buffer related varibles for splay tree static int buffer_size = 100000000; static volatile int kmer_c; static Kmer * kmerBuffer; static ubyte8 * hashBanBuffer; static char * nextcBuffer, *prevcBuffer; static void thread_mark ( KmerSet * set, unsigned char thrdID ); static void Mark1in1outNode ( unsigned char * thrdSignal ); static void thread_delow ( KmerSet * set, unsigned char thrdID ); static void deLowCov ( unsigned char * thrdSignal ); static void singleKmer ( int t, KmerSet * kset ); static void chopKmer4read ( int t, int threadID ); static void freqStat ( char * outfile ); static void threadRoutine ( void * para ) { PARAMETER * prm; int i; unsigned char id; prm = ( PARAMETER * ) para; id = prm->threadID; //printf("%dth thread with threadID %d, hash_table %p\n",id,prm.threadID,prm.hash_table); while ( 1 ) { if ( * ( prm->selfSignal ) == 1 ) { for ( i = 0; i < kmer_c; i++ ) { //if((unsigned char)(magic_seq(hashBanBuffer[i])%thrd_num)!=id) //if((kmerBuffer[i]%thrd_num)!=id) if ( ( hashBanBuffer[i] % thrd_num ) != id ) { continue; } kmerCounter[id + 1]++; singleKmer ( i, KmerSets[id] ); } * ( prm->selfSignal ) = 0; } else if ( * ( prm->selfSignal ) == 2 ) { for ( i = 0; i < read_c; i++ ) { if ( i % thrd_num != id ) { continue; } chopKmer4read ( i, id + 1 ); } * ( prm->selfSignal ) = 0; } else if ( * ( prm->selfSignal ) == 3 ) { * ( prm->selfSignal ) = 0; break; } else if ( * ( prm->selfSignal ) == 4 ) { thread_mark ( KmerSets[id], id ); * ( prm->selfSignal ) = 0; } else if ( * ( prm->selfSignal ) == 5 ) { thread_delow ( KmerSets[id], id ); * ( prm->selfSignal ) = 0; } usleep ( 1 ); } } static void singleKmer ( int t, KmerSet * kset ) { kmer_t * pos; put_kmerset ( kset, kmerBuffer[t], prevcBuffer[t], nextcBuffer[t], &pos ); } static void creatThrds ( pthread_t * threads, PARAMETER * paras ) { unsigned char i; int temp; for ( i = 0; i < thrd_num; i++ ) { //printf("to create %dth thread\n",(*(char *)&(threadID[i]))); if ( ( temp = pthread_create ( &threads[i], NULL, ( void * ) threadRoutine, & ( paras[i] ) ) ) != 0 ) { printf ( "create threads failed\n" ); exit ( 1 ); } } printf ( "%d thread created\n", thrd_num ); } static void thread_wait ( pthread_t * threads ) { int i; for ( i = 0; i < thrd_num; i++ ) if ( threads[i] != 0 ) { pthread_join ( threads[i], NULL ); } } static void chopKmer4read ( int t, int threadID ) { char * src_seq = seqBuffer[t]; char * bal_seq = rcSeq[threadID]; int len_seq = lenBuffer[t]; int j, bal_j; ubyte8 hash_ban, bal_hash_ban; Kmer word, bal_word; int index; char InvalidCh = 4; word.high1 = word.low1 = word.high2 = word.low2 = 0; for ( index = 0; index < overlaplen; index++ ) { word = KmerLeftBitMoveBy2 ( word ); word.low2 |= src_seq[index]; } reverseComplementSeq ( src_seq, len_seq, bal_seq ); // complementary node bal_word = reverseComplement ( word, overlaplen ); bal_j = len_seq - 0 - overlaplen; // 0; index = indexArray[t]; if ( KmerSmaller ( word, bal_word ) ) { hash_ban = hash_kmer ( word ); hashBanBuffer[index] = hash_ban; kmerBuffer[index] = word; prevcBuffer[index] = InvalidCh; nextcBuffer[index++] = src_seq[0 + overlaplen]; } else { bal_hash_ban = hash_kmer ( bal_word ); hashBanBuffer[index] = bal_hash_ban; kmerBuffer[index] = bal_word; prevcBuffer[index] = bal_seq[bal_j - 1]; nextcBuffer[index++] = InvalidCh; } for ( j = 1; j <= len_seq - overlaplen; j ++ ) { word = nextKmer ( word, src_seq[j - 1 + overlaplen] ); bal_j = len_seq - j - overlaplen; // j; bal_word = prevKmer ( bal_word, bal_seq[bal_j] ); if ( KmerSmaller ( word, bal_word ) ) { hash_ban = hash_kmer ( word ); hashBanBuffer[index] = hash_ban; kmerBuffer[index] = word; prevcBuffer[index] = src_seq[j - 1]; if ( j < len_seq - overlaplen ) { nextcBuffer[index++] = src_seq[j + overlaplen]; } else { nextcBuffer[index++] = InvalidCh; } //printf("%dth: %p with %p\n",kmer_c-1,word,hashBanBuffer[kmer_c-1]); } else { // complementary node bal_hash_ban = hash_kmer ( bal_word ); hashBanBuffer[index] = bal_hash_ban; kmerBuffer[index] = bal_word; if ( bal_j > 0 ) { prevcBuffer[index] = bal_seq[bal_j - 1]; } else { prevcBuffer[index] = InvalidCh; } nextcBuffer[index++] = bal_seq[bal_j + overlaplen]; //printf("%dth: %p with %p\n",kmer_c-1,bal_word,hashBanBuffer[kmer_c-1]); } } } static void sendWorkSignal ( unsigned char SIG, unsigned char * thrdSignals ) { int t; for ( t = 0; t < thrd_num; t++ ) { thrdSignals[t + 1] = SIG; } while ( 1 ) { usleep ( 10 ); for ( t = 0; t < thrd_num; t++ ) if ( thrdSignals[t + 1] ) { break; } if ( t == thrd_num ) { break; } } } boolean prlRead2HashTable ( char * libfile, char * outfile ) { long long i; char * next_name, name[256]; FILE * fo; time_t start_t, stop_t; int maxReadNum; int libNo; pthread_t threads[thrd_num]; unsigned char thrdSignal[thrd_num + 1]; PARAMETER paras[thrd_num]; boolean flag, pairs = 0; WORDFILTER = createFilter ( overlaplen ); maxReadLen = 0; maxNameLen = 256; scan_libInfo ( libfile ); alloc_pe_mem ( num_libs ); if ( !maxReadLen ) { maxReadLen = 100; } maxReadLen4all = maxReadLen; printf ( "In %s, %d libs, max seq len %d, max name len %d\n\n", libfile, num_libs, maxReadLen, maxNameLen ); next_name = ( char * ) ckalloc ( ( maxNameLen + 1 ) * sizeof ( char ) ); kmerBuffer = ( Kmer * ) ckalloc ( buffer_size * sizeof ( Kmer ) ); hashBanBuffer = ( ubyte8 * ) ckalloc ( buffer_size * sizeof ( ubyte8 ) ); prevcBuffer = ( char * ) ckalloc ( buffer_size * sizeof ( char ) ); nextcBuffer = ( char * ) ckalloc ( buffer_size * sizeof ( char ) ); maxReadNum = buffer_size / ( maxReadLen - overlaplen + 1 ); //printf("buffer size %d, max read len %d, max read num %d\n",buffer_size,maxReadLen,maxReadNum); seqBuffer = ( char ** ) ckalloc ( maxReadNum * sizeof ( char * ) ); lenBuffer = ( int * ) ckalloc ( maxReadNum * sizeof ( int ) ); indexArray = ( int * ) ckalloc ( maxReadNum * sizeof ( int ) ); for ( i = 0; i < maxReadNum; i++ ) { seqBuffer[i] = ( char * ) ckalloc ( maxReadLen * sizeof ( char ) ); } rcSeq = ( char ** ) ckalloc ( ( thrd_num + 1 ) * sizeof ( char * ) ); if ( 1 ) { kmerCounter = ( long long * ) ckalloc ( ( thrd_num + 1 ) * sizeof ( long long ) ); KmerSets = ( KmerSet ** ) ckalloc ( thrd_num * sizeof ( KmerSet * ) ); ubyte8 init_size = 1024; ubyte8 k = 0; if ( initKmerSetSize ) { init_size = ( ubyte8 ) ( ( double ) initKmerSetSize * 1024.0f * 1024.0f * 1024.0f / ( double ) thrd_num / 32 ); do { ++k; } while ( k * 0xFFFFFFLLU < init_size ); } for ( i = 0; i < thrd_num; i++ ) { //KmerSets[i] = init_kmerset(1024,0.77f); KmerSets[i] = init_kmerset ( k * 0xFFFFFFLLU, 0.77f ); thrdSignal[i + 1] = 0; paras[i].threadID = i; paras[i].mainSignal = &thrdSignal[0]; paras[i].selfSignal = &thrdSignal[i + 1]; kmerCounter[i + 1] = 0; rcSeq[i + 1] = ( char * ) ckalloc ( maxReadLen * sizeof ( char ) ); } creatThrds ( threads, paras ); } thrdSignal[0] = kmerCounter[0] = 0; time ( &start_t ); kmer_c = n_solexa = read_c = i = libNo = readNumBack = gradsCounter = 0; while ( ( flag = read1seqInLib ( seqBuffer[read_c], next_name, & ( lenBuffer[read_c] ), &libNo, pairs, 1 ) ) != 0 ) { if ( ( ++i ) % 100000000 == 0 ) { printf ( "--- %lldth reads\n", i ); } if ( lenBuffer[read_c] < 0 ) { printf ( "read len %d\n", lenBuffer[read_c] ); } if ( lenBuffer[read_c] < overlaplen + 1 ) { continue; } /* if(lenBuffer[read_c]>70) lenBuffer[read_c] = 50; else if(lenBuffer[read_c]>40) lenBuffer[read_c] = 40; */ indexArray[read_c] = kmer_c; kmer_c += lenBuffer[read_c] - overlaplen + 1; read_c++; if ( read_c == maxReadNum ) { kmerCounter[0] += kmer_c; sendWorkSignal ( 2, thrdSignal ); sendWorkSignal ( 1, thrdSignal ); kmer_c = read_c = 0; } } if ( read_c ) { kmerCounter[0] += kmer_c; sendWorkSignal ( 2, thrdSignal ); sendWorkSignal ( 1, thrdSignal ); } time ( &stop_t ); printf ( "time spent on hash reads: %ds, %lld reads processed\n", ( int ) ( stop_t - start_t ), i ); //record insert size info if ( pairs ) { if ( gradsCounter ) printf ( "%d pe insert size, the largest boundary is %lld\n\n", gradsCounter, pes[gradsCounter - 1].PE_bound ); else { printf ( "no paired reads found\n" ); } sprintf ( name, "%s.peGrads", outfile ); fo = ckopen ( name, "w" ); fprintf ( fo, "grads&num: %d\t%lld\n", gradsCounter, n_solexa ); for ( i = 0; i < gradsCounter; i++ ) { fprintf ( fo, "%d\t%lld\t%d\n", pes[i].insertS, pes[i].PE_bound, pes[i].rank ); } fclose ( fo ); } free_pe_mem(); free_libs(); if ( 1 ) { unsigned long long alloCounter = 0; unsigned long long allKmerCounter = 0; for ( i = 0; i < thrd_num; i++ ) { alloCounter += count_kmerset ( ( KmerSets[i] ) ); allKmerCounter += kmerCounter[i + 1]; free ( ( void * ) rcSeq[i + 1] ); } printf ( "%lli nodes allocated, %lli kmer in reads, %lli kmer processed\n" , alloCounter, kmerCounter[0], allKmerCounter ); } free ( ( void * ) rcSeq ); free ( ( void * ) kmerCounter ); for ( i = 0; i < maxReadNum; i++ ) { free ( ( void * ) seqBuffer[i] ); } free ( ( void * ) seqBuffer ); free ( ( void * ) lenBuffer ); free ( ( void * ) indexArray ); free ( ( void * ) kmerBuffer ); free ( ( void * ) hashBanBuffer ); free ( ( void * ) nextcBuffer ); free ( ( void * ) prevcBuffer ); free ( ( void * ) next_name ); //printf("done hashing nodes\n"); if ( deLowKmer ) { time ( &start_t ); deLowCov ( thrdSignal ); time ( &stop_t ); printf ( "time spent on delowcvgNode %ds\n", ( int ) ( stop_t - start_t ) ); } time ( &start_t ); Mark1in1outNode ( thrdSignal ); freqStat ( outfile ); time ( &stop_t ); printf ( "time spent on marking linear nodes %ds\n", ( int ) ( stop_t - start_t ) ); fflush ( stdout ); sendWorkSignal ( 3, thrdSignal ); thread_wait ( threads ); /* Kmer word = 0x21c3ca82c734c8d0; Kmer hash_ban = hash_kmer(word); int setPicker = hash_ban%thrd_num; kmer_t *node; boolean found = search_kmerset(KmerSets[setPicker], word, &node); if(!found) printf("kmer %llx not found,\n",word); else{ printf("kmer %llx, linear %d\n",word,node->linear); for(i=0;i<4;i++){ if(get_kmer_right_cov(*node,i)>0) printf("right %d, kmer %llx\n",i,nextKmer(node->seq,i)); if(get_kmer_left_cov(*node,i)>0) printf("left %d, kmer %llx\n",i,prevKmer(node->seq,i)); } } */ return 1; } static void thread_delow ( KmerSet * set, unsigned char thrdID ) { int i, in_num, out_num, cvgSingle; int l_cvg, r_cvg; kmer_t * rs; set->iter_ptr = 0; while ( set->iter_ptr < set->size ) { if ( !is_kmer_entity_null ( set->flags, set->iter_ptr ) ) { in_num = out_num = l_cvg = r_cvg = 0; rs = set->array + set->iter_ptr; for ( i = 0; i < 4; i++ ) { cvgSingle = get_kmer_left_cov ( *rs, i ); if ( cvgSingle > 0 && cvgSingle == 1 ) { set_kmer_left_cov ( *rs, i, 0 ); } cvgSingle = get_kmer_right_cov ( *rs, i ); if ( cvgSingle > 0 && cvgSingle == 1 ) { set_kmer_right_cov ( *rs, i, 0 ); } } if ( rs->l_links == 0 && rs->r_links == 0 ) { rs->deleted = 1; tips[thrdID]++; } } set->iter_ptr ++; } //printf("%lld single nodes, %lld linear\n",counter,tips[thrdID]); } static void deLowCov ( unsigned char * thrdSignal ) { int i; long long counter = 0; tips = ( long long * ) ckalloc ( thrd_num * sizeof ( long long ) ); for ( i = 0; i < thrd_num; i++ ) { tips[i] = 0; } sendWorkSignal ( 5, thrdSignal ); //mark linear nodes for ( i = 0; i < thrd_num; i++ ) { counter += tips[i]; } free ( ( void * ) tips ); printf ( "%lld kmer removed\n", counter ); } static void thread_mark ( KmerSet * set, unsigned char thrdID ) { int i, in_num, out_num, cvgSingle; int l_cvg, r_cvg; kmer_t * rs; long long counter = 0; set->iter_ptr = 0; while ( set->iter_ptr < set->size ) { if ( !is_kmer_entity_null ( set->flags, set->iter_ptr ) ) { in_num = out_num = l_cvg = r_cvg = 0; rs = set->array + set->iter_ptr; for ( i = 0; i < 4; i++ ) { cvgSingle = get_kmer_left_cov ( *rs, i ); if ( cvgSingle > 0 ) { in_num++; l_cvg += cvgSingle; } cvgSingle = get_kmer_right_cov ( *rs, i ); if ( cvgSingle > 0 ) { out_num++; r_cvg += cvgSingle; } } if ( rs->single ) { kmerFreq[thrdID][1]++; counter++; } else { kmerFreq[thrdID][ ( l_cvg > r_cvg ? l_cvg : r_cvg )]++; } if ( in_num == 1 && out_num == 1 ) { rs->linear = 1; tips[thrdID]++; } } set->iter_ptr ++; } //printf("%lld single nodes, %lld linear\n",counter,tips[thrdID]); } static void Mark1in1outNode ( unsigned char * thrdSignal ) { int i; long long counter = 0; tips = ( long long * ) ckalloc ( thrd_num * sizeof ( long long ) ); kmerFreq = ( long long ** ) ckalloc ( thrd_num * sizeof ( long long * ) ); for ( i = 0; i < thrd_num; i++ ) { kmerFreq[i] = ( long long * ) ckalloc ( 257 * sizeof ( long long ) ); memset ( kmerFreq[i], 0, 257 * sizeof ( long long ) ); tips[i] = 0; } sendWorkSignal ( 4, thrdSignal ); //mark linear nodes for ( i = 0; i < thrd_num; i++ ) { counter += tips[i]; } free ( ( void * ) tips ); printf ( "%lld linear nodes\n", counter ); } static void freqStat ( char * outfile ) { FILE * fo; char name[256]; int i, j; long long sum; sprintf ( name, "%s.kmerFreq", outfile ); fo = ckopen ( name, "w" ); for ( i = 1; i < 256; i++ ) { sum = 0; for ( j = 0; j < thrd_num; j++ ) { sum += kmerFreq[j][i]; } fprintf ( fo, "%lld\n", sum ); } for ( i = 0; i < thrd_num; i++ ) { free ( ( void * ) kmerFreq[i] ); } free ( ( void * ) kmerFreq ); fclose ( fo ); } SOAPdenovo-V1.05/src/127mer/prlRead2Ctg.c000644 000765 000024 00000053165 11530651532 017670 0ustar00Aquastaff000000 000000 /* * 127mer/prlRead2Ctg.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" static long long readsInGap = 0; static int buffer_size = 100000000; static long long readCounter; static long long mapCounter; static int ALIGNLEN = 0; //buffer related varibles for chop kmer static int read_c; static char ** rcSeq; static char ** seqBuffer; static int * lenBuffer; static unsigned int * ctgIdArray; static int * posArray; static char * orienArray; static char * footprint; // flag indicates whether the read shoulld leave markers on contigs // kmer related variables static int kmer_c; static Kmer * kmerBuffer; static ubyte8 * hashBanBuffer; static kmer_t ** nodeBuffer; static boolean * smallerBuffer; static unsigned int * indexArray; static int * deletion; static void parse1read ( int t ); static void threadRoutine ( void * thrdID ); static void searchKmer ( int t, KmerSet * kset ); static void chopKmer4read ( int t, int threadID ); static void thread_wait ( pthread_t * threads ); static void creatThrds ( pthread_t * threads, PARAMETER * paras ) { unsigned char i; int temp; for ( i = 0; i < thrd_num; i++ ) { //printf("to create %dth thread\n",(*(char *)&(threadID[i]))); if ( ( temp = pthread_create ( &threads[i], NULL, ( void * ) threadRoutine, & ( paras[i] ) ) ) != 0 ) { printf ( "create threads failed\n" ); exit ( 1 ); } } printf ( "%d thread created\n", thrd_num ); } static void threadRoutine ( void * para ) { PARAMETER * prm; int i, t; unsigned char id; prm = ( PARAMETER * ) para; id = prm->threadID; //printf("%dth thread with task %d, hash_table %p\n",id,prm.task,prm.hash_table); while ( 1 ) { if ( * ( prm->selfSignal ) == 1 ) { for ( i = 0; i < kmer_c; i++ ) { //if((hashBanBuffer[i]&taskMask)!=prm.threadID) if ( ( hashBanBuffer[i] % thrd_num ) != id ) { continue; } searchKmer ( i, KmerSets[id] ); } * ( prm->selfSignal ) = 0; } else if ( * ( prm->selfSignal ) == 2 ) { for ( i = 0; i < read_c; i++ ) { if ( i % thrd_num != id ) { continue; } chopKmer4read ( i, id + 1 ); } * ( prm->selfSignal ) = 0; } else if ( * ( prm->selfSignal ) == 3 ) { // parse reads for ( t = 0; t < read_c; t++ ) { if ( t % thrd_num != id ) { continue; } parse1read ( t ); } * ( prm->selfSignal ) = 0; } else if ( * ( prm->selfSignal ) == 5 ) { * ( prm->selfSignal ) = 0; break; } usleep ( 1 ); } } /* static void chopReads() { int i; for(i=0;ideleted ) { nodeBuffer[t] = node; } else { nodeBuffer[t] = NULL; } } static void parse1read ( int t ) { unsigned int j, i, s; unsigned int contigID; int counter = 0, counter2 = 0; unsigned int ctgLen, pos; kmer_t * node; boolean isSmaller; int flag, maxOcc = 0; kmer_t * maxNode = NULL; int alldgnLen = lenBuffer[t] > ALIGNLEN ? ALIGNLEN : lenBuffer[t]; int multi = alldgnLen - overlaplen + 1 < 2 ? 2 : alldgnLen - overlaplen + 1; unsigned int start, finish; footprint[t] = 0; start = indexArray[t]; finish = indexArray[t + 1]; if ( finish == start ) //too short { ctgIdArray[t] = 0; return; } for ( j = start; j < finish; j++ ) { node = nodeBuffer[j]; if ( !node ) //same as previous { continue; } flag = 1; for ( s = j + 1; s < finish; s++ ) { if ( !nodeBuffer[s] ) { continue; } if ( nodeBuffer[s]->l_links == node->l_links ) { flag++; nodeBuffer[s] = NULL; } } if ( ( overlaplen < 32 && flag >= 2 ) || overlaplen > 32 ) { counter2++; } if ( flag >= multi ) { counter++; } else { continue; } if ( flag > maxOcc ) { pos = j; maxOcc = flag; maxNode = node; } } if ( !counter ) //no match { ctgIdArray[t] = 0; return; } if ( counter2 > 1 ) { footprint[t] = 1; } //use as a flag j = pos; i = pos - start + 1; node = nodeBuffer[j]; isSmaller = smallerBuffer[j]; contigID = node->l_links; ctgLen = contig_array[contigID].length; pos = node->r_links; if ( node->twin == isSmaller ) { orienArray[t] = '-'; ctgIdArray[t] = getTwinCtg ( contigID ); posArray[t] = ctgLen - pos - overlaplen - i + 1; } else { orienArray[t] = '+'; ctgIdArray[t] = contigID; posArray[t] = pos - i + 1; } } static void sendWorkSignal ( unsigned char SIG, unsigned char * thrdSignals ) { int t; for ( t = 0; t < thrd_num; t++ ) { thrdSignals[t + 1] = SIG; } while ( 1 ) { usleep ( 10 ); for ( t = 0; t < thrd_num; t++ ) if ( thrdSignals[t + 1] ) { break; } if ( t == thrd_num ) { break; } } } static void locate1read ( int t ) { int i, j, start, finish; kmer_t * node; unsigned int contigID; int pos, ctgLen; boolean isSmaller; start = indexArray[t]; finish = indexArray[t + 1]; for ( j = start; j < finish; j++ ) { node = nodeBuffer[j]; if ( !node ) //same as previous { continue; } i = j - start + 1; isSmaller = smallerBuffer[j]; contigID = node->l_links; ctgLen = contig_array[contigID].length; pos = node->r_links; if ( node->twin == isSmaller ) { ctgIdArray[t] = getTwinCtg ( contigID ); posArray[t] = ctgLen - pos - overlaplen - i + 1; } else { ctgIdArray[t] = contigID; posArray[t] = pos - i + 1; } } } static void output1read ( int t, FILE * outfp ) { int len = lenBuffer[t]; int index; readsInGap++; /* if(ctgIdArray[t]==735||ctgIdArray[t]==getTwinCtg(735)){ printf("%d\t%d\t%d\t",t+1,ctgIdArray[t],posArray[t]); int j; for(j=0;j R2 <-- R1 output1read ( read1, outfp ); } else { read2 = t; read1 = t - 1; ctgIdArray[read2] = ctgIdArray[read1]; posArray[read2] = posArray[read1] + insSize - lenBuffer[read2]; // --> R1 <-- R2 output1read ( read2, outfp ); } } static void recordLongRead ( FILE * outfp ) { int t; for ( t = 0; t < read_c; t++ ) { readCounter++; if ( footprint[t] ) { output1read ( t, outfp ); } } } static void recordAlldgn ( FILE * outfp, int insSize, FILE * outfp2 ) { int t, ctgId; boolean rd1gap, rd2gap; for ( t = 0; t < read_c; t++ ) { readCounter++; rd1gap = rd2gap = 0; ctgId = ctgIdArray[t]; if ( outfp2 && t % 2 == 1 ) //make sure this is read2 in a pair { if ( ctgIdArray[t] < 1 && ctgIdArray[t - 1] > 0 ) { getReadIngap ( t, insSize, outfp2, 0 ); //read 2 in gap rd2gap = 1; } else if ( ctgIdArray[t] > 0 && ctgIdArray[t - 1] < 1 ) { getReadIngap ( t - 1, insSize, outfp2, 1 ); //read 1 in gap rd1gap = 1; } } if ( ctgId < 1 ) { continue; } mapCounter++; fprintf ( outfp, "%lld\t%u\t%d\t%c\n", readCounter, ctgIdArray[t], posArray[t], orienArray[t] ); if ( t % 2 == 0 ) { continue; } // reads are not located by pe info but across edges if ( outfp2 && footprint[t - 1] && !rd1gap ) { if ( ctgIdArray[t - 1] < 1 ) { locate1read ( t - 1 ); } output1read ( t - 1, outfp2 ); } if ( outfp2 && footprint[t] && !rd2gap ) { if ( ctgIdArray[t] < 1 ) { locate1read ( t ); } output1read ( t, outfp2 ); } } } //load contig index and length void basicContigInfo ( char * infile ) { char name[256], lldne[1024]; FILE * fp; int length, bal_ed, num_all, num_long, index; sprintf ( name, "%s.ContigIndex", infile ); fp = ckopen ( name, "r" ); fgets ( lldne, sizeof ( lldne ), fp ); sscanf ( lldne + 8, "%d %d", &num_all, &num_long ); printf ( "%d edges in graph\n", num_all ); num_ctg = num_all; contig_array = ( CONTIG * ) ckalloc ( ( num_all + 1 ) * sizeof ( CONTIG ) ); fgets ( lldne, sizeof ( lldne ), fp ); num_long = 0; while ( fgets ( lldne, sizeof ( lldne ), fp ) != NULL ) { sscanf ( lldne, "%d %d %d", &index, &length, &bal_ed ); contig_array[++num_long].length = length; contig_array[num_long].bal_edge = bal_ed + 1; if ( index != num_long ) { printf ( "basicContigInfo: %d vs %d\n", index, num_long ); } if ( bal_ed == 0 ) { continue; } contig_array[++num_long].length = length; contig_array[num_long].bal_edge = -bal_ed + 1; } fclose ( fp ); } void prlRead2Ctg ( char * libfile, char * outfile ) { long long i; char * src_name, *next_name, name[256]; FILE * fo, *outfp2 = NULL; int maxReadNum, libNo, prevLibNo, insSize; boolean flag, pairs = 1; pthread_t threads[thrd_num]; unsigned char thrdSignal[thrd_num + 1]; PARAMETER paras[thrd_num]; maxReadLen = 0; maxNameLen = 256; scan_libInfo ( libfile ); alloc_pe_mem ( num_libs ); if ( !maxReadLen ) { maxReadLen = 100; } printf ( "In file: %s, max seq len %d, max name len %d\n\n", libfile, maxReadLen, maxNameLen ); if ( maxReadLen > maxReadLen4all ) { maxReadLen4all = maxReadLen; } src_name = ( char * ) ckalloc ( ( maxNameLen + 1 ) * sizeof ( char ) ); next_name = ( char * ) ckalloc ( ( maxNameLen + 1 ) * sizeof ( char ) ); kmerBuffer = ( Kmer * ) ckalloc ( buffer_size * sizeof ( Kmer ) ); hashBanBuffer = ( ubyte8 * ) ckalloc ( buffer_size * sizeof ( ubyte8 ) ); nodeBuffer = ( kmer_t ** ) ckalloc ( buffer_size * sizeof ( kmer_t * ) ); smallerBuffer = ( boolean * ) ckalloc ( buffer_size * sizeof ( boolean ) ); maxReadNum = buffer_size / ( maxReadLen - overlaplen + 1 ); maxReadNum = maxReadNum % 2 == 0 ? maxReadNum : maxReadNum - 1; //make sure paired reads are processed at the same batch seqBuffer = ( char ** ) ckalloc ( maxReadNum * sizeof ( char * ) ); lenBuffer = ( int * ) ckalloc ( maxReadNum * sizeof ( int ) ); indexArray = ( unsigned int * ) ckalloc ( ( maxReadNum + 1 ) * sizeof ( unsigned int ) ); ctgIdArray = ( unsigned int * ) ckalloc ( ( maxReadNum + 1 ) * sizeof ( unsigned int ) ); posArray = ( int * ) ckalloc ( ( maxReadNum + 1 ) * sizeof ( int ) ); orienArray = ( char * ) ckalloc ( ( maxReadNum + 1 ) * sizeof ( char ) ); footprint = ( char * ) ckalloc ( ( maxReadNum + 1 ) * sizeof ( char ) ); for ( i = 0; i < maxReadNum; i++ ) { seqBuffer[i] = ( char * ) ckalloc ( maxReadLen * sizeof ( char ) ); } rcSeq = ( char ** ) ckalloc ( ( thrd_num + 1 ) * sizeof ( char * ) ); thrdSignal[0] = 0; if ( 1 ) { for ( i = 0; i < thrd_num; i++ ) { rcSeq[i + 1] = ( char * ) ckalloc ( maxReadLen * sizeof ( char ) ); thrdSignal[i + 1] = 0; paras[i].threadID = i; paras[i].mainSignal = &thrdSignal[0]; paras[i].selfSignal = &thrdSignal[i + 1]; } creatThrds ( threads, paras ); } if ( !contig_array ) { basicContigInfo ( outfile ); } sprintf ( name, "%s.readInGap", outfile ); outfp2 = ckopen ( name, "wb" ); sprintf ( name, "%s.readOnContig", outfile ); fo = ckopen ( name, "w" ); fprintf ( fo, "read\tcontig\tpos\n" ); readCounter = mapCounter = readsInGap = 0; kmer_c = n_solexa = read_c = i = libNo = readNumBack = gradsCounter = 0; prevLibNo = -1; while ( ( flag = read1seqInLib ( seqBuffer[read_c], next_name, & ( lenBuffer[read_c] ), &libNo, pairs, 0 ) ) != 0 ) { if ( libNo != prevLibNo ) { prevLibNo = libNo; insSize = lib_array[libNo].avg_ins; ALIGNLEN = lib_array[libNo].map_len; printf ( "current insert size %d, map_len %d\n", insSize, ALIGNLEN ); if ( insSize > 1000 ) { ALIGNLEN = ALIGNLEN < 35 ? 35 : ALIGNLEN; } else { ALIGNLEN = ALIGNLEN < 32 ? 32 : ALIGNLEN; } } if ( insSize > 1000 ) { ALIGNLEN = ALIGNLEN < ( lenBuffer[read_c] / 2 + 1 ) ? ( lenBuffer[read_c] / 2 + 1 ) : ALIGNLEN; } if ( ( ++i ) % 100000000 == 0 ) { printf ( "--- %lldth reads\n", i ); } indexArray[read_c] = kmer_c; if ( lenBuffer[read_c] >= overlaplen + 1 ) { kmer_c += lenBuffer[read_c] - overlaplen + 1; } read_c++; if ( read_c == maxReadNum ) { indexArray[read_c] = kmer_c; sendWorkSignal ( 2, thrdSignal ); sendWorkSignal ( 1, thrdSignal ); sendWorkSignal ( 3, thrdSignal ); recordAlldgn ( fo, insSize, outfp2 ); kmer_c = 0; read_c = 0; } } if ( read_c ) { indexArray[read_c] = kmer_c; sendWorkSignal ( 2, thrdSignal ); sendWorkSignal ( 1, thrdSignal ); sendWorkSignal ( 3, thrdSignal ); recordAlldgn ( fo, insSize, outfp2 ); printf ( "Output %lld out of %lld (%.1f)%% reads in gaps\n", readsInGap, readCounter, ( float ) readsInGap / readCounter * 100 ); } printf ( "%lld out of %lld (%.1f)%% reads mapped to contigs\n", mapCounter, readCounter, ( float ) mapCounter / readCounter * 100 ); sendWorkSignal ( 5, thrdSignal ); thread_wait ( threads ); fclose ( fo ); sprintf ( name, "%s.peGrads", outfile ); fo = ckopen ( name, "w" ); fprintf ( fo, "grads&num: %d\t%lld\t%d\n", gradsCounter, n_solexa, maxReadLen4all ); if ( pairs ) { if ( gradsCounter ) printf ( "%d pe insert size, the largest boundary is %lld\n\n", gradsCounter, pes[gradsCounter - 1].PE_bound ); else { printf ( "no paired reads found\n" ); } for ( i = 0; i < gradsCounter; i++ ) { fprintf ( fo, "%d\t%lld\t%d\t%d\n", pes[i].insertS, pes[i].PE_bound, pes[i].rank, pes[i].pair_num_cut ); } fclose ( fo ); } fclose ( outfp2 ); free_pe_mem(); free_libs(); if ( 1 ) // multi-threads { for ( i = 0; i < thrd_num; i++ ) { free ( ( void * ) rcSeq[i + 1] ); } } free ( ( void * ) rcSeq ); for ( i = 0; i < maxReadNum; i++ ) { free ( ( void * ) seqBuffer[i] ); } free ( ( void * ) seqBuffer ); free ( ( void * ) lenBuffer ); free ( ( void * ) indexArray ); free ( ( void * ) kmerBuffer ); free ( ( void * ) smallerBuffer ); free ( ( void * ) hashBanBuffer ); free ( ( void * ) nodeBuffer ); free ( ( void * ) ctgIdArray ); free ( ( void * ) posArray ); free ( ( void * ) orienArray ); free ( ( void * ) footprint ); free ( ( void * ) src_name ); free ( ( void * ) next_name ); if ( contig_array ) { free ( ( void * ) contig_array ); contig_array = NULL; } } static void thread_wait ( pthread_t * threads ) { int i; for ( i = 0; i < thrd_num; i++ ) if ( threads[i] != 0 ) { pthread_join ( threads[i], NULL ); } } /********************* map long reads for gap filling ************************/ void prlLongRead2Ctg ( char * libfile, char * outfile ) { long long i; char * src_name, *next_name, name[256]; FILE * outfp2; int maxReadNum, libNo, prevLibNo; boolean flag, pairs = 0; pthread_t threads[thrd_num]; unsigned char thrdSignal[thrd_num + 1]; PARAMETER paras[thrd_num]; maxReadLen = 0; maxNameLen = 256; scan_libInfo ( libfile ); if ( !maxReadLen ) { maxReadLen = 100; } int longReadLen = getMaxLongReadLen ( num_libs ); if ( longReadLen < 1 ) // no long reads { return; } maxReadLen4all = maxReadLen < longReadLen ? longReadLen : maxReadLen; printf ( "In file: %s, long read len %d, max name len %d\n\n", libfile, longReadLen, maxNameLen ); maxReadLen = longReadLen; src_name = ( char * ) ckalloc ( ( maxNameLen + 1 ) * sizeof ( char ) ); next_name = ( char * ) ckalloc ( ( maxNameLen + 1 ) * sizeof ( char ) ); kmerBuffer = ( Kmer * ) ckalloc ( buffer_size * sizeof ( Kmer ) ); hashBanBuffer = ( ubyte8 * ) ckalloc ( buffer_size * sizeof ( ubyte8 ) ); nodeBuffer = ( kmer_t ** ) ckalloc ( buffer_size * sizeof ( kmer_t * ) ); smallerBuffer = ( boolean * ) ckalloc ( buffer_size * sizeof ( boolean ) ); maxReadNum = buffer_size / ( maxReadLen - overlaplen + 1 ); maxReadNum = maxReadNum % 2 == 0 ? maxReadNum : maxReadNum - 1; //make sure paired reads are processed at the same batch seqBuffer = ( char ** ) ckalloc ( maxReadNum * sizeof ( char * ) ); lenBuffer = ( int * ) ckalloc ( maxReadNum * sizeof ( int ) ); indexArray = ( unsigned int * ) ckalloc ( ( maxReadNum + 1 ) * sizeof ( unsigned int ) ); ctgIdArray = ( unsigned int * ) ckalloc ( ( maxReadNum + 1 ) * sizeof ( unsigned int ) ); posArray = ( int * ) ckalloc ( ( maxReadNum + 1 ) * sizeof ( int ) ); orienArray = ( char * ) ckalloc ( ( maxReadNum + 1 ) * sizeof ( char ) ); footprint = ( char * ) ckalloc ( ( maxReadNum + 1 ) * sizeof ( char ) ); for ( i = 0; i < maxReadNum; i++ ) { seqBuffer[i] = ( char * ) ckalloc ( maxReadLen * sizeof ( char ) ); } rcSeq = ( char ** ) ckalloc ( ( thrd_num + 1 ) * sizeof ( char * ) ); deletion = ( int * ) ckalloc ( ( thrd_num + 1 ) * sizeof ( int ) ); thrdSignal[0] = 0; deletion[0] = 0; if ( 1 ) { for ( i = 0; i < thrd_num; i++ ) { rcSeq[i + 1] = ( char * ) ckalloc ( maxReadLen * sizeof ( char ) ); deletion[i + 1] = 0; thrdSignal[i + 1] = 0; paras[i].threadID = i; paras[i].mainSignal = &thrdSignal[0]; paras[i].selfSignal = &thrdSignal[i + 1]; } creatThrds ( threads, paras ); } if ( !contig_array ) { basicContigInfo ( outfile ); } sprintf ( name, "%s.longReadInGap", outfile ); outfp2 = ckopen ( name, "wb" ); readCounter = 0; kmer_c = n_solexa = read_c = i = libNo = 0; prevLibNo = -1; while ( ( flag = read1seqInLib ( seqBuffer[read_c], next_name, & ( lenBuffer[read_c] ), &libNo, pairs, 4 ) ) != 0 ) { if ( libNo != prevLibNo ) { prevLibNo = libNo; ALIGNLEN = lib_array[libNo].map_len; ALIGNLEN = ALIGNLEN < 35 ? 35 : ALIGNLEN; printf ( "Map_len %d\n", ALIGNLEN ); } if ( ( ++i ) % 100000000 == 0 ) { printf ( "--- %lldth reads\n", i ); } indexArray[read_c] = kmer_c; if ( lenBuffer[read_c] >= overlaplen + 1 ) { kmer_c += lenBuffer[read_c] - overlaplen + 1; } read_c++; if ( read_c == maxReadNum ) { indexArray[read_c] = kmer_c; sendWorkSignal ( 2, thrdSignal ); sendWorkSignal ( 1, thrdSignal ); sendWorkSignal ( 3, thrdSignal ); recordLongRead ( outfp2 ); kmer_c = 0; read_c = 0; } } if ( read_c ) { indexArray[read_c] = kmer_c; sendWorkSignal ( 2, thrdSignal ); sendWorkSignal ( 1, thrdSignal ); sendWorkSignal ( 3, thrdSignal ); recordLongRead ( outfp2 ); printf ( "Output %lld out of %lld (%.1f)%% reads in gaps\n", readsInGap, readCounter, ( float ) readsInGap / readCounter * 100 ); } sendWorkSignal ( 5, thrdSignal ); thread_wait ( threads ); fclose ( outfp2 ); free_libs(); if ( 1 ) // multi-threads { for ( i = 0; i < thrd_num; i++ ) { deletion[0] += deletion[i + 1]; free ( ( void * ) rcSeq[i + 1] ); } } printf ( "%d reads deleted\n", deletion[0] ); free ( ( void * ) rcSeq ); free ( ( void * ) deletion ); for ( i = 0; i < maxReadNum; i++ ) { free ( ( void * ) seqBuffer[i] ); } free ( ( void * ) seqBuffer ); free ( ( void * ) lenBuffer ); free ( ( void * ) indexArray ); free ( ( void * ) kmerBuffer ); free ( ( void * ) smallerBuffer ); free ( ( void * ) hashBanBuffer ); free ( ( void * ) nodeBuffer ); free ( ( void * ) ctgIdArray ); free ( ( void * ) posArray ); free ( ( void * ) orienArray ); free ( ( void * ) footprint ); free ( ( void * ) src_name ); free ( ( void * ) next_name ); } SOAPdenovo-V1.05/src/127mer/prlRead2path.c000644 000765 000024 00000045232 11530651532 020103 0ustar00Aquastaff000000 000000 /* * 127mer/prlRead2path.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include #include "newhash.h" #include #include #define preARCBLOCKSIZE 100000 static const Kmer kmerZero = {0, 0, 0, 0}; static unsigned int * arcCounters; static int buffer_size = 100000000; static long long markCounter = 0; static unsigned int * fwriteBuf; static unsigned char * markerOnEdge; //buffer related varibles for chop kmer static int read_c; static char ** rcSeq; static char ** seqBuffer; static int * lenBuffer; //edge and (K+1)mer related variables static preARC ** preArc_array; static Kmer * mixBuffer; static boolean * flagArray; //indicate each item in mixBuffer where it's a (K+1)mer // kmer related variables static char ** flags; static int kmer_c; static Kmer * kmerBuffer; static ubyte8 * hashBanBuffer; static kmer_t ** nodeBuffer; static boolean * smallerBuffer; static int * indexArray; static int * deletion; static void parse1read ( int t, int threadID ); static void search1kmerPlus ( int j, unsigned char thrdID ); static void threadRoutine ( void * thrdID ); static void searchKmer ( int t, KmerSet * kset ); static void chopKmer4read ( int t, int threadID ); static void thread_wait ( pthread_t * threads ); static void thread_add1preArc ( unsigned int from_ed, unsigned int to_ed, unsigned int thrdID ); static void creatThrds ( pthread_t * threads, PARAMETER * paras ) { unsigned char i; int temp; for ( i = 0; i < thrd_num; i++ ) { //printf("to create %dth thread\n",(*(char *)&(threadID[i]))); if ( ( temp = pthread_create ( &threads[i], NULL, ( void * ) threadRoutine, & ( paras[i] ) ) ) != 0 ) { printf ( "create threads failed\n" ); exit ( 1 ); } } printf ( "%d thread created\n", thrd_num ); } static void threadRoutine ( void * para ) { PARAMETER * prm; int i, t, j, start, finish; unsigned char id; prm = ( PARAMETER * ) para; id = prm->threadID; //printf("%dth thread with task %d, hash_table %p\n",id,prm.task,prm.hash_table); while ( 1 ) { if ( * ( prm->selfSignal ) == 1 ) { for ( i = 0; i < kmer_c; i++ ) { //if((hashBanBuffer[i]&taskMask)!=prm.threadID) if ( ( hashBanBuffer[i] % thrd_num ) != id ) { continue; } searchKmer ( i, KmerSets[id] ); } * ( prm->selfSignal ) = 0; } else if ( * ( prm->selfSignal ) == 2 ) { for ( i = 0; i < read_c; i++ ) { if ( i % thrd_num != id ) { continue; } chopKmer4read ( i, id + 1 ); } * ( prm->selfSignal ) = 0; } else if ( * ( prm->selfSignal ) == 3 ) { // parse reads for ( t = 0; t < read_c; t++ ) { if ( t % thrd_num != id ) { continue; } parse1read ( t, id + 1 ); } * ( prm->selfSignal ) = 0; } else if ( * ( prm->selfSignal ) == 4 ) { //printf("thread %d, reads %d splay kmerplus\n",id,read_c); for ( t = 0; t < read_c; t++ ) { start = indexArray[t]; finish = indexArray[t + 1]; for ( j = start; j < finish; j++ ) { if ( flagArray[j] == 0 ) { if ( mixBuffer[j].low2 == 0 ) { break; } } else if ( hashBanBuffer[j] % thrd_num == id ) { //fprintf(stderr,"thread %d search for ban %lld\n",id,hashBanBuffer[j]); search1kmerPlus ( j, id ); } /* if(flagArray[j]==0&&mixBuffer[j]==0) break; if(!flagArray[j]||(hashBanBuffer[j]%thrd_num)!=id) continue; search1kmerPlus(j,id); */ } } * ( prm->selfSignal ) = 0; } else if ( * ( prm->selfSignal ) == 6 ) { for ( t = 0; t < read_c; t++ ) { start = indexArray[t]; finish = indexArray[t + 1]; for ( j = start; j < finish - 1; j++ ) { if ( mixBuffer[j].low2 == 0 || mixBuffer[j + 1].low2 == 0 ) { break; } if ( mixBuffer[j].low2 % thrd_num != id ) { continue; } thread_add1preArc ( mixBuffer[j].low2, mixBuffer[j + 1].low2, id ); } } * ( prm->selfSignal ) = 0; } else if ( * ( prm->selfSignal ) == 5 ) { * ( prm->selfSignal ) = 0; break; } usleep ( 1 ); } } static void chopKmer4read ( int t, int threadID ) { char * src_seq = seqBuffer[t]; char * bal_seq = rcSeq[threadID]; int len_seq = lenBuffer[t]; int j, bal_j; ubyte8 hash_ban, bal_hash_ban; Kmer word, bal_word; int index; word = kmerZero; for ( index = 0; index < overlaplen; index++ ) { word = KmerLeftBitMoveBy2 ( word ); word.low2 |= src_seq[index]; } reverseComplementSeq ( src_seq, len_seq, bal_seq ); // complementary node bal_word = reverseComplement ( word, overlaplen ); bal_j = len_seq - 0 - overlaplen; // 0; index = indexArray[t]; if ( KmerSmaller ( word, bal_word ) ) { hash_ban = hash_kmer ( word ); kmerBuffer[index] = word; smallerBuffer[index] = 1; hashBanBuffer[index++] = hash_ban; } else { bal_hash_ban = hash_kmer ( bal_word ); kmerBuffer[index] = bal_word; smallerBuffer[index] = 0; hashBanBuffer[index++] = bal_hash_ban; } //printf("%dth: %p with %p\n",kmer_c-1,bal_word,bal_hash_ban); for ( j = 1; j <= len_seq - overlaplen; j ++ ) { word = nextKmer ( word, src_seq[j - 1 + overlaplen] ); bal_j = len_seq - j - overlaplen; // j; bal_word = prevKmer ( bal_word, bal_seq[bal_j] ); if ( KmerSmaller ( word, bal_word ) ) { hash_ban = hash_kmer ( word ); kmerBuffer[index] = word; smallerBuffer[index] = 1; hashBanBuffer[index++] = hash_ban; //printf("%dth: %p with %p\n",kmer_c-1,word,hashBanBuffer[kmer_c-1]); } else { // complementary node bal_hash_ban = hash_kmer ( bal_word ); kmerBuffer[index] = bal_word; smallerBuffer[index] = 0; hashBanBuffer[index++] = bal_hash_ban; //printf("%dth: %p with %p\n",kmer_c-1,bal_word,hashBanBuffer[kmer_c-1]); } } } //splay for one kmer in buffer and save the node to nodeBuffer static void searchKmer ( int t, KmerSet * kset ) { kmer_t * node; boolean found = search_kmerset ( kset, kmerBuffer[t], &node ); if ( !found ) { printf ( "searchKmer: kmer %llx %llx %llx %llx is not found\n", kmerBuffer[t].high1, kmerBuffer[t].low1, kmerBuffer[t].high2, kmerBuffer[t].low2 ); } nodeBuffer[t] = node; } static preARC * getPreArcBetween ( unsigned int from_ed, unsigned int to_ed ) { preARC * parc; parc = preArc_array[from_ed]; while ( parc ) { if ( parc->to_ed == to_ed ) { return parc; } parc = parc->next; } return parc; } static void thread_add1preArc ( unsigned int from_ed, unsigned int to_ed, unsigned int thrdID ) { preARC * parc = getPreArcBetween ( from_ed, to_ed ); if ( parc ) { parc->multiplicity++; } else { parc = prlAllocatePreArc ( to_ed, preArc_mem_managers[thrdID] ); arcCounters[thrdID]++; parc->next = preArc_array[from_ed]; preArc_array[from_ed] = parc; } } static void memoAlloc4preArc() { unsigned int i; preArc_array = ( preARC ** ) ckalloc ( ( num_ed + 1 ) * sizeof ( preARC * ) ); for ( i = 0; i <= num_ed; i++ ) { preArc_array[i] = NULL; } } static void memoFree4preArc() { prlDestroyPreArcMem(); if ( preArc_array ) { free ( ( void * ) preArc_array ); } } static void output_arcs ( char * outfile ) { unsigned int i; char name[256]; FILE * outfp, *outfp2 = NULL; preARC * parc; sprintf ( name, "%s.preArc", outfile ); outfp = ckopen ( name, "w" ); if ( repsTie ) { sprintf ( name, "%s.markOnEdge", outfile ); outfp2 = ckopen ( name, "w" ); } markCounter = 0; for ( i = 1; i <= num_ed; i++ ) { if ( repsTie ) { markCounter += markerOnEdge[i]; fprintf ( outfp2, "%d\n", markerOnEdge[i] ); } parc = preArc_array[i]; if ( !parc ) { continue; } fprintf ( outfp, "%u", i ); while ( parc ) { fprintf ( outfp, " %u %u", parc->to_ed, parc->multiplicity ); parc = parc->next; } fprintf ( outfp, "\n" ); } fclose ( outfp ); if ( repsTie ) { fclose ( outfp2 ); printf ( "%lld markers counted\n", markCounter ); } } static void recordPathBin ( FILE * outfp ) { int t, j, start, finish; unsigned char counter; for ( t = 0; t < read_c; t++ ) { start = indexArray[t]; finish = indexArray[t + 1]; if ( finish - start < 3 || mixBuffer[start].low2 == 0 || mixBuffer[start + 1].low2 == 0 || mixBuffer[start + 2].low2 == 0 ) { continue; } counter = 0; for ( j = start; j < finish; j++ ) { if ( mixBuffer[j].low2 == 0 ) { break; } fwriteBuf[counter++] = ( unsigned int ) mixBuffer[j].low2; if ( markerOnEdge[mixBuffer[j].low2] < 255 ) { markerOnEdge[mixBuffer[j].low2]++; } markCounter++; } fwrite ( &counter, sizeof ( char ), 1, outfp ); fwrite ( fwriteBuf, sizeof ( unsigned int ), ( int ) counter, outfp ); } } static void search1kmerPlus ( int j, unsigned char thrdID ) { kmer_t * node; boolean found = search_kmerset ( KmerSetsPatch[thrdID], mixBuffer[j], &node ); if ( !found ) { /* fprintf(stderr,"kmerPlus %llx %llx (hashban %lld) not found, flag %d!\n", mixBuffer[j].high,mixBuffer[j].low,hashBanBuffer[j],flagArray[j]); */ mixBuffer[j] = kmerZero; return; } //else fprintf(stderr,"kmerPlus found\n"); if ( smallerBuffer[j] ) { mixBuffer[j].low2 = node->l_links; } else { mixBuffer[j].low2 = node->l_links + node->twin - 1; } } static void parse1read ( int t, int threadID ) { unsigned int j, retain = 0; unsigned int edge_index = 0; kmer_t * node; boolean isSmaller; Kmer wordplus, bal_wordplus; unsigned int start, finish, pos; Kmer prevKmer, currentKmer; boolean IsPrevKmer = 0; start = indexArray[t]; finish = indexArray[t + 1]; pos = start; for ( j = start; j < finish; j++ ) { node = nodeBuffer[j]; //extract edges or keep kmers if ( ( node->deleted ) || ( node->linear && !node->inEdge ) ) // deleted or in a floating loop { if ( retain < 2 ) { retain = 0; pos = start; } else { break; } continue; } isSmaller = smallerBuffer[j]; if ( node->linear ) { if ( isSmaller ) { edge_index = node->l_links; } else { edge_index = node->l_links + node->twin - 1; } if ( retain == 0 || IsPrevKmer ) { retain++; mixBuffer[pos].low2 = edge_index; flagArray[pos++] = 0; IsPrevKmer = 0; } else if ( edge_index != mixBuffer[pos - 1].low2 ) { retain++; mixBuffer[pos].low2 = edge_index; flagArray[pos++] = 0; } } else { if ( isSmaller ) { currentKmer = node->seq; } else { currentKmer = reverseComplement ( node->seq, overlaplen ); } if ( IsPrevKmer ) { retain++; wordplus = KmerPlus ( prevKmer, lastCharInKmer ( currentKmer ) ); bal_wordplus = reverseComplement ( wordplus, overlaplen + 1 ); if ( KmerSmaller ( wordplus, bal_wordplus ) ) { smallerBuffer[pos] = 1; hashBanBuffer[pos] = hash_kmer ( wordplus ); mixBuffer[pos] = wordplus; } else { smallerBuffer[pos] = 0; hashBanBuffer[pos] = hash_kmer ( bal_wordplus ); mixBuffer[pos] = bal_wordplus; } // fprintf(stderr,"%lld\n",hashBanBuffer[pos]); flagArray[pos++] = 1; } IsPrevKmer = 1; prevKmer = currentKmer; } } /* for(j=start;j70) lenBuffer[read_c] = 70; else if(lenBuffer[read_c]>40) lenBuffer[read_c] = 40; */ indexArray[read_c] = kmer_c; kmer_c += lenBuffer[read_c] - overlaplen + 1; read_c++; if ( read_c == maxReadNum ) { indexArray[read_c] = kmer_c; time ( &read_end ); t0 += read_end - read_start; time ( &time_bef ); sendWorkSignal ( 2, thrdSignal ); time ( &time_aft ); t1 += time_aft - time_bef; time ( &time_bef ); sendWorkSignal ( 1, thrdSignal ); time ( &time_aft ); t2 += time_aft - time_bef; time ( &time_bef ); sendWorkSignal ( 3, thrdSignal ); time ( &time_aft ); t3 += time_aft - time_bef; time ( &time_bef ); sendWorkSignal ( 4, thrdSignal ); time ( &time_aft ); t4 += time_aft - time_bef; time ( &time_bef ); sendWorkSignal ( 6, thrdSignal ); time ( &time_aft ); t5 += time_aft - time_bef; time ( &time_bef ); //recordPreArc(); if ( repsTie ) { recordPathBin ( outfp ); } time ( &time_aft ); t6 += time_aft - time_bef; kmer_c = 0; read_c = 0; time ( &read_start ); } } printf ( "%lld reads processed\n", i ); printf ( "time %d,%d,%d,%d,%d,%d,%d\n", t0, t1, t2, t3, t4, t5, t6 ); if ( read_c ) { indexArray[read_c] = kmer_c; sendWorkSignal ( 2, thrdSignal ); sendWorkSignal ( 1, thrdSignal ); sendWorkSignal ( 3, thrdSignal ); sendWorkSignal ( 4, thrdSignal ); sendWorkSignal ( 6, thrdSignal ); //recordPreArc(); if ( repsTie ) { recordPathBin ( outfp ); } } printf ( "%lld markers outputed\n", markCounter ); sendWorkSignal ( 5, thrdSignal ); thread_wait ( threads ); output_arcs ( outfile ); memoFree4preArc(); if ( 1 ) // multi-threads { arcCounter = 0; for ( i = 0; i < thrd_num; i++ ) { arcCounter += arcCounters[i]; free ( ( void * ) flags[i + 1] ); deletion[0] += deletion[i + 1]; free ( ( void * ) rcSeq[i + 1] ); } } if ( 1 ) { free ( ( void * ) flags[0] ); free ( ( void * ) rcSeq[0] ); } printf ( "done mapping reads, %d reads deleted, %lld arcs created\n", deletion[0], arcCounter ); if ( repsTie ) { free ( ( void * ) markerOnEdge ); free ( ( void * ) fwriteBuf ); } free ( ( void * ) arcCounters ); free ( ( void * ) rcSeq ); for ( i = 0; i < maxReadNum; i++ ) { free ( ( void * ) seqBuffer[i] ); } free ( ( void * ) seqBuffer ); free ( ( void * ) lenBuffer ); free ( ( void * ) indexArray ); free ( ( void * ) flags ); free ( ( void * ) deletion ); free ( ( void * ) kmerBuffer ); free ( ( void * ) mixBuffer ); free ( ( void * ) smallerBuffer ); free ( ( void * ) flagArray ); free ( ( void * ) hashBanBuffer ); free ( ( void * ) nodeBuffer ); free ( ( void * ) src_name ); free ( ( void * ) next_name ); if ( repsTie ) { fclose ( outfp ); } free_pe_mem(); free_libs(); } static void thread_wait ( pthread_t * threads ) { int i; for ( i = 0; i < thrd_num; i++ ) if ( threads[i] != 0 ) { pthread_join ( threads[i], NULL ); } } SOAPdenovo-V1.05/src/127mer/prlReadFillGap.c000644 000765 000024 00000072336 11530651532 020410 0ustar00Aquastaff000000 000000 /* * 127mer/prlReadFillGap.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" #define RDBLOCKSIZE 50 #define CTGappend 50 static Kmer MAXKMER; static int Ncounter; static int allGaps; // for multi threads static int * counters; static pthread_mutex_t mutex; static int scafBufSize = 100; static boolean * flagBuf; static unsigned char * thrdNoBuf; static STACK ** ctgStackBuffer; static int scafCounter; static int scafInBuf; static void MarkCtgOccu ( unsigned int ctg ); /* static void printRead(int len,char *seq) { int j; fprintf(stderr,">read\n"); for(j=0;jlen = len; rd->dis = pos; rd->seqStarter = starter; } static void convertIndex() { int * length_array = ( int * ) ckalloc ( ( num_ctg + 1 ) * sizeof ( int ) ); unsigned int i; for ( i = 1; i <= num_ctg; i++ ) { length_array[i] = 0; } for ( i = 1; i <= num_ctg; i++ ) { if ( index_array[i] > 0 ) { length_array[index_array[i]] = i; } } for ( i = 1; i <= num_ctg; i++ ) { index_array[i] = length_array[i]; } //contig i with new index: index_array[i] free ( ( void * ) length_array ); } static long long getRead1by1 ( FILE * fp, DARRAY * readSeqInGap ) { long long readCounter = 0; if ( !fp ) { return readCounter; } int len, ctgID, pos; long long starter; char * pt; char * freadBuf = ( char * ) ckalloc ( ( maxReadLen / 4 + 1 ) * sizeof ( char ) ); while ( fread ( &len, sizeof ( int ), 1, fp ) == 1 ) { if ( fread ( &ctgID, sizeof ( int ), 1, fp ) != 1 ) { break; } if ( fread ( &pos, sizeof ( int ), 1, fp ) != 1 ) { break; } if ( fread ( freadBuf, sizeof ( char ), len / 4 + 1, fp ) != ( unsigned ) ( len / 4 + 1 ) ) { break; } //put seq to dynamic array starter = readSeqInGap->item_c; if ( !darrayPut ( readSeqInGap, starter + len / 4 ) ) // make sure there's room for this seq { break; } pt = ( char * ) darrayPut ( readSeqInGap, starter ); bcopy ( freadBuf, pt, len / 4 + 1 ); attach1read2contig ( ctgID, len, pos, starter ); readCounter++; } free ( ( void * ) freadBuf ); return readCounter; } // Darray *readSeqInGap static boolean loadReads4gap ( char * graphfile ) { FILE * fp, *fp2; char name[1024]; long long readCounter; sprintf ( name, "%s.readInGap", graphfile ); fp = fopen ( name, "rb" ); sprintf ( name, "%s.longReadInGap", graphfile ); fp2 = fopen ( name, "rb" ); if ( !fp && !fp2 ) { return 0; } if ( !orig2new ) { convertIndex(); orig2new = 1; } readSeqInGap = ( DARRAY * ) createDarray ( 1000000, sizeof ( char ) ); if ( fp ) { readCounter = getRead1by1 ( fp, readSeqInGap ); printf ( "Loaded %lld reads from %s.readInGap\n", readCounter, graphfile ); fclose ( fp ); } if ( fp2 ) { readCounter = getRead1by1 ( fp2, readSeqInGap ); printf ( "Loaded %lld reads from %s.LongReadInGap\n", readCounter, graphfile ); fclose ( fp2 ); } return 1; } static void debugging1() { unsigned int i; if ( orig2new ) { unsigned int * length_array = ( unsigned int * ) ckalloc ( ( num_ctg + 1 ) * sizeof ( unsigned int ) ); //use length_array to change info in index_array for ( i = 1; i <= num_ctg; i++ ) { length_array[i] = 0; } for ( i = 1; i <= num_ctg; i++ ) { if ( index_array[i] > 0 ) { length_array[index_array[i]] = i; } } for ( i = 1; i <= num_ctg; i++ ) { index_array[i] = length_array[i]; } //contig i with original index: index_array[i] orig2new = 0; } READNEARBY * rd; int j; char * pt; for ( i = 1; i <= num_ctg; i++ ) { if ( !contig_array[i].closeReads ) { continue; } if ( index_array[i] != 735 ) { continue; } printf ( "contig %d, len %d: \n", index_array[i], contig_array[i].length ); stackBackup ( contig_array[i].closeReads ); while ( ( rd = ( READNEARBY * ) stackPop ( contig_array[i].closeReads ) ) != NULL ) { printf ( "%d\t%d\t%lld\t", rd->dis, rd->len, rd->seqStarter ); pt = ( char * ) darrayGet ( readSeqInGap, rd->seqStarter ); for ( j = 0; j < rd->len; j++ ) { printf ( "%c", int2base ( ( int ) getCharInTightString ( pt, j ) ) ); } printf ( "\n" ); } stackRecover ( contig_array[i].closeReads ); } } static void initiateCtgInScaf ( CTGinSCAF * actg ) { actg->cutTail = 0; actg->cutHead = overlaplen; actg->gapSeqLen = 0; } static int procGap ( char * line, STACK * ctgsStack ) { char * tp; int length, i, seg; unsigned int ctg; CTGinSCAF * ctgPt; tp = strtok ( line, " " ); tp = strtok ( NULL, " " ); //length length = atoi ( tp ); tp = strtok ( NULL, " " ); //seg seg = atoi ( tp ); if ( !seg ) { return length; } for ( i = 0; i < seg; i++ ) { tp = strtok ( NULL, " " ); ctg = atoi ( tp ); MarkCtgOccu ( ctg ); ctgPt = ( CTGinSCAF * ) stackPush ( ctgsStack ); initiateCtgInScaf ( ctgPt ); ctgPt->ctgID = ctg; ctgPt->start = 0; ctgPt->end = 0; ctgPt->scaftig_start = 0; ctgPt->mask = 1; } return length; } static void debugging2 ( int index, STACK * ctgsStack ) { CTGinSCAF * actg; stackBackup ( ctgsStack ); printf ( ">scaffold%d\t%d 0.0\n", index, ctgsStack->item_c ); while ( ( actg = stackPop ( ctgsStack ) ) != NULL ) { printf ( "%d\t%d\t%d\t%d\n", actg->ctgID, actg->start, actg->end, actg->scaftig_start ); } stackRecover ( ctgsStack ); } static int cmp_reads ( const void * a, const void * b ) { READNEARBY * A, *B; A = ( READNEARBY * ) a; B = ( READNEARBY * ) b; if ( A->dis > B->dis ) { return 1; } else if ( A->dis == B->dis ) { return 0; } else { return -1; } } static void cutRdArray ( READNEARBY * rdArray, int gapStart, int gapEnd, int * count, int arrayLen, READNEARBY * cutArray ) { int i; int num = 0; for ( i = 0; i < arrayLen; i++ ) { if ( rdArray[i].dis > gapEnd ) { break; } if ( ( rdArray[i].dis + rdArray[i].len ) >= gapStart ) { cutArray[num].dis = rdArray[i].dis; cutArray[num].len = rdArray[i].len; cutArray[num++].seqStarter = rdArray[i].seqStarter; } } *count = num; } static void outputTightStr ( FILE * fp, char * tightStr, int start, int length, int outputlen, int revS, int * col ) { int i; int end; int column = *col; if ( !revS ) { end = start + outputlen <= length ? start + outputlen : length; for ( i = start; i < end; i++ ) { fprintf ( fp, "%c", int2base ( ( int ) getCharInTightString ( tightStr, i ) ) ); if ( ( ++column ) % 100 == 0 ) { //column = 0; fprintf ( fp, "\n" ); } } } else { end = length - start - outputlen - 1 >= 0 ? length - start - outputlen : 0; for ( i = length - 1 - start; i >= end; i-- ) { fprintf ( fp, "%c", int2compbase ( ( int ) getCharInTightString ( tightStr, i ) ) ); if ( ( ++column ) % 100 == 0 ) { fprintf ( fp, "\n" ); //column = 0; } } } *col = column; } static void outputTightStrLowerCase ( FILE * fp, char * tightStr, int start, int length, int outputlen, int revS, int * col ) { int i; int end; int column = *col; if ( !revS ) { end = start + outputlen <= length ? start + outputlen : length; for ( i = start; i < end; i++ ) { fprintf ( fp, "%c", "actg"[ ( int ) getCharInTightString ( tightStr, i )] ); if ( ( ++column ) % 100 == 0 ) { //column = 0; fprintf ( fp, "\n" ); } } } else { end = length - start - outputlen - 1 >= 0 ? length - start - outputlen : 0; for ( i = length - 1 - start; i >= end; i-- ) { fprintf ( fp, "%c", "tgac"[ ( int ) getCharInTightString ( tightStr, i )] ); if ( ( ++column ) % 100 == 0 ) { fprintf ( fp, "\n" ); //column = 0; } } } *col = column; } static void outputNs ( FILE * fp, int gapN, int * col ) { int i, column = *col; for ( i = 0; i < gapN; i++ ) { fprintf ( fp, "N" ); if ( ( ++column ) % 100 == 0 ) { //column = 0; fprintf ( fp, "\n" ); } } *col = column; } static void outputGapInfo ( unsigned int ctg1, unsigned int ctg2 ) { unsigned int bal_ctg1 = getTwinCtg ( ctg1 ); unsigned int bal_ctg2 = getTwinCtg ( ctg2 ); if ( isLargerThanTwin ( ctg1 ) ) { fprintf ( stderr, "%d\t", index_array[bal_ctg1] ); } else { fprintf ( stderr, "%d\t", index_array[ctg1] ); } if ( isLargerThanTwin ( ctg2 ) ) { fprintf ( stderr, "%d\n", index_array[bal_ctg2] ); } else { fprintf ( stderr, "%d\n", index_array[ctg2] ); } } static void output1gap ( FILE * fo, int scafIndex, CTGinSCAF * prevCtg, CTGinSCAF * actg, DARRAY * gapSeqArray ) { unsigned int ctg1, bal_ctg1, length1; int start1, outputlen1; unsigned int ctg2, bal_ctg2, length2; int start2, outputlen2; char * pt; int column = 0; ctg1 = prevCtg->ctgID; bal_ctg1 = getTwinCtg ( ctg1 ); start1 = prevCtg->cutHead; length1 = contig_array[ctg1].length + overlaplen; if ( length1 - prevCtg->cutTail - start1 > CTGappend ) { outputlen1 = CTGappend; start1 = length1 - prevCtg->cutTail - outputlen1; } else { outputlen1 = length1 - prevCtg->cutTail - start1; } ctg2 = actg->ctgID; bal_ctg2 = getTwinCtg ( ctg2 ); start2 = actg->cutHead; length2 = contig_array[ctg2].length + overlaplen; if ( length2 - actg->cutTail - start2 > CTGappend ) { outputlen2 = CTGappend; } else { outputlen2 = length2 - actg->cutTail - start2; } if ( isLargerThanTwin ( ctg1 ) ) { fprintf ( fo, ">S%d_C%d_L%d_G%d", scafIndex, index_array[bal_ctg1], outputlen1, prevCtg->gapSeqLen ); } else { fprintf ( fo, ">S%d_C%d_L%d_G%d", scafIndex, index_array[ctg1], outputlen1, prevCtg->gapSeqLen ); } if ( isLargerThanTwin ( ctg2 ) ) { fprintf ( fo, "_C%d_L%d\n", index_array[bal_ctg2], outputlen2 ); } else { fprintf ( fo, "_C%d_L%d\n", index_array[ctg2], outputlen2 ); } if ( contig_array[ctg1].seq ) { outputTightStr ( fo, contig_array[ctg1].seq, start1, length1, outputlen1, 0, &column ); } else if ( contig_array[bal_ctg1].seq ) { outputTightStr ( fo, contig_array[bal_ctg1].seq, start1, length1, outputlen1, 1, &column ); } pt = ( char * ) darrayPut ( gapSeqArray, prevCtg->gapSeqOffset ); outputTightStrLowerCase ( fo, pt, 0, prevCtg->gapSeqLen, prevCtg->gapSeqLen, 0, &column ); if ( contig_array[ctg2].seq ) { outputTightStr ( fo, contig_array[ctg2].seq, start2, length2, outputlen2, 0, &column ); } else if ( contig_array[bal_ctg2].seq ) { outputTightStr ( fo, contig_array[bal_ctg2].seq, start2, length2, outputlen2, 1, &column ); } fprintf ( fo, "\n" ); } static void outputGapSeq ( FILE * fo, int index, STACK * ctgsStack, DARRAY * gapSeqArray ) { CTGinSCAF * actg, *prevCtg = NULL; stackRecover ( ctgsStack ); fprintf ( fo, ">scaffold%d\n", index ); while ( ( actg = stackPop ( ctgsStack ) ) != NULL ) { if ( prevCtg ) { if ( actg->scaftig_start ) { fprintf ( fo, "0\t%d\t%d\n", prevCtg->mask, actg->mask ); } else { fprintf ( fo, "1\t%d\t%d\n", prevCtg->mask, actg->mask ); } } /* if(prevCtg&&prevCtg->gapSeqLen>0) output1gap(fo,index,prevCtg,actg,gapSeqArray); */ prevCtg = actg; } } static void outputScafSeq ( FILE * fo, int index, STACK * ctgsStack, DARRAY * gapSeqArray ) { CTGinSCAF * actg, *prevCtg = NULL; unsigned int ctg, bal_ctg, length; int start, outputlen, gapN; char * pt; int column = 0; long long cvgSum = 0; int lenSum = 0; stackRecover ( ctgsStack ); while ( ( actg = stackPop ( ctgsStack ) ) != NULL ) { if ( ! ( contig_array[actg->ctgID].cvg > 0 ) ) { continue; } lenSum += contig_array[actg->ctgID].length; cvgSum += contig_array[actg->ctgID].length * contig_array[actg->ctgID].cvg; } if ( lenSum > 0 ) { fprintf ( fo, ">scaffold%d %4.1f\n", index, ( double ) cvgSum / lenSum ); } else { fprintf ( fo, ">scaffold%d 0.0\n", index ); } stackRecover ( ctgsStack ); while ( ( actg = stackPop ( ctgsStack ) ) != NULL ) { ctg = actg->ctgID; bal_ctg = getTwinCtg ( ctg ); length = contig_array[ctg].length + overlaplen; if ( prevCtg && actg->scaftig_start ) { gapN = actg->start - prevCtg->start - contig_array[prevCtg->ctgID].length; gapN = gapN > 0 ? gapN : 1; outputNs ( fo, gapN, &column ); //outputGapInfo(prevCtg->ctgID,ctg); Ncounter++; } if ( !prevCtg ) { start = 0; } else { start = actg->cutHead; } outputlen = length - start - actg->cutTail; if ( contig_array[ctg].seq ) { outputTightStr ( fo, contig_array[ctg].seq, start, length, outputlen, 0, &column ); } else if ( contig_array[bal_ctg].seq ) { outputTightStr ( fo, contig_array[bal_ctg].seq, start, length, outputlen, 1, &column ); } if ( actg->gapSeqLen < 1 ) { prevCtg = actg; continue; } pt = ( char * ) darrayPut ( gapSeqArray, actg->gapSeqOffset ); outputTightStrLowerCase ( fo, pt, 0, actg->gapSeqLen, actg->gapSeqLen, 0, &column ); prevCtg = actg; } fprintf ( fo, "\n" ); } static void fill1scaf ( int index, STACK * ctgsStack, int thrdID ); static void check1scaf ( int t, int thrdID ) { if ( flagBuf[t] ) { return; } boolean late = 0; pthread_mutex_lock ( &mutex ); if ( !flagBuf[t] ) { flagBuf[t] = 1; thrdNoBuf[t] = thrdID; } else { late = 1; } pthread_mutex_unlock ( &mutex ); if ( late ) { return; } counters[thrdID]++; fill1scaf ( scafCounter + t + 1, ctgStackBuffer[t], thrdID ); } static void fill1scaf ( int index, STACK * ctgsStack, int thrdID ) { CTGinSCAF * actg, *prevCtg = NULL; READNEARBY * rdArray, *rdArray4gap, *rd; int numRd = 0, count, maxGLen = 0; unsigned int ctg, bal_ctg; STACK * rdStack; while ( ( actg = stackPop ( ctgsStack ) ) != NULL ) { if ( prevCtg ) { maxGLen = maxGLen < ( actg->start - prevCtg->end ) ? ( actg->start - prevCtg->end ) : maxGLen; } ctg = actg->ctgID; bal_ctg = getTwinCtg ( ctg ); if ( actg->mask ) { prevCtg = actg; continue; } if ( contig_array[ctg].closeReads ) { numRd += contig_array[ctg].closeReads->item_c; } else if ( contig_array[bal_ctg].closeReads ) { numRd += contig_array[bal_ctg].closeReads->item_c; } prevCtg = actg; } if ( numRd < 1 ) { return; } rdArray = ( READNEARBY * ) ckalloc ( numRd * sizeof ( READNEARBY ) ); rdArray4gap = ( READNEARBY * ) ckalloc ( numRd * sizeof ( READNEARBY ) ); //fprintf(stderr,"scaffold%d reads4gap %d\n",index,numRd); // collect reads appended to contigs in this scaffold int numRd2 = 0; stackRecover ( ctgsStack ); while ( ( actg = stackPop ( ctgsStack ) ) != NULL ) { ctg = actg->ctgID; bal_ctg = getTwinCtg ( ctg ); if ( actg->mask ) { continue; } if ( contig_array[ctg].closeReads ) { rdStack = contig_array[ctg].closeReads; } else if ( contig_array[bal_ctg].closeReads ) { rdStack = contig_array[bal_ctg].closeReads; } else { continue; } stackBackup ( rdStack ); while ( ( rd = ( READNEARBY * ) stackPop ( rdStack ) ) != NULL ) { rdArray[numRd2].len = rd->len; rdArray[numRd2].seqStarter = rd->seqStarter; if ( isSmallerThanTwin ( ctg ) ) { rdArray[numRd2++].dis = actg->start - overlaplen + rd->dis; } else rdArray[numRd2++].dis = actg->start - overlaplen + contig_array[ctg].length - rd->len - rd->dis; } stackRecover ( rdStack ); } if ( numRd2 != numRd ) { printf ( "##reads numbers doesn't match, %d vs %d when scaffold %d\n", numRd, numRd2, index ); } qsort ( rdArray, numRd, sizeof ( READNEARBY ), cmp_reads ); //fill gap one by one int gapStart, gapEnd; int numIn = 0; boolean flag; int buffer_size = maxReadLen > 100 ? maxReadLen : 100; int maxGSLen = maxGLen + GLDiff < 10 ? 10 : maxGLen + GLDiff; //fprintf(stderr,"maxGlen %d, maxGSlen %d\n",maxGLen,maxGSLen); char * seqGap = ( char * ) ckalloc ( maxGSLen * sizeof ( char ) ); // temp array for gap sequence Kmer * kmerCtg1 = ( Kmer * ) ckalloc ( buffer_size * sizeof ( Kmer ) ); Kmer * kmerCtg2 = ( Kmer * ) ckalloc ( buffer_size * sizeof ( Kmer ) ); char * seqCtg1 = ( char * ) ckalloc ( buffer_size * sizeof ( char ) ); char * seqCtg2 = ( char * ) ckalloc ( buffer_size * sizeof ( char ) ); prevCtg = NULL; stackRecover ( ctgsStack ); while ( ( actg = stackPop ( ctgsStack ) ) != NULL ) { if ( !prevCtg || !actg->scaftig_start ) { prevCtg = actg; continue; } gapStart = prevCtg->end - 100; gapEnd = actg->start - overlaplen + 100; cutRdArray ( rdArray, gapStart, gapEnd, &count, numRd, rdArray4gap ); numIn += count; /* if(!count){ prevCtg = actg; continue; } */ int overlap; for ( overlap = overlaplen; overlap > 14; overlap -= 2 ) { flag = localGraph ( rdArray4gap, count, prevCtg, actg, overlaplen, kmerCtg1, kmerCtg2, overlap, darrayBuf[thrdID], seqCtg1, seqCtg2, seqGap ); //free_kmerset(kmerSet); if ( flag == 1 ) { /* fprintf(stderr,"Between ctg %d and %d, Found with %d\n",prevCtg->ctgID ,actg->ctgID,overlap); */ break; } } /* if(count==0) printf("Gap closed without reads\n"); if(!flag) fprintf(stderr,"Between ctg %d and %d, NO routes found\n",prevCtg->ctgID,actg->ctgID); */ prevCtg = actg; } //fprintf(stderr,"____scaffold%d reads in gap %d\n",index,numIn); free ( ( void * ) seqGap ); free ( ( void * ) kmerCtg1 ); free ( ( void * ) kmerCtg2 ); free ( ( void * ) seqCtg1 ); free ( ( void * ) seqCtg2 ); free ( ( void * ) rdArray ); free ( ( void * ) rdArray4gap ); } static void reverseStack ( STACK * dStack, STACK * sStack ) { CTGinSCAF * actg, *ctgPt; emptyStack ( dStack ); while ( ( actg = ( CTGinSCAF * ) stackPop ( sStack ) ) != NULL ) { ctgPt = ( CTGinSCAF * ) stackPush ( dStack ); ctgPt->ctgID = actg->ctgID; ctgPt->start = actg->start; ctgPt->end = actg->end; ctgPt->scaftig_start = actg->scaftig_start; ctgPt->mask = actg->mask; ctgPt->cutHead = actg->cutHead; ctgPt->cutTail = actg->cutTail; ctgPt->gapSeqLen = actg->gapSeqLen; ctgPt->gapSeqOffset = actg->gapSeqOffset; } stackBackup ( dStack ); } static Kmer tightStr2Kmer ( char * tightStr, int start, int length, int revS ) { int i; Kmer word; word.high1 = word.low1 = word.high2 = word.low2 = 0; if ( !revS ) { if ( start + overlaplen > length ) { printf ( "tightStr2Kmer A: no enough bases for kmer\n" ); return word; } for ( i = start; i < start + overlaplen; i++ ) { word = KmerLeftBitMoveBy2 ( word ); word.low2 |= getCharInTightString ( tightStr, i ); } } else { if ( length - start - overlaplen < 0 ) { printf ( "tightStr2Kmer B: no enough bases for kmer\n" ); return word; } for ( i = length - 1 - start; i > length - 1 - start - overlaplen; i-- ) { word = KmerLeftBitMoveBy2 ( word ); word.low2 |= int_comp ( getCharInTightString ( tightStr, i ) ); } } return word; } static Kmer maxKmer() { Kmer word; word.high1 = word.low1 = word.high2 = word.low2 = 0; int i; for ( i = 0; i < overlaplen; i++ ) { word = KmerLeftBitMoveBy2 ( word ); word.low2 |= 0x3; } return word; } static int contigCatch ( unsigned int prev_ctg, unsigned int ctg ) { if ( contig_array[prev_ctg].length == 0 || contig_array[ctg].length == 0 ) { return 0; } Kmer kmerAtEnd, kmerAtStart; Kmer MaxKmer; unsigned int bal_ctg1 = getTwinCtg ( prev_ctg ); unsigned int bal_ctg2 = getTwinCtg ( ctg ); int i, start; int len1 = contig_array[prev_ctg].length + overlaplen; int len2 = contig_array[ctg].length + overlaplen; start = contig_array[prev_ctg].length; if ( contig_array[prev_ctg].seq ) { kmerAtEnd = tightStr2Kmer ( contig_array[prev_ctg].seq, start, len1, 0 ); } else { kmerAtEnd = tightStr2Kmer ( contig_array[bal_ctg1].seq, start, len1, 1 ); } start = 0; if ( contig_array[ctg].seq ) { kmerAtStart = tightStr2Kmer ( contig_array[ctg].seq, start, len2, 0 ); } else { kmerAtStart = tightStr2Kmer ( contig_array[bal_ctg2].seq, start, len2, 1 ); } MaxKmer = MAXKMER; for ( i = 0; i < 10; i++ ) { if ( KmerEqual ( kmerAtStart, kmerAtEnd ) ) { break; } MaxKmer = KmerRightBitMoveBy2 ( MaxKmer ); kmerAtEnd = KmerAnd ( kmerAtEnd, MaxKmer ); kmerAtStart = KmerRightBitMoveBy2 ( kmerAtStart ); } if ( i < 10 ) { return overlaplen - i; } else { return 0; } } static void initStackBuf ( STACK ** ctgStackBuffer, int scafBufSize ) { int i; for ( i = 0; i < scafBufSize; i++ ) { flagBuf[i] = 1; ctgStackBuffer[i] = ( STACK * ) createStack ( 100, sizeof ( CTGinSCAF ) ); } } static void freeStackBuf ( STACK ** ctgStackBuffer, int scafBufSize ) { int i; for ( i = 0; i < scafBufSize; i++ ) { freeStack ( ctgStackBuffer[i] ); } } static void threadRoutine ( void * para ) { PARAMETER * prm; int i; prm = ( PARAMETER * ) para; //printf("%dth thread with threadID %d, hash_table %p\n",id,prm.threadID,prm.hash_table); while ( 1 ) { if ( * ( prm->selfSignal ) == 1 ) { emptyDarray ( darrayBuf[prm->threadID] ); for ( i = 0; i < scafInBuf; i++ ) { check1scaf ( i, prm->threadID ); } * ( prm->selfSignal ) = 0; } else if ( * ( prm->selfSignal ) == 2 ) { * ( prm->selfSignal ) = 0; break; } usleep ( 1 ); } } static void creatThrds ( pthread_t * threads, PARAMETER * paras ) { unsigned char i; int temp; for ( i = 0; i < thrd_num; i++ ) { if ( ( temp = pthread_create ( &threads[i], NULL, ( void * ) threadRoutine, & ( paras[i] ) ) ) != 0 ) { printf ( "create threads failed\n" ); exit ( 1 ); } } printf ( "%d thread created\n...\n", thrd_num ); } static void sendWorkSignal ( unsigned char SIG, unsigned char * thrdSignals ) { int t; for ( t = 0; t < thrd_num; t++ ) { thrdSignals[t + 1] = SIG; } while ( 1 ) { usleep ( 10 ); for ( t = 0; t < thrd_num; t++ ) if ( thrdSignals[t + 1] ) { break; } if ( t == thrd_num ) { break; } } } static void thread_wait ( pthread_t * threads ) { int i; for ( i = 0; i < thrd_num; i++ ) if ( threads[i] != 0 ) { pthread_join ( threads[i], NULL ); } } static void outputSeqs ( FILE * fo, FILE * fo2, int scafInBuf ) { int i, thrdID; for ( i = 0; i < scafInBuf; i++ ) { thrdID = thrdNoBuf[i]; outputScafSeq ( fo, scafCounter + i + 1, ctgStackBuffer[i], darrayBuf[thrdID] ); outputGapSeq ( fo2, scafCounter + i + 1, ctgStackBuffer[i], darrayBuf[thrdID] ); } } static void MaskContig ( unsigned int ctg ) { contig_array[ctg].mask = 1; contig_array[getTwinCtg ( ctg )].mask = 1; } static void MarkCtgOccu ( unsigned int ctg ) { contig_array[ctg].flag = 1; contig_array[getTwinCtg ( ctg )].flag = 1; } static void output_ctg ( unsigned int ctg, FILE * fo ) { if ( contig_array[ctg].length < 1 ) { return; } int len; unsigned int bal_ctg = getTwinCtg ( ctg ); len = contig_array[ctg].length + overlaplen; int col = 0; if ( contig_array[ctg].seq ) { fprintf ( fo, ">C%d %4.1f\n", ctg, ( double ) contig_array[ctg].cvg ); outputTightStr ( fo, contig_array[ctg].seq, 0, len, len, 0, &col ); } else if ( contig_array[bal_ctg].seq ) { fprintf ( fo, ">C%d %4.1f\n", bal_ctg, ( double ) contig_array[ctg].cvg ); outputTightStr ( fo, contig_array[bal_ctg].seq, 0, len, len, 0, &col ); } contig_array[ctg].flag = 1; contig_array[bal_ctg].flag = 1; fprintf ( fo, "\n" ); } void prlReadsCloseGap ( char * graphfile ) { //thrd_num=1; if ( fillGap ) { boolean flag; printf ( "\nStart to load reads for gap filling. %d length discrepancy is allowed\n", GLDiff ); printf ( "...\n" ); flag = loadReads4gap ( graphfile ); if ( !flag ) { return; } } if ( orig2new ) { convertIndex(); orig2new = 0; } FILE * fp, *fo, *fo2; char line[1024]; CTGinSCAF * actg; STACK * ctgStack, *aStack; int index = 0, offset = 0, counter, overallLen; int i, starter, prev_start, gapLen, catchable; unsigned int ctg, prev_ctg = 0; boolean IsPrevGap; pthread_t threads[thrd_num]; unsigned char thrdSignal[thrd_num + 1]; PARAMETER paras[thrd_num]; for ( ctg = 1; ctg <= num_ctg; ctg++ ) { contig_array[ctg].flag = 0; } MAXKMER = maxKmer(); ctgStack = ( STACK * ) createStack ( 1000, sizeof ( CTGinSCAF ) ); sprintf ( line, "%s.scaf_gap", graphfile ); fp = ckopen ( line, "r" ); sprintf ( line, "%s.scafSeq", graphfile ); fo = ckopen ( line, "w" ); sprintf ( line, "%s.gapSeq", graphfile ); fo2 = ckopen ( line, "w" ); pthread_mutex_init ( &mutex, NULL ); flagBuf = ( boolean * ) ckalloc ( scafBufSize * sizeof ( boolean ) );; thrdNoBuf = ( unsigned char * ) ckalloc ( scafBufSize * sizeof ( unsigned char ) );; memset ( thrdNoBuf, 0, scafBufSize * sizeof ( char ) ); ctgStackBuffer = ( STACK ** ) ckalloc ( scafBufSize * sizeof ( STACK * ) ); initStackBuf ( ctgStackBuffer, scafBufSize ); darrayBuf = ( DARRAY ** ) ckalloc ( thrd_num * sizeof ( DARRAY * ) ); counters = ( int * ) ckalloc ( thrd_num * sizeof ( int ) ); for ( i = 0; i < thrd_num; i++ ) { counters[i] = 0; darrayBuf[i] = ( DARRAY * ) createDarray ( 100000, sizeof ( char ) ); thrdSignal[i + 1] = 0; paras[i].threadID = i; paras[i].mainSignal = &thrdSignal[0]; paras[i].selfSignal = &thrdSignal[i + 1]; } if ( fillGap ) { creatThrds ( threads, paras ); } Ncounter = scafCounter = scafInBuf = allGaps = 0; while ( fgets ( line, sizeof ( line ), fp ) != NULL ) { if ( line[0] == '>' ) { if ( index ) { aStack = ctgStackBuffer[scafInBuf]; flagBuf[scafInBuf++] = 0; reverseStack ( aStack, ctgStack ); if ( scafInBuf == scafBufSize ) { if ( fillGap ) { sendWorkSignal ( 1, thrdSignal ); } outputSeqs ( fo, fo2, scafInBuf ); scafCounter += scafInBuf; scafInBuf = 0; } if ( index % 1000 == 0 ) { printf ( "Processed %d scaffolds\n", index ); } } //read next scaff emptyStack ( ctgStack ); IsPrevGap = offset = prev_ctg = 0; sscanf ( line + 9, "%d %d %d", &index, &counter, &overallLen ); continue; } if ( line[0] == 'G' ) // gap appears { if ( fillGap ) { gapLen = procGap ( line, ctgStack ); IsPrevGap = 1; } continue; } if ( line[0] >= '0' && line[0] <= '9' ) // a contig line { sscanf ( line, "%d %d", &ctg, &starter ); actg = ( CTGinSCAF * ) stackPush ( ctgStack ); actg->ctgID = ctg; if ( contig_array[ctg].flag ) { MaskContig ( ctg ); } else { MarkCtgOccu ( ctg ); } initiateCtgInScaf ( actg ); if ( !prev_ctg ) { actg->cutHead = 0; } else if ( !IsPrevGap ) { allGaps++; } if ( !IsPrevGap ) { if ( prev_ctg && ( starter - prev_start - ( int ) contig_array[prev_ctg].length ) < ( ( int ) overlaplen * 4 ) ) { /* if(fillGap) catchable = contigCatch(prev_ctg,ctg); else */ catchable = 0; if ( catchable ) // prev_ctg and ctg overlap **bp { allGaps--; /* if(isLargerThanTwin(prev_ctg)) fprintf(stderr,"%d ####### by_overlap\n",getTwinCtg(prev_ctg)); else fprintf(stderr,"%d ####### by_overlap\n",prev_ctg); */ actg->scaftig_start = 0; actg->cutHead = catchable; offset += - ( starter - prev_start - contig_array[prev_ctg].length ) + ( overlaplen - catchable ); } else { actg->scaftig_start = 1; } } else { actg->scaftig_start = 1; } } else { offset += - ( starter - prev_start - contig_array[prev_ctg].length ) + gapLen; actg->scaftig_start = 0; } actg->start = starter + offset; actg->end = actg->start + contig_array[ctg].length - 1; actg->mask = contig_array[ctg].mask; IsPrevGap = 0; prev_ctg = ctg; prev_start = starter; } } if ( index ) { aStack = ctgStackBuffer[scafInBuf]; flagBuf[scafInBuf++] = 0; reverseStack ( aStack, ctgStack ); if ( fillGap ) { sendWorkSignal ( 1, thrdSignal ); } outputSeqs ( fo, fo2, scafInBuf ); } if ( fillGap ) { sendWorkSignal ( 2, thrdSignal ); thread_wait ( threads ); } for ( ctg = 1; ctg <= num_ctg; ctg++ ) { if ( ( contig_array[ctg].length + overlaplen ) < 100 || contig_array[ctg].flag ) { continue; } output_ctg ( ctg, fo ); } printf ( "Done with %d scaffolds, %d gaps finished, %d gaps overall\n", index, allGaps - Ncounter, allGaps ); index = 0; for ( i = 0; i < thrd_num; i++ ) { freeDarray ( darrayBuf[i] ); index += counters[i]; } if ( fillGap ) { printf ( "Threads processed %d scaffolds\n", index ); } free ( ( void * ) darrayBuf ); if ( readSeqInGap ) { freeDarray ( readSeqInGap ); } fclose ( fp ); fclose ( fo ); fclose ( fo2 ); freeStack ( ctgStack ); freeStackBuf ( ctgStackBuffer, scafBufSize ); free ( ( void * ) flagBuf ); free ( ( void * ) thrdNoBuf ); free ( ( void * ) ctgStackBuffer ); } SOAPdenovo-V1.05/src/127mer/read2scaf.c000644 000765 000024 00000016446 11530651532 017412 0ustar00Aquastaff000000 000000 /* * 127mer/read2scaf.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" static int Ncounter; static int allGaps; // for multi threads static int scafBufSize = 100; static STACK ** ctgStackBuffer; static int scafCounter; static int scafInBuf; static void convertIndex() { int * length_array = ( int * ) ckalloc ( ( num_ctg + 1 ) * sizeof ( int ) ); unsigned int i; for ( i = 1; i <= num_ctg; i++ ) { length_array[i] = 0; } for ( i = 1; i <= num_ctg; i++ ) { if ( index_array[i] > 0 ) { length_array[index_array[i]] = i; } } for ( i = 1; i <= num_ctg; i++ ) { index_array[i] = length_array[i]; } //contig i with new index: index_array[i] free ( ( void * ) length_array ); } static void reverseStack ( STACK * dStack, STACK * sStack ) { CTGinSCAF * actg, *ctgPt; emptyStack ( dStack ); while ( ( actg = ( CTGinSCAF * ) stackPop ( sStack ) ) != NULL ) { ctgPt = ( CTGinSCAF * ) stackPush ( dStack ); ctgPt->ctgID = actg->ctgID; ctgPt->start = actg->start; ctgPt->end = actg->end; } stackBackup ( dStack ); } static void initStackBuf ( STACK ** ctgStackBuffer, int scafBufSize ) { int i; for ( i = 0; i < scafBufSize; i++ ) { ctgStackBuffer[i] = ( STACK * ) createStack ( 100, sizeof ( CTGinSCAF ) ); } } static void freeStackBuf ( STACK ** ctgStackBuffer, int scafBufSize ) { int i; for ( i = 0; i < scafBufSize; i++ ) { freeStack ( ctgStackBuffer[i] ); } } static void mapCtg2Scaf ( int scafInBuf ) { int i, scafID; CTGinSCAF * actg; STACK * ctgsStack; unsigned int ctg, bal_ctg; for ( i = 0; i < scafInBuf; i++ ) { scafID = scafCounter + i + 1; ctgsStack = ctgStackBuffer[i]; while ( ( actg = stackPop ( ctgsStack ) ) != NULL ) { ctg = actg->ctgID; bal_ctg = getTwinCtg ( ctg ); if ( contig_array[ctg].from_vt != 0 ) { contig_array[ctg].multi = 1; contig_array[bal_ctg].multi = 1; continue; } contig_array[ctg].from_vt = scafID; contig_array[ctg].to_vt = actg->start; contig_array[ctg].flag = 0; //ctg and scaf on the same strand contig_array[bal_ctg].from_vt = scafID; contig_array[bal_ctg].to_vt = actg->start; contig_array[bal_ctg].flag = 1; } } } static void locateContigOnscaff ( char * graphfile ) { FILE * fp; char line[1024]; CTGinSCAF * actg; STACK * ctgStack, *aStack; int index = 0, counter, overallLen; int starter, prev_start, gapN, scafLen; unsigned int ctg, prev_ctg = 0; for ( ctg = 1; ctg <= num_ctg; ctg++ ) { contig_array[ctg].from_vt = 0; contig_array[ctg].multi = 0; } ctgStack = ( STACK * ) createStack ( 1000, sizeof ( CTGinSCAF ) ); sprintf ( line, "%s.scaf_gap", graphfile ); fp = ckopen ( line, "r" ); ctgStackBuffer = ( STACK ** ) ckalloc ( scafBufSize * sizeof ( STACK * ) ); initStackBuf ( ctgStackBuffer, scafBufSize ); Ncounter = scafCounter = scafInBuf = allGaps = 0; while ( fgets ( line, sizeof ( line ), fp ) != NULL ) { if ( line[0] == '>' ) { if ( index ) { aStack = ctgStackBuffer[scafInBuf++]; reverseStack ( aStack, ctgStack ); if ( scafInBuf == scafBufSize ) { mapCtg2Scaf ( scafInBuf ); scafCounter += scafInBuf; scafInBuf = 0; } if ( index % 1000 == 0 ) { printf ( "Processed %d scaffolds\n", index ); } } //read next scaff scafLen = prev_ctg = 0; emptyStack ( ctgStack ); sscanf ( line + 9, "%d %d %d", &index, &counter, &overallLen ); //fprintf(stderr,">%d\n",index); continue; } if ( line[0] == 'G' ) // gap appears { continue; } if ( line[0] >= '0' && line[0] <= '9' ) // a contig line { sscanf ( line, "%d %d", &ctg, &starter ); actg = ( CTGinSCAF * ) stackPush ( ctgStack ); actg->ctgID = ctg; if ( !prev_ctg ) { actg->start = scafLen; actg->end = actg->start + overlaplen + contig_array[ctg].length - 1; } else { gapN = starter - prev_start - ( int ) contig_array[prev_ctg].length; gapN = gapN < 1 ? 1 : gapN; actg->start = scafLen + gapN; actg->end = actg->start + contig_array[ctg].length - 1; } //fprintf(stderr,"%d\t%d\n",actg->start,actg->end); scafLen = actg->end + 1; prev_ctg = ctg; prev_start = starter; } } if ( index ) { aStack = ctgStackBuffer[scafInBuf++]; reverseStack ( aStack, ctgStack ); mapCtg2Scaf ( scafInBuf ); } gapN = 0; for ( ctg = 1; ctg <= num_ctg; ctg++ ) { if ( contig_array[ctg].from_vt == 0 || contig_array[ctg].multi == 1 ) { continue; } gapN++; } printf ( "\nDone with %d scaffolds, %d contigs in Scaffolld\n", index, gapN ); /* if(readSeqInGap) freeDarray(readSeqInGap); */ fclose ( fp ); freeStack ( ctgStack ); freeStackBuf ( ctgStackBuffer, scafBufSize ); free ( ( void * ) ctgStackBuffer ); } static boolean contigElligible ( unsigned int contigno ) { unsigned int ctg = index_array[contigno]; if ( contig_array[ctg].from_vt == 0 || contig_array[ctg].multi == 1 ) { return 0; } else { return 1; } } static void output1read ( FILE * fo, long long readno, unsigned int contigno, int pos ) { unsigned int ctg = index_array[contigno]; int posOnScaf; char orien; pos = pos < 0 ? 0 : pos; if ( contig_array[ctg].flag == 0 ) { posOnScaf = contig_array[ctg].to_vt + pos - overlaplen; orien = '+'; } else { posOnScaf = contig_array[ctg].to_vt + contig_array[ctg].length - pos; orien = '-'; } /* if(readno==676) printf("Read %lld in region from %d, extend %d, pos %d, orien %c\n", readno,contig_array[ctg].to_vt,contig_array[ctg].length,posOnScaf,orien); */ fprintf ( fo, "%lld\t%d\t%d\t%c\n", readno, contig_array[ctg].from_vt, posOnScaf, orien ); } void locateReadOnScaf ( char * graphfile ) { char name[1024], line[1024]; FILE * fp, *fo; long long readno, counter = 0, pre_readno = 0; unsigned int contigno, pre_contigno; int pre_pos, pos; locateContigOnscaff ( graphfile ); sprintf ( name, "%s.readOnContig", graphfile ); fp = ckopen ( name, "r" ); sprintf ( name, "%s.readOnScaf", graphfile ); fo = ckopen ( name, "w" ); if ( !orig2new ) { convertIndex(); orig2new = 1; } fgets ( line, 1024, fp ); while ( fgets ( line, 1024, fp ) != NULL ) { sscanf ( line, "%lld %d %d", &readno, &contigno, &pos ); if ( ( readno % 2 == 0 ) && ( pre_readno == readno - 1 ) // they are a pair of reads && contigElligible ( pre_contigno ) && contigElligible ( contigno ) ) { output1read ( fo, pre_readno, pre_contigno, pre_pos ); output1read ( fo, readno, contigno, pos ); counter++; } pre_readno = readno; pre_contigno = contigno; pre_pos = pos; } printf ( "%lld pairs on contig\n", counter ); fclose ( fp ); fclose ( fo ); } SOAPdenovo-V1.05/src/127mer/readInterval.c000644 000765 000024 00000003113 11530651532 020163 0ustar00Aquastaff000000 000000 /* * 127mer/readInterval.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" #define RVBLOCKSIZE 1000 void destroyReadIntervMem() { freeMem_manager ( rv_mem_manager ); rv_mem_manager = NULL; } READINTERVAL * allocateRV ( int readid, int edgeid ) { READINTERVAL * newRV; newRV = ( READINTERVAL * ) getItem ( rv_mem_manager ); newRV->readid = readid; newRV->edgeid = edgeid; newRV->nextInRead = NULL; newRV->prevInRead = NULL; newRV->nextOnEdge = NULL; newRV->prevOnEdge = NULL; return newRV; } void dismissRV ( READINTERVAL * rv ) { returnItem ( rv_mem_manager, rv ); } void createRVmemo() { if ( !rv_mem_manager ) { rv_mem_manager = createMem_manager ( RVBLOCKSIZE, sizeof ( READINTERVAL ) ); } else { printf ( "Warning from createRVmemo: rv_mem_manager is an active pointer\n" ); } } SOAPdenovo-V1.05/src/127mer/readseq1by1.c000644 000765 000024 00000034130 11530651532 017667 0ustar00Aquastaff000000 000000 /* * 127mer/readseq1by1.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" static char src_rc_seq[1024]; void readseq1by1 ( char * src_seq, char * src_name, int * len_seq, FILE * fp, long long num_seq ) { int i, k, n, strL; char c; char str[5000]; n = 0; k = num_seq; while ( fgets ( str, 4950, fp ) ) { if ( str[0] == '#' ) { continue; } if ( str[0] == '>' ) { /* if(k >= 0) { // if this isn't the first '>' in the file *len_seq = n; } */ *len_seq = n; n = 0; sscanf ( &str[1], "%s", src_name ); return; } else { strL = strlen ( str ); if ( strL + n > maxReadLen ) { strL = maxReadLen - n; } for ( i = 0; i < strL; i ++ ) { if ( str[i] >= 'a' && str[i] <= 'z' ) { c = base2int ( str[i] - 'a' + 'A' ); src_seq[n ++] = c; } else if ( str[i] >= 'A' && str[i] <= 'Z' ) { c = base2int ( str[i] ); src_seq[n ++] = c; // after pre-process all the symbles would be a,g,c,t,n in lower or upper case. } else if ( str[i] == '.' ) { c = base2int ( 'A' ); src_seq[n ++] = c; } // after pre-process all the symbles would be a,g,c,t,n in lower or upper case. } //printf("%d: %d\n",k,n); } } if ( k >= 0 ) { *len_seq = n; return; } *len_seq = 0; } void read_one_sequence ( FILE * fp, long long * T, char ** X ) { char * fasta, *src_name; //point to fasta array int num_seq, len, name_len, min_len; num_seq = readseqpar ( &len, &min_len, &name_len, fp ); if ( num_seq < 1 ) { printf ( "no fasta sequence in file\n" ); *T = 0; return; } fasta = ( char * ) ckalloc ( len * sizeof ( char ) ); src_name = ( char * ) ckalloc ( ( name_len + 1 ) * sizeof ( char ) ); rewind ( fp ); readseq1by1 ( fasta, src_name, &len, fp, -1 ); readseq1by1 ( fasta, src_name, &len, fp, 0 ); *X = fasta; *T = len; free ( ( void * ) src_name ); } long long multiFileParse ( int * max_leg, int * min_leg, int * max_name_leg, FILE * fp ) { char str[5000]; FILE * freads; int slen; long long counter = 0; *max_name_leg = *max_leg = 1; *min_leg = 1000; while ( fgets ( str, 4950, fp ) ) { slen = strlen ( str ); str[slen - 1] = str[slen]; freads = ckopen ( str, "r" ); counter += readseqpar ( max_leg, min_leg, max_name_leg, freads ); fclose ( freads ); } return counter; } long long readseqpar ( int * max_leg, int * min_leg, int * max_name_leg, FILE * fp ) { int l, n; long long k; char str[5000], src_name[5000]; n = 0; k = -1; while ( fgets ( str, 4950, fp ) ) { if ( str[0] == '>' ) { if ( k >= 0 ) { if ( n > *max_leg ) { *max_leg = n; } if ( n < *min_leg ) { *min_leg = n; } } n = 0; k ++; sscanf ( &str[1], "%s", src_name ); if ( ( l = strlen ( src_name ) ) > *max_name_leg ) { *max_name_leg = l; } } else { n += strlen ( str ) - 1; } } if ( n > *max_leg ) { *max_leg = n; } if ( n < *min_leg ) { *min_leg = n; } k ++; return ( k ); } void read1seqfq ( char * src_seq, char * src_name, int * len_seq, FILE * fp ) { int i, n, strL; char c; char str[5000]; boolean flag = 0; while ( fgets ( str, 4950, fp ) ) { if ( str[0] == '@' ) { flag = 1; sscanf ( &str[1], "%s", src_name ); break; } } if ( !flag ) //last time reading fq file get this { *len_seq = 0; return; } n = 0; while ( fgets ( str, 4950, fp ) ) { if ( str[0] == '+' ) { fgets ( str, 4950, fp ); // pass quality value line *len_seq = n; return; } else { strL = strlen ( str ); if ( strL + n > maxReadLen ) { strL = maxReadLen - n; } for ( i = 0; i < strL; i ++ ) { if ( str[i] >= 'a' && str[i] <= 'z' ) { c = base2int ( str[i] - 'a' + 'A' ); src_seq[n ++] = c; } else if ( str[i] >= 'A' && str[i] <= 'Z' ) { c = base2int ( str[i] ); src_seq[n ++] = c; // after pre-process all the symbles would be a,g,c,t,n in lower or upper case. } else if ( str[i] == '.' ) { c = base2int ( 'A' ); src_seq[n ++] = c; } // after pre-process all the symbles would be a,g,c,t,n in lower or upper case. } //printf("%d: %d\n",k,n); } } *len_seq = n; return; } // find the next file to open in libs static int nextValidIndex ( int libNo, boolean pair, unsigned char asm_ctg ) { int i = libNo; while ( i < num_libs ) { if ( asm_ctg == 1 && ( lib_array[i].asm_flag != 1 && lib_array[i].asm_flag != 3 ) ) { i++; continue; } else if ( asm_ctg == 0 && ( lib_array[i].asm_flag != 2 && lib_array[i].asm_flag != 3 ) ) { i++; continue; } else if ( asm_ctg > 1 && lib_array[i].asm_flag != asm_ctg ) // reads for other purpose { i++; continue; } if ( lib_array[i].curr_type == 1 && lib_array[i].curr_index < lib_array[i].num_a1_file ) { return i; } if ( lib_array[i].curr_type == 2 && lib_array[i].curr_index < lib_array[i].num_q1_file ) { return i; } if ( lib_array[i].curr_type == 3 && lib_array[i].curr_index < lib_array[i].num_p_file ) { return i; } if ( pair ) { if ( lib_array[i].curr_type < 3 ) { lib_array[i].curr_type++; lib_array[i].curr_index = 0; } else { i++; } continue; } if ( lib_array[i].curr_type == 4 && lib_array[i].curr_index < lib_array[i].num_s_a_file ) { return i; } if ( lib_array[i].curr_type == 5 && lib_array[i].curr_index < lib_array[i].num_s_q_file ) { return i; } if ( lib_array[i].curr_type < 5 ) { lib_array[i].curr_type++; lib_array[i].curr_index = 0; } else { i++; } }//for each lib return i; } static FILE * openFile4read ( char * fname ) { FILE * fp; if ( strlen ( fname ) > 3 && strcmp ( fname + strlen ( fname ) - 3, ".gz" ) == 0 ) { char * cmd = ( char * ) ckalloc ( ( strlen ( fname ) + 20 ) * sizeof ( char ) ); sprintf ( cmd, "gzip -dc %s", fname ); fp = popen ( cmd, "r" ); free ( cmd ); return fp; } else { return ckopen ( fname, "r" ); } } void openFileInLib ( int libNo ) { int i = libNo; if ( lib_array[i].curr_type == 1 ) { printf ( "read from file:\n %s\n", lib_array[i].a1_fname[lib_array[i].curr_index] ); printf ( "read from file:\n %s\n", lib_array[i].a2_fname[lib_array[i].curr_index] ); lib_array[i].fp1 = openFile4read ( lib_array[i].a1_fname[lib_array[i].curr_index] ); lib_array[i].fp2 = openFile4read ( lib_array[i].a2_fname[lib_array[i].curr_index] ); lib_array[i].curr_index++; lib_array[i].paired = 1; } else if ( lib_array[i].curr_type == 2 ) { printf ( "read from file:\n %s\n", lib_array[i].q1_fname[lib_array[i].curr_index] ); printf ( "read from file:\n %s\n", lib_array[i].q2_fname[lib_array[i].curr_index] ); lib_array[i].fp1 = openFile4read ( lib_array[i].q1_fname[lib_array[i].curr_index] ); lib_array[i].fp2 = openFile4read ( lib_array[i].q2_fname[lib_array[i].curr_index] ); lib_array[i].curr_index++; lib_array[i].paired = 1; } else if ( lib_array[i].curr_type == 3 ) { printf ( "read from file:\n %s\n", lib_array[i].p_fname[lib_array[i].curr_index] ); lib_array[i].fp1 = openFile4read ( lib_array[i].p_fname[lib_array[i].curr_index] ); lib_array[i].curr_index++; lib_array[i].paired = 0; } else if ( lib_array[i].curr_type == 4 ) { printf ( "read from file:\n %s\n", lib_array[i].s_a_fname[lib_array[i].curr_index] ); lib_array[i].fp1 = openFile4read ( lib_array[i].s_a_fname[lib_array[i].curr_index] ); lib_array[i].curr_index++; lib_array[i].paired = 0; } else if ( lib_array[i].curr_type == 5 ) { printf ( "read from file:\n %s\n", lib_array[i].s_q_fname[lib_array[i].curr_index] ); lib_array[i].fp1 = openFile4read ( lib_array[i].s_q_fname[lib_array[i].curr_index] ); lib_array[i].curr_index++; lib_array[i].paired = 0; } } static void reverse2k ( char * src_seq, int len_seq ) { if ( !len_seq ) { return; } int i; reverseComplementSeq ( src_seq, len_seq, src_rc_seq ); for ( i = 0; i < len_seq; i++ ) { src_seq[i] = src_rc_seq[i]; } } static void closeFp1InLab ( int libNo ) { int ftype = lib_array[libNo].curr_type; int index = lib_array[libNo].curr_index - 1; char * fname; if ( ftype == 1 ) { fname = lib_array[libNo].a1_fname[index]; } else if ( ftype == 2 ) { fname = lib_array[libNo].q1_fname[index]; } else if ( ftype == 3 ) { fname = lib_array[libNo].p_fname[index]; } else if ( ftype == 4 ) { fname = lib_array[libNo].s_a_fname[index]; } else if ( ftype == 5 ) { fname = lib_array[libNo].s_q_fname[index]; } else { return; } if ( strlen ( fname ) > 3 && strcmp ( fname + strlen ( fname ) - 3, ".gz" ) == 0 ) { pclose ( lib_array[libNo].fp1 ); } else { fclose ( lib_array[libNo].fp1 ); } } static void closeFp2InLab ( int libNo ) { int ftype = lib_array[libNo].curr_type; int index = lib_array[libNo].curr_index - 1; char * fname; if ( ftype == 1 ) { fname = lib_array[libNo].a2_fname[index]; } else if ( ftype == 2 ) { fname = lib_array[libNo].q2_fname[index]; } else { return; } if ( strlen ( fname ) > 3 && strcmp ( fname + strlen ( fname ) - 3, ".gz" ) == 0 ) { pclose ( lib_array[libNo].fp2 ); } else { fclose ( lib_array[libNo].fp2 ); } } boolean read1seqInLib ( char * src_seq, char * src_name, int * len_seq, int * libNo, boolean pair, unsigned char asm_ctg ) { int i = *libNo; int prevLib = i; if ( !lib_array[i].fp1 // file1 does not exist || ( lib_array[i].curr_type != 1 && feof ( lib_array[i].fp1 ) ) // file1 reaches end and not type1 || ( lib_array[i].curr_type == 1 && feof ( lib_array[i].fp1 ) && feof ( lib_array[i].fp2 ) ) ) //f1&f2 reaches end { if ( lib_array[i].fp1 && feof ( lib_array[i].fp1 ) ) { closeFp1InLab ( i ); } if ( lib_array[i].fp2 && feof ( lib_array[i].fp2 ) ) { closeFp2InLab ( i ); } *libNo = nextValidIndex ( i, pair, asm_ctg ); i = *libNo; if ( lib_array[i].rd_len_cutoff > 0 ) maxReadLen = lib_array[i].rd_len_cutoff < maxReadLen4all ? lib_array[i].rd_len_cutoff : maxReadLen4all; else { maxReadLen = maxReadLen4all; } //record insert size info //printf("from lib %d to %d, read %lld to %ld\n",prevLib,i,readNumBack,n_solexa); if ( pair && i != prevLib ) { if ( readNumBack < n_solexa ) { pes[gradsCounter].PE_bound = n_solexa; pes[gradsCounter].rank = lib_array[prevLib].rank; pes[gradsCounter].pair_num_cut = lib_array[prevLib].pair_num_cut; pes[gradsCounter++].insertS = lib_array[prevLib].avg_ins; readNumBack = n_solexa; } } if ( i >= num_libs ) { return 0; } openFileInLib ( i ); if ( lib_array[i].curr_type == 1 ) { readseq1by1 ( src_seq, src_name, len_seq, lib_array[i].fp1, -1 ); readseq1by1 ( src_seq, src_name, len_seq, lib_array[i].fp2, -1 ); } else if ( lib_array[i].curr_type == 3 || lib_array[i].curr_type == 4 ) { readseq1by1 ( src_seq, src_name, len_seq, lib_array[i].fp1, -1 ); } } if ( lib_array[i].curr_type == 1 ) { if ( lib_array[i].paired == 1 ) { readseq1by1 ( src_seq, src_name, len_seq, lib_array[i].fp1, 1 ); if ( lib_array[i].reverse ) { reverse2k ( src_seq, *len_seq ); } lib_array[i].paired = 2; if ( *len_seq > 0 || !feof ( lib_array[i].fp1 ) ) { n_solexa++; return 1; } else { return read1seqInLib ( src_seq, src_name, len_seq, libNo, pair, asm_ctg ); } } else { readseq1by1 ( src_seq, src_name, len_seq, lib_array[i].fp2, 1 ); if ( lib_array[i].reverse ) { reverse2k ( src_seq, *len_seq ); } lib_array[i].paired = 1; n_solexa++; return 1; //can't fail to read a read2 } } if ( lib_array[i].curr_type == 2 ) { if ( lib_array[i].paired == 1 ) { read1seqfq ( src_seq, src_name, len_seq, lib_array[i].fp1 ); /* if(*len_seq>0){ for(j=0;j<*len_seq;j++) printf("%c",int2base(src_seq[j])); printf("\n"); } */ if ( lib_array[i].reverse ) { reverse2k ( src_seq, *len_seq ); } lib_array[i].paired = 2; if ( *len_seq > 0 || !feof ( lib_array[i].fp1 ) ) { n_solexa++; return 1; } else { return read1seqInLib ( src_seq, src_name, len_seq, libNo, pair, asm_ctg ); } } else { read1seqfq ( src_seq, src_name, len_seq, lib_array[i].fp2 ); if ( lib_array[i].reverse ) { reverse2k ( src_seq, *len_seq ); } lib_array[i].paired = 1; n_solexa++; return 1; //can't fail to read a read2 } } if ( lib_array[i].curr_type == 5 ) { read1seqfq ( src_seq, src_name, len_seq, lib_array[i].fp1 ); } else { readseq1by1 ( src_seq, src_name, len_seq, lib_array[i].fp1, 1 ); } /* int t; for(t=0;t<*len_seq;t++) printf("%d",src_seq[t]); printf("\n"); */ if ( lib_array[i].reverse ) { reverse2k ( src_seq, *len_seq ); } if ( *len_seq > 0 || !feof ( lib_array[i].fp1 ) ) { n_solexa++; return 1; } else { return read1seqInLib ( src_seq, src_name, len_seq, libNo, pair, asm_ctg ); } } SOAPdenovo-V1.05/src/127mer/scaffold.c000644 000765 000024 00000007401 11530651532 017330 0ustar00Aquastaff000000 000000 /* * 127mer/scaffold.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" static void initenv ( int argc, char ** argv ); static void display_scaff_usage(); static boolean LINK, SCAFF; static char graphfile[256]; int call_scaffold ( int argc, char ** argv ) { time_t start_t, stop_t, time_bef, time_aft; time ( &start_t ); initenv ( argc, argv ); loadPEgrads ( graphfile ); time ( &time_bef ); loadUpdatedEdges ( graphfile ); time ( &time_aft ); printf ( "time spent on loading edges %ds\n", ( int ) ( time_aft - time_bef ) ); if ( !SCAFF ) { time ( &time_bef ); PE2Links ( graphfile ); time ( &time_aft ); printf ( "time spent on loading pair end info %ds\n", ( int ) ( time_aft - time_bef ) ); time ( &time_bef ); Links2Scaf ( graphfile ); time ( &time_aft ); printf ( "time spent on creating scaffolds %ds\n", ( int ) ( time_aft - time_bef ) ); scaffolding ( 100, graphfile ); } prlReadsCloseGap ( graphfile ); // locateReadOnScaf(graphfile); free_pe_mem(); if ( index_array ) { free ( ( void * ) index_array ); } freeContig_array(); destroyPreArcMem(); destroyConnectMem(); deleteCntLookupTable(); time ( &stop_t ); printf ( "time elapsed: %dm\n", ( int ) ( stop_t - start_t ) / 60 ); return 0; } /***************************************************************************** * Parse command line switches *****************************************************************************/ void initenv ( int argc, char ** argv ) { int copt; int inpseq; extern char * optarg; char temp[256]; inpseq = 0; LINK = 0; SCAFF = 0; optind = 1; while ( ( copt = getopt ( argc, argv, "g:L:p:G:FuS" ) ) != EOF ) { switch ( copt ) { case 'g': inGraph = 1; sscanf ( optarg, "%s", graphfile ); // continue; case 'G': sscanf ( optarg, "%s", temp ); // GLDiff = atoi ( temp ); continue; case 'L': sscanf ( optarg, "%s", temp ); ctg_short = atoi ( temp ); continue; case 'F': fillGap = 1; continue; case 'S': SCAFF = 1; continue; case 'u': maskRep = 0; continue; case 'p': sscanf ( optarg, "%s", temp ); // thrd_num = atoi ( temp ); continue; default: if ( inGraph == 0 ) // { display_scaff_usage(); exit ( -1 ); } } } if ( inGraph == 0 ) // { display_scaff_usage(); exit ( -1 ); } } static void display_scaff_usage() { printf ( "\nscaff -g InputGraph [-F -u -S] [-G gapLenDiff -L minContigLen] [-p n_cpu]\n" ); printf ( " -g InputFile: prefix of graph file names\n" ); printf ( " -F (optional) fill gaps in scaffold\n" ); printf ( " -S (optional) scaffold structure exists(default: NO)\n" ); printf ( " -G gapLenDiff(default 50): allowed length difference between estimated and filled gap\n" ); printf ( " -u (optional): un-mask contigs with high coverage before scaffolding (default mask)\n" ); printf ( " -p n_cpu(default 8): number of cpu for use\n" ); printf ( " -L minLen(default K+2): shortest contig for scaffolding\n" ); } SOAPdenovo-V1.05/src/127mer/searchPath.c000644 000765 000024 00000013454 11530651532 017636 0ustar00Aquastaff000000 000000 /* * 127mer/searchPath.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" static int trace_limit = 5000; //the times function is called in a search /* search connection paths which were masked along related contigs start from one contig, end with another path length includes the length of the last contig */ void traceAlongMaskedCnt ( unsigned int destE, unsigned int currE, int max_steps, int min, int max, int index, int len, int * num_route ) { num_trace++; if ( num_trace > trace_limit || *num_route >= max_n_routes ) { return; } unsigned int * array; int num, i, length; CONNECT * ite_cnt; if ( index > 0 ) // there're at most max_steps edges stored in this array including the destination edge { length = len + contig_array[currE].length; } else { length = 0; } if ( index > max_steps || length > max ) { return; } // this is the only situation we stop if ( index > 0 ) // there're at most max_steps edges stored in this array including the destination edge { so_far[index - 1] = currE; } if ( currE == destE && index == 0 ) { printf ( "traceAlongMaskedCnt: start and destination are the same\n" ); return; } if ( currE == destE && length >= min && length <= max ) { num = *num_route; array = found_routes[num]; for ( i = 0; i < index; i++ ) { array[i] = so_far[i]; } if ( index < max_steps ) { array[index] = 0; } //indicate the end of the route *num_route = ++num; } // one route is extrated, but we don't terminate searching ite_cnt = contig_array[currE].downwardConnect; while ( ite_cnt ) { if ( !ite_cnt->mask || ite_cnt->deleted ) { ite_cnt = ite_cnt->next; continue; } traceAlongMaskedCnt ( destE, ite_cnt->contigID, max_steps, min, max, index + 1, length + ite_cnt->gapLen, num_route ); ite_cnt = ite_cnt->next; } } // search connection paths from one connect to a contig // path length includes the length of the last contig void traceAlongConnect ( unsigned int destE, CONNECT * currCNT, int max_steps, int min, int max, int index, int len, int * num_route ) { num_trace++; if ( num_trace > trace_limit || *num_route >= max_n_routes ) { return; } unsigned int * array, currE; int num, i, length; CONNECT * ite_cnt; currE = currCNT->contigID; length = len + currCNT->gapLen; length += contig_array[currE].length; if ( index > max_steps || length > max ) { return; } // this is the only situation we stop /* if(globalFlag) printf("B: step %d, ctg %d, length %d\n",index,currCNT->contigID,length); */ if ( currE == destE && index == 1 ) { printf ( "traceAlongConnect: start and destination are the same\n" ); return; } so_far[index - 1] = currE; // there're at most max_steps edges stored in this array including the destination edge if ( currE == destE && length >= min && length <= max ) { num = *num_route; array = found_routes[num]; for ( i = 0; i < index; i++ ) { array[i] = so_far[i]; } if ( index < max_steps ) { array[index] = 0; } //indicate the end of the route *num_route = ++num; } // one route is extrated, but we don't terminate searching if ( currCNT->nextInScaf ) { traceAlongConnect ( destE, currCNT->nextInScaf, max_steps, min, max, index + 1, length, num_route ); return; } ite_cnt = contig_array[currE].downwardConnect; while ( ite_cnt ) { if ( ite_cnt->mask || ite_cnt->deleted ) { ite_cnt = ite_cnt->next; continue; } traceAlongConnect ( destE, ite_cnt, max_steps, min, max, index + 1, length, num_route ); ite_cnt = ite_cnt->next; } } //find paths in the graph from currE to destE, its length does not include length of both end contigs void traceAlongArc ( unsigned int destE, unsigned int currE, int max_steps, int min, int max, int index, int len, int * num_route ) { num_trace++; if ( num_trace > trace_limit || *num_route >= max_n_routes ) { return; } unsigned int * array, out_ed, vt; int num, i, pos, length; preARC * parc; pos = index; if ( pos > max_steps || len > max ) { return; } // this is the only situation we stop if ( currE == destE && pos == 0 ) { printf ( "traceAlongArc: start and destination are the same\n" ); return; } if ( pos > 0 ) // pos starts with 0 for the starting edge { so_far[pos - 1] = currE; } // there're at most max_steps edges stored in this array including the destination edge if ( currE == destE && len >= min ) { num = *num_route; array = found_routes[num]; for ( i = 0; i < pos; i++ ) { array[i] = so_far[i]; } if ( pos < max_steps ) { array[pos] = 0; } //indicate the end of the route *num_route = ++num; } // one route is extrated, but we don't terminate searching if ( pos == max_steps || len == max ) { return; } if ( pos++ > 0 ) //not the starting edge { length = len + contig_array[currE].length; } else { length = len; } vt = contig_array[currE].to_vt; parc = contig_array[currE].arcs; while ( parc ) { out_ed = parc->to_ed; traceAlongArc ( destE, out_ed, max_steps, min, max, pos, length, num_route ); parc = parc->next; } } SOAPdenovo-V1.05/src/127mer/seq.c000644 000765 000024 00000006160 11530651532 016340 0ustar00Aquastaff000000 000000 /* * 127mer/seq.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" /* void print_kmer(FILE *fp,Kmer kmer,char c) { fprintf(fp,"%llx %llx %llx %llx",kmer.high1,kmer.low1,kmer.high2,kmer.low2); fprintf(fp,"%c",c); }*/ void printTightString ( char * tightSeq, int len ) { int i; for ( i = 0; i < len; i++ ) { printf ( "%c", int2base ( ( int ) getCharInTightString ( tightSeq, i ) ) ); if ( ( i + 1 ) % 100 == 0 ) { printf ( "\n" ); } } printf ( "\n" ); } void writeChar2tightString ( char nt, char * tightSeq, int pos ) { char * byte = tightSeq + pos / 4; switch ( pos % 4 ) { case 0: *byte &= 63; *byte += nt << 6; return; case 1: *byte &= 207; *byte += nt << 4; return; case 2: *byte &= 243; *byte += nt << 2; return; case 3: *byte &= 252; *byte += nt; return; } } char getCharInTightString ( char * tightSeq, int pos ) { char * byte = tightSeq + pos / 4; switch ( pos % 4 ) { case 3: return ( *byte & 3 ); case 2: return ( *byte & 12 ) >> 2; case 1: return ( *byte & 48 ) >> 4; case 0: return ( *byte & 192 ) >> 6; } return 0; } // complement of sequence denoted 0, 1, 2, 3 void reverseComplementSeq ( char * seq, int len, char * bal_seq ) { int i, index = 0; if ( len < 1 ) { return; } for ( i = len - 1; i >= 0; i-- ) { bal_seq[index++] = int_comp ( seq[i] ); } return; } // complement of sequence denoted 0, 1, 2, 3 char * compl_int_seq ( char * seq, int len ) { char * bal_seq = NULL, c, bal_c; int i, index; if ( len < 1 ) { return bal_seq; } bal_seq = ( char * ) ckalloc ( len * sizeof ( char ) ); index = 0; for ( i = len - 1; i >= 0; i-- ) { c = seq[i]; if ( c < 4 ) { bal_c = int_comp ( c ); } //3-c; else { bal_c = c; } bal_seq[index++] = bal_c; } return bal_seq; } long long trans_seq ( char * seq, int len ) { int i; long long res; res = 0; for ( i = 0; i < len; i ++ ) { res = res * 4 + seq[i]; } return ( res ); } /* char *kmer2seq(Kmer word) { int i; char *seq; Kmer charMask = 3; seq = (char *)ckalloc(overlaplen*sizeof(char)); for(i=overlaplen-1;i>=0;i--){ seq[i] = charMask&word; word >>= 2; } return seq; } */ SOAPdenovo-V1.05/src/127mer/splitReps.c000644 000765 000024 00000023066 11530651532 017541 0ustar00Aquastaff000000 000000 /* * 127mer/splitReps.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stdinc.h" #include "newhash.h" #include "extfunc.h" #include "extvab.h" static unsigned int involved[9]; static unsigned int lefts[4]; static unsigned int rights[4]; static unsigned char gothrough[4][4]; static boolean interferingCheck ( unsigned int edgeno, int repTimes ) { int i, j, t; unsigned int bal_ed; involved[0] = edgeno; i = 1; for ( j = 0; j < repTimes; j++ ) { involved[i++] = lefts[j]; } for ( j = 0; j < repTimes; j++ ) { involved[i++] = rights[j]; } for ( j = 0; j < i - 1; j++ ) for ( t = j + 1; t < i; t++ ) if ( involved[j] == involved[t] ) { return 1; } for ( j = 0; j < i; j++ ) { bal_ed = getTwinEdge ( involved[j] ); for ( t = 0; t < i; t++ ) if ( bal_ed == involved[t] ) { return 1; } } return 0; } static ARC * arcCounts ( unsigned int edgeid, unsigned int * num ) { ARC * arc; ARC * firstValidArc = NULL; unsigned int count = 0; arc = edge_array[edgeid].arcs; while ( arc ) { if ( arc->to_ed > 0 ) { count++; } if ( count == 1 ) { firstValidArc = arc; } arc = arc->next; } *num = count; return firstValidArc; } static boolean readOnEdge ( long long readid, unsigned int edge ) { int index; int markNum; long long * marklist; if ( edge_array[edge].markers ) { markNum = edge_array[edge].multi; marklist = edge_array[edge].markers; } else { return 0; } for ( index = 0; index < markNum; index++ ) { if ( readid == marklist[index] ) { return 1; } } return 0; } static long long cntByReads ( unsigned int left, unsigned int middle , unsigned int right ) { int markNum; long long * marklist; if ( edge_array[left].markers ) { markNum = edge_array[left].multi; marklist = edge_array[left].markers; } else { return 0; } int index; long long readid; /* if(middle==8553) printf("%d markers on %d\n",markNum,left); */ for ( index = 0; index < markNum; index++ ) { readid = marklist[index]; if ( readOnEdge ( readid, middle ) && readOnEdge ( readid, right ) ) { return readid; } } return 0; } /* - - > - < - - */ unsigned int solvable ( unsigned int edgeno ) { if ( EdSameAsTwin ( edgeno ) || edge_array[edgeno].multi == 255 ) { return 0; } unsigned int bal_ed = getTwinEdge ( edgeno ); unsigned int arcRight_n, arcLeft_n; unsigned int counter; unsigned int i, j; unsigned int branch, bal_branch; ARC * parcL, *parcR; parcL = arcCounts ( bal_ed, &arcLeft_n ); if ( arcLeft_n < 2 ) { return 0; } parcR = arcCounts ( edgeno, &arcRight_n ); if ( arcLeft_n != arcRight_n ) { return 0; } // check each right branch only has one upsteam connection /* if(edgeno==2551){ for(i=0;ito_ed == 0 ) { parcR = parcR->next; continue; } branch = parcR->to_ed; if ( EdSameAsTwin ( branch ) || edge_array[branch].multi == 255 ) { return 0; } rights[arcRight_n++] = branch; bal_branch = getTwinEdge ( branch ); arcCounts ( bal_branch, &counter ); if ( counter != 1 ) { return 0; } parcR = parcR->next; } // check if each left branch only has one downsteam connection arcLeft_n = 0; while ( parcL ) { if ( parcL->to_ed == 0 ) { parcL = parcL->next; continue; } branch = parcL->to_ed; if ( EdSameAsTwin ( branch ) || edge_array[branch].multi == 255 ) { return 0; } bal_branch = getTwinEdge ( branch ); lefts[arcLeft_n++] = bal_branch; arcCounts ( bal_branch, &counter ); if ( counter != 1 ) { return 0; } parcL = parcL->next; } //check if reads indicate one to one connection between upsteam and downstream edges for ( i = 0; i < arcLeft_n; i++ ) { counter = 0; for ( j = 0; j < arcRight_n; j++ ) { gothrough[i][j] = cntByReads ( lefts[i], edgeno, rights[j] ) == 0 ? 0 : 1; counter += gothrough[i][j]; if ( counter > 1 ) { return 0; } } if ( counter != 1 ) { return 0; } } for ( j = 0; j < arcRight_n; j++ ) { counter = 0; for ( i = 0; i < arcLeft_n; i++ ) { counter += gothrough[i][j]; } if ( counter != 1 ) { return 0; } } return arcLeft_n; } static unsigned int cp1edge ( unsigned int source, unsigned int target ) { int length = edge_array[source].length; char * tightSeq; int index; unsigned int bal_source = getTwinEdge ( source ); unsigned int bal_target; if ( bal_source > source ) { bal_target = target + 1; } else { bal_target = target; target = target + 1; } tightSeq = ( char * ) ckalloc ( ( length / 4 + 1 ) * sizeof ( char ) ); for ( index = 0; index < length / 4 + 1; index++ ) { tightSeq[index] = edge_array[source].seq[index]; } edge_array[target].length = length; edge_array[target].cvg = edge_array[source].cvg; edge_array[target].to_vt = edge_array[source].to_vt; edge_array[target].from_vt = edge_array[source].from_vt; edge_array[target].seq = tightSeq; edge_array[target].bal_edge = edge_array[source].bal_edge; edge_array[target].rv = NULL; edge_array[target].arcs = NULL; edge_array[target].markers = NULL; edge_array[target].flag = 0; edge_array[target].deleted = 0; tightSeq = ( char * ) ckalloc ( ( length / 4 + 1 ) * sizeof ( char ) ); for ( index = 0; index < length / 4 + 1; index++ ) { tightSeq[index] = edge_array[bal_source].seq[index]; } edge_array[bal_target].length = length; edge_array[bal_target].cvg = edge_array[bal_source].cvg; edge_array[bal_target].to_vt = edge_array[bal_source].to_vt; edge_array[bal_target].from_vt = edge_array[bal_source].from_vt; edge_array[bal_target].seq = tightSeq; edge_array[bal_target].bal_edge = edge_array[bal_source].bal_edge; edge_array[bal_target].rv = NULL; edge_array[bal_target].arcs = NULL; edge_array[bal_target].markers = NULL; edge_array[bal_target].flag = 0; edge_array[bal_target].deleted = 0; return target; } static void moveArc2cp ( unsigned int leftEd, unsigned int rightEd, unsigned int source, unsigned int target ) { unsigned int bal_left = getTwinEdge ( leftEd ); unsigned int bal_right = getTwinEdge ( rightEd ); unsigned int bal_source = getTwinEdge ( source ); unsigned int bal_target = getTwinEdge ( target ); ARC * arc; ARC * newArc, *twinArc; //between left and source arc = getArcBetween ( leftEd, source ); arc->to_ed = 0; newArc = allocateArc ( target ); newArc->multiplicity = arc->multiplicity; newArc->prev = NULL; newArc->next = edge_array[leftEd].arcs; if ( edge_array[leftEd].arcs ) { edge_array[leftEd].arcs->prev = newArc; } edge_array[leftEd].arcs = newArc; arc = getArcBetween ( bal_source, bal_left ); arc->to_ed = 0; twinArc = allocateArc ( bal_left ); twinArc->multiplicity = arc->multiplicity; twinArc->prev = NULL; twinArc->next = NULL; edge_array[bal_target].arcs = twinArc; newArc->bal_arc = twinArc; twinArc->bal_arc = newArc; //between source and right arc = getArcBetween ( source, rightEd ); arc->to_ed = 0; newArc = allocateArc ( rightEd ); newArc->multiplicity = arc->multiplicity; newArc->prev = NULL; newArc->next = NULL; edge_array[target].arcs = newArc; arc = getArcBetween ( bal_right, bal_source ); arc->to_ed = 0; twinArc = allocateArc ( bal_target ); twinArc->multiplicity = arc->multiplicity; twinArc->prev = NULL; twinArc->next = edge_array[bal_right].arcs; if ( edge_array[bal_right].arcs ) { edge_array[bal_right].arcs->prev = twinArc; } edge_array[bal_right].arcs = twinArc; newArc->bal_arc = twinArc; twinArc->bal_arc = newArc; } static void split1edge ( unsigned int edgeno, int repTimes ) { int i, j; unsigned int target; for ( i = 1; i < repTimes; i++ ) { for ( j = 0; j < repTimes; j++ ) if ( gothrough[i][j] > 0 ) { break; } target = cp1edge ( edgeno, extraEdgeNum ); moveArc2cp ( lefts[i], rights[j], edgeno, target ); extraEdgeNum += 2; } } static void debugging ( unsigned int i ) { ARC * parc; parc = edge_array[i].arcs; if ( !parc ) { printf ( "no downward connection for %d\n", i ); } while ( parc ) { printf ( "%d -> %d\n", i, parc->to_ed ); parc = parc->next; } } void solveReps() { unsigned int i; unsigned int repTime; int counter = 0; boolean flag; //debugging(30514); extraEdgeNum = num_ed + 1; for ( i = 1; i <= num_ed; i++ ) { repTime = solvable ( i ); if ( repTime == 0 ) { continue; } flag = interferingCheck ( i, repTime ); if ( flag ) { continue; } split1edge ( i, repTime ); counter ++; //+= 2*(repTime-1); if ( EdSmallerThanTwin ( i ) ) { i++; } } printf ( "%d repeats solvable, %d more edges\n", counter, extraEdgeNum - 1 - num_ed ); num_ed = extraEdgeNum - 1; removeDeadArcs(); if ( markersArray ) { free ( ( void * ) markersArray ); markersArray = NULL; } } SOAPdenovo-V1.05/src/127mer/stack.c000644 000765 000024 00000007161 11530651532 016657 0ustar00Aquastaff000000 000000 /* * 127mer/stack.c * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "stack.h" STACK * createStack ( int num_items, size_t unit_size ) { STACK * newStack = ( STACK * ) malloc ( 1 * sizeof ( STACK ) ); newStack->block_list = NULL; newStack->items_per_block = num_items; newStack->item_size = unit_size; newStack->item_c = 0; return newStack; } void emptyStack ( STACK * astack ) { BLOCK_STARTER * block; if ( !astack || !astack->block_list ) { return; } block = astack->block_list; if ( block->next ) { block = block->next; } astack->block_list = block; astack->item_c = 0; astack->index_in_block = 0; } void freeStack ( STACK * astack ) { BLOCK_STARTER * ite_block, *temp_block; if ( !astack ) { return; } ite_block = astack->block_list; if ( ite_block ) { while ( ite_block->next ) { ite_block = ite_block->next; } } while ( ite_block ) { temp_block = ite_block; ite_block = ite_block->prev; free ( ( void * ) temp_block ); } free ( ( void * ) astack ); } void stackBackup ( STACK * astack ) { astack->block_backup = astack->block_list; astack->index_backup = astack->index_in_block; astack->item_c_backup = astack->item_c; } void stackRecover ( STACK * astack ) { astack->block_list = astack->block_backup; astack->index_in_block = astack->index_backup; astack->item_c = astack->item_c_backup; } void * stackPop ( STACK * astack ) { BLOCK_STARTER * block; if ( !astack || !astack->block_list || !astack->item_c ) { return NULL; } astack->item_c--; block = astack->block_list; if ( astack->index_in_block == 1 ) { if ( block->next ) { astack->block_list = block->next; astack->index_in_block = astack->items_per_block; } else { astack->index_in_block = 0; astack->item_c = 0; } return ( void * ) ( ( void * ) block + sizeof ( BLOCK_STARTER ) ); } return ( void * ) ( ( void * ) block + sizeof ( BLOCK_STARTER ) + astack->item_size * ( --astack->index_in_block ) ); } void * stackPush ( STACK * astack ) { BLOCK_STARTER * block; if ( !astack ) { return NULL; } astack->item_c++; if ( !astack->block_list || ( astack->index_in_block == astack->items_per_block && !astack->block_list->prev ) ) { block = malloc ( sizeof ( BLOCK_STARTER ) + astack->items_per_block * astack->item_size ); block->prev = NULL; if ( astack->block_list ) { astack->block_list->prev = block; } block->next = astack->block_list; astack->block_list = block; astack->index_in_block = 1; return ( void * ) ( ( void * ) block + sizeof ( BLOCK_STARTER ) ); } else if ( astack->index_in_block == astack->items_per_block && astack->block_list->prev ) { astack->block_list = astack->block_list->prev; astack->index_in_block = 1; return ( void * ) ( ( void * ) astack->block_list + sizeof ( BLOCK_STARTER ) ); } block = astack->block_list; return ( void * ) ( ( void * ) block + sizeof ( BLOCK_STARTER ) + astack->item_size * astack->index_in_block++ ); } SOAPdenovo-V1.05/src/127mer/inc/._.def.h.swo000644 000765 000024 00000000341 11526112514 020156 0ustar00Aquastaff000000 000000 Mac OS X  2ATTRIIcom.apple.quarantineq/0000;4d58b30f;Mail;185AD716-5ED9-4213-A5CE-C6BFB378332E|com.apple.mailSOAPdenovo-V1.05/src/127mer/inc/.def.h.swo000644 000765 000024 00000030000 11526112514 017734 0ustar00Aquastaff000000 000000 b0VIM 6.3ʔ;IB\$zhuhmcompute-0-33.local/share/raid8/zhuhm/multiFile/inc/def.h3210#"! Utpadhg6lk[ZFD8"  x d O H G 0 . !   y    b @     y ] D 8 %   j h P C A 2 #   tT: oZM/t\RED+)ut[Y@* q^K5}ARCEXIST; struct arcexist *right; struct arcexist *left; Kmer kmer;{typedef struct arcexist }ARC; struct arc *nextInLookupTable; struct arc *bal_arc; struct arc *next; struct arc *prev; unsigned int multiplicity; unsigned int to_ed;{typedef struct arc}LIGHTCTG; char *seq; int length; unsigned int index;{typedef struct lightctg }EDGEPATCH; char bal_edge; unsigned int length; Kmer from_kmer,to_kmer;{typedef struct edgepatch}LIGHTANNOT; int pos; int contigID;{typedef struct lightannot}PARAMETER; unsigned char *selfSignal; unsigned char *mainSignal; NODE **hash_table; unsigned char threadID;{typedef struct parameter}ANNOTATION; int pos; unsigned int contigID; unsigned long readID;{typedef struct annotation}CONTIG; preARC *arcs; CONNECT *downwardConnect; char *seq; unsigned char flag : 1; unsigned char mask : 1; unsigned char multi; char bal_edge; // 1, -1 or 0 int length; unsigned int to_vt; unsigned int from_vt;{typedef struct contig }preARC; struct prearc *next; unsigned int multiplicity; unsigned int to_ed;{typedef struct prearc}CONNECT; struct connection *nextInLookupTable; struct connection *next; struct connection *nextInScaf; unsigned char prevInScaf : 1; unsigned char deleted : 1; unsigned char weak : 1; unsigned char used : 1; unsigned char mask : 1; unsigned char weight; short residue; short minGap; short maxGap; int gapLen; unsigned int contigID;{typedef struct connection}VERTEX; Kmer kmer;{typedef struct vertex}EDGE_PT; struct edge_pt *next; EDGE *edge;{typedef struct edge_pt}EDGE; long *markers; struct arc *arcs; READINTERVAL *rv; char *seq; unsigned char flag : 1; unsigned char deleted : 1; unsigned char multi; char bal_edge; unsigned char cvg; int length; unsigned int to_vt; unsigned int from_vt;{typedef struct edgestruct arc;}READINTERVAL; struct readinterval *prevInRead; struct readinterval *nextInRead; struct readinterval *prevOnEdge; struct readinterval *nextOnEdge; struct readinterval *bal_rv; int start; unsigned int edgeid; int readid; {typedef struct readinterval}preEDGE; char bal_edge; //indicate whether it's bal_edge is the previous edge, next edge or itself unsigned char cvg; int length; char *seq; Kmer to_node; Kmer from_node;{typedef struct preedge}NODE_PT; struct node_pt *next; boolean isSmaller; Kmer kmer; NODE *node;{typedef struct node_pt}NODE; struct node *right; struct node *left; unsigned int to_end; // the edge no. it belongs to unsigned char mark:1; unsigned char deleted:1; unsigned char linear:1; unsigned char cvg; unsigned char linksB; unsigned char links; Kmer kmer;{typedef struct nodestruct node_pt;typedef unsigned long long Kmer;int b_ban;#define int_comp(seq) (char)(seq^0x02) //(char)((0x4E>>((seq)<<1))&0x03)#define int2compbase(seq) "TGAC"[seq]#define int2base(seq) "ACTG"[seq]#define base2int(base) (char)(((base)&0x06)>>1)#define taskMask 0x1f //the last 7 bits#define thrd_num 32 #define word_len 12#include "def2.h"/* this file provides some datatype definition */SOAPdenovo-V1.05/src/127mer/inc/._.def.h.swp000644 000765 000024 00000000341 11526112514 020157 0ustar00Aquastaff000000 000000 Mac OS X  2ATTRIIcom.apple.quarantineq/0000;4d58b30f;Mail;185AD716-5ED9-4213-A5CE-C6BFB378332E|com.apple.mailSOAPdenovo-V1.05/src/127mer/inc/.def.h.swp000644 000765 000024 00000030000 11526112514 017735 0ustar00Aquastaff000000 000000 b0VIM 6.3kEzhuhmcompute-0-33.local/zhuhm/subtleGap/inc/def.h3210#"! Utpadiqih7ml\[GE9#  | g ` _ H F 9 -   5 + *  z X 6 ' &    u \ P = *    l k L - $ #  gN2ofeNL5 wnmSQ:"  ~}caRH;:!~}jhS7$|qq}ARCEXIST; struct arcexist *right; struct arcexist *left; Kmer kmer;{typedef struct arcexist }ARC; struct arc *nextInLookupTable; struct arc *bal_arc; struct arc *next; struct arc *prev; unsigned int multiplicity; unsigned int to_ed;{typedef struct arc}LIGHTCTG; char *seq; int length; unsigned int index;{typedef struct lightctg }EDGEPATCH; char bal_edge; unsigned int length; Kmer from_kmer,to_kmer;{typedef struct edgepatch}LIGHTANNOT; int pos; int contigID;{typedef struct lightannot}PARAMETER; unsigned char *selfSignal; unsigned char *mainSignal; NODE **hash_table; unsigned char threadID;{typedef struct parameter}ANNOTATION; int pos; unsigned int contigID; unsigned long readID;{typedef struct annotation}CONTIG; preARC *arcs; CONNECT *downwardConnect; char *seq; unsigned char flag : 1; unsigned char mask : 1; unsigned char multi; char bal_edge; // 1, -1 or 0 int length; unsigned int to_vt; unsigned int from_vt;{typedef struct contig }preARC; struct prearc *next; unsigned int multiplicity; unsigned int to_ed;{typedef struct prearc}CONNECT; struct connection *next; struct connection *nextInScaf; unsigned char prevInScaf : 1; unsigned char deleted : 1; unsigned char weak : 1; unsigned char used : 1; unsigned char mask : 1; unsigned char weight; short residue; short minGap; short maxGap; int gapLen; unsigned int contigID;{typedef struct connection}VERTEX; unsigned int *outgoing_edges; unsigned int *incoming_edges; unsigned char outgoing_num; unsigned char incoming_num; Kmer kmer;{typedef struct vertex}EDGE_PT; struct edge_pt *next; EDGE *edge;{typedef struct edge_pt}EDGE; long *markers; struct arc *arcs; READINTERVAL *rv; char *seq; unsigned char flag : 1; unsigned char deleted : 1; unsigned char multi; char bal_edge; unsigned char cvg; int length; unsigned int to_vt; unsigned int from_vt;{typedef struct edgestruct arc;}READINTERVAL; struct readinterval *prevInRead; struct readinterval *nextInRead; struct readinterval *prevOnEdge; struct readinterval *nextOnEdge; struct readinterval *bal_rv; int start; unsigned int edgeid; int readid; {typedef struct readinterval}preEDGE; char bal_edge; //indicate whether it's bal_edge is the previous edge, next edge or itself unsigned char cvg; int length; char *seq; Kmer to_node; Kmer from_node;{typedef struct preedge}NODE_PT; struct node_pt *next; boolean isSmaller; Kmer kmer; NODE *node;{typedef struct node_pt}NODE; struct node *right; struct node *left; unsigned int to_end; // the edge no. it belongs to unsigned char deleted:1; unsigned char linear:1; unsigned char cvg; unsigned char linksB; unsigned char links; Kmer kmer;{typedef struct nodestruct node_pt;typedef unsigned long long Kmer;int b_ban;#define int_comp(seq) (char)(seq^0x02) //(char)((0x4E>>((seq)<<1))&0x03)#define int2compbase(seq) "TGAC"[seq]#define int2base(seq) "ACTG"[seq]#define base2int(base) (char)(((base)&0x06)>>1)#define taskMask 0xf //the last 6 bits#define thrd_num 16 #define word_len 12#include "def2.h"/* this file provides some datatype definition */SOAPdenovo-V1.05/src/127mer/inc/._.global.h.swp000644 000765 000024 00000000341 11526112512 020657 0ustar00Aquastaff000000 000000 Mac OS X  2ATTRIIcom.apple.quarantineq/0000;4d58b30f;Mail;185AD716-5ED9-4213-A5CE-C6BFB378332E|com.apple.mailSOAPdenovo-V1.05/src/127mer/inc/.global.h.swp000644 000765 000024 00000030000 11526112512 020435 0ustar00Aquastaff000000 000000 b0VIM 6.3L9zpanqicompute-0-268.local/panfs/CD/panqi/grape/inc/global.h3210#"! Utp=ad =yW5pX="  ~ j T .  l W > + t [ = "  int initKmerSetSize = 0;int GLDiff=50;boolean maskRep=1;int maxSteps;boolean orig2new;DARRAY **darrayBuf;DARRAY *gapSeqDarray=NULL;DARRAY *readSeqInGap=NULL;KmerSet **KmerSetsPatch=NULL;KmerSet **KmerSets=NULL;int cvgAvg=0;int thrd_num=8;unsigned int ctg_short=0;int gradsCounter;long long readNumBack;int libNo=0;LIB_INFO *lib_array=NULL;int num_libs=0;CONNECT **cntLookupTable=NULL;boolean repsTie=0;long long newCntCounter;boolean deLowEdge=0;boolean deLowKmer=0;long long *markersArray=NULL;ARC **arcLookupTable=NULL;int maxNameLen=0;int minReadLen=0;int maxReadLen4all=0;int maxReadLen=0;MEM_MANAGER **preArc_mem_managers=NULL;MEM_MANAGER *prearc_mem_manager=NULL;long long arcCounter;boolean globalFlag;int fillGap=0;int weakPE=3;int len_bar=100;int lineLen;CONTIG *contig_array=NULL;unsigned int *index_array=NULL;VERTEX *vt_array=NULL;EDGE *edge_array=NULL;unsigned int extraEdgeNum;unsigned int num_ed_limit;unsigned int num_ctg=0;unsigned int num_ed=0;Kmer WORDFILTER;int num_trace;int max_n_routes = 10;unsigned int *so_far=NULL;unsigned int **found_routes=NULL;unsigned int num_vt=0;MEM_MANAGER *arc_mem_manager=NULL;MEM_MANAGER *cn_mem_manager=NULL;MEM_MANAGER *rv_mem_manager=NULL;PE_INFO *pes=NULL;int ins_size_var=20;long long prevNum=0; long long n_solexa=0; long long n_ban; int inGraph;int overlaplen=23;SOAPdenovo-V1.05/src/127mer/inc/._.newhash.h.swp000644 000765 000024 00000000341 11526112514 021056 0ustar00Aquastaff000000 000000 Mac OS X  2ATTRIIcom.apple.quarantineq/0000;4d58b30f;Mail;185AD716-5ED9-4213-A5CE-C6BFB378332E|com.apple.mailSOAPdenovo-V1.05/src/127mer/inc/.newhash.h.swp000644 000765 000024 00000030000 11526112514 020634 0ustar00Aquastaff000000 000000 b0VIM 6.3rrIjTzhuhmcompute-0-35.local/share/data/assembly/potato/zhuhm/newhash/inc/newhash.h3210#"! UtpNaduNjN/.{ D C ? Q @ * ) (    vcQA/%$ trcWC,"!ARQ%#endifextern char firstCharInKmer(Kmer kmer);extern int count_branch2next(kmer_t *node);extern int count_branch2prev(kmer_t *node);extern void dislink2prevUncertain(kmer_t *node,char ch,boolean smaller);extern void dislink2nextUncertain(kmer_t *node,char ch,boolean smaller);extern void free_Sets(KmerSet **KmerSets,int num);extern byte8 count_kmerset(KmerSet *set);extern int put_kmerset(KmerSet *set, ubyte8 seq, ubyte left, ubyte right,kmer_t **kmer_p);extern int search_kmerset(KmerSet *set, ubyte8 seq, kmer_t **rs);extern KmerSet* init_kmerset(ubyte8 init_size, float load_factor);}KMER_PT; struct kmer_pt *next; boolean isSmaller; Kmer kmer; kmer_t *node;{typedef struct kmer_pt} KmerSet; ubyte8 iter_ptr; double load_factor; ubyte8 max; ubyte8 count; ubyte8 size; ubyte4 *flags; kmer_t *array;{typedef struct kmerSet_st} kmer_t; ubyte4 inEdge:2; ubyte4 twin:2; ubyte4 single:1; ubyte4 checked:1; ubyte4 deleted:1; ubyte4 linear:1; ubyte4 r_links:4*EDGE_BIT_SIZE; ubyte4 l_links; // sever as edgeID since make_edge Kmer seq;{typedef struct kmer_st#define exists_kmer_entity(flags, idx) (!((flags)[(idx)>>4]>>(((idx)&0x0f)<<1)&0x03))#define clear_kmer_entity_del(flags, idx) ((flags)[(idx)>>4] &= ~(0x02u<<(((idx)&0x0f)<<1)))#define clear_kmer_entity_null(flags, idx) ((flags)[(idx)>>4] &= ~(0x01u<<(((idx)&0x0f)<<1)))#define set_kmer_entity_del(flags, idx) ((flags)[(idx)>>4] |= (0x02u<<(((idx)&0x0f)<<1)))#define set_kmer_entity_null(flags, idx) ((flags)[(idx)>>4] |= (0x01u<<(((idx)&0x0f)<<1)))#define is_kmer_entity_del(flags, idx) ((flags)[(idx)>>4]>>(((idx)&0x0f)<<1)&0x02)#define is_kmer_entity_null(flags, idx) ((flags)[(idx)>>4]>>(((idx)&0x0f)<<1)&0x01)#define get_kmer_right_covs(mer) (get_kmer_right_cov(mer, 0) + get_kmer_right_cov(mer, 1) + get_kmer_right_cov(mer, 2) + get_kmer_right_cov(mer, 3))#define set_kmer_right_cov(mer, idx, val) ((mer).r_links = ((mer).r_links&(~(EDGE_XOR_MASK<<((idx)*EDGE_BIT_SIZE)))) | (((val)&EDGE_XOR_MASK)<<((idx)*EDGE_BIT_SIZE)) )#define get_kmer_right_cov(mer, idx) (((mer).r_links>>((idx)*EDGE_BIT_SIZE))&EDGE_XOR_MASK)#define get_kmer_left_covs(mer) (get_kmer_left_cov(mer, 0) + get_kmer_left_cov(mer, 1) + get_kmer_left_cov(mer, 2) + get_kmer_left_cov(mer, 3))#define set_kmer_left_cov(mer, idx, val) ((mer).l_links = ((mer).l_links&(~(EDGE_XOR_MASK<<((idx)*EDGE_BIT_SIZE)))) | (((val)&EDGE_XOR_MASK)<<((idx)*EDGE_BIT_SIZE)) )#define get_kmer_left_cov(mer, idx) (((mer).l_links>>((idx)*EDGE_BIT_SIZE))&EDGE_XOR_MASK)#define set_kmer_seq(mer, val) ((mer).seq = val)#define get_kmer_seq(mer) ((mer).seq)#define LINKS_BITS 0x00FFFFFFU#define EDGE_XOR_MASK 0x3FU#define EDGE_BIT_SIZE 6#define MAX_KMER_COV 63#endif#define K_LOAD_FACTOR 0.75#ifndef K_LOAD_FACTOR#define __NEW_HASH_RJ#ifndef __NEW_HASH_RJSOAPdenovo-V1.05/src/127mer/inc/check.h000644 000765 000024 00000001715 11530651532 017404 0ustar00Aquastaff000000 000000 /* * 127mer/inc/check.h * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ extern void * ckalloc ( unsigned long long amount ); extern void * ckrealloc ( void * p, size_t new_size, size_t old_size ); extern FILE * ckopen ( char * name, char * mode ); SOAPdenovo-V1.05/src/127mer/inc/darray.h000644 000765 000024 00000002362 11530651532 017610 0ustar00Aquastaff000000 000000 /* * 127mer/inc/darray.h * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #ifndef __DARRAY__ #define __DARRAY__ #include #include #include typedef struct dynamic_array { void * array; long long array_size; size_t item_size; long long item_c; } DARRAY; void * darrayPut ( DARRAY * darray, long long index ); void * darrayGet ( DARRAY * darray, long long index ); DARRAY * createDarray ( int num_items, size_t unit_size ); void freeDarray ( DARRAY * darray ); void emptyDarray ( DARRAY * darray ); #endif SOAPdenovo-V1.05/src/127mer/inc/def.h000644 000765 000024 00000014546 11530651532 017073 0ustar00Aquastaff000000 000000 /* * 127mer/inc/def.h * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ /* this file provides some datatype definition */ #ifndef _DEF #define _DEF #include "def2.h" #include "types.h" #include "stack.h" #include "darray.h" #define EDGE_BIT_SIZE 6 #define word_len 12 #define taskMask 0xf //the last 7 bits #define MaxEdgeCov 16000 #define base2int(base) (char)(((base)&0x06)>>1) #define int2base(seq) "ACTG"[seq] #define int2compbase(seq) "TGAC"[seq] #define int_comp(seq) (char)(seq^0x02) //(char)((0x4E>>((seq)<<1))&0x03) int b_ban; typedef struct kmer { unsigned long long high1, low1, high2, low2; } Kmer; typedef struct preedge { Kmer from_node; Kmer to_node; char * seq; int length; unsigned short cvg: 14; unsigned bal_edge: 2; //indicate whether it's bal_edge is the previous edge, next edge or itself } preEDGE; typedef struct readinterval { int readid; unsigned int edgeid; int start; struct readinterval * bal_rv; struct readinterval * nextOnEdge; struct readinterval * prevOnEdge; struct readinterval * nextInRead; struct readinterval * prevInRead; } READINTERVAL; struct arc; typedef struct edge { unsigned int from_vt; unsigned int to_vt; int length; unsigned short cvg: 14; unsigned short bal_edge: 2; unsigned short multi: 14; unsigned short deleted : 1; unsigned short flag : 1; char * seq; READINTERVAL * rv; struct arc * arcs; long long * markers; } EDGE; typedef struct edge_pt { EDGE * edge; struct edge_pt * next; } EDGE_PT; typedef struct vertex { Kmer kmer; } VERTEX; /* typedef struct connection { unsigned int contigID; int gapLen; short maxGap; unsigned char minGap; unsigned char bySmall:1; unsigned char weakPoint:1; unsigned char weightNotInherit; unsigned char weight; unsigned char maxSingleWeight; unsigned char mask : 1; unsigned char used : 1; unsigned char weak : 1; unsigned char deleted : 1; unsigned char prevInScaf : 1; unsigned char inherit : 1; unsigned char checking : 1; unsigned char singleInScaf : 1; struct connection *nextInScaf; struct connection *next; struct connection *nextInLookupTable; }CONNECT; */ typedef struct connection { unsigned int contigID; int gapLen; unsigned short maxGap; unsigned char minGap; unsigned char bySmall: 1; unsigned char weakPoint: 1; unsigned char weightNotInherit; unsigned char weight; unsigned char maxSingleWeight; unsigned char mask : 1; unsigned char used : 1; unsigned char weak : 1; unsigned char deleted : 1; unsigned char prevInScaf : 1; unsigned char inherit : 1; unsigned char checking : 1; unsigned char singleInScaf : 1; struct connection * nextInScaf; struct connection * next; struct connection * nextInLookupTable; } CONNECT; typedef struct prearc { unsigned int to_ed; unsigned int multiplicity; struct prearc * next; } preARC; typedef struct contig { unsigned int from_vt; unsigned int to_vt; unsigned int length; unsigned short indexInScaf; unsigned char cvg; unsigned char bal_edge: 2; // 0, 1 or 2 unsigned char mask : 1; unsigned char flag : 1; unsigned char multi: 1; unsigned char inSubGraph: 1; char * seq; CONNECT * downwardConnect; preARC * arcs; STACK * closeReads; } CONTIG; typedef struct read_nearby { int len; int dis; // dis to nearby contig or scaffold's start position long long seqStarter; //sequence start position in dynamic array } READNEARBY; typedef struct annotation { unsigned long long readID; unsigned int contigID; int pos; } ANNOTATION; typedef struct parameter { unsigned char threadID; void ** hash_table; unsigned char * mainSignal; unsigned char * selfSignal; } PARAMETER; typedef struct lightannot { int contigID; int pos; } LIGHTANNOT; typedef struct edgepatch { Kmer from_kmer, to_kmer; unsigned int length; char bal_edge; } EDGEPATCH; typedef struct lightctg { unsigned int index; int length; char * seq; } LIGHTCTG; typedef struct arc { unsigned int to_ed; unsigned int multiplicity; struct arc * prev; struct arc * next; struct arc * bal_arc; struct arc * nextInLookupTable; } ARC; typedef struct arcexist { Kmer kmer; struct arcexist * left; struct arcexist * right; } ARCEXIST; typedef struct lib_info { int min_ins; int max_ins; int avg_ins; int rd_len_cutoff; int reverse; int asm_flag; int map_len; int pair_num_cut; int rank; //indicate which file is next to be read int curr_type; int curr_index; //file handlers to opened files FILE * fp1; FILE * fp2; boolean f1_start; boolean f2_start; //whether last read is read1 in pair int paired; // 0 -- single; 1 -- read1; 2 -- read2; //type1 char ** a1_fname; char ** a2_fname; int num_a1_file; int num_a2_file; //type2 char ** q1_fname; char ** q2_fname; int num_q1_file; int num_q2_file; //type3 char ** p_fname; int num_p_file; //fasta only //type4 &5 char ** s_a_fname; int num_s_a_file; char ** s_q_fname; int num_s_q_file; } LIB_INFO; typedef struct ctg4heap { unsigned int ctgID; int dis; unsigned char ds_shut4dheap: 1; // ignore downstream connections unsigned char us_shut4dheap: 1; // ignore upstream connections unsigned char ds_shut4uheap: 1; // ignore downstream connections unsigned char us_shut4uheap: 1; // ignore upstream connections } CTGinHEAP; typedef struct ctg4scaf { unsigned int ctgID; int start; int end; //position in scaff unsigned int cutHead : 8; // unsigned int cutTail : 7; // unsigned int scaftig_start : 1; //is it a scaftig starter unsigned int mask : 1; // is it masked for further operations unsigned int gapSeqLen: 15; int gapSeqOffset; } CTGinSCAF; typedef struct pe_info { int insertS; long long PE_bound; int rank; int pair_num_cut; } PE_INFO; #endif SOAPdenovo-V1.05/src/127mer/inc/def2.h000644 000765 000024 00000003247 11530651532 017151 0ustar00Aquastaff000000 000000 /* * 127mer/inc/def2.h * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #ifndef _DEF2 #define _DEF2 typedef char boolean; typedef long long IDnum; typedef double Time; typedef long long Coordinate; // Fibonacci heaps used mainly in Tour Bus typedef struct fibheap FibHeap; typedef struct fibheap_el FibHeapNode; typedef struct dfibheap DFibHeap; typedef struct dfibheap_el DFibHeapNode; //Memory manager typedef struct block_start { struct block_start * next; } BLOCK_START; typedef struct recycle_mark { struct recycle_mark * next; } RECYCLE_MARK; typedef struct mem_manager { BLOCK_START * block_list; int index_in_block; int items_per_block; size_t item_size; RECYCLE_MARK * recycle_list; unsigned long long counter; } MEM_MANAGER; struct dfibheap_el { int dfhe_degree; boolean dfhe_mark; DFibHeapNode * dfhe_p; DFibHeapNode * dfhe_child; DFibHeapNode * dfhe_left; DFibHeapNode * dfhe_right; Time dfhe_key; unsigned int dfhe_data;//void *dfhe_data; }; #endif SOAPdenovo-V1.05/src/127mer/inc/dfib.h000644 000765 000024 00000005566 11530651532 017243 0ustar00Aquastaff000000 000000 /* * 127mer/inc/dfib.h * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ /*- * Copyright 1997, 1998-2003 John-Mark Gurney. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $Id: dfib.h,v 1.8 2007/04/24 12:16:41 zerbino Exp $ * */ #ifndef _DFIB_H_ #define _DFIB_H_ #include #include "def2.h" //#include "globals.h" /* functions for key heaps */ DFibHeap * dfh_makekeyheap ( void ); DFibHeapNode * dfh_insertkey ( DFibHeap *, Time, unsigned int ); Time dfh_replacekey ( DFibHeap *, DFibHeapNode *, Time ); unsigned int dfh_replacekeydata ( DFibHeap *, DFibHeapNode *, Time, unsigned int ); unsigned int dfh_extractmin ( DFibHeap * ); unsigned int dfh_replacedata ( DFibHeapNode *, unsigned int ); unsigned int dfh_delete ( DFibHeap *, DFibHeapNode * ); void dfh_deleteheap ( DFibHeap * ); // DEBUG IDnum dfibheap_getSize ( DFibHeap * ); Time dfibheap_el_getKey ( DFibHeapNode * ); // END DEBUG #endif /* _FIB_H_ */ SOAPdenovo-V1.05/src/127mer/inc/dfibHeap.h000644 000765 000024 00000002565 11530651532 020035 0ustar00Aquastaff000000 000000 /* * 127mer/inc/dfibHeap.h * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #ifndef _DFIBHEAP_H_ #define _DFIBHEAP_H_ DFibHeap * newDFibHeap(); DFibHeapNode * insertNodeIntoDHeap ( DFibHeap * heap, Time key, unsigned int node ); Time replaceKeyInDHeap ( DFibHeap * heap, DFibHeapNode * node, Time newKey ); unsigned int removeNextNodeFromDHeap ( DFibHeap * heap ); void destroyDHeap ( DFibHeap * heap ); boolean HasMin ( DFibHeap * h ); void replaceValueInDHeap ( DFibHeapNode * node, unsigned int newValue ); void * destroyNodeInDHeap ( DFibHeapNode * node, DFibHeap * heap ); IDnum getDFibHeapSize ( DFibHeap * heap ); Time getKey ( DFibHeapNode * node ); #endif SOAPdenovo-V1.05/src/127mer/inc/dfibpriv.h000644 000765 000024 00000007072 11530651532 020136 0ustar00Aquastaff000000 000000 /* * 127mer/inc/dfibpriv.h * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ /* * Copyright 1997, 1999-2003 John-Mark Gurney. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $Id: dfibpriv.h,v 1.8 2007/10/09 09:56:46 zerbino Exp $ * */ #ifndef _DFIBPRIV_H_ #define _DFIBPRIV_H_ //#include "globals.h" #include "def2.h" /* * specific node operations */ static DFibHeapNode * dfhe_newelem ( DFibHeap * ); static void dfhe_insertafter ( DFibHeapNode * a, DFibHeapNode * b ); static inline void dfhe_insertbefore ( DFibHeapNode * a, DFibHeapNode * b ); static DFibHeapNode * dfhe_remove ( DFibHeapNode * a ); /* * global heap operations */ struct dfibheap { MEM_MANAGER * nodeMemory; IDnum dfh_n; IDnum dfh_Dl; DFibHeapNode ** dfh_cons; DFibHeapNode * dfh_min; DFibHeapNode * dfh_root; }; static void dfh_insertrootlist ( DFibHeap *, DFibHeapNode * ); static void dfh_removerootlist ( DFibHeap *, DFibHeapNode * ); static void dfh_consolidate ( DFibHeap * ); static void dfh_heaplink ( DFibHeap * h, DFibHeapNode * y, DFibHeapNode * x ); static void dfh_cut ( DFibHeap *, DFibHeapNode *, DFibHeapNode * ); static void dfh_cascading_cut ( DFibHeap *, DFibHeapNode * ); static DFibHeapNode * dfh_extractminel ( DFibHeap * ); static void dfh_checkcons ( DFibHeap * h ); static int dfh_compare ( DFibHeap * h, DFibHeapNode * a, DFibHeapNode * b ); static int dfh_comparedata ( DFibHeap * h, Time key, unsigned int data, DFibHeapNode * b ); static void dfh_insertel ( DFibHeap * h, DFibHeapNode * x ); /* * general functions */ static inline IDnum ceillog2 ( IDnum a ); #endif /* _FIBPRIV_H_ */ SOAPdenovo-V1.05/src/127mer/inc/extfunc.h000644 000765 000024 00000024135 11530651532 020004 0ustar00Aquastaff000000 000000 /* * 127mer/inc/extfunc.h * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include "check.h" #include "extfunc2.h" extern void readseq1by1 ( char * src_seq, char * src_name, int * len_seq, FILE * fp, long long num_seq ); extern void readseqPbyP ( char * src_seq, char * src_name, int * insertS, int * len_seq, FILE * fp, long long num_seq ); extern long long readseqpar ( int * max_len, int * min_leg, int * max_name_len, FILE * fp ); extern void free_edge_list ( EDGE_PT * el ); extern void reverseComplementSeq ( char * seq, int len, char * bal_seq ); extern void free_edge_array ( EDGE * ed_array, int ed_num ); extern void free_lightctg_array ( LIGHTCTG * ed_array, int ed_num ); extern char getCharInTightString ( char * tightSeq, int pos ); extern void writeChar2tightSting ( char nt, char * tightSeq, int pos ); extern void short_reads_sum(); extern void read_one_sequence ( FILE * fp, long long * T, char ** X ); extern void output_edges ( preEDGE * ed_array, int ed_num, char * outfile ); extern void loadVertex ( char * graphfile ); extern void loadEdge ( char * graphfile ); extern boolean loadPath ( char * graphfile ); extern READINTERVAL * allocateRV ( int readid, int edgeid ); extern void createRVmemo(); extern void dismissRV ( READINTERVAL * rv ); extern void destroyReadIntervMem(); extern void destroyConnectMem(); extern void u2uConcatenate(); extern void output_contig ( EDGE * ed_array, unsigned int ed_num, char * outfile, int cut_len ); extern void printTightString ( char * tightSeq, int len ); extern int roughUniqueness ( unsigned int edgeno, char ignore_cvg, char * ignored ); extern void outputReadPos ( char * graphfile, int min_len ); extern void testSearch(); extern void allpathConcatenate(); extern void output_updated_edges ( char * outfile ); extern void output_updated_vertex ( char * outfile ); extern void loadUpdatedEdges ( char * graphfile ); extern void loadUpdatedVertex ( char * graphfile ); extern void connectByPE ( char * infile ); extern void output_cntGVZ ( char * outfile ); extern void output_graph ( char * outfile ); extern void testLinearC2C(); extern void output_contig_graph ( char * outfile ); extern void scaffolding ( unsigned int cut_len, char * outfile ); extern int cmp_int ( const void * a, const void * b ); extern CONNECT * allocateCN ( unsigned int contigId, int gap ); extern int recoverRep(); extern void loadPEgrads ( char * infile ); extern int putInsertS ( long long readid, int size, int * currGrads ); extern int getInsertS ( long long readid, int * readlen ); extern int connectByPE_grad ( FILE * fp, int peGrad, char * line ); extern void PEgradsScaf ( char * infile ); extern void reorderAnnotation ( char * infile, char * outfile ); extern void output_1edge ( preEDGE * edge, FILE * fp ); extern void prlRead2edge ( char * libfile, char * outfile ); extern void annotFileTrans ( char * infile, char * outfile ); extern void prlLoadPath ( char * graphfile ); extern void misCheck ( char * infile, char * outfile ); extern int uniqueLenSearch ( unsigned int * len_array, unsigned int * flag_array, int num, unsigned int target ); extern int cmp_vertex ( const void * a, const void * b ); extern void linkContig2Vts(); extern int connectByPE_gradPatch ( FILE * fp1, FILE * fp2, int peGrad, char * line1, char * line2 ); extern void scaftiging ( char * graphfile, int len_cut ); extern void gapFilling ( char * graphfile, int cut_len ); extern ARC * getArcBetween ( unsigned int from_ed, unsigned int to_ed ); extern void bubblePinch ( double simiCutoff, char * outfile, int M ); extern void linearConcatenate(); extern unsigned char setArcMulti ( unsigned int from_ed, unsigned int to_ed, unsigned char value ); extern ARC * allocateArc ( unsigned int edgeid ); extern void cutTipsInGraph ( int cutLen, boolean strict ); extern ARC * deleteArc ( ARC * arc_list, ARC * arc ); extern void compactEdgeArray(); extern void dismissArc ( ARC * arc ); extern void createArcMemo(); extern ARC * getArcBetween ( unsigned int from_ed, unsigned int to_ed ); extern ARC * allocateArc ( unsigned int edgeid ); extern void writeChar2tightString ( char nt, char * tightSeq, int pos ); extern void output_heavyArcs ( char * outfile ); extern preARC * allocatePreArc ( unsigned int edgeid ); extern void destroyPreArcMem(); extern void traceAlongArc ( unsigned int destE, unsigned int currE, int max_steps, int min, int max, int index, int len, int * num_route ); extern void freeContig_array(); extern void output_scafSeq ( char * graphfile, int len_cut ); extern void putArcInHash ( unsigned int from_ed, unsigned int to_ed ); extern boolean DoesArcExist ( unsigned int from_ed, unsigned int to_ed ); extern void recordArcInHash(); extern void destroyArcHash(); extern void removeWeakEdges ( int lenCutoff, unsigned int multiCutoff ); extern void createArcLookupTable(); extern void deleteArcLookupTable(); extern void putArc2LookupTable ( unsigned int from_ed, ARC * arc ); extern void removeArcInLookupTable ( unsigned int from_ed, unsigned int to_ed ); extern ARC * arcCount ( unsigned int edgeid, unsigned int * num ); extern void mapFileTrans ( char * infile ); extern void solveReps(); extern void removeDeadArcs(); extern void destroyArcMem(); extern void getCntsInFile ( char * infile ); extern void scafByCntInfo ( char * infile ); extern CONNECT * add1Connect ( unsigned int e1, unsigned int e2, int gap, int weight, boolean inherit ); extern void getScaff ( char * infile ); extern void traceAlongMaskedCnt ( unsigned int destE, unsigned int currE, int max_steps, int min, int max, int index, int len, int * num_route ); extern void createPreArcMemManager(); extern boolean loadPathBin ( char * graphfile ); extern void recordArcsInLookupTable(); extern FILE * multiFileRead1seq ( char * src_seq, char * src_name, int * len_seq, FILE * fp, FILE * freads ); extern void multiFileSeqpar ( FILE * fp ); extern long long multiFileParse ( int * max_leg, int * min_leg, int * max_name_leg, FILE * fp ); extern CONNECT * getCntBetween ( unsigned int from_ed, unsigned int to_ed ); extern void createCntMemManager(); extern void destroyConnectMem(); extern void createCntLookupTable(); extern void deleteCntLookupTable(); extern void putCnt2LookupTable ( unsigned int from_c, CONNECT * cnt ); extern void prlRead2Ctg ( char * seqfile, char * outfile ); extern boolean prlContig2nodes ( char * grapfile, int len_cut ); extern void scan_libInfo ( char * libfile ); extern void free_libs(); extern boolean read1seqInLib ( char * src_seq, char * src_name, int * len_seq, int * libNo, boolean pair, unsigned char asm_ctg ); extern void save4laterSolve(); extern void solveRepsAfter(); extern void free_pe_mem(); extern void alloc_pe_mem ( int gradsCounter ); extern void prlDestroyPreArcMem(); extern preARC * prlAllocatePreArc ( unsigned int edgeid, MEM_MANAGER * manager ); extern boolean prlRead2HashTable ( char * libfile, char * outfile ); extern void free_allSets(); extern void removeSingleTips(); extern void removeMinorTips(); extern void kmer2edges ( char * outfile ); extern void output_vertex ( char * outfile ); extern boolean prlRead2HashTable ( char * libfile, char * outfile ); extern void Links2Scaf ( char * infile ); extern void PE2Links ( char * infile ); extern unsigned int getTwinCtg ( unsigned int ctg ); extern void basicContigInfo ( char * infile ); extern boolean isSmallerThanTwin ( unsigned int ctg ); extern boolean isLargerThanTwin ( unsigned int ctg ); extern boolean isSameAsTwin ( unsigned int ctg ); extern boolean loadMarkerBin ( char * graphfile ); extern void readsCloseGap ( char * graphfile ); extern void prlReadsCloseGap ( char * graphfile ); extern void locateReadOnScaf ( char * graphfile ); /*********** Kmer related *************/ extern Kmer createFilter ( int overlaplen ); extern void printKmerSeq ( FILE * fp, Kmer kmer ); //extern U256b Kmer2int256(Kmer seq); extern boolean KmerLarger ( Kmer kmer1, Kmer kmer2 ); extern boolean KmerSmaller ( Kmer kmer1, Kmer kmer2 ); extern boolean KmerEqual ( Kmer kmer1, Kmer kmer2 ); extern Kmer KmerAnd ( Kmer kmer1, Kmer kmer2 ); extern Kmer KmerLeftBitMoveBy2 ( Kmer word ); extern Kmer KmerRightBitMoveBy2 ( Kmer word ); extern Kmer KmerPlus ( Kmer prev, char ch ); extern Kmer nextKmer ( Kmer prev, char ch ); extern Kmer prevKmer ( Kmer next, char ch ); extern char firstCharInKmer ( Kmer kmer ); extern Kmer KmerRightBitMove ( Kmer word, int dis ); extern Kmer reverseComplement ( Kmer word, int overlap ); extern ubyte8 hash_kmer ( Kmer kmer ); extern int kmer2vt ( Kmer kmer ); extern void print_kmer ( FILE * fp, Kmer kmer, char c ); extern int bisearch ( VERTEX * vts, int num, Kmer target ); extern void printKmerSeq ( FILE * fp, Kmer kmer ); extern char lastCharInKmer ( Kmer kmer ); int localGraph ( READNEARBY * rdArray, int num, CTGinSCAF * ctg1, CTGinSCAF * ctg2, int origOverlap, Kmer * kmerCtg1, Kmer * kmerCtg2, int overlap, DARRAY * gapSeqArray, char * seqCtg1, char * seqCtg2, char * seqGap ); extern unsigned int getTwinEdge ( unsigned int edgeno ); extern boolean EdSmallerThanTwin ( unsigned int edgeno ); extern boolean EdLargerThanTwin ( unsigned int edgeno ); extern boolean EdSameAsTwin ( unsigned int edgeno ); extern void removeLowCovEdges ( int lenCutoff, unsigned short covCutoff ); extern int getMaxLongReadLen ( int num_libs ); extern void prlLongRead2Ctg ( char * libfile, char * outfile ); SOAPdenovo-V1.05/src/127mer/inc/extfunc2.h000644 000765 000024 00000002111 11530651532 020054 0ustar00Aquastaff000000 000000 /* * 127mer/inc/extfunc2.h * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #ifndef _MEM_MANAGER #define _MEM_MANAGER extern MEM_MANAGER * createMem_manager ( int num_items, size_t unit_size ); extern void * getItem ( MEM_MANAGER * mem_Manager ); extern void returnItem ( MEM_MANAGER * mem_Manager, void * ); extern void freeMem_manager ( MEM_MANAGER * mem_Manager ); #endif SOAPdenovo-V1.05/src/127mer/inc/extvab.h000644 000765 000024 00000005353 11530651532 017622 0ustar00Aquastaff000000 000000 /* * 127mer/inc/extvab.h * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ /*** global variables ****/ extern int overlaplen; extern int inGraph; extern long long n_ban; extern Kmer WORDFILTER; extern boolean globalFlag; extern int thrd_num; /**** reads info *****/ extern long long n_solexa; extern long long prevNum; extern int ins_size_var; extern PE_INFO * pes; extern int maxReadLen; extern int maxReadLen4all; extern int minReadLen; extern int maxNameLen; extern int num_libs; extern LIB_INFO * lib_array; extern int libNo; extern long long readNumBack; extern int gradsCounter; /*** used for pregraph *****/ extern MEM_MANAGER * prearc_mem_manager; //also used in scaffolding extern MEM_MANAGER ** preArc_mem_managers; extern boolean deLowKmer; extern boolean deLowEdge; extern KmerSet ** KmerSets; // also used in mapping extern KmerSet ** KmerSetsPatch; /**** used for contiging ****/ extern boolean repsTie; extern long long arcCounter; extern unsigned int num_ed; extern unsigned int num_ed_limit; extern unsigned int extraEdgeNum; extern EDGE * edge_array; extern VERTEX * vt_array; extern MEM_MANAGER * rv_mem_manager; extern MEM_MANAGER * arc_mem_manager; extern unsigned int num_vt; extern int len_bar; extern ARC ** arcLookupTable; extern long long * markersArray; /***** used for scaffolding *****/ extern MEM_MANAGER * cn_mem_manager; extern unsigned int num_ctg; extern unsigned int * index_array; extern CONTIG * contig_array; extern int lineLen; extern int weakPE; extern long long newCntCounter; extern CONNECT ** cntLookupTable; extern unsigned int ctg_short; extern int cvgAvg; extern boolean orig2new; /**** used for gapFilling ****/ extern DARRAY * readSeqInGap; extern DARRAY * gapSeqDarray; extern DARRAY ** darrayBuf; extern int fillGap; /**** used for searchPath *****/ extern int maxSteps; extern int num_trace; extern unsigned int ** found_routes; extern unsigned int * so_far; extern int max_n_routes; extern boolean maskRep; extern int GLDiff; extern int initKmerSetSize; SOAPdenovo-V1.05/src/127mer/inc/fib.h000644 000765 000024 00000006241 11530651532 017066 0ustar00Aquastaff000000 000000 /* * 127mer/inc/fib.h * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ /*- * Copyright 1997, 1998-2003 John-Mark Gurney. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * */ #ifndef _FIB_H_ #define _FIB_H_ //#include "globals.h" #include #include "def2.h" typedef Coordinate ( *voidcmp ) ( unsigned int , unsigned int ); /* functions for key heaps */ boolean fh_isempty ( FibHeap * ); FibHeap * fh_makekeyheap ( void ); FibHeapNode * fh_insertkey ( FibHeap *, Coordinate, unsigned int ); Coordinate fh_minkey ( FibHeap * ); Coordinate fh_replacekey ( FibHeap *, FibHeapNode *, Coordinate ); unsigned int fh_replacekeydata ( FibHeap *, FibHeapNode *, Coordinate, unsigned int ); /* functions for unsigned int * heaps */ FibHeap * fh_makeheap ( void ); voidcmp fh_setcmp ( FibHeap *, voidcmp ); unsigned int fh_setneginf ( FibHeap *, unsigned int ); FibHeapNode * fh_insert ( FibHeap *, unsigned int ); /* shared functions */ unsigned int fh_extractmin ( FibHeap * ); unsigned int fh_min ( FibHeap * ); unsigned int fh_replacedata ( FibHeapNode *, unsigned int ); unsigned int fh_delete ( FibHeap *, FibHeapNode * ); void fh_deleteheap ( FibHeap * ); FibHeap * fh_union ( FibHeap *, FibHeap * ); #endif /* _FIB_H_ */ SOAPdenovo-V1.05/src/127mer/inc/fibHeap.h000644 000765 000024 00000002626 11530651532 017667 0ustar00Aquastaff000000 000000 /* * 127mer/inc/fibHeap.h * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #ifndef _FIBHEAP_H_ #define _FIBHEAP_H_ FibHeap * newFibHeap(); FibHeapNode * insertNodeIntoHeap ( FibHeap * heap, Coordinate key, unsigned int node ); Coordinate minKeyOfHeap ( FibHeap * heap ); Coordinate replaceKeyInHeap ( FibHeap * heap, FibHeapNode * node, Coordinate newKey ); void replaceValueInHeap ( FibHeapNode * node, unsigned int newValue ); unsigned int removeNextNodeFromHeap ( FibHeap * heap ); void * destroyNodeInHeap ( FibHeapNode * node, FibHeap * heap ); void destroyHeap ( FibHeap * heap ); boolean IsHeapEmpty ( FibHeap * heap ); #endif SOAPdenovo-V1.05/src/127mer/inc/fibpriv.h000644 000765 000024 00000007545 11530651532 017777 0ustar00Aquastaff000000 000000 /* * 127mer/inc/fibpriv.h * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ /*- * Copyright 1997, 1999-2003 John-Mark Gurney. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $Id: fibpriv.h,v 1.10 2007/10/09 09:56:46 zerbino Exp $ * */ #ifndef _FIBPRIV_H_ #define _FIBPRIV_H_ #include "def2.h" /* * specific node operations */ struct fibheap_el { int fhe_degree; boolean fhe_mark; FibHeapNode * fhe_p; FibHeapNode * fhe_child; FibHeapNode * fhe_left; FibHeapNode * fhe_right; Coordinate fhe_key; unsigned int fhe_data; }; static FibHeapNode * fhe_newelem ( struct fibheap * ); static void fhe_initelem ( FibHeapNode * ); static void fhe_insertafter ( FibHeapNode * a, FibHeapNode * b ); static inline void fhe_insertbefore ( FibHeapNode * a, FibHeapNode * b ); static FibHeapNode * fhe_remove ( FibHeapNode * a ); /* * global heap operations */ struct fibheap { Coordinate ( *fh_cmp_fnct ) ( unsigned int, unsigned int ); MEM_MANAGER * nodeMemory; IDnum fh_n; IDnum fh_Dl; FibHeapNode ** fh_cons; FibHeapNode * fh_min; FibHeapNode * fh_root; unsigned int fh_neginf; boolean fh_keys: 1; }; static void fh_initheap ( FibHeap * ); static void fh_insertrootlist ( FibHeap *, FibHeapNode * ); static void fh_removerootlist ( FibHeap *, FibHeapNode * ); static void fh_consolidate ( FibHeap * ); static void fh_heaplink ( FibHeap * h, FibHeapNode * y, FibHeapNode * x ); static void fh_cut ( FibHeap *, FibHeapNode *, FibHeapNode * ); static void fh_cascading_cut ( FibHeap *, FibHeapNode * ); static FibHeapNode * fh_extractminel ( FibHeap * ); static void fh_checkcons ( FibHeap * h ); static void fh_destroyheap ( FibHeap * h ); static int fh_compare ( FibHeap * h, FibHeapNode * a, FibHeapNode * b ); static int fh_comparedata ( FibHeap * h, Coordinate key, unsigned int data, FibHeapNode * b ); static void fh_insertel ( FibHeap * h, FibHeapNode * x ); /* * general functions */ static inline IDnum ceillog2 ( IDnum a ); #endif /* _FIBPRIV_H_ */ SOAPdenovo-V1.05/src/127mer/inc/global.h000644 000765 000024 00000004356 11530651532 017573 0ustar00Aquastaff000000 000000 /* * 127mer/inc/global.h * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ int overlaplen = 23; int inGraph; long long n_ban; long long n_solexa = 0; long long prevNum = 0; int ins_size_var = 20; PE_INFO * pes = NULL; MEM_MANAGER * rv_mem_manager = NULL; MEM_MANAGER * cn_mem_manager = NULL; MEM_MANAGER * arc_mem_manager = NULL; unsigned int num_vt = 0; unsigned int ** found_routes = NULL; unsigned int * so_far = NULL; int max_n_routes = 10; int num_trace; Kmer WORDFILTER; unsigned int num_ed = 0; unsigned int num_ctg = 0; unsigned int num_ed_limit; unsigned int extraEdgeNum; EDGE * edge_array = NULL; VERTEX * vt_array = NULL; unsigned int * index_array = NULL; CONTIG * contig_array = NULL; int lineLen; int len_bar = 100; int weakPE = 3; int fillGap = 0; boolean globalFlag; long long arcCounter; MEM_MANAGER * prearc_mem_manager = NULL; MEM_MANAGER ** preArc_mem_managers = NULL; int maxReadLen = 0; int maxReadLen4all = 0; int minReadLen = 0; int maxNameLen = 0; ARC ** arcLookupTable = NULL; long long * markersArray = NULL; boolean deLowKmer = 0; boolean deLowEdge = 1; long long newCntCounter; boolean repsTie = 0; CONNECT ** cntLookupTable = NULL; int num_libs = 0; LIB_INFO * lib_array = NULL; int libNo = 0; long long readNumBack; int gradsCounter; unsigned int ctg_short = 0; int thrd_num = 8; int cvgAvg = 0; KmerSet ** KmerSets = NULL; KmerSet ** KmerSetsPatch = NULL; DARRAY * readSeqInGap = NULL; DARRAY * gapSeqDarray = NULL; DARRAY ** darrayBuf; boolean orig2new; int maxSteps; boolean maskRep = 1; int GLDiff = 50; int initKmerSetSize = 0; SOAPdenovo-V1.05/src/127mer/inc/newhash.h000644 000765 000024 00000007416 11530651532 017770 0ustar00Aquastaff000000 000000 /* * 127mer/inc/newhash.h * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #ifndef __NEW_HASH_RJ #define __NEW_HASH_RJ #ifndef K_LOAD_FACTOR #define K_LOAD_FACTOR 0.75 #endif #define MAX_KMER_COV 63 #define EDGE_BIT_SIZE 6 #define EDGE_XOR_MASK 0x3FU #define LINKS_BITS 0x00FFFFFFU #define get_kmer_seq(mer) ((mer).seq) #define set_kmer_seq(mer, val) ((mer).seq = val) #define get_kmer_left_cov(mer, idx) (((mer).l_links>>((idx)*EDGE_BIT_SIZE))&EDGE_XOR_MASK) #define set_kmer_left_cov(mer, idx, val) ((mer).l_links = ((mer).l_links&(~(EDGE_XOR_MASK<<((idx)*EDGE_BIT_SIZE)))) | (((val)&EDGE_XOR_MASK)<<((idx)*EDGE_BIT_SIZE)) ) #define get_kmer_left_covs(mer) (get_kmer_left_cov(mer, 0) + get_kmer_left_cov(mer, 1) + get_kmer_left_cov(mer, 2) + get_kmer_left_cov(mer, 3)) #define get_kmer_right_cov(mer, idx) (((mer).r_links>>((idx)*EDGE_BIT_SIZE))&EDGE_XOR_MASK) #define set_kmer_right_cov(mer, idx, val) ((mer).r_links = ((mer).r_links&(~(EDGE_XOR_MASK<<((idx)*EDGE_BIT_SIZE)))) | (((val)&EDGE_XOR_MASK)<<((idx)*EDGE_BIT_SIZE)) ) #define get_kmer_right_covs(mer) (get_kmer_right_cov(mer, 0) + get_kmer_right_cov(mer, 1) + get_kmer_right_cov(mer, 2) + get_kmer_right_cov(mer, 3)) #define is_kmer_entity_null(flags, idx) ((flags)[(idx)>>4]>>(((idx)&0x0f)<<1)&0x01) #define is_kmer_entity_del(flags, idx) ((flags)[(idx)>>4]>>(((idx)&0x0f)<<1)&0x02) #define set_kmer_entity_null(flags, idx) ((flags)[(idx)>>4] |= (0x01u<<(((idx)&0x0f)<<1))) #define set_kmer_entity_del(flags, idx) ((flags)[(idx)>>4] |= (0x02u<<(((idx)&0x0f)<<1))) #define clear_kmer_entity_null(flags, idx) ((flags)[(idx)>>4] &= ~(0x01u<<(((idx)&0x0f)<<1))) #define clear_kmer_entity_del(flags, idx) ((flags)[(idx)>>4] &= ~(0x02u<<(((idx)&0x0f)<<1))) #define exists_kmer_entity(flags, idx) (!((flags)[(idx)>>4]>>(((idx)&0x0f)<<1)&0x03)) typedef __uint128_t u128b; typedef struct u256b { u128b low; u128b high; } U256b; typedef struct kmer_st { Kmer seq; ubyte4 l_links; // sever as edgeID since make_edge ubyte4 r_links: 4 * EDGE_BIT_SIZE; ubyte4 linear: 1; ubyte4 deleted: 1; ubyte4 checked: 1; ubyte4 single: 1; ubyte4 twin: 2; ubyte4 inEdge: 2; } kmer_t; typedef struct kmerSet_st { kmer_t * array; ubyte4 * flags; ubyte8 size; ubyte8 count; ubyte8 max; double load_factor; ubyte8 iter_ptr; } KmerSet; typedef struct kmer_pt { kmer_t * node; Kmer kmer; boolean isSmaller; struct kmer_pt * next; } KMER_PT; extern KmerSet * init_kmerset ( ubyte8 init_size, float load_factor ); extern int search_kmerset ( KmerSet * set, Kmer seq, kmer_t ** rs ); extern int put_kmerset ( KmerSet * set, Kmer seq, ubyte left, ubyte right, kmer_t ** kmer_p ); extern byte8 count_kmerset ( KmerSet * set ); extern void free_Sets ( KmerSet ** KmerSets, int num ); extern void free_kmerset ( KmerSet * set ); extern void dislink2nextUncertain ( kmer_t * node, char ch, boolean smaller ); extern void dislink2prevUncertain ( kmer_t * node, char ch, boolean smaller ); extern int count_branch2prev ( kmer_t * node ); extern int count_branch2next ( kmer_t * node ); extern char firstCharInKmer ( Kmer kmer ); #endif SOAPdenovo-V1.05/src/127mer/inc/nuc.h000644 000765 000024 00000001676 11530651532 017122 0ustar00Aquastaff000000 000000 /* * 127mer/inc/nuc.h * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ int total_nuc = 16; char na_name[17] = {'g', 'a', 't', 'c', 'n', 'r', 'y', 'w', 's', 'm', 'k', 'h', 'b', 'v', 'd', 'x' }; SOAPdenovo-V1.05/src/127mer/inc/stack.h000644 000765 000024 00000002725 11530651532 017436 0ustar00Aquastaff000000 000000 /* * 127mer/inc/stack.h * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #ifndef __STACK__ #define __STACK__ #include #include #include typedef struct block_starter { struct block_starter * prev; struct block_starter * next; } BLOCK_STARTER; typedef struct stack { BLOCK_STARTER * block_list; int index_in_block; int items_per_block; int item_c; size_t item_size; BLOCK_STARTER * block_backup; int index_backup; int item_c_backup; } STACK; void stackBackup ( STACK * astack ); void stackRecover ( STACK * astack ); void * stackPush ( STACK * astack ); void * stackPop ( STACK * astack ); void freeStack ( STACK * astack ); void emptyStack ( STACK * astack ); STACK * createStack ( int num_items, size_t unit_size ); #endif SOAPdenovo-V1.05/src/127mer/inc/stdinc.h000644 000765 000024 00000001746 11530651532 017617 0ustar00Aquastaff000000 000000 /* * 127mer/inc/stdinc.h * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #include #include #include #include #include #include #include #include #include "def.h" #include "types.h" SOAPdenovo-V1.05/src/127mer/inc/types.h000644 000765 000024 00000002125 11530651532 017467 0ustar00Aquastaff000000 000000 /* * 127mer/inc/types.h * * Copyright (c) 2008-2010 BGI-Shenzhen . * * This file is part of SOAPdenovo. * * SOAPdenovo is free software: you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation, either version 3 of the License, or * (at your option) any later version. * * SOAPdenovo is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with SOAPdenovo. If not, see . * */ #ifndef __TYPES_RJ #define __TYPES_RJ typedef unsigned long long ubyte8; typedef unsigned int ubyte4; typedef unsigned short ubyte2; typedef unsigned char ubyte; typedef long long byte8; typedef int byte4; typedef short byte2; typedef char byte; #endif